WO2009123288A1 - 単語分類システム、方法およびプログラム - Google Patents
単語分類システム、方法およびプログラム Download PDFInfo
- Publication number
- WO2009123288A1 WO2009123288A1 PCT/JP2009/056900 JP2009056900W WO2009123288A1 WO 2009123288 A1 WO2009123288 A1 WO 2009123288A1 JP 2009056900 W JP2009056900 W JP 2009056900W WO 2009123288 A1 WO2009123288 A1 WO 2009123288A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- classification
- classified
- pair
- words
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Definitions
- the present invention relates to a word classification system, a word classification method, and a word classification program.
- Non-Patent Document 1 An example of a vocabulary classification technique is described in Non-Patent Document 1.
- Word extraction which is one of the vocabulary classification methods, is a method of classifying words into categories called unique expressions such as organization names, place names, person names, dates, and the like. This method can reduce manual rule generation work by inputting learning data in which a specific expression is given to text in advance and learning word classification rules.
- Non-Patent Document 1 word classification rules are learned based on context information around the appearance position of each word individually.
- the periphery is about two words before and after the appearance position, and the context information is a word, a part of speech, and a character type.
- word classification rules are learned for each unique expression category. That is, a word classification rule for determining whether the name is an organization name, a word classification rule for determining whether the name is a place name, or the like. Since the word classification rule is described as binary data of a learning method called Support Vector Machines, it is not information that can be visually recognized by humans. However, conceptually, the word classification rule for an organization name is “ Word rules such as “host” and “system developed by ⁇ organization name>” are considered to be learned.
- Patent Document 1 describes a word classification technique related to the present invention.
- the technique described in Patent Literature 1 includes a core word dictionary that stores a plurality of sets of core words that are words representing the category, and values indicating the degree to which the core word belongs to the category, and a document Are prepared, a classification target word is searched from the documents stored in the document database, and words having the co-occurrence relationship are extracted. Then, it is searched whether each extracted co-occurrence relation word is stored as a core word in the core word dictionary, and a category ranking determination value is formed from the searched core word value, and the category to which the classification target word belongs Is determined.
- the core word is a category-specific and representative word.
- the category “art” is a typical word that well represents “art” and is related to the category, such as “movie”, “music”, and “director”. Yamada, Kudo, Matsumoto, “Japanese Named Expression Extraction Using Support Vector Machines”, IPSJ Research Reports-Natural Language Processing, Vol. 2001, No. 20, pages 121-128 JP 2004-334766 A
- Non-Patent Document 1 The problem with the technique described in Non-Patent Document 1 described above is that only coarse classification is possible. The reason is that only the context information at the appearance position of each word is used as a clue. For example, when classifying a Japanese professional baseball team and an American professional baseball team, for example, the context information around each word appearance such as “Team A” and “Team B” is very similar, so it is classified. It is not possible.
- Patent Document 2 a problem of the technique described in Patent Document 2 is that it is necessary to prepare in advance a word that can be a core word. Therefore, this preparation is a very time-consuming work, and if there are not enough core words, only coarse classification can be performed.
- the present invention has been invented in view of the above problems, and an object thereof is to provide a technique capable of classifying words in detail.
- the present invention for solving the above problems learns at least one of context information and layout information between co-appearing classified known words based on a relationship between classified known words co-appearing in a document.
- a word that creates an inter-word pattern to determine whether the data of a certain word pair is data of the same classified word pair that is a set of same classified words or different classified word pair data that is a set of different classified words A word classification system having an inter-pattern learning unit.
- the present invention for solving the above-mentioned problems is based on context information and layout information between same-category known words co-occurring in a document, and context information and layout information of a same-category known word pair that is a set of the same-category known words.
- An inter-word pattern learning unit that creates an inter-word pattern for determination; an inter-word pattern storage unit that stores the created inter-word pattern; and a known classification that co-occurs in the input classification unknown word and document Based on the relationship with a word, application data including context information and layout information of a word pair that is a set of the classified unknown word and the classified known word is created, and the word pair is referred to by referring to the pattern between words.
- the application data of the word pair and output whether the application data of the word pair is a positive example or a negative example, and the number of positive examples or negative examples of the application data of the word pair,
- the word classification system includes a classification determination unit that determines a classification of the classification unknown word of the word pair based on the classification of the word pair classification known word.
- the present invention for solving the above-described problems is a learning data generation device for generating learning data used for determining a classification of an unknown word whose classification is unknown, and a context between known classification words co-occurring in a document
- the learning data generation device includes a learning data generation unit that generates learning data including at least one of information and layout information.
- the present invention for solving the above problems learns at least one of context information and layout information between co-appearing classified known words based on a relationship between classified known words co-appearing in a document.
- the pattern between words for determining whether the data of a certain word pair is the data of the same classification word pair that is a set of the same classification word or the data of the different classification word pair that is a set of different classification words is classified as unknown. This is a word classification method used for word classification determination.
- the present invention for solving the above-mentioned problems is based on context information and layout information between same-category known words co-occurring in a document, and context information and layout information of a same-category known word pair that is a set of the same-classified known words
- Negative example learning data consisting of, and based on the learning data, whether or not the data of the word pair that is a set of the input classification unknown word and classification known word is a positive example or a negative example
- And layout Generating application data consisting of information, analyzing the application
- the present invention for solving the above-described problem is a learning data generation method for generating learning data used to determine the classification of a classification unknown word whose classification is unknown, and a context between classification known words co-occurring in a document
- This is a learning data generation method for generating learning data including at least one of information and layout information.
- the present invention for solving the above problems learns at least one of context information and layout information between co-appearing classified known words based on a relationship between classified known words co-appearing in a document.
- a word that creates an inter-word pattern to determine whether the data of a certain word pair is data of the same classified word pair that is a set of same classified words or different classified word pair data that is a set of different classified words This is a program for causing an information processing apparatus to execute an inter-pattern learning process.
- the present invention for solving the above-mentioned problems is based on context information and layout information between same-category known words co-occurring in a document, and context information and layout information of a same-category known word pair that is a set of the same-category known words.
- Context information and layout information of a pair of different classification known words that is a set of different classification known words based on learning data of positive examples consisting of the context information and layout information between different classification known words that co-occur in the document
- a word that is a set of the classified unknown word and the classified known word based on the process of creating an inter-word pattern for determining whether or not the classified unknown word and the classified known word co-appearing in the document Pair contextual information
- processing for creating application data composed of layout information and analyzing the application data of the word pair with reference to the inter-word pattern to determine whether the application data of the word pair is a positive example or a negative example
- a process of determining the classification of the unknown word of the word pair based on the number of positive examples or negative examples of the application data of the word pair and the classification of the known word classification of the
- FIG. 1 is a block diagram of a word classification system according to the first embodiment.
- FIG. 2 is a diagram showing an example of the classified known word group database 1.
- FIG. 3 is a diagram showing an example of the document group database 2.
- FIG. 4 is a diagram illustrating an example of learning data according to the first embodiment.
- FIG. 5 is a diagram illustrating an example of learning data according to the first embodiment.
- FIG. 6 is a block diagram of a word classification system according to the second embodiment.
- FIG. 7 is a diagram illustrating an example of learning data according to the second embodiment.
- FIG. 8 is a block diagram of the word classification system according to the second embodiment.
- Words with the same classification often co-occur in one sentence, and words and parts of speech between words of the same classification are often the same. Therefore, more detailed word classification can be performed by considering not only the context information of each word but also the relationship between words of the same classification.
- the relationship between words indicates a surface character string between two words, the number of appearances, the part of speech, the number of co-occurrence, layout information, and the like.
- Japanese professional baseball team names are often described as opponents in news article sentences, or are often described consecutively with punctuation marks or symbols.
- American professional baseball team names and Japanese professional baseball team names only contain information such as trade information, and there is not much frequency. In addition, it is rarely written continuously with punctuation marks or symbols.
- the present invention is for determining the classification of an unclassified word (hereinafter referred to as an unknown classification word) based on the relationship between words whose classification is known (hereinafter referred to as a classification known word). Create pattern data.
- classifications A: ⁇ a, b, c ⁇ , classification B: ⁇ d, e ⁇ , classification C: ⁇ g, h ⁇ are given.
- a, b, c, d, e, g, and h are words
- a classification name: ⁇ word set ⁇ represents a classification name and its word set.
- the words a, b, c, d, e, g, h need not be the above-mentioned core words, but the words of the movie names themselves such as “AAA Wars” and “BBB story”, and “C team” The team name itself can be used.
- word pairs of classification known words of the same classification are positive examples
- word pairs of classification known words of different classification are negative examples.
- the classification known word a and the classification known word b are in the same classification
- the word pair (hereinafter, the word pair is also expressed as ab) is a word pair of the classification known word. Is a positive example.
- the classified known word a and the classified known word d are differently classified, and the word pair (hereinafter also referred to as word ad) is a word pair of a differently classified known word. Is a negative example.
- the relationship between words refers to context information and layout information between words a and b in a document in which words a and b appear.
- context information is whether the word itself, surrounding words, surface character strings of words between words, appearance count, part of speech, character type, co-occurrence frequency, distance between words, dependency relationship, whether in a natural sentence is there.
- layout information is arranged vertically or horizontally, has the same character size, or has the same tree depth in the case of a tree structure such as HTML.
- learning data including at least one of context information and layout information between classified known words is created based on the relationship between the classified known words. Subsequently, based on this learning data, an inter-word pattern is created for determining whether the data of the word pair that is a set of the input classification unknown word and classification known word is a positive example or a negative example. To do.
- FIG. 1 is a block diagram of the word classification system in the first embodiment.
- the word classification system stores a classified known word group database 1 in which word groups with known classifications are stored, a document group database 2 in which document groups are stored, and a learned pattern between words.
- Word pattern learning that learns context information and layout information between classified known words from a known word set stored in the classified word group database 1 and a known word group database 1.
- the inter-word pattern application unit 5 for determining application data of word pairs including unknown words using the inter-word patterns stored in the inter-word pattern database 3; and the inter-word pattern application unit 5 Based on the determination result, a classification determination unit 6 that determines the classification of an unknown word, an input unit 7 such as a keyboard for inputting an unknown word, And an output unit 8 for outputting the judgment result of the classification judging section 6.
- a known word group is stored for each classification.
- An example of the classified known word group database 1 is shown in FIG.
- a plurality of classifications are held in the format of “classification name ⁇ word group ⁇ ”.
- “Category A: ⁇ a, b, c ⁇ ” indicates that the word a, the word b, and the word c are classified into the classification A.
- This storage method is an example, and one word may belong to a plurality of classifications, or a classification name may be described for each word.
- the document group database 2 stores a plurality of documents.
- An example of the document group database 2 is shown in FIG. In FIG. 3, one document is shown in one record, and “...” In the figure represents omission.
- the inter-word pattern database 3 stores the inter-word patterns created by the inter-word pattern learning unit 4.
- This inter-word pattern is a pattern indicating whether or not two words are data of the same classification when the data of the relationship between the two words is input.
- the description method of the inter-word pattern differs depending on the learning algorithm. For example, when Support Vector Machines is used, binary data is used. The interword pattern will be described later.
- the inter-word pattern learning unit 4 inputs the classified known word group stored in the classified known word group database 1 and creates learning data while referring to the document group in the document group database 2. Then, an interword pattern is created based on the learning data.
- the inter-word pattern learning unit 4 includes a learning data creation unit 11 and a pattern creation unit 12.
- the learning data creation unit 11 generates a word pair that is a pair of classified known words stored in the classified known word group database 1 for each occurrence position where the word pairs co-occur in the document stored in the document group database 2. Based on the relationship, context information and layout information are learned, and learning data including the context information and layout information of the word pair is created.
- the word pair of the same classification known word is a positive example
- the word pair of a different classification known word is a negative example.
- the word pair of the word a and the word b belongs to the classification A because both the word a and the word b belong to the classification A, which is a positive example.
- the word pair of the word a and the word d is a different example and a negative example because the word a belongs to the category A and the word d belongs to the category B.
- context information is the word itself, surrounding character strings, surface character strings of words between words, appearance count, part of speech, character type, co-occurrence frequency, distance between words, dependency relationship, whether in natural sentences Etc.
- layout information includes whether the words a and b are arranged vertically or horizontally, the size of characters is the same, or the depth of the tree in the case of a tree structure such as HTML.
- the context information and layout information of the word pair and the positive or negative example information of the word pair are included for each appearance position where these word pairs co-occur in the document stored in the document group database 2. Create learning data.
- the pattern creation unit 12 learns a pattern based on the learning data created by the learning data creation unit 11 and creates inter-word pattern data. For example, an inter-word pattern is created so that a positive example is determined if the applied data of the input word pair indicates the same classification, and a negative example is determined if the classification is different. Then, the created inter-word pattern is registered in the inter-word pattern database 3. For creating the inter-word pattern data itself, an existing one such as Support Vector Machines is used.
- the inter-word pattern application unit 5 inputs a word whose classification is unknown from the input unit 7, and determines application data of a word pair including the classification unknown word.
- the inter-word pattern application unit 5 includes an application data creation unit 21 and a pattern application unit 22.
- the application data creation unit 21 creates application data for the input classified unknown word by referring to a known word group stored in the known classification word group database 1 and a document group in the document group database 2. .
- word pairs of the unknown classification word and each known classification word stored in the known classification word group database 1 are created. For example, when a classification unknown word f is given, the classification unknown word f and the classification known words a, b, c, d, e, g, and h are used as fa, fb, fc, and f ⁇ . Create word pairs d, fe, fg, and fh.
- application data including the context information and layout information of the word pair is created for each appearance position where these word pairs co-occur in the document stored in the document group database 2 based on the relationship between the words. .
- this application data is the same as that obtained by excluding information on positive examples and negative examples of word pairs from the learning data in the learning data creating unit 11 described above.
- the pattern application unit 22 inputs application data of word pairs of classified unknown words and classified known words, and analyzes the application data of the word pairs with reference to the inter-word patterns stored in the inter-word pattern database 3. To do. Then, whether the application data of the word pair of the classification unknown word and the classification known word is positive or negative is output.
- the pattern application unit 22 uses an existing one such as Support Vector Machines like the inter-word pattern creation unit 12.
- the classification determination unit 6 inputs the positive or negative example information of the application data of the word pair of the classification unknown word and the separation known word and the classification of the separation known word of the word pair, calculates the classification score, The classification of the unknown word is determined.
- a classification determination method there is a method of determining a classification by calculating which classification word and many positive examples.
- the inter-word pattern learning unit 4 inputs a word group with a known classification stored in the classification known word group database 1 and creates learning data while referring to the document group in the document group database 2. Then, an interword pattern is created based on the learning data.
- the learning data creation unit 11 inputs the classification known word group stored in the classification known word group database 1 and the document group of the document group database 2, and is stored in the classification known word group database 1. For each word pair that is a combination of all known classification words, the context information and layout information are learned based on the relationship between the words for each occurrence position where the words of the word pair co-occur in the document. create. Learning data is created from context information between words and layout information.
- FIG. 4 is an example of learning data created based on the appearance position of the word pair ab in the document of the first record in FIG. 3
- FIG. 5 is the word in the document of the second record in FIG. It is an example of learning data created based on the appearance position of a pair ab.
- inter-word surface character strings inter-word parts of speech
- inter-word character types Hiragana, Kanji, numbers, symbols, alphabets, tags
- dependency clause or in a parallel clause
- parallel clause Or the number of morphemes between words.
- the layout information indicates whether it appears in a natural sentence, whether it is aligned vertically and horizontally, whether the character size is the same, and whether the left and right character strings are the same.
- an existing morpheme analysis tool such as chasen can be used for dividing words, parts of speech, and the number of morphemes between words.
- the output of existing parsing tools such as cabocha can be used for dependency and side-by-side relationship recognition.
- whether the text is aligned vertically or horizontally is determined from the layout location using the drawing tool of each document.
- drawing at a standard size.
- an output laid out with a screen size of 1024 ⁇ 768 by an existing Web browser is used.
- the size of the character string and the drawing position are often determined by the HTML tag information. Therefore, the layout may not be actually performed and the determination may be made using the tag information.
- Words of the same type appear in the same document, and words that exist between them are often written together with words that represent side-by-side relationships such as “to”, and symbols such as punctuation marks and “ ⁇ ”.
- Context information such as surface character strings, parts of speech, character types, and dependency relationships are used as learning data.
- the above-described example is an example, and in the case of semi-structured data such as HTML, it may include whether the depth of the tree is the same. Further, not only context information and layout information between words, but also context information for each word may be included as in a related example.
- learning data is created for every appearance position of each word pair.
- learning data is created by adding positive example information to the word pair learning data of the same classification known word and negative example information to the word pair learning data of the different classification known word.
- the pattern creation unit 12 creates an interword pattern based on the learning data created by the learning data creation unit 11.
- Support Vector Machines is used. Support Vector Machines learns interword patterns to separate positive and negative examples. Thus, an interword pattern is learned as a positive example when the application data of the word pair indicates the same classification, and as a negative example when the application data of the word pair indicates a different classification.
- the inter-word pattern application unit 5 inputs a word whose classification is unknown from the input unit 7 and outputs whether or not it is the same classification as the classification known word that forms a pair with the classification unknown word.
- the application data creation unit 21 includes a word group whose classification is known stored in the classification known word group database 1 and a document group in the document group database 2 for the unknown classification word input from the input unit 7. And apply data.
- the application data is for each appearance position of the document in which the word pair appears for all combinations (word pairs) of the unknown word f and the known words a, b, c, d, e, g, and h. Create learning data.
- it is a pair of classified unknown word f and classified known words a, b, c, d, e, g, and h, respectively, fa, fb, fc, fd, f ⁇ .
- application data is created for each occurrence position where the words of the word pair co-occur in the document stored in the document group database 2. That is, in the word pair fa, if there are three appearance locations in the document group, three application data are created.
- the application data is created using the same method as the learning data.
- the pattern application unit 22 refers to the inter-word pattern stored in the inter-word pattern database for each of the application data from the application data creation unit 21, and is the same classification for each application data of the word pair? Outputs whether or not.
- Support Vector Machines is used for the pattern application unit 22, whether each of the application data of the word pair is positive or negative is output. If it is a positive example, the word pairs of application data are the same classification, and if it is a negative example, the word pairs of application data are different classifications. For example, when there are three occurrences of the word pair fa (three application data), two may be positive examples and one may be a negative example. Further, even if there are four occurrences of the word pair fb (four application data), all four may be positive examples.
- the classification determination unit 6 receives the result from the pattern application unit 22 and determines the classification of the classification unknown word. For example, the classification of the classification unknown word is determined based on the probability that the word pair of the classification unknown word and the classification known word obtained from the number of positive examples or negative examples of the application data is the word pair of the same classification word.
- the determination result (classification name) is output to the output unit 8.
- the classification score is obtained by the following formula.
- Classification score (classification) number of applied data of word pairs of positive examples (classification) / total number of applied data of word pairs (classification)
- the applied data number (classification) of the word pair of the positive example is a number that becomes a positive example in the application data of the word pair of a certain classification known word and the classification unknown word. Further, the total number of applied data (classification) of word pairs is the total number of applied data of word pairs of classified known words and classified unknown words.
- the number of all applied data of the word pairs fa, fb, fc is the number of word pairs. This is the total number of applied data (classification A).
- the number of application data that are positive examples is the number of application data (classification A) of the positive word pairs.
- the number of application data (classification A) of the word pair of the positive example is 1, the total number of application data (classification A) of the word pair is 3,
- the number of applied data of the word pair (classification B) of the example is 2, the total number of applied data of the word pair (classification B) is 2, the number of applied data of the word pair of the positive example (classification C) is 0, and the total application of the word pair
- the classification score (classification A) is 1/3
- the classification score (classification B) is 2/2
- the classification score (classification C) is 0/2. Therefore, the classification of the classification unknown word f is the classification B having the highest classification score.
- classification score is an example, and another method such as a total of application data that is simply a positive example may be used. If the classification score is not greater than or equal to the threshold, the classification is unknown.
- the inter-word pattern learning unit learns the classification rule by using the context information and layout information between the words of the same kind classification word and the different kind classification words instead of the context information for each word. It can be classified in more detail than the conventional method.
- FIG. 6 is a block diagram of the word classification system according to the second embodiment.
- learning data creation unit 11 is replaced with second learning data creation unit 31 and application data creation unit 21 is replaced with second application data creation unit 41. It is a point. Hereinafter, different points will be mainly described.
- the second learning data creation unit 31 is different in that the learning data is not created every time a word pair appears in the document, but is created for each word pair.
- the second learning data creation unit 31 further adds statistics such as the appearance frequency of word pairs and mutual information that can be calculated based on the appearance frequency as context information.
- the context information includes a character string that is often used as a character between words, its part of speech, character type, the number of dependency times divided by the number of appearances, and the number of side-by-side relationships as the number of appearances. Use the divided value, the average morpheme between words.
- a value obtained by dividing the number of appearances in a natural sentence by the number of appearances a value obtained by dividing the number of occurrences in the vertical and horizontal directions by the number of appearances, a value obtained by dividing the number of times of the same character string size by the number of appearances.
- the left and right character strings use the same number of times divided by the number of appearances.
- the second application data creation unit 41 differs in that the application data is not created every time a word pair appears in the document, but every word pair.
- the application data creation method is the same as the method of the second learning data creation unit 31.
- the second learning data creation unit 31 creates, for each word pair, which is a group of known classification words stored in the known classification word group database 1, learning data including the context information and layout information of the word pair. For example, learning data is created for every combination of words of a, b, c, d, e, g, h, which are classified known word groups shown in FIG.
- FIG. 7 shows an example of learning data of word pairs ab. It is assumed that the word pair ab appears in the document of the first record and the document of the second record in the document group database 2 shown in FIG. Then, the context information of the word pair ab shown in FIG. 7 is present in three frequently occurring character strings between words, each part of speech, each character type between words, and a clause having a dependency relationship. A value obtained by dividing the number of occurrences by the number of occurrences, a value obtained by dividing the number of occurrences in a parallel phrase by the number of occurrences, the average number of morphemes between words, and the number of co-occurrence are shown. The layout information of the word pair ab shown in FIG.
- the 7 includes a value obtained by dividing the number of appearances in a natural sentence by the number of appearances, a value obtained by dividing the number of occurrences of vertical and horizontal alignments by the number of appearances, and the size of characters. Indicates the value obtained by dividing the same number of times by the number of occurrences, and the left and right character strings indicate the value obtained by dividing the same number of times by the number of occurrences.
- learning data are examples, and those representing numbers such as dependency relationships, side-by-side relationships, and the number of appearances in natural sentences may be 1 if they appear even once, or may be combined.
- a statistical quantity such as a mutual information quantity may be used as the number of co-occurrence appearances.
- the second application data creation unit 41 inputs a classification unknown word from the input unit 7, and for the input classification unknown word, the word group and document group database whose classification is stored in the classification known word group database 1.
- the application data is created by referring to the second document group.
- word pairs of the unknown classification word and each known classification word stored in the known classification word group database 1 are created. For example, when a classification unknown word f is given, the classification unknown word f and the classification known words a, b, c, d, e, g, and h are used as fa, fb, fc, and f ⁇ . Create word pairs d, fe, fg, and fh.
- application data including context information and layout information is created for each word pair.
- the application data is created using the same method as the second learning data creation unit 31.
- the second learning data creation unit can further add the co-occurrence frequency between words to the learning data.
- FIG. 8 is a block diagram of the word classification system according to the third embodiment.
- the learning data creation unit 11 is replaced with a third learning data creation unit 51.
- the number of learning data in the negative example is larger than that in the positive example, when applying an inter-word pattern, it corresponds to any classification word even if an unknown classification word is input. May not be classified.
- the reason that the number of learning data in the negative example becomes very large is that all word pairs of different classification known words are negative examples.
- the amount of learning data of negative examples is reduced by treating only word conditions satisfying a specific condition among word pairs of differently classified known words as negative examples.
- the specific condition is that the frequency of occurrence of two words, the co-occurrence probability, or the mutual information amount is high.
- the third learning data creation unit 51 does not create learning data for all word pairs, but creates learning data only for word pairs that satisfy a certain condition for learning data of differently classified word pairs.
- the specific condition is that the appearance frequency, co-occurrence probability, or mutual information amount of word pairs is high.
- the learning data creation method is the same as in the first or second embodiment.
- the third learning data creation unit 51 inputs the classification known word group in the classification known word group database 1 and refers to the document group, and acquires learning data for every word pair of the classification known word. create.
- Learning data is created from context information between words and layout information as in the first or second embodiment. For example, a word pair that is a combination of words a, b, c, d, e, g, and h is created from the classified known word group in the classified known word group database 1 shown in FIG. Create learning data. However, only the learning data of the word pair satisfying a specific condition is created for the learning data of the word pair of the different classification known word among the word pairs.
- the same pair of word pairs ab, ac, de, and gh create learning data as in the first or second embodiment.
- differently classified word pairs ad, ae, ag, ah, bd, be, bg, bh, cd, ce, c- Learning data is created only for word pairs that satisfy a specific condition among g, ch, dh, dh, eg, and eh.
- the specific conditions are word pair appearance frequencies, co-occurrence probabilities, or mutual information amounts exceeding a certain threshold value, or the top few word pairs.
- the threshold value and the top several items are stored in advance in the system.
- learning data is created in the same manner as the learning data creation unit 11 or the second learning data creation unit 31 in the first or second embodiment.
- the third learning data creation unit uses only typical negative examples as learning data, thereby suppressing an increase in the number of negative examples. Thereby, classification accuracy can be raised.
- context information and layout information are used as the relationship between words, but only one of them may be used.
- the learning data creation unit, the pattern creation unit, the application data creation unit, and the pattern application unit are configured by hardware, but some or all of them are operated by a program. It can also be constituted by a CPU or the like.
- the first aspect of the present invention learns at least one of context information and layout information between co-occurring classified known words based on the relationship between classified known words co-appearing in a document,
- An inter-word pattern for determining whether the data of a word pair that is a set of words is the data of the same classified word pair that is a set of same classified words or the data of a different classified word pair that is a set of different classified words It is the word classification system which has the pattern learning part between words which produces.
- the classification between the classification unknown word and the classification known word is performed.
- Generating application data of a word pair that is a set of the classified unknown word and / or the classified known word that includes at least one of context information and layout information, and analyzing the application data of the word pair with reference to the inter-word pattern Based on the determination result of the inter-word pattern application unit that determines whether the application data of the word pair is data of the same classification word pair or different classification word pair, A classification determination unit that determines the classification of the unknown word classification.
- the inter-word pattern learning unit is a group of the same classification known words based on a relationship between the same classification words co-occurring in the document. Based on the relationship between the learning data comprising at least one of context information and layout information of the known word pair and the differently known word co-appearing in the document, A learning data creation unit that creates learning data including at least one of context information and layout information, and based on the learning data, input word pair data including the unknown classification word is data of the same classification word pair. Or an inter-word pattern creating unit for creating an inter-word pattern for determining whether the data is a pair of differently classified words.
- the learning data creation unit creates learning data for each appearance position of a document in which the same-classified known word pair or the different-classified known word pair co-occurs.
- the learning data creation unit creates learning data for each same-classified known word pair or for each different-classified known word pair.
- the learning data creation unit sets the learning data of the same class known word pair as positive learning data and the learning data of the different class known word pair as a negative example. Let it be learning data.
- the learning data creation unit sets learning data of a word pair satisfying a specific condition among learning data of differently classified word pairs as negative example learning data.
- the specific condition is that a co-occurrence frequency between words is higher than a predetermined value, a co-occurrence probability is higher than a predetermined value, or mutual Either the amount of information is higher than a predetermined value.
- a ninth aspect of the present invention is the above aspect, wherein the inter-word pattern application unit, based on the relationship between the classified unknown word and the classified known word co-appearing in the document, An application data creation unit that creates application data including at least one of context information and layout information of a word pair that is a set with a classified known word; and analyzing the application data of the word pair with reference to the pattern between words, A pattern application unit that determines whether the application data of the word pair is data of the same classification word pair or different classification word pair.
- the inter-word pattern learning unit is the same classification that is a set of the same classification known words based on the relationship between the same classification words that co-occur in the document.
- a learning data creation unit that creates learning data including at least one of context information and layout information, and based on the learning data, input word pair data including the unknown classification word is data of the same classification word pair.
- an inter-word pattern creating unit for creating an inter-word pattern for determining whether the data is a pair of differently classified words.
- the pattern application unit outputs whether the application data of the word pair is a positive example or a negative example.
- the inter-word pattern creation unit is Support Vector Vector Machines.
- the pattern application unit is Support Vector Vector Machines.
- the classification determination unit is configured to determine the classification unknown word based on the probability that the word pair of the classification unknown word and the classification known word is a word pair of the classification word. Determine classification.
- context information and layout information of the same-classified known word pair that is a set of the same-classified known words
- Negative example learning data consisting of, and based on the learning data, whether or not the data of the word pair that is a set of the input classification unknown word and classification known word is a positive example or a negative example
- An inter-word pattern learning unit for creating an inter-word pattern for determination; an inter-word pattern storage unit for storing the created inter-word pattern; and a classified known word co-appearing in the input classified unknown word and document Based on the relationship, the application data composed of the context information and layout information of the word pair that is a set of the classified unknown word and the
- a learning data generating device for generating learning data used for determining a classification of a classification unknown word whose classification is unknown, and a context between classification known words co-occurring in a document
- the learning data generation device includes a learning data generation unit that generates learning data including at least one of information and layout information.
- At least one of context information and layout information between co-appearing classified known words is learned based on a relationship between classified known words co-appearing in a document.
- the pattern between words for determining whether the data of a certain word pair is the data of the same classification word pair that is a set of the same classification word or the data of the different classification word pair that is a set of different classification words is classified as unknown. This is a word classification method used for word classification determination.
- an eighteenth aspect of the present invention is that in the above aspect, based on the relationship between the input unknown classification word and the known classification word co-appearing in the document, the classification between the unknown classification word and the known classification word is performed. Generating application data of a word pair that is a set of the classified unknown word and / or the classified known word that includes at least one of context information and layout information, and analyzing the application data of the word pair with reference to the inter-word pattern Then, it is determined whether the application data of the word pair is data of the same classification word pair or data of a different classification word pair, and the classification of the classification unknown word is determined based on the determination result.
- the context information and layout information of the same-classified known word pair that is a set of the same-classified known words based on the relationship between the same-classified words that co-occur in the document. Based on the relationship between at least one of the learning data and the differently known word co-appearing in the document, from the context information and / or the layout information of the differently known word pair that is a set of the differently known word And determining whether the input word pair data including the classified unknown word is the data of the same classified word pair or the data of the different classified word pair based on the learned data. Create a pattern between words.
- learning data is created for each appearance position of a document in which the same class known word pair or the different class known word pair co-occurs.
- learning data is created for each known-word pair of the same classification or for each known-word pair of different classification.
- learning data for the same-categorized known word pair is generated as positive learning data
- learning data for the different-classified known word pair is generated as negative learning data.
- learning data of word pairs satisfying a specific condition among learning data of differently classified word pairs is set as negative example learning data.
- the specific condition is that a co-occurrence frequency between words is higher than a predetermined value, a co-occurrence probability is higher than a predetermined value, or mutual Either the amount of information is higher than a predetermined value.
- the combination of the unknown unknown word and the known classified word is based on a relationship between the unknown unknown word and the known known word co-appearing in the document.
- Creating application data consisting of at least one of context information and layout information of word pairs, analyzing the application data of the word pairs with reference to the pattern between words, and applying the application data of the word pairs to It is determined whether it is data or data of differently classified word pairs.
- the context information and layout of the same-classified known word pair that is a set of the same-classified known words
- the learning data consisting of at least one of the information and the differently known word co-occurring in the document
- at least one of the context information and the layout information of the differently known word pair that is a set of the differently known word
- it is determined whether the data of the word pair including the input classification unknown word is the data of the same classification word pair or the data of the different classification word pair. Create a pattern between words.
- the application data is analyzed, and whether the application data of the word pair is a positive example or a negative example is output.
- the classification of the unknown classification word is determined based on the probability that the word pair of the unknown classification word and the known classification word is a word pair of the same classification word.
- Contextual information and layout Creating application data consisting of information, analyzing the application data of the word pair with reference to the pattern between words, determining whether the application data of the word pair is a positive example or a negative example, and the word
- the classification of the unknown word in the word pair is determined based on the number of positive examples or negative examples of the application data of the pair and the classification of the known word in the word pair.
- a learning data generation method for generating learning data used to determine a classification of a classification unknown word whose classification is unknown, the context between classification known words co-occurring in a document
- At least one of context information and layout information between co-occurring classified known words is learned based on the relationship between classified known words co-occurring in a document, A word that creates an inter-word pattern to determine whether the data of a certain word pair is data of the same classified word pair that is a set of same classified words or different classified word pair data that is a set of different classified words.
- context information and layout information of the same-category known word pair that is a set of the same-category known words
- Context information and layout information of a pair of different classification known words that is a set of different classification known words based on learning data of positive examples consisting of the context information and layout information between different classification known words that co-occur in the document The process of creating negative example learning data consisting of the above, and based on the learning data, the data of word pairs that are pairs of input classified unknown words and classified known words are positive examples or negative examples
- a word that is a set of the classified unknown word and the classified known word based on the process of creating an inter-word pattern for determining whether or not the classified unknown word and the classified known word co-appearing in the document Pair contextual information and Processing for creating application data composed of layout information and analyzing the application data of the word pair with reference to the inter-word pattern to determine whether the application
- the present invention can be applied to automatic dictionary classification as a base for a morphological analysis tool or the like.
- the present invention can also be applied to uses such as search navigation by presenting similar terms in a search system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
山田、工藤、松本、「Support Vector Machinesを用いた日本語固有表現抽出」、情報処理学会研究報告-自然言語処理、Vol.2001, No.20, 121-128ページ
2 文書群データベース
3 単語間パターンデータベース
4 単語間パターン学習部
5 単語間パターン適用部
6 分類判定部
7 入力部
8 出力部
<第1の実施の形態>
第1の実施の形態を説明する。
正例の単語対の適用データ数(分類)は、ある分類既知単語と分類未知単語との単語対の適用データで正例になった数である。また、単語対の全適用データ数(分類)は、分類既知単語と分類未知単語との単語対の全適用データ数である。
分類スコアが閾値以上ではない場合、分類不明とする。
<第2の実施の形態>
第2の実施の形態を説明する。
<第3の実施の形態>
第3の実施の形態を説明する。
Claims (32)
- 文書で共出現する分類既知単語間の関係をもとに、共出現する分類既知単語間の文脈情報及びレイアウト情報の少なくともいずれかを学習し、単語の組である単語対のデータが、同分類単語の組である同分類単語対のデータなのか、異分類単語の組である異分類単語対のデータなのかを判断する為の単語間パターンを作成する単語間パターン学習部を有する単語分類システム。
- 入力された分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と分類既知単語との間の文脈情報及びレイアウト情報の少なくともいずれから成る前記分類未知単語と前記分類既知単語との組である単語対の適用データを生成し、前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断する単語間パターン適用部と、
前記単語間パターン適用部の判断結果に基づいて、前記単語対の分類未知単語の分類を判定する分類判定部と
を有する請求項1に記載の単語分類システム。 - 前記単語間パターン学習部は、
文書で共出現する同分類単語間の関係をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データと、文書で共出現する異分類既知単語間の関係をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データとを作成する学習データ作成部と、
前記学習データに基づいて、入力された分類未知単語を含む単語対のデータが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断する為の単語間パターンを作成する単語間パターン作成部と
を有する請求項1又は請求項2に記載の単語分類システム。 - 前記学習データ作成部は、同分類既知単語対又は異分類既知単語対が共出現する文書の出現位置毎に学習データを作成する請求項3に記載の単語分類システム。
- 前記学習データ作成部は、同分類既知単語対毎、又は異分類既知単語対毎に学習データを作成する請求項3に記載の単語分類システム。
- 前記学習データ作成部は、同分類既知単語対の学習データを正例の学習データとし、異分類既知単語対の学習データを負例の学習データとする請求項3から請求項5のいずれかに記載の単語分類システム。
- 前記学習データ作成部は、異分類単語対の学習データのうち特定の条件を満たす単語対の学習データを負例の学習データとする請求項6に記載の単語分類システム。
- 前記特定の条件は、単語間の共出現頻度が所定の値よりも高いこと、共起確率が所定の値よりも高いこと、又は相互情報量が所定の値よりも高いことのいずれかである請求項7に記載の単語分類システム。
- 前記単語間パターン適用部は、
前記分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と前記分類既知単語との組である単語対の文脈情報及びレイアウト情報の少なくともいずれから成る適用データを作成する適用データ作成部と、
前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断するパターン適用部と
を有する請求項2に記載の単語分類システム。 - 前記単語間パターン学習部は、
文書で共出現する同分類単語間の関係をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データと、文書で共出現する異分類既知単語間の関係をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データとを作成する学習データ作成部と、
前記学習データに基づいて、入力された分類未知単語を含む単語対のデータが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断する為の単語間パターンを作成する単語間パターン作成部と
を有する請求項9に記載の単語分類システム。 - 前記パターン適用部は、前記単語対の適用データが正例又は負例であるかを出力する請求項9又は請求項10に記載の単語分類システム。
- 前記単語間パターン作成部は、Support Vector Machinesである請求項3から請求項8、又は、請求項10のいずれかに記載の単語分類システム。
- 前記パターン適用部は、Support Vector Machinesである請求項9から請求項11のいずれかに記載の単語分類システム。
- 前記分類判定部は、分類未知単語と分類既知単語との単語対が同分類単語の単語対である確率に基づいて、分類未知単語の分類を判定する請求項2、又は、請求項9から請求項13のいずれかに記載の単語分類システム。
- 文書で共出現する同分類既知単語間の文脈情報及びレイアウト情報をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報から成る正例の学習データと、文書で共出現する異分類既知単語間の文脈情報及びレイアウト情報をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報から成る負例の学習データとを作成し、前記学習データに基づいて、入力された分類未知単語と分類既知単語との組である単語対のデータが正例であるか負例であるかを判断する為の単語間パターンを作成する単語間パターン学習部と、
前記作成された単語間パターンが格納される単語間パターン記憶部と、
入力された分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と前記分類既知単語との組である単語対の文脈情報及びレイアウト情報から成る適用データを作成し、前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが正例であるか負例であるかを出力する単語間パターン適用部と、
前記単語対の適用データの正例又は負例の数と、前記単語対の分類既知単語の分類とに基づいて、前記単語対の分類未知単語の分類を判定する分類判定部と
を有する単語分類システム。 - 分類が未知である分類未知単語の分類を判定するために用いられる学習データを生成する学習データ生成装置であって、
文書で共出現する分類既知単語間の文脈情報及びレイアウト情報の少なくともいずれから成る学習データを生成する学習データ生成部を有する学習データ生成装置。 - 文書で共出現する分類既知単語間の関係をもとに、共出現する分類既知単語間の文脈情報及びレイアウト情報の少なくともいずれかを学習し、単語の組である単語対のデータが、同分類単語の組である同分類単語対のデータなのか、異分類単語の組である異分類単語対のデータなのかを判断する為の単語間パターンを、分類未知単語の分類判定に用いる単語分類方法。
- 入力された分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と分類既知単語との間の文脈情報及びレイアウト情報の少なくともいずれから成る前記分類未知単語と前記分類既知単語との組である単語対の適用データを生成し、
前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断し、
前記判断結果に基づいて、前記分類未知単語の分類を判定する
請求項17に記載の単語分類方法。 - 文書で共出現する同分類単語間の関係をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データと、文書で共出現する異分類既知単語間の関係をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データとを作成し、
前記学習データに基づいて、入力された分類未知単語を含む単語対のデータが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断する為の単語間パターンを作成する
請求項17又は請求項18に記載の単語分類方法。 - 同分類既知単語対又は異分類既知単語対が共出現する文書の出現位置毎に学習データを作成する請求項19に記載の単語分類方法。
- 同分類既知単語対毎、又は異分類既知単語対毎に学習データを作成する請求項19に記載の単語分類方法。
- 同分類既知単語対の学習データを正例の学習データとして作成し、異分類既知単語対の学習データを負例の学習データとして作成する請求項19から請求項21のいずれかに記載の単語分類方法。
- 異分類単語対の学習データのうち特定の条件を満たす単語対の学習データを負例の学習データとする請求項22に記載の単語分類方法。
- 前記特定の条件は、単語間の共出現頻度が所定の値よりも高いこと、共起確率が所定の値よりも高いこと、又は相互情報量が所定の値よりも高いことのいずれかである請求項23に記載の単語分類方法。
- 前記分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と前記分類既知単語との組である単語対の文脈情報及びレイアウト情報の少なくともいずれから成る適用データを作成し、
前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断する
請求項18に記載の単語分類方法。 - 文書で共出現する同分類単語間の関係をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データと、文書で共出現する異分類既知単語間の関係をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報の少なくともいずれから成る学習データとを作成し、
前記学習データに基づいて、入力された分類未知単語を含む単語対のデータが、同分類単語対のデータなのか、異分類単語対のデータなのかを判断する為の単語間パターンを作成する
請求項25に記載の単語分類方法。 - 前記適用データを解析し、前記単語対の適用データが正例又は負例であるかを出力する請求項25又は請求項26に記載の単語分類方法。
- 分類未知単語と分類既知単語との単語対が同分類単語の単語対である確率に基づいて、分類未知単語の分類を判定する請求項18、又は、請求項25から請求項27のいずれかに記載の単語分類方法。
- 文書で共出現する同分類既知単語間の文脈情報及びレイアウト情報をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報から成る正例の学習データと、文書で共出現する異分類既知単語間の文脈情報及びレイアウト情報をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報から成る負例の学習データとを作成し、
前記学習データに基づいて、入力された分類未知単語と分類既知単語との組である単語対のデータが正例であるか負例であるかを判断する為の単語間パターンを作成し、
分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と前記分類既知単語との組である単語対の文脈情報及びレイアウト情報から成る適用データを作成し、前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが正例であるか負例であるかを判別し、
前記単語対の適用データの正例又は負例の数と、前記単語対の分類既知単語の分類とに基づいて、前記単語対の分類未知単語の分類を判定する
単語分類方法。 - 分類が未知である分類未知単語の分類を判定するために用いられる学習データを生成する学習データ生成方法であって、
文書で共出現する分類既知単語間の文脈情報及びレイアウト情報の少なくともいずれから成る学習データを生成する
学習データ生成方法。 - 文書で共出現する分類既知単語間の関係をもとに、共出現する分類既知単語間の文脈情報及びレイアウト情報の少なくともいずれかを学習し、単語の組である単語対のデータが、同分類単語の組である同分類単語対のデータなのか、異分類単語の組である異分類単語対のデータなのかを判断する為の単語間パターンを作成する単語間パターン学習処理を情報処理装置に実行させるプログラム。
- 文書で共出現する同分類既知単語間の文脈情報及びレイアウト情報をもとに、その同分類既知単語の組である同分類既知単語対の文脈情報及びレイアウト情報から成る正例の学習データと、文書で共出現する異分類既知単語間の文脈情報及びレイアウト情報をもとに、その異分類既知単語の組である異分類既知単語対の文脈情報及びレイアウト情報から成る負例の学習データとを作成する処理と、
前記学習データに基づいて、入力された分類未知単語と分類既知単語との組である単語対のデータが正例であるか負例であるかを判断する為の単語間パターンを作成する処理と、
分類未知単語と文書で共出現する分類既知単語との関係をもとに、前記分類未知単語と前記分類既知単語との組である単語対の文脈情報及びレイアウト情報から成る適用データを作成する処理と、
前記単語間パターンを参照して前記単語対の適用データを解析し、前記単語対の適用データが正例であるか負例であるかを判別する処理と、
前記単語対の適用データの正例又は負例の数と、前記単語対の分類既知単語の分類とに基づいて、前記単語対の分類未知単語の分類を判定する処理と
を情報処理装置に実行させるプログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/920,920 US8504356B2 (en) | 2008-04-03 | 2009-04-02 | Word classification system, method, and program |
JP2010505983A JP5447862B2 (ja) | 2008-04-03 | 2009-04-02 | 単語分類システム、方法およびプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008097520 | 2008-04-03 | ||
JP2008-097520 | 2008-04-03 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009123288A1 true WO2009123288A1 (ja) | 2009-10-08 |
Family
ID=41135655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/056900 WO2009123288A1 (ja) | 2008-04-03 | 2009-04-02 | 単語分類システム、方法およびプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US8504356B2 (ja) |
JP (1) | JP5447862B2 (ja) |
WO (1) | WO2009123288A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012173809A (ja) * | 2011-02-17 | 2012-09-10 | Nippon Telegr & Teleph Corp <Ntt> | 具体主題の有無判定装置、方法、及びプログラム |
JP2012173810A (ja) * | 2011-02-17 | 2012-09-10 | Nippon Telegr & Teleph Corp <Ntt> | 主題抽出装置、方法、及びプログラム |
JP2018010532A (ja) * | 2016-07-14 | 2018-01-18 | 株式会社レトリバ | 情報処理装置、プログラム及び情報処理方法 |
WO2020144736A1 (ja) * | 2019-01-08 | 2020-07-16 | 三菱電機株式会社 | 意味関係学習装置、意味関係学習方法、及び意味関係学習プログラム |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10339214B2 (en) * | 2011-11-04 | 2019-07-02 | International Business Machines Corporation | Structured term recognition |
KR101508059B1 (ko) * | 2013-06-26 | 2015-04-07 | 숭실대학교산학협력단 | 단어의 쾌-불쾌 지수 예측 장치 및 방법 |
US20150309987A1 (en) | 2014-04-29 | 2015-10-29 | Google Inc. | Classification of Offensive Words |
KR101567789B1 (ko) * | 2014-08-26 | 2015-11-11 | 숭실대학교산학협력단 | 상대적 감정 유사도를 이용한 단어의 쾌-불쾌 지수 예측 장치 및 방법 |
US9529898B2 (en) * | 2014-08-26 | 2016-12-27 | Google Inc. | Clustering classes in language modeling |
JP2017134693A (ja) * | 2016-01-28 | 2017-08-03 | 富士通株式会社 | 意味情報登録支援プログラム、情報処理装置および意味情報登録支援方法 |
JP6729232B2 (ja) * | 2016-09-20 | 2020-07-22 | 富士通株式会社 | メッセージ振り分けプログラム、メッセージ振り分け装置、およびメッセージ振り分け方法 |
US11270082B2 (en) | 2018-08-20 | 2022-03-08 | Verint Americas Inc. | Hybrid natural language understanding |
US11217226B2 (en) | 2018-10-30 | 2022-01-04 | Verint Americas Inc. | System to detect and reduce understanding bias in intelligent virtual assistants |
US11604927B2 (en) | 2019-03-07 | 2023-03-14 | Verint Americas Inc. | System and method for adapting sentiment analysis to user profiles to reduce bias |
WO2020247586A1 (en) | 2019-06-06 | 2020-12-10 | Verint Americas Inc. | Automated conversation review to surface virtual assistant misunderstandings |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08147307A (ja) * | 1994-11-22 | 1996-06-07 | Gijutsu Kenkyu Kumiai Shinjoho Shiyori Kaihatsu Kiko | 意味知識獲得装置 |
JP2007004458A (ja) * | 2005-06-23 | 2007-01-11 | National Institute Of Information & Communication Technology | 二項関係抽出装置,二項関係抽出処理を用いた情報検索装置,二項関係抽出処理方法,二項関係抽出処理を用いた情報検索処理方法,二項関係抽出処理プログラム,および二項関係抽出処理を用いた情報検索処理プログラム |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6311152B1 (en) * | 1999-04-08 | 2001-10-30 | Kent Ridge Digital Labs | System for chinese tokenization and named entity recognition |
US7299180B2 (en) * | 2002-12-10 | 2007-11-20 | International Business Machines Corporation | Name entity extraction using language models |
JP3847273B2 (ja) | 2003-05-12 | 2006-11-22 | 沖電気工業株式会社 | 単語分類装置、単語分類方法及び単語分類プログラム |
WO2005116866A1 (en) * | 2004-05-28 | 2005-12-08 | Agency For Science, Technology And Research | Method and system for word sequence processing |
US8280719B2 (en) * | 2005-05-05 | 2012-10-02 | Ramp, Inc. | Methods and systems relating to information extraction |
US9135238B2 (en) * | 2006-03-31 | 2015-09-15 | Google Inc. | Disambiguation of named entities |
CN101075228B (zh) * | 2006-05-15 | 2012-05-23 | 松下电器产业株式会社 | 识别自然语言中的命名实体的方法和装置 |
US20080052262A1 (en) * | 2006-08-22 | 2008-02-28 | Serhiy Kosinov | Method for personalized named entity recognition |
-
2009
- 2009-04-02 US US12/920,920 patent/US8504356B2/en active Active
- 2009-04-02 WO PCT/JP2009/056900 patent/WO2009123288A1/ja active Application Filing
- 2009-04-02 JP JP2010505983A patent/JP5447862B2/ja not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08147307A (ja) * | 1994-11-22 | 1996-06-07 | Gijutsu Kenkyu Kumiai Shinjoho Shiyori Kaihatsu Kiko | 意味知識獲得装置 |
JP2007004458A (ja) * | 2005-06-23 | 2007-01-11 | National Institute Of Information & Communication Technology | 二項関係抽出装置,二項関係抽出処理を用いた情報検索装置,二項関係抽出処理方法,二項関係抽出処理を用いた情報検索処理方法,二項関係抽出処理プログラム,および二項関係抽出処理を用いた情報検索処理プログラム |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012173809A (ja) * | 2011-02-17 | 2012-09-10 | Nippon Telegr & Teleph Corp <Ntt> | 具体主題の有無判定装置、方法、及びプログラム |
JP2012173810A (ja) * | 2011-02-17 | 2012-09-10 | Nippon Telegr & Teleph Corp <Ntt> | 主題抽出装置、方法、及びプログラム |
JP2018010532A (ja) * | 2016-07-14 | 2018-01-18 | 株式会社レトリバ | 情報処理装置、プログラム及び情報処理方法 |
WO2020144736A1 (ja) * | 2019-01-08 | 2020-07-16 | 三菱電機株式会社 | 意味関係学習装置、意味関係学習方法、及び意味関係学習プログラム |
JPWO2020144736A1 (ja) * | 2019-01-08 | 2021-02-18 | 三菱電機株式会社 | 意味関係学習装置、意味関係学習方法、及び意味関係学習プログラム |
Also Published As
Publication number | Publication date |
---|---|
US20110029303A1 (en) | 2011-02-03 |
US8504356B2 (en) | 2013-08-06 |
JPWO2009123288A1 (ja) | 2011-07-28 |
JP5447862B2 (ja) | 2014-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5447862B2 (ja) | 単語分類システム、方法およびプログラム | |
CA2777520C (en) | System and method for phrase identification | |
TWI536181B (zh) | 在多語文本中的語言識別 | |
US7877383B2 (en) | Ranking and accessing definitions of terms | |
Pillay et al. | Authorship attribution of web forum posts | |
JP2008165598A (ja) | 風評情報抽出装置及び風評情報抽出方法 | |
JP4911599B2 (ja) | 風評情報抽出装置及び風評情報抽出方法 | |
Ashraf et al. | Cross-Genre Author Profile Prediction Using Stylometry-Based Approach. | |
Zheng et al. | Dynamic knowledge-base alignment for coreference resolution | |
Tschuggnall et al. | Enhancing authorship attribution by utilizing syntax tree profiles | |
JP2007047974A (ja) | 情報抽出装置および情報抽出方法 | |
JP2005301856A (ja) | 文書検索方法、文書検索プログラムおよびこれを実行する文書検索装置 | |
Jha et al. | Hsas: Hindi subjectivity analysis system | |
JP4005343B2 (ja) | 情報検索システム | |
Dianati et al. | Words stemming based on structural and semantic similarity | |
WO2009113289A1 (ja) | 新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム | |
Batanović et al. | Sentiment classification of documents in Serbian: The effects of morphological normalization | |
Heidary et al. | Automatic Persian text summarization using linguistic features from text structure analysis | |
Hollingsworth | Syntactic stylometry: using sentence structure for authorship attribution | |
Kyjánek et al. | Constructing a lexical resource of Russian derivational morphology | |
JP5506482B2 (ja) | 固有表現抽出装置、文字列−固有表現クラス対データベース作成装置、固有表現抽出方法、文字列−固有表現クラス対データベース作成方法、プログラム | |
Bosch et al. | Memory-based morphological analysis and part-of-speech tagging of Arabic | |
JP4088171B2 (ja) | テキスト解析装置、方法、プログラム及びそのプログラムを記録した記録媒体 | |
JP4341077B2 (ja) | 文書処理装置、文書処理方法、および、文書処理プログラム | |
Vasili et al. | A study of summarization techniques in Albanian language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09728588 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12920920 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010505983 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09728588 Country of ref document: EP Kind code of ref document: A1 |