JP2008242626A

JP2008242626A - Term registration apparatus

Info

Publication number: JP2008242626A
Application number: JP2007079627A
Authority: JP
Inventors: Makoto Imamura; 誠今村; Yasuhiro Takayama; 泰博高山
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-03-26
Filing date: 2007-03-26
Publication date: 2008-10-09

Abstract

<P>PROBLEM TO BE SOLVED: To register unregistered terms in appropriate semantic categories by increasing the accuracy of determining semantic categories even if sentence patterns have ambiguity in terms of semantic categorization. <P>SOLUTION: A sentence pattern certainty calculation part 3 calculates a certainty of a sentence pattern in consideration of relationships between semantic categories of a plurality of terms constructing the sentence pattern. A term/semantic category similarity calculation part 6 calculates similarities between an unregistered term and a plurality of semantic categories in a thesaurus in accordance with the sentence pattern certainty. A term registration part 8 specifies the semantic category that has the highest similarity with the unregistered term by referring to the calculations of the term/semantic category similarity calculation part 8, and registers the unregistered term in the semantic category. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、コーパスから用語を抽出して、その用語をシソーラスに登録する用語登録装置に関するものである。 The present invention relates to a term registration device that extracts terms from a corpus and registers the terms in a thesaurus.

近年、電子化して蓄積される業務文書の増加に伴って、業務に依存している文書の検索や分析を実施して、その検索結果や分析結果を業務にフィードバックする技術が求められるようになってきている。
その一例として、不具合報告書に記録されている過去の不具合に関する用語を用いて、新規の設計書の該当箇所を検索することにより、不具合につながる記述をチェックする技術がある。
また、別の例として、不具合報告書を分析する際、装置や部品などのハードウェアに関する用語と、電流や電圧などの属性に関する用語との間の相関を分析（テキストマイニング）することにより、不具合につながる部品の属性に関する情報を設計にフィードバックする技術がある。 In recent years, with the increase in the number of business documents accumulated electronically, there has been a demand for technology for searching and analyzing business-dependent documents and feeding back the search results and analysis results to the business. It is coming.
As an example, there is a technique of checking a description that leads to a defect by searching for a corresponding part of a new design document using terms related to a past defect recorded in the defect report.
As another example, when analyzing a defect report, the problem is analyzed by analyzing the correlation (text mining) between terms related to hardware such as devices and components and terms related to attributes such as current and voltage. There is a technology that feeds back information on the attributes of components that lead to the design.

特定の分野の業務文書から分野固有の用語を検索することができるようにするには、分野固有の用語の意味分類を備えているシソーラス（用語が属している意味分類や、用語の上位−下位関係等を記述している辞書）を用いる必要がある。
一般的に、該当分野毎に、対象文書中で用いられる用語が異なり、また、その用語が属する意味分類が異なるため、既存の他の分野のシソーラスを使用することはできず、該当分野に適用可能なシソーラスを作成する必要がある。
該当分野に適用可能なシソーラスを作成する際、人手で個別に作成することは効率が悪いため、高精度に用語をシソーラスに登録する技術の開発が望まれている。
ここでの「用語」は、単語及び複合語を含む概念である。 To be able to search for domain-specific terms from business documents in a specific domain, a thesaurus with the semantic classification of the domain-specific terms (the semantic classification to which the term belongs, It is necessary to use a dictionary describing relationships and the like.
Generally, because the terms used in the target document are different for each applicable field and the semantic classification to which the term belongs is different, the thesaurus of other existing fields cannot be used. You need to create a possible thesaurus.
When creating a thesaurus applicable to the relevant field, it is inefficient to manually create a thesaurus individually. Therefore, it is desired to develop a technique for registering terms in the thesaurus with high accuracy.
Here, the “term” is a concept including a word and a compound word.

例えば、以下の特許文献１，２には、用語をシソーラスに登録する用語登録装置が開示されている。
即ち、以下の特許文献１には、「動詞＋格」と「名詞」の出現頻度に基づいて未登録語（用語）の類似度を計算することにより、未登録語を登録するシソーラスの意味分類（登録ノード）を選定する技術が開示されている。
ただし、特許文献１に開示されている用語登録装置では、「動詞＋格」の文パターンに意味分類の多義性が存在する場合、未登録語を登録するシソーラスの意味分類を精度よく選定することができない。
ここで、意味分類の多義性としては、例えば、動詞「固定する」と格「を」からなる文パターン「Ｘを固定する」があるとき、「Ｘ」の用語が意味分類［部品］に属する場合と、意味分類［信号］に属する場合があるような状況のことである。 For example, the following Patent Documents 1 and 2 disclose a term registration device that registers terms in a thesaurus.
That is, the following Patent Document 1 describes the semantic classification of a thesaurus for registering unregistered words by calculating the similarity of unregistered words (terms) based on the appearance frequency of “verb + case” and “noun”. A technique for selecting (registered node) is disclosed.
However, in the term registration device disclosed in Patent Document 1, when there is a ambiguity of semantic classification in the sentence pattern of “verb + case”, the semantic classification of a thesaurus for registering unregistered words should be selected with high accuracy. I can't.
Here, as the ambiguity of the semantic classification, for example, when there is a sentence pattern “fixing X” consisting of the verb “fix” and case “to”, the term “X” belongs to the semantic classification [part]. And a situation that may belong to a semantic classification [signal].

また、以下の特許文献２には、単語同士の共起関係データに基づいて単語間の距離を求めて単語のシソーラスを構築する方法が開示されている。
具体的には、「動詞＋格」と「名詞」の出現頻度に基づいて「名詞」の類似度を計算し、さらに、「動詞」が複数の意味分類に属している場合には、「動詞」が属する意味分類の多義性を解消して、それぞれの意味分類を持つ動詞を別の語として扱うようにしている。
別の語に関しては、「動詞＋格」と「名詞」の出現頻度を再計算して、「名詞」が属する意味分類を再計算することにより、「名詞」が属する意味分類の計算精度を高めるようにしている。
なお、特許文献２では、文パターンを構成する集合の要素間の距離のみのクラスタリングによって、「動詞」が属する意味分類の多義性を解消している。
例えば、「Ｘを固定する」という文パターンでは、格「を」とる用語Ｘの集合の要素間の距離でクラスタリングすることによって多義性の解消を行っている。 Patent Document 2 below discloses a method of constructing a thesaurus of words by obtaining a distance between words based on co-occurrence relation data between words.
Specifically, the similarity of “noun” is calculated based on the appearance frequency of “verb + case” and “noun”, and if “verb” belongs to a plurality of semantic categories, “verb” Is resolved, and verbs having each semantic classification are treated as different words.
For another word, recalculate the appearance frequency of “verb + case” and “noun” and recalculate the semantic classification to which “noun” belongs, thereby increasing the calculation accuracy of the semantic classification to which “noun” belongs. I am doing so.
In Patent Document 2, the ambiguity of the semantic classification to which the “verb” belongs is eliminated by clustering only the distances between the elements of the set constituting the sentence pattern.
For example, in the sentence pattern “fix X”, the ambiguity is resolved by clustering by the distance between the elements of the set of terms X taking the case “”.

特開２００５−３２６９５２号公報（段落番号［００２８］から［００２９］、図１）JP 2005-326952 A (paragraph numbers [0028] to [0029], FIG. 1) 特開２００１−３３１５１５号公報（段落番号［００２３］から［００４９］、図１）JP 2001-331515 A (paragraph numbers [0023] to [0049], FIG. 1)

従来の用語登録装置は以上のように構成されているので、特許文献１では、「動詞＋格」の文パターンに意味分類の多義性が存在する場合、未登録語を登録するシソーラスの意味分類を精度よく選定することができないが、特許文献２では、「動詞」が複数の意味分類に属している場合には、「動詞」が属する意味分類の多義性を解消して、意味分類の選定精度を高めることができる。しかし、「名詞」が属する意味分類を再計算する際、「動詞＋格」の文パターン以外の新しい情報を用いていないため、利用可能なコーパスのデータ量が少なく、充分な統計量が得られない等により、文パターン中の用語Ｘの要素間の距離の差が小さい場合や、複数の格を用いないと曖昧性を解消することができない場合には、「動詞」が属する意味分類の多義性をうまく解消することができず、「名詞」が属する意味分類を再計算しても、意味分類の選定精度を高めることができないなどの課題があった。 Since the conventional term registration device is configured as described above, in Patent Document 1, if there is ambiguity of semantic classification in a sentence pattern of “verb + case”, the semantic classification of a thesaurus for registering unregistered words However, in Patent Document 2, when the “verb” belongs to a plurality of semantic categories, the ambiguity of the semantic category to which the “verb” belongs is resolved, and the semantic category is selected. Accuracy can be increased. However, when recalculating the semantic classification to which the “noun” belongs, new information other than the “verb + case” sentence pattern is not used, so the amount of corpus data available is small and sufficient statistics can be obtained. If the difference in the distance between the elements of the term X in the sentence pattern is small due to the absence of the ambiguity, or if the ambiguity cannot be resolved without using a plurality of cases, the meaning ambiguity However, there is a problem that the accuracy of selecting the semantic classification cannot be improved even if the semantic classification to which the “noun” belongs is recalculated.

例えば、電気関連の技術文書（例えば、「基盤にユニットを固定する」や「アドレスにＤＷＮを固定する」などの文を含んでいる技術文書）から、文パターン「Ｘを固定する」を用いて用語を抽出する場合、意味分類［部品］に属する用語「ユニット」と、意味分類［信号］に属する「ＤＷＮ」が抽出される。
このとき、格「を」をとる用語である用語「ユニット」と用語「ＤＷＮ」に関して、用語間の距離を計算すると、用語間の距離が小さくなり、意味分類の多義性を解消することができないことがある。
この場合、格「を」をとる用語だけでなく、格「に」をとる用語「基盤」と用語「アドレス」を含めて用語間の距離を計算すれば、意味分類の多義性を解消することができる可能性があるが、特許文献２では、このような用語間の距離計算を行っていない。 For example, by using a sentence pattern “fix X” from an electrical related technical document (for example, a technical document including sentences such as “fix unit to base” and “DWN fix to address”). When extracting terms, the term “unit” belonging to the semantic category [component] and “DWN” belonging to the semantic category [signal] are extracted.
At this time, regarding the term “unit” and the term “DWN”, which are terms that take the case “”, if the distance between the terms is calculated, the distance between the terms becomes small, and the ambiguity of the semantic classification cannot be eliminated. Sometimes.
In this case, the ambiguity of the semantic classification can be eliminated by calculating the distance between the terms, including the term “base” and the term “address” that take the case “ni” as well as the term “ni”. However, in Patent Document 2, such a distance calculation between terms is not performed.

この発明は上記のような課題を解決するためになされたもので、文パターンに意味分類の多義性が存在する場合でも、意味分類の選定精度を高めて、未登録の用語を適正な意味分類の中に登録することができる用語登録装置を得ることを目的とする。 The present invention has been made to solve the above-described problems. Even when there is ambiguity of semantic classification in a sentence pattern, the accuracy of selecting semantic classification is improved, and an unregistered term is appropriately classified. An object of the present invention is to obtain a terminology registration device that can be registered in the Internet.

この発明に係る用語登録装置は、用語間の関係の制約を表現している文パターンを用いて、コーパスから用語を抽出する用語抽出手段と、その文パターンを構成している複数の用語が属している意味分類間の関係を考慮して、その文パターンの確信度を計算する文パターン確信度計算手段と、文パターン確信度計算手段により計算された文パターンの確信度を用いて、用語抽出手段により抽出された未登録の用語とシソーラスにおける複数の意味分類との間の類似度を計算する用語・意味分類間類似度計算手段とを設け、用語登録手段が用語・意味分類間類似度計算手段の計算結果を参照して、用語抽出手段により抽出された未登録の用語と類似度が最も高い意味分類を特定し、未登録の用語を上記意味分類の中に登録するようにしたものである。 The term registration device according to the present invention includes a term extracting means for extracting a term from a corpus using a sentence pattern expressing a restriction on a relationship between terms, and a plurality of terms constituting the sentence pattern belong to Extracting terms using sentence pattern certainty calculation means that calculates the certainty of the sentence pattern in consideration of the relationship between the semantic classifications, and sentence pattern certainty calculated by the sentence pattern certainty calculation means There is provided a term / similar classification similarity calculation means for calculating the similarity between an unregistered term extracted by the means and a plurality of semantic classifications in the thesaurus, and the term registration means calculates the similarity between the terms and the semantic classification. By referring to the calculation result of the means, the semantic classification having the highest similarity with the unregistered term extracted by the term extracting means is specified, and the unregistered term is registered in the semantic classification. That.

この発明によれば、用語間の関係の制約を表現している文パターンを用いて、コーパスから用語を抽出する用語抽出手段と、その文パターンを構成している複数の用語が属している意味分類間の関係を考慮して、その文パターンの確信度を計算する文パターン確信度計算手段と、文パターン確信度計算手段により計算された文パターンの確信度を用いて、用語抽出手段により抽出された未登録の用語とシソーラスにおける複数の意味分類との間の類似度を計算する用語・意味分類間類似度計算手段とを設け、用語登録手段が用語・意味分類間類似度計算手段の計算結果を参照して、用語抽出手段により抽出された未登録の用語と類似度が最も高い意味分類を特定し、未登録の用語を上記意味分類の中に登録するように構成したので、文パターンに意味分類の多義性が存在する場合でも、意味分類の選定精度を高めて、未登録の用語を適正な意味分類の中に登録することができる効果がある。 According to the present invention, the term extracting means for extracting the term from the corpus using the sentence pattern expressing the restriction of the relationship between the terms, and the meaning to which the plurality of terms constituting the sentence pattern belong Extracted by term extraction means using sentence pattern certainty calculation means that calculates the certainty of the sentence pattern in consideration of the relationship between classifications, and sentence pattern certainty calculated by the sentence pattern certainty calculation means A term / semantic classification similarity calculation means for calculating similarity between a registered unregistered term and a plurality of semantic classifications in the thesaurus, and the term registration means calculates the similarity between the term / semantic classification calculation means By referring to the results, the semantic classification having the highest similarity with the unregistered terms extracted by the term extraction means is identified, and the unregistered terms are registered in the semantic classification. Even if the ambiguity of meaning classification exists, to improve the selection accuracy of the semantic classification, the effect that can be registered in the proper semantic classification unregistered terms.

実施の形態１．
図１はこの発明の実施の形態１による用語登録装置を示す構成図であり、図において、コーパス１は電子化されている業務文書の文書データを格納しているメモリである。
用語抽出部２は用語間の関係の制約を表現している文パターン、即ち、用語の係り受け関係の制約を表現している文パターンを用いて、コーパス１から用語を抽出する処理を実施する。なお、用語抽出部２は用語抽出手段を構成している。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a term registration device according to Embodiment 1 of the present invention. In FIG. 1, a corpus 1 is a memory storing document data of an electronic business document.
The term extraction unit 2 performs a process of extracting a term from the corpus 1 using a sentence pattern that expresses a constraint on the relationship between terms, that is, a sentence pattern that expresses a constraint on a dependency relationship between terms. . The term extracting unit 2 constitutes term extracting means.

文パターン確信度計算部３は文パターン（用語抽出部２により用いられた文パターン）を構成している複数の用語が属している意味分類間の関係を考慮して、その文パターンの確信度を計算する処理を実施する。なお、文パターン確信度計算部３は文パターン確信度計算手段を構成している。
用語間類似度計算部４は文パターン確信度計算部３により計算された文パターンの確信度を用いて、用語間の類似度を計算する処理を実施する。なお、用語間類似度計算部４は用語間類似度計算手段を構成している。
用語間類似度ＤＢ５は用語間類似度計算部４により計算された用語間の類似度を格納するデータベースである。 The sentence pattern certainty calculation unit 3 considers the relationship between semantic classifications to which a plurality of terms constituting the sentence pattern (sentence pattern used by the term extraction unit 2) belongs, and the certainty of the sentence pattern The process of calculating is performed. The sentence pattern certainty calculator 3 constitutes a sentence pattern certainty calculator.
The term similarity calculation unit 4 uses the sentence pattern certainty calculated by the sentence pattern certainty calculation unit 3 to perform a process of calculating the similarity between terms. The inter-term similarity calculation unit 4 constitutes inter-term similarity calculation means.
The term similarity DB 5 is a database for storing the similarity between terms calculated by the term similarity calculation unit 4.

用語・意味分類間類似度計算部６は用語間類似度計算部４により計算された用語間の類似度を用いて、用語抽出部２により抽出された未登録の用語とシソーラス９における複数の意味分類との間の類似度を計算する処理を実施する。なお、用語・意味分類間類似度計算部６は用語・意味分類間類似度計算手段を構成している。
用語・意味分類間類似度ＤＢ７は用語・意味分類間類似度計算部６により計算された類似度を格納するデータベースである。 The term / semantic similarity calculation unit 6 uses the similarity between terms calculated by the inter-term similarity calculation unit 4, and uses the unregistered terms extracted by the term extraction unit 2 and a plurality of meanings in the thesaurus 9. A process of calculating the similarity between the classifications is performed. The term / semantic classification similarity calculation unit 6 constitutes a term / semantic classification similarity calculation means.
The term / semantic classification similarity DB 7 is a database that stores the similarity calculated by the term / meaning classification similarity calculation unit 6.

用語登録部８は用語・意味分類間類似度計算部６の計算結果を参照して、シソーラス９における複数の意味分類の中で、用語抽出部２により抽出された未登録の用語と類似度が最も高い意味分類を特定し、未登録の用語を上記意味分類の中に登録する処理を実施する。なお、用語登録部８は用語登録手段を構成している。
シソーラス９は複数の意味分類を格納するとともに、複数の意味分類に属している用語を格納しているメモリである。 The term registration unit 8 refers to the calculation result of the term / semantic classification similarity calculation unit 6, and among the plurality of semantic classifications in the thesaurus 9, the term registration unit 8 has a similarity to an unregistered term extracted by the term extraction unit 2. A process of specifying the highest semantic classification and registering unregistered terms in the semantic classification is performed. The term registration unit 8 constitutes term registration means.
The thesaurus 9 is a memory for storing a plurality of semantic categories and storing terms belonging to a plurality of semantic categories.

図２はこの発明の実施の形態１による用語登録装置を実現するコンピュータのハードウェア資源を示す構成図である。
図１の用語登録装置を実現するコンピュータは、下記のハードウェア資源から構成されている。
・キーボードやマウスなどからなる入力装置１１
・他の制御装置との通信に用いられる通信装置１２
・ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１４及び主記憶装置１５などから構成されている制御装置１３
なお、制御装置１３には、図１の用語抽出部２、文パターン確信度計算部３、用語間類似度計算部４、用語・意味分類間類似度計算部６及び用語登録部８が実装される。 FIG. 2 is a block diagram showing hardware resources of a computer that implements the term registration device according to Embodiment 1 of the present invention.
The computer that implements the term registration device of FIG. 1 is composed of the following hardware resources.
・ Input device 11 consisting of a keyboard, mouse, etc.
A communication device 12 used for communication with other control devices
A control device 13 including a CPU (Central Processing Unit) 14 and a main storage device 15
The control device 13 is implemented with the term extraction unit 2, sentence pattern certainty calculation unit 3, inter-term similarity calculation unit 4, term / semantic classification similarity calculation unit 6, and term registration unit 8 of FIG. 1. The

・図１の用語抽出部２、文パターン確信度計算部３、用語間類似度計算部４、用語・意味分類間類似度計算部６及び用語登録部８の処理内容を示すプログラムを記憶するとともに、図１の用語間類似度ＤＢ５、用語・意味分類間類似度ＤＢ７及びシソーラス９を記憶する２次記憶装置１６
なお、２次記憶装置１６は、制御装置１３における計算処理の過程で、補助記憶として使用されることもある。
・ディスプレイなどの表示装置１８や、プリンタなどの印刷装置１９から構成されている出力装置１７
なお、出力装置１７は、制御装置１３における計算処理の過程の出力や、類似度やシソーラス９の内容出力などに使用される。 1 stores a program showing the processing contents of the term extraction unit 2, the sentence pattern certainty calculation unit 3, the interterm similarity calculation unit 4, the term / semantic classification similarity calculation unit 6 and the term registration unit 8 in FIG. , A secondary storage device 16 that stores the term similarity DB 5, term / semantic category similarity DB 7, and thesaurus 9 in FIG. 1.
The secondary storage device 16 may be used as auxiliary storage in the course of calculation processing in the control device 13.
An output device 17 including a display device 18 such as a display and a printing device 19 such as a printer.
The output device 17 is used for output of the calculation process in the control device 13, the similarity, the contents of the thesaurus 9, and the like.

図１の用語抽出部２、文パターン確信度計算部３、用語間類似度計算部４、用語・意味分類間類似度計算部６及び用語登録部８の処理内容を示すプログラムやデータがＣＤ−ＲＯＭ等の記録媒体２０に記録されている場合には、記録媒体駆動装置２１が記録媒体２０から上記プログラムやデータを読み出し、制御装置１３が記録媒体駆動装置２１により読み出されたプログラムやデータを２次記憶装置１６に記録する。 The program and data indicating the processing contents of the term extraction unit 2, the sentence pattern certainty calculation unit 3, the term similarity calculation unit 4, the term / semantic classification similarity calculation unit 6 and the term registration unit 8 in FIG. When the program is recorded on the recording medium 20 such as a ROM, the recording medium driving device 21 reads the program and data from the recording medium 20, and the control device 13 reads the program and data read by the recording medium driving device 21. Records in the secondary storage device 16.

図３はこの発明の実施の形態１による用語登録装置の処理内容を示すフローチャートである。
図４はコーパスの一例を示す説明図である。
図４では、説明の簡単化のため、文のみを示しているが、文の出典を示す文書の番号や、文書中の文の番号等の情報が付加されているものであってもよい。 FIG. 3 is a flowchart showing the processing contents of the term registration device according to Embodiment 1 of the present invention.
FIG. 4 is an explanatory diagram showing an example of a corpus.
In FIG. 4, only the sentence is shown for simplification of explanation, but information such as a document number indicating the source of the sentence or a sentence number in the document may be added.

図５は文パターンの一例を示す説明図である。
図５では、文パターンが「Φ（ｃｓ） ⇒ 意味分類」の形式で記述されている例を示している。
Φは動詞、ｃｓは格を表す助詞を例示しており、Φ（ｃｓ）は「動詞Φと格ｃｓで係り受け関係にある要素を抽出する」ことを示している。
また、意味分類は、シソーラス９における意味分類（用語の集合）のラベルである。
図５では、説明の簡単化のため、一つの格を用いている文パターンの例を示しているが、複数の格を用いている文パターンでもよい。
以下の説明では、簡潔に説明するため、Φ（ｃｓ）の部分のみを文パターンと呼ぶことがある。また、意味分類のラベルは、［］で囲んで示すものとする。 FIG. 5 is an explanatory diagram showing an example of a sentence pattern.
FIG. 5 shows an example in which the sentence pattern is described in the format of “Φ (cs) → semantic classification”.
Φ is a verb and cs is an example of a particle representing a case, and Φ (cs) indicates that “an element having a dependency relationship between the verb Φ and the case cs is extracted”.
The semantic classification is a label of the semantic classification (a set of terms) in the thesaurus 9.
FIG. 5 shows an example of a sentence pattern using one case for the sake of simplicity of explanation, but a sentence pattern using a plurality of cases may be used.
In the following description, for the sake of brevity, only the portion of Φ (cs) may be referred to as a sentence pattern. In addition, the label of the semantic classification is surrounded by [].

図６はこの発明の実施の形態１による用語登録装置の用語抽出部２により抽出された用語データの一例を示す説明図である。
各用語データは“見出し”と“頻度”から構成されているものとする。
図６（ａ）は用語が動詞である例を示し、Ｖは動詞の見出しを表している。また、図６（ｂ）は用語が名詞である例を示し、Ｎは名詞の見出しを表している。 FIG. 6 is an explanatory diagram showing an example of term data extracted by the term extraction unit 2 of the term registration device according to Embodiment 1 of the present invention.
It is assumed that each term data is composed of “heading” and “frequency”.
FIG. 6A shows an example in which the term is a verb, and V represents the heading of the verb. FIG. 6B shows an example in which the term is a noun, and N indicates the heading of the noun.

図７はこの発明の実施の形態１による用語登録装置の用語抽出部２により抽出された用語と格の関係を示す説明図である。
図７では、用語は動詞や形容詞等の用言、格関係は表層の格助詞によって表現しているが、他の品詞の用語や、深層の格（深い構文解析・意味解析によって求めた格）の関係であってもよい。 FIG. 7 is an explanatory diagram showing the relationship between terms and cases extracted by the term extraction unit 2 of the term registration device according to Embodiment 1 of the present invention.
In Fig. 7, terms are expressed by verbs and adjectives, and case relationships are expressed by superficial case particles, but other parts of speech and deep cases (cases obtained by deep syntactic analysis and semantic analysis). The relationship may be

図８はこの発明の実施の形態１による用語登録装置のシソーラス９の一例を示す説明図である。
図８において、シソーラス９のノードは、意味分類のラベル又は用語である。［］で囲まれているものが意味分類のラベルであり、［］で囲まれていないものが用語である。
各ノードから「−」および「＋」で、右下に辿れるものが、そのノードの下位語であり、左上に辿れるものが、そのノードの上位語であることを示している。
一般に、意味分類は、階層性を有するが、以下では、簡単の簡単化ため、リーフにある用語が、そのすぐ上位の意味分類に属するものとして説明する。
また、ここでは、簡単の簡単化のため、上位−下位関係についてのみ示しているが、同義関係や、部分−全体関係などの他の関係を含むシソーラスであってもよい。 FIG. 8 is an explanatory diagram showing an example of the thesaurus 9 of the term registration device according to Embodiment 1 of the present invention.
In FIG. 8, nodes of the thesaurus 9 are semantic classification labels or terms. Items enclosed in [] are semantic classification labels, and those not enclosed in [] are terms.
“−” And “+” from each node indicate that the one that is traced to the lower right is the lower term of the node, and the one that is traced to the upper left is the upper term of the node.
In general, the semantic classification has a hierarchy, but in the following, for the sake of simplicity, the term in the leaf will be described as belonging to the immediately higher semantic classification.
Here, for the sake of simplicity, only the upper-lower relationship is shown, but a thesaurus including other relationships such as a synonym relationship and a partial-whole relationship may be used.

次に動作について説明する。
用語抽出部２は、用語の係り受け関係の制約を表現している図５の文パターンを用いて、図４のコーパス１から用語を抽出する（ステップＳＴ１）。
ここでは、用語抽出部２は、図６に示すように、動詞である用語と名詞である用語の見出しＶ，Ｎを抽出するとともに、用語の頻度を抽出する。
また、用語抽出部２は、図７に示すように、用語と格の関係を抽出する。 Next, the operation will be described.
The term extraction unit 2 extracts terms from the corpus 1 shown in FIG. 4 by using the sentence pattern shown in FIG. 5 that expresses the dependency relationship of terms (step ST1).
Here, as shown in FIG. 6, the term extraction unit 2 extracts the terms V and N of the terms that are verbs and the terms that are nouns, and also extracts the frequency of the terms.
Moreover, the term extraction part 2 extracts the relationship between a term and a case as shown in FIG.

用語抽出部２は、上記のように、用語や、用語と格の関係を抽出するが、これらの抽出は、コーパス１に対する形態素解析や構文解析を実施することにより行う。形態素解析や構文解析の実施方法については、広く公知であるため、ここでは詳細な説明を省略する。
ただし、構文解析では、文中で文節が他の文節に係ることを解析する係り受け解析を含むものとする。
ここで、文節とは、自立語（名詞、動詞など）と付属語（助詞、助動詞など）から構成されるものである。名詞＋格助詞が動詞に係る場合や、名詞＋格助詞の文節が他の名詞の文節と並列関係にある場合などが、代表的な係り受け解析の結果である。 As described above, the term extraction unit 2 extracts terms and the relationship between terms and cases, and these extractions are performed by performing morphological analysis and syntax analysis on the corpus 1. Since the implementation method of morphological analysis and syntax analysis is widely known, detailed description is omitted here.
However, the syntax analysis includes dependency analysis for analyzing that a clause relates to another clause in the sentence.
Here, the phrase is composed of independent words (nouns, verbs, etc.) and attached words (particles, auxiliary verbs, etc.). Typical cases of dependency analysis include a case where a noun + case particle relates to a verb, and a case where a clause of a noun + case particle has a parallel relationship with a clause of another noun.

文パターン確信度計算部３は、用語抽出部２がコーパス１から用語を抽出すると、用語抽出部２により用いられた図５の文パターンを構成している複数の用語が属している意味分類間の関係を考慮して、その文パターン（Φ，ｃｓ）の確信度を計算する（ステップＳＴ２）。
文パターン確信度計算部３は、文パターン（Φ，ｃｓ）の確信度を計算する際、用語間類似度ＤＢ５に格納されている用語間の類似度を用いて、文パターン（Φ，ｃｓ）の確信度を計算する。
なお、用語間類似度ＤＢ５には、用語間類似度計算部４により用語間の類似度が計算される前の初期状態時にあっては、所定の初期値（用語間の類似度）が格納されている。
以下、文パターン確信度計算部３における文パターンの確信度の計算処理を具体的に説明する。 When the term extraction unit 2 extracts terms from the corpus 1, the sentence pattern certainty calculation unit 3 includes a plurality of terms constituting the sentence pattern of FIG. 5 used by the term extraction unit 2. The certainty of the sentence pattern (Φ, cs) is calculated (step ST2).
When calculating the certainty of the sentence pattern (Φ, cs), the sentence pattern certainty calculation unit 3 uses the similarity between terms stored in the inter-term similarity DB 5 to use the sentence pattern (Φ, cs). Calculate certainty.
In the inter-term similarity DB 5, a predetermined initial value (similarity between terms) is stored in the initial state before the inter-term similarity calculation unit 4 calculates the inter-term similarity. ing.
Hereinafter, the sentence pattern certainty factor calculation process in the sentence pattern certainty factor calculation unit 3 will be described in detail.

最初に、文パターンの確信度について説明するため、「文パターンを用いた用語間の確信度」を計算する下記の式（１）について説明する。

ただし、Φは動詞、ｃｓは格、ｅ１、ｅ２は用語を表している。
ｃｓ'は、格ｃｓ以外の格である。
Ｗ（ｅｉ，Φ（ｃｓ），ｃｓ'）は、コーパス１の中で、文パターンΦ（ｃｓ）とともに格ｃｓ'で共起する単語の集合であり、文パターン（Φ，ｃｓ）で格ｃｓ'をとる「単語バグ」と称する。 First, in order to explain the certainty of the sentence pattern, the following formula (1) for calculating “the certainty between terms using the sentence pattern” will be explained.

However, Φ represents a verb, cs represents a case, and e1 and e2 represent terms.
cs ′ is a case other than the case cs.
W (ei, Φ (cs), cs ′) is a set of words co-occurring in the corpus 1 together with the sentence pattern Φ (cs) in the case cs ′, and the case cs in the sentence pattern (Φ, cs). Called “word bug”.

また、Ｄｗ（Ｗ１，Ｗ２）は、単語バグＷ１と単語バグＷ２間の距離であり、後述のように定義される。
例えば、Φ１＝固定する、ｃｓ＝「を」、ｅ１＝「ユニット」、ｅ２＝「金具」のとき、格ｃｓ'は、ｃｓ１’＝「に」、ｃｓ２’＝「で」になる。ｃｓ１’，ｃｓ２’は「を」以外の格になる。
このとき、単語バグは、下記のようになる。
Ｗ１（ユニット，固定する（を），に）＝｛基盤，キバン｝
Ｗ２（金具，固定する（を），に）＝｛制御盤，セイギョバン｝
Ｗ３（ユニット，固定する（を），で）＝｛フック，留め金｝
Ｗ４（ユニット，固定する（を），で）＝｛ねじ，ネジ｝ Dw (W1, W2) is a distance between the word bug W1 and the word bug W2, and is defined as described later.
For example, when Φ1 = fixed, cs = “on”, e1 = “unit”, e2 = “metal fitting”, the case cs ′ becomes cs1 ′ = “to” and cs2 ′ = “on”. cs1 ′ and cs2 ′ have cases other than “O”.
At this time, the word bug is as follows.
W1 (unit, fixed (to), to) = {base, millet}
W2 (fitting, fixing (to), to)) = {control panel, seijyoban}
W3 (unit, fix (with),) = {hook, clasp}
W4 (unit, fix (in), with) = {screw, screw}

単語バグＷ１と単語バグＷ２間の距離Ｄｗ（Ｗ１，Ｗ２）は、下記の式（２）で定義される。

ただし、Ｎ（ｗ｜Ｗ）は単語バグＷにおける単語ｗの頻度
Ｗ１とＷ２のいずれかが空集合の場合は、Ｄｗ（Ｗ１，Ｗ１）＝０とする。 The distance Dw (W1, W2) between the word bug W1 and the word bug W2 is defined by the following equation (2).

However, N (w | W) is the frequency of the word w in the word bug W. If either W1 or W2 is an empty set, Dw (W1, W1) = 0.

ここで、Ｄ_g(F)（ｗ１，ｗ２）は、用語ｗ１と用語ｗ２の距離であり、初期値として、予め、従来技術によって求めておくものとする。
あるいは、用語間類似度ＤＢ５に格納されている用語間の類似度（用語間類似度計算部４により計算された用語間の類似度）から求めるようにしてもよい。また、所定の設定値を用いるようにしてもよい。
ただし、繰り返しフェーズにおいては、用語間類似度ＤＢ５に格納されている用語の類似度から求めるようにする。 Here, D _{g (F)} (w1, w2) is the distance between the term w1 and the term w2, and is obtained in advance by the conventional technique as an initial value.
Alternatively, it may be obtained from the similarity between terms stored in the term similarity DB 5 (similarity between terms calculated by the term similarity calculation unit 4). Further, a predetermined set value may be used.
However, in the repetition phase, it is determined from the similarity of terms stored in the inter-term similarity DB 5.

例えば、用語「ユニット」と用語「金具」の「文パターンを用いた用語間の確信度」は、下記の式（３）のように計算される。

For example, the “confidence between terms using sentence patterns” for the term “unit” and the term “metal fitting” is calculated as in the following equation (3).

次に、文パターンの確信度は、Ｉ（Φ，ｃｓ）（ｅ，ｓｅｍＣａｔ）の形式で表されるので、文パターンを用いた用語間の確信度を用いて、下記の式（４）のように計算される。

ただし、ｅは用語、ｓｅｍＣａｔは用語が属する意味分類を表している。
また、ｅ_j∈ｓｅｍＣａｔは、意味分類が“ｓｅｍＣａｔ”であることが既知である用語（コーパス１のすべての用語）を表している。 Next, since the certainty of the sentence pattern is expressed in the form of I (Φ, cs) (e, semCat), the certainty between terms using the sentence pattern is used to Is calculated as follows.

However, e represents a term and semCat represents a semantic classification to which the term belongs.
Further, e _j εsemCat represents a term (all terms in the corpus 1) whose semantic classification is known to be “semCat”.

ここでは、文パターン確信度計算部３が文パターンの確信度を計算するに際して、意味分類が既知であるコーパス１中の単語に関する用語間の確信度の和を求めるものについて示したが、用語間の確信度に適正な係数を掛けたり、用語間の確信度に対する他の演算を実施したりして、文パターンの確信度を計算するようにしてもよい。
例えば、コーパス１の中に、意味分類が［部品］であることが既知である用語「金具」「スイッチ」が存在する場合に、文パターンΦ（ｃｓ）＝固定する（を）をとる用語ｅが［部品］である確信度は、下記の式（５）で計算される。

Here, when the sentence pattern certainty calculation unit 3 calculates the certainty of the sentence pattern, the calculation of the certainty between the terms in the corpus 1 whose semantic classification is known is shown. The certainty factor of the sentence pattern may be calculated by multiplying the certainty factor by an appropriate coefficient, or by performing another operation on the certainty factor between terms.
For example, in the corpus 1, when the terms “metal fitting” and “switch” whose semantic classification is known to be [component] exist, the term e that takes the sentence pattern Φ (cs) = fixed () The certainty that is [part] is calculated by the following equation (5).

用語間類似度計算部４は、上記のようにして、文パターン確信度計算部３が文パターンの確信度を計算すると、その文パターンの確信度を用いて、用語間の類似度を計算し、その用語間の類似度を用語間類似度ＤＢ５に格納する（ステップＳＴ３）。
例えば、用語間の類似度Ｄ（ｅ１，ｅ２）は、下記の式（６）で計算される。

When the sentence pattern certainty calculation unit 3 calculates the certainty of the sentence pattern as described above, the interterm similarity calculating unit 4 calculates the similarity between the terms using the certainty of the sentence pattern. The similarity between terms is stored in the term similarity DB 5 (step ST3).
For example, the similarity D (e1, e2) between terms is calculated by the following formula (6).

式（６）は、用語ｅ１，ｅ２に対して、すべての文パターンΦ（ｃｓ）を用いて、用語間の確信度Ｉ_(Φ,cs)（ｅ１，ｅ２）を計算し、全ての計算結果の和を求めていることに相当する。
即ち、Φとして、「固定する」、「装着する」、「取り付ける」があり、格ｃｓとして、「を」、「に」、「で」があるとき、下記の式（７）を計算することに相当する。

The expression (6) calculates the certainty factor I _{(Φ, cs)} (e1, e2) between terms using all sentence patterns Φ (cs) for the terms e1, e2, and all the calculation results. This is equivalent to seeking the sum of
That is, when Φ has “fix”, “wear”, and “attach”, and the case cs has “to”, “ni”, and “de”, calculate the following formula (7): It corresponds to.

用語・意味分類間類似度計算部６は、用語間類似度計算部４が用語間の類似度を計算すると、その用語間の類似度（または、文パターン確信度計算部３により計算された文パターンの確信度）を用いて、用語抽出部２により抽出された未登録の用語とシソーラス９における複数の意味分類との間の類似度を計算し、その類似度を用語・意味分類間類似度ＤＢ７に格納する（ステップＳＴ４）。
未登録の用語とシソーラス９における複数の意味分類との間の類似度の計算は、未登録の用語と文パターンとの共起頻度と、登録済み用語（既にシソーラス９の意味分類に登録されている用語）と文パターンとの共起頻度を用いることにより求めることができる。 When the similarity between terms / semantic classification calculation unit 6 calculates the similarity between terms, the similarity between terms (or the sentence calculated by the sentence pattern certainty calculation unit 3) is calculated. The degree of similarity between the unregistered term extracted by the term extracting unit 2 and a plurality of semantic categories in the thesaurus 9 is calculated using the certainty of the pattern), and the similarity is calculated between the term and the semantic category Store in DB7 (step ST4).
The calculation of the similarity between an unregistered term and a plurality of semantic classifications in the thesaurus 9 is performed by calculating the co-occurrence frequency of the unregistered terms and sentence patterns and the registered terms (already registered in the semantic classification of the thesaurus 9). It can be obtained by using the co-occurrence frequency of the term) and the sentence pattern.

即ち、未登録の用語ｅと意味分類ｓｅｍＣａｔの類似度は、下記の式（８）で計算することができる。

ただし、ｅは用語を表し、ｓｅｍＣａｔは用語が属する意味分類を表している。
また、ｅ_j∈ｓｅｍＣａｔは、意味分類が“ｓｅｍＣａｔ”であることが既知である用語（コーパス１のすべての用語）を表している。 That is, the similarity between the unregistered term e and the semantic classification semCat can be calculated by the following equation (8).

用語登録部８は、用語・意味分類間類似度計算部６が未登録の用語とシソーラス９における複数の意味分類との間の類似度を計算すると、用語・意味分類間類似度計算部６の計算結果を参照して、シソーラス９における複数の意味分類の中で、用語抽出部２により抽出された未登録の用語と類似度が最も高い意味分類を特定し、未登録の用語を上記意味分類の中に登録する（ステップＳＴ５）。
即ち、未登録の用語と距離が最も近いシソーラス９の意味分類ラベルの下位ノードとして、未登録の用語を登録する。 When the term / semantic classification similarity calculation unit 6 calculates the similarity between an unregistered term and a plurality of semantic classifications in the thesaurus 9, the term registration / symbol classification similarity calculation unit 6 Referring to the calculation result, among the plurality of semantic categories in the thesaurus 9, the semantic category having the highest similarity with the unregistered term extracted by the term extracting unit 2 is specified, and the unregistered term is defined as the semantic category (Step ST5).
That is, an unregistered term is registered as a lower node of the semantic classification label of the thesaurus 9 that is closest to the unregistered term.

用語登録装置は、上記のようにして、ステップＳＴ１〜ＳＴ５の処理が終了すると、用語間類似度計算部４により計算された用語間の類似度と、用語間類似度計算部４により計算される前に用語間類似度ＤＢ５に格納されていた用語間の類似度との差分を計算し、その差分が所定の閾値以上ある場合、あるいは、用語・意味分類間類似度計算部６により計算された未登録の用語と意味分類の類似度と、用語・意味分類間類似度計算部６により計算される前に用語・意味分類間類似度ＤＢ７に格納されていた類似度との差分を計算し、その差分が所定の閾値以上ある場合、ステップＳＴ２の処理に戻り、再度、ステップＳＴ２〜ＳＴ５の処理を実行する。
一方、いずれの差分も所定の閾値未満であれば、一連の処理を終了する。 When the processing of steps ST1 to ST5 ends as described above, the term registration device calculates the similarity between terms calculated by the term similarity calculation unit 4 and the term similarity calculation unit 4. The difference between the similarities between terms previously stored in the interterm term similarity DB 5 is calculated, and when the difference is equal to or greater than a predetermined threshold, or the term / semantic classification similarity calculation unit 6 calculates Calculate the difference between the similarity between the unregistered term and the semantic category and the similarity stored in the term / semantic category similarity DB 7 before being calculated by the term / semantic category similarity calculating unit 6; If the difference is equal to or greater than the predetermined threshold, the process returns to step ST2, and the processes of steps ST2 to ST5 are executed again.
On the other hand, if any difference is less than the predetermined threshold value, the series of processing ends.

以上で明らかなように、この実施の形態１によれば、用語間の関係の制約を表現している文パターンを用いて、コーパス１から用語を抽出する用語抽出部２と、その文パターンを構成している複数の用語が属している意味分類間の関係を考慮して、その文パターンの確信度を計算する文パターン確信度計算部３と、文パターン確信度計算部３により計算された文パターンの確信度を用いて、用語抽出部２により抽出された未登録の用語とシソーラスにおける複数の意味分類との間の類似度を計算する用語・意味分類間類似度計算部６とを設け、用語登録部８が用語・意味分類間類似度計算部８の計算結果を参照して、用語抽出部２により抽出された未登録の用語と類似度が最も高い意味分類を特定し、未登録の用語を上記意味分類の中に登録するように構成したので、文パターンに意味分類の多義性が存在する場合でも、意味分類の選定精度を高めて、未登録の用語を適正な意味分類の中に登録することができる効果を奏する。
即ち、この実施の形態１によれば、文パターンの確信度を計算する際、文パターンを構成している複数の用語が属している意味分類間の関係を考慮（格「を」だけでなく、格「に」なども考慮）しているので、文パターンに意味分類の多義性が存在する場合でも、意味分類の選定精度を高めて、未登録の用語を適正な意味分類の中に登録することができる効果を奏する。 As apparent from the above, according to the first embodiment, the term extraction unit 2 that extracts terms from the corpus 1 using the sentence pattern expressing the constraint of the relationship between terms, and the sentence pattern The sentence pattern certainty factor calculation unit 3 that calculates the certainty factor of the sentence pattern in consideration of the relationship between semantic classifications to which a plurality of constituent terms belong, and the sentence pattern certainty factor calculation unit 3 A term / semantic classification similarity calculation unit 6 is provided that calculates the similarity between an unregistered term extracted by the term extraction unit 2 and a plurality of semantic classifications in the thesaurus using the certainty of the sentence pattern. The term registration unit 8 refers to the calculation result of the term / semantic classification similarity calculation unit 8 to identify the semantic category having the highest similarity with the unregistered term extracted by the term extraction unit 2, and is not registered. In the above semantic classification Even if there is ambiguity of semantic classification in the sentence pattern, it has the effect of increasing the accuracy of selecting semantic classification and registering unregistered terms in appropriate semantic classification. .
That is, according to the first embodiment, when calculating the certainty of a sentence pattern, the relationship between the semantic classifications to which a plurality of terms constituting the sentence pattern belong (not only the case “ In addition, even if there is ambiguity of semantic classification in the sentence pattern, the accuracy of semantic classification selection is improved and unregistered terms are registered in the appropriate semantic classification. The effect which can be done is produced.

また、この実施の形態１によれば、文パターン確信度計算部３が用語間類似度計算部４により前回計算された用語間の類似度を用いて、文パターンの確信度を計算するように構成したので、文パターンの確信度の計算精度を高めることができる効果を奏する。 Further, according to the first embodiment, the sentence pattern certainty calculation unit 3 calculates the certainty of the sentence pattern using the similarity between terms previously calculated by the interterm similarity calculation unit 4. Since it comprised, there exists an effect which can improve the calculation precision of the certainty of a sentence pattern.

実施の形態２．
上記実施の形態１では、用語抽出部２が用語の係り受け関係の制約を表現している文パターンを用いて、コーパス１から用語を抽出するものについて示したが、用語抽出部２が用語の類似関係の制約を表現している文パターンを用いて、コーパス１から用語を抽出するようにしてもよい。
具体的には、以下の通りである。 Embodiment 2. FIG.
In Embodiment 1 described above, the term extracting unit 2 extracts a term from the corpus 1 using a sentence pattern that expresses a dependency relationship between terms, but the term extracting unit 2 uses the term pattern. Terms may be extracted from the corpus 1 using sentence patterns expressing constraints on similarity relationships.
Specifically, it is as follows.

図９はコーパスの一例を示す説明図である。
図９では、説明の簡単化のため、文のみを示しているが、文の出典を示す文書の番号や、文書中の文の番号等の情報が付加されているものであってもよい。
図１０は文パターンの一例を示す説明図である。
図１０では、文パターンが「Φ（ｃｓ） ⇒ 意味分類」の形式で記述されている例を示している。
上記実施の形態１では、Φが動詞を表すようにしているが、この実施の形態２では、Φが用語間の関係を表層の表現で代表させて表すようにしている。 FIG. 9 is an explanatory diagram showing an example of a corpus.
In FIG. 9, only the sentence is shown for the sake of simplification, but information such as a document number indicating the source of the sentence or a sentence number in the document may be added.
FIG. 10 is an explanatory diagram showing an example of a sentence pattern.
FIG. 10 shows an example in which the sentence pattern is described in a format of “Φ (cs) → semantic classification”.
In the first embodiment, Φ represents a verb, but in the second embodiment, Φ represents the relationship between terms by representing the surface layer.

図１１はこの発明の実施の形態１による用語登録装置の用語抽出部２の処理内容を示すフローチャートである。
図１２はこの発明の実施の形態１による用語登録装置の用語抽出部２により抽出された用語間の関係を示す説明図である。 FIG. 11 is a flowchart showing the processing contents of the term extraction unit 2 of the term registration device according to Embodiment 1 of the present invention.
FIG. 12 is an explanatory diagram showing the relationship between terms extracted by the term extraction unit 2 of the term registration device according to Embodiment 1 of the present invention.

次に動作について説明する。
用語抽出部２は、用語の類似関係の制約を表現している図１０の文パターンを用いて、図９のコーパス１から用語を抽出する。
以下、用語抽出部２における用語の抽出処理を具体的に説明する。 Next, the operation will be described.
The term extraction unit 2 extracts terms from the corpus 1 in FIG. 9 using the sentence pattern in FIG. 10 expressing the constraints on the similarity relationship between terms.
Hereinafter, the term extracting process in the term extracting unit 2 will be described in detail.

まず、用語抽出部２は、コーパス１に格納されている文に対する形態素解析と構文解析を実施する（ステップＳＴ１１）。
形態素解析や構文解析の実施方法については、広く公知であるため、ここでは詳細な説明を省略する。 First, the term extraction unit 2 performs morphological analysis and syntax analysis on the sentence stored in the corpus 1 (step ST11).
Since the implementation method of morphological analysis and syntax analysis is widely known, detailed description is omitted here.

用語抽出部２は、文に対する形態素解析と構文解析を実施すると、その解析結果から、図１０の文パターンと合致するデータとして、図１２に示すような用語間関係データを抽出する（ステップＳＴ１２）。
図１２では、Ｒｅｌの欄に用語間の関係が格納され、Ｅ１の欄に文パターンにおける一方の用語が格納され、Ｅ２の欄に文パターンにおける他方の用語が格納されている。 When the vocabulary analysis and syntax analysis are performed on the sentence, the term extraction unit 2 extracts inter-term relationship data as shown in FIG. 12 as data matching the sentence pattern of FIG. 10 from the analysis result (step ST12). .
In FIG. 12, the relationship between terms is stored in the Rel column, one term in the sentence pattern is stored in the E1 column, and the other term in the sentence pattern is stored in the E2 column.

この実施の形態２における文パターンでは、図１２に示すように、「ＡまたはＢ」、「ＡのＢ」、「ＡとＢ」、「ＡのようなＢ」などの文パターンがある。
例えば、用語間関係が「の」の場合、意味分類間の関係でみると曖昧性がある。例えば、物・属性関係「＜部品＞の＜属性＞（例：コンデンサの電圧）」、部分全体関係「部品の部品（例：リモコンのボタン）」などの曖昧性がある。 In the sentence pattern in the second embodiment, as shown in FIG. 12, there are sentence patterns such as “A or B”, “B of A”, “A and B”, and “B like A”.
For example, when the relationship between terms is “no”, there is ambiguity in terms of the relationship between semantic classifications. For example, there is an ambiguity such as an object / attribute relationship “<attribute> of <part> (eg, capacitor voltage)” and a partial overall relationship “part of component (eg, remote control button)”.

この実施の形態２では、上記実施の形態１において、Φが動詞であるとして説明している部分を、図１２のＲｅｌの欄に格納されている用語間の関係で読み替えることにより、文パターン確信度計算部３が、関係Ｒｅｌを持つ文パターンの確信度を計算するようにする。
用語・意味分類間類似度計算部６は、上記実施の形態１と同様にして、用語間類似度計算部４が用語間の類似度を計算すると、その用語間の類似度（または、文パターン確信度計算部３により計算された文パターンの確信度）を用いて、用語抽出部２により抽出された関係Ｒｅｌとシソーラス９における複数の意味分類との間の類似度を計算し、その類似度を用語・意味分類間類似度ＤＢ７に格納する。 In this second embodiment, the sentence pattern confidence is confirmed by replacing the part described in the first embodiment as Φ is a verb with the relationship between terms stored in the Rel column of FIG. The degree calculation unit 3 calculates the certainty factor of the sentence pattern having the relationship Rel.
Similar to the first embodiment, the term / semantic classification similarity calculating unit 6 calculates the similarity between terms when the term similarity calculating unit 4 calculates the similarity between terms (or sentence pattern). Using the certainty factor of the sentence pattern calculated by the certainty factor calculation unit 3), the similarity between the relationship Rel extracted by the term extraction unit 2 and a plurality of semantic categories in the thesaurus 9 is calculated, and the similarity Are stored in the term / semantic classification similarity DB 7.

用語登録部８は、用語・意味分類間類似度計算部６が関係Ｒｅｌとシソーラス９における複数の意味分類との間の類似度を計算すると、用語・意味分類間類似度計算部６の計算結果を参照して、シソーラス９における複数の意味分類の中で、用語抽出部２により抽出された関係Ｒｅｌと類似度が最も高い意味分類を特定し、未登録の用語を上記意味分類の中に登録する。
即ち、関係Ｒｅｌと距離が最も近いシソーラス９の意味分類ラベルの下位ノードとして、未登録の用語を登録する。 When the term / semantic classification similarity calculation unit 6 calculates the similarity between the relationship Rel and the plurality of semantic classifications in the thesaurus 9, the term registration unit 8 calculates the calculation result of the term / semantic classification similarity calculation unit 6. Referring to the above, the semantic category having the highest similarity with the relation Rel extracted by the term extraction unit 2 is identified from among the plurality of semantic categories in the thesaurus 9, and the unregistered term is registered in the semantic category. To do.
That is, an unregistered term is registered as a lower node of the semantic classification label of the thesaurus 9 having the closest distance to the relationship Rel.

以上で明らかなように、この実施の形態２によれば、用語抽出部２が用語の類似関係の制約を表現している図１０の文パターンを用いて、コーパス１から用語を抽出するように構成したので、類似している用語がコーパス１に含まれている場合でも、未登録の用語を適正な意味分類の中に登録することができる効果を奏する。 As is apparent from the above, according to the second embodiment, the term extraction unit 2 extracts terms from the corpus 1 using the sentence pattern of FIG. 10 expressing constraints on the similarity of terms. Since it comprised, even when the term which resembles is contained in corpus 1, there exists an effect which can register an unregistered term in an appropriate semantic classification.

実施の形態３．
上記実施の形態１では、文パターン確信度計算部３が用語間類似度ＤＢ５に格納されている用語間の類似度を用いて、文パターンの確信度を計算するものについて示したが、用語の出現頻度が少ない場合には、出現頻度のスムージングを行うようにしてもよい。
即ち、この実施の形態３では、用語抽出部２により抽出された用語の中で、例えば、動詞である用語の間の類似度にしたがって文パターンの確信度を補正するようにする。
なお、この実施の形態３では、動詞である用語の間の類似度にしたがって文パターンの確信度を補正するものについて示すが、動詞である用語の間の類似度に限るものではなく、例えば、形容詞である用語の間の類似度や、形容動詞である用語の間の類似度にしたがって文パターンの確信度を補正するようにしてもよい。 Embodiment 3 FIG.
In the first embodiment, the sentence pattern certainty calculation unit 3 uses the similarity between terms stored in the interterm similarity DB 5 to calculate the certainty of the sentence pattern. When the appearance frequency is low, smoothing of the appearance frequency may be performed.
That is, in the third embodiment, among the terms extracted by the term extracting unit 2, for example, the certainty of the sentence pattern is corrected according to the similarity between terms that are verbs.
In the third embodiment, the sentence pattern certainty is corrected according to the similarity between terms that are verbs, but is not limited to the similarity between terms that are verbs. The certainty of the sentence pattern may be corrected according to the similarity between terms that are adjectives or the similarity between terms that are adjective verbs.

図１３はコーパスの一例を示す説明図である。
図１４はこの発明の実施の形態１による用語登録装置の文パターン確信度計算部３の処理内容を示すフローチャートである。 FIG. 13 is an explanatory diagram showing an example of a corpus.
FIG. 14 is a flowchart showing the processing contents of the sentence pattern certainty calculation unit 3 of the term registration device according to Embodiment 1 of the present invention.

次に動作について説明する。
用語抽出部２は、上記実施の形態１と同様に、用語の係り受け関係の制約を表現している図５の文パターンを用いて、図１３のコーパス１から用語を抽出する（図３のステップＳＴ１）。
ここでは、用語抽出部２は、図６に示すように、動詞である用語と名詞である用語の見出しＶ，Ｎを抽出するとともに、用語の頻度を抽出する。
また、用語抽出部２は、図７に示すように、用語と格の関係を抽出する。 Next, the operation will be described.
The term extraction unit 2 extracts terms from the corpus 1 shown in FIG. 13 using the sentence pattern shown in FIG. 5 that expresses constraints on the dependency relationship of terms as in the first embodiment (see FIG. 3). Step ST1).
Here, as shown in FIG. 6, the term extraction unit 2 extracts the terms V and N of the terms that are verbs and the terms that are nouns, and also extracts the frequency of the terms.
Moreover, the term extraction part 2 extracts the relationship between a term and a case as shown in FIG.

文パターン確信度計算部３は、用語抽出部２がコーパス１から用語を抽出すると、上記実施の形態１と同様に、用語抽出部２により用いられた図５の文パターンを構成している複数の用語が属している意味分類間の関係を考慮して、その文パターン（Φ，ｃｓ）の確信度を計算する（図３のステップＳＴ２）。
また、文パターン確信度計算部３は、格要素となる名詞間の類似度から、文パターンを構成する動詞間の類似度を計算する（図１４のステップＳＴ２１）。 When the term extraction unit 2 extracts a term from the corpus 1, the sentence pattern certainty calculation unit 3 forms the sentence pattern of FIG. 5 used by the term extraction unit 2 as in the first embodiment. The certainty factor of the sentence pattern (Φ, cs) is calculated in consideration of the relationship between the semantic classifications to which the term “B” belongs (step ST2 in FIG. 3).
Moreover, the sentence pattern certainty calculation part 3 calculates the similarity between the verbs which comprise a sentence pattern from the similarity between nouns which are case elements (step ST21 of FIG. 14).

図３のステップＳＴ２では、文パターン確信度計算部３が文パターンΦ（ｃｓ）を「動詞Φと格ｃｓで係り受け関係にある要素を抽出する」（以下、「名詞に関する文パターン」と称する）として、確信度を計算しているが、図１４のステップＳＴ２１では、文パターン確信度計算部３が文パターンΦ（ｃｓ）を「名詞Φと格ｃｓで係り受け関係にある要素を抽出する」（以下、「動詞に関する文パターン」と称する）として、確信度を計算している。 In step ST2 of FIG. 3, the sentence pattern certainty calculation unit 3 extracts the sentence pattern Φ (cs) by “extracting an element having a dependency relationship between the verb Φ and the case cs” (hereinafter referred to as “a sentence pattern related to a noun”). In step ST21 of FIG. 14, the sentence pattern certainty calculator 3 extracts the sentence pattern Φ (cs) as “the noun Φ and the case cs are in a dependency relationship”. "(Hereinafter referred to as" sentence pattern related to verb "), the certainty factor is calculated.

用語間類似度計算部４は、文パターン確信度計算部３が名詞に関する文パターンの確信度を計算すると、上記実施の形態１と同様に、その文パターンの確信度を用いて、用語間の類似度を計算し、その用語間の類似度を用語間類似度ＤＢ５に格納する（図３のステップＳＴ３）。
また、用語間類似度計算部４は、文パターン確信度計算部３が動詞に関する文パターンの確信度を計算すると、上記実施の形態１と同様に、その文パターンの確信度を用いて、動詞間の類似度を計算し、その動詞間の類似度を用語間類似度ＤＢ５に格納する。 When the sentence pattern certainty calculation unit 3 calculates the certainty of the sentence pattern related to the noun, the interterm term similarity calculating unit 4 uses the certainty of the sentence pattern to determine between the terms. The similarity is calculated, and the similarity between the terms is stored in the inter-term similarity DB 5 (step ST3 in FIG. 3).
In addition, when the sentence pattern certainty calculation unit 3 calculates the certainty of the sentence pattern related to the verb, the inter-term similarity calculation unit 4 uses the certainty of the sentence pattern, as in the first embodiment, to determine the verb. The similarity between the verbs is calculated, and the similarity between the verbs is stored in the inter-term similarity DB 5.

文パターン確信度計算部３は、用語間類似度計算部４が動詞間の類似度を用語間類似度ＤＢ５に格納すると、その動詞間の類似度にしたがって名詞に関する文パターンの確信度を補正する（図１４のステップＳＴ２２）。
上記実施の形態１では、コーパス１の中で、Ｗ（ｅ_i，Φ（ｃｓ），ｃｓ'）が、文パターンΦ（ｃｓ）とも格ｃｓ'で共起する用語の集合として、文パターン（Φ，ｃｓ）で格ｃｓ'をとる「単語バグ」と称し、単語バグ間の距離を下記の式（９）で定義している。

ただし、Ｗ１とＷ２のいずれかが空集合の場合は、Ｄｗ（Ｗ１，Ｗ１）＝０とする。 When the inter-term similarity calculation unit 4 stores the similarity between verbs in the inter-term similarity DB 5, the sentence pattern certainty calculation unit 3 corrects the certainty of the sentence pattern related to the noun according to the similarity between the verbs. (Step ST22 in FIG. 14).
In the first embodiment, in the corpus 1, W (e _i , Φ (cs), cs ′) is a sentence pattern (as a set of terms that co-occur with the sentence pattern Φ (cs) in the case cs ′. This is referred to as a “word bug” that takes a case cs ′ with Φ, cs), and the distance between the word bugs is defined by the following equation (9).

However, if either W1 or W2 is an empty set, Dw (W1, W1) = 0.

このとき、上記実施の形態１では、Ｎ（ｗ｜Ｗ）を単語バグＷにおける単語ｗの頻度として定義しているが、この実施の形態３では、Ｎ（ｗ｜Ｗ）を下記の式（１０）で求める。

ただし、Σは、Φ_i以外のすべてのΦ_jに関して加算するものとする。
また、ｓｉｍ（Φ₁，Φ₂）は、文パターンΦ₁（ｃｓ）と文パターンΦ₂（ｃｓ）における用語Φ₁と用語Φ₂の類似度である。 At this time, in the first embodiment, N (w | W) is defined as the frequency of the word w in the word bug W, but in this third embodiment, N (w | W) is expressed by the following formula ( 10).

However, Σ is added for all Φ _j other than Φ _i .
Sim (Φ ₁ , Φ ₂ ) is the similarity between the term Φ ₁ and the term Φ _{2 in} the sentence pattern Φ ₁ (cs) and the sentence pattern Φ ₂ (cs).

例えば、各動詞に対して、名詞「ユニット」が格「を」で共起し、名詞「制御盤」が格助詞「に」で共起するものとする。
また、用語間類似度計算部４により計算された動詞「固定する」と動詞「装着する」の類似度が０．９、動詞「固定する」と動詞「割り付ける」の類似度が０．８とする。
Ｗ１（ｅ_i，Φ（ｃｓ），ｃｓ’）＝Ｗ（ユニット，固定する（を），に）
Ｗ２（ｅ_i，Φ（ｃｓ），ｃｓ’）＝Ｗ（ユニット，装着する（を），に）
Ｗ３（ｅ_i，Φ（ｃｓ），ｃｓ'）＝Ｗ（ユニット，割り付ける（を），に）
この場合、Ｎ（制御盤｜固定する）は、次のように計算される。
Ｎ（制御盤｜固定する)＝単語バグＷ１における単語「制御盤」の頻度
＋単語バグＷ２における単語「制御盤」の頻度×０．９
＋単語バグＷ２における単語「制御盤」の頻度×０．８ For example, for each verb, the noun “unit” co-occurs with the case “o”, and the noun “control panel” co-occurs with the case particle “ni”.
Further, the similarity between the verb “fix” and the verb “wear” calculated by the inter-term similarity calculation unit 4 is 0.9, and the similarity between the verb “fix” and the verb “assign” is 0.8. To do.
W1 (e _i , Φ (cs), cs ′) = W (unit, fixed (to))
W2 (e _i , Φ (cs), cs ′) = W (unit, mounted (to))
W3 (e _i , Φ (cs), cs ′) = W (unit, assigned ())
In this case, N (control panel | fixed) is calculated as follows.
N (control panel | fixed) = frequency of word “control panel” in word bug W1
+ Frequency of word “control board” in word bug W2 × 0.9
+ Frequency of word “control board” in word bug W2 × 0.8

なお、上記のＮ（ｗ｜Ｗ）の計算において、ノイズを除去するために、２項目の類似度ｓｉｍ（Φ_i，Φ_j）を掛けた頻度の加算を、類似度ｓｉｍ（Φ_i，Φ_j）が所定の閾値以上の場合に制限してもよい。
また、動詞が「積載する」と「積む」のように、一方が詳細な意味分類を持ち（シソーラス上で下位の階層にある場合）、他方が幅広い意味を取る（シソーラス上で上位の階層にある）場合には、単語バグＷにおける単語ｗの頻度の計算において、幅広い意味を取る単語の場合のみに、詳細な意味分類の単語類似度を掛けた頻度を加算するように制限してもよい。
以降は、上記実施の形態１と同様であるため説明を省略する。 In addition, in the above calculation of N (w | W), in order to remove noise, addition of frequency multiplied by similarity sim (Φ _i , Φ _j ) of two items is performed, and similarity sim (Φ _i , Φ You may restrict | limit when _j ) is more than a predetermined threshold value.
Also, like the verbs “load” and “pile”, one has a detailed semantic classification (if it is in the lower hierarchy on the thesaurus), and the other takes a broad meaning (in the higher hierarchy on the thesaurus) In some cases, in the calculation of the frequency of the word w in the word bug W, the frequency multiplied by the word similarity of the detailed semantic classification may be added only to the word having a wide meaning. .
Since the subsequent steps are the same as those in the first embodiment, description thereof is omitted.

以上で明らかなように、この実施の形態３によれば、用語抽出部２により抽出された用語間の類似度にしたがって名詞に関する文パターンの確信度を補正するように構成したので、上記実施の形態１よりも、更に、意味分類の選定精度を高めることができる効果を奏する。 As is apparent from the above, according to the third embodiment, since the configuration is such that the certainty of the sentence pattern related to the noun is corrected according to the similarity between terms extracted by the term extracting unit 2, the above-described embodiment As compared with the first aspect, there is an effect that it is possible to further improve the accuracy of selecting the semantic classification.

この発明の実施の形態１による用語登録装置を示す構成図である。It is a block diagram which shows the term registration apparatus by Embodiment 1 of this invention. この発明の実施の形態１による用語登録装置を実現するコンピュータのハードウェア資源を示す構成図である。It is a block diagram which shows the hardware resource of the computer which implement | achieves the term registration apparatus by Embodiment 1 of this invention. この発明の実施の形態１による用語登録装置の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of the vocabulary registration apparatus by Embodiment 1 of this invention. コーパスの一例を示す説明図である。It is explanatory drawing which shows an example of corpus. 文パターンの一例を示す説明図である。It is explanatory drawing which shows an example of a sentence pattern. この発明の実施の形態１による用語登録装置の用語抽出部２により抽出された用語データの一例を示す説明図である。It is explanatory drawing which shows an example of the term data extracted by the term extraction part 2 of the term registration apparatus by Embodiment 1 of this invention. この発明の実施の形態１による用語登録装置の用語抽出部２により抽出された用語と格の関係を示す説明図である。It is explanatory drawing which shows the relationship between the term extracted by the term extraction part 2 of the term registration apparatus by Embodiment 1 of this invention, and a case. この発明の実施の形態１による用語登録装置のシソーラス９の一例を示す説明図である。It is explanatory drawing which shows an example of the thesaurus 9 of the term registration apparatus by Embodiment 1 of this invention. コーパスの一例を示す説明図である。It is explanatory drawing which shows an example of corpus. 文パターンの一例を示す説明図である。It is explanatory drawing which shows an example of a sentence pattern. この発明の実施の形態１による用語登録装置の用語抽出部２の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of the term extraction part 2 of the term registration apparatus by Embodiment 1 of this invention. この発明の実施の形態１による用語登録装置の用語抽出部２により抽出された用語間の関係を示す説明図である。It is explanatory drawing which shows the relationship between the terms extracted by the term extraction part 2 of the term registration apparatus by Embodiment 1 of this invention. コーパスの一例を示す説明図である。It is explanatory drawing which shows an example of corpus. この発明の実施の形態１による用語登録装置の文パターン確信度計算部３の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of the sentence pattern reliability calculation part 3 of the term registration apparatus by Embodiment 1 of this invention.

Explanation of symbols

１コーパス、２用語抽出部（用語抽出手段）、３文パターン確信度計算部（文パターン確信度計算手段）、４用語間類似度計算部（用語間類似度計算手段）、５用語間類似度ＤＢ、６用語・意味分類間類似度計算部（用語・意味分類間類似度計算手段）、７用語・意味分類間類似度ＤＢ、８用語登録部（用語登録手段）、９シソーラス、１１入力装置、１２通信装置、１３制御装置、１４ＣＰＵ、１５主記憶装置、１６２次記憶装置、１７出力装置、１８表示装置、１９印刷装置。 1 corpus, 2 term extraction unit (term extraction unit), 3 sentence pattern certainty calculation unit (sentence pattern certainty calculation unit), 4 inter-term similarity calculation unit (inter-term similarity calculation unit), 5 inter-term similarity DB, 6 Term / semantic classification similarity calculation section (term / semantic classification similarity calculation means), 7 Term / semantic classification similarity DB, 8 Term registration section (term registration means), 9 Thesaurus, 11 Input device , 12 communication device, 13 control device, 14 CPU, 15 main storage device, 16 secondary storage device, 17 output device, 18 display device, 19 printing device.

Claims

Considering the relationship between term extraction means that extracts terms from the corpus using sentence patterns expressing the relationship restrictions between terms and the semantic classification to which multiple terms that make up the sentence pattern belong The sentence pattern certainty factor calculating means for calculating the certainty factor of the sentence pattern and the unregistered information extracted by the term extracting unit using the certainty factor of the sentence pattern calculated by the sentence pattern certainty factor calculating unit. The term extraction is performed by referring to the calculation result of the term / semantic classification similarity calculation means for calculating the similarity between the term and the plurality of semantic classifications in the thesaurus, and the calculation result of the term / semantic classification similarity calculation means. A term registration device comprising: a term registration unit that identifies a semantic category having the highest degree of similarity with an unregistered term extracted by the means, and registers the unregistered term in the semantic category.

Considering the relationship between term extraction means that extracts terms from the corpus using sentence patterns expressing the relationship restrictions between terms and the semantic classification to which multiple terms that make up the sentence pattern belong The sentence pattern certainty factor calculating means for calculating the certainty factor of the sentence pattern and the sentence pattern certainty factor calculated by the sentence pattern certainty factor calculating means are used to calculate the similarity between terms. Similarity between unregistered terms extracted by the term extracting means and a plurality of semantic classifications in the thesaurus using the similarity calculating means and the similarity between terms calculated by the inter-term similarity calculating means Refer to the calculation results of the term / semantic classification similarity calculation means for calculating the degree and the term / semantic classification similarity calculation means, and the similarity to the unregistered term extracted by the term extraction means is the highest. Identify the semantic classification had the term registration apparatus that includes a term registering means for registering the unregistered terms in the semantic classification.

3. The term registration device according to claim 2, wherein the sentence pattern certainty calculating means calculates the certainty of the sentence pattern using the similarity between terms previously calculated by the inter-term similarity calculating means.

4. The term extraction unit according to claim 1, wherein the term extraction unit extracts a term from the corpus using a sentence pattern expressing a dependency relationship of terms. Term registration device.

4. The term according to any one of claims 1 to 3, wherein the term extracting means extracts a term from the corpus using a sentence pattern expressing a restriction of similarity relation of terms. Registration device.

6. The sentence pattern certainty factor calculating unit corrects the certainty factor of a sentence pattern according to the similarity between terms extracted by the term extracting unit. Terminology registration device.