JP2794998B2

JP2794998B2 - Morphological analyzer and phrase dictionary generator

Info

Publication number: JP2794998B2
Application number: JP3228056A
Authority: JP
Inventors: 仁司大樫; 和弘小池; 敦史金枝上
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1991-09-09
Filing date: 1991-09-09
Publication date: 1998-09-10
Anticipated expiration: 2013-09-10
Also published as: JPH0567073A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、日本語文を単語列に
分割する形態素解析装置及びその装置に用いる文節辞書
を作成する文節辞書作成装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morphological analysis device for dividing a Japanese sentence into word strings and a phrase dictionary creating device for creating a phrase dictionary used in the device.

【０００２】[0002]

【従来の技術】日本語文を形態素解析する方法には、文
法を用いる方法と経験則を用いる方法の大きく２種類の
解析方法がある。2. Description of the Related Art There are roughly two types of morphological analysis methods for Japanese sentences: a method using a grammar and a method using an empirical rule.

【０００３】文法を用いる方法は、全解探索を行なえば
安全である様に、正解率が高い。しかし、作成コストが
大きく実行速度も遅いという問題点がある。安全とは、
複数の出力解の中に正解が必ず含まれていることを意味
する。A method using a grammar has a high correct answer rate so that it is safe to perform a full solution search. However, there is a problem that the production cost is large and the execution speed is slow. Safety is
It means that the correct answer is always included in the multiple output solutions.

【０００４】経験則を用いる方法は、作成コストが少な
く実行速度が速いという長所があるが、完全ではない、
つまり正解率が低いという問題点がある。The method using the rule of thumb has the advantages of low production cost and high execution speed, but is not perfect.
That is, there is a problem that the accuracy rate is low.

【０００５】経験則を用いる方法には、文節数最小法、
最長一致法、字種切り法がある。実行速度は文節数最小
法が最も遅く、最長一致法が中間的で、字種切り法が最
も速い。従って、字種切り法は形態素解析の各種方法の
中で最も実行速度が速い方法である。一方、正解率は字
種切り法が最も悪く、最長一致法が中間的で、文節数最
小法が最も良い。従って、字種切り法は形態素解析の各
種方法の中で最も正解率が悪い方法である。Methods using the rule of thumb include the minimum number of clauses method,
There are the longest matching method and the character type cutting method. The execution speed is the slowest in the number of clauses method, the longest match method is intermediate, and the character type cutting method is the fastest. Therefore, the character type cutting method is the method with the highest execution speed among the various methods of morphological analysis. On the other hand, the character type segmentation method has the worst answer rate, the longest match method is intermediate, and the minimum phrase number method is the best. Therefore, the character type segmentation method has the lowest accuracy rate among the various methods of morphological analysis.

【０００６】ここで、字種切り法について説明する。図
３１は長尾真、辻井潤一、山上明、建部周二：「国語辞
書の記憶と日本語文の自動分割」、情報処理、Ｖｏｌ．
１９，Ｎｏ．６，ｐｐ．５１４−５２１（１９７８）に
示された従来の字種切り法による形態素解析装置を示す
図であり、図において、１は入力部、２は日本語文、３
は字種切り法による文節分割部、４は文節列、１０は辞
書引きによる単語分割部、１１は単語辞書、１２は単語
列、１３は出力部である。Here, the character type cutting method will be described. FIG. 31 shows Makoto Nagao, Junichi Tsujii, Akira Yamagami, Shuji Takebe: "Storage of Japanese Language Dictionary and Automatic Division of Japanese Sentence", Information Processing, Vol.
19, no. 6, pp. 514-521 (1978) is a diagram showing a conventional morphological analyzer based on the character type segmentation method, in which 1 is an input unit, 2 is a Japanese sentence, and 3 is a Japanese sentence.
Is a phrase division unit by the character type cutting method, 4 is a phrase sequence, 10 is a word division unit by dictionary lookup, 11 is a word dictionary, 12 is a word sequence, and 13 is an output unit.

【０００７】次に動作について説明する。入力部１で作
成された日本語文２が文節分割部３に入力される。文節
分割部３は日本語文２を字種切り経験則に従って文節列
４へと分割する。単語分割部１０は文節列４を辞書引き
によって単語列１２へと分割する。そして、出力部１３
は単語列１２を記憶装置あるいは表示装置に出力する。Next, the operation will be described. The Japanese sentence 2 created by the input unit 1 is input to the phrase division unit 3. The phrase division unit 3 divides the Japanese sentence 2 into phrase strings 4 in accordance with the rule of typology of character type. The word division unit 10 divides the phrase string 4 into word strings 12 by dictionary lookup. And the output unit 13
Outputs the word string 12 to a storage device or a display device.

【０００８】次に、文節分割部３で用いる字種切り経験
則について説明する。図３２は文節分割部３の動作を示
す図である。日本語文の各文字を、句読点、ひらがな、
およびその他の３種類の字種に分類し、句読点の前後、
およびひらがなから他の字種への変化点で分割する。こ
れを文節列として出力する。Next, the empirical rule for character type separation used in the phrase division section 3 will be described. FIG. 32 is a diagram showing the operation of the phrase division unit 3. Punctuation, hiragana,
And three other character types, before and after punctuation,
And split at the point of change from Hiragana to other character types. This is output as a phrase string.

【０００９】最後に、単語分割部１２について説明す
る。単語分割の方法には幾つかの方法があるが、長尾ら
の方法は文節分割部３で分割した文節列４の分割点の中
をまず単語に分割してみる方法を取っている。そして、
可能な解釈が得られなかった場合に初めて、文節列４の
分割が誤っているものとして再文節分割を行なう。Finally, the word division section 12 will be described. There are several word division methods. The method of Nagao et al. Takes a method of first dividing words into words at the division points of the phrase string 4 divided by the phrase division unit 3. And
Only when a possible interpretation cannot be obtained, re-phrase division is performed assuming that the division of the phrase string 4 is incorrect.

【００１０】[0010]

【発明が解決しようとする課題】従来の装置は以上の様
に構成されているので、出力する単語列の正解率が低い
という問題点があった。Since the conventional apparatus is configured as described above, there is a problem that the output word string has a low accuracy rate.

【００１１】例えば、図３３は文節分割部３が誤った文
節列を出力する例である。図において「：」は文節分割
記号で、文節分割部３が分割した文節分割点を示してい
る。この例に示すように「漢字＋ひらがな＋漢字＋ひら
がな」と並ぶ単語は、文節分割部３で２文節に分割され
てしまう過剰分割を起こす。しかし、それぞれの「漢字
＋ひらがな」の中を単語分割に成功すれば、単語として
解析されない。文節列例の「書き：換える：」では、
「書き：」は動詞「書く」の連用形、「換える：」は動
詞「換える」の終止形または連体形として解釈可能なの
で、「書き換える」という１単語としては解釈されな
い。For example, FIG. 33 shows an example in which the phrase division section 3 outputs an incorrect phrase sequence. In the figure, “:” is a phrase division symbol, and indicates a phrase division point divided by the phrase division unit 3. As shown in this example, a word lined up with "Kanji + Hiragana + Kanji + Hiragana" causes excessive segmentation in the segment segmentation unit 3 which is divided into two segments. However, if the word division in each "kanji + hiragana" succeeds, it is not analyzed as a word. In the example of a phrase string, "Write: change:"
"Writing:" can be interpreted as a verb "writing", and "changing:" can be interpreted as a terminating or continuous form of the verb "changing", so it is not interpreted as one word "rewrite".

【００１２】また、誤り分割の文節列例の「実験をく
り：返す：」でも、「実験をくり：」は名詞「実験」＋
格助詞「を」＋動詞「くる」の連用形、「返す：」は動
詞「返す」の終止形または連体形として解釈可能なの
で、「くり返す」という１単語としては解釈されない。Also, in the example of a phrase string of error division, "Experiment: return:", "experiment ::" is a noun "experiment" +
Since the case particle "wo" + the verb "kuru" can be interpreted as the conjunctive form "return:", the verb "return" can be interpreted as an end form or union form, so it is not interpreted as a single word "return".

【００１３】なお、最後の文節列例の「昨年度日本
の：」は、長尾らの方法によっても正しく、「昨年度：
日本の：」と文節分割誤りを訂正することができる。[0013] The last phrasal string example of "Last year's Japan:" is also correct by the method of Nagao et al.
Japanese: "and the phrase segmentation error can be corrected.

【００１４】この発明は上記のような問題点を解消する
ためになされたもので、実行速度が速く、かつ正解率も
高い形態素解析装置を提供することを目的とする。The present invention has been made to solve the above problems, and has as its object to provide a morphological analyzer having a high execution speed and a high accuracy rate.

【００１５】[0015]

【課題を解決するための手段】第１の発明に係る形態素
解析装置は、（ａ）日本語文を入力する入力部、（ｂ）入力された日本語文を文節列に分割する文節分割
部、（ｃ）単語を記憶しておく単語辞書、（ｄ）上記単語辞書の単語を入力して文節列に分割し、
文節列にされた単語の中から、登録する単語を選択し上
記文節分割部で分割誤りを起こす単語として記憶する文
節辞書、（ｅ）文節辞書を用いて文節分割部での誤り分割を訂正
する文節連結部、（ｆ）文節列を単語列に分割する単語分割部と単語辞書
を用いて単語分割部での誤り分割を訂正する単語認定部
をもつ単語処理部、（ｇ）単語列を記憶装置あるいは表示装置に出力する出
力部、の各要素を有し、特に文節連結部を持つことによ
って、文節分割部の分割誤りを訂正できる様にしたもの
である。SUMMARY OF THE INVENTION The morphological analysis apparatus according to the first invention, (a) an input unit for inputting a Japanese sentence, phrase division unit that divides a Japanese sentence input (b) in clause column, ( c) a word dictionary for storing words, (d) inputting words in the word dictionary and dividing them into phrase strings,
Select the word to be registered from the words in the phrase
A sentence that is stored as a word that causes a division error in the phrase segmentation unit
Correction of error division in phrase segmentation section using clause dictionary, (e) phrase dictionary
Clause connecting portion for word processor having a tokenizer unit for correcting an error division of the word division unit by using the word dividing unit and the word dictionary is divided into word strings to (f) clause column, the (g) word sequence It has the components of a storage device or an output unit for outputting to a display device. In particular, by having a phrase linking unit, it is possible to correct a division error of the phrase division unit.

【００１６】第２の発明に係る文節辞書作成装置は、第
１の発明の文節連結部で用いる文節辞書を、単語辞書か
ら自動的に作成できる様にしたものであり、この文節辞
書は、単語辞書の単語の内、文節分割部によって単語自
体が２文節以上に誤って分割されてしまう単語を登録単
語選択部により抽出したものである。実験によれば、文
節辞書の語数は単語辞書の語数の３％弱であった。A phrase dictionary creating apparatus according to a second aspect of the present invention is capable of automatically creating a phrase dictionary used in the phrase connecting unit of the first aspect from a word dictionary. The words extracted from the words in the dictionary, in which the word itself is erroneously divided into two or more phrases by the phrase division unit, are extracted by the registered word selection unit. According to experiments, the number of words in the phrase dictionary was less than 3% of the number of words in the word dictionary.

【００１７】[0017]

【作用】第１の発明における文節連結部は、文節辞書と
文節列の文字列比較を行ない、文節辞書に登録されてい
る文字列が見つかった場合は、文節分割点を削除する。
これによって、誤った文節分割は無くなる。文節辞書は
単語辞書の３％弱の語数であり、非常に少ないので、文
節連結部を追加しても形態素解析装置全体の実行速度は
ほとんど低下しない。The phrase linking unit in the first invention compares the character string of the phrase dictionary with the phrase string, and deletes the phrase division point when a character string registered in the phrase dictionary is found.
This eliminates erroneous segmentation. Since the phrase dictionary has a word count of less than 3% of the word dictionary and is very small, the execution speed of the entire morphological analyzer does not substantially decrease even if the phrase connection unit is added.

【００１８】第２の発明に係る文節辞書作成装置は、単
語辞書から文節辞書を自動的に作成する装置である。文
節辞書作成装置は、単語辞書の各単語を第１の発明に係
る文節分割部に入力し、単語自体が誤って文節分割され
る単語を抽出し、文節辞書に登録する。A phrase dictionary creating apparatus according to a second aspect of the present invention is an apparatus for automatically creating a phrase dictionary from a word dictionary. The phrase dictionary creation device inputs each word of the word dictionary to the phrase division unit according to the first invention, extracts words whose words themselves are phrase-divided incorrectly, and registers the words in the phrase dictionary.

【００１９】[0019]

【Example】

実施例１．以下、この発明の一実施例を図について説明
する。図１は第１の発明に係る形態素解析装置を示す図
であり、図において、１は入力部、２は日本語文、３は
字種切り法による文節分割部、４は文節分割部が出力す
る文節列、５は文節連結部、６は文節辞書、７は文節連
結部が出力する文節列、８は字種切り法による単語分割
部、９は字種切り法による単語分割部が出力する単語
列、１０は辞書引きによる単語認定部、１１は単語辞
書、１２は辞書引きによる単語認定部が出力する単語
列、１３は出力部である。Embodiment 1 FIG. An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a morphological analyzer according to the first invention. In the figure, 1 is an input unit, 2 is a Japanese sentence, 3 is a phrase division unit by the character type cutting method, and 4 is a phrase division unit. A phrase string, 5 is a phrase connection part, 6 is a phrase dictionary, 7 is a phrase string output by the phrase connection part, 8 is a word division part by the character type cutting method, and 9 is a word output by the word division part by the character type separation method Column 10 is a word recognition unit by dictionary lookup, 11 is a word dictionary, 12 is a word string output by the word recognition unit by dictionary lookup, and 13 is an output unit.

【００２０】次に第１の発明に係る形態素解析装置の概
略動作について説明する。入力部１で作成された日本語
文２が文節分割部３に入力される。文節分割部３は日本
語文２を字種切り経験則に従って文節列４へと分割す
る。文節連結部５は文節辞書６を用いて文節列４の誤り
分割点を削除し文節列７を出力する。単語分割部８は文
節列７を字種切り経験則に従って単語列９へと分割す
る。単語認定部１０は単語辞書１１を用いて単語列９の
誤り分割点を訂正し単語列１２を出力する。そして、出
力部１３は単語列１２を記憶装置あるいは表示装置に出
力する。Next, the general operation of the morphological analyzer according to the first invention will be described. The Japanese sentence 2 created by the input unit 1 is input to the phrase division unit 3. The phrase division unit 3 divides the Japanese sentence 2 into phrase strings 4 in accordance with the rule of typology of character type. The phrase linking unit 5 deletes error division points of the phrase sequence 4 using the phrase dictionary 6 and outputs a phrase sequence 7. The word division unit 8 divides the phrase string 7 into word strings 9 in accordance with the rule of typology. The word recognition unit 10 corrects an error division point of the word string 9 using the word dictionary 11, and outputs the word string 12. Then, the output unit 13 outputs the word string 12 to a storage device or a display device.

【００２１】次にこの形態素解析装置のアルゴリズムを
各部３、５、８、１０に対応させて詳説する。Next, the algorithm of the morphological analyzer will be described in detail for each of the sections 3, 5, 8, and 10.

【００２２】形態素解析アルゴリズムは、各部３、５、
８、１０に対応して以下の４段階の処理を順に行なうこ
とによってテキストを単語列に分割する。さらに、品詞
の付与、及び活用語の連体形（＝標準形）付加を行な
う。以下の（１）と（３）は、経験則を主導とする処理
を行う。それに対して（２）と（４）では辞書を使用し
てそれぞれ（１）と（３）の例外を補正する。The morphological analysis algorithm is based on
The text is divided into word strings by sequentially performing the following four steps corresponding to steps 8 and 10. Furthermore, the part of speech is added, and the adjunct form (= standard form) of the inflected word is added. In the following (1) and (3), a process based on an empirical rule is performed. On the other hand, in (2) and (4), the exceptions (1) and (3) are corrected using a dictionary.

【００２３】（１）文節分割部（ｂｙ経験則）経験則を用いて、テキストを文節単位に分割する。経験
則を以下に示す。［１］平仮名から漢字へ変わる所で分割する。［２］平仮名からカタカナへ変わる所で分割する。［３］平仮名から英数字へ変わる所で分割する。［４］平仮名から記号へ変わる所で分割する。［５］句読点の前後で分割する。以上の経験則による結果の例を示す。：で示したのが、
文節分割点である。ほとんどの文節が正しく文節分割さ
れているが、「埋め：込むことにより：」の部分が誤っ
て２文節になっていることが分かる。字種を使った上記
経験則では、この様に間違った結果を出してしまう場合
が必ずある。これが、経験則の不完全性であり、字種経
験則を使った日本語形態素解析アルゴリズムの最大の難
点であった。「基板と：反対導電形の：半導体層の：一部の：第２
の：溝を：第１の：溝より：深く：掘り：、：第２の：
溝の：側壁に：絶縁膜を：成長させ：、：第２の：溝
に：半導体層を：成長させて：埋め：込むことにより：
集積化の：向上を：図る：。：」しかし経験則は、如何なるテキストに対しても失敗しな
い上に、文節分割点：を削除すれば原テキストが再現で
きるという特徴がある。(1) Clause division unit (by rule of thumb) A text is divided into clause units using rule of thumb. The rules of thumb are shown below. [1] Divide where Hiragana changes to Kanji. [2] Divide where Hiragana changes to Katakana. [3] Divide where hiragana changes to alphanumeric. [4] Divide where hiragana changes into a symbol. [5] Divide before and after punctuation. An example of a result based on the above empirical rules will be described. : Indicated by
This is a segmentation point. It can be seen that most of the clauses are correctly segmented, but the "embedding: by embedding:" part is incorrectly divided into two clauses. The above rule of thumb using the character types will always give such incorrect results. This is the incompleteness of the rule of thumb, and the biggest difficulty of the Japanese morphological analysis algorithm using the character type rule of thumb. "Substrate: opposite conductivity type: semiconductor layer: part: second
Of: groove: first: from the groove: deeper: digging:,: second:
By: Growing: Insulating film: Growing:,: Second: Groove: Semiconductor layer: Growing: Filling: Embedding:
Integration: Improve: Aim: :] However, the rule of thumb is that it does not fail for any text, and the original text can be reproduced by removing the segmentation point :.

【００２４】（２）文節連結部（ｂｙ文節辞書）文節辞書を用いて、上記（１）の文節分割の例外を正
す。例えば文節分割では、平仮名と漢字を共に含む単語
は、次の様に単語の途中で切れてしまう。この様な単語
を文節辞書に保持しておき、切り過ぎを元に戻す。埋込み：層埋込み：金属埋め：込む歩留り：性能突き：抜ける読み：出す文節連結による結果の例を示す。「埋め込むことによ
り：」の部分が（１）では２文節になっていたが、正し
く１文節に直っていることが分かる。これで《切り過
ぎ》という問題は解決され、本形態素解析アルゴリズム
が文節分割に関して完全であることを保証している。「基板と：反対導電形の：半導体層の：一部の：第２
の：溝を：第１の：溝より：深く：掘り：、：第２の：
溝の：側壁に：絶縁膜を：成長させ：、：第２の：溝
に：半導体層を：成長させて：埋め込むことにより：集
積化の：向上を：図る：。：」ただし完全性とは関係ないが、《切り過ぎ》が残る場合
がある。それは、文節辞書が完全でない場合である。し
かし、文節辞書はデータであり、データの完全性はアル
ゴリズムの完全性を議論する時には無視するのが通例で
ある。従って、この問題は本日本語形態素解析アルゴリ
ズムの完全性を少しも損なわない。また完全性とは関係
ないが、《切れなさすぎ》の問題は残っている。これは
本来２文節以上になるべきなのに、１文節と解析してし
まうという問題である。これについては、この後の処理
でカバーする。従って、《切れなさすぎ》の問題があっ
ても、完全性は保証される。なぜなら、アルゴリズムを
原因とする回復不能な間違いがないというのが、完全性
の定義であるからである。またこの処理は、如何なるテ
キストに対しても失敗しない上に、文節分割点：を削除
すれば原テキストが再現できるという特徴がある。(2) Clause Linking Unit (by Clause Dictionary) A clause dictionary is used to correct the exception of the above clause division (1). For example, in segmentation, a word containing both hiragana and kanji is cut off in the middle of the word as follows. Such words are stored in the phrase dictionary, and excessive cutting is restored. Embedding: Layer Embedding: Metal Embedding: Embedding Yield: Performance Thrust: Exiting Reading: Emerging Examples of results of phrase concatenation are shown. It can be seen that the “embedded by:” part was changed to two phrases in (1), but was correctly corrected to one phrase. This solves the problem of "over-cutting", and guarantees that the morphological analysis algorithm is complete with respect to segmentation. "Substrate: opposite conductivity type: semiconductor layer: part: second
Of: groove: first: from the groove: deeper: digging:,: second:
In the groove: on the side wall: growing the insulating film :, in the second: in the groove: growing the semiconductor layer: by embedding: embedding: improving the integration: aiming at: : "However, though not related to completeness, << Overcut >> may remain. That is when the phrase dictionary is not complete. However, phrase dictionaries are data, and data integrity is usually ignored when discussing algorithm integrity. Therefore, this problem does not impair the completeness of the Japanese morphological analysis algorithm at all. It has nothing to do with completeness, but the problem of "cutting too much" remains. This is a problem that, although it should originally be two or more phrases, it is analyzed as one phrase. This will be covered in a later process. Therefore, completeness is guaranteed even if there is a problem of "too short". This is because the definition of completeness is that there is no irrecoverable error caused by the algorithm. This process is characterized in that it does not fail for any text and that the original text can be reproduced by deleting the segmentation point :.

【００２５】（３）単語分割部（ｂｙ経験則）経験則のみを用いて、さらに単語単位に分割する。経験
則を以下に示す。［１］漢字から平仮名へ変わる所で分割する。［２］カタカナから平仮名へ変わる所で分割する。［３］英数字から平仮名へ変わる所で分割する。［４］記号から平仮名へ変わる所で分割する。以上の経験則による結果の例を示す。／は単語分割記号
で、／で示した部分が、単語分割点である。ほとんどの
単語が正しく単語分割されているが、「埋／め込／むこ
とにより：」の部分が誤って３単語になっていることが
分かる。字種を使った上記経験則では、この様に間違っ
た結果を出してしまう場合が必ずある。（１）と同様に
これも、経験則の不完全性であり、字種経験則を使った
日本語形態素解析アルゴリズムの難点であった。「基板／と：反対導電形／の：半導体層／の：一部／
の：第２／の：溝／を：第１／の：溝／より：深／く：
掘／り：、：第２／の：溝／の：側壁／に：絶縁膜／
を：成長／させ：、：第２／の：溝／に：半導体層／
を：成長／させて：埋／め込／むことにより：集積化／
の：向上／を：図る：。：」しかし、この処理は、如何なるテキストに対しても失敗
しない上に、文節分割点：および単語分割点／を削除す
れば原テキストが再現できるという特徴がある。(3) Word segmentation unit (by rule of thumb) Using only the rule of thumb, the word is further divided into word units. The rules of thumb are shown below. [1] Divide where kanji changes to hiragana. [2] Divide where katakana changes to hiragana. [3] Divide where alphanumeric characters change to hiragana. [4] Divide where the symbol changes to Hiragana. An example of a result based on the above empirical rules will be described. / Is a word division symbol, and the portion indicated by / is a word division point. It can be seen that most of the words are correctly word-divided, but the part of "by embedding / embedding / by:" is incorrectly three words. The above rule of thumb using the character types will always give such incorrect results. As in (1), this is also an imperfect rule of thumb and a drawback of the Japanese morphological analysis algorithm using the character type rule of thumb. “Substrate / and: opposite conductivity type / of: semiconductor layer / of: part /
Of: 2 / of: groove /: of 1 / of: groove / more: deep / deep:
Digging / Digging: 、 Second / No: Groove / No: Sidewall / On: Insulating film /
: Growth / development,: second / of: groove / to: semiconductor layer /
By: growing / growing: by embedding / embedding / embedding: integration /
: Improvement /: Planning: :] However, this process does not fail for any text, and the original text can be reproduced by deleting the phrase division point: and the word division point /.

【００２６】（４）単語認定部（ｂｙ単語辞書）単語辞書を用いて、上記（３）の単語分割の結果をチェ
ックし誤りを正す。また、品詞を付ける。単語辞書に登
録してある単語には（品詞）を付加し、未知語には＜品
詞＞を付加してある。さらに、活用語の場合には連体形
（＝標準形）を［連体形］で付加してある。結果を以下
に示す。ここでは、見やすくするために、１単語１行で
示す。＊で示した部分が（３）では３単語になっていた
が、正しく２単語に直っていることが分かる。これで、
《切り過ぎ》という問題は解決され、本日本語形態素解
析アルゴリズムが単語分割に関して完全であることを保
証している。（名詞）基板／＜付属＞と：（名詞）反対導電形／＜付属＞の：（名詞）半導体層／＜付属＞の：（名詞）一部／＜付属＞の：（名詞）第２／＜付属＞の：（名詞）溝／＜付属＞を：（名詞）第１／＜付属＞の：（名詞）溝／＜付属＞より：（形容）深く〔深い〕：（動詞）掘り〔掘る〕：（読点）、：（名詞）第２／＜付属＞の：（名詞）溝／＜付属＞の：（名詞）側壁／＜付属＞に：（名詞）絶縁膜／＜付属＞を：（動詞）成長さ〔成長する〕／＜付属＞せ：（読点）、：（名詞）第２／＜付属＞の：（名詞）溝／＜付属＞に：（名詞）半導体層／＜付属＞を：（動詞）成長さ〔成長する〕／＜付属＞せて：（動詞）埋め込む〔埋め込む〕／＜付属＞ことにより：（名詞）集積化／＜付属＞の：（名詞）向上／＜付属＞を：（動詞）図る〔図る〕：（句点）。：ただし完全性とは関係ないが、《切り過ぎ》が残ること
はある。これは、本来１単語になるべきなのに、２単語
以上と解析してしまうという問題である。これは、単語
辞書の完全性に依存している。しかし、単語辞書はデー
タでありデータの完全性はアルゴリズムの完全性を議論
する時には無視するのが通例である。従って、この問題
は本日本語形態素解析アルゴリズムの完全性を少しも損
なわない。同様に、《切れなさすぎ》が残ることもある
が、同じく本日本語形態素解析アルゴリズムの完全性を
少しも損なわない。またこの処理は、如何なるテキスト
に対しても失敗しない上に、文節分割点：単語分割点
／、（品詞）、＜品詞＞および［連体形］を削除すれば
原テキストが再現できるという特徴がある。(4) Word recognition section (by word dictionary) Using the word dictionary, the result of the word division in the above (3) is checked and errors are corrected. In addition, attach the part of speech. (Part of speech) is added to words registered in the word dictionary, and <part of speech> is added to unknown words. Further, in the case of a conjugate word, a continuous form (= standard form) is added as [continuous form]. The results are shown below. Here, one word is represented by one line for easy viewing. Although the part indicated by * was three words in (3), it can be seen that it was correctly corrected to two words. with this,
The problem of "overcutting" has been solved, and this Japanese morphological analysis algorithm is guaranteed to be complete with respect to word segmentation. (Noun) substrate / <attached> and: (noun) opposite conductivity type / <attached>: (noun) semiconductor layer / <attached>: (noun) part / <attached>: (noun) second / <Attachment>: (Noun) Groove / <Attachment>: (Noun) 1 / <Attachment>: (Noun) Groove / <Attachment> More: (Adjective) Deep [Deep]: (Verb) Digging [ (Dig): (reading point),: (noun) 2nd / <attachment>: (noun) groove / <attachment>: (noun) sidewall / <attachment>: (noun) insulating film / <attachment> : (Verb) growth [grow] / <attached>: (reading point),: (noun) second / <attached>: (noun) groove / <attached>: (noun) semiconductor layer / <attached >: (Verb) Growth [grow] / <attached>: (verb) Embed [embed] / <attached> By: (noun) integrated / <attached>: (noun Improvement / a <Included>: (verb) reduce [FIG. Ru]: (punctuation). : However, it has nothing to do with completeness, but sometimes "overcutting" remains. This is a problem that, although it should originally be one word, it is analyzed as two or more words. It depends on the completeness of the word dictionary. However, word dictionaries are data, and data integrity is usually ignored when discussing algorithm integrity. Therefore, this problem does not impair the completeness of the Japanese morphological analysis algorithm at all. Similarly, << Insufficient Cutting >> may remain, but does not impair the completeness of the Japanese morphological analysis algorithm. This process does not fail for any text, and the original text can be reproduced by deleting the segmentation points: word division point /, (part of speech), <part of speech> and [continuous form]. .

【００２７】なお、単語が全く見つからなかった場合に
は、単語分割点より前を自立語の未知語、後ろを付属語
として出力するが、自立語の未知語に対しては以下のよ
うな品詞活用推論をともなう未知語処理を行なう。If no word is found, the word before the word segmentation point is output as an independent word and the word after it is output as an adjunct. Perform unknown word processing with inference.

【００２８】未知語処理（ｂｙ経験則）未知語処理は、単語辞書に未登録の単語に対して、その
品詞、活用および連体形を推論する処理である。現在１
８１ルールを考えているが、これで全てを尽くしている
訳ではない。未知語推論ルールの詳細は、図２〜図６に
示した。図において「漢漢」、「漢」、「：」は以下の
意味をもつ。漢漢＝１文字以上の非平仮名文字漢＝１文字の非平仮名：＝文節分割記号また、図中の「推論条件」と「推論結果」というのは、
「漢漢させ」という単語があるときは、これを推論条件
として、「漢漢する」という推論結果を導くというもの
である。たとえば、「実行させ」という単語は「実行す
る」という単語に推論されることになる。Unknown word processing (by empirical rule) Unknown word processing is processing for inferring the part of speech, inflection, and union form of a word that has not been registered in the word dictionary. Currently 1
I'm thinking of 81 rules, but that's not all. Details of the unknown word inference rule are shown in FIGS. In the figure, "Kankan", "Kan" and ":" have the following meanings. Kanhan = one or more non-hiragana characters kan = one-character non-hiragana: = phrase segmentation symbol In the figure, "inference conditions" and "inference results"
When there is a word "Kan-Kan-Sen", this is used as an inference condition to derive an inference result of "Kan-Kan-San". For example, the word "execute" will be inferred to the word "execute".

【００２９】次に、形態素解析用に用いる文節辞書６、
及び単語辞書１１について説明する。（１）文節辞書文節連結用の辞書である。形式は１行１単語である。文
節辞書は、後述する第２の発明に基づいて、単語辞書か
らプログラムによって自動的に作成できるので、開発作
業は不要である。例を次に示す。埋込み：層埋込み：金属埋め：込む歩留り：性能突き：抜ける読み：出すNext, the phrase dictionary 6 used for morphological analysis,
And the word dictionary 11 will be described. (1) Clause dictionary This is a dictionary for clause connection. The format is one word per line. The phrase dictionary can be automatically created from the word dictionary by a program based on the second invention described later, so that development work is unnecessary. An example is shown below. Embedding: Layer Embedding: Metal Embedding: Embedding Yield: Performance Thrust: Exit Reading: Eject

【００３０】（２）単語辞書単語分割用の辞書である。形式は１行１単語である。活
用語については、語幹と語尾を全角空白１文字で区切
る。例を次に示す。半導体半導体ウエーハ半導体ウエハ半導体ウエハー安価な加える加わる強い極めて（副詞）しきい値しきい値電圧に関する（助詞）突き抜ける埋め込む埋込み金属従来の単語辞書では、見出し語の他に、品詞、活用、前
方接続情報、後方接続情報など色々な付加情報を全単語
に対して付加する必要があった。本単語辞書にはその様
な付加情報は殆どなく、ユーザにとって単語登録し易い
形式となっている。すなわち、単語辞書は見出し語と若
干の情報から成っており、たとえば、サイズは１１３Ｋ
Ｂ／万語と非常に小さい。名詞は見出し語のみの登録で
あり、全てサ変活用すると見なす。動詞、形容詞および
形容動詞は連体形で登録し、語幹と語尾を分かち書きす
る。その他の品詞については「（品詞）」を記述する。
この様に記述量が少ないため、自然言語処理の専門家で
なくても辞書作成が容易である。(2) Word dictionary This is a dictionary for word division. The format is one word per line. For inflected words, the stem and the end are separated by a single-byte space. An example is shown below. Semiconductors Semiconductor wafers Semiconductor wafers Semiconductor wafers Semiconductors Inexpensive Add Strong Strong (adverb) Threshold voltage (Particles) Penetrate Embed Embed metal In conventional word dictionaries, in addition to head words, , Utilization, forward connection information, backward connection information, etc., need to be added to all words. This word dictionary has almost no such additional information, and is in a format that allows the user to easily register words. That is, the word dictionary is composed of a headword and some information.
B / very small word. Nouns are registered as headwords only, and all are considered to be changed. Verbs, adjectives and adjective verbs are registered in adnominal form, and stems and endings are separated. For other parts of speech, "(part of speech)" is described.
Since the amount of description is small, it is easy to create a dictionary even if one is not an expert in natural language processing.

【００３１】次に、単語辞書への登録仕様について説明
する。まず、全般的な仕様について説明する。・１行１単語とすること。１行に２単語以上登録した
り、１つの単語を２行以上に渡らせて登録してはいけな
い。・全角文字のみを用いること。半角文字は使用してはな
らない。・空白も全角空白のみを用いること。半角空白２個で代
用してはならない。次に各品詞ごとの登録仕様は図７〜図１６に記載した。Next, the specification for registration in the word dictionary will be described. First, general specifications will be described.・ Use one word per line. Do not register two or more words on one line, or register one word over two or more lines.・ Use only full-width characters. Half-width characters must not be used.・ Use only double-byte spaces. Do not substitute two single-byte spaces. Next, the registration specifications for each part of speech are described in FIGS.

【００３２】実施例２．以上、実施例１では各部３、
５、８、１０と各辞書６、１１について説明したが、次
にこの発明の特徴点を例をあげて説明する。Embodiment 2 FIG. As described above, in the first embodiment, each unit 3
5, 8 and 10 and the respective dictionaries 6 and 11 have been described. Next, the features of the present invention will be described using examples.

【００３３】図１７を用いて、文節辞書６について説明
する。図１７は文節辞書の内容例を示している。単語辞
書に登録されている単語の内、文節分割部３によって誤
って２文節以上に分割されてしまう単語を、文節列の形
式で登録してある。例の内「”溶け：込”，」は動詞
「溶け込む」の語幹を表している。The phrase dictionary 6 will be described with reference to FIG. FIG. 17 shows an example of the contents of a phrase dictionary. Of the words registered in the word dictionary, words that are erroneously divided into two or more phrases by the phrase division unit 3 are registered in the form of phrase strings. In the example, "" Melting: inclusive "," represents the stem of the verb "melting in."

【００３４】次に、文節連結部について説明する。図１
８は文節連結部５の動作を示す図である。文節列４「誤
った：文節分割点を：書き：換えて：、：正しい：文節
列を：作る：」の「：書き：換えて：」は正しくは「：
書き換えて：」であるが、誤って過剰に分割されてい
る。動作はまず、（１）で文節列の最初の文節分割点を
捜すことから始まる。例では、「誤った：」の文節分割
点が見つかる。次に（２）で文節辞書６の最初の単語を
取り出す。そして、（３）で文節列が単語と一致するか
どうか文字列比較する。これを文節辞書６の全単語につ
いて行なうが、いずれとも一致しない。従って、これは
誤った分割ではないことが判明する。次に、（７）で次
の文節分割点を捜す。例では「文節分割点を：」の文節
分割点が見つかる。これも文節辞書６の全単語のいずれ
とも一致しない。従って、これも誤った分割ではないこ
とが判明する。さらに、（７）で次の文節分割点を捜
す。例では「書き：」の文節分割点が見つかる。これに
対して文節辞書６の単語と順次文字列比較すると、「”
書き：換え”，」と一致する。文字列比較は、文節列４
の文節分割点と文節辞書６の単語の最初の文節分割点を
揃えて、単語の長さだけの部分文字列一致で行なう。
「書き：換え」の場合は一致するので、この文節分割点
を削除する。以下、順次文節分割点について処理を繰り
返すが、誤った分割はこれ以上は見つからずに終了す
る。出力する文節列７は、「誤った：文節分割点を：書
き換えて：、：正しい：文節列を：作る：」と正しい文
節列になる。Next, the phrase connection unit will be described. FIG.
FIG. 8 is a diagram showing the operation of the phrase connection unit 5. Clause 4 "Wrong: segmentation point: write: rewrite:,: correct: create phrase sequence: make:"": write: replace:" is correctly ":
Rewrite: "but incorrectly over-split. The operation starts by searching for the first segmentation point of the phrase sequence in (1). In the example, a phrase division point of "wrong:" is found. Next, the first word of the phrase dictionary 6 is extracted in (2). Then, in (3), the character strings are compared to determine whether the phrase string matches the word. This is performed for all words in the phrase dictionary 6, but none of them match. Therefore, it turns out that this is not a false division. Next, the next segmentation point is searched in (7). In the example, a phrase division point of "A phrase division point:" is found. This also does not match any of the words in the phrase dictionary 6. Therefore, it turns out that this is not an erroneous division. Further, the next segmentation point is searched in (7). In the example, the segmentation point of “writing:” is found. On the other hand, when the character strings are sequentially compared with the words in the phrase dictionary 6, ""
Rewrite: Matches with ",". String comparison is phrase string 4
And the first segmentation point of a word in the phrase dictionary 6 are aligned, and a partial character string equal to the word length is matched.
In the case of "rewrite: change", the phrase division point is deleted because they match. Thereafter, the processing is sequentially repeated for the segment division points, but the erroneous division ends without being found any more. The phrase sequence 7 to be output becomes a correct phrase sequence as follows: "Wrong: Rewrite the phrase segmentation point :::: Correct: Create phrase sequence ::".

【００３５】次に、字種切り法による単語分割部につい
て説明する。図１９は単語分割部８の動作を示す図であ
る。文節列７の各文字を、ひらがな、文節分割点、およ
びその他の３種類の字種に分類し、その他の字種からひ
らがなへの変化点で分割する。これを単語列９として出
力する。文節分割点はそのままである。Next, a word division unit based on the character type cutting method will be described. FIG. 19 is a diagram showing the operation of the word division unit 8. Each character in the phrase string 7 is classified into hiragana, a phrase division point, and three other character types, and is divided at a change point from the other character type to hiragana. This is output as a word string 9. The segmentation point remains unchanged.

【００３６】次に、辞書引きによる単語認定部について
説明する。この部分の実現方式には、幾つかの方法が考
えられるが、本実施例では最長一致法を用いた実現につ
いて述べる。図２０は単語認定部１０の動作を示す図で
ある。単語列９の前から順に文節単位で文字列を取り出
し、最長一致法によって単語を確定する。未解析文字列
がひらがな列だけになったら、それは付属語として出力
する。単語分割点よりも最長一致法による分割を優先さ
せるが、単語が全く見つからなかった場合には、単語分
割点より前を自立語の未知語、後ろを付属語として出力
する。辞書登録語でかつ活用語が見つかった場合には、
活用語尾処理を行なう。また、自立語の未知語に対して
は品詞活用推論をともなう未知語処理を行なう。以上に
よって、正しく単語分割され、かつ品詞および活用の情
報が付加された単語列１２が完成する。Next, a word recognition unit based on dictionary lookup will be described. Several methods are conceivable as a method of realizing this part. In this embodiment, the realization using the longest match method will be described. FIG. 20 is a diagram showing the operation of the word recognition unit 10. A character string is extracted in the order of a phrase from the front of the word string 9 and words are determined by the longest match method. If the unparsed string is only a hiragana string, it is output as an adjunct. The division by the longest match method is prioritized over the word segmentation point. If no word is found, the word before the word segmentation point is output as an independent word and the word after the word segmentation point is output as an adjunct word. If you find a dictionary word and a conjugation word,
Perform inflection ending processing. In addition, unknown word processing involving part-of-speech inference is performed on unknown words that are independent words. As described above, the word string 12 in which the words are correctly divided and to which the parts of speech and the utilization information are added is completed.

【００３７】実施例３．単語分割部８および単語認定部
１０は、融合して一つの単語処理部としても良い。この
場合の正解率および実行速度は特に増減しない。Embodiment 3 FIG. The word division unit 8 and the word recognition unit 10 may be combined into one word processing unit. In this case, the correct answer rate and the execution speed do not particularly increase or decrease.

【００３８】実施例４．単語認定部１０は、実施例１で
は最長一致法を用いたが、文法を用いる方法もある。こ
の場合は、正解率がさらに向上するが、実行速度は遅く
なる。Embodiment 4 FIG. The word recognition unit 10 uses the longest match method in the first embodiment, but there is also a method using a grammar. In this case, the accuracy rate is further improved, but the execution speed is reduced.

【００３９】実施例５．単語認定部１０で、ひらがな列
の処理に対してのみ、長尾らの様に文法を用いる方法も
ある。この場合は、付属語列の正解率が向上するが、実
行速度は若干遅くなる。Embodiment 5 FIG. There is also a method in which the grammar is used by the word recognition unit 10 only for the processing of hiragana columns, like Nagao et al. In this case, the accuracy rate of the attached word string is improved, but the execution speed is slightly reduced.

【００４０】実施例６．単語認定部１０で、漢字列の処
理に対して、長尾らの様に漢字列の性質を加味する方法
もある。この場合は、特に未知語の漢字列に対する正解
率が向上する。Embodiment 6 FIG. There is also a method in which the word recognition unit 10 takes into account the characteristics of the kanji string as in Nagao et al. In this case, the correct answer rate for a kanji string of an unknown word is particularly improved.

【００４１】実施例７．次に第２の発明に係る文節辞書
作成装置について説明する。図２１は文節辞書作成装置
を示す図である。図において、１１は単語辞書、１４は
単語辞書の単語、３は字種切り法による文節分割部、４
は単語を文節分割した文節列、１５は登録単語選択部、
１６は登録する単語の文節列、６は文節辞書である。Embodiment 7 FIG. Next, a phrase dictionary creation device according to a second invention will be described. FIG. 21 is a diagram showing a phrase dictionary creation device. In the figure, 11 is a word dictionary, 14 is a word in the word dictionary, 3 is a phrase division unit by the character type cutting method, 4
Is a phrase string obtained by dividing words into phrases, 15 is a registered word selection unit,
16 is a phrase string of words to be registered, and 6 is a phrase dictionary.

【００４２】次に動作について説明する。単語辞書１１
から各単語１４を取り出し、文節分割部３で文節列４の
形にする。登録単語選択部１５が、この中から単語の途
中で文節分割されてしまう単語の文節列１６を選択す
る。そしてこれを文節辞書６に登録する。図２２は、単
語辞書から文節辞書を作成した具体例を示したものであ
り、たとえば、「しきい値」という単語が文節分割部３
により「しきい：値」と文節列に分割され、登録単語選
択部１５によって選択されて文節辞書６に登録された場
合を示している。図示していないが、たとえば「基
板」、「机」、「名前」等の単語は文節分割部３により
文節列になっても「基板」、「机」、「名前」のままで
あり、これら単語辞書内の単語と文節分割部３により生
成された文節列が同じものは、登録単語選択部１５によ
り選択されず、文節辞書６には登録されない。Next, the operation will be described. Word dictionary 11
Each word 14 is taken out of the phrase, and is formed into a phrase string 4 by the phrase division unit 3. The registered word selecting unit 15 selects a phrase sequence 16 of words that are to be phrase-divided in the middle of the word. This is registered in the phrase dictionary 6. FIG. 22 shows a specific example in which a phrase dictionary is created from a word dictionary.
Shows a case where the data is divided into “threshold: value” and a phrase string, and is selected by the registered word selecting unit 15 and registered in the phrase dictionary 6. Although not shown, for example, words such as “substrate”, “desk”, and “name” remain as “substrate”, “desk”, and “name” even if they are formed into a phrase sequence by the phrase division unit 3. Words in the word dictionary having the same phrase string generated by the phrase division unit 3 are not selected by the registered word selection unit 15 and are not registered in the phrase dictionary 6.

【００４３】実施例８．上記実施例では、文節辞書６は
文節列を登録している場合を示したが、文節列の他に自
立語と付属語の区別を登録するようにしてもよい。以
下、この自立語と付属語の区別をフラグの０と１に対応
させて登録し、利用する場合の実施例について説明す
る。図２３に示すように、この実施例では、自立語か付
属語のフラグ（０／１）を付与したものが文節辞書であ
る。単語辞書の単語には、名詞、動詞、形容詞、形容動
詞以外の品詞について品詞を記述しているため、その単
語が自立語か付属語か判明する。たとえば、図２３では
「に関する」という単語は「助詞」であるとの記述があ
るので、これを付属語として判定し、フラグ１をたて
る。その他の単語は名詞、動詞等の自立語なのでフラグ
０をたてている。Embodiment 8 FIG. In the above embodiment, the phrase dictionary 6 registers a phrase string, but a distinction between an independent word and an adjunct word may be registered in addition to the phrase string. Hereinafter, an embodiment in which the distinction between the independent word and the attached word is registered and used in association with the flags 0 and 1 will be described. As shown in FIG. 23, in this embodiment, a phrase dictionary is provided with a flag (0/1) of an independent word or an attached word. Since the words in the word dictionary describe the parts of speech of nouns, verbs, adjectives, and parts of speech other than adjective verbs, it can be determined whether the words are independent words or adjunct words. For example, in FIG. 23, since there is a description that the word “related” is a “particle”, this is determined as an adjunct and a flag 1 is set. Since other words are independent words such as nouns and verbs, the flag 0 is set.

【００４４】次に、この自立語／付属語のフラグ（０／
１）を有した文節辞書６の利用方法について述べる。た
とえば、文節分割部３による文節分割結果が図２４のよ
うな例があるものとする。文節連結部５は文節辞書６を
用いてこの文節分割の誤りを正す訳であるが、誤り分割
があった場合、それが自立語の場合は単語内分割点を削
除し、単語の前に分割点がなければ追加する。付属語の
場合は単語内分割点を削除する。文節認定結果の例を図
２５に示す。「埋め：込ん」は図２４では２文節になっ
ていたが、正しく１文節に直る。また、「しきい値：電
圧の」は正しい文節分割になると同時に図２３によれば
フラグ＝０の自立語であり、単語の前に分割点がないの
で追加され、「：しきい値電圧の」という形になる。
「変動防止」はそのままである。この様に文節認定は、
既知語の過剰分割および分割位置間違いを１００％に正
すことができ、正解率を向上させるとともに自立語のと
きは、自立語の前に文節分割点を強制挿入して文節分割
不足を解消する。Next, the flag (0 /
A method of using the phrase dictionary 6 having 1) will be described. For example, it is assumed that the phrase division result by the phrase division unit 3 is as shown in FIG. The phrase linking unit 5 corrects the error of the phrase division by using the phrase dictionary 6. If there is an error division, if it is an independent word, the division point in the word is deleted and the phrase division is performed before the word. If there is no point, add it. In the case of an accessory word, the division point in the word is deleted. FIG. 25 shows an example of the phrase recognition result. Although “embedment: embedding” has two phrases in FIG. 24, it is correctly corrected to one phrase. In addition, "threshold: voltage" is a correct phrase segmentation, and at the same time, is a self-sufficient word with flag = 0 according to FIG. 23, and is added because there is no division point before the word. ].
"Prevention of fluctuation" remains as it is. In this way, clause recognition is
Excessive division of a known word and an incorrect division position can be corrected to 100%, and the correct answer rate is improved. In the case of an independent word, a segmentation dividing point is forcibly inserted before the independent word to eliminate insufficient segmentation.

【００４５】なお、図２６、図２７は、図２４、図２５
に対した例の単語分割部８と単語認定部１０の結果であ
る。単語認定は、文節内に対して最長一致法で単語辞書
を検索することで行なう。単語辞書は予め主記憶上にロ
ードしておき、主記憶上でバイナリサーチを行なってい
る。ただし、付属語は重要キーワードである可能性が低
いことから、単語認定を行なっていない。次に、品詞を
付与する。単語辞書に登録してある単語には「（品
詞）」を付与し、活用語の場合には連体形（＝標準形）
を「［連体形］」の形式で付加する。未知語には「＜名
詞＞、＜動詞＞、＜形容＞、＜形動＞、＜付属＞」のい
ずれかを付与する。図２７において、「埋／め込／ん」
は図２６では３単語になっていたが、正しく１単語にな
る。また、「変動防止」は２単語と認定する。FIGS. 26 and 27 correspond to FIGS. 24 and 25, respectively.
7 shows the results of the word division unit 8 and the word recognition unit 10 in the example of FIG. Word recognition is performed by searching a word dictionary in a phrase by the longest match method. The word dictionary is loaded on the main memory in advance, and a binary search is performed on the main memory. However, since the attached words are unlikely to be important keywords, word recognition is not performed. Next, a part of speech is given. Words registered in the word dictionary are given "(part of speech)", and in the case of inflected words, continuous form (= standard form)
Is added in the form of [[contiguous]]. Unknown words are given any one of "<noun>, <verb>, <adjective>, <adjective>, <attachment>". In FIG. 27, "buried / embedded / n"
Is three words in FIG. 26, but becomes one word correctly. Also, "variation prevention" is recognized as two words.

【００４６】実施例９．図１８において、文節連結部が
文節辞書の文節列をサーチする場合、（３）〜（６）を
ループさせ文節列と単語をひとつひとつ比較するアルゴ
リズムを採用しているが、この方法による場合は文節辞
書の作成時に、図２２に示すように、単語辞書が採用し
ている単語列の文字コードの逆順に整べて文節列を登録
しておくことが望ましい。たとえば、以下のように文節
列を単語辞書と同じく文字コードの正順に整べて文節辞
書を生成した場合について考えてみる。埋め：込み埋め：込み：金属埋め：込み：構造埋め：込み：領域この文節辞書に対して、「埋め込み構造」という単語を
サーチすると、本来３行目に「埋め込み構造」という文
節列があるにもかかわらず、１行目の「埋め込み」と一
致してしまい「埋め込み：構造」という結果を得てしま
うことになる。次に、これを以下のような文字コードの
逆順にした場合を考えてみる。埋め：込み：領域埋め：込み：構造埋め：込み：金属埋め：込みこの場合、「埋め込み構造」をサーチすると２行目で一
致することになり、正しく一つの文節列として認定され
ることになる。Embodiment 9 FIG. In FIG. 18, when the phrase connection unit searches for a phrase string in the phrase dictionary, an algorithm is used in which loops (3) to (6) are performed and the phrase string and each word are compared one by one. At the time of creating a dictionary, as shown in FIG. 22, it is desirable to arrange phrase strings in the reverse order of the character codes of the word strings employed in the word dictionary. For example, consider a case where a phrase dictionary is generated by arranging phrase strings in the same order as character codes in the same order as a word dictionary as follows. Filling: Embedding Filling: Embedding: Metal Filling: Embedding: Structure Embedding: Embedding: Area If you search for the word "embedding structure" in this phrase dictionary, there is a phrase string "embedding structure" in the third line. Nevertheless, the result matches "embedding" in the first row, and the result "embedding: structure" is obtained. Next, let us consider a case where this is reversed in the following character code order. Filling: Embedding: Area Filling: Embedding: Structure Filling: Embedding: Metal Filling: Embedding In this case, searching for “embedded structure” will match on the second line, and will be correctly identified as one phrase column .

【００４７】実施例１０．実施例９では、文字コードの
逆順に文節列を登録した文節辞書の場合を示したが、正
順に登録したものを逆順にサーチする場合でもよい。Embodiment 10 FIG. In the ninth embodiment, the phrase dictionary in which the phrase strings are registered in the reverse order of the character codes is described. However, the phrase dictionary registered in the normal order may be searched in the reverse order.

【００４８】実験結果．プログラムをワークステーショ
ン上に作成し、実験を行なった。ハードウエア構成とソ
フトウエア構成を図２８に、辞書ロード時間を図２９
に、形態素解析時間を図３０に示す。図２９に示すとお
り、辞書ロード時間は単語辞書語数に比例してリニアに
増加する。ＣＰＵ時間は８万語に対して約５秒であり、
大量文書処理の場合に全体に占める割合は無視できる。
また、図３０に示すとおり、形態素解析時間は、単語辞
書語数の対数に比例する。これは、バイナリサーチを使
用しているためである。８万語の辞書による１０万字の
解析が１２秒強であり、極めて高速である。Experimental results. A program was created on a workstation and experiments were performed. FIG. 28 shows the hardware configuration and software configuration, and FIG.
FIG. 30 shows the morphological analysis time. As shown in FIG. 29, the dictionary load time linearly increases in proportion to the number of word dictionary words. CPU time is about 5 seconds for 80,000 words,
In the case of mass document processing, its proportion in the whole can be ignored.
Also, as shown in FIG. 30, the morphological analysis time is proportional to the logarithm of the number of word dictionary words. This is because a binary search is used. The analysis of 100,000 characters using an 80,000-word dictionary is just over 12 seconds, which is extremely fast.

【００４９】以上の実施例では、字種切り法を文節分割
に分離し、それぞれの結果を辞書で認定する方法につい
て述べた。比較的簡単な方式によって実用的な手法が得
られたと考えており、特徴はまとめると以下の５点であ
る。（１）高速（２）文節分割の誤りを小規模辞書で訂正する（３）成功率が１００％（必ず解を出力する）（４）ユーザにとって単語辞書の作成が容易（５）仮名漢字混じり日本語文専用（字種切り法がベー
ス）In the above embodiment, a method has been described in which the character type cutting method is separated into phrase divisions, and each result is recognized by the dictionary. We believe that a practical method was obtained by a relatively simple method, and the characteristics are summarized in the following five points. (1) High-speed (2) Correction of segmentation errors with a small dictionary (3) Success rate 100% (always output a solution) (4) Easy creation of word dictionary for users (5) Mixed kana-kanji Dedicated to Japanese sentences (based on character type cutting)

【００５０】[0050]

【発明の効果】以上のように第１の発明によれば、実行
速度が速くかつ正解率も高い形態素解析装置を得ること
ができる。また、第２の発明によれば、それに用いる文
節辞書は文節辞書作成装置によって自動的に作成でき
る。As described above, according to the first aspect, a morphological analyzer having a high execution speed and a high accuracy rate can be obtained. Further, according to the second aspect, the phrase dictionary used therein can be automatically created by the phrase dictionary creating device.

[Brief description of the drawings]

【図１】第１の発明に係る形態素解析装置を示す図であ
る。FIG. 1 is a diagram showing a morphological analyzer according to a first invention.

【図２】未知語処理の推論ルールを示す図である。FIG. 2 is a diagram showing inference rules for unknown word processing.

【図３】未知語処理の推論ルールを示す図である。FIG. 3 is a diagram showing inference rules for unknown word processing.

【図４】未知語処理の推論ルールを示す図である。FIG. 4 is a diagram showing inference rules for unknown word processing.

【図５】未知語処理の推論ルールを示す図である。FIG. 5 is a diagram showing inference rules for unknown word processing.

【図６】未知語処理の推論ルールを示す図である。FIG. 6 is a diagram showing an inference rule of unknown word processing.

【図７】単語辞書への登録仕様を示す図である。FIG. 7 is a diagram showing specifications registered in a word dictionary.

【図８】単語辞書への登録仕様を示す図である。FIG. 8 is a diagram showing specifications for registration in a word dictionary.

【図９】単語辞書への登録仕様を示す図である。FIG. 9 is a diagram showing specifications registered in a word dictionary.

【図１０】単語辞書への登録仕様を示す図である。FIG. 10 is a diagram showing specifications registered in a word dictionary.

【図１１】単語辞書への登録仕様を示す図である。FIG. 11 is a diagram showing registration specifications in a word dictionary.

【図１２】単語辞書への登録仕様を示す図である。FIG. 12 is a diagram showing specifications registered in a word dictionary.

【図１３】単語辞書への登録仕様を示す図である。FIG. 13 is a diagram showing specifications registered in a word dictionary.

【図１４】単語辞書への登録仕様を示す図である。FIG. 14 is a diagram showing specifications registered in a word dictionary.

【図１５】単語辞書への登録仕様を示す図である。FIG. 15 is a diagram showing registration specifications in a word dictionary.

【図１６】単語辞書への登録仕様を示す図である。FIG. 16 is a diagram showing specifications registered in a word dictionary.

【図１７】第１の発明に係る文節辞書の内容例を示す図
である。FIG. 17 is a diagram showing an example of contents of a phrase dictionary according to the first invention.

【図１８】第１の発明に係る文節連結部の動作を示す図
である。FIG. 18 is a diagram showing the operation of the phrase connection unit according to the first invention.

【図１９】第１の発明に係る単語分割部の動作を示す図
である。FIG. 19 is a diagram showing the operation of the word division unit according to the first invention.

【図２０】第１の発明に係る単語認定部の動作を示す図
である。FIG. 20 is a diagram showing the operation of the word recognition unit according to the first invention.

【図２１】第２の発明に係る文節辞書作成装置を示す図
である。FIG. 21 is a diagram showing a phrase dictionary creation device according to a second invention.

【図２２】第２の発明に係る文節辞書のその他の作成装
置を示す図である。FIG. 22 is a diagram showing another apparatus for creating a phrase dictionary according to the second invention.

【図２３】第２の発明に係る文節辞書のその他の作成装
置を示す図である。FIG. 23 is a diagram showing another apparatus for creating a phrase dictionary according to the second invention.

【図２４】第１の発明に係る文節分割部の文節列の一例
を示す図である。FIG. 24 is a diagram showing an example of a phrase sequence of a phrase division unit according to the first invention.

【図２５】第１の発明に係る文節連結部の動作の他の例
を示す図である。FIG. 25 is a diagram showing another example of the operation of the phrase connection unit according to the first invention.

【図２６】第１の発明に係る単語分割部の動作の他の例
を示す図である。FIG. 26 is a diagram showing another example of the operation of the word division unit according to the first invention.

【図２７】第１の発明に係る単語認定部の動作の他の例
を示す図である。FIG. 27 is a diagram showing another example of the operation of the word recognition unit according to the first invention.

【図２８】実験の構成を示す図である。FIG. 28 is a diagram showing a configuration of an experiment.

【図２９】実験結果を示す図である。FIG. 29 is a view showing an experimental result.

【図３０】実験結果を示す図である。FIG. 30 is a view showing an experimental result.

【図３１】従来の字種切り法による形態素解析装置を示
す図である。FIG. 31 is a diagram showing a morphological analysis device using a conventional character type cutting method.

【図３２】従来の字種切り法による文節分割部の動作を
示す図である。FIG. 32 is a diagram illustrating the operation of a phrase division unit using the conventional character type cutting method.

【図３３】従来の字種切り法による文節分割部の誤動作
例を示す図である。FIG. 33 is a diagram showing an example of a malfunction of a segment division unit by the conventional character type cutting method.

[Explanation of symbols]

１入力部２日本語文３字種切り法による文節分割部４文節分割部が出力する文節列５文節連結部６文節辞書７文節連結部が出力する文節列８字種切り法による単語分割部（単語処理部の一部分
例）９字種切り法による単語分割部が出力する単語列１０辞書引きによる単語認定部（単語処理部の一部分
例）１１単語辞書１２辞書引きによる単語分割部が出力する単語列１３出力部１４単語辞書に登録されている単語１５文節辞書に登録する単語を選択する登録単語選択
部１６文節辞書に登録する単語の文節列DESCRIPTION OF SYMBOLS 1 Input part 2 Japanese sentence 3 Phrase division part by character type separation method 4 Phrase string output by phrase division part 5 Phrase connection part 6 Phrase dictionary 7 Phrase string output by phrase connection part 8 Word division part by character type division method Partial example of word processing unit) 9 Word string output by word division unit by character type cutting method 10 Word recognition unit by dictionary lookup (partial example of word processing unit) 11 Word dictionary 12 Words output by word division unit by dictionary lookup Column 13 Output unit 14 Words registered in the word dictionary 15 Registered word selection unit 16 for selecting words to be registered in the phrase dictionary 16 Phrase sequence of words registered in the phrase dictionary

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平１−295369（ＪＰ，Ａ) 特開平２−28873（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 17/20 - 17/28 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-1-295369 (JP, A) JP-A-2-28873 (JP, A) (58) Fields investigated (Int.Cl. ⁶ , DB name) G06F 17/20-17/28 JICST file (JOIS)

Claims

(57) [Claims]

1. A morphological analyzer having the following elements: (a) an input unit for inputting a Japanese sentence; (b) a phrase dividing unit for dividing the input Japanese sentence into phrase strings; and (c) storing words. word dictionary to place, divided into clause column by typing the word of (d) the word dictionary,
Select the word to be registered from the words in the phrase
A sentence that is stored as a word that causes a division error in the phrase segmentation unit
Correction of error division in phrase segmentation section using clause dictionary, (e) phrase dictionary
Clause connecting portion for, (f) a word processor for dividing the clause string into word sequence using a word dictionary, an output unit that outputs to the storage device or the display device (g) word sequence.

2. A phrase dictionary creating apparatus that has the following elements and automatically creates a phrase dictionary from a word dictionary: (a) a phrase division unit that inputs words of a word dictionary and divides the words into phrase strings; A registered word selection unit that selects words to be registered in the phrase dictionary from the set words and stores them in the phrase dictionary as words that cause a division error in the phrase division unit.