JP2000040085A

JP2000040085A - Method and device for post-processing for japanese morpheme analytic processing

Info

Publication number: JP2000040085A
Application number: JP10205932A
Authority: JP
Inventors: Toru Hisamitsu; 徹久光
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-07-22
Filing date: 1998-07-22
Publication date: 2000-02-08

Abstract

PROBLEM TO BE SOLVED: To provide a method for acquiring rewritting rules capable of automatically correcting an analytic error in the case of morpheme analysis in natural language processing. SOLUTION: By providing a method for generating the rewritting rules from the comparison of the analyzed results including a corrected answer and an error and for further generalizing these rules and a method for measuring the reliability of generated rules, a rewritting rule set capable of automatically correcting the error is provided. further, in order to cancel vagueness, which can not be solved only by local collation, or to utilize rules showing a non- registered word to repeatedly appear in a sentence set within fixed range in spite of low reliability, morpheme analysis and post-processing therefor are applied for each unit called 'window'.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、日本語文を構成諸
単語に分割する日本語形態素解析処理の後処理方法に係
り、機械翻訳や、情報検索のためのインデクシングの精
度向上等に利用される。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a post-processing method of Japanese morphological analysis for dividing a Japanese sentence into constituent words, and is used for improving the accuracy of machine translation and indexing for information retrieval. .

【０００２】[0002]

【従来の技術】文献１、文献２を参考に従来の技術の概
要を述べる：文献１：久光徹新田義彦「ゆう度付き形態素解析用の
汎用アルゴリズムとそれを利用したゆう度基準の比較」
電子情報通信学会論文誌 D-II Vol．J77-D-II No．5，p
p．959-969 (1994) 文献２：Brill， E． “Transformation-Based Error D
riven Learning and Natural Language Processing:A C
ase Study in Part-of-Speech Tagging”，Computation
al Linguistics， Vol．21， No．4， pp．543-565 (19
95) 日本語形態素解析処理とは、入力された日本語文字列
を、辞書と文法を参照しながら、構成単語の文字列と品
詞の組の列として認識する手続きを指す。以下では、形
態素解析処理の方法自体については特定しないため、形
態素解析処理の方法に関する詳細は文献１に譲り、ここ
では述べない。2. Description of the Related Art An overview of conventional techniques will be described with reference to References 1 and 2. Reference 1: Toru Hisamitsu Yoshihiko Nitta "Comparison of general-purpose algorithm for morphological analysis with likelihood and likelihood criterion using it"
IEICE Transactions D-II Vol. J77-D-II No. 5, p
p. 959-969 (1994) Reference 2: Brill, E .; “Transformation-Based Error D
riven Learning and Natural Language Processing: AC
ase Study in Part-of-Speech Tagging ”, Computation
al Linguistics, Vol. 21, No. 4, pp. 543-565 (19
95) The Japanese morphological analysis processing refers to a procedure for recognizing an input Japanese character string as a sequence of a set of a character string of a constituent word and a part of speech while referring to a dictionary and grammar. In the following, since the method of the morphological analysis processing itself is not specified, the details of the method of the morphological analysis processing are transferred to Document 1 and will not be described here.

【０００３】尤度基準の観点から、形態素解析処理は、
先見的に与えられたルールに基づくものと、確率に基づ
くもの（単語間の隣接確率モデル等）に大別される。[0003] From the viewpoint of the likelihood criterion, the morphological analysis processing is performed as follows.
They are broadly classified into those based on rules given in advance and those based on probabilities (such as an adjacent probability model between words).

【０００４】このうち、人間の直感を利用するため、小
規模な実験データを参照するだけで比較的高精度なシス
テムを構成することができることと、解選択の根拠とな
るルール集合がコンパクトかつ人間に可読である等の理
由から、ルールに基づく形態素解析システムが数多く利
用されている。[0004] Among them, it is possible to construct a relatively high-accuracy system simply by referring to small-scale experimental data in order to utilize human intuition. Many morphological analysis systems based on rules are used because they are readable.

【０００５】昨今、形態素解析処理の高速化に伴い、新
聞や特許等の情報検索のインデキシングに形態素解析処
理がしばしば利用されるようになっており、速度ととも
に、いろいろな文書に関する解析精度を改善する必要が
生じている。In recent years, with the speeding up of morphological analysis processing, morphological analysis processing is often used for indexing information retrieval of newspapers, patents, etc., and the analysis accuracy of various documents is improved together with the speed. There is a need.

【０００６】形態素解析処理の誤りの要因は、未登録語
の存在と文法や尤度基準の不備である。未登録語に起因
する誤り対策としては、オフラインでの未登録語獲得と
辞書登録が重要な方法であるが、これのみでは、未知文
書に現れる未登録語や、文法や尤度基準の不備に起因す
る誤りに対処できない。このため、誤り例の分析から人
手でテンプレートや書き換えルールを生成する研究等が
行われている。しかし、様々な誤りに対処するために書
き換えルールが増加してゆくと、書き換えルール間の整
合性の維持や、副作用の評価が極めて困難となるため、
書き換えルールの獲得や評価を自動的に行う必要があ
る。The causes of errors in the morphological analysis processing are the presence of unregistered words and incomplete grammar and likelihood criteria. As a countermeasure against errors caused by unregistered words, it is important to acquire unregistered words offline and register them in a dictionary.However, this alone is not sufficient for unregistered words appearing in unknown documents and deficiencies in grammar and likelihood criteria. Inability to address resulting errors. For this reason, research on manually generating templates and rewriting rules from analysis of error cases has been conducted. However, as the number of rewrite rules increases to deal with various errors, it becomes extremely difficult to maintain consistency between rewrite rules and to evaluate side effects.
It is necessary to automatically obtain and evaluate rewrite rules.

【０００７】文献２では、書き換えルールに基づく英語
のPOS-tagger(品詞付けプログラム)を対象として、解析
結果と正解（人手で正解の品詞を与えた文の集合）とを
比較することにより、始めに与えられたPOS-taggerの出
力を修正するための一群の書き換えルールを自動的に発
見する手法が提案された。その方法は「誤り主導型の書
き換えルール獲得」と呼ばれ、以下の3つの要素に基づ
いて構成される： 1．任意のPOS-tagger。 2．許容される書き換えルール集合(書き換えルールは、
品詞の書き換えと適用条件を含む)。 3．各書き換えルールの適用前と適用後を正解と比較
し、その効果を評価する評価関数。[0007] In Literature 2, for an English POS-tagger (part-of-speech program) based on a rewriting rule, the analysis result is compared with the correct answer (a set of sentences to which the correct part-of-speech is given manually) to start. A method for automatically discovering a group of rewrite rules to modify the output of POS-tagger given to was proposed. The method is called "error-driven rewrite rule acquisition" and is based on three elements: Any POS-tagger. 2． Allowed rewrite rule set (rewrite rules are
Including part-of-speech rewriting and applicable conditions). 3． An evaluation function that compares the effect before and after applying each rewrite rule with the correct answer and evaluates its effect.

【０００８】許容される書き換えルールの集合は一般に
巨大な集合であり、人手では書けないため、実際にはそ
れらを生成するテンプレートの形で与える。例えば、テ
ンプレート： z a => z b は、単語aの前の単語の品詞がzであったら、aをbに書き
換えるという書き換えルールを表す。a、 b、 zに具体
的な値を代入することにより、多数の具体的な書き換え
ルールとなる。文献２では、人手で14種類のテンプレー
トを用意し、それらを具体化した複数の書き換えルール
により、与えられたPOS-taggerの出力結果を書き換えて
ゆくステップを次のようにして自動的に発見する： step 1) 与えられたPOS-taggerにより、あらかじめ用意
した学習用の文集合に品詞付けし、これをTとする。 step 2) 許容される書き換えルールすべてを個別にTに
適用し、その結果が最も正解に近づく書き換えルールを
評価関数により選び出し、順序つきリストＬに記録す
る。いかなる書き換えルールでも結果が向上しないとき
は、終了してＬを返す。 step 3) この書き換えルールに基づきTを書き直したも
のを再びTとし、step 2にもどる。[0008] The set of permitted rewrite rules is generally a huge set and cannot be manually written. Therefore, the set is actually given in the form of a template for generating them. For example, the template: za => zb represents a rewrite rule that if the part of speech of the word before the word a is z, a is rewritten to b. By substituting specific values for a, b, and z, a number of specific rewrite rules are obtained. In Reference 2, 14 steps of templates are prepared manually, and a step of rewriting the output result of the given POS-tagger is automatically found as follows according to a plurality of rewriting rules that embody them. : Step 1) Using the given POS-tagger, a part of speech is assigned to a set of learning sentences prepared in advance, and this is set as T. step 2) Apply all the permissible rewrite rules individually to T, select the rewrite rule whose result is closest to the correct answer by the evaluation function, and record it in the ordered list L. If no rewrite rules improve the result, end and return L. step 3) Rewrite T based on this rewriting rule as T again, and return to step 2.

【０００９】このようにして、Ｌが得られたとき、与え
られたPOS-taggerの出力結果をＬ中の書き換えルールを
順次用いて書き換える(書き換えを数十回行うこともあ
りうる)ことにより、POS-tagger が改良できる。In this way, when L is obtained, the output result of the given POS-tagger is rewritten sequentially using the rewriting rules in L (the rewriting may be performed several tens of times). POS-tagger can be improved.

【００１０】この方法は、単語境界が与えられている英
語を想定しているため、単語毎の品詞の書き換えだけを
前提としているが、日本語の場合は誤りの種類が英語よ
り多く、状況は複雑である。実際、日本語形態素解析の
エラーには、次の4種類がある： (A) 分割の誤り。以下の3種類に分かれる： (A-1) 過分割正：今日/の/金/相場/は、... 誤：今日/の/金/相/場/は、... (A-2) 分割不足正：ユニックス/ワークステーション誤：ユニックスワークステーション (A-3) その他の誤り（語境界交差型）正：病気/が/まん延/ 。誤：病気/がまん /延/ 。[0010] This method is based on the assumption that English is given a word boundary. Therefore, it is assumed that only the part of speech of each word is rewritten. It is complicated. In fact, there are four types of errors in Japanese morphological analysis: (A) Incorrect segmentation. It is divided into the following three types: (A-1) Over-division Positive: Today / no / gold / quote / is ... Incorrect: today / no / gold / quote / quotes / is ... (A-2 Insufficient division Correct: Unix / workstation False: Unix workstation (A-3) Other mistakes (word boundary crossing type) Correct: ill / ga / proliferation /. Wrong: illness / gaman / nobu /.

【００１１】(B) 品詞のみの誤り正：...」/と(引用助詞)/い/う誤：...」/と(並立助詞)/い/うこのため、書き換えルールとして単語境界の変更も考慮
する必要があるため、許容される書き換えルール集合が
天文学的に大きくなり、適切なテンプレートをあらかじ
め人手で用意することは不可能に近くなる。(B) Error in only part of speech Correct: ... "/ and (quoted particle) / i / u False: ..." / and (parallel particle) / i / u Since it is necessary to take into account the change of the rewrite rule, the allowable rewrite rule set becomes astronomically large, and it is almost impossible to prepare an appropriate template manually in advance.

【００１２】仮に書き換えルールを人手で構成できたと
しても、従来方法では基本的に扱えない問題がある。す
なわち、次のような曖昧性に起因する誤りを解消するこ
とはできない。例えば、「価格差」のような文字列は、
「価／格差」、「価格／差」のように複数の分割可能性
があるが、形態素解析は一般に局所的な情報に基づいて
判断するため、どちらが正しいか判別できないことがあ
りうる。そのため、任意に一つを出力せざるを得ず、誤
りにつながることになる。一般には、やや広い範囲の文
脈を参照すればこのような曖昧性の解消は可能である
が、局所的な書き換えルールだけでこのような曖昧性に
起因する誤りをとらえることは困難である。また、単語
の並びに関する確率モデルで対処することも、データが
あまりにも巨大になるため、非現実的である。[0012] Even if the rewriting rule can be manually constructed, there is a problem that the conventional method cannot basically handle it. That is, the following errors due to ambiguity cannot be eliminated. For example, a string like "price difference"
Although there are a plurality of possible divisions such as “price / difference” and “price / difference”, since morphological analysis generally determines based on local information, it may not be possible to determine which is correct. Therefore, one must be arbitrarily output, leading to an error. In general, such ambiguity can be resolved by referring to a rather wide range of contexts, but it is difficult to catch errors caused by such ambiguity only with local rewrite rules. In addition, it is impractical to deal with the probability model regarding the arrangement of words because the data is too large.

【００１３】[0013]

【発明が解決しようとする課題】前述したごとく、従来
の後処理技術では、書き換えルールを自動的に獲得する
ための方法や、曖昧性により生じる誤りの解消に難点が
あった。本発明は、単語間に区切りを置かないため複雑
な誤りパターンを有する日本語にも適用でき、曖昧性に
起因する誤りも解消する後処理技術を提供する。As described above, the conventional post-processing technique has a problem in a method for automatically acquiring a rewrite rule and in eliminating an error caused by ambiguity. The present invention can be applied to Japanese having a complicated error pattern because there is no delimiter between words, and provides a post-processing technique that eliminates errors caused by ambiguity.

【００１４】[0014]

【課題を解決するための手段】本発明では、以下のステ
ップにより、形態素解析処理の精度を向上させる。According to the present invention, the accuracy of the morphological analysis processing is improved by the following steps.

【００１５】（１）抽象度の書き換えルールの自動生成
ステップ：形態素解析結果の解析誤り部分と正解とを比
較することにより、自動的に様々な抽象度の書き換えル
ール（単語境界の変更を含む）を生成すること、（２）書き換えルールの効果の自動判定ステップ：生成
された書き換えルールの効果を自動的に判定し、効果的
なルールを記録すること、（３）形態素解析結果の最尤解を出力するステップ：あ
らかじめ定めた範囲の文に関して、形態素解析結果の最
尤解を出力する、（４）書き換えルールを適用した修正ステップ：対象と
なった文の形態素解析の出力結果に対して、書き換えル
ールを適用して修正を行うこと、の４つのステップが基
本となる。(1) Automatic generation of rewriting rules for abstraction level: Rewriting rules of various abstraction levels (including changes of word boundaries) are automatically generated by comparing the analysis error part of the morphological analysis result with the correct answer. (2) Automatically judging the effect of the rewriting rule: automatically judging the effect of the generated rewriting rule and recording an effective rule; (3) Maximum likelihood solution of the morphological analysis result Is output: the maximum likelihood solution of the morphological analysis result is output for a sentence in a predetermined range. (4) The modification step applying the rewriting rule: the output result of the morphological analysis of the target sentence is The four steps of applying a rewrite rule and making a correction are fundamental.

【００１６】これに加えて、「窓」の概念を導入すると
更に解析結果を向上させることができる。すなわち、上
述のステップ（３）および（４）を以下のように変形す
るとともに、第二段階の修正ステップ（５）を加える。In addition, if the concept of "window" is introduced, the analysis result can be further improved. That is, the above steps (3) and (4) are modified as follows, and a second-stage correction step (5) is added.

【００１７】（３）’形態素解析結果の最尤解を出力す
るステップ：「窓」と呼ぶ、あらかじめ定めた範囲の文
に関して、形態素解析の最尤解を出力すると同時に、曖
昧部分については可能な解を保持すること、（４）’書き換えルールを適用した修正ステップ：窓内
の文の形態素解析の出力結果に対して、修正ルールを適
用して第一段階の修正を行うこと、（５）第二段階の修正ステップ：窓内の文の解析結果に
第一段階の修正を行ったものに対して、未登録語の確定
や曖昧性の解消を行う第二段階の修正を行うこと。(3) 'Step of outputting the maximum likelihood solution of the result of morphological analysis: The maximum likelihood solution of the morphological analysis is output for a sentence in a predetermined range called a "window", and at the same time it is possible to output an ambiguous part. (4) 'Correction step applying rewriting rule: applying the modification rule to the output result of morphological analysis of the sentence in the window to perform the first-stage modification, (5) Second-stage correction step: The second-stage correction for determining the unregistered words and resolving ambiguity for the first-stage correction of the analysis result of the sentence in the window.

【００１８】[0018]

【発明の実施の形態】以下、これを窓を用いる場合につ
いて具体的に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a case where a window is used will be specifically described.

【００１９】（１）書き換えルールを生成するステップ書き換えルールの生成には、正しく解析されたデータの
存在を仮定する。書き換えルールは、解析結果と正解デ
ータとの比較により以下のステップで自動的に生成す
る： −第１段階− K、 Lを自然数とし、すべての誤り部分に対し、誤り部
分を正しく書き直す次のような具体的な単語からなる書
き換えルールを生成する：ａ１．．．ａＫＷ１Ｗ２．．．Ｗｎｂ１．．．ｂＬ＝
＞ａ１．．．ａＫＷ１’Ｗ２’．．．Ｗｍ’ｂ
１．．．ｂＬここでＷ１Ｗ２．．．Ｗｎ，W1'W2'...Wm'，a1...a
K，b1...bLは単語列であり、W1W2...Wnは誤り部分、W1'
W2'...Wm'はW1W2...Wnに対応する正解、a1...aK，b1...
bLは書き換えを適用する環境（コンテクストとも呼ぶ）
である。(1) Step of Generating a Rewrite Rule The generation of a rewrite rule assumes the existence of correctly analyzed data. The rewriting rule is automatically generated by comparing the analysis result and the correct data in the following steps:-1st step-K and L are natural numbers, and for all error parts, the error part is rewritten correctly as follows. Generate a rewrite rule consisting of various concrete words: a1. . . aKW1W2. . . Wnb1. . . bL =
> A1. . . aKW1'W2 '. . . Wm'b
1. . . bL where W1W2. . . Wn, W1'W2 '... Wm', a1 ... a
K, b1 ... bL are word strings, W1W2 ... Wn are error parts, W1 '
W2 '... Wm' is the correct answer corresponding to W1W2 ... Wn, a1 ... aK, b1 ...
bL is the environment where rewriting is applied (also called context)
It is.

【００２０】以下、W1W2...Wnを第１パターン、W1'W
2'...Wm'を第２パターン、 a1...aKを前方環境、b1...b
Lを後方環境と言う。K、Lがあまり大きいとデータスパ
ースネスが生じることと、簡単のため、以下の例ではK
＝L＝1とする。Hereinafter, W1W2... Wn is a first pattern, W1'W
2 '... Wm' is the second pattern, a1 ... aK is the forward environment, b1 ... b
L is called the rear environment. If K and L are too large, data sparseness will occur and, for simplicity, K
= L = 1.

【００２１】例えば、誤り(A-1)からは、次の書き換え
ルールR1が得られる： (R1) 金：普通名詞／相：名詞接辞／場：名詞接辞／
は：副助詞=>金：普通名詞／相場：普通名詞／は：副
助詞 −第２段階− 第１段階で得られたすべての書き換えルールを、次の一
般化パターンG1、G2、G3...の組み合わせを用いて一般
化する： (G1)第１、第２パターンの品詞だけに注目する。For example, from the error (A-1), the following rewriting rule R1 is obtained: (R1) Fri: ordinary noun / phase: noun affix / field: noun affix /
Ha: adjunct particle => Fri: common noun / market: common noun / ha: adjunct particle-2nd stage-All rewrite rules obtained in the 1st stage are converted into the following generalized patterns G1, G2, G3 .. Generalize using the combination of.: (G1) Focus only on the parts of speech of the first and second patterns.

【００２２】(G2)第１、第２パターンの各文字を、文字
種に置き換える。(G2) Each character of the first and second patterns is replaced with a character type.

【００２３】(G3)前方環境の品詞だけに注目する。(G3) Attention is focused only on the part of speech of the front environment.

【００２４】(G4)前方環境の文字を、文字種に置き換え
る。(G4) Characters in the front environment are replaced with character types.

【００２５】(G5)前方環境を無視する。(G5) Ignore the forward environment.

【００２６】(G6)後方環境の品詞だけに注目する。(G6) Attention is focused only on the part of speech in the rear environment.

【００２７】(G7)後方環境の文字を、文字種に置き換え
る。(G7) The characters in the rear environment are replaced with character types.

【００２８】(G8)後方環境を無視する。(G8) Ignore the rear environment.

【００２９】例えば、一般化パターンG2、G3、G6を適用
すると、書き換えルールR1は、次のように書き換えられ
る： (R1)普通名詞／"C1"：名詞接辞／"C2"：名詞接辞／助詞
=>普通名詞／"C1C2"：普通名詞／助詞ここで、"Ci"は漢字を表す。任意の組みあわせで修正さ
れた書き換えルールを生成し、書き換えルールの評価手
続で選別することができるが、実際には選別ステップの
削減のため、予備実験により有効と思われる組み合わせ
を選ぶと良い。例えば我々は22通りの組み合わせに限定
した。For example, when the generalized patterns G2, G3, G6 are applied, the rewriting rule R1 is rewritten as follows: (R1) ordinary noun / "C1": noun affix / "C2": noun affix / particle
=> Normal noun / "C1C2": Normal noun / Particle Here, "Ci" represents a kanji. A modified rewrite rule can be generated by an arbitrary combination and can be selected by a rewrite rule evaluation procedure. However, in practice, it is preferable to select a combination that is considered effective by a preliminary experiment in order to reduce the number of selection steps. For example, we limited to 22 combinations.

【００３０】（２）修正された書き換えルールの効果を
判定する手続 Brill の方法では、書き換えルールは順序付きリストに
格納され、initial state annotator の出力を順次書き
換えるため適用される。これに対し、我々は、形態素解
析結果の出力を1回走査するだけで後処理を終わらせる
ことを意図し、書き換えルール適用は、書き換えルール
の信頼度と、書き換えパターンの長さを用いた"Greedy
method" (後に説明する)で行い、順序付きリストの作成
は行わない。(2) Procedure for judging the effect of the modified rewriting rule In the method of Brill, the rewriting rules are stored in an ordered list and applied to sequentially rewrite the output of the initial state annotator. In contrast, we intended to finish the post-processing by scanning the output of the morphological analysis once only, and applied the rewrite rule using the reliability of the rewrite rule and the length of the rewrite pattern. " Greedy
method "(discussed below), without creating an ordered list.

【００３１】書き換えルールの効果を表すため、「信頼
度」と呼ぶ数値を次のようにして付与する。まず、書き
換えルールRに対し、自然数の三つ組{T、 CE、 EC}
を、以下で定義する： T：書き換えルールにより書き換えられた単語数。 CE：書き換え前に正しく解析されていた単語のうち、誤
りに書き換えられてしまう単語数。 EC：書き換え前に誤って解析されていた単語のうち、正
しく書き換えられる単語数。三つ組{T、 CE、 EC}より、Rの信頼度rを以下で定義す
る：In order to express the effect of the rewriting rule, a numerical value called "reliability" is given as follows. First, for a rewrite rule R, a triple of natural numbers {T, CE, EC}
Is defined as follows: T: The number of words rewritten by the rewriting rule. CE: The number of words that were correctly analyzed before rewriting and were incorrectly rewritten. EC: The number of correctly rewritten words among words that were incorrectly analyzed before rewriting. From the triple {T, CE, EC}, the reliability r of R is defined as follows:

【００３２】[0032]

【数１】 (Equation 1)

【００３３】ただし、このままでは、低頻度の書き換え
ルールの信頼度が極度に大きく評価される傾向があるた
め、「＋１ルール」と呼ばれる簡易手法で補正した。す
なわち、仮想的に、正→誤の書き換え、誤→正の書き換
え、誤→誤の書き換えが1回ずつ余計に生じたと考え、
{T、 CE、 EC} を計算し直す。このとき、書き換えルー
ルにより書き換えられる単語数をnとすれば、書き換え
られた３項組{T'，CE'，EC'}は、 EC' = EC + n，CE' = CE + n，T' = T + 3n となり、これらを用いて、However, in this case, the reliability of the infrequently rewritten rule tends to be extremely large, so the correction was made by a simple method called "+1 rule". In other words, virtually, it is considered that correct → wrong rewrite, wrong → correct rewrite, wrong → wrong rewrite occurred one by one extra,
Recalculate {T, CE, EC}. At this time, if the number of words rewritten by the rewriting rule is n, the rewritten ternary set {T ′, CE ′, EC ′} is EC ′ = EC + n, CE ′ = CE + n, T ′ = T + 3n, and using these,

【００３４】[0034]

【数２】 (Equation 2)

【００３５】により、信頼度を定義し直す。負の信頼度
を持つルールは排除した。Thus, the reliability is redefined. Rules with negative confidence have been eliminated.

【００３６】以下が、書き換えルールと信頼度の例であ
る。ここで、一般化パターンG5あるいはG8の適用によ
り、前環境あるいは後環境を無視した場合、何にでも一
致することを示す"＊"で表現している：例１）＊／と：並立助詞／い：子音動詞語幹 =>＊／と：引用
助詞／い：子音動詞語幹・・・・・・・・信頼度：0．947 例２）助詞／"C1"：普通名詞／"C2"：未登録語／助詞=> 助詞
／"C1C2"：普通名詞／助詞・・・・・・・・信頼度：0．667 例３）＊／"C1C2C3"：固有名詞／"C4"：普通名詞／＊=> ＊／"
C1C2"：普通名詞／"C3C4"：普通名詞／＊・・・・・・・・信頼度：−1．58 これらのうち、例３のように信頼度が負になるルールは
廃棄する。The following is an example of the rewrite rule and the reliability. Here, when the pre-environment or the post-environment is ignored by applying the generalized pattern G5 or G8, it is expressed by "*" indicating that anything matches: Example 1) * / and: Parallel Particles / I: consonant verb stem => * / and: quoted particle / i: consonant verb stem ... reliability: 0.947 Example 2) Particle / "C1": ordinary noun / "C2": not yet Registered word / Particle => Particle / "C1C2": Common noun / Particle ... Reliability: 0.667 Example 3) * / "C1C2C3": Proper noun / "C4": Common noun / * => * / "
C1C2 ": common noun /" C3C4 ": common noun / * .... Reliability: -1.58 Among these, the rule with negative reliability as in Example 3 is discarded.

【００３７】このように、本発明によれば、書き換えル
ールの適用に、書き換えルールの信頼度を導入すること
により形態素解析処理を改善するものであるが、これ
に、必要なら、さらに「窓」の概念を導入して「窓」内
にある文を総合的に評価できるように工夫されている。
すなわち、前述のステップに続き、以下のステップを続
けて行わせる。As described above, according to the present invention, the morphological analysis processing is improved by introducing the reliability of the rewriting rule into the application of the rewriting rule. By introducing the concept of "window", it is devised so that sentences in the "window" can be comprehensively evaluated.
That is, following the above-described steps, the following steps are continuously performed.

【００３８】（３）’窓内の文に関して、形態素解析の
最尤解を出力すると同時に曖昧部分については可能な解
を保持すること、形態素解析の結果、与えられた尤度基
準では、最も尤度の高い解が一意に決定できない場合が
あり、任意に一つの解を出力せざるを得ないため、誤り
が生じる場合が多かった。これに対し、従来通り一つの
解を出力すると同時に、曖昧な箇所と、その複数の解を
記録しておく。複数の解の保持の方法は特定しないが、
一般には、図１に模式的に示される"lattice"と呼ばれ
るデータ構造を用いる。(3) 'For the sentence in the window, output the maximum likelihood solution of the morphological analysis and at the same time hold the possible solution for the ambiguous part. As a result of the morphological analysis, the given likelihood criterion indicates the maximum likelihood. In some cases, a high-degree solution cannot be determined uniquely, and one solution must be output arbitrarily, so that an error often occurs. On the other hand, as in the conventional case, one solution is output, and at the same time, an ambiguous point and its plurality of solutions are recorded. We do not specify how to keep multiple solutions,
Generally, a data structure called “lattice” schematically shown in FIG. 1 is used.

【００３９】図１は、latticeによる曖昧性の保持を模
式的に示したものである。図では、文頭の文字位置を0
とし、文字と文字との間に1ずつ増やしながら数字を挿
入することにより、文中で各語が占める文字位置を表し
ている。FIG. 1 schematically shows retention of ambiguity by lattice. In the figure, the character position at the beginning of the sentence is 0.
By inserting numbers while increasing the character by one between characters, the character position occupied by each word in the sentence is represented.

【００４０】"製品の価格差"の"価格差"の部分に、{"
価"、 "格差"}と{"価格"、 "差"}の二通りの曖昧性があ
り、この曖昧性は既存のルールだけでは解消されない。
そのため、これらはそれぞれ文字位置3と4、及び4と6を
結ぶパス、文字位置3と5、及び5と6を結ぶパスとして保
持されている。In the "price difference" part of the "product price difference", {"
There are two types of ambiguity: price, inequality and {"price", difference ". These ambiguities cannot be resolved by existing rules alone.
Therefore, these are held as paths connecting character positions 3 and 4, and 4 and 6, respectively, and paths connecting character positions 3 and 5, and 5 and 6 respectively.

【００４１】解析とその結果の保持は、一文ごとでな
く、「窓」と呼ぶ文集合を対象として行う。「窓」は、
論理的な単位（記事や、章）を用いても、単なる数量的
な単位（例えば連続する10文とか）を用いても定義しう
るが、内容的につながった文の集合を含むことを意図し
ており、新聞記事を対象とするなら、一記事を「窓」と
することが自然である。The analysis and the holding of the result are performed not for each sentence but for a sentence set called a “window”. "Window"
It can be defined using logical units (articles or chapters) or simple quantitative units (for example, 10 consecutive sentences), but it is intended to include a set of connected sentences If newspaper articles are targeted, it is natural that one article is a "window".

【００４２】「窓」を利用する理由は、一定の範囲の文
に繰り返し現れる表現に注目すれば、単純な書き換えル
ールを用いて一文毎に処理していたのでは得られない効
果が期待できるためである。ここで我々は、「窓内の文
集合には、内容の関連性から、特定の内容語が繰り返し
て使われる傾向が高い」という仮定に立っている。The reason for using the “window” is that if attention is paid to expressions repeatedly appearing in a certain range of sentences, an effect that cannot be obtained by processing each sentence using a simple rewriting rule can be expected. It is. Here, we are based on the assumption that, in the sentence set in the window, a specific content word is likely to be used repeatedly because of the relevance of the content.

【００４３】（４）’窓内の文の形態素解析の出力結果
に対する第一段階の修正：解析結果は先頭から走査さ
れ、各位置毎に適用可能なルールを適用する。この段階
で適用するルールは、次の条件を満たしていなければな
らない： (i)信頼度が、あらかじめ定めた閾値より高い。 (ii)書き換えの結果、非文法的な接続を含む単語列が生
成されない。ここで、(ii)は、コンテクストを無視した型の書き換え
ルールを適用する際にチェックが必要である。(4) 'First-step correction to output result of morphological analysis of sentence in window: Analysis result is scanned from the top, and applicable rules are applied to each position. The rules applied at this stage must satisfy the following conditions: (i) The reliability is higher than a predetermined threshold. (ii) As a result of the rewriting, a word string including an ungrammatical connection is not generated. Here, (ii) needs to be checked when applying a type rewriting rule ignoring context.

【００４４】適用可能な書き換えルールが複数個ある場
合、最も信頼度の高い書き換えルールを選び出し、その
ような書き換えルールが複数個存在する場合、書き換え
対象語数がもっとも多い書き換えルールを選ぶ。書き換
えルールを適用した後、書き換えられた単語列の次の位
置から、単語列と書き換えルールの照合を再会する。書
き換えルールの適用がなかった場合、書き換えルール適
用開始単語位置を1語分進める。When there are a plurality of applicable rewrite rules, the rewrite rule with the highest reliability is selected. When there are a plurality of such rewrite rules, the rewrite rule with the largest number of words to be rewritten is selected. After applying the rewriting rule, the collation between the word string and the rewriting rule is re-established from the next position of the rewritten word string. If the rewrite rule has not been applied, the rewrite rule application start word position is advanced by one word.

【００４５】ここで、語境界を変更し、新たに語を生成
するタイプのルールが、ある場所における最も信頼度の
高い書き換えルールであり、かつ上記の条件(1)のみを
満たさない場合、その書き換えルールとその箇所を、
「潜在的に適用されうる書き換えルール」、「潜在的に
書き換えられる可能性がある箇所」として記録し、第二
段階での修正時に参照する。Here, if the rule of the type that changes the word boundary and generates a new word is the most reliable rewriting rule at a certain place and does not satisfy only the above condition (1), Rewrite rules and their locations,
It is recorded as a "potentially applicable rewrite rule" and a "potentially rewriteable location", and is referred to at the time of modification in the second stage.

【００４６】なお、書き換えルールはTRIE構造により記
録することにより効率的に検索することができる。The rewrite rules can be searched efficiently by recording them in a TRIE structure.

【００４７】（５）窓内の文の解析結果に第一段階の修
正を行った結果を用いて、未登録語の確定や曖昧性の解
消を行う第二段階の修正を行うこと、上記第一段階の修
正（４）により、窓内の解析結果について、最初の書き
換えが終了したとする。書き換えの中には、助詞／"C1"：普通名詞／"C2"：未登録語／助詞=> 助詞
／"C1C2"：普通名詞／助詞のように、単語境界を変えるもの（未登録語を生成する
場合がある。上の場合、"C1C2"が辞書に登録されていな
い場合、未登録語が獲得されたことになる）や、＊／と：並立助詞／い：子音動詞語幹=> ＊／と：引用
助詞／い：子音動詞語幹のように、品詞を書き換えるだけのものがある("＊"は
任意の語とマッチすることを示す)。(5) Using the result of the first-stage correction of the analysis result of the sentence in the window to perform the second-stage correction for determining an unregistered word and eliminating ambiguity, It is assumed that the first rewriting of the analysis result in the window has been completed by the one-stage correction (4). Some rewriting changes the word boundaries, such as particle / "C1": common noun / "C2": unregistered word / particle => particle / "C1C2": common noun / particle. In the above case, if "C1C2" is not registered in the dictionary, it means that an unregistered word has been obtained), * / and: conjunctive particle / i: consonant verb stem => * / And: Quoted particles / i: Consonant verb stems, there are things that only rewrite the part of speech ("*" indicates that it matches any word).

【００４８】上述の第二段階の修正として、単語境界を
変えるタイプの書き換えルールであって、信頼度が低か
ったため、実際には適用されなかった書き換えルールに
ついて、窓内で信頼度の再評価を行う。As a modification of the second step described above, for a rewrite rule of a type that changes a word boundary and has a low reliability, a re-evaluation of the reliability is performed within the window for a rewrite rule that is not actually applied. Do.

【００４９】具体的には、単語境界の変更の結果として
単語wを生成する書き換えルールの集合を{Ri，...，Rk}
とし、Ri の信頼度をri (i = 1，...，k)としたとき、
「単語wが窓内に現れる」という事象の信頼度を、Specifically, a set of rewrite rules for generating a word w as a result of changing a word boundary is defined as {Ri,..., Rk}.
And when the reliability of Ri is ri (i = 1, ..., k),
The reliability of the event "word w appears in the window"

【００５０】[0050]

【数３】 (Equation 3)

【００５１】で定義する。ここで、{Ri，...，Rk}中
で、書き換えルールとして等しく、かつ適用箇所の前後
文脈が共通なものがある場合、r(w)の定義における乗算
には、それらを重複しては加えない。これは、異なる種
類の出現の仕方によりはじめて、wの存在が強く示唆さ
れるという考え方による。Defined by Here, in {Ri, ..., Rk}, if there are some rewrite rules that are the same and have the same context before and after the application point, then the multiplication in the definition of r (w) will duplicate them. Is not added. This is based on the idea that the presence of w is strongly suggested only by different kinds of appearances.

【００５２】r(w)があらかじめ定めた閾値を越えた場
合、{Ri，...，Rk}のうち、最初の書き換えにおいて適
用されなかった書き換えルールを適用し、解析結果を書
き換える。wが未登録語であれば、未登録語が検出され
たと考えられる。When r (w) exceeds a predetermined threshold, a rewriting rule of {Ri,..., Rk} not applied in the first rewriting is applied, and the analysis result is rewritten. If w is an unregistered word, it is considered that an unregistered word has been detected.

【００５３】上記（３）’で述べたように、曖昧な箇所
はlatticeの形式で保存されているが、以上に述べた段
階までに解消されなかった曖昧性について、その解消を
行う。As described in (3) ′ above, the ambiguous part is stored in a lattice format, but the ambiguity that has not been resolved by the above-described stage is resolved.

【００５４】一般には、窓の中の大部分は曖昧性無く解
析されるため、各曖昧箇所について、窓内で曖昧性無く
解析されている部分に出現している自立語を多く含むも
のを最適パスとして決定し、（３）’の段階の解と異な
る場合、これに置き換える。この方法で唯一の最適パス
が定まらないときは、書き直しを行わない。In general, most of the window is analyzed without ambiguity, and for each ambiguity point, the one that contains many independent words appearing in the part of the window that is analyzed without ambiguity is optimal. It is determined as a path, and if it is different from the solution in the step (3) ′, it is replaced with this. If the only optimal path cannot be determined by this method, no rewriting is performed.

【００５５】図１の例の場合、図２のように、窓内の非
曖昧箇所に、"価格"と"差"の出現回数の合計が、"価"
と"格差"の出現回数の合計より多く出現しているなら
ば、{"価格"、"差"}が選択され、暫定的な解析結果であ
る"製品/の/価/格差"は、"製品/の/価格/差"に修正され
る。すなわち、図２の窓内に表れている他の文章から得
られる解析結果を反映して曖昧性が解消できるのであ
る。In the case of the example of FIG. 1, as shown in FIG. 2, the sum of the number of appearances of "price" and "difference" is "value"
If it appears more than the sum of the number of occurrences of "and" disparity ", {" price "," difference "} is selected, and the provisional analysis result" product / Corrected to "Product / Price / Difference". That is, the ambiguity can be resolved by reflecting the analysis result obtained from another sentence appearing in the window of FIG.

【００５６】この方法は単純であるが、実験の結果、解
析結果の精度向上に有効であることがわかった。Although this method is simple, it has been found from experiments that it is effective in improving the accuracy of analysis results.

【００５７】ここで注意すべきことは、固定された書き
換えルールであっても、簡単な表層情報と組み合わせる
ことにより、文脈に応じて異なる結果を返すことであ
る。It should be noted here that, even with a fixed rewriting rule, different results are returned depending on the context by combining it with simple surface information.

【００５８】実施例以下の実施例において、図４、図５、図６に示す処理の
流れに従い、書き換えルールの獲得と後処理について、
例を用いて説明する。また、図７〜図１１を用いて、書
き換えルールの一覧や提示、窓の設定等に関するユーザ
インターフェイス等に関して説明する。Embodiment In the following embodiment, acquisition of rewrite rules and post-processing will be described in accordance with the processing flow shown in FIGS. 4, 5 and 6.
This will be described using an example. A user interface related to a list and presentation of rewrite rules, window settings, and the like will be described with reference to FIGS.

【００５９】なお、本発明が、例えば、図３に示すよう
な一般的な計算機のハードウェア構成に従って、計算機
上に実装可能なことは自明であるので特に説明は行わな
い。ここで、３０１１は正解付き文書データファイル、
３０１２は辞書ファイル、３０１３は形態素解析プログ
ラムファイル、３０１４は追加辞書ファイル、３０１５
は書き換えルール格納ファイル、３０１６は書き換えル
ール抽出プログラムファイル、３０１７は後処理プログ
ラムファイル、３０１８は文書ファイルであり、これら
記憶装置３０１を構成する。３０２は入力装置、３０３
はメインメモリ、３０５は演算装置、３０６はディスプ
レイであり、３０３は他の計算機との連携をするための
通信装置である。Since it is obvious that the present invention can be implemented on a computer in accordance with the hardware configuration of a general computer as shown in FIG. 3, for example, no particular description will be given. Here, 3011 is a document data file with a correct answer,
3012 is a dictionary file, 3013 is a morphological analysis program file, 3014 is an additional dictionary file, 3015
Is a rewrite rule storage file, 3016 is a rewrite rule extraction program file, 3017 is a post-processing program file, and 3018 is a document file. 302 is an input device, 303
Is a main memory, 305 is a computing device, 306 is a display, and 303 is a communication device for cooperating with another computer.

【００６０】まず、形態素解析の誤り部分と正解を比較
することにより、自動的に書き換えルールを生成するス
テップを図４に従って述べる。First, a step of automatically generating a rewrite rule by comparing an error part of a morphological analysis with a correct answer will be described with reference to FIG.

【００６１】書き換えルールの生成に先だって、ユーザ
はあらかじめ、学習用のテキストと、それを正しく解析
した正解データを与える必要がある。そのためには、例
えば図1０の表示領域９０１２に示されたように、特定
のディレクトリ下に、テキストファイル（例えば"text
1"）と、それを正しく解析した結果（例えば"text1.so
l"）を組として指定することが考えられる。この例で
は、テキスト名とその解析結果は、拡張子".sol"によっ
て対応付けられている。Prior to generating the rewrite rule, the user needs to provide a learning text and correct answer data obtained by correctly analyzing the text. For this purpose, for example, as shown in a display area 9012 of FIG. 10, a text file (eg, “text
1 ") and the result of parsing it correctly (eg" text1.so
l)) may be specified as a set.In this example, the text name and its analysis result are associated by the extension ".sol".

【００６２】書き換えルールの抽出に利用するテキスト
を指定した後、ステップ４０１に従って作業用メモリM
1、 M2、 M3、 M4をクリアする。After specifying the text to be used for extracting the rewrite rule, the work memory M
Clear 1, M2, M3, M4.

【００６３】次に、ステップ４０２に従って学習用のテ
キストに含まれる文を形態素解析し、その結果をメモリ
M1に記録する。Next, a sentence included in the learning text is morphologically analyzed in accordance with step 402, and the result is stored in a memory.
Record on M1.

【００６４】次に、ステップ４０３に従って、ユーザが
指定した学習用テキストの解析結果と、M1に記録された
解析結果を比較し、単語列として食い違っている箇所の
対｛W1...Wn、 W'1...W'm｝を先頭から順に抽出し、前
後の環境A1...AK、 B1...BLとともに、作業用メモリM2
に記録する（ここで、W1...WnとW'1...W'mは、W1≠W'
1、...Wn≠W'mを満たし、W1...Wn、 W'1...W'mの前後の
単語が共に一致するような最も短いn、 mである）。K、
Lについては、ステップ４０４２で参照する、具体的な
単語列からなる書き換えルールを個別の単語によらない
書き換えルールに一般化するための方法において、食い
違い部分の前方・後方何語までを参照するかに依存し、
そのなかでもっとも大きなK、Lとする。デフォールトで
はK＝L＝1であるが、後で述べるようにユーザが指定す
ることもできる。Next, according to step 403, the analysis result of the learning text designated by the user is compared with the analysis result recorded in M1, and the pair of locations that are inconsistent as a word string {W1 ... Wn, W '1 ... W'm｝ is extracted in order from the top, and the surrounding memory A1 ... AK, B1 ... BL, and the working memory M2
(Where W1 ... Wn and W'1 ... W'm are W1 ≠ W '
1, ... Wn ≠ W'm, which is the shortest n, m such that the words before and after W1 ... Wn, W'1 ... W'm match together). K,
Regarding L, in the method for generalizing the rewrite rule consisting of a specific word string referred to in step 4042 to a rewrite rule not based on individual words, how many words before and after the discrepancy part are referred to Depends on
Let K and L be the largest among them. By default, K = L = 1, but can be specified by the user as described below.

【００６５】ステップ４０４に従い、M2に記録されたす
べての｛A1...AK、W1...Wn、W'1...W'm、B1...BL｝に対
して、ステップ４０４１、ステップ４０４２を行う。According to step 404, for all {A1 ... AK, W1 ... Wn, W'1 ... W'm, B1 ... BL} recorded in M2, step 4041, Step 4042 is performed.

【００６６】ステップ４０４１では、｛A1...AK、W1...
Wn、W'1...W'm、B1...BL｝から、具体的な単語からなる
以下のような書き換えルールを生成し、作業用メモリM3
に記録する： A1...AKW1W2...WnB1...BL => A1...AK W1'W2'...Wm'B
1...BL 次に、ステップ４０４２に従い、M2に記録されたすべて
の具体的な単語からなる書き換えルールから、４０３で
述べたように、あらかじめ指定された方法で一般化され
た書き換えルールを生成し、作業用メモリM3に追加す
る。In step 4041, it is determined that $ A1 ... AK, W1 ...
From Wn, W'1 ... W'm, B1 ... BL｝, generate the following rewrite rules consisting of specific words,
Record in: A1 ... AKW1W2 ... WnB1 ... BL => A1 ... AK W1'W2 '... Wm'B
1 ... BL Next, according to step 4042, as described in 403, a generalized rewrite rule is generated from the rewrite rule composed of all the specific words recorded in M2, as described in 403. Then, it is added to the working memory M3.

【００６７】次に、ステップ４０５に従い、M3に記録さ
れたすべての書き換えルールRに対して、ステップ４０
５１〜ステップ４０５４を繰り返す。Next, according to step 405, for all rewrite rules R recorded in M3, step 40
Steps 51 to 4054 are repeated.

【００６８】ステップ４０５１によりRをM1内の解析結
果に適用し、その結果をメモリM4に記録する。次に、ス
テップ４０４２により、M4とM1の内容を比べ、改善、改
悪、総書き換え語数を調べ、書き換えルールの効果の自
動判定ステップ（２）で述べた方法で書き換えルールR
の信頼度をrを計算する。At step 4051, R is applied to the analysis result in M1, and the result is recorded in the memory M4. Next, in step 4042, the contents of M4 and M1 are compared, and the number of words for improvement, deterioration, and total rewriting is checked.
Calculate the reliability of r.

【００６９】ステップ４０５３に従って、信頼度ｒが正
の書き換えルールＲだけ、信頼度と組にして書き換えル
ール格納メモリに格納する。In accordance with step 4053, only the rewrite rule R having a positive reliability r is stored in the rewrite rule storage memory in combination with the reliability.

【００７０】M4をクリアし、信頼度を検証していない書
き換えルールがある場合ステップ４０５１に戻る。If M4 is cleared and there is a rewrite rule whose reliability has not been verified, the flow returns to step 4051.

【００７１】以上で、書き換えルールの獲得のステップ
について説明を終わる。The step of obtaining the rewrite rule has been described above.

【００７２】形態素解析において、入力される文集合に
途切れが無い場合、バッファを用いて、一定の量以下の
文ごとに形態素解析処理を行うことになる。以下では簡
単のため、形態素解析処理が受け取る入力は、すでに取
り扱い可能な量であると仮定する。In the morphological analysis, if there is no break in the input sentence set, a morphological analysis process is performed for each sentence of a certain amount or less using a buffer. In the following, for simplicity, it is assumed that the input received by the morphological analysis processing is already an amount that can be handled.

【００７３】まず、図５に従って、「窓」を用いず、単
純に書き換えルールだけを用いた場合の後処理方法を説
明する。First, with reference to FIG. 5, a post-processing method in which only a rewrite rule is used without using a “window” will be described.

【００７４】後処理では、まずステップ５０４に従っ
て、作業用メモリM5、 M6をクリアする。次に、ステッ
プ５０５に従い、入力された文を作業用メモリM5に格納
する。In the post-processing, the working memories M5 and M6 are first cleared according to step 504. Next, in accordance with step 505, the input sentence is stored in the working memory M5.

【００７５】ステップ５０６に従い、M5に格納されたす
べての文に対して、ステップ５０６１〜ステップ５０６
４を繰り返す：ステップ５０６１に従い、入力されてい
る文集合の各文を形態素解析し、その出力を作業用メモ
リM6に記録する。In accordance with step 506, steps 5061 to 506 are executed for all the sentences stored in M5.
Repeat step 4: According to step 5061, each sentence of the input sentence set is morphologically analyzed and its output is recorded in the working memory M6.

【００７６】ステップ５０６２に従い、M6に記録された
形態素解析結果を、書き換えルール格納メモリに格納さ
れた書き換えルールを用いて、書き換えルールを適用し
た修正ステップ（４）で述べた方法で書き直す。書き換
えルールの適用の閾値は、0．65とした。In accordance with step 5062, the morphological analysis result recorded in M6 is rewritten using the rewriting rule stored in the rewriting rule storage memory by the method described in the modification step (4) to which the rewriting rule is applied. The threshold for applying the rewrite rule was 0.65.

【００７７】次にステップ５０６３に従い、作業用メモ
リM6の内容を出力した後、ステップ５０６４に従い作業
用メモリM6をクリアする。Next, after outputting the contents of the working memory M6 according to step 5063, the working memory M6 is cleared according to step 5064.

【００７８】以上で通常の後処理の方法の説明を終わ
る。The description of the ordinary post-processing method has been completed.

【００７９】次に、図６に従って、「窓」を用いた場合
の後処理方法を説明する。形態素解析とその後処理は、
「窓」と呼ばれる文集合の単位で行う。「窓」は、最小
１文から始まり、文数、または文数によらない「１記
事」等の構造で指定することができる。「１記事」のよ
うな構造で指定するときは、入力されるテキストの仕様
にそって、窓を定義する必要がある。定義の仕方は特定
しないが、「窓」の定義を明示的に行えることが重要で
ある。Next, a post-processing method using a “window” will be described with reference to FIG. Morphological analysis and subsequent processing
This is performed in units of a sentence set called “window”. The “window” starts with a minimum of one sentence, and can be specified by a structure such as the number of sentences or “one article” independent of the number of sentences. When specifying with a structure such as "one article", it is necessary to define a window according to the specification of the text to be input. Although the manner of definition is not specified, it is important that the definition of "window" can be explicitly defined.

【００８０】後処理では、まずステップ５０１に従っ
て、作業用メモリM5、 M6、 M7、 M8をクリアする。次
に、ステップ５０２に従い、入力された文を、定められ
た「窓」ごとに分割し、作業用メモリM5に格納する。In the post-processing, the work memories M5, M6, M7 and M8 are first cleared according to step 501. Next, according to step 502, the input sentence is divided for each predetermined "window" and stored in the working memory M5.

【００８１】ステップ５０３に従い、M5に格納されたす
べての窓に対して、ステップ５０３１〜ステップ５０３
６を繰り返す：ステップ５０３で固定された窓に対し、
ステップ５０３１に従い、窓内の文を形態素解析し、そ
の第１出力を作業用メモリM6に記録する。最小コスト解
が一意に定まらない部分の場所と曖昧性を保持したデー
タを作業用メモリM7に記録する。According to step 503, steps 5031 to 503 are performed for all windows stored in M5.
Repeat 6: For the window fixed in step 503,
According to step 5031, the sentence in the window is morphologically analyzed, and the first output is recorded in the working memory M6. The location where the minimum cost solution is not uniquely determined and data retaining ambiguity are recorded in the working memory M7.

【００８２】ステップ５０３２に従い、M6に記録された
形態素解析結果を、書き換えルール格納メモリに格納さ
れた書き換えルールを用いて、書き換えルールを適用し
た修正ステップ（４）’で述べた方法で書き直す。書き
換えルールの適用の閾値は、0．65とした。この際に、
単語境界を変更する型の書き換えルールで、信頼度が低
いため用いられなかった書き換えルールと、適用箇所、
書き換えで生じる単語Wの組を、Wがあらかじめ定めた品
詞等の条件を満たす場合（例えば、「Ｗが名詞である」
等）に、作業用メモリM8に記録する。In accordance with step 5032, the morphological analysis result recorded in M6 is rewritten using the rewriting rule stored in the rewriting rule storage memory by the method described in the modifying step (4) 'applying the rewriting rule. The threshold for applying the rewrite rule was 0.65. At this time,
A rewrite rule that changes word boundaries and is not used because of low reliability.
When a set of words W generated by rewriting satisfies a predetermined condition such as part of speech (for example, “W is a noun”
Etc.) in the working memory M8.

【００８３】ステップ５０３３に従い、M8に記録された
すべての単語Ｗについて、ステップ５０３３１に従い、
Wを生成する可能性のある書き換えルール{R1、...、Rk}
について、第二段階の修正ステップ（５）で述べた信頼
度の再評価を行い、再評価の結果があらかじめ定めた閾
値（実験では0．65とした）を越えた場合、書き換えル
ールを適用して、M6内の解析結果を書き直す。According to step 5033, for all words W recorded in M8, according to step 50331,
Rewrite rules that may generate W {R1, ..., Rk}
Is re-evaluated as described in the second modification step (5). If the re-evaluation result exceeds a predetermined threshold (0.65 in the experiment), the rewriting rule is applied. And rewrite the analysis result in M6.

【００８４】次に、ステップ５０３４に従い、M7内に記
録されたすべての曖昧性のある箇所について、条件判定
ステップ５０３４１：「ステップ５０３３までの書き換
えにより、曖昧性が無くなったか」を行い、曖昧性が解
消されていない場合、ステップ５０３４２に従い、第二
段階の修正ステップ（５）で述べた方法に従い、曖昧部
分の可能な解のうち、M6内で曖昧性が無く解析されてい
る位置に現れる内容語を多く含むもの選択し、それが現
時点での結果と異なる場合は書き直す。Next, in accordance with step 5034, for all ambiguities recorded in M7, condition determination step 50341: "whether or not ambiguity has been eliminated by rewriting up to step 5033" is performed. If not resolved, the content word that appears in the position where the unambiguous analysis is performed in M6 among the possible solutions of the ambiguous part according to the method described in the second modification step (5) according to step 50342. And rewrite if it differs from the current result.

【００８５】ステップ５０３５に従い、すべての後処理
を経た作業用メモリM6の内容を出力した後、ステップ５
０３６に従い作業用メモリM6、 M7、 M8をクリアする。
以上で「窓」を用いた後処理の方法の説明を終わる。After outputting the contents of the working memory M6 after all post-processing in accordance with step 5035,
In accordance with 036, the working memories M6, M7 and M8 are cleared.
This is the end of the description of the post-processing method using the “window”.

【００８６】なお、窓を用いる場合も、そうでない場合
も、後処理の結果として未知語が検出された場合、これ
をすべて作業用のメモリに記録し、記録された未知語
を、全後処理が終了した後にユーザに提示し、ユーザに
辞書登録すべき語を選択させ、追加辞書３０１４に登録
し、後の形態素解析において利用することは、本後処理
方法を利用すれば容易に実現できる。また、未知語の信
頼度がきわめて高い、すなわち、あらかじめ定めた、１
に近い閾値より高い場合は、自動的に追加辞書３０１４
に記録させ、その未知語を検出した以降の形態素解析に
おいて利用させる方法もまた、同様に実現可能である。In the case where an unknown word is detected as a result of post-processing, regardless of whether a window is used or not, all of the unknown words are recorded in a working memory, and the recorded unknown words are subjected to all post-processing. After completion of the processing, the user can select the words to be registered in the dictionary, register the words in the additional dictionary 3014, and use the words in the subsequent morphological analysis by using the post-processing method. In addition, the reliability of unknown words is extremely high, that is, a predetermined 1
If the value is higher than the threshold value close to
A method of recording the unknown word and using the unknown word in the morphological analysis after the detection is also feasible.

【００８７】ユーザに、これらのいずれかの未知語登録
方法を対話的に選択できるようにする（後者の場合閾値
の設定を含む）こともまた、容易に実現可能である。It is also easy to enable the user to interactively select any of these unknown word registration methods (including the setting of a threshold in the latter case).

【００８８】付加的な機能として、後処理用の書き換え
ルールのパターンの指定や、個別の書き換えルールの使
用・不使用等の指定ができることが好ましいため、以下
図７〜図11に基づいてインターフェイスの説明を行う。As additional functions, it is preferable to be able to specify a pattern of a rewrite rule for post-processing and to specify use / non-use of an individual rewrite rule. Give an explanation.

【００８９】まず、自動的に獲得された書き換えルール
の一覧のために、図７に示したような書き換えルール一
覧ウィンドウを設ける。書き換えルール格納メモリ内の
書き換えルールは、ステップ６０１１のごとくユーザの
指定したソート方法でソートしたり、ステップ６０１４
により検索条件を指定することで所望の書き換えルール
集合がステップ６０１２のように提示される。各書き換
えルールは、識別番号、前方、後方のコンテクスト、そ
れらに挟まれる誤りパターンと正解パターン、信頼度を
持ち、これらを一覧する事ができる。First, a rewrite rule list window as shown in FIG. 7 is provided for a list of automatically obtained rewrite rules. The rewrite rules in the rewrite rule storage memory can be sorted by the sort method designated by the user as in step 6011, or can be sorted in step 6014.
By specifying a search condition by, a desired rewrite rule set is presented as in step 6012. Each rewrite rule has an identification number, forward and backward contexts, an error pattern and a correct answer pattern sandwiched between them, and reliability, and these can be listed.

【００９０】ステップ６０１３のごとく特定の書き換え
ルールの詳細情報を要求することにより、ステップ６０
２のような詳細情報表示ウィンドウを開き、書き換えル
ール抽出の根拠の提示（ステップ６０２１）、学習集合
における書き換えルール適用成功箇所の提示（ステップ
６０２２）、同失敗箇所の提示（ステップ６０２３）が
可能である。By requesting detailed information on a specific rewrite rule as in step 6013,
2 to open the detailed information display window as shown in FIG. 2 and present the basis for extracting the rewrite rule (step 6021), present the location where the rewrite rule is successfully applied in the learning set (step 6022), and present the failure location (step 6023) is there.

【００９１】また、個別の書き換えルールを使用しない
ように指定することも可能であり、ステップ６０２４の
ごとく、書き換えルール不使用を指示するボタンを付与
する。図７の例の場合、助詞「と」の弁別に関する書き
換えルールであるから、例えば情報検索で内容語のみに
注目するときには不必要であるため、速度の向上のため
にこれを不使用としている。It is also possible to specify not to use an individual rewrite rule. As in step 6024, a button for instructing not to use the rewrite rule is provided. In the case of the example of FIG. 7, since the rewriting rule is related to the distinction of the particle “to”, it is unnecessary when focusing only on the content word in the information search, for example, and is not used for improving the speed.

【００９２】また、図４の説明中に述べたが、ステップ
４０４２で、具体的な単語の書き換えルールを個別の単
語によらない書き換えルールに一般化するときの方法
は、あらかじめファイルにより指定する。そのような方
法は、後処理の改良のために一覧・編集が必要となる可
能性があるため、一覧・編集インタフェイスが必要であ
る。Also, as described in the description of FIG. 4, a method of generalizing the specific word rewriting rule to a rewriting rule that does not depend on individual words in step 4042 is specified in advance by a file. Such a method requires a list / edit interface, since list / edit may be required to improve post-processing.

【００９３】図８において、７０１は書き換えルールの
一般化方法を提示するウィンドウの一例である。例え
ば、異なり部分の前側K語目の単語の字種と品詞をみる
ことが、"-K、字種、品詞"と記載されている。"省略無
し"は、単語列をそのまま書き換えルールに持ち込むこ
とを指し、"＊"は、任意の単語とマッチしうることを示
す。In FIG. 8, reference numeral 701 denotes an example of a window for presenting a general method of rewriting rules. For example, looking at the character type and part-of-speech of the K-th word before the different part is described as "-K, character type, part-of-speech". “No omission” indicates that the word string is directly taken into the rewrite rule, and “*” indicates that the word string can be matched with an arbitrary word.

【００９４】システムがデフォールトで用意した一般化
パターン以外のパターンを追加する場合、領域７０１１
によりパターンの編集を指示し、図９に示すごとき、新
規パターン作成ウィンドウ８０１を呼び出し、例えば領
域８０１１のように、間違い箇所の前後２語の字種、品
詞を見るように指定することもできる。When the system adds a pattern other than the generalized pattern prepared by default, the area 7011
9, a new pattern creation window 801 is called as shown in FIG. 9, and it is also possible to designate to see the character type and the part of speech of the two words before and after the error portion, for example, as in an area 8011.

【００９５】学習のための正解付きテキストはすでに与
えられていると仮定しているが、明示的に新たなテキス
トと正解の対を与えるときは、８０１２の「正解データ
指定」ボタンで、図1０に示すごとき学習データ指定ウ
ィンドウ９０１を呼び出し、領域９０１２に示される書
き換えルール抽出用に指定されている学習データに、新
たなテキストと正解の対を、領域９０１４内で指定し、
追加ボタン９０１３で追加の操作をする。これと逆に、
削除ボタン９０１１で今まで使用していた正解データを
削除する操作をすることも可能である。It is assumed that the text with the correct answer for learning has already been given. However, when a new text / correct answer pair is explicitly given, a “Specify correct answer data” button 8012 is clicked on in FIG. A learning data designation window 901 as shown in FIG. 9 is called, and a new text / correct answer pair is designated in the region 9014 for the learning data designated for rewriting rule extraction shown in the region 9012,
An additional operation is performed with an add button 9013. Conversely,
The delete button 9011 can be used to delete the correct answer data that has been used.

【００９６】こうしてユーザは図９に示す新規パターン
作成ウィンドウ８０１に帰り、書き換えルールの抽出と
再評価を指示するボタン８０１４を操作して再評価を実
行する。消去を指示するボタン８０１３を操作して新規
パターン作成ウィンドウ８０１を利用して作成した新規
パターンのキャンセルも実行できる。Thus, the user returns to the new pattern creation window 801 shown in FIG. 9 and operates the button 8014 for instructing the extraction and re-evaluation of the rewrite rule to execute the re-evaluation. A new pattern created using the new pattern creation window 801 can be canceled by operating the button 8013 for instructing erasure.

【００９７】また、すでに述べたように、「窓」の定義
もユーザが変更可能にする方がユーザに使いやすい。そ
のためには、たとえば、図１１に示すようなウィンドウ
１０１を使って、単純な場合は文数や、「記事」などの
構造などにより「窓」を指定することができるようにす
るのがよい。As described above, it is easier for the user to change the definition of the "window". For this purpose, for example, using a window 101 as shown in FIG. 11, in a simple case, it is preferable that the "window" can be designated by the number of sentences or the structure of "articles".

【００９８】領域１０１１は現在使用中の窓の種類を示
し、領域１０１２でその定義を確認できる。また、領域
１０１３にデータを与えて新しい窓を定義することも可
能である。An area 1011 indicates the type of window currently in use, and its definition can be confirmed in an area 1012. It is also possible to give data to the area 1013 and define a new window.

【００９９】実際に新聞から抽出した1500文を、形態素
解析後人手で修正した正解データを作成し、750文を用
いてルールの抽出と評価を行い、残りの750文を用いて
後処理の効果を調べた結果、後処理前の精度を、学習デ
ータで5%、テストデータで3%向上させることができた。[0099] 1500 sentences actually extracted from newspapers are manually corrected after morphological analysis to create correct answer data, rules are extracted and evaluated using 750 sentences, and post-processing effects are obtained using the remaining 750 sentences. As a result, the accuracy before the post-processing was improved by 5% for the training data and 3% for the test data.

【０１００】後処理により、誤りのうちの603個が正し
く修正され、そのうち246個は、語境界を変更する型の
書き換えによる。すべての修正のうち、約7%は「窓」に
よる効果であった。Post-processing corrects 603 of the errors correctly, 246 of which are due to the type rewriting that changes word boundaries. Of all the corrections, about 7% were "window" effects.

【０１０１】[0101]

【発明の効果】人手では用意することが困難であった
り、相互の関係の調整が困難であるような形態素解析修
正用の書き換えルールを、自動的に信頼度付きで生成す
ることにより、従来容易に行えなかった後処理が可能と
なった。According to the present invention, a rewriting rule for morphological analysis correction, which is difficult to prepare manually or difficult to adjust for mutual relations, is automatically generated with a certain degree of reliability. Post-processing that could not be performed was enabled.

【０１０２】更に、「窓」と呼ぶ文集合ごとに後処理を
行うことにより、他の文の解析結果を参照することが可
能となり、従来の方法では解決が困難であった曖昧性の
解消や、局所的なパターンマッチのみでは回復できない
未登録語による形態素解析誤りの修復が可能となる。Further, by performing post-processing for each sentence set called a “window”, it becomes possible to refer to the analysis result of another sentence, and to eliminate ambiguity which has been difficult to solve by the conventional method. In addition, it is possible to repair a morphological analysis error due to an unregistered word that cannot be recovered only by local pattern matching.

[Brief description of the drawings]

【図１】形態素解析の曖昧の例をlatticeによる曖昧性
の保持の例で模式的に示す図。FIG. 1 is a diagram schematically illustrating an example of ambiguity in morphological analysis by an example of retention of ambiguity by lattice.

【図２】「窓」内に表れている他の文章から得られる解
析結果を反映して曖昧性が解消できることを示す窓内の
単語の出現例を示す図。FIG. 2 is a view showing an example of appearance of words in a window indicating that ambiguity can be resolved by reflecting an analysis result obtained from another sentence appearing in a “window”;

【図３】本発明を実行可能な一般的な計算機のハードウ
ェア構成の例を示す図。FIG. 3 is a diagram showing an example of a hardware configuration of a general computer capable of executing the present invention.

【図４】書き換えルールの獲得の処理フローの例を示す
図。FIG. 4 is a diagram showing an example of a processing flow for acquiring a rewrite rule.

【図５】「窓」を用いず、単純に書き換えルールだけを
用いた場合の後処理方法の例を説明する図。FIG. 5 is a view for explaining an example of a post-processing method when simply using a rewriting rule without using a “window”;

【図６】「窓」を用いて、書き換えルールを用いた場合
の後処理方法の例を説明する図。FIG. 6 is a view for explaining an example of a post-processing method using a “window” and using a rewriting rule.

【図７】ユーザインターフェイスの一つの書き換えルー
ルの一覧ウィンドウの例を説明する図。FIG. 7 is an exemplary view for explaining an example of one rewrite rule list window of the user interface.

【図８】ユーザインターフェイスの一つの一般化パター
ンの一覧ウィンドウの例を説明する図。FIG. 8 is a view for explaining an example of a list window of one generalized pattern of a user interface.

【図９】ユーザインターフェイスの一つの新規パターン
作成ウィンドウの例を説明する図。FIG. 9 is a view for explaining an example of one new pattern creation window of the user interface.

【図１０】ユーザインターフェイスの一つの正解データ
の追加、削除の指定をするウィンドウの例を説明する
図。FIG. 10 is a view for explaining an example of a window for designating addition or deletion of one piece of correct answer data in a user interface.

【図１１】ユーザインターフェイスの一つの窓の定義、
変更を指定するウィンドウの例を説明する図。FIG. 11 defines one window of the user interface;
FIG. 9 is a view for explaining an example of a window for specifying a change.

[Explanation of symbols]

３０１：記憶装置、３０２：入力装置、３０３：メイン
メモリ、３０５：演算装置、３０６：ディスプレイ、３
０１１：正解付き文書データファイル、３０１２：辞書
ファイル、３０１３：形態素解析プログラムファイル、
３０１４：追加辞書ファイル、３０１５：ルール格納フ
ァイル、３０１６：ルール抽出プログラムファイル、３
０１７：後処理プログラムファイル、３０１８：文書フ
ァイル、９０１２：表示領域、M1、M2、M3、M4：作業用
メモリ、４０１−４０５，４０４１−４０５４：処理ス
テップ、５０４−５０６，５０６１−５０６４：処理ス
テップ、５０１−５０３，５０３１−５０３６，５０３
３１−５０３４２：処理ステップ、６０１，６０２、７
０１，８０１，９０１，１０１：ウインドウ。301: storage device, 302: input device, 303: main memory, 305: arithmetic device, 306: display, 3
011: Document data file with correct answer, 3012: Dictionary file, 3013: Morphological analysis program file,
3014: additional dictionary file, 3015: rule storage file, 3016: rule extraction program file, 3
017: Post-processing program file, 3018: Document file, 9012: Display area, M1, M2, M3, M4: Working memory, 401-405, 4041-4054: Processing step, 504-506, 5061-5064: Processing step , 501-503, 5031-5036, 503
31-50342: processing step, 601, 602, 7
01,801,901,101: window.

Claims

[Claims]

1. A method for converting unseparated Japanese character string data into character string data by using data in a storage device that holds a word dictionary and a storage device that holds a data table that defines connection relationships between words. In a Japanese morphological analysis process of dividing a Japanese word into constituent words, a storage device for holding document data to which a correct Japanese morphological analysis result is provided, and a morphological analysis result of a document to be analyzed and the storage The correct Japanese morphological analysis result held in the device is compared, and the analysis error part, the corresponding correct answer part, and the words having a predetermined number of words before and after the error part are specified. In the morphological analysis result of an arbitrary document, a word string that matches the error part, sandwiched between the word groups that match the preceding word group and the backward word group, Application conditions are relaxed by a predetermined matching condition relaxation method such as replacing a word string, generating a morphological analysis result rewriting rule, and replacing the morphological analysis result rewriting rule with a word of speech. Conversion to the rewriting rule, evaluating the reliability of the rewriting rule of the morphological analysis result, and rewriting the morphological analysis result of an arbitrary document using the morphological result rewriting rule having the reliability not less than a predetermined threshold value. Characteristic Japanese morphological analysis post-processing method.

2. The post-processing method of Japanese morphological analysis according to claim 1, wherein the morphological analysis processing and the subsequent processing are performed for each predetermined sentence set.

3. A rewrite rule group of a type that generates a word different from the analysis result by changing a word boundary of the morphological analysis result, and when the rewrite rule group is applied, a plurality of positions in the morphological analysis result of the document are obtained. In the case where a specific word is generated in a different appearance environment, the reliability of the rewriting rule group is re-evaluated in a manner that increases with an increase in the number of occurrences of the word. The rewriting rule group is applied when the threshold value exceeds a predetermined threshold value.
Japanese morphological analysis post-processing method described.

4. Using a data storage device that holds a word dictionary and a data storage device that holds a data table that defines the connection relationship between words, Japanese character string data that is not separated and written is composed of the character string data. In the Japanese morphological analysis process that divides Japanese into constituent words, the analysis results for each sentence must be retained with ambiguity. A post-processing method for Japanese morphological analysis characterized by preferentially outputting a solution candidate in which a content word in the solution candidate appears more frequently in a part that has been analyzed.

5. A storage device comprising a document data file with a correct answer, a dictionary file, a morphological analysis program file, an additional dictionary file, a rewrite rule storage file, a rewrite rule extraction program file, a post-processing program file and a document file, an input device, and a main unit. A Japanese morphological analyzer for performing a Japanese morphological analysis process on a computer having a memory, an arithmetic device, and a display, wherein the Japanese morphological analyzer has a list of rewriting rules of a morphological analysis result on a display. ,
A post-processing device for Japanese morphological analysis, characterized in that designation of use / non-use and display of a matching condition relaxation method can be interactively designated.

6. The post-processing device for Japanese morphological analysis according to claim 5, wherein the contents that can be interactively performed by the user include confirmation and redefinition of a definition of a sentence set for performing morphological analysis and post-processing. .