JP2798683B2

JP2798683B2 - Natural language processing system

Info

Publication number: JP2798683B2
Application number: JP63289599A
Authority: JP
Inventors: 美佐子島田; 佐敏山内; 中島　　勝; 和博井上; 延幸大呂
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-11-16
Filing date: 1988-11-16
Publication date: 1998-09-17
Anticipated expiration: 2013-09-17
Also published as: JPH038051A

Description

【発明の詳細な説明】本発明は、自然言語処理システムに関し、より詳細に
は、Ｄ−Ｓ確率理論を応用した仮名漢字変換装置及び自
然言語解析装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a natural language processing system, and more particularly, to a kana-kanji conversion device and a natural language analysis device to which DS probability theory is applied.

従来技術従来の未登録語処理として以下の３つの方式がある。2. Description of the Related Art There are the following three methods as conventional unregistered word processing.

1.未解析文字列の先頭１文字を無視して解析を再開する
方式。1. A method of ignoring the first character of the unparsed character string and restarting the analysis.

2.未解析文字列（後続文字列）から付属語を発見し、そ
こから解析を再開する方式。2. A method of finding attached words from an unparsed character string (subsequent character string) and restarting analysis from there.

3.漢音を付加して複合語構成単語の検出を行なう方式。3. A method that detects compound words by adding Chinese sounds.

従来技術として上記のような３つの方式が知られてい
るが、これらの処理は、どのように解析しても変換不可
能、すなわち接続する候補が発見できない場合にのみ実
施される。そのため、文列の一部が辞書内容と一致した
場合、その候補が求める候補でなくとも局所的な文法処
理で接続可能な場合に、そのまま解析を続けるため未登
録語処理には入らないという欠点がある。また、未登録
語処理に入るタイミングとして、上記のように変換不可
能とならなくとも、解析結果が１字よみ表記の連続であ
れば解析をやり直す、という方式が知られている（例え
ば特開昭59−144934号公報）。Although the above three methods are known as prior art, these processes are performed only when conversion is impossible, that is, when a candidate to be connected cannot be found by any analysis. Therefore, when a part of a sentence string matches the dictionary contents, even if the candidate is not the candidate to be searched for, if it can be connected by local grammar processing, it continues analysis as it is and does not enter unregistered word processing. There is. Also, as a timing for entering the unregistered word processing, a method is known in which, even if the conversion is not impossible as described above, if the analysis result is a continuation of the one-letter notation, the analysis is redone (for example, Japanese Patent Application Laid-Open No. H10-157572). JP-A-59-144934).

しかし、この場合でも１字よみ表記の連続以外の候補
が考えられなくなって初めて上記３つの未登録語処理を
行なうということになり、１字よみ表記の連続以外の候
補があれば未登録語処理には入らないという欠点があ
る。However, even in this case, the above-mentioned three unregistered word processing is performed only after the candidate other than the one-character reading notation continuation cannot be considered. There is a disadvantage that it does not enter.

目的本発明は、上述のごとき欠点を解決するためになされ
たもので、未登録語の範囲を検出・認定する手段を用い
た自然言語処理システムを提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described drawbacks, and has as its object to provide a natural language processing system using a means for detecting and certifying a range of unregistered words.

構成本発明は、上記目的を達成するために、少なくとも読
みを表わす仮名文字列およびそれに対応する表記の情報
を保持する単語辞書と、変換すべき対象仮名文字列を元
に該単語辞書の読みを表わす仮名文字列を検索する辞書
検索手段と、検索された単語を用いて表記文字列の候補
を抽出する候補抽出手段と、該抽出された前後の単語間
の接続の可否を判定する手段と、前記抽出された単語の
候補より出力する単語表記を決定するために複数の証拠
を基本確率として与えデンプスターシェーファーの理論
に基づいて合成演算する手段、少なくとも２つ以上の証
拠を与える手段、単語間の接続が可能な並びに対してデ
ンプスターシェーファーの理論に基づいて合成演算した
結果の尤度と確度により少なくとも一つの文節となる候
補を抽出する手段とを備え、デンプスターシェーファー
の理論に基づいて合成演算した結果が確度の閾値に達し
ていない範囲を未知語処理とする手段とを有してかな漢
字変換処理を行うことを特徴としたものである。以下、
本発明の実施例に基づいて具体的に説明する。In order to achieve the above object, the present invention provides a word dictionary that holds at least a kana character string representing a reading and information of a corresponding notation, and a reading of the word dictionary based on a target kana character string to be converted. A dictionary search unit for searching for a kana character string to be represented, a candidate extraction unit for extracting a candidate for a notation character string using the searched word, and a unit for determining whether connection between the extracted words before and after the extracted word is possible, A means for giving a plurality of evidences as basic probabilities to determine a word notation to be output from the extracted word candidates, performing a synthesis operation based on Dempster-Shafer theory, a means for giving at least two or more evidences, A method for extracting at least one candidate phrase based on the likelihood and accuracy of the result of a composition operation based on the Dempster-Shafer theory for connectable sequences Means for performing kana-kanji conversion processing with means for unknown word processing in a range in which the result of the synthesis operation based on the Dempster Shafer theory does not reach the threshold of accuracy. . Less than,
A specific description will be given based on embodiments of the present invention.

第１図は、仮名漢字変換装置の実施例を示す構成図で
ある。ここで、入力部１は、かなあるいはローマ字で文
字列を入力する部分で、例えばキーボードのようなもの
である。辞書記憶部は、複数の単語を記憶した辞書であ
る。文法テーブルはここに含んでもよいし、次に述べる
仮名漢字変換部の中でテーブル等の形で持ってもよい。
辞書検索部は、辞書を検索して求める単語を得るもので
ある。仮名漢字変換部10は入力部より得たかな文字列を
漢字仮名混じり文に変換する部分で、２は候補抽出部、
３は候補評価部、４はＤ−Ｓ演算部、５は未登録語判定
部、６は候補決定部、７は未登録語処理部である。候補
抽出部２は入力文字列から考えうる候補を抽出する部分
である。候補評価部３は候補抽出部２で抽出した候補を
評価する部分である。Ｄ−Ｓ演算部４は候補評価部３で
得た評価結果を合成する部分である。この評価結果を合
成する手段としてDempster ＆ Shafer（Ｄ−Ｓ）の確率
理論を応用している。Dempster ＆ Shaferの確率理論を
応用して評価値の合成を行なう方法は「かな漢字変換装
置」としてすでに提案されているのでここでは簡単に述
べる。FIG. 1 is a configuration diagram showing an embodiment of a kana-kanji conversion device. Here, the input unit 1 is a part for inputting a character string in Kana or Roman characters, such as a keyboard. The dictionary storage unit is a dictionary that stores a plurality of words. The grammar table may be included here, or may be provided in the form of a table or the like in the kana-kanji conversion unit described below.
The dictionary search unit obtains a desired word by searching a dictionary. The kana-kanji conversion unit 10 converts a kana character string obtained from the input unit into a sentence mixed with kanji kana, 2 is a candidate extraction unit,
Reference numeral 3 denotes a candidate evaluation unit, 4 denotes a DS calculation unit, 5 denotes an unregistered word determination unit, 6 denotes a candidate determination unit, and 7 denotes an unregistered word processing unit. The candidate extraction unit 2 is a part that extracts possible candidates from the input character string. The candidate evaluation section 3 is a section for evaluating the candidates extracted by the candidate extraction section 2. The DS operation unit 4 is a unit that combines the evaluation results obtained by the candidate evaluation unit 3. As a means for synthesizing the evaluation results, the probability theory of Dempster & Shafer (DS) is applied. A method of synthesizing the evaluation value by applying the probability theory of Dempster & Shafer has already been proposed as a “kana-kanji conversion device”, and will be briefly described here.

従来提案されてきた各評価基準による評価値は、候補
の尤もらしさを表わしている。また、抽出されたそれぞ
れの候補は変換すべき読みに対してとり得る事象とみな
すことができる。従って、各評価基準による評価値を０
〜１の基本確率として与え、それを合成すれば候補の出
現確率を求めることができる。基本確率の合成はDempst
erの結合規則を用いて計算する。このように合成した基
本確率、すなわち評価値に対してＤ−Ｓ確率理論を応用
すると、候補の評価に確度（確からしさ）尤度（もっと
もらしさ）を用いることができるというものである。候
補Akの確度と尤度は、次式により求めることができる。The evaluation value based on each conventionally proposed evaluation criterion represents the likelihood of the candidate. Each of the extracted candidates can be regarded as a possible event for the reading to be converted. Therefore, the evaluation value according to each evaluation criterion is 0.
By giving them as basic probabilities of １1 and combining them, the appearance probabilities of candidates can be obtained. Combination of basic probabilities is Dempst
It is calculated using the er combination rule. If the DS probability theory is applied to the basic probability synthesized in this way, that is, the evaluation value, it is possible to use the likelihood (likelihood) for the evaluation of the candidate. The accuracy and likelihood of the candidate Ak can be obtained by the following equation.

尤度 Pl（Ak）＝１−Bel（ｋ）（１から候補がAk以外である確率を減じたもの）未登録語判定部５はＤ−Ｓ演算部４で求めた確度の値
により、文字列に未登録語が含まれている可能性がある
かどうかを判定する部分であり、本発明の特徴であるの
で後で説明する。候補決定部６は未登録語判定部５で未
登録語も含む可能性のある候補と判定されたもの以外の
候補のうち、正しい変換である可能性が最も高い候補を
選択、決定する部分である。未登録語処理部７は未登録
語判定部５ですべての候補に未登録語を含む可能性があ
る場合に、これを解析して未登録語の範囲を決定する部
分である。出力部８はこのようにして、漢字仮名混じり
文に変換された結果を出力する部分で、例えばディスプ
レイ装置のようなものである。 Likelihood Pl (Ak) = 1−Bel (k) (1 minus the probability that the candidate is other than Ak) The unregistered word determination unit 5 determines the character based on the accuracy value obtained by the DS calculation unit 4. This is a part for determining whether there is a possibility that an unregistered word is included in the column and is a feature of the present invention, and will be described later. The candidate deciding unit 6 selects and decides a candidate having the highest possibility of correct conversion among the candidates other than the candidates which are also determined to include the unregistered word by the unregistered word determining unit 5. is there. The unregistered word processing unit 7 is a unit that, when there is a possibility that all the candidates include an unregistered word in the unregistered word determination unit 5, analyzes the unregistered word and determines the range of the unregistered word. The output unit 8 is a unit that outputs the result of the conversion into the sentence mixed with the kanji kana, and is, for example, a display device.

この実施例により本発明の構成である未登録語判定部
５及び未登録語処理部７の説明を行なう。入力部１から
入力されたかな文字列に対し、候補抽出部２は自立部候
補を抽出蓄積する。その後、候補評価部３で各候補に対
する評価を行なう。評価方法としては、例えば自立部長
さや評価式を用いたりすればよい。自立部に対し接続可
能な付属語を抽出し、文節候補を作成する。このよう
に、候補評価部３で得た評価結果をＤ−Ｓ演算部４で合
成し、確度と尤度とを得る。この時、確度と尤度を与え
る候補は、抽出された候補Ａ、Ｂ、Ｃと、これら以外の
事例というものとする。（なぜならば、Ｄ−Ｓ理論では
それらの中に必ず正しい答があるという前提があるが、
仮名漢字変換では抽出した候補の中に必ず正しい答があ
るとは限らないので解析を誤まったものしか抽出できな
くても「これら以外」をいうものがあれば必ずこの中に
正しい答があるということになるからである。）未登録
語判定部５で、確度と尤度とを用いて解析対象文字列中
に未登録語が含まれているかどうかを判定する。The unregistered word determination unit 5 and the unregistered word processing unit 7 having the configuration of the present invention will be described with reference to this embodiment. For the kana character string input from the input unit 1, the candidate extracting unit 2 extracts and accumulates independent unit candidates. Thereafter, the candidate evaluation unit 3 evaluates each candidate. As the evaluation method, for example, the length of the self-supporting portion or an evaluation formula may be used. Extract auxiliary words that can be connected to the autonomous part and create phrase candidates. As described above, the evaluation results obtained by the candidate evaluation unit 3 are combined by the DS calculation unit 4, and the accuracy and the likelihood are obtained. At this time, the candidates giving the certainty and the likelihood are the extracted candidates A, B, and C, and cases other than these. (Because there is a premise that there is always a correct answer in DS theory,
In Kana-Kanji conversion, there is not always a correct answer among the extracted candidates, so even if you can extract only the one with incorrect analysis, if there is something other than these, there is always a correct answer in this That is because. 3.) The unregistered word determination unit 5 determines whether or not an unregistered word is included in the analysis target character string using the accuracy and the likelihood.

第５図に未登録語判定部５のフローチャートを示す。
ここで、確度に対する閾値A₁は例えば0.5、A₂は例えば
0.2程度の値とする。また、尤度に対する閾値Ｂは、例
えば0.5程度の値とする。さらに、尤度と確度の差の値
の閾値は、例えば0.4程度の値とする。すべての候補の
うち図中のパターン１に該当するものがある場合は未登
録語が含まれていない。逆にすべての候補が図中のパタ
ーン３かパターン４あるいは「これら以外の候補」の尤
度が他の候補の尤度より大きい場合にはこの文字列中に
未登録語が含まれているとみなし、未登録語処理部７に
より処理を行なう。最終的に出力すべき候補は、Ｄ−Ｓ
演算部４で求めた確度と尤度とを用いて、候補決定部６
で決定する。FIG. 5 shows a flowchart of the unregistered word determination section 5.
Here, the threshold value A _1, for example 0.5 for the accuracy, A _2, for example
The value should be about 0.2. The threshold value B for the likelihood is, for example, about 0.5. Further, the threshold value of the difference between the likelihood and the certainty is, for example, about 0.4. If all of the candidates correspond to pattern 1 in the figure, no unregistered words are included. Conversely, if all the candidates have pattern 3 or pattern 4 in the figure or the likelihood of “other candidates” is greater than the likelihood of the other candidates, it is determined that an unregistered word is included in this character string. Deemed, unregistered word processing unit 7 performs processing. The final candidate to be output is DS
Using the accuracy and likelihood obtained by the operation unit 4, the candidate determination unit 6
Determined by

第２図に、未登録語処理部７の一方式のフローチャー
トを示す。未登録語判定部５から候補決定部６に渡した
候補が０である場合に、未登録語処理部７が再解析を行
なうのであるが、この解析対象文字列中に、読み長Ｘの
自立語があるはずだと仮定する。未登録語はそのほとん
どが片仮名語や専門語、口語や俗語など自立語であるこ
とが経験的に知られているため、この場合未登録語は自
立語であるとみなして解析を進める。読み長Ｘの自立語
候補を得るために、読み長１〜ｎ文字の候補を作る。ｎ
は、例えば５文字程度に設定する。これらの各々の候補
に対し、考えられる付属部および後続文節の自立部の検
索を行なう。これら候補の検索は２とほぼ同様に行なえ
ばよいが、読み長Ｘの自立部の品詞が体言が用言かさえ
わからないので、すべての自立語品詞をとり得ると考え
て検索・評価を行なう。後続自立部については、通常の
解析時と同様に最も高い評価を得た候補を１つ残せばよ
い。通常の解析時と同様にＤ−Ｓ演算で評価結果を合成
し、確度と尤度とを求める。確度と尤度の最も高い候補
となった読み長Ｙ（１≦Ｙ≦ｎ）を未登録語範囲とし、
文節を確定すればよい。この時、やはり確度が充分に小
さい値の場合、すなわち図中のBel2より確度の高い候補
がない場合読長ＸがＸ＜ｎの関係であるとみなし、確度
の最も高い候補となった読み長Ｙのみを確定し、後続文
字列に対し再度未登録語処理を適用する（Bel2はBel1と
等しくとも、別でもよい）。その解析の様子を第１表に
示す。FIG. 2 shows a flowchart of one method of the unregistered word processing section 7. When the candidate passed from the unregistered word determination unit 5 to the candidate determination unit 6 is 0, the unregistered word processing unit 7 performs re-analysis. Assume that there must be a word. It is empirically known that most of the unregistered words are independent words such as katakana, technical terms, colloquial words and slang, so in this case, the unregistered words are regarded as independent words and the analysis proceeds. In order to obtain an independent word candidate having a reading length X, candidates having a reading length of 1 to n characters are created. n
Is set to, for example, about five characters. For each of these candidates, a search is made for possible appendixes and autonomous parts of subsequent clauses. Searching for these candidates may be performed in substantially the same manner as in step 2. However, since the part of speech of the independent part of the reading length X is not known as a verbal word, search and evaluation are performed assuming that all independent part of speech can be taken. Regarding the succeeding independent part, one candidate having the highest evaluation may be left as in the case of normal analysis. The evaluation results are combined by the DS operation in the same manner as in the normal analysis, and the accuracy and the likelihood are obtained. The reading length Y (1 ≦ Y ≦ n) that is the candidate with the highest accuracy and likelihood is defined as an unregistered word range,
You only need to confirm the clause. At this time, if the accuracy is still a sufficiently small value, that is, if there is no candidate with higher accuracy than Bel2 in the figure, the reading length X is regarded as having a relationship of X <n, and the reading length that is the candidate with the highest accuracy is considered. Only Y is determined, and the unregistered word processing is applied again to the subsequent character string (Bel2 may be equal to or different from Bel1). The state of the analysis is shown in Table 1.

ただし、文節候補を生成できない読み長（１〜ｎ）候
補、第２自立部候補を生成できない文節候補は解析を中
止する。 However, the analysis is stopped for the reading length (1 to n) candidates that cannot generate a phrase candidate and the phrase candidates that cannot generate the second independent part candidate.

また、下線部は辞書で検索できたものを示す。 Also, the underlined portions indicate those that could be searched in the dictionary.

第３図に未登録語処理部７の他の方式のフローチャー
トを示す。未登録語判定部５から候補決定部６に渡した
候補が０である場合、再解析対象文字列中に読み長Ｘの
自立語があるはずだと仮定する。FIG. 3 shows a flowchart of another method of the unregistered word processing section 7. When the number of candidates passed from the unregistered word determination unit 5 to the candidate determination unit 6 is 0, it is assumed that there should be an independent word of the reading length X in the character string to be re-analyzed.

読み長Ｘの自立語候補を得るために、再解析対象文字
例に対し漢音を付与する。漢音を第２表に示す。漢字の
音読みとほぼ同じものである。In order to obtain an independent word candidate with a reading length X, a kanji is added to the reanalysis target character example. The Chinese sounds are shown in Table 2. It is almost the same as Kanji reading.

未登録語が専門語の場合、漢語の占める割合が全体の
６割以上となることが知られている。そこで、候補の抽
出を効果的に行なうために漢音を使用するものである。
漢音を付与した文字列の例を示す。 It is known that when an unregistered word is a technical word, the ratio of a Chinese word accounts for 60% or more of the whole. Therefore, Chinese sounds are used to effectively extract candidates.
An example of a character string with a Chinese sound is shown.

せんりゃくはっそう→せん・りゃく・はっ・そうこの漢音の区切り位置を用いて読み長Ｘの自立語候補
を作る。上の例なら、せん、せん・りゃく、せん・りゃく・はっ…などの形に
なるが、１漢音が１文字に対応するということ、四字熟
語は少なくとも二次漢語を１つ含むことを考えあわせ
て、例えば漢音３つまでの候補に限定してもよい。これ
ら各々の候補に対し、考えられる付属部および後続文節
の自立部の検索を行なう。この後の評価および確定の方
法は第２図による未登録語処理と同様に行なえばよい。
解析の様子を第３表に示す。Senri-ku-hasou → Sen-ri-ku-ha-so-so Independent word candidates of reading length X are created using the break positions of these kanji. In the above example, it will be in the form of sen, sen / ryu, sen / ryu / ha, etc., but one kanji corresponds to one character, and four-character idioms include at least one secondary kanji. Considering this, the number of candidates may be limited to, for example, up to three. For each of these candidates, a search is made for possible appendixes and autonomous portions of subsequent clauses. Subsequent evaluation and determination methods may be performed in the same manner as the unregistered word processing shown in FIG.
Table 3 shows the state of the analysis.

ただし、文節候補を生成できない漢音（１〜ｍ）候
補、第２自立部候補を生成できない文節候補は解析を中
止する。 However, the analysis is stopped for the Chinese sound (1 to m) candidate for which the phrase candidate cannot be generated and the phrase candidate for which the second independent part candidate cannot be generated.

第４図に未登録語処理部７の他の一方式のフローチャ
ートを示す。第２図、第３図の例と同様、読み長Ｘの自
立語があるはずだと仮定する。ここで再解析対象文字列
中に特徴的な文字、例えば長音や小さい「あ」を検出し
たら、片仮名語であるとみなし、再解析開始位置から特
徴的な文字（以下、C₁であらわす）までを片仮名未登録
語として、（C₁＋１）の位置から付属語あるいは後続自
立語の検索を行なう。ただし（C₁＋１）の文字が「ん」
や小さい「ゃ」など語の先頭となり得ないものの場合
は、これも片仮名未登録語の範囲として考える。この語
の評価および確定の方法は第２図による未登録語処理と
同様に行なえばよい。片仮名未登録語として範囲が確定
できたらその範囲を片仮名で出力するという処理を行な
ってもよい。再解析対象文字列中に、特徴的な文字が検
出できなかった場合は、第２図の方法あるいは第３図の
方法により読み長Ｘの未登録語の範囲を確定すればよ
い。第２図の方法と第３図の方法は独立であるばかりで
なく、可能性の高い候補から検索・評価するために第４図による方法→第３図による方法→第２図による方
法と、組み合せて実行するとさらに効果的である。FIG. 4 shows a flowchart of another method of the unregistered word processing section 7. As in the examples of FIGS. 2 and 3, it is assumed that there should be an independent word of the reading length X. Here distinctive character in reanalysis target string, for example when detecting a long vowel and a small "a", regarded as a katakana word, until distinctive character from the re-analysis start position (hereinafter, represented by C ₁₎ Is used as a katakana unregistered word, a search for an attached word or a subsequent independent word is performed from the position (C ₁ +1). However, the character of (C ₁ +1) is "n"
In the case of words that cannot be the head of a word, such as or a small "@", this is also considered as a range of katakana unregistered words. The method of evaluating and determining this word may be performed in the same manner as the unregistered word processing shown in FIG. When the range is determined as katakana unregistered words, the range may be output as katakana. When a characteristic character is not detected in the character string to be re-analyzed, the range of the unregistered word of the reading length X may be determined by the method of FIG. 2 or the method of FIG. The method of FIG. 2 and the method of FIG. 3 are not only independent, but also for searching and evaluating from candidates having a high possibility, the method according to FIG. 4 → the method according to FIG. 3 → the method according to FIG. It is more effective to execute in combination.

第６図は、請求項第２項の発明である自然言語解析装
置の実施例を構成図を示す。入力部11は区切り情報を持
たないテキスト、例えば日本語の漢字仮名混じり文や中
国語などを得る手段で、例えばOCRなどのようなもので
ある。辞書記憶部は、複数の単語を記憶した辞書であ
る。文法テーブルはここに含んでもよいし、次に述べる
解析部の中でテーブル等の形で持ってもよい。辞書検索
部は、辞書を検索して求める語を得るものである。解析
部20は、入力文字列を例えば形態素解析する部分で、次
の６つの部分より成る。FIG. 6 is a block diagram showing an embodiment of the natural language analyzing apparatus according to the second aspect of the present invention. The input unit 11 is a unit for obtaining a text having no delimiter information, for example, a sentence mixed with Japanese kanji and kana, Chinese, and the like, such as OCR. The dictionary storage unit is a dictionary that stores a plurality of words. The grammar table may be included here, or may be held in the form of a table or the like in the analysis unit described below. The dictionary search unit obtains a desired word by searching a dictionary. The analysis unit 20 is, for example, a part for morphologically analyzing the input character string, and includes the following six parts.

12は候補抽出部、13は候補評価部、14はＤ−Ｓ演算
部、15は未知語判定部、16は候補決定部、17は未知語処
理部である。12 is a candidate extraction unit, 13 is a candidate evaluation unit, 14 is a DS calculation unit, 15 is an unknown word determination unit, 16 is a candidate determination unit, and 17 is an unknown word processing unit.

候補抽出部12は入力文字列から解析候補を抽出する部
分である。ここ抽出された候補を候補評価部13で評価す
る。評価方法として、例えば評価式を用いて評価値計算
を行なうといった方法をとればよい。このように得た評
価結果をＤ−Ｓ演算部14で合成し、尤度と確度とを得
る。複数の候補に対し未知語判定部15は未知語含有の判
定を尤度と確度を利用して行なう。未知語判定部15のフ
ローチャートを第７図に示す。ここで確度に対する閾値
A₁、A₂は例えば、それぞれ0.5、0.2程度を与えてやれば
よい。また、尤度に対する閾値Ｂは、例えば0.5、尤度
と確度の差の値の閾値は例えば0.4程度の値とする。す
べての候補のうちパターン１に該当するものがある場合
は未知語が含まれていない文字列である。逆にすべての
候補がパターン３かパターン４である場合には未知語が
含まれている文字列である。この場合にはｆの未知語処
理部で未知語の範囲の確定を行なう。未知語の範囲の確
定方法はいくつか知られているものがあるが例えば、付
属語の抽出を行なう方法をとってもよい。このように、
請求項第１項とほぼ同じ方法を用いることで自然言語解
析装置による未知語処理を行なうことができる。以上の
ように本発明の実施例は、仮名漢字変換装置及び自然言
語解析装置について述べたが、各種OA機器（例えば、ワ
ープロ、各種OA機器の入力部、翻訳機）などに適用し得
るものである。The candidate extraction unit 12 is a part that extracts analysis candidates from the input character string. The candidate extracted here is evaluated by the candidate evaluation unit 13. As an evaluation method, for example, a method of calculating an evaluation value using an evaluation formula may be used. The evaluation results thus obtained are combined by the DS calculation unit 14 to obtain likelihood and accuracy. For a plurality of candidates, the unknown word determination unit 15 determines the presence of an unknown word using the likelihood and the accuracy. FIG. 7 shows a flowchart of the unknown word determination section 15. Where the threshold for accuracy
A ₁ and A ₂ may be given, for example, about 0.5 and 0.2, respectively. The threshold value B for the likelihood is, for example, 0.5, and the threshold value of the difference between the likelihood and the certainty is, for example, about 0.4. If all of the candidates correspond to pattern 1, the character string does not include an unknown word. Conversely, if all the candidates are pattern 3 or pattern 4, the character string contains an unknown word. In this case, the unknown word processing section of f determines the range of the unknown word. There are several known methods for determining the range of unknown words. For example, a method for extracting an auxiliary word may be used. in this way,
By using substantially the same method as in claim 1, unknown word processing by the natural language analyzer can be performed. As described above, the embodiments of the present invention have been described with respect to the kana-kanji conversion device and the natural language analysis device, but can be applied to various OA devices (for example, word processors, input units of various OA devices, translators) and the like. is there.

効果以上の説明から明らかなように、仮名漢字変換装置に
おいては、Ｄ−Ｓ演算により求めた確度と尤度の値を用
いて未登録語であるかどうかを判定しているので従来の
ように候補があるが確度が低いような場合でも無理に変
換することをせず、未登録語として無変換出力すること
で奇異な変換をおさえることができる。また、この未登
録語判定部で未登録語があると判定された文字列を再解
析する場合でもいくつか得られた解析結果についてＤ−
Ｓ演算で求めた確度と尤度を用いて判定するため未登録
語の範囲を適切に確定し、未登録語による誤変換の伝播
をおさえることができる。Effect As is apparent from the above description, the kana-kanji conversion device determines whether or not a word is an unregistered word using the values of the likelihood and the accuracy obtained by the DS operation. Even when there is a candidate but the accuracy is low, strange conversion can be suppressed by not converting it forcibly and outputting it as unregistered word without conversion. Further, even when re-analyzing a character string determined to have an unregistered word by the unregistered word determination section, the D-
Since the determination is made using the accuracy and likelihood obtained by the S operation, the range of the unregistered word can be appropriately determined, and the propagation of erroneous conversion due to the unregistered word can be suppressed.

また、自然言語解析装置においては、Ｄ−Ｓ演算を適
用して求めた確度と尤度の値を解析に利用しているので
従来のように未知語が文字列中に出現した際、無理に候
補を作ることをせず、未知語の範囲を適切に確定し、未
知語が含まれていることによる誤解析をおさえることが
できる。Further, in the natural language analysis device, the accuracy and likelihood values obtained by applying the DS operation are used for analysis. Therefore, when an unknown word appears in a character string as in the related art, Without making a candidate, the range of unknown words can be appropriately determined, and erroneous analysis due to the inclusion of unknown words can be suppressed.

[Brief description of the drawings]

第１図は、仮名漢字変換装置の一実施例の構成図、第２
図は、未登録語処理部の一方法式のフローチャート、第
３図は、未登録語処理部の他の方式のフローチャート、
第４図は、未登録語処理部の他の方式のフローチャー
ト、第５図は、未登録語判定部の一方式のフローチャー
ト、第６図は、自然言語解析装置の一実施例の構成図、
第７図は、未知語判定部のフローチャートである。 1,11……入力部、2,12……候補抽出部、3,13……候補評
価部、4,14……Ｄ−Ｓ演算部、５……未登録語判定部、
15……未知語判定部、6,16……候補決定部、７……未登
録語処理部、17……未知語処理部、8,18……出力部、9,
19……辞書、10……仮名漢字変換部、20……解析部、FIG. 1 is a block diagram of an embodiment of a kana-kanji conversion device, FIG.
The figure is a flowchart of one method of the unregistered word processing unit, FIG. 3 is a flowchart of another method of the unregistered word processing unit,
Fig. 4 is a flowchart of another method of the unregistered word processing unit, Fig. 5 is a flowchart of one method of the unregistered word determination unit, Fig. 6 is a configuration diagram of one embodiment of the natural language analysis device,
FIG. 7 is a flowchart of the unknown word determination unit. 1,11 ... input unit, 2,12 ... candidate extraction unit, 3,13 ... candidate evaluation unit, 4,14 ... DS calculation unit, 5 ... unregistered word determination unit,
15 unknown word determination unit, 6, 16 candidate determination unit, 7 unregistered word processing unit, 17 unknown word processing unit, 8, 18 output unit, 9,
19: Dictionary, 10: Kana-Kanji conversion unit, 20: Analysis unit,

───────────────────────────────────────────────────── フロントページの続き (72)発明者井上和博鳥取県鳥取市南隈342 リコー鳥取技術開発株式会社内 (72)発明者大呂延幸鳥取県鳥取市南隈342 リコー鳥取技術開発株式会社内 (56)参考文献特開昭57−141775（ＪＰ，Ａ) 特開昭62−251960（ＪＰ，Ａ) 特開昭57−127269（ＪＰ，Ａ) 特開昭62−130458（ＪＰ，Ａ) 電子通信学会誌 66巻９号（昭和 58年９月）第900〜903頁 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 17/21 - 17/28 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Kazuhiro Inoue 342 Ricoh Tottori Technology Development Co., Ltd., Minamikuma 342, Tottori Pref. References JP-A-57-141775 (JP, A) JP-A-62-251960 (JP, A) JP-A-57-127269 (JP, A) JP-A-62-130458 (JP, A) IEICE Journal, Vol. 66, No. 9, September 1983, pp. 900-903 (58) Fields investigated (Int. Cl. ⁶ , DB name) G06F 17/21-17/28 JICST file (JOIS)

Claims

(57) [Claims]

1. A word dictionary holding at least a kana character string representing a reading and information of a notation corresponding thereto, and a dictionary for searching for a kana character string representing a reading of the word dictionary based on a target kana character string to be converted. Search means, candidate extraction means for extracting a display character string candidate using the searched word,
Means for judging whether or not a connection between the extracted words before and after is provided, and providing a plurality of evidences as basic probabilities to determine a word notation to be output from the extracted word candidates based on the theory of Dempster-Shafer Means for performing a composition operation, means for providing at least two or more proofs, and at least one candidate phrase which is determined based on the likelihood and accuracy of a result of a composition operation based on the Dempster-Shafer theory for a connectable sequence between words. Means for performing kana-kanji conversion processing by means of unknown word processing in a range in which the result of a synthesis operation based on the theory of Dempster Shafer does not reach a certainty threshold. Language processing system. Technical field