JPH06149872A

JPH06149872A - Text input device

Info

Publication number: JPH06149872A
Application number: JP4302638A
Authority: JP
Inventors: Shinji Miwa; 真司三輪
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1992-11-12
Filing date: 1992-11-12
Publication date: 1994-05-31

Abstract

PURPOSE:To provide a text input device capable of effectively performing the post-processing of an input text with computer resources much less than ever. CONSTITUTION:Only a word with high word extraction likelihood is extracted from an input unit string at a high likelihood word extraction part 104. A morpheme analysis part 106 performs morpheme analysis processing setting a partial string divided by the word with high extraction likelihood as the unit of processing. Therefore, an error can be reduced when the input unit string is divided into the partial strings, and the morpheme analysis processing can be performed at high speed. Also, it is possible to select an input candidate and to perform error detection by effectively using the continuity of the text at the front and rear terminals of a processing unit.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、電子データとして文章
を入力し、入力された文章に誤りが含まれる場合にはこ
の誤りを検出する機構を持つ文章入力装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sentence input device having a mechanism for inputting a sentence as electronic data and detecting an error when the inputted sentence contains an error.

【０００２】[0002]

【従来の技術】近年、既存・新規に関わらず文書を電子
データ化する必要性が高まっている。しかし、文字、音
声などのパターン認識による入力では認識部において誤
認識が生じうるし、オペレータがキーボードによって文
書を入力する場合にも仮名漢字変換の誤変換等によって
入力文章に誤入力が混入する可能性は否定できない。電
子データ化された文章に含まれるこのような誤りを発見
し、また、自動的に修正する、あるいはオペレータによ
る修正作業を補助するために、文章入力の後処理部を設
けた文章入力装置は種々提案されている。2. Description of the Related Art In recent years, there is an increasing need to convert documents into electronic data regardless of existing or new documents. However, input by pattern recognition of characters, voices, etc. may cause erroneous recognition in the recognition unit, and even when the operator inputs a document using the keyboard, erroneous input may be mixed in the input sentence due to erroneous conversion of kana-kanji conversion. Cannot be denied. In order to find such an error contained in a text converted into electronic data, and to automatically correct or to assist the correction work by the operator, there are various text input devices provided with a post-processing unit for inputting text. Proposed.

【０００３】このような文章入力の後処理部において
は、入力された入力単位列と単語辞書とを対照して単語
でありうる部分列を取りだし、重なり合わない単語候補
を組み合わせた単語列について単語間の文法的な接続可
能性や統計的な共起頻度によって正しいと思われる単語
並びを決定する形態素解析処理を行なうのが一般的であ
る。そして、従来のこの種の装置では入力された全ての
入力単位の候補に対して形態素解析処理を行なってい
た。この処理方法によると、長さがＮ文字で各文字位置
についてＭ個の候補文字を持つ文字列を処理するために
はＭのＮ乗個の組み合わせについて単語辞書と照合する
必要がある。そのため、照合に時間がかかる上に、照合
時に候補文字の誤った組み合わせを単語候補として切り
出し、この後の単語間の接続可能性を調べる処理の負荷
を大きくしてしまう可能性があった。In such a text input post-processing unit, a partial sequence that may be a word is extracted by comparing the input unit sequence that has been input with the word dictionary, and the word sequence is composed of word candidates that do not overlap with each other. It is common to perform morphological analysis processing that determines the word sequence that seems to be correct depending on the grammatical connectivity between them and the statistical co-occurrence frequency. Then, in the conventional device of this type, the morphological analysis process is performed on all the input unit candidates that have been input. According to this processing method, in order to process a character string having a length of N characters and having M candidate characters at each character position, it is necessary to match the N power combinations of M with the word dictionary. Therefore, it takes a long time to perform matching, and an incorrect combination of candidate characters may be cut out as a word candidate at the time of matching, which may increase the load of the process of checking the possibility of connection between words thereafter.

【０００４】[0004]

【発明が解決しようとする課題】このため、句読点や字
種（漢字、平仮名、片仮名、記号など）の変わり目で入
力文字列を区切り、短い部分文字列を処理の単位とする
ことで候補となる文字列の数の組み合わせ的爆発を防ぐ
手法が考案されている。しかし、入力文字列を句読点で
区切る場合には、部分文字列の長さが数十文字になり、
十分に処理量を軽減することができない場合があるし、
字種の変わり目で区切る場合には、誤ってひとつの単語
の途中で文字列を切ってしまい、その部分が後段の形態
素解析処理でリジェクトされてしまう可能性があった。
また、句読点や字種の変わり目で入力文字列を区切る方
法によると、処理の単位となる部分列に関して前後関係
が失われるため、端部の入力文字候補より正しい候補を
選択するのが難しいという問題もあった。For this reason, the input character string is delimited at the transition of punctuation marks and character types (Kanji, Hiragana, Katakana, symbols, etc.), and a short partial character string becomes a candidate for processing. Techniques have been devised to prevent combinatorial explosion of the number of strings. However, if you separate the input string with punctuation, the length of the substring will be tens of characters,
Sometimes you can't reduce the amount of processing enough,
In the case of delimiting at a character type transition, there is a possibility that the character string may be erroneously cut in the middle of one word, and that part may be rejected by the morphological analysis processing in the subsequent stage.
Also, according to the method of separating the input character string at the punctuation mark and the change of the character type, it is difficult to select a correct candidate from the input character candidates at the end because the context of the substring that is the unit of processing is lost. There was also.

【０００５】また、単語辞書に含まれる全ての単語につ
いて入力文字列から単語候補を抽出すると、特に単語長
が短い平仮名単語を誤って単語候補として切り出す場合
が多くなり、これも照合に要する時間及び後段の処理の
負荷を大きくする原因となる。Further, when word candidates are extracted from the input character strings for all the words included in the word dictionary, hiragana words with particularly short word lengths are often mistakenly cut out as word candidates, which also takes time and time required for matching. This will increase the load of the subsequent processing.

【０００６】本発明は、上記従来の文章入力の後処理技
術の、形態素解析時に入力文字列から誤って単語候補を
切り出す可能性が高く、処理に多大な時間を要するとい
う問題を解決するものであり、その目的は、より少ない
計算機資源で効果的に入力文章の後処理をすることが可
能であり、またオペレータが快適に使用することのでき
る文章入力装置を提供することにある。The present invention solves the problem of the above-described conventional text input post-processing technique in that there is a high possibility that a word candidate is erroneously cut out from an input character string during morphological analysis, and that it takes a lot of time for processing. It is an object of the present invention to provide a text input device that can effectively post-process an input text with less computer resources and can be comfortably used by an operator.

【０００７】[0007]

【課題を解決するための手段】本発明の文章入力装置は
設定された入力単位の列として文章を入力する文章入力
装置であって、それぞれの入力単位について１以上の入
力候補を得る入力装置と、入力単位列より単語抽出尤度
が設定された値より高い単語の表記データと一致する部
分列を単語候補とし、それぞれの単語候補について単語
情報を付加する高尤度単語抽出部と、入力単位列中で、
上記高尤度単語抽出部によって抽出された単語の単語情
報と、単語と認定されない部分列の入力候補より、少な
くとも単語辞書、入力単位の連接規則、単語の連接規
則、文法規則、意味規則のいずれかを用いて入力候補の
組み合わせの尤度を計算する形態素解析部と、入力され
た文章中で、上記組み合わせ尤度計算部において得られ
た尤度が一定の値より低い部分列を入力誤りとして検出
する誤入力検出部とを備えることを特徴とする。A text input device of the present invention is a text input device for inputting a text as a sequence of set input units, and an input device for obtaining one or more input candidates for each input unit. , A high-likelihood word extraction unit that adds word information to each word candidate, with a partial string that matches the notation data of a word having a higher word extraction likelihood than the input unit string set as a word candidate, and an input unit In a row
From the word information of the word extracted by the high-likelihood word extraction unit and the input candidates of the subsequence that is not recognized as a word, at least one of the word dictionary, the input unit connection rule, the word connection rule, the grammar rule, and the semantic rule. A morphological analysis unit that calculates the likelihood of a combination of input candidates using or, and a subsequence in the input sentence in which the likelihood obtained by the combination likelihood calculation unit is lower than a certain value as an input error. An erroneous input detection unit for detecting is provided.

【０００８】本発明の請求項１に関る文章入力装置は、
少なくとも入力単位の列として記述された単語の表記デ
ータと単語抽出尤度データとを含む単語尤度辞書と、単
語尤度辞書を用いて、入力単位列より単語抽出尤度が設
定された値より高い単語の表記データと一致する部分列
を単語候補とし、それぞれの単語候補について単語情報
を付加する高尤度単語抽出部を備えることを特徴とす
る。A text input device according to claim 1 of the present invention comprises:
Using a word likelihood dictionary including at least word description data and word extraction likelihood data described as a string of input units, and using a word likelihood dictionary, from a value for which the word extraction likelihood is set from the input unit string It is characterized in that a partial sequence that matches the notation data of a high word is set as a word candidate, and a high-likelihood word extraction unit that adds word information to each word candidate is provided.

【０００９】本発明の請求項２に関る文章入力装置は、
当該する文章入力装置が処理の対象とする単語一般を納
めた単語辞書より、あらかじめ算出された単語抽出尤度
が設定された値より大きい単語のみを抜粋して構成され
た高尤度単語辞書と、入力単位列より高尤度単語辞書に
含まれる単語の表記データと一致する部分列を単語候補
とし、それぞれの単語候補について単語情報を付加する
高尤度単語抽出部を備えることを特徴とする。A text input device according to claim 2 of the present invention is
A high-likelihood word dictionary configured by extracting only words having a pre-calculated word extraction likelihood larger than a set value from a word dictionary containing general words to be processed by the sentence input device. A sub-string that matches the notation data of a word included in the high-likelihood word dictionary than the input unit string is a word candidate, and a high-likelihood word extraction unit that adds word information to each word candidate is provided. .

【００１０】本発明の請求項３に関る文章入力装置は、
語長がＫ文字（Ｋ＝２、３、...）以上の英字単語、語
長がＬ文字（Ｌ＝２、３、...）以上の平仮名単語、語
長がＭ文字（Ｍ＝２、３、...）以上の片仮名単語、語
長がＮ文字（Ｎ＝２、３、...）以上の仮名漢字混り単
語の１以上を含む高尤度単語辞書と、入力単位列より高
尤度単語辞書に含まれる単語の表記データと一致する部
分列を単語候補とし、それぞれの単語候補について単語
情報を付加する高尤度単語抽出部を備えることを特徴と
する。A text input device according to claim 3 of the present invention is
English words with a word length of K characters (K = 2,3, ...) or more, hiragana words with a word length of L characters (L = 2,3, ...) or more, and M characters (M = M) A high-likelihood word dictionary containing one or more kana-kanji words of 2, 3, ...) or more, and kana-kanji mixed words having a word length of N characters (N = 2, 3, ...) or more, and an input unit It is characterized in that a partial sequence that matches the notation data of a word included in the high-likelihood word dictionary rather than a column is set as a word candidate, and a high-likelihood word extracting unit that adds word information to each word candidate is provided.

【００１１】本発明の請求項４に関る文章入力装置は、
単語尤度辞書または高尤度単語辞書に含まれる活用語の
エントリーには活用情報を付し、この活用情報を参照し
て活用語については語幹に活用語尾を付した形を生成す
る活用形生成部を備え、高尤度単語抽出部は単語尤度辞
書または高尤度単語辞書に記載される非活用語及び活用
形生成部によって生成される活用語の活用形を単語抽出
の対象とすることを特徴とする。A text input device according to claim 4 of the present invention is
Utilization information is attached to the entry of the utilization word included in the word likelihood dictionary or the high-likelihood word dictionary, and the utilization word generation is used to refer to this utilization information and generate a shape in which the utilization stem is added to the stem of the utilization word. And a high-likelihood word extraction unit is a word likelihood dictionary or a non-conjugated word written in the high-likelihood word dictionary and the conjugation forms of the conjugation words generated by the conjugation form generation unit are targeted for word extraction. Is characterized by.

【００１２】本発明の請求項５に係る文章入力装置は、
文字入力手段が複数の候補文字を出力する場合に、その
第１位の候補文字によって構成される文字列のみを高尤
度単語抽出部の処理対象とすることを特徴とする。A text input device according to claim 5 of the present invention is
When the character input means outputs a plurality of candidate characters, only the character string formed by the first candidate character is set as the processing target of the high likelihood word extraction unit.

【００１３】[0013]

【作用】本発明の文章入力装置では、高尤度単語抽出部
において単語抽出尤度の高い単語を抽出し、ここで抽出
された高尤度単語をもって入力単位列を区切り、それぞ
れの部分列に対して形態素解析部の処理を実行する。In the sentence input device of the present invention, the high-likelihood word extracting unit extracts words having a high likelihood of word extraction, separates the input unit string by the high-likelihood words extracted here, and separates each of the partial strings. On the other hand, the processing of the morphological analysis unit is executed.

【００１４】[0014]

【実施例】図１は本発明の一実施例の全体構成図であ
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is an overall configuration diagram of an embodiment of the present invention.

【００１５】１０１は入力単位をコードに変換する入力
手段で、ＯＣＲ、オンライン手書き文字入力装置、キー
ボード等の文字入力装置、音声認識装置等が考えられ
る。これらの入力装置は、少なくとも認識時のスコア、
入力手段の特性を考慮した類似文字表のいずれかによっ
て１以上の順序付きの入力候補を得られるものとする。Reference numeral 101 denotes an input means for converting an input unit into a code, which may be an OCR, an online handwritten character input device, a character input device such as a keyboard, a voice recognition device or the like. These input devices are at least the score at the time of recognition,
It is assumed that one or more ordered input candidates can be obtained by any of the similar character tables in consideration of the characteristics of the input means.

【００１６】１０２は単語尤度辞書である。Reference numeral 102 is a word likelihood dictionary.

【００１７】１０３は活用形生成部で、単語尤度辞書に
含まれる活用語について可能な活用形を生成する。Reference numeral 103 denotes an inflectional form generating section for producing possible inflectional forms for inflectional words included in the word likelihood dictionary.

【００１８】１０４は高尤度単語抽出部である。Reference numeral 104 is a high likelihood word extraction unit.

【００１９】１０５は品詞連接規則表である。Reference numeral 105 is a part-of-speech connection rule table.

【００２０】１０６は形態素解析部で、高尤度単語抽出
部で単語候補に付加された単語情報と、単語と認定され
ない部分の入力文字候補の組み合わせについて、単語辞
書および品詞の連接規則を用いて尤度を計算する形態素
解析処理を行なう。A morphological analysis unit 106 uses a word dictionary and a part-of-speech concatenation rule for the combination of the word information added to the word candidates by the high likelihood word extraction unit and the input character candidates that are not recognized as words. Perform morphological analysis processing to calculate the likelihood.

【００２１】１０７は誤入力検出部で、組み合わせ尤度
計算部で算出された尤度が一定の値より低い部分列を入
力誤りとして検出する。An erroneous input detection unit 107 detects a subsequence whose likelihood calculated by the combination likelihood calculation unit is lower than a certain value as an input error.

【００２２】図２は高尤度単語抽出部の処理フロー、図
３は組み合わせ尤度計算部の処理フロー、図４は単語尤
度辞書の一部、図５は品詞連接尤度テーブルの一部を示
す図である。FIG. 2 is a processing flow of the high likelihood word extraction unit, FIG. 3 is a processing flow of the combination likelihood calculation unit, FIG. 4 is a part of the word likelihood dictionary, and FIG. 5 is a part of the part-of-speech concatenated likelihood table. FIG.

【００２３】以下、図６に示す日本語文を入力する場合
を例に、本発明の文章入力装置の動作を説明する。The operation of the text input device of the present invention will be described below with reference to the case of inputting a Japanese sentence shown in FIG.

【００２４】本実施例は入力手段として第３位の候補文
字まで出力するＯＣＲを想定したものであり、この場合
入力単位は文字である。この他に入力手段としてはオン
ライン手書き文字入力装置、キーボード等の文字入力装
置、音声認識装置等を使用することも可能で、文字入力
装置の場合の入力単位は文字、音声認識装置の場合の入
力単位は音素、音節などの該音声認識装置に固有の認識
単位となる。図７に図６の日本語文を第３位の候補文字
まで出力するＯＣＲで読みとった結果を示す。In this embodiment, an OCR that outputs up to the third highest candidate character is assumed as the input means, and in this case, the input unit is a character. In addition to this, it is also possible to use an online handwritten character input device, a character input device such as a keyboard, a voice recognition device, etc. as the input means. The input unit in the case of the character input device is a character, and the input unit in the case of a voice recognition device. The unit is a recognition unit such as a phoneme or a syllable unique to the voice recognition device. FIG. 7 shows the result of reading the Japanese sentence of FIG. 6 by the OCR that outputs up to the third candidate character.

【００２５】高尤度単語抽出部では、入力文字列の各文
字を第１文字とする辞書項目と入力文字列を照合し、一
致する場合は単語情報を単語候補テーブルに登録する処
理を繰り返し、入力文字列の全文字についてこの処理を
終了すると単語候補テーブルの情報および未処理の文字
列の情報を出力して処理を終了する。具体例として、図
７に示された文字候補列を処理する場合の高尤度単語抽
出部における処理を、図２の処理フローに従って説明す
る。高尤度単語抽出部では、まず内部状態が初期化され
る（ステップ２０１、２０２）。初期状態では入力文字
列に対するポインタｎ＝１、文字Ｃ（１）＝“光”なの
で、単語尤度辞書及び活用形生成部より第１文字が
“光”であり、単語抽出尤度が設定された値以上である
単語を得る。単語抽出尤度とは、単純に入力文字列と辞
書に記載された表記データのパターンマッチングによっ
て単語候補を生成した場合に、その単語が正しい形態素
解析結果に含まれる割合を示す値とする。また本実施例
においては、図４に示される単語尤度辞書の単語抽出尤
度の値が０．５を超えるものを高尤度単語抽出部の処理
の対象とする。従って、図４の辞書に記載の「“光”
（名詞）ｉ＝１、ｐ＝０．３」は高尤度単語抽出部にお
ける単語参照ではリジェクトされる。The high-likelihood word extraction unit collates a dictionary item having each character of the input character string as the first character with the input character string, and if they match, repeats the process of registering the word information in the word candidate table. When this process is completed for all the characters of the input character string, the information of the word candidate table and the information of the unprocessed character string are output and the process is completed. As a specific example, the processing in the high-likelihood word extracting unit when processing the character candidate string shown in FIG. 7 will be described according to the processing flow in FIG. In the high likelihood word extraction unit, the internal state is first initialized (steps 201 and 202). In the initial state, the pointer n = 1 for the input character string and the character C (1) = “light”, so the first character is “light” from the word likelihood dictionary and the inflectional form generation unit, and the word extraction likelihood is set. Get words that are greater than or equal to the value. The word extraction likelihood is a value indicating the proportion of a word included in the correct morphological analysis result when a word candidate is generated simply by pattern matching the input character string and the notation data written in the dictionary. Further, in the present embodiment, a word with a value of the word extraction likelihood of the word likelihood dictionary shown in FIG. 4 exceeding 0.5 is targeted for processing by the high likelihood word extraction unit. Therefore, ““ light ”described in the dictionary of FIG.
“(Noun) i = 1, p = 0.3” is rejected by the word reference in the high likelihood word extraction unit.

【００２６】次の辞書項目「“光”（動詞）ｉ＝１、ｐ
＝０．６、ラ行５段」は、活用語なので活用形生成部に
おいて“光ら”“光り”“光っ”“光る”“光れ”が生
成されて高尤度単語抽出部に入力される。活用形生成部
においては辞書に記載の表記データに活用語尾を付加し
単語長ｉを生成された活用形の単語長に置き換えるほ
か、単語抽出尤度ｐに処理を加え、例えば活用形に応じ
て単語抽出尤度を変化させる等の処理も考えられる。こ
のように（ステップ２０３、２０４）して得られた単語
を入力文字列の部分文字列｛Ｃ（ｎ）、Ｃ（ｎ＋１）｝
＝“光デ”と比較する（ステップ２０５）が一致しない
ので（ステップ２０６）、さらに第１文字が“光”であ
る単語（例えば“光学”“光速”など）が辞書に登録さ
れていたらそれらの単語についてステップ２０３以後を
繰り返す。第１文字が“光”である全ての単語について
この処理が終了すると、入力文字列に対するポインタｎ
を１進める（ステップ２０８）。ｎ＝２の場合、すなわ
ち文字Ｃ（２）＝“デ”について上記の処理を行なう
と、辞書に登録された単語「“ディスク”ｉ＝４、ｐ＝
０．７」が入力文字列と一致するので「“ディスク”
（名詞）開始位置＝２、終了位置＝５」を単語候補テー
ブルに登録する（ステップ２０７）。このようにしてｎ
＝５１まで処理を繰り返し、次にポインタｎを進めると
ｎの示す位置が入力文字列の終端を超えるので（ステッ
プ２０９）、単語候補テーブルの情報および未処理の文
字列の情報を出力して（ステップ２１０）処理を終了す
る。The next dictionary item "" light "(verb) i = 1, p
= 0.6, line 5 dan "is an inflection word, and thus, in the inflectional form generation unit," light "," light "," light "," light ", and" light "are generated and input to the high likelihood word extraction unit. . In the inflectional form generation unit, the inflectional word is added to the notation data described in the dictionary, the word length i is replaced with the generated inflectional word length, and the word extraction likelihood p is processed, for example, according to the inflectional form. Processing such as changing the word extraction likelihood is also conceivable. The word obtained in this way (steps 203 and 204) is a partial character string of the input character string {C (n), C (n + 1)}.
= Since the comparison with “light de” (step 205) does not match (step 206), if a word whose first character is “light” (for example, “optical” “light speed”) is registered in the dictionary, The steps after step 203 are repeated for the word. When this process is completed for all words whose first character is "light", the pointer n for the input character string
Is incremented by 1 (step 208). When n = 2, that is, when the above process is performed for the character C (2) = "de", the word "disk" i = 4, p =
"0.7" matches the input string, so "" disk "
(Noun) start position = 2 and end position = 5 ”are registered in the word candidate table (step 207). In this way n
= 51, and when the pointer n is advanced next, the position indicated by n exceeds the end of the input character string (step 209). Therefore, the information of the word candidate table and the information of the unprocessed character string are output ( (Step 210) The process ends.

【００２７】図８に高尤度単語抽出部の出力を摸式的に
示す。図中、単語は高尤度単語抽出部において抽出され
た単語候補およびその品詞名を枠で示す。文字列長が５
１の入力文字列に対し、この結果では最長の未処理文字
列の長さは６であり、入力文字列が効果的に分割されて
いることがわかる。FIG. 8 schematically shows the output of the high likelihood word extraction unit. In the figure, words indicate the word candidates and their part-of-speech names extracted by the high-likelihood word extraction unit in a frame. String length is 5
As a result, the length of the longest unprocessed character string is 6 with respect to the input character string of 1, which shows that the input character string is effectively divided.

【００２８】なお本実施例では、ステップ２０３の辞書
から単語を得る処理では、活用語については辞書に記録
されている単語の語幹とその活用情報より、活用形生成
部において活用形を生成し、全ての活用形を高尤度単語
抽出部に入力するものとしているが、語幹と活用語尾を
それぞれ抽出し、後段の形態素解析部において接続の検
定をする処理形態も考えられる。In this embodiment, in the process of obtaining a word from the dictionary in step 203, the inflectional form is generated in the inflectional form generation unit from the stem of the word recorded in the dictionary and the utilization information of the inflectional word, Although all conjugation forms are input to the high likelihood word extraction unit, a processing form in which the stem and the conjugation ending are extracted and the connection test is performed in the morphological analysis unit in the subsequent stage is also conceivable.

【００２９】また、本実施例では入力文字の第１位候補
のみによる文字列を高尤度単語抽出部の処理対象として
いるが、高尤度単語抽出部において設定された順位まで
の候補文字の組み合わせについて単語を抽出する方法、
または第１位候補の文字列から高尤度単語を抽出した後
に未処理部分の候補文字の組み合わせについて高尤度単
語を抽出する方法も考えられる。Further, in the present embodiment, the character string consisting of only the first-ranked candidate of the input character is the processing target of the high-likelihood word extracting section, but the candidate characters up to the rank set in the high-likelihood word extracting section are selected. How to extract words for combinations,
Alternatively, a method of extracting the high-likelihood word from the character string of the first-ranked candidate and then extracting the high-likelihood word for the combination of the candidate characters of the unprocessed portion is also conceivable.

【００３０】形態素解析部では、高尤度単語抽出部にお
いて未処理の部分列について単語候補を抽出し、続いて
品詞の連接尤度に基づいて単語候補間の接続尤度を計算
し、入力文字候補の組み合わせより得られる単語候補の
組み合わせに順位づけを行なう。具体例として、図８に
示す高確度単語抽出部の出力のうち文字位置ｎ＝９〜１
５、すなわち「大容量記憶装置」の部分を処理する場合
について、形態素解析部の処理を図３の処理フローに従
って説明する。ｎ＝９〜１１の部分については高尤度単
語抽出部において文字列と一致する単語が見つからなか
ったため、形態素解析部において単語候補を抽出する
（ステップ３０１）。この際に単語候補抽出の対象にな
る単語は単語尤度辞書（図４）の全単語である。In the morphological analysis unit, the high-likelihood word extraction unit extracts word candidates for the unprocessed subsequence, and then calculates the connection likelihood between the word candidates based on the concatenation likelihood of the part-of-speech. The word candidate combinations obtained from the candidate combinations are ranked. As a specific example, the character position n = 9 to 1 in the output of the high-accuracy word extraction unit shown in FIG.
5, ie, the case of processing the “mass storage” portion, the processing of the morphological analysis unit will be described according to the processing flow of FIG. For the portion of n = 9 to 11, no word matching the character string was found in the high likelihood word extraction unit, so the word candidate is extracted in the morphological analysis unit (step 301). At this time, all the words in the word likelihood dictionary (FIG. 4) are targeted for word candidate extraction.

【００３１】この例では “大” （接頭名詞）開始位置＝９、終了位置＝９ “犬” （名詞）開始位置＝９、終了位置＝９ “容量”（名詞）開始位置＝１０、終了位置＝１１ “客” （名詞）開始位置＝１０、終了位置＝１０ “各” （接頭名詞）開始位置＝１０、終了位置＝１０ “章” （名詞・接尾数詞）開始位置＝１１、終了位置＝１１ “量” （名詞）開始位置＝１１、終了位置＝１１の７単語候補が抽出される（図９）。ｎ＝１２〜１３お
よびｎ＝１４〜１５の部分については高尤度単語抽出部
においてそれぞれ「“記憶”（名詞）」および「“装
置”（名詞）」が抽出されているのでこれらを単語候補
とする。次にこれらの単語候補の、連接する単語候補の
組み合わせについて接続尤度を計算する（ステップ３０
２）。接続尤度の計算方法は各種考えられるが、一例と
して特定の区間の単語候補の組み合わせのそれぞれにつ
いて、全ての単語候補間の連接尤度の積を単語候補の組
み合わせの接続尤度とする方法を使うものとする。図５
に示す品詞連接尤度テーブルを用い、上記の単語候補の
組み合わせにこの計算方法を適用した例を図１０に示
す。この結果、「大・容量・記憶・装置」なる単語並び
が最尤の組み合わせとなり、入力文字列と一致する結果
になる。In this example, “large” (prefixed noun) start position = 9, end position = 9 “dog” (noun) start position = 9, end position = 9 “capacity” (noun) start position = 10, end position = 11 "customer" (noun) start position = 10, end position = 10 "each" (prefixed noun) start position = 10, end position = 10 "chapter" (noun / suffix) start position = 11, end position = 11 "quantity" (noun) 7 word candidates with start position = 11 and end position = 11 are extracted (Fig. 9). For the portions of n = 12 to 13 and n = 14 to 15, since ““ memory ”(noun)” and ““ device ”(noun)” are extracted by the high likelihood word extraction unit, these are word candidates. And Next, a connection likelihood is calculated for a combination of concatenated word candidates of these word candidates (step 30).
2). There are various conceivable methods for calculating the connection likelihood, but for example, for each combination of word candidates in a specific section, the method of using the product of the concatenation likelihood between all word candidates as the connection likelihood of the combination of word candidates is used. Shall be used. Figure 5
FIG. 10 shows an example in which this calculation method is applied to the combination of word candidates described above using the part-of-speech concatenated likelihood table shown in FIG. As a result, the word sequence “large / capacity / memory / device” is the maximum likelihood combination, and the result matches the input character string.

【００３２】この例では以上の処理で入力の候補文字列
より正しい候補文字の組み合わせが選択されるが、形態
素解析が失敗する、すなわち単語と認定できない部分列
が発生したり、全ての単語候補の組み合わせについて接
続尤度が許容範囲に満たない場合も考えられる。形態素
解析の失敗の一因として、高尤度単語抽出部において単
語候補の誤抽出が発生し、その結果形態素解析部におい
て障害が起こる場合が考えられる。この場合、形態素解
析に失敗し、失敗箇所に隣接して高尤度単語の単語候補
が存在する場合は、その単語候補を棄却し、低尤度単語
を対象に再度入力文字候補より単語候補を抽出して形態
素解析を行なうことによって正しい結果を得ることがで
きる（ステップ３０３）。In this example, a correct combination of candidate characters is selected from the input candidate character strings by the above processing, but the morphological analysis fails, that is, a substring that cannot be identified as a word occurs, or all word candidate It is possible that the connection likelihood of the combination is less than the allowable range. As one of the causes of the failure of the morphological analysis, it is considered that erroneous extraction of word candidates occurs in the high likelihood word extraction unit, and as a result, a failure occurs in the morpheme analysis unit. In this case, if the morpheme analysis fails and there is a word candidate of a high likelihood word adjacent to the failure part, the word candidate is rejected, and the word candidate is again selected from the input character candidates for the low likelihood word. A correct result can be obtained by extracting and performing morphological analysis (step 303).

【００３３】また、図８のｎ＝３３〜３７の区間のよう
に高尤度単語抽出部において複数の高尤度単語候補が重
複して抽出された場合は、「オンライン」を単語候補と
する場合と「ライン」を単語候補とする場合（この場
合、ｎ＝３３、３４の部分については入力文字候補の組
み合わせより単語候補を抽出する）の双方について接続
尤度を計算し、序列づけを行なうものとする。If a plurality of high-likelihood word candidates are redundantly extracted by the high-likelihood word extracting unit as in the section of n = 33 to 37 in FIG. 8, "online" is taken as the word candidate. The connection likelihood is calculated for both the case and the case where "line" is used as the word candidate (in this case, the word candidate is extracted from the combination of the input character candidates for the portions of n = 33 and 34), and the ranking is performed. I shall.

【００３４】なお、実際の処理にあたっては高尤度単語
抽出部において未処理の部分列およびその前後の高尤度
単語、すなわち図８におけるｎ＝２〜１３やｎ＝１７〜
２７の部分などを形態素解析部の処理単位とすること
で、形態素解析部の処理単位を短くし、文字候補・単語
候補の組み合わせを少なく抑えることができる。また、
このようにして切り出した部分列は句読点や記号を含む
場合が考えられるが、それらの特殊文字についても特殊
な品詞として接続尤度を計算することで、従来の方法で
は難しかった特殊文字の候補選択や誤り検出が可能にな
る。In the actual processing, the high-likelihood word extracting unit has unprocessed subsequences and high-likelihood words before and after it, that is, n = 2 to 13 and n = 17 to in FIG.
By using the portion 27 and the like as the processing unit of the morpheme analysis unit, the processing unit of the morpheme analysis unit can be shortened and the combination of character candidates / word candidates can be suppressed to a small number. Also,
The substring cut out in this way may include punctuation marks and symbols, but by calculating the connection likelihood for these special characters as a special part of speech, it is possible to select special character candidates that were difficult with the conventional method. And error detection becomes possible.

【００３５】以上の処理をもってさらに形態素解析が不
可能な部分文字列は、誤入力検出部で検出され、出力さ
れる。この出力はオペレーターによる修正などを経て正
しい文字列を得るものとするが、この修正処理の際にも
高尤度単語抽出部および形態素解析部における処理過程
で生成された単語候補等の情報を適宜表示するなどし
て、操作性にすぐれたユーザインターフェースを提供す
ることが可能である。With the above processing, the partial character string which cannot be further subjected to morphological analysis is detected by the erroneous input detection unit and output. This output is supposed to obtain the correct character string after being corrected by the operator, but during this correction processing, information such as word candidates generated in the process of the high likelihood word extraction unit and the morphological analysis unit is appropriately used. It is possible to provide a user interface with excellent operability by displaying it.

【００３６】なお、本発明の請求項２に関る文章入力装
置は、当該する文章入力装置が処理の対象とする単語一
般を納めた単語辞書より単語抽出尤度が高い単語を抜粋
して構成した高尤度単語辞書を備え、この高尤度単語辞
書を使って高尤度単語抽出処理を行なうことで、実行時
の演算量を減少させながら本実施例の文章入力装置と同
等の効果を得るものである。The text input device according to claim 2 of the present invention is configured by extracting words having a higher word extraction likelihood than a word dictionary storing general words to be processed by the text input device. The high-likelihood word dictionary described above is provided, and by performing the high-likelihood word extraction process using this high-likelihood word dictionary, the same effect as that of the text input device of the present embodiment can be obtained while reducing the calculation amount at the time of execution. I will get it.

【００３７】本発明の請求項３に関る文章入力装置はあ
らかじめ算出された単語抽出尤度が設定された値より大
きい単語によって構成される高尤度単語辞書を備え、こ
の辞書に記載される単語と入力文字列とのパターンマッ
チングによって高尤度単語抽出処理を行なう。The text input device according to claim 3 of the present invention comprises a high-likelihood word dictionary composed of words having a preliminarily calculated word extraction likelihood larger than a preset value, and is described in this dictionary. High-likelihood word extraction processing is performed by pattern matching between words and input character strings.

【００３８】また、本発明の請求項４に関る文章入力装
置は、語長がＫ文字（Ｋ≧２、望ましくはＫ≧３）以上
の英字単語による単語群、語長がＬ文字（Ｌ≧２、望ま
しくはＬ≧３）以上の平仮名単語による単語群、語長が
Ｍ文字（Ｍ≧２）以上の片仮名単語による単語群、語長
がＮ文字（Ｎ≧２）以上の仮名漢字混り単語による単語
群の１以上を含む単語群によって構成される高尤度単語
辞書を備え、この辞書に記載される単語と入力文字列と
のパターンマッチングによって高尤度単語抽出処理を行
なう。一般に単語長が長い単語、また漢字のような１字
種に多数の文字が含まれる字種によって構成される単語
は誤って単語候補として抽出されることが少ないので、
このように字種毎に最短の単語長を決定し、それより長
い単語によって高尤度単語辞書を構成することによって
本実施例の文章入力装置と同等の効果を得るものであ
る。Further, in the sentence input device according to claim 4 of the present invention, a word group consisting of English words having a word length of K characters (K ≧ 2, preferably K ≧ 3) or more, and a word length of L characters (L ≧ 2, preferably L ≧ 3) or more hiragana words, a word group of katakana words with a word length of M characters (M ≧ 2) or more, and a kana-kanji mixture of word characters of N characters (N ≧ 2) or more A high-likelihood word dictionary configured by a word group including one or more of the word groups based on the word is provided, and high-likelihood word extraction processing is performed by pattern matching between the words described in this dictionary and the input character string. Generally, a word having a long word length or a word composed of a character type such as a kanji that includes a large number of characters is rarely mistakenly extracted as a word candidate.
In this way, the shortest word length is determined for each character type, and by constructing a high-likelihood word dictionary with longer words, the same effect as that of the text input device of the present embodiment is obtained.

【００３９】[0039]

【発明の効果】以上、説明したように、本発明の文章入
力装置では高尤度単語抽出部において単語抽出尤度の高
い単語を抽出し、ここで抽出された高尤度単語をもって
入力単位列を区切り、それぞれの部分列に対して形態素
解析部の処理を実行するので、単語の途中で分割してし
まうような誤りが少なく、効果的に入力単位列を部分列
に分割する。このため、この部分列を処理の単位とする
形態素解析処理を高速に行なうことができる。また、処
理単位の切れ目となる高尤度単語の品詞情報などを形態
素解析処理に活用することができるので、処理単位の前
後端においても文章の連続性を有効に使って入力候補の
選択、誤り検出をすることが可能であり、また文章中の
句読点や記号など、従来候補選択、誤り検出が難しかっ
た部分に関しても有効に処理を行なうことができる。As described above, in the sentence input device of the present invention, the high-likelihood word extracting unit extracts words with high likelihood of word extraction, and the high-likelihood words extracted here are used as input unit strings. Since the morpheme analysis unit executes the process for each substring by dividing the input unit string, the input unit string is effectively divided into substrings with few errors such as splitting in the middle of a word. Therefore, the morpheme analysis processing using this subsequence as a processing unit can be performed at high speed. In addition, since it is possible to utilize the part-of-speech information of high-likelihood words that become breaks in processing units for morphological analysis processing, the continuity of sentences is effectively used even at the front and rear ends of processing units to select and correct input candidates. It is possible to perform detection, and it is possible to effectively perform processing even on a portion such as a punctuation mark or a symbol in a sentence that has been conventionally difficult to select a candidate or detect an error.

[Brief description of drawings]

【図１】本発明の一実施例の全体構成図。FIG. 1 is an overall configuration diagram of an embodiment of the present invention.

【図２】高尤度単語抽出部の処理フローを示す図。FIG. 2 is a diagram showing a processing flow of a high likelihood word extraction unit.

【図３】組み合わせ尤度計算部の処理フローを示す
図。FIG. 3 is a diagram showing a processing flow of a combination likelihood calculation unit.

【図４】単語尤度辞書の一部を示す図。FIG. 4 is a diagram showing a part of a word likelihood dictionary.

【図５】品詞連接尤度テーブルの一部を示す図。FIG. 5 is a diagram showing a part of a POS concatenation likelihood table.

【図６】本発明の動作を説明するために用いた日本語
文の例を示す図。FIG. 6 is a diagram showing an example of a Japanese sentence used to explain the operation of the present invention.

【図７】図６の日本語文を第３位の候補文字まで出力
するＯＣＲで読みとった結果を示す図。7 is a diagram showing the result of reading the Japanese sentence of FIG. 6 by OCR that outputs up to the third candidate character.

【図８】高尤度単語抽出部の出力を摸式的に示す図。FIG. 8 is a diagram schematically showing an output of a high likelihood word extraction unit.

【図９】形態素解析部において入力文字候補の組み合
わせより単語候補を抽出した例を示す図。FIG. 9 is a diagram showing an example in which word candidates are extracted from a combination of input character candidates in the morpheme analysis unit.

【図１０】単語候補の組み合わせの接続尤度の計算例
を示す図。FIG. 10 is a diagram showing a calculation example of connection likelihood of a combination of word candidates.

[Explanation of symbols]

１０１：入力単位をコードに変換する入力手段１０２：単語尤度辞書１０３：活用形生成部１０４：高尤度単語抽出部１０５：品詞連接規則表１０６：組み合わせ尤度計算部１０７：誤入力検出部 101: Input means for converting an input unit into a code 102: Word likelihood dictionary 103: Inflectional generation unit 104: High likelihood word extraction unit 105: Part-of-speech concatenation rule table 106: Combination likelihood calculation unit 107: False input detection unit

Claims

[Claims]

1. A sentence input device for inputting a sentence as a sequence of set input units, wherein input means for obtaining one or more input candidates for each input unit, and words described as at least a sequence of input units The word likelihood dictionary including the notation data and the word extraction likelihood data, and the substring of the input unit sequence that matches the notation data of the word in the word likelihood dictionary that has a higher word extraction likelihood than the set value. As a candidate, at least from a high likelihood word extraction unit that adds word information for each word candidate, word information of the word candidate extracted by the high likelihood word extraction unit, and a candidate string of the input unit of other parts, Word dictionary, input unit connection rules, part-of-speech connection rules,
A sentence input device, comprising: a morphological analyzer that calculates the likelihood of a combination of input unit candidates using any of a word concatenation rule, a grammar rule, and a semantic rule.

2. A sentence input device for inputting a sentence as a string of set input units, wherein the input unit obtains one or more input candidates for each input unit, and a high likelihood that is a subset of a general word dictionary. A word dictionary, a high-likelihood word extraction unit that performs a process of adding word information for each word candidate, using a substring of the input unit string that matches the word description data in the high-likelihood word dictionary as a word candidate, From the word information of the word candidates extracted by the high-likelihood word extraction unit and the candidate string of the input unit of other parts, at least the word dictionary, the connection rule of the input unit, the connection rule of the part of speech,
A sentence input device, comprising: a morphological analyzer that calculates the likelihood of a combination of input unit candidates using any of a word concatenation rule, a grammar rule, and a semantic rule.

3. The sentence input device according to claim 2, further comprising a high-likelihood word dictionary configured only with words having a word extraction likelihood larger than a set value.

4. A word group consisting of English words having a word length of K characters (K ≧ 2) or more, a word group consisting of hiragana words having a word length of L characters (L ≧ 2) or more, and a word length of M characters (M ≧ 2). ) A high likelihood word dictionary configured by a word group including the above katakana words and a word group including at least one word group including kana-kanji mixed words having a word length of N characters (N ≧ 2) or more The text input device according to claim 2.

5. Utilization information is attached to an entry of a utilization word included in a word likelihood dictionary or a high-likelihood word dictionary, and the utilization word is referred to by using a stem with a utilization ending. A high-likelihood word extraction unit is provided with an inflectional-form generation unit for generating, and the high-likelihood word extraction unit extracts the inflectional form of the non-individualized word described in the word likelihood dictionary or the high-likelihood word dictionary and the inflectional word generated by the inflectional-word generation unit The object of claim 1
Alternatively, the text input device according to claim 2.

6. When the input device outputs a plurality of input candidates, only the input unit string composed of the first-ranked input candidates is processed by the high likelihood word extraction unit. The text input device according to claim 1, claim 2, or claim 5.

7. An erroneous input detection unit for detecting, as an input error, a subsequence of which the likelihood obtained by the morphological analysis unit is lower than a certain value in the input sentence. The sentence input device according to claim 2, claim 5, or claim 6.