JPH0275059A

JPH0275059A - Error correction processor for japanese sentence

Info

Publication number: JPH0275059A
Application number: JP63227841A
Authority: JP
Inventors: Masako Mochizuki; 望月　雅子
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-09-12
Filing date: 1988-09-12
Publication date: 1990-03-14

Abstract

PURPOSE:To improve the processing efficiency for production of the correction candidates by using a correction character dictionary which stores the pairs of wrong characters and their correct candidates in the order of characters which are used wrong more easily. CONSTITUTION:A processing part 2 includes a morpheme analyzing part 4, an error detecting part 5, and a correction candidate producing part 6. The part 4 uses a word dictionary 7 and a word connection table 8 to process the Japanese word documents received from an input part 1. The part 5 detects the wrong character strings. Then the part 6 performs its process with use of a character connection table 9 and a correction character dictionary 10 in addition to the dictionary 7 and the table 8. In such a constitution, the uniform process is not applied to all characters and the preference is given to those characters which are easily misused. Thus the correction candidates are produced with high efficiency.

Description

【発明の詳細な説明】産業上の利用分野本発明は、日本文中の誤り文字列を検出し、正しい単語
候補を提示する日本文誤り訂正処理装置に関する。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a Japanese sentence error correction processing device that detects erroneous character strings in Japanese sentences and presents correct word candidates.

従来の技術従来、この種の日本文誤り訂正については、例えば情報
処理学会誌　Ｖｏ　Ｑ、　　２５　Ｎｏ、２　Ｍａｒ。BACKGROUND ART Conventionally, this type of Japanese error correction has been described, for example, in Information Processing Society of Japan Vo Q, 25 No. 2 Mar.

１９８４中の論文「単語解析プログラムによる１″Ｊ本
文誤字の自動検出と二次マルコフモデルによる訂正候補
の抽出」に示されるものがある。これは、概略的には、
訂正文字候補を誤り′部分の前後の文字との連鎖確率に
より訂正候補を推定するものである。Some examples are shown in the paper ``Automatic detection of 1''J main text errors using a word analysis program and extraction of correction candidates using a quadratic Markov model'' published in 1984. This is, roughly speaking,
The correction candidate is estimated based on the probability of linking the corrected character candidate with the characters before and after the error' part.

発明が解決しようとする問題点ところが、このような従来方式による場合、全ての文字
が同等に扱われるため、文字の誤りやすい組合せ、人力
時の操作状況等を反映させることができない。Problems to be Solved by the Invention However, in the case of such a conventional method, all characters are treated equally, so it is not possible to reflect the combinations of characters that are likely to be mistaken, the operating conditions when using manual input, etc.

また、文字単位であるので、前の単語と文法的に接続し
ない文字であっても、候補として抽出してしまうことが
ある。Furthermore, since the search is performed in character units, even characters that are not grammatically connected to the previous word may be extracted as candidates.

問題点を解決するための手段単語辞書及び単語接続表とを用いた形態素解析部及び誤
り検出部により日本文による文書中から誤り文字列を検
出し、ひらがな誤りについて誤り文字列の正解となる語
を提示する日本文誤り訂正処理装置において、誤り文字
と正解である候補文字との対をその誤りの生じやすい文
字順に格納した訂正文字辞書を設ける。Means for Solving the Problem: A morphological analysis unit and an error detection unit using a word dictionary and a word connection table detect incorrect character strings from a Japanese document, and detect the correct word for the erroneous character string for hiragana errors. In this Japanese sentence error correction processing device, a corrected character dictionary is provided which stores pairs of erroneous characters and correct candidate characters in the order of characters in which errors are likely to occur.

作用日本文における誤りを考えた場合、キーボード上でのキ
ー配′列、シフトキーの操作忘れ等に起因して、発生し
やすい誤りもあり、逆に誤りの発生しにくい文字もある
。ここに、訂正文字辞書ではこのような点を考慮し、誤
り文字と正解である候補文字との対をその誤りの生じや
すい文字順に格納しであるので、全ての文字について均
等的な処理どならず、誤りの発生しやすいものが優先す
るため、効率のよい訂正候補作成が可能となる。When considering errors in Japanese text, there are some that are easy to make due to the layout of the keys on the keyboard, forgetting to use the shift key, etc., and there are also characters that are less likely to make mistakes. The corrected character dictionary takes this point into consideration and stores pairs of error characters and correct candidate characters in the order in which errors are likely to occur, so it is not possible to process all characters equally. First, since priority is given to items that are likely to cause errors, it is possible to create correction candidates efficiently.

実施例本発明の第一の実施例を第１図ないし第７図に基づいて
説明する。第１図は日本文誤り訂正処理装置のブロック
図構成を示し、概略的には人力部１と処理部２と出力部
３とからなる。ここに、処理部２は形態素解析部４と誤
り検出部５と訂正候補作成部６とからなる。まず、形態
素解析部４は単語辞書７と単語接続表８とを用いて入力
部１からの日本文文書を処理し、誤り検出部５により誤
り文字列を検出する。一方、訂正候補作成部６は前記単
語辞書７、単語接続表８とともに文字接続表９及び本実
施例の特徴とする訂正文字辞書１０を用いて処理を行う
ものである。Embodiment A first embodiment of the present invention will be described with reference to FIGS. 1 to 7. FIG. 1 shows a block diagram of a Japanese sentence error correction processing device, which roughly consists of a human power section 1, a processing section 2, and an output section 3. Here, the processing section 2 includes a morphological analysis section 4, an error detection section 5, and a correction candidate creation section 6. First, the morphological analysis section 4 uses the word dictionary 7 and the word connection table 8 to process the Japanese document from the input section 1, and the error detection section 5 detects erroneous character strings. On the other hand, the correction candidate creation section 6 performs processing using the word dictionary 7, word connection table 8, character connection table 9, and correction character dictionary 10, which is a feature of this embodiment.

なお、入力部１からの処理対象文書がかな人力／ローマ
字入力の何れでなされたかを人力することで、前記訂正
文字辞書】Ｏを切換える辞書データ指定手段１１を訂正
候補作成部６に対して備えてもよい。Note that the correction candidate creation section 6 is provided with a dictionary data specifying means 11 for switching the correction character dictionary]O by manually inputting whether the document to be processed from the input section 1 has been input manually in kana or Roman characters. You can.

二二に、前記単語辞書７は例えば第３図に示すように、
品詞を表すコード、単語の読み、単語の表記からなる。22. The word dictionary 7, for example, as shown in FIG.
It consists of codes representing parts of speech, word pronunciation, and word notation.

また、単語接続表８は例えば第４図に示すように、単語
の接続を後方への接続と前方からの接続とを２次元の表
に表したものである。Further, the word connection table 8, as shown in FIG. 4, for example, is a two-dimensional table that represents word connections in terms of backward connections and forward connections.

図示の如く、接続すれば「Ｏ」、接続しなければ「×」
　と記憶している。この単語接続表８を用いて、形態素
解析部４では単語間の接続を調べ、訂正候補作成部６で
は訂正候補を絞ることになる。As shown in the diagram, "O" if connected, "×" if not connected
I remember that. Using this word connection table 8, the morphological analysis section 4 examines connections between words, and the correction candidate creation section 6 narrows down correction candidates.

また、文字接続表９は文字間の接続を表すもので、単語
接続表れと同様に２次元の表に表したもので、例えば第
５図に示すように、表中の行の文字の後に列の文字が接
続するかどうかを表している。図示の如く、接続すれば
「Ｏ」、接続しなければ「×」と記憶している。これは
、主に促音、拗音のチエツクに使用される。In addition, the character connection table 9 represents the connections between characters, and is expressed in a two-dimensional table similar to the word connection table. For example, as shown in Figure 5, the character connection table The character indicates whether or not to connect. As shown in the figure, if it is connected, it is stored as "O", and if it is not connected, it is stored as "x". This is mainly used to check consonants and consonants.

しかして、本実施例の特徴とする訂正文字辞書１０は、
第２図に示すように、かな入力とローマ字入力により大
きく２項目に分けられており、ワードプロセッサにおけ
るキーの位置やシフトの有無によって文字を分類し、誤
り文字を入力時に誤りやすい順に、誤って人力された文
字（誤り文字）と訂正候補となる文字（候補文字）とを
対にして並べたものである。辞書は、入力方法の別（か
な人力／ローマ字入力）、誤り文字、候補文字からなる
。訂正候補作成部６は誤り文字列中に訂正文字辞書１０
中の誤り文字があれば、候補文字として置換し、訂正候
補を作成する。Therefore, the corrected character dictionary 10, which is a feature of this embodiment, is as follows:
As shown in Figure 2, it is broadly divided into two categories: kana input and romaji input. Characters are classified according to the position of the key in the word processor and the presence or absence of shifts, and the erroneous characters are sorted in order of their likelihood of error when inputting. This is a pair of characters that have been corrected (erroneous characters) and characters that are candidates for correction (candidate characters). The dictionary consists of input methods (kana manual/romaji input), error characters, and candidate characters. The correction candidate creation unit 6 creates a correction character dictionary 10 in the error character string.
If there are any erroneous characters, they are replaced as candidate characters and correction candidates are created.

このような構成において、誤り訂正処理の概略を第６図
のフローチャートに示す。まず、誤り訂正処理を行う対
象となる日本文の文書がかな入力とローマ字入力との何
れの方法によりなされたかの、人力方法の選択を辞書デ
ータ指定手段１〕により行う。本実施例では、入力時の
キーの位置によって誤りとなる文字を推定するので、か
な入力とローマ字人力とでは違う方法で誤り訂正を行う
ためである。入力方法の選択の後、処理すべき入力文字
列が有るかどうかチエツクし、入力文字列が有れば人力
された文に対して形態素解析部４により形態素解析を行
い、その結果に対して、誤り検出部５により誤り検出を
行う。そして、誤り検出された文字列に対し訂正候補作
成部６により訂正候補作成処理を行い、その結果である
訂正候補を画面上に表示させる。In such a configuration, an outline of error correction processing is shown in the flowchart of FIG. First, the dictionary data specifying means 1 manually selects whether a Japanese document to be subjected to error correction processing was input using kana or Roman characters. This is because, in the present embodiment, since the erroneous character is estimated based on the position of the key at the time of input, errors are corrected using different methods for kana input and for manual Romaji input. After selecting an input method, it is checked whether there is an input character string to be processed, and if there is an input character string, the morphological analysis unit 4 performs morphological analysis on the human-written sentence, and the result is The error detection section 5 performs error detection. Then, the correction candidate generation unit 6 performs correction candidate generation processing on the character string in which an error has been detected, and the resulting correction candidates are displayed on the screen.

このような処理中、訂正文字辞書】０を用いた訂正候補
作成部６での処理を第７図のフローチャー１−により説
明する。まず、訂正文字辞書１０中の全ての誤り文字を
調べ終え、又は、候補単語の数がかなり多くなったら（
本実施例では、１０個とする）、処理を終了するが、そ
れ以前の状態であれば、誤り部分に訂正文字辞書１０中
の誤り文字が有るかを訂正文学界１）１．０の先頭から
順に調べていく。なければ、次の誤り文字について同様
に調べる。このようなチエツクの結果、誤り部分に訂正
文学界Ａ１０中の誤り文字があり、かつ、候補文字があ
る場合には、誤り部分の誤り文字の直前の文字と候補文
字が接続するかを、文字接続表９を用いて調べる。接続
すれば、誤り部分の誤り文字と同じ表記の文字と候補文
字とを置換する。During such processing, the processing in the correction candidate creation unit 6 using the correction character dictionary 0 will be explained with reference to the flowchart 1- in FIG. First, when all the erroneous characters in the corrected character dictionary 10 have been checked or the number of candidate words has become quite large (
In this embodiment, the number of characters is 10), and the process ends, but if the state is before then, check whether there is an error character in the correction character dictionary 10 in the error part. We will investigate in order. If not, check the next erroneous character in the same way. As a result of such a check, if there is an erroneous character in the corrected literature society A10 in the error part and there is a candidate character, check whether the candidate character connects with the character immediately before the erroneous character in the error part. Check using connection table 9. If connected, the candidate character is replaced with a character that has the same notation as the erroneous character in the error part.

置換して生成した語が単語辞書７にあり、単語接続表８
により、誤り部分の直前の１１語と接続ずれば、この語
を訂正候補単語とし、次の候補り１語に処理を進める。The word generated by the replacement is in the word dictionary 7, and the word connection table 8
Accordingly, if the word is misconnected with the 11 words immediately before the error part, this word is set as a correction candidate word, and processing proceeds to the next candidate word.

このようにして、誤り部分について訂正文字辞書１０の
先頭から誤り文字と一致する文字があれば置換して訂正
候補を作成していき、前述したように、全ての誤り文字
を調べ、又は候補嘔詰が一定値に達したら、処理を終え
る。In this way, if there is a character that matches the error character from the beginning of the correction character dictionary 10 in the error part, it is replaced to create a correction candidate, and as described above, all the error characters are checked or the candidate When the amount reaches a certain value, the process ends.

今、具体例として「定期券を再ｕ２−こう−する手続き
」なる対象文の例で説明する。この対象文１：ついて形
態素解析部４により形態素解析を行うと、「けっこう」
が未登録語となり、誤り検出部５で誤り部分と認定され
る。この後で、第７図に示したような訂正候補作成部６
による訂正候補抽出処理に供される。まず、前述した如
き処理の進行において、誤り部分「けっこう」に訂正文
字辞書１０中の「つ」があり、候補文字があり、「は」
と「つ」とが接続するので、「けっこう」の「つ」を候
補文字の一番目「つ」と置換する。そして、単語辞書７
の読み「はっこう」で当該辞書７を検索する。単語辞書
７中には第３図に例示するように、「発行」　「発酵」
　「薄幸」　「発効」　「発光」等がある。誤り部分の
直前の単語「再」にはす変名詞が接続するので、接続し
ない「薄幸」は削除される。そして、残りの語を訂正候
補単語とする。Now, as a specific example, we will explain the target sentence ``Procedures for repurchasing a commuter pass.'' This target sentence 1: When morphologically analyzed by the morphological analysis unit 4, it becomes ``Kekkou''.
becomes an unregistered word, and the error detection unit 5 recognizes it as an error part. After this, the correction candidate creation unit 6 as shown in FIG.
The correction candidate extraction process is performed by First, in the progress of the process described above, there is "tsu" in the corrected character dictionary 10 in the error part "kekko", there is a candidate character, and "ha" is found in the corrected character dictionary 10.
and "tsu" are connected, so the "tsu" in "kekko" is replaced with the first "tsu" of the candidate characters. And word dictionary 7
The dictionary 7 is searched for the reading "Hakko". As illustrated in Figure 3, the word dictionary 7 includes ``issuance'' and ``fermentation.''
Examples include ``poor happiness,''``effect,'' and ``luminescence.'' Since the word ``re'' immediately before the incorrect part is connected to a subverb noun, ``susuki'', which is not connected, is deleted. The remaining words are then used as correction candidate words.

次いで、訂正文字辞書１０の誤り文字「つ」について次
の候補文字に処理を進める。まず、ｒつ」の候補文字「
っ」の次の「ち」について調べる。Next, the process proceeds to the next candidate character for the error character "tsu" in the corrected character dictionary 10. First of all, candidate characters for ``r'' are ``
Look up the ``chi'' after ``.

「ち」について「は」は文字接続が可能であり、Ｆつ」
と「ち」を置換し、「はちこう」では「へ高」が単語辞
書７にあるが、「再」に接続しないので、候補としない
。誤り文字「つ」について同様に調べていき、候補ｊ）
１語が尽きると、訂正文字辞書１０の次の誤り文字「ゆ
」について調べる。Regarding “chi”, “ha” can be connected with letters, and “F”
and "chi" are replaced, and "hetaka" is in the word dictionary 7 for "hachikou", but since it does not connect to "re", it is not selected as a candidate. We will investigate the incorrect character "tsu" in the same way and find the candidate j)
When one word is exhausted, the next erroneous character "yu" in the corrected character dictionary 10 is checked.

「ゆ」は誤り部分「けつこう」にないので、次の誤り文
字「う」に処理を進める。「う」は誤り部分「はつこう
」中にあるので、誤り文字「う」の一番目の「う」につ
いて調べる。「こ」と「う」は接続しないので、次の候
補文字「あ」に１）いて同様に調べる。Since "yu" is not in the error part "ketsukou", processing proceeds to the next error character "u". ``U'' is in the error part ``Hatsukou'', so check the first ``U'' of the erroneous character ``U''. Since ``ko'' and ``u'' are not connected, examine the next candidate character ``a'' in the same way as in 1).

以下、同様に処理を進め、候補数が１０より大きくなる
か、又は訂正文学界Ｎ１０の全ての誤り文字について調
べたところで、この誤り部分について処理を終了する。Thereafter, the process proceeds in the same manner, and when the number of candidates becomes greater than 10 or all the erroneous characters in the corrected literary world N10 have been investigated, the process ends for this erroneous part.

このようにして得られた、訂正候補文字を表示させるこ
とにより、　［定期券を再発行する手続き」の如き正し
い日本文への訂正に供されるにのように、本実施例によ
れば、キー配列等を考慮して誤りやすい文字順に候補を
格納１ツてなる訂正文字辞書１２を用いるので、従来の
全ての文字についての同等な処理方式に比し、訂正候補
作成を、より効率的に行うことができる。また、本実施
例によれば、単語の接続をも調べて訂正候補を絞るので
、文法的に接続しないものは候補として抽出されないこ
とになり、精度が上がる。According to this embodiment, by displaying the corrected candidate characters obtained in this way, it is possible to correct the correct Japanese text, such as in [procedures for reissuing a commuter pass]. Since we use a single correction character dictionary 12 that stores candidates in the order of characters that are likely to make mistakes, taking into account key layouts, etc., we can create correction candidates more efficiently than with conventional equivalent processing methods for all characters. It can be carried out. Furthermore, according to this embodiment, since the word connections are also examined to narrow down the correction candidates, words that are not grammatically connected are not extracted as candidates, improving accuracy.

つづいて、本発明の第二の実施例を第８図ないし第１０
図により説明する。本実施例は、処理対象なる日本文の
文書中の単語についての頻度情報を格納した頻度情報辞
書１２を付加し、この頻度情報辞書１２の頻度情報を用
いて正解候補単語に優先順位を付与するようにしたもの
である。Next, a second embodiment of the present invention will be described in FIGS. 8 to 10.
This will be explained using figures. This embodiment adds a frequency information dictionary 12 that stores frequency information about words in a Japanese document to be processed, and uses the frequency information of this frequency information dictionary 12 to give priority to correct candidate words. This is how it was done.

即ち、形態素解析部４は単語辞書７と単語接続表８とを
用いて処理し、解析した単語の頻度をこの頻度情報辞書
〕２に格納する。そして、訂正候補作成部６では単語辞
書７、単語接続表８、文字接続表９及び訂正文字辞書１
０とともに、この頻度情報辞書１２を用いて処理を行う
ことになる。That is, the morphological analysis unit 4 performs processing using the word dictionary 7 and the word connection table 8, and stores the frequencies of the analyzed words in the frequency information dictionary]2. The correction candidate creation unit 6 then uses a word dictionary 7, a word connection table 8, a character connection table 9, and a correction character dictionary 1.
0 and this frequency information dictionary 12 will be used for processing.

このような頻度情報辞書１２は例えば第９図に示すよう
に表記と頻度との対からなり、形態素解析時に単語が確
定する毎にこの頻度を１ずつ増やす。Such a frequency information dictionary 12 consists of pairs of notation and frequency, as shown in FIG. 9, for example, and this frequency is increased by one each time a word is determined during morphological analysis.

第１０図はこのような頻度情報辞書１２を付加した本実
施例による訂正候補作成処理を示すフローチャートであ
る。基本的には、前記実施例による処理の場合と同様で
あるが、置換して生成した語が単語辞書７に有り、かつ
、誤り部分の直前の単語と接続可能な語がある場合には
、頻度情報辞書１２中の頻度の順に語を並び換え、これ
を訂正候補単語とする。FIG. 10 is a flowchart showing correction candidate creation processing according to this embodiment in which such a frequency information dictionary 12 is added. Basically, the processing is the same as in the case of the above embodiment, but if the word generated by substitution exists in the word dictionary 7 and there is a word that can be connected to the word immediately before the error part, The words are rearranged in the order of frequency in the frequency information dictionary 12, and these are used as correction candidate words.

例えば、具体例として「−度発行された定期券を紛失し
た場合を現下に記す。定期券を再（↓−２−二−ｊする
手続きは」なる対象文の場合を考える。まず、前記実施
例の場合と同様に「はつこう」の「つ」を［っ」と置換
し、単語辞書７を検索すると「発行」　「発酵」　「薄
幸」　「発光」　「発効」が得られる。そして、接続可
能な品Ｊ４かどうかのチエツクに従い、接続しない「薄
幸Ｊが削除される。For example, as a specific example, consider the case where the target sentence is ``If you lose a commuter pass that has been issued twice. As in the case of the example, if you replace "tsu" in "hatsuko" with "tsu" and search the word dictionary 7, you will get "issuance", "fermentation", "light happiness", "luminescence", and "effectiveness".And, Following the check to see if the product J4 can be connected, the unconnected product J4 will be deleted.

残りの各単語について、形態素解析の結果による頻度情
報辞書１２を調べ、頻度の高い順に推べる。For each of the remaining words, the frequency information dictionary 12 based on the results of morphological analysis is checked, and the words can be suggested in descending order of frequency.

すると、「発行」が最も頻度が高いので、これが訂正候
補’ｄ　Ｐの先頭に来るようにする。Then, since "issue" has the highest frequency, it is placed at the beginning of the correction candidate 'dP.

発明の効果本発明は、−１−述したように誤り文字と正解である候
補文字との対をその誤りの牛じやすい文字順に格納した
訂正文字辞書を設けたので、キーボード等におけるキー
配列、シフトキー操作の併用の有無等に起因して誤りを
生じやすい文字を優先させることができ、よって、訂正
候補作成処理を、より効率的に行うことができる。Effects of the Invention As described above, the present invention provides a correction character dictionary that stores pairs of error characters and correct candidate characters in the order of the characters that are most likely to cause errors. Priority can be given to characters that are likely to cause errors due to the presence or absence of combined use of the shift key, etc. Therefore, correction candidate creation processing can be performed more efficiently.

[Brief explanation of the drawing]

第１図は本発明の第一の実施例を示すブロック構成図、
第２図は訂正文字辞書の構成図、第３図は単語辞書の構
成図、第４図は単語接続表の構成図、第５図は文字接続
表の構成図、第６図は全体的な処理を示すフローチャー
１・、第７図は訂正候補作成処理を示すフローチャート
、第８図は本発明の第二の実施例を示すブロック構成図
、第９図は頻度情報辞書の構成図、第１０図は訂正候補
作成処理を示すフローチャー１・である。４・・・形態素解析部、５・・・誤り検出部、７・・・
ｍ語辞書、８・・・単語接続表、１０・・・訂正文７辞
書出　願　人　　　株式会社　　　リ　コ　−４７図FIG. 1 is a block diagram showing a first embodiment of the present invention;
Figure 2 is a configuration diagram of the corrected character dictionary, Figure 3 is a configuration diagram of the word dictionary, Figure 4 is a configuration diagram of the word connection table, Figure 5 is a configuration diagram of the character connection table, and Figure 6 is the overall configuration. Flowchart 1 showing the process, FIG. 7 is a flowchart showing the correction candidate creation process, FIG. 8 is a block configuration diagram showing the second embodiment of the present invention, FIG. 9 is a configuration diagram of the frequency information dictionary, and FIG. FIG. 10 is a flowchart 1 showing correction candidate creation processing. 4... Morphological analysis unit, 5... Error detection unit, 7...
m-word dictionary, 8...word connection table, 10...corrected sentence 7 dictionary Applicant Ricoh Co., Ltd. -47 Figure

Claims

[Claims]

Japanese sentence error correction processing that detects erroneous character strings in Japanese documents using a morphological analysis unit and error detection unit using a word dictionary and word connection table, and presents words that are the correct answer to the erroneous character strings for hiragana errors. 1. A Japanese sentence error correction processing device, characterized in that the device is provided with a correction character dictionary storing pairs of error characters and correct candidate characters in the order of characters in which errors are likely to occur.