JPH05225183A

JPH05225183A - Automatic error detector for words in japanese sentence

Info

Publication number: JPH05225183A
Application number: JP4023801A
Authority: JP
Inventors: Shinichiro Takagi; 伸一郎高木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1992-02-10
Filing date: 1992-02-10
Publication date: 1993-09-03

Abstract

PURPOSE:To improve accuracy for error detection by detecting error while using character link probability at a Chinese character (KANJI) string part and using part-of- speech connection probability between words at a Japanese syllabary (KANA) string part. CONSTITUTION:A character link probability dictionary 70 stores and holds the link probability information of respective N characters previously calculated from a standard document, which contains no wrong character, based on appearance frequency information concerning the patterns of continuous N characters. An inter-word part-of- speech connection dictionary 100 previously records a certain part of speech, part of speech to be connected to the back and the probability of the connection. The Japanese character string of a source sentence document file 20 is separated into the KANJI string part and the KANA string part by a KANJI string part/KANA string part separation processing part 60 while executing word certification at a morpheme analysis processing part 50. Concerning the KANJI string part, a part which character link probability value does not satisfy a rated value is defined as a wrong character by using the character link probability dictionary 70. Concerning the KANA string part, when the part-of-speech connection value does not satisfy a rated value, it is pointed out as error by using the inter-word part-of-speech connection dictionary 100.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、日本語ワードプロセッ
サなどの文書入力装置から入力した日本文文書データに
含まれる誤字・脱字・単語誤りなどの入力誤りを自動的
に検出する日本文単語誤り自動検出装置に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to automatic Japanese word error detection for automatically detecting input errors such as typographical errors, omissions, and word errors contained in Japanese sentence document data input from a document input device such as a Japanese word processor. The present invention relates to a detection device.

【０００２】[0002]

【従来の技術】新聞記事、出版用原稿、科学技術論文な
どの多量の日本文文書を文字読取装置で電子ファイル化
して、日本文文書データベースを作成する場合や、日本
語ワードプロセッサを使用して日本語文章を作成する場
合には、誤字、脱字、誤変換などの誤りが混入すること
がある。これらの誤りを人手で検出して訂正を行う校正
作業は、多大な時間を要するほかに校正者の能力や精神
的状況により校正精度にバラツキがあるという問題があ
った。これに対して人間の校正作業支援の立場から電子
ファイル化した日本文文書に対して、計算機により日本
語単語辞書と日本語形態素解析処理を用いた以下の単語
誤り検出手法が実施されている。2. Description of the Related Art A large number of Japanese text documents such as newspaper articles, manuscripts for publication, and scientific papers are converted into electronic files with a character reader to create a Japanese text document database, or a Japanese word processor is used to create a Japanese text document. When creating a word sentence, errors such as typographical errors, omissions and erroneous conversions may occur. The proofreading work in which these errors are manually detected and corrected requires a great deal of time, and there is a problem that the proofreading accuracy varies depending on the ability and mental condition of the proofreader. On the other hand, from the standpoint of supporting human proofreading work, the following word error detection method using a Japanese word dictionary and a Japanese morphological analysis process has been implemented by a computer for a Japanese text document converted into an electronic file.

【０００３】（１）単語や品詞を認定して、あらかじめ
作成した誤りやすい誤り単語群と照合することにより誤
った単語を認定する手段（２）日本語形態素解析処理の単語認定の段階で、日本
語単語辞書から抽出した単語列について位置的接続関係
や文法的接続関係より、単語の接続が不連続となる文字
領域を解析不能文字列として抽出して誤り検出する手段ここで、前者は、英文のスペルチェッカーでその基本的
な考え方が実施されているが、日本語ではその字種数が
大であり十分な誤り検出精度を得るためには誤り単語群
が膨大になる問題があった。後者は、たとえば、特願昭
６２−３１５６９８号の「日本文処理方式」で示されて
いる。しかし、この手段では文法的接続関係の認定に使
用する文法辞書は、文法的な接続関係の有無を記述した
にすぎないため、誤り単語が混入し文法的接続関係の出
現頻度が低い場合でも『文法的な接続関係有り』とし単
語認定を正常に行うため、誤り検出精度が十分でない問
題点がある。(1) A means for certifying an incorrect word by certifying a word or a part-of-speech and collating it with an error word group that is created easily in advance. (2) At the stage of certifying words in the Japanese morphological analysis process, Japan A method for detecting an error by extracting a character region in which word connections are discontinuous as an unanalyzable character string based on positional connection relations and grammatical connection relations of a word string extracted from a word-word dictionary. The basic idea is implemented by the spell checker, but in Japanese, the number of characters is large and there is a problem that the number of error words becomes enormous in order to obtain sufficient error detection accuracy. The latter is shown, for example, in "Japanese sentence processing system" of Japanese Patent Application No. 62-31598. However, with this method, the grammar dictionary used for recognizing the grammatical connection relation only describes the presence or absence of the grammatical connection relation. There is a problem that the accuracy of error detection is not sufficient because the word recognition is normally performed because "there is a grammatical connection".

【０００４】さらに統計的な情報を使用して誤りを検出
する方式として、以下の誤り検出手法が実施されてい
る。（３）誤字を含まない大量の日本文文書を用いて、Ｎ文
字の文字パターンを抽出し、これに関する出現頻度情報
に基づいて算定された各Ｎ文字の文字連接確率値を各Ｎ
文字をキーとして保持する文字連接確率辞書を予め作成
しこれを用いて原文中の連続する文字列の文字連接確率
値がある足切り値を満たさない場合に誤字として検出す
る手段（４）誤字を含まない大量の日本文文書を用いて、単語
とその前後の単語間の品詞に関する出現頻度情報に基づ
いて算定された単語間の品詞接続確率を保持する単語間
品詞接続辞書を予め作成しこれを用いて原文文字列中の
連続する２単語間の品詞接続確率が既定値を満たさない
場合に該単語を誤りとして検出する手段ここで、前者は例えば、特願昭６１−０２４９９２号の
「日本文誤字自動検出方式」で示されている。しかし、
ひらがなのように文字種が少なく出現頻度の高い文字列
に誤字が混入しても文字連接確率が一般に高いため誤字
検出が困難であるという問題点があった。また後者は例
えば、特願平２−２０２２９５号の「日本文単語誤り検
定装置」で示されている。しかし、漢字単語で構成され
る漢字列では各単語の品詞が主に出現頻度が高い名詞や
接頭接尾辞などであるため、たとえ誤字が混入しても名
詞などで認定されてしまうと品詞接続確率によって検出
できないという問題点があった。Further, as a method of detecting an error using statistical information, the following error detection method is implemented. (3) N-character character patterns are extracted using a large amount of Japanese-language documents that do not contain typographical errors, and the character concatenation probability value of each N character calculated based on the appearance frequency information related thereto is used for each N character.
A means for detecting as a typo when a character concatenation probability dictionary that holds characters as keys is created in advance and a character concatenation probability value of consecutive character strings in the original sentence does not satisfy a certain cutoff value (4) Using a large number of Japanese documents that are not included, create an inter-word part-of-speech connection dictionary that holds the part-of-speech connection probability between words calculated in advance based on the appearance frequency information about the part-of-speech between words and the words before and after it. A means for detecting the word as an error when the probability of part-of-speech connection between two consecutive words in the original text string does not satisfy a predetermined value. Here, the former is, for example, Japanese Patent Application No. 61-024992, “Japanese sentence. Typographical error automatic detection method ". But,
There is a problem that it is difficult to detect a typographical error because a character concatenation probability is generally high even if a typographical error is mixed in a character string such as hiragana that has a small number of character types and a high appearance frequency. The latter is shown, for example, in "Japanese sentence word error checking device" of Japanese Patent Application No. 2-202295. However, in a kanji string composed of kanji words, the part-of-speech of each word is mainly a noun that frequently appears or a prefix or suffix, so even if a typographical error is included, the part-of-speech connection probability is There was a problem that could not be detected by.

【０００５】[0005]

【発明が解決しようとする課題】このように、形態素解
析を用いた文法的な検出方式では誤り単語が混入したと
え文法的接続関係の出現頻度が低い場合でも、『文法的
な接続関係有り』とし単語認定を正常に行うため、誤り
検出精度が十分でない問題点がある。As described above, in the grammatical detection method using morphological analysis, even if an error word is mixed and the occurrence frequency of the grammatical connection relation is low, "there is a grammatical connection relation". Since the word recognition is normally performed, there is a problem that the error detection accuracy is not sufficient.

【０００６】また、統計的な情報を使用した誤り検出方
式でも、文字連接確率を利用する検出方式では、ひらが
なのように文字種が少なく出現頻度の高い文字列に誤字
が混入しても文字連接確率が一般に高いため誤字検出が
困難であるという問題点があった。Further, even in the error detection method using statistical information, in the detection method using the character concatenation probability, the character concatenation probability is generated even if a typographical error occurs in a character string such as hiragana that has few character types and a high appearance frequency. Since it is generally high, there is a problem that it is difficult to detect typographical errors.

【０００７】さらに、単語間品詞接続確率を利用する検
出方式では、漢字単語で構成される漢字列では各単語の
品詞が主に出現頻度が高い名詞や接頭辞接尾辞などであ
るため、たとえ誤字が混入しても名詞などで認定されて
しまうと単語間品詞接続確率によって検出できず、この
ため誤り検出精度が十分でないという問題点がある。Further, in the detection method utilizing the inter-word part-of-speech connection probability, the part-of-speech of each word in a kanji string composed of kanji words is mainly a noun or a prefix suffix which frequently appears. However, if it is identified as a noun even if is mixed, it cannot be detected by the inter-word part-of-speech connection probability, and thus there is a problem that the error detection accuracy is not sufficient.

【０００８】本発明は、漢字列部とひらがな列部とを分
けて処理するようにして、誤り検出精度を向上すること
を目的としている。An object of the present invention is to improve the error detection accuracy by separately processing the kanji character string portion and the hiragana character string portion.

【０００９】[0009]

【課題を解決するための手段】本発明では、日本語ワー
ドプロセッサなどの文書入力装置から入力され誤字・脱
字・単語誤りなどの入力誤りを含む日本文文書データの
日本文文字列についての日本文単語誤り自動検出装置に
おいて、単語と品詞と文字字種を認定する形態素解析処
理を行う手段と、文字字種に応じて漢字列部とひらがな
列部に分離する手段と、漢字列部については誤字判定部
を設定して予め作成した文字連接確率辞書を用いて文字
連接確率値の足切りにより誤字を検出する手段と、ひら
がな列部については単語間品詞判定部を設定して予め作
成した単語間品詞接続辞書を用いて単語間品詞接続確率
の足切りにより単語の誤りを検出する手段とを具備す
る。According to the present invention, a Japanese word for a Japanese character string of Japanese document data input from a document input device such as a Japanese word processor and containing input errors such as typographical errors, omissions, and word errors. In the automatic error detection device, a means for performing a morphological analysis process for recognizing a word, a part of speech, and a character type, a means for separating the kanji character string part and the hiragana character string part according to the character character type, and an error character determination for the kanji character string part. A means for detecting erroneous characters by truncating the character concatenation probability value using a character concatenation probability dictionary created by setting parts, and an interword part-of-speech part pre-created by setting the interword part-of-speech determination part for the hiragana string part. And a means for detecting an error of a word by cutting off the inter-word part-of-speech connection probability using a connection dictionary.

【００１０】[0010]

【作用】本発明においては、誤りを含む日本文文字列の
誤り検出において形態素解析処理を行い、文字字種に応
じて漢字列部では、文字連接確率という統計的な情報を
使用する誤り検出方式を使用し、ひらがな列部では、単
語間品詞接続確率という単語の接続特性を利用する検出
方式を使用するので、従来の各誤り検出方式を一様に使
用する場合に比べて誤り検出精度の向上が図れる。In the present invention, a morphological analysis process is performed in detecting an error in a Japanese character string including an error, and an error detection method using statistical information called a character concatenation probability in the kanji character string portion according to the character type. In the Hiragana string part, the detection method that uses the connection characteristics of words, which is the interword part-of-speech connection probability, is used, so the error detection accuracy is improved compared to the case where each conventional error detection method is used uniformly. Can be achieved.

【００１１】[0011]

【実施例】以下、本発明の実施例を図面により詳細に説
明する。図１は、本発明の装置の構成例を示す。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 shows a configuration example of the device of the present invention.

【００１２】図１において、処理装置１５０はＣＰＵお
よびメモリからなり、以下の機能部を有する。すなわ
ち、日本語ワードプロセッサなどの文書入力装置１０よ
り入力された原文文書ファイル２０の日本文文字列を、
日本語単語辞書３０や文法辞書４０を用いて単語と品詞
と文字字種との認定を行う形態素解析処理部５０、文字
字種に応じて漢字列部とひらがな列部とに分離する漢字
列部ひらがな列部分離処理部６０、文字連接確率辞書７
０を検索して文字連接確率値を抽出する文字連接確率値
抽出処理部８０、検索した文字連接確率値により誤字を
判定する誤字判定処理部９０、単語間品詞接続辞書１０
０を検索して品詞接続値を抽出する単語間品詞接続値抽
出処理部１１０、検索した品詞接続値により誤りを指摘
する単語間品詞検定処理部１２０、漢字列部およびひら
がな列部よりそれぞれ抽出した誤り箇所を統合する単語
誤り検出処理部１３０からなる。In FIG. 1, the processing device 150 is composed of a CPU and a memory, and has the following functional parts. That is, the Japanese character string of the original document file 20 input from the document input device 10 such as a Japanese word processor is
A morphological analysis processing unit 50 that identifies a word, a part of speech, and a character type by using the Japanese word dictionary 30 and the grammar dictionary 40, and a Kanji string unit that separates into a Kanji string unit and a Hiragana string unit according to the character type. Hiragana string part separation processing unit 60, character concatenation probability dictionary 7
A character concatenation probability value extraction processing unit 80 that retrieves 0 to extract a character concatenation probability value, a typographical error determination processing unit 90 that determines a typographical error based on the retrieved character concatenation probability value, and an inter-word part-of-speech connection dictionary 10
It was extracted from the inter-word part-of-speech connection value extraction processing unit 110 that searches for 0 and extracts the part-of-speech connection value, the inter-word part-of-speech verification processing unit 120 that points out an error based on the retrieved part-of-speech connection value, the kanji character string portion, and the hiragana character string portion, respectively. It is composed of a word error detection processing unit 130 that integrates error portions.

【００１３】この装置では、予め誤字を含まない標準文
書から連続するＮ文字のパターンに関する出現頻度情報
に基づいて算定された各Ｎ文字の連接確率情報を各Ｎ文
字のパターンをキーとして格納保持した文字連接確率辞
書７０と、予めある品詞とその後方に接続可能な品詞と
の接続の確率とを収録した単語間品詞接続辞書１００を
作成しておく。そして、原文文書ファイル２０の日本文
文字列を、日本語単語辞書３０や文法辞書４０を用いて
形態素解析処理部５０により単語と品詞と文字字種との
単語認定を行い、さらに漢字列部ひらがな列部分離処理
部６０で文字字種に応じて漢字列部とひらがな列部に分
離する。In this apparatus, concatenation probability information of each N character calculated based on appearance frequency information regarding a pattern of N consecutive characters from a standard document containing no typographical errors is stored and held using each N character pattern as a key. A character concatenation probability dictionary 70 and an inter-word part-of-speech connection dictionary 100 in which a certain part-of-speech and a probability of connection of a part-of-speech that can be connected behind it are recorded in advance are created. Then, using the Japanese word dictionary 30 and the grammar dictionary 40, the Japanese sentence character string of the original sentence document file 20 is subjected to word recognition of a word, a part of speech, and a character type by the morphological analysis processing unit 50. The line part separation processing unit 60 separates the kanji character line part and the hiragana line part according to the character type.

【００１４】この後の処理は、漢字列部とひらがな列部
で個別の処理とする。分離された漢字列部については、
予め作成した文字連接確率辞書７０を検索して文字連接
確率値抽出処理部８０で文字連接確率値を抽出して、誤
字判定処理部９０でこの文字連接確率値が既定値を満た
さない箇所を誤字とする。Subsequent processing is performed separately for the kanji character string portion and the hiragana character string portion. For the separated Kanji string part,
The character concatenation probability dictionary 70 created in advance is searched, the character concatenation probability value extraction processing unit 80 extracts the character concatenation probability value, and the erroneous character determination processing unit 90 erroneously characterizes portions where the character concatenation probability value does not satisfy the predetermined value. And

【００１５】また、分離されたひらがな列部について
は、予め作成した単語間品詞接続辞書１００を検索して
単語間品詞接続値抽出処理部１１０で品詞接続値を抽出
して、単語間品詞検定処理部１２０でこの品詞接続値が
既定値を満たさない場合に誤りとして指摘する。さらに
単語誤り検出処理部１３０で漢字列部およびひらがな列
部よりそれぞれ抽出した誤り箇所を統合する。最後に、
これらの誤り検出結果を含めた検出済み文書ファイル１
４０を作成する。For the separated hiragana string, the inter-word part-of-speech connection dictionary 100 is searched, the inter-word part-of-speech connection value extraction processing unit 110 extracts the part-of-speech connection value, and the inter-word part-of-speech verification processing is performed. If the part-of-speech connection value does not satisfy the predetermined value in the part 120, it is pointed out as an error. Further, the word error detection processing unit 130 integrates the error portions respectively extracted from the Kanji character string portion and the Hiragana character string portion. Finally,
Detected document file 1 including these error detection results
Create 40.

【００１６】図２は図１に示した構成例実施例において
誤り単語を検出する処理の概略フローを示す図であり、
概略フローに従って、動作の説明を行う。（ステップ１）：原文文書ファイル２０から形態素解析
処理の対象文字列を読み込む。FIG. 2 is a diagram showing a schematic flow of a process for detecting an error word in the embodiment of the configuration example shown in FIG.
The operation will be described according to the general flow. (Step 1): A target character string for morphological analysis processing is read from the original document file 20.

【００１７】（ステップ２）：形態素解析処理部５０に
より、日本語単語辞書３０や文法辞書４０を用いて、形
態素解析処理を行い、単語と品詞と文字字種との認定を
行う。(Step 2): The morpheme analysis processing unit 50 performs morpheme analysis processing using the Japanese word dictionary 30 and the grammar dictionary 40 to authenticate words, parts of speech, and character character types.

【００１８】（ステップ３）：単語認定結果を用いて、
文字字種でひらがな列部（ひらがなおよび句読点）と漢
字列部（ひらがな列部以外の文字列）に分離する。（ステップ４）：漢字列部とひらがな列部とに応じて処
理を分岐させる。漢字列部についてはステップ５よりス
テップ７の個別の処理を行う。(Step 3): Using the word recognition result,
Separate the hiragana string part (hiragana and punctuation) and kanji string part (character strings other than hiragana string part) by character type. (Step 4): The processing is branched according to the kanji string part and the hiragana string part. With respect to the kanji character string portion, individual processing from step 5 to step 7 is performed.

【００１９】（ステップ５）：漢字列部に前後１文字を
付与して誤字判定部を設定する。（ステップ６）：先頭より２文字ごとに文字連接確率辞
書７０より文字連接確率値を抽出する。(Step 5): One character before and after is added to the kanji character string portion to set the erroneous character determination portion. (Step 6): A character connection probability value is extracted from the character connection probability dictionary 70 for every two characters from the beginning.

【００２０】（ステップ７）：ステップ６の結果で文字
連接確率値が既定値（図３の実施例では、０．０５）未
満の２文字を誤りとして検出する。ひらがな列部につい
てはステップ８よりステップ１０の個別の処理を行う。(Step 7): As a result of step 6, two characters whose character concatenation probability value is less than a predetermined value (0.05 in the embodiment of FIG. 3) are detected as an error. For the hiragana row portion, individual processing from step 8 to step 10 is performed.

【００２１】（ステップ８）：ひらがら列部に前後１単
語を付与して単語間品詞判定部を設定する。（ステップ９）：先頭より２単語ごとに単語間品詞接続
辞書１００を検索して、単語間品詞接続値を抽出する。(Step 8): One word before and after is added to the hiragana string part to set the inter-word part of speech determination part. (Step 9): The inter-word part-of-speech connection dictionary 100 is searched for every two words from the beginning, and the inter-word part-of-speech connection value is extracted.

【００２２】（ステップ１０）：ステップ９の検索の結
果で、単語間品詞接続値が既定値（図４の実施例では
０．０５）未満の２単語を誤りとして検出する。（ステップ１１）：誤りを含む誤字判定部および単語間
品詞判定部を単語誤り箇所として抽出する。(Step 10): As a result of the search in step 9, two words whose inter-word part-of-speech connection value is less than a predetermined value (0.05 in the embodiment of FIG. 4) are detected as errors. (Step 11): The erroneous character determination unit including the error and the inter-word part-of-speech determination unit are extracted as the word error portion.

【００２３】（ステップ１２）：ステップ１１で抽出し
た単語誤り箇所をマークして検出済み文書ファイル１４
０に出力する。（ステップ１３）：形態素解析対象文字列の最終まで終
了したかを判定する。ステップ１３で最終でない場合、
ステップ３へ移行し処理を継続する。ステップ１３で最
終である場合、処理を終了する。(Step 12): Marked the word error portion extracted in step 11 and detected document file 14
Output to 0. (Step 13): It is determined whether the end of the morphological analysis target character string has been completed. If not final in step 13,
The process proceeds to step 3 and the process is continued. If it is final in step 13, the process ends.

【００２４】図１の構成例を用いた実施例として、漢字
列部とひらがな列部とにおける誤り判定処理をそれぞれ
図３、図４を参照して説明する。図３では、漢字列部に
含まれる誤字を検出する処理を具体例を挙げて説明す
る。As an embodiment using the configuration example of FIG. 1, an error determination process in the kanji character string portion and the hiragana character string portion will be described with reference to FIGS. 3 and 4, respectively. In FIG. 3, the process of detecting an erroneous character included in the Chinese character string portion will be described with a specific example.

【００２５】ここで、２００は原文文書ファイルから抽
出した誤字を含む日本文文字列、２１０は日本文文字列
における脱字や誤字の箇所、２２０は正解の文字（正
字）、２３０は形態素解析処理の結果である単語や品詞
情報、２４０は文字字種で分離した漢字列部、２５０は
誤字判定部、２６０は先頭から２文字ごとに切り出した
文字連接確率値検索対象の文字、２７０は文字連接確率
辞書７０より検索した文字連接確率値、２８０は既定値
未満の文字連接確率値箇所、２９０は誤りと判定された
２文字である。Here, 200 is a Japanese sentence character string including erroneous characters extracted from the original sentence document file, 210 is a missing character or a typographical error in the Japanese sentence character string, 220 is a correct character (original character), and 230 is a morphological analysis process. The result is words and part-of-speech information, 240 is a kanji string part separated by character type, 250 is a typographical error determination part, 260 is a character concatenation probability value that is cut out every two characters from the beginning, and 270 is a character concatenation probability. The character concatenation probability value retrieved from the dictionary 70, 280 is a character concatenation probability value portion that is less than the default value, and 290 is two characters determined to be erroneous.

【００２６】本実施例では、『電話網』の『話』、『持
つ』の『つ』の脱字および『共通化』における誤字
『か』が含まれた原文の日本文文字列２００において、
これを形態素解析処理した後、文字字種がひらがな列以
外の漢字列部２４０を抽出し、これをもとに誤字判定部
２５０を設定する。つぎに『電網を』、『を特システム
に』、『に共通か』の各誤字判定部において先頭から２
文字ごとにこれをキーに文字連接確率辞書７０を検索し
て文字連接確率値を抽出する。例えば、『網を』では文
字連接確率値０．２５を検索したが、『電網』では文字
連接確率辞書に該当のレコードがないので文字連接確率
値０．０とする。このように検索した文字連接確率値が
既定値（本実施例では０．０５）未満の２文字箇所２８
０を誤りとして検出することによって、『電網』には誤
字が含まれている可能性を示しており、誤りを含む２文
字２９０として検出する。同様に、『持シ』や『通か』
なども既定値未満なので誤りとして検出される。In the present embodiment, in the Japanese sentence character string 200 of the original sentence, which includes the punctuation of "talk" of "telephone network", "tsu" of "have", and the typographical error "ka" in "common",
After morphological analysis processing of this, the kanji character string portion 240 whose character type is other than the hiragana character string is extracted, and the erroneous character determination portion 250 is set based on this. Next, in each of the typographical error determination parts of “Denshi”, “to special system”, and “common to”, 2 from the beginning.
For each character, the character connection probability dictionary 70 is searched using this as a key to extract the character connection probability value. For example, the character concatenation probability value 0.25 is searched for "Ami", but the character concatenation probability value is set to 0.0 because there is no corresponding record in the character concatenation probability dictionary for "Denji". The two-character portion 28 in which the character concatenation probability value thus searched is less than the predetermined value (0.05 in this embodiment)
By detecting 0 as an error, there is a possibility that the "electric network" contains a typographical error, and two characters 290 including an error are detected. Similarly, "hold" or "pass"
Is also less than the default value, so it is detected as an error.

【００２７】図４では、ひらがな列部に含まれる誤り単
語を検出する処理を具体例を挙げて説明する。ここで、
３００は原文文書ファイルから抽出した誤字を含む日本
文文字列、３１０は日本文文字列における誤りの箇所、
３２０は正解の文字（正字）、３３０は形態素解析処理
の結果である単語や品詞情報、３４０は文字字種で分離
したひらがな列部、３５０は単語間品詞判定部、３６０
は先頭から２単語ごとに切り出した単語間品詞接続確率
値検索対象の文字、３７０は単語間品詞接続辞書１００
より検索した単語間品詞接続確率値、３８０は既定値未
満の単語間品詞接続確率値箇所、３９０は誤りと判定さ
れた２単語である。Referring to FIG. 4, a process for detecting an error word included in the hiragana string portion will be described with a specific example. here,
Reference numeral 300 is a Japanese character string including typographical errors extracted from the original document file, 310 is an error portion in the Japanese character string,
320 is a correct answer character (regular character), 330 is word or part-of-speech information as a result of morphological analysis processing, 340 is a hiragana string part separated by character type, 350 is an inter-word part-of-speech determination part, 360
Is a character to be searched for an inter-word part-of-speech connection probability value that is cut out every two words from the beginning, and 370 is an inter-word part-of-speech connection dictionary 100.
The word-to-word part-of-speech connection probability value 380 searched for is a word part of the word-to-word part-of-speech connection probability value less than the default value, and 390 is two words determined to be erroneous.

【００２８】本実施例では、『本体内』の『ない』にお
ける誤字、『動かすため』の『く』の誤挿が含まれた原
文の日本文文字列３００において、これを形態素解析処
理した後、文字字種がひらがなのひらがな列部３４０を
抽出し前後１単語を付与して単語間品詞判定部３５０を
設定する。つぎに『本体ないの装置』、『装置で動』、
『動くかすために』の各単語間品詞判定部において先頭
から２単語ごとにこれをキーに単語間品詞接続辞書１０
０を検索して単語間品詞接続確率値を抽出する。In this embodiment, after morphological analysis processing is performed on the original Japanese character string 300 including the erroneous character “in” of “inside body” and the incorrect insertion of “ku” for “to move”. The hiragana string part 340 of which the character type is hiragana is extracted, one word before and after is extracted, and the inter-word part-of-speech determination part 350 is set. Next, "device without main body", "moving with device",
The inter-word part-of-speech connection dictionary 10 using this as a key for every two words from the beginning in the inter-word part-of-speech determination unit of "to move"
0 is searched to extract the inter-word part-of-speech connection probability value.

【００２９】例えば、「形容詞語幹」と「形容詞語尾」
の品詞間で検索すると単語間品詞接続確率値０．３１が
抽出される。また、『本体ないの装置』において『本
体』「一般名詞」と『な』「形容詞語幹」の品詞間では
単語間品詞接続辞書に該当のレコードがないので単語間
品詞接続確率値０．０とする。このように検索した単語
間品詞接続確率値が既定値（本実施例では０．０５）未
満の２単語箇所３８０を誤りとして検出することによっ
て、『本体な』には誤りが含まれている可能性が示さ
れ、誤りを含む２単語３９０として検出する。同様に、
『かす』と『ため』の品詞間も既定値未満なので誤りと
して検出される。この場合、実際の誤りは誤挿『く』で
あるが、誤りを指摘した箇所が単語間品詞判定部に含ま
れているので誤り検出は可能である。For example, "adjective stem" and "adjective tail"
When a search is performed between the part-of-speech, the inter-word part-of-speech connection probability value 0.31 is extracted. Also, in the "main body device", since there is no corresponding record in the inter-word part-of-speech connection dictionary between the part-of-speechs of "main body""generalnoun" and "na""adjectivestem", the inter-word part-of-speech connection probability value 0.0 To do. By detecting the two-word portion 380 in which the inter-word part-of-speech connection probability value thus searched is less than the default value (0.05 in this embodiment) as an error, “main body” may include an error. Gender is indicated and detected as two words 390 including an error. Similarly,
The part-of-speech between "Kas" and "Tama" is less than the default value, so it is detected as an error. In this case, the actual error is a misinsertion "ku", but since the part that points out the error is included in the inter-word part-of-speech determination unit, the error can be detected.

【００３０】このように、誤りを含む日本文文字列の誤
り検出において形態素解析処理を行い、文字字種に応じ
て漢字列部では、文字連接確率という統計的な情報を使
用する誤り検出方式を使用し、ひらがな列部では、単語
間品詞接続確率という単語の接続特性を利用する検出方
式を使用するので、従来の各誤り検出方式を一様に使用
する場合に比べて誤り検出精度の向上を図ることができ
る。As described above, the morphological analysis processing is performed in the error detection of the Japanese sentence character string including the error, and the kanji character string portion uses an error detection method that uses statistical information such as character concatenation probability according to the character type. In addition, the Hiragana string part uses a detection method that uses the connection characteristics of words called interword part-of-speech connection probability, so the error detection accuracy is improved compared to the case where each conventional error detection method is used uniformly. Can be planned.

【００３１】なお文字連接確率の設定方法、既定値の設
定方法、文字連接確率による誤り単語判定方法ならびに
単語間品詞接続確率の設定方法、既定値の設定方法、単
語間品詞接続確率による誤り単語判定方法は、文書の分
野や使用される単語の頻度などに応じて適宜変更しても
よい。It is to be noted that a character concatenation probability setting method, a default value setting method, a character concatenation probability error word determination method, an inter-word part-of-speech connection probability setting method, a default value setting method, and an inter-word part-of-speech connection probability error word determination. The method may be appropriately changed depending on the field of the document, the frequency of words used, and the like.

【００３２】[0032]

【発明の効果】以上説明したように、本発明によれば、
文字字種に応じて漢字列部では、文字連接確率という統
計的な情報を使用する誤り検出方式を使用し、ひらがな
列部では、単語間品詞接続確率という単語の接続特性を
利用する検出方式を使用するので、従来の各誤り検出方
式を一様に使用する場合に比べて誤り検出精度を向上さ
せることができる。As described above, according to the present invention,
Depending on the character type, the Kanji string part uses an error detection method that uses statistical information called the character concatenation probability, and the Hiragana string part uses a detection method that uses the word connection characteristics called the interword part-of-speech connection probability. Since it is used, the error detection accuracy can be improved as compared with the case where the conventional error detection methods are uniformly used.

[Brief description of drawings]

【図１】本発明の装置の構成例を示す。FIG. 1 shows a configuration example of an apparatus of the present invention.

【図２】図１の装置の処理の概略フローを示す。FIG. 2 shows a schematic flow of processing of the apparatus of FIG.

【図３】漢字列部に含まれる誤字を検出する処理の実施
例を示す。FIG. 3 shows an embodiment of a process for detecting an erroneous character included in a Chinese character string portion.

【図４】ひらがな列部に含まれる誤り単語を検出する処
理の実施例を示す説明図である。FIG. 4 is an explanatory diagram illustrating an example of a process of detecting an error word included in a hiragana string portion.

[Explanation of symbols]

１０文書入力装置２０原文文書ファイル３０日本語単語辞書４０文法辞書５０形態素解析処理部６０漢字列部ひらがな列部分離処理部７０文字連接確率辞書８０文字連接確率値抽出処理部９０誤字判定処理部１００単語間品詞接続辞書１１０単語間品詞接続値抽出処理部１２０単語間品詞検定処理部１３０単語誤り検出処理部１４０検出済み文書ファイル１５０処理装置２００日本文文字列２１０誤り箇所２２０正字２３０形態素解析処理結果２４０漢字列部２５０誤字判定部２６０検索対象文字２７０文字連接確率値２８０既定値未満の箇所２９０検出された文字３００日本文文字列３１０誤り箇所３２０正字３３０形態素解析処理結果３４０ひらがな列部３５０単語間品詞判定部３６０検索対象文字３７０単語間品詞接続確率値３８０既定値未満の箇所３９０検出された単語 10 Document Input Device 20 Original Text Document File 30 Japanese Word Dictionary 40 Grammar Dictionary 50 Morphological Analysis Processing Unit 60 Kanji Character String Hiragana Sequence Separation Processing Unit 70 Character Concatenation Probability Dictionary 80 Character Concatenation Probability Value Extraction Processing Unit 90 Error Character Judgment Processing Unit 100 Inter-word part-of-speech connection dictionary 110 Inter-word part-of-speech connection value extraction processing unit 120 Inter-word part-of-speech test processing unit 130 Word error detection processing unit 140 Detected document file 150 Processing device 200 Japanese sentence character string 210 Error location 220 Orthogonal character 230 Morphological analysis processing result 240 Kanji string part 250 Error character determination part 260 Search target character 270 Character concatenation probability value 280 Location less than default value 290 Detected character 300 Japanese sentence string 310 Error location 320 Orthomorphic 330 Processing result morphological 340 Hiragana string part 350 Between words Part of speech determination section 360 search Words portion 390 is detected less than between elephant characters 370 words part of speech connection probability value 380 defaults

Claims

[Claims]

1. A Japanese word containing a Japanese word in a Japanese word error automatic detection device for detecting an input error such as a typographical error, omission, or word error included in Japanese document data input from a document input device. A dictionary, a grammar dictionary that describes grammatical analysis rules, a morphological analysis processing unit that identifies a word, a part of speech, and a character type using the Japanese word dictionary and grammar dictionary, and a character from the identified word. Kanji string part Hiragana string part separation processing part that separates into kanji string part and hiragana string part according to type, and for the separated kanji string part, the appearance of a pattern of consecutive N characters from a standard document that does not include typographical errors in advance. A character concatenation probability dictionary in which concatenation probability information of each N character calculated based on frequency information is stored and stored using a pattern of each N character as a key, and the character concatenation probability dictionary is searched to obtain a character concatenation probability value. It has a character concatenation probability value extraction processing part to be extracted and a typographical error determination processing part that makes a typographical error a part where the retrieved character concatenation probability value does not satisfy the default value.For the separated hiragana string part, An inter-word part-of-speech connection dictionary that stores part-of-speech that can be connected behind and a probability of the connection, and an inter-word part-of-speech connection value extraction processing unit that searches the inter-word part-of-speech connection dictionary to extract a part-of-speech connection value It is equipped with an inter-word part-of-speech test processing unit that points out an error if the connected part-of-speech connection value does not satisfy the default value, and a word error detection processing unit that integrates the error points extracted from the Kanji string part and the hiragana string part, respectively. An automatic Japanese word error detection device characterized by:

2. A means for performing a morphological analysis process for recognizing a word, a part of speech, and a character type for a Japanese character string of Japanese sentence document data, and a kanji string part and a hiragana string from the recognized word according to the character type. For the separated kanji character string part and the separated kanji character string part, set the error character determination part that includes the kanji character string and the characters before and after it, and search the character concatenation probability dictionary that is stored and held in advance for every two characters from the beginning. Then, the character concatenation probability value is extracted, and if it does not satisfy the default value, the means for pointing out the typographical error and for the separated hiragana string part, the interword part-of-speech judgment part that includes the hiragana string and one word before and after it is used. A means for setting and setting an interword part-of-speech connection dictionary that is stored and held for each two words from the beginning to extract an interword part-of-speech connection value, and pointing out a word error part when this does not satisfy the default value. Row and Hiragana column Full word error automatic detection apparatus according to claim 1, wherein the day, characterized in that it comprises means for performing word error detection by integrating the error location extracted respectively from the unit.