JPH05225183A - Automatic error detector for words in japanese sentence - Google Patents

Automatic error detector for words in japanese sentence

Info

Publication number
JPH05225183A
JPH05225183A JP4023801A JP2380192A JPH05225183A JP H05225183 A JPH05225183 A JP H05225183A JP 4023801 A JP4023801 A JP 4023801A JP 2380192 A JP2380192 A JP 2380192A JP H05225183 A JPH05225183 A JP H05225183A
Authority
JP
Japan
Prior art keywords
character
word
error
string
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP4023801A
Other languages
Japanese (ja)
Inventor
Shinichiro Takagi
伸一郎 高木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP4023801A priority Critical patent/JPH05225183A/en
Publication of JPH05225183A publication Critical patent/JPH05225183A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

PURPOSE:To improve accuracy for error detection by detecting error while using character link probability at a Chinese character (KANJI) string part and using part-of- speech connection probability between words at a Japanese syllabary (KANA) string part. CONSTITUTION:A character link probability dictionary 70 stores and holds the link probability information of respective N characters previously calculated from a standard document, which contains no wrong character, based on appearance frequency information concerning the patterns of continuous N characters. An inter-word part-of- speech connection dictionary 100 previously records a certain part of speech, part of speech to be connected to the back and the probability of the connection. The Japanese character string of a source sentence document file 20 is separated into the KANJI string part and the KANA string part by a KANJI string part/KANA string part separation processing part 60 while executing word certification at a morpheme analysis processing part 50. Concerning the KANJI string part, a part which character link probability value does not satisfy a rated value is defined as a wrong character by using the character link probability dictionary 70. Concerning the KANA string part, when the part-of-speech connection value does not satisfy a rated value, it is pointed out as error by using the inter-word part-of-speech connection dictionary 100.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、日本語ワードプロセッ
サなどの文書入力装置から入力した日本文文書データに
含まれる誤字・脱字・単語誤りなどの入力誤りを自動的
に検出する日本文単語誤り自動検出装置に関するもので
ある。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to automatic Japanese word error detection for automatically detecting input errors such as typographical errors, omissions, and word errors contained in Japanese sentence document data input from a document input device such as a Japanese word processor. The present invention relates to a detection device.

【0002】[0002]

【従来の技術】新聞記事、出版用原稿、科学技術論文な
どの多量の日本文文書を文字読取装置で電子ファイル化
して、日本文文書データベースを作成する場合や、日本
語ワードプロセッサを使用して日本語文章を作成する場
合には、誤字、脱字、誤変換などの誤りが混入すること
がある。これらの誤りを人手で検出して訂正を行う校正
作業は、多大な時間を要するほかに校正者の能力や精神
的状況により校正精度にバラツキがあるという問題があ
った。これに対して人間の校正作業支援の立場から電子
ファイル化した日本文文書に対して、計算機により日本
語単語辞書と日本語形態素解析処理を用いた以下の単語
誤り検出手法が実施されている。
2. Description of the Related Art A large number of Japanese text documents such as newspaper articles, manuscripts for publication, and scientific papers are converted into electronic files with a character reader to create a Japanese text document database, or a Japanese word processor is used to create a Japanese text document. When creating a word sentence, errors such as typographical errors, omissions and erroneous conversions may occur. The proofreading work in which these errors are manually detected and corrected requires a great deal of time, and there is a problem that the proofreading accuracy varies depending on the ability and mental condition of the proofreader. On the other hand, from the standpoint of supporting human proofreading work, the following word error detection method using a Japanese word dictionary and a Japanese morphological analysis process has been implemented by a computer for a Japanese text document converted into an electronic file.

【0003】(1)単語や品詞を認定して、あらかじめ
作成した誤りやすい誤り単語群と照合することにより誤
った単語を認定する手段 (2)日本語形態素解析処理の単語認定の段階で、日本
語単語辞書から抽出した単語列について位置的接続関係
や文法的接続関係より、単語の接続が不連続となる文字
領域を解析不能文字列として抽出して誤り検出する手段 ここで、前者は、英文のスペルチェッカーでその基本的
な考え方が実施されているが、日本語ではその字種数が
大であり十分な誤り検出精度を得るためには誤り単語群
が膨大になる問題があった。後者は、たとえば、特願昭
62−315698号の「日本文処理方式」で示されて
いる。しかし、この手段では文法的接続関係の認定に使
用する文法辞書は、文法的な接続関係の有無を記述した
にすぎないため、誤り単語が混入し文法的接続関係の出
現頻度が低い場合でも『文法的な接続関係有り』とし単
語認定を正常に行うため、誤り検出精度が十分でない問
題点がある。
(1) A means for certifying an incorrect word by certifying a word or a part-of-speech and collating it with an error word group that is created easily in advance. (2) At the stage of certifying words in the Japanese morphological analysis process, Japan A method for detecting an error by extracting a character region in which word connections are discontinuous as an unanalyzable character string based on positional connection relations and grammatical connection relations of a word string extracted from a word-word dictionary. The basic idea is implemented by the spell checker, but in Japanese, the number of characters is large and there is a problem that the number of error words becomes enormous in order to obtain sufficient error detection accuracy. The latter is shown, for example, in "Japanese sentence processing system" of Japanese Patent Application No. 62-31598. However, with this method, the grammar dictionary used for recognizing the grammatical connection relation only describes the presence or absence of the grammatical connection relation. There is a problem that the accuracy of error detection is not sufficient because the word recognition is normally performed because "there is a grammatical connection".

【0004】さらに統計的な情報を使用して誤りを検出
する方式として、以下の誤り検出手法が実施されてい
る。 (3)誤字を含まない大量の日本文文書を用いて、N文
字の文字パターンを抽出し、これに関する出現頻度情報
に基づいて算定された各N文字の文字連接確率値を各N
文字をキーとして保持する文字連接確率辞書を予め作成
しこれを用いて原文中の連続する文字列の文字連接確率
値がある足切り値を満たさない場合に誤字として検出す
る手段 (4)誤字を含まない大量の日本文文書を用いて、単語
とその前後の単語間の品詞に関する出現頻度情報に基づ
いて算定された単語間の品詞接続確率を保持する単語間
品詞接続辞書を予め作成しこれを用いて原文文字列中の
連続する2単語間の品詞接続確率が既定値を満たさない
場合に該単語を誤りとして検出する手段 ここで、前者は例えば、特願昭61−024992号の
「日本文誤字自動検出方式」で示されている。しかし、
ひらがなのように文字種が少なく出現頻度の高い文字列
に誤字が混入しても文字連接確率が一般に高いため誤字
検出が困難であるという問題点があった。また後者は例
えば、特願平2−202295号の「日本文単語誤り検
定装置」で示されている。しかし、漢字単語で構成され
る漢字列では各単語の品詞が主に出現頻度が高い名詞や
接頭接尾辞などであるため、たとえ誤字が混入しても名
詞などで認定されてしまうと品詞接続確率によって検出
できないという問題点があった。
Further, as a method of detecting an error using statistical information, the following error detection method is implemented. (3) N-character character patterns are extracted using a large amount of Japanese-language documents that do not contain typographical errors, and the character concatenation probability value of each N character calculated based on the appearance frequency information related thereto is used for each N character.
A means for detecting as a typo when a character concatenation probability dictionary that holds characters as keys is created in advance and a character concatenation probability value of consecutive character strings in the original sentence does not satisfy a certain cutoff value (4) Using a large number of Japanese documents that are not included, create an inter-word part-of-speech connection dictionary that holds the part-of-speech connection probability between words calculated in advance based on the appearance frequency information about the part-of-speech between words and the words before and after it. A means for detecting the word as an error when the probability of part-of-speech connection between two consecutive words in the original text string does not satisfy a predetermined value. Here, the former is, for example, Japanese Patent Application No. 61-024992, “Japanese sentence. Typographical error automatic detection method ". But,
There is a problem that it is difficult to detect a typographical error because a character concatenation probability is generally high even if a typographical error is mixed in a character string such as hiragana that has a small number of character types and a high appearance frequency. The latter is shown, for example, in "Japanese sentence word error checking device" of Japanese Patent Application No. 2-202295. However, in a kanji string composed of kanji words, the part-of-speech of each word is mainly a noun that frequently appears or a prefix or suffix, so even if a typographical error is included, the part-of-speech connection probability is There was a problem that could not be detected by.

【0005】[0005]

【発明が解決しようとする課題】このように、形態素解
析を用いた文法的な検出方式では誤り単語が混入したと
え文法的接続関係の出現頻度が低い場合でも、『文法的
な接続関係有り』とし単語認定を正常に行うため、誤り
検出精度が十分でない問題点がある。
As described above, in the grammatical detection method using morphological analysis, even if an error word is mixed and the occurrence frequency of the grammatical connection relation is low, "there is a grammatical connection relation". Since the word recognition is normally performed, there is a problem that the error detection accuracy is not sufficient.

【0006】また、統計的な情報を使用した誤り検出方
式でも、文字連接確率を利用する検出方式では、ひらが
なのように文字種が少なく出現頻度の高い文字列に誤字
が混入しても文字連接確率が一般に高いため誤字検出が
困難であるという問題点があった。
Further, even in the error detection method using statistical information, in the detection method using the character concatenation probability, the character concatenation probability is generated even if a typographical error occurs in a character string such as hiragana that has few character types and a high appearance frequency. Since it is generally high, there is a problem that it is difficult to detect typographical errors.

【0007】さらに、単語間品詞接続確率を利用する検
出方式では、漢字単語で構成される漢字列では各単語の
品詞が主に出現頻度が高い名詞や接頭辞接尾辞などであ
るため、たとえ誤字が混入しても名詞などで認定されて
しまうと単語間品詞接続確率によって検出できず、この
ため誤り検出精度が十分でないという問題点がある。
Further, in the detection method utilizing the inter-word part-of-speech connection probability, the part-of-speech of each word in a kanji string composed of kanji words is mainly a noun or a prefix suffix which frequently appears. However, if it is identified as a noun even if is mixed, it cannot be detected by the inter-word part-of-speech connection probability, and thus there is a problem that the error detection accuracy is not sufficient.

【0008】本発明は、漢字列部とひらがな列部とを分
けて処理するようにして、誤り検出精度を向上すること
を目的としている。
An object of the present invention is to improve the error detection accuracy by separately processing the kanji character string portion and the hiragana character string portion.

【0009】[0009]

【課題を解決するための手段】本発明では、日本語ワー
ドプロセッサなどの文書入力装置から入力され誤字・脱
字・単語誤りなどの入力誤りを含む日本文文書データの
日本文文字列についての日本文単語誤り自動検出装置に
おいて、単語と品詞と文字字種を認定する形態素解析処
理を行う手段と、文字字種に応じて漢字列部とひらがな
列部に分離する手段と、漢字列部については誤字判定部
を設定して予め作成した文字連接確率辞書を用いて文字
連接確率値の足切りにより誤字を検出する手段と、ひら
がな列部については単語間品詞判定部を設定して予め作
成した単語間品詞接続辞書を用いて単語間品詞接続確率
の足切りにより単語の誤りを検出する手段とを具備す
る。
According to the present invention, a Japanese word for a Japanese character string of Japanese document data input from a document input device such as a Japanese word processor and containing input errors such as typographical errors, omissions, and word errors. In the automatic error detection device, a means for performing a morphological analysis process for recognizing a word, a part of speech, and a character type, a means for separating the kanji character string part and the hiragana character string part according to the character character type, and an error character determination for the kanji character string part. A means for detecting erroneous characters by truncating the character concatenation probability value using a character concatenation probability dictionary created by setting parts, and an interword part-of-speech part pre-created by setting the interword part-of-speech determination part for the hiragana string part. And a means for detecting an error of a word by cutting off the inter-word part-of-speech connection probability using a connection dictionary.

【0010】[0010]

【作用】本発明においては、誤りを含む日本文文字列の
誤り検出において形態素解析処理を行い、文字字種に応
じて漢字列部では、文字連接確率という統計的な情報を
使用する誤り検出方式を使用し、ひらがな列部では、単
語間品詞接続確率という単語の接続特性を利用する検出
方式を使用するので、従来の各誤り検出方式を一様に使
用する場合に比べて誤り検出精度の向上が図れる。
In the present invention, a morphological analysis process is performed in detecting an error in a Japanese character string including an error, and an error detection method using statistical information called a character concatenation probability in the kanji character string portion according to the character type. In the Hiragana string part, the detection method that uses the connection characteristics of words, which is the interword part-of-speech connection probability, is used, so the error detection accuracy is improved compared to the case where each conventional error detection method is used uniformly. Can be achieved.

【0011】[0011]

【実施例】以下、本発明の実施例を図面により詳細に説
明する。図1は、本発明の装置の構成例を示す。
Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 shows a configuration example of the device of the present invention.

【0012】図1において、処理装置150はCPUお
よびメモリからなり、以下の機能部を有する。すなわ
ち、日本語ワードプロセッサなどの文書入力装置10よ
り入力された原文文書ファイル20の日本文文字列を、
日本語単語辞書30や文法辞書40を用いて単語と品詞
と文字字種との認定を行う形態素解析処理部50、文字
字種に応じて漢字列部とひらがな列部とに分離する漢字
列部ひらがな列部分離処理部60、文字連接確率辞書7
0を検索して文字連接確率値を抽出する文字連接確率値
抽出処理部80、検索した文字連接確率値により誤字を
判定する誤字判定処理部90、単語間品詞接続辞書10
0を検索して品詞接続値を抽出する単語間品詞接続値抽
出処理部110、検索した品詞接続値により誤りを指摘
する単語間品詞検定処理部120、漢字列部およびひら
がな列部よりそれぞれ抽出した誤り箇所を統合する単語
誤り検出処理部130からなる。
In FIG. 1, the processing device 150 is composed of a CPU and a memory, and has the following functional parts. That is, the Japanese character string of the original document file 20 input from the document input device 10 such as a Japanese word processor is
A morphological analysis processing unit 50 that identifies a word, a part of speech, and a character type by using the Japanese word dictionary 30 and the grammar dictionary 40, and a Kanji string unit that separates into a Kanji string unit and a Hiragana string unit according to the character type. Hiragana string part separation processing unit 60, character concatenation probability dictionary 7
A character concatenation probability value extraction processing unit 80 that retrieves 0 to extract a character concatenation probability value, a typographical error determination processing unit 90 that determines a typographical error based on the retrieved character concatenation probability value, and an inter-word part-of-speech connection dictionary 10
It was extracted from the inter-word part-of-speech connection value extraction processing unit 110 that searches for 0 and extracts the part-of-speech connection value, the inter-word part-of-speech verification processing unit 120 that points out an error based on the retrieved part-of-speech connection value, the kanji character string portion, and the hiragana character string portion, respectively. It is composed of a word error detection processing unit 130 that integrates error portions.

【0013】この装置では、予め誤字を含まない標準文
書から連続するN文字のパターンに関する出現頻度情報
に基づいて算定された各N文字の連接確率情報を各N文
字のパターンをキーとして格納保持した文字連接確率辞
書70と、予めある品詞とその後方に接続可能な品詞と
の接続の確率とを収録した単語間品詞接続辞書100を
作成しておく。そして、原文文書ファイル20の日本文
文字列を、日本語単語辞書30や文法辞書40を用いて
形態素解析処理部50により単語と品詞と文字字種との
単語認定を行い、さらに漢字列部ひらがな列部分離処理
部60で文字字種に応じて漢字列部とひらがな列部に分
離する。
In this apparatus, concatenation probability information of each N character calculated based on appearance frequency information regarding a pattern of N consecutive characters from a standard document containing no typographical errors is stored and held using each N character pattern as a key. A character concatenation probability dictionary 70 and an inter-word part-of-speech connection dictionary 100 in which a certain part-of-speech and a probability of connection of a part-of-speech that can be connected behind it are recorded in advance are created. Then, using the Japanese word dictionary 30 and the grammar dictionary 40, the Japanese sentence character string of the original sentence document file 20 is subjected to word recognition of a word, a part of speech, and a character type by the morphological analysis processing unit 50. The line part separation processing unit 60 separates the kanji character line part and the hiragana line part according to the character type.

【0014】この後の処理は、漢字列部とひらがな列部
で個別の処理とする。分離された漢字列部については、
予め作成した文字連接確率辞書70を検索して文字連接
確率値抽出処理部80で文字連接確率値を抽出して、誤
字判定処理部90でこの文字連接確率値が既定値を満た
さない箇所を誤字とする。
Subsequent processing is performed separately for the kanji character string portion and the hiragana character string portion. For the separated Kanji string part,
The character concatenation probability dictionary 70 created in advance is searched, the character concatenation probability value extraction processing unit 80 extracts the character concatenation probability value, and the erroneous character determination processing unit 90 erroneously characterizes portions where the character concatenation probability value does not satisfy the predetermined value. And

【0015】また、分離されたひらがな列部について
は、予め作成した単語間品詞接続辞書100を検索して
単語間品詞接続値抽出処理部110で品詞接続値を抽出
して、単語間品詞検定処理部120でこの品詞接続値が
既定値を満たさない場合に誤りとして指摘する。さらに
単語誤り検出処理部130で漢字列部およびひらがな列
部よりそれぞれ抽出した誤り箇所を統合する。最後に、
これらの誤り検出結果を含めた検出済み文書ファイル1
40を作成する。
For the separated hiragana string, the inter-word part-of-speech connection dictionary 100 is searched, the inter-word part-of-speech connection value extraction processing unit 110 extracts the part-of-speech connection value, and the inter-word part-of-speech verification processing is performed. If the part-of-speech connection value does not satisfy the predetermined value in the part 120, it is pointed out as an error. Further, the word error detection processing unit 130 integrates the error portions respectively extracted from the Kanji character string portion and the Hiragana character string portion. Finally,
Detected document file 1 including these error detection results
Create 40.

【0016】図2は図1に示した構成例実施例において
誤り単語を検出する処理の概略フローを示す図であり、
概略フローに従って、動作の説明を行う。 (ステップ1):原文文書ファイル20から形態素解析
処理の対象文字列を読み込む。
FIG. 2 is a diagram showing a schematic flow of a process for detecting an error word in the embodiment of the configuration example shown in FIG.
The operation will be described according to the general flow. (Step 1): A target character string for morphological analysis processing is read from the original document file 20.

【0017】(ステップ2):形態素解析処理部50に
より、日本語単語辞書30や文法辞書40を用いて、形
態素解析処理を行い、単語と品詞と文字字種との認定を
行う。
(Step 2): The morpheme analysis processing unit 50 performs morpheme analysis processing using the Japanese word dictionary 30 and the grammar dictionary 40 to authenticate words, parts of speech, and character character types.

【0018】(ステップ3):単語認定結果を用いて、
文字字種でひらがな列部(ひらがなおよび句読点)と漢
字列部(ひらがな列部以外の文字列)に分離する。 (ステップ4):漢字列部とひらがな列部とに応じて処
理を分岐させる。漢字列部についてはステップ5よりス
テップ7の個別の処理を行う。
(Step 3): Using the word recognition result,
Separate the hiragana string part (hiragana and punctuation) and kanji string part (character strings other than hiragana string part) by character type. (Step 4): The processing is branched according to the kanji string part and the hiragana string part. With respect to the kanji character string portion, individual processing from step 5 to step 7 is performed.

【0019】(ステップ5):漢字列部に前後1文字を
付与して誤字判定部を設定する。 (ステップ6):先頭より2文字ごとに文字連接確率辞
書70より文字連接確率値を抽出する。
(Step 5): One character before and after is added to the kanji character string portion to set the erroneous character determination portion. (Step 6): A character connection probability value is extracted from the character connection probability dictionary 70 for every two characters from the beginning.

【0020】(ステップ7):ステップ6の結果で文字
連接確率値が既定値(図3の実施例では、0.05)未
満の2文字を誤りとして検出する。ひらがな列部につい
てはステップ8よりステップ10の個別の処理を行う。
(Step 7): As a result of step 6, two characters whose character concatenation probability value is less than a predetermined value (0.05 in the embodiment of FIG. 3) are detected as an error. For the hiragana row portion, individual processing from step 8 to step 10 is performed.

【0021】(ステップ8):ひらがら列部に前後1単
語を付与して単語間品詞判定部を設定する。 (ステップ9):先頭より2単語ごとに単語間品詞接続
辞書100を検索して、単語間品詞接続値を抽出する。
(Step 8): One word before and after is added to the hiragana string part to set the inter-word part of speech determination part. (Step 9): The inter-word part-of-speech connection dictionary 100 is searched for every two words from the beginning, and the inter-word part-of-speech connection value is extracted.

【0022】(ステップ10):ステップ9の検索の結
果で、単語間品詞接続値が既定値(図4の実施例では
0.05)未満の2単語を誤りとして検出する。 (ステップ11):誤りを含む誤字判定部および単語間
品詞判定部を単語誤り箇所として抽出する。
(Step 10): As a result of the search in step 9, two words whose inter-word part-of-speech connection value is less than a predetermined value (0.05 in the embodiment of FIG. 4) are detected as errors. (Step 11): The erroneous character determination unit including the error and the inter-word part-of-speech determination unit are extracted as the word error portion.

【0023】(ステップ12):ステップ11で抽出し
た単語誤り箇所をマークして検出済み文書ファイル14
0に出力する。 (ステップ13):形態素解析対象文字列の最終まで終
了したかを判定する。ステップ13で最終でない場合、
ステップ3へ移行し処理を継続する。ステップ13で最
終である場合、処理を終了する。
(Step 12): Marked the word error portion extracted in step 11 and detected document file 14
Output to 0. (Step 13): It is determined whether the end of the morphological analysis target character string has been completed. If not final in step 13,
The process proceeds to step 3 and the process is continued. If it is final in step 13, the process ends.

【0024】図1の構成例を用いた実施例として、漢字
列部とひらがな列部とにおける誤り判定処理をそれぞれ
図3、図4を参照して説明する。図3では、漢字列部に
含まれる誤字を検出する処理を具体例を挙げて説明す
る。
As an embodiment using the configuration example of FIG. 1, an error determination process in the kanji character string portion and the hiragana character string portion will be described with reference to FIGS. 3 and 4, respectively. In FIG. 3, the process of detecting an erroneous character included in the Chinese character string portion will be described with a specific example.

【0025】ここで、200は原文文書ファイルから抽
出した誤字を含む日本文文字列、210は日本文文字列
における脱字や誤字の箇所、220は正解の文字(正
字)、230は形態素解析処理の結果である単語や品詞
情報、240は文字字種で分離した漢字列部、250は
誤字判定部、260は先頭から2文字ごとに切り出した
文字連接確率値検索対象の文字、270は文字連接確率
辞書70より検索した文字連接確率値、280は既定値
未満の文字連接確率値箇所、290は誤りと判定された
2文字である。
Here, 200 is a Japanese sentence character string including erroneous characters extracted from the original sentence document file, 210 is a missing character or a typographical error in the Japanese sentence character string, 220 is a correct character (original character), and 230 is a morphological analysis process. The result is words and part-of-speech information, 240 is a kanji string part separated by character type, 250 is a typographical error determination part, 260 is a character concatenation probability value that is cut out every two characters from the beginning, and 270 is a character concatenation probability. The character concatenation probability value retrieved from the dictionary 70, 280 is a character concatenation probability value portion that is less than the default value, and 290 is two characters determined to be erroneous.

【0026】本実施例では、『電話網』の『話』、『持
つ』の『つ』の脱字および『共通化』における誤字
『か』が含まれた原文の日本文文字列200において、
これを形態素解析処理した後、文字字種がひらがな列以
外の漢字列部240を抽出し、これをもとに誤字判定部
250を設定する。つぎに『電網を』、『を特システム
に』、『に共通か』の各誤字判定部において先頭から2
文字ごとにこれをキーに文字連接確率辞書70を検索し
て文字連接確率値を抽出する。例えば、『網を』では文
字連接確率値0.25を検索したが、『電網』では文字
連接確率辞書に該当のレコードがないので文字連接確率
値0.0とする。このように検索した文字連接確率値が
既定値(本実施例では0.05)未満の2文字箇所28
0を誤りとして検出することによって、『電網』には誤
字が含まれている可能性を示しており、誤りを含む2文
字290として検出する。同様に、『持シ』や『通か』
なども既定値未満なので誤りとして検出される。
In the present embodiment, in the Japanese sentence character string 200 of the original sentence, which includes the punctuation of "talk" of "telephone network", "tsu" of "have", and the typographical error "ka" in "common",
After morphological analysis processing of this, the kanji character string portion 240 whose character type is other than the hiragana character string is extracted, and the erroneous character determination portion 250 is set based on this. Next, in each of the typographical error determination parts of “Denshi”, “to special system”, and “common to”, 2 from the beginning.
For each character, the character connection probability dictionary 70 is searched using this as a key to extract the character connection probability value. For example, the character concatenation probability value 0.25 is searched for "Ami", but the character concatenation probability value is set to 0.0 because there is no corresponding record in the character concatenation probability dictionary for "Denji". The two-character portion 28 in which the character concatenation probability value thus searched is less than the predetermined value (0.05 in this embodiment)
By detecting 0 as an error, there is a possibility that the "electric network" contains a typographical error, and two characters 290 including an error are detected. Similarly, "hold" or "pass"
Is also less than the default value, so it is detected as an error.

【0027】図4では、ひらがな列部に含まれる誤り単
語を検出する処理を具体例を挙げて説明する。ここで、
300は原文文書ファイルから抽出した誤字を含む日本
文文字列、310は日本文文字列における誤りの箇所、
320は正解の文字(正字)、330は形態素解析処理
の結果である単語や品詞情報、340は文字字種で分離
したひらがな列部、350は単語間品詞判定部、360
は先頭から2単語ごとに切り出した単語間品詞接続確率
値検索対象の文字、370は単語間品詞接続辞書100
より検索した単語間品詞接続確率値、380は既定値未
満の単語間品詞接続確率値箇所、390は誤りと判定さ
れた2単語である。
Referring to FIG. 4, a process for detecting an error word included in the hiragana string portion will be described with a specific example. here,
Reference numeral 300 is a Japanese character string including typographical errors extracted from the original document file, 310 is an error portion in the Japanese character string,
320 is a correct answer character (regular character), 330 is word or part-of-speech information as a result of morphological analysis processing, 340 is a hiragana string part separated by character type, 350 is an inter-word part-of-speech determination part, 360
Is a character to be searched for an inter-word part-of-speech connection probability value that is cut out every two words from the beginning, and 370 is an inter-word part-of-speech connection dictionary 100.
The word-to-word part-of-speech connection probability value 380 searched for is a word part of the word-to-word part-of-speech connection probability value less than the default value, and 390 is two words determined to be erroneous.

【0028】本実施例では、『本体内』の『ない』にお
ける誤字、『動かすため』の『く』の誤挿が含まれた原
文の日本文文字列300において、これを形態素解析処
理した後、文字字種がひらがなのひらがな列部340を
抽出し前後1単語を付与して単語間品詞判定部350を
設定する。つぎに『本体ないの装置』、『装置で動』、
『動くかすために』の各単語間品詞判定部において先頭
から2単語ごとにこれをキーに単語間品詞接続辞書10
0を検索して単語間品詞接続確率値を抽出する。
In this embodiment, after morphological analysis processing is performed on the original Japanese character string 300 including the erroneous character “in” of “inside body” and the incorrect insertion of “ku” for “to move”. The hiragana string part 340 of which the character type is hiragana is extracted, one word before and after is extracted, and the inter-word part-of-speech determination part 350 is set. Next, "device without main body", "moving with device",
The inter-word part-of-speech connection dictionary 10 using this as a key for every two words from the beginning in the inter-word part-of-speech determination unit of "to move"
0 is searched to extract the inter-word part-of-speech connection probability value.

【0029】例えば、「形容詞語幹」と「形容詞語尾」
の品詞間で検索すると単語間品詞接続確率値0.31が
抽出される。また、『本体ないの装置』において『本
体』「一般名詞」と『な』「形容詞語幹」の品詞間では
単語間品詞接続辞書に該当のレコードがないので単語間
品詞接続確率値0.0とする。このように検索した単語
間品詞接続確率値が既定値(本実施例では0.05)未
満の2単語箇所380を誤りとして検出することによっ
て、『本体な』には誤りが含まれている可能性が示さ
れ、誤りを含む2単語390として検出する。同様に、
『かす』と『ため』の品詞間も既定値未満なので誤りと
して検出される。この場合、実際の誤りは誤挿『く』で
あるが、誤りを指摘した箇所が単語間品詞判定部に含ま
れているので誤り検出は可能である。
For example, "adjective stem" and "adjective tail"
When a search is performed between the part-of-speech, the inter-word part-of-speech connection probability value 0.31 is extracted. Also, in the "main body device", since there is no corresponding record in the inter-word part-of-speech connection dictionary between the part-of-speechs of "main body""generalnoun" and "na""adjectivestem", the inter-word part-of-speech connection probability value 0.0 To do. By detecting the two-word portion 380 in which the inter-word part-of-speech connection probability value thus searched is less than the default value (0.05 in this embodiment) as an error, “main body” may include an error. Gender is indicated and detected as two words 390 including an error. Similarly,
The part-of-speech between "Kas" and "Tama" is less than the default value, so it is detected as an error. In this case, the actual error is a misinsertion "ku", but since the part that points out the error is included in the inter-word part-of-speech determination unit, the error can be detected.

【0030】このように、誤りを含む日本文文字列の誤
り検出において形態素解析処理を行い、文字字種に応じ
て漢字列部では、文字連接確率という統計的な情報を使
用する誤り検出方式を使用し、ひらがな列部では、単語
間品詞接続確率という単語の接続特性を利用する検出方
式を使用するので、従来の各誤り検出方式を一様に使用
する場合に比べて誤り検出精度の向上を図ることができ
る。
As described above, the morphological analysis processing is performed in the error detection of the Japanese sentence character string including the error, and the kanji character string portion uses an error detection method that uses statistical information such as character concatenation probability according to the character type. In addition, the Hiragana string part uses a detection method that uses the connection characteristics of words called interword part-of-speech connection probability, so the error detection accuracy is improved compared to the case where each conventional error detection method is used uniformly. Can be planned.

【0031】なお文字連接確率の設定方法、既定値の設
定方法、文字連接確率による誤り単語判定方法ならびに
単語間品詞接続確率の設定方法、既定値の設定方法、単
語間品詞接続確率による誤り単語判定方法は、文書の分
野や使用される単語の頻度などに応じて適宜変更しても
よい。
It is to be noted that a character concatenation probability setting method, a default value setting method, a character concatenation probability error word determination method, an inter-word part-of-speech connection probability setting method, a default value setting method, and an inter-word part-of-speech connection probability error word determination. The method may be appropriately changed depending on the field of the document, the frequency of words used, and the like.

【0032】[0032]

【発明の効果】以上説明したように、本発明によれば、
文字字種に応じて漢字列部では、文字連接確率という統
計的な情報を使用する誤り検出方式を使用し、ひらがな
列部では、単語間品詞接続確率という単語の接続特性を
利用する検出方式を使用するので、従来の各誤り検出方
式を一様に使用する場合に比べて誤り検出精度を向上さ
せることができる。
As described above, according to the present invention,
Depending on the character type, the Kanji string part uses an error detection method that uses statistical information called the character concatenation probability, and the Hiragana string part uses a detection method that uses the word connection characteristics called the interword part-of-speech connection probability. Since it is used, the error detection accuracy can be improved as compared with the case where the conventional error detection methods are uniformly used.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の装置の構成例を示す。FIG. 1 shows a configuration example of an apparatus of the present invention.

【図2】図1の装置の処理の概略フローを示す。FIG. 2 shows a schematic flow of processing of the apparatus of FIG.

【図3】漢字列部に含まれる誤字を検出する処理の実施
例を示す。
FIG. 3 shows an embodiment of a process for detecting an erroneous character included in a Chinese character string portion.

【図4】ひらがな列部に含まれる誤り単語を検出する処
理の実施例を示す説明図である。
FIG. 4 is an explanatory diagram illustrating an example of a process of detecting an error word included in a hiragana string portion.

【符号の説明】[Explanation of symbols]

10 文書入力装置 20 原文文書ファイル 30 日本語単語辞書 40 文法辞書 50 形態素解析処理部 60 漢字列部ひらがな列部分離処理部 70 文字連接確率辞書 80 文字連接確率値抽出処理部 90 誤字判定処理部 100 単語間品詞接続辞書 110 単語間品詞接続値抽出処理部 120 単語間品詞検定処理部 130 単語誤り検出処理部 140 検出済み文書ファイル 150 処理装置 200 日本文文字列 210 誤り箇所 220 正字 230 形態素解析処理結果 240 漢字列部 250 誤字判定部 260 検索対象文字 270 文字連接確率値 280 既定値未満の箇所 290 検出された文字 300 日本文文字列 310 誤り箇所 320 正字 330 形態素解析処理結果 340 ひらがな列部 350 単語間品詞判定部 360 検索対象文字 370 単語間品詞接続確率値 380 既定値未満の箇所 390 検出された単語 10 Document Input Device 20 Original Text Document File 30 Japanese Word Dictionary 40 Grammar Dictionary 50 Morphological Analysis Processing Unit 60 Kanji Character String Hiragana Sequence Separation Processing Unit 70 Character Concatenation Probability Dictionary 80 Character Concatenation Probability Value Extraction Processing Unit 90 Error Character Judgment Processing Unit 100 Inter-word part-of-speech connection dictionary 110 Inter-word part-of-speech connection value extraction processing unit 120 Inter-word part-of-speech test processing unit 130 Word error detection processing unit 140 Detected document file 150 Processing device 200 Japanese sentence character string 210 Error location 220 Orthogonal character 230 Morphological analysis processing result 240 Kanji string part 250 Error character determination part 260 Search target character 270 Character concatenation probability value 280 Location less than default value 290 Detected character 300 Japanese sentence string 310 Error location 320 Orthomorphic 330 Processing result morphological 340 Hiragana string part 350 Between words Part of speech determination section 360 search Words portion 390 is detected less than between elephant characters 370 words part of speech connection probability value 380 defaults

Claims (2)

【特許請求の範囲】[Claims] 【請求項1】 文書入力装置から入力した日本文文書デ
ータに含まれる誤字・脱字・単語誤りなどの入力誤りを
検出する日本文単語誤り自動検出装置において、 日本語の単語を収録した日本語単語辞書と、 文法的な解析規則を記述した文法辞書と、 該日本語単語辞書と文法辞書とを用いて単語と品詞と文
字字種を認定する形態素解析処理部と、 認定された単語から文字字種に応じて漢字列部とひらが
な列部に分離する漢字列部ひらがな列部分離処理部と、 分離された漢字列部については、予め誤字を含まない標
準文書から連続するN文字のパターンに関する出現頻度
情報に基づいて算定された各N文字の連接確率情報を各
N文字のパターンをキーとして格納保持した文字連接確
率辞書と、 当該文字連接確率辞書を検索して文字連接確率値を抽出
する文字連接確率値抽出処理部と、 検索した文字連接確率値が既定値を満たさない箇所を誤
字とする誤字判定処理部とをそなえると共に、 分離されたひらがな列部については、予めある品詞とそ
の後方に接続可能な品詞とその接続の確率とを収録した
単語間品詞接続辞書と、 当該単語間品詞接続辞書を検索して品詞接続値を抽出す
る単語間品詞接続値抽出処理部と、 検索した品詞接続値が既定値を満たさない場合に誤りを
指摘する単語間品詞検定処理部とをそなえ、 漢字列部およびひらがな列部よりそれぞれ抽出した誤り
箇所を統合する単語誤り検出処理部とを具備することを
特徴とする日本文単語誤り自動検出装置。
1. A Japanese word containing a Japanese word in a Japanese word error automatic detection device for detecting an input error such as a typographical error, omission, or word error included in Japanese document data input from a document input device. A dictionary, a grammar dictionary that describes grammatical analysis rules, a morphological analysis processing unit that identifies a word, a part of speech, and a character type using the Japanese word dictionary and grammar dictionary, and a character from the identified word. Kanji string part Hiragana string part separation processing part that separates into kanji string part and hiragana string part according to type, and for the separated kanji string part, the appearance of a pattern of consecutive N characters from a standard document that does not include typographical errors in advance. A character concatenation probability dictionary in which concatenation probability information of each N character calculated based on frequency information is stored and stored using a pattern of each N character as a key, and the character concatenation probability dictionary is searched to obtain a character concatenation probability value. It has a character concatenation probability value extraction processing part to be extracted and a typographical error determination processing part that makes a typographical error a part where the retrieved character concatenation probability value does not satisfy the default value.For the separated hiragana string part, An inter-word part-of-speech connection dictionary that stores part-of-speech that can be connected behind and a probability of the connection, and an inter-word part-of-speech connection value extraction processing unit that searches the inter-word part-of-speech connection dictionary to extract a part-of-speech connection value It is equipped with an inter-word part-of-speech test processing unit that points out an error if the connected part-of-speech connection value does not satisfy the default value, and a word error detection processing unit that integrates the error points extracted from the Kanji string part and the hiragana string part, respectively. An automatic Japanese word error detection device characterized by:
【請求項2】 日本文文書データの日本文文字列につい
て単語と品詞と文字字種を認定する形態素解析処理を行
う手段と、 認定された単語から文字字種に応じて漢字列部とひらが
な列部に分離する手段と、 分離された漢字列部については、漢字列とその前後の文
字を含む誤字判定部を設定して、先頭より2文字ごと
に、予め格納保持した文字連接確率辞書を検索して文字
連接確率値を抽出しこれが既定値を満たさない場合に誤
字箇所を指摘する手段と、 分離されたひらがな列部については、ひらがな列とその
前後の1単語を含む単語間品詞判定部を設定して、先頭
より2単語ごとに、予め格納保持した単語間品詞接続辞
書を検索して単語間品詞接続値を抽出しこれが既定値を
満たさない場合に単語誤り箇所を指摘する手段と、 漢字列部およびひらがな列部よりそれぞれ抽出した誤り
箇所を統合して単語誤り検出を行う手段とを備えること
を特徴とする請求項1記載の日本文単語誤り自動検出装
置。
2. A means for performing a morphological analysis process for recognizing a word, a part of speech, and a character type for a Japanese character string of Japanese sentence document data, and a kanji string part and a hiragana string from the recognized word according to the character type. For the separated kanji character string part and the separated kanji character string part, set the error character determination part that includes the kanji character string and the characters before and after it, and search the character concatenation probability dictionary that is stored and held in advance for every two characters from the beginning. Then, the character concatenation probability value is extracted, and if it does not satisfy the default value, the means for pointing out the typographical error and for the separated hiragana string part, the interword part-of-speech judgment part that includes the hiragana string and one word before and after it is used. A means for setting and setting an interword part-of-speech connection dictionary that is stored and held for each two words from the beginning to extract an interword part-of-speech connection value, and pointing out a word error part when this does not satisfy the default value. Row and Hiragana column Full word error automatic detection apparatus according to claim 1, wherein the day, characterized in that it comprises means for performing word error detection by integrating the error location extracted respectively from the unit.
JP4023801A 1992-02-10 1992-02-10 Automatic error detector for words in japanese sentence Pending JPH05225183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP4023801A JPH05225183A (en) 1992-02-10 1992-02-10 Automatic error detector for words in japanese sentence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP4023801A JPH05225183A (en) 1992-02-10 1992-02-10 Automatic error detector for words in japanese sentence

Publications (1)

Publication Number Publication Date
JPH05225183A true JPH05225183A (en) 1993-09-03

Family

ID=12120431

Family Applications (1)

Application Number Title Priority Date Filing Date
JP4023801A Pending JPH05225183A (en) 1992-02-10 1992-02-10 Automatic error detector for words in japanese sentence

Country Status (1)

Country Link
JP (1) JPH05225183A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065384A (en) * 2009-09-16 2011-03-31 Nippon Telegr & Teleph Corp <Ntt> Text analysis device, method, and program coping with wrong letter and omitted letter

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065384A (en) * 2009-09-16 2011-03-31 Nippon Telegr & Teleph Corp <Ntt> Text analysis device, method, and program coping with wrong letter and omitted letter

Similar Documents

Publication Publication Date Title
WO1997004405A1 (en) Method and apparatus for automated search and retrieval processing
JPH0211934B2 (en)
JP2536633B2 (en) Compound word extraction device
JPH05225183A (en) Automatic error detector for words in japanese sentence
JP2681663B2 (en) Japanese sentence correction candidate character extraction method
JP2599973B2 (en) Japanese sentence correction candidate character extraction device
JPS6394364A (en) Automatic correction device for wrong character in japanese sentence
JPH01281561A (en) Method for extracting japanese sentence correcting candidate character
JPH087046A (en) Document recognition device
JP2575947B2 (en) Phrase extraction device
JPH0362260A (en) Detecting/correcting device for katakana word error
JP3233283B2 (en) Japanese sentence analyzer
JP2592993B2 (en) Phrase extraction device
JPH08221443A (en) Method and device for retrieving text including kanji
JPH03156589A (en) Method for detecting and correcting erroneously read character
JPH077412B2 (en) Japanese sentence correction candidate character extraction device
JPH02253370A (en) Morpheme analyzing device
JPS6358570A (en) Automatic detecting system for wrong character of japanese sentence
JPH0546612A (en) Sentence error detector
JPH0614376B2 (en) Japanese sentence error detection device
JPH03242755A (en) Katakana word error detecting and correcting device
JPS62182982A (en) Automatic detection system for wrong word in japanese text
JPH02105968A (en) Automatic test and correction system for japanese sentence error
JPH05233619A (en) Method for correcting error of japanese language sentence and device therefor
JPH0432958A (en) Japanese sentence error word detecting device