JPS6358570A

JPS6358570A - Automatic detecting system for wrong character of japanese sentence

Info

Publication number: JPS6358570A
Application number: JP61203220A
Authority: JP
Inventors: Shinichiro Takagi; 伸一郎高木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1986-08-29
Filing date: 1986-08-29
Publication date: 1988-03-14

Abstract

PURPOSE:To greatly improve the overall detecting accuracy of wrong characters and to greatly reduce the overall processing load, by performing concatenately the detection of wrong characters with cut-off of the character concatenation probability and the detection of wrong characters with analysis of morphemes. CONSTITUTION:The character concatenation probability product is calculated with the preceding and next characters of an input character string by means of a character concatenation probability dictionary 8. Then the calculated probability is cut off with the prescribed cut-off value. Thus a wrong character deciding part 22 detects the wrong characters and sends the paragraphs including these wrong characters to a correction processing part 9. While the paragraphs including no wrong character confirmed by the part 22 are processed by a wrong character detecting part 4 through analysis of morphemes. Then the areas including wrong characters are extracted and a position having a low character concatenation probability level is detected at a wrong character position detecting part 7 by means of the dictionary 8 and sent to the part 9. Thus the 1st and 2nd wrong characters can be detected at the part 22 by a series of said processes. Then the 3rd wrong character is detected by both parts 4 and 7.

Description

【発明の詳細な説明】（１，１発明の属する技術分野本発明は１日本文文書データベース作成のため。[Detailed description of the invention] (1.1 Technical field to which the invention belongs This invention is for creating a Japanese document database.

入力装置から読み込まれた漢字かな混じりの日本文文字
列に含まれる誤字の自動検出を行う日本文誤字自動検出
方式に関するものである。This invention relates to an automatic Japanese character error detection method that automatically detects typographical errors contained in Japanese character strings containing kanji and kana that are read from an input device.

（２）　　従来の技術新聞記事、出版用原稿、科学技術論文等の多量の日本文
文書を電子ファイル化して日本文文書データベースを作
成する場合、これらの読み取り結果に含まれる誤字を単
語辞書２文法辞書を用いた形態素解析によりその誤字を
含む文節単位の領域（誤字含有域）として検出する文法
的手法（例えば、池原、白井［単語解析プログラムによ
る日本文誤字の自動検出と二次マルコフモデルによる訂
正候補の抽出」情処論誌ｖｏ１．２５　Ｎｏ、２１９８
４ンが実現されている。(2) When creating a Japanese document database by converting a large amount of conventional Japanese documents such as technical newspaper articles, publication manuscripts, and scientific and technical papers into electronic files, the typographical errors contained in these reading results can be identified using the Word Dictionary 2 Grammar. A grammatical method (e.g., Ikehara, Shirai [Automatic detection of Japanese sentence errors using a word analysis program and correction using a quadratic Markov model] Candidate Extraction” Information Processing Journal vol. 1.25 No. 2198
4 has been realized.

第３図は、従来の文法的手法による誤字検出方式におけ
る構成例で、１は文書を読み取るための漢字ＯＣＲやベ
ンタッチ入力装置等の入力装置。FIG. 3 shows a configuration example of a conventional typographical error detection method using a grammatical method. 1 is an input device such as a Kanji OCR or Bentouch input device for reading a document.

２は読み込みを行う入力処理部、３は入力装置１で読み
込まれて磁気装置に文字コードの形式で記録されている
読み取り結果の入力日本文データベース、４は形態素解
析によって誤字含有域を抽出する誤字検出部、５，６は
誤字検出部４で用いる単語辞書および文法辞書、７は誤
字検出部４で抽出した誤字含有域内から誤字とみなす誤
字位置を。2 is an input processing unit that performs reading, 3 is an input Japanese database of reading results read by input device 1 and recorded in the form of character codes on a magnetic device, and 4 is a typographical error region that extracts areas containing typographical errors by morphological analysis. Detection units 5 and 6 are word dictionaries and grammar dictionaries used by the typo detection unit 4, and 7 is a typo position that is considered to be an error from within the error-containing area extracted by the typo detection unit 4.

文字間の連接確率を用いて検出する誤字位置検出部、８
は文字連接確率辞書、９は検出された誤字の訂正を行う
訂正処理部、１０は処理装置である。Misprint position detection unit that detects using probability of connection between characters, 8
9 is a character concatenation probability dictionary, 9 is a correction processing unit for correcting detected typographical errors, and 10 is a processing device.

この例では、単語候補抽出、単語間の品詞接続。In this example, word candidate extraction, part-of-speech connections between words.

検定等の形態素解析により、誤字含有域を抽出し。Extract areas containing typographical errors through morphological analysis such as verification.

さらに、同種の誤りを含まない大量の文書を用いてあら
かじめ作成した文字連接確率辞書８によって誤字位置の
検出を行っているや第４図は文法的手法での誤字検出例であって。Furthermore, the position of typographical errors is detected using a character concatenation probability dictionary 8 that has been created in advance using a large number of documents that do not contain the same kind of errors. FIG. 4 is an example of detecting typographical errors using a grammatical method.

原文に対してランダムに選んだ誤字が２文字挿入されて
いる。１１は挿入された誤字、１２は正字。Two randomly selected misspellings have been inserted into the original text. 11 is an inserted typo, 12 is a correct typo.

１３は誤字周辺の品詞接続状況２１４は品詞接続検定で
エラーとなる位置を示す。Reference numeral 13 indicates the part-of-speech connection status 214 around the typographical error, which indicates the position where an error occurs in the part-of-speech connection test.

第４図では、網羅的な単語抽出の後に各単語間の品詞接
続検定を行い１品詞接続検定エラーが生ずる事によって
誤字を含む文節を抽出している。In FIG. 4, after exhaustive word extraction, a part-of-speech connection test is performed between each word, and clauses containing typographical errors due to one-part-of-speech connection test errors are extracted.

すなわち「評価負社会的」では、単語の品詞認定で「負
」が接尾辞となったため１前後の品詞接続関係が成立し
、誤字検出が実施できない。In other words, in "Evaluation Negative Social", since "negative" became a suffix in the part-of-speech recognition of the word, a part-of-speech connection relation around 1 is established, and misspelling cannot be detected.

つまり、このような文法的手法では。That is, in such a grammatical method.

■　品詞接続によるチェックのため、未検出誤字が多数
残存する。■ Because the check is based on part-of-speech connections, many undetected typos remain.

■　膨大な単語辞書の索引および複雑な検定論理実行の
ため、多大な処理時間を要する。■ A large amount of processing time is required due to the indexing of a huge word dictionary and the execution of complex test logic.

という問題点がある６また９文法的手法に対して、特願昭６１−０２４９９２
号に述べられる方式のように、誤字を含まない大量の日
本文文書を用いて、Ｎ文字の文字パターンを抽出し、こ
れに関する出現頻度情報に基づいて算定された各Ｎ文字
の文字連接確率情報を各Ｎ文字をキーとして保持する文
字連接確率辞書をあらかじめ作成して、原文の文字列の
連続する文字列（例えば２文字）の文字連接確率を成る
尾切り値で尾切りする事によって誤字を検出する方式が
ある。There is a problem with 6 and 9 grammatical methods,
As in the method described in this issue, a character pattern of N characters is extracted using a large amount of Japanese documents that do not contain typographical errors, and character conjunctive probability information for each N character is calculated based on frequency information related to this. By creating a character concatenation probability dictionary in advance that holds each N character as a key, and truncating the character concatenation probability of consecutive character strings (for example, two characters) in the original text using a tail cut value, typographical errors can be avoided. There is a method to detect it.

第５図は文字連接確率値尾切りによる誤字検出例で、１
５は誤字周辺の文字列、１６は連続する２文字パターン
、１７は２文字パターン１６の前方２文字連接確率値、
１８は前方２文字連接確率値の積、１９は前方２文字連
接確率値の積１８が尾切り値１０１未満を示すマーカー
、２０は尾切り値より高く誤字検出に失敗した例である
。Figure 5 shows an example of misspelling detection by truncating character concatenation probability values.
5 is the character string around the typo, 16 is the consecutive two-character pattern, 17 is the probability value of the two characters in front of the two-character pattern 16,
18 is the product of the concatenated probability values of two leading characters, 19 is a marker indicating that the product 18 of the concatenated probability values of the leading two characters is less than the trailing value of 101, and 20 is an example in which the typographical error detection fails because it is higher than the trailing value.

第５図では、あらかじめ統計的に収集した文字連接確率
辞書を用いて前方２文字連接確率値の積１８が所定の尾
切り値１０−９より小さい場合、該当の文字を誤字とし
て抽出する。しかしこれらの確率的手法でも、ひらがな
のような字種数が少なく、出現頻度が高い文字列の場合
、誤字が混入しても文字連接確率が一般に高り、「生ま
れつい」の例のように誤字検出ができない、という問題
点があった。In FIG. 5, when the product 18 of the concatenation probability values of two preceding characters is smaller than a predetermined tail cut value 10-9 using a character concatenation probability dictionary statistically collected in advance, the corresponding character is extracted as a typographical error. However, even with these probabilistic methods, in the case of character strings such as hiragana, which have a small number of character types and occur frequently, the probability of character concatenation is generally high even if there are typos, and typos as in the example of ``naturally'' occur. The problem was that it could not be detected.

（３）発明の目的本発明の目的は、あらかじめ作成した文字連接確率辞書
を用いて文字連接確率値の尾切りにより誤字を検出する
方式と、従来の形態素解析により誤字を検出する方式と
を縦続的に接続して、前者の誤字判定部で検出した誤字
を含まない文節を後者の形態素解析による誤字検出部で
再度誤字検出を行う事によって、総合的な誤字検出精度
の向上と処理負荷の削減とを図る日本文誤字自動検出方
式を提供することにある。(3) Purpose of the Invention The purpose of the present invention is to cascade a method of detecting typos by cutting off the character concatenation probability value using a pre-created character concatenation probability dictionary and a conventional method of detecting typos by morphological analysis. By connecting the clauses that do not contain typographical errors detected by the former typographical error detection unit to the latter typographical error detection unit using morphological analysis, the overall typographical error detection accuracy is improved and the processing load is reduced. The purpose of this invention is to provide an automatic detection method for Japanese characters.

（４）　　発明の構成本発明は、誤字自動検出の対象となる文書と同種の誤字
を含まない大量の文書とを用いて、先頭より連続するＮ
文字を順に読み出し、各々のＮ文字パターンにおける出
現頻度を求め、前半（Ｎ−１）文字を等しくする全ての
Ｎ文字のパターンの出現頻度の総和に対する各Ｎ文字パ
ターンの頻度の比を文字間の接続確率と定義し１文字連
接確率辞書としてあらかじめ作成し、入力日本文データ
ベースより入力された文字列内の誤字検出対象の文字の
前後の文字との文字連接確率値を該文字連接確率辞書よ
り抽出して、その積があらかじめ設定した足切り値より
小さい場合に誤字検出対象の文字を誤字と認定し、認定
した誤字を含まない文節を形態素解析による誤字検出部
で再度処理して縦続的に誤字検出を行う事を最も主要な
特徴とする。(4) Structure of the Invention The present invention uses a document to be automatically detected for typographical errors and a large number of documents that do not contain the same type of typographical error to detect consecutive N characters from the beginning.
Read out the characters in order, find the frequency of appearance in each N-character pattern, and calculate the ratio of the frequency of each N-character pattern to the sum of the frequency of occurrence of all N-character patterns that make the first half (N-1) characters equal. A single character concatenation probability dictionary is defined as concatenation probability and is created in advance as a single character concatenation probability dictionary, and the character concatenation probability value of the character before and after the character to be detected for misspellings in the character string inputted from the input Japanese database is extracted from the character concatenation probability dictionary. Then, if the product is smaller than a preset cut-off value, the character to be detected is recognized as a typo, and the phrases that do not contain the recognized typo are reprocessed by the typo detection unit using morphological analysis to generate consecutive typos. The main feature is to perform detection.

従来の技術とは。What is conventional technology?

■　あらかじめ作成した文字連接確率辞書を用いて文字
連接確率値の足切りを行う誤字検出方式と、従来の形態
素解析による誤字検出方式とを縦続的に接続する点。■ A typographical error detection method that uses a pre-created character connectivity probability dictionary to cut down the character contiguity probability value, and a conventional typographical error detection method using morphological analysis are connected in a cascading manner.

■　前者の方式で誤字と認定された文字を含まない文節
のみを後者の方式の対象としている点。■ Only clauses that do not contain characters recognized as misspelled in the former method are subject to the latter method.

が異なる。are different.

（５）　　実施例第１図は本発明の構成例で、２１は２文字パターンをキ
ーとして文字連接確率辞書８から文字連接確率値を抽出
する文字連接確率値抽出部、２２は文字連接確率値と所
定の足切り値とによって誤字を検出する誤字判定部、２
３は所定の足切り値を記憶する足切り値記憶部、２４は
誤字判定部２２で誤字と判定された文字、２５は誤字判
定部２２で検出された誤字を含まない文節、２６は誤字
検出部４で誤字を含むと認定された誤字含有域。(5) Embodiment FIG. 1 shows a configuration example of the present invention, in which 21 is a character concatenation probability value extracting unit that extracts a character concatenation probability value from a character concatenation probability dictionary 8 using a two-character pattern as a key, and 22 is a character concatenation probability value extraction unit. a typographical error determination unit that detects typographical errors based on and a predetermined cutoff value;
Reference numeral 3 indicates a cut-off value storage unit that stores a predetermined cut-off value, 24 indicates a character determined to be a typo by the typo determination unit 22, 25 indicates a phrase that does not include the typo detected by the typo determination unit 22, and 26 indicates a typo detection. An area containing typographical errors that was identified as containing typographical errors in Part 4.

２７は誤字検出部４で誤字検出されなかった文節である
。Reference numeral 27 indicates a phrase for which no typographical errors were detected by the typographical error detection unit 4.

本構成例では、入力文文字列に対してまず文字連接確率
辞書８を用いて１前後の文字との文字連接確率値の積を
算定し、さらに所定の足切り値で足切りする事によって
、誤字判定部２２で誤字を検出し、その誤字を含む文節
を訂正処理部９へ送り、また誤字判定部２２で認定され
た誤字を含まない文節について、形態素解析による誤字
検出部４で処理して誤字を含む誤字含有域を抽出した後
。In this configuration example, for an input sentence string, first, the character concatenation probability dictionary 8 is used to calculate the product of the character concatenation probability value with the character before and after 1, and then the product is cut off at a predetermined cutoff value. The typographical error detection section 22 detects a typographical error, sends the phrase containing the typographical error to the correction processing section 9, and the typographical error detection section 4 that uses morphological analysis processes the phrases that do not contain the typographical error recognized by the typographical error determination section 22. After extracting the typo-containing region containing typos.

誤字位置検出部７で再度文字連接確率辞書８を用いて文
字連接確率値が低い位置を検出し、訂正処理部９へ送る
所の一連の処理を行う。The erroneous character position detection unit 7 again uses the character concatenation probability dictionary 8 to detect a position where the character concatenation probability value is low, and performs a series of processes for sending the detected position to the correction processing unit 9.

第２図は９本発明による誤字検出例で、第５図の文字連
接確率値尾切りによる誤字検出例と対比している。本発
明の構成では、第１．第２の誤字は前後の文字との文字
連接確率値の積の足切りによって検出でき、第３の誤字
は、誤字を含まない文節として誤字検出部４へ送られ、
形態素解析による品詞接続検定エラーにより誤字含有域
が抽出される。さらに誤字位置検出部７で誤字含有域内
の誤字位置候補が抽出され、誤字が検出される。FIG. 2 shows an example of detecting typographical errors according to the present invention, and is compared with an example of detecting typographical errors by truncating character concatenation probability values shown in FIG. In the configuration of the present invention, first. The second typo can be detected by cutting the product of the character concatenation probability values with the preceding and following characters, and the third typo is sent to the typo detection unit 4 as a clause that does not include the typo,
Areas containing misspellings are extracted due to errors in part-of-speech connection testing through morphological analysis. Furthermore, the typo position detection unit 7 extracts typo position candidates within the typo-containing area and detects the typo.

このような構造および作用をするから、適当な足切り値
の設定により、入力原文中に含まれる誤字のうち、大部
分を文字連接確率辞書を用いる誤字検出方式において、
誤字位置を含めて検出でき。Because of this structure and operation, by setting an appropriate cut-off value, most of the typos contained in the input original text can be detected in a typo detection method using a character concatenation probability dictionary.
Can detect typographical errors including locations.

さらにひらがな文字列でのひらがな誤りのように。Even more like hiragana errors in hiragana strings.

誤字を含む文字連接確率が一般的に高いため足切りによ
る誤字検出ができない場合に′＃１続的に処理される品
詞接続検定等の形態素解析により誤字を検出する事がで
きる。When it is not possible to detect typographical errors by truncating because the probability of character concatenation including typographical errors is generally high, typographical errors can be detected by morphological analysis such as a part-of-speech connection test that is processed continuously.

その効果としては。As for the effect.

■　文字連接確率辞書を用いた誤字検出方式で誤字とし
て検出できない誤字を形態素解析により検出できるので
５誤字検出精度を大幅に向上させる事ができる。■ Since typographical errors that cannot be detected as typographical errors using a typographical error detection method using a character concatenation probability dictionary can be detected by morphological analysis, the accuracy of detecting 5 typographical errors can be greatly improved.

■　誤字の大部分は処理負荷の小さい文字連接確率辞書
による誤字検出方式で処理できるので。■ The majority of typos can be handled by the typo detection method using a character concatenation probability dictionary, which has a low processing load.

処理負荷の大きい形態素解析による誤字検出方式の処理
を最小限に抑制する事によって、！、８合的な処理負荷
を大幅に削減できる。By minimizing the processing of the typographical error detection method using morphological analysis, which requires a large processing load,! , the processing load can be significantly reduced.

などの利点がある。There are advantages such as

（６）　　発明の詳細な説明したように１本発明によれば１文字連接確率辞書
を用いて文字連接確率値の足切りにより誤字を検出する
方式と従来の形態素解析により誤字を検出する方式とを
ｋ１１続的に接続し、前者の誤字判定部で検出された誤
字を含まない文節を後者の形態素解析による誤字検出部
で再度処理して誤字検出するようにしたのであるから。(6) As described in detail of the invention, according to the present invention, there are two methods: one method detects typos by cutting the character concatenation probability value using a one-character concatenation probability dictionary, and the other method detects typos by conventional morphological analysis. This is because the spelling errors detected by the former typographical error determination unit are processed again by the latter typographical error detection unit using morphological analysis to detect typographical errors.

■　前者の文字連接確率辞書を用いた誤字検出方式で誤
字検出できない誤字を後者の形態素解析により検出でき
るので、総合的に誤字検出精度を大幅に向上させる事が
できる。■ Since the latter method of morphological analysis can detect typographical errors that cannot be detected using the former method of detecting typographical errors using a character concatenation probability dictionary, the accuracy of typographical error detection can be greatly improved overall.

■　誤字の大部分は処理負荷の小さい前者の方式で処理
できるので、処理負荷の大きい後者の誤字検出方式の処
理を最小限に抑制する事によって総合的な処理負荷を大
幅に削減できる。■ Since the majority of typographical errors can be processed by the former method, which has a small processing load, the overall processing load can be significantly reduced by minimizing the processing of the latter typographical error detection method, which has a large processing load.

という利点がある。There is an advantage.

本発明による効果を定量的に評価すると次の如く考えら
れる。The effects of the present invention can be quantitatively evaluated as follows.

新聞記事原文にランダムに選択した文字を誤字として１
０％程度挿入した入力文に対し、実際に試作した文字連
接確率値の足切りによる誤字検出方式で足切り値１０１
とする場合、誤字検出率９２％、正字を誤字として検出
した検出誤り文字率０．８７％であり、一方形態素解析
による誤字検出方式では、同じ入力文に対して、誤字検
出率８６．１％。1 randomly selected characters in the original newspaper article as typos
For an input sentence with approximately 0% insertion, the cutoff value is 101 using an error detection method that cuts off the character concatenation probability value that was actually prototyped.
In this case, the error detection rate is 92%, and the detection error rate of correct letters as errors is 0.87%.On the other hand, in the error detection method using morphological analysis, the error detection rate is 86.1% for the same input sentence. .

検出誤り文字率０．４７５％であった。従って両者の方
式を本発明のように縦続的に結合させると、誤字検出精
度９８．９％と向上するが、検出誤り文字率も１．１８
％と増大する。このため、総合的な評価関数として。The detection error rate was 0.475%. Therefore, if both methods are combined in series as in the present invention, the error detection accuracy will improve to 98.9%, but the detection error rate will also be 1.18%.
%. Therefore, as a comprehensive evaluation function.

誤字設定数を定義し、さらに前段の文字連接確率値の足切り値を適
当に調整すると１本発明の最終誤字含有率は約８％、一
方、形態素解析による誤字検出方式では、最終誤字含有
率は約１８％となり誤字検出精度は約５０％程度向上す
る。By defining the number of typographical errors and further adjusting the cut-off value of the character concatenation probability value in the previous stage, the final typographical content rate of the present invention is approximately 8%.On the other hand, with the typographical error detection method using morphological analysis, the final typographical content rate is approximately 8%. is approximately 18%, and the error detection accuracy is improved by approximately 50%.

さらに、性能面では、プログラムステップ数。Furthermore, in terms of performance, the number of program steps.

総辞書量から文字当たりの処理性能比が３文字連接確率
値の足切りによる誤字検出方式と形態素解析による誤字
検出方式とで１＝１０と推定されるので８両者の文字当
たりの性能をＴｘ、Ｔｙさらに形態素解析による誤字検
出方式では、前段での誤字と認識された文字を含む文節
以外を処理するため。From the total dictionary volume, it is estimated that the processing performance ratio per character is 1 = 10 between the typo detection method based on the three-character concatenation probability value and the typo detection method based on morphological analysis, so the performance per character for both is Tx, Furthermore, in the typographical error detection method using morphological analysis, clauses other than those containing characters recognized as typographical in the previous stage are processed.

前述の入力文（全文字数：　１４０６．誤字数１４４）
の例のとき、前段で誤字と認識された文字数は未検出文
字数と検出誤り文字数との和で約１４３文字。The above input sentence (total number of characters: 1406. Number of typos: 144)
In the example above, the number of characters recognized as typographical errors in the previous stage is approximately 143 characters, which is the sum of the number of undetected characters and the number of detected erroneous characters.

文節平均文字数４とすると３本発明の性能は。Assuming that the average number of characters in a clause is 4, the performance of the present invention is as follows.

本発明の性能−１４０６ＸＴｘ　＋　（１４０６−１４
３ｘ４）ｘＴｙなお、　　（１４０６−１４３Ｘ４）は
後段で処理する文字数である。Performance of the present invention - 1406XTx + (1406-14
3x4)xTy Note that (1406-143X4) is the number of characters processed in the subsequent stage.

一方、従来の形態素解析による誤字検出方式の性能は従来の性能間１４０６×Ｔｙ従ってＴｘ　／Ｔｙ　＝０．１より性能比は従来の性能１４０６ＸＴｙ即ち本発明の性能は従来に比べて約３０％改善される。On the other hand, the performance of the conventional typographical error detection method using morphological analysis is Conventional performance 1406×Ty Therefore, from Tx / Ty = 0.1, the performance ratio is 1406XTy That is, the performance of the present invention is improved by about 30% compared to the conventional method.

[Brief explanation of drawings]

第１図は９本発明の構成例、第２図は１本発明による誤
字検出例、第３図は、従来の文法的手法による誤字検出
の構成例、第４図は、従来の文法的手法による誤字検出
例、第５図は１文字連接確率値足切りによる誤字検出例
を示す。１・・・入力装置、２・・・入力処理部、３・・・入力
日本文データベース、４・・・誤字検出部、５・・・単
語辞書。６・・・文法辞書、７・・・誤字位置検出部、８・・・
文字連接確率辞書、９・・・訂正処理部、１０・・・処
理装置。１１・・・挿入された誤字、１２・・・正字、１３・・
・品詞接続状況、１４・・・品詞接続検定エラー位置、
１５・・・誤字周辺の文字列、１６・・・２文字パター
７−、　１７・・・前方２文字連接確率値、１８・・・
前方２文字連接確率値の積、１９・・・１８が足切り値
１０−’未満を示すマーカー、２０・・・誤字検出失敗
例、２１・・・文字連接確率値抽出部、２２・・・誤字
判定部、２３・・・足切り値記憶部、２４・・・誤字と
判定された文字。２５・・・検出された誤字を含まない文節、２６・・・
誤字含有域、２７・・・誤字検出されなかった文節。Figure 1 shows a configuration example of the present invention; Figure 2 shows an example of typographical error detection according to the present invention; Figure 3 shows a configuration example of typographical error detection using a conventional grammatical method; and Figure 4 shows a conventional grammatical method. FIG. 5 shows an example of misspelling detection using one-character concatenation probability values. 1... Input device, 2... Input processing section, 3... Input Japanese sentence database, 4... Misprint detection section, 5... Word dictionary. 6... Grammar dictionary, 7... Misprint position detection unit, 8...
Character concatenation probability dictionary, 9... Correction processing unit, 10... Processing device. 11... Inserted typo, 12... Correct letter, 13...
・Part-of-speech connection status, 14... Part-of-speech connection test error position,
15... Character string around the typo, 16... 2-character pattern 7-, 17... Probability value of concatenation of two characters in front, 18...
Product of forward two character concatenation probability values, 19...18 is a marker indicating that the cutoff value is less than 10-', 20...An example of failure in misspelling detection, 21...Character concatenation probability value extraction unit, 22... Misprint determination unit, 23... Cutoff value storage unit, 24... Characters determined to be misspelled. 25... Clause that does not include the detected typo, 26...
Error-containing area, 27... Clauses for which no typographical errors were detected.

Claims

[Scope of Claims] In an automatic Japanese text error detection method in which a data processing device automatically detects typographical errors included in Japanese document data input from an input device, consecutive N characters sequentially extracted from a standard document containing no typos are provided.
Based on appearance frequency information regarding character patterns, character concatenation probability information for each N character calculated in advance is sequentially calculated from a character concatenation probability dictionary that holds each N character pattern as a key and Japanese document data that is the target of misspelling detection. Extract a continuous string, and calculate the first N in the extracted string.
A means for extracting character concatenation probability values from a character concatenation probability dictionary using each character and the last N characters as keys, and detecting a typographical error if the product of the two character concatenation probability values is less than a predetermined cutoff value. means for extracting a typographical error region by morphological analysis using a word dictionary and a grammar dictionary for a clause that does not include a typographical error detected by the means; What is claimed is: 1. An automatic method for detecting typographical errors in Japanese text, characterized in that the method includes means for detecting based on consecutive character probability values based on a probability dictionary, and is configured to automatically detect typographical errors included in Japanese document data.