JPS6358570A - Automatic detecting system for wrong character of japanese sentence - Google Patents

Automatic detecting system for wrong character of japanese sentence

Info

Publication number
JPS6358570A
JPS6358570A JP61203220A JP20322086A JPS6358570A JP S6358570 A JPS6358570 A JP S6358570A JP 61203220 A JP61203220 A JP 61203220A JP 20322086 A JP20322086 A JP 20322086A JP S6358570 A JPS6358570 A JP S6358570A
Authority
JP
Japan
Prior art keywords
character
characters
typographical
wrong
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP61203220A
Other languages
Japanese (ja)
Inventor
Shinichiro Takagi
伸一郎 高木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP61203220A priority Critical patent/JPS6358570A/en
Publication of JPS6358570A publication Critical patent/JPS6358570A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

PURPOSE:To greatly improve the overall detecting accuracy of wrong characters and to greatly reduce the overall processing load, by performing concatenately the detection of wrong characters with cut-off of the character concatenation probability and the detection of wrong characters with analysis of morphemes. CONSTITUTION:The character concatenation probability product is calculated with the preceding and next characters of an input character string by means of a character concatenation probability dictionary 8. Then the calculated probability is cut off with the prescribed cut-off value. Thus a wrong character deciding part 22 detects the wrong characters and sends the paragraphs including these wrong characters to a correction processing part 9. While the paragraphs including no wrong character confirmed by the part 22 are processed by a wrong character detecting part 4 through analysis of morphemes. Then the areas including wrong characters are extracted and a position having a low character concatenation probability level is detected at a wrong character position detecting part 7 by means of the dictionary 8 and sent to the part 9. Thus the 1st and 2nd wrong characters can be detected at the part 22 by a series of said processes. Then the 3rd wrong character is detected by both parts 4 and 7.

Description

【発明の詳細な説明】 (1,1発明の属する技術分野 本発明は1日本文文書データベース作成のため。[Detailed description of the invention] (1.1 Technical field to which the invention belongs This invention is for creating a Japanese document database.

入力装置から読み込まれた漢字かな混じりの日本文文字
列に含まれる誤字の自動検出を行う日本文誤字自動検出
方式に関するものである。
This invention relates to an automatic Japanese character error detection method that automatically detects typographical errors contained in Japanese character strings containing kanji and kana that are read from an input device.

(2)  従来の技術 新聞記事、出版用原稿、科学技術論文等の多量の日本文
文書を電子ファイル化して日本文文書データベースを作
成する場合、これらの読み取り結果に含まれる誤字を単
語辞書2文法辞書を用いた形態素解析によりその誤字を
含む文節単位の領域(誤字含有域)として検出する文法
的手法(例えば、池原、白井[単語解析プログラムによ
る日本文誤字の自動検出と二次マルコフモデルによる訂
正候補の抽出」情処論誌vo1.25 No、2198
4ンが実現されている。
(2) When creating a Japanese document database by converting a large amount of conventional Japanese documents such as technical newspaper articles, publication manuscripts, and scientific and technical papers into electronic files, the typographical errors contained in these reading results can be identified using the Word Dictionary 2 Grammar. A grammatical method (e.g., Ikehara, Shirai [Automatic detection of Japanese sentence errors using a word analysis program and correction using a quadratic Markov model] Candidate Extraction” Information Processing Journal vol. 1.25 No. 2198
4 has been realized.

第3図は、従来の文法的手法による誤字検出方式におけ
る構成例で、1は文書を読み取るための漢字OCRやベ
ンタッチ入力装置等の入力装置。
FIG. 3 shows a configuration example of a conventional typographical error detection method using a grammatical method. 1 is an input device such as a Kanji OCR or Bentouch input device for reading a document.

2は読み込みを行う入力処理部、3は入力装置1で読み
込まれて磁気装置に文字コードの形式で記録されている
読み取り結果の入力日本文データベース、4は形態素解
析によって誤字含有域を抽出する誤字検出部、5,6は
誤字検出部4で用いる単語辞書および文法辞書、7は誤
字検出部4で抽出した誤字含有域内から誤字とみなす誤
字位置を。
2 is an input processing unit that performs reading, 3 is an input Japanese database of reading results read by input device 1 and recorded in the form of character codes on a magnetic device, and 4 is a typographical error region that extracts areas containing typographical errors by morphological analysis. Detection units 5 and 6 are word dictionaries and grammar dictionaries used by the typo detection unit 4, and 7 is a typo position that is considered to be an error from within the error-containing area extracted by the typo detection unit 4.

文字間の連接確率を用いて検出する誤字位置検出部、8
は文字連接確率辞書、9は検出された誤字の訂正を行う
訂正処理部、10は処理装置である。
Misprint position detection unit that detects using probability of connection between characters, 8
9 is a character concatenation probability dictionary, 9 is a correction processing unit for correcting detected typographical errors, and 10 is a processing device.

この例では、単語候補抽出、単語間の品詞接続。In this example, word candidate extraction, part-of-speech connections between words.

検定等の形態素解析により、誤字含有域を抽出し。Extract areas containing typographical errors through morphological analysis such as verification.

さらに、同種の誤りを含まない大量の文書を用いてあら
かじめ作成した文字連接確率辞書8によって誤字位置の
検出を行っているや 第4図は文法的手法での誤字検出例であって。
Furthermore, the position of typographical errors is detected using a character concatenation probability dictionary 8 that has been created in advance using a large number of documents that do not contain the same kind of errors. FIG. 4 is an example of detecting typographical errors using a grammatical method.

原文に対してランダムに選んだ誤字が2文字挿入されて
いる。11は挿入された誤字、12は正字。
Two randomly selected misspellings have been inserted into the original text. 11 is an inserted typo, 12 is a correct typo.

13は誤字周辺の品詞接続状況214は品詞接続検定で
エラーとなる位置を示す。
Reference numeral 13 indicates the part-of-speech connection status 214 around the typographical error, which indicates the position where an error occurs in the part-of-speech connection test.

第4図では、網羅的な単語抽出の後に各単語間の品詞接
続検定を行い1品詞接続検定エラーが生ずる事によって
誤字を含む文節を抽出している。
In FIG. 4, after exhaustive word extraction, a part-of-speech connection test is performed between each word, and clauses containing typographical errors due to one-part-of-speech connection test errors are extracted.

すなわち「評価負社会的」では、単語の品詞認定で「負
」が接尾辞となったため1前後の品詞接続関係が成立し
、誤字検出が実施できない。
In other words, in "Evaluation Negative Social", since "negative" became a suffix in the part-of-speech recognition of the word, a part-of-speech connection relation around 1 is established, and misspelling cannot be detected.

つまり、このような文法的手法では。That is, in such a grammatical method.

■ 品詞接続によるチェックのため、未検出誤字が多数
残存する。
■ Because the check is based on part-of-speech connections, many undetected typos remain.

■ 膨大な単語辞書の索引および複雑な検定論理実行の
ため、多大な処理時間を要する。
■ A large amount of processing time is required due to the indexing of a huge word dictionary and the execution of complex test logic.

という問題点がある6 また9文法的手法に対して、特願昭61−024992
号に述べられる方式のように、誤字を含まない大量の日
本文文書を用いて、N文字の文字パターンを抽出し、こ
れに関する出現頻度情報に基づいて算定された各N文字
の文字連接確率情報を各N文字をキーとして保持する文
字連接確率辞書をあらかじめ作成して、原文の文字列の
連続する文字列(例えば2文字)の文字連接確率を成る
尾切り値で尾切りする事によって誤字を検出する方式が
ある。
There is a problem with 6 and 9 grammatical methods,
As in the method described in this issue, a character pattern of N characters is extracted using a large amount of Japanese documents that do not contain typographical errors, and character conjunctive probability information for each N character is calculated based on frequency information related to this. By creating a character concatenation probability dictionary in advance that holds each N character as a key, and truncating the character concatenation probability of consecutive character strings (for example, two characters) in the original text using a tail cut value, typographical errors can be avoided. There is a method to detect it.

第5図は文字連接確率値尾切りによる誤字検出例で、1
5は誤字周辺の文字列、16は連続する2文字パターン
、17は2文字パターン16の前方2文字連接確率値、
18は前方2文字連接確率値の積、19は前方2文字連
接確率値の積18が尾切り値101未満を示すマーカー
、20は尾切り値より高く誤字検出に失敗した例である
Figure 5 shows an example of misspelling detection by truncating character concatenation probability values.
5 is the character string around the typo, 16 is the consecutive two-character pattern, 17 is the probability value of the two characters in front of the two-character pattern 16,
18 is the product of the concatenated probability values of two leading characters, 19 is a marker indicating that the product 18 of the concatenated probability values of the leading two characters is less than the trailing value of 101, and 20 is an example in which the typographical error detection fails because it is higher than the trailing value.

第5図では、あらかじめ統計的に収集した文字連接確率
辞書を用いて前方2文字連接確率値の積18が所定の尾
切り値10−9より小さい場合、該当の文字を誤字とし
て抽出する。しかしこれらの確率的手法でも、ひらがな
のような字種数が少なく、出現頻度が高い文字列の場合
、誤字が混入しても文字連接確率が一般に高り、「生ま
れつい」の例のように誤字検出ができない、という問題
点があった。
In FIG. 5, when the product 18 of the concatenation probability values of two preceding characters is smaller than a predetermined tail cut value 10-9 using a character concatenation probability dictionary statistically collected in advance, the corresponding character is extracted as a typographical error. However, even with these probabilistic methods, in the case of character strings such as hiragana, which have a small number of character types and occur frequently, the probability of character concatenation is generally high even if there are typos, and typos as in the example of ``naturally'' occur. The problem was that it could not be detected.

(3)発明の目的 本発明の目的は、あらかじめ作成した文字連接確率辞書
を用いて文字連接確率値の尾切りにより誤字を検出する
方式と、従来の形態素解析により誤字を検出する方式と
を縦続的に接続して、前者の誤字判定部で検出した誤字
を含まない文節を後者の形態素解析による誤字検出部で
再度誤字検出を行う事によって、総合的な誤字検出精度
の向上と処理負荷の削減とを図る日本文誤字自動検出方
式を提供することにある。
(3) Purpose of the Invention The purpose of the present invention is to cascade a method of detecting typos by cutting off the character concatenation probability value using a pre-created character concatenation probability dictionary and a conventional method of detecting typos by morphological analysis. By connecting the clauses that do not contain typographical errors detected by the former typographical error detection unit to the latter typographical error detection unit using morphological analysis, the overall typographical error detection accuracy is improved and the processing load is reduced. The purpose of this invention is to provide an automatic detection method for Japanese characters.

(4)  発明の構成 本発明は、誤字自動検出の対象となる文書と同種の誤字
を含まない大量の文書とを用いて、先頭より連続するN
文字を順に読み出し、各々のN文字パターンにおける出
現頻度を求め、前半(N−1)文字を等しくする全ての
N文字のパターンの出現頻度の総和に対する各N文字パ
ターンの頻度の比を文字間の接続確率と定義し1文字連
接確率辞書としてあらかじめ作成し、入力日本文データ
ベースより入力された文字列内の誤字検出対象の文字の
前後の文字との文字連接確率値を該文字連接確率辞書よ
り抽出して、その積があらかじめ設定した足切り値より
小さい場合に誤字検出対象の文字を誤字と認定し、認定
した誤字を含まない文節を形態素解析による誤字検出部
で再度処理して縦続的に誤字検出を行う事を最も主要な
特徴とする。
(4) Structure of the Invention The present invention uses a document to be automatically detected for typographical errors and a large number of documents that do not contain the same type of typographical error to detect consecutive N characters from the beginning.
Read out the characters in order, find the frequency of appearance in each N-character pattern, and calculate the ratio of the frequency of each N-character pattern to the sum of the frequency of occurrence of all N-character patterns that make the first half (N-1) characters equal. A single character concatenation probability dictionary is defined as concatenation probability and is created in advance as a single character concatenation probability dictionary, and the character concatenation probability value of the character before and after the character to be detected for misspellings in the character string inputted from the input Japanese database is extracted from the character concatenation probability dictionary. Then, if the product is smaller than a preset cut-off value, the character to be detected is recognized as a typo, and the phrases that do not contain the recognized typo are reprocessed by the typo detection unit using morphological analysis to generate consecutive typos. The main feature is to perform detection.

従来の技術とは。What is conventional technology?

■ あらかじめ作成した文字連接確率辞書を用いて文字
連接確率値の足切りを行う誤字検出方式と、従来の形態
素解析による誤字検出方式とを縦続的に接続する点。
■ A typographical error detection method that uses a pre-created character connectivity probability dictionary to cut down the character contiguity probability value, and a conventional typographical error detection method using morphological analysis are connected in a cascading manner.

■ 前者の方式で誤字と認定された文字を含まない文節
のみを後者の方式の対象としている点。
■ Only clauses that do not contain characters recognized as misspelled in the former method are subject to the latter method.

が異なる。are different.

(5)  実施例 第1図は本発明の構成例で、21は2文字パターンをキ
ーとして文字連接確率辞書8から文字連接確率値を抽出
する文字連接確率値抽出部、22は文字連接確率値と所
定の足切り値とによって誤字を検出する誤字判定部、2
3は所定の足切り値を記憶する足切り値記憶部、24は
誤字判定部22で誤字と判定された文字、25は誤字判
定部22で検出された誤字を含まない文節、26は誤字
検出部4で誤字を含むと認定された誤字含有域。
(5) Embodiment FIG. 1 shows a configuration example of the present invention, in which 21 is a character concatenation probability value extracting unit that extracts a character concatenation probability value from a character concatenation probability dictionary 8 using a two-character pattern as a key, and 22 is a character concatenation probability value extraction unit. a typographical error determination unit that detects typographical errors based on and a predetermined cutoff value;
Reference numeral 3 indicates a cut-off value storage unit that stores a predetermined cut-off value, 24 indicates a character determined to be a typo by the typo determination unit 22, 25 indicates a phrase that does not include the typo detected by the typo determination unit 22, and 26 indicates a typo detection. An area containing typographical errors that was identified as containing typographical errors in Part 4.

27は誤字検出部4で誤字検出されなかった文節である
Reference numeral 27 indicates a phrase for which no typographical errors were detected by the typographical error detection unit 4.

本構成例では、入力文文字列に対してまず文字連接確率
辞書8を用いて1前後の文字との文字連接確率値の積を
算定し、さらに所定の足切り値で足切りする事によって
、誤字判定部22で誤字を検出し、その誤字を含む文節
を訂正処理部9へ送り、また誤字判定部22で認定され
た誤字を含まない文節について、形態素解析による誤字
検出部4で処理して誤字を含む誤字含有域を抽出した後
In this configuration example, for an input sentence string, first, the character concatenation probability dictionary 8 is used to calculate the product of the character concatenation probability value with the character before and after 1, and then the product is cut off at a predetermined cutoff value. The typographical error detection section 22 detects a typographical error, sends the phrase containing the typographical error to the correction processing section 9, and the typographical error detection section 4 that uses morphological analysis processes the phrases that do not contain the typographical error recognized by the typographical error determination section 22. After extracting the typo-containing region containing typos.

誤字位置検出部7で再度文字連接確率辞書8を用いて文
字連接確率値が低い位置を検出し、訂正処理部9へ送る
所の一連の処理を行う。
The erroneous character position detection unit 7 again uses the character concatenation probability dictionary 8 to detect a position where the character concatenation probability value is low, and performs a series of processes for sending the detected position to the correction processing unit 9.

第2図は9本発明による誤字検出例で、第5図の文字連
接確率値尾切りによる誤字検出例と対比している。本発
明の構成では、第1.第2の誤字は前後の文字との文字
連接確率値の積の足切りによって検出でき、第3の誤字
は、誤字を含まない文節として誤字検出部4へ送られ、
形態素解析による品詞接続検定エラーにより誤字含有域
が抽出される。さらに誤字位置検出部7で誤字含有域内
の誤字位置候補が抽出され、誤字が検出される。
FIG. 2 shows an example of detecting typographical errors according to the present invention, and is compared with an example of detecting typographical errors by truncating character concatenation probability values shown in FIG. In the configuration of the present invention, first. The second typo can be detected by cutting the product of the character concatenation probability values with the preceding and following characters, and the third typo is sent to the typo detection unit 4 as a clause that does not include the typo,
Areas containing misspellings are extracted due to errors in part-of-speech connection testing through morphological analysis. Furthermore, the typo position detection unit 7 extracts typo position candidates within the typo-containing area and detects the typo.

このような構造および作用をするから、適当な足切り値
の設定により、入力原文中に含まれる誤字のうち、大部
分を文字連接確率辞書を用いる誤字検出方式において、
誤字位置を含めて検出でき。
Because of this structure and operation, by setting an appropriate cut-off value, most of the typos contained in the input original text can be detected in a typo detection method using a character concatenation probability dictionary.
Can detect typographical errors including locations.

さらにひらがな文字列でのひらがな誤りのように。Even more like hiragana errors in hiragana strings.

誤字を含む文字連接確率が一般的に高いため足切りによ
る誤字検出ができない場合に′#1続的に処理される品
詞接続検定等の形態素解析により誤字を検出する事がで
きる。
When it is not possible to detect typographical errors by truncating because the probability of character concatenation including typographical errors is generally high, typographical errors can be detected by morphological analysis such as a part-of-speech connection test that is processed continuously.

その効果としては。As for the effect.

■ 文字連接確率辞書を用いた誤字検出方式で誤字とし
て検出できない誤字を形態素解析により検出できるので
5誤字検出精度を大幅に向上させる事ができる。
■ Since typographical errors that cannot be detected as typographical errors using a typographical error detection method using a character concatenation probability dictionary can be detected by morphological analysis, the accuracy of detecting 5 typographical errors can be greatly improved.

■ 誤字の大部分は処理負荷の小さい文字連接確率辞書
による誤字検出方式で処理できるので。
■ The majority of typos can be handled by the typo detection method using a character concatenation probability dictionary, which has a low processing load.

処理負荷の大きい形態素解析による誤字検出方式の処理
を最小限に抑制する事によって、!、8合的な処理負荷
を大幅に削減できる。
By minimizing the processing of the typographical error detection method using morphological analysis, which requires a large processing load,! , the processing load can be significantly reduced.

などの利点がある。There are advantages such as

(6)  発明の詳細 な説明したように1本発明によれば1文字連接確率辞書
を用いて文字連接確率値の足切りにより誤字を検出する
方式と従来の形態素解析により誤字を検出する方式とを
k11続的に接続し、前者の誤字判定部で検出された誤
字を含まない文節を後者の形態素解析による誤字検出部
で再度処理して誤字検出するようにしたのであるから。
(6) As described in detail of the invention, according to the present invention, there are two methods: one method detects typos by cutting the character concatenation probability value using a one-character concatenation probability dictionary, and the other method detects typos by conventional morphological analysis. This is because the spelling errors detected by the former typographical error determination unit are processed again by the latter typographical error detection unit using morphological analysis to detect typographical errors.

■ 前者の文字連接確率辞書を用いた誤字検出方式で誤
字検出できない誤字を後者の形態素解析により検出でき
るので、総合的に誤字検出精度を大幅に向上させる事が
できる。
■ Since the latter method of morphological analysis can detect typographical errors that cannot be detected using the former method of detecting typographical errors using a character concatenation probability dictionary, the accuracy of typographical error detection can be greatly improved overall.

■ 誤字の大部分は処理負荷の小さい前者の方式で処理
できるので、処理負荷の大きい後者の誤字検出方式の処
理を最小限に抑制する事によって総合的な処理負荷を大
幅に削減できる。
■ Since the majority of typographical errors can be processed by the former method, which has a small processing load, the overall processing load can be significantly reduced by minimizing the processing of the latter typographical error detection method, which has a large processing load.

という利点がある。There is an advantage.

本発明による効果を定量的に評価すると次の如く考えら
れる。
The effects of the present invention can be quantitatively evaluated as follows.

新聞記事原文にランダムに選択した文字を誤字として1
0%程度挿入した入力文に対し、実際に試作した文字連
接確率値の足切りによる誤字検出方式で足切り値101
とする場合、誤字検出率92%、正字を誤字として検出
した検出誤り文字率0.87%であり、一方形態素解析
による誤字検出方式では、同じ入力文に対して、誤字検
出率86.1%。
1 randomly selected characters in the original newspaper article as typos
For an input sentence with approximately 0% insertion, the cutoff value is 101 using an error detection method that cuts off the character concatenation probability value that was actually prototyped.
In this case, the error detection rate is 92%, and the detection error rate of correct letters as errors is 0.87%.On the other hand, in the error detection method using morphological analysis, the error detection rate is 86.1% for the same input sentence. .

検出誤り文字率0.475%であった。従って両者の方
式を本発明のように縦続的に結合させると、誤字検出精
度98.9%と向上するが、検出誤り文字率も1.18
%と増大する。このため、総合的な評価関数として。
The detection error rate was 0.475%. Therefore, if both methods are combined in series as in the present invention, the error detection accuracy will improve to 98.9%, but the detection error rate will also be 1.18%.
%. Therefore, as a comprehensive evaluation function.

誤字設定数 を定義し、さらに前段の文字連接確率値の足切り値を適
当に調整すると1本発明の最終誤字含有率は約8%、一
方、形態素解析による誤字検出方式では、最終誤字含有
率は約18%となり誤字検出精度は約50%程度向上す
る。
By defining the number of typographical errors and further adjusting the cut-off value of the character concatenation probability value in the previous stage, the final typographical content rate of the present invention is approximately 8%.On the other hand, with the typographical error detection method using morphological analysis, the final typographical content rate is approximately 8%. is approximately 18%, and the error detection accuracy is improved by approximately 50%.

さらに、性能面では、プログラムステップ数。Furthermore, in terms of performance, the number of program steps.

総辞書量から文字当たりの処理性能比が3文字連接確率
値の足切りによる誤字検出方式と形態素解析による誤字
検出方式とで1=10と推定されるので8両者の文字当
たりの性能をTx、Tyさらに形態素解析による誤字検
出方式では、前段での誤字と認識された文字を含む文節
以外を処理するため。
From the total dictionary volume, it is estimated that the processing performance ratio per character is 1 = 10 between the typo detection method based on the three-character concatenation probability value and the typo detection method based on morphological analysis, so the performance per character for both is Tx, Furthermore, in the typographical error detection method using morphological analysis, clauses other than those containing characters recognized as typographical in the previous stage are processed.

前述の入力文(全文字数: 1406.誤字数144)
の例のとき、前段で誤字と認識された文字数は未検出文
字数と検出誤り文字数との和で約143文字。
The above input sentence (total number of characters: 1406. Number of typos: 144)
In the example above, the number of characters recognized as typographical errors in the previous stage is approximately 143 characters, which is the sum of the number of undetected characters and the number of detected erroneous characters.

文節平均文字数4とすると3本発明の性能は。Assuming that the average number of characters in a clause is 4, the performance of the present invention is as follows.

本発明の性能−1406XTx + (1406−14
3x4)xTyなお、  (1406−143X4)は
後段で処理する文字数である。
Performance of the present invention - 1406XTx + (1406-14
3x4)xTy Note that (1406-143X4) is the number of characters processed in the subsequent stage.

一方、従来の形態素解析による誤字検出方式の性能は 従来の性能間1406×Ty 従ってTx /Ty =0.1より性能比は従来の性能 1406XTy 即ち本発明の性能は従来に比べて約30%改善される。On the other hand, the performance of the conventional typographical error detection method using morphological analysis is Conventional performance 1406×Ty Therefore, from Tx / Ty = 0.1, the performance ratio is 1406XTy That is, the performance of the present invention is improved by about 30% compared to the conventional method.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は9本発明の構成例、第2図は1本発明による誤
字検出例、第3図は、従来の文法的手法による誤字検出
の構成例、第4図は、従来の文法的手法による誤字検出
例、第5図は1文字連接確率値足切りによる誤字検出例
を示す。 1・・・入力装置、2・・・入力処理部、3・・・入力
日本文データベース、4・・・誤字検出部、5・・・単
語辞書。 6・・・文法辞書、7・・・誤字位置検出部、8・・・
文字連接確率辞書、9・・・訂正処理部、10・・・処
理装置。 11・・・挿入された誤字、12・・・正字、13・・
・品詞接続状況、14・・・品詞接続検定エラー位置、
15・・・誤字周辺の文字列、16・・・2文字パター
7−、 17・・・前方2文字連接確率値、18・・・
前方2文字連接確率値の積、19・・・18が足切り値
10−’未満を示すマーカー、20・・・誤字検出失敗
例、21・・・文字連接確率値抽出部、22・・・誤字
判定部、23・・・足切り値記憶部、24・・・誤字と
判定された文字。 25・・・検出された誤字を含まない文節、26・・・
誤字含有域、27・・・誤字検出されなかった文節。
Figure 1 shows a configuration example of the present invention; Figure 2 shows an example of typographical error detection according to the present invention; Figure 3 shows a configuration example of typographical error detection using a conventional grammatical method; and Figure 4 shows a conventional grammatical method. FIG. 5 shows an example of misspelling detection using one-character concatenation probability values. 1... Input device, 2... Input processing section, 3... Input Japanese sentence database, 4... Misprint detection section, 5... Word dictionary. 6... Grammar dictionary, 7... Misprint position detection unit, 8...
Character concatenation probability dictionary, 9... Correction processing unit, 10... Processing device. 11... Inserted typo, 12... Correct letter, 13...
・Part-of-speech connection status, 14... Part-of-speech connection test error position,
15... Character string around the typo, 16... 2-character pattern 7-, 17... Probability value of concatenation of two characters in front, 18...
Product of forward two character concatenation probability values, 19...18 is a marker indicating that the cutoff value is less than 10-', 20...An example of failure in misspelling detection, 21...Character concatenation probability value extraction unit, 22... Misprint determination unit, 23... Cutoff value storage unit, 24... Characters determined to be misspelled. 25... Clause that does not include the detected typo, 26...
Error-containing area, 27... Clauses for which no typographical errors were detected.

Claims (1)

【特許請求の範囲】 入力装置から入力した日本文文書データに含まれる誤字
をデータ処理装置により自動検出する日本文誤字自動検
出方式において、 誤字を含まない標準文書から順次抽出された連続するN
文字のパターンに関する出現頻度情報に基づいて、あら
かじめ算定された各N文字の文字連接確率情報を各N文
字パターンをキーとして保持する文字連接確率辞書と 誤字検出対象である日本文文書データから、順次連続す
る文字列を抽出し、抽出された文字列における前半のN
文字および後半のN文字をそれぞれキーとして、文字連
接確率辞書から文字連接確率値を抽出する手段と、 前記の2つの文字連接確率値の積が所定の足切り値未満
の場合、誤字として検出する手段と、該手段で検出され
た誤字を含まない文節を対象として単語辞書と文法辞書
とを用いた形態素解析によって誤字含有域を抽出する手
段と、 該誤字含有域の誤字位置を前記の文字連接確率辞書によ
る文字連続確率値により検出する手段とを備え、 日本文文書データに含まれる誤字を自動的に検出するよ
うにしたことを特徴とする日本文誤字自動検出方式。
[Scope of Claims] In an automatic Japanese text error detection method in which a data processing device automatically detects typographical errors included in Japanese document data input from an input device, consecutive N characters sequentially extracted from a standard document containing no typos are provided.
Based on appearance frequency information regarding character patterns, character concatenation probability information for each N character calculated in advance is sequentially calculated from a character concatenation probability dictionary that holds each N character pattern as a key and Japanese document data that is the target of misspelling detection. Extract a continuous string, and calculate the first N in the extracted string.
A means for extracting character concatenation probability values from a character concatenation probability dictionary using each character and the last N characters as keys, and detecting a typographical error if the product of the two character concatenation probability values is less than a predetermined cutoff value. means for extracting a typographical error region by morphological analysis using a word dictionary and a grammar dictionary for a clause that does not include a typographical error detected by the means; What is claimed is: 1. An automatic method for detecting typographical errors in Japanese text, characterized in that the method includes means for detecting based on consecutive character probability values based on a probability dictionary, and is configured to automatically detect typographical errors included in Japanese document data.
JP61203220A 1986-08-29 1986-08-29 Automatic detecting system for wrong character of japanese sentence Pending JPS6358570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP61203220A JPS6358570A (en) 1986-08-29 1986-08-29 Automatic detecting system for wrong character of japanese sentence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP61203220A JPS6358570A (en) 1986-08-29 1986-08-29 Automatic detecting system for wrong character of japanese sentence

Publications (1)

Publication Number Publication Date
JPS6358570A true JPS6358570A (en) 1988-03-14

Family

ID=16470446

Family Applications (1)

Application Number Title Priority Date Filing Date
JP61203220A Pending JPS6358570A (en) 1986-08-29 1986-08-29 Automatic detecting system for wrong character of japanese sentence

Country Status (1)

Country Link
JP (1) JPS6358570A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019145023A (en) * 2018-02-23 2019-08-29 株式会社リクルート Document revision device and program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019145023A (en) * 2018-02-23 2019-08-29 株式会社リクルート Document revision device and program

Similar Documents

Publication Publication Date Title
US6035268A (en) Method and apparatus for breaking words in a stream of text
EP0415000B1 (en) Method and apparatus for spelling error detection and correction
US4813010A (en) Document processing using heading rules storage and retrieval system for generating documents with hierarchical logical architectures
CN101315622B (en) System and method for detecting file similarity
JPH0519184B2 (en)
JP2001034623A (en) Information retrievel method and information reteraval device
JPH0724055B2 (en) Word division processing method
Zitouni et al. The impact of morphological stemming on Arabic mention detection and coreference resolution
JPH0211934B2 (en)
JPS6033665A (en) Automatic extracting system of keyword
JPS6358570A (en) Automatic detecting system for wrong character of japanese sentence
JP2681663B2 (en) Japanese sentence correction candidate character extraction method
Kumar et al. Applications of stemming algorithms in information retrieval-a review
Ren et al. A hybrid approach to automatic Chinese text checking and error correction
JP3932912B2 (en) Character string shaping device, method and program
JP2599973B2 (en) Japanese sentence correction candidate character extraction device
CN113033188B (en) Tibetan grammar error correction method based on neural network
JPS6394364A (en) Automatic correction device for wrong character in japanese sentence
JPH05225183A (en) Automatic error detector for words in japanese sentence
JPH06149872A (en) Text input device
JP2575947B2 (en) Phrase extraction device
JPH077412B2 (en) Japanese sentence correction candidate character extraction device
Cao et al. The Research of Chinese Text Proofreading base on Pattern Matching
JPH0614376B2 (en) Japanese sentence error detection device
Wen et al. Ambiguity solution of pinyin segmentation in continuous pinyin-to-character conversion