JPH03156589A

JPH03156589A - Method for detecting and correcting erroneously read character

Info

Publication number: JPH03156589A
Application number: JP2011695A
Authority: JP
Inventors: Akiko Konno; 紺野　章子; Yasuo Hongo; 本郷　保夫; Shinji Matsui; 伸二松井
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1989-08-23
Filing date: 1990-01-23
Publication date: 1991-07-04

Abstract

PURPOSE:To enable the detection and correction of erroneous reading with high accuracy by determining whether a character extracted by comparison with the linkage table of two characters before and behind the postpositional particle of one character is a postpositional particle or not, dividing a character string and collating it with a word dictionary. CONSTITUTION:An input character feature pattern passing through a scanner 1 and a character recognizing device 2 is compared with a standard pattern by a personal computer 4 and the character is recognized. Then, the character string is partitioned by punctuation marks. From this partitioned character string, the postpositional particle of one Japanese character (ka, ga, shi, te, de, to, ni, no, ha, ba, he, mo, ya, wo) is extracted and the two characters before and behind the postpositional particle are compared with the stored table of linkage between two characters. Then, it is determined that the character is the postpositional particle. The character string is divided by the determined postpositional particle and when the erroneous read character is detected and corrected while collating it with the word dictionary by a longest coincidence method within the divided character string, the detection and correction can be executed with high accuracy.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、誤読文字の検出、修正方法に関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a method for detecting and correcting misread characters.

[Conventional technology]

文字認識装置により認識された結果の信転性を高めるた
め、結果を日本語文として形態素解析し、単語に分割し
たレベルで単語照合を行ない、誤読文字の検出、訂正を
行なうことは一般的に実施されてい名。In order to increase the credibility of the results recognized by a character recognition device, it is common practice to perform morphological analysis of the results as Japanese sentences, perform word matching at the word level, and detect and correct misread characters. Been name.

ここで、形態素解析の手段として従来は、まず句読点を
信転し得る区切りとし、句読点で区切られた範囲内の文
字列についてさらに字種（漢字。Conventionally, as a means of morphological analysis, punctuation marks are used as delimiters that can be used interchangeably, and character strings within the range separated by punctuation marks are further divided into character types (kanji).

ひらがな、カタカナ）の変化点で区切り、その結果得ら
れた文字列に対して単語辞書と照合しながら最長一致法
により形態素解析を行なっていく手法が一般的である（
ただし、漢字−ひらがなへの字種変化点は送り仮名の関
係からこれを許容するものとする）。A common method is to divide the resulting character strings at changing points (hiragana, katakana), and perform morphological analysis using the longest match method while checking the resulting character strings against a word dictionary (
However, the change in character type from kanji to hiragana is allowed due to the relationship between okurigana).

[Problem to be solved by the invention]

しかしながら、上記のような方法で形態素解析を行なう
と、まず、漢字−ひらがなという字種変化以外の字種変
化を持つ単語、例えば「ソ連」。However, when morphological analysis is performed using the method described above, first, words that have a character type change other than the kanji-hiragana character type change, such as "Soviet Union", are detected.

「さ迷うｊなどが単語として抽出されず、従ってそこに
誤読文字が含まれていても検出、修正することができな
い、という問題がある。また、漢字とカタカナには一部
非常に字形の類（以した文字があり　（夕と夕、力と力
、工と工など）、これらの文字を類似した別の字種の文
字に誤読した場合の誤読検出、修正が不可能である、と
いう不都合が生じている。``There is a problem that words such as Wandering J are not extracted as words, so even if they contain misread characters, they cannot be detected and corrected.Also, some kanji and katakana have very similar glyph shapes. (There are characters with the following characters (Yu and Yu, Chikara and Chikara, Ko and Ko, etc.), and if these characters are misread as characters of another similar character type, it is impossible to detect and correct the misreading, which is an inconvenience. is occurring.

Ｃ課題を解決するための手段〕人力された文字の特徴パターンと、予め記憶されている
各文字の標準パターンとを比較して文字を認識した後、
認識された文字列を句読点で区切り、この句読点で区切
られた文字列の中からひらがな１文字助詞（か、が、し
、て、で、と、に。Means for Solving Problem C] After recognizing the characters by comparing the human-generated characteristic patterns of the characters with the standard patterns of each character stored in advance,
Separate the recognized character strings with punctuation marks, and select one-character hiragana particles (ka, ga, shi, te, de, to, ni) from the character strings separated by these punctuation marks.

の、は、ば、へ、も、や、を）を抽出し、この抽出され
た文字の前後の文字を予め記憶されているひらがな１文
字助詞を含む単語から抽出した前後２文字の連接表と比
較して抽出された文字が助詞であるかどうかを確定し、
助詞と確定された文字でさらに文字列を分割し、分割さ
れた文字列内で最長一致法により単語辞書と照合しなが
ら誤読文字の検出、修正を行なう。, wa, ba, he, mo, ya, wo) are extracted, and the characters before and after these extracted characters are extracted from a pre-memorized word containing a one-letter hiragana particle, and the two characters before and after are linked together. Determine whether the extracted character is a particle by comparison,
The character string is further divided into characters determined to be particles, and misread characters are detected and corrected while checking the divided character strings against a word dictionary using the longest match method.

[Effect]

ひらがな１文字の頻出単語（か、が、し、で。 Frequently occurring words with one character in hiragana (ka, ga, shi, de.

で、と、に、の、は、ば、へ、も、や、を）を含む単語
の辞書から、ひらがな１文字助詞の前または後に来るこ
とができる文字を抽出して前後文字連接表を作成してお
き、文章を句読点で区切った後、上記１４種類のひらが
なが出現した箇所を検出し、その前後の文字列を上記の
前後文字連接表と比較し、前後のいずれかと一部するも
のがあれば、それは単語中のひらがなである可能性があ
る旨のマークを付けておく一方、一致するものがなけれ
ば、これは開田１文字助詞であると決定し、そこを区切
りとして形態素解析を行ない、誤読文字の検出、修正を
行なうことにより、複数字種からなる単語の検出、修正
を可能とし、また字種間の誤読（カタカナ−漢字、漢字
−カタカナ）を少なくする。Extract characters that can come before or after a one-letter hiragana particle from a dictionary of words that include words such as de, to, ni, no, wa, ba, he, mo, ya, wo) and create a preceding and following character concatenation table. After separating the sentences with punctuation marks, detect the places where the above 14 types of hiragana appear, compare the character strings before and after that with the preceding and following character concatenation table above, and find out which part of the hiragana appears before or after it. If there is a hiragana in the word, it is marked to indicate that it may be a hiragana, but if there is no match, it is determined that this is a Kaida one-character particle, and morphological analysis is performed using that as a break. By detecting and correcting misread characters, it is possible to detect and correct words consisting of multiple character types, and to reduce misreading between character types (katakana-kanji, kanji-katakana).

〔Example〕

第１図はこの発明の実施例を示すフローチャート、第２
図はこの発明が適用されるシステムを示すブロフク図で
ある。FIG. 1 is a flowchart showing an embodiment of the invention, and FIG.
The figure is a diagram showing a system to which the present invention is applied.

まず、第２図から説明する。同図において、１は文書画
像を光学的走査により画像メモリへ入力するスキャナ、
２は文書画像から１文字ずつの画像を取り出し認識する
認識部および単語辞書、接続表（マツプ）により、誤読
文字の修正を行なう修正部からなる文字認識装置（ＯＣ
Ｒ）　、３は認識結果を表示５確認するデイスプレィ、
４はパーソナルコンピュータ等のデータ処理装置、５は
データ入力等を行なうキーボードである。First, explanation will be given starting from FIG. In the figure, 1 is a scanner that inputs a document image into an image memory by optical scanning;
2 is a character recognition device (OC) consisting of a recognition unit that extracts and recognizes each character from a document image, and a correction unit that corrects misread characters using a word dictionary and a connection table (MAP).
R), 3 is a display that displays recognition results and 5 confirms the results;
4 is a data processing device such as a personal computer, and 5 is a keyboard for inputting data.

次に、第１図について説明する。Next, FIG. 1 will be explained.

まず、認識結果の各文字についてそれが句読点かどうか
を判断し、句読点ならば何もせずに終了する（■〜■参
照）。句読点でなければひらがな１文字助詞かどうかを
判断しく■参照）、そうでなければステップ■に戻る。First, it is determined whether each character in the recognition result is a punctuation mark, and if it is a punctuation mark, the process ends without doing anything (see ■ to ■). If it is not a punctuation mark, determine whether it is a one-character hiragana particle (see ■), otherwise return to step ■.

ひらがな１文字助詞ならば、前後文字連接表とマンチン
グをとり（■参照）、マツチ′ングがとれればステップ
■に戻る。If it is a one-character hiragana particle, check the munching with the preceding and following character concatenation table (see ■), and if the matching is successful, return to step ■.

前後文字連接表になければ１文字助詞として確定しく■
参照）、この１文字助詞にて区切られる文字列の形態素
解析を行なう（■参照）。その結果、誤読が検出された
らその修正を行なう　（■参照）。If it is not in the preceding and following character conjunction table, it is definitely a one-character particle■
), and performs morphological analysis of the character string delimited by this one-character particle (see ■). As a result, if misreadings are detected, they will be corrected (see ■).

第３図に文字列の具体例を示し、第４図に前後文字連接
表の一部を示す。FIG. 3 shows a specific example of a character string, and FIG. 4 shows a part of a preceding and following character concatenation table.

すなわち、第３図の場合の入力文字列は［昨年−年間に
ソ連人のなかで１であるが、文字認識結果（第１候補）
は「昨年−年間にソ連大のなかで」である。このような
文字列を従来方法のように字種で分割してしまうと（な
お、漢字−ひらがなの場合は送り仮名の場合があるので
除外する）、「昨年−年間に／ソ／達人のなかで／」と
なり（「／」にて区切りを示す）、それぞれ区切られた
文字列内で単語照合、修正を行なうと、「年間」＝「年
間」は修正できるが、「達人」という誤読は修正不可能
である。一方、この発明の手法では、上記の文字列から
１文字助詞候補として「に、の、か、で」がピックアッ
プされるので、それらの前後の文字列を第４図に示す１
文字助詞の前後文字連接表と比較すれば、これらのうち
「に、の、で」は前後の文字列が１文字助詞の前後文字
連接表に合致しないことから、これらを助詞と確定する
ことができる。また、「か」については前後文字連接表
に「なか」という文字列があるので、「か」は単語中の
ひらがなの可能性があるというマークを付けておく、そ
して、助詞として確定した文字で文字列を区切ると、「昨年−年間／に／ソ／達人／のなか／で／」となり、
字種変化点で区切ったものより正確な形態素解析が可能
となっている。つまり、「ソ連人」という部分について
は「ソ／達人」と「ソ連／人」という２種類の単語分割
が考えられるが、「ソ」が「ソビエト連邦」の略語であ
ることから、「ソ連人」を正解とし、誤読を修正するこ
とが可能となる。In other words, the input character string in the case of Figure 3 is [last year - 1 among Soviet people in the year, character recognition result (first candidate)
is ``Last year - among the Soviet universities in the year''. If such a character string is divided by character type as in the conventional method (Kanji-Hiragana may be okurigana, it is excluded), "Last year - in the year / So / master If you match and correct the words within each separated character string, you can correct "annual" = "annual", but you can correct the mispronunciation of "master". It's impossible. On the other hand, in the method of this invention, "ni, no, ka, de" are picked up as one-character particle candidates from the above character string, so the character strings before and after them are
If you compare it with the character conjunction table for character particles, the preceding and following character strings of "ni, no, de" do not match the character combination table for single character particles, so these cannot be determined as particles. can. Also, for ``ka'', there is a character string ``naka'' in the preceding and following character concatenation table, so we mark ``ka'' as a possible hiragana character in the word, and also mark it as a character that has been determined as a particle. If you separate the string, it becomes "last year - year / in / so / master / inside / in /",
More accurate morphological analysis is possible than dividing by character change points. In other words, the word "Soviet person" can be divided into two types: "Soviet person/master" and "Soviet person/person," but since "Soviet person" is an abbreviation for "Soviet Union,""Sovietperson" ” as the correct answer, and it becomes possible to correct misreadings.

第５図に文字列の別の例を示す。この場合の文字列は、［トロイカが走って来たゴであり、文字認識結果（第１位）は、「トロイカが走って来た」である、このような文字列に対し、従来の如く字種によ
る分割を行なうと、「トロイ／力／が／走って／来た」となる。したがって、字種によって分割された範囲内で
、単語照合を行なった結果では「力」−「力」という誤
読は修正されない。FIG. 5 shows another example of a character string. In this case, the character string is ``The troika came running'', and the character recognition result (first place) is ``The troika came running''. If we divide it by character type, we get "Troy/power/ran/came." Therefore, the result of word matching within the range divided by character types does not correct the mispronunciation of "power" - "power."

これに対し、この発明による方法では１文字助詞の候補
として「が、て」が挙がるが、「て」は第４図の前後文
字連接表に「って」が存在するので、１文字助詞とは確
定されない。従って、分割結果は、「トロイカ／が／走って来た」となり、最長一致法で「トロイカ」を「トロイカ」と訂
正することができる。On the other hand, in the method according to the present invention, "ga, te" is raised as a candidate for a one-letter particle, but since "te" exists in the preceding and following character concatenation table in Figure 4, it is not considered a one-letter particle. is not confirmed. Therefore, the division result is ``Troika/ came running.'' Using the longest match method, ``Troika'' can be corrected to ``Troika.''

ところで、ＯＣＲの認識結果には成る確率で誤読文字が
含まれており、また文章中の全ての単語が辞書にあるわ
けではないから、誤読文字を含む単語および未登録語（
以下、これらをまとめて未知語という）を形態素解析に
より抽出しなければならない。このとき、句読点により
区切られた範囲内に未知語が現われた場合、先頭からの
形態素解析によって未知語の先頭は知ることができるが
、未知語の末尾を抽出するのは困難である。未知語の先
頭から１文字ずつずらして形態素解析をしても、単語中
に他の単語が含まれている場合が多々あるから、信顛の
おける範囲が抽出できないこともあり得る。そこで、以
下のようにすることが考えられる。By the way, OCR recognition results include misread characters with a high probability, and not all words in a sentence are in the dictionary, so words containing misread characters and unregistered words (
(hereinafter collectively referred to as unknown words) must be extracted by morphological analysis. At this time, if an unknown word appears within a range separated by punctuation marks, the beginning of the unknown word can be determined by morphological analysis from the beginning, but it is difficult to extract the end of the unknown word. Even if you shift one character from the beginning of an unknown word and perform morphological analysis, the word often contains other words, so it may not be possible to extract a reliable range. Therefore, it is possible to do the following.

第６図はかかる観点に基づくこの発明の実施例を示すフ
ローチャートである。FIG. 6 is a flowchart showing an embodiment of the present invention based on this viewpoint.

まず、第６図のステップ■〜■の部分では、句読点によ
って区切られた範囲内の文字列から、前述のひらがな１
文字助詞の他に、動詞末尾文字「る」、１文字助動詞「
た」、「だ」および頻出ひらがな２文字単語を抽出する
。第７図に頻出ひらがな２文字単語の種類を示し、ここ
では２３種を対象とする。また、１文字助詞については
第１図の実施例の場合と同様、第４図の如き連接表を用
いて確定するようにしても良いが、必ずしもこの方法に
依らなくても良い。First, in steps ■ to ■ in Figure 6, the above-mentioned hiragana 1
In addition to character particles, the final character of the verb "ru", the one-character auxiliary verb "
Extract "ta", "da" and frequently occurring two-letter hiragana words. Figure 7 shows the types of frequently occurring two-letter hiragana words, and here 23 types are targeted. Furthermore, as in the case of the embodiment shown in FIG. 1, single-character particles may be determined using a conjunctive table as shown in FIG. 4, but this method is not necessarily required.

第６図のステップ■〜■では上記の如く抽出された文字
について、その前後の文字種類との関係がＩＦＦのよう
なパターンになっているかどうかを調べ、このパターン
に合致している場合は、句読点によって区切られた範囲
をさらに小さく区切るようにする。In steps ■ to ■ in Figure 6, it is checked whether the relationship between the characters extracted as above and the character types before and after it forms a pattern like IFF, and if it matches this pattern, Makes the range delimited by punctuation into smaller parts.

ａ）１文字助詞「を」の前後・・・／を／・・・ｂ）その他の１文字助詞とひらがな以外の文字の間・・・の／場合・・・、・・・が／スタート・・・Ｃ）
動詞語尾「る」または１文字助動詞「た」とひらがな以
外の文字の間・・・起きる７時間・・・、・・・起きた７時間・・・
ｄ）２文字助詞とひらがな以外の文字の間・・・／から
／始まる・・・ｅ）ひらがな以外の文字と２文字助詞の間・・・読書／
など／・・・「）ひらがな以外の文字と２文字助動詞の間・・・本当
／です／・・・ｇ）２文字形式名詞と１文字助動詞「だ」または１文字
助詞の前後・・・／もの／だ／・・・、・・・／ため／に／・・・
ｈ）２文字動詞と２文字形式名詞の前後・・・／いる／
ため／・・・、・・・／いる／もの／・・・ｉ）２文字
す変動詞と２文字形式名詞の前後・・・／する／ごと／
・・・ｊ）文末の２文字動詞の前・・・／です。、・・・／ます。a) Before and after the one-letter particle "wo".../wo/... b) Between other one-letter particles and characters other than hiragana...in the case of...,...ga/start... ...C)
Between the verb ending “ru” or the one-letter auxiliary verb “ta” and a letter other than hiragana...7 hours of waking up...,...7 hours of waking up...
d) Between a two-letter particle and a non-hiragana character... /starts from/ e) Between a non-hiragana character and a two-letter particle...reading/
etc. /... ") Between a character other than hiragana and a two-letter auxiliary verb... truth/desu/... g) Before or after a two-letter form noun and a one-letter auxiliary verb "da" or a one-letter auxiliary verb.../ thing / is / ..., ... / for / for / ...
h) Before and after two-letter verbs and two-letter formal nouns.../are/
for/...,.../to be/thing/...i) Before and after two-letter verbs and two-letter formal nouns.../to/goto/
... j) Before the two-letter verb at the end of the sentence.../. ,···/Masu.

ｋ）文末の２文字助動詞の′前・・・／ある。、・・・／なる。k) Before the two-letter auxiliary verb at the end of the sentence ···/be. ,···/Become.

なお、２文字助詞、２文字形式名詞等の呼称は第７図の
分類による。Note that the names of two-letter particles, two-letter formal nouns, etc. are based on the classification shown in FIG.

こうして、細かく区切られた範囲が前述の１文字または
２文字のひらがな単語の場合、その単語は確定したもの
とする。それ以外の場合は、区切られた範囲に対して単
語辞書と照合しながら形態素解析を行なう。ここで、未
知語があると形態素解析で解析しきれなくなることが多
い。これは、上述のように未知語の先頭の部分は確定で
きるものの、その末尾の部分の確定が難しいからである
。In this way, if the finely divided range is the aforementioned one-letter or two-letter Hiragana word, that word is determined to have been determined. In other cases, morphological analysis is performed while checking the divided range against the word dictionary. Here, if there is an unknown word, it often becomes impossible to fully analyze it using morphological analysis. This is because, as mentioned above, although the beginning part of an unknown word can be determined, it is difficult to determine the last part.

そこで、この発明では区切られた範囲の末尾から逆方向
に最長一致法で解析して行くことにより（［相］参照）
、未知語の末尾を確定し、確定した範囲について単語辞
書との照合により、誤読を含む単語の場合は誤読文字か
どうかをチエツクするとともに修正を行ない、未登録語
の場合は未登録語リストを作成し、後で未登録語の確認
のために利用する。Therefore, in this invention, analysis is performed in the reverse direction from the end of the divided range using the longest match method (see [Phase]).
, the end of the unknown word is determined, and the determined range is checked against a word dictionary. If the word contains mispronunciations, it is checked to see if it is a misread character, and corrections are made. If it is an unregistered word, the unregistered word list is created. Create it and use it later to check for unregistered words.

第８図は以上の方法を説明するための説明図である。FIG. 8 is an explanatory diagram for explaining the above method.

この場合の入力文字列は同図（イ）の如く、「××電工
事件で逮捕されたために」であり、認識結果（第１位）
は同図（ロ）の如＜「××電工事件で逮補さ礼たために
」で、「捕」−「補」、「れ」−「礼」と誤読している
。In this case, the input character string is "Because I was arrested for the electrical work incident" as shown in the same figure (a), and the recognition result (first place)
As shown in the same figure (b), it is incorrectly read as ``rei'' - ``rei'' and ``re'' - ``rei.''

これを従来方法により区切ると、同図（ハ）に示すよう
な区切り方となる。これに対し、この発明による方法で
区切ると同図（ニ）に示すとおり、ひらがな文字列が（
ハ）と比較して長く残らず、最長一致法による単語分割
も行ない易い。さらに、未知語の先頭、末尾を（ホ）、
（へ）の如く抽出した後（「？」にて先頭または末尾を
示す）、未知語の候補文字を組み合わせて文法に合う単
語が抽出できれば修正し、単語抽出ができなければそれ
を未登録語としてリストアツブしておき、後で未登録語
として確認を求め、修正が必要なところがあれば修正を
要求する。同図（ト）の場合は未登録語として登録され
、同図（チ）の場合は修正可能な場合として修正される
。If this is divided using the conventional method, the division will be as shown in FIG. On the other hand, when the hiragana character string is separated using the method according to the present invention, as shown in FIG.
Compared to (c), it does not remain long and it is easy to perform word division using the longest match method. Furthermore, change the beginning and end of the unknown word (e),
After extracting as shown in (f) (use "?" to indicate the beginning or end), combine the candidate characters of the unknown word and if a word that matches the grammar can be extracted, correct it. If the word cannot be extracted, use it as an unregistered word. Restore it as an unregistered word, and later ask for confirmation as an unregistered word, and request corrections if any need to be corrected. In the case of (G) in the same figure, it is registered as an unregistered word, and in the case of (H) in the same figure, it is corrected as a case that can be corrected.

〔Effect of the invention〕

この発明によれば、１文字助詞に着目するようにしたの
で、複数の字種（漢字−ひらがな以外）からなる単語に
おける誤読、および字種を越えた検出、訂正が可能とな
る利点が得られる。また、上記１文字助詞の他に１文字
動詞語尾「る」、１文字助動詞「た」、「だ」および頻
出ひらがな２文字単語に着目してこれらを抽出すること
により、より一層検出、修正のための精度を高めること
ができる。According to this invention, since the focus is on single-letter particles, it is possible to detect and correct misreadings in words consisting of multiple character types (other than kanji and hiragana), as well as across character types. . In addition to the one-letter particles mentioned above, by focusing on and extracting one-letter verb endings "ru", one-letter auxiliary verbs "ta", "da", and frequently occurring two-letter hiragana words, detection and correction can be made even more effective. The accuracy can be increased.

[Brief explanation of the drawing]

第１図はこの発明の実施例を示すフローチャート、第２
図はこの発明が適用されるシステムを示すブロック図、
第３図は入力文字列と認識結果の具体例を説明するため
の説明図、第４図は前後文字連接表の例を説明するため
の説明図、第５図は入力文字列と認識結果の他の例を説
明するための説明図、第６図は発明の他の実施例を示す
フローチャート、第７図は頻出ひらがな２文字単語の例
を説明するための説明図、第８図は入力文字列と認識結
果のさらに他の例を説明するための説明図である。符号説明１・・・スキャナ、２・・・文字認識装置、３・・・デ
イスプレィ、４・・・データ処理装置（パーソナルコン
ピュータ）、５・・・キーボード。FIG. 1 is a flowchart showing an embodiment of the invention, and FIG.
The figure is a block diagram showing a system to which this invention is applied.
Figure 3 is an explanatory diagram for explaining a specific example of an input character string and recognition results, Figure 4 is an explanatory diagram for explaining an example of a preceding and following character concatenation table, and Figure 5 is an explanatory diagram for explaining an example of an input character string and recognition results. An explanatory diagram for explaining another example, FIG. 6 is a flowchart showing another embodiment of the invention, FIG. 7 is an explanatory diagram for explaining an example of frequently occurring two-letter hiragana words, and FIG. 8 is an input character FIG. 7 is an explanatory diagram for explaining still another example of columns and recognition results. Description of symbols 1...Scanner, 2...Character recognition device, 3...Display, 4...Data processing device (personal computer), 5...Keyboard.

Claims

[Claims] 1) After recognizing characters by comparing the characteristic pattern of input characters with standard patterns of each character stored in advance, the recognized character string is separated by punctuation marks, and the punctuation marks are Hiragana one-letter particles from the string separated by (ka, ga, shi, te, de, to, ni, no, wa, ba, he, mo, ya, wo)
is extracted, and the characters before and after this extracted character are compared with a pre-memorized concatenation table of two characters before and after extracted from a word containing a one-character hiragana particle to determine whether the extracted character is a particle. A method for detecting and correcting misread characters, which further divides the character string by the characters determined to be particles and determines the particle, and detects and corrects the misread characters by comparing the divided character strings with a word dictionary using the longest match method. Detection and correction methods. 2) After recognizing the characters by comparing the characteristic pattern of the input characters with the standard pattern of each character stored in advance, the recognized character strings are separated by punctuation marks, and the character strings separated by these punctuation marks are created. Hiragana one-letter particles from among (ka, ga, shi, te, de, to, ni, no, wa, ba, he, mo, ya, wo)
, one-letter verb ending "ru", one-letter auxiliary verb "ta", "da"
and frequently occurring two-letter hiragana words (koto, thing, when, to, this, that, that, which, this, it, those, which, from, than, etc., until, not, desu, masu, aru, etc.)
Depending on whether these words and the character types before and after them satisfy a predetermined condition, the character string is further divided into a range of one to several words, and the longest within that range is extracted. Detect misreading by matching words using matching method.
A method for detecting and correcting misread characters, characterized by correcting them. 3) If a word that contains a mispronunciation or an unregistered word that is not in the word dictionary appears, use the longest match method to determine the word range from the front and back within the range delimited as above. 3. The method for detecting and correcting misread characters according to claim 2, further comprising extracting the remaining portion of the text and comparing it with a word dictionary to detect and correct words containing misread characters.