JPH03156589A - Method for detecting and correcting erroneously read character - Google Patents

Method for detecting and correcting erroneously read character

Info

Publication number
JPH03156589A
JPH03156589A JP2011695A JP1169590A JPH03156589A JP H03156589 A JPH03156589 A JP H03156589A JP 2011695 A JP2011695 A JP 2011695A JP 1169590 A JP1169590 A JP 1169590A JP H03156589 A JPH03156589 A JP H03156589A
Authority
JP
Japan
Prior art keywords
character
characters
letter
word
hiragana
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2011695A
Other languages
Japanese (ja)
Inventor
Akiko Konno
紺野 章子
Yasuo Hongo
本郷 保夫
Shinji Matsui
伸二 松井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuji Electric Co Ltd
Fuji Facom Corp
Original Assignee
Fuji Electric Co Ltd
Fuji Facom Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Electric Co Ltd, Fuji Facom Corp filed Critical Fuji Electric Co Ltd
Priority to JP2011695A priority Critical patent/JPH03156589A/en
Publication of JPH03156589A publication Critical patent/JPH03156589A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

PURPOSE:To enable the detection and correction of erroneous reading with high accuracy by determining whether a character extracted by comparison with the linkage table of two characters before and behind the postpositional particle of one character is a postpositional particle or not, dividing a character string and collating it with a word dictionary. CONSTITUTION:An input character feature pattern passing through a scanner 1 and a character recognizing device 2 is compared with a standard pattern by a personal computer 4 and the character is recognized. Then, the character string is partitioned by punctuation marks. From this partitioned character string, the postpositional particle of one Japanese character (ka, ga, shi, te, de, to, ni, no, ha, ba, he, mo, ya, wo) is extracted and the two characters before and behind the postpositional particle are compared with the stored table of linkage between two characters. Then, it is determined that the character is the postpositional particle. The character string is divided by the determined postpositional particle and when the erroneous read character is detected and corrected while collating it with the word dictionary by a longest coincidence method within the divided character string, the detection and correction can be executed with high accuracy.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 この発明は、誤読文字の検出、修正方法に関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a method for detecting and correcting misread characters.

〔従来の技術〕[Conventional technology]

文字認識装置により認識された結果の信転性を高めるた
め、結果を日本語文として形態素解析し、単語に分割し
たレベルで単語照合を行ない、誤読文字の検出、訂正を
行なうことは一般的に実施されてい名。
In order to increase the credibility of the results recognized by a character recognition device, it is common practice to perform morphological analysis of the results as Japanese sentences, perform word matching at the word level, and detect and correct misread characters. Been name.

ここで、形態素解析の手段として従来は、まず句読点を
信転し得る区切りとし、句読点で区切られた範囲内の文
字列についてさらに字種(漢字。
Conventionally, as a means of morphological analysis, punctuation marks are used as delimiters that can be used interchangeably, and character strings within the range separated by punctuation marks are further divided into character types (kanji).

ひらがな、カタカナ)の変化点で区切り、その結果得ら
れた文字列に対して単語辞書と照合しながら最長一致法
により形態素解析を行なっていく手法が一般的である(
ただし、漢字−ひらがなへの字種変化点は送り仮名の関
係からこれを許容するものとする)。
A common method is to divide the resulting character strings at changing points (hiragana, katakana), and perform morphological analysis using the longest match method while checking the resulting character strings against a word dictionary (
However, the change in character type from kanji to hiragana is allowed due to the relationship between okurigana).

〔発明が解決しようとする課題〕[Problem to be solved by the invention]

しかしながら、上記のような方法で形態素解析を行なう
と、まず、漢字−ひらがなという字種変化以外の字種変
化を持つ単語、例えば「ソ連」。
However, when morphological analysis is performed using the method described above, first, words that have a character type change other than the kanji-hiragana character type change, such as "Soviet Union", are detected.

「さ迷うjなどが単語として抽出されず、従ってそこに
誤読文字が含まれていても検出、修正することができな
い、という問題がある。また、漢字とカタカナには一部
非常に字形の類(以した文字があり (夕と夕、力と力
、工と工など)、これらの文字を類似した別の字種の文
字に誤読した場合の誤読検出、修正が不可能である、と
いう不都合が生じている。
``There is a problem that words such as Wandering J are not extracted as words, so even if they contain misread characters, they cannot be detected and corrected.Also, some kanji and katakana have very similar glyph shapes. (There are characters with the following characters (Yu and Yu, Chikara and Chikara, Ko and Ko, etc.), and if these characters are misread as characters of another similar character type, it is impossible to detect and correct the misreading, which is an inconvenience. is occurring.

C課題を解決するための手段〕 人力された文字の特徴パターンと、予め記憶されている
各文字の標準パターンとを比較して文字を認識した後、
認識された文字列を句読点で区切り、この句読点で区切
られた文字列の中からひらがな1文字助詞(か、が、し
、て、で、と、に。
Means for Solving Problem C] After recognizing the characters by comparing the human-generated characteristic patterns of the characters with the standard patterns of each character stored in advance,
Separate the recognized character strings with punctuation marks, and select one-character hiragana particles (ka, ga, shi, te, de, to, ni) from the character strings separated by these punctuation marks.

の、は、ば、へ、も、や、を)を抽出し、この抽出され
た文字の前後の文字を予め記憶されているひらがな1文
字助詞を含む単語から抽出した前後2文字の連接表と比
較して抽出された文字が助詞であるかどうかを確定し、
助詞と確定された文字でさらに文字列を分割し、分割さ
れた文字列内で最長一致法により単語辞書と照合しなが
ら誤読文字の検出、修正を行なう。
, wa, ba, he, mo, ya, wo) are extracted, and the characters before and after these extracted characters are extracted from a pre-memorized word containing a one-letter hiragana particle, and the two characters before and after are linked together. Determine whether the extracted character is a particle by comparison,
The character string is further divided into characters determined to be particles, and misread characters are detected and corrected while checking the divided character strings against a word dictionary using the longest match method.

〔作用〕[Effect]

ひらがな1文字の頻出単語(か、が、し、で。 Frequently occurring words with one character in hiragana (ka, ga, shi, de.

で、と、に、の、は、ば、へ、も、や、を)を含む単語
の辞書から、ひらがな1文字助詞の前または後に来るこ
とができる文字を抽出して前後文字連接表を作成してお
き、文章を句読点で区切った後、上記14種類のひらが
なが出現した箇所を検出し、その前後の文字列を上記の
前後文字連接表と比較し、前後のいずれかと一部するも
のがあれば、それは単語中のひらがなである可能性があ
る旨のマークを付けておく一方、一致するものがなけれ
ば、これは開田1文字助詞であると決定し、そこを区切
りとして形態素解析を行ない、誤読文字の検出、修正を
行なうことにより、複数字種からなる単語の検出、修正
を可能とし、また字種間の誤読(カタカナ−漢字、漢字
−カタカナ)を少なくする。
Extract characters that can come before or after a one-letter hiragana particle from a dictionary of words that include words such as de, to, ni, no, wa, ba, he, mo, ya, wo) and create a preceding and following character concatenation table. After separating the sentences with punctuation marks, detect the places where the above 14 types of hiragana appear, compare the character strings before and after that with the preceding and following character concatenation table above, and find out which part of the hiragana appears before or after it. If there is a hiragana in the word, it is marked to indicate that it may be a hiragana, but if there is no match, it is determined that this is a Kaida one-character particle, and morphological analysis is performed using that as a break. By detecting and correcting misread characters, it is possible to detect and correct words consisting of multiple character types, and to reduce misreading between character types (katakana-kanji, kanji-katakana).

〔実施例〕〔Example〕

第1図はこの発明の実施例を示すフローチャート、第2
図はこの発明が適用されるシステムを示すブロフク図で
ある。
FIG. 1 is a flowchart showing an embodiment of the invention, and FIG.
The figure is a diagram showing a system to which the present invention is applied.

まず、第2図から説明する。同図において、1は文書画
像を光学的走査により画像メモリへ入力するスキャナ、
2は文書画像から1文字ずつの画像を取り出し認識する
認識部および単語辞書、接続表(マツプ)により、誤読
文字の修正を行なう修正部からなる文字認識装置(OC
R) 、3は認識結果を表示5確認するデイスプレィ、
4はパーソナルコンピュータ等のデータ処理装置、5は
データ入力等を行なうキーボードである。
First, explanation will be given starting from FIG. In the figure, 1 is a scanner that inputs a document image into an image memory by optical scanning;
2 is a character recognition device (OC) consisting of a recognition unit that extracts and recognizes each character from a document image, and a correction unit that corrects misread characters using a word dictionary and a connection table (MAP).
R), 3 is a display that displays recognition results and 5 confirms the results;
4 is a data processing device such as a personal computer, and 5 is a keyboard for inputting data.

次に、第1図について説明する。Next, FIG. 1 will be explained.

まず、認識結果の各文字についてそれが句読点かどうか
を判断し、句読点ならば何もせずに終了する(■〜■参
照)。句読点でなければひらがな1文字助詞かどうかを
判断しく■参照)、そうでなければステップ■に戻る。
First, it is determined whether each character in the recognition result is a punctuation mark, and if it is a punctuation mark, the process ends without doing anything (see ■ to ■). If it is not a punctuation mark, determine whether it is a one-character hiragana particle (see ■), otherwise return to step ■.

ひらがな1文字助詞ならば、前後文字連接表とマンチン
グをとり(■参照)、マツチ′ングがとれればステップ
■に戻る。
If it is a one-character hiragana particle, check the munching with the preceding and following character concatenation table (see ■), and if the matching is successful, return to step ■.

前後文字連接表になければ1文字助詞として確定しく■
参照)、この1文字助詞にて区切られる文字列の形態素
解析を行なう(■参照)。その結果、誤読が検出された
らその修正を行なう (■参照)。
If it is not in the preceding and following character conjunction table, it is definitely a one-character particle■
), and performs morphological analysis of the character string delimited by this one-character particle (see ■). As a result, if misreadings are detected, they will be corrected (see ■).

第3図に文字列の具体例を示し、第4図に前後文字連接
表の一部を示す。
FIG. 3 shows a specific example of a character string, and FIG. 4 shows a part of a preceding and following character concatenation table.

すなわち、第3図の場合の入力文字列は[昨年−年間に
ソ連人のなかで1であるが、文字認識結果(第1候補)
は「昨年−年間にソ連大のなかで」である。このような
文字列を従来方法のように字種で分割してしまうと(な
お、漢字−ひらがなの場合は送り仮名の場合があるので
除外する)、「昨年−年間に/ソ/達人のなかで/」と
なり(「/」にて区切りを示す)、それぞれ区切られた
文字列内で単語照合、修正を行なうと、「年間」=「年
間」は修正できるが、「達人」という誤読は修正不可能
である。一方、この発明の手法では、上記の文字列から
1文字助詞候補として「に、の、か、で」がピックアッ
プされるので、それらの前後の文字列を第4図に示す1
文字助詞の前後文字連接表と比較すれば、これらのうち
「に、の、で」は前後の文字列が1文字助詞の前後文字
連接表に合致しないことから、これらを助詞と確定する
ことができる。また、「か」については前後文字連接表
に「なか」という文字列があるので、「か」は単語中の
ひらがなの可能性があるというマークを付けておく、そ
して、助詞として確定した文字で文字列を区切ると、 「昨年−年間/に/ソ/達人/のなか/で/」となり、
字種変化点で区切ったものより正確な形態素解析が可能
となっている。つまり、「ソ連人」という部分について
は「ソ/達人」と「ソ連/人」という2種類の単語分割
が考えられるが、「ソ」が「ソビエト連邦」の略語であ
ることから、「ソ連人」を正解とし、誤読を修正するこ
とが可能となる。
In other words, the input character string in the case of Figure 3 is [last year - 1 among Soviet people in the year, character recognition result (first candidate)
is ``Last year - among the Soviet universities in the year''. If such a character string is divided by character type as in the conventional method (Kanji-Hiragana may be okurigana, it is excluded), "Last year - in the year / So / master If you match and correct the words within each separated character string, you can correct "annual" = "annual", but you can correct the mispronunciation of "master". It's impossible. On the other hand, in the method of this invention, "ni, no, ka, de" are picked up as one-character particle candidates from the above character string, so the character strings before and after them are
If you compare it with the character conjunction table for character particles, the preceding and following character strings of "ni, no, de" do not match the character combination table for single character particles, so these cannot be determined as particles. can. Also, for ``ka'', there is a character string ``naka'' in the preceding and following character concatenation table, so we mark ``ka'' as a possible hiragana character in the word, and also mark it as a character that has been determined as a particle. If you separate the string, it becomes "last year - year / in / so / master / inside / in /",
More accurate morphological analysis is possible than dividing by character change points. In other words, the word "Soviet person" can be divided into two types: "Soviet person/master" and "Soviet person/person," but since "Soviet person" is an abbreviation for "Soviet Union,""Sovietperson" ” as the correct answer, and it becomes possible to correct misreadings.

第5図に文字列の別の例を示す。この場合の文字列は、 [トロイカが走って来たゴ であり、文字認識結果(第1位)は、 「トロイカが走って来た」 である、このような文字列に対し、従来の如く字種によ
る分割を行なうと、 「トロイ/力/が/走って/来た」 となる。したがって、字種によって分割された範囲内で
、単語照合を行なった結果では「力」−「力」という誤
読は修正されない。
FIG. 5 shows another example of a character string. In this case, the character string is ``The troika came running'', and the character recognition result (first place) is ``The troika came running''. If we divide it by character type, we get "Troy/power/ran/came." Therefore, the result of word matching within the range divided by character types does not correct the mispronunciation of "power" - "power."

これに対し、この発明による方法では1文字助詞の候補
として「が、て」が挙がるが、「て」は第4図の前後文
字連接表に「って」が存在するので、1文字助詞とは確
定されない。従って、分割結果は、 「トロイカ/が/走って来た」 となり、最長一致法で「トロイカ」を「トロイカ」と訂
正することができる。
On the other hand, in the method according to the present invention, "ga, te" is raised as a candidate for a one-letter particle, but since "te" exists in the preceding and following character concatenation table in Figure 4, it is not considered a one-letter particle. is not confirmed. Therefore, the division result is ``Troika/ came running.'' Using the longest match method, ``Troika'' can be corrected to ``Troika.''

ところで、OCRの認識結果には成る確率で誤読文字が
含まれており、また文章中の全ての単語が辞書にあるわ
けではないから、誤読文字を含む単語および未登録語(
以下、これらをまとめて未知語という)を形態素解析に
より抽出しなければならない。このとき、句読点により
区切られた範囲内に未知語が現われた場合、先頭からの
形態素解析によって未知語の先頭は知ることができるが
、未知語の末尾を抽出するのは困難である。未知語の先
頭から1文字ずつずらして形態素解析をしても、単語中
に他の単語が含まれている場合が多々あるから、信顛の
おける範囲が抽出できないこともあり得る。そこで、以
下のようにすることが考えられる。
By the way, OCR recognition results include misread characters with a high probability, and not all words in a sentence are in the dictionary, so words containing misread characters and unregistered words (
(hereinafter collectively referred to as unknown words) must be extracted by morphological analysis. At this time, if an unknown word appears within a range separated by punctuation marks, the beginning of the unknown word can be determined by morphological analysis from the beginning, but it is difficult to extract the end of the unknown word. Even if you shift one character from the beginning of an unknown word and perform morphological analysis, the word often contains other words, so it may not be possible to extract a reliable range. Therefore, it is possible to do the following.

第6図はかかる観点に基づくこの発明の実施例を示すフ
ローチャートである。
FIG. 6 is a flowchart showing an embodiment of the present invention based on this viewpoint.

まず、第6図のステップ■〜■の部分では、句読点によ
って区切られた範囲内の文字列から、前述のひらがな1
文字助詞の他に、動詞末尾文字「る」、1文字助動詞「
た」、「だ」および頻出ひらがな2文字単語を抽出する
。第7図に頻出ひらがな2文字単語の種類を示し、ここ
では23種を対象とする。また、1文字助詞については
第1図の実施例の場合と同様、第4図の如き連接表を用
いて確定するようにしても良いが、必ずしもこの方法に
依らなくても良い。
First, in steps ■ to ■ in Figure 6, the above-mentioned hiragana 1
In addition to character particles, the final character of the verb "ru", the one-character auxiliary verb "
Extract "ta", "da" and frequently occurring two-letter hiragana words. Figure 7 shows the types of frequently occurring two-letter hiragana words, and here 23 types are targeted. Furthermore, as in the case of the embodiment shown in FIG. 1, single-character particles may be determined using a conjunctive table as shown in FIG. 4, but this method is not necessarily required.

第6図のステップ■〜■では上記の如く抽出された文字
について、その前後の文字種類との関係がIFFのよう
なパターンになっているかどうかを調べ、このパターン
に合致している場合は、句読点によって区切られた範囲
をさらに小さく区切るようにする。
In steps ■ to ■ in Figure 6, it is checked whether the relationship between the characters extracted as above and the character types before and after it forms a pattern like IFF, and if it matches this pattern, Makes the range delimited by punctuation into smaller parts.

a)1文字助詞「を」の前後 ・・・/を/・・・ b)その他の1文字助詞とひらがな以外の文字の間 ・・・の/場合・・・、・・・が/スタート・・・C)
動詞語尾「る」または1文字助動詞「た」とひらがな以
外の文字の間 ・・・起きる7時間・・・、・・・起きた7時間・・・
d)2文字助詞とひらがな以外の文字の間・・・/から
/始まる・・・ e)ひらがな以外の文字と2文字助詞の間・・・読書/
など/・・・ 「)ひらがな以外の文字と2文字助動詞の間・・・本当
/です/・・・ g)2文字形式名詞と1文字助動詞「だ」または1文字
助詞の前後 ・・・/もの/だ/・・・、・・・/ため/に/・・・
h)2文字動詞と2文字形式名詞の前後・・・/いる/
ため/・・・、・・・/いる/もの/・・・i)2文字
す変動詞と2文字形式名詞の前後・・・/する/ごと/
・・・ j)文末の2文字動詞の前 ・・・/です。、・・・/ます。
a) Before and after the one-letter particle "wo".../wo/... b) Between other one-letter particles and characters other than hiragana...in the case of...,...ga/start... ...C)
Between the verb ending “ru” or the one-letter auxiliary verb “ta” and a letter other than hiragana...7 hours of waking up...,...7 hours of waking up...
d) Between a two-letter particle and a non-hiragana character... /starts from/ e) Between a non-hiragana character and a two-letter particle...reading/
etc. /... ") Between a character other than hiragana and a two-letter auxiliary verb... truth/desu/... g) Before or after a two-letter form noun and a one-letter auxiliary verb "da" or a one-letter auxiliary verb.../ thing / is / ..., ... / for / for / ...
h) Before and after two-letter verbs and two-letter formal nouns.../are/
for/...,.../to be/thing/...i) Before and after two-letter verbs and two-letter formal nouns.../to/goto/
... j) Before the two-letter verb at the end of the sentence.../. ,···/Masu.

k)文末の2文字助動詞の′前 ・・・/ある。、・・・/なる。k) Before the two-letter auxiliary verb at the end of the sentence ···/be. ,···/Become.

なお、2文字助詞、2文字形式名詞等の呼称は第7図の
分類による。
Note that the names of two-letter particles, two-letter formal nouns, etc. are based on the classification shown in FIG.

こうして、細かく区切られた範囲が前述の1文字または
2文字のひらがな単語の場合、その単語は確定したもの
とする。それ以外の場合は、区切られた範囲に対して単
語辞書と照合しながら形態素解析を行なう。ここで、未
知語があると形態素解析で解析しきれなくなることが多
い。これは、上述のように未知語の先頭の部分は確定で
きるものの、その末尾の部分の確定が難しいからである
In this way, if the finely divided range is the aforementioned one-letter or two-letter Hiragana word, that word is determined to have been determined. In other cases, morphological analysis is performed while checking the divided range against the word dictionary. Here, if there is an unknown word, it often becomes impossible to fully analyze it using morphological analysis. This is because, as mentioned above, although the beginning part of an unknown word can be determined, it is difficult to determine the last part.

そこで、この発明では区切られた範囲の末尾から逆方向
に最長一致法で解析して行くことにより([相]参照)
、未知語の末尾を確定し、確定した範囲について単語辞
書との照合により、誤読を含む単語の場合は誤読文字か
どうかをチエツクするとともに修正を行ない、未登録語
の場合は未登録語リストを作成し、後で未登録語の確認
のために利用する。
Therefore, in this invention, analysis is performed in the reverse direction from the end of the divided range using the longest match method (see [Phase]).
, the end of the unknown word is determined, and the determined range is checked against a word dictionary. If the word contains mispronunciations, it is checked to see if it is a misread character, and corrections are made. If it is an unregistered word, the unregistered word list is created. Create it and use it later to check for unregistered words.

第8図は以上の方法を説明するための説明図である。FIG. 8 is an explanatory diagram for explaining the above method.

この場合の入力文字列は同図(イ)の如く、「××電工
事件で逮捕されたために」であり、認識結果(第1位)
は同図(ロ)の如<「××電工事件で逮補さ礼たために
」で、「捕」−「補」、「れ」−「礼」と誤読している
In this case, the input character string is "Because I was arrested for the electrical work incident" as shown in the same figure (a), and the recognition result (first place)
As shown in the same figure (b), it is incorrectly read as ``rei'' - ``rei'' and ``re'' - ``rei.''

これを従来方法により区切ると、同図(ハ)に示すよう
な区切り方となる。これに対し、この発明による方法で
区切ると同図(ニ)に示すとおり、ひらがな文字列が(
ハ)と比較して長く残らず、最長一致法による単語分割
も行ない易い。さらに、未知語の先頭、末尾を(ホ)、
(へ)の如く抽出した後(「?」にて先頭または末尾を
示す)、未知語の候補文字を組み合わせて文法に合う単
語が抽出できれば修正し、単語抽出ができなければそれ
を未登録語としてリストアツブしておき、後で未登録語
として確認を求め、修正が必要なところがあれば修正を
要求する。同図(ト)の場合は未登録語として登録され
、同図(チ)の場合は修正可能な場合として修正される
If this is divided using the conventional method, the division will be as shown in FIG. On the other hand, when the hiragana character string is separated using the method according to the present invention, as shown in FIG.
Compared to (c), it does not remain long and it is easy to perform word division using the longest match method. Furthermore, change the beginning and end of the unknown word (e),
After extracting as shown in (f) (use "?" to indicate the beginning or end), combine the candidate characters of the unknown word and if a word that matches the grammar can be extracted, correct it. If the word cannot be extracted, use it as an unregistered word. Restore it as an unregistered word, and later ask for confirmation as an unregistered word, and request corrections if any need to be corrected. In the case of (G) in the same figure, it is registered as an unregistered word, and in the case of (H) in the same figure, it is corrected as a case that can be corrected.

〔発明の効果〕〔Effect of the invention〕

この発明によれば、1文字助詞に着目するようにしたの
で、複数の字種(漢字−ひらがな以外)からなる単語に
おける誤読、および字種を越えた検出、訂正が可能とな
る利点が得られる。また、上記1文字助詞の他に1文字
動詞語尾「る」、1文字助動詞「た」、「だ」および頻
出ひらがな2文字単語に着目してこれらを抽出すること
により、より一層検出、修正のための精度を高めること
ができる。
According to this invention, since the focus is on single-letter particles, it is possible to detect and correct misreadings in words consisting of multiple character types (other than kanji and hiragana), as well as across character types. . In addition to the one-letter particles mentioned above, by focusing on and extracting one-letter verb endings "ru", one-letter auxiliary verbs "ta", "da", and frequently occurring two-letter hiragana words, detection and correction can be made even more effective. The accuracy can be increased.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図はこの発明の実施例を示すフローチャート、第2
図はこの発明が適用されるシステムを示すブロック図、
第3図は入力文字列と認識結果の具体例を説明するため
の説明図、第4図は前後文字連接表の例を説明するため
の説明図、第5図は入力文字列と認識結果の他の例を説
明するための説明図、第6図は発明の他の実施例を示す
フローチャート、第7図は頻出ひらがな2文字単語の例
を説明するための説明図、第8図は入力文字列と認識結
果のさらに他の例を説明するための説明図である。 符号説明 1・・・スキャナ、2・・・文字認識装置、3・・・デ
イスプレィ、4・・・データ処理装置(パーソナルコン
ピュータ)、5・・・キーボード。
FIG. 1 is a flowchart showing an embodiment of the invention, and FIG.
The figure is a block diagram showing a system to which this invention is applied.
Figure 3 is an explanatory diagram for explaining a specific example of an input character string and recognition results, Figure 4 is an explanatory diagram for explaining an example of a preceding and following character concatenation table, and Figure 5 is an explanatory diagram for explaining an example of an input character string and recognition results. An explanatory diagram for explaining another example, FIG. 6 is a flowchart showing another embodiment of the invention, FIG. 7 is an explanatory diagram for explaining an example of frequently occurring two-letter hiragana words, and FIG. 8 is an input character FIG. 7 is an explanatory diagram for explaining still another example of columns and recognition results. Description of symbols 1...Scanner, 2...Character recognition device, 3...Display, 4...Data processing device (personal computer), 5...Keyboard.

Claims (1)

【特許請求の範囲】 1)入力された文字の特徴パターンと、予め記憶されて
いる各文字の標準パターンとを比較して文字を認識した
後、認識された文字列を句読点で区切り、この句読点で
区切られた文字列の中からひらがな1文字助詞(か、が
、し、て、で、と、に、の、は、ば、へ、も、や、を)
を抽出し、この抽出された文字の前後の文字を予め記憶
されているひらがな1文字助詞を含む単語から抽出した
前後2文字の連接表と比較して抽出された文字が助詞で
あるかどうかを確定し、助詞と確定された文字でさらに
文字列を分割し、分割された文字列内で最長一致法によ
り単語辞書と照合しながら誤読文字の検出、修正を行な
うことを特徴とする誤読文字の検出、修正方法。 2)入力された文字の特徴パターンと、予め記憶されて
いる各文字の標準パターンとを比較して文字を認識した
後、認識された文字列を句読点で区切り、この句読点で
区切られた文字列の中からひらがな1文字助詞(か、が
、し、て、で、と、に、の、は、ば、へ、も、や、を)
、1文字動詞語尾「る」、1文字助動詞「た」、「だ」
および頻出ひらがな2文字単語(こと、もの、とき、た
め、この、その、あの、どの、これ、それ、あれ、どれ
、から、より、など、まで、ない、です、ます、ある、
なる、いる、する)をそれぞれ抽出し、これらの単語と
その前後の文字種が所定の条件を満たしているか否かに
より、文字列をさらに1〜数単語の範囲に分割し、その
範囲内で最長一致法により単語照合をして誤読の検出、
修正を行なうことを特徴とする誤読文字の検出、修正方
法。 3)誤読を含む単語または単語辞書にない未登録語が出
現した場合は、語の範囲を確定するために前記の如く区
切られた範囲内で前方および後方から最長一致法で単語
を順次確定して余った部分を抽出し、単語辞書と照合し
ながら誤読を含む単語の検出、修正を行なうことを特徴
とする請求項2)に記載の誤読文字の検出、修正方法。
[Claims] 1) After recognizing characters by comparing the characteristic pattern of input characters with standard patterns of each character stored in advance, the recognized character string is separated by punctuation marks, and the punctuation marks are Hiragana one-letter particles from the string separated by (ka, ga, shi, te, de, to, ni, no, wa, ba, he, mo, ya, wo)
is extracted, and the characters before and after this extracted character are compared with a pre-memorized concatenation table of two characters before and after extracted from a word containing a one-character hiragana particle to determine whether the extracted character is a particle. A method for detecting and correcting misread characters, which further divides the character string by the characters determined to be particles and determines the particle, and detects and corrects the misread characters by comparing the divided character strings with a word dictionary using the longest match method. Detection and correction methods. 2) After recognizing the characters by comparing the characteristic pattern of the input characters with the standard pattern of each character stored in advance, the recognized character strings are separated by punctuation marks, and the character strings separated by these punctuation marks are created. Hiragana one-letter particles from among (ka, ga, shi, te, de, to, ni, no, wa, ba, he, mo, ya, wo)
, one-letter verb ending "ru", one-letter auxiliary verb "ta", "da"
and frequently occurring two-letter hiragana words (koto, thing, when, to, this, that, that, which, this, it, those, which, from, than, etc., until, not, desu, masu, aru, etc.)
Depending on whether these words and the character types before and after them satisfy a predetermined condition, the character string is further divided into a range of one to several words, and the longest within that range is extracted. Detect misreading by matching words using matching method.
A method for detecting and correcting misread characters, characterized by correcting them. 3) If a word that contains a mispronunciation or an unregistered word that is not in the word dictionary appears, use the longest match method to determine the word range from the front and back within the range delimited as above. 3. The method for detecting and correcting misread characters according to claim 2, further comprising extracting the remaining portion of the text and comparing it with a word dictionary to detect and correct words containing misread characters.
JP2011695A 1989-08-23 1990-01-23 Method for detecting and correcting erroneously read character Pending JPH03156589A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011695A JPH03156589A (en) 1989-08-23 1990-01-23 Method for detecting and correcting erroneously read character

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP1-214929 1989-08-23
JP21492989 1989-08-23
JP2011695A JPH03156589A (en) 1989-08-23 1990-01-23 Method for detecting and correcting erroneously read character

Publications (1)

Publication Number Publication Date
JPH03156589A true JPH03156589A (en) 1991-07-04

Family

ID=26347183

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011695A Pending JPH03156589A (en) 1989-08-23 1990-01-23 Method for detecting and correcting erroneously read character

Country Status (1)

Country Link
JP (1) JPH03156589A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06301781A (en) * 1993-02-03 1994-10-28 Internatl Business Mach Corp <Ibm> Method and equipment for image transformation for pattern recognition by computer
US5872725A (en) * 1994-12-05 1999-02-16 International Business Machines Corporation Quasi-random number generation apparatus and method, and multiple integration apparatus and method of function f

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06301781A (en) * 1993-02-03 1994-10-28 Internatl Business Mach Corp <Ibm> Method and equipment for image transformation for pattern recognition by computer
US6246793B1 (en) 1993-02-03 2001-06-12 International Business Machines Corp. Method and apparatus for transforming an image for classification or pattern recognition
US5872725A (en) * 1994-12-05 1999-02-16 International Business Machines Corporation Quasi-random number generation apparatus and method, and multiple integration apparatus and method of function f

Similar Documents

Publication Publication Date Title
US7680649B2 (en) System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages
US5161245A (en) Pattern recognition system having inter-pattern spacing correction
US8725497B2 (en) System and method for detecting and correcting mismatched Chinese character
Zhang et al. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm
Uthayamoorthy et al. Ddspell-a data driven spell checker and suggestion generator for the tamil language
Tufiş et al. DIAC+: A professional diacritics recovering system
US10515148B2 (en) Arabic spell checking error model
Kanwar et al. N-GRAMS SOLUTION FOR ERROR DETECTION AND CORRECTION IN HINDI LANGUAGE.
JPH03156589A (en) Method for detecting and correcting erroneously read character
Octaviano et al. A spell checker for a low-resourced and morphologically rich language
KS et al. Automatic error detection and correction in malayalam
JPS6239793B2 (en)
Mon Spell checker for Myanmar language
JP4318223B2 (en) Document proofing apparatus and program storage medium
JP3924899B2 (en) Text search apparatus and text search method
JP2908460B2 (en) Error recognition correction method and apparatus
Mon et al. Myanmar spell checker
JP2006294069A (en) Document corrector and program storage medium
Byun et al. Automatic spelling correction rule extraction and application for spoken-style korean text
JPH0614376B2 (en) Japanese sentence error detection device
JP2939945B2 (en) Roman character address recognition device
JPH0362260A (en) Detecting/correcting device for katakana word error
JPH087046A (en) Document recognition device
JPH0589281A (en) Erroneous read correcting and detecting method
JPS6394364A (en) Automatic correction device for wrong character in japanese sentence