JPS63249282A

JPS63249282A - Multifont printed character reader

Info

Publication number: JPS63249282A
Application number: JP62083314A
Authority: JP
Inventors: Atsushi Shimoyama; 霜山　篤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1987-04-03
Filing date: 1987-04-03
Publication date: 1988-10-17

Abstract

PURPOSE:To efficiently perform the document processing by severely collating one-page components of input characters at every font and defining the font of the character, which has the lowest reject (unreadable) rate, as the font of this page and collating one-page components of input characters with a relaxed reference with respect to only this font to output answers. CONSTITUTION:A collating means 2 calculates the degree of noncoincidence between feature data of feature data of an inputted character and that of contents of a dictionary, and an answer is outputted when the degree of noncoincidence is lower than a prescribed threshold and the difference between first and second candidates is larger than a prescribed threshold, but reject is outputted otherwise. A font determining means 3 counts the number of unreadable characters at every page of an input document with respect to each font of dictionaries 11, 12... and determines the font, where the number of unreadable characters is minimum, as the font of this page. A font selecting means 4 selects respective fonts at every page of the input document in the first stage and selects the font, which is determined by the font determining means 3, in the second stage. A threshold selecting means 5 selects a decision threshold level different between the first stage and the second stage. Thus, characters are efficiently read with high precision.

Description

【発明の詳細な説明】［概　要］多種のフォントの印刷文字を読み取る装置であって、辞
書をフォント別に備え、■頁分の入力文字に対し、フォ
ント毎に厳しい照合を行い、リジェクト（読取り不能）
率の最も低いものをその頁のフォントとみなし、そのフ
ォント単独でゆるい基準を用いて照合して答を出力する
よう構成したものである。[Detailed Description of the Invention] [Summary] This is a device that reads printed characters in various fonts, has dictionaries for each font, performs strict verification for each font against pages of input characters, and rejects (reads) characters. impossible)
The font with the lowest rate is regarded as the font for that page, and the answer is output by comparing that font alone using loose criteria.

［産業上の利用分野］本発明は印刷文字読取り装置に係わり、特に複数種類の
フォントの文字で印刷された文書の読取り装置に関する
。[Industrial Field of Application] The present invention relates to a printed character reading device, and more particularly to a reading device for a document printed with characters of multiple types of fonts.

文献、特許情報、外国からの情報誌等を計算機ヘ文字コ
ード情報として入力し、分類整理、検索、高速参照等に
利用することができる。また計算機による自動翻訳、自
動配信、自動設計等も可能となっている。しかし、これ
らの情報はその殆どがキーボード人力であり、人手と時
間が膨大に必要である。Literature, patent information, information magazines from foreign countries, etc. can be input into a computer as character code information and used for classification, organization, searching, high-speed reference, etc. Automatic translation, automatic distribution, and automatic design using computers are also possible. However, most of this information is obtained manually using a keyboard, which requires an enormous amount of manpower and time.

これを解決する手段の一つはこれらの文書をスキャナに
入力するのみで計算機へ文字コードに変の理解できるコ
ード情報に変換する技術は現在不充分である。One way to solve this problem is to simply input these documents into a scanner, but there is currently insufficient technology to convert the documents into character codes and into understandable code information for a computer.

その中の一つが、文字の形が各国、各分野において千差
万別で、即座にこれらに対応する技術が不足していると
いう問題であって、これに対応できる技術が要望される
。One of these problems is that the shapes of characters vary widely in each country and in each field, and there is a lack of technology that can immediately respond to these variations.There is a need for technology that can respond to this problem.

［従来の技術］従来の印刷文字読取り装置においては、通常、読取り対
象の文書の使用フォントをオペレータが記憶させておい
て、読取りを行うものであった。[Prior Art] In conventional printed character reading devices, an operator usually stores the font used in a document to be read and then reads the document.

数字だけのラインプリンタ印字文字についてマルチフォ
ント対応を行っている例があるが、その方法は、辞書を
すべてのフォント分用意し、読取り対象フォントのすべ
ての辞書をフォント毎に分けずに記憶し、入カバターン
との最少不一致度の辞書が示す答をそのまま出力すると
いうものである。There is an example of multi-font support for line printer printed characters that are only numbers, but the method is to prepare dictionaries for all fonts, store all dictionaries of the font to be read without dividing them by font, The answer indicated by the dictionary with the minimum degree of mismatch with the input pattern is output as is.

［発明が解決しようとする問題点コ上記従来のマルチフォント対応技術では、カテゴリが数
字のみの場合には対応できたが、読取り対象字種が英数
カナ記号のような多種のカテゴリに広がると対応できな
くなるという欠点がある。[Problems to be solved by the invention] The above conventional multi-font support technology was able to handle cases where the category was only numbers, but when the number of character types to be read expanded to various categories such as alphanumeric, kana, and symbols. The disadvantage is that it cannot be handled.

当然のことであるが、マルチフォントとなると同一カテ
ゴリでありながら字形が異なるため同一辞書では読み取
れなくなり、カテゴリが増加したことと同じになり、精
度の高い読取りができなくなってくるという問題点があ
る。Of course, when using multiple fonts, even though they are in the same category, they have different letterforms, so they cannot be read with the same dictionary, which is equivalent to an increase in the number of categories, and there is a problem that highly accurate reading becomes impossible. .

本発明は、このような従来技術の問題点を解消した新規
なマルチフォント印刷文字読取り装置を提供しようとす
るものである。The present invention aims to provide a novel multi-font printed character reading device that solves the problems of the prior art.

［問題点を解決するための手段］第１図は本発明のマルチフォント印刷文字読取り装置の
原理ブロック図を示す。[Means for Solving the Problems] FIG. 1 shows a block diagram of the principle of a multi-font printed character reading device according to the present invention.

図において、１　＋、　１２．１３は辞書であり、フォ
ント別に記憶されている。In the figure, 1+, 12.13 are dictionaries, which are stored for each font.

２は照合手段であり、入力された文字の特徴データと辞
書の特徴データとの不一致度を算出し、不一致度が所定
のしきい値より小さく一位と三位の候補の差が所定のし
きい値より大きいとき答を出力し、それ以外のとき読取
り不能を出力する。2 is a matching means that calculates the degree of mismatch between the input character feature data and the feature data of the dictionary, and determines if the degree of mismatch is less than a predetermined threshold and the difference between the first and third candidates is within a predetermined value. If it is larger than the threshold, output the answer, otherwise output unreadable.

３はフォント決定手段であり、辞書Ｉ　１．１２．１３
、−　の各フォントごとの入力文書の１頁ごとの読取り
不能文字数をカウントし、読取り不能文字数の最少のち
をその頁のフォントと決定する。3 is a font determining means, dictionary I 1.12.13
, -, the number of unreadable characters per page of the input document for each font is counted, and the font with the least number of unreadable characters is determined as the font for that page.

４はフォント選択手段であり、第１段階で入力文書の各
頁ごとに、各フォントを選択し、第２段階でフォント決
定手段３の判定したフォントを選択する。Reference numeral 4 denotes a font selection means, which selects each font for each page of the input document in the first stage, and selects the font determined by the font determining means 3 in the second stage.

５はしきい個選択手段であり、第１段階と第２段階とで
異なる判定・しきい値を選択する。Reference numeral 5 denotes a threshold selection means, which selects different judgment/threshold values in the first stage and the second stage.

［作用］本発明のマルチフォント印刷文字読取り装置では、多種
のフォントが頁内では混在しないが、頁間では混在する
文書を読取り対象とする。[Operation] The multi-font printed character reading device of the present invention reads documents in which various fonts do not coexist within a page, but do coexist between pages.

認識辞書１　＋、　１２．１３．−・〜はフォント別に
用意し、読取り対象フォントのすべての辞書を各々調１
フォント辞書、患２フォント辞書、Ｎ１３フォント辞書
と分けて記憶する。Recognition Dictionary 1+, 12.13. −・～ are prepared for each font, and all dictionaries of the font to be read are prepared in each tone.
The font dictionary, patient 2 font dictionary, and N13 font dictionary are stored separately.

第１段階では、スキャナより入力したイメージ情報から
１文字づつ切出し、辞書と比較すべきデータを抽出した
後、各フォントごとの辞書と照合し、不一致度が第１の
しきい値より少なく、−位と三位の候補の差が第２のし
きい値より大きいときに答を出力し、それ以外のとき読
取り不能とする。In the first stage, characters are extracted one by one from the image information input by the scanner, data to be compared with the dictionary is extracted, and then the data is compared with the dictionary for each font, and if the degree of mismatch is less than the first threshold, - The answer is output when the difference between the first place candidate and the third place candidate is greater than the second threshold, and is rendered unreadable otherwise.

１頁内のすべての文字の答が、各フォントごとの答とし
て出力された後、フォント決定手段３は１頁内の読取り
不能文字数をカウントし、各フォントによる読取り不能
文字数を比較し、最も少ないフォントをその頁の読取り
フォントと決定する。After the answers for all the characters in one page are output as answers for each font, the font determining means 3 counts the number of unreadable characters in one page, compares the number of unreadable characters by each font, and selects the least number of unreadable characters. Decide the font as the reading font for that page.

第２段階では、フォント決定手段３の決定したフォント
の辞書を選択して、１真の読取りを再度行い、辞書と照
合し、不一致度を求め、前記第１のしきい値をこれより
小さい値の第３のしきい値に切り換え、前記第２のしき
い値をこれより大きい第４のしきい値に切り換え、不一
致度が第３のしきい値より小さく、且つ一位と二値の候
補の差が第４のしきい値より大きいときに答を出力し、
それ以外を読取り不能として最終答の結果を出力する。In the second step, the dictionary of the font determined by the font determining means 3 is selected, the first true reading is performed again, the degree of mismatch is determined by comparing with the dictionary, and the first threshold value is set to a value smaller than this. , the second threshold is switched to a fourth threshold larger than this, and the degree of inconsistency is smaller than the third threshold, and the candidates are the first and second-value candidates. output the answer when the difference is greater than the fourth threshold,
Outputs the final answer as unreadable.

これにより、高精度で且つリジェクト率の少ないマルチ
フォント印刷文字の読取りが可能となる。This makes it possible to read multi-font printed characters with high precision and a low rejection rate.

使用フォントのはっきり判っている文書については、照
合フォントをセットして第２段階から読取りを開始する
ようにできることは勿論である。Of course, for documents for which the font to be used is clearly known, it is possible to set the verification font and start reading from the second stage.

［実施例］以下第２図〜第５図に示す実施例により、本発明をさら
に具体的に説明する。[Example] The present invention will be described in more detail below with reference to Examples shown in FIGS. 2 to 5.

第２図は、本発明の一実施例の構成を示すブロック図で
ある。FIG. 2 is a block diagram showing the configuration of an embodiment of the present invention.

図において、０１はスキャナであり、入力文書を光学的
に走査し、光電変換して二値化イメージデータを得る。In the figure, 01 is a scanner that optically scans an input document and performs photoelectric conversion to obtain binary image data.

０２は画像メモリであり、スキャナＯ１の出力イメージ
データを格納する。02 is an image memory, which stores output image data of the scanner O1.

１１、１２．１３は、それぞれフォント１ｌｈｉ、フォ
ント隘２、″フォントＮ１１３の辞書である。11, 12.13 are dictionaries of font 1lhi, font 龘2, and ``font N113, respectively.

２１は制御プロセッサであり、画像メモリ０２のイメー
ジデータから文字を切り出し、辞書と照合すべきデータ
を抽出し、全体のデータ流を制御する。A control processor 21 cuts out characters from the image data in the image memory 02, extracts data to be checked against a dictionary, and controls the overall data flow.

２２は照合回路であり、制御プロセッサ２１の抽出した
照合データと辞書１１．１２．１３のデータを照合し、
しきい値メモリ５１の中から認識段階に応じて選択した
しきい値により答または読取り不能を出力する。22 is a collation circuit, which collates the collation data extracted by the control processor 21 and the data in the dictionary 11, 12, 13;
An answer or unreadable is output based on a threshold value selected from the threshold value memory 51 according to the recognition stage.

３１はワークメモリであり、照合データ、第１段階の答
、フォント等を一時格納する。A work memory 31 temporarily stores collation data, first-stage answers, fonts, and the like.

６１は最終答の出力を記憶するメモリである。61 is a memory that stores the output of the final answer.

第３図乃至第５図は、本発明の一実施例による処理の流
れを示す図である。3 to 5 are diagrams showing the flow of processing according to an embodiment of the present invention.

第３図は、第１段階における照合処理の手順を示す図で
ある。FIG. 3 is a diagram showing the procedure of the verification process in the first stage.

スキャナからの１頁のイメージデータから文字を分離し
、照合データを抽出して、各フォントの辞書と照合され
、各フォントによる１頁分の答がワークメモリに格納さ
れる。Characters are separated from one page of image data from the scanner, collation data is extracted, and the data is collated with the dictionary of each font, and the answers for one page of each font are stored in the work memory.

以下、第２図に示したデータの流れに付けた番号を参照
し、その動作を説明する。The operation will be described below with reference to the numbers assigned to the data flows shown in FIG.

スキャナからの１頁分のイメージデータ■が画像メモリ
に格納される。One page of image data ■ from the scanner is stored in the image memory.

制御プロセッサは画像メモリのイメージデータを取り出
し■、文字を分離し、照合データを抽出して、ワークメ
モリに一時格納する■。The control processor retrieves the image data from the image memory ■, separates the characters, extracts the matching data, and temporarily stores it in the work memory ■.

制御プロセッサは、しきい値メモリから第１のしきい値
、第２のしきい値が読み出し照合回路にセットする■。The control processor reads the first threshold value and the second threshold value from the threshold memory and sets them in the verification circuit.

照合回路は、ワークメモリにある１頁分の照合データと
フォントＮ［１１の辞書データ■を照合し、答をワーク
メモリに一時格納する。The collation circuit collates one page of collation data in the work memory with the dictionary data ■ of font N[11, and temporarily stores the answer in the work memory.

次いでフォント嵐２の辞書データ■と照合し、答をワー
クメモリに一時格納する。次いでフォントＮ［Ｌ３の辞
書データ■と照合し、答をワークメモリに一時格納する
。Next, it is compared with the dictionary data ■ of Font Arashi 2, and the answer is temporarily stored in the work memory. Next, it is compared with the dictionary data ■ of font N[L3, and the answer is temporarily stored in the work memory.

第４図は、フォントの決定処理の手順を示す図である。FIG. 4 is a diagram showing the procedure of font determination processing.

制御プロセッサは、各フォントによる１頁分の答からり
ジェクト文字数をカウントし、各フォントのりジェクト
文字数を比較し、最もリジェクト文字数の少ないものを
読取りフォントと決定する。The control processor counts the number of rejected characters from the answers for one page of each font, compares the number of rejected characters of each font, and determines the one with the smallest number of rejected characters as the reading font.

第５図は、本実施例における第１段階（フォント決定時
）と第２段階（最終答出力時）の照合を説明する図であ
る。FIG. 5 is a diagram illustrating the comparison between the first stage (when determining the font) and the second stage (when outputting the final answer) in this embodiment.

フォント決定時の照合はりジェット率が高く誤読率が低
くなるようしきい値を設定し、最終答出力時の照合は許
容される範囲でリジェクト率を低くするよう設定する。The threshold value is set so that the matching rejection rate is high and the misreading rate is low when determining the font, and the matching when outputting the final answer is set so as to keep the rejection rate as low as possible.

［発明の効果コ以上説明のように本発明によれば、多種のフォントが頁
内では混在しないが頁間で混在する印刷文字を、高い精
度で且つリジェクト率低く読み取ることができ、文書処
理の効率化に及ぼす効果は極めて大である。[Effects of the Invention] As explained above, according to the present invention, printed characters in which various fonts are not mixed within a page but mixed between pages can be read with high accuracy and with a low rejection rate, which improves document processing. The effect on efficiency is extremely large.

[Brief explanation of the drawing]

第１図は本発明の原理ブロック図、第２図は本発明の一実施例の構成を示すブロック図、第３図は本発明の一実施例における第１段階の照合処理
の手順を示す図、第４図は本発明の一実施例におけるフォント決定の手順
を示す図、第５図は第１段階と第２段階の照合を説明する図である
。図面において、１　ｒ、　１２．１１，１１．１２．１３は辞書、２は
照合手段、　　　　　　３はフォント決定手段、４はフ
ォント選択手段、　５はしきい値選択手段、０１はスキ
ャナ、　　　　　０２は画像メモリ、２１は制御プロセ
ッサ、　　２２は照合回路、３１はワークメモリ、　　
　５１はしきい値メモリ、６１は答の出力メモリ、をそれぞれ示す。本発明の原理ブロック図第１図第　　２　　図第　　３　　図フォント決定支剪（刈（止しＦＭ刈ｈツお歌ハΔ暗　　　　　　　　錫卒騙ハｑ論第１段階
と第２段階の照合を説明する口笛　　５　　図Fig. 1 is a block diagram of the principle of the present invention; Fig. 2 is a block diagram showing the configuration of an embodiment of the present invention; Fig. 3 is a diagram showing the procedure of the first stage matching process in an embodiment of the present invention. , FIG. 4 is a diagram showing the procedure of font determination in one embodiment of the present invention, and FIG. 5 is a diagram explaining the collation between the first stage and the second stage. In the drawings, 1 r, 12.11, 11.12.13 are dictionaries, 2 is collation means, 3 is font determination means, 4 is font selection means, 5 is threshold selection means, 01 is scanner, 02 is image Memory, 21 is a control processor, 22 is a collation circuit, 31 is a work memory,
51 is a threshold memory, and 61 is an answer output memory. Principle block diagram of the present invention Fig. 1 Fig. 2 Fig. 3 Font determination Selection (cutting) Explanation of matching between the first and second stages Whistle 5 Figure

Claims

[Claims] Dictionaries stored for each font (l_1, l_2, l_3
), a matching means (2) that calculates the degree of mismatch between the input character feature data and the dictionary feature data, and outputs the answer category or unreadable; font determining means (3) that counts the number of unreadable characters in the font and determines the one with the least number of unreadable characters as the font for that page; and a dictionary of each font (
l_1, l_2, l_3) in sequence, and in the second step, the font dictionary (l_
1, l_2, l_3);
) and a threshold selection means (5) for selecting different judgment thresholds in the first stage and the second stage, the font is determined in the first stage for each page of the input document, and the font is determined in the first stage, A multi-font printed character reading device characterized in that, in the second stage, the dictionary of the determined font is checked and read using a judgment threshold different from that in the first stage.