JPS63265377A

JPS63265377A - Production of dictionary for optical character reader

Info

Publication number: JPS63265377A
Application number: JP62318598A
Authority: JP
Inventors: Kenji Yasujima; 安島　顕司; Masao Hashimoto; 政雄橋本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-12-19
Filing date: 1987-12-18
Publication date: 1988-11-01
Anticipated expiration: 2012-10-15
Also published as: JP2662404B2

Abstract

PURPOSE:To easily produce not only a dictionary for printing types but a dictionary for peculiar handwritten characters with use of an optical character reader itself, by writing laterally the characters to be registered on a single line of an original with addition of a mark showing the height of the line. CONSTITUTION:When a dictionary for peculiar handwritten characters is produced, a character, e.g., 'i' to be registered on a single line of an original is written laterally up to about 10 pieces at intervals and a mark M is put at a position slightly distant from the last 'i' to show the height of the line. Then this original is scanned by an image scanner 2 and the lines are segmented with the space during which the mark M is detected and not in the next time defined as the height of a single character. As a result, even such characters like 'i', 'j', etc., which are vertically separated into two parts can be defined as a single character as a whole. Thus a character pattern can be correctly segmented. In the character register mode even the peculiar handwritten characters can be smoothed and registered in a character recognizing dictionary file of a hard disk device 7 via a keyboard 1. In such a way, a dictionary for peculiar handwritten characters is easily produced.

Description

【発明の詳細な説明】技１３υ瑣この発明は、一般にｒＯｃＲＪと略称される光学文字読
取装置における文字認識用の辞書作成方法に関する。DETAILED DESCRIPTION OF THE INVENTION Technique 13υd This invention relates to a dictionary creation method for character recognition in an optical character reading device, generally abbreviated as rOcRJ.

灸來皮４紙に文字を含む画情報が印刷あるいは手書きさ汎た原稿
をイメージスキャナでスキャンして、原稿の画情報をイ
メージデータとして取り込み、そのイメージデータから
文字を認識して文字コードデータに変換する光学文字読
取装置が種々開発されている。Moxibustion 4 Scan a document with image information including characters printed or handwritten on paper using an image scanner, import the image information of the document as image data, recognize the characters from the image data, and convert it into character code data. Various types of optical character reading devices have been developed.

この光学文字読取装置をワードプロセッサや自動翻訳装
置、あるいは１１ｙＸ集計装置や検索ｊｐデータファイ
ル作成装置などの文字を扱う処理システムや文字データ
を伝送するデータ通信などの通信システムへの文字情報
の入力手段として使用すれば、キーボード入力に比べて
入力効率を大幅に向上させることが可能である。This optical character reading device can be used as a means of inputting character information into word processors, automatic translation devices, processing systems that handle characters such as 11yX aggregation devices and search jp data file creation devices, and communication systems such as data communications that transmit character data. If used, it is possible to significantly improve input efficiency compared to keyboard input.

この光学文字読取装置には、文字フォノ１−のイメージ
データが基準画情報としてあらかじめ登録された文字認
識用辞書が設けられており、文字認識手段がその文字認
識用辞書を参照して、入力された文字のイメージデータ
を辞書のイメージデータと比較してパターンマツチング
をとることによって、これを特定の文字として認識して
それに対応する文、字コードデータを発生する。This optical character reading device is provided with a character recognition dictionary in which the image data of the character phono 1- is registered in advance as reference image information, and the character recognition means refers to the character recognition dictionary and inputs the image data. By comparing the image data of the selected character with the image data of the dictionary and performing pattern matching, this is recognized as a specific character and the corresponding sentence and character code data are generated.

一般に使用される活字等の文字種のデザイン、すなわち
フォントには様々な種類のものがある。There are various types of character designs, ie, fonts, for commonly used typefaces.

そのため、通常用いられる複数の文字種についてそのセ
ットごとに文字認識用辞書を備える必要がある。Therefore, it is necessary to provide a character recognition dictionary for each set of a plurality of commonly used character types.

しかしながら、従来はこのような文字認識用の辞書を作
成したり、それを修正あるいは変更するなどの保守を行
なうには、別に辞書作成保守用のマシーンを使用しなけ
ればならなかったので、誰でも容易に新らたな辞書を作
成したり修正したりすることはできなかった。However, in the past, in order to create such a dictionary for character recognition and maintain it by modifying or changing it, it was necessary to use a separate dictionary creation and maintenance machine, so anyone could do it. New dictionaries could not be easily created or modified.

また１手書き文字用の辞書を作成する場合には、制約条
件が多く、個人個人のくせ字をそのまま辞書に登録する
ことはできなかった。In addition, when creating a dictionary for one handwritten character, there are many constraints, and it is not possible to directly register each individual's quirky characters in the dictionary.

且−血この発明は、このような従来の文字認識用辞書作成方法
における問題点を解消し、光学文字読取装置自体を使用
して、活字用の辞書のみならず手書きのくせ牛用の辞書
でも容易に作成できるようにすることを目的とする。This invention solves the problems in the conventional character recognition dictionary creation method, and uses the optical character reading device itself to create not only printed dictionaries but also handwritten dictionaries. The purpose is to make it easy to create.

盪−廣この発明は上記の目的を達成するため、上述のような光
学文字読取装置において、第１図に示すように、１行に
同一文字を多数列記すると共にその行の文字の高さを示
すマークＭを付した原稿（Ａ）をスキャナによってスキ
ャンさせてそのイメージデータを読み込み（Ｂ）、マー
クＭによって規制された高さ内の各文字のドツトパター
ンを重ね合わせるか平均化する文字パターンデータ処理
（Ｃ）を行なって１つの文字パターンを作成し、その文
字パターン又は読み込んだ１行の何れかの文字を表示（
Ｄ）して、上記文字パターンに対応する文字コードを与
えて（Ｅ）、文字認識用の辞書ファイルに登録（Ｆ）す
る。2-Hiro In order to achieve the above object, the present invention uses an optical character reading device as described above, in which a large number of the same characters are listed in one line and the height of the characters in that line is adjusted as shown in FIG. A document (A) with a mark M shown in the figure is scanned by a scanner and its image data is read (B), and the dot patterns of each character within the height regulated by the mark M are superimposed or averaged to obtain character pattern data. Perform processing (C) to create one character pattern, and display that character pattern or any character in the read line (
D), give a character code corresponding to the character pattern (E), and register it in a dictionary file for character recognition (F).

以下、この発明の一実施例に基づいて具体的に説明する
。Hereinafter, a detailed explanation will be given based on one embodiment of the present invention.

第２図は、この発明を実施した光学文字読取装置の機能
を備えたワードプロセッサ、オフィスコンピュータ、自
動翻訳装置、帳票処理装置等に使用できる文書処理シス
テムの一例を示す外観斜視図である。FIG. 2 is an external perspective view showing an example of a document processing system that can be used for a word processor, an office computer, an automatic translation device, a form processing device, etc., which has the function of an optical character reading device embodying the present invention.

この文書処理システムは、入力装置として、英数字キー
、カナキーなどの文字キー及びカーソル移動キーや各種
ファンクションキー等を有し、操作者の指示を入力する
キーボード１と、ＪＭ稿を光学的にスキャンして文字を
含む画情報をイメージデータとして入力するイメージス
キャナ２とを備えている。This document processing system has alphanumeric keys, character keys such as kana keys, cursor movement keys, various function keys, etc. as input devices, and includes a keyboard 1 for inputting instructions from an operator, and optical scanning of JM manuscripts. The image scanner 2 inputs image information including characters as image data.

また、出力装置として、操作者に対するガイダンスを含
む各種文字及び画情報を表示するための表示装置である
ＣＲＴディスプレイ装置（以下単にｒＣＲＴＪという）
３と、このシステムで処理した各種情報をプリントアウ
トするためのレーザプリンタ等のプリンタ４とを備えて
いる。In addition, as an output device, a CRT display device (hereinafter simply referred to as rCRTJ) is a display device for displaying various text and image information including guidance for the operator.
3, and a printer 4 such as a laser printer for printing out various information processed by this system.

そして、本体５にはデータ記憶装置であるフロッピディ
スク装置ｌｌ　（ＦＤＤ）６とハードディスク装置Ｉ　
（ＨＤＤ）７とを備えており、さらに第３図に示すよう
に、このシステム全体の動作を統括制御するマイクロコ
ンピュータ等からなる制御部（ＣＰｔＪ）１０と、プロ
グラムメモリであるＲ○Ｍ１１．データメモリであるＲ
ＡＭ（２５６Ｋｂｉｔ。The main body 5 includes a floppy disk device (FDD) 6 and a hard disk device I, which are data storage devices.
(HDD) 7, and as shown in FIG. 3, a control unit (CPtJ) 10 consisting of a microcomputer etc. that centrally controls the operation of the entire system, and a program memory R○M 11. R which is data memory
AM (256Kbit.

以上）１２．キーボードインタフェース１３．スキャナ
インタフェース１４．ｃＩｔＴコントローラ１５、ＦＤ
Ｄコントローラ１６．ＨＤＤコントローラ１７．及びプ
リンタコントローラ１８等が設けられている。Above) 12. Keyboard interface 13. Scanner interface 14. cItT controller 15, FD
D controller 16. HDD controller 17. A printer controller 18 and the like are provided.

このシステムにより、イメージスキャナ２が読み取った
原稿画像のイメージデータをスキャナインタフェース１
４を介して本体５に取り込み、それを直接あるいは一旦
ＦＤＤ６又はＨＤＤ７のイメージデータファイルに格納
して、ＣＲＴ！ｌに表示したり、そのデータを用いてこ
の発明による書体判別及び文字コード判定等の処理を行
なう。With this system, the image data of the original image read by the image scanner 2 is transferred to the scanner interface 1.
CRT! The data is used to perform processes such as typeface discrimination and character code discrimination according to the present invention.

各種書体ごとに各文字のイメージデータが基準画情報と
して登録されている文字認識用の複数の辞書は、通常Ｈ
Ｄ　Ｄ　７に格納されている。Multiple dictionaries for character recognition in which image data of each character is registered as reference image information for each typeface are usually H
It is stored in DD7.

なお１手書きのくせ字（フォーマット化されていない書
体）についても、後述するようにして辞書を作成するこ
とができ、それを登録して活字用の辞書と同様に使用す
ることが可能である。Note that a dictionary can also be created for handwritten kakuji (unformatted typeface) as described later, and it can be registered and used in the same way as a dictionary for printed characters.

そこで、この発明による辞書作成方法の説明に先立って
、それによって作成された辞書を使用するこの光学文字
読取装置の作用を第４図のフローチャート及びその他の
図によって説明する。Therefore, before explaining the dictionary creation method according to the present invention, the operation of this optical character reading device that uses the dictionary created by the method will be explained with reference to the flowchart of FIG. 4 and other figures.

イメージスキャナ２から原稿のイメージデータを直接取
り込んで文字を認識する場合には、まずステップ１で原
稿に書かれている文字の濃度を判定し、ステップ２でそ
の判定結果に応じてイメージスキャナ２の原稿照明用蛍
光灯の明るさを設定する。When recognizing characters by directly importing the image data of a document from the image scanner 2, first, in step 1, the density of the characters written on the document is determined, and in step 2, the image data of the image scanner 2 is determined according to the determination result. Set the brightness of the fluorescent light for document illumination.

文字濃度判定は、オペレータによる濃度指定によって判
定するか、あるいはイメージスキャナにより原稿を部分
的にスキャンしてその検出レベルから自動的に判定する
こともできる。The character density can be determined by specifying the density by an operator, or it can be automatically determined from the detection level by partially scanning a document with an image scanner.

蛍光灯の明るさの設定は、原稿の文字が濃く書かれてい
たら暗めに点灯するように設定し、薄く書れていたら明
るめに点灯するように設定する。The brightness of the fluorescent light is set so that if the text on the document is written darkly, the light is dimmed, and if the text is faintly written, the light is lit brightly.

それによって１文字の潰れや欠けを防ぐ。This prevents one character from being crushed or missing.

そして、ステップ３でイメージスキャナ２により原稿の
全面を予め設定された読取密度でスキャンして、そのイ
メージデータを本体５へ取り込んでメモリ（ＲＡＭ１２
）へ書き込む６一方、ＨＤＤ７等のイメージデータファイルに格納しで
あるイメージデータから文字認識を行なう場合には、そ
のイメージデータファイルがらのデータを読み込んでメ
モリ（ＲＡＭ１２）へ書き込む。Then, in step 3, the image scanner 2 scans the entire surface of the document at a preset reading density, imports the image data into the main body 5, and stores it in the memory (RAM 12).
) 6 On the other hand, when performing character recognition from image data stored in an image data file such as the HDD 7, the data from the image data file is read and written to the memory (RAM 12).

この場合は、予めイメージスキャナ２によって読み取っ
たイメージデータをイメージデータファイルに格納して
おく必要がある。In this case, it is necessary to store the image data read by the image scanner 2 in an image data file in advance.

その時、イメージデータファイルには、第５図に示すよ
うにヘッダ部を設け、読取密度と、読み込んだ文書の縦
と横の長さの情報を入れておく。At this time, the image data file is provided with a header section as shown in FIG. 5, in which information about the reading density and the length and width of the read document is stored.

文字認識するに当り１行切り出しと文字切り出しを行な
う時にこの縦と横の長さが必要になり。This vertical and horizontal length is required when cutting out one line and cutting out characters for character recognition.

文字コード判定時に読取密度の情報が必要になる。Reading density information is required when determining character codes.

次に、ステップ３又はステップ４からステップ５へ進ん
で自動行切り出しを行ない、続いてステップ６で文字切
り出しを行なう。Next, the process proceeds from step 3 or step 4 to step 5, where automatic line cutting is performed, and then, in step 6, character cutting is performed.

ところで、イメージスキャナ２は原稿を横方向に走査し
てイメージデータを取ってくるので、イメージデータフ
ァイルあるいはＲＡＭ１２のイメージデータ格納領域に
は、第６図に示すようにイメージデータが順にバイト単
位で入っている。Incidentally, since the image scanner 2 retrieves image data by scanning the original in the horizontal direction, the image data is sequentially input in bytes in the image data file or the image data storage area of the RAM 12 as shown in FIG. ing.

そこで１行切り出しと文字切り出しを行なう場合に、縦
と横の長さの情報がないと１文字部分のイメージの切り
出しができない。Therefore, when cutting out one line and cutting out characters, it is not possible to cut out the image of one character unless there is information about the vertical and horizontal lengths.

また、読取密度が高くなると、−文字分の高さと幅が大
きくなるので、当然マツチングのデータも変わる。Furthermore, as the reading density increases, the height and width of -characters increase, so naturally the matching data also changes.

第６図（Ａ）と（Ｂ）は、同じ文字を読取密度２００ｄ
ｐｉと３００ｄｐｉで読み込んだ時のイメージデータの
状態を示す。Figure 6 (A) and (B) show the same characters at a reading density of 200d.
This shows the state of image data when read at pi and 300dpi.

また、行切り出しを行なう際には、水平斜影をとって、
行間のスペースからスペースまでを行として切り出すが
、イメージスキャナに原稿が若干斜めにセットされたり
すると、読み込んだイメージデータが第７図に示すよう
になり、全体で水平斜影をとっても行と行の間のスペー
スがなくなってしまう場合がある。Also, when cutting out lines, remove horizontal shading,
The space between the lines is cut out as a line, but if the original is set in the image scanner at a slight angle, the read image data will look like the one shown in Figure 7, and even if you take the horizontal shading as a whole, the space between the lines will be cut out. You may run out of space.

そこで、このような場合には第７図に細線で囲んで示す
ように、水平斜影により行と行の間にスペースができる
ようなブロックに行を分割して、行切り出しを行なう。Therefore, in such a case, as shown surrounded by thin lines in FIG. 7, lines are cut out by dividing the lines into blocks such that spaces are created between the lines by horizontal oblique shading.

次に、この実施例では手書き文字も読み取れるようにす
るため、第４図のステップ７で文字タイプの判別を行な
っている。Next, in this embodiment, in order to be able to read handwritten characters, the character type is determined in step 7 of FIG. 4.

この判定は、手書き文字の場合には認識率を高めるため
に１例えば第８図に示すように文書を複数のフィールド
に区切って、その各フィールドの長さの情報とそψ各フ
ィールド内にある文字の種類（アルファベット、数字、
記号、ひらがな、漢字など）の指定情報を与えておく。In the case of handwritten characters, in order to increase the recognition rate, 1. For example, as shown in Figure 8, the document is divided into multiple fields, and the information on the length of each field and the Character types (alphabet, numbers,
Specify the specified information (symbols, hiragana, kanji, etc.).

したがって、これらの情報が有るか否かによって、手書
文字か活字文字かを判別することができる。Therefore, depending on whether or not such information is present, it is possible to determine whether a character is handwritten or printed.

そして、活字文字の場合には、ステップ８へ進んで文字
フォント判別（活字書体判別）を行なって文字認識に使
用する文字フォント辞書を決定し。In the case of printed characters, the process proceeds to step 8, where character font discrimination (printed typeface discrimination) is performed to determine a character font dictionary to be used for character recognition.

ステップ１２でその辞書を用いて文字コード判定の処理
を行なうが、その詳細は後述する。In step 12, character code determination processing is performed using the dictionary, the details of which will be described later.

手書文字の場合には、ステップ９へ進んでスムージング
処理を行なって凸凹を修正し、ステップ１０で正規化に
より文字の大きさを修正する。In the case of handwritten characters, the process proceeds to step 9 where smoothing processing is performed to correct irregularities, and in step 10 the size of the characters is corrected through normalization.

この場合、例えば大文字と小文字が同じ形状のアルファ
ベットの識別ができなくならない範囲で、文字の拡大あ
るいは縮小を行なって大きさを揃える。In this case, for example, the letters are enlarged or reduced to the same size within a range that does not make it impossible to identify alphabets in which uppercase and lowercase letters have the same shape.

そして、ステップ１１で手書き書体の判別を行なって１
文字認識に用いる手書き書体用の辞書を決定し、ステッ
プ１２でその辞書を用いて文字コード判定の処理を行な
う。Then, in step 11, the handwritten font is determined and
A dictionary for handwritten fonts used for character recognition is determined, and in step 12, character code determination processing is performed using the dictionary.

次に、ステップ１３で文字認識終りか否かを判断して、
終りでなければステップ５へ戻って１次の行の切り出し
から文字コード判定までの処理を繰り返す。Next, in step 13, it is determined whether character recognition is finished or not.
If it is not the end, the process returns to step 5 and repeats the process from cutting out the first line to character code determination.

そして、最終行までの文字認識を終了すればこの処理を
終る。This process ends when character recognition up to the last line is completed.

ここで、上述した文字フォント（活字書体）判別と文字
コード判定の処理について、第Ｓ図及び第１０図によっ
て詳細に説明する。Here, the above-described character font (typeface) discrimination and character code determination processing will be explained in detail with reference to FIG. S and FIG. 10.

文字フォント判別処理は、複数の文字フォント辞書を用
いて第Ｓ図のフローチャートに示すようにして行なうが
、その際用意されている複数の文字フォント辞書に、例
えば次表に示すような可変の優先順位をつけておく。The character font discrimination process is performed as shown in the flowchart in Figure S using multiple character font dictionaries. Rank them.

そして、１行分の文字のイメージデータについて、一番
優先順位の高い辞書からマツチングを行ない、あるレベ
ル以上のマツチング結果が得られるまで１次々に優先順
位の低い辞書とのマツチングを行なう。Then, for one line of character image data, matching is performed starting from the dictionary with the highest priority, and matching with dictionaries with lower priority is performed one after another until a matching result of a certain level or higher is obtained.

その時、ある辞書とのマツチングの結果が不合格の場合
には、その辞書の優先順位を一番低くし、て、それ以外
の辞書の優先順位を１つずつ繰り上げる。At that time, if the matching result with a certain dictionary is unsuccessful, that dictionary is given the lowest priority, and the other dictionaries are raised one by one in priority.

−通りの複数の辞書とのマツチングの結果、あるレベル
以上のマツチング結果が得られれば、その時の辞書を選
び、それが得られない場合には一番高いマツチングが得
られた辞書を選ぶことによって文字フォントを判別する
。- If a matching result of a certain level or higher is obtained as a result of matching with multiple dictionaries in the street, the dictionary at that time is selected, and if that cannot be obtained, the dictionary with the highest matching is selected. Determine character font.

通常、−文書は同一文字フォントで印刷されているので
、このように辞書に優先順位をつけることにより、次の
行からの文字認識（文字フォント判別及び文字コード判
定）の高速化を計ることができる。Normally, documents are printed using the same character font, so by prioritizing the dictionaries in this way, it is possible to speed up character recognition (character font discrimination and character code determination) from the next line. can.

この文字フォント判別処理を第９図によって説明すると
、最初は第１優先順位の辞１Ｆ（前記衣ではＡ辞書）を
読み込み、まず１行の第１文字を読み込んでパターンマ
ツチング（あるいは特徴マツチング）を行ない、マツチ
ングがとれて認識可能（ＯＫ）であればレジスタの値Ｘ
（最初は「０」）からある一定値αを減じ、マツチング
がとれなければレジスタの値Ｘにある一定値β（αくβ
）を加える。To explain this character font discrimination process with reference to FIG. 9, first, the first priority dictionary 1F (A dictionary in the above case) is read, the first character of one line is read, and pattern matching (or feature matching) is performed. If matching is achieved and recognition is possible (OK), register value X
(initially "0"), then subtract a certain value α, and if matching is not achieved, the register value X is subtracted by a certain value β (α minus β
) is added.

そして１行末か否かを判断して１行末でなければ次の文
字を読み込んで同様の処理を行なう。Then, it is determined whether or not it is the end of one line, and if it is not the end of one line, the next character is read and the same processing is performed.

これを１行の最後の文字まで行なうと、その時のＸの値
を記憶し、次にＸく０か否かの判断を行なって、ＹＥＳ
であればその時使用した辞書（Ａ辞書）に決定する。When this is done until the last character of one line, the value of X at that time is memorized, and then a judgment is made as to whether or not
If so, the dictionary used at that time (Dictionary A) is determined.

Ｘ〈０でなければ、未使用の辞書が有るか否かを判断し
て、有れば辞書の優先順位を例えば前記衣の第２判定時
のように変更して、新らたな第１優先順位の辞書（Ｂ＃
″Ｊ）を読み込んで、１行の第１文字から行末の文字ま
で順次マツチングをとって前述と同様な処理を行なった
後、Ｘ〈０の判断を行ない、ＹＥＳであればその時使用
した辞書（Ｂ辞書）に決定する。If X<0, it is determined whether or not there is an unused dictionary, and if there is, the priority order of the dictionary is changed, for example, as in the second judgment of the clothes, and a new first judgment is made. Priority Dictionary (B#
``J), and perform matching from the first character of one line to the last character of the line and perform the same process as described above, then determine if X<0, and if YES, the dictionary used at that time ( B dictionary).

この時もＸく０でなければ、また未使用の辞書が有るか
否かを判断して、有れば再び辞書の優先順位を例えば前
記衣の第３判定時のように変更し、新らたな第１優先順
位の辞書（Ｃ辞書）を読み込んで同様な処理を行なう。If it is not 0 at this time, it is determined whether there is an unused dictionary or not, and if there is, the priority order of the dictionary is changed again, for example, as in the third judgment of the clothes, and a new one is created. Then, the dictionary with the first priority (C dictionary) is read and similar processing is performed.

そして、Ｘ＜Ｏになればその時使用した辞書（Ｃ辞書）
に決定するが、この時もＸくＯにならず、未使用の辞書
がなくなった場合には、記憶している各辞書使用時のＸ
の値を比較して、それが最小の辞書に決定する。Then, if X<O, the dictionary used at that time (C dictionary)
However, if it does not become X and O at this time and there are no unused dictionaries, then the
Compare the values of and decide which dictionary is the smallest.

次に１文字コード判定処理は、第１０図のフローチャー
トに従ってなされ、まず文字フォント判別によって決定
した辞書を読み込むが、第Ｓ図の文字フォント判別処理
に続けてこの文字コード判定処理を行なう場合で、刀く
０になって辞書を決定した時にはその辞書が読み込まれ
ているので。Next, the one-character code determination process is performed according to the flowchart in FIG. 10, and the dictionary determined by character font discrimination is first read, but when this character code determination process is performed following the character font discrimination process in FIG. By the time you reach 0 and decide on a dictionary, that dictionary is already loaded.

このステップを省略できる。You can skip this step.

そして、１行の第１文字を読み込んで辞書とのパターン
マツチング（例えば２４次元マツチング）を行ない、マ
ツチングがとれて文字コード判定が可能（ＯＫ）であれ
ば文字コードを出力し、コード判定が不能（ＮＧ）であ
れば次に他の方法によるパターンマツチング（４Ｘ４Ｘ
８次元マツチング、３Ｘ３Ｘ８次元マツチング、多層方
向とストグラム法によるパターンマツチング等）を行な
って、文字コードの判定が可能になれば文字コードを出
力し、それでも文字コードの判定が不能であれば、読み
取り不能のコードを出力する。Then, it reads the first character of one line and performs pattern matching (for example, 24-dimensional matching) with a dictionary. If the matching is successful and the character code can be determined (OK), the character code is output, and the code is determined. If it is not possible (NG), then pattern matching using other methods (4X4X
8-dimensional matching, 3X3X8-dimensional matching, pattern matching using multi-layer direction and stogram method, etc.), and if the character code can be determined, the character code is output, and if the character code cannot be determined, the character code is output. Outputs an invalid code.

このような処理を１行の最後の文字まで順次行なって１
行分のコード判定処理を行なう。This process is performed sequentially until the last character of one line, and 1
Performs code judgment processing for a line.

このようにして１行ずつ文字フォント判別と文字コード
判定を続けて行なって、全文書を読み取る。その読み取
った文字データを表示あるいは印刷する場合には、読み
取り不能のコードがある部分には読み取り不能の文字が
あることを示すマークを表示又は印字する。In this way, character font discrimination and character code determination are successively performed line by line to read the entire document. When displaying or printing the read character data, a mark indicating that there are unreadable characters is displayed or printed in the part where the unreadable code exists.

このように、原稿に使用されている活字のフォントを指
定しなくても、複数の文字フォント辞書を用いて自動的
にそのフォントを判別して名字読取を行なう機能を以後
「マルチフォント」と称する。In this way, the function that automatically identifies the font and reads the surname using multiple character font dictionaries without specifying the font used in the printed text is hereinafter referred to as "multi-font". .

手書き書体の判別及び文字コード判定処理も。Also includes handwritten typeface recognition and character code determination processing.

この活字の場合とほぼ同様にして行なうが、この場合に
は第８図に示したフィールドの長さと文字種の情報も利
用してパターンマツチングを行なう。Pattern matching is carried out in substantially the same manner as in the case of printed characters, but in this case, pattern matching is also performed using the field length and character type information shown in FIG.

なお、そこで使用する手書きのくせ牛用辞書の作成方法
については後邊する。The method for creating the handwritten Kusegyu dictionary used there will be explained later.

また、第４図における文字フォント判別と手書き書体の
判別を、まとめて書体判別処理として行なうことも可能
である。Further, it is also possible to perform the character font discrimination and the handwritten typeface discrimination in FIG. 4 together as a typeface discrimination process.

〔この発明による辞書作成及び保守について〕次に、上
述の光学文字読取装置に使用する辞書の作成及び保守方
法について説明する。[Regarding dictionary creation and maintenance according to the present invention] Next, a method for creating and maintaining a dictionary used in the above-mentioned optical character reading device will be explained.

第２図及び第３図に示した光学文字読取装置の機能を有
する文書処理装置において、キーボード１からのキー人
力により「辞書の作成保守」が指令されると、まず第１
１図のフローチャートに示すユーティリティ選択の処理
を開始し、処理の種類を示すメインメニューを表示する
。In the document processing device having the function of an optical character reading device shown in FIGS.
The utility selection process shown in the flowchart of FIG. 1 is started, and a main menu indicating the type of process is displayed.

そして、キー人力により処理の選択がなされると、それ
を判別して［単一辞書作成保守ユーティリティ」　「マ
ルチフォント辞書作成保守ユーティリティ」　「辞書フ
ァイル名一覧」　「手書き辞書作成保守ユーティリティ
」及び（終了」のいずれかの処理を行なう。Then, when a process is selected manually, it is determined whether it is [Single Dictionary Creation Maintenance Utility], [Multi-font Dictionary Creation Maintenance Utility], [Dictionary File Name List], [Handwritten Dictionary Creation Maintenance Utility] or (End). ”.

く単一辞書作成保守ユーティリティ〉文字認識に使用する活字用の単一辞書を作成するプログ
ラムであり、第１２図に示すように、各辞書ファイルへ
のファイル名の登録２文字の登録及び追加、削除、登録
文字リスト印刷の各機能があって、それぞれ辞書ファイ
ルとの読み書きを行なう。Single dictionary creation/maintenance utility> This is a program that creates a single dictionary for printed characters used for character recognition. It has the functions of deleting and printing a list of registered characters, and each reads and writes from the dictionary file.

″゛ファイル名登録″は、第１３図に示すフローチャー
トに従って実行され、ファイルの領域取り及びファイル
名のディレクトリへの登録を行なう。"File name registration" is executed according to the flowchart shown in FIG. 13, and allocates a file area and registers a file name in a directory.

゛″文字９．録・追加″は、第１４図に示すフローチャ
ートに従って実行され１文字作成機能の中心となるもの
である。"Character 9. Record/Add" is executed according to the flowchart shown in FIG. 14 and is the main character creation function.

ここで、１行の文字数と読取濃度を入力し１作業許可が
なされて原稿をセットすると、スキャナが１行の文字を
読み取り１画像処理によりその各文字のパターンを重ね
るか平均化して１つの文字パターンを作成し、その文字
パターン又は１行の何れかの文字をＣＲＴに表示する。Here, when you enter the number of characters in one line and the reading density, one work permission is granted, and you set the document, the scanner reads one line of characters and performs one image process to overlap or average the patterns of each character and create one character. A pattern is created and either the character pattern or one line of characters is displayed on the CRT.

そのパターンをオペレータが見て、それに対応する文字
をキャラクタキーによって入力すると。When the operator sees the pattern and enters the corresponding character using the character keys.

その文字コードと表示中の文字パターンのデータとを対
応付けて辞書ファイルに書き込む。The character code and the data of the character pattern being displayed are associated and written to a dictionary file.

なお、この文字登録に関しては後でより詳細に説明する
。Note that this character registration will be explained in more detail later.

゛文字削除″は、第１５図のフローチャートに従って実
行され、辞書ファイル内の登録文字を削除する処理であ
る。"Character deletion" is a process that is executed according to the flowchart of FIG. 15 and deletes registered characters in the dictionary file.

゛登録文字リスト印刷”は、第１６図のフローチャート
に従って実行され、辞書ファイルに登録しである文字を
ＣＲＴ３あるいはプリンタ４（第２図、第３図）へ出力
して、表示あるいは印刷する処理である。``Registered character list printing'' is executed according to the flowchart in Figure 16, and is a process of outputting the characters registered in the dictionary file to the CRT 3 or printer 4 (Figures 2 and 3) for display or printing. be.

くマルチフォント辞書作成保守ユーティリティ〉マルチ
フォント機能によって、前述したようにフォント指定が
なくても文字認識ができるように。Multi-font dictionary creation and maintenance utility> The multi-font function allows character recognition without specifying a font as mentioned above.

使用する辞書ファイル名を登録しておくファイルの作成
保守プログラムである。This is a file creation/maintenance program that registers dictionary file names to be used.

このプログラムは第１７図に示すように、マルチフォン
ト・ファイル名登録、辞書ファイル名登録、辞書ファイ
ル名削除、辞書ファイル内登録文字印刷、辞書ファイル
名追加、辞書ファイル名人れ替えの各処理プログラムか
らなる。As shown in Figure 17, this program includes processing programs for registering multi-font file names, registering dictionary file names, deleting dictionary file names, printing registered characters in dictionary files, adding dictionary file names, and changing dictionary file masters. Become.

そして、各処理はそれぞれ第１８図乃至第２３図に示す
フローチャートに従って実行される。Each process is executed according to the flowcharts shown in FIGS. 18 to 23, respectively.

この例では、マルチフォント・ファイルには６個の辞書
ファイル名を登録することができ、その登録された辞書
ファイル名には前述したように優先順位を付けである。In this example, six dictionary file names can be registered in the multi-font file, and the registered dictionary file names are prioritized as described above.

例えば、最初に辞書ファイル名を全て新しく登録する場
合には登録順に優先順位をつけ、それを使用すると前述
のように優先順位が入れ替わり、辞書ファイル名を追加
登録した時はその辞書の優先順位を最も低くし、辞書フ
ァイル名を削除した時はその辞書より低い優先順位の辞
書ファイル名の優先順位を繰り上げて付は直す。For example, when first registering all new dictionary file names, the priority order is set in the order of registration, and when that is used, the priority order is swapped as described above, and when additional dictionary file names are registered, the priority order of that dictionary is changed. When a dictionary file name is deleted, the priorities of dictionary file names with lower priority than that dictionary are raised and reassigned.

また、辞書ファイル名を入れ替えた時にも同様に優先順
位を付は直す。Furthermore, when the dictionary file name is replaced, the priority order is also rearranged in the same way.

く辞書ファイル名一覧〉カレントディスク（今仕事中のディスク）内にある文字
認識用辞書のファイル名の一覧表を表示するプログラム
であり、第２４図に示すフローチャートに従って実行さ
れる。List of Dictionary File Names> This is a program that displays a list of file names of character recognition dictionaries on the current disk (the disk on which you are currently working), and is executed according to the flowchart shown in FIG.

この例では、一画面（１頁）に８個の辞書ファイル名を
表示することができるが、登録されている辞書ファイル
の総数が８個以上の場合には、Ｎ（ネクスト）キーを押
すことによって次の頁の辞書ファイル名を表示させるこ
とができ、Ｂ（バック）キーを押すことによって前の頁
の辞書ファイル名の表示に戻すことができる。そして、
Ｅ（エンド）キーを押すとこの処理を終了する。In this example, eight dictionary file names can be displayed on one screen (one page), but if the total number of registered dictionary files is eight or more, press the N (Next) key. The dictionary file name of the next page can be displayed by pressing the B (back) key, and the dictionary file name of the previous page can be returned to the display by pressing the B (back) key. and,
Pressing the E (end) key ends this process.

く手書き辞書作成保守ユーティリティ〉文字認識に使用
する手書き辞書を作るプログラムであり、第２５図に示
すように、辞書ファイルへのファイル名登録２文字の登
録及び追加２文字削除、登録文字リスト印刷の各機能が
ある。Handwritten Dictionary Creation and Maintenance Utility This is a program that creates a handwritten dictionary used for character recognition.As shown in Figure 25, it can register two characters in a file name to a dictionary file, add and delete two characters, and print a list of registered characters. There are various functions.

この機能は第１２図に示した単一辞書作成保守ユーティ
リティの機能と同じであり、その各処理内容を示す第２
６図乃至第２Ｓ図のフローチャートも、単一辞書作成保
守ユーティリティにおける第１３図乃至第１６図の処理
と略同様である。This function is the same as that of the single dictionary creation and maintenance utility shown in Figure 12.
The flowcharts shown in FIGS. 6 to 2S are also substantially similar to the processes shown in FIGS. 13 to 16 in the single dictionary creation and maintenance utility.

但し、第２８図に示す文字削除処理において、「削除文
字種類入力」を設けており、削除できる文字の種類（活
字のみ１手書きのみ２両方の３種類）の指定ができるよ
うになっている。However, in the character deletion process shown in FIG. 28, a "deleted character type input" is provided, which allows the user to specify the types of characters that can be deleted (3 types: 1 for printed text, 2 for handwritten text, and 2 for both).

次に、活字あるいは手書の辞書作成方法について、第３
０図以降によって具体的に説明する。Next, the third section explains how to create a printed or handwritten dictionary.
A detailed explanation will be given from Figure 0 onwards.

所望の文字（記号等も含ものとする）を辞書登録する際
には、１行にその同一文字を多数列記した原稿をイメー
ジスキャナによってスキャンさせてそのイメージデータ
を取り込み、その水平斜影を取る。When desired characters (including symbols, etc.) are registered in a dictionary, an image scanner scans a document in which many of the same characters are listed in one line, captures the image data, and takes the horizontal oblique shadow.

これは、第３１図に示すように、スキャン方向Ｓに直交
する水平方向（矢示Ｈ方向）から各文字を見て、その文
字の始まり（白い部分から文字の影である黒い部分にか
かる所）から文字の終り（文字の影である黒い部分から
白い部分に変化する所）を判断し、それによって文字の
高さを決定して行切り出しを行なうために取るのである
。As shown in Figure 31, when looking at each character from the horizontal direction (direction of arrow H) perpendicular to the scanning direction S, you can determine the beginning of the character (the point from the white part to the black part that is the shadow of the character). ) to determine the end of the character (the place where the shadow of the character changes from black to white), determine the height of the character, and cut out lines.

そして、この水平斜影はある程度の高さを予め持ってお
り、例えば少しかすれた文字を読み取った時に、そのか
すれの部分で水平斜影がなくなっても文字の終りと誤認
するようなことを防止するようにしている、したがって、この水平斜影が始めに持つ高さを高くして
おけば、ｌｉ　ｉ　ｐｇや′ｊ″あるいは“：″などの
上下に分離した部分からなる文字を１つの文字と判断す
ることができる。This horizontal diagonal shadow has a certain height in advance, so that, for example, when reading a slightly faded character, even if the horizontal diagonal shadow disappears at the blurred part, it will not be mistaken as the end of the character. Therefore, if the initial height of this horizontal diagonal shadow is set high, characters consisting of vertically separated parts such as li i pg, ``j'', or ``:'' will be recognized as one character. be able to.

しかしながら、そのようにすると小さな文字を登録しよ
うとした時、隣接する他の文字の一部まで１文字と判断
してしまう恐れがあるので、必要最少限の範囲でしか高
さを持つことはできない。However, if you do this, when you try to register a small character, there is a risk that some of the adjacent characters will be judged as one character, so the height can only be set to the minimum necessary range. .

そこで、以下に説明する例では第３２図に示す−ように
、原稿の１行に登録しようとする文字（図示の例ではａ
ｖｅｒ）を横方向に間隔を置いて例えばＬＯ文字列記し
、その最後の文字から少し離れた位置にこの行の文字の
高さを示すマークＭ（この例では縦線）を付記しておく
。Therefore, in the example described below, the characters (in the illustrated example, a
ver) are written in a string of LO characters, for example, at intervals in the horizontal direction, and a mark M (vertical line in this example) indicating the height of the characters in this line is added at a position slightly away from the last character.

この原稿をイメージスキャナでスキャンさせてそのイメ
ージデータを取り込めば、マークＭが検出されてからそ
れが検出されなくなるまでを１つの文字の高さとして正
確に判断して、行切り出しを行なうことができるので、
９１″やパノ″のように上下に分離した２部分からなる
文字でも全体で１つの文字として１文字パターンデータ
を正しく切り出すことができる。By scanning this document with an image scanner and importing the image data, it is possible to accurately determine the height of one character from when mark M is detected until it is no longer detected, and to cut out lines. So,
Even a character consisting of two vertically separated parts, such as ``91'' or ``pano'', can be treated as one character as a whole, and one character pattern data can be correctly extracted.

そして、小さい文字の場合にも、その文字の高さに合わ
せたマークＭを付記することによって、その文字のパタ
ーンデータのみを正しく切り出すことができる。Even in the case of a small character, by adding a mark M that matches the height of the character, only the pattern data of that character can be correctly extracted.

また、このようにすることによって、手書き文字を登録
する場合にも１例えば゛ｉ″の点を離して書きすぎるよ
うな、くせのある字でも制約なく登録することが可能に
なる。In addition, by doing this, when registering handwritten characters, it becomes possible to register even irregular characters, such as writing the dots of "i" too far apart, without any restrictions.

なお、ごみや点状のノイズをマークと誤認しないように
、マークＭをある程度太くして横方向のスキャン時に数
ドツト分の黒レベルのデータが得られるようにしておく
のが望ましい。In order to prevent dust or dotted noise from being mistaken for a mark, it is desirable to make the mark M somewhat thick so that black level data for several dots can be obtained during horizontal scanning.

この方法を用いて辞書登録を行なう際の処理を。The process for registering a dictionary using this method.

第３０図のフローチャートによって説明する。This will be explained using the flowchart shown in FIG.

第３２図に示したように、登録しようとする文字を１行
に１０文字列記（印刷でも手書きでもよい）して、その
最後の文字から少し離して文字の高さを示すマークＭを
付記した原稿をスキャナにセットして、第１４図あるい
は第２７図の「読み取りＪを開始すると、この第３０図
の処理がスタートする。As shown in Figure 32, write 10 characters to be registered in one line (printed or handwritten), and add a mark M indicating the height of the character a little apart from the last character. When a document is set in the scanner and "reading J" shown in FIG. 14 or 27 is started, the process shown in FIG. 30 starts.

まず、スキャナを始動させて原稿のスキャンを開始し、
横方向の１スキヤン中に予め定めたドツト数（マークＭ
を検出した時のドツト数より少ない）以上の黒レベルが
検出されない間は原稿の白い部分（スペース部分）をス
キャンしていると判断して何もせずにスキャンを続ける
。First, start the scanner and start scanning the document.
A predetermined number of dots (mark M
Unless a black level greater than (less than the number of dots detected) is not detected, it is assumed that the white part (space part) of the document is being scanned, and scanning continues without doing anything.

そして、横方向の１スキヤン中に所定ドツト数以上の黒
レベルが検出されると、マークＭの黒い部分を検出した
と判断して１行のイメージデータの切り出しを開始し、
その黒い部分が検出されな（なるまで１行の切り出しを
続け、黒い部分が検出されなくなると１行の切り出しを
終了する。When a black level equal to or greater than a predetermined number of dots is detected during one scan in the horizontal direction, it is determined that a black part of the mark M has been detected, and cutting out of one line of image data is started.
One line is continued to be cut out until the black part is not detected, and when the black part is no longer detected, one line cutout is finished.

そして、切り出した１行分のイメージデータから文字切
り出しを行なって、その１行に含まれて　−いる各文字
（この例では同一文字が１０文字）の文字パターン（ド
ツトパターン）データをそれぞれ切り出す。Characters are then extracted from the extracted one line of image data, and character pattern (dot pattern) data of each character (in this example, 10 identical characters) included in that one line is extracted.

手書き文字の場合には、ここで文字パターンの凸凹を修
正するスムージング処理と、大きさを統一するために文
字パターン全体を若干拡大又は縮小する正規化処理を行
なうのが望ましい。In the case of handwritten characters, it is desirable to perform smoothing processing to correct irregularities in the character pattern and normalization processing to slightly enlarge or reduce the entire character pattern in order to unify the size.

次いで、その各文字パターンのそれぞれ対応するドツト
のデータ（１″か０“）のＯＲをとって重ね合わせる重
畳処理を行なう。その際各ドツト位置毎に黒レベルのデ
ータが予め設定した数似下の場合は白レベルとみなすよ
うにすれば、ノイズの影響を除去できると共に、手書き
文字の場合には書体のバラツキの影響を少なくして平均
化した文字パターンを得ることができる。Next, a superimposition process is performed in which the corresponding dot data (1'' or 0'') of each character pattern are ORed and superimposed. At that time, if the black level data for each dot position is less than a preset number, it is treated as the white level, so that the influence of noise can be removed, and in the case of handwritten characters, the influence of font variations can be eliminated. It is possible to obtain an averaged character pattern by reducing the number of characters.

このようにして得た文字パターン又は１行の何れかの文
字を１例えば第３３図に示すようにＣＲＴ３の画面に表
示する。The character pattern or one line of characters thus obtained is displayed on the screen of the CRT 3, for example, as shown in FIG.

この表示を作成者が確認して、この文字パターンに対応
する文字（この例では「ｉ」）をキーボード１のキャラ
クタキーによって入力すると、その文字を示す文字コー
ドを発生し、それを前述のようにして得た文字パターン
のデータに付与してＨＤＤ７の文字認識用辞書ファイル
に９．録する。When the creator confirms this display and enters the character corresponding to this character pattern (in this example, "i") using the character key on keyboard 1, a character code indicating that character is generated, and it is input as described above. Add it to the character pattern data obtained in 9. and save it to the character recognition dictionary file on the HDD 7. Record.

このキーボードからの文字入力に代えて直接文字コード
を入力して登録することも可能である。Instead of inputting characters from the keyboard, it is also possible to register by directly inputting a character code.

この辞書作成方法によれば、活字は勿論のことであるが
１手書きのくせ字であっても簡単に辞書登録することが
できる。According to this dictionary creation method, not only printed characters but also single handwritten curly characters can be easily registered in the dictionary.

１行の文字数は、多い方が精度が向上するが文字パター
ンデータの処理時間が長くなるので、１０文字程度が適
当である。As for the number of characters in one line, the accuracy improves as the number of characters increases, but the processing time for character pattern data increases, so about 10 characters is appropriate.

夏−策以上説明してきたように、この発明によれば、光学文字
読取装置に使用する文字！！識用辞書を作成するために
専用の装置を必要とせず、光学文字読取装置自体を使用
して誰でも容易に辞書の作成や保守を行なうことができ
る。しかも、手書きのくせ字でも特別な制約なく辞書登
録することが可能になる。As explained above, according to the present invention, characters used in an optical character reading device! ! No special device is required to create a common dictionary, and anyone can easily create and maintain a dictionary using the optical character reading device itself. Furthermore, even handwritten curly characters can be registered in the dictionary without any special restrictions.

そして、読み取られた１行の文字から作成された０文字
パターン又は１行の何れかの文字が表示されるので、オ
ペレータは文字の認識を容易にすることができる。Then, either a zero-character pattern created from one line of read characters or one line of characters is displayed, making it easier for the operator to recognize the characters.

[Brief explanation of the drawing]

第１図はこの発明による辞書作成方法の手順を示すフロ
ー図。第２図はこの発明の一実施例である文書処理システムの
外観斜視図。第３図は同じくそのブロック構成図第４図は同じくその文字読取に関する動作を示すフロー
図、第５図乃至第８図は第４図による動作説明に共する説明
図、第Ｓ図は第４図における文字フォント判別処理の詳細を
示すフロー図。第１０図は第４図における文字コード判定処理の詳細を
示すフロー図。第１１図乃至第３０図は文字認識用辞書の作成保守に関
する各種の処理を説明するためのフロー図。第３１図乃至第３３図は第′５０図による辞書登録処理
の説明に共する説明図である。Ａ・・・原稿　　　　　１・・・キーボード２・・・イ
メージスキャナ３・・・ＣＲＴディスプレイ装［４・・・プリンタ５・
・・本体　　　　　６・・・フロッピディスク装置７・
・・ハードディスク装置１０・・・制御部（ＣＰＵ）Ｍ１図第５図ヘッダ部第９図第３１図第３２図第３３図FIG. 1 is a flow diagram showing the procedure of a dictionary creation method according to the present invention. FIG. 2 is an external perspective view of a document processing system that is an embodiment of the present invention. FIG. 3 is a block diagram of the same; FIG. 4 is a flowchart showing the operation related to character reading; FIGS. 5 to 8 are explanatory diagrams accompanying the explanation of the operation in FIG. 4; FIG. 3 is a flowchart showing details of character font discrimination processing in the figure. FIG. 10 is a flowchart showing details of the character code determination process in FIG. 4. FIG. 11 to FIG. 30 are flowcharts for explaining various processes related to creation and maintenance of a dictionary for character recognition. FIGS. 31 to 33 are explanatory diagrams accompanying the explanation of the dictionary registration process shown in FIG. '50. A... Original 1... Keyboard 2... Image scanner 3... CRT display device [4... Printer 5...
・Main unit 6 ・Floppy disk device 7・
...Hard disk device 10...Control unit (CPU) M1 figure figure 5 header section figure 9 figure 31 figure 32 figure 33

Claims

[Claims]

1. An optical character reader that scans a document with a scanner, captures image information including characters as image data, recognizes characters from the image data, and converts them into character code data. At the same time, a document with a mark indicating the height of the characters in that line is scanned by the scanner to capture the image data, and the dot patterns of each character within the height regulated by the mark are overlapped. Combine or average to create one character pattern, and then
A dictionary creation method characterized by displaying any character in a line, giving a character code corresponding to the character pattern, and registering it in a dictionary file for character recognition.