JPH0514952B2 - - Google Patents

Info

Publication number
JPH0514952B2
JPH0514952B2 JP58070911A JP7091183A JPH0514952B2 JP H0514952 B2 JPH0514952 B2 JP H0514952B2 JP 58070911 A JP58070911 A JP 58070911A JP 7091183 A JP7091183 A JP 7091183A JP H0514952 B2 JPH0514952 B2 JP H0514952B2
Authority
JP
Japan
Prior art keywords
character
characters
character string
image data
cutting device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
JP58070911A
Other languages
Japanese (ja)
Other versions
JPS59197971A (en
Inventor
Teruo Akyama
Seiichiro Naito
Isao Masuda
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP58070911A priority Critical patent/JPS59197971A/en
Publication of JPS59197971A publication Critical patent/JPS59197971A/en
Publication of JPH0514952B2 publication Critical patent/JPH0514952B2/ja
Granted legal-status Critical Current

Links

Description

【発明の詳細な説明】 (発明の属する分野) 本発明は文書上の文字を直接機械で読み取る
OCR(Optical Dharacter Recognition)装置に
おいて、ほぼ一定のピツチで並んだ文字の他に、
通常の文字の粉程度のピツチしか持たない英数
字、記号等が連続して含まれている文字列からも
個々の文字、記号等を効率良く切出すことの出来
る文字切出し装置に関するものである。
[Detailed Description of the Invention] (Field to which the invention pertains) The present invention directly reads characters on a document with a machine.
In OCR (Optical Dharacter Recognition) equipment, in addition to characters arranged at a nearly constant pitch,
The present invention relates to a character cutting device capable of efficiently cutting out individual characters, symbols, etc. even from a character string containing successive alphanumeric characters, symbols, etc. having a pitch comparable to that of ordinary characters.

(従来の技術) 従来の方法では、文字列中に通常文字ピツチの
半分程度のピツチを持つ英数字、記号等が連続し
て出現した場合、これらの英数字、記号を幾つか
まとめて一つの文字として切出してしまうため、
切出し処理が正常に行なわれないという欠点があ
つた。また、文字認識装置からの情報を用い、リ
ジエクトとなつた文字の切出し処理を改めて行う
手法も考えらえているが、その手法では全ての文
字に対して認識処理を行なうため、認識用辞書の
容量が大きくなつたり、或いは、文字切出し装置
としてシステム全体が大きなものとなるという欠
点があつた。
(Prior art) In the conventional method, when alphanumeric characters, symbols, etc. with a pitch of about half the normal character pitch appear consecutively in a character string, several of these alphanumeric characters and symbols are combined into one. Because it is cut out as a character,
There was a drawback that the cutting process was not performed properly. In addition, a method is being considered in which the information from the character recognition device is used to re-extract characters that have become rejects, but since that method requires recognition processing for all characters, the capacity of the recognition dictionary is limited. However, there are disadvantages in that the character cutting device becomes large, or the entire system becomes large as a character cutting device.

(発明の目的) 本発明はこれらの欠点を解決するために英数
字、記号等、通常の文字とは異なるピツチを持つ
もの又は異なるピツチを持つものと分離文字との
みの認識用辞書を用意し、一度文字として切出さ
れた図形の中で、その図形が複数個の部分図形に
分離出来るもののみを抽出して個々の部分図形に
認識処理を加え、リジエクトされるか否かを判定
し、その結果によつて文字として切出した図形が
複数個の英数字、記号等に分離出来るか否かを判
定するようにしたものである。
(Object of the Invention) In order to solve these drawbacks, the present invention provides a dictionary for recognizing alphanumeric characters, symbols, etc. that have a pitch different from normal characters, or only characters that have a different pitch and separate characters. , extract only those shapes that can be separated into multiple sub-figures from among the figures once cut out as characters, apply recognition processing to each sub-figure, and determine whether or not to be rejected; Based on the results, it is determined whether a figure cut out as a character can be separated into a plurality of alphanumeric characters, symbols, etc.

(発明の構成および作用) 第1図は本発明の一実施例の構成を示すブロツ
ク図であり、1は文書画像から抽出された文字列
の画像データを信号線aを用いて読み込み記憶し
ておく文字列画像データ記憶装置、2は信号線b
を用いて入力された推定ピツチをもとに文字列画
像データ記憶装置1に記憶された文字列の画像デ
ータを信号線cを用いて読み込み、文字の1次切
出しを行なう文字1次切出し装置、3は文字1次
切出し装置2から信号線dを用いて入力された文
字1次切出し結果をもとに、文字列画像データ記
憶装置1に記憶された文字列の画像データを信号
線cを用いて読み込み、識別処理用辞書4から信
号線fを通じて読込んだ標準文字パターンを用い
て必要な文字認識処理を行ない、文字の1次切出
し結果に修正を加える文字2次切出し装置であ
り、その文字2次切出し結果は信号線gを用いて
出力する。以下、それぞれの装置について説明す
る。
(Structure and operation of the invention) FIG. 1 is a block diagram showing the structure of an embodiment of the invention, in which 1 reads and stores image data of a character string extracted from a document image using a signal line a. character string image data storage device, 2 is signal line b
a character primary cutting device which reads image data of a character string stored in a character string image data storage device 1 using a signal line c based on an estimated pitch inputted using a signal line c, and performs primary cutting of a character; 3 uses the signal line c to convert the image data of the character string stored in the character string image data storage device 1 based on the primary character extraction result inputted from the primary character extraction device 2 using the signal line d. This is a secondary character extraction device that performs the necessary character recognition processing using the standard character pattern read in from the identification processing dictionary 4 through the signal line f, and corrects the primary character extraction result. The secondary cutting result is output using the signal line g. Each device will be explained below.

文字列画像データ記憶装置1は通常のOCR、
或いは、別途出願中の特願昭55−126845「2次元
文字領域抽出装置」等によつて抽出された1文字
列分の画像データを記憶する画像メモリである。
文字1次切出し装置2は外部から入力された推定
文字ピツチをもとに個々の文字を切出す装置で、
例えば別途出願中の特願昭56−74015「文字切出し
装置」によつて実現可能である。ここで、推定文
字ピツチは数値を直接入力してもよいし、また、
例えば漢字は縦と横の比がほぼ1であるという性
質を利用して文字列を列方向に投影し、文字列の
幅を求めることによつても容易に推定出来る。
The character string image data storage device 1 is a normal OCR,
Alternatively, it is an image memory that stores image data for one character string extracted by a separately pending patent application No. 55-126845 "Two-dimensional Character Area Extraction Apparatus" or the like.
The primary character cutting device 2 is a device that cuts out individual characters based on the estimated character pitch input from the outside.
For example, this can be realized by the patent application No. 56-74015 "Character cutting device" which has been filed separately. Here, you can enter a numerical value directly for the estimated character pitch, or
For example, it can be easily estimated by projecting a character string in the column direction and finding the width of the character string, taking advantage of the property that the ratio of height to width of kanji is approximately 1.

第2図は文字列画像配列と文字1次及び2次切
出し装置により抽出された結果の説明図であり、
1は文字列画像配列、2及び3は文字1次切出し
装置2及び文字2次切出し装置3の抽出結果の一
例を示す。
FIG. 2 is an explanatory diagram of the character string image array and the results extracted by the primary and secondary character extraction devices.
1 shows a character string image array, and 2 and 3 show examples of extraction results of the primary character cutting device 2 and the secondary character cutting device 3.

第2図2では第2図1の数字“34”と“)。”が
一文字として抽出されている。本来これらの文字
は2つに分離して切出されるべきものであるが、
個々の数字、記号が通常文字の半分程度しかな
く、しかも連続して存在しているために2つの数
字、記号を纒めて切出してしまつたものであり、
従来のOCRでは対応が困難なものである。逆に
“門”という文字は予め求めてある文字推定ピツ
チを用いて切出しを行うことによつて正確に切出
されている。
In FIG. 2, the numbers "34" and ")." in FIG. 2 are extracted as one character. Originally, these characters should be cut out in two parts, but
Individual numbers and symbols are usually only about half the size of letters, and because they exist consecutively, two numbers and symbols have been cut out together.
This is difficult to handle with conventional OCR. On the other hand, the character "mon" is accurately extracted by using a predetermined estimated character pitch.

文字2次切出し装置3は文字1次切出し装置2
の内容を読み込んで文字として切出した図形が複
数個の部分図形に分離出来るものを抽出し、文字
として切出した個々の部分図形、又は全体図形及
び個々の部分図形について認識処理を行ない、文
字として切出した図形がそのまま単一の文字なの
か、或いは複数個の図形に分解出来るかを判断す
る装置である。文字が分離しているか否かの判断
は、例えば、文字列の上下方向(横書きの場合)、
又は左右方向(縦書きの場合)の投影を行ない、
文字1次切出し装置2で文字として抽出された図
形の中央部に黒画素が存在しない部分が有るか否
かを調べればよい。
The secondary character cutting device 3 is the primary character cutting device 2.
Read the contents of the text, extract figures that can be separated into multiple partial figures, cut out the figures as characters, perform recognition processing on each partial figure cut out as characters, or the whole figure and each partial figure, and cut out as characters. This is a device that determines whether a drawn figure is a single character as it is, or whether it can be broken down into multiple figures. To determine whether characters are separated, for example, the vertical direction of the character string (in the case of horizontal writing),
Or perform horizontal projection (in case of vertical writing),
It is only necessary to check whether there is a part in the center of the figure extracted as a character by the primary character extraction device 2, in which no black pixels exist.

第2図3は文字2次切出し装置2によつて切出
された結果を示したもので、“〓”“〓”は識別の
結果リジエクトとなり、文字1次切出し装置2の
切出し結果の“門”がそのまま採用されている。
また“3”,“4”,“)”,“。”は数字、記号とし

認識され、文字1字切出し装置2の結果が修正さ
れて別々のものとして切出される。
FIG. 2 shows the result of cutting out by the secondary character cutting device 2, where “〓” and “〓” are rejected as a result of identification, and “〓” and “〓” are rejected as a result of identification, and the “gate” of the cutting result of the primary character cutting device 2 is shown. ” has been adopted as is.
Further, "3", "4", ")", and "." are recognized as numbers and symbols, and the results of the single character cutting device 2 are corrected and cut out as separate characters.

認識処理用辞書4は文字2次切出し装置3の認
識処理時に用いられる文字の標準パターンを登録
しておくもので、英数字、記号等通常の文字とは
ピツチが異なるもののみの標準パターンが登録さ
れており、このため辞書容量を極めて小さくする
ことが出来る。また、“門”等の分離文字の標準
パターンを登録しておくことにより、個々の部分
図形だけでなく、分離文字全体の図形についても
認識処理を行なうことにより精度を向上すること
も可能である。この場合、“門”、“兆”等、文字
を構成している図形が完全に分離している文字は
少なく、この場合でも文字全体の標準パターンを
持つよりは、はるかに少ない辞書容量ですむ。ま
た認識処理用辞書4に登録されている標準パター
ンは予めROM等に書き込んでおいてもよいし、
また切出しの過程で人間との会話処理により標準
パターンを学習させていつても良い。
The recognition processing dictionary 4 is used to register standard patterns of characters used during the recognition processing of the secondary character cutting device 3, and only standard patterns for characters whose pitch is different from normal characters such as alphanumeric characters and symbols are registered. Therefore, the dictionary capacity can be made extremely small. In addition, by registering standard patterns for separated characters such as "mon", it is possible to improve accuracy by performing recognition processing not only for individual partial figures but also for the entire separated character figure. . In this case, there are few characters in which the shapes that make up the characters are completely separated, such as "mon" and "cho", and even in this case, the dictionary capacity is much smaller than having a standard pattern for the entire character. . Further, the standard patterns registered in the recognition processing dictionary 4 may be written in advance in a ROM etc.
Also, standard patterns may be learned through conversation processing with humans during the extraction process.

(効果) 以上説明したように本装置では文字列中から文
字として切出したものの中で図形が分離している
もののみを抽出し、分離した個々の部分図形又は
個々の部分図形及び図形全体の認識処理を行なう
ことにより、ほぼ定ピツチで並んでいる通常文字
の他に通常文字の半分程度のピツチしかない英数
字等が連続して存在している文字列からも個々の
文字を効率的に切出すことが出来、また認識処理
上必要な辞書も極めて小さなものですむという利
点がある。また本装置で用いた手法は一度切出し
た結果を修正する場合だけでなく、文字列の端か
ら文字を切出して行く途中の過程においても充分
応用可能であることは明らかである。
(Effects) As explained above, this device extracts only the separated figures among the characters cut out from the character string, and recognizes the separated individual partial figures, individual partial figures, and the whole figure. By performing this processing, individual characters can be efficiently cut out from character strings that include not only regular characters that are lined up at a constant pitch, but also consecutive alphanumeric characters that have a pitch that is only about half that of regular characters. It also has the advantage that the dictionary required for recognition processing can be extremely small. Furthermore, it is clear that the method used in this device is fully applicable not only to the case of modifying the result of cutting once, but also to the process of cutting out characters from the end of a character string.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の一実施例の構成を示すブロツ
ク図、第2図は文字列画像配列と、文字1次及び
2次切出し装置により抽出された結果の説明図で
ある。 1……文字列画像データ記憶装置、2……文字
1次切出し装置、3……文字2次切出し装置、4
……認識処理用辞書、a……文字列画像データを
転送するための信号線、b……文字1次切出しに
必要な文字推定ピツチを入力する信号線、c……
文字1次切出し装置が文字列画像データを読み込
むための信号線、d……文字1次切出し結果を出
力するための信号線、e……文字2次切出し装置
が文字列画像データを読み込むための信号線、f
……文字2次切出し装置が認識処理用辞書を読み
込むための信号線、g……出力信号線。
FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention, and FIG. 2 is an explanatory diagram of a character string image arrangement and the results extracted by the primary and secondary character cutting devices. 1...Character string image data storage device, 2...Character primary cutting device, 3...Character secondary cutting device, 4
... Dictionary for recognition processing, a ... Signal line for transferring character string image data, b ... Signal line for inputting the estimated character pitch necessary for primary character extraction, c ...
Signal line for the primary character extraction device to read character string image data, d... Signal line for outputting the primary character extraction result, e... Signal line for the secondary character extraction device to read the character string image data. signal line, f
...Signal line for the secondary character cutting device to read the dictionary for recognition processing, g...Output signal line.

Claims (1)

【特許請求の範囲】[Claims] 1 文書画像中の文字列部分の画像データを入力
し、その文字列の画像を記憶しておく文字列画像
データ記憶装置と、その文字列画像データ記憶装
置に格納された文字列画像データを用いて外部か
ら入力された推定文字ピツチによつて文字切出し
を行なう文字1次切出し装置と、その文字1次切
出し装置によつて文字として切出された図形の中
で図形が分離していて2個またはそれ以上の部分
図形に分解出来るもののみを抽出し、各部分図形
又は各部分図形と図形全体の認識処理を行ない、
前記文字1次切出し装置により切出された図形を
単一の文字として切出すか或いは複数個の文字と
して切出すかを判別する文字2次切出し装置と、
文字2次切出し装置の認識処理時に用いられる文
字の標準パターンを登録しておく認識処理用辞書
とを備え、ほぼ定ピツチで並んでいる通常文字の
他に通常文字の半分程度のピツチを有する英数
字、記号等が連続して存在している文字列から
個々の文字を切出すようにしたことを特徴とする
文字切出し装置。
1. A character string image data storage device that inputs image data of a character string part in a document image and stores an image of the character string, and uses the character string image data stored in the character string image data storage device. A primary character extraction device performs character extraction based on estimated character pitches input from the outside. or extract only those that can be decomposed into more than 10 sub-figures, and perform recognition processing on each sub-figure or each sub-figure and the entire figure,
a secondary character cutting device that determines whether to cut out the figure cut out by the primary character cutting device as a single character or as a plurality of characters;
It is equipped with a recognition processing dictionary that stores standard patterns of characters used during recognition processing by the secondary character extraction device, and in addition to regular characters that are lined up at approximately regular pitches, it also has English characters that have a pitch that is about half that of regular characters. A character cutting device is characterized in that it cuts out individual characters from a string of consecutive numbers, symbols, etc.
JP58070911A 1983-04-23 1983-04-23 Character cutting-out device Granted JPS59197971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP58070911A JPS59197971A (en) 1983-04-23 1983-04-23 Character cutting-out device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP58070911A JPS59197971A (en) 1983-04-23 1983-04-23 Character cutting-out device

Publications (2)

Publication Number Publication Date
JPS59197971A JPS59197971A (en) 1984-11-09
JPH0514952B2 true JPH0514952B2 (en) 1993-02-26

Family

ID=13445165

Family Applications (1)

Application Number Title Priority Date Filing Date
JP58070911A Granted JPS59197971A (en) 1983-04-23 1983-04-23 Character cutting-out device

Country Status (1)

Country Link
JP (1) JPS59197971A (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61220081A (en) * 1985-03-27 1986-09-30 Hitachi Ltd Segmentation and recognition system for pattern
JPH07107700B2 (en) * 1987-04-28 1995-11-15 松下電器産業株式会社 Character recognition device
JPH02220188A (en) * 1989-02-22 1990-09-03 Nec Corp Character recognizing device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5210635A (en) * 1975-07-09 1977-01-27 Ibm Pattern separator

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5210635A (en) * 1975-07-09 1977-01-27 Ibm Pattern separator

Also Published As

Publication number Publication date
JPS59197971A (en) 1984-11-09

Similar Documents

Publication Publication Date Title
US5359673A (en) Method and apparatus for converting bitmap image documents to editable coded data using a standard notation to record document recognition ambiguities
EP0063454B1 (en) Method for recognizing machine encoded characters
JPH04195692A (en) Document reader
CN113610068A (en) Test question disassembling method, system, storage medium and equipment based on test paper image
JPH0514952B2 (en)
JPH0689365A (en) Document image processor
JP2993533B2 (en) Information processing device and character recognition device
JP2578767B2 (en) Image processing method
JPS6095689A (en) Optical character reader
JPS6227887A (en) Character type separating system
JP3151866B2 (en) English character recognition method
Hwang et al. Segmentation of a text printed in Korean and English using structure information and character recognizers
JP2851102B2 (en) Character extraction method
JPH03268181A (en) Document reader
JPH04130979A (en) Character picture segmenting method
JPH01201789A (en) Character reader
JP2578768B2 (en) Image processing method
JPH02230484A (en) Character recognizing device
EP0490374A2 (en) Character recognition system
JPH08129608A (en) Character recognition device
JPH0353392A (en) Character recognizing device
JPS63204486A (en) Character input device
JPH0350689A (en) Character recognizing device
JPH04346189A (en) Character string type identification device
JPS60201480A (en) Character read system