JPH04105178A

JPH04105178A - Document picture processor

Info

Publication number: JPH04105178A
Application number: JP22381390A
Authority: JP
Inventors: Naoki Kuwata; 直樹鍬田
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1990-08-24
Filing date: 1990-08-24
Publication date: 1992-04-07

Abstract

PURPOSE:To reduce a storage capacity by separate-extracting a character area, a graphic area and an image data from document information fetched in the form of image information, converting the information included in the respective areas into code data suited to the information, after that, reconstituting and outputting them. CONSTITUTION:A picture input part 101, a picture memory 102, a preprocessing part 103, a feature extracting part 104, an area dividing part 105, a character area processing part 106, a graphic area processing part 107, an image area processing part 108, a reconstituting part 109 and a picture output part 110 are provided. Then, an inputted document picture is separated into a character area, a graphic area and an image area, moreover, a character is recognized in the character area, the character and a geometric graphic are recognized in the graphic area, they are converted into code data suited to the respective information and a document is stored. Thus, in the case of storing, a small storage capacity is satisfied.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、イメージ情報の形式で取り込まれた文書情報
から文字領域と図表領域とイメージ領域を分離抽出し、
各領域に含まれる情報をその情報に適するコードデータ
に変換後、再構成して出力する文書画像処理装置に関す
る。[Detailed Description of the Invention] [Industrial Application Field] The present invention separates and extracts character areas, figure areas, and image areas from document information captured in the form of image information,
The present invention relates to a document image processing device that converts information included in each area into code data suitable for the information, then reconfigures and outputs the code data.

［従来の技術］一般的に、紙に書かれた文書情報を保存する場合、イメ
ージスキャナ等で取り込んだ画像をそのままイメージ情
報として扱い、外部記憶装置等に保存している。また、
文字領域のみを抽出した後文字領域について文字認識を
行い文字コード化して保存する場合は、入力文書画像に
対して使用者が、文字領域を人手により指定する必要が
あった。[Prior Art] Generally, when document information written on paper is stored, an image captured by an image scanner or the like is treated as image information and is stored in an external storage device or the like. Also,
If only a character region is extracted, then the character region is subjected to character recognition, character coded, and saved, the user needs to manually specify the character region for the input document image.

［発明が解決しようとする課題］以上述べたように、イメージ情報の形で保存するタイプ
では、記憶容量が膨大になり、また、文書の一部を書き
直したりする編集作業が行えなかった。そして、領域指
定するタイプでは、人間が常にその場に居て、指示する
必要があり、手間がかかっていた。[Problems to be Solved by the Invention] As described above, the type that saves in the form of image information requires a huge amount of storage capacity, and also cannot perform editing work such as rewriting part of the document. The type that specifies areas requires a human being to be present at all times and give instructions, which is time-consuming.

そこで本発明は、上記の問題点を解決するためのもので
、人手に頼らず入力された文書画像を文字領域・図表領
域・イメージ領域に分離し、さらに文字領域においては
文字を認識し、図表領域においては文字および幾何図形
を認識し、それぞれの情報に適したコードデータに変換
して、文書を保存する装置を提供することを目的とする
。SUMMARY OF THE INVENTION The present invention aims to solve the above-mentioned problems by separating an inputted document image into a text area, a diagram area, and an image area without relying on humans. In this area, the object is to provide a device that recognizes characters and geometric figures, converts them into code data suitable for each type of information, and stores the document.

［課題を解決するための手段］本発明の文書画像処理装置は、文字・図表・イメージ領
域をを少なくとも一つ以上含む文書をイメージ情報とし
て取り込む画像入力部と、この画像入力部により取り込
まれたイメージ情報から雑音を除去し２値化する前処理
部と、イメージ情報における文字・図表・イメージ領域
の持つ特徴を抽出する特徴抽出部と、この特徴抽出部で
抽出された特徴に基づき前述のイメージ情報を文字領域
・図表領域・イメージ領域に分割する領域分割部と、文
字領域部の文字を認識する文字領域処理部と、図表領域
内の幾何図形および文字を認識する図表領域処理部と、
イメージ領域内のイメージ情報を加工するイメージ領域
処理部と、これらの各処理部からのデータを再構成する
再構成部と、再構成された文書情報を出力する画像出力
部とを具備したことを特徴とする。[Means for Solving the Problems] The document image processing device of the present invention includes an image input section that captures a document containing at least one of characters, diagrams, and image areas as image information, and A preprocessing unit that removes noise from image information and binarizes it; a feature extraction unit that extracts the features of text, diagrams, and image areas in the image information; and a feature extraction unit that extracts the aforementioned image based on the features extracted by this feature extraction unit. an area dividing unit that divides information into a character area, a diagram area, and an image area; a character area processing unit that recognizes characters in the character area; a diagram area processing unit that recognizes geometric figures and characters in the diagram area;
An image area processing unit that processes image information in an image area, a reconstruction unit that reconstructs data from each of these processing units, and an image output unit that outputs reconstructed document information. Features.

［実施例］以下本発明について図面に基づいて説明する。[Example] The present invention will be explained below based on the drawings.

第１図は本発明の文書画像処理装置の構成を示すブロッ
ク図である。１０１は文書画像をイメージ情報として取
り込む画像入力部であり、スキャナもしくはカメラ等を
用いる。あらかじめ画像が収納されている光ディスク等
を使用する場合は、これに対応する再生装置になる。１
０２は取り込んだイメージ情報を一時的に保存する画像
メモリである。１０３はイメージ情報に含まれる雑音の
除去、２値化を行う前処理部である。雑音の除去には、
メジアンフィルタ等を用いて孤立雑音を除去する。もし
、入力画像が傾いていたときは後の処理をやりやすくす
るために、この部分で傾斜角の補正を行う。１０４は文
字領域・図表領域・イメージ領域を分離するための特徴
量を抽出する特徴抽出部である。特徴の抽出法について
は後述する。FIG. 1 is a block diagram showing the configuration of a document image processing apparatus according to the present invention. An image input unit 101 inputs a document image as image information, and uses a scanner, a camera, or the like. When using an optical disk or the like on which images are stored in advance, the playback device is compatible with this. 1
02 is an image memory that temporarily stores captured image information. 103 is a preprocessing unit that removes noise contained in image information and binarizes it. To remove noise,
Remove isolated noise using a median filter or the like. If the input image is tilted, the tilt angle is corrected in this part to make later processing easier. Reference numeral 104 denotes a feature extraction unit that extracts feature quantities for separating character areas, diagram areas, and image areas. The feature extraction method will be described later.

１０５は、特徴抽出部１０４で抽出された特徴量に基づ
いて入力されたイメージ情報を文字領域・図表領域・イ
メージ領域に分割する領域分割部である。１０６は分割
された文字領域内で、文字列の抽出、−文字の切り出し
、切り出した文字の認識を行う文字領域処理部である。Reference numeral 105 denotes an area dividing unit that divides the input image information into a character area, a diagram area, and an image area based on the feature amount extracted by the feature extracting unit 104. A character area processing unit 106 extracts a character string, cuts out a - character, and recognizes the cut out characters within the divided character areas.

１０７は図表領域内の幾何図形および文字を抽出した後
、認識を行う図表領域処理部である。１０８はイメージ
領域と判定された部分をイメージデータのまま、もしく
は圧縮処理をするイメージ領域処理部である。Reference numeral 107 is a diagram area processing unit that performs recognition after extracting geometric figures and characters in the diagram area. Reference numeral 108 denotes an image area processing unit that processes a portion determined to be an image area as image data or compresses it.

１０９は、コードデータ化された領域とイメージデータ
の領域を紙面上に再構成する再構成部である。１１０は
再構成された文書情報を出力する画像出力部で、具体的
には印画装置・表示装置・外部記憶装置がこれに該当す
る。Reference numeral 109 denotes a reconstruction unit that reconstructs the area converted into code data and the area of image data on the paper. Reference numeral 110 denotes an image output unit that outputs reconstructed document information, and specifically includes a printing device, a display device, and an external storage device.

次に、入力されたイメージ情報の特徴抽出の方法につい
て説明する。まず、入力画像を縦ｍ個、横ｎ個の画素ご
とにグループ化する。そして、各グループ（ｍＸｎ画素
）中に存在する黒点の数を計数する。この操作を入力画
像全体に対して行う。Next, a method for extracting features of input image information will be described. First, the input image is grouped into m pixels in the vertical direction and n pixels in the horizontal direction. Then, the number of black points existing in each group (mXn pixels) is counted. This operation is performed on the entire input image.

第２図は入力されたイメージ情報をｍＸｎ画素のグルー
プに分割し、そのなかに存在する黒画素の数を入力画像
全面に渡って計数し、密度（黒画素数）を横軸に、その
出現頻度（度数）を縦軸にとったヒストグラムを示す図
である。−船釣に、図表領域は白い部分が多く密度は低
くなる。一方、イメージ領域は黒い部分が多く密度は高
くなる。Figure 2 shows that the input image information is divided into groups of mXn pixels, and the number of black pixels present in the groups is counted over the entire input image, and the density (number of black pixels) is plotted on the horizontal axis. FIG. 3 is a diagram showing a histogram with frequency (frequency) on the vertical axis. - For boat fishing, the chart area has a lot of white parts and the density is low. On the other hand, the image area has many black parts and has a high density.

文字領域はこの中間に位置する。図に示されたように、
適当なしきい値（ｔｌ、ｔ２、ｔ３）で分離された領域
を密度の低い順番に０．１．２．３と番号を振ると、０
は何も書かれてぃない空白領域、１は図表領域、２は文
字領域、３はイメージ領域というように分割することが
できる。この例の場合は、文字領域の面積が大きい文書
を標本として用いたので、文字領域に対応する部分の度
数が多くなっている。The character area is located in between. As shown in the figure,
If the regions separated by appropriate thresholds (tl, t2, t3) are numbered 0.1.2.3 in descending order of density, then 0
can be divided into blank areas where nothing is written, 1 is a diagram area, 2 is a text area, and 3 is an image area. In this example, a document with a large character area was used as a sample, so the frequency of the portion corresponding to the character area is large.

第３図は、第２図に示した方法により入力文書に対して
、ｍ　ｘ　ｎ画素ごとにラベル付けを行った一例を示す
図である。１とラベルがつけられた領域が図表領域、２
が文字領域、３がイメージ領域に対応する。このように
して、同じラベルの付いた領域をグループ化することに
より、領域分割を行う。FIG. 3 is a diagram showing an example in which an input document is labeled for each m x n pixel by the method shown in FIG. 2. The area labeled 1 is the diagram area, 2
corresponds to the character area, and 3 corresponds to the image area. In this way, region division is performed by grouping regions with the same label.

第４図は、文字領域処理部１０６の詳細を示すブロック
図である。４１は文字領域内の文字列を抽出する文字列
抽出部、４２は抽出された文字列から一文字を切り出す
文字抽出部、４３は抽出された文字を文字認識用辞書４
４を参照して、認識を行う文字認識部である。ここで認
識が行えなかった文字に関しては、イメージデータのま
ま次の単語照合部４５へ送られる。単語照合部では、認
識された文字が、単語として意味をもつがどうが単語辞
書４６を参照して、もし文字の誤認識により意味のない
単語が存在した場合は、訂正の可能なものについては正
しい単語に変換する。文字認識部で認識できなかった文
字についても、単語辞書を参照することにより確定でき
るものについては、この部分で決定する。単語照合部で
確定できなかった文字については、イメージのまま残し
ておく。４７は、認識された文字について、これをコー
ド化する文字コード化部である。４８は、上記の部分で
コードデータ化された文字列を紙面上で再構成するため
に必要な情報を付加する真書式付加部である。例えば、
第５図に示されるように、紙面の左上を原点として、縦
方向にＸ軸を、横方向にＹ軸をとったとき、　（ｘＬｙ
ｌ）と（ｘ２、ｙ２）で囲まれた領域に文字列が存在す
るとする。FIG. 4 is a block diagram showing details of the character area processing section 106. 41 is a character string extraction unit that extracts a character string within a character area; 42 is a character extraction unit that extracts one character from the extracted character string; and 43 is a character recognition dictionary 4 that extracts the extracted character.
4, it is a character recognition unit that performs recognition. Characters that could not be recognized here are sent as image data to the next word matching section 45. The word matching unit checks whether the recognized characters have meaning as words or not by referring to the word dictionary 46 and if there are meaningless words due to erroneous recognition of characters, check the words that can be corrected. Convert to the correct word. Characters that cannot be recognized by the character recognition unit but can be determined by referring to the word dictionary are determined in this part. Characters that cannot be determined by the word matching section are left as images. 47 is a character encoding unit that encodes recognized characters. 48 is a true format addition section that adds information necessary for reconstructing the character string converted into code data in the above section on paper. for example,
As shown in Figure 5, when the origin is the upper left of the page, the X axis is vertical and the Y axis is horizontal, (xLy
Suppose that a character string exists in an area surrounded by (x2, y2).

このとき、この領域を示す頁書式は、例えば第６図（ａ
）に示したようになる。この例では、＋＋　Ｃｈａｒ”
が、この領域が文字領域であることを、文字領域の存在
位置が（ｘｌ、ｙｌ）と（ｘ２、ｙ２）で囲まれた矩形
内であること、文字の種類が明朝体であること、文字の
大きさが１０ポイントであり、認識した文字列が（＠＠
・旧・川・・・・＠＠）で示される内容であることをそ
れぞれ表している。At this time, the page format showing this area is, for example, as shown in FIG.
). In this example, ++ Char”
However, this area is a character area, and the position of the character area is within a rectangle surrounded by (xl, yl) and (x2, y2), and the character type is Mincho typeface. The font size is 10 points, and the recognized character string is (@@
・Old・River・・・・Represents the content indicated by @@).

この情報を基に、文字領域内の情報が印字装置や表示装
置に出力されたり、あるいは外部記憶装置に保存される
。Based on this information, the information in the character area is output to a printing device or display device, or stored in an external storage device.

第７図は、図表領域処理部１０７の詳細を示すブロック
図である。７１は図表領域内の幾何図形を抽出する幾何
図形抽出部で、抽出法には、Ｈ。FIG. 7 is a block diagram showing details of the chart area processing section 107. 71 is a geometric figure extraction unit that extracts geometric figures in the chart area, and the extraction method is H.

ｕｇｈ変換・黒画素の連結成分抽出等を用いる。Ugh conversion, connected component extraction of black pixels, etc. are used.

７２は図表領域に含まれる文字を抽出する文字抽出部、
７３は抽出された文字を文字認識用辞書７４を参照して
、認識を行う文字認識部である。ここで認識が行えなか
った文字に関しては、イメージデータのまま次の単語照
合部７５へ送られる。72 is a character extraction unit that extracts characters included in the diagram area;
Reference numeral 73 denotes a character recognition unit that refers to the character recognition dictionary 74 to recognize the extracted characters. Characters that could not be recognized here are sent as image data to the next word matching section 75.

単語照合部では、認識された文字が、単語として意味を
もつかどうか単語辞書７６を参照して、もし文字の誤認
識により意味のない単語が存在した場合は、訂正の可能
なものについては正しい単語に変換する。文字認識部で
認識できなかった文字についても、単語辞書を参照する
ことにより確定できるものについては、この部分で決定
する。単語照合部で確定できなかった文字については、
イメージのまま残しておく。７７は、認識された文字に
ついて、これをコード化する文字コード化部である。７
８は、上記の部分でコードデータ化された文字列を紙面
上で再構成するために必要な情報を付加する真書式付加
部である。例えば、第５図に示されるように、紙面の左
上を原点として、縦方向にＸ軸を、横方向にＹ軸をとっ
たとき、（ｘ３、ｙ３）と（ｘ４、ｙ４）を結ぶ直線が
存在するとする。このとき、この直線を示す頁書式は、
例えば第６図（ｂ）に示したようになる。この例では、
”Ｌｉｎｅ”が、この図形が直線であることを、直線の
存在位置が（ｘ’３、ｙ３）と（ｘ４、ｙ４）を結ぶ領
域であること、線の種類が実線であること、線の幅が０
．５ｍｍであることをそれぞれ表している。第７図にお
いて、７２から７８で示される部分については、文字領
域処理部１０６に含まれるものと共用してもよい。The word matching section refers to the word dictionary 76 to determine whether the recognized characters have meaning as words, and if there are meaningless words due to misrecognition of characters, the words that can be corrected are corrected. Convert to word. Characters that cannot be recognized by the character recognition unit but can be determined by referring to the word dictionary are determined in this part. For characters that could not be confirmed by the word matching section,
Leave it as an image. 77 is a character encoding unit that encodes recognized characters. 7
Reference numeral 8 denotes a true format addition section that adds information necessary to reconstruct the character string converted into code data in the above section on paper. For example, as shown in Figure 5, when the origin is at the top left of the paper, the X axis is vertically, and the Y axis is horizontally, the straight line connecting (x3, y3) and (x4, y4) is Suppose it exists. At this time, the page format showing this straight line is
For example, it becomes as shown in FIG. 6(b). In this example,
"Line" indicates that this figure is a straight line, that the position of the straight line is in the area connecting (x'3, y3) and (x4, y4), that the type of line is a solid line, and that the line Width is 0
．． Each indicates that it is 5 mm. In FIG. 7, portions 72 to 78 may be shared with those included in the character area processing section 106.

本発明の応用例としては、以下のものが考えられる。電
子ファイリングシステムにおいて、久方画像を領域分割
し、コード化することによって、データの圧縮ができ、
記憶容量が縮小できる。デスクトップパブリッシングと
組み合わせることにより、入力画像の文章や図形を書き
換えて別の文書を作成するのに利用することができる。The following can be considered as application examples of the present invention. In the electronic filing system, data can be compressed by dividing the Kugata image into regions and encoding it.
Storage capacity can be reduced. By combining it with desktop publishing, it can be used to rewrite text and graphics in an input image to create another document.

機械翻訳を行う際、従来キーボードなどを使用して人が
人力していた文書入力の自動化を図ることができる。複
写機において、従来イメージ情報のまま複製を繰り返す
とと、雑音等の影響で文字や図形が不鮮明になりついに
は読み取れなくなったが、度この文書処理装置を通し、
コード化できる部分についてコード化することにより、
コード化された部分については、何回複写を繰り返して
も常に鮮明な画像を得ることができる。また、同様の理
由でファクシミリの入力画像の処理に利用すると、画像
が鮮明になり、かつ伝送容量の圧縮につながる。When performing machine translation, it is possible to automate document input that previously had to be done manually using a keyboard. Conventionally, when copying machines repeatedly reproduce image information, characters and figures become blurred due to noise and become unreadable, but this time, through this document processing device,
By coding the parts that can be coded,
No matter how many times the coded part is copied, a clear image can always be obtained. Furthermore, for the same reason, when used to process facsimile input images, the images become clearer and the transmission capacity is reduced.

［発明の効果］以上述べたように、本発明の文書画像処理装置を用いる
と、従来イメージデータとして取り扱っていた文字およ
び図形を認識することにより、これに適したコードデー
タに直すので、保存する場合、記憶容量が少なくて済む
。またコードデータに変換されているので、一部分の文
字・図形等を変更したり、再利用したりする編集作業が
行える。[Effects of the Invention] As described above, when the document image processing device of the present invention is used, characters and figures that were conventionally handled as image data are recognized and converted into code data suitable for them, so it is easier to save them. In this case, less storage capacity is required. Also, since it has been converted to code data, editing work such as changing or reusing a portion of characters, figures, etc. can be performed.

さらに、文字・図表・イメージ領域を自動的に分離抽出
しているので処理の省力化が可能になるだけでなく、あ
らかじめプログラムを設定しておくことにより、欄外に
存在するロゴマークを消すとか、文章だけ、図形だけの
保存といったトリックプレイも行える。Furthermore, since text, diagrams, and image areas are automatically separated and extracted, not only is it possible to save processing time, but by setting the program in advance, it is possible to erase logo marks that exist in the margins, etc. You can also perform trick plays such as saving only text or shapes.

[Brief explanation of drawings]

第１図は本発明の文書画像処理装置の構成を示すブロッ
ク図、第２図は各領域の黒画素の分布を示す図、第３図
は入力画像を領域分割したときの図、第４図は本発明の
文字領域処理部のブロック図、第５図は入力画像の一例
を示す図、第６図は第５図を真書式で表現した図、第７
図は本発明の図表領域処理部のブロック図である。１０１・・・画像入力部、１０２・・・画像メモリ、１
０３・・・前処理部、１０４・・・特徴抽出部、１０５
・・・領域分割部、１０６・・・文字領域処理部、１０
７・・・図表領域処理部、１０８・・・イメージ領域処
理部、１０９・・・再構成部、１１０・・・画像出力部
、４１・・・文字列抽出部、４２・７２・・・文字抽出
部、４３・７３・・・文字認識部、４４・７４・・・文
字認識用辞書、４５・７５・・・単語照合部、４６・７
６・・・単語辞書、４７・７７・・・文字コード化部、
４８・７８・・・頁書式付加部、７１・・・幾何図形抽
出部用願人　セイコーエプソン株式会社代理人　弁理土鈴水害三部（他１名）ｔ２密度第３図第６図（ｂ）FIG. 1 is a block diagram showing the configuration of the document image processing device of the present invention, FIG. 2 is a diagram showing the distribution of black pixels in each region, FIG. 3 is a diagram when an input image is divided into regions, and FIG. 4 is a block diagram of the character area processing unit of the present invention, FIG. 5 is a diagram showing an example of an input image, FIG. 6 is a diagram expressing FIG. 5 in true format, and FIG. 7 is a diagram showing an example of an input image.
The figure is a block diagram of the chart area processing section of the present invention. 101... Image input section, 102... Image memory, 1
03... Preprocessing unit, 104... Feature extraction unit, 105
...Area dividing section, 106...Character area processing section, 10
7... Chart area processing unit, 108... Image area processing unit, 109... Reconstruction unit, 110... Image output unit, 41... Character string extraction unit, 42, 72... Character Extraction unit, 43.73 Character recognition unit, 44.74 Character recognition dictionary, 45.75 Word matching unit, 46.7
6... Word dictionary, 47.77... Character encoding section,
48, 78...Page format addition section, 71...Geometric figure extraction section Applicant: Seiko Epson Co., Ltd. Agent, Patent Attorney Dozu Flood Damage Department 3 (1 other person) t2 Density Figure 3 Figure 6 (b)

Claims

[Claims]

an image input unit that captures a document containing at least one of text, diagrams, and image areas as image information; a preprocessing unit that removes noise from the image information captured by the image input unit and converts it into binary; and the image information. a feature extraction unit that extracts features of the text, diagram, and image regions; and an area division unit that divides the image information into text, diagram, and image regions based on the features extracted by the feature extraction unit; a character area processing unit that recognizes characters in the character area; a diagram area processing unit that recognizes geometric figures and characters in the diagram area; an image area processing unit that processes image information in the image area; 1. A document image processing device comprising: a reconstruction section that reconstructs data from three processing sections; and an image output section that outputs reconstructed document information.