JPH04106670A

JPH04106670A - Document picture processor

Info

Publication number: JPH04106670A
Application number: JP2224835A
Authority: JP
Inventors: Naoki Kuwata; 直樹鍬田
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1990-08-27
Filing date: 1990-08-27
Publication date: 1992-04-08

Abstract

PURPOSE:To reduce the processing manpower required by the title processor by recognizing characters and graphics which have been handled as image information and converting the characters and graphics into suitable code data. CONSTITUTION:An area dividing section 105 divides inputted image information into a character area, graphic area, and image area by finding the black picture element density of the areas of an inputted picture other than the blank area extracted by means of a blank area extracting section 104. A character area processing section 106 performs extraction of character strings, segmentation of one character, and recognition of a segmented character in the divided character area and a graphic area processing section 107 performs recognition after extracting geometrical figures and characters. Then an image area processing section 108 leaves the image area as image information or performs a compressing or contracting process on the image information and a reconstituting section 109 reconstitutes the area converted into the code data and the area of the image data on paper. In addition, a picture outputting section 110 outputs the reconstituted document information. Therefore, a document can be preserved without manual aid.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、イメージ情報の形式で取り込まれた文書情報
から文字領域と図表領域とイメージ領域を分離抽出し、
各領域に含まれる情報をその情報に適するコードデータ
に変換後、再構成して出力する文書画像処理装置に関す
る。[Detailed Description of the Invention] [Industrial Application Field] The present invention separates and extracts character areas, figure areas, and image areas from document information captured in the form of image information,
The present invention relates to a document image processing device that converts information included in each area into code data suitable for the information, then reconfigures and outputs the code data.

［従来の技術］一般的に、紙に書かれた文書情報を保存する場合、イメ
ージスキャナ等で取り込んだ画像をそのままイメージ情
報として扱い、外部記憶装置等に保存している。また、
文字領域のみを抽出した後文字領域について文字認識を
行い文字コード化して保存する場合は、入力文書画像に
対して使用者が、文字領域を人手により指定する必要が
あった。[Prior Art] Generally, when document information written on paper is stored, an image captured by an image scanner or the like is treated as image information and is stored in an external storage device or the like. Also,
If only a character region is extracted, then the character region is subjected to character recognition, character coded, and saved, the user needs to manually specify the character region for the input document image.

［発明が解決しようとする課題］以上述べたように、イメージ情報の形で保存するタイプ
では、記憶容量が膨大になり、また、文書の一部を書き
直したりする編集作業が行えなかった。そして、領域指
定するタイプでは、人間が常にその場に居て、指示する
必要があり、手間がかかっていた。[Problems to be Solved by the Invention] As described above, the type that saves in the form of image information requires a huge amount of storage capacity, and also cannot perform editing work such as rewriting part of the document. The type that specifies areas requires a human being to be present at all times and give instructions, which is time-consuming.

そこで本発明は、上記の問題点を解決するためのもので
、人手に頼らず入力された文書画像を文字領域・図表領
域・イメージ領域に分離し、さらに文字領域においては
文字を認識し、図表領域においては文字および幾何図形
を認識し、それぞれの情報に適したコードデータに変換
して、文書を保存する装置を提供することを目的とする
。SUMMARY OF THE INVENTION The present invention aims to solve the above-mentioned problems by separating an inputted document image into a text area, a diagram area, and an image area without relying on humans. In this area, the object is to provide a device that recognizes characters and geometric figures, converts them into code data suitable for each type of information, and stores the document.

［課題を解決するための手段１本発明の文書画像処理装置は、文字・図表・イメージ領
域をを少なくとも一つ以上含む文書をイメージ情報とし
て取り込む画像人力部と、この画像入力部により取り込
まれたイメージ情報から雑音を除去し２値化する前処理
部と、この前処理部で処理されたデータから空白領域を
抽出する空白領域抽出部と、空白領域以外の領域につい
て黒画素密度を求めることにより前述のイメージ情報を
文字領域・図表領域・イメージ領域に分割する領域分割
部と、文字領域部の文字を認識する文字領域処理部と、
図表領域内の幾何図形および文字を認識する図表領域処
理部と、イメージ領域内のイメージ情報を加工するイメ
ージ領域処理部と、これらの各処理部からのデータを再
構成する再構成部と、再構成された文書情報を出力する
画像出力部とを具備したことを特徴とする。[Means for Solving the Problems 1] The document image processing device of the present invention includes an image processing unit that captures as image information a document including at least one of text, diagrams, and image areas, and a A preprocessing unit removes noise from image information and binarizes it, a blank area extraction unit extracts blank areas from the data processed by this preprocessing unit, and a black pixel density is calculated for areas other than blank areas. an area dividing unit that divides the image information into a character area, a diagram area, and an image area; a character area processing unit that recognizes characters in the character area;
A diagram area processing unit that recognizes geometric figures and characters in the diagram area, an image area processing unit that processes image information in the image area, a reconstruction unit that reconstructs the data from each of these processing units, and a reconstruction unit that processes the image information in the image area. The present invention is characterized by comprising an image output unit that outputs structured document information.

［実施例］以下本発明について図面に基づいて説明する。[Example] The present invention will be explained below based on the drawings.

第１図は本発明の文書画像処理装置の構成を示すブロッ
ク図である。１０１は文書画像をイメージ情報として取
り込む画像入力部であり、スキャナもしくはカメラ等を
用いる。あらかじめ画像が収納されている光ディスク等
を使用する場合は、これに対応する再生装置になる。１
０２は取り込んだイメージ情報を一時的に保存する画像
メモリである。１０３はイメージ情報に含まれる雑音の
除去、２値化を行う前処理部である。雑音の除去には、
メジアンフィルタ等を用いて孤立雑音を除去する。もし
、人力画像が傾いていたときは後の処理をやりやすくす
るために、この部分で傾斜角の補正を行う。１０４は文
字領域・図表領域・イメージ領域を分離抽出するために
、通常これらの領、域を分離している空白領域を抽出す
る空白領域抽出部である。空白部の抽出法については後
述する。FIG. 1 is a block diagram showing the configuration of a document image processing apparatus according to the present invention. An image input unit 101 inputs a document image as image information, and uses a scanner, a camera, or the like. When using an optical disk or the like on which images are stored in advance, the playback device is compatible with this. 1
02 is an image memory that temporarily stores captured image information. 103 is a preprocessing unit that removes noise contained in image information and binarizes it. To remove noise,
Remove isolated noise using a median filter or the like. If the human image is tilted, the tilt angle is corrected in this part to make later processing easier. Reference numeral 104 denotes a blank area extracting unit that extracts a blank area that normally separates these areas in order to separate and extract the character area, figure area, and image area. The method for extracting the blank space will be described later.

１０５は、空白領域抽出部１０４で抽出された空白領域
以外の領域について、その部分の黒画素密度を求めるこ
とにより、入力されたイメージ情報を文字領域・図表領
域・イメージ領域に分割する領域分割部である。１０６
は分割された文字領域内で、文字列の抽出、−文字の切
り出し、切り出した文字の認識を行う文字領域処理部で
ある。１０７は図表領域内の幾何図形および文字を抽出
した後、認識を行う図表領域処理部である。１０８はイ
メージ領域と判定された部分をイメージ情報のまま、も
しくは圧縮処理をするイメージ領域処理部である。１０
９は、コードデータ化された領域とイメージデータの領
域を紙面上に再構成する再構成部である。１１０は再構
成された文書情報を出力する画像出力部で、具体的には
印画装置・表示装置・外部記憶装置がこれに該当する。Reference numeral 105 denotes an area dividing unit that divides the input image information into a character area, a diagram area, and an image area by determining the black pixel density of areas other than the blank area extracted by the blank area extracting unit 104. It is. 106
is a character area processing unit that extracts a character string, cuts out a - character, and recognizes the cut out characters within the divided character area. Reference numeral 107 is a diagram area processing unit that performs recognition after extracting geometric figures and characters in the diagram area. Reference numeral 108 denotes an image area processing unit that processes a portion determined to be an image area as image information or compresses it. 10
Reference numeral 9 denotes a reconstruction unit that reconstructs the area converted into code data and the area of image data on the paper. Reference numeral 110 denotes an image output unit that outputs reconstructed document information, and specifically includes a printing device, a display device, and an external storage device.

次に、入力されたイメージ情報から空白部分を抽出する
方法について、第２図を用いて説明する。Next, a method for extracting blank areas from input image information will be explained using FIG. 2.

第２図において、２１は前処理部１０３を経た後のイメ
ージ情報である。今、図に示したように、ｍＸｎ画素の
矩形を作り、矢印の方向（横方向）に順次走査して行き
、矩形内に存在する黒画素の数を計数する。このときの
結果を２２に示す。ａの部分では、矩形内には空白領域
しか存在しないので、計数された黒画素の数は０である
。次にｂの部分では、文字の存在する領域にかかるので
黒画素の数が増える。Ｃでは、すべて文字領域なのでさ
らに黒画素の数が多くなる。このように、空白領域から
例えば文字領域に遷移するときには、矩形内の黒画素の
数が０からある値へ変化する。In FIG. 2, 21 is image information after passing through the preprocessing section 103. Now, as shown in the figure, a rectangle of m×n pixels is created, sequentially scanned in the direction of the arrow (horizontal direction), and the number of black pixels existing within the rectangle is counted. The results at this time are shown in 22. In part a, since there is only a blank area within the rectangle, the counted number of black pixels is 0. Next, in part b, the number of black pixels increases because it covers an area where characters exist. In C, the number of black pixels is even greater since all of the areas are text areas. In this way, when transitioning from a blank area to, for example, a character area, the number of black pixels within the rectangle changes from 0 to a certain value.

この変化する矩形内（ｂの部分）に領域の境界線が存在
する。次に領域の境界線の位置を正確に求めるために、
ｂの矩形を左側へ移動させ、矩形内の黒画素の数がＯに
なる場所を求める。移動の方法としては、１画素ずつ左
方向へ移動させればよい。矩形内の黒画素数がＯになっ
た時の矩形の右辺の位置を境界として記憶しておく。文
字領域から空白領域への遷移部分（ｘｙｚ）で示される
部分についても同様の走査を行い境界線を求める。A region boundary line exists within this changing rectangle (portion b). Next, in order to accurately find the position of the boundary line of the area,
Move the rectangle b to the left and find the location where the number of black pixels in the rectangle is O. As a method of movement, it is sufficient to move one pixel at a time to the left. The position of the right side of the rectangle when the number of black pixels in the rectangle reaches O is stored as the boundary. Similar scanning is performed for the transition area (xyz) from the character area to the blank area to determine the boundary line.

走査する矩形の形状は、一般的に以下のようにする。横
方向走査のときは、ｍ≧ｎを満たし、かつ複数行もしく
は複数文字が矩形内に入るｍの値を用い、縦方向走査の
ときは、ｍ≦ｎを満たし、かつ複数行もしくは複数文字
が矩形内に入るｎの値を用いる。こうすることによって
、段組等が存在する文書に対しても各領域を分離してい
る空白部分を感度良く抽出できる。The shape of the rectangle to be scanned is generally as follows. For horizontal scanning, use the value of m that satisfies m≧n and that includes multiple lines or multiple characters within the rectangle; for vertical scanning, use the value of m that satisfies m≦n and includes multiple lines or multiple characters. Use the value of n that falls within the rectangle. By doing this, it is possible to extract with high sensitivity the blank parts separating each area even in a document in which there are columns or the like.

以上の動作を第２図に示した人力画像に対して横方向・
縦方向について行った結果を第３図に示す。第３図に見
られるように、入力画像の状態により、境界線の位置が
多少凸凹している。このままでは、以後の処理がやりに
くいので外接矩形をとることにより、この部分の領域を
入力画像から分離抽出する。このようにして、入力画像
の中から、空白領域とそれ以外の領域を分離する。次に
、以上の処理により、分離抽出された空白領域以外の部
分（第３図に示される矩形）について、その中の黒画素
の数を計数し、面積で除することにより密度を求める。The above operations are performed horizontally and
The results obtained in the vertical direction are shown in FIG. As seen in FIG. 3, the position of the boundary line is somewhat uneven depending on the state of the input image. If this continues, subsequent processing will be difficult, so by taking a circumscribed rectangle, this area is separated and extracted from the input image. In this way, blank areas and other areas are separated from the input image. Next, by the above processing, the number of black pixels in the separated and extracted portion (rectangle shown in FIG. 3) other than the blank area is counted and divided by the area to obtain the density.

いま、写真等のイメージ領域。Now, we are in the realm of images such as photographs.

文字領域・図表領域の黒画素密度をそれぞれ、Ｄｌ−Ｄ
２・Ｄ３とすると、一般的にＤｉ＞０２＞Ｄ３の関係式
が成り立つ。そこで、適当なしきい値を設けることによ
り、各領域を分離抽出することができる−０第４図は、文字領域処理部１０６の詳細を示すブロック
図である。４１は文、字領域内の文字列を抽出する文字
列抽出部、４２は抽出された文字列から一文字を切り出
す文字抽出部、４３は抽出された文字を文字認識用辞書
４４を参照して、認識を行う文字認識部である。ここで
認識が行えなかった文字に関しては、イメージデータの
まま次の単語照合部４５へ送られる。単語照合部では、
認識された文字が、単語として意味をもつかどうか単語
辞書４６を参照して、もし文字の誤認識により意味のな
い単語が存在した場合は、訂正の可能なものについては
正しい単語に変換する。文字認識部で認識できなかった
文字についても、単語辞書を参照することにより確定で
きるものについては、この部分で決定する。単語照合部
で確定できなかった文字については、イメージのまま残
しておく。４７は、認識された文字について、これをコ
ード化する文字コード化部である。４８は、上記の部分
でコードデータ化された文字列を紙面上で再構成するた
めに必要な情報を付加する真書式付加部である。例えば
、第、５図に示されるように、紙面の左上を原点として
、縦方向にＸ軸を、横方向にＹ軸をとったとき、（ｘｌ
、ｙｌ）と（ｘ２、ｙ２）で囲まれた領域に文字列が存
在するとする。このとき、この領域を示す真書式は、例
えば第６図（ａ）に示したようになる。この例では、”
Ｃｈａｒ”が、この領域が文字領域であることを、文字
領域の存在位置が（ｘｌ、ｙｌ）と（ｘ２、ｙ２）で囲
まれた矩形内であること、文字の種類が明朝体であるこ
と、文字の大きさが１０ポイントであり、認識した文字
列が１＠＠・・・・・・・・・・・・＠＠）　で示され
る内容であることをそれぞれ表している。この情報を基
に、文字領域内の情報が印字装置や表示装置に出力され
たり、あるいは外部記憶装置に保存される。The black pixel density of the text area and diagram area is Dl-D, respectively.
2.D3, the relational expression Di>02>D3 generally holds true. Therefore, by setting an appropriate threshold value, each area can be separated and extracted.-0 FIG. 4 is a block diagram showing details of the character area processing section 106. 41 is a character string extraction unit that extracts a character string within a text and character area; 42 is a character extraction unit that extracts one character from the extracted character string; 43 is a character string extraction unit that extracts a character string from the extracted character string; This is a character recognition unit that performs recognition. Characters that could not be recognized here are sent as image data to the next word matching section 45. In the word matching section,
A word dictionary 46 is checked to see if the recognized characters have meaning as words, and if meaningless words exist due to misrecognition of characters, those that can be corrected are converted into correct words. Characters that cannot be recognized by the character recognition unit but can be determined by referring to the word dictionary are determined in this part. Characters that cannot be determined by the word matching section are left as images. 47 is a character encoding unit that encodes recognized characters. 48 is a true format addition section that adds information necessary for reconstructing the character string converted into code data in the above section on paper. For example, as shown in FIG. 5, when the origin is the upper left of the paper, the
, yl) and (x2, y2). At this time, the true format indicating this area is as shown in FIG. 6(a), for example. In this example, “
"Char" indicates that this area is a character area, that the position of the character area is within a rectangle surrounded by (xl, yl) and (x2, y2), and that the character type is Mincho typeface. This indicates that the font size is 10 points, and that the recognized character string is the content shown by 1@@・・・・・・・・・・・・@@).This information Based on this, information in the character area is output to a printing device or display device, or stored in an external storage device.

第７図は、図表領域処理部１０７の詳細を示すブロック
図である。７１は図表領域内の幾何図形を抽出する幾何
図形抽出部で、抽出法には、ＨＯｕｇｈ変換・黒画素の
連結成分抽出等を用いる。FIG. 7 is a block diagram showing details of the chart area processing section 107. Reference numeral 71 denotes a geometric figure extraction unit that extracts geometric figures within the diagram area, and uses Hough transformation, connected component extraction of black pixels, etc. as an extraction method.

７２は図表領域に含まれる文字を抽出する文字抽出部、
７３は抽出された文字を文字認識用辞書７４を参照して
、認識を行う文字認識部である。ここで認識が行えなか
った文字に関しては、イメージデータのまま次の単語照
合部７５へ送られる。72 is a character extraction unit that extracts characters included in the diagram area;
Reference numeral 73 denotes a character recognition unit that refers to the character recognition dictionary 74 to recognize the extracted characters. Characters that could not be recognized here are sent as image data to the next word matching section 75.

単語照合部では、認識された文字が、単語として意味を
もつかどうか単語辞書７６を参照して、もし文字の誤認
識により意味のない単語が存在した場合は、訂正の可能
なものについては正しい単語に変換する。文字認識部で
認識できなかった文字についても、単語辞書を参照する
ことにより確定できるものについては、この部分で決定
する。単語照合部で確定できなかった文字については、
イメージのまま残しておく。７７は、認識された文字に
ついて、これをコード化する文字コード化部である。７
８は、上記の部分でコードデータ化された文字列を紙面
上で再構成するために必要な情報を付加する頁書式付加
部である。例えば、第５図に示されるように、紙面の左
上を原点として、縦方向にＸ軸を、横方向にＹ軸をとっ
たとき、（ｘ３、ｙ３）と（ｘ４、ｙ４）を結ぶ直線が
存在するとする。このとき、この直線を示す頁書式は、
例えば第６図（ｂ）に示したようになる。この例では、
”Ｌｉｎｅ”が、この図形が直線であることを、直線の
存在位置が（ｘ３、ｙ３）と（ｘ４、ｙ４）を結ぶ領域
であること、線の種類が実線であること、線の幅が０．
５ｍｍであることをそれぞれ表している。第７図におい
て、７２から７８で示される部分については、文字領域
処理部１０６に含まれるものと共用してもよい。The word matching section refers to the word dictionary 76 to determine whether the recognized characters have meaning as words, and if there are meaningless words due to misrecognition of characters, the words that can be corrected are corrected. Convert to word. Characters that cannot be recognized by the character recognition unit but can be determined by referring to the word dictionary are determined in this part. For characters that could not be confirmed by the word matching section,
Leave the image as is. 77 is a character encoding unit that encodes recognized characters. 7
Reference numeral 8 denotes a page format adding section that adds information necessary for reconstructing the character string converted into code data in the above section on paper. For example, as shown in Figure 5, when the origin is at the top left of the paper, the X axis is vertically, and the Y axis is horizontally, the straight line connecting (x3, y3) and (x4, y4) is Suppose it exists. At this time, the page format showing this straight line is
For example, it becomes as shown in FIG. 6(b). In this example,
"Line" indicates that this figure is a straight line, that the position of the straight line is in the area connecting (x3, y3) and (x4, y4), that the type of line is a solid line, and that the width of the line is 0.
Each indicates that it is 5 mm. In FIG. 7, portions 72 to 78 may be shared with those included in the character area processing section 106.

本発明の応用例としては、以下のものが考えられる。電
子ファイリングシステムにおいて、入力画像を領域分割
し、コード化することによって、データの圧縮ができ、
記憶容量が縮小できる。デスクトップパブリッシングと
組み合わせることにより、入力画像の文章や図形を書き
換えて別の文書を作成するのに利用することができる。The following can be considered as application examples of the present invention. In electronic filing systems, data can be compressed by dividing the input image into regions and coding.
Storage capacity can be reduced. By combining it with desktop publishing, it can be used to rewrite text and graphics in an input image to create another document.

機械翻訳を行う際、従来キーボードなどを使用して人が
入力していた文書入力の自動化を図ることができる。複
写機において、従来イメージ情報のまま複製を繰り返す
とと、雑音等の影響で文字や図形が不鮮明になりついに
は読み取れなくなったが、度この文書処理装置を通し、
コード化できる部分についてコード化することにより、
コード化された部分については、何回複写を繰り返して
も常に鮮明な画像を得ることができる。また、同様の理
由でファクシミリの入力画像の処理に利用すると、画像
が鮮明になり、かつ伝送容量の圧縮につながる。When performing machine translation, it is possible to automate document input, which was previously done by humans using a keyboard. Conventionally, when copying machines repeatedly reproduce image information, characters and figures become unclear due to noise and other factors, and eventually become unreadable.
By coding the parts that can be coded,
No matter how many times the coded part is copied, a clear image can always be obtained. Furthermore, for the same reason, when used to process facsimile input images, the images become clearer and the transmission capacity is reduced.

［発明の効果］以上述べたように、本発明の文書画像処理装置を用いる
と、従来イメージ情報として取り扱っていた文字および
図形を認識することにより、これに適したコードデータ
に直すので、保存する場合、記憶容量が少なくて済む。[Effects of the Invention] As described above, when the document image processing device of the present invention is used, characters and figures that were conventionally treated as image information are recognized and converted into code data suitable for them, so that it is possible to save data. In this case, less storage capacity is required.

またコードデータに変換されているので、一部分の文字
・図形等を変更したり、再利用したりする編集作業が行
える。さらに、文字・図表・イメージ領域を自動的に分
離抽出しているので処理の省力化が可能になるだけでな
く、あらかじめプログラムを設定しておくことにより、
欄外に存在するロゴマークを消すとか、文章だけ、図形
だけの保存といったトリックプレイも行える。Also, since it has been converted to code data, editing work such as changing or reusing a portion of characters, figures, etc. can be performed. Furthermore, text, diagrams, and image areas are automatically separated and extracted, which not only saves processing time, but also allows you to set the program in advance.
You can also perform trick plays such as erasing the logo mark that exists in the margin, or saving only text or shapes.

[Brief explanation of the drawing]

第１図は本発明の文書画像処理装置の構成を示すブロッ
ク図、第２図は本発明の空白領域抽出法の原理説明図、
第３図は第２図で示した方法を文書に適用したときの結
果を示す図、第４図は本発明の文字領域処理部のブロッ
ク図、第５図は入力画像の一例を示す図、第６図は第５
図を頁書式で表現した図、第７図は本発明の図表領域処
理部のブロック図である。１０１・・・画像入力部、１０２・・・画像メモリ、１
０３・・・前処理部、１０４・・・特徴抽出部、１０５
・・・領域分割部、１０６・・・文字領域処理部、１０
７・・・図表領域処理部、１０８・・・イメージ領域処
理部、１０９・・・再構成部、１１０・・・画像出力部
、２１・・・入力イメージ情報、２２・・・黒画素数の
分布図、４１・・・文字列抽出部、４２・７２・・・文
字抽出部、４３・７３・・・文字認識部、４４・７４・
・・文字認識用辞書、４５・７５・・・単語照合部、４
６・７６・・・単語辞書、４７・７７・・・文字コード
化部、４８・７８・・・頁書式付加部、７１・・・幾何
図形抽出部用願人　　セイコーエプソン株式会社代理人　弁理土鈴水害三部（他１名） −一一一一一一トかかっていた。そこで本発明は、上記の問題点を解決するためのもので
、人手に頼らず入力された文書画像を文字領域・図表領
域・イメージ領域に分離し、さらに文字領域においては
文字を認識し、図表領域においては文字および幾何図形
を認識し、それぞれの情報に適したフードデータに変換
して、文書を保存する装置を提供することを目的とする
。１ｉ！！題を解決するための手段】本発明の文書画像処理装置は、文字・図表・イメージ領
域をを少なくとも一つ以上含む文書をイメージ情報とし
て取り込む画像入力部と、この画像入力部により取り込
まれたイメージ情報から雑音を除去し２値化する前処理
部と、イメージ情報における文字・図表・イメージ領域
の持つ特徴を抽出する特徴抽出部と、この特徴抽出部で
抽出された特徴に基づき前述のイメージ情報を文字領第
２図第６図（ｂ）FIG. 1 is a block diagram showing the configuration of the document image processing device of the present invention, FIG. 2 is a diagram explaining the principle of the blank area extraction method of the present invention,
FIG. 3 is a diagram showing the result when the method shown in FIG. 2 is applied to a document, FIG. 4 is a block diagram of the character area processing section of the present invention, and FIG. 5 is a diagram showing an example of an input image. Figure 6 is the 5th
FIG. 7, which represents a diagram in page format, is a block diagram of the diagram area processing section of the present invention. 101... Image input section, 102... Image memory, 1
03... Preprocessing unit, 104... Feature extraction unit, 105
...Area dividing section, 106...Character area processing section, 10
7... Chart area processing unit, 108... Image area processing unit, 109... Reconstruction unit, 110... Image output unit, 21... Input image information, 22... Number of black pixels Distribution diagram, 41... Character string extraction section, 42.72... Character extraction section, 43.73... Character recognition section, 44.74.
...Character recognition dictionary, 45.75...Word matching section, 4
6.76...Word dictionary, 47.77...Character encoding section, 48.78...Page format addition section, 71...Geometric figure extraction section Applicant Seiko Epson Co., Ltd. Agent Patent attorney Suzu Flood Damage 3 (1 other person) -111111. SUMMARY OF THE INVENTION The present invention aims to solve the above-mentioned problems by separating an inputted document image into a text area, a diagram area, and an image area without relying on humans. The purpose of the present invention is to provide a device that recognizes characters and geometric figures in the domain, converts the information into food data suitable for each type of information, and stores the document. 1i! ! [Means for Solving the Problem] A document image processing device of the present invention includes an image input section that captures a document containing at least one of characters, diagrams, and image areas as image information, and an image input section that captures an image captured by the image input section. A preprocessing unit that removes noise from information and binarizes it; a feature extraction unit that extracts the features of text, diagrams, and image regions in image information; and a feature extraction unit that extracts the features of characters, diagrams, and image areas in image information, and extracts the aforementioned image information based on the features extracted by this feature extraction unit. Character area Figure 2 Figure 6 (b)

Claims

[Claims]

an image input unit that captures a document containing at least one of text, diagrams, and image areas as image information; a preprocessing unit that removes noise from the image information captured by the image input unit and converts it into binarization; and the preprocessing. a blank area extraction unit that extracts a blank area from the data processed by the blank area, and an area division unit that divides the image information into a character area, a diagram area, and an image area by determining the black pixel density for areas other than the blank area. a character area processing unit that recognizes characters in the character area; a diagram area processing unit that recognizes geometric figures and characters in the diagram area; and an image area processing unit that processes image information in the image area. , a reconstruction unit that reconstructs the data from the three processing units;
1. A document image processing device, comprising: an image output unit that outputs reconstructed document information.