JP2606560B2

JP2606560B2 - Document image storage device

Info

Publication number: JP2606560B2
Application number: JP5199408A
Authority: JP
Inventors: 洋一白川; 健上村; 淳津雲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-08-11
Filing date: 1993-08-11
Publication date: 1997-05-07
Anticipated expiration: 2012-05-07
Also published as: JPH0757046A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文字認識装置における
文書画像記憶方式に関し、特に文書を電子化して管理す
るための文字認識装置における文書画像記憶方式に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document image storage method in a character recognition device, and more particularly to a document image storage method in a character recognition device for electronically managing a document.

【０００２】[0002]

【従来の技術】従来より、多量の既存の文書画像を管理
するためには、文書画像を電子化することが行われて来
ている。しかし、その際に、単に文書画像のデータを圧
縮してキーワード付けを行うだけではなく、画像を解析
して文字認識により文章領域の内容をコード化すること
により、更に効率的な蓄積および検索が可能となり得る
ので、文書画像の解析は、このような機能を実現するた
めの今後の重要な技術であると考えられる。2. Description of the Related Art Conventionally, in order to manage a large number of existing document images, digitizing the document images has been performed. However, in this case, not only the data of the document image is simply compressed and the keyword is assigned, but also by analyzing the image and encoding the contents of the text area by character recognition, more efficient storage and retrieval can be performed. Analysis of document images is considered to be an important technology in the future for realizing such a function because it may be possible.

【０００３】従来の文字認識装置における文書画像記憶
方式では、文書をイメージスキャナから入力して得られ
る画像イメージから、文書の構成要素である文章領域や
図表領域を分離し、領域の包含関係や配置関係をレイア
ウト情報として記憶し、特に文章領域に関しては、一文
字単位の領域に分離して文字認識を行っている。In a conventional document image storage system of a character recognition apparatus, a text area and a chart area which are constituent elements of a document are separated from an image obtained by inputting the document from an image scanner, and the inclusion relation and arrangement of the areas are separated. The relationship is stored as layout information, and particularly in the case of a text area, character recognition is performed separately for each character area.

【０００４】このような従来の文字認識装置における文
書画像記憶方式の第１の例として、特願昭６２−２９２
０７「文書画像解析方式」がある。この方式によれば、
文書画像が縦書き，横書き，段組などの書式にかかわら
ず、何れの文書のコード化も可能であるという利点があ
る。As a first example of a document image storage system in such a conventional character recognition device, Japanese Patent Application No. 62-292 is disclosed.
07 “document image analysis method”. According to this method,
There is an advantage that any document can be coded, regardless of the format of the document image, such as vertical writing, horizontal writing, and multiple columns.

【０００５】図６は、このような従来の文字認識装置に
おける文書画像記憶方式の第１の例を示すブロック図で
ある。図６に示す文書画像メモリ６１は、文書画像を電
子化した画像情報を記憶している。また、領域分割部６
２は、文書画像メモリ６１の画像情報に対して、上下関
係および左右関係の配置関係を保持しながら、大局的領
域から局所的領域へと領域分割を行って、領域分割の結
果から文書が縦書きか横書きかを判定し、その結果を縦
横情報記憶部６３に記憶している。FIG. 6 is a block diagram showing a first example of a document image storage system in such a conventional character recognition device. A document image memory 61 shown in FIG. 6 stores image information obtained by digitizing a document image. In addition, the area dividing unit 6
2 divides the image information in the document image memory 61 from the global area to the local area while maintaining the arrangement relation of the vertical and horizontal relations, and based on the result of the area division, the document is vertically divided. Whether writing or horizontal writing is determined, and the result is stored in the vertical and horizontal information storage unit 63.

【０００６】そして、領域分割部６２により得られた部
分領域が文章領域である場合には、その文章領域のデー
タは、文字分離部６４に送られる。文字分離部６４は、
その文章領域を構成する行領域を抽出し、さらに、その
行領域を構成する一文字単位の領域情報を抽出し、順次
に、構造化データ記憶部６５に格納している。If the partial area obtained by the area dividing section 62 is a text area, the data of the text area is sent to the character separating section 64. The character separation unit 64
Line regions constituting the text region are extracted, and region information for each character constituting the line region is extracted and sequentially stored in the structured data storage unit 65.

【０００７】そこで、領域探索部６７は、領域定義記憶
部６６にあらかじめ記憶されている抽出すべき領域の配
置関係に関する条件を読出して、その条件に従って、構
造化データ記憶部６５に格納されている文書画像内の領
域間の配置構造を示す構造化データを探索している。さ
らに、領域探索部６７は、条件を満たす領域を構造化デ
ータ記憶部６５から読出し、所定の順序により抽出結果
記憶部６８に格納している。Therefore, the area search section 67 reads out the condition relating to the arrangement relation of the area to be extracted, which is stored in the area definition storage section 66 in advance, and stores the readout condition in the structured data storage section 65 in accordance with the condition. Searching for structured data indicating the arrangement structure between regions in the document image. Further, the area search section 67 reads out the area satisfying the condition from the structured data storage section 65 and stores the area in the extraction result storage section 68 in a predetermined order.

【０００８】このようにして、従来の文字認識装置にお
ける文書画像記憶方式の第１の例では、文書画像を構成
要素単位に分割し、各々のレイアウト情報を記憶するこ
とができる。As described above, in the first example of the document image storage method in the conventional character recognition device, a document image can be divided into constituent elements and each piece of layout information can be stored.

【０００９】また、従来の文字認識装置における文書画
像記憶方式の第２の例として、帳票のイメージデータを
読み取り、帳票のイメージデータに含まれる文字を認識
して修正した後に、文字データと帳票のイメージデータ
とを連結したデータファイルを作成して、帳票画像と帳
票画像中の文字コードとを対応付けて管理する方式があ
る。As a second example of a document image storage system in a conventional character recognition device, image data of a form is read, and characters included in the image data of the form are recognized and corrected. There is a method of creating a data file in which image data is linked and managing a form image and a character code in the form image in association with each other.

【００１０】このような従来の文字認識装置における文
書画像記憶方式の第２の例として、特願平２−３９３７
４「帳票処理装置」がある。この装置によれば、帳票画
像を帳票画像中に書かれている文字列で検索できるとい
う利点がある。As a second example of the document image storage method in such a conventional character recognition device, Japanese Patent Application No. Hei 2-3937 is disclosed.
4 There is a “form processing device”. According to this apparatus, there is an advantage that a form image can be searched for by a character string written in the form image.

【００１１】これらの例に示す通り、従来の文字認識装
置における文書画像記憶方式では、文字をコード化して
記憶することを実現しているけれども、文書を構成する
文字フォントの種類やフォントサイズなどの文字属性を
得てはいない。つまり、これまでの技術では、文字の位
置関係および文書のレイアウト情報を記憶することにと
どまっている。As shown in these examples, in the conventional document image storage system of the character recognition device, characters are coded and stored, but the type and font size of the character font constituting the document, etc. No character attributes have been obtained. That is, in the conventional techniques, only the positional relationship of the characters and the layout information of the document are stored.

【００１２】[0012]

【発明が解決しようとする課題】上述した従来の文字認
識装置における文書画像記憶方式では、レイアウト情報
や文字コードは保持されているが、文字フォントの種類
やフォントサイズの識別は、現在の技術で困難であるた
めに実現されておらず、これらの情報は失われてしまっ
ているという欠点を有している。In the document image storage method in the above-described conventional character recognition apparatus, layout information and character codes are held, but the type and font size of character fonts are identified by current technology. This has not been realized due to the difficulty, and has the disadvantage that such information has been lost.

【００１３】しかし、文書には、例えば、文字フォント
やフォントサイズを用いて強調事項を示すなどの高度な
情報が含まれている。文書の電子化の際に、この強調事
項のようにより高度な情報を保持するために、文字認識
によって得られる文字コードおよび文章領域や図表領域
の位置関係や包含関係を表わす文書のレイアウト情報の
他に、フォント情報などの文書中の文字属性も記憶する
ことが必要である。[0013] However, the document contains high-level information such as indicating emphasis items using a character font or a font size. In order to retain more advanced information such as this emphasis matter when digitizing a document, other information such as character codes obtained by character recognition and document layout information indicating the positional relationship and inclusion relationship between text areas and figure / table areas are included. In addition, it is necessary to store character attributes in a document such as font information.

【００１４】そこで、本発明の目的は、文字フォントを
利用した高度な文書情報を保持するために、文字画像に
文字属性を対応付けて記憶できる文字認識装置における
文書画像記憶方式を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a document image storage method in a character recognition device capable of storing character images in association with character attributes in order to retain advanced document information using character fonts. is there.

【００１５】[0015]

【課題を解決するための手段】第１の発明の文字認識装
置における文書画像記憶方式は、電子化された画像情報
データである文書画像を格納する文書画像格納手段と、
前記文書画像格納手段に格納された文書画像から文章お
よび図表等の構成要素を抽出して、その構成要素が文章
領域であるか図表領域であるかを判断することにより、
その構成要素間の包含関係および上下左右の配置関係を
含むレイアウト情報を出力するとともに、その構成要素
が文章領域である場合には、一文字単位に分割した文字
領域の内容を文字画像として出力するレイアウト解析手
段と、前記レイアウト解析手段により得られたレイアウ
ト情報を格納するレイアウト情報記憶手段と、前記レイ
アウト解析手段により得られた文字画像を認識すること
により、その認識結果を文字コードとして出力する文字
認識手段と、前記文字認識手段により得られた一部の文
字コードを適切な文字コードに修正して置換える機能を
有する文字コード修正手段と、前記レイアウト解析手段
によって得られた文字画像に対して、前記文字コード修
正手段による修正後に得られた文字コードを対応付けて
格納する文字画像記憶手段と、を備えて構成されてい
る。According to a first aspect of the present invention, there is provided a character image storage system for storing a document image which is digitized image information data;
By extracting components such as text and diagrams from the document image stored in the document image storage unit, and determining whether the components are a text region or a diagram region,
A layout that outputs layout information including the inclusion relationship between the components and the top, bottom, left and right arrangement relationships, and outputs the contents of the character region divided into single characters as a character image when the component is a text region. Analysis means; layout information storage means for storing layout information obtained by the layout analysis means; and character recognition for recognizing a character image obtained by the layout analysis means and outputting the recognition result as a character code. Means, a character code correction means having a function of correcting and replacing a part of the character code obtained by the character recognition means with an appropriate character code, and a character image obtained by the layout analysis means, A character image storing the character code obtained after the correction by the character code correction means in association with the character code It is configured by including a 憶 means.

【００１６】また、第２の発明の文字認識装置における
文書画像記憶方式は、第１の発明の文字認識装置におけ
る文書画像記憶方式の各構成要件に加えて、第１の発明
のレイアウト解析手段は、第１の発明の文書画像格納手
段内に格納された文書画像から文章および図表等の構成
要素を抽出し、その構成要素が文章領域であるか図表領
域であるかを判断することにより、その構成要素が図表
領域である場合に、その構成要素の内容を図表画像とし
て出力する機能を有するとともに、前記レイアウト解析
手段によって出力された図表画像を格納する図表画像記
憶手段と、第１の発明のレイアウト情報記憶手段に格納
するレイアウト情報を読込んで、配置関係に対する変更
を加え、第１の発明の文字画像記憶手段内に格納する文
字画像および前記図表画像記憶手段に格納する図表画像
を読込んで、変更した配置関係に従った再配置を行って
出力するレイアウト編集手段と、を備えて構成されてい
る。The document image storage method in the character recognition device according to the second invention is characterized in that, in addition to the constituent elements of the document image storage method in the character recognition device according to the first invention, By extracting constituent elements such as sentences and figures from the document image stored in the document image storing means of the first invention, it is determined whether the constituent elements are a text area or a chart area. When the constituent element is a chart area, it has a function of outputting the contents of the constituent element as a chart image, and stores a chart image output by the layout analyzing means. The layout information stored in the layout information storage unit is read, and a change is made to the arrangement relationship, and the character image stored in the character image storage unit of the first invention and Nde read a chart images stored in the table image storage means is configured to include a layout editing means for outputting performing rearrangement in accordance with the modified arrangement relationship.

【００１７】一方、第３の発明の文字認識装置における
文書画像記憶方式は、電子化された画像情報データであ
る文書画像を格納し、前記文書画像から文章および図表
等の構成要素を抽出して前記構成要素が文章領域である
か図表領域であるかを判断することにより、前記構成要
素の間の包含関係および上下左右の配置関係を含むレイ
アウト情報を出力するとともに、前記構成要素が前記文
章領域である場合には、一文字単位に分割した文字領域
の内容を文字画像として出力し、得られた前記レイアウ
ト情報を格納し、得られた前記文字画像を認識すること
によりその認識結果を文字コードとして出力し、得られ
た前記文字コードの一部を適切な文字コードに修正して
置換え、得られた前記文字画像に対して修正後に得られ
た前記文字コードを対応付けて格納する、ことを含んで
いる。On the other hand, the document image storage system in the character recognition apparatus of the third invention stores a document image which is digitized image information data, and extracts components such as sentences and figures from the document image. By determining whether the constituent element is a text area or a chart area, layout information including an inclusion relation between the constituent elements and an upper, lower, left and right arrangement relation is output, and the constituent element is the text area. In the case of, the content of the character area divided into one character unit is output as a character image, the obtained layout information is stored, and the obtained character image is recognized. Output, correct and replace a part of the obtained character code with an appropriate character code, and obtain the character code obtained by correcting the obtained character image. Association with storing includes that.

【００１８】さらに、第４の発明の文字認識装置におけ
る文書画像記憶方式は、電子化された画像情報データで
ある文書画像を格納し、前記文書画像から文章および図
表等の構成要素を抽出して前記構成要素が文章領域であ
るか図表領域であるかを判断することにより、前記構成
要素の間の包含関係および上下左右の配置関係を含むレ
イアウト情報を出力するとともに、前記構成要素が前記
文章領域である場合には、一文字単位に分割した文字領
域の内容を文字画像として出力して、前記構成要素が前
記図表領域である場合には、前記構成要素の内容を図表
画像として出力し、得られた前記レイアウト情報を格納
し、得られた前記文字画像を認識することによりその認
識結果を文字コードとして出力し、得られた前記文字コ
ードの一部を適切な文字コードに修正して置換え、得ら
れた前記文字画像に対して修正後に得られた前記文字コ
ードを対応付けて格納し、得られた前記図表画像を格納
し、前記レイアウト情報を読込んで前記配置関係に対す
る変更を加えて、前記文字画像および前記図表画像を読
込んで、前記配置関係を変更した配置関係に従った再配
置を行って出力する、ことを含んでいる。Further, in the document image storage system in the character recognition apparatus according to the fourth invention, a document image which is digitized image information data is stored, and constituent elements such as sentences and figures are extracted from the document image. By determining whether the constituent element is a text area or a chart area, layout information including an inclusion relation between the constituent elements and an upper, lower, left and right arrangement relation is output, and the constituent element is the text area. In the case of, the contents of the character area divided in units of one character are output as a character image, and when the constituent element is the chart area, the contents of the constituent element are output as a chart image and are obtained. The obtained layout information is stored, the obtained character image is recognized, the recognition result is output as a character code, and a part of the obtained character code is appropriately used. Correcting and replacing the character code, storing the obtained character image in association with the character code obtained after the correction, storing the obtained chart image, reading the layout information, and arranging the layout. Reading the character image and the chart image with a change to the relationship, performing rearrangement according to the changed layout relationship, and outputting the result.

【００１９】[0019]

【実施例】次に、本発明の実施例について図面を参照し
て説明する。図１は本発明の文字認識装置における文書
画像記憶方式の第１の実施例を示したブロック図であ
る。まず、図１に示す文書画像格納手段１０は、電子化
された画像情報データである文書画像を格納している。Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a first embodiment of a document image storage system in the character recognition device of the present invention. First, the document image storage means 10 shown in FIG. 1 stores a document image which is digitized image information data.

【００２０】そして、レイアウト解析手段１１は、文書
画像格納手段１０から得られた文書画像１００を読込ん
で、文書画像１００を文章領域や図表領域などの構成要
素に分け、その構成要素の包含関係や上下左右の配置関
係のレイアウト情報１０１を出力している。また、レイ
アウト解析手段１１は、構成要素が文章領域であると判
断した場合には、文章領域に包まれる行領域を抽出し、
行領域に含まれる文字単位の文字領域を抽出し、そのレ
イアウト情報１０１を出力し、その文字領域を文字画像
１０２として出力している。The layout analysis unit 11 reads the document image 100 obtained from the document image storage unit 10 and divides the document image 100 into components such as a text area and a chart area. It outputs layout information 101 of an arrangement relationship of up, down, left, and right. If the layout analysis unit 11 determines that the component is a text area, it extracts a line area wrapped in the text area,
A character area for each character included in the line area is extracted, the layout information 101 is output, and the character area is output as a character image 102.

【００２１】一方、レイアウト情報記憶手段１２は、レ
イアウト解析手段１１から得られるレイアウト情報１０
１を記憶している。また、文字認識手段１３は、レイア
ウト解析手段１１から得られた文字画像１０２を認識
し、認識結果である文字コード１０３を出力している。
そして、文字コード修正手段１４は、文字認識手段１３
から得られた文字コード１０３を修正する文字コードに
置き換えて、置き換えた文字コード１０４を出力してい
る。On the other hand, the layout information storage means 12 stores the layout information 10 obtained from the layout analysis means 11.
1 is stored. The character recognizing means 13 recognizes the character image 102 obtained from the layout analyzing means 11 and outputs a character code 103 as a recognition result.
Then, the character code correcting means 14
Is replaced with a character code to be corrected, and the replaced character code 104 is output.

【００２２】そこで、文字画像記憶手段１５は、レイア
ウト解析手段１１から得られた文字画像１０２と、文字
コード修正手段１４から得られた文字コード１０４とを
対応付けて格納している。Therefore, the character image storing means 15 stores the character image 102 obtained from the layout analyzing means 11 and the character code 104 obtained from the character code correcting means 14 in association with each other.

【００２３】次に、第１の実施例の動作について説明す
る。まず、文書画像格納手段１０に蓄えられた電子化さ
れた文書画像は、レイアウト解析手段１１に送られてい
る。レイアウト解析手段１１は、文書を文章領域と図表
領域との構成要素に分けて、その包含関係や配置関係を
レイアウト情報記憶手段１２に記憶する。Next, the operation of the first embodiment will be described. First, the digitized document image stored in the document image storage unit 10 is sent to the layout analysis unit 11. The layout analysis unit 11 divides the document into components of a text area and a chart area, and stores the inclusion relation and the arrangement relation in the layout information storage means 12.

【００２４】そして、文章領域であると判断した領域に
対しては、例えば、分割処理により文章領域を構成する
行領域の抽出および行領域を構成する文字単位の領域の
抽出を行う。また、それぞれの位置関係や配置関係は、
レイアウト情報記憶手段１２にレイアウト情報として記
憶する。Then, for a region determined to be a text region, for example, a line region forming a text region and a character unit region forming a line region are extracted by division processing. In addition, each positional relationship and arrangement relationship,
The layout information is stored in the layout information storage unit 12 as layout information.

【００２５】図２はレイアウト解析手段１１により切出
された文字画像の一例を示した図である。これにより、
例えば、図２（ａ）に示す文章領域は、例えば、図２
（ｂ）に示す文字領域に分離されて、各々の文字画像と
して出力される。FIG. 2 is a view showing an example of a character image cut out by the layout analysis means 11. In FIG. This allows
For example, the text area shown in FIG.
The image is separated into the character areas shown in FIG.

【００２６】一方、図３はレイアウト情報記憶手段１２
に記憶されるデータの一例を示した図である。文字画像
の包含関係や配置関係は、例えば、図３に示すような階
層的なレイアウト情報によって示される。ここでは、ポ
インタ３０が文章領域を指して、ポインタ３１が行を指
し、ポインタ３２が文字を指し、レイアウトの物理的構
造が階層的に表現されている。また、文字画像には、文
字認識手段１３で認識された文字コードが付加される。On the other hand, FIG.
FIG. 3 is a diagram showing an example of data stored in a storage device. The inclusion relation and arrangement relation of the character images are indicated by, for example, hierarchical layout information as shown in FIG. Here, the pointer 30 points to a text area, the pointer 31 points to a line, the pointer 32 points to a character, and the physical structure of the layout is expressed hierarchically. Further, a character code recognized by the character recognition unit 13 is added to the character image.

【００２７】さらに、文字コード修正手段１４では、文
字コードが修正される。ここでは、誤認識文字のマニュ
アルによる訂正機能が含まれている。そして、文字画像
は、文字コード修正手段１４で修正された文字コードと
共に、文字画像記憶手段１５に蓄えられる。Further, the character code correcting means 14 corrects the character code. Here, a manual correction function of the erroneously recognized character is included. The character image is stored in the character image storage unit 15 together with the character code corrected by the character code correction unit 14.

【００２８】図４は文字画像記憶手段１５に記憶される
データの一例を示す図である。これにより、例えば、図
４に示すように、文字画像と文字コードとが対応付けて
記憶される。以上により、文字画像に文字コードを対応
付けた格納が達成される。FIG. 4 is a diagram showing an example of data stored in the character image storage means 15. Thereby, for example, as shown in FIG. 4, the character image and the character code are stored in association with each other. As described above, storage in which character codes are associated with character images is achieved.

【００２９】図５は本発明の文字認識装置における文書
画像記憶方式の第２の実施例を示すブロック図である。
図５の第２の実施例は、第１の実施例の各構成要素に加
え、レイアウト解析手段１６には、図表画像記憶手段１
７が接続し、レイアウト情報記憶手段１２，文字画像記
憶手段１５，図表画像記憶手段１７には、レイアウト編
集手段１８が接続している。FIG. 5 is a block diagram showing a second embodiment of the document image storage system in the character recognition device of the present invention.
In the second embodiment of FIG. 5, in addition to the components of the first embodiment, the layout analysis unit 16 stores the chart image storage unit 1
7, a layout editing unit 18 is connected to the layout information storage unit 12, the character image storage unit 15, and the diagram image storage unit 17.

【００３０】次に、第２の実施例の動作について説明す
る。まず、レイアウト解析手段１６は、文書画像格納手
段１０から得られた文書画像１００を読込んで、文書を
文章領域や図表領域などの構成要素に分け、その構成要
素の包含関係および上下左右の配置関係のレイアウト情
報１０１を出力する。Next, the operation of the second embodiment will be described. First, the layout analysis unit 16 reads the document image 100 obtained from the document image storage unit 10, divides the document into components such as a text region and a diagram region, and includes the inclusion relationship of the components and the top-bottom / left-right arrangement relationship. Is output.

【００３１】また、レイアウト解析手段１６は、その構
成要素が文章領域であると判断した場合に、文章領域に
含まれる行領域を抽出し、行領域に含まれる文字単位の
文字領域を抽出し、その文字画像１０２を出力し、その
レイアウト情報１０１を出力する。さらに、その構成要
素が図または表であると判断した場合には、その図表画
像１０５を出力する。When the layout analyzing means 16 determines that the constituent element is a text area, it extracts a line area included in the text area, and extracts a character area in character units included in the line area. The character image 102 is output, and the layout information 101 is output. Further, when it is determined that the component is a diagram or a table, the diagram image 105 is output.

【００３２】他方、図表画像記憶手段１７は、図表画像
１０５を格納する。レイアウト編集手段１８は、レイア
ウト情報記憶手段１２から、文書のレイアウト情報１０
６を読込み、そのレイアウト情報に変更を加え、変更し
た配置情報に従い、文字画像記憶手段１５に蓄えられた
文字画像１０７と図表画像記憶手段１７に蓄えられた図
表画像１０８とを再配置して編集結果１０９を出力す
る。On the other hand, the chart image storage means 17 stores the chart image 105. The layout editing unit 18 stores the document layout information 10 from the layout information storage unit 12.
6, the layout information is changed, and the character image 107 stored in the character image storage unit 15 and the chart image 108 stored in the chart image storage unit 17 are rearranged and edited according to the changed arrangement information. The result 109 is output.

【００３３】[0033]

【発明の効果】以上説明したように、本発明の文字認識
装置における文書画像記憶方式を使用すると、例えば、
一度電子化し、データベース等に格納した既存文書を検
索し、出力する際に、文書中の文字強調といった属性情
報までをも含めて、元の文書と同様に再現することが可
能であるという効果がある。As described above, when the document image storage method in the character recognition device of the present invention is used, for example,
Once an existing document stored in a database or the like has been digitized and retrieved, it can be reproduced in the same way as the original document, including attribute information such as character emphasis in the document. is there.

【００３４】また、本発明の文字認識装置における文書
画像記憶方式では、文字属性を保持したままで縦書き文
書を横書きにして印刷するなどのレイアウトを変更した
出力も可能になるという効果も有している。Further, the document image storage method in the character recognition device of the present invention has an effect that it is possible to perform an output with a changed layout, such as printing a vertically written document horizontally while maintaining the character attributes. ing.

[Brief description of the drawings]

【図１】本発明の文字認識装置における文書画像記憶方
式の第１の実施例を示すブロック図である。FIG. 1 is a block diagram showing a first embodiment of a document image storage system in a character recognition device of the present invention.

【図２】レイアウト解析手段１１により切出された文字
画像の一例を示す図である。FIG. 2 is a diagram showing an example of a character image cut out by a layout analysis unit 11;

【図３】レイアウト情報記憶手段１２に記憶されるデー
タの一例を示す図である。FIG. 3 is a diagram illustrating an example of data stored in a layout information storage unit 12;

【図４】文字画像記憶手段１５に記憶されるデータの一
例を示す図である。FIG. 4 is a diagram illustrating an example of data stored in a character image storage unit 15;

【図５】本発明の文字認識装置における文書画像記憶方
式の第２の実施例を示すブロック図である。FIG. 5 is a block diagram showing a second embodiment of the document image storage system in the character recognition device of the present invention.

【図６】従来の文字認識装置における文書画像記憶方式
の第１の例を示すブロック図である。FIG. 6 is a block diagram showing a first example of a document image storage method in a conventional character recognition device.

[Explanation of symbols]

１０文書画像格納手段１１レイアウト解析手段１２レイアウト情報記憶手段１３文字認識手段１４文字コード修正手段１５文字画像記憶手段１６レイアウト解析手段１７図表画像記憶手段１８レイアウト編集手段３０，３１，３２ポインタ６１文書画像メモリ６２領域分割部６３縦横情報記憶部６４文字分離部６５構造化データ記憶部６６領域定義記憶部６７領域探索部６８抽出結果記憶部 DESCRIPTION OF SYMBOLS 10 Document image storage means 11 Layout analysis means 12 Layout information storage means 13 Character recognition means 14 Character code correction means 15 Character image storage means 16 Layout analysis means 17 Chart image storage means 18 Layout editing means 30, 31, 32 Pointer 61 Document image Memory 62 Area division unit 63 Vertical and horizontal information storage unit 64 Character separation unit 65 Structured data storage unit 66 Area definition storage unit 67 Area search unit 68 Extraction result storage unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平５−108866（ＪＰ，Ａ) 特開平３−161886（ＪＰ，Ａ) 特開平３−131992（ＪＰ，Ａ) ──────────────────────────────────────────────────続き Continuation of front page (56) References JP-A-5-108866 (JP, A) JP-A-3-161886 (JP, A) JP-A-3-131992 (JP, A)

Claims

(57) [Claims]

1. A document image storage means for storing a document image which is digitized image information data, and constituent elements such as sentences and figures are extracted from the document image stored in the document image storage means. By judging whether a component is a text area or a chart area, output layout information including the inclusion relation between the components and the top, bottom, left and right arrangement relations, and when the component is a text area A layout analysis unit that outputs the contents of a character area divided into single character units as a character image, a layout information storage unit that stores layout information obtained by the layout analysis unit, and a layout analysis unit that obtains the layout information. Character recognition means for recognizing a character image and outputting the recognition result as a character code; and A character code correcting unit having a function of correcting and replacing a part of the obtained character code with an appropriate character code; and obtaining a character image obtained by the layout analyzing unit after correcting the character image by the character code correcting unit. Character image storage means for storing the associated character codes in association with each other,
A document image storage device comprising:

Wherein said layout analysis means, said extracting components of text and diagrams, such as from the stored document image in the document image storage in means, whether the components are either figure area is the text area Determining, when the component is a chart area, having a function of outputting the contents of the component as a chart image, and storing a chart image output by the layout analyzing means; Reading the layout information stored in the layout information storage means and making changes to the layout relationship, reading the character image stored in the text image storage means and the chart image stored in the chart image storage means, and changing the layout relationship. document image according to claim 1, wherein the layout editing means for outputting performing rearrangement in accordance, characterized in that it comprises to 憶apparatus.

3. A document image, which is digitized image information data, is stored, and constituent elements such as text and charts are extracted from the document image to determine whether the constituent element is a text area or a chart area. By judging, the layout information including the intentional inclusion relation of the constituent elements and the top, bottom, left and right arrangement relations is output, and when the constituent element is the text area, the character area divided into one character unit is output. The content is output as a character image, the obtained layout information is stored, and the obtained character image is recognized, and the recognition result is output as a character code, and a part of the obtained character code is output. Correcting and replacing with an appropriate character code, and storing the obtained character image in association with the obtained character code.
Document image storage device .

4. A document image, which is digitized image information data, is stored, and constituent elements such as text and charts are extracted from the document image to determine whether the constituent element is a text area or a chart area. By judging, the layout information including the inclusion relation between the constituent elements and the vertical, horizontal, and horizontal arrangement relations is output, and when the constituent element is the text area, the contents of the character area divided into one character unit Is output as a character image, and when the component is the chart area, outputs the content of the component as a chart image, stores the obtained layout information, and obtains the obtained character image. By recognizing, the recognition result is output as a character code, a part of the obtained character code is corrected and replaced with an appropriate character code, and the obtained character image is corrected. The character codes obtained later are stored in association with each other, the obtained chart images are stored, the layout information is read, and the arrangement relationship is changed, and the character images and the chart images are read. A document image storage device , which performs rearrangement according to the changed arrangement relation and outputs the result.