JP2008108114A

JP2008108114A - Document processor and document processing method

Info

Publication number: JP2008108114A
Application number: JP2006291180A
Authority: JP
Inventors: Yoji Kawasaki; 洋治川崎; Makiko Katagiri; 牧子片桐
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2006-10-26
Filing date: 2006-10-26
Publication date: 2008-05-08

Abstract

<P>PROBLEM TO BE SOLVED: To eliminate limitation in table structure of a processing object in a device performing table recognition. <P>SOLUTION: In the document processor, a processing object table extracted from image data of a document is acquired (S10), and a foreground such as ruler lines, character strings or numerical sequences is removed therefrom to generate a foreground removed image such that it has information for only a title area (S12). The foreground removed image is divided to partial tables by pattern-matching with template data having a basic table structure (S14 and S16). Each partial table is divided to cells based on the interval of ruler lines, character strings or numerical sequences (S18), and data associating a character string or the like in a title cell with a numerical sequence or the like in each data cell is extracted by recognizing the character string or numerical sequence described in each cell (S20). <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は文書解析技術に関し、特に文書内に表された表を読み取りデータを取得するための文書処理装置およびそれに適用される文書処理方法に関する。 The present invention relates to a document analysis technique, and more particularly to a document processing apparatus for reading a table represented in a document and acquiring data, and a document processing method applied thereto.

近年、手書きの書類や印刷された文書を機械的に読み取り、文字を認識するＯＣＲ（Optical Character Reader）の技術が一般化してきた。これによりユーザは、紙面に書かれた内容を電子データとして保存したり、出力結果を表計算のソフトウェアに読み込ませて計算を行ったりすることができるようになった。 In recent years, OCR (Optical Character Reader) technology for mechanically reading handwritten documents and printed documents and recognizing characters has become common. As a result, the user can save the contents written on the paper as electronic data, or read the output result into a spreadsheet software and perform the calculation.

また、紙面上の表を認識する技術は帳簿の自動管理や現金自動振込みなど身近な環境で利便性を発揮している。一般的に用いられる表は、罫線で囲まれた矩形の領域をさらに罫線で細分化して得られる複数の矩形領域を、項目名欄（以後、タイトルセルと呼ぶ）およびデータ欄（以後、データセルと呼ぶ）として使用することにより、項目とデータの対応付けを表している。したがって表を認識するためにはタイトルセルとデータセルとの区別、およびその対応関係を把握する必要がある。表認識の最も簡単な形態としては、あらかじめタイトルセルにのみ記入のある帳票等を読み込み、タイトルセルおよび対応するデータセルの位置と、項目名とを記憶しておく場合がある。この場合は、実際に入力された帳票のデータセルの位置にある文字列や数列などを読み取ることにより容易に項目とデータとの対応を取得することができる。 In addition, the technology for recognizing tables on paper is useful in familiar environments such as automatic book management and automatic cash transfer. A commonly used table is that a rectangular area surrounded by ruled lines is further subdivided by ruled lines into a plurality of rectangular areas, an item name column (hereinafter referred to as title cell) and a data column (hereinafter referred to as data cell). By using it as an item, it indicates the correspondence between items and data. Therefore, in order to recognize the table, it is necessary to understand the distinction between the title cell and the data cell and the corresponding relationship. As the simplest form of table recognition, there is a case where a form or the like that is filled in only in the title cell is read in advance and the position of the title cell and the corresponding data cell and the item name are stored. In this case, the correspondence between the item and the data can be easily obtained by reading a character string or a numeric string at the position of the data cell of the actually input form.

この形態は、あらかじめ読み込んだ帳票と同一様式の帳票のみ認識が可能である。一方、表構造のバリエーションを許容できる技術も開発されている。例えば、各矩形領域の枠の辺の長さなどを比較することによりタイトルセルとデータセルとを区別する手法や、あらかじめタイトルセルに記載されるであろう「氏名」や「住所」などの文字列を辞書に登録しておくことにより、登録された文字列が記載されたセルをタイトルセルと判定する手法などがある（例えば特許文献１、特許文献２、非特許文献１参照）。
特開平１０−１１６３１４号公報特開２００５−２７５８３０号公報駱琴、渡邉豊英、杉江昇、帳票文書の構造認識のための書式構造知識の自動獲得，電子情報通信学会論文誌 D-II Vol. J76-D-II No. 3 pp.534-546, 1993年3月 In this form, only a form having the same format as the form read in advance can be recognized. On the other hand, a technology capable of allowing variations in the table structure has been developed. For example, a method of distinguishing the title cell from the data cell by comparing the length of the sides of each rectangular area frame, or characters such as “name” and “address” that will be written in the title cell in advance There is a method of determining a cell in which a registered character string is described as a title cell by registering a column in a dictionary (see, for example, Patent Document 1, Patent Document 2, and Non-Patent Document 1).
Japanese Patent Laid-Open No. 10-116314 JP 2005-275830 A Sakin, Toyohide Watanabe, Noboru Sugie, Automatic acquisition of format structure knowledge for structure recognition of form documents, IEICE Transactions D-II Vol. J76-D-II No. 3 pp.534-546, 1993 March

ところが上述のような技術では、多少のバリエーションは許されるものの、構造の種類や項目名が限定的であり、あくまで最初に想定した範囲内の表を処理することが前提であるため、汎用性に乏しい。汎用性を向上させるためには様々な表の種類に応じた多数の情報をあらかじめ準備しておかなければならず、開発コストが増大する。またこれらの技術は、罫線に囲まれた矩形によって各セルの存在を認識するため、横方向の罫線のみ引かれた表や、罫線が引かれず文字の間隔のみで各セルを表した表などは認識できない。 However, with the technologies described above, although some variations are allowed, the types of structures and item names are limited, and it is assumed that the table within the initially assumed range is processed, so it is versatile. poor. In order to improve versatility, a large amount of information corresponding to various types of tables must be prepared in advance, which increases the development cost. In addition, since these technologies recognize the existence of each cell by a rectangle surrounded by ruled lines, a table in which only horizontal ruled lines are drawn or a table in which each cell is represented by only the character spacing without drawing ruled lines is used. I can't recognize it.

本発明はこうした状況に鑑みてなされたものであり、その目的は、汎用性が高く導入コストの低い表認識技術を提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to provide a table recognition technique having high versatility and low introduction cost.

本発明のある態様は、文書処理装置に関する。この文書処理装置は、文書の画像データに含まれる表を認識し、当該表の記載内容の読み出しを行う文書処理装置であり、文書の画像データから処理対象の表の画像データを抽出する表抽出部と、処理対象の表の画像データから項目名欄の領域を所定の判定手法により特定し、当該項目名欄の領域の全体形状について画像解析を行うことにより、処理対象の表に含まれ独立した表の形式を有する部分表に分割する領域分割部と、部分表ごとに項目名欄およびデータ欄から記載内容を読み出し、項目名欄およびデータ欄との対応関係に基づき当該記載内容を対応付けたデータを作成するデータ抽出部と、を備えたことを特徴とする。 One embodiment of the present invention relates to a document processing apparatus. This document processing device is a document processing device that recognizes a table included in image data of a document and reads out the contents of the table, and extracts table data to be processed from the image data of the document. By identifying the area of the item name column from the image data of the copy and the table to be processed by a predetermined determination method, and performing image analysis on the entire shape of the area of the item name column to be included in the table to be processed independently. Read the description from the item name column and data column for each partial table, and associate the description content based on the correspondence between the item name column and the data column. And a data extraction unit for creating the data.

ここで「全体形状」は罫線による区分けの情報を持たない項目名欄の「かたまり」の形状でよいが、孤立した１つの項目名欄であっても「全体形状」を構成しうる。また「全体形状」は１つの連続した領域の形状であってもよいし、２つ以上の領域の形状を含んでもよい。 Here, the “whole shape” may be the shape of the “chunk” in the item name field that does not have the classification information by ruled lines, but even an isolated item name field can constitute the “whole shape”. The “overall shape” may be the shape of one continuous region or may include the shape of two or more regions.

本発明の別の態様は、文書処理方法に関する。この文書処理方法は、文書の画像データに含まれる表を認識し、当該表の記載内容の読み出しを行う文書処理方法であり、文書の画像データから処理対象の表の画像データを抽出するステップと、処理対象の表の画像データから項目名欄の領域を所定の判定手法により特定し、当該項目名欄の領域の全体形状について画像解析を行うことにより、処理対象の表に含まれ独立した表の形式を有する部分表に分割するステップと、部分表ごとに項目名欄およびデータ欄から記載内容を読み出し、項目名欄およびデータ欄との対応関係に基づき当該記載内容を対応付けたデータを作成するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to a document processing method. The document processing method is a document processing method for recognizing a table included in image data of a document and reading out the contents of the table, and extracting the image data of the table to be processed from the image data of the document; By identifying the area of the item name column from the image data of the table to be processed by a predetermined determination method and performing image analysis on the entire shape of the area of the item name column, an independent table included in the table to be processed Step to divide into partial tables having the format of, read out the description content from the item name column and data column for each partial table, and create data that correlates the description content based on the correspondence with the item name column and data column And a step of performing.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a representation of the present invention converted between a method, an apparatus, a system, etc. are also effective as an aspect of the present invention.

本発明によれば、汎用性の高い表認識技術を低コストで実現できる。 According to the present invention, a highly versatile table recognition technique can be realized at low cost.

図１は本実施の形態における文書処理装置の全体的な構成を示している。文書処理装置１００は、文書画像のデータなどを入力する入力部６２、文書画像中に含まれる表を認識し、示されたデータを読み取る文書解析部１０、表の認識に必要な情報などを記憶した記憶部６０、表から読み取ったデータを適切な形式で出力する出力部６４を含む。これらの機能ブロックはバス６６を介して相互にデータの授受を行う。 FIG. 1 shows the overall configuration of a document processing apparatus according to this embodiment. The document processing apparatus 100 stores an input unit 62 for inputting document image data, a table included in the document image, a document analysis unit 10 for reading the indicated data, and information necessary for table recognition. And an output unit 64 that outputs data read from the table in an appropriate format. These functional blocks exchange data with each other via the bus 66.

入力部６２はユーザが処理に係る入力を行うユーザインターフェースであり、キーボード、ポインティングデバイスなど一般的な入力装置のいずれかまたは組み合わせを含む。また、文書を読み込み２次元の画像データとして取得するスキャナーを含んでいてもよい。さらに、画像化した文書の処理を行う図示しない画像文書処理機能の出力ブロックを入力部６２に含んでもよい。スキャナや画像文書処理機能の出力ブロックより取得した文書画像のデータ、または、ユーザが記憶部６０などにあらかじめ記憶させ、キーボードなどにより指定した文書画像のデータのファイル名が、処理対象文書画像の情報として文書解析部１０に提供される。 The input unit 62 is a user interface through which a user performs input related to processing, and includes any one or a combination of general input devices such as a keyboard and a pointing device. Further, a scanner that reads a document and acquires it as two-dimensional image data may be included. Further, the input unit 62 may include an output block of an image document processing function (not shown) for processing an imaged document. The document image data acquired from the scanner or the output block of the image document processing function, or the file name of the document image data stored in advance in the storage unit 60 by the user and designated by the keyboard or the like is the information on the processing target document image. Is provided to the document analysis unit 10.

文書解析部１０は文書処理装置１００の主たる動作を掌るブロックであり、文書画像のデータから表データを抽出し、所定の処理を施すことにより解析を行って、タイトルセルとデータセルに記載された内容およびその対応関係を取得する。このとき文書解析部１０はまず表を大域的に解析することで当該表を部分表に分割する。部分表とは文書画像から抽出した表に含まれ、タイトルセルおよびデータセルを有するそれ自体で独立して１つの表とすることのできる部分である。表がそれ以上分割できない場合は分割せずに当該表を部分表とする。そして部分表ごとに局所的な解析を行うことによりデータとその対応関係を取得する。 The document analysis unit 10 is a block that handles the main operation of the document processing apparatus 100. The document analysis unit 10 extracts table data from the document image data, performs a predetermined process, performs analysis, and is described in the title cell and the data cell. Get the contents and their correspondence. At this time, the document analysis unit 10 first analyzes the table globally to divide the table into partial tables. A partial table is a portion that is included in a table extracted from a document image, and that can have a title cell and a data cell and can be independently made into one table. If a table cannot be further divided, the table is made a partial table without being divided. Then, local analysis is performed for each partial table to acquire data and the corresponding relationship.

記憶部６０はハードディスクなどの記憶装置、ＣＤ−ＲＯＭやＭＤなどの記録媒体およびそれらの読取装置などのいずれかまたは組み合わせを含む。記憶部６０には、文書解析部１０が表を部分表に分割するために行う照合処理に用いる表構造のテンプレートを記憶させる。さらに文書解析部１０などを動作させるためのコンピュータプログラムや、処理対象たる文書画像のデータを記憶させてもよい。 The storage unit 60 includes any one or combination of a storage device such as a hard disk, a recording medium such as a CD-ROM or MD, and a reading device thereof. The storage unit 60 stores a template having a table structure used for collation processing performed by the document analysis unit 10 to divide a table into partial tables. Further, a computer program for operating the document analysis unit 10 or the like, or data of a document image to be processed may be stored.

出力部６４はディスプレイおよびそれを制御するディスプレイコントローラを含む。処理を開始したり文書画像のファイル名を指定したりするための受け付け画面を表示させるなど、入力部６２の補助たる機能も有する。さらに文書解析部１０が取得したタイトルセルとデータセルに記載された内容およびその対応関係を適切な形式でデータ化したものを、図示しない別の機能ブロックなどに出力するインターフェースであってもよい。別の機能ブロックとは表計算や文書作成など当該データを利用してさらに別の処理を行うシステムの入力ブロックなどである。したがって適切な形式とはそのような機能ブロックが処理可能な形式である。出力部６４の制御のもと、得られたデータを記憶部６０やその他の記憶装置に出力し、データベースとして記憶させるようにしてもよい。 The output unit 64 includes a display and a display controller that controls the display. It also has a function to assist the input unit 62, such as displaying a reception screen for starting processing or designating the file name of a document image. Furthermore, an interface that outputs the content described in the title cell and data cell acquired by the document analysis unit 10 and the corresponding relationship in data in an appropriate format may be output to another functional block (not shown). Another functional block is an input block of a system that performs further processing using the data such as spreadsheet or document creation. Therefore, an appropriate format is a format that can be processed by such a functional block. Under the control of the output unit 64, the obtained data may be output to the storage unit 60 or other storage device and stored as a database.

図２は文書解析部１０の構造をより詳細に示している。ここに示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、当業者には理解されるところである。 FIG. 2 shows the structure of the document analysis unit 10 in more detail. Each block shown here can be realized in hardware by an element such as a CPU of a computer or a mechanical device, and in software it is realized by a computer program or the like. Draw functional blocks. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by a combination of hardware and software.

文書解析部１０は解析処理部１２とメモリ３０を含む。解析処理部１２は画像取得部１４、表抽出部１６、領域分割部１８、セル分割部２０、およびデータ抽出部２２を含む。画像取得部１４は入力部６２から入力された文書画像のデータ、あるいは入力部６２により指定され記憶部６０に記憶された文書画像のデータを読み出し、メモリ３０に保存する。 The document analysis unit 10 includes an analysis processing unit 12 and a memory 30. The analysis processing unit 12 includes an image acquisition unit 14, a table extraction unit 16, a region division unit 18, a cell division unit 20, and a data extraction unit 22. The image acquisition unit 14 reads out the document image data input from the input unit 62 or the document image data specified by the input unit 62 and stored in the storage unit 60, and stores it in the memory 30.

表抽出部１６は文書画像のデータから表の領域のデータを抽出する。例えば文書画像を走査して連結した罫線集合を求め、その外接矩形を表の領域として認識し抽出を行う。なお、以後説明する機能ブロックでは、基本的にはメモリ３０に保存されたデータを取得し、処理を施してメモリ３０に保存し直す、という手順を踏むが、メモリ３０に対する入出力についてはその説明を省略する場合がある。 The table extraction unit 16 extracts data of the table area from the document image data. For example, a ruled line set obtained by scanning a document image is obtained, and the circumscribed rectangle is recognized as a table area and extracted. In the functional block described below, the procedure is basically to acquire the data stored in the memory 30, process it, and store it again in the memory 30. May be omitted.

領域分割部１８は表の全領域のうちタイトルセルが存在する領域（以後、タイトル領域と呼ぶ）を特定し、その領域の全体形状を２次元図形として画像解析することにより、部分表の境界を決定し分割する。例えば背景色が施されている領域、色など背景の属性が他と異なる領域、文字列と数列が認識された場合に文字列のみが存在する領域、文字列が太字であるなど所定の字体である領域、隣接する罫線が他よりも太い領域などのいずれか、またはそれらの組み合わせを、タイトル領域の判定手法に用いることにより特定する。 The area dividing unit 18 identifies the area where the title cell exists (hereinafter referred to as the title area) out of all the areas of the table, and analyzes the entire shape of the area as a two-dimensional figure to thereby determine the boundary of the partial table. Decide and divide. For example, areas with background colors, areas with different background attributes such as colors, areas where only character strings exist when character strings and numeric sequences are recognized, character strings are bold, etc. Any one of a certain region, a region where the adjacent ruled line is thicker than the other, or a combination thereof is used for the title region determination method.

タイトル領域の判定基準としてはこのほかに、日本語仮名漢字、アルファベットといった文字列の文字種、文字サイズ、文字色、アンダーラインなどの文字飾りといった文字列の属性、右寄せ、左寄せ、中央寄せといった文字列の配置、字面、文字数、日付など特定の文字列パターンといった文字列そのもの、罫線の線種や色といった属性、罫線の有無などに基づいてもよい。 Other criteria for determining the title area include character types of character strings such as Japanese kana and kanji, alphabet, character string attributes such as character size, character color, character decoration such as underline, character strings such as right justification, left justification, and center justification It may be based on the character string itself such as a specific character string pattern such as the layout of the character, the face, the number of characters, the date, the attribute such as the line type and color of the ruled line, the presence or absence of the ruled line, and the like.

さらに表全体のうち左側、上側にあるなど、領域の位置情報を考慮してもよい。また、タイトル領域の特定は、上記のような判定手法を全て試行することにより総合的に判断してもよいし、ユーザに判定手法の選択を行わせたり、様々な判定手法によって導出されたタイトル領域の候補から選択させたりすることによって最終的な判断を行ってもよい。 Furthermore, the position information of the area may be considered, such as being on the left side or the upper side of the entire table. In addition, the title area may be determined comprehensively by trying all of the above-described determination methods, or the titles derived by various determination methods may be selected by allowing the user to select a determination method. The final determination may be made by selecting from the region candidates.

また部分表は、記憶部６０からメモリ３０に読み出した表構造のテンプレートを参照して認識する。テンプレートは部分表の構造の候補であり、基本となる表を画像データとして用意する。ここでは、前段で特定したタイトル領域と、テンプレートにおけるタイトル領域の形状とを比較することにより、テンプレートのいずれかに合致した領域を各部分表の領域として特定する。この処理で着目する箇所はタイトル領域の大域的な形状、すなわち配置であるため、表に含まれる罫線の情報や各セルの内容は使用しなくてもよい。具体例については後に詳述する。 The partial table is recognized with reference to a table-structured template read from the storage unit 60 to the memory 30. A template is a candidate for a partial table structure, and a basic table is prepared as image data. Here, by comparing the title area specified in the previous stage with the shape of the title area in the template, an area that matches one of the templates is specified as the area of each partial table. Since the location to which attention is paid in this process is the global shape of the title area, that is, the arrangement, the ruled line information included in the table and the contents of each cell need not be used. Specific examples will be described in detail later.

セル分割部２０は、各部分表の領域に罫線の情報を付加することにより部分表を各セルに分割する。罫線が引かれていない場合は各文字列や数列の間隔によってセルの境界を決定して分割する。本実施の形態において、罫線はタイトル領域の特定には用いず、縦および横のセルの境界を決定するのに用いられるため、データの間隔など他の情報で容易に代替することが可能である。 The cell dividing unit 20 divides the partial table into cells by adding ruled line information to the area of each partial table. When the ruled line is not drawn, the cell boundary is determined according to the interval between each character string or several strings, and divided. In the present embodiment, ruled lines are not used to specify the title area, but are used to determine the boundaries between the vertical and horizontal cells, so that they can be easily replaced with other information such as data intervals. .

データ抽出部２２は分割したセルのそれぞれから文字列、数列などを読み取り、認識する。ここでは一般的に用いられる文字認識の手法のいずれかを採用すればよい。領域分割部１８において表を部分表に分割する際、タイトルセルの位置は把握済みであるため、タイトルセルに記載の文字列などと、データセルに記載の数列などとを対応づけて出力データとし、出力部６４に提供する。 The data extraction unit 22 reads and recognizes a character string, a number string, etc. from each of the divided cells. Here, any one of commonly used character recognition techniques may be employed. When the area dividing unit 18 divides the table into partial tables, since the position of the title cell is already known, the character string described in the title cell and the numerical sequence described in the data cell are associated with each other as output data. Provided to the output unit 64.

次に領域分割部１８、セル分割部２０、データ抽出部２２が行う処理について具体的に説明する。なおここで示す表および処理手順は例示であり、本実施の形態を限定するものではない。図３は表抽出部１６が文書画像のデータから抽出した処理対象の表の構造例を示している。処理対象表７０は、タイトルセル７２ａ、７２ｂ、７２ｃ、およびデータセル７４ａ、７４ｂ、７４ｃを含む。同図では斜線パターンを施したセルをタイトルセル、白抜きのセルをデータセルとして示しているが、図を煩雑にしないため代表するそれぞれ３つのセルのみに符号を付している。また各セルには文字列や数列などが記載されているがここでは図示を省略している。この例では処理対象表７０は５行５列のセルによって構成されている。 Next, the processing performed by the area dividing unit 18, the cell dividing unit 20, and the data extracting unit 22 will be specifically described. Note that the tables and processing procedures shown here are examples and do not limit the present embodiment. FIG. 3 shows an example of the structure of the processing target table extracted from the document image data by the table extraction unit 16. The processing target table 70 includes title cells 72a, 72b, 72c and data cells 74a, 74b, 74c. In the figure, the hatched pattern cell is shown as a title cell and the white cell is shown as a data cell, but only three representative cells are denoted by reference numerals so as not to complicate the figure. Each cell includes a character string, a numerical string, and the like, which are not shown here. In this example, the processing target table 70 is composed of cells of 5 rows and 5 columns.

タイトルセルは、タイトルセル７２ａを含む最も左の１列を構成する５つのセル、タイトルセル７２ｂを含む左から４列目の上から３つのセル、およびタイトルセル７２ｃを含む上から３行目の１行を構成するセルである。それ以外のセル、すなわちデータセル７４ａを含む２行２列の４つのセル、データセル７４ｂを含む２行１列の２つのセル、データセル７４ｃを含む２行４列の８つのセルがデータセルである。 The title cell includes five cells constituting the leftmost column including the title cell 72a, three cells from the top in the fourth column from the left including the title cell 72b, and the third cell from the top including the title cell 72c. It is a cell constituting one row. The other cells, that is, four cells in two rows and two columns including data cells 74a, two cells in two rows and one column including data cells 74b, and eight cells in two rows and four columns including data cells 74c are data cells. It is.

このような処理対象表７０に対し、領域分割部１８、セル分割部２０、データ抽出部２２は次に述べる処理を行う。図４〜図７は領域分割部１８およびセル分割部２０が処理対象表７０の分割を行う様子を模式的に示している。まず図４に示すように、領域分割部１８は、図３に示した処理対象表７０のうちタイトル領域７３を特定する。特定は上述したように背景色の有無、セル内の記載が文字列か数列か等、あるいはそれらの組み合わせに基づき行うが、ここでは一例としてタイトルセルにのみ背景色が施されていた場合について主に説明する。 For such a processing target table 70, the area dividing unit 18, cell dividing unit 20, and data extracting unit 22 perform the following processing. 4 to 7 schematically show how the region dividing unit 18 and the cell dividing unit 20 divide the processing target table 70. First, as shown in FIG. 4, the area dividing unit 18 specifies the title area 73 in the processing target table 70 shown in FIG. As described above, the identification is performed based on the presence / absence of the background color, whether the description in the cell is a character string or a numerical string, or a combination thereof, but here, as an example, the case where the background color is applied only to the title cell is mainly used. Explained.

このとき領域分割部１８は、処理対象表７０の画像データのうち、罫線および各セル内に記載された文字列、数列などの前景を除去した、図４に示す前景除去画像７６の画像データを生成しメモリ３０に保存する。この際、各画素値の濃度に対するヒストグラムを生成することにより、前景および背景の濃度のしきい値を求め、それを超えた濃度の画素値を近隣の画素値と置き換えるなど一般的な除去手法を用いてよい。前景除去画像７６はおよそ背景色の画素値と白抜きの画素値とのいずれかを有する画素で構成されるため、結果としてタイトル領域７３の全体形状を得ることができる。このときノイズ除去処理を施して得られた２値画像を前景除去画像７６としてもよい。 At this time, the region dividing unit 18 removes image data of the foreground removed image 76 shown in FIG. 4 from which image data of the processing target table 70 is removed, such as ruled lines and character strings and number sequences written in each cell. Generated and stored in the memory 30. At this time, by generating a histogram for the density of each pixel value, a general removal technique such as obtaining foreground and background density threshold values and replacing pixel values with densities exceeding that with neighboring pixel values is used. May be used. Since the foreground removal image 76 is composed of pixels having either a background pixel value or a white pixel value, the overall shape of the title area 73 can be obtained as a result. At this time, a binary image obtained by performing noise removal processing may be used as the foreground removal image 76.

文字の種類や形状をタイトル領域の判定手法とする場合は、罫線やセルの間隔に基づく境界線によってタイトル領域７３を特定し、当該領域に所定の画素値、例えば「１」を代入し、その他の領域に別の画素値、例えば「０」を代入した２値画像を前景除去画像７６としてもよい。 When the type and shape of the character are used as the title region determination method, the title region 73 is specified by a border line based on a ruled line or a cell interval, a predetermined pixel value, for example, “1” is assigned to the region, and the like. A binary image obtained by substituting another pixel value, for example, “0” in the area may be used as the foreground removal image 76.

次に領域分割部１８は図５に示すように、記憶部６０が記憶した表構造のテンプレートデータ７８と前景除去画像７６とを照合していくことにより部分表を特定する。テンプレートデータとしては例えば、タイトル列のみを含む表７８ａ、タイトル行のみを含む表７８ｂ、およびタイトル行およびタイトル列の双方を含む表７８ｃの画像データを用意する。そして前景除去画像７６の左上から一般的なテンプレートマッチングを行っていくことにより、部分表への分割を実施する。なおテンプレートデータは上記構造に限らず、例えば様々な構造の処理対象表で試行を行うことにより必要なものを様々に決定してよい。タイトル領域だけからなるテンプレートやデータ領域だけからなるテンプレートを含めてもよい。またパターンマッチングにおいて縦横に伸縮処理を施すことが可能なため、テンプレートデータは例えば正方形の表の画像データなどでよい。 Next, as shown in FIG. 5, the area dividing unit 18 specifies the partial table by collating the table structure template data 78 stored in the storage unit 60 with the foreground removal image 76. As template data, for example, image data of a table 78a including only a title column, a table 78b including only a title row, and a table 78c including both a title row and a title column are prepared. Then, by performing general template matching from the upper left of the foreground removal image 76, division into a partial table is performed. Note that the template data is not limited to the above structure, and for example, necessary data may be variously determined by performing trials on processing target tables having various structures. You may include the template which consists only of a title area, and the template which consists only of a data area. Further, since it is possible to perform expansion / contraction processing in the vertical and horizontal directions in pattern matching, the template data may be image data of a square table, for example.

図５では、前景除去画像７６のうち左上の領域８２ａおよび右上の領域８２ｂがタイトル列のみを含む表７８ａと合致し、下の領域８２ｃがタイトル行およびタイトル列の双方を含む表７８ｃと合致している。例えば表の最も右側にタイトル領域が存在するなど、テンプレートデータのいずれとも合致しない領域が存在する場合は、その部分をデータセルと考えて隣接する部分表に含めることもできる。このようにテンプレートデータを用いることにより、タイトル領域の誤認識をスクリーニングすることもできる。 In FIG. 5, the upper left area 82a and the upper right area 82b of the foreground removal image 76 match the table 78a including only the title column, and the lower area 82c matches the table 78c including both the title row and the title column. ing. For example, when there is an area that does not match any of the template data, such as a title area on the rightmost side of the table, it can be considered as a data cell and included in an adjacent partial table. By using template data in this way, it is possible to screen for erroneous recognition of the title area.

前景除去画像７６をテンプレートデータ７８と照合する際、前景除去画像７６のうち同一の左上角を有する領域でも複数のテンプレートと合致したり、合致する領域が複数通り存在する場合が考えられる。このような場合に備え、どのテンプレート、どの領域を優先するかについてあらかじめ規則を設定しておく。例えば、合致する領域が縦長より横長となる方を優先させる。すなわち、図６のような前景除去画像７６ａに対しては、図５の７８ｃのようなテンプレートが合致する領域として、点線で囲んだ領域９０と一点鎖線で囲んだ領域９２とが存在するが、横長である点線で囲んだ領域９０を優先させて部分表とする。この規則は正確な部分表分割における経験則に基づいている。 When the foreground removal image 76 is collated with the template data 78, there may be a case where a region having the same upper left corner in the foreground removal image 76 matches a plurality of templates or there are a plurality of matching regions. In preparation for such a case, a rule is set in advance for which template and which area to prioritize. For example, priority is given to a matching region that is horizontally long rather than vertically long. That is, for the foreground removal image 76a as shown in FIG. 6, there are a region 90 surrounded by a dotted line and a region 92 surrounded by a one-dot chain line as a region where the template matches 78c of FIG. A region 90 surrounded by a horizontally long dotted line is given priority as a partial table. This rule is based on an empirical rule for accurate subtable partitioning.

さらに、合致する領域の面積が大きい方を優先させてもよい。また、タイトル領域とデータ領域の双方を含むテンプレートを優先させたり、各テンプレートに優先順位を付与してもよい。以上のような規則のいずれか、またはその組み合わせを、想定される処理対象表７０などを考慮した実験などによって最適なものをあらかじめ設定しておく。 Furthermore, priority may be given to the larger area of the matching region. Further, a template including both the title area and the data area may be given priority, or a priority order may be given to each template. Any one of the above rules or a combination thereof is set in advance by an experiment or the like considering the assumed processing target table 70 or the like.

また、照合は前景除去画像７６の左上から行わなくてもよい。例えば前景除去画像７６の左上、左下、右上、右下の４箇所でそれぞれ照合を行い、合致した領域の面積がより大きい部分を部分表として分割し、さらに同様の照合を繰り返すようにしてもよい。 The collation need not be performed from the upper left of the foreground removal image 76. For example, collation may be performed at each of the four locations of the upper left, lower left, upper right, and lower right of the foreground removal image 76, and a portion with a larger area of the matched region may be divided as a partial table, and similar collation may be repeated. .

領域分割部１８は上述のようにテンプレートデータと照合することによって特定した部分表をなす領域８２ａ、８２ｂ、８２ｃの画像データを、それぞれ独立した表のデータとしてメモリ３０に保存する。この際、処理対象表７０と部分表をなす各領域との相対位置情報も保存しておく。 The area dividing unit 18 stores the image data of the areas 82a, 82b, and 82c constituting the partial table identified by collating with the template data as described above in the memory 30 as independent table data. At this time, the relative position information between the processing target table 70 and each area constituting the partial table is also stored.

次にセル分割部２０は、メモリ３０に保存された部分表の領域ごとに、処理対象表７０に付加されていた罫線および文字列、数列を当てはめ、図７に示すようなセルが区分けされた複数の部分表８４ａ、８４ｂ、８４ｃのデータを生成する。なおここでも各セルに記載された文字列および数列の図示を省略している。部分表８４ａ、８４ｂ、８４ｃは、部分表をなす各領域８２ａ、８２ｂ、８２ｃの処理対象表７０に対する相対位置情報を基に、各画素値を処理対象表７０の対応する画素の画素値に戻すことによって得られる。さらに罫線がある場合はそれをセルの境界線とし、罫線がない場合は文字列および数列のみを当てはめ、その間隔の中心線などを境界線とすることによってセル単位に分割する。 Next, the cell dividing unit 20 applies the ruled lines, character strings, and number sequences added to the processing target table 70 for each area of the partial table stored in the memory 30, and the cells as shown in FIG. Data of a plurality of partial tables 84a, 84b, 84c is generated. Note that the character strings and the numerical sequences described in each cell are also omitted here. The partial tables 84a, 84b, and 84c return each pixel value to the pixel value of the corresponding pixel in the processing target table 70 based on the relative position information with respect to the processing target table 70 in each of the regions 82a, 82b, and 82c forming the partial table. Can be obtained. Further, if there is a ruled line, it is used as a cell boundary line. If there is no ruled line, only a character string and a number string are applied, and the center line of the interval is used as a boundary line to divide the cell unit.

データ抽出部２２は、部分表８４ａ、８４ｂ、８４ｃの各セルに対し、一般的な文字認識処理を施すことより文字列および数列を読み出す。このとき、タイトルセルおよびデータセルとの境界はすでに判明しているため、タイトルセルに記載された文字列などと、その他のセルに記載された数列などとの対応づけは容易に行うことができる。またタイトルセルの配置も判明しているため、行または列内での対応か、行および列の交差による対応かを容易に区別することができる。 The data extraction unit 22 reads a character string and a numerical sequence by performing a general character recognition process on each cell of the partial tables 84a, 84b, and 84c. At this time, since the boundary between the title cell and the data cell is already known, it is possible to easily associate the character string described in the title cell with the number sequence described in other cells. . Further, since the arrangement of the title cells is also known, it is possible to easily distinguish between correspondence in a row or column and correspondence due to the intersection of a row and a column.

図８は以上述べた領域分割部１８、セル分割部２０、データ抽出部２２が行う処理の手順を示している。まず領域分割部１８は表抽出部１６が抽出した処理対象表７０を取得する（Ｓ１０）。次に、背景色や文字列の種類などに基づきタイトルセルの領域を特定したうえで、処理対象表７０から所定の手法で罫線、文字列、数列などの前景を除去した、前景除去画像７６を生成する（Ｓ１２）。次に記憶部６０が記憶したテンプレートデータとのパターンマッチングを行うことにより、前景除去画像７６から部分表の領域を特定し分割する（Ｓ１４、Ｓ１６）。 FIG. 8 shows a procedure of processing performed by the area dividing unit 18, cell dividing unit 20, and data extracting unit 22 described above. First, the area dividing unit 18 acquires the processing target table 70 extracted by the table extracting unit 16 (S10). Next, after specifying the title cell area based on the background color, the type of character string, and the like, the foreground removed image 76 is obtained by removing the foreground such as ruled lines, character strings, and numeric strings from the processing target table 70 by a predetermined method. Generate (S12). Next, by performing pattern matching with the template data stored in the storage unit 60, the area of the partial table is specified and divided from the foreground removal image 76 (S14, S16).

次にセル分割部２０は、各部分表に罫線、文字列、数列などもとの処理対象表７０に記載されていた情報を当てはめることにより、セルに分割する（Ｓ１８）。そしてデータ抽出部２２は各セルに記載されてる文字列または数列を、所定の文字認識手法により読み取り、タイトルセルの記載内容とデータセルの記載内容とを、合致したテンプレートに基づく対応関係を参照して対応付けしながら抽出する（Ｓ２０）。以上のようにして生成したｃｓｖファイルなどのデータを別のソフトウェアへ入力したり、データベース化したりすることにより、表の内容を適宜電子処理することができる。 Next, the cell dividing unit 20 divides the cells into cells by applying the information described in the original processing target table 70 such as ruled lines, character strings, and numerical sequences to each partial table (S18). Then, the data extraction unit 22 reads the character string or number sequence described in each cell by a predetermined character recognition method, and refers to the correspondence relationship based on the matched template based on the description content of the title cell and the description content of the data cell. Are extracted while associating with each other (S20). By inputting the data such as the csv file generated as described above into another software or creating a database, the contents of the table can be appropriately electronically processed.

以上述べた本実施の形態によれば、処理対象たる表の画像データからタイトルセルの配置を示す形状のみに着目してパターンマッチングを施し、画像処理的アプローチから部分表を特定する。これにより罫線が引かれていない表でも構造が単純な部分表に分割することができ、後の解析、すなわちデータの読み取りと対応付けを容易にすることができる。また処理対象表を大局的に解析することから元の画像に歪みや回転がある場合でも、特段の対策処理を行わずにデータ抽出処理までを進捗することができる。 According to the present embodiment described above, pattern matching is performed by focusing only on the shape indicating the arrangement of the title cells from the image data of the table to be processed, and the partial table is specified from the image processing approach. As a result, even a table without ruled lines can be divided into partial tables with a simple structure, and subsequent analysis, that is, data reading and association can be facilitated. In addition, since the processing target table is analyzed globally, even when the original image is distorted or rotated, it is possible to proceed to the data extraction processing without performing special countermeasure processing.

さらにタイトル領域を図形的に導き出すことから、あらかじめタイトルセルに記載される文字列について辞書登録を行う必要がない。さらに基本となる表構造をテンプレートデータとして用意することにより処理対象表を部分領域に分割していくため、いかに複雑な構造を有する表やサイズの大きな表でも同様に処理することが可能となり、あらかじめ全体的な表構造を登録しておく必要がない。さらに種々の手法によりタイトル領域を特定するため、記載された文字列に頼らずタイトル領域を特定でき、あらかじめ項目のみ記入された表を読み込ませるなどの手間を省略できる。結果として低い導入コストで汎用性の高い表認識技術を実現することができる。 Furthermore, since the title area is derived graphically, it is not necessary to perform dictionary registration for the character string described in the title cell in advance. Furthermore, by preparing the basic table structure as template data, the processing target table is divided into partial areas, so it is possible to process even a table having a complicated structure or a large table in advance. There is no need to register the overall table structure. Furthermore, since the title area is specified by various methods, the title area can be specified without relying on the written character string, and the trouble of reading a table in which only items are entered in advance can be omitted. As a result, a highly versatile table recognition technique can be realized at a low introduction cost.

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there.

例えば、本実施の形態で述べた処理対象表は画素値の情報のみを有するラスタ画像であることを前提として説明したが、より高次の情報を有する画像データであっても本発明を適用できる。より高次の情報とは例えば、位置情報を有する矩形の塗りつぶし情報と位置情報を有する矩形の文字列情報などである。例えば、背景色を施された矩形領域、罫線の幅を有する罫線をなす矩形領域、文字列を囲む最小外接矩形領域のそれぞれを左上角、右下角のｘｙ座標値で表した情報、および文字属性情報を含む文字列の情報からなってもよい。このような情報を有する画像においても、まず背景色を施された矩形領域の情報からなる前景除去画像とテンプレートデータとを照合して部分表に分割する。そして罫線をなす矩形領域や文字列を囲む矩形領域の情報を用いてセルに分割し、セルごとに文字列などを読み出す。これにより、本実施の形態と同様に、容易に汎用性の高い表認識技術を実現できる。 For example, the processing target table described in the present embodiment has been described on the assumption that it is a raster image having only pixel value information, but the present invention can be applied to image data having higher-order information. . The higher-order information includes, for example, rectangular fill information having position information and rectangular character string information having position information. For example, a rectangular area with a background color, a rectangular area forming a ruled line having a ruled line width, and a minimum circumscribed rectangular area surrounding a character string, each of which is represented by xy coordinate values at the upper left corner and the lower right corner, and character attributes It may consist of character string information including information. Even in an image having such information, first, the foreground-removed image made up of the information of the rectangular area with the background color and the template data are collated and divided into partial tables. Then, the information is divided into cells using the information of the rectangular area forming the ruled line and the rectangular area surrounding the character string, and the character string and the like are read for each cell. Thereby, similarly to the present embodiment, a highly versatile table recognition technique can be easily realized.

同様にテンプレートデータもラスタ画像に限らず、タイトル領域の形状を現す情報であればよい。例えば、ベクトル画像でもよいし、矩形を表す文字コードと改行情報を含むテキストデータでもよい。後者の場合、例えば黒塗り矩形をタイトル領域、白塗り矩形をデータ領域として表すことができる。異なるデータ形式を有する前景除去画像とテンプレートデータとの照合のためには、例えば低次の情報を有する側に合わせるように高次の情報のデータ変換を行ってもよいし、それ以外の一般的な解析手法を用いてもよい。テンプレートデータのデータ形式は照合に用いる解析手法やデータを記憶する記憶部の容量などに鑑み決定する。これによりより様々なデータ形式や記憶容量に応じた表認識技術を実現できる。 Similarly, the template data is not limited to a raster image, and may be information that represents the shape of the title area. For example, it may be a vector image or text data including a character code representing a rectangle and line feed information. In the latter case, for example, a black rectangle can be represented as a title area and a white rectangle can be represented as a data area. In order to collate foreground-removed images having different data formats with template data, for example, data conversion of high-order information may be performed so as to match the side having low-order information, or other general data Various analysis methods may be used. The data format of the template data is determined in view of the analysis method used for collation and the capacity of the storage unit for storing the data. Thereby, it is possible to realize a table recognition technique corresponding to various data formats and storage capacities.

本実施の形態における文書処理装置の全体的な構成を示す図である。It is a figure which shows the whole structure of the document processing apparatus in this Embodiment. 本実施の形態の文書解析部の構造をより詳細に示す図である。It is a figure which shows the structure of the document analysis part of this Embodiment in detail. 本実施の形態において表抽出部が文書画像のデータから抽出した処理対象の表の構造例を示す図である。It is a figure which shows the structural example of the table of the process target extracted from the data of the document image by the table extraction part in this Embodiment. 本実施の形態の領域分割部が処理対象表を分割する際、生成する前景除去画像を模式的に示す図である。It is a figure which shows typically the foreground removal image produced | generated when the area | region division part of this Embodiment divides | segments a process target table. 本実施の形態の領域分割部が処理対象表の分割を行う様子を模式的に示す図である。It is a figure which shows typically a mode that the area | region division part of this Embodiment divides | segments a process target table. テンプレートとの照合における優先順位の例を説明する図である。It is a figure explaining the example of the priority in collation with a template. 本実施の形態のセル分割部が処理対象表の分割を行う様子を模式的に示す図である。It is a figure which shows typically a mode that the cell division part of this Embodiment divides | segments a process target table. 本実施の形態において領域分割部、セル分割部、データ抽出部が行う処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process which an area | region division part, a cell division part, and a data extraction part perform in this Embodiment.

Explanation of symbols

１０文書解析部、１２解析処理部、１４画像取得部、１６表抽出部、１８領域分割部、２０セル分割部、２２データ抽出部、６０記憶部、６２入力部、６４出力部、７０処理対象表、７３タイトル領域、７６前景除去画像、７８テンプレートデータ、８４部分表、１００文書処理装置。 DESCRIPTION OF SYMBOLS 10 Document analysis part, 12 Analysis processing part, 14 Image acquisition part, 16 Table extraction part, 18 Area division part, 20 Cell division part, 22 Data extraction part, 60 Storage part, 62 Input part, 64 Output part, 70 Processing object Table, 73 title area, 76 foreground removal image, 78 template data, 84 partial table, 100 document processing device.

Claims

A document processing apparatus that recognizes a table included in image data of a document and reads out the contents of the table,
A table extraction unit that extracts image data of a table to be processed from image data of the document;
An area of the item name column is identified from the image data of the table to be processed by a predetermined determination method, and image analysis is performed on the entire shape of the area of the item name column, thereby being included in the table to be processed independently. An area division unit for dividing into partial tables having a table format;
A data extraction unit that reads the description content from the item name column and the data column for each partial table, and creates data in which the description content is associated based on the correspondence relationship between the item name column and the data column;
A document processing apparatus comprising:

A storage unit storing one or more types of partial table structure candidates;
The region dividing unit refers to the storage unit, and matches the overall shape of the item name column region in the processing target table with the shape of the item name column region in the structure candidate, thereby The document processing apparatus according to claim 1, wherein a boundary of the partial table in the table is specified and divided.

The region dividing unit specifies, as the region of the item name column, a region satisfying any one of a region in which a background color is set in the table to be processed, a region having a different background color attribute, or a combination thereof. The document processing apparatus according to claim 1 or 2, wherein

The area dividing unit includes an area in which only a character string is described in the processing target table, an area in which the character string has a predetermined font, an area in which the character string has a predetermined character type, and an area in which the character string attribute is different from the others , A region where the character string is different from the other, a region where the character string is different from the others, a region where the character string has a predetermined number or range of characters, a region where the character string has a predetermined character pattern, or 4. The document processing apparatus according to claim 1, wherein an area satisfying a combination is specified as an area of the item name column.

The area dividing unit sets, as an area of the item name column, an area satisfying any one of an area where an adjacent ruled line exists in the table to be processed, an area where the attribute of the adjacent ruled line is different from the other, or a combination thereof. The document processing apparatus according to claim 1, wherein the document processing apparatus is specified.

A document processing method for recognizing a table included in image data of a document and reading out the contents of the table,
Extracting image data of a table to be processed from image data of the document;
An area of the item name column is identified from the image data of the table to be processed by a predetermined determination method, and image analysis is performed on the entire shape of the area of the item name column, thereby being included in the table to be processed independently. Dividing into sub-tables having a table format;
Reading the description content from the item name column and the data column for each partial table, and creating data that associates the description content based on the correspondence relationship between the item name column and the data column;
A document processing method comprising:

The dividing step includes:
Generating binary image data having different pixel values in the area of the item name column specified in the table to be processed and other areas;
Template data representing one or more types of partial table structure candidates stored in advance in a storage device, with the item name column area and other areas distinguished in the same manner as the binary image, and the binary The document processing method according to claim 6, further comprising: specifying a boundary of the partial table in the table to be processed by matching image data.

A computer program for causing a computer to realize a function of recognizing a table included in image data of a document and reading out the contents described in the table,
A function of extracting image data of a table to be processed from image data of the document stored in a memory;
An area of the item name column is identified from the image data of the table to be processed by a predetermined determination method, and image analysis is performed on the entire shape of the area of the item name column, thereby being included in the table to be processed independently. The ability to divide into sub-tables with table format;
A function of reading the description content from the item name column and the data column for each partial table, and creating data in which the description content is associated based on the correspondence relationship between the item name column and the data column;
A computer program that causes a computer to realize