JP2001331764A

JP2001331764A - Method for recognizing character

Info

Publication number: JP2001331764A
Application number: JP2001064972A
Authority: JP
Inventors: Junji Kashioka; 潤二柏岡; Satoshi Naoi; 聡直井
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-03-13
Filing date: 2001-03-08
Publication date: 2001-11-30

Abstract

PROBLEM TO BE SOLVED: To improve the convenience of the result of character recognition by recognizing the characters of cells in areas in a fixed relation within the same row as one character string. SOLUTION: The table structure of a tabular document is recognized to extract ruled lines and when a ruled line dividing adjacent cells in a row is a dot line, the adjacent cells are integrated to perform the character recognition of them as one cell. It is possible that after integrating the adjacent cells, the dot line dividing the adjacent cells is deleted to character-recognize the integrate cells. It is also possible that after integrating the adjacent cells, the adjacent cells are character-recognized individually to combine the results of the character recognition. When the respective sizes of the adjacent cells are smaller than a fixed threshold and their shapes are similar to each other, the cells can be integrated. Furthermore, it is possible to perform character recognition by integrating plural cells held between the right and left ruled lines of the item area by each row concerning cells in a row lower than the item area of the tabular document.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】近年、入力周辺機器として文
字認識処理の需要が増加している。本発明は、この文字
認識処理において、表形式文書のセルに対する文字認識
結果を、より利便性の高いものとすることができる文字
認識方法に関する。BACKGROUND OF THE INVENTION In recent years, the demand for character recognition processing as input peripheral devices has been increasing. The present invention relates to a character recognition method capable of making the character recognition result for a cell of a table format document more convenient in the character recognition processing.

【０００２】[0002]

【従来の技術】従来、図９、図１０に示すような表形式
文書のセルを認識するには、表の罫線を抽出し、罫線に
より区切られたセル毎に文字認識を行い、その文字認識
結果を個別に保持していた。2. Description of the Related Art Conventionally, in order to recognize cells of a table format document as shown in FIGS. 9 and 10, a ruled line of a table is extracted, and character recognition is performed for each cell separated by the ruled line. The results were kept separately.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、図９及
び図１０中の点線で囲んだ領域Ａのように同一行内で一
定の関係にある複数のセルの文字群は互いに結びついて
いる。例えば、図９における領域Ａにおける”１２３４
５６”は、個々の数字”１”，”２”，…，”６”を単
に並べたものではなく、１２万３千４５６を意味してい
る。すなわち、個々の数字を互いに結び付け複数桁の数
字として認識することにより、初めて文字列として意味
のあるものとなる。従来の認識方法は、罫線により区切
られたセル毎に文字認識を行い、その文字認識結果を個
別に保持していたため、文字認識結果を上記のような意
味のある文字列として取得できなかった。このため、従
来の方法で認識した結果を、例えば表計算ソフトで処理
する場合には、セル毎に認識した結果を互いに結びつ
け、一つの意味のある数字列（文字列）に変換する必要
があった。本発明は上記事情に鑑みなされたものであっ
て、上記のような同一行内で一定の関係にある領域のセ
ルの文字を一つの文字列として自動的に認識することに
より、表計算等にそのまま適用できるようにする等、文
字認識結果をより利便性の高いものとすることを目的と
する。However, the character groups of a plurality of cells having a fixed relationship in the same row, such as an area A surrounded by a dotted line in FIGS. 9 and 10, are connected to each other. For example, “1234” in the area A in FIG.
56 "is not simply an arrangement of the individual numbers" 1 "," 2 ",...," 6 "but means 123,456. Recognition as a number makes it meaningful as a character string for the first time.In the conventional recognition method, character recognition is performed for each cell separated by ruled lines, and the character recognition result is stored separately, Since the recognition result could not be obtained as a meaningful character string as described above, when the result recognized by the conventional method is processed by, for example, spreadsheet software, the result recognized for each cell is linked to each other. The present invention has been made in view of the above circumstances, and has been made in view of the above circumstances, and has been described above. One sentence of letters By automatically recognized as a column, etc. to be able to directly applied to spreadsheets, an object to a character recognition result and more convenient.

【０００４】[0004]

【課題を解決するための手段】図１は本発明の概要を説
明する図である。図１に示すように本発明においては、
以下のようにて前記課題を解決する。（１）図１（ａ）に示すように、表形式文書の表構造を
認識し罫線を抽出する。そして、罫線の抽出結果を基
に、行中の隣接するセルを区切る罫線を調べ、行中の隣
接するセルを区切る罫線が点線の場合に、隣接するセル
を統合して、一つのセルとし文字認識を行う。上記文字
認識をする際、隣接するセルを統合したのち、隣接した
セルを区切る点線を画像から削除して、該統合したセル
を文字認識したり、また、隣接するセルを統合したの
ち、隣接したセルを個別に文字認識して、文字認識結果
を結合した文字列を得ることもできる。（２）図１（ｂ）に示すように、表形式文書の表構造を
認識し罫線を抽出する。そして、行中の隣接するセルの
それぞれの大きさが一定の閾値より小さく、かつ形状が
相似な場合に該セルを統合して、一つのセルとして文字
認識を行う。上記文字認識をする際、隣接するセルを統
合したのち、隣接したセルを区切る罫線を画像から削除
して、該統合したセルを文字認識したり、また、隣接す
るセルを統合したのち、隣接したセルを個別に文字認識
して、文字認識結果を結合した文字列を得ることもでき
る。（３）図１（ｃ）に示すように、表形式文書の表構造を
認識し罫線を抽出する。そして、項目領域（例えば図１
中の「金額」）を抽出し、項目領域の文字認識結果が、
予め登録されている文字列に一致する場合に、項目領域
より下の行のセルについて、項目領域の左右の罫線間に
挟まれる複数のセルを行毎に統合して文字認識を行な
う。以上のように文字認識を行うことにより、表形式文書の
隣接するセルの文字群を一つの意味のある文字列として
取得することができる。FIG. 1 is a diagram for explaining the outline of the present invention. As shown in FIG. 1, in the present invention,
The above problem is solved as follows. (1) As shown in FIG. 1A, the table structure of a tabular document is recognized and ruled lines are extracted. Then, based on the ruled line extraction result, the ruled line that separates adjacent cells in the row is checked. If the ruled line that separates adjacent cells in the row is a dotted line, the adjacent cells are integrated into one Perform recognition. When performing the above-described character recognition, after integrating adjacent cells, a dotted line that separates adjacent cells is deleted from the image, and the integrated cells are subjected to character recognition. It is also possible to perform character recognition on cells individually and obtain a character string obtained by combining the character recognition results. (2) As shown in FIG. 1B, the table structure of the tabular document is recognized and ruled lines are extracted. Then, when the size of each adjacent cell in the row is smaller than a certain threshold value and the shapes are similar, the cells are integrated and character recognition is performed as one cell. When performing the above-described character recognition, after integrating adjacent cells, a ruled line that separates adjacent cells is deleted from the image, and the integrated cells are subjected to character recognition.Also, after integrating adjacent cells, adjacent cells are integrated. It is also possible to perform character recognition on cells individually and obtain a character string obtained by combining the character recognition results. (3) As shown in FIG. 1C, the table structure of the tabular document is recognized and ruled lines are extracted. Then, the item area (for example, FIG. 1)
"Money" in the item area, and the character recognition result in the item area is
If the character string matches a character string registered in advance, character recognition is performed on cells in a row below the item area by integrating a plurality of cells sandwiched between left and right ruled lines in the item area for each row. By performing the character recognition as described above, it is possible to acquire a character group of an adjacent cell of the tabular document as one meaningful character string.

【０００５】[0005]

【発明の実施の形態】図２に本発明の第１の実施例の処
理ブロック図を示す。なお、本発明は、ＣＰＵ、メモ
リ、外部記憶装置、入出力装置、画像読み取りを行うス
キャナ、記録媒体読み取り装置、通信インタフェース等
を備えた通常の計算機システムにより実現することがで
き、上記外部記憶装置等に本発明の文字認識処理を行う
プログラムが格納され、実行時、上記プログラムがメモ
リに読み込まれ、スキャナ等で読み取った画像につい
て、本発明の文字認識処理による文字認識が行われ、文
字認識結果が上記入出力装置から出力される。FIG. 2 is a processing block diagram of a first embodiment of the present invention. Note that the present invention can be realized by a normal computer system including a CPU, a memory, an external storage device, an input / output device, a scanner for reading an image, a recording medium reading device, a communication interface, and the like. The program for performing the character recognition process of the present invention is stored in the memory. When the program is executed, the program is read into a memory, and the image read by the scanner or the like is subjected to character recognition by the character recognition process of the present invention. Is output from the input / output device.

【０００６】次に図２により本発明の第１の実施例につ
いて説明する。本実施例は例えば、前記図９の領域Ａの
ように、文字が縦点線の罫線で区切られている領域の文
字認識に適用される。本実施例においては、図２に示す
ように、まず、表構造認識部１１で、入力画像に対して
罫線抽出を行ない、罫線に関する長さ、位置、点線・実
線等の属性、などの情報を取得するとともに表の構造を
認識する。罫線抽出方法としては、例えば特開平９−５
０５２７号公報記載の公知の方法を用いることができ
る。なお、罫線の抽出、表構造の認識方法は、罫線につ
いては、位置、点線・実線等の属性、表については含ま
れるセル、行などの構造、各セルを構成する罫線の情報
が得られものであれば、その方法は問わない。Next, a first embodiment of the present invention will be described with reference to FIG. This embodiment is applied, for example, to character recognition in an area in which characters are separated by vertical dotted lines, such as the area A in FIG. In the present embodiment, as shown in FIG. 2, first, the table structure recognizing unit 11 extracts a ruled line from an input image, and extracts information such as the length and position of the ruled line, attributes such as a dotted line and a solid line, and the like. Acquire and recognize the structure of the table. As a ruled line extraction method, for example, Japanese Unexamined Patent Application Publication No. 9-5
A known method described in JP-A-0527 can be used. The method of extracting ruled lines and recognizing the table structure is as follows. For ruled lines, attributes such as position, dotted line, solid line, etc., for tables, included cells, rows, etc., information on ruled lines constituting each cell are obtained. Any method can be used.

【０００７】次に、セル統合処理部１２で、表形式文書
の中で縦点線で区切られた領域（例えば図９では領域
Ａ）のセル統合処理を行う。セルの統合処理は、後述す
る図４のフローチャートに示すように表のセルのうち、
行毎に処理を行い、隣接するセルを区切る罫線が点線で
あるかを調べ、隣接するセルを区切る縦罫線が点線の場
合にはセルを統合する。この処理を行内の全てのセルに
ついて行いセルを統合する。次いで、区切り点線削除部
１３で、隣接するセル間の縦点線を削除し、文字認識部
１４で、統合されたセルを一括して文字認識する。これ
により、統合したセル内の文字認識結果を文字列として
取得することができる。なお、文字認識方法としては、
従来から提案されている種々の方法を用いることができ
る。Next, the cell integration processing unit 12 performs a cell integration process on an area (for example, area A in FIG. 9) separated by a vertical dotted line in the tabular document. As shown in the flowchart of FIG. 4 described later, the cell integration process
Processing is performed for each row, and it is checked whether the ruled line separating adjacent cells is a dotted line. If the vertical ruled line separating adjacent cells is a dotted line, cells are integrated. This process is performed for all cells in a row to integrate the cells. Next, the vertical dotted line between adjacent cells is deleted by the dividing dotted line removing unit 13, and the integrated cells are collectively recognized by the character recognizing unit 14. Thereby, the character recognition result in the integrated cell can be obtained as a character string. In addition, as a character recognition method,
Various methods conventionally proposed can be used.

【０００８】図２では、区切り点線削除部１３におい
て、隣接するセル間の縦点線を画像上から削除している
が、文字認識結果結合部を設け、セル毎に文字認識を行
った後に、認識した文字列をつなぎ合わせて文字列を取
得してもよい。すなわち、図３に示すように表構造認識
部１１で、上記のように罫線に関する長さ、位置、点
線、実線等の属性、などの情報を取得するとともに表の
構造を認識し、セル統合処理部１２でセルの統合処理を
行う。ついで、文字認識部１４で統合したセルの個々の
セルについて文字認識をし、文字認識結果結合部１５
で、統合したセルの個々の文字認識結果の文字をつなぎ
あわせて、統合したセルについての文字列を取得する。In FIG. 2, the vertical dotted line between adjacent cells is deleted from the image in the separating dotted line deleting unit 13. However, a character recognition result combining unit is provided, and after performing character recognition for each cell, the recognition is performed. The obtained character strings may be joined to obtain a character string. That is, as shown in FIG. 3, the table structure recognizing unit 11 acquires information such as lengths, positions, dotted lines, solid lines, and other attributes related to ruled lines, recognizes the table structure, and performs cell integration processing. The unit 12 performs a cell integration process. Next, character recognition is performed for each of the cells integrated by the character recognition unit 14, and a character recognition result combining unit 15 is performed.
Then, the characters of the individual character recognition results of the integrated cell are joined to obtain a character string for the integrated cell.

【０００９】次に図４に示すフローチャートにより本実
施例のセル統合処理について説明する。まず、行中に存
在するセルを左から順にソートする。すなわち、ｉ＝１
として（ステップＳ１）、ｉ≦〔行数〕であるかを調べ
（ステップＳ２）、ｉ≦〔行数〕でなければ処理を終了
する。またｉ≦〔行数〕の場合にはｉ行のセルを左から
順にソートする（ステップＳ３）。次にセルを順に取り
出し、次のセル（隣接するセル）とを区切る罫線が点線
であるか否かを調べる。すなわち、ｊ＝１として（ステ
ップＳ４）、ｊ＜〔ｉ行のセル数〕であるかを調べ（ス
テップＳ５）、ｊ＜〔ｉ行のセル数〕でなければ、ｉ＝
ｉ＋１として（ステップＳ６）ステップＳ２に戻る。ま
た、ｊ＜〔ｉ行のセル数〕の場合には、ｋ＝ｊ＋１とし
て（ステップＳ７）、ｋ≦〔ｉ行のセル数〕であるかを
調べる（ステップＳ８）。ｋ≦〔ｉ行のセル数〕でない
場合には、ステップＳ６に行き、上記のようにｉ＝ｉ＋
１として（ステップＳ６）ステップＳ２に戻る。Next, the cell integration processing of the present embodiment will be described with reference to the flowchart shown in FIG. First, cells existing in a row are sorted in order from the left. That is, i = 1
(Step S1), it is checked whether i ≦ [number of lines] (step S2). If i ≦ [number of lines], the process ends. If i ≦ [number of rows], the cells in the i-th row are sorted in order from the left (step S3). Next, the cells are sequentially taken out, and it is checked whether or not the ruled line separating the next cell (adjacent cell) is a dotted line. That is, assuming that j = 1 (step S4), it is checked whether j <[the number of cells in the i-th row] (step S5).
Return to step S2 as i + 1 (step S6). If j <[the number of cells in the i-th row], k = j + 1 (step S7), and it is checked whether k ≦ [the number of cells in the i-th row] (step S8). If k ≦ [the number of cells in the i-th row] is not satisfied, the process proceeds to step S6, where i = i +
As 1 (step S6), the process returns to step S2.

【００１０】ｋ≦〔ｉ行のセル数〕の場合には、ステッ
プＳ９に行き、ｊ番目のセルに統合された最右端のセル
とｋ番目のセルが隣接し、両者を区切る罫線が縦点線で
あるかを調べる。この条件を満たす場合には、ステップ
Ｓ１１において、これらのセル（ｊ番目のセルとｋ番目
のセル）を統合し、ｋ＝ｋ＋１として（ステップＳ１
２）、ステップＳ８に戻る。また、上記条件を満たさな
い場合には、ｊ＝ｋとして（ステップＳ１０）、ステッ
プＳ５に戻る。以上のようにセルを統合する処理を行内
全てのセルについて繰り返し、行内のセルの統合を全て
の行について繰り返す。If k.ltoreq. [The number of cells in the i-th row], the process goes to step S9, where the rightmost cell integrated with the j-th cell and the k-th cell are adjacent to each other, and the ruled line that separates them is a vertical dotted line. Find out if If this condition is satisfied, in step S11, these cells (the j-th cell and the k-th cell) are integrated, and k = k + 1 (step S1).
2) Return to step S8. If the above condition is not satisfied, j = k (step S10), and the process returns to step S5. The process of integrating cells as described above is repeated for all cells in a row, and the integration of cells in a row is repeated for all rows.

【００１１】次に本発明に第２の実施例について説明す
る。図５に本発明の第２の実施例の処理ブロック図を示
す。本実施例は前記図１０の領域Ａのように、文字間が
縦罫線で区切られ、文字のサイズが閾値以下の大きさ
で、かつ、文字が相似の形状の領域の文字認識に適用さ
れる。本実施例では、図５に示すように、まず、表構造
認識部２１で、入力画像に対して罫線抽出を行ない、罫
線に関する長さ、位置、点線・実線等の属性、などの情
報を取得するとともに表の構造を認識する。罫線抽出方
法としては、前記したように例えば特開平９−５０５２
７号公報記載の公知の方法を用いることができる。な
お、罫線の抽出、表構造の認識方法は、罫線について
は、位置、点線・実線等の属性、表については含まれる
セル、行などの構造、各セルを構成する罫線の情報が得
られものであれば、第１の実施例と同様、その方法は問
わない。Next, a second embodiment of the present invention will be described. FIG. 5 shows a processing block diagram of the second embodiment of the present invention. This embodiment is applied to character recognition of an area in which characters are separated by a vertical ruled line, the character size is equal to or smaller than a threshold value, and the characters have similar shapes, as in the area A in FIG. . In the present embodiment, as shown in FIG. 5, first, the table structure recognizing unit 21 extracts a ruled line from an input image and obtains information such as the length and position of the ruled line, attributes such as a dotted line and a solid line, and the like. And recognize the structure of the table. As a ruled line extracting method, as described above, for example, Japanese Patent Laid-Open No. 9-5052
A known method described in Japanese Patent Publication No. 7 can be used. The method of extracting ruled lines and recognizing the table structure is as follows. For ruled lines, attributes such as position, dotted line, solid line, etc., for tables, included cells, rows, etc., information on ruled lines constituting each cell are obtained. Then, as in the first embodiment, the method does not matter.

【００１２】次に、セル統合処理部２２で、セルの統合
処理を行う。セルの統合処理は、後述する図７のフロー
チャートに示すように、行毎に処理を行う。すなわち、
行毎に、セルを順に取り出し、取り出したセル（セル
１）が閾値以下のサイズの場合、次のセル（セル２）を
取り出し、セル２が閾値以下の大きさで、且つセル１と
相似の形状の場合に両セルを統合する。ここでセルの形
状が同一か否かは、例えば、同一行にあるセルは高さは
同じであることから、横方向の長さを比較して、長さが
一定の差の範囲であれば相似とすることができる。次い
で、区切り罫線削除部２３で、隣接するセル間の縦罫線
を削除し、文字認識部２４で、統合されたセルを一括し
て文字認識する。これにより、統合したセル内の文字認
識結果を文字列として取得することができる。Next, the cell integration processing unit 22 performs cell integration processing. The cell integration processing is performed for each row as shown in a flowchart of FIG. 7 described later. That is,
For each row, cells are sequentially taken out, and if the taken out cell (cell 1) is smaller than the threshold, the next cell (cell 2) is taken out, and cell 2 is smaller than the threshold, and is similar to cell 1 In the case of a shape, both cells are integrated. Here, whether or not the cells have the same shape is determined, for example, because the cells in the same row have the same height. Can be similar. Next, a vertical ruled line between adjacent cells is deleted by a dividing ruled line deleting unit 23, and a character recognizing unit 24 collectively recognizes characters of the integrated cells. Thereby, the character recognition result in the integrated cell can be obtained as a character string.

【００１３】図５では、区切り罫線削除部２３におい
て、隣接するセル間の縦罫線を画像上から削除している
が、前記第１の実施例と同様、文字認識結果結合部を設
け、セル毎に文字認識を行った後に、認識した文字列を
つなぎ合わせて文字列を取得してもよい。すなわち、図
６に示すように表構造認識部２１で、上記のように罫線
に関する長さ、位置、点線・実線等の属性、などの情報
を取得するとともに表の構造を認識し、セル統合処理部
２２でセルの統合処理を行う。ついで、文字認識部２４
で統合したセルの個々のセルについて文字認識をし、文
字認識結果結合部２５で、統合したセルの個々の文字認
識結果の文字をつなぎあわせて、統合したセルについて
の文字列を取得する。In FIG. 5, a vertical ruled line between adjacent cells is deleted from the image in the delimiter ruled line deleter 23. However, as in the first embodiment, a character recognition result combining unit is provided and each cell is deleted. After character recognition is performed, the recognized character strings may be joined to obtain a character string. That is, as shown in FIG. 6, the table structure recognizing unit 21 obtains information such as the length and position of the ruled line, the attributes such as the dotted line and the solid line, and recognizes the table structure as described above. The unit 22 performs cell integration processing. Then, the character recognition unit 24
The character recognition is performed on the individual cells of the integrated cells in step (1), and the character recognition result combining unit 25 connects the characters of the individual character recognition results of the integrated cells to obtain a character string for the integrated cells.

【００１４】次に図７に示すフローチャートにより本実
施例のセル統合処理について説明する。まず、行中に存
在するセルを左から順にソートする。すなわち、ｉ＝１
として（ステップＳ１）、ｉ≦〔行数〕であるかを調べ
（ステップＳ２）、ｉ≦〔行数〕でなければ処理を終了
する。またｉ≦〔行数〕の場合にはｉ行のセルを左から
順にソートする（ステップＳ３）。次にセルを順に取り
出し、次にセルを順に取り出す（セル１）。取り出した
セル１が閾値以下のサイズの場合、次のセル（隣接する
セル：セル２）を取り出し、セル２が閾値以下の大きさ
で、且つセル１と相似の形状の場合に両セルを統合す
る。すなわち、ｊ＝１として（ステップＳ４）、ｊ＜
〔ｉ行のセル数〕であるかを調べ（ステップＳ５）、ｊ
＜〔ｉ行のセル数〕でなければ、ｉ＝ｉ＋１として（ス
テップＳ６）ステップＳ２に戻る。また、ｊ＜〔ｉ行の
セル数〕の場合には、ｊ番目のセルが閾値以下のサイズ
であるかを調べ（ステップＳ７）、閾値以下のサイズの
場合には、ｋ＝ｊ＋１として（ステップＳ８）、ｋ≦
〔ｉ行のセル数〕であるかを調べる（ステップＳ９）。
また、ｋ≦〔ｉ行のセル数〕でない場合には、ステップ
Ｓ５からステップＳ６に行き、上記のようにｉ＝ｉ＋１
として（ステップＳ６）ステップＳ２に戻る。また、ス
テップＳ７において、ｊ番目のセルが閾値以下のサイズ
である場合には、ステップＳ１０に行き、ｊ＝ｊ＋１と
してステップＳ５に戻る。Next, the cell integration processing of the present embodiment will be described with reference to the flowchart shown in FIG. First, cells existing in a row are sorted in order from the left. That is, i = 1
(Step S1), it is checked whether i ≦ [number of lines] (step S2). If i ≦ [number of lines], the process ends. If i ≦ [number of rows], the cells in the i-th row are sorted in order from the left (step S3). Next, cells are sequentially taken out, and then cells are taken out sequentially (cell 1). When the extracted cell 1 is smaller than the threshold, the next cell (adjacent cell: cell 2) is extracted, and when the cell 2 is smaller than the threshold and has a similar shape to the cell 1, the two cells are integrated. I do. That is, j = 1 (step S4), j <
It is checked whether it is [the number of cells in the i-th row] (step S5), j
If not [the number of cells in the i-th row], i = i + 1 is set (step S6), and the process returns to step S2. If j <[the number of cells in the i-th row], it is checked whether or not the j-th cell has a size equal to or smaller than the threshold (step S7). S8), k ≦
It is checked whether it is [the number of cells in the i-th row] (step S9).
If k ≦ [the number of cells in the i-th row] is not satisfied, the process goes from step S5 to step S6, where i = i + 1 as described above.
(Step S6) and return to Step S2. If the j-th cell has a size equal to or smaller than the threshold value in step S7, the process proceeds to step S10, and returns to step S5 with j = j + 1.

【００１５】ステップＳ９において、ｋ≦〔ｉ行のセル
数〕の場合には、ステップＳ１１に行き、ｋ番目のセル
が閾値以下のサイズでありかつ、ｊ番目の最右端のセル
と隣接し、形状が相似であるかを調べる。この条件を満
たす場合には、ステップＳ１３において、これらのセル
（ｊ番目のセルとｋ番目のセル）を統合し、ｋ＝ｋ＋１
として（ステップＳ１４）、ステップＳ９に戻る。ま
た、上記条件を満たさない場合には、ｊ＝ｋとして（ス
テップＳ１２）、ステップＳ５に戻る。以上のようにセ
ルを統合する処理を行中のセル全てについて繰り返し、
表の全ての行について処理を行なう。次いで統合したセ
ルを取り出し、個々の文字認識を行う。In step S9, if k ≦ [the number of cells in the i-th row], the process goes to step S11, where the k-th cell is smaller than the threshold value and is adjacent to the j-th rightmost cell; Check if the shapes are similar. If this condition is satisfied, in step S13, these cells (the j-th cell and the k-th cell) are integrated, and k = k + 1
(Step S14), and returns to Step S9. If the above condition is not satisfied, j = k is set (step S12), and the process returns to step S5. The process of integrating cells as described above is repeated for all cells in a row,
Process all rows in the table. Next, the integrated cells are taken out and individual character recognition is performed.

【００１６】次に本発明の第３の実施例について説明す
る。図８に本発明の第３の実施例の処理ブロック図を示
す。本実施例は、例えば前記図９のＢ１，Ｂ２，Ｂ３の
領域のように表の先頭行に以下の行のセルの属性をしめ
す文字列が記入された表形式文書の文字認識に適用され
る。本実施例では、図７に示すように、まず表構造認識
部３１で入力画像に対して表構造の認識を行い、表のセ
ル情報および罫線の情報を取得する。表構造認識方法と
しては、前記第１、第２の実施例と同様、例えば特開平
９−５０５２７記載の公知の方法を用いることができ
る。なお、罫線の抽出、表構造の認識方法は、罫線につ
いての位置、長さ等の情報、表については含まれるセ
ル、行などの構造、各セルを構成する罫線の情報が得ら
れるものであれば、その方法は問わない。Next, a third embodiment of the present invention will be described. FIG. 8 shows a processing block diagram of the third embodiment of the present invention. This embodiment is applied to character recognition of a tabular document in which a character string indicating the attribute of a cell in the following row is written in the first row of the table, for example, in the areas B1, B2, and B3 in FIG. . In this embodiment, as shown in FIG. 7, the table structure recognition unit 31 first recognizes the table structure of the input image, and acquires cell information and ruled line information of the table. As the table structure recognition method, for example, a known method described in JP-A-9-50527 can be used as in the first and second embodiments. The method of extracting ruled lines and recognizing the table structure is such that information such as the position and length of the ruled lines, the structure of the cells and rows included in the table, and the information of the ruled lines constituting each cell can be obtained. Any method can be used.

【００１７】次に、項目領域抽出部３２で表の項目領域
を抽出する。ここで項目領域とは例えば図９中のＢ１，
Ｂ２，Ｂ３の領域のように、表の先頭行にあるセルで、
表の２行目以下のセルの属性を示す文字列が記入された
セルである。例えば、図９では、領域Ｂ３の「金額」と
いう文字列がその下に続くセルが金額を意味する数字が
記入されていることを示している。なお、項目領域は表
の先頭行のセルには限らず、あらかじめ設定されている
条件により抽出されるセルで、その他の関係あるセルの
グループの属性を示す文字列が記入されたセルであれば
よい。次に、項目領域文字認識部３３で、抽出した項目
領域の文字認識を行ない、項目領域文字認識結果照合部
３４で予め登録されている文字列との照合を行なう。例
えば、図９の場合、項目領域の「金額」という文字が認
識された場合には、この文字と予め登録されている文字
列とを照合する。Next, an item area of the table is extracted by the item area extracting unit 32. Here, the item area is, for example, B1,
Cells in the first row of the table, like the areas B2 and B3,
This is a cell in which a character string indicating the attribute of the cell in the second and lower rows of the table is written. For example, in FIG. 9, a cell following the character string “money” in the area B3 indicates that a number meaning the money is entered. Note that the item area is not limited to the cell in the first row of the table, but may be any cell that is extracted according to preset conditions and is a cell in which a character string indicating the attribute of another related cell group is entered. Good. Next, the item region character recognition unit 33 performs character recognition of the extracted item region, and the item region character recognition result comparison unit 34 performs comparison with a character string registered in advance. For example, in the case of FIG. 9, when the character "money" in the item area is recognized, the character is collated with a character string registered in advance.

【００１８】照合結果が一致する場合には、セル統合部
３５で、項目領域より下の行で、項目領域の左右の縦罫
線に挟まれた複数のセルを統合する。これにより、図９
においては、「金額」の項目の下の行の例えば数字列”
１”，”２”，…，”６”が統合され”１２３４５６”
となる。上記のようにセルが統合されると、文字認識部
３６で統合したセルの文字認識を行なう。この文字認識
では、前記図２、図４に示したように統合したセルの個
々のセルを区切る縦線を画像上から削除して、統合した
セルを一括して文字認識すれば、統合したセル内の文字
認識結果を文字列として取得できる。また、前記図３、
図５に示したように統合したセルを個々のセル毎に文字
認識して、文字認識結果を結合して文字列を取得するこ
ともできる。なお、上記第１〜第３の実施例に示した文
字認識方法は、文字認識の対象となる表形式文書に対し
て単独で使用してもよいし、また、第１〜第３の実施例
に示した文字認識方法を組み合わせて使用してもよい。If the collation results match, the cell integrating unit 35 integrates a plurality of cells sandwiched between the left and right vertical ruled lines of the item area in a row below the item area. As a result, FIG.
In the line below the item "Amount", for example, a numeric string "
1 ”,“ 2 ”,...,“ 6 ”are integrated and“ 123456 ”
Becomes When the cells are integrated as described above, the character recognition unit 36 performs character recognition of the integrated cells. In this character recognition, as shown in FIGS. 2 and 4, the vertical lines separating the individual cells of the integrated cells are deleted from the image, and the integrated cells are collectively subjected to character recognition. The character recognition result in can be obtained as a character string. In addition, FIG.
As shown in FIG. 5, the integrated cells can be subjected to character recognition for each cell, and the character recognition results can be combined to obtain a character string. Note that the character recognition methods shown in the first to third embodiments may be used alone for a tabular document to be subjected to character recognition, or may be used in the first to third embodiments. May be used in combination.

【００１９】[0019]

【発明の効果】以上説明したように本発明によれば、
同一行に存在する点線で区切られたセルの個々の文字を
自動的に意味のある文字列として取得したり、同一行
に存在する閾値より小さいセルで相似な形状の隣接する
セルの個々の文字を自動的に意味のある文字列として取
得したり、項目領域の文字列が予め登録された文字列
と一致する場合に、項目領域の下の行にある複数のセル
の文字を行毎に自動的に意味のある文字列として取得す
ることができる。このため、文字認識結果を例えば表計
算ソフトウェアで使用する場合等において、複数セルを
統合する処理を行う必要がなく、文字認識結果をより利
便性の高いものとすることができる。As described above, according to the present invention,
Individual characters in cells separated by a dotted line in the same row are automatically obtained as a meaningful character string, or individual characters in adjacent cells in the same row that are smaller than the threshold and have similar shapes Is automatically obtained as a meaningful character string, or when the character string in the item area matches the character string registered in advance, the characters of a plurality of cells in the row below the item area are automatically It can be obtained as a meaningful character string. Therefore, for example, when the character recognition result is used by spreadsheet software, it is not necessary to perform a process of integrating a plurality of cells, and the character recognition result can be made more convenient.

[Brief description of the drawings]

【図１】本発明の概要を説明する図である。FIG. 1 is a diagram illustrating an outline of the present invention.

【図２】本発明の第１の実施例を示す処理ブロック図で
ある。FIG. 2 is a processing block diagram showing a first embodiment of the present invention.

【図３】第１の実施例の文字列認識方法を、統合したセ
ルの個々のセルを個別に文字認識して認識した文字を結
合する方法に置き換えた場合の処理ブロック図である。FIG. 3 is a processing block diagram in a case where the character string recognition method of the first embodiment is replaced with a method of individually recognizing individual cells of integrated cells and combining the recognized characters.

【図４】本発明の第１の実施例のセル統合処理のフロー
チャートである。FIG. 4 is a flowchart of a cell integration process according to the first embodiment of the present invention.

【図５】本発明の第２の実施例を示す処理ブロック図で
ある。FIG. 5 is a processing block diagram showing a second embodiment of the present invention.

【図６】第２の実施例の文字認識方法を、統合したセル
の個々のセルを個別に文字認識して、認識した文字を結
合する方法に置き換えた場合の処理ブロック図である。FIG. 6 is a processing block diagram in the case where the character recognition method of the second embodiment is replaced with a method of individually character-recognizing individual cells of integrated cells and combining the recognized characters.

【図７】本発明の第２の実施例のセル統合処理のフロー
チャートである。FIG. 7 is a flowchart of a cell integration process according to a second embodiment of the present invention.

【図８】本発明の第３の実施例を示す処理ブロック図で
ある。FIG. 8 is a processing block diagram showing a third embodiment of the present invention.

【図９】本発明が認識する対象とする表形式文書の一例
を示す図（１）である。FIG. 9 is a diagram (1) illustrating an example of a tabular document to be recognized by the present invention;

【図１０】本発明が認識する対象とする表形式文書の一
例を示す図（２）である。FIG. 10 is a diagram (2) illustrating an example of a tabular document to be recognized by the present invention;

[Explanation of symbols]

１１，２１，３１表構造認識部１２，２２セル統合処理部１３区切り点線削除部１４，２４，３６文字認識部１５，２５文字認識結果結合部２３区切り罫線削除部３２項目領域抽出部３３項目領域文字認識部３４項目領域文字認識結果照合部３５セル統合部 11, 21, 31 Table structure recognition unit 12, 22 Cell integration processing unit 13 Separation dotted line deletion unit 14, 24, 36 Character recognition unit 15, 25 Character recognition result combining unit 23 Separator ruled line deletion unit 32 Item area extraction unit 33 Item area Character recognition unit 34 Item area character recognition result collation unit 35 Cell integration unit

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B029 AA01 BB02 CC18 CC27 CC30 EE12 5B064 AA01 AB09 AB13 BA01 CA08 ──────────────────────────────────────────────────続き Continued on the front page F-term (reference) 5B029 AA01 BB02 CC18 CC27 CC30 EE12 5B064 AA01 AB09 AB13 BA01 CA08

Claims

[Claims]

1. A character recognition method for performing character recognition on cells in a tabular document separated by ruled lines, wherein a ruled line separating adjacent cells in a row of the tabular document is a dotted line. A character recognition method, comprising: integrating adjacent cells to perform character recognition as one cell.

2. A character recognition method for performing character recognition on cells in a tabular document separated by ruled lines, wherein a size of an adjacent cell in a row of the tabular document is smaller than a certain threshold value. And a character recognition method wherein, when shapes are similar, the cells are integrated to perform character recognition as one cell.

3. A character recognition method for performing character recognition on cells in a table format document delimited by ruled lines, comprising recognizing the table structure of the table format document, and retrieving table items from the table structure recognition result. Extract the area, and if the character recognition result of the item area matches a pre-registered character string, for the line below the item area,
A character recognition method characterized in that a plurality of cells sandwiched between left and right ruled lines of an item area are integrated for each row and character recognition is performed as one cell.

4. A character recognition program for performing character recognition on cells in a table format document delimited by ruled lines, the character recognition program comprising: a ruled line separating adjacent cells in a row of the table format document. A character recognition program for causing a computer to execute a process of character recognition of the adjacent cell group as one cell when is a dotted line.

5. A character recognition program for performing character recognition on cells in a table-format document delimited by ruled lines, wherein the character recognition program determines the size of an adjacent cell in a row of the table-format document. A character recognition program for causing a computer to execute a process of integrating characters and recognizing characters as one cell when the shapes are smaller than a certain threshold value and the shapes are similar.

6. A character recognition program for performing character recognition on cells in a table format document delimited by ruled lines, wherein the character recognition program recognizes a table structure of the table format document, and Extracting the item area of the table from the recognition result, and if the character recognition result of the item area matches a pre-registered character string, for a row below the item area,
A character recognition program for causing a computer to execute a process of integrating a plurality of cells sandwiched between left and right ruled lines of an item area for each row and performing character recognition as one cell.

7. A recording medium storing a character recognition program for performing character recognition on cells in a table format document delimited by ruled lines, wherein the character recognition program includes an adjacent line in a row of the table format document. A character recognition program for recognizing a character as a single cell when the ruled line that separates a cell is a dotted line.

8. A recording medium for recording a character recognition program for performing character recognition on cells in a table format document delimited by ruled lines, wherein the character recognition program includes an adjacent line in a row of the table format document. A recording medium storing a character recognition program characterized in that, when the size of a cell to be processed is smaller than a predetermined threshold value and the shapes are similar, the cells are integrated and the character is recognized as one cell.

9. A recording medium recording a character recognition program for performing character recognition on cells in a table format document delimited by ruled lines, wherein the character recognition program recognizes a table structure of the table format document. Is performed, and an item area of the table is extracted from the table structure recognition result. If the character recognition result of the item area matches a character string registered in advance, for a row below the item area,
A recording medium on which a character recognition program is recorded, wherein a plurality of cells sandwiched between ruled lines on the left and right of an item area are integrated for each row to perform character recognition as one cell.