JP2002236921A

JP2002236921A - Document image recognition method, document image recognition device and recording medium

Info

Publication number: JP2002236921A
Application number: JP2001031216A
Authority: JP
Inventors: Tsukasa Kouchi; 司幸地
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-02-07
Filing date: 2001-02-07
Publication date: 2002-08-23

Abstract

PROBLEM TO BE SOLVED: To specify a character color without performing an OCR by distinguishing an area of a color document image. SOLUTION: A document image is input as a color digital image (step S1). A background color of the document image is specified (step S2). Picture elements other than a background area are extracted from the document image by using the background color, concatenated components are generated by integrating the picture elements, and the concatenated components are sorted and distinguished by shape characteristics and color characteristics into areas such as characters, ruled lines, charts, and photographs. Character rectangle data is acquired to be used as a distinguished character area (step S4). A representative color is calculated on the basis of a picture element value of picture elements left after removal of picture elements equivalent to the background color in the character rectangle, and the representative color is determined as the character color of the character rectangle.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像認識方
法、文書画像認識装置及びコンピュータ読み取り可能な
記録媒体、より詳細には、カラー文書画像を領域識別し
て文字色を特定する技術に関し、特定された文字色の情
報を用いてタイトル・キーワード抽出や要約文の生成に
応用するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document image recognition method, a document image recognition device, and a computer-readable recording medium, and more particularly, to a technique for identifying a color of a color document image by identifying a region. It is applied to the extraction of a title / keyword or the generation of a summary sentence using the information on the obtained character color.

【０００２】[0002]

【従来の技術】カラー文書画像から文字を抽出して、文
字色を特定する従来技術に関して、例えば、特開平７−
１２１７３３号公報（文書画像処理装置）では、カラー
画像を領域識別した後、文字領域に属する小領域を領域
の塊ごとに色によって分類する。これは、小領域のＣ，
Ｍ，Ｙ，Ｋ各色の濃度の平均値をもとに判別基準を決定
し、この判別基準より濃度が大きいか小さいかによって
文字色を判別する。しかしながら、前記文字色の判別基
準は濃度に関する基準であるので、地肌と文字との濃度
差が小さい文字領域（緑地肌＋黒文字）に適応するのは
難しかった。2. Description of the Related Art Japanese Patent Laid-Open No.
In Japanese Patent Application Laid-Open No. 121733 (document image processing apparatus), after a color image is area-identified, small areas belonging to a character area are classified by color for each chunk of the area. This is the C,
A criterion is determined based on the average value of the densities of the M, Y, and K colors, and the character color is determined based on whether the density is higher or lower than the criterion. However, since the character color determination criterion is a criterion relating to density, it is difficult to adapt to a character area (green ground + black character) where the density difference between the background and the character is small.

【０００３】また、特開平８−１２３９０１号公報（文
字抽出装置及び該装置を用いた文字認識装置）は、カラ
ー画像を一旦二値化して、二値画像から文字を抽出す
る。従って、色情報はなんら文字抽出には反映されな
い。前述と同様、地肌と文字との濃度差が小さい文字領
域を精度よく抽出するのは難しく、当然黒地肌上に白文
字で書かれた文字領域には対応できない。Japanese Patent Laid-Open Publication No. Hei 8-123901 (character extraction device and character recognition device using the device) once binarizes a color image and extracts characters from the binary image. Therefore, no color information is reflected in character extraction. As described above, it is difficult to accurately extract a character area in which the density difference between the background and the character is small, and it is not possible to cope with a character area written with white characters on a black background.

【０００４】[0004]

【発明が解決しようとする課題】本発明は、上述のよう
な問題点に鑑みてなされたものであり、１．ＯＣＲすることなく文字色を特定すること、２．地肌と文字との濃度差が小さい文字にも対応する
こと、３．白文字も特定すること、４．ユーザに分かりやすい文字色特定結果の提示方法
を提供すること、５．ＯＣＲすることなく文書のタイトルやキーワード
等を抽出すること、６．ＯＣＲを最小限にとどめて効率よく要約文を作成
すること、を目的としてなされたものである。なお、上記２と３に
関しては特願２０００−０３２２９８号公報に記載され
た方法を用いて、カラー画像から任意の文字矩形が抽出
可能になるので、それに対して文字色を与えることで解
決することができる。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems. 1. Specify the character color without OCR; 2. Correspond to characters with a small density difference between the background and the characters; 3. Specify white characters. 4. To provide a method of presenting a character color identification result that is easy for the user to understand; 5. Extract document titles, keywords, etc. without OCR; The purpose of this is to efficiently create a summary by minimizing the OCR. Regarding the above 2 and 3, any character rectangle can be extracted from a color image by using the method described in Japanese Patent Application No. 2000-032298, and the problem can be solved by giving a character color to it. Can be.

【０００５】[0005]

【課題を解決するための手段】請求項１の発明は、文書
画像をカラーデジタル画像として入力し、該文書画像か
ら色情報を抽出して該文書画像の背景色を特定し、該背
景色を用いて前記文書画像から背景領域以外の画素を小
塊として抽出し、該抽出した小塊を統合して連結成分を
生成し、該連結成分をその形状特徴と色特徴とから、文
字／罫線／図／写真などの領域に分類する文書画像認識
方法であって、前記文字と分類された領域である文字領
域に対して、カラーデジタル画像から文字領域の文字色
を特定することを特徴としたものである。According to a first aspect of the present invention, a document image is input as a color digital image, color information is extracted from the document image, a background color of the document image is specified, and the background color is determined. Then, pixels other than the background area are extracted from the document image as small chunks, and the extracted small chunks are integrated to generate a connected component. What is claimed is: 1. A document image recognition method for classifying an image into a region such as a figure / photograph, wherein a character color of a character region is specified from a color digital image for a character region which is a region classified as the character. It is.

【０００６】請求項２の発明は、請求項１の発明におい
て、文字色を特定する際には、前記文字領域として分類
された文字矩形内に背景色があるとき、該文字矩形内の
背景色とほぼ同じ画素値に相当する画素を０、それ以外
の画素を１とすると、１の値を有する画素の集合から代
表色を決定して、決定した代表色を前記文字矩形の文字
色とすることを特徴としたものである。According to a second aspect of the present invention, in the first aspect of the present invention, when the character color is specified, when the background color is present in the character rectangle classified as the character area, the background color in the character rectangle is determined. Assuming that a pixel corresponding to a pixel value substantially equal to 0 is 0 and other pixels are 1, a representative color is determined from a set of pixels having a value of 1, and the determined representative color is set as the character color of the character rectangle. It is characterized by the following.

【０００７】請求項３の発明は、請求項１の発明におい
て、文字色が特定されると、各分類された文字矩形に対
してそれぞれ文字色が与えられ、それぞれの文字色に関
する分布図を作成し、該分布図に基づいて特定の条件を
満たす色を文書の代表文字色とすることを特徴としたも
のである。According to a third aspect of the present invention, in the first aspect of the present invention, when a character color is specified, a character color is given to each of the classified character rectangles, and a distribution chart for each character color is created. A color that satisfies a specific condition based on the distribution map is set as a representative character color of the document.

【０００８】請求項４の発明は、請求項１の発明におい
て、文字／罫線／図／写真などの領域を、文書中にそれ
ぞれ矩形としてユーザに提示する際に、前記文字領域を
表す矩形の色を請求項１によって特定された文字色とす
ることを特徴としたものである。According to a fourth aspect of the present invention, in the first aspect of the present invention, when presenting an area such as a character / ruled line / figure / photo as a rectangle in a document to a user, the color of the rectangle representing the character area Is the character color specified by the first aspect.

【０００９】請求項５の発明は、請求項３の発明におい
て、文書の代表文字色以外の色で記述された文字列があ
れば、該文字列を文書のタイトル、あるいはキーワード
とすることを特徴としたものである。A fifth aspect of the present invention is characterized in that, in the third aspect of the present invention, if there is a character string described in a color other than the representative character color of the document, the character string is used as a document title or a keyword. It is what it was.

【００１０】請求項６の発明は、請求項３の発明におい
て、文書の代表文字色以外の色で記述された文字列があ
れば、該文字列を包含する領域全体をＯＣＲして、ＯＣ
Ｒされた複数の領域から得られる文字コードから文書の
要約文を作成することを特徴としたものである。According to a sixth aspect of the present invention, in the third aspect of the invention, if there is a character string described in a color other than the representative character color of the document, the entire area including the character string is subjected to OCR,
It is characterized in that a summary sentence of a document is created from character codes obtained from a plurality of R regions.

【００１１】請求項７の発明は、文書画像をカラーデジ
タル画像として入力する画像入力手段と、該文書画像か
ら色情報を抽出して該文書画像の背景色を特定し、該背
景色を用いて前記文書画像から背景領域以外の画素を小
塊として抽出し、該抽出した小塊を統合して連結成分を
生成し、該連結成分をその形状特徴と色特徴とから、文
字／罫線／図／写真などの領域に分類する領域分類手段
とを有する文書画像認識装置であって、前記文字と分類
された領域である文字領域に対して、カラーデジタル画
像から文字領域の文字色を特定する文字色特定手段を有
することを特徴としたものである。According to a seventh aspect of the present invention, there is provided image input means for inputting a document image as a color digital image, extracting color information from the document image, specifying a background color of the document image, and using the background color. Pixels other than the background area are extracted from the document image as small blocks, and the extracted small blocks are integrated to generate a connected component. What is claimed is: 1. A document image recognizing device having an area classifying means for classifying an image into a region such as a photograph, wherein a character color for specifying a character color of a character region from a color digital image for a character region which is a region classified as the character It is characterized by having a specifying means.

【００１２】請求項８の発明は、請求項７の発明におい
て、文字色を特定する際には、前記文字領域として分類
された文字矩形内に背景色があるとき、該文字矩形内の
背景色とほぼ同じ画素値に相当する画素を０、それ以外
の画素を１とすると、１の値を有する画素の集合から代
表色を決定して、決定した代表色を前記文字矩形の文字
色とすることを特徴としたものである。According to an eighth aspect of the present invention, in the invention of the seventh aspect, when specifying the character color, when the background color is present in the character rectangle classified as the character area, the background color in the character rectangle is determined. Assuming that a pixel corresponding to a pixel value substantially equal to 0 is 0 and other pixels are 1, a representative color is determined from a set of pixels having a value of 1, and the determined representative color is set as the character color of the character rectangle. It is characterized by the following.

【００１３】請求項９の発明は、請求項７の発明におい
て、文字色が特定されると、各分類された文字矩形に対
してそれぞれ文字色が与えられ、それぞれの文字色に関
する分布図を作成し、該分布図に基づいて特定の条件を
満たす色を文書の代表文字色とすることを特徴としたも
のである。According to a ninth aspect of the present invention, in the invention of the seventh aspect, when a character color is specified, a character color is given to each of the classified character rectangles, and a distribution chart for each character color is created. A color that satisfies a specific condition based on the distribution map is set as a representative character color of the document.

【００１４】請求項１０の発明は、請求項７の発明にお
いて、文字／罫線／図／写真などの領域を、文書中にそ
れぞれ矩形としてユーザに提示する際に、前記文字領域
を表す矩形の色を請求項７によって特定された文字色と
することを特徴としたものである。According to a tenth aspect of the present invention, in the invention of the seventh aspect, when an area such as a character / ruled line / figure / photo is presented to a user as a rectangle in a document, the color of the rectangle representing the character area Is the character color specified by claim 7.

【００１５】請求項１１の発明は、請求項９の発明にお
いて、文書の代表文字色以外の色で記述された文字列が
あれば、該文字列を文書のタイトル、あるいはキーワー
ドとするタイトル・キーワード抽出手段を有することを
特徴としたものである。According to an eleventh aspect of the present invention, in the ninth aspect of the present invention, if there is a character string described in a color other than the representative character color of the document, the character string is a title of the document or a title / keyword which is a keyword. It is characterized by having extraction means.

【００１６】請求項１２の発明は、請求項９の発明にお
いて、文書の代表文字色以外の色で記述された文字列が
あれば、該文字列を包含する領域全体をＯＣＲするＯＣ
Ｒ手段と、ＯＣＲされた複数の領域から得られる文字コ
ードから文書の要約文を作成する要約文作成手段を有す
ることを特徴としたものである。According to a twelfth aspect of the present invention, in the ninth aspect of the invention, if there is a character string described in a color other than the representative character color of the document, the OC which performs OCR on the entire area including the character string is provided.
R means and a summary sentence creating means for creating a summary sentence of a document from character codes obtained from a plurality of OCR areas.

【００１７】請求項１３の発明は、文書画像をカラーデ
ジタル画像として入力する画像入力ステップと、該文書
画像から色情報を抽出して該文書画像の背景色を特定
し、該背景色を用いて前記文書画像から背景領域以外の
画素を小塊として抽出し、該抽出した小塊を統合して連
結成分を生成し、該連結成分をその形状特徴と色特徴と
から、文字／罫線／図／写真などの領域に分類する領域
分類ステップからなる文書画像認識方法をコンピュータ
に実行させるプログラムを記録したコンピュータ読み取
り可能な記録媒体であって、前記文字と分類された領域
である文字領域に対して、カラー画像から文字領域の文
字色を特定する文字色特定ステップをコンピュータに実
行させるプログラムを記録したコンピュータ読み取り可
能な記録媒体である。According to a thirteenth aspect of the present invention, there is provided an image input step of inputting a document image as a color digital image, extracting a color information from the document image, specifying a background color of the document image, and using the background color. Pixels other than the background area are extracted from the document image as small blocks, and the extracted small blocks are integrated to generate a connected component. A computer-readable recording medium that records a program that causes a computer to execute a document image recognition method including an area classification step of classifying an area such as a photograph, and for a character area that is an area classified as the character, A computer-readable recording medium storing a program for causing a computer to execute a character color specifying step of specifying a character color of a character area from a color image

【００１８】請求項１４の発明は、請求項２に記載の文
書画像認識方法をコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体であっ
て、文字色を特定する際に、前記文字領域として分類さ
れた文字矩形内に背景色があるとき、該文字矩形内の背
景色とほぼ同じ画素値に相当する画素を０、それ以外の
画素を１に設定し、１の値を有する画素の集合から代表
色を決定して、決定した代表色を前記文字矩形の文字色
とするステップをコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体であ
る。According to a fourteenth aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for causing a computer to execute the document image recognition method according to the second aspect, wherein the character area is specified when a character color is specified. When there is a background color in the character rectangle classified as, the pixel corresponding to the pixel value substantially the same as the background color in the character rectangle is set to 0, the other pixels are set to 1, and the pixels having the value of 1 are set. A computer-readable recording medium storing a program for causing a computer to execute a step of determining a representative color from a set and setting the determined representative color as the character color of the character rectangle.

【００１９】請求項１５の発明は、請求項３に記載の文
書画像認識方法をコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体であっ
て、文字色が特定されると、各分類された文字矩形に対
してそれぞれ文字色を与えて、それぞれの文字色に関す
る分布図を作成し、該分布図に基づいて特定の条件を満
たす色を文書の代表文字色とするステップをコンピュー
タに実行させるプログラムを記録したコンピュータ読み
取り可能な記録媒体である。According to a fifteenth aspect of the present invention, there is provided a computer-readable recording medium storing a program for causing a computer to execute the document image recognizing method according to the third aspect. Giving a character color to each of the character rectangles, creating a distribution map for each character color, and causing the computer to execute a step of setting a color satisfying a specific condition as a representative character color of the document based on the distribution map. It is a computer-readable recording medium on which a program is recorded.

【００２０】請求項１６の発明は、請求項４に記載の文
書画像認識方法をコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体であっ
て、文字／罫線／図／写真などの領域を、文書中にそれ
ぞれ矩形としてユーザに提示する際に、前記文字領域を
表す矩形の色を請求項１によって特定された文字色とす
るステップをコンピュータに実行させるプログラムを記
録したコンピュータ読み取り可能な記録媒体である。According to a sixteenth aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for causing a computer to execute the document image recognition method according to the fourth aspect, wherein an area such as a character / ruled line / figure / photo is recorded. A computer-readable recording medium storing a program for causing a computer to execute a step of setting the color of the rectangle representing the character area to the character color specified by claim 1 when presenting the rectangle to the user as a rectangle in a document. It is.

【００２１】請求項１７の発明は、請求項５に記載の文
書画像認識方法をコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体であっ
て、文書の代表文字色以外の色で記述された文字列があ
れば、該文字列を文書のタイトル、あるいはキーワード
とするタイトル・キーワード抽出ステップをコンピュー
タに実行させるプログラムを記録したコンピュータ読み
取り可能な記録媒体である。According to a seventeenth aspect of the present invention, there is provided a computer-readable recording medium storing a program for causing a computer to execute the document image recognition method according to the fifth aspect, wherein the recording medium is described in a color other than the representative character color of the document. If there is such a character string, it is a computer-readable recording medium that records a program for causing a computer to execute a title / keyword extraction step using the character string as a document title or a keyword.

【００２２】請求項１８の発明は、請求項６に記載の文
書画像認識方法をコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体であっ
て、文書の代表文字色以外の色で記述された文字列があ
れば、該文字列を包含する領域全体をＯＣＲするＯＣＲ
ステップと、ＯＣＲされた複数の領域から得られる文字
コードから文書の要約文を作成する要約文作成ステップ
とをコンピュータに実行させるプログラムを記録したコ
ンピュータ読み取り可能な記録媒体である。〔発明の詳細な説明〕According to an eighteenth aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for causing a computer to execute the document image recognition method according to the sixth aspect, wherein the medium is described in a color other than the representative character color of the document. If there is a character string, an OCR that OCRs the entire area including the character string
A computer-readable recording medium in which a program for causing a computer to execute a step and a step of creating a digest of a document from character codes obtained from a plurality of OCR regions is recorded. [Detailed description of the invention]

【００２３】[0023]

【発明の実施の形態】本発明は、特願２０００−０３２
２９８号公報に記載された方法の応用として、カラー文
書画像中の文字色を自動的に特定する。本発明は、カラ
ー画像から文字色を直接特定することが可能であるた
め、ＯＣＲ不要という利点がある。比較的大きな文字に
関しては、ほぼ文字単位で文字色を再現できて、小さな
文字ならば、行あるいは領域単位で文字色の特定が可能
である。特に白抜き文字にも十分対応可能である。BEST MODE FOR CARRYING OUT THE INVENTION The present invention relates to Japanese Patent Application No. 2000-032.
As an application of the method described in Japanese Patent Publication No. 298, a character color in a color document image is automatically specified. The present invention has an advantage that OCR is not required because the character color can be directly specified from the color image. For relatively large characters, the character color can be reproduced almost in character units, and for small characters, the character color can be specified in line or area units. In particular, it can sufficiently cope with white characters.

【００２４】（実施例）図１は、本発明が適用される文
書画像認識装置の構成例を示す図で、図中、１は、画像
入力手段、２は、領域分類手段、３は、文字色特定手
段、４は、タイトル・キーワード抽出手段、５は、タイ
トル・キーワードＤＢ、６は、ＯＣＲ手段、７は、要約
文作成手段、８は、要約文ＤＢである。初めに、画像入
力手段１よりカラー文書画像を入力して、領域分類手段
２において文書全体の背景色を特定し、カラー文書画像
に対して領域識別処理を行い、文字矩形の情報を取得す
る。以下、得られた各々の文字矩形について、文字色特
定手段３において文字色を特定する。(Embodiment) FIG. 1 is a diagram showing an example of the configuration of a document image recognition apparatus to which the present invention is applied. In FIG. 1, 1 is an image input means, 2 is an area classification means, and 3 is a character Color identification means, 4 is a title / keyword extraction means, 5 is a title / keyword DB, 6 is an OCR means, 7 is a summary sentence creation means, and 8 is a summary sentence DB. First, a color document image is input from the image input unit 1, the background color of the entire document is specified by the area classification unit 2, area identification processing is performed on the color document image, and information on a character rectangle is obtained. Hereinafter, a character color is specified by the character color specifying means 3 for each of the obtained character rectangles.

【００２５】図２は、本発明が適用される文書画像認識
装置における文字色を特定する処理の一例を説明するフ
ローチャートである。まず、画像入力手段１よりカラー
文書画像を入力して（ステップＳ１）、領域分類手段２
において文書全体の背景色を特定し（ステップＳ２）、
カラー文書画像に対して領域識別処理を行い（ステップ
Ｓ３）、文字矩形の情報を取得する（ステップＳ４）。
次に、オリジナルカラー画像上の文字矩形内から背景色
に相当する画素を除去して残った画素値を取得する（ス
テップＳ５、ステップＳ６）。取得した画素値の代表色
を該文字矩形の文字色とする（ステップＳ７）。図２に
示す実施例では、全画素値の平均値を文字色としている
が、濃度値のヒストグラムを生成して、そのメディアン
を文字色としてもよい。FIG. 2 is a flowchart for explaining an example of processing for specifying a character color in the document image recognition apparatus to which the present invention is applied. First, a color document image is inputted from the image input means 1 (step S1), and the area classifying means 2
In step S2, the background color of the entire document is specified.
An area identification process is performed on the color document image (step S3), and information on a character rectangle is obtained (step S4).
Next, the pixels corresponding to the background color are removed from the character rectangle on the original color image, and the remaining pixel values are obtained (steps S5 and S6). The representative color of the obtained pixel value is set as the character color of the character rectangle (step S7). In the embodiment shown in FIG. 2, the average value of all pixel values is used as the text color. However, a histogram of density values may be generated and the median may be used as the text color.

【００２６】図３は、本発明が適用される任意の文字矩
形の文字色を特定する様子の一例を示す図で、図中、１
０は、文字矩形で、該文字矩形１０は、背景色に相当す
る画素１０ａ（濃いグレーの画素）、文字を構成する画
素１０ｂ（白い画素）、１０ｃ（薄いグレーの画素）か
らなる。文字色を特定する際に、領域識別の結果文字と
識別された領域に対してのみ処理する。図３の左側に示
す図において、オリジナルの文字矩形１０は文字Ａを含
む。文字Ａの背景色は予め前述のステップＳ２で求めら
れている。このとき、図３の真中に示す図は、文字矩形
１０から背景色に相当する画素１０ａを除去して残りの
画素（画素１０ｂと画素１０ｃ）を取得した状態を示
す。これら取得された画素から代表色を一色求めて、そ
れを文字矩形１０の文字色とする。図３の右側に示す図
は、画素１０ｂの色を文字色として特定した状態を示
す。ここで、文字色を特定する手段としては、ＲＧＢ各
プレーンの平均値、あるいは濃度値のヒストグラムを生
成して、そのメディアンを文字色としてもよい。FIG. 3 is a diagram showing an example of how a character color of an arbitrary character rectangle to which the present invention is applied is specified.
0 is a character rectangle, and the character rectangle 10 is composed of pixels 10a (dark gray pixels) corresponding to the background color and pixels 10b (white pixels) and 10c (light gray pixels) constituting the character. When specifying the character color, processing is performed only on the area identified as a character as a result of the area identification. In the diagram shown on the left side of FIG. 3, the original character rectangle 10 includes the character A. The background color of the character A is obtained in advance in step S2. At this time, the diagram shown in the middle of FIG. 3 shows a state in which the pixel 10a corresponding to the background color has been removed from the character rectangle 10 and the remaining pixels (pixels 10b and 10c) have been obtained. One representative color is obtained from the obtained pixels, and is set as the character color of the character rectangle 10. The diagram shown on the right side of FIG. 3 shows a state in which the color of the pixel 10b is specified as a character color. Here, as means for specifying the character color, a histogram of the average value or the density value of each of the RGB planes may be generated, and the median may be used as the character color.

【００２７】文書の代表文字色を特定する際には、領域
分類手段２で得られたすべての文字矩形について文字色
を特定し終えた後に、文書を代表する唯一の代表文字色
を決定する。例えば、輝度に関するヒストグラムを作成
して、中央値に相当する色を代表文字色とする。When specifying the representative character color of the document, after the character colors have been specified for all the character rectangles obtained by the area classifying means 2, the only representative character color representative of the document is determined. For example, a histogram relating to luminance is created, and a color corresponding to the median value is set as a representative character color.

【００２８】図４は、文字色特定結果をユーザに提示す
るＧＵＩ画面の一例を示す図である。文字色特定手段３
から入力文書のすべての文字色が得られた後に、その結
果をユーザに提示する。図４には、実際のＷｉｎｄｏｗ
ｓの画面を示し、該画面に領域分類手段２で識別された
すべての文字矩形を表示する際に、矩形の色として文字
色特定手段３で特定された色が提示される。ここで、図
面上では白黒印刷であるため分かりにくいが、図４の上
図において、前半の文字列「Ｅｒｆａｔｅｃ」は青の矩
形、後半の文字列は赤の矩形で、それぞれ文字を包含し
ている状態を示し、図４の下図は、文書全体を表示した
画面を示す。FIG. 4 is a diagram showing an example of a GUI screen for presenting a character color specification result to a user. Character color specifying means 3
After obtaining all the character colors of the input document from, the result is presented to the user. FIG. 4 shows an actual window.
When the screen of s is shown and all the character rectangles identified by the area classification means 2 are displayed on the screen, the color specified by the character color specification means 3 is presented as the color of the rectangle. Here, it is difficult to understand in the drawing because the printing is monochrome, but in the upper drawing of FIG. The lower part of FIG. 4 shows a screen displaying the entire document.

【００２９】図５は、マウスポインタが文字矩形上にあ
るときの動作の一例を示した図である。図５（Ａ）は、
マウスポインタが文字矩形上にない状態を示しており、
通常矩形を表示しないか、あるいは点線や薄い色で表示
される。図５（Ｂ）は、マウスポインタが文字矩形上に
ある状態を示しており、矩形の色が文字色に変化する
か、あるいは矩形が文字色で半透明に塗りつぶされて表
示される。FIG. 5 is a diagram showing an example of the operation when the mouse pointer is over a character rectangle. FIG. 5 (A)
Indicates that the mouse pointer is not over the character rectangle,
Usually, a rectangle is not displayed or is displayed in a dotted line or a light color. FIG. 5B shows a state in which the mouse pointer is over the character rectangle, and the color of the rectangle is changed to the character color, or the rectangle is displayed with the character color translucently filled.

【００３０】図６は、本発明が適用される文書画像認識
装置におけるタイトル・キーワードを特定する処理の一
例を説明するフローチャートである。文字色特定手段３
から得られる結果を用いた、文書画像から高速にタイト
ル・キーワードを抽出する方法について説明する。ま
ず、領域分類手段２より文字矩形データを取得する（ス
テップＳ１１）。文字色特定手段３より文書の代表文字
色以外の文字矩形データを取得する（ステップＳ１
２）。これをタイトル・キーワード候補文字矩形とい
う。タイトル文字矩形を選択する（ステップＳ１３）。
例えば、文書の最も左上に位置する文字矩形データの塊
をタイトル文字矩形とする。ユーザからの指示によりタ
イトル・キーワード候補文字列をＯＣＲするかどうかを
選択する（ステップＳ１４）。FIG. 6 is a flowchart for explaining an example of a process for specifying a title / keyword in the document image recognition apparatus to which the present invention is applied. Character color specifying means 3
A method for extracting a title / keyword from a document image at high speed using the result obtained from the above will be described. First, character rectangle data is obtained from the area classification means 2 (step S11). Character rectangle data other than the representative character color of the document is obtained from the character color specifying means 3 (step S1).
2). This is called a title / keyword candidate character rectangle. A title character rectangle is selected (step S13).
For example, a chunk of character rectangle data located at the upper left of the document is defined as a title character rectangle. It is determined whether or not OCR is performed on the title / keyword candidate character string according to an instruction from the user (step S14).

【００３１】ステップＳ１４でＹＥＳの場合（ＯＣＲ実
施）、ＯＣＲ手段６を用いてタイトル・キーワード候補
文字列をＯＣＲする（ステップＳ１５）。ＯＣＲ結果か
ら、最終タイトル・キーワード文字列を取得する（ステ
ップＳ１６）。この場合、例えば、ＯＣＲ手段６からＯ
ＣＲ確信度が得られるならば、一定水準以下のＯＣＲ確
信度しかない文字列は、最終結果から除外してもよい。
最終タイトル・キーワード文字列をタイトル・キーワー
ドＤＢ５に保存する（ステップＳ１７）。If YES in step S14 (OCR execution), the OCR means 6 performs OCR on the title / keyword candidate character string (step S15). The final title / keyword character string is obtained from the OCR result (step S16). In this case, for example, the OCR means 6
If the CR certainty is obtained, a character string having an OCR certainty below a certain level may be excluded from the final result.
The final title / keyword character string is stored in the title / keyword DB 5 (step S17).

【００３２】ステップＳ１４でＮＯの場合（ＯＣＲな
し）、画像の状態で最終タイトル・キーワード文字列を
取得し（ステップＳ１８）、取得した最終タイトル・キ
ーワード文字列をタイトル・キーワードＤＢ５に保存す
る（ステップＳ１７）。ここで、タイトル・キーワード
ＤＢ５へ保存する際のデータ型の一例を下記の表１に示
す。ＯＣＲなしの場合は文字列の欄を空にしておき、ユ
ーザへは画像を提示する。If NO in step S14 (no OCR), the final title / keyword character string is obtained in the state of the image (step S18), and the obtained final title / keyword character string is stored in the title / keyword DB 5 (step S18). S17). Here, an example of a data type when the data is stored in the title / keyword DB 5 is shown in Table 1 below. If there is no OCR, the column of the character string is left empty, and the image is presented to the user.

【００３３】[0033]

【表１】 [Table 1]

【００３４】図７は、本発明における要約文を生成する
際に必要な文字領域を選択する一例を示す図で、図中、
１１は、入力文書で、該入力文書１１は、領域を識別
し、文字色を特定した６つの領域からなり、１１ａは、
領域Ｒ１、１１ｂは、領域Ｒ３、１１ｃは、領域Ｒ６で
ある。文字色特定手段３から得られる結果を用いて、文
書画像から高速に要約文を生成する方法を以下に説明す
る。FIG. 7 is a diagram showing an example of selecting a character area necessary for generating a summary sentence according to the present invention.
Reference numeral 11 denotes an input document. The input document 11 includes six areas in which areas are identified and character colors are specified.
Regions R1 and 11b are regions R3 and 11c are regions R6. A method of generating a summary sentence from a document image at high speed using the result obtained from the character color specifying means 3 will be described below.

【００３５】図８は、本発明が適用される文書画像認識
装置における要約文を生成する処理の一例を説明するフ
ローチャートである。まず、領域分類手段２より文字矩
形データを取得する（ステップＳ２１）。文字色特定手
段３より文書の代表文字色以外の文字矩形データを取得
する（ステップＳ２２）。これをタイトル・キーワード
候補文字矩形という。取得したタイトル・キーワード候
補文字矩形を含む領域を取得する（ステップＳ２３）。
これは、前述の図７に示す領域Ｒ１，Ｒ３，Ｒ６を取得
することを示す。取得した領域（図７に示すＲ１，Ｒ
３，Ｒ６）をＯＣＲする。ＯＣＲ結果から要約文を生成
する（ステップＳ２５）。ここは本発明のポイントでは
ないので、適当な従来技術を用いて実現する。FIG. 8 is a flowchart illustrating an example of a process for generating a summary sentence in the document image recognition apparatus to which the present invention is applied. First, character rectangle data is obtained from the area classification means 2 (step S21). Character rectangle data other than the representative character color of the document is obtained from the character color specifying means 3 (step S22). This is called a title / keyword candidate character rectangle. An area including the acquired title / keyword candidate character rectangle is acquired (step S23).
This means that the regions R1, R3, and R6 shown in FIG. 7 are acquired. The acquired regions (R1, R shown in FIG. 7)
OCR is performed on (3, R6). A summary sentence is generated from the OCR result (step S25). Since this is not the point of the present invention, it is realized by using an appropriate conventional technique.

【００３６】図９は、複数のページから要約文を生成す
る一例を示す図で、図示のように、例えば、複数のペー
ジから所定の領域Ｒａ、Ｒｂだけ選択してＯＣＲをかけ
ることで、実行時間の大幅な削減を図ることができる。FIG. 9 is a diagram showing an example of generating a summary sentence from a plurality of pages. As shown in FIG. 9, for example, only predetermined areas Ra and Rb are selected from a plurality of pages and OCR is applied to execute the processing. The time can be significantly reduced.

【００３７】[0037]

【発明の効果】以上の説明から明らかなように、請求項
１、７、１３の発明によると、カラー文書画像からＯＣ
Ｒせずに直接文字色が特定できる。As is apparent from the above description, according to the first, seventh and thirteenth aspects, the color document image can be converted to the OC.
The character color can be specified directly without R.

【００３８】請求項２、８、１４の発明によると、各文
字矩形から予め特定しておいた背景色を除去することに
より、黒文字だけではなく、従来難しかった薄い文字色
も精度よく特定することができ、特に、白抜き文字にも
対応可能である。According to the second, eighth and fourteenth aspects of the present invention, by removing the previously specified background color from each character rectangle, it is possible to accurately specify not only black characters but also thin character colors which have been difficult in the past. In particular, white characters can be handled.

【００３９】請求項３、９、１５の発明によると、文書
を代表する代表文字色を特定することにより、代表文字
色以外の色で記述された文字列を文書中の強調文字列と
みなすことで、請求項５、１１、請求項６、１２への応
用が可能になり、特に、ＯＣＲすることなく強調文字列
が特定できる効果がある。According to the third, ninth and fifteenth aspects of the present invention, a character string described in a color other than the representative character color is regarded as an emphasized character string in the document by specifying a representative character color representing the document. Thus, application to claims 5 and 11 and claims 6 and 12 becomes possible, and in particular, there is an effect that an emphasized character string can be specified without performing OCR.

【００４０】請求項４、１０、１６の発明によると、文
字色特定結果をユーザにわかりやすく、瞬時に提示する
ことができる。According to the fourth, tenth, and sixteenth aspects, the character color specification result can be presented to the user easily and instantly.

【００４１】請求項５、１１、１７の発明によると、Ｏ
ＣＲすることなくタイトル・キーワード候補文字矩形の
特定が可能になる。According to the invention of claims 5, 11 and 17, O
The title / keyword candidate character rectangle can be specified without CR.

【００４２】請求項６、１２、１８の発明によると、請
求項５、１１と同様に、ＯＣＲすることなく文書の要約
文生成のために必要な文字領域を特定することが可能に
なる。したがって、実際には文書中すべての文字をＯＣ
Ｒすることなく、効率よく文書の要約文を生成すること
ができる。According to the sixth, twelfth, and eighteenth aspects, it is possible to specify a character area necessary for generating a digest of a document without performing OCR, as in the fifth and eleventh aspects. Therefore, all characters in the document are actually
A summary sentence of a document can be efficiently generated without performing R.

[Brief description of the drawings]

【図１】本発明が適用される文書画像認識装置の構成
例を示す図である。FIG. 1 is a diagram illustrating a configuration example of a document image recognition device to which the present invention is applied.

【図２】本発明が適用される文書画像認識装置におけ
る文字色を特定する処理の一例を説明するフローチャー
トである。FIG. 2 is a flowchart illustrating an example of a process for specifying a character color in a document image recognition device to which the present invention is applied.

【図３】本発明が適用される任意の文字矩形の文字色
を特定する様子の一例を示す図である。FIG. 3 is a diagram illustrating an example of how a character color of an arbitrary character rectangle to which the present invention is applied is specified;

【図４】文字色特定結果をユーザに提示するＧＵＩ画
面の一例を示す図である。FIG. 4 is a diagram illustrating an example of a GUI screen for presenting a character color specification result to a user.

【図５】マウスポインタが文字矩形上にあるときの動
作の一例を示した図である。FIG. 5 is a diagram illustrating an example of an operation when a mouse pointer is over a character rectangle.

【図６】本発明が適用される文書画像認識装置におけ
るタイトル・キーワードを特定する処理の一例を説明す
るフローチャートである。FIG. 6 is a flowchart illustrating an example of processing for specifying a title / keyword in the document image recognition device to which the present invention is applied.

【図７】本発明における要約文を生成する際に必要な
文字領域を選択する一例を示す図である。FIG. 7 is a diagram illustrating an example of selecting a character area necessary for generating a summary sentence according to the present invention.

【図８】本発明が適用される文書画像認識装置におけ
る要約文を生成する処理の一例を説明するフローチャー
トである。FIG. 8 is a flowchart illustrating an example of a process of generating a summary sentence in the document image recognition device to which the present invention is applied.

【図９】複数のページから要約文を生成する一例を示
す図である。FIG. 9 is a diagram illustrating an example of generating a summary sentence from a plurality of pages.

[Explanation of symbols]

１…画像入力手段、２…領域分類手段、３…文字色特定
手段、４…タイトル・キーワード抽出手段、５…タイト
ル・キーワードＤＢ、６…ＯＣＲ手段、７…要約文作成
手段、８…要約文ＤＢ、１０…文字矩形、１１…入力文
書。DESCRIPTION OF SYMBOLS 1 ... Image input means, 2 ... area classification means, 3 ... Character color specification means, 4 ... Title / keyword extraction means, 5 ... Title / keyword DB, 6 ... OCR means, 7 ... Summary sentence creation means, 8 ... Summary sentence DB, 10: character rectangle, 11: input document.

フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｎ 1/40 Ｈ０４Ｎ 1/40 Ｆ５Ｌ０９６ 1/60 Ｄ 1/46 1/46 ＺＦターム(参考） 5B029 AA02 BB02 CC28 CC29 5B050 BA16 EA06 EA09 5B064 AA01 AA07 BA01 CA08 5C077 MP05 MP06 MP08 PP27 PP31 PP32 SS01 5C079 HA16 LA02 LA31 5L096 AA02 BA18 FA15 FA19 GA38Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat II (reference) H04N 1/40 H04N 1/40 F 5L096 1/60 D 1/46 1/46 Z F term (reference) 5B029 AA02 BB02 CC28 CC29 5B050 BA16 EA06 EA09 5B064 AA01 AA07 BA01 CA08 5C077 MP05 MP06 MP08 PP27 PP31 PP32 SS01 5C079 HA16 LA02 LA31 5L096 AA02 BA18 FA15 FA19 GA38

Claims

[Claims]

1. A document image is input as a color digital image, color information is extracted from the document image, a background color of the document image is specified, and pixels other than the background area are extracted from the document image using the background color. Is extracted as small chunks, the extracted small chunks are integrated to generate connected components, and the connected components are classified into regions such as characters / ruled lines / diagrams / photographs based on their shape characteristics and color characteristics. What is claimed is: 1. A document image recognition method, comprising: identifying a character color of a character region from a color digital image for a character region that is a region classified as the character.

2. The method according to claim 1, wherein when the character color is specified, when a background color is present in the character rectangle classified as the character region, the character color corresponds to a pixel value substantially equal to the background color in the character rectangle. If the pixel to be performed is 0 and the other pixels are 1, then 1
A representative color is determined from a set of pixels having the following values, and the determined representative color is used as the character color of the character rectangle.

3. The method according to claim 1, wherein when the character color is specified, a character color is given to each of the classified character rectangles, and a distribution map for each character color is created. A color satisfying a specific condition as a representative character color of the document.

4. The color of a rectangle representing the character area when presenting an area such as a character / ruled line / figure / photo as a rectangle in a document to the user.
A document color recognition method characterized by using the character color specified by

5. The document image recognition method according to claim 3, wherein if there is a character string described in a color other than the representative character color of the document, the character string is used as a document title or a keyword.

6. The method according to claim 3, wherein if there is a character string described in a color other than the representative character color of the document, the whole area including the character string is OCR and obtained from a plurality of OCR-processed areas. A document image recognition method characterized by creating a summary of a document from a character code.

7. An image input means for inputting a document image as a color digital image, extracting color information from the document image, specifying a background color of the document image, and using the background color to convert a background image from the document image. Extract pixels outside the area as small blocks,
Document image recognition having area classification means for generating a connected component by integrating the extracted small blocks, and classifying the connected component into areas such as characters / ruled lines / diagrams / photographs based on the shape characteristics and color characteristics. A document image recognition apparatus, comprising: a character color specifying unit that specifies a character color of a character region from a color digital image for a character region that is a region classified as the character.

8. The method according to claim 7, wherein when specifying the character color, when a background color is present in the character rectangle classified as the character area, the character color corresponds to a pixel value that is substantially the same as the background color in the character rectangle. If the pixel to be performed is 0 and the other pixels are 1, then 1
A representative color is determined from a set of pixels having the following values, and the determined representative color is used as the character color of the character rectangle.

9. The method according to claim 7, wherein when the character color is specified, a character color is given to each of the classified character rectangles, and a distribution map for each character color is created. A document color recognizing device that sets a color satisfying a specific condition as a representative character color of the document.

10. The method according to claim 7, wherein the character / ruled line / figure /
8. A document image recognition apparatus, wherein when a region such as a photograph is presented to a user as a rectangle in a document, the color of the rectangle representing the character region is the character color specified by claim 7.

11. The method according to claim 9, further comprising, if there is a character string described in a color other than the representative character color of the document, a title / keyword extracting unit that uses the character string as a document title or a keyword. Document image recognition device.

12. The method according to claim 9, wherein if there is a character string described in a color other than the representative character color of the document, an OCR means for OCRing the entire area including the character string, A document image recognition device comprising a summary sentence creating means for creating a summary sentence of a document from the obtained character code.

13. An image inputting step of inputting a document image as a color digital image, extracting color information from the document image, specifying a background color of the document image, and using the background color to extract a background from the document image. Pixels other than the area are extracted as small chunks, and the extracted small chunks are integrated to generate connected components.
A computer-readable recording medium in which a program for causing a computer to execute a document image recognition method including an area classification step of classifying the image into a figure / photograph is recorded. And a computer-readable recording medium storing a program for causing a computer to execute a character color specifying step of specifying a character color of a character area from a color image.

14. A computer-readable recording medium storing a program for causing a computer to execute the document image recognition method according to claim 2, wherein the character is classified as the character area when specifying a character color. When there is a background color in the rectangle, the pixel corresponding to the pixel value substantially the same as the background color in the character rectangle is set to 0, the other pixels are set to 1, and the representative color is set from the set of pixels having the value of 1. Decide,
A computer-readable recording medium on which a program for causing a computer to execute a step of setting the determined representative color to the character rectangle character color is recorded.

15. A computer-readable recording medium on which a program for causing a computer to execute the document image recognition method according to claim 3 is recorded. A computer-readable recording medium that records a program that causes a computer to execute a step of creating a distribution map for each character color based on the distribution map and setting a color satisfying a specific condition as a representative character color of a document based on the distribution map. A readable recording medium.

16. A computer-readable recording medium on which a program for causing a computer to execute the document image recognition method according to claim 4 is recorded.
A program for causing a computer to execute a step of setting the color of a rectangle representing the character area to the character color specified by claim 1 when presenting an area such as a figure / photo as a rectangle in a document to a user. Computer readable recording medium.

17. A computer-readable recording medium storing a program for causing a computer to execute the document image recognition method according to claim 5, wherein a character string described in a color other than the representative character color of the document is provided. If the character string is the title of the document,
A computer-readable recording medium on which a program for causing a computer to execute a keyword extracting step is recorded.

18. A computer-readable recording medium storing a program for causing a computer to execute the document image recognition method according to claim 6, wherein a character string described in a color other than the representative character color of the document is provided. An OCR step for OCR of the entire area including the character string;
A computer-readable recording medium storing a program for causing a computer to execute a summary sentence creating step of creating a summary sentence of a document from character codes obtained from a plurality of R regions.