JP2010225112A

JP2010225112A - Data generation device and data generation method

Info

Publication number: JP2010225112A
Application number: JP2009074648A
Authority: JP
Inventors: Sachiaki Yamagishi; 祥晃山岸; Minoru Masutani; 稔桝谷; Yoshiyuki Mitsumori; 芳幸三ツ森
Original assignee: Toppan Printing Co Ltd; Media Drive Co Ltd
Current assignee: Media Drive Co Ltd; Toppan Inc
Priority date: 2009-03-25
Filing date: 2009-03-25
Publication date: 2010-10-07
Anticipated expiration: 2029-03-25
Also published as: JP5368141B2

Abstract

<P>PROBLEM TO BE SOLVED: To extract character information from printed matter printed in a plurality of colors. <P>SOLUTION: The data generation device includes: a reading part reading a printed matter including an image representing a character in a plurality of colors and generating image data; a color image data generation part separating the image data generated by the reading part by color, and generating color image data that are image data for each of the plurality of colors; a character area detection part detecting a character area recognized as the character from each of the color image data generated by the color image data generation part; an analysis part analyzing the character area in the color image data detected by the character area detection part, and detecting the character information included in the character area; a position information detection part detecting position information indicating coordinates in the image data of the character area to be analyzed when the analysis part detects the character information; and an analysis character information output part associating and outputting the character information detected by the analysis part and the position information detected by the position information detection part. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、複数色で印刷された印刷物から文字情報を抽出し、データを生成するデータ生成装置に関する。 The present invention relates to a data generation apparatus that extracts character information from a printed matter printed in a plurality of colors and generates data.

文字や画像が印刷された印刷物に光を照射し、その反射光に基づいて生成される画像データを解析して、文字情報を抽出する技術を用いる装置として、例えば、ＯＣＲ（Optical Character Reader）等の光学文字読取装置がある。
このような光学文字読取装置は、読み取り対象が白黒印刷された印刷物であり、２値認識が主流である。カラーで印刷された印刷物から文字認識を行う場合には、カラーの印刷物を白黒でコピーを行って２値化し、この２値化された後の画像を対象として文字認識を行っている。このような文字認識を行う技術として、以下の特許文献１のものがある。
一方、新聞等で折り込みチラシが利用されている。折り込みチラシは、スーパーマーケットや家電製品の量販店等に広告媒体として利用されており、紙媒体に印刷される。このような折り込みチラシを電子化し、電子チラシとしてネットワークを介して配信するサービスが行われつつある。配信された電子チラシは、受信した端末装置の表示画面に表示されることによって、消費者に閲覧されている。
このような折り込みチラシは、大半がカラー印刷されている。この折り込みチラシを上記の光学文字読み取り装置によって読み取り、文字認識を行うことによって、折り込みチラシにどのような商品が掲載され、折り込みチラシの印刷面のどの位置に配置されているのかを得ることが求められている。 As an apparatus that uses a technique for irradiating a printed matter on which characters and images are printed, analyzing image data generated based on the reflected light, and extracting character information, for example, OCR (Optical Character Reader) There is an optical character reader.
Such an optical character reader is a printed matter on which a reading target is printed in black and white, and binary recognition is mainstream. When character recognition is performed from a printed matter printed in color, the color printed matter is copied and binarized in black and white, and character recognition is performed on the binarized image. As a technique for performing such character recognition, there is one disclosed in Patent Document 1 below.
On the other hand, folded leaflets are used in newspapers and the like. Folded flyers are used as advertising media in supermarkets and mass retailers of home appliances, and are printed on paper media. There is a service in which such a folded leaflet is digitized and distributed as an electronic leaflet via a network. The distributed electronic leaflet is viewed by the consumer by being displayed on the display screen of the received terminal device.
Most of these leaflets are printed in color. By reading this folded flyer with the above-mentioned optical character reader and performing character recognition, it is required to obtain what kind of product is placed on the folded flyer and where it is placed on the printed surface of the folded flyer. It has been.

特開平９−２６９９７１号公報JP-A-9-269971

しかしながら、折り込みチラシは、カラー印刷されているため、一度白黒でコピーを行って２値化し、文字認識を行うと、文字認識の精度が低く、折り込みチラシに掲載された商品等の文字情報を正確に得ることができないという問題がある。すなわち、カラー印刷から白黒にコピーを行う際に、文字として認識したい対象の部分の画像が背景と一緒に白として認識されてしまうと、その文字を認識することができない。また、折り込みチラシは、写真やイラスト、背景等の上に文字がレイアウトされることが多々あるため、文字の色と、その写真やイラスト、背景の色とが、異なる値になるように２値化されるとは限らない。そうすると、文字認識の精度が低下してしまう。
このように、文字と写真等の画像が重なっている場合、特に、折り込みチラシのように複数の色で印刷された文字や画像が複雑にレイアウトされていると、文字と画像を判別して文字認識することが難しく、文字情報のみを精度良く抽出することは困難であった。 However, since the flyers are printed in color, once they are copied in black and white, binarized, and character recognition is performed, the character recognition accuracy is low, and the character information on the product etc. posted on the flyers is accurate. There is a problem that cannot be obtained. That is, when copying from color printing to black and white, if an image of a target portion that is desired to be recognized as a character is recognized as white together with the background, the character cannot be recognized. In addition, since a flyer is often laid out with characters on a photo, illustration, background, etc., it is binary so that the color of the character is different from the color of the photo, illustration, or background. It is not always possible. If it does so, the accuracy of character recognition will fall.
In this way, when characters and images such as photographs overlap, especially when characters and images printed in multiple colors are laid out in a complicated manner, such as a folded flyer, the characters and images are distinguished. It was difficult to recognize, and it was difficult to accurately extract only character information.

本発明は、このような事情を考慮し、上記の問題を解決すべくなされたものであって、その目的は、文字を表す画像を含んだ画像データから精度よく文字認識を行うことができるデータ生成装置およびデータ生成方法を提供することにある。 The present invention has been made in view of such circumstances, and has been made to solve the above-described problem. The object of the present invention is to enable accurate character recognition from image data including an image representing a character. An object of the present invention is to provide a generation device and a data generation method.

上記問題を解決するために、本発明に係るデータ生成装置は、文字を表す画像を含んだ印刷物を複数色で読み取り画像データを生成する読取部と、前記読取部が生成した画像データを色ごとに分離し、複数の色ごとの画像データである色画像データを生成する色画像データ生成部と、前記色画像データ生成部が生成した前記色画像データの各々から文字として認識される文字領域を検出する文字領域検出部と、前記文字領域検出部が検出した前記色画像データにおける文字領域を解析し、当該文字領域に含まれる文字情報を検出する解析部と、前記解析部が文字情報を検出した場合、解析対象とした文字領域の前記画像データにおける座標を示す位置情報を検出する位置情報検出部と、前記解析部が検出した文字情報と、前記位置情報検出部が検出した位置情報とを関連付けて出力する解析文字情報出力部と、を備えることを特徴とする。 In order to solve the above problems, a data generation apparatus according to the present invention includes a reading unit that reads a printed matter including an image representing characters in a plurality of colors and generates image data, and the image data generated by the reading unit for each color. A color image data generation unit that generates color image data that is image data for each of a plurality of colors, and a character region that is recognized as a character from each of the color image data generated by the color image data generation unit A character area detection unit to detect, an analysis unit to analyze a character region in the color image data detected by the character region detection unit, detect character information included in the character region, and the analysis unit to detect character information A position information detection unit that detects position information indicating coordinates in the image data of the character region to be analyzed, character information detected by the analysis unit, and the position information detection unit Characterized in that it comprises the analysis the character information output unit that associates and outputs the detected position information.

また、本発明は、上述のデータ生成装置において、前記色画像データ生成部は、予め定められた色、あるいは、前記画像データの色ヒストグラムから予め定められる上位の色について、画像データを色ごとに分離して色画像データを生成することを特徴とする。 According to the present invention, in the above data generation device, the color image data generation unit converts the image data for each color for a predetermined color or a higher-order color predetermined from a color histogram of the image data. The color image data is generated by separation.

また、本発明は、上述のデータ生成装置において、前記文字領域検出部は、前記画像データの濃淡から検出されるエッジ情報が前記色画像データ中の各々の領域にどの程度含まれているかを比較することにより、前記文字領域を検出することを特徴とする。 According to the present invention, in the data generation device described above, the character area detection unit compares how much edge information detected from the density of the image data is included in each area in the color image data. Thus, the character area is detected.

また、本発明は、上述のデータ生成装置において、前記解析部は、前記文字領域から複数の文字を検出した場合に、前記文字領域における各々の文字の配置状態に基づいて、縦書きであるか横書きであるかを検出し、検出した書字方向に連続する文字列を文字情報として検出することを特徴とする。 Further, according to the present invention, in the above data generation device, when the analysis unit detects a plurality of characters from the character region, is the vertical writing based on the arrangement state of each character in the character region? Whether it is horizontal writing is detected, and a character string continuous in the detected writing direction is detected as character information.

また、本発明は、上述のデータ生成装置において、検索対象である文字を表す検索文字が入力される入力部と、前記解析部が検出した文字情報と、前記位置情報検出部が検出した位置情報とを対応付けて記憶する解析文字情報記憶部と、前記入力部に入力された検索文字を前記解析文字情報記憶部から検索する検索部と、前記検索部によって検索された位置情報を出力する検索文字情報出力部と、を有することを特徴とする。 Further, the present invention provides an input unit for inputting a search character representing a character to be searched, character information detected by the analysis unit, and position information detected by the position information detection unit in the data generation device described above. Are stored in association with each other, an analysis character information storage unit that stores them in association with each other, a search unit that searches the analysis character information storage unit for a search character input to the input unit, and a search that outputs position information searched by the search unit And a character information output unit.

また、本発明は、上述のデータ生成装置において、互いに類似する類似文字を含む類似文字群を記憶する類似文字記憶部をさらに備え、前記検索部は、前記類似文字記憶部に記憶された類似文字の中に前記入力部に入力された前記検索文字に対応する類似文字があるか否かを検出し、類似文字がある場合、前記検索文字のうち類似文字に該当する文字を前記検出された類似文字が含まれる類似文字群のうち他の前記類似文字に置き換えた類似検索文字を生成し、生成した類似検索文字と前記検索文字に基づき、前記解析文字情報記憶部から検索することを特徴とする。 The present invention further includes a similar character storage unit that stores a similar character group including similar characters that are similar to each other in the data generation device, and the search unit stores the similar characters stored in the similar character storage unit. If there is a similar character corresponding to the search character input to the input unit, and if there is a similar character, the character corresponding to the similar character among the search characters is detected. A similar search character replaced with another similar character in a similar character group including characters is generated, and a search is performed from the analysis character information storage unit based on the generated similar search character and the search character. .

また、本発明は、上述のデータ生成装置において、読取部が、文字を表す画像を含んだ印刷物を複数色で読み取り画像データを生成し、色画像データ生成部が、前記読取部が生成した画像データを色ごとに分離し、複数の色ごとの画像データである色画像データを生成し、文字領域検出部が、前記色画像データ生成部が生成した前記色画像データの各々から文字として認識される文字領域を検出し、解析部が、前記文字領域検出部が検出した前記色画像データにおける文字領域を解析し、当該文字領域に含まれる文字情報を検出し、位置情報検出部が、前記解析部が文字情報を検出した場合、解析対象とした文字領域の前記画像データにおける座標を示す位置情報を検出し、解析文字情報出力部が、前記解析部が検出した文字情報と、前記位置情報検出部が検出した位置情報とを関連付けて出力することを特徴とする。 Further, according to the present invention, in the above-described data generation device, the reading unit reads a printed matter including an image representing a character with a plurality of colors to generate image data, and the color image data generation unit generates an image generated by the reading unit. Data is separated for each color, color image data that is image data for a plurality of colors is generated, and a character area detection unit is recognized as a character from each of the color image data generated by the color image data generation unit. The character area in the color image data detected by the character area detection unit, the character information included in the character area is detected, and the position information detection unit When the part detects character information, position information indicating coordinates in the image data of the character region to be analyzed is detected, and the analysis character information output unit detects the character information detected by the analysis unit and the position Wherein the information detecting unit outputs in association with position information detected.

本発明によれば、印刷物を読み取って得られた画像データを色ごとに分離し、その分離された画像データから文字認識を行い、その認識された文字情報と、その文字が印刷物に配置された座標を示す位置情報を得るようにした。これにより、文字とその文字の周囲の画像とを分離して文字認識を行うことができるので、複数の色で印刷された印刷物であっても、精度よく文字認識を行い、その文字の位置を把握することができる。 According to the present invention, image data obtained by reading a printed matter is separated for each color, character recognition is performed from the separated image data, and the recognized character information and the characters are arranged on the printed matter. The position information indicating the coordinates was obtained. As a result, character recognition can be performed by separating a character and an image around the character, so even a printed matter printed in multiple colors can accurately recognize the character and position the character. I can grasp it.

本実施の形態に係るデータ生成装置の構成の一例を説明するためのブロック図である。It is a block diagram for demonstrating an example of a structure of the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置によるデータ生成方法の一例を説明するための概略図である。It is the schematic for demonstrating an example of the data generation method by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置による写真領域の除去の一例を説明するための概略図である。It is the schematic for demonstrating an example of the removal of the photography area | region by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置による解析領域の検出の一例を説明するための概略図である。It is the schematic for demonstrating an example of the detection of the analysis area | region by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置による位置情報の抽出の一例を説明するための概略図である。It is the schematic for demonstrating an example of the extraction of the positional information by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置によるデータ生成方法の一例を説明するための概略図である。It is the schematic for demonstrating an example of the data generation method by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置による類似文字の検索の一例を説明するための概略図である。It is the schematic for demonstrating an example of the search of the similar character by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置によるデータ生成方法の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the data generation method by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置によるデータ生成方法の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the data generation method by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置による縦書きあるいは横書きの判別の一例を説明するための概略図である。It is the schematic for demonstrating an example of discrimination | determination of vertical writing or horizontal writing by the data generation apparatus which concerns on this Embodiment. 本実施の形態に係るデータ生成装置による傾いた文字列の判別の一例を説明するための概略図である。It is the schematic for demonstrating an example of the discrimination | determination of the inclined character string by the data generation apparatus which concerns on this Embodiment.

図１は、本実施形態に係るデータ生成装置１００の構成を示すブロック図である。データ生成装置１００は、読取部１０１と、色設定部１０２と、色画像データ生成部１０３と、画像処理部１０４と、文字領域検出部１０５と、解析部１０６と、位置情報検出部１０７と、解析文字情報記憶部１０８と、解析文字情報出力部１１３と、検索部１０９と、入力部１１０と、検索文字情報出力部１１１と、類似文字記憶部１１２とを備える。 FIG. 1 is a block diagram illustrating a configuration of a data generation device 100 according to the present embodiment. The data generation apparatus 100 includes a reading unit 101, a color setting unit 102, a color image data generation unit 103, an image processing unit 104, a character area detection unit 105, an analysis unit 106, a position information detection unit 107, An analysis character information storage unit 108, an analysis character information output unit 113, a search unit 109, an input unit 110, a search character information output unit 111, and a similar character storage unit 112 are provided.

読取部１０１は、例えば複数色で印刷されている印刷物を複数色で読み取るカラースキャナ等の画像読取装置であって、読み取り対象である印刷物に対して光を照射しその反射光を受光し、光電変換素子等を利用して印刷物からの反射光を電気信号に変換して、印刷物の印刷面に印刷された画像を表す画像データを生成する。この読み取り対象である印刷物は、文字や画像が複数の色で印刷された印刷物であって、文字を表す画像を含んでいる。この実施形態における印刷物とは、新聞などの折り込みチラシである。例えば、この折り込みチラシに、商品の名称や原料、特長、価格、商品が撮像された写真、イラストなど、商品に関する情報である商品情報のうち、名称や原料、特長、価格などを表す文字が画像として印刷されており、また、商品情報のうち、商品が撮像された写真、イラストなどを表す商品画像が写真等で印刷され、広告や宣伝のために消費者に配布される。この配布は、折り込みチラシそのものが配布される場合もあるが、この実施形態においては、電子チラシとして配信される場合について説明する。電子チラシとは、この折り込みチラシを画像読取装置で読み取ってえられる画像データである。この電子チラシは、広告主である店舗が取扱っている商品または提供している役務に関する画像情報、文字情報、および音声情報等を含む電子化された広告宣伝情報を意味するものである。また店舗毎に開催されるフェアや、キャンペーン情報等、特定の商品や役務に限られない販売促進情報が含まれていてもよい。また、この電子チラシは、ユーザの端末装置からの要求に応じて配信される。 The reading unit 101 is an image reading device such as a color scanner that reads a printed matter printed in a plurality of colors, for example, and irradiates the printed matter to be read with light and receives reflected light. The reflected light from the printed material is converted into an electrical signal using a conversion element or the like to generate image data representing an image printed on the printed surface of the printed material. The printed material to be read is a printed material on which characters and images are printed in a plurality of colors, and includes an image representing characters. The printed material in this embodiment is a folded leaflet such as a newspaper. For example, on this flyer, characters representing the name, raw material, features, price, etc. of the product information that is information about the product, such as the product name, raw material, feature, price, photograph of the product, illustration, etc. In addition, in the product information, a product image representing a photograph, an illustration, or the like in which the product is captured is printed as a photograph and distributed to consumers for advertisement or promotion. In this distribution, a folded flyer itself may be distributed, but in this embodiment, a case where it is distributed as an electronic flyer will be described. The electronic leaflet is image data obtained by reading this folded leaflet with an image reading device. This electronic leaflet means computerized advertising information including image information, character information, audio information, and the like related to products handled by a store as an advertiser or services provided. Further, sales promotion information that is not limited to a specific product or service, such as a fair held for each store or campaign information, may be included. The electronic leaflet is distributed in response to a request from the user's terminal device.

色設定部１０２は、読取部１０１によって読み取られたチラシを表す画像データ２００の色ヒストグラムから予め定められる上位の色を抽出色として検出する。この上位とは、色ごとに得られるヒストグラムのうち、予め決められた順位まで、上位から抽出する。例えば、チラシに印刷されている抽出色の印刷面積の割合が大きい抽出色を検出する。
また、色設定部１０２は、生成された色のヒストグラムに基づき、チラシの全画面において使用されている色とその色の印刷面積の割合を検出し、チラシ全画面内で使用されている面積の割合が多い抽出色を検出する。本実施の形態において、色設定部１０２は、例えば、上位の５位までを抽出することを予め指定することにより、画像データ２００内で使用されている割合が多い上位５色（例えば、黒色、青色、赤色、白色、黄色）を抽出色として検出する。
また、色設定部１０２は、抽出された複数の抽出色を、色画像データ生成部１０３によって利用される抽出対象として、色画像データ生成部１０３に出力する。 The color setting unit 102 detects an upper color determined in advance from the color histogram of the image data 200 representing the leaflet read by the reading unit 101 as an extracted color. The upper level is extracted from the upper level up to a predetermined rank in the histogram obtained for each color. For example, an extracted color having a large print area ratio of the extracted color printed on the leaflet is detected.
The color setting unit 102 detects the color used in the full screen of the leaflet and the ratio of the print area of the color based on the generated color histogram, and the area used in the full screen of the leaflet is detected. Detect extracted colors with a high percentage. In the present embodiment, the color setting unit 102, for example, specifies in advance to extract the top five places, so that the top five colors (for example, black, (Blue, red, white, yellow) are detected as extraction colors.
Further, the color setting unit 102 outputs the extracted plurality of extracted colors to the color image data generation unit 103 as extraction targets used by the color image data generation unit 103.

色画像データ生成部１０３は、読取部１０１によって生成された画像データ２００から、抽出対象である抽出色として、例えば、予め定められた色、あるいは、色設定部１０２によって検出された抽出色で印刷されている画像情報を抽出色毎に分離し、複数の色ごとの画像データである色画像データを生成する。例えば、図２に示す通り、色画像データ生成部１０３は、画像データ２００から、色設定部１０２によって検出された複数の抽出色（黒色、青色、赤色、白色、黄色）で印刷されている画像データのみを抽出色毎に分離し、複数の色画像データ（色黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４、および黄画像データ２０５）を生成する。なお、これら分離された色画像データは、それぞれ抽出色に応じて２値化された画像データであって、黒画像データ２０１、青画像データ２０２、赤画像データ２０３および黄画像データ２０５は、それぞれの抽出色で表されている画像領域を黒色とする２値化データである。一方、白画像データ２０４は、画像データ２００から白色で表されている画像領域を黒色とする２値化データである。 The color image data generation unit 103 prints, for example, a predetermined color or an extraction color detected by the color setting unit 102 as an extraction color to be extracted from the image data 200 generated by the reading unit 101. The image information is separated for each extracted color, and color image data that is image data for a plurality of colors is generated. For example, as illustrated in FIG. 2, the color image data generation unit 103 is an image printed with a plurality of extracted colors (black, blue, red, white, yellow) detected by the color setting unit 102 from the image data 200. Only the data is separated for each extracted color, and a plurality of color image data (color black image data 201, blue image data 202, red image data 203, white image data 204, and yellow image data 205) are generated. These separated color image data are image data binarized according to the extracted colors, and the black image data 201, the blue image data 202, the red image data 203, and the yellow image data 205 are respectively This is binarized data in which the image area represented by the extracted color is black. On the other hand, the white image data 204 is binarized data in which an image area represented in white from the image data 200 is black.

画像処理部１０４は、色画像データ生成部１０３によって抽出された複数の色画像データに対して、文字情報と認識しない画像データを除去する画像処理をし、画像処理された複数の色画像データを文字領域検出部１０５に出力する。例えば、画像処理部１０４は、複数の色画像データに対して、写真領域を除去するための処理や、ノイズ部分をフィルタ等を用いて除去するための処理等を行う。
ここで、写真領域を除去するための処理について、図３を用いて説明する。図３（ａ）は、チラシの画像データ２００の一部分を示している。図３（ａ）に示す画像データは、黒色の商品名、黄色の商品金額、黒色の矩形内に白抜きで記載された「税込」部分が示されており、背景は黒っぽい色のしょうゆの商品画像と、商品画像の周辺が赤色と青色で示されている。
また、図３（ｂ）は、色画像データ生成部１０３によって生成された黒画像データ２０１の一部分を示している。図３（ｂ）に示す通り、黒色に対応する部分として、商品名と、しょうゆの商品画像の一部と、税込の記載が示されている。
画像処理部１０４は、画像データ２００に対して、画像データの濃淡からエッジ部分を特徴量として検出するエッジ検出による画像処理を行い、図３（ｃ）に示すようなエッジ画像データ２１１を生成する。
また、画像処理部１０４は、黒画像データ２０１とエッジ画像データ２１１を比較し、黒画像データ２０１に対してエッジ画像データ２１１に含まれるエッジ部分を表すエッジ情報がＴ（Ｔは、１００よりも小さい数）％未満しか含まれない領域を黒画像データ２０１から削除し、写真領域除去処理後の処理画像データ２２１を生成する。なお、パラメータＴは予め設定することや領域の大きさなどによって動的に変更が可能である。 The image processing unit 104 performs image processing on the plurality of color image data extracted by the color image data generation unit 103 to remove image data that is not recognized as character information, and outputs the plurality of color image data subjected to image processing. The data is output to the character area detection unit 105. For example, the image processing unit 104 performs processing for removing a photographic region, processing for removing a noise portion using a filter, and the like for a plurality of color image data.
Here, the process for removing the photographic region will be described with reference to FIG. FIG. 3A shows a part of the image data 200 of the leaflet. The image data shown in FIG. 3 (a) shows a black product name, a yellow product price, and a “tax-included” portion written in white in a black rectangle, and a soy sauce product with a blackish background. The image and the periphery of the product image are shown in red and blue.
FIG. 3B shows a part of the black image data 201 generated by the color image data generation unit 103. As shown in FIG. 3B, as a portion corresponding to black, a product name, a part of a soy sauce product image, and a description including tax are shown.
The image processing unit 104 performs image processing on the image data 200 by edge detection that detects an edge portion as a feature amount from the density of the image data, and generates edge image data 211 as illustrated in FIG. .
Further, the image processing unit 104 compares the black image data 201 with the edge image data 211, and the edge information indicating the edge portion included in the edge image data 211 with respect to the black image data 201 is T (T is greater than 100). An area including less than a small number)% is deleted from the black image data 201 to generate processed image data 221 after the photographic area removal processing. The parameter T can be set in advance or can be dynamically changed depending on the size of the area.

文字領域検出部１０５は、色画像データ生成部１０３によって分離された複数の色画像データ毎に、文字として認識される対象である解析領域（文字領域）を検出する。例えば、文字領域検出部１０５は、黒画像データ２０１に基づき画像処理部１０４によって写真画像に対応する部分が概ね除去された処理画像データ２２１に対して、文字を検出するレイアウト解析を行い、文字列画像と判別できる画像領域を検出する。また、文字領域検出部１０５は、検出された文字列画像と判別できる画像領域に基づき、１文字に対応する文字領域を検出し、これら全ての文字領域を解析領域３００（図４参照）として解析部１０６に出力する。なお、文字領域検出部１０５は、色画像データ毎に解析領域３００を検出する。 The character region detection unit 105 detects an analysis region (character region) that is a target recognized as a character for each of a plurality of color image data separated by the color image data generation unit 103. For example, the character region detection unit 105 performs layout analysis for detecting characters on the processed image data 221 from which the part corresponding to the photographic image is substantially removed by the image processing unit 104 based on the black image data 201, and the character string An image area that can be distinguished from an image is detected. Further, the character area detection unit 105 detects a character area corresponding to one character based on the image area that can be distinguished from the detected character string image, and analyzes all the character areas as an analysis area 300 (see FIG. 4). To the unit 106. The character area detecting unit 105 detects the analysis area 300 for each color image data.

解析部１０６は、文字領域検出部１０５によって検出された解析領域３００の画像データを解析して、解析領域３００の各文字領域に対応する文字情報を検出する。例えば、図４に示す通り、文字領域検出部１０５によって文字領域３０１〜３０７が検出された場合、解析部１０６は、この文字領域３０１〜３０７の画像データを解析することで、「北」「海」「道」「産」「ト」「マ」「ト」の文字情報４０１〜４０７を検出する。
また、解析部１０６は、文字領域検出部１０５によって検出されている文字領域毎の間隔等に応じて、後述する縦書き文字列、横書き文字列あるいは傾き文字列を判別する処理を行って、文字列の連続方向を判断する。
なお解析部１０６は、文字列の連続方向に基づき、各文字情報４０１〜４０７の組み合わせによって構成される文字列の塊を検出し、例えば、文字情報４０１〜４０４からなる単語情報６０１と、文字情報４０５〜４０７からなる単語情報６０２を検出する。また、解析部１０６は、文字領域検出部１０５によって解析領域３００が検出された色画像データ毎に、単語情報や文字情報を抽出する。 The analysis unit 106 analyzes the image data of the analysis region 300 detected by the character region detection unit 105 and detects character information corresponding to each character region of the analysis region 300. For example, as shown in FIG. 4, when the character areas 301 to 307 are detected by the character area detection unit 105, the analysis unit 106 analyzes the image data of the character areas 301 to 307 to obtain “north” “sea”. Character information 401 to 407 of "" road "" product "" g "" ma "" g "is detected.
Further, the analysis unit 106 performs processing for discriminating a vertically written character string, a horizontally written character string, or a tilted character string, which will be described later, according to an interval for each character area detected by the character area detecting unit 105, and the like. Determine the continuous direction of a column.
Note that the analysis unit 106 detects a character string block composed of a combination of the character information 401 to 407 based on the continuous direction of the character string. For example, the word information 601 including the character information 401 to 404 and the character information Word information 602 consisting of 405 to 407 is detected. The analysis unit 106 extracts word information and character information for each color image data in which the analysis region 300 is detected by the character region detection unit 105.

位置情報検出部１０７は、解析対象とした文字領域の画像データにおける座標を示す位置情報を検出する。すなわち、位置情報検出部１０７は、解析部１０６によって解析された文字情報４０１〜４０７がチラシに印刷されている位置を表す位置情報を、文字情報４０１〜４０７が抽出された色画像データから検出する。例えば、位置情報検出部１０７は、図５に示す通り、解析部１０６によって解析された文字情報４０５の「ト」に基づき、文字情報４０５の解析領域３００の左上の座標（Ｘ１_Ｌ，Ｙ１_Ｌ）と右下の座標（Ｘ１_Ｒ，Ｙ１_Ｒ）を位置情報５０５として検出する。同様にして、位置情報検出部１０７は、文字情報４０６の「マ」に基づき、文字情報４０６の解析領域３００の左上の座標（Ｘ２_Ｌ，Ｙ２_Ｌ）と右下の座標（Ｘ２_Ｒ，Ｙ２_Ｒ）を位置情報５０６として検出し、文字情報４０７の「ト」に基づき、文字情報４０７の解析領域３００の左上の座標（Ｘ３_Ｌ，Ｙ３_Ｌ）と右下の座標（Ｘ３_Ｒ，Ｙ３_Ｒ）を位置情報５０７として検出する。
なお、本実施の形態において、位置情報検出部１０７は、解析領域３００の対角線上の２点の座標を検出する例について説明したが、本発明はこれに限られず、解析領域３００の４点であってもよく、解析領域３００の中央の一点であってもよい。 The position information detection unit 107 detects position information indicating coordinates in the image data of the character area to be analyzed. That is, the position information detection unit 107 detects position information representing the position where the character information 401 to 407 analyzed by the analysis unit 106 is printed on the leaflet from the color image data from which the character information 401 to 407 is extracted. . For example, as shown in FIG. 5, the position information detection unit 107 has coordinates (X1 _L , Y1 _L ) in the upper left of the analysis area 300 of the character information 405 based on “G” of the character information 405 analyzed by the analysis unit 106. And the lower right coordinates (X1 _R , Y1 _R ) are detected as position information 505. Similarly, the position information detection unit 107 uses the upper left coordinates (X2 _L , Y2 _L ) and the lower right coordinates (X2 _R , Y2 _R ) of the analysis area 300 of the character information 406 based on “ma” in the character information 406. ) detects as the position information 506, based on the "" on the character information 407, upper left coordinates of the analysis region 300 of the character information 407 _(X3 L, Y3 _L) and lower right coordinates _(X3 R, Y3 _R) Detected as position information 507.
In the present embodiment, the position information detection unit 107 has described an example in which the coordinates of two points on the diagonal line of the analysis region 300 are detected. However, the present invention is not limited to this, and the four points of the analysis region 300 are used. It may be a single point in the center of the analysis region 300.

解析文字情報出力部１１３は、解析部１０６によって検出された文字情報と、位置情報検出部１０７によって検出された位置情報とを関連付けて出力する。 The analysis character information output unit 113 outputs the character information detected by the analysis unit 106 and the position information detected by the position information detection unit 107 in association with each other.

解析文字情報記憶部１０８は、解析文字情報出力部１１３から出力された情報に基づき、解析部１０６によって抽出された文字情報と、位置情報検出部１０７によって検出された位置情報とを対応付けて記憶する。例えば、解析文字情報記憶部１０８は、図６に示す通り、解析部１０６によって抽出された文字情報からなる単語情報６０２“トマト”と、単語情報６０２を構成する文字情報４０５〜４０７の位置情報５０５〜５０７“Ｘ１_Ｌ，Ｙ１_Ｌ，Ｘ１_Ｒ，Ｙ１_Ｒ，Ｘ２_Ｌ，Ｙ２_Ｌ，Ｘ２_Ｒ，Ｙ２_Ｒ，Ｘ３_Ｌ，Ｙ３_Ｌ，Ｘ３_Ｒ，Ｙ３_Ｒ”とを対応付けて記憶する。 Based on the information output from the analysis character information output unit 113, the analysis character information storage unit 108 stores the character information extracted by the analysis unit 106 and the position information detected by the position information detection unit 107 in association with each other. To do. For example, as illustrated in FIG. 6, the analysis character information storage unit 108 includes word information 602 “tomato” including character information extracted by the analysis unit 106 and position information 505 of character information 405 to 407 constituting the word information 602. ˜507 “X1 _L , Y1 _L , X1 _R , Y1 _R , X2 _L , Y2 _L , X2 _R , Y2 _R , X3 _L , Y3 _L , X3 _R , Y3 _R ” are stored in association with each other.

入力部１１０は、検索対象である文字を表す検索キーワード（検索文字）の入力を受け付ける。この受け付けは、例えば、電子チラシを参照するユーザの端末装置からネットワークを介して送信される検索キーワードを受信することによって行う。
検索部１０９は、入力部１１０から入力される検索キーワードを、解析文字情報記憶部１０８から検索する。例えば、検索部１０９は、入力部１１０を介して検索キーワード「トマト」が入力された場合、解析文字情報記憶部１０８に記憶されている情報のうち単語情報（文字情報）として「トマト」があるか否かを検索する。検索部１０９は、解析文字情報記憶部１０８に記憶されている単語情報（文字情報）として「トマト」を検索すると、検索された「トマト」に対応付けられた情報として位置情報５０５〜５０７を検出する。
検索文字情報出力部１１１は、検索部１０９によって検索された単語情報（文字情報）と、当該単語情報（文字情報）と対応付けられている位置情報とを出力する。例えば、検索文字情報出力部１１１は、検索部１０９によって単語情報（文字情報）「トマト」が検索された場合、「トマト」と対応付けて解析文字情報記憶部１０８に記憶されている位置情報５０５〜５０７を、単語情報（文字情報）「トマト」の位置情報として出力する。 The input unit 110 receives an input of a search keyword (search character) representing a character to be searched. This reception is performed, for example, by receiving a search keyword transmitted from the terminal device of the user who refers to the electronic leaflet via the network.
The search unit 109 searches the analysis character information storage unit 108 for a search keyword input from the input unit 110. For example, when the search keyword “tomato” is input via the input unit 110, the search unit 109 has “tomato” as word information (character information) among the information stored in the analysis character information storage unit 108. Search whether or not. When the search unit 109 searches for “tomato” as the word information (character information) stored in the analysis character information storage unit 108, the search unit 109 detects position information 505 to 507 as information associated with the searched “tomato”. To do.
The search character information output unit 111 outputs word information (character information) searched by the search unit 109 and position information associated with the word information (character information). For example, when the search unit 109 searches for word information (character information) “tomato”, the search character information output unit 111 associates with “tomato” and stores the position information 505 stored in the analysis character information storage unit 108. ˜507 are output as position information of the word information (character information) “tomato”.

類似文字記憶部１１２は、互いに類似する類似文字を含む類似文字群を、少なくとも１つ記憶する。例えば、類似文字記憶部１１２は、図７に示す通り、類似文字群７０１として、カタカナの「ん」である「ン」と、カタカナの「そ」である「ソ」と、カタカナの「の」である「ノ」を記憶し、類似文字群７０２として、音を伸ばすことを意味する長音符「ー」、ダッシュ記号「―」、漢数字の「いち」である「一」、マイナス記号「−」を記憶する。 The similar character storage unit 112 stores at least one similar character group including similar characters that are similar to each other. For example, as illustrated in FIG. 7, the similar character storage unit 112 includes, as the similar character group 701, “Ka” of Katakana “N”, “So” of Katakana “So”, and “No” of Katakana. Is stored as a similar character group 702, and a long note "-", a dash symbol "-", "1" which is a Chinese numeral "1", a minus sign "-" Remember.

検索部１０９は、入力部１１０を介して入力された検索キーワード（検索文字）が類似文字を含んでいるか否かを判断し、含んでいる場合、検索キーワードに含まれる類似文字と、類似文字記憶部１１２に記憶されている類似文字群に含まれる他の類似文字とを置き換えた類似検索キーワード（類似検索文字）を生成する。また、検索部１０９は、生成した類似検索キーワードと検索キーワードに基づき、類似検索キーワードと対応付けられている文字情報、および検索キーワードと対応付けられている文字情報を、解析文字情報記憶部１０８から検索する。
例えば、検索部１０９は、入力部１１０を介して検索キーワード「ラーメン」が入力された場合、「ラーメン」は類似文字「ー」と「ン」を含んでいることを検出する。検索部１０９は、検索キーワード「ラーメン」に含まれる類似文字「ー」と「ン」に基づき、類似文字記憶部１１２に記憶されている類似文字群７０１，７０２を検出し、この類似文字群７０１，７０２に含まれる他の類似文字と検索キーワード「ラーメン」に含まれる類似文字とをそれぞれ置き換えた類似検索キーワードを生成する。例えば、検索部１０９は、「ラーメン」「ラーメソ」「ラ−メン」「ラ−メソ」等の類似検索キーワードを生成し、生成した複数の類似検索キーワードと検索キーワード「ラーメン」と対応付けられている文字情報を、解析文字情報記憶部１０８から検索する。 The search unit 109 determines whether or not the search keyword (search character) input via the input unit 110 includes a similar character, and if so, stores the similar character included in the search keyword and the similar character storage. A similar search keyword (similar search character) is generated by replacing another similar character included in the similar character group stored in the unit 112. In addition, the search unit 109 receives the character information associated with the similar search keyword and the character information associated with the search keyword from the analysis character information storage unit 108 based on the generated similar search keyword and the search keyword. Search for.
For example, when the search keyword “ramen” is input via the input unit 110, the search unit 109 detects that “ramen” includes similar characters “-” and “n”. The search unit 109 detects similar character groups 701 and 702 stored in the similar character storage unit 112 based on the similar characters “-” and “n” included in the search keyword “ramen”, and this similar character group 701. , 702 and the similar search keyword in which the similar character included in the search keyword “ramen” is replaced respectively. For example, the search unit 109 generates similar search keywords such as “ramen”, “rameso”, “ramen”, and “rameso”, and associates the generated similar search keywords with the search keyword “ramen”. Is searched from the analysis character information storage unit 108.

次に、図８を用いて、本発明に係るデータ生成装置のデータ生成方法の一例について説明する。
図８に示す通り、読取部１０１は、例えば、複数の色で印刷された文字（商品名等）および画像（商品画像等）を含むチラシを読み取り、画像データ２００を生成する（ステップＳＴ１）。次いで、例えば、色設定部１０２が、画像データ２００に基づき、上位５色（黒色、青色、赤色、白色、黄色）の複数の抽出色を、色画像データ生成部１０３によって利用される抽出対象として、色画像データ生成部１０３に出力する（ステップＳＴ２）。ここで、ステップＳＴ２では、色設定部１０２によらず、予め定められた色を抽出対象として、操作部（図示せず）を介して画像データ生成部１０３に入力してもよく、あるいは他の外部装置から色画像データ生成部１０３に予め定められた色を抽出色として出力しても良い。 Next, an example of the data generation method of the data generation apparatus according to the present invention will be described with reference to FIG.
As shown in FIG. 8, the reading unit 101 reads a leaflet including characters (product names, etc.) and images (product images, etc.) printed in a plurality of colors, and generates image data 200 (step ST1). Next, for example, based on the image data 200, the color setting unit 102 uses a plurality of extracted colors of the top five colors (black, blue, red, white, yellow) as extraction targets used by the color image data generation unit 103. And output to the color image data generation unit 103 (step ST2). Here, in step ST <b> 2, a predetermined color may be input to the image data generation unit 103 via the operation unit (not shown) as an extraction target regardless of the color setting unit 102, or other A predetermined color may be output from the external device to the color image data generation unit 103 as the extracted color.

そして、色画像データ生成部１０３は、例えば色設定部１０２から入力されている抽出色に基づき、抽出色毎に画像データを抽出し、複数の色画像データを生成する（ステップＳＴ３）。すなわち、色画像データ生成部１０３は、各抽出色（黒色、青色、赤色、白色、黄色）で印刷されている画像データを抽出色毎に分離し、黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４、および黄画像データ２０５を生成する。これによって、抽出色毎の２値化された画像データが生成される。 Then, the color image data generation unit 103 extracts image data for each extracted color based on the extracted color input from the color setting unit 102, for example, and generates a plurality of color image data (step ST3). That is, the color image data generation unit 103 separates the image data printed in each extracted color (black, blue, red, white, yellow) for each extracted color, and the black image data 201, blue image data 202, red Image data 203, white image data 204, and yellow image data 205 are generated. As a result, binarized image data for each extracted color is generated.

次いで、画像処理部１０４は、色画像データ生成部１０３によって抽出された黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４、および黄画像データ２０５に対して、例えば、写真領域を除去するための処理を行う（ステップＳＴ４）。そして、文字領域検出部１０５は、黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４、および黄画像データ２０５から、それぞれ、文字として解析される対象である解析領域３００（文字領域）を検出する（ステップＳＴ５）。
解析部１０６は、ステップＳＴ５において検出された解析領域３００の画像データを解析して、文字情報を抽出する（ステップＳＴ６）。例えば、図４に示す通り、解析部１０６は、解析領域３００の画像データを解析することで、「ト」「マ」「ト」の文字情報４０５〜４０７を抽出し、１つの文字列である単語情報６０２とする。なお、解析部１０６は、黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４、および黄画像データ２０５のそれぞれに対して、検出された解析領域３００の解析を行う。 Next, the image processing unit 104 applies, for example, a photograph to the black image data 201, blue image data 202, red image data 203, white image data 204, and yellow image data 205 extracted by the color image data generation unit 103. Processing for removing the region is performed (step ST4). Then, the character area detecting unit 105 analyzes each of the black image data 201, the blue image data 202, the red image data 203, the white image data 204, and the yellow image data 205 as an object to be analyzed as a character. Character area) is detected (step ST5).
The analysis unit 106 analyzes the image data of the analysis region 300 detected in step ST5, and extracts character information (step ST6). For example, as illustrated in FIG. 4, the analysis unit 106 analyzes the image data in the analysis region 300 to extract the character information 405 to 407 of “G”, “M”, and “G”, which is one character string. Word information 602 is assumed. The analysis unit 106 analyzes the detected analysis region 300 for each of the black image data 201, the blue image data 202, the red image data 203, the white image data 204, and the yellow image data 205.

位置情報検出部１０７は、解析部１０６によって解析された文字情報（単語情報）が印刷されているチラシ内での位置を検出し、検出した位置を表す位置情報を検出する（ステップＳＴ７）。なお、位置情報検出部１０７は、複数の色画像データ、例えば、黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４、および黄画像データ２０５のそれぞれから、文字情報の位置を表す位置情報を検出する。例えば、位置情報検出部１０７は、文字情報「ト」「マ」「ト」に基づき、各文字情報４０５〜４０７の位置情報５０５〜５０７“Ｘ１_Ｌ，Ｙ１_Ｌ，Ｘ１_Ｒ，Ｙ１_Ｒ，Ｘ２_Ｌ，Ｙ２_Ｌ，Ｘ２_Ｒ，Ｙ２_Ｒ，Ｘ３_Ｌ，Ｙ３_Ｌ，Ｘ３_Ｒ，Ｙ３_Ｒ”を検出する。
そして、解析部１０６および位置情報検出部１０７は、解析部１０６によって抽出された文字情報と、位置情報検出部１０７によって検出された位置情報とを対応付けて、解析文字情報記憶部１０８に記憶させる（ステップＳＴ８）。なお、解析文字情報記憶部１０８は、一枚のチラシの画像データ２００から生成されている、黒画像データ２０１、青画像データ２０２、赤画像データ２０３、白画像データ２０４および黄画像データ２０５から抽出された文字情報および位置情報は、１つの画像データ２００に基づくものとして、それぞれ対応付けて記憶する。 The position information detection unit 107 detects the position in the leaflet on which the character information (word information) analyzed by the analysis unit 106 is printed, and detects position information representing the detected position (step ST7). The position information detection unit 107 detects the position of character information from each of a plurality of color image data, for example, black image data 201, blue image data 202, red image data 203, white image data 204, and yellow image data 205. The position information representing is detected. For example, the position information detection unit 107 determines the position information 505 to 507 “X1 _L , Y1 _L , X1 _R , Y1 _R , X2 _{L of the} character information 405 to 407 based on the character information“ G ”,“ MA ”,“ G ”. , Y2 _L , X2 _R , Y2 _R , X3 _L , Y3 _L , X3 _R , Y3 _R ″ are detected.
Then, the analysis unit 106 and the position information detection unit 107 associate the character information extracted by the analysis unit 106 with the position information detected by the position information detection unit 107 and store them in the analysis character information storage unit 108. (Step ST8). The analysis character information storage unit 108 is extracted from the black image data 201, the blue image data 202, the red image data 203, the white image data 204, and the yellow image data 205 that are generated from the image data 200 of one leaflet. The character information and position information thus obtained are stored in association with each other as being based on one image data 200.

次に、図９を参照して、本実施の形態に係るデータ生成装置のデータ検索方法の一例について説明する。
図９に示す通り、入力部１１０から検索キーワードが入力されると（ステップＳＴ１０）、検索部１０９は、類似文字記憶部１１２を検索して、検索キーワードに含まれる類似文字が記憶されているか否かを検出する（ステップＳＴ１１）。ステップＳＴ１１において、検索キーワードに類似文字が含まれていることが検出された場合（ステップＳＴ１１―ＹＥＳ）、類似文字記憶部１１２から検出された類似文字を含む類似文字群を読み出す（ステップＳＴ１２）。例えば、検索キーワードが「ラーメン」の場合、検索部１０９は、類似文字記憶部１１２において類似文字「ー」と「ン」を検出し、「ー」を含む類似文字群と「ン」を含む類似文字群を類似文字記憶部１１２から読み出す。 Next, an example of a data search method of the data generation device according to the present embodiment will be described with reference to FIG.
As shown in FIG. 9, when a search keyword is input from the input unit 110 (step ST10), the search unit 109 searches the similar character storage unit 112 to determine whether or not similar characters included in the search keyword are stored. Is detected (step ST11). If it is detected in step ST11 that the search keyword includes similar characters (YES in step ST11), a similar character group including the detected similar characters is read from the similar character storage unit 112 (step ST12). For example, when the search keyword is “ramen”, the search unit 109 detects the similar characters “-” and “n” in the similar character storage unit 112, and the similar character group including “-” and the similar character group including “-” The character group is read from the similar character storage unit 112.

そして、検索部１０９は、例えば、読み出した類似文字群７０１に含まれる他の類似文字、カタカナの「ソ」、カタカナの「ノ」と、検索キーワードに含まれる対応する類似語「ン」と置き換えた類似検索キーワードを生成する（ステップＳＴ１３）。例えば、検索部１０９は、「ラーメン」「ラーメソ」「ラーメノ」「ラ−メン」「ラ−メソ」「ラ−メノ」等の類似検索キーワードを生成する。
次いで、検索部１０９は、ステップＳＴ１０において入力された検索キーワードと、ステップＳＴ１３において生成した類似検索キーワードに基づき、それぞれと対応づけられている文字情報を、解析文字情報記憶部１０８から検索する（ステップＳＴ１４）。
そして、検索文字情報出力部１１１は、検索部１０９によって検索された文字情報と、当該文字情報と対応付けられている位置情報とを出力する（ステップＳＴ１５）。 Then, the search unit 109 replaces, for example, other similar characters included in the read similar character group 701, katakana “so”, katakana “no”, and the corresponding similar word “n” included in the search keyword. The similar search keyword is generated (step ST13). For example, the search unit 109 generates similar search keywords such as “ramen”, “rameso”, “rameno”, “ramen”, “rameso”, and “rameno”.
Next, the search unit 109 searches the analysis character information storage unit 108 for character information associated with each based on the search keyword input in step ST10 and the similar search keyword generated in step ST13 (step ST10). ST14).
And the search character information output part 111 outputs the character information searched by the search part 109, and the positional information matched with the said character information (step ST15).

一方、ステップＳＴ１１において、検索キーワードに類似文字が含まれていないと判断された場合（ステップＳＴ１１―ＮＯ）、検索部１８は、解析文字情報記憶部１０８から、検索キーワードに相当する文字情報を検索し（ステップＳＴ１６）、相当する文字情報が検索された場合、当該文字情報と、当該文字情報と対応づけられている位置情報とを出力する。 On the other hand, when it is determined in step ST11 that the search keyword does not contain similar characters (step ST11-NO), the search unit 18 searches the analysis character information storage unit 108 for character information corresponding to the search keyword. (Step ST16) When the corresponding character information is searched, the character information and the position information associated with the character information are output.

以上説明した実施形態において、検索キーワードによって電子チラシに含まれる商品等を検索し、その位置情報を得ることができる。これにより、電子チラシ上のどこに所望の商品が掲載されているかを、検索キーワードを入力することによって、商品情報がレイアウトされた位置を把握することができる。電子チラシは、Ａ１サイズやＡ２サイズの折り込みチラシの印刷面を読み取ったものもあるので、表裏ともに合わせると、多数の商品が掲載されている。したがって、上述のように、検索キーワードを入力することによって簡単に商品を見つけることができる。 In the embodiment described above, it is possible to search for a product or the like included in an electronic leaflet using a search keyword and obtain position information thereof. Thereby, the position where the product information is laid out can be grasped by inputting the search keyword where the desired product is posted on the electronic leaflet. Since some electronic leaflets are obtained by reading the printed surface of an A1 size or A2 size folded leaflet, a large number of products are listed when both front and back sides are combined. Therefore, as described above, a product can be easily found by inputting a search keyword.

次に、図１０、１１を用いて、解析部１０６において検出された文字情報の文字列が連続している方向（書字方向）を検出する一例について説明する。図１０は、文字列の縦方向あるいは横方向を判定する方法の一例を説明する説明図であって、図１１は、傾いた文字列を認識する方法の一例を説明する説明図である。 Next, an example of detecting the direction in which the character string of the character information detected by the analysis unit 106 continues (the writing direction) will be described with reference to FIGS. FIG. 10 is an explanatory diagram for explaining an example of a method for determining the vertical or horizontal direction of a character string, and FIG. 11 is an explanatory diagram for explaining an example of a method for recognizing a tilted character string.

まず、図１０を用いて、文字列の縦方向あるいは横方向を判定する方法の一例を説明する。
解析部１０６は、解析領域３００から複数の文字領域を検出した場合に、文字領域における各々の文字の配置状態に基づいて、縦書きであるか横書きであるかを検出し、検出した書字方向に連続する文字列を文字情報として検出する。例えば、解析部１０６は、図１０に示すように、解析領域３００に含まれた文字領域３０１〜３０７に基づき、各文字領域３０１〜３０７間の横方向（Ｘ方向）の間隔Ｇｘおよび縦方向の間隔Ｇｙと、各文字領域３０１〜３０７の縦横比Ｒｘｙ（横サイズＲｘと縦サイズＲｙとの比）を検出し、検出された結果に基づき、各文字領域３０１〜３０７が縦書きの文字列であるか、あるいは横書きの文字列であるかを判定する。
また、解析部１０６は、文字領域３０１〜３０７の各文字領域間の横方向の間隔Ｇｘおよび縦方向の間隔Ｇｙを検出し、各文字領域３０１〜３０７における横方向の間隔Ｇｘおよび縦方向の間隔Ｇｙの比率を算出する。解析部１０６は、文字領域において、横方向の間隔Ｇｘの方が縦方向の間隔ＧｙよりもＭ（Ｍは、正の整数）％以上大きい場合、当該文字領域を縦書きと判断し、縦方向の間隔Ｇｙの方が横方向の間隔ＧｘよりもＮ（Ｎは、正の整数）％以上大きい場合、当該文字領域を横書きと判断する。
また、解析部１０６は、文字領域３０１〜３０７のそれぞれの横サイズＲｘと縦サイズＲｙに基づき縦横比Ｒｘｙを算出する。全ての文字領域が縦方向（Ｙ方向）に長い長方形であれば、英文あるいは縦長の書体と判断してＭ％やＮ％に相当するパラメータを変更する。
なお、Ｍ％やＮ％に相当するパラメータは、チラシ内の文字の平均的なサイズや文字情報のチラシ内の出現位置の傾向等に応じて変更可能である。 First, an example of a method for determining the vertical or horizontal direction of a character string will be described with reference to FIG.
When the analysis unit 106 detects a plurality of character regions from the analysis region 300, the analysis unit 106 detects whether the writing is vertical or horizontal based on the arrangement state of each character in the character region, and detects the detected writing direction Is detected as character information. For example, as shown in FIG. 10, the analysis unit 106 is based on the character areas 301 to 307 included in the analysis area 300, and the horizontal direction (X direction) interval Gx between the character areas 301 to 307 and the vertical direction. An interval Gy and an aspect ratio Rxy (ratio between the horizontal size Rx and the vertical size Ry) of each character area 301 to 307 are detected, and based on the detected result, each character area 301 to 307 is a vertically written character string. It is determined whether there is a horizontal character string.
Further, the analysis unit 106 detects the horizontal gap Gx and the vertical gap Gy between the character areas 301 to 307, and the horizontal gap Gx and the vertical gap in the character areas 301 to 307. The ratio of Gy is calculated. When the horizontal interval Gx is greater than the vertical interval Gy by M (M is a positive integer)% or more in the character area, the analysis unit 106 determines that the character area is vertical writing, and the vertical direction If the interval Gy is greater than the horizontal interval Gx by N (N is a positive integer)% or more, the character area is determined to be horizontal writing.
The analysis unit 106 calculates the aspect ratio Rxy based on the horizontal size Rx and the vertical size Ry of each of the character regions 301 to 307. If all the character areas are rectangles that are long in the vertical direction (Y direction), it is determined as an English text or a vertically long font, and parameters corresponding to M% and N% are changed.
Note that the parameters corresponding to M% and N% can be changed according to the average size of characters in the leaflet, the tendency of the appearance position of the character information in the leaflet, and the like.

よって、図１０に示すような文字画像データでは、解析部１０６が、文字領域３０１〜３０７を検出し、例えば、文字領域３０１と隣の文字領域３０２と横方向の間隔Ｇｘ１、文字領域３０１と隣の文字領域３０５と縦方向の間隔Ｇｙ１、文字領域３０５と隣の文字領域３０５と横方向の間隔Ｇｘ２を検出する。
解析部１０６は、横方向の間隔Ｇｘ１と縦方向の間隔Ｇｙ１とを比較し、縦方向の間隔Ｇｙ１が横方向の間隔Ｇｘ１に比べてＮ％以上大きいことを検出し、文字領域３０１は、文字領域３０２と共に横書きの文字列を構成すると判断する。
また、解析部１０６は、縦方向の間隔Ｇｙ１と横方向の間隔Ｇｘ１とを比較し、縦方向の間隔Ｇｙ１が横方向の間隔Ｇｘ１に比べてＮ％以上大きいことを検出し、文字領域３０１が横書きと判断できるため、文字領域３０１は、文字領域３０５とは異なる文字列であると判断する。 Therefore, in the character image data as shown in FIG. 10, the analysis unit 106 detects the character areas 301 to 307, and for example, the character area 301 and the adjacent character area 302 and the horizontal gap Gx1, and the character area 301 and the adjacent area. The character area 305 and the vertical gap Gy1 and the character area 305 and the adjacent character area 305 and the horizontal gap Gx2 are detected.
The analysis unit 106 compares the horizontal gap Gx1 with the vertical gap Gy1, detects that the vertical gap Gy1 is N% or more larger than the horizontal gap Gx1, and the character area 301 includes a character area 301. It is determined that a horizontally written character string is formed together with the area 302.
The analysis unit 106 compares the vertical gap Gy1 with the horizontal gap Gx1, detects that the vertical gap Gy1 is N% or more larger than the horizontal gap Gx1, and the character region 301 is Since it can be determined as horizontal writing, the character area 301 is determined to be a character string different from the character area 305.

次に、図１１を用いて、傾いた文字列を認識する方法の一例を説明する。
図１１に示す通り、解析部１０６は、解析領域３００に含まれる文字領域３１１〜３１７に基づき、各文字領域３１１〜３１７間の横方向の間隔Ｇｘおよび縦方向の間隔Ｇｙと、各文字領域３１１〜３１７の横サイズＲｘと縦サイズＲｙを検出し、検出された結果に基づき、各文字領域３１１〜３１７が傾いた文字列であるか否かを判断する。
例えば、解析部１０６は、各文字領域３１１〜３１７の横方向の間隔Ｇｘと縦方向の間隔Ｇｙが一定値以下となるものを近接する文字領域として検出し、近接する文字領域の縦サイズＲｙの重なりを算出して当該重なりが一定の割合以上である場合、当該近接する文字領域が傾いた文字列であると判断する。 Next, an example of a method for recognizing a tilted character string will be described with reference to FIG.
As shown in FIG. 11, the analysis unit 106 based on the character areas 311 to 317 included in the analysis area 300, the horizontal gap Gx and the vertical gap Gy between the character areas 311 to 317, and the character areas 311. The horizontal size Rx and the vertical size Ry of ˜317 are detected, and based on the detected result, it is determined whether or not each of the character regions 311 to 317 is a tilted character string.
For example, the analysis unit 106 detects a character area 311 to 317 in which the horizontal gap Gx and the vertical gap Gy are equal to or less than a certain value as the adjacent character area, and the vertical size Ry of the adjacent character area When the overlap is calculated and the overlap is a certain ratio or more, it is determined that the adjacent character area is a tilted character string.

図１１に示す例を用いて説明すると、解析部１０６は、横方向の間隔Ｇｘおよび縦方向の間隔Ｇｙに基づき、文字領域３１１と文字領域３１２、および、文字領域３１１と文字領域３１５とがそれぞれ近接している文字領域であると判断する。解析部１０６は、近接する２つの文字領域３１１の縦サイズＲｙ１と文字領域３１２の縦サイズＲｙ２とが縦方向に重なりあっている重複サイズＷ１を算出し、文字領域３１１の縦サイズＲｙ１に対して重複サイズＷ１がＬ（Ｌは、１００よりも小さい数）％以上であるか否かを判断する。ここで、重複サイズＷ１がＬ％以上であるため、解析部１０６は、文字領域画像３１１と３１２とが横方向に連続する傾いた文字列であると判断する。
一方、解析部１０６は、近接する２つの文字領域３１１の縦サイズＲｙ１と文字領域３１５の縦サイズＲｙ３とが縦方向に重なりあっている重複サイズを算出するが、図１１に示す通り、文字領域３１１と３１５とは重なりあっている領域がないため、異なる文字列であると判断する。
このようにして、解析部１０６は、傾いた文字列として「北海道産」という文字列と、「トマト」という文字列を検出することができる。 Referring to the example shown in FIG. 11, the analysis unit 106 determines that the character area 311 and the character area 312, and the character area 311 and the character area 315 are based on the horizontal gap Gx and the vertical gap Gy, respectively. It is determined that the character area is close. The analysis unit 106 calculates an overlap size W1 in which the vertical size Ry1 of the two adjacent character regions 311 and the vertical size Ry2 of the character region 312 overlap in the vertical direction, and the vertical size Ry1 of the character region 311 is calculated. It is determined whether or not the overlap size W1 is L (L is a number smaller than 100)% or more. Here, since the overlap size W1 is L% or more, the analysis unit 106 determines that the character area images 311 and 312 are inclined character strings that are continuous in the horizontal direction.
On the other hand, the analysis unit 106 calculates an overlap size in which the vertical size Ry1 of the two adjacent character regions 311 and the vertical size Ry3 of the character region 315 overlap in the vertical direction. As illustrated in FIG. Since there is no area where 311 and 315 overlap, it is determined that they are different character strings.
In this way, the analysis unit 106 can detect the character string “produced in Hokkaido” and the character string “tomato” as tilted character strings.

なお、本実施の形態に係る解析部１０６は、一般的な文字認識の技術、例えば、ＯＣＲ（Optical Character Reader）に利用されている技術が利用可能である。また、本実施の形態に係る生成装置１００の文字抽出の精度を確認したところ、従来の白黒の画像データに基づき黒色に対する２値認識を行った場合の文字抽出の精度は、正答率が約４０％であったが、本実施の形態に係るデータ生成装置１００による文字抽出の精度は、従来の方法に比べて倍増した。 The analysis unit 106 according to the present embodiment can use a general character recognition technique, for example, a technique used in OCR (Optical Character Reader). In addition, when the accuracy of character extraction of the generation apparatus 100 according to the present embodiment is confirmed, the accuracy of character extraction in the case of performing binary recognition for black based on conventional black and white image data has a correct answer rate of about 40. However, the accuracy of character extraction by the data generation apparatus 100 according to the present embodiment has doubled compared to the conventional method.

上述の通り、本実施の形態に係るデータ生成装置１００は、チラシのように複数色で印刷された文字や画像を含む印刷物から、抽出色毎に分離して複数の色画像データを生成し、色画像データ毎に文字認識を行うことにより、複数色を利用した写真等と混じって文字が印刷されている印刷物からの文字抽出の精度を高めることができる。例えば、写真の上に黒の文字が表示されている印刷物にあっては、カラーで表示されている状態ではこれらを区別して認識することができるが、共に濃い色合である場合、あるいは、背景の写真が複雑な画像である場合、この印刷物を２値化することで文字と写真の境界部分の認識が困難となる。本発明は上記構成を備えることによって、上述のような問題を解決し、複数色で文字と写真が重なって印刷されている印刷物であっても、文字部分を抽出し、文字認識を行うことができる。 As described above, the data generation device 100 according to the present embodiment generates a plurality of color image data by separating each extracted color from a printed matter including characters and images printed in a plurality of colors like a leaflet. By performing character recognition for each color image data, it is possible to improve the accuracy of character extraction from a printed matter in which characters are printed mixed with photographs using a plurality of colors. For example, in a printed matter in which black characters are displayed on a photograph, these can be distinguished and recognized when displayed in color, but if both are dark colors or the background When a photograph is a complex image, binarization of the printed matter makes it difficult to recognize the boundary between characters and photographs. By providing the above-described configuration, the present invention solves the above-described problems, and can extract character portions and perform character recognition even in a printed matter in which characters and photos are printed in multiple colors. it can.

また、本実施の形態に係るデータ生成装置１００は、抽出された文字情報と印刷物における位置情報とを対応付けて解析文字情報記憶部１０８に記憶することで、読取部１０１によって読み取られた印刷物の画像データと抽出された文字情報とを対応付けて利用することが可能となり、抽出された文字情報を様々な方法で有効に活用することができる。例えば、抽出された文字情報と印刷物における位置情報とが対応付けられた解析文字情報記憶部１０８の情報を利用して、検索部１０９によって入力された検索キーワードに対応する印刷物を表示することができる。また、検索キーワードが記載されている印刷物内の位置を特定して表示部に表示することができる。
さらに、抽出された文字情報と当該文字情報の印刷物における位置情報とを対応付けて解析文字情報記憶部１０８に記憶させておくことにより、抽出されない写真等の画像と抽出された文字とを対応付けて管理することができる。 In addition, the data generation device 100 according to the present embodiment stores the extracted character information and the position information in the printed matter in association with each other in the analysis character information storage unit 108, so that the printed matter read by the reading unit 101 is stored. Image data and extracted character information can be used in association with each other, and the extracted character information can be effectively used in various ways. For example, the printed matter corresponding to the search keyword input by the search unit 109 can be displayed using the information in the analysis character information storage unit 108 in which the extracted character information is associated with the position information in the printed matter. . Further, the position in the printed matter where the search keyword is described can be specified and displayed on the display unit.
Further, the extracted character information and the positional information of the character information in the printed matter are associated with each other and stored in the analysis character information storage unit 108, thereby associating an image such as a photograph that is not extracted with the extracted character. Can be managed.

また、本実施の形態に係るデータ生成装置１００は、画像処理部１０４によって、複数の色画像データに対して、文字情報と認識しない画像データを除去する画像処理を行うことにより、文字認識の精度を高めることができる。
さらに、本実施の形態に係るデータ生成装置１００は、文字列が縦書きであるかあるいは横書きであるかを検出し、検出された方向に連続する文字列を文字情報として検出することができる。さらに、直交する印刷物の縦横方向と異なる方向に文字が連続している「傾いた文字列」についても、傾いた文字列であることを検出することができる。これにより、縦書きの文字列、横書きの文字列および傾いた文字列が混じっているチラシのような印刷物であっても、文字を特定の単語や意味のある文字列として抽出することができる。
また、本実施の形態に係るデータ生成装置１００は、互いに類似する類似文字を含む類似郡を少なくとも１つ記憶する類似文字記憶部１１２をさらに備えることにより、文字列とて抽出された文字情報が、部分的に誤った文字認識を行った場合であっても、類似する文字情報を検索することができる。 In addition, the data generation apparatus 100 according to the present embodiment uses the image processing unit 104 to perform image processing that removes image data that is not recognized as character information on a plurality of color image data, thereby improving character recognition accuracy. Can be increased.
Furthermore, the data generation apparatus 100 according to the present embodiment can detect whether a character string is vertically written or horizontally written, and can detect a character string continuous in the detected direction as character information. Furthermore, it is possible to detect that an “inclined character string” in which characters continue in a direction different from the vertical and horizontal directions of the orthogonal printed material is also an inclined character string. Thereby, even a printed matter such as a flyer in which a vertically written character string, a horizontally written character string, and a tilted character string are mixed, characters can be extracted as a specific word or a meaningful character string.
In addition, the data generation device 100 according to the present embodiment further includes a similar character storage unit 112 that stores at least one similar group including similar characters that are similar to each other, so that character information extracted as a character string can be obtained. Even when partially erroneous character recognition is performed, similar character information can be searched.

なお、データ生成装置１００は、色設定部１０２を備えず、予め選択された任意の抽出色が、記憶部（図示せず）に記憶されており、色画像データ生成部１０３が、当該抽出色を読み出すものであってもよい。 Note that the data generation device 100 does not include the color setting unit 102, and an arbitrary extracted color selected in advance is stored in a storage unit (not shown), and the color image data generation unit 103 stores the extracted color. May be read out.

１００・・・データ生成装置、１０１・・・読取部、１０２・・・色設定部、１０３・・・色画像データ生成部、１０４・・・画像処理部、１０５・・・文字領域検出部、１０６・・・解析部、１０７・・・位置情報検出部、１０８・・・解析文字情報記憶部、１０９・・・検索部、１１０・・・入力部、１１１・・・検索文字情報出力部、１１２・・・類似文字記憶部、１１３・・・解析文字情報出力部 DESCRIPTION OF SYMBOLS 100 ... Data generation apparatus, 101 ... Reading part, 102 ... Color setting part, 103 ... Color image data generation part, 104 ... Image processing part, 105 ... Character area detection part, 106: Analysis unit, 107: Position information detection unit, 108: Analysis character information storage unit, 109 ... Search unit, 110 ... Input unit, 111 ... Search character information output unit, 112 ... Similar character storage unit, 113 ... Analysis character information output unit

Claims

A reading unit that reads a printed material including an image representing characters in a plurality of colors and generates image data;
A color image data generation unit that separates the image data generated by the reading unit for each color and generates color image data that is image data for a plurality of colors;
A character region detection unit for detecting a character region recognized as a character from each of the color image data generated by the color image data generation unit;
Analyzing a character region in the color image data detected by the character region detection unit, and detecting character information included in the character region;
When the analysis unit detects character information, a position information detection unit that detects position information indicating coordinates in the image data of the character region to be analyzed;
An analysis character information output unit that outputs the character information detected by the analysis unit in association with the position information detected by the position information detection unit;
A data generation device comprising:

The color image data generation unit
2. The data according to claim 1, wherein color data is generated by separating image data for each color for a predetermined color or a higher-order color predetermined from a color histogram of the image data. Generator.

The character area detection unit
The character region is detected by comparing how much edge information detected from the density of the image data is included in each region in the color image data. The data generation device described in 1.

The analysis unit
When a plurality of characters are detected from the character area, whether the character is written vertically or horizontally is detected based on the arrangement state of each character in the character area, and the character string continuous in the detected writing direction The data generation device according to claim 1, wherein the data generation device is detected as character information.

An input unit for inputting a search character representing a character to be searched;
An analysis character information storage unit that stores the character information detected by the analysis unit and the position information detected by the position information detection unit in association with each other;
A search unit that searches the analysis character information storage unit for position information corresponding to the search character input to the input unit;
The data generation device according to claim 1, further comprising: a search character information output unit that outputs position information searched by the search unit.

A similar character storage unit that stores a similar character group including similar characters that are similar to each other;
The search unit
It is detected whether there is a similar character corresponding to the search character input to the input unit among the similar characters stored in the similar character storage unit. A similar search character is generated by replacing a character corresponding to a character with another similar character in a group of similar characters including the detected similar character, and the analysis character is based on the generated similar search character and the search character. It searches from an information storage part. The data generation device according to claim 5 characterized by things.

The reading unit reads a printed material including an image representing characters in a plurality of colors and generates image data.
A color image data generation unit separates the image data generated by the reading unit for each color, and generates color image data that is image data for a plurality of colors,
A character region detection unit detects a character region recognized as a character from each of the color image data generated by the color image data generation unit;
The analysis unit analyzes the character region in the color image data detected by the character region detection unit, detects character information included in the character region,
When the position information detection unit detects character information, the position information detection unit detects position information indicating coordinates in the image data of the character region to be analyzed;
The analysis character information output unit outputs the character information detected by the analysis unit in association with the position information detected by the position information detection unit.