JP2006277104A

JP2006277104A - Image reading device, extraction method for dictionary registration object word/phrase and program

Info

Publication number: JP2006277104A
Application number: JP2005092626A
Authority: JP
Inventors: Kei Tanaka; 圭田中; Shoichi Tateno; 昌一舘野; Toshiya Koyama; 俊哉小山; Takashi Nagao; 隆長尾; Masayoshi Sakakibara; 正義榊原; Teruka Saito; 照花斎藤; Shinu Ho; 新宇彭; Kotaro Nakamura; 浩太郎中村
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-03-28
Filing date: 2005-03-28
Publication date: 2006-10-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide an image reading device for extracting not only words but also their attribute information from a document. <P>SOLUTION: This image reading device is provided with an image reading means for reading the image of an original, and for generating input image data, a layout analyzing means for generating layout information from the input image data, an image dividing means for dividing the input image data into a plurality of small regions based on the layout information, a data base in which an identifier for specifying the small region having a title character string/image among the plurality of small regions and its title character string/image and an identifier for specifying the small region having an information character string/image and its information character string/image are stored so as to be associated with each other, an information extracting means for extracting registration object words and phrases from the small region having the title character string/image, and for extracting the attribute information from the small region having the information character string/image based on the information stored in the data base and an output means for outputting the dictionary registration object words and phrases. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文書から単語・熟語を抽出して辞書を生成・更新する技術に関する。 The present invention relates to a technique for generating and updating a dictionary by extracting words and idioms from a document.

近年のグローバルな情報流通、経済活動など、国境を越えた活動の広がりにより、ある言語（例えば、英語）から別の言語（例えば、日本語）への翻訳の需要が高まっている。しかし、翻訳を業者（翻訳者）に依頼すると、一般に価格が高く、また時間もかかるため、コンピュータなどの機械を用いて自動的に翻訳する自動翻訳（機械翻訳）装置の需要が高まっている。 With the spread of activities across borders such as global information distribution and economic activities in recent years, the demand for translation from one language (for example, English) to another language (for example, Japanese) is increasing. However, when a translation is requested from a trader (translator), the price is generally high and it takes time. Therefore, the demand for an automatic translation (machine translation) apparatus that automatically translates using a machine such as a computer is increasing.

文章中に存在する企業名、個人名、製品名などの固有名詞や、特定の技術分野の専門用語を正確に翻訳するためには、固有名詞辞書や専門用語辞書といった専用の辞書が使用されるのが一般的である。これらの辞書は人間の手により作成されることが多い。また、特に固有名詞など情報が常に更新されるものに対しては、辞書の更新（メンテナンス）が必要であるが、これも人間の手により行われるのが一般的であった。 Dedicated dictionaries such as proper noun dictionaries and technical term dictionaries are used to accurately translate proper nouns such as company names, personal names, and product names in the text, as well as technical terms in specific technical fields. It is common. These dictionaries are often created by human hands. Also, especially for information that is constantly updated, such as proper nouns, it is necessary to update the dictionary (maintenance), but this is also generally done manually.

ところで、会社情報が記載された本や、各企業が所有するある種の帳票など、情報が所定のレイアウトに従って配置された文書が存在する場合がある。このような文書から情報を抽出し、辞書の更新に活用できれば便利である。このように所定のレイアウトを有する書類から情報を抽出する技術として、特許文献１に記載の技術がある。特許文献１は、申込書のレイアウトに基づき、配送伝票作成に必要な情報を取得し、伝票の作成あるいは顧客データベースの更新を行う技術が開示されている。
特開平５−２９８３４１号公報 By the way, there may be a document in which information is arranged according to a predetermined layout, such as a book in which company information is described or a certain form owned by each company. It would be convenient if information could be extracted from such a document and used to update the dictionary. As a technique for extracting information from a document having a predetermined layout in this way, there is a technique described in Patent Document 1. Patent Document 1 discloses a technique for acquiring information necessary for creating a delivery slip based on the layout of an application form and creating a slip or updating a customer database.
Japanese Patent Application Laid-Open No. 5-298341

しかし、特許文献１に記載の技術によれば、文書から単語を抽出することはできても、付加的な属性情報の抽出を行うことは困難であるという問題があった。
本発明は上述の事情に鑑みてなされたものであり、文書から単語に加え、付加的な属性情報を抽出することができる画像読み取り装置を提供することを目的とする。 However, according to the technique described in Patent Document 1, there is a problem that although it is possible to extract words from a document, it is difficult to extract additional attribute information.
The present invention has been made in view of the above circumstances, and an object thereof is to provide an image reading apparatus that can extract additional attribute information in addition to words from a document.

上述の課題を解決するため、本発明は、原稿の画像を読み取り、入力画像データを生成する画像読み取り手段と、前記画像読み取り手段により生成された入力画像データに対しレイアウト解析処理を行い、レイアウト情報を生成するレイアウト解析手段と、前記レイアウト解析手段により生成されたレイアウト情報に基づいて、前記入力画像データを複数の小領域に分割する画像分割手段と、前記複数の小領域のうち、見出し文字列または見出し画像を有する小領域を特定する第１の識別子およびその見出し文字列または見出し画像と、情報文字列または情報画像を有する小領域を特定する第２の識別子およびその情報文字列および情報画像とを対応付けて記憶したレイアウトデータベースと、前記レイアウトデータベースに記憶された第１の識別子で特定される小領域から登録対象語句を、前記レイアウトデータベースに記憶された第２の識別子で特定される小領域からその辞書登録対象語句の属性情報を抽出する情報抽出手段と、前記情報抽出手段により抽出された辞書登録対象語句を出力する出力手段とを有する画像読み取り装置を提供する。
この画像読み取り装置によれば、文書から、その文書のレイアウトに基づいて自動的に辞書の更新に必要な情報（登録対象語句およびその属性情報）が出力される。 In order to solve the above problems, the present invention provides an image reading unit that reads an image of a document and generates input image data, a layout analysis process for the input image data generated by the image reading unit, and layout information Layout analyzing means for generating the image, image dividing means for dividing the input image data into a plurality of small areas based on the layout information generated by the layout analyzing means, and a heading character string among the plurality of small areas Alternatively, a first identifier that identifies a small area having a heading image and its heading character string or heading image, and a second identifier that identifies a small area having an information character string or information image, and its information character string and information image And a layout database stored in association with each other and a first stored in the layout database Information extracting means for extracting a registration target phrase from a small area specified by an identifier, and attribute information of the dictionary registration target phrase from a small area specified by a second identifier stored in the layout database; and the information extraction An image reading apparatus having output means for outputting a dictionary registration target phrase extracted by the means is provided.
According to this image reading apparatus, information (registration target words and their attribute information) necessary for automatically updating a dictionary is automatically output from a document based on the layout of the document.

好ましい態様において、この画像読み取り装置は、見出し文字列または見出し画像と、情報文字列または情報画像との定義を記憶した定義記憶手段と、前記定義記憶手段に記憶された定義に従って、前記複数の小領域のうち、見出し文字列または見出し画像を有する小領域と、情報文字列または情報画像を有する小領域とを特定する小領域特定手段と、前記小領域特定手段により特定された小領域の情報に基づいて、前記レイアウトデータベースの内容を更新するデータベース更新手段とをさらに有してもよい。
この画像読み取り装置によれば、レイアウトデータベースが自動的に更新されるので、処理対象文書のレイアウトに応じたレイアウトデータベースを自動的に作成することができる。 In a preferred aspect, the image reading apparatus includes a definition storage unit that stores definitions of a heading character string or heading image, an information character string or an information image, and the plurality of small subordinates according to the definitions stored in the definition storage unit. Among the regions, small region specifying means for specifying a small region having a heading character string or a heading image, and a small region having an information character string or an information image, and information on the small region specified by the small region specifying unit And a database updating means for updating the contents of the layout database.
According to this image reading apparatus, since the layout database is automatically updated, a layout database corresponding to the layout of the document to be processed can be automatically created.

また、本発明は、原稿の画像を読み取り、入力画像データを生成する画像読み取りステップと、前記入力画像データに対しレイアウト解析処理を行い、レイアウト情報を生成するレイアウト解析ステップと、前記レイアウト情報に基づいて、前記入力画像データを複数の小領域に分割する画像分割ステップと、前記複数の小領域のうち、見出し文字列または見出し画像を有する小領域を特定する第１の識別子およびその見出し文字列または見出し画像と、情報文字列または情報画像を有する小領域を特定する第２の識別子およびその情報文字列および情報画像とを対応付けて記憶したレイアウトデータベースに記憶された第１の識別子で特定される小領域から登録対象語句を、前記レイアウトデータベースに記憶された第２の識別子で特定される小領域からその辞書登録対象語句の属性情報を抽出する情報抽出ステップと、前記情報抽出ステップにおいて抽出された辞書登録対象語句を出力する出力ステップとを有する辞書登録対象語句の抽出方法を提供する。
また、本発明は、コンピュータ装置に上述の辞書登録対象語句の抽出方法を実行させるプログラムを提供する。 Further, the present invention provides an image reading step for reading an image of a document and generating input image data, a layout analysis step for performing layout analysis processing on the input image data to generate layout information, and the layout information. An image dividing step of dividing the input image data into a plurality of small areas, and a first identifier for identifying a small area having a heading character string or a heading image and the heading character string or It is specified by the first identifier stored in the layout database that stores the header image, the information character string or the small identifier having the information image, and the information identifier and the information image stored in association with each other. The registration target word / phrase is specified from the small area by the second identifier stored in the layout database. There is provided a method for extracting a dictionary registration target phrase including an information extraction step of extracting attribute information of the dictionary registration target phrase from a small area and an output step of outputting the dictionary registration target phrase extracted in the information extraction step. .
The present invention also provides a program for causing a computer device to execute the above-described dictionary registration target phrase extraction method.

以下、図面を参照して本発明の一実施形態について説明する。
図１は、本発明の一実施形態に係る辞書更新システム１の機能構成を示すブロック図である。画像読み取り部１０は、文書ＤＯＣの画像を読み取り、入力画像データを生成する。レイアウト解析部２０は、画像データのレイアウト解析を行い、レイアウト情報を抽出する。領域分割部３０は、レイアウト情報に基づいて入力画像データを小領域の画像データに分割する。また、領域分割部３０は、レイアウトデータベースの情報に基づいて小領域のうち、属性情報に対応するものと、辞書登録対象語句に対応するものとを抽出する。属性情報抽出部４０は、属性情報に対応すると特定された小領域から属性情報を抽出する。登録対象語句抽出部６０は、辞書登録対象語句に対応すると特定された小領域から辞書登録対象語句を抽出する。辞書データ登録部５０は、抽出された辞書登録対象語句および属性情報を辞書ＤＩＣに登録する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of a dictionary update system 1 according to an embodiment of the present invention. The image reading unit 10 reads an image of the document DOC and generates input image data. The layout analysis unit 20 performs a layout analysis of the image data and extracts layout information. The area dividing unit 30 divides the input image data into small area image data based on the layout information. The area dividing unit 30 extracts small areas corresponding to the attribute information and those corresponding to the dictionary registration target phrase from the small area based on the information in the layout database. The attribute information extraction unit 40 extracts attribute information from the small area specified to correspond to the attribute information. The registration target word / phrase extraction unit 60 extracts a dictionary registration target word / phrase from a small area identified as corresponding to the dictionary registration target word / phrase. The dictionary data registration unit 50 registers the extracted dictionary registration target phrase and attribute information in the dictionary DIC.

図２は、辞書更新システム１の構成を示す図である。辞書更新システム１は、複合機１００と、サーバ２００とから構成される。複合機１００とサーバ２００とは、インターネット、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等のネットワーク３００を介して接続されている。図２においては図面が煩雑となるのを避けるため複合機１００およびサーバ２００はそれぞれ１台ずつしか図示していないが、辞書更新システム１は、複数台の複合機１００あるいは複数台のサーバ２００を含んでもよい。 FIG. 2 is a diagram illustrating a configuration of the dictionary update system 1. The dictionary update system 1 includes a multifunction device 100 and a server 200. The MFP 100 and the server 200 are connected via a network 300 such as the Internet, a LAN (Local Area Network), and a WAN (Wide Area Network). In FIG. 2, only one multi-function device 100 and one server 200 are shown in order to avoid complicated drawing. However, the dictionary update system 1 includes a plurality of multi-function devices 100 or a plurality of servers 200. May be included.

図３は、複合機１００のハードウェア構成を示す図である。複合機１００は主に、ＣＰＵ（Central Processing Unit）１１０等からなる制御系、原稿の画像を読み取る画像読み取り系１６０、用紙（記録材）上に画像形成を行う画像形成系１７０から構成される。ＣＰＵ１１０は、記憶部１２０に記憶されている制御プログラムを読み出して実行することにより、複合機１００の各構成要素を制御する機能を有する。記憶部１２０は、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＨＤＤ（Hard Disk Drive）等から構成され、制御プログラムや翻訳プログラム等の各種プログラムおよび、画像データやテキストデータ等の各種データを記憶する。表示部１３０および操作部１４０は、ユーザインターフェースである。表示部１３０は、例えば液晶ディスプレイで構成され、ＣＰＵ１１０からの制御信号に従ってユーザへのメッセージや作業状況を示す画像などを表示する。操作部１４０は、テンキー、スタートボタン、ストップボタン、液晶ディスプレイ上に設置されたタッチパネル等で構成され、ユーザの操作入力およびその時の表示画面に応じた信号を出力する。ユーザは表示部１３０に表示された画像やメッセージを見ながら操作部１４０を操作することにより、複合機１００に対して指示入力を行うことができる。 FIG. 3 is a diagram illustrating a hardware configuration of the multifunction peripheral 100. The multifunction peripheral 100 mainly includes a control system including a CPU (Central Processing Unit) 110 and the like, an image reading system 160 that reads an image of a document, and an image forming system 170 that forms an image on a sheet (recording material). The CPU 110 has a function of controlling each component of the multifunction peripheral 100 by reading and executing a control program stored in the storage unit 120. The storage unit 120 includes a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), etc., and various programs such as a control program and a translation program, and various data such as image data and text data. Remember. The display unit 130 and the operation unit 140 are user interfaces. The display unit 130 is configured by, for example, a liquid crystal display, and displays a message to the user, an image showing a work status, and the like according to a control signal from the CPU 110. The operation unit 140 includes a numeric keypad, a start button, a stop button, a touch panel installed on a liquid crystal display, and the like, and outputs a signal corresponding to a user operation input and a display screen at that time. The user can input an instruction to the multifunction device 100 by operating the operation unit 140 while viewing an image or message displayed on the display unit 130.

Ｉ／Ｆ１５０は、他の装置との間で制御信号やデータの送受信を行うためのインターフェースである。Ｉ／Ｆ１５０を介して、例えば公衆電話回線に接続することにより、複合機１００はＦＡＸの送受信を行うことができる。あるいは、Ｉ／Ｆ１５０を介してインターネット等のネットワークに接続することにより、複合機１００は電子メールメッセージの送受信を行うこともできる。あるいは、ネットワークを介して接続されたコンピュータ装置から画像データを受信し、用紙に画像形成を行うことでプリンタとして機能する。 The I / F 150 is an interface for transmitting and receiving control signals and data to and from other devices. By connecting to a public telephone line, for example, via the I / F 150, the multi-function device 100 can perform FAX transmission / reception. Alternatively, by connecting to a network such as the Internet via the I / F 150, the multi-function device 100 can also send and receive e-mail messages. Alternatively, it functions as a printer by receiving image data from a computer device connected via a network and forming an image on a sheet.

画像読み取り系１６０は、原稿を読み取り位置まで搬送する原稿搬送部１６１と、読み取り位置にある原稿を光学的に読み取りアナログ画像信号を生成する画像読み取り部１６２と、アナログ画像信号をデジタル画像データに変換し、必要な画像処理を行う画像処理部１６３とを有する。原稿搬送部１６１は、例えばＡＤＦ（Automatic Document Feeder）等の原稿搬送装置である。画像読み取り部１６２は、原稿を載置するプラテンガラス、光源やＣＣＤ（Charge Coupled Device）センサ等の光デバイス、レンズやミラー等の光学系を有する（いずれも図示略）。画像処理部１６３は、デジタル／アナログ変換を行うＡ／Ｄ変換回路や、シェーディング補正や色空間変換等の処理を行う画像処理回路を有する（いずれも図示略）。 The image reading system 160 includes a document conveying unit 161 that conveys a document to a reading position, an image reading unit 162 that optically reads a document at a reading position and generates an analog image signal, and converts the analog image signal into digital image data. And an image processing unit 163 that performs necessary image processing. The document feeder 161 is a document feeder such as an ADF (Automatic Document Feeder). The image reading unit 162 includes a platen glass on which an original is placed, an optical device such as a light source and a CCD (Charge Coupled Device) sensor, and an optical system such as a lens and a mirror (all not shown). The image processing unit 163 includes an A / D conversion circuit that performs digital / analog conversion, and an image processing circuit that performs processing such as shading correction and color space conversion (all not shown).

画像形成系１７０は、用紙を画像形成位置まで搬送する用紙搬送部１７１と、搬送された用紙上に画像形成を行う画像形成部１７２とを有する。用紙搬送部１７１は、用紙を収納する用紙トレイ、用紙トレイから用紙を１枚ずつ所定の位置まで搬送する搬送ローラ等を有する（いずれも図示略）。画像形成部１７２は、例えばＹＭＣＫ各色のトナー像が作像される感光体ドラム、感光体ドラムを帯電させる帯電器、帯電した感光体ドラムに静電画像を形成する露光装置、感光体ドラムにＹＭＣＫ各色のトナー像を形成する現像器等を有する（いずれも図示略）。 The image forming system 170 includes a sheet conveying unit 171 that conveys a sheet to an image forming position, and an image forming unit 172 that forms an image on the conveyed sheet. The paper transport unit 171 includes a paper tray for storing paper, a transport roller for transporting the paper from the paper tray to a predetermined position one by one (all not shown). The image forming unit 172 includes, for example, a photosensitive drum on which toner images of YMCK colors are formed, a charger that charges the photosensitive drum, an exposure device that forms an electrostatic image on the charged photosensitive drum, and YMCK on the photosensitive drum. It has a developing device or the like that forms toner images of the respective colors (all are not shown).

以上の各構成要素は、バス１９０で相互に接続されている。例えば、画像読み取り系１６０で原稿から画像データを生成し、生成した画像データに従って画像形成系１７０で用紙上に画像形成を行うと、複合機１００は複写機として機能する。画像読み取り系１６０で原稿から画像データを生成し、生成した画像データをＩ／Ｆ１５０を介して他の装置に出力すると、複合機１００はスキャナとして機能する。Ｉ／Ｆ１５０を介して受信した画像データに従って、画像形成系１７０で用紙上に画像形成を行うと、複合機１００はプリンタとして機能する。画像読み取り系１６０で原稿からＦＡＸデータを生成し、生成したＦＡＸデータをＩ／Ｆ１５０および公衆電話回線を介してＦＡＸ受信装置に送信すると、複合機１００はＦＡＸ送信機として機能する。あるいは、画像読み取り系１６０で原稿から画像データを生成し、さらに、文字認識処理により画像データからテキストデータを生成し、翻訳プログラムを実行することによりテキストデータの翻訳文を生成すると、複合機１００はスキャン翻訳機として機能する。なお、図示は省略したが、複合機１００には、Ｉ／Ｆ１５０を介して複数のコンピュータ装置が接続されている。これらの複数のコンピュータ装置のユーザは、自分のコンピュータ装置を介して複合機１００との間でデータを送受信することにより、複合機１００をプリンタ、ＦＡＸ送受信機等として使用することができる。あるいは、複合機１００に直接原稿をセットすることにより、複合機１００を複写機、ＦＡＸ送受信機等として使用することができる。 The above components are connected to each other by a bus 190. For example, when image data is generated from a document by the image reading system 160 and an image is formed on a sheet by the image forming system 170 according to the generated image data, the multifunction peripheral 100 functions as a copying machine. When the image reading system 160 generates image data from a document and outputs the generated image data to another apparatus via the I / F 150, the multifunction peripheral 100 functions as a scanner. When the image forming system 170 forms an image on a sheet in accordance with image data received via the I / F 150, the multifunction peripheral 100 functions as a printer. When the image reading system 160 generates FAX data from a document and transmits the generated FAX data to the FAX receiving device via the I / F 150 and the public telephone line, the multifunction peripheral 100 functions as a FAX transmitter. Alternatively, when the image reading system 160 generates image data from a document, and further generates text data from the image data by character recognition processing and generates a translation of the text data by executing a translation program, the multi-function device 100 Functions as a scan translator. Although not shown, a plurality of computer devices are connected to the multifunction device 100 via the I / F 150. Users of these plurality of computer devices can use the multifunction device 100 as a printer, a FAX transceiver, or the like by transmitting and receiving data to and from the multifunction device 100 via their computer devices. Alternatively, by setting a document directly on the multifunction device 100, the multifunction device 100 can be used as a copying machine, a FAX transceiver, or the like.

図４は、サーバ２００のハードウェア構成を示す図である。ＣＰＵ２１０は、ＲＡＭ２３０を作業エリアとして、ＲＯＭ２２０あるいはＨＤＤ２５０に記憶されているプログラムを実行する。ＨＤＤ２５０は、各種プログラムやデータを記憶する記憶装置である。ユーザは、キーボード２６０、マウス２７０を操作することにより、サーバ２００に対してデータ入力等を行うことができる。サーバ２００はＩ／Ｆ２４０を介して複合機１００に接続されており、複合機１００とデータの送受信を行うことができる。ディスプレイ２８０は、ＣＰＵ２１０の制御下でプログラムの実行結果等を示す画像やメッセージを表示する。これらの構成要素はバス２９０で相互に接続されている。ＨＤＤ２５０は、翻訳プログラムおよび辞書ＤＩＣを記憶しており、翻訳を行う機能を有する翻訳サーバである。 FIG. 4 is a diagram illustrating a hardware configuration of the server 200. The CPU 210 executes a program stored in the ROM 220 or the HDD 250 using the RAM 230 as a work area. The HDD 250 is a storage device that stores various programs and data. The user can input data to the server 200 by operating the keyboard 260 and the mouse 270. The server 200 is connected to the multifunction device 100 via the I / F 240 and can transmit and receive data to and from the multifunction device 100. The display 280 displays an image and a message indicating the execution result of the program under the control of the CPU 210. These components are connected to each other by a bus 290. The HDD 250 is a translation server that stores a translation program and a dictionary DIC and has a function of performing translation.

図５は、辞書更新システム１の動作を示すフローチャートである。電源（図示略）を投入すると、複合機１００のＣＰＵ１１０は、記憶部１２０から制御プログラムを読み出して実行する。制御プログラムを実行すると、ＣＰＵ１１０は表示部１３０を制御してメニュー画面を表示させる。このとき、複合機１００はユーザの操作入力待ち状態となる。同様にサーバ２００においても、電源（図示略）を投入すると、ＣＰＵ２１０はＨＤＤ２５０から制御プログラムを読み出して実行する。制御プログラムを実行すると、ＣＰＵ２１０はデータの受信待ち状態となる。複合機１００のＣＰＵ１１０およびサーバ２００のＣＰＵ２１０が制御プログラムを実行することにより、辞書更新システム１は図１に示される各機能を具備する。 FIG. 5 is a flowchart showing the operation of the dictionary update system 1. When the power (not shown) is turned on, the CPU 110 of the multifunction peripheral 100 reads the control program from the storage unit 120 and executes it. When the control program is executed, the CPU 110 controls the display unit 130 to display a menu screen. At this time, the multi-function device 100 waits for a user operation input. Similarly, in the server 200, when the power (not shown) is turned on, the CPU 210 reads the control program from the HDD 250 and executes it. When the control program is executed, the CPU 210 enters a data reception waiting state. When the CPU 110 of the multifunction peripheral 100 and the CPU 210 of the server 200 execute the control program, the dictionary update system 1 has each function shown in FIG.

図６は、本実施形態において辞書更新処理の際に使用される文書ＤＯＣを例示する図である。文書ＤＯＣは、例えば会社案内であり、会社名、業種、本社所在地、ＵＲＬ、証券コード等の各種情報が所定のレイアウトに従って配置されている。文書ＤＯＣは、例えば罫線によって複数の小領域に区分されている。各章領域には、その小領域に記載されている情報の種類を特定する文字列（以下、「見出し文字列」という。例えば、［業種］、［会社名］、［本社］、［ＵＲＬ］、［証券コード］等）と、情報の内容（以下、「情報文字列」という。例えば、製造業、ＡＢＣ工業株式会社、１００−００００東京都ｘｘ区ｙｙ１−１、http://www.xxx.yyy.co.jp/、００００）とが記載されている。 FIG. 6 is a diagram illustrating a document DOC used in dictionary update processing in the present embodiment. The document DOC is, for example, company information, and various information such as a company name, a business type, a head office location, a URL, and a securities code are arranged according to a predetermined layout. The document DOC is divided into a plurality of small areas by ruled lines, for example. In each chapter area, a character string (hereinafter referred to as “headline character string”) that specifies the type of information described in the small area. For example, [business type], [company name], [head office], [URL] , [Securities code], etc.) and information content (hereinafter referred to as "information character string". For example, manufacturing industry, ABC Industrial Co., Ltd., 100-0000 Tokyo xx ward yy1-1, http://www.xxx .yyy.co.jp /, 0000).

再び図５を参照して説明する。ユーザは、文書ＤＯＣをＡＤＦあるいはプラテンガラスにセットし、サーバ２００のキーボード２６０およびマウス２７０を操作して、辞書更新処理の実行を指示する操作入力を行う。辞書更新処理の実行指示は、複合機１００を特定する情報を含んでいる。辞書更新処理の実行が指示されると、サーバ２００のＣＰＵ２１０は、辞書更新処理の実行指示により特定される複合機１００に対し、画像の読み取りを指示する信号を出力する。 A description will be given with reference to FIG. 5 again. The user sets the document DOC on the ADF or the platen glass, and operates the keyboard 260 and the mouse 270 of the server 200 to perform an operation input that instructs execution of the dictionary update process. The execution instruction for the dictionary update process includes information for specifying the multifunction peripheral 100. When the execution of the dictionary update process is instructed, the CPU 210 of the server 200 outputs a signal instructing the image reading to the multifunction peripheral 100 specified by the execution instruction of the dictionary update process.

画像の読み取りを指示する信号を受け取ると、複合機１００のＣＰＵ１１０は、画像読み取り系１６０を制御して文書ＤＯＣの画像を読み取り、入力画像データを生成させる（ステップＳ１１０）。ＣＰＵ１１０は、生成した入力画像データを記憶部１２０に記憶する。文書ＤＯＣが複数ページからなる場合には、複数ページの画像データをそれぞれ、ページ番号を示す情報を付加して記憶部１２０に記憶する。 Upon receiving a signal for instructing image reading, the CPU 110 of the multifunction peripheral 100 controls the image reading system 160 to read the image of the document DOC and generate input image data (step S110). The CPU 110 stores the generated input image data in the storage unit 120. When the document DOC is composed of a plurality of pages, the image data of the plurality of pages is added to the information indicating the page number and stored in the storage unit 120.

次に、ＣＰＵ１１０は、入力画像データに対し周知のレイアウト解析処理を行い、レイアウト情報を抽出する（ステップＳ１２０）。レイアウト解析処理により、入力画像データは、複数の小領域に分割される。レイアウト情報は、例えば、各小領域の２次元直交座標系における小領域の頂点の座標、各小領域における文字サイズ等の情報を含んでいる。ＣＰＵ１１０は、抽出したレイアウト情報を記憶部１２０に記憶する。 Next, the CPU 110 performs a well-known layout analysis process on the input image data and extracts layout information (step S120). Through the layout analysis process, the input image data is divided into a plurality of small areas. The layout information includes, for example, information such as the coordinates of the vertices of the small areas in the two-dimensional orthogonal coordinate system of each small area and the character size in each small area. CPU 110 stores the extracted layout information in storage unit 120.

次に、ＣＰＵ１１０は、レイアウト情報に基づいて入力画像データを複数の小領域の画像データに分割する（ステップＳ１３０）。小領域には、他の小領域と区別するための識別子が与えられる。ＣＰＵ１１０は、小領域の画像データとレイアウト情報と識別子とを対応付けて記憶部１２０に記憶する。 Next, the CPU 110 divides the input image data into a plurality of small area image data based on the layout information (step S130). The small area is given an identifier for distinguishing it from other small areas. The CPU 110 stores the small area image data, layout information, and an identifier in the storage unit 120 in association with each other.

次に、ＣＰＵ１１０は、各小領域から見出し文字列を検索する（ステップＳ１４０）。この処理は次のように行われる。まず、ＣＰＵ１１０は、小領域の画像データの各々に文字認識処理を行い、テキストデータを生成する。ＣＰＵ１１０は、生成したテキストデータを記憶部１２０に記憶する。ＣＰＵ１１０は、各小領域のテキストデータから見出し文字列を検索する。検索対象となる見出し文字列は、辞書更新処理の指示を入力する際にユーザが入力してもよいし、記憶部１２０あるいはＨＤＤ２５０に検索対象となる見出し文字列を定義するデータベース、テーブル、関数等をあらかじめ記憶しておいてもよい。ここでは、検索対象の見出し文字列として“［会社名］”、“［業種］”という文字列が定義されている。 Next, the CPU 110 searches for a heading character string from each small area (step S140). This process is performed as follows. First, the CPU 110 performs character recognition processing on each of the small area image data to generate text data. CPU 110 stores the generated text data in storage unit 120. CPU 110 retrieves a heading character string from the text data of each small area. The heading character string to be searched may be input by the user when inputting a dictionary update processing instruction, or a database, table, function, or the like that defines the heading character string to be searched in the storage unit 120 or the HDD 250. May be stored in advance. Here, character strings “[company name]” and “[business type]” are defined as headline character strings to be searched.

次に、ＣＰＵ１１０は、レイアウトデータベースの更新を行う（ステップＳ１５０）。すなわち、ＣＰＵ１１０は、検索により発見した見出し文字列と、その文字列が発見された小領域を特定する識別子と、その見出し文字列が辞書登録対象語句（例えば、固有名詞や専門用語）に対応するか属性情報に対応するか示すフラグとを対応付けて、レイアウトデータベースとして記憶部１２０に記憶する。見出し文字列と、その見出し文字列が辞書登録対象語句に対応するか属性情報に対応するかという関係の定義は、辞書更新処理の指示を入力する際にユーザが入力してもよいし、記憶部１２０あるいはＨＤＤ２５０にこの関係を定義するデータベース、テーブル、関数等をあらかじめ記憶しておいてもよい。ここでは、［会社名］＝辞書登録対象語句、［業種］＝属性情報という関係が定義されている。 Next, the CPU 110 updates the layout database (step S150). That is, the CPU 110 corresponds to a headline character string found by the search, an identifier for specifying a small area in which the character string is found, and the headline character string to a dictionary registration target phrase (for example, proper noun or technical term). Are stored in the storage unit 120 as a layout database in association with each other. The definition of the relationship between the heading character string and whether the heading character string corresponds to the dictionary registration target phrase or attribute information may be input by the user when inputting an instruction for dictionary update processing, A database, table, function, or the like that defines this relationship may be stored in the unit 120 or the HDD 250 in advance. Here, the relationship [company name] = dictionary registration target phrase and [business type] = attribute information is defined.

次に、ＣＰＵ１１０は、辞書登録対象語句情報および属性情報の抽出を行う（ステップＳ１６０）。ＣＰＵ１１０は、レイアウトデータベースを参照して辞書登録対象語句に対応する小領域のテキストデータから辞書登録対象語句を、属性情報に対応する小領域から属性情報を抽出する。ＣＰＵ１１０は、抽出した辞書登録対象語句と属性情報とを対応付けて記憶部１２０に記憶する。次に、ＣＰＵ１１０は、文書ＤＯＣの全ページについて辞書登録対象語句情報および属性情報の抽出が完了したか判断する（ステップＳ１７０）。全ページについて処理が完了していない場合（ステップＳ１７０：ＮＯ）、ＣＰＵ１１０は、全頁について処理が完了するまでステップＳ１６０〜Ｓ１７０の処理を繰り返し実行する。 Next, the CPU 110 extracts dictionary registration target phrase information and attribute information (step S160). The CPU 110 refers to the layout database and extracts the dictionary registration target phrase from the small area text data corresponding to the dictionary registration target phrase and the attribute information from the small area corresponding to the attribute information. The CPU 110 stores the extracted dictionary registration target phrase and attribute information in the storage unit 120 in association with each other. Next, CPU 110 determines whether extraction of dictionary registration target phrase information and attribute information has been completed for all pages of document DOC (step S170). If the processing has not been completed for all pages (step S170: NO), the CPU 110 repeatedly executes the processing of steps S160 to S170 until the processing is completed for all pages.

全ページについて処理が完了した場合（ステップＳ１７０：ＹＥＳ）、ＣＰＵ１１０は、辞書の更新を行う（ステップＳ１８０）。すなわち、ＣＰＵ１１０は、記憶部１２０に記憶された辞書登録対象語句および属性情報を、辞書更新処理の実行指示の送信元であるサーバ２００に送信する。サーバ２００のＣＰＵ２１０は、受信した情報をＨＤＤ２５０に記憶された辞書ＤＩＣに追加する。ＨＤＤ２５０に辞書ＤＩＣが記憶されていない場合、ＣＰＵ２１０は、受信した情報を基に新たに辞書ＤＩＣを生成する。 When processing has been completed for all pages (step S170: YES), the CPU 110 updates the dictionary (step S180). That is, the CPU 110 transmits the dictionary registration target phrase and attribute information stored in the storage unit 120 to the server 200 that is the transmission source of the execution instruction for the dictionary update process. CPU 210 of server 200 adds the received information to dictionary DIC stored in HDD 250. When the dictionary DIC is not stored in the HDD 250, the CPU 210 newly generates a dictionary DIC based on the received information.

図７は、辞書ＤＩＣの内容を例示する図である。この例では、辞書登録対象語句として「日本語社名」が、その属性情報として「英語社名」および「業種」が記録されている例が示されている。このように、本実施形態によれば、辞書登録対象語句とその属性情報とが対応付けられた辞書を自動的に作成、更新することができる。 FIG. 7 is a diagram illustrating the contents of the dictionary DIC. In this example, “Japanese company name” is recorded as a dictionary registration target phrase, and “English company name” and “business type” are recorded as attribute information thereof. Thus, according to the present embodiment, a dictionary in which dictionary registration target words and their attribute information are associated can be automatically created and updated.

＜変形例＞
本発明は上述の実施形態に限定されるものではなく、種々の変形実施が可能である。
上述の実施形態においては、小領域から見出し文字列および情報文字列を検索する態様について説明したが、文字列の代わりに一定のパターンを有する画像（バーコードやカルラコード等、一定のルールに基づいて作成された図形）を検索する態様としてもよい。すなわち、見出し文字列および情報文字列の代わりに見出し図形および情報図形を用いてもよい。 <Modification>
The present invention is not limited to the above-described embodiment, and various modifications can be made.
In the above-described embodiment, the mode of searching for a heading character string and an information character string from a small area has been described. It is also possible to search for graphics created in this manner. That is, a heading graphic and an information graphic may be used instead of the heading character string and the information character string.

上述の実施形態においては、サーバ２００が辞書の更新指示を出力し、複合機１００が辞書の更新に必要な情報を抽出し、サーバ２００にその情報を送信する態様について説明した。しかし、複合機１００とサーバ２００との機能の分担は上述の実施形態で説明したものに限定されない。上述の実施形態で複合機１００の機能として説明したものの一部または全部をサーバ２００に実行させてもよい。逆に、上述の実施形態でサーバ２００の機能として説明したものの一部または全部を複合機１００に実行させてもよい。例えば、複合機１００が上述の実施形態で説明した辞書更新システム１のすべての機能を有していてもよい。これは、複合機１００自身が翻訳プログラムおよび辞書を記憶部１２０に記憶し、翻訳機としての機能を有する場合に有効である。 In the above-described embodiment, a mode has been described in which the server 200 outputs a dictionary update instruction, the MFP 100 extracts information necessary for updating the dictionary, and transmits the information to the server 200. However, the division of functions between the multifunction peripheral 100 and the server 200 is not limited to that described in the above embodiment. The server 200 may execute part or all of the functions described as the functions of the multifunction peripheral 100 in the above-described embodiment. Conversely, part or all of what has been described as the function of the server 200 in the above-described embodiment may be executed by the multifunction peripheral 100. For example, the multi-function device 100 may have all the functions of the dictionary update system 1 described in the above embodiment. This is effective when the multifunction peripheral 100 itself stores the translation program and dictionary in the storage unit 120 and has a function as a translator.

また、上述の実施形態においては、複合機１００が上述の各機能を有する態様について説明したが、複合機の代わりに、画像形成機能を有しないスキャナ等の画像読み取り装置を用いてもよい。 In the above-described embodiments, the aspect in which the multifunction peripheral 100 has the above-described functions has been described. However, instead of the multifunction peripheral, an image reading device such as a scanner having no image forming function may be used.

また、上述の実施形態においては、複合機１００が自動的にレイアウトデータベースの更新を行う態様について説明したが、複合機１００は自動的にレイアウトデータベースの更新を行わなくてもよい。すなわち、レイアウトデータベースはあらかじめ決められており不変のものであってもよい。あるいは、ユーザが辞書更新処理の実行を指示する際にレイアウトデータベースを入力することとしてもよい。 In the above-described embodiment, the aspect in which the multifunction device 100 automatically updates the layout database has been described. However, the multifunction device 100 may not automatically update the layout database. In other words, the layout database may be predetermined and unchanged. Alternatively, the layout database may be input when the user instructs execution of the dictionary update process.

一実施形態に係る辞書更新システム１の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the dictionary update system 1 which concerns on one Embodiment. 辞書更新システム１の構成を示す図である。It is a figure which shows the structure of the dictionary update system. 複合機１００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of the multifunction machine 100. FIG. サーバ２００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a server 200. FIG. 辞書更新システム１の動作を示すフローチャートである。3 is a flowchart showing the operation of the dictionary update system 1. 文書ＤＯＣを例示する図である。It is a figure which illustrates document DOC. 辞書ＤＩＣの内容を例示する図である。It is a figure which illustrates the contents of dictionary DIC.

Explanation of symbols

１…辞書更新システム、１０…画像読み取り部、２０…レイアウト解析部、３０…領域分割部、４０…属性情報抽出部、５０…辞書データ登録部、６０…登録対象語句抽出部、１００…複合機、１１０…ＣＰＵ、１２０…記憶部、１３０…表示部、１４０…操作部、１５０…Ｉ／Ｆ、１６０…画像読み取り系、１６１…原稿搬送部、１６２…画像読み取り部、１６３…画像処理部、１７０…画像形成系、１７１…用紙搬送部、１７２…画像形成部、１９０…バス、２００…サーバ、２１０…ＣＰＵ、２２０…ＲＯＭ、２３０…ＲＡＭ、２４０…Ｉ／Ｆ、２５０…ＨＤＤ、２６０…キーボード、２７０…マウス、２８０…ディスプレイ、２９０…バス、３００…ネットワーク DESCRIPTION OF SYMBOLS 1 ... Dictionary update system, 10 ... Image reading part, 20 ... Layout analysis part, 30 ... Area division part, 40 ... Attribute information extraction part, 50 ... Dictionary data registration part, 60 ... Registration object phrase extraction part, 100 ... Multifunction machine 110 ... CPU, 120 ... storage unit, 130 ... display unit, 140 ... operation unit, 150 ... I / F, 160 ... image reading system, 161 ... original conveying unit, 162 ... image reading unit, 163 ... image processing unit, 170: Image forming system, 171: Paper conveying unit, 172 ... Image forming unit, 190 ... Bus, 200 ... Server, 210 ... CPU, 220 ... ROM, 230 ... RAM, 240 ... I / F, 250 ... HDD, 260 ... Keyboard, 270 ... Mouse, 280 ... Display, 290 ... Bus, 300 ... Network

Claims

Image reading means for reading an image of a document and generating input image data;
Layout analysis means for performing layout analysis processing on the input image data generated by the image reading means and generating layout information;
Image dividing means for dividing the input image data into a plurality of small regions based on layout information generated by the layout analyzing means;
Among the plurality of small areas, a first identifier for identifying a small area having a heading character string or a heading image and a second area for identifying the small area having the heading character string or heading image and an information character string or information image A layout database that stores the identifiers and their information character strings and information images in association with each other;
The registration target word / phrase from the small area specified by the first identifier stored in the layout database, and the attribute information of the dictionary registration target word / phrase from the small area specified by the second identifier stored in the layout database. Information extracting means for extracting;
An image reading apparatus comprising: output means for outputting a dictionary registration target phrase extracted by the information extraction means.

Definition storage means for storing definitions of a heading character string or heading image and an information character string or information image;
A small area specifying means for specifying a small area having a heading character string or a heading image and a small area having an information character string or an information image among the plurality of small areas according to the definition stored in the definition storage means; ,
The image reading apparatus according to claim 1, further comprising database update means for updating contents of the layout database based on information on the small area specified by the small area specifying means.

An image reading step for reading an image of a document and generating input image data;
A layout analysis step of performing layout analysis processing on the input image data and generating layout information;
An image dividing step of dividing the input image data into a plurality of small regions based on the layout information;
Among the plurality of small areas, a first identifier for identifying a small area having a heading character string or a heading image and a second area for identifying the small area having the heading character string or heading image and an information character string or information image And the second identifier stored in the layout database from the small area specified by the first identifier stored in the layout database that stores the identifier and its information character string and information image in association with each other. An information extraction step of extracting attribute information of the dictionary registration target phrase from the small area specified by
A dictionary registration target word / phrase extraction method comprising: an output step of outputting the dictionary registration target word / phrase extracted in the information extraction step.

Computer equipment,
An image reading step for reading an image of a document and generating input image data;
A layout analysis step of performing layout analysis processing on the input image data and generating layout information;
An image dividing step of dividing the input image data into a plurality of small regions based on the layout information;
Among the plurality of small areas, a first identifier for identifying a small area having a heading character string or a heading image and a second area for identifying the small area having the heading character string or heading image and an information character string or information image And the second identifier stored in the layout database from the small area specified by the first identifier stored in the layout database that stores the identifier and its information character string and information image in association with each other. An information extraction step of extracting attribute information of the dictionary registration target phrase from the small area specified by
An output step of outputting the dictionary registration target phrase extracted in the information extraction step.