JPH0757040A

JPH0757040A - Filing device provided with ocr

Info

Publication number: JPH0757040A
Application number: JP5202253A
Authority: JP
Inventors: Yutaka Katsuyama; 裕勝山
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1993-08-16
Filing date: 1993-08-16
Publication date: 1995-03-03

Abstract

PURPOSE:To improve the recognition performance of a recognition dictionary for character recognition by performing a learning process at all times on the basis of the recognition results of characters stored in a file device. CONSTITUTION:This device provided with an image input part 2A, a document structure analytic part 4 which analyzes an input image and separates it into areas differing in attribute, a character recognition part 5 which performs character recognition by using the recognition dictionary 9 and the file device 7 is provided with a document analytic process part 24 which performs a document analytic process for character data in a character area stored in the file device by using a word dictionary and grammar; and the character recognition part 5 updates the recognition dictionary 9 on the basis of information regarding misrecognized characters to optimize the recognition dictionary. Further, the document analytic process and recognition dictionary updating process are executed as background processes at all times.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ＯＣＲ付きファイリン
グ装置に関する。この装置は、例えば、新聞の切り抜き
等の文書をスキャナで読み取り、入力画像を文字領域、
図表領域、写真領域等の属性の異なる領域に分離してフ
ァイル装置に格納すると共に、文字領域については文字
認識を行い、認識結果のデータも前記ファイル装置に格
納する機能を備えた装置である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a filing device with OCR. This device, for example, reads a document such as a newspaper clipping with a scanner, the input image is a character area,
This is a device having a function of separately storing in a file device into regions having different attributes such as a graphic region and a photograph region, performing character recognition for a character region, and also storing recognition result data in the file device.

【０００２】また、この装置では、ファイル装置に蓄積
したデータに対し、キーワード検索等を行うことによ
り、データの再編集等ができるようになっている。Further, in this apparatus, the data accumulated in the file apparatus can be re-edited by performing a keyword search or the like.

【０００３】[0003]

【従来の技術】図９は、従来技術の説明図であり、図９
中、１はＯＣＲ付きファイリング装置、２はスキャナ、
３はイメージメモリ、４は文書構造解析部、５は文字認
識部、７はファイル装置、８はキーワード検索部、９は
認識辞書、１０はディスプレイを示す。2. Description of the Related Art FIG. 9 is an explanatory view of the prior art.
Medium, 1 is a filing device with OCR, 2 is a scanner,
3 is an image memory, 4 is a document structure analysis unit, 5 is a character recognition unit, 7 is a file device, 8 is a keyword search unit, 9 is a recognition dictionary, and 10 is a display.

【０００４】従来、ＯＣＲ付きファイリング装置とし
て、例えば、図９に示したような装置が知られていた。
図示のように、このＯＣＲ付きファイリング装置１に
は、スキャナ２、イメージメモリ３、文書構造解析部
４、文字認識部５、ファイル装置７、キーワード検索部
８、ディスプレイ１０等が設けてある。また、前記文字
認識部５には、文字認識処理で使用する認識辞書９が設
けてある。Conventionally, for example, a device shown in FIG. 9 has been known as a filing device with OCR.
As shown in the figure, the filing device 1 with OCR is provided with a scanner 2, an image memory 3, a document structure analysis part 4, a character recognition part 5, a file device 7, a keyword search part 8, a display 10 and the like. Further, the character recognition unit 5 is provided with a recognition dictionary 9 used in the character recognition process.

【０００５】この装置では、例えば、新聞の切り抜き等
の文書を入力し、その入力画像（イメージデータ）か
ら、本文、見出し等の「文字領域」、「図表領域」、
「写真領域」を抽出して自動的に分離し、ファイル装置
に格納する。また、「文字領域」については文字認識処
理を行い、その文字データも、ファイル装置に格納す
る。具体的には次の通りである。In this apparatus, for example, a document such as a newspaper clipping is input, and from the input image (image data), a "text area" such as a text or a headline, a "chart area",
"Photo area" is extracted, automatically separated, and stored in the file device. Further, character recognition processing is performed on the “character area”, and the character data is also stored in the file device. Specifically, it is as follows.

【０００６】先ず、スキャナ２で文書の読み取りを行
い、入力画像（イメージデータ）を２値化して、イメー
ジメモリ３に格納する。次に、イメージメモリ３に格納
した入力画像に対し、文書構造解析部４が、自動解析を
行い、「文字領域」、「図表領域」、「写真領域」を識
別して自動的に分離し、ファイル装置７に格納（ファイ
リング）する。First, the document is read by the scanner 2, the input image (image data) is binarized, and stored in the image memory 3. Next, the document structure analysis unit 4 automatically analyzes the input image stored in the image memory 3 to identify and automatically separate the “character area”, the “graph area”, and the “photo area”. The file is stored (filed) in the file device 7.

【０００７】また、前記分離した「文字領域」は、文字
認識部５へ送り、文字認識を行う。この場合、文字認識
部５では、認識辞書９を用いて、前記の「文字領域」に
対し、文字認識処理を行い、認識結果のデータ（文字コ
ード）もファイル装置７に格納（ファイリング）する。The separated "character area" is sent to the character recognition section 5 for character recognition. In this case, the character recognition unit 5 uses the recognition dictionary 9 to perform character recognition processing on the “character area”, and also stores (filing) data (character code) of the recognition result in the file device 7.

【０００８】なお、前記ファイル装置７へのデータ格納
時には、文書構造解析部４で文書解析処理を行った際に
得られた各領域毎の属性データ等（本文、見出し、図
表、写真等に関する情報）も一緒に格納しておく。When storing the data in the file device 7, the attribute data and the like (text, heading, chart, photograph, etc.) for each area obtained when the document structure analyzing section 4 performs the document analyzing process. ) Is also stored together.

【０００９】このようにしてファイル装置７に蓄積した
データに対し、キーワード検索部８からキーワードを入
力して、キーワードによる文字データの検索ができるよ
うになっている。このキーワード検索で一致したデータ
があれば、その画像情報、及び属性データ等をディスプ
レイ１０で表示する。By inputting a keyword from the keyword search section 8 to the data stored in the file device 7 in this way, it is possible to search for character data by the keyword. If there is matching data in this keyword search, the image information, attribute data, etc. are displayed on the display 10.

【００１０】[0010]

【発明が解決しようとする課題】上記のような従来のも
のにおいては、次のような課題があった。検索対象とな
る文字データを生成する文字認識機能は、完璧な認識が
行われず、数％程度の誤りが生じるのが普通である。こ
のため、本来の情報には、キーワードがあるにも係わら
ず、検索に失敗し、検索できなくなる場合がある。SUMMARY OF THE INVENTION The above-mentioned conventional devices have the following problems. The character recognition function for generating character data to be searched does not perform perfect recognition, and usually causes an error of about several percent. Therefore, although the original information includes the keyword, the search may fail and the search may not be performed.

【００１１】本発明は、このような従来の課題を解決
し、文字認識処理に使用する認識辞書に対して、ファイ
ル装置に保存されている文字の認識結果を基に、常に学
習を行うことにより、認識辞書の認識性能を向上させる
ことを目的とする。The present invention solves such a conventional problem by constantly learning the recognition dictionary used for the character recognition processing based on the recognition result of the characters stored in the file device. , It aims at improving the recognition performance of a recognition dictionary.

【００１２】[0012]

【課題を解決するための手段】図１は本発明の原理説明
図であり、図１中、図９と同じものは、同一符号で示し
てある。また、２Ａは画像入力部、２４は文章解析処理
部、２５は単語辞書／文法部を示す。FIG. 1 is a diagram for explaining the principle of the present invention. In FIG. 1, the same parts as those in FIG. 9 are designated by the same reference numerals. 2A is an image input unit, 24 is a sentence analysis processing unit, and 25 is a word dictionary / grammar unit.

【００１３】本発明は上記の課題を解決するため、次の
ように構成した。：文書画像を入力する画像入力部２Ａと、入力した文
書画像を解析して、異なる属性の領域（文字領域、図表
領域、写真領域）に分離し、各種情報を抽出する文書構
造解析部４と、前記分離された領域の内、文字領域を対
象とし、認識辞書９を用いて文字認識を行う文字認識部
５と、前記各部で得られた文書データ（画像データ、文
字データ等）を格納するファイル装置７を備えたＯＣＲ
付きファイリング装置において、前記ファイル装置７に
格納された文書データの内、文字領域内の文字データを
対象として文章解析処理を行い、誤認識文字を抽出し
て、その正解文字情報を求める文章解析処理部２４を設
け、前記誤認識文字に関する情報を基に、文字認識部５
が認識辞書９の更新処理を行うことにより、認識辞書の
最適化を可能にしたＯＣＲ付きファイリング装置。The present invention has the following structure to solve the above problems. : An image input unit 2A for inputting a document image, and a document structure analysis unit 4 for analyzing the input document image and separating it into regions having different attributes (character region, chart region, photo region) and extracting various information. A character recognition unit 5 that performs character recognition using a recognition dictionary 9 on a character region among the separated regions, and document data (image data, character data, etc.) obtained by each unit are stored. OCR with file device 7
In the attached filing device, a sentence analysis process is performed for the character data in the character area of the document data stored in the file device 7, the misrecognized character is extracted, and the correct analysis character information is obtained. The unit 24 is provided, and the character recognition unit 5 is based on the information on the misrecognized characters.
A filing apparatus with OCR that enables optimization of the recognition dictionary by updating the recognition dictionary 9.

【００１４】：構成において、単語辞書、及び文法
情報を格納した単語辞書／文法部２５を設け、単語辞書
／文法部２５の単語辞書、及び文法を使用して、前記文
章解析処理部２４による文章解析処理を行うＯＣＲ付き
ファイリング装置。In the configuration, a word dictionary and a word dictionary / grammar section 25 storing grammatical information are provided, and the sentence by the sentence analysis processing section 24 using the word dictionary / grammar of the word dictionary / grammar section 25. A filing device with OCR for analysis processing.

【００１５】：構成において、文字認識部５による
認識辞書更新処理を行う際、前記誤認識文字に関する情
報を基に、誤認識文字に対応する画像データを切り出し
て特徴量を求め、該特徴量を基に、認識辞書９の更新処
理を行うＯＣＲ付きファイリング装置。In the configuration, when the recognition dictionary updating process is performed by the character recognition unit 5, the image data corresponding to the erroneously recognized character is cut out based on the information about the erroneously recognized character to obtain the feature amount, and the feature amount is calculated. A filing device with OCR for updating the recognition dictionary 9 based on the above.

【００１６】：構成において、内部処理をマルチタ
スクとすることにより、文章解析処理部２４による文章
解析処理、及び文字認識部５による認識辞書更新処理
を、他の処理等とは関係なく、常時、バックグラウンド
ジョブとして実行可能にしたＯＣＲ付きファイリング装
置。In the configuration, by making the internal processing multitasking, the text analysis processing by the text analysis processing section 24 and the recognition dictionary update processing by the character recognition section 5 are always performed regardless of other processing. A filing device with OCR that can be executed as a background job.

【００１７】[0017]

【作用】上記構成に基づく本発明の作用を、図１に基づ
いて説明する。：文書データを入力して、ファイル装置に保存する場
合には、次のように処理を行う。The operation of the present invention based on the above configuration will be described with reference to FIG. : When document data is input and stored in the file device, the following processing is performed.

【００１８】先ず、画像入力部２Ａで文書を読み込み、
画像を入力する。入力画像は、一旦イメージメモリ３に
格納した後、文書構造解析部４が文書解析処理を行う。
この時、文書構造解析部４は、イメージメモリ３上のデ
ータを走査して、「文字領域」と「図表領域」と「写真
領域」を抽出して分離する。また、前記各領域につい
て、領域の数、位置、横幅、縦高さ等の情報を抽出す
る。そして、これらのデータはファイル装置７へ転送し
て格納する。First, a document is read by the image input section 2A,
Enter the image. The input image is temporarily stored in the image memory 3, and then the document structure analysis unit 4 performs a document analysis process.
At this time, the document structure analysis unit 4 scans the data on the image memory 3 to extract and separate the “character area”, the “graph area” and the “photo area”. Further, for each area, information such as the number of areas, the position, the width, and the height is extracted. Then, these data are transferred to and stored in the file device 7.

【００１９】また、前記処理で「文字領域」と判定した
場合には、文字認識部５で認識辞書９を用いて、文字認
識処理を行い、認識結果のデータも、ファイル装置７へ
転送して格納する。このようにして、ファイル装置７に
は、画像データと、文字データを格納する。When it is determined that the character area is a character area in the above processing, the character recognition unit 5 uses the recognition dictionary 9 to perform character recognition processing, and the recognition result data is also transferred to the file device 7. Store. In this way, the image data and the character data are stored in the file device 7.

【００２０】：バックグラウンドジョブによる認識辞
書の最適化処理の説明バックグラウンドジョブによる認識辞書の最適化処理
は、次のようにして行う。Description of Optimization Processing of Recognition Dictionary by Background Job The optimization processing of the recognition dictionary by the background job is performed as follows.

【００２１】まず、文章解析処理部２４は、ファイル装
置７に格納されている文字領域内の文字データを対象と
して、文章解析処理を行う。この場合、文章解析処理部
２４では、単語辞書／文法部２５の単語辞書、及び文法
を使用して、文章解析処理を行い、誤認識している文字
を抽出すると共に、文字の正解情報を求め、文字認識部
５へ送る。First, the text analysis processing section 24 performs text analysis processing on the character data in the character area stored in the file device 7. In this case, the sentence analysis processing unit 24 uses the word dictionary and the grammar of the word dictionary / grammar unit 25 to perform a sentence analysis process to extract misrecognized characters and obtain correct answer information of the characters. , To the character recognition unit 5.

【００２２】その後、文字認識部５では、ファイル装置
７から、誤認識した文字に対応する画像データを切り出
し、特徴量に変換する。文字認識部５では、変換した特
徴量と、認識辞書９に格納してあるカテゴリの特徴量か
ら、辞書が認識に最適となるように、認識辞書９の更新
を行う。Thereafter, the character recognition section 5 cuts out the image data corresponding to the erroneously recognized character from the file device 7 and converts it into a feature amount. The character recognition unit 5 updates the recognition dictionary 9 based on the converted feature quantity and the category feature quantity stored in the recognition dictionary 9 so that the dictionary is optimal for recognition.

【００２３】以上の処理を、バックグラウンドジョブと
して、常に行うことにより、認識辞書に対する学習を行
って認識辞書の最適化を行うことができる。その結果、
認識辞書の認識性能を向上させることができる。By constantly performing the above processing as a background job, the recognition dictionary can be learned and the recognition dictionary can be optimized. as a result,
The recognition performance of the recognition dictionary can be improved.

【００２４】[0024]

【実施例】以下、本発明の実施例を図面に基づいて説明
する。図２〜図８は、本発明の実施例を示した図であ
り、図２〜図８中、図１、図９と同じものは、同一符号
で示してある。Embodiments of the present invention will be described below with reference to the drawings. 2 to 8 are views showing an embodiment of the present invention. In FIGS. 2 to 8, the same parts as those in FIGS. 1 and 9 are designated by the same reference numerals.

【００２５】また、１１はＣＰＵ（中央処理装置）、１
２、１３は入／出力制御部（以下単に「Ｉ／Ｏ」とい
う）、１４はメモリ、１７は光ディスク装置入／出力制
御部（以下、単に「光ディスク装置Ｉ／Ｏ」という）、
１８は光ディスク装置、２１はイメージ処理部、２２は
整合部（または照合部）、２３はキーワード記録／検索
部を示す。Further, 11 is a CPU (central processing unit), 1
Reference numerals 2 and 13 denote input / output control units (hereinafter simply referred to as “I / O”), 14 denotes a memory, 17 denotes an optical disk device input / output control unit (hereinafter simply referred to as “optical disk device I / O”),
Reference numeral 18 is an optical disk device, 21 is an image processing unit, 22 is a matching unit (or collating unit), and 23 is a keyword recording / searching unit.

【００２６】§１：ＯＣＲ付きファイリング装置の構成
説明・・・図２参照図２は実施例の装置構成図である。図示のように、ＯＣ
Ｒ付きファイリング装置１には、ＣＰＵ１１、Ｉ／Ｏ１
２、１３、スキャナ２、イメージメモリ３、ディスプレ
イ１０、メモリ１４、光ディスク装置Ｉ／Ｏ１７、光デ
ィスク装置１８、文書構造解析部４、文字認識部５、キ
ーワード記録／検索部２３、文章解析処理部２４、単語
辞書／文法部２５を設ける。§1: Description of the configuration of the filing device with OCR ... See FIG. 2. FIG. 2 is a device configuration diagram of the embodiment. As shown, OC
The filing device 1 with R includes a CPU 11 and an I / O 1
2, 13, scanner 2, image memory 3, display 10, memory 14, optical disk device I / O 17, optical disk device 18, document structure analysis unit 4, character recognition unit 5, keyword recording / search unit 23, sentence analysis processing unit 24. A word dictionary / grammar unit 25 is provided.

【００２７】また、前記文字認識部５には、イメージ処
理部２１、整合部２２、認識辞書９を設ける。前記各部
の機能等は、次の通りである。 (1) ：ＣＰＵ１１は、ＯＣＲ付きファイリング装置内
で、各種の制御を行うプロセッサである。The character recognition section 5 is provided with an image processing section 21, a matching section 22 and a recognition dictionary 9. Functions and the like of the respective units are as follows. (1): The CPU 11 is a processor that performs various controls in the filing apparatus with OCR.

【００２８】(2) ：Ｉ／Ｏ１２、１３は、スキャナ２、
及びディスプレイの各入／出力制御を行うものである。 (3) ：スキャナ２は、光学的に文書を読み取って、文書
画像を入力するものである。(2): The I / Os 12, 13 are the scanner 2,
And each input / output control of the display. (3): The scanner 2 optically reads a document and inputs a document image.

【００２９】(4) ：ディスプレイ１０は、各種情報の表
示を行うものである。 (5) ：光ディスク装置Ｉ／Ｏ１７は、光ディスク装置１
８の入／出力制御を行うものである。(4): The display 10 displays various information. (5): The optical disc device I / O 17 is the optical disc device 1
8 for input / output control.

【００３０】(6) ：光ディスク装置１８は、データの記
録、及び再生の可能なファイル装置（例えば、光磁気デ
ィスク装置）である。 (7) ：文書構造解析部４は、入力画像を解析して、「文
字領域」、「図表領域」、「写真領域」等の各領域に分
離したり、各領域毎の座標を検知したりするものである
（詳細は後述する）。(6): The optical disk device 18 is a file device (for example, a magneto-optical disk device) capable of recording and reproducing data. (7): The document structure analysis unit 4 analyzes the input image and divides it into each area such as "character area", "chart area", "photo area", or detects the coordinates of each area. (Details will be described later).

【００３１】(8) ：イメージ処理部２１は、文字認識部
５において、文字認識処理を行う際のイメージデータ
（画像データ）の処理、例えば、特徴抽出処理等を行う
ものである。(8): The image processing section 21 performs processing of image data (image data) when performing character recognition processing in the character recognition section 5, for example, feature extraction processing.

【００３２】(9) ：整合部２２は、イメージ処理部２１
で処理した情報（特徴量等）を、認識辞書９と比較し
て、文字認識処理を行うものである。 (10)：キーワード記録／検索部２３は、文字データのキ
ーワードを光ディスク装置１８に記録したり、光ディス
ク装置１８に格納されている文字データを対象として、
キーワード検索を行ったりするものである。(9): The matching unit 22 is the image processing unit 21.
The character recognition processing is performed by comparing the information (feature amount, etc.) processed in (1) with the recognition dictionary 9. (10): The keyword recording / retrieval unit 23 records a keyword of character data in the optical disc device 18 or targets character data stored in the optical disc device 18,
It is something that searches for keywords.

【００３３】(11)：文章解析処理部２４は、光ディスク
装置１８内の文字データについて、単語辞書／文法部２
５の情報を用いて、文章解析処理を行うものである（詳
細は後述する）。(11): The sentence analysis processing section 24 uses the word dictionary / grammar section 2 for the character data in the optical disk device 18.
The sentence analysis processing is performed using the information of 5 (details will be described later).

【００３４】(12)：単語辞書／文法部２５は、単語辞
書、及び文章解析を行う場合に必要な文法情報を格納し
たものである。 (13)：メモリ１４はワーク用のメモリである。(12): The word dictionary / grammar section 25 stores a word dictionary and grammatical information necessary for sentence analysis. (13): The memory 14 is a work memory.

【００３５】§２：光ディスク装置内のデータの説明・
・・図２参照光ディスク装置１８に格納するデータとしては、例え
ば、：ファイル番号、：画像情報、：全領域数、
：画像領域数、：文字領域数等である。§2: Description of data in the optical disk device
.. Refer to FIG. 2. As the data to be stored in the optical disc device 18, for example ,: file number ,: image information ,: total area number,
: Number of image areas ,: number of character areas, etc.

【００３６】前記：画像情報は、画像データ、バイナ
リコード列等で構成され、「横幅」、「縦高さ」等の情
報であり、：画像領域数は、「画像領域番号」と、
「画像領域開始位置」、「横幅」、「縦高さ」等からな
る。The above: The image information is information such as "width", "height", etc., which is composed of image data, binary code string, etc .: The number of image areas is "image area number",
It consists of "image area start position", "width", "height", and the like.

【００３７】また、前記：文字領域数は、「文字領域
内容」、「属性」からなり、「属性」は、「縦書き／横
書き」、「領域の開始位置」、「横幅」、「縦高さ」、
「画像領域との関係」等の情報からなる。The above: The number of character areas is composed of "character area contents" and "attributes", and "attributes" are "vertical writing / horizontal writing", "area start position", "width", and "height". "
It includes information such as "relationship with image area".

【００３８】§３：文書データ保存手順の説明・・・図
２参照以下、図２に基づいて、文書データ保存手順を説明す
る。 (1) ：先ず、装置が起動すると、ＣＰＵ１１は、スキャ
ナ２を起動して、文書を読み込み、画像の入力を開始す
る。入力画像（イメージデータ）は、２値化した後、一
旦イメージメモリ３に格納する。§3: Description of Document Data Saving Procedure--See FIG. 2 Hereinafter, the document data saving procedure will be described with reference to FIG. (1): First, when the apparatus is activated, the CPU 11 activates the scanner 2 to read a document and start inputting an image. The input image (image data) is binarized and then temporarily stored in the image memory 3.

【００３９】(2) ：次に、ＣＰＵ１１は、文書構造解析
部４を起動し、イメージメモリ３内の入力画像（イメー
ジデータ）を解析する。この時、文書構造解析部４は、
イメージメモリ３上のデータを走査して、「文字領域」
と「図表領域」と「写真領域」を抽出して分離する。ま
た、前記各領域について、領域の数、位置、横幅、縦高
さの各情報を抽出する。(2): Next, the CPU 11 activates the document structure analysis unit 4 and analyzes the input image (image data) in the image memory 3. At this time, the document structure analysis unit 4
Scan the data in the image memory 3 to display "character area"
And "Chart area" and "Photo area" are extracted and separated. Further, for each area, information on the number of areas, the position, the width, and the height is extracted.

【００４０】そして、これらのデータはＣＰＵ１１の制
御により、光ディスク装置１８へ転送して媒体（光ディ
スク）に格納する。この場合、「文字領域」について
は、その内容が、「縦書き」か、「横書き」かを自動判
定し、その情報も光ディスク装置１８に格納する。Under the control of the CPU 11, these data are transferred to the optical disk device 18 and stored in the medium (optical disk). In this case, for the “character area”, it is automatically determined whether the content is “vertical writing” or “horizontal writing”, and the information is also stored in the optical disc device 18.

【００４１】更に、写真や、図表等の側の文字領域につ
いては、例えば、ヒストグラム計算と、閾値による判定
により、その写真や図表の説明文の属性を判定し、光デ
ィスク装置１８へ転送して媒体（光ディスク）に格納す
る。Further, with respect to the character area on the side of a photograph or a chart, the attribute of the explanatory note of the photograph or the chart is determined by, for example, histogram calculation and determination by a threshold value, and transferred to the optical disc device 18 to be a medium. (Optical disc).

【００４２】すなわち、「図表領域」及び「写真領域」
からなる「画像領域」と、「文字領域」との関係を自動
判定し（画像領域と文字領域が、或る閾値以内の距離
で、画像領域の大きさに比べて、文字領域が十分小さい
等の条件判定で判定し）、「画像領域」の説明文である
と判定した「文字領域」については、その情報も光ディ
スク装置１８に格納する。That is, the "graph / table area" and the "photograph area"
The relationship between the "image area" and the "character area" is automatically determined (the character area is sufficiently smaller than the size of the image area when the distance between the image area and the character area is within a certain threshold). For the “character area” that is determined to be the description of the “image area”, the information is also stored in the optical disc device 18.

【００４３】なお、前記処理で「文字領域」と判定した
場合には、該当する領域について、文字認識部５による
文字認識処理を行い、光ディスク装置１８に格納する。 (3) ：前記文書構造解析部４による解析処理において、
「文字領域」と判定された領域については、ＣＰＵ１１
が文字認識部５を起動して文字認識処理を行う。When it is determined in the above process that the character region is a character region, the character recognition unit 5 performs a character recognition process on the corresponding region and stores it in the optical disk device 18. (3): In the analysis processing by the document structure analysis unit 4,
For the area determined to be the “character area”, the CPU 11
Activates the character recognition unit 5 to perform character recognition processing.

【００４４】この場合、文字認識部５では、イメージメ
モリ３上の前記「文字領域」と判定した領域の範囲だけ
を対象として、イメージ処理部２１がイメージ処理（特
徴抽出等）を行い、整合部２２が、認識辞書９を用い
て、文字認識処理を行う。In this case, in the character recognition section 5, the image processing section 21 performs image processing (feature extraction, etc.) only on the range of the area determined to be the "character area" on the image memory 3, and the matching section 22 uses the recognition dictionary 9 to perform character recognition processing.

【００４５】(4) ：前記文字認識部５による文字認識処
理が終了すると、前記文書構造解析部４で得られた属性
と共に、認識結果の文字データ（文字コード等）を、光
ディスク装置１８に格納する。この場合、光ディスク装
置１８内では、「画像データ」と「文字データ」は、２
つで１対をなすように管理される。(4): When the character recognition processing by the character recognition unit 5 is completed, the character data (character code, etc.) of the recognition result is stored in the optical disk device 18 together with the attributes obtained by the document structure analysis unit 4. To do. In this case, in the optical disc device 18, the “image data” and the “character data” are 2
It is managed as one pair.

【００４６】(5) ：前記各処理を、例えば、読み込む文
書の１頁ごとに処理し、オペレータが指定した文書の単
位が終了するまで、繰り返す。 §４：キーワード検索手順の説明・・・図２参照光ディスク装置に格納したデータについて、キーワード
検索する場合には、次のようにして行う。(5): For example, the above-mentioned processes are processed for each page of the document to be read and repeated until the unit of the document designated by the operator is completed. §4: Description of keyword search procedure--see FIG. 2 When performing a keyword search on the data stored in the optical disk device, the following steps are performed.

【００４７】(1) ：先ず、キーボード等の入力手段によ
り、検索したい文書データに含まれていると思われるキ
ーワードを入力する。 (2) ：キーワード記録／検索部２３では、前記入力され
たキーワードを基に、光ディスク装置１８に格納されて
いる文字データ部分全てを対象として、全文検索を行
う。この時、文字データ部分の中に、キーワードが存在
したら、そのファイルだけを、以降の検索の対象とす
る。(1): First, an input means such as a keyboard is used to input a keyword that is considered to be contained in the document data to be searched. (2): The keyword recording / search unit 23 performs a full-text search on all the character data portions stored in the optical disk device 18 based on the input keyword. At this time, if there is a keyword in the character data portion, only that file is targeted for the subsequent search.

【００４８】(3) ：前記処理を繰り返して行い、多数の
ファイルから検索しようとしているファイルの候補を絞
り込み、候補が少なくなったら、オペレータの指示によ
り、文字データに対応している画像データを抽出する。(3): The above-mentioned processing is repeated to narrow down the candidates of the file to be searched from a large number of files, and when the number of the candidates is reduced, the image data corresponding to the character data is extracted by the operator's instruction. To do.

【００４９】そして、ＣＰＵ１１の制御により、検索結
果のデータをディスプレイ１０へ送って表示する。オペ
レータは、この表示画面を見て、目的のファイルであっ
たかどうかを確認する。Then, under the control of the CPU 11, the search result data is sent to the display 10 and displayed. The operator looks at this display screen and confirms whether or not the file is the target file.

【００５０】(4) ：目的のファイルであった場合には、
プリントアウト、別ディスク装置へのデータ転送、ＦＡ
Ｘ送信等の手段により、前記データを出力する。 §５：文書構造検索手順の説明・・・図２参照 (1) ：先ず、オペレータは、表示された文書構造キーワ
ードから、１つを選択、又はキー入力し、それを使用し
て検索を起動する。この時、ＣＰＵ１１は、キーワード
記録／検索部２３を起動する。(4): If the file is the target file,
Print out, transfer data to another disk device, FA
The data is output by means such as X transmission. §5: Description of document structure search procedure-See Fig. 2 (1): First, the operator selects one of the displayed document structure keywords or inputs a key, and starts the search using it. To do. At this time, the CPU 11 activates the keyword recording / search unit 23.

【００５１】(2) ：キーワード記録／検索部２３は、光
ディスク装置１８に格納されている全文書の文字データ
を検索し、前記文書構造キーワードの含まれているもの
だけを以降の対象とする。(2): The keyword recording / retrieving unit 23 retrieves the character data of all the documents stored in the optical disk device 18, and only the documents including the document structure keyword are targeted for the following.

【００５２】(3) ：前記処理を繰り返して行い、多数の
ファイルから、検索しようとしているファイルの候補を
絞り込む。そして、候補が少なくなったら、オペレータ
の指示により、文字データに対応している画像データを
抽出する。(3): The above process is repeated to narrow down the candidates of the file to be searched from a large number of files. When the number of candidates is reduced, the image data corresponding to the character data is extracted according to the operator's instruction.

【００５３】そして、ＣＰＵ１１の制御により、検索結
果のデータをディスプレイ１０へ送って表示する。オペ
レータは、この表示画面を見て、目的のファイルであっ
たかどうかを確認する。Then, under the control of the CPU 11, the search result data is sent to the display 10 and displayed. The operator looks at this display screen and confirms whether or not the file is the target file.

【００５４】(4) ：目的のファイルであった場合には、
プリントアウト、別ディスク装置へのデータ転送、ＦＡ
Ｘ送信等の手段により、前記データを出力する。 §６：バックグラウンドジョブによる辞書の最適化処理
（自動辞書更新処理）の説明・・・図２参照バックグラウンドジョブによる認識辞書の最適化処理
（自動辞書更新処理）は、次のようにして行う。(4): If the file is the target file,
Print out, transfer data to another disk device, FA
The data is output by means such as X transmission. §6: Description of dictionary optimization process (automatic dictionary update process) by background job ... See FIG. 2 The recognition dictionary optimization process (automatic dictionary update process) by the background job is performed as follows. .

【００５５】(1) ：ＣＰＵ１１により文章解析処理部２
４が起動されると、該文章解析処理部２４では、光ディ
スク装置１８に格納されているファイルの文字データ領
域内の文字データを読み出し、メモリ１４に格納する。(1): The sentence analysis processing unit 2 by the CPU 11
4 is activated, the sentence analysis processing section 24 reads out the character data in the character data area of the file stored in the optical disk device 18 and stores it in the memory 14.

【００５６】(2) ：この時、このファイルの画像情報
の、この文字領域を表す画像データを、メモリ１４に格
納する。 (3) ：文章解析処理部２４では、メモリ１４上の文字デ
ータに、単語辞書／文法部２５の単語辞書、及び文法を
使用して、文章解析処理を行い、誤認識している文字を
抽出する。(2): At this time, the image data representing the character area of the image information of the file is stored in the memory 14. (3): The sentence analysis processing unit 24 performs sentence analysis processing on the character data in the memory 14 using the word dictionary of the word dictionary / grammar unit 25 and the grammar, and extracts characters that are erroneously recognized. To do.

【００５７】そして、その位置、及びこの領域の最初か
ら何文字目かの情報を、前記メモリ１４に記憶してお
く。また、この文字の正しいカテゴリを前記メモリ１４
に記憶しておく。Then, the position and the information of the number of characters from the beginning of this area are stored in the memory 14. Also, the correct category of this character is the memory 14
Remember.

【００５８】(4) ：次に、ＣＰＵ１１では、メモリ１４
上にある誤認識した文字に関する情報を用い、光ディス
ク装置１８から、誤認識した文字に対応する画像データ
（原画像データ）を切り出して文字認識部５へ送り、文
字認識部５を起動する。(4): Next, in the CPU 11, the memory 14
Image information (original image data) corresponding to the erroneously recognized character is cut out from the optical disc device 18 by using the above-mentioned information regarding the erroneously recognized character and sent to the character recognizing unit 5 to activate the character recognizing unit 5.

【００５９】(5) ：文字認識部５では、イメージ処理部
２１が、前記切り出した画像データから特徴量を抽出す
る。そして、文字認識部５では、前記抽出した特徴量
と、認識辞書９に格納してあるカテゴリの特徴量から、
辞書が認識に最適となるように、認識辞書の更新を行
う。(5): In the character recognition unit 5, the image processing unit 21 extracts a feature amount from the cut out image data. Then, in the character recognition unit 5, from the extracted feature quantity and the feature quantity of the category stored in the recognition dictionary 9,
The recognition dictionary is updated so that the dictionary is optimal for recognition.

【００６０】なお、認識辞書の更新処理としては、次の
ような方法がある（いずれも従来から使用されている方
法である）。：辞書の正しい文字のカテゴリの特徴量と、誤認識し
た文字の特徴量とを、加重平均する方法。There are the following methods for updating the recognition dictionary (all of which are conventionally used). : A method of weighted averaging the feature amount of the correct character category of the dictionary and the feature amount of the character that is erroneously recognized.

【００６１】：辞書内の全カテゴリの特徴量を対象
に、誤認識した文字の特徴量とで、クラスタリングを行
い、最も近いカテゴリを探す。そして、そのカテゴリ
が、誤認識した文字の正しいカテゴリでなかった場合
は、誤認識した文字の特徴量を、新たな文字の辞書に追
加する方法。Clustering is performed on the feature amounts of all categories in the dictionary with the feature amounts of the characters that are erroneously recognized, and the closest category is searched. Then, if the category is not the correct category of the erroneously recognized character, the feature amount of the erroneously recognized character is added to the dictionary of new characters.

【００６２】以上の処理を、光ディスク装置１８に格納
されているファイルの全てを対象として行い、認識辞書
に対して学習を行う。 §７：文書データの説明と、文書構造解析部の処理説明
・・・図３参照図３は文書データの説明図である。以下、図３を参照し
ながら、光ディスク装置に格納された文書データ（画像
データ、及び文字データ）、及び文書構造解析部の処理
を説明する。The above processing is performed for all the files stored in the optical disk device 18, and the recognition dictionary is learned. §7: Description of document data and description of processing of document structure analysis unit ... See FIG. 3. FIG. 3 is an explanatory diagram of document data. Hereinafter, the processing of the document data (image data and character data) stored in the optical disk device and the document structure analysis unit will be described with reference to FIG.

【００６３】前記のように、新聞の切り抜き等の文書
（活字文書）をスキャナ２で入力した後、２値化処理を
行い、イメージデータとして、イメージメモリ３に格納
する。この場合の２値化処理としては、次の通りであ
る。As described above, a document (printed document) such as a newspaper clipping is input by the scanner 2 and then binarized to be stored in the image memory 3 as image data. The binarization process in this case is as follows.

【００６４】例えば、読み込み濃度のヒストグラムをと
り、その最大頻度を含む頻度よりも、少し濃い濃度を閾
値とする処理を行う。最近では、スキャナ本体に付属す
る２値化アルゴリズムを利用するのが普通である。For example, the histogram of the read densities is taken, and the density is set to be slightly darker than the maximum frequency. Recently, it is common to use a binary algorithm attached to the scanner body.

【００６５】これは、可変閾値をユーザが設定するもの
から、自動露出度測定によって読み込むライン毎に、閾
値を変化させて、成るべく背景濃度と図柄の濃度を区別
するようにするものである。This is to change the threshold value for each line read by the automatic exposure degree measurement from a variable threshold value set by the user so that the background density and the pattern density are distinguished as much as possible.

【００６６】前記のようにして入力した文書データは、
図示のように、「文字領域１」、「文字領域２」、「文
字領域３」、「図表領域１」、「写真領域１」等で構成
されている。すなわち、文字が集まり、その外接が矩形
をなす「文字領域」、写真の外接を矩形で囲んだ「写真
領域」、画の外接が矩形をなす「図表領域」である。The document data input as described above is
As shown in the figure, it is composed of "character area 1", "character area 2", "character area 3", "figure area 1", "photograph area 1" and the like. That is, there are a "character area" in which characters are gathered and whose circumscribed area is a rectangle, a "photo area" in which the circumscribed area of a photograph is surrounded by a rectangle, and a "chart area" in which the circumscribed area of a picture is a rectangle.

【００６７】ところで、前記文書構造解析部４では、以
下の３つの処理を行う。：文書の中の「文字領域」、「写真領域」、「図表領
域」の各領域毎に座標、大きさを求め、それらを記憶す
る。By the way, the document structure analysis unit 4 performs the following three processes. : Calculate the coordinates and size of each of the "character area", "photograph area" and "figure area" in the document and store them.

【００６８】：「文字領域」と判定した領域では、更
に、その領域内の文字列（行）の方向から、縦書き、横
書きの判定を行い、それも記憶する。：「写真領域」や、「図表領域」内にある文字領域に
ついては、写真、または図表と前記文字領域との近さ
が、或る一定閾値より小さければ、その写真、又は図表
の説明文であると判定し、その情報も記憶する。In the area determined to be the “character area”, vertical writing and horizontal writing are further determined from the direction of the character string (row) in the area, which is also stored. : For the text area in the "photo area" or the "chart area", if the closeness between the picture or the graphic area and the text area is smaller than a certain threshold, the description of the picture or the graphic is displayed. It is determined that there is, and the information is also stored.

【００６９】§８：文書データの各領域抽出処理の説明
・・・図４参照図４は領域抽出／判定処理フローチャートである。以
下、図４に基づいて、領域抽出／判定処理を説明する。
なお、Ｓ１〜Ｓ９は各処理番号を示す。§8: Description of Each Area Extraction Process of Document Data--See FIG. 4 FIG. 4 is an area extraction / determination process flowchart. Hereinafter, the area extraction / determination processing will be described with reference to FIG.
It should be noted that S1 to S9 represent respective process numbers.

【００７０】この処理は、文書を入力してイメージメモ
リ３に格納した入力画像（イメージデータ）を走査し
て、「文字領域」、「図表領域」、及び「写真領域」の
抽出を行う文書構造解析部４の処理である。This processing is a document structure in which a document is input and the input image (image data) stored in the image memory 3 is scanned to extract "character area", "figure area", and "photo area". This is the processing of the analysis unit 4.

【００７１】先ず、黒画素の塊の輪郭追跡から、それの
外接矩形の座標を求める（Ｓ１）。次に、矩形の高さの
ヒストグラムを求め、閾値以上の高さで、ヒストグラム
の山が途切れる所まで、文字部分と判断する（Ｓ２）。
そして、この範囲に入った矩形だけを抽出する（Ｓ
３）。この場合、小さな矩形だけが残る。First, the coordinates of the circumscribed rectangle of the block of black pixels are obtained by tracing the outline of the block (S1). Next, a histogram of the height of the rectangle is obtained, and it is determined that the height is equal to or higher than the threshold value and the peaks of the histogram are interrupted (S2).
Then, only the rectangles within this range are extracted (S
3). In this case, only a small rectangle remains.

【００７２】前記小さな矩形相互が閾値以下の距離にあ
り、それを１つにまとめて大きな矩形を形成する場合に
は、２つの小さな矩形の統合を行い１つの矩形とする。
この処理を繰り返して行い、大きな文字矩形領域を作
り、メモリ１４に記憶する（Ｓ４）。When the small rectangles are at a distance equal to or less than the threshold value and are combined into one to form a large rectangle, two small rectangles are integrated into one rectangle.
This process is repeated to create a large character rectangular area and store it in the memory 14 (S4).

【００７３】その後、原画像から前記文字矩形領域を取
り除き（Ｓ５）、残った画像から、先ず、線で構成され
ている部分を抽出する。これには、例えば、３×３の線
分検出マスクで画像を走査して、線分と思われる所をチ
ェックする。After that, the character rectangular area is removed from the original image (S5), and first of all, from the remaining image, a portion constituted by a line is extracted. For this, for example, the image is scanned with a 3 × 3 line segment detection mask to check a portion that is considered to be a line segment.

【００７４】線分と思われる所が多くある領域内で、黒
画素で連続している部分を「図表領域」として記憶する
（Ｓ６）。前記処理終了後、原画像から「図表領域」を
取り除き（Ｓ７）、残った画像は、「写真領域」と見な
して、連続する黒画素の外接矩形を求め、記憶する（Ｓ
８）。そして、前記各領域で重なりが出た場合には、
「文字領域」を優先して重なりを除く処理を行う（Ｓ
９）。In a region where there are many places which are considered to be line segments, a portion which is continuous with black pixels is stored as a "graph region" (S6). After the above processing is completed, the "graph area" is removed from the original image (S7), the remaining image is regarded as a "photographic area", and a circumscribed rectangle of continuous black pixels is obtained and stored (S7).
8). Then, if overlap occurs in each of the above areas,
Perform processing to prioritize "character areas" to eliminate overlap (S
9).

【００７５】§９：文字領域の縦書き／横書き判定処理
の説明・・・図５参照図５は文字領域の縦書き／横書き判定処理フローチャー
トである。以下、図５に基づいて、領域抽出／判定処理
を説明する。§9: Description of Vertical Writing / Horizontal Writing Judgment Processing of Character Area--See FIG. 5 FIG. 5 is a flowchart of vertical writing / horizontal writing judgment processing of a character area. Hereinafter, the area extraction / determination processing will be described with reference to FIG.

【００７６】なお、Ｓ１１〜Ｓ１４は各処理番号を示
す。また、この処理は、前記文書構造解析部４が行う処
理である。先ず、文字領域内にある、文字を表す黒画素
の外接矩形の並びを調査する（Ｓ１１）。そして、矩形
の座標を、横書きを仮定して、矩形列を横に走査する。
この走査により、左上点座標、右上点座標の上下座標
が、或る閾値の範囲内に収まれば、横書きと判定する
（Ｓ１２）。It should be noted that S11 to S14 represent respective process numbers. Further, this process is a process performed by the document structure analysis unit 4. First, the arrangement of the circumscribing rectangles of black pixels representing characters in the character area is investigated (S11). Then, the coordinates of the rectangle are horizontally scanned, assuming horizontal writing.
If the upper and lower coordinates of the upper left point coordinate and the upper right point coordinate are within a certain threshold range by this scanning, it is determined to be horizontal writing (S12).

【００７７】同じようにして、矩形の座標を、縦書きを
仮定して、矩形列を縦に走査して、左上点座標、右上点
座標の左右の座標が、或る閾値の範囲内に収まれば、縦
書きと判定する（Ｓ１３）。In the same manner, assuming that the rectangular coordinates are written vertically, the rectangular column is vertically scanned and the left and right coordinates of the upper left point coordinate and the upper right point coordinate are within a certain threshold range. For example, it is determined to be vertical writing (S13).

【００７８】また、どちらか片方の判定の場合には、そ
れを採用する。両方の判定が出た場合には、隣の矩形と
の平均距離が近い方を選択する（Ｓ１４）。 §１０：文書データの説明・・・図６参照図６は文書データの説明図である。図示の文書データ例
は、光ディスク装置１８に格納されている文書データの
１例である。If either one of the judgments is made, that is adopted. If both judgments are made, the one having the smaller average distance from the adjacent rectangle is selected (S14). §10: Description of Document Data--See FIG. 6 FIG. 6 is an explanatory diagram of document data. The illustrated document data example is an example of the document data stored in the optical disc device 18.

【００７９】この例では、文書データは、「ページ」、
「写真領域」、「図表領域」、「文字領域」のそれぞれ
が、リスト構造となっている。これは、各要素が、幾つ
であるか、事前に予測がつかないためである。認識結果
の文字は、図の構造体の中に格納されるので、原画像
（入力画像）の中の、どの部分の文字領域の中の、どの
文字の認識した結果かが、特定できるようになっている
（文字ＩＤが付されているため）。In this example, the document data is "page",
Each of the “photograph area”, “figure area”, and “character area” has a list structure. This is because it is not possible to predict in advance how many each element will be. Since the character of the recognition result is stored in the structure of the figure, it is possible to specify which character in the character area of the original image (input image) is recognized. (Because the character ID is attached).

【００８０】文書データ例の各項目は次の通りである。：「struct BOOK 」の項の「struct PAGE *page;」は
最初のページのデータへのポインタ、「struct IMAGE *
image;」は画像データへのポインタを示す。Each item of the document data example is as follows. : "Struct PAGE * page;" in the "struct BOOK" section is a pointer to the data of the first page, and "struct IMAGE *
“Image;” indicates a pointer to image data.

【００８１】：「struct PAGE 」の項の「struct PAG
E *page;」は次のページのデータへのポインタ」、「st
ruct PHOTO *photo;」は最初の写真領域のデータへのポ
インタ、「struct PICTURE *picture;」は最初の図表領
域のデータへのポインタ、「struct CHAR *char;」は最
初の文字領域のデータへのポインタを示す。[0081]: "struct PAG" in the "struct PAGE" section
"E * page;" is a pointer to the data of the next page "," st
"ruct PHOTO * photo;" is a pointer to the data of the first photo area, "struct PICTURE * picture;" is a pointer to the data of the first figure area, and "struct CHAR * char;" is the data of the first character area. Indicates the pointer.

【００８２】：「struct PHOTO」の項の「struct PHO
TO *photo;」は次の写真領域のデータへのポインタ」、
「int x1,y1,x2,y2;」は写真領域の矩形座標（左上、右
下）、「int beside; 」は説明文の文字領域番号を示
す。[0082]: "struct PHO" in the "struct PHOTO" section
"TO * photo;" is a pointer to the data in the next photo area ",
“Int x1, y1, x2, y2;” indicates the rectangular coordinates (upper left, lower right) of the photo area, and “int beside;” indicates the character area number of the description.

【００８３】：「struct PICTURE」の項の「struct P
ICTURE *picture;」は次の図表領域のデータへのポイン
タ、「int x1,y1,x2,y2;」は図表領域の矩形座標（左
上、右下）、「int beside; 」は説明文の文字領域番号
を示す。[0100]: "struct PICTURE"-"struct PICTURE"
"ICTURE * picture;" is a pointer to the data of the next chart area, "int x1, y1, x2, y2;" is the rectangular coordinates of the chart area (upper left, lower right), and "int beside;" is the text of the description Indicates the area number.

【００８４】：「struct CHAR 」の項の「struct CHA
R *char;」は次の文字領域のデータへのポインタ、「in
t x1,y1,x2,y2;」は文字領域の矩形座標（左上、右
下）、「struct CHARS *chars;」は認識結果文字列情報
へのポインタを示す。: "Struct CHA" in the "struct CHAR" section
"R * char;" is a pointer to the data in the next character area, "in
"t x1, y1, x2, y2;" indicates the rectangular coordinates (upper left, lower right) of the character area, and "struct CHARS * chars;" indicates a pointer to the recognition result character string information.

【００８５】：「struct CHARS」の項の「struct CHA
RS *chars;」は次の認識結果文字列情報へのポインタ、
「int x1,y1,x2,y2;」は文字を構成する画像の矩形座標
（左上、右下）、「int ID; 」は文字領域内での文字を
特定する文字ＩＤ、「int code; 」はこの文字のコード
（認識結果）を示す。[0085]: "struct CHA" in the "struct CHARS" section
"RS * chars;" is a pointer to the next recognition result character string information,
“Int x1, y1, x2, y2;” is the rectangular coordinates (upper left, lower right) of the image that forms the character, “int ID;” is the character ID that identifies the character in the character area, and “int code;” Indicates the code (recognition result) of this character.

【００８６】：「struct IMAGE」の項の「int x1,y1,
x2,y2;」は画像の範囲、「char image［1000000 ］；」
は画像情報中身（２値化されたもの）を示す。 §１１：バックグラウンドジョブ１の説明・・・図７参
照図７はバックグラウンドジョブの説明図１であり、Ａは
単語辞書の構造例、Ｂは単語辞書を使用した誤認識文字
検知処理フローチャートである。: "Int x1, y1," in the "struct IMAGE" section
"x2, y2;" is the range of the image, "char image [1000000];"
Indicates the content of image information (binarized). §11: Description of background job 1 ... See FIG. 7. FIG. 7 is an explanatory diagram 1 of a background job, where A is a structural example of a word dictionary, and B is a misrecognized character detection processing flowchart using the word dictionary. is there.

【００８７】：単語辞書の説明・・・図７Ａ参照図７Ａに示したように、単語辞書／文法部２５の単語辞
書は、複数の単語の集合で構成されている。単語の並べ
順は、ＪＩＳで規定された順序とする。Description of Word Dictionary--See FIG. 7A As shown in FIG. 7A, the word dictionary of the word dictionary / grammar unit 25 is composed of a plurality of word sets. The order of words is the order specified by JIS.

【００８８】この場合、各単語は特定のコード（０等）
で区切られている。例えば、図示のように「亜細亜，
０，亜熱帯，０，・・・・」のように構成されている。：単語辞書を使用した誤認識文字検知処理の説明・・
・図７Ｂ参照以下、図７Ｂに基づいて、文章解析処理部２４による単
語辞書を使用した誤認識文字検知処理を説明する。な
お、Ｓ２１〜Ｓ２６は各処理番号を示す。In this case, each word has a specific code (0, etc.)
Separated by. For example, as shown in the figure,
0, subtropical, 0, ... ". : Description of misrecognized character detection process using word dictionary
-Refer to FIG. 7B Hereinafter, the erroneously recognized character detection process using the word dictionary by the sentence analysis processing unit 24 will be described with reference to FIG. 7B. Note that S21 to S26 indicate respective process numbers.

【００８９】先ず、光ディスク装置１８に格納されてい
る認識結果の文字列の中から、漢字が連続している箇所
を抽出する（Ｓ２１）。そして、抽出した文字列に対し
て、単語辞書と比較する（Ｓ２２）。First, from the character string of the recognition result stored in the optical disk device 18, the place where the Chinese characters are continuous is extracted (S21). Then, the extracted character string is compared with the word dictionary (S22).

【００９０】その結果、両者が一致しない場合には（Ｓ
２３）、抽出文字列はそのまま（Ｓ２４）とし、前記処
理（Ｓ２１）から繰り返す。しかし、一致した場合（一
致度が、閾値より大きい場合）には（Ｓ２３）、単語内
で、単語辞書と抽出文字列との違っている箇所を探し
て、その文字の「文字ＩＤ」を求める。また、その部分
の正しい文字コードを単語辞書から求める（Ｓ２５）。
次に、「文字ＩＤ」及び正解コードをメモリ１４へ格納
して（Ｓ２６）処理を終了する。As a result, if they do not match (S
23), the extracted character string is left as it is (S24), and the above-mentioned processing (S21) is repeated. However, when they match (when the degree of matching is larger than the threshold value) (S23), the word dictionary is searched for a different portion from the extracted character string, and the "character ID" of the character is obtained. . Further, the correct character code for that portion is obtained from the word dictionary (S25).
Next, the “character ID” and the correct answer code are stored in the memory 14 (S26), and the process ends.

【００９１】なお前記処理を行う際、文章解析処理部２
４では、単語辞書／文法部２５の文法も使用して解析処
理を行う。 §１２：バックグラウンドジョブ２の説明・・・図８参
照図８はバックグラウンドジョブの説明図２であり、Ａは
認識辞書の構造例、Ｂは認識辞書更新処理フローチャー
トである。When performing the above processing, the sentence analysis processing unit 2
In 4, the parsing process is performed using the grammar of the word dictionary / grammar unit 25. §12: Description of background job 2 ... See FIG. 8 FIG. 8 is an explanatory diagram 2 of a background job, A is a structural example of a recognition dictionary, and B is a recognition dictionary update processing flowchart.

【００９２】：認識辞書の構造例の説明・・・図８Ａ
参照認識辞書は、図示のように、「文字コード」と、「特徴
量」の組みの集合である。例えば、文字「亜」の文字コ
ードは、「３０２１」であり、その特徴量は、３，５，
７，９０，３，・・・となっている。Description of Structure Example of Recognition Dictionary--FIG. 8A
The reference recognition dictionary is a set of "character codes" and "features" as shown in the figure. For example, the character code of the character “A” is “3021”, and the feature amount is 3, 5,
7, 90, 3, ...

【００９３】また、文字「唖」の文字コードは、「３０
２２」であり、その特徴量は、１，４，７，８，４５
３，・・・となっている。：認識辞書更新処理の説明・・・図８Ｂ参照以下、図８Ｂに基づいて、認識辞書更新処理を説明す
る。なお、Ｓ３１〜Ｓ３５は各処理番号を示す。Further, the character code of the character "Muta" is "30".
22 ”, and the feature amount is 1, 4, 7, 8, 45
3, ... : Description of the recognition dictionary update process ... See FIG. 8B Hereinafter, the recognition dictionary update process will be described with reference to FIG. 8B. It should be noted that S31 to S35 represent respective process numbers.

【００９４】この処理は文章解析処理部２４による文章
解析処理で得られた情報（誤認識文字に関する情報）を
基に、文字認識部５が行う処理である。文字認識部５の
イメージ処理部２１では、先ず、文章解析処理部２４が
格納した文字ＩＤ、正解コードをメモリ１４から取り出
す（Ｓ３１）。そして、画像情報から、文字ＩＤにあた
る部分を取り出し、特徴量ベクトルを生成する（Ｓ３
２）。このようにして特徴量ベクトルを得たら、これを
「更新用特徴量ベクトル」とする（Ｓ３３）。This process is a process performed by the character recognition unit 5 on the basis of the information (information regarding misrecognized characters) obtained by the sentence analysis process by the sentence analysis processing unit 24. In the image processing unit 21 of the character recognition unit 5, first, the character ID and the correct answer code stored in the sentence analysis processing unit 24 are retrieved from the memory 14 (S31). Then, a portion corresponding to the character ID is extracted from the image information and a feature amount vector is generated (S3).
2). When the feature amount vector is obtained in this way, this is set as the "update feature amount vector" (S33).

【００９５】次に、認識辞書９の中で、正解コードの辞
書特徴量ベクトルを、重み付きで足して、正規化する
（Ｓ３４）。そして、更新後の特徴量ベクトルを、新し
い辞書ベクトルとする（Ｓ３５）。Next, in the recognition dictionary 9, the dictionary feature amount vector of the correct answer code is added with weighting and normalized (S34). Then, the updated feature amount vector is set as a new dictionary vector (S35).

【００９６】[0096]

【発明の効果】以上説明したように、本発明によれば次
のような効果がある。：蓄積されている大量の文書データから学習し、最適
化された認識辞書を使用するため、文字認識精度が向上
する。As described above, the present invention has the following effects. : Learning from a large amount of accumulated document data and using an optimized recognition dictionary improves character recognition accuracy.

【００９７】：辞書更新処理をバックグラウンドジョ
ブとして行うので、ユーザが意識することなく、また、
ユーザが本装置を使用していない時でも、認識辞書の最
適化が可能となる。Since the dictionary update process is performed as a background job, the user is not aware of
The recognition dictionary can be optimized even when the user is not using the device.

[Brief description of drawings]

【図１】本発明の原理説明図である。FIG. 1 is a diagram illustrating the principle of the present invention.

【図２】実施例の装置構成図である。FIG. 2 is a device configuration diagram of an embodiment.

【図３】実施例における文書データの説明図である。FIG. 3 is an explanatory diagram of document data according to the embodiment.

【図４】実施例における領域抽出／判定処理フローチャ
ートである。FIG. 4 is a flowchart of a region extraction / determination process in the embodiment.

【図５】実施例における文字領域の縦書き／横書き判定
処理フローチャートである。FIG. 5 is a flowchart of vertical writing / horizontal writing determination processing of a character area in the embodiment.

【図６】実施例における文書データの説明図である。FIG. 6 is an explanatory diagram of document data according to the embodiment.

【図７】実施例におけるバックグラウンドジョブの説明
図１である。FIG. 7 is an explanatory diagram 1 of a background job in the embodiment.

【図８】実施例におけるバックグラウンドジョブの説明
図２である。FIG. 8 is an explanatory diagram 2 of a background job according to the embodiment.

【図９】従来技術の説明図である。FIG. 9 is an explanatory diagram of a conventional technique.

[Explanation of symbols]

２Ａ画像入力部３イメージメモリ４文書構造解析部５文字認識部７ファイル装置９認識辞書２４文章解析処理部２５単語辞書／文法部 2A image input unit 3 image memory 4 document structure analysis unit 5 character recognition unit 7 file device 9 recognition dictionary 24 sentence analysis processing unit 25 word dictionary / grammar unit

Claims

[Claims]

1. An image input unit (2A) for inputting a document image
And a document structure analysis unit (4) that analyzes the input document image and divides it into regions with different attributes (character region, chart region, photo region), and extracts various types of information. , A character recognition unit (5) for performing character recognition on a character area using a recognition dictionary (9), and a file device (7) for storing document data (image data, character data, etc.) obtained by each unit. In the filing device with OCR, the document data stored in the file device (7),
A sentence analysis process is performed on the character data in the character area to extract erroneously recognized characters, and a sentence analysis processing unit (24) for obtaining the correct character information is provided. Recognition part (5)
The filing apparatus with OCR, wherein the recognition dictionary (9) is updated to enable optimization of the recognition dictionary.

2. The filing device with OCR according to claim 1, further comprising a word dictionary and a word dictionary / grammar section (25) storing grammatical information, and the word dictionary / grammar section (25) word dictionary and grammar. A filing apparatus with OCR, characterized in that the sentence analysis processing section (24) is used to perform a sentence analysis process.

3. The filing device with OCR according to claim 1, wherein when performing the recognition dictionary update processing by the character recognition unit (5), image data corresponding to the erroneously recognized character is generated based on the information about the erroneously recognized character. A filing apparatus with an OCR, which is characterized in that it is cut out to obtain a feature amount, and the recognition dictionary (9) is updated based on the feature amount.

4. The filing apparatus with OCR according to claim 1, wherein the internal processing is multitasking, whereby a text analysis processing by the text analysis processing unit (24) and a recognition dictionary update processing by the character recognition unit (5). Is a filing device with OCR, which can be always executed as a background job regardless of other processing.