JP4872285B2

JP4872285B2 - Document management apparatus, document management system, and document management method

Info

Publication number: JP4872285B2
Application number: JP2005267400A
Authority: JP
Inventors: 雅弘加藤; 明男山下; 勝彦糸乘; 一成橋本
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-09-14
Filing date: 2005-09-14
Publication date: 2012-02-08
Anticipated expiration: 2025-09-14
Also published as: JP2007079979A

Description

本発明は、種々のアプリケーションソフトで生成された電子文書を所定フォーマットの電子文書に変換して一元管理する技術に関する。 The present invention relates to a technology for centrally managing electronic documents generated by various application software by converting them into electronic documents of a predetermined format.

一般に、文書作成用のアプリケーションソフトによって作成された電子文書においては、基本的に各文字が文字コードによって表現されているため、修正や加工や検索などの処理を行う際に便利である。また、アプリケーションの中には、電子文書を画像データとして形成して保存あるいは出力するものもあり、この場合には、オリジナルの文字コード列も画像データに添付される。したがって、電子文書の画像データを閲覧することも、文字コード列に対して種々の処理をすることもできる。 In general, in an electronic document created by application software for creating a document, each character is basically represented by a character code, which is convenient when performing processing such as correction, processing, and search. Some applications form and store or output an electronic document as image data. In this case, an original character code string is also attached to the image data. Therefore, the image data of the electronic document can be browsed and various processes can be performed on the character code string.

しかしながら、文字に修飾が施されている場合は、修飾された部分の文字コードが特殊なコードに置き換えられて文字コードが喪失していることがある。また、文字コードを組み合わせて修飾を表現するアプリケーションの場合は、文字コードは残存していても文としての意味をなさない場合がある。さらに、文字画像が貼り付けられている場合は、その文字画像に対応する文字コードはもともと有していない。これらの場合には、文字コードを用いる種々の処理に支障をきたすという問題が生じた。 However, when the character is modified, the character code of the modified part may be replaced with a special code and the character code may be lost. In addition, in the case of an application that expresses a modification by combining character codes, there is a case where the character code does not make sense as a sentence even if it remains. Further, when a character image is pasted, the character code corresponding to the character image is not originally provided. In these cases, there arises a problem that various processes using the character code are hindered.

そこで、特許文献１においては、電子文書の画像データに対してＯＣＲなどの文字認識を行い、その結果得られる文字コード列を新たな電子文書として保存あるいは出力している。 Therefore, in Patent Document 1, character recognition such as OCR is performed on image data of an electronic document, and a character code string obtained as a result is stored or output as a new electronic document.

特開２００４−１７８４３８号公報JP 2004-178438 A

しかしながら、特許文献１に記載の発明においては、もともと電子文書に添付されていたオリジナルの文字コード列を全て捨ててしまい、文字認識によって得られた文字コード列を採用しているので、例えば、文字認識エラーなどがあった場合には、不正確な文字コードが生成され、原文の文字コードと異なってしまうという問題があった。 However, in the invention described in Patent Document 1, the original character code string originally attached to the electronic document is discarded, and the character code string obtained by character recognition is adopted. When there is a recognition error, an incorrect character code is generated, which is different from the original character code.

この発明は上述した事情に鑑みてなされたもので、文字コード列とそれを画像化した画像データを含む電子文書から、適切な文字コード列を生成する文書管理装置、文書管理システムおよび文書管理方法を提供することを目的とする。 The present invention has been made in view of the above-described circumstances. A document management apparatus, a document management system, and a document management method for generating an appropriate character code string from an electronic document including a character code string and image data obtained by imaging the character code string. The purpose is to provide.

前記目的を達成するために、本発明が採用する文書管理装置は、文字コード情報を画像化した情報を含む画像情報および前記文字コード情報を有した電子文書を受け取る入力手段と、前記入力手段が受け取った前記電子文書の画像情報をイメージに展開する文書イメージ生成手段と、前記イメージ生成手段によって展開されたイメージのレイアウトを解析して領域を認識するレイアウト解析手段と、前記レイアウト解析手段によって認識された領域が文字領域か否かを判定する判定手段と、前記判定手段によって文字領域と判定された領域に対して文字認識処理を施して文字コード情報を生成する文字認識手段と、前記入力手段が受け取った文字コード情報と前記文字認識手段が生成した文字コード情報について、対応する部分同士を所定のアルゴリズムに従って文書解析して評価し、評価結果に基づいて、いずれか一方の文字コード情報を選択する文字コード選択手段と、前記文字コード選択手段によって選択された文字コード情報の各部分を繋ぎ合わせて前記電子文書の文字コード情報とする文字コード情報作成手段と、を具備することを特徴とする。 In order to achieve the above object, the document management apparatus adopted by the present invention comprises: input means for receiving image information including information obtained by imaging character code information and an electronic document having the character code information; and Document image generation means for developing the received image information of the electronic document into an image, layout analysis means for analyzing the layout of the image developed by the image generation means and recognizing a region, and recognized by the layout analysis means Determination means for determining whether the area is a character area, character recognition means for generating character code information by performing character recognition processing on the area determined as a character area by the determination means, and the input means About the received character code information and the character code information generated by the character recognition means, Evaluated by document analysis according to algorithms, evaluation results on the basis, a character code selection means for selecting one of the character code information, by joining the parts of the character code information selected by the character code selection means And character code information creating means for character code information of the electronic document.

このように構成することにより、この文字コード情報作成手段では、レイアウト解析手段、文字認識手段、判定手段によって得られた文字コード情報を繋ぎ合わせて文字コード情報を生成しているため、文字コード列が文としての意味をなさない場合や、文字画像が画像情報に貼り付けられて文字コードが存在しない場合等、画像情報に添付された文字コード情報のうち正確さに欠ける部分を、前記処理によって得られた文字コードで補完することができ、適切な文字コード情報を得ることができる。 With this configuration, the character code information creating unit generates the character code information by connecting the character code information obtained by the layout analysis unit, the character recognition unit, and the determination unit. In the character code information attached to the image information, such as when the character does not make sense as a sentence or when the character image is pasted on the image information and there is no character code, The obtained character code can be complemented, and appropriate character code information can be obtained.

前記文書管理装置において、前記画像情報を他のフォーマット形式に変換し、変換後の画像情報を前記文字コード情報作成手段が作成した文字コード情報とともに格納する電子文書格納手段を具備することが望ましい。 The document management apparatus preferably includes an electronic document storage unit that converts the image information into another format and stores the converted image information together with the character code information created by the character code information creation unit.

前記文書管理装置において、前記所定のアルゴリズムは自然言語解析に従ったアルゴリズムであることが望ましい。 In the document management apparatus, the predetermined algorithm is preferably an algorithm according to natural language analysis.

本発明が採用する文書管理システムは、端末装置と、前記端末装置とネットワークを介して接続された管理サーバとを有し、
前記管理サーバが、文字コード情報を画像化した情報を含む画像情報および前記文字コード情報を有した電子文書を受け取る入力手段と、前記入力手段が受け取った前記電子文書の画像情報をイメージに展開する文書イメージ生成手段と、前記イメージ生成手段によって展開されたイメージのレイアウトを解析して領域を認識するレイアウト解析手段と、前記レイアウト解析手段によって認識された領域が文字領域か否かを判定する判定手段と、前記判定手段によって文字領域と判定された領域に対して文字認識処理を施して文字コード情報を生成する文字認識手段と、前記入力手段が受け取った文字コード情報と前記文字認識手段が生成した文字コード情報について、対応する部分同士を所定のアルゴリズムに従って文書解析して評価し、評価結果に基づいて、いずれか一方の文字コード情報を選択する文字コード選択手段と、前記文字コード選択手段によって選択された文字コード情報の各部分を繋ぎ合わせて前記電子文書の文字コード情報とする文字コード情報作成手段と、を備え、
前記端末装置が、前記管理サーバの前記入力手段に対して電子文書を送信する電子文書
送信手段を備えたことを特徴とする。 The document management system employed by the present invention includes a terminal device and a management server connected to the terminal device via a network,
The management server develops image information including information obtained by imaging character code information and an electronic document having the character code information, and image information of the electronic document received by the input means into an image. Document image generation means, layout analysis means for analyzing the layout of the image developed by the image generation means and recognizing the area, and determination means for determining whether or not the area recognized by the layout analysis means is a character area And character recognition means for generating character code information by performing character recognition processing on the area determined as the character area by the determination means, and character code information received by the input means and the character recognition means For character code information, the corresponding parts are analyzed according to a predetermined algorithm and evaluated, Based on the value result and a character code selection means for selecting one of the character code information, and the character code character code information of the electronic document by joining the parts of the character code information selected by the selecting means Character code information creating means,
The terminal device includes electronic document transmission means for transmitting an electronic document to the input means of the management server.

本発明が採用する文書管理方法は、文字コード情報を画像化した情報を含む画像情報および前記文字コード情報を有した電子文書を受け取る入力ステップと、前記入力ステップで受け取った前記電子文書の画像情報をイメージに展開する文書イメージ生成ステップと、前記イメージ生成ステップによって展開されたイメージのレイアウトを解析して領域を認識するレイアウト解析ステップと、前記レイアウト解析ステップによって認識された領域が文字領域か否かを判定する判定ステップと、前記判定ステップによって文字領域と判定された領域に対して文字認識処理を施して文字コード情報を生成する文字認識ステップと、前記入力ステップで受け取った文字コード情報と前記文字認識ステップで生成した文字コード情報について、対応する部分同士を所定のアルゴリズムに従って文書解析して評価し、評価結果に基づいて、いずれか一方の文字コード情報を選択する文字コード選択ステップと、前記文字コード選択ステップによって選択された文字コード情報の各部分を繋ぎ合わせて前記電子文書の文字コード情報とする文字コード情報作成ステップと、を備えることを特徴とする。 The document management method employed by the present invention includes an image information including information obtained by imaging character code information and an electronic document having the character code information, and image information of the electronic document received in the input step. A document image generation step of expanding the image into an image, a layout analysis step of analyzing the layout of the image expanded in the image generation step to recognize the region, and whether the region recognized by the layout analysis step is a character region A character recognition step for generating character code information by performing character recognition processing on the region determined as the character region by the determination step, the character code information received in the input step, and the character The character code information generated in the recognition step Portions to each other was evaluated by the document analysis according to a predetermined algorithm, the evaluation results on the basis of a character code selection step of selecting one of the character code information, each character code information selected by the character code selection step And a character code information creating step for joining the portions to obtain character code information of the electronic document.

本発明に係る文書管理装置によれば、文字コード情報作成手段では、レイアウト解析手段、文字認識手段、判定手段によって得られた文字コード情報を繋ぎ合わせて文字コード情報を生成しているため、文字コードが文としての意味をなさない場合や、文字画像が貼り付けられて文字コードが存在しない場合であっても、この部分の文字コードを補完することができ、電子文書の文字コード情報を活かしつつ適切な文字コード情報を得ることができる。 According to the document management apparatus of the present invention, the character code information creating means generates the character code information by connecting the character code information obtained by the layout analysis means, the character recognition means, and the determination means. Even if the code does not make sense as a sentence, or even if a character image is pasted and there is no character code, this part of the character code can be complemented, making use of the character code information of the electronic document However, appropriate character code information can be obtained.

＜Ａ．実施形態＞
以下、図面を参照し、本発明に係る第１実施形態を説明する。図１は、本実施形態による文書管理システム１の全体構成を示す図である。このシステム１は、ネットワーク４００を介して接続されたユーザ端末１００、管理サーバ２００およびデータベース３００を具備している。
ユーザ端末１００は、種々のアプリケーションソフトによって電子文書を生成する機能を有する。これらのアプリケーションソフトによって生成された電子文書は、文字コード情報（例えば、ＪＩＳコード，ＥＵＣ，シフトＪＩＳ）を画像化した情報を含む画像情報および前記文字コード情報を有している。 <A. Embodiment>
A first embodiment according to the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing an overall configuration of a document management system 1 according to the present embodiment. The system 1 includes a user terminal 100, a management server 200, and a database 300 connected via a network 400.
The user terminal 100 has a function of generating an electronic document using various application software. The electronic document generated by these application software has image information including information obtained by imaging character code information (for example, JIS code, EUC, shift JIS) and the character code information.

管理サーバ２００は、後述する処理によってユーザ端末１００から送信されるある種のアプリケーションソフトによって生成された電子文書の画像情報を所定フォーマットの画像情報に変換し、この所定フォーマットの画像情報を文字コード情報とともにデータベース３００に格納する。
具体的には、送信された電子文書のファイル名から画像情報のフォーマットを特定し、変換可能なフォーマットである場合には所定フォーマット（例えば、ＰＤＦ（Portable Document Format））に変換する。 The management server 200 converts image information of an electronic document generated by a certain kind of application software transmitted from the user terminal 100 by processing to be described later into image information of a predetermined format, and converts the image information of the predetermined format into character code information. At the same time, it is stored in the database 300.
Specifically, the format of the image information is specified from the file name of the transmitted electronic document, and if it is a convertible format, it is converted into a predetermined format (for example, PDF (Portable Document Format)).

従来技術で述べた如く、アプリケーションソフトによって生成された電子文書には、この電子文書の各文字に対応した文字コードが附随されており、所定フォーマットに変換した場合であってもこの文字コード情報は添付される。このため、データベース３００に格納された電子文書は、図２に示すように、電子文書のファイル名に対し、画像情報とこの画像情報の各文字に対応した文字コード情報が格納されることになる。
具体的には、電子文書のファイル名「００１」に対し、画像情報「００１−Ｄ」・文字コード情報「００１−Ｃ」、ファイル名「００２」に対し、画像情報「００２−Ｄ」・文字コード情報「００２−Ｃ」、といった具合である。 As described in the prior art, the electronic document generated by the application software is accompanied by a character code corresponding to each character of the electronic document, and this character code information is stored even when converted into a predetermined format. Attached. For this reason, as shown in FIG. 2, the electronic document stored in the database 300 stores image information and character code information corresponding to each character of the image information for the file name of the electronic document. .
Specifically, the image information “001-D” / character code information “001-C” for the file name “001” of the electronic document, and the image information “002-D” / character for the file name “002”. The code information is “002-C”.

＜Ａ−１．管理サーバ２００の構成＞
図３は、管理サーバ２００の機能構成を示すブロック図である。管理サーバ２００は、データ送受信部２１０、文書イメージ生成部２２０、レイアウト解析部２３０、判定部２４０、ＯＣＲ処理部２５０、文字コード情報選択部２６０、文字コード情報作成部２７０およびこれらを制御する制御部２８０を具備する。制御部２８０は、図示しないＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）を備えており、各種の処理動作を司る機能を有する。 <A-1. Configuration of Management Server 200>
FIG. 3 is a block diagram illustrating a functional configuration of the management server 200. The management server 200 includes a data transmission / reception unit 210, a document image generation unit 220, a layout analysis unit 230, a determination unit 240, an OCR processing unit 250, a character code information selection unit 260, a character code information creation unit 270, and a control unit that controls them. 280. The control unit 280 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory) (not shown), and has a function of controlling various processing operations.

データ送受信部２１０は、ユーザ端末１００等のネットワーク４００に接続された外部機器（図示せず）との間でデータの授受を行う。
文書イメージ生成部２２０は、アプリケーションソフトによって生成された電子文書の画像情報をイメージ化する。即ち、電子的なデータから紙面に印刷される画像データに変換する。レイアウト解析部２３０は、文書イメージ生成部２２０によりイメージ化された画像データをレイアウト解析して文字領域と図形領域（図や写真）とに分ける。判定部２４０は、レイアウト解析部２３０によって認識された領域が文字領域であるか否かを判定する。ＯＣＲ（Optical Character Reader）処理部２５０は、判定部２４０でレイアウト解析部２３０によって認識された領域が文字領域であると判定した場合、当該領域の文字認識を行い、文書データの各文字に対する文字コード情報を確定する。 The data transmission / reception unit 210 exchanges data with an external device (not shown) connected to the network 400 such as the user terminal 100.
The document image generation unit 220 converts the image information of the electronic document generated by the application software into an image. That is, the electronic data is converted into image data to be printed on paper. The layout analysis unit 230 performs layout analysis on the image data imaged by the document image generation unit 220 and divides the image data into character regions and graphic regions (drawings and photographs). The determination unit 240 determines whether the area recognized by the layout analysis unit 230 is a character area. When the determination unit 240 determines that the region recognized by the layout analysis unit 230 is a character region, the OCR (Optical Character Reader) processing unit 250 performs character recognition of the region and character codes for each character of the document data Confirm the information.

文字コード情報選択部２６０は、電子文書に添付された文字データ情報とＯＣＲ処理部２５０にて確定された文字コード情報とを比較し、自然言語解析によって最適な文字コード情報を選択する。そして、文字コード情報作成部２７０は、前記文字コード情報選択部２６０によって選択された文字コード情報の各部分を繋ぎ合わせて電子文書の文字コード情報を作成する。そして、この文字コード情報を画像情報とともに電子文書としてデータベース３００に格納する。
ここで、自然言語解析とは、公知の形態素解析や構文解析の手法を用いて、文章を解析する手法であり、本実施形態では、言葉として意味が通じているか否かを解析するものである。 The character code information selection unit 260 compares the character data information attached to the electronic document with the character code information determined by the OCR processing unit 250, and selects the optimum character code information by natural language analysis. Then, the character code information creation unit 270 creates character code information of the electronic document by connecting the portions of the character code information selected by the character code information selection unit 260. This character code information is stored in the database 300 as an electronic document together with image information.
Here, natural language analysis is a technique for analyzing sentences using a known morphological analysis or syntactic analysis technique, and in this embodiment, analyzes whether the meaning is understood as words. .

＜Ａ−２．文書管理システムの動作＞
次に、図４のフローチャートに基づいて、本実施形態による文書管理システム１の文字コード情報作成処理動作について説明する。
管理サーバ２００は、ユーザ端末１００から所定のアプリケーションソフトによって生成された電子文書が送信されることにより、その動作を開始する（ステップＳ１；ＹＥＳ）。
この説明では、ユーザ端末１００から送信される電子文書が１つの場合について説明するが、ユーザが電子文書をまとめて送信してもよいことは勿論である。 <A-2. Operation of document management system>
Next, the character code information creation processing operation of the document management system 1 according to the present embodiment will be described based on the flowchart of FIG.
The management server 200 starts its operation when an electronic document generated by predetermined application software is transmitted from the user terminal 100 (step S1; YES).
In this description, a case where there is one electronic document transmitted from the user terminal 100 will be described, but it is a matter of course that the user may transmit the electronic documents collectively.

制御部２８０は、送信された電子文書の画像情報Ｄ１に添付された文字コード情報Ｃ１を一時的に記憶する（ステップＳ２）。さらに、制御部２８０は、データ送受信部２１０で受信した画像情報Ｄ１を文書イメージ生成部２２０に転送する。文書イメージ生成部２２０では、転送された画像情報Ｄ１をラスタライズしてイメージ化した画像情報Ｄ２を生成する（ステップＳ３）。 The control unit 280 temporarily stores the character code information C1 attached to the image information D1 of the transmitted electronic document (step S2). Further, the control unit 280 transfers the image information D1 received by the data transmission / reception unit 210 to the document image generation unit 220. The document image generation unit 220 generates image information D2 obtained by rasterizing the transferred image information D1 (step S3).

次に、制御部２８０は、文書イメージ生成部２２０で生成された画像情報Ｄ２をレイアウト解析部２３０に転送する。レイアウト解析部２３０では、画像情報Ｄ２から、文書領域と図形領域に分ける（ステップＳ４）。
さらに、制御部２８０は判定部２４０に対し、レイアウト解析部２３０で分けられた領域のうち、これらの領域が文字領域であるか否かを判定させる。文字領域となる情報をＯＣＲ処理部２５０に転送する。このＯＣＲ処理部２５０では、転送された文字領域の文字に対して文字コードを確定して文字コード情報Ｃ２を生成する（ステップＳ５）。 Next, the control unit 280 transfers the image information D <b> 2 generated by the document image generation unit 220 to the layout analysis unit 230. The layout analysis unit 230 divides the image information D2 into a document area and a graphic area (step S4).
Further, the control unit 280 causes the determination unit 240 to determine whether or not these regions are character regions among the regions divided by the layout analysis unit 230. Information serving as a character area is transferred to the OCR processing unit 250. The OCR processing unit 250 determines the character code for the transferred character area character and generates character code information C2 (step S5).

制御部２８０は、ステップＳ２で記憶した文字コード情報Ｃ１を読み出し、文字コード情報選択部２６０に送信する（ステップＳ６）。さらに、制御部２８０は、ＯＣＲ処理部２５０で文字に対して確定された文字コード情報Ｃ２を文字コード情報選択部２６０に送信する。文字コード情報選択部２６０では、文字コード情報Ｃ１と文字コード情報Ｃ２とを比較するとともに、自然言語解析によって最適な文字コード情報を選択する（ステップＳ７）。 The control unit 280 reads the character code information C1 stored in step S2, and transmits it to the character code information selection unit 260 (step S6). Further, the control unit 280 transmits the character code information C2 determined for the character by the OCR processing unit 250 to the character code information selection unit 260. The character code information selection unit 260 compares the character code information C1 and the character code information C2, and selects the optimum character code information by natural language analysis (step S7).

さらに、制御部２８０は文字コード情報作成部２７０により、前記文字コード情報選択部２６０によって選択された文字コード情報を繋ぎ合わせて電子文書の文字コード情報を作成する（ステップＳ８）。 Further, the control unit 280 creates character code information of the electronic document by connecting the character code information selected by the character code information selection unit 260 with the character code information creation unit 270 (step S8).

その後、画像情報Ｄ１を所定フォーマット形式の他の画像情報Ｄ２に変換した上で、作成された文字コード情報を画像情報Ｄ２とともにデータベース３００に格納する。 Thereafter, the image information D1 is converted into other image information D2 in a predetermined format, and the created character code information is stored in the database 300 together with the image information D2.

＜Ｂ．具体例＞
次に、本実施形態の具体例について、図５および図６を参照しつつ説明する。
図５に示すように、電子文書中に「これは影付き文字です。」という文章Ｘ１が影付き文字の装飾が施され、「これは中抜き文字です。」という文章Ｘ２が中抜き文字の装飾が施されている。ここでは、アプリケーションソフトをＦ１，Ｆ２とする。
アプリケーションソフトＦ１における文章Ｘ１は、「これは影付き文字です。」→「ここれれはは影影付付きき文文字字でですす。。」となり、この２２文字に対して文字コード情報が割り振られることになる。一方、アプリケーションソフトＦ２における文章Ｘ１は、「これは影付き文字です。」となり、この１１文字に対して正確な文字コード情報が割り振られることになる。
また、文章Ｘ２の「これは中抜き文字です。」に対しては、文字画像として捕らえられているため、アプリケーションソフトＦ１，Ｆ２の両方とも文字コード情報は割り振られていない。 <B. Specific example>
Next, a specific example of this embodiment will be described with reference to FIGS.
As shown in FIG. 5, in the electronic document, a sentence X1 “This is a shaded character.” Is decorated with a shaded character, and a sentence X2 “This is a outlined character.” Decorated. Here, the application software is F1 and F2.
The sentence X1 in the application software F1 is “This is a shaded character.” → “This is a shaded text character.” Character code information is assigned to these 22 characters. Will be. On the other hand, the sentence X1 in the application software F2 is “This is a shaded character”, and accurate character code information is assigned to these 11 characters.
In addition, since “this is a hollow character” in the sentence X2 is captured as a character image, character code information is not allocated to both the application software F1 and F2.

ここで、文章Ｘ１について図６を参照しつつ本実施形態の動作を説明する。
前述した如く、入力された文章Ｘ１は影付き文字の装飾が施された「これは影付き文字です。」であり、アプリケーションソフトＦ１では「ここれれはは影影付付きき文文字字でですす。。」となる。このため、この２２文字に対して文字コードが割り振られた文字データ情報が画像情報に添付されている。 Here, the operation | movement of this embodiment is demonstrated referring FIG. 6 about the text X1.
As mentioned above, the input sentence X1 is “This is a shaded character.” Decorated with a shaded character. In the application software F1, “This is a shaded letter character. . " For this reason, character data information in which character codes are assigned to these 22 characters is attached to the image information.

ここで、図４の処理に対応させて説明すると、ステップＳ２の処理において、２２文字に対応した文字コード情報が記憶される。ステップＳ３の処理において、入力された文字Ｘ１の文書データからイメージ化した画像データ（図６中の中央）を得る。ステップＳ４〜６の処理を経て、図６中の右側に示す１１文字の文字コードを得る。ステップＳ７による処理において、２２文字の文字コードと１１文字の文字コードに対して自然言語解析を行い、最適な文字コードを選択する。当然、ここでは、１１文字の文字コードが選択されることになる。
そして、ステップＳ８の処理において、画像情報に添付した文字コード情報の２２文字の文字コードを１１文字の文字コード情報に置換し、置換後の文字コード情報を画像情報に添付する。
因みに、「これは影付き文字です。」に対応したＪＩＳコードは、2433・246C・244F・3146・4955・242D・4A38・3B7A・2447・2439・2123と表される。 Here, to explain in correspondence with the processing of FIG. 4, in the processing of step S2, character code information corresponding to 22 characters is stored. In the process of step S3, image data (center in FIG. 6) obtained from the input document data of the character X1 is obtained. Through the processing in steps S4 to S6, an 11-character code shown on the right side in FIG. 6 is obtained. In the processing in step S7, natural language analysis is performed on the 22-character code and the 11-character code, and an optimal character code is selected. Of course, an 11-character code is selected here.
Then, in the process of step S8, the 22 character code of the character code information attached to the image information is replaced with 11 character code information, and the replaced character code information is attached to the image information.
Incidentally, JIS codes corresponding to “This is a shaded character” are represented as 2433, 246C, 244F, 3146, 4955, 242D, 4A38, 3B7A, 2447, 2439, and 2123.

また、文章Ｘ２の「これは中抜き文字です。」に対しては、文字コードが割り振られていないので、ステップＳ３〜７の処理で得られた文字コード情報を画像情報に添付させる。 In addition, since no character code is assigned to “This is a hollow character” in the sentence X2, the character code information obtained in the processing of steps S3 to S7 is attached to the image information.

このように、本実施形態では、種々のアプリケーションソフトから生成された電子文書に本来添付されている正確な文字コード情報を活かしつつ、文字に対応した文字コード情報がない部分や、文章として意味の成さない部分について、ステップＳ３〜６の処理で得られた文字コードを用いて補完して文字コード情報を得ることができる。しかも、文字コード選択処理においては、自然言語解析を行っているため、文字認識処理（ＯＣＲ処理）において文字の誤認識が生じていた場合であっても、正確な文字コードの方を選択することができる。
さらに、正確な文字コード情報が添付された画像情報を所定フォーマットに変換した他の画像情報にも添付させることができ、他のフォーマット形式の画像情報であっても、文字コード情報を利用しての検索や翻訳等の機能や再編集が可能となる。 As described above, in the present embodiment, while utilizing the accurate character code information originally attached to the electronic document generated from various application software, there is no character code information corresponding to the character or the meaning of the text. Character code information can be obtained by complementing the parts that are not formed using the character codes obtained in the processes of steps S3 to S6. In addition, since the natural code analysis is performed in the character code selection process, the correct character code is selected even if the character recognition process (OCR process) has caused erroneous character recognition. Can do.
Furthermore, image information with accurate character code information attached can also be attached to other image information converted into a predetermined format. Even if the image information is in other formats, the character code information is used. Search and translation functions and re-editing are possible.

＜Ｃ．変形例＞
以上、本発明の実施形態について説明したが、本発明は上述した各実施形態に限定されるものではなく、種々の態様が可能である。
前記実施形態では、ユーザ端末１００と管理サーバ２００とをネットワーク４００で接続する文書管理システムとして説明したが、本発明はこれに限らず、ユーザ端末１００内に管理サーバ２００の機能を内蔵させ、ユーザ端末１００を文書管理装置してもよい。また、管理サーバ２００は、社内或いは社外に設置して文書管理を行ってもよい。さらに、管理サーバ２００に翻訳部および翻訳辞書部を設け、前記管理サーバ２００を管理サーバとして用いてもよい。 <C. Modification>
As mentioned above, although embodiment of this invention was described, this invention is not limited to each embodiment mentioned above, A various aspect is possible.
In the above-described embodiment, the document management system has been described in which the user terminal 100 and the management server 200 are connected via the network 400. However, the present invention is not limited to this, and the function of the management server 200 is built in the user terminal 100, The terminal 100 may be a document management device. The management server 200 may be installed inside or outside the company to perform document management. Furthermore, the management server 200 may be provided with a translation unit and a translation dictionary unit, and the management server 200 may be used as the management server.

また、文書データに注釈や付箋等のアノテーションが記載についてのデータが含まれている場合であっても、本実施形態の管理サーバ２００を用いれば、アノテーションによって書き込まれた文字から文字コード情報を作成することも可能であり、これにより、アノテーションをキーワードとしての検索も可能となる。 Even when the annotation data such as annotations and sticky notes is included in the document data, if the management server 200 of this embodiment is used, character code information is created from the characters written by the annotations. It is also possible to search for annotations as keywords.

前記実施形態では、図１においてユーザ端末１００および管理サーバ２００はそれぞれ１台のみ図示されているが、文書管理システム１は、ユーザ端末１００および管理サーバ２００を複数有してもよいことは勿論である。 In the embodiment, only one user terminal 100 and one management server 200 are shown in FIG. 1, but the document management system 1 may include a plurality of user terminals 100 and management servers 200. is there.

また、上述の実施形態においては、ユーザ端末１００は制御部が、管理サーバ２００は制御部２８０がそれぞれプログラムを実行することにより実現されたが、この一部または全部をハードウェアで実現する構成としてもよい。 In the above-described embodiment, the user terminal 100 is realized by the control unit and the management server 200 is realized by the control unit 280 executing the program. However, a part or all of this is realized by hardware. Also good.

本発明の実施形態に係る文書管理システムの構成を示す図である。It is a figure which shows the structure of the document management system which concerns on embodiment of this invention. 本発明実施形態に用いられるデータベースに格納されるデータを模式的に示す図である。It is a figure which shows typically the data stored in the database used for embodiment of this invention. 本発明実施形態に用いられる管理サーバの構成を示すブロック図である。It is a block diagram which shows the structure of the management server used for this invention embodiment. 実施形態による文書管理システムの動作を示す流れ図である。It is a flowchart which shows operation | movement of the document management system by embodiment. 実施形態による具体例を示す図である。It is a figure which shows the specific example by embodiment. 図５とともに実施形態による具体例を示す図である。It is a figure which shows the specific example by embodiment with FIG.

Explanation of symbols

１…文書管理システム、１００…ユーザ端末、２００…管理サーバ、２１０…データ送受信部、２２０…文書イメージ生成部、２３０…レイアウト解析部、２４０…判定部、２５０…ＯＣＲ処理部、２６０…文字コード情報選択部、２７０…文字コード情報作成部、２８０…制御部、３００…データベース、４００…ネットワーク。 DESCRIPTION OF SYMBOLS 1 ... Document management system 100 ... User terminal 200 ... Management server 210 ... Data transmission / reception part 220 ... Document image generation part 230 ... Layout analysis part 240 ... Determination part 250 ... OCR processing part 260 ... Character code Information selection unit, 270 ... character code information creation unit, 280 ... control unit, 300 ... database, 400 ... network.

Claims

Input means for receiving image information including information obtained by imaging character code information and an electronic document having the character code information;
Document image generation means for developing image information of the electronic document received by the input means into an image;
Layout analysis means for recognizing a region by analyzing the layout of the image developed by the image generation means;
Determining means for determining whether the area recognized by the layout analyzing means is a character area;
Character recognition means for generating character code information by performing character recognition processing on an area determined to be a character area by the determination means;
With respect to the character code information received by the input means and the character code information generated by the character recognition means, the corresponding parts are analyzed by document analysis according to a predetermined algorithm, and one of the character codes is determined based on the evaluation result. A character code selection means for selecting information;
A document management apparatus comprising: character code information creating means for joining character code information selected by the character code selection means to form character code information of the electronic document.

The document management apparatus according to claim 1.
The converted image information to another format, the document management apparatus, wherein the image information after conversion that you comprise an electronic document storage means for storing together with the character code information the character code information producing means.

The document management apparatus according to claim 1 or 2,
The document management apparatus, wherein the predetermined algorithm is an algorithm according to natural language analysis.

A terminal device, and a management server connected to the terminal device via a network,
The management server is
Input means for receiving image information including information obtained by imaging character code information and an electronic document having the character code information;
Document image generation means for developing image information of the electronic document received by the input means into an image;
Layout analysis means for recognizing a region by analyzing the layout of the image developed by the image generation means;
Determining means for determining whether the area recognized by the layout analyzing means is a character area;
Character recognition means for generating character code information by performing character recognition processing on an area determined to be a character area by the determination means;
With respect to the character code information received by the input means and the character code information generated by the character recognition means, the corresponding parts are analyzed by document analysis according to a predetermined algorithm , and one of the character codes is determined based on the evaluation result. A character code selection means for selecting information ;
Character code information creating means for connecting character code information pieces selected by the character code selection means to form character code information of the electronic document,
The terminal device is
An electronic document transmission unit that transmits an electronic document to the input unit of the management server.

An input step of receiving image information including information obtained by imaging character code information and an electronic document having the character code information;
A document image generation step of developing image information of the electronic document received in the input step into an image;
A layout analysis step for recognizing a region by analyzing a layout of the image developed by the image generation step;
A determination step of determining whether or not the region recognized by the layout analysis step is a character region;
A character recognition step of generating character code information by performing character recognition processing on the region determined to be a character region by the determination step;
For the character code information received in the input step and the character code information generated in the character recognition step, the corresponding portions are analyzed by document analysis according to a predetermined algorithm, and either character code is determined based on the evaluation result. A character code selection step for selecting information;
A document management method comprising: a step of creating character code information by connecting portions of the character code information selected in the character code selection step to obtain character code information of the electronic document.