JP2005173686A

JP2005173686A - Document detection method and system

Info

Publication number: JP2005173686A
Application number: JP2003408786A
Authority: JP
Inventors: Tadashi Takizawa; 正滝沢
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-12-08
Filing date: 2003-12-08
Publication date: 2005-06-30

Abstract

<P>PROBLEM TO BE SOLVED: To perform a detection in defiance of the reading order of document blocks while improving the speed of detection by performing a concept vector retrieval of a document for retrieval of an original computerized document from a scanned paper document, and detecting a similar document. <P>SOLUTION: This system comprises a document storage means for storing the original computerized document; a document reading means for reading a printed document from the original computerized document as image data; a document recognition means for recognizing the read image data as a character code; a document analysis means for analyzing the recognized document data to extract layout information of the document; a document detection means for performing a concept vector retrieval of document for detection of the stored original computerized document from the analyzed document data to detect a similar document, and a detection result output means for outputting a detected result. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、スキャンされた紙文書からＰＣのハードディスク等に蓄積されたオリジナルの電子化文書を検出する文書検出方法およびシステムであり、特に文書に対する概念ベクトル検索の類似性を利用した文書検出に関するものである。 The present invention relates to a document detection method and system for detecting an original digitized document stored in a hard disk of a PC from a scanned paper document, and particularly to document detection using the similarity of concept vector search for a document. It is.

スキャンされた紙文書からオリジナルの電子化文書を検出するとき、印刷された文書をイメージデータとして読み取り、読み取ったイメージデータを文字コードとして認識し、認識された文書データを解析し、解析された文書データから記憶されたオリジナルの電子化文書の検出に対して文書に対する概念ベクトル検索を行ない、類似する文書を検出することにより、速度の速い文書検出が可能になる文書検出方法およびシステムである。 When an original digitized document is detected from a scanned paper document, the printed document is read as image data, the read image data is recognized as a character code, the recognized document data is analyzed, and the analyzed document is analyzed. A document detection method and system capable of detecting a document at high speed by performing a concept vector search on the document for detecting an original digitized document stored from data and detecting a similar document.

従来の文書検出方法およびシステムでは、特許文献１のように、スキャンされた紙文書から蓄積されたオリジナルの電子化文書を検出する場合は、スキャンされた紙文書をＯＣＲ（Optical Character Recognition、光学文字認識）によりテキスト情報を抽出し、そのテキスト情報を解析し、解析された文書データと、蓄積されたオリジナルの電子化文書の文字列をマッチングし、一致する文書を検出していた。 In the conventional document detection method and system, as in Patent Document 1, when an original digitized document accumulated from a scanned paper document is detected, the scanned paper document is subjected to OCR (Optical Character Recognition, optical character recognition). Text information is extracted by recognition), the text information is analyzed, the analyzed document data is matched with the character string of the stored original electronic document, and a matching document is detected.

また、スキャンされた紙文書をＯＣＲによりテキスト情報を抽出するが、ＯＣＲは１００％保証できないので、認識誤りがあり、解析された文書データと、蓄積されたオリジナルの電子化文書の文字列を曖昧マッチング、すなわち、曖昧性を考慮し、文字列間の一致を判定していた。
特開平８−２６３５１２号公報 Also, text information is extracted from the scanned paper document by OCR. However, since OCR cannot be guaranteed 100%, there is a recognition error, and the analyzed document data and the stored original digitized character string are ambiguous. Matching, that is, matching between character strings was determined in consideration of ambiguity.
JP-A-8-263512

しかしながら、上記のような従来技術の場合では、スキャンされた紙文書から蓄積されたオリジナルの電子化文書を検出する場合、ＯＣＲの認識率が１００％保証できないにも関わらず、スキャンされた紙文書をＯＣＲによりテキスト情報を抽出し、そのテキスト情報を解析し、解析された文書データの文字列と、蓄積されたオリジナルの電子化文書の文字列が完全一致でなければ検出されないという欠点があった。 However, in the case of the prior art as described above, when detecting an original digitized document stored from a scanned paper document, the scanned paper document is not guaranteed even though the OCR recognition rate cannot be guaranteed 100%. The text information is extracted by OCR, the text information is analyzed, and the character string of the analyzed document data is not detected unless the character string of the stored original digitized document is an exact match. .

また、文字列の曖昧マッチングの場合、例えば、文書に文字列データと画像データが混在している場合には文字列ブロックと画像ブロックが別々に抽出され、また文字列ブロックが複数抽出される場合は、文字列が複数のブロックに分かれることがあり、その場合には曖昧マッチングにおいても文書が検出できないという欠点があった。 In the case of ambiguous matching of character strings, for example, when character string data and image data are mixed in a document, character string blocks and image blocks are extracted separately, and a plurality of character string blocks are extracted. The character string may be divided into a plurality of blocks. In this case, the document cannot be detected even in the ambiguous matching.

また、文字列のマッチングは文書内のすべての文字列に対して行なう必要があるので検出処理に時間がかかっていた。 In addition, since it is necessary to perform matching of character strings for all character strings in the document, it takes a long time for detection processing.

本発明は、上記の従来技術の課題を解決するためになされたもので、その目的とするところは、スキャンされた紙文書から蓄積されたオリジナルの電子化文書を検出する速度が速いことにある。また、文書ブロックの順番であるリーディング・オーダを無視した検索が可能である。 The present invention has been made to solve the above-described problems of the prior art, and an object of the present invention is to detect an original digitized document accumulated from a scanned paper document at a high speed. . Further, it is possible to perform a search ignoring the reading order which is the order of the document block.

上記目的を達成するために、本発明の文書検出方法およびシステムは、文書を検出する文書検出方法およびシステムにおいて、電子的に作成されたオリジナルの電子化文書を記憶する文書記憶手段と、前記オリジナルの電子化文書から印刷された文書をイメージデータとして読み取る文書読取手段と、前記文書読取手段により読み取ったイメージデータを文字コードとして認識する文書認識手段と、前記文書認識手段により認識された文書データから前記文書記憶手段により記憶されたオリジナルの電子化文書の検出に対して概念検索を行ない、該当する文書を検出する文書検出手段と、前記文書検出手段により検出された結果を出力する検出結果出力手段と、を備えることを特徴とする。 In order to achieve the above object, a document detection method and system according to the present invention includes a document storage means for storing an original electronically created document in the document detection method and system for detecting a document, and the original A document reading unit that reads a document printed from an electronic document as image data, a document recognition unit that recognizes image data read by the document reading unit as a character code, and a document data recognized by the document recognition unit A document detection unit that performs a concept search for detection of the original electronic document stored in the document storage unit and detects a corresponding document, and a detection result output unit that outputs a result detected by the document detection unit And.

また、文書を検出する文書検出方法およびシステムにおいて、電子的に作成されたオリジナルの電子化文書を記憶する文書記憶手段と、前記オリジナルの電子化文書から印刷された文書をイメージデータとして読み取る文書読取手段と、前記文書読取手段により読み取ったイメージデータを文字コードとして認識する文書認識手段と、前記文書認識手段により認識された文書データを解析し、文書のレイアウト情報を抽出する文書解析手段と、前記文書解析手段により解析された文書データから前記文書記憶手段により記憶されたオリジナルの電子化文書の検出に対して概念検索を行ない、該当する文書を検出する文書検出手段と、前記文書検出手段により検出された結果を出力する検出結果出力手段と、を備えることを特徴とする。 Further, in a document detection method and system for detecting a document, document storage means for storing an electronically generated original electronic document, and document reading for reading a document printed from the original electronic document as image data Means, document recognition means for recognizing image data read by the document reading means as a character code, document analysis means for analyzing document data recognized by the document recognition means, and extracting document layout information, A concept search is performed on the detection of the original digitized document stored in the document storage unit from the document data analyzed by the document analysis unit, and the document detection unit detects the corresponding document, and the document detection unit detects the document. And a detection result output means for outputting the result.

また、文書を検出する文書検出方法およびシステムにおいて、電子的に作成された電子化文書を記憶する文書記憶手段と、印刷または手書きされた文書をイメージデータとして読み取る文書読取手段と、前記文書読取手段により読み取ったイメージデータを文字コードとして認識する文書認識手段と、前記文書認識手段により認識された文書データから前記文書記憶手段により記憶された電子化文書の類似する文書の検出に対して概念検索を行ない、類似する文書を検出する文書検出手段と、前記文書検出手段により検出された結果を出力する検出結果出力手段と、を備えることを特徴とする。 In addition, in a document detection method and system for detecting a document, document storage means for storing an electronically created electronic document, document reading means for reading a printed or handwritten document as image data, and the document reading means A document recognition means for recognizing the image data read as a character code, and a concept search for detecting a document similar to an electronic document stored in the document storage means from the document data recognized by the document recognition means. And a document detection means for detecting a similar document, and a detection result output means for outputting a result detected by the document detection means.

また、文書を検出する文書検出方法およびシステムにおいて、電子的に作成された電子化文書を記憶する文書記憶手段と、印刷または手書きされた文書をイメージデータとして読み取る文書読取手段と、前記文書読取手段により読み取ったイメージデータを文字コードとして認識する文書認識手段と、前記文書認識手段により認識された文書データを解析し、文書のレイアウト情報を抽出する文書解析手段と、前記文書解析手段により解析された文書データから前記文書記憶手段により記憶された電子化文書の類似する文書の検出に対して概念検索を行ない、類似する文書を検出する文書検出手段と、前記文書検出手段により検出された結果を出力する検出結果出力手段と、を備えることを特徴とする。 In addition, in a document detection method and system for detecting a document, document storage means for storing an electronically created electronic document, document reading means for reading a printed or handwritten document as image data, and the document reading means The document recognition means for recognizing the image data read as a character code, the document analysis means for analyzing the document data recognized by the document recognition means and extracting the layout information of the document, and the document analysis means A concept search is performed for detection of similar documents in the digitized document stored in the document storage means from the document data, and a document detection means for detecting similar documents and a result detected by the document detection means are output. And a detection result output means.

また、前記文書検出手段において、概念検索が文書に対する概念ベクトル検索であることを特徴とする。 In the document detection means, the concept search is a concept vector search for a document.

また、前記文書検出手段において、検出する文書の対象が文書単位であることを特徴とする。 In the document detection means, the object of the document to be detected is a document unit.

また、前記文書検出手段において、検出する文書の対象が前記文書解析手段により抽出された文書レイアウト単位であることを特徴とする。 In the document detection means, the target of the document to be detected is a document layout unit extracted by the document analysis means.

以上説明したように、本発明によれば、スキャンされた紙文書からオリジナルの電子化文書を検出するとき、印刷された文書をイメージデータとして読み取り、読み取ったイメージデータを文字コードとして認識し、認識された文書データを解析し、解析された文書データから記憶されたオリジナルの電子化文書の検出に対して文書に対する概念ベクトル検索を行ない、類似する文書を検出することで、速度の速い文書検出が可能になり、また文書ブロックの順番であるリーディング・オーダを無視した検索が可能になることである。 As described above, according to the present invention, when an original digitized document is detected from a scanned paper document, the printed document is read as image data, and the read image data is recognized as a character code. Analyzing the analyzed document data, performing a concept vector search on the document for the detection of the original digitized document stored from the analyzed document data, and detecting a similar document, fast document detection It is possible to perform a search that ignores the reading order that is the order of the document block.

（第１の実施例）
以下に、図１から１５を参照して、本発明の第１の実施例を説明する。 (First embodiment)
Hereinafter, a first embodiment of the present invention will be described with reference to FIGS.

図１は本発明を適用した文字処理装置の構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of a character processing apparatus to which the present invention is applied.

図示の構成において、CPUはマイクロプロセッサであり、文書検出処理のための演算、論理判断等を行い、バスを介してバスに接続された各構成要素を制御する。マイクロプロセッサCPUが文書検出表示手段としても動作する。 In the configuration shown in the figure, the CPU is a microprocessor that performs calculation, logic determination, and the like for document detection processing, and controls each component connected to the bus via the bus. The microprocessor CPU also operates as document detection display means.

BUSはバスであり、マイクロプロセッサCPUの制御対象である各構成要素を指示するアドレス信号、コントロール信号を転送する。また、各構成要素間のデータ転送を行う。 BUS is a bus and transfers an address signal and a control signal instructing each component to be controlled by the microprocessor CPU. In addition, data transfer between each component is performed.

RAMは書込み可能なランダムアクセスメモリであって、各構成要素からの各種データの一次記憶に用いる。 The RAM is a writable random access memory and is used for primary storage of various data from each component.

ROMは読出し専用の固定メモリである。マイクロプロセッサCPUによるブートプログラムを記憶する。ブートプログラムはシステム起動時にハードディスクに記憶された制御プログラムをRAMにロードし、マイクロプロセッサCPUに実行させる。制御プログラムについては、後にフローチャートを参照して詳述する。 ROM is a read-only fixed memory. The boot program by the microprocessor CPU is stored. The boot program loads a control program stored in the hard disk into the RAM when the system is started, and causes the microprocessor CPU to execute it. The control program will be described in detail later with reference to a flowchart.

入力装置はキーボード、およびマウス等である。 The input device is a keyboard and a mouse.

表示装置はCRT、あるいは液晶ディスプレイ等である。 The display device is a CRT or a liquid crystal display.

スキャナは紙ドキュメントを読み込んでデジタル化するためのスキャナである。 The scanner is a scanner for reading and digitizing a paper document.

HDはハードディスクであり、CPUにより実行される制御プログラム、形態素解析を行うための形態素解析辞書、電子的に作成された電子化文書を記憶した文書データベース、文書の各単語を概念ベクトル化した単語ベクトル辞書、概念検索を行う際にインデックスとして使用される概念検索索引が格納される。 HD is a hard disk, a control program executed by the CPU, a morphological analysis dictionary for performing morphological analysis, a document database storing electronically created electronic documents, and a word vector obtained by conceptualizing each word of the document A dictionary and a concept search index used as an index when performing concept search are stored.

リムーバブル外部記憶装置はフロップーディスクやCD、DVD等の外部記憶にアクセスするためのドライブ等である。上記HDと同様に使用でき、それらの記録媒体を通じて他の文書処理装置とのデータ交換を行う装置である。なお、ハードディスクに記憶される制御プログラムは、これらの外部記憶装置から必要に応じてHDにコピーすることもできる。 The removable external storage device is a drive or the like for accessing an external storage such as a floppy disk, CD, or DVD. It is an apparatus that can be used in the same manner as the HD and exchanges data with other document processing apparatuses through these recording media. Note that the control program stored in the hard disk can be copied from these external storage devices to HD as necessary.

通信装置はネットワークコントローラである。通信回線を介して外部とのデータ交換を行う装置である。 The communication device is a network controller. A device for exchanging data with the outside via a communication line.

かかる各構成要素からなる本発明文書検出処理装置においては、入力装置からの各種の入力に応じて作動するものであって、入力装置からの入力が供給されるとまずインタラプト信号がマイクロプロセッサCPUに送られ、それに伴って、CPUがROMまたはRAM内に記憶される各種命令を読み出し、その実行によって各種の制御が行われる。 In the document detection processing apparatus of the present invention composed of such components, it operates in response to various inputs from the input device. When an input from the input device is supplied, an interrupt signal is first sent to the microprocessor CPU. Along with this, the CPU reads various instructions stored in the ROM or RAM, and various controls are performed by executing the instructions.

図２は本発明の第１の実施例の処理の流れを示す図である。 FIG. 2 is a diagram showing a processing flow of the first embodiment of the present invention.

本発明の第１の実施例は電子化文書１、文書記憶部２、文書印刷部３、紙文書４、文書読取部５、文書認識部６、文書解析部７、文書検出部８、検出結果出力部９から構成されている。 The first embodiment of the present invention is an electronic document 1, a document storage unit 2, a document printing unit 3, a paper document 4, a document reading unit 5, a document recognition unit 6, a document analysis unit 7, a document detection unit 8, and a detection result. The output unit 9 is configured.

１は電子化文書で、電子的に作成されたオリジナルの電子化文書である。 Reference numeral 1 denotes an electronic document, which is an original electronic document created electronically.

２は文書記憶部で、電子化文書１の電子的に作成されたオリジナルの電子化文書を蓄積する。 Reference numeral 2 denotes a document storage unit that stores an electronically generated original electronic document of the electronic document 1.

３は文書印刷部で、電子化文書１の電子的に作成されたオリジナルの電子化文書を印刷する。 Reference numeral 3 denotes a document printing unit which prints an electronically generated original electronic document of the electronic document 1.

４は紙文書で、電子化文書１の電子的に作成されたオリジナルの電子化文書を電子印刷部３で印刷し、出力された紙文書である。 Reference numeral 4 denotes a paper document, which is a paper document output by printing an electronically created original electronic document of the electronic document 1 by the electronic printing unit 3.

５は文書読取部で、紙文書４の電子的に作成されたオリジナルの電子化文書から印刷された文書をイメージデータとして読み取る。 A document reading unit 5 reads a printed document from the original electronically created document of the paper document 4 as image data.

６は文書認識部で、文書読取部５により読み取った電子的に作成されたオリジナルの電子化文書から印刷された文書のイメージデータを文字コードとして認識する。 A document recognition unit 6 recognizes image data of a document printed from an electronically created original electronic document read by the document reading unit 5 as a character code.

７は文書解析部で、文書認識部６により認識された文書データを解析し、文書のレイアウト情報を抽出する。 A document analysis unit 7 analyzes the document data recognized by the document recognition unit 6 and extracts document layout information.

８は文書検出部で、文書解析部７により解析された文書データから文書記憶部２で蓄積されたオリジナルの電子化文書の検出に対して概念検索を行ない、該当する文書の検出を行なう。 A document detection unit 8 performs a concept search on the detection of the original digitized document stored in the document storage unit 2 from the document data analyzed by the document analysis unit 7, and detects the corresponding document.

９は検出結果出力部で、文書検出部８により検出された文書の結果を出力する。 A detection result output unit 9 outputs the result of the document detected by the document detection unit 8.

図３から５は本発明の第１の実施例の処理の流れを示すフローチャートである。 3 to 5 are flow charts showing the flow of processing of the first embodiment of the present invention.

図３は電子的に作成されたオリジナルの電子化文書１の文書記憶部２に記憶する処理の流れを示すフローチャートである。 FIG. 3 is a flowchart showing a flow of processing to be stored in the document storage unit 2 of the original digitized document 1 created electronically.

ステップＳ０１では、電子的に作成されたオリジナルの電子化文書１を一覧表示する。例えば、ＰＣのハードディスク等に存在する電子化文書を、図６のように電子化文書一覧で表示する。 In step S01, a list of electronically created original documents 1 created electronically is displayed. For example, the digitized documents existing on the hard disk of the PC are displayed in a digitized document list as shown in FIG.

ステップＳ０２では、ステップS０１で表示された電子化文書一覧から文書記憶部２で記憶する電子的に作成されたオリジナルの電子化文書１を選択する。例えば、本発明の第１の実施例では、文書記憶部２で記憶する電子的に作成されたオリジナルの電子化文書１として「経済001.doc」を選択する。 In step S02, the electronically created original digitized document 1 stored in the document storage unit 2 is selected from the digitized document list displayed in step S01. For example, in the first embodiment of the present invention, “economic 001.doc” is selected as the electronically created original electronic document 1 stored in the document storage unit 2.

ステップＳ０３では、ステップS０２で選択された電子的に作成されたオリジナルの電子化文書１を文書記憶部２で記憶する。例えば、電子的に作成されたオリジナルの電子化文書１を記憶する文書記憶部２は文書データベースである場合は、図７のように文書データベースに記憶される。 In step S03, the electronic computerized original document 1 selected in step S02 is stored in the document storage unit 2. For example, when the document storage unit 2 for storing the original electronic document 1 created electronically is a document database, it is stored in the document database as shown in FIG.

ステップＳ０４では、電子的に作成されたオリジナルの電子化文書１を文書記憶部２に記憶を続ける場合は処理をステップＳ０１に戻り、ステップS０１から０３の処理を繰り返す。また、記憶を終了する場合は処理を終了する。 In step S04, if the electronically created original document 1 is to be stored in the document storage unit 2, the process returns to step S01, and the processes of steps S01 to 03 are repeated. If the storage is to be terminated, the process is terminated.

図４は文書記憶部２に記憶された電子的に作成されたオリジナルの電子化文書１を文書印刷部３で印刷する処理の流れを示すフローチャートである。 FIG. 4 is a flowchart showing a flow of processing for printing the original electronically created electronic document 1 stored in the document storage unit 2 by the document printing unit 3.

ステップＳ１１では、電子的に作成されたオリジナルの電子化文書１を一覧表示する。例えば、ステップＳ０１の例と同様に、ＰＣのハードディスク等に存在する電子化文書を、図６のように電子化文書一覧を表示する。 In step S11, a list of electronically created original documents 1 created electronically is displayed. For example, as in the example of step S01, an electronic document list existing in the hard disk or the like of the PC is displayed as shown in FIG.

ステップＳ１２では、ステップS１１で表示された電子化文書一覧から文書印刷部３で印刷する電子的に作成されたオリジナルの電子化文書１を選択する。例えば、本発明の第１の実施例では、文書印刷部３で印刷する電子的に作成されたオリジナルの電子化文書１として「経済001.doc」を選択する。 In step S12, an electronically created original electronic document 1 to be printed by the document printing unit 3 is selected from the electronic document list displayed in step S11. For example, in the first embodiment of the present invention, “economic 001.doc” is selected as the electronically generated original electronic document 1 to be printed by the document printing unit 3.

ステップＳ１３では、ステップS１２で選択された電子的に作成されたオリジナルの電子化文書１を文書印刷部２で印刷する。例えば、電子的に作成されたオリジナルの電子化文書１を印刷する文書印刷部３がプリンタである場合は、図８のように紙文書４が印刷される。 In step S13, the original electronically created document 1 selected in step S12 is printed by the document printing unit 2. For example, when the document printing unit 3 that prints the electronically created original electronic document 1 is a printer, a paper document 4 is printed as shown in FIG.

ステップＳ１４では、電子的に作成されたオリジナルの電子化文書１を文書印刷部２で印刷を続ける場合は処理をステップＳ１１に戻り、ステップS１１から１３の処理を繰り返す。また、印刷を終了する場合は処理を終了する。 In step S14, when the original electronic document 1 created electronically is continued to be printed by the document printing unit 2, the process returns to step S11, and the processes of steps S11 to S13 are repeated. If the printing is to be terminated, the process is terminated.

図５は文書記憶部２に記憶された電子的に作成されたオリジナルの電子化文書１を文書印刷部３で印刷された紙文書４から、電子的に作成されたオリジナルの電子化文書１の文書記憶部２に記憶された文書を検出する処理の流れを示すフローチャートである。 FIG. 5 shows an original electronic document 1 created electronically from a paper document 4 printed by the document printing unit 3 from the electronic computerized original document 1 stored in the document storage unit 2. 4 is a flowchart showing a flow of processing for detecting a document stored in a document storage unit 2.

ステップＳ２１では、ステップS１４で印刷された文書の紙文書４を文書読取部５に設定する。例えば、印刷された文書の紙文書４を文書読取部５はスキャナ装置である場合は、紙文書４を読み取りできるように設定する。 In step S21, the paper document 4 of the document printed in step S14 is set in the document reading unit 5. For example, when the document reading unit 5 is a scanner device, the paper document 4 of the printed document is set so that the paper document 4 can be read.

ステップＳ２２では、ステップＳ２１で設定された紙文書４を文書読取部５で読み取る。例えば、印刷された文書の紙文書４を文書読取部５がスキャナ装置の場合は、イメージデータとして読み取る。 In step S22, the paper document 4 set in step S21 is read by the document reading unit 5. For example, a paper document 4 of a printed document is read as image data when the document reading unit 5 is a scanner device.

ステップＳ２３では、ステップS２２で読み取った、印刷された文書の紙文書４のイメージデータを文書認識部６で文字コードとして認識する。例えば、文字コードとして認識する文書認識部６がＯＣＲ（Optical Character Recognition、光学文字認識）の場合は、文字として認識された文字コードのみを出力する。 In step S23, the image data of the paper document 4 of the printed document read in step S22 is recognized as a character code by the document recognition unit 6. For example, when the document recognition unit 6 that recognizes a character code is OCR (Optical Character Recognition), only the character code recognized as a character is output.

ステップＳ２４では、ステップＳ２３で認識された文書データを文書解析部７で解析し、文書のレイアウト情報を抽出する。例えば、認識された文書データを文書解析部７で解析し、文書のレイアウト情報を抽出した結果、図９のようにレイアウト的に同一の文書ブロックとして扱えない場合は、複数の文書ブロックとして出力する。 In step S24, the document analysis unit 7 analyzes the document data recognized in step S23, and extracts document layout information. For example, when the recognized document data is analyzed by the document analysis unit 7 and the layout information of the document is extracted and cannot be handled as the same document block in the layout as shown in FIG. 9, it is output as a plurality of document blocks. .

ステップＳ２５では、ステップS０３で記憶された電子的に作成されたオリジナルの電子化文書を文書記憶部２から読み込む。例えば、電子的に作成されたオリジナルの電子化文書１を記憶する文書記憶部２は文書データベースである場合は、図７のような文書データベースから読み込む。 In step S25, the electronically created original electronic document stored in step S03 is read from the document storage unit 2. For example, when the document storage unit 2 for storing the original electronic document 1 created electronically is a document database, it reads from the document database as shown in FIG.

ステップＳ２６では、ステップＳ２４で解析された文書（紙文書をスキャンした文書）と、ステップＳ２５で読み込んだ文書（電子的に作成されたオリジナルの電子化文書）に対して文書に対する概念ベクトル検索処理を行なう。 In step S26, a concept vector search process is performed on the document analyzed in step S24 (a document obtained by scanning a paper document) and the document read in step S25 (an electronically created original document). Do.

ステップＳ２７では、ステップＳ２６の概念検索処理の結果から文書検出部８で該当する文書を検出する。 In step S27, the document detection unit 8 detects the corresponding document from the result of the concept search process in step S26.

ステップＳ２８では、ステップＳ２７で検出された該当する文書を検出結果出力部９で出力する。 In step S28, the corresponding document detected in step S27 is output by the detection result output unit 9.

図６は電子化文書一覧の一例である。例えば、ＰＣのハードディスク等に存在する電子化文書は文書名一覧を表示し、選択する場合である。 FIG. 6 is an example of a computerized document list. For example, an electronic document existing on a hard disk of a PC is a case where a document name list is displayed and selected.

図７は文書記憶部３の一例として文書データベースの場合である。例えば、この文書データベースは文書検出の対象となる電子的に作成されたオリジナルの電子化文書を記憶するためのもので、文書データベースのカラムとしては「文書名」と「文書データ」から構成されている。また、「文書名」と「文書データ」は一対一で対応している文書データベースである。 FIG. 7 shows a case of a document database as an example of the document storage unit 3. For example, this document database is for storing an electronically created original electronic document that is a target of document detection, and the document database column is composed of “document name” and “document data”. Yes. Further, “document name” and “document data” are document databases that correspond one-to-one.

図８は文書印刷部３で印刷された紙文書４の一例である。例えば、電子的に作成されたオリジナルの電子化文書の電子化文書一覧から印刷する文書を選択し、プリンタなどで印刷された場合の紙文書である。 FIG. 8 is an example of a paper document 4 printed by the document printing unit 3. For example, it is a paper document when a document to be printed is selected from an electronic document list of an original electronic document created electronically and printed by a printer or the like.

図９は文書読取部５でスキャンされた紙文書を文書認識部６であるＯＣＲにより、テキスト情報を抽出し、文書解析部７で文書レイアウトを解析した後の文書ブロック構成の一例である。例えば、図８の紙文書は文字列データと画像データが混在している。この文書に対してＯＣＲを行なうと文字として認識された文字コードのみを出力する。そして、出力された文書データを解析し、文書レイアウトを抽出すると、レイアウト的に同一の文書ブックとして扱えないので、２つの文書ブロックとして出力される。 FIG. 9 shows an example of a document block configuration after text information is extracted from a paper document scanned by the document reading unit 5 by the OCR which is the document recognition unit 6 and the document layout is analyzed by the document analysis unit 7. For example, in the paper document of FIG. 8, character string data and image data are mixed. When OCR is performed on this document, only the character code recognized as a character is output. When the output document data is analyzed and the document layout is extracted, it cannot be handled as the same document book in terms of layout, and thus is output as two document blocks.

図１０は検出結果出力の一例である。例えば、スキャンされた紙文書から蓄積されたオリジナルの電子化文書を検出した結果、該当する文書を出力する。なお、本発明の第１の実施例では検出結果出力は表示の場合であり、該当する文書名として「経済001.doc」および「経済001コピー.doc」が表示されている。 FIG. 10 shows an example of detection result output. For example, as a result of detecting an original digitized document accumulated from a scanned paper document, the corresponding document is output. In the first embodiment of the present invention, the detection result output is displayed, and “Economic 001.doc” and “Economic 001 copy.doc” are displayed as corresponding document names.

図１１から１５までは文書に対する概念ベクトル検索処理についての説明を行なう。 11 to 15 will explain the concept vector search processing for a document.

図１１は単語ベクトル辞書の構成を示したものである。単語ベクトルは、語義ベースの各単語の意味を意味ベクトル（意味分類ごとの特徴量のリスト）であり、単語ベクトル辞書はその集合である。各次元は１つの意味分類を表現する。各単語（語義）は各次元の意味分類をどの程度含意しているかという値（＝特徴量）を記憶する。例えば、次元３は「宇宙・空」、次元４は「取引・売買」、次元７は「身振り・動作」という意味分類をそれぞれ表している。単語７は「フォーム」という単語を表している。単語７の次元３は０であり、これは「フォーム(帳票)」という単語には「宇宙・空」の意味分類に関係する意味を全く持っていないことを意味する。単語７の次元４の値は大きく、次元７の値は小さい。これは「フォーム(帳票)」が「取引・売買」の意味を強く持っているが、「身振り・動作」の意味は弱いことを意味する。これに対し、単語８の次元４は小さく、次元７が大きい。これは「フォーム(姿勢)」が「身振り・動作」の意味を強く持っているが、「取引・売買」の意味は弱いことを意味する。このように単語ベクトル辞書により、語義別の各単語の意味するものが分かる。 FIG. 11 shows the configuration of the word vector dictionary. A word vector is a semantic vector (a list of feature quantities for each semantic classification), and the word vector dictionary is a set of meanings of each word based on meaning. Each dimension represents one semantic classification. Each word (meaning) stores a value (= feature value) indicating how much the semantic classification of each dimension is implied. For example, dimension 3 represents “space / sky”, dimension 4 represents “transaction / trading”, and dimension 7 represents “gesture / motion”. The word 7 represents the word “form”. The dimension 3 of the word 7 is 0, which means that the word “form” has no meaning related to the meaning classification of “space / sky”. The value of dimension 4 of word 7 is large and the value of dimension 7 is small. This means that “form (form)” has a strong meaning of “dealing / trading” but “gesturing / motion” is weak. On the other hand, the dimension 4 of the word 8 is small and the dimension 7 is large. This means that “form (posture)” has a strong meaning of “gesture / motion”, but “transaction / trading” has a weak meaning. In this way, the word vector dictionary shows what each word means.

図１２は概念検索索引を示したものである。概念検索索引は、各文書に対して該文書に対応する概念ベクトル（＝文書ベクトル）を記憶するものである。各文書ベクトルは、各文書がどのような意味内容を表現しているかを示す。例えば、文書ID＝6949の次元４、次元７の特徴量はそれぞれ0.009、0.425であり、文書ID=6953の次元4、次元８の特徴量はそれぞれ0.362、0.008であることが分かる。これにより文書ID=6949は「取引・売買」の意味分類をほとんど含んでない文章であり、文書ID=6953は「身振り・動作」の意味分類をほとんど含んでいない文章であることが分かる。 FIG. 12 shows a concept search index. The concept search index stores a concept vector (= document vector) corresponding to each document for each document. Each document vector indicates what semantic content each document represents. For example, it can be seen that the feature amounts of dimension 4 and dimension 7 of document ID = 6949 are 0.009 and 0.425, respectively, and the feature amounts of dimension 4 and dimension 8 of document ID = 6953 are 0.362 and 0.008, respectively. Thus, it can be seen that the document ID = 6949 is a sentence that hardly includes the semantic classification of “transaction / trading”, and the document ID = 6953 is a sentence that hardly includes the semantic classification of “gesture / motion”.

図１３は文書と文書の概念的類似性を判定する際の概念類似度の算出方法を示した図である。概念検索処理は、検索クエリに対して概念的に類似するドキュメントを文書データベースから探してくる処理である。本実施例では検索クエリを１つの文書とみなしてその文書ベクトルを求め（＝クエリベクトル＝スキャン文書の文書ベクトル）、検索対象文書データベース上の文書の文書ベクトル（＝オリジナルの電子化文書の文書ベクトル）と該クエリベクトルとの余弦測度を求め、概念類似度としている。 FIG. 13 is a diagram showing a method for calculating the concept similarity when determining the conceptual similarity between documents. Concept search processing is processing for searching a document database for documents that are conceptually similar to a search query. In this embodiment, the search query is regarded as one document and its document vector is obtained (= query vector = document vector of the scanned document), and the document vector of the document in the search target document database (= document vector of the original digitized document) ) And the query vector are obtained as the concept similarity.

文書ベクトルXは各次元にx1〜xnの値を持つn次元のベクトルである。クエリベクトルQも同様にn次元のベクトルである。余弦測度による類似度をSD(X,Q)、と表すことにする。余弦測度SD(X,Q)は両ベクトルの内積を両ベクトルのノルムの積で割った値となる。両ベクトルがノルム=１で正規化されている本実施例では、SD(X,P)は内積そのものに相当する。よって両ベクトルに対し、同次元特徴量の積の総和で求めることができる。 The document vector X is an n-dimensional vector having values x1 to xn in each dimension. Similarly, the query vector Q is an n-dimensional vector. The similarity based on the cosine measure is expressed as SD (X, Q). The cosine measure SD (X, Q) is the value obtained by dividing the inner product of both vectors by the product of the norms of both vectors. In this embodiment in which both vectors are normalized by norm = 1, SD (X, P) corresponds to the inner product itself. Therefore, the sum of the products of the same dimension feature values can be obtained for both vectors.

図１４は文書に対する概念ベクトル検索処理を行ない、類似文書検出の処理の流れを示すフローチャートである。 FIG. 14 is a flowchart showing the flow of a similar document detection process in which a concept vector search process is performed on a document.

ステップＳ３１では、ステップＳ２４で解析された文書（紙文書をスキャンした文書）を入力する。 In step S31, the document analyzed in step S24 (a document obtained by scanning a paper document) is input.

ステップＳ３２では、ステップＳ３１で入力された文書（スキャン文書）に対して文書概念ベクトル生成処理を行なう。 In step S32, document concept vector generation processing is performed on the document (scanned document) input in step S31.

ステップＳ３３では、ステップＳ３２で生成された文書ベクトルを検索索引の形に加工し、概念検索索引を作成する。 In step S33, the document vector generated in step S32 is processed into a search index to create a concept search index.

ステップＳ３４では、ステップＳ２５で読み込んだ文書（電子的に作成されたオリジナルの電子化文書）を入力する。 In step S34, the document read in step S25 (original electronically created document) is input.

ステップＳ３５では、ステップＳ３３で入力された文書（オリジナル文書）に対して文書概念ベクトル生成処理を行なう。 In step S35, document concept vector generation processing is performed on the document (original document) input in step S33.

ステップＳ３６では、ステップＳ３２で生成されたスキャン文書の文書概念ベクトルと、ステップＳ３５で生成されたオリジナル文書の文書概念ベクトルを比較し、文書の概念ベクトルの概念類似度を生成する。その結果をソート・リストに出力する。 In step S36, the document concept vector of the scanned document generated in step S32 is compared with the document concept vector of the original document generated in step S35, and the concept similarity of the document concept vector is generated. The result is output to the sort list.

なお、本発明の第１の実施例では、文書単位、すなわち文書全体の概念ベクトルを比較の対象とする。 In the first embodiment of the present invention, a document unit, that is, a concept vector of the entire document is a comparison target.

ステップＳ３７では、ステップＳ２５で読み込んだ文書（電子的に作成されたオリジナルの電子化文書）に未処理の文書が存在するときは処理をステップＳ３４に戻り、ステップＳ３４からＳ３６までの処理を繰り返す。また、未処理の文書が存在しないときは処理をＳ３８に進める。 In step S37, when an unprocessed document exists in the document read in step S25 (original electronically created document), the process returns to step S34, and the processes from step S34 to S36 are repeated. If there is no unprocessed document, the process proceeds to S38.

ステップＳ３８では、ステップＳ３５の文書概念ベクトルの概念類似度の結果から類似文書を検出する。 In step S38, a similar document is detected from the result of the concept similarity of the document concept vector in step S35.

図１５は文書概念ベクトル生成の処理の流れを示すフローチャートである。 FIG. 15 is a flowchart showing a flow of processing of document concept vector generation.

ステップＳ４１では、文書から単語を抽出する処理であり、形態素解析用辞書を使用して形態素解析を行う。 In step S41, a word is extracted from the document, and morphological analysis is performed using a morphological analysis dictionary.

ステップＳ４２では、形態素解析後、解析結果に基づいて各単語の多義解消を行う。多義解消の手法として、これまでにも各種の方法が提案されているので、それに従う。例えば、係り受け解析結果と共起データベースとのマッチングによる多義解消、ユーザプロファイルとの概念マッチングに基づく多義解消などが考えられる。十分に多義解消されなかった単語については、複数個語義が出力される。 In step S42, after the morphological analysis, ambiguity of each word is resolved based on the analysis result. Various methods have been proposed to solve the ambiguity, so follow it. For example, ambiguity elimination based on matching between the dependency analysis result and the co-occurrence database, ambiguity elimination based on concept matching with the user profile, and the like can be considered. A plurality of meanings are output for words that are not sufficiently ambiguous.

ステップＳ４３では、文書概念ベクトルを生成する。ステップＳ４１、Ｓ４２で抽出された単語及び特定された語義から単語ベクトル辞書を検索し、単語ごとの次元別の特徴量を得て、その総和から文書ベクトルを生成する。なお、語義の特定できない単語についてはその表記を持つすべての語義の単語ベクトルに頻度別の重みをつけて加算することになる。 In step S43, a document concept vector is generated. A word vector dictionary is searched from the words extracted in steps S41 and S42 and the specified meaning, a feature quantity for each dimension is obtained for each word, and a document vector is generated from the sum. For words whose meaning cannot be specified, weights according to frequency are added to all meaning word vectors having the notation.

（第２の実施例）
構成、処理の流れは本発明の第１の実施例と同じであるが、
ステップＳ３６では、ステップＳ３２で生成されたスキャン文書の文書概念ベクトルと、ステップＳ３５で生成されたオリジナル文書の文書概念ベクトルを比較し、文書の概念ベクトルの概念類似度を生成する。その結果をソート・リストに出力する。 (Second embodiment)
The configuration and the flow of processing are the same as in the first embodiment of the present invention,
In step S36, the document concept vector of the scanned document generated in step S32 is compared with the document concept vector of the original document generated in step S35, and the concept similarity of the document concept vector is generated. The result is output to the sort list.

なお、本発明の第２の実施例では、ステップＳ２４で抽出された文書レイアウト単位、すなわち文書全体を構成する各文書ブロックの各ブロックの概念ベクトルを、各文書ブロック毎に比較にする。 In the second embodiment of the present invention, the document layout unit extracted in step S24, that is, the concept vector of each block of each document block constituting the entire document is compared for each document block.

（第３の実施例）
以下に、図１６から２０を参照して、本発明の第３の実施例を説明する。 (Third embodiment)
Hereinafter, a third embodiment of the present invention will be described with reference to FIGS.

本発明の第３の実施例の適用した文字処理装置の構成を示すブロック図は、図１の本発明の第１の実施例の適用した文字処理装置の構成を示すブロック図と同じである。 The block diagram showing the configuration of the character processing apparatus to which the third embodiment of the present invention is applied is the same as the block diagram showing the configuration of the character processing apparatus to which the first embodiment of the present invention is applied in FIG.

図１６は本発明の第３の実施例の処理に流れを示す図である。 FIG. 16 is a diagram showing a flow of processing in the third embodiment of the present invention.

本発明の第３の実施例は電子化文書２−１、文書記憶部２−２、紙文書２−３、文書読取部２−４、文書認識部２−５、文書検出部２−６、検出結果出力部２−７から構成されている。 The third embodiment of the present invention includes an electronic document 2-1, a document storage unit 2-2, a paper document 2-3, a document reading unit 2-4, a document recognition unit 2-5, a document detection unit 2-6, It consists of a detection result output unit 2-7.

２−１は電子化文書で、電子的に作成された電子化文書である。 Reference numeral 2-1 denotes an electronic document, which is an electronic document created electronically.

２−２は文書記憶部で、電子化文書１−１の電子的に作成された電子化文書を蓄積する。 Reference numeral 2-2 denotes a document storage unit that stores electronically created electronic documents of the electronic document 1-1.

２−３は紙文書で、印刷または手書きされた紙文書である。 2-3 is a paper document, which is a printed or handwritten paper document.

２−４は文書読取部で、紙文書２−３の印刷または手書きされた紙文書をイメージデータとして読み取る。 Reference numeral 2-4 denotes a document reading unit which reads a printed or handwritten paper document 2-3 as image data.

２−５は文書認識部で、文書読取部２−４により読み取った印刷または手書きされた紙文書をイメージデータを文字コードとして認識する。 A document recognizing unit 2-5 recognizes a printed or handwritten paper document read by the document reading unit 2-4 as a character code.

２−６は文書検出部で、文書認識部２−５により認識された文書データから文書記憶部２で蓄積された電子化文書の検出に対して概念検索を行ない、類似するする文書の検出を行なう。 A document detection unit 2-6 performs a concept search on the detection of the digitized document stored in the document storage unit 2 from the document data recognized by the document recognition unit 2-5, and detects similar documents. Do.

２−７は検出結果出力部で、文書検出部２−６により検出された文書の結果を出力する。 A detection result output unit 2-7 outputs a result of the document detected by the document detection unit 2-6.

本発明の第３の実施例の電子的に作成された電子化文書の文書記憶部に記憶する処理の流れを示すフローチャートは、図３の本発明の第１の実施例の電子的に作成されたオリジナルの電子化文書の文書記憶部に記憶する処理の流れを示すフローチャートと同じである。 The flowchart showing the flow of processing to be stored in the document storage unit of the electronically created electronic document according to the third embodiment of the present invention is created electronically according to the first embodiment of the present invention shown in FIG. It is the same as the flowchart showing the flow of processing stored in the document storage unit of the original electronic document.

図１７は印刷または手書きされた紙文書２−３から、電子的に作成された電子化文書２−１の文書記憶部２−２に記憶された文書を検出する処理の流れを示すフローチャートである。 FIG. 17 is a flowchart showing a flow of processing for detecting a document stored in the document storage unit 2-2 of an electronically created document 2-1 from a printed or handwritten paper document 2-3. .

ステップＳ２−１１では、印刷または手書きされた紙文書２−３を文書読取部２−４に設定する。例えば、印刷または手書きされた紙文書２−３を文書読取部２−４はスキャナ装置である場合は、紙文書２−３を読み取りできるように設定する。 In step S2-11, the printed or handwritten paper document 2-3 is set in the document reading unit 2-4. For example, when the document reading unit 2-4 is a scanner device, the paper document 2-3 that is printed or handwritten is set so that the paper document 2-3 can be read.

ステップＳ２−１２では、ステップＳ２−１１で設定された紙文書２−３を文書読取部２−４で読み取る。例えば、印刷または手書きされた紙文書２−３を文書読取部２−４がスキャナ装置の場合は、イメージデータとして読み取る。 In step S2-12, the document reading unit 2-4 reads the paper document 2-3 set in step S2-11. For example, a printed or handwritten paper document 2-3 is read as image data when the document reading unit 2-4 is a scanner device.

ステップＳ２−１３では、ステップS２−１２で読み取った、印刷または手書きされた文書の紙文書２−３のイメージデータを文書認識部２−５で文字コードとして認識する。例えば、文字コードとして認識する文書認識部２−５がＯＣＲ（Optical Character Recognition、光学文字認識）の場合は、文字として認識された文字コードを出力する。 In step S2-13, the image data of the paper document 2-3 of the printed or handwritten document read in step S2-12 is recognized as a character code by the document recognition unit 2-5. For example, when the document recognition unit 2-5 that recognizes a character code is OCR (Optical Character Recognition), the character code recognized as a character is output.

ステップＳ２−１４では、ステップS２−０３で記憶された電子的に作成された電子化文書を文書記憶部２−２から読み込む。例えば、電子的に作成された電子化文書２−１を記憶する文書記憶部２−２は文書データベースである場合は、文書データベースから読み込む。 In step S2-14, the electronically created digitized document stored in step S2-03 is read from the document storage unit 2-2. For example, when the document storage unit 2-2 that stores the electronically created electronic document 2-1 is a document database, it reads from the document database.

本発明の第３の実施例の文書データベースは、図７の本発明の第１の実施例の文書データベースと同じである。 The document database of the third embodiment of the present invention is the same as the document database of the first embodiment of the present invention shown in FIG.

ステップＳ２−１５では、ステップＳ２−１３で認識された文書（紙文書をスキャンした文書）と、ステップＳ２−１４で読み込んだ文書（電子的に作成された電子化文書）に対して文書に対する概念ベクトル検索処理を行なう。 In step S2-15, the document concept for the document recognized in step S2-13 (a document obtained by scanning a paper document) and the document read in step S2-14 (electronically created document) is described. Perform vector search processing.

ステップＳ２−１６では、ステップＳ２−１５の概念検索処理の結果から文書検出部２−６で該当する文書を検出する。 In step S2-16, the document detection unit 2-6 detects the corresponding document from the result of the concept search process in step S2-15.

ステップＳ２−１７では、ステップＳ２−１６で検出された該当する文書を検出結果出力部２−７で出力する。 In step S2-17, the corresponding document detected in step S2-16 is output by the detection result output unit 2-7.

図１８は文書記憶部２−３で記憶された電子的に作成された電子化文書の一例である。例えば、「経済001.doc」および「天気.doc」を含む電子的に作成された電子化文書が記憶されている場合である。 FIG. 18 is an example of an electronically created electronic document stored in the document storage unit 2-3. For example, an electronically created electronic document including “economy 001.doc” and “weather.doc” is stored.

図１９は文書読取部２−４でスキャンされた紙文書を文書認識部２−５であるＯＣＲにより、テキスト情報を抽出した後の文書データの一例である。例えば、印刷された、または手書きされた紙文書に対してＯＣＲを行なうと文字として認識された文字コードを出力する。なお、本発明の第３の実施例では消費支出に関する出力がされた場合である。 FIG. 19 shows an example of the document data after the text information is extracted from the paper document scanned by the document reading unit 2-4 by the OCR which is the document recognition unit 2-5. For example, when OCR is performed on a printed or handwritten paper document, a character code recognized as a character is output. In the third embodiment of the present invention, output related to consumption expenditure is made.

図２０は検出結果出力の一例である。例えば、スキャンされた紙文書から蓄積された電子化文書を検出した結果、類似する文書を出力する。なお、本発明の第３の実施例では検出結果出力は表示の場合であり、スキャンされた文書の内容が消費支出に関する文書データであるので、“経済”の類似度高いと判断し、類似する文書名として「経済001.doc」が表示されている。 FIG. 20 shows an example of detection result output. For example, a similar document is output as a result of detecting an electronic document stored from a scanned paper document. In the third embodiment of the present invention, the detection result output is a display case, and since the content of the scanned document is document data related to consumption expenditure, it is determined that the similarity of “economy” is high and is similar. “Economy 001.doc” is displayed as the document name.

本発明の第３の実施例の文書に対する概念ベクトル検索処理についての説明は、図１１から１５までの本発明の第１の実施例の文書に対する概念ベクトル検索処理についての説明と同じである。 The description of the concept vector search process for the document according to the third embodiment of the present invention is the same as the description of the concept vector search process for the document according to the first embodiment of the present invention shown in FIGS.

本発明の第１の実施例の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the 1st Example of this invention. 本発明の第１の実施例の処理の流れを示す図である。It is a figure which shows the flow of a process of 1st Example of this invention. 本発明の第１の実施例の文書記憶の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the document storage of 1st Example of this invention. 本発明の第１の実施例の文書印刷の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a document printing process of 1st Example of this invention. 本発明の第１の実施例の文書検出の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a document detection process of 1st Example of this invention. 本発明の第１の実施例の電子化文書一覧の一例である。It is an example of the electronic document list of 1st Example of this invention. 本発明の第１の実施例の文書記憶部２の一例として文書データベースの場合である。This is a case of a document database as an example of the document storage unit 2 of the first embodiment of the present invention. 本発明の第１の実施例の文書印刷部３で印刷された紙文書４の一例である。It is an example of the paper document 4 printed by the document printing part 3 of 1st Example of this invention. 本発明の第１の実施例の文書読取部５でスキャンされた紙文書を文書認識部６であるＯＣＲにより、テキスト情報を抽出し、文書解析部７で文書レイアウトを解析した後の文書ブロック構成の一例である。Document block configuration after text information is extracted from a paper document scanned by the document reading unit 5 of the first embodiment of the present invention by the OCR which is the document recognition unit 6 and the document layout is analyzed by the document analysis unit 7 It is an example. 本発明の第１の実施例の検出結果出力の一例である。It is an example of the detection result output of 1st Example of this invention. 本発明の第１の実施例の単語ベクトル辞書の構成を示したものである。1 shows the configuration of a word vector dictionary according to a first embodiment of the present invention. 本発明の第１の実施例の概念検索索引を示したものである。2 shows a concept search index according to the first embodiment of the present invention. 本発明の第１の実施例の文書と文書の概念的類似性を判定する際の概念類似度の算出方法を示した図である。It is the figure which showed the calculation method of the concept similarity degree at the time of determining the conceptual similarity of the document of 1st Example of this invention, and a document. 本発明の第１の実施例の文書に対する概念ベクトル検索処理を行ない、類似文書検出の処理の流れを示すフローチャートである。It is a flowchart which performs the concept vector search process with respect to the document of 1st Example of this invention, and shows the flow of a similar document detection process. 本発明の第１の実施例の文書概念ベクトル生成の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the document concept vector production | generation of 1st Example of this invention. 本発明の第３の実施例の処理の流れを示す図である。It is a figure which shows the flow of a process of the 3rd Example of this invention. 本発明の第３の実施例の文書検出の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a document detection process of the 3rd Example of this invention. 本発明の第３の実施例の印刷または手書きされた紙文書２−３の一例である。It is an example of the printed or handwritten paper document 2-3 of the 3rd Example of this invention. 本発明の第３の実施例の文書読取部２−４でスキャンされた紙文書を文書認識部２−５であるＯＣＲにより、テキスト情報を抽出した後の文書データの一例である。It is an example of the document data after extracting the text information from the paper document scanned by the document reading part 2-4 of the 3rd Example of this invention by OCR which is the document recognition part 2-5. 本発明の第３の実施例の検出結果出力の一例である。It is an example of the detection result output of 3rd Example of this invention.

Explanation of symbols

１電子化文書
２文書記憶部
３文書印刷部
４紙文書
５文書読取部
６文書認識部
７文書解析部
８文書検出部
９検出結果出力部
２−１電子化文書
２−２文書記憶部
２−３紙文書
２−４文書読取部
２−５文書認識部
２−６文書検出部
２−７検出結果出力部
DESCRIPTION OF SYMBOLS 1 Electronic document 2 Document storage part 3 Document printing part 4 Paper document 5 Document reading part 6 Document recognition part 7 Document analysis part 8 Document detection part 9 Detection result output part 2-1 Electronic document 2-2 Document storage part 2- 3 Paper Document 2-4 Document Reading Unit 2-5 Document Recognition Unit 2-6 Document Detection Unit 2-7 Detection Result Output Unit

Claims

In a document detection method and system for detecting a document,
Document storage means for storing electronically created original electronic documents;
Document reading means for reading a document printed from the original electronic document as image data;
Document recognition means for recognizing image data read by the document reading means as a character code;
A document detection unit that performs a concept search on the detection of the original digitized document stored by the document storage unit from the document data recognized by the document recognition unit, and detects a corresponding document;
Detection result output means for outputting the result detected by the document detection means;
A document detection method and system.

2. The document detection method and system according to claim 1, wherein in the document detection means, the concept search is a concept vector search for a document.

In a document detection method and system for detecting a document,
Document storage means for storing electronically created original electronic documents;
Document reading means for reading a document printed from the original electronic document as image data;
Document recognition means for recognizing image data read by the document reading means as a character code;
Document analysis means for analyzing document data recognized by the document recognition means, and extracting document layout information;
A document detection unit that performs a concept search on the detection of the original digitized document stored in the document storage unit from the document data analyzed by the document analysis unit, and detects a corresponding document;
Detection result output means for outputting the result detected by the document detection means;
A document detection method and system.

4. The document detection method and system according to claim 3, wherein in the document detection means, the concept search is a concept vector search for a document.

4. The document detection method and system according to claim 3, wherein the document detection unit detects a document to be detected in units of documents.

4. The document detection method and system according to claim 3, wherein in the document detection means, the target of the document to be detected is a document layout unit extracted by the document analysis means.

In a document detection method and system for detecting a document,
Document storage means for storing electronically created electronic documents;
Document reading means for reading a printed or handwritten document as image data;
Document recognition means for recognizing image data read by the document reading means as a character code;
A document detection means for detecting a similar document by performing a concept search on the detection of a similar document of the digitized document stored by the document storage means from the document data recognized by the document recognition means;
Detection result output means for outputting the result detected by the document detection means;
A document detection method and system.

8. The document detection method and system according to claim 7, wherein in the document detection means, the concept search is a concept vector search for a document.

In a document detection method and system for detecting a document,
Document storage means for storing electronically created electronic documents;
Document reading means for reading a printed or handwritten document as image data;
Document recognition means for recognizing image data read by the document reading means as a character code;
Document analysis means for analyzing document data recognized by the document recognition means, and extracting document layout information;
A document detection means for detecting a similar document by performing a concept search on the detection of a similar document of the digitized document stored by the document storage means from the document data analyzed by the document analysis means;
Detection result output means for outputting the result detected by the document detection means;
A document detection method and system.

10. The document detection method and system according to claim 9, wherein in the document detection means, the concept search is a concept vector search for a document.

10. The document detection method and system according to claim 9, wherein the document detection means detects a document to be detected in units of documents.

10. The document detection method and system according to claim 9, wherein in the document detection unit, a target of the document to be detected is a document layout unit extracted by the document analysis unit.