JP2011221894A

JP2011221894A - Secure document detection method, secure document detection program, and optical character reader

Info

Publication number: JP2011221894A
Application number: JP2010092071A
Authority: JP
Inventors: Takeshi Nagasaki; 健永崎; Masakazu Fujio; 正和藤尾; Shoji Ikeda; 尚司池田; Toshiyuki Kuwana; 利幸桑名
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-04-13
Filing date: 2010-04-13
Publication date: 2011-11-04
Anticipated expiration: 2030-04-13
Also published as: JP5629908B2

Abstract

PROBLEM TO BE SOLVED: To highly accurately detect a secure document by a simple definition.SOLUTION: A secure document detection method is executed by a secure document detector, and the secure document detector includes an arithmetic unit and a storage device for holding a dictionary. In the dictionary, a plurality of keyword pairs each including at least two keywords and information indicating the position relation in a document of the two keywords included in each keyword pair are registered. The secure document detection method includes a first procedure of extracting the keyword pair registered in the dictionary from inputted document data, and a second procedure for determining whether or not the inputted document data is the secure document on the basis of the position relation in the inputted document data of the two keywords included in the extracted keyword pair.

Description

本発明は、情報のセキュリティを管理する技術に関し、特に、記憶装置に格納された文書又は印刷された文書からセキュアな文書を検出する技術に関する。 The present invention relates to a technique for managing information security, and more particularly to a technique for detecting a secure document from a document stored in a storage device or a printed document.

情報セキュリティに対する社会的関心の高まりに伴って、サーバ又は個人が所有するパーソナルコンピュータ（ＰＣ）に格納された大量の電子文書中に、セキュアな情報が存在するか否かを高精度に自動検出する技術が求められている。ここでセキュアな情報とは、例えば自社の秘密情報、他社の秘密情報又は個人情報のような、機密を保持する必要がある情報である。このような自動検出の技術として、例えば特許文献１が開示されている。 With increasing social interest in information security, it is possible to automatically detect whether or not secure information exists in a large amount of electronic documents stored in a server or a personal computer (PC) owned by an individual with high accuracy. Technology is required. Here, the secure information is information that needs to be kept confidential, such as private information of the company, confidential information of other companies, or personal information. For example, Patent Document 1 is disclosed as such an automatic detection technique.

特許文献１に記載された機密文書検出システムは、入力された文書を複数の領域に分割し、各領域に対応する辞書を参照することによって各領域の特徴要素を検出し、検出された特徴要素に基づいて各文書が属する機密情報カテゴリを判定する。 The confidential document detection system described in Patent Literature 1 divides an input document into a plurality of areas, detects feature elements in each area by referring to a dictionary corresponding to each area, and detects the detected feature elements Based on, the confidential information category to which each document belongs is determined.

特開２００６−２０９６４９号公報JP 2006-209649 A

個人のＰＣのチェックツールのような、従来のテキスト検索を用いた場合、セキュア文書の誤検出が多いため、人間が点検する手間が多く必要であった。また、従来のセキュア文書検出では、検出したいキーワードをユーザが指定することはできるが、多様な様式の文書に対応することは困難であった。 When a conventional text search such as a personal PC check tool is used, there are many false detections of secure documents, which requires a lot of labor for human inspection. In the conventional secure document detection, a user can specify a keyword to be detected, but it is difficult to deal with documents in various formats.

例えば、特許文献１に記載された機密文書検出システムは、領域ごとに特徴要素を検出するが、定義されていない領域に出現した特徴要素を検出することはできない。さらに、この機密文書検出システムは、キーワードと、それが検出された領域との対応に基づいて機密情報カテゴリを判定するが、複数のキーワードの関係に基づいて機密情報カテゴリを判定することはできない。 For example, the confidential document detection system described in Patent Document 1 detects a feature element for each area, but cannot detect a feature element that appears in an undefined area. Further, the confidential document detection system determines the confidential information category based on the correspondence between the keyword and the area in which the keyword is detected, but cannot determine the confidential information category based on the relationship between a plurality of keywords.

本発明の代表的な一例を示せば、次の通りである。すなわち、セキュア文書検出装置が実行するセキュア文書検出方法であって、前記セキュア文書検出装置は、演算装置と、辞書を保持する記憶装置と、を備え、前記辞書には、各々が少なくとも二つのキーワードを含む複数のキーワードペア、及び、前記各キーワードペアに含まれる二つのキーワードの文書中の位置関係を示す情報が登録され、前記セキュア文書検出方法は、入力された文書データから前記辞書に登録されたキーワードペアを抽出する第１手順と、前記抽出されたキーワードペアに含まれる二つのキーワードの前記入力された文書データ中の位置関係に基づいて、前記入力された文書データがセキュア文書であるか否かを判定する第２手順と、を含むことを特徴とする。 A typical example of the present invention is as follows. That is, a secure document detection method executed by a secure document detection device, wherein the secure document detection device includes an arithmetic device and a storage device that holds a dictionary, each of which includes at least two keywords. And information indicating the positional relationship of the two keywords included in each keyword pair in the document are registered, and the secure document detection method is registered in the dictionary from input document data. Whether the input document data is a secure document based on the first procedure for extracting the keyword pair and the positional relationship of the two keywords included in the extracted keyword pair in the input document data And a second procedure for determining whether or not.

本発明の一実施形態によれば、入力される文書の種類が増えた場合にも、簡易な定義で高精度にセキュアな文書を検出することができる。 According to an embodiment of the present invention, a secure document can be detected with high accuracy with a simple definition even when the types of input documents increase.

本発明の実施形態の概要を示すブロック図である。It is a block diagram which shows the outline | summary of embodiment of this invention. 本発明の第１の実施形態のセキュア電子文書管理システムのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the secure electronic document management system of the 1st Embodiment of this invention. 本発明の第１の実施形態のセキュア文書検出装置が実行する処理の全体を示す説明図である。It is explanatory drawing which shows the whole process which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態の文書要素抽出処理及びセキュア文書判定処理の詳細な手順を説明するフローチャートである。It is a flowchart explaining the detailed procedure of the document element extraction process and secure document determination process of the 1st Embodiment of this invention. 本発明の第１の実施形態のセキュア文書検出装置に入力されるセキュア文書の具体例の説明図である。It is explanatory drawing of the specific example of the secure document input into the secure document detection apparatus of the 1st Embodiment of this invention. 本発明の第１の実施形態のセキュア文書検出装置が実行する複数のキーワードの組み合わせに基づくセキュア文書検出の説明図である。It is explanatory drawing of the secure document detection based on the combination of the several keyword which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態のセキュア文書検出装置によって識別されるブロックの説明図である。It is explanatory drawing of the block identified by the secure document detection apparatus of the 1st Embodiment of this invention. 本発明の第１の実施形態のセキュア文書検出装置が実行するキーワード抽出及びセキュア文書判定の第１の具体例を示す説明図である。It is explanatory drawing which shows the 1st specific example of the keyword extraction and the secure document determination which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態のセキュア文書検出装置が実行するキーワード抽出及びセキュア文書判定の第２の具体例を示す説明図である。It is explanatory drawing which shows the 2nd specific example of the keyword extraction and secure document determination which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態のセキュア文書検出装置が実行するキーワード抽出及びセキュア文書判定の第３の具体例を示す説明図である。It is explanatory drawing which shows the 3rd specific example of the keyword extraction and the secure document determination which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態のセキュア文書検出装置が実行するキーワード抽出及びセキュア文書判定の第４の具体例を示す説明図である。It is explanatory drawing which shows the 4th specific example of the keyword extraction and the secure document determination which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態のセキュア文書検出装置が実行するキーワード抽出及びセキュア文書判定の第５の具体例を示す説明図である。It is explanatory drawing which shows the 5th specific example of the keyword extraction and secure document determination which the secure document detection apparatus of the 1st Embodiment of this invention performs. 本発明の第１の実施形態のセキュア文書辞書に含まれる配置コストテーブルの説明図である。It is explanatory drawing of the arrangement | positioning cost table contained in the secure document dictionary of the 1st Embodiment of this invention. 本発明の第１の実施形態のセキュア文書辞書の説明図である。It is explanatory drawing of the secure document dictionary of the 1st Embodiment of this invention. 本発明の第２の実施形態のＯＣＲ一体型セキュア文書検出装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the OCR integrated secure document detection apparatus of the 2nd Embodiment of this invention.

以下、図面を用いて本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の実施形態の概要を示すブロック図である。 FIG. 1 is a block diagram showing an outline of an embodiment of the present invention.

最初に、従来のセキュア紙文書管理について説明する。 First, conventional secure paper document management will be described.

光学式文字読み取り装置（ＯＣＲ装置）０３０２は、入力された紙文書０３０１を読み取り、文書ファイル０３０３を作成する。この文書ファイル０３０３には、紙文書０３０１に記載された画像、テキスト又はその両方のデータが含まれる。この文書ファイルは、例えばＰＤＦ（Portable Document Format）（登録商標）ファイルであってもよい。ユーザは、計算機０３０４によって表示された文書ファイル０３０３を参照して、その文書ファイル０３０３がセキュアであるか否かを判定し、その結果を計算機０３０４に入力する。文書ファイル０３０３がセキュアであると判定した場合、ユーザは、文書ファイル０３０３をロックする指示を計算機０３０４に入力してもよい。計算機０３０４は、文書ファイル０３０３をロックすることによって、ロックされた文書ファイル０３０５を作成し、この文書ファイル０３０５を出力する。なお、ロックとは、文書ファイル０３０５の閲覧を制限するための処理を意味し、その典型的な例は暗号化である。 The optical character reading device (OCR device) 0302 reads the input paper document 0301 and creates a document file 0303. This document file 0303 includes image, text, or both data described in the paper document 0301. This document file may be, for example, a PDF (Portable Document Format) (registered trademark) file. The user refers to the document file 0303 displayed by the computer 0304, determines whether or not the document file 0303 is secure, and inputs the result to the computer 0304. When it is determined that the document file 0303 is secure, the user may input an instruction to lock the document file 0303 to the computer 0304. The computer 0304 creates a locked document file 0305 by locking the document file 0303, and outputs this document file 0305. The lock means a process for restricting browsing of the document file 0305, and a typical example is encryption.

次に、本発明の実施形態の一つであるＯＣＲ一体型セキュア紙文書管理について説明する。 Next, OCR integrated secure paper document management, which is one embodiment of the present invention, will be described.

入力される紙文書０３０６は、既に説明した紙文書０３０１と同様のものであってよい。ＯＣＲ装置０３０７は、入力された紙文書０３０６を読み取り、紙文書０３０６に含まれる画像情報及びテキスト情報を抽出する。計算機０３０８は、抽出された情報にセキュアな情報が含まれるか否かを判定する。抽出された情報にセキュアな情報が含まれると判定された場合、計算機０３０８は、抽出された情報を含む、ロックされた文書ファイル０３０９を作成し、出力する。これらの判定及び作成はユーザの介在なしに自動的に実行されるため、計算機０３０８はロックされる前の文書ファイルを表示する必要がない。このため、ＯＣＲ装置０３０７及び計算機０３０８のいずれも、ロックされていない文書ファイルを上記の判定の前に作成する必要がない。また、ユーザは文書ファイルをロックする指示を計算機０３０８に入力する必要がない。 The input paper document 0306 may be the same as the paper document 0301 already described. The OCR device 0307 reads the input paper document 0306 and extracts image information and text information included in the paper document 0306. The computer 0308 determines whether or not the extracted information includes secure information. When it is determined that secure information is included in the extracted information, the computer 0308 creates and outputs a locked document file 0309 including the extracted information. Since these determinations and creation are automatically performed without user intervention, the computer 0308 does not need to display the document file before being locked. Therefore, neither the OCR device 0307 nor the computer 0308 needs to create an unlocked document file before the above determination. Further, the user does not need to input an instruction to lock the document file to the computer 0308.

なお、作成された文書ファイルがセキュアであることは、元の紙文書０３０６もセキュアであることを意味する。このため、計算機０３０８は、文書ファイルがセキュアであるか否かの判定結果に基づいて、ＯＣＲ装置０３０７による紙文書０３０６の排出方法を制御してもよい。 Note that that the created document file is secure means that the original paper document 0306 is also secure. For this reason, the computer 0308 may control the discharge method of the paper document 0306 by the OCR device 0307 based on the determination result of whether or not the document file is secure.

このＯＣＲ一体型セキュア紙文書管理の詳細については、本発明の第２の実施形態として後述する。 Details of the OCR-integrated secure paper document management will be described later as a second embodiment of the present invention.

次に、本発明のもう一つの実施形態であるセキュア電子文書管理について説明する。 Next, secure electronic document management, which is another embodiment of the present invention, will be described.

計算機０３１１は、文書ファイル０３１０を入力されると、その文書ファイル０３１０がセキュアか否かを判定する。文書ファイル０３１０は、例えば文書ファイル０３０３と同様のものであってもよい。計算機０３１１は、文書ファイル０３１０がセキュアであると判定された場合、それをロックすることによって、ロックされた文書ファイル０３１２を作成し、出力する。上記のＯＣＲ一体型セキュア紙文書管理の場合と同様、これらの判定及び作成はユーザの介在なしに自動的に実行される。 When a document file 0310 is input, the computer 0311 determines whether the document file 0310 is secure. The document file 0310 may be the same as the document file 0303, for example. When it is determined that the document file 0310 is secure, the computer 0311 creates and outputs a locked document file 0312 by locking it. As in the case of the OCR-integrated secure paper document management described above, these determinations and creations are automatically executed without user intervention.

このセキュア電子文書管理の詳細については、本発明の第１の実施形態として後述する。 Details of the secure electronic document management will be described later as the first embodiment of the present invention.

なお、上記の説明ではテキスト等のデータを含む文書ファイルの例としてＰＤＦファイルを挙げたが、これらの文書ファイルはＰＤＦ以外の形式の文書ファイル又は図面ファイルであってもよい。 In the above description, a PDF file is exemplified as an example of a document file including data such as text. However, these document files may be a document file of a format other than PDF or a drawing file.

＜第１の実施形態＞
図２は、本発明の第１の実施形態のセキュア文書検出装置０１００のハードウェア構成を示すブロック図である。 <First Embodiment>
FIG. 2 is a block diagram illustrating a hardware configuration of the secure document detection device 0100 according to the first embodiment of this invention.

セキュア文書検出装置０１００は、図１に示した本発明のセキュア電子文書管理を実現する装置の一例である。 The secure document detection apparatus 0100 is an example of an apparatus that realizes the secure electronic document management of the present invention shown in FIG.

本実施形態のセキュア文書検出装置０１００は、操作端末装置０１０１、表示端末装置０１０２、外部記憶装置０１０３、メモリ０１０４、中央演算装置０１０５、通信装置０１０７及びこれらを相互に接続する通信線０１０６を備える。セキュア文書検出装置０１００は、例えば一般的なパーソナルコンピュータであってもよい。 The secure document detection device 0100 of this embodiment includes an operation terminal device 0101, a display terminal device 0102, an external storage device 0103, a memory 0104, a central processing unit 0105, a communication device 0107, and a communication line 0106 that interconnects them. The secure document detection device 0100 may be a general personal computer, for example.

操作端末装置０１０１は、例えばキーボード又はマウス等であり、ユーザが指示又はデータ等をセキュア文書検出装置０１００に入力するために使用される。 The operation terminal device 0101 is, for example, a keyboard or a mouse, and is used by the user to input an instruction or data to the secure document detection device 0100.

表示端末装置０１０２は、例えば液晶表示装置のような、テキスト及び画像等を表示する装置である。 The display terminal device 0102 is a device that displays text, images, and the like, such as a liquid crystal display device.

外部記憶装置０１０３は、例えばハードディスク装置又はフラッシュメモリのような記憶装置であり、入力された文書データ（例えば文書ファイル０３１０）及び出力された文書データ（例えばロックされた文書ファイル０３１２）を格納する。さらに、本実施形態を実現するために中央演算装置０１０５によって実行されるプログラム等が格納されてもよい。 The external storage device 0103 is a storage device such as a hard disk device or a flash memory, and stores input document data (for example, a document file 0310) and output document data (for example, a locked document file 0312). Furthermore, a program executed by the central processing unit 0105 to realize this embodiment may be stored.

メモリ０１０４は、例えば半導体メモリであり、中央演算装置０１０５によって実行されるプログラム及び参照されるデータ等を格納する。外部記憶装置０１０３に格納されたプログラム及びデータ等の少なくとも一部が必要に応じてメモリ０１０４にコピーされてもよい。 The memory 0104 is, for example, a semiconductor memory, and stores a program executed by the central processing unit 0105, data to be referred to, and the like. At least a part of programs, data, and the like stored in the external storage device 0103 may be copied to the memory 0104 as necessary.

中央演算装置０１０５は、メモリ０１０４に格納されたプログラムを実行し、必要に応じて操作端末装置０１０１、表示端末装置０１０２、外部記憶装置０１０３及び通信装置０１０７を制御する。以下の説明においてセキュア文書検出装置０１００が実行する処理は、実際には中央演算装置０１０５によって実行される。 The central processing unit 0105 executes a program stored in the memory 0104 and controls the operation terminal device 0101, the display terminal device 0102, the external storage device 0103, and the communication device 0107 as necessary. In the following description, the processing executed by the secure document detection device 0100 is actually executed by the central processing unit 0105.

通信装置０１０７は、ネットワーク（図示省略）に接続され、そのネットワークに接続された他の装置（図示省略）と通信するインターフェースである。例えば、通信装置０１０７は入力データとして文書ファイル０３１０を受信し、ロックされた文書ファイル０３１２を送信してもよい。 The communication device 0107 is an interface that is connected to a network (not shown) and communicates with other devices (not shown) connected to the network. For example, the communication device 0107 may receive the document file 0310 as input data and transmit the locked document file 0312.

図３は、本発明の第１の実施形態のセキュア文書検出装置０１００が実行する処理の全体を示す説明図である。 FIG. 3 is an explanatory diagram illustrating the entire processing executed by the secure document detection device 0100 according to the first embodiment of this invention.

セキュア文書検出装置０１００が実行する処理は、学習フェーズ０５００と利用フェーズ０５１０とに分けられる。 The processing executed by the secure document detection device 0100 is divided into a learning phase 0500 and a usage phase 0510.

学習フェーズ０５００において、セキュア文書検出装置０１００は、入力された情報に基づいてセキュア文書辞書０５０４を作成する。 In the learning phase 0500, the secure document detection device 0100 creates a secure document dictionary 0504 based on the input information.

具体的には、例えば、ユーザがセキュア文書例０５０１及びセキュア用語定義０５０２をセキュア文書検出装置０１００に入力する。 Specifically, for example, the user inputs the secure document example 0501 and the secure term definition 0502 to the secure document detection apparatus 0100.

セキュア文書例０５０１は、セキュアな文書として検出されるべきであるとユーザが考える実際の文書ファイルである。セキュア文書例０５０１は、例えば通信装置０１０７を介してセキュア文書検出装置０１００に入力されてもよい。 The secure document example 0501 is an actual document file that the user thinks should be detected as a secure document. The secure document example 0501 may be input to the secure document detection device 0100 via the communication device 0107, for example.

セキュア用語定義０５０２は、セキュア文書検出に用いられるキーワードのリストである。ユーザは、セキュアな文書として検出されるべき文書に含まれる文字列からなるキーワードのリストを、セキュア用語定義０５０２としてセキュア文書検出装置０１００に入力することができる。特に、例えば「××製作所」のような文書の作成者又は所有者を示す文字列と、「設計書」のような文書の種類を示す文字列との組み合わせを含む文書をセキュア文書として検出する必要がある場合、ユーザは、このような文字列からなるキーワードの組み合わせ（以下、キーワードペアとも記載）を、セキュア用語定義０５０２としてセキュア文書検出装置０１００に入力することができる。セキュア用語定義０５０２は、例えば通信装置０１０７を介して入力されてもよいし、操作端末装置０１０１を介して入力されてもよい。 The secure term definition 0502 is a list of keywords used for secure document detection. The user can input a keyword list including character strings included in a document to be detected as a secure document to the secure document detection apparatus 0100 as the secure term definition 0502. In particular, for example, a document including a combination of a character string indicating the creator or owner of a document such as “XX Manufacturing” and a character string indicating a document type such as “design document” is detected as a secure document. When necessary, the user can input a combination of keywords composed of such character strings (hereinafter also referred to as keyword pairs) to the secure document detection apparatus 0100 as the secure term definition 0502. The secure term definition 0502 may be input via the communication device 0107, for example, or may be input via the operation terminal device 0101.

セキュア文書検出装置０１００は、入力されたセキュア文書例０５０１及びセキュア用語定義０５０２に基づいて、セキュア辞書学習処理０５０３を実行する。その結果、セキュア文書辞書０５０４が作成される。セキュア文書辞書０５０４には、後述するように（図１０参照）、キーワードとして登録された文字列の組み合わせ、各キーワードペアに含まれる二つのキーワードの文書上の位置関係、等を示す情報が含まれる。なお、二つのキーワードの位置関係を示す情報は、例えば、それらのキーワードが配置される方向及び距離を表すベクトルである。このようなキーワードペアに含まれる二つのキーワードの位置関係を、以下、「キーワードペアの位置関係」とも記載する。 The secure document detection apparatus 0100 executes a secure dictionary learning process 0503 based on the input secure document example 0501 and the secure term definition 0502. As a result, a secure document dictionary 0504 is created. As will be described later (see FIG. 10), the secure document dictionary 0504 includes information indicating a combination of character strings registered as keywords, a positional relationship between two keywords included in each keyword pair, and the like. . Note that the information indicating the positional relationship between the two keywords is, for example, a vector representing the direction and distance in which the keywords are arranged. Hereinafter, the positional relationship between two keywords included in such a keyword pair is also referred to as “the positional relationship between keyword pairs”.

次に、利用フェーズ０５１０について説明する。ユーザは、セキュア文書検出装置０１００に非管理文書０５１１を入力する。非管理文書０５１１は、ユーザがこれから管理しようとする文書であり、言い換えると、それがセキュアな情報を含んでいるか否かを判定する必要がある文書である。その判定結果に応じて、その文書の管理方法（例えば文書をロックするか否か等）が決定される。非管理文書０５１１は、例えば、図１の文書ファイル０３１０に相当する。 Next, the usage phase 0510 will be described. The user inputs the unmanaged document 0511 to the secure document detection apparatus 0100. The unmanaged document 0511 is a document that the user intends to manage from now on. In other words, it is a document that needs to be determined whether or not it includes secure information. In accordance with the determination result, the document management method (for example, whether to lock the document or the like) is determined. The unmanaged document 0511 corresponds to, for example, the document file 0310 in FIG.

セキュア文書検出装置０１００は、入力された非管理文書０５１１について、文書要素抽出処理０５１２を実行する。これによって、非管理文書０５１１から文書要素、すなわち、テキスト、キーワード（ＫＷ）、罫線、キーワードの位置を示す情報、及びブロックの配置を示す情報等が抽出される。なお、キーワード及びその位置を抽出するために、セキュア文書辞書０５０４に含まれるキーワード情報０５１３が参照される。 The secure document detection device 0100 executes document element extraction processing 0512 for the input non-management document 0511. As a result, document elements, that is, text, keyword (KW), ruled line, information indicating the position of the keyword, information indicating the arrangement of blocks, and the like are extracted from the unmanaged document 0511. In order to extract a keyword and its position, the keyword information 0513 included in the secure document dictionary 0504 is referred to.

入力された非管理文書０５１１のファイル形式と、文書要素抽出処理０５１２によって処理できるファイル形式とが異なる場合、セキュア文書検出装置０１００は、文書変換処理０５１７を実行して、入力された非管理文書０５１１のファイル形式を変換する。例えば、文書要素抽出処理０５１２がＰＤＦファイルしか処理できないにもかかわらず、それ以外の形式のファイル（例えば一般的な文書作成ソフトウェアによって作成された文書ファイル）が非管理文書０５１１として入力された場合、文書変換処理０５１７によって、非管理文書０５１１のファイル形式がＰＤＦに変換される。 If the file format of the input unmanaged document 0511 and the file format that can be processed by the document element extraction process 0512 are different, the secure document detection apparatus 0100 executes the document conversion process 0517 to input the input unmanaged document 0511. Convert the file format. For example, when the document element extraction process 0512 can process only a PDF file, but a file in another format (for example, a document file created by general document creation software) is input as the unmanaged document 0511, A document conversion process 0517 converts the file format of the unmanaged document 0511 to PDF.

次に、セキュア文書検出装置０１００は、文書要素抽出処理０５１２によって抽出された文書情報０５１８について、セキュア文書判定処理０５１５を実行する。具体的には、セキュア文書検出装置０１００は、文書情報０５１８と、セキュア文書辞書０５０４に含まれるパタン情報・配置尤度０５１４と、を参照して、入力された非管理文書０５１１のセキュア情報尤度を算出し、それに基づいて、非管理文書０５１１がセキュア文書であるか否か（すなわちセキュアな情報を含むか否か）を判定する。 Next, the secure document detection device 0100 executes secure document determination processing 0515 for the document information 0518 extracted by the document element extraction processing 0512. Specifically, the secure document detection apparatus 0100 refers to the document information 0518 and the pattern information / placement likelihood 0514 included in the secure document dictionary 0504, and the secure information likelihood of the input unmanaged document 0511. And whether or not the non-managed document 0511 is a secure document (that is, whether or not it includes secure information) is determined based on this.

そして、セキュア文書検出装置０１００は、セキュア文書判定処理０５１５の結果０５１６を出力する。この結果は非管理文書０５１１がセキュア文書であるか否かを示す情報を含み、さらに、セキュア尤度又はそれに基づく危険度を示す情報等を含んでもよい。 Then, the secure document detection device 0100 outputs a result 0516 of the secure document determination process 0515. This result includes information indicating whether or not the unmanaged document 0511 is a secure document, and may further include information indicating a secure likelihood or a risk based on the secure likelihood.

なお、セキュア文書辞書０５０４を予め保持していれば、セキュア文書検出装置０１００は、学習フェーズ０５００を実行せずに、利用フェーズ０５１０のみを実行することができる。例えば、ユーザは、セキュア文書検出装置０１００のメーカが作成したセキュア文書辞書０５０４を取得してもよいし、他のユーザが学習フェーズ０５００を実行することによって作成したセキュア文書辞書０５０４を取得してもよい。 If the secure document dictionary 0504 is held in advance, the secure document detection device 0100 can execute only the use phase 0510 without executing the learning phase 0500. For example, the user may acquire the secure document dictionary 0504 created by the manufacturer of the secure document detection device 0100, or may acquire the secure document dictionary 0504 created by another user executing the learning phase 0500. Good.

図４は、本発明の第１の実施形態の文書要素抽出処理０５１２及びセキュア文書判定処理０５１５の詳細な手順を説明するフローチャートである。 FIG. 4 is a flowchart illustrating detailed procedures of the document element extraction process 0512 and the secure document determination process 0515 according to the first embodiment of this invention.

セキュア文書検出装置０１００は、入力された電子文書ファイル０４１１について、文書要素を抽出する（ステップ０４０１）。具体的には、セキュア文書検出装置０１００は、電子文書ファイル０４１１から、その電子文書に含まれるテキストの文字情報、その文字が書かれる紙面上の位置、罫線の位置、等を抽出する。これによって、各文字が抽出され、さらに、各文字の位置及び罫線の位置から、各行に相当する文字列が特定される。なお、電子文書ファイル０４１１は、図３の非管理文書０５１１に相当する。 The secure document detection device 0100 extracts document elements from the input electronic document file 0411 (step 0401). Specifically, the secure document detection device 0100 extracts from the electronic document file 0411 character information of text included in the electronic document, a position on the paper where the character is written, a ruled line position, and the like. Thereby, each character is extracted, and further, a character string corresponding to each line is specified from the position of each character and the position of the ruled line. The electronic document file 0411 corresponds to the unmanaged document 0511 in FIG.

次に、セキュア文書検出装置０１００は、抽出された文書要素を用いて、文書構造を解析する（ステップ０４０２）。具体的には、セキュア文書検出装置０１００は、抽出された文字及び罫線の位置等に基づいて、文書上の文字をブロックに分ける。例えば、文書がヘッダ、フッタ及び本文からなる場合、ヘッダ、フッタ及び本文がそれぞれ一つのブロックとして識別される。本文が段組みされている場合、各段が一つのブロックとして識別される。文書に表が含まれる場合、その表が一つのブロックとして識別される。セキュア文書検出装置０１００は、ステップ０４０２において文書構造辞書（図示省略）を参照してもよい。これによって、ステップ０４０１で抽出された各行が属するブロックが特定される。このような文書構造の解析は、公知の方法によって行うことができる。例えば、Ｘ−Ｙ再帰的解析法、文字列間移動距離最小法などの手法がある。 Next, the secure document detection device 0100 analyzes the document structure using the extracted document element (step 0402). Specifically, the secure document detection device 0100 divides characters on the document into blocks based on the extracted characters and the positions of ruled lines. For example, when a document includes a header, a footer, and a body, the header, footer, and body are each identified as one block. If the text is in columns, each column is identified as one block. If the document contains a table, the table is identified as a block. The secure document detection device 0100 may refer to a document structure dictionary (not shown) in step 0402. As a result, the block to which each row extracted in step 0401 belongs is specified. Such document structure analysis can be performed by a known method. For example, there are methods such as an XY recursive analysis method and a character string moving distance minimum method.

次に、セキュア文書検出装置０１００は、ブロック及び行を、テキストの読み順（言い換えると、それらが文書中に現れる順）に整合するように並べ替える（ステップ０４０３）。これによって、各ブロック内の行がテキストの読み順に並べ替えられ、さらに、ブロックもテキストの読み順に並べ替えられる。例えば本文が複数のブロックからなる場合、それらのブロックがテキストの読み順に並べ替えられる。この並べ替えも、ステップ０４０２と同様、公知の方法によって行うことができる。 Next, the secure document detection device 0100 rearranges the blocks and lines so as to match the reading order of the text (in other words, the order in which they appear in the document) (step 0403). Thereby, the lines in each block are rearranged in the text reading order, and the blocks are also rearranged in the text reading order. For example, when the body is composed of a plurality of blocks, these blocks are rearranged in the reading order of the text. Similar to step 0402, this rearrangement can also be performed by a known method.

次に、セキュア文書検出装置０１００は、文書要素を抽出する（ステップ０４０４）。具体的には、セキュア文書検出装置０１００は、罫線、及び、その罫線等によって形成されたレイアウトを抽出する。さらに、セキュア文書検出装置０１００は、ステップ０４０１において抽出された文字列からキーワードを抽出する。具体的には、セキュア文書検出装置０１００は、セキュア文書辞書のキーワード情報０４１２に登録されたキーワードを検索キーとして、ステップ０４０１において抽出された文字列を検索する。キーワード情報０４１２は、図３のキーワード情報０５１３に相当する。 Next, the secure document detection device 0100 extracts document elements (step 0404). Specifically, the secure document detection device 0100 extracts a ruled line and a layout formed by the ruled line and the like. Further, the secure document detection device 0100 extracts a keyword from the character string extracted in step 0401. Specifically, the secure document detection device 0100 searches for the character string extracted in step 0401 using the keyword registered in the keyword information 0412 of the secure document dictionary as a search key. The keyword information 0412 corresponds to the keyword information 0513 in FIG.

次に、セキュア文書検出装置０１００は、セキュア文書辞書に含まれるパタン情報０４１３を用いてセキュア情報尤度を算出する（ステップ０４０５）。セキュア情報尤度とは、入力された文書のセキュア文書らしさを示す指標である（詳細は後述）。パタン情報０４１３は、図３のパタン情報・配置尤度０５１４の一部に相当する。 Next, the secure document detection device 0100 calculates the secure information likelihood using the pattern information 0413 included in the secure document dictionary (step 0405). The secure information likelihood is an index indicating the secure document quality of the input document (details will be described later). The pattern information 0413 corresponds to a part of the pattern information / placement likelihood 0514 in FIG.

次に、セキュア文書検出装置０１００は、セキュア文書辞書に含まれる配置尤度情報０４１４を用いてセキュア情報尤度を算出する（ステップ０４０６）。配置尤度情報０４１４は、図３のパタン情報・配置尤度０５１４の一部に相当する。 Next, the secure document detection device 0100 calculates secure information likelihood using the placement likelihood information 0414 included in the secure document dictionary (step 0406). The placement likelihood information 0414 corresponds to a part of the pattern information / placement likelihood 0514 of FIG.

セキュア情報尤度の算出については後述する（図８〜図９及び数式（１）〜（３）等参照）。 The calculation of the secure information likelihood will be described later (see FIGS. 8 to 9 and equations (1) to (3)).

セキュア文書検出装置０１００は、ステップ０４０５及び０４０６において算出されたセキュア情報尤度に基づいて、入力された電子文書がセキュア文書であるか否かを判定する（ステップ０４０７）。例えば、セキュア文書検出装置０１００は、算出されたセキュア情報尤度が所定の閾値より大きい場合、入力された電子文書がセキュア文書であると判定してもよい。ユーザがこの閾値を設定してもよい。 The secure document detection device 0100 determines whether or not the input electronic document is a secure document based on the secure information likelihood calculated in steps 0405 and 0406 (step 0407). For example, the secure document detection device 0100 may determine that the input electronic document is a secure document when the calculated secure information likelihood is greater than a predetermined threshold. The user may set this threshold value.

ステップ０４０７において、入力された電子文書がセキュア文書である（すなわち「Ｙｅｓ」）と判定された場合、セキュア文書検出装置０１００は、入力された電子文書ファイル０４１１をロックする（ステップ０４０８）。一方、入力された電子文書がセキュア文書でない（すなわち「Ｎｏ」）と判定された場合、セキュア文書検出装置０１００はステップ０４０８を実行しない。 If it is determined in step 0407 that the input electronic document is a secure document (ie, “Yes”), the secure document detection device 0100 locks the input electronic document file 0411 (step 0408). On the other hand, when it is determined that the input electronic document is not a secure document (that is, “No”), the secure document detection device 0100 does not execute Step 0408.

次に、セキュア文書検出装置０１００は、電子文書を出力する（ステップ０４０９）。具体的には、セキュア文書検出装置０１００は、ステップ０４０７で「Ｙｅｓ」の場合、ロックされた電子文書を出力し、「Ｎｏ」の場合、ロックされていない電子文書（すなわち入力された電子文書ファイル０４１１そのもの）を出力する。出力された電子文書０４１５（図１の文書ファイル０３１２に相当）は、外部記憶装置０１０３に格納される。さらに、セキュア文書検出装置０１００は、セキュア情報尤度そのものを出力してもよいし、セキュア情報尤度に基づいて決定される危険度（又は要求される保護レベル）を出力してもよい。 Next, the secure document detection device 0100 outputs an electronic document (step 0409). Specifically, the secure document detection device 0100 outputs a locked electronic document if “Yes” in step 0407, and if it is “No”, the secure document detection device 0100 does not lock the electronic document (that is, the input electronic document file). 0411 itself) is output. The output electronic document 0415 (corresponding to the document file 0312 in FIG. 1) is stored in the external storage device 0103. Further, the secure document detection device 0100 may output the secure information likelihood itself, or may output the risk (or required protection level) determined based on the secure information likelihood.

なお、上記はステップ０４０７において文書がセキュア文書であるか否かを判定する例を示したが、ステップ０４０７においてこのような二値判定の代わりに多値判定が行われてもよい。例えば、セキュア文書検出装置０１００は、算出されたセキュア情報尤度と複数の閾値とを比較することで、セキュア情報尤度のランクを判定してもよい。その場合、判定されたランクに応じて電子文書の出力方法（例えば使用する暗号の強度等）が選択されてもよい。例えば、セキュア文書検出装置０１００は、より高いランクの電子文書ファイル０４１１を暗号化するために、より長い暗号鍵を使用してもよい。 In the above, an example in which it is determined in step 0407 whether or not the document is a secure document has been described. However, in step 0407, multivalue determination may be performed instead of such binary determination. For example, the secure document detection device 0100 may determine the rank of the secure information likelihood by comparing the calculated secure information likelihood with a plurality of threshold values. In this case, an electronic document output method (for example, the strength of encryption used) may be selected according to the determined rank. For example, the secure document detection device 0100 may use a longer encryption key to encrypt the higher-ranked electronic document file 0411.

以下、図４の処理の詳細を説明する。 Hereinafter, details of the processing of FIG. 4 will be described.

図５は、本発明の第１の実施形態のセキュア文書検出装置０１００に入力されるセキュア文書の具体例の説明図である。 FIG. 5 is an explanatory diagram of a specific example of a secure document input to the secure document detection apparatus 0100 according to the first embodiment of this invention.

本発明は、アクセスを制限する必要があるセキュア文書に適用することができる。そのようなセキュア文書の典型例は、自社が作成した自社の機密情報を含む文書、他社から取得した当該他社の機密情報を含む文書、又は顧客等の個人情報を含む文書、等である。このような典型例について説明する。 The present invention can be applied to secure documents in which access needs to be restricted. A typical example of such a secure document is a document including the confidential information of the company created by the company, a document including confidential information of the other company acquired from another company, or a document including personal information of a customer or the like. Such a typical example will be described.

図５（ａ）〜図５（ｃ）は、文書のタイトル及び特定の企業の名称が表示されたセキュア文書の例である。例えば、文書の表紙のタイトルに「設計書」、「仕様書」又は「アライアンス」等の特定の文字列が含まれ、さらに、その表紙に（例えばその文書の作成者又はその文書の配布先として）特定の企業名「××」又は「××製作所」が含まれる。なお、図５に表示されたアンダーライン０６０１は、各文書に表示された特定の文字列及び企業名等を指し示して本実施形態を説明するために表示したものであり、そのアンダーライン０６０１自体が文書に表示されているわけではない。 FIG. 5A to FIG. 5C are examples of a secure document in which the document title and the name of a specific company are displayed. For example, the title of the cover page of a document includes a specific character string such as “design document”, “specification document”, or “alliance”, and the cover page (for example, as the creator of the document or the distribution destination of the document) ) The specific company name “XX” or “XX Manufacturing” is included. Note that the underline 0601 displayed in FIG. 5 indicates the specific character string, company name, and the like displayed in each document to explain the present embodiment, and the underline 0601 itself is It is not displayed in the document.

図５（ｄ）〜図５（ｆ）は、ヘッダ等に特定の文字列（例えば企業名）を含み、さらにその文字列の隣に特定の接頭辞又は接尾辞を含むセキュア文書の例である。図５（ｄ）の例では、特定の文字列「（株）××」の隣に特定の接尾辞「作成」が表示される。図５（ｅ）の例では、特定の文字列「××」の隣に特定の接尾辞「ｃｏｎｆｉｄｅｎｔｉａｌ」が表示される。図５（ｅ）の例では、特定の文字列「××」の隣に特定の接尾辞「Ｐｒｅｐｅａｒｄ」が表示される。 FIG. 5D to FIG. 5F are examples of secure documents that include a specific character string (for example, a company name) in the header and the like, and further include a specific prefix or suffix next to the character string. . In the example of FIG. 5D, a specific suffix “created” is displayed next to a specific character string “(stock) xx”. In the example of FIG. 5 (e), a specific suffix “confidental” is displayed next to the specific character string “XX”. In the example of FIG. 5 (e), a specific suffix “Prepeard” is displayed next to the specific character string “XX”.

図５（ｇ）及び図５（ｈ）は、それぞれ設計図面及び製品仕様書の例である。この種の文書は、必ずしも特定の文字列を含んでいないが、罫線を用いた特定のフォーマットを有する場合が多い。 FIG. 5G and FIG. 5H are examples of design drawings and product specifications, respectively. This type of document does not necessarily include a specific character string, but often has a specific format using ruled lines.

図５（ｉ）は、機密情報を含むことを示す文字列又は図形（例えば、「秘」のような文字を含む印影）が表示された文書の例である。 FIG. 5I is an example of a document in which a character string or a graphic (for example, an imprint including a character such as “secret”) indicating that confidential information is included is displayed.

図５（ｊ）及び図５（ｋ）は、文書中に特定の文字列と特定の接頭辞又は接尾辞とが混在している例を示す。 FIG. 5 (j) and FIG. 5 (k) show an example in which a specific character string and a specific prefix or suffix are mixed in a document.

図５（ｊ）の例では、本文中に「北海道」及びそれに連続して「札幌市」と表示され、フッタに「北海道」及びそれに連続して「製作所」が表示されている。この場合、本文中の「北海道」は単なる地名であるが、フッタの「北海道」は特定の企業名（又はその一部）である。 In the example of FIG. 5 (j), “Hokkaido” and “Sapporo City” are displayed in the text, and “Hokkaido” and “Manufacturing” are displayed in the footer. In this case, “Hokkaido” in the text is merely a place name, but “Hokkaido” in the footer is a specific company name (or part thereof).

図５（ｋ）の例では、本文中に人名を示す特定の文字列「××△△」が表示され、さらにその前後に隣接して文字列「出席」及び「様」が表示されている。 In the example of FIG. 5 (k), a specific character string “XXΔΔ” indicating a person's name is displayed in the text, and character strings “attendance” and “like” are displayed adjacently before and after the character string. .

本実施形態のセキュア文書検出装置０１００は、入力された文書に含まれるキーワード、そのキーワードが記載された位置、及びその文書のフォーマット等に基づいてこれらのセキュア文書を検出する。 The secure document detection apparatus 0100 according to the present embodiment detects these secure documents based on a keyword included in the input document, a position where the keyword is described, a format of the document, and the like.

図５に示す文書は、例えばセキュア文書例０５０１としてセキュア文書検出装置０１００に入力されてもよいし、非管理文書０５１１（すなわち電子文書ファイル０４１１）としてセキュア文書検出装置０１００に入力されてもよい。 The document shown in FIG. 5 may be input to the secure document detection apparatus 0100 as the secure document example 0501, for example, or may be input to the secure document detection apparatus 0100 as the unmanaged document 0511 (that is, the electronic document file 0411).

例えば、図５（ａ）に示す文書がセキュア文書例０５０１として入力され、さらに、文字列「設計書」及び「××製作所」がセキュア用語定義０５０２として入力された場合、それらの入力に基づいてセキュア辞書学習処理０５０３が実行される。その結果、文字列「設計書」及び「××製作所」がキーワードとしてセキュア文書辞書０５０４に登録される。さらに、それらのキーワードの位置関係（例えばそれらの間の距離及びそれらが配置される方向を表すベクトル）もセキュア文書例０５０１から抽出され、セキュア文書辞書０５０４に登録される。このとき、例えば「××製作所」が主キーワード、「設計書」が補助キーワードとして、それらの組（キーワードペア）が登録されてもよい。 For example, when the document shown in FIG. 5A is input as the secure document example 0501 and the character strings “design document” and “XX manufactory” are input as the secure term definition 0502, based on those inputs. A secure dictionary learning process 0503 is executed. As a result, the character strings “design document” and “xx factory” are registered in the secure document dictionary 0504 as keywords. Further, the positional relationship of these keywords (for example, a vector representing the distance between them and the direction in which they are arranged) is also extracted from the secure document example 0501 and registered in the secure document dictionary 0504. At this time, for example, “XX factory” is the main keyword, and “design document” is the auxiliary keyword, and those sets (keyword pairs) may be registered.

なお、本実施形態では主に会社名「××製作所」のような固有名詞を主キーワード、「設計書」のような普通名詞を補助キーワードとして扱う例を示すが、実際には任意の文字列を主キーワード及び補助キーワードとして登録することができる。例えば、補助キーワード「××製作所」と主キーワード「設計書」とからなるキーワードペアが登録されてもよい。 In this embodiment, an example is shown in which a proper noun such as a company name “XX Manufacturing” is mainly treated as a main keyword, and a common noun such as “design document” is treated as an auxiliary keyword. Can be registered as a main keyword and an auxiliary keyword. For example, a keyword pair made up of an auxiliary keyword “XX factory” and a main keyword “design document” may be registered.

図５（ａ）に示す文書が非管理文書０５１１（すなわち電子文書ファイル０４１１）として入力された場合、その文書から抽出された複数のキーワード及びそれらの位置関係と、登録されている複数のキーワード及びそれらの位置関係とが参照され、その文書がセキュア文書であるか否かが判定される。 When the document shown in FIG. 5A is input as an unmanaged document 0511 (that is, an electronic document file 0411), a plurality of keywords extracted from the document and their positional relationships, a plurality of registered keywords, These positional relationships are referred to, and it is determined whether or not the document is a secure document.

図５（ｂ）〜図５（ｆ）、図５（ｊ）及び図５（ｋ）に示す文書も上記と同様である。すなわち、それらの文書に含まれる会社名、文書タイトル、接頭辞及び接尾辞等の文字列が主キーワード又は補助キーワードとして登録され、それらのキーワードに基づいて入力された文書がセキュア文書であるか否かが判定される。 The documents shown in FIGS. 5B to 5F, FIG. 5J, and FIG. 5K are the same as described above. That is, character strings such as company names, document titles, prefixes and suffixes included in those documents are registered as main keywords or auxiliary keywords, and whether or not a document input based on these keywords is a secure document. Is determined.

なお、図５に示す文書は典型例に過ぎず、本発明はあらゆる種類のセキュア文書に適用することができる。 The document shown in FIG. 5 is merely a typical example, and the present invention can be applied to all types of secure documents.

図６は、本発明の第１の実施形態のセキュア文書検出装置０１００が実行する複数のキーワードの組み合わせに基づくセキュア文書検出の説明図である。 FIG. 6 is an explanatory diagram of secure document detection based on a combination of a plurality of keywords executed by the secure document detection apparatus 0100 according to the first embodiment of this invention.

図６に示す文書０７０１が非管理文書０５１１として入力されると、セキュア文書検出装置０１００は、入力された文書から、自社名を示す主キーワード「××」と補助キーワード「ｃｏｎｆｉｄｅｎｔｉａｌ」とからなるキーワードペア、及び、他社名を示すキーワード「北海道」と補助キーワード「作成」とからなるキーワードペアを抽出する。そして、セキュア文書検出装置０１００は、抽出されたキーワードペア及び各キーワードペアの位置関係を、セキュア文書辞書０５０４に登録された情報と比較することによって、セキュア情報尤度を算出する。 When the document 0701 shown in FIG. 6 is input as the unmanaged document 0511, the secure document detection apparatus 0100, from the input document, the keyword including the main keyword “xx” indicating the company name and the auxiliary keyword “confidental”. A keyword pair consisting of a pair and a keyword “Hokkaido” indicating the name of another company and an auxiliary keyword “creation” is extracted. Then, the secure document detection device 0100 calculates the secure information likelihood by comparing the extracted keyword pair and the positional relationship of each keyword pair with the information registered in the secure document dictionary 0504.

図７は、本発明の第１の実施形態のセキュア文書検出装置０１００によって識別されるブロックの説明図である。 FIG. 7 is an explanatory diagram of blocks identified by the secure document detection device 0100 according to the first embodiment of this invention.

具体的には、図７には、図４のステップ０４０２において抽出され、ステップ０４０３において並べ替えられたブロックの具体例を示す。 Specifically, FIG. 7 shows a specific example of blocks extracted in step 0402 of FIG. 4 and rearranged in step 0403.

図７（ａ）に示す文書０８２０は、タイトル０８５１、著者名０８５２及び本文０８５３からなる。この文書０８２０が電子文書ファイル０４１１として入力された場合、セキュア文書検出装置０１００は、ブロックＢ１＿０８０１、ブロックＢ２＿０８０２及びブロックＢ３＿０８０３を抽出する（ステップ０４０２）。ブロックＢ１＿０８０１はタイトル０８５１が表示された領域に、ブロックＢ２＿０８０２は著者名０８５２が表示された領域に、ブロックＢ３＿０８０３は本文０８５３が表示された領域に相当する。 A document 0820 shown in FIG. 7A is composed of a title 0851, an author name 0852, and a text 0853. When this document 0820 is input as the electronic document file 0411, the secure document detection apparatus 0100 extracts the block B1_0801, the block B2_0802, and the block B3_0803 (step 0402). Block B1_0801 corresponds to the area where the title 0851 is displayed, block B2_0802 corresponds to the area where the author name 0852 is displayed, and block B3_0803 corresponds to the area where the body text 0853 is displayed.

図７（ｂ）に示す文書０８３０は、本文０８５５及び本文０８５６を含む。この例において本文は段組みされており、本文０８５５及び本文０８５６が各段に相当し、本文０８５６は本文０８５５の次に読まれるべきものである。この文書０８３０が電子文書ファイル０４１１として入力された場合、セキュア文書検出装置０１００は、本文０８５５が表示された領域に相当するブロックＢ５＿０８０５、及び、本文０８５６が表示された領域に相当するブロックＢ６＿０８０６を抽出する（ステップ０４０２）。さらに、セキュア文書検出装置０１００は、本文の読み順と同様、ブロックＢ６＿０８０６がブロックＢ５＿０８０５の後に続くようにこれらのブロックを並べ替える（ステップ０４０３）。 A document 0830 shown in FIG. 7B includes a text 0855 and a text 0856. In this example, the text is in a column, and the text 0855 and the text 0856 correspond to each level, and the text 0856 should be read after the text 0855. When this document 0830 is input as the electronic document file 0411, the secure document detection device 0100 extracts the block B5_0805 corresponding to the area where the text 0855 is displayed and the block B6_0806 corresponding to the area where the text 0856 is displayed. (Step 0402). Further, the secure document detection device 0100 rearranges these blocks so that the block B6_0806 follows the block B5_0805 in the same manner as the reading order of the text (step 0403).

図７（ｃ）に示す文書０８４０は、本文０８５７、本文０８５８、脚注０８５９、ヘッダ０８６０及びフッタ０８６１を含む。この例において本文は段組みされており、本文０８５７及び本文０８５８が各段に相当し、本文０８５８は本文０８５７の次に読まれるべきものである。 A document 0840 shown in FIG. 7C includes a text 0857, a text 0858, a footnote 0859, a header 0860, and a footer 0861. In this example, the text is in a column, and the text 0857 and the text 0858 correspond to each level, and the text 0858 should be read after the text 0857.

この文書０８４０が電子文書ファイル０４１１として入力された場合、セキュア文書検出装置０１００は、ブロックＢ７＿０８０７、ブロックＢ８＿０８０８、ブロックＢ９＿０８０９、ブロックＢ１０＿０８１０及びブロックＢ１１＿０８１１を抽出する（ステップ０４０２）。ブロックＢ７＿０８０７及びブロックＢ８＿０８０８はそれぞれ本文０８５７及び本文０８５８が表示された領域に、ブロックＢ９＿０８０９は脚注０８５９が表示された領域に、ブロックＢ１０＿０８１０及びブロックＢ１１＿０８１１はそれぞれヘッダ０８６０及びフッタ０８６１が表示された領域に相当する。 When this document 0840 is input as the electronic document file 0411, the secure document detection apparatus 0100 extracts block B7_0807, block B8_0808, block B9_0809, block B10_0810, and block B11_0811 (step 0402). Block B7_0807 and block B8_0808 correspond to the area where the text 0857 and text 0858 are displayed, block B9_0809 corresponds to the area where the footnote 0859 is displayed, block B10_0810 and block B11_0811 correspond to the area where the header 0860 and footer 0861 are displayed, respectively To do.

さらに、セキュア文書検出装置０１００は、本文の読み順と同様、ブロックＢ８＿０８０８がブロックＢ７＿０８０７の後に続くようにこれらのブロックを並べ替える（ステップ０４０３）。 Further, the secure document detection device 0100 rearranges these blocks so that the block B8_0808 follows the block B7_0807, similarly to the reading order of the text (step 0403).

図８Ａ〜図８Ｅは、本発明の第１の実施形態のセキュア文書検出装置０１００が実行するキーワード抽出及びセキュア文書判定の具体例を示す説明図である。 8A to 8E are explanatory diagrams illustrating specific examples of keyword extraction and secure document determination executed by the secure document detection device 0100 according to the first embodiment of this invention.

図８Ａの例では、電子文書ファイル０４１１として文書０６１０が入力される。文書０６１０は、図５（ａ）に示したものと同じである。この文書０６１０には文字列「設計書」０６１１及び「××製作所」０６１２が含まれる。例えば、会社名に相当する主キーワード「××製作所」と、補助キーワード「設計書」との組み合わせ（キーワードペア）がキーワード情報０４１２に登録されている場合、セキュア文書検出装置０１００は、ステップ０４０４のキーワード抽出処理によって文字列「設計書」０６１１及び「××製作所」０６１２をそれぞれ補助キーワード０６１３及び主キーワード０６１４として抽出する。 In the example of FIG. 8A, a document 0610 is input as the electronic document file 0411. The document 0610 is the same as that shown in FIG. The document 0610 includes a character string “design document” 0611 and “XX Manufacturing” 0612. For example, when the combination (keyword pair) of the main keyword “XX factory” corresponding to the company name and the auxiliary keyword “design document” is registered in the keyword information 0412, the secure document detection apparatus 0100 performs step 0404. The character strings “design document” 0611 and “xxx factory” 0612 are extracted as the auxiliary keyword 0613 and the main keyword 0614, respectively, by the keyword extraction process.

なお、図８Ａの左側の文書０６１０は、入力される文書に実際に表示されている文字等を示す。一方、中央及び右側の文書０６１０は、キーワード抽出処理を説明するための図面である。すなわち、二重線の楕円及び二重線の長方形等の図形、並びに、「会社名」及び「補助ＫＷ」等の文字は、実際に文書０６１０に表示されているものではなく、キーワード抽出処理を説明する便宜上付与したものである。これは、図８Ｂ〜図８Ｅについても同様である。 Note that a document 0610 on the left side of FIG. 8A indicates characters or the like actually displayed in the input document. On the other hand, the center and right side documents 0610 are diagrams for explaining the keyword extraction processing. That is, figures such as double-line ellipses and double-line rectangles, and characters such as “company name” and “auxiliary KW” are not actually displayed in the document 0610, and keyword extraction processing is performed. It is given for convenience of explanation. The same applies to FIGS. 8B to 8E.

さらに、セキュア文書検出装置０１００は、抽出された主キーワード０６１４及び補助キーワード０６１３の位置関係に基づいて、両者の関連の強さを算出し、その関連の強さ等に基づいて、抽出されたキーワードペアが連携キーワードペアであるか否かを判定する。本実施形態では、二つのキーワード間のユークリッド距離、及び、それぞれのキーワードの文脈上の距離に基づいて、両者の関連の強さが算出される。連携キーワードペアの意義については図８Ｃ等を参照して、連携キーワードペアの判定基準については数式（１）等を参照してそれぞれ後述する。 Further, the secure document detection device 0100 calculates the strength of the relationship between the extracted main keyword 0614 and auxiliary keyword 0613 based on the positional relationship between the extracted keyword and the extracted keyword based on the strength of the relationship. It is determined whether the pair is a linked keyword pair. In the present embodiment, based on the Euclidean distance between two keywords and the contextual distance of each keyword, the strength of the relationship between the two keywords is calculated. The significance of the cooperative keyword pair will be described later with reference to FIG. 8C and the like, and the criterion for determining the cooperative keyword pair will be described later with reference to Equation (1) and the like.

抽出された主キーワード０６１４及び補助キーワード０６１３が連携キーワードペアである場合、それらの位置関係、具体的には位置関係を表すベクトル０６１５が抽出される。このベクトル０６１５は、主キーワード０６１４から補助キーワード０６１３に向かう方向、及び、それらの間の距離を表す。このベクトル０６１５と、セキュア文書辞書０５０４に登録されている主キーワード「××製作所」と補助キーワード「設計書」との位置関係を示すベクトルとの類似度が所定の閾値より高い場合、文書０６１０がセキュア文書であると判定される。 When the extracted main keyword 0614 and auxiliary keyword 0613 are linked keyword pairs, a positional relationship between them, specifically, a vector 0615 representing the positional relationship is extracted. This vector 0615 represents the direction from the main keyword 0614 to the auxiliary keyword 0613 and the distance between them. When the similarity between the vector 0615 and the vector indicating the positional relationship between the main keyword “xx factory” and the auxiliary keyword “design document” registered in the secure document dictionary 0504 is higher than a predetermined threshold, the document 0610 is It is determined that the document is a secure document.

なお、ユーザが予め学習フェーズ０５００において文書０６１０をセキュア文書例０５０１としてセキュア文書検出装置０１００に入力し、さらに、主キーワード「××製作所」と補助キーワード「設計書」との組み合わせをセキュア用語定義０５０２として入力すれば、主キーワード「××製作所」と補助キーワード「設計書」との組み合わせ、及び、文書０６１０におけるそれらのキーワードの位置関係を示す情報がキーワード情報０４１２としてセキュア文書辞書０５０４に登録される。その後、文書０６１０（又は、文書０６１０と同様に文字列「設計書」及び「××製作所」を含む文書）が入力された場合、セキュア文書検出装置０１００は、キーワード情報０４１２を参照して、入力された文書から上記のように主キーワード０６１４及び補助キーワード０６１３を抽出し、それらに基づいて文書０６１０がセキュア文書か否かを判定することができる。これは、続いて説明する図８Ｂ及び図８Ｃについても同様である。 In the learning phase 0500, the user inputs the document 0610 in advance to the secure document detection apparatus 0100 as the secure document example 0501. Further, the secure keyword definition 0502 is a combination of the main keyword “XX factory” and the auxiliary keyword “design document”. As the keyword information 0412, the combination of the main keyword “XX factory” and the auxiliary keyword “design document” and the positional relationship of these keywords in the document 0610 are registered in the secure document dictionary 0504. . Thereafter, when the document 0610 (or a document including the character strings “design document” and “xx manufacturing” as in the document 0610) is input, the secure document detection device 0100 refers to the keyword information 0412 and inputs it. As described above, the main keyword 0614 and the auxiliary keyword 0613 are extracted from the obtained document, and it can be determined based on them whether the document 0610 is a secure document. The same applies to FIGS. 8B and 8C described later.

図８Ｂの例では、電子文書ファイル０４１１として文書０６２０が入力される。文書０６２０は、図５（ｄ）に示したものと同じである。この文書０６２０には文字列「（株）××」０６２１及び「作成」０６２２が含まれる。例えば、会社名に相当する主キーワード「（株）××」と、補助キーワード「作成」との組み合わせがキーワード情報０４１２として登録されている場合、ステップ０４０４のキーワード抽出処理によって文字列「（株）××」０６２１及び「作成」０６２２がそれぞれ主キーワード０６２３及び補助キーワード０６２４として抽出される。この場合も、図８Ａの場合と同様、抽出されたキーワード間の位置関係を示すベクトル０６２５が特定され、それに基づいて文書０６２０がセキュア文書であるか否かが判定される。 In the example of FIG. 8B, a document 0620 is input as the electronic document file 0411. The document 0620 is the same as that shown in FIG. This document 0620 includes a character string “(share) xx” 0621 and “creation” 0622. For example, when the combination of the main keyword “(stock) XX” corresponding to the company name and the auxiliary keyword “creation” is registered as the keyword information 0412, the character string “(stock)” is obtained by the keyword extraction process in step 0404. “XX” 0621 and “Create” 0622 are extracted as the main keyword 0623 and the auxiliary keyword 0624, respectively. Also in this case, as in the case of FIG. 8A, the vector 0625 indicating the positional relationship between the extracted keywords is specified, and based on this, it is determined whether or not the document 0620 is a secure document.

図８Ｃの例では、電子文書ファイル０４１１として文書０６３０が入力される。文書０６３０は、図５（ｊ）に示したものと同じである。この文書０６３０には文字列「北海道」０６３１、「製作所」０６３２及び「北海道」０６３３が含まれる。例えば、会社名「北海道製作所」の前半部分に相当する主キーワード「北海道」と、後半部分に相当する補助キーワード「製作所」との組み合わせがキーワード情報０４１２として登録されている場合、ステップ０４０４のキーワード抽出処理によって文字列「北海道」０６３１及び「製作所」０６３２がそれぞれ会社名０６３４を構成する主キーワード０６３５及び補助キーワード０６３６として抽出される。この場合も、図８Ａの場合と同様、抽出されたキーワード間の位置関係を示すベクトル０６３７が特定され、それに基づいて文書０６３０がセキュア文書であるか否かが判定される。 In the example of FIG. 8C, a document 0630 is input as the electronic document file 0411. The document 0630 is the same as that shown in FIG. This document 0630 includes character strings “Hokkaido” 0631, “Manufacturing” 0632, and “Hokkaido” 0633. For example, if a combination of the main keyword “Hokkaido” corresponding to the first half of the company name “Hokkaido Manufacturing” and the auxiliary keyword “manufacturing” corresponding to the second half is registered as the keyword information 0412, the keyword extraction in step 0404 is performed. Through the processing, the character strings “Hokkaido” 0631 and “Manufacturer” 0632 are extracted as the main keyword 0635 and the auxiliary keyword 0636 constituting the company name 0634, respectively. Also in this case, as in the case of FIG. 8A, the vector 0637 indicating the positional relationship between the extracted keywords is specified, and based on this, it is determined whether or not the document 0630 is a secure document.

なお、文書０６３０には、会社名の前半部分と同一の文字列「北海道」０６３３も含まれている。この場合、「北海道」０６３３と「製作所」０６３２との組み合わせもキーワードペアとして抽出される。しかし、文字列「北海道」０６３３の後に文字列「札幌市」が続いていることからわかるように、この文字列「北海道」０６３３は会社名の一部ではなく単なる地名である。 The document 0630 also includes a character string “Hokkaido” 0633 that is the same as the first half of the company name. In this case, a combination of “Hokkaido” 0633 and “Manufacturer” 0632 is also extracted as a keyword pair. However, as can be seen from the character string “Hokkaido” 0633 followed by the character string “Sapporo City”, this character string “Hokkaido” 0633 is not a part of the company name, but merely a place name.

例えば、「北海道製作所」なる会社が作成した資料のフッタ部分には、例えば図８Ｃに示すように「北海道製作所」という文字列が印刷され、そのような文書をセキュア文書として検出する必要がある場合、ユーザは、主キーワード「北海道」と補助キーワード「製作所」とを含むキーワードペア、及び、それらの位置関係を表すベクトル（例えばベクトル０６３７と同等のベクトル）をセキュア文書辞書０５０４に登録することができる。 For example, in the footer portion of the material created by the company “Hokkaido Seisakusho”, for example, a character string “Hokkaido Seisakusho” is printed as shown in FIG. The user can register a keyword pair including the main keyword “Hokkaido” and the auxiliary keyword “manufacturer” and a vector (for example, a vector equivalent to the vector 0637) representing the positional relationship thereof in the secure document dictionary 0504. .

しかし、その後、文書０６３０が電子文書ファイル０４１１として入力されると、上記のように「北海道」０６３１と「製作所」０６３２との組み合わせだけでなく、「北海道」０６３３と「製作所」０６３２との組み合わせもキーワードペアとして抽出される。この例において、「北海道」０６３１と「製作所」０６３２とは会社名「北海道製作所」の一部であるからそれらの間の関連が強いが、「北海道」０６３３と「製作所」０６３２とはそれぞれ全く異なる文脈に属するからそれらの間に関連はない。このような場合に「北海道」０６３３と「製作所」０６３２との組み合わせについても位置関係を表すベクトルを特定し、そのベクトルとセキュア文書辞書０５０４に登録されたベクトルとを比較しても、その比較はセキュア文書の検出に寄与しない。このため、「北海道」０６３３と「製作所」０６３２との組み合わせをベクトルの比較の対象から除外することが望ましい。 However, after that, when the document 0630 is input as the electronic document file 0411, not only the combination of “Hokkaido” 0631 and “Manufacturer” 0632 as described above, but also the combination of “Hokkaido” 0633 and “Manufacturer” 0632 Extracted as keyword pairs. In this example, “Hokkaido” 0631 and “Manufacturer” 0632 are part of the company name “Hokkaido Manufactory”, so the relationship between them is strong, but “Hokkaido” 0633 and “Manufacturer” 0632 are completely different from each other. There is no relationship between them because they belong to the context. In such a case, even if the vector representing the positional relationship is specified for the combination of “Hokkaido” 0633 and “Manufacturer” 0632 and the vector is compared with the vector registered in the secure document dictionary 0504, the comparison is not Does not contribute to secure document detection. For this reason, it is desirable to exclude the combination of “Hokkaido” 0633 and “Manufacturing” 0632 from the comparison target of vectors.

本実施形態のセキュア文書検出装置０１００は、抽出されたキーワードペアからさらに、セキュア文書辞書０５０４に登録されたベクトルとの比較の対象とするキーワードペア（以下、連携キーワードペアと記載）を抽出する。抽出されたキーワードペアが連携キーワードペアであるか否かは、そのキーワードペアに含まれる二つのキーワードの関連の強さ、及び、それらのキーワードについて予め定められた重要度等に基づいて判定される。このように抽出された連携キーワードペアの位置関係がセキュア文書辞書０５０４に登録されたベクトルと比較される。 The secure document detection device 0100 of this embodiment further extracts a keyword pair (hereinafter referred to as a cooperative keyword pair) to be compared with a vector registered in the secure document dictionary 0504 from the extracted keyword pair. Whether or not the extracted keyword pair is a linked keyword pair is determined based on the strength of the relationship between the two keywords included in the keyword pair and the importance level determined in advance for the keywords. . The positional relationship between the linked keyword pairs extracted in this way is compared with a vector registered in the secure document dictionary 0504.

例えば、セキュア文書検出装置０１００は、一つの文書から抽出された全てのキーワードペアについてそれらに含まれる二つのキーワードの関連の強さを算出し、その値の順位が所定の閾値より高いものを、連携キーワードペアとして抽出してもよい。あるいは、セキュア文書検出装置０１００は、上記のように算出された関連の強さが所定の閾値を超えるものを連携キーワードペアとして抽出してもよい。「北海道」０６３３と「製作所」０６３２との関連の強さが十分に低ければ、「北海道」０６３３と「製作所」０６３２との組み合わせは連携キーワードペアとして抽出されない。 For example, the secure document detection device 0100 calculates the strength of the relationship between two keywords included in all keyword pairs extracted from one document, and the ranking of the values is higher than a predetermined threshold value. You may extract as a cooperation keyword pair. Alternatively, the secure document detection device 0100 may extract a link keyword pair whose association strength calculated as described above exceeds a predetermined threshold. If the relationship between “Hokkaido” 0633 and “Manufacturer” 0632 is sufficiently low, the combination of “Hokkaido” 0633 and “Manufacturer” 0632 is not extracted as a linked keyword pair.

さらに、本実施形態のセキュア文書検出装置０１００は、連携キーワードペアとして抽出されるべきでないキーワードを積極的に排除することもできる。 Furthermore, the secure document detection apparatus 0100 of this embodiment can also positively exclude keywords that should not be extracted as linked keyword pairs.

例えば、学習フェーズ０５００において、ユーザは、文字列「北海道」と文字列「札幌市」との組み合わせを、連携キーワードペアとして抽出されるべきでないキーワードペアとしてキーワード情報０４１２に登録してもよい。そのような情報が登録されていれば、文書０６３０が入力された場合、文字列「北海道」０６３３は、文字列「札幌市」との関連が強いものであると判定され、連携キーワードペアとしては抽出されない。 For example, in the learning phase 0500, the user may register a combination of the character string “Hokkaido” and the character string “Sapporo City” in the keyword information 0412 as a keyword pair that should not be extracted as a cooperative keyword pair. If such information is registered, when the document 0630 is input, it is determined that the character string “Hokkaido” 0633 is strongly related to the character string “Sapporo City”. Not extracted.

図８Ｄの例では、電子文書ファイル０４１１として文書０６４０が入力される。文書０６４０は、図５（ｇ）に示したものと同じである。文書０６４０は図面０６４１を含む。図面０６４１は、例えば部品等の図面（図示省略）、図面のタイトル０６４２及び図面の作成年月日０６４３等を含み、それらの要素のレイアウトは罫線０６４４によって定義される。ステップ０４０４の罫線・レイアウト抽出処理によって文書０６４０のレイアウトが特定様式０６４５として抽出される。この特定様式０６４５とパタン情報０４１３とを比較することによって、文書０６４０のセキュア情報尤度を算出することができる（ステップ０４０５）。 In the example of FIG. 8D, a document 0640 is input as the electronic document file 0411. The document 0640 is the same as that shown in FIG. Document 0640 includes drawing 0641. The drawing 0641 includes, for example, drawings of parts and the like (not shown), a drawing title 0642, a drawing creation date 0643, and the like, and the layout of these elements is defined by ruled lines 0644. The layout of the document 0640 is extracted as the specific format 0645 by the ruled line / layout extraction process in step 0404. The secure information likelihood of the document 0640 can be calculated by comparing the specific format 0645 with the pattern information 0413 (step 0405).

なお、ユーザが予め学習フェーズ０５００において文書０６４０をセキュア文書例０５０１としてセキュア文書検出装置０１００に入力することによって、特定様式０６４５をパタン情報０４１３としてセキュア文書辞書０５０４に登録することができる。その後、文書０６４０（又は、文書０６４０と同様のレイアウトを有する文書）が入力された場合、セキュア文書検出装置０１００は、パタン情報０４１３を参照して、入力された文書から上記のように特定様式０６４５を抽出することができる。これは、続いて説明する図８Ｅについても同様である。 It should be noted that the specific format 0645 can be registered in the secure document dictionary 0504 as the pattern information 0413 when the user inputs the document 0640 as the secure document example 0501 to the secure document detection device 0100 in advance in the learning phase 0500. Thereafter, when the document 0640 (or a document having a layout similar to that of the document 0640) is input, the secure document detection device 0100 refers to the pattern information 0413 and specifies the specific format 0645 as described above from the input document. Can be extracted. The same applies to FIG. 8E described later.

図８Ｅの例では、電子文書ファイル０４１１として文書０６５０が入力される。文書０６５０は、図５（ｉ）に示したものと同じである。文書０６５０は印影０６５１を含む。印影０６５１は、それが表示された文書が機密情報を含むことを意味する「秘」の文字を含む。ステップ０４０４の罫線・レイアウト抽出処理によってこの印影０６５１が特定様式０６５２として抽出される。この特定様式０６５２とパタン情報０４１３とを比較することによって、文書０６５０のセキュア情報尤度を算出することができる（ステップ０４０５）。 In the example of FIG. 8E, a document 0650 is input as the electronic document file 0411. The document 0650 is the same as that shown in FIG. The document 0650 includes an imprint 0651. The seal impression 0651 includes a character “secret” which means that the document on which it is displayed contains confidential information. The imprint 0651 is extracted as the specific style 0652 by the ruled line / layout extraction process in step 0404. The secure information likelihood of the document 0650 can be calculated by comparing the specific format 0652 and the pattern information 0413 (step 0405).

次に、図４のステップ０４０６において実行されるセキュア情報尤度算出について説明する。 Next, secure information likelihood calculation executed in step 0406 of FIG. 4 will be described.

抽出された主キーワードｍｗ_i及び補助キーワードｈｗ_jの組み合わせ（ペア）の連携度を示す指標Ｌ_pair（ｍｗ_i，ｈｗ_j）は、次の数式（１）によって算出される。 An index L _pair (mw _i , hw _j ) indicating the degree of cooperation of the extracted combination (pair) of the main keyword mw _i and the auxiliary keyword hw _j is calculated by the following equation (1).

ここで、Ｄ_BLK（ｍｗ_i，ｈｗ_j）は、主キーワードと補助キーワードとの間の文書型ブロック距離である。文書型ブロック距離とは、言い換えるとすれば、二つのキーワードの文脈中の距離であり、二つのキーワードの文脈上の関連の強さを示す指標である。一般には、二つのキーワードが読まれる順が近ければ、それらの文脈上の関連が強い。例えば二つのキーワードが一つのブロックに属する場合と、それぞれが別のブロックに属する場合との文書型ブロック距離を比較すると、両者におけるキーワード間のユークリッド距離が同じであっても、一般に、後者の文書型ブロック距離は前者の文書型ブロック距離より大きくなる。 Here, D _BLK (mw _i , hw _j ) is a document type block distance between the main keyword and the auxiliary keyword. In other words, the document type block distance is a distance in the context of two keywords, and is an index indicating the strength of the relationship between the contexts of the two keywords. In general, if the order in which two keywords are read is close, their contextual association is strong. For example, comparing the document-type block distances when two keywords belong to one block and when each belongs to another block, even if the Euclidean distance between the two keywords is the same, generally the latter document The type block distance is larger than the former document type block distance.

ここで、ブロック距離は画像処理の分野などで使われる距離の概念を文書向けに拡張したものである。一般に、画像処理におけるブロック距離では、２点間の距離をＸ方向の差とＹ方向の差の和｜Ｘ｜＋｜Ｙ｜や｜Ｘ＋Ｙ｜で表す距離尺度群を指す（マンハッタン距離とも称する）。単純なブロック距離は、文書上に書かれた段落や表など意味情報を表しているレイアウト構造を反映しない、文書上の任意の２点間で一様な距離尺度となっている。文書文脈上の関連の強さを文書構造から反映して、距離尺度の重みを変えるのが文書型ブロック距離である。 Here, the block distance is an extension of the concept of distance used in the field of image processing for documents. In general, the block distance in image processing refers to a distance measure group in which the distance between two points is represented by the sum of the difference in the X direction and the difference in the Y direction | X | + | Y | or | X + Y | . The simple block distance is a uniform distance measure between any two points on the document that does not reflect the layout structure representing semantic information such as paragraphs and tables written on the document. The document type block distance changes the weight of the distance measure by reflecting the strength of the relation in the document context from the document structure.

αは、文書型ブロック距離に基づく連携度を算出するための重み係数である。ユーザは、αとして任意の値を設定することができるが、二つのキーワードの配置に応じた適切な値を設定することが望ましい。αの値の例については、図９を参照して後述する。 α is a weighting coefficient for calculating the degree of cooperation based on the document type block distance. The user can set an arbitrary value as α, but it is desirable to set an appropriate value according to the arrangement of the two keywords. An example of the value of α will be described later with reference to FIG.

Ｄ_EUC（ｍｗ_i，ｈｗ_j）は、主キーワードと補助キーワードとの間のユークリッド距離、すなわち、文書中の、主キーワードが表示された位置と、補助キーワードが表示された位置との間のユークリッド距離である。 D _EUC (mw _i , hw _j ) is the Euclidean distance between the main keyword and the auxiliary keyword, that is, the Euclidean distance between the position where the main keyword is displayed and the position where the auxiliary keyword is displayed in the document. Distance.

βは、ユークリッド距離に基づく連携度を算出するための重み係数である。ユーザは、βとして任意の値を設定することができる。 β is a weighting coefficient for calculating the degree of cooperation based on the Euclidean distance. The user can set an arbitrary value as β.

なお、数式（１）の右辺の第１項及び第２項の分母の「＋１」は、距離がゼロの場合に値が発散することを防ぐために付されている。 Note that “+1” in the denominator of the first term and the second term on the right side of Expression (1) is attached to prevent the value from diverging when the distance is zero.

Ｌ_word（ｍｗ_i，ｈｗ_j）は、キーワードの重要度（各キーワードの重要度又はキーワードペアの重要度）を表す指標であり、予めユーザによって定められる。例えば、ユーザは、重要顧客の名前を含むキーワードの組み合わせに関するＬ_word（ｍｗ_i，ｈｗ_j）として、その他の組み合わせに関するものより高い値を設定してもよい。 L _word (mw _i , hw _j ) is an index representing the importance of keywords (importance of each keyword or importance of keyword pairs), and is determined in advance by the user. For example, the user may set a higher value for L _word (mw _i , hw _j ) relating to a combination of keywords including the names of important customers than for other combinations.

γは、キーワードの重要度に基づく連携度を算出するための重み係数である。ユーザは、γとして任意の値を設定することができる。 γ is a weighting factor for calculating the degree of cooperation based on the importance of the keyword. The user can set an arbitrary value as γ.

結局、主キーワードｍｗ_i及び補助キーワードｈｗ_jの連携度は、文書型ブロック距離Ｄ_BLK（ｍｗ_i，ｈｗ_j）が小さいほど高く、ユークリッド距離Ｄ_EUC（ｍｗ_i，ｈｗ_j）が小さいほど高く、予め定められたキーワードの重要度が高いほど高く、重み係数（α、β及びγ）の値が大きいほど高くなる。 As a result, the degree of cooperation between the main keyword mw _i and the auxiliary keyword hw _j is higher as the document type block distance D _BLK (mw _i , hw _j ) is smaller, and as the Euclidean distance D _EUC (mw _i , hw _j ) is smaller The higher the importance of a predetermined keyword is, the higher the value is.

セキュア文書検出装置０１００は、入力された文書から抽出された全てのキーワードペアについてＬ_word（ｍｗ_i，ｈｗ_j）を算出し、それらの値が大きいものが連携キーワードペアであると判定してもよい。具体的には、例えば、あるキーワードペアのＬ_word（ｍｗ_i，ｈｗ_j）の値が所定の閾値より大きい場合に、そのキーワードペアが連携キーワードペアであると判定してもよい。あるいは、各文書について算出された全てのＬ_word（ｍｗ_i，ｈｗ_j）のうち、大きさの順位が所定の閾値より大きいものに対応するキーワードペアが連携キーワードペアであると判定してもよい。 The secure document detection device 0100 calculates L _word (mw _i , hw _j ) for all keyword pairs extracted from the input document, and determines that those having large values are linked keyword pairs. Good. Specifically, for example, when the value of L _word (mw _i , hw _j ) of a certain keyword pair is larger than a predetermined threshold, it may be determined that the keyword pair is a linked keyword pair. Alternatively, all of the _{_{_{L word (mw i, hw j}}} ) calculated for each document within the keyword pair size rank corresponds to greater than a predetermined threshold value may be determined that the cooperative keyword pairs .

入力された文書ｄｏｃ_iのセキュア情報尤度（すなわちその文書のセキュア文書らしさを示す指標）Ｌ_sequre（ｄｏｃ_i）は、次の数式（２）によって算出される。 The secure information likelihood (that is, an index indicating the likelihood of a secure document of the document) L _sequre (doc _i ) of the input document doc _i is calculated by the following equation (2).

ここで、ｄｉｃ_jは、セキュア文書辞書０５０４に入力されたセキュア文書例０５０１（すなわち文書事例０７０２）に含まれるｊ番目の文書である。 Here, dic _j is the j-th document included in the secure document example 0501 (that is, the document case 0702) input to the secure document dictionary 0504.

Ｌ_format（ｄｏｃ_i，ｄｉｃ_j）は、文書ｄｏｃ_iのフォーマットと文書ｄｉｃ_jのフォーマットとの比較に基づく、文書ｄｏｃ_iのセキュア文書らしさを示す指標である。具体的には、図４のステップ０４０４において抽出された文書ｄｏｃ_iのレイアウトと、文書ｄｉｃ_jのレイアウトとの間の類似度が高いほど、Ｌ_format（ｄｏｃ_i，ｄｉｃ_j）の値は大きくなる。 L _format (doc _i , dic _j ) is an index indicating the secure document quality of the document doc _i based on a comparison between the format of the document doc _{i and} the format of the document dic _j . Specifically, the value of L _format (doc _i , dic _j ) increases as the similarity between the layout of document doc _i extracted in step 0404 in FIG. 4 and the layout of document dic _j increases. .

Ｌ_keyword（ｄｏｃ_i，ｄｉｃ_j）は、文書ｄｏｃ_iに含まれるキーワードと文書ｄｉｃ_jに含まれるキーワードとの比較に基づく、文書ｄｏｃ_iのセキュア文書らしさを示す指標である。具体的には、図４のステップ０４０４において文書ｄｏｃ_iから抽出されたキーワードペアの位置関係と、文書ｄｉｃ_jに含まれるキーワードペア（すなわちセキュア文書例０５０１から抽出されたキーワードの組み合わせ又はセキュア用語定義０５０２として入力されたキーワードの組み合わせ）の位置関係との間の類似度が計算され、その類似度が高いほどＬ_keyword（ｄｏｃ_i，ｄｉｃ_j）の値は大きくなる。Ｌ_keyword（ｄｏｃ_i，ｄｉｃ_j）の算出方法については後述する（数式（３）参照）。 _{_{_{L keyword (doc i, dic j}}} ) is based on a comparison of the keywords included in the keyword and document dic _j included in the document doc _i, an index indicating the secure document likeness document doc _i. Specifically, the positional relationship between the keyword pairs extracted from the document doc _i in step 0404 of FIG. 4 and the keyword pairs included in the document dic _j (that is, the keyword combination extracted from the secure document example 0501 or the secure term definition). The degree of similarity with the positional relationship of the keyword combination input as 0502 is calculated, and the value of L _keyword (doc _i , dic _j ) increases as the degree of similarity increases. A method for calculating L _keyword (doc _i , dic _j ) will be described later (see Equation (3)).

全ての文書ｄｉｃ_jについて算出されたＬ_format（ｄｏｃ_i，ｄｉｃ_j）＋Ｌ_keyword（ｄｏｃ_i，ｄｉｃ_j）の最大値がＬ_sequre（ｄｏｃ_i）である。 The maximum value of L _format (doc _i , dic _j ) + L _keyword (doc _i , dic _j ) calculated for all documents dic _j is L _sequre (doc _i ).

なお、文書のレイアウトによらず、キーワードの組み合わせのみに基づいて文書ｄｏｃ_iのセキュア情報尤度を算出してもよい。その場合、Ｌ_format（ｄｏｃ_i，ｄｉｃ_j）を算出する必要はなく、Ｌ_keyword（ｄｏｃ_i，ｄｉｃ_j）の最大値がＬ_sequre（ｄｏｃ_i）である。 Note that the secure information likelihood of the document doc _i may be calculated based only on the combination of keywords, regardless of the document layout. In that case, it is not necessary to calculate L _format (doc _i , dic _j ), and the maximum value of L _keyword (doc _i , dic _j ) is L _sequre (doc _i ).

Ｌ_keyword（ｄｏｃ_i，ｄｉｃ_j）は数式（３）によって算出される。 L _keyword (doc _i , dic _j ) is calculated by Equation (3).

数式（３）によって、文書ｄｏｃ_iから抽出されたキーワードペアの位置関係を表すベクトルと、文書ｄｉｃ_jに含まれるキーワードペアの位置関係を表すベクトルとの距離が算出され、その距離に基づいてセキュア情報尤度が算出される。このとき、文書ｄｏｃ_iから抽出された全てのキーワードペアについてではなく、連携キーワードペアのみについて数式（３）が算出されてもよい。その場合、文書ｄｏｃ_iから抽出された全ての連携キーワードペアについて算出された上記の尤度の総和がＬ_keyword（ｄｏｃ_i，ｄｉｃ_j）である。 The distance between the vector representing the positional relationship between the keyword pairs extracted from the document doc _i and the vector representing the positional relationship between the keyword pairs included in the document dic _j is calculated by Equation (3), and secure based on the distance. Information likelihood is calculated. At this time, the mathematical formula (3) may be calculated not for all the keyword pairs extracted from the document doc _i but only for the linked keyword pairs. In this case, the total sum of the likelihoods calculated for all the linked keyword pairs extracted from the document doc _i is L _keyword (doc _i , dic _j ).

Ｌ_formatの意図は辞書として登録した文書と、似たキーワードを持ち、それらが似た配置にあるような文書を見つけることにある。数式（３）は単純なユークリッド距離の定義によってキーワードペアの類似性を導くことを示している。すなわちキーワード間に何らかの距離尺度が存在し（例えば、「ｃｏｎｆｉｄｅｎｔｉａｌ」や「ｐｒｅｐａｒｅｄ」など文書発行元を指し得る補助キーワードは同類と看做し距離０とし、敬称や送付先を表すような補助キーワード「御中」「宛先」などはこれと別類と看做し距離が大きいとするように、キーワードの間に距離尺度が定義できる。 The purpose of L _format is to find a document registered as a dictionary and a document having similar keywords and in a similar arrangement. Equation (3) shows that the similarity between keyword pairs is derived by simply defining the Euclidean distance. That is, there is some distance measure between keywords (for example, auxiliary keywords such as “confidential” and “prepared” are regarded as similar, and the distance is 0, and an auxiliary keyword “ A distance scale can be defined between keywords so that the “middle”, “destination”, etc. are regarded as different types and the distance is large.

例えば、形態素解析で同じ品詞の場合は距離０、その他は１とするなども距離尺度となる）、キーワードの配置位置の間に距離尺度が存在し（例えば、先に説明した文書型ブロック距離）、類似度尺度の間に距離尺度が存在するならば、これらを数式（３）のようにベクトルと看做してユークリッド距離を計算すれば、２つのキーワードペアの間の距離が計算できる。 For example, the distance scale is the distance scale for the same part of speech in the morphological analysis, and the distance scale is 1 for the other parts), and there is a distance scale between the keyword placement positions (for example, the document type block distance described above). If there is a distance measure between the similarity measures, the distance between the two keyword pairs can be calculated if the Euclidean distance is calculated by regarding these as vectors as in Equation (3).

更に、これに尤度を導入することも可能である。２つのキーワードペアの各々の関連度Ｌ_wordが高く、かつ、配置が似ているものを尤度が高いとしたいならば、上記距離を０〜１の間に変換したものが尤度であると看做すことができる。すなわち、尤度を持つ項を入力とする、あらゆる計算式に対しては、それに付属する尤度を計算することができる。 It is also possible to introduce likelihood into this. If the relevance L _word of each of the two keyword pairs is high and the likelihood is high that the arrangement is similar, the likelihood obtained by converting the distance between 0 and 1 is the likelihood Can be seen. That is, for any calculation formula that uses a term having a likelihood as an input, the likelihood attached to it can be calculated.

図９は、本発明の第１の実施形態のセキュア文書辞書０５０４に含まれる配置コストテーブル０９００の説明図である。 FIG. 9 is an explanatory diagram of the arrangement cost table 0900 included in the secure document dictionary 0504 according to the first embodiment of this invention.

配置コストテーブル０９００は、文書から抽出されたキーワードペアに含まれる二つのキーワードの当該文書中の位置（すなわち、それらのキーワードがその文書中のどの領域から抽出されたか）と、数式（１）の重み係数αとを対応付けるテーブルである。具体的には、配置コストテーブル０９００は行０９０１〜０９０３及び列０９１１〜０９１３からなる。 The arrangement cost table 0900 includes the positions of two keywords included in the keyword pair extracted from the document in the document (that is, from which area in the document the keywords are extracted), and the expression (1). It is a table which matches weighting coefficient (alpha). Specifically, the arrangement cost table 0900 includes rows 0901 to 0903 and columns 0911 to 0913.

行０９０１には、主キーワードが文書中の表から抽出された場合の重み係数αの値が登録される。行０９０２には、主キーワードが文書中の本文又はタイトルから抽出された場合の重み係数αの値が登録される。行０９０３には、主キーワードが文書中のヘッダ又はフッタから抽出された場合の重み係数αの値が登録される。 In row 0901, the value of the weighting coefficient α when the main keyword is extracted from the table in the document is registered. In row 0902, the value of the weighting coefficient α when the main keyword is extracted from the text or title in the document is registered. In row 0903, the value of the weighting factor α when the main keyword is extracted from the header or footer in the document is registered.

列０９１１には、補助キーワードが文書中の表から抽出された場合の重み係数αの値が登録される。列０９１２には、補助キーワードが文書中の本文から抽出された場合の重み係数αの値が登録される。列０９１３には、補助キーワードが文書中のヘッダ又はフッタから抽出された場合の重み係数αの値が登録される。 In a column 0911, the value of the weighting factor α when the auxiliary keyword is extracted from the table in the document is registered. In a column 0912, the value of the weighting coefficient α when the auxiliary keyword is extracted from the text in the document is registered. In a column 0913, the value of the weighting factor α when the auxiliary keyword is extracted from the header or footer in the document is registered.

なお、図９の例では列０９１２が本文に対応するが、列０９１２は、補助キーワードが本文又はタイトルから抽出された場合に対応してもよい。また、上記のような領域の分類は一例に過ぎない。例えば、文書が段組みされている場合、各段が独立した領域として扱われてもよい。あるいは、ヘッダ及びフッタがそれぞれ独立した領域として扱われてもよい。 In the example of FIG. 9, column 0912 corresponds to the text, but column 0912 may correspond to the case where the auxiliary keyword is extracted from the text or the title. Moreover, the classification of the areas as described above is merely an example. For example, when documents are arranged in columns, each column may be treated as an independent area. Alternatively, the header and footer may be treated as independent areas.

図９の例において、主キーワード及び補助キーワードがいずれも文書中の表から抽出された場合、重み係数αの値は「α１１」となる。主キーワードがタイトル又は本文から抽出され、補助キーワードが表から抽出された場合、重み係数αの値は「α２１」となる。 In the example of FIG. 9, when both the main keyword and the auxiliary keyword are extracted from the table in the document, the value of the weighting factor α is “α11”. When the main keyword is extracted from the title or the text and the auxiliary keyword is extracted from the table, the value of the weight coefficient α is “α21”.

ユーザは、配置コストテーブル０９００に重み係数αとして任意の値を登録することができる。ただし、一般に、主キーワードと補助キーワードとが文書中の同一の領域（例えばタイトル、本文、表、ヘッダ又はフッタ等）から抽出された場合、そうでない場合と比較して主キーワードと補助キーワードとの文脈上の距離が近いと推定される。後述するように、二つのキーワードの文脈上の距離が近いほど、それらのキーワードの関連が強い可能性が高い。このため、典型的には、主キーワードと補助キーワードとが文書中の同一の領域から抽出された場合のαの値が、そうでない場合と比較して大きくなるように設定される。例えば、典型的には、α１１の値は、α２１の値より大きい。 The user can register an arbitrary value as the weighting factor α in the arrangement cost table 0900. However, in general, when the main keyword and the auxiliary keyword are extracted from the same area in the document (for example, title, text, table, header, footer, etc.), the main keyword and the auxiliary keyword are compared with the other cases. Estimated to be close in context. As will be described later, the closer the context distance between two keywords is, the higher the possibility that the relationship between the keywords is strong. For this reason, typically, the value of α when the main keyword and the auxiliary keyword are extracted from the same region in the document is set to be larger than that when it is not. For example, typically, the value of α11 is larger than the value of α21.

主キーワード及び補助キーワードが同一の領域から抽出された場合、重み係数αの値は、さらに、それらのキーワードの位置関係に応じて決定されてもよい。 When the main keyword and the auxiliary keyword are extracted from the same region, the value of the weighting factor α may be further determined according to the positional relationship between these keywords.

具体的には、主キーワードが本文又はタイトルから抽出され、補助キーワードが本文から抽出された場合、重み係数αは、さらに、タイトル及び本文の中における主キーワード及び補助キーワードの位置に応じて決定される。 Specifically, when the main keyword is extracted from the text or the title and the auxiliary keyword is extracted from the text, the weighting coefficient α is further determined according to the position of the main keyword and the auxiliary keyword in the title and the text. The

例えば、主キーワード及び補助キーワードがそれぞれ互いに隣接する単語である場合、重み係数αの値は「α２２ａ」となる。主キーワード及び補助キーワードが隣接しないが、同一の行に含まれる場合、重み係数αの値は「α２２ｂ」となる。主キーワード及び補助キーワードがそれぞれ異なる行に含まれるが、同一の段落に含まれる場合、重み係数αの値は「α２２ｃ」となる。主キーワード及び補助キーワードがそれぞれ異なる段落に含まれる場合、重み係数αの値は「α２２ｄ」となる。 For example, when the main keyword and the auxiliary keyword are words adjacent to each other, the value of the weighting coefficient α is “α22a”. When the main keyword and the auxiliary keyword are not adjacent but are included in the same row, the value of the weight coefficient α is “α22b”. When the main keyword and the auxiliary keyword are included in different lines, but are included in the same paragraph, the value of the weight coefficient α is “α22c”. When the main keyword and the auxiliary keyword are included in different paragraphs, the value of the weighting coefficient α is “α22d”.

ユーザは、これらの値を任意に設定することができる。ただし、二つのキーワードが異なる段落に含まれるよりは同一の段落に含まれるほうが、二つのキーワードが異なる行に含まれるよりは同一の行に含まれるほうが、二つのキーワードが隣接しないよりは隣接するほうが、それらのキーワードの文脈上の距離が近い。文脈上の距離が近いほど、それらのキーワードの関連が強い可能性が高い。 The user can arbitrarily set these values. However, if two keywords are included in the same paragraph rather than included in different paragraphs, the two keywords are included in the same line rather than included in different lines, rather than the two keywords are not adjacent. The context distance of those keywords is closer. The closer the context, the more likely the keywords are related.

例えば、主キーワード「（株）××」と補助キーワード「作成」とが、連続する文字列「（株）××作成」から抽出された場合、これらのキーワードは互いに隣接している。この場合、通常、「（株）××」及び「作成」の文脈上の意味は互いに関連する。具体的には、上記の文字列は「（株）××」なる会社が何かを「作成」したことを意味し、図５（ｄ）の例のように、それらのキーワードを含む文書自体が「（株）××」なる会社によって作成されたものである可能性がある。 For example, when the main keyword “(share) xx” and the auxiliary keyword “creation” are extracted from the continuous character string “(share) xx creation”, these keywords are adjacent to each other. In this case, the contextual meanings of “(share) xx” and “creation” are usually related to each other. Specifically, the above character string means that the company “(share) XX” “created” something, and the document itself including those keywords as shown in the example of FIG. May be created by a company “(share) xx”.

一方、例えば主キーワード「（株）××」と補助キーワード「作成」とが異なる段落から抽出された場合であっても、それらのキーワードが互いに関連している可能性はある。しかし、補助キーワード「作成」は、例えば「○○製作所作成」という文字列から抽出されたものである可能性もある。この場合、上記の抽出されたキーワードの組み合わせは、「（株）××」が何かを「作成」したことを意味しない。すなわち、「（株）××」と「作成」との間に文脈上の関連はない。この場合、それらのキーワードを含む文書自体が「（株）××」なる会社によって作成されたものである可能性は低い。 On the other hand, for example, even when the main keyword “(share) xx” and the auxiliary keyword “creation” are extracted from different paragraphs, the keywords may be related to each other. However, the auxiliary keyword “created” may be extracted from a character string “created by XX factory”, for example. In this case, the extracted keyword combination does not mean that “(stock) xx” “created” something. That is, there is no contextual relationship between “(share) xx” and “creation”. In this case, it is unlikely that the document itself including those keywords is created by a company “(share) xx”.

「（株）××」なる会社によって作成された文書がセキュア文書であると判定する必要がある場合、主キーワード「（株）××」と補助キーワード「作成」とからなるキーワードペアを連携キーワードペアとして抽出することが望ましい。上記のような例を考慮すると、主キーワード「（株）××」と補助キーワード「作成」とが隣接する場合の連携度を、そうでない場合より高くなるように算出することが望ましい。このため、典型的には、α２２ｄよりα２２ｃの値が大きく、α２２ｃよりα２２ｂの値が大きく、α２２ｂよりα２２ａの値が大きくなるように重み係数αの値が設定される。 When it is necessary to determine that a document created by the company “(share) xx” is a secure document, a keyword pair consisting of the main keyword “(share) xx” and the auxiliary keyword “create” is linked to the keyword. It is desirable to extract as a pair. Considering the above example, it is desirable to calculate the degree of cooperation when the main keyword “(share) xx” and the auxiliary keyword “creation” are adjacent to each other so as to be higher than the case where the main keyword “(stock) XX” is adjacent. Therefore, typically, the value of the weighting factor α is set so that the value of α22c is larger than α22d, the value of α22b is larger than α22c, and the value of α22a is larger than α22b.

ただし、実際には、互いに離れた領域から抽出された主キーワードと補助キーワードとの連携度を高く算出すべき場合もある。例えば、図５（ｂ）の例では、主キーワード「××」がフッタから抽出され、補助キーワード「仕様書」はタイトルから抽出される。このように配置されたキーワードペアを連携キーワードペアとして抽出したい場合、そのキーワードペアに対応する配置コストテーブル０９００のα３２の値をその他の値より大きく設定してもよい。ただし、その場合、列０９１２が本文だけでなくタイトルにも対応する。 However, in practice, there is a case where the degree of cooperation between the main keyword and the auxiliary keyword extracted from the areas separated from each other should be calculated high. For example, in the example of FIG. 5B, the main keyword “xx” is extracted from the footer, and the auxiliary keyword “specification” is extracted from the title. When it is desired to extract a keyword pair arranged in this way as a linked keyword pair, the value of α32 of the arrangement cost table 0900 corresponding to the keyword pair may be set larger than other values. However, in that case, column 0912 corresponds to not only the text but also the title.

主キーワード及び補助キーワードがいずれも表から抽出された場合も、上記のα２２の場合と同様、重み係数αは、さらに、表の中における主キーワード及び補助キーワードの位置に応じて決定される。 When both the main keyword and the auxiliary keyword are extracted from the table, the weighting factor α is further determined according to the position of the main keyword and the auxiliary keyword in the table, as in the case of α22 described above.

例えば、主キーワード及び補助キーワードがそれぞれ表の中の互いに隣接するセルから抽出された場合、重み係数αの値は「α１１ａ」となる。主キーワード及び補助キーワードが同一の表から（ただし互いに隣接するセル以外から）抽出された場合、重み係数αの値は「α１１ｂ」となる。主キーワード及び補助キーワードがそれぞれ別の表から抽出された場合、重み係数αの値は「α１１ｃ」となる。α２２の場合と同様、ユーザはこれらの値を任意に設定することができる。例えば、上記のα２２の場合と同様の理由で、α１１ｃよりα１１ｂが大きく、α１１ｂよりα１１ａがさらに大きくなるように設定されてもよい。 For example, when the main keyword and the auxiliary keyword are extracted from cells adjacent to each other in the table, the value of the weighting coefficient α is “α11a”. When the main keyword and the auxiliary keyword are extracted from the same table (but not from cells adjacent to each other), the value of the weight coefficient α is “α11b”. When the main keyword and the auxiliary keyword are extracted from different tables, the value of the weighting factor α is “α11c”. As in the case of α22, the user can arbitrarily set these values. For example, for the same reason as in the case of α22 described above, α11b may be set larger than α11c, and α11a may be set larger than α11b.

同様に、主キーワード及び補助キーワードがいずれもヘッダ又はフッタから抽出された場合、重み係数αは、さらに、ヘッダ又はフッタの中における主キーワード及び補助キーワードの位置に応じて決定される。 Similarly, when both the main keyword and the auxiliary keyword are extracted from the header or footer, the weighting factor α is further determined according to the position of the main keyword and auxiliary keyword in the header or footer.

例えば、主キーワード及び補助キーワードが同一の行から抽出された場合、重み係数αの値は「α３３ａ」となり、それらが互いに異なる行から抽出された場合、重み係数αの値は「α３３ｂ」となる。α２２の場合と同様、ユーザはこれらの値を任意に設定することができる。例えば、上記のα２２の場合と同様の理由で、α３３ａがα３３ｂより大きくなるように設定されてもよい。 For example, when the main keyword and the auxiliary keyword are extracted from the same row, the value of the weighting factor α is “α33a”, and when they are extracted from different rows, the value of the weighting factor α is “α33b”. . As in the case of α22, the user can arbitrarily set these values. For example, α33a may be set larger than α33b for the same reason as in the case of α22 described above.

なお、セキュア文書辞書０５０４は、複数の配置コストテーブル０９００を含んでもよい。例えば、図８Ｃに示したように、「北海道製作所」という文字列は連携キーワードペアとして抽出したいが、「北海道札幌市」という文字列に含まれる「北海道」は連携キーワードペアに含めたくない場合、「北海道」と「札幌市」（又は同様の北海道内の市町村名）との組み合わせに関する配置コストテーブル０９００をさらに作成し、その中のα２２ａの値を、連携キーワードペアとして抽出されるべきキーワードペアの配置に与えられる値より小さい値（例えば「０」）としてもよい。その場合、「北海道札幌市」という文字列に関する連携度が低くなるため、文書のセキュア情報尤度判定に「北海道札幌市」のような文字列が与える影響を抑えることができる。 The secure document dictionary 0504 may include a plurality of arrangement cost tables 0900. For example, as shown in FIG. 8C, when the character string “Hokkaido Seisakusho” is to be extracted as a linked keyword pair, but “Hokkaido” included in the string “Hokkaido Sapporo City” is not to be included in the linked keyword pair, An arrangement cost table 0900 relating to a combination of “Hokkaido” and “Sapporo City” (or similar city name in Hokkaido) is further created, and the value of α22a in the placement cost table 0900 is extracted as a linked keyword pair. A value smaller than the value given to the arrangement (for example, “0”) may be used. In this case, since the degree of cooperation regarding the character string “Hokkaido Sapporo City” is low, the influence of the character string such as “Hokkaido Sapporo City” on the secure information likelihood determination of the document can be suppressed.

図１０は、本発明の第１の実施形態のセキュア文書辞書０５０４の説明図である。 FIG. 10 is an explanatory diagram of the secure document dictionary 0504 according to the first embodiment of this invention.

セキュア文書辞書０５０４は、セキュア辞書ヘッダ１００１、複数のキーワード１０１１等、一つ以上の配置コストテーブル１０２１等及び一つ以上の特定様式１０３１等を含む。 The secure document dictionary 0504 includes a secure dictionary header 1001, a plurality of keywords 1011 and the like, one or more arrangement cost tables 1021 and the like, and one or more specific forms 1031 and the like.

セキュア辞書ヘッダ１００１は、セキュア文書辞書０５０４のバージョンを示す情報及びその辞書の内容を説明する情報を含む。 The secure dictionary header 1001 includes information indicating the version of the secure document dictionary 0504 and information describing the contents of the dictionary.

キーワード１０１１等の各々は、キーワードとして指定された文字列及びそのキーワードに関する付加情報を含む。付加情報は、そのキーワードが主キーワード又は補助キーワードのいずれであるかを示す情報、そのキーワードと組み合わせられる主キーワード又は補助キーワードを特定する情報、そのキーワードの品詞（例えば会社名のような固有名詞又は「秘」のような普通名詞）を示す情報、及びキーワードの重要度を示す情報等を含む。この付加情報は、上記の「北海道札幌市」の例のように、セキュア情報尤度判定に影響すべきでないキーワードの組み合わせを特定する情報をさらに含んでもよい。 Each of the keywords 1011 and the like includes a character string designated as a keyword and additional information related to the keyword. The additional information includes information indicating whether the keyword is a main keyword or an auxiliary keyword, information specifying a main keyword or an auxiliary keyword combined with the keyword, a part of speech of the keyword (for example, a proper noun such as a company name or Information indicating a common noun such as “secret”), information indicating the importance of the keyword, and the like. This additional information may further include information for specifying a combination of keywords that should not affect the secure information likelihood determination, as in the example of “Hokkaido Sapporo City”.

さらに、キーワード１０１１等の各々は、主キーワード及び補助キーワードからなるキーワードペアの位置関係を示すベクトルデータを含んでもよい。このベクトルデータは、例えば図４のステップ０４０６において、入力された文書から抽出されたキーワードペアの位置関係を示すベクトルデータと比較される。 Further, each of the keywords 1011 and the like may include vector data indicating the positional relationship between keyword pairs including a main keyword and an auxiliary keyword. This vector data is compared with vector data indicating the positional relationship of keyword pairs extracted from the input document, for example, in step 0406 of FIG.

図１０にはキーワード１０１１等の例としてキーワード１＿１０１１及びキーワード２＿１０１２を示すが、セキュア文書辞書０５０４はさらに多くのキーワードを含んでもよい。 FIG. 10 shows a keyword 1 — 1011 and a keyword 2 — 1012 as examples of the keyword 1011 and the like, but the secure document dictionary 0504 may include more keywords.

配置コストテーブル１０２１等の各々は、図９を参照して説明した配置コストテーブル０９００に相当するものであり、その配置コストテーブル１０２１等に対応する主キーワード及び補助キーワードの種類を示す情報、及びそれらの重み（重要度）を示す情報を含む。図９を参照して説明したように、複数の配置コストテーブル０９００が作成されてもよい。例えば、キーワードペアの種類ごとに、それに対応する配置コストテーブル１０２１等が作成されてもよい。あるいは、特定のキーワードペアのみに対応する配置コストテーブル１０２１等が作成されてもよい。 Each of the arrangement cost table 1021 and the like corresponds to the arrangement cost table 0900 described with reference to FIG. 9, information indicating the types of main keywords and auxiliary keywords corresponding to the arrangement cost table 1021, and the like Includes information indicating the weight (importance) of. As described with reference to FIG. 9, a plurality of arrangement cost tables 0900 may be created. For example, for each type of keyword pair, an arrangement cost table 1021 corresponding to the keyword pair may be created. Or the arrangement | positioning cost table 1021 etc. corresponding to only a specific keyword pair may be produced.

図１０には配置コストテーブル１０２１等の例として配置コストテーブル１＿１０２１及び配置コストテーブル２＿１０２２を示すが、セキュア文書辞書０５０４はさらに多くの配置コストテーブルを含んでもよい。 FIG. 10 shows an arrangement cost table 1_1021 and an arrangement cost table 2_1022 as examples of the arrangement cost table 1021 and the like, but the secure document dictionary 0504 may include more arrangement cost tables.

特定様式１０３１等の各々は、文書からその文書の様式（具体的には図８Ｄ及び図８Ｅに示すような特定のフォーマット又は図形等に対応するベクトルデータ）を抽出する方式及び範囲、及び、抽出されたベクトルデータと比較される様式ベクトルデータ（すなわち予めセキュア文書例０５０１から抽出され、登録された罫線又は印影等のベクトルデータ）を含む。図１０には特定様式１０３１等の例として特定様式１＿１０３１及び特定様式２＿１０３２を示すが、セキュア文書辞書０５０４はさらに多くの特定様式を含んでもよい。 Each of the specific forms 1031 and the like is a method and range for extracting the form of the document (specifically, vector data corresponding to a specific format or graphic as shown in FIGS. 8D and 8E), and extraction. Format vector data (that is, vector data such as ruled lines or seals previously extracted from the registered secure document example 0501 and registered). FIG. 10 shows a specific form 1_1031 and a specific form 2_1032 as examples of the specific form 1031 and the like, but the secure document dictionary 0504 may include more specific forms.

なお、上記の第１の実施形態では、文書から抽出された二つのキーワードからなるキーワードペアについて、抽出された位置関係と予め登録された位置関係とを比較する例を示した。しかし、三つ以上のキーワードからなるキーワードのグループについて上記と同様の処理が実行されてもよい。例えば、三つのキーワード及びそれらの相互の位置関係を示すベクトルデータがセキュア文書辞書０５０４に登録されてもよい。その場合、入力された文書からそれらの三つのキーワード及びそれらの相互の位置関係を示すベクトルデータが抽出される。そして、抽出されたベクトルデータと登録されたベクトルデータとの類似度に基づいて、入力された文書がセキュア文書であるか否かが判定される。 In the first embodiment, an example is shown in which the extracted positional relationship is compared with the previously registered positional relationship for a keyword pair composed of two keywords extracted from the document. However, the same processing as described above may be executed for a group of keywords including three or more keywords. For example, vector data indicating three keywords and their positional relationship may be registered in the secure document dictionary 0504. In that case, vector data indicating the three keywords and their positional relationship are extracted from the input document. Then, based on the similarity between the extracted vector data and the registered vector data, it is determined whether or not the input document is a secure document.

以上に説明した本発明の第１の実施形態によれば、入力された文書に含まれるキーワード等に基づいて、その文書がセキュアであるか否かが自動的に判定され、セキュアである場合には暗号化等を施して保管することができる。特に、本実施形態によれば、複数のキーワードの組み合わせと、それらの組み合わせの文脈上の距離と、に基づいて、複数のキーワードの文脈中における関連を考慮したセキュア文書判定が行われる。文書から抽出されたキーワードの組み合わせの連携度に基づいて、その組み合わせをセキュア文書判定に用いるか否かが判定される。これによって、本来セキュアであるべき文書の検出漏れ及び本来セキュアでない文書の誤検出のいずれも減らすことができ、高精度のセキュア文書判定を実現することができる。これによって、ユーザによる管理コストを抑えながら、セキュアな文書を確実に保護することができる。 According to the first embodiment of the present invention described above, whether or not the document is secure is automatically determined based on a keyword or the like included in the input document. Can be stored with encryption or the like. In particular, according to the present embodiment, secure document determination is performed in consideration of the association of a plurality of keywords in the context based on the combination of the plurality of keywords and the context distance of the combination. Based on the degree of cooperation of the keyword combinations extracted from the document, it is determined whether or not to use the combination for secure document determination. As a result, it is possible to reduce both the omission of detection of a document that should be originally secure and the erroneous detection of a document that is not inherently secure, thereby realizing highly accurate secure document determination. As a result, it is possible to reliably protect a secure document while suppressing management costs by the user.

＜第２の実施形態＞
図１１は、本発明の第２の実施形態のＯＣＲ一体型セキュア文書検出装置０２００のハードウェア構成を示すブロック図である。 <Second Embodiment>
FIG. 11 is a block diagram illustrating a hardware configuration of the OCR integrated secure document detection apparatus 0200 according to the second embodiment of this invention.

ＯＣＲ一体型セキュア文書検出装置０２００は、図１に示した本発明のＯＣＲ一体型紙文書管理を実現する装置の一例である。 The OCR integrated secure document detection apparatus 0200 is an example of an apparatus that realizes the OCR integrated paper document management of the present invention shown in FIG.

本実施形態のＯＣＲ一体型セキュア文書検出装置０２００は、操作端末装置０２０１、表示端末装置０２０２、外部記憶装置０２０３、メモリ０２０４、中央演算装置０２０５、通信装置０２０７、画像撮像装置０２０８、ソータ装置０２０９及びこれらを相互に接続する通信線０２０６を備える。 The OCR integrated secure document detection device 0200 of this embodiment includes an operation terminal device 0201, a display terminal device 0202, an external storage device 0203, a memory 0204, a central processing unit 0205, a communication device 0207, an image imaging device 0208, a sorter device 0209, and A communication line 0206 is provided for interconnecting them.

操作端末装置０２０１、表示端末装置０２０２、外部記憶装置０２０３、メモリ０２０４、中央演算装置０２０５、通信装置０２０７及び通信線０２０６は、それぞれ第１の実施形態の操作端末装置０１０１、表示端末装置０１０２、外部記憶装置０１０３、メモリ０１０４、中央演算装置０１０５、通信装置０１０７及び通信線０１０６と同様であるため、それらに関する詳細な説明は省略する。 The operation terminal device 0201, the display terminal device 0202, the external storage device 0203, the memory 0204, the central processing unit 0205, the communication device 0207, and the communication line 0206 are respectively the operation terminal device 0101, the display terminal device 0102, and the external Since the storage device 0103, the memory 0104, the central processing unit 0105, the communication device 0107, and the communication line 0106 are the same, detailed description thereof will be omitted.

画像撮像装置０２０８は、入力された紙文書０３０６に記載された文字、罫線及び図形等を読み取ってデータ化する光学スキャナを含む。このとき読み取られたデータは、テキストデータ及び画像データを含むファイルとして外部記憶装置０２０３に格納されてもよい。 The image capturing apparatus 0208 includes an optical scanner that reads characters, ruled lines, graphics, and the like described in the input paper document 0306 and converts them into data. The data read at this time may be stored in the external storage device 0203 as a file including text data and image data.

ソータ装置０２０９は、画像撮像装置０２０８による読み取りが終了した後の紙文書０３０６を排出する装置である。例えば、ソータ装置０２０９は、紙文書０３０６の排出先として複数の棚を備えてもよい。この場合、ソータ装置０２０９は、必要に応じて選択された棚に紙文書０３０６を排出することができる。 The sorter device 0209 is a device that discharges the paper document 0306 after reading by the image pickup device 0208 is completed. For example, the sorter apparatus 0209 may include a plurality of shelves as the discharge destination of the paper document 0306. In this case, the sorter device 0209 can discharge the paper document 0306 to the selected shelf as necessary.

なお、ＯＣＲ一体型セキュア文書検出装置０２００は、第１の実施形態のセキュア文書検出装置０１００に、従来のＯＣＲ装置を画像撮像装置０２０８及びソータ装置０２０９として追加することによって実現されてもよい。その場合、例えば、図１１の画像撮像装置０２０８及びソータ装置０２０９が図１のＯＣＲ装置０３０７に相当し、図１１の残りの部分が図１の計算機０３０８に相当する。あるいは、ＯＣＲ一体型セキュア文書検出装置０２００全体が一つのＯＣＲ装置として実現されてもよい。 Note that the OCR integrated secure document detection device 0200 may be realized by adding a conventional OCR device as the image capturing device 0208 and the sorter device 0209 to the secure document detection device 0100 of the first embodiment. In this case, for example, the image capturing device 0208 and the sorter device 0209 in FIG. 11 correspond to the OCR device 0307 in FIG. 1, and the remaining portion in FIG. 11 corresponds to the computer 0308 in FIG. Alternatively, the entire OCR integrated secure document detection device 0200 may be realized as one OCR device.

ＯＣＲ一体型セキュア文書検出装置０２００の中央演算装置０２０５は、画像撮像装置０２０８によって読み取られたデータを用いて、入力された文書がセキュア文書であるか否かを判定する。その処理は図３から図１０を参照して説明した第１の実施形態と同様であるため、それについての説明は省略する。 The central processing unit 0205 of the OCR integrated secure document detection device 0200 uses the data read by the image capturing device 0208 to determine whether or not the input document is a secure document. Since the process is the same as that of the first embodiment described with reference to FIGS. 3 to 10, the description thereof is omitted.

さらに、本実施形態のＯＣＲ一体型セキュア文書検出装置０２００は、ステップ０４０７の判定結果に基づいて、セキュア化電子文書０４１５を出力するだけでなく、画像撮像装置０２０８による読み取りが終了した後の紙文書０３０６の排出方法を選択することができる。これは、紙文書０３０６がセキュア文書である場合に、そこに含まれるセキュア情報を流出から保護するためである。 Furthermore, the OCR integrated secure document detection device 0200 according to the present embodiment not only outputs the secure electronic document 0415 based on the determination result of step 0407 but also the paper document after the reading by the image capturing device 0208 is completed. 0306 can be selected. This is because when the paper document 0306 is a secure document, the secure information contained therein is protected from outflow.

例えば、ＯＣＲ一体型セキュア文書検出装置０２００は、セキュア文書であると判定された紙文書０３０６を、そうでない紙文書０３０６とは異なる位置に排出してもよい。ここで「異なる位置」とは、例えば「異なる棚」であってもよいし、「同一の棚の中の異なる位置」であってもよい。 For example, the OCR integrated secure document detection apparatus 0200 may discharge the paper document 0306 determined to be a secure document to a position different from that of the paper document 0306 that is not. Here, the “different positions” may be, for example, “different shelves” or “different positions in the same shelf”.

あるいは、ＯＣＲ一体型セキュア文書検出装置０２００は、セキュア文書であると判定された紙文書０３０６を加工して排出してもよい。ここで「加工」とは、紙文書０３０６がセキュア文書であることを示す表示（例えば「秘」のような文字等）を印刷することであってもよいし、紙文書０３０６上の文字等を読み取りにくくするための所定の図形パタン等を印刷することであってもよいし、それらの文字等の読み取りを不可能にするために紙文書０３０６を破砕することであってもよい。この場合、ソータ装置０２０９は印刷装置又はシュレッダのような文書加工装置を含む。 Alternatively, the OCR integrated secure document detection device 0200 may process and discharge the paper document 0306 determined to be a secure document. Here, “processing” may mean printing a display (for example, a character such as “secret”) indicating that the paper document 0306 is a secure document, or a character on the paper document 0306. A predetermined graphic pattern or the like for making it difficult to read may be printed, or the paper document 0306 may be crushed to make it impossible to read those characters or the like. In this case, the sorter device 0209 includes a document processing device such as a printing device or a shredder.

あるいは、ＯＣＲ一体型セキュア文書検出装置０２００は、セキュア文書でないと判定された紙文書０３０６を通常の棚に排出し、セキュア文書であると判定された紙文書０３０６を排出しなくてもよい。この場合、セキュア文書であると判定された紙文書０３０６は、適切な権限を持ったユーザに取り出されるまで、ＯＣＲ一体型セキュア文書検出装置０２００の内部に蓄積される。 Alternatively, the OCR integrated secure document detection device 0200 may discharge the paper document 0306 determined not to be a secure document to a normal shelf and may not discharge the paper document 0306 determined to be a secure document. In this case, the paper document 0306 determined to be a secure document is stored in the OCR integrated secure document detection apparatus 0200 until it is taken out by a user having appropriate authority.

なお、上記はステップ０４０７において文書がセキュア文書であるか否かを判定する例を示したが、ステップ０４０７においてこのような二値判定の代わりに多値判定が行われてもよい。例えば、ＯＣＲ一体型セキュア文書検出装置０２００は、算出されたセキュア情報尤度と複数の閾値とを比較することで、セキュア情報尤度のランクを判定してもよい。その場合、判定されたランクに応じて紙文書０３０６の排出方法が選択されてもよい。例えば、一体型セキュア文書検出装置０２００は、最も高いランクの紙文書０３０６を破砕して排出し、それ以外のランクの紙文書０３０６を、それぞれのランクに割り当てられた棚に排出してもよい。 In the above, an example in which it is determined in step 0407 whether or not the document is a secure document has been described. However, in step 0407, multivalue determination may be performed instead of such binary determination. For example, the OCR integrated secure document detection device 0200 may determine the rank of the secure information likelihood by comparing the calculated secure information likelihood with a plurality of threshold values. In that case, a discharge method of the paper document 0306 may be selected according to the determined rank. For example, the integrated secure document detection apparatus 0200 may crush and discharge the highest-ranked paper document 0306 and discharge the paper documents 0306 of other ranks to the shelves assigned to the respective ranks.

以上に説明した本発明の第２の実施形態によれば、紙に印刷された文書について、第１の実施形態と同様の高精度のセキュア文書判定を実現し、それによってセキュアな電子文書を確実に保護することができる。さらに、セキュア文書判定の結果と、入力された紙の排出とを連動させることによって、ユーザの管理コストを抑えながら、セキュアな紙文書を確実に保護することができる。 According to the second embodiment of the present invention described above, high-accuracy secure document determination similar to that of the first embodiment is realized for a document printed on paper, thereby ensuring a secure electronic document. Can be protected. Further, by linking the result of the secure document determination and the discharge of the input paper, the secure paper document can be reliably protected while suppressing the management cost of the user.

０１００セキュア文書検出装置
０１０１、０２０１操作端末装置
０１０２、０２０２表示端末装置
０１０３、０２０３外部記憶装置
０１０４、０２０４メモリ
０１０５、０２０５中央演算装置
０１０６、０２０６通信線
０１０７、０２０７通信装置
０２００ＯＣＲ一体型セキュア文書検出装置
０２０８画像撮像装置
０２０９ソータ装置
０３０１、０３０６紙文書
０３０２光学式文字読取装置（ＯＣＲ装置）
０３０３、０３０５、０３０９、０３１０、０３１２文書ファイル
０３０４、０３０８、０３１１計算機
０５０１セキュア文書例
０５０２セキュア用語定義
０５０４セキュア文書辞書
０５１１非管理文書
０９００配置コストテーブル 0100 Secure document detection device 0101, 0201 Operation terminal device 0102, 0202 Display terminal device 0103, 0203 External storage device 0104, 0204 Memory 0105, 0205 Central processing unit 0106, 0206 Communication line 0107, 0207 Communication device 0200 OCR integrated secure document detection Apparatus 0208 Image imaging apparatus 0209 Sorter apparatus 0301, 0306 Paper document 0302 Optical character reader (OCR apparatus)
0303, 0305, 0309, 0310, 0312 Document file 0304, 0308, 0311 Computer 0501 Secure document example 0502 Secure term definition 0504 Secure document dictionary 0511 Unmanaged document 0900 Arrangement cost table

Claims

A secure document detection method executed by a secure document detection device,
The secure document detection device includes an arithmetic device and a storage device that holds a dictionary,
In the dictionary, a plurality of keyword pairs each including at least two keywords, and information indicating a positional relationship in the document of the two keywords included in each keyword pair are registered,
The secure document detection method includes:
A first procedure for extracting keyword pairs registered in the dictionary from input document data;
A second procedure for determining whether or not the input document data is a secure document based on the positional relationship in the input document data of two keywords included in the extracted keyword pair; A secure document detection method comprising:

The second procedure includes
A third procedure for determining whether or not the extracted keyword pair is a cooperative keyword pair based on the strength of association between the two keywords included in the extracted keyword pair;
The secure information likelihood of the input document data is increased so that the similarity between the positional relationship between the two keywords included in the cooperative keyword pair and the positional relationship between the two keywords registered in the dictionary increases. A fourth procedure for calculating the degree;
The secure document detection method according to claim 1, further comprising: a fifth step of determining that the input document data is a secure document when the secure information likelihood is greater than a predetermined threshold. .

The secure document detection method further includes a procedure of classifying text included in the input document data into a plurality of regions,
The third procedure includes
A sixth procedure for calculating the strength of association between the two keywords based on the region including the two keywords;
When the strength of the relation exceeds a predetermined threshold, or when the rank of the relation strength in the input document data exceeds a predetermined threshold, the extracted keyword pair is a linked keyword pair The secure document detection method according to claim 2, further comprising: a seventh procedure for determining that there is.

The plurality of areas include at least one of a title, text, table, header, or footer,
The dictionary further includes placement cost information that associates the combination of the regions with a predetermined weight,
The sixth procedure includes a procedure of calculating the strength of association between the two keywords so that the larger the predetermined weight corresponding to the combination of regions to which the two keywords belong, the larger the predetermined procedure is. The secure document detection method according to claim 3.

In the dictionary, keyword pairs that should not be extracted as the linked keywords are further registered,
5. The arrangement cost information is registered with a value smaller than the weights corresponding to other keyword pairs as the weight corresponding to a keyword pair that should not be extracted as the cooperation keyword. The described secure document detection method.

The secure document detection method further includes a procedure of rearranging text included in the input document data in the order in which the text is read.
The sixth procedure calculates a contextual distance between the two keywords based on the order in which the text is read, and relates the relation between the two keywords so as to increase as the calculated contextual distance decreases. The secure document detection method according to claim 3, further comprising a procedure for calculating the strength.

The sixth procedure includes a procedure of calculating a strength of association between the two keywords so that the Euclidean distance in the input document data of the two keywords is shorter. 4. The secure document detection method according to 3.

The method of claim 1, wherein the secure character detection method further includes a step of encrypting and outputting the input document data when it is determined that the input document data is a secure document. The described secure document detection method.

In the secure character detection method, when a document to be determined to be a secure document and information specifying two keywords included in the document are input, information indicating the positional relationship between the two specified keywords The secure document detection method according to claim 1, further comprising a step of: extracting the specified keyword and information indicating the extracted positional relationship in the dictionary.

The information indicating the positional relationship between two keywords included in the keyword pair includes vector data indicating a direction indicating the arrangement of the two keywords in the document and a distance between them. The described secure document detection method.

A secure document detection program executed by a computer,
The computer includes an arithmetic device, a memory that stores the secure document detection program, and a storage device that holds a dictionary.
In the dictionary, a plurality of keyword pairs each including at least two keywords, and information indicating a positional relationship in the document of the two keywords included in each keyword pair are registered,
The secure document detection program is:
A first procedure for extracting keyword pairs registered in the dictionary from input document data;
A second procedure for determining whether or not the input document data is a secure document based on the positional relationship in the input document data of two keywords included in the extracted keyword pair; A secure document detection program executed by the computer.

The second procedure includes
A third procedure for determining whether or not the extracted keyword pair is a cooperative keyword pair based on the strength of association between the two keywords included in the extracted keyword pair;
The secure information likelihood of the input document data is increased so that the similarity between the positional relationship between the two keywords included in the cooperative keyword pair and the positional relationship between the two keywords registered in the dictionary increases. A fourth procedure for calculating the degree;
The secure document detection program according to claim 11, further comprising: a fifth step of determining that the input document data is a secure document when the secure information likelihood is greater than a predetermined threshold. .

The secure document detection program further causes the computer to execute a procedure for classifying text included in the input document data into a plurality of regions,
The third procedure includes
A sixth procedure for calculating the strength of association between the two keywords based on the region including the two keywords;
When the strength of the relation exceeds a predetermined threshold, or when the rank of the relation strength in the input document data exceeds a predetermined threshold, the extracted keyword pair is a linked keyword pair The secure document detection program according to claim 12, further comprising a seventh procedure for determining that there is.

The plurality of areas include at least one of a title, text, table, header, or footer,
The dictionary further includes placement cost information that associates the combination of the regions with a predetermined weight,
The sixth procedure includes a procedure of calculating the strength of association between the two keywords so that the larger the predetermined weight corresponding to the combination of regions to which the two keywords belong, the larger the predetermined procedure is. The secure document detection program according to claim 13.

The secure character detection program further causes the computer to execute a procedure of encrypting and outputting the input document data when it is determined that the input document data is a secure document. The secure document detection program according to claim 11.

In the secure character detection method, when a document to be determined to be a secure document and information specifying two keywords included in the document are input, information indicating the positional relationship between the two specified keywords The secure document detection program according to claim 11, wherein the computer is caused to execute a procedure of extracting the specified keyword and the information indicating the extracted positional relationship in the dictionary.

An optical character reader that reads character information from an input paper document,
The optical character reader includes an arithmetic device, a storage device that holds a secure information dictionary, an image pickup device that reads the input paper document, and a paper discharge device that discharges the input paper document. Prepared,
In the secure information dictionary, a plurality of keyword pairs each including two keywords, and information indicating the positional relationship in the document of the two keywords included in each keyword pair are registered,
The optical character reader is
Create document data by reading character information from the input paper document,
Extracting keyword pairs registered in the secure document dictionary from the document data;
Determining whether or not the input paper document is a secure document based on the positional relationship in the document data of two keywords included in the extracted keyword pair;
An optical character reader that controls a method of discharging the input paper document in accordance with the result of the determination.

The paper discharge device includes a plurality of shelves,
The optical character reader discharges the input paper document so that a paper document determined to be a secure document and a paper document determined to be not a secure document are discharged to different shelves. The optical character reader according to claim 17, wherein:

The paper discharge device includes a printing device,
When it is determined that the input paper document is a secure document, the optical character reader prints a display indicating that the input paper document is a secure document on the input paper document, and the printed The optical character reader according to claim 17, wherein the paper document is discharged.

The paper discharge device includes a processing device that processes the input paper document so as to be difficult to read,
When it is determined that the input paper document is a secure document, the optical character reader processes the input paper document so as to be difficult to read, and discharges the processed paper document The optical character reader according to claim 17.