JP2006221560A

JP2006221560A - Data substitution device, data substitution method, and data substitution program

Info

Publication number: JP2006221560A
Application number: JP2005036616A
Authority: JP
Inventors: Katsuya Mimuro; 克哉三室; Eisuke Sudo; 英介須藤
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2005-02-14
Filing date: 2005-02-14
Publication date: 2006-08-24

Abstract

<P>PROBLEM TO BE SOLVED: To precisely detect personal information to be substituted with other data. <P>SOLUTION: This data substitution device 1 has a reading means for reading a document data, an extraction means for analyzing the document data by syntax to extract the personal information, a substitution means for substituting the each personal information extracted by the extraction means with a data different from the personal information, and an output means for outputting the document data substituted by the substitution means. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、データを置換する技術に関する。 The present invention relates to a technique for replacing data.

近年、個人情報など秘匿すべき情報の取り扱いに対して、関心が高まっている。すなわち、個人情報などの秘匿すべき情報が含まれる文書を公開する場合、秘匿すべき情報の記載箇所を削除またはマスク（黒く塗り潰すなど）処理を行う必要がある。例えば、特許文献１では、秘匿すべき箇所の情報に、マスク画面を埋め込む情報秘匿化方法が記載されている。
特開２００３−２９８８３２ In recent years, there has been an increasing interest in handling confidential information such as personal information. That is, when a document including information to be concealed, such as personal information, is disclosed, it is necessary to delete or mask (for example, black out) a portion where the information to be concealed is described. For example, Patent Document 1 describes an information concealment method in which a mask screen is embedded in information on a portion to be concealed.
JP 2003-298732 A

ところで、企業のデータベースには、営業活動を報告するための営業日報や、コールセンターに寄せられる消費者からの問合せ情報などが、テキストデータとして電子化され、大量に蓄積されている。このようなテキストデータには、一般的に、氏名、電話番号、メールアドレス等の個人情報が含まれている。そのため、個人情報を含むテキストデータを社外または社内で公開し、知識の共有化を図る場合、個人情報が漏洩する危険性がある。 By the way, in a company database, daily business reports for reporting business activities, inquiries from consumers sent to call centers, etc. are digitized as text data and accumulated in large quantities. Such text data generally includes personal information such as name, telephone number, and mail address. Therefore, when text data including personal information is disclosed outside or inside the company to share knowledge, there is a risk that personal information may be leaked.

したがって、担当者は、個人情報を含むテキストデータを公開する場合、個人情報に該当する部分を目で見て探し出し、探し出した個人情報の部分にマスク処理を行う必要がある。しかしながら、対象となるテキストデータが膨大な場合、担当者が個人情報に該当する部分を検知してマスク処理を行うことは、作業負荷が大きく、また、個人情報の検知漏れやミスが発生するおそれがある。 Therefore, when the person in charge publishes text data including personal information, the person in charge needs to visually search for a portion corresponding to the personal information, and perform mask processing on the portion of the personal information found. However, if the text data to be processed is enormous, it will be a heavy workload for the person in charge to detect the portion corresponding to the personal information and perform mask processing, and there is a risk that personal information may not be detected or missed. There is.

本発明は上記事情に鑑みてなされたものであり、本発明の目的は、個人情報をより高い精度で抽出し、抽出した個人情報を他のデータに置換することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to extract personal information with higher accuracy and replace the extracted personal information with other data.

上記課題を解決するために、本発明は、例えば、データ置換装置であって、文書データを読み込む読込み手段と、文書データを構文解析して、個人情報を抽出する抽出手段と、抽出手段が抽出した個人情報各々を、当該個人情報とは異なるデータに置換する置換手段と、置換手段が置換した置換後の文書データを出力する出力手段と、を有する。 In order to solve the above-described problems, the present invention provides, for example, a data replacement device, a reading unit that reads document data, an extracting unit that parses document data and extracts personal information, and an extracting unit extracts And replacing means for replacing each piece of personal information with data different from the personal information, and output means for outputting the replaced document data replaced by the replacing means.

本発明により、個人情報をより高い精度で検出し、他のデータに置換することができる。 According to the present invention, personal information can be detected with higher accuracy and replaced with other data.

以下、本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described.

図１は、本発明の一実施形態が適用されたデータ置換装置１の機能構成図である。図示するデータ置換装置１は、テキストデータを含む文書ファイルの中から、個人情報を抽出し、他のデータに置換する装置である。なお、個人情報は、個人に関する情報であって、特定の個人を識別することができる情報（また、他の情報と照合することにより、特定の個人を識別することができる情報を含む）である。個人情報には、例えば、氏名、住所（郵便番号を含む）、電話番号、メールアドレス、生年月日、クレジット番号、口座番号などが考えられる。 FIG. 1 is a functional configuration diagram of a data replacement device 1 to which an embodiment of the present invention is applied. The illustrated data replacement device 1 is a device that extracts personal information from a document file including text data and replaces it with other data. The personal information is information relating to an individual, and is information that can identify a specific individual (including information that can identify a specific individual by collating with other information). . The personal information may be, for example, name, address (including postal code), telephone number, e-mail address, date of birth, credit number, account number, and the like.

本実施形態のデータ置換装置１は、図示するように、データ読込部１１と、データ抽出部１２と、置換処理部１３と、データ出力部１４と、データ管理部１５と、入力ファイル２１と、出力ファイル２２と、ユーザ辞書２３と、置換ルールテーブル２４と、構文解析ルール２５と、を有する。 As shown in the figure, the data replacement device 1 of the present embodiment includes a data reading unit 11, a data extraction unit 12, a replacement processing unit 13, a data output unit 14, a data management unit 15, an input file 21, An output file 22, a user dictionary 23, a replacement rule table 24, and a syntax analysis rule 25 are included.

データ読込部１１は、入力ファイル２１に記憶されている文書ファイルのデータを読み込む。データ抽出部１２は、データ読込部１１が読み込んだデータの中から、個人情報を抽出する。置換処理部１３は、データ抽出部１２が抽出した個人情報を、他のデータに置換する。データ出力部１４は、個人情報を置換したデータを、出力ファイル２２または出力装置（不図示）に出力する。データ管理部１５は、ユーザ辞書２３、置換ルールテーブル２４および構文解析ルール２５のデータを管理（追加、更新、削除）する。 The data reading unit 11 reads data of a document file stored in the input file 21. The data extraction unit 12 extracts personal information from the data read by the data reading unit 11. The replacement processing unit 13 replaces the personal information extracted by the data extracting unit 12 with other data. The data output unit 14 outputs the data in which the personal information is replaced to the output file 22 or an output device (not shown). The data management unit 15 manages (adds, updates, deletes) data in the user dictionary 23, the replacement rule table 24, and the syntax analysis rule 25.

入力ファイル２１には、各種の文書ファイルが記憶されているものとする。なお、文書ファイルは、例えば、営業活動を報告するための営業日報、コールセンターに寄せられる消費者からの問合せ情報、顧客への返信メール、電子掲示板（Bulletin Board System）への書込みデータ、外部の業者への委託データ、または、顧客毎に作成されるＨＴＭＬデータなどが考えられる。そして、このような文書ファイルには、一般的に、氏名、電話番号等の個人情報が含まれているものとする。 It is assumed that various document files are stored in the input file 21. Document files include, for example, daily business reports for reporting business activities, inquiry information from consumers sent to the call center, emails sent back to customers, data written on bulletin board systems, external vendors, etc. Consignment data to the customer or HTML data created for each customer can be considered. Such document files generally include personal information such as names and telephone numbers.

なお、文書ファイルには、テキストデータ以外のデータ（イメージデータ、タブや改行などの制御コードなど）が含まれていてもよい。また、入力ファイル２１は、このような文書ファイルが記憶されたデータベースであってもよい。また、入力ファイル２１に記憶された文書ファイルは、データ置換装置１の入力手段（不図示）を用いて入力・作成される場合、ＭＯ（Magneto-Optical disk）やＣＤ−ＲＯＭなどの記憶装置（記憶媒体）のデータを入力する場合、または、ネットワークを介して他のコンピュータシステム（不図示）からファイル転送される場合、などが考えられる。 Note that the document file may include data other than text data (image data, control codes such as tabs and line feeds). The input file 21 may be a database in which such a document file is stored. Further, when the document file stored in the input file 21 is input / created using an input unit (not shown) of the data replacement device 1, a storage device (such as an MO (Magneto-Optical disk) or a CD-ROM) ( For example, when data of a storage medium is input or when a file is transferred from another computer system (not shown) via a network.

出力ファイル２２には、置換処理部１３が個人情報を置換した後の文書ファイルが記憶される。ユーザ辞書２３は、個人情報に該当する任意の文字列が登録されたデータベースである。置換ルールテーブル２４は、個人情報を置換する際の規則が設定されたテーブルである。構文解析ルール２５には、個人情報の信頼度算出に必要な各種のデータが記憶されている。なお、ユーザ辞書２３、置換ルールテーブル２４および構文解析ルール２５については後述する。 The output file 22 stores a document file after the replacement processing unit 13 replaces personal information. The user dictionary 23 is a database in which an arbitrary character string corresponding to personal information is registered. The replacement rule table 24 is a table in which rules for replacing personal information are set. The parsing rule 25 stores various data necessary for calculating the reliability of personal information. The user dictionary 23, the replacement rule table 24, and the syntax analysis rule 25 will be described later.

上記説明した、データ置換装置１は、例えば図２に示すようなＣＰＵ９０１と、メモリ９０２と、ＨＤＤ等の外部記憶装置９０３と、キーボードやマウスなどの入力装置９０４と、ディスプレイやプリンタなどの出力装置９０５と、ネットワーク接続するための通信制御装置９０６と、を備えた汎用的なコンピュータシステムを用いることができる。このコンピュータシステムにおいて、ＣＰＵ９０１がメモリ９０２上にロードされた所定のプログラムを実行することにより、データ置換装置１の各機能が実現される。すなわち、データ置換装置１の各機能は、データ置換装置１のＣＰＵ９０１が、データ置換装置１用のプログラムを実行することにより実現される。なお、データ置換装置１の入力ファイル２１、出力ファイル２２、ユーザ辞書２３、置換ルールテーブル２４および構文解析ルール２５は、メモリ９０２または外部記憶装置９０３に記憶される。また、通信制御装置９０６は、必要に応じて備えるものとする。 The data replacement device 1 described above includes, for example, a CPU 901, a memory 902, an external storage device 903 such as an HDD, an input device 904 such as a keyboard and a mouse, and an output device such as a display and a printer as shown in FIG. A general-purpose computer system including 905 and a communication control device 906 for network connection can be used. In this computer system, each function of the data replacement device 1 is realized by the CPU 901 executing a predetermined program loaded on the memory 902. That is, each function of the data replacement device 1 is realized by the CPU 901 of the data replacement device 1 executing a program for the data replacement device 1. The input file 21, the output file 22, the user dictionary 23, the replacement rule table 24, and the syntax analysis rule 25 of the data replacement device 1 are stored in the memory 902 or the external storage device 903. Moreover, the communication control apparatus 906 shall be provided as needed.

次に、ユーザ辞書２３について説明する。 Next, the user dictionary 23 will be described.

ユーザ辞書２３は、個人情報に該当する任意の文字列が登録されたデータベースである。本実施形態のユーザ辞書２３は、登録された文字列毎に、当該文字列の属するグループ名と、置換適用フラグとを有する。グループ名は、文字列が示す単語を分類するための種別（カテゴリー）である。例えば、登録された文字列が顧客名の場合はグループ名に「顧客」が、登録された文字列が社員名の場合はグループ名に「社員」が、また、登録された文字列が取引先の会社名の場合はグループ名に「取引先」が設定される。 The user dictionary 23 is a database in which an arbitrary character string corresponding to personal information is registered. The user dictionary 23 of this embodiment has a group name to which the character string belongs and a replacement application flag for each registered character string. The group name is a type (category) for classifying the word indicated by the character string. For example, if the registered character string is a customer name, "customer" is the group name, if the registered character string is an employee name, the group name is "employee", and the registered character string is the business partner In the case of the company name, “business partner” is set as the group name.

置換適用フラグは、ユーザ辞書２３に登録された文字列を、後述する置換処理で置換の対象とするか否かを判別するためのフラグである。例えば、置換適用フラグが「ＯＮ」の場合は当該文字列を置換の対象として抽出する。一方、置換適用フラグが「ＯＦＦ」の場合は、ユーザ辞書２３に登録された文字列であっても当該文字列を置換の対象外とする。 The replacement application flag is a flag for determining whether or not a character string registered in the user dictionary 23 is a replacement target in a replacement process described later. For example, if the replacement application flag is “ON”, the character string is extracted as a replacement target. On the other hand, when the replacement application flag is “OFF”, even the character string registered in the user dictionary 23 is excluded from the replacement target.

図３（Ａ）は、ユーザ辞書２３の設定画面（または参照画面）の一例を示した図である。データ管理部１５は、ユーザが入力装置を用いて入力した指示を受け付けて、図示する設定画面を出力装置に表示するものとする。 FIG. 3A is a diagram showing an example of a setting screen (or reference screen) for the user dictionary 23. The data management unit 15 receives an instruction input by the user using the input device, and displays a setting screen illustrated in the drawing on the output device.

図示する設定画面には、ユーザ辞書２３に登録されている文字列毎に、置換適用フラグを設定するためのチェックボックス３１と、グループ名３２と、文字列３３とが表示されている。本実施形態では、チェックボックス３１にチェックが入力されている場合、データ管理部１５は、ユーザ辞書２３の置換適用フラグに「ＯＮ（置換の対象）」を設定する。一方、チェックボックス３１にチェックが入力されていない場合、データ管理部１５は、置換適用フラグに「ＯＦＦ（置換の対象外）」を設定する。 On the illustrated setting screen, a check box 31, a group name 32, and a character string 33 for setting a replacement application flag are displayed for each character string registered in the user dictionary 23. In the present embodiment, when a check is input in the check box 31, the data management unit 15 sets “ON (replacement target)” in the replacement application flag of the user dictionary 23. On the other hand, when a check is not input in the check box 31, the data management unit 15 sets “OFF (not subject to replacement)” in the replacement application flag.

チェックボックス３１を表示してユーザの入力（指示）を受け付けることにより、ユーザ辞書２３に登録された文字列であっても、個別にまた一時的に置換の対象外とすることができる。すなわち、ユーザは、置換処理を実行する文書ファイルの種類および属性などに応じて、柔軟にユーザ辞書２３を変更することができる。 By displaying the check box 31 and accepting a user input (instruction), even a character string registered in the user dictionary 23 can be individually and temporarily excluded from replacement. That is, the user can flexibly change the user dictionary 23 according to the type and attribute of the document file for which the replacement process is executed.

また、図示する設定画面には、追加ボタン３４と、編集ボタン３５と、削除ボタン３６と、ＯＫボタン３７と、ＣＡＮＣＥＬボタン３８とが表示されている。データ管理部１５は、追加ボタン３４が押下された場合、新たな文字列を新規に登録するための追加用ダイアログボックス（dialog box）を表示する。また、データ管理部１５は、編集ボタン３５が押下された場合、カーソルで選択された文字列を編集するための編集用ダイアログボックスを表示する。また、データ管理部１５は、削除ボタン３６が押下された場合、カーソルで選択された文字列をユーザ辞書２３から削除する。 In addition, an add button 34, an edit button 35, a delete button 36, an OK button 37, and a CANCEL button 38 are displayed on the setting screen shown in the figure. When the add button 34 is pressed, the data management unit 15 displays an add dialog box (dialog box) for registering a new character string. Further, when the edit button 35 is pressed, the data management unit 15 displays an edit dialog box for editing the character string selected by the cursor. Further, when the delete button 36 is pressed, the data management unit 15 deletes the character string selected with the cursor from the user dictionary 23.

図３（Ｂ）は、編集ボタン３５を押下した場合に表示される編集用ダイアログボックスの一例を示した図である。編集用ダイアログボックスは、ドロップダウンリストを用いたグループ入力欄３５１と、文字列入力欄３５２とを有する。なお、編集用ダイアログボックスの場合、グループ入力欄３５１および文字列入力欄３５２には、図３（Ａ）の設定画面で選択された編集対象のグループ名および文字列が、あらかじめ表示されているものとする。ユーザは、グループ名および文字列の入力後（変更後）、ＯＫボタン３５３を押下する。データ管理部１５は、ＯＫボタン３５３の押下を受け付けて、入力された内容でユーザ辞書２３を更新する。また、データ管理部１５は、ＣＡＮＣＥＬボタン３５４の押下を受け付けて、ユーザ辞書２３を更新することなく、図３（Ａ）の設定画面に遷移する。 FIG. 3B is a diagram showing an example of an edit dialog box displayed when the edit button 35 is pressed. The editing dialog box has a group input field 351 using a drop-down list and a character string input field 352. In the case of the editing dialog box, the group input field 351 and the character string input field 352 are displayed in advance with the group name and character string to be edited selected on the setting screen of FIG. And The user presses the OK button 353 after inputting (after changing) the group name and the character string. The data management unit 15 accepts pressing of the OK button 353 and updates the user dictionary 23 with the input content. Further, the data management unit 15 accepts pressing of the CANCEL button 354, and transitions to the setting screen of FIG. 3A without updating the user dictionary 23.

次に、置換ルールテーブル２４について説明する。
図４は、置換ルールテーブル２４の一例を示した図である。置換ルールテーブル２４は、個人情報を置換する際の規則が設定されたテーブルである。図示する置換ルールテーブル２４は、抽出方法４１と、ルール名４２と、正規表現文字パターン４３と、置換文字列４４と、ＩＤフラグ４５と、適用フラグ４６と、優先順位４７と、を有する。 Next, the replacement rule table 24 will be described.
FIG. 4 is a diagram showing an example of the replacement rule table 24. The replacement rule table 24 is a table in which rules for replacing personal information are set. The replacement rule table 24 shown in the figure has an extraction method 41, a rule name 42, a regular expression character pattern 43, a replacement character string 44, an ID flag 45, an application flag 46, and a priority 47.

抽出方法４１には、個人情報に該当する文字列（個人情報の可能性がある文字列を含む）を抽出する方法が設定される。本実施形態では、構文解析による抽出、正規表現による抽出、および、ユーザ辞書２３による抽出の３種類の抽出方法を用いることとする。構文解析による抽出方法は、構文解析により文章の構造（例えば、係り受け構造）を解析し、個人情報の可能性が高い文字列を抽出する方法である。なお、構文解析による抽出方法については後述する。 In the extraction method 41, a method for extracting a character string corresponding to personal information (including a character string that may be personal information) is set. In this embodiment, three types of extraction methods are used: extraction by syntax analysis, extraction by regular expressions, and extraction by the user dictionary 23. The extraction method by syntactic analysis is a method of analyzing a sentence structure (for example, dependency structure) by syntactic analysis and extracting a character string having a high possibility of personal information. The extraction method by syntax analysis will be described later.

正規表現による抽出方法は、個人情報に該当する情報であって、番号体系および使用される文字の種別（文字パターン）などから特定が可能な正規表現の文字列を抽出する方法である。正規表現のデータとしては、例えば、電話番号、メールアドレス（ドメイン名）、郵便番号、生年月日、クレジット番号、口座番号などが考えられる。ユーザは、後述する設定画面を用いて、任意の番号体系および文字パターンを置換ルールテーブル２４に登録できるものとする。ユーザ辞書２３による抽出方法は、前述のユーザ辞書２３に登録されている各文字列と同一の文字列をマッチングして抽出する方法である。 The regular expression extraction method is information that corresponds to personal information, and is a method of extracting a regular expression character string that can be identified from the numbering system, the type of character used (character pattern), and the like. As regular expression data, for example, a telephone number, a mail address (domain name), a postal code, a date of birth, a credit number, an account number, and the like are conceivable. It is assumed that the user can register an arbitrary number system and character pattern in the replacement rule table 24 using a setting screen described later. The extraction method by the user dictionary 23 is a method of matching and extracting the same character string as each character string registered in the user dictionary 23 described above.

ルール名４２には、個人情報の種別（カテゴリー）と当該種別の個人情報を置換するか否かが設定されている。図示する置換ルールテーブル２４では、抽出方法４１が構文解析の場合、人名を置換するルールと、地名を置換するルールとが設定されている。また、抽出方法４１が正規表現の場合、電話番号を置換するルールと、メールアドレスを置換するルールとが設定されている。 In the rule name 42, the type (category) of personal information and whether to replace the personal information of the type are set. In the replacement rule table 24 shown in the figure, when the extraction method 41 is syntax analysis, a rule for replacing a person name and a rule for replacing a place name are set. When the extraction method 41 is a regular expression, a rule for replacing a telephone number and a rule for replacing an e-mail address are set.

また、抽出方法４１がユーザ辞書の場合、グループ名が「顧客」の文字列を置換するルールと、グループ名が「社員」の文字列を置換しないルールとが設定されている。例えば、社内でのみ公開される文書ファイルの場合（例えば、所定の部署内で知識の共有化を図る場合、上司が部下の仕事振りを評価する場合など）、社員名を知りたい場合がある。このような場合、グループ名が「社員」の文字列を置換しないルールを適用することにより、社員名を置換の対象外とすることができる。 When the extraction method 41 is a user dictionary, a rule for replacing a character string whose group name is “customer” and a rule for not replacing a character string whose group name is “employee” are set. For example, in the case of a document file that is disclosed only within the company (for example, when sharing knowledge within a predetermined department, when a supervisor evaluates the work of subordinates), there is a case where it is desired to know the employee name. In such a case, by applying a rule that does not replace the character string whose group name is “employee”, the employee name can be excluded from replacement.

正規表現文字パターン４３には、正規表現の番号体系および使用される文字の種別（文字パターン）が設定される。図示する正規表現文字パターン４３には、電話番号に使用される数字・記号および番号体系と、メールアドレスに使用される英数字・記号および名称体系とが設定されている。置換文字列４４には、抽出した文字列の変換（置換）後の文字列が設定される。なお、ルール名４２が「社員（非置換）」の場合、社員名は置換されないため、置換文字列４４にはスペースが設定される。 In the regular expression character pattern 43, the regular expression numbering system and the type of character used (character pattern) are set. In the illustrated regular expression character pattern 43, numbers / symbols and number systems used for telephone numbers and alphanumeric characters / symbols and name systems used for mail addresses are set. In the replacement character string 44, a character string after conversion (replacement) of the extracted character string is set. When the rule name 42 is “employee (non-replacement)”, the employee name is not replaced, and a space is set in the replacement character string 44.

ＩＤフラグ４５は、同じルールが適用される各文字列を識別するための識別情報を置換文字列４４に付加するか否かを判別するためのフラグである。図示する例では、「人名（置換）」のルール名４２のＩＤフラグ４５が「有」である。この状態において、１つの文書ファイルの中で「田中」と「鈴木」の２つの人名の文字列が、構文解析により抽出された場合、置換処理部１３は、置換文字列（［人名］）の直後に識別情報を付与することとする。本実施形態では、識別情報として抽出した順番を示す連番を付与することとする。すなわち、置換処理部１３は、先に抽出された「田中」を「［人名］１」、次に抽出された「鈴木」を「［人名］２」に置換する。 The ID flag 45 is a flag for determining whether or not to add identification information for identifying each character string to which the same rule is applied to the replacement character string 44. In the illustrated example, the ID flag 45 of the rule name 42 of “person name (replacement)” is “present”. In this state, when two character strings of “Tanaka” and “Suzuki” are extracted by syntax analysis in one document file, the replacement processing unit 13 determines the replacement character string ([person name]). Immediately after that, identification information is given. In the present embodiment, a serial number indicating the extracted order is assigned as identification information. That is, the replacement processing unit 13 replaces “Tanaka” extracted previously with “[Personal name] 1” and “Suzuki” extracted next with “[Personal name] 2”.

適用フラグ４６は、置換ルールテーブル２４に登録された各ルールを、後述する置換処理で適用するか否かを判別するためのフラグである。例えば、適用フラグ４６が「ＯＮ」の場合は当該ルールを適用して置換処理を行う。一方、適用フラグ４６が「ＯＦＦ」の場合は、当該ルールを適用せず置換処理を行う。 The application flag 46 is a flag for determining whether or not each rule registered in the replacement rule table 24 is applied in a replacement process described later. For example, when the application flag 46 is “ON”, the replacement process is performed by applying the rule. On the other hand, when the application flag 46 is “OFF”, the replacement process is performed without applying the rule.

優先順位４７は、同一の文字列が複数のルールで重複して抽出された場合の優先順位を示したものである。例えば、ユーザ辞書２３の「顧客」のグループ名と、「社員」のグループ名に、同姓同名の氏名が重複して登録されているものとする。この場合、図示する置換ルールテーブル２４では、ルール名４２が「顧客（置換）」の優先順位４７の方が、「社員（非置換）」の優先順位より高い。したがって、置換処理部１３は、「顧客（置換）」のルールを優先し、重複する氏名を［顧客］に置換する。 The priority order 47 indicates the priority order when the same character string is extracted in duplicate by a plurality of rules. For example, it is assumed that the same name and the same name are registered in the group name “customer” and the group name “employee” in the user dictionary 23. In this case, in the illustrated replacement rule table 24, the priority order 47 of the rule name 42 “customer (replacement)” is higher than the priority order of “employee (non-replacement)”. Accordingly, the replacement processing unit 13 gives priority to the rule “customer (replacement)” and replaces the duplicate name with [customer].

図５（Ａ）は、置換ルールテーブル２４の設定画面（または参照画面）の一例を示した図である。なお、データ管理部１５は、ユーザが入力装置を用いて入力した指示を受け付けて、図示する設定画面を出力装置に表示するものとする。 FIG. 5A is a diagram showing an example of a setting screen (or reference screen) for the replacement rule table 24. Note that the data management unit 15 receives an instruction input by the user using the input device, and displays a setting screen illustrated in the drawing on the output device.

図示する設定画面は、置換ルールテーブル２４に登録されたルール毎に、適用フラグ４６を設定するためのチェックボックス５１と、置換ルールテーブル２４に設定された設定内容が表示される表示欄５２とを有する。なお、表示欄５２には、ルール名、抽出方法、正規表現文字パターンなどが表示されている。 The setting screen shown in the figure includes a check box 51 for setting an application flag 46 for each rule registered in the replacement rule table 24, and a display field 52 for displaying the setting contents set in the replacement rule table 24. Have. In the display column 52, a rule name, an extraction method, a regular expression character pattern, and the like are displayed.

本実施形態では、データ管理部１５は、チェックボックス５１にチェックが入力されている場合は適用フラグ４６に「ＯＮ」を設定し、チェックボックス５１にチェックが入力されていない場合は適用フラグ４６に「ＯＦＦ」を設定する。チェックボックス５１を表示してユーザの入力（指示）を受け付けることにより、置換ルールテーブル２４に登録されたルールであっても、一時的に置換処理の適用対象外とすることができる。すなわち、ユーザは、置換処理を実行する文書ファイルの種類および属性などに応じて、適用するルールを選択することができる。 In this embodiment, the data management unit 15 sets “ON” in the application flag 46 when a check is input in the check box 51, and sets the application flag 46 in a case where a check is not input in the check box 51. Set “OFF”. By displaying the check box 51 and accepting a user input (instruction), even a rule registered in the replacement rule table 24 can be temporarily excluded from the application of the replacement process. That is, the user can select a rule to be applied according to the type and attribute of the document file on which the replacement process is executed.

また、図示する設定画面には、追加ボタン５３と、編集ボタン５４と、削除ボタン５５と、ＯＫボタン５６と、ＣＡＮＣＥＬボタン５７とが表示されている。データ管理部１５は、追加ボタン５３が押下された場合、新たなルールを新規に登録するための追加用ダイアログボックスを表示する。また、データ管理部１５は、編集ボタン５４が押下された場合、カーソルで選択されたルールを編集するための編集用ダイアログボックスを表示する。また、データ管理部１５は、削除ボタン５５が押下された場合、カーソルで選択されたルールを置換ルールテーブル２４から削除する。 In addition, an add button 53, an edit button 54, a delete button 55, an OK button 56, and a CANCEL button 57 are displayed on the setting screen shown in the figure. When the add button 53 is pressed, the data management unit 15 displays an addition dialog box for newly registering a new rule. Further, when the edit button 54 is pressed, the data management unit 15 displays an edit dialog box for editing the rule selected by the cursor. Further, when the delete button 55 is pressed, the data management unit 15 deletes the rule selected with the cursor from the replacement rule table 24.

図５（Ｂ）は、編集ボタン５４を押下した場合に表示される編集用ダイアログボックスの一例を示した図である。図示する編集用ダイアログボックスには、置換ルールテーブル２４の各項目を変更するための入力欄、ラジオボタンおよびチェックボックスが表示されている。すなわち、図示する編集用ダイアログボックスには、ドロップダウンリストを用いた抽出方法入力欄５４１と、ルール名入力欄５４２と、正規表現文字パターン入力欄５４３と、置換するか否かを入力するラジオボタン５４４と、置換文字列にＩＤを付加するか否かを指示するチェックボックス５４５と、置換文字列入力欄５４６と、優先順位入力欄５４７と、が表示されている。 FIG. 5B is a diagram showing an example of an editing dialog box displayed when the edit button 54 is pressed. The editing dialog box shown in the figure displays input fields, radio buttons, and check boxes for changing each item of the replacement rule table 24. That is, in the editing dialog box shown in the figure, an extraction method input field 541 using a drop-down list, a rule name input field 542, a regular expression character pattern input field 543, and radio buttons for inputting whether or not to replace A check box 545 for instructing whether or not to add an ID to the replacement character string, a replacement character string input field 546, and a priority order input field 547 are displayed.

なお、編集用ダイアログボックスの各入力欄等には、図５（Ａ）の設定画面で選択された編集対象のルールの設定内容が、あらかじめ表示されているものとする。ユーザは、所望の設定内容を入力後（変更後）、ＯＫボタン５４８を押下する。データ管理部１５は、ＯＫボタン５４８の押下を受け付けて、入力された内容で置換ルールテーブル２４を更新する。また、データ管理部１５は、ＣＡＮＣＥＬボタン５４９の押下を受け付けて、置換ルールテーブル２４を更新することなく図５（Ａ）の設定画面に戻る。 It is assumed that the setting contents of the rule to be edited selected on the setting screen of FIG. 5A are displayed in advance in the input fields of the editing dialog box. The user presses an OK button 548 after inputting desired setting contents (after change). The data management unit 15 accepts the pressing of the OK button 548 and updates the replacement rule table 24 with the input content. Further, the data management unit 15 accepts pressing of the CANCEL button 549, and returns to the setting screen of FIG. 5A without updating the replacement rule table 24.

次に、構文解析ルール２５について説明する。 Next, the syntax analysis rule 25 will be described.

構文解析ルール２５には、人の姓および名が登録された人名辞書、各地の県名や市町村名が登録された地名辞書、信頼度算出に用いる信頼度データなどが記憶されている。信頼度データには、本実施形態では、人を示す接尾詞、人の行為を示す述語、地名を修飾する語、所定の格助詞などを用いることとする。なお、信頼度データの具体例については後述する。 The parsing rule 25 stores a personal name dictionary in which a person's first name and surname are registered, a place name dictionary in which names of prefectures and municipalities are registered, reliability data used for reliability calculation, and the like. In the present embodiment, a suffix indicating a person, a predicate indicating a person's action, a word that modifies a place name, a predetermined case particle, and the like are used for the reliability data. A specific example of reliability data will be described later.

次に、構文解析による個人情報に該当する文字列の抽出方法について説明する。 Next, a method for extracting a character string corresponding to personal information by syntax analysis will be described.

なお、データ抽出部１２は、例えば、形態素解析エンジンおよび構文解析エンジンを用いて、以下に説明する個人情報に該当する文字列の抽出を行う。形態素解析エンジンは、自然言語処理の技術である形態素解析（Morphological Analysis）を行うことにより、文を最小の文字列である形態素（単語）に分解し、分解した形態素各々の品詞を特定する。構文解析エンジンは、自然言語処理の技術である構文解析（Syntactic Analysis）を行うことにより、文の構造（例えば、係り受け構造）を解析する。 The data extraction unit 12 extracts a character string corresponding to personal information described below using, for example, a morphological analysis engine and a syntax analysis engine. The morpheme analysis engine performs morphological analysis (Morphological Analysis), which is a technology of natural language processing, to decompose a sentence into morphemes (words) that are the minimum character strings, and specify the part of speech of each decomposed morpheme. The parsing engine analyzes a sentence structure (for example, a dependency structure) by performing syntactic analysis, which is a natural language processing technique.

なお、形態素解析および構文解析については、例えば以下に記述されている。 Note that morphological analysis and syntax analysis are described below, for example.

「形態素解析・構文解析入門」、［平成１７年１月２６日検索］、インターネット＜ＵＲＬ：http://www.unixuser.org/~euske/doc/nlpintro/＞
まず、データ抽出部１２は、処理対象の文書ファイルの文（センテンス）毎に、形態素解析を行う。すなわち、データ抽出部１２は、形態素解析エンジンを用いて、文を最小の文字列である形態素（単語）に分解し、分解した形態素各々の品詞を特定する。そして、データ抽出部１２は、特定した品詞が名詞の形態素の文字列のみを抽出する。 “Introduction to morphological and syntactic analysis”, [searched on January 26, 2005], Internet <URL: http://www.unixuser.org/~euske/doc/nlpintro/>
First, the data extraction unit 12 performs morphological analysis for each sentence (sentence) of a document file to be processed. That is, the data extraction unit 12 uses a morpheme analysis engine to decompose a sentence into morphemes (words) that are the minimum character strings, and specifies the part of speech of each decomposed morpheme. Then, the data extraction unit 12 extracts only the character string of the morpheme whose specified part of speech is a noun.

そして、データ抽出部１２は、抽出した文字列各々について、構文解析ルール２５の人名辞書または地名辞書に登録されている文字列と一致するか否かを判別する。そして、データ抽出部１２は、人名辞書または地名辞書に登録されている文字列を、個人情報としてさらに抽出する。そして、データ抽出部１２は、人名辞書に登録されている文字列（以下、「人名データ」）、および、地名辞書に登録されている文字列（以下、「地名データ」）各々の信頼度を算出する。信頼度は、個人情報に該当する可能性（信頼性）を示す指標である。 Then, the data extraction unit 12 determines whether each extracted character string matches the character string registered in the personal name dictionary or the place name dictionary of the parsing rule 25. The data extraction unit 12 further extracts a character string registered in the personal name dictionary or the place name dictionary as personal information. Then, the data extraction unit 12 determines the reliability of the character strings registered in the personal name dictionary (hereinafter “personal name data”) and the character strings registered in the geographical name dictionary (hereinafter “place name data”). calculate. The reliability is an index indicating a possibility (reliability) corresponding to personal information.

図６は、人名データおよび地名データの信頼度を算出する処理フロー図である。なお、データ抽出部１２は、図示する処理フローを行う前に、人名データまたは地名データを含む文を、構文解析エンジンを用いて構文解析を行う。 FIG. 6 is a process flow diagram for calculating the reliability of personal name data and place name data. Note that the data extraction unit 12 parses a sentence including personal name data or place name data using a parsing engine before performing the illustrated processing flow.

まず、データ抽出部１２は、処理対象の対象データ（人名データまたは地名データ）を入力し、初期値の信頼度を設定する（Ｓ２１）。なお、本実施形態では、初期値の信頼度を５０％とする。そして、データ抽出部１２は、対象データが含まれる文を参照し、対象データの直後に人を示す接尾詞（例えば、「様」、「さん」など）が付加されているか否かを判別する（Ｓ２２）。なお、人を示す接尾詞については、構文解析ルール２５の信頼度データに記憶されているものとする。 First, the data extraction unit 12 inputs target data (person name data or place name data) to be processed, and sets the reliability of the initial value (S21). In the present embodiment, the reliability of the initial value is 50%. Then, the data extraction unit 12 refers to a sentence including the target data, and determines whether or not a suffix indicating a person (for example, “sama” or “san”) is added immediately after the target data. (S22). Note that a suffix indicating a person is stored in the reliability data of the parsing rule 25.

接尾詞が付加されている場合であって（Ｓ２２：ＹＥＳ）、対象データが人名データの場合、データ抽出部１２は、初期値の信頼度に２５％を加算する。また、対象データが地名データの場合、データ抽出部１２は、初期値の信頼度から２５％を減算する（Ｓ２３）。一方、接尾詞が付加されていない場合（Ｓ２２：ＮＯ）、データ抽出部１２は、Ｓ２４に進む。 If the suffix is added (S22: YES) and the target data is personal name data, the data extraction unit 12 adds 25% to the reliability of the initial value. If the target data is place name data, the data extraction unit 12 subtracts 25% from the reliability of the initial value (S23). On the other hand, when no suffix is added (S22: NO), the data extraction unit 12 proceeds to S24.

そして、データ抽出部１２は、対象データが含まれる文を構文解析することにより、対象データと係り受けの関係にある述語を特定する。そして、対象データと係り受けの関係にある述語が、人の行為を示す述語（例えば、「言う」、「話す」、など）であるか否かを判別する（Ｓ２４）。なお、人の行為を示す述語については、構文解析ルール２５の信頼度データに記憶されているものとする。 Then, the data extraction unit 12 identifies a predicate having a dependency relationship with the target data by parsing a sentence including the target data. And it is discriminate | determined whether the predicate in dependency relation with object data is a predicate which shows a person's action (for example, "say", "speak", etc.) (S24). It is assumed that the predicate indicating the human action is stored in the reliability data of the parsing rule 25.

人の行為を示す述語が係り受けの関係にあって（Ｓ２４：ＹＥＳ）、対象データが人名データの場合、データ抽出部１２は、信頼度に２５％を加算する。また、対象データが地名データの場合、データ抽出部１２は、信頼度から２５％を減算する（Ｓ２５）。一方、人の行為を示す述語が係り受けの関係にない場合（Ｓ２４：ＮＯ）、データ抽出部１２は、Ｓ２６に進む。 If the predicate indicating a human action is in a dependency relationship (S24: YES), and the target data is personal name data, the data extraction unit 12 adds 25% to the reliability. If the target data is place name data, the data extraction unit 12 subtracts 25% from the reliability (S25). On the other hand, when the predicate indicating the human action is not in the dependency relationship (S24: NO), the data extraction unit 12 proceeds to S26.

そして、データ抽出部１２は、対象データが含まれる文を参照し、対象データの直後に地名を修飾する語（例えば、「発祥」、「育ち」など）が付加されているか否かを判別する（Ｓ２６）。なお、地名を修飾する語については、構文解析ルール２５の信頼度データに記憶されているものとする。地名を修飾する語が付加されている場合であって（Ｓ２６：ＹＥＳ）、対象データが人名データの場合、データ抽出部１２は、信頼度から２５％を減算する。また、対象データが地名データの場合、データ抽出部１２は、信頼度に２５％を加算する（Ｓ２７）。一方、地名を修飾する語が付加されていない場合（Ｓ２６：ＮＯ）、データ抽出部１２は、Ｓ２８に進む。 Then, the data extraction unit 12 refers to a sentence including the target data, and determines whether or not a word that modifies the place name (for example, “origin”, “bred”) is added immediately after the target data. (S26). It is assumed that the word that modifies the place name is stored in the reliability data of the parsing rule 25. If a word that modifies the place name is added (S26: YES) and the target data is personal name data, the data extraction unit 12 subtracts 25% from the reliability. If the target data is place name data, the data extraction unit 12 adds 25% to the reliability (S27). On the other hand, when the word which modifies a place name is not added (S26: NO), the data extraction part 12 progresses to S28.

そして、データ抽出部１２は、対象データが含まれる文を構文解析することにより、人の行為かあるいは場所を表すかによって格助詞が異なる単語が、対象データと係り受けの関係にあるか否か判別する（Ｓ２８）。格助詞が異なる単語には、例えば「行く」がある。すなわち、「行く」の格助詞が「が」の場合（Ａが「行く」）、Ａは人である。また、「行く」の格助詞が「に」の場合（Ａに「行く」）、Ａは地名である。なお、このような格助詞が異なる単語、および、当該単語の格助詞（人を示す格助詞、地名を示す格助詞）については、構文解析ルール２５の信頼度データに記憶されているものとする。格助詞が異なる単語が対象データと係り受けの関係にない場合（Ｓ２８：ＮＯ）、データ抽出部１２は、Ｓ３２に進む。 Then, the data extraction unit 12 parses a sentence including the target data to determine whether or not a word having a different case particle depends on the target data depending on whether it represents a human action or a place. It discriminate | determines (S28). Examples of words with different case particles include “go”. That is, when the case particle of “go” is “ga” (A is “go”), A is a person. In addition, when the case particle of “go” is “ni” (“go” to A), A is a place name. It should be noted that such words with different case particles and the case particles (case particles indicating persons, case particles indicating place names) of the words are stored in the reliability data of the parsing rule 25. . When words having different case particles are not in a dependency relationship with the target data (S28: NO), the data extraction unit 12 proceeds to S32.

格助詞が異なる単語が対象データと係り受けの関係にある場合（Ｓ２８：ＹＥＳ）、データ抽出部１２は、当該単語の格助詞が人の行為を示す格助詞か否かを判別する（Ｓ２９）。人の行為を示す格助詞であって（Ｓ２９：ＹＥＳ）、対象データが人名データの場合、データ抽出部１２は、信頼度に２５％を加算する。また、対象データが地名データの場合、データ抽出部１２は、信頼度から２５％を減算する（Ｓ３０）。一方、地名を示す格助詞であって（Ｓ２９：ＮＯ）、対象データが人名データの場合、データ抽出部１２は、信頼度から２５％を減算する。また、対象データが地名データの場合、データ抽出部１２は、信頼度に２５％を加算する（Ｓ３１）
そして、データ抽出部１２は、Ｓ２１からＳ３１の処理により算出した信頼度が、１００％を超えるか否かを判別する（Ｓ３２）。１００％を超える場合（Ｓ３２：ＹＥＳ）、データ抽出部１２は、対象データの信頼度を１００％とする（Ｓ３３）。一方、１００％以下の場合（Ｓ３２：ＮＯ）、抽出部１００は、対象データの信頼度に、Ｓ２１からＳ３１の処理により算出した信頼度を設定する。 When a word having a different case particle has a dependency relationship with the target data (S28: YES), the data extraction unit 12 determines whether or not the case particle of the word is a case particle indicating a human action (S29). . If it is a case particle indicating a human action (S29: YES), and the target data is personal name data, the data extraction unit 12 adds 25% to the reliability. If the target data is place name data, the data extraction unit 12 subtracts 25% from the reliability (S30). On the other hand, if it is a case particle indicating a place name (S29: NO) and the target data is personal name data, the data extraction unit 12 subtracts 25% from the reliability. If the target data is place name data, the data extraction unit 12 adds 25% to the reliability (S31).
Then, the data extraction unit 12 determines whether or not the reliability calculated by the processing from S21 to S31 exceeds 100% (S32). If it exceeds 100% (S32: YES), the data extraction unit 12 sets the reliability of the target data to 100% (S33). On the other hand, when it is 100% or less (S32: NO), the extraction unit 100 sets the reliability calculated by the processing from S21 to S31 as the reliability of the target data.

以上説明した信頼度の値に応じて、データ抽出部１２は、個人情報である可能性が高い人名データおよび地名データを最終的に抽出する。すなわち、データ抽出部１２は、信頼度が所定の値（例えば、５０％）以上の人名データおよび地名データを、個人情報であると判別し、他のデータに置換するために抽出する。 In accordance with the reliability value described above, the data extraction unit 12 finally extracts personal name data and place name data that are highly likely to be personal information. That is, the data extraction unit 12 determines that personal name data and place name data having a reliability level equal to or higher than a predetermined value (for example, 50%) as personal information, and extracts the data for replacement with other data.

次に、データ置換装置１が行う置換処理について説明する。 Next, replacement processing performed by the data replacement device 1 will be described.

なお、本処理を行う前に、ユーザは、ユーザ辞書２３および置換ルールテーブル２４を、各設定画面（図３、図５参照）を用いて、処理対象の文書ファイルの種類および属性を考慮した内容に変更する。
図７は、置換処理の処理フロー図である。 Before performing this process, the user uses the setting screens (see FIGS. 3 and 5) to set the user dictionary 23 and the replacement rule table 24 in consideration of the type and attributes of the document file to be processed. Change to
FIG. 7 is a flowchart of the replacement process.

まず、データ読込部１１は、入力ファイルに記憶（蓄積）された文書ファイルを読み込む（Ｓ４１）。そして、データ抽出部１２は、前述の構文解析による個人情報の抽出および信頼度の算出（図６参照）を行う（Ｓ４２）。 First, the data reading unit 11 reads a document file stored (accumulated) in an input file (S41). Then, the data extraction unit 12 performs extraction of personal information and calculation of reliability (see FIG. 6) by the above-described syntax analysis (S42).

そして、データ抽出部１２は、正規表現による個人情報の抽出および信頼度の設定を行う（Ｓ４３）。すなわち、データ抽出部１２は、置換ルールテーブル２４に登録された正規表現文字パターン４３のいずれかに該当する文字列を、個人情報として抽出する。そして、データ抽出部１２は、正規表現により抽出した文字列の信頼度に所定の値（例えば、９９％）を設定する。 Then, the data extraction unit 12 extracts personal information using a regular expression and sets the reliability (S43). That is, the data extraction unit 12 extracts a character string corresponding to one of the regular expression character patterns 43 registered in the replacement rule table 24 as personal information. Then, the data extraction unit 12 sets a predetermined value (for example, 99%) for the reliability of the character string extracted by the regular expression.

そして、データ抽出部１２は、ユーザ辞書２３による個人情報の抽出および信頼度の設定を行う（Ｓ４４）。すなわち、データ抽出部１２は、ユーザ辞書２３に登録されたいずれかの文字列と一致する文字列を、個人情報として抽出する。なお、ユーザ辞書２３に登録されている文字列であっても、グループ名が「社員」で置換しないルールの場合、データ抽出部は当該文字列を抽出しない。そして、データ抽出部１２は、ユーザ辞書２３を用いて抽出した各文字列の信頼度に、所定の値を設定する。なお、ユーザ辞書２３に登録されているの文字列は個人情報に該当するため、高い信頼度（例えば、１００％）を設定する。 Then, the data extraction unit 12 extracts personal information by the user dictionary 23 and sets the reliability (S44). That is, the data extraction unit 12 extracts a character string that matches any character string registered in the user dictionary 23 as personal information. Even in the case of a character string registered in the user dictionary 23, the data extraction unit does not extract the character string when the group name is not replaced with “employee”. Then, the data extraction unit 12 sets a predetermined value for the reliability of each character string extracted using the user dictionary 23. Since the character string registered in the user dictionary 23 corresponds to personal information, a high reliability (for example, 100%) is set.

なお、Ｓ４２、Ｓ４３およびＳ４４の処理の順番は、これに限定されず、どのような順番で行ってもよい。 Note that the order of the processing of S42, S43, and S44 is not limited to this, and may be performed in any order.

そして、置換処理部１３は、データ抽出部１２が抽出した文字列の中で、信頼度が所定の値を超える文字列を置換する（Ｓ４５）。すなわち、置換処理部１３は、データ抽出部１２が算出した信頼度に基づいて、個人情報に該当する可能性が高い文字列のみを他のデータ（文字列）に置換する。これにより、個人情報に該当する可能性が高い文字列がマスクされ、置換後の文書ファイルを公開した場合であっても、個人情報の漏洩リスクを防止することができる。 Then, the replacement processing unit 13 replaces a character string whose reliability exceeds a predetermined value in the character string extracted by the data extracting unit 12 (S45). That is, based on the reliability calculated by the data extraction unit 12, the replacement processing unit 13 replaces only a character string that is highly likely to be personal information with other data (character string). As a result, a character string that is highly likely to correspond to personal information is masked, and the risk of leakage of personal information can be prevented even when the replaced document file is disclosed.

また、置換処理部１３は、置換ルールテーブル２４を参照し、抽出した文字列のルール名４２に対応する置換文字列４４を特定する。そして、置換処理部１３は、抽出した文字列を特定した置換文字列４４に置換する。また、置換処理部１３は、置換ルールテーブル２４の対応するＩＤフラグ４５を参照し、「有」の場合は置換文字列４４に識別情報（例えば、連番）を付加する。 Further, the replacement processing unit 13 refers to the replacement rule table 24 and specifies a replacement character string 44 corresponding to the rule name 42 of the extracted character string. Then, the replacement processing unit 13 replaces the extracted character string with the specified replacement character string 44. Further, the replacement processing unit 13 refers to the corresponding ID flag 45 of the replacement rule table 24, and adds “identification information” (for example, serial number) to the replacement character string 44 when “Yes”.

そして、データ出力部１４は、置換処理部１３が置換処理を行った置換後の文書ファイルを、出力ファイル２２に出力する（Ｓ４６）。そして、データ読込部１１は、入力ファイル２１に未処理の文書ファイルが存在するか否かを判別する（Ｓ４７）。未処理の文書ファイルが存在する場合（Ｓ４７：ＹＥＳ）、データ読込部１１は、Ｓ４１に戻り未処理の文書ファイルを読み込む。また、未処理の文書ファイルが存在しない場合（Ｓ４７：ＮＯ）、データ読込部１１は、置換処理を終了する。 Then, the data output unit 14 outputs the replaced document file subjected to the replacement process by the replacement processing unit 13 to the output file 22 (S46). Then, the data reading unit 11 determines whether or not an unprocessed document file exists in the input file 21 (S47). If there is an unprocessed document file (S47: YES), the data reading unit 11 returns to S41 and reads the unprocessed document file. If there is no unprocessed document file (S47: NO), the data reading unit 11 ends the replacement process.

次に、図８（Ａ）に示す文書ファイルのデータを例に、図７の処理を具体的に説明する。なお、図８（Ｂ）は、図８（Ａ）の文書ファイルの置換後の文書ファイルである。また、図４に示す置換ルールテーブル２４および図３に示すユーザ辞書２３を用いて置換処理を行うこととする。 Next, the processing of FIG. 7 will be specifically described by taking the document file data shown in FIG. 8A as an example. 8B is a document file after replacement of the document file of FIG. Also, the replacement process is performed using the replacement rule table 24 shown in FIG. 4 and the user dictionary 23 shown in FIG.

まず、データ読込部１１は、図示する文書ファイルを読み込む（Ｓ４１）。そして、データ抽出部１２は、Ｓ４２、Ｓ４３およびＳ４４の処理を行い、個人情報に該当する可能性のある文字列を抽出する。 First, the data reading unit 11 reads the illustrated document file (S41). Then, the data extraction unit 12 performs the processes of S42, S43, and S44, and extracts a character string that may correspond to personal information.

図９は、データ抽出部１２が図示する文書ファイルから抽出した文字列が設定された抽出文字列テーブルの一例を示す図である。なお、データ抽出部１２は、必要に応じて抽出文字列テーブルを作成し、メモリまたは外部記憶装置に一時的に記憶する。図示する抽出文字列テーブルは、抽出文字列６１と、抽出した文字列の位置６２と、適用されるルール名６３と、抽出方法６４と、信頼度６５と、置換文字列６６とを有する。 FIG. 9 is a diagram illustrating an example of an extracted character string table in which character strings extracted from the document file illustrated by the data extraction unit 12 are set. The data extraction unit 12 creates an extracted character string table as necessary and temporarily stores it in a memory or an external storage device. The extracted character string table shown in the figure includes an extracted character string 61, a position 62 of the extracted character string, an applied rule name 63, an extraction method 64, a reliability 65, and a replacement character string 66.

データ抽出部１２は、構文解析により、例えば、図示する文書ファイルから「鈴木」８１を人名データの文字列として抽出し、抽出文字列６１ａに設定する。そして、データ抽出部１２は、「鈴木」を抽出した位置と、置換ルールテーブル２４の対応するルール名および抽出方法とを、抽出文字列テーブルに設定する。 The data extraction unit 12 extracts, for example, “Suzuki” 81 from the illustrated document file as a character string of personal name data by syntax analysis, and sets the extracted character string 61 a. Then, the data extraction unit 12 sets the extracted position of “Suzuki” and the corresponding rule name and extraction method of the replacement rule table 24 in the extracted character string table.

そして、データ抽出部１２は、「鈴木」の信頼度を図６の処理により算出する。すなわち、データ抽出部１２、人名データの初期値（５０％）を設定する（Ｓ２１）。そして、データ抽出部１２は、図示する文書ファイルを参照し、「鈴木」８１の直後に人を示す接尾詞（「さん」）があるため、初期値（５０％）に２５％を加算する（Ｓ２３）。そして、データ抽出部１２は、「鈴木」８１を含む文を構文解析し、行為を示す述語（「聞く」）が「鈴木」８１と係り受けの関係にあるため、信頼度に２５％を加算する（Ｓ２５）。したがって、データ抽出部１２は、「鈴木」８１の信頼度を１００％とする。 Then, the data extraction unit 12 calculates the reliability of “Suzuki” by the process of FIG. That is, the data extraction unit 12 sets an initial value (50%) of personal name data (S21). Then, the data extraction unit 12 refers to the illustrated document file and adds “25%” to the initial value (50%) because there is a suffix (“san”) indicating a person immediately after “Suzuki” 81 ( S23). Then, the data extraction unit 12 parses the sentence including “Suzuki” 81 and adds 25% to the reliability because the predicate indicating the action (“listening”) has a dependency relationship with “Suzuki” 81. (S25). Therefore, the data extraction unit 12 sets the reliability of “Suzuki” 81 to 100%.

また、データ抽出部１２は、ユーザ辞書２３（グループ名：顧客）に一致する「山本太郎」８２の文字列を図示する文書ファイルから抽出し、抽出文字列６２ｂに設定する。そして、データ抽出部１２は、抽出した位置と、置換ルールテーブル２４の対応するルール名および抽出方法とを、抽出文字列テーブルに設定する。また、データ抽出部１２は、ユーザ辞書の所定の信頼度（例えば、１００％）を抽出文字列テーブルに設定する。 In addition, the data extraction unit 12 extracts a character string of “Taro Yamamoto” 82 that matches the user dictionary 23 (group name: customer) from the document file illustrated, and sets the extracted character string 62b. Then, the data extraction unit 12 sets the extracted position, the corresponding rule name in the replacement rule table 24, and the extraction method in the extracted character string table. Further, the data extraction unit 12 sets a predetermined reliability (for example, 100%) of the user dictionary in the extracted character string table.

なお、図示する文書ファイルの「須藤次郎」８３の文字列は、ユーザ辞書２３（グループ名：社員）に一致する。しかしながら、図４に示す置換ルールテーブル２４の場合、社員（非置換）のルールが適用されるため、データ抽出部１２は、「須藤次郎」８３を抽出しない。また、「須藤」および「次郎」は、構文解析による人名データとして抽出される。しかしながら、図４に示す置換ルールテーブル２４の場合、社員（非置換）のルールの方が人名（置換）のルールより優先順位が高い。そのため、データ抽出部１２は、「須藤」および「次郎」を抽出しない。 The character string “Jiro Sudo” 83 in the illustrated document file matches the user dictionary 23 (group name: employee). However, in the case of the replacement rule table 24 shown in FIG. 4, the employee (non-replacement) rule is applied, so the data extraction unit 12 does not extract “Jiro Sudo” 83. “Sudo” and “Jiro” are extracted as personal name data by parsing. However, in the replacement rule table 24 shown in FIG. 4, the employee (non-replacement) rule has a higher priority than the person name (replacement) rule. Therefore, the data extraction unit 12 does not extract “Sudo” and “Jiro”.

また、データ抽出部１２は、置換ルールテーブル２４の正規表現文字パターンと一致する電話番号８４の文字列を図示する文書ファイルから抽出し、抽出文字列６２ｃに設定する。そして、データ抽出部１２は、抽出した位置と、置換ルールテーブル２４の対応するルール名および抽出方法とを、抽出文字列テーブルに設定する。また、データ抽出部１２は、正規表現の所定の信頼度（例えば、９９％）を抽出文字列テーブルに設定する。 In addition, the data extraction unit 12 extracts the character string of the telephone number 84 that matches the regular expression character pattern in the replacement rule table 24 from the document file shown in the figure, and sets the extracted character string 62c. Then, the data extraction unit 12 sets the extracted position, the corresponding rule name in the replacement rule table 24, and the extraction method in the extracted character string table. Further, the data extraction unit 12 sets a predetermined reliability (for example, 99%) of the regular expression in the extracted character string table.

そして、置換処理部１３は、置換ルールテーブル２４の対応する置換文字列を抽出文字列テーブルにそれぞれ設定する。そして、置換処理部３１は、信頼度６５が所定の値（例えば、５０％以上）の抽出文字列６１を、置換文字列６６に変換する。なお、置換処理部１３は、置換ルールテーブル２４のＩＤフラグを参照し、［人名］の置換文字列６６に識別情報（連番）を付加する。また、置換処理部１３は、重複して抽出された文字列が存在する場合（「山本太郎」６２ｂ、「山本」・「太郎」６２ｄ）、信頼度の高い置換文字列（［顧客］）に置換する。 Then, the replacement processing unit 13 sets the corresponding replacement character string in the replacement rule table 24 in the extracted character string table. Then, the replacement processing unit 31 converts the extracted character string 61 having a reliability 65 of a predetermined value (for example, 50% or more) into a replacement character string 66. The replacement processing unit 13 refers to the ID flag of the replacement rule table 24 and adds identification information (serial number) to the replacement character string 66 of [person name]. In addition, when there is a duplicated character string (“Taro Yamamoto” 62b, “Yamamoto” / “Taro” 62d), the replacement processing unit 13 determines that the replacement character string is highly reliable ([customer]). Replace.

そして、データ出力部１４は、置換後の文書ファイル（図８（Ｂ））を出力ファイル２２に出力する。 Then, the data output unit 14 outputs the replaced document file (FIG. 8B) to the output file 22.

以上、本発明の一実施形態を説明した。 The embodiment of the present invention has been described above.

本実施形態のデータ抽出部１２は、文書ファイルの構文解析を行い、解析した文の構造に基づいて、抽出した文字列（データ）の信頼度を算出する（図６参照）。これにより、個人情報をより高い精度で抽出することができる。すなわち、個人情報の抽出漏れ、および抽出ミスを減らし、個人情報の漏洩リスクをより低減することができる。 The data extraction unit 12 of the present embodiment performs syntax analysis of the document file and calculates the reliability of the extracted character string (data) based on the analyzed sentence structure (see FIG. 6). Thereby, personal information can be extracted with higher accuracy. In other words, it is possible to reduce the leakage of personal information and extraction errors, and further reduce the risk of leakage of personal information.

また、本実施形態では、個人情報に該当する可能性が高い文字列を、他の文字列に置換（マスク）する。これにより、置換後の文書ファイルを公開した場合であっても、個人情報を隠蔽することができる。 In the present embodiment, a character string that is highly likely to correspond to personal information is replaced (masked) with another character string. Thereby, even when the document file after replacement is disclosed, the personal information can be concealed.

また、本実施形態では、個人情報に該当する可能性が高い文字列を、当該文字列の種別・属性（人名、地名、電話番号など）に応じた置換文字列に変換する。これにより、個人情報を置換（隠蔽）した場合であっても、元の個人情報の種別または属性を把握することができる。 In the present embodiment, a character string that is highly likely to correspond to personal information is converted into a replacement character string corresponding to the type / attribute (person name, place name, telephone number, etc.) of the character string. Thereby, even when personal information is replaced (hidden), the type or attribute of the original personal information can be grasped.

また、本実施形態の置換ルールテーブル２４は、ＩＤフラグ４５を有する。これにより、個人情報が置換文字列に置換された場合であっても、置換前の文字列が同じか否かを識別することができる。例えば、１つの文書ファイルの中で、同じ文字列の個人情報が複数回出現する場合、置換後の文書ファイルには同一の識別情報が付加された置換文字列に変換されるため、同じ人名（または、同じ地名など）が繰り返し記載されていることを認識することができる。 Further, the replacement rule table 24 of the present embodiment has an ID flag 45. Thereby, even when the personal information is replaced with the replacement character string, it is possible to identify whether or not the character string before the replacement is the same. For example, if personal information of the same character string appears multiple times in one document file, it is converted to a replacement character string with the same identification information added to the replaced document file, so the same personal name ( Alternatively, it can be recognized that the same place name is repeatedly described.

なお、本発明は上記の実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。例えば、上記の実施形態では、個人情報を当該個人情報の種別に応じた文字列に置換した。しかしながら、個人情報を文字以外のイメージデータに変換することとしてもよい。また、個人情報を削除または黒く塗りつぶすなどのマスク処理を行うこととしてもよい。 In addition, this invention is not limited to said embodiment, Many deformation | transformation are possible within the range of the summary. For example, in the above embodiment, the personal information is replaced with a character string corresponding to the type of the personal information. However, the personal information may be converted into image data other than characters. It is also possible to perform a masking process such as deleting or blacking out personal information.

また、本実施形態では、データ出力部１４は、置換後の文書ファイルを出力ファイル２２に出力する（図７：Ｓ４６）。しかしながら、データ出力部１４は、出力装置に置換後の文書ファイルを出力することとしてもよい。これにより、ユーザは、逐次、置換後の文書ファイルを確認することができる。また、データ出力部１４は、置換後の文書ファイルを出力する際に、図９に示すような抽出文字列テーブルを、置換後の文書ファイルとともに、出力装置に出力することとしてもよい。これにより、ユーザは、信頼度が低い文字列を、置換後の文書データを参照しながら逐次、置換指示を入力して置換することができる。 In this embodiment, the data output unit 14 outputs the replaced document file to the output file 22 (FIG. 7: S46). However, the data output unit 14 may output the replaced document file to the output device. Thereby, the user can confirm the document file after replacement sequentially. Further, when outputting the replaced document file, the data output unit 14 may output the extracted character string table as shown in FIG. 9 to the output device together with the replaced document file. Thus, the user can replace the character string with low reliability by sequentially inputting a replacement instruction while referring to the replaced document data.

本発明の一実施形態が適用されたデータ置換装置の構成を示す図である。It is a figure which shows the structure of the data replacement apparatus with which one Embodiment of this invention was applied. データ置換装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of a data replacement apparatus. ユーザ辞書設定画面の一例を示す図である。It is a figure which shows an example of a user dictionary setting screen. 置換ルールテーブルの一例を示す図であるIt is a figure which shows an example of a replacement rule table. 置換ルールテーブルの設定画面の一例を示す図である。It is a figure which shows an example of the setting screen of a replacement rule table. 構文解析の信頼度を算出する処理フロー図である。It is a processing flowchart which calculates the reliability of parsing. 置換処理の処理フロー図である。It is a processing flow figure of substitution processing. 文書ファイルの一例を示す図である。It is a figure which shows an example of a document file. 抽出文字列テーブルの一例を示す図である。It is a figure which shows an example of an extraction character string table.

Explanation of symbols

１：データ置換装置、１１：データ読込部、１２：データ抽出部、１３：置換処理部、１４：データ出力部、１５：データ管理部、２１：入力ファイル、２２：出力ファイル、２３：ユーザ辞書、２４：置換ルールテーブル、２５：構文解析ルール
1: data replacement device, 11: data reading unit, 12: data extraction unit, 13: replacement processing unit, 14: data output unit, 15: data management unit, 21: input file, 22: output file, 23: user dictionary 24: Replacement rule table 25: Parsing rule

Claims

A data replacement device,
Reading means for reading document data;
Extracting means for parsing the document data and extracting personal information;
Replacement means for replacing each piece of personal information extracted by the extraction means with data different from the personal information;
Output means for outputting post-replacement document data replaced by the replacement means.

The data replacement device according to claim 1, wherein
A storage means for storing a personal name dictionary and a place name dictionary;
The extraction means determines whether each character string of the document data matches the personal name dictionary or the place name dictionary, and whether the character string that matches the personal name dictionary or the place name dictionary corresponds to personal information. The reliability indicating whether or not is calculated based on the structure of the sentence analyzed by the syntax analysis,
The data replacing device, wherein the replacing means replaces a character string having the reliability exceeding a predetermined value with data different from the character string.

The data replacement device according to claim 1, wherein
The replacing unit replaces the personal information extracted by the extracting unit with predetermined data according to the type of the personal information, and uses identification information for identifying the personal information for each type as the predetermined data. A data replacement device characterized by being added to the above.

The data replacement device according to claim 1, wherein
A storage means for storing a user dictionary in which at least one character string corresponding to personal information is registered, and a regular expression list in which a predetermined numbering system or character pattern is registered;
The extraction means determines a character string that matches the character string registered in the user dictionary as personal information, extracts it from the document data,
A data replacement device characterized in that a character string that matches a numbering system or a character pattern registered in the regular expression list is identified as personal information and extracted from the document data.

A data replacement method performed by an information processing apparatus,
The information processing apparatus includes a storage unit and a processing unit,
The processor is
A reading step of reading the document data stored in the storage unit;
An extraction step of parsing the document data to extract personal information;
A replacement step of replacing each piece of personal information extracted in the extraction step with data different from the personal information;
An output step of outputting the replaced document data replaced in the replacing step.

A data replacement program executed by the information processing apparatus,
The information processing apparatus includes a storage unit and a processing unit,
In the processing unit,
A reading step of reading the document data stored in the storage unit;
An extraction step of parsing the document data to extract personal information;
A replacement step of replacing each piece of personal information extracted in the extraction step with data different from the personal information;
An output step of outputting the replaced document data replaced in the replacing step is executed.