JP2015041145A

JP2015041145A - Personal information detection device and computer program

Info

Publication number: JP2015041145A
Application number: JP2013170585A
Authority: JP
Inventors: 俊彦佐々木; Toshihiko Sasaki
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2013-08-20
Filing date: 2013-08-20
Publication date: 2015-03-02
Anticipated expiration: 2033-08-20
Also published as: JP5420099B1

Abstract

PROBLEM TO BE SOLVED: To make processing of identifying a mask object portion within a document efficient.SOLUTION: An original document acquisition part 32 acquires a document in which a character string relating to a plurality of items is recorded. A determination part 54 determines whether personal information is included in the character string of remaining items obtained by excluding a personal information item designated by a user, and detects the item having the personal information included as a personal information item. A detection result output part 56 outputs the personal information item designated by the user and the personal information item detected by the determination part 54 as an item to undergo mask processing of personal information. The determination part 54 sequentially determines whether the personal information is included in the character string of each item of a plurality of records within the document, and when the personal information is included in a specific item of a certain record, determines the specific item as the personal information item, and excludes the specific item from the object for determination of whether the personal information is included as to the character string of items of subsequent records.

Description

本発明はデータ処理技術に関し、特に文書に記録された個人情報を検出する技術に関する。 The present invention relates to a data processing technique, and more particularly to a technique for detecting personal information recorded in a document.

情報システムの運用フェーズでは、保守作業のために、本番環境（言い換えれば商用環境）に蓄積されたデータをテスト用のデータとして抽出することがある。そして抽出したデータを開発環境（言い換えればテスト環境）へ導入し、開発環境にて各種のテストを実施することがある。 In the operation phase of the information system, data accumulated in the production environment (in other words, commercial environment) may be extracted as test data for maintenance work. Then, the extracted data may be introduced into a development environment (in other words, a test environment), and various tests may be performed in the development environment.

本番環境に蓄積されたデータには個人情報が含まれることがある。個人情報の保護が重視される現在、個人情報をマスキングする技術が提案されている。 Data stored in the production environment may contain personal information. Currently, protection of personal information is emphasized, and techniques for masking personal information have been proposed.

特開２０１３−１０５２７４号公報JP 2013-105274 A 特開２０１１−０３４２６４号公報JP 2011-034264 A

文書内の個人情報をマスクする処理の前段では、マスク対象とすべき個人情報を示す文字列が文書内のどこに記載されているかを識別する必要がある。ところで、本番環境に蓄積された、個人情報を含む文書は非常に大きなサイズになることがあり、その文書内での個人情報の記載箇所を識別する処理に多大な時間を要してしまうことがあった。システム開発の工程の中でテストに許容される期間は限られており、本発明者は、文書内のマスク対象箇所を識別する処理を効率化する必要があると考えた。 Before the process of masking personal information in the document, it is necessary to identify where in the document the character string indicating the personal information to be masked is written. By the way, documents including personal information accumulated in the production environment may be very large in size, and it may take a lot of time to identify the location where the personal information is written in the document. there were. The period allowed for the test is limited in the system development process, and the present inventor considered that it is necessary to improve the efficiency of the process of identifying the mask target portion in the document.

本発明は、本発明者の上記課題認識に基づきなされたものであり、その主な目的は、文書内のマスク対象箇所を識別する処理を効率化する技術を提供することである。 The present invention has been made on the basis of the above-mentioned problem recognition of the present inventor, and its main object is to provide a technique for improving the efficiency of the process of identifying a mask target portion in a document.

上記課題を解決するために、本発明のある態様の個人情報検出装置は、複数の項目に関する文字列が記録された文書を取得する文書取得部と、ユーザにより指定された項目であって、複数の項目の中で個人情報を含む個人情報項目を示す情報を取得する指定取得部と、ユーザにより指定された個人情報項目を除外した他の項目の文字列に個人情報が含まれるか否かを判定し、個人情報が含まれる項目を個人情報項目として検出する判定部と、ユーザにより指定された個人情報項目および判定部により検出された個人情報項目を、個人情報のマスク処理を行うべき項目として出力する出力部と、を備える。文書取得部が取得した文書は複数のレコードを含み、各レコードは複数の項目に関する文字列を含むものであり、判定部は、複数のレコードの各項目の文字列に個人情報が含まれるか否かを順次判定し、あるレコードの特定の項目に個人情報が含まれると判定した場合、当該項目を個人情報項目として検出し、残りのレコードの当該項目の文字列については、個人情報が含まれるか否かを判定する対象から除外する。 In order to solve the above problems, a personal information detection device according to an aspect of the present invention includes a document acquisition unit that acquires a document in which character strings related to a plurality of items are recorded, and items specified by a user, Whether or not personal information is included in the character string of other items excluding the personal information items specified by the user and the specified acquisition unit that acquires information indicating personal information items including personal information among the items A determination unit that determines and detects an item including personal information as a personal information item, and the personal information item specified by the user and the personal information item detected by the determination unit are items to be masked with personal information. An output unit for outputting. The document acquired by the document acquisition unit includes a plurality of records, each record includes a character string related to a plurality of items, and the determination unit determines whether the character string of each item of the plurality of records includes personal information. When it is determined that personal information is included in a specific item of a record, the item is detected as a personal information item, and the character string of the item in the remaining records includes personal information. Is excluded from the object to be determined.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、システム、プログラム、プログラムを格納した記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements, and the expression of the present invention converted between a method, a system, a program, a recording medium storing the program, and the like are also effective as an aspect of the present invention.

本発明によれば、文書内のマスク対象箇所を識別する処理を効率化することができる。 According to the present invention, it is possible to improve the efficiency of processing for identifying a mask target portion in a document.

第１の実施の形態の情報システムの構成を示す図である。It is a figure which shows the structure of the information system of 1st Embodiment. 図１のデータ変換装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the data converter of FIG. 個人情報の抽出結果を示す図である。It is a figure which shows the extraction result of personal information. 対応規則を示す図である。It is a figure which shows a correspondence rule. 置換規則を示す図である。It is a figure which shows a replacement rule. ユーザ設定画面を示す図である。It is a figure which shows a user setting screen. データ変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a data converter. 図７（ａ）に続く動作を示すフローチャートである。It is a flowchart which shows the operation | movement following Fig.7 (a). 複数のテーブルのマスク処理を模式的に示す図である。It is a figure which shows typically the mask process of a some table. ログデータのマスク処理を模式的に示す図である。It is a figure which shows the mask process of log data typically. 個人情報をマスキングする前の原本文書データを示す図である。It is a figure which shows the original document data before masking personal information. 第２の実施の形態のデータ変換装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the data converter of 2nd Embodiment.

以下、第１の実施の形態として、個人情報をマスクした後の文書について、テストデータとしての品質の低下を抑制する技術を説明する。また第２の実施の形態として、個人情報のマスク処理に要するトータル時間を抑制するために、文書内のマスク対象箇所を識別する処理を効率化する技術を説明する。 Hereinafter, as a first embodiment, a technique for suppressing deterioration in quality as test data for a document after masking personal information will be described. As a second embodiment, a technique for improving the efficiency of the process of identifying a mask target portion in a document in order to suppress the total time required for the masking process of personal information will be described.

（第１の実施の形態）
図１は、第１の実施の形態の情報システム１００の構成を示す。情報システム１００は、例えば、小売業者や金融業者のための情報処理サービスを提供するものであり、その開発・運用・保守をＳＩ企業が担当する。情報システム１００は、本番機１０と開発機１２とデータ変換装置１４を含む。 (First embodiment)
FIG. 1 shows a configuration of an information system 100 according to the first embodiment. The information system 100 provides, for example, an information processing service for a retailer or a financial company, and an SI company takes charge of development, operation, and maintenance thereof. The information system 100 includes a production machine 10, a development machine 12, and a data conversion device 14.

本番機１０は、本番環境に設置されたウェブサーバや、アプリケーションサーバ、データベースサーバ等の情報処理装置である。本番機１０は、顧客企業やエンドユーザに対する商用の情報処理サービスを提供し、また、顧客企業やエンドユーザの個人情報に該当する各種情報を含む文書データを保持する。この文書データは、データベースサーバが管理するテーブルのデータを含む。またＣＳＶファイルや、フリーフォーマットのログファイル、固定長ファイル等を含む。 The production machine 10 is an information processing apparatus such as a web server, an application server, or a database server installed in a production environment. The production machine 10 provides commercial information processing services for customer companies and end users, and holds document data including various information corresponding to personal information of customer companies and end users. This document data includes table data managed by the database server. Also includes CSV files, free format log files, fixed length files, and the like.

開発機１２は開発環境に設置された情報処理装置である。また開発機１２は、本番機１０にインストールされたアプリケーションについて、そのトラブル解析や、バグ改修、機能追加等の作業（以下、総称して「保守作業」とも呼ぶ。）を行うための情報処理装置である。実施の形態では、開発機１２での保守作業の効率を高めるために、その保守作業に用いるテストデータとして、本番環境に保持される文書データに対応した文書データを用いる。 The development machine 12 is an information processing apparatus installed in the development environment. The development machine 12 is an information processing apparatus for performing work such as trouble analysis, bug repair, and function addition (hereinafter also collectively referred to as “maintenance work”) for the application installed in the production machine 10. It is. In the embodiment, in order to increase the efficiency of maintenance work in the development machine 12, document data corresponding to document data held in the production environment is used as test data used for the maintenance work.

データ変換装置１４は、本番環境における文書データ（以下、「原本文書データ」とも呼ぶ。）を本番機１０から取得する。そして、その原本文書データに含まれる個人情報をマスクした文書データ（以下、「テスト用文書データ」とも呼ぶ。）へ変換し、記録メディア１６へ記録する。データ変換装置１４は一般的なＰＣであってもよい。ＳＩ企業の担当者は、記録メディア１６に記録されたテスト用文書データを開発機１２に読み込ませて、開発機１２での保守作業を実施する。 The data conversion apparatus 14 acquires document data in the production environment (hereinafter also referred to as “original document data”) from the production machine 10. Then, the personal information contained in the original document data is converted into document data masked (hereinafter also referred to as “test document data”) and recorded on the recording medium 16. The data converter 14 may be a general PC. The person in charge of the SI company reads the test document data recorded on the recording medium 16 into the development machine 12 and performs maintenance work on the development machine 12.

図２は、図１のデータ変換装置１４の機能構成を示すブロック図である。データ変換装置１４は、各種データを保持する記憶領域であるデータ保持部２０と、各種データ処理を実行するデータ処理部３０を備える。データ保持部２０は、抽出データ保持部２２と、対応関係保持部２４と、置換規則保持部２６と、除外規則保持部２８を含む。データ処理部３０は、原本文書取得部３２と、個人情報検出部３４と、置換規則決定部３６と、置換データ取得部３８と、文書変換部４０と、変換文書出力部４２と、ユーザ設定支援部４４を含む。 FIG. 2 is a block diagram showing a functional configuration of the data converter 14 shown in FIG. The data conversion device 14 includes a data holding unit 20 that is a storage area for holding various types of data, and a data processing unit 30 that executes various types of data processing. The data holding unit 20 includes an extracted data holding unit 22, a correspondence relation holding unit 24, a replacement rule holding unit 26, and an exclusion rule holding unit 28. The data processing unit 30 includes an original document acquisition unit 32, a personal information detection unit 34, a replacement rule determination unit 36, a replacement data acquisition unit 38, a document conversion unit 40, a converted document output unit 42, and a user setting support. Part 44 is included.

本明細書のブロック図において示される各ブロックは、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、当業者には理解されるところである。例えば図２の各ブロックは、プログラムモジュールとして記録媒体に格納され、その記録媒体を介してデータ変換装置１４のストレージへインストールされてもよい。そしてデータ変換装置１４において、各ブロックに対応するプログラムモジュールをメインメモリへ随時読み出し、ＣＰＵにより実行することで、各ブロックの機能を実現してもよい。 Each block shown in the block diagram of the present specification can be realized in terms of hardware by an element such as a CPU of a computer or a mechanical device, and in terms of software, it can be realized by a computer program or the like. The functional block realized by those cooperation is drawn. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by a combination of hardware and software. For example, each block in FIG. 2 may be stored as a program module in a recording medium and installed in the storage of the data conversion apparatus 14 via the recording medium. And in the data converter 14, the function of each block may be implement | achieved by reading the program module corresponding to each block to a main memory at any time, and performing with CPU.

抽出データ保持部２２は、原本文書データに含まれる個人情報の抽出結果を保持する。図３は個人情報の抽出結果を示す。レコード番号フィールドには、原本文書データにおけるレコード番号が記録される。例えば、ＣＳＶファイルの行位置を示す番号であってもよく、データベースで管理されるテーブルの各レコードに付された識別番号であってもよい。項目名フィールドには個人情報が設定された、原本文書データの情報項目の名称が記録される。文字列フィールドには個人情報として検出された文字列が設定される。位置フィールドには、原本文書データの各情報項目において個人情報の文字列が設定された位置、具体的には先頭を１とした場合のバイト数が設定される。検出タイプフィールドには、個人情報の種類を識別する情報、例えば個人情報が人名・地名・電話番号・組織・メールアドレス等のいずれであるかを示す情報が記録される。 The extracted data holding unit 22 holds the extraction result of the personal information included in the original document data. FIG. 3 shows the extraction result of personal information. The record number in the original document data is recorded in the record number field. For example, it may be a number indicating the line position of the CSV file, or may be an identification number assigned to each record of a table managed in the database. In the item name field, the name of the information item of the original document data in which personal information is set is recorded. A character string detected as personal information is set in the character string field. In the position field, the position where the character string of the personal information is set in each information item of the original document data, specifically, the number of bytes when the head is 1 is set. In the detection type field, information for identifying the type of personal information, for example, information indicating whether the personal information is a person name, a place name, a telephone number, an organization, an email address, or the like is recorded.

図２に戻り、対応関係保持部２４は、個人情報の検出タイプと、その個人情報を別の文字列（本実施の形態において文字列は数字列を含む）に置換する際の種類・態様を示すマスクパターンとを対応づけた対応規則を保持する。この対応規則は、データ変換装置１４において予め定められたものであるが、後述のユーザ設定支援部４４を介してユーザが変更することもできる。図４は対応規則を示す。同図で示すように、原則として、個人情報として検出された文字列の属性に合致するマスクパターン（例えば漢字の個人情報であればランダムな漢字列のマスクパターン）が対応づけられる。 Returning to FIG. 2, the correspondence relationship holding unit 24 determines the detection type of personal information and the type / mode when the personal information is replaced with another character string (in this embodiment, the character string includes a numeric string). A correspondence rule that associates the indicated mask pattern is held. This correspondence rule is predetermined in the data conversion device 14, but can be changed by the user via a user setting support unit 44 described later. FIG. 4 shows the correspondence rules. As shown in the figure, in principle, a mask pattern that matches the attribute of a character string detected as personal information (for example, a random Chinese character string mask pattern for personal information of Chinese characters) is associated.

なおマスクパターンが、ユーザが任意の形式を設定可能な「カスタムパターン」の場合は、ユーザにより決定された文字列の態様を示すカスタム文字列をさらに保持する。例えば、カスタム文字列「＜２ｎ＞−＜４ｎ＞−＜４ｎ＞」は、長さ２のランダムな数字列、「−」、長さ４のランダムな数字列、「−」、「長さ４のランダムな数字列」、を連結した文字列を示している。またカスタム文字列「＜５ａ＞＠＜３ａ＞．＜２ａ＞．＜２ａ＞」は、長さ５のランダムなアスキー文字列、「＠」、長さ３のランダムなアスキー文字列、「．」、長さ２のランダムなアスキー文字列、「．」、長さ２のランダムなアスキー文字列、を連結した文字列を示している。 If the mask pattern is a “custom pattern” that allows the user to set an arbitrary format, a custom character string indicating the character string mode determined by the user is further held. For example, a custom character string “<2n>-<4n>-<4n>” is a random number string of length 2, “−”, a random number string of length 4, “−”, “length 4” The random character string "is a concatenated character string. The custom character string “<5a> @ <3a>. <2a>. <2a>” is a random ASCII character string of length 5, “@”, a random ASCII character string of length 3, “.”. , A random ASCII character string of length 2, “.”, And a random ASCII character string of length 2 are concatenated.

図２に戻り、置換規則保持部２６は、原本文書データにおける各情報項目と、マスクパターンとを対応づけた置換規則を保持する。図５は置換規則を示す。項目名フィールドには、原本文書データにおける情報項目の名称が記録される。対象フィールドには、当該情報項目が個人情報を含む（ＴＲＵＥ）か否か（ＦＡＬＳＥ）を示す情報が記録される。最大検出タイプフィールドには、当該情報項目の検出タイプとして最も多く決定された個人情報の種類が記録され、その検出タイプの検出数が検出数フィールドに記録される。マスクパターンフィールドには、個人情報を置換するマスクパターンが記録され、カスタムパターンについてはカスタム文字列がさらに記録される。 Returning to FIG. 2, the replacement rule holding unit 26 holds a replacement rule in which each information item in the original document data is associated with a mask pattern. FIG. 5 shows the replacement rule. In the item name field, the name of the information item in the original document data is recorded. Information indicating whether the information item includes personal information (TRUE) or not (FALSE) is recorded in the target field. In the maximum detection type field, the type of personal information determined most frequently as the detection type of the information item is recorded, and the number of detections of the detection type is recorded in the detection number field. A mask pattern for replacing personal information is recorded in the mask pattern field, and a custom character string is further recorded for the custom pattern.

また図５の置換規則では、項目名「ＡＣＣＯＵＮＴ＿ＮＵＭＢＥＲ」のマスクパターンとしてハッシュ値が指定されている。これは、項目名「ＡＣＣＯＵＮＴ＿ＮＵＭＢＥＲ」のデータが、図５では不図示の他のテーブルでも使用され、両テーブルを関連づけるキーとなっているためである。個人情報をハッシュ値でマスクすることにより、原本文書データにおける複数テーブルの関連性を、テスト用文書データで維持することについては、図８等に関連して後述する。 In the replacement rule of FIG. 5, a hash value is specified as a mask pattern for the item name “ACCOUNT_NUMBER”. This is because the data of the item name “ACCOUNT_NUMBER” is also used in other tables not shown in FIG. 5 and serves as a key for associating both tables. Maintaining the relevance of multiple tables in the original document data by masking the personal information with the hash value will be described later with reference to FIG.

図２に戻り、除外規則保持部２８は、個人情報として検出された文字列のうち、マスク処理から除外すべき文字列を識別するための除外規則を保持する。本実施の形態の除外規則は、マスク処理の対象外とすべき１つ以上の文字列（以下、「マスク対象外文字列」とも呼ぶ。）を定めたものとする。本実施の形態では、個人情報として検出された文字列のうち、マスク対象外文字列と完全一致する文字列をマスク処理から除外する。変形例としては、マスク対象外文字列を一部に含む文字列をマスク処理から除外してもよく、マスク対象外文字列が正規表現で示される場合には、その正規表現に包含される文字列をマスク処理から除外してもよい。 Returning to FIG. 2, the exclusion rule holding unit 28 holds an exclusion rule for identifying a character string to be excluded from the masking process among the character strings detected as personal information. It is assumed that the exclusion rule of this embodiment defines one or more character strings (hereinafter also referred to as “non-maskable character strings”) that should not be masked. In the present embodiment, a character string that completely matches a character string that is not to be masked among character strings detected as personal information is excluded from mask processing. As a modified example, a character string partially including an unmasked character string may be excluded from the mask process. When the non-maskable character string is indicated by a regular expression, the characters included in the regular expression are excluded. The column may be excluded from the mask process.

原本文書取得部３２は、原本文書データを本番機１０から取得する。既述したように、原本文書データは、データベースのテーブルに格納されたレコードであってもよく、ＣＳＶファイル・固定長ファイル・フリーフォーマットのログファイル等の各種ファイルデータであってもよい。 The original document acquisition unit 32 acquires original document data from the production machine 10. As described above, the original document data may be records stored in a database table, or may be various file data such as a CSV file, a fixed length file, a free format log file, and the like.

個人情報検出部３４は、原本文書データから、当該データに含まれる個人情報を検出し、その検出結果を図２で示した態様で抽出データ保持部２２へ記録する。個人情報検出部３４は公知の個人情報抽出手段により実現されてよい。例えば、株式会社野村総合研究所が提供するソフトウェア製品である「ＴＲＵＥＴＥＬＬＥＲ個人情報フィルタ（登録商標）」により実現されてもよい。 The personal information detection unit 34 detects personal information included in the data from the original document data, and records the detection result in the extracted data holding unit 22 in the manner shown in FIG. The personal information detection unit 34 may be realized by a known personal information extraction unit. For example, it may be realized by “TRUE TELLER Personal Information Filter (registered trademark)” which is a software product provided by Nomura Research Institute, Ltd.

置換規則決定部３６は、抽出データ保持部２２に格納された個人情報の検出結果と、対応関係保持部２４に格納された対応規則を参照して、原本文書データに含まれる個人情報に対する置換規則を決定し置換規則保持部２６へ記録する。具体的には、原本文書データの情報項目ごとに、個人情報が検出されたか否か（例えば検出タイプが記録されたか否か）を判定し、その判定結果を記録する。また原本文書データの情報項目ごとに、各検出タイプの検出数をカウントして最大検出タイプを判定し記録する。そして、対応関係保持部２４に格納された対応規則にしたがって、最大検出タイプと対応づけられたマスクパターン（およびカスタム文字列）を特定し記録する。 The replacement rule determining unit 36 refers to the detection result of the personal information stored in the extracted data holding unit 22 and the corresponding rule stored in the correspondence holding unit 24, and replaces the personal information included in the original document data. And is recorded in the replacement rule holding unit 26. Specifically, for each information item of the original document data, it is determined whether or not personal information is detected (for example, whether or not a detection type is recorded), and the determination result is recorded. For each information item of the original document data, the number of detections of each detection type is counted to determine and record the maximum detection type. Then, the mask pattern (and custom character string) associated with the maximum detection type is specified and recorded in accordance with the correspondence rule stored in the correspondence relationship holding unit 24.

なお、マスク前の文字列（個人情報を含む文字列であり、以下「オリジナル文字列」とも呼ぶ。）と、マスク後の文字列の属性を近似させるために、置換規則決定部３６は、最大検出タイプの判定において、個人情報検出部３４により特定された検出タイプを、文字の属性に応じてより詳細化する。例えば、個人情報の検出結果における最大検出タイプが［人名］であり、文字列フィールドに設定された文字列が漢字であれば、最大検出タイプ［人名］ＫＡＮＪＩを記録する。また、個人情報の検出結果における最大検出タイプが［人名］であり、文字列フィールドに設定された文字列が平仮名であれば、最大検出タイプ［人名］ＫＡＮＡを記録する。 In order to approximate the character string before the mask (a character string including personal information, hereinafter also referred to as “original character string”) and the attribute of the character string after the mask, In the detection type determination, the detection type specified by the personal information detection unit 34 is further detailed according to the character attribute. For example, if the maximum detection type in the personal information detection result is [person name] and the character string set in the character string field is kanji, the maximum detection type [person name] KANJI is recorded. If the maximum detection type in the personal information detection result is [person name] and the character string set in the character string field is hiragana, the maximum detection type [person name] KANA is recorded.

置換データ取得部３８は、置換規則保持部２６に格納された置換規則を参照して、原本文書データのレコードごと、かつ、情報項目ごとに、個人情報をマスクするための置換用のデータ（以下、「マスクデータ」とも呼ぶ。）を取得する。例えば、マスクパターンがランダム文字列（漢字）の場合、オリジナルの文字列長に対応する長さ（本実施の形態では同じ長さ）のランダムな漢字文字列をマスクデータとして取得する。 The replacement data acquisition unit 38 refers to the replacement rules stored in the replacement rule holding unit 26 and replaces data for masking personal information for each record of the original document data and for each information item (hereinafter referred to as “replacement data”). , Also referred to as “mask data”). For example, when the mask pattern is a random character string (kanji), a random kanji character string having a length corresponding to the original character string length (the same length in this embodiment) is acquired as mask data.

またマスクパターンがハッシュ値の場合、オリジナルの文字列をハッシュ関数（実施の形態ではＳＨＡ−２）に入力し、当該ハッシュ関数の出力結果であるハッシュ値を示す文字列（以下、「ハッシュ文字列」とも呼ぶ。）をマスクデータとして取得する。このハッシュ関数は、他の種類のハッシュ関数であってもよく、例えばＳＨＡ−１やＭＤ５であってもよい。ハッシュ文字列は、所定長のハッシュ値を１６進表記したＨＥＸ文字列であってもよく、数字列であってもよい。また置換データ取得部３８は、ハッシュ文字列を、オリジナルの文字列長に対応する長さにトリミングした結果をマスクデータとして取得してもよい。 When the mask pattern is a hash value, an original character string is input to a hash function (SHA-2 in the embodiment), and a character string indicating a hash value as an output result of the hash function (hereinafter referred to as “hash character string”). Is also obtained as mask data. This hash function may be another type of hash function, for example, SHA-1 or MD5. The hash character string may be a HEX character string in which a hash value of a predetermined length is expressed in hexadecimal or a numeric string. Further, the replacement data acquisition unit 38 may acquire, as mask data, the result of trimming the hash character string to a length corresponding to the original character string length.

また図６に関連して後述するように、マスクパターンがカスタムパターンの場合もハッシュ値が指定される場合がある。このとき置換データ取得部３８は、オリジナル文字列のハッシュ値を取得し、そのハッシュ値を示す文字列を、カスタム文字列で指定された長さにトリミングした結果をマスクデータとして取得する。 As will be described later with reference to FIG. 6, a hash value may be specified even when the mask pattern is a custom pattern. At this time, the replacement data acquisition unit 38 acquires a hash value of the original character string, and acquires, as mask data, a result of trimming the character string indicating the hash value to a length specified by the custom character string.

文書変換部４０は、原本文書データのレコードごと、かつ、情報項目ごとに、個人情報検出部３４により個人情報として検出された文字列を、置換データ取得部３８により取得されたマスクデータへ置換する。これにより、原本文書データを、個人情報がマスクされたテスト用文書データへ変換する。 The document conversion unit 40 replaces the character string detected as the personal information by the personal information detection unit 34 with the mask data acquired by the replacement data acquisition unit 38 for each record of the original document data and for each information item. . Thus, the original document data is converted into test document data in which personal information is masked.

また文書変換部４０は、原本文書データにおける変換対象のオリジナル文字列が、除外規則保持部２８の除外規則で定められたマスク対象外文字列と一致するか否かを判定し、不一致であれば、当該オリジナル文字列をマスクデータへ置換する。一致した場合は、当該オリジナル文字列のマスクデータへの置換を抑制する。言い換えれば、当該オリジナル文字列のマスク処理をスキップして、次の変換対象文字列のマスク処理へ移行する。 Further, the document conversion unit 40 determines whether or not the original character string to be converted in the original document data matches the non-maskable character string determined by the exclusion rule of the exclusion rule holding unit 28. The original character string is replaced with mask data. If they match, the replacement of the original character string with the mask data is suppressed. In other words, the mask process for the original character string is skipped, and the process proceeds to the mask process for the next character string to be converted.

変換文書出力部４２は、文書変換部４０により生成されたテスト用文書データを記録メディア１６へ記録する。例えば、原本文書データがデータベースのテーブルデータの場合は、個人情報をマスク後のテーブルデータをテスト用文書データとして記録メディア１６へ格納する。また原本文書データがフリーフォーマットのログファイルの場合は、個人情報をマスク後のログファイルのデータをテスト用文書データとして記録メディア１６へ格納する。記録メディア１６に記録されたテスト用文書データは、開発機１２に読み込まれ、開発機１２での保守作業において（例えばテストのための入力データや照合用データとして）用いられる。 The converted document output unit 42 records the test document data generated by the document conversion unit 40 on the recording medium 16. For example, if the original document data is table data in a database, the personal data masked table data is stored in the recording medium 16 as test document data. When the original document data is a free format log file, the log file data after masking the personal information is stored in the recording medium 16 as test document data. The test document data recorded on the recording medium 16 is read into the development machine 12 and used in maintenance work on the development machine 12 (for example, as input data or verification data for testing).

ユーザ設定支援部４４は、対応関係保持部２４に保持された対応規則と、置換規則保持部２６に保持された置換規則に対するユーザの設定操作を支援する。具体的には、対応規則および置換規則を編集するためのユーザ設定画面を所定のディスプレイに表示させ、ユーザ設定画面に対するユーザの入力情報を対応関係保持部２４の対応規則および置換規則保持部２６の置換規則へ反映させる。 The user setting support unit 44 supports the user's setting operation for the correspondence rules held in the correspondence relationship holding unit 24 and the replacement rules held in the replacement rule holding unit 26. Specifically, a user setting screen for editing the correspondence rule and the replacement rule is displayed on a predetermined display, and user input information for the user setting screen is displayed in the correspondence rule and replacement rule holding unit 26 of the correspondence relationship holding unit 24. Reflect in the replacement rule.

図６はユーザ設定画面を示す。同図は、対応関係保持部２４に保持された対応規則を編集するためのユーザ設定画面を示している。同図の内容をユーザが入力すると、［人名］ＫＡＮＪＩの検出タイプと、８文字のハッシュ文字列のマスクパターンとを対応づけるよう対応関係が更新される。なお、マスクパターンのプルダウンメニューからハッシュ文字列を選択することもできる。 FIG. 6 shows a user setting screen. This figure shows a user setting screen for editing the correspondence rule held in the correspondence relationship holding unit 24. When the user inputs the contents of FIG. 6, the correspondence is updated so that the detection type of [person name] KANJI and the mask pattern of the 8-character hash character string are associated with each other. It is also possible to select a hash character string from the pull-down menu of the mask pattern.

またユーザ設定支援部４４は、マスク対象外文字列をユーザに入力させるためのユーザ設定画面を表示させる。そして、ユーザ設定画面に対してユーザが入力したマスク対象外文字列を取得し、その文字列を除外規則保持部２８の除外規則へ追加する。 The user setting support unit 44 displays a user setting screen for allowing the user to input a character string that is not to be masked. Then, the non-maskable character string input by the user on the user setting screen is acquired, and the character string is added to the exclusion rule of the exclusion rule holding unit 28.

以上の構成によるデータ変換装置１４の動作を以下説明する。
図７（ａ）は、データ変換装置１４の動作を示すフローチャートである。データ変換装置１４において本番環境の文書データに対するマスクパターンの決定を指示するユーザ操作が検出されると（Ｓ１０のＹ）、原本文書取得部３２は、当該ユーザ操作で指定された原本文書データを本番機１０から取得する（Ｓ１２）。個人情報検出部３４は、原本文書データに記載された個人情報を検出し、その記載位置を含む属性情報を抽出データ保持部２２へ記録する（Ｓ１４）。置換規則決定部３６は、個人情報検出部３４により検出された個人情報の属性と、対応関係保持部２４に保持された対応規則とに基づいて各個人情報のマスクパターンを決定し、各個人情報の置換規則を置換規則保持部２６へ記録する（Ｓ１６）。マスクパターンの決定を指示するユーザ操作が未検出であれば（Ｓ１０のＮ）、Ｓ１２〜Ｓ１６をスキップする。 The operation of the data converter 14 having the above configuration will be described below.
FIG. 7A is a flowchart showing the operation of the data converter 14. When the user operation for instructing the determination of the mask pattern for the document data in the production environment is detected in the data conversion device 14 (Y in S10), the original document acquisition unit 32 performs the production of the original document data designated by the user operation. Obtained from the machine 10 (S12). The personal information detection unit 34 detects personal information described in the original document data, and records attribute information including the description position in the extracted data holding unit 22 (S14). The replacement rule determination unit 36 determines the mask pattern of each personal information based on the attribute of the personal information detected by the personal information detection unit 34 and the correspondence rule held in the correspondence relationship holding unit 24, and each personal information The replacement rule is recorded in the replacement rule holding unit 26 (S16). If the user operation for instructing the determination of the mask pattern is not detected (N in S10), S12 to S16 are skipped.

またデータ変換装置１４においてマスク処理の設定変更を指示するユーザ操作が検出されると（Ｓ１８のＹ）、ユーザ設定支援部４４は、ユーザ設定画面を表示させる（Ｓ２０）。ユーザ設定支援部４４は、ユーザ設定画面に入力された対応規則の更新情報を対応関係保持部２４に反映させ、または、ユーザ設定画面に入力された置換規則の更新情報を置換規則保持部２６に反映させる（Ｓ２２）。マスク処理の設定変更を指示するユーザ操作が未検出であれば（Ｓ１８のＮ）、Ｓ２０およびＳ２２をスキップする。なお図７（ａ）には不図示であるが、ユーザ設定画面においてマスク対象外文字列が入力されると、ユーザ設定支援部４４はマスク対象外文字列を除外規則保持部２８に記録する。 When the user operation for instructing the change of the mask processing setting is detected in the data conversion device 14 (Y in S18), the user setting support unit 44 displays a user setting screen (S20). The user setting support unit 44 reflects the update information of the correspondence rule input on the user setting screen in the correspondence relationship holding unit 24 or the update information of the replacement rule input on the user setting screen in the replacement rule holding unit 26. Reflect (S22). If the user operation for instructing the setting change of the mask process is not detected (N in S18), S20 and S22 are skipped. Although not shown in FIG. 7A, when an unmasked character string is input on the user setting screen, the user setting support unit 44 records the unmasked character string in the exclusion rule holding unit 28.

典型的には、データ変換装置１４のユーザは、原本文書データにおいて複数箇所に記載されて、相互の関連性を維持すべき個人情報の項目に対するマスクパターンとしてハッシュ値を設定する。より具体的には、リレーショナルデータベースにおいて複数のテーブルを関連づけるためのキー情報（例えば第１テーブルにおける外部キーであり、第２のテーブルにおける主キー）に対するマスクパターンとしてハッシュ値を設定する。これにより、原本文書データにおける項目間の関連性、例えばリレーショナルデータベースにおける複数テーブル間のリレーションを、マスク後のテスト用文書データでも維持することができる。 Typically, the user of the data conversion device 14 sets a hash value as a mask pattern for items of personal information that are described in a plurality of locations in the original document data and should maintain their relevance. More specifically, a hash value is set as a mask pattern for key information (for example, a foreign key in the first table and a primary key in the second table) for associating a plurality of tables in a relational database. Thereby, the relationship between items in the original document data, for example, the relation between a plurality of tables in the relational database can be maintained even in the test document data after masking.

図７（ｂ）は、図７（ａ）に続く動作を示すフローチャートである。データ変換装置１４においてマスク処理の開始を指示するユーザ操作が検出されると（Ｓ２４のＹ）、置換データ取得部３８は、原本文書データに記載された個人情報のオリジナル文字列ごとに、その属性に応じたマスクパターンに基づくマスクデータを取得する。文書変換部４０は、個人情報として検出されたオリジナル文字列が、除外規則保持部２８に記録されたマスク対象外文字列と不一致であれば（Ｓ２６のＮ）、そのオリジナル文字列をマスクデータへ置換する（Ｓ２８）。具体的には、抽出データ保持部２２に記録されたオリジナル文字列の先頭位置から、オリジナル文字列の長さ分のデータ（すなわちオリジナル文字列そのもの）を、マスクデータの文字列へ置き換える。個人情報として検出されたオリジナル文字列がマスク対象外文字列と一致すれば（Ｓ２６のＹ）、Ｓ２８をスキップする。 FIG. 7B is a flowchart showing the operation following FIG. When the user operation for instructing the start of the mask process is detected in the data conversion device 14 (Y in S24), the replacement data acquisition unit 38 sets the attribute for each original character string of the personal information described in the original document data. The mask data based on the mask pattern corresponding to is acquired. If the original character string detected as personal information does not match the non-maskable character string recorded in the exclusion rule holding unit 28 (N in S26), the document conversion unit 40 converts the original character string into mask data. Replace (S28). Specifically, data corresponding to the length of the original character string (that is, the original character string itself) is replaced with the character string of the mask data from the beginning position of the original character string recorded in the extracted data holding unit 22. If the original character string detected as personal information matches the non-maskable character string (Y in S26), S28 is skipped.

原本文書データに記載された全ての個人情報のオリジナル文字列に対する置換処理、もしくは置換スキップを完了すると（Ｓ３０のＹ）、変換文書出力部４２はテスト用文書データを記録メディア１６へ出力する（Ｓ３２）。未処理の個人情報のオリジナル文字列が残っていれば（Ｓ３０のＮ）、Ｓ２６へ戻って、置換データ取得部３８は、未処理のオリジナル文字列に対するマスクデータを取得する。マスク処理の開始を指示するユーザ操作が未検出であれば（Ｓ２４のＮ）、Ｓ２６からＳ３２をスキップする。 When the replacement process for the original character string of all personal information described in the original document data or the replacement skip is completed (Y in S30), the converted document output unit 42 outputs the test document data to the recording medium 16 (S32). ). If the original character string of unprocessed personal information remains (N in S30), the process returns to S26, and the replacement data acquisition unit 38 acquires mask data for the unprocessed original character string. If no user operation for instructing the start of the mask process is detected (N in S24), S26 to S32 are skipped.

本実施の形態のデータ変換装置１４によると、原本文書データの各情報項目とマスクパターンとの対応関係を１つ１つユーザが定義する必要がない。すなわち、原本文書データに含まれる個人情報を自動的に検出し、個人情報の属性とマスクパターンとの対応規則にもとづいて、各個人情報のマスクパターンを自動的に決定する。これにより人為的なミスの発生（典型的にはマスク設定の漏れ）を抑制できる。例えば、原本文書データの中に予備項目が設けられ、その予備項目は初期の開発時には未使用であり、後の機能追加時に個人情報を保持するよう変更されることがある。人手でマスクパターンを設定すると、予備項目のマスクが見落とされて、マスク設定の漏れが発生しやすい。データ変換装置１４では、マスク対象とすべき文字列の検出と、その文字列を置き換えるべきマスクデータの決定とを自動化することにより、人為的なミスの発生を抑制できる。 According to the data conversion apparatus 14 of the present embodiment, it is not necessary for the user to define the correspondence between each information item of the original document data and the mask pattern one by one. That is, the personal information included in the original document data is automatically detected, and the mask pattern of each personal information is automatically determined based on the correspondence rule between the attribute of the personal information and the mask pattern. As a result, it is possible to suppress the occurrence of artificial mistakes (typically leakage of mask settings). For example, a spare item is provided in the original document data, and the spare item is unused at the time of initial development, and may be changed to retain personal information when a function is added later. When the mask pattern is manually set, the mask of the spare item is overlooked, and the mask setting is likely to be leaked. The data converter 14 can suppress the occurrence of human error by automating the detection of a character string to be masked and the determination of mask data to be replaced with the character string.

またユーザ設定画面では、置換規則を編集できるだけでなく、置換規則の基礎となる対応規則もユーザが編集できる。ユーザは対応規則を編集することで、個人情報の属性が共通する原本文書データの複数の情報項目に対するマスクパターンを一括して設定でき、マスク処理のためのユーザ作業の効率化を実現できる。 In addition, on the user setting screen, not only can the replacement rule be edited, but the user can also edit the corresponding rule that is the basis of the replacement rule. By editing the correspondence rule, the user can collectively set a mask pattern for a plurality of information items of original document data having the same personal information attribute, and the efficiency of user work for mask processing can be realized.

またデータ変換装置１４によると、原本文書データにおいて異なる箇所に記載された複数の個人情報項目であり、マスク後においても互いの関連性を維持すべき複数の個人情報項目のそれぞれを、個人情報を示す文字列のハッシュ値によりマスクする。これにより、原本文書データにおける情報項目間の関連性を、個人情報をマスクした後のテスト用文書データでも維持でき、テストデータとしての品質の低下を抑制することができる。 Further, according to the data conversion device 14, a plurality of personal information items described in different locations in the original document data, and each of the plurality of personal information items that should be maintained in relation to each other even after masking is obtained. Mask with the hash value of the indicated string. Thereby, the relationship between the information items in the original document data can be maintained even in the test document data after the personal information is masked, and the deterioration of the quality as the test data can be suppressed.

図８は、複数のテーブルのマスク処理を模式的に示す。名義情報テーブルと残高情報テーブルは、口座番号をキーとして互いに関連性を有する。図８（ａ）は、原本文書データとしての名義情報テーブルと残高情報テーブルを示しており、例えば名前「佐々木」の残高は「３００，０００」であることを示している。図８（ｂ）は、名義情報テーブルの口座番号と、残高情報テーブルの口座番号のそれぞれをランダムな値でマスクした結果を示しており、名義情報テーブルと残高情報テーブルの関連性が失われている。 FIG. 8 schematically shows mask processing of a plurality of tables. The name information table and the balance information table are related to each other using the account number as a key. FIG. 8A shows a name information table and a balance information table as original document data. For example, the balance of the name “Sasaki” is “300,000”. FIG. 8B shows the result of masking the account number of the nominal information table and the account number of the balance information table with random values, and the association between the nominal information table and the balance information table is lost. Yes.

これに対して、図８（ｃ）は、名義情報テーブルの口座番号と、残高情報テーブルの口座番号のそれぞれをハッシュ値によりマスクした結果を示している。同図で示すように、名義情報テーブルの口座番号と残高情報テーブルの口座番号は、それぞれオリジナルの値が秘匿されつつも、互いの関連性が維持されている。例えば、名前「ＸＸＸ（佐々木のマスク結果）」の残高は「３００，０００」であることが、マスク後も識別できる。図８（ｃ）の名義情報テーブルと残高情報テーブルをテスト用文書データとして用いることにより、開発環境においても本番環境に即したテストを実施しやすくなる。 On the other hand, FIG. 8C shows the result of masking each of the account number of the nominal information table and the account number of the balance information table with a hash value. As shown in the figure, the relationship between the account number in the name information table and the account number in the balance information table is maintained while the original values are kept secret. For example, the balance of the name “XXX (Sasaki mask result)” is “300,000”, which can be identified even after masking. By using the name information table and the balance information table of FIG. 8C as test document data, it becomes easy to perform a test according to the production environment even in the development environment.

またデータ変換装置１４によると、個人情報がどの位置に記載されるかが確定しないフリーフォーマットのファイルデータ（例えばログデータ）に対しても、個人情報を自動でマスクすることができる。またデータ変換装置１４によると、ユーザは任意の文字列をマスク対象外文字列として指定でき、個人情報の保護と、テストデータの品質低下の抑制の両立を支援できる。 Further, according to the data converter 14, personal information can be automatically masked even for free format file data (for example, log data) in which the position where the personal information is described is not determined. Moreover, according to the data converter 14, the user can designate an arbitrary character string as a non-maskable character string, and can support both the protection of personal information and the suppression of deterioration in the quality of test data.

図９は、ログデータのマスク処理を模式的に示す。図９（ａ）は、２つのログメッセージを含む本番環境でのオリジナルのログデータを示している。これら２つのログメッセージはフォーマットが異なるものである。図９（ｂ）は、図９（ａ）のログデータに含まれる個人情報をマスクした後のテスト用のデータを示している。既述したように、個人情報検出部３４は、オリジナルのログデータに含まれる個人情報について、その属性・記載位置を記録する。そして置換データ取得部３８は、個人情報の属性に応じたマスクデータを取得し、文書変換部４０は、個人情報の記載位置を特定してマスクデータへ置き換える。これにより、フリーフォーマットのファイルデータにおける個人情報のマスキングを実現できる。 FIG. 9 schematically illustrates log data mask processing. FIG. 9A shows original log data in a production environment including two log messages. These two log messages have different formats. FIG. 9B shows test data after the personal information included in the log data of FIG. 9A is masked. As described above, the personal information detection unit 34 records the attribute / description position of the personal information included in the original log data. Then, the replacement data acquisition unit 38 acquires mask data corresponding to the attribute of the personal information, and the document conversion unit 40 specifies the description position of the personal information and replaces it with the mask data. Thereby, the masking of the personal information in the free format file data can be realized.

なお図９（ｂ）では、本来マスクされるべきでない商品名「山田３００」も個人情報としてマスクされている。ユーザは、文字列「山田３００」をマスク対象外文字列として指定することにより、テスト用ログデータにおいて商品名「山田３００」をそのまま出力させることができる。 In FIG. 9B, the product name “Yamada 300” that should not be masked is also masked as personal information. The user can output the product name “Yamada 300” as it is in the test log data by designating the character string “Yamada 300” as a non-maskable character string.

また上記では言及していないが、個人情報検出部３４は、図９（ｃ）で示すように、個人情報として検出した文字列を記録した個人情報検出リストをさらに出力してもよい。そして、個人情報検出リストをディスプレイに表示し、ユーザへ提示してもよい。このリストは、抽出データ保持部２２に格納した個人情報抽出結果によって代用してもよい。個人情報検出リストをユーザへ提示することにより、ユーザがマスク対象外文字列を適切に指定できるよう支援できる。 Although not mentioned above, the personal information detection unit 34 may further output a personal information detection list in which character strings detected as personal information are recorded, as shown in FIG. 9C. Then, the personal information detection list may be displayed on the display and presented to the user. This list may be substituted by the personal information extraction result stored in the extracted data holding unit 22. By presenting the personal information detection list to the user, it is possible to assist the user in appropriately specifying the non-maskable character string.

以上、本発明を第１の実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。以下変形例を示す。 The present invention has been described based on the first embodiment. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there. A modification is shown below.

第１の実施の形態の第１の変形例を説明する。
上記実施の形態では、ハッシュ値でマスクする個人情報項目をユーザが指定することとしたが、置換規則決定部３６は、ハッシュ値でマスクする個人情報項目を自動で決定し、置換規則へ記録してもよい。例えば、リレーショナルデータベースの定義情報を参照し、リレーショナルデータベースにおいて複数テーブル間に設定された参照整合性制約を検出してもよい。そして参照整合性制約が設定されたカラム（典型的には第１テーブルにおける外部キーのカラムと、第２テーブルにおける主キーのカラムの両方）について、マスクパターンとしてハッシュ値を設定してもよい。 A first modification of the first embodiment will be described.
In the above embodiment, the user designates the personal information item to be masked with the hash value, but the replacement rule determining unit 36 automatically determines the personal information item to be masked with the hash value and records it in the replacement rule. May be. For example, referential integrity constraints set between a plurality of tables in the relational database may be detected by referring to the definition information of the relational database. A hash value may be set as a mask pattern for columns for which referential integrity constraints are set (typically, both the foreign key column in the first table and the primary key column in the second table).

第１の実施の形態の第２の変形例を説明する。
上記実施の形態で一部既述したが、ハッシュ文字列を、オリジナルの文字列長に対応する長さにトリミングした結果をマスクデータとする場合、異なるオリジナル文字列を同じマスクデータへ変換してしまうことが考えられる。そのため変形例として、置換データ取得部３８は、オリジナル文字列と、そのハッシュ値をトリミングしたマスクデータとを対応づけたテーブル（以下、「割当履歴テーブル」）を保持してもよい。 A second modification of the first embodiment will be described.
As described above in part in the above embodiment, when mask data is obtained by trimming a hash character string to a length corresponding to the original character string length, different original character strings are converted to the same mask data. It is possible to end up. Therefore, as a modification, the replacement data acquisition unit 38 may hold a table (hereinafter, “allocation history table”) in which the original character string is associated with the mask data obtained by trimming the hash value.

置換データ取得部３８は、あるオリジナル文字列（「当該オリジナル文字列」と呼ぶ。）のマスクデータ（ここではハッシュ文字列をトリミングした文字列）を取得すべきとき、割当履歴テーブルを参照して、当該オリジナル文字列と一致する文字列にマスクデータを割当済かを判定する。割当済であれば、当該オリジナル文字列と一致する文字列に割当済のマスクデータを、当該オリジナル文字列へ割り当てる。 The replacement data acquisition unit 38 refers to the allocation history table when acquiring mask data (here, a character string obtained by trimming a hash character string) of a certain original character string (referred to as “the original character string”). Then, it is determined whether the mask data has been assigned to the character string that matches the original character string. If it has been assigned, the mask data assigned to the character string that matches the original character string is assigned to the original character string.

当該オリジナル文字列と一致する文字列が割当履歴テーブルに未記録であれば、置換データ取得部３８は、当該オリジナル文字列のハッシュ値をトリミングしたマスクデータを取得する。そして割当履歴テーブルを参照し、そのマスクデータを他のオリジナル文字列へ割当済か否かを判定する。未割当であれば、そのマスクデータを当該オリジナル文字列へ割り当て、割当履歴テーブルへ記録する。他のオリジナル文字列へ割当済であれば、当該オリジナル文字列のハッシュ値をハッシュ関数へ入力し、その出力結果であるハッシュ値を新たなマスクデータとして取得する。以下、ユニークなマスクデータを取得するまで上記処理を繰り返す。 If a character string that matches the original character string is not recorded in the allocation history table, the replacement data acquisition unit 38 acquires mask data obtained by trimming the hash value of the original character string. Then, with reference to the allocation history table, it is determined whether or not the mask data has already been allocated to another original character string. If not assigned, the mask data is assigned to the original character string and recorded in the assignment history table. If it has been assigned to another original character string, the hash value of the original character string is input to the hash function, and the hash value that is the output result is obtained as new mask data. Thereafter, the above process is repeated until unique mask data is acquired.

この変形例によると、マスクパターンがハッシュ値に設定され、そのハッシュ文字列をトリミングする場合に、異なるオリジナル文字列に対して重複するマスクデータを割り当てることを回避できる。これにより、原本文書データにおいて関連性のない複数の情報項目について、テスト用文書データにおいて関連性を生じさせることを回避できる。 According to this modification, when a mask pattern is set to a hash value and the hash character string is trimmed, it is possible to avoid assigning overlapping mask data to different original character strings. Thereby, it is possible to avoid the occurrence of relevance in the test document data for a plurality of information items that are not relevant in the original document data.

（第２の実施の形態）
既述したように、第２の実施の形態では、第１の実施の形態で説明した個人情報のマスク処理に要するトータルの時間を低減するために、文書内のマスク対象箇所、言い換えれば、文書内での個人情報の記録位置を特定する処理を効率化する技術を説明する。第２の実施の形態における情報システムの構成は、図１で示した情報システム１００の構成と同じである。 (Second Embodiment)
As described above, in the second embodiment, in order to reduce the total time required for the masking process of personal information described in the first embodiment, the mask target portion in the document, in other words, the document A technique for improving the efficiency of the process of specifying the recording position of personal information in the network will be described. The configuration of the information system in the second embodiment is the same as the configuration of the information system 100 shown in FIG.

図１０は、個人情報をマスキングする前の原本文書データを示す。第２の実施の形態では、原本文書データとして、リレーショナルデータベースのテーブル形式のデータを例示する。ただし、原本文書データはテーブル形式に限らず、ＣＳＶ形式やスプレッドシート等、複数項目に区分けして情報が電子的に記録された他の形式の文書データ・文書ファイルであってもよい。 FIG. 10 shows original document data before masking personal information. In the second embodiment, data in a table format of a relational database is exemplified as original document data. However, the original document data is not limited to the table format, but may be a document data / document file of another format in which information is electronically recorded by dividing into a plurality of items such as a CSV format and a spreadsheet.

図１０の原本文書データには４レコード（ＩＤ００１〜００４）が記録され、各レコードは、ＩＤ・氏名・口座番号・住所・備考の５つの項目に関する文字列を含む。項目は、データの種類やカテゴリを示す見出し、索引と言え、テーブルにおいては例えばカラム名が対応する。なお、現実の大規模な情報システム１００では、本番機１０から取得される原本文書データの規模は数百万レコード以上になることもある。また、各レコードの項目数も数十から数百になることもある。 In the original document data of FIG. 10, four records (ID001 to 004) are recorded, and each record includes character strings regarding five items of ID, name, account number, address, and remarks. The item can be referred to as a heading or index indicating the type or category of data, and corresponds to, for example, a column name in the table. In the actual large-scale information system 100, the scale of the original document data acquired from the production machine 10 may be several million records or more. In addition, the number of items in each record may be several tens to several hundreds.

これまでの個人情報検出処理では、レコード単位に複数の項目それぞれに記録された文字列を取得（スキャン）し、データベースに予め記憶された個人情報を示す膨大なキーワードのいずれかを含むか否かを確認していた。そして、この確認処理をレコード数分繰り返していた。したがって、文書内のレコード数が増加するほど、また各レコードの項目数が増加するほど、個人情報の検出処理に要する時間が増加していた。 Whether the personal information detection process so far acquires (scans) a character string recorded in each of a plurality of items in units of records and includes any of enormous keywords indicating personal information stored in advance in the database. Had confirmed. This confirmation process was repeated for the number of records. Therefore, as the number of records in the document increases and the number of items in each record increases, the time required for the personal information detection process increases.

そこで第２の実施の形態のデータ変換装置１４では、
（１）原本文書データ（言い換えれば本番機１０）の設計上、明らかに個人情報が含まれるはずの項目は、個人情報の検出処理の対象外とする、
（２）一度、個人情報の存在を検出した項目は、それ以降、個人情報の検出処理の対象外とする、
という２つの工夫の組み合わせにより個人情報の検出処理を効率化する。これにより、現実の本番機１０に蓄積された膨大なサイズの原本文書データに対する個人情報のマスク処理を、現実のシステム開発で許容される時間内に完了できるよう支援する。 Therefore, in the data converter 14 of the second embodiment,
(1) Due to the design of the original document data (in other words, the production machine 10), items that should obviously contain personal information are excluded from the personal information detection process.
(2) Items for which the presence of personal information has been detected will be excluded from personal information detection processing thereafter.
The combination of these two ideas makes the personal information detection process more efficient. As a result, the personal information masking process for the original document data of enormous size stored in the actual production machine 10 is supported so that it can be completed within the time allowed for the actual system development.

図１１は、第２の実施の形態のデータ変換装置１４の機能構成を示すブロック図である。図１１に示す機能ブロックのうち第１の実施の形態で説明済みの機能ブロックと同一もしくは対応する機能ブロックには同一の符号を付している。以下、第１の実施の形態において説明済みの内容は適宜省略する。 FIG. 11 is a block diagram illustrating a functional configuration of the data conversion apparatus 14 according to the second embodiment. Among the functional blocks shown in FIG. 11, functional blocks that are the same as or correspond to those already described in the first embodiment are denoted by the same reference numerals. Hereinafter, the contents already described in the first embodiment are omitted as appropriate.

データ変換装置１４のデータ保持部２０は、判定基準保持部５０をさらに含む。データ処理部３０の個人情報検出部３４は、ユーザ指定取得部５２と、判定部５４と、検出結果出力部５６を含む。 The data holding unit 20 of the data conversion device 14 further includes a determination criterion holding unit 50. The personal information detection unit 34 of the data processing unit 30 includes a user designation acquisition unit 52, a determination unit 54, and a detection result output unit 56.

判定基準保持部５０は、後述の判定部５４が文字列に個人情報が含まれるか否かを判定するための基準となるデータを保持する。実施の形態では、基準となるデータとして、個人情報を示す複数のキーワードを保持し、言わば、判定基準保持部５０は、個人情報の辞書データを保持する。例えば、人名に関する個人情報として、キーワード「山田」、「太郎」等を保持してもよい。また、住所に関する個人情報として、キーワード「東京都」、「渋谷区」等を保持してもよい。 The determination criterion holding unit 50 holds data serving as a reference for the determination unit 54 described later to determine whether or not personal information is included in a character string. In the embodiment, a plurality of keywords indicating personal information are held as reference data. In other words, the determination reference holding unit 50 holds dictionary data of personal information. For example, keywords “Yamada”, “Taro”, etc. may be held as personal information related to the name of the person. In addition, keywords “Tokyo”, “Shibuya-ku”, and the like may be held as personal information regarding the address.

原本文書取得部３２は、図１０に示したような複数レコードを含む文書データであり、各レコードにおいて複数項目に関する文字列が電子的に記録された原本文書データを本番機１０から取得し、読み込む。 The original document acquisition unit 32 is document data including a plurality of records as shown in FIG. 10, and acquires and reads original document data in which character strings relating to a plurality of items are electronically recorded in each record from the production machine 10. .

ユーザ指定取得部５２は、原本文書データに含まれる複数の項目のうち、原本文書データの設計上、個人情報が含まれるはずの項目（以下「個人情報指定項目」と呼ぶ。）を指定する情報を、保守・試験作業の担当者（以下単に「ユーザ」と呼ぶ。）から取得する。例えばユーザは、図１０のテーブルの氏名カラム、口座番号カラム、住所カラムを個人情報指定項目とする旨の情報を所定の電子ファイルに記録してもよい。ユーザ指定取得部５２は、その電子ファイルを読み込み、氏名カラム、口座番号カラム、住所カラムを個人情報指定項目として識別してもよい。 The user designation acquisition unit 52 designates an item (hereinafter referred to as “personal information designation item”) that should include personal information in the design of the original document data among a plurality of items contained in the original document data. Is acquired from the person in charge of maintenance / test work (hereinafter simply referred to as “user”). For example, the user may record information indicating that the name column, the account number column, and the address column in the table of FIG. 10 are personal information designation items in a predetermined electronic file. The user designation acquisition unit 52 may read the electronic file and identify the name column, the account number column, and the address column as personal information designation items.

判定部５４は、原本文書データに含まれる複数の項目のうちユーザが指定した個人情報指定項目を除外した残りの項目（以下「スキャン対象項目」と呼ぶ。）について、個人情報を示す文字列が記録されているか否かをレコード単位に判定する。具体的には、判定部５４は、原本文書データのレコード単位に、１つ以上のスキャン対象項目のそれぞれに記録された文字列を読み込む。そして、読み込んだ文字列が、判定基準保持部５０に格納された個人情報のキーワードのいずれかを含むものであるか否かを判定する。 For the remaining items (hereinafter referred to as “scan target items”) excluding the personal information designation items designated by the user among the plurality of items included in the original document data, the determination unit 54 has a character string indicating personal information. It is determined for each record whether or not it is recorded. Specifically, the determination unit 54 reads a character string recorded in each of one or more items to be scanned for each record of original document data. Then, it is determined whether or not the read character string includes any of the keywords of the personal information stored in the determination criterion holding unit 50.

実施の形態において、判定部５４は、図１０に例示した原本文書データの１レコードを横にスキャンし、１レコードに含まれるすべてのスキャン対象項目の文字列に対する判定を終了すると次のレコードの判定に移る。すなわち、１つのレコードを単位として、その各項目に対して個人情報の存在有無を順次判定していく。 In the embodiment, the determination unit 54 horizontally scans one record of the original document data illustrated in FIG. 10 and determines the next record when the determination on the character strings of all scan target items included in the one record is completed. Move on. That is, the presence / absence of personal information is sequentially determined for each item in units of one record.

ただし判定部５４は、あるレコードの特定のスキャン対象項目の文字列に個人情報のキーワードが含まれると判定した場合、そのスキャン対象項目を個人情報を含む項目（以下「個人情報検出項目」と呼ぶ。）として特定する。個人情報検出項目は、原本文書データの少なくとも1つのレコードにおいて個人情報が記録された項目と言える。判定部５４は、個人情報検出項目を特定すると、その項目をスキャン対象項目から除外する。言い換えれば、判定部５４は、個人情報の存在有無を未判定の、残りのレコードに対する判定処理では、個人情報検出項目を個人情報の存在有無を判定する対象から除外する。 However, when the determination unit 54 determines that the keyword of the personal information is included in the character string of a specific scan target item in a certain record, the scan target item is referred to as an item including personal information (hereinafter referred to as “personal information detection item”). )). The personal information detection item can be said to be an item in which personal information is recorded in at least one record of the original document data. When determining the personal information detection item, the determination unit 54 excludes the item from the scan target item. In other words, the determination unit 54 excludes the personal information detection item from the target for determining the presence / absence of personal information in the determination process for the remaining records in which the presence / absence of personal information has not been determined.

このように、判定部５４は、ユーザにより静的に指定された個人情報指定項目と、自律的な判定処理により動的に検出した個人情報検出項目の両方について、個人情報の存在有無の判定処理をスキップする。典型的には、原本文書データのレコード単位での判定処理が進むほど、個人情報検出項目の個数は増加していき、その一方で、スキャン対象項目の個数は減少していくことになる。 As described above, the determination unit 54 determines whether or not the personal information exists for both the personal information specification item statically specified by the user and the personal information detection item dynamically detected by the autonomous determination processing. To skip. Typically, as the determination process of the original document data in units of records proceeds, the number of personal information detection items increases, while the number of items to be scanned decreases.

変形例として、判定部５４は、図１０に例示した原本文書データを項目単位に縦にスキャンし、１つの項目に関するすべてのレコードの文字列に対する判定を終了すると次の項目の判定に移ってもよい。すなわち、1つの項目を単位として、複数レコードに亘り個人情報の存在有無を順次判定してもよい。この場合も、ある項目に個人情報の存在を検出した場合、その項目を個人情報検出項目として、その項目の判定を終了し、次の項目の判定を開始してもよい。 As a modification, the determination unit 54 scans the original document data illustrated in FIG. 10 vertically in units of items, and when the determination on the character strings of all the records for one item is completed, the determination unit 54 proceeds to determination of the next item. Good. That is, the presence / absence of personal information may be sequentially determined over a plurality of records in units of one item. Also in this case, when the presence of personal information is detected in a certain item, the item may be determined as a personal information detection item, the determination of the item may be terminated, and the determination of the next item may be started.

検出結果出力部５６は、ユーザ指定取得部５２により取得された個人情報指定項目と、判定部５４により検出された個人情報検出項目の両方を示す情報である個人情報検出結果を外部の機能ブロックへ出力する。個人情報検出結果は、原本文書データが含む複数項目の中で、個人情報が格納された項目であり、格納された文字列に対するマスク処理を行うべき項目を示す情報と言える。 The detection result output unit 56 sends the personal information detection result, which is information indicating both the personal information specification item acquired by the user specification acquisition unit 52 and the personal information detection item detected by the determination unit 54, to an external function block. Output. The personal information detection result is an item in which personal information is stored among a plurality of items included in the original document data, and can be said to be information indicating an item to be masked with respect to the stored character string.

個人情報検出結果の出力先となる機能ブロックは、原本文書データの個人情報指定項目および個人情報検出項目に記録された文字列に対してマスキングを実行する。具体的には、個人情報指定項目と個人情報検出項目のそれぞれに記録された文字列のすべてを所定の規則にしたがってマスク処理し、言い換えれば、個人情報が排除された別の文字列（以下「マスク文字列」とも呼ぶ。）へ置き換える。 The functional block that is the output destination of the personal information detection result performs masking on the character string recorded in the personal information designation item and the personal information detection item of the original document data. Specifically, all of the character strings recorded in the personal information designation item and the personal information detection item are masked according to a predetermined rule, in other words, another character string from which personal information is excluded (hereinafter, “ Also called “mask character string”.

例えば、検出結果出力部５６は、個人情報検出結果を置換規則決定部３６へ出力してもよい。置換規則決定部３６は、個人情報指定項目と個人情報検出項目のそれぞれの文字列に対する置換規則を項目単位で一律に決定してもよい。文書変換部４０は、置換規則決定部３６が決定した置換規則にしたがって、個人情報指定項目と個人情報検出項目のそれぞれに記録された文字列をマスク文字列へ変換してもよい。 For example, the detection result output unit 56 may output the personal information detection result to the replacement rule determination unit 36. The replacement rule determination unit 36 may uniformly determine a replacement rule for each character string of the personal information designation item and the personal information detection item for each item. The document conversion unit 40 may convert the character string recorded in each of the personal information designation item and the personal information detection item into a mask character string according to the replacement rule determined by the replacement rule determination unit 36.

以上の構成によるデータ変換装置１４の動作を説明する。
本番環境に蓄積された文書データに対するマスク処理を指示するユーザ操作がデータ変換装置１４において検出されると、原本文書取得部３２は、ユーザ操作で指定された原本文書データ（ここでは図１０のテーブルデータ）を、通信網を介して本番機１０から取得する。ユーザ指定取得部５２は、予めユーザが指定した個人情報指定項目を示す情報を取得する。ここでは、図１０の氏名カラム、口座番号カラム、住所カラムが個人情報指定項目として指定されたこととする。 The operation of the data converter 14 having the above configuration will be described.
When a user operation instructing a mask process for document data stored in the production environment is detected by the data conversion device 14, the original document acquisition unit 32 reads the original document data designated by the user operation (here, the table of FIG. 10). Data) is acquired from the production machine 10 via the communication network. The user designation acquisition unit 52 acquires information indicating personal information designation items designated in advance by the user. Here, it is assumed that the name column, account number column, and address column in FIG. 10 are designated as personal information designation items.

判定部５４は、原本文書データのレコード単位に、各項目の文字列が個人情報を含むか否かを判定する。その際に、ユーザにより指定された個人情報指定項目の文字列は判定対象から除外する。したがって、図１０のＩＤカラムおよび備考カラムに対して個人情報の存在有無を判定する。また判定部５４は、ある項目について個人情報を示す文字列が格納されている事実を一度検出すると、その項目を個人情報検出項目として決定し、以降のレコードの個人情報検出項目の文字列は判定対象から除外する。例えば、図１０のＩＤ００３のレコードについて、その備考カラムに個人情報（例えば電話番号）が含まれることを検出し、備考カラムを個人情報検出項目として決定する。そして、ＩＤ００４以降のレコードについては、ＩＤカラムに対してのみ個人情報の存在有無を判定する。 The determination unit 54 determines whether the character string of each item includes personal information for each record of the original document data. At that time, the character string of the personal information designation item designated by the user is excluded from the determination target. Therefore, the presence / absence of personal information is determined for the ID column and the remarks column of FIG. In addition, once the determination unit 54 detects the fact that a character string indicating personal information is stored for a certain item, the determination unit 54 determines that item as a personal information detection item, and determines the character string of the personal information detection item in subsequent records. Exclude from the target. For example, it is detected that personal information (for example, a telephone number) is included in the remarks column for the record of ID003 in FIG. 10, and the remarks column is determined as a personal information detection item. For records subsequent to ID004, the presence / absence of personal information is determined only for the ID column.

検出結果出力部５６は、個人情報指定項目および個人情報検出項目、言い換えれば、マスク対象文字列が格納された項目を示す情報を、マスク処理を実施する他の機能ブロック、例えば置換規則決定部３６へ通知する。置換規則決定部３６が参照する記憶領域にその情報を記録することにより通知してもよい。以降、原本文書データに含まれる個人情報をマスク文字列へ置き換える変換処理（例えば図７（ａ）のＳ１６以降）が実行される。図１０の原本文書データでは、氏名カラム、口座番号カラム、住所カラム、備考カラムの文字列に対するマスキング処理が実行されてよい。最終的に、原本文書データ内の個人情報がマスキングされたテスト用文書データを生成する。 The detection result output unit 56 converts the personal information designation item and the personal information detection item, in other words, information indicating the item in which the mask target character string is stored into other functional blocks that perform mask processing, for example, the replacement rule determination unit 36. To notify. You may notify by recording the information on the memory area which the replacement rule determination part 36 refers. Thereafter, a conversion process (for example, S16 and after in FIG. 7A) for replacing the personal information included in the original document data with the mask character string is executed. In the original document data of FIG. 10, masking processing may be performed on the character strings of the name column, the account number column, the address column, and the remarks column. Finally, test document data in which personal information in the original document data is masked is generated.

実施の形態のデータ変換装置１４によると、原本文書データにおいて個人情報の存在有無を判定する項目から、静的に指定された個人情報指定項目を当初から除外するとともに、動的に検出された個人情報検出項目を追加的に除外していく。これにより、原本文書データに対する項目単位での個人情報スキャンを進める中で、スキャン対象項目の個数を予め小さくし、さらに動的に小さくしていき、原本文書データ内での個人情報のマスキング対象箇所を特定する処理を効率化できる。言い換えれば、原本文書データにおける個人情報の記録位置の特定処理を効率的に実行できる。この結果、原本文書データのサイズ（レコード数や項目数）が増大しても、文書内のマスク対象箇所の識別処理に要する時間の増大を抑制できる。 According to the data conversion apparatus 14 of the embodiment, the personal information designation item that is statically designated is excluded from the items that determine the presence / absence of personal information in the original document data, and the dynamically detected individual We will exclude information detection items additionally. As a result, while proceeding with the personal information scan for the original document data in units of items, the number of items to be scanned is reduced in advance and dynamically reduced so that the masked portion of the personal information in the original document data It is possible to increase the efficiency of the process of identifying In other words, it is possible to efficiently execute the process for specifying the recording position of the personal information in the original document data. As a result, even if the size of the original document data (the number of records and the number of items) increases, it is possible to suppress an increase in the time required for identifying the mask target portion in the document.

またデータ変換装置１４によると、原本文書データに記録された個人情報の検出漏れの発生可能性を限りなく０に近づけることができる。ここで判定部５４による個人情報の自動検出の精度は以下のように考えられる。まず前提として辞書による検出率の正確な数字を特定することは難しい。例えば、名字については日本語の名字はほぼ網羅可能であるものの、名前については変動が激しく網羅率の情報がない。そこで、仮に辞書による検出率を８０％とし、検出対象のレコード数を１万とする。なお、ここでは特定の１つの項目について個人情報の存在有無を検出することとする。この場合の個人情報の検出精度は、
１ー（（１−０．８）＾１００００）（「＾」はべき乗を表す）
でありほぼ１００％となる。１０レコードの場合は９９．９９９９８９７６％となる。 Further, according to the data conversion device 14, the possibility of the detection failure of the personal information recorded in the original document data can be made as close to zero as possible. Here, the accuracy of the automatic detection of personal information by the determination unit 54 is considered as follows. First of all, it is difficult to specify an accurate number of detection rates using a dictionary. For example, Japanese surnames can be almost covered as for surnames, but there is no information on the coverage rate because the names vary greatly. Therefore, assume that the detection rate by the dictionary is 80%, and the number of records to be detected is 10,000. Here, the presence / absence of personal information is detected for one specific item. The detection accuracy of personal information in this case is
1-((1-0.8) ^ 10000) ("^" represents power)
And almost 100%. In the case of 10 records, it becomes 99.99998976%.

このように、個人情報の検出漏れが発生する可能性は、データボリュームに依存するが、ほぼ０と言える。実施の形態のデータ変換装置１４では、ユーザが指定した個人情報指定項目を個人情報の検出対象から除外することで、予めスキャン対象項目を減少させ、個人情報の検出漏れが発生する可能性を一層低くする。また、図１０の備考カラム等、個人情報が記録されているか否かを人間が判断することが困難な項目は、判定部５４により個人情報の存在を自動検出することで補完している。すなわちデータ変換装置１４は、設計に基づくユーザの判断と、辞書に基づく機械検出を組み合わせた相乗効果として、個人情報の検出漏れの発生を限りなく０に近づけることができる。 In this way, the possibility of omission of detection of personal information depends on the data volume, but can be said to be almost zero. In the data conversion apparatus 14 according to the embodiment, by excluding personal information designation items designated by the user from the personal information detection target, the number of scan target items is reduced in advance, and there is a possibility that omission of detection of personal information may occur. make low. Further, items that are difficult for humans to determine whether or not personal information is recorded, such as the remarks column in FIG. 10, are complemented by automatically detecting the presence of personal information by the determination unit 54. In other words, the data conversion device 14 can make the occurrence of omission of detection of personal information as close as possible to 0 as a synergistic effect combining the user's judgment based on the design and the machine detection based on the dictionary.

以上、本発明を第２の実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。以下変形例を示す。 The present invention has been described based on the second embodiment. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there. A modification is shown below.

第２の実施の形態の第１の変形例を説明する。
上記実施の形態では言及していないが、判定部５４の個人情報検出処理をマルチプロセス化し、複数の個人情報検出処理を複数のプロセスにより同時並行で実行してもよい。具体的には、判定部５４の個人情報検出処理を実行すべき際に、データ変換装置１４のオペレーティングシステム（ＯＳ）は、個人情報検出処理を実行する複数のプロセスを生成してもよい。そして、データ変換装置１４が備える複数のＣＰＵのそれぞれが各プロセスを並行実行してもよい。プロセス数はユーザが任意に設定可能であり、また変更可能に構成されることが望ましい。 A first modification of the second embodiment will be described.
Although not mentioned in the above embodiment, the personal information detection process of the determination unit 54 may be multiprocessed, and a plurality of personal information detection processes may be simultaneously executed by a plurality of processes. Specifically, when the personal information detection process of the determination unit 54 is to be executed, the operating system (OS) of the data conversion apparatus 14 may generate a plurality of processes that execute the personal information detection process. And each of several CPU with which the data converter 14 is provided may perform each process in parallel. It is desirable that the number of processes can be arbitrarily set by the user and can be changed.

また、マルチプロセス化とともに、もしくはマルチプロセス化に代えて、判定部５４の個人情報検出処理をマルチスレッド化し、複数の個人情報検出処理を複数のスレッドにより同時並行で実行してもよい。具体的には、判定部５４の個人情報検出処理を実行すべき際に、データ変換装置１４のＯＳは、個人情報検出処理を並行して実行する複数のスレッドを生成してもよい。そして、データ変換装置１４が備えるＣＰＵが各スレッドを並行実行してもよい。スレッド数はユーザが任意に設定可能であり、また変更可能に構成されることが望ましい。 In addition to or instead of multiprocessing, the personal information detection process of the determination unit 54 may be multithreaded, and a plurality of personal information detection processes may be simultaneously executed by a plurality of threads. Specifically, when the personal information detection process of the determination unit 54 is to be executed, the OS of the data conversion apparatus 14 may generate a plurality of threads that execute the personal information detection process in parallel. The CPU included in the data conversion device 14 may execute each thread in parallel. It is desirable that the number of threads can be arbitrarily set by the user and can be changed.

例えば、複数のプロセスのそれぞれが、互いに異なる複数の原本文書データに対する個人情報の検出処理を並行して実行してもよい。この実行主体が複数のスレッドであってもよいことはもちろんである。また、複数のスレッドのそれぞれが、同一の原本文書データ（文書ファイル）のレコードの一部であり、互いに異なるレコード群に対する個人情報の検出処理を並行して実行してもよい。この実行主体が複数のプロセスであってもよいことはもちろんである。この変形例によると、マルチプロセス化およびマルチスレッド化によって、複数ファイルに対する個人情報検出処理を並列実行させ、また、同一ファイル内の複数レコードに対する個人情報検出処理を並列実行させ、高速化を図ることができる。 For example, each of a plurality of processes may execute the personal information detection process for a plurality of different original document data in parallel. Of course, the execution subject may be a plurality of threads. In addition, each of the plurality of threads may be a part of a record of the same original document data (document file), and the personal information detection process for different record groups may be executed in parallel. Of course, the execution subject may be a plurality of processes. According to this modification, personal information detection processing for multiple files is executed in parallel by multi-process and multi-threading, and personal information detection processing for multiple records in the same file is executed in parallel to increase the speed. Can do.

なお、複数のスレッドが、同一の原本文書データ内を異なるレコード群に対する個人情報の検出処理を実行する場合、あるスレッドが特定の項目に個人情報が含まれることを検出すると、検出した個人情報検出項目の情報を他のスレッドに通知する。公知のスレッド間通信の仕組み（例えば複数のスレッドが参照可能なメモリ領域へフラグを設定する等）を使用して通知してもよい。以降、他のスレッドは、通知された個人情報検出項目をスキャン対象項目から除外する。実行主体が複数のプロセスの場合も同様であり、公知のプロセス間通信の仕組みを使用して個人情報検出項目を相互に通知してもよい。 When multiple threads execute personal information detection processing for different record groups in the same original document data, if a thread detects that personal information is included in a specific item, the detected personal information is detected. Notify other thread of item information. Notification may be performed using a known inter-thread communication mechanism (for example, setting a flag in a memory area that can be referred to by a plurality of threads). Thereafter, the other threads exclude the notified personal information detection items from the scan target items. The same applies to the case where the execution subject is a plurality of processes, and the personal information detection items may be mutually notified using a known inter-process communication mechanism.

第２の実施の形態の第２の変形例を説明する。
判定基準保持部５０は、個人情報を示すキーワードを、その種類を示すカテゴリ（例えば図３の検出タイプ）と対応づけて保持してもよい。例えば、カテゴリ「人名」と対応づけてキーワード「山田」、「太郎」等を保持してもよい。また、カテゴリ「住所」と対応づけてキーワード「東京都」、「渋谷区」等を保持してもよい。 A second modification of the second embodiment will be described.
The determination criterion holding unit 50 may hold a keyword indicating personal information in association with a category indicating the type (for example, the detection type in FIG. 3). For example, the keywords “Yamada”, “Taro”, etc. may be held in association with the category “person name”. Further, the keywords “Tokyo”, “Shibuya-ku”, etc. may be held in association with the category “address”.

判定部５４は、原本文書データ内の個人情報検出項目を特定後、個人情報指定項目および個人情報検出項目に対する個人情報記録位置特定処理をレコード単位に実行してもよい。具体的には、判定基準保持部５０を参照して、個人情報のキーワードが記載された文書内（レコード内）の位置と、そのキーワードのカテゴリを特定してもよい。そして、図３に示した個人情報抽出結果、すなわちレコード番号・項目名・文字列・位置・検出タイプの組み合わせを置換規則決定部３６へ出力してもよい。なお、１つの項目に１つのカテゴリの情報のみが格納され、位置の特定も不要の場合（典型的には文字列のすべてを置換する場合）は、実施の形態に記載の判定処理において原本文書データ内の個人情報検出項目を特定した際にカテゴリもあわせて特定すればよい。 The determination unit 54 may execute the personal information recording position specifying process for the personal information designation item and the personal information detection item for each record after specifying the personal information detection item in the original document data. Specifically, referring to the determination criterion holding unit 50, the position in the document (in the record) in which the keyword of the personal information is described and the category of the keyword may be specified. Then, the personal information extraction result shown in FIG. 3, that is, the combination of record number, item name, character string, position, and detection type may be output to the replacement rule determination unit 36. When only one category of information is stored in one item and no position specification is required (typically when all of the character string is replaced), the original document is used in the determination process described in the embodiment. When the personal information detection item in the data is specified, the category may be specified together.

第２の実施の形態の第３の変形例を説明する。
上記実施の形態では、１つの装置内に、個人情報の検出機能とマスキング機能の両方を備えることとしたが、これらの機能は物理的に異なる複数の装置が分担して実行してもよい。すなわち、第２の実施の形態に記載の個人情報の検出処理を実行する第１の装置と、第１の実施の形態に記載の個人情報のマスク処理を実行する第２の装置が通信網を介して連携することにより第２の実施の形態に記載のデータ変換装置１４が実現されてもよい。 A third modification of the second embodiment will be described.
In the above embodiment, both the personal information detection function and the masking function are provided in one apparatus, but these functions may be shared and executed by a plurality of physically different apparatuses. That is, the first apparatus that executes the personal information detection process described in the second embodiment and the second apparatus that executes the personal information mask process described in the first embodiment configure the communication network. The data conversion apparatus 14 described in the second embodiment may be realized by cooperating with each other.

上述した第１の実施の形態、第２の実施の形態、および変形例の任意の組み合わせもまた本発明の実施の形態として有用である。組み合わせによって生じる新たな実施の形態は、組み合わされる実施の形態および変形例それぞれの効果をあわせもつ。また、請求項に記載の各構成要件が果たすべき機能は、第１の実施の形態、第２の実施の形態および各実施の形態の変形例において示された各構成要素の単体もしくはそれらの連携によって実現されることも当業者には理解されるところである。 Any combination of the first embodiment, the second embodiment, and the modifications described above is also useful as an embodiment of the present invention. The new embodiment generated by the combination has the effects of the combined embodiment and the modified examples. Further, the functions to be fulfilled by the constituent elements described in the claims are as follows. The constituent elements shown in the first embodiment, the second embodiment, and the modified examples of the embodiments are united or their cooperation. Those skilled in the art will also understand that

１０本番機、１４データ変換装置、３４個人情報検出部、５０判定基準保持部、５２ユーザ指定取得部、５４判定部、５６検出結果出力部、１００情報システム。 10 production machine, 14 data conversion device, 34 personal information detection unit, 50 judgment reference holding unit, 52 user designation acquisition unit, 54 judgment unit, 56 detection result output unit, 100 information system.

Claims

A document acquisition unit for acquiring a document in which character strings related to a plurality of items are recorded;
A designation acquisition unit for acquiring information indicating personal information items including personal information among the items specified by the user;
A determination unit for determining whether or not personal information is included in a character string of another item excluding the personal information item specified by the user, and detecting an item including the personal information as a personal information item;
An output unit that outputs the personal information item specified by the user and the personal information item detected by the determination unit as an item to be masked with personal information;
With
The document acquired by the document acquisition unit includes a plurality of records, and each record includes a character string related to the plurality of items.
The determination unit sequentially determines whether or not personal information is included in a character string of each item of the plurality of records, and when determining that a specific item of a record includes personal information, A personal information detection apparatus, characterized in that it is detected as a personal information item, and the character string of the item in the remaining records is excluded from a target for determining whether or not personal information is included.

The ability to retrieve documents that contain text strings for multiple items;
A function for obtaining information indicating a personal information item that is designated by a user and includes personal information among the plurality of items;
A determination function for determining whether or not personal information is included in a character string of another item excluding the personal information item specified by the user, and detecting an item including the personal information as a personal information item;
An output unit for outputting the personal information item specified by the user and the personal information item detected by the determination function as an item to be subjected to the masking process of the personal information;
Is realized on a computer,
The document acquired by the acquiring function includes a plurality of records, and each record includes a character string related to the plurality of items.
The determination function sequentially determines whether or not personal information is included in a character string of each item of the plurality of records, and when it is determined that a specific item of a record includes personal information, the item is A computer program that is detected as a personal information item and that is excluded from a target for determining whether or not personal information is included in the character string of the item in the remaining records.