JP2011034264A

JP2011034264A - Personal information masking system

Info

Publication number: JP2011034264A
Application number: JP2009178690A
Authority: JP
Inventors: Michihiro Nishide; 通啓西出
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2009-07-31
Filing date: 2009-07-31
Publication date: 2011-02-17

Abstract

<P>PROBLEM TO BE SOLVED: To convert personal information into a character string from which any individual can not be specified while holding the pattern and features of the personal information. <P>SOLUTION: All the character strings of mask object personal information registered in a DB are acquired, and the patterns or features of the personal information are extracted. A character string is replaced on the basis of the patterns or features of the personal information so that following five conditions can be established. (1) Any character string from which an individual is specified does not exist in a post-mask DB. (2) All characters as the elements of the character string from which the individual is specified exist in post-mask DB. (3) When several character strings from which the individual is specified exist in the DB, only the same number of corresponding masked character strings exist in the post-mask DB. (4) When such a character as a symbol or space from which any individual is not specified and the character string or a portion of the character string exist, the same number of the masked character strings exist at the same place in the post-mask DB. (5) When the characters as the elements of the character string from which the individual is specified exist at the start position of the same character string as the masked character string in the post-mask DB. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、データベースに登録されている個人情報のパターンや特徴を保持した状態で、かつ個人を特定できないようにするマスキングシステムに関する。 The present invention relates to a masking system that retains patterns and features of personal information registered in a database and prevents an individual from being identified.

近年、個人情報漏えいが危惧される中、企業としては個人情報に対して細心の注意を払い、慎重に取り扱う必要がある。一方、企業内のコンピュータシステムは、システム開発会社に外注することが多くなされている。既存システムのリプレイスといったシステム開発の作業においては、システム開発会社は、顧客である企業から個人情報をマスクした既存システムのデータを一時的に預かってシステム開発を進める。 In recent years, there is a concern about leakage of personal information, and it is necessary for companies to pay close attention to personal information and handle it carefully. On the other hand, in-house computer systems are often outsourced to system development companies. In system development work such as replacing an existing system, a system development company temporarily proceeds with system development by temporarily storing data of an existing system masked by personal information from a company that is a customer.

一般的に個人情報のマスク方法は個人情報がどこにあるかを顧客（企業内コンピュータシステムのユーザである企業）に確認してもらい、手作業又はツールにより個人情報を含む文字の削除や「***」「XXX」といった無意味な文字列への置換、または氏名であれば「山田太郎１、山田太郎２」などの一定の取り決めた形式による一括置換をすることで個人情報の参照をできないようにする。これらのマスクを実施するために、データベースやテキストファイルなどの大量の文字情報から、どこに個人情報があるかを抽出し、その個人情報が氏名や住所を表す文字であるかを認識する手法やツールが検討され開発・利用されている。このようなマスキング処理および文字認識に関する公知技術文献としては特許文献１があげられる。 Generally, the personal information masking method asks the customer (the company that is the user of the in-house computer system) to confirm where the personal information is, and deletes characters including personal information manually or by using a tool. * "XXX" or replacement with meaningless character strings such as "XXX", or if it is a name, it is not possible to refer to personal information by batch replacement in a fixed format such as "Taro Yamada 1, Taro Yamada 2" To. To implement these masks, a method or tool that extracts where personal information exists from a large amount of character information such as a database or text file and recognizes that the personal information is a character representing a name or address Has been studied, developed and used. Patent Document 1 can be cited as a known technical document regarding such masking processing and character recognition.

特開２００４−９４５４２号公報JP 2004-94542 A

マスクは個人情報を特定できないように文字を隠すことが目的である。したがって、マスク後の個人情報をどのように変更するかについては検討されないことが多い。そのため顧客（企業内コンピュータシステムのユーザ）が変更した一定のパターンに置換した文字列や顧客からヒアリングした範囲で、システム開発会社側はデータを認知する必要がある。そのため、顧客が認知できていないデータに潜むパターンや特徴を考慮しないままシステム開発を進め、結果的にデータ移行後や顧客による運用テスト工程などの後工程に進むまで潜在化しているデータのパターンや特徴が検知できない。結果的に以下のような問題が発生する。 The purpose of the mask is to hide characters so that personal information cannot be specified. Therefore, it is often not considered how to change personal information after masking. Therefore, the system development company needs to recognize the data within the range of the character string replaced by the customer (the user of the computer system in the company) and the interview with the customer. Therefore, the system development proceeds without considering the patterns and features hidden in the data that the customer is not able to recognize, and as a result, the data patterns that are latent until the data migration or the subsequent process such as the operation test process by the customer. The feature cannot be detected. As a result, the following problems occur.

例えば、データベースに登録されている顧客の認識しているデータは次の６つの特徴を有するものと仮定する。
（１）漢字氏名は全て全角漢字（第１水準、第２水準）で登録されている。
（２）漢字氏名は同じ名前が登録されていない。
（３）漢字氏名は姓と名の間には全角１文字の空文字が登録されている。
（４）カナ氏名は全て全角カナで登録されている。
（５）カナ氏名は姓と名の間には全角１文字の空文字が登録されていない。
（６）漢字住所は全て「東京都」が省略されて登録されている。 For example, assume that customer-recognized data registered in the database has the following six characteristics.
(1) All kanji names are registered in full-width kanji (first and second levels).
(2) Kanji names are not registered with the same name.
(3) A full-pitch empty character is registered between the surname and first name for the name of the kanji.
(4) All Kana names are registered in double-byte Kana.
(5) For Kana's first name, there is no single-byte empty character registered between the last name and first name.
(6) All kanji addresses are registered with “Tokyo” omitted.

尚、顧客から預かることができたデータは、例えば１万件の漢字氏名は全て「山田太郎１〜山田太郎１００００」に置換、カナ氏名は全て「ヤマダタロウ１〜ヤマダタロウ１００００」に置換、住所は全て「ＸＸＸＸＸ」に置換した場合を仮定する。 For example, 10,000 Kanji names are replaced with “Taro Yamada 1 to Taro Yamada 10000”, and Kana names are all replaced with “Yamada Taro 1 to Yamada Taro 10000”. Are all replaced with “XXXX”.

ここで、実際に顧客の運用していたデータベースに登録されているデータは、次のような問題を有していたと仮定する。
（１）漢字氏名は外字が一部登録されていた。
（２）漢字氏名は同じ名前が一部登録されていた。
（３）漢字氏名は姓と名の間がないもの、および全角２文字以上の空文字が一部登録されていた。
（４）カナ氏名は半角カナが混在して一部登録されていた。
（５）カナ氏名は姓と名の間には半角１文字以上の空文字が一部登録されていた。
（６）漢字住所は「東京都」から一部登録されていた。
Here, it is assumed that the data registered in the database actually operated by the customer has the following problems.
(1) Some Kanji characters have been registered.
(2) The same name was partly registered for Kanji names.
(3) Kanji names that had no last name and a surname, and some empty characters with two or more full-width characters were registered.
(4) Kana's name was partly registered with a mixture of half-width kana.
(5) In Kana's name, one or more single-byte empty characters were registered between the last name and the first name.
(6) Kanji addresses were partly registered from “Tokyo”.

これにより、例えば以下のシステム影響が運用テスト工程において発生する。
（１）漢字氏名に外字が登録されていたため帳票に正しく氏名が出力されなかった。
（２）漢字氏名を改ページ条件にして帳票を出力していたが、同じ名前が登録されていたため改ページされなかった。
（３）漢字氏名を姓と名に分割して帳票の出力する位置を姓と名の文字数により割付（例を図８に示す）していたが、正しく割付して出力できなかった。
（４）カナ氏名に半角カナが混在して登録されていたため、全角カナを半角カナに変換して帳票出力する処理で異常終了した。
（５）カナ氏名はデータベースに登録されている文字列に初めに空白が現れるまで出力する仕様であったが、姓と名の間には半角１文字以上の空文字が登録されていたため、帳票に姓のみが出力された。
（６）漢字住所は東京都が省略されてデータベースに登録されていることから、帳票に住所を出力する場合にプログラムにより「東京都」を付与して出力する仕様としていた。しかし、漢字住所が「東京都」から一部登録されていたため帳票に出力した住所が「東京都東京都・・・」と出力されるものがあった。 Thereby, for example, the following system influence occurs in the operation test process.
(1) Since the external characters were registered in the name of kanji, the name was not correctly output on the form.
(2) A form was output with the kanji name as the page break condition, but the page was not paged because the same name was registered.
(3) Kanji names were divided into surnames and surnames, and the output position of the form was assigned according to the number of characters of the surname and surname (example is shown in FIG. 8), but could not be correctly assigned and output.
(4) Since half-width kana was mixed and registered in the name of Kana, it ended abnormally in the process of converting full-width kana to half-width kana and outputting the form.
(5) Kana's name was output until the first blank appears in the character string registered in the database, but since there were more than one empty character between the last name and first name, Only the last name was output.
(6) Since Kanji addresses are registered in the database with Tokyo omitted, the specifications are such that when the address is output on a form, “Tokyo” is given by the program. However, since some kanji addresses were registered from “Tokyo”, the address output on the form was output as “Tokyo Metropolitan ...”.

上記のような問題を未然に防ぐため、設計・コーディングなどの早期段階から顧客データのパターンや特徴を意識してシステム開発が進められるように、従来のような個人情報を単につぶしてマスクする方法ではなく、データに潜むパターンや特徴を保持しつつ、かつ個人情報をマスクできる手法を検討する必要がある。 In order to prevent the above problems in advance, the conventional method of masking personal information by simply crushing it so that system development can be promoted from the early stages of design and coding in consideration of patterns and features of customer data Instead, it is necessary to consider a method that can mask personal information while retaining patterns and features hidden in the data.

本発明の目的は、データベース上に登録された情報のパターン、特徴を保持しつつ個人を特定できない文字列に変換するシステムを提供することにある。 An object of the present invention is to provide a system for converting a character string that cannot identify an individual while retaining the pattern and characteristics of information registered on a database.

上記課題を解決するために、本発明のマスキング手法では、テーブルに登録されているマスク対象とする個人情報を全て取得し、パターンや特徴を保持したまま異なる文字列に置換することにより実現する。データベースから取り出した全ての個人情報から、以下の五つの条件が成立するように置換することで実現する。 In order to solve the above-described problems, the masking method of the present invention is realized by acquiring all personal information to be masked registered in the table and replacing it with a different character string while retaining the pattern and features. This is realized by replacing all personal information extracted from the database so that the following five conditions are satisfied.

（条件１）マスク前に存在する一人の個人を特定する「文字列」は、マスク後のデータベースに存在しない状態にする。
（条件２）マスク前に存在する一人の個人を特定する「文字列」の要素となる「文字」は、マスク後のデータベースに全て存在する状態にする。
（条件３）マスク前に存在する一人の個人を特定する「文字列」がデータベースに複数個存在する場合は、マスク後のデータベースに同じ数だけ対応するマスク後の「文字列」が存在する状態にする。
（条件４）マスク前に存在する一人の個人を特定できない記号（図７に示す）やスペースなどの「文字」および「文字列」又は「文字列」の一部が存在する場合は、マスク後のデータベースに同じ場所、同じ数だけ存在する状態にする。
（条件５）マスク前に存在する一人の個人を特定する「文字列」の要素となる「文字」は、マスク後のデータベースのマスクされた「文字列」と同じ文字列の開始位置に存在する状態にする。 (Condition 1) “Character string” specifying one individual existing before masking is not present in the database after masking.
(Condition 2) All “characters” that are elements of a “character string” that identifies one individual existing before masking are in a state where they exist in the database after masking.
(Condition 3) When there are a plurality of “character strings” specifying one individual existing before masking in the database, the same number of corresponding “character strings” after masking exist in the database after masking To.
(Condition 4) If there is a “character” and a part of “character string” or “character string” such as a symbol (shown in FIG. 7) or a space that cannot identify an individual existing before masking, and after masking In the same location and the same number in the database.
(Condition 5) “Character” that is an element of “Character string” that identifies one individual existing before masking exists at the start position of the same character string as the masked “Character string” in the database after masking Put it in a state.

上記五つの条件が成立する場合に、個人情報のパターンや特徴を保持できる理由を五つの条件ごとに以下に示す。
（条件１）マスク後のデータベースにマスク前に存在する一人の個人を特定する「文字列」が存在する場合は、マスクがされていないことに直結する。そのためマスク後には完全に、その個人を特定する文字列が存在しないようにする必要がある。
（条件２）マスク前に存在する一人の個人を特定する「文字列」の要素となる「文字」を、マスク後のデータベースに全て存在させることにより、どのような「文字」が使用されているか文字特徴を全て保存することが可能になるためである。これにより発明が解決しようとする課題に示した具体例の問題（１）、（４）が解決される。
（条件３）マスク後のデータベースに同じ数だけ対応するマスク後の「文字列」を存在させることで、同じ個人を特定する情報の存在有無および出現頻度を保存することが可能になるためである。これにより発明が解決しようとする課題に示した具体例の問題（２）が解決される。
（条件４）マスク後のデータベースに同じ場所、同じ数だけ一人の個人を特定できない記号やスペースを存在させることで文字列が表すパターンや特徴を保存することが可能になるためである。これにより発明が解決しようとする課題に示した具体例の問題（３）、（５）が解決される。
（条件５）マスク後のデータベースのマスクされた「文字列」と同じ文字列の開始位置に存在させることで、マスク項目のデータパターンを保存することが可能になるためである。これにより発明が解決しようとする課題に示した具体例の問題（６）が解決される。 The reason why the personal information pattern and characteristics can be retained when the above five conditions are satisfied is shown below for each of the five conditions.
(Condition 1) If a “character string” specifying one individual existing before masking exists in the database after masking, this is directly related to the fact that masking is not performed. Therefore, it is necessary to completely prevent the character string that identifies the individual from being present after the masking.
(Condition 2) What “characters” are used by making all the “characters” that are elements of “character strings” that identify one individual existing before the mask exist in the database after masking This is because all character features can be saved. This solves the problems (1) and (4) of the specific examples shown in the problem to be solved by the invention.
(Condition 3) It is possible to store the presence / absence and appearance frequency of information specifying the same individual by having the same number of post-mask “character strings” corresponding to the post-mask database. . This solves the problem (2) of the specific example shown in the problem to be solved by the invention.
(Condition 4) This is because a pattern or feature represented by a character string can be stored by having the same place and the same number of symbols or spaces that cannot identify one individual in the database after masking. This solves the problems (3) and (5) of the specific example shown in the problem to be solved by the invention.
(Condition 5) This is because the data pattern of the mask item can be saved by making it exist at the start position of the same character string as the masked “character string” in the database after masking. This solves the problem (6) of the specific example shown in the problem to be solved by the invention.

本発明のマスキング手法によれば、次のような効果がある。
（１）既存システムをリプレイスする場合、既存システムの本番データにある個人情報を参照することなくデータのパターンや特徴を認知することができるため、システム開発の早期工程により設計考慮およびデータクリーニング必要性、データ移行方式を効率よく検討することが可能となる。
（２）個人情報を公開することなく本番の個人情報に近いテストデータを手軽に作成することが可能となる。 The masking method of the present invention has the following effects.
(1) When replacing an existing system, it is possible to recognize the pattern and characteristics of the data without referring to the personal information in the production data of the existing system, so there is a need for design consideration and data cleaning at an early stage of system development. This makes it possible to efficiently examine the data migration method.
(2) It is possible to easily create test data close to the actual personal information without disclosing the personal information.

本発明に係るマスキングシステムの実施形態を示すシステム構成図である。1 is a system configuration diagram showing an embodiment of a masking system according to the present invention. 個人情報の文字列をマスキング処理した具体例である。It is the example which masked the character string of personal information. マスキング処理に利用する各種管理テーブルの構成図である。It is a block diagram of the various management tables utilized for a masking process. マスキング処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of a masking process. 文字列解析処理のフローチャートである。It is a flowchart of a character string analysis process. 文字列置換処理のフローチャートである。It is a flowchart of a character string replacement process. 個人を特定できない記号を示す文字である。It is a character indicating a symbol that cannot identify an individual. 氏名の割付を表す具体例である。It is a specific example showing allocation of a name. 入力ファイル１０８とパラメタ１０９の具体例である。This is a specific example of the input file 108 and the parameter 109.

以下、本発明を適用したオートマスキングツール（マスキング装置）の実施形態について説明する。 Hereinafter, embodiments of an auto masking tool (masking apparatus) to which the present invention is applied will be described.

図１は、本発明に係るマスキング装置の実施形態を示すシステム構成図であり、マスク処理端末１０１によりマスク処理ソフト１０２を動作させることによりマスク処理を実施する。マスク処理ソフト１０２は、データ読込モジュール１０３、データ解析モジュール１０４、データ置換モジュール１０５、データ検証モジュール１０６、データ出力モジュール１０７から構成される。データ読込モジュール１０３にて入力ファイル１０８およびパラメタ１０９を読み込んで、データ解析モジュール１０４へ渡される。ここでマスク対象となる文字列を全件読み込み、パターンや特徴を抽出する。データ解析モジュール１０４の解析結果を元に、データ置換モジュール１０５により、個人情報が特定できないように文字列置換され、データ検証モジュール１０６にマスク後の文字列が渡される。データ検証モジュール１０６では、個人情報が完全に特定できない文字列に置換されているかをチェックする。最後にマスクされた出力ファイル１１０がデータ出力モジュール１０７より出力される。ユティリティ１１１はデータベース１１２の内容をテキストファイル化するＤＢＭＳに既設されたツールを示しており、データベース１１２のデータがカンマ区切りによりテキストファイル化される。 FIG. 1 is a system configuration diagram showing an embodiment of a masking apparatus according to the present invention. Mask processing is performed by operating mask processing software 102 by a mask processing terminal 101. The mask processing software 102 includes a data reading module 103, a data analysis module 104, a data replacement module 105, a data verification module 106, and a data output module 107. The data reading module 103 reads the input file 108 and parameters 109 and passes them to the data analysis module 104. Here, all the character strings to be masked are read, and patterns and features are extracted. Based on the analysis result of the data analysis module 104, the data replacement module 105 performs character string replacement so that personal information cannot be specified, and the masked character string is passed to the data verification module 106. The data verification module 106 checks whether the personal information is replaced with a character string that cannot be completely specified. Finally, the masked output file 110 is output from the data output module 107. The utility 111 is a tool that is already installed in the DBMS that converts the contents of the database 112 into a text file, and the data in the database 112 is converted into a text file by comma separation.

図１のマスク処理端末１０１は、上述したようにマスク処理ソフト１０２というコンピュータプログラムを読み込んでその処理を実行するコンピュータである。データ読込モジュール１０３からデータ出力モジュール１０７までの５つのモジュールは、コンピュータプログラムの部分といえるが、ＣＰＵがそれぞれのコンピュータプログラムを読み込んで処理する状態を装置と捉えれば、図１における各モジュールは、手段又は装置とも捉えられる。したがって、図１の構成は、特許請求の範囲の表現と呼応するものである。 The mask processing terminal 101 in FIG. 1 is a computer that reads a computer program called mask processing software 102 and executes the processing as described above. The five modules from the data reading module 103 to the data output module 107 can be said to be computer program parts. However, if the CPU reads and processes each computer program as a device, each module in FIG. Or it can also be regarded as a device. Accordingly, the configuration of FIG. 1 corresponds to the claims.

図２は、個人情報の文字列をマスキング処理した具体例であり、マスク処理前の個人を特定する文字列２０１をマスク処理ソフトに入力すると、マスク処理後の文字列２０２のような文字列に置換されて個人情報がマスクされる。このマスク処理は、課題を解決するための手段にて定義した五つの条件を充足するものである。図２（ａ）のマスク処理前の個人を特定する文字列と、図２（ｂ）のマスク処理後の文字列とを見比べると、空白の位置、個人を特定できない記号の位置が同一であることから課題解決手段の（条件４）を充足することが一見してわかる。他の条件である（条件１）、（条件２）、（条件３）、（条件５）についても充足したマスク処理がなされている。 FIG. 2 is a specific example in which a character string of personal information is subjected to masking processing. When a character string 201 specifying an individual before mask processing is input to mask processing software, a character string such as a character string 202 after mask processing is obtained. It is replaced and the personal information is masked. This masking process satisfies the five conditions defined by the means for solving the problem. Comparing the character string that identifies an individual before mask processing in FIG. 2A with the character string after mask processing in FIG. 2B, the position of a blank and the position of a symbol that cannot identify an individual are the same. Therefore, it can be seen at a glance that the condition solving means (condition 4) is satisfied. Mask processing that satisfies the other conditions (condition 1), (condition 2), (condition 3), and (condition 5) is also performed.

図３は、マスキング処理に利用する各種管理テーブルの構成図であり、文字列データテーブル３０１は、データ読込モジュール１０３により入力ファイル１０８から作成される。マスク対象とする文字列の最大長を列幅に持つテーブルであり、１文字目３０２には文字列の先頭１文字が保持され、２文字目３０３は文字列の先頭から２文字目が保持されている。同様に文字列が３文字以上ある場合は、文字列の先頭から３文字目以降が保持される構成となる。 FIG. 3 is a configuration diagram of various management tables used for the masking process. The character string data table 301 is created from the input file 108 by the data reading module 103. This is a table having the maximum length of the character string to be masked in the column width. The first character 302 holds the first character of the character string, and the second character 303 holds the second character from the beginning of the character string. ing. Similarly, when there are three or more character strings, the third and subsequent characters from the beginning of the character string are retained.

置換管理テーブル３０４および重複管理テーブル３０７は、データ解析モジュール１０４により作成される。置換管理テーブル３０４の１文字目３０５は文字列データテーブル３０１の１文字目３０２と対応しており、２文字目３０６は文字列データテーブル３０１の２文字目３０２と対応している。同様に文字列が３文字以上ある場合は、文字列データテーブル３０１の３文字目以降と対応している。 The replacement management table 304 and the duplication management table 307 are created by the data analysis module 104. The first character 305 of the replacement management table 304 corresponds to the first character 302 of the character string data table 301, and the second character 306 corresponds to the second character 302 of the character string data table 301. Similarly, when there are three or more character strings, this corresponds to the third and subsequent characters in the character string data table 301.

重複管理テーブル３０７の行３０８は、文字列データテーブル３０１の行数と対応しており、重複行３０９は文字列データテーブル３０１の重複する行番号を保存する構成となる。文字列データテーブル３０１は、入力ファイル１０８のマスク対象項目を示すパラメタ１０９の全てのデータと文字列の存在位置を保持する。置換管理テーブル３０４は、文字列データテーブル３０１に保持している文字の場所に一対一で対応する形式により、文字列データテーブル３０１の文字を置換対象とする場合は「１」が保持され、文字を置換対象としない場合は「０」が保持される。重複管理テーブル３０７は、文字列データテーブル３０１の各行に対応した個人を特定する文字列が重複する行数を保持する。図３では１行目は、３行目と５行目１９行目に同じ個人を特定する人の情報が文字列データテーブル３０１に保持されていたことを示している。 The row 308 of the duplication management table 307 corresponds to the number of rows of the character string data table 301, and the duplicate row 309 is configured to store duplicate row numbers of the character string data table 301. The character string data table 301 holds all the data of the parameter 109 indicating the mask target item of the input file 108 and the existing position of the character string. The replacement management table 304 stores “1” when a character in the character string data table 301 is to be replaced in a format corresponding to the character location held in the character string data table 301 on a one-to-one basis. “0” is held when is not a replacement target. The duplication management table 307 holds the number of lines in which character strings that identify individuals corresponding to the respective lines of the character string data table 301 are duplicated. In FIG. 3, the first line indicates that the character string data table 301 holds information about the person who identifies the same individual in the third line and the fifth line and the 19th line.

図４は、マスキング処理の概要を示すフローチャートであり、始めに入力ファイル１０８を読み込む（ステップ４０１）。次にパラメタ１０９を読み込む（ステップ４０２）。これらの内容を元に文字列解析処理を実施する（ステップ４０３）。文字列解析処理の詳細は図５に示す。解析結果を元に文字列置換処理を実施する（ステップ４０４）。文字列置換処理の詳細は図６に示す。文字列の置換が全て完了した後、変換した文字列に同じ文字列が存在しないかチェックをして、存在すれば個人情報を全て特定できない状態となっていないため、もう一度、文字列置換処理を実施する（ステップ４０５、４０６）。このチェックにより、変換した文字列に同じ文字列が存在しない場合は、全ての個人情報がマスクされたと判断してマスク後のファイルを出力して処理を終了する（ステップ４０７）。 FIG. 4 is a flowchart showing an outline of the masking process. First, the input file 108 is read (step 401). Next, the parameter 109 is read (step 402). Based on these contents, character string analysis processing is performed (step 403). Details of the character string analysis processing are shown in FIG. A character string replacement process is performed based on the analysis result (step 404). Details of the character string replacement process are shown in FIG. After all the replacement of the character string is completed, check whether the same character string exists in the converted character string, and if it exists, it is not possible to specify all personal information. Implement (steps 405 and 406). If the same character string does not exist in the converted character string as a result of this check, it is determined that all personal information has been masked, the masked file is output, and the process ends (step 407).

図５は、文字列解析処理（図４のステップ４０３）のフローチャートである。図４のステップ４０１と４０２から得た情報（入力ファイル、パラメタ）を元に、マスク対象とする項目の文字列を全件読み込み、入力件数および文字列の最大文字長を算出した上で文字列データテーブル３０１を作成する（ステップ５０１）。そして、マスク対象とする項目の文字列を全て１文字ごとに分解して、文字列データテーブル３０１の対応する位置に保存する（ステップ５０２）。 FIG. 5 is a flowchart of the character string analysis process (step 403 in FIG. 4). Based on the information (input file, parameter) obtained from steps 401 and 402 in FIG. 4, all the character strings of the items to be masked are read, the number of input cases and the maximum character length of the character string are calculated, and the character string A data table 301 is created (step 501). Then, the character strings of the items to be masked are all decomposed for each character and stored in the corresponding positions in the character string data table 301 (step 502).

次に文字列データテーブル３０１の内容を元に、置換すべき対象文字の指定場所に「１」を代入し、置換すべき対象文字でない指定場所に「０」を代入して、置換管理テーブル３０４を作成する。置換すべき対象文字でない判断は、図７に示した記号やスペースに該当する文字であるか、又は、重複管理テーブル３０７の重複行に保持された行である場合であり、そうでない場合は全て置換すべき対象文字と判断する（ステップ５０３、５０４、５０５、５０６、５０７）。 Next, based on the contents of the character string data table 301, “1” is assigned to the designated location of the target character to be replaced, and “0” is substituted to the designated location that is not the target character to be replaced. Create Judgment that is not a target character to be replaced is a character corresponding to the symbol or space shown in FIG. 7 or a row held in a duplicate row of the duplicate management table 307. The target character to be replaced is determined (steps 503, 504, 505, 506, 507).

次に、文字列データテーブル３０１の内容を元に、重複した文字列があれば対応する重複行を重複管理テーブル３０７に追加格納する（ステップ５０８、５０９、５１０、５１１，５１２）。例えば、文字列データテーブル３０１の１行目と同じ文字列が３行目、５行目、１９行目に存在している場合は、重複管理テーブル３０７の行３０８が「１」の値に対して、重複行３０９に「３、５、１９」の数値を保存する。重複した文字列がなければ、文字列データテーブル３０１の行に対応する重複管理テーブル３０７の重複行３０９に「０」を格納する（ステップ５１３）。 Next, based on the contents of the character string data table 301, if there is a duplicate character string, the corresponding duplicate row is additionally stored in the duplicate management table 307 (steps 508, 509, 510, 511, 512). For example, when the same character string as the first line of the character string data table 301 exists in the third line, the fifth line, and the 19th line, the line 308 of the duplication management table 307 corresponds to the value “1”. Thus, the numerical values “3, 5, 19” are stored in the duplicate row 309. If there is no duplicate character string, “0” is stored in the duplicate row 309 of the duplicate management table 307 corresponding to the row of the character string data table 301 (step 513).

図６は、文字列置換処理（図４のステップ４０４）のフローチャートである。置換管理テーブル３０４から値を読み込み「１」である場合は置換対象であるため、対応する文字列データテーブル３０１のセルとランダムに取得した行に対応する文字列データテーブル３０１のセルを置換する。尚、置換管理テーブル３０４の１行目に「１」が存在し、かつ２行目以降が全て「０」である場合は、置換対象が存在しないことになるため、処理を終了させる（ステップ６０１、６０２、６０３）。 FIG. 6 is a flowchart of the character string replacement process (step 404 in FIG. 4). When the value is read from the replacement management table 304 and is “1”, it is a replacement target, and therefore the cell of the corresponding character string data table 301 and the cell of the character string data table 301 corresponding to the randomly acquired row are replaced. If “1” exists in the first row of the replacement management table 304 and all the second and subsequent rows are “0”, the replacement target does not exist, and the process is terminated (step 601). 602, 603).

ランダムに取得した行とは、置換管理テーブル３０４の行数と異なる数値で、かつ「１」〜「置換管理テーブル３０４の最大行」までの数値を乱数により取得したものである。これにより算出された行数の置換管理テーブル３０４のセルの値が「０」であれば置換対象としない文字列であるため、乱数を再取得して算出した重複管理テーブル３０７のセルの値が「１」であれば置換する。 The randomly acquired rows are numerical values different from the number of rows in the replacement management table 304, and numerical values from “1” to “maximum rows in the replacement management table 304” are acquired by random numbers. If the cell value in the replacement management table 304 for the number of rows calculated in this way is “0”, it is a character string that is not subject to replacement. Therefore, the cell value in the duplication management table 307 calculated by re-acquiring random numbers is obtained. If “1”, replace.

次に、置換したセルに対応する置換管理テーブル３０４の値に「０」を代入する（ステップ６０４）。これは既に置換が完了したことを示すためである。置換管理テーブル３０４から値を読み込み「０」である場合は、置換対象ではないため置換管理テーブル３０４から次のデータを読み込む（ステップ６０５）。置換管理テーブル３０４の値が全て「０」になったら、重複管理テーブル３０７の値に対応する行に文字列データテーブル３０１の値を全ての重複行３０９に保持されている行に代入する（ステップ６０７）。置換管理テーブル３０４の値が全て「０」にならない場合は、ステップ６０２に戻り処理を続行する。 Next, “0” is substituted into the value of the replacement management table 304 corresponding to the replaced cell (step 604). This is to indicate that the replacement has already been completed. If the value is read from the replacement management table 304 and is “0”, the next data is read from the replacement management table 304 because it is not a replacement target (step 605). When all the values in the replacement management table 304 become “0”, the values in the character string data table 301 are assigned to the rows held in all the duplicate rows 309 in the rows corresponding to the values in the duplicate management table 307 (step 607). If all the values in the replacement management table 304 are not “0”, the process returns to step 602 to continue the processing.

図７は、個人を特定できない記号を示す文字であり、これらに該当する文字がマスク対象に存在する場合は置換しない。 FIG. 7 shows characters indicating symbols that cannot identify an individual. If characters corresponding to these symbols exist in the mask target, they are not replaced.

図８は、氏名の割付を表す具体例であり、例えば１行目にある姓が１文字、名が１文字であれば「割付」に示すように空白６個を間に入れて帳票に出力する。これにより漢字氏名の開始位置と終了位置が整頓された形式となる。 FIG. 8 is a specific example showing the assignment of names. For example, if the last name on the first line is 1 character and the name is 1 character, then it is output on a form with 6 blanks in between as shown in “assignment”. To do. As a result, the start position and end position of the name of the Chinese character are arranged in order.

図９は、入力ファイル１０８、パラメタ１０９の具体例であり、入力ファイル１０８はデータベース１１２の内容がカンマ区切りでテキストファイル化されたものである。パラメタ１０９はカンマ区切りのテキストファイルのどの位置にある項目をマスク対象とするかを数値で指定している。この例では、２項目をマスク対象として指定している。 FIG. 9 is a specific example of the input file 108 and the parameter 109. The input file 108 is a text file in which the contents of the database 112 are separated by commas. The parameter 109 designates numerically the position at which the item in the comma-delimited text file is to be masked. In this example, two items are designated as mask targets.

データベースに登録された情報のパターン、特徴を保持しつつ個人を特定できない文字列に変換して、システム開発会社に預けて、システム開発をする場合にテストデータとしての利用価値が高い。 It has high utility value as test data when it is converted to a character string that cannot identify an individual while retaining the pattern and characteristics of information registered in the database, and is stored in a system development company for system development.

１０１マスク処理端末
１０２マスク処理ソフト
１０３データ読込モジュール
１０４データ解析モジュール
１０５データ置換モジュール
１０６データ検証モジュール
１０７データ出力モジュール
１０８入力ファイル
１０９パラメタ
１１０出力ファイル
１１１ユティリティ
１１２データベース
２０１マスク処理前の個人を特定する文字列
２０２マスク処理後の文字列
３０１文字列データテーブル
３０２文字列データテーブルの１文字目
３０３文字列データテーブルの２文字目
３０４置換管理テーブル
３０５置換管理テーブルの１文字目
３０６置換管理テーブルの２文字目
３０７重複管理テーブル
３０８重複管理テーブルの行
３０９重複管理テーブルの重複行 DESCRIPTION OF SYMBOLS 101 Mask processing terminal 102 Mask processing software 103 Data reading module 104 Data analysis module 105 Data replacement module 106 Data verification module 107 Data output module 108 Input file 109 Parameter 110 Output file 111 Utility 112 Database 201 Character which specifies the individual before mask processing Column 202 Character string 301 after mask processing Character string data table 302 First character of character string data table 303 Second character of character string data table 304 Replacement management table 305 First character of replacement management table 306 Two characters of replacement management table Item 307 Duplicate management table 308 Duplicate management table row 309 Duplicate management table duplicate row

Claims

A masking system that converts a character string including personal information registered in a database of an existing system into a character string that has the same value as test data used for system development but cannot identify an individual,
A data reading device for reading parameters indicating input files and mask target items;
A character string analysis device that analyzes a character string to be masked read by the data reading device and extracts patterns and features;
A character string replacing device that replaces the character string to be masked with a character string whose personal information cannot be identified based on an analysis result by the character string analyzing device;
A data verification device that verifies whether or not the character string after masking replaced by the character string replacement device is completely replaced with information that cannot identify personal information;
A data output device that outputs data of a character string after masking verified by the data verification device as an output file.

A masking system according to claim 1,
The character string analyzer is
The number of input information read by the data reading device, the maximum character length is calculated, a character string data table creating means for creating a character string data table to store the input information,
Input information storage means for storing the input information in the character string data table created by the character string data table creation means;
A replacement management table creating means for creating a replacement management table for managing which character is replaced and which is not replaced;
A duplicate management table creation means for creating a duplicate management table for holding and managing the number of duplicate rows, and
The character string replacement device includes:
Value reading means for reading values from the replacement management table created by the replacement management table creating means;
Replacement means for replacing a cell of the character string data table corresponding to the replacement management table with a cell of the character string data table corresponding to a randomly acquired row when the value read by the value reading means is 1 When,
0 substitution means for assigning 0 to the replacement management table corresponding to the cell replaced by the replacement means;
Duplicate row replacement means for referring to the duplicate management table when the values of the replacement management table all become 0 and substituting the value of the character string data table into the row corresponding to the value. Masking system to do.