JP6779854B2

JP6779854B2 - Anonymization device, anonymization method and anonymization program

Info

Publication number: JP6779854B2
Application number: JP2017232733A
Authority: JP
Inventors: 知明三本; 清本　晋作; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2020-11-04
Anticipated expiration: 2037-12-04
Also published as: JP2019101809A

Description

本発明は、データセットを匿名化するための装置、方法及びプログラムに関する。 The present invention relates to devices, methods and programs for anonymizing datasets.

従来、例えば、ユーザ属性と共に移動履歴又は購買履歴等の個人情報を含むデータセットを解析し、広告配信等に利用する際には、レコードから個人が識別され個人情報が漏洩するリスクを回避する必要があった。このため、個人情報を含むデータセットは、匿名化の処理をした後に提供される。
データセットを自動で匿名化する際には、距離の近いレコードを丸めてクラスタ化する、あるいは、各属性に木構造を持たせ、汎化を繰り返すことでｋ−匿名化する手法が用いられている（例えば、非特許文献１及び２参照）。 Conventionally, for example, when analyzing a data set including personal information such as movement history or purchase history together with user attributes and using it for advertisement distribution, it is necessary to avoid the risk that an individual is identified from a record and personal information is leaked. was there. Therefore, the data set including the personal information is provided after the anonymization process.
When automatically anonymizing a dataset, a method is used in which records that are close to each other are rounded and clustered, or each attribute has a tree structure and k-anonymization is repeated by repeating generalization. (See, for example, Non-Patent Documents 1 and 2).

Ｊｉ−ＷｏｎＢｙｕｎ，ＡｓｈｉｓｈＫａｍｒａ，ＥｌｉｓａＢｅｒｔｉｎｏ，ＮｉｎｇｈｕｉＬｉ，Ｅｆｆｉｃｉｅｎｔｋ−ａｎｏｎｙｍｉｚａｔｉｏｎｕｓｉｎｇｃｌｕｓｔｅｒｉｎｇｔｅｃｈｎｉｑｕｅｓ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１２ｔｈｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＤａｔａｂａｓｅｓｙｓｔｅｍｓｆｏｒａｄｖａｎｃｅｄａｐｐｌｉｃａｔｉｏｎｓ，Ａｐｒｉｌ０９−１２，２００７，Ｂａｎｇｋｏｋ，ＴｈａｉｌａｎｄJi-Won Byun, Ashish Kamra, Elisa Bertino, Ninghui Li, Efficient k-anonymization using clustering techniques, Proceedings of the 12th international conference on Database systems for advanced applications, April 09-12, 2007, Bangkok, Thailand ＫｒｉｓｔｅｎＬｅＦｅｖｒｅ，ＤａｖｉｄＪ．ＤｅＷｉｔｔ，ＲａｇｈｕＲａｍａｋｒｉｓｈｎａｎ，Ｉｎｃｏｇｎｉｔｏ：ｅｆｆｉｃｉｅｎｔｆｕｌｌ−ｄｏｍａｉｎＫ−ａｎｏｎｙｍｉｔｙ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２００５ＡＣＭＳＩＧＭＯＤｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＭａｎａｇｅｍｅｎｔｏｆｄａｔａ，Ｊｕｎｅ１４−１６，２００５，Ｂａｌｔｉｍｏｒｅ，ＭａｒｙｌａｎｄKristen LeFevre, David J. et al. DeWitt, Raghu Ramakrishnan, Incognito: effective full-domin K-anonymity, Proceedings of the 2005 ACM SIGMOD intermediary FQDN, FQDN, Maryland, FQDN

しかしながら、自動で距離の近いレコードを丸める方式では、データの持つ意味を考慮することなく、単純なレコード間の距離によってクラスタ化されるため、例えば年齢「１７−２１」というように、利用者にとって有用性の低い汎化が行われていた。
また、木構造に基づく場合であっても、目的はｋ−匿名化を実施することであり、属性の数が増えると、ほとんどの属性に対して大幅な汎化が行われ、結果として利用価値の低いデータセットとなることが多かった。 However, in the method of automatically rounding records that are close to each other, the records are clustered according to the distance between simple records without considering the meaning of the data. Therefore, for the user, for example, age "17-21". Generalization with low usefulness was performed.
Also, even if it is based on a tree structure, the purpose is to implement k-anonymization, and as the number of attributes increases, most of the attributes are significantly generalized, resulting in utility value. Often the data set was low.

本発明は、有用性を残し、かつ、安全なレコードのみを出力できる匿名化装置、匿名化方法及び匿名化プログラムを提供することを目的とする。 An object of the present invention is to provide an anonymization device, an anonymization method, and an anonymization program that can output only safe records while leaving usefulness.

本発明に係る匿名化装置は、複数のレコードからなる同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力部と、前記データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける階層木入力部と、前記データセットの利用方法に応じた匿名化ルールの入力を受け付ける匿名化ルール入力部と、個人が識別されるリスクに基づく出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力部と、前記匿名化ルールに基づいて前記データセットの全体を匿名化する匿名化処理部と、前記匿名化処理部により匿名化されたデータセットから、前記出力ルールに合致したレコードのみを出力する匿名化データ出力部と、前記出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する退避レコード記憶部と、を備え、前記匿名化処理部は、前記退避レコード記憶部に記憶されているレコードを、前記データセット入力部により受け付けたデータセットに追加した後に匿名化する。 The anonymization device according to the present invention has a data set input unit that periodically accepts input of a data set having the same attribute consisting of a plurality of records, and a generalized upper level for each of the attributes included in the data set. A hierarchical tree input unit that accepts input of a generalized hierarchical tree that has nodes, an anonymization rule input unit that accepts input of anonymization rules according to the usage of the data set, and output based on the risk of identifying an individual. An output rule input unit that accepts input of an output rule that defines conditions for various records, an anonymization processing unit that anonymizes the entire data set based on the anonymization rule, and an anonymization processing unit. An anonymized data output unit that outputs only records that match the output rule and a save record storage unit that stores saved records that do not match the output rule in the state before anonymization from the data set. The anonymization processing unit adds the record stored in the saved record storage unit to the data set received by the data set input unit, and then anonymizes the data set.

前記退避レコード記憶部は、前記レコードに日時情報の属性が含まれる場合、当該日時情報を削除して記憶してもよい。 When the record includes an attribute of date and time information, the saved record storage unit may delete and store the date and time information.

前記退避レコード記憶部は、所定期間の経過したレコードを削除してもよい。 The saved record storage unit may delete a record for which a predetermined period has passed.

本発明に係る匿名化方法は、複数のレコードからなる同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力ステップと、前記データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける階層木入力ステップと、前記データセットの利用方法に応じた匿名化ルールの入力を受け付ける匿名化ルール入力ステップと、個人が識別されるリスクに基づく出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力ステップと、前記匿名化ルールに基づいて前記データセットの全体を匿名化する匿名化処理ステップと、前記匿名化処理ステップにおいて匿名化されたデータセットから、前記出力ルールに合致したレコードのみを出力する匿名化データ出力ステップと、前記出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する退避レコード記憶ステップと、をコンピュータが実行し、前記匿名化処理ステップにおいて、前記退避レコード記憶ステップにおいて記憶されているレコードを、前記データセット入力ステップにおいて受け付けたデータセットに追加した後に匿名化する。 The anonymization method according to the present invention is a generalized upper level for each of a data set input step that periodically accepts input of a data set having the same attribute consisting of a plurality of records and attributes included in the data set. Hierarchical tree input step that accepts input of generalized hierarchical tree having nodes, anonymization rule input step that accepts input of anonymization rule according to the usage of the data set, and output based on the risk of identifying an individual An output rule input step that accepts input of an output rule that defines conditions for various records, an anonymization process step that anonymizes the entire data set based on the anonymization rule, and an anonymization process step. An anonymized data output step that outputs only records that match the output rule and a saved record storage step that stores the saved records that do not match the output rule in the state before anonymization from the data set. The computer executes, and in the anonymization processing step, the record stored in the saved record storage step is added to the data set received in the dataset input step, and then anonymized.

本発明に係る匿名化プログラムは、複数のレコードからなる同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力ステップと、前記データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける階層木入力ステップと、前記データセットの利用方法に応じた匿名化ルールの入力を受け付ける匿名化ルール入力ステップと、個人が識別されるリスクに基づく出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力ステップと、前記匿名化ルールに基づいて前記データセットの全体を匿名化する匿名化処理ステップと、前記匿名化処理ステップにおいて匿名化されたデータセットから、前記出力ルールに合致したレコードのみを出力する匿名化データ出力ステップと、前記出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する退避レコード記憶ステップと、をコンピュータに実行させ、前記匿名化処理ステップにおいて、前記退避レコード記憶ステップにおいて記憶されているレコードを、前記データセット入力ステップにおいて受け付けたデータセットに追加した後に匿名化させるためのものである。 The anonymization program according to the present invention periodically receives input of a data set having the same attribute consisting of a plurality of records, and a generalized upper level for each of the attributes included in the data set. Hierarchical tree input step that accepts input of generalized hierarchical tree with nodes, anonymization rule input step that accepts input of anonymization rule according to the usage of the data set, and output based on the risk of identifying an individual An output rule input step that accepts input of an output rule that defines conditions for various records, an anonymization process step that anonymizes the entire data set based on the anonymization rule, and anonymization in the anonymization process step. An anonymized data output step that outputs only records that match the output rule and a save record storage step that stores the saved records that do not match the output rule in the state before anonymization from the data set. The purpose is to cause a computer to execute the data set, add the record stored in the saved record storage step to the data set received in the data set input step, and then anonymize the data set in the anonymization processing step.

本発明によれば、データセットを匿名化する際に、有用性を残し、かつ、安全なレコードのみを出力できる。 According to the present invention, when anonymizing a data set, only a safe record can be output while leaving usefulness.

第１実施形態に係る匿名化装置の機能構成を示す図である。It is a figure which shows the functional structure of the anonymization apparatus which concerns on 1st Embodiment. 第１実施形態に係る匿名化装置の入出力情報を示す図である。It is a figure which shows the input / output information of the anonymization apparatus which concerns on 1st Embodiment. 第２実施形態に係る匿名化装置の入出力情報を示す図である。It is a figure which shows the input / output information of the anonymization apparatus which concerns on 2nd Embodiment.

以下、本発明の第１実施形態について説明する。
図１は、本実施形態に係る匿名化装置１の機能構成を示す図である。
匿名化装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０、記憶部２０、及び各種の入出力デバイスを備える。 Hereinafter, the first embodiment of the present invention will be described.
FIG. 1 is a diagram showing a functional configuration of the anonymization device 1 according to the present embodiment.
The anonymization device 1 is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10, a storage unit 20, and various input / output devices.

制御部１０は、匿名化装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における機能を実現している。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire anonymization device 1, and realizes the functions in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群を匿名化装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の機能を制御部１０に実行させるための匿名化プログラムの他、処理対象のデータセット及び各種のファイル群等を記憶する。 The storage unit 20 is a storage area for various programs and various data for making the hardware group function as the anonymization device 1, and may be a ROM, RAM, flash memory, hard disk (HDD), or the like. Specifically, the storage unit 20 stores the data set to be processed, various file groups, and the like, in addition to the anonymization program for causing the control unit 10 to execute the function of the present embodiment.

また、制御部１０は、データセット入力部１１と、階層木入力部１２と、匿名化ルール入力部１３と、出力ルール入力部１４と、設定情報入力部１５と、匿名化処理部１６と、匿名化データ出力部１７とを備える。 Further, the control unit 10 includes a data set input unit 11, a hierarchical tree input unit 12, an anonymization rule input unit 13, an output rule input unit 14, a setting information input unit 15, an anonymization processing unit 16, and the like. It includes an anonymized data output unit 17.

データセット入力部１１は、複数のレコードからなる同一の属性を持つデータセットの入力をバッチ処理等により定期的に受け付ける。例えば、１日１回、１日分のデータセットが取り込まれ、匿名化処理部１６に提供される。 The data set input unit 11 periodically receives input of a data set having the same attribute composed of a plurality of records by batch processing or the like. For example, once a day, the data set for one day is taken in and provided to the anonymization processing unit 16.

匿名化の対象となるデータセットの各レコードは、複数の属性からなる。各属性のデータの種類は、質的データ、量的データ、コード型データ等を含む。
質的データは、例えば、「東京」、「京都」といった住所が該当する。
量的データは、例えば、「１．５」、「３０」といった数値データが該当する。
コード型データは、例えば、郵便番号のように、各桁に意味を持つデータが該当する。 Each record in the dataset to be anonymized consists of multiple attributes. The data type of each attribute includes qualitative data, quantitative data, code type data, and the like.
The qualitative data corresponds to addresses such as "Tokyo" and "Kyoto", for example.
Quantitative data corresponds to, for example, numerical data such as "1.5" and "30".
The code type data corresponds to data having a meaning in each digit, such as a postal code.

階層木入力部１２は、データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける。
一般化階層木では、例えば、質的データである「東京」又は「京都」といったノードの上位階層に、それぞれ「関東」又は「関西」といったノードが設けられる。また、量的データである「１３」、「１４」、「１５」といったノードの上位階層には、「１３−１５」又は「未成年」といったノードが設けられる。また、コード型データである「１２３−４５６７」といったノードの上位階層には、「１２３−４５＊＊」といった一部の桁を省略したノードが設けられる。 The hierarchical tree input unit 12 receives input of a generalized hierarchical tree having a generalized upper node for each attribute included in the data set.
In the generalized hierarchical tree, for example, nodes such as "Kanto" or "Kansai" are provided above the nodes such as "Tokyo" or "Kyoto" which are qualitative data. In addition, nodes such as "13-15" or "minors" are provided in the upper hierarchy of nodes such as "13", "14", and "15" which are quantitative data. Further, in the upper layer of the node such as "123-4567" which is the code type data, a node such as "123-45 **" in which some digits are omitted is provided.

匿名化ルール入力部１３は、データセットの利用方法に応じた匿名化ルールのファイル入力を受け付ける。
匿名化ルールでは、例えば、属性ｘを木の高さｈまで汎化する、属性ｙの一部又は全部を削除する、同一レコード数がｎ以上のレコードに対して、属性ｚを木の高さｈまで汎化する等、汎化ルール、又は条件付きの汎化ルールが定義される。条件は複数設けられてもよく、例えばａｎｄ又はｏｒを用いて定義される。
なお、匿名化ルールは、一般化階層木に基づく汎化に限らず、例えば、サンプリング、スワッピング、ノイズ付与等の匿名化の手法が用いられてもよい。 The anonymization rule input unit 13 accepts file input of anonymization rules according to the usage method of the data set.
In the anonymization rule, for example, the attribute x is generalized to the height h of the tree, part or all of the attribute y is deleted, and the attribute z is the height of the tree for records having the same number of records n or more. Generalization rules, such as generalization to h, or conditional generalization rules are defined. A plurality of conditions may be provided, and are defined using, for example, and or or.
The anonymization rule is not limited to generalization based on the generalization hierarchy tree, and for example, anonymization methods such as sampling, swapping, and noise addition may be used.

出力ルール入力部１４は、データセットのレコードの情報から個人が識別されるリスクを所定未満に抑えるために、出力可能なレコードの条件を定めた出力ルールのファイル入力を受け付ける。
出力ルールは、例えば、重複するレコード数ｋ、又は個人識別確率ｐ等の閾値で表現されてよい。
また、出力ルールは、レコード毎に独立して定められてもよい。さらに、出力ルールは、データセットの提供先に応じて定められてもよい。例えば、あるレコードは、企業規模がｘｘ以上の企業に対しては同一レコード数ｋ≧２、ｙｙ以下の企業に対してはｋ＞１０であれば開示してよい等、条件付きの閾値が出力ルールとして定められてもよい。 The output rule input unit 14 accepts a file input of an output rule that defines conditions for records that can be output in order to reduce the risk of an individual being identified from the record information of the data set to less than a predetermined value.
The output rule may be expressed by a threshold value such as the number of overlapping records k or the personal identification probability p.
Further, the output rule may be set independently for each record. Further, the output rule may be determined according to the data set destination. For example, a certain record may be disclosed if the same number of records k ≧ 2 for a company with a company size of xx or more and k> 10 for a company with yy or less, and a conditional threshold is output. It may be defined as a rule.

設定情報入力部１５は、匿名化後のデータセット、ログファイル等の各種出力情報の保存先を指定したファイルの入力を受け付ける。 The setting information input unit 15 accepts input of a file that specifies a storage destination of various output information such as a data set and a log file after anonymization.

匿名化処理部１６は、匿名化ルールに基づいてデータセットの全体を匿名化する。
これにより得られたデータセットは、データの利用目的に合わせたレベルまで匿名化されている。 The anonymization processing unit 16 anonymizes the entire data set based on the anonymization rule.
The resulting dataset is anonymized to a level tailored to the intended use of the data.

匿名化データ出力部１７は、匿名化処理部１６により匿名化されたデータセットから、出力ルールに合致したレコードのみを出力する。
なお、出力ルールに合致しなかったレコードについて、匿名化データ出力部１７は、データセットから削除、マスク処理、出力ルールに合致するまで汎化を繰り返す等の加工を適宜実行する。
これにより、安全性が担保された匿名化データセットが出力される。 The anonymized data output unit 17 outputs only records that match the output rule from the data set anonymized by the anonymization processing unit 16.
The anonymized data output unit 17 appropriately executes processing such as deletion from the data set, mask processing, and repetition of generalization until the records do not match the output rule.
As a result, an anonymized data set with guaranteed security is output.

また、匿名化データ出力部１７は、匿名化データセットと共に、各種のログファイル及びレポートを出力し、記憶部２０に格納する。 Further, the anonymized data output unit 17 outputs various log files and reports together with the anonymized data set and stores them in the storage unit 20.

図２は、本実施形態に係る匿名化装置１の入出力情報を示す図である。
匿名化装置１は、前述のように、匿名化の対象とするデータセットの他、一般化階層木、匿名化ルールファイル、出力ルールファイル、その他の設定ファイルを入力として受け付ける。
そして、匿名化装置１は、匿名化データセットを出力した際に、匿名化ログファイル、エラーログファイル、及び匿名化レポートを出力する。 FIG. 2 is a diagram showing input / output information of the anonymization device 1 according to the present embodiment.
As described above, the anonymization device 1 accepts a generalization hierarchy tree, an anonymization rule file, an output rule file, and other setting files in addition to the data set to be anonymized.
Then, when the anonymization device 1 outputs the anonymization data set, it outputs the anonymization log file, the error log file, and the anonymization report.

匿名化ログファイルには、出力ルールに合致しなかったレコードを、匿名化データセットと紐付けるためのＩＤと、このレコードが匿名化される前の元の属性情報が記録される。
エラーログファイルには、匿名化の処理が正常に終了しなかった場合のエラーメッセージが記録される。
匿名化レポートには、安全管理措置のため、匿名化ルールに基づきどのような匿名化を実施し、出力ルールに基づきどの程度のリスクが残っているかが記述される。 In the anonymization log file, the ID for associating the record that does not match the output rule with the anonymization data set and the original attribute information before this record is anonymized are recorded.
In the error log file, an error message is recorded when the anonymization process is not completed normally.
The anonymization report describes what kind of anonymization is performed based on the anonymization rule and how much risk remains based on the output rule for safety management measures.

本実施形態によれば、匿名化装置１は、利用目的に合わせた匿名化ルールに基づいて汎化等の匿名化処理を行った後、個人が識別されるリスクを所定未満にするための出力ルールに合致した匿名化レコードのみを出力する。
従来の匿名化の手法では、出力条件に合致するようにデータセットの全体を加工するので、外れ値を他のレコードと合わせて大幅に汎化してしまい、有用性が低下していた。本実施形態の匿名化装置１は、ユースケースに応じて異なる残したい情報を匿名化ルールで明確化した上で、匿名化の後に出力ルールに合致する安全なレコードのみを出力することにより、安全性が所定未満の外れ値を除外して高い有用性を維持できる。
この結果、匿名化装置１は、従来の自動匿名化の手法とは異なり、ある一定の加工のルールと安全性の担保が可能であるため、匿名化データの利用者にとって有用なデータセットを生成できる。 According to the present embodiment, the anonymization device 1 outputs an output for reducing the risk of identifying an individual to less than a predetermined value after performing anonymization processing such as generalization based on an anonymization rule according to the purpose of use. Output only anonymized records that match the rules.
In the conventional anonymization method, since the entire data set is processed so as to match the output conditions, the outliers are greatly generalized together with other records, and the usefulness is reduced. The anonymization device 1 of the present embodiment is safe by clarifying the information to be kept, which differs depending on the use case, by the anonymization rule, and then outputting only the safe record that matches the output rule after the anonymization. High usefulness can be maintained by excluding outliers whose sex is less than predetermined.
As a result, unlike the conventional automatic anonymization method, the anonymization device 1 can guarantee certain processing rules and safety, and thus generates a data set useful for the user of the anonymized data. it can.

さらに、匿名化装置１は、出力ルールをデータセットのレコード毎に独立して定めることにより、各レコードの安全性をより適切に定義できる。
また、匿名化装置１は、出力ルールをデータセットの提供先に応じて定めることにより、利用目的に合わせた適切なデータセットを出力できる。 Further, the anonymization device 1 can more appropriately define the security of each record by independently defining the output rule for each record of the data set.
Further, the anonymization device 1 can output an appropriate data set according to the purpose of use by defining an output rule according to the data set provider.

［第２実施形態］
以下、本発明の第２実施形態について説明する。
なお、第１実施形態と同様の構成については、同一の符号を付し、説明を省略又は簡略化する。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described.
The same components as those in the first embodiment are designated by the same reference numerals, and the description thereof will be omitted or simplified.

データセットを自動で匿名化する際、他の類似したレコードが存在せず外れ値となるようなレコードは、大幅な汎化、又はレコードの全部若しくは一部の削除等の処理により、情報量が大きく削減されていた。また、第１実施形態においても、出力ルールに基づく安全性を満たすために、一部のレコードの削除又は大幅な汎化が行われると、有用性の低下が考えられる。
しかしながら、定期的にデータセットが入力される場合、時間経過に伴って、外れ値に類似したレコードの増加が期待できるため、汎化の度合いを抑えられる可能性がある。
そこで、本実施形態では、匿名化装置１は、出力ルールに合致しなかったレコードを、後に入力されたデータセットと統合して処理することで、有用性を維持する。 When the data set is automatically anonymized, the amount of information will be reduced due to processing such as significant generalization or deletion of all or part of the record if there are no other similar records and the value is out of order. It was greatly reduced. Further, also in the first embodiment, if some records are deleted or significantly generalized in order to satisfy the safety based on the output rule, the usefulness may be reduced.
However, when the data set is input regularly, the number of records similar to outliers can be expected to increase with the passage of time, so that the degree of generalization may be suppressed.
Therefore, in the present embodiment, the anonymization device 1 maintains the usefulness by integrating and processing the records that do not match the output rule with the data set input later.

本実施形態では、匿名化処理部１６及び匿名化データ出力部１７の機能が第１実施形態とは異なる。
匿名化データ出力部１７は、出力ルールに合致せず出力対象としなかった退避レコードを、匿名化前の元の状態で、過剰匿名化対象データセットとして、記憶部２０（退避レコード記憶部）に記憶する。
匿名化処理部１６は、記憶部２０に記憶されている過剰匿名化対象データセットを、データセット入力部１１により次回以降に受け付けたデータセットに追加した後に匿名化する。 In the present embodiment, the functions of the anonymization processing unit 16 and the anonymization data output unit 17 are different from those of the first embodiment.
The anonymized data output unit 17 stores the saved record that does not match the output rule and is not output as the storage unit 20 (save record storage unit) as an over-anonymization target data set in the original state before the anonymization. Remember.
The anonymization processing unit 16 adds the over-anonymization target data set stored in the storage unit 20 to the data set received by the data set input unit 11 from the next time onward, and then anonymizes the data set.

なお、匿名化ルールとしてサンプリングを採用した場合、出力ルールに関わらず出力対象とならないレコードが発生するが、これらのレコードは、過剰匿名化対象データセットに含めなくてよい。 When sampling is adopted as the anonymization rule, records that are not output targets are generated regardless of the output rule, but these records need not be included in the over-anonymization target data set.

ここで、匿名化データ出力部１７は、退避レコードに日時情報の属性が含まれる場合、この日時情報を削除した上で、過剰匿名化対象データセットとする。日時情報は、次回（例えば、翌日）以降に入力されるデータセットのレコードと同一値にならない。したがって、この日時情報が匿名化のための加工対象である場合、匿名化データ出力部１７は、退避レコードから日時情報を削除することにより、次回以降の匿名化処理においても外れ値となり続けることを抑制できる。
なお、匿名化処理部１６は、匿名化処理の際に、過剰匿名化対象データセットのうち所定期間の経過したレコードを削除してもよい。 Here, when the saved record includes the attribute of the date and time information, the anonymization data output unit 17 deletes the date and time information and sets it as the excessive anonymization target data set. The date and time information does not have the same value as the record of the data set input after the next time (for example, the next day). Therefore, when this date and time information is a processing target for anonymization, the anonymization data output unit 17 deletes the date and time information from the saved record, so that it will continue to be an outlier in the next and subsequent anonymization processes. Can be suppressed.
In addition, the anonymization processing unit 16 may delete the record for which a predetermined period has passed from the data set to be over-anonymized during the anonymization processing.

図３は、本実施形態に係る匿名化装置１の入出力情報を示す図である。
第２実施形態において、匿名化装置１は、第１実施形態における出力データに加えて、過剰匿名化対象データセットを出力すると、記憶部２０の退避用データベース（ＤＢ）に格納する。このとき、退避用ＤＢに格納されるデータセットからは、日時情報が削除される。
そして、匿名化装置１は、次回の匿名化処理の際に、退避用ＤＢに格納されているデータセットを匿名化対象データセットに加えて匿名化を行う。 FIG. 3 is a diagram showing input / output information of the anonymization device 1 according to the present embodiment.
In the second embodiment, when the anonymization device 1 outputs the over-anonymization target data set in addition to the output data in the first embodiment, it stores it in the save database (DB) of the storage unit 20. At this time, the date and time information is deleted from the data set stored in the backup DB.
Then, at the next anonymization process, the anonymization device 1 adds the data set stored in the backup DB to the anonymization target data set to perform anonymization.

本実施形態によれば、匿名化装置１は、出力ルールに合致しなかった退避レコードを、匿名化前の状態で退避用ＤＢに格納し、次回以降に入力されるデータセットに加えることで匿名化処理に再利用する。
したがって、匿名化装置１は、同じ属性を持つデータセットに対して繰り返し匿名化処理を行う場合、リスクの高いレコードを一時退避して後から匿名化処理を行うことで、今回は外れ値であっても次回以降に出力対象となる可能性を高め、有用性を向上できる。 According to the present embodiment, the anonymization device 1 stores the save record that does not match the output rule in the save DB in the state before the anonymization, and adds it to the data set to be input from the next time onward to make it anonymous. Reuse for conversion process.
Therefore, when the anonymization device 1 repeatedly performs anonymization processing on a data set having the same attribute, it temporarily saves a high-risk record and then performs anonymization processing, which is an outlier this time. However, it is possible to increase the possibility that it will be output from the next time onward and improve its usefulness.

さらに、匿名化装置１は、レコードに日時情報の属性が含まれる場合、退避用ＤＢには、この日時情報を削除して格納するので、次回（例えば、翌日）のデータセットの中で外れ値となる事態を回避し、出力対象となる可能性を高められる。
また、匿名化装置１は、退避用ＤＢから、所定期間の経過したレコードを削除することにより、処理対象として統合することが適当でないレコードを除外でき、出力データの有用性を高められる。 Further, when the anonymization device 1 includes the attribute of the date and time information in the record, the date and time information is deleted and stored in the save DB, so that the outlier value is stored in the next data set (for example, the next day). It is possible to avoid the situation and increase the possibility of being an output target.
Further, the anonymization device 1 can exclude records that are not suitable for integration as processing targets by deleting records for which a predetermined period has passed from the backup DB, and the usefulness of the output data can be enhanced.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. In addition, the effects described in the above-described embodiments are merely a list of the most preferable effects arising from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

匿名化装置１による匿名化方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The anonymization method by the anonymization device 1 is realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１匿名化装置
１０制御部
１１データセット入力部
１２階層木入力部
１３匿名化ルール入力部
１４出力ルール入力部
１５設定情報入力部
１６匿名化処理部
１７匿名化データ出力部
２０記憶部 1 Anonymization device 10 Control unit 11 Data set input unit 12 Hierarchical tree input unit 13 Anonymization rule input unit 14 Output rule input unit 15 Setting information input unit 16 Anonymous processing unit 17 Anonymous data output unit 20 Storage unit

Claims

A data set input unit that periodically accepts input of a data set consisting of multiple records with the same attributes,
A hierarchical tree input unit that accepts input from a generalized hierarchical tree that has generalized higher-level nodes for each of the attributes contained in the dataset.
Anonymization rule input unit that accepts input of anonymization rule according to the usage method of the data set,
An output rule input unit that accepts input of output rules that define conditions for records that can be output based on the risk of identifying an individual,
Anonymization processing unit that anonymizes the entire data set based on the anonymization rule,
An anonymized data output unit that outputs only records that match the output rule from the data set anonymized by the anonymization processing unit.
It is provided with a saved record storage unit that stores a saved record that does not match the output rule in the state before anonymization.
The anonymization processing unit is an anonymization device that anonymizes a record stored in the saved record storage unit after adding it to a data set received by the data set input unit.

The anonymization device according to claim 1, wherein the saved record storage unit deletes and stores the date and time information when the record includes an attribute of date and time information.

The anonymization device according to claim 1 or 2, wherein the saved record storage unit deletes a record for which a predetermined period has passed.

A dataset input step that periodically accepts the input of a dataset consisting of multiple records with the same attributes,
A hierarchical tree input step that accepts input for a generalized hierarchical tree that has generalized higher-level nodes for each of the attributes contained in the dataset.
Anonymization rule input step that accepts input of anonymization rule according to the usage method of the data set, and
An output rule input step that accepts input of an output rule that defines the conditions of records that can be output based on the risk of identifying an individual,
Anonymization processing steps that anonymize the entire dataset based on the anonymization rules,
Anonymized data output step that outputs only records that match the output rule from the data set anonymized in the anonymization processing step, and
The computer executes a save record storage step of storing the saved records that do not match the output rule in the state before anonymization.
An anonymization method of anonymizing after adding a record stored in the saved record storage step to the data set received in the data set input step in the anonymization processing step.

A dataset input step that periodically accepts the input of a dataset consisting of multiple records with the same attributes,
A hierarchical tree input step that accepts input from a generalized hierarchical tree that has generalized higher-level nodes for each of the attributes contained in the dataset.
Anonymization rule input step that accepts input of anonymization rule according to the usage of the data set, and
An output rule input step that accepts input of an output rule that defines the conditions of records that can be output based on the risk of identifying an individual,
Anonymization processing steps that anonymize the entire dataset based on the anonymization rules,
Anonymized data output step that outputs only records that match the output rule from the data set anonymized in the anonymization processing step, and
A computer is made to execute a save record storage step of storing a saved record that does not match the output rule in the state before anonymization.
An anonymization program for anonymizing a record stored in the saved record storage step in the anonymization processing step after adding it to the data set received in the data set input step.