JP2016184213A

JP2016184213A - Method for anonymizing numeric data, and numeric data anonymization server

Info

Publication number: JP2016184213A
Application number: JP2015062981A
Authority: JP
Inventors: 謙英田辺; Kenei Tanabe; 和明井堀; Kazuaki Ihori
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2015-03-25
Filing date: 2015-03-25
Publication date: 2016-10-20

Abstract

PROBLEM TO BE SOLVED: To suppress the degradation of data quality before and after the anonymization of numeric data.SOLUTION: An anonymization server 100 generates, in a process to create the first hierarchy of a generalization hierarchy, a combination of a first node included in a zeroth hierarchy that is one hierarchy below the first hierarchy and each of nodes adjacent to the first node in the magnitude correlation of numeric values; calculates, for each of the generated combinations, a deviation between the statistical indicator of numeric data anonymized using the first hierarchy when a node into which the nodes of the combination are integrated is determined to be the node of the first hierarchy corresponding to each of the nodes of the combination, and the statistical indicator of the numeric data; and determines a node into which the nodes of a combination where the calculated deviation is minimal are integrated to be the node of the first hierarchy corresponding to each of the nodes of the combination.SELECTED DRAWING: Figure 1

Description

本発明は、数値データを匿名化する方法及び数値データ匿名化サーバに関する。 The present invention relates to a method for anonymizing numerical data and a numerical data anonymization server.

近年、安価なサービスや情報を集約化するため、クラウドコンピューティングが急速に普及している。しかし、クラウドコンピューティングにおけるセキュリティやプライバシーに対する配慮はまだ必ずしも十分とはいえない状況である。 In recent years, cloud computing has been rapidly spreading in order to aggregate inexpensive services and information. However, security and privacy considerations in cloud computing are not always sufficient.

個人を特定不可能な匿名情報の匿名性を評価する指標としてｋ−匿名性がある。本技術分野の背景技術として、国際公開第２０１１／１４５４０１号（特許文献１）がある。この公報には、「各個人の情報は複数の属性に対する該個人の属性値を含む。この属性値を曖昧化することで匿名化を達成するが、属性値の曖昧化対象をその曖昧さの程度によって木構造で表現したものを一般化階層木と呼ぶ。本個人情報匿名化装置は、属性値の頻度情報を用いて木を構成することで自動的な構成を達成する。また、損失情報量計量手段を定義することで、一般化階層木を用いて、２つの匿名データ間、または匿名化途中のデータ間の情報量損失を定量的に判定する。」と記載されている（要約参照）。 There is k-anonymity as an index for evaluating the anonymity of anonymous information that cannot identify an individual. As a background art in this technical field, there is International Publication No. 2011/145401 (Patent Document 1). In this publication, “each individual's information includes attribute values of the individual with respect to a plurality of attributes. Anonymization is achieved by obscuring this attribute value. What is expressed in a tree structure depending on the degree is called a generalized hierarchical tree.This personal information anonymization device achieves an automatic configuration by constructing a tree using frequency information of attribute values. By defining the quantity measuring means, a generalized hierarchical tree is used to quantitatively determine the amount of information loss between two anonymous data or data in the middle of anonymization ”(see summary). ).

国際公開第２０１１／１４５４０１号公報International Publication No. 2011/145401

特許文献１に記載の技術は、匿名化対象データの各属性について、Ｈｕ−Ｔｕｃｋｅｒ符号木を作成し、Ｈｕ−Ｔｕｃｋｅｒ符号木が示す一般化階層を用いて各属性の属性値を一般化することにより、ｋ−匿名化を実行する。Ｈｕ−Ｔｕｃｋｅｒ符号木は、属性値の頻度情報と順序情報とから生成される。しかし、Ｈｕ−Ｔｕｃｋｅｒ符号木の作成において、属性値そのものは考慮されていない。従って、特許文献１に記載の技術がＨｕ−Ｔｕｃｋｅｒ符号木を用いて一般化した属性値から算出される統計指標は、一般化前の属性値の統計指標と著しく乖離するおそれがある。即ち、特許文献１に記載の技術におけるｋ−匿名化によって、データ品質が著しく劣化するおそれがある。 The technique described in Patent Document 1 creates a Hu-Tucker code tree for each attribute of anonymization target data, and generalizes the attribute value of each attribute using a generalization hierarchy indicated by the Hu-Tucker code tree. Thus, k-anonymization is executed. The Hu-Tucker code tree is generated from frequency information of attribute values and order information. However, the attribute value itself is not considered in creating the Hu-Tucker code tree. Therefore, the statistical index calculated from the attribute value generalized by the technique described in Patent Document 1 using the Hu-Tucker code tree may be significantly different from the statistical index of the attribute value before generalization. That is, there is a possibility that the data quality is significantly deteriorated by k-anonymization in the technique described in Patent Document 1.

上記課題を解決するため、本発明の一態様は、例えば、以下の構成を採用する。複数の数値を含む数値データを保持する数値データ匿名化サーバが、前記数値データを匿名化する方法であって、前記数値データ匿名化サーバは、プログラムを実行するプロセッサと、前記プロセッサがアクセスするメモリと、を有し、前記方法は、前記プロセッサが、前記数値データに含まれる数値に基づいて、複数の階層からなる一般化階層を作成し、前記プロセッサが、前記一般化階層を用いて、前記数値データを匿名化し、前記一般化階層の第１階層の作成処理において、前記第１階層の一つ下の階層である第０階層に含まれる第１ノードと、数値の大小関係において前記第１ノードに隣接するノードそれぞれと、の組み合わせを生成し、前記生成した組み合わせそれぞれに対して、当該組み合わせのノードを統合したノードを当該組み合わせのノードそれぞれに対応する前記第１階層のノードに決定した場合における前記第１階層を用いて匿名化された前記数値データの統計指標と、前記数値データの統計指標と、の乖離を算出し、前記算出した乖離が最小である組み合わせのノードを統合したノードを、当該組み合わせのノードそれぞれに対応する前記第１階層のノードに決定する、方法。 In order to solve the above problems, one embodiment of the present invention employs the following configuration, for example. A numerical data anonymization server that holds numerical data including a plurality of numerical values is a method for anonymizing the numerical data, wherein the numerical data anonymization server includes a processor that executes a program and a memory that the processor accesses In the method, the processor creates a generalized hierarchy including a plurality of hierarchies based on numerical values included in the numerical data, and the processor uses the generalized hierarchies to The numerical data is anonymized, and in the process of creating the first layer of the generalized layer, the first node included in the 0th layer, which is one layer below the first layer, and the first node A combination of each node adjacent to the node is generated, and for each of the generated combinations, a node obtained by integrating the combination node is Calculate the divergence between the statistical index of the numerical data anonymized using the first hierarchy and the statistical index of the numerical data when the first hierarchical node corresponding to each of the matching nodes is determined And determining a node obtained by integrating the combination nodes having the smallest calculated divergence as the first layer node corresponding to each of the combination nodes.

本発明の一態様によれば、匿名化によるデータ品質の劣化を抑制することができる。 According to one embodiment of the present invention, deterioration of data quality due to anonymization can be suppressed.

上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

実施例１における匿名化サーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of the anonymization server in Example 1. FIG. 実施例１における匿名化対象データの一例である。It is an example of the anonymization object data in Example 1. FIG. 実施例１における一般化階層データの一例である。4 is an example of generalized hierarchical data in the first embodiment. 実施例１における匿名化結果データの一例である。It is an example of the anonymization result data in Example 1. 実施例１におけるｋ−匿名化の全体処理の一例を示すフローチャートである。3 is a flowchart illustrating an example of an overall process of k-anonymization in the first embodiment. 実施例１における数値型準識別子の一般化階層作成処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a generalized hierarchy generation process for numerical type quasi-identifiers according to the first exemplary embodiment. 従来手法における年齢階層データの一例である。It is an example of the age hierarchy data in a conventional method. 実施例１の一般化階層を用いた数値型準識別子の匿名化と、従来の一般化階層を用いた数値型準識別子の匿名化と、の比較の一例を示す図である。It is a figure which shows an example of a comparison with the anonymization of the numerical type semi-identifier using the generalized hierarchy of Example 1, and the anonymization of the numerical type semi-identifier using the conventional generalized hierarchy.

以下、添付図面を参照して本発明の実施形態を説明する。本実施形態の匿名化サーバは、数値で示される準識別子の一般化階層を、匿名化前後の当該準識別子の統計指標間の乖離を小さくするよう作成する。匿名化サーバは、当該一般化階層を用いて当該準識別子に含まれる値を匿名化することにより、匿名化によるデータ品質の劣化を抑制することができる。 Embodiments of the present invention will be described below with reference to the accompanying drawings. The anonymization server of this embodiment creates a generalized hierarchy of quasi-identifiers indicated by numerical values so as to reduce the divergence between the statistical indicators of the quasi-identifiers before and after anonymization. The anonymization server can suppress degradation of data quality due to anonymization by anonymizing the value included in the quasi-identifier using the generalized hierarchy.

図１は、本実施例の匿名化サーバの構成例を示す。本実施例の匿名化サーバ１００は、プロセッサ（ＣＰＵ）１０１、メモリ１０２、補助記憶装置１０３及び通信インターフェース１０４を有する計算機によって構成される。 FIG. 1 shows a configuration example of the anonymization server of this embodiment. An anonymization server 100 according to the present embodiment is configured by a computer having a processor (CPU) 101, a memory 102, an auxiliary storage device 103, and a communication interface 104.

プロセッサ１０１は、メモリ１０２に格納されたプログラムを実行する。メモリ１０２は、不揮発性の記憶素子であるＲＯＭ及び揮発性の記憶素子であるＲＡＭを含む。ＲＯＭは、不変のプログラム（例えば、ＢＩＯＳ）などを格納する。ＲＡＭは、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）のような高速かつ揮発性の記憶素子であり、プロセッサ１０１が実行するプログラム及びプログラムの実行時に使用されるデータを一時的に格納する。 The processor 101 executes a program stored in the memory 102. The memory 102 includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element. The ROM stores an immutable program (for example, BIOS). The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 101 and data used when the program is executed.

補助記憶装置１０３は、例えば、磁気記憶装置（ＨＤＤ）、フラッシュメモリ（ＳＳＤ）等の大容量かつ不揮発性の記憶装置であり、プロセッサ１０１が実行するプログラム及びプログラムの実行時に使用されるデータを格納する。すなわち、プログラムは、補助記憶装置１０３から読み出されて、メモリ１０２にロードされて、プロセッサ１０１によって実行される。 The auxiliary storage device 103 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD), and stores a program executed by the processor 101 and data used when the program is executed. To do. That is, the program is read from the auxiliary storage device 103, loaded into the memory 102, and executed by the processor 101.

匿名化サーバ１００は、入力インターフェース１０５及び出力インターフェース１０８を有してもよい。入力インターフェース１０５は、キーボード１０６やマウス１０７などが接続され、オペレータからの入力を受けるインターフェースである。出力インターフェース１０８は、ディスプレイ装置１０９やプリンタなどが接続され、プログラムの実行結果をオペレータが視認可能な形式で出力するインターフェースである。 The anonymization server 100 may have an input interface 105 and an output interface 108. The input interface 105 is an interface that is connected to a keyboard 106, a mouse 107, and the like and receives input from an operator. The output interface 108 is an interface to which a display device 109, a printer, or the like is connected, and the execution result of the program is output in a form that can be visually recognized by the operator.

通信インターフェース１０４は、所定のプロトコルに従って、他の装置との通信を制御するネットワークインターフェース装置である。また、通信インターフェース１０４は、例えば、ＵＳＢ等のシリアルインターフェースを含む。 The communication interface 104 is a network interface device that controls communication with other devices according to a predetermined protocol. The communication interface 104 includes, for example, a serial interface such as a USB.

プロセッサ１０１が実行するプログラムは、リムーバブルメディア（ＣＤ−ＲＯＭ、フラッシュメモリなど）又はネットワークを介して匿名化サーバ１００に提供され、非一時的記憶媒体である不揮発性の補助記憶装置１０３に格納される。このため、匿名化サーバ１００は、リムーバブルメディアからデータを読み込むインターフェースを有するとよい。 A program executed by the processor 101 is provided to the anonymization server 100 via a removable medium (CD-ROM, flash memory, etc.) or a network, and is stored in a nonvolatile auxiliary storage device 103 which is a non-temporary storage medium. . Therefore, the anonymization server 100 may have an interface for reading data from the removable media.

匿名化サーバ１００は、物理的に一つの計算機上で、又は、論理的又は物理的に構成された複数の計算機上で構成される計算機システムであり、同一の計算機上で別個のスレッドで動作してもよく、複数の物理的計算機資源上に構築された仮想計算機上で動作してもよい。 The anonymization server 100 is a computer system configured on a single computer or a plurality of computers configured logically or physically, and operates in a separate thread on the same computer. It may be possible to operate on a virtual machine constructed on a plurality of physical computer resources.

プロセッサ１０１は、匿名化対象データ取込部１１１、一般化階層作成部１１２、統計指標計算部１１３、ｋ−匿名化実行部１１４、及びｋ−匿名化結果出力部１１５を含む。例えば、プロセッサ１０１は、メモリ１０２にロードされた匿名化対象データ取込プログラムに従って動作することで、匿名化対象データ取込部１１１として機能し、メモリ１０２にロードされた一般化階層作成プログラムに従って動作することで、一般化階層作成部１１２として機能する。プロセッサ１０１に含まれる他の部についても同様である。 The processor 101 includes an anonymization target data capturing unit 111, a generalized hierarchy creation unit 112, a statistical index calculation unit 113, a k-anonymization execution unit 114, and a k-anonymization result output unit 115. For example, the processor 101 operates as the anonymization target data capturing unit 111 by operating in accordance with the anonymization target data capturing program loaded in the memory 102, and operates in accordance with the generalized hierarchy creation program loaded in the memory 102. By doing so, it functions as the generalized hierarchy creating unit 112. The same applies to the other units included in the processor 101.

匿名化対象データ取込部１１１は、通信インターフェース１０４又は入力インターフェース１０５を介して、匿名化対象データ１２１を取り込み、取り込んだデータを匿名化対象データ１２１として補助記憶装置１０３に格納する。一般化階層作成部１１２は、匿名化対象データ１２１の準識別子それぞれに対して、ｋ−匿名化のための一般化階層を作成する。一般化階層は、例えば木構造で与えられ、匿名化前後のデータ値の関係を階層的に示す。一般化階層作成部１１２は、作成した一般化階層それぞれを示す情報を、一般化階層データ１２２として補助記憶装置１０３に格納する。 The anonymization target data capturing unit 111 captures the anonymization target data 121 via the communication interface 104 or the input interface 105 and stores the captured data in the auxiliary storage device 103 as the anonymization target data 121. The generalized hierarchy creating unit 112 creates a generalized hierarchy for k-anonymization for each quasi-identifier of the anonymization target data 121. The generalized hierarchy is given by a tree structure, for example, and hierarchically shows the relationship between data values before and after anonymization. The generalized hierarchy creating unit 112 stores information indicating each created generalized hierarchy in the auxiliary storage device 103 as generalized hierarchy data 122.

ｋ−匿名化実行部１１４は、一般化階層データ１２２を参照して、匿名化対象データ１２１をｋ−匿名化し、ｋ−匿名化したデータを匿名化結果データ１２３として補助記憶装置１０３に格納する。ｋ−匿名化結果出力部１１５は、例えば、出力インターフェース１０８を介してディスプレイ装置１０９に、又は通信インターフェース１０４を介して他の装置等に、匿名化結果データ１２３を出力する。 The k-anonymization execution unit 114 refers to the generalized hierarchy data 122, anonymizes the anonymization target data 121, and stores the k-anonymized data in the auxiliary storage device 103 as anonymization result data 123. . For example, the k-anonymization result output unit 115 outputs the anonymization result data 123 to the display device 109 via the output interface 108 or to another device via the communication interface 104.

補助記憶装置１０３は、匿名化対象データ１２１、一般化階層データ１２２、及び匿名化結果データ１２３を保持する。匿名化対象データ１２１は、ｋ−匿名化の対象となる元データである。一般化階層データ１２２は、匿名化対象データ１２１に含まれる準識別子それぞれを一般化する階層を示すデータである。なお、準識別子とは、他の準識別子と組み合わせることによって間接的に個人を識別できる情報である。これに対して、識別子とは、識別子単体で直接的に個人を識別できる情報である。匿名化結果データ１２３は、一般化階層データ１２２に基づいて、ｋ−匿名化された匿名化対象データ１２１である。 The auxiliary storage device 103 holds anonymization target data 121, generalized hierarchy data 122, and anonymization result data 123. The anonymization target data 121 is original data that is an object of k-anonymization. The generalized hierarchy data 122 is data indicating a hierarchy that generalizes each quasi-identifier included in the anonymization target data 121. The quasi-identifier is information that can indirectly identify an individual in combination with another quasi-identifier. On the other hand, an identifier is information that can directly identify an individual by itself. The anonymization result data 123 is the anonymization target data 121 that has been k-anonymized based on the generalized hierarchy data 122.

なお、本実施形態において、匿名化サーバ１００が使用する情報は、データ構造に依存せずどのようなデータ構造で表現されていてもよい。例えば、テーブル、リスト、データベース又はキューから適切に選択したデータ構造体が、情報を格納することができる。なお、後述する図３から図６では、補助記憶装置１０３が保持する各データがテーブル構造で表現されている例を示す。 In the present embodiment, the information used by the anonymization server 100 may be expressed in any data structure without depending on the data structure. For example, a data structure appropriately selected from a table, list, database or queue can store the information. 3 to 6 described later show examples in which each data held in the auxiliary storage device 103 is expressed in a table structure.

図２は、匿名化対象データ１２１の例を示す。匿名化対象データ１２１は、１種類以上の準識別子を含む。図２の例では、匿名化対象データ１２１は、いずれも準識別子である、年齢１２１０、及び住所１２１１を含む。年齢１２１０は、当該レコードに対応する個人の年齢を格納する。住所１２１１は、当該レコードに対応する個人の住所を格納する。 FIG. 2 shows an example of the anonymization target data 121. The anonymization target data 121 includes one or more types of quasi-identifiers. In the example of FIG. 2, the anonymization target data 121 includes an age 1210 and an address 1211 that are quasi-identifiers. Age 1210 stores the age of the individual corresponding to the record. The address 1211 stores an individual address corresponding to the record.

なお、匿名化対象データ１２１に含まれる準識別子は、図２の例に限られず任意でよい。また、匿名化対象データ１２１は、１種類以上の識別子をさらに含んでもよい。このとき、例えば、匿名化対象データ１２１は各カラムが識別子であるか準識別子であるかを示す情報を含む。また匿名化対象データ１２１は、複数のテーブルからなってもよい。 The quasi-identifier included in the anonymization target data 121 is not limited to the example in FIG. 2 and may be arbitrary. The anonymization target data 121 may further include one or more types of identifiers. At this time, for example, the anonymization target data 121 includes information indicating whether each column is an identifier or a quasi-identifier. The anonymization target data 121 may be composed of a plurality of tables.

図３は、一般化階層データ１２２の例を示す。一般化階層データ１２２は、匿名化対象データ１２１に含まれる各準識別子における、木構造で表された一般化階層を示す情報である。図３は、図２の匿名化対象データ１２１に含まれる準識別子に対応する一般化階層データ１２２の例である。 FIG. 3 shows an example of the generalized hierarchy data 122. The generalized hierarchy data 122 is information indicating a generalized hierarchy represented by a tree structure in each quasi-identifier included in the anonymization target data 121. FIG. 3 is an example of the generalized hierarchical data 122 corresponding to the quasi-identifier included in the anonymization target data 121 of FIG.

一般化階層データ１２２は、年齢階層データ１２２０、及び住所階層データ１２２４を含む。年齢階層データ１２２０は、年齢１２１０の木構造で表された一般化階層を示す。年齢階層データ１２２０は、第０階層１２２１、第１階層１２２２、及びルートノード１２２３を含む。 The generalized hierarchy data 122 includes age hierarchy data 1220 and address hierarchy data 1224. The age hierarchy data 1220 indicates a generalized hierarchy represented by a tree structure of the age 1210. The age hierarchy data 1220 includes a 0th hierarchy 1221, a first hierarchy 1222, and a root node 1223.

第０階層１２２１は、当該木構造における最下の階層であり、第０階層１２２１の各ノードは当該木構造における葉ノードである。第０階層１２２１の葉ノードそれぞれは、年齢１２１０に含まれる重複を除く値それぞれに対応する。 The 0th hierarchy 1221 is the lowest hierarchy in the tree structure, and each node of the 0th hierarchy 1221 is a leaf node in the tree structure. Each leaf node of the 0th layer 1221 corresponds to each value excluding duplication included in the age 1210.

第１階層１２２２は、当該木構造における第０階層１２２１の一つ上の階層である。第１階層１２２２の各ノードは、第０階層１２２１の対応するノードの親ノードであり、当該対応するノードと同一のノード、又は当該対応するノードと当該対応するノードに隣接するノードの一方とが統合されたノードである。つまり、第０階層１２２１のノードに含まれる値は、第１階層１２２２の対応するノードに含まれる。第１階層１２２２の各ノードは、第０階層１２２１の対応するノードと比較して、匿名化の程度が同一又は高い。 The first hierarchy 1222 is a hierarchy above the 0th hierarchy 1221 in the tree structure. Each node of the first hierarchy 1222 is a parent node of the corresponding node of the 0th hierarchy 1221, and the same node as the corresponding node or the corresponding node and one of the nodes adjacent to the corresponding node are It is an integrated node. That is, the value included in the node of the 0th hierarchy 1221 is included in the corresponding node of the 1st hierarchy 1222. Each node of the first hierarchy 1222 has the same or higher degree of anonymization than the corresponding node of the 0th hierarchy 1221.

ルートノード１２２３は、当該木構造における最上の階層であり、ルートノード１２２３の各ノードは、第０階層１２２１の全てのノードが統合されたノードである。つまり、ルートノード１２２３の各ノードは、他階層のノードに含まれる全ての値を含む最も抽象的な概念である。なお、例えば、年齢１２１０に含まれる値の種類数に応じて、年齢階層データ１２２０の階層の数は増加する。 The root node 1223 is the highest hierarchy in the tree structure, and each node of the root node 1223 is a node in which all the nodes of the 0th hierarchy 1221 are integrated. That is, each node of the root node 1223 is the most abstract concept including all the values included in the nodes of other layers. For example, according to the number of types of values included in the age 1210, the number of layers in the age layer data 1220 increases.

住所階層データ１２２４は、住所１２１１の木構造で表された一般化階層を示す。住所階層データ１２２４は、第０階層１２２５、第１階層１２２６、及びルートノード１２２７を含む。住所階層データ１２２４の各カラムについては、年齢階層データ１２２０の各カラムと同様であるため、説明を省略する。 The address hierarchy data 1224 indicates a generalized hierarchy represented by a tree structure of the address 1211. The address hierarchy data 1224 includes a 0th hierarchy 1225, a first hierarchy 1226, and a route node 1227. Since each column of the address hierarchy data 1224 is the same as each column of the age hierarchy data 1220, description thereof is omitted.

なお、年齢階層データ１２２０において、例えば、「２０〜２５」のように範囲表示されたノードは、年齢が２０歳以上かつ２５歳以下であることを示す。また、住所階層データ１２２４において、例えば、「神奈川県；東京都；沖縄県」のように複数の値が「；」で区切られて記載されたノードは、住所が神奈川県、東京都、又は沖縄県であることを示す。 In the age hierarchy data 1220, for example, a node whose range is displayed as “20 to 25” indicates that the age is 20 years old or more and 25 years old or less. In the address hierarchy data 1224, for example, a node in which a plurality of values are separated by “;”, such as “Kanagawa Prefecture; Tokyo; Okinawa Prefecture”, has an address in Kanagawa Prefecture, Tokyo, or Okinawa. Indicates a prefecture.

図４は、匿名化結果データ１２３の例を示す。匿名化結果データ１２３は、ｋ−匿名化された匿名化対象データ１２１である。図４は、図２の匿名化対象データ１２１がｋ−匿名化された例を示す。 FIG. 4 shows an example of the anonymization result data 123. Anonymization result data 123 is k-anonymized anonymization target data 121. FIG. 4 shows an example in which the anonymization target data 121 of FIG. 2 is k-anonymized.

匿名化結果データ１２３に含まれる準識別子それぞれは、一般化階層データ１２２に含まれる当該準識別子に対応する階層データに基づいて、ｋ−匿名化された準識別子である。図４の例では、匿名化結果データ１２３は、年齢１２３０、及び住所１２３１を含む。年齢１２３０は、年齢階層データ１２２０に基づいて、ｋ−匿名化された年齢１２１０である。住所１２３１は、住所階層データ１２２４に基づいてｋ−匿名化された住所１２１１である。 Each of the quasi-identifiers included in the anonymization result data 123 is a k-anonymized quasi-identifier based on the hierarchy data corresponding to the quasi-identifier included in the generalized hierarchy data 122. In the example of FIG. 4, the anonymization result data 123 includes an age 1230 and an address 1231. Age 1230 is k-anonymized age 1210 based on age hierarchy data 1220. The address 1231 is an address 1211 that is k-anonymized based on the address hierarchy data 1224.

図５は、本実施例の匿名化サーバ１００が匿名化対象データ１２１をｋ−匿名化する処理の一例を示す。まず匿名化対象データ取込部１１１は、例えば、入力インターフェース１０５又は通信インターフェース１０４を介して、匿名化対象データ１２１を取り込み、補助記憶装置１０３に格納する（Ｓ５０１）。続いて、一般化階層作成部１１２は、匿名化対象データ１２１のレコード件数を算出する（Ｓ５０２）。 FIG. 5 shows an example of a process in which the anonymization server 100 of this embodiment anonymizes the anonymization target data 121. First, the anonymization target data capturing unit 111 captures the anonymization target data 121 via the input interface 105 or the communication interface 104, for example, and stores it in the auxiliary storage device 103 (S501). Subsequently, the generalized hierarchy creating unit 112 calculates the number of records of the anonymization target data 121 (S502).

続いて、一般化階層作成部１１２は、匿名化対象データ１２１に含まれる各準識別子について、一般化階層を作成する（Ｓ５０３）。準識別子のうち、例えば、順序尺度であり、数値で示される準識別子（図２の例における年齢１２１０）、における一般化階層作成方法については、図６を用いて後述する。また、例えば、数値で示される全ての準識別子について図６の一般化階層作成方法を適用してもよい。以下、図６の一般化階層作成方法の適用対象となる準識別子を数値型準識別子と称する。なお、匿名化サーバ１００は、各準識別子が数値型準識別子か否かについての情報を保持しているものとする。当該情報は、例えば、匿名化対象データ１２１に含まれていてもよいし、別のデータとして補助記憶装置１０３に格納されていてもよい。 Subsequently, the generalized hierarchy creating unit 112 creates a generalized hierarchy for each quasi-identifier included in the anonymization target data 121 (S503). For example, a generalized hierarchy creating method for quasi-identifiers (for example, age 1210 in the example of FIG. 2) that is an order scale and represented by numerical values will be described later with reference to FIG. Further, for example, the generalized hierarchy creating method of FIG. 6 may be applied to all quasi-identifiers indicated by numerical values. Hereinafter, the quasi-identifier to which the generalized hierarchy creating method of FIG. It is assumed that the anonymization server 100 holds information about whether each quasi-identifier is a numerical quasi-identifier. The information may be included in the anonymization target data 121, for example, or may be stored in the auxiliary storage device 103 as other data.

一般化階層作成部１１２は、数値型準識別子以外の準識別子それぞれ（図２の例における住所１２１１）に対して、例えば、当該準識別子に含まれる値の頻度情報からＨｕ−Ｔｕｃｋｅｒ符号木を構成して、これを一般化階層とする。 For each quasi-identifier other than the numerical quasi-identifier (address 1211 in the example of FIG. 2), the generalized hierarchy creating unit 112 constructs a Hu-Tucker code tree from frequency information of values included in the quasi-identifier, for example. This is the generalized hierarchy.

続いて、ｋ−匿名化実行部１１４は、一般化階層データ１２２を参照し、例えば、予め定められた匿名化プランに基づいて、匿名化対象データ１２１のカラムそれぞれに対して、オペレータ等が指定したｋを満たすｋ−匿名化を実行する（Ｓ５０４）。ｋ−匿名化プランは、準識別子に含まれる値を匿名化するための処理を示す。 Subsequently, the k-anonymization execution unit 114 refers to the generalized hierarchy data 122 and, for example, an operator or the like designates each column of the anonymization target data 121 based on a predetermined anonymization plan. The k-anonymization that satisfies k is performed (S504). The k-anonymization plan indicates a process for anonymizing a value included in the quasi-identifier.

以下、ステップＳ５０４のｋ−匿名化実行処理の具体例について述べる。ｋ−匿名化実行部１１４は、例えば、一般化階層データ１２２に含まれる全ての階層データの、ルートノードと、各階層の葉ノード以外のノードと、からなるノードリストを作成する。ｋ−匿名化実行部１１４は、ノードリストのノードを、当該ノードを用いて匿名化対象データ１２１の値を一般化した場合における情報損失量が小さい順に、ソートする。例えば、ノードに属する匿名化対象データ１２１の値の頻度が小さいほど、当該ノードを用いて匿名化対象データ１２１の値を一般化した場合における情報損失量が小さい。 Hereinafter, a specific example of the k-anonymization execution process in step S504 will be described. For example, the k-anonymization execution unit 114 creates a node list including a root node and nodes other than leaf nodes of each hierarchy of all hierarchy data included in the generalized hierarchy data 122. The k-anonymization execution unit 114 sorts the nodes in the node list in ascending order of information loss when the value of the anonymization target data 121 is generalized using the node. For example, the smaller the frequency of the value of the anonymization target data 121 belonging to the node, the smaller the information loss amount when the value of the anonymization target data 121 is generalized using the node.

ｋ−匿名化実行部１１４は、例えば、ノードリストから情報損失量が最小のノードを選択し、空の選択済みノードリストに追加する。ｋ−匿名化実行部１１４は、選択済みノードリストのノードを用いて匿名化対象データ１２１の値を一般化した匿名化結果がｋを満たすか否かを判定する。匿名化結果がｋを満たす場合、匿名化結果を匿名化結果データ１２３として補助記憶装置１０３に格納する。 For example, the k-anonymization execution unit 114 selects a node with the smallest amount of information loss from the node list and adds it to the empty selected node list. The k-anonymization execution unit 114 determines whether or not the anonymization result obtained by generalizing the value of the anonymization target data 121 using the nodes of the selected node list satisfies k. When the anonymization result satisfies k, the anonymization result is stored in the auxiliary storage device 103 as anonymization result data 123.

匿名化結果がｋを満たさない場合、ｋ−匿名化実行部１１４は、ノードリストに含まれる未選択のノードから、情報損失量が最小のノードを選択し、選択済みノードリストに追加し、新たな選択済みノードリストに基づいて、上述したｋを満たすか否かの判定を行う。匿名化結果がｋを満たすまで、同様の処理を繰り返す。なお、選択済みノードリストに親子関係にあるノードが含まれる場合、ｋ−匿名化実行部１１４は、階層が最も高いノードを用いて匿名化対象データ１２１の値を一般化する。 When the anonymization result does not satisfy k, the k-anonymization execution unit 114 selects the node with the smallest amount of information loss from the unselected nodes included in the node list, adds it to the selected node list, and newly Based on the selected node list, it is determined whether or not k described above is satisfied. The same process is repeated until the anonymization result satisfies k. When the selected node list includes a node having a parent-child relationship, the k-anonymization execution unit 114 generalizes the value of the anonymization target data 121 using a node having the highest hierarchy.

なお、匿名化対象データ１２１に識別子が含まれる場合、例えば、ｋ−匿名化実行部１１４は、識別子を削除する又は識別子の各値を別の値に置換する、等の処理を行う。 Note that when the anonymization target data 121 includes an identifier, for example, the k-anonymization execution unit 114 performs processing such as deleting the identifier or replacing each value of the identifier with another value.

続いて、ｋ−匿名化結果出力部１１５は、匿名化結果データ１２３を、例えば、ディスプレイ装置１０９、他の計算機等に出力する（Ｓ５０５）。 Subsequently, the k-anonymization result output unit 115 outputs the anonymization result data 123 to, for example, the display device 109, another computer, or the like (S505).

図６は、一般化階層作成部１１２が、各数値型準識別子における一般化階層を作成する処理の一例を示す。一般化階層作成部１１２は、匿名化対象データ１２１に含まれる数値型準識別子それぞれに対して、ステップＳ６０２〜ステップＳ６０６からなるループを開始する（Ｓ６０１）。一般化階層作成部１１２は、当該数値型準識別子に含まれる各値の頻度を取得し、重複を削除した値を昇順にソートする（Ｓ６０２）。一般化階層作成部１１２は、ソート結果を当該数値型準識別子の階層データにおける第０階層に決定する。 FIG. 6 shows an example of processing in which the generalized hierarchy creating unit 112 creates a generalized hierarchy for each numerical type quasi-identifier. The generalized hierarchy creating unit 112 starts a loop composed of step S602 to step S606 for each numerical quasi-identifier included in the anonymization target data 121 (S601). The generalized hierarchy creating unit 112 acquires the frequency of each value included in the numerical type quasi-identifier and sorts the values from which duplicates are deleted in ascending order (S602). The generalized hierarchy creating unit 112 determines the sorting result as the 0th hierarchy in the hierarchy data of the numerical type quasi-identifier.

一般化階層作成部１１２は、当該数値型準識別子の階層データにおけるルートノードを作成するまで、即ち第０階層の全てのノードが統合されたノードを作成するまで、階層データの最上位の階層のノードについて、ステップＳ６０４〜ステップＳ６０５からなるループを開始する（Ｓ６０３）。 The generalized hierarchy creating unit 112 creates a node of the highest hierarchy in the hierarchy data until the root node in the hierarchy data of the numerical type quasi-identifier is created, that is, until a node in which all nodes in the 0th hierarchy are integrated is created. For the node, a loop composed of step S604 to step S605 is started (S603).

一般化階層作成部１１２は、当該階層の一部又は全部のノードそれぞれについて、隣接するノードそれぞれとの組み合わせにおける統計指標の乖離Ａを算出する（Ｓ６０４）。統計指標の乖離Ａは、匿名化前の当該数値型準識別子の統計指標と、当該隣接するノードを統合したノードで当該数値型準識別子の値を匿名化した匿名化結果データの統計指標と、の乖離を示す。 The generalized hierarchy creating unit 112 calculates, for each or all of the nodes in the hierarchy, a statistical index deviation A in combination with each adjacent node (S604). The statistical index divergence A is a statistical index of the numerical quasi-identifier before anonymization, a statistical index of anonymization result data in which the value of the numerical quasi-identifier is anonymized at a node obtained by integrating the adjacent nodes, and Shows the divergence.

以下の数式（１）は、統計指標が平均値である場合の、隣接する２つのノードの組み合わせにおける統計指標の乖離Ａの算出式である。 The following mathematical formula (1) is a calculation formula for the divergence A of the statistical index in the combination of two adjacent nodes when the statistical index is an average value.

数式（１）における、ｎは当該組み合わせのノードに属する値の合計頻度、ｌ_ｉは当該組み合わせのノードそれぞれに対応する匿名化対象データ１２１の値、Ｖは当該組み合わせのノードに含まれる値の最小値と最大値の平均値、Ｒは匿名化対象データ１２１のレコード数である。例えば、組み合わせのノードそれぞれが「５」、「１０」である場合、当該組み合わせのノードに含まれる最小値は「５」、最大値は「１０」であるため、Ｖ＝（５＋１０）／２＝７．５である。組み合わせのノードそれぞれが「１〜５」、「６〜１０」である場合、当該組み合わせのノードに含まれる最小値は「１」、最大値は「１０」であるため、Ｖ＝（１＋１０）／２＝５．５である。 In Equation (1), n is the total frequency of values belonging to the node of the combination, l _i is the value of the anonymization target data 121 corresponding to each node of the combination, and V is the minimum value included in the node of the combination The average value of the value and the maximum value, R is the number of records of the anonymization target data 121. For example, when the combination nodes are “5” and “10”, the minimum value included in the combination node is “5” and the maximum value is “10”, so V = (5 + 10) / 2 = 7.5. When the combination nodes are “1 to 5” and “6 to 10”, the minimum value included in the combination node is “1” and the maximum value is “10”, so V = (1 + 10) / 2 = 5.5.

また、例えば、組み合わせのノードそれぞれが「５」、「６〜１０」である場合を考える。このとき、さらに匿名化対象データ１２１の当該数値型準識別子が、「５」に対応する値として「５」を、「６〜１０」に対応する値として「６」、「８」、「１０」を含み、かつ「５」、「６」、「８」、「１０」の頻度それぞれが「３」、「１」、「２」、「１」である場合、ｎ＝３＋１＋２＋１＝７、ｌ_１＝５、ｌ_２＝５、ｌ_３＝５、ｌ_４＝６、ｌ_５＝８、ｌ_６＝８、ｌ_７＝１０である。 Further, for example, consider a case where the combination nodes are “5” and “6-10”, respectively. At this time, the numerical type quasi-identifier of the anonymization target data 121 is “5” as a value corresponding to “5”, “6”, “8”, “10” as values corresponding to “6-10”. ”And the frequencies of“ 5 ”,“ 6 ”,“ 8 ”,“ 10 ”are“ 3 ”,“ 1 ”,“ 2 ”,“ 1 ”, respectively, n = 3 + 1 + 2 + 1 = 7, l _{_{_{_{1 = 5, l 2 = 5}}}} , l 3 = 5, l 4 = 6, l 5 = 8, l 6 = 8, a l 7 = 10.

続いて、一般化階層作成部１１２は、統計指標の乖離Ａが最小である組み合わせのノードを統合したノードを、当該組み合わせのノードそれぞれに対応する１つ上の階層のノード、即ち親ノードに決定し、当該組み合わせのノードそれぞれの頻度の和を、当該親ノードの頻度とする（Ｓ６０５）。一般化階層作成部１１２は、当該組み合わせ以外のノードについては、例えば、当該ノードと同一のノードを当該ノードに対応する１つ上の階層のノードに決定する。 Subsequently, the generalized hierarchy creating unit 112 determines a node obtained by integrating the combination nodes having the smallest statistical index divergence A as a node in the hierarchy one level higher corresponding to each node of the combination, that is, a parent node. Then, the sum of the frequencies of the nodes of the combination is set as the frequency of the parent node (S605). For the nodes other than the combination, the generalized hierarchy creating unit 112 determines, for example, a node that is the same as the node as a node in the next higher hierarchy corresponding to the node.

なお、統計指標の乖離Ａが最小である複数の組み合わせが存在する場合、例えば、一般化階層作成部１１２は、当該複数の組み合わせから選択した１つの組み合わせのノードを統合する。一般化階層作成部１１２は、当該選択において、例えば、ランダムに１つの組み合わせを選択してもよいし、当該複数の組み合わせに含まれる値の最大値又は最小値が、最大又は最小の組み合わせを選択してもよい。 In addition, when there are a plurality of combinations having the smallest statistical index deviation A, for example, the generalized hierarchy creating unit 112 integrates nodes of one combination selected from the plurality of combinations. In the selection, for example, the generalized hierarchy creating unit 112 may select one combination at random, or select the combination with the maximum or minimum value of the values included in the plurality of combinations being the maximum or minimum. May be.

なお、一般化階層作成部１１２は、統計指標の乖離Ａの算出の過程において、ある組み合わせにおける統計指標の乖離Ａが、所定の閾値以下であると判定した場合、当該組み合わせのノードを統合し、その他の組み合わせにおけるＡの算出を中止してもよい。これにより、一般化階層作成部１１２は、匿名化前後の統計指標の乖離を低減しつつ、演算量を削減することができる。当該所定の閾値が０である場合、匿名化前後の統計指標の乖離を特に低減することができる。 When the generalized hierarchy creating unit 112 determines that the statistical index deviation A in a certain combination is equal to or less than a predetermined threshold in the process of calculating the statistical index deviation A, the generalized hierarchy creating unit 112 integrates the nodes of the combination. The calculation of A in other combinations may be stopped. Thereby, the generalization hierarchy preparation part 112 can reduce the amount of calculations, reducing the divergence of the statistical index before and behind anonymization. When the predetermined threshold value is 0, the divergence of the statistical index before and after anonymization can be particularly reduced.

続いて、一般化階層作成部１１２は、ルートノードを作成したら、当該数値型準識別子における階層データ作成を終了する（Ｓ６０６）。ルートノードが作成されていない場合、一般化階層作成部１１２は、最上位の階層についてステップＳ６０４〜ステップＳ６０５の処理を行う。 Subsequently, after creating the root node, the generalized hierarchy creating unit 112 ends the hierarchy data creation for the numerical type quasi-identifier (S606). If the root node has not been created, the generalized hierarchy creating unit 112 performs the processes of steps S604 to S605 for the highest hierarchy.

一般化階層作成部１１２は、全ての数値型準識別子における階層データを作成したら、一般化階層作成処理を終了する（Ｓ６０７）。階層データが作成されていない数値型準識別子が存在する場合、一般化階層作成部１１２は、階層データが作成されていない数値型準識別子に対して、ステップＳ６０２〜ステップＳ６０６の処理を行う。 When the generalized hierarchy creating unit 112 has created the hierarchy data for all the numerical quasi-identifiers, the generalized hierarchy creating process ends (S607). When there is a numerical type quasi-identifier for which hierarchical data has not been created, the generalized hierarchy creating unit 112 performs the processing from step S602 to step S606 on the numerical type quasi-identifier for which hierarchical data has not been created.

以下、一般化階層作成部１１２が、匿名化対象データ１２１の年齢１２１０から年齢階層データ１２２０を作成する処理（ステップＳ６０２〜ステップＳ６０６に相当）の例を説明する。統計指標が平均値である例を示す。一般化階層作成部１１２は、年齢１２１０に含まれる値「２５」、「５」、「２０」それぞれの頻度「３」、「２」、「１」を取得する。一般化階層作成部１１２は、年齢１２１０に含まれる値を昇順に即ち「５」、「２０」、「２５」の順にソートし、ソート結果を第０階層１２２１とする。 Hereinafter, an example of a process (corresponding to steps S602 to S606) in which the generalized hierarchy creating unit 112 creates the age hierarchy data 1220 from the age 1210 of the anonymization target data 121 will be described. An example in which the statistical index is an average value is shown. The generalized hierarchy creating unit 112 acquires the frequencies “3”, “2”, and “1” of the values “25”, “5”, and “20” included in the age 1210, respectively. The generalized hierarchy creating unit 112 sorts the values included in the age 1210 in ascending order, that is, in the order of “5”, “20”, “25”, and sets the sorting result as the 0th hierarchy 1221.

一般化階層作成部１１２は、隣接するノードである「５」と「２０」との組み合わせの統計指標の乖離Ａ_１、及び隣接するノードである「２０」と「２５」との組み合わせの統計指標の乖離Ａ_２を算出する。 The generalized hierarchy creating unit 112 calculates the statistical index deviation A ₁ of the combination of the adjacent nodes “5” and “20” and the statistical index of the combination of the adjacent nodes “20” and “25”. to calculate the deviation of _{a 2.}

以下、Ａ_１の算出処理を示す。Ｒ＝６、Ｖ＝（５＋２０）／２＝１２．５であるため、Ａ_１＝｜｛（５−１２．５）＋（５−１２．５）＋（２０−１２．５）｝／６｜＝１．２５である。 Hereinafter, a process for calculating _{A 1.} Since R = 6 and V = (5 + 20) /2=12.5, A ₁ = | {(5-12.5) + (5-12.5) + (20-12.5)} / 6 | = 1.25.

以下、Ａ_２の算出処理を示す。Ｒ＝６、Ｖ＝（２０＋２５）／２＝２２．５であるため、Ａ_２＝｜｛（２０−２２．５）＋（２５−２２．５）＋（２５−２２．５）＋（２５−２２．５）｝／６｜＝０．８３３・・・である。 Hereinafter, a process for calculating _{A 2.} Since R = 6 and V = (20 + 25) /2=22.5, A ₂ = | {(20-22.5) + (25-22.5) + (25-22.5) + (25 -22.5)} / 6 | = 0.833.

Ａ_１＞Ａ_２であるため、一般化階層作成部１１２は、「２０」と「２５」とを統合したノード「２０〜２５」を作成する。一般化階層作成部１１２は、作成したノード「２０〜２５」を、第０階層１２２１のノード「２０」と「２５」それぞれの親ノード、即ち「２０」と「２５」それぞれに対応する第１階層１２２２のノードに決定する。また、一般化階層作成部１１２は、統合の対象とならなかったノード「５」については、「５」自身を、第０階層１２２１の「５」に対応する第１階層１２２２のノードに決定する。 Since A ₁ > A ₂ , the generalized hierarchy creating unit 112 creates nodes “20 to 25” obtained by integrating “20” and “25”. The generalized hierarchy creating unit 112 assigns the created nodes “20 to 25” to the parent nodes of the nodes “20” and “25” of the 0th hierarchy 1221, that is, the first corresponding to “20” and “25”, respectively. It is determined as a node of the hierarchy 1222. Also, the generalized hierarchy creating unit 112 determines “5” itself as the node of the first hierarchy 1222 corresponding to “5” of the 0th hierarchy 1221 for the node “5” that is not the integration target. .

第１階層１２２２は、「５」と「２０〜２５」の２種類のノードからなるため、一般化階層作成部１１２は、これらを統合したノード「５〜２５」を、第１階層１２２２の全てのノードに対応するルートノード１２２３に決定する。 Since the first hierarchy 1222 includes two types of nodes “5” and “20-25”, the generalized hierarchy creation unit 112 assigns the nodes “5-25” obtained by integrating these to all of the first hierarchy 1222. The root node 1223 corresponding to the node is determined.

図７は、Ｈｕ−Ｔｕｃｋｅｒ符号木等の従来の手法を用いて作成された、図２の匿名化対象データ１２１の年齢１２１０に対する年齢階層データの一例である。年齢階層データ７００は、第０階層７０１、第１階層７０２、及びルートノード７０３を含む。従来の一般化階層作成処理においては、頻度が少ないノード同士が統合されて、上位階層のノードが作成される。値「５」、「２０」、「２５」の頻度は、それぞれ「２」、「１」、「３」であるため、第１階層７０２において「５」と「２０」が統合されている。この点において、年齢階層データ７００は本実施例の年齢階層データ１２２０と異なる。 FIG. 7 is an example of age hierarchy data for the age 1210 of the anonymization target data 121 of FIG. 2 created using a conventional method such as a Hu-Tucker code tree. The age hierarchy data 700 includes a 0th hierarchy 701, a first hierarchy 702, and a root node 703. In the conventional generalized hierarchy creation process, nodes with low frequency are integrated to create a higher hierarchy node. Since the frequencies of the values “5”, “20”, and “25” are “2”, “1”, and “3”, respectively, “5” and “20” are integrated in the first layer 702. In this respect, the age hierarchy data 700 is different from the age hierarchy data 1220 of the present embodiment.

図８は、本実施例の一般化階層を用いた数値型準識別子の匿名化と、従来の一般化階層を用いた数値型準識別子の匿名化と、の比較の一例を示す。具体的には、図８は、匿名化サーバ１００が年齢１２１０のみからなるデータにｋ＝２を満たすｋ−匿名化を、従来手法の年齢階層データ７００を用いて施した例と、本実施例の年齢階層データ１２２０を用いて施した例と、を示す。さらに、図８は、年齢１２１０の平均値及び分散と、ｋ−匿名化が施された結果データの平均値及び分散との乖離を、本実施例と従来手法との間で比較した例を示す。ここで、年齢１２１０の平均値は１７．５、分散は８１．２５である。 FIG. 8 shows an example of comparison between anonymization of numerical quasi-identifiers using the generalized hierarchy of this embodiment and anonymization of numerical quasi-identifiers using a conventional generalized hierarchy. Specifically, FIG. 8 shows an example in which the anonymization server 100 performs k-anonymization satisfying k = 2 on data consisting only of the age 1210 using the age hierarchy data 700 of the conventional method, and the present embodiment. The example given using the age hierarchy data 1220 is shown. Further, FIG. 8 shows an example in which the difference between the average value and variance of the age 1210 and the average value and variance of the result data subjected to k-anonymization are compared between this example and the conventional method. . Here, the average value of the age 1210 is 17.5, and the variance is 81.25.

まず従来手法における年齢階層データ７００を用いたｋ−匿名化について述べる。ｋ−匿名化実行部１１４は、第１階層７０２を用いて年齢１２１０を匿名化することにより、ｋ＝２を満たすｋ−匿名化を実行することができる。ｋ−匿名化実行部１１４は、第１階層７０２を用いて、年齢１２１０に匿名化を施すことにより、結果データ８０２を得る。範囲変換後データ８０３は、結果データ８０２の範囲表示された値、即ち「５〜２０」を、結果データ８０２の平均値及び分散を算出するために、当該範囲の平均値１２．５に変換したデータである。 First, k-anonymization using the age hierarchy data 700 in the conventional method will be described. The k-anonymization execution unit 114 can execute k-anonymization satisfying k = 2 by anonymizing the age 1210 using the first hierarchy 702. The k-anonymization executing unit 114 obtains result data 802 by anonymizing the age 1210 using the first hierarchy 702. In the range-converted data 803, the range-displayed value of the result data 802, that is, “5 to 20” is converted into the average value 12.5 of the range in order to calculate the average value and variance of the result data 802. It is data.

ここで、範囲変換後データ８０３の平均値は１８．７５、分散は３９．０６２５である。また、年齢１２１０と範囲変換後データ８０３とにおける、平均値の乖離は１．２５、分散の乖離は４２．１８７５である。 Here, the average value of the range-converted data 803 is 18.75, and the variance is 39.0625. Further, the average value divergence between the age 1210 and the range-converted data 803 is 1.25, and the variance divergence is 42.1875.

続いて本実施例の年齢階層データ１２２０を用いたｋ−匿名化について述べる。ｋ−匿名化実行部１１４は、第１階層１２２２を用いて年齢１２１０を匿名化することにより、ｋ＝２を満たすｋ−匿名化を実行することができる。ｋ−匿名化実行部１１４は、第１階層１２２２を用いて、年齢１２１０に匿名化が施すことにより、結果データ８０４を得る。範囲変換後データ８０５は、結果データ８０４の範囲表示された値、即ち「２０〜２５」を、結果データ８０４の平均値及び分散を算出するために、当該範囲の平均値２２．５に変換したデータである。 Next, k-anonymization using the age hierarchy data 1220 of the present embodiment will be described. The k-anonymization execution unit 114 can execute k-anonymization satisfying k = 2 by anonymizing the age 1210 using the first hierarchy 1222. The k-anonymization executing unit 114 obtains result data 804 by anonymizing the age 1210 using the first hierarchy 1222. In the range-converted data 805, the range-displayed value of the result data 804, that is, “20 to 25” is converted into the average value 22.5 of the range in order to calculate the average value and variance of the result data 804. It is data.

ここで、範囲変換後データ８０５の平均値は１６．６６６７、分散は６８．０５５６である。また、年齢１２１０と範囲変換後データ８０５とにおける、平均値の乖離は０．８３３３３、分散の乖離は１３．１９４４である。従って、本実施例における匿名化前後の平均値の乖離は、従来手法における匿名化前後の平均値の乖離よりも小さい。分散の乖離についても同様である。 Here, the average value of the range-converted data 805 is 16.6667, and the variance is 68.0556. Also, the average value divergence between the age 1210 and the range-converted data 805 is 0.83333, and the variance divergence is 13.1944. Therefore, the deviation of the average value before and after anonymization in this embodiment is smaller than the deviation of the average value before and after anonymization in the conventional method. The same applies to the variance deviation.

以上、本実施例の匿名化サーバ１００は、数値型準識別子の一般化階層作成において、ｋ−匿名化前後における数値準識別子の統計指標の乖離を低減することができ、ひいてはｋ−匿名化によるデータ品質の劣化を抑制することができる。また、匿名化サーバ１００は、ｋ−匿名化前後の統計指標自体を算出することなく、統計指標の乖離を算出するため、少ない演算量でデータ品質の劣化を抑制することができる。 As described above, the anonymization server 100 according to the present embodiment can reduce the divergence of the statistical index of the numerical quasi-identifier before and after k-anonymization in the creation of the generalized hierarchy of numerical quasi-identifiers. Degradation of data quality can be suppressed. Moreover, since the anonymization server 100 calculates the divergence of the statistical index without calculating the statistical index itself before and after k-anonymization, it is possible to suppress the deterioration of the data quality with a small amount of calculation.

また、匿名化サーバ１００は、統計指標として平均値を用いることにより、重要な基本統計量の１つである平均値の、ｋ−匿名化前後における乖離を、少ない演算量で低減することができる。 Moreover, the anonymization server 100 can reduce the deviation before and after k-anonymization of the average value, which is one of important basic statistics, by using the average value as a statistical index with a small amount of calculation. .

また、既存の匿名化サーバにおける一般化階層の作成処理を、本実施例のステップＳ５０３に置き換えることにより、既存の匿名化サーバによるｋ−匿名化後のデータ品質を、低コストで向上させることができる。 Moreover, the data quality after k-anonymization by the existing anonymization server can be improved at low cost by replacing the process of creating the generalized hierarchy in the existing anonymization server with step S503 of the present embodiment. it can.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることも可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to an above-described Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Also, a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記録装置、または、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files for realizing each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Further, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

１００匿名化サーバ、１０１プロセッサ、１０２メモリ、１０３補助記憶装置、１１１匿名化対象データ取込部、１１２一般化階層作成部、１１３統計指標計算部、１１４ｋ−匿名化実行部、１２１匿名化対象データ、１２２一般化階層データ、１２３匿名化結果データ DESCRIPTION OF SYMBOLS 100 Anonymization server, 101 Processor, 102 Memory, 103 Auxiliary storage device, 111 Anonymization object data acquisition part, 112 Generalization hierarchy preparation part, 113 Statistical index calculation part, 114 k-anonymization execution part, 121 Anonymization object Data, 122 generalized hierarchy data, 123 anonymization result data

Claims

A numerical data anonymization server that holds numerical data including a plurality of numerical values is a method of anonymizing the numerical data,
The numerical data anonymization server includes a processor that executes a program, and a memory that the processor accesses.
The method
The processor creates a generalized hierarchy consisting of a plurality of hierarchies based on numeric values included in the numeric data,
The processor anonymizes the numerical data using the generalized hierarchy;
In the process of creating the first hierarchy of the generalized hierarchy,
Generating a combination of the first node included in the 0th hierarchy, which is one level below the first hierarchy, and each of the nodes adjacent to the first node in a numerical magnitude relationship;
For each of the generated combinations, the numerical value anonymized using the first hierarchy when a node obtained by integrating the nodes of the combination is determined as a node of the first hierarchy corresponding to each of the nodes of the combination Calculate the divergence between the statistical index of the data and the statistical index of the numerical data,
A method of determining a node obtained by integrating a combination of nodes having the smallest calculated divergence as a node of the first hierarchy corresponding to each node of the combination.

The method of claim 1, comprising:
The statistical index is an average value,
The method
The processor calculates the divergence for each of the generated combinations using the following equation:

In the above formula, A is the divergence with respect to the combination, n is the total frequency in the numerical data of the numerical values corresponding to the nodes of the combination, V is the average value of the minimum and maximum values included in the nodes of the combination, l _i is a numerical value corresponding to a leaf node that is a descendant node of each node of the combination, and R is the number of records of the numerical data.

The method of claim 1, comprising:
The processor is
In the creation process of the first hierarchy,
Calculating the divergence for a first combination of the first node and one of the nodes adjacent to the first node in a numerical magnitude relationship;
When the calculated deviation is not more than a threshold value,
Calculating the divergence for the second combination of the first node and the other of the nodes adjacent to the first node;
A method in which a node obtained by integrating the first combination of nodes is determined as a node of the first hierarchy corresponding to each of the first combination of nodes.

The method of claim 1, comprising:
The processor is
Repeat the creation process of creating the generalized hierarchy level one by one in order from the lowest level until the level indicating the root node is created,
The numerical data excluding duplication is determined as the lowest hierarchy of the generalized hierarchy,
A method in which, in the creation process of each hierarchy other than the lowest hierarchy, the hierarchy is the first hierarchy.

A numerical data anonymization server that anonymizes numerical data including a plurality of numerical values,
A processor and a storage device;
The storage device holds the numerical data,
The processor is
Create a generalized hierarchy consisting of multiple hierarchies based on the numeric values contained in the numeric data,
Using the generalized hierarchy, the numerical data is anonymized,
In creating the first hierarchy of the generalized hierarchy,
For each combination of the first node included in the 0th hierarchy, which is one hierarchy below the first hierarchy, and each of the nodes adjacent to the first node in the magnitude relation, A statistical index of the numerical data that is anonymized using the first hierarchy when a node obtained by integrating the nodes of the combination is determined as a node of the first hierarchy corresponding to each node of the combination; , The difference between
A numerical data anonymization server that determines a node obtained by integrating the first combination of nodes having the smallest calculated deviation as a node of the first hierarchy corresponding to each of the first combination of nodes.