JP6681799B2

JP6681799B2 - Generating apparatus, method and program for generalized hierarchical tree

Info

Publication number: JP6681799B2
Application number: JP2016138351A
Authority: JP
Inventors: 知明三本; 清本　晋作; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2016-07-13
Filing date: 2016-07-13
Publication date: 2020-04-15
Anticipated expiration: 2036-07-13
Also published as: JP2018010453A

Description

本発明は、データセットの匿名化手法における一般化階層木の生成装置、方法及びプログラムに関する。 The present invention relates to an apparatus, method and program for generating a generalized hierarchical tree in a data set anonymization method.

従来、個人を識別され得る属性を含んだデータセットにおいて、プライバシ保護の観点から、属性値の一部を一般化し、データを組み合わせても個人が特定されないようにする匿名化手法が提案されている。
例えば、ｋ匿名化と呼ばれる手法では、属性値を一般化するための一般化階層を構築する必要がある（例えば、非特許文献１〜３参照）。 Conventionally, in a data set including an attribute that can identify an individual, from the viewpoint of privacy protection, an anonymization method has been proposed in which a part of the attribute value is generalized and the individual is not specified even if the data is combined. .
For example, in the method called k-anonymization, it is necessary to construct a generalization hierarchy for generalizing attribute values (see, for example, Non-Patent Documents 1 to 3).

原田邦彦，佐藤嘉則， “一般化階層木の自動生成と情報エントロピーによる歪度評価を伴うｋ−匿名化手法，” 研究報告コンピュータセキュリティ（ＣＳＥＣ），２０１０−ＣＳＥＣ−５０（４７），１−７，２０１０−０６−２４Kunihiko Harada, Yoshinori Sato, “k-anonymization method with automatic generation of generalized hierarchical tree and skewness evaluation by information entropy,” Research Report Computer Security (CSEC), 2010-CSEC-50 (47), 1-7 , 2010-06-24 Ｉｗｕｃｈｕｋｗｕ，Ｔ．ａｎｄＮａｕｇｈｔｏｎ，Ｊ．Ｆ．（２００７）， “Ｋ−ＡｎｏｎｙｍｉｚａｔｉｏｎａｓＳｐａｔｉａｌＩｎｄｅｘｉｎｇ：ＴｏｗａｒｄＳｃａｒａｂｌｅａｎｄＩｎｃｒｅｍｅｎｔａｌＡｎｏｎｙｍｉｚａｔｉｏｎ，” ＩｎＰｒｏｃｅｅｄｉｎｇｏｆｔｈｅ３３ｒｄＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＶｅｒｙＬａｒｇｅＤａｔａＢａｓｅｓ，ＶＬＤＢ，ｐａｇｅｓ７４６−７５７．Iwuchukwu, T .; and Naughton, J .; F. (2007), "K-Anonymization as Spatial Indexing: Toward Scarable and Incremental Anonymization," In Proceeding of the Beer-the-Behavior-Behavior-Behavior-Beer-Venue-Learning-Venue-Legence-Venue-Duration-On-Vehicles-Beer-Vehicles-,- Ｂｙｕｎ，Ｊ． −Ｗ．，Ｋａｍｒａ，Ａ．，Ｂｅｒｔｉｎｏ，Ｅ．，ａｎｄＬｉ，Ｎ．（２００７）， “Ｅｆｆｉｃｉｅｎｔｋ−ＡｎｏｎｙｍｉｔｙＵｓｉｎｇＣｌｕｓｔｅｒｉｎｇＴｅｃｈｎｉｑｕｅ，” ＩｎＰｒｏｃ．ｏｆｔｈｅＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＤａｔａｂａｓｅＳｙｓｔｅｍｓｆｏｒＡｄｖａｎｃｅｄＡｐｐｌｉｃａｔｉｏｎｓ，ｐａｇｅｓ１８８−２００．Byun, J .; -W. , Kamra, A .; Bertino, E .; , And Li, N .; (2007), "Efficient k-Anonymity Using Clustering Technology," In Proc. of the International Conference on Database Systems Systems for Advanced Applications, pages 188-200.

しかしながら、既存の手法により構築された一般化階層を用いた匿名化では、データが必要以上に一般化され情報量の損失が大きくなる場合があった。 However, in the anonymization using the generalized hierarchy constructed by the existing method, the data may be generalized more than necessary and the loss of information amount may be large.

本発明は、情報量の損失を低減できる一般化階層木の生成装置、方法及びプログラムを提供することを目的とする。 It is an object of the present invention to provide a generalized hierarchical tree generation device, method and program capable of reducing the loss of information amount.

本発明に係る一般化階層木の生成装置は、複数の属性からなるデータセットに含まれる第１の属性に対して、最も相関の大きい第２の属性を選択する選択部と、前記第２の属性が特定の値域であるレコードのうち、前記第１の属性の値を区分けした各値域内に含まれるレコード数が所定の範囲となる最下位の階層のノードを生成する第１生成部と、下位階層のノードを統合し、当該下位階層のノードの数よりも少ない上位階層のノードを生成する第２生成部と、を備える。 A generalized hierarchical tree generation device according to the present invention includes a selection unit that selects a second attribute having the largest correlation with respect to a first attribute included in a data set including a plurality of attributes, and the second attribute. A first generation unit that generates a node in the lowest hierarchy in which the number of records included in each range obtained by dividing the value of the first attribute among records whose attributes are in a specific range is a predetermined range; A second generation unit that integrates the nodes in the lower layer and generates nodes in the upper layer that are less than the number of nodes in the lower layer.

前記第２生成部は、未選択の属性を選択し、当該選択した属性が特定の値域であるレコードのうち、前記第１の属性の値を区分けした前記下位階層のノードの数よりも少ない各値域内に含まれるレコード数が略均等となる前記上位階層のノードを生成し、前記生成装置は、前記下位階層のノード及び前記上位階層のノードにおける前記第１の属性の値域の包含関係に基づいて、前記上位階層のノードの数を調整する調整部を備えてもよい。 The second generation unit selects an unselected attribute, and in the records in which the selected attribute is in a specific range, the number is smaller than the number of nodes in the lower hierarchy that divides the value of the first attribute. The generation device generates nodes of the upper layer in which the number of records included in the range is substantially equal, and the generation device is based on an inclusion relation of the range of the first attribute in the node of the lower layer and the node of the upper layer. Then, an adjusting unit for adjusting the number of nodes in the upper layer may be provided.

前記調整部は、包含関係にある前記下位階層のノード及び前記上位階層のノードの合計数が閾値を超える場合、当該包含関係にあるノードの全値域を２つに分割して調整後のノードとし、前記合計数が前記閾値以下の場合、当該包含関係にあるノードの全値域を１つの調整後のノードとしてもよい。 When the total number of nodes in the lower layer and nodes in the upper layer that have an inclusion relationship exceeds a threshold value, the adjustment unit divides the entire range of the nodes that have the inclusion relationship into two to make an adjusted node. If the total number is less than or equal to the threshold value, the entire range of the nodes having the inclusion relation may be one adjusted node.

前記生成装置は、各階層のノードの値域に対して、選択した属性の特定の値域を、一般化の条件として出力する出力部を備えてもよい。 The generation device may include an output unit that outputs a specific value range of the selected attribute as a generalization condition with respect to the value range of the node of each hierarchy.

本発明に係る一般化階層木の生成方法は、複数の属性からなるデータセットに含まれる第１の属性に対して、最も相関の大きい第２の属性を選択する選択ステップと、前記第２の属性が特定の値域であるレコードのうち、前記第１の属性の値を区分けした各値域内に含まれるレコード数が所定の範囲となる最下位の階層のノードを生成する第１生成ステップと、下位階層のノードを統合し、当該下位階層のノードの数よりも少ない上位階層のノードを生成する第２生成ステップと、をコンピュータが実行する。 A generalized hierarchical tree generation method according to the present invention includes a selection step of selecting a second attribute having the highest correlation with respect to a first attribute included in a data set including a plurality of attributes, A first generation step of generating a node of the lowest hierarchy in which the number of records included in each range obtained by dividing the value of the first attribute among records whose attributes are in a specific range is a predetermined range; The computer executes a second generation step of integrating the nodes in the lower layer and generating nodes in the upper layer less than the number of nodes in the lower layer.

本発明に係る一般化階層木の生成プログラムは、複数の属性からなるデータセットに含まれる第１の属性に対して、最も相関の大きい第２の属性を選択する選択ステップと、前記第２の属性が特定の値域であるレコードのうち、前記第１の属性の値を区分けした各値域内に含まれるレコード数が所定の範囲となる最下位の階層のノードを生成する第１生成ステップと、下位階層のノードを統合し、当該下位階層のノードの数よりも少ない上位階層のノードを生成する第２生成ステップと、をコンピュータに実行させる。 A generalized hierarchical tree generation program according to the present invention includes a selection step of selecting a second attribute having the largest correlation with respect to a first attribute included in a data set including a plurality of attributes; A first generation step of generating a node of the lowest hierarchy in which the number of records included in each range obtained by dividing the value of the first attribute among records whose attributes are in a specific range is a predetermined range; A second generation step of integrating the nodes in the lower layer and generating nodes in the upper layer less than the number of nodes in the lower layer.

本発明によれば、データセットの匿名化手法における情報量の損失が低減される。 According to the present invention, the loss of information amount in the data set anonymization method is reduced.

実施形態に係る生成装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the production | generation apparatus which concerns on embodiment. 実施形態に係る相関及び属性値の分布を表現する方法の一例を示す図である。It is a figure which shows an example of the method of expressing the distribution of the correlation and attribute value which concerns on embodiment. 実施形態に係る上位階層の調整方法を例示する図である。It is a figure which illustrates the adjustment method of the upper hierarchy concerning an embodiment. 実施形態に係る一般化階層木の生成方法を示すフローチャートである。6 is a flowchart showing a method for generating a generalized hierarchical tree according to the embodiment.

以下、本発明の実施形態の一例について説明する。
本実施形態に係る生成装置１は、複数の属性からなるデータセットにおいて、個人を識別可能な属性の組み合わせ（準識別子）に対して、属性値を一般化することにより匿名化する場合に、一般化レベル毎の属性値を定義した一般化階層木を生成する。 Hereinafter, an example of the embodiment of the present invention will be described.
In the data set including a plurality of attributes, the generation device 1 according to the present embodiment is generally used when anonymization is performed by generalizing attribute values for a combination of attributes (quasi-identifier) that can identify an individual. Generate a generalized hierarchical tree that defines attribute values for each level.

生成装置１は、制御部（例えば、ＣＰＵ）及び記憶部（例えば、ＨＤＤ）を備えたサーバ装置又はＰＣなどの情報処理装置（コンピュータ）であり、記憶部に記憶されたソフトウェア（生成プログラム）を制御部が読み込み、実行することにより、本実施形態に係る各種機能を実現する。 The generation device 1 is an information processing device (computer) such as a server device or a PC including a control unit (for example, CPU) and a storage unit (for example, HDD), and stores software (generation program) stored in the storage unit. The various functions according to the present embodiment are realized by the reading and execution by the control unit.

図１は、本実施形態に係る生成装置１の機能構成を示すブロック図である。
生成装置１は、選択部１１と、第１生成部１２と、第２生成部１３と、調整部１４と、出力部１５とを備える。 FIG. 1 is a block diagram showing a functional configuration of a generation device 1 according to this embodiment.
The generation device 1 includes a selection unit 11, a first generation unit 12, a second generation unit 13, an adjustment unit 14, and an output unit 15.

選択部１１は、入力として、評価の対象となるデータセットＤ、データセットＤに含まれる準識別子ＡＴＴＲ、及び一般化階層木を生成する対象である属性ａｔｔｒ_ｆ∈ＡＴＴＲが与えられる。
選択部１１は、与えられた第１の属性に対して、最も相関の大きい第２の属性ａｔｔｒ_ｇを選択する。 The input of the selection unit 11 is a data set D to be evaluated, a quasi-identifier ATTR included in the data set D, and an attribute attr _f εATTR to be a target for generating a generalized hierarchical tree.
The selection unit 11 selects the second attribute attr _g having the largest correlation with the given first attribute.

具体的には、選択部１１は、データセットＤにおける全てのレコードｒ_ｐの属性ａｔｔｒ_ｆの属性値ｒ_ｐ［ａｔｔｒ_ｆ］と、∀ａｔｔｒ∈ＡＴＴＲ＼｛ａｔｔｒ_ｆ｝の属性値ｒ_ｐ［ａｔｔｒ］との相関を求め、相関が最大の属性を選択する。
属性間の相関は、レコードの散布図を生成して相関係数から、あるいは、統計情報に基づく関連性の高低から求められる。例えば、属性値ｒ_ｐ［ａｔｔｒ］の値又は値域毎のレコードについての、属性ａｔｔｒ_ｆの属性値ｒ_ｐ［ａｔｔｒ_ｆ］の平均値又は中央値が最も離れる属性が選択されてよい。 More specifically, the selection unit 11, an attribute value _r p of all the records _{r p} attributes attr _f in the data set D [attr _f], ∀attr∈ATTR\ attribute value _r p [attr of {attr _f} ], And select the attribute with the maximum correlation.
The correlation between attributes is obtained from a correlation coefficient by generating a scatter plot of records, or from the degree of association based on statistical information. For example, for the value of the attribute value r _p [attr] or the record for each range, the attribute that has the largest average value or median value of the attribute value r _p [attr _f ] of the attribute attr _f may be selected.

第１生成部１２は、第２の属性が特定の値域（単一の値又は値の範囲）であるレコードのうち、第１の属性の値を区分けした各値域内に含まれるレコード数が所定の範囲となるように、最下位の階層のノードを生成する。具体的には、詰め込み問題におけるアルゴリズムが用いられ、この結果、一般化階層木における最下位の各ノードには、ある一定数以上のレコードが含まれる。
なお、本処理における各種のパラメータは、適宜ユーザからの入力を受け付けることとしてよい。 The first generation unit 12 determines a predetermined number of records included in each range in which the value of the first attribute is divided among records in which the second attribute has a specific range (single value or range of values). The node of the lowest hierarchy is generated so that the range becomes. Specifically, the algorithm in the packing problem is used, and as a result, each lowest node in the generalized hierarchical tree contains a certain number or more of records.
It should be noted that various parameters in this processing may appropriately receive input from the user.

ここで、選択部１１及び第１生成部１２によるノードの生成手法では、ユーザからの入力に基づいて第２の属性を選択、及び一般化階層木の最下位ノードを生成するためのインタフェースを提供してもよい。
例えば、生成装置１は、相関及び属性値の分布を視覚的に表現する出力を行い、ユーザから選択入力又はパラメータ入力などを受け付ける。 Here, the node generation method by the selection unit 11 and the first generation unit 12 provides an interface for selecting the second attribute based on the input from the user and generating the lowest node of the generalized hierarchical tree. You may.
For example, the generation device 1 performs an output that visually expresses the correlation and the distribution of attribute values, and accepts selection input or parameter input from the user.

図２は、本実施形態に係る相関及び属性値の分布を表現する方法の一例を示す図である。
この例では、着目する属性ａｔｔｒ_ｆとしての身長を一方の軸、相関を調べる属性ａｔｔｒ_ｑとしての性別を他方の軸としてヒートマップを作成し表示されている。
これにより、ａｔｔｒ_ｑ（性別）の属性値毎のａｔｔｒ_ｆ（身長）の分布に重なりが少なく、中央値又は平均値が乖離していること、すなわち相関が大きいことを、ユーザは視覚的に判断でき、さらに属性値の分布と境界線とを視認しつつ、ノードの境界の調整を適切に行うことができる。 FIG. 2 is a diagram showing an example of a method for expressing the correlation and the distribution of attribute values according to this embodiment.
In this example, the heat map is created and displayed with the height as the attribute attr _f of interest as one axis and the sex as the attribute attr _q for checking the correlation as the other axis.
As a result, the user visually determines that there is little overlap in the distribution of attr _f (height) for each attribute value of attr _q (sex), and that the median or the average value deviates, that is, the correlation is large. In addition, it is possible to appropriately adjust the boundaries of the nodes while visually recognizing the distribution of the attribute values and the boundaries.

第２生成部１３は、下位階層のノードを統合し、この下位階層のノードの数よりも少ない上位階層のノードを生成する。
例えば、第２生成部１３は、下位階層のノード２つを上位階層のノード１つに統合するなどの規則に従って、一般化階層木を構成してもよいが、本実施形態では、一例として以下の手順を採用する。 The second generation unit 13 integrates the nodes in the lower layer and generates nodes in the upper layer that are smaller than the number of nodes in the lower layer.
For example, the second generation unit 13 may configure the generalized hierarchical tree according to a rule such as integrating two nodes in the lower layer into one node in the upper layer, but in the present embodiment, as an example, Adopt the procedure.

第２生成部１３は、未選択の属性を順に選択し、選択した属性が特定の値域（単一の値又は値の範囲）であるレコードのうち、第１の属性の値を区分けした各値域内に含まれるレコード数が略均等となるようにノードを生成し、調整部１４により上位階層のノードとして調整する。
ここで、第２生成部１３により生成されるノードの数は、生成済みの下位階層のノードの数よりも少ない。 The second generation unit 13 sequentially selects the unselected attributes, and selects the values of the first attribute among the records in which the selected attribute is in the specific range (single value or range of values). The nodes are generated so that the number of records included in the area is approximately equal, and the adjustment unit 14 adjusts the nodes as the upper layer nodes.
Here, the number of nodes generated by the second generation unit 13 is smaller than the number of generated lower-layer nodes.

調整部１４は、生成済みの下位階層のノード及び第２生成部１３により生成されたノードにおける第１の属性の値域の包含関係に基づいて、上位階層のノードの数を調整する。
例えば、調整部１４は、包含関係にある下位階層のノード及び上位階層のノードの合計数が閾値（例えば、４個）を超える場合、この包含関係にあるノードの全値域を２つに分割して調整後の上位階層のノードとする。一方、合計数が閾値以下の場合、包含関係にあるノードの全値域を１つの調整後のノードとする。
なお、本処理における各種のパラメータは、適宜ユーザからの入力を受け付けることとしてよい。 The adjustment unit 14 adjusts the number of nodes in the upper layer based on the inclusion relation of the range of the first attribute in the generated lower layer node and the node generated by the second generation unit 13.
For example, when the total number of lower-layer nodes and upper-layer nodes that have an inclusive relationship exceeds a threshold value (for example, 4), the adjusting unit 14 divides the entire range of the inclusive nodes into two. And adjust it to be the node of the upper layer. On the other hand, when the total number is less than or equal to the threshold value, the entire range of the nodes having the inclusion relation is set as one adjusted node.
It should be noted that various parameters in this processing may appropriately receive input from the user.

図３は、本実施形態に係る第２生成部１３及び調整部１４による上位階層の調整方法を例示する図である。
この調整方法では、第１の属性（ａｔｔｒ_ｆ）と他の属性（ａｔｔｒ_ｑ∈ＡＴＴＲ＼｛ａｔｔｒ_ｆ，ａｔｔｒ_ｇ｝）を元に、上位階層のノードを調整する。 FIG. 3 is a diagram illustrating an upper layer adjustment method by the second generation unit 13 and the adjustment unit 14 according to the present embodiment.
In this adjusting method, the node of the upper layer is adjusted based on the first attribute (attr _f ) and another attribute (attr _q εATTR \ {attr _f , attr _g }).

例えばａｔｔｒ_ｆ：身長、ａｔｔｒ_ｇ：性別、ａｔｔｒ_ｑ：既往歴とし、第１生成部１２により、「ａｔｔｒ_ｇ（性別）＝男性」の場合に「−１６２（ｃｍ）」，「１６３−１６４」，「１６５−１６６」，「１６７」，「１６８」，…，「１７５」，「１７６−１７８」，「１７９−」というノードの区分け（Ａ）ができている。
また、第２生成部１３により、ａｔｔｒ_ｆ及びａｔｔｒ_ｑから、「ａｔｔｒ_ｑ（既往歴）＝有」の場合に「−１５７」，「１５８−１６２」，「１６３−１６５」，「１６６−１７０」，「１７１−１７４」，「１７５−」というノードの区分け（Ｂ）ができている。 For example, attr _f : height, attr _g : gender, attr _q : history, and the first generation unit 12 sets “−162 (cm)” and “163-164” when “attr _g (sex) = male”. , "165-166", "167", "168", ..., "175", "176-178", "179-" are divided into nodes (A).
Further, the second generation unit 13 determines from “attr _f and attr _q that“ -157 ”,“ 158-162 ”,“ 163-165 ”, and“ 166-170 ”when“ attr _q (history) = present ”. , "171-174", and "175-" are divided into nodes (B).

ここで、（Ａ）の区分けと（Ｂ）の区分けとで範囲の重複が合計５区分以上ある場合は、区分を２つに分割し（Ｘ）、５区分未満の場合は区分を大きい方の１つにまとめ（Ｙ）、この結果、上位階層のノードの区分け（Ｃ）が生成される。
これにより、例えば、「ａｔｔｒ_ｇ（性別）＝男性」かつ「ａｔｔｒ_ｑ（既往歴）＝有」のレコードの属性値ｒ_ｐ［ａｔｔｒ_ｆ］＝１７２は、「１７１−１７２」に一般化される。 Here, if there is a total of 5 or more overlapping ranges in the (A) division and the (B) division, the division is divided into two (X), and in the case of less than 5 divisions, the larger division Combined into one (Y), as a result, the division (C) of the nodes in the upper hierarchy is generated.
Thereby, for example, the attribute value r _p [attr _f ] = 172 of the record of “attr _g (sex) = male” and “attr _q (history) = present” is generalized to “171-172”. .

出力部１５は、全ての属性を順に選択し、調整を行った後の一般化階層木を出力する。
このとき、各階層のノードの値域（一般化した属性値）に対して、選択した属性の特定の値域を、一般化の条件として合わせて出力する。
例えば、「ａｔｔｒ_ｇ（性別）＝男性」かつ「ａｔｔｒ_ｑ（既往歴）＝有」のレコードの属性値ｒ_ｐ［ａｔｔｒ_ｆ］＝１７２は、「１７１−１７２」に一般化するという情報が階層木に基づく一般化の条件として出力される。 The output unit 15 sequentially selects all the attributes and outputs the generalized hierarchical tree after adjustment.
At this time, a specific value range of the selected attribute is output as a generalization condition for the value range (generalized attribute value) of the node of each hierarchy.
For example, the attribute value r _p [attr _f ] = 172 of the record of “attr _g (sex) = male” and “attr _q (history) = present” is classified into “171-172” as a hierarchy. It is output as a condition for tree-based generalization.

図４は、本実施形態に係る生成装置１による一般化階層木の生成方法を示すフローチャートである。 FIG. 4 is a flowchart showing a method for generating a generalized hierarchical tree by the generating device 1 according to this embodiment.

ステップＳ１において、選択部１１は、一般化階層木を生成する対象である属性ａｔｔｒ_ｆと、他の属性それぞれとの相関を算出する。
ステップＳ２において、選択部１１は、ステップＳ１で算出された相関が最大の属性ａｔｔｒ_ｇを選択する。 In step S1, the selection unit 11 calculates the correlation between the attribute attr _f that is the target for generating the generalized hierarchical tree and each of the other attributes.
In step S2, the selection unit 11 selects the attribute attr _g having the maximum correlation calculated in step S1.

ステップＳ３において、第１生成部１２は、属性ａｔｔｒ_ｇの値域（例えば、性別の男性又は女性）毎に、一般化階層木の最下層のノードを生成する。 In step S3, the first generator 12, the range of the attribute attr _g (e.g., male or female sex) for each, to produce a lowermost node of the generalized hierarchical tree.

ステップＳ４において、第２生成部１３は、属性ａｔｔｒ_ｆ，ａｔｔｒ_ｇ以外の属性ａｔｔｒ_ｑを選択する。
ステップＳ５において、第２生成部１３は、属性ａｔｔｒ_ｑの値域（例えば、既往歴の有無）毎に、一般化階層木における生成済みのノードの上位階層のノードを生成する。 In step S4, the second generation unit 13 selects an attribute attr _q other than the attributes attr _f and attr _g .
In step S5, the second generation unit 13 generates a node in an upper layer of the generated node in the generalized hierarchical tree for each value range of the attribute attr _q (for example, presence or absence of a history).

ステップＳ６において、調整部１４は、生成済みのノードと、ステップＳ５で生成されたノードとの階層間における包含関係に基づいて、上位階層のノードの数及び値域を調整する。 In step S6, the adjusting unit 14 adjusts the number of nodes in the upper layer and the range based on the inclusion relationship between the layers of the generated node and the node generated in step S5.

ステップＳ７において、第２生成部１３は、ステップＳ４で全ての属性を選択したか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ８に移り、判定がＮＯの場合、処理はステップＳ４に戻る。 In step S7, the second generation unit 13 determines whether or not all attributes have been selected in step S4. If this determination is YES, the process proceeds to step S8, and if the determination is NO, the process returns to step S4.

ステップＳ８において、出力部１５は、属性ａｔｔｒ_ｆの一般化階層木、及びこの階層木を用いる際の一般化条件を出力する。 In step S8, the output unit 15 outputs the generalized hierarchical tree of the attribute attr _{f and} the generalized condition when using this hierarchical tree.

本実施形態によれば、生成装置１は、最も相関の大きい属性を選択することで、最下位の階層のノード数を最大にできるので、データセットの匿名化手法における一般化に伴う情報量の損失が低減される。 According to the present embodiment, the generation device 1 can maximize the number of nodes in the lowest hierarchy by selecting the attribute with the highest correlation, so that the amount of information that accompanies generalization in the anonymization method of the data set Loss is reduced.

また、生成装置１は、一般化階層木の上位階層を生成する過程において、各属性の値に応じたノードの区分けを行った後、生成済みのノードとの包含関係に基づいてノード数を調整する。したがって、生成装置１は、各属性との関連性に基づいて、適切に一般化階層木の上位階層を生成でき、匿名化手法における一般化レベルを上げた場合の情報量の損失を低減できる。 In addition, in the process of generating the upper hierarchy of the generalized hierarchy tree, the generation device 1 performs node division according to the value of each attribute, and then adjusts the number of nodes based on the inclusion relationship with the generated nodes. To do. Therefore, the generation device 1 can appropriately generate the upper hierarchy of the generalized hierarchy tree based on the association with each attribute, and can reduce the loss of information amount when the generalization level in the anonymization method is increased.

さらに、生成装置１は、ノード数を調整する過程において、包含関係にあるノード数の合計を閾値と比較することにより、調整後のノード数を決定する。したがって、生成装置１は、一般化によってノード数が減少し過ぎることによる情報量の損失を抑制できる。 Further, in the process of adjusting the number of nodes, the generation device 1 determines the adjusted number of nodes by comparing the total number of nodes having an inclusive relationship with a threshold value. Therefore, the generation device 1 can suppress the loss of the amount of information due to the excessive decrease in the number of nodes due to generalization.

また、生成装置１は、一般化階層木と共に、この階層木を用いる際の一般化条件を合わせて出力することにより、匿名化処理を効率化できる。 Moreover, the generation device 1 can improve the efficiency of the anonymization process by outputting together with the generalized hierarchical tree the generalized conditions when using this hierarchical tree.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. In addition, the effects described in the present embodiment are merely enumeration of the most suitable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the present embodiment.

本実施形態の生成装置１が備える各機能部は、複数の情報処理装置（コンピュータ）に分散されてもよい。また、本実施形態の機能は、複数のサーバにより負荷分散させたクラウドシステムにより提供されてもよい。 Each functional unit included in the generation device 1 of the present embodiment may be distributed to a plurality of information processing devices (computers). Further, the functions of this embodiment may be provided by a cloud system in which loads are distributed by a plurality of servers.

生成装置１による生成方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The generation method by the generation device 1 is realized by software. When implemented by software, a program forming the software is installed in an information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to users, or may be distributed by being downloaded to a user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via the network without being downloaded.

１生成装置
１１選択部
１２第１生成部
１３第２生成部
１４調整部
１５出力部 DESCRIPTION OF SYMBOLS 1 generation device 11 selection unit 12 first generation unit 13 second generation unit 14 adjustment unit 15 output unit

Claims

A selection unit for selecting the second attribute having the highest correlation with respect to the first attribute included in the data set including a plurality of attributes;
Wherein the second attribute of the record of the data set is a specific range, so the number of records contained in the first respective range within that partition the value of the attribute is a predetermined range, each was the divided A first generation unit that generates a node of the lowest hierarchy indicating a range ,
A second generation unit that integrates a plurality of nodes in the generated lower layer in order from the lowest layer to the upper layer and generates a node in the upper layer that is smaller than the number of nodes in the lower layer. Generalized hierarchical tree generator.

The second generation unit selects an unselected attribute, and the number of nodes in the lower layer that divides the value of the first attribute among the records of the data set in which the selected attribute is in a specific range. Generate a node in the upper hierarchy in which the number of records contained in each range less than is approximately equal,
The generalized hierarchical tree according to claim 1, further comprising an adjusting unit that adjusts the number of nodes of the upper layer based on an inclusion relation of the range of the first attribute in the node of the lower layer and the node of the upper layer. Generator.

When the total number of nodes in the lower layer and nodes in the upper layer that have an inclusion relationship exceeds a threshold value, the adjustment unit divides the entire range of the nodes that have the inclusion relationship into two to make an adjusted node. The generalized hierarchical tree generation device according to claim 2, wherein when the total number is less than or equal to the threshold value, the entire range of the nodes having the inclusion relation is one adjusted node.

The generalized hierarchical tree generation device according to claim 1, further comprising: an output unit that outputs a specific range of the selected attribute as a generalization condition with respect to a range of nodes of each hierarchy. .

A selection step of selecting part for the first attribute in the dataset including a plurality of attributes, selects the larger second attribute of highest correlation,
Of the first generating unit and the second attribute of the data set is a specific range records, as the number of records included in the first in each value range is divided the value of the attribute is a predetermined range , A first generation step of generating a node of the lowest hierarchy indicating each of the divided bins ,
A second generation in which the second generation unit integrates a plurality of nodes in the generated lower hierarchy in order from the lowest hierarchy to an upper level, and generates an upper-layer node smaller in number than the number of the lower-layer nodes. A method for generating a generalized hierarchical tree by a computer, the method including :

A generalized hierarchical tree generation program for causing a computer to function as the generalized hierarchical tree generation device according to any one of claims 1 to 4 .