JP2012022315A

JP2012022315A - Method and device for anonymizing data

Info

Publication number: JP2012022315A
Application number: JP2011137185A
Authority: JP
Inventors: Jianqiang Li; ジェンチャンリイ; Yu Zhao; ユウジャオ; Bo Liu; ボリウ
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2010-07-02
Filing date: 2011-06-21
Publication date: 2012-02-02
Anticipated expiration: 2031-06-21
Also published as: CN102314565A; JP5282121B2; CN102314565B

Abstract

PROBLEM TO BE SOLVED: To provide a method and device for anonymizing data.SOLUTION: A data anonymizing device 1 includes: a distance calculation unit 10 which calculates a distance between each pair of records of a plurality of data records; a complete graph construction unit 12 which uses each record as an apex, connects all pairs of apexes by sides, and weights the sides by distances between corresponding pairs of records to construct a complete graph including all records; a side cut unit 14 which cuts the sides in the order of side weight in order to divide the complete graph into a plurality of components each of which includes at least k apexes; a large component decomposition unit 16 which decomposes a component in which the number of apexes included exceeds 2k-1, into a plurality of clusters so that the number of apexes included in each cluster is between k and 2k-1; and a generalizing unit 18 which generalizes records corresponding to apexes of each cluster so that records in the cluster cannot be distinguished from one another.

Description

本発明は、データ保護の技術分野に関し、特に、データ匿名化のための方法と装置に関する。 The present invention relates to the technical field of data protection, and more particularly to a method and apparatus for data anonymization.

社会の情報化が発展するにつれて、データ共有はますます広く普及してきている。しかしながら、攻撃者が、共有されるデータ・レコードのエントリから個人あるいは組織の秘密情報を取得し或いは推測する可能性があり、それは個人のプライバシーと組織の機密などの保護すべきデータに対するセキュリティ脅威をもたらす。 As social informatization develops, data sharing is becoming increasingly widespread. However, attackers can obtain or infer personal or organizational confidential information from shared data record entries, which poses security threats to protected data such as personal privacy and organizational confidentiality. Bring.

一般に、各種のデータレコード（テーブル形式のデータレコード等）の属性は、４つのカテゴリに分類することができる。
個人の姓名やＩＤや会社の登記名称などの、対象を直接識別することができる明示的識別情報。
個人の年齢、性別、学歴、出生地や会社のカテゴリおよび所在地などの、関連する外部情報と組み合わせて対応する対象を推測するのに用いることができる準識別情報（quasi-identifiers）。
収入と病歴等の、一般に秘密保持を望む機密情報（sensitive attributes）。
情報開示が一般的に対象にほとんど影響を及ぼさない、非機密情報（non-sensitive attributes）。 In general, attributes of various data records (table format data records, etc.) can be classified into four categories.
Explicit identification information that can directly identify an object, such as an individual's first and last name, ID, or company registered name.
Quasi-identifiers (quasi-identifiers) that can be used in conjunction with relevant external information, such as an individual's age, gender, educational background, birthplace or company category and location, to infer the corresponding subject.
Sensitive attributes that generally want confidentiality, such as income and medical history.
Non-sensitive attributes where disclosure generally has little effect on the subject.

機密情報の値がデータ分析のために保存されるべきであると仮定すると、データ匿名化は、機密データを所有する個人の身元を隠すための操作を意味する。個人または組織のプライバシーを保護するために、明示的な識別情報は、例えば、”＊”と置き換えられて、一般に完全に隠されるか削除される。非機密情報は、完全に公開することができる。準識別情報は、関連する外部情報と組み合わせて対応する対象を推測するのに必要なデータレコードの最小の集合と見なすことができ、そのため、保護することが必要である。しかしながら、準識別情報が明示的な識別情報として完全に隠され或いは削除されれば、データレコードが対象に関する有益な情報を提供することができないので、最終的に取得されるデータレコード内に含まれる情報は、ほとんど完全に失われることになる。この場合、そのようなデータレコードはもはや使用価値を有しない。 Assuming that the value of sensitive information should be saved for data analysis, data anonymization means an operation to hide the identity of the individual who owns the sensitive data. To protect the privacy of an individual or organization, explicit identification information is generally completely hidden or deleted, for example, replaced by “*”. Non-confidential information can be fully disclosed. Quasi-identification information can be considered as the smallest set of data records necessary to infer the corresponding object in combination with relevant external information and therefore needs to be protected. However, if the quasi-identification information is completely hidden or deleted as explicit identification information, the data record cannot be provided with useful information about the object, so it is included in the finally obtained data record Information will be almost completely lost. In this case, such data records are no longer worth using.

従って、データレコード保護は、主に、データレコードの可用性を下げることなく、如何に情報の損失をできるだけ少なくし、如何に潜在的な攻撃脅威からデータレコード内の準識別情報を保護するかに焦点を置く。この点から、データ匿名化(Anonymization)技術が提案されている。２つの基本的なデータ匿名化技術が存在する。
１）一般化（Generalization）；多くの準識別情報、属性あるいは属性値をそれらの一般化されたバージョンと置き換える。例えば、都市名「北京」と「上海」を国名「中国」に一般化する。
２）抑制（Suppression）；多くの準識別情報、属性あるいは属性値を”＊”などのような文字と符号に置き換える。抑制は、一般化の特殊な例と見なすことができる。 Thus, data record protection is primarily focused on how to minimize information loss and reduce quasi-identification information in data records from potential attack threats without reducing the availability of data records. Put. From this point, a data anonymization technique has been proposed. There are two basic data anonymization techniques.
1) Generalization; replace many quasi-identification information, attributes or attribute values with their generalized versions. For example, the city names “Beijing” and “Shanghai” are generalized to the country name “China”.
2) Suppression: Replaces a lot of semi-identification information, attributes or attribute values with characters such as “*” and a code. Suppression can be viewed as a special case of generalization.

一般化の処理において、情報の損失は避けられない。抑制は情報の完全消失を引き起こすだろう。情報の損失を減らすために、様々な匿名化方法が提案されており、その中で広く使用されているのは、ｋ−匿名化方法と呼ばれている。テーブルＡＴ内の各レコードについて、所定の準識別情報に関して少なくともｋ−１個のそれと同一の他のレコードが存在すれば、テーブルＡＴはｋ−匿名性である。最適化されたｋ−匿名化方法は、テーブルＴなどの所定のデータレコードについて、最小の情報の損失のために、準識別情報Ｑを考慮に入れてｋ−匿名性テーブルＡＴを計算する。 In the generalization process, information loss is inevitable. Suppression will cause complete loss of information. In order to reduce the loss of information, various anonymization methods have been proposed, and the widely used method is called the k-anonymization method. For each record in the table AT, the table AT is k-anonymity if there are at least k-1 other records identical to that for the given semi-identification information. The optimized k-anonymization method calculates the k-anonymity table AT for a given data record, such as the table T, taking into account the semi-identification information Q for the minimum loss of information.

ｋ−匿名化方法の重要な１つは、クラスタリングに基づいたｋ−匿名化方法である。それは２つの基本手順を含んでいる。まず、データレコードは、クラスタリングによって、それぞれ少なくともｋ個のレコードを有する複数のクラスタに分割される。その後、各クラスタは、クラスタ内の全てのレコードの準識別情報が同じ値を有するように一般化される。この方法によれば、互いに関連したレコードは、単一のクラスタへ分割できる。また、得られるクラスタを別々に一般化できる。クラスタリングなしのグローバルな一般化と比較して、このクラスタリングに基づいた局所的な一般化はより多くの情報を保持し、情報の損失を縮小する。最適化されたクラスタリング処理は、さらに情報の損失を縮小できる適切な方法でデータレコードを適切にクラスタに分割できる。
したがって、クラスタリングに基づいたｋ−匿名化方法は、情報の損失を最小限にしながら最適の方法でどのようにレコードを分割するかという、クラスタリングの問題を有している。 One important k-anonymization method is a k-anonymization method based on clustering. It contains two basic procedures. First, a data record is divided into a plurality of clusters each having at least k records by clustering. Each cluster is then generalized so that the quasi-identification information of all records in the cluster has the same value. According to this method, records related to each other can be divided into a single cluster. Also, the resulting clusters can be generalized separately. Compared to global generalization without clustering, this generalization based on clustering retains more information and reduces the loss of information. The optimized clustering process can properly divide the data records into clusters in an appropriate manner that can reduce the loss of information.
Therefore, the k-anonymization method based on clustering has a clustering problem of how to divide records by an optimal method while minimizing information loss.

上記の問題に対して、従来のクラスタリングに基づいたｋ−匿名化方法は、一般に局所的な最適化アプローチを採用する。非特許文献１は、レコード分割について局所的な最適化方法を使用する、ｋ−匿名のための多項式時間近似アルゴリズムを提供する。特許文献１は、全ての一般化バージョンを考慮して任意の度量について最適解を見出すｋ−匿名のための動的プログラミング方法を提供する。 For the above problem, the k-anonymization method based on the conventional clustering generally adopts a local optimization approach. Non-Patent Document 1 provides a polynomial time approximation algorithm for k-anonymity that uses a local optimization method for record partitioning. Patent Document 1 provides a dynamic programming method for k-anonymity that finds an optimal solution for an arbitrary measure in consideration of all generalized versions.

これまでの方法においては、レコードのクラスタリングと一般化は、情報の損失の局所的最適化を考慮して実行される。非特許文献１に開示される方法は、頂点である各レコードについて、ボトムアップクラスタリングを実行する。具体的には、そのようなボトムアップ方法においては、まず、任意の頂点をサブグラフと見なす。
ｋ個未満の頂点を含む任意のサブグラフについて、頂点（ｕ）が、他の頂点に向かう何れの有向辺とも接続していなければ、指向性の辺（ｕ、ｖ）が生成される。ここで、ｖは頂点ｕに最も近いｋ−１個の近接する頂点の１つである（例えば、属性または属性値から計算された距離が最も近い）。
この処理において、ループフリー状態を満足し、かつどんな頂点も他の頂点に向かうただ１つの有向辺を有する（しかし、その頂点に向かう１つ以上の他の頂点が存在する）ことを保証することが必要である。任意の有向グラフに含まれる頂点が少なくともｋ個になるまで、上記の処理が繰り返される。
その後、辺の方向が削除され、有向グラフが無向グラフに変換される。max（２ｋ−１、３ｋ−５）以上の頂点を有するグラフについて（上記方法によって取得される何れのグラフも木と見なすことができる）、頂点（ｘ）が、グラフからランダムに選択され、サブツリーと頂点ｘを合併するためのルートノードと見なされる。このようにして、グラフは、それぞれｋより大きいサイズを有する２つのサブグラフに分解される。そのような分解ができない場合、頂点の数の条件をそれぞれ満足する２つの部分にグラフを分解することができるまで、同様の処理を行うことにより他の頂点（ｙ）を選択する。最終的に取得される各サブグラフに含まれる頂点がmax（２ｋ−１、３ｋ−５）未満となるまで、上記処理が繰り返される。 In previous methods, record clustering and generalization is performed in consideration of local optimization of information loss. The method disclosed in Non-Patent Document 1 performs bottom-up clustering for each record that is a vertex. Specifically, in such a bottom-up method, first, an arbitrary vertex is regarded as a subgraph.
For any subgraph that includes less than k vertices, if the vertex (u) is not connected to any directed edge towards another vertex, a directional edge (u, v) is generated. Here, v is one of the k−1 adjacent vertices closest to the vertex u (for example, the distance calculated from the attribute or attribute value is the closest).
This process guarantees that the loop-free condition is satisfied and that any vertex has only one directed edge towards the other vertex (but there are one or more other vertices towards that vertex) It is necessary. The above process is repeated until there are at least k vertices included in any directed graph.
Thereafter, the direction of the edge is deleted, and the directed graph is converted into an undirected graph. For graphs with vertices greater than or equal to max (2k-1, 3k-5) (any graph obtained by the above method can be considered as a tree), vertex (x) is randomly selected from the graph and the subtree And vertex x are considered as root nodes. In this way, the graph is decomposed into two subgraphs, each having a size greater than k. If such decomposition is not possible, another vertex (y) is selected by performing the same processing until the graph can be decomposed into two parts that respectively satisfy the condition of the number of vertices. The above process is repeated until the number of vertices included in each finally acquired subgraph is less than max (2k-1, 3k-5).

US20100027780 A1, “Systems andmethods for anonymizing personally identifiable information associated withepigenetic information”US20100027780 A1, “Systems andmethods for anonymizing personally identifiable information associated withepigenetic information”

G. Aggarwal, A. Feder, K.Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu, ApproximationAlgorithms for k-Anonymity, Journal of Privacy Technology, 2005.G. Aggarwal, A. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu, ApproximationAlgorithms for k-Anonymity, Journal of Privacy Technology, 2005.

上記ボトムアップ方法によれば、ツリーの構築中に、頂点とそれらの隣接が、順番あるいはシーケンス制御メカニズムなしでランダムに選択される。max（２ｋ−１、３ｋ−５）以上の頂点（ラージコンポーネント（ｌａｒｇｅｃｏｍｐｏｎｅｎｔ）と称する）を有するグラフの分解において、情報の損失の最適化は考慮されない。さらに、これらの方法は、主に、全てのレコード或いは頂点を含む大局的な最適化を考慮せず、情報の損失の局所的最適化に焦点を置く。そのような局所的最適化はある程度まで情報の損失を減らすことができるけれども、大局的な状況が考慮されないので、大局的最適化を実現することができない。引き起こされる情報の損失は、厳しい要求を有する後続のデータ分析にとってなお受け入れがたい。 According to the bottom-up method, vertices and their neighbors are randomly selected without order or sequence control mechanism during tree construction. In the decomposition of a graph having vertices greater than max (2k-1, 3k-5) (referred to as large components), optimization of information loss is not considered. Furthermore, these methods mainly focus on local optimization of information loss without considering global optimization involving all records or vertices. Although such local optimization can reduce information loss to some extent, global optimization cannot be realized because the global situation is not considered. The loss of information caused is still unacceptable for subsequent data analysis with strict requirements.

大局的最適化と情報損失のさらなる低減を実現することができるデータ匿名化方法が必要となっている。 There is a need for a data anonymization method that can achieve global optimization and further reduction of information loss.

本発明によるデータ匿名化装置は、複数のデータレコード中の２つのレコード毎の間の距離を計算する距離計算ユニットと、各レコードを頂点として用い、全ての２つの頂点を辺で接続し、２つの対応するレコードの間の距離で辺に重みを加えることにより、全てのレコードを含む完全グラフを構築する完全グラフ構築ユニットと、完全グラフを少なくともｋ（所定の自然数）個の頂点を含む複数のコンポーネントに分割するために、辺の重み順に辺を順番にカットする辺カットユニットと、各クラスタに含まれる頂点の数がｋ個と２ｋ−１個の間となるように、２ｋ−１個を超える頂点を含むコンポーネントを複数のクラスタに分解するラージコンポーネント分解ユニットと、各クラスタ内のレコードを互いに区別することができないように、各クラスタの頂点に対応するレコードを一般化する一般化ユニットとを備え、２ｋ−１個を超える頂点を含むコンポーネントをラージコンポーネントとし、ｋ個以上で２ｋ−１以下の頂点を含むコンポーネントがクラスタとする。 A data anonymization device according to the present invention uses a distance calculation unit that calculates a distance between two records in a plurality of data records, and uses each record as a vertex, and connects all two vertices with edges. A complete graph construction unit that constructs a complete graph containing all records by weighting edges with the distance between two corresponding records, and a plurality of complete graphs comprising at least k (predetermined natural number) vertices In order to divide into components, an edge cut unit that sequentially cuts edges in the order of edge weights, and 2k-1 pieces so that the number of vertices included in each cluster is between k and 2k-1 pieces. Large component decomposition unit that decomposes components that contain more vertices into multiple clusters and records in each cluster cannot be distinguished from each other A generalization unit that generalizes records corresponding to the vertices of each cluster, and a component including more than 2k-1 vertices is a large component, and a component including k or more and 2k-1 or less vertices is a cluster And

本発明によるデータ匿名化方法は、複数のデータレコード中の２つのレコード毎の間の距離を計算する距離計算ステップと、各レコードを頂点として用い、全ての２つの頂点を辺で接続し、２つの対応するレコードの間の距離で辺に重みを加えることにより、全てのレコードを含む完全グラフを構築する完全グラフ構築ステップと、完全グラフを少なくともｋ（ｋは所定の自然数）個の頂点を含む複数のコンポーネントに分割するために、辺の重み順に辺を順番にカットする辺カットステップと、各クラスタに含まれる頂点の数がｋ個と２ｋ−１個の間となるように、２ｋ−１個を超える頂点を含むコンポーネントを複数のクラスタに分解するラージコンポーネント分解ステップと、各クラスタ内のレコードを互いに区別することができないように、各クラスタの頂点に対応するレコードを一般化する一般化ステップとを備え、２ｋ−１個を超える頂点を含むコンポーネントをラージコンポーネントとし、ｋ個以上で２ｋ−１以下の頂点を含むコンポーネントがクラスタとする。 In the data anonymization method according to the present invention, a distance calculating step for calculating a distance between two records in a plurality of data records, and using each record as a vertex, connecting all two vertices with edges, A complete graph construction step of constructing a complete graph including all records by weighting edges with the distance between two corresponding records, and the complete graph includes at least k (k is a predetermined natural number) vertices In order to divide into a plurality of components, an edge cutting step for sequentially cutting edges in the order of edge weights, and 2k−1 so that the number of vertices included in each cluster is between k and 2k−1. Large component decomposition step that decomposes components with more than vertices into multiple clusters and records in each cluster cannot be distinguished from each other A generalization step for generalizing records corresponding to the vertices of each cluster, and a component including more than 2k-1 vertices as a large component, and a component including k or more and 2k-1 or less vertices. A cluster.

本発明によるデータ匿名化装置と方法によれば、レコード分割／クラスタリング処理はトップダウン方法で実行される。そして、念入りに定められたシーケンス制御メカニズムを用いることによって、グラフ内の辺は決められた順番にカットされる。ラージコンポーネントを分解する際にも、辺は決められた順番にカットされる。このようにして、局所的最適化だけでなく大局的な最適化も実現することができ、情報の損失をさらに低減することが可能となる。 According to the data anonymization apparatus and method of the present invention, the record division / clustering process is executed in a top-down manner. Then, by using a carefully determined sequence control mechanism, edges in the graph are cut in a predetermined order. When disassembling the large component, the edges are cut in a predetermined order. In this way, not only local optimization but also global optimization can be realized, and information loss can be further reduced.

本発明の上記および他の特徴、並びに効果は、図面を参照して説明された下記の好適な実施例からさらに明らかになるであろう。
本発明の好ましい実施の形態によるデータ匿名化装置の概略構成を示すブロック図である。図１のデータ匿名化装置におけるラージコンポーネント分解ユニットの概略構成を示すブロック図である。本発明の好ましい実施の形態によるデータ匿名化方法を説明するフローチャートである。図３のデータ匿名化方法におけるラージコンポーネント分解手順を説明するフローチャートである。本発明の好ましい実施の形態によるレコードに対する完全グラフ構築処理を示す概略図である。本発明の好ましい実施の形態によるシーケンス制御メカニズムが導入される辺カット処理を示す概略図である。本発明の好ましい実施の形態によるラージコンポーネント分解処理を示す概略図である。本発明の好ましい実施の形態によるレコード分割の最終結果を概略的に示す図である。 The above and other features and advantages of the present invention will become more apparent from the following preferred embodiments described with reference to the drawings.
It is a block diagram which shows schematic structure of the data anonymization apparatus by preferable embodiment of this invention. It is a block diagram which shows schematic structure of the large component decomposition | disassembly unit in the data anonymization apparatus of FIG. It is a flowchart explaining the data anonymization method by preferable embodiment of this invention. It is a flowchart explaining the large component decomposition | disassembly procedure in the data anonymization method of FIG. FIG. 6 is a schematic diagram illustrating a complete graph construction process for a record according to a preferred embodiment of the present invention. It is the schematic which shows the edge cut process in which the sequence control mechanism by preferable embodiment of this invention is introduced. FIG. 3 is a schematic diagram illustrating a large component disassembly process according to a preferred embodiment of the present invention. It is a figure which shows roughly the final result of the record division | segmentation by preferable embodiment of this invention.

以下、本発明の実施の形態について、図面を参照して詳細に説明する。しかしながら、本発明は以下の実施の形態に限定されるものではない。本発明の基本概念の説明を明確にするため、本発明の解決方法に関連する構成要素、機能あるいはステップだけを図示する。既存の技術、機能、構成要素あるいはステップの詳細な記述は、以下の説明では省略している。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. However, the present invention is not limited to the following embodiment. In order to clarify the explanation of the basic concept of the present invention, only components, functions or steps related to the solution of the present invention are shown. Detailed descriptions of existing technologies, functions, components or steps are omitted in the following description.

図１は、本発明の好ましい実施の形態によるデータ匿名化装置１のブロック図を示している。データ匿名化装置１は、データレコードの匿名バージョンを取得するために、複数のデータレコードの匿名化を行うよう構成されている。ここで、「データレコード」、「レコード」また「レコード項目」等の用語は、同じ意味を有し、互いに交換して使用可能である。 FIG. 1 shows a block diagram of a data anonymization device 1 according to a preferred embodiment of the present invention. The data anonymization device 1 is configured to anonymize a plurality of data records in order to obtain an anonymous version of the data records. Here, terms such as “data record”, “record”, and “record item” have the same meaning and can be used interchangeably.

本発明の好ましい実施の形態によれば、データ匿名化装置１は、例えば、準識別情報を含んでいるデータレコードを主に考慮する。
ｋ−匿名化方法（ここで、ｋは所定の自然数）を取り入れ、データ匿名化装置１は、準識別情報に起因するプライバシー漏洩を回避し、同時に情報の損失を最小限にするために、例えば、複数データレコードを含むテーブルＴから、一般化によってｋ−匿名性テーブルＡＴを生成する。
背景技術において記述したように、明示的識別情報は直接隠すか削除することができ、非機密情報は保護を必要としない。準識別情報は、関連する外部情報と組み合わせて対応する対象を推測するのに必要なデータレコードの最小の集合と見なすことができる。したがって、準識別情報については保護する必要がある。
準識別情報は、多次元的であり、多数の属性を含んでいる。例えば、準識別情報は、Q={A1, A2, …Am}と表わすことができる。ここで、A1, A2, …, Amは、準識別情報の個々の属性を示している。あるいは、準識別情報は、例えば、Q={Rec ID，A1, A2, …Am} 又はQ={Rec ID，A1-value1, A2-value1, …Am-valuem}等の形式で表わすことができる。ここで、ＲｅｃＩＤは、全データレコードにおける準識別情報の索引を示し、valueは、それぞれ対応する属性の値を示している。
以下に、「データレコード」、「レコード」、「レコード項目」等の用語は、準識別情報のレコードを指すものとする。しかしながら、本発明は上記に限定されず、他のどんなデータレコードの匿名化と保護に適用することが可能である。さらに、以下の説明では、ｋ−匿名化方法を例としてあげているけれども、本発明は、クラスタリングに基づくどのようなデータ匿名化方法にも適用可能である。 According to a preferred embodiment of the present invention, the data anonymization apparatus 1 mainly considers, for example, a data record including semi-identification information.
In order to adopt a k-anonymization method (where k is a predetermined natural number), the data anonymization device 1 avoids privacy leakage due to quasi-identification information and at the same time minimizes information loss, for example The k-anonymity table AT is generated by generalization from the table T including a plurality of data records.
As described in the background art, explicit identification information can be directly hidden or deleted, and non-confidential information does not require protection. Quasi-identification information can be viewed as the minimum set of data records required to infer the corresponding object in combination with related external information. Therefore, it is necessary to protect the semi-identification information.
The semi-identification information is multidimensional and includes a number of attributes. For example, the semi-identification information can be expressed as Q = {A1, A2,... Am}. Here, A1, A2,..., Am indicate individual attributes of the semi-identification information. Alternatively, the quasi-identification information can be expressed in the form of Q = {Rec ID, A1, A2,... Am} or Q = {Rec ID, A1-value1, A2-value1,... Am-valuem}, for example. . Here, RecID indicates an index of semi-identification information in all data records, and value indicates a value of a corresponding attribute.
Hereinafter, terms such as “data record”, “record”, and “record item” refer to a record of semi-identification information. However, the present invention is not limited to the above, and can be applied to anonymization and protection of any other data record. Further, in the following description, the k-anonymization method is taken as an example, but the present invention can be applied to any data anonymization method based on clustering.

データレコードは、リストまたはテーブルなどの様々な形式を有する。データレコードは、レコード記録ユニット２０内に格納されている。
図１に示すように、レコード記憶ユニット２０は、データ匿名化装置１によってアクセス可能なスタンド・アロンのユニットであってもよいし、あるいはデータ匿名化装置１の一部分であってもよい。
データ匿名化装置１は、複数のデータレコード中の各２つのレコード間の距離を計算するための距離計算ユニット１０と、各レコードを頂点として用い、全ての２つの頂点を辺で接続し、２つの対応するレコードの間の距離で辺に重みを加えることにより、全てのレコードを含む完全グラフを構築するための完全グラフ構築ユニット１２と、完全グラフを少なくともｋ個の頂点を含む複数のコンポーネント（ｃｏｍｐｏｎｅｎｔ）に分割するために、辺の重み順に辺を順番にカットするための辺カットユニット１４と、各クラスタに含まれる頂点の数がｋ個と２ｋ−１個の間となるように、２ｋ−１個を超える頂点を含むコンポーネントを複数のクラスタに分解するためのラージコンポーネント分解ユニット１６と、各クラスタ内のレコードを互いに区別することができないように、各クラスタの頂点に対応するレコードを一般化するための一般化ユニット１８とを含んでいる。
ここで、２ｋ−１個を超える頂点を含むコンポーネントを「ラージコンポーネント」と称し、ｋ個以上で２ｋ−１以下の頂点を含むコンポーネントを「クラスタ」と称する。
一般化ユニット１８は、辺カットユニット１４とラージコンポーネント分解ユニット１６両方によって取得された各クラスタの頂点に対応するレコードを一般化する。本発明において、コンポーネントはそれぞれ木である。 Data records have various formats such as lists or tables. Data records are stored in the record recording unit 20.
As shown in FIG. 1, the record storage unit 20 may be a stand-alone unit accessible by the data anonymization device 1, or may be a part of the data anonymization device 1.
The data anonymization apparatus 1 uses a distance calculation unit 10 for calculating the distance between each two records in a plurality of data records, and uses each record as a vertex and connects all the two vertices with edges. A complete graph construction unit 12 for constructing a complete graph including all records by weighting edges with the distance between two corresponding records, and a plurality of components including a complete graph including at least k vertices ( 2k so that the number of vertices included in each cluster is between k and 2k−1. A large component decomposition unit 16 for decomposing a component including more than one vertex into a plurality of clusters, and a record in each cluster So that it can not be distinguished from each other, and a generalized unit 18 for generalizing the record corresponding to the vertices of each cluster.
Here, a component including more than 2k−1 vertices is referred to as a “large component”, and a component including k or more and 2k−1 or less vertices is referred to as a “cluster”.
The generalization unit 18 generalizes the records corresponding to the vertices of each cluster acquired by both the edge cut unit 14 and the large component decomposition unit 16. In the present invention, each component is a tree.

距離計算ユニット１０は、レコード記憶ユニットに格納される各レコードの名称、属性あるいは属性値に基づいてレコードの間の距離を計算するように構成されている。
例えば、定められた基準に従って各レコードの名称あるいは属性を量子化し、量子化された値に基づいてレコードの間の距離を計算することが可能である（例えば、周知のユークリッド距離アルゴリズムを用いることによって）。なお、可能であれば、計算された距離は、距離記憶装置（図示せず）に格納してもよい。 The distance calculation unit 10 is configured to calculate the distance between records based on the name, attribute, or attribute value of each record stored in the record storage unit.
For example, it is possible to quantize the name or attribute of each record according to defined criteria and calculate the distance between records based on the quantized value (eg, by using the well-known Euclidean distance algorithm ). If possible, the calculated distance may be stored in a distance storage device (not shown).

完全グラフ構築ユニット１２は、全てのレコードを含む完全グラフを構築するように構成されている。すなわち、任意の２つのレコードの間には辺が存在する。
上述したように、本発明によれば、トップダウン方式のレコード分割／クラスタリング処理が、構築が各レコードから開始される既存技術のボトムアップ処理の代わりに採用される。
このように、完全グラフが本発明に導入される。完全グラフは、２つの任意のレコードの間の距離を含んでいる。
全てのレコードは、完全グラフを分割あるいは分解することにより、トップダウン方法で、それぞれサブグラフあるいはコンポーネント（各サブグラフはここでコンポーネントと見なすことができる）に分割される。
なお、可能であれば、構築された完全グラフは、記憶装置（図示せず）に格納してもよい。 The complete graph construction unit 12 is configured to construct a complete graph including all records. That is, an edge exists between any two records.
As described above, according to the present invention, the top-down record division / clustering process is adopted instead of the bottom-up process of the existing technology in which the construction starts from each record.
Thus, a complete graph is introduced into the present invention. A complete graph includes the distance between two arbitrary records.
All records are divided into sub-graphs or components, respectively (each sub-graph can now be considered a component) in a top-down manner by dividing or decomposing the complete graph.
If possible, the constructed complete graph may be stored in a storage device (not shown).

辺カットユニット１４は、それぞれの重みに従って辺をソートし、かつそれらの重みの降順に辺をカットするように構成されている。
この方法においては、任意に頂点の隣接値を選択することによりコンポーネントが構築される既存方式とは対照的に、局所的最適化を確保しながら大局的最適化を達成するために、シーケンス制御メカニズム（sequential control mechanism）が本発明に導入されている。
例えば、３つのレコード「修士」、「博士」および「エンジニア」について、定められた基準に基づいて、「修士」と「博士」の間の距離は、「修士」と「エンジニア」の間の距離より短く、「博士」と「エンジニア」の間の距離が最長であると判定される。
この場合、辺カットの処理において、「博士」と「エンジニア」の間の辺がまずカットされ、その後、「修士」と「エンジニア」の間の辺がカットされ、「修士」と「博士」の間の辺が保持される。
この方法で、「修士」と「博士」を含むサブグラフあるいはコンポーネントが、「エンジニア」から分離される。 The side cut unit 14 is configured to sort the sides according to the respective weights and cut the sides in descending order of the weights.
In this method, a sequence control mechanism is used to achieve global optimization while ensuring local optimization, as opposed to existing schemes where components are constructed by arbitrarily selecting adjacent values of vertices. (Sequential control mechanism) is introduced in the present invention.
For example, for the three records “Master”, “Doctor” and “Engineer”, the distance between “Master” and “Doctor” is the distance between “Master” and “Engineer” based on established criteria. It is determined that the distance between “Doctor” and “Engineer” is the longest.
In this case, in the edge cutting process, the edge between “Doctor” and “Engineer” is cut first, then the edge between “Master” and “Engineer” is cut, and “Master” and “Doctor” The edges between are preserved.
In this way, the subgraph or component containing “Master” and “Doctor” is separated from “Engineer”.

辺の連続的なカットにおいて、辺カットユニット１４は、以下の条件の１つが満足されれば、辺をカットするように構成されている。
１）辺がブリッジであり（すなわち、辺がカットされると、その辺を含むグラフが２つのサブグラフに分割される）、かつ、辺をカットした後に得られる各サブグラフが少なくともｋ個の頂点を含む。
２）辺がブリッジではない（すなわち、辺がカットされても、その辺を含むグラフが２つのサブグラフに分割されない）。 In the continuous cutting of the side, the side cutting unit 14 is configured to cut the side if one of the following conditions is satisfied.
1) An edge is a bridge (ie, when an edge is cut, the graph containing the edge is divided into two subgraphs), and each subgraph obtained after cutting an edge has at least k vertices Including.
2) An edge is not a bridge (ie, even if an edge is cut, the graph containing the edge is not divided into two subgraphs).

辺カットユニット１４の動作については、具体例と図を参照して後述する。
なお、辺カットユニット１４からの結果出力は、辺カット結果記憶ユニット（図示せず）に格納してもよい。 The operation of the side cut unit 14 will be described later with reference to specific examples and drawings.
The result output from the side cut unit 14 may be stored in a side cut result storage unit (not shown).

辺カットユニット１４の動作後に取得されるいくつかのコンポーネントは、ｋ個以上で２ｋ−１個以下の頂点を含む可能性がある。そのようなコンポーネントの各々は分解をそれ以上必要としないクラスタと見なされる。
他方、辺カットユニット１４の動作後に取得されるいくつかのコンポーネントは、２ｋ−１個を超える頂点を含む。そのようなコンポーネントは、ラージコンポーネントと称され、分解された各部分に含まれる頂点の数がｋ個と２ｋ−１個の間となるように、分解する必要がある、
本発明の好ましい実施の形態によれば、ラージコンポーネントの分解のために以下の２つの方法を採用する。
１）既存技術における何れかの適切な方法を用いる。その詳細な説明については、本発明を必要以上に不明瞭としないために省略する。
２）既存のランダムマージ方法と異なり、辺カット処理において用いられるものと類似するシーケンス制御メカニズムを、大局的最適化と情報の損失を考慮して導入する。これについては、以下においてさらに説明する。 Some components obtained after the operation of the edge cut unit 14 may include k or more and 2k-1 or less vertices. Each such component is considered a cluster that requires no further decomposition.
On the other hand, some components obtained after the operation of the edge cut unit 14 include more than 2k-1 vertices. Such a component is called a large component and needs to be decomposed so that the number of vertices contained in each decomposed part is between k and 2k-1.
According to a preferred embodiment of the present invention, the following two methods are adopted for the decomposition of the large component.
1) Use any suitable method in existing technology. A detailed description thereof is omitted to avoid obscuring the present invention more than necessary.
2) Unlike the existing random merge method, a sequence control mechanism similar to that used in the edge cut processing is introduced in consideration of global optimization and information loss. This will be further described below.

図２は、図１のデータ匿名化装置１におけるラージコンポーネント分解ユニット１６の構成を示すブロック図である。
ラージコンポーネント分解ユニット１６は、ｋ中心頂点検出ユニット１６０と、サブコンポーネント距離計算ユニット１６２と、サブコンポーネント完全グラフ構築ユニット１６４と、サブコンポーネント完全グラフ辺カットユニット１６６およびマージユニット１６８を含む。ここで、サブコンポーネントはそれぞれ木である。 FIG. 2 is a block diagram showing a configuration of the large component decomposition unit 16 in the data anonymization device 1 of FIG.
The large component decomposition unit 16 includes a k-center vertex detection unit 160, a subcomponent distance calculation unit 162, a subcomponent complete graph construction unit 164, a subcomponent complete graph edge cut unit 166, and a merge unit 168. Here, each subcomponent is a tree.

本発明の好ましい実施の形態によれば、ｋ中心頂点（k-central-vertex）がラージコンポーネント分解に導入される。頂点が削除される時、取得される各サブコンポーネント（これらはサブグラフとも見なすことができる）が多くてもｋ−１個の頂点を含んでいれば、コンポーネント内の頂点は、ｋ中心頂点と定義される。
ここで、以下の補題（Lemma）を導入する。
補題（Lemma）
２ｋ−１個を超える頂点を有するコンポーネントについて、ｋ中心頂点は１つだけ存在する。 According to a preferred embodiment of the present invention, k-central-vertex is introduced into the large component decomposition. When a vertex is deleted, if each acquired subcomponent (which can also be considered as a subgraph) contains at most k-1 vertices, the vertices in the component are defined as k-center vertices Is done.
Here, we introduce the following lemma (Lemma).
Lemma
For components with more than 2k-1 vertices, there is only one k-center vertex.

証明（Proof）
２ｋ−１個を超える頂点を有する１つのコンポーネント内に、上記のように定義される２つのｋ中心頂点ｖ１、ｖ２が存在すると仮定する。
ｖ１とｖ２の間の辺がカットされると、取得される各サブコンポーネントは多くてもｋ−１個の頂点を有する。それでは、ラージコンポーネントは、多くても２ｋ−２個のノードを有することになり、上記の仮定と矛盾する。これにより、上記補題が成立することが証明される。 Proof
Suppose that there are two k-center vertices v1 and v2 defined above in one component with more than 2k-1 vertices.
When the edge between v1 and v2 is cut, each acquired subcomponent has at most k−1 vertices. The large component will then have at most 2k-2 nodes, which contradicts the above assumption. This proves that the above lemma holds.

ｋ中心頂点検出ユニット１６０は、各ラージコンポーネントにおけるｋ中心頂点を検出し、かつｋ中心頂点以外の複数のサブコンポーネントあるいはサブグラフを取得するために、検出されたｋ中心頂点と接続されている全ての辺をカットするように構成されている。 The k-center vertex detection unit 160 detects all the k-center vertices in each large component, and obtains a plurality of sub-components or subgraphs other than the k-center vertices. It is configured to cut a side.

サブコンポーネント距離計算ユニット１６２は、各サブコンポーネントの中心を計算し、かつ２つのサブコンポーネント中心間の距離を計算するように構成されている。本発明の好ましい実施の形態によれば、ｋ中心頂点は除外されるので、取得される各サブコンポーネントは多くてもｋ−１個の頂点を含んでいる。従って、サブコンポーネントのそれぞれを分解する必要はない。そのため、ラージコンポーネント分解処理において、各サブコンポーネントの全体をそのサブコンポーネントの中心によって表すことができる。サブコンポーネントの中心は、サブコンポーネントに含まれる頂点に対応するレコードの量子化された値或いは属性値の平均値あるいは中央値、あるいは他の適切な度量である。２つのサブコンポーネントの中心間の距離についても、適切な既存の方法（例えば、ユークリッド距離アルゴリズム）の何れかを用いることにより計算することができる。 The subcomponent distance calculation unit 162 is configured to calculate the center of each subcomponent and to calculate the distance between the two subcomponent centers. According to the preferred embodiment of the present invention, k-center vertices are excluded, so that each acquired subcomponent contains at most k-1 vertices. Thus, it is not necessary to disassemble each of the subcomponents. Therefore, in the large component decomposition process, each subcomponent can be represented entirely by the center of the subcomponent. The center of the subcomponent is the quantized value or the average or median value of the records corresponding to the vertices contained in the subcomponent, or other suitable measure. The distance between the centers of the two subcomponents can also be calculated by using any suitable existing method (eg, Euclidean distance algorithm).

本発明の好ましい実施の形態によれば、上述したように、トップダウン式の分割／クラスタリング処理をラージコンポーネント分解にも導入し、それによって、大局的最適化を保証している。ラージコンポーネント分解ユニット１６内のサブコンポーネント完全グラフ構築ユニット１６４は、サブコンポーネント全体を頂点として用いる。それにより、サブコンポーネント完全グラフ構築ユニット１６４は、各サブコンポーネントを頂点として用いることにより、具体的には、計算された各サブコンポーネントの中心を頂点として用い、全ての２つの頂点を辺で接続することにより、完全グラフを構築するように構成されている。
このような構築されたグラフにおいて、頂点は、対応するサブコンポーネントのサイズ（すなわち、サブコンポーネントに含まれる頂点の数）によってそれぞれ重み付けされ、辺は、２つの対応するサブコンポーネント中心間の距離によってそれぞれ重み付けされる。 According to the preferred embodiment of the present invention, as described above, a top-down partitioning / clustering process is also introduced into the large component decomposition, thereby ensuring global optimization. The subcomponent complete graph construction unit 164 in the large component decomposition unit 16 uses the entire subcomponent as a vertex. Thereby, the subcomponent complete graph construction unit 164 uses each subcomponent as a vertex, specifically, uses the calculated center of each subcomponent as a vertex and connects all two vertices with edges. Thus, a complete graph is constructed.
In such a constructed graph, the vertices are each weighted by the size of the corresponding subcomponent (ie, the number of vertices contained in the subcomponent), and the edges are each by the distance between the two corresponding subcomponent centers. Weighted.

連続カット方法は、ラージコンポーネント分解にも導入される。辺カットユニット１４の上記動作と同様に、サブコンポーネント完全グラフ辺カットユニット１６６は、辺の重みの順に辺を連続してカットし、サブコンポーネントを複数のクラスタに分割するように構成されている。各クラスタに含まれる全ての頂点の重みの和は、ｋ以上で２ｋ−１以下である。 The continuous cutting method is also introduced for large component decomposition. Similar to the above operation of the edge cut unit 14, the subcomponent complete graph edge cut unit 166 is configured to continuously cut edges in the order of edge weights and to divide the subcomponent into a plurality of clusters. The sum of the weights of all the vertices included in each cluster is k or more and 2k−1 or less.

サブコンポーネント完全グラフ辺カットユニット１６６は、辺の重みの降順に辺を連続してカットする。辺の連続カット処理において、以下の条件の１つが満足されれば、サブコンポーネント完全グラフ辺カットユニット１６６は辺をカットする。
１）辺がブリッジであり（すなわち、辺がカットされると、その辺を含むグラフが２つのサブグラフに分割される）、かつ、辺をカットした後に得られる各コンポーネントに含められる頂点の重みの和が、少なくともｋである。
２）辺がブリッジではない（すなわち、辺がカットされても、その辺を含むグラフが２つのサブグラフに分割されない）。 The subcomponent complete graph edge cut unit 166 cuts edges continuously in descending order of edge weights. In the continuous edge cutting process, the sub-component complete graph edge cutting unit 166 cuts an edge if one of the following conditions is satisfied.
1) An edge is a bridge (ie, when an edge is cut, the graph containing the edge is divided into two subgraphs), and the weight of the vertex included in each component obtained after the edge is cut The sum is at least k.
2) An edge is not a bridge (ie, even if an edge is cut, the graph containing the edge is not divided into two subgraphs).

サブコンポーネント完全グラフ辺カットユニット１６６の動作については、具体例と図を参照して後述する。 The operation of the sub-component complete graph edge cut unit 166 will be described later with reference to specific examples and drawings.

サブコンポーネント完全グラフ辺カットユニット１６６の動作完了後、クラスタの１つに先に除外されたｋ中心頂点をマージすることが必要となる。マージユニット１６８は、ｋ中心頂点に距離が最も近いクラスタに対してｋ中心頂点をマージするように構成されている。ｋ中心頂点とマージされたクラスタに含まれる全ての頂点の重みの和が、２ｋに等しければ、クラスタはさらに２つのクラスタへ分解され、その結果、各クラスタに含まる全ての頂点の重みの和は、ｋと等しくなる。 After the operation of the subcomponent complete graph edge cut unit 166 is complete, it is necessary to merge the k-center vertices previously excluded into one of the clusters. The merge unit 168 is configured to merge the k center vertex with the cluster having the closest distance to the k center vertex. If the sum of the weights of all vertices included in the cluster merged with the k-center vertex is equal to 2k, the cluster is further decomposed into two clusters, so that the sum of the weights of all vertices included in each cluster is obtained. Is equal to k.

ラージコンポーネント分解ユニット１６によって取得されたクラスタは、辺カットユニット１４によって取得されたクラスタと共に、レコード分割／クラスタリングの結果を構成する。結果として得られた各クラスタに含まれる頂点或いはレコードの数は、ｋ以上で２ｋ−１以下である。なお、可能であれば、結果として得られたレコードは、レコード分割記憶装置（図示せず）に格納してもよい。 The clusters acquired by the large component decomposition unit 16 together with the clusters acquired by the edge cut unit 14 constitute a record division / clustering result. As a result, the number of vertices or records included in each cluster is k or more and 2k-1 or less. If possible, the resulting record may be stored in a record division storage device (not shown).

一般化ユニット１８は、結果として得られた各クラスタ毎に、頂点に対応するレコードを一般化するように構成されている。その結果、各クラスタ内のレコードは互いに分割することができなくなる。一般化ユニット１８は、周知のどのような一般化方法も用いることが可能である。一例として、複数の数値については、それらの最小公倍数として一般化することが可能である。例えば、値２、４、１０は、２０として一般化することが可能である。
他の例として、複数の都市名称は、これらの都市が属している州の名称として一般化することが可能である。例えば、都市名称「成都（Chengdu）」、「綿陽（Mianyang）」および「楽山（Leshan）」は、州名称「四川（Sichuan）」として一般化することが可能である。一般に、異なる属性は、それらが属するカテゴリの最下位のレベルとして一般化することができる。これにより、それらの属性は互いに分割することができなくなり、同時に情報の損失を最小限に保つ。なお、一般化ユニット１８からの結果出力は、匿名テーブルあるいはリスト等のような様々な形式で匿名のレコード記憶ユニット（図示せず）に格納される。 The generalization unit 18 is configured to generalize the records corresponding to the vertices for each resulting cluster. As a result, the records in each cluster cannot be divided from each other. The generalization unit 18 can use any known generalization method. As an example, a plurality of numerical values can be generalized as their least common multiple. For example, the values 2, 4, 10 can be generalized as 20.
As another example, a plurality of city names can be generalized as names of states to which these cities belong. For example, the city names “Chengdu”, “Mianyang” and “Leshan” can be generalized as the state name “Sichuan”. In general, different attributes can be generalized as the lowest level of the category to which they belong. This prevents those attributes from being separated from each other and at the same time keeps information loss to a minimum. The result output from the generalization unit 18 is stored in an anonymous record storage unit (not shown) in various formats such as an anonymous table or a list.

以上、本発明の好ましい実施の形態によるデータ匿名化装置１について説明した。
図３は、本発明の好ましい実施の形態によるデータ匿名化方法３００を説明するフローチャートである。このデータ匿名化方法３００はデータ匿名化装置１によって実行される。
ステップ３０２で、複数のデータレコードを含むテーブル内の２つのレコードごとの間の距離を計算する。
ステップ３０４で、頂点として各レコードを用い、全ての２つの頂点を辺で接続し、２つの対応するレコードの間の距離で辺に重みを加えることにより、全てのレコードを含む完全グラフを構築する。
ステップ３０６で、完全グラフを少なくともｋ個の頂点を含む複数のコンポーネントに分割するために、辺の重み順に辺を順番にカットする。
ステップ３０８で、各クラスタに含まれる頂点の数がｋ個と２ｋ−１個の間となるように、２ｋ−１個を超える頂点を含むラージコンポーネントを複数のクラスタに分解する。
ステップ３１０で、得られた各クラスタ内のレコードを互いに分割することができないように、各クラスタの頂点に対応するレコードを一般化する。 The data anonymization apparatus 1 according to the preferred embodiment of the present invention has been described above.
FIG. 3 is a flowchart illustrating a data anonymization method 300 according to a preferred embodiment of the present invention. This data anonymization method 300 is executed by the data anonymization device 1.
In step 302, the distance between every two records in a table containing a plurality of data records is calculated.
At step 304, use each record as a vertex, connect all two vertices with edges, and weight the edges with the distance between the two corresponding records to construct a complete graph containing all records .
In step 306, edges are cut in order of edge weights to divide the complete graph into multiple components including at least k vertices.
In step 308, a large component including more than 2k-1 vertices is decomposed into a plurality of clusters such that the number of vertices included in each cluster is between k and 2k-1.
In step 310, the records corresponding to the vertices of each cluster are generalized so that the obtained records in each cluster cannot be divided from each other.

図４は、図３のデータ匿名化方法３００におけるラージコンポーネント分解ステップ３０８を示すフローチャートである。
ステップ４０２で、２ｋ−１個を超える頂点を含む各ラージコンポーネントにおけるｋ中心頂点を検出し、かつｋ中心頂点以外の複数のサブコンポーネントを取得するために、検出されたｋ中心頂点と接続されている全ての辺をカットする。
ステップ４０４で、各サブコンポーネントの中心と、２つのサブコンポーネント中心間の距離を計算する。
ステップ４０６で、計算された各サブコンポーネントの中心を頂点として用い、対応するサブコンポーネントのサイズによって頂点に重みを付け、全ての２つの頂点を辺で接続し、２つの対応するサブコンポーネント中心間の距離によって辺に重みを付けることにより、各サブコンポーネントを頂点する完全グラフを構築する。
ステップ４０８で、辺の重みの順に辺を連続してカットして、サブコンポーネントを複数のクラスタに分割し、各クラスタに含まれる全ての頂点の重みの和をｋ以上２ｋ−１以下とする。
ステップ４１０で、ｋ中心頂点に距離が最も近いクラスタにｋ中心頂点をマージし、マージされたクラスタに含まれる全ての頂点の重みの和が２ｋに等しければ、各クラスタに含まる全ての頂点の重みの和がｋと等しくなるように、クラスタをさらに２つのクラスタに分解する。 FIG. 4 is a flowchart showing the large component decomposition step 308 in the data anonymization method 300 of FIG.
In step 402, connected to the detected k-center vertices to detect k-center vertices in each large component that includes more than 2k-1 vertices and to obtain a plurality of sub-components other than the k-center vertices. Cut all sides that are present.
In step 404, the center of each subcomponent and the distance between the two subcomponent centers is calculated.
In step 406, the calculated center of each subcomponent is used as a vertex, the vertex is weighted by the size of the corresponding subcomponent, all two vertices are connected by edges, and between the two corresponding subcomponent centers Construct a complete graph that vertexes each subcomponent by weighting edges by distance.
In step 408, edges are successively cut in the order of edge weights, the subcomponent is divided into a plurality of clusters, and the sum of the weights of all the vertices included in each cluster is set to k or more and 2k-1 or less.
In step 410, the k-center vertex is merged with the cluster closest to the k-center vertex, and if the sum of the weights of all the vertices included in the merged cluster is equal to 2k, all the vertices included in each cluster are The cluster is further decomposed into two clusters so that the sum of the weights is equal to k.

本発明の実施の形態をさらに明確に示すために、実施の形態による具体例について、図５〜図８を参照して以下に説明する。これらの具体例は、本発明の好ましい実施の形態を例示するためだけのものであり、本発明を制限するものではない。 In order to show the embodiment of the present invention more clearly, specific examples according to the embodiment will be described below with reference to FIGS. These specific examples are only for the purpose of illustrating preferred embodiments of the invention and are not intended to limit the invention.

例えば、１０個のデータレコードT=[Q0, Q1, …, Q9], Qi={A1, A2, …, Am},i={0, 2, .., 9}を含むテーブルがあるものとする。ここで、ｍは自然数である。ｋ（＝２）−匿名テーブルを形成するために、これらの１０のデータレコードを匿名化する必要がある。 For example, if there is a table containing 10 data records T = [Q0, Q1,…, Q9], Qi = {A1, A2,…, Am}, i = {0, 2, .., 9} To do. Here, m is a natural number. k (= 2) —In order to form an anonymous table, these 10 data records need to be anonymized.

まず、距離計算ユニット１０が、ユークリッド距離アルゴリズムを用いることによって、全ての２つのＱ０、Ｑ１、…、Ｑ９の間の距離を計算する。その後、完全グラフ構築ユニット１２が完全グラフを構築する。 First, the distance calculation unit 10 calculates the distance between all two Q0, Q1,..., Q9 by using the Euclidean distance algorithm. Thereafter, the complete graph construction unit 12 constructs a complete graph.

図５は、本発明の好ましい実施の形態による完全グラフ構築を示す概略図である。図５に示すように、図の左側は、レコードの複数の頂点を示している。
図５の右側に示すような完全グラフを形成するために、これらの頂点を２つずつ辺で接続する。説明の便宜上、２つの頂点間の各辺の長さがその辺の重み（すなわち、２つの頂点に対応する２つのレコード間の距離）を表わすものと仮定する。 FIG. 5 is a schematic diagram illustrating a complete graph construction according to a preferred embodiment of the present invention. As shown in FIG. 5, the left side of the figure shows a plurality of vertices of the record.
These vertices are connected by two edges to form a complete graph as shown on the right side of FIG. For convenience of explanation, it is assumed that the length of each side between two vertices represents the weight of that side (ie, the distance between two records corresponding to the two vertices).

次に、辺カットユニット１４が、前述の条件に従って辺の重みの降順に辺をカットする。図６は、本発明の好ましい実施の形態によるシーケンス制御メカニズムが導入される辺カット処理を示す概略図である。図６から分かるように、Ｑ３とＱ８の間の辺edge３８が最も長く、最大の重みを有する。Ｑ０とＱ４の間の辺edge０４が２番めに長く、・・・、Ｑ８とＱ９の間の辺edge８９が最も短い。 Next, the side cut unit 14 cuts the sides in descending order of the side weights according to the above-described conditions. FIG. 6 is a schematic diagram illustrating edge cut processing in which a sequence control mechanism according to a preferred embodiment of the present invention is introduced. As can be seen from FIG. 6, the edge edge 38 between Q3 and Q8 is the longest and has the largest weight. The edge edge04 between Q0 and Q4 is the second longest,..., The edge edge89 between Q8 and Q9 is the shortest.

まず、辺カットユニット１４は、辺edge３８が上述した２つの条件を満足するかどうかを判定する。この具体例において、辺edge３８はブリッジではないので、カットされる。次に、辺カットユニット１４は、辺edge０４が２つの条件を満足するかどうかを判定する。辺edge０４はブリッジではないので、カットされる。
Ｑ０とＱ１の間の辺edge０１が２つの状態を満足するかどうかを判定するまで、辺カットユニット１４はそのような動作を継続する。
辺edge０１がカットされれば、グラフが２つのコンポーネントあるいはサブグラフに完全に分割されるので、辺edge０１はブリッジである。従って、結果として得られる２つのコンポーネントがそれぞれ少なくともｋ個の頂点を有するかどうかを判定することが必要となる。図示のように、辺edge０１のカットから得られた２つのコンポーネントは、それぞれ６個の頂点と４個の頂点を含んでいる。よって、辺カットユニット１４は辺edge０１をカットする。
Ｑ０とＱ９の間の辺edge０９が２つの条件を満足するかどうかを判定する時、辺カットユニット１４は、辺edge０９がカットされれば、完全グラフのコンポーネントがさらに２つの部分に分けられると判定する。
よって、得られる２つの部分が、それぞれ少なくともｋ個の頂点を有するかどうかを判定することが再び必要となる。図示のように、辺edge０９のカットから得られた２つの部分は、１個および５個の頂点をそれぞれ含んでいる。従って、条件を満足しない。よって、辺カットユニット１４は辺edge０９をカットしない。このようにして、図６の右下部分に示すように、部分ＣＱ１、ＣＱ２、ＣＱ３およびＣＱ４が最終的に取得される。 First, the side cut unit 14 determines whether or not the side edge 38 satisfies the two conditions described above. In this example, the edge 38 is not a bridge and is therefore cut. Next, the edge cut unit 14 determines whether or not the edge edge04 satisfies two conditions. Since edge 04 is not a bridge, it is cut.
The edge cut unit 14 continues such an operation until it is determined whether the edge edge01 between Q0 and Q1 satisfies two states.
If edge edge01 is cut, the graph is completely divided into two components or subgraphs, so edge edge01 is a bridge. It is therefore necessary to determine whether each of the resulting two components has at least k vertices. As shown, the two components obtained from the cut of edge edge01 each include six vertices and four vertices. Therefore, the side cut unit 14 cuts the side edge01.
When determining whether the edge edge09 between Q0 and Q9 satisfies the two conditions, the edge cut unit 14 determines that the component of the complete graph is further divided into two parts if the edge edge09 is cut To do.
It is therefore necessary again to determine whether the two parts obtained each have at least k vertices. As shown, the two parts obtained from the cut of edge edge09 include 1 and 5 vertices, respectively. Therefore, the condition is not satisfied. Therefore, the side cut unit 14 does not cut the side edge 09. In this way, as shown in the lower right part of FIG. 6, the parts CQ1, CQ2, CQ3 and CQ4 are finally obtained.

ここで、部分ＣＱ２、ＣＱ３およびＣＱ４は、それぞれ２個の頂点を含んでおり、したがってそれ以上分解を必要としない。すなわち、各部分ＣＱ２、ＣＱ３、ＣＱ４はクラスタである。しかしながら、部分ＣＱ１は、４つの頂点Ｑ０、Ｑ７、Ｑ８およびＱ９を含んでおり、その数は２ｋ（＝２）−１より大きい、したがって分解する必要がある。それゆえ、ラージコンポーネント分解ユニット１６は、部分ＣＱ１についてラージコンポーネント分解処理を実行する。
図７は、本発明の好ましい実施の形態によるラージコンポーネント分解処理を示す概略図である。
まず、ｋ中心頂点検出ユニット１６０が、ｋ中心頂点を検出する。図示のように、ラージコンポーネントＣＱ１のｋ中心頂点は、頂点Ｑ９である。そこで、頂点Ｑ９に接続される全ての辺をカットし、個々の分離したサブコンポーネントを取得する。この具体例においては、図７の２番目の部分図に示すように、各サブコンポーネントはただ１個の頂点を含んでいる。しかしながら、これは便宜上例として示しているだけであり、他の状況においては、各サブコンポーネントが１つ以上の頂点を含む場合もある。 Here, the parts CQ2, CQ3 and CQ4 each contain two vertices and therefore no further decomposition is required. That is, each part CQ2, CQ3, CQ4 is a cluster. However, the part CQ1 includes four vertices Q0, Q7, Q8 and Q9, the number of which is greater than 2k (= 2) -1, and therefore needs to be decomposed. Therefore, the large component disassembly unit 16 performs a large component disassembly process on the portion CQ1.
FIG. 7 is a schematic diagram illustrating a large component disassembly process according to a preferred embodiment of the present invention.
First, the k center vertex detection unit 160 detects the k center vertex. As illustrated, the k-center vertex of the large component CQ1 is the vertex Q9. Therefore, all the sides connected to the vertex Q9 are cut to obtain individual separated subcomponents. In this example, each subcomponent contains only one vertex, as shown in the second partial view of FIG. However, this is only shown as an example for convenience, and in other situations, each subcomponent may include one or more vertices.

各サブコンポーネントが１つのみ頂点を含んでいるので、計算ユニット１６２は、サブコンポーネント中心として頂点の代表値を直接用いて、サブコンポーネント中心の間の距離を計算する。サブコンポーネント完全グラフ構築ユニット１６４は、図７の３番目の部分図に示すように、ｋ中心頂点以外の全てのサブコンポーネント中心を接続し、全てのサブコンポーネントを含む完全グラフを取得する。サブコンポーネント完全グラフ辺カットユニット１６６は、図７の４番目の部分図に示すように、上述した２つの条件に従って、辺の重み（すなわち、図において示される辺の長さ）に基づいてＱ０とＱ７の間の辺をカットし、他の辺をカットしない。マージユニット１６８は、図７の５番目の部分図に示すように、頂点Ｑ９を左のコンポーネントにマージし、４つの頂点を含むクラスタを取得する。これにより、マージユニット１６８は、図７の６番目の部分図に示すように、Ｑ７とＱ８の間の辺をカットし、それぞれ２つの頂点を含む２つのクラスタを取得する。 Since each subcomponent contains only one vertex, the calculation unit 162 directly uses the representative value of the vertex as the subcomponent center to calculate the distance between the subcomponent centers. As shown in the third partial diagram of FIG. 7, the subcomponent complete graph construction unit 164 connects all subcomponent centers other than the k-center vertex, and obtains a complete graph including all subcomponents. As shown in the fourth partial diagram of FIG. 7, the subcomponent complete graph edge cut unit 166 has Q0 and Q0 based on the edge weight (that is, the edge length shown in the figure) according to the two conditions described above. Cut the side between Q7 and do not cut the other side. As shown in the fifth partial diagram of FIG. 7, the merge unit 168 merges the vertex Q9 with the left component to obtain a cluster including four vertices. As a result, the merge unit 168 cuts the side between Q7 and Q8 as shown in the sixth partial diagram of FIG. 7, and acquires two clusters each including two vertices.

図８は、本発明の好ましい実施の形態によるレコード分割の最終結果を概略的に示す。
本発明の好ましい実施の形態によるｋ（＝２）−匿名化方法により、レコードＱ０−Ｑ９はそれぞれ２つのレコードを含む５個のクラスタに分割される。
その後、一般化ユニット１８は、各クラスタについて一般化を実行し、ｋ−匿名性テーブルAT=[AQ0, AQ1, …, AQ4]を取得する。ここで、ＡＱｉは、各クラスタにおけるレコードの一般化された値を表わす。 FIG. 8 schematically illustrates the final result of record splitting according to a preferred embodiment of the present invention.
According to the k (= 2) -anonymization method according to the preferred embodiment of the present invention, records Q0-Q9 are divided into five clusters each containing two records.
Thereafter, the generalization unit 18 performs generalization for each cluster, and obtains a k-anonymity table AT = [AQ0, AQ1,..., AQ4]. Here, AQi represents the generalized value of the record in each cluster.

以上、本発明の好ましい実施の形態によるデータ匿名化装置と方法について説明した。上記説明において、本発明の好ましい実施の形態を、具体例だけで図示しているが、そのことは、本発明が上記の手順とユニット構成に限定されることを意味するものではない。必要に応じて、これらの手段と要素を調整し、取捨選択し、組み合わせることも可能である。さらに、これらの手順と要素のいくつかは、本発明の発明概念を実現するうえで必要不可欠ではない。このように、本発明に必須の技術的特徴は、上記の特定の具体例ではなく、本発明の発明概念を実現するための最小の必要条件にのみ限定される。 The data anonymization apparatus and method according to the preferred embodiment of the present invention have been described above. In the above description, preferred embodiments of the present invention are illustrated by specific examples only, but this does not mean that the present invention is limited to the above procedures and unit configurations. These means and elements can be adjusted, selected and combined as needed. In addition, some of these procedures and elements are not essential for implementing the inventive concept of the present invention. Thus, the technical features essential to the present invention are not limited to the specific embodiments described above, but are limited only to the minimum requirements for realizing the inventive concept of the present invention.

以上、本発明についてその好適な実施例を参照して説明したが、当該技術に精通した当業者には、本発明の精神と範囲から逸脱することなく他の様々な修正、変更、追加を行うことが可能なことは明らかであろう。したがって、本発明の範囲は上記の具体的な実施例に限定されず、付記した請求項によってのみ限定される。 Although the present invention has been described with reference to preferred embodiments thereof, various other modifications, changes and additions can be made by those skilled in the art without departing from the spirit and scope of the present invention. It will be clear that this is possible. Accordingly, the scope of the invention is not limited to the specific embodiments described above, but only by the appended claims.

さらに、上記実施形態の一部又は全部は、以下の付記のようにも記載されうるが、これに限定されない。 Further, a part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）
複数のデータレコード中の２つのレコード毎の間の距離を計算する距離計算ユニットと、
各レコードを頂点として用い、全ての２つの頂点を辺で接続し、２つの対応するレコードの間の距離で辺に重みを加えることにより、全てのレコードを含む完全グラフを構築する完全グラフ構築ユニットと、
完全グラフを少なくともｋ（所定の自然数）個の頂点を含む複数のコンポーネントに分割するために、辺の重み順に辺を順番にカットする辺カットユニットと、
各クラスタに含まれる頂点の数がｋ個と２ｋ−１個の間となるように、２ｋ−１個を超える頂点を含むコンポーネントを複数のクラスタに分解するラージコンポーネント分解ユニットと、
各クラスタ内のレコードを互いに区別することができないように、各クラスタの頂点に対応するレコードを一般化する一般化ユニットとを備え、
２ｋ−１個を超える頂点を含むコンポーネントをラージコンポーネントとし、ｋ個以上で２ｋ−１以下の頂点を含むコンポーネントがクラスタとする
ことを特徴とするデータ匿名化装置。 (Appendix 1)
A distance calculation unit for calculating the distance between every two records in the plurality of data records;
A complete graph construction unit that builds a complete graph containing all records by using each record as a vertex, connecting all two vertices with edges, and adding weights to the edges at the distance between two corresponding records When,
An edge cutting unit that sequentially cuts edges in order of edge weights to divide the complete graph into a plurality of components including at least k (predetermined natural number) vertices;
A large component decomposition unit for decomposing a component including more than 2k-1 vertices into a plurality of clusters such that the number of vertices included in each cluster is between k and 2k-1.
A generalization unit that generalizes the records corresponding to the vertices of each cluster so that the records in each cluster cannot be distinguished from each other;
A data anonymization device, wherein a component including more than 2k-1 vertices is a large component, and a component including k or more and 2k-1 or less vertices is a cluster.

（付記２）
前記辺カットユニットは、辺の重みに従って辺をソートし、かつそれらの重みの降順に辺をカットすることを特徴とする付記１に記載のデータ匿名化装置。 (Appendix 2)
The data anonymization device according to appendix 1, wherein the side cut unit sorts the sides according to the weights of the sides and cuts the sides in descending order of the weights.

（付記３）
前記辺カットユニットは、以下の条件の１つが満足されれば、辺をカットする
１）辺がブリッジであり、かつ、辺をカットした後に得られる各サブグラフが少なくともｋ個の頂点を含む
２）辺がブリッジではない
ここで、辺をカットすると、その辺を含むグラフが２つの個別の区分に分割されるなら、辺はブリッジであり、
辺をカットしても、その辺を含むグラフが２つの個別の区分に分割されないなら、辺はブリッジでない
ことを特徴とする付記１に記載のデータ匿名化装置。 (Appendix 3)
The edge cut unit cuts an edge if one of the following conditions is satisfied: 1) The edge is a bridge, and each subgraph obtained after the edge is cut includes at least k vertices 2) An edge is not a bridge If an edge is cut and the graph containing the edge is divided into two separate sections, the edge is a bridge,
The data anonymization device according to appendix 1, wherein, even if an edge is cut, if the graph including the edge is not divided into two individual sections, the edge is not a bridge.

（付記４）
前記ラージコンポーネント分解ユニットが、
各ラージコンポーネントにおけるｋ中心頂点を検出し、かつｋ中心頂点以外の複数のサブコンポーネントを取得するために、検出されたｋ中心頂点と接続されている全ての辺をカットするｋ中心頂点検出ユニットと、
各サブコンポーネントの中心を計算し、かつ２つのサブコンポーネント中心間の距離を計算するサブコンポーネント距離計算ユニットと、
各サブコンポーネントの中心を頂点として用い、頂点を対応するサブコンポーネントのサイズ（サブコンポーネントに含まれるレコードの頂点の数によって表わされる）によって重み付けし、全ての２つの頂点を辺で接続し、辺を２つの対応するサブコンポーネント中心間の距離によって重み付けすることにより、サブコンポーネント完全グラフを構築するサブコンポーネント完全グラフ構築ユニットと、
辺の重みの順に辺を連続してカットし、サブコンポーネント完全グラフを複数のクラスタに分割し、各クラスタに含まれる全ての頂点の重みの和を、ｋ以上で２ｋ−１以下とするサブコンポーネント完全グラフ辺カットユニットと、
ｋ中心頂点に距離が最も近いクラスタに対してｋ中心頂点をマージし、ｋ中心頂点とマージされたクラスタに含まれる全ての頂点の重みの和が、２ｋに等しければ、クラスタを２つのクラスタに分解し、各クラスタに含まる全ての頂点の重みの和をｋと等しくするマージユニットとを
備えることを特徴とする付記１に記載のデータ匿名化装置。 (Appendix 4)
The large component disassembly unit is
A k-center vertex detection unit that cuts all sides connected to the detected k-center vertex in order to detect the k-center vertex in each large component and obtain a plurality of subcomponents other than the k-center vertex; ,
A sub-component distance calculation unit that calculates the center of each sub-component and calculates the distance between the two sub-component centers;
Use the center of each subcomponent as a vertex, weight the vertex by the size of the corresponding subcomponent (represented by the number of vertices in the record contained in the subcomponent), connect all two vertices with edges, A subcomponent complete graph construction unit that constructs a subcomponent complete graph by weighting by the distance between two corresponding subcomponent centers;
Subcomponent that cuts edges continuously in the order of edge weights, divides the subcomponent complete graph into multiple clusters, and sets the sum of the weights of all vertices in each cluster to be greater than or equal to k and less than or equal to 2k-1 Complete graph edge cut unit,
If the k-center vertex is merged with the cluster having the closest distance to the k-center vertex, and the sum of the weights of all the vertices included in the cluster merged with the k-center vertex is equal to 2k, the cluster is divided into two clusters. The data anonymization device according to appendix 1, further comprising: a merge unit that decomposes and makes a sum of weights of all vertices included in each cluster equal to k.

（付記５）
前記サブコンポーネント完全グラフ辺カットユニットは、辺の重みに従って辺をソートし、それらの重みの降順に辺をカットすることを特徴とする付記４に記載のデータ匿名化装置。 (Appendix 5)
The data anonymization device according to appendix 4, wherein the sub-component complete graph edge cut unit sorts edges according to edge weights and cuts edges in descending order of the weights.

（付記６）
前記サブコンポーネント完全グラフ辺カットユニットは、以下の条件の１つが満足されれば、辺をカットする
１）辺がブリッジであり、かつ、辺をカットした後に得られる各コンポーネントに含められる頂点の重みの和が、少なくともｋである。
２）辺がブリッジではない
ここで、辺をカットすると、その辺を含むグラフが２つのサブグラフに分割されるなら、辺はブリッジであり、
辺をカットしても、その辺を含むグラフが２つのサブグラフに分割されないなら、辺はブリッジでない
ことを特徴とする
付記４に記載のデータ匿名化装置。 (Appendix 6)
The sub-component complete graph edge cut unit cuts an edge if one of the following conditions is satisfied: 1) The edge is a bridge, and the weight of the vertex included in each component obtained after the edge is cut Is at least k.
2) An edge is not a bridge If an edge is cut and the graph containing the edge is split into two subgraphs, the edge is a bridge,
The data anonymization device according to appendix 4, wherein if an edge is cut and the graph including the edge is not divided into two subgraphs, the edge is not a bridge.

（付記７）
複数のデータレコード中の２つのレコード毎の間の距離を計算する距離計算ステップと、
各レコードを頂点として用い、全ての２つの頂点を辺で接続し、２つの対応するレコードの間の距離で辺に重みを加えることにより、全てのレコードを含む完全グラフを構築する完全グラフ構築ステップと、
完全グラフを少なくともｋ（ｋは所定の自然数）個の頂点を含む複数のコンポーネントに分割するために、辺の重み順に辺を順番にカットする辺カットステップと、
各クラスタに含まれる頂点の数がｋ個と２ｋ−１個の間となるように、２ｋ−１個を超える頂点を含むコンポーネントを複数のクラスタに分解するラージコンポーネント分解ステップと、
各クラスタ内のレコードを互いに区別することができないように、各クラスタの頂点に対応するレコードを一般化する一般化ステップとを備え、
２ｋ−１個を超える頂点を含むコンポーネントをラージコンポーネントとし、ｋ個以上で２ｋ−１以下の頂点を含むコンポーネントがクラスタとする
ことを特徴とするデータ匿名化方法。 (Appendix 7)
A distance calculating step for calculating a distance between every two records in the plurality of data records;
A complete graph construction step that builds a complete graph containing all records by using each record as a vertex, connecting all two vertices with edges, and weighting the edges with the distance between the two corresponding records When,
An edge cutting step for sequentially cutting edges in order of edge weights in order to divide the complete graph into a plurality of components including at least k (k is a predetermined natural number) vertices;
A large component decomposition step of decomposing a component including more than 2k-1 vertices into a plurality of clusters such that the number of vertices included in each cluster is between k and 2k-1;
A generalization step that generalizes the records corresponding to the vertices of each cluster so that the records in each cluster cannot be distinguished from each other;
A data anonymization method, wherein a component including more than 2k-1 vertices is a large component, and a component including k or more and 2k-1 or less vertices is a cluster.

（付記８）
前記辺カットステップにおいて、辺の重みに従って辺をソートし、かつそれらの重みの降順に辺をカットすることを特徴とする付記７に記載のデータ匿名化方法。 (Appendix 8)
The data anonymization method according to appendix 7, wherein in the side cutting step, the sides are sorted according to the weights of the sides and the sides are cut in descending order of the weights.

（付記９）
前記辺カットステップにおいて、以下の条件の１つが満足されれば、辺をカットする
１）辺がブリッジであり、かつ、辺をカットした後に得られる各サブグラフが少なくともｋ個の頂点を含む
２）辺がブリッジではない
ここで、辺をカットすると、その辺を含むグラフが２つの個別の区分に分割されるなら、辺はブリッジであり、
辺をカットしても、その辺を含むグラフが２つの個別の区分に分割されないなら、辺はブリッジでない
ことを特徴とする付記７に記載のデータ匿名化方法。 (Appendix 9)
In the edge cutting step, if one of the following conditions is satisfied, the edge is cut 1) The edge is a bridge, and each subgraph obtained after cutting the edge includes at least k vertices 2) An edge is not a bridge If an edge is cut and the graph containing the edge is divided into two separate sections, the edge is a bridge,
The data anonymization method according to appendix 7, wherein if an edge is cut but the graph including the edge is not divided into two separate sections, the edge is not a bridge.

（付記１０）
前記ラージコンポーネント分解ステップが、
各ラージコンポーネントにおけるｋ中心頂点を検出し、かつｋ中心頂点以外の複数のサブコンポーネントを取得するために、検出されたｋ中心頂点と接続されている全ての辺をカットするｋ中心頂点検出ステップと、
各サブコンポーネントの中心を計算し、かつ２つのサブコンポーネント中心間の距離を計算するサブコンポーネント距離計算ステップと、
各サブコンポーネントの中心を頂点として用い、頂点を対応するサブコンポーネントのサイズ（サブコンポーネントに含まれるレコードの頂点の数によって表わされる）によって重み付けし、全ての２つの頂点を辺で接続し、辺を２つの対応するサブコンポーネント中心間の距離によって重み付けすることにより、サブコンポーネント完全グラフを構築するサブコンポーネント完全グラフ構築ステップと、
辺の重みの順に辺を連続してカットし、サブコンポーネント完全グラフを複数のクラスタに分割し、各クラスタに含まれる全ての頂点の重みの和を、ｋ以上で２ｋ−１以下とするサブコンポーネント完全グラフ辺カットステップと、
ｋ中心頂点に距離が最も近いクラスタに対してｋ中心頂点をマージし、ｋ中心頂点とマージされたクラスタに含まれる全ての頂点の重みの和が、２ｋに等しければ、クラスタを２つのクラスタに分解し、各クラスタに含まる全ての頂点の重みの和をｋと等しくするマージステップとを
含むことを特徴とする付記７に記載のデータ匿名化方法。 (Appendix 10)
The large component decomposition step comprises:
A k-center vertex detection step of cutting all sides connected to the detected k-center vertex to detect the k-center vertex in each large component and to obtain a plurality of sub-components other than the k-center vertex; ,
A sub-component distance calculation step for calculating the center of each sub-component and calculating the distance between the two sub-component centers;
Use the center of each subcomponent as a vertex, weight the vertex by the size of the corresponding subcomponent (represented by the number of vertices in the record contained in the subcomponent), connect all two vertices with edges, A subcomponent complete graph construction step of constructing a subcomponent complete graph by weighting by the distance between two corresponding subcomponent centers;
Subcomponent that cuts edges continuously in the order of edge weights, divides the subcomponent complete graph into multiple clusters, and sets the sum of the weights of all vertices in each cluster to be greater than or equal to k and less than or equal to 2k-1 Complete graph edge cut step;
If the k-center vertex is merged with the cluster having the closest distance to the k-center vertex, and the sum of the weights of all the vertices included in the cluster merged with the k-center vertex is equal to 2k, the cluster is divided into two clusters. The data anonymization method according to appendix 7, further comprising a merging step of decomposing and making the sum of weights of all vertices included in each cluster equal to k.

（付記１１）
前記サブコンポーネント完全グラフ辺カットステップにおいて、辺の重みに従って辺をソートし、それらの重みの降順に辺をカットすることを特徴とする付記１０に記載のデータ匿名化方法。 (Appendix 11)
The data anonymization method according to supplementary note 10, wherein in the sub-component complete graph edge cut step, edges are sorted according to edge weights, and the edges are cut in descending order of the weights.

（付記１２）
前記サブコンポーネント完全グラフ辺カットステップにおいて、以下の条件の１つが満足されれば、辺をカットする
１）辺がブリッジであり、かつ、辺をカットした後に得られる各コンポーネントに含められる頂点の重みの和が、少なくともｋである。
２）辺がブリッジではない
ここで、辺をカットすると、その辺を含むグラフが２つのサブグラフに分割されるなら、辺はブリッジであり、
辺をカットしても、その辺を含むグラフが２つのサブグラフに分割されないなら、辺はブリッジでない
ことを特徴とする
付記１０に記載のデータ匿名化方法。 (Appendix 12)
In the sub-component complete graph edge cutting step, if one of the following conditions is satisfied, the edge is cut 1) The edge is a bridge, and the weight of the vertex included in each component obtained after the edge is cut Is at least k.
2) An edge is not a bridge If an edge is cut and the graph containing the edge is split into two subgraphs, the edge is a bridge,
The data anonymization method according to claim 10, wherein, even if an edge is cut, if the graph including the edge is not divided into two subgraphs, the edge is not a bridge.

１：データ匿名化装置
１０：距離計算ユニット
１２：完全グラフ構築ユニット
１４：辺カットユニット
１６：ラージコンポーネント分解ユニット
１８：一般化ユニット
２０：レコード記録ユニット
１６：ラージコンポーネント分解ユニット
１６０：ｋ中心頂点検出ユニット
１６２：サブコンポーネント距離計算ユニット
１６４：サブコンポーネント完全グラフ構築ユニット
１６６：サブコンポーネント完全グラフ辺カットユニット
１６８：マージユニット 1: Data anonymization device 10: Distance calculation unit 12: Complete graph construction unit 14: Edge cut unit 16: Large component decomposition unit 18: Generalization unit 20: Record recording unit 16: Large component decomposition unit 160: k-center vertex detection Unit 162: Subcomponent distance calculation unit 164: Subcomponent complete graph construction unit 166: Subcomponent complete graph edge cut unit 168: Merge unit

Claims

A distance calculation unit for calculating the distance between every two records in the plurality of data records;
A complete graph construction unit that builds a complete graph containing all records by using each record as a vertex, connecting all two vertices with edges, and adding weights to the edges at the distance between two corresponding records When,
An edge cutting unit that sequentially cuts edges in order of edge weights to divide the complete graph into a plurality of components including at least k (predetermined natural number) vertices;
A large component decomposition unit for decomposing a component including more than 2k-1 vertices into a plurality of clusters such that the number of vertices included in each cluster is between k and 2k-1.
A generalization unit that generalizes the records corresponding to the vertices of each cluster so that the records in each cluster cannot be distinguished from each other;
A data anonymization device, wherein a component including more than 2k-1 vertices is a large component, and a component including k or more and 2k-1 or less vertices is a cluster.

The data anonymization device according to claim 1, wherein the side cut unit sorts the sides according to the weights of the sides and cuts the sides in descending order of the weights.

The edge cut unit cuts an edge if one of the following conditions is satisfied: 1) The edge is a bridge, and each subgraph obtained after the edge is cut includes at least k vertices 2) An edge is not a bridge If an edge is cut and the graph containing the edge is divided into two separate sections, the edge is a bridge,
2. The data anonymization device according to claim 1, wherein, even if an edge is cut, if the graph including the edge is not divided into two individual sections, the edge is not a bridge.

The large component disassembly unit is
A k-center vertex detection unit that cuts all sides connected to the detected k-center vertex in order to detect the k-center vertex in each large component and obtain a plurality of subcomponents other than the k-center vertex; ,
A sub-component distance calculation unit that calculates the center of each sub-component and calculates the distance between the two sub-component centers;
Use the center of each subcomponent as a vertex, weight the vertex by the size of the corresponding subcomponent (represented by the number of vertices in the record contained in the subcomponent), connect all two vertices with edges, A subcomponent complete graph construction unit that constructs a subcomponent complete graph by weighting by the distance between two corresponding subcomponent centers;
Subcomponent that cuts edges continuously in the order of edge weights, divides the subcomponent complete graph into multiple clusters, and sets the sum of the weights of all vertices in each cluster to be greater than or equal to k and less than or equal to 2k-1 Complete graph edge cut unit,
If the k-center vertex is merged with the cluster having the closest distance to the k-center vertex, and the sum of the weights of all the vertices included in the cluster merged with the k-center vertex is equal to 2k, the cluster is divided into two clusters. The data anonymization device according to claim 1, further comprising: a merge unit that decomposes and makes a sum of weights of all vertices included in each cluster equal to k.

5. The data anonymization device according to claim 4, wherein the sub-component complete graph edge cut unit sorts edges according to edge weights and cuts edges in descending order of the weights.

The sub-component complete graph edge cut unit cuts an edge if one of the following conditions is satisfied: 1) The edge is a bridge, and the weight of the vertex included in each component obtained after the edge is cut Is at least k.
2) An edge is not a bridge If an edge is cut and the graph containing the edge is split into two subgraphs, the edge is a bridge,
The data anonymization device according to claim 4, wherein, even if an edge is cut, if the graph including the edge is not divided into two subgraphs, the edge is not a bridge.

A distance calculating step for calculating a distance between every two records in the plurality of data records;
A complete graph construction step that builds a complete graph containing all records by using each record as a vertex, connecting all two vertices with edges, and weighting the edges with the distance between the two corresponding records When,
An edge cutting step for sequentially cutting edges in order of edge weights in order to divide the complete graph into a plurality of components including at least k (k is a predetermined natural number) vertices;
A large component decomposition step of decomposing a component including more than 2k-1 vertices into a plurality of clusters such that the number of vertices included in each cluster is between k and 2k-1;
A generalization step that generalizes the records corresponding to the vertices of each cluster so that the records in each cluster cannot be distinguished from each other;
A data anonymization method, wherein a component including more than 2k-1 vertices is a large component, and a component including k or more and 2k-1 or less vertices is a cluster.

The data anonymization method according to claim 7, wherein in the side cutting step, the sides are sorted according to the weights of the sides, and the sides are cut in descending order of the weights.

In the edge cutting step, if one of the following conditions is satisfied, the edge is cut 1) The edge is a bridge, and each subgraph obtained after cutting the edge includes at least k vertices 2) An edge is not a bridge If an edge is cut and the graph containing the edge is divided into two separate sections, the edge is a bridge,
The data anonymization method according to claim 7, wherein, even if an edge is cut, if the graph including the edge is not divided into two separate sections, the edge is not a bridge.

The large component decomposition step comprises:
A k-center vertex detection step of cutting all sides connected to the detected k-center vertex to detect the k-center vertex in each large component and to obtain a plurality of sub-components other than the k-center vertex; ,
A sub-component distance calculation step for calculating the center of each sub-component and calculating the distance between the two sub-component centers;
Use the center of each subcomponent as a vertex, weight the vertex by the size of the corresponding subcomponent (represented by the number of vertices in the record contained in the subcomponent), connect all two vertices with edges, A subcomponent complete graph construction step of constructing a subcomponent complete graph by weighting by the distance between two corresponding subcomponent centers;
Subcomponent that cuts edges continuously in the order of edge weights, divides the subcomponent complete graph into multiple clusters, and sets the sum of the weights of all vertices in each cluster to be greater than or equal to k and less than or equal to 2k-1 Complete graph edge cut step;
If the k-center vertex is merged with the cluster having the closest distance to the k-center vertex, and the sum of the weights of all the vertices included in the cluster merged with the k-center vertex is equal to 2k, the cluster is divided into two clusters. The data anonymization method according to claim 7, further comprising: a step of decomposing and merging the sum of weights of all vertices included in each cluster to be equal to k.