JP7096145B2

JP7096145B2 - Clustering device, clustering method and clustering program

Info

Publication number: JP7096145B2
Application number: JP2018232364A
Authority: JP
Inventors: 知明三本; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2022-07-05
Anticipated expiration: 2038-12-12
Also published as: JP2020095437A

Description

本発明は、暗号化されたデータセットをクラスタリングする装置、方法及びプログラムに関する。 The present invention relates to devices, methods and programs for clustering encrypted datasets.

従来、データの匿名化手法として、例えば非特許文献１～３に示されるように、ｋ－匿名化のアルゴリズムが数多く提案されてきた。これらの手法は、ユーザが持つデータセットをユーザ自身が匿名化することを想定しているため、全ての演算が平文で処理される。 Conventionally, as a data anonymization method, many k-anonymization algorithms have been proposed, for example, as shown in Non-Patent Documents 1 to 3. Since these methods assume that the user himself anonymizes the data set owned by the user, all operations are processed in plain text.

Ｌ．Ｓｗｅｅｎｅｙ， “Ａｃｈｉｅｖｉｎｇｋ－ａｎｏｎｙｍｉｔｙｐｒｉｖａｃｙｐｒｏｔｅｃｔｉｏｎｕｓｉｎｇｇｅｎｅｒａｌｉｚａｔｉｏｎａｎｄｓｕｐｐｒｅｓｓｉｏｎ，” ｉｎＪ．Ｕｎｃｅｒｔａｉｎｔｙ，Ｆｕｚｚｉｎｅｓｓ，ａｎｄＫｎｏｗｌｅｄｇｅ－ＢａｓｅＳｙｓｔｅｍｓ，ｖｏｌ．１０（５），２００２，ｐｐ．５７１－５８８．L. Sweeney, "Achieving k-anonymity privacy promotion using generalization and support," in J. et al. Uncertainty, Fusionss, and Knowledge-Base Systems, vol. 10 (5), 2002, pp. 571-588. Ｋ．ＬｅＦｅｖｒｅ，Ｄ．Ｊ．ＤｅＷｉｔｔ，ａｎｄＲ．Ｒａｍａｋｒｉｓｈｎａｎ， “Ｉｎｃｏｇｎｉｔｏ：Ｅｆｆｉｃｉｅｎｔｆｕｌｌ－ｄｏｍａｉｎｋ－ａｎｏｎｙｍｉｔｙ，” ｉｎＰｒｏｃ．ｏｆＳＩＧＭＯＤ２００５，２００５，ｐｐ．４９－６０．K. LeFavre, D.I. J. DeWitt, and R. Ramakrishanan, “Incognito: Effective full-domin k-anonymity,” in Proc. of SIGMOD 2005, 2005, pp. 49-60. Ｋ．ＬｅＦｅｖｒｅｅｔａｌ．， “Ｍｏｎｄｒｉａｎｍｕｌｔｉｄｉｍｅｎｓｉｏｎａｌｋ－ａｎｏｎｙｍｉｔｙ，” ｉｎＰｒｏｃ．ｏｆｔｈｅ２２ｎｄＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＤａｔａＥｎｇｉｎｅｅｒｉｎｇ（ＩＣＤＥ ’０６），ＩＥＥＥ，２００６，ｐｐ．２５－３５．K. LeFavre et al. , “Mondrian multimetrical k-anonymity,” in Proc. of the 22nd International Convention on Data Engineering (ICDE '06), IEEE, 2006, pp. 25-35. ＫｏｈｌｍａｙｅｒＦ，ＰｒａｓｓｅｒＦ，ＥｃｋｅｒｔＣ，ＫｕｈｎＫＡ． “Ａｆｌｅｘｉｂｌｅａｐｐｒｏａｃｈｔｏｄｉｓｔｒｉｂｕｔｅｄｄａｔａａｎｏｎｙｍｉｚａｔｉｏｎ，” ＪＢｉｏｍｅｄＩｎｆｏｒｍ２０１３Ｄｅｃ１２．Kohlmayar F, Presser F, Eckert C, Khun KA. “A flexible applied data anonymization,” J Biomed Information 2013 Dec 12.

匿名化を外部に委託する場合、又は複数機関の持つデータを併せて匿名化する場合は、データは暗号化される必要がある。例えば、非特許文献４の手法では、複数機関で匿名化を行うことを想定しているが、データ保有者がプロトコルに参加する必要がある、確定的暗号を用いる必要があり匿名性が低下する等の課題があった。 When outsourcing anonymization, or when anonymizing data held by multiple institutions together, the data needs to be encrypted. For example, in the method of Non-Patent Document 4, it is assumed that anonymization is performed by a plurality of institutions, but the data holder needs to participate in the protocol, and it is necessary to use deterministic encryption, which reduces anonymity. There were issues such as.

また、前述のｋ－匿名化アルゴリズムは、平文上では高速に実行できるが、同じアルゴリズムをそのまま暗号化されたデータに対して適用した場合、次の問題がある。
秘密計算では、データを秘匿した状態で加法及び乗法が可能なため、理論上は任意の計算が秘匿した状態で実現可能である。しかし、ＧａｒｂｌｅｄＣｉｒｃｕｉｔ、秘密分散、準同型暗号等、いずれの計算手法を用いた場合でも処理が遅く、大規模なデータに対しては現実的な時間で目的の機能を果たすことが困難であることが知られている。 Further, although the above-mentioned k-anonymization algorithm can be executed at high speed in plain text, when the same algorithm is applied to encrypted data as it is, there are the following problems.
In secret calculation, addition and multiplication can be performed while the data is concealed, so theoretically any calculation can be realized in a concealed state. However, even if any calculation method such as Garbled Circuit, secret sharing, homomorphic encryption, etc. is used, the processing is slow and it is difficult to achieve the desired function in a realistic time for large-scale data. It has been known.

本発明は、暗号化したデータを高速にクラスタリングできるクラスタリング装置、クラスタリング方法及びクラスタリングプログラムを提供することを目的とする。 An object of the present invention is to provide a clustering device, a clustering method, and a clustering program capable of clustering encrypted data at high speed.

本発明に係るクラスタリング装置は、暗号化されたデータセットのレコードを所定数のクラスタに分割する際に、各クラスタに少なくとも１つの初期レコードを割り当てる初期処理部と、前記クラスタそれぞれについて、所属するレコード全体の重心を算出する重心算出部と、前記クラスタ毎に、前記データセットのうち、前記重心との距離が近いレコードから順に当該クラスタに所属させる分割処理を並列に実行する分割処理部と、を備える。 The clustering apparatus according to the present invention has an initial processing unit that allocates at least one initial record to each cluster when dividing a record of an encrypted data set into a predetermined number of clusters, and a record to which each of the clusters belongs. A center of gravity calculation unit that calculates the entire center of gravity, and a division processing unit that executes division processing in parallel for each cluster so that the records that are closest to the center of gravity of the data set belong to the cluster in order. Be prepared.

前記重心算出部は、前記クラスタ毎に、前記分割処理を複数回実行した後に当該クラスタの前記重心を算出してもよい。 The center of gravity calculation unit may calculate the center of gravity of the cluster after executing the division process a plurality of times for each cluster.

前記初期処理部は、前記初期レコードをランダムに選択してもよい。 The initial processing unit may randomly select the initial record.

前記クラスタリング装置は、前記分割処理部による処理結果として、複数のクラスタに重複して所属したレコードがある場合、当該レコード毎に、最も前記重心との距離が近いクラスタのみに所属させる第１調整処理を並列に実行する第１調整部を備えてもよい。 When the clustering apparatus has records that belong to a plurality of clusters in duplicate as a result of processing by the division processing unit, the first adjustment process for each record belongs only to the cluster closest to the center of gravity. May be provided with a first adjusting unit that executes the above in parallel.

前記クラスタリング装置は、前記分割処理部による処理結果として、いずれのクラスタにも所属しなかったレコードがある場合、当該レコード毎に、最も前記重心との距離が近いクラスタに所属させる第２調整処理を並列に実行する第２調整部を備えてもよい。 If there is a record that does not belong to any of the clusters as a result of processing by the division processing unit, the clustering apparatus performs a second adjustment process for each record to belong to the cluster closest to the center of gravity. A second adjustment unit that is executed in parallel may be provided.

前記クラスタリング装置は、前記分割処理部による処理結果として、所定の大きさに満たないクラスタがある場合、当該クラスタ毎に、当該クラスタに所属しているレコードを重心間の距離が最も近いクラスタに所属させる第３調整処理を並列に実行する第３調整部を備えてもよい。 When there are clusters smaller than a predetermined size as a result of processing by the division processing unit, the clustering apparatus belongs to the cluster having the shortest distance between the centers of gravity of the records belonging to the cluster for each cluster. A third adjustment unit may be provided to execute the third adjustment process to be performed in parallel.

本発明に係るクラスタリング方法は、暗号化されたデータセットのレコードを所定数のクラスタに分割する際に、各クラスタに少なくとも１つの初期レコードを割り当てる初期処理ステップと、前記クラスタそれぞれについて、所属するレコード全体の重心を算出する重心算出ステップと、前記クラスタ毎に、前記データセットのうち、前記重心との距離が近いレコードから順に当該クラスタに所属させる分割処理を並列に実行する分割処理ステップと、をコンピュータが実行する。 The clustering method according to the present invention includes an initial processing step of allocating at least one initial record to each cluster when dividing a record of an encrypted data set into a predetermined number of clusters, and a record to which each of the clusters belongs. A center of gravity calculation step for calculating the entire center of gravity and a split processing step of parallelly executing a division process of assigning the data set to the cluster in order from the record having the closest distance to the center of the data set for each cluster. The computer runs.

本発明に係るクラスタリングプログラムは、暗号化されたデータセットのレコードを所定数のクラスタに分割する際に、各クラスタに少なくとも１つの初期レコードを割り当てる初期処理ステップと、前記クラスタそれぞれについて、所属するレコード全体の重心を算出する重心算出ステップと、前記クラスタ毎に、前記データセットのうち、前記重心との距離が近いレコードから順に当該クラスタに所属させる分割処理を並列に実行する分割処理ステップと、をコンピュータに実行させるためのものである。 The clustering program according to the present invention has an initial processing step of allocating at least one initial record to each cluster when dividing a record of an encrypted data set into a predetermined number of clusters, and a record belonging to each of the clusters. A center of gravity calculation step for calculating the entire center of gravity and a splitting process for each cluster in which the splitting process of assigning the data set to the cluster in order from the record closest to the center of gravity is executed in parallel. It's meant to be run by a computer.

本発明によれば、暗号化したデータを高速にクラスタリングできる。 According to the present invention, encrypted data can be clustered at high speed.

実施形態に係るクラスタリング装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the clustering apparatus which concerns on embodiment. 実施形態に係るクラスタリング方法における分割処理のアルゴリズムを例示する図である。It is a figure which illustrates the algorithm of the division processing in the clustering method which concerns on embodiment. 実施形態に係る第１調整処理及び第２調整処理を含むアルゴリズムを例示する図である。It is a figure which illustrates the algorithm which includes the 1st adjustment process and the 2nd adjustment process which concerns on embodiment. 実施形態に係る第３調整処理及び匿名化の処理を含むアルゴリズムを例示する図である。It is a figure which illustrates the algorithm which includes the 3rd adjustment process and the anonymization process which concerns on embodiment.

以下、本発明の実施形態の一例について説明する。
図１は、本実施形態に係るクラスタリング装置１の機能構成を示すブロック図である。
クラスタリング装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０及び記憶部２０の他、各種データの入出力デバイス及び通信デバイス等を備える。 Hereinafter, an example of the embodiment of the present invention will be described.
FIG. 1 is a block diagram showing a functional configuration of the clustering apparatus 1 according to the present embodiment.
The clustering device 1 is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10 and a storage unit 20, as well as various data input / output devices and communication devices.

制御部１０は、クラスタリング装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各機能を実現する。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire clustering device 1, and realizes each function in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群をクラスタリング装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるためのプログラム（クラスタリングプログラム）、及びこのプログラムが処理対象とするデータセット、各種の変数、フラグ等を記憶する。 The storage unit 20 is a storage area for various programs and various data for making the hardware group function as the clustering device 1, and may be a ROM, RAM, flash memory, hard disk (HDD), or the like. Specifically, the storage unit 20 stores a program (clustering program) for causing the control unit 10 to execute each function of the present embodiment, a data set to be processed by this program, various variables, flags, and the like. do.

制御部１０は、初期処理部１１と、重心算出部１２と、分割処理部１３と、第１調整部１４と、第２調整部１５と、第３調整部１６とを備える。制御部１０は、これらの機能部により、処理対象のデータセットを暗号化されたまま複数のクラスタに分割する。
クラスタリング装置１は、データセットの各レコードをデータ点とし、分割された各クラスタに所属する点を重心等の代表点に変更することでｋ－匿名化を実現する。 The control unit 10 includes an initial processing unit 11, a center of gravity calculation unit 12, a division processing unit 13, a first adjustment unit 14, a second adjustment unit 15, and a third adjustment unit 16. The control unit 10 divides the data set to be processed into a plurality of clusters while being encrypted by these functional units.
The clustering device 1 realizes k-anonymization by using each record of the data set as a data point and changing the point belonging to each divided cluster to a representative point such as the center of gravity.

初期処理部１１は、暗号化されたデータセットの複数のレコードを所定数のクラスタに分割する際に、各クラスタに少なくとも１つの初期レコードを割り当てる。
初期レコードの割当方法は限定されないが、本実施形態では、処理速度を優先し、初期処理部１１は、初期レコードをランダムに選択する。 The initial processing unit 11 allocates at least one initial record to each cluster when dividing a plurality of records of the encrypted data set into a predetermined number of clusters.
The method of allocating the initial record is not limited, but in the present embodiment, the processing speed is prioritized, and the initial processing unit 11 randomly selects the initial record.

重心算出部１２は、クラスタそれぞれについて、所属するレコード全体の重心を算出する。
クラスタに所属するレコードは、分割処理部１３による分割処理の度に増加していく。したがって、重心は分割処理の度に更新されてもよいが、本実施形態では、処理速度を優先し、重心算出部１２は、クラスタ毎に、分割処理を複数回実行した後にクラスタの重心を算出して更新する。例えば、重心算出部１２は、繰り返し処理中の毎回の更新を省略し、所定回数の分割処理が完了した後に１回、あるいは、所定回数の分割処理が完了するまでに数回、重心を更新してもよい。 The center of gravity calculation unit 12 calculates the center of gravity of the entire record to which the cluster belongs.
The records belonging to the cluster increase every time the division processing unit 13 performs division processing. Therefore, the center of gravity may be updated every time the division process is performed, but in the present embodiment, the processing speed is prioritized, and the center of gravity calculation unit 12 calculates the center of gravity of the cluster after executing the division process a plurality of times for each cluster. And update. For example, the center of gravity calculation unit 12 omits each update during the iterative process, and updates the center of gravity once after the completion of the predetermined number of division processes or several times until the predetermined number of division processes are completed. You may.

分割処理部１３は、クラスタ毎に、データセットのうち、重心との距離が近いレコードから順にこのクラスタに所属させる分割処理を並列に所定回数実行する。実行回数は、例えば、ｋ－匿名化におけるクラスタの最小数ｋに、適宜設定される調整数αを加算したｋ＋α回であってよい。
これにより、分割処理部１３は、重心からの距離が近いレコードを探索するという処理を、クラスタの数だけ並列化して実行し、処理速度を向上させる。 The division processing unit 13 executes the division processing for each cluster in parallel a predetermined number of times in order from the record having the closest distance to the center of gravity of the data set. The number of executions may be, for example, k + α times, which is obtained by adding an appropriately set adjustment number α to the minimum number k of clusters in k-anonymization.
As a result, the division processing unit 13 executes the process of searching for records that are close to the center of gravity in parallel by the number of clusters, and improves the processing speed.

第１調整部１４は、分割処理部１３による処理結果として、複数のクラスタに重複して所属したレコードがある場合、該当するレコード毎に、最も重心との距離が近いクラスタのみに所属させ、他のクラスタへの所属を解除する第１調整処理を並列に実行する。 If there are duplicate records belonging to a plurality of clusters as a result of processing by the division processing unit 13, the first adjusting unit 14 assigns each corresponding record to only the cluster closest to the center of gravity, and the other. The first adjustment process for canceling the affiliation with the cluster is executed in parallel.

第２調整部１５は、分割処理部１３による処理結果として、いずれのクラスタにも所属しなかったレコードがある場合、該当するレコード毎に、最も重心との距離が近いクラスタに所属させる第２調整処理を並列に実行する。 If there is a record that does not belong to any of the clusters as a result of processing by the division processing unit 13, the second adjustment unit 15 makes the corresponding record belong to the cluster closest to the center of gravity. Execute processing in parallel.

第３調整部１６は、分割処理部１３による処理結果として、所定の大きさ、すなわちｋ－匿名化における最小数ｋに満たないクラスタがある場合、該当するクラスタ毎に、このクラスタに所属しているレコードを、重心間の距離が最も近い他のクラスタに所属させる第３調整処理を並列に実行する。 If there is a cluster of a predetermined size, that is, less than the minimum number k in k-anonymization as a result of processing by the division processing unit 13, the third adjustment unit 16 belongs to this cluster for each corresponding cluster. The third adjustment process of making the existing records belong to another cluster having the shortest distance between the centers of gravity is executed in parallel.

図２は、本実施形態に係るクラスタリング方法における分割処理のアルゴリズムＡを例示する図である。
アルゴリズムＡでは、暗号化されたデータセットＥ（Ｄ）、ｋ－匿名化におけるパラメータｋ、及び調整パラメータαが入力される。 FIG. 2 is a diagram illustrating the algorithm A of the division processing in the clustering method according to the present embodiment.
In the algorithm A, the encrypted data set E (D), the parameter k in k-anonymization, and the adjustment parameter α are input.

まず、ｒ_ｉ［０］を、データセットＥ（Ｄ）のｉ番目のレコードの値（データ点）とする。さらに、ｒ_ｉ［ｊ］を、レコードｒ_ｉがｊ番目のクラスタに所属している（＝Ｅ（１））か否か（＝Ｅ（０））を示すフラグとする（ステップ２）。
データセットのレコード数がｎ、クラスタ数がＫのとき、１≦ｉ≦ｎ、１≦ｊ≦Ｋにおいて、ｒ_ｉ［ｊ］はＥ（０）に初期化される（ステップ３）。 First, let r _i [0] be the value (data point) of the i-th record of the data set E (D). Further, r _i [j] is set as a flag indicating whether or not the record r _i belongs to the jth cluster (= E (1)) (= E (0)) (step 2).
When the number of records in the data set is n and the number of clusters is K, ri [j] is initialized to E (0) in 1 ≦ _i ≦ n and 1 ≦ j ≦ K (step 3).

また、ｄ_ｉ［ｊ］を、ｒ_ｉ［０］とクラスタｃ_ｊの重心ｇ（ｃ_ｊ）との距離とする（ステップ４）。
１≦ｉ≦ｎ、１≦ｊ≦Ｋにおいて、ｄ_ｉ［ｊ］はＥ（∞）に初期化される（ステップ５）。 Further, let _di [j] be the distance between r _i [0] and the center of gravity g (c _j ) of the cluster c _j (step 4).
In 1 ≦ _i ≦ n and 1 ≦ j ≦ K, di [j] is initialized to E (∞) (step 5).

次に、データセットＥ（Ｄ）からＫ＝ｆｌｏｏｒ（ｎ／ｋ）個のレコードがランダムに選択される（ステップ６）。
ｉ番目のクラスタに対してｉ番目のレコードが選択されたとすると、ｒ_ｉ［ｉ］はＥ（１）に、ｇ（ｃ_ｉ）はＥ（ｒ_ｉ［０］）に更新される（ステップ７）。 Next, K = floor (n / k) records are randomly selected from the data set E (D) (step 6).
Assuming that the i-th record is selected for the i-th cluster, ri [ _i ] is updated to E (1) and g ( _ci ) is updated to E ( _ri [0]) (step 7). ).

ステップ９から１５において、Ｋ個のクラスタ毎に、分割処理がｋ＋α回繰り返される。この分割処理は、クラスタの数と同数に並列化され、ステップ１２と１３とがクラスタの単位で並列実行される。 In steps 9 to 15, the division process is repeated k + α times for each of K clusters. This division process is parallelized to the same number as the number of clusters, and steps 12 and 13 are executed in parallel in units of clusters.

分割処理部１３は、クラスタｃ_ｊに対して、重心ｇ（ｃ_ｊ）との距離が最小となるレコードｉ’を、いずれのクラスタにも所属していないレコードの中から選択する（ステップ１２）。そして、分割処理部１３は、ｒ_ｉ’［ｊ］及びｄ_ｉ’［ｊ］を更新する（ステップ１３）。 The division processing unit 13 selects the record i'that has the minimum distance from the center of gravity g (c _j ) with respect to the cluster c _j from the records that do not belong to any of the clusters (step 12). .. Then, the division processing unit 13 updates r _i' [j] and di _' [j] (step 13).

ｋ＋α回の繰り返し処理が完了すると、クラスタｃ_ｊには、フラグｒ_ｉ［ｊ］がＥ（１）となっているレコードｒ_ｉが所属している。
重心算出部１２は、全てのクラスタｃ_ｊについて、重心ｇ（ｃ_ｊ）を更新する（ステップ１７）。 When the iterative processing of k + α times is completed, the record r _i in which the flag r _i [j] is E (1) belongs to the cluster c _j .
The center of gravity calculation unit 12 updates the center of gravity g (c _j ) for all the clusters c _j (step 17).

なお、繰り返しの回数は、パラメータαにより調整されるが、この回数が多くなるほど、複数のクラスタに重複して所属するレコードが増加し、回数が少なくなるほど、いずれのクラスタにも所属しないレコードが増加する。
パラメータαは、状況に応じて適宜設定されるが、正しくクラスタリングされなかったレコードについては、次の処理により調整される。 The number of repetitions is adjusted by the parameter α, but as the number of repetitions increases, the number of records that belong to multiple clusters increases, and as the number of times decreases, the number of records that do not belong to any cluster increases. do.
The parameter α is appropriately set according to the situation, but the records that are not clustered correctly are adjusted by the following processing.

図３は、本実施形態に係る第１調整処理及び第２調整処理を含むアルゴリズムＢを例示する図である。 FIG. 3 is a diagram illustrating an algorithm B including a first adjustment process and a second adjustment process according to the present embodiment.

ステップ２３～２７の第１調整処理では、複数のクラスタに所属したレコード、すなわち、ｒ_ｉ［ｊ］の和（１≦ｊ≦Ｋ）がＥ（０）でもＥ（１）でもないレコードｒ_ｉが調整される。第１調整部１４は、これらのレコードを並列に処理する。 In the first adjustment process of steps 23 to 27, the records belonging to a plurality of clusters, that is, the records r _i in which the sum (1 ≦ j ≦ K) of r _i [j] is neither E (0) nor E (1). Is adjusted. The first adjusting unit 14 processes these records in parallel.

第１調整部１４は、各レコードｒ_ｉについて、重心からの距離ｄ_ｉ［ｊ］が最小のクラスタｃ’_ｊを選択し（ステップ２５）、フラグｒ_ｉ［ｊ’］をＥ（１）に、その他のフラグ∀ｒ_ｉ［ｊ≠ｊ’］をＥ（０）に更新する（ステップ２６）。 The first adjusting unit 14 selects the cluster _c'j having the smallest distance di [ _j ] from the center of gravity for each record r _i (step 25), and sets the flag r _i [j'] to E (1). , Other flags _∀ri [j ≠ j'] are updated to E (0) (step 26).

ステップ２９～３３の第２調整処理では、いずれのクラスタにも所属しないレコード、すなわち、ｒ_ｉ［ｊ］の和（１≦ｊ≦Ｋ）がＥ（０）のレコードｒ_ｉが調整される。第２調整部１５は、これらのレコードを並列に処理する。 In the second adjustment process of steps 29 to 33, a record that does not belong to any of the clusters, that is, a record r _i whose sum (1 ≦ j ≦ K) of r _i [j] is E (0) is adjusted. The second adjustment unit 15 processes these records in parallel.

第２調整部１５は、各レコードｒ_ｉについて、重心からの距離が最小のクラスタｃ’_ｊを選択し（ステップ３１）、フラグｒ_ｉ［ｊ’］をＥ（１）に更新する（ステップ３２）。 The second adjusting unit 15 selects the cluster _c'j having the smallest distance from the center of gravity for each record r _i (step 31), and updates the flag r _i [j'] to E (1) (step 32). ).

第１調整処理及び第２調整処理が終了すると、全てのクラスタｃ_ｊについて、重心ｇ（ｃ_ｊ）が更新される（ステップ３４）。 When the first adjustment process and the second adjustment process are completed, the center of gravity g (c _j ) is updated for all the clusters c _j (step 34).

図４は、本実施形態に係る第３調整処理及び匿名化の処理を含むアルゴリズムＣを例示する図である。 FIG. 4 is a diagram illustrating an algorithm C including a third adjustment process and an anonymization process according to the present embodiment.

ステップ４３～４７の第３調整処理では、クラスタに所属するレコードの数がｋより小さい、すなわちｒ_ｉ［ｊ］の和（１≦ｉ≦ｎ）がＥ（ｋ）より小さいクラスタｃ_ｊが調整される。第３調整部１６は、これらのクラスタを並列に処理する。 In the third adjustment process of steps 43 to 47, the cluster c _j in which the number of records belonging to the cluster is smaller than k, that is, the sum of r _i [j] (1 ≦ i ≦ n) is smaller than E (k) is adjusted. Will be done. The third coordinating unit 16 processes these clusters in parallel.

第３調整部１６は、調整対象のクラスタｃ_ｊについて、重心間の距離が最小の他のクラスタｃ’_ｈを選択し（ステップ４５）、クラスタｃ_ｊの全レコードをクラスタｃ’_ｈに所属させるため、フラグｒ_ｉ［ｈ’］をＥ（１）に、フラグｒ_ｉ［ｊ］をＥ（０）に更新する（ステップ４６）。 The third adjustment unit 16 selects another cluster _c'h with the minimum distance between the centers of gravity for the cluster c _j to be adjusted (step 45), and makes all the records of the cluster c _j belong to the cluster _c'h . Therefore, the flag r _i [h'] is updated to E (1), and the flag r _i [j] is updated to E (0) (step 46).

続いて、制御部１０は、ｋ－匿名化のため、全レコードの値ｒ_ｉ［０］を、所属するクラスタｃ_ｊの重心ｇ（ｃ_ｊ）に更新する（ステップ４８～５０）。
そして、制御部１０は、ｒ_ｉ［０］の集合を、クラスタリングされたデータセットＥ（Ｄ’）として出力する（ステップ５１）。 Subsequently, the control unit 10 updates the value r _i [0] of all the records to the center of gravity g (c _j ) of the cluster c _j to which the control unit 10 belongs for k-anonymization (steps 48 to 50).
Then, the control unit 10 outputs the set of ri [ ₀ ] as a clustered data set E (D') (step 51).

本実施形態によれば、クラスタリング装置１は、暗号化されたデータセットのレコードをクラスタに分割する際に、はじめから分割するクラスタの数を決定し、各クラスタに少なくとも１つの初期レコードを割り当てた後、クラスタ毎に、重心との距離が近いレコードから順にこのクラスタに所属させる分割処理を並列に実行する。 According to the present embodiment, when the clustering apparatus 1 divides the records of the encrypted data set into clusters, the number of clusters to be divided is determined from the beginning, and at least one initial record is assigned to each cluster. After that, for each cluster, the division processing to belong to this cluster is executed in parallel in order from the record closest to the center of gravity.

従来のクラスタリング手法では、全てのレコードに対してどのクラスタに属するかの判定（Ａｒｇｍｉｎ）処理が必要となる。例えば、各レコードとクラスタとの距離を計算した上で所属するクラスタを決定し、クラスタの更新を行うという処理が全てのレコードに対して行われるが、このアルゴリズムは並列処理が困難であり、少なくともレコードの数だけＡｒｇｍｉｎ又はＡｒｇｍａｘの処理が必要であった。
一方、クラスタリング装置１は、各クラスタからレコードまでの距離を計算し、その後クラスタの重心を更新するため、クラスタの数と同数の並列化処理が可能となる。したがって、重心との距離が近いレコードを探索するという重い処理を並列化することで、暗号文上でもデータセットのクラスタリング（例えば、ｋ－匿名化）を高速に実行できる。 In the conventional clustering method, it is necessary to determine (Argmin) which cluster belongs to all the records. For example, the process of calculating the distance between each record and the cluster, determining the cluster to which it belongs, and updating the cluster is performed for all records, but this algorithm is difficult to process in parallel, and at least Argmin or Argmax processing was required for the number of records.
On the other hand, since the clustering apparatus 1 calculates the distance from each cluster to the record and then updates the center of gravity of the cluster, the same number of parallel processing as the number of clusters can be performed. Therefore, by parallelizing the heavy process of searching for records that are close to the center of gravity, data set clustering (for example, k-anonymization) can be executed at high speed even on the ciphertext.

クラスタリング装置１は、クラスタの重心を再計算する頻度を、クラスタが更新される頻度よりも減らすことで、処理を高速化できる。なお、再計算の頻度は、クラスタリングの精度とのトレードオフにより、適宜設定されてよい。
また、クラスタリング装置１は、各クラスタの初期レコードをランダムに選択することで、処理を高速化できる。 The clustering device 1 can speed up the processing by reducing the frequency of recalculating the center of gravity of the cluster less than the frequency of updating the cluster. The frequency of recalculation may be appropriately set depending on the trade-off with the accuracy of clustering.
Further, the clustering apparatus 1 can speed up the processing by randomly selecting the initial record of each cluster.

クラスタリング装置１は、分割処理の結果、複数のクラスタに重複して所属したレコードがある場合、これらのレコード毎に、最も重心との距離が近いクラスタのみに所属させる第１調整処理を並列に実行する。
また、クラスタリング装置１は、分割処理の結果、いずれのクラスタにも所属しなかったレコードがある場合、これらのレコード毎に、最も重心との距離が近いクラスタに所属させる第２調整処理を並列に実行する。
さらに、クラスタリング装置１は、分割処理の結果、所定の大きさに満たないクラスタがある場合、これらのクラスタ毎に、所属しているレコードを重心間の距離が最も近い他のクラスタに所属させる第３調整処理を並列に実行する。
したがって、クラスタリング装置１は、分割処理によりクラスタリングが十分にできなかった部分を調整でき、さらに、これらの各種調整処理を並列化することで高速に実行できる。 When the clustering apparatus 1 has duplicate records belonging to a plurality of clusters as a result of the division processing, the clustering apparatus 1 executes the first adjustment processing in parallel for each of these records so that the records belong only to the cluster closest to the center of gravity. do.
Further, if there are records that do not belong to any of the clusters as a result of the division processing, the clustering apparatus 1 performs a second adjustment process in parallel for each of these records so that the records belong to the cluster closest to the center of gravity. Run.
Further, when the clustering apparatus 1 has clusters smaller than a predetermined size as a result of the division processing, the clustering apparatus 1 makes the records to which the clusters belong belong to the other cluster having the shortest distance between the centers of gravity. 3 The adjustment process is executed in parallel.
Therefore, the clustering apparatus 1 can adjust the portion where the clustering cannot be sufficiently performed by the division processing, and can execute the various adjustment processes at high speed by parallelizing them.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most suitable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

クラスタリング装置１によるクラスタリング方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ－ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The clustering method by the clustering device 1 is realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１クラスタリング装置
１０制御部
１１初期処理部
１２重心算出部
１３分割処理部
１４第１調整部
１５第２調整部
１６第３調整部
２０記憶部 1 Clustering device 10 Control unit 11 Initial processing unit 12 Center of gravity calculation unit 13 Division processing unit 14 1st adjustment unit 15 2nd adjustment unit 16 3rd adjustment unit 20 Storage unit

Claims

An initial processing unit that allocates at least one initial record to each cluster when dividing the records of the encrypted data set into a predetermined number of clusters.
For each of the clusters, a center of gravity calculation unit that calculates the center of gravity of the entire record to which it belongs on the ciphertext ,
For each cluster, the division process of assigning the data set to the cluster in order from the record calculated on the ciphertext and having the closest distance to the center of gravity is executed in parallel with the same number of clusters. A clustering device that includes a unit.

The clustering apparatus according to claim 1, wherein the center of gravity calculation unit calculates the center of gravity of the cluster after executing the division process a plurality of times for each cluster.

The clustering apparatus according to claim 1 or 2, wherein the initial processing unit randomly selects the initial record.

If there are duplicate records belonging to a plurality of clusters as a processing result by the division processing unit, the first adjustment process of making each record belong only to the cluster closest to the center of gravity is executed in parallel. The clustering apparatus according to any one of claims 1 to 3, further comprising a first adjusting unit.

If there is a record that does not belong to any cluster as a processing result by the division processing unit, the second adjustment process of making the record belong to the cluster closest to the center of gravity is executed in parallel for each record. 2. The clustering apparatus according to any one of claims 1 to 4, further comprising an adjusting unit.

If there are clusters smaller than a predetermined size as a result of processing by the division processing unit, the third adjustment process of making the records belonging to the cluster belong to the cluster having the shortest distance between the centers of gravity for each cluster. The clustering apparatus according to any one of claims 1 to 5, further comprising a third adjusting unit for executing the above in parallel.

An initial processing step that allocates at least one initial record to each cluster when dividing records in an encrypted dataset into a predetermined number of clusters.
For each of the clusters, a center of gravity calculation step for calculating the center of gravity of the entire record to which the record belongs on the ciphertext , and
For each cluster, the division process of assigning the data set to the cluster in order from the record calculated on the ciphertext and having the closest distance to the center of gravity is executed in parallel with the same number of clusters. Steps and how the computer performs the clustering method.

An initial processing step that allocates at least one initial record to each cluster when dividing records in an encrypted dataset into a predetermined number of clusters.
For each of the clusters, a center of gravity calculation step for calculating the center of gravity of the entire record to which the record belongs on the ciphertext , and
For each cluster, the division process of assigning the data set to the cluster in order from the record calculated on the cryptographic text and having the closest distance to the center of gravity is executed in parallel with the same number of clusters. A clustering program that lets your computer perform steps and.