JP6015661B2

JP6015661B2 - Data division apparatus, data division system, data division method, and program

Info

Publication number: JP6015661B2
Application number: JP2013534778A
Authority: JP
Inventors: 隆夫竹之内
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-09-21
Filing date: 2012-09-14
Publication date: 2016-10-26
Anticipated expiration: 2032-09-14
Also published as: JPWO2013042788A1; WO2013042788A1

Description

本発明は、分散して保持されているデータを、互いに開示することなく適切に分割する技術に関する。 The present invention relates to a technique for appropriately dividing data held in a distributed manner without disclosing each other.

分散して保持されているデータを、互いに開示することなく分割する技術が知られている。例えば、非特許文献１には、分散して保持されているプライバシ保護の対象となるデータに関する、データマイニングによる分類木生成技術が開示されている。
非特許文献１に記載の技術は、複数の事業者が異なる種類のユーザの個人情報を保持している状況において、ＩＤ３（ＩｔｅｒａｔｉｖｅＤｉｃｈｏｔｏｍｉｓｅｒ３）のアルゴリズムを分散環境で実現することで、互いの事業者がデータを開示することなく分類木を生成する。非特許文献１に記載の技術は、ＭＰＣ（ＭｕｌｔｉＰａｒｔｙＣｏｍｐｕｔａｔｉｏｎ）を利用したエントロピー計算により、分類木の分割点を決定する。
以下、非特許文献１に記載の技術による、分類木生成の動作を説明する。
図４６は、事業者Ｓが保持しているデータの例を示す図である。図４６に示すように、事業者Ｓは、「ｕｓｅｒＩＤ」と「腹囲（Ｘ）」と「Ｃｌａｓｓ」に関するデータを保持している。「ｕｓｅｒＩＤ」はデータとして登録されているユーザの識別子である。事業者Ｓは、ｕｓｅｒ１〜ｕｓｅｒ９の計９人のユーザに関するデータを保持している。「腹囲（Ｘ）」は、ユーザの腹囲を示すデータである。「Ｃｌａｓｓ」は、「Ａ」又は「Ｂ」で表示され、「Ａ」はユーザが非メタボリックであることを、「Ｂ」はユーザがメタボリックであることを示す。
図４７は、事業者Ｔが保持しているデータの例を示す図である。図４７に示すように、事業者Ｔは、「ｕｓｅｒＩＤ」と「血圧（Ｙ）」に関するデータを保持している。また、事業者Ｔは、ｕｓｅｒ１〜ｕｓｅｒ８の計８人のユーザに関するデータを保持している。「血圧（Ｙ）」は、ユーザの血圧を示すデータである。
非特許文献１に記載の技術は、事業者Ｓと事業者Ｔとの共通のユーザであるｕｓｅｒ１〜ｕｓｅｒ８のデータに関して、ＭＰＣを用いてエントロピーを計算することにより、分割後に最もエントロピーが小さくなる点「Ｘ＝９０」を分割点として決定する。
図４８は、事業者Ｓ及び事業者Ｔが保持するデータが、「Ｘ＝９０」で２つに分割された状態を示す図である。図４９は、腹囲をＸ軸とし、血圧をＹ軸とした場合の、データの分布を「Ｘ＝９０」で分割した様子を示す図である。図４８及び図４９に示すように、ｕｓｅｒ１〜ｕｓｅｒ８のデータは、「Ｘ＝９０」を分割点として、「ｕｓｅｒ１〜３，７，８」のグループと、「ｕｓｅｒ４〜６」のグループとに分割される。
図５０は、図４８に示すデータが、さらに「Ｙ＝１３０」で分割された状態を示す図である。図５１は、データの分布を「Ｙ＝１３０」で分割した様子を示す図である。図５０及び図５１に示すように、非特許文献１に記載の技術は、ＭＰＣを用いたエントロピー計算により、「Ｙ＝１３０」で分割することが、最もデータの混じりが少なく、分割点として適切であると判断する。
図５２は、非特許文献１に記載の技術により最終的に生成される分類木の例を示す図である。上述の例の場合、非特許文献１に記載の技術によれば、図５２に示される分類木が生成される。A technique for dividing data held in a distributed manner without disclosing them is known. For example, Non-Patent Document 1 discloses a classification tree generation technique by data mining regarding data that is subject to privacy protection that is distributed and held.
The technology described in Non-Patent Document 1 is based on the fact that ID3 (Iterative Dichotomiser 3) algorithm is realized in a distributed environment in a situation where a plurality of business operators hold personal information of different types of users. A person generates a classification tree without disclosing data. The technique described in Non-Patent Document 1 determines classification tree division points by entropy calculation using MPC (Multi Party Computation).
Hereinafter, the operation of classification tree generation by the technique described in Non-Patent Document 1 will be described.
FIG. 46 is a diagram illustrating an example of data held by the operator S. As shown in FIG. 46, the business operator S holds data on “userID”, “abdominal circumference (X)”, and “Class”. “UserID” is an identifier of a user registered as data. The business operator S holds data relating to a total of nine users, user1 to user9. “Waist circumference (X)” is data indicating the user's waist circumference. “Class” is displayed as “A” or “B”, “A” indicates that the user is non-metabolic, and “B” indicates that the user is metabolic.
FIG. 47 is a diagram illustrating an example of data held by the operator T. As shown in FIG. 47, the business operator T holds data related to “userID” and “blood pressure (Y)”. Further, the business operator T holds data relating to a total of eight users, user1 to user8. “Blood pressure (Y)” is data indicating the blood pressure of the user.
The technique described in Non-Patent Document 1 is that the entropy is the smallest after division by calculating entropy using MPC for the data of user1 to user8 that are common users of the operator S and the operator T. “X = 90” is determined as a dividing point.
FIG. 48 is a diagram illustrating a state where data held by the business operator S and the business operator T is divided into two at “X = 90”. FIG. 49 is a diagram illustrating a state where the data distribution is divided by “X = 90” when the abdominal circumference is the X axis and the blood pressure is the Y axis. As shown in FIGS. 48 and 49, the data of user1 to user8 is divided into a group of “user1 to 3,7,8” and a group of “user4 to 6” with “X = 90” as a division point. Is done.
FIG. 50 is a diagram showing a state where the data shown in FIG. 48 is further divided by “Y = 130”. FIG. 51 is a diagram illustrating a state where the data distribution is divided by “Y = 130”. As shown in FIGS. 50 and 51, in the technique described in Non-Patent Document 1, dividing by “Y = 130” by entropy calculation using MPC has the least data mixing and is suitable as a dividing point. It is judged that.
FIG. 52 is a diagram illustrating an example of a classification tree finally generated by the technique described in Non-Patent Document 1. In the case of the above example, according to the technique described in Non-Patent Document 1, the classification tree shown in FIG. 52 is generated.

Ｙ．Ｌｉｎｄｅｌｌ，ａｎｄＢ．Ｐｉｎｋａｓ，“ＰｒｉｖａｃｙＰｒｅｓｅｒｖｉｎｇＤａｔａＭｉｎｉｎｇ”，Ｃｒｙｐｔｏ２０００．Y. Lindell, and B.M. Pinkas, “Privacy Preserving Data Mining”, Crypto 2000. ″Ｐｒｉｖａｃｙ−ＰｒｅｓｅｒｖｉｎｇＤａｔａＭａｓｈｕｐ″，ＮｏｍａｎＭｏｈａｍｍｅｄ，ＢｅｎｊａｍｉｎＣ．Ｍ．Ｆｕｎｇ，ＫｅＷａｎｇ，ＰａｔｒｉｃｋＣ．Ｋ．Ｈｕｎｇ，ＩｎＥＤＢＴ ’０９Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１２ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＥｘｔｅｎｄｉｎｇＤａｔａｂａｓｅＴｅｃｈｎｏｌｏｇｙ：ＡｄｖａｎｃｅｓｉｎＤａｔａｂａｓｅＴｅｃｈｎｏｌｏｇｙ，２００９．“Privacy-Preserving Data Maskup”, Noman Mohammed, Benjamin C .; M.M. Fung, Ke Wang, Patrick C.M. K. Hung, In EDBT '09 Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009. Hung, In EDBT '09 Proceedings of the 12th International Conference on Extending Database Technology.

非特許文献１に記載の技術の課題は、一方の事業者（装置）が保持する非共通ユーザのデータが重要な意味を持っていた場合に、該データが分割の処理に反映されず、分類の精度が悪くなることである。なぜなら、非特許文献１に記載の技術は、双方の事業者（装置）が保持する共通のユーザのデータを処理の対象とし、一方の事業者（装置）が保持する非共通のユーザのデータを処理の対象としないからである。また、一方の事業者（装置）が保持する非共通のユーザのデータに対し、他の事業者（装置）の非共通のユーザのデータをランダムに値を割り当てたとしても、分類の精度が良くなるとは限らない。 The problem of the technology described in Non-Patent Document 1 is that when data of a non-common user held by one operator (device) has an important meaning, the data is not reflected in the division process and is classified. The accuracy of is worse. This is because the technology described in Non-Patent Document 1 targets the common user data held by both operators (devices) and the non-common user data held by one operator (device). This is because it is not subject to processing. In addition, even if non-common user data held by one operator (device) is randomly assigned to non-common user data of another operator (device), the classification accuracy is good. Not necessarily.

上記目的を達成するため、本発明におけるデータ分割装置は、自装置の第一の個人情報と、前記自装置の前記第一の個人情報に割当てられたユーザ識別子と、他装置の第二の個人情報に割当てられたユーザ識別子を含むユーザデータを分割するデータ分割装置であって、前記他装置のユーザ識別子を取得する送受信手段と、前記自装置のユーザ識別子と当該識別子に関連付けられた前記第一の個人情報の値とを記憶する記憶手段と、取得した前記他装置のユーザ識別子のうち、前記自装置のユーザ識別子と一致しない前記他装置のユーザ識別子に関連付けするダミーデータとして、前記第一の個人情報のダミー値を設定する設定手段と、前記ダミーデータを含む所定のユーザデータを、前記ダミー値を含む前記第一の個人情報の値又は前記第二の個人情報の値に基づいて決定された分割点によって、グループに分割する分割手段と、分割後のグループに属するユーザデータのうち、前記自装置と前記他装置のユーザ識別子が一致する前記第一の個人情報の値に基づいて、前記設定された第一の個人情報の前記ダミー値を修正する修正手段と、を含む。
上記目的を達成するため、本発明におけるデータ分割システムは、第一のデータ分割装置の第一の個人情報と、前記第一のデータ分割装置の前記第一の個人情報に割当てられたユーザ識別子と、第二のデータ分割装置の第二の個人情報に割当てられたユーザ識別子を含むユーザデータを分割するデータ分割システムであって、前記第一のデータ分割装置は、前記第二のデータ分割装置のユーザ識別子を取得する第一の送受信手段と、前記第一のデータ分割装置のユーザ識別子と当該識別子に関連付けられた前記第一の個人情報の値とを記憶する第一の記憶手段と、取得した前記第二のデータ分割装置のユーザ識別子のうち、前記第一のデータ分割装置のユーザ識別子と一致しない前記第二のデータ分割装置のユーザ識別子に関連付けするダミーデータとして、前記第一の個人情報の第一のダミー値を設定する第一の設定手段と、前記ダミーデータを含む所定のユーザデータを、前記第一のダミー値を含む前記第一の個人情報の値又は前記第二のデータ分割装置が設定した第二のダミー値を含む第二の個人情報の値に基づいて決定された分割点によって、グループに分割する第一の分割手段と、分割後のグループに属するユーザデータのうち、前記第一のデータ分割装置と前記第二のデータ分割装置のユーザ識別子が一致する前記第一の個人情報の値に基づいて、設定された前記第一のダミー値を修正する修正手段と、前記第二のデータ分割装置は、前記第一のデータ分割装置のユーザ識別子を取得する第二の送受信手段と、前記第二のデータ分割装置のユーザ識別子と当該識別子に関連付けられた前記第二の個人情報の値とを記憶する第二の記憶手段と、取得した前記第一のデータ分割装置のユーザ識別子のうち、前記第二のデータ分割装置のユーザ識別子と一致しない前記第一のデータ分割装置のユーザ識別子に関連付けするダミーデータとして、前記第二の個人情報の第二のダミー値を設定する第二の設定手段と、前記ダミーデータを含む所定のユーザデータを、前記第二のダミー値を含む前記第二の個人情報の値又は前記第一のダミー値を含む第一の個人情報の値に基づいて決定された分割点によって、グループに分割する第二の分割手段と、分割後のグループに属するユーザデータのうち、前記第二のデータ分割装置と前記第一のデータ分割装置のユーザ識別子が一致する前記第二の個人情報の値に基づいて、設定された前記第二のダミー値を修正する修正手段と、を含む。
上記目的を達成するため、本発明におけるデータ方法は、自装置の第一の個人情報と、前記自装置の前記第一の個人情報に割当てられたユーザ識別子と、他装置の第二の個人情報に割当てられたユーザ識別子を含むユーザデータを分割するデータ分割方法であって、前記他装置のユーザ識別子を取得し、前記自装置のユーザ識別子と当該識別子に関連付けられた前記第一の個人情報の値とを記憶し、取得した前記他装置のユーザ識別子のうち、前記自装置のユーザ識別子と一致しない前記他装置のユーザ識別子に関連付けするダミーデータとして、前記第一の個人情報のダミー値を設定し、前記ダミーデータを含む所定のユーザデータを、前記ダミー値を含む前記第一の個人情報の値又は前記第二の個人情報の値に基づいて決定された分割点によって、グループに分割し、分割後のグループに属するユーザデータのうち、前記自装置と前記他装置のユーザ識別子が一致する前記第一の個人情報の値に基づいて、前記設定された第一の個人情報の前記ダミー値を修正する。
上記目的を達成するため、本発明におけるプログラムは、自装置の第一の個人情報と、前記自装置の前記第一の個人情報に割当てられたユーザ識別子と、他装置の第二の個人情報に割当てられたユーザ識別子を含むユーザデータを分割するデータ分割装置を実現するプログラムであって、前記他装置のユーザ識別子を取得し、前記自装置のユーザ識別子と当該識別子に関連付けられた前記第一の個人情報の値とを記憶し、取得した前記他装置のユーザ識別子のうち、前記自装置のユーザ識別子と一致しない前記他装置のユーザ識別子に関連付けするダミーデータとして、前記第一の個人情報のダミー値を設定し、前記ダミーデータを含む所定のユーザデータを、前記ダミー値を含む前記第一の個人情報の値又は前記第二の個人情報の値に基づいて決定された分割点によって、グループに分割し、分割後のグループに属するユーザデータのうち、前記自装置と前記他装置のユーザ識別子が一致する前記第一の個人情報の値に基づいて、前記設定された第一の個人情報の前記ダミー値を修正する、処理をコンピュータに実行させる。In order to achieve the above object, the data dividing device according to the present invention includes the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the second individual of the other device. A data dividing device for dividing user data including a user identifier assigned to information, the transmitting / receiving means for obtaining a user identifier of the other device, the user identifier of the own device, and the first associated with the identifier As the dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the obtained user identifiers of the other device. Setting means for setting a dummy value of personal information; and predetermined user data including the dummy data, the value of the first personal information including the dummy value or the first The dividing means for dividing the group into groups based on the division point determined based on the personal information value, and the user data belonging to the group after the division match the user identifiers of the own device and the other device among the first Correction means for correcting the dummy value of the set first personal information based on the value of the personal information.
In order to achieve the above object, a data division system according to the present invention includes first personal information of a first data division device, and a user identifier assigned to the first personal information of the first data division device. A data division system for dividing user data including a user identifier assigned to the second personal information of the second data division device, wherein the first data division device is the second data division device First transmission / reception means for acquiring a user identifier, first storage means for storing a user identifier of the first data dividing device and a value of the first personal information associated with the identifier, and Of the user identifiers of the second data dividing device, the dummy associated with the user identifier of the second data dividing device that does not match the user identifier of the first data dividing device First setting means for setting a first dummy value of the first personal information as data, predetermined user data including the dummy data, and the first personal information including the first dummy value A first dividing means for dividing into groups based on a dividing point determined based on the value of the second personal information including the value of the second data dividing device or the second dummy value set by the second data dividing device; and Among the user data belonging to the first group, the first dummy information set based on the value of the first personal information that matches the user identifiers of the first data dividing device and the second data dividing device Correction means for correcting the value; the second data dividing device; second transmitting / receiving means for obtaining a user identifier of the first data dividing device; a user identifier of the second data dividing device; and the identifier. In The second storage means for storing the associated second personal information value and the user identifier of the second data dividing device among the acquired user identifiers of the first data dividing device The second setting means for setting the second dummy value of the second personal information as dummy data associated with the user identifier of the first data dividing device, and predetermined user data including the dummy data The second division into groups by the division point determined based on the value of the second personal information including the second dummy value or the value of the first personal information including the first dummy value Based on the value of the second personal information in which the user identifiers of the second data dividing device and the first data dividing device match among the dividing means and user data belonging to the group after the division. Correcting means for correcting the second dummy value.
In order to achieve the above object, a data method according to the present invention includes: first personal information of an own device; a user identifier assigned to the first personal information of the own device; and second personal information of another device. A data dividing method for dividing user data including a user identifier assigned to a device, wherein the user identifier of the other device is acquired, the user identifier of the own device and the first personal information associated with the identifier A dummy value of the first personal information is set as dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the acquired user identifiers of the other device. The predetermined user data including the dummy data is divided based on the value of the first personal information or the value of the second personal information including the dummy value. Therefore, it is divided into groups, and among the user data belonging to the group after the division, based on the value of the first personal information that matches the user identifiers of the own device and the other device, the set first data The dummy value of personal information is corrected.
In order to achieve the above object, the program according to the present invention includes the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the second personal information of the other device. A program for realizing a data dividing device that divides user data including an assigned user identifier, obtains a user identifier of the other device, and obtains a user identifier of the own device and the first identifier associated with the identifier. The dummy information of the first personal information is stored as dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the acquired user identifiers of the other device. A value is set, and predetermined user data including the dummy data is set based on the value of the first personal information or the value of the second personal information including the dummy value. Based on the value of the first personal information that matches the user identifiers of the own device and the other device among the user data belonging to the group after the division by the division point determined in the above, The computer is caused to execute a process of correcting the dummy value of the set first personal information.

本発明の効果の一例は、複数の装置によるデータの分散処理において、一方の装置が保持する非共通ユーザのデータを有効に活用することで、精度の良いデータの分割が可能になることである。 An example of the effect of the present invention is that, in data distribution processing by a plurality of devices, it is possible to divide data with high accuracy by effectively using non-common user data held by one device. .

事業者Ｓが保持しているデータの例を示す図である。It is a figure which shows the example of the data which the provider S hold | maintains. 事業者Ｔが保持しているデータの例を示す図である。It is a figure which shows the example of the data which the provider T hold | maintains. 事業者Ｓ及び事業者Ｔが保持するデータが、「Ｙ＝１３０」で２つに分割された状態を示す図である。It is a figure which shows the state in which the data which the provider S and the provider T hold | maintain were divided | segmented into two by "Y = 130". 腹囲をＸ軸とし、血圧をＹ軸とした場合の、データの分布を「Ｙ＝１３０」で分割した様子を示す図である。It is a figure which shows a mode that the data distribution was divided | segmented by "Y = 130" when an abdominal circumference is set to an X-axis and a blood pressure is set to a Y-axis. 図１及び図２に示すデータから最終的に生成される分類木の例を示す図である。It is a figure which shows the example of the classification tree finally produced | generated from the data shown in FIG.1 and FIG.2. 非特許文献１の技術により生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。It is a figure which shows the classification result of user1-user15 using the classification tree produced | generated by the technique of the nonpatent literature 1. FIG. 事業者Ｔが保持しているデータに対し、ｕｓｅｒ７〜ｕｓｅｒ１５の値をリサンプリングにより決定したデータを示す図である。It is a figure which shows the data which determined the value of user7-user15 by resampling with respect to the data which the provider T hold | maintains. ｕｓｅｒ１〜ｕｓｅｒ１５のデータを「Ｘ＝９０」で２つに分割した状態を示す図である。It is a figure which shows the state which divided | segmented the data of user1-user15 into two by "X = 90". ダミー値が設定されたデータも含めたｕｓｅｒ１〜ｕｓｅｒ１５のデータの分布と、「Ｘ＝９０」で分割した様子を示す図である。It is a figure which shows a mode that the distribution of the data of user1-user15 including the data in which the dummy value was set was divided | segmented by "X = 90". 図８に示すデータを、さらに「Ｙ＝１２０」で分割した状態を示す図である。It is a figure which shows the state which further divided | segmented the data shown in FIG. 8 by "Y = 120". ダミー値が設定されたデータも含めたデータの分布を「Ｙ＝１２０」で分割した様子を示す図である。It is a figure which shows a mode that the distribution of the data also including the data in which the dummy value was set was divided | segmented by "Y = 120". ダミー値を設定したデータも含めたｕｓｅｒ１〜ｕｓｅｒ１５のデータから最終的に生成される分類木の例を示す図である。It is a figure which shows the example of the classification | category tree finally produced | generated from the data of user1-user15 also including the data which set the dummy value. ダミー値を設定したデータも含めたデータにより生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。It is a figure which shows the classification result of user1-user15 using the classification tree produced | generated by the data also including the data which set the dummy value. 第１実施形態に係る第一のデータ分割装置１００の構成を示すブロック図である。It is a block diagram which shows the structure of the 1st data division | segmentation apparatus 100 which concerns on 1st Embodiment. 第１実施形態に係る第二のデータ分割装置２００の構成を示すブロック図である。It is a block diagram which shows the structure of the 2nd data division | segmentation apparatus 200 which concerns on 1st Embodiment. 本発明の第１実施形態に係る第一のデータ分割装置１００の動作を示すフローチャート図である。It is a flowchart figure which shows operation | movement of the 1st data division | segmentation apparatus 100 which concerns on 1st Embodiment of this invention. 図７のデータに対し、ダミー値を修正したデータを表す図である。It is a figure showing the data which corrected the dummy value with respect to the data of FIG. ダミー値を修正した場合の分割の状態を示す図である。It is a figure which shows the state of the division | segmentation at the time of correcting a dummy value. ダミー値を修正した場合の、データの分布と分割の様子を示す図である。It is a figure which shows the mode of data distribution and a division | segmentation at the time of correcting a dummy value. 図１８に示すデータにおける｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループが、さらに「Ｙ＝１３０」で分割された状態を示す図である。FIG. 19 is a diagram illustrating a state in which a group of {user4 to 6, 10, 13, 15} in the data illustrated in FIG. 18 is further divided at “Y = 130”. ダミー値が修正されたデータの分布を｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループの「Ｙ＝１３０」で分割した様子を示す図である。It is a figure which shows a mode that the distribution of the data by which the dummy value was corrected was divided | segmented by "Y = 130" of the group of {user4-6,10,13,15}. ダミー値を修正した場合のｕｓｅｒ１〜ｕｓｅｒ１５のデータから最終的に生成される分類木の例を示す図である。It is a figure which shows the example of the classification tree finally produced | generated from the data of user1-user15 at the time of correcting a dummy value. ダミー値を修正したデータにより生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。It is a figure which shows the classification result of user1-user15 using the classification tree produced | generated by the data which corrected the dummy value. 第２実施形態に係る第一のデータ分割装置３００の構成を示すブロック図である。It is a block diagram which shows the structure of the 1st data division | segmentation apparatus 300 which concerns on 2nd Embodiment. 本発明の第２実施形態に係る第一のデータ分割装置３００の動作を示すフローチャート図である。It is a flowchart figure which shows operation | movement of the 1st data division | segmentation apparatus 300 which concerns on 2nd Embodiment of this invention. 修正されたデータを表す図１７のデータに対し、ダミー値を調整したデータを表す図である。It is a figure showing the data which adjusted the dummy value with respect to the data of FIG. 17 showing the corrected data. ダミー値を調整した場合の分割の状態を示す図である。It is a figure which shows the state of the division | segmentation at the time of adjusting a dummy value. ダミー値を調整した場合の、データの分布と分割の様子を示す図である。It is a figure which shows the mode of data distribution and a division | segmentation at the time of adjusting a dummy value. ダミー値を調整した場合のｕｓｅｒ１〜ｕｓｅｒ１５のデータから最終的に生成される分類木の例を示す図である。It is a figure which shows the example of the classification tree finally produced | generated from the data of user1-user15 at the time of adjusting a dummy value. ダミー値を調整したデータにより生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。It is a figure which shows the classification | category result of user1-user15 using the classification tree produced | generated by the data which adjusted the dummy value. 第３実施形態に係る第一のデータ分割装置４００の構成を示すブロック図である。It is a block diagram which shows the structure of the 1st data division | segmentation apparatus 400 which concerns on 3rd Embodiment. 第３実施形態に係る第二のデータ分割装置５００の構成を示すブロック図である。It is a block diagram which shows the structure of the 2nd data division | segmentation apparatus 500 which concerns on 3rd Embodiment. 第３実施形態における第一のデータ分割装置４００の動作を示すフローチャート図である。It is a flowchart figure which shows operation | movement of the 1st data division | segmentation apparatus 400 in 3rd Embodiment. 第３実施形態において事業者Ｔが保持しているデータの例を示す図である。It is a figure which shows the example of the data which the provider T has in 3rd Embodiment. 第３実施形態において、ｕｓｅｒ７〜ｕｓｅｒ１５の値をリサンプリングにより決定したデータを示す図である。It is a figure which shows the data which determined the value of user7-user15 by resampling in 3rd Embodiment. 事業者Ｓが生成する初期匿名データの例を示す図である。It is a figure which shows the example of the initial anonymous data which the provider S produces | generates. 事業者Ｔが生成する初期匿名データの例を示す図である。It is a figure which shows the example of the initial anonymous data which the provider T produces | generates. 図３６のデータを「Ｘ＝９０」で分割したデータを表す図である。It is a figure showing the data which divided | segmented the data of FIG. 36 by "X = 90". 図３７のデータを「Ｘ＝９０」で分割したデータを表す図である。It is a figure showing the data which divided | segmented the data of FIG. 37 by "X = 90". 図３５のデータに対し、ダミー値を修正したデータを表す図である。It is a figure showing the data which corrected the dummy value with respect to the data of FIG. 図３８のデータを「Ｙ＝１３０」で分割したデータを表す図である。It is a figure showing the data which divided | segmented the data of FIG. 38 by "Y = 130". 図３９のデータを「Ｙ＝１３０」で分割したデータを表す図である。FIG. 40 is a diagram illustrating data obtained by dividing the data in FIG. 39 by “Y = 130”. 第３実施形態に係る本発明により生成された最終的な結合された匿名化データ（結合匿名化データ）を示す図である。It is a figure which shows the final combined anonymization data (joining anonymization data) produced | generated by this invention which concerns on 3rd Embodiment. 第４実施形態に係るデータ分割装置６００の構成を示すブロック図である。It is a block diagram which shows the structure of the data division | segmentation apparatus 600 which concerns on 4th Embodiment. 第１実施形態に係る第一のデータ分割装置１００のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the 1st data division | segmentation apparatus 100 which concerns on 1st Embodiment. 背景技術を説明するための事業者Ｓが保持しているデータの例を示す図である。It is a figure which shows the example of the data which the provider S for demonstrating background art is hold | maintaining. 背景技術を説明するための事業者Ｔが保持しているデータの例を示す図である。It is a figure which shows the example of the data which the provider T for demonstrating background art hold | maintains. 事業者Ｓ及び事業者Ｔが保持するデータが、「Ｘ＝９０」で２つに分割された状態を示す図である。It is a figure which shows the state in which the data which the provider S and the provider T hold | maintain were divided | segmented into two by "X = 90". 腹囲をＸ軸とし、血圧をＹ軸とした場合の、データの分布を「Ｘ＝９０」で分割した様子を示す図である。It is a figure which shows a mode that the data distribution was divided | segmented by "X = 90" when abdominal girth is made into an X-axis and blood pressure is made into a Y-axis. 図４８に示すデータが、さらに「Ｙ＝１３０」で分割された状態を示す図である。49 is a diagram showing a state where the data shown in FIG. 48 is further divided by “Y = 130”. FIG. データの分布を「Ｙ＝１３０」で分割した様子を示す図である。It is a figure which shows a mode that the distribution of data was divided | segmented by "Y = 130". 非特許文献１に記載の技術により最終的に生成される分類木の例を示す図である。It is a figure which shows the example of the classification tree finally produced | generated by the technique of a nonpatent literature 1.

まず、本発明の実施形態の理解を容易にするために、本発明の背景を説明する。
背景技術で説明した例と同様に、事業者Ｓと事業者Ｔが、それぞれ保持している個人情報から、分類木を生成することを考える。各事業者が保持する個人情報は、識別子管理事業者が管理する共通の識別子に対応していても良い。
分類木生成の技術には、非特許文献１の技術を用いるものとする。
まず、例えば、識別子管理事業者は、分類木生成の対象となるユーザの識別子を各事業者に対して通知する。例えば、ｕｓｅｒ１〜ｕｓｅｒ１５の識別子が各事業者に通知されたものとする。
事業者Ｓは、通知された識別子のユーザに関して、図１に示すデータを保持しているとする。図１に示すように、事業者Ｓは、ｕｓｅｒ１〜ｕｓｅｒ１５の識別子のユーザに関する個人情報（「腹囲（Ｘ）」と「Ｃｌａｓｓ」のデータ）を保持している。「Ｃｌａｓｓ」は、「Ａ」又は「Ｂ」で表示され、「Ａ」はユーザが非メタボリックであることを、「Ｂ」はユーザがメタボリックであることを示す。
事業者Ｔは、通知された識別子のユーザに関して、図２に示すデータを保持しているとする。図２に示すように、事業者Ｔは、ｕｓｅｒ１〜ｕｓｅｒ６の識別子のユーザに関する個人情報（「血圧（Ｙ）」のデータ）を保持している。
非特許文献１に記載の技術は、事業者Ｓと事業者Ｔとの共通のユーザであるｕｓｅｒ１〜ｕｓｅｒ６のデータを分類木生成処理の対象とし、ｕｓｅｒ７〜ｕｓｅｒ１５のデータは使用しない。非特許文献１に記載の技術は、ｕｓｅｒ１〜ｕｓｅｒ６のデータを対象としてＭＰＣを用いてエントロピーを計算することにより、分割後に最もエントロピーが小さくなる点「Ｙ＝１３０」を分割点として決定する。
図３は、事業者Ｓ及び事業者Ｔが保持するデータが、「Ｙ＝１３０」で２つに分割された状態を示す図である。図４は、腹囲をＸ軸とし、血圧をＹ軸とした場合の、データの分布を「Ｙ＝１３０」で分割した様子を示す図である。図３及び図４に示すように、ｕｓｅｒ１〜ｕｓｅｒ８のデータは、「Ｙ＝１３０」を分割点として、「ｕｓｅｒ１〜４」のグループと、「ｕｓｅｒ５，６」のグループとに分割される。
図５は、図１及び図２に示すデータから最終的に生成される分類木の例を示す図である。上述の例の場合、非特許文献１に記載の技術は、図３及び図４に示した以上の分類は不可能と判断し、図５に示される分類木を生成する。図５に示される分類木によれば、ｕｓｅｒ１〜ｕｓｅｒ６のデータは正確に分類される。具体的には、ｕｓｅｒ１〜ｕｓｅｒ４は、「Ａ」、すなわち「非メタボリック」に分類され、ｕｓｅｒ５，ｕｓｅｒ６は、「Ｂ」、すなわち「メタボリック」に分類される。しかしながら、ｕｓｅｒ７〜ｕｓｅｒ１５のデータは、「血圧（Ｙ）」の値を持たないため、分類することができない。
図６は、非特許文献１の技術により生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。図６の「正解」は、図５に示される分類木を用いて分類した場合に、「Ａ」又は「Ｂ」が正しく分類されているかを示すデータである。「○」は正しく分類されていることを表す。「不明」は、対象となるデータが分類に必要な値を保持していないことから分類ができないか、又は「Ａ」と「Ｂ」とが混ざったグループに分類されるため「Ａ」と「Ｂ」とのいずれに該当するか不明であることを表す。
図６に示す分類結果は、ｕｓｅｒ１〜ｕｓｅｒ１５のデータ全体から見ると、決して精度の良い分類であるとは言えない。ここで、分類木の生成に使用されなかったｕｓｅｒ７〜ｕｓｅｒ１５のデータに着目する。ｕｓｅｒ７〜ｕｓｅｒ１５のデータは、「腹囲（Ｘ）＝８０台」のユーザのほとんどが「Ａ」であることを示し、また、「腹囲（Ｘ）＝９０台」のユーザのほとんどが「Ｂ」であることを示す。ｕｓｅｒ７〜ｕｓｅｒ１５のデータは、分類木生成にあたり重要な意味を持つにも関わらず、非特許文献１の技術では、分類木生成にあたりこれらのデータが活用されない。
分類木生成にあたり、ｕｓｅｒ７〜ｕｓｅｒ１５のデータを活用するために、「リサンプリング」という技術を用いることが考えられる。「リサンプリング」とは、標本に基づいてサンプルの値を決定する手法をいう。上述の例においては、「血圧（Ｙ）」の値を実際に保持するｕｓｅｒ１〜ｕｓｅｒ６の値の分布に基づいて、ｕｓｅｒ７〜ｕｓｅｒ１５の値を決定する。
図７は、事業者Ｔが保持しているデータに対し、ｕｓｅｒ７〜ｕｓｅｒ１５の値をリサンプリングにより決定したデータを示す図である。事業者Ｔは、ｕｓｅｒ１〜ｕｓｅｒ６の「血圧（Ｙ）」の値に関し、「１１０台：１２０台：１３０台＝１：１：１」の割合で保持している。そこで、事業者Ｔの装置は、図７に示すように、ｕｓｅｒ７〜ｕｓｅｒ１５の「血圧（Ｙ）」の値を「１１０台：１２０台：１３０台＝１：１：１」の割合でリサンプリングしてダミー値を設定する。本例においては、１１０台として１１５を、１２０台として１２５を、１３０台として１３５を割り当てた。
非特許文献１に記載の技術は、ダミー値を設定することで事業者Ｓと事業者Ｔとの共通のユーザとなったｕｓｅｒ１〜ｕｓｅｒ１５のデータに関して、ＭＰＣを用いてエントロピーを計算することにより、分割後に最もエントロピーが小さくなる点「Ｘ＝９０」を分割点として決定する。
図８は、ｕｓｅｒ１〜ｕｓｅｒ１５のデータが、「Ｘ＝９０」で２つに分割された状態を示す図である。図９は、ダミー値が設定されたデータも含めたｕｓｅｒ１〜ｕｓｅｒ１５のデータの分布を「Ｘ＝９０」で分割した様子を示す図である。図８及び図９に示すように、ｕｓｅｒ１〜ｕｓｅｒ１５のデータは、「Ｘ＝９０」を分割点として、「ｕｓｅｒ１〜３，７〜９，１１，１２，１４」のグループと、「ｕｓｅｒ４〜６，１０，１３，１５」のグループとに分割される。
図１０は、図８に示すデータが、さらに「Ｙ＝１２０」で分割された状態を示す図である。図１１は、ダミー値が設定されたデータも含めたデータの分布を「Ｙ＝１２０」で分割した様子を示す図である。図１０及び図１１に示すように、非特許文献１に記載の技術は、ＭＰＣを用いたエントロピー計算により、本例においては「Ｙ＝１２０」で分割することが、最もデータの混じりが少なく、分割点として適切であると判断する。
図１２は、ダミー値を設定したデータも含めたｕｓｅｒ１〜ｕｓｅｒ１５のデータから最終的に生成される分類木の例を示す図である。リサンプリングを行った場合、非特許文献１に記載の技術によれば、図１２に示される分類木が生成される。
図１３は、ダミー値を設定したデータも含めたデータにより生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。図１３に示すように、図６の結果と比較して、リサンプリングを行うことで、分類結果が「不明」であったｕｓｅｒ７〜ｕｓｅｒ１５のデータのうち、ｕｓｅｒ７〜９，１１，１２，１４のデータを正しく分類することが可能な分類木を生成することができる。一方で、リサンプリングの値は標本の分布に従っているとはいえランダムに設定されるため、ｕｓｅｒ４〜６のように本来正しく分類可能であったデータが、「不明」となってしまうことがある。
以上説明したように、共通ユーザのデータを用いたデータ分割の場合、精度の良い分類木を生成することができない。リサンプリングの手法によりダミー値をランダムに設定しても、必ずしも精度の良い分類木が生成されるとは言えない。言い換えれば、非特許文献１の技術に単純にリサンプリングの手法を適用しただけでは、データを精度良く分割することができない。
以下に説明される本発明の第１実施形態によれば、これまでに説明した問題が解決される。
＜第１実施形態＞
まず、図１４及び図１５を参照して、本発明の第１実施形態に係るデータ分割装置の構成を説明する。
図１４は、第１実施形態に係る第一のデータ分割装置１００の構成を示すブロック図である。図１４に示すように、第一のデータ分割装置１００は、第一の送受信部１１０と、第一の記憶部１２０と、第一の設定部１３０と、第一の分割部１４０と、第一の修正部１５０とを含む。
図１５は、第１実施形態に係る第二のデータ分割装置２００の構成を示すブロック図である。図１５に示すように、第二のデータ分割装置２００の構成は、第一のデータ分割装置１００と同様でも良い。本実施形態では、第一のデータ分割装置１００及び第二のデータ分割装置２００が、データ分割システムを構成する。以下の説明においては、第一のデータ分割装置１００の構成を中心として説明する。
なお、本実施形態においては、データ分割装置は２台として説明するが、２台に限定されず、２台以上の複数の装置を含むシステムでも良い。
第一の送受信部１１０は、外部装置と通信を行うことで、自装置（第一のデータ分割装置１００）及び他装置（第二のデータ分割装置２００）のそれぞれが保持しているユーザの識別子の情報を取得する。なお、プライバシを保護することを前提とするため、例えば「腹囲」や「血圧」等の個人情報に関しては互いに開示せず、例えば、ｕｓｅｒＩＤの情報を交換する。
また、第一の送受信部１１０は、例えば識別子管理事業者が保持する識別子管理装置からの通知を受け、母集団となる所定の識別子の情報を取得しても良い。第一の送受信部１１０は、取得した所定の識別子の情報のうち、第一のデータ分割装置１００（自装置）が保持していない識別子の情報を第二のデータ分割装置２００（他装置）が保持している非共通な識別子の情報であると判定しても良い。
または、第一の送受信部１１０は、第二の送受信部２１０と直接ｕｓｅｒＩＤの情報の送受信を行うことで、第二のデータ分割装置２００（他装置）が保持しているｕｓｅｒＩＤの情報を取得しても良い。
または、第一のデータ分割装置１００は、予め第二のデータ分割装置２００が保持しているｕｓｅｒＩＤの情報を第一の記憶部１２０に記憶していても良い。この場合、第一の送受信部１１０は、第二のデータ分割装置２００のｕｓｅｒＩＤの情報を取得する処理を行わなくても良い。
なお、本実施形態において使用する識別子は、例えば国民ＩＤでも良い。又は識別子は、非特許文献２に記載されているＯｐｅｎＩＤでも良く、これらに限定されない。
第一の記憶部１２０は、複数のユーザ識別子と第一の個人情報の値とを関連付けたユーザデータを記憶する。ここで「ユーザ識別子」とは、各データ分割装置が記憶しているユーザの識別子を意味する。例えば、「第一の記憶部１２０が記憶するユーザ識別子」とは、第一のデータ分割装置１００の第一の記憶部１２０が記憶しているユーザの識別子を意味し、第二のデータ分割装置２００の第二の記憶部２２０が記憶しているが、第一の記憶部１２０が記憶していないユーザの識別子は含まない。
また、「第一の個人情報」とは、第一のデータ分割装置１００（自装置）が記憶している個人情報の一つを言う。例えば、第一の記憶部１２０が図１に示すユーザデータを記憶している場合、「第一の個人情報」は「腹囲」でも良い。
第一の設定部１３０は、第一のデータ分割装置１００（自装置）が記憶しているユーザデータ（ユーザ識別子と当該識別子に関連する個人情報の値）を第一の記憶部１２０から取得する。また、第一の設定部１３０は、第二のデータ分割装置２００（他装置）が保持しているユーザ識別子を第一の送受信部１１０又は第一の記憶部１２０から取得する。
第一の設定部１３０は、第二のデータ分割装置（他装置）が記憶しているユーザデータであって、第一の記憶部１２０が記憶していないユーザ識別子のデータに対し、ダミーデータとして第一の個人情報の値にダミー値を設定する。ダミー値は、例えば、ユーザ識別子に対応する第一の個人情報の値の分布に従って、リサンプリングの手法により設定する。第一の設定部１３０によるダミー値の設定方法は、リサンプリングの手法に限定されず、他の適当な方法でも良い。
第一の設定部１３０は、ダミーデータを含む所定のユーザデータを第一の分割部１４０に出力する。
第一の分割部１４０は、第一の設定部１３０から出力されたダミーデータを含む所定のユーザデータを、分割点によってクループに分割する。分割点とは、ユーザデータを分割するための閾値であり、ダミー値を含む第一の個人情報の値、又は、第二の個人情報の値に基づいて、決定される。
上記の他に、第一の分割部１４０は、第一の送受信部１１０を介して第二の分割部２４０と通信を行い、第一のデータ分割装置１００及び第二のデータ分割装置２００が保持する個人情報のうち、分割の軸として最も適切な個人情報を決定しても良い。この場合、第一の分割部１４０は、第二の分割部２４０と通信を行い、該個人情報の値の中で最も適切な分割点を決定しても良い。
なお、第二のデータ分割装置２００（他装置）においてもダミー値の設定が行われている場合には、分割点を決定するための第二の個人情報の値には第二のデータ分割装置（他装置）が設定したダミー値（第二のダミー値）を含む。
分割方法は特に限定されない。第一の分割部１４０は、所定の個人情報の値の平均値を分割点として、ユーザデータを２つのグループに分割しても良い。この場合、第一の分割部１４０は、分割後のグループの内容を第一の送受信部１１０を介して第二の分割部２４０に送信しても良い。第一の分割部１４０及び第二の分割部２４０は、互いの個人情報の値の平均値を分割点として、順番に分割を繰り返しても良い。または、第一の分割部１４０は、周知のヒューリスティック関数を用いて分割点を決定しても良い。
また、第一の分割部１４０は、ユーザデータを分割した場合のエントロピーを考慮して分割点を決定しても良い。エントロピーを考慮することで、第一の分割部１４０は、分割後のグループに属するユーザデータの混ざりが少なくなるように分割点を決定しても良い。
例えば、分割後のグループにおけるエントロピーは以下の式で計算しても良い。
エントロピー＝Σ｛−１×Ｐ（Ｃｌａｓｓ）×ｌｏｇ（Ｐ（Ｃｌａｓｓ））｝
ここで、「Ｃｌａｓｓ」を「Ａ」又は「Ｂ」で分類する場合、Ｐ（Ｃｌａｓｓ）はそれぞれ以下のようになる。
Ｐ（Ａ）＝「分割後のグループ内での「Ａ」の数」／「分割後のグループ内での「Ａ」及び「Ｂ」の数の合計」
Ｐ（Ｂ）＝「分割後のグループ内での「Ｂ」の数」／「分割後のグループ内での「Ａ」及び「Ｂ」の数の合計」
この場合、第一の分割部１４０は、分割後のグループにおけるエントロピーを以下のように計算する。
エントロピー＝｛−１×Ｐ（Ａ）×ｌｏｇ（Ｐ（Ａ））｝＋｛−１×Ｐ（Ｂ）×ｌｏｇ（Ｐ（Ｂ））｝
例えば、第一の分割部１４０は、上記のエントロピーを、適当な分割候補点における分割後の２つのグループ（分割点以上と未満の２つのグループ）について計算する。分割候補点は、所定のルール（アルゴリズム）で決めれば良く、周知の手法で良い。
第一の分割部１４０は、分割候補点でデータを２つのグループに分割して、２つのグループのエントロピーを足した値をＳとした場合に、Ｓの値が最も小さくなる点を分割点として決定しても良い。なお、Ｓの値が最も小さくなる点を分割点として決定することが好ましいが、これに限らず最も小さくなるＳの値に近似した値でもよい。
Ｓの値が小さいということは、２つのグループ内におけるデータの混ざり（「Ａ」と「Ｂ」の混ざり）が少ないことを意味する。
または、第一の分割部１４０は、所定の分割候補点のうちで、分割点以上と未満の２つのグループのいずれかが最小の値を取るグループを含むように分割する分割候補点を、分割点として決定しても良い。なお、前述のように、分割候補点のうちで、分割点以上と未満の２つのグループのいずれかが最小の値をとるグループを含むように分割することが好ましいが、これに限られず最小の値に近似した値であってもよい。エントロピーを用いた分割点の決定方法は上述の方法には限定されず、他の方法でも良い。
なお上述したように、第一のデータ分割装置１００と第二のデータ分割装置２００とは、互いの個人情報の値がわからない。具体的には第一のデータ分割装置１００は、第二のデータ分割装置２００が保持する第二の個人情報の真の値がわからない。
そこで、第一の分割部１４０は、ＭＰＣ（ＭｕｌｔｉＰａｒｔｙＣｏｍｐｕｔａｔｉｏｎ）又はＳＭＰＣ（ＳｅｃｕｒｅＭｕｌｔｉＰａｒｔｙＣｏｍｐｕｔａｔｉｏｎ）を用いて、第二の個人情報の値も考慮して、分割点を計算しても良い。第一の分割部１４０は、ＭＰＣ等を用いることで、第一のデータ分割装置１００及び第二のデータ分割装置２００の互いの個人情報の値を一切出さずに分割点を計算することができる。
以降は説明の便宜のため、第一の分割部１４０は、ＭＰＣを用いてエントロピーを計算することで分割点を決定し、データを分割するものとする。
第一の分割部１４０は、分割した各グループにおける識別子の内容を示す分割情報を、第一の送受信部１１０を介して第二のデータ分割装置２００に送信する。分割情報は、例えば分割点で分割したｕｓｅｒＩＤのリストでも良い。
また、第一の分割部１４０は、第二の送受信部２１０から送信される分割情報を、第一の送受信部１１０を介して受信する。第一の分割部１４０は、受信した分割情報に基づいてデータを分割する。
第一の分割部１４０は、分割後のデータを第一の修正部１５０に出力する。
第一の修正部１５０は、第一の分割部１４０による分割の度に、分割後のグループに属するユーザデータのうちダミーデータ以外のデータの第一の個人情報の値に基づいて、ダミー値を修正する。例えば、第一の修正部１５０は、分割後のグループにおいて、ユーザ識別子に対応するデータの第一の個人情報の値の分布に従って、そのグループ内でダミー値の値を修正しても良い。
第一の修正部１５０は、ダミー値の修正後のデータを、再び第一の分割部１４０に出力する。第一の分割部１４０は、ダミー値が修正されたグループ内のユーザデータを、さらに２つのグループに分割することが可能か否かを判定する。第一の分割部１４０は、グループ内において周知の手法により分割候補点が存在するか否かを判定することで、さらに分割可能か否かを判定しても良い。
さらに分割可能であると判定した場合、第一の分割部１４０は、第一の修正部１５０から出力されたデータを、さらにグループに分割する。
分割不可能と判定した場合、第一のデータ分割装置１００は、処理を終了する。この場合、第一のデータ分割装置１００は、分割後のユーザデータ、生成した分類木等を出力装置、外部の他のシステム等に出力しても良い。
次に図１６を参照して、本発明の第１実施形態に係る第一のデータ分割装置１００の動作について説明する。
図１６は、本発明の第１実施形態に係る第一のデータ分割装置１００の動作を示すフローチャート図である。図１６に示すように、例えば第一の送受信部１１０は、第二のデータ分割装置２００が保持しているデータのｕｓｅｒＩＤの情報（ユーザ識別子）を取得する（ステップＳ１１）。
第一の送受信部１１０が第二のデータ分割装置２００（他装置）のｕｓｅｒＩＤの情報を取得すると、第一の設定部１３０は、取得したｕｓｅｒＩＤの情報を含んだユーザデータのうち第一の記憶部１２０が実際に記憶しているｕｓｅｒＩＤ（ユーザ識別子）に対応するデータ以外のデータに対し、ダミーデータとして第一の個人情報の値にダミー値を設定する（ステップＳ１２）。この時、第二のデータ分割装置２００においても同様に、第二の送受信部２１０が第一のデータ分割装置１００が保持するｕｓｅｒＩＤの情報を取得し、第二の設定部２３０が第二の個人情報の値にダミー値を設定する。当然のことながら、第一の記憶部１２０と第二の記憶部２２０とが記憶しているデータは異なるので、互いのダミーデータは異なる。
次に、第一の分割部１４０は、ダミー値が設定されたデータを含めた所定のユーザデータの分割にあたり、分割候補点が存在するか否かを判定する。分割候補点が存在すると判定すると、第一の分割部１４０は、所定の分割候補点について、分割した場合の２つのグループのエントロピーの合計を計算する。第一の分割部１４０は、エントロピーの合計が最も低い点を分割点として決定し、その点で所定のユーザデータを２つのグループに分割する（ステップＳ１３）。
次に、第一の修正部１５０は、分割後のグループ内のユーザデータに対しダミー値を修正する（ステップＳ１４）。第一の修正部１５０は、ユーザ識別子に対応するデータの第一の個人情報の値の分布に従って、分割後のグループ内でダミー値を修正しても良い。
次に、第一の修正部１５０は、ダミー値を修正した分割後のグループ内のユーザデータを、再び第一の分割部１４０に出力する。第一の分割部１４０は、分割後にダミー値が修正されたユーザデータを、さらに分割可能か否かを判定する（ステップＳ１５）。第一の分割部１４０は、分割候補点が存在するか否かを判定することで、さらに分割可能か否かを判定しても良い。
分割候補点が存在すると判定した場合は、処理は、ステップＳ１３に進む。ステップＳ１３において、第一の分割部１４０は、分割後にダミー値を修正したユーザデータのグループを、さらに分割する。分割候補点が存在しないと判定した場合は、処理は終了する。
次に、図１７〜図２２を参照して、図１６の各ステップを、具体的に例を用いて説明する。前提として、第一のデータ分割装置１００は事業者Ｓの装置であるものとする。また、第二のデータ分割装置２００は事業者Ｔの装置であるものとする。
また、以降の例は、上述した例と同様の状況を前提とする。具体的には、事業者Ｓ（第一のデータ分割装置１００）は、「腹囲」と「Ｃｌａｓｓ」に関する個人情報（図１に示すデータ）を保持しているとする。事業者Ｔ（第二のデータ分割装置２００）は、「血圧」に関する個人情報（図２に示すデータ）を保持しているとする。各事業者が保持する個人情報は、識別子管理事業者が管理する共通の識別子で対応している。また、以降の例では「腹囲」を第一の個人情報とし、「血圧」を第二の個人情報とする。
図１６のステップＳ１１において、第一のデータ分割装置１００及び第二のデータ分割装置２００は、互いが保持しているデータのｕｓｅｒＩＤの情報を交換する。具体的には、第一の送受信部１１０は、ｕｓｅｒ１〜ｕｓｅｒ１５の識別子を第二の送受信部２１０に送信し、第二の送受信装置２１０から、ｕｓｅｒ１〜ｕｓｅｒ６の識別子を受信する。
図１６のステップＳ１２において、事業者Ｓにおける第一の送受信部１１０がｕｓｅｒ１〜ｕｓｅｒ６の識別子を受信すると、第一の設定部１３０は、図１に示すデータと照合する。照合した結果、第一の設定部１３０は、第二のデータ分割装置２００が記憶しているｕｓｅｒ１〜ｕｓｅｒ６の識別子は、第一のデータ分割装置１００が記憶しているｕｓｅｒ１〜ｕｓｅｒ１５の識別子に含まれていると判定する。従って第一の設定部１３０は、ダミーデータの設定を行わない。
一方、事業者Ｔにおける第二の設定部２３０は、受信したｕｓｅｒ１〜ｕｓｅｒ１５の識別子と、図２に示す情報とを照らし合わせた結果、第一のデータ分割装置１００は、第二のデータ分割装置２００が記憶していないｕｓｅｒ７〜ｕｓｅｒ１５のデータを記憶していると判定する。従って、第二の設定部２３０は、図７に示すようにｕｓｅｒ７〜ｕｓｅｒ１５のデータに対し、リサンプリングの手法により第二の個人情報にダミー値を設定する。
図１６のステップＳ１３において、第一のデータ分割装置１００の第一の分割部１４０及び第二のデータ分割装置２００の第二の分割部２４０は、互いに通信を行いながら、個人情報の分割候補点が存在するか否かを判定する。第一の分割部１４０及び第二の分割部２４０は、第一の個人情報に関して「Ｘ＝９０」が、第二の個人情報に関して「Ｙ＝１２０」及び「Ｙ＝１３０」が、分割候補点であると判定する。
第一の分割部１４０及び第二の分割部２４０は、互いに通信を行いながら、ＭＰＣを用いて個人情報の値を開示することなく、各分割候補点で分割した場合の２つのグループのエントロピーの合計を計算する。第一の分割部１４０及び第二の分割部２４０は、「Ｘ＝９０」を、エントロピーの合計が最小となる点であると判定し、分割点に決定する。
分割点が決定すると、腹囲（Ｘ）の個人情報（第一の個人情報）を保持している第一のデータ分割装置１００において、第一の分割部１４０が所定のユーザデータを｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝及び｛ｕｓｅｒ４〜６，１０，１３，１５｝の２つのグループに分割する。
第一の分割部１４０は、データの分割情報（｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝、｛ｕｓｅｒ４〜６，１０，１３，１５｝という２つのグループにユーザデータを分割することを示す情報）を事業者Ｔの第二の送受信部２１０に送信する。分割情報を受信すると、第二の分割部２４０も所定のユーザデータを分割する。図８及び図９は分割後のユーザデータの状態を示す図である。
図１６のステップＳ１４において、第一の修正部１５０及び第二の修正部２５０は、分割後のグループ内でダミー値の修正を行う。本例においては、事業者Ｓ（第一のデータ分割装置１００）はダミーデータを保持しないためダミー値の修正は行わず、事業者Ｔ（第二のデータ分割装置２００）の第二の修正部２５０が、ダミー値の修正を行う。
図１７は、図７のユーザデータに対し、ダミー値を修正したデータを表す図である。第二のデータ分割装置２００の第二の修正部２５０は、グループ毎にダミーデータ（ｕｓｅｒ７〜ｕｓｅｒ１５のデータ）以外の識別子（ユーザ識別子）のデータ（ｕｓｅｒ１〜ｕｓｅｒ６のデータ）の第二の個人情報の値の分布に従って、ダミー値を修正する。
まず、｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝のグループについて考える。当該グループに属する第二のデータ分割装置２００におけるユーザ識別子のデータはｕｓｅｒ１〜ｕｓｅｒ３のデータである。ｕｓｅｒ１〜ｕｓｅｒ３のデータは、第二の個人情報（血圧）に関し「１１０台：１２０台＝２：１」に分布している。そのため、第二の修正部２５０は、｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝のグループに属するダミーデータ（ｕｓｅｒ７〜９，１１，１２，１４のデータ）のダミー値を「１１０台：１２０台＝２：１」になるように修正する。
図１７に示すように、第二の修正部２５０は、例えばｕｓｅｒ９を「１３５」から「１１５」に、ｕｓｅｒ１２を「１３５」から「１１５」に、ｕｓｅｒ１４を「１２５」から「１１５」に修正する。
次に、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループについて考える。当該グループに属する第二のデータ分割装置２００におけるユーザ識別子のデータはｕｓｅｒ４〜ｕｓｅｒ６のデータである。ｕｓｅｒ４〜ｕｓｅｒ６のデータは、第二の個人情報（血圧）に関し「１２０台：１３０台＝１：２」に分布している。そのため、第二の修正部２５０は、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループに属するダミーデータ（ｕｓｅｒ１０，１３，１５のデータ）のダミー値を「１２０台：１３０台＝１：２」になるように修正する。
図１７に示すように、第二の修正部２５０は、例えばｕｓｅｒ１０を「１１５」から「１２５」に、ｕｓｅｒ１３を「１１５」から「１３５」に修正する。
図１８は、ダミー値を修正した場合の分割の状態を示す図である。図１９は、ダミー値を修正した場合の、ユーザデータの分布と分割の様子を示す図である。図１９において、丸で囲ったデータが、ダミー値が修正されたデータである。
ダミー値の修正後、ステップＳ１５において、第一の分割装置１００の第一の分割部１４０及び第二の分割装置２００の第二の分割部２４０は、２つの各グループにおいて分割候補点が存在するか否かを判定する。第一の分割部１４０及び第二の分割部２４０は、｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝のグループには分割候補点はないと判定する。第一の分割部１４０及び第二の分割部２４０は、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループについては、「Ｙ＝１２０」及び「Ｙ＝１３０」が分割候補点であると判定し、処理は再び図１６のステップＳ１３に戻る。第一の分割部１４０及び第二の分割部２４０は、ＭＰＣを用いて分割後のエントロピーの合計を計算する。その結果、第一の分割部１４０及び第二の分割部２４０は、「Ｙ＝１３０」を分割点に決定する。
図２０は、図１８に示すデータにおける｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループが、さらに「Ｙ＝１３０」で分割された状態を示す図である。図２１は、ダミー値が修正されたユーザデータの分布と、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループの「Ｙ＝１３０」で分割した様子を示す図である。図２０において、｛ｕｓｅｒ５，６，１３，１５｝のグループは、「Ａ」と「Ｂ」とが混じっているが、「Ｂ」の数が明かに多いため、当該グループは「Ｂ」のグループであると判定している。
図２２は、ダミー値を修正した場合のｕｓｅｒ１〜ｕｓｅｒ１５のデータから最終的に生成される分類木の例を示す図である。本発明によれば、図２２に示される分類木が生成される。分類木は、図１４及び図２５には図示しない生成部によって、データの分割の過程を情報としてまとめることで生成されても良い。
図２３は、ダミー値を修正したユーザデータにより生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。図２３に示すように、図１３の結果と比較して、ダミー値の修正を行うことで、分類結果が「不明」であったｕｓｅｒ４〜６のデータのうち、ｕｓｅｒ５，６のデータを正しく分類することが可能な分類木を生成することができる。当該分類木は、図５、図１２に示す分類木と比較して、最も精度の良い分類木である。
以上説明したように、第１実施形態に係るデータ分割装置１００によれば、複数の装置によるデータの分散処理において、一方の装置が保持する非共通データを有効に活用することで、精度の良いデータの分割が可能となる。第１実施形態に係るデータ分割装置１００においては、非共通データにダミー値を設定して活用し、分割の度にダミー値を修正することで、非共通データを有効に活用する。
＜第２実施形態＞
次に本発明の第２実施形態に係る第一のデータ分割装置３００の機能構成を説明する。
図２４は、第２実施形態に係る第一のデータ分割装置３００の構成を示すブロック図である。図２４に示すように、第一のデータ分割装置３００は、第１実施形態における第一のデータ分割装置１００と比較して、第一の調整部３１０を含む点で異なる。第一の調整部３１０以外の構成部については第１実施形態と同様の構成であるため、同様の番号を付し、説明を省略する。
第一の調整部３１０は、第一の修正部１５０によるダミー値の修正の後に、当該修正前のダミー値と、当該修正後のダミー値との値の変化量に基づいてダミー値を調整する。データの分布に相関関係などの特徴がある場合、第一の調整部３１０がダミー値の変化量に基づいてダミー値を調整することで、第一のデータ分割装置３００は、その特徴を反映してより精度良くデータを分割する。
図２５は、本発明の第２実施形態に係る第一のデータ分割装置３００の動作を示すフローチャート図である。図２５に示すように、第一のデータ分割装置３００の動作は、図１６に示す第一のデータ分割装置１００の動作と比較して、第一の調査部３１０がダミー値の修正前後の変化量に基づいてダミー値を調整するステップＳ１６を有する点で異なる。
ステップＳ１６の後、ステップＳ１５において、第一の分割部１４０は、分割後にダミー値が修正及び調整されたデータを、さらに分割可能か否かを判定する（ステップＳ１５）。
図２６及び図２７は第一の調整部３１０の機能を説明するための図である。
図２６は、修正されたデータを表す図１７のデータに対し、ダミー値を調整したデータを表す図である。第一の調整部３１０は、第一の修正部１５０によるダミー値の修正の後に、グループ毎に当該修正前のダミーデータ（ｕｓｅｒ７〜ｕｓｅｒ１５のデータ）のダミー値と、当該修正後のダミーデータ（ｕｓｅｒ７〜ｕｓｅｒ１５のデータ）のダミー値との値の変化量に基づいてダミー値を調整する。修正前後のダミー値の変化量に基づく調整の方法はどのような方法でも良いが、以下ではダミー値の重心の値（ダミー値の平均値）の変化に基づいて、調整する方法を例に説明する。
まず、｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝のグループについて考える。当該グループに属するダミーデータはｕｓｅｒ７〜９，１１，１２，１４のデータである。
図７に示す第一の修正部１５０による修正前のユーザデータにおいて、ｕｓｅｒ７〜９，１１，１２，１４のダミー値はそれぞれ、「１１５」、「１２５」、「１３５」、「１２５」、「１３５」及び「１２５」である。従って、第一の調整部３１０は、当該グループにおける修正前のダミー値の重心の値（平均値）を、（１１５＋１２５＋１３５＋１２５＋１３５＋１２５）÷６により「１２６．６６６」であると算出する。
また、図１７に示す第一の修正部１５０による修正後のデータにおいて、ｕｓｅｒ７〜９，１１，１２，１４のダミー値はそれぞれ、「１１５」、「１２５」、「１１５」、「１２５」、「１１５」及び「１１５」である。従って、第一の調整部３１０は、当該グループにおける修正後のダミー値の重心の値（平均値）を、（１１５＋１２５＋１１５＋１２５＋１１５＋１１５）÷６により「１１８．６６６」であると算出する。
ダミー値の修正によって、重心の値が「１２６．６６６」から「１１８．６６６」に変化したため、ダミー値の重心の値の変化量は「−８」である。そのため、第一の調整部３１０は、｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝のグループに属するダミーデータ（ｕｓｅｒ７〜９，１１，１２，１４のデータ）所定のダミー値から「１０」を減算する。ここで、ダミー値は「１１５」から１０刻みの値を取ることとしているため、「−８」は、「−１０」とした。また、本実施形態において、ダミー値は「１１５〜１３５」の範囲の値をとることとし、１１５以下又は１３５以上の値はとらないこととする。
図２６に示すように、第一の調整部３１０は、例えばｕｓｅｒ８及びｕｓｅｒ１１のダミー値を「１２５」から「１１５」に修正する。
同様に、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループについて考える。当該グループに属するダミーデータはｕｓｅｒ１０，１３，１５のデータである。
図７に示す第一の修正部１５０による修正前のデータにおいて、ｕｓｅｒ１０，１３，１５のダミー値はそれぞれ、「１１５」、「１１５」及び「１３５」である。従って、第一の調整部３１０は、当該グループにおける修正前のダミー値の重心の値（平均値）を、（１１５＋１１５＋１３５）÷３により「１２１．６６６」であると算出する。
また、図１７に示す第一の修正部１５０による修正後のデータにおいて、ｕｓｅｒ１０，１３，１５のダミー値はそれぞれ、「１２５」、「１３５」及び「１３５」である。従って、第一の調整部３１０は、当該グループにおける修正後のダミー値の重心の値（平均値）を、（１２５＋１３５＋１３５）÷３により「１３１．６６６」であると算出する。
ダミー値の修正によって、重心の値が「１２１．６６６」から「１３１．６６６」に変化したため、ダミー値の重心の値の変化量は「＋１０」である。そのため、第一の調整部３１０は、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループに属するダミーデータ（ｕｓｅｒ１０，１３，１５のデータ）所定のダミー値に「１０」を加算する。
図２６に示すように、第一の調整部３１０は、例えばｕｓｅｒ１０のダミー値を「１２５」から「１３５」に修正する。
図２７は、ダミー値を調整した場合の分割の状態を示す図である。図２８は、ダミー値を調整した場合の、ユーザデータの分布と分割の様子を示す図である。図２８において、丸で囲ったデータが、ダミー値が調整されたデータである。
図２９は、ダミー値を調整した場合のｕｓｅｒ１〜ｕｓｅｒ１５のデータから最終的に生成される分類木の例を示す図である。第２実施形態に係る本発明によれば、図２９に示される分類木が生成される。
図３０は、ダミー値を調整したデータにより生成された分類木を用いた、ｕｓｅｒ１〜ｕｓｅｒ１５の分類結果を示す図である。図３０に示すように、図２３の結果と比較して、ダミー値の調整を行うことで、分類結果が「不明」であったｕｓｅｒ４のデータを正しく分類することが可能な分類木を生成することができる。当該分類木は、図５、図１２、図２２に示す分類木と比較して、最も精度の良い分類木である。
以上説明したように、第２実施形態に係るデータ分割装置３００によれば、データの特徴を反映したより精度の良いデータの分割が可能となる。その理由は、例えばデータの分布に相関関係があるとか、一定の範囲にデータが固まっている等の特徴がある場合に、第一の調整部３１０が、その特徴が強調されるように修正後のデータを調整するからである。
＜第３実施形態＞
次に本発明の第３実施形態について説明する。本発明の第３実施形態は、第一のデータ分割装置４００及び第二のデータ分割装置５００を用いて分散匿名化を行う分散匿名化システムである。
分散匿名化とは、分散して保持されている情報を結合する際における、個人の特定や属性の推定を防ぐための匿名化の技術である。分散匿名化技術は、例えば、非特許文献２に記載されている。
非特許文献２の技術は、２つの事業者の間でデータを結合する際に、まず２つの事業者がそれぞれ保持する個人情報を抽象化して初期匿名データを生成する。非特許文献２の技術は、抽象化された個人情報を、匿名性を満たすことを確認しながら徐々に具体化していく。
個人情報の具体化のために、個人情報の分割点を決定し、データを分割する。非特許文献２に記載の技術は、分割の際に、ｋ−匿名性（ｋ−ａｎｏｎｙｍｉｔｙ）とｌ−多様性（ｌ−ｄｉｖｅｒｓｉｔｙ）という二つの指標が満たされるか否かを、センシティブ情報を保持している事業者で確認する。
ここでセンシティブ情報とは、結合後のデータの情報処理に用いるため、変更したくない情報のことを言う。
ｋ−匿名性（ｋ−ａｎｏｎｙｍｉｔｙ）とは、準識別子の組み合わせが同じユーザをｋ人以上にすることを要求する指標である。ｌ−多様性（ｌ−ｄｉｖｅｒｓｉｔｙ）とは、準識別子の組み合わせが同じユーザのセンシティブ情報をｌ通り以上にすることを要求する指標である。二つの指標を満たしているデータを利用者に提供することで、非特許文献１の技術により、提供するデータからの個人の特定を防ぐことができ、個人のセンシティブ情報が知られることを防ぐことができる。
以降の本実施形態の説明では、個人情報のデータが２−匿名性を満たすことを要求する。
図３１は、第３実施形態に係る第一のデータ分割装置４００の構成を示すブロック図である。図３１に示すように、第一のデータ分割装置４００は、第１実施形態における第一のデータ分割装置１００と比較して、第一の判定部４１０を含む点で異なる。第一の判定部４１０以外の構成部については第１実施形態と同様の構成であるため、同様の番号を付し、説明を省略する。
第一の判定部４１０は、自装置（第一のデータ分割装置４００）と、他装置（第二のデータ分割装置５００）とのいずれにも存在するユーザ識別子の割合が、予め定められた匿名指標を満たすか否かを第一の分割部１４０による分割後のグループ毎に判定する。
図３２は、第３実施形態に係る第二のデータ分割装置５００の構成を示すブロック図である。図３２に示すように、第二のデータ分割装置２００の構成は、第一のデータ分割装置４００と同様でも良い。
図３３は、第３実施形態における第一のデータ分割装置４００の動作を示すフローチャート図である。図３３に示すように、第一のデータ分割装置４００の動作は、図１６に示す第一のデータ分割装置１００の動作と比較して、第一の判定部４１０が分割後のグループが所定の匿名指標を満たすか否かを判定するステップＳ１８を有する点で異なる。また、第一のデータ分割装置４００の動作は、ステップＳ１２を有さず、代わりにステップＳ１７を有する。図１６における動作と同様の動作には同様の符合を付し、説明を省略する。
次に、図３４〜図４３を参照して、図３３の各ステップを、具体的に例を用いて説明する。前提として、第一のデータ分割装置４００は事業者Ｓが有するものとする。また、第二のデータ分割装置５００は事業者Ｔが有するものとする。
また、事業者Ｓは、「腹囲」と「Ｃｌａｓｓ」に関する個人情報（図１に示すデータ）を保持しているとする。事業者Ｔは、「血圧」と「病気」に関する個人情報を保持しているとする。
図３４は、第３実施形態において事業者Ｔが保持しているデータの例を示す図である。
各事業者が保持する個人情報は、識別子管理事業者が管理する共通の識別子で対応している。また、以降の例では「腹囲」を第一の個人情報とし、「血圧」を第二の個人情報とし、「病気」をセンシティブ情報とする。
図３３のステップＳ１１において、第一のデータ分割装置４００及び第二のデータ分割装置５００は、互いに、互いが保持するユーザの識別子を取得する。
図３５は、ｕｓｅｒ７〜ｕｓｅｒ１５のデータをリサンプリングにより決定したデータを示す図である。より具体的に説明すると、図３３のステップＳ１７において、事業者Ｔ（第二のデータ分割装置５００）における第二の設定部２３０は、受信した図１に示すデータと、図３４に示すデータとを照らし合わせた結果、図３５に示すようにｕｓｅｒ７〜ｕｓｅｒ１５のデータに対し、リサンプリングの手法により第二の個人情報にダミー値を設定し、センシティブ情報を適当に設定する。
また、ステップＳ１７において、第一の設定部１３０及び第二の設定部２３０は、初期匿名データを生成する。例えば、第一の設定部１３０は、図１のデータから図３６に示す初期匿名データを生成する。また、第二の操作部２４０は、事業者Ｔが保持する図３４のデータから図３７に示す初期匿名データを生成する。
図３６及び図３７に示すように、初期匿名データは、ｕｓｅｒＩＤ、準識別子（血圧、腹囲、及び、ｃｌａｓｓに関する情報）及びセンシティブ情報（病気に関する情報）を含む。
図３３のステップＳ１３において、第１実施形態と同様に、第一の分割部１４０及び第二の分割部２４０は、「Ｘ＝９０」を、エントロピーの合計が最小となる点であると判定し、分割点に決定する。腹囲（Ｘ）に関する情報を保持している事業者Ｓは、「Ｘ＝９０」で分割した場合のグループの内容に関する情報を第一の送受信部１１０を介して事業者Ｔに送信する。具体的には、第一の分割部１４０は、データの分割情報（｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝、｛ｕｓｅｒ４〜６，１０，１３，１５｝という２つのグループにデータを分割することを示す情報）を事業者Ｔの第二の送受信部２１０に送信する。
図３８は、図３６のデータを「Ｘ＝９０」で分割したユーザデータを表す図である。図３９は、図３７のユーザデータを「Ｘ＝９０」で分割したユーザデータを表す図である。
次に図３３のステップＳ１８において、第一の判定部４１０及び第二の判定部４２０は、第一の分割装置４００（自装置）と、第二の分割装置５００（他装置）とのいずれにも存在するユーザ識別子の割合が、予め定められた匿名指標を満たすか否かを第１の分割装置４００第一の分割部１４０による分割後のグループ毎に判定する。
「ユーザ識別子」とは、自装置が記憶しているユーザの識別子を意味する。具体的には、事業者Ｓのユーザ識別子はｕｓｅｒ１〜１５である。事業者Ｔのユーザ識別子はｕｓｅｒ１〜６である。
本実施形態において、予め定められた匿名指標は、２−匿名性である。図３９の｛ｕｓｅｒ１〜３，７〜９，１１，１２，１４｝のグループ（一行目のグループ）は、ユーザ９名のうち６名がダミーなので３−匿名である。また、｛ｕｓｅｒ４〜６，１０，１３，１５｝のグループ（二行目のグループ）は、ユーザ６名のうち３名がダミーなので３−匿名である。従って、いずれのグループも２−匿名性を満たす。
なお、本例においては、センシティブ情報を保持しているのは事業者Ｔなので、匿名指標の確認は事業者Ｔが行えばよい。
また、本例においては、ダミーデータは事業者Ｔのユーザデータに含まれているので、指標を満たしていることの確認は難しくない。仮に事業者Ｓのデータにもダミーデータが含まれている場合には、第１の判定部４１０が、ＭＰＣを用いて事業者Ｓのデータ及び事業者Ｔのデータが共に匿名指標を満たしていることを確認しても良い。
事業者Ｔが保持するデータの匿名指標が保たれていることを確認すると、図３３のステップＳ１４において第二の修正部２５０は、第１実施形態と同様の手順でダミー値の修正を行う。
図４０は、図３５のデータに対し、ダミー値を修正したユーザデータを表す図である。第二の修正部２５０がダミー値を修正することで、次のステップにおいて適格な点で精度良くユーザデータを分割することが可能となる
次に、第一の分割部１４０及び第二の分割部２４０は、第１実施形態と同様の方法により、分割後にダミー値が修正されたユーザデータを、さらに分割可能か否かを判定する（ステップＳ１５）。分割候補点が存在すると判定すると、第一の分割部１４０及び第二の分割部２４０は、ＭＰＣを用いて分割後のエントロピーの合計を計算して、第１実施形態と同様に「Ｙ＝１３０」を分割点に決定する。
再び図３３のステップＳ１３において、第一の分割部１４０及び第二の分割部２４０は、ユーザデータを分割する。図４１は、図３８のデータを「Ｙ＝１３０」で分割したデータを表す図である。図４２は、図３９のデータを「Ｙ＝１３０」で分割したユーザデータを表す図である。
図３３のステップＳ１８において、第二の判定部４２０は、図４２のユーザデータについて匿名指標が保たれているかを確認する。第二の判定部４２０は、｛ｕｓｅｒ４，１０｝のグループ（二行目のグループ）は、ユーザ２名のうち１名がダミーなので１匿名であり、２匿名性を満たさないと判定する。
第二の判定部４２０が匿名指標を満たさないと判定すると、第一のデータ分割装置４００及び第二のデータ分割装置５００は、最後に行った分割をキャンセルする。第一のデータ分割装置４００及び第二のデータ分割装置５００は、キャンセルしたそれぞれのデータについて、ＭＰＣを用いて双方に存在する人数を計算し、結合匿名化データを生成する。
図４３は、第３実施形態に係る本発明により生成された最終的な結合された匿名化データ（結合匿名化データ）を示す図である。
ここで、最終的に出力される図４３に示すユーザデータを参照しても、事業者Ｓはどのユーザのデータが確実に事業者Ｔのデータに存在するかはわからない。また、事業者Ｔはどのユーザのデータが確実に事業者Ｓのデータに存在するかはわからない。
以上説明したように、第３実施形態に係る本願発明である分散匿名化システムによれば、他の事業者にユーザのデータの存在が漏洩する危険性なく、かつ一方の装置が保持する非共通データを有効に活用して、データ分割の精度良く分散匿名化処理を実行することができる。その理由は、本願発明の分散匿名化システムは、他の事業者に送信するデータの中に、実際には存在しないダミーのデータを含めて送信することで、分散匿名化処理の過程における、他の事業者へのユーザのデータの存在の漏洩が防止できるからである。また、分割の度でダミー値を修正することで、適切な分割点で分割が可能となるからである。
＜第４実施形態＞
次に図４４を参照して、本発明の第４実施形態に係るデータ分割装置６００の機能構成を説明する。
図４４は、第４実施形態に係るデータ分割装置６００の構成を示すブロック図である。図４４に示すように、データ分割装置６００は、送受信部６１０と、記憶部６２０と、設定部６３０と、分割部６４０と、修正部６５０とを含む。なお、これらは上述した第一の送受信部１１０、第一の記憶部１２０、第一の設定部１３０、第一の分割部１４０及び第一の修正部１５０と同様の構成である。
データ分割装置６００は、第一の個人情報を記憶しており、第二の個人情報を記憶している他装置と、送受信部６１０を介して互いに通信を行いながらデータを分割する。
送受信部６１０は、他装置とデータを送受信する。
記憶部６２０は、複数のユーザの識別子と、第一の個人情報の値とを関連付けてユーザデータとして記憶する。
設定部６３０は、送受信手段６１０が受信した他装置が記憶しているデータであって、記憶部６２０が記憶していない識別子のデータに対し、ダミーデータとして第一の個人情報に第一のダミー値を設定する。
分割部６４０は、ダミーデータを含む所定のデータを、ダミー値を含む第一の個人情報の値又は第二の個人情報の値に基づいて決定された分割点によって、グループに分割する。
修正部６５０は、分割の度に、分割後のグループに属するデータのうちダミーデータ以外のデータの第一の個人情報の値に基づいて、ダミー値を修正する。
以上説明したように、第４実施形態に係るデータ分割装置６００によれば、複数の装置によるデータの分散処理において、一方の装置が保持する非共通データを有効に活用することで、精度の良いデータの分割が可能となる。
以上、各実施形態を参照して本発明を説明したが、本発明は以上の実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で同業者が理解し得る様々な変更をすることができる。
なお、本発明に係る二以上の異なる事業者が保持するデータ分割装置は、それぞれ管理上分離された装置であれば良く、例えば、仮想的に分離された装置であっても良い。また、例えば、各事業者のデータ分割装置の記憶部が同一のデータベースで保持されており、異なる事業者が保持するデータであることが分かるような管理形態で保持されていても良い。
図４５は、第１実施形態に係る第一のデータ分割装置１００のハードウェア構成の一例を示すブロック図である。
図４５に示すように、第一のデータ分割装置１００を構成する各部は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１と、ネットワーク接続用の通信ＩＦ２（通信インターフェース２）と、メモリ３と、プログラムを格納するハードディスク等の記憶装置４とを含む、コンピュータ装置によって実現される。ただし、第一のデータ分割装置１００の構成は、図４５に示すコンピュータ装置に限定されない。
例えば、第一の送受信部１１０は、通信ＩＦ２によって実現されても良い。
ＣＰＵ１は、オペレーティングシステムを動作させて第一のデータ分割装置１００を制御する。また、ＣＰＵ１は、例えばドライブ装置などに装着された記録媒体からメモリ３にプログラムやデータを読み出し、これにしたがって各種の処理を実行する。
例えば第一の設定部１３０、第一の分割部１４０及び第一の修正部１５０は、ＣＰＵ１及びプログラムによって実現されても良い。
記録装置４は、例えば光ディスク、フレキシブルディスク、磁気光ディスク、外付けハードディスク、半導体メモリ等であって、コンピュータプログラムをコンピュータ読み取り可能に記録する。コンピュータプログラムは、通信網に接続されている図示しない外部コンピュータからダウンロードされても良い。
例えば、第一の記憶部１２０は記録装置４によって実現されても良い。
なお、これまでに説明した各実施形態において利用するブロック図は、ハードウェア単位の構成ではなく、機能単位のブロックを示している。これらの機能ブロックはハードウェア及びソフトウェアの任意の組み合わせによって実現される。また、第一のデータ分割装置１００の構成部の実現手段は特に限定されない。すなわち、第一のデータ分割装置１００は、物理的に結合した一つの装置により実現されても良いし、物理的に分離した二つ以上の装置を有線又は無線で接続し、これら複数の装置により実現されても良い。
本発明のプログラムは、上記の各実施形態で説明した各動作を、コンピュータに実行させるプログラムであれば良い。
この出願は、２０１１年９月２１日に出願された日本出願特願２０１１−２０５５１９を基礎とする優先権を主張し、その開示の全てをここに取り込む。First, in order to facilitate understanding of the embodiments of the present invention, the background of the present invention will be described.
As in the example described in the background art, it is assumed that the business operator S and the business operator T generate a classification tree from personal information held respectively. The personal information held by each business operator may correspond to a common identifier managed by the identifier management business operator.
The technique of Non-Patent Document 1 is used as a technique for generating a classification tree.
First, for example, the identifier management business operator notifies each business operator of the identifier of the user who is the target of classification tree generation. For example, it is assumed that the identifiers of user1 to user15 are notified to each business operator.
It is assumed that the business operator S holds the data shown in FIG. 1 regarding the user with the notified identifier. As shown in FIG. 1, the business operator S holds personal information (data of “abdominal girth (X)” and “Class”) regarding users with identifiers of user 1 to user 15. “Class” is displayed as “A” or “B”, “A” indicates that the user is non-metabolic, and “B” indicates that the user is metabolic.
It is assumed that the business operator T holds the data shown in FIG. 2 regarding the user with the notified identifier. As shown in FIG. 2, the business operator T holds personal information (“blood pressure (Y)” data) related to the users having the identifiers of user 1 to user 6.
In the technology described in Non-Patent Document 1, data of users 1 to 6 that are common users of the business operators S and T are targeted for classification tree generation processing, and data of user 7 to user 15 are not used. The technique described in Non-Patent Document 1 determines the point “Y = 130” having the smallest entropy after division as a division point by calculating entropy using MPC for the data of user1 to user6.
FIG. 3 is a diagram illustrating a state where data held by the business operator S and the business operator T is divided into two at “Y = 130”. FIG. 4 is a diagram illustrating a state where the data distribution is divided by “Y = 130” when the abdominal circumference is the X axis and the blood pressure is the Y axis. As shown in FIGS. 3 and 4, the data of user1 to user8 is divided into a group of “user1 to 4” and a group of “user5, 6” with “Y = 130” as a division point.
FIG. 5 is a diagram illustrating an example of a classification tree that is finally generated from the data illustrated in FIGS. 1 and 2. In the case of the above-described example, the technique described in Non-Patent Document 1 determines that the above classification shown in FIGS. 3 and 4 is impossible, and generates the classification tree shown in FIG. According to the classification tree shown in FIG. 5, the data of user1 to user6 is correctly classified. Specifically, user1 to user4 are classified as “A”, that is, “non-metabolic”, and user5 and user6 are classified as “B”, that is, “metabolic”. However, since the data of user7 to user15 does not have the value of “blood pressure (Y)”, the data cannot be classified.
FIG. 6 is a diagram illustrating the classification results of user1 to user15 using the classification tree generated by the technique of Non-Patent Document 1. The “correct answer” in FIG. 6 is data indicating whether “A” or “B” is correctly classified when the classification tree shown in FIG. 5 is used for classification. “◯” indicates that the classification is correct. “Unknown” cannot be classified because the target data does not hold a value necessary for classification, or “A” and “B” are classified into a group in which “A” and “B” are mixed. “B” indicates that it is unknown.
The classification result shown in FIG. 6 cannot be said to be a highly accurate classification when viewed from the entire data of user1 to user15. Here, attention is focused on the data of user7 to user15 that were not used for generating the classification tree. The data of user 7 to user 15 indicates that most of the users with “abdominal circumference (X) = 80” are “A”, and most of the users with “abdominal circumference (X) = 90” are “B”. Indicates that there is. Although the data of user7 to user15 has an important meaning in generating a classification tree, the data of Non-Patent Document 1 does not utilize these data in generating a classification tree.
In generating the classification tree, it is conceivable to use a technique called “resampling” in order to use the data of user7 to user15. “Resampling” refers to a method of determining a sample value based on a specimen. In the above example, the values of user7 to user15 are determined based on the distribution of values of user1 to user6 that actually hold the value of “blood pressure (Y)”.
FIG. 7 is a diagram illustrating data in which values of user7 to user15 are determined by resampling with respect to data held by the operator T. The business operator T holds the values of “blood pressure (Y)” of the users 1 to 6 at a ratio of “110 units: 120 units: 130 units = 1: 1: 1”. Therefore, as shown in FIG. 7, the device of the business operator T resamples the values of “blood pressure (Y)” of the users 7 to 15 at a ratio of “110 units: 120 units: 130 units = 1: 1: 1”. And set a dummy value. In this example, 115 is assigned as 110 units, 125 is assigned as 120 units, and 135 is assigned as 130 units.
The technique described in Non-Patent Document 1 calculates the entropy using MPC for the data of user1 to user15 that have become common users of the business operator S and the business operator T by setting a dummy value. A point “X = 90” having the smallest entropy after division is determined as a division point.
FIG. 8 is a diagram illustrating a state in which the data of user1 to user15 is divided into two at “X = 90”. FIG. 9 is a diagram illustrating a state in which the distribution of data of users 1 to 15 including data for which dummy values are set is divided by “X = 90”. As shown in FIG. 8 and FIG. 9, the data of user 1 to user 15 are divided into groups “user 1 to 3, 7 to 9, 11, 12, 14” and “user 4 to 6” with “X = 90” as a division point. , 10, 13, 15 ".
FIG. 10 is a diagram showing a state where the data shown in FIG. 8 is further divided by “Y = 120”. FIG. 11 is a diagram illustrating a state where the distribution of data including data for which dummy values are set is divided by “Y = 120”. As shown in FIG. 10 and FIG. 11, the technique described in Non-Patent Document 1 is divided by “Y = 120” in this example by entropy calculation using MPC. Judged to be appropriate as a dividing point.
FIG. 12 is a diagram illustrating an example of a classification tree that is finally generated from data of users 1 to 15 including data for which dummy values are set. When resampling is performed, according to the technique described in Non-Patent Document 1, the classification tree shown in FIG. 12 is generated.
FIG. 13 is a diagram illustrating a classification result of users 1 to 15 using a classification tree generated from data including data for which dummy values are set. As shown in FIG. 13, by performing resampling as compared with the result of FIG. 6, among the data of user7 to user15 whose classification result is “unknown”, user7 to 9,11,12,14 A classification tree capable of correctly classifying data can be generated. On the other hand, since the re-sampling value is randomly set although it follows the sample distribution, data that can be classified correctly such as users 4 to 6 may become “unknown”.
As described above, in the case of data division using common user data, it is not possible to generate an accurate classification tree. Even if dummy values are set at random by a resampling method, it cannot be said that a classification tree with high accuracy is necessarily generated. In other words, simply applying a resampling method to the technique of Non-Patent Document 1 cannot divide data with high accuracy.
According to the first embodiment of the present invention described below, the problems described so far are solved.
<First Embodiment>
First, the configuration of the data dividing apparatus according to the first embodiment of the present invention will be described with reference to FIGS.
FIG. 14 is a block diagram showing a configuration of the first data dividing device 100 according to the first embodiment. As shown in FIG. 14, the first data dividing device 100 includes a first transmitting / receiving unit 110, a first storage unit 120, a first setting unit 130, a first dividing unit 140, The correction part 150 is included.
FIG. 15 is a block diagram illustrating a configuration of the second data dividing device 200 according to the first embodiment. As shown in FIG. 15, the configuration of the second data dividing device 200 may be the same as that of the first data dividing device 100. In the present embodiment, the first data dividing device 100 and the second data dividing device 200 constitute a data dividing system. In the following description, the configuration of the first data dividing device 100 will be mainly described.
In the present embodiment, two data dividing devices are described. However, the number of data dividing devices is not limited to two, and a system including a plurality of two or more devices may be used.
The first transmission / reception unit 110 communicates with an external device, whereby user identifiers held by each of its own device (first data division device 100) and another device (second data division device 200) are stored. Get information about. Since privacy is assumed to be protected, personal information such as “abdominal girth” and “blood pressure” is not disclosed to each other, for example, userID information is exchanged.
In addition, the first transmission / reception unit 110 may acquire information on a predetermined identifier serving as a population upon receiving a notification from an identifier management device held by an identifier management company, for example. In the first transmission / reception unit 110, the second data division device 200 (other device) uses the identifier information that the first data division device 100 (own device) does not hold among the acquired predetermined identifier information. You may determine with the information of the non-common identifier hold | maintaining.
Alternatively, the first transmission / reception unit 110 directly transmits / receives the userID information to / from the second transmission / reception unit 210, thereby acquiring the userID information held by the second data division device 200 (another device). May be.
Alternatively, the first data dividing device 100 may store information on the userID held by the second data dividing device 200 in the first storage unit 120 in advance. In this case, the first transmission / reception unit 110 may not perform the process of acquiring the userID information of the second data division device 200.
The identifier used in this embodiment may be a national ID, for example. Alternatively, the identifier may be OpenID described in Non-Patent Document 2, and is not limited thereto.
The first storage unit 120 stores user data in which a plurality of user identifiers are associated with values of first personal information. Here, the “user identifier” means a user identifier stored in each data dividing device. For example, “a user identifier stored in the first storage unit 120” means a user identifier stored in the first storage unit 120 of the first data dividing device 100, and the second data dividing device. 200 second storage units 220 are stored, but user identifiers not stored in the first storage unit 120 are not included.
Further, “first personal information” refers to one piece of personal information stored in the first data dividing apparatus 100 (self apparatus). For example, when the first storage unit 120 stores the user data shown in FIG. 1, the “first personal information” may be “abdominal circumference”.
The first setting unit 130 acquires, from the first storage unit 120, user data (a user identifier and a value of personal information related to the identifier) stored in the first data dividing device 100 (own device). . The first setting unit 130 acquires the user identifier held by the second data dividing device 200 (another device) from the first transmission / reception unit 110 or the first storage unit 120.
The first setting unit 130 is user data stored in the second data dividing device (another device) and is used as dummy data for user identifier data not stored in the first storage unit 120. A dummy value is set as the value of the first personal information. The dummy value is set by a resampling method according to the distribution of the value of the first personal information corresponding to the user identifier, for example. The dummy value setting method by the first setting unit 130 is not limited to the resampling method, and may be another appropriate method.
The first setting unit 130 outputs predetermined user data including dummy data to the first dividing unit 140.
The first dividing unit 140 divides predetermined user data including dummy data output from the first setting unit 130 into groups according to division points. The division point is a threshold value for dividing the user data, and is determined based on the value of the first personal information including the dummy value or the value of the second personal information.
In addition to the above, the first dividing unit 140 communicates with the second dividing unit 240 via the first transmission / reception unit 110, and is held by the first data dividing device 100 and the second data dividing device 200. Among the personal information to be used, the most appropriate personal information may be determined as a division axis. In this case, the first dividing unit 140 may communicate with the second dividing unit 240 to determine the most appropriate dividing point among the values of the personal information.
In addition, when the dummy value is set also in the second data dividing device 200 (other device), the second data dividing device uses the value of the second personal information for determining the dividing point. The dummy value (second dummy value) set by (other device) is included.
The dividing method is not particularly limited. The first dividing unit 140 may divide user data into two groups using an average value of predetermined personal information values as a dividing point. In this case, the first division unit 140 may transmit the contents of the group after division to the second division unit 240 via the first transmission / reception unit 110. The first dividing unit 140 and the second dividing unit 240 may repeat the division in turn with the average value of the values of the personal information as the division points. Alternatively, the first dividing unit 140 may determine a dividing point using a known heuristic function.
Further, the first dividing unit 140 may determine the dividing point in consideration of entropy when the user data is divided. By considering entropy, the first dividing unit 140 may determine the dividing points so that the user data belonging to the group after the division is less mixed.
For example, the entropy in the divided group may be calculated by the following formula.
Entropy = Σ {−1 × P (Class) × log (P (Class))}
Here, when “Class” is classified by “A” or “B”, P (Class) is as follows.
P (A) = “number of“ A ”in the group after the division” / “total number of“ A ”and“ B ”in the group after the division” ”
P (B) = “number of“ B ”s in group after division” / “total number of“ A ”and“ B ”in group after division” ”
In this case, the first division unit 140 calculates entropy in the group after division as follows.
Entropy = {− 1 × P (A) × log (P (A))} + {− 1 × P (B) × log (P (B))}
For example, the first division unit 140 calculates the above entropy for two groups after division at the appropriate division candidate points (two groups greater than or less than the division points). The division candidate points may be determined by a predetermined rule (algorithm), and may be a known method.
The first dividing unit 140 divides the data into two groups at the division candidate points, and when the value obtained by adding the entropy of the two groups is S, the point where the value of S is the smallest is the division point. You may decide. Although it is preferable to determine the point at which the value of S is the smallest as the dividing point, the value is not limited to this and may be a value that approximates the value of S that is the smallest.
A small value of S means that there is little data mixing (mixing of “A” and “B”) in the two groups.
Alternatively, the first division unit 140 divides the division candidate points to be divided so as to include a group in which any one of the two groups that are greater than or less than the division point has a minimum value among the predetermined division candidate points. It may be determined as a point. In addition, as described above, it is preferable to divide so that one of the two groups that are greater than or less than the division point includes a group that has the smallest value among the division candidate points. A value approximate to the value may be used. The division point determination method using entropy is not limited to the above-described method, and other methods may be used.
As described above, the first data dividing device 100 and the second data dividing device 200 do not know the values of the personal information of each other. Specifically, the first data dividing device 100 does not know the true value of the second personal information held by the second data dividing device 200.
Therefore, the first dividing unit 140 may calculate the dividing point by considering the value of the second personal information using MPC (Multi Parity Computation) or SMPC (Secure Multi Parity Computation). By using MPC or the like, the first dividing unit 140 can calculate the dividing point without giving out the values of the personal information of the first data dividing device 100 and the second data dividing device 200 at all. .
Hereinafter, for convenience of explanation, it is assumed that the first dividing unit 140 determines a dividing point by calculating entropy using MPC and divides the data.
The first division unit 140 transmits division information indicating the contents of the identifier in each divided group to the second data division device 200 via the first transmission / reception unit 110. The division information may be, for example, a list of user IDs divided at division points.
Further, the first division unit 140 receives the division information transmitted from the second transmission / reception unit 210 via the first transmission / reception unit 110. The first division unit 140 divides the data based on the received division information.
The first dividing unit 140 outputs the divided data to the first correcting unit 150.
The first correction unit 150 sets the dummy value based on the value of the first personal information of the data other than the dummy data among the user data belonging to the group after the division at each division by the first division unit 140. Correct it. For example, the first correction unit 150 may correct the value of the dummy value in the group after the division according to the distribution of the value of the first personal information of the data corresponding to the user identifier.
The first correcting unit 150 outputs the data after correcting the dummy value to the first dividing unit 140 again. The first dividing unit 140 determines whether or not the user data in the group whose dummy value is corrected can be further divided into two groups. The first dividing unit 140 may determine whether or not further division is possible by determining whether or not there is a candidate division point by a known method in the group.
If it is determined that the data can be further divided, the first dividing unit 140 further divides the data output from the first correcting unit 150 into groups.
If it is determined that the data cannot be divided, the first data dividing device 100 ends the process. In this case, the first data dividing device 100 may output the divided user data, the generated classification tree, and the like to an output device, another external system, and the like.
Next, the operation of the first data dividing apparatus 100 according to the first embodiment of the present invention will be described with reference to FIG.
FIG. 16 is a flowchart showing the operation of the first data dividing apparatus 100 according to the first embodiment of the present invention. As illustrated in FIG. 16, for example, the first transmission / reception unit 110 acquires userID information (user identifier) of data held by the second data division device 200 (step S <b> 11).
When the first transmission / reception unit 110 acquires the userID information of the second data dividing device 200 (another device), the first setting unit 130 stores the first memory among the user data including the acquired userID information. A dummy value is set as the value of the first personal information as dummy data for data other than the data corresponding to the userID (user identifier) actually stored by the unit 120 (step S12). At this time, in the second data dividing device 200 as well, similarly, the second transmitting / receiving unit 210 acquires the userID information held by the first data dividing device 100, and the second setting unit 230 receives the second personal information. Set a dummy value for the information value. As a matter of course, since the data stored in the first storage unit 120 and the second storage unit 220 are different, the dummy data are different from each other.
Next, the first dividing unit 140 determines whether or not there is a division candidate point when dividing predetermined user data including data for which dummy values are set. If it is determined that there is a division candidate point, the first division unit 140 calculates the total entropy of the two groups when the predetermined division candidate point is divided. The first dividing unit 140 determines a point having the lowest total entropy as a dividing point, and divides predetermined user data into two groups at that point (step S13).
Next, the 1st correction part 150 corrects a dummy value with respect to the user data in the group after a division | segmentation (step S14). The first correction unit 150 may correct the dummy value in the divided group according to the distribution of the value of the first personal information of the data corresponding to the user identifier.
Next, the first correction unit 150 outputs the user data in the group after the division whose dummy value has been corrected to the first division unit 140 again. The first dividing unit 140 determines whether or not the user data whose dummy value has been corrected after the division can be further divided (step S15). The first division unit 140 may determine whether or not further division is possible by determining whether or not division candidate points exist.
If it is determined that there is a candidate for division, the process proceeds to step S13. In step S13, the first dividing unit 140 further divides the group of user data whose dummy values have been corrected after the division. If it is determined that there is no division candidate point, the process ends.
Next, with reference to FIGS. 17 to 22, each step of FIG. 16 will be specifically described with an example. As a premise, it is assumed that the first data dividing device 100 is a device of the business operator S. Further, it is assumed that the second data dividing device 200 is a device of the business operator T.
Further, the following examples are based on the same situation as the above-described example. Specifically, it is assumed that the business operator S (first data division device 100) holds personal information (data shown in FIG. 1) regarding “abdominal circumference” and “Class”. It is assumed that the business operator T (second data division device 200) holds personal information (data shown in FIG. 2) regarding “blood pressure”. The personal information held by each business operator is associated with a common identifier managed by the identifier management business operator. In the following examples, “abdominal circumference” is the first personal information, and “blood pressure” is the second personal information.
In step S11 of FIG. 16, the first data dividing device 100 and the second data dividing device 200 exchange information on the userID of the data held by each other. Specifically, the first transmission / reception unit 110 transmits the identifiers of user 1 to user 15 to the second transmission / reception unit 210, and receives the identifiers of user 1 to user 6 from the second transmission / reception device 210.
In step S12 of FIG. 16, when the first transmission / reception unit 110 in the business operator S receives the identifiers of user1 to user6, the first setting unit 130 collates with the data shown in FIG. As a result of the collation, the first setting unit 130 includes the identifiers of user1 to user6 stored in the second data dividing device 200 in the identifiers of user1 to user15 stored in the first data dividing device 100. It is determined that Therefore, the first setting unit 130 does not set dummy data.
On the other hand, as a result of comparing the received identifiers of user1 to user15 and the information shown in FIG. 2, the second setting unit 230 in the operator T determines that the first data dividing device 100 is the second data dividing device. It is determined that the data of user7 to user15 which 200 does not store is stored. Accordingly, the second setting unit 230 sets a dummy value for the second personal information by the resampling method for the data of the users 7 to 15 as shown in FIG.
In step S13 of FIG. 16, the first dividing unit 140 of the first data dividing device 100 and the second dividing unit 240 of the second data dividing device 200 communicate with each other and share personal information division candidate points. It is determined whether or not exists. The first dividing unit 140 and the second dividing unit 240 have “X = 90” for the first personal information, and “Y = 120” and “Y = 130” for the second personal information. It is determined that
The first dividing unit 140 and the second dividing unit 240 communicate with each other and do not disclose the value of the personal information using the MPC, and the entropy of the two groups when divided at each division candidate point. Calculate the total. The first dividing unit 140 and the second dividing unit 240 determine that “X = 90” is the point at which the total entropy is minimized, and determine the dividing point.
When the dividing point is determined, in the first data dividing apparatus 100 that holds the personal information (first personal information) of the waist circumference (X), the first dividing unit 140 transmits predetermined user data {user 1 to 3 , 7-9, 11, 12, 14} and {user4-6, 10, 13, 15}.
The first dividing unit 140 divides user data into two groups of data division information ({user 1 to 3, 7 to 9, 11, 12, 14}, {user 4 to 6, 10, 13, 15}). To the second transmitting / receiving unit 210 of the business operator T. When the division information is received, the second division unit 240 also divides predetermined user data. 8 and 9 are diagrams showing the state of user data after division.
In step S14 of FIG. 16, the first correction unit 150 and the second correction unit 250 correct the dummy value within the divided group. In this example, the operator S (first data dividing device 100) does not hold dummy data, so the dummy value is not corrected and the second correction unit of the operator T (second data dividing device 200). 250 corrects the dummy value.
FIG. 17 is a diagram illustrating data obtained by correcting a dummy value with respect to the user data in FIG. The second modification unit 250 of the second data dividing device 200 performs second personal information of identifier (user identifier) data (user1 to user6 data) other than dummy data (user7 to user15 data) for each group. The dummy value is corrected according to the distribution of values.
First, consider the groups {user 1 to 3, 7 to 9, 11, 12, 14}. The user identifier data in the second data division device 200 belonging to the group is data of user1 to user3. The data of user1 to user3 is distributed in “110 units: 120 units = 2: 1” regarding the second personal information (blood pressure). Therefore, the second correction unit 250 sets the dummy value of dummy data (data of users 7 to 9, 11, 12, and 14) belonging to the group of {users 1 to 3, 7 to 9, 11, 12, and 14} to “ 110 units: 120 units = 2: 1 ”.
As illustrated in FIG. 17, the second correction unit 250 corrects, for example, user9 from “135” to “115”, user12 from “135” to “115”, and user14 from “125” to “115”. .
Next, consider the group {user4-6,10,13,15}. The data of the user identifier in the second data division device 200 belonging to the group is data of user4 to user6. The data of user4 to user6 is distributed in “120 units: 130 units = 1: 2” with respect to the second personal information (blood pressure). Therefore, the second correction unit 250 sets the dummy value of dummy data (data of users 10, 13, 15) belonging to the group of {user4-6, 10, 13, 15} to “120 units: 130 units = 1: Modify to be “2”.
As illustrated in FIG. 17, the second correction unit 250 corrects the user 10 from “115” to “125” and the user 13 from “115” to “135”, for example.
FIG. 18 is a diagram illustrating a division state when the dummy value is corrected. FIG. 19 is a diagram illustrating the distribution and division of user data when the dummy value is corrected. In FIG. 19, the data surrounded by a circle is data in which the dummy value is corrected.
After the correction of the dummy value, in step S15, the first division unit 140 of the first division device 100 and the second division unit 240 of the second division device 200 have division candidate points in each of the two groups. It is determined whether or not. The first division unit 140 and the second division unit 240 determine that there are no division candidate points in the groups {user 1 to 3, 7 to 9, 11, 12, 14}. The first division unit 140 and the second division unit 240 determine that “Y = 120” and “Y = 130” are the division candidate points for the groups {user4 to 6,10,13,15}. Then, the process returns to step S13 in FIG. The first dividing unit 140 and the second dividing unit 240 calculate the total entropy after the division using MPC. As a result, the first dividing unit 140 and the second dividing unit 240 determine “Y = 130” as a dividing point.
FIG. 20 is a diagram illustrating a state where the group of {user4 to 6, 10, 13, 15} in the data illustrated in FIG. 18 is further divided by “Y = 130”. FIG. 21 is a diagram illustrating the distribution of user data with the dummy values corrected and the division by “Y = 130” of the groups {user4 to 6,10,13,15}. In FIG. 20, in the group of {user5, 6, 13, 15}, “A” and “B” are mixed, but since the number of “B” is clearly large, the group is a group of “B”. It is determined that.
FIG. 22 is a diagram illustrating an example of a classification tree that is finally generated from the data of user1 to user15 when the dummy value is corrected. According to the present invention, the classification tree shown in FIG. 22 is generated. The classification tree may be generated by collecting data division processes as information by a generation unit (not shown in FIGS. 14 and 25).
FIG. 23 is a diagram illustrating a classification result of user1 to user15 using a classification tree generated from user data whose dummy values are corrected. As shown in FIG. 23, the data of users 5 and 6 among the data of users 4 to 6 whose classification result is “unknown” is correctly classified by correcting the dummy value as compared with the result of FIG. 13. A classification tree that can be generated can be generated. The classification tree is the most accurate classification tree compared to the classification trees shown in FIGS.
As described above, according to the data dividing apparatus 100 according to the first embodiment, in the distributed processing of data by a plurality of apparatuses, the non-common data held by one apparatus is effectively used, thereby achieving high accuracy. Data can be divided. In the data dividing apparatus 100 according to the first embodiment, dummy values are set and used for non-common data, and the non-common data is effectively used by correcting the dummy value for each division.
Second Embodiment
Next, the functional configuration of the first data dividing device 300 according to the second embodiment of the present invention will be described.
FIG. 24 is a block diagram illustrating a configuration of the first data dividing device 300 according to the second embodiment. As shown in FIG. 24, the first data dividing device 300 is different from the first data dividing device 100 in the first embodiment in that it includes a first adjustment unit 310. Since components other than the first adjustment unit 310 have the same configuration as that of the first embodiment, the same reference numerals are given and description thereof is omitted.
The first adjustment unit 310 adjusts the dummy value based on the amount of change between the dummy value before the correction and the dummy value after the correction after the dummy value is corrected by the first correction unit 150. . When the data distribution has a characteristic such as a correlation, the first adjustment unit 310 adjusts the dummy value based on the amount of change in the dummy value, so that the first data dividing device 300 reflects the characteristic. To divide the data more accurately.
FIG. 25 is a flowchart showing the operation of the first data dividing device 300 according to the second embodiment of the present invention. As shown in FIG. 25, the operation of the first data dividing device 300 is different from the operation of the first data dividing device 100 shown in FIG. The difference is that step S16 is provided for adjusting the dummy value based on the amount.
After step S16, in step S15, the first dividing unit 140 determines whether or not the data whose dummy values have been corrected and adjusted after division can be further divided (step S15).
26 and 27 are diagrams for explaining the function of the first adjustment unit 310. FIG.
FIG. 26 is a diagram illustrating data obtained by adjusting dummy values with respect to the data illustrated in FIG. 17 representing corrected data. The first adjustment unit 310, after the dummy value is corrected by the first correction unit 150, for each group, the dummy value of the dummy data before the correction (data of user 7 to user 15) and the dummy data after the correction ( The dummy value is adjusted based on the amount of change in the value of the dummy value of user7 to user15). Any adjustment method may be used based on the amount of change in the dummy value before and after the correction. In the following, an adjustment method based on the change in the center of gravity of the dummy value (average value of the dummy value) will be described as an example. To do.
First, consider the groups {user 1 to 3, 7 to 9, 11, 12, 14}. The dummy data belonging to the group is data of users 7 to 9, 11, 12, and 14.
In the user data before correction by the first correction unit 150 illustrated in FIG. 7, the dummy values of the users 7 to 9, 11, 12, and 14 are “115”, “125”, “135”, “125”, “ 135 "and" 125 ". Therefore, the first adjusting unit 310 calculates the value (average value) of the dummy values before correction in the group as “126.666” by (115 + 125 + 135 + 125 + 135 + 125) ÷ 6.
In addition, in the data corrected by the first correction unit 150 shown in FIG. 17, the dummy values of users 7 to 9, 11, 12, and 14 are “115”, “125”, “115”, “125”, “115” and “115”. Accordingly, the first adjustment unit 310 calculates the value (average value) of the corrected dummy values in the group as “118.666” by (115 + 125 + 115 + 125 + 115 + 115) ÷ 6.
Since the value of the center of gravity has changed from “126.666” to “118.666” by the modification of the dummy value, the amount of change in the value of the center of gravity of the dummy value is “−8”. Therefore, the first adjustment unit 310 uses dummy data belonging to the groups {user 1 to 3, 7 to 9, 11, 12, 14} (data of users 7 to 9, 11, 12, and 14) from a predetermined dummy value. "10" is subtracted. Here, since the dummy value is assumed to be a value in increments of 10 from “115”, “−8” is set to “−10”. In the present embodiment, the dummy value takes a value in the range of “115 to 135” and does not take a value of 115 or less or 135 or more.
As illustrated in FIG. 26, the first adjustment unit 310 corrects the dummy values of user8 and user11 from “125” to “115”, for example.
Similarly, consider the group {user4-6,10,13,15}. The dummy data belonging to the group is data of users 10, 13, and 15.
In the data before correction by the first correction unit 150 shown in FIG. 7, the dummy values of the users 10, 13, and 15 are “115”, “115”, and “135”, respectively. Therefore, the first adjustment unit 310 calculates the value (average value) of the dummy values before correction in the group as “121.666” by (115 + 115 + 135) ÷ 3.
In addition, in the data corrected by the first correction unit 150 shown in FIG. 17, the dummy values of the users 10, 13, and 15 are “125”, “135”, and “135”, respectively. Accordingly, the first adjustment unit 310 calculates the value (average value) of the corrected dummy values in the group as “131.666” by (125 + 135 + 135) ÷ 3.
Since the value of the center of gravity has changed from “121.666” to “131.666” due to the modification of the dummy value, the amount of change in the value of the center of gravity of the dummy value is “+10”. Therefore, the first adjustment unit 310 adds “10” to a predetermined dummy value belonging to the groups {user4 to 6,10,13,15} (data of users10, 13,15).
As shown in FIG. 26, the first adjustment unit 310 corrects the dummy value of the user 10 from “125” to “135”, for example.
FIG. 27 is a diagram illustrating a division state when the dummy value is adjusted. FIG. 28 is a diagram showing the distribution and division of user data when the dummy value is adjusted. In FIG. 28, data surrounded by a circle is data in which the dummy value is adjusted.
FIG. 29 is a diagram illustrating an example of a classification tree that is finally generated from the data of user1 to user15 when the dummy value is adjusted. According to the present invention related to the second embodiment, the classification tree shown in FIG. 29 is generated.
FIG. 30 is a diagram illustrating a classification result of user1 to user15 using a classification tree generated from data with adjusted dummy values. As shown in FIG. 30, a classification tree that can correctly classify the data of user4 whose classification result is “unknown” is generated by adjusting the dummy value as compared with the result of FIG. be able to. The classification tree is the most accurate classification tree as compared to the classification trees shown in FIGS.
As described above, according to the data dividing apparatus 300 according to the second embodiment, it is possible to divide data with higher accuracy reflecting the characteristics of the data. The reason is that, for example, when there is a feature such as a correlation in the distribution of data or a feature that the data is fixed within a certain range, the first adjustment unit 310 corrects the feature so that the feature is emphasized. This is because the data is adjusted.
<Third Embodiment>
Next, a third embodiment of the present invention will be described. The third embodiment of the present invention is a distributed anonymization system that performs distributed anonymization using the first data dividing device 400 and the second data dividing device 500.
Distributed anonymization is an anonymization technique for preventing individual identification and attribute estimation when combining information held in a distributed manner. The distributed anonymization technique is described in Non-Patent Document 2, for example.
In the technique of Non-Patent Document 2, when data is combined between two businesses, first, the personal information held by the two businesses is abstracted to generate initial anonymous data. The technique of Non-Patent Document 2 gradually embodies abstracted personal information while confirming that anonymity is satisfied.
In order to materialize the personal information, the division point of the personal information is determined and the data is divided. The technology described in Non-Patent Document 2 holds sensitive information as to whether or not two indices, k-anonymity and l-diversity, are satisfied during division. Check with the business operator.
Here, the sensitive information is information that is not to be changed because it is used for information processing of the combined data.
k-anonymity (k-anonymity) is an index that requires the number of users with the same combination of quasi-identifiers to be k or more. The l-diversity is an index that requires the number of pieces of sensitive information of the same user with the same combination of quasi-identifiers to be 1 or more. By providing users with data that satisfies the two indicators, it is possible to prevent the identification of individuals from the data provided by the technology of Non-Patent Document 1, and to prevent personal sensitive information from being known. Can do.
In the following description of the present embodiment, it is required that personal information data satisfy 2-anonymity.
FIG. 31 is a block diagram showing a configuration of the first data dividing device 400 according to the third embodiment. As shown in FIG. 31, the first data dividing device 400 is different from the first data dividing device 100 in the first embodiment in that it includes a first determination unit 410. Since components other than the first determination unit 410 have the same configuration as that of the first embodiment, the same reference numerals are given and description thereof is omitted.
The first determination unit 410 is configured such that the ratio of user identifiers existing in both the own device (first data division device 400) and the other device (second data division device 500) is anonymized in advance. Whether the index is satisfied is determined for each group after the division by the first dividing unit 140.
FIG. 32 is a block diagram illustrating a configuration of a second data dividing device 500 according to the third embodiment. As shown in FIG. 32, the configuration of the second data dividing device 200 may be the same as that of the first data dividing device 400.
FIG. 33 is a flowchart showing the operation of the first data dividing device 400 in the third embodiment. As shown in FIG. 33, the operation of the first data dividing device 400 is different from the operation of the first data dividing device 100 shown in FIG. It differs in having step S18 for determining whether or not the anonymous index is satisfied. Further, the operation of the first data dividing device 400 does not have step S12, but has step S17 instead. The same operations as those in FIG. 16 are denoted by the same reference numerals, and the description thereof is omitted.
Next, with reference to FIGS. 34 to 43, each step of FIG. 33 will be described using a specific example. As a premise, the business operator S has the first data dividing device 400. Further, it is assumed that the business operator T has the second data dividing device 500.
Further, it is assumed that the business operator S holds personal information (data shown in FIG. 1) relating to “abdominal circumference” and “Class”. It is assumed that the business operator T holds personal information regarding “blood pressure” and “disease”.
FIG. 34 is a diagram illustrating an example of data held by the operator T in the third embodiment.
The personal information held by each business operator is associated with a common identifier managed by the identifier management business operator. In the following examples, “abdominal circumference” is the first personal information, “blood pressure” is the second personal information, and “disease” is the sensitive information.
In step S11 of FIG. 33, the first data dividing device 400 and the second data dividing device 500 obtain user identifiers held by each other.
FIG. 35 is a diagram illustrating data obtained by resampling data of user7 to user15. More specifically, in step S17 of FIG. 33, the second setting unit 230 in the business operator T (second data dividing device 500) receives the data shown in FIG. 1 and the data shown in FIG. As shown in FIG. 35, as shown in FIG. 35, dummy data is set for the second personal information by the resampling method for the data of user7 to user15, and the sensitive information is appropriately set.
In step S17, the first setting unit 130 and the second setting unit 230 generate initial anonymous data. For example, the first setting unit 130 generates initial anonymous data shown in FIG. 36 from the data in FIG. Moreover, the 2nd operation part 240 produces | generates the initial anonymous data shown in FIG. 37 from the data of FIG.
As shown in FIGS. 36 and 37, the initial anonymous data includes a user ID, a quasi-identifier (information on blood pressure, waist circumference, and class), and sensitive information (information on disease).
In step S13 of FIG. 33, as in the first embodiment, the first dividing unit 140 and the second dividing unit 240 determine that “X = 90” is the point at which the total entropy is minimized. Determine the dividing point. The business operator S holding the information about the waist circumference (X) transmits information about the contents of the group when divided by “X = 90” to the business operator T via the first transmission / reception unit 110. Specifically, the first dividing unit 140 is divided into two groups of data division information ({user 1 to 3, 7 to 9, 11, 12, 14}, {user 4 to 6, 10, 13, 15}). Information indicating that the data is to be divided) is transmitted to the second transmitting / receiving unit 210 of the business operator T.
FIG. 38 is a diagram illustrating user data obtained by dividing the data in FIG. 36 by “X = 90”. FIG. 39 is a diagram illustrating user data obtained by dividing the user data in FIG. 37 by “X = 90”.
Next, in step S18 of FIG. 33, the first determination unit 410 and the second determination unit 420 are assigned to either the first dividing device 400 (self device) or the second dividing device 500 (other device). Whether the ratio of existing user identifiers satisfies a predetermined anonymous index is determined for each group after division by the first division unit 140 of the first division device 400.
“User identifier” means an identifier of a user stored in the device itself. Specifically, the user identifiers of the business operators S are users 1 to 15. User identifiers of the business operator T are users 1 to 6.
In the present embodiment, the predetermined anonymous index is 2-anonymity. The group of {users 1-3, 7-9, 11, 12, 14} in FIG. 39 (group in the first row) is 3-anonymous because 6 of the 9 users are dummy. Moreover, the group of {user4-6,10,13,15} (group of the 2nd line) is 3-anonymous because three of six users are dummy. Therefore, any group satisfies 2-anonymity.
In this example, since the business operator T holds sensitive information, the business operator T may confirm the anonymous index.
In this example, since the dummy data is included in the user data of the business operator T, it is not difficult to confirm that the index is satisfied. If dummy data is also included in the data of the operator S, the first determination unit 410 uses the MPC and both the data of the operator S and the data of the operator T satisfy the anonymous index. You may confirm that.
If it is confirmed that the anonymous index of the data held by the operator T is maintained, the second correction unit 250 corrects the dummy value in the same procedure as in the first embodiment in step S14 of FIG.
FIG. 40 is a diagram showing user data obtained by correcting dummy values with respect to the data shown in FIG. By correcting the dummy value by the second correction unit 250, it becomes possible to divide the user data with high accuracy at the next step.
Next, the first dividing unit 140 and the second dividing unit 240 determine whether or not the user data whose dummy value has been corrected after the division can be further divided by the same method as in the first embodiment ( Step S15). When it is determined that there is a division candidate point, the first division unit 140 and the second division unit 240 calculate the total entropy after division by using MPC, and “Y = 130” as in the first embodiment. Is determined as the dividing point.
In step S13 in FIG. 33 again, the first dividing unit 140 and the second dividing unit 240 divide the user data. FIG. 41 is a diagram illustrating data obtained by dividing the data in FIG. 38 by “Y = 130”. FIG. 42 is a diagram illustrating user data obtained by dividing the data in FIG. 39 by “Y = 130”.
In step S18 of FIG. 33, the second determination unit 420 confirms whether the anonymous index is maintained for the user data of FIG. The second determination unit 420 determines that the {user4,10} group (second row group) is one anonymous because two of the two users are dummy, and does not satisfy the two anonymity.
If the second determination unit 420 determines that the anonymous index is not satisfied, the first data division device 400 and the second data division device 500 cancel the division performed last. The first data dividing device 400 and the second data dividing device 500 calculate the number of persons existing in each of the canceled data using MPC, and generate combined anonymized data.
FIG. 43 is a diagram showing final combined anonymized data (joined anonymized data) generated by the present invention according to the third embodiment.
Here, even if the user data shown in FIG. 43 that is finally output is referred to, the business operator S does not know which user's data surely exists in the data of the business operator T. Further, the business operator T does not know which user's data is surely present in the data of the business operator S.
As described above, according to the distributed anonymization system of the present invention according to the third embodiment, there is no risk that the existence of the user's data leaks to other operators, and the non-common that one device holds It is possible to execute the distributed anonymization process with high accuracy of data division by effectively utilizing the data. The reason for this is that the distributed anonymization system of the invention of the present application includes other dummy data that does not actually exist in the data to be transmitted to other operators, and so on in the process of distributed anonymization processing. This is because it is possible to prevent the leakage of the existence of the user data to the business operators. Moreover, it is because it becomes possible to divide at an appropriate division point by correcting the dummy value at each division.
<Fourth embodiment>
Next, with reference to FIG. 44, a functional configuration of a data dividing device 600 according to the fourth embodiment of the present invention will be described.
FIG. 44 is a block diagram showing a configuration of a data dividing device 600 according to the fourth embodiment. As illustrated in FIG. 44, the data dividing device 600 includes a transmission / reception unit 610, a storage unit 620, a setting unit 630, a dividing unit 640, and a correcting unit 650. These are the same configurations as the first transmission / reception unit 110, the first storage unit 120, the first setting unit 130, the first division unit 140, and the first correction unit 150 described above.
The data dividing device 600 stores first personal information, and divides data while communicating with another device storing second personal information via the transmission / reception unit 610.
The transmission / reception unit 610 transmits / receives data to / from other devices.
Storage unit 620 associates a plurality of user identifiers with the value of the first personal information and stores them as user data.
The setting unit 630 stores the first dummy information in the first personal information as dummy data for the data stored in the other device received by the transmission / reception means 610 and not stored in the storage unit 620. Set the value.
The dividing unit 640 divides predetermined data including dummy data into groups by dividing points determined based on the value of the first personal information including the dummy value or the value of the second personal information.
The correction unit 650 corrects the dummy value for each division based on the value of the first personal information of data other than the dummy data among the data belonging to the group after the division.
As described above, according to the data dividing device 600 according to the fourth embodiment, in the distributed processing of data by a plurality of devices, the non-common data held by one device is effectively used, thereby improving the accuracy. Data can be divided.
As mentioned above, although this invention was demonstrated with reference to each embodiment, this invention is not limited to the above embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
Note that the data dividing devices held by two or more different business operators according to the present invention may be devices that are separated from each other in terms of management, for example, may be devices that are virtually separated. Further, for example, the storage units of the data division apparatuses of the respective companies may be held in the same database, and may be held in a management form so that it is understood that the data is held by different companies.
FIG. 45 is a block diagram illustrating an example of a hardware configuration of the first data dividing device 100 according to the first embodiment.
As shown in FIG. 45, each unit constituting the first data dividing apparatus 100 stores a CPU (Central Processing Unit) 1, a network connection communication IF 2 (communication interface 2), a memory 3, and a program. This is realized by a computer device including a storage device 4 such as a hard disk. However, the configuration of the first data dividing device 100 is not limited to the computer device shown in FIG.
For example, the first transmission / reception unit 110 may be realized by the communication IF2.
The CPU 1 controls the first data dividing device 100 by operating the operating system. Further, the CPU 1 reads a program and data from a recording medium mounted on, for example, a drive device to the memory 3 and executes various processes according to the program and data.
For example, the first setting unit 130, the first dividing unit 140, and the first correcting unit 150 may be realized by the CPU 1 and a program.
The recording device 4 is, for example, an optical disk, a flexible disk, a magnetic optical disk, an external hard disk, a semiconductor memory, etc., and records a computer program so that it can be read by a computer. The computer program may be downloaded from an external computer (not shown) connected to the communication network.
For example, the first storage unit 120 may be realized by the recording device 4.
In addition, the block diagram utilized in each embodiment described so far has shown the block of a functional unit instead of the structure of a hardware unit. These functional blocks are realized by any combination of hardware and software. In addition, the means for realizing the components of the first data dividing device 100 is not particularly limited. That is, the first data dividing device 100 may be realized by one physically coupled device, or two or more physically separated devices are connected by wire or wirelessly, and these multiple devices are used. It may be realized.
The program of the present invention may be a program that causes a computer to execute the operations described in the above embodiments.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2011-205519 for which it applied on September 21, 2011, and takes in those the indications of all here.

１ＣＰＵ
２通信ＩＦ
３メモリ
４記憶装置
１００、３００、４００第一のデータ分割装置
１１０第一の送受信部
１２０第一の記憶部
１３０第一の設定部
１４０第一の分割部
１５０第一の修正部
３１０第一の調整部
４１０第一の判定部
６００データ分割装置
６１０送受信部
６２０記憶部
６３０設定部
６４０分割部
６５０修正部
２００、５００第二のデータ分割装置
２１０第二の送受信部
２２０第二の記憶部
２３０第二の設定部
２４０第二の分割部
２５０第二の修正部
４２０第二の判定部1 CPU
2 Communication IF
3 Memory 4 Storage device 100, 300, 400 First data division device 110 First transmission / reception unit 120 First storage unit 130 First setting unit 140 First division unit 150 First modification unit 310 First Adjustment unit 410 First determination unit 600 Data division device 610 Transmission / reception unit 620 Storage unit 630 Setting unit 640 Division unit 650 Correction unit 200, 500 Second data division device 210 Second transmission / reception unit 220 Second storage unit 230 Second Second setting unit 240 Second division unit 250 Second correction unit 420 Second determination unit

Claims

Data that divides user data including the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the user identifier assigned to the second personal information of the other device A splitting device,
Transmitting / receiving means for acquiring a user identifier of the other device;
Storage means for storing the user identifier of the device and the value of the first personal information associated with the identifier;
Setting means for setting a dummy value of the first personal information as dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the acquired user identifiers of the other device;
Division means for dividing predetermined user data including the dummy data into groups by division points determined based on the value of the first personal information or the value of the second personal information including the dummy value; ,
Based on the value of the first personal information in which the user identifiers of the own device and the other device match among the user data belonging to the group after the division, the dummy value of the set first personal information is determined. Correction means to correct;
A data dividing device including:

When the predetermined user data is divided at division candidate points determined by a predetermined rule, the dividing unit determines a point at which entropy is minimum or a point approximate to the minimum as the division point, and the predetermined user data Split into groups,
The data dividing device according to claim 1.

The correction means includes a dummy value in the group after the division according to a distribution of values of the first personal information in which user identifiers of the own device and the other device match among user data belonging to the group after the division. Correct the value of
The data dividing device according to claim 1 or 2.

An adjusting means for adjusting the dummy value based on the amount of change between the dummy value before the correction and the dummy value after the correction after the correction of the dummy value by the correction means;
The data dividing device according to any one of claims 1 to 3, further comprising:

The amount of change is a difference between an average value of a predetermined dummy value before correction belonging to the group and an average value of the predetermined dummy value after correction.
The data dividing device according to claim 4.

A determination unit that determines, for each group after division by the division unit, whether or not the ratio of user identifiers existing in both the own device and the other device satisfies a predetermined anonymous index;
The data dividing device according to any one of claims 1 to 4, further comprising:

First personal information of the first data dividing device, user identifier assigned to the first personal information of the first data dividing device, and assigned to second personal information of the second data dividing device A data dividing system for dividing user data including a given user identifier,
The first data dividing device includes:
First transmitting / receiving means for obtaining a user identifier of the second data dividing device;
First storage means for storing a user identifier of the first data dividing device and a value of the first personal information associated with the identifier;
Among the acquired user identifiers of the second data dividing device, as the dummy data associated with the user identifier of the second data dividing device that does not match the user identifier of the first data dividing device, the first individual A first setting means for setting a first dummy value of information;
Second user information including predetermined user data including the dummy data, a value of the first personal information including the first dummy value, or a second dummy value set by the second data dividing device. A first dividing means for dividing into groups by dividing points determined based on the values of
Of the user data belonging to the group after the division, the first data set based on the value of the first personal information in which the user identifiers of the first data division device and the second data division device match A correction means for correcting the dummy value of
The second data dividing device includes:
Second transmitting / receiving means for obtaining a user identifier of the first data dividing device;
Second storage means for storing a user identifier of the second data dividing device and a value of the second personal information associated with the identifier;
Among the acquired user identifiers of the first data division device, the second individual as dummy data associated with the user identifier of the first data division device that does not match the user identifier of the second data division device A second setting means for setting a second dummy value of information;
Division of predetermined user data including the dummy data determined based on the value of the second personal information including the second dummy value or the value of the first personal information including the first dummy value A second dividing means for dividing into groups by points;
Of the user data belonging to the group after the division, the second data set based on the value of the second personal information in which the user identifiers of the second data division device and the first data division device match. A correction means for correcting the dummy value of
Data partitioning system including

Data that divides user data including the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the user identifier assigned to the second personal information of the other device A division method,
Obtaining a user identifier of the other device;
Storing the user identifier of the device and the value of the first personal information associated with the identifier;
Among the acquired user identifiers of the other devices, set dummy values of the first personal information as dummy data associated with the user identifiers of the other devices that do not match the user identifier of the own device,
The predetermined user data including the dummy data is divided into groups by a division point determined based on the value of the first personal information or the second personal information including the dummy value,
Based on the value of the first personal information in which the user identifiers of the own device and the other device match among the user data belonging to the group after the division, the dummy value of the set first personal information is determined. To fix,
Data partitioning method.

Data that divides user data including the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the user identifier assigned to the second personal information of the other device A program for realizing a dividing device,
Obtaining a user identifier of the other device;
Storing the user identifier of the device and the value of the first personal information associated with the identifier;
Among the acquired user identifiers of the other devices, set dummy values of the first personal information as dummy data associated with the user identifiers of the other devices that do not match the user identifier of the own device,
The predetermined user data including the dummy data is divided into groups by a division point determined based on the value of the first personal information or the second personal information including the dummy value,
Based on the value of the first personal information in which the user identifiers of the own device and the other device match among the user data belonging to the group after the division, the dummy value of the set first personal information is determined. To fix,
A program that causes a computer to execute processing.