WO2013042788A1 - Appareil de partitionnement de données, système de partitionnement de données, procédé de partitionnement de données, et programme - Google Patents

Appareil de partitionnement de données, système de partitionnement de données, procédé de partitionnement de données, et programme Download PDF

Info

Publication number
WO2013042788A1
WO2013042788A1 PCT/JP2012/074311 JP2012074311W WO2013042788A1 WO 2013042788 A1 WO2013042788 A1 WO 2013042788A1 JP 2012074311 W JP2012074311 W JP 2012074311W WO 2013042788 A1 WO2013042788 A1 WO 2013042788A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
personal information
user
value
dummy
Prior art date
Application number
PCT/JP2012/074311
Other languages
English (en)
Japanese (ja)
Inventor
隆夫 竹之内
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2013534778A priority Critical patent/JP6015661B2/ja
Publication of WO2013042788A1 publication Critical patent/WO2013042788A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2123Dummy operation

Definitions

  • the present invention relates to a technique for appropriately dividing data held in a distributed manner without disclosing each other.
  • Non-Patent Document 1 discloses a classification tree generation technique by data mining regarding data that is subject to privacy protection that is distributed and held.
  • the technology described in Non-Patent Document 1 is based on the fact that ID3 (Iterative Dichotomiser 3) algorithm is realized in a distributed environment in a situation where a plurality of business operators hold personal information of different types of users. A person generates a classification tree without disclosing data.
  • the technique described in Non-Patent Document 1 determines classification tree division points by entropy calculation using MPC (Multi Party Computation).
  • MPC Multi Party Computation
  • FIG. 46 is a diagram illustrating an example of data held by the operator S.
  • the business operator S holds data on “userID”, “abdominal circumference (X)”, and “Class”.
  • “UserID” is an identifier of a user registered as data.
  • the business operator S holds data relating to a total of nine users, user1 to user9.
  • “Waist circumference (X)” is data indicating the user's waist circumference.
  • “Class” is displayed as “A” or “B”, “A” indicates that the user is non-metabolic, and “B” indicates that the user is metabolic.
  • FIG. 47 is a diagram illustrating an example of data held by the operator T. As shown in FIG.
  • FIG. 52 is a diagram illustrating an example of a classification tree finally generated by the technique described in Non-Patent Document 1. In the case of the above example, according to the technique described in Non-Patent Document 1, the classification tree shown in FIG. 52 is generated.
  • Non-Patent Document 1 The problem of the technology described in Non-Patent Document 1 is that when data of a non-common user held by one operator (device) has an important meaning, the data is not reflected in the division process, and is classified The accuracy of is worse. This is because the technology described in Non-Patent Document 1 targets the common user data held by both operators (devices) and the non-common user data held by one operator (device). This is because it is not subject to processing. In addition, even if non-common user data held by one operator (device) is randomly assigned to non-common user data of another operator (device), the classification accuracy is good. Not necessarily.
  • the data dividing device includes the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the second individual of the other device.
  • a data dividing device for dividing user data including a user identifier assigned to information, the transmitting / receiving means for obtaining a user identifier of the other device, the user identifier of the own device, and the first associated with the identifier
  • the dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the obtained user identifiers of the other device.
  • a data division system includes first personal information of a first data division device, and a user identifier assigned to the first personal information of the first data division device.
  • First transmission / reception means for acquiring a user identifier, first storage means for storing a user identifier of the first data dividing device and a value of the first personal information associated with the identifier, and Of the user identifiers of the second data dividing device, the dummy associated with the user identifier of the second data dividing device that does not match the user identifier of the first data dividing device
  • a first dividing means for dividing into groups based on a dividing point determined based on the value of the second personal information including the value of the second data dividing device or the second dummy value set by the second data dividing device; and Among the user data belonging to the
  • the second storage means for storing the associated second personal information value and the user identifier of the second data dividing device among the acquired user identifiers of the first data dividing device
  • the second setting means for setting the second dummy value of the second personal information as dummy data associated with the user identifier of the first data dividing device, and predetermined user data including the dummy data
  • the second division into groups by the division point determined based on the value of the second personal information including the second dummy value or the value of the first personal information including the first dummy value Based on the value of the second personal information in which the user identifiers of the second data dividing device and the first data dividing device match among the dividing means and user data belonging to the group after the division.
  • a data method includes: first personal information of an own device; a user identifier assigned to the first personal information of the own device; and second personal information of another device.
  • a data dividing method for dividing user data including a user identifier assigned to a device, wherein the user identifier of the other device is acquired, the user identifier of the own device and the first personal information associated with the identifier
  • a dummy value of the first personal information is set as dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the acquired user identifiers of the other device.
  • the predetermined user data including the dummy data is divided based on the value of the first personal information or the value of the second personal information including the dummy value. Therefore, it is divided into groups, and among the user data belonging to the group after the division, based on the value of the first personal information where the user identifiers of the own device and the other device match, The dummy value of personal information is corrected.
  • the program according to the present invention includes the first personal information of the own device, the user identifier assigned to the first personal information of the own device, and the second personal information of the other device.
  • a program for realizing a data dividing device that divides user data including an assigned user identifier, obtains a user identifier of the other device, and obtains a user identifier of the own device and the first identifier associated with the identifier.
  • the dummy information of the first personal information is stored as dummy data associated with the user identifier of the other device that does not match the user identifier of the own device among the acquired user identifiers of the other device.
  • a value is set, and predetermined user data including the dummy data is set based on the value of the first personal information or the value of the second personal information including the dummy value.
  • the computer is caused to execute a process of correcting the dummy value of the set first personal information.
  • An example of the effect of the present invention is that, in data distribution processing by a plurality of devices, it is possible to divide data with high accuracy by effectively using non-common user data held by one device. .
  • FIG. 14 is a diagram illustrating data obtained by determining values of user7 to user15 by resampling in the third embodiment.
  • Non-Patent Document 1 is used as a technique for generating a classification tree.
  • the identifier management business operator notifies each business operator of the identifier of the user who is the target of classification tree generation. For example, it is assumed that the identifiers of user1 to user15 are notified to each business operator. It is assumed that the business operator S holds the data shown in FIG.
  • the business operator S holds personal information (data of “abdominal circumference (X)” and “Class”) regarding users with identifiers of user 1 to user 15. “Class” is displayed as “A” or “B”, “A” indicates that the user is non-metabolic, and “B” indicates that the user is metabolic. It is assumed that the business operator T holds the data shown in FIG. 2 regarding the user with the notified identifier. As shown in FIG. 2, the business operator T holds personal information (data on “blood pressure (Y)”) related to users with identifiers of user 1 to user 6.
  • FIG. 5 is a diagram illustrating an example of a classification tree that is finally generated from the data illustrated in FIGS. 1 and 2.
  • the technique described in Non-Patent Document 1 determines that the above classification shown in FIGS. 3 and 4 is impossible, and generates the classification tree shown in FIG. According to the classification tree shown in FIG.
  • FIG. 6 is a diagram showing the classification results of user1 to user15 using the classification tree generated by the technique of Non-Patent Document 1.
  • the “correct answer” in FIG. 6 is data indicating whether “A” or “B” is correctly classified when the classification tree shown in FIG. 5 is used for classification. “ ⁇ ” indicates that the classification is correct.
  • Non-Patent Document 1 does not use these data in generating the classification tree.
  • “Resampling” refers to a method of determining a sample value based on a specimen.
  • the values of user7 to user15 are determined based on the distribution of values of user1 to user6 that actually hold the value of “blood pressure (Y)”.
  • FIG. 7 is a diagram illustrating data in which values of user7 to user15 are determined by resampling with respect to data held by the operator T.
  • 115 is assigned as 110 units
  • 125 is assigned as 120 units
  • 135 is assigned as 130 units.
  • the technique described in Non-Patent Document 1 calculates the entropy using MPC for the data of user1 to user15 that have become common users of the operator S and the operator T by setting a dummy value.
  • FIG. 12 is a diagram illustrating an example of a classification tree that is finally generated from data of user1 to user15 including data in which dummy values are set. When resampling is performed, according to the technique described in Non-Patent Document 1, the classification tree shown in FIG. 12 is generated.
  • FIG. 12 is illustrating an example of a classification tree that is finally generated from data of user1 to user15 including data in which dummy values are set.
  • FIG. 13 is a diagram illustrating a classification result of user1 to user15 using a classification tree generated from data including data for which dummy values are set.
  • FIG. 13 by performing resampling as compared with the result of FIG. 6, among the data of user 7 to user 15 whose classification result is “unknown”, user 7 to 9, 11, 12, and 14 A classification tree capable of correctly classifying data can be generated.
  • the resampling value is randomly set although it follows the sample distribution, data that can be classified correctly such as users 4 to 6 may become “unknown”.
  • FIG. 14 is a block diagram showing a configuration of the first data dividing device 100 according to the first embodiment. As shown in FIG.
  • the first data dividing device 100 includes a first transmitting / receiving unit 110, a first storage unit 120, a first setting unit 130, a first dividing unit 140, The correction part 150 is included.
  • FIG. 15 is a block diagram illustrating a configuration of the second data dividing device 200 according to the first embodiment. As shown in FIG. 15, the configuration of the second data dividing device 200 may be the same as that of the first data dividing device 100. In the present embodiment, the first data dividing device 100 and the second data dividing device 200 constitute a data dividing system. In the following description, the configuration of the first data dividing device 100 will be mainly described. In the present embodiment, two data dividing devices are described.
  • the first transmission / reception unit 110 communicates with an external device, whereby user identifiers held by each of its own device (first data division device 100) and another device (second data division device 200) are stored. Get information about. Since privacy is assumed to be protected, personal information such as “abdominal girth” and “blood pressure” is not disclosed to each other, for example, userID information is exchanged.
  • the first transmission / reception unit 110 may acquire information on a predetermined identifier serving as a population upon receiving a notification from an identifier management device held by an identifier management company, for example.
  • the second data division device 200 uses the identifier information that the first data division device 100 (own device) does not hold among the acquired predetermined identifier information. You may determine with the information of the non-common identifier hold
  • the first transmission / reception unit 110 directly transmits / receives the userID information to / from the second transmission / reception unit 210, thereby acquiring the userID information held by the second data division device 200 (another device). May be.
  • the first data dividing device 100 may store information on the userID held by the second data dividing device 200 in the first storage unit 120 in advance.
  • the first transmission / reception unit 110 may not perform the process of acquiring the userID information of the second data division device 200.
  • the identifier used in this embodiment may be a national ID, for example. Alternatively, the identifier may be OpenID described in Non-Patent Document 2, and is not limited thereto.
  • the first storage unit 120 stores user data in which a plurality of user identifiers are associated with values of first personal information.
  • the “user identifier” means a user identifier stored in each data dividing device.
  • a user identifier stored in the first storage unit 120 means a user identifier stored in the first storage unit 120 of the first data dividing device 100, and the second data dividing device.
  • first personal information refers to one piece of personal information stored in the first data dividing apparatus 100 (self apparatus). For example, when the first storage unit 120 stores the user data shown in FIG. 1, the “first personal information” may be “abdominal circumference”.
  • the first setting unit 130 acquires, from the first storage unit 120, user data (a user identifier and a value of personal information related to the identifier) stored in the first data dividing device 100 (own device). .
  • the first setting unit 130 acquires the user identifier held by the second data dividing device 200 (another device) from the first transmission / reception unit 110 or the first storage unit 120.
  • the first setting unit 130 is user data stored in the second data dividing device (another device) and is used as dummy data for user identifier data not stored in the first storage unit 120.
  • a dummy value is set as the value of the first personal information.
  • the dummy value is set by a resampling method according to the distribution of the value of the first personal information corresponding to the user identifier, for example.
  • the dummy value setting method by the first setting unit 130 is not limited to the resampling method, and may be another appropriate method.
  • the first setting unit 130 outputs predetermined user data including dummy data to the first dividing unit 140.
  • the first dividing unit 140 divides predetermined user data including dummy data output from the first setting unit 130 into groups according to division points.
  • the division point is a threshold value for dividing the user data, and is determined based on the value of the first personal information including the dummy value or the value of the second personal information.
  • the first dividing unit 140 communicates with the second dividing unit 240 via the first transmission / reception unit 110, and is held by the first data dividing device 100 and the second data dividing device 200.
  • the most appropriate personal information may be determined as a division axis.
  • the first dividing unit 140 may communicate with the second dividing unit 240 to determine the most appropriate dividing point among the values of the personal information.
  • the second data dividing device uses the value of the second personal information for determining the dividing point.
  • the dummy value (second dummy value) set by (other device) is included.
  • the dividing method is not particularly limited.
  • the first dividing unit 140 may divide user data into two groups using an average value of predetermined personal information values as a dividing point. In this case, the first division unit 140 may transmit the contents of the group after division to the second division unit 240 via the first transmission / reception unit 110.
  • the first dividing unit 140 and the second dividing unit 240 may repeat the division in turn with the average value of the values of the personal information as the division points.
  • P (Class) is as follows.
  • the division candidate points may be determined by a predetermined rule (algorithm), and may be a known method.
  • the first dividing unit 140 divides the data into two groups at the division candidate points, and when the value obtained by adding the entropy of the two groups is S, the point where the value of S is the smallest is the division point. You may decide. Although it is preferable to determine the point at which the value of S is the smallest as the dividing point, the value is not limited to this and may be a value that approximates the value of S that is the smallest.
  • a small value of S means that there is little data mixing (mixing of “A” and “B”) in the two groups.
  • the first division unit 140 divides the division candidate points to be divided so as to include a group in which any one of the two groups that are greater than or less than the division point has a minimum value among the predetermined division candidate points. It may be determined as a point. In addition, as described above, it is preferable to divide so that one of the two groups that are greater than or less than the division point includes a group that has the smallest value among the division candidate points. A value approximate to the value may be used.
  • the division point determination method using entropy is not limited to the above-described method, and other methods may be used. As described above, the first data dividing device 100 and the second data dividing device 200 do not know the values of the personal information of each other.
  • the first data dividing device 100 does not know the true value of the second personal information held by the second data dividing device 200. Therefore, the first dividing unit 140 may calculate the dividing point by considering the value of the second personal information using MPC (Multi Parity Computation) or SMPC (Secure Multi Parity Computation). By using MPC or the like, the first dividing unit 140 can calculate the dividing point without giving out the values of the personal information of the first data dividing device 100 and the second data dividing device 200 at all. .
  • MPC Multi Parity Computation
  • SMPC Simple Multi Parity Computation
  • the first division unit 140 transmits division information indicating the contents of the identifier in each divided group to the second data division device 200 via the first transmission / reception unit 110.
  • the division information may be, for example, a list of user IDs divided at division points.
  • the first division unit 140 receives the division information transmitted from the second transmission / reception unit 210 via the first transmission / reception unit 110.
  • the first division unit 140 divides the data based on the received division information.
  • the first dividing unit 140 outputs the divided data to the first correcting unit 150.
  • the first correction unit 150 sets the dummy value based on the value of the first personal information of the data other than the dummy data among the user data belonging to the group after the division at each division by the first division unit 140. Correct it.
  • the first correction unit 150 may correct the value of the dummy value in the group after the division according to the distribution of the value of the first personal information of the data corresponding to the user identifier.
  • the first correcting unit 150 outputs the data after correcting the dummy value to the first dividing unit 140 again.
  • the first dividing unit 140 determines whether or not the user data in the group whose dummy value is corrected can be further divided into two groups.
  • the first dividing unit 140 may determine whether or not further division is possible by determining whether or not there is a candidate division point by a known method in the group. If it is determined that the data can be further divided, the first dividing unit 140 further divides the data output from the first correcting unit 150 into groups.
  • FIG. 16 is a flowchart showing the operation of the first data dividing apparatus 100 according to the first embodiment of the present invention.
  • the first transmission / reception unit 110 acquires userID information (user identifier) of data held by the second data division device 200 (step S ⁇ b> 11).
  • the first setting unit 130 stores the first memory among the user data including the acquired userID information.
  • a dummy value is set as the value of the first personal information as dummy data for data other than the data corresponding to the userID (user identifier) actually stored by the unit 120 (step S12).
  • the second transmitting / receiving unit 210 acquires the userID information held by the first data dividing device 100, and the second setting unit 230 receives the second personal information. Set a dummy value for the information value.
  • the first dividing unit 140 determines whether or not there is a division candidate point when dividing predetermined user data including data for which dummy values are set. If it is determined that there is a division candidate point, the first division unit 140 calculates the total entropy of the two groups when the predetermined division candidate point is divided. The first dividing unit 140 determines a point having the lowest total entropy as a dividing point, and divides predetermined user data into two groups at that point (step S13).
  • the 1st correction part 150 corrects a dummy value with respect to the user data in the group after a division
  • the first correction unit 150 may correct the dummy value in the divided group according to the distribution of the value of the first personal information of the data corresponding to the user identifier.
  • the first correction unit 150 outputs the user data in the group after the division whose dummy value has been corrected to the first division unit 140 again.
  • the first dividing unit 140 determines whether or not the user data whose dummy value has been corrected after the division can be further divided (step S15).
  • the first division unit 140 may determine whether or not further division is possible by determining whether or not division candidate points exist.
  • step S13 the first dividing unit 140 further divides the group of user data whose dummy values have been corrected after the division. If it is determined that there is no division candidate point, the process ends.
  • the first data dividing device 100 is a device of the business operator S.
  • the second data dividing device 200 is a device of the business operator T.
  • the following examples are based on the same situation as the above-described example. Specifically, it is assumed that the business operator S (first data division device 100) holds personal information (data shown in FIG.
  • step S11 of FIG. 16 the first data dividing device 100 and the second data dividing device 200 exchange information on the userID of the data held by each other.
  • the first transmission / reception unit 110 transmits the identifiers of user 1 to user 15 to the second transmission / reception unit 210, and receives the identifiers of user 1 to user 6 from the second transmission / reception device 210.
  • step S12 of FIG. 16 when the first transmission / reception unit 110 in the business operator S receives the identifiers of user1 to user6, the first setting unit 130 collates with the data shown in FIG. As a result of the collation, the first setting unit 130 includes the identifiers of user1 to user6 stored in the second data dividing device 200 in the identifiers of user1 to user15 stored in the first data dividing device 100. It is determined that Therefore, the first setting unit 130 does not set dummy data.
  • the second setting unit 230 in the operator T determines that the first data dividing device 100 is the second data dividing device. It is determined that the data of user7 to user15 that are not stored in the memory 200 are stored. Accordingly, as shown in FIG. 7, the second setting unit 230 sets a dummy value for the second personal information for the data of user 7 to user 15 by the resampling method.
  • the first dividing unit 140 of the first data dividing device 100 and the second dividing unit 240 of the second data dividing device 200 communicate with each other and share personal information division candidate points. It is determined whether or not exists.
  • the first dividing unit 140 transmits predetermined user data ⁇ user1 to 3 , 7-9, 11, 12, 14 ⁇ and ⁇ user4-6, 10, 13, 15 ⁇ .
  • the first dividing unit 140 divides user data into two groups of data division information ( ⁇ user 1 to 3, 7 to 9, 11, 12, 14 ⁇ , ⁇ user 4 to 6, 10, 13, 15 ⁇ ).
  • the second division unit 240 also divides predetermined user data. 8 and 9 are diagrams showing the state of user data after division. In step S14 of FIG.
  • FIG. 17 is a diagram illustrating data obtained by correcting a dummy value with respect to the user data in FIG.
  • the second modification unit 250 of the second data dividing device 200 performs second personal information of identifier (user identifier) data (user1 to user6 data) other than dummy data (data of user7 to user15) for each group.
  • the dummy value is corrected according to the distribution of values.
  • the data of the user identifier in the second data division device 200 belonging to the group is data of user1 to user3.
  • the second correction unit 250 corrects, for example, user9 from “135” to “115”, user12 from “135” to “115”, and user14 from “125” to “115”. .
  • FIG. 18 is a diagram illustrating a division state when the dummy value is corrected.
  • FIG. 19 is a diagram illustrating the distribution and division of user data when the dummy value is corrected.
  • the data surrounded by a circle is data in which the dummy value is corrected.
  • the first division unit 140 of the first division device 100 and the second division unit 240 of the second division device 200 have division candidate points in each of the two groups. It is determined whether or not.
  • the first dividing unit 140 and the second dividing unit 240 determine that there is no division candidate point in the group of ⁇ user 1 to 3, 7 to 9, 11, 12, 14 ⁇ .
  • FIG. 20 is a diagram illustrating a state in which the group of ⁇ user4 to 6, 10, 13, 15 ⁇ in the data illustrated in FIG. 18 is further divided by “
  • FIG. 20 in the group of ⁇ user5, 6, 13, 15 ⁇ , “A” and “B” are mixed, but since the number of “B” is clearly large, the group is a group of “B”. It is determined that.
  • FIG. 22 is a diagram illustrating an example of a classification tree finally generated from the data of user1 to user15 when the dummy value is corrected. According to the present invention, the classification tree shown in FIG. 22 is generated.
  • the classification tree may be generated by collecting data division processes as information by a generation unit (not shown in FIGS. 14 and 25).
  • FIG. 23 is a diagram illustrating a classification result of user1 to user15 using a classification tree generated from user data whose dummy values are corrected.
  • the data of users 5 and 6 among the data of users 4 to 6 whose classification result is “unknown” is correctly classified by correcting the dummy value as compared with the result of FIG.
  • a classification tree that can be generated can be generated.
  • the classification tree is the most accurate classification tree compared to the classification trees shown in FIGS.
  • the data dividing apparatus 100 in the distributed processing of data by a plurality of apparatuses, the non-common data held by one apparatus is effectively used, thereby achieving high accuracy. Data can be divided.
  • FIG. 24 is a block diagram illustrating a configuration of the first data dividing device 300 according to the second embodiment.
  • the first data dividing device 300 is different from the first data dividing device 100 in the first embodiment in that it includes a first adjustment unit 310. Since components other than the first adjustment unit 310 have the same configuration as that of the first embodiment, the same reference numerals are given and description thereof is omitted.
  • the first adjustment unit 310 adjusts the dummy value based on the amount of change between the dummy value before the correction and the dummy value after the correction after the dummy value is corrected by the first correction unit 150. .
  • the first adjustment unit 310 adjusts the dummy value based on the amount of change in the dummy value, so that the first data dividing device 300 reflects the characteristic. To divide the data more accurately.
  • FIG. 25 is a flowchart showing the operation of the first data dividing device 300 according to the second embodiment of the present invention. As shown in FIG. 25, the operation of the first data dividing device 300 is different from the operation of the first data dividing device 100 shown in FIG.
  • step S16 is provided for adjusting the dummy value based on the amount.
  • step S15 the first dividing unit 140 determines whether or not the data whose dummy values have been corrected and adjusted after division can be further divided (step S15).
  • 26 and 27 are diagrams for explaining the function of the first adjustment unit 310.
  • FIG. FIG. 26 is a diagram illustrating data obtained by adjusting dummy values with respect to the data illustrated in FIG. 17 representing corrected data.
  • the first adjustment unit 310 after the dummy value correction by the first correction unit 150, for each group, the dummy value of the dummy data before the correction (data of user7 to user15) and the dummy data after the correction ( The dummy value is adjusted on the basis of the amount of change in the value of the dummy value of user7 to user15). Any adjustment method may be used based on the amount of change in the dummy value before and after the correction. In the following, an adjustment method based on the change in the center of gravity of the dummy value (average value of the dummy value) will be described as an example. To do. First, consider groups ⁇ user1-3, 7-9, 11, 12, 14 ⁇ .
  • the dummy data belonging to the group is data of users 7 to 9, 11, 12, and 14.
  • the dummy values of users 7 to 9, 11, 12, and 14 are “115”, “125”, “135”, “125”, “ 135 "and” 125 ". Therefore, the first adjusting unit 310 calculates the value (average value) of the dummy values before correction in the group as “126.666” by (115 + 125 + 135 + 125 + 135 + 125) ⁇ 6.
  • the dummy values of the users 7 to 9, 11, 12, and 14 are “115”, “125”, “115”, “125”, “115” and “115”.
  • the first adjustment unit 310 calculates the value (average value) of the corrected dummy values in the group as “118.666” by (115 + 125 + 115 + 125 + 115 + 115) ⁇ 6. Since the value of the center of gravity has changed from “126.666” to “118.666” by the modification of the dummy value, the amount of change in the value of the center of gravity of the dummy value is “ ⁇ 8”. Therefore, the first adjustment unit 310 uses dummy data belonging to the group ⁇ user1-3, 7-9, 11, 12, 14 ⁇ (data of users 7-9, 11, 12, 14) from a predetermined dummy value. "10" is subtracted.
  • the dummy value is assumed to be a value in increments of 10 from “115”, “ ⁇ 8” is set to “ ⁇ 10”.
  • the dummy value takes a value in the range of “115 to 135” and does not take a value of 115 or less or 135 or more.
  • the first adjustment unit 310 corrects the dummy values of user8 and user11 from “125” to “115”, for example.
  • the dummy data belonging to the group is data of users 10, 13, and 15. In the data before correction by the first correction unit 150 shown in FIG. 7, the dummy values of the users 10, 13, and 15 are “115”, “115”, and “135”, respectively.
  • the first adjustment unit 310 calculates the value (average value) of the dummy values before correction in the group as “121.666” by (115 + 115 + 135) ⁇ 3.
  • the dummy values of the users 10, 13, and 15 are “125”, “135”, and “135”, respectively. Accordingly, the first adjustment unit 310 calculates the value (average value) of the corrected dummy values in the group as “131.666” by (125 + 135 + 135) ⁇ 3. Since the value of the center of gravity has changed from “121.666” to “131.666” due to the modification of the dummy value, the amount of change in the value of the center of gravity of the dummy value is “+10”.
  • the first adjustment unit 310 adds “10” to a predetermined dummy value belonging to the groups ⁇ user4 to 6,10,13,15 ⁇ (data of users10, 13,15). As shown in FIG. 26, the first adjustment unit 310 corrects the dummy value of the user 10 from “125” to “135”, for example.
  • FIG. 27 is a diagram illustrating a division state when the dummy value is adjusted.
  • FIG. 28 is a diagram showing the distribution and division of user data when the dummy value is adjusted. In FIG. 28, data surrounded by a circle is data in which the dummy value is adjusted.
  • FIG. 29 is a diagram illustrating an example of a classification tree that is finally generated from the data of user1 to user15 when the dummy values are adjusted.
  • FIG. 30 is a diagram illustrating the classification results of user1 to user15 using the classification tree generated from the data with the dummy values adjusted.
  • a classification tree that can correctly classify the data of user4 whose classification result is “unknown” is generated by adjusting the dummy value as compared with the result of FIG. 23. be able to.
  • the classification tree is the most accurate classification tree as compared to the classification trees shown in FIGS. As described above, according to the data dividing apparatus 300 according to the second embodiment, it is possible to divide data with higher accuracy reflecting the characteristics of the data.
  • the third embodiment of the present invention is a distributed anonymization system that performs distributed anonymization using the first data dividing device 400 and the second data dividing device 500.
  • Distributed anonymization is an anonymization technique for preventing individual identification and attribute estimation when combining distributed and held information.
  • the distributed anonymization technique is described in Non-Patent Document 2, for example.
  • Non-Patent Document 2 when data is combined between two businesses, first, the personal information held by the two businesses is abstracted to generate initial anonymous data.
  • the technology of Non-Patent Document 2 gradually embodies abstracted personal information while confirming that it satisfies anonymity.
  • the division point of the personal information is determined and the data is divided.
  • the technology described in Non-Patent Document 2 holds sensitive information as to whether or not two indices, k-anonymity and l-diversity, are satisfied during division. Check with the business operator.
  • the sensitive information is information that is not to be changed because it is used for information processing of the combined data.
  • FIG. 31 is a block diagram showing a configuration of the first data dividing device 400 according to the third embodiment. As shown in FIG.
  • the first data dividing device 400 is different from the first data dividing device 100 in the first embodiment in that it includes a first determination unit 410. Since components other than the first determination unit 410 have the same configuration as that of the first embodiment, the same reference numerals are given and description thereof is omitted.
  • the first determination unit 410 is configured such that the ratio of user identifiers existing in both the own device (first data division device 400) and the other device (second data division device 500) is anonymized in advance. Whether the index is satisfied is determined for each group after the division by the first dividing unit 140.
  • FIG. 32 is a block diagram illustrating a configuration of a second data dividing device 500 according to the third embodiment. As shown in FIG.
  • FIG. 33 is a flowchart showing the operation of the first data dividing device 400 in the third embodiment. As shown in FIG. 33, the operation of the first data dividing device 400 is different from the operation of the first data dividing device 100 shown in FIG. It differs in having step S18 for determining whether or not the anonymous index is satisfied. Further, the operation of the first data dividing device 400 does not have step S12, but has step S17 instead. The same operations as those in FIG. 16 are denoted by the same reference numerals, and the description thereof is omitted. Next, with reference to FIGS. 34 to 43, each step of FIG. 33 will be described using a specific example.
  • the business operator S has the first data dividing device 400. Further, it is assumed that the business operator T has the second data dividing device 500. Further, it is assumed that the business operator S holds personal information (data shown in FIG. 1) relating to “abdominal circumference” and “Class”. It is assumed that the business operator T holds personal information regarding “blood pressure” and “disease”.
  • FIG. 34 is a diagram illustrating an example of data held by the operator T in the third embodiment. The personal information held by each business operator is associated with a common identifier managed by the identifier management business operator. In the following examples, “abdominal circumference” is the first personal information, “blood pressure” is the second personal information, and “disease” is the sensitive information. In step S11 of FIG.
  • FIG. 35 is a diagram illustrating data obtained by resampling data of user7 to user15. More specifically, in step S17 of FIG. 33, the second setting unit 230 in the operator T (second data dividing device 500) receives the data shown in FIG. 1 and the data shown in FIG. As a result of the comparison, as shown in FIG. 35, a dummy value is set for the second personal information by the resampling method for the data of user7 to user15, and the sensitive information is appropriately set. In step S17, the first setting unit 130 and the second setting unit 230 generate initial anonymous data. For example, the first setting unit 130 generates initial anonymous data shown in FIG.
  • the 2nd operation part 240 produces
  • the initial anonymous data includes a user ID, a quasi-identifier (information on blood pressure, waist circumference, and class), and sensitive information (information on disease).
  • the first dividing unit 140 is divided into two groups of data division information ( ⁇ user 1 to 3, 7 to 9, 11, 12, 14 ⁇ , ⁇ user 4 to 6, 10, 13, 15 ⁇ ). Information indicating that the data is to be divided) is transmitted to the second transmitting / receiving unit 210 of the business operator T.
  • the first determination unit 410 and the second determination unit 420 are assigned to either the first dividing device 400 (self device) or the second dividing device 500 (other device).
  • Whether the ratio of existing user identifiers satisfies a predetermined anonymous index is determined for each group after division by the first division unit 140 of the first division device 400.
  • “User identifier” means an identifier of a user stored in the device itself. Specifically, the user identifiers of the business operators S are users 1 to 15. User identifiers of the operator T are users 1 to 6.
  • the predetermined anonymous index is 2-anonymity. The group of ⁇ user 1 to 3, 7 to 9, 11, 12, 14 ⁇ in FIG. 39 (group in the first row) is 3-anonymous because 6 of the 9 users are dummy.
  • the group of ⁇ user4 to 6,10,13,15 ⁇ (group in the second row) is 3-anonymous because three of the six users are dummy. Therefore, any group satisfies 2-anonymity.
  • the business operator T since the business operator T holds sensitive information, the business operator T may confirm the anonymous index.
  • the dummy data is included in the user data of the business operator T, it is not difficult to confirm that the index is satisfied. If dummy data is also included in the data of the operator S, the first determination unit 410 uses the MPC and both the data of the operator S and the data of the operator T satisfy the anonymous index. You may confirm that.
  • FIG. 40 is a diagram showing user data obtained by correcting dummy values with respect to the data shown in FIG.
  • the first dividing unit 140 and the second dividing unit 240 determine whether or not the user data whose dummy value has been corrected after the division can be further divided by the same method as in the first embodiment ( Step S15).
  • the first dividing unit 140 and the second dividing unit 240 divide the user data.
  • the second determination unit 420 confirms whether the anonymous index is maintained for the user data of FIG.
  • the second determination unit 420 determines that the ⁇ user4,10 ⁇ group (second row group) is one anonymous because two of the two users are dummy, and does not satisfy the two anonymity. If the second determination unit 420 determines that the anonymous index is not satisfied, the first data division device 400 and the second data division device 500 cancel the division performed last. The first data dividing device 400 and the second data dividing device 500 calculate the number of persons existing in each of the canceled data using MPC, and generate combined anonymized data.
  • FIG. 43 is a diagram showing final combined anonymized data (joined anonymized data) generated by the present invention according to the third embodiment. Here, even if the user data shown in FIG.
  • the business operator S does not know which user's data surely exists in the data of the business operator T. Further, the business operator T does not know which user's data is surely present in the data of the business operator S.
  • the distributed anonymization system of the present invention according to the third embodiment there is no risk that the existence of the user's data leaks to other operators, and the non-common that one device holds It is possible to execute the distributed anonymization process with high accuracy of data division by effectively utilizing the data.
  • FIG. 44 is a block diagram showing a configuration of a data dividing device 600 according to the fourth embodiment. As illustrated in FIG.
  • the data dividing device 600 includes a transmission / reception unit 610, a storage unit 620, a setting unit 630, a dividing unit 640, and a correcting unit 650. These are the same configurations as the first transmission / reception unit 110, the first storage unit 120, the first setting unit 130, the first division unit 140, and the first correction unit 150 described above.
  • the data dividing device 600 stores first personal information, and divides data while communicating with another device storing second personal information via the transmission / reception unit 610.
  • the transmission / reception unit 610 transmits / receives data to / from other devices.
  • Storage unit 620 associates a plurality of user identifiers with the value of the first personal information and stores them as user data.
  • the setting unit 630 stores the first dummy information in the first personal information as dummy data for the data stored in the other device received by the transmission / reception means 610 and not stored in the storage unit 620. Set the value.
  • the dividing unit 640 divides predetermined data including dummy data into groups by dividing points determined based on the value of the first personal information including the dummy value or the value of the second personal information.
  • the correction unit 650 corrects the dummy value for each division based on the value of the first personal information of data other than the dummy data among the data belonging to the group after the division.
  • the data dividing device 600 in the distributed processing of data by a plurality of devices, the non-common data held by one device is effectively used, thereby improving the accuracy.
  • Data can be divided.
  • this invention was demonstrated with reference to each embodiment, this invention is not limited to the above embodiment.
  • Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
  • the data dividing devices held by two or more different business operators according to the present invention may be devices that are separated from each other in terms of management, for example, may be devices that are virtually separated.
  • FIG. 45 is a block diagram illustrating an example of a hardware configuration of the first data dividing device 100 according to the first embodiment.
  • each unit constituting the first data dividing apparatus 100 stores a CPU (Central Processing Unit) 1, a network connection communication IF 2 (communication interface 2), a memory 3, and a program.
  • a computer device including a storage device 4 such as a hard disk.
  • the configuration of the first data dividing device 100 is not limited to the computer device shown in FIG.
  • the first transmission / reception unit 110 may be realized by the communication IF2.
  • the CPU 1 controls the first data dividing device 100 by operating the operating system. Further, the CPU 1 reads a program and data from a recording medium mounted on, for example, a drive device to the memory 3 and executes various processes according to the program and data.
  • the first setting unit 130, the first dividing unit 140, and the first correcting unit 150 may be realized by the CPU 1 and a program.
  • the recording device 4 is, for example, an optical disk, a flexible disk, a magnetic optical disk, an external hard disk, a semiconductor memory, etc., and records a computer program so that it can be read by a computer.
  • the computer program may be downloaded from an external computer (not shown) connected to the communication network.
  • the first storage unit 120 may be realized by the recording device 4.
  • the block diagram utilized in each embodiment described so far has shown the block of a functional unit instead of the structure of a hardware unit. These functional blocks are realized by any combination of hardware and software.
  • the means for realizing the components of the first data dividing device 100 is not particularly limited. That is, the first data dividing device 100 may be realized by one physically coupled device, or two or more physically separated devices are connected by wire or wirelessly, and these multiple devices are used. It may be realized.
  • the program of the present invention may be a program that causes a computer to execute the operations described in the above embodiments. This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2011-205519 for which it applied on September 21, 2011, and takes in those the indications of all here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne une technique qui, lors du traitement distribué de données au moyen d'une pluralité d'appareils, permet un partitionnement de données de grande précision par l'utilisation efficace de données non partagées maintenues par un appareil. La présente invention concerne un appareil de partitionnement de données comportant: un moyen d'établissement pour établir des valeurs fictives pour une première information personnelle sous la forme de données fictives associées à des identifiants d'utilisateurs d'autres appareils qui ne correspondent pas aux identifiants de l'appareil de partitionnement de données, parmi tous les identifiants d'autres appareils qui ont été obtenus ; un moyen de partitionnement pour le partitionnement en groupes de données d'utilisateurs prescrites comprenant les données fictives, au moyen de points de partitionnement déterminés sur la base des valeurs de la première information personnelle comprenant les valeurs fictives ou des valeurs de seconde information personnelle ; et un moyen de modification pour la modification des valeurs fictives établies pour la première information personnelle, sur la base des valeurs de la première information personnelle de l'appareil de partitionnement de données et des autres appareils où les identifiants d'utilisateurs correspondent parmi les données d'utilisateurs appartenant aux groupes partitionnés.
PCT/JP2012/074311 2011-09-21 2012-09-14 Appareil de partitionnement de données, système de partitionnement de données, procédé de partitionnement de données, et programme WO2013042788A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013534778A JP6015661B2 (ja) 2011-09-21 2012-09-14 データ分割装置、データ分割システム、データ分割方法及びプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2011-205519 2011-09-21
JP2011205519 2011-09-21

Publications (1)

Publication Number Publication Date
WO2013042788A1 true WO2013042788A1 (fr) 2013-03-28

Family

ID=47914546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/074311 WO2013042788A1 (fr) 2011-09-21 2012-09-14 Appareil de partitionnement de données, système de partitionnement de données, procédé de partitionnement de données, et programme

Country Status (2)

Country Link
JP (1) JP6015661B2 (fr)
WO (1) WO2013042788A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015162748A1 (fr) * 2014-04-24 2015-10-29 株式会社日立製作所 Dispositif de conversion de données et procédé de conversion de données
WO2019138584A1 (fr) * 2018-01-15 2019-07-18 日本電気株式会社 Procédé, dispositif et programme de génération d'arbres de classification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001216307A (ja) * 2000-01-31 2001-08-10 Teijin Ltd リレーショナルデータベース管理システム及びそれを記憶した記憶媒体
JP2004086782A (ja) * 2002-08-29 2004-03-18 Hitachi Ltd 異種データベース統合支援装置
JP2005011049A (ja) * 2003-06-19 2005-01-13 Nec Soft Ltd データベース統合装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001216307A (ja) * 2000-01-31 2001-08-10 Teijin Ltd リレーショナルデータベース管理システム及びそれを記憶した記憶媒体
JP2004086782A (ja) * 2002-08-29 2004-03-18 Hitachi Ltd 異種データベース統合支援装置
JP2005011049A (ja) * 2003-06-19 2005-01-13 Nec Soft Ltd データベース統合装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015162748A1 (fr) * 2014-04-24 2015-10-29 株式会社日立製作所 Dispositif de conversion de données et procédé de conversion de données
WO2019138584A1 (fr) * 2018-01-15 2019-07-18 日本電気株式会社 Procédé, dispositif et programme de génération d'arbres de classification
JPWO2019138584A1 (ja) * 2018-01-15 2020-12-17 日本電気株式会社 分類木生成方法、分類木生成装置および分類木生成プログラム
JP6992821B2 (ja) 2018-01-15 2022-01-13 日本電気株式会社 分類木生成方法、分類木生成装置および分類木生成プログラム

Also Published As

Publication number Publication date
JP6015661B2 (ja) 2016-10-26
JPWO2013042788A1 (ja) 2015-03-26

Similar Documents

Publication Publication Date Title
US11323347B2 (en) Systems and methods for social graph data analytics to determine connectivity within a community
US10454901B2 (en) Systems and methods for enabling data de-identification and anonymous data linkage
Daubert et al. A view on privacy & trust in IoT
US9405787B2 (en) Distributed anonymization system, distributed anonymization device, and distributed anonymization method
AU2024200809A1 (en) Data protection via aggregation-based obfuscation
US11601437B2 (en) Account access security using a distributed ledger and/or a distributed file system
US20160246981A1 (en) Data secrecy statistical processing system, server device for presenting statistical processing result, data input device, and program and method therefor
EP2879069A2 (fr) Système pour rendre anonymes et regrouper des informations de santé protégées
US10176340B2 (en) Abstracted graphs from social relationship graph
CN110086817B (zh) 可靠的用户服务系统和方法
US20170277907A1 (en) Abstracted Graphs from Social Relationship Graph
US11775656B2 (en) Secure multi-party information retrieval
KR20200053613A (ko) 데이터 통계 방법 및 장치
US10783277B2 (en) Blockchain-type data storage
JP2014211607A (ja) 情報処理装置およびその方法
JP6015661B2 (ja) データ分割装置、データ分割システム、データ分割方法及びプログラム
WO2013121738A1 (fr) Dispositif d'anonymisation distribuée, et procédé d'anonymisation distribuée
CN112860790B (zh) 数据管理方法、系统、装置
US20220407706A1 (en) Generation device, generation method, and verification device
US10970417B1 (en) Differential privacy security for benchmarking
JP2020003989A (ja) 個人情報分析システム、及び個人情報分析方法
WO2021118413A2 (fr) Procédé de traitement de données comprenant des procédés de calcul multilatéral sécurisé et d'analyse de données
JP2017033305A (ja) 情報処理システム及び情報処理方法
US20220147651A1 (en) Data management method, non-transitory computer readable medium, and data management system
EP3901808B1 (fr) Système de réponse d'interrogation d'analyse, dispositif d'exécution d'interrogation d'analyse, dispositif de vérification d'interrogation d'analyse, procédé de réponse d'interrogation d'analyse et programme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12833023

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2013534778

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12833023

Country of ref document: EP

Kind code of ref document: A1