CN111444544B - Method and device for clustering private data of multiple parties - Google Patents

Method and device for clustering private data of multiple parties Download PDF

Info

Publication number
CN111444544B
CN111444544B CN202010536190.4A CN202010536190A CN111444544B CN 111444544 B CN111444544 B CN 111444544B CN 202010536190 A CN202010536190 A CN 202010536190A CN 111444544 B CN111444544 B CN 111444544B
Authority
CN
China
Prior art keywords
data
target
cluster
party
central data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010536190.4A
Other languages
Chinese (zh)
Other versions
CN111444544A (en
Inventor
陈超超
周俊
王力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010536190.4A priority Critical patent/CN111444544B/en
Publication of CN111444544A publication Critical patent/CN111444544A/en
Application granted granted Critical
Publication of CN111444544B publication Critical patent/CN111444544B/en
Priority to PCT/CN2021/099485 priority patent/WO2021249502A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the present specification provides a method and an apparatus for clustering private data of multiple parties, where the method includes: a first party determines first fragments of the central data respectively corresponding to the various clusters; respectively taking the central data as target central data, and performing first joint calculation with a second fragment of the target central data in a second party by using a secret sharing mode based on local first private data and a first fragment of the target central data to obtain first private data and a first fragment of a first target distance of the target central data; based on the first fragments of the first target distances, performing joint comparison with the second fragments of the first target distances in the second party in a secret sharing mode to determine the nearest first target distance in the first target distances; and determining the cluster corresponding to the closest first target distance as the cluster to which the first privacy data belongs currently. The leakage of private data can be prevented.

Description

Method and device for clustering private data of multiple parties
Technical Field
One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for clustering private data for multiple parties.
Background
Clustering is a very common technique in machine learning. It is often applied to tasks such as community discovery, anomaly detection, and the like. A common clustering algorithm is an unsupervised learning algorithm and aims to classify similar objects into the same class. The more similar the objects within a cluster, the better the clustering. Clustering and classification differ most strongly in that the target of the classification is known a priori, whereas clustering is not the same. The result and classification produced is the same, except that the classification is not predefined.
In some scenarios, the data is distributed horizontally across multiple parties. The data that the parties have may be private data, i.e. the private data that one party has cannot be disclosed to the other party. In this case, the prior art does not provide a suitable clustering method.
Accordingly, it would be desirable to have an improved scheme for preventing privacy data from being revealed when clustering is performed on privacy data of multiple parties.
Disclosure of Invention
One or more embodiments of the present specification describe a method and an apparatus for clustering private data of multiple parties, which can prevent private data from being leaked when clustering private data of multiple parties.
In a first aspect, a method for clustering privacy data of multiple parties is provided, where the multiple parties include a first party and a second party, the first party has a first privacy data set, the first privacy data set includes multiple first privacy data, the method is performed by the first party, and includes multiple iterations, where any one iteration includes:
determining first fragments of the central data respectively corresponding to the various cluster types at present; the second party has a second shard of the central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data;
respectively taking the central data as target central data, and performing first joint calculation with a second fragment of the target central data in the second party by using a secret sharing mode based on local first private data and a first fragment of the target central data to obtain first private data and a first fragment of a first target distance of the target central data; the second party has a second segment of the first target distance;
based on the first fragments of the first target distances, performing joint comparison with the second fragments of the first target distances in the second party in a secret sharing mode to determine the nearest first target distance in the first target distances;
and determining the cluster corresponding to the closest first target distance as the cluster to which the first privacy data belongs currently.
In one possible implementation, the first joint calculation includes:
locally calculating a first distance between the first private data and a first segment of the target-centric data;
performing multiplication operation on the difference value between the first fragment of the target center data and the first private data and the second fragment of the target center data in the second party in a secret sharing mode to obtain a first fragment of the product;
determining a first patch of a first target distance of the first privacy data and the target center data according to the first distance and a first patch of the product.
In a possible implementation manner, the any one iteration is a first iteration, and the first segment of each piece of central data currently and respectively corresponding to each cluster class is randomly initialized data.
In one possible embodiment, the joint comparison comprises:
based on the first fragments of any two first target distances in the first target distances, performing joint comparison with the second fragments of the any two first target distances in the second party in a secret sharing manner, and determining a comparison result of the distance between the any two first target distances;
and determining the nearest first target distance in the first target distances according to the comparison results.
In a possible implementation manner, after determining the class cluster corresponding to the closest first target distance as the class cluster to which the first privacy data currently belongs, the method further includes:
and updating the first fragment of the central data of the cluster according to the average value of the first private data of the same cluster.
Further, after updating the first fragment of the central data of the cluster, the method further includes:
judging whether the variable quantity of the central data of each cluster meets preset iteration stopping conditions or not;
and if the judgment result is that the variable quantity of the central data of each cluster does not meet the preset iteration stopping condition, performing next iteration in the multi-round iteration process.
Further, the method further comprises:
and if the judgment result is that the variation of the central data of each cluster meets the preset iteration stopping condition, determining the cluster to which the first privacy data belongs currently as the cluster to which the first privacy data belongs finally.
Further, the determining whether the variation of the central data of each cluster satisfies a preset iteration stop condition includes:
and taking any one of the various clusters as a target cluster, and performing joint comparison with a second fragment of the central data before updating of the target cluster and a second fragment of the central data after updating of the target cluster in the second party in a secret sharing manner according to the first fragment of the central data before updating of the target cluster and the first fragment of the central data after updating of the target cluster, so as to judge whether the variation of the central data of the target cluster meets a preset iteration stop condition.
In a possible embodiment, the second party has a second set of privacy data, the second set of privacy data comprising a plurality of second privacy data, the method further comprising:
respectively taking the central data as target central data, and performing second joint calculation with second privacy data in the second party and second fragments of the target central data by using a secret sharing mode based on first fragments of local target central data to obtain second fragments of second target distances of the second privacy data and the target central data; the second party has a first segment of the second target distance.
Further, the second joint calculation includes:
locally calculating a square of a first slice of the target-centric data;
performing multiplication operation on the first fragment of the target center data and the difference value between the second fragment of the target center data in the second party and the second private data in a secret sharing mode to obtain a second fragment of the product;
determining a second patch of a second target distance of the second privacy data and the target-centric data according to a second patch of the square and the product.
In a second aspect, an apparatus for clustering privacy data of multiple parties is provided, where the multiple parties include a first party and a second party, the first party has a first privacy data set, and the first privacy data set includes multiple first privacy data, and the apparatus is provided in the first party, and is configured to perform multiple rounds of iteration processes, and includes the following units for performing any one round of iteration:
the center determining unit is used for determining first fragments of the center data corresponding to the various cluster types at present; the second party has a second shard of the central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data;
a first joint calculation unit, configured to respectively use each piece of central data determined by the central determination unit as target central data, perform, based on first local private data and a first segment of the target central data, a first joint calculation with a second segment of the target central data in the second party in a secret sharing manner, so as to obtain first segments of first target distances between the first private data and the target central data; the second party has a second segment of the first target distance;
a joint comparison unit, configured to perform joint comparison with second segments of the first target distances in the second party in a secret sharing manner based on the first segments of the first target distances obtained by the first joint calculation unit, and determine a closest first target distance among the first target distances;
and a class cluster determining unit, configured to determine, as the class cluster to which the first privacy data currently belongs, the class cluster corresponding to the closest first target distance determined by the joint comparing unit.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
By the method and the device provided by the embodiment of the specification, the central data of each cluster is not independently determined by any party, but the first party determines the first fragment of each central data corresponding to each cluster at present, and the second party determines the second fragment of each central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data; and subsequently, when determining a first target distance between the first private data and the target center data, determining a first segment of the first target distance by the first party and determining a second segment of the first target distance by the second party by using a secret sharing mode; when the nearest first target distance in the first target distances is determined, a secret sharing mode is also utilized; and finally, determining the cluster corresponding to the closest first target distance as the cluster to which the first privacy data belongs currently. The whole process is based on secret sharing, and when the private data of multiple parties are clustered, the private data can be prevented from being disclosed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a flow diagram of a method of clustering private data for multiple parties, according to one embodiment;
fig. 3 shows a schematic block diagram of an apparatus for clustering private data for multiple parties, according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. This implementation scenario involves clustering private data for multiple parties. It is understood that the above-mentioned parties may be two or more parties, e.g., three parties, four parties, etc. In the embodiment of the present specification, clustering of private data of two parties is taken as an example. Referring to fig. 1, a first party 11 has private data 1, private data 2, private data 3, private data 4, private data 5; the second party 12 has private data 6, private data 7, private data 8, private data 9. The first party and the second party are only used for distinguishing the two parties, and the first party can be called party A, the second party can be called party B, and the like.
In the embodiments of the present specification, the information included in the private data is not limited, and may be any information that cannot be transmitted outside, for example, personal information of a user or a trade secret. For example, the private data is personal information of the user, including name, age, income, and the like of the user, and specifically, the correspondence table of each private data containing information shown in table one may be referred to.
Table one: correspondence table containing information for each private data
Figure DEST_PATH_IMAGE001
As can be seen from table one, data of different rows in table one may be distributed in different parties, for example, the private data 1 is distributed in a first party, the private data 8 is distributed in a second party, and such a way that data is distributed horizontally in multiple parties may be referred to as horizontal splitting.
In this embodiment of the specification, clustering needs to be performed on private data of multiple parties, and taking fig. 1 as an example, namely clustering is performed on private data 1, private data 2, private data 3, private data 4, private data 5, private data 6, private data 7, private data 8, and private data 9, where private data distributed in different parties may be divided into the same cluster, for example, the private data 1, the private data 3, the private data 6, and the private data 7 are divided into a cluster 1, and the private data 2, the private data 4, the private data 5, the private data 8, and the private data 9 are divided into a cluster 2. In the embodiment of the specification, clustering is performed on the private data of multiple parties by using a secret sharing mode on the premise of not revealing the private data.
Fig. 2 shows a flow diagram of a method for clustering private data for multiple parties, according to one embodiment, which may be based on the implementation scenario shown in fig. 1. The multiple parties include a first party and a second party, the first party has a first privacy data set, the first privacy data set includes multiple first privacy data, the method is executed by the first party and includes multiple iterations, as shown in fig. 2, where any one iteration includes the following steps: step 21, determining first fragments of the central data respectively corresponding to the various cluster types at present; the second party has a second shard of the central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data; step 22, respectively taking the central data as target central data, and performing a first joint calculation with a second fragment of the target central data in the second party by using a secret sharing mode based on a first local private data and a first fragment of the target central data to obtain a first fragment of a first target distance between the first private data and the target central data; the second party has a second segment of the first target distance; step 23, based on the first segments of the first target distances, performing joint comparison with the second segments of the first target distances in the second party in a secret sharing manner to determine the closest first target distance in the first target distances; and step 24, determining the cluster corresponding to the closest first target distance as the cluster to which the first privacy data belongs currently. Specific execution modes of the above steps are described below.
Firstly, in step 21, determining first fragments of each central data respectively corresponding to each class cluster at present; the second party has a second shard of the central data; the sum of the first slice of any central data and the second slice of the central data is equal to the central data. It is to be understood that the number of the above-mentioned various class clusters may be preset, for example, the privacy data of multiple parties is preset to be divided into two class clusters or three class clusters, etc.
In this embodiment of the present specification, each piece of central data is jointly determined by a first party and a second party, the first party may only determine a first fragment of each piece of central data, the second party determines a second fragment of each piece of central data, and neither of the first party and the second party may determine central data alone.
In an example, the any one iteration is a first iteration, and the first segment of each piece of central data currently and respectively corresponding to each cluster class is randomly initialized data.
For example, assuming that the number of the above-mentioned clusters of each class is 2, the first party randomly initializes the first shard (share) of 2 pieces of central data, which is denoted as (< c1>1, < c2> 1); accordingly, the second party randomly initializes a second slice of 2 central data, denoted as (< c1>2, < c2> 2).
Further, the first party may initialize, for each first private data, a class cluster vector of K dimensions for marking the class cluster to which the first private data belongs, where K is the number of class clusters, and when K =2, initialize a class cluster vector of 2 dimensions, for example, a vector of initially all 0, that is, [0, 0 ].
Then, in step 22, the central data are respectively used as target central data, and based on the local first private data and the first fragment of the target central data, a first joint calculation is performed with the second fragment of the target central data in the second party in a secret sharing manner, so as to obtain the first private data and the first fragment of the target central data at the first target distance; the second party has a second segment of the first target distance. It is to be appreciated that the sum of the first slice of target centric data and the second slice of target centric data is target centric data.
In this embodiment, assuming that the target-centric data is represented by c1 and x1 represents the first private data, the first target distance between the first private data and the target-centric data may be represented by (c1-x1) ^2, and then the first segment of the target-centric data is represented by < c1>1 and the second segment of the target-centric data is represented by < c1>2, the following formula derivation process may be performed:
(c1-x1)^2
=(<c1>1+<c1>2-x1)^2
=(<c1>1-x1)^2 + 2 (<c1>1-x1)<c1>2 + (<c1>2)^2
deriving the results from the above equation, the solution (c1-x1) ^2 can be converted to solutions (< c1>1-x1) ^2, (< c1>1-x1) < c1>2, and (< c1>2) ^ 2.
In one example, the first joint calculation includes:
locally calculating a first distance between the first private data and a first segment of the target-centric data;
performing multiplication operation on the difference value between the first fragment of the target center data and the first private data and the second fragment of the target center data in the second party in a secret sharing mode to obtain a first fragment of the product;
determining a first patch of a first target distance of the first privacy data and the target center data according to the first distance and a first patch of the product.
It is understood that the first distance corresponds to the distance (< c1>1-x1) ^ 2; the product corresponds to (< c1>1-x1) < c1>2 in the derivation of the foregoing formula, and the first fragment of the product can be expressed as < (< c1>1-x1) < c1>2> 1. The first party may sum the (< c1>1-x1) ^2 and < (< c1>1-x1) < c1>2>1, resulting in a first patch of the first target distance, which may be denoted as < x1c1> 1.
Accordingly, the second party may determine the second fraction of the first target distance by:
and the second party respectively takes the central data as target central data, and performs joint calculation with the first privacy data in the first party and the first fragment of the target central data by using a secret sharing mode based on the second fragment of the local target central data to obtain the second fragments of the first privacy data and the first target distance of the target central data.
Further, the joint calculation includes:
the second party locally calculates the square of a second fragment of the target center data;
performing multiplication operation on the second fragment of the target center data and the difference value between the first fragment of the target center data and the first private data in the first party in a secret sharing mode to obtain a second fragment of the product;
determining a second patch of a first target distance of the first privacy data and the target-centric data according to a second patch of the square and the product.
It is understood that the square corresponds to (< c1>2) ^2 in the derivation of the foregoing equation; the product corresponds to (< c1>1-x1) < c1>2 in the derivation of the formula, and the second slice of the product can be expressed as < (< c1>1-x1) < c1> 2. The second party may sum (< c1>2) ^2 and < (< c1>1-x1) < c1>2> 2) to get a second slice of the first target distance, which may be denoted as < x1c1> 2.
In the embodiment of the present specification, assuming that c2 represents another target center data other than c1 and x1 represents the first privacy data, the distance between x1 and c2 can be determined in the same manner as the distance between x1 and c1 is determined.
Next, in step 23, a joint comparison is performed with the second segment of each first target distance in the second party by using a secret sharing method based on the first segments of each first target distance, and a closest first target distance among the first target distances is determined. It is to be understood that each first target distance is a distance between the first private data and each central data, and a sum of a first segment of the first target distance and a second segment of the first target distance is the first target distance.
In this embodiment, when the number of the clusters is two, there are two central data, and accordingly there are two first target distances, and comparing the two first target distances, the closest first target distance in each first target distance may be determined. For example, comparing the sizes of x1c1 and x1c2, wherein the smaller corresponding class cluster is the class cluster to which x1 belongs, assuming that x1c2 is a small value, it means that x1 is closest to c2, and the class cluster vector becomes [0, 1 ].
When the number of the clusters is more than three, the central data is more than three, correspondingly, the first target distances are more than three, and the nearest first target distance in the first target distances can be determined by comparing the sizes of any two first target distances.
In one example, the joint comparison includes:
based on the first fragments of any two first target distances in the first target distances, performing joint comparison with the second fragments of the any two first target distances in the second party in a secret sharing manner, and determining a comparison result of the distance between the any two first target distances;
and determining the nearest first target distance in the first target distances according to the comparison results.
Finally, in step 24, the cluster corresponding to the closest first target distance is determined as the cluster to which the first privacy data belongs currently. It is to be understood that the cluster of classes to which the first privacy data belongs may be different in different rounds of the iterative process.
In an example, after determining the class cluster corresponding to the closest first target distance as the class cluster to which the first privacy data currently belongs, the method further includes:
and updating the first fragment of the central data of the cluster according to the average value of the first private data of the same cluster.
It is to be understood that the same cluster as described above is any one of the various clusters described above.
For example, the first party and the second party update the central data (c1 and c2) according to the cluster-like vectors of all the private data, taking c1 as an example, the update process is as follows:
the first party calculates the mean value of privacy data with all cluster vectors of [1, 0] and records the mean value as < c1> 1;
the second party calculates the mean of all private data with class cluster vector 1, 0, denoted as < c1> 2.
Further, after updating the first fragment of the central data of the cluster, the method further includes:
judging whether the variable quantity of the central data of each cluster meets preset iteration stopping conditions or not;
and if the judgment result is that the variable quantity of the central data of each cluster does not meet the preset iteration stopping condition, performing next iteration in the multi-round iteration process.
Further, the method further comprises:
and if the judgment result is that the variation of the central data of each cluster meets the preset iteration stopping condition, determining the cluster to which the first privacy data belongs currently as the cluster to which the first privacy data belongs finally.
Further, the determining whether the variation of the central data of each cluster satisfies a preset iteration stop condition includes:
and taking any one of the various clusters as a target cluster, and performing joint comparison with a second fragment of the central data before updating of the target cluster and a second fragment of the central data after updating of the target cluster in the second party in a secret sharing manner according to the first fragment of the central data before updating of the target cluster and the first fragment of the central data after updating of the target cluster, so as to judge whether the variation of the central data of the target cluster meets a preset iteration stop condition.
For example, the above-mentioned condition of stopping iteration is | C (t) | a 2< delta, where delta may be a preset value, C (t) represents the central data before updating, and C (t +1) represents the central data after updating.
In this embodiment of the present specification, the processing procedures in the foregoing steps 21 to 24 mainly describe that the first party determines, for first private data of the first party, a class cluster to which the first private data belongs, and in addition, the first party needs to determine, in a secret sharing manner, a class cluster to which the second private data belongs in cooperation with the second party in a secret sharing manner.
In one example, the second party has a second set of privacy data comprising a plurality of second privacy data, the method further comprising:
the first party respectively takes the central data as target central data, and performs second joint calculation with second private data in the second party and second fragments of the target central data by using a secret sharing mode based on first fragments of local target central data to obtain second fragments of second target distances between the second private data and the target central data; the second party has a first segment of the second target distance.
Further, the second joint calculation includes:
locally calculating a square of a first slice of the target-centric data;
performing multiplication operation on the first fragment of the target center data and the difference value between the second fragment of the target center data in the second party and the second private data in a secret sharing mode to obtain a second fragment of the product;
determining a second patch of a second target distance of the second privacy data and the target-centric data according to a second patch of the square and the product.
It can be understood that, in the method for clustering private data for multiple parties, the first party and the second party are equal in status, and the processing procedures of the first party and the second party are not substantially different, in the embodiment of the present specification, the first party is mainly used as an execution subject to describe the corresponding processing procedures.
By the method provided by the embodiment of the specification, the central data of each cluster is not independently determined by any party, but the first party determines the first fragment of each central data corresponding to each cluster at present, and the second party determines the second fragment of each central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data; and subsequently, when determining a first target distance between the first private data and the target center data, determining a first segment of the first target distance by the first party and determining a second segment of the first target distance by the second party by using a secret sharing mode; when the nearest first target distance in the first target distances is determined, a secret sharing mode is also utilized; and finally, determining the cluster corresponding to the closest first target distance as the cluster to which the first privacy data belongs currently. The whole process is based on secret sharing, and when the private data of multiple parties are clustered, the private data can be prevented from being disclosed.
According to an embodiment of another aspect, an apparatus for clustering privacy data of multiple parties is further provided, and the apparatus is configured to perform the method for clustering privacy data of multiple parties provided by the embodiments of the present specification. The multiple parties include a first party and a second party, the first party is provided with a first privacy data set, the first privacy data set comprises a plurality of first privacy data, and the device is arranged on the first party and used for executing a plurality of rounds of iteration processes. Fig. 3 shows a schematic block diagram of an apparatus for clustering private data for multiple parties, according to one embodiment. As shown in fig. 3, the apparatus 300 includes the following units for performing any one iteration:
a center determining unit 31, configured to determine first segments of each piece of center data respectively corresponding to each class cluster at present; the second party has a second shard of the central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data;
a first joint calculation unit 32, configured to perform, on the basis of local first private data and a first segment of target central data, a first joint calculation with a second segment of the target central data in the second party in a secret sharing manner by using each piece of central data determined by the central determination unit 31 as target central data, respectively, to obtain a first segment of a first target distance between the first private data and the target central data; the second party has a second segment of the first target distance;
a joint comparison unit 33, configured to perform joint comparison with the second segment of each first target distance in the second party in a secret sharing manner based on the first segment of each first target distance obtained by the first joint calculation unit 32, and determine a closest first target distance among the first target distances;
a class cluster determining unit 34, configured to determine, as the class cluster to which the first privacy data currently belongs, the class cluster corresponding to the closest first target distance determined by the joint comparing unit 33.
Optionally, as an embodiment, the first joint calculation unit 32 includes:
a local computation subunit to locally compute a first distance between the first private data and a first patch of the target-centric data;
a joint calculation subunit, configured to perform, with the second segment of the target center data in the second party, multiplication operation in a secret sharing manner on a difference between the first segment of the target center data and the first private data, to obtain a first segment of a product;
a determining subunit, configured to determine, according to a first segment of a product obtained by the local calculating subunit and the joint calculating subunit, a first segment of a first target distance of the first privacy data and the target center data.
Optionally, as an embodiment, the any one iteration is a first iteration, and the center determining unit 31 is specifically configured to determine that first segments of each piece of center data currently and respectively corresponding to each class cluster are randomly initialized data.
Optionally, as an embodiment, the joint comparison unit 33 includes:
a joint comparison subunit, configured to perform joint comparison with a second segment of any two first target distances in the second party in a secret sharing manner based on first segments of any two first target distances among the first target distances, and determine a comparison result of a distance between any two first target distances;
and the determining subunit is configured to determine, according to each comparison result determined by the joint comparison subunit, a closest first target distance among the first target distances.
Optionally, as an embodiment, the apparatus further includes:
an updating unit, configured to update the first segment of the central data of the cluster according to an average value of each first privacy data of the same cluster after the cluster determining unit 34 determines the cluster corresponding to the closest first target distance as the cluster to which the first privacy data currently belongs.
Further, the apparatus further comprises:
the judging unit is used for judging whether the variation of the central data of each type of cluster meets the preset iteration stopping condition or not after the updating unit updates the first fragment of the central data of each type of cluster;
and the iteration triggering unit is used for performing the next iteration in the multi-round iteration process if the judgment result of the judging unit is that the variable quantity of the central data of each cluster does not meet the preset iteration stopping condition.
Further, the apparatus further comprises:
a final determining unit, configured to determine, if the determination result of the determining unit is that the variation of the central data of each type of cluster meets a preset iteration stop condition, the type cluster to which the first privacy data determined by the type cluster determining unit 34 belongs currently as the type cluster to which the first privacy data belongs finally.
Further, the determining unit is specifically configured to use any one of the various clusters as a target cluster, and perform joint comparison, in a secret sharing manner, on a first fragment of pre-update central data of the target cluster and a first fragment of post-update central data of the target cluster, with a second fragment of pre-update central data of the target cluster and a second fragment of post-update central data of the target cluster in the second party, according to the first fragment of pre-update central data of the target cluster and the first fragment of post-update central data of the target cluster, to determine whether a variation amount of the central data of the target cluster meets a preset iteration stop condition.
Optionally, as an embodiment, the second party has a second privacy data set, and the second privacy data set includes a plurality of second privacy data, and the apparatus further includes:
a second joint calculation unit, configured to perform a second joint calculation with a second private data in the second party and a second segment of the target center data in a secret sharing manner based on a first segment of local target center data, using the respective center data as target center data, and obtain a second segment of a second target distance between the second private data and the target center data; the second party has a first segment of the second target distance.
Further, the second joint calculation unit includes:
a local computation subunit for locally computing a square of a first segment of the target centric data;
a joint calculation subunit, configured to perform a multiplication operation in a secret sharing manner on the first segment of the target center data and a difference between the second segment of the target center data in the second party and the second private data, to obtain a second segment of the product;
a determining subunit, configured to determine, according to a second segment of a product obtained by the square obtained by the local calculating subunit and the joint calculating subunit, a second segment of a second target distance of the second privacy data and the target center data.
With the apparatus provided in the embodiment of the present specification, instead of determining the central data of each cluster separately by any party, the central determining unit 31 of the first party determines the first shards of each central data currently corresponding to each cluster, and the second party determines the second shards of each central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data; and subsequently when the first federated computing unit 32 determines the first target distance of the first private data and the target central data, the first party determines a first segment of the first target distance and the second party determines a second segment of the first target distance by means of secret sharing; when the joint comparison unit 33 determines the closest first target distance among the first target distances, the secret sharing method is also used; the last cluster determining unit 34 determines the cluster corresponding to the closest first target distance as the cluster to which the first private data belongs currently. The whole process is based on secret sharing, and when the private data of multiple parties are clustered, the private data can be prevented from being disclosed.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (22)

1. A method of clustering privacy data for a plurality of parties, the plurality of parties including a first party and a second party, the first party having a first set of privacy data including a plurality of first privacy data, the method performed by the first party comprising a plurality of iterations, wherein any one iteration comprises:
determining first fragments of the central data respectively corresponding to the various cluster types at present; the second party has a second shard of the central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data;
respectively taking the central data as target central data, and performing first joint calculation with a second fragment of the target central data in the second party by using a secret sharing mode based on local first private data and a first fragment of the target central data to obtain first private data and a first fragment of a first target distance of the target central data; the second party has a second segment of the first target distance;
based on the first fragments of the first target distances, performing joint comparison with the second fragments of the first target distances in the second party in a secret sharing mode to determine the nearest first target distance in the first target distances;
and determining the cluster corresponding to the closest first target distance as the cluster to which the first privacy data belongs currently.
2. The method of claim 1, wherein the first joint computation comprises:
locally calculating a first distance between the first private data and a first segment of the target-centric data;
performing multiplication operation on the difference value between the first fragment of the target center data and the first private data and the second fragment of the target center data in the second party in a secret sharing mode to obtain a first fragment of the product;
determining a first patch of a first target distance of the first privacy data and the target center data according to the first distance and a first patch of the product.
3. The method according to claim 1, wherein the any one iteration is a first iteration, and the first slice of each central data currently corresponding to each cluster class is randomly initialized data.
4. The method of claim 1, wherein the joint comparison comprises:
based on the first fragments of any two first target distances in the first target distances, performing joint comparison with the second fragments of the any two first target distances in the second party in a secret sharing manner, and determining a comparison result of the distance between the any two first target distances;
and determining the nearest first target distance in the first target distances according to the comparison results.
5. The method of claim 1, wherein after determining the class cluster corresponding to the closest first target distance as the class cluster to which the first privacy data currently belongs, the method further comprises:
and updating the first fragment of the central data of the cluster according to the average value of the first private data of the same cluster.
6. The method of claim 5, wherein after updating the first slice of the cluster-like hub data, the method further comprises:
judging whether the variable quantity of the central data of each cluster meets preset iteration stopping conditions or not;
and if the judgment result is that the variable quantity of the central data of each cluster does not meet the preset iteration stopping condition, performing next iteration in the multi-round iteration process.
7. The method of claim 6, wherein the method further comprises:
and if the judgment result is that the variation of the central data of each cluster meets the preset iteration stopping condition, determining the cluster to which the first privacy data belongs currently as the cluster to which the first privacy data belongs finally.
8. The method according to claim 6, wherein the determining whether the variation of the central data of each cluster satisfies a predetermined condition for stopping iteration includes:
and taking any one of the various clusters as a target cluster, and performing joint comparison with a second fragment of the central data before updating of the target cluster and a second fragment of the central data after updating of the target cluster in the second party in a secret sharing manner according to the first fragment of the central data before updating of the target cluster and the first fragment of the central data after updating of the target cluster, so as to judge whether the variation of the central data of the target cluster meets a preset iteration stop condition.
9. The method of claim 1, wherein the second party has a second set of privacy data, the second set of privacy data including a plurality of second privacy data, the method further comprising:
respectively taking the central data as target central data, and performing second joint calculation with second privacy data in the second party and second fragments of the target central data by using a secret sharing mode based on first fragments of local target central data to obtain second fragments of second target distances of the second privacy data and the target central data; the second party has a first segment of the second target distance.
10. The method of claim 9, wherein the second joint calculation comprises:
locally calculating a square of a first slice of the target-centric data;
performing multiplication operation on the first fragment of the target center data and the difference value between the second fragment of the target center data in the second party and the second private data in a secret sharing mode to obtain a second fragment of the product;
determining a second patch of a second target distance of the second privacy data and the target-centric data according to a second patch of the square and the product.
11. An apparatus for clustering privacy data of multiple parties, the multiple parties including a first party and a second party, the first party having a first privacy data set, the first privacy data set including a plurality of first privacy data, the apparatus being provided for the first party, and configured to perform a multiple-round iteration process, including the following units for performing any one round of iteration:
the center determining unit is used for determining first fragments of the center data corresponding to the various cluster types at present; the second party has a second shard of the central data; the sum of the first shard of any central data and the second shard of the central data is equal to the central data;
a first joint calculation unit, configured to respectively use each piece of central data determined by the central determination unit as target central data, perform, based on first local private data and a first segment of the target central data, a first joint calculation with a second segment of the target central data in the second party in a secret sharing manner, so as to obtain first segments of first target distances between the first private data and the target central data; the second party has a second segment of the first target distance;
a joint comparison unit, configured to perform joint comparison with second segments of the first target distances in the second party in a secret sharing manner based on the first segments of the first target distances obtained by the first joint calculation unit, and determine a closest first target distance among the first target distances;
and a class cluster determining unit, configured to determine, as the class cluster to which the first privacy data currently belongs, the class cluster corresponding to the closest first target distance determined by the joint comparing unit.
12. The apparatus of claim 11, wherein the first joint computation unit comprises:
a local computation subunit to locally compute a first distance between the first private data and a first patch of the target-centric data;
a joint calculation subunit, configured to perform, with the second segment of the target center data in the second party, multiplication operation in a secret sharing manner on a difference between the first segment of the target center data and the first private data, to obtain a first segment of a product;
a determining subunit, configured to determine, according to a first segment of a product obtained by the local calculating subunit and the joint calculating subunit, a first segment of a first target distance of the first privacy data and the target center data.
13. The apparatus according to claim 11, wherein the any one iteration is a first iteration, and the center determining unit is specifically configured to determine that first segments of the pieces of central data respectively corresponding to the various clusters are randomly initialized data.
14. The apparatus of claim 11, wherein the joint comparison unit comprises:
a joint comparison subunit, configured to perform joint comparison with a second segment of any two first target distances in the second party in a secret sharing manner based on first segments of any two first target distances among the first target distances, and determine a comparison result of a distance between any two first target distances;
and the determining subunit is configured to determine, according to each comparison result determined by the joint comparison subunit, a closest first target distance among the first target distances.
15. The apparatus of claim 11, wherein the apparatus further comprises:
and an updating unit, configured to update the first segment of the central data of the cluster according to an average value of each first privacy data of the same cluster after the cluster determining unit determines the cluster corresponding to the closest first target distance as the cluster to which the first privacy data currently belongs.
16. The apparatus of claim 15, wherein the apparatus further comprises:
the judging unit is used for judging whether the variation of the central data of each type of cluster meets the preset iteration stopping condition or not after the updating unit updates the first fragment of the central data of each type of cluster;
and the iteration triggering unit is used for performing the next iteration in the multi-round iteration process if the judgment result of the judging unit is that the variable quantity of the central data of each cluster does not meet the preset iteration stopping condition.
17. The apparatus of claim 16, wherein the apparatus further comprises:
and a final determining unit, configured to determine, if a determination result of the determining unit is that the variation of the central data of each type of cluster meets a preset iteration stop condition, a type cluster to which the first privacy data determined by the type cluster determining unit currently belongs as a type cluster to which the first privacy data finally belongs.
18. The apparatus according to claim 16, wherein the determining unit is specifically configured to determine, with respect to any one of the various clusters as a target cluster, whether a variation of the central data of the target cluster satisfies a preset iteration stop condition by jointly comparing, in a secret sharing manner, a first slice of pre-update central data of the target cluster and a first slice of post-update central data of the target cluster with a second slice of pre-update central data of the target cluster and a second slice of post-update central data of the target cluster in the second party.
19. The apparatus of claim 11, wherein the second party has a second set of privacy data, the second set of privacy data including a plurality of second privacy data, the apparatus further comprising:
a second joint calculation unit, configured to perform a second joint calculation with a second private data in the second party and a second segment of the target center data in a secret sharing manner based on a first segment of local target center data, using the respective center data as target center data, and obtain a second segment of a second target distance between the second private data and the target center data; the second party has a first segment of the second target distance.
20. The apparatus of claim 19, wherein the second aggregate calculation unit comprises:
a local computation subunit for locally computing a square of a first segment of the target centric data;
a joint calculation subunit, configured to perform a multiplication operation in a secret sharing manner on the first segment of the target center data and a difference between the second segment of the target center data in the second party and the second private data, to obtain a second segment of the product;
a determining subunit, configured to determine, according to a second segment of a product obtained by the square obtained by the local calculating subunit and the joint calculating subunit, a second segment of a second target distance of the second privacy data and the target center data.
21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.
22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.
CN202010536190.4A 2020-06-12 2020-06-12 Method and device for clustering private data of multiple parties Active CN111444544B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010536190.4A CN111444544B (en) 2020-06-12 2020-06-12 Method and device for clustering private data of multiple parties
PCT/CN2021/099485 WO2021249502A1 (en) 2020-06-12 2021-06-10 Method and apparatus for clustering privacy data of multiple parties

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010536190.4A CN111444544B (en) 2020-06-12 2020-06-12 Method and device for clustering private data of multiple parties

Publications (2)

Publication Number Publication Date
CN111444544A CN111444544A (en) 2020-07-24
CN111444544B true CN111444544B (en) 2020-09-11

Family

ID=71653625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010536190.4A Active CN111444544B (en) 2020-06-12 2020-06-12 Method and device for clustering private data of multiple parties

Country Status (2)

Country Link
CN (1) CN111444544B (en)
WO (1) WO2021249502A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444544B (en) * 2020-06-12 2020-09-11 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN112560107B (en) * 2021-02-20 2021-05-14 支付宝(杭州)信息技术有限公司 Method and device for processing private data
CN113257378B (en) * 2021-06-16 2021-09-28 湖南创星科技股份有限公司 Medical service communication method and system based on micro-service technology
CN114282076B (en) * 2022-03-04 2022-06-14 支付宝(杭州)信息技术有限公司 Sorting method and system based on secret sharing
CN116094844B (en) * 2023-04-10 2023-06-20 蓝象智联(杭州)科技有限公司 Address checking method for multiparty security calculation

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996198B (en) * 2009-08-31 2016-06-29 中国移动通信集团公司 Cluster realizing method and system
CN104809242B (en) * 2015-05-15 2018-03-02 成都睿峰科技有限公司 A kind of big data clustering method and device based on distributed frame
CN105138923B (en) * 2015-08-11 2019-01-08 苏州大学 A kind of time series similarity calculation method for protecting privacy
CN106874367A (en) * 2016-12-30 2017-06-20 江苏号百信息服务有限公司 A kind of sampling distribution formula clustering method based on public sentiment platform
CN110609831B (en) * 2019-08-27 2020-07-03 浙江工商大学 Data link method based on privacy protection and safe multi-party calculation
CN111159406A (en) * 2019-12-30 2020-05-15 内蒙古工业大学 Big data text clustering method and system based on parallel improved K-means algorithm
CN111444544B (en) * 2020-06-12 2020-09-11 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties

Also Published As

Publication number Publication date
CN111444544A (en) 2020-07-24
WO2021249502A1 (en) 2021-12-16

Similar Documents

Publication Publication Date Title
CN111444544B (en) Method and device for clustering private data of multiple parties
CN111523143B (en) Method and device for clustering private data of multiple parties
CN111444545B (en) Method and device for clustering private data of multiple parties
Ban et al. A PTAS for ℓp-low rank approximation
US10552471B1 (en) Determining identities of multiple people in a digital image
CN111445007A (en) Training method and system for resisting generation of neural network
Hegde et al. Sok: Efficient privacy-preserving clustering
CN112487489B (en) Joint data processing method and device for protecting privacy
Hayes et al. Bounding training data reconstruction in dp-sgd
CN112084500A (en) Method and device for clustering virus samples, electronic equipment and storage medium
US20180239993A1 (en) Face recognition in big data ecosystem using multiple recognition models
CN111523674B (en) Model training method, device and system
CN108363740B (en) IP address analysis method and device, storage medium and terminal
Garcia-Magarinos et al. Lasso logistic regression, GSoft and the cyclic coordinate descent algorithm: application to gene expression data
CN104765776B (en) The clustering method and device of a kind of data sample
CN116028832A (en) Sample clustering processing method and device, storage medium and electronic equipment
CN116629376A (en) Federal learning aggregation method and system based on no data distillation
CN115545215A (en) Decentralized federal cluster learning method, decentralized federal cluster learning device, decentralized federal cluster learning equipment and decentralized federal cluster learning medium
CN111667394B (en) Map scaling inference method based on feature description
CN113537308A (en) Two-stage k-means clustering processing system and method based on localized differential privacy
CN107248929B (en) Strong correlation data generation method of multi-dimensional correlation data
CN113361055B (en) Privacy processing method, device, electronic equipment and storage medium in extended social network
WO2023216900A1 (en) Model performance evaluating method, apparatus, device, and storage medium
US20230342420A1 (en) Approximate maximal clique enumeration for dynamic graphs
US7975044B1 (en) Automated disambiguation of fixed-serverport-based applications from ephemeral applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40034074

Country of ref document: HK