WO2022116491A1 - Dbscan clustering method based on horizontal federation, and related device therefor - Google Patents

Dbscan clustering method based on horizontal federation, and related device therefor Download PDF

Info

Publication number
WO2022116491A1
WO2022116491A1 PCT/CN2021/096851 CN2021096851W WO2022116491A1 WO 2022116491 A1 WO2022116491 A1 WO 2022116491A1 CN 2021096851 W CN2021096851 W CN 2021096851W WO 2022116491 A1 WO2022116491 A1 WO 2022116491A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
data set
feature
sum
encrypted
Prior art date
Application number
PCT/CN2021/096851
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
李泽远
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116491A1 publication Critical patent/WO2022116491A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a DBSCAN clustering method, device, computer equipment and storage medium based on horizontal federation.
  • the embodiment of the present application provides a DBSCAN clustering method based on horizontal federation, and adopts the following technical solutions:
  • the first data set includes first features of several first objects
  • the federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
  • DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
  • a data set obtaining module configured to obtain a first data set, wherein the first data set includes the first features of several first objects
  • the feature screening module is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the The second server performs feature screening on the second data set through the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
  • an object traversal module configured to traverse the first object in the first to-be-clustered data set
  • a distance calculation module used to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm
  • the object clustering module is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
  • an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
  • Step S20222 Encrypt the accumulated sum of the first eigenvalues and the first number of objects in the first data set by using the first homomorphic key pair.
  • the second server computes a random message z ⁇ M, and computes the cumulative sum of the random message z and the second eigenvalue product of The product z ⁇ NB of the random message z and the second number of objects NB , and then using the first encryption key E k pair Encrypt with z ⁇ N B to get and z 2 ⁇ E k1 (z ⁇ N B ).
  • Step S20241 generate a second homomorphic key pair.
  • the first server generates a second homomorphic key pair (E k2 , D k2 ), where E k2 is the second encryption key, and D k2 is the second decryption key.
  • the second homomorphic key pair (E k2 , D k2 ) satisfies homomorphic encryption.
  • Step S20242 Encrypt the first accumulated error sum and the first number of objects in the first data set by using the second homomorphic key pair.
  • Step S20243 sending the second encryption key in the second homomorphic key pair, the encrypted first error accumulation sum and the encrypted first object number to the second server, to instruct the second server to encrypt according to the second encryption Calculate the key, the encrypted first cumulative sum of errors, the number of encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain the encrypted joint cumulative sum of errors and the encrypted The number of federated objects.
  • Steps S20241-S20244 implement the weighted average algorithm of homomorphic encryption.
  • the first data set and the second data set in the second server are feature-filtered, the most useful features for clustering are retained, and at the same time the Feature dimensionality reduction, so as to adapt to the DBSCAN algorithm.
  • Step S203 traverse the first object in the first data set to be clustered.
  • Step S204 Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm.
  • Step S2041 Calculate the Euclidean distance between the current first object and each first object.
  • the first server calculates the Euclidean distance between the current first object and each first object in the first to-be-clustered data set. Let the current first object be The other first object is recorded as The feature dimension is d, then and Euclidean distance for:
  • Step S2042 Calculate the sum of squares of the first feature of the current first object.
  • the first server calculates the first feature square sum of the current first object
  • Step S2043 for each second object, calculate the cross product sum of features of the current first object and the second object through the product algorithm with the second server, and instruct the second server to calculate the second feature square sum of the second object.
  • the current first object needs to calculate the Euclidean distance with each second object.
  • the first server needs to input multiple times.
  • Second server multiple input and random numbers r j ,j ⁇ [1,d], where the second server needs to generate d random numbers r 1 , r 2 ,...r d , and satisfy
  • the first server and the second server use the product algorithm to calculate And sum it up to get the feature cross product sum of the first object and the second object
  • the second server calculates the second characteristic sum of squares of the second object
  • the calculation is performed based on the product algorithm, and the underlying feature value does not need to be exchanged.
  • step S2043 may include:
  • Step S20431 generate a first random number, and generate a third homomorphic key pair based on the paillier encryption algorithm.
  • Step S20435 Decrypt each encrypted feature cross product by using the third decryption key in the third homomorphic key pair to obtain the feature cross product sum of the current first object and the second object.
  • the cross-product sum of the features of the current first object and the second object is calculated, ensuring that The implementation of the Euclidean distance calculation between the current first object and the second object is presented.
  • Steps S2042-S2044 are the federal Euclidean distance algorithm.
  • DBSCAN clustering can be performed on the current first object according to the DBSCAN algorithm to obtain a clustering result.
  • the clustering result can be regarded as group division of the objects in the first to-be-clustered data set and the second to-be-clustered data set.
  • Step S2051 according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects, determine whether the current first object is a core point.
  • Noise points samples that are neither core points nor boundary points.
  • Density direct access If x i is located in the ⁇ neighborhood of x j , and x j is the core point, then x i is directly accessible by the density of x j .
  • the first server queries the objects in the clustering neighborhood (ie the ⁇ neighborhood) of the current first object according to the calculated Euclidean distance (which can be from the first data set to be clustered or from the second data set to be clustered).
  • the number of objects is compared with the preset threshold MinPts of the number of neighboring objects to determine whether the current first object is a core point.
  • Step S2052 when the current first object is the core point, determine the density-reachable points in the clustering neighborhood of the current first object, and obtain the object clustering result, wherein the density-reachable points include those in the first to-be-clustered data set.
  • a density-reachable point is searched in its clustering neighborhood, and the density-reachable point includes the first data to be clustered
  • the density-reachable points found form a cluster. If the current first object is a boundary point or a noise point, the current first object is not processed, and the next core point is searched until all the first objects in the first to-be-clustered data set are processed, and the object clustering result is obtained, Among them, each cluster can be a clustering result.
  • the second server can perform DBSCAN clustering on the second object according to the same operation as the first server.
  • the horizontal federation-based DBSCAN clustering method of the present application realizes object clustering, and for each clustering result, each object in it has a certain degree of similarity. For example, in a financial marketing scenario, after the horizontal federation-based DBSCAN clustering is performed on users according to user data, each clustering result can be a user with similar behaviors.
  • the horizontal federation-based DBSCAN clustering method is equivalent to the user Community divisions were made.
  • DBSCAN clustering is performed on the current first object, which realizes the use of data sets of different institutions for object analysis. Clustering breaks the data barriers and improves the accuracy of DBSCAN clustering.
  • the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data.
  • the current first object traversed in the first data set to be clustered the current first object and the first object in the first data set to be clustered are calculated.
  • the Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data.
  • the Euclidean distance of the centralized objects is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.
  • the present application provides an embodiment of a DBSCAN clustering apparatus based on horizontal federation, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 .
  • the device can be specifically applied to various electronic devices.
  • the horizontal federation-based DBSCAN clustering device 300 in this embodiment includes: a data set acquisition module 301, a feature screening module 302, an object traversal module 303, a distance calculation module 304, and an object clustering module 305, in:
  • the feature screening module 302 is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the second server Feature screening is performed on the second data set by the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes second features of several second objects.
  • the object traversal module 303 is configured to traverse the first object in the first to-be-clustered data set.
  • the distance calculation module 304 is configured to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm.
  • the feature value calculation submodule is configured to, for each first feature in the first data set, calculate the accumulated sum of the first feature values of the first feature, and instruct the second server to calculate the first feature value of the second feature corresponding to the first feature. Cumulative sum of two eigenvalues.
  • the accumulation and calculation submodule is used to calculate the accumulated sum of the first characteristic value and the accumulated sum of the second characteristic value through the homomorphic encryption weighted average algorithm with the second server to obtain the joint mean value of the first characteristic.
  • the error calculation submodule is configured to calculate the first accumulated error sum of the first feature based on the joint mean value, and instruct the second server to calculate the second accumulated error sum of the second feature based on the joint mean value.
  • the mean square error calculation sub-module is used to calculate the accumulated sum of the first error and the accumulated sum of the second error through the homomorphic encryption weighted average algorithm with the second server to obtain the joint mean square error of the first feature.
  • the first data set and the second data set in the second server are feature-filtered, the most useful features for clustering are retained, and at the same time the Feature dimensionality reduction, so as to adapt to the DBSCAN algorithm.
  • the accumulation and calculation submodule may include: a first generation unit, a first encryption unit, a first transmission unit, and a mean value calculation unit, wherein:
  • the first generating unit is used to generate a first homomorphic key pair.
  • the first encryption unit is configured to encrypt the accumulated sum of the first feature values and the first object quantity of the first data set by using the first homomorphic key pair.
  • the first sending unit is configured to send the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first eigenvalues, and the number of encrypted first objects to the second server to indicate the second
  • the server calculates according to the first encryption key, the encrypted cumulative sum of the first eigenvalues, the number of encrypted first objects, the cumulative sum of the second eigenvalues, and the number of second objects in the second data set, and obtains the encrypted joint The cumulative sum and the number of encrypted union objects.
  • the mean value calculation unit is configured to calculate the joint mean value of the first feature according to the encrypted joint cumulative sum and the number of encrypted joint objects returned by the second server.
  • the joint average value of the feature is obtained by combining the first data set and the second data set.
  • the mean square error calculation submodule may include: a first generation unit, a first encryption unit, a first transmission unit, and a mean value calculation unit, wherein:
  • the second generating unit is configured to generate a second homomorphic key pair.
  • the second sending unit is configured to send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum and the encrypted first object quantity to the second server, so as to indicate the second server Calculate according to the second encryption key, the encrypted first accumulated error sum, the encrypted first object number, the second error accumulated sum, and the second object number of the second data set, and obtain the encrypted joint error accumulated sum and the number of encrypted union objects.
  • the mean square error calculation unit is configured to calculate the joint mean square error of the first feature according to the encrypted cumulative sum of the joint errors returned by the second server and the number of encrypted joint objects.
  • the joint mean square error of the feature is calculated in combination with the first data set and the second data set.
  • the distance calculation module 304 may include: a distance calculation submodule, a sum of squares calculation submodule, a cross calculation submodule, and an Euclidean calculation submodule, wherein:
  • the distance calculation submodule is used to calculate the Euclidean distance between the current first object and each first object.
  • the sum of squares calculation submodule is used to calculate the sum of squares of the first feature of the current first object.
  • the cross calculation submodule is configured to, for each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second feature of the second object sum of square.
  • the Euclidean calculation submodule is configured to calculate the current Euclidean distance between the first object and the second object according to the first feature square sum, the feature cross product sum and the second feature square sum returned by the second server.
  • the Euclidean distance of objects in the first data set to be clustered and the objects in the second data set to be clustered is calculated and obtained under the condition of not infringing on data privacy through the federal Euclidean distance algorithm, which ensures the protection of the two data Implementation of DBSCAN clustering between datasets.
  • the square sum calculation submodule may include: a generating unit, a joint encryption unit, an encrypted value sending unit, a receiving unit, and a decrypting unit, wherein:
  • the generating unit is used for generating the first random number and generating the third homomorphic key pair based on the paillier encryption algorithm.
  • the joint encryption unit is configured to jointly encrypt each first characteristic value of the current first object and the first random number by using the third encryption key in the third homomorphic key pair to obtain a joint encrypted value.
  • the encrypted value sending unit is configured to send the joint encrypted value to the second server, and for each second object, instruct the second server to perform the operation according to the joint encrypted value, each second characteristic value of the second object and the generated second random number. After calculation, the cross product of each encrypted feature is obtained, and the second server is instructed to calculate the second feature square sum of the second object.
  • the receiving unit is configured to receive the cross product of each encrypted feature and the square sum of the second feature of the second object returned by the second server.
  • the decryption unit is configured to decrypt each encrypted feature cross product by using the third decryption key in the third homomorphic key pair to obtain the feature cross product sum of the current first object and the second object.
  • the cross-product sum of the features of the current first object and the second object is calculated, ensuring that The implementation of the Euclidean distance calculation between the current first object and the second object is presented.
  • the object clustering module 305 may include: an object determination submodule and a reachable point determination submodule, wherein:
  • the object determination sub-module is configured to determine whether the current first object is a core point according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects.
  • the reachable point determination sub-module is used to determine the density reachable points in the cluster neighborhood of the current first object when the current first object is the core point, and obtain the object clustering result, wherein the density reachable points include the first The first object in the data set to be clustered and the second object in the second data set to be clustered.
  • DBSCAN clustering is performed on the current first object, which realizes the use of data sets of different institutions for object analysis. Clustering breaks the data barriers and improves the accuracy of DBSCAN clustering.

Abstract

A DBSCAN clustering method and apparatus (300) based on horizontal federation, and a computer device (4) and a storage medium, which belong to the field of artificial intelligence. The method comprises: acquiring a first data set, wherein the first data set comprises first features of several first objects (S201); performing horizontal federated learning with a second data set of a second server, and performing feature screening on the first data set by means of a federated variance selection algorithm, so as to obtain a first data set to be clustered (S202); traversing the first objects in the first data set to be clustered (S203); calculating a Euclidean distance between the current first object and each of the other first objects, and calculating a Euclidean distance between the current first object and each second object by means of a federated Euclidean distance algorithm (S204); and performing DBSCAN clustering on the current first object according to the obtained Euclidean distances, so as to obtain an object clustering result (S205). In addition, a first data set can be stored in a blockchain. By means of the method, the accuracy of object clustering is improved.

Description

基于横向联邦的DBSCAN聚类方法、及其相关设备DBSCAN clustering method based on horizontal federation and its related equipment
本申请要求于2020年12月01日提交中国专利局、申请号为202011388364.3,发明名称为“基于横向联邦的DBSCAN聚类方法、及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 01, 2020 with the application number 202011388364.3 and the invention titled "DBSCAN clustering method based on horizontal federation and related equipment", the entire content of which is approved by Reference is incorporated in this application.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种基于横向联邦的DBSCAN聚类方法、装置、计算机设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a DBSCAN clustering method, device, computer equipment and storage medium based on horizontal federation.
背景技术Background technique
随着计算机技术的深入发展,计算机应用于各种数据挖掘场景中。对象聚类是数据挖掘的一种,通过对对象各维度的数据分析,将对象进行聚类,相同或相似的对象可以被归为一类。例如,在金融营销场景中,金融机构每天都可以获得大量的用户数据,这些数据包含了大量的个人隐私或商业机密,通过对用户数据进行聚类,可以对用户进行分类,以便为不同类别的用户提供服务。With the in-depth development of computer technology, computers are used in various data mining scenarios. Object clustering is a type of data mining. By analyzing the data of each dimension of the object, the objects are clustered, and the same or similar objects can be classified into one category. For example, in financial marketing scenarios, financial institutions can obtain a large amount of user data every day, which contains a lot of personal privacy or business secrets. By clustering user data, users can be classified for different categories of users. User provides services.
DBSCAN算法是一种基于密度的聚类算法,它将簇定义为密度相连的点的最大集合,能够把具有足够密度的区域划分为簇,并可以在有噪音的空间数据集中发现任意形状的簇。然而,发明人意识到,传统的DBSCAN算法无法打破不同机构间的数据壁垒,只能对机构的内部数据进行聚类,且无法适用于高纬度数据,因此聚类的准确性较低。The DBSCAN algorithm is a density-based clustering algorithm, which defines a cluster as the largest set of density-connected points, can divide regions with sufficient density into clusters, and can find clusters of arbitrary shapes in noisy spatial datasets . However, the inventor realized that the traditional DBSCAN algorithm cannot break the data barriers between different institutions, and can only cluster the internal data of the institution, and cannot be applied to high-latitude data, so the accuracy of clustering is low.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的在于提出一种基于横向联邦的DBSCAN聚类方法、装置、计算机设备及存储介质,以解决DBSCAN聚类准确性较低的问题。The purpose of the embodiments of the present application is to propose a DBSCAN clustering method, device, computer equipment and storage medium based on horizontal federation, so as to solve the problem of low accuracy of DBSCAN clustering.
为了解决上述技术问题,本申请实施例提供一种基于横向联邦的DBSCAN聚类方法,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application provides a DBSCAN clustering method based on horizontal federation, and adopts the following technical solutions:
获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;acquiring a first data set, wherein the first data set includes first features of several first objects;
与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
遍历所述第一待聚类数据集中的第一对象;Traversing the first object in the first data set to be clustered;
计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
为了解决上述技术问题,本申请实施例还提供一种基于横向联邦的DBSCAN聚类装置,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a DBSCAN clustering device based on horizontal federation, which adopts the following technical solutions:
数据集获取模块,用于获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;a data set obtaining module, configured to obtain a first data set, wherein the first data set includes the first features of several first objects;
特征筛选模块,用于与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;The feature screening module is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the The second server performs feature screening on the second data set through the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
对象遍历模块,用于遍历所述第一待聚类数据集中的第一对象;an object traversal module, configured to traverse the first object in the first to-be-clustered data set;
距离计算模块,用于计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;a distance calculation module, used to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
对象聚类模块,用于根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。The object clustering module is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;acquiring a first data set, wherein the first data set includes first features of several first objects;
与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
遍历所述第一待聚类数据集中的第一对象;Traversing the first object in the first data set to be clustered;
计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:In order to solve the above technical problems, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;acquiring a first data set, wherein the first data set includes first features of several first objects;
与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
遍历所述第一待聚类数据集中的第一对象;Traversing the first object in the first data set to be clustered;
计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
与现有技术相比,本申请实施例主要有以下有益效果:获取到第一数据集后,与第二服务器进行横向联邦学习,通过联邦方差选择算法,在不交换具体数据的情况下对第一数据集和第二服务器中的第二数据集进行特征筛选,实现特征降维,从而适配DBSCAN算法;同时,对于第一待聚类数据集中遍历到的当前第一对象,计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离,并通过联邦欧式距离算法计算当前第一对象与第二待聚类数据集中各第二对象的欧氏距离,在不交换具体数据的情况下,计算两个相离的数据集中对象的欧氏距离,欧氏距离用于DBSCAN聚类,从而打破了数据壁垒,实现了在不侵犯数据隐私的情况下,利用不同机构的数据集进行对象聚类,提高了对象聚类的准确性。Compared with the prior art, the embodiment of the present application mainly has the following beneficial effects: after obtaining the first data set, horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to analyze the first data set without exchanging specific data. The first dataset and the second dataset in the second server perform feature screening to realize feature dimensionality reduction, thereby adapting the DBSCAN algorithm; at the same time, for the current first object traversed in the first dataset to be clustered, calculate the current first object. The Euclidean distance between the object and each first object in the first data set to be clustered, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm. In the case of data, the Euclidean distance of objects in two separate datasets is calculated, and the Euclidean distance is used for DBSCAN clustering, thereby breaking the data barrier and realizing the use of data from different institutions without violating data privacy. This method improves the accuracy of object clustering.
附图说明Description of drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是根据本申请的基于横向联邦的DBSCAN聚类方法的一个实施例的流程图;Fig. 2 is the flow chart of one embodiment of the DBSCAN clustering method based on horizontal federation according to the present application;
图3是根据本申请的基于横向联邦的DBSCAN聚类装置的一个实施例的结构示意图;3 is a schematic structural diagram of an embodiment of a horizontal federation-based DBSCAN clustering device according to the present application;
图4是根据本申请的计算机设备的一个实施例的结构示意图。FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在 于限制本申请。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。The terminology used herein in the specification of the application is for the purpose of describing particular embodiments only and is not intended to limit the application. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.
下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
如图1所示,系统架构100可以包括终端设备101、102,网络103,第一服务器104和第二服务器105。网络103用以在终端设备101、102,第一服务器104,第二服务器105之间提供通信链路的介质。网络103可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 100 may include terminal devices 101 and 102 , a network 103 , a first server 104 and a second server 105 . The network 103 is used to provide a medium of communication links between the terminal devices 101 and 102 , the first server 104 and the second server 105 . The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101通过网络103与第一服务器104交互以接收或发送消息等,用户也可以使用终端设备102通过网络103与第二服务器105交互以接收或发送消息等。终端设备101、102上可以安装有各种通讯客户端应用。The user can use the terminal device 101 to interact with the first server 104 through the network 103 to receive or send messages, and the user can also use the terminal device 102 to interact with the second server 105 through the network 103 to receive or send messages. Various communication client applications may be installed on the terminal devices 101 and 102 .
终端设备101、102可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101 and 102 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, video experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
第一服务器104、第二服务器105可以是提供各种服务的服务器,第一服务器104、第二服务器105可以实现基于横向联邦的DBSCAN聚类服务。The first server 104 and the second server 105 may be servers that provide various services, and the first server 104 and the second server 105 may implement the DBSCAN clustering service based on horizontal federation.
需要说明的是,本申请实施例所提供的基于横向联邦的DBSCAN聚类方法一般由第一服务器和第二服务器执行,相应地,基于横向联邦的DBSCAN聚类装置一般设置于第一服务器和第二服务器中。在本申请中,以第一服务器为主体进行描述。It should be noted that the DBSCAN clustering method based on the horizontal federation provided in the embodiment of the present application is generally executed by the first server and the second server, and accordingly, the DBSCAN clustering device based on the horizontal federation is generally set on the first server and the second server. on the second server. In this application, the first server is taken as the main body for description.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
继续参考图2,示出了根据本申请的基于横向联邦的DBSCAN聚类方法的一个实施例的流程图。所述的基于横向联邦的DBSCAN聚类方法,包括以下步骤:Continuing to refer to FIG. 2 , a flow chart of one embodiment of the horizontal federation-based DBSCAN clustering method according to the present application is shown. The described DBSCAN clustering method based on horizontal federation includes the following steps:
步骤S201,获取第一数据集,其中,第一数据集包括若干个第一对象的第一特征。Step S201, acquiring a first data set, wherein the first data set includes first features of several first objects.
在本实施例中,基于横向联邦的DBSCAN聚类方法运行于其上的电子设备(例如图1所示的第一服务器)可以通过各种有线连接方式或者无线连接方式进行通信。In this embodiment, the electronic device (for example, the first server shown in FIG. 1 ) on which the DBSCAN clustering method based on horizontal federation runs can communicate through various wired connection methods or wireless connection methods.
具体地,在进行基于横向联邦的DBSCAN聚类时,第一服务器和第二服务器同时进行聚类,第一服务器获取存储在第一服务器中的第一数据集,第二服务器获取存储在第二服务器中的第二数据集。Specifically, when performing horizontal federation-based DBSCAN clustering, the first server and the second server perform clustering at the same time, the first server obtains the first data set stored in the first server, and the second server obtains the first data set stored in the second server. The second dataset in the server.
第一数据集和第二数据集可以是两个参与方中对象的特征集合,且第一数据集和第二数据集的特征、各特征所刻画大的信息类型相同,但第一数据集和第二数据集所刻画的对象不同。例如,在金融营销场景中,第一数据集和第二数据集可以是两个公司的用户数据,特征可以包括用户的性别、学历、工作单位、过往消费数据等。第一数据集和第二数据集为对象聚类提供数据基础。The first data set and the second data set may be the feature sets of the objects in the two parties, and the features of the first data set and the second data set and the type of large information described by each feature are the same, but the first data set and the second data set are the same. The objects depicted in the second dataset are different. For example, in a financial marketing scenario, the first data set and the second data set may be user data of two companies, and the features may include the user's gender, education, work unit, past consumption data, and so on. The first dataset and the second dataset provide a data basis for object clustering.
第一数据集记为
Figure PCTCN2021096851-appb-000001
其中,
Figure PCTCN2021096851-appb-000002
为第i个对象的特征集,特征维度为q,数据集中对象的数量记为对象数量,第一对象数量为N A,同理,对第二数据集有
Figure PCTCN2021096851-appb-000003
The first dataset is recorded as
Figure PCTCN2021096851-appb-000001
in,
Figure PCTCN2021096851-appb-000002
is the feature set of the i-th object, the feature dimension is q, the number of objects in the data set is recorded as the number of objects, and the number of the first object is N A . Similarly, for the second data set, we have
Figure PCTCN2021096851-appb-000003
需要强调的是,为进一步保证上述第一数据集的私密和安全性,上述第一数据集还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned first data set, the above-mentioned first data set may also be stored in a node of a blockchain.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
步骤S202,与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对第一数据集进行特征筛选,得到第一待聚类数据集,并指示第二服务器通过联邦方差选择算法对第二数据集进行特征筛选,得到第二待聚类数据集,其中,第二数据集包括若干个第二对象的第二特征。Step S202, perform horizontal federated learning with the second data set of the second server, to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the second server to select through the federated variance The algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes second features of several second objects.
具体地,第一服务器和第二服务器可以组成联邦网络,并进行联邦学习,在联邦学习中,第一服务器和第二服务器在不交换具体数据的条件下完成数据运算。第一服务器和第二服务器可以通过联邦方差选择算法对第一数据集进行特征筛选,剔除掉一部分特征,得到第一待聚类数据集I A。同理,第二服务器也通过联邦方差选择算法对第二数据集进行特征筛选,剔除掉一部分特征,得到第二待聚类数据集I BSpecifically, the first server and the second server may form a federated network and perform federated learning. In the federated learning, the first server and the second server complete data operations without exchanging specific data. The first server and the second server can perform feature screening on the first data set through the federated variance selection algorithm, and remove a part of the features to obtain the first to-be-clustered data set I A . Similarly, the second server also performs feature screening on the second data set through the federal variance selection algorithm, and removes a part of the features to obtain the second data set IB to be clustered.
进一步的,上述步骤S202可以包括:Further, the above step S202 may include:
步骤S2021,对于第一数据集中的每种第一特征,计算第一特征的第一特征值累加和,并指示第二服务器计算与第一特征相对应的第二特征的第二特征值累加和。Step S2021, for each first feature in the first data set, calculate the cumulative sum of the first feature values of the first feature, and instruct the second server to calculate the cumulative sum of the second feature values of the second feature corresponding to the first feature .
具体地,对于每一个特征j,j∈[1,q],第一服务器对第一特征计算第一特征值累加和
Figure PCTCN2021096851-appb-000004
第二服务器对第二特征计算第二特征值累加和
Figure PCTCN2021096851-appb-000005
Specifically, for each feature j,j∈[1,q], the first server calculates the accumulated sum of the first feature value for the first feature
Figure PCTCN2021096851-appb-000004
The second server calculates the accumulated sum of the second feature value for the second feature
Figure PCTCN2021096851-appb-000005
步骤S2022,与第二服务器通过同态加密加权平均算法,对第一特征值累加和与第二特征值累加和进行计算,得到第一特征的联合均值。Step S2022 , calculating the cumulative sum of the first feature value and the cumulative sum of the second feature value with the second server through the weighted average algorithm of homomorphic encryption to obtain the joint mean value of the first feature.
具体地,第一服务器和第二服务器通过同态加密加权平均算法,对第一特征值累加和
Figure PCTCN2021096851-appb-000006
以及第二特征值累加和
Figure PCTCN2021096851-appb-000007
进行计算,得到第一特征的联合均值
Figure PCTCN2021096851-appb-000008
Specifically, the first server and the second server accumulate and sum the first eigenvalues through a homomorphic encryption weighted average algorithm.
Figure PCTCN2021096851-appb-000006
and the second eigenvalue cumulative sum
Figure PCTCN2021096851-appb-000007
Calculate to get the joint mean of the first feature
Figure PCTCN2021096851-appb-000008
进一步的,上述步骤S2022可以包括:Further, the above step S2022 may include:
步骤S20221,生成第一同态密钥对。Step S20221, generate a first homomorphic key pair.
具体地,第一服务器生成第一同态密钥对(E k1,D k1),其中,E k1为第一加密密钥,D k1为第一解密密钥。第一同态密钥对(E k1,D k1)满足同态加密。 Specifically, the first server generates a first homomorphic key pair (E k1 , D k1 ), where E k1 is the first encryption key, and D k1 is the first decryption key. The first homomorphic key pair (E k1 , D k1 ) satisfies homomorphic encryption.
步骤S20222,通过第一同态密钥对对第一特征值累加和以及第一数据集的第一对象数量进行加密。Step S20222: Encrypt the accumulated sum of the first eigenvalues and the first number of objects in the first data set by using the first homomorphic key pair.
具体地,第一服务器使用第一同态密钥对(E k1,D k1)中的第一加密密钥E k1对第一特征值累加和
Figure PCTCN2021096851-appb-000009
进行加密,得到
Figure PCTCN2021096851-appb-000010
并使用第一加密密钥E k1对第一数据集的第一对象数量N A进行加密,得到n 1=E k1(N A)。
Specifically, the first server uses the first encryption key E k1 in the first homomorphic key pair (E k1 , D k1 ) to accumulate the sum of the first eigenvalues
Figure PCTCN2021096851-appb-000009
to encrypt, get
Figure PCTCN2021096851-appb-000010
And use the first encryption key E k1 to encrypt the first number of objects NA of the first data set, to obtain n 1 = E k1 (NA ).
步骤S20223,将第一同态密钥对中的第一加密密钥、加密后的第一特征值累加和以及加密后的第一对象数量发送至第二服务器,以指示第二服务器根据第一加密密钥、加密后的第一特征值累加和、加密后的第一对象数量、第二特征值累加和以及第二数据集的第二对象数量进行计算,得到加密后的联合累加和与加密后的联合对象数量。Step S20223, sending the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the number of encrypted first objects to the second server to instruct the second server according to the first The encryption key, the encrypted first eigenvalue cumulative sum, the encrypted first object number, the second eigenvalue cumulative sum, and the second object number of the second data set are calculated to obtain the encrypted joint cumulative sum and encrypted The number of federated objects after.
具体地,第一服务器将第一同态密钥对中的第一加密密钥E k1、加密后的第一特征值累加和
Figure PCTCN2021096851-appb-000011
以及加密后的第一对象数量n 1=E k1(N A)发送至第二服务器。
Specifically, the first server accumulates the first encryption key E k1 and the encrypted first feature value in the first homomorphic key pair to sum up
Figure PCTCN2021096851-appb-000011
and the encrypted first object number n 1 = E k1 (NA ) is sent to the second server.
第二服务器计算一个随机消息z∈M,并计算随机消息z与第二特征值累加和
Figure PCTCN2021096851-appb-000012
的乘积
Figure PCTCN2021096851-appb-000013
随机消息z与第二对象数量N B的乘积z·N B,然后使用第一加密密钥E k
Figure PCTCN2021096851-appb-000014
与z·N B进行加密,得到
Figure PCTCN2021096851-appb-000015
和z 2∈E k1(z·N B)。第二服务器在密文状态下计算加密后的联合累加和
Figure PCTCN2021096851-appb-000016
以及加密后的联合对象数量m 2=E k1(z·N A+z·N B),然后将
Figure PCTCN2021096851-appb-000017
和m 2=E k1(z·N A+z·N B)发送至第一服务器。
The second server computes a random message z∈M, and computes the cumulative sum of the random message z and the second eigenvalue
Figure PCTCN2021096851-appb-000012
product of
Figure PCTCN2021096851-appb-000013
The product z·NB of the random message z and the second number of objects NB , and then using the first encryption key E k pair
Figure PCTCN2021096851-appb-000014
Encrypt with z·N B to get
Figure PCTCN2021096851-appb-000015
and z 2 ∈ E k1 (z·N B ). The second server calculates the encrypted joint cumulative sum in the ciphertext state
Figure PCTCN2021096851-appb-000016
and the number of encrypted joint objects m 2 = E k1 (z·NA + z·NB ), then
Figure PCTCN2021096851-appb-000017
and m 2 = E k1 (z·NA + z·NB ) is sent to the first server.
步骤S20224,根据第二服务器返回的加密后的联合累加和与加密后的联合对象数量, 计算第一特征的联合均值。Step S20224: Calculate the joint mean value of the first feature according to the encrypted joint cumulative sum and the number of encrypted joint objects returned by the second server.
第一服务器得到
Figure PCTCN2021096851-appb-000018
和m 2=E k1(z·N A+z·N B)后,使用第一同态密钥对中的第一解密密钥D k1对加密后的联合累加和
Figure PCTCN2021096851-appb-000019
以及加密后的联合对象数量m 2=E k1(z·N A+z·N B)进行解密,得到
Figure PCTCN2021096851-appb-000020
和z·(N A+N B),然后计算第一特征的联合均值
Figure PCTCN2021096851-appb-000021
并将
Figure PCTCN2021096851-appb-000022
发送至第二服务器,可以理解,
Figure PCTCN2021096851-appb-000023
也将作为对应的第二特征的联合均值。
The first server gets
Figure PCTCN2021096851-appb-000018
After sum m 2 =E k1 (z·N A +z·N B ), use the first decryption key D k1 in the first homomorphic key pair to the encrypted joint cumulative sum
Figure PCTCN2021096851-appb-000019
And the encrypted number of joint objects m 2 = E k1 (z·NA + z·NB ) is decrypted to obtain
Figure PCTCN2021096851-appb-000020
and z·(N A +N B ), then compute the joint mean of the first feature
Figure PCTCN2021096851-appb-000021
and will
Figure PCTCN2021096851-appb-000022
sent to the second server, understandably,
Figure PCTCN2021096851-appb-000023
will also serve as the joint mean of the corresponding second feature.
步骤S20221-S20224即实现了同态加密加权平均算法。Steps S20221-S20224 implement the weighted average algorithm of homomorphic encryption.
本实施例中,通过同态加密加权平均算法,在不交换底层数据的前提下,结合第一数据集和第二数据集计算得到了特征的联合均值。In this embodiment, through the weighted average algorithm of homomorphic encryption, on the premise of not exchanging the underlying data, the joint average value of the feature is obtained by combining the first data set and the second data set.
步骤S2023,基于联合均值计算第一特征的第一误差累加和,并指示第二服务器基于联合均值计算第二特征的第二误差累加和。Step S2023: Calculate the first accumulated error sum of the first feature based on the joint mean value, and instruct the second server to calculate the second accumulated error sum of the second feature based on the joint mean value.
具体地,第一服务器根据联合均值
Figure PCTCN2021096851-appb-000024
和第一数据集中每个第一特征值
Figure PCTCN2021096851-appb-000025
计算第一误差累加和
Figure PCTCN2021096851-appb-000026
第二服务器根据联合均值
Figure PCTCN2021096851-appb-000027
和第二数据集中每个第二特征值
Figure PCTCN2021096851-appb-000028
计算第二误差累加和
Figure PCTCN2021096851-appb-000029
Specifically, the first server according to the joint mean
Figure PCTCN2021096851-appb-000024
and each first eigenvalue in the first dataset
Figure PCTCN2021096851-appb-000025
Calculate the first accumulated error sum
Figure PCTCN2021096851-appb-000026
The second server according to the joint mean
Figure PCTCN2021096851-appb-000027
and each second eigenvalue in the second dataset
Figure PCTCN2021096851-appb-000028
Calculate the second accumulated error sum
Figure PCTCN2021096851-appb-000029
步骤S2024,与第二服务器通过同态加密加权平均算法,对第一误差累加和与第二误差累加和进行计算,得到第一特征的联合均方误差。Step S2024 , calculating the accumulated sum of the first error and the accumulated sum of the second error with the second server through the weighted average algorithm of homomorphic encryption, to obtain the joint mean square error of the first feature.
具体地,第一服务器与第二服务器通过同态加密加权平均算法,对第一误差累加和
Figure PCTCN2021096851-appb-000030
及第二误差累加和
Figure PCTCN2021096851-appb-000031
进行计算,得到第一特征的联合均方误差
Figure PCTCN2021096851-appb-000032
Specifically, the first server and the second server use a homomorphic encryption weighted average algorithm to accumulate and sum the first errors.
Figure PCTCN2021096851-appb-000030
and the second accumulated error
Figure PCTCN2021096851-appb-000031
Calculate to get the joint mean squared error of the first feature
Figure PCTCN2021096851-appb-000032
进一步的,上述步骤S2024可以包括:Further, the above step S2024 may include:
步骤S20241,生成第二同态密钥对。Step S20241, generate a second homomorphic key pair.
具体地,第一服务器生成第二同态密钥对(E k2,D k2),其中,E k2为第二加密密钥,D k2为第二解密密钥。第二同态密钥对(E k2,D k2)满足同态加密。 Specifically, the first server generates a second homomorphic key pair (E k2 , D k2 ), where E k2 is the second encryption key, and D k2 is the second decryption key. The second homomorphic key pair (E k2 , D k2 ) satisfies homomorphic encryption.
步骤S20242,通过第二同态密钥对对第一误差累加和以及第一数据集的第一对象数量进行加密。Step S20242: Encrypt the first accumulated error sum and the first number of objects in the first data set by using the second homomorphic key pair.
具体地,第一服务器通过第二同态密钥对(E k2,D k2)中的第二加密密钥E k2对第一误差累加和
Figure PCTCN2021096851-appb-000033
进行加密,得到
Figure PCTCN2021096851-appb-000034
并通过第二加密密钥E k2对第一数据集的第一对象数量N A进行加密,得到n 1=E k2(N A)。
Specifically, the first server accumulates the first error by using the second encryption key E k2 in the second homomorphic key pair (E k2 , D k2 )
Figure PCTCN2021096851-appb-000033
to encrypt, get
Figure PCTCN2021096851-appb-000034
And encrypt the first number of objects NA of the first data set with the second encryption key E k2 to obtain n 1 = E k2 (NA ).
步骤S20243,将第二同态密钥对中的第二加密密钥、加密后的第一误差累加和以及加密后的第一对象数量发送至第二服务器,以指示第二服务器根据第二加密密钥、加密后的第一误差累加和、加密后的第一对象数量、第二误差累加和以及第二数据集的第二对象数量进行计算,得到加密后的联合误差累加和与加密后的联合对象数量。Step S20243, sending the second encryption key in the second homomorphic key pair, the encrypted first error accumulation sum and the encrypted first object number to the second server, to instruct the second server to encrypt according to the second encryption Calculate the key, the encrypted first cumulative sum of errors, the number of encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain the encrypted joint cumulative sum of errors and the encrypted The number of federated objects.
具体地,第一服务器将第二同态密钥对中的第二加密密钥E k、加密后的第一误差累加和
Figure PCTCN2021096851-appb-000035
以及加密后的第一对象数量n 1=E k2(N A)发送至第二服务器。
Specifically, the first server accumulates the second encryption key E k in the second homomorphic key pair and the encrypted first error.
Figure PCTCN2021096851-appb-000035
and the encrypted first object number n 1 = E k2 (NA ) is sent to the second server.
第二服务器计算一个随机消息z∈M,并计算该随机消息与第二误差累加和
Figure PCTCN2021096851-appb-000036
的乘积,得到
Figure PCTCN2021096851-appb-000037
随机消息与第二对象数量N B的乘积z·N B,然后使用第二加密密钥E k2
Figure PCTCN2021096851-appb-000038
和z·N B进行加密,得到
Figure PCTCN2021096851-appb-000039
第二服务器在密文状态下计算加密后的联合误差累加和
Figure PCTCN2021096851-appb-000040
与加密后的联合对象数量m 2=E k2(z·N A+z·N B),然后将
Figure PCTCN2021096851-appb-000041
和m 2=E k2(z·N A+z·N B)发送至第一服务器。
The second server calculates a random message z∈M, and calculates the accumulated sum of the random message and the second error
Figure PCTCN2021096851-appb-000036
the product of , we get
Figure PCTCN2021096851-appb-000037
The product z·NB of the random message and the second number of objects, NB , is then paired with the second encryption key E k2
Figure PCTCN2021096851-appb-000038
Encrypt with z·N B to get
Figure PCTCN2021096851-appb-000039
The second server calculates the encrypted joint error accumulation sum in the ciphertext state
Figure PCTCN2021096851-appb-000040
and the encrypted number of joint objects m 2 = E k2 (z·NA + z·NB ), and then
Figure PCTCN2021096851-appb-000041
and m 2 = E k2 (z·NA + z·NB ) is sent to the first server.
步骤S20244,根据第二服务器返回的加密后的联合误差累加和与加密后的联合对象数量,计算第一特征的联合均方误差。Step S20244: Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors returned by the second server and the number of encrypted joint objects.
具体地,第一服务器接收到加密后的联合误差累加和
Figure PCTCN2021096851-appb-000042
与加密后的联合对象数量m 2=E k2(z·N A+z·N B)后,使用第二同态密钥对中的第二解密密钥D k对加密后的联合误差累加和
Figure PCTCN2021096851-appb-000043
以及加密后的联合对象数量m 2=E k2(z·N A+z·N B)进行解密,得到
Figure PCTCN2021096851-appb-000044
和z·(N A+N B),然后计算第一特征的联合均方误差
Figure PCTCN2021096851-appb-000045
并将
Figure PCTCN2021096851-appb-000046
发送至第二服务器,可以理解,
Figure PCTCN2021096851-appb-000047
也将作为对应的第二特征的联合均方误差。
Specifically, the first server receives the encrypted joint error accumulation sum
Figure PCTCN2021096851-appb-000042
After the number of encrypted joint objects m 2 =E k2 (z·N A +z·N B ), use the second decryption key D k in the second homomorphic key pair to accumulate the encrypted joint errors
Figure PCTCN2021096851-appb-000043
And the encrypted number of joint objects m 2 = E k2 (z·NA + z·NB ) is decrypted to obtain
Figure PCTCN2021096851-appb-000044
and z·(N A +N B ), then compute the joint mean squared error of the first feature
Figure PCTCN2021096851-appb-000045
and will
Figure PCTCN2021096851-appb-000046
sent to the second server, understandably,
Figure PCTCN2021096851-appb-000047
will also be the joint mean squared error of the corresponding second feature.
步骤S20241-S20244即实现了同态加密加权平均算法。Steps S20241-S20244 implement the weighted average algorithm of homomorphic encryption.
本实施例中,通过同态加密加权平均算法,在不交换底层数据的前提下,结合第一数据集和第二数据集计算得到了特征的联合均方误差。In this embodiment, through the weighted average algorithm of homomorphic encryption, on the premise of not exchanging the underlying data, the joint mean square error of the feature is calculated in combination with the first data set and the second data set.
步骤S2025,根据得到的联合均方误差对第一数据集中的第一特征进行筛选,得到第一待聚类数据集,并指示第二服务器根据得到的联合均方误差对第二数据集中的第二特征进行筛选,得到第二待聚类数据集。Step S2025: Screen the first feature in the first data set according to the obtained joint mean square error to obtain the first to-be-clustered data set, and instruct the second server to classify the first feature in the second data set according to the obtained joint mean square error. Two features are screened to obtain a second data set to be clustered.
具体地,特征的联合均方误差可以作为特征重要性的衡量标准,每个特征均计算联合均方误差,并对q个联合均方误差进行降序排序,选取前d个特征作为筛选后的特征,第一服务器和第二服务器均执行上述筛选的操作,分别得到第一待聚类数据集I A和第二待聚类数据集I BSpecifically, the joint mean square error of the feature can be used as a measure of the importance of the feature. The joint mean square error is calculated for each feature, and the q joint mean square errors are sorted in descending order, and the first d features are selected as the filtered features. , the first server and the second server both perform the above-mentioned screening operation, and obtain the first to-be-clustered data set IA and the second to - be-clustered data set IB respectively.
本实施例中,通过联邦方差选择算法,在不交换底层数据的情况下,对第一数据集和第二服务器中的第二数据集进行特征筛选,保留对聚类最有用的特征,同时实现特征降维,从而适配DBSCAN算法。In this embodiment, through the federal variance selection algorithm, without exchanging the underlying data, the first data set and the second data set in the second server are feature-filtered, the most useful features for clustering are retained, and at the same time the Feature dimensionality reduction, so as to adapt to the DBSCAN algorithm.
步骤S203,遍历第一待聚类数据集中的第一对象。Step S203, traverse the first object in the first data set to be clustered.
具体地,第一服务器遍历第一待聚类数据集中的第一对象,以对每个第一对象分别进行聚类处理。Specifically, the first server traverses the first objects in the first to-be-clustered data set to perform clustering processing on each first object respectively.
步骤S204,计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算当前第一对象与各第二对象的欧氏距离。Step S204: Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm.
具体地,将正在遍历的第一对象作为当前第一对象,计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离,并通过联邦欧氏距离算法,计算当前第一对象与第二待 聚类数据集中各第二对象的欧氏距离。基于联邦欧氏距离算法,在计算欧氏距离时,第一服务器和第二服务器不必交换真实的底层数据。Specifically, the first object being traversed is taken as the current first object, the Euclidean distance between the current first object and each first object in the first data set to be clustered is calculated, and the current first object is calculated by the federal Euclidean distance algorithm. The Euclidean distance between the object and each second object in the second data set to be clustered. Based on the federated Euclidean distance algorithm, the first server and the second server do not have to exchange real underlying data when calculating the Euclidean distance.
进一步的,上述步骤S204可以包括:Further, the above step S204 may include:
步骤S2041,计算当前第一对象与各第一对象的欧氏距离。Step S2041: Calculate the Euclidean distance between the current first object and each first object.
具体地,第一服务器计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离。令当前第一对象为
Figure PCTCN2021096851-appb-000048
其他第一对象记为
Figure PCTCN2021096851-appb-000049
特征维度为d,则
Figure PCTCN2021096851-appb-000050
Figure PCTCN2021096851-appb-000051
欧氏距离
Figure PCTCN2021096851-appb-000052
为:
Specifically, the first server calculates the Euclidean distance between the current first object and each first object in the first to-be-clustered data set. Let the current first object be
Figure PCTCN2021096851-appb-000048
The other first object is recorded as
Figure PCTCN2021096851-appb-000049
The feature dimension is d, then
Figure PCTCN2021096851-appb-000050
and
Figure PCTCN2021096851-appb-000051
Euclidean distance
Figure PCTCN2021096851-appb-000052
for:
Figure PCTCN2021096851-appb-000053
Figure PCTCN2021096851-appb-000053
数据集内部不存在数据隐私的限制,当前第一对象与各第一对象的欧氏距离可以代入每种第一特征的第一特征值直接计算。There is no data privacy restriction in the data set, and the Euclidean distance between the current first object and each first object can be directly calculated by substituting the first feature value of each first feature.
步骤S2042,计算当前第一对象的第一特征平方和。Step S2042: Calculate the sum of squares of the first feature of the current first object.
具体地,计算计算当前第一对象与第二待聚类数据集中各第二对象的欧氏距离时,令当前第一对象为
Figure PCTCN2021096851-appb-000054
第二对象记为
Figure PCTCN2021096851-appb-000055
特征维度为d,则
Figure PCTCN2021096851-appb-000056
Figure PCTCN2021096851-appb-000057
欧氏距离
Figure PCTCN2021096851-appb-000058
为:
Specifically, when calculating the Euclidean distance between the current first object and each second object in the second to-be-clustered data set, let the current first object be
Figure PCTCN2021096851-appb-000054
The second object is recorded as
Figure PCTCN2021096851-appb-000055
The feature dimension is d, then
Figure PCTCN2021096851-appb-000056
and
Figure PCTCN2021096851-appb-000057
Euclidean distance
Figure PCTCN2021096851-appb-000058
for:
Figure PCTCN2021096851-appb-000059
Figure PCTCN2021096851-appb-000059
可以理解,
Figure PCTCN2021096851-appb-000060
为第一对象第j个特征的特征值,
Figure PCTCN2021096851-appb-000061
为第二对象第j个特征的特征值。
understandably,
Figure PCTCN2021096851-appb-000060
is the eigenvalue of the jth feature of the first object,
Figure PCTCN2021096851-appb-000061
is the eigenvalue of the jth feature of the second object.
第一服务器计算当前第一对象的第一特征平方和
Figure PCTCN2021096851-appb-000062
The first server calculates the first feature square sum of the current first object
Figure PCTCN2021096851-appb-000062
步骤S2043,对于每个第二对象,与第二服务器通过乘积算法,计算当前第一对象与第二对象的特征交叉乘积和,并指示第二服务器计算第二对象的第二特征平方和。Step S2043, for each second object, calculate the cross product sum of features of the current first object and the second object through the product algorithm with the second server, and instruct the second server to calculate the second feature square sum of the second object.
具体地,当前第一对象需要与每个第二对象计算欧氏距离,在与其中一个第二对象计算欧氏距离时,第一服务器多次输入
Figure PCTCN2021096851-appb-000063
第二服务器多次输入
Figure PCTCN2021096851-appb-000064
和随机数r j,j∈[1,d],其中,第二服务器需要生成d个随机数r 1,r 2,...r d,且满足
Figure PCTCN2021096851-appb-000065
第一服务器和第二服务器通过乘积算法,计算
Figure PCTCN2021096851-appb-000066
并将其求和得到第一对象与第二对象的特征交叉乘积和
Figure PCTCN2021096851-appb-000067
同时,第二服务器计算计算第二对象的第二特征平方和
Figure PCTCN2021096851-appb-000068
Specifically, the current first object needs to calculate the Euclidean distance with each second object. When calculating the Euclidean distance with one of the second objects, the first server needs to input multiple times.
Figure PCTCN2021096851-appb-000063
Second server multiple input
Figure PCTCN2021096851-appb-000064
and random numbers r j ,j∈[1,d], where the second server needs to generate d random numbers r 1 , r 2 ,...r d , and satisfy
Figure PCTCN2021096851-appb-000065
The first server and the second server use the product algorithm to calculate
Figure PCTCN2021096851-appb-000066
And sum it up to get the feature cross product sum of the first object and the second object
Figure PCTCN2021096851-appb-000067
At the same time, the second server calculates the second characteristic sum of squares of the second object
Figure PCTCN2021096851-appb-000068
第一服务器和第二服务器在计算特征交叉乘积和时,基于乘积算法进行计算,不必交换底层的特征值。When the first server and the second server calculate the feature cross product sum, the calculation is performed based on the product algorithm, and the underlying feature value does not need to be exchanged.
进一步的,上述步骤S2043可以包括:Further, the above step S2043 may include:
步骤S20431,生成第一随机数,并基于paillier加密算法生成第三同态密钥对。Step S20431, generate a first random number, and generate a third homomorphic key pair based on the paillier encryption algorithm.
具体地,第一服务器生成第一随机数v,并基于paillier加密算法生成第三同态密钥对(E k3,D k3),其中,E k3为第三加密密钥,D k3为第三解密密钥。paillier加密算法是一种同态加密,满足加法和数乘同态。 Specifically, the first server generates a first random number v, and generates a third homomorphic key pair (E k3 , D k3 ) based on the paillier encryption algorithm, where E k3 is the third encryption key, and D k3 is the third decryption key. The paillier encryption algorithm is a homomorphic encryption that satisfies the homomorphism of addition and multiplication.
步骤S20432,通过第三同态密钥对中的第三加密密钥,将当前第一对象的各第一特征值与第一随机数进行联合加密,得到联合加密值。Step S20432: Jointly encrypt each first characteristic value of the current first object and the first random number by using the third encryption key in the third homomorphic key pair to obtain a joint encrypted value.
具体地,第一服务器通过第三同态密钥对(E k3,D k3)中的第三加密密钥E k3,对当前第一对象的各第一特征值
Figure PCTCN2021096851-appb-000069
与第一随机数r进行联合加密,得到联合加密值
Figure PCTCN2021096851-appb-000070
Specifically, the first server uses the third encryption key E k3 in the third homomorphic key pair (E k3 , D k3 ) to perform an analysis on each first feature value of the current first object
Figure PCTCN2021096851-appb-000069
Perform joint encryption with the first random number r to obtain a joint encrypted value
Figure PCTCN2021096851-appb-000070
步骤S20433,将联合加密值发送至第二服务器,对于每个第二对象,指示第二服务器根据联合加密值、第二对象的各第二特征值以及生成的第二随机数进行计算,得到各加密特征交叉乘积,并指示第二服务器计算第二对象的第二特征平方和。Step S20433: Send the joint encrypted value to the second server, and for each second object, instruct the second server to calculate according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number to obtain each second object. encrypting the feature cross product, and instructing the second server to calculate a second sum of squares of features for the second object.
具体地,第一服务器将第三加密密钥E k3、第一随机数r和联合加密值
Figure PCTCN2021096851-appb-000071
发送至第二服务器。第二服务器生成第二随机数r j,j∈[1,d],且
Figure PCTCN2021096851-appb-000072
第二服务器根据联合加密值
Figure PCTCN2021096851-appb-000073
第二对象的各第二特征值
Figure PCTCN2021096851-appb-000074
以及生成的第二随机数r j,j∈[1,d]进行计算,得到各加密特征交叉乘积
Figure PCTCN2021096851-appb-000075
同时,第二服务器计算第二对象的第二特征平方和
Figure PCTCN2021096851-appb-000076
Specifically, the first server encrypts the third encryption key E k3 , the first random number r and the joint encryption value
Figure PCTCN2021096851-appb-000071
sent to the second server. The second server generates a second random number r j ,j∈[1,d], and
Figure PCTCN2021096851-appb-000072
The second server encrypts the value according to the joint
Figure PCTCN2021096851-appb-000073
each second eigenvalue of the second object
Figure PCTCN2021096851-appb-000074
and the generated second random number r j ,j∈[1,d] for calculation to obtain the cross product of each encrypted feature
Figure PCTCN2021096851-appb-000075
At the same time, the second server calculates the second characteristic sum of squares of the second object
Figure PCTCN2021096851-appb-000076
步骤S20434,接收第二服务器返回的各加密特征交叉乘积以及第二对象的第二特征平方和。Step S20434: Receive the cross product of each encrypted feature returned by the second server and the square sum of the second feature of the second object.
具体地,第二服务器将各加密特征交叉乘积
Figure PCTCN2021096851-appb-000077
以及第二对象的第二特征平方和
Figure PCTCN2021096851-appb-000078
发送至第一服务器。在发送第二特征平方和
Figure PCTCN2021096851-appb-000079
时,可以发送
Figure PCTCN2021096851-appb-000080
以便对真实的第二特征平方和进行加密,其中第一随机数v的影响可以被第一服务器抵消掉。
Specifically, the second server cross-products the encrypted features
Figure PCTCN2021096851-appb-000077
and the second characteristic sum of squares of the second object
Figure PCTCN2021096851-appb-000078
sent to the first server. After sending the second characteristic sum of squares
Figure PCTCN2021096851-appb-000079
, you can send
Figure PCTCN2021096851-appb-000080
In order to encrypt the real second characteristic sum of squares, the influence of the first random number v can be canceled by the first server.
步骤S20435,通过第三同态密钥对中的第三解密密钥,对各加密特征交叉乘积进行解密,得到当前第一对象与第二对象的特征交叉乘积和。Step S20435: Decrypt each encrypted feature cross product by using the third decryption key in the third homomorphic key pair to obtain the feature cross product sum of the current first object and the second object.
具体地,第一服务器使用第三同态密钥对(E k3,D k3)中的第三解密密钥D k对各加密特征交叉乘积
Figure PCTCN2021096851-appb-000081
进行解密:u=D k3(u'),基于paillier加密算法固有的性质,解密后的结果即为
Figure PCTCN2021096851-appb-000082
即当前第一对象与第二对象的特征交叉乘积和与第二随机数r j的和,第二随机数r j的影响可以在计算欧氏距离时被抵消。
Specifically, the first server uses the third decryption key D k in the third homomorphic key pair (E k3 , D k3 ) to cross-product each encrypted feature
Figure PCTCN2021096851-appb-000081
Decrypt: u=D k3 (u'), based on the inherent properties of the paillier encryption algorithm, the decrypted result is
Figure PCTCN2021096851-appb-000082
That is, the sum of the feature cross-product sum of the current first object and the second object and the second random number r j , the influence of the second random number r j can be canceled when calculating the Euclidean distance.
步骤S20431-S20435即为乘积算法的实现步骤。Steps S20431-S20435 are the realization steps of the product algorithm.
本实施例中,通过乘积算法,在保护第一待聚类数据集和第二待聚类数据集数据隐私的条件下,实现了计算当前第一对象与第二对象的特征交叉乘积和,保证了当前第一对象与第二对象欧氏距离计算的实现。In this embodiment, through the product algorithm, under the condition of protecting the data privacy of the first to-be-clustered data set and the second to-be-clustered data set, the cross-product sum of the features of the current first object and the second object is calculated, ensuring that The implementation of the Euclidean distance calculation between the current first object and the second object is presented.
步骤S2044,根据第一特征平方和、特征交叉乘积和以及第二服务器返回的第二特征平方和,计算当前第一对象与第二对象的欧氏距离。Step S2044: Calculate the current Euclidean distance between the first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
具体地,第一服务器根据第一特征平方和
Figure PCTCN2021096851-appb-000083
特征交叉乘积和
Figure PCTCN2021096851-appb-000084
以及第二服务器返回的第二特征平方和
Figure PCTCN2021096851-appb-000085
计算当前第一对象与第二对象的欧氏距离
Figure PCTCN2021096851-appb-000086
Specifically, the first server calculates the sum of squares according to the first feature
Figure PCTCN2021096851-appb-000083
feature cross product sum
Figure PCTCN2021096851-appb-000084
and the second feature sum of squares returned by the second server
Figure PCTCN2021096851-appb-000085
Calculate the Euclidean distance between the current first object and the second object
Figure PCTCN2021096851-appb-000086
步骤S2042-S2044即为联邦欧氏距离算法。Steps S2042-S2044 are the federal Euclidean distance algorithm.
本实施例中,通过联邦欧氏距离算法,在不侵犯数据隐私的条件下计算得到第一待聚类数据集和第二待聚类数据集中对象的欧氏距离,保证了在两个数据保护的数据集间进行 DBSCAN聚类的实现。In this embodiment, the Euclidean distance of objects in the first data set to be clustered and the objects in the second data set to be clustered is calculated and obtained under the condition of not infringing on data privacy through the federal Euclidean distance algorithm, which ensures the protection of the two data Implementation of DBSCAN clustering between datasets.
步骤S205,根据得到的欧氏距离对当前第一对象进行DBSCAN聚类,得到对象聚类结果。Step S205, DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
具体地,得到当前第一对象与各第一对象和各第二对象的欧氏距离后,可以根据DBSCAN算法对当前第一对象进行DBSCAN聚类,得到聚类结果。聚类结果可视为对第一待聚类数据集和第二待聚类数据集中的对象进行群体划分。Specifically, after obtaining the Euclidean distance between the current first object and each first object and each second object, DBSCAN clustering can be performed on the current first object according to the DBSCAN algorithm to obtain a clustering result. The clustering result can be regarded as group division of the objects in the first to-be-clustered data set and the second to-be-clustered data set.
进一步的,上述步骤S205可以包括:Further, the above step S205 may include:
步骤S2051,根据得到的欧氏距离和预设的邻域对象数量阈值,确定当前第一对象是否为核心点。Step S2051, according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects, determine whether the current first object is a core point.
具体地,在DBSCAN算法中,假设存在一数据集D={x 1,x 2,...,x m},有如下定义: Specifically, in the DBSCAN algorithm, it is assumed that there is a data set D={x 1 ,x 2 ,...,x m }, which is defined as follows:
(1)N ε(x j):对于x j∈D,其ε邻域包含数据集D中与x j欧氏距离不大于ε的子样本集,即N ε(x j)={x i∈D|,dis tan ce(x i,x j)≤ε},|N ε(x j)|记为样本x j的ε邻域中样本数量。 (1) N ε (x j ): For x j ∈ D, its ε neighborhood includes the sub-sample set in the data set D whose Euclidean distance from x j is not greater than ε, that is, N ε (x j )={x i ∈D|, distance(x i , x j )≤ε}, |N ε (x j )| is denoted as the number of samples in the ε neighborhood of sample x j .
(2)核心点:对于任一样本x j∈D,如果其ε邻域对应的N ε(x j)至少包含MinPts个样本,即如果|N ε(x j)|≥MinPts,则样本是x j核心点。 (2) Core point: For any sample x j ∈ D, if the N ε (x j ) corresponding to its ε neighborhood contains at least MinPts samples, that is, if |N ε (x j )|≥MinPts, then the sample is x j core point.
(3)边界点:若样本x j∈D的N ε(x j)邻域内包含的样本数目小于MinPts,但是样本x j∈D在其他核心点的邻域内,则样本点样本x j∈D为边界点。 (3) Boundary point: If the number of samples contained in the N ε (x j ) neighborhood of sample x j ∈ D is less than MinPts, but the sample x j ∈ D is in the neighborhood of other core points, then the sample point sample x j ∈ D is the boundary point.
(4)噪音点:既不是核心点也不是边界点的样本。(4) Noise points: samples that are neither core points nor boundary points.
(5)密度直达:如果x i位于x j的ε邻域中,且x j是核心点,则x i由x j密度直达。 (5) Density direct access: If x i is located in the ε neighborhood of x j , and x j is the core point, then x i is directly accessible by the density of x j .
(6)密度可达:对于x i和x j,如果存在样本序列p 1,p 2,...,p T,满足p 1=x i,p T=x j,且任意p t+1由p t密度直达,则称x j由x i密度可达,即,密度可达满足传递性。 (6) Density reachable: For x i and x j , if there is a sample sequence p 1, p 2 ,...,p T , p 1 = xi , p T =x j , and any p t+1 If it is directly reached by the density of p t , then x j is said to be reachable by the density of x i , that is, the density reachability satisfies the transitivity.
(7)密度相连:对于x i和x j,如果存在核心点x k,使x i和x j均由x k密度可达,则称x i和x j密度相连。 (7) Density connection: For x i and x j , if there is a core point x k , so that both x i and x j are reachable by the density of x k , then x i and x j are said to be density-connected.
综上,第一服务器根据计算得到的欧氏距离查询当前第一对象的聚类邻域(即ε邻域)中的对象(可以来自第一待聚类数据集,也可以来自第二待聚类数据集)数量,将对象数量与预先设定好的邻域对象数量阈值MinPts相比较,以确定当前第一对象是否为核心点。To sum up, the first server queries the objects in the clustering neighborhood (ie the ε neighborhood) of the current first object according to the calculated Euclidean distance (which can be from the first data set to be clustered or from the second data set to be clustered). The number of objects is compared with the preset threshold MinPts of the number of neighboring objects to determine whether the current first object is a core point.
步骤S2052,当当前第一对象为核心点时,确定当前第一对象聚类邻域中的密度可达点,得到对象聚类结果,其中,密度可达点包括第一待聚类数据集中的第一对象和第二待聚类数据集中的第二对象。Step S2052, when the current first object is the core point, determine the density-reachable points in the clustering neighborhood of the current first object, and obtain the object clustering result, wherein the density-reachable points include those in the first to-be-clustered data set. The first object and the second object in the second to-be-clustered dataset.
具体地,当当前第一对象为核心点时,根据DBSCAN算法的定义,根据计算得到的欧氏距离在其聚类邻域中查找密度可达点,密度可达点包括第一待聚类数据集中的第一对象和第二待聚类数据集中的第二对象,查找到的密度可达点组成一个簇。如果当前第一对象为边界点或噪音点,则不对当前第一对象进行处理,查找下一个核心点,直至第一待聚类数据集中的全部第一对象均被处理,获得对象聚类结果,其中,每个簇可以是一种聚类结果。Specifically, when the current first object is the core point, according to the definition of the DBSCAN algorithm, according to the calculated Euclidean distance, a density-reachable point is searched in its clustering neighborhood, and the density-reachable point includes the first data to be clustered The first object in the set and the second object in the second to-be-clustered data set, the density-reachable points found form a cluster. If the current first object is a boundary point or a noise point, the current first object is not processed, and the next core point is searched until all the first objects in the first to-be-clustered data set are processed, and the object clustering result is obtained, Among them, each cluster can be a clustering result.
可以理解,第二服务器可以按照与第一服务器相同的操作对第二对象进行DBSCAN聚类。本申请的基于横向联邦的DBSCAN聚类方法实现了对象聚类,对于每种聚类结果,其中的各对象具有一定的相似度。例如,在金融营销场景中,根据用户数据对用户进行基于横向联邦的DBSCAN聚类后,每一种聚类结果可以是具有相近行为的的用户,基于横向联邦的DBSCAN聚类方法相当于对用户进行了社群划分。It can be understood that the second server can perform DBSCAN clustering on the second object according to the same operation as the first server. The horizontal federation-based DBSCAN clustering method of the present application realizes object clustering, and for each clustering result, each object in it has a certain degree of similarity. For example, in a financial marketing scenario, after the horizontal federation-based DBSCAN clustering is performed on users according to user data, each clustering result can be a user with similar behaviors. The horizontal federation-based DBSCAN clustering method is equivalent to the user Community divisions were made.
本实施例中,当根据欧氏距离和预设的邻域对象数量阈值,确定当前第一对象为核心点时,对当前第一对象进行DBSCAN聚类,实现了利用不同机构的数据集进行对象聚类,打破了数据壁垒,提高了DBSCAN聚类的准确性。In this embodiment, when it is determined that the current first object is the core point according to the Euclidean distance and the preset threshold of the number of neighboring objects, DBSCAN clustering is performed on the current first object, which realizes the use of data sets of different institutions for object analysis. Clustering breaks the data barriers and improves the accuracy of DBSCAN clustering.
本实施例中,获取到第一数据集后,与第二服务器进行横向联邦学习,通过联邦方差 选择算法,在不交换具体数据的情况下对第一数据集和第二服务器中的第二数据集进行特征筛选,实现特征降维,从而适配DBSCAN算法;同时,对于第一待聚类数据集中遍历到的当前第一对象,计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离,并通过联邦欧式距离算法计算当前第一对象与第二待聚类数据集中各第二对象的欧氏距离,在不交换具体数据的情况下,计算两个相离的数据集中对象的欧氏距离,欧氏距离用于DBSCAN聚类,从而打破了数据壁垒,实现了在不侵犯数据隐私的情况下,利用不同机构的数据集进行对象聚类,提高了对象聚类的准确性。In this embodiment, after the first data set is acquired, the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data. At the same time, for the current first object traversed in the first data set to be clustered, the current first object and the first object in the first data set to be clustered are calculated. The Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data. The Euclidean distance of the centralized objects, the Euclidean distance is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.
本申请中基于横向联邦的DBSCAN聚类方法、及其相关设备涉及人工智能领域中的聚类分析,还可以涉及金融科技中的资产管理。The horizontal federation-based DBSCAN clustering method and related devices in this application relate to cluster analysis in the field of artificial intelligence, and may also relate to asset management in financial technology.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, the processes of the above-mentioned method embodiments may be included. Although the various steps in the flowchart of the accompanying drawings are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders.
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种基于横向联邦的DBSCAN聚类装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Further referring to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a DBSCAN clustering apparatus based on horizontal federation, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 . , the device can be specifically applied to various electronic devices.
如图3所示,本实施例所述的基于横向联邦的DBSCAN聚类装置300包括:数据集获取模块301、特征筛选模块302、对象遍历模块303、距离计算模块304以及对象聚类模块305,其中:As shown in FIG. 3 , the horizontal federation-based DBSCAN clustering device 300 in this embodiment includes: a data set acquisition module 301, a feature screening module 302, an object traversal module 303, a distance calculation module 304, and an object clustering module 305, in:
数据集获取模块301,用于获取第一数据集,其中,第一数据集包括若干个第一对象的第一特征。The data set obtaining module 301 is configured to obtain a first data set, wherein the first data set includes first features of several first objects.
特征筛选模块302,用于与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对第一数据集进行特征筛选,得到第一待聚类数据集,并指示第二服务器通过联邦方差选择算法对第二数据集进行特征筛选,得到第二待聚类数据集,其中,第二数据集包括若干个第二对象的第二特征。The feature screening module 302 is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the second server Feature screening is performed on the second data set by the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes second features of several second objects.
对象遍历模块303,用于遍历第一待聚类数据集中的第一对象。The object traversal module 303 is configured to traverse the first object in the first to-be-clustered data set.
距离计算模块304,用于计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算当前第一对象与各第二对象的欧氏距离。The distance calculation module 304 is configured to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm.
对象聚类模块305,用于根据得到的欧氏距离对当前第一对象进行DBSCAN聚类,得到对象聚类结果。The object clustering module 305 is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
本实施例中,获取到第一数据集后,与第二服务器进行横向联邦学习,通过联邦方差选择算法,在不交换具体数据的情况下对第一数据集和第二服务器中的第二数据集进行特征筛选,实现特征降维,从而适配DBSCAN算法;同时,对于第一待聚类数据集中遍历到的当前第一对象,计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离,并通过联邦欧式距离算法计算当前第一对象与第二待聚类数据集中各第二对象的欧氏距离,在不交换具体数据的情况下,计算两个相离的数据集中对象的欧氏距离,欧氏距离用于DBSCAN聚类,从而打破了数据壁垒,实现了在不侵犯数据隐私的情况下,利用不同机构的数据集进行对象聚类,提高了对象聚类的准确性。In this embodiment, after the first data set is acquired, the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data. At the same time, for the current first object traversed in the first data set to be clustered, the current first object and the first object in the first data set to be clustered are calculated. The Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second dataset to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data. The Euclidean distance of the centralized objects, the Euclidean distance is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.
在本实施例的一些可选的实现方式中,特征筛选模块302可以包括:特征值计算子模块、累加和计算子模块、误差计算子模块、均方误差计算子模块以及特征筛选子模块,其中:In some optional implementations of this embodiment, the feature screening module 302 may include: a feature value calculation submodule, an accumulation sum calculation submodule, an error calculation submodule, a mean square error calculation submodule, and a feature screening submodule, wherein :
特征值计算子模块,用于对于第一数据集中的每种第一特征,计算第一特征的第一特征值累加和,并指示第二服务器计算与第一特征相对应的第二特征的第二特征值累加和。The feature value calculation submodule is configured to, for each first feature in the first data set, calculate the accumulated sum of the first feature values of the first feature, and instruct the second server to calculate the first feature value of the second feature corresponding to the first feature. Cumulative sum of two eigenvalues.
累加和计算子模块,用于与第二服务器通过同态加密加权平均算法,对第一特征值累 加和与第二特征值累加和进行计算,得到第一特征的联合均值。The accumulation and calculation submodule is used to calculate the accumulated sum of the first characteristic value and the accumulated sum of the second characteristic value through the homomorphic encryption weighted average algorithm with the second server to obtain the joint mean value of the first characteristic.
误差计算子模块,用于基于联合均值计算第一特征的第一误差累加和,并指示第二服务器基于联合均值计算第二特征的第二误差累加和。The error calculation submodule is configured to calculate the first accumulated error sum of the first feature based on the joint mean value, and instruct the second server to calculate the second accumulated error sum of the second feature based on the joint mean value.
均方误差计算子模块,用于与第二服务器通过同态加密加权平均算法,对第一误差累加和与第二误差累加和进行计算,得到第一特征的联合均方误差。The mean square error calculation sub-module is used to calculate the accumulated sum of the first error and the accumulated sum of the second error through the homomorphic encryption weighted average algorithm with the second server to obtain the joint mean square error of the first feature.
特征筛选子模块,用于根据得到的联合均方误差对第一数据集中的第一特征进行筛选,得到第一待聚类数据集,并指示第二服务器根据得到的联合均方误差对第二数据集中的第二特征进行筛选,得到第二待聚类数据集。The feature screening sub-module is used for screening the first feature in the first data set according to the obtained joint mean square error, to obtain the first data set to be clustered, and instructing the second server to screen the second data set according to the obtained joint mean square error. The second feature in the data set is screened to obtain a second data set to be clustered.
本实施例中,通过联邦方差选择算法,在不交换底层数据的情况下,对第一数据集和第二服务器中的第二数据集进行特征筛选,保留对聚类最有用的特征,同时实现特征降维,从而适配DBSCAN算法。In this embodiment, through the federal variance selection algorithm, without exchanging the underlying data, the first data set and the second data set in the second server are feature-filtered, the most useful features for clustering are retained, and at the same time the Feature dimensionality reduction, so as to adapt to the DBSCAN algorithm.
在本实施例的一些可选的实现方式中,累加和计算子模块可以包括:第一生成单元、第一加密单元、第一发送单元以及均值计算单元,其中:In some optional implementations of this embodiment, the accumulation and calculation submodule may include: a first generation unit, a first encryption unit, a first transmission unit, and a mean value calculation unit, wherein:
第一生成单元,用于生成第一同态密钥对。The first generating unit is used to generate a first homomorphic key pair.
第一加密单元,用于通过第一同态密钥对对第一特征值累加和以及第一数据集的第一对象数量进行加密。The first encryption unit is configured to encrypt the accumulated sum of the first feature values and the first object quantity of the first data set by using the first homomorphic key pair.
第一发送单元,用于将第一同态密钥对中的第一加密密钥、加密后的第一特征值累加和以及加密后的第一对象数量发送至第二服务器,以指示第二服务器根据第一加密密钥、加密后的第一特征值累加和、加密后的第一对象数量、第二特征值累加和以及第二数据集的第二对象数量进行计算,得到加密后的联合累加和与加密后的联合对象数量。The first sending unit is configured to send the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first eigenvalues, and the number of encrypted first objects to the second server to indicate the second The server calculates according to the first encryption key, the encrypted cumulative sum of the first eigenvalues, the number of encrypted first objects, the cumulative sum of the second eigenvalues, and the number of second objects in the second data set, and obtains the encrypted joint The cumulative sum and the number of encrypted union objects.
均值计算单元,用于根据第二服务器返回的加密后的联合累加和与加密后的联合对象数量,计算第一特征的联合均值。The mean value calculation unit is configured to calculate the joint mean value of the first feature according to the encrypted joint cumulative sum and the number of encrypted joint objects returned by the second server.
本实施例中,通过同态加密加权平均算法,在不交换底层数据的前提下,结合第一数据集和第二数据集计算得到了特征的联合均值。In this embodiment, through the weighted average algorithm of homomorphic encryption, on the premise of not exchanging the underlying data, the joint average value of the feature is obtained by combining the first data set and the second data set.
在本实施例的一些可选的实现方式中,均方误差计算子模块可以包括:第一生成单元、第一加密单元、第一发送单元以及均值计算单元,其中:In some optional implementations of this embodiment, the mean square error calculation submodule may include: a first generation unit, a first encryption unit, a first transmission unit, and a mean value calculation unit, wherein:
第二生成单元,用于生成第二同态密钥对。The second generating unit is configured to generate a second homomorphic key pair.
第二加密单元,用于通过第二同态密钥对对第一误差累加和以及第一数据集的第一对象数量进行加密。The second encryption unit is configured to encrypt the first accumulated error sum and the first object quantity of the first data set by using the second homomorphic key pair.
第二发送单元,用于将第二同态密钥对中的第二加密密钥、加密后的第一误差累加和以及加密后的第一对象数量发送至第二服务器,以指示第二服务器根据第二加密密钥、加密后的第一误差累加和、加密后的第一对象数量、第二误差累加和以及第二数据集的第二对象数量进行计算,得到加密后的联合误差累加和与加密后的联合对象数量。The second sending unit is configured to send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum and the encrypted first object quantity to the second server, so as to indicate the second server Calculate according to the second encryption key, the encrypted first accumulated error sum, the encrypted first object number, the second error accumulated sum, and the second object number of the second data set, and obtain the encrypted joint error accumulated sum and the number of encrypted union objects.
均方误差计算单元,用于根据第二服务器返回的加密后的联合误差累加和与加密后的联合对象数量,计算第一特征的联合均方误差。The mean square error calculation unit is configured to calculate the joint mean square error of the first feature according to the encrypted cumulative sum of the joint errors returned by the second server and the number of encrypted joint objects.
本实施例中,通过同态加密加权平均算法,在不交换底层数据的前提下,结合第一数据集和第二数据集计算得到了特征的联合均方误差。In this embodiment, through the weighted average algorithm of homomorphic encryption, on the premise of not exchanging the underlying data, the joint mean square error of the feature is calculated in combination with the first data set and the second data set.
在本实施例的一些可选的实现方式中,距离计算模块304可以包括:距离计算子模块、平方和计算子模块、交叉计算子模块以及欧式计算子模块,其中:In some optional implementations of this embodiment, the distance calculation module 304 may include: a distance calculation submodule, a sum of squares calculation submodule, a cross calculation submodule, and an Euclidean calculation submodule, wherein:
距离计算子模块,用于计算当前第一对象与各第一对象的欧氏距离。The distance calculation submodule is used to calculate the Euclidean distance between the current first object and each first object.
平方和计算子模块,用于计算当前第一对象的第一特征平方和。The sum of squares calculation submodule is used to calculate the sum of squares of the first feature of the current first object.
交叉计算子模块,用于对于每个第二对象,与第二服务器通过乘积算法,计算当前第一对象与第二对象的特征交叉乘积和,并指示第二服务器计算第二对象的第二特征平方和。The cross calculation submodule is configured to, for each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second feature of the second object sum of square.
欧式计算子模块,用于根据第一特征平方和、特征交叉乘积和以及第二服务器返回的第二特征平方和,计算当前第一对象与第二对象的欧氏距离。The Euclidean calculation submodule is configured to calculate the current Euclidean distance between the first object and the second object according to the first feature square sum, the feature cross product sum and the second feature square sum returned by the second server.
本实施例中,通过联邦欧氏距离算法,在不侵犯数据隐私的条件下计算得到第一待聚类数据集和第二待聚类数据集中对象的欧氏距离,保证了在两个数据保护的数据集间进行DBSCAN聚类的实现。In this embodiment, the Euclidean distance of objects in the first data set to be clustered and the objects in the second data set to be clustered is calculated and obtained under the condition of not infringing on data privacy through the federal Euclidean distance algorithm, which ensures the protection of the two data Implementation of DBSCAN clustering between datasets.
在本实施例的一些可选的实现方式中,平方和计算子模块可以包括:生成单元、联合加密单元、加密值发送单元、接收单元以及解密单元,其中:In some optional implementations of this embodiment, the square sum calculation submodule may include: a generating unit, a joint encryption unit, an encrypted value sending unit, a receiving unit, and a decrypting unit, wherein:
生成单元,用于生成第一随机数,并基于paillier加密算法生成第三同态密钥对。The generating unit is used for generating the first random number and generating the third homomorphic key pair based on the paillier encryption algorithm.
联合加密单元,用于通过第三同态密钥对中的第三加密密钥,将当前第一对象的各第一特征值与第一随机数进行联合加密,得到联合加密值。The joint encryption unit is configured to jointly encrypt each first characteristic value of the current first object and the first random number by using the third encryption key in the third homomorphic key pair to obtain a joint encrypted value.
加密值发送单元,用于将联合加密值发送至第二服务器,对于每个第二对象,指示第二服务器根据联合加密值、第二对象的各第二特征值以及生成的第二随机数进行计算,得到各加密特征交叉乘积,并指示第二服务器计算第二对象的第二特征平方和。The encrypted value sending unit is configured to send the joint encrypted value to the second server, and for each second object, instruct the second server to perform the operation according to the joint encrypted value, each second characteristic value of the second object and the generated second random number. After calculation, the cross product of each encrypted feature is obtained, and the second server is instructed to calculate the second feature square sum of the second object.
接收单元,用于接收第二服务器返回的各加密特征交叉乘积以及第二对象的第二特征平方和。The receiving unit is configured to receive the cross product of each encrypted feature and the square sum of the second feature of the second object returned by the second server.
解密单元,用于通过第三同态密钥对中的第三解密密钥,对各加密特征交叉乘积进行解密,得到当前第一对象与第二对象的特征交叉乘积和。The decryption unit is configured to decrypt each encrypted feature cross product by using the third decryption key in the third homomorphic key pair to obtain the feature cross product sum of the current first object and the second object.
本实施例中,通过乘积算法,在保护第一待聚类数据集和第二待聚类数据集数据隐私的条件下,实现了计算当前第一对象与第二对象的特征交叉乘积和,保证了当前第一对象与第二对象欧氏距离计算的实现。In this embodiment, through the product algorithm, under the condition of protecting the data privacy of the first to-be-clustered data set and the second to-be-clustered data set, the cross-product sum of the features of the current first object and the second object is calculated, ensuring that The implementation of the Euclidean distance calculation between the current first object and the second object is presented.
在本实施例的一些可选的实现方式中,对象聚类模块305可以包括:对象确定子模块以及可达点确定子模块,其中:In some optional implementations of this embodiment, the object clustering module 305 may include: an object determination submodule and a reachable point determination submodule, wherein:
对象确定子模块,用于根据得到的欧氏距离和预设的邻域对象数量阈值,确定当前第一对象是否为核心点。The object determination sub-module is configured to determine whether the current first object is a core point according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects.
可达点确定子模块,用于当当前第一对象为核心点时,确定当前第一对象聚类邻域中的密度可达点,得到对象聚类结果,其中,密度可达点包括第一待聚类数据集中的第一对象和第二待聚类数据集中的第二对象。The reachable point determination sub-module is used to determine the density reachable points in the cluster neighborhood of the current first object when the current first object is the core point, and obtain the object clustering result, wherein the density reachable points include the first The first object in the data set to be clustered and the second object in the second data set to be clustered.
本实施例中,当根据欧氏距离和预设的邻域对象数量阈值,确定当前第一对象为核心点时,对当前第一对象进行DBSCAN聚类,实现了利用不同机构的数据集进行对象聚类,打破了数据壁垒,提高了DBSCAN聚类的准确性。In this embodiment, when it is determined that the current first object is the core point according to the Euclidean distance and the preset threshold of the number of neighboring objects, DBSCAN clustering is performed on the current first object, which realizes the use of data sets of different institutions for object analysis. Clustering breaks the data barriers and improves the accuracy of DBSCAN clustering.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 4 , which is a block diagram of a basic structure of a computer device according to this embodiment.
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备。The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be pointed out that the figure only shows the computer device 4 having the components 41-43, and it is not required to implement all the shown components, and more or less components may be implemented instead. The computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
所述存储器41至少包括一种类型的计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如基于横向联邦的DBSCAN聚类方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数 据。The memory 41 includes at least one type of computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes flash memory, hard disk, and multimedia card. , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable Program read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 . In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed on the computer device 4, such as computer-readable instructions of the DBSCAN clustering method based on horizontal federation. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述基于横向联邦的DBSCAN聚类方法的计算机可读指令。The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, the computer-readable instructions for executing the horizontal federation-based DBSCAN clustering method.
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
本实施例中,获取到第一数据集后,与第二服务器进行横向联邦学习,通过联邦方差选择算法,在不交换具体数据的情况下对第一数据集和第二服务器中的第二数据集进行特征筛选,实现特征降维,从而适配DBSCAN算法;同时,对于第一待聚类数据集中遍历到的当前第一对象,计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离,并通过联邦欧式距离算法计算当前第一对象与第二待聚类数据集中各第二对象的欧氏距离,在不交换具体数据的情况下,计算两个相离的数据集中对象的欧氏距离,欧氏距离用于DBSCAN聚类,从而打破了数据壁垒,实现了在不侵犯数据隐私的情况下,利用不同机构的数据集进行对象聚类,提高了对象聚类的准确性。In this embodiment, after the first data set is acquired, the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data. At the same time, for the current first object traversed in the first data set to be clustered, the current first object and the first object in the first data set to be clustered are calculated. The Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data. The Euclidean distance of the centralized objects, the Euclidean distance is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于横向联邦的DBSCAN聚类方法的步骤。The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the horizontal federation based DBSCAN clustering method as described above.
本实施例中,获取到第一数据集后,与第二服务器进行横向联邦学习,通过联邦方差选择算法,在不交换具体数据的情况下对第一数据集和第二服务器中的第二数据集进行特征筛选,实现特征降维,从而适配DBSCAN算法;同时,对于第一待聚类数据集中遍历到的当前第一对象,计算当前第一对象与第一待聚类数据集中各第一对象的欧氏距离,并通过联邦欧式距离算法计算当前第一对象与第二待聚类数据集中各第二对象的欧氏距离,在不交换具体数据的情况下,计算两个相离的数据集中对象的欧氏距离,欧氏距离用于DBSCAN聚类,从而打破了数据壁垒,实现了在不侵犯数据隐私的情况下,利用不同机构的数据集进行对象聚类,提高了对象聚类的准确性。In this embodiment, after the first data set is acquired, the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data. At the same time, for the current first object traversed in the first data set to be clustered, the current first object and the first object in the first data set to be clustered are calculated. The Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data. The Euclidean distance of the centralized objects, the Euclidean distance is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. The present application may be implemented in many different forms. Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims (20)

  1. 一种基于横向联邦的DBSCAN聚类方法,包括下述步骤:A DBSCAN clustering method based on horizontal federation, comprising the following steps:
    获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;acquiring a first data set, wherein the first data set includes first features of several first objects;
    与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
    遍历所述第一待聚类数据集中的第一对象;Traversing the first object in the first data set to be clustered;
    计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
    根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
  2. 根据权利要求1所述的基于横向联邦的DBSCAN聚类方法,其中,所述与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征的步骤包括:The DBSCAN clustering method based on horizontal federation according to claim 1, wherein the horizontal federated learning is performed with the second data set of the second server, so as to perform feature screening on the first data set through a federated variance selection algorithm , obtain the first data set to be clustered, and instruct the second server to perform feature screening on the second data set through the federal variance selection algorithm to obtain the second data set to be clustered, wherein the second data set is The step of the data set including the second features of the plurality of second objects includes:
    对于所述第一数据集中的每种第一特征,计算第一特征的第一特征值累加和,并指示所述第二服务器计算与所述第一特征相对应的第二特征的第二特征值累加和;For each first feature in the first dataset, compute a cumulative sum of first feature values for the first feature, and instruct the second server to compute a second feature of the second feature corresponding to the first feature value accumulation;
    与所述第二服务器通过同态加密加权平均算法,对所述第一特征值累加和与所述第二特征值累加和进行计算,得到所述第一特征的联合均值;Calculate the cumulative sum of the first eigenvalues and the cumulative sum of the second eigenvalues through a homomorphic encryption weighted average algorithm with the second server to obtain a joint mean value of the first features;
    基于所述联合均值计算所述第一特征的第一误差累加和,并指示所述第二服务器基于所述联合均值计算所述第二特征的第二误差累加和;calculating a first cumulative sum of errors for the first feature based on the joint mean, and instructing the second server to calculate a second cumulative sum of errors for the second feature based on the joint mean;
    与所述第二服务器通过所述同态加密加权平均算法,对所述第一误差累加和与所述第二误差累加和进行计算,得到所述第一特征的联合均方误差;Using the homomorphic encryption weighted average algorithm with the second server to calculate the first cumulative sum of errors and the second cumulative sum of errors to obtain the joint mean square error of the first feature;
    根据得到的联合均方误差对所述第一数据集中的第一特征进行筛选,得到第一待聚类数据集,并指示所述第二服务器根据得到的联合均方误差对所述第二数据集中的第二特征进行筛选,得到第二待聚类数据集。Screen the first features in the first data set according to the obtained joint mean square error, obtain a first data set to be clustered, and instruct the second server to The second feature in the set is screened to obtain a second data set to be clustered.
  3. 根据权利要求2所述的基于横向联邦的DBSCAN聚类方法,其中,所述与所述第二服务器通过同态加密加权平均算法,对所述第一特征值累加和与所述第二特征值累加和进行计算,得到所述第一特征的联合均值的步骤包括:The DBSCAN clustering method based on horizontal federation according to claim 2, wherein the first eigenvalue is accumulated and the second eigenvalue is accumulated with the second server through a homomorphic encryption weighted average algorithm. The steps of accumulating and calculating to obtain the joint mean of the first feature include:
    生成第一同态密钥对;generating a first homomorphic key pair;
    通过所述第一同态密钥对对所述第一特征值累加和以及所述第一数据集的第一对象数量进行加密;Encrypting the accumulated sum of the first eigenvalues and the first object quantity of the first data set by using the first homomorphic key pair;
    将所述第一同态密钥对中的第一加密密钥、加密后的第一特征值累加和以及加密后的第一对象数量发送至所述第二服务器,以指示所述第二服务器根据所述第一加密密钥、加密后的第一特征值累加和、加密后的第一对象数量、所述第二特征值累加和以及所述第二数据集的第二对象数量进行计算,得到加密后的联合累加和与加密后的联合对象数量;Sending the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the encrypted first object number to the second server to instruct the second server Calculate according to the first encryption key, the accumulated sum of the encrypted first eigenvalues, the number of encrypted first objects, the accumulated sum of the second eigenvalues, and the number of second objects in the second data set, Get the encrypted joint cumulative sum and the number of encrypted joint objects;
    根据所述第二服务器返回的所述加密后的联合累加和与所述加密后的联合对象数量,计算所述第一特征的联合均值。The joint mean value of the first feature is calculated according to the encrypted joint cumulative sum and the encrypted joint object number returned by the second server.
  4. 根据权利要求2所述的基于横向联邦的DBSCAN聚类方法,其中,所述与所述第二服务器通过所述同态加密加权平均算法,对所述第一误差累加和与所述第二误差累加和进行计算,得到所述第一特征的联合均方误差的步骤包括:The horizontal federation-based DBSCAN clustering method according to claim 2, wherein the first error is accumulated and the second error is accumulated by the second server and the second server through the homomorphic encryption weighted average algorithm. The steps of accumulating and calculating to obtain the joint mean square error of the first feature include:
    生成第二同态密钥对;generating a second homomorphic key pair;
    通过所述第二同态密钥对对所述第一误差累加和以及所述第一数据集的第一对象数量进行加密;encrypting the first cumulative sum of errors and the first number of objects of the first data set with the second homomorphic key pair;
    将所述第二同态密钥对中的第二加密密钥、加密后的第一误差累加和以及加密后的第一对象数量发送至所述第二服务器,以指示所述第二服务器根据所述第二加密密钥、加密 后的第一误差累加和、所述加密后的第一对象数量、所述第二误差累加和以及所述第二数据集的第二对象数量进行计算,得到加密后的联合误差累加和与加密后的联合对象数量;Send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum, and the encrypted first object number to the second server to instruct the second server according to Calculate the second encryption key, the encrypted first cumulative sum of errors, the number of the encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain The cumulative sum of the encrypted joint errors and the number of encrypted joint objects;
    根据所述第二服务器返回的所述加密后的联合误差累加和与所述加密后的联合对象数量,计算所述第一特征的联合均方误差。Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors and the number of encrypted joint objects returned by the second server.
  5. 根据权利要求1所述的基于横向联邦的DBSCAN聚类方法,其中,所述计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离的步骤包括:The DBSCAN clustering method based on horizontal federation according to claim 1, wherein the calculation of the Euclidean distance between the current first object and each first object, and the calculation of the current first object and the first object through a federal Euclidean distance algorithm The steps of the Euclidean distance of each second object include:
    计算当前第一对象与各第一对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object;
    计算所述当前第一对象的第一特征平方和;calculating the first characteristic square sum of the current first object;
    对于每个第二对象,与所述第二服务器通过乘积算法,计算所述当前第一对象与第二对象的特征交叉乘积和,并指示所述第二服务器计算所述第二对象的第二特征平方和;For each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second characteristic sum of squares;
    根据所述第一特征平方和、所述特征交叉乘积和以及所述第二服务器返回的所述第二特征平方和,计算所述当前第一对象与所述第二对象的欧氏距离。Calculate the Euclidean distance between the current first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
  6. 根据权利要求5所述的基于横向联邦的DBSCAN聚类方法,其中,所述对于每个第二对象,与所述第二服务器通过乘积算法,计算所述当前第一对象与第二对象的特征交叉乘积和,并指示所述第二服务器计算所述第二对象的第二特征平方和的步骤包括:The DBSCAN clustering method based on horizontal federation according to claim 5, wherein, for each second object, the feature of the current first object and the second object is calculated by a product algorithm with the second server The step of cross-product sum and instructing the second server to calculate the second feature square sum of the second object includes:
    生成第一随机数,并基于paillier加密算法生成第三同态密钥对;generating a first random number, and generating a third homomorphic key pair based on the paillier encryption algorithm;
    通过所述第三同态密钥对中的第三加密密钥,将所述当前第一对象的各第一特征值与所述第一随机数进行联合加密,得到联合加密值;Using the third encryption key in the third homomorphic key pair, jointly encrypt each first characteristic value of the current first object and the first random number to obtain a joint encryption value;
    将所述联合加密值发送至所述第二服务器,对于每个第二对象,指示所述第二服务器根据所述联合加密值、第二对象的各第二特征值以及生成的第二随机数进行计算,得到各加密特征交叉乘积,并指示所述第二服务器计算所述第二对象的第二特征平方和;Sending the joint encrypted value to the second server, and for each second object, instructing the second server to generate a second random number according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number performing calculation to obtain the cross product of each encrypted feature, and instructing the second server to calculate the second sum of squares of the second feature of the second object;
    接收所述第二服务器返回的所述各加密特征交叉乘积以及所述第二对象的第二特征平方和;receiving the cross product of the encrypted features and the second feature square sum of the second object returned by the second server;
    通过所述第三同态密钥对中的第三解密密钥,对所述各加密特征交叉乘积进行解密,得到所述当前第一对象与第二对象的特征交叉乘积和。Using the third decryption key in the third homomorphic key pair, the encrypted feature cross-product is decrypted, and the feature cross-product sum of the current first object and the second object is obtained.
  7. 根据权利要求1所述的基于横向联邦的DBSCAN聚类方法,其中,所述根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果的步骤包括:The DBSCAN clustering method based on horizontal federation according to claim 1, wherein the step of performing DBSCAN clustering on the current first object according to the obtained Euclidean distance, and obtaining an object clustering result comprises:
    根据得到的欧氏距离和预设的邻域对象数量阈值,确定所述当前第一对象是否为核心点;Determine whether the current first object is a core point according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects;
    当所述当前第一对象为核心点时,确定所述当前第一对象聚类邻域中的密度可达点,得到对象聚类结果,其中,所述密度可达点包括所述第一待聚类数据集中的第一对象和所述第二待聚类数据集中的第二对象。When the current first object is a core point, determine the density reachable points in the clustering neighborhood of the current first object, and obtain an object clustering result, wherein the density reachable points include the first to-be-to-be The first object in the clustered data set and the second object in the second to-be-clustered data set.
  8. 一种基于横向联邦的DBSCAN聚类装置,包括:A DBSCAN clustering device based on horizontal federation, comprising:
    数据集获取模块,用于获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;a data set obtaining module, configured to obtain a first data set, wherein the first data set includes the first features of several first objects;
    特征筛选模块,用于与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;The feature screening module is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the The second server performs feature screening on the second data set through the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
    对象遍历模块,用于遍历所述第一待聚类数据集中的第一对象;an object traversal module, configured to traverse the first object in the first to-be-clustered data set;
    距离计算模块,用于计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;a distance calculation module, used to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
    对象聚类模块,用于根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。The object clustering module is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所 述处理器执行所述计算机可读指令时实现如下步骤:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:
    获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;acquiring a first data set, wherein the first data set includes first features of several first objects;
    与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
    遍历所述第一待聚类数据集中的第一对象;Traversing the first object in the first data set to be clustered;
    计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
    根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
  10. 根据权利要求9所述的计算机设备,其中,所述与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集的步骤包括:The computer device according to claim 9, wherein the horizontal federated learning is performed with the second data set of the second server, so as to perform feature screening on the first data set through a federal variance selection algorithm to obtain the first to-be-aggregated data set. class data set, and instruct the second server to perform feature screening on the second data set through the federated variance selection algorithm, and the steps of obtaining the second to-be-clustered data set include:
    对于所述第一数据集中的每种第一特征,计算第一特征的第一特征值累加和,并指示所述第二服务器计算与所述第一特征相对应的第二特征的第二特征值累加和;For each first feature in the first dataset, compute a cumulative sum of first feature values for the first feature, and instruct the second server to compute a second feature of the second feature corresponding to the first feature value accumulation;
    与所述第二服务器通过同态加密加权平均算法,对所述第一特征值累加和与所述第二特征值累加和进行计算,得到所述第一特征的联合均值;Calculate the cumulative sum of the first eigenvalues and the cumulative sum of the second eigenvalues through a homomorphic encryption weighted average algorithm with the second server to obtain a joint mean value of the first features;
    基于所述联合均值计算所述第一特征的第一误差累加和,并指示所述第二服务器基于所述联合均值计算所述第二特征的第二误差累加和;calculating a first cumulative sum of errors for the first feature based on the joint mean, and instructing the second server to calculate a second cumulative sum of errors for the second feature based on the joint mean;
    与所述第二服务器通过所述同态加密加权平均算法,对所述第一误差累加和与所述第二误差累加和进行计算,得到所述第一特征的联合均方误差;Using the homomorphic encryption weighted average algorithm with the second server to calculate the first cumulative sum of errors and the second cumulative sum of errors to obtain the joint mean square error of the first feature;
    根据得到的联合均方误差对所述第一数据集中的第一特征进行筛选,得到第一待聚类数据集,并指示所述第二服务器根据得到的联合均方误差对所述第二数据集中的第二特征进行筛选,得到第二待聚类数据集。Screen the first features in the first data set according to the obtained joint mean square error, obtain a first data set to be clustered, and instruct the second server to The second feature in the set is screened to obtain a second data set to be clustered.
  11. 根据权利要求10所述的计算机设备,其中所述与所述第二服务器通过同态加密加权平均算法,对所述第一特征值累加和与所述第二特征值累加和进行计算,得到所述第一特征的联合均值的步骤包括:The computer device according to claim 10, wherein the accumulated sum of the first eigenvalue and the accumulated sum of the second eigenvalue are calculated by the weighted average algorithm of homomorphic encryption with the second server to obtain the obtained The step of describing the joint mean of the first feature includes:
    生成第一同态密钥对;generating a first homomorphic key pair;
    通过所述第一同态密钥对对所述第一特征值累加和以及所述第一数据集的第一对象数量进行加密;Encrypting the accumulated sum of the first eigenvalues and the first object quantity of the first data set by using the first homomorphic key pair;
    将所述第一同态密钥对中的第一加密密钥、加密后的第一特征值累加和以及加密后的第一对象数量发送至所述第二服务器,以指示所述第二服务器根据所述第一加密密钥、加密后的第一特征值累加和、加密后的第一对象数量、所述第二特征值累加和以及所述第二数据集的第二对象数量进行计算,得到加密后的联合累加和与加密后的联合对象数量;Sending the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the encrypted first object number to the second server to instruct the second server Calculate according to the first encryption key, the accumulated sum of the encrypted first eigenvalues, the number of encrypted first objects, the accumulated sum of the second eigenvalues, and the number of second objects in the second data set, Get the encrypted joint cumulative sum and the number of encrypted joint objects;
    根据所述第二服务器返回的所述加密后的联合累加和与所述加密后的联合对象数量,计算所述第一特征的联合均值。The joint mean value of the first feature is calculated according to the encrypted joint cumulative sum and the encrypted joint object number returned by the second server.
  12. 根据权利要求10所述的计算机设备,其中,所述与所述第二服务器通过所述同态加密加权平均算法,对所述第一误差累加和与所述第二误差累加和进行计算,得到所述第一特征的联合均方误差的步骤包括:The computer device according to claim 10, wherein the first error accumulation sum and the second error accumulation sum are calculated by the homomorphic encryption weighted average algorithm with the second server to obtain The step of the joint mean square error of the first feature includes:
    生成第二同态密钥对;generating a second homomorphic key pair;
    通过所述第二同态密钥对对所述第一误差累加和以及所述第一数据集的第一对象数量进行加密;encrypting the first cumulative sum of errors and the first number of objects of the first data set with the second homomorphic key pair;
    将所述第二同态密钥对中的第二加密密钥、加密后的第一误差累加和以及加密后的第一对象数量发送至所述第二服务器,以指示所述第二服务器根据所述第二加密密钥、加密后的第一误差累加和、所述加密后的第一对象数量、所述第二误差累加和以及所述第二数 据集的第二对象数量进行计算,得到加密后的联合误差累加和与加密后的联合对象数量;Send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum, and the encrypted first object number to the second server to instruct the second server according to Calculate the second encryption key, the encrypted first cumulative sum of errors, the number of the encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain The cumulative sum of the encrypted joint errors and the number of encrypted joint objects;
    根据所述第二服务器返回的所述加密后的联合误差累加和与所述加密后的联合对象数量,计算所述第一特征的联合均方误差。Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors and the number of encrypted joint objects returned by the second server.
  13. 根据权利要求9所述的计算机设备,其中,所述计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离的步骤包括:The computer device according to claim 9, wherein the Euclidean distance between the current first object and each first object is calculated, and the Euclidean distance between the current first object and each second object is calculated by a federal Euclidean distance algorithm The steps of the Clan distance include:
    计算当前第一对象与各第一对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object;
    计算所述当前第一对象的第一特征平方和;calculating the first characteristic square sum of the current first object;
    对于每个第二对象,与所述第二服务器通过乘积算法,计算所述当前第一对象与第二对象的特征交叉乘积和,并指示所述第二服务器计算所述第二对象的第二特征平方和;For each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second characteristic sum of squares;
    根据所述第一特征平方和、所述特征交叉乘积和以及所述第二服务器返回的所述第二特征平方和,计算所述当前第一对象与所述第二对象的欧氏距离。Calculate the Euclidean distance between the current first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
  14. 根据权利要求13所述的计算机设备,其中,所述对于每个第二对象,与所述第二服务器通过乘积算法,计算所述当前第一对象与第二对象的特征交叉乘积和,并指示所述第二服务器计算所述第二对象的第二特征平方和的步骤包括:The computer device according to claim 13, wherein, for each second object, the feature cross-product sum of the current first object and the second object is calculated with the second server through a product algorithm, and indicates The step of calculating the second sum of squares of the second feature of the second object by the second server includes:
    生成第一随机数,并基于paillier加密算法生成第三同态密钥对;generating a first random number, and generating a third homomorphic key pair based on the paillier encryption algorithm;
    通过所述第三同态密钥对中的第三加密密钥,将所述当前第一对象的各第一特征值与所述第一随机数进行联合加密,得到联合加密值;Using the third encryption key in the third homomorphic key pair, jointly encrypt each first characteristic value of the current first object and the first random number to obtain a joint encryption value;
    将所述联合加密值发送至所述第二服务器,对于每个第二对象,指示所述第二服务器根据所述联合加密值、第二对象的各第二特征值以及生成的第二随机数进行计算,得到各加密特征交叉乘积,并指示所述第二服务器计算所述第二对象的第二特征平方和;Sending the joint encrypted value to the second server, and for each second object, instructing the second server to generate a second random number according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number performing calculation to obtain the cross product of each encrypted feature, and instructing the second server to calculate the second sum of squares of the second feature of the second object;
    接收所述第二服务器返回的所述各加密特征交叉乘积以及所述第二对象的第二特征平方和;receiving the cross product of the encrypted features and the second feature square sum of the second object returned by the second server;
    通过所述第三同态密钥对中的第三解密密钥,对所述各加密特征交叉乘积进行解密,得到所述当前第一对象与第二对象的特征交叉乘积和。Using the third decryption key in the third homomorphic key pair, the encrypted feature cross-product is decrypted, and the feature cross-product sum of the current first object and the second object is obtained.
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令;其中,所述计算机可读指令被处理器执行时实现如下步骤:A computer-readable storage medium on which computer-readable instructions are stored; wherein the computer-readable instructions are executed by a processor to achieve the following steps:
    获取第一数据集,其中,所述第一数据集包括若干个第一对象的第一特征;acquiring a first data set, wherein the first data set includes first features of several first objects;
    与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集,其中,所述第二数据集包括若干个第二对象的第二特征;Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;
    遍历所述第一待聚类数据集中的第一对象;Traversing the first object in the first data set to be clustered;
    计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;
    根据得到的欧氏距离对所述当前第一对象进行DBSCAN聚类,得到对象聚类结果。DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述与第二服务器的第二数据集进行横向联邦学习,以通过联邦方差选择算法对所述第一数据集进行特征筛选,得到第一待聚类数据集,并指示所述第二服务器通过所述联邦方差选择算法对所述第二数据集进行特征筛选,得到第二待聚类数据集的步骤包括:The computer-readable storage medium according to claim 15, wherein the horizontal federated learning is performed with the second data set of the second server, so as to perform feature screening on the first data set through a federated variance selection algorithm to obtain the first data set. a data set to be clustered, and instruct the second server to perform feature screening on the second data set through the federated variance selection algorithm, and the steps of obtaining the second data set to be clustered include:
    对于所述第一数据集中的每种第一特征,计算第一特征的第一特征值累加和,并指示所述第二服务器计算与所述第一特征相对应的第二特征的第二特征值累加和;For each first feature in the first dataset, compute a cumulative sum of first feature values for the first feature, and instruct the second server to compute a second feature of the second feature corresponding to the first feature value accumulation;
    与所述第二服务器通过同态加密加权平均算法,对所述第一特征值累加和与所述第二特征值累加和进行计算,得到所述第一特征的联合均值;Calculate the cumulative sum of the first eigenvalues and the cumulative sum of the second eigenvalues through a homomorphic encryption weighted average algorithm with the second server to obtain a joint mean value of the first features;
    基于所述联合均值计算所述第一特征的第一误差累加和,并指示所述第二服务器基于所述联合均值计算所述第二特征的第二误差累加和;calculating a first cumulative sum of errors for the first feature based on the joint mean, and instructing the second server to calculate a second cumulative sum of errors for the second feature based on the joint mean;
    与所述第二服务器通过所述同态加密加权平均算法,对所述第一误差累加和与所述第二误差累加和进行计算,得到所述第一特征的联合均方误差;Using the homomorphic encryption weighted average algorithm with the second server to calculate the first cumulative sum of errors and the second cumulative sum of errors to obtain the joint mean square error of the first feature;
    根据得到的联合均方误差对所述第一数据集中的第一特征进行筛选,得到第一待聚类数据集,并指示所述第二服务器根据得到的联合均方误差对所述第二数据集中的第二特征进行筛选,得到第二待聚类数据集。Screen the first features in the first data set according to the obtained joint mean square error, obtain a first data set to be clustered, and instruct the second server to The second feature in the set is screened to obtain a second data set to be clustered.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述与所述第二服务器通过同态加密加权平均算法,对所述第一特征值累加和与所述第二特征值累加和进行计算,得到所述第一特征的联合均值的步骤包括:The computer-readable storage medium according to claim 16, wherein the first eigenvalue cumulative sum and the second eigenvalue cumulative sum are performed with the second server through a homomorphic encryption weighted average algorithm. Calculating, the step of obtaining the joint mean value of the first feature includes:
    生成第一同态密钥对;generating a first homomorphic key pair;
    通过所述第一同态密钥对对所述第一特征值累加和以及所述第一数据集的第一对象数量进行加密;Encrypting the accumulated sum of the first eigenvalues and the first object quantity of the first data set by using the first homomorphic key pair;
    将所述第一同态密钥对中的第一加密密钥、加密后的第一特征值累加和以及加密后的第一对象数量发送至所述第二服务器,以指示所述第二服务器根据所述第一加密密钥、加密后的第一特征值累加和、加密后的第一对象数量、所述第二特征值累加和以及所述第二数据集的第二对象数量进行计算,得到加密后的联合累加和与加密后的联合对象数量;Send the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the encrypted first object number to the second server to instruct the second server Calculate according to the first encryption key, the accumulated sum of the encrypted first feature values, the number of encrypted first objects, the accumulated sum of the second feature values, and the number of second objects in the second data set, Get the encrypted joint cumulative sum and the number of encrypted joint objects;
    根据所述第二服务器返回的所述加密后的联合累加和与所述加密后的联合对象数量,计算所述第一特征的联合均值。The joint mean value of the first feature is calculated according to the encrypted joint cumulative sum and the encrypted joint object number returned by the second server.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述与所述第二服务器通过所述同态加密加权平均算法,对所述第一误差累加和与所述第二误差累加和进行计算,得到所述第一特征的联合均方误差的步骤包括:The computer-readable storage medium of claim 16, wherein the first error accumulation sum and the second error accumulation sum are performed by the homomorphic encryption weighted average algorithm with the second server. Calculating, the steps of obtaining the joint mean square error of the first feature include:
    生成第二同态密钥对;generating a second homomorphic key pair;
    通过所述第二同态密钥对对所述第一误差累加和以及所述第一数据集的第一对象数量进行加密;encrypting the first cumulative sum of errors and the first number of objects of the first data set with the second homomorphic key pair;
    将所述第二同态密钥对中的第二加密密钥、加密后的第一误差累加和以及加密后的第一对象数量发送至所述第二服务器,以指示所述第二服务器根据所述第二加密密钥、加密后的第一误差累加和、所述加密后的第一对象数量、所述第二误差累加和以及所述第二数据集的第二对象数量进行计算,得到加密后的联合误差累加和与加密后的联合对象数量;Send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum, and the encrypted first object number to the second server to instruct the second server according to Calculate the second encryption key, the encrypted first cumulative sum of errors, the number of the encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain The cumulative sum of encrypted joint errors and the number of encrypted joint objects;
    根据所述第二服务器返回的所述加密后的联合误差累加和与所述加密后的联合对象数量,计算所述第一特征的联合均方误差。Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors and the number of encrypted joint objects returned by the second server.
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述计算当前第一对象与各第一对象的欧氏距离,并通过联邦欧氏距离算法计算所述当前第一对象与各第二对象的欧氏距离的步骤包括:The computer-readable storage medium according to claim 15, wherein the calculating the Euclidean distance between the current first object and each first object, and calculating the current first object and each second object through a federal Euclidean distance algorithm The steps for the Euclidean distance of an object include:
    计算当前第一对象与各第一对象的欧氏距离;Calculate the Euclidean distance between the current first object and each first object;
    计算所述当前第一对象的第一特征平方和;calculating the first characteristic square sum of the current first object;
    对于每个第二对象,与所述第二服务器通过乘积算法,计算所述当前第一对象与第二对象的特征交叉乘积和,并指示所述第二服务器计算所述第二对象的第二特征平方和;For each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second characteristic sum of squares;
    根据所述第一特征平方和、所述特征交叉乘积和以及所述第二服务器返回的所述第二特征平方和,计算所述当前第一对象与所述第二对象的欧氏距离。Calculate the Euclidean distance between the current first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述对于每个第二对象,与所述第二服务器通过乘积算法,计算所述当前第一对象与第二对象的特征交叉乘积和,并指示所述第二服务器计算所述第二对象的第二特征平方和的步骤包括:The computer-readable storage medium according to claim 19, wherein, for each second object, the second server calculates a feature cross-product sum of the current first object and the second object through a product algorithm , and instructing the second server to calculate the second characteristic sum of squares of the second object includes:
    生成第一随机数,并基于paillier加密算法生成第三同态密钥对;generating a first random number, and generating a third homomorphic key pair based on the paillier encryption algorithm;
    通过所述第三同态密钥对中的第三加密密钥,将所述当前第一对象的各第一特征值与所述第一随机数进行联合加密,得到联合加密值;Using the third encryption key in the third homomorphic key pair, jointly encrypt each first characteristic value of the current first object and the first random number to obtain a joint encryption value;
    将所述联合加密值发送至所述第二服务器,对于每个第二对象,指示所述第二服务器根据所述联合加密值、第二对象的各第二特征值以及生成的第二随机数进行计算,得到各 加密特征交叉乘积,并指示所述第二服务器计算所述第二对象的第二特征平方和;Sending the joint encrypted value to the second server, and for each second object, instructing the second server to generate a second random number according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number performing calculation to obtain the cross product of each encrypted feature, and instructing the second server to calculate the second sum of squares of the second feature of the second object;
    接收所述第二服务器返回的所述各加密特征交叉乘积以及所述第二对象的第二特征平方和;receiving the cross product of the encrypted features and the second feature square sum of the second object returned by the second server;
    通过所述第三同态密钥对中的第三解密密钥,对所述各加密特征交叉乘积进行解密,得到所述当前第一对象与第二对象的特征交叉乘积和。Using the third decryption key in the third homomorphic key pair, the encrypted feature cross-product is decrypted to obtain the feature cross-product sum of the current first object and the second object.
PCT/CN2021/096851 2020-12-01 2021-05-28 Dbscan clustering method based on horizontal federation, and related device therefor WO2022116491A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011388364.3 2020-12-01
CN202011388364.3A CN112508075A (en) 2020-12-01 2020-12-01 Horizontal federation-based DBSCAN clustering method and related equipment thereof

Publications (1)

Publication Number Publication Date
WO2022116491A1 true WO2022116491A1 (en) 2022-06-09

Family

ID=74969352

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096851 WO2022116491A1 (en) 2020-12-01 2021-05-28 Dbscan clustering method based on horizontal federation, and related device therefor

Country Status (2)

Country Link
CN (1) CN112508075A (en)
WO (1) WO2022116491A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271733A (en) * 2022-09-28 2022-11-01 深圳市迪博企业风险管理技术有限公司 Privacy-protecting block chain transaction data anomaly detection method and equipment
CN117640253A (en) * 2024-01-25 2024-03-01 济南大学 Federal learning privacy protection method and system based on homomorphic encryption

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508075A (en) * 2020-12-01 2021-03-16 平安科技(深圳)有限公司 Horizontal federation-based DBSCAN clustering method and related equipment thereof
CN113487041B (en) * 2021-07-15 2024-05-07 深圳市与飞科技有限公司 Transverse federal learning method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190020477A1 (en) * 2017-07-12 2019-01-17 International Business Machines Corporation Anonymous encrypted data
CN110827924A (en) * 2019-09-23 2020-02-21 平安科技(深圳)有限公司 Clustering method and device for gene expression data, computer equipment and storage medium
CN111339212A (en) * 2020-02-13 2020-06-26 深圳前海微众银行股份有限公司 Sample clustering method, device, equipment and readable storage medium
CN111507481A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Federated learning system
US20200358599A1 (en) * 2019-05-07 2020-11-12 International Business Machines Corporation Private and federated learning
CN112508075A (en) * 2020-12-01 2021-03-16 平安科技(深圳)有限公司 Horizontal federation-based DBSCAN clustering method and related equipment thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190020477A1 (en) * 2017-07-12 2019-01-17 International Business Machines Corporation Anonymous encrypted data
US20200358599A1 (en) * 2019-05-07 2020-11-12 International Business Machines Corporation Private and federated learning
CN110827924A (en) * 2019-09-23 2020-02-21 平安科技(深圳)有限公司 Clustering method and device for gene expression data, computer equipment and storage medium
CN111339212A (en) * 2020-02-13 2020-06-26 深圳前海微众银行股份有限公司 Sample clustering method, device, equipment and readable storage medium
CN111507481A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Federated learning system
CN112508075A (en) * 2020-12-01 2021-03-16 平安科技(深圳)有限公司 Horizontal federation-based DBSCAN clustering method and related equipment thereof

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271733A (en) * 2022-09-28 2022-11-01 深圳市迪博企业风险管理技术有限公司 Privacy-protecting block chain transaction data anomaly detection method and equipment
CN117640253A (en) * 2024-01-25 2024-03-01 济南大学 Federal learning privacy protection method and system based on homomorphic encryption
CN117640253B (en) * 2024-01-25 2024-04-05 济南大学 Federal learning privacy protection method and system based on homomorphic encryption

Also Published As

Publication number Publication date
CN112508075A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
WO2022116491A1 (en) Dbscan clustering method based on horizontal federation, and related device therefor
WO2021249086A1 (en) Multi-party joint decision tree construction method, device and readable storage medium
WO2021114911A1 (en) User risk assessment method and apparatus, electronic device, and storage medium
US11263344B2 (en) Data management method and registration method for an anonymous data sharing system, as well as data manager and anonymous data sharing system
Parra-Arnau et al. Measuring the privacy of user profiles in personalized information systems
WO2022021696A1 (en) Multi-information source-based whole-process blockchain system
CN111081337B (en) Collaborative task prediction method and computer readable storage medium
CN110866546B (en) Method and device for evaluating consensus node
US20230239134A1 (en) Data processing permits system with keys
JP2016511891A (en) Privacy against sabotage attacks on large data
Amelkin et al. A distance measure for the analysis of polar opinion dynamics in social networks
CN109615021A (en) A kind of method for protecting privacy based on k mean cluster
WO2023216494A1 (en) Federated learning-based user service strategy determination method and apparatus
WO2023071105A1 (en) Method and apparatus for analyzing feature variable, computer device, and storage medium
WO2022237175A1 (en) Graph data processing method and apparatus, device, storage medium, and program product
CN104077723A (en) Social network recommending system and social network recommending method
CN108154048B (en) Asset information processing method and device
CN112968873B (en) Encryption method and device for private data transmission
CN113434906A (en) Data query method and device, computer equipment and storage medium
Vlachos et al. On data publishing with clustering preservation
CN114741595A (en) Information pushing method and device
CN111061695B (en) File sharing method and system based on block chain
Xu et al. FedG2L: a privacy-preserving federated learning scheme base on “G2L” against poisoning attack
CN110069924B (en) Computer user behavior monitoring method and computer-readable storage medium
Jain et al. An Approach to Identify Vulnerable Features of Instant Messenger

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899534

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899534

Country of ref document: EP

Kind code of ref document: A1