WO2022116491A1

WO2022116491A1 - Dbscan clustering method based on horizontal federation, and related device therefor

Info

Publication number: WO2022116491A1
Application number: PCT/CN2021/096851
Authority: WO
Inventors: 王健宗; 李泽远
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-01
Filing date: 2021-05-28
Publication date: 2022-06-09
Also published as: CN112508075A

Abstract

A DBSCAN clustering method and apparatus (300) based on horizontal federation, and a computer device (4) and a storage medium, which belong to the field of artificial intelligence. The method comprises: acquiring a first data set, wherein the first data set comprises first features of several first objects (S201); performing horizontal federated learning with a second data set of a second server, and performing feature screening on the first data set by means of a federated variance selection algorithm, so as to obtain a first data set to be clustered (S202); traversing the first objects in the first data set to be clustered (S203); calculating a Euclidean distance between the current first object and each of the other first objects, and calculating a Euclidean distance between the current first object and each second object by means of a federated Euclidean distance algorithm (S204); and performing DBSCAN clustering on the current first object according to the obtained Euclidean distances, so as to obtain an object clustering result (S205). In addition, a first data set can be stored in a blockchain. By means of the method, the accuracy of object clustering is improved.

Description

DBSCAN clustering method based on horizontal federation and its related equipment

This application claims the priority of the Chinese patent application filed on December 01, 2020 with the application number 202011388364.3 and the invention titled "DBSCAN clustering method based on horizontal federation and related equipment", the entire content of which is approved by Reference is incorporated in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a DBSCAN clustering method, device, computer equipment and storage medium based on horizontal federation.

Background technique

With the in-depth development of computer technology, computers are used in various data mining scenarios. Object clustering is a type of data mining. By analyzing the data of each dimension of the object, the objects are clustered, and the same or similar objects can be classified into one category. For example, in financial marketing scenarios, financial institutions can obtain a large amount of user data every day, which contains a lot of personal privacy or business secrets. By clustering user data, users can be classified for different categories of users. User provides services.

The DBSCAN algorithm is a density-based clustering algorithm, which defines a cluster as the largest set of density-connected points, can divide regions with sufficient density into clusters, and can find clusters of arbitrary shapes in noisy spatial datasets . However, the inventor realized that the traditional DBSCAN algorithm cannot break the data barriers between different institutions, and can only cluster the internal data of the institution, and cannot be applied to high-latitude data, so the accuracy of clustering is low.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to propose a DBSCAN clustering method, device, computer equipment and storage medium based on horizontal federation, so as to solve the problem of low accuracy of DBSCAN clustering.

In order to solve the above-mentioned technical problems, the embodiment of the present application provides a DBSCAN clustering method based on horizontal federation, and adopts the following technical solutions:

acquiring a first data set, wherein the first data set includes first features of several first objects;

Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;

Traversing the first object in the first data set to be clustered;

Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;

DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.

In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a DBSCAN clustering device based on horizontal federation, which adopts the following technical solutions:

a data set obtaining module, configured to obtain a first data set, wherein the first data set includes the first features of several first objects;

The feature screening module is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the The second server performs feature screening on the second data set through the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;

an object traversal module, configured to traverse the first object in the first to-be-clustered data set;

a distance calculation module, used to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;

The object clustering module is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:

Traversing the first object in the first data set to be clustered;

In order to solve the above technical problems, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:

Traversing the first object in the first data set to be clustered;

Compared with the prior art, the embodiment of the present application mainly has the following beneficial effects: after obtaining the first data set, horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to analyze the first data set without exchanging specific data. The first dataset and the second dataset in the second server perform feature screening to realize feature dimensionality reduction, thereby adapting the DBSCAN algorithm; at the same time, for the current first object traversed in the first dataset to be clustered, calculate the current first object. The Euclidean distance between the object and each first object in the first data set to be clustered, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm. In the case of data, the Euclidean distance of objects in two separate datasets is calculated, and the Euclidean distance is used for DBSCAN clustering, thereby breaking the data barrier and realizing the use of data from different institutions without violating data privacy. This method improves the accuracy of object clustering.

Description of drawings

In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

Fig. 2 is the flow chart of one embodiment of the DBSCAN clustering method based on horizontal federation according to the present application;

3 is a schematic structural diagram of an embodiment of a horizontal federation-based DBSCAN clustering device according to the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

The terminology used herein in the specification of the application is for the purpose of describing particular embodiments only and is not intended to limit the application. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in FIG. 1 , the system architecture 100 may include

terminal devices

101 and 102 , a network 103 , a first server 104 and a second server 105 . The network 103 is used to provide a medium of communication links between the

terminal devices

101 and 102 , the first server 104 and the second server 105 . The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the terminal device 101 to interact with the first server 104 through the network 103 to receive or send messages, and the user can also use the terminal device 102 to interact with the second server 105 through the network 103 to receive or send messages. Various communication client applications may be installed on the

terminal devices

101 and 102 .

The

terminal devices

101 and 102 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, video experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.

The first server 104 and the second server 105 may be servers that provide various services, and the first server 104 and the second server 105 may implement the DBSCAN clustering service based on horizontal federation.

It should be noted that the DBSCAN clustering method based on the horizontal federation provided in the embodiment of the present application is generally executed by the first server and the second server, and accordingly, the DBSCAN clustering device based on the horizontal federation is generally set on the first server and the second server. on the second server. In this application, the first server is taken as the main body for description.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

Continuing to refer to FIG. 2 , a flow chart of one embodiment of the horizontal federation-based DBSCAN clustering method according to the present application is shown. The described DBSCAN clustering method based on horizontal federation includes the following steps:

Step S201, acquiring a first data set, wherein the first data set includes first features of several first objects.

In this embodiment, the electronic device (for example, the first server shown in FIG. 1 ) on which the DBSCAN clustering method based on horizontal federation runs can communicate through various wired connection methods or wireless connection methods.

Specifically, when performing horizontal federation-based DBSCAN clustering, the first server and the second server perform clustering at the same time, the first server obtains the first data set stored in the first server, and the second server obtains the first data set stored in the second server. The second dataset in the server.

The first data set and the second data set may be the feature sets of the objects in the two parties, and the features of the first data set and the second data set and the type of large information described by each feature are the same, but the first data set and the second data set are the same. The objects depicted in the second dataset are different. For example, in a financial marketing scenario, the first data set and the second data set may be user data of two companies, and the features may include the user's gender, education, work unit, past consumption data, and so on. The first dataset and the second dataset provide a data basis for object clustering.

The first dataset is recorded as

in,

is the feature set of the i-th object, the feature dimension is q, the number of objects in the data set is recorded as the number of objects, and the number of the first object is N ^A . Similarly, for the second data set, we have

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned first data set, the above-mentioned first data set may also be stored in a node of a blockchain.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Step S202, perform horizontal federated learning with the second data set of the second server, to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the second server to select through the federated variance The algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes second features of several second objects.

Specifically, the first server and the second server may form a federated network and perform federated learning. In the federated learning, the first server and the second server complete data operations without exchanging specific data. The first server and the second server can perform feature screening on the first data set through the federated variance selection algorithm, and remove a part of the features to obtain the first to-be-clustered data set I ^A . Similarly, the second server also performs feature screening on the second data set through the federal variance selection algorithm, and removes a part of the features to obtain the second data set ^IB to be clustered.

Further, the above step S202 may include:

Step S2021, for each first feature in the first data set, calculate the cumulative sum of the first feature values of the first feature, and instruct the second server to calculate the cumulative sum of the second feature values of the second feature corresponding to the first feature .

Specifically, for each feature j,j∈[1,q], the first server calculates the accumulated sum of the first feature value for the first feature

The second server calculates the accumulated sum of the second feature value for the second feature

Step S2022 , calculating the cumulative sum of the first feature value and the cumulative sum of the second feature value with the second server through the weighted average algorithm of homomorphic encryption to obtain the joint mean value of the first feature.

Specifically, the first server and the second server accumulate and sum the first eigenvalues through a homomorphic encryption weighted average algorithm.

and the second eigenvalue cumulative sum

Calculate to get the joint mean of the first feature

Further, the above step S2022 may include:

Step S20221, generate a first homomorphic key pair.

Specifically, the first server generates a first homomorphic key pair (E _k1 , D _k1 ), where E _k1 is the first encryption key, and D _k1 is the first decryption key. The first homomorphic key pair (E _k1 , D _k1 ) satisfies homomorphic encryption.

Step S20222: Encrypt the accumulated sum of the first eigenvalues and the first number of objects in the first data set by using the first homomorphic key pair.

Specifically, the first server uses the first encryption key E _k1 in the first homomorphic key pair (E _k1 , D _k1 ) to accumulate the sum of the first eigenvalues

to encrypt, get

And use the first encryption key ^E _k1 to encrypt the first number of objects NA of the first data set, to obtain n ₁ = ^E _k1 (NA ).

Step S20223, sending the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the number of encrypted first objects to the second server to instruct the second server according to the first The encryption key, the encrypted first eigenvalue cumulative sum, the encrypted first object number, the second eigenvalue cumulative sum, and the second object number of the second data set are calculated to obtain the encrypted joint cumulative sum and encrypted The number of federated objects after.

Specifically, the first server accumulates the first encryption key E _k1 and the encrypted first feature value in the first homomorphic key pair to sum up

and the encrypted first object number n ₁ = ^E _k1 (NA ) is sent to the second server.

The second server computes a random message z∈M, and computes the cumulative sum of the random message z and the second eigenvalue

product of

The product z·NB of the random message z and the second number of objects ^NB , and then using the first encryption key ^E _k pair

Encrypt with z·N ^B to get

and z ₂ ∈ E _k1 (z·N ^B ). The second server calculates the encrypted joint cumulative sum in the ciphertext state

and the number of encrypted joint objects m ₂ = ^E _k1 (z·NA ⁺ z·NB ), then

and m ₂ = ^E _k1 (z·NA ⁺ z·NB ) is sent to the first server.

Step S20224: Calculate the joint mean value of the first feature according to the encrypted joint cumulative sum and the number of encrypted joint objects returned by the second server.

The first server gets

After sum m ₂ =E _k1 (z·N ^A +z·N ^B ), use the first decryption key D _k1 in the first homomorphic key pair to the encrypted joint cumulative sum

And the encrypted number of joint objects m ₂ = ^E _k1 (z·NA ⁺ z·NB ) is decrypted to obtain

and z·(N ^A +N ^B ), then compute the joint mean of the first feature

and will

sent to the second server, understandably,

will also serve as the joint mean of the corresponding second feature.

Steps S20221-S20224 implement the weighted average algorithm of homomorphic encryption.

In this embodiment, through the weighted average algorithm of homomorphic encryption, on the premise of not exchanging the underlying data, the joint average value of the feature is obtained by combining the first data set and the second data set.

Step S2023: Calculate the first accumulated error sum of the first feature based on the joint mean value, and instruct the second server to calculate the second accumulated error sum of the second feature based on the joint mean value.

Specifically, the first server according to the joint mean

and each first eigenvalue in the first dataset

Calculate the first accumulated error sum

The second server according to the joint mean

and each second eigenvalue in the second dataset

Calculate the second accumulated error sum

Step S2024 , calculating the accumulated sum of the first error and the accumulated sum of the second error with the second server through the weighted average algorithm of homomorphic encryption, to obtain the joint mean square error of the first feature.

Specifically, the first server and the second server use a homomorphic encryption weighted average algorithm to accumulate and sum the first errors.

and the second accumulated error

Calculate to get the joint mean squared error of the first feature

Further, the above step S2024 may include:

Step S20241, generate a second homomorphic key pair.

Specifically, the first server generates a second homomorphic key pair (E _k2 , D _k2 ), where E _k2 is the second encryption key, and D _k2 is the second decryption key. The second homomorphic key pair (E _k2 , D _k2 ) satisfies homomorphic encryption.

Step S20242: Encrypt the first accumulated error sum and the first number of objects in the first data set by using the second homomorphic key pair.

Specifically, the first server accumulates the first error by using the second encryption key E _k2 in the second homomorphic key pair (E _k2 , D _k2 )

to encrypt, get

And encrypt the first number of objects NA of the first data set with the second encryption key ^E _k2 to obtain n ₁ = ^E _k2 (NA ).

Step S20243, sending the second encryption key in the second homomorphic key pair, the encrypted first error accumulation sum and the encrypted first object number to the second server, to instruct the second server to encrypt according to the second encryption Calculate the key, the encrypted first cumulative sum of errors, the number of encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain the encrypted joint cumulative sum of errors and the encrypted The number of federated objects.

Specifically, the first server accumulates the second encryption key E _k in the second homomorphic key pair and the encrypted first error.

and the encrypted first object number n ₁ = ^E _k2 (NA ) is sent to the second server.

The second server calculates a random message z∈M, and calculates the accumulated sum of the random message and the second error

the product of , we get

The product z·NB of the random message and the second number of objects, ^NB , is then paired with the second encryption key ^E _k2

Encrypt with z·N ^B to get

The second server calculates the encrypted joint error accumulation sum in the ciphertext state

and the encrypted number of joint objects m ₂ = ^E _k2 (z·NA ⁺ z·NB ), and then

and m ₂ = ^E _k2 (z·NA ⁺ z·NB ) is sent to the first server.

Step S20244: Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors returned by the second server and the number of encrypted joint objects.

Specifically, the first server receives the encrypted joint error accumulation sum

After the number of encrypted joint objects m ₂ =E _k2 (z·N ^A +z·N ^B ), use the second decryption key D _k in the second homomorphic key pair to accumulate the encrypted joint errors

And the encrypted number of joint objects m ₂ = ^E _k2 (z·NA ⁺ z·NB ) is decrypted to obtain

and z·(N ^A +N ^B ), then compute the joint mean squared error of the first feature

and will

sent to the second server, understandably,

will also be the joint mean squared error of the corresponding second feature.

Steps S20241-S20244 implement the weighted average algorithm of homomorphic encryption.

In this embodiment, through the weighted average algorithm of homomorphic encryption, on the premise of not exchanging the underlying data, the joint mean square error of the feature is calculated in combination with the first data set and the second data set.

Step S2025: Screen the first feature in the first data set according to the obtained joint mean square error to obtain the first to-be-clustered data set, and instruct the second server to classify the first feature in the second data set according to the obtained joint mean square error. Two features are screened to obtain a second data set to be clustered.

Specifically, the joint mean square error of the feature can be used as a measure of the importance of the feature. The joint mean square error is calculated for each feature, and the q joint mean square errors are sorted in descending order, and the first d features are selected as the filtered features. , the first server and the second server both perform the above-mentioned screening operation, and obtain the first to-be-clustered data set IA and the second to ^- be-clustered data set ^IB respectively.

In this embodiment, through the federal variance selection algorithm, without exchanging the underlying data, the first data set and the second data set in the second server are feature-filtered, the most useful features for clustering are retained, and at the same time the Feature dimensionality reduction, so as to adapt to the DBSCAN algorithm.

Step S203, traverse the first object in the first data set to be clustered.

Specifically, the first server traverses the first objects in the first to-be-clustered data set to perform clustering processing on each first object respectively.

Step S204: Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm.

Specifically, the first object being traversed is taken as the current first object, the Euclidean distance between the current first object and each first object in the first data set to be clustered is calculated, and the current first object is calculated by the federal Euclidean distance algorithm. The Euclidean distance between the object and each second object in the second data set to be clustered. Based on the federated Euclidean distance algorithm, the first server and the second server do not have to exchange real underlying data when calculating the Euclidean distance.

Further, the above step S204 may include:

Step S2041: Calculate the Euclidean distance between the current first object and each first object.

Specifically, the first server calculates the Euclidean distance between the current first object and each first object in the first to-be-clustered data set. Let the current first object be

The other first object is recorded as

The feature dimension is d, then

and

Euclidean distance

for:

There is no data privacy restriction in the data set, and the Euclidean distance between the current first object and each first object can be directly calculated by substituting the first feature value of each first feature.

Step S2042: Calculate the sum of squares of the first feature of the current first object.

Specifically, when calculating the Euclidean distance between the current first object and each second object in the second to-be-clustered data set, let the current first object be

The second object is recorded as

The feature dimension is d, then

and

Euclidean distance

for:

understandably,

is the eigenvalue of the jth feature of the first object,

is the eigenvalue of the jth feature of the second object.

The first server calculates the first feature square sum of the current first object

Step S2043, for each second object, calculate the cross product sum of features of the current first object and the second object through the product algorithm with the second server, and instruct the second server to calculate the second feature square sum of the second object.

Specifically, the current first object needs to calculate the Euclidean distance with each second object. When calculating the Euclidean distance with one of the second objects, the first server needs to input multiple times.

Second server multiple input

and random numbers r _j ,j∈[1,d], where the second server needs to generate d random numbers r ₁ , r ₂ ,...r _d , and satisfy

The first server and the second server use the product algorithm to calculate

And sum it up to get the feature cross product sum of the first object and the second object

At the same time, the second server calculates the second characteristic sum of squares of the second object

When the first server and the second server calculate the feature cross product sum, the calculation is performed based on the product algorithm, and the underlying feature value does not need to be exchanged.

Further, the above step S2043 may include:

Step S20431, generate a first random number, and generate a third homomorphic key pair based on the paillier encryption algorithm.

Specifically, the first server generates a first random number v, and generates a third homomorphic key pair (E _k3 , D _k3 ) based on the paillier encryption algorithm, where E _k3 is the third encryption key, and D _k3 is the third decryption key. The paillier encryption algorithm is a homomorphic encryption that satisfies the homomorphism of addition and multiplication.

Step S20432: Jointly encrypt each first characteristic value of the current first object and the first random number by using the third encryption key in the third homomorphic key pair to obtain a joint encrypted value.

Specifically, the first server uses the third encryption key E _k3 in the third homomorphic key pair (E _k3 , D _k3 ) to perform an analysis on each first feature value of the current first object

Perform joint encryption with the first random number r to obtain a joint encrypted value

Step S20433: Send the joint encrypted value to the second server, and for each second object, instruct the second server to calculate according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number to obtain each second object. encrypting the feature cross product, and instructing the second server to calculate a second sum of squares of features for the second object.

Specifically, the first server encrypts the third encryption key E _k3 , the first random number r and the joint encryption value

sent to the second server. The second server generates a second random number r _j ,j∈[1,d], and

The second server encrypts the value according to the joint

each second eigenvalue of the second object

and the generated second random number r _j ,j∈[1,d] for calculation to obtain the cross product of each encrypted feature

Step S20434: Receive the cross product of each encrypted feature returned by the second server and the square sum of the second feature of the second object.

Specifically, the second server cross-products the encrypted features

and the second characteristic sum of squares of the second object

sent to the first server. After sending the second characteristic sum of squares

, you can send

In order to encrypt the real second characteristic sum of squares, the influence of the first random number v can be canceled by the first server.

Step S20435: Decrypt each encrypted feature cross product by using the third decryption key in the third homomorphic key pair to obtain the feature cross product sum of the current first object and the second object.

Specifically, the first server uses the third decryption key D _k in the third homomorphic key pair (E _k3 , D _k3 ) to cross-product each encrypted feature

Decrypt: u=D _k3 (u'), based on the inherent properties of the paillier encryption algorithm, the decrypted result is

That is, the sum of the feature cross-product sum of the current first object and the second object and the second random number r _j , the influence of the second random number r _j can be canceled when calculating the Euclidean distance.

Steps S20431-S20435 are the realization steps of the product algorithm.

In this embodiment, through the product algorithm, under the condition of protecting the data privacy of the first to-be-clustered data set and the second to-be-clustered data set, the cross-product sum of the features of the current first object and the second object is calculated, ensuring that The implementation of the Euclidean distance calculation between the current first object and the second object is presented.

Step S2044: Calculate the current Euclidean distance between the first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.

Specifically, the first server calculates the sum of squares according to the first feature

feature cross product sum

and the second feature sum of squares returned by the second server

Calculate the Euclidean distance between the current first object and the second object

Steps S2042-S2044 are the federal Euclidean distance algorithm.

In this embodiment, the Euclidean distance of objects in the first data set to be clustered and the objects in the second data set to be clustered is calculated and obtained under the condition of not infringing on data privacy through the federal Euclidean distance algorithm, which ensures the protection of the two data Implementation of DBSCAN clustering between datasets.

Step S205, DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.

Specifically, after obtaining the Euclidean distance between the current first object and each first object and each second object, DBSCAN clustering can be performed on the current first object according to the DBSCAN algorithm to obtain a clustering result. The clustering result can be regarded as group division of the objects in the first to-be-clustered data set and the second to-be-clustered data set.

Further, the above step S205 may include:

Step S2051, according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects, determine whether the current first object is a core point.

Specifically, in the DBSCAN algorithm, it is assumed that there is a data set D={x ₁ ,x ₂ ,...,x _m }, which is defined as follows:

(1) N _ε (x _j ): For x _j ∈ D, its ε neighborhood includes the sub-sample set in the data set D whose Euclidean distance from x _j is not greater than ε, that is, N _ε (x _j )={x _i ∈D|, distance(x _i , x _j )≤ε}, |N _ε (x _j )| is denoted as the number of samples in the ε neighborhood of sample x _j .

(2) Core point: For any sample x _j ∈ D, if the N _ε (x _j ) corresponding to its ε neighborhood contains at least MinPts samples, that is, if |N _ε (x _j )|≥MinPts, then the sample is x _j core point.

(3) Boundary point: If the number of samples contained in the N _ε (x _j ) neighborhood of sample x _j ∈ D is less than MinPts, but the sample x _j ∈ D is in the neighborhood of other core points, then the sample point sample x _j ∈ D is the boundary point.

(4) Noise points: samples that are neither core points nor boundary points.

(5) Density direct access: If x _i is located in the ε neighborhood of x _j , and x _j is the core point, then x _i is directly accessible by the density of x _j .

(6) Density reachable: For x _i and x _j , if there is a sample sequence p _1, p ₂ ,...,p _T , p ₁ = _xi , p _T =x _j , and any p _t+1 If it is directly reached by the density of p _t , then x _j is said to be reachable by the density of x _i , that is, the density reachability satisfies the transitivity.

(7) Density connection: For x _i and x _j , if there is a core point x _k , so that both x _i and x _j are reachable by the density of x _k , then x _i and x _j are said to be density-connected.

To sum up, the first server queries the objects in the clustering neighborhood (ie the ε neighborhood) of the current first object according to the calculated Euclidean distance (which can be from the first data set to be clustered or from the second data set to be clustered). The number of objects is compared with the preset threshold MinPts of the number of neighboring objects to determine whether the current first object is a core point.

Step S2052, when the current first object is the core point, determine the density-reachable points in the clustering neighborhood of the current first object, and obtain the object clustering result, wherein the density-reachable points include those in the first to-be-clustered data set. The first object and the second object in the second to-be-clustered dataset.

Specifically, when the current first object is the core point, according to the definition of the DBSCAN algorithm, according to the calculated Euclidean distance, a density-reachable point is searched in its clustering neighborhood, and the density-reachable point includes the first data to be clustered The first object in the set and the second object in the second to-be-clustered data set, the density-reachable points found form a cluster. If the current first object is a boundary point or a noise point, the current first object is not processed, and the next core point is searched until all the first objects in the first to-be-clustered data set are processed, and the object clustering result is obtained, Among them, each cluster can be a clustering result.

It can be understood that the second server can perform DBSCAN clustering on the second object according to the same operation as the first server. The horizontal federation-based DBSCAN clustering method of the present application realizes object clustering, and for each clustering result, each object in it has a certain degree of similarity. For example, in a financial marketing scenario, after the horizontal federation-based DBSCAN clustering is performed on users according to user data, each clustering result can be a user with similar behaviors. The horizontal federation-based DBSCAN clustering method is equivalent to the user Community divisions were made.

In this embodiment, when it is determined that the current first object is the core point according to the Euclidean distance and the preset threshold of the number of neighboring objects, DBSCAN clustering is performed on the current first object, which realizes the use of data sets of different institutions for object analysis. Clustering breaks the data barriers and improves the accuracy of DBSCAN clustering.

In this embodiment, after the first data set is acquired, the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data. At the same time, for the current first object traversed in the first data set to be clustered, the current first object and the first object in the first data set to be clustered are calculated. The Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second data set to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data. The Euclidean distance of the centralized objects, the Euclidean distance is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.

The horizontal federation-based DBSCAN clustering method and related devices in this application relate to cluster analysis in the field of artificial intelligence, and may also relate to asset management in financial technology.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, the processes of the above-mentioned method embodiments may be included. Although the various steps in the flowchart of the accompanying drawings are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders.

Further referring to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a DBSCAN clustering apparatus based on horizontal federation, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 . , the device can be specifically applied to various electronic devices.

As shown in FIG. 3 , the horizontal federation-based DBSCAN clustering device 300 in this embodiment includes: a data set acquisition module 301, a feature screening module 302, an object traversal module 303, a distance calculation module 304, and an object clustering module 305, in:

The data set obtaining module 301 is configured to obtain a first data set, wherein the first data set includes first features of several first objects.

The feature screening module 302 is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the second server Feature screening is performed on the second data set by the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes second features of several second objects.

The object traversal module 303 is configured to traverse the first object in the first to-be-clustered data set.

The distance calculation module 304 is configured to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm.

The object clustering module 305 is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.

In this embodiment, after the first data set is acquired, the horizontal federated learning is performed with the second server, and the federated variance selection algorithm is used to compare the first data set and the second data in the second server without exchanging specific data. At the same time, for the current first object traversed in the first data set to be clustered, the current first object and the first object in the first data set to be clustered are calculated. The Euclidean distance of the object, and the Euclidean distance between the current first object and each second object in the second dataset to be clustered is calculated by the federal Euclidean distance algorithm, and the two separated data are calculated without exchanging specific data. The Euclidean distance of the centralized objects, the Euclidean distance is used for DBSCAN clustering, thus breaking the data barrier, realizing the object clustering using the data sets of different institutions without infringing the data privacy, and improving the object clustering efficiency. accuracy.

In some optional implementations of this embodiment, the feature screening module 302 may include: a feature value calculation submodule, an accumulation sum calculation submodule, an error calculation submodule, a mean square error calculation submodule, and a feature screening submodule, wherein :

The feature value calculation submodule is configured to, for each first feature in the first data set, calculate the accumulated sum of the first feature values of the first feature, and instruct the second server to calculate the first feature value of the second feature corresponding to the first feature. Cumulative sum of two eigenvalues.

The accumulation and calculation submodule is used to calculate the accumulated sum of the first characteristic value and the accumulated sum of the second characteristic value through the homomorphic encryption weighted average algorithm with the second server to obtain the joint mean value of the first characteristic.

The error calculation submodule is configured to calculate the first accumulated error sum of the first feature based on the joint mean value, and instruct the second server to calculate the second accumulated error sum of the second feature based on the joint mean value.

The mean square error calculation sub-module is used to calculate the accumulated sum of the first error and the accumulated sum of the second error through the homomorphic encryption weighted average algorithm with the second server to obtain the joint mean square error of the first feature.

The feature screening sub-module is used for screening the first feature in the first data set according to the obtained joint mean square error, to obtain the first data set to be clustered, and instructing the second server to screen the second data set according to the obtained joint mean square error. The second feature in the data set is screened to obtain a second data set to be clustered.

In some optional implementations of this embodiment, the accumulation and calculation submodule may include: a first generation unit, a first encryption unit, a first transmission unit, and a mean value calculation unit, wherein:

The first generating unit is used to generate a first homomorphic key pair.

The first encryption unit is configured to encrypt the accumulated sum of the first feature values and the first object quantity of the first data set by using the first homomorphic key pair.

The first sending unit is configured to send the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first eigenvalues, and the number of encrypted first objects to the second server to indicate the second The server calculates according to the first encryption key, the encrypted cumulative sum of the first eigenvalues, the number of encrypted first objects, the cumulative sum of the second eigenvalues, and the number of second objects in the second data set, and obtains the encrypted joint The cumulative sum and the number of encrypted union objects.

The mean value calculation unit is configured to calculate the joint mean value of the first feature according to the encrypted joint cumulative sum and the number of encrypted joint objects returned by the second server.

In some optional implementations of this embodiment, the mean square error calculation submodule may include: a first generation unit, a first encryption unit, a first transmission unit, and a mean value calculation unit, wherein:

The second generating unit is configured to generate a second homomorphic key pair.

The second encryption unit is configured to encrypt the first accumulated error sum and the first object quantity of the first data set by using the second homomorphic key pair.

The second sending unit is configured to send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum and the encrypted first object quantity to the second server, so as to indicate the second server Calculate according to the second encryption key, the encrypted first accumulated error sum, the encrypted first object number, the second error accumulated sum, and the second object number of the second data set, and obtain the encrypted joint error accumulated sum and the number of encrypted union objects.

The mean square error calculation unit is configured to calculate the joint mean square error of the first feature according to the encrypted cumulative sum of the joint errors returned by the second server and the number of encrypted joint objects.

In some optional implementations of this embodiment, the distance calculation module 304 may include: a distance calculation submodule, a sum of squares calculation submodule, a cross calculation submodule, and an Euclidean calculation submodule, wherein:

The distance calculation submodule is used to calculate the Euclidean distance between the current first object and each first object.

The sum of squares calculation submodule is used to calculate the sum of squares of the first feature of the current first object.

The cross calculation submodule is configured to, for each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second feature of the second object sum of square.

The Euclidean calculation submodule is configured to calculate the current Euclidean distance between the first object and the second object according to the first feature square sum, the feature cross product sum and the second feature square sum returned by the second server.

In some optional implementations of this embodiment, the square sum calculation submodule may include: a generating unit, a joint encryption unit, an encrypted value sending unit, a receiving unit, and a decrypting unit, wherein:

The generating unit is used for generating the first random number and generating the third homomorphic key pair based on the paillier encryption algorithm.

The joint encryption unit is configured to jointly encrypt each first characteristic value of the current first object and the first random number by using the third encryption key in the third homomorphic key pair to obtain a joint encrypted value.

The encrypted value sending unit is configured to send the joint encrypted value to the second server, and for each second object, instruct the second server to perform the operation according to the joint encrypted value, each second characteristic value of the second object and the generated second random number. After calculation, the cross product of each encrypted feature is obtained, and the second server is instructed to calculate the second feature square sum of the second object.

The receiving unit is configured to receive the cross product of each encrypted feature and the square sum of the second feature of the second object returned by the second server.

The decryption unit is configured to decrypt each encrypted feature cross product by using the third decryption key in the third homomorphic key pair to obtain the feature cross product sum of the current first object and the second object.

In some optional implementations of this embodiment, the object clustering module 305 may include: an object determination submodule and a reachable point determination submodule, wherein:

The object determination sub-module is configured to determine whether the current first object is a core point according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects.

The reachable point determination sub-module is used to determine the density reachable points in the cluster neighborhood of the current first object when the current first object is the core point, and obtain the object clustering result, wherein the density reachable points include the first The first object in the data set to be clustered and the second object in the second data set to be clustered.

To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 4 , which is a block diagram of a basic structure of a computer device according to this embodiment.

The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be pointed out that the figure only shows the computer device 4 having the components 41-43, and it is not required to implement all the shown components, and more or less components may be implemented instead. The computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.

The memory 41 includes at least one type of computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes flash memory, hard disk, and multimedia card. , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable Program read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 . In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed on the computer device 4, such as computer-readable instructions of the DBSCAN clustering method based on horizontal federation. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, the computer-readable instructions for executing the horizontal federation-based DBSCAN clustering method.

The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the horizontal federation based DBSCAN clustering method as described above.

From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. The present application may be implemented in many different forms. Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims

A DBSCAN clustering method based on horizontal federation, comprising the following steps:

acquiring a first data set, wherein the first data set includes first features of several first objects;

Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;

Traversing the first object in the first data set to be clustered;

Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;

DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
The DBSCAN clustering method based on horizontal federation according to claim 1, wherein the horizontal federated learning is performed with the second data set of the second server, so as to perform feature screening on the first data set through a federated variance selection algorithm , obtain the first data set to be clustered, and instruct the second server to perform feature screening on the second data set through the federal variance selection algorithm to obtain the second data set to be clustered, wherein the second data set is The step of the data set including the second features of the plurality of second objects includes:

For each first feature in the first dataset, compute a cumulative sum of first feature values for the first feature, and instruct the second server to compute a second feature of the second feature corresponding to the first feature value accumulation;

Calculate the cumulative sum of the first eigenvalues and the cumulative sum of the second eigenvalues through a homomorphic encryption weighted average algorithm with the second server to obtain a joint mean value of the first features;

calculating a first cumulative sum of errors for the first feature based on the joint mean, and instructing the second server to calculate a second cumulative sum of errors for the second feature based on the joint mean;

Using the homomorphic encryption weighted average algorithm with the second server to calculate the first cumulative sum of errors and the second cumulative sum of errors to obtain the joint mean square error of the first feature;

Screen the first features in the first data set according to the obtained joint mean square error, obtain a first data set to be clustered, and instruct the second server to The second feature in the set is screened to obtain a second data set to be clustered.
The DBSCAN clustering method based on horizontal federation according to claim 2, wherein the first eigenvalue is accumulated and the second eigenvalue is accumulated with the second server through a homomorphic encryption weighted average algorithm. The steps of accumulating and calculating to obtain the joint mean of the first feature include:

generating a first homomorphic key pair;

Encrypting the accumulated sum of the first eigenvalues and the first object quantity of the first data set by using the first homomorphic key pair;

Sending the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the encrypted first object number to the second server to instruct the second server Calculate according to the first encryption key, the accumulated sum of the encrypted first eigenvalues, the number of encrypted first objects, the accumulated sum of the second eigenvalues, and the number of second objects in the second data set, Get the encrypted joint cumulative sum and the number of encrypted joint objects;

The joint mean value of the first feature is calculated according to the encrypted joint cumulative sum and the encrypted joint object number returned by the second server.
The horizontal federation-based DBSCAN clustering method according to claim 2, wherein the first error is accumulated and the second error is accumulated by the second server and the second server through the homomorphic encryption weighted average algorithm. The steps of accumulating and calculating to obtain the joint mean square error of the first feature include:

generating a second homomorphic key pair;

encrypting the first cumulative sum of errors and the first number of objects of the first data set with the second homomorphic key pair;

Send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum, and the encrypted first object number to the second server to instruct the second server according to Calculate the second encryption key, the encrypted first cumulative sum of errors, the number of the encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain The cumulative sum of the encrypted joint errors and the number of encrypted joint objects;

Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors and the number of encrypted joint objects returned by the second server.
The DBSCAN clustering method based on horizontal federation according to claim 1, wherein the calculation of the Euclidean distance between the current first object and each first object, and the calculation of the current first object and the first object through a federal Euclidean distance algorithm The steps of the Euclidean distance of each second object include:

Calculate the Euclidean distance between the current first object and each first object;

calculating the first characteristic square sum of the current first object;

For each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second characteristic sum of squares;

Calculate the Euclidean distance between the current first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
The DBSCAN clustering method based on horizontal federation according to claim 5, wherein, for each second object, the feature of the current first object and the second object is calculated by a product algorithm with the second server The step of cross-product sum and instructing the second server to calculate the second feature square sum of the second object includes:

generating a first random number, and generating a third homomorphic key pair based on the paillier encryption algorithm;

Using the third encryption key in the third homomorphic key pair, jointly encrypt each first characteristic value of the current first object and the first random number to obtain a joint encryption value;

Sending the joint encrypted value to the second server, and for each second object, instructing the second server to generate a second random number according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number performing calculation to obtain the cross product of each encrypted feature, and instructing the second server to calculate the second sum of squares of the second feature of the second object;

receiving the cross product of the encrypted features and the second feature square sum of the second object returned by the second server;

Using the third decryption key in the third homomorphic key pair, the encrypted feature cross-product is decrypted, and the feature cross-product sum of the current first object and the second object is obtained.
The DBSCAN clustering method based on horizontal federation according to claim 1, wherein the step of performing DBSCAN clustering on the current first object according to the obtained Euclidean distance, and obtaining an object clustering result comprises:

Determine whether the current first object is a core point according to the obtained Euclidean distance and a preset threshold of the number of neighboring objects;

When the current first object is a core point, determine the density reachable points in the clustering neighborhood of the current first object, and obtain an object clustering result, wherein the density reachable points include the first to-be-to-be The first object in the clustered data set and the second object in the second to-be-clustered data set.
A DBSCAN clustering device based on horizontal federation, comprising:

a data set obtaining module, configured to obtain a first data set, wherein the first data set includes the first features of several first objects;

The feature screening module is used to perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federated variance selection algorithm to obtain the first to-be-clustered data set, and instruct the The second server performs feature screening on the second data set through the federal variance selection algorithm to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;

an object traversal module, configured to traverse the first object in the first to-be-clustered data set;

a distance calculation module, used to calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;

The object clustering module is configured to perform DBSCAN clustering on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:

acquiring a first data set, wherein the first data set includes first features of several first objects;

Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;

Traversing the first object in the first data set to be clustered;

Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;

DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
The computer device according to claim 9, wherein the horizontal federated learning is performed with the second data set of the second server, so as to perform feature screening on the first data set through a federal variance selection algorithm to obtain the first to-be-aggregated data set. class data set, and instruct the second server to perform feature screening on the second data set through the federated variance selection algorithm, and the steps of obtaining the second to-be-clustered data set include:

For each first feature in the first dataset, compute a cumulative sum of first feature values for the first feature, and instruct the second server to compute a second feature of the second feature corresponding to the first feature value accumulation;

Calculate the cumulative sum of the first eigenvalues and the cumulative sum of the second eigenvalues through a homomorphic encryption weighted average algorithm with the second server to obtain a joint mean value of the first features;

calculating a first cumulative sum of errors for the first feature based on the joint mean, and instructing the second server to calculate a second cumulative sum of errors for the second feature based on the joint mean;

Using the homomorphic encryption weighted average algorithm with the second server to calculate the first cumulative sum of errors and the second cumulative sum of errors to obtain the joint mean square error of the first feature;

Screen the first features in the first data set according to the obtained joint mean square error, obtain a first data set to be clustered, and instruct the second server to The second feature in the set is screened to obtain a second data set to be clustered.
The computer device according to claim 10, wherein the accumulated sum of the first eigenvalue and the accumulated sum of the second eigenvalue are calculated by the weighted average algorithm of homomorphic encryption with the second server to obtain the obtained The step of describing the joint mean of the first feature includes:

generating a first homomorphic key pair;

Encrypting the accumulated sum of the first eigenvalues and the first object quantity of the first data set by using the first homomorphic key pair;

Sending the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the encrypted first object number to the second server to instruct the second server Calculate according to the first encryption key, the accumulated sum of the encrypted first eigenvalues, the number of encrypted first objects, the accumulated sum of the second eigenvalues, and the number of second objects in the second data set, Get the encrypted joint cumulative sum and the number of encrypted joint objects;

The joint mean value of the first feature is calculated according to the encrypted joint cumulative sum and the encrypted joint object number returned by the second server.
The computer device according to claim 10, wherein the first error accumulation sum and the second error accumulation sum are calculated by the homomorphic encryption weighted average algorithm with the second server to obtain The step of the joint mean square error of the first feature includes:

generating a second homomorphic key pair;

encrypting the first cumulative sum of errors and the first number of objects of the first data set with the second homomorphic key pair;

Send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum, and the encrypted first object number to the second server to instruct the second server according to Calculate the second encryption key, the encrypted first cumulative sum of errors, the number of the encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain The cumulative sum of the encrypted joint errors and the number of encrypted joint objects;

Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors and the number of encrypted joint objects returned by the second server.
The computer device according to claim 9, wherein the Euclidean distance between the current first object and each first object is calculated, and the Euclidean distance between the current first object and each second object is calculated by a federal Euclidean distance algorithm The steps of the Clan distance include:

Calculate the Euclidean distance between the current first object and each first object;

calculating the first characteristic square sum of the current first object;

For each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second characteristic sum of squares;

Calculate the Euclidean distance between the current first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
The computer device according to claim 13, wherein, for each second object, the feature cross-product sum of the current first object and the second object is calculated with the second server through a product algorithm, and indicates The step of calculating the second sum of squares of the second feature of the second object by the second server includes:

generating a first random number, and generating a third homomorphic key pair based on the paillier encryption algorithm;

Using the third encryption key in the third homomorphic key pair, jointly encrypt each first characteristic value of the current first object and the first random number to obtain a joint encryption value;

Sending the joint encrypted value to the second server, and for each second object, instructing the second server to generate a second random number according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number performing calculation to obtain the cross product of each encrypted feature, and instructing the second server to calculate the second sum of squares of the second feature of the second object;

receiving the cross product of the encrypted features and the second feature square sum of the second object returned by the second server;

Using the third decryption key in the third homomorphic key pair, the encrypted feature cross-product is decrypted, and the feature cross-product sum of the current first object and the second object is obtained.
A computer-readable storage medium on which computer-readable instructions are stored; wherein the computer-readable instructions are executed by a processor to achieve the following steps:

acquiring a first data set, wherein the first data set includes first features of several first objects;

Perform horizontal federated learning with the second data set of the second server, so as to perform feature screening on the first data set through the federal variance selection algorithm to obtain the first data set to be clustered, and instruct the second server to pass the The federal variance selection algorithm performs feature screening on the second data set to obtain a second to-be-clustered data set, wherein the second data set includes the second features of several second objects;

Traversing the first object in the first data set to be clustered;

Calculate the Euclidean distance between the current first object and each first object, and calculate the Euclidean distance between the current first object and each second object through a federal Euclidean distance algorithm;

DBSCAN clustering is performed on the current first object according to the obtained Euclidean distance to obtain an object clustering result.
The computer-readable storage medium according to claim 15, wherein the horizontal federated learning is performed with the second data set of the second server, so as to perform feature screening on the first data set through a federated variance selection algorithm to obtain the first data set. a data set to be clustered, and instruct the second server to perform feature screening on the second data set through the federated variance selection algorithm, and the steps of obtaining the second data set to be clustered include:

For each first feature in the first dataset, compute a cumulative sum of first feature values for the first feature, and instruct the second server to compute a second feature of the second feature corresponding to the first feature value accumulation;

Calculate the cumulative sum of the first eigenvalues and the cumulative sum of the second eigenvalues through a homomorphic encryption weighted average algorithm with the second server to obtain a joint mean value of the first features;

calculating a first cumulative sum of errors for the first feature based on the joint mean, and instructing the second server to calculate a second cumulative sum of errors for the second feature based on the joint mean;

Using the homomorphic encryption weighted average algorithm with the second server to calculate the first cumulative sum of errors and the second cumulative sum of errors to obtain the joint mean square error of the first feature;

Screen the first features in the first data set according to the obtained joint mean square error, obtain a first data set to be clustered, and instruct the second server to The second feature in the set is screened to obtain a second data set to be clustered.
The computer-readable storage medium according to claim 16, wherein the first eigenvalue cumulative sum and the second eigenvalue cumulative sum are performed with the second server through a homomorphic encryption weighted average algorithm. Calculating, the step of obtaining the joint mean value of the first feature includes:

generating a first homomorphic key pair;

Encrypting the accumulated sum of the first eigenvalues and the first object quantity of the first data set by using the first homomorphic key pair;

Send the first encryption key in the first homomorphic key pair, the accumulated sum of the encrypted first feature values, and the encrypted first object number to the second server to instruct the second server Calculate according to the first encryption key, the accumulated sum of the encrypted first feature values, the number of encrypted first objects, the accumulated sum of the second feature values, and the number of second objects in the second data set, Get the encrypted joint cumulative sum and the number of encrypted joint objects;

The joint mean value of the first feature is calculated according to the encrypted joint cumulative sum and the encrypted joint object number returned by the second server.
The computer-readable storage medium of claim 16, wherein the first error accumulation sum and the second error accumulation sum are performed by the homomorphic encryption weighted average algorithm with the second server. Calculating, the steps of obtaining the joint mean square error of the first feature include:

generating a second homomorphic key pair;

encrypting the first cumulative sum of errors and the first number of objects of the first data set with the second homomorphic key pair;

Send the second encryption key in the second homomorphic key pair, the encrypted first accumulated error sum, and the encrypted first object number to the second server to instruct the second server according to Calculate the second encryption key, the encrypted first cumulative sum of errors, the number of the encrypted first objects, the second cumulative sum of errors, and the second number of objects in the second data set to obtain The cumulative sum of encrypted joint errors and the number of encrypted joint objects;

Calculate the joint mean square error of the first feature according to the encrypted cumulative sum of joint errors and the number of encrypted joint objects returned by the second server.
The computer-readable storage medium according to claim 15, wherein the calculating the Euclidean distance between the current first object and each first object, and calculating the current first object and each second object through a federal Euclidean distance algorithm The steps for the Euclidean distance of an object include:

Calculate the Euclidean distance between the current first object and each first object;

calculating the first characteristic square sum of the current first object;

For each second object, use the product algorithm with the second server to calculate the cross product sum of the features of the current first object and the second object, and instruct the second server to calculate the second characteristic sum of squares;

Calculate the Euclidean distance between the current first object and the second object according to the first feature square sum, the feature cross product sum, and the second feature square sum returned by the second server.
The computer-readable storage medium according to claim 19, wherein, for each second object, the second server calculates a feature cross-product sum of the current first object and the second object through a product algorithm , and instructing the second server to calculate the second characteristic sum of squares of the second object includes:

generating a first random number, and generating a third homomorphic key pair based on the paillier encryption algorithm;

Using the third encryption key in the third homomorphic key pair, jointly encrypt each first characteristic value of the current first object and the first random number to obtain a joint encryption value;

Sending the joint encrypted value to the second server, and for each second object, instructing the second server to generate a second random number according to the joint encrypted value, each second characteristic value of the second object, and the generated second random number performing calculation to obtain the cross product of each encrypted feature, and instructing the second server to calculate the second sum of squares of the second feature of the second object;

receiving the cross product of the encrypted features and the second feature square sum of the second object returned by the second server;

Using the third decryption key in the third homomorphic key pair, the encrypted feature cross-product is decrypted to obtain the feature cross-product sum of the current first object and the second object.