CN112508203B

CN112508203B - Data clustering processing method, device, equipment and medium based on federal learning

Info

Publication number: CN112508203B
Application number: CN202110170456.2A
Authority: CN
Inventors: 汪宏; 陈玲慧; 岑园园; 李宏宇; 李晓林
Original assignee: Tongdun Technology Co ltd; Tongdun Holdings Co Ltd
Current assignee: Tongdun Technology Co ltd; Tongdun Holdings Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-15
Anticipated expiration: 2041-02-08
Also published as: CN112508203A

Abstract

The embodiment of the invention discloses a data clustering processing method, a device, equipment and a medium based on federal learning, wherein the method is executed by one participant in a plurality of participants, and each local cross-feature sample held by each participant is part of the features of a corresponding clustering sample to be clustered, and the method comprises the following steps: initializing each local clustering center, giving an initial value, and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information; carrying out secret sharing on the local clustering information and other participants, and receiving the local clustering information from the secret sharing of the other participants; reference distance information fragments between each clustering sample and each collaborative clustering center are calculated, so that safe multi-party calculation is carried out with other participants, the nearest collaborative clustering center of each clustering sample is determined, and the value of each local clustering center is updated; and returning to execute the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is met.

Description

Data clustering processing method, device, equipment and medium based on federal learning

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data clustering processing method and device based on federal learning, computer equipment and a storage medium.

Background

Federal learning is used as a distributed machine learning paradigm, the problem of data island can be effectively solved, participators can jointly model on the basis of not sharing data, and AI (Artificial Intelligence) cooperation is realized. In the case of cross-feature federated learning, the features of a sample are distributed among different participants, each of which owns only a portion of the features. The purpose of the cross-feature federated clustering processing is that all participants respectively obtain the clustering centers corresponding to the features of the participants under the condition that the input of each party is not exposed.

At present, in a scheme supporting cross-feature federated clustering processing, part of technical schemes use a homomorphic encryption algorithm, and the federated clustering processing mode has the defect of low efficiency caused by large calculated amount. Part of technical scheme use safe many ways of calculation technique, this kind of guibang's cluster processing mode has the problem that data processing efficiency is low, and can expose the sample and the middle result data of clustering center contrast, needs at least 4 parties to participate in the cluster processing process simultaneously, and is too high to the quantity requirement of party, and reliability and practicality are lower.

Disclosure of Invention

The embodiment of the invention provides a data clustering processing method and device based on federal learning, computer equipment and a storage medium, so as to improve the processing efficiency, reliability and practicability of the federal data clustering processing.

In a first aspect, an embodiment of the present invention provides a data clustering processing method based on federal learning, which is performed by one of multiple participants, where each local cross-feature sample held by each participant is a partial feature of a corresponding clustering sample to be clustered; the method comprises the following steps:

initializing each local clustering center, giving an initial value, and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information;

carrying out secret sharing on the local clustering information and other participants, and receiving the local clustering information from the secret sharing of the other participants;

calculating reference distance information fragments between each clustering sample and each collaborative clustering center based on the secret sharing result;

performing safe multi-party calculation with other participants by using the reference distance information fragments, determining the nearest collaborative clustering center of each clustering sample, and updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample;

and returning to execute the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is met.

In a second aspect, an embodiment of the present invention further provides a data clustering processing apparatus based on federal learning, which is configured in one of a plurality of participant devices, where each local cross-feature sample held by each participant device is a partial feature of a corresponding clustering sample to be clustered; the method comprises the following steps:

the local clustering information calculation module is used for initializing each local clustering center to give an initial value and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information;

the local clustering information sharing module is used for carrying out secret sharing on the local clustering information and other participants and receiving the local clustering information from the secret sharing of the other participants;

the reference distance information fragment calculation module is used for calculating reference distance information fragments between each clustering sample and each collaborative clustering center based on the secret sharing result;

the local clustering center updating module is used for performing safe multi-party calculation with other participants by using the reference distance information fragments, determining the nearest collaborative clustering center of each clustering sample, and updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample;

and the circular execution module is used for returning and executing the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is met.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, the one or more programs cause the one or more processors to implement the federated learning-based data clustering methodology provided in any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the federated learning-based data clustering method provided in any embodiment of the present invention.

The embodiment of the invention calculates the local clustering information of each clustering sample by each participant in a plurality of participants, and the local clustering information is the distance information between each local cross-feature sample and each local clustering center. After the local clustering information is obtained, each participant carries out secret sharing on the local clustering information and the local clustering information of other participants, reference distance information fragments between each clustering sample and each collaborative clustering center are calculated based on the secret sharing result, then the reference distance information fragments and other participants are used for carrying out safe multi-party calculation, the nearest collaborative clustering center of each clustering sample is determined, and each local clustering center is updated according to the nearest collaborative clustering center of each clustering sample. Each participant can circularly execute the data clustering operation based on the updated local clustering center until a clustering end condition is met, clustering processing of each clustering sample is realized according to different local cross-feature samples, the safety of reference distance information can be ensured without using a homomorphic encryption algorithm to encrypt the reference distance information in the whole clustering processing process, the number of the participants is not too high, the problems of low processing efficiency, reliability, practicability and the like of the existing federal clustering processing method are solved, and the processing efficiency, reliability and practicability of the federal data clustering processing are improved.

Drawings

Fig. 1 is a flowchart of a data clustering processing method based on federal learning according to an embodiment of the present invention;

fig. 2 is a flowchart of a data clustering processing method based on federal learning according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of a data clustering processing apparatus based on federal learning according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a data clustering method based on federal learning according to an embodiment of the present invention, which is applicable to efficient and reliable federal data clustering according to local clustering information of each participant, and the method can be executed by a data clustering device based on federal learning, which can be implemented by software and/or hardware, and can be generally integrated into computer equipment and used in cooperation with other participant equipment. Accordingly, as shown in fig. 1, the method comprises the following operations:

and S110, initializing each local clustering center, and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information.

The clustered sample can be a complete sample formed by splicing local cross-feature samples locally held by each participant. Each participant can collaboratively participate in federated data clustering, in the embodiment of the present invention, the number of the participants may be 2 or more than 2, and the embodiment of the present invention does not limit the specific number of the participants. The local cross-feature sample may be sample data held locally by each participant, the sample data having cross-feature attributes. The local clustering center can be a local clustering center point generated by clustering each participant according to the local cross-feature sample. The distance information of the local clustering centers may be distance square value information between the local cross-feature samples calculated by each participant and each local clustering center. In the initialization process, each participant can set an initial value for each local clustering center, and the initial value can be a randomly selected clustering center point with a set number.

First, the technical solution of the present embodiment is mainly applicable to such an application scenario: and multiple participants can perform cross-feature data clustering processing according to the local cross-feature samples held by the participants in a federal learning mode. In the present invention, the number of the plurality is two or more. In an embodiment of the invention, the number of participants may be at least two. Accordingly, at least two participants each hold different local cross-feature samples. Assuming that R participants perform clustering processing on local cross-feature sample data of other participants in a federated learning mode, each participant holds N local cross-feature samples respectively, and the local cross-feature samples held by each participant are aligned. The alignment process screens out common samples from the original local samples held by the participants. Accordingly, each of the R participants may compute local clustering information from locally held local cross-feature samples. Specifically, each participant may calculate, as local clustering information, distance information of a distance square value between each local cross-feature sample and each local clustering center according to the local clustering center having a set initial value. After the local clustering centers are updated, each participant can recalculate the distance information of the distance square value between each local cross-feature sample and each local clustering center according to the updated local clustering centers as new local clustering information. The local cross-feature sample comprises partial features of the corresponding clustering sample, so that the clustering information of the corresponding clustering sample can be obtained by combining the local clustering information obtained by calculation of each participant. The calculation of the local clustering information of all the participants is combined to obtain the corresponding clustering samples, so that the calculation of the local clustering information of all the participants is performed.

It should be noted that the local cross-feature samples held by the participants include a part of the features of the cluster samples. For example, assuming that R takes a value of 3, each participant holds a local cross-feature sample including a partial feature of the cluster sample. For example, if the 3 cluster samples are A, B and C, then the 3 local cross-feature samples held by party 1 may be a1, b1 and C1, the 3 local cross-feature samples held by party 2 may be a2, b2 and C2, and the 3 local cross-feature samples held by party 3 may be a3, b3 and C3. Among them, a1, a2 and a3 may be combined into a, B1, B2 and B3 may be combined into B, and C1, C2 and C3 may be combined into C. That is, the embodiment of the present invention may implement a cross-feature learning scenario in a federated learning scenario.

It is also noted that the original local samples held locally by different participants may be different. Therefore, before performing clustering processing by using the local cross-feature samples of each participant, it is first necessary to perform alignment processing on the original local samples locally held by each participant, so as to obtain the local cross-feature samples of each participant. In a specific example, assuming that the original local samples of party 1 are sample a, sample B, sample C, and sample D, and the original local samples of party 2 are sample B, sample C, sample D, and sample E, after aligning party 1 and party 2, the local cross-feature samples held by party 1 are sample B, sample C, and sample D, and the local cross-feature samples held by party 2 are sample B, sample C, and sample D. Here, the local cross feature sample held by the participant 1 is different from the local cross feature sample held by the participant 2 in the sample characteristics. For example, the local cross-feature sample held by party 1 is the user sample, specifically the age and income information of user B, user C and user D, and the local cross-feature sample held by party 2 is also the user sample, specifically the height and expense information of user B, user C and user D. Accordingly, the local cross-feature samples of the participating parties 1 and 2 can form cluster samples, that is, the cluster samples are the users B, C and D with feature data including age, income, height, expense and the like.

In the embodiment of the present invention, each participant may first calculate local clustering information corresponding to a clustering sample according to information such as a local clustering center having a set initial value, a local cross-feature sample, and the like. It is understood that the local clustering center to set the initial value may be set according to the type of clustering algorithm. For example, when the clustering algorithm adopts a K-means (K mean) clustering algorithm, K cluster center points may be randomly generated from the local cross-feature samples as local cluster centers. It should be noted that the clustering algorithm may also adopt a k-models (fuzzy) clustering algorithm or a k-protocols clustering algorithm, and the embodiment of the present invention does not limit the specific type of the clustering algorithm.

In an optional embodiment of the present invention, the cluster sample is a group data sample of a set area.

The set area may be a geographic area corresponding to the sample distance processing, for example, the set area may be a certain city or a certain urban area, and the area range of the set area is not limited in the embodiment of the present invention. The group data sample may be a data sample of a certain type of group in a set area, the group may be, for example, a user group, a group of objects of the same type (such as animals, buildings or other objectively existing objects, etc.), a group of virtual objects of the same type (such as image resources of the type), and the like, and the sample type and the group type of the group data sample are not limited in the embodiment of the present invention.

The embodiment of the invention can realize a cross-feature learning scene in a federal learning scene, can carry out federal learning on local samples held by different participants in a certain set area, and can combine the local samples held by the participants to form group data samples so as to carry out clustering processing on the group data samples. In a specific application scenario, a business super and a bank in the same area can be used as participants, and users reached by different participants are residents of the area, that is, local samples of the participants are the same, but because the business super and the bank provide different services for the residents, user data types held by different participants are different, that is, the local samples of the participants have different characteristics. Therefore, the business department and the bank can carry out federal learning according to the held user data, and clustering processing on the user samples (group data samples) in the region is realized.

And S120, carrying out secret sharing on the local clustering information and other participants, and receiving the local clustering information from the secret sharing of the other participants.

Correspondingly, after the participants obtain the local clustering information, in order to improve the security and confidentiality of the local clustering center, other participants do not directly obtain the local clustering information, and the local clustering information held by each participant can be secretly shared with other participants. That is, the participant may share part of the local clustering information to other participants, and may also receive the local clustering information secretly shared by other participants.

The secret sharing is a technology for sharing secret data, the idea is that the secret data is split in a proper mode, each split share is managed by different participants, a single participant cannot recover secret data information, and only a plurality of participants cooperate together, the secret data information can be recovered. More importantly, the secret data can still be fully recovered when any participant within the corresponding range goes wrong. The formalized definition of secret sharing is: s (S, t, n) → { < S0>, < S1>, >., < sn > }. Wherein S represents the secret data needing to be split, t represents a recovery threshold, sn represents each share, and n is the split number. There is a recovery function R, for any m ≧ t, R (< S0>, < S1>,.., < sm >) → S, where m represents the number of participants collaboratively recovering the secret data.

And S130, calculating reference distance information fragments between each clustering sample and each collaborative clustering center based on the secret sharing result.

After the secret sharing result, that is, the participant completes the secret sharing of the local clustering information, the shared content of the local clustering information held locally. The collaborative clustering center is a complete clustering center formed by splicing the local clustering centers, namely the clustering center corresponding to the clustering sample. The reference distance information may be a reference distance between the cluster sample and the collaborative cluster center determined by each participant from data shared by other participants' secrets. The reference distance is not the actual distance between the cluster sample and the collaborative cluster center, but is a distance value calculated by each participant according to the secret shared data combination of the local cluster information. The reference distance information fragments may be fragments generated by secret sharing of the reference distance information of the participants, and the complete reference distance information may be obtained by splicing the fragments.

Specifically, the participants may generate a plurality of pieces of secret sharing fragmentation information for the local clustering information in a secret sharing manner, and send the pieces of secret sharing fragmentation information to other participants respectively, and each participant may hold one piece of secret sharing fragmentation information. Correspondingly, each participant can determine reference distance information between each cluster sample and each collaborative cluster center according to the secret sharing fragment information corresponding to each cluster sample as a secret sharing result.

Still further illustrated by the above example, participant 1 may generate a certain number of cluster centers as the local cluster centers of participant 1 according to the age and income information of user B, user C, and user D, etc. local cross-feature samples. Similarly, the participating party 2 may generate a certain number of clustering center points as the local clustering centers of the participating party 2 according to the height and expenditure information of the user B, the user C, and the user D. And the local clustering center of the participant 1 and the local clustering center of the participant 2 are spliced to obtain the collaborative clustering center corresponding to the clustering sample. That is, the local clustering center generated by each participant can be used as the fragment of the collaborative clustering center of the corresponding clustering sample, and the fragments of all collaborative clustering centers corresponding to the same clustering sample are spliced to obtain the complete collaborative clustering center.

For example, assuming that there are 3 participants, participant 1 may divide locally generated local clustering information into 3 pieces of secret sharing fragmentation information, and share 2 pieces of secret sharing fragmentation information among the 3 pieces of secret sharing fragmentation information to participant 2 and participant 3, where participant 2 and participant 3 respectively obtain 1 piece of secret sharing fragmentation information. Similarly, party 2 and party 3 may share locally generated local clustering information in the same manner and in a secret manner among the parties. Correspondingly, taking the participant 1 as an example, the participant 1 may obtain 1 piece of secret sharing fragment information of the local clustering information, 1 piece of secret sharing fragment information shared by the participant 2 according to the local clustering information, and 1 piece of secret sharing fragment information shared by the participant 3 according to the local clustering information. The participant 1 can determine reference distance information between each cluster sample and each collaborative cluster center according to the held secret sharing fragmentation information.

Therefore, each participant can obtain the information of the reference distance information fragments from the clustering sample to each collaborative clustering center based on the secret sharing result of the local clustering information, and actually does not know the true value of the local clustering information, so that the secret protection of the local clustering information is realized, and the safety and the reliability of the local clustering information are improved.

S140, performing safe multi-party calculation with other participants by using the reference distance information fragments, determining the nearest collaborative clustering center of each clustering sample, and updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample.

Wherein, the nearest collaborative cluster center may be the collaborative cluster center nearest to the cluster sample. It can be understood that one cluster sample may correspond to one nearest collaborative cluster center, and the nearest collaborative cluster centers corresponding to the cluster samples may be the same or different. In general, the nearest collaborative cluster centers corresponding to cluster samples belonging to a cluster are the same.

Correspondingly, after each participant acquires the reference distance information fragment between each clustering sample and each collaborative clustering center, the participant with clustering demand can use the reference distance information fragment to perform safe multi-party calculation with other participants so as to determine the nearest collaborative clustering center of each clustering sample. It should be noted that a participant without a clustering requirement may only assist other participants in performing clustering processing, such as calculating local clustering information and performing secret sharing on the local clustering information, and does not need to perform an operation of obtaining a nearest collaborative clustering center, that is, may participate in an intermediate process of clustering processing, and does not need to obtain a final clustering result. Correspondingly, after the participant obtains the nearest collaborative clustering center of each clustering sample, similar to the existing clustering algorithm principle, the participant can update each local clustering center according to the nearest collaborative clustering center of each clustering sample.

In one specific example, assume that the local cluster centers of participant 1 are local cluster center 1 and local cluster center 2, the clustered samples are sample 1, sample 2, sample 3, and sample 4, wherein, the participator 1 determines the nearest collaborative clustering center of the sample 1 and the sample 2 in the clustering sample as the clustering center 1, the nearest collaborative clustering center of the sample 3 and the sample 4 in the clustering sample as the clustering center 2 according to the acquired reference distance information fragment, the participant 1 calculates the sample mean value for the local cross feature sample 1 and the local cross feature sample 2 owned by the local clustering center 1 to obtain a local clustering center 3, calculates the sample mean value for the local cross feature sample 3 and the local cross feature sample 4 owned by the local clustering center 2 to obtain a local clustering center 4, thereby updating the local clustering center 1 according to the local clustering center 3 and updating the local clustering center 2 to be the clustering center 4 according to the local clustering center 4.

S150, judging whether a clustering end condition is met, if so, executing S160; otherwise, returning to execute S110.

And S160, terminating the federal data clustering processing flow.

The clustering end condition may be a condition for determining whether to terminate the federated data clustering process of each participant. Optionally, the clustering end condition may be that the local clustering center tends to be stable and no longer changes, or reaches a set number of loop iterations, and the like.

In the clustering algorithm, the participants performing the clustering process usually need to perform the clustering process in a loop iteration manner until a clustering end condition is satisfied. Correspondingly, each participant can judge whether the current clustering result meets the clustering ending condition or not by updating the local clustering center once. If the current clustering result meets the clustering end condition, the federated data clustering processing flow can be terminated; otherwise, returning to execute the operation of calculating the local clustering information of each clustering sample until the clustering end condition is met. Therefore, the participator can cluster all the clustering samples according to different local cross-feature samples.

Therefore, the data clustering processing method based on the federal learning provided by the embodiment of the invention can complete the process of supporting the cross-feature federal clustering processing only by adopting a safe multi-party computing technology without using a homomorphic encryption algorithm, has no strict limitation on the number of the participants, and can complete the whole process by at least 2 participants, thereby improving the processing efficiency and the practicability of the federal data clustering processing. Meanwhile, because the shared data is protected by using a secret sharing mode, each participant can only obtain the information of the collaborative clustering center closest to each clustering sample in the clustering process, but cannot really obtain the local clustering information obtained by the calculation of other participants, so that the local clustering information of each participant is protected, and the reliability of the federated data clustering process is improved.

Example two

Fig. 2 is a flowchart of a data clustering method based on federal learning according to a second embodiment of the present invention, which is embodied on the basis of the above embodiments, and in this embodiment, a specific optional implementation manner is provided for calculating reference distance information fragments between each cluster sample and each collaborative clustering center based on a secret sharing result, determining a nearest collaborative clustering center of each cluster sample, and updating a value of each local clustering center according to the nearest collaborative clustering center of each cluster sample. Correspondingly, as shown in fig. 2, the method of the present embodiment may include:

and S210, initializing each local clustering center, giving an initial value, and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information.

Specifically, each participant may calculate a distance square value between the local cross-feature sample and each local clustering center according to the local cross-feature sample held locally, and the distance square value is used as local clustering information.

In one specific example, assume that there are a total of 3 participantsP ₁，P ₂，P ₃And a third party arbiter. Wherein the third party arbiters are responsible for providingAnd (5) auxiliary calculation. Or, the third party arbiter may also be one of the parties, which is not limited in this embodiment of the present invention. The local cross feature samples of each participant include N samples. Remember the input of the r-th participant as

Wherein

，

Typically a vector. Assuming that a K-means algorithm is selected as a federal clustering algorithm, the basic principle is as follows: selecting random cluster centers, comparing the distance between each sample and each cluster center, assigning the sample to the cluster center closest to the sample, and calculating the average value of the coordinates of the middle point of each assigned cluster to obtain a new cluster center. Repeating the steps until the cluster center is stable. According to the principle, three parties are in the initial stateP ₁，P ₂AndP ₃and respectively randomly selecting K local cross-feature samples from the local cross-feature samples as local clustering centers. One local cluster center represents one cluster and three participantsP ₁，P ₂AndP ₃the N local cross-feature samples may be divided into K clusters, respectively. For convenience of description, take each participant noteP _rThe N input local cross-feature samples are

Wherein

Is shown by

Is assigned to the k-th cluster and,

the nth local cross feature sample representing the r-th participant,

the nth local cross feature sample representing the r-th participant is assigned to the kth cluster. Accordingly, each participantP _rIndependent random generation of K local clustering centers

Wherein, in the step (A),

the kth local cluster center representing the r-th participant, and the concatenation of the kth local cluster centers of all participants

A kth true, complete collaborative cluster center may be constructed.

Accordingly, each participantP _rComputing each local cross-feature sample

With each local cluster center

Square value of (2)

As local clustering information. Wherein the content of the first and second substances,

to represent

To

The square value of the distance of (c). In particular, the r-th participantP _rFor each local cross-feature sample it owns, K can be computedThe square value of the distance. Correspondingly, each participant squares the K distance squares corresponding to each local cross-feature sample

The secret is shared with the other participants.

S220, secret sharing is carried out on the local clustering information and other participants, and the local clustering information which is secret shared by the other participants is received.

Accordingly, after each participant shares the local clustering information in a secret manner, each participant may include one secret sharing fragment information of the local clustering information and one secret sharing fragment information of the local clustering information of the other participants.

Further illustrated by the above example, each participant

Secret shared shard information of local clustering information of each participant can be obtained as

And is provided with

. Wherein the content of the first and second substances,

indicating that the ith participant clusters information according to the local

The generated secret shares the shard information.

And S230, summing secret shared fragment information of each local clustering information corresponding to the same clustering sample to the same collaborative clustering center to obtain a reference distance information fragment between each clustering sample and each collaborative clustering center.

In the embodiment of the present invention, after obtaining the secret shared fragment information of the local clustering information of each participant, each participant may sum the secret shared fragment information of each local clustering information to obtain the reference distance information fragment between each clustering sample and each collaborative clustering center.

Taking the above example as an example to further illustrate, the participants

The received secret sharing fragment information can be summed to obtain

. Wherein the content of the first and second substances,

representing a reference distance information patch.

In one specific example, assume that there are three participants: a, b, c, coexisting in two clusters and three cluster samples: A. b and C, taking the cluster sample a as an example for illustration, the data distribution of the reference distance information between the cluster sample a and a certain collaborative cluster center calculated by different participants can refer to table 1. As shown in table 1, the local clustering information obtained by each participant respectively calculating the distance information from the local cross-feature sample of the clustering sample a to each local clustering center is a_{First of all}、A_{Second step}And A_C3. Further explanation by taking the first participant as an example, the first participant pair calculates the obtained local clustering information A_{First of all}Splitting into three pieces of local clustering information according to the number of parameter parties, wherein the three pieces are respectively A_{First 1}、A_{First 2}And A_{First 3}. The participant then reserves himself with a copy a_{First 1}A is_{First 2}Secret sharing to the party B, and sharing A_{First 3}The secret is shared with the participant c. Similarly, party A may also receive party B and party C secret shared A_{Second 1}And A_C1. Correspondingly, the participant A can be specifically A according to the reference distance information from the local corresponding cluster sample A to the collaborative cluster center_{First 1}+A_{Second 1}+ A_C1=A_1A. Similarly, the second and third parties can also go to the collaborative clustering center according to the local corresponding clustering sample AReference distance information of (2).

TABLE 1 List of relationships between participants, local clustering information, and reference distances

S240, receiving a random number set sent by a first auxiliary party and a local random number set fragment sent by the first auxiliary party in a secret sharing mode.

Wherein each random number in the set of random numbers is used to identify each of the collaborative cluster centers; the first secondary party is a third party or any one of the plurality of participants.

The random number set may include a set number of random numbers, and each random number is used to uniquely identify one collaborative clustering center, so that the number of the random numbers is the same as the number of the collaborative clustering centers. The local random number set fragment may be fragment information of random numbers generated in a secret sharing manner for each random number by the first secondary party that generates the random numbers.

In an embodiment of the present invention, each participant may receive a random number set transmitted by the first secondary and a local random number set fragment transmitted by the first secondary in a secret sharing manner. The first auxiliary party may be a coordinator that does not participate in the federated data clustering process, such as a third party arbiter. Alternatively, the first secondary party may also be one of the participants, e.g. 3 participantsP ₁，P ₂，P ₃Any one of the participants. Alternatively, the random number sets may be generated by all the participants together by using the same rule, for example, each participant generates the same random number set. In order to ensure the public confidence of the random numbers, it is preferable that the coordinator is used as the first assisting party to independently generate the random number set. The random number can be used for hiding the intermediate information of clustering comparison in the subsequent distance square value comparison process between the clustering sample and the collaborative clustering center of each participant, so that the clustering information is protected. The first secondary party generates a set of random numbersThe shard secrets for each random number in the set of random numbers may then be shared to each participant. Meanwhile, in order to ensure that the collaborative clustering center can be recovered finally, the first auxiliary party can also send the original complete random number set to each participant.

It should be noted that, after each participant generates local clustering information, the first secondary party may generate a random number set again, share a secret of the random number set to each participant, and send the complete random number set to each participant.

Taking the above example as an example to further illustrate, the set of random numbers generated by the first secondary party includes K random numbers

Respectively representing K cooperative clustering centers, and sharing the local random number set shard of the random number set to other participants in a secret mannerP _rCan obtain

The local random number set of (2) is fragmented. Wherein the content of the first and second substances,

an r-th fragment representing a k-th random number. The third party may also assemble the originally generated random numbers

To each participant. Wherein the content of the first and second substances,

and the like to the composition of other random numbers.

And S250, performing safe multi-party calculation by using the reference distance information fragments, the random number set, the local random number set fragments and other participants, and determining the nearest collaborative clustering center of each clustering sample.

Accordingly, each participant can perform secure multi-party computation with other participants according to the locally held reference distance information, the random number set and the local random number set fragments to determine the nearest collaborative clustering center of each clustering sample.

In an optional embodiment of the present invention, the performing secure multiparty computation with other participants using the reference distance information fragment, the random number set, and the local random number set fragment to determine the nearest collaborative clustering center of each of the clustering samples includes determining the nearest collaborative clustering center of a current processed target clustering sample in any two collaborative clustering centers by using the following method: calculating distance information difference fragments from the target clustering sample to the two coordination clustering centers according to the reference distance information fragments from the target clustering sample to the two coordination clustering centers; performing first safe multiparty calculation by using the distance information difference fragment and data correspondingly held by other participants to obtain a corresponding distance information comparison result fragment; according to the distance information comparison result fragment, the reference distance information fragment and the random number fragments correspondingly marking the two collaborative clustering centers in the local random number set fragment, performing second safe multiparty calculation with data correspondingly held by other participants to obtain a target reference distance information fragment and a target random number fragment corresponding to the nearest collaborative clustering center; the second secure multiparty computed computation target is the closest collaborative cluster center of the two collaborative cluster centers that selected the target cluster sample.

The target cluster sample is also the cluster sample currently being processed. The distance information difference may be a difference between the first reference distance information and the second reference distance information. The distance information difference fragment may be a fragment of a distance information difference calculated by the participant according to reference distance information fragments from the target cluster sample to the first collaborative cluster center and the second collaborative cluster center. The first and second collaborative clustering centers may be two different collaborative clustering centers. The distance information comparison result may be a comparison result of the target cluster sample calculated by the participant using the first secure multiparty calculation method and a distance difference between the first collaborative cluster center and the second collaborative cluster center. The distance information comparison result shard may be a shard generated by secret sharing of the distance information comparison result by the participant. The first secure multi-party computation and the second secure multi-party computation may be two different secure multi-party computation methods, and an appropriate secure multi-party computation method may be selected according to a computation target. The target reference distance information fragment is a fragment held by all the participants after secret sharing is carried out on the reference distance information corresponding to the nearest collaborative clustering center among all the participants again, namely a fragment of a distance square value between the nearest collaborative clustering center and a target clustering sample; the target random number fragment is the fragment held by all the participants after secret sharing is carried out among all the participants by the random number used for identifying the nearest collaborative clustering center.

Correspondingly, when the participator carries out clustering processing, a target clustering sample can be firstly determined, and then distance information difference fragments from the target clustering sample to the two coordination clustering centers are calculated according to reference distance information fragments from the target clustering sample to the two coordination clustering centers. After the distance information difference fragments are obtained, the participants can use the distance information difference fragments and distance information difference fragments held by other participants to perform first safe multi-party calculation to obtain corresponding distance information comparison result fragments, and further can perform second safe multi-party calculation to obtain target reference distance information fragments and target random number fragments corresponding to the nearest collaborative clustering centers according to the distance information comparison result fragments, the reference distance information fragments, random number fragments correspondingly identifying two collaborative clustering centers in the local random number set fragments, distance information comparison result fragments and reference distance information fragments correspondingly held by other participants, random number fragments correspondingly identifying two collaborative clustering centers in the local random number set fragments and other data.

In an optional embodiment of the present invention, the determining a nearest collaborative clustering center of each of the clustering samples by performing secure multi-party computation with other participants using the reference distance information fragment, the random number set, and the local random number set fragment may include: selecting two non-traversed collaborative clustering centers as a first collaborative clustering center and a second collaborative clustering center respectively; calculating distance information difference fragments from the target clustering samples to the first collaborative clustering center and the second collaborative clustering center according to the reference distance information fragments from the target clustering samples to the first collaborative clustering center and the second collaborative clustering center; performing first safe multi-party calculation on distance information difference fragments from the target clustering sample to the two collaborative clustering centers and distance information difference fragments correspondingly held by other participants to obtain corresponding distance information comparison result fragments; comparing the result fragments according to the distance information, the reference distance information fragments from the target clustering sample to the first collaborative clustering center and the second collaborative clustering center, and the random number fragments correspondingly marking the first collaborative clustering center and the second collaborative clustering center in the local random number clustering fragments, and performing second safe multi-party calculation with corresponding data correspondingly held by other participants to obtain the target reference distance information fragments and the target random number fragments corresponding to the nearest collaborative clustering center; the second secure multiparty computed computation target is the closest collaborative cluster center of the selected target cluster samples among the first collaborative cluster center and the second collaborative cluster center; and selecting one unexplored collaborative clustering center and a first collaborative clustering center and a second collaborative clustering center which are respectively updated by the nearest collaborative clustering center, and returning the operation of calculating the distance information difference fragments from the target clustering sample to the first collaborative clustering center and the second collaborative clustering center aiming at the updated first collaborative clustering center and the updated second collaborative clustering center until the traversal calculation of the last collaborative clustering center is completed.

In the embodiment of the present invention, after the reference distance information fragment, the random number set and the local random number set fragment are obtained by the participant, the difference between the distances from each cluster sample to different collaborative cluster centers may be continuously compared, so as to obtain the collaborative cluster center closest to the cluster sample according to the difference. It will be appreciated that for K synergistic cluster centers, each cluster sample needs to be compared K-1 times. It should be noted that the comparison calculation between the clustering samples and the collaborative clustering center distance may be performed according to a certain sequence of the clustering samples, or may be performed by randomly selecting one of the clustering samples, as long as the processing of all the clustering samples can be implemented, which is not limited in the embodiment of the present invention. Meanwhile, the operation of determining the nearest collaborative clustering center by each clustering sample can be executed sequentially or simultaneously, and the embodiment of the invention also has the advantage of comparison without limitation.

Specifically, the participator may select two unexploited cooperative clustering centers as a first cooperative clustering center and a second cooperative clustering center respectively, calculate a distance information difference fragment according to a first reference distance information fragment from the target clustering sample to the first cooperative clustering center and a second reference distance information fragment from the target clustering sample to the second cooperative clustering center, perform a first secure multiparty calculation using the distance information difference fragment and distance information difference fragments correspondingly held by other participators, determine a distance information comparison result corresponding to the distance information difference fragment, and share the distance information comparison result fragment with other participators in a secret manner. Because each participant can only obtain the fragment of the comparison result of the distance information and can not obtain the real distance information difference, the protection of the distance information difference is realized. At this time, each participant may compare the result fragments, reference distance information fragments from the target cluster sample to the first collaborative cluster center and the second collaborative cluster center, and random number fragments correspondingly identifying the first collaborative cluster center and the second collaborative cluster center in the local random number set fragments according to the held distance information, compare the result fragments, reference distance information fragments from the target cluster sample to the first collaborative cluster center and the second collaborative cluster center, and random number fragments correspondingly identifying the first collaborative cluster center and the second collaborative cluster center in the local random number set fragments with the distance information held by other participants, and perform the second secure multiparty computation again to determine which one of the first collaborative cluster center and the second collaborative cluster center is closest to the target cluster sample according to the secure multiparty computation result.

At this time, the participant may update the nearest collaborative clustering center of the target clustering sample to the first collaborative clustering center, and select one non-traversed collaborative clustering center to update to the second collaborative clustering center. After the first collaborative clustering center and the second collaborative clustering center are updated, the participator can continue to perform the operation of calculating the distance information difference fragments from the target clustering sample to the first collaborative clustering center and the second collaborative clustering center according to the reference distance information fragments from the target clustering sample to the first collaborative clustering center and the second collaborative clustering center until the last collaborative clustering center is traversed and calculated, and then the collaborative clustering center with the smallest distance to the target clustering sample can be selected as the nearest collaborative clustering center from the collaborative clustering centers. It can be understood that, along with the traversal calculation process of the target cluster sample, the first collaborative cluster center and the second collaborative cluster center, the closest collaborative cluster center closest to the target cluster sample is also updated in real time, and after the traversal calculation of the target cluster sample and each collaborative cluster center is completed, the latest collaborative cluster center obtained finally is the closest collaborative cluster center closest to the target cluster sample in each collaborative cluster center.

In an optional embodiment of the present invention, the determining a nearest collaborative clustering center of each of the clustering samples using the reference distance information fragment, the random number set, and the local random number set fragment for secure multi-party computation with other participants may further include: performing multi-party safe calculation with other participants to recover the target random number fragments to obtain a target random number; and determining the collaborative clustering center identified by the target random number based on the recovered target random number and the random number set, and using the collaborative clustering center as the nearest collaborative clustering center of the target clustering sample.

The target reference distance information fragment can be a fragment generated by secret sharing of a distance square value between the target clustering sample and the nearest collaborative clustering center and other participants by the participants. The target random number shard may be a shard of target random numbers, which may be random numbers that identify the nearest collaborative cluster center.

Correspondingly, after determining the nearest collaborative clustering center closest to the target clustering sample among the first collaborative clustering center and the second collaborative clustering center, each participant can also obtain the fragment information of the distance square value between the target clustering sample and the nearest collaborative clustering center according to the first secure multiparty calculation result, that is, obtain the target reference distance information fragment, and the target reference distance information fragment can be used for reference in subsequent calculation of the nearest collaborative clustering center for the target clustering sample and the updated first collaborative clustering center and the second collaborative clustering center by the participant. Meanwhile, each participant can also obtain fragment information of the target random number corresponding to the nearest collaborative clustering center according to the first safe multiparty calculation result, namely obtain the target random number fragments. It will be appreciated that each time the participating method invokes the first secure multi-party computation method to determine to update the nearest collaborative clustering center, the first secure multi-party computation method will generate a new target reference distance information fragment and a target random number fragment at the same time.

Since the nearest collaborative clustering center finally determined by the participant using the first secure multiparty method is unknown to the participant, after the participant finally determines the nearest collaborative clustering center of the target clustering sample using the first secure multiparty calculation method, it is necessary to further determine a matched target random number according to the nearest collaborative clustering center. Since the random number can identify the collaborative clustering center, the participant can determine the closest collaborative clustering center closest to the target clustering sample according to the finally determined target random number. Specifically, the participant can perform multi-party security calculation with other participants by using the finally updated and determined target random number fragment, so as to recover a specific numerical value of the target random number, and determine the nearest collaborative clustering center of the target clustering sample according to the identifier of the target random number.

Taking the above example as an example to further illustrate, the participants need to compare the distance between each cluster sample and different collaborative cluster centers to determine the nearest collaborative cluster center for each cluster sample. Firstly, a participant can select a clustering sample which is not subjected to clustering processing as a target clustering sample, a first collaborative clustering center is used as a first collaborative clustering center, and a second collaborative clustering center is used as a second collaborative clustering center. Comparison of participantsTargeting clustered samples to collaborative clustering centers

And

the distance of (c). Each participant

Has already held

And

. At this time, each participant

Can calculate

Wherein, in the step (A),

representing a distance information difference fragment of the target cluster sample between the first collaborative cluster center and the second collaborative cluster center,

represents a first reference distance information patch,

representing a second reference distance information patch.

Correspondingly, each participant jointly calls the first secure multiparty computation method to judge

Whether greater than 0. Wherein the content of the first and second substances,

represents traversing i pairs

And summing, namely comparing the real distance information of the target clustering sample aiming at the first collaborative clustering center and the second collaborative clustering center. Set distance information comparison results

Wherein, in the step (A),DReLUa first secure multiparty computation method of computing a distance information comparison result is presented. If the result of c is 0, the target cluster sample is far away from the first collaborative cluster center

Closer; if the result of c is 1, the target cluster sample is far away from the second collaborative cluster center

Closer. After obtaining the distance information comparison result, the first secure multiparty computation method may compare the distance information comparison resultcSecret sharing in shards to three participants, the ith participant

To obtain

Is provided with

. Wherein the content of the first and second substances,

indicating the distance information comparison result fragment held by the ith participant. Meanwhile, the first secure multiparty computation method can also secretly share the distance square value of the target cluster sample from the nearest collaborative cluster center to each participant in the form of target reference distance information fragments. The target reference distance information fragment can be used for comparing with a distance square value between the target clustering sample and a third collaborative clustering center. Furthermore, the first secure multi-party meterThe calculation method can also secretly share the target random number corresponding to the nearest collaborative clustering center to each participant in the form of target random number fragments.

After obtaining the distance information comparison result fragment and the distance square value fragment, each participant

Is inputted as

，

And

. At this time, each participant

To obtain

And

wherein, in the step (A),

representing participants

The distance square fragment of the nth cluster sample, that is, the distance square fragment of the target cluster sample from the nearest collaborative cluster center,

representing a local random number set fragment. Each participant

Again using a second secure multi-party computing method (e.g. using a second secure multi-party computing method)SelectShareSecure multi-party computing partyMethod) determining which collaborative clustering center of the two collaborative clustering centers currently compared is closest to the target clustering sample according to the distance information comparison result fragment. In particular, if

Then, then

And is and

indicating that the target cluster sample is closer to the first collaborative cluster center. If it is

Then, then

And is and

indicating that the target cluster sample is closer to the second collaborative cluster center. The above example only shows a specific implementation manner of comparing the distance between the target cluster sample and the first collaborative cluster center and the second system cluster center. After the comparison between the first collaborative clustering center and the second collaborative clustering center is completed, the first collaborative clustering center and the second collaborative clustering center may be updated, and the distance from the target clustering sample to the first collaborative clustering center and the second collaborative clustering center is continuously compared until the comparison between all collaborative clustering centers is completed.

Accordingly, the participant can recover the target random number based on the local target random number fragment and the target random number fragments held by other participants

Wherein, in the step (A),

which represents the target random number, is,

representing the local target random number shard and the target random number shards held by other participants. If it is

The nearest cooperative clustering center of the target clustering sample obtained by each participant isk。

And S260, updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample.

In an optional embodiment of the present invention, updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample may include: obtaining all local cross-feature samples included in the corresponding clusters of the nearest collaborative clustering centers; calculating a sample average value of all local cross-feature samples for each cluster; and updating the value of each local clustering center according to each sample average value.

Where the sample average may be the average of all local cross-feature samples in a cluster.

Correspondingly, after the nearest collaborative clustering center is obtained, each participant can calculate the sample average value of all local cross-feature samples in each cluster according to the reconstructed cluster according to all local cross-feature samples included in the cluster corresponding to the nearest collaborative clustering center, so that the value of each local clustering center is updated according to the sample average value of each cluster.

To further illustrate by taking the above example as an example, for the kth cluster, the r-th participant needs to calculate the updated local cluster center of its corresponding cluster according to the nearest collaborative cluster center

. Wherein the content of the first and second substances,

a local cluster center representing an update,

representing all p belonging to the k-th cluster_r,nThe samples are summed up in a summation step,

denotes p belonging to the k-th cluster_r,nThe number of samples.

It should be noted that after the currently processed target cluster sample is processed, the participating party may select another unprocessed cluster sample to update to the target cluster sample, and repeat the above operation of determining the nearest collaborative cluster center of the cluster sample for each cluster sample until all cluster samples are processed, and complete an update process of the collaborative cluster center and the local cluster center. After the updating of the collaborative clustering center and the local clustering center is completed, the updating operation can be executed in a circulating iteration mode until the clustering end condition is met.

S270, judging whether a clustering end condition is met, if so, executing S280; otherwise, the process returns to the step S210.

In an optional embodiment of the present invention, the returning to perform the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is satisfied includes determining whether the clustering end condition is satisfied by: calculating cluster center updating reference value fragments according to the participant threshold value fragments generated and shared secretly by the second auxiliary party and the updated local cluster center; the second assistant is any one of a third party or a plurality of participants; performing safe multiparty calculation with other participants by using the cluster center update reference value fragments to determine a cluster center update reference value; and determining whether the clustering ending condition is met or not according to the value taking result of the updated reference value of the clustering center.

The participant threshold value shard may be a shard generated by the target participant according to a pre-generated participant threshold value. The participant threshold may be a value set according to a data processing requirement, and may be used to determine a condition for terminating the federal data clustering process. It should be noted that the second assisting party may generate a threshold before or after the first assisting party sends the random number set and the local random number set shard to each participating party, and secretly share the participant threshold shard of the participant threshold to each other participating party. The second secondary party may be the same as the first secondary party or may be different from the first secondary party. In view of the complexity of the federated data clustering process, typically the first and second assisting parties are the same and one of the multiple participating parties. The cluster center update reference value shard may be shards generated by the participants when sharing secrets with reference values generated by the most recent collaborative cluster center according to the participant threshold shard.

Optionally, in order to ensure the reasonability of the clustering process, a second auxiliary party may be selected from each participant or a third party to generate a participant threshold in advance, and the participant threshold is shared with other participants in secret, and each participant may hold a participant threshold fragment. Furthermore, the participator can calculate the cluster center updating reference value according to the participator threshold value fragment and the updated local cluster center, and carry out secret sharing on the cluster center updating reference value fragment corresponding to the cluster center updating reference value. Accordingly, each participant can perform secure multiparty computation with other participants using the cluster center update reference value shard to recover the cluster center update reference value. After the cluster center update reference value is obtained, each participant can determine whether the cluster end condition is met according to the value taking result of the cluster center update reference value.

Further explanation is given by taking the above example as an example, assuming thatP ₁As a second assistant, can beP ₁Generating a participant threshold valuethresSharing the information to other participants in a secret sharing mode, wherein the participants share the information

To obtain

And is provided with

. Wherein the content of the first and second substances,

representing a participant threshold fragment. Each participantP _rCan also calculate

Wherein, in the step (A),

represents a cluster center update reference value patch,

a patch representing a difference in distance between the updated local cluster center and the local cluster center before the update. After the fragments of the updated reference values of the clustering centers are obtained, all the participantsP _rThe cluster center updating reference value fragments held by each participant can be respectively adopted to carry out recovery calculation by adopting a safe multi-party calculation method to obtain the corresponding cluster center updating reference value

And make a judgment on

Whether greater than 0. If it is not

And if the distance between the updated local clustering center and the original local clustering center is less than or equal to 0, the distance between the updated local clustering center and the original local clustering center is very small, the clustering center tends to be stable, and at the moment, the updating of the original local clustering center according to the updated local clustering center can be stopped, namely, the original local clustering center is kept unchanged. If it is not

If the distance between the updated local clustering center and the original local clustering center is larger than 0, the distance between the updated local clustering center and the original local clustering center is larger, and the original local clustering center can be updated according to the updated local clustering centerA local cluster center.

In an optional embodiment of the present invention, the returning to perform the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is satisfied includes determining whether the clustering end condition is satisfied by: acquiring the current data clustering processing times; and under the condition that the current data clustering processing times reach a preset clustering processing time threshold, determining that a clustering ending condition is met.

The current data clustering processing times may be times of updating the local clustering center currently. The preset threshold of the clustering process frequency may be a preset value, such as 100 or 200, and the embodiment of the present invention does not limit the specific value of the preset threshold of the clustering process frequency.

In the embodiment of the present invention, optionally, a preset threshold of clustering processing times may be set to determine the clustering end condition. If the current data clustering processing times reach a preset clustering processing time threshold value at each participant, the federal data clustering processing process can be terminated; otherwise, the federated data clustering process can continue to be executed in a loop iteration mode until the optimal local clustering center is determined.

And S280, terminating the federated data clustering processing flow.

If a certain bank needs to perform unsupervised analysis on credit customers, and the better analysis effect is required, the clustering analysis can be performed on the credit customers by combining the characteristics of a plurality of banks or institutions in a mode of crossing the characteristic federation. The clustering samples are credit characteristic data of credit customers, and the cross-characteristic local samples are credit characteristic data of the credit customers distributed in different banks and institutions. The specific flow of the data clustering processing method based on the federal learning is further explained by taking the application scenario as a specific example, and the method comprises the following steps:

step 1: each participant P_rIndependent random generation of K local clustering centers

And the concatenation of the kth local clustering center

Is the kth cooperative cluster center.

Step 2: second subsidiary P₁Generating a threshold value thres of the participator, sharing the threshold value thres to other participators in a secret sharing mode, namely the participator P_iObtaining thres_iAnd is provided with

。

And step 3: third party generation of K random number sets

Respectively representing K cooperative clustering centers, and secretly sharing the random number set to other participants, namely a participant P_rTo obtain

And combining the random numbers

To each participant.

And 4, step 4: participant P_rComputing each local cross-feature sample

With each local cluster center c_r,kSquare value of (2)

. In particular, the r-th party P_rFor each local cross-feature sample it owns, K distance squared values are computed.

And 5: each participant squares K distance squares corresponding to each local cross-feature sample

Secret sharing to other participantsP _iSecret shared shard information of local clustering information of each participant can be obtained as

And is provided with

。

Step 6: participant side

The received secret sharing fragment information can be summed to obtain

。

The following steps 7 to 11 are executed in a loop, and are used for comparing the distance from each cluster sample to different collaborative cluster centers to obtain the nearest collaborative cluster center from the cluster sample, and updating the local cluster center according to the nearest collaborative cluster center. For K collaborative cluster centers, K-1 comparisons are needed, and the 1 st and 2 nd collaborative cluster centers are taken as an example for explanation.

And 7: each participant compares the target cluster samples to the collaborative cluster center

And

the distance of (c). Participant side

Has already held

And

。

and 8: each participant

Can calculate

。

And step 9: all the participators call the first safety multi-party calculation method together to judge

Whether greater than 0. Set distance information comparison results

. If the result of c is 0, the target cluster sample is far away from the first collaborative cluster center

Closer; if the result of c is positive, 1 represents that the target cluster sample is far away from the second collaborative cluster center

Closer. After obtaining the distance information comparison result, the distance information comparison result can be comparedcSecret sharing in shards to three participants, the ith participant

To obtain

，

。

Step 10: participant side

Is inputted as

，

And

。

step 11: each participant

And determining which collaborative clustering center of the two collaborative clustering centers which are compared at present is closest to the target clustering sample according to the distance information comparison result fragment by adopting a second safe multi-party calculation method. In particular, if

Then, then

And is and

Then, then

And is and

indicating that the target cluster sample is closer to the second collaborative cluster center.

The steps 7-11 complete the process of clustering the target sample to the collaborative clustering center

And

and obtaining the secret sharing value of the minimum distance between the target clustering sample and the nearest collaborative clustering center.

Step 12: for the

And circularly executing the step 6 to the step 11, comparing the distances from the clustering sample to all the collaborative clustering centers, and updating the secret sharing value of the minimum distance from the target clustering sample to the nearest collaborative clustering center.

Step 13: all participants recover the target random number according to the held target random number fragments

Wherein, in the step (A),

which represents the target random number, is,

Step 14: for the kth cluster, the r-th participant calculates a new local cluster center

。

Step 15: each participantP _rCan also calculate

. Wherein

Representing the difference between the updated cluster center and the original cluster center.

Step 16: each participant P_rComputing with secure multiparty computing method

And make a judgment on

Whether greater than 0. If it is not

If the local clustering center is less than or equal to 0, the updating of the original local clustering center according to the updated local clustering center can be stopped, that is, the original local clustering center is kept unchanged. If it is not

If the local clustering center is larger than 0, the original local clustering center can be updated according to the updated local clustering center.

And step 17: and (5) circularly executing the steps 4 to 16 until the updating times meeting the termination condition or exceeding the preset local clustering center are determined according to the threshold value.

By adopting the technical scheme, through the application of the threshold value of the participants, the random number and the safe multi-party calculation method, each participant is assisted to carry out the federal data clustering processing, and the high efficiency, the safety and the reliability of the federal data clustering processing process can be ensured.

It should be noted that any permutation and combination between the technical features in the above embodiments also belong to the scope of the present invention.

EXAMPLE III

Fig. 3 is a schematic diagram of a data clustering apparatus based on federal learning according to a third embodiment of the present invention, where the apparatus is configured in one of a plurality of participant devices, and each participant device clusters each clustered sample according to different local cross-feature samples, as shown in fig. 3, the apparatus includes: a local clustering information calculating module 310, a local clustering information sharing module 320, a reference distance information fragment calculating module 330, a local clustering center updating module 340, and a loop executing module 350, wherein:

the local clustering information calculating module 310 is configured to initialize each local clustering center, and calculate distance information between each local cross-feature sample and each local clustering center as local clustering information;

the local clustering information sharing module 320 is configured to share the local clustering information with other parties in a secret manner, and receive local clustering information from other parties in the secret sharing;

a reference distance information fragment calculation module 330, configured to calculate a reference distance information fragment between each cluster sample and each collaborative cluster center based on a secret sharing result;

the local clustering center updating module 340 is configured to perform secure multi-party calculation with other participants by using the reference distance information fragments, determine the nearest collaborative clustering center of each clustering sample, and update the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample;

and the loop execution module 350 is configured to return to execute the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until a clustering end condition is met.

Optionally, the local clustering information sharing module 320 is specifically configured to: and summing the secret shared fragment information of each local clustering information from the same clustering sample to the same collaborative clustering center to obtain a reference distance information fragment between each clustering sample and each collaborative clustering center.

Optionally, the apparatus further comprises: a random number set receiving module, configured to receive a random number set sent by a first auxiliary party and a local random number set fragment sent by the first auxiliary party in a secret sharing manner, where each random number in the random number set is used to identify each of the collaborative clustering centers; the first auxiliary party is a third party or any one of the plurality of participants; the local cluster center updating module 340 is specifically configured to: and performing safe multi-party calculation by using the reference distance information fragment, the random number set and the local random number set fragment and other participants to determine the nearest collaborative clustering center of each clustering sample.

Optionally, the local clustering center updating module 340 is specifically configured to: calculating distance information difference fragments from the target clustering sample to the two coordination clustering centers according to the reference distance information fragments from the target clustering sample to the two coordination clustering centers; performing first safe multiparty calculation by using the distance information difference fragment and data correspondingly held by other participants to obtain a corresponding distance information comparison result fragment; according to the distance information comparison result fragment, the reference distance information fragment and the random number fragments correspondingly marking the two collaborative clustering centers in the local random number set fragment, performing second safe multiparty calculation with data correspondingly held by other participants to obtain a target reference distance information fragment and a target random number fragment corresponding to the nearest collaborative clustering center; the second secure multiparty computed computation target is the closest collaborative cluster center of the two collaborative cluster centers that selected the target cluster sample.

Optionally, the local cluster center updating module 340 is further configured to: performing multi-party safe calculation with other participants to recover the target random number fragments to obtain a target random number; and determining the collaborative clustering center identified by the target random number based on the recovered target random number and the random number set, and using the collaborative clustering center as the nearest collaborative clustering center of the target clustering sample.

Optionally, the local cluster center updating module 340 is further configured to: obtaining all local cross-feature samples included in the corresponding clusters of the nearest collaborative clustering centers; calculating a sample average value of all local cross-feature samples for each cluster; and updating the value of each local clustering center according to each sample average value.

Optionally, the loop execution module 350 is specifically configured to: calculating cluster center updating reference value fragments according to the participant threshold value fragments generated and shared secretly by the second auxiliary party and the updated local cluster center; the second assistant is any one of a third party or a plurality of participants; performing safe multiparty calculation with other participants by using the cluster center update reference value fragments to determine a cluster center update reference value; and determining whether the clustering ending condition is met or not according to the value taking result of the updated reference value of the clustering center.

The data clustering processing device based on the federal learning can execute the data clustering processing method based on the federal learning provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to a data clustering method based on federal learning provided in any embodiment of the present invention.

Since the data clustering processing device based on federal learning described above is a device that can execute the data clustering processing method based on federal learning in the embodiment of the present invention, based on the data clustering processing method based on federal learning described in the embodiment of the present invention, a person skilled in the art can understand the specific implementation manner and various variations of the data clustering processing device based on federal learning in the embodiment of the present invention, and therefore, how the data clustering processing method based on federal learning is implemented by the data clustering processing device based on federal learning in the embodiment of the present invention is not described in detail here. As long as a person skilled in the art implements the device used in the federated learning-based data clustering method in the embodiments of the present invention, the device is within the scope of protection to be claimed in the present application.

Example four

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of a computer device 412 suitable for use in implementing embodiments of the present invention. The computer device 412 shown in FIG. 4 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention. The computer device may be a participant device of any party, and may be a terminal or a server, which is not limited in this embodiment of the present invention.

As shown in FIG. 4, computer device 412 is in the form of a general purpose computing device. Components of computer device 412 may include, but are not limited to: one or more processors 416, a storage device 428, and a bus 418 that couples the various system components including the storage device 428 and the processors 416.

Bus 418 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computer device 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 412 and includes both volatile and nonvolatile media, removable and non-removable media.

Storage 428 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 430 and/or cache Memory 432. The computer device 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 434 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 418 by one or more data media interfaces. Storage 428 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

Program 436 having a set (at least one) of program modules 426 may be stored, for example, in storage 428, such program modules 426 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination may comprise an implementation of a network environment. Program modules 426 generally perform the functions and/or methodologies of embodiments of the invention as described herein.

The computer device 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing device, camera, display 424, etc.), with one or more devices that enable a user to interact with the computer device 412, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 412 to communicate with one or more other computing devices. Such communication may be through an Input/Output (I/O) interface 422. Also, computer device 412 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) through Network adapter 420. As shown, network adapter 420 communicates with the other modules of computer device 412 over bus 418. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer device 412, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.

The processor 416 executes programs stored in the storage 428 to perform various functional applications and data processing, such as implementing the data clustering method based on federal learning provided in the foregoing embodiments of the present invention.

That is, the processing unit implements, when executing the program: initializing each local clustering center, giving an initial value, and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information; carrying out secret sharing on the local clustering information and other participants, and receiving the local clustering information from the secret sharing of the other participants; calculating reference distance information fragments between each clustering sample and each collaborative clustering center based on the secret sharing result; performing safe multi-party calculation with other participants by using the reference distance information fragments, determining the nearest collaborative clustering center of each clustering sample, and updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample; and returning to execute the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is met.

EXAMPLE five

An embodiment of the present invention further provides a computer storage medium storing a computer program, where the computer program is used to execute the data clustering processing method based on federated learning according to any of the foregoing embodiments of the present invention when executed by a computer processor: initializing each local clustering center, giving an initial value, and calculating the distance information between each local cross-feature sample and each local clustering center as local clustering information; carrying out secret sharing on the local clustering information and other participants, and receiving the local clustering information from the secret sharing of the other participants; calculating reference distance information fragments between each clustering sample and each collaborative clustering center based on the secret sharing result; performing safe multi-party calculation with other participants by using the reference distance information fragments, determining the nearest collaborative clustering center of each clustering sample, and updating the value of each local clustering center according to the nearest collaborative clustering center of each clustering sample; and returning to execute the operation of calculating the local clustering information of each clustering sample based on the updated local clustering center until the clustering end condition is met.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A data clustering processing method based on federal learning is executed by one of a plurality of participants, and local cross-feature samples held by the participants are partial features of corresponding clustering samples to be clustered; it is characterized by comprising:

2. The method of claim 1, wherein computing a reference distance information patch between each cluster sample and each collaborative cluster center based on the result of the secret sharing comprises:

and summing the secret shared fragment information of each local clustering information from the same clustering sample to the same collaborative clustering center to obtain a reference distance information fragment between each clustering sample and each collaborative clustering center.

3. The method of claim 2, further comprising, prior to performing secure multi-party computations with other participants using the shards of reference distance information:

receiving a random number set sent by a first auxiliary party and a local random number set fragment sent by the first auxiliary party in a secret sharing mode, wherein each random number in the random number set is used for identifying each cooperative clustering center; the first auxiliary party is a third party or any one of the plurality of participants;

performing safe multiparty calculation with other participants by using the reference distance information fragments, and determining the nearest collaborative clustering center of each clustering sample, wherein the safe multiparty calculation comprises the following steps:

and performing safe multi-party calculation by using the reference distance information fragment, the random number set and the local random number set fragment and other participants to determine the nearest collaborative clustering center of each clustering sample.

4. The method of claim 3, wherein the determining the nearest collaborative cluster center of each cluster sample by performing secure multiparty computation with other participants using the reference distance information patch, the random number set, and the local random number set patch comprises determining the nearest collaborative cluster center of a current processed target cluster sample in any two collaborative cluster centers by:

calculating distance information difference fragments from the target clustering samples to the two coordination clustering centers according to the reference distance information fragments from the target clustering samples to the two coordination clustering centers;

performing first safe multiparty calculation by using the distance information difference fragment and data correspondingly held by other participants to obtain a corresponding distance information comparison result fragment;

according to the distance information comparison result fragment, the reference distance information fragment and the random number fragments correspondingly marking the two collaborative clustering centers in the local random number set fragment, performing second safe multiparty calculation with data correspondingly held by other participants to obtain a target reference distance information fragment and a target random number fragment corresponding to the nearest collaborative clustering center;

the second secure multiparty computed computation target is the nearest collaborative cluster center of the selected target cluster sample among the two collaborative cluster centers.

5. The method of claim 4, wherein determining a nearest collaborative cluster center for each of the cluster samples using secure multiparty computation with other participants using the reference distance information shard, the set of random numbers, and the local set of random numbers shard, further comprises:

performing multi-party safe calculation with other participants to recover the target random number fragments to obtain a target random number;

and determining the collaborative clustering center identified by the target random number based on the recovered target random number and the random number set, and using the collaborative clustering center as the nearest collaborative clustering center of the target clustering sample.

6. The method of claim 1, wherein updating the value of each local cluster center according to the nearest collaborative cluster center of each cluster sample comprises:

obtaining all local cross-feature samples included in the corresponding clusters of the nearest collaborative clustering centers;

calculating a sample average value of all local cross-feature samples for each cluster;

and updating the value of each local clustering center according to each sample average value.

7. The method of claim 1, wherein the returning to perform the operation of calculating the local clustering information of each cluster sample based on the updated local clustering center until the clustering end condition is satisfied comprises determining whether the clustering end condition is satisfied by:

calculating cluster center updating reference value fragments according to the participant threshold value fragments generated and shared secretly by the second auxiliary party and the updated local cluster center; the second assistant is any one of a third party or a plurality of participants;

performing safe multiparty calculation with other participants by using the cluster center update reference value fragments to determine a cluster center update reference value;

and determining whether the clustering ending condition is met or not according to the value taking result of the updated reference value of the clustering center.

8. A data clustering processing device based on federal learning is configured on one participant device in a plurality of participant devices, and each local cross-feature sample held by each participant device is a partial feature of a corresponding clustering sample to be clustered; it is characterized by comprising:

9. A computer device, characterized in that the computer device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the federated learning-based data clustering method of any one of claims 1-7.

10. A computer storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the federated learning-based data clustering method of any one of claims 1-7.