WO2022105183A1

WO2022105183A1 - User clustering method, apparatus and device

Info

Publication number: WO2022105183A1
Application number: PCT/CN2021/097306
Authority: WO
Inventors: 王健宗; 李泽远
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-20
Filing date: 2021-05-31
Publication date: 2022-05-27
Also published as: CN112381163A; CN112381163B

Abstract

A user clustering method, apparatus and device, wherein the method comprises: a first data source determines a first cluster center, the first cluster center being any one of k pre-clustered cluster centers; the first data source calculates, according to feature data of a first user owned by the first data source and feature data of a user corresponding to the first cluster center owned by the first data source, a first distance estimation value between the first user and the first cluster center; the first data source generates, according to the first distance estimation value, at least one first feature number and sends same to a second data source among m data sources, and acquires the actual distance between the first user and the first cluster center according to the at least one first feature number and a second distance estimation value; the first data source clusters the first user according to the actual distance. The present invention can ensure that users of multiple data sources are clustered without sending data information locally, so that the clustering result is accurate and reliable.

Description

A user clustering method, device and device

This application claims the priority of the Chinese patent application filed on November 20, 2020 with the application number 202011307323.7 and titled "A User Clustering Method, Apparatus and Equipment", the entire contents of which are incorporated by reference in in this application.

technical field

The present application relates to the field of data mining, and in particular, to a user clustering method, apparatus and device.

Background technique

Nowadays, the application of clustering algorithms is more and more extensive, usually clustering multiple identical users owned by various data sources, for example, clustering credit card users owned by various banks, so that high-risk users can be identified. , which will help commercial banks prevent and resolve credit card risks and improve credit card default risk management. When clustering users, it becomes a very important step to calculate the distance between users and each cluster center.

The inventor realized that at present, in order to ensure the security of the user characteristic data owned by each data source, data sharing is not performed between each data source. When the distance is calculated, it does not refer to the user feature data owned by other data sources, but only refers to the user feature data owned by itself for distance calculation, and the clustering result is inaccurate. Alternatively, each data source sends its own user feature data to a third-party server, and the third-party server calculates the distance between the user and the cluster center according to the user feature data owned by each data source, thereby completing the clustering. In this way, each data source needs to send its own user feature data locally, so there is a risk of leaking the locally owned user feature data.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a user clustering method, device, and device. When clustering users, it can ensure that user characteristic data is not local, ensure that user characteristic data is not leaked, and ensure data security, and can also Combine the user feature data owned by each data source to calculate the distance to ensure the accuracy of the clustering results.

In a first aspect, a user clustering method is provided. The method is applicable to a communication system, where the communication system includes m data sources, where m is an integer greater than or equal to 2, and the method includes:

The first data source determines a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to A user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

The first data source calculates the first user according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source the first estimated distance from the first cluster center, where the first user is any user of the n users except the k users;

The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to a second data source among the m data sources;

The first data source obtains an actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance , the second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

The first data source clusters the first users according to the actual distance.

In a second aspect, a user clustering device is provided, the device is applied to a first data source in a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2, the user clustering The device includes:

A determination module, used to determine a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center Corresponding to one user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

A calculation module, configured to calculate the difference between the first user and the the first estimated distance between the first cluster centers, and the first user is any user except the k users among the n users;

a sending module, configured to generate at least one first feature number according to the first distance estimation value, and send the at least one first feature number to a second data source among the m data sources;

an acquisition module, configured to acquire the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to the at least one first feature number and a second estimated distance, The second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

A clustering module, configured to cluster the first users according to the actual distance.

In a third aspect, a user clustering device is provided, including a processor, a memory, and an input/output interface, the processor, the memory, and the input/output interface are connected to each other, and the user clustering device is a first data source in a communication system, The communication system includes m data sources, where m is an integer greater than or equal to 2; wherein, the input and output interface is used for inputting or outputting data, and the memory is used for storing the application code for the user clustering device to execute the above method , the processor is configured to perform the following methods:

Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. The first estimated distance between the class centers, the first user is any user of the n users except the k users;

generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;

Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

The first users are clustered according to the actual distance.

In a fourth aspect, a computer storage medium is provided, the computer storage medium is applied to a first data source in a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2; the computer storage medium The medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the following methods:

The first users are clustered according to the actual distance.

In the embodiment of the present application, when the first data source calculates the distance from the user to the cluster center, the first distance estimate may be calculated according to the feature data of the user locally owned by the data source and the feature data of the user corresponding to the cluster center value, and generate at least one first feature number according to the first distance estimation value and send it to other second data sources, so as to ensure that the user feature data is not local, ensure that the user feature data is not leaked, ensure the security of the data, and the first The actual distance used by a data source for clustering is generated according to the feature number and a second distance estimate value, where the second distance estimate value is the feature data and cluster center of the user owned by other second data sources The characteristic data of the corresponding user is obtained by calculation, that is, the present application can also perform distance calculation in conjunction with the user characteristic data possessed by each data source, so as to ensure the accuracy of the clustering result.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic flowchart of a user clustering method provided by an embodiment of the present application;

2 is a schematic flowchart of obtaining the actual distance between a first user and a first cluster center provided by an embodiment of the present application;

3 is another schematic flowchart of obtaining the actual distance between the first user and the first cluster center provided by an embodiment of the present application;

4 is a schematic flowchart of another user clustering method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the composition and structure of a user clustering apparatus provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a user clustering device provided by an embodiment of the present application.

Detailed ways

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or circuits is not limited to the listed steps or circuits, but optionally also includes unlisted steps or circuits, or optionally also includes For other steps or circuits inherent to these processes, methods, products or devices.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

The technical solution of the present application relates to the technical field of artificial intelligence and/or big data, and is used for cluster analysis. For example, it can be applied to scenarios such as financial technology, such as clustering credit card users of financial institutions, to improve the accuracy of clustering results. . Optionally, the data involved in this application, such as user characteristic data and/or clustering results, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.

The solutions of the embodiments of the present application are applicable to the scenario of clustering users in the case of multiple data sources. In a communication system including m data sources, m is an integer greater than or equal to 2, and the first data source determines the first Clustering center, the first clustering center is any one of the pre-clustered k clustering centers, k clustering centers correspond to k users, one clustering center corresponds to one user, and the above k users are to be classified For a user among n users, the first data source is any one of m data sources, n is an integer, and n>1;

The first data source calculates the distance between the first user and the first cluster center according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source. The first distance estimate value of , the first user is any user except k users among the above n users;

The first data source obtains the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to at least one first feature number and a second estimated distance value, and the second estimated distance value is based on the second estimated distance. The characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center are calculated and obtained;

It can be seen from the above that the embodiment of the present application is a method for clustering multiple data sources, which is more accurate and reliable than the clustering result of a single data source; at the same time, each data source participating in the clustering participates in the calculation based on local data calculation. The distance between the obtained user and the cluster center is not the data itself, but the number of features generated based on the calculated distance. The number of features includes the first feature number and the second feature number. Thereby ensuring the security of data information.

Referring to FIG. 1, FIG. 1 is a schematic flowchart of a user clustering method provided by an embodiment of the present application. For ease of understanding, here is an example of clustering credit card users of financial institutions. There are higher requirements for performance, which is in line with the feature that the embodiment of the present application can ensure that the data is not localized. The specific process includes:

S101: The first data source determines a first cluster center, where the first cluster center is any one of the pre-clustered k cluster centers.

The m financial institutions participating in the clustering are m data sources, and the m data sources include n identical users, where m, and n are integers, and m≥2, and n>1. Each data source extracts behavior features of each of the n users to obtain user feature data, where the feature data includes at least one feature dimension, wherein one feature dimension corresponds to a type of behavior feature of the user. In the case of participating in the calculation, each data source provides different types of user features, and the sum of the feature dimensions provided by all data sources is L.

One data source is selected from the m data sources as the main data source, and the main data source can be any of the m data sources participating in the clustering. In order to ensure the absolute security of the data, in a possible implementation, the main data source The data source can also be a trusted data source. The main data source randomly selects k user IDs from the n users to be classified as the cluster centers, where the dimensions of the k cluster centers are all L, and k is the cluster cluster. number of.

Since the user data provided by different data sources have different feature dimensions, after the main data source is divided according to the different feature dimensions of the user data provided by each data source, the user IDs carrying the identifiers of different feature dimensions are sent to other data sources. The feature dimension identifier carried by the user id is adapted to the feature dimension contained in the data source. After each data source receives the message from the main data source, the feature dimension of the data of the user id in the local data and the received The feature dimension identifier of the obtained user id is matched to form a local cluster center.

Here, the main data source sends user ids with different feature dimension identifiers to other data sources. The purpose of sending user ids to other data sources is to allow m data sources to cluster the same cluster center, and send The user id only carries the dimension identifiers of the characteristic dimensions of different data sources, and does not carry local data, which can ensure that the data is not local and ensure the security of data information.

The first data source is any data source among the m data sources, that is, the first data source may be the above-mentioned main data source or other data sources among the m data sources, and the data source is in the k data sources. One of the cluster centers is arbitrarily selected as the first cluster center.

It should be understood that, since the information data in the m data sources needs to be combined for calculation, the first cluster center in each of the m data sources is the same cluster center.

S102: The first data source calculates the relationship between the first user and the first cluster center according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source The first distance estimate of .

The first data source calculates the first data between the first user and the first cluster center according to the characteristic data of the first user in the local data and the characteristic data of the user corresponding to the first cluster center owned by the first data source. distance estimation value, where the first user is any user among the n users except the k users.

Same as the above-mentioned first data source, m-1 second data sources among the m data sources calculate the first data source according to the characteristic data of the first user possessed in the local data and the characteristic data of the user corresponding to the first cluster center. The second estimated distance between the user and the first cluster center, where the second data source refers to other data sources except the first data source among the m data sources.

That is, each data source participating in the clustering needs to calculate the distance between the first user and the first cluster center based on the local data.

Since multiple data sources jointly participate in the calculation in the embodiment of this solution, each data source will provide different types of features of the first user, that is, different feature dimensions, and the first distance estimate here is the feature dimension provided in the first data source. Medium: the distance between the first user and the first cluster center, and the second estimated distance is the feature dimension provided in the second data source Medium: the distance between the first user and the first cluster center.

It should be understood that the above-mentioned first distance estimation value and second distance estimation value are calculated based on the user characteristic data in the data source, and refer to the distance between the first user and the first cluster center in the data source, The distance is calculated by the distance function Dist, such as Euclidean distance, Manhattan distance, etc. Dist can be defined according to different application scenarios.

S103: The first data source generates at least one first feature number according to the first distance estimation value, and sends it to the second data source among the m data sources, and obtains the first feature number according to the at least one first feature number and the second distance estimation value. The actual distance between the user and the first cluster center.

The embodiments of the present application provide two methods for obtaining the actual distance between the first user and the first cluster center.

Method 1: split-sum method, split-sum method includes the following steps:

1. Split the first distance estimate to obtain m-1 first feature numbers.

For the convenience of description, the above-mentioned first distance estimation value is denoted as

represents the distance between the first user and the first cluster center in the first data source, for

Do a random split into m-1 random numbers:

make

That is, the sum of the m-1 random numbers is the first distance estimation value, and the m-1 random numbers are:

That is, the above-mentioned m-1 first characteristic numbers.

Similar to what was done in the first data source, the second data source records the second distance estimate as

represents the distance between the first user and the first cluster center in the jth second data source, where j is an integer, and 1≤j≤m-1, for

Do a random split into m-1 random numbers:

make

The above m-1 random numbers:

That is, m-1 second characteristic numbers.

It should be understood that, since the above-mentioned first data source can be any data source among m data sources, the first data source and m-1 second data sources here are only to distinguish the data sources. , there is no essential difference in the processing performed by the first data source and the second data source.

That is, each of the m data sources will generate m-1 feature numbers based on the distance between the first user and the first cluster center calculated from the local data, and the feature numbers include the first feature number or the first feature number above. Two feature numbers, if the sender is the first data source, the first feature number, and if the sender is the second data source, the second feature number.

In order to facilitate understanding, the data will be described in conjunction with Fig. 2. As shown in Fig. 2, the figure includes three data sources: data source 1, data source 2 and data source 3, that is, the number of m is 3 at this time, and the steps One corresponds to the "random split" part in Figure 2. The

values

5, 7, and 9 in the figure correspond to the distances between the first user and the first cluster center calculated based on local data in each of the above data sources, as shown in the figure As shown in the figure, 5 in data source 1 is split into 2 and 3, 7 in data source 2 is split into 3 and 4, and 9 in data source 3 is split into -1 and 10. In step 1, if data source 1 is the first data source, then data source 2 and data source 3 are both the second data source; if data source 2 is the first data source, then data source 1 and data source 2 are both The second data source; if the data source 3 is the first data source, then both the data source 1 and the data source 2 are the second data source. Obviously, there is no essential difference in the processing performed by each of the m data sources in step 1.

It should be understood that the splitting situation shown in the figure is only one of many situations, and in fact, the splitting process in step 1 is random.

2. Send the number of generated features.

The first data source will have m-1 first feature numbers:

are respectively sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to one second data source.

Similar to the processing in the first data source, the second data source sends m-1 second feature numbers to data sources other than itself among the m data sources, and the same second feature number is sent to a data source. source.

That is, each of the m data sources will send the m-1 feature number generated above to the other data sources in the m data sources, respectively, and the feature number includes the above-mentioned first feature number or second feature number. , the first feature number if the sender is the first data source, and the second feature number if the sender is the second data source.

Referring to Figure 2, in Figure 2, data source 1 sends the split

random numbers

2 and 3 to data source 2 and data source 3 respectively; data source 2 sends the split

random numbers

3 and 4 to data source 1 and data source 3; data source 3 sends the random numbers -1 and 10 obtained by splitting to data source 1 and data source 2 respectively. If data source 1 is the first data source, 2 and 3 in the figure are the first feature numbers, 3 and 4 are the second feature numbers generated in data source 2 as the second data source, and -1 and 10 are The second feature number generated in the data source 3 of the second data source; if the data source 2 is the first data source, then 3 and 4 are the first feature numbers; if the data source 3 is the first data source, then -1 and 10 is the first characteristic number.

Obviously, there is no essential difference in the processing performed by each of the m data sources in the second step.

3. The first data source receives m-1 second feature numbers from m-1 second data sources, wherein one second feature number comes from one second data source.

The above-mentioned first data source receives m-1 second feature numbers from m-1 second data sources, the first data source calculates the sum of the feature numbers of the m-1 second feature numbers, and obtains a first cumulative value. Since the m data sources include a master data source and a slave data source, if the first data source is the master data source among the m data sources, the first data source stores the first accumulated value; The first data source is a slave data source among the m data sources, and the first data source sends the first accumulated value obtained by the above calculation to the master data source among the m data sources.

Similar to the processing in the first data source, the second data source will also receive m-1 feature numbers, which include the above-mentioned first feature number or second feature number, if the sender is the first data source. is the first feature number, and if the sender is the second data source, it is the second feature number, and the received feature numbers are summed to obtain the second cumulative value.

That is, each of the m data sources will receive m-1 feature numbers, which include the above-mentioned first feature number or second feature number, and if the sender is the first data source, the first feature number , if the sender is the second data source, it is the second feature number, the data source sums the received feature numbers, and if the data source is the main data source, the summation result is stored; if the data source is the slave data source, the result of the summation is sent to the primary data source.

Referring to Figure 2, data source 1 receives 3, -1, and the summation obtains 2; data source 2 receives 2, 10, and the summation obtains 12; data source 3 receives 3, 4, and the summation obtains 7. As shown in the figure, if the data source 1 is the first data source, the m-1 second feature numbers received by the data source 1 as the first data source are 3 and -1, and the summed data 2 is the first data source. A cumulative value, 12 obtained by the summation of data source 2 is the second characteristic number sum of data source 2 as the second data source, and 7 obtained by the summation of data source 3 is the second characteristic number of data source 2 as the second data source Accumulated value; similarly, if data source 2 or data source 3 is the first data source, the first accumulated value is 12 or 7. Obviously, each of the m data sources will receive m-1 feature numbers, which include the above-mentioned first feature number or second feature number, and the first feature if the sender is the first data source. If the sender is the second data source, it is the second feature number; the data source sums the received feature numbers to obtain the summation result, and the summation result is the first accumulated value or the second accumulated value. If If it is the first data source to be summed, it is the first accumulated value, and if it is the second data source to be summed, it is the second accumulated value.

As shown in Figure 2, at this time, data source 1 is the main data source, data source 1 stores the data obtained by the summation, data source 2 and data source 3 are slave data sources, and the summed 12 and 7 are sent to the main data source, i.e. data source 1.

It should be understood that, since users in each data source have different feature dimensions, when performing a specific summation operation in the embodiment of the present application, data summation methods of different dimensions are considered, and it is not common to add values to numerical values. And, the summation method shown in FIG. 2 is only for convenience of explanation.

Fourth, obtain the actual distance between the first user and the first cluster center.

If the first data source is the main data source among the m data sources, the first data source receives m-1 second feature sums from the m-1 second data sources, and calculates the sum of the second feature numbers according to the locally stored first data source. The actual distance between the first user and the first cluster center is calculated by an accumulated value and the received sum of m-1 second feature numbers. After the actual distance between the first user and the first cluster center is obtained, the actual distance needs to be sent to the slave data source among the m data sources.

With reference to Figure 2, as shown in Figure 2, in Figure 2, the data source 1 is the main data source, and the data source 1 receives the second feature number and 12 sent by the data source 2 as the slave data source and the data as the slave data source. The second feature sum 7 sent from

source

3, 12 and 7 are the two second feature sums received by data source 1 as the main data source. The data source 1 sums the received two second feature numbers to obtain data 21, which is equivalent to the actual distance in the embodiment of the present application.

It should be understood that, since users in each data source have different feature dimensions, when performing a specific summation operation, data summation methods of different dimensions are considered, rather than the ordinary summation of values.

Obviously, the only one of the m data sources that differs from the other data sources is the primary data source.

It can be seen from the above summation method that, first of all, each of the m data sources participates in the calculation, and the amount of calculation in each data source is roughly the same, even the main data source will not have a high computing power. That is, the embodiment of the present application is scalable, that is, when necessary, more data sources can be added for clustering, so that the results are more reliable and accurate, and at the same time, it will not cause too much damage to the main data source. burden.

Secondly, each data source participating in the clustering can receive the result of the summation. At the same time, each data source participating in the clustering ensures that the data does not go out of the locality. On the premise of ensuring the security of the data and information, the resource is achieved. Shared effects.

Finally, since every data source participating in the clustering participates in the calculation, that is, the whole system has the computing power of multiple computers and has a faster processing speed.

Method 2: Random number summation method, the random number summation method includes the following steps:

1. The main data source generates feature numbers and sends them.

The m data sources include the master data source and the slave data source. The m data sources are arranged in the order of data transmission, the master data source is arranged in the first place, and each of the m data sources has a random number. , if the first data source is the main data source among the m data sources; the first data source generates the first characteristic number according to the first random number possessed by the first data source and the first distance estimate value, and uses the first The feature number is sent to the second data source arranged after the first data source and closest to the first data source.

With reference to Figure 3, as shown in Figure 3, the data source 1 as the first data source in the figure is the main data source and is arranged in the first place. The estimated value of the first distance calculated by the data source 1 based on the local data is D1, and V1 is The first random number owned by data source 1, data source 1 sums D1 and V1 to obtain the first characteristic number C1, and sends it to data source 2 arranged after data source 1. At this time, the data source 2 is the second data source arranged after the first data source and closest to the first data source.

2. Generate feature numbers from the data source and send them.

If the first data source is a slave data source among m data sources, the first data source receives the second feature number from the second data source arranged before the first data source and closest to the first data source, and the second feature number the number is generated based on at least one second distance estimate and at least one second random number, a second distance estimate is derived from a second data source of at least one second data source arranged before the first data source, A second random number is derived from a second data source of at least one second data source arranged before the first data source. The first data source generates a first feature number according to the second feature number, the first random number possessed by the first data source, and the first distance estimation value, and sends the first feature number to the A second data source after a data source.

With reference to Figure 3, as shown in Figure 3, the data source 2 in the figure is the slave data source, if the data source 2 is the first data source, the other data sources in Figure 3 are the second data sources, and the data source 1 is the arrangement A second data source in at least one second data source preceding the first data source, data source 2 receives a second characteristic number C1 from data source 1, the second characteristic number C1 is based on the data as the second data source The second distance estimate of source 1 and the second random number V1 possessed by data source 1 are generated. The data source 2 generates the first value according to the received second feature number C1, the first random number V2 possessed by the data source 2 as the first data source, and the first distance estimation value D1 in the data source 2 as the first data source. feature number C2, and send C2 to the data source 3. In the figure, the data source 3 is the second data source arranged after the data source 2 as the first data source. Since data source 2, data source 3, data source 4 and data source 5 are slave data sources, the situation in which data source 3, data source 4 or data source 5 is the first data source is the same as the case where data source 2 is the first data source. The situation is similar, which can be specifically understood in conjunction with FIG. 3 and the situation in which data source 2 is used as the first data source, and will not be repeated here.

3. The main data source obtains the actual distance between the first user and the first cluster center.

If the first data source is the main data source among the m data sources, the first data source receives the second feature number from the last second data source, and determines the first user and the first cluster according to the second feature number The actual distance between the centers, the second feature number is generated based on the first feature number, m-1 second distance estimates and m-1 second random numbers, and a second distance estimate is derived from m data A second data source among the sources, and a second random number is derived from a second data source among the m data sources.

As shown in FIG. 3 , the data source 1 in the figure is the main data source. If the data source 1 is the main data source of the first data source, the data source 1 receives the second feature sent by the data source 5 as the second data source. Number C5, the main data source, that is, data source 1, processes the received second feature number C5, and the actual distance between the first user and the first cluster center can be determined according to the second feature number. The specific processing method In order to subtract the sum of random numbers V1, V2, V3, V4 and V5, after the actual distance is determined, data source 1 sends the actual distance to the slave data source among the m data sources. At this time, data source 5 is ranked last. the second data source.

Each of the m data sources has a random number, and the random number can be obtained in the following two ways:

1. They are sent by the main data source. The main data source randomly generates m random numbers, records the sum of the m random numbers, and randomly distributes the m random numbers to m data sources including itself. Among them, a random number Sent to a data source.

2. The data source generates a random number locally. If the random number in each data source is randomly generated locally, after the random number is generated by the data source, the random number needs to be sent to the main site so that the main site can obtain the random number. The sum of m random numbers.

Fourth, obtain the actual distance between the first user and the first cluster center from the data source.

If the first data source is a slave data source among the m data sources, the first data source receives the actual distance between the first user and the first cluster center from the master data source among the m data sources.

As shown in Figure 3, data source 1 is the master data source, data source 2, data source 3, data source 4 and data source 5 are slave data sources, if data source 2, data source 3, data source 4 and data source Any one of the data sources in 5 is the first data source, and the first data source obtains the actual distance between the first user and the first cluster center by receiving the first user and the first cluster center from the data source 1. actual distance between.

In the above random number summation method, the distance between the first user and the first cluster center is calculated based on local data, and after adding a random number to generate a feature number, the feature number is passed in the data source instead of Direct data transmission further guarantees the security of local data, and the feature number refers to the first feature number or the second feature number. In the above method, after each of the m data sources receives a characteristic number, it will add a random number to the summation, so that even if two adjacent data sources conspire, they cannot Get raw data passed between data sources.

When the number of data sources participating in the clustering is only two, if the above two methods are used to obtain the actual distance, consider sending the estimated distance from the data source directly to the main data source, reducing the computational complexity. complexity, and at the same time, there is a certain guarantee for the security of local data in the data source.

S104: The first data source clusters the first users according to the actual distance.

In the process of steps S101 to S103, each of the m data sources obtains the actual distance between the first user and the first cluster center, since the first cluster center is one of the above k cluster centers For any one of the above k cluster centers, a different first cluster center can be selected, and the actual distance between the first user and the k cluster centers can be obtained, and the minimum value among the k actual distances can be obtained by comparison. The first user falls into this category.

Further, referring to FIG. 4 , FIG. 4 is a schematic flow chart of completing the final classification through multiple rounds of classification according to an embodiment of the present application. After the above-mentioned first users are clustered, the clustering of all users can be completed through multiple rounds of iterations. The specific process includes:

In S201, the main data source selects k cluster centers and distributes them to each data source. Please refer to the process of selecting the cluster centers from the main data source in step S101.

S202, each data source calculates an estimated value of the distance from the user to the cluster center in the local data.

Referring to the step of obtaining the estimated value of the distance between the first user and the first cluster center in step S102, changing the selection of the first user and the first cluster center, each data source can obtain any user to any cluster in the local data. The distance estimate from the center.

S203, obtain the actual distances from all users to each cluster center, referring to the method of obtaining the actual distance between the first user and the first cluster center in step S103, the actual distance between any user and any cluster center can be obtained distance.

S204, classify the user into a cluster class with the smallest actual distance from the cluster center.

Referring to the method for clustering the first users in step S104, in the case of selecting different users, the first round of classification of each of the n users can be obtained.

S205, each data source locally averages the points classified into a certain cluster as a new cluster center.

After obtaining the classification of n users, for users classified in a certain class, the first data source first locally obtains the specific number of users classified in this class and the data of users in this class in different Then, according to the summation results of different dimensions and the number of users in the above, the average value of the data of users in this category in different dimensions is calculated as the new k cluster centers.

For ease of understanding, the above method of calculating new cluster centers is described in detail. For example, the coordinates of points classified into a certain class in a certain data source are (2,3), (6,2) and (1,1), then the number of the class is 3, and the above three points are summed in different dimensions. According to the number 3, the averages in the two dimensions are 3 and 2, respectively, then the next The cluster center point of the round is (3,2).

S206, each data source calculates the sum of the moving distances of the original center point and the current center point.

The first data source calculates the distances between the new k cluster centers and the corresponding cluster centers in the previous round of clustering respectively, and records the calculated distance between the corresponding cluster centers as the center distance, then you can Obtain k center distances, which are also the distances moved by the cluster centers. Similar to calculating distance estimates based on local data, the distances are also calculated by the distance function Dist, such as Euclidean distance, Manhattan distance, etc., which can be Dist is defined according to different application scenarios. The above-mentioned first data source sums the k center distances to obtain the first center distance, that is, the sum of the distances moved by the k cluster centers.

Similar to the processing of the first data source, the second data source respectively calculates the distances between the new k cluster centers and the corresponding cluster centers in the previous round of clustering, and calculates the distance between the corresponding cluster centers The distance of is recorded as the center distance, then k center distances can be obtained, and the above-mentioned second data source sums the k center distances to obtain the second center distance.

That is, each of the m data sources will obtain the sum of k center distances based on the k new cluster centers and the original k cluster centers.

S207, obtain the actual moving distance sum of the original cluster center and the new cluster center.

Since in m data sources, the cluster centers change in different dimensions, here, in order to obtain the actual distance moved by the k cluster centers, it is necessary to sum up the moving distances in different dimensions in the m data sources. The method of summation can adopt the above-mentioned splitting and summing method and random number calculation. For specific steps, refer to the two methods for obtaining the actual distance between the first user and the first cluster center and Figure 2 and Figure 3 for specific steps. , when the above two methods for obtaining the actual distance are adopted, the first center distance is equivalent to the first estimated distance value, and the second center distance is equivalent to the second estimated distance value.

S208, compare the actual distance moved by the k cluster centers and the size of the termination threshold.

In the case that the actual distance moved by the above k cluster centers is greater than the first threshold, the data source performs the next round of clustering on the n users according to the new cluster centers, and the specific steps for the next round of clustering are the same as the first one. The steps in the round clustering are consistent, resulting in a new clustering result.

When the actual distance moved by the above k cluster centers is less than the termination threshold, the clustering is stopped, and the clustering result at this time is taken as the final clustering result.

The value of the above termination threshold is generally set according to the needs, generally set between 0 and 10 ^-5 . If the actual distance moved by the above k cluster centers is less than the termination threshold, we consider that the k cluster centers do not have Moving, the clustering results will not change.

The methods of the embodiments of the present application are described above, and the devices of the embodiments of the present application are described below.

Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a user clustering apparatus provided by an embodiment of the present application. The apparatus 50 includes:

A determination module 501 is configured to determine a first cluster center, where the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster The center corresponds to one user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1, In the main data source, the determining module 501 is further configured to determine the pre-clustered k cluster centers;

The calculation module 502 is configured to calculate the relationship between the first user and the user according to the characteristic data of the first user possessed by the first data source and the characteristic data of the user corresponding to the first cluster center possessed by the first data source. the first estimated distance between the first cluster centers, and the first user is any user of the n users except the k users;

A sending module 503, configured to generate at least one first feature number according to the first distance estimation value, and send the at least one first feature number to a second data source in the m data sources;

Obtaining module 504, configured to obtain the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to the at least one first feature number and a second estimated distance , the second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

The clustering module 505 is configured to cluster the first users according to the actual distance.

In a possible design, the sending module 503 is configured to: split the first distance estimate value to obtain m-1 first feature numbers, and separate the m-1 first feature numbers respectively It is sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to one second data source.

In a possible design, the m data sources include a master data source and a slave data source, and the acquiring module 504 is configured to receive m-1 second feature numbers from the m-1 second data sources , wherein a second feature number comes from a second data source, the second feature number is generated according to the second distance estimation value, and the acquiring module 504 is further configured to calculate the m-1 second features the sum of the numbers to obtain the first accumulated value; if the first data source is the main data source among the m data sources, the obtaining module 504 is configured to receive m-1 from the m-1 second data sources a second cumulative value, and calculate the actual distance between the first user and the first cluster center according to the first cumulative value and the m-1 second cumulative values, wherein a first The second accumulated value is derived from a second data source, and the second accumulated value is calculated according to the first characteristic number;

In a possible design, the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data source is arranged first, Each data source in the m data sources has a random number, if the first data source is the main data source in the m data sources; the sending module 503 is used for according to the first data source. having the first random number and the first distance estimate value, generating a first feature number, and sending the first feature number to the first data source arranged after the first data source and closest to the first data source the second data source.

In a possible design, the obtaining module 504 is further configured to receive a second feature number from the last second data source, and determine the first user and the first cluster according to the second feature number The actual distance between the centers, the second feature number is generated according to the first feature number, m-1 second distance estimates and m-1 second random numbers, a source of second distance estimates For a second data source among the m data sources, a second random number is derived from a second data source among the m data sources.

In a possible design, the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data source is arranged first, Each of the m data sources has a random number, if the first data source is a slave data source among the m data sources; the sending module 503 is used for the slaves arranged in the A second data source preceding a data source and closest to the first data source receives a second feature number, the second feature number is generated according to at least one second distance estimate and at least one second random number, A second distance estimate is derived from a second data source of at least one second data source arranged before the first data source, and a second random number is derived from at least one second data source arranged before the first data source a second data source in a second data source; the sending module 503 is further configured to, according to the second characteristic number, the first random number possessed by the first data source, and the first distance estimation value, A first feature number is generated and sent to a second data source arranged after the first data source.

In a possible design, if the first data source is a slave data source among the m data sources, the obtaining module 504 is further configured to receive the first data source from a master data source among the m data sources The actual distance between a user and the first cluster center.

Further, in a possible design, the device further includes: a module 506 for obtaining the moving distances of the cluster centers, for obtaining k cluster center movements with reference to the method for obtaining the actual distance between the first user and the first cluster center actual distance.

The device also includes: a termination module 507, configured to compare the actual distances moved by the k cluster centers and the size of the termination threshold. The new clustering center performs the next round of clustering on n users, and the specific steps of the next round of clustering are consistent with the steps in the first round of clustering, thereby obtaining a new clustering result.

It should be noted that, for the content not mentioned in the embodiment corresponding to FIG. 5 , reference may be made to the description of the method embodiment, which will not be repeated here.

In the embodiment of the present application, by determining the cluster center, each data source locally calculates the distance between the user and the cluster center in different dimensions, that is, the distance estimate value, and transmits the number of features generated based on the distance estimate value between the data sources to determine The actual distance between the user and the cluster center, so that the users are clustered according to the actual distance. In the embodiment of the present application, it is ensured that the data is not local and the detailed information of the data is not leaked, and the multi-party data information is securely combined to cluster customers, thereby clustering users. At the same time, all participants can perform calculations simultaneously, and the entire system has the computing power of multiple computers and has a faster processing speed.

Referring to FIG. 6 , FIG. 6 is a schematic structural diagram of a user clustering device provided by an embodiment of the present application. The device 60 includes a processor 601 , a memory 602 , and an input and output interface 603 . The processor 601 is connected to the memory 602 and the input-output interface 603, for example, the processor 601 can be connected to the memory 602 and the input-output interface 603 through a bus.

The processor 601 is configured to support the user clustering device to perform corresponding functions in the user clustering method described in FIG. 1 to FIG. 2 and FIG. 4 . The processor 601 may be a central processing unit (CPU), a network processor (NP), a hardware chip or any combination thereof. The above-mentioned hardware chip may be an application specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (generic array logic, GAL) or any combination thereof.

The memory 602 is used to store program codes and the like. Memory 602 may include volatile memory (volatile memory, VM), such as random access memory (RAM); memory 602 may also include non-volatile memory (non-volatile memory, NVM), such as read-only A memory (read-only memory, ROM), a flash memory (flash memory), a hard disk drive (HDD) or a solid-state drive (solid-state drive, SSD); the memory 602 may also include a combination of the above-mentioned types of memory.

The input/output interface 603 is used for inputting or outputting data.

Processor 601 may invoke the program code to perform the following operations:

determining a first cluster center, where the first cluster center is any one of the pre-clustered k cluster centers;

Calculate the first estimated distance between the first user and the first cluster center according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source ;

At least one first feature number is generated according to the first distance estimation value, and sent to the second data source among the m data sources, and the first user and the first cluster are obtained according to the at least one first feature number and the second distance estimation value the actual distance between the centers;

The first users are clustered according to the actual distance.

It should be noted that, the implementation of each operation may also refer to the corresponding description of the foregoing method embodiments; the processor 601 may also cooperate with the input/output interface 603 to perform other operations in the foregoing method embodiments.

Embodiments of the present application further provide a computer storage medium, where the computer storage medium stores a computer program, and the computer program includes program instructions, and the program instructions, when executed by a computer, cause the computer to execute as described in the foregoing embodiments method, the computer may be part of the above-mentioned user clustering device. For example, it is the above-mentioned processor 601 .

Optionally, the storage medium involved in this application may be a readable storage medium. Further optionally, the storage medium involved in this application may be non-volatile or volatile.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium, and the program is in During execution, it may include the processes of the embodiments of the above-mentioned methods. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

The above disclosures are only the preferred embodiments of the present application, and of course, the scope of the rights of the present application cannot be limited by this. Therefore, equivalent changes made according to the claims of the present application are still within the scope of the present application.

Claims

A user clustering method, wherein the method is applicable to a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2, and the method includes:

The first data source determines a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to A user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

The first data source calculates the first user according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source the first estimated distance from the first cluster center, where the first user is any user of the n users except the k users;

The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to a second data source among the m data sources;

The first data source obtains an actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance , the second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

The first data source clusters the first users according to the actual distance.
The method of claim 1, wherein the first data source generates at least one first feature number based on the first distance estimate and sends the at least one first feature number to the m pieces of data A second data source in the source, including:

The first data source splits the first distance estimation value to obtain m-1 first feature numbers;

The first data source sends the m-1 first feature numbers to m-1 second data sources other than the first data source among the m data sources, wherein one first data source is The feature numbers are sent to a second data source.
The method of claim 2, wherein the m data sources include a master data source and a slave data source, and the first data source obtains the data between the first user and the first cluster center Actual distances, including:

The first data source receives m-1 second feature numbers from the m-1 second data sources, wherein one second feature number comes from a second data source, and the second feature number is based on the second distance estimate is generated;

The first data source calculates the sum of the m-1 second feature numbers to obtain a first accumulated value;

If the first data source is the main data source among the m data sources, the first data source receives m-1 second accumulated values from the m-1 second data sources, and according to the The first cumulative value and the m-1 second cumulative values are used to calculate the actual distance between the first user and the first cluster center, wherein a second cumulative value is derived from a second data source, the second accumulated value is calculated according to the first characteristic number;

If the first data source is a slave data source among the m data sources, the first data source sends the first accumulated value to the master data source among the m data sources, and sends the first accumulated value from the The main data source receives the actual distance between the first user and the first cluster center, where the actual distance is calculated by the main data source according to the first cumulative value and m-1 second cumulative values It is obtained that the second accumulated value is calculated according to the first characteristic number.
The method of claim 1, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data sources are arranged in the first One bit, each of the m data sources has a random number, if the first data source is the main data source among the m data sources;

The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to the m data sources other than the first data source. A second data source, including:

generating, by the first data source, a first characteristic number according to the first random number and the first estimated distance value possessed by the first data source;

The first data source sends the first feature number to a second data source arranged after the first data source and closest to the first data source.
The method according to claim 4, wherein obtaining the actual distance between the first user and the first cluster center by the first data source comprises:

The first data source receives the second feature number from the last second data source, and determines the actual distance between the first user and the first cluster center according to the second feature number, so The second feature number is generated according to the first feature number, m-1 second distance estimation values and m-1 second random numbers, and a second distance estimation value is derived from the m data sources a second data source of , and a second random number is derived from a second data source among the m data sources.
The method of claim 1, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data sources are arranged in the first One bit, each of the m data sources has a random number, if the first data source is a slave data source among the m data sources;

The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to the m data sources other than the first data source. The second data source includes:

The first data source receives a second characteristic number from a second data source arranged before the first data source and closest to the first data source, the second characteristic number according to at least one second distance The estimated value and at least one second random number are generated, and a second distance estimated value is derived from a second data source and a second random number source among at least one second data source arranged before the first data source a second data source in at least one second data source arranged before the first data source;

generating, by the first data source, a first characteristic number according to the second characteristic number, a first random number possessed by the first data source, and the first estimated distance;

The first data source sends the first feature number to a second data source arranged after the first data source.
The method according to claim 6, wherein obtaining the actual distance between the first user and the first cluster center by the first data source comprises:

The first data source receives the actual distance between the first user and the first cluster center from a main data source among the m data sources.
A user clustering device, wherein the device is applied to a first data source in a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2, and the user clustering device includes:

A determination module, used to determine a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center Corresponding to one user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

A calculation module, configured to calculate the difference between the first user and the the first estimated distance between the first cluster centers, and the first user is any user except the k users among the n users;

a sending module, configured to generate at least one first feature number according to the first distance estimation value, and send the at least one first feature number to a second data source among the m data sources;

an acquisition module, configured to acquire the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to the at least one first feature number and a second estimated distance, The second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

A clustering module, configured to cluster the first users according to the actual distance.
A user clustering device, comprising a processor, a memory and an input/output interface, the processor, the memory and the input/output interface are connected to each other, the user clustering device is a first data source in a communication system, the The communication system includes m data sources, where m is an integer greater than or equal to 2; wherein, the input/output interface is used for inputting or outputting data, the memory is used for storing program codes, and the processor is used for calling the program codes , execute the following method:

Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. The first estimated distance between the class centers, the first user is any user of the n users except the k users;

generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;

Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

The first users are clustered according to the actual distance.
The user clustering device of claim 9, wherein the generating at least one first feature number according to the first distance estimation value is performed, and the at least one first feature number is sent to the m pieces of data A second data source in the source, including:

splitting the first distance estimation value to obtain m-1 first feature numbers;

The m-1 first feature numbers are respectively sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to a first data source. Two data sources.
The user clustering device according to claim 10, wherein the m data sources include a master data source and a slave data source, and performing the acquiring of the data between the first user and the first cluster center Actual distances, including:

m-1 second feature numbers are received from the m-1 second data sources, wherein one second feature number is from a second data source, and the second feature number is estimated based on the second distance value generated;

Calculate the sum of the m-1 second characteristic numbers to obtain a first accumulated value;

If the first data source is the main data source among the m data sources, m-1 second accumulated values are received from the m-1 second data sources, and according to the first accumulated value and the The m-1 second cumulative values are used to calculate the actual distance between the first user and the first cluster center, wherein a second cumulative value is derived from a second data source, and the second cumulative value is derived from a second data source. The accumulated value is calculated according to the first characteristic number;

If the first data source is a slave data source among the m data sources, send the first accumulated value to the master data source among the m data sources, and receive the first accumulated value from the master data source The actual distance between the first user and the first cluster center, the actual distance is calculated by the main data source according to the first accumulated value and m-1 second accumulated values, the second The accumulated value is calculated according to the first characteristic number.
The user clustering device according to claim 9, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data source Arranged in the first place, each of the m data sources has a random number, if the first data source is the main data source among the m data sources;

performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source sources, including:

generating a first feature number according to the first random number possessed by the first data source and the first estimated distance;

The first feature number is sent to a second data source arranged after the first data source and closest to the first data source.
The user clustering device according to claim 12, wherein the obtaining the actual distance between the first user and the first cluster center comprises:

A second feature number is received from the last second data source, and the actual distance between the first user and the first cluster center is determined according to the second feature number, the second feature number being Generated according to the first feature number, m-1 second distance estimation values and m-1 second random numbers, a second distance estimation value is derived from a second data source among the m data sources , a second random number is derived from a second data source among the m data sources.
The user clustering device according to claim 9, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data source Arranged in the first place, each of the m data sources has a random number, if the first data source is a slave data source among the m data sources;

performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source Sources include:

A second feature number is received from a second data source arranged before and closest to the first data source, the second feature number being based on at least one second distance estimate and at least one first generated by two random numbers, a second distance estimate is derived from a second data source among at least one second data source arranged before the first data source, and a second random number is derived from a second data source arranged in the first data source a second data source of at least one second data source preceding a data source;

generating a first feature number according to the second feature number, the first random number possessed by the first data source, and the first distance estimate;

The first feature number is sent to a second data source arranged after the first data source.
A computer storage medium, wherein the computer storage medium is applied to a first data source in a communication system, and the communication system includes m data sources, where m is an integer greater than or equal to 2; the computer storage medium stores A computer program comprising program instructions which, when executed by a processor, cause the processor to perform the following methods:

Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;

According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. a first estimated distance between class centers, where the first user is any user among the n users except the k users;

generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;

Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;

The first users are clustered according to the actual distance.
16. The computer storage medium of claim 15, wherein the generating at least one first feature number based on the first distance estimate is performed and sending the at least one first feature number to the m data sources A second data source in , including:

splitting the first distance estimation value to obtain m-1 first feature numbers;

The m-1 first feature numbers are respectively sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to a first data source. Two data sources.
The computer storage medium according to claim 16, wherein the m data sources include a master data source and a slave data source, and performing the obtaining of the actual data between the first user and the first cluster center distance, including:

m-1 second feature numbers are received from the m-1 second data sources, wherein one second feature number is from a second data source, and the second feature number is estimated based on the second distance value generated;

Calculate the sum of the m-1 second characteristic numbers to obtain a first accumulated value;

If the first data source is the main data source among the m data sources, m-1 second accumulated values are received from the m-1 second data sources, and according to the first accumulated value and the The m-1 second cumulative values are used to calculate the actual distance between the first user and the first cluster center, wherein a second cumulative value is derived from a second data source, and the second cumulative value is derived from a second data source. The accumulated value is calculated according to the first characteristic number;

If the first data source is a slave data source among the m data sources, send the first accumulated value to the master data source among the m data sources, and receive the first accumulated value from the master data source The actual distance between the first user and the first cluster center, the actual distance is calculated by the main data source according to the first accumulated value and m-1 second accumulated values, the second The accumulated value is calculated according to the first characteristic number.
The computer storage medium according to claim 15, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data sources are arranged In the first place, each of the m data sources has a random number, if the first data source is the main data source among the m data sources;

performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source sources, including:

generating a first characteristic number according to the first random number possessed by the first data source and the first estimated distance;

The first feature number is sent to a second data source arranged after the first data source and closest to the first data source.
The computer storage medium of claim 18, wherein performing the obtaining the actual distance between the first user and the first cluster center comprises:

A second feature number is received from the last second data source, and the actual distance between the first user and the first cluster center is determined according to the second feature number, the second feature number being Generated according to the first feature number, m-1 second distance estimation values and m-1 second random numbers, a second distance estimation value is derived from a second data source among the m data sources , a second random number is derived from a second data source among the m data sources.
The computer storage medium according to claim 15, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data sources are arranged In the first place, each of the m data sources has a random number, if the first data source is a slave data source among the m data sources;

performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source Sources include:

A second feature number is received from a second data source arranged before and closest to the first data source, the second feature number being based on at least one second distance estimate and at least one first generated by two random numbers, a second distance estimate is derived from a second data source among at least one second data source arranged before the first data source, and a second random number is derived from a second data source arranged in the first data source a second data source of at least one second data source preceding a data source;

generating a first feature number according to the second feature number, the first random number possessed by the first data source, and the first distance estimate;

The first feature number is sent to a second data source arranged after the first data source.