WO2022105183A1 - User clustering method, apparatus and device - Google Patents

User clustering method, apparatus and device Download PDF

Info

Publication number
WO2022105183A1
WO2022105183A1 PCT/CN2021/097306 CN2021097306W WO2022105183A1 WO 2022105183 A1 WO2022105183 A1 WO 2022105183A1 CN 2021097306 W CN2021097306 W CN 2021097306W WO 2022105183 A1 WO2022105183 A1 WO 2022105183A1
Authority
WO
WIPO (PCT)
Prior art keywords
data source
data
user
distance
feature
Prior art date
Application number
PCT/CN2021/097306
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
李泽远
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022105183A1 publication Critical patent/WO2022105183A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

A user clustering method, apparatus and device, wherein the method comprises: a first data source determines a first cluster center, the first cluster center being any one of k pre-clustered cluster centers; the first data source calculates, according to feature data of a first user owned by the first data source and feature data of a user corresponding to the first cluster center owned by the first data source, a first distance estimation value between the first user and the first cluster center; the first data source generates, according to the first distance estimation value, at least one first feature number and sends same to a second data source among m data sources, and acquires the actual distance between the first user and the first cluster center according to the at least one first feature number and a second distance estimation value; the first data source clusters the first user according to the actual distance. The present invention can ensure that users of multiple data sources are clustered without sending data information locally, so that the clustering result is accurate and reliable.

Description

一种用户聚类方法、装置及设备A user clustering method, device and device
本申请要求于2020年11月20日提交中国专利局、申请号为202011307323.7,发明名称为“一种用户聚类方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 20, 2020 with the application number 202011307323.7 and titled "A User Clustering Method, Apparatus and Equipment", the entire contents of which are incorporated by reference in in this application.
技术领域technical field
本申请涉及数据挖掘领域,尤其涉及一种用户聚类方法、装置及设备。The present application relates to the field of data mining, and in particular, to a user clustering method, apparatus and device.
背景技术Background technique
现如今,聚类算法的应用越来越广泛,通常会对各个数据源所拥有的多个相同用户进行聚类,例如,对各个银行所拥有的信用卡用户进行聚类,从而可以识别高风险用户,有利于商业银行防范和化解信用卡风险,完善信用卡违约风险管理工作。而对用户进行聚类时,计算用户到各个聚类中心之间的距离成为一个非常重要的步骤。Nowadays, the application of clustering algorithms is more and more extensive, usually clustering multiple identical users owned by various data sources, for example, clustering credit card users owned by various banks, so that high-risk users can be identified. , which will help commercial banks prevent and resolve credit card risks and improve credit card default risk management. When clustering users, it becomes a very important step to calculate the distance between users and each cluster center.
发明人意识到,目前,各个数据源为了保障自己所拥有的用户特征数据的安全性,各个数据源之间是不进行数据共享的,因此,现有技术各个数据源在计算用户到聚类中心的距离时,不会参考别的数据源所拥有的用户特征数据,而仅仅参考自身所拥有的用户特征数据进行距离计算,聚类结果不准确。或者,各个数据源将自身所拥有的用户特征数据发送给第三方服务器,由第三方服务器根据各个数据源所拥有的用户特征数据计算用户到聚类中心之间的距离,从而完成聚类,这种方式,各个数据源需要将自身所拥有的用户特征数据发送出本地,因此有泄露本地所拥有的用户特征数据的风险。The inventor realized that at present, in order to ensure the security of the user characteristic data owned by each data source, data sharing is not performed between each data source. When the distance is calculated, it does not refer to the user feature data owned by other data sources, but only refers to the user feature data owned by itself for distance calculation, and the clustering result is inaccurate. Alternatively, each data source sends its own user feature data to a third-party server, and the third-party server calculates the distance between the user and the cluster center according to the user feature data owned by each data source, thereby completing the clustering. In this way, each data source needs to send its own user feature data locally, so there is a risk of leaking the locally owned user feature data.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种用户聚类方法、装置及设备,在对用户进行聚类时,可以保证用户特征数据不出本地,确保用户特征数据不被泄露,保障数据的安全性,并且还可以联合各个数据源所拥有的用户特征数据进行距离的计算,保障聚类结果的准确性。The embodiments of the present application provide a user clustering method, device, and device. When clustering users, it can ensure that user characteristic data is not local, ensure that user characteristic data is not leaked, and ensure data security, and can also Combine the user feature data owned by each data source to calculate the distance to ensure the accuracy of the clustering results.
第一方面,提供一种用户聚类方法,所述方法适用于通信系统,所述通信系统包括m个数据源,m为大于或者等于2的整数,所述方法包括:In a first aspect, a user clustering method is provided. The method is applicable to a communication system, where the communication system includes m data sources, where m is an integer greater than or equal to 2, and the method includes:
第一数据源确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;The first data source determines a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to A user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
所述第一数据源根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的所述第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;The first data source calculates the first user according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source the first estimated distance from the first cluster center, where the first user is any user of the n users except the k users;
所述第一数据源根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to a second data source among the m data sources;
所述第一数据源获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;The first data source obtains an actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance , the second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
所述第一数据源根据所述实际距离,对所述第一用户进行聚类。The first data source clusters the first users according to the actual distance.
第二方面,提供一种用户聚类装置,所述装置应用于通信系统中的第一数据源,所述通信系统包括m个数据源,m为大于或者等于2的整数,所述用户聚类装置包括:In a second aspect, a user clustering device is provided, the device is applied to a first data source in a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2, the user clustering The device includes:
确定模块,用于确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;A determination module, used to determine a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center Corresponding to one user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
计算模块,用于根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;A calculation module, configured to calculate the difference between the first user and the the first estimated distance between the first cluster centers, and the first user is any user except the k users among the n users;
发送模块,用于根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;a sending module, configured to generate at least one first feature number according to the first distance estimation value, and send the at least one first feature number to a second data source among the m data sources;
获取模块,用于获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;an acquisition module, configured to acquire the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to the at least one first feature number and a second estimated distance, The second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
聚类模块,用于根据所述实际距离,对所述第一用户进行聚类。A clustering module, configured to cluster the first users according to the actual distance.
第三方面,提供用户聚类设备,包括处理器、存储器、以及输入输出接口,所述处理器、存储器和输入输出接口相互连接,所述用户聚类设备为通信系统中的第一数据源,所述通信系统包括m个数据源,m为大于或者等于2的整数;其中,所述输入输出接口用于输入或输出数据,所述存储器用于存储用户聚类设备执行上述方法的应用程序代码,所述处理器被配置用于执行以下方法:In a third aspect, a user clustering device is provided, including a processor, a memory, and an input/output interface, the processor, the memory, and the input/output interface are connected to each other, and the user clustering device is a first data source in a communication system, The communication system includes m data sources, where m is an integer greater than or equal to 2; wherein, the input and output interface is used for inputting or outputting data, and the memory is used for storing the application code for the user clustering device to execute the above method , the processor is configured to perform the following methods:
确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的所述第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. The first estimated distance between the class centers, the first user is any user of the n users except the k users;
根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;
获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
根据所述实际距离,对所述第一用户进行聚类。The first users are clustered according to the actual distance.
第四方面,提供一种计算机存储介质,所述计算机存储介质应用于通信系统中的第一数据源,所述通信系统包括m个数据源,m为大于或者等于2的整数;所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下方法:In a fourth aspect, a computer storage medium is provided, the computer storage medium is applied to a first data source in a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2; the computer storage medium The medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the following methods:
确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的所述第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. The first estimated distance between the class centers, the first user is any user of the n users except the k users;
根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;
获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
根据所述实际距离,对所述第一用户进行聚类。The first users are clustered according to the actual distance.
本申请实施例中,第一数据源在计算用户到聚类中心的距离时,可以根据该数据源本地拥有的该用户的特征数据以及聚类中心对应的用户的特征数据,计算第一距离估计值,并根据该第一距离估计值生成至少一个第一特征数发送给其他第二数据源,从而保证用户特征数据不出本地,确保用户特征数据不被泄露,保障数据的安全性,并且第一数据源进行聚类时所使用的实际距离是根据该特征数和第二距离估计值生成的,该第二距离估计值是其他第二数据源所拥有的该用户的特征数据和聚类中心对应的用户的特征数据计算得到的,即本申请还可以联合各个数据源所拥有的用户特征数据进行距离的计算,保障聚类结果的准确性。In the embodiment of the present application, when the first data source calculates the distance from the user to the cluster center, the first distance estimate may be calculated according to the feature data of the user locally owned by the data source and the feature data of the user corresponding to the cluster center value, and generate at least one first feature number according to the first distance estimation value and send it to other second data sources, so as to ensure that the user feature data is not local, ensure that the user feature data is not leaked, ensure the security of the data, and the first The actual distance used by a data source for clustering is generated according to the feature number and a second distance estimate value, where the second distance estimate value is the feature data and cluster center of the user owned by other second data sources The characteristic data of the corresponding user is obtained by calculation, that is, the present application can also perform distance calculation in conjunction with the user characteristic data possessed by each data source, so as to ensure the accuracy of the clustering result.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请实施例提供的一种用户聚类方法的流程示意图;1 is a schematic flowchart of a user clustering method provided by an embodiment of the present application;
图2是本申请实施例提供的一种获取第一用户和第一聚类中心的实际距离的流程示意图;2 is a schematic flowchart of obtaining the actual distance between a first user and a first cluster center provided by an embodiment of the present application;
图3是本申请实施例提供的另一种获取第一用户和第一聚类中心的实际距离的流程示意图;3 is another schematic flowchart of obtaining the actual distance between the first user and the first cluster center provided by an embodiment of the present application;
图4是本申请实施例提供的另一种用户聚类方法的流程示意图;4 is a schematic flowchart of another user clustering method provided by an embodiment of the present application;
图5是本申请实施例提供的一种用户聚类装置的组成结构示意图;FIG. 5 is a schematic diagram of the composition and structure of a user clustering apparatus provided by an embodiment of the present application;
图6是本申请实施例提供的一种用户聚类设备的组成结构示意图。FIG. 6 is a schematic structural diagram of a user clustering device provided by an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或电路的过程、方法、系统、产品或设备没有限定于已列出的步骤或电路,而是可选地还包括没有列出的步骤或电路,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或电路。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or circuits is not limited to the listed steps or circuits, but optionally also includes unlisted steps or circuits, or optionally also includes For other steps or circuits inherent to these processes, methods, products or devices.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
本申请的技术方案涉及人工智能和/或大数据技术领域,用于聚类分析,如可应用于金融科技如对金融机构的信用卡用户进行聚类等场景中,以提升聚类结果的准确性。可选的,本申请涉及的数据如用户的特征数据和/或聚类的结果等可存储于数据库中,或者可以存储于区块链中,本申请不做限定。The technical solution of the present application relates to the technical field of artificial intelligence and/or big data, and is used for cluster analysis. For example, it can be applied to scenarios such as financial technology, such as clustering credit card users of financial institutions, to improve the accuracy of clustering results. . Optionally, the data involved in this application, such as user characteristic data and/or clustering results, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
本申请实施例的方案适用于多数据源的情况下对用户进行聚类的场景,在一个包括m个数据源的通信系统中,m为大于或者等于2的整数,第一数据源确定第一聚类中心,第一聚类中心是预聚类的k个聚类中心中的任一个,k个聚类中心对应k个用户,一个聚类中心对应一个用户,上述k个用户是待分类的n个用户中的用户,第一数据源是m个数据源中的任一个,n为整数,且n>1;The solutions of the embodiments of the present application are applicable to the scenario of clustering users in the case of multiple data sources. In a communication system including m data sources, m is an integer greater than or equal to 2, and the first data source determines the first Clustering center, the first clustering center is any one of the pre-clustered k clustering centers, k clustering centers correspond to k users, one clustering center corresponds to one user, and the above k users are to be classified For a user among n users, the first data source is any one of m data sources, n is an integer, and n>1;
第一数据源根据第一数据源拥有的第一用户的特征数据以及第一数据源拥有的第一聚类中心对应的用户的特征数据,计算第一用户与所述第一聚类中心之间的第一距离估计值,该第一用户为上述n个用户中除k个用户外的任一用户;The first data source calculates the distance between the first user and the first cluster center according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source. The first distance estimate value of , the first user is any user except k users among the above n users;
第一数据源根据第一距离估计值生成至少一个第一特征数,并将至少一个第一特征数发送给m个数据源中的第二数据源;The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to a second data source among the m data sources;
第一数据源获取第一用户与第一聚类中心之间的实际距离,其中,实际距离是根据至少一个第一特征数与第二距离估计值生成的,该第二距离估计值是根据第二数据源拥有的第一用户的特征数据和第一聚类中心对应的用户的特征数据计算得到的;The first data source obtains the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to at least one first feature number and a second estimated distance value, and the second estimated distance value is based on the second estimated distance. The characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center are calculated and obtained;
第一数据源根据实际距离,对第一用户进行聚类。The first data source clusters the first users according to the actual distance.
由上述可知,本申请实施例是一个针对多数据源聚类的方法,比单一数据源的聚类结果更加准确、可靠;同时参与聚类的每一个数据源都参与计算了基于本地数据计算而得的用户与聚类中心间的距离,在数据源之间传输的并不是数据本身,而是基于计算得到的距离生成的特征数,该特征数包括上述第一特征数以及第二特征数,从而保证了数据信息的安全性。It can be seen from the above that the embodiment of the present application is a method for clustering multiple data sources, which is more accurate and reliable than the clustering result of a single data source; at the same time, each data source participating in the clustering participates in the calculation based on local data calculation. The distance between the obtained user and the cluster center is not the data itself, but the number of features generated based on the calculated distance. The number of features includes the first feature number and the second feature number. Thereby ensuring the security of data information.
参见图1,图1是本申请实施例提供的一种用户聚类方法的流程示意图,为了便于理解,这里具体以对金融机构的信用卡用户聚类为例,金融机构对于本地数据中用户的保密性有更高的要求,与本申请实施例能够保证数据不出本地的特性十分契合,具体过程包括:Referring to FIG. 1, FIG. 1 is a schematic flowchart of a user clustering method provided by an embodiment of the present application. For ease of understanding, here is an example of clustering credit card users of financial institutions. There are higher requirements for performance, which is in line with the feature that the embodiment of the present application can ensure that the data is not localized. The specific process includes:
S101:第一数据源确定第一聚类中心,第一聚类中心是预聚类的k个聚类中心中的任一个。S101: The first data source determines a first cluster center, where the first cluster center is any one of the pre-clustered k cluster centers.
参与聚类的m个金融机构为m个数据源,该m个数据源包含n个相同的用户,其中m、和n为整数,且m≥2,n>1。各数据源对该n个用户中的每个用户的行为特征进行提取,获得用户的特征数据,该特征数据包括至少一个特征维度,其中,一个特征维度对应该用户的一类行为特征。在参与计算的情况下,每一个数据源提供不同类的用户特征,所有数据源提供的特征维度之和为L。The m financial institutions participating in the clustering are m data sources, and the m data sources include n identical users, where m, and n are integers, and m≥2, and n>1. Each data source extracts behavior features of each of the n users to obtain user feature data, where the feature data includes at least one feature dimension, wherein one feature dimension corresponds to a type of behavior feature of the user. In the case of participating in the calculation, each data source provides different types of user features, and the sum of the feature dimensions provided by all data sources is L.
在m个数据源中选择一个数据源为主数据源,主数据源可以是参与聚类的m个数据源里的任一个,为了保证数据的绝对安全性,在可能的实现方式中,该主数据源也可以是受信任的数据源,主数据源从n个待分类的用户中随机选择k个用户ID作为聚类中心,其中k个聚类中心的维度均为L,k为聚类簇的个数。One data source is selected from the m data sources as the main data source, and the main data source can be any of the m data sources participating in the clustering. In order to ensure the absolute security of the data, in a possible implementation, the main data source The data source can also be a trusted data source. The main data source randomly selects k user IDs from the n users to be classified as the cluster centers, where the dimensions of the k cluster centers are all L, and k is the cluster cluster. number of.
由于不同的数据源提供的用户数据具有不同的特征维度,主数据源按照各个数据源提供的用户数据的不同特征维度进行划分后,将携带有不同特征维度标识的用户id发给其他数据源,其中该用户id所携带的特征维度标识与数据源中所具有的特征维度相适应,各个数据源接收到主数据源的消息后,根据本地数据中具有的该用户id的数据的特征维度与接收到的用户id的特征维度标识进行匹配,形成本地的聚类中心。Since the user data provided by different data sources have different feature dimensions, after the main data source is divided according to the different feature dimensions of the user data provided by each data source, the user IDs carrying the identifiers of different feature dimensions are sent to other data sources. The feature dimension identifier carried by the user id is adapted to the feature dimension contained in the data source. After each data source receives the message from the main data source, the feature dimension of the data of the user id in the local data and the received The feature dimension identifier of the obtained user id is matched to form a local cluster center.
这里,该主数据源将携带有不同特征维度标识的用户id发给其他数据源,发送用户id给其他数据源的目的是为了让m个数据源对相同的聚类中心进行聚类,发送的用户id仅携带有不同数据源具有的特征维度的维度标识,不携带有本地数据,能够保证数据不出本地,保证数据信息的安全性。Here, the main data source sends user ids with different feature dimension identifiers to other data sources. The purpose of sending user ids to other data sources is to allow m data sources to cluster the same cluster center, and send The user id only carries the dimension identifiers of the characteristic dimensions of different data sources, and does not carry local data, which can ensure that the data is not local and ensure the security of data information.
第一数据源是m个数据源中的任一数据源,即该第一数据源可以是上述提到的主数据源也可以是m个数据源中的其他数据源,该数据源在k个聚类中心中任意选择一个聚类中心为第一聚类中心。The first data source is any data source among the m data sources, that is, the first data source may be the above-mentioned main data source or other data sources among the m data sources, and the data source is in the k data sources. One of the cluster centers is arbitrarily selected as the first cluster center.
应理解的是,由于需要联合m个数据源中的信息数据进行计算,故m个数据源中每一个数据源中的第一聚类中心是同一聚类中心。It should be understood that, since the information data in the m data sources needs to be combined for calculation, the first cluster center in each of the m data sources is the same cluster center.
S102:第一数据源根据第一数据源拥有的第一用户的特征数据以及第一数据源拥有的第一聚类中心对应的用户的特征数据,计算第一用户与第一聚类中心之间的第一距离估计 值。S102: The first data source calculates the relationship between the first user and the first cluster center according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source The first distance estimate of .
第一数据源根据本地数据中拥有的第一用户的特征数据以及第一数据源拥有的第一聚类中心对应的用户的特征数据,计算第一用户与第一聚类中心之间的第一距离估计值,其中,第一用户为n个用户中除所述k个用户外的任一用户。The first data source calculates the first data between the first user and the first cluster center according to the characteristic data of the first user in the local data and the characteristic data of the user corresponding to the first cluster center owned by the first data source. distance estimation value, where the first user is any user among the n users except the k users.
与上述第一数据源相同,m个数据源中的m-1个第二数据源根据本地数据中拥有的第一用户的特征数据以及第一聚类中心对应的用户的特征数据,计算第一用户与第一聚类中心之间的第二距离估计值,这里的第二数据源是指m个数据源中除了第一数据源的其他数据源。Same as the above-mentioned first data source, m-1 second data sources among the m data sources calculate the first data source according to the characteristic data of the first user possessed in the local data and the characteristic data of the user corresponding to the first cluster center. The second estimated distance between the user and the first cluster center, where the second data source refers to other data sources except the first data source among the m data sources.
也即参与聚类的每一个数据源都要基于本地数据计算第一用户与第一聚类中心之间的距离。That is, each data source participating in the clustering needs to calculate the distance between the first user and the first cluster center based on the local data.
由于本方案实施例是多个数据源联合参与计算,各个数据源会提供第一用户不同类的特征,即不同的特征维度,这里的第一距离估计值为第一数据源中提供的特征维度中:第一用户与第一聚类中心之间的距离,第二距离估计值为第二数据源中提供的特征维度中:第一用户与第一聚类中心之间的距离。Since multiple data sources jointly participate in the calculation in the embodiment of this solution, each data source will provide different types of features of the first user, that is, different feature dimensions, and the first distance estimate here is the feature dimension provided in the first data source. Medium: the distance between the first user and the first cluster center, and the second estimated distance is the feature dimension provided in the second data source Medium: the distance between the first user and the first cluster center.
应理解的是,上述第一距离估计值以及第二距离估计值是基于数据源中的用户特征数据计算而来的,是指数据源中第一用户与第一聚类中心之间的距离,该距离由距离函数Dist计算,例如欧氏距离、曼哈顿距离等,可根据不同的应用场景来定义Dist。It should be understood that the above-mentioned first distance estimation value and second distance estimation value are calculated based on the user characteristic data in the data source, and refer to the distance between the first user and the first cluster center in the data source, The distance is calculated by the distance function Dist, such as Euclidean distance, Manhattan distance, etc. Dist can be defined according to different application scenarios.
S103:第一数据源根据第一距离估计值生成至少一个第一特征数,并发送给m个数据源中的第二数据源,根据至少一个第一特征数与第二距离估计值获取第一用户与第一聚类中心之间的实际距离。S103: The first data source generates at least one first feature number according to the first distance estimation value, and sends it to the second data source among the m data sources, and obtains the first feature number according to the at least one first feature number and the second distance estimation value. The actual distance between the user and the first cluster center.
本申请实施例提供两种获取第一用户与第一聚类中心之间的实际距离的方法。The embodiments of the present application provide two methods for obtaining the actual distance between the first user and the first cluster center.
方法一:拆分求和法,拆分求和法包括以下步骤:Method 1: split-sum method, split-sum method includes the following steps:
一、对第一距离估计值进行拆分,得到m-1个第一特征数。1. Split the first distance estimate to obtain m-1 first feature numbers.
为便于说明,将上述第一距离估计值记为
Figure PCTCN2021097306-appb-000001
表示第一数据源中,第一用户与第一聚类中心之间的距离,对
Figure PCTCN2021097306-appb-000002
进行随机拆分为m-1个随机数:
Figure PCTCN2021097306-appb-000003
使得
Figure PCTCN2021097306-appb-000004
也即该m-1个随机数之和为第一距离估计值,该m-1个随机数:
Figure PCTCN2021097306-appb-000005
也即上述m-1个第一特征数。
For the convenience of description, the above-mentioned first distance estimation value is denoted as
Figure PCTCN2021097306-appb-000001
represents the distance between the first user and the first cluster center in the first data source, for
Figure PCTCN2021097306-appb-000002
Do a random split into m-1 random numbers:
Figure PCTCN2021097306-appb-000003
make
Figure PCTCN2021097306-appb-000004
That is, the sum of the m-1 random numbers is the first distance estimation value, and the m-1 random numbers are:
Figure PCTCN2021097306-appb-000005
That is, the above-mentioned m-1 first characteristic numbers.
与第一数据源中进行的处理相似,第二数据源将第二距离估计值记为
Figure PCTCN2021097306-appb-000006
表示第j个第二数据源中第一用户与第一聚类中心之间的距离,其中j为整数,且1≤j≤m-1,对
Figure PCTCN2021097306-appb-000007
进行随机拆分为m-1个随机数:
Figure PCTCN2021097306-appb-000008
使得
Figure PCTCN2021097306-appb-000009
上述m-1个随机数:
Figure PCTCN2021097306-appb-000010
即为m-1个第二特征数。
Similar to what was done in the first data source, the second data source records the second distance estimate as
Figure PCTCN2021097306-appb-000006
represents the distance between the first user and the first cluster center in the jth second data source, where j is an integer, and 1≤j≤m-1, for
Figure PCTCN2021097306-appb-000007
Do a random split into m-1 random numbers:
Figure PCTCN2021097306-appb-000008
make
Figure PCTCN2021097306-appb-000009
The above m-1 random numbers:
Figure PCTCN2021097306-appb-000010
That is, m-1 second characteristic numbers.
应理解的是,由于上述第一数据源可以是m个数据源中的任一数据源,这里的第一数据源和m-1个第二数据源只是为了对数据源加以区分,在步骤一中,第一数据源与第二数据源所做的处理无本质区别。It should be understood that, since the above-mentioned first data source can be any data source among m data sources, the first data source and m-1 second data sources here are only to distinguish the data sources. , there is no essential difference in the processing performed by the first data source and the second data source.
即m个数据源中的每一个数据源都会基于本地数据计算的第一用户与第一聚类中心之间的距离生成m-1个特征数,该特征数包括上述的第一特征数或第二特征数,若发送方是 第一数据源则为第一特征数,若发送方为第二数据源,则为第二特征数。That is, each of the m data sources will generate m-1 feature numbers based on the distance between the first user and the first cluster center calculated from the local data, and the feature numbers include the first feature number or the first feature number above. Two feature numbers, if the sender is the first data source, the first feature number, and if the sender is the second data source, the second feature number.
为了便于理解这里结合图2对数据具体化进行说明,如图2所示,图中包括数据源1、数据源2以及数据源3共三个数据源,即此时m的数目为3,步骤一对应图2中的“随机拆分”部分,图中的数值5、7、9相当于上述各个数据源中基于本地数据计算的第一用户与第一聚类中心之间的距离,如图中所示,数据源1中5被拆分成了2和3,数据源2中7被拆分成了3和4,,数据源3中9被拆分成了-1和10。在步骤一中,若数据源1为第一数据源,则数据源2以及数据源3均为第二数据源;若数据源2为第一数据源,则数据源1以及数据源2均为第二数据源;若数据源3为第一数据源,则数据源1以及数据源2均为第二数据源。显然m个数据源中的每一个数据源在步骤一中所做的处理无本质区别。In order to facilitate understanding, the data will be described in conjunction with Fig. 2. As shown in Fig. 2, the figure includes three data sources: data source 1, data source 2 and data source 3, that is, the number of m is 3 at this time, and the steps One corresponds to the "random split" part in Figure 2. The values 5, 7, and 9 in the figure correspond to the distances between the first user and the first cluster center calculated based on local data in each of the above data sources, as shown in the figure As shown in the figure, 5 in data source 1 is split into 2 and 3, 7 in data source 2 is split into 3 and 4, and 9 in data source 3 is split into -1 and 10. In step 1, if data source 1 is the first data source, then data source 2 and data source 3 are both the second data source; if data source 2 is the first data source, then data source 1 and data source 2 are both The second data source; if the data source 3 is the first data source, then both the data source 1 and the data source 2 are the second data source. Obviously, there is no essential difference in the processing performed by each of the m data sources in step 1.
应理解的是,图中所示的拆分情况仅为多种情况中的一种,实际上,步骤一中的拆分过程是随机的。It should be understood that the splitting situation shown in the figure is only one of many situations, and in fact, the splitting process in step 1 is random.
二、发送生成的特征数。2. Send the number of generated features.
第一数据源将m-1个第一特征数:
Figure PCTCN2021097306-appb-000011
分别发送给所述m个数据源中除所述第一数据源外的m-1个第二数据源,其中,一个第一特征数发送给一个第二数据源。
The first data source will have m-1 first feature numbers:
Figure PCTCN2021097306-appb-000011
are respectively sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to one second data source.
与第一数据源中的处理相似,第二数据源将m-1个第二特征数分别发送给所述m个数据源中除自己以外的数据源,同样一个第二特征数发送给一个数据源。Similar to the processing in the first data source, the second data source sends m-1 second feature numbers to data sources other than itself among the m data sources, and the same second feature number is sent to a data source. source.
也即m个数据源中的每一个数据源都会将上述生成的m-1特征数分别发送给m个数据源中的其他数据源,该特征数包括上述的第一特征数或第二特征数,若发送方是第一数据源则为第一特征数,若发送方为第二数据源,则为第二特征数。That is, each of the m data sources will send the m-1 feature number generated above to the other data sources in the m data sources, respectively, and the feature number includes the above-mentioned first feature number or second feature number. , the first feature number if the sender is the first data source, and the second feature number if the sender is the second data source.
应理解的是,由于上述第一数据源可以是m个数据源中的任一数据源,这里的第一数据源和m-1个第二数据源只是为了对数据源加以区分,在步骤二中,第一数据源与第二数据源所做的处理无本质区别。It should be understood that, since the above-mentioned first data source can be any data source among m data sources, the first data source and m-1 second data sources here are only to distinguish the data sources. , there is no essential difference in the processing performed by the first data source and the second data source.
结合图2,在图2中,数据源1将拆分得到的随机数2和3分别发送给了数据源2和数据源3;数据源2将拆分得到的随机数3和4分别发送给了数据源1和数据源3;数据源3将拆分得到的随机数-1和10分别发送给了数据源1和数据源2。若数据源1为第一数据源,则图中的2和3为第一特征数,3和4为作为第二数据源的数据源2中生成的第二特征数,-1和10为作为第二数据源的数据源3中生成的第二特征数;若数据源2为第一数据源,则3和4为第一特征数;若数据源3为第一数据源,则-1和10为第一特征数。Referring to Figure 2, in Figure 2, data source 1 sends the split random numbers 2 and 3 to data source 2 and data source 3 respectively; data source 2 sends the split random numbers 3 and 4 to data source 1 and data source 3; data source 3 sends the random numbers -1 and 10 obtained by splitting to data source 1 and data source 2 respectively. If data source 1 is the first data source, 2 and 3 in the figure are the first feature numbers, 3 and 4 are the second feature numbers generated in data source 2 as the second data source, and -1 and 10 are The second feature number generated in the data source 3 of the second data source; if the data source 2 is the first data source, then 3 and 4 are the first feature numbers; if the data source 3 is the first data source, then -1 and 10 is the first characteristic number.
显然m个数据源中的每一个数据源在步骤二中所做的处理无本质区别。Obviously, there is no essential difference in the processing performed by each of the m data sources in the second step.
三、第一数据源从m-1个第二数据源接收m-1个第二特征数,其中,一个第二特征数来自于一个第二数据源。3. The first data source receives m-1 second feature numbers from m-1 second data sources, wherein one second feature number comes from one second data source.
上述第一数据源从m-1个第二数据源接收m-1个第二特征数,该第一数据源计算所述m-1个第二特征数的特征数之和,获得第一累计值。由于m个数据源中包括主数据源和从数据源,若上述第一数据源为所述m个数据源中的主数据源,所述第一数据源存储所述第一累计值;若上述第一数据源为m个数据源中的从数据源,第一数据源向m个数据源中的主数据源发送上述计算得到的第一累计值。The above-mentioned first data source receives m-1 second feature numbers from m-1 second data sources, the first data source calculates the sum of the feature numbers of the m-1 second feature numbers, and obtains a first cumulative value. Since the m data sources include a master data source and a slave data source, if the first data source is the master data source among the m data sources, the first data source stores the first accumulated value; The first data source is a slave data source among the m data sources, and the first data source sends the first accumulated value obtained by the above calculation to the master data source among the m data sources.
与第一数据源中的处理相似,第二数据源也将收到m-1个特征数,该特征数包括上述的第一特征数或第二特征数,若发送方是第一数据源则为第一特征数,若发送方为第二数据源,则为第二特征数,对收到的特征数求和,获得第二累计值。Similar to the processing in the first data source, the second data source will also receive m-1 feature numbers, which include the above-mentioned first feature number or second feature number, if the sender is the first data source. is the first feature number, and if the sender is the second data source, it is the second feature number, and the received feature numbers are summed to obtain the second cumulative value.
即m个数据源中的每一个数据源都会接收m-1个特征数,该特征数包括上述的第一特征数或第二特征数,若发送方是第一数据源则为第一特征数,若发送方为第二数据源,则为第二特征数,数据源对接收到的特征数求和,若该数据源为主数据源就将求和结果储存;若该数据源是从数据源,就将该求和结果发送给主数据源。That is, each of the m data sources will receive m-1 feature numbers, which include the above-mentioned first feature number or second feature number, and if the sender is the first data source, the first feature number , if the sender is the second data source, it is the second feature number, the data source sums the received feature numbers, and if the data source is the main data source, the summation result is stored; if the data source is the slave data source, the result of the summation is sent to the primary data source.
结合图2,数据源1接收3、-1,求和得到2;数据源2接收2、10,求和得到12;数 据源3接收3、4,求和得到7。如图所示,若数据源1为第一数据源,则作为第一数据源的数据源1接收到的m-1个第二特征数为3和-1,求和得到的数据2为第一累计值,数据源2求和得到的12是作为第二数据源的数据源2的第二特征数和,数据源3求和得到的7是作为第二数据源的数据源2的第二累计值;同样的,若数据源2或者数据源3是第一数据源,则第一累计值就为12或7。显然m个数据源中的每一个数据源都会接收到m-1个特征数,该特征数包括上述的第一特征数或第二特征数,若发送方是第一数据源则为第一特征数,若发送方为第二数据源,则为第二特征数;数据源对接收到的特征数求和得到求和结果,该求和结果即为第一累计值或第二累计值,若进行求和的是第一数据源,则为第一累计值,若进行求和的是第二数据源,则为第二累计值。Referring to Figure 2, data source 1 receives 3, -1, and the summation obtains 2; data source 2 receives 2, 10, and the summation obtains 12; data source 3 receives 3, 4, and the summation obtains 7. As shown in the figure, if the data source 1 is the first data source, the m-1 second feature numbers received by the data source 1 as the first data source are 3 and -1, and the summed data 2 is the first data source. A cumulative value, 12 obtained by the summation of data source 2 is the second characteristic number sum of data source 2 as the second data source, and 7 obtained by the summation of data source 3 is the second characteristic number of data source 2 as the second data source Accumulated value; similarly, if data source 2 or data source 3 is the first data source, the first accumulated value is 12 or 7. Obviously, each of the m data sources will receive m-1 feature numbers, which include the above-mentioned first feature number or second feature number, and the first feature if the sender is the first data source. If the sender is the second data source, it is the second feature number; the data source sums the received feature numbers to obtain the summation result, and the summation result is the first accumulated value or the second accumulated value. If If it is the first data source to be summed, it is the first accumulated value, and if it is the second data source to be summed, it is the second accumulated value.
如图2所示,此时数据源1为主数据源,数据源1储存求和得到的数据,数据源2以及数据源3为从数据源,将求和得到的12、7发送给主数据源,即数据源1。As shown in Figure 2, at this time, data source 1 is the main data source, data source 1 stores the data obtained by the summation, data source 2 and data source 3 are slave data sources, and the summed 12 and 7 are sent to the main data source, i.e. data source 1.
应理解的是,由于每一个数据源中的用户具有不同的特征维度,在本申请实施例中进行具体的求和运算时,考虑不同维度的数据求和方法,并不是普通的对于数值进行加和,在图2中所举的加和方法仅为了便于说明。It should be understood that, since users in each data source have different feature dimensions, when performing a specific summation operation in the embodiment of the present application, data summation methods of different dimensions are considered, and it is not common to add values to numerical values. And, the summation method shown in FIG. 2 is only for convenience of explanation.
四、获取第一用户与第一聚类中心之间的实际距离。Fourth, obtain the actual distance between the first user and the first cluster center.
若第一数据源为所述m个数据源中的主数据源,第一数据源从所述m-1个第二数据源接收m-1个第二特征数和,并根据本地存储的第一累计值以及接收到的m-1个第二特征数和,计算第一用户与第一聚类中心之间的实际距离。获得第一用户与第一聚类中心之间的实际距离之后,还需要将该实际距离发送给m个数据源中的从数据源。If the first data source is the main data source among the m data sources, the first data source receives m-1 second feature sums from the m-1 second data sources, and calculates the sum of the second feature numbers according to the locally stored first data source. The actual distance between the first user and the first cluster center is calculated by an accumulated value and the received sum of m-1 second feature numbers. After the actual distance between the first user and the first cluster center is obtained, the actual distance needs to be sent to the slave data source among the m data sources.
结合图2,如图2所示,图2中数据源1为主数据源,数据源1接收到作为从数据源的数据源2发来的第二特征数和12以及作为从数据源的数据源3发来的第二特征数和7,12和7即为作为主数据源的数据源1接收到的2个第二特征数和。数据源1对接收到的2个第二特征数和求和得到数据21,即相当于本申请实施例中的实际距离。With reference to Figure 2, as shown in Figure 2, in Figure 2, the data source 1 is the main data source, and the data source 1 receives the second feature number and 12 sent by the data source 2 as the slave data source and the data as the slave data source. The second feature sum 7 sent from source 3, 12 and 7 are the two second feature sums received by data source 1 as the main data source. The data source 1 sums the received two second feature numbers to obtain data 21, which is equivalent to the actual distance in the embodiment of the present application.
应理解的是,由于每一个数据源中的用户具有不同的特征维度,在进行具体的求和运算时,考虑不同维度的数据求和方法,并不是普通的对于数值进行加和。It should be understood that, since users in each data source have different feature dimensions, when performing a specific summation operation, data summation methods of different dimensions are considered, rather than the ordinary summation of values.
显然,m个数据源中唯一与其他数据源所做的处理有所不同的只有主数据源。Obviously, the only one of the m data sources that differs from the other data sources is the primary data source.
由上述求和方法可知,首先,m个数据源中的每一个数据源都参与了计算,并且每一个数据源中的计算量大致相同,即使是主数据源也不会对算力有很高的要求,即本申请实施例具有可拓展性,也即在有需要的时候,可以增加更多的数据源进行聚类,使得结果更加可靠和准确,同时又不会对于主数据源造成太多的负担。It can be seen from the above summation method that, first of all, each of the m data sources participates in the calculation, and the amount of calculation in each data source is roughly the same, even the main data source will not have a high computing power. That is, the embodiment of the present application is scalable, that is, when necessary, more data sources can be added for clustering, so that the results are more reliable and accurate, and at the same time, it will not cause too much damage to the main data source. burden.
其次,每一个参与聚类的数据源都能接收到求和的结果,同时,参与聚类的每一个数据源都保证了数据没有出本地,在保证数据信息安全性的前提下,达到了资源共享的效果。Secondly, each data source participating in the clustering can receive the result of the summation. At the same time, each data source participating in the clustering ensures that the data does not go out of the locality. On the premise of ensuring the security of the data and information, the resource is achieved. Shared effects.
最后,由于每一个参与聚类的数据源都参与了计算,即整个系统拥有多台计算机的计算能力,拥有更快的处理速度。Finally, since every data source participating in the clustering participates in the calculation, that is, the whole system has the computing power of multiple computers and has a faster processing speed.
方法二:随机数求和法,随机数求和法包括以下步骤:Method 2: Random number summation method, the random number summation method includes the following steps:
一、主数据源生成特征数并发送。1. The main data source generates feature numbers and sends them.
m个数据源中包括主数据源和从数据源,m个数据源按照数据传输的先后顺序进行排列,主数据源排列在第一位,m个数据源中每个数据源各拥有一个随机数,若第一数据源是m个数据源中的主数据源;第一数据源根据第一数据源拥有的第一随机数以及所述第一距离估计值生成第一特征数,并将第一特征数发送给排列在第一数据源之后的且距离第一数据源最近的第二数据源。The m data sources include the master data source and the slave data source. The m data sources are arranged in the order of data transmission, the master data source is arranged in the first place, and each of the m data sources has a random number. , if the first data source is the main data source among the m data sources; the first data source generates the first characteristic number according to the first random number possessed by the first data source and the first distance estimate value, and uses the first The feature number is sent to the second data source arranged after the first data source and closest to the first data source.
结合图3,如图3所示,图中作为第一数据源的数据源1为主数据源,排列在第一位,数据源1基于本地数据计算的第一距离估计值为D1,V1为数据源1拥有的第一随机数,数据源1对D1和V1进行求和得到第一特征数C1,并发送给排列在数据源1后的数据源2。 此时的数据源2即为排列在所述第一数据源之后的且距离所述第一数据源最近的第二数据源。With reference to Figure 3, as shown in Figure 3, the data source 1 as the first data source in the figure is the main data source and is arranged in the first place. The estimated value of the first distance calculated by the data source 1 based on the local data is D1, and V1 is The first random number owned by data source 1, data source 1 sums D1 and V1 to obtain the first characteristic number C1, and sends it to data source 2 arranged after data source 1. At this time, the data source 2 is the second data source arranged after the first data source and closest to the first data source.
二、从数据源生成特征数并发送。2. Generate feature numbers from the data source and send them.
若第一数据源是m个数据源中的从数据源,第一数据源从排列在第一数据源之前的且距离第一数据源最近的第二数据源接收第二特征数,第二特征数是根据至少一个第二距离估计值和至少一个第二随机数生成的,一个第二距离估计值来源于排列在第一数据源之前的至少一个第二数据源中的一个第二数据源,一个第二随机数来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源。第一数据源根据所述第二特征数、第一数据源拥有的第一随机数以及第一距离估计值,生成第一特征数,并将所述第一特征数发送给排列在所述第一数据源之后的第二数据源。If the first data source is a slave data source among m data sources, the first data source receives the second feature number from the second data source arranged before the first data source and closest to the first data source, and the second feature number the number is generated based on at least one second distance estimate and at least one second random number, a second distance estimate is derived from a second data source of at least one second data source arranged before the first data source, A second random number is derived from a second data source of at least one second data source arranged before the first data source. The first data source generates a first feature number according to the second feature number, the first random number possessed by the first data source, and the first distance estimation value, and sends the first feature number to the A second data source after a data source.
结合图3,如图3所示,图中数据源2为从数据源,若数据源2为第一数据源,则图3中其他数据源均为第二数据源,而数据源1为排列在第一数据源之前的至少一个第二数据源中的一个第二数据源,数据源2从数据源1中接收第二特征数C1,第二特征数C1是根据作为第二数据源的数据源1的第二距离估计值和数据源1拥有的第二随机数V1生成的。数据源2根据接收到的第二特征数C1、作为第一数据源的数据源2拥有的第一随机数V2以及作为第一数据源的数据源2中的第一距离估计值D1生成第一特征数C2,并将C2发送给数据源3,图中数据源3则为排列在作为第一数据源的数据源2之后的第二数据源。由于数据源2、数据源3、数据源4以及数据源5为从数据源,则数据源3、数据源4或数据源5作为第一数据源的情况与数据源2作为第一数据源的情况相似,具体可以结合图3以及数据源2作为第一数据源的情况理解,在此不做赘述。With reference to Figure 3, as shown in Figure 3, the data source 2 in the figure is the slave data source, if the data source 2 is the first data source, the other data sources in Figure 3 are the second data sources, and the data source 1 is the arrangement A second data source in at least one second data source preceding the first data source, data source 2 receives a second characteristic number C1 from data source 1, the second characteristic number C1 is based on the data as the second data source The second distance estimate of source 1 and the second random number V1 possessed by data source 1 are generated. The data source 2 generates the first value according to the received second feature number C1, the first random number V2 possessed by the data source 2 as the first data source, and the first distance estimation value D1 in the data source 2 as the first data source. feature number C2, and send C2 to the data source 3. In the figure, the data source 3 is the second data source arranged after the data source 2 as the first data source. Since data source 2, data source 3, data source 4 and data source 5 are slave data sources, the situation in which data source 3, data source 4 or data source 5 is the first data source is the same as the case where data source 2 is the first data source. The situation is similar, which can be specifically understood in conjunction with FIG. 3 and the situation in which data source 2 is used as the first data source, and will not be repeated here.
三、主数据源获得第一用户与第一聚类中心之间的实际距离。3. The main data source obtains the actual distance between the first user and the first cluster center.
若第一数据源是m个数据源中的主数据源,第一数据源从排列在最后的第二数据源接收第二特征数,并根据第二特征数确定第一用户与第一聚类中心之间的实际距离,第二特征数是根据第一特征数、m-1个第二距离估计值以及m-1个第二随机数生成的,一个第二距离估计值来源于m个数据源中的一个第二数据源,一个第二随机数来源于m个数据源中的一个第二数据源。If the first data source is the main data source among the m data sources, the first data source receives the second feature number from the last second data source, and determines the first user and the first cluster according to the second feature number The actual distance between the centers, the second feature number is generated based on the first feature number, m-1 second distance estimates and m-1 second random numbers, and a second distance estimate is derived from m data A second data source among the sources, and a second random number is derived from a second data source among the m data sources.
如图3中所示,图中的数据源1为主数据源,若数据源1为第一数据源主数据源,数据源1接收到作为第二数据源的数据源5发送的第二特征数C5,主数据源也就是数据源1对接收到的第二特征数C5进行处理,根据该第二特征数即可确定第一用户与第一聚类中心之间的实际距离,具体处理方式为减去随机数V1、V2、V3、V4以及V5的和,在确定实际距离后,数据源1将实际距离发送给m个数据源中的从数据源,此时数据源5为排列在最后的第二数据源。As shown in FIG. 3 , the data source 1 in the figure is the main data source. If the data source 1 is the main data source of the first data source, the data source 1 receives the second feature sent by the data source 5 as the second data source. Number C5, the main data source, that is, data source 1, processes the received second feature number C5, and the actual distance between the first user and the first cluster center can be determined according to the second feature number. The specific processing method In order to subtract the sum of random numbers V1, V2, V3, V4 and V5, after the actual distance is determined, data source 1 sends the actual distance to the slave data source among the m data sources. At this time, data source 5 is ranked last. the second data source.
m个数据源中的每一个数据源都拥有一个随机数,该随机数的获得方式有如下两种:Each of the m data sources has a random number, and the random number can be obtained in the following two ways:
1、由主数据源分别发送,主数据源随机生成m个随机数,并记录m个随机数的和,随机将m个随机数分给自己在内的m个数据源,其中,一个随机数发送给一个数据源。1. They are sent by the main data source. The main data source randomly generates m random numbers, records the sum of the m random numbers, and randomly distributes the m random numbers to m data sources including itself. Among them, a random number Sent to a data source.
2、数据源本地随机生成一个随机数,若每个数据源中的随机数由本地随机生成,则在数据源生成随机数之后,需要将该随机数发送给主站点,以使主站点可以获取m个随机数之和。2. The data source generates a random number locally. If the random number in each data source is randomly generated locally, after the random number is generated by the data source, the random number needs to be sent to the main site so that the main site can obtain the random number. The sum of m random numbers.
四、从数据源获得第一用户与第一聚类中心之间的实际距离。Fourth, obtain the actual distance between the first user and the first cluster center from the data source.
若第一数据源是m个数据源中的从数据源,第一数据源从m个数据源中的主数据源接收第一用户与所述第一聚类中心之间的实际距离。If the first data source is a slave data source among the m data sources, the first data source receives the actual distance between the first user and the first cluster center from the master data source among the m data sources.
如图3中所示,数据源1为主数据源,数据源2、数据源3、数据源4以及数据源5为从数据源,若数据源2、数据源3、数据源4以及数据源5中的任一个数据源为第一数据源,该第一数据源获得第一用户与第一聚类中心之间的实际距离的方式为从数据源1接收第一 用户与第一聚类中心之间的实际距离。As shown in Figure 3, data source 1 is the master data source, data source 2, data source 3, data source 4 and data source 5 are slave data sources, if data source 2, data source 3, data source 4 and data source Any one of the data sources in 5 is the first data source, and the first data source obtains the actual distance between the first user and the first cluster center by receiving the first user and the first cluster center from the data source 1. actual distance between.
在上述随机数求和方法中,基于本地数据计算得到第一用户以及第一聚类中心的距离,在加上一个随机数从而生成一个特征数之后,通过在数据源中传递该特征数而不是直接传输数据,对本地数据的安全性有了进一步的保障,该特征数指的是上述第一特征数或第二特征数。上述方法中,m个数据源中的每一个数据源在接收到一个特征数后,都会在求和时再加上一个随机数,这样,即使是相邻的两个数据源共谋,也不能获得在数据源之间传递的原始数据。In the above random number summation method, the distance between the first user and the first cluster center is calculated based on local data, and after adding a random number to generate a feature number, the feature number is passed in the data source instead of Direct data transmission further guarantees the security of local data, and the feature number refers to the first feature number or the second feature number. In the above method, after each of the m data sources receives a characteristic number, it will add a random number to the summation, so that even if two adjacent data sources conspire, they cannot Get raw data passed between data sources.
在参与聚类的数据源的数目仅为两个的时候,如果进行上述两种方法的获得实际距离的方法,考虑将从数据源中的距离估计值直接发给主数据源,减少了计算的复杂度,同时对数据源中本地数据的安全性有一定保证。When the number of data sources participating in the clustering is only two, if the above two methods are used to obtain the actual distance, consider sending the estimated distance from the data source directly to the main data source, reducing the computational complexity. complexity, and at the same time, there is a certain guarantee for the security of local data in the data source.
S104:第一数据源根据所述实际距离,对所述第一用户进行聚类。S104: The first data source clusters the first users according to the actual distance.
在步骤S101到S103的过程中,m个数据源中的每一个数据源都获得了第一用户和第一聚类中心的实际距离,由于第一聚类中心是上述k个聚类中心中的任意一个,在上述k个聚类中心中选择不同的第一聚类中心,能够获取第一用户与k个聚类中心的实际距离,比较得出该k个实际距离中的最小值,则将第一用户划入该类。In the process of steps S101 to S103, each of the m data sources obtains the actual distance between the first user and the first cluster center, since the first cluster center is one of the above k cluster centers For any one of the above k cluster centers, a different first cluster center can be selected, and the actual distance between the first user and the k cluster centers can be obtained, and the minimum value among the k actual distances can be obtained by comparison. The first user falls into this category.
进一步地,参考图4,图4是本申请实施例通过多轮分类完成最终分类的流程示意图。在对上述第一用户进行聚类之后,可以经过多轮迭代完成所有用户的聚类,具体过程包括:Further, referring to FIG. 4 , FIG. 4 is a schematic flow chart of completing the final classification through multiple rounds of classification according to an embodiment of the present application. After the above-mentioned first users are clustered, the clustering of all users can be completed through multiple rounds of iterations. The specific process includes:
S201,主数据源选择k个聚类中心分发给各数据源,可参见步骤S101中主数据源选择聚类中心的过程。In S201, the main data source selects k cluster centers and distributes them to each data source. Please refer to the process of selecting the cluster centers from the main data source in step S101.
S202,各数据源计算本地数据中用户到聚类中心的距离估计值。S202, each data source calculates an estimated value of the distance from the user to the cluster center in the local data.
参见步骤S102中获取第一用户与第一聚类中心的距离估计值的步骤,改变第一用户和第一聚类中心的选取,各个数据源可以获得本地数据中任一用户到任一聚类中心的距离估计值。Referring to the step of obtaining the estimated value of the distance between the first user and the first cluster center in step S102, changing the selection of the first user and the first cluster center, each data source can obtain any user to any cluster in the local data. The distance estimate from the center.
S203,获得所有用户到各个聚类中心的实际距离,参见步骤S103获取第一用户与第一聚类中心之间的实际距离的方法,可以获得任一用户与任一聚类中心之间的实际距离。S203, obtain the actual distances from all users to each cluster center, referring to the method of obtaining the actual distance between the first user and the first cluster center in step S103, the actual distance between any user and any cluster center can be obtained distance.
S204,将用户划入与聚类中心实际距离最小的簇类。S204, classify the user into a cluster class with the smallest actual distance from the cluster center.
参见步骤S104中对第一用户聚类的方法,在选择不同用户的情况下,可以得到n个用户中每一个用户的第一轮分类。Referring to the method for clustering the first users in step S104, in the case of selecting different users, the first round of classification of each of the n users can be obtained.
S205,各个数据源在本地对划入某聚类簇的点求均值,作为新的聚类中心。S205, each data source locally averages the points classified into a certain cluster as a new cluster center.
在得到n个用户的分类之后,对于被分在某一类的用户,第一数据源首先在本地求得被划分在该类中的用户的具体数目以及这一类中的用户的数据在不同维度上的和,然后根据上述中不同维度上的求和结果以及用户数目计算出这一类中的用户的数据在不同维度上的均值,作为新的k个聚类中心。After obtaining the classification of n users, for users classified in a certain class, the first data source first locally obtains the specific number of users classified in this class and the data of users in this class in different Then, according to the summation results of different dimensions and the number of users in the above, the average value of the data of users in this category in different dimensions is calculated as the new k cluster centers.
为了便于理解,这里对于上述计算新的聚类中心的方式做具体说明,例如,某一个数据源中被划分到某一类的点的坐标分别为(2,3)、(6,2)以及(1,1),则被划分到该类的个数为3,对上述三个点在不同维度上求和,根据数目3求出两个维度上的均值分别为3和2,则下一轮的聚类中心点为(3,2)。For ease of understanding, the above method of calculating new cluster centers is described in detail. For example, the coordinates of points classified into a certain class in a certain data source are (2,3), (6,2) and (1,1), then the number of the class is 3, and the above three points are summed in different dimensions. According to the number 3, the averages in the two dimensions are 3 and 2, respectively, then the next The cluster center point of the round is (3,2).
S206,各数据源计算原有中心点和现中心点的移动距离和。S206, each data source calculates the sum of the moving distances of the original center point and the current center point.
第一数据源分别计算新的k个聚类中心与上一轮聚类中对应的聚类中心之间的距离,将计算得到的对应的聚类中心之间的距离记为中心距离,则可以得到k个中心距离,该k个中心距离也即为聚类中心移动的距离,与基于本地数据计算距离估计值一样,该距离同样由距离函数Dist计算,例如欧氏距离、曼哈顿距离等,可根据不同的应用场景来定义Dist。上述第一数据源对k个中心距离求和,得到第一中心距离,也即k个聚类中心移动的距离和。The first data source calculates the distances between the new k cluster centers and the corresponding cluster centers in the previous round of clustering respectively, and records the calculated distance between the corresponding cluster centers as the center distance, then you can Obtain k center distances, which are also the distances moved by the cluster centers. Similar to calculating distance estimates based on local data, the distances are also calculated by the distance function Dist, such as Euclidean distance, Manhattan distance, etc., which can be Dist is defined according to different application scenarios. The above-mentioned first data source sums the k center distances to obtain the first center distance, that is, the sum of the distances moved by the k cluster centers.
与第一数据源的处理相似,第二数据源分别计算新的k个聚类中心与上一轮聚类中对应的聚类中心之间的距离,将计算得到的对应的聚类中心之间的距离记为中心距离,则可以得到k个中心距离,上述第二数据源对k个中心距离求和,得到第二中心距离。Similar to the processing of the first data source, the second data source respectively calculates the distances between the new k cluster centers and the corresponding cluster centers in the previous round of clustering, and calculates the distance between the corresponding cluster centers The distance of is recorded as the center distance, then k center distances can be obtained, and the above-mentioned second data source sums the k center distances to obtain the second center distance.
也即,m个数据源中到的每一个数据源都会基于k个新的聚类中心与原k个聚类中心得到k个中心距离的和。That is, each of the m data sources will obtain the sum of k center distances based on the k new cluster centers and the original k cluster centers.
S207,获取原聚类中心和新的聚类中心的实际移动距离和。S207, obtain the actual moving distance sum of the original cluster center and the new cluster center.
由于在m个数据源中,聚类中心在不同维度上发生改变,这里,为了获得k个聚类中心移动的实际距离,需要对m个数据源中不同维度上的移动距离求和。求和的方式可以采用上述拆分求和法以及随机数求合法,具体步骤可以参考上述提供的两种获取第一用户与第一聚类中心之间的实际距离的方法以及图2、图3,在采用上述两种获取实际距离的方法时,第一中心距离相当于第一距离估计值,第二中心距离相当于第二距离估计值。Since in m data sources, the cluster centers change in different dimensions, here, in order to obtain the actual distance moved by the k cluster centers, it is necessary to sum up the moving distances in different dimensions in the m data sources. The method of summation can adopt the above-mentioned splitting and summing method and random number calculation. For specific steps, refer to the two methods for obtaining the actual distance between the first user and the first cluster center and Figure 2 and Figure 3 for specific steps. , when the above two methods for obtaining the actual distance are adopted, the first center distance is equivalent to the first estimated distance value, and the second center distance is equivalent to the second estimated distance value.
S208,比较k个聚类中心移动的实际距离与终止阈值的大小。S208, compare the actual distance moved by the k cluster centers and the size of the termination threshold.
在上述k个聚类中心移动的实际距离大于第一阈值的情况下,数据源根据新的聚类中心对n个用户进行下一轮聚类,进行下一轮聚类的具体步骤与第一轮聚类中的步骤一致,从而得到新的聚类结果。In the case that the actual distance moved by the above k cluster centers is greater than the first threshold, the data source performs the next round of clustering on the n users according to the new cluster centers, and the specific steps for the next round of clustering are the same as the first one. The steps in the round clustering are consistent, resulting in a new clustering result.
在上述k个聚类中心移动的实际距离小于终止阈值的情况下,停止聚类,将此时的聚类结果作为最终的聚类结果。When the actual distance moved by the above k cluster centers is less than the termination threshold, the clustering is stopped, and the clustering result at this time is taken as the final clustering result.
上述终止阈值的取值一般根据需要设定,一般设定在0到10 -5之间,上述k个聚类中心移动的实际距离小于终止阈值的情况下,我们认为该k个聚类中心没有移动,聚类结果也不会发生改变。 The value of the above termination threshold is generally set according to the needs, generally set between 0 and 10 -5 . If the actual distance moved by the above k cluster centers is less than the termination threshold, we consider that the k cluster centers do not have Moving, the clustering results will not change.
上面介绍了本申请实施例的方法,下面介绍本申请实施例的装置。The methods of the embodiments of the present application are described above, and the devices of the embodiments of the present application are described below.
参见图5,图5是本申请实施例提供的一种用户聚类装置的组成结构示意图,该装置50包括:Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a user clustering apparatus provided by an embodiment of the present application. The apparatus 50 includes:
确定模块501,用于确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1,在主数据源中,确定模块501还用于确定预聚类的k个聚类中心;A determination module 501 is configured to determine a first cluster center, where the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster The center corresponds to one user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1, In the main data source, the determining module 501 is further configured to determine the pre-clustered k cluster centers;
计算模块502,用于根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;The calculation module 502 is configured to calculate the relationship between the first user and the user according to the characteristic data of the first user possessed by the first data source and the characteristic data of the user corresponding to the first cluster center possessed by the first data source. the first estimated distance between the first cluster centers, and the first user is any user of the n users except the k users;
发送模块503,用于根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;A sending module 503, configured to generate at least one first feature number according to the first distance estimation value, and send the at least one first feature number to a second data source in the m data sources;
获取模块504,用于获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;Obtaining module 504, configured to obtain the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to the at least one first feature number and a second estimated distance , the second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
聚类模块505,用于根据所述实际距离,对所述第一用户进行聚类。The clustering module 505 is configured to cluster the first users according to the actual distance.
在一种可能的设计中,发送模块503,用于:对所述第一距离估计值进行拆分,得到m-1个第一特征数,并将所述m-1个第一特征数分别发送给所述m个数据源中除所述第一数据源外的m-1个第二数据源,其中,一个第一特征数发送给一个第二数据源。In a possible design, the sending module 503 is configured to: split the first distance estimate value to obtain m-1 first feature numbers, and separate the m-1 first feature numbers respectively It is sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to one second data source.
在一种可能的设计中,所述m个数据源中包括主数据源和从数据源,获取模块504,用于从所述m-1个第二数据源接收m-1个第二特征数,其中,一个第二特征数来自于一个第二数据源,所述第二特征数是根据所述第二距离估计值生成的,获取模块504还用于计算所述m-1个第二特征数之和,获得第一累计值;若所述第一数据源为所述m个数据源中 的主数据源,获取模块504用于从所述m-1个第二数据源接收m-1个第二累计值,并根据所述第一累计值与所述m-1个第二累计值,计算所述第一用户与所述第一聚类中心之间的实际距离,其中,一个第二累计值来源于一个第二数据源,所述第二累计值是根据所述第一特征数计算得到;In a possible design, the m data sources include a master data source and a slave data source, and the acquiring module 504 is configured to receive m-1 second feature numbers from the m-1 second data sources , wherein a second feature number comes from a second data source, the second feature number is generated according to the second distance estimation value, and the acquiring module 504 is further configured to calculate the m-1 second features the sum of the numbers to obtain the first accumulated value; if the first data source is the main data source among the m data sources, the obtaining module 504 is configured to receive m-1 from the m-1 second data sources a second cumulative value, and calculate the actual distance between the first user and the first cluster center according to the first cumulative value and the m-1 second cumulative values, wherein a first The second accumulated value is derived from a second data source, and the second accumulated value is calculated according to the first characteristic number;
在一种可能的设计中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的主数据源;发送模块503,用于根据所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数,并将所述第一特征数发送给排列在所述第一数据源之后的且距离所述第一数据源最近的第二数据源。In a possible design, the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data source is arranged first, Each data source in the m data sources has a random number, if the first data source is the main data source in the m data sources; the sending module 503 is used for according to the first data source. having the first random number and the first distance estimate value, generating a first feature number, and sending the first feature number to the first data source arranged after the first data source and closest to the first data source the second data source.
在一种可能的设计中,获取模块504还用于从排列在最后的第二数据源接收第二特征数,并根据所述第二特征数确定所述第一用户与所述第一聚类中心之间的实际距离,所述第二特征数是根据所述第一特征数、m-1个第二距离估计值以及m-1个第二随机数生成的,一个第二距离估计值来源于所述m个数据源中的一个第二数据源,一个第二随机数来源于所述m个数据源中的一个第二数据源。In a possible design, the obtaining module 504 is further configured to receive a second feature number from the last second data source, and determine the first user and the first cluster according to the second feature number The actual distance between the centers, the second feature number is generated according to the first feature number, m-1 second distance estimates and m-1 second random numbers, a source of second distance estimates For a second data source among the m data sources, a second random number is derived from a second data source among the m data sources.
在一种可能的设计中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的从数据源;所述发送模块503用于从排列在所述第一数据源之前的且距离所述第一数据源最近的第二数据源接收第二特征数,所述第二特征数是根据至少一个第二距离估计值和至少一个第二随机数生成的,一个第二距离估计值来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源,一个第二随机数来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源;所述发送模块503还用于根据所述第二特征数、所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数,并发送给排列在所述第一数据源之后的第二数据源。In a possible design, the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data source is arranged first, Each of the m data sources has a random number, if the first data source is a slave data source among the m data sources; the sending module 503 is used for the slaves arranged in the A second data source preceding a data source and closest to the first data source receives a second feature number, the second feature number is generated according to at least one second distance estimate and at least one second random number, A second distance estimate is derived from a second data source of at least one second data source arranged before the first data source, and a second random number is derived from at least one second data source arranged before the first data source a second data source in a second data source; the sending module 503 is further configured to, according to the second characteristic number, the first random number possessed by the first data source, and the first distance estimation value, A first feature number is generated and sent to a second data source arranged after the first data source.
在一种可能的设计中,若所述第一数据源是所述m个数据源中的从数据源,获取模块504还用于从所述m个数据源中的主数据源接收所述第一用户与所述第一聚类中心之间的实际距离。In a possible design, if the first data source is a slave data source among the m data sources, the obtaining module 504 is further configured to receive the first data source from a master data source among the m data sources The actual distance between a user and the first cluster center.
进一步地,在一种可能的设计中,所述装置还包括:获取聚类中心移动距离模块506,用于参考获取第一用户和第一聚类中心实际距离的方法获取k个聚类中心移动的实际距离。Further, in a possible design, the device further includes: a module 506 for obtaining the moving distances of the cluster centers, for obtaining k cluster center movements with reference to the method for obtaining the actual distance between the first user and the first cluster center actual distance.
所述装置还包括:终止模块507,用于比较k个聚类中心移动的实际距离以及终止阈值的大小,在上述k个聚类中心移动的实际距离大于第一阈值的情况下,数据源根据新的聚类中心对n个用户进行下一轮聚类,进行下一轮聚类的具体步骤与第一轮聚类中的步骤一致,从而得到新的聚类结果。The device also includes: a termination module 507, configured to compare the actual distances moved by the k cluster centers and the size of the termination threshold. The new clustering center performs the next round of clustering on n users, and the specific steps of the next round of clustering are consistent with the steps in the first round of clustering, thereby obtaining a new clustering result.
在上述k个聚类中心移动的实际距离小于终止阈值的情况下,停止聚类,将此时的聚类结果作为最终的聚类结果。When the actual distance moved by the above k cluster centers is less than the termination threshold, the clustering is stopped, and the clustering result at this time is taken as the final clustering result.
需要说明的是,图5对应的实施例中未提及的内容可参见方法实施例的描述,这里不再赘述。It should be noted that, for the content not mentioned in the embodiment corresponding to FIG. 5 , reference may be made to the description of the method embodiment, which will not be repeated here.
本申请实施例中,通过确定聚类中心,由各个数据源本地计算用户与聚类中心不同维度的距离,也就是距离估计值,在数据源之间传输基于距离估计值生成的特征数来确定用户与聚类中心的实际距离,从而根据该实际距离对用户进行聚类。由于在本申请实施例中保证了数据不出本地,并且数据的详细信息不被泄露,安全的联合了多方数据信息对客户进行聚类,以此对用户进行聚类。同时,各参与方可以同步进行计算,整个系统拥有多台计算机的计算能力,拥有更快的处理速度。In the embodiment of the present application, by determining the cluster center, each data source locally calculates the distance between the user and the cluster center in different dimensions, that is, the distance estimate value, and transmits the number of features generated based on the distance estimate value between the data sources to determine The actual distance between the user and the cluster center, so that the users are clustered according to the actual distance. In the embodiment of the present application, it is ensured that the data is not local and the detailed information of the data is not leaked, and the multi-party data information is securely combined to cluster customers, thereby clustering users. At the same time, all participants can perform calculations simultaneously, and the entire system has the computing power of multiple computers and has a faster processing speed.
参见图6,图6是本申请实施例提供的一种用户聚类设备的组成结构示意图,该设备60包括处理器601、存储器602以及输入输出接口603。处理器601连接到存储器602和输入输出接口603,例如处理器601可以通过总线连接到存储器602和输入输出接口603。Referring to FIG. 6 , FIG. 6 is a schematic structural diagram of a user clustering device provided by an embodiment of the present application. The device 60 includes a processor 601 , a memory 602 , and an input and output interface 603 . The processor 601 is connected to the memory 602 and the input-output interface 603, for example, the processor 601 can be connected to the memory 602 and the input-output interface 603 through a bus.
处理器601被配置为支持所述用户聚类设备执行图1-图2、图4所述的用户聚类方法中相应的功能。该处理器601可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP),硬件芯片或者其任意组合。上述硬件芯片可以是专用集成电路(application specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor 601 is configured to support the user clustering device to perform corresponding functions in the user clustering method described in FIG. 1 to FIG. 2 and FIG. 4 . The processor 601 may be a central processing unit (CPU), a network processor (NP), a hardware chip or any combination thereof. The above-mentioned hardware chip may be an application specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (generic array logic, GAL) or any combination thereof.
存储器602用于存储程序代码等。存储器602可以包括易失性存储器(volatile memory,VM),例如随机存取存储器(random access memory,RAM);存储器602也可以包括非易失性存储器(non-volatile memory,NVM),例如只读存储器(read-only memory,ROM),快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器602还可以包括上述种类的存储器的组合。The memory 602 is used to store program codes and the like. Memory 602 may include volatile memory (volatile memory, VM), such as random access memory (RAM); memory 602 may also include non-volatile memory (non-volatile memory, NVM), such as read-only A memory (read-only memory, ROM), a flash memory (flash memory), a hard disk drive (HDD) or a solid-state drive (solid-state drive, SSD); the memory 602 may also include a combination of the above-mentioned types of memory.
所述输入输出接口603用于输入或输出数据。The input/output interface 603 is used for inputting or outputting data.
处理器601可以调用所述程序代码以执行以下操作:Processor 601 may invoke the program code to perform the following operations:
确定第一聚类中心,第一聚类中心是预聚类的k个聚类中心中的任一个;determining a first cluster center, where the first cluster center is any one of the pre-clustered k cluster centers;
根据第一数据源拥有的第一用户的特征数据以及第一数据源拥有的第一聚类中心对应的用户的特征数据,计算第一用户与第一聚类中心之间的第一距离估计值;Calculate the first estimated distance between the first user and the first cluster center according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source ;
根据第一距离估计值生成至少一个第一特征数,并发送给m个数据源中的第二数据源,根据至少一个第一特征数与第二距离估计值获取第一用户与第一聚类中心之间的实际距离;At least one first feature number is generated according to the first distance estimation value, and sent to the second data source among the m data sources, and the first user and the first cluster are obtained according to the at least one first feature number and the second distance estimation value the actual distance between the centers;
根据所述实际距离,对所述第一用户进行聚类。The first users are clustered according to the actual distance.
需要说明的是,各个操作的实现还可以对应参照上述方法实施例的相应描述;所述处理器601还可以与输入输出接口603配合执行上述方法实施例中的其他操作。It should be noted that, the implementation of each operation may also refer to the corresponding description of the foregoing method embodiments; the processor 601 may also cooperate with the input/output interface 603 to perform other operations in the foregoing method embodiments.
本申请实施例还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被计算机执行时使所述计算机执行如前述实施例所述的方法,所述计算机可以为上述提到的用户聚类设备的一部分。例如为上述的处理器601。Embodiments of the present application further provide a computer storage medium, where the computer storage medium stores a computer program, and the computer program includes program instructions, and the program instructions, when executed by a computer, cause the computer to execute as described in the foregoing embodiments method, the computer may be part of the above-mentioned user clustering device. For example, it is the above-mentioned processor 601 .
可选的,本申请涉及的存储介质可以是可读存储介质。进一步可选的,本申请涉及的存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application may be a readable storage medium. Further optionally, the storage medium involved in this application may be non-volatile or volatile.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium, and the program is in During execution, it may include the processes of the embodiments of the above-mentioned methods. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。The above disclosures are only the preferred embodiments of the present application, and of course, the scope of the rights of the present application cannot be limited by this. Therefore, equivalent changes made according to the claims of the present application are still within the scope of the present application.

Claims (20)

  1. 一种用户聚类方法,其中,所述方法适用于通信系统,所述通信系统包括m个数据源,m为大于或者等于2的整数,所述方法包括:A user clustering method, wherein the method is applicable to a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2, and the method includes:
    第一数据源确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;The first data source determines a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to A user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
    所述第一数据源根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的所述第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;The first data source calculates the first user according to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source the first estimated distance from the first cluster center, where the first user is any user of the n users except the k users;
    所述第一数据源根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to a second data source among the m data sources;
    所述第一数据源获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;The first data source obtains an actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance , the second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
    所述第一数据源根据所述实际距离,对所述第一用户进行聚类。The first data source clusters the first users according to the actual distance.
  2. 如权利要求1所述的方法,其中,所述第一数据源根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源,包括:The method of claim 1, wherein the first data source generates at least one first feature number based on the first distance estimate and sends the at least one first feature number to the m pieces of data A second data source in the source, including:
    所述第一数据源对所述第一距离估计值进行拆分,得到m-1个第一特征数;The first data source splits the first distance estimation value to obtain m-1 first feature numbers;
    所述第一数据源将所述m-1个第一特征数分别发送给所述m个数据源中除所述第一数据源外的m-1个第二数据源,其中,一个第一特征数发送给一个第二数据源。The first data source sends the m-1 first feature numbers to m-1 second data sources other than the first data source among the m data sources, wherein one first data source is The feature numbers are sent to a second data source.
  3. 如权利要求2所述的方法,其中,所述m个数据源中包括主数据源和从数据源,所述第一数据源获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The method of claim 2, wherein the m data sources include a master data source and a slave data source, and the first data source obtains the data between the first user and the first cluster center Actual distances, including:
    所述第一数据源从所述m-1个第二数据源接收m-1个第二特征数,其中,一个第二特征数来自于一个第二数据源,所述第二特征数是根据所述第二距离估计值生成的;The first data source receives m-1 second feature numbers from the m-1 second data sources, wherein one second feature number comes from a second data source, and the second feature number is based on the second distance estimate is generated;
    所述第一数据源计算所述m-1个第二特征数之和,获得第一累计值;The first data source calculates the sum of the m-1 second feature numbers to obtain a first accumulated value;
    若所述第一数据源为所述m个数据源中的主数据源,所述第一数据源从所述m-1个第二数据源接收m-1个第二累计值,并根据所述第一累计值与所述m-1个第二累计值,计算所述第一用户与所述第一聚类中心之间的实际距离,其中,一个第二累计值来源于一个第二数据源,所述第二累计值是根据所述第一特征数计算得到;If the first data source is the main data source among the m data sources, the first data source receives m-1 second accumulated values from the m-1 second data sources, and according to the The first cumulative value and the m-1 second cumulative values are used to calculate the actual distance between the first user and the first cluster center, wherein a second cumulative value is derived from a second data source, the second accumulated value is calculated according to the first characteristic number;
    若所述第一数据源为所述m个数据源中的从数据源,所述第一数据源向所述m个数据源中的主数据源发送所述第一累计值,并从所述主数据源接收所述第一用户与所述第一聚类中心之间的实际距离,所述实际距离是所述主数据源根据所述第一累计值与m-1个第二累计值计算得到,所述第二累计值是根据所述第一特征数计算得到。If the first data source is a slave data source among the m data sources, the first data source sends the first accumulated value to the master data source among the m data sources, and sends the first accumulated value from the The main data source receives the actual distance between the first user and the first cluster center, where the actual distance is calculated by the main data source according to the first cumulative value and m-1 second cumulative values It is obtained that the second accumulated value is calculated according to the first characteristic number.
  4. 如权利要求1所述的方法,其中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的主数据源;The method of claim 1, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data sources are arranged in the first One bit, each of the m data sources has a random number, if the first data source is the main data source among the m data sources;
    所述第一数据源根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中除所述第一数据源外的第二数据源,包括:The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to the m data sources other than the first data source. A second data source, including:
    所述第一数据源根据所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数;generating, by the first data source, a first characteristic number according to the first random number and the first estimated distance value possessed by the first data source;
    所述第一数据源将所述第一特征数发送给排列在所述第一数据源之后的且距离所述第一数据源最近的第二数据源。The first data source sends the first feature number to a second data source arranged after the first data source and closest to the first data source.
  5. 如权利要求4所述的方法,其中,所述第一数据源获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The method according to claim 4, wherein obtaining the actual distance between the first user and the first cluster center by the first data source comprises:
    所述第一数据源从排列在最后的第二数据源接收第二特征数,并根据所述第二特征数确定所述第一用户与所述第一聚类中心之间的实际距离,所述第二特征数是根据所述第一特征数、m-1个第二距离估计值以及m-1个第二随机数生成的,一个第二距离估计值来源于所述m个数据源中的一个第二数据源,一个第二随机数来源于所述m个数据源中的一个第二数据源。The first data source receives the second feature number from the last second data source, and determines the actual distance between the first user and the first cluster center according to the second feature number, so The second feature number is generated according to the first feature number, m-1 second distance estimation values and m-1 second random numbers, and a second distance estimation value is derived from the m data sources a second data source of , and a second random number is derived from a second data source among the m data sources.
  6. 如权利要求1所述的方法,其中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的从数据源;The method of claim 1, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged in the order of data transmission, and the master data sources are arranged in the first One bit, each of the m data sources has a random number, if the first data source is a slave data source among the m data sources;
    所述第一数据源根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中除所述第一数据源外的第二数据源包括:The first data source generates at least one first feature number according to the first distance estimate value, and sends the at least one first feature number to the m data sources other than the first data source. The second data source includes:
    所述第一数据源从排列在所述第一数据源之前的且距离所述第一数据源最近的第二数据源接收第二特征数,所述第二特征数是根据至少一个第二距离估计值和至少一个第二随机数生成的,一个第二距离估计值来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源,一个第二随机数来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源;The first data source receives a second characteristic number from a second data source arranged before the first data source and closest to the first data source, the second characteristic number according to at least one second distance The estimated value and at least one second random number are generated, and a second distance estimated value is derived from a second data source and a second random number source among at least one second data source arranged before the first data source a second data source in at least one second data source arranged before the first data source;
    所述第一数据源根据所述第二特征数、所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数;generating, by the first data source, a first characteristic number according to the second characteristic number, a first random number possessed by the first data source, and the first estimated distance;
    所述第一数据源将所述第一特征数发送给排列在所述第一数据源之后的第二数据源。The first data source sends the first feature number to a second data source arranged after the first data source.
  7. 如权利要求6所述的方法,其中,所述第一数据源获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The method according to claim 6, wherein obtaining the actual distance between the first user and the first cluster center by the first data source comprises:
    所述第一数据源从所述m个数据源中的主数据源接收所述第一用户与所述第一聚类中心之间的实际距离。The first data source receives the actual distance between the first user and the first cluster center from a main data source among the m data sources.
  8. 一种用户聚类装置,其中,所述装置应用于通信系统中的第一数据源,所述通信系统包括m个数据源,m为大于或者等于2的整数,所述用户聚类装置包括:A user clustering device, wherein the device is applied to a first data source in a communication system, the communication system includes m data sources, m is an integer greater than or equal to 2, and the user clustering device includes:
    确定模块,用于确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;A determination module, used to determine a first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center Corresponding to one user, the k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
    计算模块,用于根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;A calculation module, configured to calculate the difference between the first user and the the first estimated distance between the first cluster centers, and the first user is any user except the k users among the n users;
    发送模块,用于根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;a sending module, configured to generate at least one first feature number according to the first distance estimation value, and send the at least one first feature number to a second data source among the m data sources;
    获取模块,用于获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;an acquisition module, configured to acquire the actual distance between the first user and the first cluster center, wherein the actual distance is generated according to the at least one first feature number and a second estimated distance, The second distance estimation value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
    聚类模块,用于根据所述实际距离,对所述第一用户进行聚类。A clustering module, configured to cluster the first users according to the actual distance.
  9. 一种用户聚类设备,其中,包括处理器、存储器以及输入输出接口,所述处理器、 存储器和输入输出接口相互连接,所述用户聚类设备为通信系统中的第一数据源,所述通信系统包括m个数据源,m为大于或者等于2的整数;其中,所述输入输出接口用于输入或输出数据,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,执行以下方法:A user clustering device, comprising a processor, a memory and an input/output interface, the processor, the memory and the input/output interface are connected to each other, the user clustering device is a first data source in a communication system, the The communication system includes m data sources, where m is an integer greater than or equal to 2; wherein, the input/output interface is used for inputting or outputting data, the memory is used for storing program codes, and the processor is used for calling the program codes , execute the following method:
    确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
    根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的所述第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. The first estimated distance between the class centers, the first user is any user of the n users except the k users;
    根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;
    获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
    根据所述实际距离,对所述第一用户进行聚类。The first users are clustered according to the actual distance.
  10. 如权利要求9所述的用户聚类设备,其中,执行所述根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源,包括:The user clustering device of claim 9, wherein the generating at least one first feature number according to the first distance estimation value is performed, and the at least one first feature number is sent to the m pieces of data A second data source in the source, including:
    对所述第一距离估计值进行拆分,得到m-1个第一特征数;splitting the first distance estimation value to obtain m-1 first feature numbers;
    将所述m-1个第一特征数分别发送给所述m个数据源中除所述第一数据源外的m-1个第二数据源,其中,一个第一特征数发送给一个第二数据源。The m-1 first feature numbers are respectively sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to a first data source. Two data sources.
  11. 如权利要求10所述的用户聚类设备,其中,所述m个数据源中包括主数据源和从数据源,执行所述获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The user clustering device according to claim 10, wherein the m data sources include a master data source and a slave data source, and performing the acquiring of the data between the first user and the first cluster center Actual distances, including:
    从所述m-1个第二数据源接收m-1个第二特征数,其中,一个第二特征数来自于一个第二数据源,所述第二特征数是根据所述第二距离估计值生成的;m-1 second feature numbers are received from the m-1 second data sources, wherein one second feature number is from a second data source, and the second feature number is estimated based on the second distance value generated;
    计算所述m-1个第二特征数之和,获得第一累计值;Calculate the sum of the m-1 second characteristic numbers to obtain a first accumulated value;
    若所述第一数据源为所述m个数据源中的主数据源,从所述m-1个第二数据源接收m-1个第二累计值,并根据所述第一累计值与所述m-1个第二累计值,计算所述第一用户与所述第一聚类中心之间的实际距离,其中,一个第二累计值来源于一个第二数据源,所述第二累计值是根据所述第一特征数计算得到;If the first data source is the main data source among the m data sources, m-1 second accumulated values are received from the m-1 second data sources, and according to the first accumulated value and the The m-1 second cumulative values are used to calculate the actual distance between the first user and the first cluster center, wherein a second cumulative value is derived from a second data source, and the second cumulative value is derived from a second data source. The accumulated value is calculated according to the first characteristic number;
    若所述第一数据源为所述m个数据源中的从数据源,向所述m个数据源中的主数据源发送所述第一累计值,并从所述主数据源接收所述第一用户与所述第一聚类中心之间的实际距离,所述实际距离是所述主数据源根据所述第一累计值与m-1个第二累计值计算得到,所述第二累计值是根据所述第一特征数计算得到。If the first data source is a slave data source among the m data sources, send the first accumulated value to the master data source among the m data sources, and receive the first accumulated value from the master data source The actual distance between the first user and the first cluster center, the actual distance is calculated by the main data source according to the first accumulated value and m-1 second accumulated values, the second The accumulated value is calculated according to the first characteristic number.
  12. 如权利要求9所述的用户聚类设备,其中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的主数据源;The user clustering device according to claim 9, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data source Arranged in the first place, each of the m data sources has a random number, if the first data source is the main data source among the m data sources;
    执行所述根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中除所述第一数据源外的第二数据源,包括:performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source sources, including:
    根据所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数;generating a first feature number according to the first random number possessed by the first data source and the first estimated distance;
    将所述第一特征数发送给排列在所述第一数据源之后的且距离所述第一数据源最近的 第二数据源。The first feature number is sent to a second data source arranged after the first data source and closest to the first data source.
  13. 如权利要求12所述的用户聚类设备,其中,执行所述获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The user clustering device according to claim 12, wherein the obtaining the actual distance between the first user and the first cluster center comprises:
    从排列在最后的第二数据源接收第二特征数,并根据所述第二特征数确定所述第一用户与所述第一聚类中心之间的实际距离,所述第二特征数是根据所述第一特征数、m-1个第二距离估计值以及m-1个第二随机数生成的,一个第二距离估计值来源于所述m个数据源中的一个第二数据源,一个第二随机数来源于所述m个数据源中的一个第二数据源。A second feature number is received from the last second data source, and the actual distance between the first user and the first cluster center is determined according to the second feature number, the second feature number being Generated according to the first feature number, m-1 second distance estimation values and m-1 second random numbers, a second distance estimation value is derived from a second data source among the m data sources , a second random number is derived from a second data source among the m data sources.
  14. 如权利要求9所述的用户聚类设备,其中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的从数据源;The user clustering device according to claim 9, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data source Arranged in the first place, each of the m data sources has a random number, if the first data source is a slave data source among the m data sources;
    执行所述根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中除所述第一数据源外的第二数据源包括:performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source Sources include:
    从排列在所述第一数据源之前的且距离所述第一数据源最近的第二数据源接收第二特征数,所述第二特征数是根据至少一个第二距离估计值和至少一个第二随机数生成的,一个第二距离估计值来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源,一个第二随机数来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源;A second feature number is received from a second data source arranged before and closest to the first data source, the second feature number being based on at least one second distance estimate and at least one first generated by two random numbers, a second distance estimate is derived from a second data source among at least one second data source arranged before the first data source, and a second random number is derived from a second data source arranged in the first data source a second data source of at least one second data source preceding a data source;
    根据所述第二特征数、所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数;generating a first feature number according to the second feature number, the first random number possessed by the first data source, and the first distance estimate;
    将所述第一特征数发送给排列在所述第一数据源之后的第二数据源。The first feature number is sent to a second data source arranged after the first data source.
  15. 一种计算机存储介质,其中,所述计算机存储介质应用于通信系统中的第一数据源,所述通信系统包括m个数据源,m为大于或者等于2的整数;所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下方法:A computer storage medium, wherein the computer storage medium is applied to a first data source in a communication system, and the communication system includes m data sources, where m is an integer greater than or equal to 2; the computer storage medium stores A computer program comprising program instructions which, when executed by a processor, cause the processor to perform the following methods:
    确定第一聚类中心,所述第一聚类中心是预聚类的k个聚类中心中的任一个,所述k个聚类中心对应k个用户,一个聚类中心对应一个用户,所述k个用户是待分类的n个用户中的用户,所述第一数据源是所述m个数据源中的任一个,所述n为整数,且n>1;Determine the first cluster center, the first cluster center is any one of the pre-clustered k cluster centers, the k cluster centers correspond to k users, and one cluster center corresponds to one user, so The k users are users among the n users to be classified, the first data source is any one of the m data sources, the n is an integer, and n>1;
    根据所述第一数据源拥有的第一用户的特征数据以及所述第一数据源拥有的所述第一聚类中心对应的用户的特征数据,计算所述第一用户与所述第一聚类中心之间的第一距离估计值,所述第一用户为所述n个用户中除所述k个用户外的任一用户;According to the feature data of the first user owned by the first data source and the feature data of the user corresponding to the first cluster center owned by the first data source, calculate the relationship between the first user and the first cluster center. a first estimated distance between class centers, where the first user is any user among the n users except the k users;
    根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源;generating at least one first feature number according to the first distance estimate, and sending the at least one first feature number to a second data source among the m data sources;
    获取所述第一用户与所述第一聚类中心之间的实际距离,其中,所述实际距离是根据所述至少一个第一特征数与第二距离估计值生成的,所述第二距离估计值是根据所述第二数据源拥有的所述第一用户的特征数据和所述第一聚类中心对应的用户的特征数据计算得到的;Obtain the actual distance between the first user and the first cluster center, where the actual distance is generated according to the at least one first feature number and a second estimated distance value, and the second distance The estimated value is calculated according to the characteristic data of the first user owned by the second data source and the characteristic data of the user corresponding to the first cluster center;
    根据所述实际距离,对所述第一用户进行聚类。The first users are clustered according to the actual distance.
  16. 如权利要求15所述的计算机存储介质,其中,执行所述根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中的第二数据源,包括:16. The computer storage medium of claim 15, wherein the generating at least one first feature number based on the first distance estimate is performed and sending the at least one first feature number to the m data sources A second data source in , including:
    对所述第一距离估计值进行拆分,得到m-1个第一特征数;splitting the first distance estimation value to obtain m-1 first feature numbers;
    将所述m-1个第一特征数分别发送给所述m个数据源中除所述第一数据源外的m-1个第二数据源,其中,一个第一特征数发送给一个第二数据源。The m-1 first feature numbers are respectively sent to m-1 second data sources other than the first data source among the m data sources, wherein one first feature number is sent to a first data source. Two data sources.
  17. 如权利要求16所述的计算机存储介质,其中,所述m个数据源中包括主数据源和从数据源,执行所述获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The computer storage medium according to claim 16, wherein the m data sources include a master data source and a slave data source, and performing the obtaining of the actual data between the first user and the first cluster center distance, including:
    从所述m-1个第二数据源接收m-1个第二特征数,其中,一个第二特征数来自于一个第二数据源,所述第二特征数是根据所述第二距离估计值生成的;m-1 second feature numbers are received from the m-1 second data sources, wherein one second feature number is from a second data source, and the second feature number is estimated based on the second distance value generated;
    计算所述m-1个第二特征数之和,获得第一累计值;Calculate the sum of the m-1 second characteristic numbers to obtain a first accumulated value;
    若所述第一数据源为所述m个数据源中的主数据源,从所述m-1个第二数据源接收m-1个第二累计值,并根据所述第一累计值与所述m-1个第二累计值,计算所述第一用户与所述第一聚类中心之间的实际距离,其中,一个第二累计值来源于一个第二数据源,所述第二累计值是根据所述第一特征数计算得到;If the first data source is the main data source among the m data sources, m-1 second accumulated values are received from the m-1 second data sources, and according to the first accumulated value and the The m-1 second cumulative values are used to calculate the actual distance between the first user and the first cluster center, wherein a second cumulative value is derived from a second data source, and the second cumulative value is derived from a second data source. The accumulated value is calculated according to the first characteristic number;
    若所述第一数据源为所述m个数据源中的从数据源,向所述m个数据源中的主数据源发送所述第一累计值,并从所述主数据源接收所述第一用户与所述第一聚类中心之间的实际距离,所述实际距离是所述主数据源根据所述第一累计值与m-1个第二累计值计算得到,所述第二累计值是根据所述第一特征数计算得到。If the first data source is a slave data source among the m data sources, send the first accumulated value to the master data source among the m data sources, and receive the first accumulated value from the master data source The actual distance between the first user and the first cluster center, the actual distance is calculated by the main data source according to the first accumulated value and m-1 second accumulated values, the second The accumulated value is calculated according to the first characteristic number.
  18. 如权利要求15所述的计算机存储介质,其中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的主数据源;The computer storage medium according to claim 15, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data sources are arranged In the first place, each of the m data sources has a random number, if the first data source is the main data source among the m data sources;
    执行所述根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中除所述第一数据源外的第二数据源,包括:performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source sources, including:
    根据所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数;generating a first characteristic number according to the first random number possessed by the first data source and the first estimated distance;
    将所述第一特征数发送给排列在所述第一数据源之后的且距离所述第一数据源最近的第二数据源。The first feature number is sent to a second data source arranged after the first data source and closest to the first data source.
  19. 如权利要求18所述的计算机存储介质,其中,执行所述获取所述第一用户与所述第一聚类中心之间的实际距离,包括:The computer storage medium of claim 18, wherein performing the obtaining the actual distance between the first user and the first cluster center comprises:
    从排列在最后的第二数据源接收第二特征数,并根据所述第二特征数确定所述第一用户与所述第一聚类中心之间的实际距离,所述第二特征数是根据所述第一特征数、m-1个第二距离估计值以及m-1个第二随机数生成的,一个第二距离估计值来源于所述m个数据源中的一个第二数据源,一个第二随机数来源于所述m个数据源中的一个第二数据源。A second feature number is received from the last second data source, and the actual distance between the first user and the first cluster center is determined according to the second feature number, the second feature number being Generated according to the first feature number, m-1 second distance estimation values and m-1 second random numbers, a second distance estimation value is derived from a second data source among the m data sources , a second random number is derived from a second data source among the m data sources.
  20. 如权利要求15所述的计算机存储介质,其中,所述m个数据源中包括主数据源和从数据源,所述m个数据源按照数据传输的先后顺序进行排列,所述主数据源排列在第一位,所述m个数据源中每个数据源各拥有一个随机数,若所述第一数据源是所述m个数据源中的从数据源;The computer storage medium according to claim 15, wherein the m data sources include a master data source and a slave data source, the m data sources are arranged according to the sequence of data transmission, and the master data sources are arranged In the first place, each of the m data sources has a random number, if the first data source is a slave data source among the m data sources;
    执行所述根据所述第一距离估计值生成至少一个第一特征数,并将所述至少一个第一特征数发送给所述m个数据源中除所述第一数据源外的第二数据源包括:performing the generating of at least one first feature number according to the first distance estimation value, and sending the at least one first feature number to second data of the m data sources other than the first data source Sources include:
    从排列在所述第一数据源之前的且距离所述第一数据源最近的第二数据源接收第二特征数,所述第二特征数是根据至少一个第二距离估计值和至少一个第二随机数生成的,一个第二距离估计值来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源,一个第二随机数来源于排列在所述第一数据源之前的至少一个第二数据源中的一个第二数据源;A second feature number is received from a second data source arranged before and closest to the first data source, the second feature number being based on at least one second distance estimate and at least one first generated by two random numbers, a second distance estimate is derived from a second data source among at least one second data source arranged before the first data source, and a second random number is derived from a second data source arranged in the first data source a second data source of at least one second data source preceding a data source;
    根据所述第二特征数、所述第一数据源拥有的第一随机数以及所述第一距离估计值,生成第一特征数;generating a first feature number according to the second feature number, the first random number possessed by the first data source, and the first distance estimate;
    将所述第一特征数发送给排列在所述第一数据源之后的第二数据源。The first feature number is sent to a second data source arranged after the first data source.
PCT/CN2021/097306 2020-11-20 2021-05-31 User clustering method, apparatus and device WO2022105183A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011307323.7 2020-11-20
CN202011307323.7A CN112381163B (en) 2020-11-20 2020-11-20 User clustering method, device and equipment

Publications (1)

Publication Number Publication Date
WO2022105183A1 true WO2022105183A1 (en) 2022-05-27

Family

ID=74585264

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097306 WO2022105183A1 (en) 2020-11-20 2021-05-31 User clustering method, apparatus and device

Country Status (2)

Country Link
CN (1) CN112381163B (en)
WO (1) WO2022105183A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381163B (en) * 2020-11-20 2023-07-25 平安科技(深圳)有限公司 User clustering method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170024186A1 (en) * 2015-03-02 2017-01-26 Slyce Holdings Inc. System and method for clustering data
CN110135856A (en) * 2019-05-16 2019-08-16 中国银联股份有限公司 A kind of repeat business risk monitoring method, device and computer readable storage medium
CN111783875A (en) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 Abnormal user detection method, device, equipment and medium based on cluster analysis
CN112381163A (en) * 2020-11-20 2021-02-19 平安科技(深圳)有限公司 User clustering method, device and equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012089902A1 (en) * 2010-12-30 2012-07-05 Nokia Corporation Method, apparatus, and computer program product for image clustering
CN108734188B (en) * 2017-04-25 2023-04-07 中兴通讯股份有限公司 Clustering method, device and storage medium
CN107067045A (en) * 2017-05-31 2017-08-18 北京京东尚科信息技术有限公司 Data clustering method, device, computer-readable medium and electronic equipment
CN110895706B (en) * 2019-11-07 2022-12-27 苏宁云计算有限公司 Method and device for acquiring target cluster number and computer system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170024186A1 (en) * 2015-03-02 2017-01-26 Slyce Holdings Inc. System and method for clustering data
CN110135856A (en) * 2019-05-16 2019-08-16 中国银联股份有限公司 A kind of repeat business risk monitoring method, device and computer readable storage medium
CN111783875A (en) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 Abnormal user detection method, device, equipment and medium based on cluster analysis
CN112381163A (en) * 2020-11-20 2021-02-19 平安科技(深圳)有限公司 User clustering method, device and equipment

Also Published As

Publication number Publication date
CN112381163A (en) 2021-02-19
CN112381163B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN109299336B (en) Data backup method and device, storage medium and computing equipment
TWI727467B (en) Trustworthiness verification method, system, device and equipment of alliance chain
US9104676B2 (en) Hash algorithm-based data storage method and system
US20180367293A1 (en) Private set intersection encryption techniques
TWI712972B (en) Trustworthiness verification method, system, device and equipment of alliance chain
CN110427969B (en) Data processing method and device and electronic equipment
CN110852882B (en) Packet consensus method, apparatus, device, and medium for blockchain networks
WO2020220536A1 (en) Data backup method and device, and computer readable storage medium
CN113010896B (en) Method, apparatus, device, medium and program product for determining abnormal object
EP3817333B1 (en) Method and system for processing requests in a consortium blockchain
EP4033440A1 (en) Consensus method, apparatus and device of block chain
WO2021000575A1 (en) Data interaction method and apparatus, and electronic device
CN109145053B (en) Data processing method and device, client and server
WO2022105183A1 (en) User clustering method, apparatus and device
CN114528916A (en) Sample clustering processing method, device, equipment and storage medium
US11546171B2 (en) Systems and methods for synchronizing anonymized linked data across multiple queues for secure multiparty computation
CN112073395B (en) File distribution method and device
CN111382233A (en) Similar text detection method and device, electronic equipment and storage medium
CN107203578B (en) Method and device for establishing association of user identifiers
CN113392138B (en) Statistical analysis method, device, server and storage medium for private data
CN113312521B (en) Content retrieval method, device, electronic equipment and medium
US11921787B2 (en) Identity-aware data management
US20240022604A1 (en) Security configuration evaluation
US20230061914A1 (en) Rule based machine learning for precise fraud detection
CN114519884A (en) Face recognition method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893338

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893338

Country of ref document: EP

Kind code of ref document: A1