CN105282011A

CN105282011A - Social group finding method based on cluster fusion algorithm

Info

Publication number: CN105282011A
Application number: CN201510646011.1A
Authority: CN
Inventors: 刘波; 余刚; 肖燕珊; 郝志峰; 梁荣德
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2015-09-30
Filing date: 2015-09-30
Publication date: 2016-01-27

Abstract

The invention discloses a method for discovering a social group based on a clustering fusion algorithm. First, a reference class is generated from social network data, so that social users in the same reference class have similar group attributes. For base clustering, using adopters, various sets of base clusters are generated. For each base cluster set, a clustering fusion algorithm is used to fuse the clustering fusion results with the clustering fusion algorithm to generate candidate benchmarks. For candidate benchmarks, use filters to obtain benchmarks based on the set filter conditions. Then, benchmarks are used to evaluate the clustering quality. After obtaining the benchmark, this paper uses the extrinsic evaluation method to evaluate the clustering quality. The invention obtains more accurate and robust decision-making by fusing the decision-making of base clustering. It improves the accuracy of group discovery and individual discovery in social network data, enables service providers to obtain user information more fully, thereby improving service quality, and has great use value.

Description

A Social Group Discovery Method Based on Clustering Fusion Algorithm

技术领域 technical field

本发明属于社交网络团体挖掘技术领域，涉及一种运用聚类融合算法的判断方法，具体涉及一种基于聚类融合算法的社交团体发现方法。 The invention belongs to the technical field of social network group mining, and relates to a judgment method using a clustering fusion algorithm, in particular to a method for discovering a social group based on a clustering fusion algorithm.

背景技术 Background technique

“互联网+”是互联网思维的进一步实践成果，它代表一种先进的生产力，推动经济形态不断的发生演变，从而带动社会经济实体的生命力，为改革、发展、创新提供广阔的网络平台。 "Internet +" is a further practical result of Internet thinking. It represents an advanced productive force that promotes the continuous evolution of economic forms, thus driving the vitality of social and economic entities and providing a broad network platform for reform, development and innovation.

现在，传统的互联网正在迈向全新的时代----社交服务网时代(SocialNetworkingService)，从“人与机器”的时代迈向“人与人”的时代。个体的社交圈会不断地扩大和重叠并在最终形成大的社交网络。社交网的一个显著特点是支持巨大用户数，例如Facebook支持超过3亿的用户，其数据中心运行着超过万台的服务器，为遍布全球的用户提供信息通讯服务。另外，任何两个社交网用户都可能交互，也就是必须支持任何两个数据库用户的数据关联操作。这对于服务端的数据库管理提出了极大的挑战。 Now, the traditional Internet is entering a new era - the era of Social Networking Service (SocialNetworkingService), from the era of "human and machine" to the era of "human and human". Individual social circles will continue to expand and overlap and eventually form a large social network. A notable feature of social networking is that it supports a huge number of users. For example, Facebook supports more than 300 million users, and its data center runs more than 10,000 servers to provide information and communication services for users all over the world. In addition, any two social network users may interact, that is, the data association operation of any two database users must be supported. This poses a great challenge to database management on the server side.

云服务器(ElasticComputeService,简称ECS)是一种处理能力可弹性伸缩的计算服务，其管理方式比物理服务器更简单高效。云服务器帮助您快速构建更稳定、安全的应用，降低开发运维的难度和整体IT成本，使您能够更专注于核心业务的创新。目前，是做的比较完善的生态系统。 A cloud server (Elastic Compute Service, ECS for short) is a computing service with elastically scalable processing capabilities, and its management method is simpler and more efficient than that of a physical server. Cloud servers help you quickly build more stable and secure applications, reduce the difficulty of development and maintenance and overall IT costs, and enable you to focus more on core business innovation. At present, it is a relatively complete ecosystem.

聚类融合算法的核心思想是通过把多个聚类算法融合，得出更准确、更健壮的决策。一方面，由于基聚类分别来自于不同基聚类算法，其聚类算法的初始化条件、参数设置，甚至算法思想都各不相同，所以这些各不相同的基聚类都蕴含数据集的一部分特征。通过把这些各异的基聚类进行融合，能有效地更全面地、更准确地反映出数据集的真正特征。另一方面，即使某些基聚类存在反映数据集的错误信息，但通过大量基聚类的正确信息修正，能有效地得出更健壮的聚类决策。由于聚类融合算法这些优良特点，目前在聚类算法研究领域，聚类融合算法正在蓬勃地发展。 The core idea of the clustering fusion algorithm is to obtain more accurate and robust decisions by fusing multiple clustering algorithms. On the one hand, since the base clusters come from different base clustering algorithms, the initialization conditions, parameter settings, and even algorithm ideas of the clustering algorithms are different, so these different base clusters all imply a part of the data set feature. By fusing these different base clusters, it can effectively reflect the real characteristics of the data set more comprehensively and accurately. On the other hand, even if some base clusters have wrong information reflecting the data set, more robust clustering decisions can be effectively derived by correcting correct information of a large number of base clusters. Due to these excellent characteristics of clustering fusion algorithm, clustering fusion algorithm is developing vigorously in the field of clustering algorithm research.

发明内容 Contents of the invention

本发明的目的是提供一种基于聚类融合算法的社交团体发现方法，针对复杂的社交网络数据，运用聚类融合算法作为判断准则，然后对一系列未知的社交网络数据进行分类，得到相应的分类，让市场人员能相应的服务。 The purpose of the present invention is to provide a social group discovery method based on clustering fusion algorithm. For complex social network data, clustering fusion algorithm is used as a judgment criterion, and then a series of unknown social network data are classified to obtain corresponding Classification, so that market personnel can provide corresponding services.

本发明所采用的技术方案是，基于聚类融合算法的社交团体发现方法，具体按照以下步骤实施： The technical scheme adopted in the present invention is a method for discovering social groups based on a clustering fusion algorithm, which is specifically implemented according to the following steps:

步骤1：对于社交网路中的数据，根据基聚类算法分别得出相应的采样基聚类； Step 1: For the data in the social network, obtain the corresponding sampling base clustering according to the base clustering algorithm;

步骤2：对步骤1得到的每个采样基聚类集进行融合，得出候选基准； Step 2: Fuse each sampling base cluster set obtained in step 1 to obtain candidate benchmarks;

步骤3：对步骤2得到的候选基准进行筛选，评分最高的候选基准作为最优基准； Step 3: Screen the candidate benchmarks obtained in step 2, and use the candidate benchmark with the highest score as the optimal benchmark;

步骤4：使用步骤3得到的最优基准对聚类质量进行评价。 Step 4: Use the optimal benchmark obtained in step 3 to evaluate the clustering quality.

本发明的特点还在于， The present invention is also characterized in that,

其中的步骤1具体按照以下步骤实施： Wherein step 1 is specifically implemented according to the following steps:

假设有一个包含m个对象的数据集X，定义X＝{x₁,x₂,…,x_M}，在运行N个基聚类算法后，得到N个基聚类π，定义π＝{π₁,π₂,…,π_N}，然后，对π进行融合聚类算法运算，得到融合聚类π^*，定义π^*＝φ(π)，其中φ是聚类融合函数； Suppose there is a dataset X containing m objects, define X={x ₁ ,x ₂ ,…,x _M }, after running N base clustering algorithms, get N base clusters π, define π={ π ₁ , π ₂ ,..., π _N }, then, perform fusion clustering algorithm operation on π to obtain fusion cluster π ^* , define π ^* = φ(π), where φ is the clustering fusion function;

首先，对社交网络用户信息进行采样，利用社交平台账户获取平台访问权限，通过设置初始任务集对目标信息进行定向获取； First, sample social network user information, use social platform accounts to obtain platform access rights, and obtain targeted information by setting initial task sets;

其次，采用k-means作为候选基准算法，先设定聚类个数，然后随机设定初始化聚类中心，生成多个基聚类；为了生成多样性高的基聚类集，通过采样器对基聚类集进行采样，通过组合子基聚类集的方式，得到多个组差异化大的采样基聚类基。 Secondly, using k-means as a candidate benchmark algorithm, first set the number of clusters, and then randomly set the initial cluster center to generate multiple base clusters; in order to generate a base cluster set with high diversity, the sampler The base clustering set is used for sampling, and the sampling base clustering bases with large differences in multiple groups are obtained by combining the sub-base clustering sets.

其中的采样器采样的方式是随机赌轮盘方式。 The sampling method of the sampler is a random roulette method.

其中的步骤2具体按照以下步骤实施： Wherein step 2 is specifically implemented according to the following steps:

采用SLC算法来对融合聚类集进行融合，得到候选基准： The SLC algorithm is used to fuse the fusion clustering set to obtain the candidate benchmark:

候选基准的评分定义如下： The scoring of candidate benchmarks is defined as follows:

$E E. v v a a l l (({π π}_{B B}^{* *})) = = \{\begin{matrix} 00,, i i f f | | A A c c u u (({π π}_{u u}^{* *})) - - A A c c u u (({π π}_{v v}^{* *})) | | > > α α \\ λ λ \cdot &Center Dot; N N M m I I (({π π}_{u u}^{* *},, {π π}_{v v}^{* *})) + + ((11 - - λ λ)) | | A A c c u u (({π π}_{u u}^{* *})) - - A A c c u u (({π π}_{v v}^{* *})) | |,, o o t t h h e e r r s the s \end{matrix},,$

其中，候选基准为融合聚类为和α为阈值。 Among them, the candidate benchmarks are Fusion clustering as and α is the threshold.

其中的当融合聚类之间相似程度大于α时，评分为0，这时防止融合聚类之间的相似性太大；当融合聚类之间相似程度小于α时，评分由两部分相加而成；第一部分是融合聚类于候选基准之间的相似程度，第二部分是融合聚类之间的相似程度；λ是两部分之间的权重；当λ＞0.5时，在评分中，第一部分比第二部分的权重大；当λ＜0.5时，在评分中，第二部分比第一部分的权重大；当λ＝0.5时，在评分中，第二部分比第一部分的权重相等；一般而言，选择λ＝0.5，即第二部分和第一部分占评分的权重一样；据此，计算每一个候选基准的评分，评分最高的候选基准作为最终的基准；经过筛选的基准作为下一步的最优基准使用，来评价聚类质量。 Among them, when the similarity between the fusion clusters is greater than α, the score is 0, which prevents the similarity between the fusion clusters from being too large; when the similarity between the fusion clusters is less than α, the score is added by two parts The first part is the degree of similarity between fusion clusters and candidate benchmarks, and the second part is the degree of similarity between fusion clusters; λ is the weight between the two parts; when λ>0.5, in the scoring, The weight of the first part is greater than that of the second part; when λ<0.5, the weight of the second part is greater than that of the first part in the scoring; when λ=0.5, the weight of the second part is equal to that of the first part in the scoring; Generally speaking, λ=0.5 is selected, that is, the second part and the first part have the same weight in the score; accordingly, the score of each candidate benchmark is calculated, and the candidate benchmark with the highest score is used as the final benchmark; the filtered benchmark is used as the next step The optimal benchmark to use to evaluate the clustering quality.

其中的步骤4具体按照以下步骤实施： Wherein step 4 is specifically implemented according to the following steps:

利用上一步生成的最优基准，利用外在方法BCubed对聚类质量进行评价：给定基准π_b和K个由不同聚类融合算法所得的融合聚类π＝{π₁,π₂,…π_k}，对每一个融合聚类π_i，都可以得出一个质量评价Q_i(π_i,π_b)；评分越高，代表该聚类融合算法得出来的融合结果越好； Using the optimal benchmark generated in the previous step, use the external method BCubed to evaluate the clustering quality: given the benchmark π _b and K fusion clusters obtained by different clustering fusion algorithms π={π ₁ ,π ₂ ,… π _k }, for each fusion cluster π _i , a quality evaluation Q _i (π _i ,π _b ) can be obtained; the higher the score, the better the fusion result obtained by the clustering fusion algorithm;

假设有对象集合X＝{x₁,x₂,…,x_n}，C是X的一个聚类，B是X的基准；C(x_i)(1≤i≤n)表示x_i在C的类别，B(x_i)(1≤i≤n)表示x_i在B的类别；对于两个对象x_i和x_j(1≤i,j≤n,i≠j),x_i和x_j在聚类C的正确性的定义如下： Suppose there is an object set X={x ₁ ,x ₂ ,…,x _n }, C is a cluster of X, and B is the benchmark of X; C( _xi )(1≤i≤n) means that x _i is in C category, B( _xi )(1≤i≤n) means that x _i is in the category of B; for two objects x _i and x _j (1≤i, j≤n, i≠j), x _i and x The correctness of _j in cluster C is defined as follows:

$C C o o r r r r e e c c t t n no e e s the s s the s (({x x}_{i i},, {x x}_{j j})) = = \{\begin{matrix} 11,, i i f f B B (({x x}_{i i})) = = B B (({x x}_{j j})) &DoubleLeftRightArrow; &DoubleLeftRightArrow; C C (({x x}_{i i})) = = C C (({x x}_{j j})) \\ 00,, o o t t h h e e r r s the s \end{matrix},,$

BCubed的精度定义如下： The precision of BCubed is defined as follows:

$P P = = \frac{11}{n no} {Σ Σ}_{i i = = 11}^{n no} \frac{\underset{{x x}_{i i : : i i &NotEqual; &NotEqual; j j,, C C (({x x}_{i i})) = = C C (({x x}_{j j}))}}{Σ Σ} C C o o r r r r e e c c t t n no e e s the s s the s (({x x}_{i i},, {x x}_{j j}))}{| | {{{x x}_{i i} | | i i &NotEqual; &NotEqual; j j,, C C (({x x}_{i i})) = = C C (({x x}_{j j}))}} | |},,$

BCubed的召回率定义如下： The recall rate of BCubed is defined as follows:

$R R = = \frac{11}{n no} {Σ Σ}_{i i = = 11}^{n no} \frac{\underset{{x x}_{i i : : i i &NotEqual; &NotEqual; j j,, B B (({x x}_{i i})) = = B B (({x x}_{j j}))}}{Σ Σ} C C o o r r r r e e c c t t n no e e s the s s the s (({x x}_{i i},, {x x}_{j j}))}{| | {{{x x}_{i i} | | i i &NotEqual; &NotEqual; j j,, B B (({x x}_{i i})) = = B B (({x x}_{j j}))}} | |},,$

精度和召回率都可以用来评价聚类，F度量可以同时结合精度和召回率，定义如下： Both precision and recall can be used to evaluate clustering, and the F measure can combine precision and recall at the same time, defined as follows:

$F f = = \frac{((11 + + {β β}^{22})) P P \cdot &Center Dot; R R}{{β β}^{22} \cdot \cdot P P + + R R},,$

F度量的取值范围在0到1之间，当F度量等于0时，聚类质量并不理想；当F度量等于1时，聚类质量理想，与基准完全一致；所以当F度量越接近1时，聚类质量越好。 The value range of the F measure is between 0 and 1. When the F measure is equal to 0, the clustering quality is not ideal; when the F measure is equal to 1, the clustering quality is ideal, which is completely consistent with the benchmark; so when the F measure is closer to When 1, the clustering quality is better.

发明的有益效果是，本发明提出一种不依赖专家评价基准的外在评价方法为准则的团体发现识别方法。首先，由社交网络数据生成基准类，使同基准类中的社交用户具有相似的团体属性。对于基聚类，使用采用器，生成各种基聚类集。对于每个基聚类集，使用聚类融合算法，对聚类融合结果采用聚类融合算法进行融合，生成候选基准。对于候选基准，使用筛选器，依据设定筛选条件，得出基准。然后，使用基准对聚类质量进行评价。得到基准后，本文采用外在评价方法对聚类质量进行评价。本发明通过对基聚类的决策进行融合，得出更准确、壮健的决策。提高了社交网络数据中团体发现，个体发现的准确率，使服务商更加充分的获得用户信息，从而提高服务质量，具有极大的使用价值。 The beneficial effect of the invention is that the invention proposes a group discovery and recognition method that does not rely on the external evaluation method of the expert evaluation standard as a criterion. First, a benchmark class is generated from social network data, so that social users in the same benchmark class have similar group attributes. For base clustering, using adopters, various sets of base clusters are generated. For each base cluster set, a clustering fusion algorithm is used to fuse the clustering fusion results with the clustering fusion algorithm to generate candidate benchmarks. For candidate benchmarks, use filters to obtain benchmarks based on the set filter conditions. Then, benchmarks are used to evaluate the clustering quality. After obtaining the benchmark, this paper uses the extrinsic evaluation method to evaluate the clustering quality. The invention obtains more accurate and robust decision-making by fusing the decision-making of base clustering. It improves the accuracy of group discovery and individual discovery in social network data, enables service providers to obtain user information more fully, thereby improving service quality, and has great use value.

附图说明 Description of drawings

图1为对基聚类采样部分实现的框架图； Figure 1 is a frame diagram of the realization of the base clustering sampling part;

图2为生成候选基准部分实现的框架图； Figure 2 is a framework diagram of the implementation of generating candidate benchmarks;

图3为筛选候选基准部分实现的框架图。 Figure 3 is a framework diagram of the partial implementation of screening candidate benchmarks.

具体实施方式 detailed description

下面结合附图和具体实施方式对本发明进行详细说明。 The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明基于聚类融合算法的社交团体发现方法，具体按照以下步骤实施： The method for discovering social groups based on the clustering fusion algorithm of the present invention is specifically implemented according to the following steps:

步骤1：对于社交网路中的数据，根据基聚类算法分别得出相应的基聚类(基聚类算法1得出相应的基聚类1)，其中采样的方式是随机赌轮盘方式。具体为：构建社交网路数据中的基聚类：对于社交网路中的数据，根据基聚类算法分别得出相应的基聚类(基聚类算法1得出相应的基聚类1)，把社交网路数据分为不同的基聚类，然后对基聚类集进行采样，目的是生成多样性高的采样基聚类集。多样性高的采样基聚类集有助于后续生成的多样性的候选融合聚类，有利于于筛选最后的融合聚类。 Step 1: For the data in the social network, the corresponding basic clusters are obtained according to the basic clustering algorithm (the corresponding basic clustering 1 is obtained by the basic clustering algorithm 1), and the sampling method is a random roulette method . Specifically: construct the base clustering in the social network data: for the data in the social network, obtain the corresponding base clustering according to the base clustering algorithm (the base clustering algorithm 1 obtains the corresponding base clustering 1) , divide the social network data into different base clusters, and then sample the base cluster sets, in order to generate a sampled base cluster set with high diversity. The sampling base clustering set with high diversity is conducive to the subsequent generation of diverse candidate fusion clusters, which is beneficial to the screening of the final fusion clusters.

步骤2：对每个采样基聚类集进行融合，得出候选基准。具体为：对每个采样网络数据基聚类运行参与评价的聚类融合算法，把生成的融合聚类集用聚类融合算法来生成候选基准。以此类推，生成候选基准集。 Step 2: Fuse each sampling base clustering set to obtain candidate benchmarks. Specifically, run the cluster fusion algorithm participating in the evaluation for each sampled network data base cluster, and use the cluster fusion algorithm to generate candidate benchmarks for the generated fusion cluster set. By analogy, a candidate benchmark set is generated.

步骤3：对候选基准进行筛选，评分最高的候选基准即是基准。 Step 3: Screen the candidate benchmarks, and the candidate benchmark with the highest score is the benchmark.

步骤4：使用基准对聚类质量进行评价。 Step 4: Use benchmarks to evaluate clustering quality.

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。 Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

实施例 Example

图1为本发明实施例的对基聚类采样部分实现的框架图，具体流程描述如下： Fig. 1 is a framework diagram of the implementation of the base clustering sampling part of the embodiment of the present invention, and the specific process is described as follows:

用公式表达来表达，假设有一个包含m个对象的数据集X,定义X＝{x₁,x₂,…,x_M}。在运行N个基聚类算法后，得到N个基聚类π，定义π＝{π₁,π₂,…,π_N}。然后，对π进行融合聚类算法运算，得到融合聚类π^*，定义π^*＝φ(π)，其中φ是聚类融合函数。 Expressed in a formula, suppose there is a data set X containing m objects, define X={x ₁ ,x ₂ ,…,x _M }. After running the N base clustering algorithm, N base clusters π are obtained, and π={π ₁ ,π ₂ ,...,π _N } is defined. Then, perform fusion clustering algorithm operation on π to obtain fusion cluster π ^* , define π ^* = φ(π), where φ is the clustering fusion function.

首先，对社交网络用户信息进行采样，利用社交平台账户获取平台访问权限,通过设置初始任务集对目标信息进行定向获取。从获取到的数据中提取职业信息、提取兴趣爱好信息、提取性别信息、提取所在地信息、提取毕业学校信息。 First, sample social network user information, use social platform accounts to obtain platform access rights, and obtain targeted information by setting an initial task set. Extract occupation information, interest information, gender information, location information, and graduate school information from the acquired data.

其次，采用k-means作为候选基准算法。先设定聚类个数，然后随机设定初始化聚类中心，可以生成多个基聚类。为了生成多样性高的基聚类集，通过采样器对基聚类集进行采样，通过组合子基聚类集的方式，得到多个组差异化大的采样基聚类基。采样器采样用的是赌轮盘随机采样的方式采样基聚类集。一个基聚类与其它基聚类差异越大，它被选采样到的概率越大。这是为了生成多样性高的基聚类集，从而在下面两步能够生成多样性高的融合聚类和候选基准。 Second, k-means is adopted as a candidate benchmark algorithm. First set the number of clusters, and then randomly set the initial cluster center to generate multiple base clusters. In order to generate a base clustering set with high diversity, the base clustering set is sampled by a sampler, and multiple sampling base clustering bases with large group differences are obtained by combining sub-base clustering sets. The sampler sampling uses a roulette wheel random sampling method to sample the base cluster set. The greater the difference between a base cluster and other base clusters, the greater the probability of it being selected and sampled. This is to generate a base cluster set with high diversity, so that fusion clusters and candidate benchmarks with high diversity can be generated in the next two steps.

聚类融合算法的实验评价方法的实现采用标准互信息(NormalizedMutualInformation)来评价聚类之间的多样性和正确性，根据采样器的定义，多样性高的基聚类被选入采样基聚类集的概率越高。基聚类被选入采样基聚类的概率的定义如下表示： The realization of the experimental evaluation method of the cluster fusion algorithm uses the standard mutual information (Normalized Mutual Information) to evaluate the diversity and correctness of the clusters. According to the definition of the sampler, the base cluster with high diversity is selected as the sampling base cluster The higher the probability of the set. The definition of the probability that the base cluster is selected into the sampling base cluster is expressed as follows:

其中π_p为基聚类，Div(π_p)表示基聚类与其他基聚类的差异程度，运行多次随机k-means算法在数据集上，得到多个基聚类。再把基聚类集放到采样器上，采样器采取赌轮盘随机采样的策略，即多样性越高的基聚类被选入采样基聚类的可能性越高，生成多个采样基聚类。 Among them, π _p is the base cluster, and Div(π _p ) represents the degree of difference between the base cluster and other base clusters. Run the random k-means algorithm multiple times on the data set to obtain multiple base clusters. Then put the base cluster set on the sampler, and the sampler adopts the random sampling strategy of roulette, that is, the higher the diversity of the base cluster, the higher the probability of being selected into the sampling base cluster, and multiple sampling bases are generated. clustering.

图2为对采样器生成候选基准的框架图，具体描述如下： Figure 2 is a framework diagram for generating candidate benchmarks for samplers, which are described in detail as follows:

通过上一步的采样器，生成多个采样基聚类，然后对于每一个采样基聚类，使用参与评价的融合聚类算法，得到融合基聚类集。本发明采用SLC算法(Single-LevelCell单层单元)来对融合聚类集进行融合，得到候选基准。 Through the sampler in the previous step, multiple sampling base clusters are generated, and then for each sampling base cluster, use the fusion clustering algorithm participating in the evaluation to obtain the fusion base cluster set. The invention adopts the SLC algorithm (Single-LevelCell single-layer unit) to fuse the fusion clustering sets to obtain candidate benchmarks.

其中，候选基准为融合聚类为和α为阈值。当融合聚类之间相似程度大于α时，评分为0，这时防止融合聚类之间的相似性太大。当融合聚类之间相似程度小于α时，评分由两部分相加而成。第一部分是融合聚类于候选基准之间的相似程度，第二部分是融合聚类之间的相似程度。λ是两部分之间的权重。当λ＞0.5时，在评分中，第一部分比第二部分的权重大；当λ＜0.5时，在评分中，第二部分比第一部分的权重大；当λ＝0.5时，在评分中，第二部分比第一部分的权重相等。一般而言，选择λ＝0.5，即第二部分和第一部分占评分的权重一样。据此，计算每一个候选基准的评分，评分最高的候选基准作为最终的基准。经过筛选的基准将作为下一步的基准使用，来评价聚类质量。 Among them, the candidate benchmarks are Fusion clustering as and α is the threshold. When the similarity between fusion clusters is greater than α, the score is 0, which prevents the similarity between fusion clusters from being too large. When the similarity between fusion clusters is less than α, the score is made by adding two parts. The first part is the degree of similarity between fusion clusters and candidate benchmarks, and the second part is the degree of similarity between fusion clusters. λ is the weight between the two parts. When λ>0.5, in the scoring, the weight of the first part is greater than that of the second part; when λ<0.5, in the scoring, the weight of the second part is greater than that of the first part; when λ=0.5, in the scoring, The second part is given equal weight than the first part. Generally speaking, λ=0.5 is selected, that is, the weight of the second part and the first part in the scoring is the same. Accordingly, the score of each candidate benchmark is calculated, and the candidate benchmark with the highest score is used as the final benchmark. The screened benchmarks will be used as benchmarks in the next step to evaluate the clustering quality.

图3为筛选候选基准部分实现的框架图。具体描述如下： Figure 3 is a framework diagram of the partial implementation of screening candidate benchmarks. The specific description is as follows:

对所有的候选基准进行筛选，筛选器会根据计算候选基准评分得出最优的基准。 All candidate benchmarks are screened, and the filter calculates the candidate benchmark score to obtain the optimal benchmark.

利用上一步生成的基准，可利用外在方法BCubed对聚类质量进行评价。给定基准π_b和K个由不同聚类融合算法所得的融合聚类π＝{π₁,π₂,…π_k}，对每一个融合聚类π_i，都可以得出一个质量评价Q_i(π_i,π_b)。评分越高，代表该聚类融合算法得出来的融合结果越好。 Using the benchmark generated in the previous step, the clustering quality can be evaluated using the extrinsic method BCubed. Given a benchmark π _b and K fusion clusters π={π ₁ ,π ₂ ,…π _k } obtained by different cluster fusion algorithms, a quality evaluation Q can be obtained for each fusion cluster π _i _i (π _i ,π _b ). The higher the score, the better the fusion result obtained by the clustering fusion algorithm.

BCubed是一种外在评价方法，它根据基准，对给定数据集上聚类中的每个对象计算精度和召回率。假设有对象集合X＝{x₁,x₂,…,x_n}，C是X的一个聚类，B是X的基准。C(x_i)(1≤i≤n)表示x_i在C的类别，B(x_i)(1≤i≤n)表示x_i在B的类别。对于两个对象x_i和x_j(1≤i,j≤n,i≠j),x_i和x_j在聚类C的正确性的定义如下： BCubed is an extrinsic evaluation method that computes precision and recall for each object in a cluster on a given dataset, according to a benchmark. Suppose there is an object set X={x ₁ ,x ₂ ,...,x _n }, C is a cluster of X, and B is the benchmark of X. C(x _i )(1≤i≤n) indicates the category of x _i in C, and B( _xi )(1≤i≤n) indicates the category of x _i in B. For two objects x _i and x _j (1≤i, j≤n, i≠j), the correctness of x _i and x _j in cluster C is defined as follows:

BCubed的精度定义如下： The precision of BCubed is defined as follows:

$R R = = \frac{11}{n no} {Σ Σ}_{i i = = 11}^{n no} \frac{\underset{{x x}_{i i : : i i &NotEqual; &NotEqual; j j,, B B (({x x}_{i i})) = = B B (({x x}_{j j}))}}{Σ Σ} C C o o r r r r e e c c t t n no e e s the s s the s (({x x}_{i i},, {x x}_{j j}))}{| | {{{x x}_{i i} | | i i &NotEqual; &NotEqual; j j,, B B (({x x}_{i i})) = = B B (({x x}_{j j}))}} | |} - - - - - - ((55))$

精度和召回率都可以用来评价聚类，F度量(F-measure)可以同时结合精度和召回率，定义如下： Both precision and recall can be used to evaluate clustering. F-measure can combine precision and recall at the same time. It is defined as follows:

$F f = = \frac{((11 + + {β β}^{22})) P P \cdot &Center Dot; R R}{{β β}^{22} \cdot &Center Dot; P P + + R R} - - - - - - ((66))$

F度量的取值范围在0到1之间。当F度量等于0时，聚类质量并不理想；当F度量等于1时，聚类质量理想，与基准完全一致。所以当F度量越接近1时，聚类质量越好。 The value range of F-measure is between 0 and 1. When the F-measure is equal to 0, the clustering quality is not ideal; when the F-measure is equal to 1, the clustering quality is ideal and fully consistent with the benchmark. So when the F-measure is closer to 1, the clustering quality is better.

Claims

1. The social group discovery method based on clustering fusion algorithm is characterized in that, concretely implement according to the following steps:

Step 1: For the data in the social network, obtain the corresponding sampling base clustering according to the base clustering algorithm;

Step 2: Fuse each sampling base cluster set obtained in step 1 to obtain candidate benchmarks;

Step 3: Screen the candidate benchmarks obtained in step 2, and use the candidate benchmark with the highest score as the optimal benchmark;

Step 4: Use the optimal benchmark obtained in step 3 to evaluate the clustering quality.

2. the social group discovery method based on clustering fusion algorithm according to claim 1, is characterized in that, described step 1 is specifically implemented according to the following steps:

Suppose there is a dataset X containing m objects, define X={x ₁ ,x ₂ ,…,x _M }, after running N base clustering algorithms, get N base clusters π, define π={ π ₁ , π ₂ ,..., π _N }, then, perform fusion clustering algorithm operation on π to obtain fusion cluster π ^* , define π ^* = φ(π), where φ is the clustering fusion function;

First, sample social network user information, use social platform accounts to obtain platform access rights, and obtain targeted information by setting initial task sets;

Secondly, using k-means as a candidate benchmark algorithm, first set the number of clusters, and then randomly set the initial cluster center to generate multiple base clusters; in order to generate a base cluster set with high diversity, the sampler The base clustering set is used for sampling, and the sampling base clustering bases with large differences in multiple groups are obtained by combining the sub-base clustering sets.

3. The method for discovering a social group based on a clustering fusion algorithm according to claim 2, wherein the method of sampling by the sampler is a random roulette method.

4. the social group discovery method based on clustering fusion algorithm according to claim 2, is characterized in that, described step 2 is specifically implemented according to the following steps:

The SLC algorithm is used to fuse the fusion clustering set to obtain the candidate benchmark:

The scoring of candidate benchmarks is defined as follows:

Among them, the candidate benchmarks are Fusion clustering as and α is the threshold.

5. The method for discovering social groups based on clustering fusion algorithm according to claim 4, characterized in that, said step 3 is specifically implemented according to the following steps: when the degree of similarity between fusion clusters is greater than α, the score is 0 , at this time, prevent the similarity between the fusion clusters from being too large; when the similarity between the fusion clusters is less than α, the score is composed of two parts; the first part is the similarity between the fusion clusters and the candidate benchmarks , the second part is the degree of similarity between fusion clusters; λ is the weight between the two parts; when λ>0.5, the weight of the first part is greater than that of the second part in the score; when λ<0.5, the In scoring, the weight of the second part is greater than that of the first part; when λ=0.5, the weight of the second part is equal to that of the first part in the scoring; generally speaking, choose λ=0.5, that is, the second part and the first part account for The weight of the score is the same; accordingly, the score of each candidate benchmark is calculated, and the candidate benchmark with the highest score is used as the final benchmark; the screened benchmark is used as the optimal benchmark for the next step to evaluate the clustering quality.

6. the social group discovery method based on clustering fusion algorithm according to claim 5, is characterized in that, described step 4 is specifically implemented according to the following steps:

Using the optimal benchmark generated in the previous step, use the external method BCubed to evaluate the clustering quality: given the benchmark π _b and K fusion clusters obtained by different clustering fusion algorithms π={π ₁ ,π ₂ ,… π _k }, for each fusion cluster π _i , a quality evaluation Q _i (π _i ,π _b ) can be obtained; the higher the score, the better the fusion result obtained by the clustering fusion algorithm;

Suppose there is an object set X={x ₁ ,x ₂ ,…,x _n }, C is a cluster of X, and B is the benchmark of X; C( _xi )(1≤i≤n) means that x _i is in C category, B( _xi )(1≤i≤n) means that x _i is in the category of B; for two objects x _i and x _j (1≤i, j≤n, i≠j), x _i and x The correctness of _j in cluster C is defined as follows:

The precision of BCubed is defined as follows:

The recall rate of BCubed is defined as follows:

Both precision and recall can be used to evaluate clustering, and the F measure can combine precision and recall at the same time, defined as follows:

The value range of the F measure is between 0 and 1. When the F measure is equal to 0, the clustering quality is not ideal; when the F measure is equal to 1, the clustering quality is ideal, which is completely consistent with the benchmark; so when the F measure is closer to When 1, the clustering quality is better.