CN111125469B

CN111125469B - User clustering method and device of social network and computer equipment

Info

Publication number: CN111125469B
Application number: CN201911247467.5A
Authority: CN
Inventors: 陈子忠; 彭道万; 夏书银; 李曹枭
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2022-06-10
Anticipated expiration: 2039-12-09
Also published as: CN111125469A

Abstract

The invention belongs to the field of machine learning and data mining, and particularly relates to a user clustering method and device of a social network and computer equipment; the method comprises the steps of obtaining account information of each user to be clustered in a social network, wherein the account information comprises a user ID, a user area, a user gender, a user fan number, a user attention number, a content tag and release time; vectorizing each account information in the social network respectively to form a vector data set; selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering users according to the initial clustering centers, updating the clustering centers according to clustered results or clustered results, and continuously clustering the users according to the clustering centers; until the clustering center is not changed any more, obtaining a plurality of clustered classifications, wherein each classification at least comprises one user to be clustered; the improved initial clustering center selection algorithm and the improved clustering framework can greatly reduce the calculated amount, thereby improving the efficiency of the whole method and the whole device.

Description

User clustering method and device of social network and computer equipment

Technical Field

The invention belongs to the field of machine learning and data mining, and relates to a fast and efficient k-means clustering algorithm in a clustering problem and application thereof in a social network, in particular to a user clustering method and device of the social network and computer equipment.

Background

Social network analysis is a product of combining human social science and natural science, and researches on social networks comprise networks such as e-mails, WeChat, QQ, Sina microblog, Twitter, Facebook and the like, various objects exist in the social networks, and the objects are required to be classified. The k-means clustering algorithm is one of the most common, simple and effective algorithms in the clustering algorithm. The standard k-means clustering algorithm was independently proposed by Steinhaus in 1955, Lloyd in 1957, Ball & Hall in 1965, and McQueen in 1967 in respective different scientific research fields. Cluster analysis is a technique for statistical data analysis, and is widely used in many fields, including machine learning, data mining, pattern recognition, image analysis, and biological information.

The traditional k-means algorithm performs well in the scene of processing small-batch data, not only in efficiency, but also in clustering effect. However, in the social network, massive user data needs to be processed, so the efficiency of the clustering algorithm is very important. However, in the past, some traditional clustering algorithms are mainly used for technical support, but the traditional clustering algorithms become inefficient in a big data scene, and mainly show that the convergence speed is extremely slow, the time complexity of the algorithms is high, the algorithms are sensitive to noise and outliers, clustering results depend on initial clustering centers, and the like.

Aiming at the problems of the traditional k-means algorithm, the main improvement can be divided into the following three aspects, namely, the selection of an initial clustering center; second, is an approximate k-means; and thirdly, accelerating k-means. David Arthur et al propose a sampling-based method (D)²-sampling) initial cluster center point selection method-k-means + +. The core of the method is as follows: the first centroids are spaced as far apart as possible. This is the most widely used method of initializing the cluster center,although the method improves the defect that the standard k-means algorithm randomly selects the initial central point, the inherent orderliness of the method causes the defect of expandability of the method, so that the algorithm cannot be expanded in parallel to be applied to a super-large-scale data set. When clustering is performed on massive data, k-means approximation is a very effective method. In recent years, from different perspectives, researchers have proposed a number of approaches to k-means: storing data points in a k-d tree and maintaining a subset of candidate centers for each node of the tree eliminates computation time by avoiding comparing each point to all center points. Another approach is based on sub-sampling the data points. This method runs the k-means over the sub-sampled data points, and its extension is to add the remaining points incrementally and rerun the k-means to obtain finer clustering. The former approach is not suitable for many applications, such as clustering in social networking applications is less accurate and less performing. There are many types of k-means clustering methods for accurate acceleration, but the problems of extra time and space consumption, non-self-adaption and the like still exist in the context of mass data of a social network.

Disclosure of Invention

The invention aims to solve the efficiency problem caused by clustering mass data in a social network scene, provides a fast and efficient k-means clustering algorithm based on division in a targeted manner, and particularly provides a user clustering method and device of a social network and computer equipment.

A method of clustering users of a social network, the method comprising:

step 1: acquiring account information of each user to be clustered in the social network, wherein the account information comprises a user ID, a user area, a user gender, a user fan number, a user attention number, a content tag and release time;

and 2, step: vectorizing each account information in the social network respectively to form a vector data set;

and step 3: selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering users according to the initial clustering centers, updating the clustering centers according to clustered results or clustered results, and continuously clustering the users according to the clustering centers; and obtaining a plurality of clustered classifications until the clustering center is not changed any more, wherein each classification at least comprises one user to be clustered.

In another aspect, the present invention further provides a user clustering device for a social network, including:

The data interface module is used for accessing account information of each user to be clustered in the social network;

the data shaping module is used for shaping the account information accessed by the data interface module to form a vector data set;

and the clustering result module is used for processing the vectors in the vector data set to obtain a plurality of clustered classifications, and each classification at least comprises one user to be clustered.

The invention has the beneficial effects that:

in the invention, under the scene of processing mass data of the social network, the user account information data of the social network is vectorized through data preprocessing, the final clustering result is more accurate by the improved initial clustering center selection method provided by the invention, and in the improved k-means frame, as the concept of a 'core domain' is defined, and respective neighbor clusters are searched for each cluster ball, all calculated quantities are limited within a small range, thus saving the calculated quantity which is required to be carried out originally and improving the efficiency of clustering the huge mass social network data. The invention can theoretically reduce the time complexity of the algorithm from O (nk) of each iteration to O (k) ²+ n). For the problem of ultra-large-scale social network user clustering, the method and the device can greatly reduce the calculated amount, and further improve the efficiency of the whole method and device.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only a simple schematic diagram of the present invention.

FIG. 1 is an overall flow diagram of one embodiment of the present invention;

FIG. 2 is an overall flow diagram of another embodiment of the present invention;

FIG. 3 is a flow diagram of an extraction and processing of social networking data;

FIG. 4 is a detailed flow diagram of an improved k-means clustering framework in an embodiment of the present invention;

FIG. 5 is a detailed flow chart of an improved k-means clustering framework in another embodiment of the present invention;

FIG. 6 is a schematic diagram of a user clustering device of a social network according to the present invention;

FIG. 7 is a schematic diagram of a data shaping module of the present invention;

FIG. 8 is a schematic diagram of a clustering results module of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly and completely apparent, the technical solutions in the embodiments of the present invention are described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As shown in fig. 1, a method for clustering users in a social network includes:

step 2: vectorizing each account information in the social network respectively to form a vector data set;

In another embodiment, as shown in fig. 2, a method for clustering users of a social network may further include:

acquiring user account information in a social network (such as a Sina microblog), wherein the user account information comprises information such as a user ID, a user region, a user gender, a user fan number, a user attention number, a microblog content tag, release time and the like;

preprocessing user information and then vectorizing the preprocessed user information;

Selecting an initial clustering center by using an improved method;

inputting all processed data sets into an improved k-means algorithm frame, and continuously iterating until the algorithm converges;

and outputting an accurate clustering result.

In one embodiment, as shown in FIG. 3, there is a flow chart for extracting and processing social networking data, the method comprises the following steps:

carrying out data cleaning on social network data, and preprocessing the characteristic data of a user;

converting the format of the data, wherein a part of data is processed by adopting a digital vector, and the other part of data is processed by adopting a semantic vector; for example, the ID of the user is normalized to limit the range between 0 and 1, the gender of the user (for example, male 1 represents and female 0 represents), the region of the user is represented by numbers (for example, Beijing 001, Shanghai 002, Chongqing 003 and the like), the semantic information is converted into a semantic vector by a sentence turning quantity sen2vec method, the semantic vector is converted into words in the semantic information by using a pre-trained word vector, all the words are weighted, and then the whole semantic information set is processed by using a principal component analysis method to obtain the vector representation of each piece of semantic information. And vectorizing the user data information on the basis of the steps.

In one embodiment, as shown in FIG. 4, it is a detailed flow chart of the improved k-means clustering framework, which includes the detailed process of the whole clustering. The method comprises the following steps:

step 301: selecting k vectors from the vector data set as an initial clustering center by using an improved initial clustering center method;

step 302: dividing all vectors into cluster balls represented by cluster centers closest to the vectors according to a nearest principle;

step 303: calculating the mean value of all vectors in each cluster ball to serve as a new clustering center, and calculating the radius of the cluster ball;

step 304: finding out the neighbor cluster of each cluster according to the distance relationship between the cluster sphere radius and the cluster center;

step 305: calculating the distance between each vector and the center of the adjacent cluster where the vector is located, and dividing the vector into cluster balls with the nearest distance according to the principle of proximity;

step 306: and repeating the steps 303 to 305 until the clustering center is not changed any more, and outputting a clustering result.

In one embodiment, the step 301 comprises:

step 3011: randomly selecting one vector data from the vector data set as a first initial clustering center;

step 3012: adopting a Markov chain with the length of 3k from the vector data set by using a Markov model Carlo method, and taking 3k data on the Markov chain as a candidate initial clustering center;

Step 3013: and (3) repeatedly combining the two closest initial clustering centers into a new initial clustering center by adopting a minimum spanning tree method of Primem for the candidate initial clustering centers in the 3k vector data sets until only k data are left as the initial clustering centers.

Of course, the data acquired by the markov model carlo method may be 4k pieces of vector data, or may be 4.5k pieces of vector data, and so on.

In one embodiment, the step 301 may further include:

randomly selecting one vector data from the vector data set in a reservoir sampling (reservoir sampling) mode as a first initial clustering center; and placing the first k vectors in the vector data set into the reservoir, replacing one vector in the reservoir with the probability of k/m for the mth element, and taking the finally selected k vectors as an initial clustering center.

In another embodiment, as shown in fig. 5, the step 3 further includes:

step 311: first, a first iteration is carried out on a data set by using a standard k-means flow, and all data are distributed to a cluster where a central point closest to the data is located:

b(x_i)＝argmin_j＝1...k{(dis(x_i,c_j)}

c_jrepresenting the center of the cluster, x _iIs any one sample in the sample space.

Step 312: updating all cluster centers according to the above allocation steps:

| N | indicates that C is assigned_jThe data sample of (1).

Step 313: calculating the radius of each cluster sphere (radius is defined as the distance between the data point farthest from the center point and the center point in the cluster sphere) defined as:

R_i＝max(dis(x_i,c_i))for{x_i|x_i∈C_j}。

step 314: finding out a neighbor cluster of each cluster according to the radius information of each cluster ball and the distance relationship between any two cluster balls (the neighbor cluster is defined as a neighbor cluster if half of the distance between the central points of two clusters is smaller than the radius of the current cluster);

step 315: sequencing the adjacent clusters of each cluster from near to far;

step 316: filtering out data in a "stable domain" in each cluster sphere (a "stable domain" is defined as a set of data points that are close to the cluster center and still belong to the current cluster sphere in the current iteration);

step 317: performing nearest principle distribution on data outside the 'stable domain' in each cluster ball, wherein the range available for distribution is that of each cluster ball according to the sorted neighbor cluster in the step 315;

step 318: steps 312 through 317 are repeated until all center points no longer change.

In addition, in a specific embodiment, the vector data set of the present invention uses the user data crawled from Twitter and the user data on the xinlang microblog, which respectively include 376 ten thousand pieces of user data and 620 ten thousand pieces of user data. After all social network user data are preprocessed, clustering is carried out on the two data sets under the framework of the method provided by the invention, the result shows that the clustering is 51 times faster than that of the traditional k-means algorithm on average, and the clustering result (the evaluation index is WCSSD, namely the square sum in the cluster is better, and the smaller the square sum in the cluster is), is reduced by 3.1% compared with that of the traditional social network user clustering algorithm.

Based on the same concept of the present invention, the present invention further provides a user clustering device for a social network, as shown in fig. 6, including:

As shown in fig. 7, the data shaping module includes a digital vector generating unit and a semantic vector generating unit, the digital vector generating unit is configured to convert a part of data of the account information into a digital vector, and the semantic vector generating module is configured to convert another part of data of the account information into a semantic vector.

As shown in fig. 8, the clustering result module includes an initial clustering center selecting unit, a calculating unit, and a result calculating unit; the cluster center selection unit is used for selecting an initial cluster center in the vector data set; the calculating unit is used for calculating the radius distance of the cluster balls and the distance relation between any two cluster balls; and the result operation unit is used for dividing a clustering result according to the clustering center.

The initial clustering center selecting unit comprises a random selecting subunit, a fixed selecting subunit and a merging unit; the random selection subunit is used for randomly selecting one initial clustering center, the fixed selection subunit is used for adopting a plurality of initial clustering centers from the vector data set according to a Markov model Carlo method, and the merging unit is used for merging the two closest initial clustering centers into a new initial clustering center according to a method of a minimum spanning tree of Primum.

In addition, the invention also provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of being used on the processor, wherein the processor executes the program to realize the user clustering method provided by the invention.

In one embodiment of the invention, the invention employs a Python programming language and can operate on mainstream computer platforms. The operating system used in the implementation is CentOS 6.5, the CPU is required to be Intel i5, the memory is more than 8GB, and the hard disk space is required to be more than 32 GB.

It is understood that some features of the method, apparatus and computer device of the present invention may be mutually cited, and the present invention is not described in detail.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for clustering users of a social network, the method comprising:

and step 3: selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering users according to the initial clustering centers, updating the clustering centers according to clustered results or clustered results, and continuously clustering the users according to the clustering centers; obtaining a plurality of clustered classifications until the clustering center is not changed any more, wherein each classification at least comprises one user to be clustered;

step 303: calculating the mean value of all vectors in each cluster ball to serve as a new clustering center, and calculating the radius of each cluster ball, wherein the radius is defined as the distance between the data point farthest from the central point in each cluster ball and the central point;

Step 304: finding out the neighbor cluster of each cluster according to the distance relationship between the cluster sphere radius and the cluster center, wherein the neighbor cluster is defined as the neighbor cluster if half of the distance between the center points of the two clusters is smaller than the radius of the current cluster;

step 305: calculating the distance between each vector and the center of the neighboring cluster in which the vector is positioned, filtering out data in the stable domain in each cluster ball, and dividing the data outside the stable domain in each cluster ball into the cluster balls with the nearest distance according to the principle of proximity; the stable domain is defined as a set formed by data points which are close to the clustering center and still belong to the current ball cluster in the iteration;

2. The user clustering method of the social network as claimed in claim 1, wherein the vectorizing of each account information in the social network includes digitizing a part of data of each account information into a digital vector, and converting another part of data into a semantic vector by a steering quantity sen2vec method, including converting words in the semantic information using a pre-trained word vector, weighting all the words, and then processing the whole semantic information set using a principal component analysis method to obtain a vector representation of each semantic information; and splicing the digital vector and the semantic vector to obtain a vector corresponding to the account information of the user to be clustered.

3. The method for clustering users in a social network as claimed in claim 1, wherein said step 301 comprises:

step 3013: and (3) repeatedly combining the two closest initial clustering centers into a new initial clustering center by adopting a minimum spanning tree method of primum for the candidate initial clustering centers in the 3k vector data sets until only k data are left as the initial clustering centers.

4. A user clustering apparatus for a social network, comprising:

the data interface module is used for accessing account information of each user to be clustered in the social network, and the account information comprises a user ID, a user area, a user gender, a user fan number, a user attention number, a content tag and release time;

The clustering result module is used for processing the vectors in the vector data set, selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering the users according to the initial clustering centers, updating the clustering centers according to the clustered results or the clustered results, and continuously clustering the users according to the clustering centers; obtaining a plurality of clustered classifications until the clustering center is not changed any more, wherein each classification at least comprises one user to be clustered;

the clustering result module specifically executes the following steps:

Step 305: calculating the distance between each vector and the center of the adjacent cluster, filtering data in a stable domain in each cluster ball, and dividing the data outside the stable domain in each cluster ball into cluster balls with the closest distance according to the principle of proximity; the stable domain is defined as a set formed by data points which are close to the clustering center and still belong to the current cluster ball in the iteration of the current round;

5. The social network user clustering device according to claim 4, wherein the data shaping module comprises a digital vector generation unit and a semantic vector generation unit, the digital vector generation unit is configured to convert a part of data of the account information into a digital vector, and the semantic vector generation module is configured to convert another part of data of the account information into a semantic vector.

6. The user clustering device of the social network according to claim 4, wherein the clustering result module comprises an initial clustering center selecting unit, a calculating unit, and a result calculating unit; the cluster center selection unit is used for selecting an initial cluster center in the vector data set; the calculating unit is used for calculating the radius distance of the cluster balls and the distance relation between any two cluster balls; and the result operation unit is used for dividing a clustering result according to the clustering center.

7. The social network user clustering device according to claim 4, wherein the initial cluster center selecting unit comprises a random selecting subunit, a fixed selecting subunit and a merging unit; the random selection subunit is used for randomly selecting one initial clustering center, the fixed selection subunit is used for adopting a plurality of initial clustering centers from the vector data set according to a Markov model Carlo method, and the merging unit is used for merging the two closest initial clustering centers into a new initial clustering center according to a method of a minimum spanning tree of Primum.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor when executing the program implements the method of any of claims 1 to 3.