CN111125469A

CN111125469A - A kind of user clustering method, device and computer equipment of social network

Info

Publication number: CN111125469A
Application number: CN201911247467.5A
Authority: CN
Inventors: 陈子忠; 彭道万; 夏书银; 李曹枭
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-05-08
Anticipated expiration: 2039-12-09
Also published as: CN111125469B

Abstract

The invention belongs to the field of machine learning and data mining, in particular to a user clustering method, device and computer equipment in a social network; the method includes acquiring account information of each user to be clustered in the social network, including user ID, user region , user gender, user fan count, user follow count, content label and release time; vectorize each account information in the social network to form a vector data set; select multiple vectors from the vector data set as the initial clustering centers respectively , cluster the users according to the initial clustering center, update the clustering center according to the clustering result or the clustering result, and continue to cluster each user according to the clustering center; until the clustering center does not change, get the clustering There are multiple classifications after the class, and each classification includes at least one user to be clustered; through the improved selection algorithm of the initial clustering center and the improved clustering framework of the present invention, the amount of calculation can be greatly reduced, thereby improving the overall method. , the efficiency of the device.

Description

User clustering method and device for social network and computer equipment

Technical Field

The invention belongs to the field of machine learning and data mining, and relates to a fast and efficient k-means clustering algorithm in a clustering problem and application thereof in a social network, in particular to a user clustering method and device of the social network and computer equipment.

Background

Social network analysis is a product of combining human social science and natural science, and researches on social networks comprise networks such as e-mails, WeChat, QQ, Sina microblog, Twitter, Facebook and the like, various objects exist in the social networks, and the objects are required to be classified. The k-means clustering algorithm is one of the most common, simple and effective algorithms in the clustering algorithm. The standard k-means clustering algorithm was independently proposed by Steinhaus in 1955, Lloyd in 1957, Ball & Hall in 1965, and McQueen in 1967 in respective different scientific research fields. Cluster analysis is a technique for statistical data analysis, and is widely used in many fields, including machine learning, data mining, pattern recognition, image analysis, and biological information.

The traditional k-means algorithm performs well in the scene of processing small-batch data, not only in efficiency, but also in clustering effect. However, in the social network, massive user data needs to be processed, so the efficiency of the clustering algorithm is very important. However, in the past, some traditional clustering algorithms are mainly used for technical support, but the traditional clustering algorithms become inefficient in a big data scene, and mainly show that the convergence speed is extremely slow, the time complexity of the algorithms is high, the algorithms are sensitive to noise and outliers, clustering results depend on initial clustering centers, and the like.

Aiming at the problems of the traditional k-means algorithm, the main improvement can be divided into the following three aspects, namely, the selection of an initial clustering center; second, is an approximate k-means; and thirdly, accelerating k-means. David ArthurEtc. propose a method based on sampling (D)²-sampling) initial cluster center point selection method-k-means + +. The core of the method is as follows: the first centroids are spaced as far apart as possible. Although the method improves the defect that the standard k-means algorithm randomly selects the initial central point, the inherent order of the method causes the defect of expandability of the method, so that the algorithm cannot be expanded in parallel and applied to a super-large-scale data set. When clustering is performed on massive data, approximating k-means is a very effective method. In recent years, from different perspectives, researchers have proposed a number of approaches to k-means: storing data points in a k-d tree and maintaining a subset of candidate centers for each node of the tree eliminates computation time by avoiding comparing each point to all center points. Another approach is based on sub-sampling the data points. This method runs the k-means over the sub-sampled data points, and its extension is to add the remaining points incrementally and rerun the k-means to obtain finer clustering. The former approach is not suitable for many applications, such as clustering in social networking applications is less accurate and less performing. There are many types of k-means clustering methods for accurate acceleration, but the problems of extra time and space consumption, non-self-adaption and the like still exist in the context of mass data of a social network.

Disclosure of Invention

The invention aims to solve the efficiency problem caused by clustering mass data in a social network scene, provides a fast and efficient k-means clustering algorithm based on division in a targeted manner, and particularly provides a user clustering method and device of a social network and computer equipment.

A method of clustering users of a social network, the method comprising:

step 1: acquiring account information of each user to be clustered in the social network, wherein the account information comprises a user ID, a user area, a user gender, a user fan number, a user attention number, a content tag and release time;

step 2: vectorizing each account information in the social network respectively to form a vector data set;

and step 3: selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering users according to the initial clustering centers, updating the clustering centers according to clustered results or clustered results, and continuously clustering the users according to the clustering centers; and obtaining a plurality of clustered classifications until the clustering center is not changed any more, wherein each classification at least comprises one user to be clustered.

In another aspect, the present invention further provides a user clustering device for a social network, including:

the data interface module is used for accessing account information of each user to be clustered in the social network;

the data shaping module is used for shaping the account information accessed by the data interface module to form a vector data set;

and the clustering result module is used for processing the vectors in the vector data set to obtain a plurality of clustered classifications, and each classification at least comprises one user to be clustered.

The invention has the beneficial effects that:

in the invention, under the scene of processing mass data of the social network, the user account information data of the social network is vectorized through data preprocessing, the final clustering result is more accurate by the improved initial clustering center selection method provided by the invention, and in the improved k-means frame, as the concept of a 'core domain' is defined, and respective neighbor clusters are searched for each cluster ball, all calculated quantities are limited within a small range, thus saving the calculated quantity which is required to be carried out originally and improving the efficiency of clustering the huge mass social network data. The invention can theoretically reduce the time complexity of the algorithm from the original O (nk) of each iteration to O (k)²+ n). For the problem of ultra-large-scale social network user clustering, the method and the device can greatly reduce the calculated amount, and further improve the efficiency of the whole method and device.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only a simple schematic diagram of the present invention.

FIG. 1 is an overall flow diagram of one embodiment of the present invention;

FIG. 2 is an overall flow diagram of another embodiment of the present invention;

FIG. 3 is a flow diagram of an extraction and processing of social networking data;

FIG. 4 is a detailed flow diagram of an improved k-means clustering framework in an embodiment of the present invention;

FIG. 5 is a detailed flow chart of an improved k-means clustering framework in another embodiment of the present invention;

FIG. 6 is a schematic diagram of a user clustering device of a social network according to the present invention;

FIG. 7 is a schematic diagram of a data shaping module of the present invention;

FIG. 8 is a schematic diagram of a clustering results module of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly and completely apparent, the technical solutions in the embodiments of the present invention are described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As shown in fig. 1, a method for clustering users in a social network includes:

In another embodiment, as shown in fig. 2, a method for clustering users of a social network may further include:

acquiring user account information in a social network (such as a Sina microblog), wherein the user account information comprises information such as a user ID, a user region, a user gender, a user fan number, a user attention number, a microblog content tag, release time and the like;

preprocessing user information and then vectorizing the preprocessed user information;

selecting an initial clustering center by using an improved method;

inputting all processed data sets into an improved k-means algorithm frame, and continuously iterating until the algorithm converges;

and outputting an accurate clustering result.

In one embodiment, as shown in FIG. 3, there is a flow chart for extracting and processing social networking data, the method comprises the following steps:

carrying out data cleaning on social network data, and preprocessing the characteristic data of a user;

converting the format of the data, wherein a part of data is processed by adopting a digital vector, and the other part of data is processed by adopting a semantic vector; for example, the ID of the user is normalized to limit the range between 0 and 1, the gender of the user (for example, male 1 represents and female 0 represents), the region of the user is represented by numbers (for example, Beijing 001, Shanghai 002, Chongqing 003 and the like), the semantic information is converted into a semantic vector by a sentence turning quantity sen2vec method, the semantic vector is converted into words in the semantic information by using a pre-trained word vector, all the words are weighted, and then the whole semantic information set is processed by using a principal component analysis method to obtain the vector representation of each piece of semantic information. And vectorizing the user data information on the basis of the steps.

In one embodiment, as shown in FIG. 4, it is a detailed flow chart of the improved k-means clustering framework, which includes the detailed process of the whole clustering. The method comprises the following steps:

step 301: selecting k vectors from the vector data set as an initial clustering center by using an improved initial clustering center method;

step 302: dividing all vectors into cluster balls represented by cluster centers closest to the vectors according to a nearest principle;

step 303: calculating the mean value of all vectors in each cluster ball to serve as a new clustering center, and calculating the radius of the cluster ball;

step 304: finding out the neighbor cluster of each cluster according to the distance relationship between the cluster sphere radius and the cluster center;

step 305: calculating the distance between each vector and the center of the adjacent cluster where the vector is located, and dividing the vector into cluster balls with the nearest distance according to the principle of proximity;

step 306: and repeating the steps 303 to 305 until the clustering center is not changed any more, and outputting a clustering result.

In one embodiment, the step 301 comprises:

step 3011: randomly selecting one vector data from the vector data set as a first initial clustering center;

step 3012: adopting a Markov chain with the length of 3k from the vector data set by using a Markov model Carlo method, and taking 3k data on the Markov chain as a candidate initial clustering center;

step 3013: and (3) repeatedly combining the two closest initial clustering centers into a new initial clustering center by adopting a minimum spanning tree method of Primem for the candidate initial clustering centers in the 3k vector data sets until only k data are left as the initial clustering centers.

Of course, the data acquired by the markov model carlo method may be 4k pieces of vector data, or may be 4.5k pieces of vector data, and so on.

In one embodiment, the step 301 may further include:

randomly selecting one vector data from the vector data set in a reservoir sampling (reservoir sampling) mode as a first initial clustering center; and placing the first k vectors in the vector data set into the reservoir, replacing one vector in the reservoir with the probability of k/m for the mth element, and taking the finally selected k vectors as an initial clustering center.

In another embodiment, as shown in fig. 5, the step 3 further includes:

step 311: first, a first iteration is carried out on a data set by using a standard k-means flow, and all data are distributed to a cluster where a central point closest to the data is located:

b(x_i)＝argmin_j＝1...k{(dis(x_i,c_j)}

c_jrepresenting the center of the cluster, x_iIs any one sample in the sample space.

Step 312: updating all cluster centers according to the above allocation steps:

| N | indicates that C is assigned_jThe data sample of (1).

Step 313: calculating the radius of each cluster sphere (radius is defined as the distance between the data point farthest from the center point and the center point in the cluster sphere) defined as:

R_i＝max(dis(x_i,c_i))for{x_i|x_i∈C_j}。

step 314: finding out a neighbor cluster of each cluster according to the radius information of each cluster ball and the distance relationship between any two cluster balls (the neighbor cluster is defined as a neighbor cluster if half of the distance between the central points of two clusters is smaller than the radius of the current cluster);

step 315: sequencing the adjacent clusters of each cluster from near to far;

step 316: filtering out data in a "stable domain" in each cluster sphere (a "stable domain" is defined as a set of data points that are close to the cluster center and still belong to the current cluster sphere in the current iteration);

step 317: performing nearest principle distribution on data outside the 'stable domain' in each cluster ball, wherein the range available for distribution is that each cluster ball is a neighbor cluster according to the sorting in the step 315;

step 318: steps 312 to 317 are repeated until all center points are no longer changed.

In addition, in a specific embodiment, the vector data set of the present invention uses the user data crawled from Twitter and the user data on the xinlang microblog, which respectively include 376 ten thousand pieces of user data and 620 ten thousand pieces of user data. After all social network user data are preprocessed, the two data sets are clustered under the framework of the method provided by the invention, the result display is 51 times faster than that of the traditional k-means algorithm on average, and the clustering result (the evaluation index is WCSSD, namely the square sum in a cluster is better, and the smaller the square sum in the cluster is), is reduced by 3.1% compared with that of the traditional social network user clustering algorithm.

Based on the same concept of the present invention, the present invention further provides a user clustering device for a social network, as shown in fig. 6, including:

As shown in fig. 7, the data shaping module includes a digital vector generation unit for converting a part of data of the account information into a digital vector and a semantic vector generation unit for converting another part of data of the account information into a semantic vector.

As shown in fig. 8, the clustering result module includes an initial clustering center selection unit, a calculation unit, and a result calculation unit; the cluster center selection unit is used for selecting an initial cluster center in the vector data set; the calculating unit is used for calculating the radius distance of the cluster balls and the distance relation between any two cluster balls; and the result operation unit is used for dividing a clustering result according to the clustering center.

The initial clustering center selecting unit comprises a random selecting subunit, a fixed selecting subunit and a merging unit; the random selection subunit is used for randomly selecting one initial clustering center, the fixed selection subunit is used for adopting a plurality of initial clustering centers from the vector data set according to a Markov model Carlo method, and the merging unit is used for merging the two closest initial clustering centers into a new initial clustering center according to a method of a minimum spanning tree of Primum.

In addition, the invention also provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of being used on the processor, wherein the processor executes the program to realize the user clustering method provided by the invention.

In one embodiment of the invention, the invention employs a Python programming language and can operate on mainstream computer platforms. The operating system used in the implementation is CentOS 6.5, the CPU is required to be Intel i5, the memory is more than 8GB, and the hard disk space is required to be more than 32 GB.

It is understood that some features of the method, apparatus and computer device of the present invention may be mutually cited, and the present invention is not described in detail.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for clustering users of a social network, the method comprising:

2. The user clustering method of the social network as claimed in claim 1, wherein the vectorizing of each account information in the social network includes digitizing a part of data of each account information into a digital vector, and converting another part of data into a semantic vector by a steering quantity sen2vec method, including converting words in the semantic information using a pre-trained word vector, weighting all the words, and then processing the whole semantic information set using a principal component analysis method to obtain a vector representation of each semantic information; and splicing the digital vector and the semantic vector to obtain a vector corresponding to the account information of the user to be clustered.

3. The method for clustering users in a social network according to claim 1, wherein the step 3 comprises the steps of:

4. The method for clustering users in a social network according to claim 3, wherein the step 301 comprises:

5. The method of claim 3, wherein the finding the neighbor cluster of each cluster comprises finding the neighbor cluster if a half of a distance between center points of two clusters is smaller than a radius of the current cluster.

6. A user clustering apparatus for a social network, comprising:

7. The social network user clustering device according to claim 6, wherein the data shaping module comprises a digital vector generation unit and a semantic vector generation unit, the digital vector generation unit is configured to convert a part of data of the account information into a digital vector, and the semantic vector generation module is configured to convert another part of data of the account information into a semantic vector.

8. The user clustering device of the social network according to claim 6, wherein the clustering result module comprises an initial clustering center selecting unit, a calculating unit, and a result calculating unit; the cluster center selection unit is used for selecting an initial cluster center in the vector data set; the calculating unit is used for calculating the radius distance of the cluster balls and the distance relation between any two cluster balls; and the result operation unit is used for dividing a clustering result according to the clustering center.

9. The social network user clustering device according to claim 6, wherein the initial cluster center selecting unit comprises a random selecting subunit, a fixed selecting subunit and a merging unit; the random selection subunit is used for randomly selecting one initial clustering center, the fixed selection subunit is used for adopting a plurality of initial clustering centers from the vector data set according to a Markov model Carlo method, and the merging unit is used for merging the two closest initial clustering centers into a new initial clustering center according to a method of a minimum spanning tree of Primum.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor when executing the program implements the method of any of claims 1 to 5.