Disclosure of Invention
The invention aims to solve the efficiency problem caused by clustering mass data in a social network scene, provides a fast and efficient k-means clustering algorithm based on division in a targeted manner, and particularly provides a user clustering method and device of a social network and computer equipment.
A method of clustering users of a social network, the method comprising:
step 1: acquiring account information of each user to be clustered in the social network, wherein the account information comprises a user ID, a user area, a user gender, a user fan number, a user attention number, a content tag and release time;
step 2: vectorizing each account information in the social network respectively to form a vector data set;
and step 3: selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering users according to the initial clustering centers, updating the clustering centers according to clustered results or clustered results, and continuously clustering the users according to the clustering centers; and obtaining a plurality of clustered classifications until the clustering center is not changed any more, wherein each classification at least comprises one user to be clustered.
In another aspect, the present invention further provides a user clustering device for a social network, including:
the data interface module is used for accessing account information of each user to be clustered in the social network;
the data shaping module is used for shaping the account information accessed by the data interface module to form a vector data set;
and the clustering result module is used for processing the vectors in the vector data set to obtain a plurality of clustered classifications, and each classification at least comprises one user to be clustered.
The invention has the beneficial effects that:
in the invention, under the scene of processing mass data of the social network, the user account information data of the social network is vectorized through data preprocessing, the final clustering result is more accurate by the improved initial clustering center selection method provided by the invention, and in the improved k-means frame, as the concept of a 'core domain' is defined, and respective neighbor clusters are searched for each cluster ball, all calculated quantities are limited within a small range, thus saving the calculated quantity which is required to be carried out originally and improving the efficiency of clustering the huge mass social network data. The invention can theoretically reduce the time complexity of the algorithm from the original O (nk) of each iteration to O (k)2+ n). For the problem of ultra-large-scale social network user clustering, the method and the device can greatly reduce the calculated amount, and further improve the efficiency of the whole method and device.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly and completely apparent, the technical solutions in the embodiments of the present invention are described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
As shown in fig. 1, a method for clustering users in a social network includes:
step 1: acquiring account information of each user to be clustered in the social network, wherein the account information comprises a user ID, a user area, a user gender, a user fan number, a user attention number, a content tag and release time;
step 2: vectorizing each account information in the social network respectively to form a vector data set;
and step 3: selecting a plurality of vectors from the vector data set as initial clustering centers respectively, clustering users according to the initial clustering centers, updating the clustering centers according to clustered results or clustered results, and continuously clustering the users according to the clustering centers; and obtaining a plurality of clustered classifications until the clustering center is not changed any more, wherein each classification at least comprises one user to be clustered.
In another embodiment, as shown in fig. 2, a method for clustering users of a social network may further include:
acquiring user account information in a social network (such as a Sina microblog), wherein the user account information comprises information such as a user ID, a user region, a user gender, a user fan number, a user attention number, a microblog content tag, release time and the like;
preprocessing user information and then vectorizing the preprocessed user information;
selecting an initial clustering center by using an improved method;
inputting all processed data sets into an improved k-means algorithm frame, and continuously iterating until the algorithm converges;
and outputting an accurate clustering result.
In one embodiment, as shown in FIG. 3, there is a flow chart for extracting and processing social networking data, the method comprises the following steps:
carrying out data cleaning on social network data, and preprocessing the characteristic data of a user;
converting the format of the data, wherein a part of data is processed by adopting a digital vector, and the other part of data is processed by adopting a semantic vector; for example, the ID of the user is normalized to limit the range between 0 and 1, the gender of the user (for example, male 1 represents and female 0 represents), the region of the user is represented by numbers (for example, Beijing 001, Shanghai 002, Chongqing 003 and the like), the semantic information is converted into a semantic vector by a sentence turning quantity sen2vec method, the semantic vector is converted into words in the semantic information by using a pre-trained word vector, all the words are weighted, and then the whole semantic information set is processed by using a principal component analysis method to obtain the vector representation of each piece of semantic information. And vectorizing the user data information on the basis of the steps.
In one embodiment, as shown in FIG. 4, it is a detailed flow chart of the improved k-means clustering framework, which includes the detailed process of the whole clustering. The method comprises the following steps:
step 301: selecting k vectors from the vector data set as an initial clustering center by using an improved initial clustering center method;
step 302: dividing all vectors into cluster balls represented by cluster centers closest to the vectors according to a nearest principle;
step 303: calculating the mean value of all vectors in each cluster ball to serve as a new clustering center, and calculating the radius of the cluster ball;
step 304: finding out the neighbor cluster of each cluster according to the distance relationship between the cluster sphere radius and the cluster center;
step 305: calculating the distance between each vector and the center of the adjacent cluster where the vector is located, and dividing the vector into cluster balls with the nearest distance according to the principle of proximity;
step 306: and repeating the steps 303 to 305 until the clustering center is not changed any more, and outputting a clustering result.
In one embodiment, the step 301 comprises:
step 3011: randomly selecting one vector data from the vector data set as a first initial clustering center;
step 3012: adopting a Markov chain with the length of 3k from the vector data set by using a Markov model Carlo method, and taking 3k data on the Markov chain as a candidate initial clustering center;
step 3013: and (3) repeatedly combining the two closest initial clustering centers into a new initial clustering center by adopting a minimum spanning tree method of Primem for the candidate initial clustering centers in the 3k vector data sets until only k data are left as the initial clustering centers.
Of course, the data acquired by the markov model carlo method may be 4k pieces of vector data, or may be 4.5k pieces of vector data, and so on.
In one embodiment, the step 301 may further include:
randomly selecting one vector data from the vector data set in a reservoir sampling (reservoir sampling) mode as a first initial clustering center; and placing the first k vectors in the vector data set into the reservoir, replacing one vector in the reservoir with the probability of k/m for the mth element, and taking the finally selected k vectors as an initial clustering center.
In another embodiment, as shown in fig. 5, the step 3 further includes:
step 311: first, a first iteration is carried out on a data set by using a standard k-means flow, and all data are distributed to a cluster where a central point closest to the data is located:
b(xi)=argminj=1...k{(dis(xi,cj)}
cjrepresenting the center of the cluster, xiIs any one sample in the sample space.
Step 312: updating all cluster centers according to the above allocation steps:
| N | indicates that C is assignedjThe data sample of (1).
Step 313: calculating the radius of each cluster sphere (radius is defined as the distance between the data point farthest from the center point and the center point in the cluster sphere) defined as:
Ri=max(dis(xi,ci))for{xi|xi∈Cj}。
step 314: finding out a neighbor cluster of each cluster according to the radius information of each cluster ball and the distance relationship between any two cluster balls (the neighbor cluster is defined as a neighbor cluster if half of the distance between the central points of two clusters is smaller than the radius of the current cluster);
step 315: sequencing the adjacent clusters of each cluster from near to far;
step 316: filtering out data in a "stable domain" in each cluster sphere (a "stable domain" is defined as a set of data points that are close to the cluster center and still belong to the current cluster sphere in the current iteration);
step 317: performing nearest principle distribution on data outside the 'stable domain' in each cluster ball, wherein the range available for distribution is that each cluster ball is a neighbor cluster according to the sorting in the step 315;
step 318: steps 312 to 317 are repeated until all center points are no longer changed.
In addition, in a specific embodiment, the vector data set of the present invention uses the user data crawled from Twitter and the user data on the xinlang microblog, which respectively include 376 ten thousand pieces of user data and 620 ten thousand pieces of user data. After all social network user data are preprocessed, the two data sets are clustered under the framework of the method provided by the invention, the result display is 51 times faster than that of the traditional k-means algorithm on average, and the clustering result (the evaluation index is WCSSD, namely the square sum in a cluster is better, and the smaller the square sum in the cluster is), is reduced by 3.1% compared with that of the traditional social network user clustering algorithm.
Based on the same concept of the present invention, the present invention further provides a user clustering device for a social network, as shown in fig. 6, including:
the data interface module is used for accessing account information of each user to be clustered in the social network;
the data shaping module is used for shaping the account information accessed by the data interface module to form a vector data set;
and the clustering result module is used for processing the vectors in the vector data set to obtain a plurality of clustered classifications, and each classification at least comprises one user to be clustered.
As shown in fig. 7, the data shaping module includes a digital vector generation unit for converting a part of data of the account information into a digital vector and a semantic vector generation unit for converting another part of data of the account information into a semantic vector.
As shown in fig. 8, the clustering result module includes an initial clustering center selection unit, a calculation unit, and a result calculation unit; the cluster center selection unit is used for selecting an initial cluster center in the vector data set; the calculating unit is used for calculating the radius distance of the cluster balls and the distance relation between any two cluster balls; and the result operation unit is used for dividing a clustering result according to the clustering center.
The initial clustering center selecting unit comprises a random selecting subunit, a fixed selecting subunit and a merging unit; the random selection subunit is used for randomly selecting one initial clustering center, the fixed selection subunit is used for adopting a plurality of initial clustering centers from the vector data set according to a Markov model Carlo method, and the merging unit is used for merging the two closest initial clustering centers into a new initial clustering center according to a method of a minimum spanning tree of Primum.
In addition, the invention also provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of being used on the processor, wherein the processor executes the program to realize the user clustering method provided by the invention.
In one embodiment of the invention, the invention employs a Python programming language and can operate on mainstream computer platforms. The operating system used in the implementation is CentOS 6.5, the CPU is required to be Intel i5, the memory is more than 8GB, and the hard disk space is required to be more than 32 GB.
It is understood that some features of the method, apparatus and computer device of the present invention may be mutually cited, and the present invention is not described in detail.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.