CN112287247B

CN112287247B - Social network user position feature extraction method and device based on Meanshift and K-means clustering

Info

Publication number: CN112287247B
Application number: CN201910628876.3A
Authority: CN
Inventors: 史英吉; 王海艳; 吕朝萍; 何旭
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2022-11-11
Anticipated expiration: 2039-07-12
Also published as: CN112287247A

Abstract

The invention discloses a social network user position feature extraction method and device based on Meanshift and K-means algorithms, the method is used for finding a hot spot area with higher user sign-in frequency, namely a position really interested by a user, in massive user sign-in data, and the implementation flow of the invention comprises the following steps: firstly, analyzing and preprocessing user sign-in data collected from a Flickr platform, selecting a region with dense sign-in points and typical sign-in points as a research region, then carrying out primary clustering on the sign-in data in a certain city range based on a Meanshift method, carrying out secondary clustering on the screened clusters with large scale and the clusters with excessive density based on a K-means method, and finally dividing the clusters into corresponding points of interest (POI) according to clustering results, namely completing user position feature extraction. The method of the invention can more effectively realize the position characteristic extraction of the LBSs data.

Description

Social network user position feature extraction method and device based on Meanshift and K-means clustering

Technical Field

The invention belongs to the field of intelligent information processing and data mining, and particularly relates to application and mining of massive user sign-in data in a Location-based mobile social network (LBSs), in particular to a social network user Location feature extraction method and device based on Meanshift and K-means integrated clustering algorithm.

Background

The rapid development of Location-based Mobile social networks (lbs ns) is driven by the progress of Mobile Internet (Mobile Internet) and Global Positioning System (GPS) technologies, and thus, a large amount of check-in data is accumulated. The rapid development of the lbs provides rich information, greatly enriches the availability of human mobile data, and brings various values, on one hand, compared with the traditional social network data, the lbs data contains the position information of the user in addition to the social relationship data and comment data. This allows social networking to connect from pure cyber-virtual world communication to real world spatio-temporal attributes. On the other hand, compared to conventional GPS data, the lbs ns data contains social relationship and comment data in addition to position data. Therefore, the analysis from the geographic perspective is not limited to single space-time position analysis any more, and more practical behavior patterns can be obtained by combining the regularity and the purposiveness of the user activity. A large number of user activity characteristics and behavior patterns are hidden in the lbs ns data, so that the feature extraction work of the lbs ns data becomes a popular research problem. The method finds out the value of travel and urban development of the user and has important significance for further improving the service quality based on the position.

In a general method, a Point of Interest (POI) serving as an access hotspot (hotspot area with a high check-in frequency of a user) is found by clustering check-in points in the lbs ns, so that the position characteristics of the user are extracted. Due to the lack of awareness of adaptability of the clustering algorithm on the LBSs data, although a plurality of researchers directly apply the clustering algorithm or improve the clustering algorithm in a targeted manner in the process of extracting the POIs, in the aspect of which clustering algorithm is most suitable for the LBSs data, a single algorithm is usually used for adapting to various characteristic requirements of the LBSs data, and the consideration is difficult. Therefore, an algorithm needs to be designed according to various characteristics of large data volume, uneven density and the like of the LBSs, and more effective position feature extraction is realized.

Disclosure of Invention

The invention aims to solve the problems that a single algorithm cannot adapt to large data volume and uneven density of LBSs, provides a social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm, and realizes more effective LBSs user position feature extraction.

According to the characteristics of LBNS data, the designed method meets the following three standards:

A. multi-density clustering can be identified;

B. can handle clusters of arbitrary shape;

C. the spatial-temporal complexity can be as low as possible.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

the technical scheme adopted by the social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm specifically comprises the following steps:

selecting an object area according to pre-collected user check-in data, and acquiring the user check-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;

performing preliminary clustering on the user geographical location information data in the selected range based on a Meanshift method;

screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;

and dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction.

Further, the method for selecting the object area according to the pre-collected user check-in data comprises the following steps:

with the help of ArcGIS, the distribution situation of the check-in records in the data set is described by drawing a scatter diagram, and a New York Manhattan area with dense check-in records is selected as the object area of the invention.

Further, the data preprocessing comprises: and cleaning the data, and removing the data with field missing and error data which does not meet the requirements in the data.

Further, the preliminary clustering of check-in data in a selected range based on the Meanshift method comprises the following steps:

(4-1) recording any two sign-in points r _i And r _j Having coordinates p respectively _i ＝(lat _i ，lon _i ) And p _j ＝(lat _j ，lon _j ) Wherein p is _i ＝(lat _i ，lon _i ) To representLatitude and longitude of the geographic location coordinates of the ith check-in data; p is a radical of formula _j ＝(lat _j ，lon _j ) Latitude and longitude representing the geographic location coordinates of the first check-in data;

calculating the distance d between any two check-in points _ij The expression is as follows:

where r represents the earth's radius, and hav () is an abbreviation of haversine function, expanded form of which is:

theta represents an included angle formed by connecting two points on the spherical surface with the spherical center respectively;

based on the distance d between any two check-in points _ij Forming a distance matrix D;

(4-2) initially and randomly selecting a cluster center, and setting a key parameter bandwidth and a stop threshold stopthresh;

(4-3) updating the cluster center and the structure by superimposing an offset vector on the current cluster center coordinate vector, wherein the expression is as follows:

Center ^(t+1) ＝Center ^(t) +shift ^(t)

wherein, the Center ^(t) Represents the current cluster Center, which is the cluster Center after the t-th overlay offset vector, center ^(t+1) Represents the cluster center after the t +1 th overlay offset vector, shift ^(t) An offset vector representing the t-th superposition;

(4-4) with the aim that the offset vector is smaller than a stop threshold stopthresh, the offset vector shift of the t-th superposition is required to be met ^(t) And (4) iterating the step (4-3) until all sample points find the most appropriate cluster center, and combining the clusters meeting the requirements to complete the clustering based on the Meanshift algorithm.

Further, the t-th timeSuperimposed offset vector shift ^(t)

The mean value representing the distance from all samples in the current cluster to the current cluster center is basically the following form:

where K denotes the number of samples in the current cluster, S ^(t) Represents the set of samples in the current cluster, arbitrary x _i ∈S ^(t) The distance from all sample points to the current cluster center is smaller than the bandwidth of a key parameter, and the expression is shown as follows;

wherein, the first and the second end of the pipe are connected with each other,

represents a sample point x _i To the current cluster Center ^(t) The bandwidth represents the bandwidth of the key parameter.

Further, screening out a specific cluster according to a preset condition and carrying out secondary clustering based on a K-means method comprises the following steps:

(6-1) screening out clusters with the scale larger than a preset threshold value, and determining a parameter K of a K-means algorithm according to the scale;

(6-2) randomly selecting center points of the k clusters, and calculating a distance between each sample and each center point;

(6-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;

(6-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;

(6-5) repeating and iterating the steps for a plurality of times, or stopping iterating until the central points of the groups do not change greatly between two iterations, and finishing secondary division.

Further, the step 4 comprises:

and marking all points according to the clustering result, and dividing the points into the POI of the hot spot region with higher sign-in frequency of the corresponding user.

In another aspect, the present invention provides a user location feature extraction apparatus based on a mean shift and K-means integrated clustering algorithm in a social network, including:

the data preprocessing module is used for selecting an object area according to pre-collected user sign-in data and acquiring the user sign-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;

the preliminary clustering module is used for carrying out preliminary clustering on the user geographic position information data in the selected range based on a Meanshift method;

the secondary clustering module is used for screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;

and the data dividing module is used for dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction.

Compared with the prior art, the invention has the beneficial effects that:

1. from the view of computational complexity, assuming that the Meanshift algorithm needs to iterate for T times to achieve convergence, and the scale of the input data set is | R |, the Meanshift time complexity is O (T | R |) ² ). And the time complexity of K-means is O (K | l | T), where | l | represents the size of one cluster. Assuming that the number of oversized clusters is m, the computational complexity of Meanshift + K-means is O (T | R $) ² + mK | l | T). Where | l | < | R |, and m, K, T are all much smaller than the constant of | R |, the time complexity of Meanshifi + K-means can be reduced to O (| R |) ² )。

2. For urban environments, the distribution of POIs tends to be locally clustered. For example, the POI distribution in the downtown area is dense, the traffic is large, and the number of POIs in the suburban area is small. For dense areas, if the POIs are not subdivided, it may result in the POIs being indistinguishable. The secondary division of Meanshift + k-means can better solve the problem and avoid a large number of check-in points from being concentrated into one cluster.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a distribution of global check-in data in a Flickr dataset;

FIG. 3 shows the clustering result of Flickr check-in data in Manhattan area by the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in figure 1, the social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm firstly analyzes and preprocesses user check-in data collected from a Flickr platform, uses a Meanshifti method to perform preliminary clustering on the preprocessed check-in data, then screens out clusters with larger scale and clusters which are too dense, performs secondary clustering based on the K-means method, and finally divides corresponding POI according to clustering results, namely completes user location feature extraction.

In a particular embodiment the method comprises the steps of:

step 1: analyzing and preprocessing pre-collected user check-in data; preferably, user check-in data is collected from the Flickr platform;

(1-1) with the help of ArcGIS, the distribution situation of the check-in records in the data set is described by drawing a scatter diagram, and a New York Manhattan area with very dense check-in records is selected as the object area of the invention. Let L be the geographic location information dataset that the user checked in, which can be expressed as L = (p) ₁ ，p ₂ ，...，p _m ) Wherein p is _i ＝(lat _i ，lon _i ) Latitude and longitude, which are geographic position coordinates representing the ith check-in data;

and (1-2) data cleaning, namely removing data with field missing and obviously wrong data in the data.

Step 2: performing preliminary clustering on check-in data in a city range based on a Meanshift method;

(2-1) any twoSign-in point r _i And r _j Having respective coordinates of p _i ＝(lat _i ，lon _i ) And p _j ＝(lat _j ，lon _j ) Calculating the distance d between any two check-in points _ij ：

Wherein r represents the earth radius, and generally takes 6371km (earth radius mean), and hav () is an abbreviation of hemiversive function, and its expansion form is as follows:

wherein θ represents an angle formed by connecting two points on the spherical surface with the center of the sphere, and can be represented by a difference of longitude or latitude.

(2-2) initially and randomly selecting cluster centers, and setting a critical parameter bandwidth (bandwidth) and a stopping threshold (stopthresh), wherein the quantity of the randomly selected cluster centers does not need to be specified specifically because the Meanshift clustering algorithm can realize the combination of similar clusters;

(2-3) updating the cluster center and the structure by superimposing an offset vector on the current cluster center coordinate vector, i.e.

Center ^(t+1) ＝Center ^(t) +shift ^(t)

Wherein, the Center ^(t) Representing the current cluster Center, i.e. the cluster Center after the t-th overlay offset vector, center ^(t+1) Represents the cluster center after the t +1 th overlay shift vector, shift ^(t) An offset vector representing the t-th stack, which represents the mean of the distances from all samples in the current cluster to the current cluster center, is of the basic form:

where K denotes the number of samples in the current cluster, S ^(t) Represents the set of samples in the current cluster, arbitrary x _i ∈S ^(t) The distance from all sample points to the current cluster center is smaller than the bandwidth of the key parameter:

represents a sample point x _i To the current cluster Center ^(t) Of the distance of (c).

(2-4) shift is targeted to shift vector smaller than a stop threshold stopthresh ^(t) And (5) iteration is carried out on the step (2-3) until the most suitable cluster center is found out for all sample points, and meanwhile, relatively close clusters are combined, namely, one-time clustering based on the Meanshift algorithm is completed.

And step 3: performing secondary clustering on the large-scale clusters by using K-means;

(3-1) screening out clusters with the scale larger than a certain threshold value, and determining a parameter K of a K-means algorithm according to the scale;

(3-2) randomly selecting center points of the k clusters, and calculating a distance between each sample and each center point;

(3-3) clustering according to the principle of minimum distance, and classifying each sample into a cluster with the shortest distance;

(3-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;

(3-5) repeating and iterating the steps for a plurality of times, or stopping iterating until the central point of each group does not change greatly between two iterations, and finishing secondary division.

And 4, step 4: and marking all the points according to the clustering result, and dividing the points into corresponding POIs.

Evaluation of Properties

According to the invention, experiments are carried out according to the flow, the performance of the invention is evaluated by using a real LBSs data set, check-in data on a Flickr platform of a Manhattan area in New York City is taken as a research object, the data is firstly analyzed and preprocessed, FIG. 2 shows the distribution of global check-in data in the Flickr data set drawn with the help of ArcGIS, areas in North America, europe and the like are considered to have higher check-in density after a hot area is analyzed by nuclear density estimation, and finally, the Manhattan area in New York city with very dense check-in records is determined and selected as the object area of the invention.

The method takes the contour coefficient as an evaluation index for measuring the effectiveness of the clustering algorithm, and measures the adaptability of the clustering algorithm on the Flickr data set by using the maximum cluster point ratio and the noise ratio.

The contour Coefficient (Silhouette coeffient) is calculated as follows:

wherein, S (i) represents the contour coefficient of the sample i, and the mean value of S (i) of all samples is the contour coefficient of the cluster analysis. Where a (i) represents the average distance of sample i to other samples in the same cluster (intra-cluster dissimilarity); b (i) = min { b (i, 1), b (i, 2),.. B (i, k) }, b (i, j) represents the average distance of a sample i to all samples in a certain cluster j (inter-cluster dissimilarity).

Maximum cluster point number ratio C _largest And maximum noise Ratio _noise The calculation method is respectively the proportion of the record number in the maximum cluster after clustering to all the points in the data set and the proportion of the noise points found by the clustering algorithm to all the points in the data set, and the expression is as follows:

wherein l _largest Representing the number of records in the largest cluster after clustering, R representing the total number of points in the data set, p _noise Representing the number of noise points found by the clustering algorithm.

To extract more effectivelyThe POI avoids most of the check-in points from gathering to a small number of POIs, and the clustering method suitable for the LBSs data set needs to reduce the maximum cluster point number ratio C as much as possible _largest And maximum noise Ratio _noise 。

POI extraction work is carried out on check-in data on a Flickr platform in a Manhattan area in New York City by using three clustering algorithms of Meanshift, DBSCAN and Meanshift + K-means, clustering effects are compared, and experimental index results shown in table 1 are obtained.

TABLE 1 Experimental index results of various clustering methods based on Flickr data set

From the outline coefficient, the Meanshift + K-means clustering method is the highest, and the performance is optimal; from the two indexes of the maximum cluster point ratio and the noise ratio, the Meanshift + K-means values are all minimum and perform the best. By combining various indexes, the Meanshift + K-means clustering method is more effective than other clustering algorithms in the work of extracting POI.

FIG. 3 shows the clustering result of Flickr sign-in data in Manhattan area, and compared with the clustering results of other clustering methods, the clustering method of Meanshift + K-means has obvious advantages in the index of maximum cluster point ratio due to the secondary division of large-scale clusters, and is a clustering method more suitable for extracting user position characteristics in social network.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the particular illustrative embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications, equivalent arrangements, and equivalents thereof, which may be made by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm is characterized by comprising the following steps:

dividing the user geographical position information data to corresponding interest points according to the clustering result to complete the user position feature extraction, comprising: marking all points according to the clustering result, and dividing the points into the POI (point of interest) of the hot spot area with higher sign-in frequency of the corresponding user;

screening out a specific cluster according to a preset condition and carrying out secondary clustering based on a K-means method, wherein the method comprises the following steps:

screening out clusters with the scale larger than a preset threshold value and determining a parameter K of a K-means algorithm according to the scale of the clusters;

step (1-2) randomly selecting central points of k clusters, and calculating the distance between each sample and each central point;

step (1-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;

step (1-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;

and (5) repeating and iterating the steps for a plurality of times, or stopping iteration until the central point of each group is not greatly changed between two iterations, and finishing secondary division.

2. The social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the method for selecting the object area according to the pre-collected user check-in data is as follows:

with the help of ArcGIS, the distribution situation of the check-in records in the pre-collected user check-in data is described by drawing a scatter diagram, and a New York Manhattan area with very dense check-in records is selected as the object area of the invention.

3. The social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the data preprocessing comprises: and (4) cleaning the data, and removing the data with missing fields and the error data which does not meet the requirements in the data.

4. The method for extracting the location features of the users in the social network based on the Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the preliminary clustering of the check-in data in the selected range based on the Meanshift method comprises the following steps:

step (4-1) of recording any two sign-in points r _i And r _j Having respective coordinates of p _i ＝(lat _i ，lon _i ) And p _j ＝(lat _j ，lon _j ) Wherein p is _i ＝(lat _i ，lon _i ) Latitude and longitude representing the geographic location coordinates of the ith check-in data; p is a radical of formula _j ＝(lat _j ，lon _j ) Latitude and longitude representing the geographic location coordinates of the jth check-in data;

step (4-2) cluster centers are selected at random initially, and a bandwidth and a stop threshold stopthresh are set as key parameters;

and (4-3) updating the cluster center and the structure in a mode of superposing an offset vector on the current cluster center coordinate vector, wherein the expression is as follows:

Center ^(t+1) ＝Center ^(t) +shift ^(t)

wherein, center ^(t) Represents the current cluster Center, which is the cluster Center after the t-th overlay offset vector, center ^(t+1) Represents the cluster center after the t +1 th overlay offset vector, shift ^(t) An offset vector representing the t-th superposition;

step (4-4) takes the offset vector smaller than a stop threshold stopthresh as a target, and the offset vector shift meeting the t-th superposition is required ^(t) And (4) iterating the step (4-3) until all sample points find the most appropriate cluster center, and combining the clusters meeting the requirements to complete the clustering based on the Meanshift algorithm.

5. The method for extracting features of social network user positions based on Meanshift and K-means integrated clustering algorithm as claimed in claim 4, wherein the offset vector shift of the t-th superposition ^(t) The mean value representing the distance from all samples in the current cluster to the current cluster center is basically the following form:

wherein the content of the first and second substances,

6. A user position feature extraction device based on Meanshift and K-means integrated clustering algorithm in a social network is characterized by comprising the following steps:

the preliminary clustering module is used for carrying out preliminary clustering on the user geographic position information data in the selected range based on the Meanshift method;

the data dividing module is used for dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction and comprises the following steps: marking all points according to the clustering result, and dividing the points into the POI (point of interest) of the hot spot area with higher sign-in frequency of the corresponding user;

screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method, wherein the method comprises the following steps: screening out clusters with the scale larger than a preset threshold value in the step (6-1) and determining a parameter K of a K-means algorithm according to the scale;

step (6-2) randomly selecting the central points of the k clusters, and calculating the distance between each sample and each central point;

step (6-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;

step (6-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;

and (6-5) repeating and iterating the steps for a plurality of times, or stopping iteration until the central point of each group is not changed greatly between two iterations, and finishing secondary division.