CN112287247B - Social network user position feature extraction method and device based on Meanshift and K-means clustering - Google Patents
Social network user position feature extraction method and device based on Meanshift and K-means clustering Download PDFInfo
- Publication number
- CN112287247B CN112287247B CN201910628876.3A CN201910628876A CN112287247B CN 112287247 B CN112287247 B CN 112287247B CN 201910628876 A CN201910628876 A CN 201910628876A CN 112287247 B CN112287247 B CN 112287247B
- Authority
- CN
- China
- Prior art keywords
- data
- clustering
- user
- points
- meanshift
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 23
- 238000003064 k means clustering Methods 0.000 title description 4
- 238000000034 method Methods 0.000 claims abstract description 43
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 34
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000010586 diagram Methods 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 11
- 238000009826 distribution Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims 1
- 238000011160 research Methods 0.000 abstract description 3
- 238000004590 computer program Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Computing Systems (AREA)
- Economics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a social network user position feature extraction method and device based on Meanshift and K-means algorithms, the method is used for finding a hot spot area with higher user sign-in frequency, namely a position really interested by a user, in massive user sign-in data, and the implementation flow of the invention comprises the following steps: firstly, analyzing and preprocessing user sign-in data collected from a Flickr platform, selecting a region with dense sign-in points and typical sign-in points as a research region, then carrying out primary clustering on the sign-in data in a certain city range based on a Meanshift method, carrying out secondary clustering on the screened clusters with large scale and the clusters with excessive density based on a K-means method, and finally dividing the clusters into corresponding points of interest (POI) according to clustering results, namely completing user position feature extraction. The method of the invention can more effectively realize the position characteristic extraction of the LBSs data.
Description
Technical Field
The invention belongs to the field of intelligent information processing and data mining, and particularly relates to application and mining of massive user sign-in data in a Location-based mobile social network (LBSs), in particular to a social network user Location feature extraction method and device based on Meanshift and K-means integrated clustering algorithm.
Background
The rapid development of Location-based Mobile social networks (lbs ns) is driven by the progress of Mobile Internet (Mobile Internet) and Global Positioning System (GPS) technologies, and thus, a large amount of check-in data is accumulated. The rapid development of the lbs provides rich information, greatly enriches the availability of human mobile data, and brings various values, on one hand, compared with the traditional social network data, the lbs data contains the position information of the user in addition to the social relationship data and comment data. This allows social networking to connect from pure cyber-virtual world communication to real world spatio-temporal attributes. On the other hand, compared to conventional GPS data, the lbs ns data contains social relationship and comment data in addition to position data. Therefore, the analysis from the geographic perspective is not limited to single space-time position analysis any more, and more practical behavior patterns can be obtained by combining the regularity and the purposiveness of the user activity. A large number of user activity characteristics and behavior patterns are hidden in the lbs ns data, so that the feature extraction work of the lbs ns data becomes a popular research problem. The method finds out the value of travel and urban development of the user and has important significance for further improving the service quality based on the position.
In a general method, a Point of Interest (POI) serving as an access hotspot (hotspot area with a high check-in frequency of a user) is found by clustering check-in points in the lbs ns, so that the position characteristics of the user are extracted. Due to the lack of awareness of adaptability of the clustering algorithm on the LBSs data, although a plurality of researchers directly apply the clustering algorithm or improve the clustering algorithm in a targeted manner in the process of extracting the POIs, in the aspect of which clustering algorithm is most suitable for the LBSs data, a single algorithm is usually used for adapting to various characteristic requirements of the LBSs data, and the consideration is difficult. Therefore, an algorithm needs to be designed according to various characteristics of large data volume, uneven density and the like of the LBSs, and more effective position feature extraction is realized.
Disclosure of Invention
The invention aims to solve the problems that a single algorithm cannot adapt to large data volume and uneven density of LBSs, provides a social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm, and realizes more effective LBSs user position feature extraction.
According to the characteristics of LBNS data, the designed method meets the following three standards:
A. multi-density clustering can be identified;
B. can handle clusters of arbitrary shape;
C. the spatial-temporal complexity can be as low as possible.
In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:
the technical scheme adopted by the social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm specifically comprises the following steps:
selecting an object area according to pre-collected user check-in data, and acquiring the user check-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
performing preliminary clustering on the user geographical location information data in the selected range based on a Meanshift method;
screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
and dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction.
Further, the method for selecting the object area according to the pre-collected user check-in data comprises the following steps:
with the help of ArcGIS, the distribution situation of the check-in records in the data set is described by drawing a scatter diagram, and a New York Manhattan area with dense check-in records is selected as the object area of the invention.
Further, the data preprocessing comprises: and cleaning the data, and removing the data with field missing and error data which does not meet the requirements in the data.
Further, the preliminary clustering of check-in data in a selected range based on the Meanshift method comprises the following steps:
(4-1) recording any two sign-in points r i And r j Having coordinates p respectively i =(lat i ,lon i ) And p j =(lat j ,lon j ) Wherein p is i =(lat i ,lon i ) To representLatitude and longitude of the geographic location coordinates of the ith check-in data; p is a radical of formula j =(lat j ,lon j ) Latitude and longitude representing the geographic location coordinates of the first check-in data;
calculating the distance d between any two check-in points ij The expression is as follows:
where r represents the earth's radius, and hav () is an abbreviation of haversine function, expanded form of which is:
theta represents an included angle formed by connecting two points on the spherical surface with the spherical center respectively;
based on the distance d between any two check-in points ij Forming a distance matrix D;
(4-2) initially and randomly selecting a cluster center, and setting a key parameter bandwidth and a stop threshold stopthresh;
(4-3) updating the cluster center and the structure by superimposing an offset vector on the current cluster center coordinate vector, wherein the expression is as follows:
Center (t+1) =Center (t) +shift (t)
wherein, the Center (t) Represents the current cluster Center, which is the cluster Center after the t-th overlay offset vector, center (t+1) Represents the cluster center after the t +1 th overlay offset vector, shift (t) An offset vector representing the t-th superposition;
(4-4) with the aim that the offset vector is smaller than a stop threshold stopthresh, the offset vector shift of the t-th superposition is required to be met (t) And (4) iterating the step (4-3) until all sample points find the most appropriate cluster center, and combining the clusters meeting the requirements to complete the clustering based on the Meanshift algorithm.
Further, the t-th timeSuperimposed offset vector shift (t)
The mean value representing the distance from all samples in the current cluster to the current cluster center is basically the following form:
where K denotes the number of samples in the current cluster, S (t) Represents the set of samples in the current cluster, arbitrary x i ∈S (t) The distance from all sample points to the current cluster center is smaller than the bandwidth of a key parameter, and the expression is shown as follows;
wherein, the first and the second end of the pipe are connected with each other,represents a sample point x i To the current cluster Center (t) The bandwidth represents the bandwidth of the key parameter.
Further, screening out a specific cluster according to a preset condition and carrying out secondary clustering based on a K-means method comprises the following steps:
(6-1) screening out clusters with the scale larger than a preset threshold value, and determining a parameter K of a K-means algorithm according to the scale;
(6-2) randomly selecting center points of the k clusters, and calculating a distance between each sample and each center point;
(6-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;
(6-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
(6-5) repeating and iterating the steps for a plurality of times, or stopping iterating until the central points of the groups do not change greatly between two iterations, and finishing secondary division.
Further, the step 4 comprises:
and marking all points according to the clustering result, and dividing the points into the POI of the hot spot region with higher sign-in frequency of the corresponding user.
In another aspect, the present invention provides a user location feature extraction apparatus based on a mean shift and K-means integrated clustering algorithm in a social network, including:
the data preprocessing module is used for selecting an object area according to pre-collected user sign-in data and acquiring the user sign-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
the preliminary clustering module is used for carrying out preliminary clustering on the user geographic position information data in the selected range based on a Meanshift method;
the secondary clustering module is used for screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
and the data dividing module is used for dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction.
Compared with the prior art, the invention has the beneficial effects that:
1. from the view of computational complexity, assuming that the Meanshift algorithm needs to iterate for T times to achieve convergence, and the scale of the input data set is | R |, the Meanshift time complexity is O (T | R |) 2 ). And the time complexity of K-means is O (K | l | T), where | l | represents the size of one cluster. Assuming that the number of oversized clusters is m, the computational complexity of Meanshift + K-means is O (T | R $) 2 + mK | l | T). Where | l | < | R |, and m, K, T are all much smaller than the constant of | R |, the time complexity of Meanshifi + K-means can be reduced to O (| R |) 2 )。
2. For urban environments, the distribution of POIs tends to be locally clustered. For example, the POI distribution in the downtown area is dense, the traffic is large, and the number of POIs in the suburban area is small. For dense areas, if the POIs are not subdivided, it may result in the POIs being indistinguishable. The secondary division of Meanshift + k-means can better solve the problem and avoid a large number of check-in points from being concentrated into one cluster.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a distribution of global check-in data in a Flickr dataset;
FIG. 3 shows the clustering result of Flickr check-in data in Manhattan area by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in figure 1, the social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm firstly analyzes and preprocesses user check-in data collected from a Flickr platform, uses a Meanshifti method to perform preliminary clustering on the preprocessed check-in data, then screens out clusters with larger scale and clusters which are too dense, performs secondary clustering based on the K-means method, and finally divides corresponding POI according to clustering results, namely completes user location feature extraction.
In a particular embodiment the method comprises the steps of:
step 1: analyzing and preprocessing pre-collected user check-in data; preferably, user check-in data is collected from the Flickr platform;
(1-1) with the help of ArcGIS, the distribution situation of the check-in records in the data set is described by drawing a scatter diagram, and a New York Manhattan area with very dense check-in records is selected as the object area of the invention. Let L be the geographic location information dataset that the user checked in, which can be expressed as L = (p) 1 ,p 2 ,...,p m ) Wherein p is i =(lat i ,lon i ) Latitude and longitude, which are geographic position coordinates representing the ith check-in data;
and (1-2) data cleaning, namely removing data with field missing and obviously wrong data in the data.
Step 2: performing preliminary clustering on check-in data in a city range based on a Meanshift method;
(2-1) any twoSign-in point r i And r j Having respective coordinates of p i =(lat i ,lon i ) And p j =(lat j ,lon j ) Calculating the distance d between any two check-in points ij :
Wherein r represents the earth radius, and generally takes 6371km (earth radius mean), and hav () is an abbreviation of hemiversive function, and its expansion form is as follows:
wherein θ represents an angle formed by connecting two points on the spherical surface with the center of the sphere, and can be represented by a difference of longitude or latitude.
Based on the distance d between any two check-in points ij Forming a distance matrix D;
(2-2) initially and randomly selecting cluster centers, and setting a critical parameter bandwidth (bandwidth) and a stopping threshold (stopthresh), wherein the quantity of the randomly selected cluster centers does not need to be specified specifically because the Meanshift clustering algorithm can realize the combination of similar clusters;
(2-3) updating the cluster center and the structure by superimposing an offset vector on the current cluster center coordinate vector, i.e.
Center (t+1) =Center (t) +shift (t)
Wherein, the Center (t) Representing the current cluster Center, i.e. the cluster Center after the t-th overlay offset vector, center (t+1) Represents the cluster center after the t +1 th overlay shift vector, shift (t) An offset vector representing the t-th stack, which represents the mean of the distances from all samples in the current cluster to the current cluster center, is of the basic form:
where K denotes the number of samples in the current cluster, S (t) Represents the set of samples in the current cluster, arbitrary x i ∈S (t) The distance from all sample points to the current cluster center is smaller than the bandwidth of the key parameter:
wherein, the first and the second end of the pipe are connected with each other,represents a sample point x i To the current cluster Center (t) Of the distance of (c).
(2-4) shift is targeted to shift vector smaller than a stop threshold stopthresh (t) And (5) iteration is carried out on the step (2-3) until the most suitable cluster center is found out for all sample points, and meanwhile, relatively close clusters are combined, namely, one-time clustering based on the Meanshift algorithm is completed.
And step 3: performing secondary clustering on the large-scale clusters by using K-means;
(3-1) screening out clusters with the scale larger than a certain threshold value, and determining a parameter K of a K-means algorithm according to the scale;
(3-2) randomly selecting center points of the k clusters, and calculating a distance between each sample and each center point;
(3-3) clustering according to the principle of minimum distance, and classifying each sample into a cluster with the shortest distance;
(3-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
(3-5) repeating and iterating the steps for a plurality of times, or stopping iterating until the central point of each group does not change greatly between two iterations, and finishing secondary division.
And 4, step 4: and marking all the points according to the clustering result, and dividing the points into corresponding POIs.
Evaluation of Properties
According to the invention, experiments are carried out according to the flow, the performance of the invention is evaluated by using a real LBSs data set, check-in data on a Flickr platform of a Manhattan area in New York City is taken as a research object, the data is firstly analyzed and preprocessed, FIG. 2 shows the distribution of global check-in data in the Flickr data set drawn with the help of ArcGIS, areas in North America, europe and the like are considered to have higher check-in density after a hot area is analyzed by nuclear density estimation, and finally, the Manhattan area in New York city with very dense check-in records is determined and selected as the object area of the invention.
The method takes the contour coefficient as an evaluation index for measuring the effectiveness of the clustering algorithm, and measures the adaptability of the clustering algorithm on the Flickr data set by using the maximum cluster point ratio and the noise ratio.
The contour Coefficient (Silhouette coeffient) is calculated as follows:
wherein, S (i) represents the contour coefficient of the sample i, and the mean value of S (i) of all samples is the contour coefficient of the cluster analysis. Where a (i) represents the average distance of sample i to other samples in the same cluster (intra-cluster dissimilarity); b (i) = min { b (i, 1), b (i, 2),.. B (i, k) }, b (i, j) represents the average distance of a sample i to all samples in a certain cluster j (inter-cluster dissimilarity).
Maximum cluster point number ratio C largest And maximum noise Ratio noise The calculation method is respectively the proportion of the record number in the maximum cluster after clustering to all the points in the data set and the proportion of the noise points found by the clustering algorithm to all the points in the data set, and the expression is as follows:
wherein l largest Representing the number of records in the largest cluster after clustering, R representing the total number of points in the data set, p noise Representing the number of noise points found by the clustering algorithm.
To extract more effectivelyThe POI avoids most of the check-in points from gathering to a small number of POIs, and the clustering method suitable for the LBSs data set needs to reduce the maximum cluster point number ratio C as much as possible largest And maximum noise Ratio noise 。
POI extraction work is carried out on check-in data on a Flickr platform in a Manhattan area in New York City by using three clustering algorithms of Meanshift, DBSCAN and Meanshift + K-means, clustering effects are compared, and experimental index results shown in table 1 are obtained.
TABLE 1 Experimental index results of various clustering methods based on Flickr data set
From the outline coefficient, the Meanshift + K-means clustering method is the highest, and the performance is optimal; from the two indexes of the maximum cluster point ratio and the noise ratio, the Meanshift + K-means values are all minimum and perform the best. By combining various indexes, the Meanshift + K-means clustering method is more effective than other clustering algorithms in the work of extracting POI.
FIG. 3 shows the clustering result of Flickr sign-in data in Manhattan area, and compared with the clustering results of other clustering methods, the clustering method of Meanshift + K-means has obvious advantages in the index of maximum cluster point ratio due to the secondary division of large-scale clusters, and is a clustering method more suitable for extracting user position characteristics in social network.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the particular illustrative embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications, equivalent arrangements, and equivalents thereof, which may be made by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (6)
1. A social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm is characterized by comprising the following steps:
selecting an object area according to pre-collected user check-in data, and acquiring the user check-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
performing preliminary clustering on the user geographical location information data in the selected range based on a Meanshift method;
screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
dividing the user geographical position information data to corresponding interest points according to the clustering result to complete the user position feature extraction, comprising: marking all points according to the clustering result, and dividing the points into the POI (point of interest) of the hot spot area with higher sign-in frequency of the corresponding user;
screening out a specific cluster according to a preset condition and carrying out secondary clustering based on a K-means method, wherein the method comprises the following steps:
screening out clusters with the scale larger than a preset threshold value and determining a parameter K of a K-means algorithm according to the scale of the clusters;
step (1-2) randomly selecting central points of k clusters, and calculating the distance between each sample and each central point;
step (1-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;
step (1-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
and (5) repeating and iterating the steps for a plurality of times, or stopping iteration until the central point of each group is not greatly changed between two iterations, and finishing secondary division.
2. The social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the method for selecting the object area according to the pre-collected user check-in data is as follows:
with the help of ArcGIS, the distribution situation of the check-in records in the pre-collected user check-in data is described by drawing a scatter diagram, and a New York Manhattan area with very dense check-in records is selected as the object area of the invention.
3. The social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the data preprocessing comprises: and (4) cleaning the data, and removing the data with missing fields and the error data which does not meet the requirements in the data.
4. The method for extracting the location features of the users in the social network based on the Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the preliminary clustering of the check-in data in the selected range based on the Meanshift method comprises the following steps:
step (4-1) of recording any two sign-in points r i And r j Having respective coordinates of p i =(lat i ,lon i ) And p j =(lat j ,lon j ) Wherein p is i =(lat i ,lon i ) Latitude and longitude representing the geographic location coordinates of the ith check-in data; p is a radical of formula j =(lat j ,lon j ) Latitude and longitude representing the geographic location coordinates of the jth check-in data;
calculating the distance d between any two check-in points ij The expression is as follows:
where r represents the earth's radius, and hav () is an abbreviation of haversine function, expanded form of which is:
theta represents an included angle formed by connecting two points on the spherical surface with the spherical center respectively;
based on the distance d between any two check-in points ij Forming a distance matrix D;
step (4-2) cluster centers are selected at random initially, and a bandwidth and a stop threshold stopthresh are set as key parameters;
and (4-3) updating the cluster center and the structure in a mode of superposing an offset vector on the current cluster center coordinate vector, wherein the expression is as follows:
Center (t+1) =Center (t) +shift (t)
wherein, center (t) Represents the current cluster Center, which is the cluster Center after the t-th overlay offset vector, center (t+1) Represents the cluster center after the t +1 th overlay offset vector, shift (t) An offset vector representing the t-th superposition;
step (4-4) takes the offset vector smaller than a stop threshold stopthresh as a target, and the offset vector shift meeting the t-th superposition is required (t) And (4) iterating the step (4-3) until all sample points find the most appropriate cluster center, and combining the clusters meeting the requirements to complete the clustering based on the Meanshift algorithm.
5. The method for extracting features of social network user positions based on Meanshift and K-means integrated clustering algorithm as claimed in claim 4, wherein the offset vector shift of the t-th superposition (t) The mean value representing the distance from all samples in the current cluster to the current cluster center is basically the following form:
where K denotes the number of samples in the current cluster, S (t) Represents the set of samples in the current cluster, arbitrary x i ∈S (t) The distance from all sample points to the current cluster center is smaller than the bandwidth of a key parameter, and the expression is shown as follows;
6. A user position feature extraction device based on Meanshift and K-means integrated clustering algorithm in a social network is characterized by comprising the following steps:
the data preprocessing module is used for selecting an object area according to pre-collected user sign-in data and acquiring the user sign-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
the preliminary clustering module is used for carrying out preliminary clustering on the user geographic position information data in the selected range based on the Meanshift method;
the secondary clustering module is used for screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
the data dividing module is used for dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction and comprises the following steps: marking all points according to the clustering result, and dividing the points into the POI (point of interest) of the hot spot area with higher sign-in frequency of the corresponding user;
screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method, wherein the method comprises the following steps: screening out clusters with the scale larger than a preset threshold value in the step (6-1) and determining a parameter K of a K-means algorithm according to the scale;
step (6-2) randomly selecting the central points of the k clusters, and calculating the distance between each sample and each central point;
step (6-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;
step (6-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
and (6-5) repeating and iterating the steps for a plurality of times, or stopping iteration until the central point of each group is not changed greatly between two iterations, and finishing secondary division.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910628876.3A CN112287247B (en) | 2019-07-12 | 2019-07-12 | Social network user position feature extraction method and device based on Meanshift and K-means clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910628876.3A CN112287247B (en) | 2019-07-12 | 2019-07-12 | Social network user position feature extraction method and device based on Meanshift and K-means clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112287247A CN112287247A (en) | 2021-01-29 |
CN112287247B true CN112287247B (en) | 2022-11-11 |
Family
ID=74418576
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910628876.3A Active CN112287247B (en) | 2019-07-12 | 2019-07-12 | Social network user position feature extraction method and device based on Meanshift and K-means clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112287247B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113283248B (en) * | 2021-04-29 | 2022-06-21 | 桂林电子科技大学 | Automatic natural language generation method and device for scatter diagram description |
CN116776011A (en) * | 2023-05-10 | 2023-09-19 | 中国测绘科学研究院 | ROI extraction method and system considering POI space co-located mode |
CN116541474B (en) * | 2023-07-05 | 2024-02-02 | 平安银行股份有限公司 | Object acquisition method, device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106296695A (en) * | 2016-08-12 | 2017-01-04 | 西安理工大学 | Adaptive threshold natural target image based on significance segmentation extraction algorithm |
WO2018086433A1 (en) * | 2016-11-08 | 2018-05-17 | 江苏大学 | Medical image segmenting method |
-
2019
- 2019-07-12 CN CN201910628876.3A patent/CN112287247B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106296695A (en) * | 2016-08-12 | 2017-01-04 | 西安理工大学 | Adaptive threshold natural target image based on significance segmentation extraction algorithm |
WO2018086433A1 (en) * | 2016-11-08 | 2018-05-17 | 江苏大学 | Medical image segmenting method |
Non-Patent Citations (2)
Title |
---|
A quick Otsu-Kmeans algorithm for the internal pipeline detection;Tao Song 等;《2017 IEEE International Conference on Mechatronics and Automation (ICMA)》;20170824;全文 * |
三种聚类算法在建筑图像分割上的应用;周婷婷 等;《现代计算机》;20170215;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112287247A (en) | 2021-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112287247B (en) | Social network user position feature extraction method and device based on Meanshift and K-means clustering | |
CN110012428B (en) | Indoor positioning method based on WiFi | |
CN102682477B (en) | Regular scene three-dimensional information extracting method based on structure prior | |
TWI584137B (en) | Search, determine the active area of the method with the server | |
CN110909788B (en) | Statistical clustering-based road intersection position identification method in track data | |
CN110334293B (en) | Position social network-oriented position recommendation method with time perception based on fuzzy clustering | |
CN108804551B (en) | Spatial interest point recommendation method considering diversity and individuation | |
CN109167805A (en) | Analysis and processing method based on car networking space-time data in City scenarios | |
CN104202816B (en) | Extensive node positioning method of the 3D wireless sensor networks based on convex division | |
CN109478184A (en) | Identification, processing and display data point cluster | |
CN111460508B (en) | Track data protection method based on differential privacy technology | |
CN109739585B (en) | Spark cluster parallelization calculation-based traffic congestion point discovery method | |
CN107392245A (en) | A kind of taxi trajectory clustering algorithm Tr OPTICS | |
CN110119772B (en) | Three-dimensional model classification method based on geometric shape feature fusion | |
CN112328728A (en) | Clustering method and device for mining traveler track, electronic device and storage medium | |
Liu et al. | A semantics-based trajectory segmentation simplification method | |
CN111536973A (en) | Indoor navigation network extraction method | |
Buchin et al. | Improved map construction using subtrajectory clustering | |
CN110298687B (en) | Regional attraction assessment method and device | |
Liao | [Retracted] Hot Spot Analysis of Tourist Attractions Based on Stay Point Spatial Clustering | |
CN115205699B (en) | Map image spot clustering fusion processing method based on CFSFDP improved algorithm | |
CN110851742A (en) | Interest point recommendation method and device based on position and time information | |
CN113268770B (en) | Track k anonymous privacy protection method based on user activity | |
Wang et al. | Spatial entropy-based clustering for mining data with spatial correlation | |
Ma et al. | Complex buildings orientation recognition and description based on vector reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |