CN112287247B - Social network user position feature extraction method and device based on Meanshift and K-means clustering - Google Patents

Social network user position feature extraction method and device based on Meanshift and K-means clustering Download PDF

Info

Publication number
CN112287247B
CN112287247B CN201910628876.3A CN201910628876A CN112287247B CN 112287247 B CN112287247 B CN 112287247B CN 201910628876 A CN201910628876 A CN 201910628876A CN 112287247 B CN112287247 B CN 112287247B
Authority
CN
China
Prior art keywords
data
clustering
user
points
meanshift
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910628876.3A
Other languages
Chinese (zh)
Other versions
CN112287247A (en
Inventor
史英吉
王海艳
吕朝萍
何旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910628876.3A priority Critical patent/CN112287247B/en
Publication of CN112287247A publication Critical patent/CN112287247A/en
Application granted granted Critical
Publication of CN112287247B publication Critical patent/CN112287247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a social network user position feature extraction method and device based on Meanshift and K-means algorithms, the method is used for finding a hot spot area with higher user sign-in frequency, namely a position really interested by a user, in massive user sign-in data, and the implementation flow of the invention comprises the following steps: firstly, analyzing and preprocessing user sign-in data collected from a Flickr platform, selecting a region with dense sign-in points and typical sign-in points as a research region, then carrying out primary clustering on the sign-in data in a certain city range based on a Meanshift method, carrying out secondary clustering on the screened clusters with large scale and the clusters with excessive density based on a K-means method, and finally dividing the clusters into corresponding points of interest (POI) according to clustering results, namely completing user position feature extraction. The method of the invention can more effectively realize the position characteristic extraction of the LBSs data.

Description

Social network user position feature extraction method and device based on Meanshift and K-means clustering
Technical Field
The invention belongs to the field of intelligent information processing and data mining, and particularly relates to application and mining of massive user sign-in data in a Location-based mobile social network (LBSs), in particular to a social network user Location feature extraction method and device based on Meanshift and K-means integrated clustering algorithm.
Background
The rapid development of Location-based Mobile social networks (lbs ns) is driven by the progress of Mobile Internet (Mobile Internet) and Global Positioning System (GPS) technologies, and thus, a large amount of check-in data is accumulated. The rapid development of the lbs provides rich information, greatly enriches the availability of human mobile data, and brings various values, on one hand, compared with the traditional social network data, the lbs data contains the position information of the user in addition to the social relationship data and comment data. This allows social networking to connect from pure cyber-virtual world communication to real world spatio-temporal attributes. On the other hand, compared to conventional GPS data, the lbs ns data contains social relationship and comment data in addition to position data. Therefore, the analysis from the geographic perspective is not limited to single space-time position analysis any more, and more practical behavior patterns can be obtained by combining the regularity and the purposiveness of the user activity. A large number of user activity characteristics and behavior patterns are hidden in the lbs ns data, so that the feature extraction work of the lbs ns data becomes a popular research problem. The method finds out the value of travel and urban development of the user and has important significance for further improving the service quality based on the position.
In a general method, a Point of Interest (POI) serving as an access hotspot (hotspot area with a high check-in frequency of a user) is found by clustering check-in points in the lbs ns, so that the position characteristics of the user are extracted. Due to the lack of awareness of adaptability of the clustering algorithm on the LBSs data, although a plurality of researchers directly apply the clustering algorithm or improve the clustering algorithm in a targeted manner in the process of extracting the POIs, in the aspect of which clustering algorithm is most suitable for the LBSs data, a single algorithm is usually used for adapting to various characteristic requirements of the LBSs data, and the consideration is difficult. Therefore, an algorithm needs to be designed according to various characteristics of large data volume, uneven density and the like of the LBSs, and more effective position feature extraction is realized.
Disclosure of Invention
The invention aims to solve the problems that a single algorithm cannot adapt to large data volume and uneven density of LBSs, provides a social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm, and realizes more effective LBSs user position feature extraction.
According to the characteristics of LBNS data, the designed method meets the following three standards:
A. multi-density clustering can be identified;
B. can handle clusters of arbitrary shape;
C. the spatial-temporal complexity can be as low as possible.
In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:
the technical scheme adopted by the social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm specifically comprises the following steps:
selecting an object area according to pre-collected user check-in data, and acquiring the user check-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
performing preliminary clustering on the user geographical location information data in the selected range based on a Meanshift method;
screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
and dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction.
Further, the method for selecting the object area according to the pre-collected user check-in data comprises the following steps:
with the help of ArcGIS, the distribution situation of the check-in records in the data set is described by drawing a scatter diagram, and a New York Manhattan area with dense check-in records is selected as the object area of the invention.
Further, the data preprocessing comprises: and cleaning the data, and removing the data with field missing and error data which does not meet the requirements in the data.
Further, the preliminary clustering of check-in data in a selected range based on the Meanshift method comprises the following steps:
(4-1) recording any two sign-in points r i And r j Having coordinates p respectively i =(lat i ,lon i ) And p j =(lat j ,lon j ) Wherein p is i =(lat i ,lon i ) To representLatitude and longitude of the geographic location coordinates of the ith check-in data; p is a radical of formula j =(lat j ,lon j ) Latitude and longitude representing the geographic location coordinates of the first check-in data;
calculating the distance d between any two check-in points ij The expression is as follows:
Figure BDA0002128074440000021
where r represents the earth's radius, and hav () is an abbreviation of haversine function, expanded form of which is:
Figure BDA0002128074440000031
theta represents an included angle formed by connecting two points on the spherical surface with the spherical center respectively;
based on the distance d between any two check-in points ij Forming a distance matrix D;
(4-2) initially and randomly selecting a cluster center, and setting a key parameter bandwidth and a stop threshold stopthresh;
(4-3) updating the cluster center and the structure by superimposing an offset vector on the current cluster center coordinate vector, wherein the expression is as follows:
Center (t+1) =Center (t) +shift (t)
wherein, the Center (t) Represents the current cluster Center, which is the cluster Center after the t-th overlay offset vector, center (t+1) Represents the cluster center after the t +1 th overlay offset vector, shift (t) An offset vector representing the t-th superposition;
(4-4) with the aim that the offset vector is smaller than a stop threshold stopthresh, the offset vector shift of the t-th superposition is required to be met (t) And (4) iterating the step (4-3) until all sample points find the most appropriate cluster center, and combining the clusters meeting the requirements to complete the clustering based on the Meanshift algorithm.
Further, the t-th timeSuperimposed offset vector shift (t)
The mean value representing the distance from all samples in the current cluster to the current cluster center is basically the following form:
Figure BDA0002128074440000032
where K denotes the number of samples in the current cluster, S (t) Represents the set of samples in the current cluster, arbitrary x i ∈S (t) The distance from all sample points to the current cluster center is smaller than the bandwidth of a key parameter, and the expression is shown as follows;
Figure BDA0002128074440000033
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002128074440000034
represents a sample point x i To the current cluster Center (t) The bandwidth represents the bandwidth of the key parameter.
Further, screening out a specific cluster according to a preset condition and carrying out secondary clustering based on a K-means method comprises the following steps:
(6-1) screening out clusters with the scale larger than a preset threshold value, and determining a parameter K of a K-means algorithm according to the scale;
(6-2) randomly selecting center points of the k clusters, and calculating a distance between each sample and each center point;
(6-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;
(6-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
(6-5) repeating and iterating the steps for a plurality of times, or stopping iterating until the central points of the groups do not change greatly between two iterations, and finishing secondary division.
Further, the step 4 comprises:
and marking all points according to the clustering result, and dividing the points into the POI of the hot spot region with higher sign-in frequency of the corresponding user.
In another aspect, the present invention provides a user location feature extraction apparatus based on a mean shift and K-means integrated clustering algorithm in a social network, including:
the data preprocessing module is used for selecting an object area according to pre-collected user sign-in data and acquiring the user sign-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
the preliminary clustering module is used for carrying out preliminary clustering on the user geographic position information data in the selected range based on a Meanshift method;
the secondary clustering module is used for screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
and the data dividing module is used for dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction.
Compared with the prior art, the invention has the beneficial effects that:
1. from the view of computational complexity, assuming that the Meanshift algorithm needs to iterate for T times to achieve convergence, and the scale of the input data set is | R |, the Meanshift time complexity is O (T | R |) 2 ). And the time complexity of K-means is O (K | l | T), where | l | represents the size of one cluster. Assuming that the number of oversized clusters is m, the computational complexity of Meanshift + K-means is O (T | R $) 2 + mK | l | T). Where | l | < | R |, and m, K, T are all much smaller than the constant of | R |, the time complexity of Meanshifi + K-means can be reduced to O (| R |) 2 )。
2. For urban environments, the distribution of POIs tends to be locally clustered. For example, the POI distribution in the downtown area is dense, the traffic is large, and the number of POIs in the suburban area is small. For dense areas, if the POIs are not subdivided, it may result in the POIs being indistinguishable. The secondary division of Meanshift + k-means can better solve the problem and avoid a large number of check-in points from being concentrated into one cluster.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a distribution of global check-in data in a Flickr dataset;
FIG. 3 shows the clustering result of Flickr check-in data in Manhattan area by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in figure 1, the social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm firstly analyzes and preprocesses user check-in data collected from a Flickr platform, uses a Meanshifti method to perform preliminary clustering on the preprocessed check-in data, then screens out clusters with larger scale and clusters which are too dense, performs secondary clustering based on the K-means method, and finally divides corresponding POI according to clustering results, namely completes user location feature extraction.
In a particular embodiment the method comprises the steps of:
step 1: analyzing and preprocessing pre-collected user check-in data; preferably, user check-in data is collected from the Flickr platform;
(1-1) with the help of ArcGIS, the distribution situation of the check-in records in the data set is described by drawing a scatter diagram, and a New York Manhattan area with very dense check-in records is selected as the object area of the invention. Let L be the geographic location information dataset that the user checked in, which can be expressed as L = (p) 1 ,p 2 ,...,p m ) Wherein p is i =(lat i ,lon i ) Latitude and longitude, which are geographic position coordinates representing the ith check-in data;
and (1-2) data cleaning, namely removing data with field missing and obviously wrong data in the data.
Step 2: performing preliminary clustering on check-in data in a city range based on a Meanshift method;
(2-1) any twoSign-in point r i And r j Having respective coordinates of p i =(lat i ,lon i ) And p j =(lat j ,lon j ) Calculating the distance d between any two check-in points ij
Figure BDA0002128074440000061
Wherein r represents the earth radius, and generally takes 6371km (earth radius mean), and hav () is an abbreviation of hemiversive function, and its expansion form is as follows:
Figure BDA0002128074440000062
wherein θ represents an angle formed by connecting two points on the spherical surface with the center of the sphere, and can be represented by a difference of longitude or latitude.
Based on the distance d between any two check-in points ij Forming a distance matrix D;
(2-2) initially and randomly selecting cluster centers, and setting a critical parameter bandwidth (bandwidth) and a stopping threshold (stopthresh), wherein the quantity of the randomly selected cluster centers does not need to be specified specifically because the Meanshift clustering algorithm can realize the combination of similar clusters;
(2-3) updating the cluster center and the structure by superimposing an offset vector on the current cluster center coordinate vector, i.e.
Center (t+1) =Center (t) +shift (t)
Wherein, the Center (t) Representing the current cluster Center, i.e. the cluster Center after the t-th overlay offset vector, center (t+1) Represents the cluster center after the t +1 th overlay shift vector, shift (t) An offset vector representing the t-th stack, which represents the mean of the distances from all samples in the current cluster to the current cluster center, is of the basic form:
Figure BDA0002128074440000063
where K denotes the number of samples in the current cluster, S (t) Represents the set of samples in the current cluster, arbitrary x i ∈S (t) The distance from all sample points to the current cluster center is smaller than the bandwidth of the key parameter:
Figure BDA0002128074440000064
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002128074440000065
represents a sample point x i To the current cluster Center (t) Of the distance of (c).
(2-4) shift is targeted to shift vector smaller than a stop threshold stopthresh (t) And (5) iteration is carried out on the step (2-3) until the most suitable cluster center is found out for all sample points, and meanwhile, relatively close clusters are combined, namely, one-time clustering based on the Meanshift algorithm is completed.
And step 3: performing secondary clustering on the large-scale clusters by using K-means;
(3-1) screening out clusters with the scale larger than a certain threshold value, and determining a parameter K of a K-means algorithm according to the scale;
(3-2) randomly selecting center points of the k clusters, and calculating a distance between each sample and each center point;
(3-3) clustering according to the principle of minimum distance, and classifying each sample into a cluster with the shortest distance;
(3-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
(3-5) repeating and iterating the steps for a plurality of times, or stopping iterating until the central point of each group does not change greatly between two iterations, and finishing secondary division.
And 4, step 4: and marking all the points according to the clustering result, and dividing the points into corresponding POIs.
Evaluation of Properties
According to the invention, experiments are carried out according to the flow, the performance of the invention is evaluated by using a real LBSs data set, check-in data on a Flickr platform of a Manhattan area in New York City is taken as a research object, the data is firstly analyzed and preprocessed, FIG. 2 shows the distribution of global check-in data in the Flickr data set drawn with the help of ArcGIS, areas in North America, europe and the like are considered to have higher check-in density after a hot area is analyzed by nuclear density estimation, and finally, the Manhattan area in New York city with very dense check-in records is determined and selected as the object area of the invention.
The method takes the contour coefficient as an evaluation index for measuring the effectiveness of the clustering algorithm, and measures the adaptability of the clustering algorithm on the Flickr data set by using the maximum cluster point ratio and the noise ratio.
The contour Coefficient (Silhouette coeffient) is calculated as follows:
Figure BDA0002128074440000071
wherein, S (i) represents the contour coefficient of the sample i, and the mean value of S (i) of all samples is the contour coefficient of the cluster analysis. Where a (i) represents the average distance of sample i to other samples in the same cluster (intra-cluster dissimilarity); b (i) = min { b (i, 1), b (i, 2),.. B (i, k) }, b (i, j) represents the average distance of a sample i to all samples in a certain cluster j (inter-cluster dissimilarity).
Maximum cluster point number ratio C largest And maximum noise Ratio noise The calculation method is respectively the proportion of the record number in the maximum cluster after clustering to all the points in the data set and the proportion of the noise points found by the clustering algorithm to all the points in the data set, and the expression is as follows:
Figure BDA0002128074440000081
wherein l largest Representing the number of records in the largest cluster after clustering, R representing the total number of points in the data set, p noise Representing the number of noise points found by the clustering algorithm.
To extract more effectivelyThe POI avoids most of the check-in points from gathering to a small number of POIs, and the clustering method suitable for the LBSs data set needs to reduce the maximum cluster point number ratio C as much as possible largest And maximum noise Ratio noise
POI extraction work is carried out on check-in data on a Flickr platform in a Manhattan area in New York City by using three clustering algorithms of Meanshift, DBSCAN and Meanshift + K-means, clustering effects are compared, and experimental index results shown in table 1 are obtained.
TABLE 1 Experimental index results of various clustering methods based on Flickr data set
Figure BDA0002128074440000082
From the outline coefficient, the Meanshift + K-means clustering method is the highest, and the performance is optimal; from the two indexes of the maximum cluster point ratio and the noise ratio, the Meanshift + K-means values are all minimum and perform the best. By combining various indexes, the Meanshift + K-means clustering method is more effective than other clustering algorithms in the work of extracting POI.
FIG. 3 shows the clustering result of Flickr sign-in data in Manhattan area, and compared with the clustering results of other clustering methods, the clustering method of Meanshift + K-means has obvious advantages in the index of maximum cluster point ratio due to the secondary division of large-scale clusters, and is a clustering method more suitable for extracting user position characteristics in social network.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the particular illustrative embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications, equivalent arrangements, and equivalents thereof, which may be made by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (6)

1. A social network user position feature extraction method based on Meanshift and K-means integrated clustering algorithm is characterized by comprising the following steps:
selecting an object area according to pre-collected user check-in data, and acquiring the user check-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
performing preliminary clustering on the user geographical location information data in the selected range based on a Meanshift method;
screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
dividing the user geographical position information data to corresponding interest points according to the clustering result to complete the user position feature extraction, comprising: marking all points according to the clustering result, and dividing the points into the POI (point of interest) of the hot spot area with higher sign-in frequency of the corresponding user;
screening out a specific cluster according to a preset condition and carrying out secondary clustering based on a K-means method, wherein the method comprises the following steps:
screening out clusters with the scale larger than a preset threshold value and determining a parameter K of a K-means algorithm according to the scale of the clusters;
step (1-2) randomly selecting central points of k clusters, and calculating the distance between each sample and each central point;
step (1-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;
step (1-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
and (5) repeating and iterating the steps for a plurality of times, or stopping iteration until the central point of each group is not greatly changed between two iterations, and finishing secondary division.
2. The social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the method for selecting the object area according to the pre-collected user check-in data is as follows:
with the help of ArcGIS, the distribution situation of the check-in records in the pre-collected user check-in data is described by drawing a scatter diagram, and a New York Manhattan area with very dense check-in records is selected as the object area of the invention.
3. The social network user location feature extraction method based on Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the data preprocessing comprises: and (4) cleaning the data, and removing the data with missing fields and the error data which does not meet the requirements in the data.
4. The method for extracting the location features of the users in the social network based on the Meanshift and K-means integrated clustering algorithm as claimed in claim 1, wherein the preliminary clustering of the check-in data in the selected range based on the Meanshift method comprises the following steps:
step (4-1) of recording any two sign-in points r i And r j Having respective coordinates of p i =(lat i ,lon i ) And p j =(lat j ,lon j ) Wherein p is i =(lat i ,lon i ) Latitude and longitude representing the geographic location coordinates of the ith check-in data; p is a radical of formula j =(lat j ,lon j ) Latitude and longitude representing the geographic location coordinates of the jth check-in data;
calculating the distance d between any two check-in points ij The expression is as follows:
Figure FDA0003811385810000021
where r represents the earth's radius, and hav () is an abbreviation of haversine function, expanded form of which is:
Figure FDA0003811385810000022
theta represents an included angle formed by connecting two points on the spherical surface with the spherical center respectively;
based on the distance d between any two check-in points ij Forming a distance matrix D;
step (4-2) cluster centers are selected at random initially, and a bandwidth and a stop threshold stopthresh are set as key parameters;
and (4-3) updating the cluster center and the structure in a mode of superposing an offset vector on the current cluster center coordinate vector, wherein the expression is as follows:
Center (t+1) =Center (t) +shift (t)
wherein, center (t) Represents the current cluster Center, which is the cluster Center after the t-th overlay offset vector, center (t+1) Represents the cluster center after the t +1 th overlay offset vector, shift (t) An offset vector representing the t-th superposition;
step (4-4) takes the offset vector smaller than a stop threshold stopthresh as a target, and the offset vector shift meeting the t-th superposition is required (t) And (4) iterating the step (4-3) until all sample points find the most appropriate cluster center, and combining the clusters meeting the requirements to complete the clustering based on the Meanshift algorithm.
5. The method for extracting features of social network user positions based on Meanshift and K-means integrated clustering algorithm as claimed in claim 4, wherein the offset vector shift of the t-th superposition (t) The mean value representing the distance from all samples in the current cluster to the current cluster center is basically the following form:
Figure FDA0003811385810000031
where K denotes the number of samples in the current cluster, S (t) Represents the set of samples in the current cluster, arbitrary x i ∈S (t) The distance from all sample points to the current cluster center is smaller than the bandwidth of a key parameter, and the expression is shown as follows;
Figure FDA0003811385810000032
wherein the content of the first and second substances,
Figure FDA0003811385810000033
represents a sample point x i To the current cluster Center (t) The bandwidth represents the bandwidth of the key parameter.
6. A user position feature extraction device based on Meanshift and K-means integrated clustering algorithm in a social network is characterized by comprising the following steps:
the data preprocessing module is used for selecting an object area according to pre-collected user sign-in data and acquiring the user sign-in data of the object area; extracting user geographical position information data from the data, and performing data preprocessing on the data;
the preliminary clustering module is used for carrying out preliminary clustering on the user geographic position information data in the selected range based on the Meanshift method;
the secondary clustering module is used for screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method;
the data dividing module is used for dividing the user geographical position information data into corresponding interest points according to the clustering result to finish the user position feature extraction and comprises the following steps: marking all points according to the clustering result, and dividing the points into the POI (point of interest) of the hot spot area with higher sign-in frequency of the corresponding user;
screening out specific clusters according to preset conditions and carrying out secondary clustering based on a K-means method, wherein the method comprises the following steps: screening out clusters with the scale larger than a preset threshold value in the step (6-1) and determining a parameter K of a K-means algorithm according to the scale;
step (6-2) randomly selecting the central points of the k clusters, and calculating the distance between each sample and each central point;
step (6-3) clustering according to the principle of minimum distance, and classifying each sample into the cluster with the closest distance;
step (6-4) based on the current clustering result, recalculating the mean value of the sample coordinates in the cluster, and determining a new central point;
and (6-5) repeating and iterating the steps for a plurality of times, or stopping iteration until the central point of each group is not changed greatly between two iterations, and finishing secondary division.
CN201910628876.3A 2019-07-12 2019-07-12 Social network user position feature extraction method and device based on Meanshift and K-means clustering Active CN112287247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910628876.3A CN112287247B (en) 2019-07-12 2019-07-12 Social network user position feature extraction method and device based on Meanshift and K-means clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910628876.3A CN112287247B (en) 2019-07-12 2019-07-12 Social network user position feature extraction method and device based on Meanshift and K-means clustering

Publications (2)

Publication Number Publication Date
CN112287247A CN112287247A (en) 2021-01-29
CN112287247B true CN112287247B (en) 2022-11-11

Family

ID=74418576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910628876.3A Active CN112287247B (en) 2019-07-12 2019-07-12 Social network user position feature extraction method and device based on Meanshift and K-means clustering

Country Status (1)

Country Link
CN (1) CN112287247B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113283248B (en) * 2021-04-29 2022-06-21 桂林电子科技大学 Automatic natural language generation method and device for scatter diagram description
CN116776011A (en) * 2023-05-10 2023-09-19 中国测绘科学研究院 ROI extraction method and system considering POI space co-located mode
CN116541474B (en) * 2023-07-05 2024-02-02 平安银行股份有限公司 Object acquisition method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296695A (en) * 2016-08-12 2017-01-04 西安理工大学 Adaptive threshold natural target image based on significance segmentation extraction algorithm
WO2018086433A1 (en) * 2016-11-08 2018-05-17 江苏大学 Medical image segmenting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296695A (en) * 2016-08-12 2017-01-04 西安理工大学 Adaptive threshold natural target image based on significance segmentation extraction algorithm
WO2018086433A1 (en) * 2016-11-08 2018-05-17 江苏大学 Medical image segmenting method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A quick Otsu-Kmeans algorithm for the internal pipeline detection;Tao Song 等;《2017 IEEE International Conference on Mechatronics and Automation (ICMA)》;20170824;全文 *
三种聚类算法在建筑图像分割上的应用;周婷婷 等;《现代计算机》;20170215;全文 *

Also Published As

Publication number Publication date
CN112287247A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN112287247B (en) Social network user position feature extraction method and device based on Meanshift and K-means clustering
CN110012428B (en) Indoor positioning method based on WiFi
CN102682477B (en) Regular scene three-dimensional information extracting method based on structure prior
TWI584137B (en) Search, determine the active area of ​​the method with the server
CN110909788B (en) Statistical clustering-based road intersection position identification method in track data
CN110334293B (en) Position social network-oriented position recommendation method with time perception based on fuzzy clustering
CN108804551B (en) Spatial interest point recommendation method considering diversity and individuation
CN109167805A (en) Analysis and processing method based on car networking space-time data in City scenarios
CN104202816B (en) Extensive node positioning method of the 3D wireless sensor networks based on convex division
CN109478184A (en) Identification, processing and display data point cluster
CN111460508B (en) Track data protection method based on differential privacy technology
CN109739585B (en) Spark cluster parallelization calculation-based traffic congestion point discovery method
CN107392245A (en) A kind of taxi trajectory clustering algorithm Tr OPTICS
CN110119772B (en) Three-dimensional model classification method based on geometric shape feature fusion
CN112328728A (en) Clustering method and device for mining traveler track, electronic device and storage medium
Liu et al. A semantics-based trajectory segmentation simplification method
CN111536973A (en) Indoor navigation network extraction method
Buchin et al. Improved map construction using subtrajectory clustering
CN110298687B (en) Regional attraction assessment method and device
Liao [Retracted] Hot Spot Analysis of Tourist Attractions Based on Stay Point Spatial Clustering
CN115205699B (en) Map image spot clustering fusion processing method based on CFSFDP improved algorithm
CN110851742A (en) Interest point recommendation method and device based on position and time information
CN113268770B (en) Track k anonymous privacy protection method based on user activity
Wang et al. Spatial entropy-based clustering for mining data with spatial correlation
Ma et al. Complex buildings orientation recognition and description based on vector reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant