CN113392652A

CN113392652A - Sign-in hotspot functional feature identification method based on semantic clustering

Info

Publication number: CN113392652A
Application number: CN202110343078.3A
Authority: CN
Inventors: 杨剑; 王鹏启; 贾奋励; 王光霞
Original assignee: Unit 32023 Of Chinese Pla; Information Engineering University of PLA Strategic Support Force
Current assignee: Unit 32023 Of Chinese Pla; Information Engineering University of PLA Strategic Support Force
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-09-14
Anticipated expiration: 2041-03-30
Also published as: CN113392652B

Abstract

The invention relates to a sign-in hotspot functional feature recognition method based on semantic clustering, belonging to the technical field of data processing, and the method comprises the steps of obtaining sign-in data of a user in a certain period of time on a certain social network site, and determining a plurality of hotspot areas according to the sign-in data; carrying out POI classification on the check-in data in each hotspot region by utilizing a POI clustering algorithm based on semantic similarity; if a Word2Vec similarity calculation function is utilized, calculating the semantic similarity of the sample points, and outputting a similarity matrix W; and then calculating a Laplace matrix, then calculating eigenvectors corresponding to the front k eigenvalues of the Laplace matrix, finally forming a matrix U consisting of the calculated eigenvectors, wherein each row of the U becomes a newly generated sample point, and clustering the newly generated sample points.

Description

Sign-in hotspot functional feature identification method based on semantic clustering

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a sign-in hotspot functional feature identification method based on semantic clustering.

Background

In recent years, social media has been rapidly developed with the development of the internet. Social media is information content created by publishing technology, the technology is highly developed and has the characteristic of strong expansibility, and the method changes the way in which people read, share and comment news information content.

In China, the Sina microblog is a social media form of the national network which is rapidly developed and has great influence in recent years. The microblog check-in data reflects the what a user saw, what he hears, what he feels, say and state recorded under the condition of specific time and place. The events occurring at the user can be recorded by issuing a check-in function containing contents such as characters, pictures, videos and the like, and have rich attribute information such as positions, time and the like. The location information may be a variety of points of interest (POIs) such as coffee shops, shopping malls, movie theaters, train stations, etc. By analyzing and mining microblog check-in data, the characteristics of a user group, such as age and sex, spatial distribution, interests and hobbies and the like, can be known, and personalized services are provided for the user according to the conclusions and knowledge.

With the rapid development of internet technology and positioning technology, the popularization of products such as mobile phones, tablet computers, smart watches and the like, which provides favorable conditions for obtaining massive sign-in data. People are used to check in, comment on, share and the like by using position services in various APPs, so that a large amount of check-in data can reflect the daily life range and track of people. The check-in data can be analyzed by using a spatial analysis method in various geographic information systems to obtain an urban crowd activity hot spot area, so that suggestions and support are provided for reasonable distribution of urban public resources. For example, the sharing bicycle is rapidly developed, and the throwing amount and the throwing point of the sharing bicycle can be determined through urban crowd activity hotspot detection; the scenic spot sign-in data is analyzed, the tourism time can be reasonably arranged for tourists, and the peak period of the scenic spot is avoided.

In recent years, a lot of scholars have taken position check-in data of social media as data sources to conduct city hotspot and crowd activity research. Social media commonly used abroad to extract POI data are Foursquare, Twitter, Facebook, etc. For example, Comito et al mine user travel routes from Twitter with geographic tags to analyze travel hotspots and human behavioral activity; li et al, taking tweets and photographs of Twitter and Flickr, California as examples, studied the spatio-temporal patterns of geographic data in the neighborhood of the United states, discussing characteristics of urban hot regions and local residents.

The Sina microblog is also widely concerned by domestic scholars as a domestic mainstream social network platform. For example, the royal wave and the like analyze the sign-in behavior characteristics of residents in Nanjing city from two angles of time and space respectively based on the sign-in data, and divide the functional areas of the city; zhang son On et al explores the evolution characteristics of behavior activities of tourists in the scenic spot of Nanjing clock from two dimensions of time and space respectively based on microblog sign-in data; chen hong Fei et al studied the evolution law of the user sign-in behavior in time and space at night in Xian city; ten skillful et al take microblog-in data as an example, analyze from a spatial mode, and detect the approximate direction of an urban hot spot area.

Therefore, in the prior art, the research based on social media check-in data mainly analyzes city hotspot regions and user check-in behavior characteristics from spatial or temporal dimensions, and the research methods are often limited to conventional classical statistics and geographic statistics, which results in that the semantic characteristics of the data are not deeply and comprehensively mined, and inaccurate behavior activities are obtained through analysis.

In addition, currently, for classification of POI data, people manually establish mapping relationships between different classes in each classification system mainly through a manual method, and further realize conversion and comparison between different classification systems. The method for manually constructing the classification system to map the POI relationship needs a great deal of manpower, material resources and time, and cannot be popularized and applied on a large scale.

With the development and maturity of related technologies such as Chinese word segmentation, semantic calculation, text clustering and the like, a POI text classification method based on contents appears, for example, the semantic inconsistency of a geographic information standard classification system is analyzed from a semantic level by people such as Chamomile and the like, and the geographic information classification system based on semantics is provided; roman et al propose a Chinese POI name semantic classification method based on role labeling; by using the technologies of classification feature word extraction, duplication removal, optimization and the like, Wangyong and the like realize the mapping and conversion of a multisource heterogeneous POI classification system. However, these classification methods are very complex in calculation method and not high in applicability, and may ignore valid features of words, generate semantic ambiguity, and further affect the classification accuracy of POIs.

Disclosure of Invention

The invention aims to provide a sign-in hotspot functional feature identification method based on semantic clustering, which is used for solving the problems of complex POI data classification algorithm and low precision in the existing method.

Based on the purpose, the technical scheme of the sign-in hotspot function feature identification method based on semantic clustering is as follows:

1) the method comprises the steps of obtaining sign-in data of a user in a certain period of time on a certain social network site, and determining a plurality of hot spot areas according to the sign-in data;

2) carrying out POI classification on the check-in data in each hotspot region by utilizing a POI clustering algorithm based on semantic similarity; the POI clustering algorithm based on semantic similarity comprises the following substeps:

acquiring a sample point of the check-in data, calling a Word2Vec similarity calculation function, calculating the semantic similarity of the sample point, and outputting a similarity matrix W; calculating the sum d of each row element of the similarity matrix W_iObtained by_iForming a diagonal matrix of n x n, namely a degree matrix D;

calculating pullThe placian matrix L ═ D-W; calculating the eigenvalue of L, sorting the eigenvalues from small to large, taking the first k eigenvalues, and calculating the eigenvector u of the first k eigenvalues₁,u₂,…,u_k(ii) a Forming a matrix U (U) by the k eigenvectors₁,u₂,…,u_k}，U∈R^n*k；

Let y_i∈R^kIs the vector of the ith row of U, where i ═ 1,2, …, n; new sample point Y ═ Y₁,y₂,…,y_nAre clustered into { C }₁,C₂,…,C_k}; output cluster A₁,A₂,…,A_kWherein A is_i＝{j|y_j∈C_i}。

The beneficial effects of the above technical scheme are:

the sign-in hotspot functional feature identification method disclosed by the invention is used for carrying out POI classification by utilizing a POI clustering algorithm based on semantic similarity, calculating the semantic similarity of sample points by utilizing a Word2Vec similarity calculation function and outputting a similarity matrix W; and then calculating a Laplace matrix, then calculating eigenvectors corresponding to the front k eigenvalues of the Laplace matrix, finally forming a matrix U consisting of the calculated eigenvectors, wherein each row of the U becomes a newly generated sample point, and clustering the newly generated sample points.

Further, in order to meet the requirement of a user for analyzing check-in data of different types of activities, determining a plurality of hotspot areas according to the check-in data comprises:

dividing the behavior activities reflected by the check-in data into three types, namely high-frequency revisiting activities, low-frequency revisiting activities and photographing activities;

adopting a nuclear density estimation method to respectively detect and select the sign-in data under each type of activity:

using a nuclear density analysis tool, and inputting fields as total check-in times to obtain a check-in activity hotspot area result graph;

for high frequency revisit activities: using a nuclear density analysis tool, obtaining a high-frequency revisiting activity area result graph by using input fields of total sign-in times/number of sign-in people, overlapping POI points screened out from the high-frequency revisiting activity area result graph and a sign-in activity hotspot area result graph, and selecting an overlapped and matched area as a final hotspot area under the category;

for low frequency revisit activities: using a nuclear density analysis tool, wherein an input field is the number of check-in users/check-in times to obtain a low-frequency revisiting activity area result graph, overlapping POI points screened out from the low-frequency revisiting activity area result graph with the check-in activity hotspot area result graph, and selecting an overlapped and matched area as a final hotspot area under the category;

for the photographing activity: and taking the number of the shot as an input field of a nuclear density analysis tool to obtain a nuclear density analysis result graph of the number of the check-in users, and selecting a final hotspot region under the category.

As other embodiments, other methods may also be employed to determine the hot spot region. That is, determining a number of hotspot regions according to the check-in data includes:

and (4) using a nuclear density analysis tool, wherein the input field is the total check-in times in the check-in data to obtain a result graph of the check-in activity hotspot area, and selecting a plurality of hotspot areas in the graph.

Further, in order to facilitate the user to clearly identify the check-in data classification condition, the method further comprises the following steps:

and after the check-in data of the user in a certain period of time on the certain social network site is classified into N types, a radar map is formed and displayed according to the number and the ratio of the POIs classified into various types in each hot spot area.

Further, in order to improve the classification effect, in step 2), the new sample point Y is set to { Y ═ Y by using a k-means clustering method₁,y₂,…,y_nAre clustered into { C }₁,C₂,…,C_k}。

Further, in the step 1), before the hot spot area is determined, a step of performing data cleaning on the check-in data is further included, so that wrong check-in data are cleaned, the accuracy of the data is ensured, and the classification precision of subsequent processing is improved.

Drawings

FIG. 1 is a flowchart of a sign-in hotspot function feature identification method in an embodiment of the present invention;

FIG. 2 is an exemplary graph of nuclear density analysis in an embodiment of the present invention;

FIG. 3 is a line diagram of field values in an embodiment of the present invention;

FIG. 4 is a diagram of a Nangong-drum lane microblog check-in point interface in an embodiment of the present invention;

FIG. 5 is a diagram illustrating the results of a check-in activity hotspot zone in an embodiment of the present invention;

FIG. 6-1 is a result diagram of a high frequency revisit activity area in an embodiment of the invention;

FIG. 6-2 is an overlay of POI and check-in hotspot area results of high frequency revisitation activity screening in an embodiment of the present invention;

6-3 are graphs of the results of a low frequency revisit of an active area in an embodiment of the invention;

6-4 are overlay graphs of the POI and check-in activity hotspot area results of the low frequency revisit activity screening in an embodiment of the present invention;

6-5 are graphs of results of density analysis of check-in by a user taking a picture in an embodiment of the invention;

FIG. 7-1 is a radar chart of the number of POIs in a hotspot region of Beijing south station;

FIG. 7-2 is a pie chart of the check-in times of various POIs with a hot spot area of Beijing south station;

FIG. 7-3 is a radar chart of the number of POIs in a hotspot area of Beijing university of transportation;

FIGS. 7-4 are pie charts of various POI sign-in times for the Beijing university of transportation at a hotspot area;

7-5 are radar plots of the number of POIs for each type of well in the Focus area of Wangfu;

FIGS. 7-6 are pie charts of various POI sign-in times for the Wangfu well in the hot spot area;

7-7 are radar plots of the number of POIs of various types for West Danyue city in the hotspot region;

FIGS. 7-8 are pie charts of various POI sign-in times for West Danish City in the hotspot region;

7-9 are radar plots of the number of POIs for each type of POI for which the hotspot region is the A square;

FIGS. 7-10 are pie charts of the check-in times of various POIs in the hotspot region of the Square A;

7-11 are radar maps of the number of POIs for which the hot spot area is a southern gong-drum lane;

FIGS. 7-12 are pie charts of the number of check-in times of various POIs with hot spot areas of southern drum lanes;

FIG. 8-1 is a radar map of the number of POIs for each of six hotspot regions;

FIG. 8-2 is a radar chart of various POI ratios of six hotspot areas;

8-3 are histograms of POI occupancy for each of the six hotspot regions;

8-4 are histograms of the number of POI check-ins for each category of six hotspot regions.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings.

The implementation provides a sign-in hotspot functional feature recognition method based on semantic clustering, the overall thought of the method is shown in fig. 1, sign-in data of an APP such as a microblog within a certain period of time are firstly obtained, then preprocessing (such as data cleaning) is carried out on the data, then sign-in activity hotspot detection is carried out, and a plurality of hotspot areas are determined according to the number of users in the sign-in data, the sign-in number of the users and the number of photos; and then, carrying out POI classification on the check-in data in each hotspot region by utilizing a POI clustering algorithm based on semantic similarity.

For the determination of the hotspot region, in the detection process of the sign-in hotspot activity, determining the hotspot region of three activity types, wherein the three activity types are respectively high-frequency revisiting activity, low-frequency revisiting activity and photographing activity. The specific determination steps are as follows:

1) microblog check-in data (also called check-in POI data) is obtained.

A POI mainly refers to all geographic physical objects that can be abstracted as points, such as restaurants, schools, stations, companies, etc. Generally, a POI has attribute information such as a name, longitude and latitude, category, detailed address, and the like, and the richer the POI data category is, the larger the amount of information contained.

The microblog check-in data comprises the basic attributes, and records of the total check-in times, the number of check-in users and the number of photos of the POI points, so that the POI attribute information is richer, and the POI data has the characteristics of simple structure, large data volume, strong situational property and the like, and has important significance for researching urban crowd activity hotspots. For example, table 1 is a microblog registration data part attribute table.

Table 1 microblog attendance data attribute table (part)

2) And (4) preprocessing data.

There is a large amount of error data in the check-in data, and part of the error data is shown in table 2. Mainly including non-existing points (e.g., 1,2, 3 data), location repeat points (e.g., 4, 5 data), and location error points (e.g., 6, 7, 8 data).

TABLE 2 partial error data

In order to solve the above problem, after 51185 pieces of data are cleaned, 20442 pieces of error data are cleaned, and 30743 pieces of cleaned data are cleaned.

3) And dividing the behavior activities reflected by the check-in data into three types, namely high-frequency revisiting activities, low-frequency revisiting activities and photographing activities.

The microblog check-in data comprises three attributes of the total check-in number, the total number of users checking in and the total number of photos taken. The three attributes reflect different activity types of people, for example, the number of times of user access is large at a strong landmark position such as a station, and the number of check-in users are large; the number of times of visiting users at the positions of daily activities of people such as residential areas is not high, but the number of times of signing in of a single user is high; locations such as tourist attractions may be accompanied by more user photographing activities. Therefore, the mining of the crowd activity behavior rule through the three attribute information has important significance.

Therefore, in this step, the check-in activities are classified into three categories, namely, high-frequency revisiting activities, low-frequency revisiting activities and photographing activities. Where "frequency" refers to frequency, i.e., the average single user revisit check-in frequency corresponding to a single POI point.

For high frequency revisit activities:

the high-frequency revisiting activity refers to the behavior that a microblog user checks in the same POI for multiple times in a data acquisition time period. The check-in data are processed, the data that the number of check-in users is larger than 2 are screened, the malicious check-in behavior of the users is avoided, a user/num field is added, and the ratio of the number of users to the total check-in times num is calculated. Setting the standard deviation to 1, a field value line graph is obtained as shown in fig. 3. Assuming that the number of check-in users is smaller at points with the same check-in times, the smaller the field value is, the higher the frequency of user access is for the POI point. Data below the standard deviation of the field values is defined as a high frequency revisit point, i.e., a point where the field value is below 0.3632 (the part enclosed in the lower left corner in fig. 3).

The calculated field values are 276 in total below the standard deviation point, the field values are sorted, and the data of the high-frequency revisiting activity part is screened out as shown in table 3.

TABLE 3 field value sorting partial data

For low frequency revisit activities:

the low-frequency revisiting activity can be understood as that the frequency of visiting the same POI and signing in behaviors of a microblog user in a data acquisition time period is small. Assuming that the number of check-in users is the same, the larger the field value is, the lower the frequency of user access to the POI point is. Data above the standard deviation of the field values is defined as the low frequency revisit point, i.e., the point where the field value is above 0.8168 (the top right-hand corner-boxed part in fig. 3).

The total number of points with calculated field values higher than the standard deviation is 226, the field values are sorted, and the data of the low-frequency revisiting activity part is screened out as shown in table 4.

TABLE 4 field value sorting partial data

For the photographing activity:

the microblog user often takes a picture in the check-in process. Fig. 4 shows a microblog profile interface of the pedestrian street of the southward gong and drum lane of the tokyo, beijing, which includes information of POI name, profile, microblog of the place, heat news, heat map, people who have arrived at the place, comment, and the like.

TABLE 5 screening of hot spots for photographing

In summary, the check-in activities of the user can be roughly classified into three types, namely, high-frequency revisiting activities, low-frequency revisiting activities and photographing activities. From the screening results of different check-in activity types, the POI compositions of the three types of activity types are different.

4) And (3) adopting a nuclear density estimation method to respectively carry out hotspot detection on the check-in data under various types (high-frequency revisit activity, low-frequency revisit activity and photographing activity).

Kernel Density Estimation (KDE) is a non-parametric method for estimating probability density functions, a basic means of data smoothing, also known as the "Parzen-Rosenblatt window". The kernel density estimate is defined as follows:

let x be₁,x₂,…,x_nIs a series of n sample points independently distributed, and the density function is defined as f, then the kernel density estimation of f is:

where K is a kernel function, having a non-negativity. There are many commonly used kernel functions, such as unifonm, triangular, biweight, trilweight, Epanechnikov, normal, etc. h >0, is a smoothing parameter, also called "bandwidth", and the subscripted kernel h is called "scaling kernel", expressed as:

in this step, a kernel density analysis tool in ArcGIS can be used to fit the points or polylines to a smooth conical surface by calculating the unit area size of the point or polyline features and using a kernel function. It belongs to a visualization algorithm of density analysis. The algorithm is as follows:

step 1: the average center of the input points is calculated. If the position field is selected instead of "none," the value in this field will weight this calculation and all the following calculations.

Step 2: the distance to the (weighted) average center of all other points is calculated.

And step 3: calculating the (weighted) median of these distances, D_m。

And 4, step 4: standard _ Distance, SD, is calculated (weighted).

And 5: the bandwidth is calculated using the following formula:

where SD is the standard distance and Dm is the median distance, if the position field is used, then n is the sum of all point position field values, otherwise n is the number of data points.

In the nuclear density analysis, firstly, setting the search radius of the nuclear density analysis, and determining a search area; points falling into the search area are given different weights, points close to the search center are relatively heavy, the weight is gradually reduced along with the distance from the search center to the outside, and the transition is smooth. In the ArcGIS software tool, the algorithm principle of the nuclear density analysis is to apply the 'third-dimensional attribute' to the number of drawing points so as to influence the final visualization result. For example, one example of a core density analysis is shown in FIG. 2, with Outras representing the core density and InPts representing the input data.

Therefore, in this step, based on the nuclear density analysis tool, the total number of sign-ins of the input field is obtained, and the result of the sign-in activity hotspot area is shown in fig. 5, and it can be seen from the analysis result graph that the sign-in activity hotspots in the experimental area mainly include urban strong and prosperous commercial streets such as beijing south station, beijing traffic university, wang mansion well, western style, a square, and south gong-drum lane.

Three types of user activity types are compared with the sign-in activity hotspot as follows:

for high frequency revisit activities:

the result of high frequency revisit activity area obtained by using ArcGIS Pro nuclear density analysis tool and inputting field as total number of check-in times/number of check-in persons is shown in FIG. 6-1; the POI points screened by the high frequency revisitation activities are then overlaid with the check-in hotspot area result map, as shown in FIG. 6-2.

Of the 7 POI spots with the highest high frequency revisiting activity, 6 are residential areas and 1 economical chain hotel. From the two analysis result graphs, it can be seen that the distribution of the high-frequency revisiting activity points is not in the check-in hot spot area, but in the check-in activity cold area, such as a residential area, a hotel and the like. Therefore, the high-frequency revisiting activities are often the main positions of daily life activities in places where people are located for a long time.

For low frequency revisit activities:

the result of the low-frequency revisiting activity area obtained by taking the number of check-in users/check-in times as an input field of the check-in density analysis is shown in fig. 6-3, the categories of the top 20 low-frequency revisiting points POI are selected as shown in table 6, and the POI points screened by the low-frequency revisiting activity are overlapped with the result graph of the check-in activity hotspot area, as shown in fig. 6-4.

TABLE 6 Top 20 Low frequency revisit Point POI Categories statistics

As can be seen from the above table, 9 POI spots with the highest low frequency revisiting activity are food service class, 8 life entertainment class, 2 transportation facility class and 1 shopping service class. As can be seen from comparison of the sign-in hotspot area maps, the low-frequency revisit activities are mostly distributed in sign-in hotspot areas, such as west single happy city, hybrid sea scenic spots, beijing palace scenic spots, beijing stations, and the like. The areas are often accompanied with activities of catering consumption, shopping in markets, scenic spot tourism and transportation travel, generally, the number of times of visiting the population is high, but the visiting frequency is low for a specific user, and the activities, habits and behaviors of people are met.

For the photographing activity:

the number of the shot is used as an input field of the check density analysis, the check density analysis result of the number of the check-in users is obtained and is shown in fig. 6-5, the graph and the check-in activity thermodynamic diagram 5 are compared and analyzed, and the activity hotspot areas of the shot check-in are very similar to the activity hotspot areas of the common check-in and are all located in areas such as Beijing south station, West single Da Yue city, Wang Fu well department goods, A square and the like. The rule of the user photographing check-in activity is similar to that of the common check-in activity, and the user often accompanies the photographing activity in the check-in process, so that the user behavior rule and the personal habit are met. To find more regularity, check-in times and number of shots were ranked as shown in table 7.

TABLE 7 check-in times and photo quantity ranking

As can be seen from the table, the photo quantity ranking of tourist attractions such as the square A and the south drum lane is obviously higher than the sign-in quantity ranking, which shows that the probability of photo in the sign-in activity of the user is higher in the travel tour process.

In conclusion, different check-in activity types have different meanings and characteristics, and high-frequency revisiting activities are mainly distributed in residential areas, hotels and other places where people live for a long time or live; the low-frequency revisiting activities are more distributed in urban hot areas such as restaurants, life entertainment, trips and the like; the shooting activity is the same as the hot spot area of the check-in activity, and is mainly a station, a commercial street, a scenic spot and the like.

Therefore, the hot spot areas of the three activity categories can be determined by classifying the check-in behaviors of the user.

5) And carrying out POI classification on the check-in data of each hot spot region by utilizing a POI clustering algorithm based on semantic similarity.

In the step, the POI clustering algorithm based on semantic similarity can be mainly divided into three sub-steps, wherein the first step is POI semantic similarity matrix calculation; secondly, calculating characteristic values and corresponding characteristic vectors; and thirdly, clustering the feature vectors by using a K-means algorithm, and finally outputting a cluster. The specific classification steps are as follows:

inputting: n sample points X ═ X₁,x₂,…,x_nAnd the number of cluster clusters k;

and (3) outputting: cluster A₁,A₂,…,A_k。

Step 1: calling a Word2Vec similarity calculation function, calculating the semantic similarity of the sample points, and outputting a similarity matrix W;

in this step, a Word2Vec similarity calculation function wv _ from _ text.similarity is called, and semantic similarity of two words can be calculated, which is exemplified as follows:

input: 'music hall', 'bowling bowl'

Output：0.6246

Input: 'Gong hall', 'Anhui dish'

Output：0.2714

Input ` railway station ` bus station `'

Output：0.8172

Input ` train station ` school `'

Output：0.4747

Using the similarity calculation results, the semantic similarity distance matrix is converted, as shown in table 8 below.

TABLE 8 semantic similarity distance matrix (part)

In the above table, the semantic similarity between the words varies from low to high, for example, the word is most similar to itself, so the semantic similarity is 1; anhui dish and Beijing dish belong to the dish family catering, and have high semantic similarity (Beijing dish and ATM have great difference and low semantic similarity).

Step 2: the degree matrix D is calculated using the formula, i.e. the sum of the elements of each row of the similarity matrix W, D being D_iForming a diagonal matrix of n x n;

and step 3: calculating a Laplace matrix L-D-W;

and 4, step 4: calculating the eigenvalue of L, sorting the eigenvalues from small to large, taking the first k eigenvalues, and calculating the eigenvector u of the first k eigenvalues₁,u₂,…,u_k；

And 5: forming the k column vectors into a matrix U ═ U₁,u₂,…,u_k}，U∈R^n*k；

Step 6: let y_i∈R^kIs the vector of the ith row of U, where i ═ 1,2, …, n;

and 7: using the k-means algorithm, the new sample point Y is given as Y₁,y₂,…,y_nAre clustered into { C }₁,C₂,…,C_k}；

And 8: output cluster A₁,A₂,…,A_kWherein A is_i＝{j|y_j∈C_i}。

The POI clustering algorithm based on semantic similarity is simply characterized in that a similarity matrix is calculated according to sample points, then a Laplace matrix is calculated, then eigenvectors corresponding to k eigenvalues in front of the Laplace matrix are calculated, finally, each row of a matrix U and U formed by the calculated eigenvectors becomes a newly generated sample point, k-means clustering is carried out on the newly generated sample points to form k types, and finally, a clustering result is output. As other embodiments, besides using k-means clustering method, other clustering methods such as algorithms of k-means, k-models, k-means, CLARA, PAM, etc. can be used to cluster the newly generated sample points.

Therefore, in this step, the semantic similarity calculation result between words is used as a similarity matrix of the clustering samples, the spectral clustering function in the Python language scimit-lean library is used on the basis of the improved algorithm to perform clustering calculation on the POI semantic similarity matrix, and meanwhile, the clustering result is added with category labels (such as catering services, travel categories and the like) by referring to the high-resolution map POI classification standard, and the result is shown in table 9.

TABLE 9 POI type clustering results

For example, to verify the effectiveness of the method, with reference to fig. 6-5 and table 7, six sign-in activity hot spot regions for hot spot detection are taken as an example, and POI composition is further analyzed, first, a buffer analysis tool in ArcGIS Pro is used to set the radius of the buffer to 1km, and the number of six sign-in hot spot regions POI is obtained as shown in table 10 below.

TABLE 10 results of POI extraction for six hotspot regions

Then, the number of POI check-ins in each of the six hotspot regions is counted as shown in table 11 below.

TABLE 11 number of various POI check-ins for six hotspot regions

An Excel chart tool is utilized to make radar maps and cake maps of the number of POI and the check-in times of various types of the six hot spot areas as shown in the figures 7-1 to 7-12. In order to increase the contrast effect, a radar chart and a histogram are created by comparing the number, the proportion and the check-in times of the six hot spot areas, as shown in fig. 8-1 to 8-4, wherein fig. 8-1 and 8-2 are radar charts of the number and the proportion of each kind of POI in the six hot spot areas, fig. 8-3 is a histogram of the number and the proportion of each kind of POI in the six hot spot areas, and fig. 8-4 is a histogram of the check-in number of each kind of POI in the six hot spot areas.

Through the chart mentioned above, it can be found that there are the most POI in the wangfu and west single hot spot areas, and the categories and numbers of the two POIs are very similar, because both are shopping malls, these two hot spot areas are the biggest compared with other hot spot areas; the south drum lane and the hot spot area of the A square are the largest tourism occupation ratio, the types and the quantity of POI are very similar, and the number and the occupation ratio of catering service types are more than those of the A square because the south drum lane and the A square are Beijing tourist attractions and the south drum lane is a scenic spot known as food; the public facilities in the hot spot area of the Beijing south station account for the largest proportion, because the railway station is usually accompanied by people to travel; beijing university of transportation is the highest POI ratio of education and training in six hot areas, which is determined by the attributes of the colleges themselves.

In summary, according to the sign-in hotspot functional feature identification method, POI classification is performed by using a POI clustering algorithm based on semantic similarity, and is divided into 8 classes, and the calculation time of the classification algorithm is less, namely the time for classifying by using a spectral clustering algorithm is 2.6s, and the time for classifying by the method is 2.2 s; in addition, the POI category label classification result is more accurate, and the identification rate is higher.

Finally, it should be noted that: in the embodiment, the sign-in data of the social network site, i.e., the microblog, is used to describe the sign-in hotspot function feature identification method of the present invention, but the application scenario of the method is not limited thereto, and the method can also be applied to other social network sites with sign-in functions, such as Twitter, wechat, and the like.

The above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. The sign-in hotspot functional feature identification method based on semantic clustering is characterized by comprising the following steps of:

calculating a Laplace matrix L-D-W; calculating the eigenvalue of L, sorting the eigenvalues from small to large, taking the first k eigenvalues, and calculating the eigenvector u of the first k eigenvalues₁,u₂,…,u_k(ii) a The k feature vectors are combinedThe composition matrix U ═ U₁,u₂,…,u_k}，U∈R^n*k；

2. The method of claim 1, wherein determining a number of hotspot regions from the check-in data comprises:

3. The method of claim 1, wherein determining a number of hotspot regions from the check-in data comprises:

4. The sign-in hotspot functional feature identification method based on semantic clustering as claimed in any one of claims 1 to 3, characterized in that the method further comprises the following steps:

5. The sign-in hotspot functional feature identification method based on semantic clustering as claimed in claim 1, wherein in the step 2), a new sample point Y is set to { Y ═ Y by using a k-means clustering method₁,y₂,…,y_nAre clustered into { C }₁,C₂,…,C_k}。

6. The sign-in hotspot functional feature recognition method based on semantic clustering as claimed in claim 1, wherein in the step 1), before determining the hotspot region, a step of performing data cleaning on the sign-in data is further included, so as to clean out wrong sign-in data.