CN108021944A

CN108021944A - A kind of public bicycles website clustering method based on incidence relation

Info

Publication number: CN108021944A
Application number: CN201711282891.4A
Authority: CN
Inventors: 刘良旭
Original assignee: Ningbo University of Technology
Current assignee: Ningbo University of Technology
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2018-05-11

Abstract

The invention discloses a kind of public bicycles website clustering method based on incidence relation, comprise the following steps：Define SimRank algorithms；The characteristics of according to city public bicycle system, be defined the public bicycles website similar value based on SimRank algorithms；Wherein, it is public bicycles website topological network to define 1, and it is website similar value to define 2, and it is that p websites are similar to define 3, defines 4 and is clustered for website；According to the similarity score matrix M defined between 2 and the calculating websites of definition 3；Website, with reference to defining 4, is classified as different clusters by the similar value probability threshold value p specified according to similarity score matrix M and user.Website cluster proposed by the present invention has good practical significance.This website cluster based on contact can easily be used in site zone and divide, in the research of public bicycles scheduling strategy scheduling algorithm.

Description

Public bicycle station clustering method based on incidence relation

Technical Field

The invention belongs to the technical field of urban public bike management, and particularly relates to a public bike site clustering method based on an incidence relation.

Background

The concept of urban Public Bike (Public Bike) originally originated in europe, and with the development of information science and technology since the end of the 90 s of the 20 th century, emerging information technology has promoted the development of urban Public Bike systems at home and abroad. Compared with other public transportation modes, the public bicycle is gradually popular among people due to the advantages of convenience and health-care life, and in order to actively respond to the call of green life, the urban public bicycle in China is continuously developed and improved in recent years. The problems of urban public bicycles are developed rapidly, the public bicycles are unevenly distributed, and tide problems (the problems that the bicycles cannot be borrowed and returned during the peak hours of commuting and commuting) caused by small station capacity often occur. In order to solve this problem, many scholars have studied from the viewpoints of site clustering, dynamic scheduling, site layout, and the like, and aim to seek a "tide problem" that can reduce bicycles as much as possible. It is clear that site clustering is one of the hotspots and foundations of research in this field. Although many scholars have studied, most of the results are based on static characteristics of the sites (such as site capacity and borrowing statistics) and neglect the connection between the sites. The method is based on inter-site connection, researches are conducted on bicycle site clustering, and a site clustering algorithm SCSR (Station Cluster based SimRank) based on the SimRank idea is provided.

The biggest problem faced in the field of public bicycles at present is the problem of tide, namely that when a user borrows a bicycle, the user cannot return the bicycle because the station has no bicycle. Several documents have studied this with the aim of optimizing the public bicycle redistribution strategy. These documents have been proposed throughout more or less based on the idea of site clustering. Therefore, research site clustering is one of the important research contents and research bases of current public bicycle system research. Many scholars have studied this in order to seek more efficient site clustering. For example, froehlich ((1) J.Froehlich, J.Neumann, and n.Oliver. Measuring the pulse of the city through shared bicycle programs in Urberane Sense08, pages 16-20,2008; (2) J.Froehlich, J.Neumann, and N.Oliver. Sensing and comparing the pulse of the city through shared bicycle, in 21st International Joint Conference organic interest, IJCAI 09, pages 1420-6. AAAI Press, 2009) et al, use a high mix EM algorithm to mine the bicycle number, empty bit and data to other attributes per 5 minutes of the Sabourne public mining site, and use a mixture EM model with EM data of other attributes. Lathia et al (New Lathia, A.Saniul, and L.Capra.measuring the impact of influencing the London shared bicycle scheme to case users. Transportation Research Part C: operating Technologies,22, june 2012.) derived the impact of user-used policy changes on Barker bicycle rental patterns in London by spatial and temporal analysis of site occupancy data. Borgtnat et al ((1); P.Borgtnat, E.Fleury, C.Robardet, and A.Scherrer.spatial Analysis Of dynamic movements Of V 'elo el v, lyon's Shared Bicycle program, in Francois Kepes, editor, european Conference On Complex Systems, ECCS 09.Complex Systems society, september 2009; (2); P.Borgtnat. Shared Bicycles in a City A Signal processing Data Analysis perty. Advance Systems,14 (3): 1-24, june site; (3); (P.Dynal. Dynamic Data Analysis in company Systems, 19. URL/. 5. Video coding Of video pages, and URL.2. Video coding Of video pages, URL.2. URL/. 5. URL. Vogel et al ((1) P.Vogel and D.C.Mattfeld.Strategical and operational clustering system by data mining-a case study in ICCL, pages 127-141.Springer Berlin Heidelberg,2011; (2) P.Vogel, T.Greiser, and D.C.Mattfeld.Understand binary system using data mining: sampling orientation pages, procedia-Social and Beacorvior Sciences,20 (0): 514-523, 2011.) extract the feature vectors of stations from the Count sequence (bicycles arriving and departing every hour) using K-Means, respectively, a Gaussian mixture model based on EM and sIB. Etienne et al (Etienne COME, latifa Oukhellou. Model-based counting Systems clustering for Bike shaping System using a case study with the present System of part. ACM Transactions on Intelligent Systems and Technology,2014, 27p.) then processes the count sequence (the number of bicycles left and returned per station per hour) using a Poisson mixture model and a station scale factor to balance the differences between stations to generate a movement pattern, taking into account both weekend and non-weekend cases. Chardon CMD et al (Chardon CMD, caruso g.estimation bike-share trips using level data [ J ]. Transportation Research Part B statistical, 2015, 78.) Model, such as Day statistical Model (Day Aggregation Model), interval statistical Model (Interval Aggregation Model), station statistical Model (Station Aggregation Model), etc., are built from the travel statistical data, and thus Station classification and Station level quantity redistribution are realized. These studies use static characteristics of the sites (e.g., location, capacity, etc.), or simple dynamic factors (e.g., counting sequences) to calculate similarity values between sites, and perform site clustering using classical clustering algorithms (e.g., K-Means algorithm, etc.).

However, in view of the whole, the urban public bicycle system is a connection relationship network in which stations are used as nodes and data is borrowed and returned every time as an association relationship. What is more embodied in the network is the association between the sites rather than the characteristics of the sites themselves. Some researchers also analyze similarity between sites from the perspective of network topology relations, for example, L Chen et al (L Chen et al. Dynamic cluster-based over-demand prediction in bike sharing system [ J ]. ACM International Joint reference.2016: 841-852.) use a weighted correlation network to simulate the relation between bicycles, connect two links if the geographical locations between the stations are close to each other, and calculate the weight of each link to form a similar value. Similar neighboring sites dynamically combine to integrate the clusters. These clusters can be considered as densely connected communities inside and loose among each other. Based on a label propagation algorithm and a Givan-Newman algorithm, the literature proposes a Geographical Constraint Label Propagation (GCLP) method to solve the problem of distribution of bicycles in a large range of single clusters. The clustering algorithm calculates similar values from the number of bicycles at the bicycle station state, and more intuitively displays the trend and the range of the bicycles. Kadri (A Kadri, K Labadi, I Kacem. An integrated Petri Net and GA based approach for performance optimization of bicycycle Sharing systems [ J ]. European J of Industrial Engineering,2015,9 (5) & Labadi (K Labadi et al. Stochastic Petri Net Modeling, simulation and Analysis of Public Bicycle Sharing systems. IEEE Transactions on Automation Science & Engineering,2015,12 (4): 1380-1395.) is a Public Bicycle system in cities that is considered a Petri Net and trained with the Petri Net concept in the network. The Petri Net has the characteristics of time, arc suppression and variable arc weight, and is used for modeling a bicycle sharing system for developing discrete events of performance evaluation. However, the Petri net is originally a black box, and the operation of the structure is difficult to guarantee by a training data set.

SimRank (Jeh G, widom J. SimRank: a measure of structural-contextual similarity [ C ]. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data mining. ACM,2002, 538-543.) is a typical algorithm based on connected network similarity calculation, a ubiquitous model for measuring the degree of similarity between objects by the topology information of the network. The similarity value of the SimRank can be compared with the similarity value problem between any two nodes, and the recursive definition of the similarity value also enables the similarity value of the SimRank to capture the overall similarity information of the graph structure.

Disclosure of Invention

In view of the above, the present invention provides a public bicycle station clustering method based on an incidence relation.

In order to solve the technical problem, the invention discloses a public bicycle station clustering method based on incidence relation, which comprises the following steps:

step 1, defining a SimRank algorithm;

step 2, defining a public bicycle station similarity value based on the SimRank algorithm according to the characteristics of the urban public bicycle system; the method comprises the following steps that 1 is defined as a public bicycle station topological network, 2 is defined as a station similarity value, 3 is defined as p-station similarity, and 4 is defined as station clustering;

step 3, calculating a similarity value matrix M between the sites according to the definition 2 and the definition 3;

and 4, classifying the sites into different clusters according to the similarity value matrix M and a similarity value probability threshold p specified by a user by combining the definition 4.

Further, the SimRank algorithm defined in step 1 specifically includes:

assuming that two nodes alpha and beta exist in a network topological structure, the similarity value s of the SimRank algorithm based on the characteristics of adjacent nodes ₀ The definition of (. Alpha.,. Beta.) is as follows:

where C is a constant, damping coefficient or attenuation factor between 0 and 1. O (alpha) and O (beta) both represent a node set connected with the nodes, | O (alpha) | and | O (beta) | represent the number of elements of the sets O (alpha) and O (beta), respectively, O _i (α) represents an ith node connected to the node α; o is _i And (β) represents the jth node connected to the node β. Equation (1) represents: the SimRank similarity value between the two nodes alpha and beta based on the node characteristics is an average value of the similarity values between each associated node of the nodes and each associated node of the nodes;

let n be the number of nodes in the site network graph, α, β represent any two nodes in the graph, R ₀ (α, β) represents an initial value of the degree of identity between two nodes, R _i (α, β) represents a similarity value between α and β for the ith iteration, which is calculated as follows:

R ₀ (α, β) is the lower limit of S (α, β), and

each iteration is based on R according to the following formula _i-1 (alpha, beta) Table calculation R _i (alpha, beta) table

Equation (4) is represented by a matrix symbol as:

wherein the matrix W represents a matrix of similarity values normalized by columns, I _n The n identity matrix is shown. Diag (X) represents the vector formed by the elements on the diagonal of matrix X, and the addition of Diag (Diag (X)) is intended to set the main diagonal elements of the matrix to 1, i.e., to indicate that s (α, α) is 1.

Further, according to the characteristics of the urban public bicycle system, the public bicycle station similarity value based on the SimRank algorithm in the step 2 is defined as follows:

step 2.1, defining 1, a public bicycle station topology network: the public bike station topological network G = (V, E), which is a station network for short, nodes alpha and beta represent two stations of V, and the triad (alpha, beta, r) represents a directional connection relation between the stations alpha and beta, wherein r represents the number of bicycles from the station alpha to the station beta in a specified time; o (α) denotes the presence of (a, a) with site α _i R) station a of (r > 0) _i Set of where O _i (α) represents any element of O (α);

step 2.2, define 2, site similarity value: given two sites α, β, the site similarity value based on definition 1 is defined as follows:

wherein max (O) _i (α),O(β))＝max _{j＝1...|O(β)|} s(O _i (α),O _j (beta)) that is, finding out the associated site O with the site alpha from the associated sites of the site beta _i (α) the maximum similarity value, given n sites, the matrix M of similarity values between every two sites is called n x n similarity value matrix;

formula (5) is used to calculate the similarity between any two sites α, β in the site network, and an n × n matrix needs to be calculated for a site network containing n sites; to calculate this matrix, equation (5) is written as an iterative equation as follows:

inference 1: s (α, β) has the following characteristics:

asymmetry: s (. Alpha.,. Beta.) does not necessarily have to be the same as s (. Beta.,. Alpha.)

Monotonicity: s is not less than 0 ^k (α,β)≤s ^k-1 (α,β)≤1

Presence and uniqueness: from the iterative formula, it can be seen that s (α, β) will necessarily tend to a fixed value;

step 2.3, define 3, p-site similarity: giving a similarity value threshold value p, and giving similarity values of alpha and beta of any two sites of a site network according to definition 2, wherein the similarity values of the alpha and beta of the two sites meet s (alpha, beta) > p and s (beta, alpha) > p, and then the sites are called as alpha and beta are similar to p-sites;

step 2.4, defining 4, clustering sites: given a matrix of similarity values M (n × n), if M matrix elements site α, β satisfy p-site similarity, and there is not one site γ, both conditions are satisfied: (a) sites α, γ are similar to p-sites; (b) s (α, β) < s (α, γ); then site alpha belongs to the cluster in which site beta resides.

Further, the calculating the similarity matrix M between the sites according to definition 2 and definition 3 in step 3 specifically includes: each similarity value matrix M of the site similarity value matrix is obtained through iterative calculation of an iterative formula (6) and a formula (7), assuming that n represents the number of sites, k represents the iteration times, and R (alpha, beta) is an element of the similarity value matrix M, and storing the similarity value between each pair of nodes (alpha, beta) in each iteration process; r ^* (α, β) is the element of the temporary similarity value matrix, storing R calculated in the iteration _i Copying the intermediate value of the (alpha, beta) to a similarity value matrix M when each iteration is finished; firstly, initializing a similarity value matrix M according to a formula 6, and realizing by using a function initialization; then, for every two sites α, β, the loop performs an iterative process k as followsSecondly: each element O of a set of connected sites O (α) for a site α _i (α) finding a site O (β) most similar to the site α from the connected site set O (β) of the site α _j (β), regarding a similarity value between the two sites as a similarity value of O (α) and O (β); the average of the corresponding values of all connected stations of α is calculated as the similarity value between the two stations. Thirdly, copying the intermediate matrix M to the matrix M for next iteration, and ending the iteration; the function returns a similarity value matrix M obtained by iterative computation.

Further, the classifying the sites into different clusters according to the similarity value matrix M and the similarity value probability threshold p specified by the user in step 4, in combination with the definition 4, specifically includes:

the CHEMALEN algorithm is adopted as a clustering algorithm, the site clustering process is to judge all site sets with similar p-sites according to similarity values among the sites and classify the site sets into different clusters according to a big priority principle, wherein an array c [ i ] represents a cluster of which site i belongs to, c [ i ] = = j represents that the site i belongs to a cluster of which site j, and max [ i ] represents the maximum similarity value of the cluster of which site j the site i belongs to; the clustering algorithm flow is as follows:

first, for each column i of each similarity value matrix M, the following operations are performed:

judging whether a current row i of the similarity value matrix has sites meeting two conditions of definition 3, and if not, directly jumping to the next cycle; otherwise, jumping to the second step;

(ii) if there are a plurality of stations, setting a station j having the largest similarity value as a control station of the station i;

(iii) determining the matrix element m _ij >max[i]If the similarity value of the site i and the site j is larger than the maximum similarity value, updating the control site of the site i to the site j, otherwise, directly skipping the cycle;

next, the clusters are merged according to the array c [ ], for each c [ i ], the following is performed: if c [ i ] = = j (j | = = 0), then it is determined whether station i and station j already belong to a cluster:

(i-1) if site i and site j do not belong to any cluster, creating a new cluster with site i and site j as elements, and setting the control site of the cluster as site j;

(ii-1) if site j already belongs to a cluster, then add site i to the cluster where site j is located.

Compared with the prior art, the invention can obtain the following technical effects:

the site clustering provided by the invention has good practical significance. The connection-based station clustering can be conveniently used for algorithm researches such as station area division and public bicycle scheduling strategies.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not limit the invention. In the drawings:

FIG. 1 shows the clustering result of the SCSR algorithm of the present invention;

FIG. 2 is a partial clustering result of the SCSR algorithm of the present invention (Tianyi-drum building business district);

FIG. 3 is a site distribution associated with all member sites of cluster A in accordance with the present invention;

FIG. 4 is a site distribution associated with all member sites of cluster B of the present invention;

FIG. 5 is a graph of the loan amount variation for cluster A of the present invention over different time periods and regions;

FIG. 6 is the change in loan amounts for cluster B of the present invention at different time periods and regions;

FIG. 7 shows the distribution of clusters 47 with a threshold p of 0.68 according to the present invention;

FIG. 8 shows the distribution of clusters 47 with a threshold p of 0.69 according to the present invention;

FIG. 9 shows the distribution of clusters 47 with a threshold p of 0.70 according to the present invention;

fig. 10 shows the distribution of clusters 47 with a threshold p of 0.71 according to the invention.

Detailed Description

The following embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that the implementation process of the present invention, which adopts technical means to solve the technical problems and achieve the technical effects, can be fully understood and implemented.

The invention provides a public bicycle station clustering method based on incidence relation, which comprises the following steps:

step 1, defining a SimRank algorithm:

the SimRank is one of the most popular algorithms in the field of similarity value calculation based on multi-hop neighbors (multi-hop-neighbor), and performs node similarity calculation on the basis of node classification according to node relevance. The distance between two nodes is defined as the hop count from the two nodes to the same node which meets, and the method is applied in many fields. SimRank is often used to measure similarity values of nodes within a topology. The SimRank can be calculated based on the connection edge characteristics of the adjacent nodes. Assuming that two nodes alpha and beta exist in a network topological structure, the similarity value s of the SimRank algorithm based on the characteristics of adjacent nodes ₀ The definition of (. Alpha.,. Beta.) is as follows:

where C is a constant (damping coefficient or attenuation factor) between 0 and 1. O (alpha) (or O (beta)) respectively represents a node set connected with the node (or), and | O (alpha) | (or O (beta)) represents the number of elements of the set O (alpha) (or O (beta)), O (alpha) | or O (beta) _i (alpha) (or O) _i (β)) represents the ith (or jth) node connected to the node α (or β). Equation 1 shows: the SimRank similarity value between the two nodes α and β based on the node characteristics is an average value of similarity values between each associated node of the nodes and each associated node of the nodes.

According to the SimRank idea, the value of the formula (1) can be obtained through several iterative calculations of the formula (2) and the formula (3), and since the realization ideas of the formula (1) and the formula (2) are consistent, the following description will take the formula (1) as an example. Let n be the number of nodes in the site network graph, α and β represent any two nodes in the graph, R ₀ (α, β) represents an initial value of the degree of identity between two nodes, R _i And (alpha, beta) represents the similarity value between alpha and beta of the ith iteration. According to the iterative idea, the calculation process is as follows:

R ₀ (α, β) is the lower limit of S (α, β), and

Equation (3) can be expressed in a matrix notation as:

wherein the matrix W represents a matrix of similarity values normalized by columns, I _n The expressed n × n identity matrix. Diag (X) represents the vector formed by the elements on the diagonal of matrix X, and the addition of Diag (Diag (X)) is intended to set the main diagonal elements of the matrix to 1, i.e., to indicate that s (α, α) is 1.

Step 2, according to the characteristics of the urban public bicycle system, defining the similarity value of public bicycle stations based on the SimRank algorithm:

the small area occupied by public bike stations makes the public bike stations easy to deploy in cities, but also causes the public bike stations to have small capacity, so that a certain city functional area is often provided with a plurality of stations. The stations are not independent from each other, and the borrowing and returning conditions of a certain station can affect the change of surrounding stations, for example, the number of bicycles of a nearby station can be directly affected by the lack of bicycles of the certain station. Based on these situations, the SimRank algorithm cannot be directly used for similarity value calculation for public bicycle stations. To simplify and better describe the algorithm, we will present several concepts related to urban public bike systems.

Definition 1 (public bicycle station topology network): public bike station topology network G = (V, E) (referred to herein as station network for short). Nodes α, β represent two sites of V, and the triplet (α, β, r) represents a directional connection between sites α, β, where r represents the number of bicycles from site α to site β in a given time. O (α) denotes the presence of (a, a) with site α _i R) station a of (r > 0) _i Set of where O _i And (. Alpha.) represents any element of O (. Alpha.).

In the SimRank algorithm, the similarity value is an average value of similarity values between two node connecting edges. But since the borrowing data of the site network is not a simple reference relationship but an association relationship with a scalar. Also, the relationships between sites are not the same type as all sites associated with documents citations or internet advertisements. Therefore, we take each associated site O of the target site α _i (α) closest associated site O to comparison site β _i The average value of the similarity values between (β) is taken as the similarity value of the two stations. The site similarity value definition is given below.

Definition 2 (site similarity value): given two sites α, β, the site similarity value based on definition 1 is defined as follows:

wherein max (O) _i (α),O(β))＝max _{j＝1...|O(β)|} s(O _i (α),O _j (beta)) that is, finding out the associated site O with the site alpha from the associated sites of the site beta _i (α) maximum similarity value. Given n sites, the similarity between each two sites is constructedThe resulting matrix M is referred to as an n x n similarity matrix.

Equation (5) is used to calculate the similarity between any two sites α, β in the site network, and an n × n matrix needs to be calculated for a site network containing n sites. To calculate this matrix, we write equation 5 as an iterative equation as follows:

inference 1: s (α, β) has the following characteristics:

Monotonicity: s is not less than 0 ^k (α,β)≤s ^k-1 (α,β)≤1

Presence and uniqueness: it can be seen from the iterative formula that s (α, β) will necessarily tend to a fixed value.

And (3) proving that: but not limited to.

Asymmetry is an inherent feature of conforming sites. Since site α is very similar to site β, but it may be that site α is much larger than site β, site α has exactly the relevant characteristics of site β, but this does not mean that site β is similar to site α.

Definition 3 (p-site similarity): given a similarity value threshold p, the similarity value of any two sites α, β of a given network of sites (according to definition 2) satisfies s (α, β) > p, s (β, α) > p, and then site α, β are said to be p-site similar.

Definition 4 (site clustering): given a matrix of similarity values M (n × n), if the M matrix elements site α, β satisfy p-site similarity, and there is not one site γ, the following two conditions are satisfied: (1) sites α, γ are similar to p-sites; (2) s (. Alpha.,. Beta.) < s (. Alpha.,. Gamma.). Then site alpha belongs to the cluster where site beta resides.

Step 3, calculating a similarity value matrix M between the sites according to the definitions 2 and 3, wherein the function is realized by a function computeSimilarMatri; then, the algorithm classifies the sites into different clusters according to the similarity value matrix M and the similarity value probability threshold p specified by the user by combining the idea of definition 4, and the clustering is realized by a function Cluster. The corresponding pseudo code is shown in algorithm 1.

Algorithm 1: station MiningByLinkRelations (G, k, c, p)

Inputting: g represents a site network, k represents the number of iterations, c represents a suppression factor, and p represents a probability

And (3) returning: site clustering collections

1M＝compteSimilarMatrix(G,k,c)；

// M denotes a matrix of similarity values,

2return Cluster(M,p)；

step 3.1, calculating a similarity value matrix M between the sites according to definition 2 and definition 3:

each similarity value matrix M of the site similarity value matrix can be obtained by iterating the formula (6) and the formula (7). Let n denote the number of sites and k denote the number of iterations. R (α, β) is an element of the similarity value matrix M, and stores the similarity value between each pair of nodes (α, β) during each iteration. R is ^* (α, β) is the element of the temporary similarity value matrix, storing R calculated in the iteration _i The intermediate values of (α, β) are copied to the matrix of similarity values M at the end of each iteration. Firstly, initializing a similarity value matrix M according to a formula 6, wherein the similarity value matrix M is realized by a function initialization; then, for every two sites α, β, the loop performs the following iterative procedure k times: (1) Each element O of a set O (α) of connected sites for a site α _i (α) finding a site O (β) most similar to the site α from the connected site set O (β) of the site α _j (β), regarding a similarity value between the two sites as a similarity value of O (α) and O (β); (2) The average of the correspondence values of all connected stations of α is calculated as the similarity value between the two stations. Thirdly, the intermediate matrix M is copied to the matrix M for the next iteration, and the iteration ends. The function returns a similarity value matrix M obtained by iterative computation.

Step 3.2, classifying the sites into different clusters according to the similarity value matrix M and the similarity value probability threshold p specified by the user by combining with the definition 4, and realizing by a function Cluster:

although typical clustering algorithms are numerous, such as K-Means, CHEMALOEN, DBSCAN, etc., the clustering algorithm herein has the following features: (1) the clustering number is not fixed; (2) features are not space-based; (3) The similarity value matrix between each two sites is obtained in advance. Therefore, the CHEMALEN algorithm is adopted as the clustering algorithm of the text, and corresponding changes are made. The site clustering process is to judge all site sets with similar p-sites according to similarity values among the sites and classify the site sets into different clusters according to a principle of big priority. For convenience of description, it is first stated that some key data structure settings are set, and the array c [ i ] indicates a cluster of which site i belongs to (c [ i ] = = j indicates that the site i belongs to a cluster of which site j is located, and max [ i ] indicates a maximum similarity value that the site i belongs to the cluster of which site j is located). The clustering algorithm flow is as follows:

judging whether a current row i of the similarity value matrix has a site meeting two conditions of definition 3, and if not, directly jumping to the next cycle; otherwise, jumping to the second step;

(iii) determining the matrix element m _ij >max[i](the similarity value of the site i and the site j is greater than the maximum similarity value in the prior art), if so, the control site of the site i is updated to the site j, otherwise, the loop is skipped directly;

next, the clusters are merged according to the array c [ ], for each c [ i ], the following is performed: if c [ i ] = = j (j | = = 0), it is determined whether station i and station j already belong to a certain cluster.

(i-1) if site i and site j do not belong to any cluster, creating a new cluster by taking site i and site j as elements, and setting the control site of the cluster as site j;

The technical effects of the invention are illustrated below with reference to specific experimental data:

1. the practical significance of the algorithm result of the method is as follows:

the experimental data are 5 days of Ningbo public bike borrowing and returning data, wherein the data (borrowing station number, borrowing time, returning station number and returning time) are extracted as the experimental data, and meanwhile, station information data (station number, station name, station longitude, station latitude and the like) are combined. And a win7 operating system is adopted in the experimental development environment, and the system is developed by Visual C + +.

Since the site similarity values are currently calculated with a single site feature (e.g., location, borrow statistics, or a combination of both, etc.), whereas the SCSR calculates the site similarity values with inter-site connectivity, the SCSR has no direct comparability to previous algorithms. The invention focuses on the explanation in terms of analyzing the reason for forming the cluster and the practical significance. Fig. 1 shows the clustering result calculated by the SCSR algorithm (parameter p is 0.7). Wherein the water drop shape indicates a site (abnormal point) that does not belong to any cluster, and sites of the same shape and color constitute one cluster. From the figure, it can be found that the SCSR algorithm has the following characteristics:

(1) Clusters within a crowd-gathered region will typically cross-overlap, as in region 1 of fig. 1. Because the similarity value calculation of the SCSR algorithm no longer considers the site location, but the association relationship between the sites. The crowd gathering area has a complex crowd flow direction, and clusters formed by the SCSR algorithm based on the site association relation are no longer provided with position characteristics. For example, not only are there numerous clusters within region 1 of FIG. 1 but they are also cross-overlapped, with no positional features between clusters.

(2) The non-crowd-gathered region clusters have certain regional characteristics, such as the clusters of the region 2 (cluster 22) and the region 3 (cluster 1) in fig. 1, and although the clusters are clustered based on the connection relationship, the clustering has the regional characteristics because the crowd flow direction has certain regionality.

(3) Most sites do not belong to any cluster, i.e. outliers. The reason for this is that the capacity of the public bicycle station is limited, and in many cases, an abnormal situation occurs in which no bicycle can be borrowed or no pile space is reserved, thereby causing data distortion based on the connection relationship.

Fig. 2 shows two adjacent and relatively independent commercial tourist areas of the day-drum business area: a sky square (circle around area) of purely commercial tourist nature, a barrows street (square around area) with a combination of residential (perimeter) and commercial tourist (center). Wherein, the cluster a (circular site 50#,201#,209 #) and the cluster B (square site 160#,478#,260 #) are two clusters in the area 1 respectively. The two clusters have a certain area overlap. However, the 160# site is characterized closer to cluster a than to cluster B, either from the view of the city functional area in which the site is located (the 160# site in cluster B is located on the north side of the barrows, and should be more similar to the 50# site), or from the view of the map location (the 160# site is closer to the 50# site and the 209# site). Next, the present invention will analyze the actual association data in detail, to illustrate the association characteristics of the sites in cluster a and cluster B, the reason for forming the clusters, and the actual meaning represented by the clusters, to give the difference between the two clusters, and to verify the difference between the 160# site and the member site of cluster a.

Firstly, different practical meanings represented by the two clusters are analyzed by combining with the angle of the urban functional feature. Fig. 3 shows the location distribution of all sites having an association relationship with the cluster a member site. Clusters a bikes with associations at all member sites are divided into three distinct independent areas: the area A1 surrounded by Zhongshan West road, cupressu road and Tanjin road, the area A2 of the drummer pedestrian street and the area A3 of the Tianyi square business. Similarly, FIG. 4 shows the distribution of all sites having associations with all cluster B member sites. The associated sites with the association relationship of all member sites in the cluster B are divided into three areas: an alidade area B1 (including an area A1) of the liberation road, the west of the middle mountain and the green road, a business area B2 (A2 and A3) of the Tianyi square, a Jiangtong old area B3 close to the Jiangxiao bridge, and a Jiangbei area B4 close to the liberation bridge. Comparing the two figures can find that:

(1) Cluster a is the source site cluster for regions A1, A2, A3 in fig. 3. Target sites having a relationship with all member sites of cluster B include a jiangdong area B3 near the jiangxia bridge and a jiangbei area near the liberation bridge, in addition to areas (A1, A2, and A3 in fig. 3) corresponding to cluster a. The association relationship between the cluster a and the cluster B is already greatly different from this point.

(2) Comparing the associated target sites of cluster a and cluster B, it can be seen that cluster a member 50# site is located west of drum foot street (area A2 of fig. 3) and zhongshan park, while the other two member sites (209 # and 201 #) are located at the middle of drum foot street, sky square and lake park. And the three member stations of the cluster B are all located in the north edge area of the business area of the Tianyi-drum building square, and a road connection which is not a main road and is suitable for riding is arranged among the three member stations. In addition, cluster a is associated with a smaller area of the sky plaza than cluster B, but larger on the barrows than cluster B. The above analysis shows that: cluster a is a site feature in a region where a business district and a tourist district are combined, and cluster B is a site feature in the north edge of the day-drum building business district.

Next, the reason for cluster A and cluster B formation is analyzed in conjunction with the actual association data. Table 1 and table 2 show the associated sites of cluster a and the corresponding associated data, respectively. The abscissa in the table is the name of the member station of the cluster, the ordinate is the associated station associated with the member station, and the data in the table represents the number of bicycles from the corresponding member station to the corresponding associated station. Comparing the two tables can find that: (1) The public association site sets of cluster a and cluster B are very different; (2) The site characteristics (associated sites and corresponding numbers) of site # 160, although similar in location and city functionality to the member sites of cluster a, are very different from the member sites of cluster a.

TABLE 1 Cluster A Member site Co-Association sites and Association numbers

Table 2 cluster B member site co-association site and association number

TABLE 3 bicycle number scale for clustering A at different time intervals and regions

TABLE 4 bicycle number table clustered B at different time intervals and regions

Finally, in order to better analyze the association relationship of the sites, the change situation between each cluster and the associated sites is analyzed by dividing the association data of one day into borrowing data of four intervals, namely, an office peak interval (6 to 9 am), a working interval (9 to 15 pm), an office peak interval (15 to 19 pm), a night leisure area (19 to 22 pm) and the like according to a document [ tide analysis based on position and direction ]. Table 3 describes the correlation matrix between clusters a and each correlation area in different time intervals. Table 4 describes the correlation matrix between each correlation area and the cluster in different time intervals of cluster B. Based on the data of tables 3 and 4, fig. 5 and 6 plot the change in the associated data averaged hourly for cluster a and cluster B, respectively. The strength sequence of the association relationship of the cluster A in the daytime is a residential area A1 contained in a drumbeat pedestrian street A2, a Tianyi square A3 and Cupressu road, a Zhongshan West road and a Xiao Wen street. That is, in the daytime, the cluster a takes the drum as the most relevant area, the bicycles flow to the drum most, one square time every day, and finally the A3 residential area. However, after 15 hours, the number of bicycles going from cluster a to the drum and the day square decreased sharply, while the number of bicycles going to A3 began to increase, which is the reason why the climax began to flow from the business district to the residential district after the store closed in the evening.

Similarly, the correlation strength order of the cluster B during the daytime is region a (day-square business district B2 (A2 and A3)), region B (an alienation region B1 (including A1) of the liberated road, the west of the middle mountain, the cuibia road, and a north region B4 near the liberated bridge), and region C (a old region B3 of the east of the river near the summer bridge). As with cluster a, the public business district of a square of a drum tower day is the gathering point of most bicycles during the day, and is the most relevant area, followed by the residential district of sea eosin, and finally the old district of the east of the river and summer bridge to which cluster a is not related. After 15 pm, the number of bicycles actively riding on the barrows and the square of the day decreased, and the flow of bicycles became concentrated in the residential area, which was also because the store was closed at night and people returned home from work and started to rush to the residential area.

2. Influence of parameters

The parameters related to the algorithm comprise two parameters of a damping coefficient C and a similarity value probability p, and the damping coefficient C is set to be a fixed value (0.65). The method analyzes the influence of the change of the probability threshold value p of the similarity value on the algorithm result. Fig. 7, 8, 9 and 10 show the results of the algorithm operations for different similarity value probabilities p (0.68, 0.69,0.70 and 0.71), respectively. It is known from the figure that the larger the threshold p is, the fewer the clusters are, because the sites satisfying the p condition after the calculation of the associated data of each site in the clustering algorithm decrease as p becomes larger. Both clusters 9 and 29 fade away as in fig. 7. On the other hand, the cluster size is also gradually becoming smaller. For example, the 47# cluster (circled site cluster) in the graph gradually becomes smaller as the p value is increased. In fig. 7, the 47# cluster extends over substantially all areas of the north area of the river, south of the new road, while the member sites near the exit in fig. 8 (which are near the bridge connecting the sea eosin area and the east area of the river, and marked by boxes) do not belong to the member sites of the 47# cluster any more; in fig. 9 and 10, the 47# cluster only includes three or two sites, and the members are located in the most dense area of the south population of the north and south regions "new road", especially two sites in fig. 10: the bicycle leasing point of the Shengbaolu is located near the Ningbo outer beach, while the bicycle leasing point of the Jiangbei administrative center is located near the Jiangbei administrative center, and obviously, the bicycle leasing point and the Jiangbei administrative center are both crowd-dense areas in the Ningbo Jiangbei area.

While the foregoing description shows and describes several preferred embodiments of the invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A public bicycle site clustering method based on incidence relation is characterized by comprising the following steps:

step 1, defining a SimRank algorithm;

step 2, defining a public bicycle station similarity value based on a SimRank algorithm according to the characteristics of the urban public bicycle system; the method comprises the following steps that 1 is defined as a public bicycle station topological network, 2 is defined as a station similarity value, 3 is defined as p-station similarity, and 4 is defined as station clustering;

step 3, calculating a similarity value matrix M between the sites according to the definitions 2 and 3;

and 4, classifying the sites into different clusters according to the similarity value matrix M and a similarity value probability threshold p specified by a user in combination with the definition 4.

2. The public bicycle site clustering method based on incidence relation as claimed in claim 1, wherein the defined SimRank algorithm in step 1 is specifically:

where C is a constant, damping coefficient or attenuation factor between 0 and 1. O (alpha) and O (beta) both represent a node set connected with the nodes, | O (alpha) | and | O (beta) | represent the number of elements of the sets O (alpha) and O (beta), respectively, O _i (α) denotes an ith node connected to the node α; o is _i And (β) represents the jth node connected to the node β. Equation (1) represents: the SimRank similarity value between the two nodes alpha and beta based on the node characteristics is an average value of the similarity values between each associated node of the nodes and each associated node of the nodes;

R ₀ (α, β) is the lower limit of S (α, β), and

each iteration is based on R according to the following formula _i-1 (alpha, beta) table calculation of R _i (alpha, beta) table

Equation (4) is represented by a matrix symbol as:

3. The public bike station clustering method based on incidence relation as claimed in claim 1, wherein the public bike station similarity values based on the SimRank algorithm in step 2 are defined according to the characteristics of the urban public bike system as follows:

step 2.1, defining 1, a public bicycle station topology network: public bike station topological network G = (V, E), which is simply a station network, nodes α, β represent two stations of V, and a triplet (α, β, r) represents a directional connection relationship between stations α, β, where r represents the number of bicycles from station α to station β in a specified time; o (α) denotes the presence of (a, a) with site α _i R) station a of (r > 0) _i Set of where O _i (α) represents any element of O (α);

wherein max (O) _i (α),O(β))＝max _{j＝1...|O(β)|} s(O _i (α),O _j And beta) finding out the associated site O with the site alpha from the associated sites of the site beta _i (α) the maximum similarity value, given n sites, the matrix M of similarity values between every two sites is called n x n similarity value matrix;

inference 1: s (α, β) has the following characteristics:

asymmetry: s (. Alpha.,. Beta.) does not necessarily have to be identical to s (. Beta.,. Alpha.)

Monotonicity: s is not less than 0 ^k (α,β)≤s ^k-1 (α,β)≤1

step 2.4, defining 4, clustering sites: given a matrix of similarity values M (n × n), if M matrix elements site α, β satisfy p-site similarity, and there is not one site γ, both conditions are satisfied: (a) sites α, γ are similar to p-sites; (b) s (α, β) < s (α, γ); then site alpha belongs to the cluster where site beta resides.

4. The public bike station clustering method based on incidence relation as claimed in claim 1, wherein the calculating of the similarity value matrix M between stations according to definition 2 and definition 3 in step 3 is specifically: each similarity value matrix M of the site similarity value matrix is obtained through iterative calculation of an iterative formula (6) and a formula (7), assuming that n represents the number of sites, k represents the iteration times, and R (alpha, beta) is an element of the similarity value matrix M, and the similarity value between each pair of nodes (alpha, beta) in each iteration process is stored; r ^* (α, β) is the element of the temporary similarity value matrix, storing R calculated in the iteration _i The intermediate value of (alpha, beta) is copied to a similarity value matrix M when each iteration is finished; firstly, initializing a similarity value matrix M according to a formula 6, and realizing by using a function initialization; then, for every two sites α, β, the loop performs the following iterative procedure k times: each element O of a set of connected sites O (α) for a site α _i (α) finding a site O (β) most similar to the site α from the connected site set O (β) of the site α _j (. Beta.), phasing between the two sitesThe similarity value is the similarity value of O (alpha) and O (beta); the average of the correspondence values of all connected stations of α is calculated as the similarity value between the two stations. Thirdly, copying the intermediate matrix M to the matrix M for next iteration, and ending the iteration; the function returns a similarity value matrix M obtained by iterative computation.

5. The public bicycle station clustering method based on incidence relation according to claim 1, wherein the classifying of the stations into different clusters according to the similarity value matrix M and the user-specified similarity value probability threshold p in the step 4, in combination with the definition 4, is specifically:

(ii) if there are a plurality of sites, setting a site j having the largest similarity value as a control site of the site i;

next, the clusters are merged as an array c [ ], for each c [ i ], as follows: if c [ i ] = = j (j | = = 0), it is determined whether station i and station j already belong to a certain cluster:

(ii-1) if site j already belongs to a cluster, then add site i to the cluster that site j belongs to.