CN107609107B

CN107609107B - Travel co-occurrence phenomenon visual analysis method based on multi-source city data

Info

Publication number: CN107609107B
Application number: CN201710820085.1A
Authority: CN
Inventors: 孔祥杰; 李梦琳; 夏锋; 赵高兴; 刘程程
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2020-07-14
Anticipated expiration: 2037-09-13
Also published as: CN107609107A

Abstract

The invention belongs to the technical field of urban mobile data analysis, and discloses a visual analysis method for travel co-occurrence phenomena based on multi-source urban data, which comprises the following steps: the method comprises the steps of firstly, carrying out region division on cities by using road network basic data and a simulation tool, then carrying out co-occurrence modeling on the regions, carrying out association rule mining on the regions based on taxi track data by using the model and parameters set by a user, then mining the region functions by combining city interest point data, and finally visually displaying co-occurrence mining results and the region functions. The invention can utilize multi-source city data: taxi track data, urban road network data, POI data, carry out the visual analysis of full aspect multi-angle ground and explore to regional co-occurrence phenomenon and urban area function, provide effective information for urban traffic planning, have the intrinsic relevance of the analysis data of being convenient for, characteristics such as maneuverability is strong.

Description

Travel co-occurrence phenomenon visual analysis method based on multi-source city data

Technical Field

The invention belongs to the technical field of urban mobile data analysis, and particularly relates to a travel co-occurrence visualization analysis method based on multi-source urban data.

Background

With the rapid development of urban traffic, a large amount of mobile data is generated, and the mobile data has abundant time attributes and spatial attributes, and the moving conditions of urban human beings can be truly reflected through the attributes. The taxi is used as an important component of urban mobile traffic, and great convenience is provided for urban residents to go out. According to taxi track data, a travel mode with a certain rule in a city can be found, and the finding has great significance for understanding the city structure. We define the co-occurrence as: if people from area A and area B visit area C within the same time interval, we call "area A and area B co-occur in area C". We can say that region a and region B participate in a co-occurrence. The rule for all co-occurrence events occurring in a city is our analysis topic-co-occurrence. Based on the analysis of the co-occurrence phenomenon, valuable information in the aspects of city planning, business strategy formulation, contact infectious disease transmission and the like can be obtained. Road network data is the most commonly used geographic data in urban research and is often presented by way of a graph. The nodes in the graph represent intersections and have unique geographic coordinates; the edge represents a road section and connects two nodes; other attributes, such as length, speed limit, road type, lane number, etc., are associated with the edge. Point Of Interest (POI) data (e.g., restaurants, shopping malls) generally consists Of names, addresses, categories, and geographic coordinates, and generally introduces basic attributes Of geographic units, and such data is obtained mainly by manual identification by a map data provider or by free editing Of netizens on an open-source online map website.

However, information mining from the taxi track data is not easy due to the fact that the taxi track data is numerous and complicated and abstract, the visualization technology is combined with the display form and the human-computer interaction of the visual chart, the analysis process is simplified through operation, and the user modifies the parameters of the analysis model through interaction, so that a new visualization result is generated, and more valuable information can be mined from the taxi track data through visualization analysis.

The method adopts urban taxi track data, road network data and POI data, and aims to utilize multi-source urban data to jointly explore travel co-occurrence phenomena from multiple aspects and mine the value hidden by the phenomena.

Disclosure of Invention

The invention mainly aims at the inconvenience of the data analysis and provides a travel co-occurrence visualization analysis method based on multi-source urban data. Dividing regions based on road network data, and extracting co-occurrence data capable of reflecting the inter-region connection by processing taxi track data; and (4) mining the regional function by combining the urban POI data, and finally visually displaying the co-occurrence result and the regional function mining. Providing effective information for understanding the city structure.

The technical scheme of the invention is as follows:

a visual analysis method for travel co-occurrence phenomena based on multi-source urban data comprises the following steps:

s1: preprocessing raw data

S1.1: cleaning taxi operation track data and carrying out standardized processing on taxi basic data;

s1.2: cleaning original POI data and carrying out normalized processing on the POI data;

s2: performing time and region division on the data preprocessed in the step S1

S2.1: time division: dividing one day into T time intervals according to the characteristics of the driving rule;

s2.2: area division: dividing an urban space into R areas according to an urban road network;

s3: performing regional function mining on the data divided in the step S2

S3.1: regional function division, classifying urban regional functions into F class

S3.2: calculating the frequency of the POI in each region, using the symbol TF_i,jIs shown in the region r_iThe calculation formula of the frequency of the occurrence of the j-th POI data is as follows:

wherein n is_i,jRepresentative region r_iThe number of POIs in the j-th category, and F represents the number of categories of the POIs;

s3.3: calculating inverse document frequency of the j-th POI data by using IDF_jWhere R represents the total number of regions, the calculation formula is as follows:

S3.4：TF_i,jand IDF_jMultiplication is the region r_iFor the TF-IDF value of the j-th POI, the static function distribution state of the area is represented, and the calculation formula is as follows:

TF-IDF_i,j＝TF_i,j×IDF_j

s3.5, performing theme mining on the OD data in the step S2 by using an L DA theme model algorithm, and using a final result

Is shown, it shows the region r_iDynamic functional distribution of (2), wherein z_i,kIndicates that the kth class region function is in region r_iThe ratio of (A) to (B);

s3.6: calculating the region r_iAnd region r_mThe dynamic function similarity between them is recorded as lambda_i,mCos represents the cosine value between vectors, and the calculation formula is as follows:

s3.7: defining a cost function J, namely an objective function, representing the deviation between the function status of real execution of the area and the phenomenon represented in both static and dynamic aspects, and calculating the minimum value of the cost function, wherein the formula of the cost function is as follows:

wherein R represents the total number of regions,

representative region r_iThe true functional distribution of (a), is also the final requirement,

representative region r_jPOI distribution status of (1);

s4: mining co-occurrence events of the data divided in the step S2 through an association rule mining algorithm

S4.1: performing co-occurrence transaction extraction on the data in the step S2;

s4.2: mining a frequent item set for the data extracted in the S4.1 through an association rule Apriori algorithm;

s4.3: and (3) performing correlation statistic calculation on the data obtained in the step (S4.2), wherein the support degree, the confidence coefficient, the full confidence coefficient all _ confidence, the maximum confidence coefficient max _ confidence, the lift degree lift, the Kulczynski metric Kulc, the imbalance ratio IR and a cosine calculation formula between the area A and the area B are as follows, wherein P represents the probability:

support＝P(A∪B)

s5: visualization display co-occurrence result

S5.1: according to the regional function diagram obtained by calculation in S3.7, different regional functions are marked on the map by using different colors;

s5.2: drawing a global co-occurrence map according to the frequent item set mined in the S4.2, wherein the map is drawn based on two aspects of co-occurrence relation and co-occurrence participation;

s5.3: drawing an area co-occurrence annular heat map according to the frequent item set mined in the S4.2, wherein the annular heat map focuses on analyzing the co-occurrence rules between areas;

s5.4: and drawing a parallel coordinate graph according to the statistic data obtained by calculation in the S4.3, wherein the parallel coordinate graph measures the correlation between the two areas by indexes.

The invention has the beneficial effects that: the invention can utilize multi-source city data: taxi track data, urban road network data, POI data, carry out the visual analysis of full aspect multi-angle ground and explore to regional co-occurrence phenomenon and urban area function, provide effective information for urban traffic planning, have the intrinsic relevance of the analysis data of being convenient for, characteristics such as maneuverability is strong.

Drawings

FIG. 1 is a block diagram of the process;

FIG. 2 is a data processing flow chart of a travel co-occurrence visualization analysis method based on multi-source city data;

FIG. 3 is a region division diagram of a travel co-occurrence phenomenon visualization analysis method based on multi-source city data;

fig. 4 is a global visualization effect of co-occurrence events after co-occurrence mining is performed by using taxi data of year 4 of shanghai 2015 according to an embodiment of the present invention;

fig. 5 shows a global co-occurrence heat visualization effect of the embodiment of the present invention after co-occurrence mining using taxi data of year 4 of shanghai 2015;

fig. 6 shows a local visualization effect of co-occurrence heat of the region after co-occurrence mining is performed by using taxi data of year 4 of shanghai 2015 in the embodiment of the present invention;

fig. 7 is a visualization effect of an embodiment of the present invention using shanghai POI data and taxi data of month 4 in 2015 for regional function mining;

fig. 8 shows the analysis visualization effect of the regional similarity statistics after co-occurrence mining by using the taxi data of shanghai city in 2015 and 4 months in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The embodiment of the invention provides a multi-source city data-based travel co-occurrence visualization analysis method, a system flow is shown in a figure 1, a data processing flow is shown in a figure 2, and the method comprises the following steps:

s1: useful data is extracted on the basis of the original data set, and the steps are as follows:

s1.1: the cleaning of the taxi operation track data aims at the Shanghai taxi track data of 30 days from 1/4/2015 to 30/4/2015. Based on the research on the co-occurrence phenomenon, it is obvious that the OD data of the passenger-carrying taxi is needed, so that the getting-on and getting-off time of the passenger-carrying taxi, the longitude and latitude of the getting-on and getting-off place and the attribute of the OD data are extracted from the original data set, as shown in table 1:

TABLE 1

Because the distance used in the original data set is a straight-line distance, but the roads of the city are basically regular, the original distance is abandoned through further analysis and comparison of the distances in the city, and the Manhattan distance between two points is calculated according to the original longitude and latitude. When a taxi does not carry passengers, the speed of the taxi is slowed down to search the passengers, and the influence on the urban movement law is small, so that the unloaded driving track is selected to be screened out, the data volume is remarkably reduced, and the follow-up analysis and calculation are facilitated. The extracted data attributes are as follows:

TABLE 2

After data extraction, some tracks are found, the time used by the tracks is long, but the distance is short, the data are judged to be abnormal data, and the cleaning method is to calculate the average speed and delete the too small speed. The average speed is calculated by the Manhattan distance/running time; wherein the driving time is calculated by the time when a passenger gets on or off the vehicle, the Manhattan distance is calculated by the longitude and latitude, and the following attributes are added in the data set:

TABLE 3

Numbering	Name (R)	Note
			9	interval	Length of travel
10	speed	Average velocity

S1.2: the cleaning of the original POI data is mainly to extract useful information from the original data and correct some misclassified data, and to ensure the integrity of the record, the extracted information is shown in table 3.

TABLE 4

Numbering	Name (R)	Note
			1	Numbering	The value is 0 to 110769, and one POI data is uniquely identified
2	Name(s)	Name of POI data
			3	Latitude	GPS latitude of POI
4	Longitude (G)	GPS longitude of POI
			5	Three-level directory	Three level directory of POI categories

S2: for the data obtained in S1, time division and area division need to be performed on the data, and the steps are as follows:

s2.1 the data time division steps are as follows:

the time length of the divided time is determined by counting the running time length of each OD and the statistical rule. Therefore, the driving time of a week from 4 months and 4 days to 4 months and 10 days is counted, about 85% of OD can be found by statistics, the driving time is within 30 minutes, therefore, 30 minutes is selected as the length of time division, one day is divided into 48 time zones, the number of 0:00 to 0:30 is set as 0, and the like, 23:30 to 0: number 00 is 47. Each OD calculates its own time slice from the time the passenger gets on the bus. At this time, the attribute label _ time [0-47] is added to the data set to indicate the time slice to which the OD belongs.

TABLE 5

Numbering	Name (R)	Note
			11	label_time	Number of time slot to which OD belongs

S2.2, the data area division step is as follows:

the regional division is to divide the whole research region into different regions, so that the regions can be mapped to the OD between urban regions through the taxi OD, and the distribution rule of the co-occurrence phenomenon in the urban space can be visually presented. To achieve this, our algorithm must have two functions: 1) carrying out area division on a plane on the urban space, and numbering each area; 2) it can be mapped into the divided area by giving a latitude and longitude.

It is known that urban roads are constructed by urban planning, which divides cities into regular blocks, and these blocks often show urban functional bias, i.e. blocks gather similar functional points. Therefore, it is reasonable to spatially divide the city into regions by city roads.

The city roads of the second level and above of the city in the ranges of N31.15-N31.37 and E121.31-E121.84 of Shanghai city are selected for carrying out regional division on the ranges. The specific steps are as follows:

1) performing expansion processing on the pictures, and removing small gaps among road intersections;

2) thinning the expanded picture to thin the width of the road into one pixel;

3) numbering the thinned images, numbering each pixel by an algorithm, wherein the pixels in the same area have the same number;

4) removing the pixels representing the road from the numbered images by a processing method of coding the pixels into adjacent areas;

through the above processing, we obtained area division data of Shanghai city, which divides Shanghai city into 541 areas in total. The effect of the division is shown in figure 3. We then add two attributes, start and end area number, for the initial OD data in S1, and an area number attribute for each POI.

TABLE 6

Numbering	Name (R)	Note
			12	label_start	OD Start region numbering
13	label_end	OD termination region numbering

S3: and completing the mining of the region function aiming at the data obtained in the step S2, wherein the steps are as follows:

s3.1: regional functional partitioning, classifying regional functions into 6 categories (residential, work, educational, business, public, service, and scenic spots), and classifying each piece of POI data into a certain category according to a three-level directory.

S3.2: calculating the frequency of the POI in each region, using the symbol TF_i,jIs shown in the region r_iThe calculation formula of the frequency of the occurrence of the POI data of the j-th category is shown as follows (n)_i,jRepresentative region r_iNumber of POI class j in (1), F represents number of categories of POI):

TF-IDF_i,j＝TF_i,j×IDF_j

and S3.5, integrating the OD data of the region more accords with the real situation compared with integrating the OD data at an indefinite time according to the driving rule by taking half an hour as time slice information. Table 7 shows the time divisions for weekdays and table 8 shows the time divisions for weekdays.

TABLE 7

Peak section	Starting time	End time
			1	02:30:00	04:29:59
2	04:30:00	07:29:59
			3	07:30:00	10:29:59
4	10:30:00	14:59:59
			5	15:00:00	16:59:59
6	19:30:00	02:29:59

TABLE 8

One 541 x 18 matrix is obtained by using a certain region as a reference and time as columns (18 time segments in total to distinguish OD inflow and outflow), and the other 541 regions as rows, and 541 such matrices can be obtained. Then, combining the 541 matrixes to obtain a 541 x 9738(541 x 18) matrix which is marked as a matrix D;

s3.6, performing theme mining on the matrix D obtained in the S3.5 by using an L DA theme model algorithm, and using a final result

s3.7: calculating the region r_iAnd region r_mThe dynamic function similarity between them is recorded as lambda_i,mCos represents the cosine value between vectors, and the calculation formula is as follows:

s3.8: a cost function J, i.e. an objective function, is defined, which represents the deviation of the actually performed function status of the region from the phenomenon that it represents in both static and dynamic states, and the minimum value of the cost function is calculated, the cost function formula is as follows (R represents the total number of regions,

representative region r_jPOI distribution status of (1):

s4: and for the data obtained in the step S2, mining the co-occurrence events by an association rule mining algorithm, wherein the steps are as follows:

s4.1: the transaction is fetched. And extracting the transaction according to the label _ start, label _ end and label _ time in the data set, wherein the transaction at the moment represents the region number reaching the same region in the same time period, namely:

select label_start where label_time＝0and label_start＝1

the above statement extracts one transaction, so there will be 541 transactions per slot;

s4.2: mining a frequent item set for the data extracted in the S4.1 through an association rule Apriori algorithm, wherein the specific steps are as follows:

1) a support threshold q is given. The support threshold is used for telling the algorithm what item set is marked as a frequent item set, and all the item sets with the support count not less than the support threshold are frequent item sets; where the degree of support is the number of transactions in which an item in the set of items occurs in one transaction at a time. And the frequent item set is a co-occurrence event of mining.

2) A frequent 1 item set is mined. The frequent 1 item set is a frequent item set with items 1. The method is to traverse all transactions, count the number of transactions in which all items appear in the transactions, and mark 1 item set with count > -q as frequent 1 item set.

3) And mining a frequent n item set. And combining the frequent n-1 item sets pairwise to obtain a candidate item set, and checking whether the support degree of the candidate item set is q or not when the transaction is scanned, and if so, marking the candidate item set as the frequent n item set.

4) And circularly mining the frequent n item sets until the n item sets have no frequent item sets, and ending the circulation.

5) And storing the mined co-occurrence events into a file according to dates, and storing the co-occurrence events into a frequent item set and the support degree count thereof through time marks.

S4.3: the calculation of the correlation statistic, meaning and calculation formula are as follows:

and the support degree represents the ratio of the number of times of the co-occurrence of the area A and the area B in the total transaction, wherein P represents the probability. Namely, it is

support＝P(A∪B)

2) The confidence level represents the ratio of the co-occurrence event occurring in the area B and the area A to the co-occurrence event participating in the area A, which is called A->B confidence level. Namely, it is

3) The full degree of confidence is that the data was received,

confidence and

the smaller value of the confidence. Namely, it is

4) The maximum degree of confidence is given to the user,

confidence and

the greater the confidence. Namely, it is

5) The degree of lifting L ift, the degree of lifting indicates the ratio of the probability of containing B under the condition of containing A to the probability of containing B under the condition of not containing A, the degree of lifting L ift indicates that the two areas are positively correlated when the degree of lifting is greater than 1, and indicates that the two areas are negatively correlated when the degree of lifting is less than 1, and indicates that the two areas are not correlated and independent when the degree of lifting is equal to 1, namely

6)Kulc，

Confidence and

the average of the confidence. Namely, it is

7)IR，

Confidence and

the ratio of the confidences. Namely, it is

8) Cosine represents the value of the geometric mean of the probability of co-occurrence of region a and region B and the probability of occurrence of a, B.

Namely, it is

S5: and (3) performing visual display on the data obtained by mining S3 and S4, wherein the steps are as follows:

s5.1: according to the regional function map calculated in S3.7, different regional functions are identified on the map by using different colors, as shown in FIG. 4 and FIG. 7;

s5.2: drawing a global co-occurrence map according to the frequent item set mined in the S4.2, wherein the map is drawn based on the co-occurrence relation, such as a graph 4, and the co-occurrence participation, such as a graph 5;

s5.3: drawing an annular heat map of co-occurrence of the areas according to the frequent item set mined in the S4.2, wherein the annular heat map focuses on analyzing the co-occurrence rules among the areas, as shown in FIG. 6;

s5.4: and drawing a parallel coordinate graph according to the statistic data calculated in the S4.3, wherein the parallel coordinate graph measures the correlation between the two areas by indexes, such as figure 8.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A travel co-occurrence phenomenon visualization analysis method based on multi-source city data is characterized by comprising the following steps:

s1: preprocessing raw data

s2: performing time and region division on the data preprocessed in the step S1

s3: performing regional function mining on the data divided in the step S2

S3.4：TF_i,jand IDF_jMultiplication is the region r_iFor TF-IDF value of j-th POI, static state of area is representedThe function distribution condition is calculated according to the following formula:

TF-IDF_i,j＝TF_i,j×IDF_j

wherein R represents the total number of regions,

representative region r_jPOI distribution status of (1);

support＝P(A∪B)

s5: visualization display co-occurrence result