CN110298500B

CN110298500B - Urban traffic track data set generation method based on taxi data and urban road network

Info

Publication number: CN110298500B
Application number: CN201910532080.8A
Authority: CN
Inventors: 孔祥杰; 马凯; 商迪; 侯明良; 郝欣宇; 冯嘉伟; 夏锋
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2022-11-08
Anticipated expiration: 2039-06-19
Also published as: CN110298500A

Abstract

A city traffic track data set generation method based on taxi data and a city road network belongs to the field of traffic transportation. The model consists of a preparation layer, a generation layer and a verification layer, generates track data of private cars based on taxi data and urban road network information for the first time, and provides a set of complete verification method to verify the accuracy of the generated data set. The invention provides a city functional area dividing method based on an adjacent road dividing method, which is used for dividing a city into different functional areas. Meanwhile, a regional population weight opportunity model is provided, so that population movement patterns can be more accurately depicted. In order to verify the authenticity and the accuracy of the invention, models based on macroscopic and microscopic angles are respectively provided at a verification layer to verify the authenticity and the accuracy of a data set generated by a verification model. After verification of the verification layer, the method has higher authenticity and accuracy according to the track data of the private car generated by the taxi GPS data and the urban vehicle network information.

Description

Urban traffic track data set generation method based on taxi data and urban road network

Technical Field

The invention belongs to the field of traffic transportation, and relates to a method for generating a social vehicle track data set based on floating vehicle track data in the field of traffic big data.

Background

The Internet of things is an important component of a new generation of information technology and is an important stage of development of the information era. With more and more Vehicles accessing the Internet of Things (Internet of Things), the popularity and application of Internet of Vehicles (Internet of Vehicles) technology is rising. The human movement trajectory data is the basis of vehicle communication in the internet of vehicles and can reflect individual movement patterns and social migration rules in space-time environments. Through the track data of social vehicles (such as private cars) and floating vehicles (such as taxis, buses and the like), complex and various Vehicle Social Networks (VSNs) can be constructed. The trajectory data set of the floating car can be easily obtained from the internet. However, due to the protection of personal privacy and some relevant government policies, it is almost impossible for general researchers to obtain the trajectory data of social vehicles, which greatly hinders the researchers' intensive research and development progress in the related fields.

Disclosure of Invention

The invention aims to overcome the defects of the existing data set Generation research, and provides a three-layer RDMP (Region Division and Mobility Pattern based Vehicle track transaction Model) Vehicle track data set Generation Model which comprises a preparation layer, a Generation layer and a verification layer.

The technical scheme of the invention is as follows:

a method for generating urban traffic track data sets based on taxi data and urban road networks is completed on three layers of RDMP vehicle track data set generation models, and the three layers of RDMP vehicle track data set generation models comprise the following steps:

s1: in a preparation layer, preprocessing original taxi track data, and deleting useless and abnormal data; meanwhile, the method for dividing the urban functional area based on the adjacent road division method is used for dividing the urban area, and the urban road network is constructed, and the method comprises the following steps:

s1.1: pretreatment: deleting useless track data in an empty taxi state, clearing paradoxical data caused by equipment precision and statistical errors, and obtaining available taxi track data;

s1.2: based on an adjacent road segmentation method, a researched city is divided into different areas by utilizing a human moving track and POI interest points, and the traveling of all vehicles is converted into circulation among the different areas, wherein the specific process is as follows:

s1.2.1: using a DMR-based probabilistic topic model of unsupervised learning, taking the whole city as a document, taking different functions owned by each region as topics, taking the movement patterns among the regions as words, taking the characteristic vectors of the POIs as metadata of the document, and taking the frequency density v of the ith POIs in the region r _i，r The calculation formula is as follows:

wherein, num _i Representing the number of types of POIs in area r, S _r Represents the area of the region r, and the POI feature vector of the region r is marked as x _r ＝(v _1，r ，v _2，r ，...，v _i，r ，...，v _F，r 1), metadata representing the region r, where F is the number of POI categories in r, and the last vector 1 is the default feature; the subject distribution of the region r is a K-dimensional vector theta _r ＝(θ _r，1 ，θ _r，2 ，...，θ _r，e ...，θ _r，K )，θ _r，e Representing the proportion of the subject e in the area r;

s1.2.2: clustering the theme distribution obtained in the S1.2.1 by using a k-means clustering algorithm, inputting the coordinate information of the starting point and the end point of each trip into a kernel density estimation KDE model, and quantifying the functional strength in a functional area; setting n regions, and calculating the functional strength of the region r by using a nuclear density estimator through a nuclear density estimation model KDE model:

wherein, d _i，r Represents the distance from the region i to the region R, R represents the bandwidth, KF (·) represents a gaussian kernel function; after the estimation of the function intensity is finished, the divided areas are annotated to reflect the actual functions of the city, and the number attribute of the area r is defined as K _a ；

S1.2.3: region clustering

Using adjacent road segmentation method, regarding important road as line segment and grid as node, calculating Euclidean distance from each grid to road, and recording road number K nearest to each grid _l Clustering each grid in the rasterized map after the calculation is finished, and clustering K _a And K _l The nodes with the same value are taken as a cluster;

the judgment principle of the important road is as follows: the urban map is rasterized, grids are divided according to the longitude and latitude of 0.001 × 0.001, and roads with the average traffic volume of more than one hundred thousand in the whole day are extracted and taken as important roads.

S1.3: according to an actual research area, cutting and layering original files downloaded from an Open Street Map website, manually modifying road travel limits in the real world, updating the latest traffic conditions, and constructing a road network of a city;

s2: in the generation layer:

s2.1: and calculating the quantity of the social vehicles in each area according to the static proportion of the floating vehicles to the social vehicles in different functional areas, wherein the calculation formula is as follows:

wherein, SA _i The sum of private car trips in the area i; SG _j Representing the number of taxis, N, in each grid j contained in region i _i Indicating the number of meshes divided in the area i, alpha _i Representing the ratio of private cars to taxis, SR, in region i _w Representing the number of taxis on road w in grid j within region i, n _j Represents the number of roads in grid j;

s2.2: providing a Regional Population Weight Opportunity (RPWO) model, and calculating an origin-destination (OD) matrix of social vehicles, wherein the method comprises the following steps:

s2.2.1: setting the attraction force of an end area j to a start area i and the distance R between the center of gravity of the start area i and the center of gravity of the end area j with the center of gravity of the area j as the center of a circle _ij Total population Q in a circular area of radius _ji In inverse proportion;

s2.2.2: for center of gravity of

The calculation method of (2) is as follows:

wherein L represents the number of meshes of the region; x is the number of _l And y _l Respectively representing the relative longitude and relative latitude of the grid within the area;

s2.2.3: calculating the attraction force A of the end region j to the start region i according to the barycenter of the start region and the end region _ji The calculation method is as follows:

wherein Q _ji Is represented by R _ij The total number of private cars in a circular area with the radius and the center of gravity of the endpoint area j as the center of a circle; n is a radical of hydrogen _s Is the number of regions contained in the circle, beta _r Representing the proportion of the area r in the circular area; the total area of the region r is S _r The area of the region r in the circular region is

P _r Is the number of private cars of the region r, A _ji Representing the attraction force, o, of the end region j relative to the start region i _j Representing the number of private cars in the destination area j, and M representing the total number of private cars in the whole city;

s2.2.4: based on the attractions of the respective zones, the traffic volume from zone i to zone j is calculated:

wherein, SA _i The number of private car trips from the area i is calculated by S2.1, and n is the number of areas divided in the city;

s2.3: and generating simulation track data by using an SUMO simulation tool in combination with the OD matrix, wherein the steps are as follows:

s2.3.1: classifying the urban roads according to the divided areas; the road network file comprises the longitude and latitude of each road connection point, the longitude and latitude are used for calculating which area the road belongs to, and the road ID contained in each area is written in the road network file;

s2.3.2: using an OD2TRIPS plug-in an SUMO tool, importing an O/D matrix, decomposing the O/D matrix into single vehicle journey, inputting the O/D matrix generated by S2.2, a road network file and a road list contained in an area according to the specific situation and data of a city, setting a generation time period, a trip proportion of each time period and vehicle type parameters, and generating an xml file of a series of vehicle journey information, wherein each journey information comprises a vehicle ID, a departure time, a departure place ID and a destination ID;

s2.3.3: in order to generate the path track information of private vehicles, a DUAROUTER plug-in an SUMO tool is used, a road network file and travel information generated by OD2TRIPS are input, a simulation time period and a shortest path calculation method are set, a simulated vehicle path is generated, the simulated vehicle path comprises a vehicle ID, travel time and the passing condition of a road, and finally vehicle travel track information is generated;

s2.3.4: in order to generate information of a vehicle in a specified time interval every second, wherein the information comprises longitude and latitude positions, driving angles, instantaneous speeds and road numbers, a Trace File Generation plugin in an SUMO tool is used, corresponding information is input in a DUAROUTER function, corresponding configuration files are written at the same time, the time interval is set to delta t minutes, and vehicle tracking files in different time periods are generated;

s3: in the verification layer, a verification model is designed from the macroscopic aspect and the microscopic aspect respectively, and the accuracy and the authenticity of generated data are verified, wherein the verification model comprises the following steps:

s3.1: in the macro model, analyzing and comparing the generated track data with the actual traffic condition, including traffic flow comparison, trip range comparison and traffic condition comparison;

s3.2: in a microscopic verification model, the authenticity of generated data is analyzed and evaluated from the aspects of acceleration and relative distance, and a method for quantitatively detecting track data is designed, and the method comprises the following steps:

s3.2.1: analyzing acceleration, verifying generated track data acceleration and gradient value J thereof _e [m/s ³ ]When the result precision is in a reasonable range, the generated data set is proved to have higher internal rationality, and acceleration gradient analysis is carried out by utilizing three indexes: j. the design is a square _e Greater than +/-3 m/s ³ Percentage of trace data of threshold, data set J _e Maximum, data set J _e A minimum value;

s3.2.2: consistency analysis, which verifies the authenticity of the generated data set from the distance between vehicles, and when an abnormally small distance interval exists in the data set, the accuracy of the data set is questionable;

s3.2.3: the distance between the pair of vehicles is calculated as follows:

represents the vehicle separation of the vehicles v1 and v2 at time T,

and

meaning that the starting points of the roads are respectively to the point

And

i.e. the projection of the road alignment of the actual vehicle position;

s3.2.4: the distance between vehicles is directly calculated from projection coordinates corresponding to geographic coordinates of the vehicles by setting that the vehicles always run along a straight line on a road:

when in use

At least in a moment when the distance between two vehicles is below 5m, the two vehicles collide to cause traffic accidents, therefore, the detection is carried out

The proportion of the time in the total data set is used for verifying the accuracy of the data set.

The invention has the beneficial effects that: according to the method, the generation and verification of the private car movement track data are realized by constructing a three-layer vehicle track data set generation model. The invention provides a novel method for dividing functional areas of cities by combining a main road network and travel track data. Meanwhile, the flow pattern of the urban vehicles is deeply analyzed, an inter-area population weight opportunity model is provided, the relation among different urban functional areas is considered, and the inter-urban vehicle migration pattern and law are well described. In addition, the invention constructs a verification model from two aspects of macroscopic view and microcosmic view, and verifies that the private car track data generated by the provided data set generation method has accuracy and use value. The invention provides a new method for generating a private car track data set, which can provide powerful data support for car networking and traffic research work.

Drawings

Fig. 1 is an RDMP model proposed by the present invention, which is a three-layer vehicle trajectory data set generation model, and is composed of a preparation layer, a generation layer, and a verification layer.

Fig. 2 shows the ARS region partitioning method proposed by us. Taking Beijing as an example, a city can be divided into 153 areas.

Fig. 3 is a traffic distribution proportion diagram on the main road comparing the data of the private car in five rings of beijing city generated by the invention with the real data.

Fig. 4 is a comparison graph of travel time of private car data and real data in five rings in beijing city generated by the present invention. Wherein, (a) is the ratio of the travel time from 1 point to 12 points; and (b) is the ratio of travel time from 13 to 24.

Fig. 5 is a comparison graph of the travel distance between the data of the private car in five rings of beijing city and the real data generated by the invention.

FIG. 6 is a comparison between a simulated path generated using a Baidu map API navigation service and a data set generated by the present invention. The result shows that the distribution of the travel route is basically the same within 5km and between 5km and 10km. Wherein, (a) represents that the travel route is within 5km, and (b) represents that the travel route is 5-10km.

FIG. 7 is a comparison of the true data set and the generated data set in terms of acceleration. Wherein, (a) is a vehicle acceleration profile of a real data set in the range of 7; (b) A vehicle acceleration profile for the real dataset at 7; (c) To generate a vehicle acceleration profile for the data set at 7; (d) To generate a vehicle acceleration profile for the data set at 7.

FIG. 8 is a plot of the relative distance of the real data set from the generated data set.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The present example uses a taxi track dataset from beijing, month 11 of 2012 (containing about 2.7 ten thousand GPS records for over 100 billion taxis), and provides a method for generating a social vehicle track dataset based on floating vehicle track data. The core of the invention is a three-layer vehicle track data set generation model (RDMP), and the structure of the model is shown in figure 1. The model consists of a preparation layer, a generation layer and a verification layer, and can generate the track data of private cars based on taxi GPS data and urban vehicle network information. The method comprises the following steps:

s1 is an implementation scheme of a preparation layer of the RDMP model, and in the layer, data are preprocessed to obtain available taxi track data. By preparing the layers, a foundation is laid for the next work.

S1.1: and preprocessing the data set. In the example, we used taxi track data from 11 months 2012 in china beijing, which contains about 2.7 tens of thousands of taxis in excess of 100 billion GPS records. The taxi's location information is updated by the GPS device at a frequency of 1 time every 11 seconds. The original data file is stored in a text document named at the storage time. The taxi has two travel modes, namely a passenger carrying mode and an empty mode. Only in the passenger carrying mode, the running track of the taxi can be regular and can be similar to that of a private carThe travel mode of (1). In contrast, in the empty mode, the taxi only runs purposelessly, which is meaningless for our research. Therefore, when the taxi has no passenger, we delete the useless trajectory data. Furthermore, some data is apparently paradoxical due to device accuracy and statistical errors. For example, some journeys are too long or too short to obtain a valuable analysis result, and this part of the erroneous data also needs to be cleared. We then process the condensed data to obtain each vehicle trip. We extract the same vehicle ID into one file and sort them by travel time. Thereby obtaining the running track of a single taxi. It is worth mentioning that our research focuses on the vehicle movement pattern within the five rings of Beijing: (

And

). Therefore, the vehicle track data with the longitude and the latitude not within the five-ring range is eliminated.

TABLE 1 proportion of taxi to private car on main traffic trunk in Beijing City

S1.2: and (5) dividing the region. The areas of Beijing are divided, and the traveling of vehicles is simplified into traveling among different areas. The invention provides an adjacent road segmentation method (ARS) which is based on the utilization of travel tracks and POIs (geographical interest points) of people and is combined with main traffic lines of a road network to carry out regional division and comprises the following specific steps:

s1.2.1: identifying the function of a single area, an area with multiple functions, as if it contained multiple functions, using a topic model-based approachA document of a subject. The city to be researched is taken as a document as a whole, and different functions owned by each region are taken as subjects. Meanwhile, the moving mode between the regions is taken as a word, and the POIs characteristic vector is taken as the metadata of the document. In the present invention, a DMR-based topic model in unsupervised learning is used, based on the Late Dirichlet Allocation (LDA) and Dirichlet Multinominal Regression (DMR). The POIs characteristic vector and the movement mode are combined, and the function of the region is comprehensively researched from two aspects. For each region r, the number of different classes of POIs in the region is statistically available. Frequency density v of i-th POIs in region r _i，r The calculation formula is as follows:

wherein, num _i Representing the number of types of POIs in area r, S _r Representing the area of region r. In addition, for region r, its POI feature vector may be denoted as x _r ＝(v _1，r ，v _2，r ，...，v _i，r ，...，v _F，r 1), representing the metadata of the region r, F is the number of POI classes in r, and the last vector 1 is the default feature. After parameter estimation using the DMR model, for region r, its topic distribution is a K-dimensional vector θ _r ＝(θ _r，1 ，θ _r，2 ，...，θ _r，e ...，θ _r，K )，θ _r，e Representing the proportion of the subject e in the region r.

S1.2.2: and (4) clustering the functional regions by using a k-means algorithm which is a classic clustering algorithm in unsupervised machine learning. Experiments confirm that the clustering result is optimal when the final clustering value is set to 9 for Beijing. To quantify the popularity and extent of the area within a functional zone, the functional strength of each functional zone is estimated. The popularity of a functional area is potentially linked to traffic volume, which represents human movement patterns meaning the functional strength of the area. And inputting the coordinate information of the starting point and the end point of each travel into a kernel density estimation KDE model, and quantifying the function intensity in the function area. Assuming n regions, the functional strength of region s is calculated by the KDE model using the kernel density estimator:

wherein, d _i，r Represents the distance of the region i from the region R, R represents the bandwidth, KF (·) represents the gaussian kernel function. After the estimation of the function strength is finished, the divided areas are annotated to reflect the actual functions of the city, and the number attribute of the divided areas is defined as K _a . We then consider region labeling from four perspectives. First, we ranked the frequencies of POIs in the functional regions based on the average frequency density of the feature vectors of POIs in each region, and ranked the frequency of all the functional regions containing the POIs. Second, we count the most frequent movement patterns within each functional zone. Thirdly, we explore the most representative POIs in each functional kernel by using the functional strength, and then carry out regional annotation, and fourthly, we carry out artificial marking according to the actual situation, such as the historical sites of interest.

S1.2.3: firstly, the functional area is rasterized, and is divided into a large number of grids on the map projection according to the latitude and longitude of 0.001. The relative length and width of each grid is defined as 1, and there is a fixed ID. If a plurality of different areas are contained in one grid, the grid is classified into the functional area with the largest occupied area. Aiming at Beijing city, a functional area depth division method based on an adjacent road division method is provided. And screening 18 main urban roads according to the road traffic flow information counted by the government, and numbering according to Arabic numerals 1-18. Each important road is marked in the rasterized map and treated as a line segment. At the same time, we consider each mesh approximation as a node. Subsequently, the euclidean distance from each node (mesh) to the line segment (road) is calculated, and the road number Kl closest to each mesh is recorded. After the distance calculation is finished, each grid node has two attribute values, namely the functional area number Ka and the nearest road number Kl. And clustering each grid in the rasterized map, and taking the nodes with the same Ka and kl values as a cluster. Finally, 153 subdivided functional areas are obtained. Due to the fact that the proportion of taxis to private cars is different between different roads, in the dividing method, the same region not only has the same urban function, but also the proportion of the taxis to the private cars is the same.

S1.3 road network description. The city road network map data may be downloaded free of charge from an OpenStreetMap (OSM) or other open source website. The OSM data may be uploaded by any user so that everyone can maintain and modify the map data. In our study, we downloaded the OSM file from beijing, including road, underground, various construction facility information, reflecting the geographic information of the city. However, due to the open source nature, there may be some errors between the downloaded data and the actual situation. In order to build an accurate simulated road network, the road topology is modified to match the real world. We modify it using Java OpenStreetMap (JOSM) technology, which is a free tool for editing open street map geographic information. Furthermore, since the template of the present invention consists in simulating and generating private cars, not all vehicles, we have eliminated redundant information on railways, sidewalks, etc. And after relevant data such as railway data and the like are deleted, comparing the map data with the real world and modifying the map data.

And S2 is an implementation scheme of a generation layer of the RDMP model, and a private car track data set is generated by utilizing track data obtained by preprocessing of the previous layer in the step 1). The invention provides a regional-based population weight opportunity (RPWO) model, which predicts the private traffic volume origin-destination matrix (ODmatrix) of two regions by using the traffic volume of each divided region. We then use the SUMO simulation tool to convert the matrix between the entire regions into microscopic trajectory paths for each private car.

S2.1: and calculating the traffic volume between the areas. The traffic volume of a vehicle refers to the number of vehicles passing a certain section of a road in a certain time. The traffic volume can reflect the total traffic flow of the road, and has important research value. In the invention, the traffic volume of the vehicle movement between the divided areas is predicted by calculating the traffic volume in each area and utilizing the inter-area population weight opportunity model provided by the invention. A simulated social vehicle dataset is then generated based on the predicted amount of traffic between the regions. In step 1) the beijing has been divided into different regions. The daily traffic volume in each area is now calculated. The traffic volume in the area is accumulated by the traffic volume of all roads in the area, so the traffic volume of a single road is calculated at the beginning. The traffic volume on the road is determined by social vehicles and floating vehicles, but the obtained data is only the track data of the floating vehicles (taxis). Therefore, we need to calculate the traffic volume of the social vehicle (private car) by the traffic volume of the floating vehicle. The proportion of the social vehicles and the floating vehicles on each road is different, and the traffic volume of the private car can be easily calculated through the existing taxi traffic volume according to the proportion relation. Based on the information provided in the "annual report of traffic development in beijing in 2012", we obtained different ratios of the vehicles moving on the main traffic roads (shown in table 1) to the social vehicles. Because we divide the area based on these major traffic roads, we can assume that all roads within the same area have the same proportion of floating vehicles and social vehicles as the major traffic roads. We calculate the number of social vehicles in each functional zone by the following formula:

wherein, SA _i The sum of private car trips in the area i; SG _j Representative regionNumber of taxis in each grid j contained in i, N _i Indicating the number of meshes divided in the area i, alpha _i Representing the ratio of private cars to taxis, SR, in region i _w Representing the number of taxis on road w in grid j within region i, n _j Representing the number of roads in grid j. The number of private cars in each functional area is obtained through the two formulas.

S2.2: and (3) a space-time interaction model. After the traffic volume of each area in Beijing is obtained, the invention continues to construct the human trip mode under the city scale. Despite the long history of building human movement models, researchers still lack highly accurate methods of predicting urban movement patterns, especially if the data types are not sufficiently diverse. The present invention proposes an inter-area population weight opportunity model (RPWO) to capture the potential drivers of human movement patterns on a city scale, which does not depend on any adjustable parameters. It is worth mentioning that the results of the study on the model show that the model is very suitable for the population flow pattern between cities, but not between countries, which indicates the diversity of human flow at different spatial scales. Different from the traditional population weight opportunity model, the method takes region factors into consideration when a human migration model is considered, and performs cluster-cluster research. That is, the present invention studies the migration pattern of human between different regions by a region division method. The method comprises the following specific steps:

s2.2.1: in the RPWO model, the attraction force of an end region j to a start region i and the distance R between the center of gravity of the start region i and the center of gravity of the end region j around the center of gravity of the region j are set _ij Total population Q in a circular area of radius _ji In inverse proportion;

s2.2.2: to center of gravity of

The calculation method of (2) is as follows:

where L represents the number of grid areas of the area (0.001 latitude and longitude as division). x is the number of _l And y _l Respectively, the relative longitude and relative latitude of the grid nodes within the area.

S2.2.3: then, the attraction force to the start point by the end point (i.e., the start point area to end point area traffic volume) can be calculated from the center of gravity of the start point and the end point area:

wherein Q is _ji Is represented by R _ij The total number of private cars in a circular area with the radius and the center of gravity of the endpoint area j as the center of a circle; n is a radical of _s Is the number of regions contained in the circle, beta _r Represents the proportion of the region r in the circular region; the total area of the region r is S _r The area of the region r in the circular region is

s2.3: the specific implementation of trajectory simulation and data generation is as follows:

s2.3.1: the invention firstly classifies the urban roads according to the divided areas. The road network file contains the longitude and latitude of each road connection point, the longitude and latitude are used for calculating which area the road belongs to, and the road ID contained in each area is written in the road network file.

S2.3.2: in S2.2, the urban traffic volume O/D matrix is obtained, and the data set simulation generation work is carried out by combining the modified road network file. To achieve this goal, we use the SUMO tool. First, we use the OD2TRIPS plug-in the SUMO tool, import the O/D matrix and decompose it into individual vehicle TRIPS. The traffic volume O/D matrix may reflect the overall people's travel patterns within the city. Further, from a microscopic perspective, the individual vehicle path information generated by the matrix may reflect the individual's movement pattern. In the OD2TRIPS plug-in, O/D matrix information, a road network file and a road list contained in an area are input, and a generation time period, a trip proportion of each time period and a generation vehicle type parameter are set. Then, an xml file of a series of pieces of vehicle travel information each including a vehicle ID, a departure time, a departure place ID, and a destination ID is generated.

S2.3.3: with the help of OD2TRIPS, it is not obvious enough to generate origin and destination information for each vehicle trip. Therefore, we have used the DUAROUTER plug-in the SUMO tool, which uses the shortest path to calculate the vehicle path that the SUMO may use, generating path trajectory information for the vehicle. Inputting road network files and travel information generated by OD2TRIPS, setting a simulation time period and a shortest path calculation method, and finally generating vehicle travel track information comprising vehicle ID, travel time and road passing condition. The vehicle track information can help us to analyze urban road traffic and regional travel modes. Our generated daily trace dataset size is about 3GB.

S2.3.4: in addition to vehicle trajectory information, microscopic information of the instantaneous speed, relative position, latitude and longitude of the vehicle in a unit time period (per second) is also important for data verification in the present invention. We use the Trace File Generation plugin in the SUMO tool to generate vehicle information including latitude and longitude position, driving angle, instantaneous speed, road number of the vehicle every second within a specified time interval. We input the corresponding information in the durauter function while writing the corresponding configuration file, set the time interval to 15 minutes, and generate vehicle tracking files for different time periods. The size of this type of data set generated a day is about 200GB.

S3 is an embodiment of the validation layer of the RDMP model in the present invention. After social vehicle trajectory data are generated in the step 2), a verification model is designed, and the accuracy and the authenticity of the generated data are verified. In the verification layer, the invention verifies the data from the macroscopic aspect and the microscopic aspect respectively.

S3.1: in the macro model, the generated track data is analyzed and compared with the actual traffic condition described in the Beijing traffic development annual report of 2012.

S3.1.1: and comparing the traffic flow. FIG. 3 is a comparison diagram of traffic flow of Beijing major roads. The actual data used was the main road traffic flow data published by the beijing city traffic development research center in 2012. The results show that the all day traffic of the west and east tetracyclic rings is in the front, both actual and generated data. The traffic flow of the west pentacycle and the south dicyclo is low, which indicates that the traffic load is light. From the overall comparison results, the data generated by the method are better matched with the actual data except for the south five-ring and the east five-ring.

S3.1.2: and (4) trip range. For human movement studies, travel time distribution and distance distribution are two key parameters. Through studying the travel amount in different periods, researchers can provide a better travel optimization scheme, relieve road conditions and improve travel efficiency. In addition, the travel distance distribution of researchers can also play an important role in road planning and travel prediction. Thus, in the present invention, we use the generated trajectory data set to analyze the travel time and distance distribution. The accuracy of the simulation data was evaluated by comparison with the actual data. Fig. 4 is a distribution of the travel amount of the resident travel time. The data involved in the comparison included official statistics of the generated data, the first half of the year 2012 and the second half of the year 2012. The result shows that the generated trajectory data has the same travel characteristics as the actual data. Furthermore, as can be seen from fig. 7, 7. The traffic volume in these two periods is about 50% of the total daily traffic volume. In terms of resident travel distance distribution, fig. 5 shows our analysis results. We compare the generated data with official trip distance distribution data of the first and second year of 2012. From the distribution of travel distances, the number of travels is inversely proportional to the total distance. As the distance increases, the number of trips decreases. In the aspect of driving, people prefer short-distance travel, and the shortest distance is 0-5 kilometers and accounts for more than 40%. In consideration of specific conditions, when the travel distance is too long, people can select travel modes such as trains and subways instead of driving by considering factors such as oil consumption and time.

S3.1.3: traffic conditions. From a macroscopic traffic flow situation, the overall distance traveled by the vehicle and the time traveled are regular. In the present invention, we validate the generated data set using the navigation service of Baidu Map APIs. Today, as mobile devices are used more and more, everyone has experience with the map service application. In the software, the positions of a departure place and a destination are input, and a proper travel mode, such as walking, public transport, private car and the like, is selected to obtain the estimated route length and travel time. These results are based on real-world historical data, including data provided by the department of transportation, GPS generated vehicle trajectory data, and various applications and electronic devices that can transmit the position fix. It is worth mentioning that currently, mainstream Map service providers such as Google Map and Baidu Map provide corresponding api interfaces for software developers, which facilitates direct service invocation. Navigation services can be used to verify two important indicators in traffic flow:

the route length, the indicator representing the length of the journey. Using Baidu Map APIs, we input the coordinates of the start and end points of each trip in the generated dataset into the routing function. Through the navigation service, a corresponding path and a corresponding travel length can be generated;

travel time the indicator represents travel time. Similar to the method of obtaining the route length, we use the navigation service to estimate the duration of travel. As can be seen from fig. 5, more than 70% of the vehicles travel less than 10km. Therefore, we focus on the driving of vehicles within 10km. FIG. 6 is a scatter plot of travel time and route length predicted by our generated data and Baidu Maps APIs. Fig. 6 (a) shows the driving condition of the vehicle within a range of 5km, and fig. 6 (b) shows the driving condition of the vehicle within a range of 5-10km. We note that the travel time generated by the present invention overlaps to a large extent with the evaluation provided by the navigation service in both route length ranges. This also reflects that the generated vehicle speed distribution is closer to the vehicle speed distribution of the real vehicle. In addition, compared with the distance of 5-10km, the generated travel time data is closer to the real data in the short-distance travel within 5km, which proves that the model is more suitable for generating travel tracks with shorter distances.

S3.2: in the microscopic model, whether the dataset considered is itself contrary to reality. The acceleration analysis is first performed, starting with the calculation of the vehicle acceleration and its gradient, to analyze whether the data set is reliable. Then, consistency analysis is carried out, vehicle pairs are randomly extracted, and the relative distance between the two vehicles is analyzed. The invention designs a method for quantitatively detecting track data. In this section, we selected the early peak time of the working day (7 00) as the study time, in particular 7. In addition, we used four sets of observations in total for comparison, with each set of data being separated by 15 minutes. These trajectory data are from 7.

S3.2.1: and (6) analyzing the acceleration. Vehicle acceleration is an important component in studying vehicle dynamics and traffic flow. Therefore, it is necessary to verify the accuracy of the acceleration of the generated trajectory data. If the accuracy of the result is within a reasonable range, the generated data set is proved to have high intrinsic reasonableness. There is an obvious way to validate the acceleration data, that is, to check its distribution throughout the data set. During data acquisition and estimation, two types of problems can be clearly found, namely infeasible extrema and distributed irregular shapes. FIG. 7 is a graph of acceleration distribution frequencies for a raw data set and a generated data set. As can be seen from fig. 7, the frequency distribution of the vehicle acceleration is a normal distribution regardless of the real data or the generated data, which proves that the method provided by the present invention is feasible.

In addition to the distribution of the acceleration values, the acceleration gradient can also be an important indicator of the data quality. Acceleration gradient J _e [m/s ³ ]Representing the change in acceleration with time, is the derivative of the acceleration. In the present invention, we consider that the concentration is within. + -. 3m/s ³ The actual acceleration gradient values on the left and right are acceptable values under study. Therefore, we propose three indices for acceleration gradient analysis:

·J _e greater than +/-3 m/s ³ A threshold percentage of trajectory data;

j maximum in dataset;

the j minimum in the dataset.

The statistical results of the acceleration gradient errors obtained by analysis are shown in table 2. As can be seen, +/-3 m/s ³ Other J _e The percentages of the values were 2.99% (raw data (7-00-7)) and 3.28% (raw data (7. This means that the error statistics for j are fairly good. In the lateral comparison, the error of the data set is less than 10%, which shows that the data set is relatively accurate. In addition, the maximum and minimum j values of the selected data both reach relatively unreasonable values.

TABLE 2 acceleration gradient error statistics of true and generated trajectories

S3.2.2: and (5) analyzing consistency. The vehicle must maintain a reasonable distance from other vehicles during travel. Otherwise, traffic accidents are likely to occur. This situation is also reflected in the vehicle trajectory data set. The invention verifies the authenticity of the generated data set from the distance between the vehicles. The accuracy of a data set is questionable if there are many unusually small distance intervals in the data set.

When we are concerned with two vehicles in close proximity to each other, the distance between the pair of vehicles can be used to quantify the error in estimating the trajectory data. In fact, at a certain moment, the distance between a pair of vehicles can be measured directly from the positions of two vehicles at that moment.

In our study we assume that it is the length of the road segment between spaced vehicles. Formally defined as:

represents the vehicle separation of vehicles v1 and v2 at time T,

and

meaning that the starting points of the roads are respectively to the point

And

i.e. the road-aligned projection of the actual vehicle position.

To simplify the calculation process, we assume that the vehicle always travels in a straight line on the road. Then, the distance between the vehicles is directly calculated from the projection coordinates corresponding to the geographic coordinates of the vehicles:

under the general condition of

When the vehicle drops below 5m at least in one moment, the two vehicles collide with each other, resulting in traffic accidents. Thus, by detecting

The proportion of the time in the total data set can verify the accuracy of the data set. If there are a large number of outliers in the dataset, the dataset is most likely to be problematic. For the consistency analysis, the following statistics are meaningful:

this is the total number of vehicle pairs in each selected data set;

average vehicle separation of vehicle pairs in the data set;

number and proportion of vehicles with speed below 5 meters;

vehicle pair maximum separation.

The statistical results of the analyzed consistency are shown in table 3. We specify that two vehicles within 50m of each other can be considered a set of vehicle pairs. From the calculation results, we can know that the total number of vehicle pairs is about 4000 to 5000. In all selected data sets, we used the original data sets at 7. For the data set we generated, at two time periods 7-00-7. It can be seen that the abnormal data in the generated data set is kept in a small range, and the generated data is proved to have high authenticity. Further, in the generated data set, the average vehicle-to-vehicle distance was about 27m, which is not much different from the actual average value (about 22 m). Fig. 8 reflects the frequency of occurrence of the vehicle to the pitch length. Obviously, the four groups of observation data have very similar distribution, which means that the generated data set has a vehicle movement pattern similar to real data, and the model provided by the invention is proved to be good in effect.

TABLE 3 consistency (Interval indicator) error statistics for real and generated trajectories

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for generating an urban traffic track data set based on taxi data and an urban road network is characterized in that the method for generating the urban traffic track data set is completed on a three-layer RDMP vehicle track data set generation model, and the three-layer RDMP vehicle track data set generation model comprises the following steps:

s1.1: pretreatment: deleting useless track data in an empty state, clearing up paradoxical data caused by equipment precision and statistical errors, and obtaining available taxi track data;

s1.2: based on an adjacent road segmentation method, a researched city is divided into different areas by using a human moving track and POI interest points, and the travel of all vehicles is converted into circulation among the different areas, and the specific process is as follows:

s1.2.1: using DMR-based probability topic model of unsupervised learning, using city as a document, using different functions of each region as topics, using movement patterns between regions as words, using POIs feature vector as metadata of document, and frequency density v of the ith POIs in region r _i，r The calculation formula is as follows:

wherein, num _i Representing the number of types of POIs in area r, S _r Representing the area of the region r, the POI feature vector of the region r is marked as x _r ＝(v _1，r ，v _2，r ，...，v _i，r ，...，v _F，r 1), metadata representing the region r, where F is the number of POI categories in r, and the last vector 1 is the default feature; the subject distribution of the region r is a K-dimensional vector theta _r ＝(θ _r，1 ，θ _r，2 ，...，θ _r，e ...，θ _r，K )，θ _r，e Representing the proportion of the subject e in the region r;

s1.2.2: clustering the theme distribution obtained in the S1.2.1 by using a k-means clustering algorithm, inputting the coordinate information of the starting point and the end point of each trip into a KDE model, and quantifying the functional strength in a functional area; setting n regions, and calculating the functional strength of the region r by using a nuclear density estimator through a nuclear density estimation model KDE model:

wherein d is _i，r Represents the distance from the region i to the region R, R represents the bandwidth, KF (·) represents a gaussian kernel function; after the estimation of the function intensity is finished, the divided areas are annotated to reflect the actual functions of the city, and the number attribute of the area r is defined as K _a ；

S1.2.3: region clustering

Using adjacent road segmentation method, regarding important road as line segment and grid as node, calculating Euclidean distance from each grid to road, and recordingRecording the road number K with the nearest distance of each grid _l Clustering each grid in the rasterized map after the calculation is finished, and clustering K _a And K _l The nodes with the same value are taken as a cluster;

s2: in the generation layer:

s2.2.1: setting the attraction force of an end point area j to a start point area i, and the distance R between the center of gravity of the start point area i and the center of gravity of the end point area j with the center of gravity of the area j as the center _ij Total population Q in circular area of radius _ji In inverse proportion;

s2.2.2: to center of gravity of

The calculation method of (2) is as follows:

wherein L represents the number of meshes of the region; x is a radical of a fluorine atom _l And y _l Respectively representing the relative longitude and relative latitude of the grid within the area;

s2.2.3: calculating attraction A of the end point region j to the start point region i based on the center of gravity of the start point region and the end point region _ji The calculation method is as follows:

s2.3.4: in order to generate information of a vehicle, including longitude and latitude positions, driving angles, instantaneous speeds and road numbers, generated every second within a specified time interval, using a Trace File Generation plugin in an SUMO tool, inputting corresponding information in a DUAROUTER function, simultaneously writing a corresponding configuration File, setting the time interval to delta t minutes, and generating vehicle tracking files of different time periods;

s3.2.2: consistency analysis, namely verifying the authenticity of a generated data set from the distance between vehicles, and when an abnormally small distance interval exists in the data set, the accuracy of the data set is doubtful;

s3.2.3: the distance between the pair of vehicles is calculated as follows:

represents the vehicle separation of the vehicles v1 and v2 at time T,

and

meaning that the starting points of the roads are respectively to the point

And

i.e. the projection of the road alignment of the actual vehicle position;

when in use

At least in a moment when the distance between the two vehicles is below 5m, the two vehicles collide to cause traffic accidents, therefore, the detection is carried out

The proportion of the time to the total data set is used for verifying the accuracy of the data set.

2. The method for generating an urban traffic trajectory data set based on taxi data and an urban road network according to claim 1, wherein the principle of judging important roads is as follows: the urban map is rasterized, grids are divided according to the longitude and latitude of 0.001 × 0.001, and roads with the average traffic volume of more than one hundred thousand in the whole day are extracted and taken as important roads.