CN117037465A

CN117037465A - Traffic jam propagation mode sensing and visual analysis method

Info

Publication number: CN117037465A
Application number: CN202310593139.0A
Authority: CN
Inventors: 张慧杰; 吕程; 谢文强; 董家鹭
Original assignee: Northeast Normal University
Current assignee: Northeast Normal University
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-11-10
Anticipated expiration: 2043-05-24

Abstract

The invention relates to a traffic jam propagation mode sensing and visual analysis method, belongs to the technical field of electric digital data processing, and provides a traffic jam propagation visual analysis method based on graph mining, which allows interactive exploration of urban road jam propagation rules and helps field experts to mine the traffic jam propagation rules from large-scale traffic data and analyze and explain space-time factors of the jam propagation. In the aspect of congestion propagation rule mining, the method allows the comprehensive road network topology information to identify the congestion propagation relationship from massive traffic data, and introduces a graph neural network to characterize and cluster the congestion and the congestion propagation, so as to effectively find the traffic congestion propagation mode; in the aspects of congestion propagation mode sensing and interpretation, the method provides a flexible visual analysis flow, integrates a visual view showing space-time factors of the congestion propagation mode into an interactive visual analysis system, and allows domain experts to conduct deep analysis of congestion propagation at a plurality of layers.

Description

Traffic jam propagation mode sensing and visual analysis method

Technical Field

The invention belongs to the technical field of electric digital data processing, and particularly relates to a traffic jam propagation mode sensing and visual analysis method.

Background

Traffic jam is increasingly serious in various large cities, and has become a common problem puzzling urban development, so that a plurality of inconveniences are brought to the travel of people, and the problem of traffic jam is effectively relieved, so that the method has important research significance in improving urban traffic planning and improving the travel experience of people. The collection and analysis work of urban traffic data provides a solid foundation for solving the problem, and by exploring massive traffic travel data, the generation, development and dissipation of urban road congestion can be found, and the congestion propagation rule hidden behind the congestion phenomenon can be revealed. However, this presents a series of problems, how to identify the propagation relationship of traffic congestion from massive data, how to mine the congestion propagation rule in the real world complex road traffic network and display it in an intuitive way, is a problem and pain point at the present stage.

Therefore, a traffic congestion propagation mode sensing and visual analysis method needs to be designed at the present stage to solve the problems.

Disclosure of Invention

The invention aims to provide a traffic jam propagation mode sensing and visual analysis method which is used for solving the technical problems in the prior art, finding the traffic jam propagation mode based on graph mining, sensing complex propagation dependency relationship of traffic jams among roads, introducing wisdom of people through a visual technology, realizing an interactive analysis process, revealing space-time situation of the traffic jam propagation mode, thereby accurately and pointedly helping domain experts to find bottlenecks of urban road traffic and making related measures for relieving traffic jam problems.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

the traffic jam propagation mode sensing and visual analysis method comprises the following steps:

s1, sensing occurrence and dissipation of road congestion through GPS track data of a taxi for cruising;

data cleaning is carried out on GPS track data; mapping the GPS track points to corresponding roads by adopting a map matching algorithm, dividing the GPS track points by time slices with fixed lengths, and calculating the average speed of the GPS track points in the time slices to serve as the travel speed of the roads in the period;

s2, quantitatively calculating the congestion degree of the road according to the travel speed of the road on each time slice, wherein the time slices with continuous congestion on the road are traffic congestion events; quantifying the possibility of congestion propagation among congestion events by adopting a space-time neighbor relation, determining the congestion propagation relation among the congestion events, and constructing a traffic congestion event propagation diagram;

s3, a Node2Vec model is adopted to represent the road, and the embedded vector of the road is used as the characteristic of the Node in the traffic jam event propagation diagram; taking a traffic jam event propagation diagram as an input diagram, adopting a VGAE model to represent each jam event node in the propagation diagram, and representing the jam propagation relation among the propagation diagram nodes and the time-space characteristics of the occurrence of the jam event to an embedded vector; clustering traffic congestion events by adopting an HDBSCAN algorithm to mine traffic congestion propagation modes; after clustering the congestion events, carrying out space-time analysis and road congestion propagation overview chart construction on the congestion events of each cluster, and summarizing the space-time rule of the congestion propagation mode;

And S4, visual analysis is introduced to enhance traffic jam event characterization and congestion propagation space-time rule mining.

Further, the step S1 specifically includes:

the GPS track data is the GPS track data collected by taxis in the city of the research area, and track points of the GPS track data set are concentrated in the urban area range of the city to reflect the running condition of the road traffic of the city;

each data sample in the data set represents a track point, and the position of the taxi on the space and time is recorded and recorded as pt; each track point comprises a vehicle ID, a time stamp, longitude and latitude coordinates and a running speed and some additional identification information;

performing data cleaning on the GPS track data;

organizing discrete track points pt by vehicle ID and recording time firstPost-ordering into a sequence of trace points [ pt ] ₁ ，pt ₂ ，pt ₃ ，…，pt _n ]The track point sequence corresponds to the running condition of the vehicle in the time range;

dividing the sequence into subsequences by limiting the time interval and geographic distance between adjacent track points of the sequence; track point pt of two adjacent in a sequence _i And pt _i+1 The time interval must be less than a thresholdThe geographic distance must also be less than a threshold delta; the sub-sequence that simultaneously meets the above time and space constraints is called a trajectory, denoted tj; and filtering the track tj;

The track point pt is marked as map through ST-Matching map Matching algorithm, so that the track point pt corresponds to the road rd, namely rd=map (pt);

calculating the average speed of track points in a period of time as the travel speed of the road, dividing the time slices by a fixed time length TL, and giving the road rd _i And corresponding time slices ts _j All the track points in (a)Each track point has a speed record +.>Adding a support degree parameter theta when calculating the average speed of the road, and adding the number of track points in a time slice participating in calculating the average speed +.>Must be greater than a given threshold θ, otherwise the calculation is invalid; road rd _i At time slice ts _j Upper travel speed->The calculation method is as follows

Using the track points of m roads and n time slices to be analyzed for calculating the average speed of the roadsThe road is taken as a row of the matrix, a time slice is taken as a column, and the average speed matrix V of the road is obtained _m×n 。

Further, the step S2 specifically includes:

constructing a road congestion back propagation network based on dual representation of a road network and the back propagation characteristics of traffic congestion, wherein the network takes roads as nodes, the connection relationship among the roads is taken as edges, the direction of the edges is the opposite direction of the road traffic direction, and the congestion propagation direction among adjacent roads is indicated;

The construction process of the traffic jam event propagation diagram comprises the following steps:

creating an empty traffic congestion event propagation graph G; then, starting from the first time slice, selecting one congestion event in the time slice as a source event, calculating a road sequence R of the source event conforming to a spatial neighbor relation, and extracting all congestion events occurring on the road sequence R as an alternative target event set E; finally, calculating the time neighbor relation between each event in the source event and the candidate target event set E, if the time neighbor relation is met, adding the two events into the propagation graph, and adding a connecting edge; then, taking the target event as a source event, repeating the process until no new target event is added into the graph, selecting a brand new source event from the first time slice, and repeating the algorithm; if no alternative source event exists in the first time slice, selecting from the next time slice, repeating the process, and calculating all congestion events in all time slices.

Further, the data cleaning of the GPS track data is as follows:

(1) Geographic area restrictions;

(2) Recording a time limit;

(3) Repeating the track point limitation;

(4) A travel speed limit;

(5) Road range limitations.

Further, the trace tj is filtered as follows:

(1) The number of track points; the track comprises five or more track points;

(2) Travel time; the time interval from the first node to the last node of the track must be greater than or equal to 25 seconds;

(3) Distance travelled; the sum of the distances between any two adjacent track points in the track is more than or equal to 400 meters.

Further, the ST-Matching map Matching algorithm is as follows:

(1) Determining a candidate point set; the track point pt actually observed as track tj _i E, tj as the center of a circle, designating a radius r to determine a circle, passing pt _i Making normals of all roads in the circle, wherein the intersection point of the jth normals and the road is a candidate pointpt _i There may be one or more candidate points, all of which constitute a candidate point set +.>

(2) Candidate point space analysis; calculating the selected weights of two adjacent candidate points by carrying out space-time analysis on the candidate points, wherein the space weights and the time weights of the adjacent candidate points are comprehensively considered by the space-time analysis, and the space weights are commonly measured by the difference of the observation probability and the transition probability of the distance between the GPS track point and the corresponding candidate point;

the observation probability is a normal distribution N (μ, σ) with μ and σ as parameters ² ) Measurement of GPS track point pt _i With a certain candidate pointApproximation of>Representation pt _i And its candidate point->The probability of candidate point observation is as followsThe formula (i) is that,

given two adjacent track points pt _i-1 、pt _i And each candidate pointd _i-1→i Representing the slave trace point pt _i-1 To pt _i Is the Euclidean distance, w _{(i-1，t)→(i，s)} Representing->To candidate point->The transition probability V is expressed as the following formula,

given two adjacent candidate pointsAnd->Space analysis function F _s Is defined as a formula, n represents the number of trace points,

(3) Candidate point time analysis; time analysis function F _t The method has the advantages that the passing speed of the track and the speed limit of the road are considered, the situation that the space analysis function cannot distinguish the same-direction adjacent roads is avoided, and the quality of track map matching is improved; for adjacent track points pt _i-1 And pt _i Is a candidate point of (2)And->Their shortest path comprises the road section rd ₁ ，rd ₂ ，…，rd _u ，…，rd _k ]In which the road segment rd _u Length of (c) is denoted as len _u Speed limit is lim _u Track point pt _i-1 To pt _i The travel time of (a) is recorded as deltat _i-1→i The trace point pt _i-1 To pt _i Is +.>Calculating the following first formula to correspond to the candidate pointsAnd->Time analysis function F of (2) _t The following second formula is given as a second formula,

(4) Path matching; before path matching, firstly, constructing a candidate graph by taking each track point as a stage and taking candidate points of each track point as states of corresponding stages; each candidate point of adjacent track points And->One side of the candidate graph is formed, the transition probability of the side is expressed as a space-time analysis function F,

candidate way for track tjDiameter sequence P _cand ：The sequence score is the following first formula, the highest scoring sequence is the best matching path MP, the following second formula,

compared with the prior art, the invention has the following beneficial effects:

the method has the advantages that the traffic jam propagation mode is found based on graph mining, complex propagation dependency relationship of traffic jams among roads is perceived, intelligence of people is introduced through a visualization technology, an interactive analysis process is realized, space-time situation of the traffic jam propagation mode is revealed, therefore, domain experts are accurately and pertinently helped to find bottlenecks of urban road traffic, and relevant measures for relieving the traffic jam problem are formulated.

Drawings

Fig. 1 is a flow chart of a traffic jam propagation mode sensing and visual analysis method provided by the invention.

FIG. 2 is a view of a study area provided by the present invention, (a) a geographic area; (b) a road network.

Fig. 3 is a schematic view of road range limitation provided by the present invention.

Fig. 4 is a schematic diagram of a map matching example provided by the present invention.

Fig. 5 is a schematic diagram of a traffic jam event acquisition flow provided by the present invention.

Fig. 6 is a schematic diagram of a time neighbor relation provided by the present invention.

Fig. 7 is a schematic diagram of spatial neighbor relation provided by the present invention.

FIG. 8 is a schematic diagram of the transition probability of an edge according to the present invention.

Fig. 9 is a schematic diagram of a VGAE model framework provided by the present invention.

FIG. 10 is a simplified schematic diagram of the present invention.

Fig. 11 is a schematic diagram of congestion event embedding vector dimension reduction scattering points provided by the present invention.

Fig. 12 is a schematic diagram of a road congestion propagation overview construction algorithm provided by the invention.

Detailed Description

For the purpose of making the technical solution and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

the method framework is shown in fig. 1.

(1) Data set and data preprocessing

The scheme senses the occurrence and dissipation of road congestion through the GPS track data of the taxis. Firstly, the GPS track data is subjected to data cleaning by the scheme, so that the data quality is improved. Because the GPS track data only contains geographical position information, the scheme adopts a map matching algorithm to map the GPS track points to corresponding roads, divides the corresponding roads by time slices with fixed length, and calculates the average speed of the GPS track points in the time slices to serve as the travel speed of the roads in the period.

(2) Congestion event extraction and propagation map construction

The road trip speed reflects the running condition of the road, and a trip speed lower than the normal level of the road indicates that traffic jam occurs. According to the scheme, the road congestion degree is calculated quantitatively according to the travel speed of the road on each time slice, and the time slices with continuous congestion on the road are traffic congestion events.

In a road network, traffic congestion may propagate to surrounding roads. The method adopts a space-time neighbor relation to quantify the possibility of congestion propagation among congestion events, determines the congestion propagation relation among the congestion events, and constructs a traffic congestion event propagation diagram. The event nodes in the propagation diagram take the road embedded vector as a characteristic vector, and the road where the congestion occurs is described. The road embedded vector is obtained through node characterization of the road congestion back propagation network established by the scheme.

(3) Congestion event characterization and congestion propagation rule mining

According to the scheme, a traffic congestion event propagation diagram is taken as an input diagram, and each congestion event node in the propagation diagram is represented by adopting a VGAE (Variational Graph Auto-Encoders, VGAE) model. The congestion propagation relationship between nodes of the propagation map and the spatio-temporal characteristics of the occurrence of congestion events in this process are characterized to embedded vectors.

Traffic congestion events with similar congestion propagation characteristics are more similar in embedded vector, so that the traffic congestion event is clustered by using an HDBSCAN (high-speed binary DBSCAN) algorithm to mine a traffic congestion propagation mode. After the congestion events are clustered, the scheme further carries out space-time analysis and road congestion propagation overview chart construction on the congestion events of each cluster, summarizes the space-time rule of the congestion propagation mode, and deepens the perception and understanding of the traffic congestion propagation.

(4) Visual analysis and exploration

The visual analysis is introduced to enhance the traffic jam event characterization and jam propagation space-time rule mining method, so that the method can be widely applied. The scheme provides a visual analysis system JPViz which integrates various interaction modes and rich views, supports users to perceive urban traffic jam conditions from multiple layers, and explores traffic jam propagation rules.

Traffic jam is a phenomenon that the speed of a road is slow due to the fact that the number of vehicles actually passing through the road is more than the road traffic capacity, and the traffic jam has become a serious problem which plagues the daily life of people and the rapid development of cities. With increasing importance on traffic congestion problems and continuous improvement of technical means, various methods for monitoring the running condition of roads are researched and applied, so that the occurrence of road congestion is expected to be observed timely and accurately, and effective countermeasures are provided.

Vehicle GPS track data is widely used to acquire road trip speed and monitor road behavior. The method is also adopted in the scheme, the running state of the urban road is obtained through the GPS track data set of the taxi for cruising, and the traffic jam event is identified. According to the scheme, a ST-Matching method is adopted to carry out map Matching on taxi GPS track data, and the geographic position of a track point is converted into a corresponding road representation. According to the scheme, the travel speed of the road is calculated through the track points on the road, so that the road congestion degree is quantized, and the traffic congestion event is identified.

In a road network, traffic jams have the characteristic of propagation, and the scheme establishes the congestion propagation relationship among congestion events by calculating the space-time neighbor relationship among the traffic jams. According to the congestion propagation relationship, a traffic congestion event propagation diagram is constructed, the propagation diagram takes traffic congestion events as nodes, the congestion propagation relationship among the congestion events as a continuous edge, and continuous propagation behaviors of the traffic congestion events are described.

The research area of the scheme is a rectangular area shown in fig. 2a, and the city is a certain province in China. The construction of traffic systems is important for the city itself and for surrounding cities. However, with the continuous increase of the quantity of the motor vehicles, the urban traffic in the city also faces serious problems, and the traveling contradiction such as road traffic jam is increasingly prominent. Therefore, the city is taken as an example to study the propagation rule of road traffic jam, so that the method has certain theoretical significance, urgent practical requirements and higher research value.

Road network data in the research area is provided by OpenStreetMap and is downloaded and processed through OSMnx. OSMnx provides a series of programmable interfaces for downloading geographical data resources from OpenStreetMap and providing modeling, projection, analysis, etc. functions for downloaded geographical data. The scheme obtains a road network in a research area, wherein the road network comprises 298703 nodes which represent intersections among roads; and 23690 sides, representing the road. The basic information of the road is described by edges and mainly comprises information such as a road number, a road name, a road length, a road geometry and the like. As shown in fig. 2b, which is a schematic view of the road network, the nodes of the road network are hidden, only the edges representing the road are shown, and the edges are drawn as the actual geometry of the road.

The GPS track data adopted by the scheme is the GPS track data collected by the urban patrol taxis in the research area, and the track points of the GPS track data set are concentrated in the urban area range, so that the running condition of the urban road traffic is reflected.

Each data sample in the data set represents a track point, and the position of the taxi on the way in space and time is recorded and is denoted as pt. Each track point mainly comprises a vehicle ID, a time stamp, longitude and latitude coordinates, a driving speed and some additional identification information such as license plate color and the like. The dataset was relatively raw and no data cleansing was performed.

In order to improve the data quality and the accuracy of analysis results, the data cleaning is performed on the GPS track data by the scheme, and the method mainly comprises the following steps:

(1) Geographical region limitations. Although the GPS track data set concentrates track points in urban areas, there are some track points located in surrounding cities due to taxi cab operation across urban areas. Therefore, the present solution limits the spatial range of the track point, and the limited area is consistent with the road network data, and is a rectangular area including the urban area, as shown by the dashed line box in fig. 2 a;

(2) Recording time constraints. The acquisition time of the GPS track data set is 2021, 10 month 04, 2021, 10 month 17, the time range is defined as the time range of the scheme, and track points beyond the time range can be cleaned;

(3) And repeating the track point limitation. A taxi can only have one effective track point at a determined moment, and the track point record with the same vehicle ID and record timestamp is defined as a repeated track point in the scheme. After analysis, the data set is found to have a certain repetitive phenomenon, and only one repetitive track point is reserved.

(4) And (5) limiting the running speed. The running speed of part of track points far exceeds the running speed which can be achieved by the vehicle under the actual condition, and the track points with unreasonable speed information are cleaned. The scheme refers to the current road speed limit regulation, and a reasonable running speed interval is determined to be 0-130 km/h.

(5) Road range limitations. The scheme defines the space within the range of the two sides 55m of the road as the road range, and the track points beyond the range are regarded as noise track point cleaning. The observation of two practical situations makes the scheme considered to be necessary for road range limitation, firstly, the GPS recording equipment can not accurately record the accurate position of the vehicle every time, the accurate position is often distributed in a space range, and in the rare cases, the offset can be serious, so that the track point can not be analyzed as a normal track point record; secondly, due to timeliness of map data and the fact that part of abnormal roads can also run vehicles, part of track points in the data set are not affiliated to any road in the existing road network, and the track points can also influence analysis of surrounding roads. As shown in fig. 3, the black lines in the figure are roads, the gray shades are road limitation areas, triangles represent track points distributed within the limitation range, and dots represent noise track points beyond the limitation of the road range.

In order to better describe the driving state of the taxi on the patrol, the scheme organizes the discrete track points pt according to the ID of the vehicle and sorts the discrete track points pt into a track point sequence [ pt ] according to the recorded time sequence ₁ ，pt ₂ ，pt ₃ ，…，pt _n ]The sequence of track points corresponds to the driving situation of the vehicle in the time range. Taxi cruising in real life does not run all the timeThe adjacent track points in the sequence have the condition of longer time interval or longer space distance, and the two adjacent track points are insufficient to infer the actual running condition of the vehicle. Therefore, the scheme divides the sequence into sub-sequences by limiting the time interval and the geographic distance between adjacent track points of the sequence. Track point pt of two adjacent in a sequence _i And pt _i+1 The time interval must be less than a thresholdThe geographical distance must also be less than the threshold delta. The subsequence meeting the above time limit and space limit can better reflect the running condition of the vehicle in a period of time, and the scheme is called a track and is marked as tj.

In practice, the tracks divided by the method cannot meet the requirement of subsequent analysis, and in order to screen out the tracks which can better reflect the running condition of the vehicle, the track filtering work is performed by the scheme, and mainly has the following three conditions:

(1) Number of trace points. The track contains five or more track points, and the too short track cannot accurately reflect the actual running condition of the vehicle on the road;

(2) Travel time. The time interval from the first node to the last node of the track must be greater than or equal to 25 seconds;

(3) Distance travelled. The sum of the distances between any two adjacent track points in the track is more than or equal to 400 meters.

The GPS track data describes the distribution of taxis in the geographic space in terms of points through longitude and latitude, and the urban road data expresses the geometric shape of taxis in the geographic space in terms of line segments through longitude and latitude. Although the track points and the urban roads are distributed in the same geographic space, no direct connection is established between the track points and the urban roads, whether a certain track point is collected on a certain road cannot be directly judged, and even if the distance between the GPS track point and the road is very close in the geographic space, the road collected by the track point can be intuitively judged, but for a complex urban road network and a massive GPS track data set, the simple judging method cannot be practically applied.

In the subsequent work of the scheme, the congestion condition of the road needs to be calculated through the track points on the road, so that a mapping relation needs to be established between GPS track data and a road network. Through the mapping relation, a track point for determining the geographic position can be corresponding to a specific road. For the track point pt, a mapping method map needs to be determined so that the track point pt can correspond to the road rd with rd=map (pt).

Part of the research adopts a mapping method for directly corresponding the track points to the road closest to the geographic distance, which is a simple and rapid method. However, this approach also has some limitations for complex urban roads and real world trajectory data. Most urban roads have multiple lanes that are close to each other, while GPS trajectory data collected by taxis tends to be offset, which results in the road closest to the trajectory point, and most likely not the road that actually travels. An example of an offset is shown by the black dashed box in fig. 4, with squares being the original trajectory points taken, all along the right lane, along the path from a to e. It can be observed that within the black dashed box, the original locus points of the square and the opposite road are closer.

The scheme adopts an ST-Matching map Matching algorithm to finish the mapping of the track points and the roads. The ST-Matching algorithm is a map Matching algorithm suitable for low-sampling-rate tracks, and can be used for integrating the aggregate information of track points and road distances, the topological structure of road networks and the track information of actual running to match a global optimal path. The algorithm has the advantages of high speed, good stability and high precision. The ST-Matching algorithm is briefly described below.

(1) A candidate set of points is determined. The track point pt actually observed as track tj _i E, tj as the center of a circle, designating a radius r to determine a circle, passing pt _i Making normals of all roads in the circle, wherein the intersection point of the jth normals and the road is a candidate pointpt _i There may be one orA plurality of candidate points, all of which constitute a candidate point set->

(2) And (5) spatial analysis of candidate points. The space-time analysis is carried out on the candidate points to calculate the selected weights of the two adjacent candidate points, the space weights and the time weights of the adjacent candidate points are comprehensively considered in the space-time analysis, and the space weights are commonly measured through the difference of the observation probability and the transition probability of the distances between the GPS track points and the corresponding candidate points.

The observation probability is a normal distribution N (μ, σ) with μ and σ as parameters ² ) The GPS track point pt is measured _i With a certain candidate pointApproximation of>Representation pt _i And its candidate point->The euclidean distance of the candidate point observation probability is as in formula 3-1.

Given two adjacent track points pt _i-1 、pt _i And each candidate pointd _i-1→i Representing the slave trace point pt _i-1 To pt _i Is the Euclidean distance, w _{(i-1，t)→(i，s)} Representing->To candidate point->The transition probability V is expressed as formula 3-2.

The comprehensive observation probabilities equation 3-1 and equation 3-2 give two adjacent candidate points And->Space analysis function F _s Defined as formulas 3-3.

(3) Candidate point time analysis. Time analysis function F _t And the passing speed of the track and the speed limit of the road are considered, the situation that the space analysis function cannot distinguish the same-direction adjacent road is avoided, and the quality of track map matching is improved. For adjacent track points pt _i-1 And pt _i Is a candidate point of (2)And->Their shortest path comprises the road section rd ₁ ，rd ₂ ，…，rd _u ，…，rd _k ]In which the road segment rd _u Length of (c) is denoted as len _u Speed limit is lim _u Track point pt _i-1 To pt _i The travel time of (a) is recorded as deltat _i-1→i The trace point pt _i-1 To pt _i The average speed of (a) is shown as formula 3-4, corresponding to the candidate point +.>And->The time analysis function of (2) is shown in equations 3-5.

(4) And (5) path matching. Before path matching, each track point is used as a stage, and the candidate points of each track point are used as states of corresponding stages, so that a candidate graph is constructed. Considering the candidate point space analysis formula 3-3 and the time analysis formula 3-5, each candidate point of adjacent track pointsAnd->One side of the candidate graph is formed, and the transition probability is a space-time analysis function F as shown in formulas 3-6.

Candidate path sequence P for trajectory tj _cand ：The sequence score is shown in the formula 3-7, and the sequence with the highest score is the best matching path MP, which is shown in the formula 3-8.

Taking fig. 4 as an example, a track Matching result obtained by the map Matching method is shown, dots in the map represent track points after map Matching, and the ST-Matching method is shown to well combine road network information and track information to match a road. In particular, it can be seen in the area of the black dashed box that the ST-Matching algorithm correctly handles the track travel direction and the topology of the road network, matching the offset track points isolated by the subtended roads to the correct roads.

In the scheme, the track points of the GPS track data set record the driving speed information of a taxi, the speed of the track points is used as the response of road congestion, the average speed of the track points in a period of time is calculated to be used as the travel speed of a road, time slices are divided by a fixed time length TL, and the TL is set to be 10min. Given road rd _i And corresponding time slices ts _j All the track points in (a)Each track point has a speed record +.>In order to enable the calculated average speed of the road to be closer to the actual situation and avoid that the data record of the abnormality of the individual vehicles influences the accuracy of the average speed of the road, the scheme adds a support degree parameter theta when calculating the average speed of the road and participates in calculating the track point number in a time slice of the average speed Must be greater than a given threshold θ, otherwise the calculation is invalid. Road rd _i At time slices tsj _{Upper part} Travel speed of->The calculation method is as shown in formulas 3-9.

The scheme uses the track points of m roads and n time slices to be analyzed to calculate the average speed of the roadsThe road is taken as a row of the matrix, a time slice is taken as a column, and the average speed matrix V of the road is obtained _m×n 。

Roads in the whole road space have different design traffic capacity, daily maintenance level and actual traffic demandThe difference is calculated, and a fixed speed threshold is used for measuring the congestion state of the whole road space, so that the scheme calculates SRI speed decreasing index (Speed Reduction Index) according to the estimated free flow speed of each road to determine the road congestion state of each time slice. The free flow speed refers to the average traffic speed of the road during off-peak hours, and can be used to estimate the daily traffic capacity of the road. This index can be deduced from the large amount of data observed on the road. By road rd _i For example, the average speeds of all effective roads are arranged from small to large, and the speed of F% of the average speed of the ordered roads is taken as the road rd _i Free flow velocity v of (2) _ffi In practice f=85 is a commonly used parameter. At time slice ts _j Calculating to obtain a road rd _i Is of the travel speed of (2)As an observation of the road speed, a speed decreasing index measuring the congestion status is defined in conjunction with the road free flow speed, as shown in equations 3-10.

The SRI index ranges from 0, 10]The larger the index value is, the worse the road traffic condition is considered, and when the SRI is more than or equal to 4, the occurrence of road congestion is indicated byRepresenting a road rd _i At time slice ts _j Whether or not congestion has occurred, wherein 0 indicates no congestion condition, 1 indicates congestion has occurred, and the formula 3-11 is given.

The road rd is calculated using the above equations 3-11 _i Road congestion status of each time slice on the road to obtain congestion vectorIf it exists a set of subsequencesThe values of (1) are 1, the road rd _i From time slice ts _t By time slice ts _t+Q Traffic jam event occurred, noted +.>

As shown in FIG. 5, the road rd _i And (3) calculating corresponding SRI index vectors and road congestion indication vectors, and finally obtaining a flow diagram of the traffic congestion event. Each rectangular block of the road average speed vector, SRI index vector, and road congestion indication vector in the figure represents a corresponding observed value for a time slice. The color of the road average speed vector from light grey to grey indicates that the road traffic speed is from high to low, and the SRI index uses grey indication values of different depths from small to large.

The duration of the traffic jam event obtained by the method is different, and the event with too short duration or too long duration can influence the subsequent analysis process. Congestion events that are too short in duration, such as for only one time slice. Such congestion events may be due to abnormal driving behavior of individual vehicles, resulting in insufficient acquisition of trajectory point data on the road. Events of too long duration can affect the estimation of other event propagation relationships of the surrounding roads, causing adverse effects.

Traffic networks are complex networks that exist in the real world and have long been of interest and research. There are various traffic networks in the real world, such as railway networks, subway networks, public traffic networks, etc. Among them, a road network is a particularly basic one, describing the geospatial distribution of roads and the connection relationship between roads, and is used in aspects of daily life and research work. Road networks are a typical spatial network, the construction and connection of which is geospatially constrained. According to different requirements of research, reasonable representation of road network is very important, and has profound effects on research work.

Road networks are often built with traffic infrastructure as a starting point, with intersections denoted as nodes and roads between intersections denoted as edges. According to the different emphasis points of the research problem, the nodes and the edges in the road network can be represented in dual, namely, the roads are represented as nodes, and the connection relationship between the roads is represented as edges. Compared with a road network which is directly represented, the dual representation of the road network has the advantages in some research works, the road network which is represented by the dual representation ignores the specific geographic space limitation of the road, simplifies the topological structure of the road network, highlights the connection relation among the roads, and provides convenience for the research of traffic congestion propagation of the cut-in points in the connection relation among the roads.

In line with the phenomenon observed in daily life, vehicles travelling on a road travel along the road passing direction, and if congestion occurs at a certain place of the road to cause no passing, vehicles travelling along the road travelling direction to a congestion position become more and more with the passage of time, that is, the traffic congestion propagates in the direction opposite to the road passing direction, which is called the counter-propagation of the traffic congestion. In real life, some busy bidirectional traffic roads also have a type of congestion caused by turning around the vehicle, but the U-shaped congestion occupies less overall congestion and is not recognized by a better method, so the scheme can ignore the congestion.

Based on the dual representation of the road network and the characteristics of the back propagation of traffic jam, the scheme constructs the road jam back propagation network, the network takes roads as nodes, the connection relationship among the roads is taken as edges, the direction of the edges is the opposite direction of the road traffic direction, and the jam propagation direction among adjacent roads is indicated.

According to the scheme, whether any two traffic jam events have a jam propagation relationship is calculated by adopting a time neighbor relationship and a space neighbor relationship, if the two events have the time neighbor relationship and the space neighbor relationship at the same time, the two events are considered to have the time-space neighbor relationship, so that the transmission of traffic jam occurs with high probability, and the jam propagation relationship is realized.

The time neighbor relation refers to the time when two congestion events occurMeets the condition of congestion propagation, and has traffic congestion event ev ₁ From time slice ts _m Last until ts _n Congestion event ev ₂ From time slice ts _p Last until ts _q If there is a relation ts between two congestion events _m ≤ts _p ≤ts _n Congestion event ev ₁ And ev ₂ Congestion propagation may occur over time. As shown in fig. 6, which is a schematic diagram of the time neighbor relation, the congestion event ev ₁ As a source event, a congestion event ev ₂ 、ev ₃ And ev ₄ Is a potential target event. ev ₂ And ev ₃ Is at ev ₁ For a duration of time, thus ev ₁ Respectively with ev ₂ And ev ₃ There is a temporal proximity relationship, ev ₄ At ev ₁ And only after the end, the two events have no time neighbor relation.

A spatial neighbor relationship refers to that the spatial distance between two congestion events is smaller than a given threshold, and then the source event may be transferred to the target event, and there is a spatial neighbor relationship between the two events. Traffic congestion propagates along roads, so this approach takes into account the limitations and effects of road networks when computing spatial neighborhood relationships. The method does not take the geographical distance between two roads as the basis for judging the spatial neighbors, but takes the hop count of the communication between road nodes in the road congestion propagation network and the accumulated propagation road length as the threshold value. For a typical road segment, traffic congestion is not allowed to propagate across nodes, so congestion must propagate node by node with a hop count threshold of 1. However, the length of the road section is very short at the crossing position of the road, so that the whole road is very likely to be jammed, but the data of the track points of the taxis are not recorded at the crossing position. In summary, the scheme uses two parameters of accumulated propagation road length LEN and path number SKP to judge the spatial neighbor relation, and in the experiment, the scheme adopts the spatial neighbor relation of the parameter condition as shown in figure 7, wherein the road accumulated length LEN is less than or equal to 90 and the path number SKP is less than or equal to 3.

For any two congestion events, the method calculates the time neighbor relation and the space neighbor relation of the two congestion events respectively, and if the two events have the time neighbor relation and the space neighbor relation at the same time, congestion transmission is more likely to occur for the two events, and the congestion propagation relation among the events is realized.

Traffic event congestion may also link multiple congestion events by continuous propagation between congestion events. In order to better represent the direct or indirect congestion propagation relationship among a plurality of traffic congestion events, the scheme constructs a traffic congestion event propagation diagram. The traffic jam event propagation graph is a directed graph, nodes of the graph are traffic jam events, the jam propagation relationship among the events is edges of the graph, and the direction of the jam transmitted from a source event to a target event is the direction of the edges.

The construction process of the traffic jam event propagation graph mainly comprises the steps of adding two jam events with a jam propagation relation into the graph and adding a connecting edge. The specific process is as follows: first, an empty traffic congestion event propagation map G is created. Then, starting from the first time slice, selecting one congestion event in the time slice as a source event, calculating a road sequence R of the source event conforming to a spatial neighbor relation, and extracting all the congestion events occurring on the road sequence R as an alternative target event set E. And finally, calculating the time neighbor relation between each event in the source event and the candidate target event set E, and adding the two events into the propagation graph and adding the continuous edge if the time neighbor relation is met. And then, taking the target event as a source event, repeating the process until no new target event is added into the graph, selecting a brand new source event from the first time slice, and repeating the algorithm. If the first time slice does not have the alternative source event, selecting from the next time slice, repeating the process, and calculating all congestion events in all time slices.

The Node2Vec completes the characterization task by extending a Skip-gram model onto the graph, which is a simple neural network with a hidden layer, and when predicting a given center word, the conditional probability of the corresponding background word occurs. Given graph g= (V, E), V is the set of nodes of graph G, E is the set of edges of the graph, and onAnd f: V.fwdarw.R ^d The nodes are mapped to hidden space for the downstream prediction task, where R represents feature space, d is the embedding dimension, and f is the mapping function that maps the nodes to feature space, i.e., a matrix of size |v|×d. Definition of arbitrary source node o ε VRepresenting the set of neighboring nodes of the source node o obtained by the sampling strategy STRTG. The optimization goal of the Node2Vec model is to maximize the neighbor Node N under the condition of the given source Node o _STRTG (o) the probability of being observed, as shown in equation 4-1.

Wherein f represents a mapping function, N _STRTG (o) represents a neighbor node set of the source node o, and Pr (x) represents an occurrence probability;

the Node2Vec model introduces two assumptions to simplify the optimization objective, as follows:

(1) Conditional independence assumption. Given a source node, the probability that the neighboring node of that node is observed is equal to the set N of neighboring nodes _STRTG Other nodes in (o) are independent, as shown in equation 4-2, n _i Representing any one of the neighboring nodes.

(2) The feature space symmetry assumption. In the figure, the feature space of the vertex serving as a source node and the feature space of the neighbor node are consistent, and one node serving as the source node and the neighbor node are represented by one feature vector. Given a source node o, its neighbor node n _i ∈N _STRTG The observation probability of (o) is shown as formula 4-3, exp (x) represents an exponential function.

Will be as followsThe two above assumptions are brought into the optimization objective to obtain a new optimization objective function, such as equation 4-4, where Z _o ＝∑ _v∈V exp (f (o). F (v)) is a normalization factor.

The Node2Vec model uses the structure information of the neighbor Node set expression graph, and the neighbor Node set for acquiring better expression graph structure information by selecting a proper Node sampling strategy has important significance. Depth-first search (DFS) and Breadth-first search (BFS) are two classical traversal strategies, the Depth-first traversal starts from a certain node in the graph and continuously goes to the next node of the current access node, the Depth-first traversal sampled node has stronger structural equivalence, and the global structure of the graph is described; the breadth-first traversal obtains all adjacent nodes of the current access node before going to the next node, and the sampling nodes have higher homogeneity and describe the local structure of the graph.

Considering the efficiency problem of exploring large graphs, the Node2Vec algorithm does not directly use depth-first traversal or breadth-first traversal strategy, but balances the homogeneity and structural equivalence of the nodes sampled from the graph by a biased random walk. Random walks are sampling strategies that randomly select nodes to be accessed each time, and in general, random walks are more prone to sampling nodes that are far from the source node, expressing more structural equivalence. If the node o epsilon V in the graph G (V, E) is taken as the source node to acquire the sampling sequence with the walk step length of I, the ith node c in the random walk sampling sequence _i =nxt is generated by a probability distribution as in equations 4-5.

Wherein c _i-1 =cur is the node of the current walk sample, c _i =nxt is the node to be transferred, Z is the regularization constant, σ _(cur，nxt) For the slave node curThe unnormalized transition probability to the node nxt is 0 if the node cur and the node nxt have no connecting edge; if there is a continuous edge between the node cur and the node nxt, the probability of random walk through any continuous edge to access the node is the same, node c _i-1 =cur has M contiguous nodes, σ _(cur，nxt) ＝1\M。

Each time the access node is randomly selected, the weighting can be adopted, so that the probability of being accessed by some nodes in the travelling process is higher than that of other nodes, and the random travelling process is controlled to a certain extent, namely, the random travelling with bias is realized. By reasonably defining the weighting parameters, the homogeneity and structural equivalence of the expression of the sampling nodes can be balanced in the walking process. Therefore, the unnormalized transition probability among nodes is defined as sigma _(cur，nxt) ＝αp _q (prev，nxt)·w _(cur，nxt) Wherein w is _(cur，nxt) Weights representing edges (cur, nxt) E, for non-weighted graph w _(cur，nxt) =1; prev represents the last sampling node of node cur, α _pq (prev, nxt) is defined as formula 4-6, where d _(prev，nxt) The shortest path distance between the previous node prev and the next node nxt of the current node cur is represented, and the parameters p and q regulate the attention degree of the biased walk strategy on the network homogeneity and the structural equivalence.

Specifically, the parameter p controls the probability of returning to the last accessed node at the time of sampling. When p > max (q, 1), the probability of returning to the accessed node is smaller, making the sampling process more likely to explore the global structure of the graph outwards; otherwise, the node sampling is carried out around the source node o, so that the local structure of the node and the neighborhood thereof in the graph is described; while parameter q is used to control the sampling propensity of the walks "inward" or "outward", when q >1, the random walks will return with higher probability the nodes that have been sampled and the nodes that are neighbors of those nodes, which helps to obtain the local structure of the graph; if q <1, the walking process is more likely to explore outwards, and the combination of the parameter p and the parameter q can better control the node sampling to be performed in the expected direction. The calculation of the transition probabilities is illustrated in fig. 8.

The road embedding model has two core parts, namely a node sequence is sampled from a road congestion back propagation network through random walk, and the embedded characterization of the road is learned through a Skip-gram model and the node sequence.

The traffic jam event characterization aims at embedding the jam event into a hidden vector for space-time propagation rule analysis of subsequent jams. According to the scheme, the VGAE model is adopted to complete the traffic jam event embedding task by taking the traffic jam event congestion propagation diagram as an input diagram, and the VGAE model can effectively map the information of the diagram nodes and the relation between the nodes to the feature space, so that the method has the advantage of retaining more road information of the congestion event and the congestion propagation relation between the events.

The VGAE model is an extension of the variation from the encoder (Variational autoencoder, VAE) in the map data. The self-encoder structure includes two parts, an encoder for obtaining a representation of the input data in a low-dimensional space and a decoder for reconstructing the input data using the hidden vectors.

Fig. 9 is a schematic structural diagram of a VGAE model, where g= (V, E) represents a graph, V is a node set, E is an edge set, and n= |v| nodes in the graph G; a represents the adjacency matrix of the graph G, X represents the characteristic matrix of the nodes in the graph G, sigma and mu are normal distribution parameters represented by low dimensions, Z represents the low-dimensional matrix of all the nodes obtained by sampling, and the low-dimensional vector Z of each node sample _i The composition of the composite material comprises the components,representing the reconstructed adjacency matrix.

The encoder part of the VGAE model is composed of two layers of GCN neural networks, input data are an adjacent matrix of the graph G and a node characteristic matrix X, and an embedded matrix Z representing the nodes of the graph is output. The first layer GCN neural network of the encoder obtains a low-dimensional feature matrix from the adjacent matrix A and the node feature matrix XAs in formulas 4-7, wherein W ₀ Representing a first layer GCN neural network weight matrix, < ->Then a symmetric normalized adjacency matrix. />The relationship with the node degree matrix D is shown in formulas 4-8, and ReLU (x) represents an activation function.

The second layer GCN neural network learns the normal distribution parameter mu and the parameter log sigma ² These two parameters constrain the distribution space of the node low-dimensional representation as shown in equations 4-9 and 4-10, respectively, where W ₁ Representing the weight matrix of the second layer GCN neural network. Considering the first layer GCN neural network and the second layer GCN neural network in combination, the encoder of the VGAE model is expressed as equations 4-11.

The hidden vectors of the nodes need to be sampled from the learned distribution space by the VGAE model, however, the sampling operation has no gradient information in the model training process, so that the back propagation cannot be performed, and the VGAE model uses heavy parameter skills (reparameterization trick) to solve the problem. Specifically, it will be divided from normal Cloth space N (mu, sigma) ² ) The operation of sampling node hidden vector Z is converted into sampling random number E from standard normal distribution N (0, 1), and then the hidden vector Z of the sample is obtained by calculation through the formulas 4-12. Thus, the VGAE model encoder encodes the ith node hidden vector z in the input map _i The process of (a) is expressed as formulas 4-13, diag (x) is expressed as a diagonal matrix, N (x) is expressed as a normal distribution, and the process of all nodes in the code map is as formulas 4-14.

Z＝μ+σ*∈#(4-12)

The decoder of the VGAE model does not use a symmetrical structure with the encoder, but calculates the probability of the edge existing between the nodes through the inner product of the hidden vectors of the two nodes, so as to reconstruct the adjacency matrix of the input graph. Specifically, the probability of an edge existing in any two nodes i and j in the input graph is calculated using equations 4-15. Reconstructing the adjacency matrix of the input graph G requires that all nodes in the graph compute the probability of edge existence pairwise, as shown in formulas 4-16, where A _ij Representing the corresponding position elements of the adjacency matrix A of FIG. G, σ (·) represents a logistic sigmoid function.

The learning objective of the VGAE model consists of two parts, one is that the reconstructed adjacency matrix and the adjacency matrix of the input data are as similar as possible, and the other is that the divergence of the normal distribution of the encoder and decoder is as small as possible. The optimization objective of the VGAE model is defined as L, as shown in equations 4-17. KL [ q (·) p (·) in formulas 4-17 ]Representing encoder q (·) and decodingKL divergence of p (·), E _q(z|X，A) [logp(A|Z)]P (Z) is as in equations 4-18 for the cross entropy function.

L＝E _q(Z|X，A) [logp(A|Z)]-KL[q(Z|X，A)||p(Z)]#(4-17)

Different data records in a data set can be distributed to different groups or clusters through a clustering algorithm, so that records in the same cluster are similar as much as possible, and differences among different clusters are as large as possible. The method is characterized in that a clustering algorithm is applied to embedded vectors of traffic jam events, the traffic jam events are clustered, and the whole traffic jam event set is divided into different clusters. The clustering algorithm has important significance and effect in data analysis, different types of clustering algorithms are suitable for data sets with different distributions, and the selection of the applicable clustering algorithm is the key for mining data modes. The scheme adopts an HDBSCAN algorithm to cluster the congestion events.

When clustering high-dimensional data, if the data are distributed sparsely in a high-dimensional space, the clustering algorithm may be overfitted, so that the clustering effect is poor, and for a density-based clustering algorithm, a large number of samples are identified as noise, the data are subjected to dimension reduction in practice, so that the data distribution space is changed, and the problem of sparse data distribution is solved. According to the scheme, the hidden vector of the traffic jam event is reduced by using the UMAP (Uniform Manifold Approximation and Projection, UMAP) algorithm, and the hidden vector is used as a front operation of clustering, so that the clustering effect of the HDBSCAN algorithm is improved by means of the nonlinear sensing and dimension reduction capability of the UMAP algorithm.

The congestion event in the class cluster is a concrete expression of the congestion propagation rule, and the congestion propagation rule can be specifically described by summarizing the congestion event in the class cluster. The scheme provides a road congestion propagation overview construction algorithm, and the spatial distribution of congestion propagation rules is displayed by combining the road congestion propagation overview.

Dimension reduction (Dimensionality reduction, DR) is widely used in preprocessing of various machine learning tasks, and the use of dimension reduction algorithms before clustering of data sets is an effective workflow. Preprocessing by adopting a dimension reduction algorithm before a clustering algorithm has the following two remarkable advantages that (1) clustering work is carried out by using data with lower dimension, so that the time complexity of the clustering algorithm can be remarkably reduced, and the execution of the clustering algorithm is accelerated; (2) The dimension reduction can help the clustering algorithm to obtain a better clustering result, especially when the data set to be clustered shows better structurality in a low-dimension space.

According to the scheme, the dimension of the embedded vector of the traffic jam event is reduced through a dimension reduction algorithm, and the embedded vector is used as a preprocessing process of the jam event clustering. The problem of sparse spatial distribution of congestion event embedding is solved on the basis of maintaining the correlation of congestion event embedding vectors by means of a dimension reduction algorithm, and the performance of congestion event clustering work is improved. For different types of dimension reduction algorithms, UMAP algorithm is selected to complete the work, and experiments on a plurality of data sets show that the UMAP algorithm is superior to other common algorithms in dimension reduction effect, and the performance of the UMAP algorithm is equivalent to that of a t-SNE (t-Distributed Stochastic Neighbor Embedding, t-SNE) algorithm, but the UMAP algorithm is far superior to that of the t-SNE algorithm in time complexity.

The UMAP algorithm uses the concept of a simplex complex (Simplicial complexes) to construct a topology space by simple combining components, effectively reducing the complexity of processing the data topology space. As shown in fig. 10, which is a schematic diagram of four basic simplices, the 0-simplex (0-simple) is a straight line composed of one single point, the 1-simplex (1-simple) is a straight line composed of two independent points, the 2-simplex (2-simple) is a triangle composed of three points, and the 3-simplex (3-simple) is a tetrahedron containing four 2-simplices, and these basic combined components can be generalized to an arbitrary dimension space by a k-dimensional object formed by convex shells of k+1 points.

The UMAP algorithm takes each data sample of the input data as a point in a high-dimensional space, which is a 0-simplex, when constructing a high-dimensional weighting map. By concatenating overlapping points within a point radius, one can construct 1, 2, and higher dimensional simplices that are combined in a specific mannerComplexes that are significant to the approximate topological representation of the dataset. In general, combining the 0 simplex and the 1 simplex not only accomplishes most of the topology representation work, but also has a higher computational efficiency, which is a significant advantage over large data sets.

However, choosing the appropriate connection radius for constructing the complex is a challenge, and the choice of radius will affect the approximate representation of the data topology space, if the radius is too small, the algorithm will have a significant tendency to cluster locally; otherwise, if the radius is too large, the algorithm cannot effectively capture the structure. The UMAP algorithm dynamically determines the radius by the distance of each sample point to its k-nearest neighbor node without using a fixed radius parameter. There is a connection to a point inside this local radius, which also means that there is at least one connection to each point and no isolated node. The distance of the connection between the sample points is taken as a weight, and the data samples form a directional weighting graphWherein V represents the node of the diagram, E represents the diagram +.>The weight w of an edge represents the likelihood of a connection between two nodes. Representing data samples as x, with x _i E V, if x _i There are k neighbor nodes->There is-> To calculate x _i And (2) a certain neighbor node->First of all, ρ is defined _i Sum sigma _i As shown in formula 5-1 and formula5-2, & gt>Representing the euclidean distance of two nodes. And sigma is provided with _i Then the constraint of equation 5-2 is satisfied and ρ is synthesized _i Sum sigma _i Side->The weight function w of (2) is represented by equation 5-3.

Drawing of the figureAs a directed graph, for any two nodes x _i And->There may be at most two edges, and the two edge weights may be different between these two points in combination with the weight function w of the edge. Directed graph->Conversion into an undirected weighted graph g= { V, E, w } requires solving the problem of inconsistent weights of two edges. The UMAP algorithm adopts a merging strategy to merge the weights of the edges between two points to obtain an undirected weighted graph with combined weights. If A is a directed graph->Adjacent matrix of A ^T The transposed matrix of A is represented, B is the weight of the connecting edge of the undirected graph G after the weight combination, and B is shown in the formula 5-4.

The UMAP algorithm in the above step represents the input data as a Gao Weijia weight map, which is an undirected map, capturing the topology of the data manifold.

The UMAP algorithm will use a force directed graph layout algorithm to determine a topology in a low dimensional space that is as similar as possible to the high dimensional space, which is a low dimensional representation of the input data. In order to obtain a projection in the low dimensional space that is as similar as possible to the high dimensional weighting map, the projection is optimized using a cross entropy function as in equations 5-5.

For each edge E E, w in graph G _l (e) Representing the distance weight of a low dimensional space, w _h (e) The distance weights in the high-dimensional space are represented, and the calculation method of the two distance weights is shown in the formula 5-3.

Different clustering algorithms use different method indexes to measure the similarity among data samples, and specific clustering algorithms are not all applicable to all data sets, so that the selection of a proper clustering algorithm is important in the data mining task according to the task type and the distribution characteristics of the data samples. The scheme examines several common clustering algorithms, and compares the advantages, disadvantages and application range of the algorithms.

The K-Means algorithm is a widely used clustering algorithm, and uses distance as a standard for similarity measurement between data examples, and the closer the distance is, the more likely the examples are divided into the same clusters. In practical applications, a lot of a priori knowledge of the data set is required to select the appropriate cluster number parameter k, and although some methods are proposed to help better select the parameter k, these methods are applicable only to data sets of a specific distribution. The DBSCAN algorithm uses density to measure similarity between samples, looking for high density clusters of samples that are segmented by low density samples. Therefore, the DBSCAN algorithm can obtain better clustering results on the data set with any shape without specifying the number of the classified categories, the selection of the initial value of the algorithm does not have great influence on the clustering results, and the clustering stability is better. The DBSCAN algorithm changes the density reachability among samples through the minimum cluster sample number parameter MPts and the radius Eps, and adjusts the clustering effect on a specific data set. For data sets with larger data sample density differences, challenges exist in selecting an appropriate radius Eps parameter for a DBSCAN algorithm, and an ideal clustering effect is difficult to obtain.

The UMAP algorithm is used for visualizing the traffic jam event embedded vector after dimension reduction so as to observe the distribution condition of the embedded vector in the hidden space, and the visual result is shown in figure 11. Through careful comparison and verification, the embedded vector is found to have obvious aggregation phenomenon in the hidden space, and congestion events corresponding to the sample points in the high-density area are also close to each other in the road space. However, by further observation, the present approach finds that the density of these "clusters" is highly diverse, with a very uneven distribution in the hidden space, and it is difficult to select a suitable radius parameter to separate these high density regions. According to the spatial distribution characteristics of the embedded vectors, the scheme needs to adopt a clustering algorithm which can automatically identify the optimal clustering quantity and adapt to the neighborhood radius.

Some key improvements of the HDBSCAN algorithm enable the algorithm to obtain a good clustering effect on data sets with large density differences. Firstly, the HDBSCAN algorithm provides a calculation method for improving the distance between data samples by mutually reaching the distance, so that the aim of space transformation is fulfilled, and the error combination of different clusters is avoided; secondly, the HDBSCAN algorithm calculates the stability of the clusters, and properly merges or partitions the clusters according to the stability, so as to obtain the optimal clustering results of the clusters with different densities. By combining the spatial distribution characteristics of the embedded vectors of the traffic jam events, the method uses an HDBSCAN algorithm to cluster the traffic jam events and excavate traffic jam propagation modes.

The core idea of the HDBSCAN clustering algorithm is single link clustering (Single Linkage Clustering), which measures the distance between clusters by the shortest sample distance between clusters, and merges two clusters with the smallest distance between clusters. This strategy makes the single link clustering algorithm very sensitive to noise points, which can connect two different clusters in the "wrong" position, thus changing the cluster structure of the whole dataset. In order to enhance the robustness of the clustering algorithm and reduce the influence of noise data on a clustering result, the HDBSCAN algorithm redefines the mutual reachable distance between samples before specific clustering work is carried out so as to achieve the purpose of space transformation, so that the distance between dense samples and sparse noise samples in a data space is farther.

First, another data sample T of the data sample spl and the kth nearest neighbor is used ^k The distance between (spl) estimates the density of the sample spl, which is an efficient density estimation method, called the core distance of the data sample spl under the parameter k, denoted core _k (spl) a calculation formula such as 5-6, where d represents a distance function.

core _k (spl)＝d(spl，T ^k (spl))#(5-6)

Next, a new data sample interval measurement algorithm is defined by means of the core distance, so that sparse points are far from dense points, i.e. the mutual reachable distances, as defined in formulas 5-7.

d _mreach-k (spla，spl _b )＝max{core _k (spla)，core _k (spl _b )，d(spla，spl _b )}#(5-7)

As with equations 5-6 above, d represents a distance function, d (spl _a ，spl _b ) I.e. spl _a And spl _b Distance measure of (2); core (core) _k (spl _a ) Representing spl _a Core distance of core _k (spl _b ) Representing spl _b Is a core distance of (c). Equations 5-7 illustrate the data samples spl _a And data sample spl _b The mutual reachable distance is determined by the distance between the two and the core distance, and the distance measurement between dense data samples still keeps the actual distance d (spl) _a ，spl _b ) The distance between sparse data samples will be represented by the core distance, with the final sparse data instance being "pushed" away from other points. The selection of the core distance parameter k affects the judgment of noise points, and a larger k leads to more data instancesIs determined as a noise point.

Studies have demonstrated that spatial transformation using mutually reachable distances can enable a single-link minimum distance clustering algorithm to achieve a more nearly horizontal hierarchical structure on any density-distributed data set, which helps to promote clustering effects.

The HDBSCAN algorithm needs to find a way to separate dense data instance clusters as clusters in the overall data instance distribution space. The term dense is used herein to mean that they have different densities for different "clusters" in the same data distribution space. In order to complete the task, the HDBSCAN algorithm firstly constructs a weighted graph, wherein the weighted graph takes a data instance as a node, a connecting edge is established between any two data instances, and the weight of the connecting edge is the mutual reachable distance of the two data instances. The HDBSCAN algorithm then finds an edge set with the smallest sum of weights, which is the smallest spanning tree of the fully connected weighted graph, and any one of the edges in the edge set is broken, resulting in the weighted graph breaking.

The Prim algorithm is a minimum spanning tree construction algorithm based on greedy strategy, and adds a least weighted edge to a selected subset of edges each time during execution, so that the tree in the subset of edges is connected to vertices in the subset that have not yet appeared until all vertices are added to the subset of edges.

By the construction method, the HDBSCAN constructs a minimum spanning tree with data samples as nodes and the connection between the samples as edges. Thereafter, the HDBSCAN builds a hierarchical structure based on the minimum spanning tree to obtain clusters of data samples, the algorithm process is mainly divided into two steps, and this process is repeated until each edge of the minimum spanning tree is processed:

(1) The first step: firstly, sequencing edges in a minimum spanning tree in an ascending order from small to large;

(2) And a second step of: and selecting each edge according to the ordered sequence, and merging the sub-graphs of the selected edges into a class cluster.

The hierarchical structure of the cluster constructed by the algorithm is a binary tree, the root node represents the whole data set, each node represents a class cluster formed by a data sample subset, and the subtree of the node represents the splitting of the current class cluster. The top-down understanding of the splitting process of this subset of class clusters, each split removes one edge of the smallest spanning tree, dividing the connected class cluster structure into unconnected sub-graphs. Each split has a distance corresponding to it, which is the distance of the removed edge in the smallest binary tree, the binary tree representing the hierarchical structure of clusters is called the cluster tree.

According to the hierarchy of the cluster tree, a fixed threshold value can be determined, the cluster tree is divided into an upper part and a lower part, under the given threshold value, the cluster tree node closest to the threshold value is the obtained cluster, and the data sample which is larger than the threshold value and is contained in the cluster tree node is the noise point. However, the strategy of using fixed threshold partitioning relies on knowledge of the priori knowledge of the data set, and it is difficult to obtain good clustering effect for the data set with unbalanced distance between clusters, which limits the application range of the clustering algorithm and reduces the performance of the algorithm on the actual data set.

The HDBSCAN algorithm solves the above problem by compressing the cluster tree and improving the node distance metric when obtaining clusters. The compressed cluster tree aims to compress a large and complex cluster tree into a smaller-scale tree so that each cluster tree node contains more data samples and removes noise samples. The minimum cluster size defines the minimum sample number limit of the cluster, the HDBSCAN algorithm traverses each node of the cluster tree, and whether the number of data samples contained in the child nodes of the node meets the requirement of the minimum cluster size is checked:

(1) If the number of data instances contained in the two child nodes is smaller than the minimum cluster size, the two child nodes are deleted, and the current node stops splitting;

(2) If one of the two child nodes contains data instances of which the number is smaller than the minimum cluster size and the other is larger than the minimum cluster size, deleting the node smaller than the minimum cluster size, reserving the node with the larger size and reserving the label of the parent node;

(3) If the number of the data instances contained in the two child nodes is larger than the minimum cluster size, reserving the two child nodes of the current node, and enabling the current node to be normally split in the cluster tree.

After the cluster tree is compressed, a cluster tree with smaller scale and removed noise point samples is obtained, and the clustering is completed by selecting nodes which are not contained by other labeled nodes in the cluster tree and adding labels. In order to achieve the best clustering effect, the selected cluster tree nodes should be stable.

A cluster tree can be visually understood as a process of splitting a data set into different clusters, and the distance between the split nodes and the child nodes is the distance of the edge of the minimum spanning tree in the process. The HDBSCAN algorithm measures the stability of the node using the inverse of distance, i.e., λ=1_distance.

From the perspective of different nodes, each time a node is split, the split process comprises a split node and also comprises a plurality of nodes generated by splitting. Thus, for each node of the cluster tree, it is split from other nodes (except the root node), while each node will also split from other nodes (except the leaf nodes). Thus each node defines two stability metrics, lambda _birth Representing nodes (child nodes) generated by splitting, lambda _death Indicating that the current node is a split node (parent node). When the cluster hierarchy is constructed, the edges of the minimum spanning tree are firstly ordered in ascending order, so that the deeper the generated cluster tree hierarchy is, the shorter the distance between the broken edges is, and the relation lambda of the stability measurement of the nodes can be known _birth ＜λ _death 。

For each data sample slp in a cluster tree node, a stability metric λ is also defined _slp The value is the reciprocal of the minimum spanning tree edge distance that is broken when the sample slp leaves the node in the process of compressing the cluster tree. There are two situations for this process to leave the sample from the current cluster tree node:

(1) The number of samples of the split child nodes is smaller than the minimum cluster size, and the data instance is removed from the cluster tree as noise point, and lambda exists _birth ＜λ _slp ＜λ _death ；

(2) The number of samples of the two sub-nodes obtained by splitting is larger than the minimum cluster size, in this case, the node performs normal splitting, and the sample slp of the current node enters the sub-node, and at this time, lambda exists _slp ＝λ _death 。

The stability of the cluster tree nodes is calculated as shown in equations 5-8 for each sample within the integrated node (cluster).

Analysis formulas 5-8 show s _cluster Less than or equal to 0, and the fewer noise scattering points in the node, the higher the cluster stability, the node stability s _cluster The larger.

And after calculating the stability of the cluster tree nodes, traversing the cluster tree from bottom to top to obtain the marked nodes. If the stability of the current node is smaller than the sum of the stability of the child nodes, setting the stability of the current node as the sum of the stability of the child nodes; if the stability of the current node is greater than the sum of the stability of the child nodes, marking the current node, and canceling the marking of all the child nodes of the current node. And when traversing to the root node of the clustering tree, returning all marked nodes as clustering results.

According to the scheme, the traffic congestion propagation events are clustered by using a clustering algorithm, the traffic congestion events in the same cluster have similar traffic congestion propagation rules, and the common rules of the congestion events in time, space and congestion propagation relations are summarized by summarizing the congestion events in the clusters. The scheme mainly comprises the steps of finding out the road space of a congestion event, the severity of road congestion, the congestion propagation relationship and association strength among roads, the event period of the congestion event, the congestion strength on different event periods and the like from the rules, describing the specific traffic congestion propagation rule, and the description is helpful for people to understand the time-space characteristics of different types of traffic congestion.

The relationship and strength of the inter-road congestion propagation cannot be described simply by numerical values, and the relationship comprises a series of roads, inter-road congestion propagation relationships and propagation strengths, and attention is paid to the inter-road relationship on the congestion propagation. Therefore, the scheme provides a road congestion propagation overview construction algorithm for summarizing the road congestion propagation rule, which is a weighted directed graph, and the construction algorithm extracts traffic congestion events contained in the traffic congestion propagation modes and summarizes the road congestion propagation relationship.

In the scheme, congestion events in a cluster are taken as key events for describing a traffic congestion propagation rule, and are called anchor events, and the nodes filled with oblique lines are shown in fig. 12 to illustrate anchor events in an event cluster. And respectively performing forward and reverse deep traversal on the traffic congestion event propagation graph by taking the anchor point event as a departure node to obtain a congestion event set E which is highly correlated with the traffic congestion event cluster.

The construction algorithm then initializes an empty directed topology graph G, and then traverses each directed edge where the set of congestion events E exists. Each edge obtains the shortest path between two congestion event occurrence roads as the optimal propagation path between propagation edges in a road traffic congestion propagation network by using a depth traversal algorithm (DFS) of the graph. Adding the propagation path as an edge into the topological graph G, and adding a corresponding connected edge if the edge to be added into the topological graph G does not exist, wherein the weight of the edge is 1; if the connected edge to be added into the topological graph G exists, the weight of the corresponding edge is only increased. And (3) the same processing is carried out on the road nodes to be added into the topological graph G, if the road nodes do not exist, the nodes are added, the weight of the initialized nodes is 1, and if the nodes exist, the weight of the nodes is only increased. The construction algorithm repeats this process until all relevant edges in the congestion event set E are processed, thus yielding a road congestion propagation overview map. The specific procedure of this construction algorithm is illustrated in fig. 12.

The visual analysis system JPViz for the traffic jam propagation is based on specific design requirements, and is combined with rich visual view design and efficient system interaction, so that field experts are helped to deeply explore the urban traffic jam propagation modes.

The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims

1. The traffic jam propagation mode sensing and visual analysis method is characterized by comprising the following steps of:

2. The traffic congestion propagation mode sensing and visual analysis method according to claim 1, wherein step S1 is specifically as follows:

Performing data cleaning on the GPS track data;

organizing discrete track points pt by vehicle ID and sequencing the record time into a track point sequence [ pt ] ₁ ，pt ₂ ，pt ₃ ，…，pt _n ]The track point sequence corresponds to the running condition of the vehicle in the time range;

calculating the average speed of track points in a period of time as the travel speed of the road, dividing the time slices by a fixed time length TL, and giving the road rd _i And corresponding time slices ts _j All the track points in (a)Each track point has a speed record +.>Adding a support degree parameter theta when calculating the average speed of the road, and participating in calculating the number of track points in a time slice of the average speedMust be greater than a given threshold θ, otherwise the calculation is invalid; road rd _i At time slice ts _j Upper travel speedThe calculation method is as follows

3. The traffic congestion propagation mode sensing and visual analysis method according to claim 1, wherein step S2 is specifically as follows:

4. The traffic congestion propagation mode sensing and visual analysis method according to claim 2, wherein the data cleansing of the GPS trajectory data is as follows:

(1) Geographic area restrictions;

(2) Recording a time limit;

(3) Repeating the track point limitation;

(4) A travel speed limit;

(5) Road range limitations.

5. Traffic congestion propagation mode perception and visual analysis method according to claim 2, characterized in that the trajectory tj is filtered as follows:

(1) The number of track points; the track comprises five or more track points;

6. The traffic congestion propagation mode sensing and visual analysis method according to claim 2, wherein the ST-Matching map Matching algorithm is as follows:

(1) Determining a candidate point set; the track point pt actually observed as track tj _i E, tj as the center of a circle, designating a radius r to determine a circle, passing pt _i Making normals of all roads in the circle, wherein the intersection point of the jth normals and the road is a candidate pointpt _i There may be one or more candidate points, all of which constitute a candidate point set +. >

the observation probability is a normal distribution N (μ, σ) with μ and σ as parameters ² ) Measurement of GPS track point pt _i With a certain candidate pointApproximation of>Representation pt _i And its candidate point->The observation probability of the candidate point is expressed as follows,

(3) Candidate point time analysis; time analysis function F _t The method has the advantages that the passing speed of the track and the speed limit of the road are considered, the situation that the space analysis function cannot distinguish the same-direction adjacent roads is avoided, and the quality of track map matching is improved; for adjacent track points pt _i-1 And pt _i Is a candidate point of (2)And->Their shortest path comprises the road section rd ₁ ，rd ₂ ，…，rd _u ，…，rd _k ]In which the road segment rd _u Length of (c) is denoted as len _u Speed limit is lim _u Track point pt _i-1 To pt _i The travel time of (a) is recorded as deltat _i-1→i Trace pointpt _i-1 To pt _i Is +.>The following first formula is calculated corresponding to the candidate point +.>Andtime analysis function F of (2) _t The following second formula is given as a second formula,

(4) Path matching; before path matching, firstly, constructing a candidate graph by taking each track point as a stage and taking candidate points of each track point as states of corresponding stages; each candidate point of adjacent track pointsAnd->One side of the candidate graph is formed, the transition probability of the side is expressed as a space-time analysis function F,

candidate path sequence for trajectory tjSequence scores such asThe sequence with the highest score is the best matching path MP according to the first formula, the second formula,

MP＝argmax _Pcand F(P _cand )#(3-8)。