CN115563522A - Traffic data clustering method, device, equipment and medium - Google Patents

Traffic data clustering method, device, equipment and medium Download PDF

Info

Publication number
CN115563522A
CN115563522A CN202211538308.2A CN202211538308A CN115563522A CN 115563522 A CN115563522 A CN 115563522A CN 202211538308 A CN202211538308 A CN 202211538308A CN 115563522 A CN115563522 A CN 115563522A
Authority
CN
China
Prior art keywords
data
clustering
density
sample
sample point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211538308.2A
Other languages
Chinese (zh)
Other versions
CN115563522B (en
Inventor
周新民
熊智谋
胡怀钰
贾啸宇
朱雁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202211538308.2A priority Critical patent/CN115563522B/en
Publication of CN115563522A publication Critical patent/CN115563522A/en
Application granted granted Critical
Publication of CN115563522B publication Critical patent/CN115563522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method, a device, equipment and a medium for clustering traffic data, wherein the method comprises the following steps: acquiring initial traffic data; carrying out data cleaning and standardization processing on the traffic data to obtain standardized data, taking each piece of standardized data as sample data, and projecting each piece of sample data into a coordinate system to be used as a sample point; calculating the local density of each sample point, calculating the relative distance of the sample points according to the position information of the K adjacent point of the sample points, and greatly reducing the number of interference points in the decision graph; determining the cluster-like center according to the local density and the relative distance; based on the cluster center, each sample point is clustered according to the density reachable and density gradient descending. The method can improve the accuracy of traffic data clustering, deeply excavate traffic jam characteristics and is beneficial to a traffic management department to reasonably plan treatment resources.

Description

Traffic data clustering method, device, equipment and medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for data clustering of traffic data.
Background
The urban traffic jam feature is mined and beneficial for a traffic management department to reasonably plan treatment resources, but the traffic jam is not invariable and has different characteristics in different time periods and different places, and how to mine and analyze valuable information in mass data becomes a hotspot in the field of intelligent traffic. Clustering is an important unsupervised machine learning method in urban traffic data mining, a given group of data is divided into a plurality of clusters according to the characteristics of distribution, similarity and the like of sample points, the sample points in the same cluster have higher similarity, and the sample points in different clusters have lower similarity.
Currently, mainstream clustering algorithms can be divided into five major categories according to algorithm processes and clustering conditions: clustering methods based on partitioning, density, hierarchy, mesh and model. The density-based clustering algorithm has the characteristics of strong applicability to various data sets, insensitivity to noise data, no need of presetting the number of clusters and the like, and is a very wide algorithm in the field of data mining.
In 2014, rodriguez et al published a novel Density Peak Clustering (DPC) algorithm in Science, which not only can quickly find a clustering center and identify the number of clusters in a data set, but also is suitable for cluster analysis of large-scale data. By means of the characteristics of simple steps, wide applicability and the like of the DPC algorithm, the DPC algorithm is widely applied to the field of subject frontiers such as image segmentation, semantic clustering, algorithm optimization and the like in recent years.
The inventor realizes in the process of implementing the present invention that the existing way is to calculate the local density and relative distance value of the whole sample points through the specific local density and relative distance, and construct a decision diagram, select the cluster center, and finally allocate the remaining sample points. However, this approach still has significant limitations: 1) When the local density of the sample point is calculated, the distribution structure around the sample is not considered, so that a plurality of clusters are lost; 2) Too many interference points are in the decision graph, so that the cluster center is difficult to accurately select; 3) When the distance between two clusters with different densities is close, the clustering effect is not good; 4) For a data set with a complex structure, the 'domino effect' is easy to occur just by allocating data points to clusters to which the cluster center points closest to the data set belong, so that the clustering effect on traffic data is poor.
Disclosure of Invention
The embodiment of the invention provides a data clustering method and device of traffic data, computer equipment and a storage medium, and aims to improve the data clustering accuracy of the traffic data.
In order to solve the above technical problem, an embodiment of the present application provides a data clustering method for traffic data, where the data clustering method for traffic data includes:
acquiring initial traffic data;
carrying out data cleaning on the traffic data, carrying out standardization processing to obtain standardized data, taking each piece of standardized data as sample data, and projecting each piece of sample data into a coordinate system to serve as a sample point;
calculating the local density and the relative distance of each sample point;
determining cluster-like centers according to the local density and the relative distance;
based on the cluster center, clustering each sample point according to density reachable and density gradient descending.
Optionally, the clustering, based on the cluster center, each of the sample points according to density reachable and density gradient decrease includes:
taking the center of each cluster as a current sample point;
sequentially traversing and judging whether the density of other points in the cutoff distance of the current sample point can be reached or the density gradient is reduced;
and if the average density of other points in the truncation distance of each current sample point can be reached or the density gradient is reduced, determining the cluster center as a target center, and clustering according to the target center.
Optionally, after the sequentially traversing determines whether the density of other points within the truncation distance of the current sample point is reachable or the density gradient is decreased, the method further includes:
and if the current sample point with the reachable density or the decreased density gradient of other points in the truncation distance exists, returning to the step of determining the cluster center according to the local density and the relative distance and continuing to execute the step.
Optionally, the sample points whose distances from the cluster centers are smaller than a preset distance threshold are used as allocated sample points, and the sample points whose distances from any cluster center are larger than the preset distance threshold are used as sample points to be allocated, and the density is determined to be reachable by adopting the following method:
calculating the sample points to be distributedjAnd the allocated sample pointsiIs a distance ofd ij Less than the cut-off distanced c
If the sample point isjIs greater than a preset density threshold value alpha, is called sample point-to-sample pointjThe density can reach:
Figure 270901DEST_PATH_IMAGE001
wherein the content of the first and second substances,ρshows the result of the ascending order of local density of all sample points,βthe value is (0, 1),ceil() The function represents an upward rounding of the input value,nis the total number of samples and is,ρ ceil(ρ×n) is a set of internal referencesceil(β×n)The value of the position of the optical sensor,Nindicating the number of class clusters that are partitioned for the data set as a whole,label i representing sample pointsiAnd when the value of the cluster label is 0, the cluster label indicates that the sample point is to be distributed with the cluster.
Optionally, the density gradient decrease is determined by the following equation:
Figure 984779DEST_PATH_IMAGE002
wherein the content of the first and second substances,d c in order to cut off the distance, the distance is,d ij for the sample point to be distributedjAnd the distributed sample pointsiThe distance of (c).
Optionally, the performing data cleaning on the traffic data and performing standardization processing to obtain standardized data includes:
carrying out data segmentation and formatting processing on the traffic data according to a preset mode to obtain formatted data;
and filling data in the formatted data containing the traffic attributes by adopting an averaging method, and carrying out standardized conversion on the filled data to obtain the standardized data.
In order to solve the above technical problem, an embodiment of the present application further provides a data clustering device for traffic data, including:
the data acquisition module is used for acquiring initial traffic data;
the data preprocessing module is used for cleaning the traffic data, performing standardization processing to obtain standardized data, taking each piece of standardized data as sample data, and projecting each piece of sample data into a coordinate system to be used as a sample point;
the distance calculation module is used for calculating the local density and the relative distance of each sample point;
the cluster center determining module is used for determining cluster center classes according to the local density and the relative distance;
and the sample clustering module is used for clustering each sample point according to density reachable and density gradient descending based on the cluster center.
Optionally, the sample clustering module comprises:
the sample point selecting unit is used for taking the center of each cluster as a current sample point;
the traversing unit is used for sequentially traversing and judging whether the density of other points in the cutoff distance of the current sample point is up or the density gradient is reduced;
and the clustering unit is used for determining the cluster center as a target center if the average density of other points in the truncation distance of each current sample point can reach or the density gradient is reduced, and clustering according to the target center.
Optionally, the apparatus further comprises:
and the circulating module is used for returning to the step of determining the cluster center according to the local density and the relative distance to continue executing if the current sample point with the reachable density or the decreased density gradient of other points in the truncation distance exists.
Optionally, the data preprocessing module includes:
the data formatting unit is used for carrying out data segmentation and formatting processing on the traffic data according to a preset mode to obtain formatted data;
and the data standardization unit is used for carrying out data filling on the formatted data containing the traffic attributes and carrying out standardization conversion on the filled data to obtain the standardized data.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the data clustering method for traffic data when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the data clustering method for traffic data.
The embodiment of the invention provides a data clustering method, a device, computer equipment and a storage medium of traffic data, which are used for clustering traffic data by acquiring initial traffic data; carrying out data cleaning on the traffic data, carrying out standardization processing to obtain standardized data, taking each piece of data in the standardized data as sample data, and projecting each sample data into a coordinate system to be used as a sample point; calculating the local density of each sample point, calculating the relative distance of the sample points by introducing the position information of the K adjacent point of the sample points, and greatly reducing the number of interference points in a decision diagram; determining the cluster-like center according to the local density and the relative distance; based on the cluster center, each sample point is clustered according to the density reachable and density gradient descending. The clustering is realized by determining the cluster center in a relative distance and local density mode, the number of interference points is reduced, and the clustering accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.
FIG. 1 is an exemplary system architecture diagram to which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a method of data clustering of traffic data of the present application;
FIG. 3 is a schematic diagram of an embodiment of a data clustering device for traffic data according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the data clustering method for traffic data provided in the embodiments of the present application is executed by a server, and accordingly, a data clustering device for traffic data is disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the terminal devices 101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.
Referring to fig. 2, fig. 2 shows a data clustering method for traffic data according to an embodiment of the present invention, which is described by taking the application of the method to the server in fig. 1 as an example, and is detailed as follows:
s201: initial traffic data is acquired.
Specifically, the initial traffic data is acquired in a big data crawling or data interface transmission mode.
In a specific optional implementation mode, the data are derived from taxi driving data of a government data open platform, and compared with private taxi driving data, the taxi driving data do not relate to the privacy problem of citizens and have the characteristics of easiness in obtaining and strong usability; the bus driving data usually has a fixed route, is greatly influenced by the route planning of a bus company, is lack of flexibility, and is difficult to reflect the real condition of urban traffic flow.
The data selected in the method is driving data of taxies in XX year XX month XX day XX city, 2134124 pieces of information in an original data set, the recording time interval has no fixed numerical value, each piece of data comprises 17 attributes in seconds, and the data is shown in table 1:
table 1 data set field description
Figure 233358DEST_PATH_IMAGE003
It should be noted that, in this embodiment, the data of 6 fields in the original data set and the automatically generated sample serial number are selected as the initial traffic data: the system comprises a display _ num (license plate number), a gps _ time (gps time), a gps _ speed (gps speed), a direction (direction), a gps _ long (gps longitude), a gps _ time (gps latitude) and an ID (each data number), wherein the actual flexible selection can be carried out according to application requirements, and is not specifically limited, a gps _ time field is in a character string form, the first half of the character string represents date, and the second half of the character string represents time-minute-second numerical values respectively; the unit of gps _ speed data is km/h; the value range of the direction data is 0-360, the north direction is 0, and the renting direction angle value is taken out clockwise; the ID item is numbered 0-2134123.
S202: and cleaning the traffic data, standardizing to obtain standardized data, taking each piece of data in the standardized data as sample data, and projecting each sample data into a coordinate system to be used as a sample point.
Due to the fact that data sources are diverse, a large number of non-standard format values, noise values and abnormal values exist, the obtained initial traffic data cannot be directly used, the requirements of data mining on formats, dimensions and the like of original data sets are high, and poor data mining effects are often caused by low-quality data sets. Data pre-processing is therefore required before computation to increase the usability of the data set and reduce the complexity of subsequent computations. There are generally three methods for data preprocessing: data integration, data cleansing, and redundancy elimination. In this embodiment, the source data set needs to unify the formats of the fields, so as to ensure that a large amount of data redundancy does not exist; however, a large number of blank values and abnormal values exist in the data set, so that the data preprocessing mainly carries out standardization processing on unstructured data which are difficult to calculate, and adopts data cleaning to eliminate most of data redundancy, abnormal values and blank items.
Firstly, data is downloaded in a get request mode by calling an interface provided by a government data open platform, 5000 lines of each page are set for storing because a webpage limits a row value of each page, and the data is downloaded through 427 pages and stored in a txt file. And deleting all 426 headers except the first row in the txt file to obtain an original data set.
Secondly, the column data of 6 fields required by this embodiment is read according to the fields, stored in the csv file to facilitate subsequent operations such as segmentation, reading, and calculation on the data set, and an "ID" field is added, and a corresponding number is added to each written row in the process of writing the csv file.
Because the data format of the gps _ time field is in a character string form, is not in a standardized format, and causes great difficulty for subsequent calculation, the character string data of each gps _ time in the data set needs to be cut, and because the first 9 character strings of the field are in a form of '20181001', the character strings at the positions of [10,11], [12,13] and [14,15] are respectively cut and converted into integer types without processing, and then calculation is performed in the format of result time, and finally a uniform standardized format in units of seconds is obtained.
Regarding the coordinate system, it can be known from the above table that the two attribute values gps _ longitude and gps _ latitude are not adapted to the latitude and longitude range used in daily life, the first two numerical values are taken as the latitude and longitude of vehicle positioning, the values are divided by 600,000 to obtain the latitude and longitude in WGS 84 coordinate system, and the calculated latitude and longitude values are taken as the latitude and longitude attribute values respectively.
Thirdly, clearing vacant data and abnormal data, wherein five fields of gps _ time, gps _ speed, direction, gps _ length and gps _ delay are all data uploaded by vehicle-mounted gps, and partial unrealistic data values can occur under the influence of vehicle environment, for example, the value range of gps _ time is [20181001/000000,20181001/235959 ]]And the highest speed of the whole area motor vehicle is 120km/h, so that the value range of gps _ speed is [0,120'](ii) a After conversion, the value range of gps _ length is [113.46,114.37 ]]The gps _ satellite has a value range of [22.27,22.52 ]](ii) a For the record of abnormal or vacant part of coordinate data, interpolation is carried out by adopting a mean value interpolation method, and because the taxi track has irregularity, interpolation is carried out according to the time of two track points and the information of coordinate points, and the points are interpolatedi1 to pointi+The track of 1 is approximately regarded as uniform linear motion, and the specific mathematical expression is as follows:
Figure 222042DEST_PATH_IMAGE004
wherein the content of the first and second substances, longitude i andlatitude i is shown asiThe longitude and latitude of the point(s) are,time i denotes the firstiThe value of the time attribute of a point.
Continuing with the example in S201 as an example, each sample in the obtained normalized data includes 7 attributes, which are specifically shown in table 2:
TABLE 2 fields of the standardized data
Figure 505256DEST_PATH_IMAGE005
In a specific optional embodiment, the data cleaning and the normalization processing are performed on the traffic data, and obtaining the normalized data includes:
carrying out data segmentation and formatting processing on the traffic data according to a preset mode to obtain formatted data;
and the data standardization unit is used for carrying out data filling on the formatted data containing the traffic attributes and carrying out standardization processing on the filled data to obtain standardized data.
S203: the local density and relative distance of each sample point are calculated.
Currently, there are four types of spatio-temporal data mining: spatial-temporal clustering, spatial-temporal classification, temporal-spatial pattern mining and anomaly detection. Spatio-temporal data clustering is similar to classical clustering, and objects similar in time and space dimensions in a data set are divided into clusters, so that the similarity between samples of different clusters is small enough, and the similarity between samples of the same cluster is large enough.
The clustering algorithm for processing the taxi time-space data in the embodiment belongs to the category of time-space data mining, and aims to find out the rule of traffic jam distribution in time and space; the traffic jam has the obvious characteristic that a large number of vehicles are gathered in the same area, so that the density clustering (DPC) algorithm is selected as a basis in the embodiment, but the DPC algorithm has the problems of low cluster center selection precision and single distribution strategy, and the like, so that an improved algorithm of the DPC algorithm is provided, the cluster analysis is carried out on the traffic jam area, and the problems that the application range of the DPC algorithm and other clustering algorithms is small, the result depends on input parameters and the like are solved.
The method for calculating the median sum value in the DPC algorithm process directly determines the distribution condition of the sample points on the decision graph, and the accurate selection of the cluster center has great significance on the clustering result of the DPC algorithm. However, the standard DPC algorithm often has insufficient outlier or more interference points at the cluster center in the decision diagram, so that the cluster center is difficult to accurately select.
In the embodiment, the relative distance is calculated by using the position information of the K neighbor of the sample point, the closest distance value from a higher density point of the sample point nearest neighbor to a set of the K neighbor of the sample point is calculated as the relative distance of the sample point,δ i the calculation formula of (2) is as follows:
Figure 918264DEST_PATH_IMAGE006
(3-3)
wherein, = sample pointjIs a distance sample pointiNearest higher local density point, sample pointxIs a sample pointiK is close to one sample point in the set of neighbors.
Through the formula (3-3), the relative distance value of most of the non-local maximum density sample points in the cluster is 0, the relative distance of the cluster edge points is greatly reduced, and the relative distance of the cluster center points is basically kept unchanged, which is shown in a decision diagram as follows: most of the interference points have ordinate values of 0, thus highlighting the cluster-like center.
S204: and determining the cluster center according to the local density and the relative distance.
Specifically, the local density and the relative distance are used as coordinate axes to construct a decision graph, and an effective cluster-like center is determined based on the distribution of points on the decision graph.
The sample point distribution strategy in the DPC algorithm directly determines the algorithm clustering result, the sample points are simply distributed to the cluster where the nearest higher-density point is located, the clustering effect in some strip-shaped data sets may be poor, and the DBSCAN algorithm has the advantages of processing the strip-shaped data sets, wherein the density reachable idea is the key point of the strip-shaped data sets, so that the density reachable idea in the DBSCAN algorithm is introduced, and the processing effect of the DPC algorithm on the strip-shaped data sets is improved; the convex data set has the obvious characteristics of large cluster center density and cluster edge point density reduction, so that the concept of density gradient reduction is added in the distribution strategy, and the convex data can be more effectively processed.
S205: each sample point is clustered based on the cluster center.
In a specific alternative embodiment, based on the cluster center, clustering each sample point according to the density reachable and density gradient descending includes:
taking the center of each cluster as a current sample point;
sequentially traversing and judging whether the density of other points in the cutoff distance of the current sample point can be reached or the density gradient is reduced;
and if the average density of other points in the truncation distance of each current sample point can be reached or the density gradient is reduced, determining the cluster center as a target center, and clustering according to the target center.
Optionally, after sequentially traversing to determine whether the density of other points within the truncation distance of the current sample point is reachable or the density gradient is decreased, the method further includes:
and if the current sample point with the reachable density or the decreased density gradient of other points in the truncation distance exists, returning to the step of determining the cluster center according to the local density and the relative distance and continuing to execute.
In this embodiment, the sample point whose distance from the cluster center is smaller than the preset distance threshold is used as the allocated sample point, and the sample points whose distances from any cluster center are greater than the preset distance threshold are used as the sample points to be allocated, and the density is determined to be reachable by the following method:
calculating sample points to be distributedjAnd assigned sample pointsiOf (2) isd ij Less than the cut-off distanced c
If the sample pointjIs greater than a predetermined density threshold value alpha, is called a sample pointiTo the sample pointjThe density can reach:
Figure 654139DEST_PATH_IMAGE007
wherein the content of the first and second substances,ρshows the result of the ascending order of the local densities of all sample points,βthe value is (0, 1),ceil() The function represents an upward rounding of the input value,nis the total number of samplesρ ceil(ρ×n) As a set of internal postamblesceil(β×n)The value of the position is such that,Nindicating the number of class clusters that are partitioned over the data set as a whole, label i representing sample pointsiAnd when the value of the cluster label is 0, the cluster label indicates that the sample point is to be distributed with the cluster.
Optionally, the density gradient decrease is determined by the following equation:
Figure 446515DEST_PATH_IMAGE008
wherein the content of the first and second substances,d c in order to cut off the distance, the distance is,d ij for sample points to be distributedjAnd assigned sample pointsiOf the distance of (c).
In the embodiment, initial traffic data is acquired; carrying out data cleaning on the traffic data, carrying out standardization processing to obtain standardized data, taking each piece of standardized data as sample data, and projecting each sample data into a coordinate system to be used as a sample point; calculating the local density and the relative distance of each sample point; determining the cluster-like center according to the local density and the relative distance; based on the cluster center, each sample point is clustered according to the density reachable and density gradient descending. The cluster center is determined by calculating the relative distance and the local density for clustering, so that the number of interference points is reduced, and the clustering accuracy is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 3 is a schematic block diagram of a data clustering device for traffic data, which corresponds to the data clustering method for traffic data according to the above-described embodiment. As shown in fig. 3, the data clustering device for traffic data includes a data obtaining module 31, a data preprocessing module 32, a distance calculating module 33, a cluster center determining module 34, and a sample clustering module 35. The detailed description of each functional module is as follows:
a data acquisition module 31, configured to acquire initial traffic data;
the data preprocessing module 32 is configured to perform data cleaning on the traffic data, perform standardization processing to obtain standardized data, use each piece of standardized data as one sample data, and project each sample data into a coordinate system as one sample point;
a distance calculation module 33 for calculating the local density and relative distance of each sample point;
a cluster center determining module 34, configured to determine a cluster-like center according to the local density and the relative distance;
and the sample clustering module 35 is configured to cluster each sample point according to the density reachable and density gradient decrease based on the cluster center.
Optionally, the sample clustering module 35 includes:
the sample point selecting unit is used for taking the center of each cluster as a current sample point;
the traversing unit is used for sequentially traversing and judging whether the density of other points in the cutoff distance of the current sample point can be reached or the density gradient is reduced;
and the clustering unit is used for determining the cluster center as the target center if the average density of other points in the truncation distance of each current sample point can reach or the density gradient is reduced, and clustering according to the target center.
Optionally, the apparatus further comprises:
and the circulating module is used for returning to the step of determining the cluster center according to the local density and the relative distance to continue executing if the current sample point with the reachable density or the decreased density gradient of other points in the truncation distance exists.
Optionally, the data preprocessing module 32 includes:
the data formatting unit is used for carrying out data segmentation and formatting processing on the traffic data according to a preset mode to obtain formatted data;
and the data standardization unit is used for carrying out data filling on the formatted data containing the traffic attributes and carrying out standardization conversion on the filled data to obtain standardized data.
For the specific definition of the data clustering device of the traffic data, reference may be made to the above definition of the data clustering method of the traffic data, which is not described herein again. All or part of each module in the data clustering device of the traffic data can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor of the computer device, and can also be stored in a memory of the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, the embodiment of the application further provides computer equipment. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 4. Of course, the memory 41 may also include both an internal storage unit of the computer device 4 and an external storage device thereof. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to perform the steps of the data clustering method for traffic data as described above.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present application or portions thereof contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method described in the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields, and all the equivalent structures are within the protection scope of the present application.

Claims (10)

1. A data clustering method of traffic data is characterized by comprising the following steps:
acquiring initial traffic data;
carrying out data cleaning and standardization processing on the traffic data to obtain standardized data, taking each piece of standardized data as sample data, and projecting each piece of sample data into a coordinate system to serve as a sample point;
calculating the local density and the relative distance of each sample point;
determining cluster-like centers according to the local density and the relative distance;
and clustering each sample point based on the cluster center.
2. The method for data clustering of traffic data according to claim 1, wherein the clustering of each of the sample points based on the cluster center comprises:
taking the center of each cluster as a current sample point;
sequentially traversing and judging whether the density of other points in the cutoff distance of the current sample point is up or the density gradient is reduced;
and if the average density of other points in the truncation distance of each current sample point can be reached or the density gradient is reduced, determining the cluster center as a target center, and clustering according to the target center.
3. The method for data clustering of traffic data according to claim 2, wherein after the sequentially traversing determines whether the density of other points within the cutoff distance of the current sample point is reachable or the density gradient is decreased, the method further comprises:
and if the current sample point with the reachable density or decreased density gradient of other points in the truncation distance exists, returning to the step of determining the cluster center according to the local density and the relative distance to continue executing.
4. The traffic data clustering method according to claim 2, wherein the sample points with the distance from the cluster center smaller than a preset distance threshold are used as the distributed sample points, the sample points with the distance from any one cluster center larger than the preset distance threshold are used as the sample points to be distributed, and the density is judged to be reachable by adopting the following method:
when the sample point to be distributedjAnd the allocated sample pointsiOf (2) isd ij Less than the cut-off distanced c If the sample point isjIs greater than a preset density threshold value alpha, called a sample pointiTo the sample pointjThe density can reach:
ρ j >α,d ij <d c (label i =1,2,…,N;label j =0)
α=ρ ceil(ρ×n)
wherein the content of the first and second substances,ρshows the result of the ascending order of the local densities of all sample points,βthe value is (0, 1),ceil() The function represents an upward rounding of the input value,nis the total number of the samples,ρ ceil(ρ×n) as a set of internal postamblesceil(β×n)The value of the position is such that,Nindicating the number of class clusters that are partitioned over the data set as a whole,label i representing sample pointsiAnd when the value of the class cluster label is 0, the class cluster to be distributed of the sample point is represented.
5. The method for data clustering of traffic data according to claim 4, wherein the density gradient descent is determined by the following equation:
ρ j i d ij <d c (label i =1,2,…,N;label j =0)
wherein, the first and the second end of the pipe are connected with each other,d c in order to cut off the distance,d ij for sample points to be distributedjAnd assigned sample pointsiOf the distance of (c).
6. The method for data clustering of traffic data according to any one of claims 1 to 5, wherein the data cleaning and normalization of the traffic data comprises:
carrying out data segmentation and formatting processing on the traffic data to obtain formatted data;
and filling data in the formatted data containing the traffic attributes by adopting an averaging method, and performing standardized conversion on the filled data to obtain the standardized data.
7. A data clustering device for traffic data, characterized in that the data clustering device for traffic data comprises:
the data acquisition module is used for acquiring initial traffic data;
the data preprocessing module is used for cleaning the traffic data, standardizing the traffic data to obtain standardized data, taking each piece of standardized data as sample data, and projecting each piece of sample data into a coordinate system to serve as a sample point;
the distance calculation module is used for calculating the local density and the relative distance of each sample point;
the cluster center determining module is used for determining cluster center classes according to the local density and the relative distance;
and the sample clustering module is used for clustering each sample point according to density reachable and density gradient descending based on the cluster center.
8. The apparatus for data clustering of traffic data according to claim 7, wherein the sample clustering module comprises:
the sample point selecting unit is used for taking the center of each cluster as a current sample point;
the traversing unit is used for sequentially traversing and judging whether the density of other points in the current sample point truncation distance can be reached or the density gradient is reduced;
and the clustering unit is used for determining the cluster center as a target center if the average density of other points in the truncation distance of each current sample point can reach or the density gradient is reduced, and clustering according to the target center.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data clustering method of traffic data according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for data clustering of traffic data according to any one of claims 1 to 6.
CN202211538308.2A 2022-12-02 2022-12-02 Traffic data clustering method, device, equipment and medium Active CN115563522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211538308.2A CN115563522B (en) 2022-12-02 2022-12-02 Traffic data clustering method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211538308.2A CN115563522B (en) 2022-12-02 2022-12-02 Traffic data clustering method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115563522A true CN115563522A (en) 2023-01-03
CN115563522B CN115563522B (en) 2023-04-07

Family

ID=84770022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211538308.2A Active CN115563522B (en) 2022-12-02 2022-12-02 Traffic data clustering method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115563522B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116484179A (en) * 2023-06-20 2023-07-25 厦门精图信息技术有限公司 Interactive data cleaning system and method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232414A (en) * 2019-06-11 2019-09-13 西北工业大学 Density peaks clustering algorithm based on k nearest neighbor and shared nearest neighbor
CN110750573A (en) * 2019-08-28 2020-02-04 广东工业大学 Traffic jam evolution rule identification method based on gridding and space-time clustering
CN111783823A (en) * 2020-05-21 2020-10-16 东南大学 Density peak value clustering method based on local reachable density
WO2020232977A1 (en) * 2019-05-21 2020-11-26 北京市商汤科技开发有限公司 Neural network training method and apparatus, and image processing method and apparatus
CN112163623A (en) * 2020-09-30 2021-01-01 广东工业大学 Fast clustering method based on density subgraph estimation, computer equipment and storage medium
CN112528025A (en) * 2020-12-16 2021-03-19 平安科技(深圳)有限公司 Text clustering method, device and equipment based on density and storage medium
CN113344019A (en) * 2021-01-20 2021-09-03 昆明理工大学 K-means algorithm for improving decision value selection initial clustering center
CN113436433A (en) * 2021-06-24 2021-09-24 福建师范大学 Efficient urban traffic outlier detection method
WO2021232442A1 (en) * 2020-05-21 2021-11-25 深圳大学 Density clustering method and apparatus on basis of dynamic grid hash index
CN113850281A (en) * 2021-02-05 2021-12-28 天翼智慧家庭科技有限公司 Data processing method and device based on MEANSHIFT optimization
CN114139618A (en) * 2021-11-24 2022-03-04 杭州电子科技大学 Signal dependent noise parameter estimation method based on improved density peak clustering

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020232977A1 (en) * 2019-05-21 2020-11-26 北京市商汤科技开发有限公司 Neural network training method and apparatus, and image processing method and apparatus
CN110232414A (en) * 2019-06-11 2019-09-13 西北工业大学 Density peaks clustering algorithm based on k nearest neighbor and shared nearest neighbor
CN110750573A (en) * 2019-08-28 2020-02-04 广东工业大学 Traffic jam evolution rule identification method based on gridding and space-time clustering
CN111783823A (en) * 2020-05-21 2020-10-16 东南大学 Density peak value clustering method based on local reachable density
WO2021232442A1 (en) * 2020-05-21 2021-11-25 深圳大学 Density clustering method and apparatus on basis of dynamic grid hash index
CN112163623A (en) * 2020-09-30 2021-01-01 广东工业大学 Fast clustering method based on density subgraph estimation, computer equipment and storage medium
CN112528025A (en) * 2020-12-16 2021-03-19 平安科技(深圳)有限公司 Text clustering method, device and equipment based on density and storage medium
CN113344019A (en) * 2021-01-20 2021-09-03 昆明理工大学 K-means algorithm for improving decision value selection initial clustering center
CN113850281A (en) * 2021-02-05 2021-12-28 天翼智慧家庭科技有限公司 Data processing method and device based on MEANSHIFT optimization
CN113436433A (en) * 2021-06-24 2021-09-24 福建师范大学 Efficient urban traffic outlier detection method
CN114139618A (en) * 2021-11-24 2022-03-04 杭州电子科技大学 Signal dependent noise parameter estimation method based on improved density peak clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIU YAOHUI 等: "Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy" *
周新民 等: "基于多模态多层级数据融合方法的城市功能识别研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116484179A (en) * 2023-06-20 2023-07-25 厦门精图信息技术有限公司 Interactive data cleaning system and method
CN116484179B (en) * 2023-06-20 2023-09-08 厦门精图信息技术有限公司 Interactive data cleaning system and method

Also Published As

Publication number Publication date
CN115563522B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN112148987B (en) Message pushing method based on target object activity and related equipment
CN111178380B (en) Data classification method and device and electronic equipment
Tang et al. A network Kernel Density Estimation for linear features in space–time analysis of big trace data
CN111291071B (en) Data processing method and device and electronic equipment
CN111522838B (en) Address similarity calculation method and device
CN113570867B (en) Urban traffic state prediction method, device, equipment and readable storage medium
CN110471999B (en) Trajectory processing method, apparatus, device and medium
CN115563522B (en) Traffic data clustering method, device, equipment and medium
CN114721835A (en) Method, system, device and medium for predicting energy consumption of edge data center server
CN111753114A (en) Image pre-labeling method and device and electronic equipment
CN111738316A (en) Image classification method and device for zero sample learning and electronic equipment
CN110012426B (en) Method and device for determining casualty POI, computer equipment and storage medium
CN115034497A (en) Multi-site daily water level prediction method and device, electronic equipment and computer medium
US20170299424A1 (en) Measuring and diagnosing noise in an urban environment
CN111198861A (en) Logic log processing method and device and electronic equipment
CN110674208B (en) Method and device for determining position information of user
CN114625971B (en) Interest point recommendation method and device based on user sign-in
CN113033682B (en) Video classification method, device, readable medium and electronic equipment
CN114692871A (en) Decision tree training method, waybill type identification device, equipment and medium
CN114168824A (en) Cold and hot data separation method, system, equipment and medium based on machine learning
CN115329857B (en) Inland navigation water area grade division method and device, electronic equipment and storage medium
CN113052513B (en) Method for constructing address classification model, address classification method and related equipment
CN114219053B (en) User position information processing method and device and electronic equipment
CN113779370B (en) Address retrieval method and device
CN111753080B (en) Method and device for outputting information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant