CN108509594A - A kind of traffic big data cleaning system based on cloud computing framework - Google Patents
A kind of traffic big data cleaning system based on cloud computing framework Download PDFInfo
- Publication number
- CN108509594A CN108509594A CN201810275731.5A CN201810275731A CN108509594A CN 108509594 A CN108509594 A CN 108509594A CN 201810275731 A CN201810275731 A CN 201810275731A CN 108509594 A CN108509594 A CN 108509594A
- Authority
- CN
- China
- Prior art keywords
- data
- reference point
- similar
- cloud computing
- metadata
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Traffic Control Systems (AREA)
Abstract
The present invention provides a kind of traffic big data cleaning system based on cloud computing framework, including data preprocessing module, set of metadata of similar data link block, data memory module, wherein data preprocessing module is for scanning entire data source, if there are missing data, filled according to the mean value tieed up where same road segment data;Set of metadata of similar data link block is used to carry out similar connection to the data after data prediction resume module, and two data for finding out similarity value more than given threshold are stored as set of metadata of similar data pair, and by the set of metadata of similar data found out to being sent in data memory module.
Description
Technical field
The present invention relates to intelligent transportation fields, and in particular to a kind of traffic big data cleaning system based on cloud computing framework
System.
Background technology
Intelligent transportation system helps to improve traffic passage situation, for the city perplexed by traffic congestion, traffic congestion
Problem has been to be concerned by more and more people.Traffic sensor is the important sources of intelligent transportation system data.But by equipment essence
The influence of many factors such as degree, equipment fault, acquisition environment often collects abnormal or error data.This will reduce intelligence
The accuracy of energy traffic system major applications (such as traffic behavior estimation, traffic status prediction).Therefore, it is necessary to traffic number
According to being cleaned, missing data is filled up, error data is rejected, corrects abnormal data.Traffic sensor type used at present is more,
Sample frequency is high, and system in face of the traffic data of magnanimity, will need to consider treatment effeciency in cleaning process.
Invention content
In view of the above-mentioned problems, the present invention provides a kind of traffic big data cleaning system based on cloud computing framework.
The purpose of the present invention is realized using following technical scheme:
Provide a kind of traffic big data cleaning system based on cloud computing framework, including data preprocessing module, similar
Data connection module, data memory module, wherein data preprocessing module are for scanning entire data source, if there are missing numbers
According to according to the mean value filling tieed up where same road segment data;Set of metadata of similar data link block is used for data prediction resume module
Data afterwards carry out similar connection, find out two data of the similarity value more than given threshold as set of metadata of similar data pair, and will look for
The set of metadata of similar data gone out is stored to being sent in data memory module.
Preferably, the entire data source of the scanning, if there are missing data, it is equal according to what is tieed up where same road segment data
Value filling, specifically includes:
(1) in Map functions, historical data is read in, the numerical value of data element is obtained;Then, data generation is parsed
Location information obtains section label;
(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensing data
Set, missing information, data element and location information;
(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesModel
The subdata enclosed carries out calculating mean value, fills missing data.
Further, data preprocessing module is additionally operable to carry out abnormality detection the data after completion, to being unsatisfactory for setting
The data of threshold range are rejected.
Beneficial effects of the present invention are:When being cleaned using the computation capability of cloud computing platform to solve traffic big data
Calculating speed problem;Using data cleansing as target, completion, abnormality detection and similarity join are carried out to data and handled, it is beneficial
In the accuracy and robustness that ensure follow-up every application for data.
Description of the drawings
Using attached drawing, the invention will be further described, but the embodiment in attached drawing does not constitute any limit to the present invention
System, for those of ordinary skill in the art, without creative efforts, can also obtain according to the following drawings
Other attached drawings.
Fig. 1 is the system structure schematic block diagram of an illustrative embodiment of the invention;
Fig. 2 is the structural schematic block diagram of the data preprocessing module of an illustrative embodiment of the invention.
Reference numeral:
Data preprocessing module 1, set of metadata of similar data link block 2, data memory module 3, Supplementing Data unit 10, data are different
Normal detection unit 20.
Specific implementation mode
The invention will be further described with the following Examples.
Referring to Fig. 1, a kind of traffic big data cleaning system based on cloud computing framework provided in this embodiment, including data
Preprocessing module 1, set of metadata of similar data link block 2, data memory module 3, wherein data preprocessing module 1 are for scanning entire number
According to source, if there are missing data, filled according to the mean value tieed up where same road segment data;Set of metadata of similar data link block 2 for pair
Treated the data of data preprocessing module 1 carry out similar connection, and two data for finding out similarity value more than given threshold are made
For set of metadata of similar data pair, and the set of metadata of similar data found out is stored to being sent in data memory module 3.
In one embodiment, the entire data source of the scanning, if there are missing data, according to same road segment data institute
It fills, specifically includes in the mean value of dimension:
(1) in Map functions, historical data is read in, the numerical value of data element is obtained;Then, data generation is parsed
Location information obtains section label;
(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensing data
Set, missing information, data element and location information;
(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesModel
The subdata enclosed carries out calculating mean value, fills missing data.
Further, data preprocessing module 1 is additionally operable to carry out abnormality detection the data after completion, to being unsatisfactory for setting
The data of threshold range are rejected.
As shown in Fig. 2, the data preprocessing module 1 includes the Supplementing Data unit 10 being connected and data exception inspection
Survey unit 20.
The above embodiment of the present invention is solved using the computation capability of cloud computing platform when traffic big data is cleaned
Calculating speed problem;Using data cleansing as target, completion, abnormality detection and similarity join are carried out to data and handled, is beneficial to
Ensure the accuracy and robustness of follow-up every application for data.
In one embodiment, set of metadata of similar data link block 2 carries out treated the data of data preprocessing module 1 similar
Connection, it is specific to execute:
(1) one piece of data is extracted at random, and according to the acquisition time sequential build time series of data;
(2) multiple reference points are selected from time series, the reference point based on selection is built for the data in time series
Be based on the data directory structure of Distance-Tree, utilizes the data partition scheme of data directory structural generation MapReduce;
(3) using the data partition scheme information of reference mode set, data directory structure and MapReduce as the overall situation
Variable accurately calculates the data there are similitude using MapReduce tasks, obtains all in time series meeting phase
It is more than the data pair of given threshold like angle value.
In one embodiment, described that multiple reference points are selected from time series, it specifically includes:
(1) data are randomly choosed from time series, and the number farthest apart from the data is found in time series
According to being set as v1, and by v1It is set as first reference point;
(2) it finds out from v1Apart from farthest data, it is set as v2, and by v2It is set as second reference point;
(3) using the reference point having been selected, for each data x for being not chosen as reference pointiRange difference weights are calculated,
And select the data of minimum range difference weights as next reference point;
(4) (3) are repeated until selecting the reference point of setting quantity, is included into reference point sequence sets.
Wherein, the calculation formula of the range difference weights is:
In formula,Indicate the data x for being not chosen as reference pointiRange difference weights, S (v1,v2) indicate reference point v1,v2
The distance between, S (vk,xi) indicate data xiWith the reference point v having been selectedkThe distance between, Ω indicates the ginseng having been selected
Examination point set.
The selection of reference point influences the performance that data are carried out with data similarity analysis, and a good reference point set can
Time series to be carried out to more appropriate segmentation.The present embodiment sets the selection mechanism of reference point, by the selection mechanism,
The distance between enable to the outlier of time series that there is the probability of bigger to become reference point, and make the reference point chosen
Farther out, so that the reference point set chosen can preferably be split time series, be conducive to optimization to data
Carry out the performance of data similarity analysis.
In one embodiment, the data directory structure based on Distance-Tree is established for the data in time series, specifically
Including:
(1) Distance-Tree is initialized, the root node of tree is built, by first reference point v in reference point sequence sets1As this
The corresponding reference point of root node, and the affiliated level that root node is arranged is 0, position Ω=0, number m=0;
(2) it is divided using first reference point root node, generates the multiple leaf knots for the child node for belonging to root node
Point, wherein each leaf node includes affiliated level, its internal three attribute of data bulk and location information, wherein position is believed
Breath shows the multiple proportion of the distance threshold ε of distance and setting of the leaf node apart from the corresponding reference point of its father node;
Data insertion is carried out, Distance-Tree is built by way of being inserted into one by one, the process for being inserted into data is exactly per number
According to the process for being distributed to corresponding leaf node, wherein be distributed to the data inside leaf node α and meet ginseng corresponding with its father node
The distance of examination point is in section [(Ωα-1)×ε,Ωα× ε) in, wherein ΩαFor the location information of leaf node α, and inside leaf node α
The data volume of storage is less than the maximum capacity of setting;
Wherein, if the data set to be distributed is { x1,x2,..,xn, then the section quantity being arranged is:
In formula, S (xa,vr) it is data xaWith father node at a distance from corresponding reference point v;
(3) if the data volume of leaf node storage reaches the maximum capacity of setting, new ginseng is chosen from reference point sequence sets
Examination point divides the leaf node, generates corresponding child node, and the redundant data in the leaf node is distributed to its child node
In, the process is repeated, until the data bulk that all leaf nodes or child node include is both less than the maximum capacity set.
Similarity join algorithm or violence algorithm based on disk is used to carry out similar connection to data in the related technology,
Similarity join algorithm based on disk lacks validity and scalability in terms of Memory linkage calculating.Violence algorithm, also
To concentrate arbitrary two data record to be compared data, calculate cost can with data amount check exponential growth, problem
Key is that violence algorithm is infeasible for real data.
In past correlative study in twenties years, experiments have shown that using some beta pruning plans during similarity join
Slightly it is a feasible method.The present embodiment carries out similar connection using the data directory structure based on Distance-Tree to data, can
With the unnecessary data comparison of beta pruning compared with the redundancy for reducing the similar calculating of data is spent, and saves traffic big data cleaning system
Data calculate cost.Wherein according to the calculation formula for setting section quantity with the maximum distance of reference point, is conducive to structure and closes
The Distance-Tree of reason, to lay a good foundation for subsequent data partition.
In one embodiment, the data partition scheme using data directory structural generation MapReduce, specifically
Including:
(1) according to one figure H (G, Z) of data directory Structure Creating based on Distance-Tree, the set of vertex G is Distance-Tree
All leaf nodes, the set of side Z is cannot be by the node pair of beta pruning principle beta pruning, and there are one be connected with its own on each vertex
The weight q (d) on side, setting vertex d is the data volume of corresponding leaf node, the power on two vertex thereon such as weight q (v) of side v
The product of weight;
Wherein, the beta pruning principle is:
To two leaf node α for being scheduled on L1 layers and L2 layers1And α2, it is assumed that L1 >=L2;From root node to α1And α2The leaf of process
The position sequence of node is respectively { w1, w2..., wL1And { ζ1, ζ2..., ζL2}.If for arbitrary t≤L2, there is wt+
2<ζtOr wt>ζt+ 2, then α1In any data and α2In the distance between any data be more than ε;
(2) H (G, Z) is divided into two subgraph H (G, Z)1、H(G,Z)2, meet following equilibrium degree condition
In formula, θ is the equilibrium degree threshold value of setting,
(3) by subgraph H (G, Z)1、H(G,Z)2It is added in a Priority Queues, the subgraph in Priority Queues is according to cost
Carry out descending arrangement;
Wherein subgraph H (G, Z)iThe calculation formula of cost be:
(4) in the iteration of next round, the subgraph for coming foremost is selected from Priority Queues, is randomly divided into two
The identical subgraph of number of vertices, and the subgraph add value Priority Queues being divided into repeat the process until being come in Priority Queues
When the cost of the subgraph of foremost meets the cost threshold value less than setting, export final partition scheme, as at
The data partition scheme of MapReduce.
There will be the data of similitude to be distributed in the same subregion as much as possible based on figure subregion for the present embodiment, and can use up can
Energy ground reduces the data exchange and copy amount of by stages, wherein setting will meet equilibrium degree condition and cost when carrying out subregion
Condition is conducive in Reduce tasks, and the number of traffic big data cleaning system is minimized in the case where ensuring load balancing
According to transmission cost and redundancy.
Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than the present invention is protected
The limitation of range is protected, although being explained in detail to the present invention with reference to preferred embodiment, those skilled in the art answer
Work as understanding, technical scheme of the present invention can be modified or replaced equivalently, without departing from the reality of technical solution of the present invention
Matter and range.
Claims (6)
1. a kind of traffic big data cleaning system based on cloud computing framework, characterized in that including data preprocessing module, similar
Data connection module, data memory module, wherein data preprocessing module are for scanning entire data source, if there are missing numbers
According to according to the mean value filling tieed up where same road segment data;Set of metadata of similar data link block is used for data prediction resume module
Data afterwards carry out similar connection, find out two data of the similarity value more than given threshold as set of metadata of similar data pair, and will look for
The set of metadata of similar data gone out is stored to being sent in data memory module.
2. a kind of traffic big data cleaning system based on cloud computing framework according to claim 1, characterized in that described
The entire data source of scanning filled, specifically included according to the mean value tieed up where same road segment data if there are missing data:
(1) in Map functions, historical data is read in, the numerical value of data element is obtained;Then, the position of data generation is parsed
Information obtains section label;
(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensor data set
Conjunction, missing information, data element and location information;
(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesRange
Subdata carries out calculating mean value, fills missing data.
3. a kind of traffic big data cleaning system based on cloud computing framework according to claim 1, characterized in that data
Preprocessing module is additionally operable to carry out abnormality detection the data after completion, and the data to being unsatisfactory for given threshold range are picked
It removes.
4. special according to a kind of traffic big data cleaning system based on cloud computing framework of claim 1-3 any one of them
Sign is that set of metadata of similar data link block carries out similar connection to the data after data prediction resume module, specific to execute:
(1) one piece of data is extracted at random, and according to the acquisition time sequential build time series of data;
(2) multiple reference points are selected from time series, the reference point based on selection establishes base for the data in time series
In the data directory structure of Distance-Tree, the data partition scheme of data directory structural generation MapReduce is utilized;
(3) become the data partition scheme information of reference mode set, data directory structure and MapReduce as the overall situation
Amount, the data there are similitude are accurately calculated using MapReduce tasks, obtain in time series it is all meet it is similar
Angle value is more than the data pair of given threshold.
5. a kind of traffic big data cleaning system based on cloud computing framework according to claim 4, characterized in that described
Slave time series in select multiple reference points, specifically include:
(1) data are randomly choosed from time series, and the data farthest apart from the data are found in time series, if
For v1, and by v1It is set as first reference point;
(2) it finds out from v1Apart from farthest data, it is set as v2, and by v2It is set as second reference point;
(3) using the reference point having been selected, for each data x for being not chosen as reference pointiRange difference weights are calculated, and are selected
The data of minimum range difference weights are selected as next reference point;
(4) (3) are repeated until selecting the reference point of setting quantity, is included into reference point sequence sets.
6. a kind of traffic big data cleaning system based on cloud computing framework according to claim 5, characterized in that described
The calculation formula of range difference weights be:
In formula,Indicate the data x for being not chosen as reference pointiRange difference weights, S (v1,v2) indicate reference point v1,v2Between
Distance, S (vk,xi) indicate data xiWith the reference point v having been selectedkThe distance between, Ω indicates the reference point having been selected
Set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810275731.5A CN108509594A (en) | 2018-03-30 | 2018-03-30 | A kind of traffic big data cleaning system based on cloud computing framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810275731.5A CN108509594A (en) | 2018-03-30 | 2018-03-30 | A kind of traffic big data cleaning system based on cloud computing framework |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108509594A true CN108509594A (en) | 2018-09-07 |
Family
ID=63379261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810275731.5A Withdrawn CN108509594A (en) | 2018-03-30 | 2018-03-30 | A kind of traffic big data cleaning system based on cloud computing framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108509594A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614392A (en) * | 2018-10-25 | 2019-04-12 | 珠海派诺科技股份有限公司 | Interrupt historical data self-repairing method, device, electronic equipment and medium |
CN110096497A (en) * | 2019-03-28 | 2019-08-06 | 中国农业科学院农业信息研究所 | A kind of agricultural output data intelligence cleaning method and system |
CN112487495A (en) * | 2020-12-01 | 2021-03-12 | 李孔雀 | Data processing method based on big data and cloud computing and big data server |
-
2018
- 2018-03-30 CN CN201810275731.5A patent/CN108509594A/en not_active Withdrawn
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614392A (en) * | 2018-10-25 | 2019-04-12 | 珠海派诺科技股份有限公司 | Interrupt historical data self-repairing method, device, electronic equipment and medium |
CN109614392B (en) * | 2018-10-25 | 2023-08-08 | 珠海派诺科技股份有限公司 | Automatic interrupt history data restoration method and device, electronic equipment and medium |
CN110096497A (en) * | 2019-03-28 | 2019-08-06 | 中国农业科学院农业信息研究所 | A kind of agricultural output data intelligence cleaning method and system |
CN112487495A (en) * | 2020-12-01 | 2021-03-12 | 李孔雀 | Data processing method based on big data and cloud computing and big data server |
CN112487495B (en) * | 2020-12-01 | 2021-07-02 | 厦门立马耀网络科技有限公司 | Data processing method based on big data and cloud computing and big data server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112862874B (en) | Point cloud data matching method and device, electronic equipment and computer storage medium | |
CN102915347B (en) | A kind of distributed traffic clustering method and system | |
CN113033712B (en) | Multi-user cooperative training people flow statistical method and system based on federal learning | |
CN108509594A (en) | A kind of traffic big data cleaning system based on cloud computing framework | |
CN105635963B (en) | Multiple agent co-located method | |
Feng et al. | Allocation using a heterogeneous space Voronoi diagram | |
CN104077438B (en) | Power network massive topologies structure construction method and system | |
US10217225B2 (en) | Distributed processing for producing three-dimensional reconstructions | |
Barequet et al. | λ> 4: An improved lower bound on the growth constant of polyominoes | |
CN107742169A (en) | A kind of Urban Transit Network system constituting method and performance estimating method based on complex network | |
Hastings et al. | Self-correcting quantum memories beyond the percolation threshold | |
Tong et al. | Multi-UAV collaborative absolute vision positioning and navigation: a survey and discussion | |
CN112445132A (en) | Multi-agent system optimal state consistency control method | |
CN106323272B (en) | A kind of method and electronic equipment obtaining track initiation track | |
CN109741209A (en) | Power distribution network multi-source data fusion method, system and storage medium under typhoon disaster | |
Jiang et al. | Advanced network representation learning for container shipping network analysis | |
Liu et al. | Computer, intelligent computing and education technology | |
Yu et al. | A Hybrid Model Based on NeuralProphet and Long Short-Term Memory for Time Series Forecasting | |
CN110211227A (en) | A kind of method for processing three-dimensional scene data, device and terminal device | |
CN109993338A (en) | A kind of link prediction method and device | |
Zhang et al. | Robustness optimization of cloud manufacturing process under various resource substitution strategies | |
CN108898527A (en) | A kind of traffic data fill method based on the generation model for having loss measurement | |
CN115001978A (en) | Cloud tenant virtual network intelligent mapping method based on reinforcement learning model | |
CN108414018A (en) | A kind of power transformer environmental monitoring system based on big data | |
Song et al. | Novel graph processor architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180907 |