CN108509594A - A kind of traffic big data cleaning system based on cloud computing framework - Google Patents

A kind of traffic big data cleaning system based on cloud computing framework Download PDF

Info

Publication number
CN108509594A
CN108509594A CN201810275731.5A CN201810275731A CN108509594A CN 108509594 A CN108509594 A CN 108509594A CN 201810275731 A CN201810275731 A CN 201810275731A CN 108509594 A CN108509594 A CN 108509594A
Authority
CN
China
Prior art keywords
data
reference point
similar
cloud computing
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810275731.5A
Other languages
Chinese (zh)
Inventor
钟建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huitong Intelligent Technology Co Ltd
Original Assignee
Shenzhen Huitong Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huitong Intelligent Technology Co Ltd filed Critical Shenzhen Huitong Intelligent Technology Co Ltd
Priority to CN201810275731.5A priority Critical patent/CN108509594A/en
Publication of CN108509594A publication Critical patent/CN108509594A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Traffic Control Systems (AREA)

Abstract

The present invention provides a kind of traffic big data cleaning system based on cloud computing framework, including data preprocessing module, set of metadata of similar data link block, data memory module, wherein data preprocessing module is for scanning entire data source, if there are missing data, filled according to the mean value tieed up where same road segment data;Set of metadata of similar data link block is used to carry out similar connection to the data after data prediction resume module, and two data for finding out similarity value more than given threshold are stored as set of metadata of similar data pair, and by the set of metadata of similar data found out to being sent in data memory module.

Description

A kind of traffic big data cleaning system based on cloud computing framework
Technical field
The present invention relates to intelligent transportation fields, and in particular to a kind of traffic big data cleaning system based on cloud computing framework System.
Background technology
Intelligent transportation system helps to improve traffic passage situation, for the city perplexed by traffic congestion, traffic congestion Problem has been to be concerned by more and more people.Traffic sensor is the important sources of intelligent transportation system data.But by equipment essence The influence of many factors such as degree, equipment fault, acquisition environment often collects abnormal or error data.This will reduce intelligence The accuracy of energy traffic system major applications (such as traffic behavior estimation, traffic status prediction).Therefore, it is necessary to traffic number According to being cleaned, missing data is filled up, error data is rejected, corrects abnormal data.Traffic sensor type used at present is more, Sample frequency is high, and system in face of the traffic data of magnanimity, will need to consider treatment effeciency in cleaning process.
Invention content
In view of the above-mentioned problems, the present invention provides a kind of traffic big data cleaning system based on cloud computing framework.
The purpose of the present invention is realized using following technical scheme:
Provide a kind of traffic big data cleaning system based on cloud computing framework, including data preprocessing module, similar Data connection module, data memory module, wherein data preprocessing module are for scanning entire data source, if there are missing numbers According to according to the mean value filling tieed up where same road segment data;Set of metadata of similar data link block is used for data prediction resume module Data afterwards carry out similar connection, find out two data of the similarity value more than given threshold as set of metadata of similar data pair, and will look for The set of metadata of similar data gone out is stored to being sent in data memory module.
Preferably, the entire data source of the scanning, if there are missing data, it is equal according to what is tieed up where same road segment data Value filling, specifically includes:
(1) in Map functions, historical data is read in, the numerical value of data element is obtained;Then, data generation is parsed Location information obtains section label;
(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensing data Set, missing information, data element and location information;
(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesModel The subdata enclosed carries out calculating mean value, fills missing data.
Further, data preprocessing module is additionally operable to carry out abnormality detection the data after completion, to being unsatisfactory for setting The data of threshold range are rejected.
Beneficial effects of the present invention are:When being cleaned using the computation capability of cloud computing platform to solve traffic big data Calculating speed problem;Using data cleansing as target, completion, abnormality detection and similarity join are carried out to data and handled, it is beneficial In the accuracy and robustness that ensure follow-up every application for data.
Description of the drawings
Using attached drawing, the invention will be further described, but the embodiment in attached drawing does not constitute any limit to the present invention System, for those of ordinary skill in the art, without creative efforts, can also obtain according to the following drawings Other attached drawings.
Fig. 1 is the system structure schematic block diagram of an illustrative embodiment of the invention;
Fig. 2 is the structural schematic block diagram of the data preprocessing module of an illustrative embodiment of the invention.
Reference numeral:
Data preprocessing module 1, set of metadata of similar data link block 2, data memory module 3, Supplementing Data unit 10, data are different Normal detection unit 20.
Specific implementation mode
The invention will be further described with the following Examples.
Referring to Fig. 1, a kind of traffic big data cleaning system based on cloud computing framework provided in this embodiment, including data Preprocessing module 1, set of metadata of similar data link block 2, data memory module 3, wherein data preprocessing module 1 are for scanning entire number According to source, if there are missing data, filled according to the mean value tieed up where same road segment data;Set of metadata of similar data link block 2 for pair Treated the data of data preprocessing module 1 carry out similar connection, and two data for finding out similarity value more than given threshold are made For set of metadata of similar data pair, and the set of metadata of similar data found out is stored to being sent in data memory module 3.
In one embodiment, the entire data source of the scanning, if there are missing data, according to same road segment data institute It fills, specifically includes in the mean value of dimension:
(1) in Map functions, historical data is read in, the numerical value of data element is obtained;Then, data generation is parsed Location information obtains section label;
(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensing data Set, missing information, data element and location information;
(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesModel The subdata enclosed carries out calculating mean value, fills missing data.
Further, data preprocessing module 1 is additionally operable to carry out abnormality detection the data after completion, to being unsatisfactory for setting The data of threshold range are rejected.
As shown in Fig. 2, the data preprocessing module 1 includes the Supplementing Data unit 10 being connected and data exception inspection Survey unit 20.
The above embodiment of the present invention is solved using the computation capability of cloud computing platform when traffic big data is cleaned Calculating speed problem;Using data cleansing as target, completion, abnormality detection and similarity join are carried out to data and handled, is beneficial to Ensure the accuracy and robustness of follow-up every application for data.
In one embodiment, set of metadata of similar data link block 2 carries out treated the data of data preprocessing module 1 similar Connection, it is specific to execute:
(1) one piece of data is extracted at random, and according to the acquisition time sequential build time series of data;
(2) multiple reference points are selected from time series, the reference point based on selection is built for the data in time series Be based on the data directory structure of Distance-Tree, utilizes the data partition scheme of data directory structural generation MapReduce;
(3) using the data partition scheme information of reference mode set, data directory structure and MapReduce as the overall situation Variable accurately calculates the data there are similitude using MapReduce tasks, obtains all in time series meeting phase It is more than the data pair of given threshold like angle value.
In one embodiment, described that multiple reference points are selected from time series, it specifically includes:
(1) data are randomly choosed from time series, and the number farthest apart from the data is found in time series According to being set as v1, and by v1It is set as first reference point;
(2) it finds out from v1Apart from farthest data, it is set as v2, and by v2It is set as second reference point;
(3) using the reference point having been selected, for each data x for being not chosen as reference pointiRange difference weights are calculated, And select the data of minimum range difference weights as next reference point;
(4) (3) are repeated until selecting the reference point of setting quantity, is included into reference point sequence sets.
Wherein, the calculation formula of the range difference weights is:
In formula,Indicate the data x for being not chosen as reference pointiRange difference weights, S (v1,v2) indicate reference point v1,v2 The distance between, S (vk,xi) indicate data xiWith the reference point v having been selectedkThe distance between, Ω indicates the ginseng having been selected Examination point set.
The selection of reference point influences the performance that data are carried out with data similarity analysis, and a good reference point set can Time series to be carried out to more appropriate segmentation.The present embodiment sets the selection mechanism of reference point, by the selection mechanism, The distance between enable to the outlier of time series that there is the probability of bigger to become reference point, and make the reference point chosen Farther out, so that the reference point set chosen can preferably be split time series, be conducive to optimization to data Carry out the performance of data similarity analysis.
In one embodiment, the data directory structure based on Distance-Tree is established for the data in time series, specifically Including:
(1) Distance-Tree is initialized, the root node of tree is built, by first reference point v in reference point sequence sets1As this The corresponding reference point of root node, and the affiliated level that root node is arranged is 0, position Ω=0, number m=0;
(2) it is divided using first reference point root node, generates the multiple leaf knots for the child node for belonging to root node Point, wherein each leaf node includes affiliated level, its internal three attribute of data bulk and location information, wherein position is believed Breath shows the multiple proportion of the distance threshold ε of distance and setting of the leaf node apart from the corresponding reference point of its father node;
Data insertion is carried out, Distance-Tree is built by way of being inserted into one by one, the process for being inserted into data is exactly per number According to the process for being distributed to corresponding leaf node, wherein be distributed to the data inside leaf node α and meet ginseng corresponding with its father node The distance of examination point is in section [(Ωα-1)×ε,Ωα× ε) in, wherein ΩαFor the location information of leaf node α, and inside leaf node α The data volume of storage is less than the maximum capacity of setting;
Wherein, if the data set to be distributed is { x1,x2,..,xn, then the section quantity being arranged is:
In formula, S (xa,vr) it is data xaWith father node at a distance from corresponding reference point v;
(3) if the data volume of leaf node storage reaches the maximum capacity of setting, new ginseng is chosen from reference point sequence sets Examination point divides the leaf node, generates corresponding child node, and the redundant data in the leaf node is distributed to its child node In, the process is repeated, until the data bulk that all leaf nodes or child node include is both less than the maximum capacity set.
Similarity join algorithm or violence algorithm based on disk is used to carry out similar connection to data in the related technology, Similarity join algorithm based on disk lacks validity and scalability in terms of Memory linkage calculating.Violence algorithm, also To concentrate arbitrary two data record to be compared data, calculate cost can with data amount check exponential growth, problem Key is that violence algorithm is infeasible for real data.
In past correlative study in twenties years, experiments have shown that using some beta pruning plans during similarity join Slightly it is a feasible method.The present embodiment carries out similar connection using the data directory structure based on Distance-Tree to data, can With the unnecessary data comparison of beta pruning compared with the redundancy for reducing the similar calculating of data is spent, and saves traffic big data cleaning system Data calculate cost.Wherein according to the calculation formula for setting section quantity with the maximum distance of reference point, is conducive to structure and closes The Distance-Tree of reason, to lay a good foundation for subsequent data partition.
In one embodiment, the data partition scheme using data directory structural generation MapReduce, specifically Including:
(1) according to one figure H (G, Z) of data directory Structure Creating based on Distance-Tree, the set of vertex G is Distance-Tree All leaf nodes, the set of side Z is cannot be by the node pair of beta pruning principle beta pruning, and there are one be connected with its own on each vertex The weight q (d) on side, setting vertex d is the data volume of corresponding leaf node, the power on two vertex thereon such as weight q (v) of side v The product of weight;
Wherein, the beta pruning principle is:
To two leaf node α for being scheduled on L1 layers and L2 layers1And α2, it is assumed that L1 >=L2;From root node to α1And α2The leaf of process The position sequence of node is respectively { w1, w2..., wL1And { ζ1, ζ2..., ζL2}.If for arbitrary t≤L2, there is wt+ 2<ζtOr wtt+ 2, then α1In any data and α2In the distance between any data be more than ε;
(2) H (G, Z) is divided into two subgraph H (G, Z)1、H(G,Z)2, meet following equilibrium degree condition
In formula, θ is the equilibrium degree threshold value of setting,
(3) by subgraph H (G, Z)1、H(G,Z)2It is added in a Priority Queues, the subgraph in Priority Queues is according to cost Carry out descending arrangement;
Wherein subgraph H (G, Z)iThe calculation formula of cost be:
(4) in the iteration of next round, the subgraph for coming foremost is selected from Priority Queues, is randomly divided into two The identical subgraph of number of vertices, and the subgraph add value Priority Queues being divided into repeat the process until being come in Priority Queues When the cost of the subgraph of foremost meets the cost threshold value less than setting, export final partition scheme, as at The data partition scheme of MapReduce.
There will be the data of similitude to be distributed in the same subregion as much as possible based on figure subregion for the present embodiment, and can use up can Energy ground reduces the data exchange and copy amount of by stages, wherein setting will meet equilibrium degree condition and cost when carrying out subregion Condition is conducive in Reduce tasks, and the number of traffic big data cleaning system is minimized in the case where ensuring load balancing According to transmission cost and redundancy.
Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than the present invention is protected The limitation of range is protected, although being explained in detail to the present invention with reference to preferred embodiment, those skilled in the art answer Work as understanding, technical scheme of the present invention can be modified or replaced equivalently, without departing from the reality of technical solution of the present invention Matter and range.

Claims (6)

1. a kind of traffic big data cleaning system based on cloud computing framework, characterized in that including data preprocessing module, similar Data connection module, data memory module, wherein data preprocessing module are for scanning entire data source, if there are missing numbers According to according to the mean value filling tieed up where same road segment data;Set of metadata of similar data link block is used for data prediction resume module Data afterwards carry out similar connection, find out two data of the similarity value more than given threshold as set of metadata of similar data pair, and will look for The set of metadata of similar data gone out is stored to being sent in data memory module.
2. a kind of traffic big data cleaning system based on cloud computing framework according to claim 1, characterized in that described The entire data source of scanning filled, specifically included according to the mean value tieed up where same road segment data if there are missing data:
(1) in Map functions, historical data is read in, the numerical value of data element is obtained;Then, the position of data generation is parsed Information obtains section label;
(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensor data set Conjunction, missing information, data element and location information;
(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesRange Subdata carries out calculating mean value, fills missing data.
3. a kind of traffic big data cleaning system based on cloud computing framework according to claim 1, characterized in that data Preprocessing module is additionally operable to carry out abnormality detection the data after completion, and the data to being unsatisfactory for given threshold range are picked It removes.
4. special according to a kind of traffic big data cleaning system based on cloud computing framework of claim 1-3 any one of them Sign is that set of metadata of similar data link block carries out similar connection to the data after data prediction resume module, specific to execute:
(1) one piece of data is extracted at random, and according to the acquisition time sequential build time series of data;
(2) multiple reference points are selected from time series, the reference point based on selection establishes base for the data in time series In the data directory structure of Distance-Tree, the data partition scheme of data directory structural generation MapReduce is utilized;
(3) become the data partition scheme information of reference mode set, data directory structure and MapReduce as the overall situation Amount, the data there are similitude are accurately calculated using MapReduce tasks, obtain in time series it is all meet it is similar Angle value is more than the data pair of given threshold.
5. a kind of traffic big data cleaning system based on cloud computing framework according to claim 4, characterized in that described Slave time series in select multiple reference points, specifically include:
(1) data are randomly choosed from time series, and the data farthest apart from the data are found in time series, if For v1, and by v1It is set as first reference point;
(2) it finds out from v1Apart from farthest data, it is set as v2, and by v2It is set as second reference point;
(3) using the reference point having been selected, for each data x for being not chosen as reference pointiRange difference weights are calculated, and are selected The data of minimum range difference weights are selected as next reference point;
(4) (3) are repeated until selecting the reference point of setting quantity, is included into reference point sequence sets.
6. a kind of traffic big data cleaning system based on cloud computing framework according to claim 5, characterized in that described The calculation formula of range difference weights be:
In formula,Indicate the data x for being not chosen as reference pointiRange difference weights, S (v1,v2) indicate reference point v1,v2Between Distance, S (vk,xi) indicate data xiWith the reference point v having been selectedkThe distance between, Ω indicates the reference point having been selected Set.
CN201810275731.5A 2018-03-30 2018-03-30 A kind of traffic big data cleaning system based on cloud computing framework Withdrawn CN108509594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810275731.5A CN108509594A (en) 2018-03-30 2018-03-30 A kind of traffic big data cleaning system based on cloud computing framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810275731.5A CN108509594A (en) 2018-03-30 2018-03-30 A kind of traffic big data cleaning system based on cloud computing framework

Publications (1)

Publication Number Publication Date
CN108509594A true CN108509594A (en) 2018-09-07

Family

ID=63379261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810275731.5A Withdrawn CN108509594A (en) 2018-03-30 2018-03-30 A kind of traffic big data cleaning system based on cloud computing framework

Country Status (1)

Country Link
CN (1) CN108509594A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614392A (en) * 2018-10-25 2019-04-12 珠海派诺科技股份有限公司 Interrupt historical data self-repairing method, device, electronic equipment and medium
CN110096497A (en) * 2019-03-28 2019-08-06 中国农业科学院农业信息研究所 A kind of agricultural output data intelligence cleaning method and system
CN112487495A (en) * 2020-12-01 2021-03-12 李孔雀 Data processing method based on big data and cloud computing and big data server

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614392A (en) * 2018-10-25 2019-04-12 珠海派诺科技股份有限公司 Interrupt historical data self-repairing method, device, electronic equipment and medium
CN109614392B (en) * 2018-10-25 2023-08-08 珠海派诺科技股份有限公司 Automatic interrupt history data restoration method and device, electronic equipment and medium
CN110096497A (en) * 2019-03-28 2019-08-06 中国农业科学院农业信息研究所 A kind of agricultural output data intelligence cleaning method and system
CN112487495A (en) * 2020-12-01 2021-03-12 李孔雀 Data processing method based on big data and cloud computing and big data server
CN112487495B (en) * 2020-12-01 2021-07-02 厦门立马耀网络科技有限公司 Data processing method based on big data and cloud computing and big data server

Similar Documents

Publication Publication Date Title
CN112862874B (en) Point cloud data matching method and device, electronic equipment and computer storage medium
CN102915347B (en) A kind of distributed traffic clustering method and system
CN113033712B (en) Multi-user cooperative training people flow statistical method and system based on federal learning
CN108509594A (en) A kind of traffic big data cleaning system based on cloud computing framework
CN105635963B (en) Multiple agent co-located method
Feng et al. Allocation using a heterogeneous space Voronoi diagram
CN104077438B (en) Power network massive topologies structure construction method and system
US10217225B2 (en) Distributed processing for producing three-dimensional reconstructions
Barequet et al. λ> 4: An improved lower bound on the growth constant of polyominoes
CN107742169A (en) A kind of Urban Transit Network system constituting method and performance estimating method based on complex network
Hastings et al. Self-correcting quantum memories beyond the percolation threshold
Tong et al. Multi-UAV collaborative absolute vision positioning and navigation: a survey and discussion
CN112445132A (en) Multi-agent system optimal state consistency control method
CN106323272B (en) A kind of method and electronic equipment obtaining track initiation track
CN109741209A (en) Power distribution network multi-source data fusion method, system and storage medium under typhoon disaster
Jiang et al. Advanced network representation learning for container shipping network analysis
Liu et al. Computer, intelligent computing and education technology
Yu et al. A Hybrid Model Based on NeuralProphet and Long Short-Term Memory for Time Series Forecasting
CN110211227A (en) A kind of method for processing three-dimensional scene data, device and terminal device
CN109993338A (en) A kind of link prediction method and device
Zhang et al. Robustness optimization of cloud manufacturing process under various resource substitution strategies
CN108898527A (en) A kind of traffic data fill method based on the generation model for having loss measurement
CN115001978A (en) Cloud tenant virtual network intelligent mapping method based on reinforcement learning model
CN108414018A (en) A kind of power transformer environmental monitoring system based on big data
Song et al. Novel graph processor architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180907