CN108509594A

CN108509594A - A kind of traffic big data cleaning system based on cloud computing framework

Info

Publication number: CN108509594A
Application number: CN201810275731.5A
Authority: CN
Inventors: 钟建明
Original assignee: Shenzhen Huitong Intelligent Technology Co Ltd
Current assignee: Shenzhen Huitong Intelligent Technology Co Ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2018-09-07

Abstract

The present invention provides a kind of traffic big data cleaning system based on cloud computing framework, including data preprocessing module, set of metadata of similar data link block, data memory module, wherein data preprocessing module is for scanning entire data source, if there are missing data, filled according to the mean value tieed up where same road segment data；Set of metadata of similar data link block is used to carry out similar connection to the data after data prediction resume module, and two data for finding out similarity value more than given threshold are stored as set of metadata of similar data pair, and by the set of metadata of similar data found out to being sent in data memory module.

Description

A kind of traffic big data cleaning system based on cloud computing framework

Technical field

The present invention relates to intelligent transportation fields, and in particular to a kind of traffic big data cleaning system based on cloud computing framework System.

Background technology

Intelligent transportation system helps to improve traffic passage situation, for the city perplexed by traffic congestion, traffic congestion Problem has been to be concerned by more and more people.Traffic sensor is the important sources of intelligent transportation system data.But by equipment essence The influence of many factors such as degree, equipment fault, acquisition environment often collects abnormal or error data.This will reduce intelligence The accuracy of energy traffic system major applications (such as traffic behavior estimation, traffic status prediction).Therefore, it is necessary to traffic number According to being cleaned, missing data is filled up, error data is rejected, corrects abnormal data.Traffic sensor type used at present is more, Sample frequency is high, and system in face of the traffic data of magnanimity, will need to consider treatment effeciency in cleaning process.

Invention content

In view of the above-mentioned problems, the present invention provides a kind of traffic big data cleaning system based on cloud computing framework.

The purpose of the present invention is realized using following technical scheme：

Provide a kind of traffic big data cleaning system based on cloud computing framework, including data preprocessing module, similar Data connection module, data memory module, wherein data preprocessing module are for scanning entire data source, if there are missing numbers According to according to the mean value filling tieed up where same road segment data；Set of metadata of similar data link block is used for data prediction resume module Data afterwards carry out similar connection, find out two data of the similarity value more than given threshold as set of metadata of similar data pair, and will look for The set of metadata of similar data gone out is stored to being sent in data memory module.

Preferably, the entire data source of the scanning, if there are missing data, it is equal according to what is tieed up where same road segment data Value filling, specifically includes：

(1) in Map functions, historical data is read in, the numerical value of data element is obtained；Then, data generation is parsed Location information obtains section label；

(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensing data Set, missing information, data element and location information；

(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesModel The subdata enclosed carries out calculating mean value, fills missing data.

Further, data preprocessing module is additionally operable to carry out abnormality detection the data after completion, to being unsatisfactory for setting The data of threshold range are rejected.

Beneficial effects of the present invention are：When being cleaned using the computation capability of cloud computing platform to solve traffic big data Calculating speed problem；Using data cleansing as target, completion, abnormality detection and similarity join are carried out to data and handled, it is beneficial In the accuracy and robustness that ensure follow-up every application for data.

Description of the drawings

Using attached drawing, the invention will be further described, but the embodiment in attached drawing does not constitute any limit to the present invention System, for those of ordinary skill in the art, without creative efforts, can also obtain according to the following drawings Other attached drawings.

Fig. 1 is the system structure schematic block diagram of an illustrative embodiment of the invention；

Fig. 2 is the structural schematic block diagram of the data preprocessing module of an illustrative embodiment of the invention.

Reference numeral：

Data preprocessing module 1, set of metadata of similar data link block 2, data memory module 3, Supplementing Data unit 10, data are different Normal detection unit 20.

Specific implementation mode

The invention will be further described with the following Examples.

Referring to Fig. 1, a kind of traffic big data cleaning system based on cloud computing framework provided in this embodiment, including data Preprocessing module 1, set of metadata of similar data link block 2, data memory module 3, wherein data preprocessing module 1 are for scanning entire number According to source, if there are missing data, filled according to the mean value tieed up where same road segment data；Set of metadata of similar data link block 2 for pair Treated the data of data preprocessing module 1 carry out similar connection, and two data for finding out similarity value more than given threshold are made For set of metadata of similar data pair, and the set of metadata of similar data found out is stored to being sent in data memory module 3.

In one embodiment, the entire data source of the scanning, if there are missing data, according to same road segment data institute It fills, specifically includes in the mean value of dimension：

Further, data preprocessing module 1 is additionally operable to carry out abnormality detection the data after completion, to being unsatisfactory for setting The data of threshold range are rejected.

As shown in Fig. 2, the data preprocessing module 1 includes the Supplementing Data unit 10 being connected and data exception inspection Survey unit 20.

The above embodiment of the present invention is solved using the computation capability of cloud computing platform when traffic big data is cleaned Calculating speed problem；Using data cleansing as target, completion, abnormality detection and similarity join are carried out to data and handled, is beneficial to Ensure the accuracy and robustness of follow-up every application for data.

In one embodiment, set of metadata of similar data link block 2 carries out treated the data of data preprocessing module 1 similar Connection, it is specific to execute：

(1) one piece of data is extracted at random, and according to the acquisition time sequential build time series of data；

(2) multiple reference points are selected from time series, the reference point based on selection is built for the data in time series Be based on the data directory structure of Distance-Tree, utilizes the data partition scheme of data directory structural generation MapReduce；

(3) using the data partition scheme information of reference mode set, data directory structure and MapReduce as the overall situation Variable accurately calculates the data there are similitude using MapReduce tasks, obtains all in time series meeting phase It is more than the data pair of given threshold like angle value.

In one embodiment, described that multiple reference points are selected from time series, it specifically includes：

(1) data are randomly choosed from time series, and the number farthest apart from the data is found in time series According to being set as v₁, and by v₁It is set as first reference point；

(2) it finds out from v₁Apart from farthest data, it is set as v₂, and by v₂It is set as second reference point；

(3) using the reference point having been selected, for each data x for being not chosen as reference point_iRange difference weights are calculated, And select the data of minimum range difference weights as next reference point；

(4) (3) are repeated until selecting the reference point of setting quantity, is included into reference point sequence sets.

Wherein, the calculation formula of the range difference weights is：

In formula,Indicate the data x for being not chosen as reference point_iRange difference weights, S (v₁,v₂) indicate reference point v₁,v₂ The distance between, S (v_k,x_i) indicate data x_iWith the reference point v having been selected_kThe distance between, Ω indicates the ginseng having been selected Examination point set.

The selection of reference point influences the performance that data are carried out with data similarity analysis, and a good reference point set can Time series to be carried out to more appropriate segmentation.The present embodiment sets the selection mechanism of reference point, by the selection mechanism, The distance between enable to the outlier of time series that there is the probability of bigger to become reference point, and make the reference point chosen Farther out, so that the reference point set chosen can preferably be split time series, be conducive to optimization to data Carry out the performance of data similarity analysis.

In one embodiment, the data directory structure based on Distance-Tree is established for the data in time series, specifically Including：

(1) Distance-Tree is initialized, the root node of tree is built, by first reference point v in reference point sequence sets₁As this The corresponding reference point of root node, and the affiliated level that root node is arranged is 0, position Ω=0, number m=0；

(2) it is divided using first reference point root node, generates the multiple leaf knots for the child node for belonging to root node Point, wherein each leaf node includes affiliated level, its internal three attribute of data bulk and location information, wherein position is believed Breath shows the multiple proportion of the distance threshold ε of distance and setting of the leaf node apart from the corresponding reference point of its father node；

Data insertion is carried out, Distance-Tree is built by way of being inserted into one by one, the process for being inserted into data is exactly per number According to the process for being distributed to corresponding leaf node, wherein be distributed to the data inside leaf node α and meet ginseng corresponding with its father node The distance of examination point is in section [(Ω_α-1)×ε,Ω_α× ε) in, wherein Ω_αFor the location information of leaf node α, and inside leaf node α The data volume of storage is less than the maximum capacity of setting；

Wherein, if the data set to be distributed is { x₁,x₂,..,x_n, then the section quantity being arranged is：

In formula, S (x_a,v_r) it is data x_aWith father node at a distance from corresponding reference point v；

(3) if the data volume of leaf node storage reaches the maximum capacity of setting, new ginseng is chosen from reference point sequence sets Examination point divides the leaf node, generates corresponding child node, and the redundant data in the leaf node is distributed to its child node In, the process is repeated, until the data bulk that all leaf nodes or child node include is both less than the maximum capacity set.

Similarity join algorithm or violence algorithm based on disk is used to carry out similar connection to data in the related technology, Similarity join algorithm based on disk lacks validity and scalability in terms of Memory linkage calculating.Violence algorithm, also To concentrate arbitrary two data record to be compared data, calculate cost can with data amount check exponential growth, problem Key is that violence algorithm is infeasible for real data.

In past correlative study in twenties years, experiments have shown that using some beta pruning plans during similarity join Slightly it is a feasible method.The present embodiment carries out similar connection using the data directory structure based on Distance-Tree to data, can With the unnecessary data comparison of beta pruning compared with the redundancy for reducing the similar calculating of data is spent, and saves traffic big data cleaning system Data calculate cost.Wherein according to the calculation formula for setting section quantity with the maximum distance of reference point, is conducive to structure and closes The Distance-Tree of reason, to lay a good foundation for subsequent data partition.

In one embodiment, the data partition scheme using data directory structural generation MapReduce, specifically Including：

(1) according to one figure H (G, Z) of data directory Structure Creating based on Distance-Tree, the set of vertex G is Distance-Tree All leaf nodes, the set of side Z is cannot be by the node pair of beta pruning principle beta pruning, and there are one be connected with its own on each vertex The weight q (d) on side, setting vertex d is the data volume of corresponding leaf node, the power on two vertex thereon such as weight q (v) of side v The product of weight；

Wherein, the beta pruning principle is：

To two leaf node α for being scheduled on L1 layers and L2 layers₁And α₂, it is assumed that L1 >=L2；From root node to α₁And α₂The leaf of process The position sequence of node is respectively { w₁, w₂..., w_L1And { ζ₁, ζ₂..., ζ_L2}.If for arbitrary t≤L2, there is w_t+ 2<ζ_tOr w_t>ζ_t+ 2, then α₁In any data and α₂In the distance between any data be more than ε；

(2) H (G, Z) is divided into two subgraph H (G, Z)₁、H(G,Z)₂, meet following equilibrium degree condition

In formula, θ is the equilibrium degree threshold value of setting,

(3) by subgraph H (G, Z)₁、H(G,Z)₂It is added in a Priority Queues, the subgraph in Priority Queues is according to cost Carry out descending arrangement；

Wherein subgraph H (G, Z)_iThe calculation formula of cost be：

(4) in the iteration of next round, the subgraph for coming foremost is selected from Priority Queues, is randomly divided into two The identical subgraph of number of vertices, and the subgraph add value Priority Queues being divided into repeat the process until being come in Priority Queues When the cost of the subgraph of foremost meets the cost threshold value less than setting, export final partition scheme, as at The data partition scheme of MapReduce.

There will be the data of similitude to be distributed in the same subregion as much as possible based on figure subregion for the present embodiment, and can use up can Energy ground reduces the data exchange and copy amount of by stages, wherein setting will meet equilibrium degree condition and cost when carrying out subregion Condition is conducive in Reduce tasks, and the number of traffic big data cleaning system is minimized in the case where ensuring load balancing According to transmission cost and redundancy.

Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than the present invention is protected The limitation of range is protected, although being explained in detail to the present invention with reference to preferred embodiment, those skilled in the art answer Work as understanding, technical scheme of the present invention can be modified or replaced equivalently, without departing from the reality of technical solution of the present invention Matter and range.

Claims

1. a kind of traffic big data cleaning system based on cloud computing framework, characterized in that including data preprocessing module, similar Data connection module, data memory module, wherein data preprocessing module are for scanning entire data source, if there are missing numbers According to according to the mean value filling tieed up where same road segment data；Set of metadata of similar data link block is used for data prediction resume module Data afterwards carry out similar connection, find out two data of the similarity value more than given threshold as set of metadata of similar data pair, and will look for The set of metadata of similar data gone out is stored to being sent in data memory module.

2. a kind of traffic big data cleaning system based on cloud computing framework according to claim 1, characterized in that described The entire data source of scanning filled, specifically included according to the mean value tieed up where same road segment data if there are missing data：

(1) in Map functions, historical data is read in, the numerical value of data element is obtained；Then, the position of data generation is parsed Information obtains section label；

(2) data object is constructed marked as key assignments with section and distributed, data object attribute includes key assignments, sensor data set Conjunction, missing information, data element and location information；

(3) in Reduce functions, the mean μ and standard deviation that sub-block is respectively tieed up are calculatedIt choosesRange Subdata carries out calculating mean value, fills missing data.

3. a kind of traffic big data cleaning system based on cloud computing framework according to claim 1, characterized in that data Preprocessing module is additionally operable to carry out abnormality detection the data after completion, and the data to being unsatisfactory for given threshold range are picked It removes.

4. special according to a kind of traffic big data cleaning system based on cloud computing framework of claim 1-3 any one of them Sign is that set of metadata of similar data link block carries out similar connection to the data after data prediction resume module, specific to execute：

(2) multiple reference points are selected from time series, the reference point based on selection establishes base for the data in time series In the data directory structure of Distance-Tree, the data partition scheme of data directory structural generation MapReduce is utilized；

(3) become the data partition scheme information of reference mode set, data directory structure and MapReduce as the overall situation Amount, the data there are similitude are accurately calculated using MapReduce tasks, obtain in time series it is all meet it is similar Angle value is more than the data pair of given threshold.

5. a kind of traffic big data cleaning system based on cloud computing framework according to claim 4, characterized in that described Slave time series in select multiple reference points, specifically include：

(1) data are randomly choosed from time series, and the data farthest apart from the data are found in time series, if For v₁, and by v₁It is set as first reference point；

(3) using the reference point having been selected, for each data x for being not chosen as reference point_iRange difference weights are calculated, and are selected The data of minimum range difference weights are selected as next reference point；

6. a kind of traffic big data cleaning system based on cloud computing framework according to claim 5, characterized in that described The calculation formula of range difference weights be：

In formula,Indicate the data x for being not chosen as reference point_iRange difference weights, S (v₁,v₂) indicate reference point v₁,v₂Between Distance, S (v_k,x_i) indicate data x_iWith the reference point v having been selected_kThe distance between, Ω indicates the reference point having been selected Set.