CN106528865A - Quick and accurate cleaning method of traffic big data - Google Patents

Quick and accurate cleaning method of traffic big data Download PDF

Info

Publication number
CN106528865A
CN106528865A CN201611094160.2A CN201611094160A CN106528865A CN 106528865 A CN106528865 A CN 106528865A CN 201611094160 A CN201611094160 A CN 201611094160A CN 106528865 A CN106528865 A CN 106528865A
Authority
CN
China
Prior art keywords
data
rfid
time
vehicle
track
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611094160.2A
Other languages
Chinese (zh)
Inventor
张鹏飞
赵凯
梁婷婷
陶斯琴
侯俊巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Casic Wisdom Industrial Development Co Ltd
Original Assignee
Casic Wisdom Industrial Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Casic Wisdom Industrial Development Co Ltd filed Critical Casic Wisdom Industrial Development Co Ltd
Priority to CN201611094160.2A priority Critical patent/CN106528865A/en
Publication of CN106528865A publication Critical patent/CN106528865A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Traffic Control Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a quick and accurate cleaning method of traffic big data, and relates to the technical field of traffic data processing. With regard to real-time RFID and snapshot data, a Spark Streaming stream processing technology is adopted, Kafka is utilized to provide data caching, data is constantly extracted from the Kafka according to a time window, and comparisons, statistics and exception handling of data are finished according to a data cleaning rule; with regard to off-line batch cumulative data, a Spark internal storage processing technology is adopted, data is read from an HDFS, comparisons, statistics and exception handling of data are conducted according to the data cleaning rule, through comparisons, an algorithm is optimized, and performance of a procedure and accuracy of a data cleaning result are improved. According to the quick and accurate cleaning method of the traffic big data, the quick and accurate processing of data of RFID, snapshots and the like generated in the monitoring and managing process of urban traffic is achieved, so that processing of the traffic data resources is achieved, and storage and utilization of traffic big data resources are guaranteed.

Description

A kind of traffic big data cleaning method of fast accurate
Technical field
The present invention relates to transport data processing technical field, more particularly to a kind of traffic big data cleaning side of fast accurate Method.
Background technology
With the development and the raising of people's level of consumption of urban construction, automobile has become indispensable during people live Instrument, and the process of the huge traffic data for producing therewith also becomes a problem demanding prompt solution.In order to realize quick reality When traffic monitoring and forecast analysis, realize the analysis and inquiry of traffic historical data, need the traffic data to separate sources Cleaning filtration being carried out, and abnormal data being extracted for artificial treatment, the result to processing is deposited respectively using appropriate storage mode Storage, and data access interface is provided, to realize real-time analysis and the query function of traffic data.
At present, cleaning to real time data, the method for employing is traffic big data cleaning method:To directly receive RFID cross car data and capture data flow give spark streaming process, spark streaming are according to cleaning Rule is required to carry out track of vehicle cleaning, crosses vehicle flowrate and anomaly extracting.For off-line data is cleaned, compiled using spark Journey model, requires to cross car data and capture data RFID to be attached according to cleaning rule, extracts effective field, so as to extract Go out track of vehicle, count each collection point crosses vehicle flowrate, and isolate abnormal data and supply artificial treatment.
There is problems with the method:For real time data is cleaned, due to the number that RFID device and candid photograph equipment are collected According to spark streaming process is real-time transmitted to, spark streaming tasks are had to last for after submission Wait until that receiving all data that the time period collects can just carry out the process of next step, so result in big data Platform operational efficiency is seriously reduced.For off-line data process, due to data volume it is huge, according to key assignments do matching connection when Time frequently can lead to memory pressure greatly, the slow consequence of processing speed, so as to affect the performance of program.
The content of the invention
It is an object of the invention to provide a kind of traffic big data cleaning method of fast accurate, so as to solve prior art Present in foregoing problems.
To achieve these goals, the technical solution used in the present invention is as follows:
A kind of traffic big data cleaning method of fast accurate, including the place of the processing method and historical data of real time data Reason method;
The processing method of the real time data is, for real-time RFID and candid photograph data, to take Spark Streaming Stream process technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, complete data comparison, Statistics and abnormality processing;
The processing method of the historical data, using Spark internal memory treatment technologies, reads data, according to number from HDFS According to cleaning rule, data are compared, is counted and abnormality processing.
Preferably, it is described constantly to extract data according to time window from Kafka, specifically, between the time according to setting Car data and candid photograph data are crossed every RFID is obtained from lasting Kafka Distributed Message Queues, when adding up to obtain setting every time Between data within section.
Preferably, in the processing method of the real time data, it is described according to data cleansing rule, complete data comparison, Statistics and abnormality processing, specifically include the cleaning of track of vehicle, cross wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
A1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and Collection four, direction field, two kinds of data records are attached;
A2, according to the comparison function that Spark Streaming are provided, is carried out at backward to license plate number and time character string Reason, and car data is crossed to the RFID for connecting according to comparison rules and data are captured filtered, obtain vehicle when collection point Track record, i.e. track of vehicle wash result;
A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with car The backward character string of the trade mark and time character string is stored for key.
Preferably, it is described to cross wagon flow statistics of variables, implemented in accordance with the following steps:
The RFID received in each time period is crossed car data and is converted to the key-value pair shape with collection point field as key by B1 Formula;
B2, according to the principle that the distributed big datas of Spark Streaming are processed, enters to the data record with same keys Row is counted, and the then statistical result to each collection point is sued for peace at set time intervals, obtains each collection point in phase Vehicle flowrate record should be crossed in time period;
B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.
Preferably, the extraction of the abnormal data, is implemented in accordance with the following steps:
C1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and Collection four, direction field, two kinds of data records are attached;
C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract to RFID Abnormal data;
C3, is stored using relevant database.
Preferably, it is in the processing method of the historical data, described according to data cleansing rule, data are compared, The cleaning of statistics and abnormality processing, specially track of vehicle, excessively wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
RFID is crossed car data and video by the information of license plate number, time, collection point title, four fields in direction by D1 Capture data to be attached;
D2, carries out backward process to license plate number and time character string, using car plate color and transit time field to data Filtered, obtained track of vehicle data;
D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase In.
Preferably, in the cleaning process of the track of vehicle, first by RFID data, data and facility information table point are captured Corresponding RDD is not encapsulated as, according to the IP address of equipment, data cube computation is carried out, is obtained RFID data RDD with direction field Candid photograph data RDD directive with band;Then two class data RDD are changed respectively, obtains the RDD of key-value pair form, with Convenient the carrying out for comparing attended operation, wherein key are the character string of the field composition for needing to compare;Finally, by two kinds of data RDD compared according to key assignments and connected, using rules such as time integrity, number plate colour consistency, the integrity of field Requirement is filtered to data, obtains correct data track.
Preferably, it is described to cross wagon flow statistics of variables, implemented in accordance with the following steps:
RFID is crossed car data and is converted to collection point field and to be accurate to key assignments of the time character string of hour as key by E1 To form;
E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains Record to vehicle flowrate of crossing of each collection point in the corresponding time period;
E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.
Preferably, the type of the abnormal data includes:Data field is imperfect, shortage of data and data message differ Cause.
Preferably, the extraction of the abnormal data, is implemented in accordance with the following steps:
RFID by the number-plate number, collection point title, collection direction and is crossed car by the information of four fields of time by F1 Data and candid photograph data are attached;
F2, according to data exception type, first determines whether whether RFID data lacks, if there is RFID data, then judges In RFID data, color field whether there is, capture, if field is complete, judge Whether RFID data is consistent with number plate color in candid photograph data, finally, by the abnormal data for extracting storage to MySQL database In, and identify Exception Type.
The invention has the beneficial effects as follows:A kind of traffic big data cleaning side of fast accurate provided in an embodiment of the present invention Method, for real-time RFID and candid photograph data, using Spark Streaming stream process technologies, provides data using Kafka and delays Deposit, data are constantly extracted according to time window from Kafka, according to data cleansing rule, complete the comparison of data, count and different Often process;For offline batch accumulation data, using Spark internal memory treatment technologies, data are read from HDFS, according to data Data are compared, are counted and abnormality processing by cleaning rule, by the optimization to alignment algorithm, improve program performance and The accuracy of data cleansing result.Realize to the RFID that produces during urban transportation monitoring management and the data such as to capture quick Track of vehicle cleaning, dealing of abnormal data, vehicle flowrate are accurately carried out, and then realizes the processing to traffic data resource Process, ensure the storage and utilization of traffic big data resource.
Description of the drawings
Fig. 1 is real time data cleaning process schematic diagram;
Fig. 2 is offline historical data cleaning process schematic diagram;
Fig. 3 is track of vehicle cleaning module RDD dependence schematic diagrams.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing, the present invention is entered Row is further described.It should be appreciated that specific embodiment described herein is not used to only to explain the present invention Limit the present invention.
Embodiments provide a kind of traffic big data cleaning method of fast accurate, including the process of real time data The processing method of method and historical data;
The processing method of the real time data is, for real-time RFID and candid photograph data, to take Spark Streaming Stream process technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, complete data comparison, Statistics and abnormality processing;
The processing method of the historical data, using Spark internal memory treatment technologies, reads data, according to number from HDFS According to cleaning rule, data are compared, is counted and abnormality processing.
In said method, the cleaning of real time data is believed for the data that RFID device and picture pick-up device are got in real time Breath, and the cleaning of historical data be for history accumulation RFID cross car data and capture data.Because the characteristics of two kinds of data Difference, former data amount are relatively fewer, but the requirement of real-time for processing is higher;The data volume of the latter is huge, no real-time Require, but require can efficiently and accurately complete the cleaning to mass data.
Method provided in an embodiment of the present invention, provides distributed data using the big data platform customized based on Hadoop Process and store.For real-time RFID and candid photograph data, Spark Streaming stream process technologies are taken, is pressed from Kafka Data are constantly extracted according to time window, according to data cleansing rule, comparison, statistics and the abnormality processing of data is completed.For from The batch accumulation data of line, using Spark internal memory treatment technologies, read data from HDFS, according to data cleansing rule, logarithm According to comparing, count and abnormality processing.
For the process of real time data, can be found in shown in Fig. 1.
Off-line data cleaning depend on spark programming models, using Spark on Yarn as program operation platform, By the distributed programmed traffic big data cleaning process for realizing fast accurate.
The specific process cleaned to off-line data using Spark is as shown in Figure 2.
In said method, for real-time RFID and candid photograph data, using Spark Streaming stream process technologies, profit Data buffer storage is provided with Kafka, data is constantly extracted according to time window from Kafka, according to data cleansing rule, complete number According to comparison, statistics and abnormality processing;For offline batch accumulation data, using Spark internal memory treatment technologies, from HDFS Data are read, according to data cleansing rule, data is compared, is counted and abnormality processing, by the optimization to alignment algorithm, The accuracy of the performance and data wash result of raising program.Realize the RFID to producing during urban transportation monitoring management With carry out track of vehicle cleaning, dealing of abnormal data, vehicle flowrate with the data fast accurate such as capturing, and then realize to handing over The processed of logical data resource, ensures the storage and utilization of traffic big data resource.
It is in the embodiment of the present invention, described constantly to extract data according to time window from Kafka, specifically, according to setting Time interval RFID is obtained from lasting Kafka Distributed Message Queues cross car data and capture data, it is every time accumulative to obtain Take the data within setting time section.
In the embodiment of the present invention, time interval can be 5 minutes, and time window can be 10 minutes.
In a preferred embodiment of the invention, it is in the processing method of the real time data, described according to data cleansing Rule, completes comparison, statistics and the abnormality processing of data, specifically includes the cleaning of track of vehicle, crosses wagon flow statistics of variables and different The extraction of regular data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
A1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and Collection four, direction field, two kinds of data records are attached;
A2, according to the comparison function that Spark Streaming are provided, is carried out at backward to license plate number and time character string Reason, and car data is crossed to the RFID for connecting according to comparison rules and data are captured filtered, obtain vehicle when collection point Track record, i.e. track of vehicle wash result;
A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with car The backward character string of the trade mark and time character string is stored for key.
In said method, by backward process is carried out to license plate number and time character string, reduction comparison field has identical The probability of prefix, can so greatly reduce the number of times for comparing between character string two-by-two, so as to improve the efficiency of comparison.
Due to track of vehicle data volume it is very big, while the efficient inquiry carried out by mass data, so the embodiment of the present invention In, track of vehicle wash result is stored in HBase, is the search efficiency for improving wash result, HBase is divided into into 1000 Individual different domain, is stored with the backward character string of license plate number and time character string as key.
In a preferred embodiment of the invention, it is described to cross wagon flow statistics of variables, reality can be carried out in accordance with the following steps Apply:
The RFID received in each time period is crossed car data and is converted to the key-value pair shape with collection point field as key by B1 Formula;
B2, according to the principle that the distributed big datas of Spark Streaming are processed, enters to the data record with same keys Row is counted, and the then statistical result to each collection point is sued for peace at set time intervals, obtains each collection point in phase Vehicle flowrate record should be crossed in time period;
B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.
Cross vehicle flowrate and count the vehicle number in a time window through each collection point respectively.In order to enter Quickly read-write or inquiry etc. are processed row, the vehicle flowrate of crossing of each collection point are stored using memory database, are being entered When vehicle flowrate of going is inquired about, it is only necessary to one-accumulate calculating is carried out in internal memory, the real-time of vehicle flowrate was improve Property.
In the present invention, the extraction of the abnormal data can be implemented in accordance with the following steps:
C1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and Collection four, direction field, two kinds of data records are attached;
C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract to RFID Abnormal data;
C3, is stored using relevant database.
The situation inconsistent due to there may be shortage of data or different types of data information, needs to extract exception Data, so that audit, manual examination and verification are used.The extraction of abnormal data first by some public fields by RFID cross car data and Video capture data are attached;Then, car data is crossed respectively according to the decision rule of abnormal data to RFID and captures data Filtered, extracted abnormal data.As the data volume of abnormal data is limited, can be deposited using relevant database Storage.
It is in a preferred embodiment of the present invention, in the processing method of the historical data, described to advise according to data cleansing Then, data compared, count and abnormality processing, specifically include the cleaning of track of vehicle, cross wagon flow statistics of variables and exception The extraction of data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
RFID is crossed car data and video by the information of license plate number, time, collection point title, four fields in direction by D1 Capture data to be attached;
D2, carries out backward process to license plate number and time character string, using car plate color and transit time field to data Filtered, obtained track of vehicle data;
D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase In.
In said method, RFID is crossed by car data by Spark first and data is captured with people, car, collection in data base The Back ground Informations such as point are coupled together, and obtain the information useful to track cleaning, traffic statistics, abnormal extraction, in order to offline number According to the enforcement of cleaning process.
The present invention carries out backward process to license plate number and time character string first, two types data is compared with improving To efficiency.Due to track of vehicle data volume it is very big, while the efficient inquiry carried out by mass data, so, the present invention is with car The backward character string of the trade mark and time character string is key, and the track of vehicle for washing out is stored in HBase.In order to improve storage Speed, it is possible to use data import tool Loader imports to track data in HBase.
In the embodiment of the present invention, in the cleaning process of the track of vehicle, first by RFID data, data and equipment are captured Information table is encapsulated as corresponding RDD respectively, according to the IP address of equipment, carries out data cube computation, obtains with direction field RFID data RDD and with it is directive candid photograph data RDD;Then two class data RDD are changed respectively, obtains key-value pair shape The RDD of formula, to facilitate the carrying out for comparing attended operation, wherein key is the character string of the field composition for needing to compare;Finally, The RDD of two kinds of data is compared according to key assignments and is connected, using time integrity, number plate colour consistency, field it is complete The rule such as whole property requires to filter data, obtains correct data track.
Track of vehicle cleaning module RDD dependences are as shown in Figure 3.
It is in the embodiment of the present invention, described to cross wagon flow statistics of variables, implemented in accordance with the following steps:
RFID is crossed car data and is converted to collection point field and to be accurate to key assignments of the time character string of hour as key by E1 To form;
E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains Record to vehicle flowrate of crossing of each collection point in the corresponding time period;
E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.
Cross vehicle flowrate and be divided into two kinds according to measurement type:Full dose data statisticss and all types of vehicle flowrates.Full dose Data statisticss count the wagon flow total amount by each collection point;Vehicle is segmented by all types of vehicle flowrates by type, statistics By each type of wagon flow statistics of variables of a collection point.As collection point number and type of vehicle are all limited, institutes With the data volume of statistical result and less, can be stored in relevant database.
In the present invention, the type of the abnormal data includes:Data field is imperfect, shortage of data and data message differ Cause.
Due to there may be the situation that shortage of data, data field are imperfect or different types of data information is inconsistent, In processing procedure, need for above-mentioned several situations, filter out abnormal RFID and cross car data and capture data, and identify The Exception Type of data, for manual examination and verification and modification, to ensure the correctness of the integrity and track of vehicle cleaning of data.It is abnormal Data mainly include following three kinds of situations:
(1) data field is imperfect
(2) shortage of data
(3) data message is inconsistent.
In the embodiment of the present invention, the extraction of the abnormal data can be implemented in accordance with the following steps:
RFID by the number-plate number, collection point title, collection direction and is crossed car by the information of four fields of time by F1 Data and candid photograph data are attached;
F2, according to data exception type, first determines whether whether RFID data lacks, if there is RFID data, then judges In RFID data, color field whether there is, capture, if field is complete, judge Whether RFID data is consistent with number plate color in candid photograph data, finally, by the abnormal data for extracting storage to MySQL database In, and identify Exception Type.
According to RFID and the characteristics of capture data, data field is imperfect mainly include number plate color it is inconsistent, without candid photograph Picture two types.In order to improve abnormality processing efficiency, the abnormal data of three types can be closed according to the method described above And process.
After extracting abnormal data, make a distinction according to Exception Type and data type, by interface display to examination & verification Data are supplemented and are repaired for different types of exception by auditor, are then forwarded to Data clean system by personnel Processed, so as to improve the accuracy of track of vehicle cleaning.
By using above-mentioned technical proposal disclosed by the invention, having obtained following beneficial effect:The embodiment of the present invention is carried For a kind of fast accurate traffic big data cleaning method, for real-time RFID and capture data, using Spark Streaming stream process technologies, provide data buffer storage using Kafka, constantly extract data according to time window from Kafka, According to data cleansing rule, comparison, statistics and the abnormality processing of data are completed;For offline batch accumulation data, adopt Spark internal memory treatment technologies, read data from HDFS, according to data cleansing rule, data compared, count and abnormal Process, by the optimization to alignment algorithm, improve the accuracy of the performance and data wash result of program.Realize and city is handed over The RFID that produces during logical monitoring management and carry out track of vehicle cleaning with the data fast accurate such as capturing, at abnormal data Reason, vehicle flowrate, and then the processed to traffic data resource is realized, ensure storage and the profit of traffic big data resource With.
Each embodiment in this specification is described by the way of progressive, what each embodiment was stressed be with The difference of other embodiment, between each embodiment identical similar part mutually referring to.
Those skilled in the art should be understood that the sequential of the method and step that above-described embodiment is provided can be entered according to practical situation Row accommodation, is concurrently carried out also dependent on practical situation.
All or part of step in the method that above-described embodiment is related to can be instructed by program correlation hardware come Complete, described program can be stored in the storage medium that computer equipment can read, for performing the various embodiments described above side All or part of step described in method.The computer equipment, for example:Personal computer, server, the network equipment, intelligent sliding Dynamic terminal, intelligent home device, wearable intelligent equipment, vehicle intelligent equipment etc.;Described storage medium, for example:RAM、 ROM, magnetic disc, tape, CD, flash memory, USB flash disk, portable hard drive, storage card, memory stick, webserver storage, network cloud storage Deng.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by One entity or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or operation Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant are anticipated Covering including for nonexcludability, so that a series of process, method, commodity or equipment including key elements not only includes that A little key elements, but also including other key elements being not expressly set out, or also include for this process, method, commodity or The intrinsic key element of equipment.In the absence of more restrictions, the key element for being limited by sentence "including a ...", does not arrange Except also there is other identical element in including the process of the key element, method, commodity or equipment.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should Depending on protection scope of the present invention.

Claims (10)

1. the traffic big data cleaning method of a kind of fast accurate, it is characterised in that processing method including real time data and go through The processing method of history data;
The processing method of the real time data is for real-time RFID and captures data, takes at Spark Streaming streams Reason technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, completes comparison, the statistics of data And abnormality processing;
The processing method of the historical data, using Spark internal memory treatment technologies, reads data from HDFS, clear according to data Rule is washed, data is compared, is counted and abnormality processing.
2. the traffic big data cleaning method of fast accurate according to claim 1, it is characterised in that described from Kafka In constantly extract data according to time window, specifically, at set time intervals from lasting Kafka distributed messages team In row, acquisition RFID crosses car data and captures data, every time the data within accumulative acquisition setting time section.
3. the traffic big data cleaning method of fast accurate according to claim 2, it is characterised in that the real time data Processing method in, it is described according to data cleansing rule, complete data comparison, statistics and abnormality processing, specifically include vehicle The cleaning of track, excessively wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
A1, crosses car data according to RFID and captures the public field of data, including license plate number, time, collection point title and collection Four, direction field, two kinds of data records are attached;
A2, according to the comparison function that Spark Streaming are provided, carries out backward process to license plate number and time character string, and Car data is crossed to the RFID for connecting according to comparison rules and data are captured and is filtered, obtain track of vehicle when collection point Record, i.e. track of vehicle wash result;
A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with license plate number Stored for key with the backward character string of time character string.
4. the traffic big data cleaning method of fast accurate according to claim 3, it is characterised in that described to cross vehicle flowrate Statistics, implemented in accordance with the following steps:
The RFID received in each time period is crossed car data and is converted to the key-value pair form with collection point field as key by B1;
B2, according to the principle that the distributed big datas of Spark Streaming are processed, counts to the data record with same keys Number, the then statistical result to each collection point sued for peace at set time intervals, obtains each collection point when corresponding Between cross vehicle flowrate record in section;
B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.
5. the traffic big data cleaning method of fast accurate according to claim 3, it is characterised in that the abnormal data Extraction, implemented in accordance with the following steps:
C1, crosses car data according to RFID and captures the public field of data, including license plate number, time, collection point title and collection Four, direction field, two kinds of data records are attached;
C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract exception to RFID Data;
C3, is stored using relevant database.
6. the traffic big data cleaning method of fast accurate according to claim 1, it is characterised in that the historical data Processing method in, it is described according to data cleansing rule, data are compared, are counted and abnormality processing, specially vehicle rail The cleaning of mark, excessively wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
RFID is crossed car data and video capture by the information of license plate number, time, collection point title, four fields in direction by D1 Data are attached;
D2, carries out backward process to license plate number and time character string, data is carried out using car plate color and transit time field Filter, obtain track of vehicle data;
D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase.
7. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that the track of vehicle Cleaning process in, first by RFID data, capture data and facility information table and be encapsulated as corresponding RDD respectively, according to equipment IP address, carry out data cube computation, obtain RFID data RDD with direction field and with it is directive candid photograph data RDD;So Afterwards two class data RDD are changed respectively, the RDD of key-value pair form is obtained, to facilitate the carrying out for comparing attended operation, wherein Key is the character string of the field composition for needing to compare;Finally, the RDD of two kinds of data is compared according to key assignments and is connected Connect, require to filter data using the rule such as time integrity, number plate colour consistency, integrity of field, just obtain True data track.
8. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that described to cross vehicle flowrate Statistics, implemented in accordance with the following steps:
RFID is crossed car data and is converted to collection point field and to be accurate to key-value pair shape of the time character string of hour as key by E1 Formula;
E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains each Vehicle flowrate record is crossed the corresponding time period in individual collection point;
E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.
9. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that the abnormal data Type include:Data field is imperfect, shortage of data and data message are inconsistent.
10. the traffic big data cleaning method of fast accurate according to claim 9, it is characterised in that the abnormal number According to extraction, implemented in accordance with the following steps:
RFID by the number-plate number, collection point title, collection direction and is crossed car data by the information of four fields of time by F1 It is attached with data are captured;
F2, according to data exception type, first determines whether whether RFID data lacks, and if there is RFID data, then judges RFID In data, color field whether there is, capture, if field is complete, judge RFID Whether data are consistent with number plate color in candid photograph data, finally, the abnormal data for extracting stored in MySQL database, And identify Exception Type.
CN201611094160.2A 2016-12-02 2016-12-02 Quick and accurate cleaning method of traffic big data Pending CN106528865A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611094160.2A CN106528865A (en) 2016-12-02 2016-12-02 Quick and accurate cleaning method of traffic big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611094160.2A CN106528865A (en) 2016-12-02 2016-12-02 Quick and accurate cleaning method of traffic big data

Publications (1)

Publication Number Publication Date
CN106528865A true CN106528865A (en) 2017-03-22

Family

ID=58354223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611094160.2A Pending CN106528865A (en) 2016-12-02 2016-12-02 Quick and accurate cleaning method of traffic big data

Country Status (1)

Country Link
CN (1) CN106528865A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106878092A (en) * 2017-03-28 2017-06-20 上海以弈信息技术有限公司 A kind of network O&M monitor in real time of multi-source heterogeneous data fusion is presented platform with analysis
CN107391719A (en) * 2017-07-31 2017-11-24 南京邮电大学 Distributed stream data processing method and system in a kind of cloud environment
CN107688646A (en) * 2017-08-30 2018-02-13 武汉烽火众智数字技术有限责任公司 A kind of method of the bayonet socket data area crash analysis based on ES
CN108090191A (en) * 2017-12-14 2018-05-29 苏州泥娃软件科技有限公司 The method and system that a kind of traffic big data cleaning arranges
CN108171971A (en) * 2017-12-18 2018-06-15 武汉烽火众智数字技术有限责任公司 Vehicular real time monitoring method and system based on Spark Streaming
CN108319538A (en) * 2018-02-02 2018-07-24 世纪龙信息网络有限责任公司 The monitoring method and system of big data platform operating status
CN109118806A (en) * 2017-06-26 2019-01-01 杭州海康威视系统技术有限公司 A kind of unit exception detection method, apparatus and system
CN109753496A (en) * 2018-11-27 2019-05-14 天聚地合(苏州)数据股份有限公司 A kind of data cleaning method for big data
CN109785595A (en) * 2019-02-26 2019-05-21 成都古河云科技有限公司 A kind of vehicle abnormality track real-time identification method based on machine learning
CN110287010A (en) * 2019-06-12 2019-09-27 北京工业大学 A kind of data cached forecasting method towards the analysis of Spark time window data
CN110334081A (en) * 2019-06-28 2019-10-15 北京天眼查科技有限公司 The cleaning method and device of mass data
CN110502509A (en) * 2019-08-27 2019-11-26 广东工业大学 A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame
CN110569237A (en) * 2019-09-12 2019-12-13 上海富数科技有限公司 System and method for realizing real-time data cleaning processing
CN110569238A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data management method, system, storage medium and server based on big data
CN110704206A (en) * 2019-09-09 2020-01-17 上海凯京信达科技集团有限公司 Real-time computing method, computer storage medium and electronic equipment
CN110888972A (en) * 2019-10-27 2020-03-17 北京明朝万达科技股份有限公司 Sensitive content identification method and device based on Spark Streaming
CN111127949A (en) * 2019-12-18 2020-05-08 北京中交兴路车联网科技有限公司 Vehicle high-risk road section early warning method and device and storage medium
CN111143415A (en) * 2019-12-26 2020-05-12 政采云有限公司 Data processing method and device and computer readable storage medium
CN111368134A (en) * 2019-07-04 2020-07-03 杭州海康威视系统技术有限公司 Traffic data processing method and device, electronic equipment and storage medium
CN112347093A (en) * 2020-11-05 2021-02-09 哈尔滨航天恒星数据系统科技有限公司 Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data
CN113177049A (en) * 2021-05-13 2021-07-27 中移智行网络科技有限公司 Data processing method, device and system
CN113505119A (en) * 2021-07-29 2021-10-15 青岛以萨数据技术有限公司 ETL method and device based on multiple data sources
CN114996260A (en) * 2022-08-05 2022-09-02 深圳市深蓝信息科技开发有限公司 Method and device for cleaning AIS data, terminal equipment and storage medium
CN115359666A (en) * 2022-08-19 2022-11-18 重庆首讯科技股份有限公司 Abnormal traffic behavior detection method based on multi-source data cross validation
CN115391315A (en) * 2022-07-15 2022-11-25 生命奇点(北京)科技有限公司 Data cleaning method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778245A (en) * 2015-04-09 2015-07-15 北方工业大学 Similar trajectory mining method and device on basis of massive license plate identification data
CN105426478A (en) * 2015-11-18 2016-03-23 四川长虹电器股份有限公司 Method for user behavior analysis
CN105786864A (en) * 2014-12-24 2016-07-20 国家电网公司 Offline analysis method for massive data
CN105893628A (en) * 2016-05-17 2016-08-24 中国农业银行股份有限公司 Real-time data collection system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786864A (en) * 2014-12-24 2016-07-20 国家电网公司 Offline analysis method for massive data
CN104778245A (en) * 2015-04-09 2015-07-15 北方工业大学 Similar trajectory mining method and device on basis of massive license plate identification data
CN105426478A (en) * 2015-11-18 2016-03-23 四川长虹电器股份有限公司 Method for user behavior analysis
CN105893628A (en) * 2016-05-17 2016-08-24 中国农业银行股份有限公司 Real-time data collection system and method

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106878092A (en) * 2017-03-28 2017-06-20 上海以弈信息技术有限公司 A kind of network O&M monitor in real time of multi-source heterogeneous data fusion is presented platform with analysis
CN109118806A (en) * 2017-06-26 2019-01-01 杭州海康威视系统技术有限公司 A kind of unit exception detection method, apparatus and system
CN107391719A (en) * 2017-07-31 2017-11-24 南京邮电大学 Distributed stream data processing method and system in a kind of cloud environment
CN107688646A (en) * 2017-08-30 2018-02-13 武汉烽火众智数字技术有限责任公司 A kind of method of the bayonet socket data area crash analysis based on ES
CN108090191A (en) * 2017-12-14 2018-05-29 苏州泥娃软件科技有限公司 The method and system that a kind of traffic big data cleaning arranges
CN108171971A (en) * 2017-12-18 2018-06-15 武汉烽火众智数字技术有限责任公司 Vehicular real time monitoring method and system based on Spark Streaming
CN108319538A (en) * 2018-02-02 2018-07-24 世纪龙信息网络有限责任公司 The monitoring method and system of big data platform operating status
CN109753496A (en) * 2018-11-27 2019-05-14 天聚地合(苏州)数据股份有限公司 A kind of data cleaning method for big data
CN109785595A (en) * 2019-02-26 2019-05-21 成都古河云科技有限公司 A kind of vehicle abnormality track real-time identification method based on machine learning
CN110287010A (en) * 2019-06-12 2019-09-27 北京工业大学 A kind of data cached forecasting method towards the analysis of Spark time window data
CN110287010B (en) * 2019-06-12 2021-09-14 北京工业大学 Cache data prefetching method oriented to Spark time window data analysis
CN110334081A (en) * 2019-06-28 2019-10-15 北京天眼查科技有限公司 The cleaning method and device of mass data
CN111368134B (en) * 2019-07-04 2023-10-27 杭州海康威视系统技术有限公司 Traffic data processing method and device, electronic equipment and storage medium
CN111368134A (en) * 2019-07-04 2020-07-03 杭州海康威视系统技术有限公司 Traffic data processing method and device, electronic equipment and storage medium
CN110502509A (en) * 2019-08-27 2019-11-26 广东工业大学 A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame
CN110502509B (en) * 2019-08-27 2023-04-18 广东工业大学 Traffic big data cleaning method based on Hadoop and Spark framework and related device
CN110704206A (en) * 2019-09-09 2020-01-17 上海凯京信达科技集团有限公司 Real-time computing method, computer storage medium and electronic equipment
CN110704206B (en) * 2019-09-09 2022-09-27 上海斑马来拉物流科技有限公司 Real-time computing method, computer storage medium and electronic equipment
CN110569237A (en) * 2019-09-12 2019-12-13 上海富数科技有限公司 System and method for realizing real-time data cleaning processing
CN110569238A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data management method, system, storage medium and server based on big data
CN110569238B (en) * 2019-09-12 2023-03-24 成都中科大旗软件股份有限公司 Data management method, system, storage medium and server based on big data
CN110888972A (en) * 2019-10-27 2020-03-17 北京明朝万达科技股份有限公司 Sensitive content identification method and device based on Spark Streaming
CN111127949A (en) * 2019-12-18 2020-05-08 北京中交兴路车联网科技有限公司 Vehicle high-risk road section early warning method and device and storage medium
CN111127949B (en) * 2019-12-18 2021-12-03 北京中交兴路车联网科技有限公司 Vehicle high-risk road section early warning method and device and storage medium
CN111143415A (en) * 2019-12-26 2020-05-12 政采云有限公司 Data processing method and device and computer readable storage medium
CN111143415B (en) * 2019-12-26 2023-12-29 政采云有限公司 Data processing method, device and computer readable storage medium
CN112347093A (en) * 2020-11-05 2021-02-09 哈尔滨航天恒星数据系统科技有限公司 Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data
CN113177049A (en) * 2021-05-13 2021-07-27 中移智行网络科技有限公司 Data processing method, device and system
CN113505119A (en) * 2021-07-29 2021-10-15 青岛以萨数据技术有限公司 ETL method and device based on multiple data sources
CN113505119B (en) * 2021-07-29 2023-08-29 青岛以萨数据技术有限公司 ETL method and device based on multiple data sources
CN115391315A (en) * 2022-07-15 2022-11-25 生命奇点(北京)科技有限公司 Data cleaning method and device
CN114996260A (en) * 2022-08-05 2022-09-02 深圳市深蓝信息科技开发有限公司 Method and device for cleaning AIS data, terminal equipment and storage medium
CN114996260B (en) * 2022-08-05 2022-11-11 深圳市深蓝信息科技开发有限公司 Method and device for cleaning AIS data, terminal equipment and storage medium
CN115359666A (en) * 2022-08-19 2022-11-18 重庆首讯科技股份有限公司 Abnormal traffic behavior detection method based on multi-source data cross validation

Similar Documents

Publication Publication Date Title
CN106528865A (en) Quick and accurate cleaning method of traffic big data
CN109697214B (en) Tourism data analysis system and method
CN111488363B (en) Data processing method, device, electronic equipment and medium
CN107958031B (en) Resident travel OD distribution extraction method based on fusion data
CN106897930A (en) A kind of method and device of credit evaluation
CN105913656B (en) Based on the frequent method and system for crossing vehicle of distributed statistics
CN106777703A (en) A kind of bus passenger real-time analyzer and its construction method
CN104778245A (en) Similar trajectory mining method and device on basis of massive license plate identification data
CN104199903B (en) A kind of vehicle data inquiry system and method based on path association
CN111127105A (en) User hierarchical model construction method and system, and operation analysis method and system
CN107704590A (en) A kind of data processing method and system based on data warehouse
CN107993444B (en) Suspected vehicle identification method based on bayonet vehicle-passing big data analysis
CN108470195A (en) Video identity management method and device
CN114596700B (en) Real-time traffic estimation method for expressway section based on portal data
CN115080638B (en) Multi-source data fusion analysis method for microscopic simulation, electronic equipment and storage medium
CN112181955A (en) Data standard management method for information sharing of heavy haul railway comprehensive big data platform
CN107729448A (en) A kind of data handling system based on data warehouse
CN110874369A (en) Multidimensional data fusion investigation system and method thereof
CN115458140A (en) Internet hospital intelligent operation system based on medical big data
CN107070897A (en) Network log storage method based on many attribute Hash duplicate removals in intruding detection system
Mesabbah et al. Presenting a hybrid processing mining framework for automated simulation model generation
CN109308290A (en) A kind of efficient data cleaning conversion method based on CIM
CN116934270A (en) Library book borrowing management system based on data analysis
CN102156799A (en) Cascadable complex event processing engine and train overhauling automatic recording method
CN108021361A (en) A kind of the highway fee evasion of falling card vehicle screening method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322

RJ01 Rejection of invention patent application after publication