CN108647360A

CN108647360A - A kind of method of the access of taxi big data and the processing of multithreading

Info

Publication number: CN108647360A
Application number: CN201810480023.5A
Authority: CN
Inventors: 孙玲; 张琨; 施佺; 陆俊天; 吕心钰
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2018-10-12
Anticipated expiration: 2038-05-18
Also published as: CN108647360B

Abstract

The method of the access of taxi big data and processing of the multithreading of the present invention, includes the following steps：Step 1）Longitude and latitude is subjected to coordinate conversion, the international latitude and longitude coordinates standards of WGS 84 system is converted twice by GCJ 02 and BD 09, the coordinate that can be accurately shown in Baidu map is obtained, each longitude and latitude degrees of data is executed by coordinate conversion operation by the Map operations of elasticity distribution formula data set by the Spark under Hadoop parallel computation frames；Step 2）Data are cleaned, index calculates, the operation of region division and data conversion, per single stepping inside be all made of the parallel computation that Spark carries out each data multithreading, step 3）By data deposit distributed file system HDFS.Advantageous effect：Using taxi big data as background, in addition to the available data handling implement using Spark combinations HDFS, in order to enable treatment effeciency is more promoted, it is also added into multi-threading parallel process mechanism, using Python advanced features decorator and multithreading module, perfect synchronizing for multiprocessing process works.

Description

A kind of method of the access of taxi big data and the processing of multithreading

Technical field

The present invention relates to big data fields more particularly to a kind of taxi big data of multithreading based on Spark to access And the method for processing.

Background technology

Traditional have in such a way that Hadoop clusters access data MapReduce, MapReduce is Hadoop Parallel computation frame, but due to the mechanism of its step-by-step processing, limit its effectiveness of performance, it is very big for the I/O expenses of disk, Data are read from cluster every time and carry out a part of processing, then write the result into cluster, then repeat this step until having handled At.And Spark can complete all data analyses with the time close to " real-time " in memory, and data are read from cluster, it is complete At after all analyzing processings i.e. result back into cluster.Therefore fast nearly 10 times of the batch processing speed ratio MapReduce of Spark, Data analyzing speed in memory is then nearly 100 times fast.It is thus found that it is traditional by Hadoop clusters access data in the way of Through being difficult to meet the requirement of access, the processing and analysis ability of current mass data, the real-time operation of data is less adapted to.

Invention content

Present invention aims at solve the problems, such as asking for poor, the multitask coordinated work of the access efficiency of existing mass data The problem of topic, single task data processing, the history obtained by data-interface and real time traffic data, it is efficiently interior with Spark It deposits calculating and elasticity distribution formula data set RDD carries out the parallel processing of data, while setting up multithreading operation, synchronous generation is a variety of Data processing as a result, and be stored in distributed file system HDFS, it is big to provide a kind of taxi of the multithreading based on Spark Data access and the method for processing, are specifically realized by following technical scheme：

The taxi big data of the multithreading accesses and the method for processing, includes the following steps：

Step 1）Longitude and latitude is subjected to coordinate conversion, GCJ-02 and BD-09 is passed through into the worlds WGS-84 latitude and longitude coordinates standard system It converts twice, obtains the coordinate that can be accurately shown in Baidu map, it will be every by the Spark under Hadoop parallel computation frames One longitude and latitude degrees of data executes coordinate conversion operation by the Map operations of elasticity distribution formula data set RDD；

Step 2）Data are cleaned, index calculates, the operation of region division and data conversion, per single stepping inside adopt The parallel computation of multithreading is carried out to each data with Spark,

Step 3）By data deposit distributed file system HDFS.

The taxi big data of the multithreading accesses and the further design of processing is that the Spark includes The course of work of SparkContext, Cluter Manager and Executor, Spark include the following steps：

Step a) application programs using spark-submit after being submitted, according to parameter setting when submitting at the beginning of corresponding position Beginningization SparkContext, and create DAG Scheduler and Task Scheduler, Driver and generation is executed according to application program Entire program is divided into multiple job by code according to action operators, and each job internal builds DAG figures, DAG Scheduler will DAG figures are divided into multiple stage, while being divided into multiple task as a taskSet, DAG inside each stage TaskSet is transmitted to Task Scheduler by Scheduler, and Task Scheduler are responsible for the scheduling of task on cluster；

Step b) Driver apply for resource, the money according to the resource requirement in SparkContext to Cluter Manager Source includes Executor numbers and memory source；

Step c) explorers create Executor processes after receiving request on the work node nodes for the condition that meets；

Step d) Executor process creations are reversely registered after completing to Driver, to receive the task of Driver distribution；

For step e) after program has executed, Driver nullifies apllied resource to ResourceManager.

The taxi big data of the multithreading accesses and the further design of processing is, step 2）In data cleansing Include the following steps：

The first step is the data dump that will be more than actual coordinate range；

Second step is verification characteristics exceptional value, if data meet normal distribution, that is, 3 σ principles, σ is used to indicate standard deviation, do not exist Data within 3 standard deviations of mean value judge to be exceptional value；If data are unsatisfactory for normal distribution, that is, use box traction substation method Then, first quartile, the second quartile are found out, third quartile calculates interquartile-range IQR from as long as data are in setting Then retain in range, go beyond the scope, is judged as that exceptional value is rejected；

Third step is rejected to redundant data.

The taxi big data of the multithreading accesses and the further design of processing is, in the second step of data cleansing First quartile is set as Q1, the second quartile is median, and third quartile is Q3, interquartile-range IQR separation from for： IQR=Q3-Q1, setting is ranging from（Q1-1.5*IQR, Q3+1.5*IQR）.

The taxi big data of the multithreading accesses and the further design of processing is, step 2）In region division Include that regular cutting is carried out to map on map by coordinate, by and the combination of transport power achievement data adjust difference of longitude and latitude Degree it is poor, finally by the difference of longitude of regional extent be set as 0.03 °, difference of latitude be set as 0.02 °.

The taxi big data of the multithreading accesses and the further design of processing is, step 2）In data conversion For：Data are carried out to the conversion of data format from time transverse direction, longitudinal direction, feature correlation and spatial distribution.

The taxi big data of the multithreading accesses and the further design of processing is, step 2）Middle index calculates For：Calculate includes inflow, outflow, retention and unloaded transport power index；Calculating include rate of empty ride, handling capacity of passengers, the amount of getting on the bus and The operation indicator for the amount of getting off.

The taxi big data of the multithreading accesses and the further design of processing is, the step 2）In it is multi-thread The parallel computation of journey is：Metaclass, constructed fuction method are initialized first；Then, data cleansing, region division, index meter are defined It calculates, the subclass of the various operations of data conversion, the method for inheriting metaclass, and adds the advanced feature decorator of python for subclass, The method for calling metaclass interface that all subclasses can be realized；Finally each subclass runner is executed parallel.

Advantages of the present invention is as follows：

The history and real time traffic data that the present invention is obtained by data-interface, are calculated and elasticity point with the efficient memories of Spark Cloth data set RDD carries out the parallel processing of data, while setting up multithreading operation, the synchronous knot for generating a variety of data processings Fruit, and it is stored in distributed file system HDFS.The access of taxi big data and processing of the multithreading based on Spark of the present invention Method using taxi big data as background, in addition to using Spark combination HDFS available data handling implement, in order to enable locate Reason efficiency is more promoted, and is also added into multi-threading parallel process mechanism, is utilized Python advanced features decorator and multithreading mould Block, perfect synchronizing for multiprocessing process work.

Description of the drawings

Fig. 1 is Spark operation principle flow charts.

Fig. 2 is that HDFS writes process flow diagram flow chart.

Fig. 3 is multithreading instance data process chart.

Fig. 4 is that regular domain divides exemplary plot.

Fig. 5 is multithreading implementation flow chart.

Specific implementation mode

Below in conjunction with attached drawing, technical scheme of the present invention is described in detail.

Such as Fig. 3, the method for the access of taxi big data and processing of the multithreading of the present embodiment includes the following steps：

Step 1）Longitude and latitude is subjected to coordinate conversion, GCJ-02 and BD-09 is passed through into the worlds WGS-84 latitude and longitude coordinates standard system It converts twice, obtains the coordinate that can be accurately shown in Baidu map, it will be every by the Spark under Hadoop parallel computation frames One longitude and latitude degrees of data executes coordinate conversion operation by the Map operations of elasticity distribution formula data set RDD.

Step 2）Data are cleaned, index calculates, the operation of region division and data conversion, per single stepping inside It is all made of the parallel computation that Spark carries out each data multithreading.

Step 3）By data deposit distributed file system HDFS.

If Fig. 1, Spark include SparkContext, Cluter Manager and Executor.Cluter Manager Refer to obtaining the external service of resource on cluster.There are three types of types at present：Standalone, Apache Mesos and Hadoop Yarn.Standalone is the primary resource managements of spark, is responsible for the distribution of resource by Master.Apache Mesos is and a kind of good scheduling of resource frame of hadoop MR compatibility.Hadoop Yarn are to be primarily referred to as in Yarn ResourceManager.Used herein be exactly yarn is exactly explorer, and the essence of yarn layered structures is This entity of ResourceManager controls entire cluster and distribution of the management application to basic calculation resource. Executor:Indicate that some Spark application program operates in a process on work node nodes.SparkContext is The running environment of spark.The course of work of Spark includes the following steps：

Step a) application programs using spark-submit after being submitted, according to parameter setting when submitting at the beginning of corresponding position Beginningization SparkContext, and create DAG Scheduler and Task Scheduler, Driver and generation is executed according to application program Entire program is divided into multiple job by code according to action operators, and each job internal builds DAG figures, DAG Scheduler will DAG figures are divided into multiple stage, while being divided into multiple task as a taskSet, DAG inside each stage TaskSet is transmitted to Task Scheduler by Scheduler, and Task Scheduler are responsible for the scheduling of task on cluster.

Step b) Driver apply according to the resource requirement in SparkContext to the ResourceManager of yarn Resource, the resource include Executor numbers and memory source.Driver in Spark is to run Spark application programs Main functions simultaneously create SparkContext, and the purpose for creating SparkContext is to prepare the fortune of Spark application programs Row environment, have in Spark SparkContext be responsible for communicate with ClusterManager, progress resource bid, task divide Match and monitor, after the parts Executor are run, Driver is responsible for closing SparkContext simultaneously, usually uses SparkContext represents Driver.

Step c) explorers receive request after on the work node nodes for the condition that meets create Executor into Journey.

Step d) Executor process creations are reversely registered after completing to Driver, to receive Driver distribution task。

DAG Scheduler are an advanced scheduler layers, realize the scheduling based on stage, it is each A job divides stage, and single stage is divided into multiple task, then submits to bottom using stage as taskSet Task Scheduler are executed by Task Scheduler.Task Scheduler are in SparkContext in addition to DAG Another very important scheduler of Scheduler, task Scheduler are responsible for the task for generating DAG Scheduler It is dispatched in Executor and executes.

Such as Fig. 2, distributed file system（Hereinafter HDFS）Workflow is：The HDFS of the present embodiment is deployed in In 2 Namenode and 5 Datanode nodes under Hadoop clusters, data are stored in the form of a file, wherein Namenode stores data meta file, includes timestamp, path, the number information etc. of deposit data, Datanode storage files Truthful data, and file is backed up, at least two parts, is respectively stored in different Datanode nodes.It reads HDFS data files utilize Spark caching mechanisms, it would be desirable to which the data meta file of long-time service is cached, and is no longer needed Entire fileinfo is traversed, only need to inquire the data meta file that Namenode is preserved can find the position of truthful data.With The mode of Hadoop clusters carries out the access of mass data, and calculates the erection logarithm with multithreading with efficient Spark memories According to parallel processing is carried out, compared with there is great promotion on the effectiveness of performance of individual node and traditional Relational DataBase, and Have the characteristics that high reliability and high scalability, according to the demand of real data can increase node appropriate to meet data Storage needs.

By HDFS write process for, describe the member that Namenode is responsible for being stored in All Files on HDFS in detail Data, it can first confirm the request of client, and record the name of file and store the Datanode set of this file, and Store this information in the whole process of the file allocation table in memory.It is asked to Namenode for example, client sends one, Say that " gcwk.csv " file is written to HDFS by it, this document saves the data such as the longitude and latitude of taxi, carrying situation.That , flow is executed, referring to Fig. 2：

1st step：Client sends a request to Namenode, and " gcwk.csv " file is written.(①)

2nd step：Namenode return informations allow client to write file in Datanode A, B and D, and allow it to client Directly contacted with Datanode B.(②)

3rd step：Client sends a request to Datanode B, allows it to preserve a " gcwk.csv " file, and send two parts Copy, respectively in Datanode A and Datanode D.(③)

4th step：Datanode B send a request to Datanode A, allow it to preserve a " gcwk.csv " file, and allow it It sends a copy and gives Datanode D.(④)

5th step：Datanode A send a request to Datanode D, it is allowed to preserve a " gcwk.csv " file.(⑤)

6th step：Datanode D receive information and return to confirmation message to give Datanode A.(⑤)

7th step：Datanode A receive information and return to confirmation message to give Datanode B.(④)

8th step：Datanode B receive information and return to confirmation message to client, so far indicate the knot of entire ablation process Beam.(⑥)

Step 2）In data cleansing include the following steps：

The first step is the data dump that will be more than actual coordinate range.

Second step is verification characteristics exceptional value, if data meet normal distribution, that is, 3 σ principles, σ is used to indicate standard deviation, The data within 3 standard deviations of mean value do not judge to be exceptional value；If data are unsatisfactory for normal distribution, that is, use case line Figure rule, finds out first quartile, the second quartile, and third quartile calculates interquartile-range IQR from as long as data are being set Then retain in fixed range, go beyond the scope, is judged as that exceptional value is rejected.

Third step is rejected to redundant data.

Further, first quartile is set in the second step of data cleansing as Q1, the second quartile is median, Third quartile be Q3, interquartile-range IQR separation from for：IQR=Q3-Q1, setting is ranging from（Q1-1.5*IQR, Q3+ 1.5*IQR）.

Such as Fig. 4, step 2）In region division by coordinate include that regular cutting is carried out to map on map, by with The combination adjustment difference of longitude and difference of latitude of transport power achievement data, finally by the difference of longitude of regional extent is set as 0.03 °, difference of latitude sets It is 0.02 °.

Step 2）In data be converted to：Data are carried out from time transverse direction, longitudinal direction, feature correlation and spatial distribution The conversion of data format.

Step 2）Middle index is calculated as：Calculate includes inflow, outflow, retention and unloaded transport power index；Calculating includes The operation indicator of rate of empty ride, handling capacity of passengers, the amount of getting on the bus and the amount of getting off.

Such as Fig. 5, step 2）In the parallel computation of multithreading be：Metaclass, constructed fuction method are initialized first；Then, The subclass of data cleansing, region division, index calculating, the various operations of data conversion, the method for inheriting metaclass are defined, and is subclass The advanced feature decorator of python is added, the method for calling metaclass interface that all subclasses can be realized；Last parallel execution is every One subclass runner.

The present embodiment is using taxi big data as background, the available data handling implement in addition to utilizing Spark combinations HDFS, In order to enable treatment effeciency is more promoted, it is also added into multi-threading parallel process mechanism, utilizes Python advanced feature decorators With multithreading module, perfect synchronizing for multiprocessing process works.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims Subject to.

Claims

1. a kind of taxi big data of multithreading accesses and the method for processing, it is characterised in that include the following steps：

Step 1）Longitude and latitude is subjected to coordinate conversion, GCJ-02 and BD-09 is passed through into the worlds WGS-84 latitude and longitude coordinates standard system It converts twice, obtains the coordinate that can be accurately shown in Baidu map, it will be every by the Spark under Hadoop parallel computation frames One longitude and latitude degrees of data executes coordinate conversion operation by the Map operations of elasticity distribution formula data set；

Step 3）By data deposit distributed file system HDFS.

2. the taxi big data of multithreading according to claim 1 accesses and the method for processing, it is characterised in that described Spark includes SparkContext, Cluter Manager and Executor, and the course of work of Spark includes the following steps：

Step a) application programs using spark-submit after being submitted, according to parameter setting when submitting at the beginning of corresponding position Beginningization SparkContext simultaneously creates DAG Scheduler and Task Scheduler, Driver according to application program execution generation Entire program is divided into multiple job by code according to action operators, and each job internal builds DAG figures, DAG Scheduler will DAG figures are divided into multiple stage, while being divided into multiple task as a taskSet, DAG inside each stage TaskSet is transmitted to Task Scheduler by Scheduler, and Task Scheduler are responsible for the scheduling of task on cluster；

3. the taxi big data of multithreading according to claim 1 accesses and the method for processing, it is characterised in that step 2）In data cleansing include the following steps：

Third step is rejected to redundant data.

4. the taxi big data of multithreading according to claim 3 accesses and the method for processing, it is characterised in that data First quartile is set in the second step of cleaning as Q1, the second quartile is median, and third quartile is Q3, four points Position is apart from distance：IQR=Q3-Q1, setting is ranging from（Q1-1.5*IQR, Q3+1.5*IQR）.

5. the taxi big data of multithreading according to claim 1 accesses and the method for processing, it is characterised in that step 2）In region division by coordinate include that regular cutting is carried out to map on map, by and transport power achievement data combination Adjust difference of longitude and difference of latitude, finally by the difference of longitude of regional extent be set as 0.03 °, difference of latitude be set as 0.02 °.

6. the taxi big data of multithreading according to claim 1 accesses and the method for processing, it is characterised in that step 2）In data be converted to：Data are subjected to turning for data format from time transverse direction, longitudinal direction, feature correlation and spatial distribution It changes.

7. the taxi big data of multithreading according to claim 1 accesses and the method for processing, it is characterised in that step 2）Middle index is calculated as：Calculate includes inflow, outflow, retention and unloaded transport power index；Calculating includes rate of empty ride, carrying It measures, the operation indicator of the amount of getting on the bus and the amount of getting off.

8. the taxi big data of multithreading according to claim 1 accesses and the method for processing, it is characterised in that described Step 2）In the parallel computation of multithreading be：Metaclass, constructed fuction method are initialized first；Then, data cleansing, area are defined Domain divides, index calculates, the subclass of the various operations of data conversion, the method for inheriting metaclass, and the height of python is added for subclass Level characteristics decorator, the method for calling metaclass interface that all subclasses can be realized；Finally each subclass runner is executed parallel.