CN111680075A

CN111680075A - Hadoop + Spark traffic prediction system and method based on combination of offline analysis and online prediction

Info

Publication number: CN111680075A
Application number: CN202010298397.2A
Authority: CN
Inventors: 张红; 王文婷
Original assignee: Lanzhou University of Technology
Current assignee: Lanzhou University of Technology
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2020-09-18

Abstract

The invention discloses a Hadoop + Spark traffic prediction system based on combination of off-line analysis and on-line prediction, which comprises the storage and management of Hadoop platform traffic big data, Spark system traffic real-time data analysis, traffic flow prediction and traffic flow prediction application. According to the comprehensive traffic big data processing platform based on the Hadoop + Spark architecture, a working mechanism of a MapReduce Distributed architecture of the Hadoop cloud platform, a Distributed File storage (HDFS) principle and a working process of Spark based on memory computing are researched, and according to the strong real-time requirement of traffic flow prediction, the comprehensive traffic big data processing platform based on the Hadoop + Spark architecture is established. Aiming at the characteristics of traffic big data, the platform is optimized, so that the traffic big data can be quickly analyzed, a traffic data preprocessing method based on the big data platform is researched, and high-quality data support is provided for traffic characteristic analysis and traffic flow prediction.

Description

Hadoop + Spark traffic prediction system and method based on combination of offline analysis and online prediction

Technical Field

The invention relates to the field of traffic big data analysis and traffic prediction, in particular to a Hadoop-based traffic big data platform construction and a traffic flow prediction system architecture based on the Hadoop-based traffic big data platform construction, and provides a research platform and technical application development for analysis and pretreatment of traffic big data and short-time prediction of urban traffic flow.

Background

The traffic data is various in types, large in scale, dynamically variable, large in space-time span, high in randomness and heterogeneity, and is a representative typical sample in the concept category of big data. How to efficiently and quickly analyze the data, mine useful information and provide a data basis for traffic state analysis and traffic flow prediction is a necessary condition for improving the urban traffic flow prediction precision and instantaneity under the background of big data. A traffic big data processing platform is built based on data management and analysis technologies such as big data, distributed parallel computation, data mining and the like, a traffic flow prediction system based on the big data platform is built, and the method is a prerequisite for researching urban traffic flow prediction methods under the big data background. The traffic big data provides rich data sources for traffic flow prediction, and a big data analysis platform based on distributed parallel computing provides powerful technical support for deep mining and efficient analysis of the traffic big data.

The current situation of traffic big data research suggests the development of advanced information communication technology and intelligent information acquisition and perception technology, so that a large amount of traffic related data are accumulated in the traffic field. The data bring huge innovation for the further development and promotion of intelligent traffic, and avoid the defects brought to traffic analysis by the traditional single monitoring mode, such as that the loop coil can only detect a single lane, the microwave can not detect low-speed vehicles, the video is greatly influenced by the environment, and the mobile detection is restricted by the communication technology. The traffic big data is a basic guarantee for realizing intelligent traffic, and the traffic analysis of multi-source information fusion can better combine the advantages of various detection modes and improve the accuracy and robustness of traffic information and state detection.

Disclosure of Invention

The technical problem is as follows: the invention provides a system method for comprehensively analyzing and predicting a traffic state in a multidimensional way, which is evolved from a linear normal form taking a management flow as a main part to a flat normal form taking data as a center and promotes the fusion analysis of traffic big data to be carried out aiming at the problems in the traditional traffic management and decision making. The innovation is mainly represented in the following aspects:

1. high real-time efficiency. The traditional data analysis technology and algorithm are not suitable for a big data processing mode and cannot meet the real-time requirement of traffic information service, the big data technology can quickly analyze and process traffic big data through distributed parallel processing, the efficiency of data query and analysis is greatly improved, second-level response is provided, the internal association rules hidden in the data can be quickly excavated from mass traffic data, traffic abnormity can be found in time, the crux is positioned, reasonable traffic operation is induced, and the traffic operation efficiency and the traffic capacity of a road network are improved.

2. And (4) distribution comprehensiveness. Most of traditional traffic applications are mostly single-table mining analysis based on single-source data, once cross-table association based on the multi-source data is involved, the efficiency problem cannot be overcome, distributed parallel processing of big data is good at complex block table association analysis, multi-source data and multi-angle analysis problems can be fused, data series and parallel association is promoted, data processing capacity and multi-dimensional deep analysis problem capacity are improved, and traffic flow evolution rules can be deeply analyzed.

3. Accurate and predictive. The short-time traffic flow prediction based on big data can reduce the probability of false report and missed report of the traffic jam state, and by establishing a monitoring and predicting model of the regional traffic state, the traffic operation related data and road condition environment data are shared, the traffic dynamics is monitored in real time in multiple directions, the traffic state change is accurately predicted, drivers and travelers are helped to know the traffic jam state in advance, the jammed road sections are avoided, and the road traffic capacity is improved.

The technical scheme is as follows: in order to achieve the purpose, the invention provides a Hadoop + Spark traffic prediction system and a method based on combination of off-line analysis and on-line prediction, which adopt the currently popular Hadoop/MapReduce adopted by a plurality of large IT companies as an analysis platform of traffic historical data, adopt Spark with high-efficiency calculation and strong fault tolerance as an analysis and prediction modeling tool of real-time traffic flow data, and have the overall structure shown in figure 1 and mainly comprise a data source, traffic big data storage, traffic data analysis and prediction application.

1) Hadoop traffic big data platform

Hadoop is an open-source distributed computing framework integrating distributed computing, storage and management, provides stable and reliable interfaces for application programs through a cluster consisting of a large number of common computers, and constructs a high-reliability and strong fault-tolerant large-data distributed storage and computing system which is scalable and extensible. The core components of the system are a distributed file system HDFS and a distributed parallel computing architecture MapReduce, and the system also comprises a series of big data tools established on the system, such as Hadoop YARN, Chukwa, HBase, Hive, Mahout, Pig, Spark, ZooKeeper and the like, which are collectively called as a Hadoop ecosystem, and see FIG. 2.

The Hadoop cluster generally consists of three parts, namely a client (JobClient), a Master node (Master) and a Slave node (Slave), and the whole body presents a Master-Slave architecture (Master/Slave), and the mutual cooperation principle of the three parts is shown in fig. 3. Wherein, Job Client is used for submitting operations such as traffic data preprocessing and analysis and copying resources related to the operations; the Master manages and maintains the distributed storage of the whole traffic data, and monitors the MapReduce task related to the operation analysis; the Slave is used for actual storage of traffic data and data processing tasks; the Job Tracker receives requests of new operations such as traffic data analysis or predictive modeling, creates operation objects, encapsulates related tasks, states and progress generated in the operation process of one operation, and distributes specific execution tasks for the TaskTracker; the Task Tracker is used for monitoring and managing the operation condition of the jobs on each node, copying JAR files (including JAR package files of third parties) related to the localization jobs, and creating new instance execution tasks.

The HDFS is an open-source implementation of Google File System (GFS), can realize high-throughput parallel access and distributed storage of traffic big data, and provides high-performance, strong fault-tolerant, and highly reliable traffic big data rapid analysis and modeling, and its internal execution flow is shown in fig. 4. HDFS adopts a master-slave mode of operation, the NameNode node realizes the management of metadata files, the DataNode node is used for storing actual traffic data, and the NameNode node and the DataNode node realize mutual communication through a remote process call mechanism of Hadoop.

MapReduce is a device capable of processing large-scale dataThe collective parallel programming model can execute parallel computing tasks on a Hadoop cluster consisting of hundreds of ordinary PCs, and the operation execution flow is shown in FIG. 5. MapReduce distributes data analysis or modeling tasks to each data node to carry out sub tasks such as analysis mining and calculation of traffic big data, abstracts a parallel calculation process operated in a large-scale cluster into two stages of Map (mapping) and Reduce (protocol), and decomposes the whole calculation task into a plurality of sub calculation tasks in the Map stage, which is substantially characterized in that a group of key value pairs is less than key₁，value₁Mapping into a set of new intermediate key-value pairs < key₂，value₂The Reduce stage receives the output of the Map function, aggregates the value values of the same key value in a plurality of output results, and uses the key value pair < key₃，value₃Output in the form of > map phase and reduce phase may be repeated.

2) Spark real-time computing platform

Hadoop/MapReduce is a batch processing process, is good at off-line analysis of historical traffic big data, and is not suitable for analysis and prediction of real-time traffic data. Spark is a big data distributed computing framework based on memory computing, which can provide faster data analysis and prediction results, but consumes more memory. Therefore, the invention establishes a system architecture combining offline analysis and online prediction of Hadoop + Spark, provides a traffic prediction one-stop solution based on elastic distributed data sets (RDD) through Spark, realizes quick calculation of traffic real-time data, interactive query of historical traffic modes (Ad-hoc Queries), stream calculation (Streaming computing) and the like by Spark architecture, realizes seamless integration of all processing parts in a memory through a consistent Application Programming Interface (API) and the same deployment scheme, cooperatively completes the overall task of the system, avoids excessive network and disk I/O (input/output) overhead in the calculation process, and improves the real-time of traffic flow prediction under a big data background.

The Spark real-time traffic flow data analysis and prediction system mainly comprises streaming analysis of real-time traffic flow data, job task scheduling, memory management, Spark SQL, Spark MLlib and the like, and is shown in FIG. 6. The client submits analysis and prediction operation of real-time traffic data through a Spark driver; the resource management layer provides efficient traffic data management and data sharing functions for Spark through the YARN, and the overall resource utilization efficiency of the system is improved; real-time traffic data is received and analyzed through Spark stream calculation, and a Distributed elastic data set (RDD) and memory calculation are formed through micro batch processing of the flow data to improve the real-time performance of traffic data analysis; spark SQL and Mllib provide a traffic history operation mode for users, establish comprehensive query fields and establish a rapid prediction model.

Drawings

FIG. 1 is a Hadoop + Spark ensemble analysis and prediction system employed in the present invention;

FIG. 2 is an illustration of the Hadoop + Spark ecosystem employed in the present invention;

FIG. 3 is a Hadoop + Spark cluster architecture employed by the present invention;

FIG. 4 is a flow chart of HDFS traffic data read-write adopted by the present invention;

FIG. 5 is a MapReduce job execution flow adopted by the present invention;

FIG. 6 is a Spark real-time traffic flow data analysis and prediction system employed in the present invention;

the specific implementation method comprises the following steps:

in order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention, and the protection scope of the present invention is not limited thereby.

The platform established by the invention adopts a fully distributed operation mode, and the distributed storage and parallel computing capability of the platform are enhanced by using a VMWare virtual technology. The platform construction process is divided into three key stages for explanation, namely an early preparation stage, a Hadoop installation configuration stage, and a Hadoop + Spark installation and starting stage.

And the early preparation stage is mainly used for finishing the setting of the running environment, the setting of the cluster nodes and the preparation of related software. Because the unique support system of Hadoop is Unix, most of the applications run on Windows platforms at present, in order to not influence the existing applications, the embodiment builds a Unix virtual machine based on Windows environment, constructs a small cluster constructed by 4 associated ordinary personal computers (Intel (R) core (TM) i5-3210M CPU @2.50GHZ 2.50GHZ, 4G memory and Windows 7 flagship edition 64 bits), sets host names and corresponding IP addresses in a table 1, modifies hosts files of each host, and configures clock synchronization, network environment, password-free login and firewall closing. The relevant software and versions are shown in table 2.

TABLE 1 Cluster node setup

TABLE 2 software version and Main Functions

The method has the advantages that the correct installation and configuration of software of each version are the key for building the platform for the installation and configuration steps of Hadoop software, and the platform relates to the distributed coordination work of a plurality of hosts, so that the technical difficulty is high, the requirement is high, and the key links in the building process are emphasized.

1) Virtual machine and JDK installation.

The VMWare Workstation 10 virtual machine, the CentOS 7Linux operating system and a platform development kit JDK 1.7 are sequentially installed on the Master, the Slave1, the Slave2 and the Slave3 respectively, and the Java environment variables are configured by the getit.

2) Hadoop installation and configuration.

And installing Hadoop 2.6.4, and configuring environment variables of Hadoop-env.sh and yarn-env.sh.

Xml, the storage position and the port of the HDFS are set, and the file cache is set.

Xml, setting addresses and ports of NameNode and DataNode of distributed storage of traffic data files, and the number of file backups, and setting the number of file backups to be 3.

And (5) allocating yarn-site. xml, and uniformly managing Hadoop resources.

Xml, and setting management nodes of a distributed computing architecture MapReduce.

Starting the Hadoop by using start-dfs.sh and start-horn.sh, respectively using Jps command to check the system at the Master and the three Slave terminals, when the Master terminal has four processes of SecondaryNameNode, Jps, NameNode and ResourceManager, and the Slave terminal has three processes of Jps, DataNode and NodeManager, it is indicated that the Hadoop cluster is normally installed and can be started.

3) Hadoop + Spark installation and start-up

Installation at this stage needs to be performed on the basis that Hadoop has been successfully installed, and the Hadoop platform is required to be started normally. The Hadoop ecosystem comprises a series of big data storage, analysis and transmission tools, wherein Spark 1.6.2, Scala2.11.8, Hbase 1.2.2, MySQL 5.7.14, Mahout 0.10.0 and Sqoop1.99.7 are sequentially installed and deployed.

First, a development language scala2.11.8 is installed for Spark, and environment variables of scala are configured.

Secondly, spark 1.6.2 is installed, and spark environment variables and spark-env.

And finally, respectively installing and configuring other Hadoop ecosystem software, wherein the environment variables and configuration files need to be modified after the Hbase 1.2.2 and the Sqoop1.99.7 are installed, and details of implementation details and related configuration files are not detailed. Each node in the Hadoop cluster needs to execute the operation, the method selects the copy function, copies the installation to each Slave node, uses source/etc/profile to enable the configuration file to take effect, and completes the construction and deployment work of the whole traffic big data analysis platform. Hadoop and spark (namely/start-all. sh) are respectively operated on a Master Node (Master) and Slave nodes (Slave1, Slave2 and Slave3), Jps is input at the Master Node and the Slave Node, when the Master terminal has five processes of Secondary NameNode, Jps, NameNode, Resource Manager and Master, and when the Slave terminal has four processes of Jps, DataNode, Node Manager and Worker, the Hadoop + spark cluster is successfully installed, and the analysis and prediction platform can be normally started to become traffic big data.

The invention discloses a traffic prediction platform based on big data analysis, which comprises the following steps: hadoop + Spark traffic prediction system and method based on combination of off-line analysis and on-line prediction. According to the comprehensive traffic big data processing platform based on the Hadoop + Spark architecture, a working mechanism of a MapReduce Distributed architecture of the Hadoop cloud platform, a Distributed File storage (HDFS) principle and a working process of Spark based on memory computing are researched, and according to the strong real-time requirement of traffic flow prediction, the comprehensive traffic big data processing platform based on the Hadoop + Spark architecture is established. Aiming at the characteristics of traffic big data, the platform is optimized, so that the traffic big data can be quickly analyzed, a traffic data preprocessing method based on the big data platform is researched, and high-quality data support is provided for traffic characteristic analysis and traffic flow prediction.

Finally, it should be noted that the above-mentioned embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the modifications to the specific embodiments of the present invention or equivalent substitutions for some technical features may be made without departing from the spirit of the technical solutions of the present invention, and all of them should be covered in the technical solutions of the present invention.

Claims

1. A Hadoop + Spark traffic prediction system based on combination of off-line analysis and on-line prediction is characterized by comprising storage and management (1) of Hadoop platform traffic big data, Spark system traffic real-time data analysis and traffic flow prediction (2) and traffic flow prediction application (3), wherein the whole system is applied from bottom layer traffic data acquisition to high-layer traffic flow prediction and comprises four parts, namely a traffic big data source, the storage and management (1) of the Hadoop platform traffic big data, the Spark system traffic big data real-time analysis and traffic flow prediction (2) and the traffic flow prediction application (3).

2. The Hadoop + Spark traffic prediction system based on the combination of the offline analysis and the online prediction as claimed in claim 1, wherein a traffic flow prediction architecture based on the combination of the Hadoop + Spark offline analysis and the online flow processing is constructed, a Hadoop platform traffic big data storage and management (1) is adopted, a Hadoop/MapReduce distributed computing framework is used for analyzing and processing historical traffic data, deep knowledge contained in the data is mined, rules hidden in the data, such as daily travel behaviors of residents, travel modes, urban dynamic features and the like, then a Spark system is used for carrying out real-time analysis and traffic flow prediction (2) on the traffic big data, and the system is finally applied to traffic prediction applications (3) such as traffic induction, traffic signal control, traffic information services and the like.

3. A Hadoop + Spark traffic prediction method based on combination of off-line analysis and on-line prediction is used for realizing the platform traffic prediction and application of claim 1, and is characterized by comprising the following steps of:

1) a vehicle tracking and Positioning System based on Radio Frequency Identification (RFID), a Global Positioning System (GPS), traffic monitoring videos, social media, mobile phone applications, induction coils, buckles, microwaves, radar monitoring and the like are adopted to accumulate a large amount of traffic data;

2) the method comprises the steps that (1) Hadoop platform traffic big data storage and management is adopted, traffic unstructured file data are firstly classified according to directories, then file attributes are managed according to a metadata management method, and unified management is carried out through an HDFS distributed system; organizing and managing real-time, large-capacity and continuous traffic information, such as real-time track data and monitoring video data, by using Tachyon in a Spark system; the processed and mined partial regular traffic mode information is stored in a relational database MySQL, so that seamless access of most application development is facilitated; most unstructured traffic data which are subjected to compilation, reanalysis, classification, correlation calculation and related conversion processing are stored in an HBase non-relational database; through organization and management of different forms of traffic big data, operations such as convenient capacity expansion, deletion, migration and the like and classified storage of the traffic data are realized, traffic data access with different requirements is met, and data optimized storage and rapid query are achieved;

3) analyzing and processing historical traffic data by adopting a Hadoop/MapReduce distributed computing framework, mining deep knowledge contained in the data, searching rules hidden in the data, such as daily travel behaviors, travel modes, urban dynamic characteristics and the like of residents, and analyzing and computing traffic big data in real time by using a Spark system to realize short-term prediction of traffic flow;

4) the short-term traffic flow prediction information is utilized to realize the application of a traffic guidance system, a traffic signal control system, a real-time road condition forecasting system, real-time road network planning and road network map updating, traffic supply and demand analysis, traffic abnormity detection, intelligent electronic parking, short-term traffic jam prediction, travel information service and the like.