CN109271371A

CN109271371A - A kind of Distributed-tier big data analysis processing model based on Spark

Info

Publication number: CN109271371A
Application number: CN201810956427.7A
Authority: CN
Inventors: 宋泊东; 张立臣
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2019-01-25
Anticipated expiration: 2038-08-21
Also published as: CN109271371B

Abstract

The invention discloses a kind of, and the Distributed-tier big data analysis based on Spark handles model, including expression layer (PT), front end switching layer (FST), rear end switching layer (BST), real time business logical layer (RBLT), non-real-time service logical layer (NRBLT) and data access layer (DAT).The invention proposes a kind of, and the Distributed-tier big data analysis based on Spark handles model, can effectively reduce the analysis speed of mass data, and the Heterogeneous Information in support system between each subsystem is linked up and stored with data.It is sufficient for the short-term trend forecast demand of high frequency trade market.The application value with higher in high frequency, big data processing system.

Description

A kind of Distributed-tier big data analysis processing model based on Spark

Technical field

The present invention relates to big data analysis process fields, more particularly, to a kind of Distributed-tier based on Spark Big data analysis handles model.

Background technique

Big data can help user to improve insight, be promoted in higher level, wider array of visual angle, bigger range Decision edge.But some values having often are hidden in big data, show value density it is extremely low, distribution extremely not Rule, Information hiding are in the highest degree, discovery is useful is worth extremely difficult distinct characteristic.As the high frequency of stock market is traded (HFT), because of short-term market trend and quickly quotation, people are difficult to determine when buy or sell in time, to big data Accuracy, the rapidity of analysis have high requirement.

Summary of the invention

Present invention aim to address said one or multiple defects, propose that a kind of Distributed-tier based on Spark is big Data Analysis Services model.

To realize the above goal of the invention, the technical solution adopted is that:

A kind of Distributed-tier big data analysis processing model based on Spark, including expression layer (PT), front end exchange Layer (FST), rear end switching layer (BST), real time business logical layer (RBLT), non-real-time service logical layer (NRBLT) and data are visited Ask layer (DAT)；Wherein expression layer (PT) carries out data transmission with front end switching layer (FST), the output of front end switching layer (FST) End is connect with the input terminal of medium；Medium carries out data transmission with rear end switching layer (BST)；The output of rear end switching layer (BST) End is connect with the input terminal of the input terminal of real time business logical layer (RBLT) and non-real-time service logical layer (NRBLT)；Real-time industry Be engaged in logical layer (RBLT) output end and non-real-time service logical layer (NRBLT) output end with data access layer (DAT) Input terminal connection.

Preferably, the expression layer (PT) is obtained data and is serviced using Facade and handled from user to rear from BLT Hold all requests of cluster.

Preferably, the front end switching layer (FST) further includes the front-end server being deployed on node, the front end Switching layer (FST) is responsible for receiving web request, and web request is transferred to Facade by Kafka message system.

Preferably, the front-end server is the front-end server for deploying MongoDB, before the MongoDB passes through End switching layer (FST) is sent to Kafka to avoid enter into rear end cluster.

Preferably, the rear end switching layer (BST) obtains message from Kafka, carries out front end by BST ingress interface Server and rear end switching layer carry out information transmission.

Preferably, the real time business logical layer (RBLT) further includes indicating node and docking center；The expression section Point is carried out data transmission by spout and medium；The docking center is carried out data transmission by bolt and medium.

Preferably, the non-real-time service logical layer (NRBLT) is for storing decision strategy；The wherein decision plan It is slightly stored in MongoDB, can be obtained the interface of quickly access large data collection using R program and Spark RDD.

Preferably, the data access layer (DAT) includes real time data resources bank, switching centre, baseline and data bins Library；Wherein real time data resources bank carries out real-time data access to switching centre.

Compared with prior art, the beneficial effects of the present invention are:

The invention proposes a kind of, and the Distributed-tier big data analysis based on Spark handles model, can effectively reduce sea The analysis speed of data is measured, and the Heterogeneous Information in support system between each subsystem is linked up and stored with data.It is sufficient for high frequency The short-term trend forecast demand of trade market.The application value with higher in high frequency, big data processing system.

Detailed description of the invention

Fig. 1 is the distributed architecture figure of this system；

Fig. 2 is real time business logical layer structure figure；

Fig. 3 is status center topology diagram；

Fig. 4 is original design figure；

Fig. 5 is HFT topology diagram；

Fig. 6 is the average calculation times figure that state of market calculates；

Fig. 7 is computing market each second status number figure；

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

Below in conjunction with drawings and examples, the present invention is further elaborated.

Embodiment 1

A kind of Distributed-tier big data analysis processing model based on Spark, referring to FIG. 1, including expression layer (PT), front end switching layer (FST), rear end switching layer (BST), real time business logical layer (RBLT), non-real-time service logical layer (NRBLT) and data access layer (DAT)；Wherein expression layer (PT) carries out data transmission with front end switching layer (FST), and front end is handed over The input terminal of the output end and medium that change layer (FST) connects；Medium carries out data transmission with rear end switching layer (BST)；It hands over rear end Change the defeated of the output end of layer (BST) and the input terminal of real time business logical layer (RBLT) and non-real-time service logical layer (NRBLT) Enter end connection；The output end of the output end of real time business logical layer (RBLT) and non-real-time service logical layer (NRBLT) with The input terminal of data access layer (DAT) connects.This framework is from Triple distribution architectural evolution.Finally, we will Business Logic is separated into real time business logical layer and non-real-time service logical layer.In addition, we use the message of two ranks Middleware is transmitted to solve the high frequency requirements in whole system.

Expression layer (PT) this layer obtains data from BLT, and prepares the user that web page is presented to online browse.In order to Accelerate loading velocity, reduces the delay of access time, the present embodiment services to handle from user to rear end cluster using Facade All requests.Architecture is set to have more loose couplings.

Front end switching layer (FST) is responsible for receiving web request, and is passed them to by Kafka message system Facade.This layer includes the front-end server being deployed on node.In view of operation efficiency, the present embodiment disposes MongoDB It in front-end server, is run through front end switching layer and is sent to Kafka, it is not necessary to enter rear end cluster and carry out data processing.

In the present embodiment, the front-end server is the front-end server for deploying MongoDB, and the MongoDB passes through Front end switching layer (FST) is sent to Kafka to avoid enter into rear end cluster.

In the present embodiment, the rear end switching layer (BST) obtains message from Kafka, before being carried out by BST ingress interface Server and rear end switching layer is held to carry out information transmission.

In the present embodiment, the real time business logical layer (RBLT) further includes indicating node and docking center；The expression Node is carried out data transmission by spout and medium；The docking center is carried out data transmission by bolt and medium.Real-time industry Business logical layer (RBLT) is the key component of radio frequency system, is mainly responsible for the processing and calculating of real time data.It includes two weights The service wanted, data analysis and decision.Such as a stock trade price forecasting system, it is necessary to a storm topology, with quick Real-time price quotations stream is handled, and is stored into HBase.Rket state is calculated for HDFS.That is: the signal bought in or sold is calculated. As shown in Fig. 2, if user terminal and transaction platform are divided into two topological networks.Pass through Kafka message system computing market State simultaneously passes it to user.In order to improve efficiency with higher speed, we incorporate the two topology, and will Kafka messaging middleware replaces with Netty, realizes high-frequency therapeutic treatment and the transmission of information.In Storm topology, Netty Speed be about 10 times of Kafka.

In the present embodiment, the function of non-real-time service logical layer (NRBLT) calculates user according to big data Information result carries out decision strategy.Decision strategy is stored in MongoDB, is quickly accessed convenient for user from front end node.It utilizes R program and Spark RDD, so that it may obtain the interface of quickly access large data collection.

In the present embodiment, the data access layer (DAT) includes real time data resources bank, switching centre, baseline and data Warehouse；Wherein real time data resources bank carries out real-time data access to switching centre.Data access layer (DAT) comes for accessing From all data of database or external data source.As DAT provides an order interface, and big data information can be combined At a K-Bar, middleware is transmitted to user's immediate feedback external data information by Kafka unified message.

By above-mentioned model framework, our one stock exchange big data analysis decision calculated examples of virtual development, to calculation Method process is analyzed.Stock trade price provides real-time price quotations and marketing state by network trading platform.Due to needing The requirement for meeting machine learning and quickly calculating.We are first using network trading center as a topology, to realize algorithm Low latency.See the most entire status center topology of Fig. 3.

In topology, KafkaSpout is serviced from external RealtimeDataPublisher and is received real-time price quotations, and is led to It crosses distributed information system Kafka and constantly sends market real-time deal price.Then KafkaSpout by Price pass-through give with 18 ComputeStateBolt afterwards.Each ComputeStateBolt has different computer logics, and is come using it Calculate state of market defined in specific TA logic.Then, result state of market is sent to spy by 18 ComputeStateBolt Fixed TA WriteDataBolt.HBase is written in corresponding TA data by each WriteDataBolt.For example, State of market is sent to MAWriteDataBolt by ComputeStateBolt, special to store MA state of market.In topology Outside, all black lines all indicate that Kafka, Netty distributed messaging system transmit.

The purpose of high frequency trading market data analysis is to acquire marketing and price status for user.Therefore, Wo Menxu Machine learning algorithm is used, historic market data are learnt, then according to historical trend changing rule, help constructs investment plan Slightly.In order to solve the problems, such as large data sets Fast Learning, herein using the Plan Center operation in Apache Spark frame Machine learning algorithm loads large-scale history data set from HBase, and learns in a short time and analyze.Plan Center branch Hold vector machine (SVM), logistic regression (LR) and classification.By Spark RDD, Plan Center can be by the city of hundreds of gb Field status data is loaded into memory, and multiple nodes in the cluster calculate analysis.User is helped to provide trading strategies.It hands over After easy strategy generating, user can choose the investment decision used on web page.Large data sets handle model framework such as Fig. 4 It is shown.

In order to reduce big data analysis, processing and the overhead time of transmission, we are by status center topology and trade us Above topology is merged into one large-scale topology, forms a large size HFT system such as Fig. 5.

HFT after integration extracts data from Kafka queue, and writes data into HBase and MongoDB.Fig. 5 is shown Entire HFT topological structure.Pass through the integration to network trading center and user terminal, so that it may the cost time of information transmission Shorten to several milliseconds.But due to the complexity of large-scale transaction system architecture, it is necessary to carry out efficient cluster resource pipe Reason, can just effectively improve the calculating speed of algorithm.Therefore, we are started most of services using yarn and managed on cluster All resources.And each node and Hadoop service status on the configuration monitoring Cloudera by customizing Hadoop service, Realize the cluster service of large data sets.

Due to high frequency and real time data processing requirement, trade center needs calculate millions of a markets in one second State.Therefore, simulated experimental environments have been built herein, compare the algorithm performance processing result of different number futures exchange.We 8 computers are prepared as cluster, wherein 6 run Storm topology as manager.Experimental situation is as shown in table 1.

The details of 1 cluster of table

In order to test the extreme efficiency of the architecture and find out the configuration of most suitable cluster, we are to each experiment The average calculation times of all state of market compare.

For check algorithm performance, we be added in original topological structure one it is entitled The new bolt of ExpStateReceiverBolt is flat by calculating quickly to collect all calculating metric datas of state of market Mean testing algorithm performance.Fig. 6 shows results of property, and Fig. 7 shows the state of market number of N number of stock.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of Distributed-tier big data analysis based on Spark handles model, which is characterized in that including expression layer (PT), Front end switching layer (FST), rear end switching layer (BST), real time business logical layer (RBLT), non-real-time service logical layer (NRBLT) With data access layer (DAT)；Wherein expression layer (PT) carries out data transmission with front end switching layer (FST), front end switching layer (FST) Output end and medium input terminal connect；Medium carries out data transmission with rear end switching layer (BST)；Rear end switching layer (BST) Output end connect with the input terminal of the input terminal of real time business logical layer (RBLT) and non-real-time service logical layer (NRBLT)； The output end of the output end of real time business logical layer (RBLT) and non-real-time service logical layer (NRBLT) is and data access layer (DAT) input terminal connection.

2. a kind of Distributed-tier big data analysis based on Spark according to claim 1 handles model, feature exists Ask the visitor in for the institute for obtaining data from BLT in, the expression layer (PT) and servicing using Facade to handle from user to rear end cluster It asks.

3. a kind of Distributed-tier big data analysis based on Spark according to claim 1 handles model, feature exists In the front end switching layer (FST) further includes the front-end server being deployed on node, and the front end switching layer (FST) is responsible for Web request is received, and is transferred to web request by Kafka message system

4. a kind of Distributed-tier big data analysis based on Spark according to claim 3 handles model, feature exists In the front-end server is the front-end server for deploying MongoDB, and the MongoDB is sent out by front end switching layer (FST) It send to Kafka to avoid enter into rear end cluster.

5. a kind of Distributed-tier big data analysis based on Spark according to claim 1 handles model, feature exists In the rear end switching layer (BST) obtains message from Kafka, carries out front-end server by BST ingress interface and exchanges with rear end Layer carries out information transmission.

6. a kind of Distributed-tier big data analysis based on Spark according to claim 1 handles model, feature exists In the real time business logical layer (RBLT) further includes indicating node and docking center；The expression node passes through spout and matchmaker Jie carries out data transmission；The docking center is carried out data transmission by bolt and medium.

7. a kind of Distributed-tier big data analysis based on Spark according to claim 1 handles model, feature exists In the non-real-time service logical layer (NRBLT) is for storing decision strategy；Wherein the decision strategy is stored in MongoDB In, it can be obtained the interface of quickly access large data collection using R program and Spark RDD.

8. a kind of distributed big data analysis based on Spark according to claim 1 handles model, which is characterized in that The data access layer (DAT) includes real time data resources bank, switching centre, baseline and data warehouse；Wherein real time data provides Source library carries out real-time data access to switching centre.