CN105677836A

CN105677836A - Big data processing and solving system simultaneously supporting offline data and real-time online data

Info

Publication number: CN105677836A
Application number: CN201610005212.8A
Authority: CN
Inventors: 许丹霞; 刘寅; 汪伟; 郑宇�
Original assignee: Beijing Huishang Rongtong Information Technology Co Ltd
Current assignee: Beijing Huishang Rongtong Information Technology Co Ltd
Priority date: 2016-01-05
Filing date: 2016-01-05
Publication date: 2016-06-15

Abstract

The invention discloses a big data processing and solving system simultaneously supporting offline data and real-time online data. The system comprises a data collecting module, a preprocessing module, a distributed storage module, a distributed real-time flow calculating module, an offline data processing module, a database, a data comprehensive analysis and query module, a comprehensive showing module and a uniform configuration center. The big data processing and solving system can process the real-time data and the offline data, and is timely in processing and high in processing efficiency.

Description

The big data of a kind of support off-line data and real-time online data simultaneously deal with system

Technical field

The present invention relates to a kind of big data and deal with scheme, particularly the complete big data of a kind of support off-line data and real-time online data simultaneously deal with system.

Background technology

Along with the development of technology, people have increasing need for building complicated and low latency process system. Two instruments that they can use all can not be fully solved problem: for processing the extendible high latency batch processing system of historical data, and cannot reprocess the low latency Stream Processing system of result. But the two instrument is connected together, it is possible to build available solution.

Hadoop framework brings batch data and processes, but processing in real time of the big data of network size remains a challenge. There is a lot of technology to can be used to set up such a complete data handling system, but to select suitable instrument and layout to use them to be complicated and arduous.

Summary of the invention

Based on case above, the present invention proposes the complete big data of a kind of support off-line data and real-time online data simultaneously and deals with scheme. Including:

One, a configurable data acquisition module that can gather multiple Data Source, and introduce distributed fault testing mechanism, improve stability and the reliability of data acquisition.

Two, a configurable data preprocessing module, it is possible to read configuration information loading from unified configuration center and process program accordingly.

Three, the distributed document memory module of an innovatory algorithm, it is proposed to a kind of appraisal procedure to joint behavior, stores algorithm to HDFS and improves so that it is can complete the storage work of mass data more quickly, efficiently and accurately.

Four, a high performance real time data processing module, adopts Strom distributive type to process framework, processes magnanimity real time data, and result of calculation be stored in real time in data base.

Five, a high performance off-line data processing module, adopts HadoopMapReduce programming model, and proposes a kind of task allocation algorithms inferred based on node dynamic property, improve performance and the stability of off-line data processing module.

Six, an overview display module highly customized, provides inquiry service based on web container, realizes analyzing result visualization by ECharts, and user can pass through to pull self-defined layout, the displayed page that customization is personalized, collaborative support and drilling through between chart. And provide interface that unified configuration center is safeguarded.

For realizing the purpose of the present invention, it is achieved by the following technical solutions:

A kind of big data handling system simultaneously supporting off-line data and real-time online data, including:

Data acquisition module, pretreatment module, distributed storage module, distributed real-time streams computing module, off-line data processing module, data base, aggregation of data analyze enquiry module, overview display module and unified configuration center;

Wherein:

Data acquisition module is for reading configuration information from unified configuration center, the data in relevant database are read according to this configuration information, and these data are imported distributed document memory module, receive the process request that application cluster sends, the request data received is supplied directly to distributed real-time streams computing module, application cluster journal file is sent to local disk and carries out storage backup;

Data preprocessing module, for reading configuration information from unified configuration center, reads the journal file of the application of local disk storage, is stored in local disk, and uploads files to distributed document memory module after data are processed;

Distributed storage module is used for storing mass data;

Distributed real-time streams computing module is for reading data from data acquisition module, and reads the configuration information of unified configuration center, calculates in real time, result of calculation is stored in data base; Each index, for processing the data of storage in distributed document memory module, has been calculated rear write into Databasce by off-line data processing module;

Data base is used for storing data;

Aggregation of data is analyzed enquiry module and is used for accessing data base, and provides various index query interface;

Overview display module is for providing inquiry service based on web container, it is achieved analyze result visualization;

Unified configuration center is for configuring application cluster.

Described big data handling system, it is preferred that: data acquisition module includes message-oriented middleware module, and this message-oriented middleware module receives the process request that application cluster sends, and the data received are supplied directly to distributed real-time streams computing module; Application cluster journal file is also sent to local disk and carries out storage backup by this message-oriented middleware module.

Described big data handling system, it is preferred that: data preprocessing module data are carried out pretreatment include data are carried out, stipulations, compression processes the data of identical category.

Described big data handling system, it is preferred that distributed document memory module includes: memory node, joint behavior evaluation module;

Wherein:

(1) performance of each server in application cluster is estimated by joint behavior evaluation module, generates a dynamic joint behavior reference file, and this document is regular update according to demand; The assessment of cluster interior joint server performance is included the CPU disposal ability of server, internal memory performance and magnetic disc i/o performance;

(2) when the upper transmitting file of user, joint behavior evaluation module first calculates the performance value of memory node and the ratio value of all joint behavior numerical value summations, and the value further according to this ratio determines that the size of data that this node can store accounts for cluster and always stores the ratio of size of data.

Described big data handling system, it is preferred that:

The performance number P_i of server node describes in order to minor function, and wherein C_i represents cpu performance value, and M_i represents internal memory performance value, and D_i represents magnetic disc i/o performance number, and W_i represents network I/O performance number:

P_i=α C_i+βM_i+γD_i+δW_i

Alpha+beta+γ+δ=1

In above-mentioned formula, these four parameters of α, β, γ, δ represent the impact for the different weights of server performance of each index.

A kind of big data processing method simultaneously supporting off-line data and real-time online data, including:

Configuration information is read from unified configuration center, the data in relevant database are read according to this configuration information, and these data are imported distributed document memory module, receive the process request that application cluster sends, the request data received is supplied directly to distributed real-time streams computing module, application cluster journal file is sent to local disk and carries out storage backup;

Read configuration information from unified configuration center, read the journal file of the application of local disk storage, be stored in local disk after data are carried out pretreatment, and upload files to distributed document memory module;

Read data from data acquisition module, and read the configuration information of unified configuration center, calculate in real time, result of calculation is stored in data base; Each index, for processing the data of storage in distributed document memory module, has been calculated rear write into Databasce by off-line data processing module.

Described big data processing method, it is preferred that: data are carried out pretreatment include data are carried out, stipulations, compression processes the data of identical category.

Accompanying drawing explanation

Fig. 1 is the big data handling system schematic diagram simultaneously supporting off-line data and real-time online data provided by the invention;

Fig. 2 is the improvement dispatching algorithm schematic diagram of the present invention.

Detailed description of the invention

As it is shown in figure 1, support that the big data handling system of off-line data and real-time online data includes simultaneously: data acquisition module, data preprocessing module, unified configuration center, distributed document memory module, distributed real-time streams computing module, off-line data processing module, data base, aggregation of data analyze enquiry module and overview display module.

Data acquisition module:

(1) reading configuration information from unified configuration center, the data increment in a relevant database (such as: MySQL, Oracle etc.) is imported distributed document memory module by the mode dispatched by timing cycle, such as HDFS. Such as importing the user message table of storage, production schedule etc. in oracle database, data based on these data coordinate daily record data to be analyzed in follow-up log processing, calculating etc. processes. Data acquisition module according to the link information of the data base of the above data configuration derivation data from configuration center reading, from which table derivation data, can be derived the mode (full dose/increment) of data and derive the time started of data, data type etc.

(2) including message-oriented middleware module, it is possible to be WebSphereMQ message-oriented middleware, this middleware module receives the process request that application cluster sends, and the data received are supplied directly to distributed real-time streams computing module for real-time calculating. By this message-oriented middleware module, the journal file (application cluster journal file) of each application is sent to local disk and carries out storage backup. Data will not be carried out any process amendment by this part, it is ensured that data intactly store. Journal file will be supplied directly to data preprocessing module and use. Acquisition module may also include scheduler module and synchronous task management module, and synchronous task management module is for synchronizing the data acquisition in data base to HDFS, and scheduler module is for being timed above-mentioned data acquisition.

(3) data acquisition module is as a distributed system, this is as multinode structure, data need to be transmitted between different nodes, the situation such as therefore there will be node failure, system process lost efficacy, node load is excessive, and these situations all will cause loss of data. In order to ensure the transmission safety of data, the invention allows for a distributed fault based on data acquisition module and detect framework. In this framework, the data source of data collecting module collected includes two category nodes, one class is host node, another kind of is controlled node, one monitor node is managed server as host node, each application node, as controlled node, completes the monitoring to each application at each application node and controls function, and the method that monitoring host node is communicated by heart beating carries out data interaction with each controlled node. Controlled node needs timing to be sent to heartbeat data, reports the status information that this node apply, it is possible to is the Apply Names of this node, stores the node status information such as cpu load of position, IP address, present node. When certain controlled node does not send heartbeat data in heart beat cycle, then judge that this node temporarily lost efficacy, when certain node failure and alarm, be conducive to related personnel to fix a breakdown as early as possible, improve fault-tolerance and the reliability of data transmission. Meanwhile, controlled node configuration can be modified by manager by web interface, and is notified that controlled node updates its configuration by heart beating communication by monitor node.

Data preprocessing module: read configuration information from unified configuration center, read the journal file of each application being sent to local disk by described message-oriented middleware module, program is processed according to configuration information startup, data are carried out duplicate removal, cleaning, stipulations, exception record process, compression processes the data of identical category, is stored in local disk after each business being sorted out. And uploading files in distributed document memory module HDFS, data preprocessing module can include log collection module, is used for uploading journal file, and the log collection module of employing can be Flume system.

Distributed document memory module: the HDFS distributed file storage system of Hadoop can be adopted. The present invention creatively proposes a kind of appraisal procedure to joint behavior, distributed document memory module HDFS is stored algorithm improve, being implemented as follows, this distributed document memory module includes: memory node, joint behavior evaluation module (namenode):

(1) performance of each server in cluster is estimated by joint behavior evaluation module, generates a dynamic joint behavior reference file, and this document is regular update according to demand. When Hadoop cluster interior joint server performance is assessed by the present invention, focus mainly includes the CPU disposal ability of server, internal memory performance, magnetic disc i/o performance and network I/O performance. Performance number P to a server node i_{_i}, it is possible to describe in order to minor function, wherein C_{_i}Represent cpu performance value, M_{_i}Represent internal memory performance value, D_{_i}Represent disk performance value, W_{_i}Represent network performance value:

P_i=α C_i+βM_i+γD_i+δW_i

Alpha+beta+γ+δ=1

In above-mentioned formula, these four parameters of α, β, γ, δ represent the impact for the different weights of server performance of each index, and in different application scenarios, weight is also different. Such as when Hadoop cluster application is in the scene of data mining, then based on cpu performance. Therefore in actual applications, it is necessary to adjust the weighted value of parameters according to concrete application scenarios. After the value defining these four parameters of above-mentioned α, β, γ, δ, get the performance number P of each node through the test of performance reference instrument_{_i}。

(2) when the upper transmitting file of user, NameNode needs to store the data block of this document according to certain algorithms selection node. Joint behavior evaluation module all can give one joint behavior numerical value of this node in the Performance Evaluation to each node, the performance value of joint behavior evaluation module elder generation computing node and the ratio value of all joint behavior numerical value summations, the value further according to this ratio determines that the size of data that this node can store accounts for cluster and always stores the ratio of size of data. Realize storing when file stores on each node the data block of corresponding proportion according to the performance of node.

Distributed real-time streams computing module: based on ageing requirement, Storm is adopted to realize, Storm is a kind of big data handling system cluster increased income, data are read from message-oriented middleware module, and read the configuration information of unified configuration center, calculate in real time according to configuration information, result of calculation is stored in data base, such as HBASE or oracle database.

Off-line data processing module: for processing the mass data in distributed file system, by each index (as the same day goods orders amount seniority among brothers and sisters, merchandise sales classification seniority among brothers and sisters etc.) calculated rear write into Databasce, and by the mass data storage after processing in distributed file system. The present invention is by studying the Task Scheduling Mechanism of MapReduce, it is proposed that a kind of task allocation algorithms inferred based on node dynamic property suitable in isomerous environment, to improve its performance processing off-line data and stability. Processed offline module includes task allocation node, task processes node.

The present invention uses the data processing rate of node to represent the performance of node. Node data processing speed is the data volume of this node processing within the unit interval. In the Hadoop cluster of isomery, the quantity processing speed of node can present the performance difference of each node exactly.

Off-line data processing module, when processing data, carries out task distribution in the following way:

(1) when task allocation node to distribute task, it needs to process the performance of node in conjunction with each task, need the data volume of transmission and the network performance of each node described to carry out COMPREHENSIVE CALCULATING, selects optimal node by relevant computational analysis and carrys out operation task distribution. In acquiescence Hadoop cluster, task run is random at which node, and in innovatory algorithm of the present invention, the node of operation task is then the performance according to node and loading condition selects, and this improves Hadoop performance and stability to a certain extent.

Specific algorithm flow process is as shown in Figure 2: first task allocation node obtains the performance number of cluster interior joint by node dynamic property inference module, simultaneously by processing the heart beating communication of node with task, obtains other information of node, builds node state list. When task allocation node starts to distribute Reduce task, adopt the form of actively distribution. Task allocation node is high to Low according to joint behavior, successively the nodal information in query node status list, the then loading condition of query node, choose available free renduce task run ability and also performance best node distribution one reduce task. Further according to the node operation task needing the number of tasks run to choose respective numbers successively.

Aggregation of data analyzes enquiry module: is used for accessing the data bases such as HBase data base and Oracle, and provides various index query interface. In addition this module also provides for the maintenance interface to unified configuration center.

Overview display module: provide inquiry service based on web container, it is achieved analyze result visualization, it is provided that directly perceived, lively, data visualization chart that can be mutual, personalized. The characteristics such as re-computation, Data View, codomain roaming that pull of innovation greatly strengthen Consumer's Experience, imparts the ability that data are excavated, integrated by user. In addition overview display module also provides for interface unified configuration center is safeguarded. Simultaneously overview display module also provides for interface configuration center is safeguarded, mainly the information such as data source types, acquisition server address, acquisition strategies, pretreatment strategy is configured.

In accordance with the invention it is possible to off-line and process mass data in real time, meet user to data batch processing and ageing demand simultaneously. And the scheme of concrete raising systematic function is proposed so that storage and process mass data are more efficient. System level configurations, a lot of work just can be completed by page configuration. Displayed page is personalized, it is provided that better Consumer's Experience.

Claims

1. the big data handling system simultaneously supporting off-line data and real-time online data, it is characterised in that include data acquisition module, pretreatment module, distributed storage module, distributed real-time streams computing module, off-line data processing module, data base, aggregation of data analysis enquiry module, overview display module and unified configuration center;

Wherein:

Data acquisition module, for reading configuration information from unified configuration center, reads the data in relevant database according to this configuration information, and these data is imported distributed document memory module; Receive the process request that application cluster sends, the request data received is supplied directly to distributed real-time streams computing module; Application cluster journal file is sent to local disk and carries out storage backup;

Data preprocessing module is for reading configuration information from unified configuration center, read the journal file of the application of local disk storage, it is stored in local disk after log file data is carried out pretreatment, and uploads log file data after pretreatment to distributed storage module;

Distributed storage module is used for storing mass data;

Distributed real-time streams computing module is for reading data from data acquisition module, and reads the configuration information of unified configuration center, according to this configuration information, the data read from data acquisition module is calculated in real time, result of calculation is stored in data base;

Each index, for processing the data of storage in distributed storage module, has been calculated rear write into Databasce by off-line data processing module;

Data base is used for storing data;

Unified configuration center is for configuring application cluster.

2. big data handling system according to claim 1, it is characterized in that: data acquisition module includes message-oriented middleware module, this message-oriented middleware module receives the process request that application cluster sends, and the request data received is supplied directly to distributed real-time streams computing module; Application cluster journal file is also sent to local disk and carries out storage backup by this message-oriented middleware module.

3. big data handling system according to claim 1, it is characterised in that: data preprocessing module data are carried out pretreatment include data are carried out, stipulations, compression processes the data of identical category.

4. big data handling system according to claim 1, it is characterised in that: distributed storage module includes: memory node, joint behavior evaluation module;

Wherein:

(1) performance of each server in application cluster is estimated by joint behavior evaluation module, generates a dynamic joint behavior reference file, and this document is regular update according to demand; The assessment of cluster interior joint server performance is included the CPU disposal ability of server, internal memory performance, magnetic disc i/o performance and network I/O performance;

5. the big data processing method simultaneously supporting off-line data and real-time online data, it is characterised in that including:

Read configuration information from unified configuration center, read the data in relevant database according to this configuration information, and these data are imported distributed document memory module; Receive the process request that application cluster sends, the request data received is supplied directly to distributed real-time streams computing module; Application cluster journal file is sent to local disk and carries out storage backup;

Read configuration information from unified configuration center, read the journal file of the application of local disk storage, after log file data is carried out pretreatment, be stored in local disk, and upload log file data after pretreatment to distributed document memory module;

Read data from data acquisition module, and read the configuration information of unified configuration center, according to this configuration information, the data read from data acquisition module are calculated in real time, result of calculation is stored in data base; Each index, for processing the data of storage in distributed document memory module, has been calculated rear write into Databasce by off-line data processing module.

6. big data processing method according to claim 5, it is characterised in that: data are carried out pretreatment include data are carried out, stipulations, compression processes the data of identical category.