CN110209646A

CN110209646A - A kind of data platform system calculated based on real-time streaming

Info

Publication number: CN110209646A
Application number: CN201910397951.XA
Authority: CN
Inventors: 钟证业; 毕军; 胡雨成; 胥小燕
Original assignee: Huitongda Network Co Ltd
Current assignee: Huitongda Network Co Ltd
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-09-06

Abstract

The invention discloses a kind of data platform systems calculated based on real-time streaming, it include: to obtain the behavioral data information of tentation data and store, the tentation data is converted into a series of short and small data flows, the processing of real-time streaming data that high-throughput may be implemented, having fault tolerant mechanism.The system is by being converted into a series of short and small batch processing jobs for the behavioral data information of tentation data and being parsed and extracted key message, to realize high-volume data conversion into micro- batch data, and data are quickly calculated by distributed way, achieve the effect that handling capacity is big, low latency, to meet requirement of the data jettison system to high-timeliness and be applicable in the business sensitive for time delay.

Description

A kind of data platform system calculated based on real-time streaming

Technical field

The present invention relates to computer internet Data Warehouse Design technology more particularly to it is a kind of based on real-time streaming calculate Data platform system.

Background technique

Real-time streaming calculating is to carry out staged operation to data stream, resolves into a series of short and small batch processing jobs, then By the data stream transmitting of segmentation into data batch processing engine, data are cleaned in data engine, are extracted, convert behaviour Make, and obtained data processed result is saved in memory.

The data warehouse of traditional standardized, it is main to utilize the mechanism such as the timed task of oracle and trigger, partition table Storage data are quasi real time statisticallyd analyze, in terms of performance and function, have been difficult to meet user to Data Management Analysis Timeliness, the demand of accuracy.

With the rapid development of Internet, big data era has arrived, data are presented explosive growth, at traditional data Reason mode has been not suitable for the analysis to mass data, needs a kind of new data processing method to handle complicated business in real time and patrols Volume.

Summary of the invention

It, can be quickly and effectively the technical problem to be solved by the present invention is to establish the real-time streaming computing system of a set of maturation Accurately data service is provided.In order to solve the above-mentioned technical problems, the present invention provides a kind of numbers calculated based on real-time streaming It is flat according to plateform system, including data source modules, real time data computing module, data billboard display module and job scheduling management Platform；

Wherein, the data source modules are responsible for disposition data source link information, and the business that extracts needs to calculate and real-time synchronization Data；

The real time data computing module be responsible for real time data calculate and storage, and be responsible for each node resource management and Scheduling；

The data billboard display module is used for interaction analysis；

The job scheduling management platform is responsible for the configuration of workflow, calculates the normal operation of scheduler task in real time and determine When traffic control stream, it is ensured that the accuracy of data.

The data source modules include data volume module and data source module；

The data volume module is for distinguishing whether data are initialized, when the size of data volume is less than 10W+ number According to when, it is only necessary to data are synchronized, full table loads data, does not need to initialize；

The data source module includes structural data.Such as: oracle database, mysql database, structuring text Part, Hive table etc. and business need to calculate and the data of real-time synchronization, such as: sales volume, sales order, member's number, voucher Number etc.；

The real time data computing module includes real-time data synchronization module, yarn distributed management system, data storage Module and data computation module；

The real-time data synchronization module includes data volume module and data source module；

The data volume module and the function of data source module are identical as the function in data source modules；

The yarn distributed management system is used for the scheduling of resource of each node of cluster, to reach efficient resource pipe Reason, such as: when data need to store, the host node ReourceManager on yarn distributed management system understands basis from section The resource request of point NodeManager, carrys out reasonable distribution resource；

The data memory module is used to grab data by the micro- batch processing of spark sql to HDFS (Hadoop distribution File system) on synchronize, and be stored in memory, by processing, there are in each node of cluster for final data；

The data computation module is used for the data in cluster memory by registering interim table, and utilizes sql sentence Carry out logic calculation, such as: data from data source in real time micro- batch be drawn into HIVE after, using sql sentence data are carried out It screens and is associated and calculates with other tables, thus the achievement data needed；

The data billboard display module includes interaction analysis module, report display module and permission control module；

The interaction analysis module is carried out for the logic calculation of data set and according to different dimensions, different themes Analysis and processing, such as: the data of same table daily carry out data statistics and index calculate according to time dimension, or According to regional dimension, according to different provinces, to be divided, different data sets is had reached, completes different reports；

The report display module for data to be shown by different data drawing lists, such as: histogram, curve Figure, text box etc..

The permission control module gives phase to different business sides for controlling the permission that every report is checked and modified Corresponding permission.

The job scheduling management console module is used for the quasi real time management and running of spark task；

Quasi real time management and running include Workflow configuration, Coordinator configuration, workflow prison to the spark task Keyholed back plate reason；Workflow can Parallel Scheduling, also can serial scheduling, workflow allows unsuccessfully to weigh brush mechanism, can restart Execute workflow schedule；

The Workflow is configured to management and running spark task, and each spark task requires one work of configuration Industry, same category of operation need to configure a workflow, and during multiple workflow Parallel Schedulings, a spark task Other task runs in the same workflow are not interfered with unsuccessfully；

The Coordinator is configured to the timer-triggered scheduler of management work stream, needs to specify a corresponding job The frequency of stream and scheduling time, scheduling；

The workflow monitoring management is used to monitor state and the time of each spark task run.

The data memory module is used to grab by the micro- batch processing of spark sql and synchronize in data to HDFS, Synchronous data are a Dataframe (structured data sets) on HDFS, it can register interim table, to use sql It is operated.

The yarn distributed management system is responsible for cluster resource management and scheduling, for dividing the data of data source modules Cloth is stored on each node on HDFS and the scheduling of each node resource.

The data of the distributed storage on HDFS are by being mapped to the external table of GreenPlum.

The data of the GreenPlum external table are carried out logic calculation, are inserted into GreenPlum by storing process Portion's table.

The data of the GreenPlum external table are carried out logic calculation, are inserted by storing process Table inside GreenPlum, includes the following steps:

Step 1, the data of data source modules are extracted using spark sql micro- batch；

Step 2, the data register of extraction at interim table；

Step 3, the data of interim table are inserted into HIVE external table；

Step 4, data are mapped in GreenPlum external table by HIVE external table；

Step 5, GreenPlum external table is by storing process (logic calculation passes through after data progress logic calculation Sql sentence is completed), it is inserted into the inside table of GreenPlum；

Step 6, the data of internal table are write in data set, is shown by report.

The present invention has the advantages that following control:

The behavioral data information of tentation data is obtained using spark sql and is stored, and the tentation data is converted into one The short and small data flow of series, the processing of real-time streaming data that high-throughput may be implemented, having fault tolerant mechanism, by high-volume number According to being converted into micro- batch data, and data are quickly calculated by distributed way, reach handling capacity greatly, the effect of low latency Fruit；

The yarn distributed management system is used to be responsible for the scheduling of resource of each node of cluster, to reach the utilization of resources Maximization；

Distributed data-storage system is by the disk space on every server of Web vector graphic, by the storage resource of dispersion Constitute a virtual storage equipment, each corner of data dispersion storage in a network, to reach resource efficiency height, safety High feature；

Real time data calculating converts the data into RDD data set and is cached in memory, due to frequently using data set, subtracts I/O operation, the network transmission, the time recalculated for having lacked intermediate result improve the speed using operation significantly, reach The synchronization and calculating of near-realtime data；

Interaction analysis module can specify different dimension and theme according to different requirements, to complete corresponding report Table；

Permission control module controls the permission of different user, different reports, can according to the demand of user, to user and report Corresponding permission is arranged in table, reach user can see it is associated with oneself；

Monitoring operation: by operation, job title, state (it is ready, successfully, operation, failure, alarm), real time inspection make Industry operating status realizes that operation is run again by reset button.

Detailed description of the invention

The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, it is of the invention above-mentioned or Otherwise advantage will become apparent.

Fig. 1 is system structure of the invention figure.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

As shown in Figure 1, the invention discloses one to calculate the data platform system being related to based on real-time streaming, it is a kind of standard The system of real-time data synchronization and calculating, including data source modules, real time data computing module, data billboard display module, work Industry management and running platform；

Data source modules: data volume and data source；

Real time data computing module: real-time data synchronization module, yarn distributed management system, data storage, data meter It calculates；

Data billboard display module: interaction analysis module, report display module, permission control module；

Job scheduling manages platform: Workflow configuration, Coordinator configuration, workflow monitoring management；

1, data source modules

It is responsible for disposition data source link information, the business that extracts needs to calculate and the data of real-time synchronization；

1.1 data volume

When the data volume of data source is smaller, it can take and extract total data progress data synchronization every time and calculate, work as number When according to measuring larger, first extracting all data synchronizations and calculating, being initialized, in the synchronization of subsequent data and calculating process, It takes toward the mode of backwash for a period of time, realizes real time data synchronization；

1.2 data source

Data source includes structural data, such as: oracle database, mysql database, structured document, Hive Table etc.；

2, real time data computing module

It is responsible for real time data calculating and storage and the resource management and scheduling of each node；

3, real-time data synchronization module

Real-time data synchronization module is using spark sql at required for index in data source part field data micro- batch It manages, on real-time synchronization to HDFS.

4, Yarn distributed management system

Yarn distributed management system is made of ResourceManager and ApplicationMaster, ResourceManager is responsible for the resource management and scheduling of entire cluster, and ApplicationMaster is responsible for application program Relevant issues, such as task schedule, Mission Monitor and fault-tolerant etc.；

5, data store

Data storage is made of data buffer storage and data storage, and on data pick-up to HDFS, data buffer storage is in server Memory in, final data is stored on each node of server.

6, data calculate

The data being buffered in server memory are registered interim table by Dataframe, are answered using sql sentence Miscellaneous logic calculation solves the problems, such as that mass data loading velocity is slow in conjunction with Hive external table and GreenPlum external table, with And the problem of each data among systems Type-Inconsistencies, GreenPlum external table is mapped the data by Hive external table In, by calling storing process to insert data into inside calculated result layer GreenPlum in table；Such as: outside GreenPlum Portion's table carries out complicated logic calculation using sql by calling storing process；

7, data billboard display module

It is responsible for interaction analysis module, the permission control module of report display module and report of data；

8, interaction analysis module

Interaction analysis module be data in data set according to different dimensions, theme, different classification is carried out, thus complete At；

9, report display module

Report display module shows that two parts form by production report and report, according to the achievement data in data set, When making report, the data in data set can directly be carried out drawing and dragged, shown with forms such as column, curve, text boxes,

The time interval of refresh page is set according to different requirements,；

10, permission control module

Permission control module setting user checks the permission of report, and setting user can check only report related to user, Other reports do not appear in the interface that user can check；

11, job scheduling manages platform

It is responsible for the configuration of workflow, calculates the normal operation and timer-triggered scheduler workflow of scheduler task in real time, it is ensured that number According to accuracy

12, Workflow is configured

Workflow is configured in Oozie, an operation configures a spark program, can configure in a workflow Multiple operation concurrent processing, need to import the connection packet of data source in configuration and the jar packet of project, multiple operations are concurrent When processing, the state of operation is independent of each other；

13, Coordinator is configured

Coordinator is configured in Oozie, first selectes the workflow for needing timer-triggered scheduler, the beginning of timer-triggered scheduler is set The frequency of time, end time and timer-triggered scheduler；

14, workflow monitoring management

Workflow monitoring management is made of work flow operation state and spark task run state, and workflow passes through timing After scheduling starting, spark task brings into operation, by workflow and task names, check spark task operating status, when Between, it can also check the journal file of spark task run；

In the present embodiment, development deployment environment is as follows:

Develop environment:

Scala version: 2.10.5

Spark is multiplexed existing CDH component

IDE:IDEA

Deployed environment:

Scala version: 2.10.5

Spark version: 1.6.0

Zookeeper version: 3.4.5

MySQL version: 5.1.4

Oracle version: 10g

GreenPlum version: 5.7.0

Streaming computing system server deployed position see the table below 1:

Table 1

Embodiment

In the present embodiment, setting certain in oracle database table to have the data of 260M (as shown in table 2, is certain company pin Sell data),

Table 2

In table 2, first row ORDERNO indicates that order number, secondary series MEMBERNO indicate membership number, third column TOTALPRICE indicates total price, and the 4th column ORDERSTATUS indicates order status, and the 5th arranges organization number belonging to ORGID expression, 6th column OPERATORID indicates operator's number；

By data source modules, oracle database link information, including user name, password, oracle driving etc. are configured, Since data volume is larger, 120W row is had reached, needs advanced row data initialization procedure, is then carrying out data synchronization process；

Data initialization process is all data extracted in oracle first with spark sql micro- batch；Pass through data meter Module is calculated, the data of extraction are stored in memory, a data set is converted by registering interim table and is inserted into HIVE In, it by data memory module, stores data on each node of server, passes through yarn distributed management system mould Block carries out each node of cluster the reasonable distribution of resource, by the map feature of HIVE external table, data is mapped and are synchronized Into the external table of GreenPlum, by calling storing process, logic required for being carried out the data of external table using sql Then processing inserts data into table inside GreenPlum including association, conditional filtering, the logic calculation etc. with other tables In；

Data synchronization process, and the data extracted in oracle first with spark sql micro- batch, but be to extract currently Time, then next process was identical with the process of initialization toward one day data of backwash, but required first to empty every time Then HIVE external table, GreenPlum external table again come in data insertion；Each data synchronization process is all once to dispatch, It needs to manage platform by job scheduling and dispatches, job scheduling needs to configure the jar packet of code execution, link library Jar packet, distribution, the time of timer-triggered scheduler of dispatching resource etc.；Result data after having handled shows mould by data billboard Data are carried out the processing of different dimensions using data set by the interaction analysis module of block, such as: with time dimension (according to difference Period data are classified) or with regional dimension (according to different areas, data are classified) etc., pass through Data are carried out visualization exhibition by report display module by permission control module, the permission of every report of control, different user Show.

The present invention provides a kind of data platform systems calculated based on real-time streaming, implement the side of the technical solution There are many method and approach, the above is only a preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications Also it should be regarded as protection scope of the present invention.All undefined components in this embodiment can be implemented in the prior art.

Claims

1. a kind of data platform system calculated based on real-time streaming, which is characterized in that including data source modules, real time data meter It calculates module, data billboard display module and job scheduling and manages platform；

Wherein, the data source modules are responsible for disposition data source link information, and the business that extracts needs to calculate and the number of real-time synchronization According to；

The real time data computing module is responsible for real time data and calculates and store, and is responsible for resource management and the tune of each node Degree；

The data billboard display module is used for interaction analysis；

The job scheduling management platform is responsible for configuration, the normal operation for calculating scheduler task in real time and the timing tune of workflow Spend workflow, it is ensured that the accuracy of data.

2. system according to claim 1, which is characterized in that the data source modules include that data volume module and data are come Source module；

The data volume module is for distinguishing whether data are initialized, when the size of data volume is less than 10W+ data When, it is only necessary to data are synchronized, full table loads data, does not need to initialize；

The data source module includes structural data.

3. system according to claim 2, which is characterized in that the real time data computing module includes real-time data synchronization Module, yarn distributed management system, data memory module and data computation module；

The yarn distributed management system is used for the scheduling of resource of each node of cluster, to reach efficient resource management；

The data memory module is used to grab by the micro- batch processing of spark sql and synchronize in data to HDFS, and stores In memory, by processing, there are in each node of cluster for final data；

The data computation module is used for the data in cluster memory by registering interim table, and using sql sentence come into Row logic calculation.

4. system according to claim 3, which is characterized in that the data billboard display module includes interaction analysis mould Block, report display module and permission control module；

The interaction analysis module is used for the logic calculation of data set and is analyzed according to different dimensions, different main bodys And processing；

The report display module is for showing data by different data drawing lists；

The permission control module gives different business sides corresponding for controlling the permission that every report is checked and modified Permission.

5. system according to claim 4, which is characterized in that the job scheduling management console module is appointed for spark Business quasi real time management and running；

Quasi real time management and running include Workflow configuration, Coordinator configuration, workflow monitoring pipe to the spark task Reason；Workflow can Parallel Scheduling, also can serial scheduling, workflow allows unsuccessfully to weigh brush mechanism, can restart to execute Workflow schedule；

The Workflow is configured to management and running spark task, and each spark task requires one operation of configuration, together A kind of other operation needs to configure a workflow, and during multiple workflow Parallel Schedulings, a spark mission failure is not Influence whether other task runs in the same workflow；

The Coordinator is configured to the timer-triggered scheduler of management work stream, need to specify a corresponding workflow with And the frequency of scheduling time, scheduling；

6. the system stated according to claim 5, which is characterized in that the data memory module is used for by spark sql micro- batch It is synchronized in reason crawl data to HDFS, synchronous data are a Dataframe structured data sets, its energy on HDFS Interim table is registered, enough to use sql to be operated.

7. the system stated according to claim 6, which is characterized in that the yarn distributed management system be responsible for cluster resource management and Scheduling, for the data distribution formula of data source modules to be stored on each node on HDFS and the tune of each node resource Degree.

8. the system stated according to claim 7, which is characterized in that the data of the distributed storage on HDFS are by being mapped to The external table of GreenPlum.

9. system according to claim 8, which is characterized in that the data of the GreenPlum external table were by storing Journey carries out logic calculation, is inserted into table inside GreenPlum.

10. system according to claim 9, which is characterized in that the data of the GreenPlum external table are by depositing Storage process carries out logic calculation, is inserted into table inside GreenPlum, includes the following steps:

Step 2, the data register of extraction at interim table；

Step 3, the data of interim table are inserted into HIVE external table；

Step 4, data are mapped in GreenPlum external table by HIVE external table；

Step 5, after data are carried out logic calculation by storing process by GreenPlum external table, it is inserted into GreenPlum's Internal table；

Step 6, the data of internal table are write in data set, is shown by report.