CN107766147A

CN107766147A - Distributed data analysis task scheduling system

Info

Publication number: CN107766147A
Application number: CN201610712300.1A
Authority: CN
Inventors: 孙冬雪; 万英杰; 李娟�; 史宁; 鲍远松; 黄明; 李亚贝
Original assignee: Shanghai Baosight Software Co Ltd
Current assignee: Shanghai Baosight Software Co Ltd
Priority date: 2016-08-23
Filing date: 2016-08-23
Publication date: 2018-03-06

Abstract

The invention provides a kind of distributed data analysis task scheduling system, including：Distributed Storage service module, resource-based distributed task dispatching engine modules, Distributed Message Queue module, distributed application program coordination service module, automatic enforcement engine module.The present invention proposes a kind of distributed task dispatching framework that data analysis is carried out using R language, and distributed resource management is realized using resource management platform, and then realizes that the distributed scheduling of R language analyses performs；Using Automatic dispatching engine, realize that the task of data fragmentation is called automatically, meet the automated analysis demand of industrial process data tracking.

Description

Distributed data analysis task scheduling system

Technical field

The present invention relates to data analysis task scheduling, in particular it relates to distributed data analysis task scheduling system, energy The scheduling for being enough widely used in the data analysis formula of industrial process data performs.

Background technology

With the object tracking such as development, mobile device, RFID of the technologies such as the lasting depth propulsion of industry 4.0 and Internet of Things The application of equipment in the industrial production is more and more extensive, and explosive increase of data will turn into trend.At the same time, enterprise's essence The propulsion of the management of refinement needs the analysis of more data volumes and wider data dimension to provide support for business decision, and With the use of Enterprise Informatization system for many years, many enterprises all generate substantial amounts of historical data, but analyze profit to it With and it is insufficient.The how fully existing and newly-increased a large number of services data of digging utilizationThe traditional commercial Application of comparative study BI instruments, be primarily present it is following some deficiency：

(1) substantial amounts of data analysis task can not carry out distributed scheduling automatically, frequently result in the accumulation of data analysis task In individual node, node resource is caused to consume more, execution efficiency is low.

(2) there is single node failure in data analysis task, after the analysis task failure of single node is run on, no weight Open execution mechanism.

(3) defect effectively timed task scheduling feature, fixed cycle or the analysis task that fixed interval performs can not be met Dispatching requirement.

(4) data storage of data source and analysis result is generally without distributed frame is utilized, as a result easily by result The influence of the unit failure of memory node, it is possible to cause loss of data.

R language is the language of increasing income exclusively for statistics and data analysis exploitation, good to different operating system compatibility, is programmed Succinctly, it is programming tool platform that statistical analysis personnel prefer.Ripe data mining algorithm bag is abundant and is constantly increasing, Also there is powerful analysis result visualization model, as ggplot multi-layer image is drawn.But its shortcoming is：Big text-processing is compared Difference, although data analysis component is very strong, lack for data management part, so being frequently necessary to after external environment condition is carried out After data segmentation, return again to R language platforms and carry out analysis application.

In disclosed paper studies, Yang Xia, Wu Dongwei's《Application of the R language in big data processing》, mainly The characteristics of introducing the RHadoop expanding packets of Revolution Analytics companies and occupation mode, can be in R using the bag Map-Reduce programs are write, Liu Wenfei's《Integrated technology and its realization based on R language and Hadoop》, mainly describe Integrated using Hadoop Streaming mode and perform R programs.Application No. CN201610074884.4, title：It is distributed The task of Computational frame sends system, this publication disclose a kind of task of distributed computing framework and sends system, wherein Including application server, task queue service platform and Redi s service platforms.Application server is used to dispose at multiple business Reason service；Task queue service platform passes through networking cluster by multiple tasks server, is disposed in task queue service platform Zookeeper is serviced, and task-scheduling operation is used for the new client task for handling the message queue of zookeeper services；By more Individual Redi s servers connect and compose through networking, and Redis service platforms are connected to task queue service platform via networking, Redi s service platforms call treatment progress according to the new client task added in message queue, and treatment progress enters to client task Row cleans and exports the first business result and stores into Redi s buffer memories；The real-time computing module of Redis service platforms Detect and the first new business result in Redis buffer memories be present, real-time computing module is calculated the first business result And export the second business result.

Technical essential compares：For the present invention compared with the patent document, technical pattern difference is obvious, is equally being based on Zookeeper service purpose is different, and the invention is used for message queue client task, is then used to perform timed task in the present invention Single node it is fault-tolerant.For the present invention without Redis is used, the patent document does not have timer-triggered scheduler strategy yet.

The content of the invention

For in the prior art the defects of, it is an object of the invention to provide a kind of distributed data analysis task scheduling system System.The advantages of the technical problem to be solved in the present invention is how to make full use of the data analysis of R platforms strong, flexible, utilize big number According to distributed resource management service under environment, a resource-based distributed data analysis task scheduling system, side are built Just the data analyst during Industrial Analysis uses the focus for being the present invention.

According to a kind of distributed data analysis task scheduling system provided by the invention, including：

Distributed Storage service module：Stored by non-relational database, pass through distributed search engine Carry out the retrieval of data, there is provided Distributed Storage service

Resource-based distributed task dispatching engine modules：Carry out resource management, resources control, task scheduling with Track, there is provided task scheduling service；

Distributed Message Queue module：Realize the issue of data with subscribing to function by Distributed Message Queue；

Distributed application program coordination service module：The subsequent execution of automatic enforcement engine task in single node is carried out It is fault-tolerant；

Automatic enforcement engine module：Data analysis task is analyzed.

Preferably, non-relational database uses database HBase；Distributed search engine is using search application server Solr。

Preferably, Distributed Storage service module is the data source of analysis task.

Preferably, Distributed Storage service module is the memory carrier of assignment file and data results.

The distributed task dispatching engine modules of resource are preferably based on, are that resource is carried out based on explorer YARN Management, resources control, task scheduling and tracking.

Preferably, automatic enforcement engine module, according to cycle access, according to clocked flip two ways, to data analysis Task does continual analysis, the Automatic dispatching engine on node is carried out using distributed application program coordination service fault-tolerant.

Preferably, in system initialisation phase, the application program of each node is from Distributed Storage service module The data analysis task of the run mode based on timer-triggered scheduler engine is loaded, task is created in Distributed Message Queue and changes topic TOPIC, loading distributed application program coordination service ZooKeeper；

SDK interface interchange Distributed Storage service modules are provided, acquisition is converted into kernel data structure in R language The data source of Data.Frame structures, data results are saved in distributed storage carrier；

The task of run mode, the variation of Automatic dispatching engine strategy are broadcasted by Distributed Message Queue in cluster, each section Point subscription task changes topic TOPIC respectively in respective node updates Automatic dispatching engine strategy.

Preferably, the data task of load operating state is initialized, for appointing after continuing in last time cluster entirety disaster Business performs.

Preferably, the internal memory of explorer control individual task, CPU usage amount are no more than the application value of the task, And provide execution journal and status inquiry.

Preferably, analyze script file to obtain from distributed storage service, broadcasted in the cluster by HDFS.

According to a kind of distributed data analysis method for scheduling task provided by the invention, including：

Distributed Storage service steps：Stored by non-relational database, pass through distributed search engine Carry out the retrieval of data, there is provided Distributed Storage service

Resource-based distributed task dispatching engine step：Carry out resource management, resources control, task scheduling with Track, there is provided task scheduling service；

Distributed Message Queue step：Realize the issue of data with subscribing to function by Distributed Message Queue；

Distributed application program coordination service step：The subsequent execution of automatic enforcement engine task in single node is carried out It is fault-tolerant；

Automatic enforcement engine step：Data analysis task is analyzed.

Preferably, Distributed Storage service steps are the data sources of analysis task.

Preferably, Distributed Storage service steps are the memory carriers of assignment file and data results.

The distributed task dispatching engine step of resource is preferably based on, is that resource is carried out based on explorer YARN Management, resources control, task scheduling and tracking.

Preferably, automatic enforcement engine step, according to cycle access, according to clocked flip two ways, to data analysis Task does continual analysis, the Automatic dispatching engine on node is carried out using distributed application program coordination service fault-tolerant.

Preferably, loaded in method initial phase, the application program of each node from Distributed Storage service The data analysis task of run mode based on timer-triggered scheduler engine, task variation topic is created in Distributed Message Queue TOPIC, loading distributed application program coordination service ZooKeeper；

SDK interface interchange Distributed Storage services are provided, acquisition is converted into kernel data structure in R language The data source of Data.Frame structures, data results are saved in distributed storage carrier；

Compared with prior art, the present invention has following beneficial effect：

The present invention proposes a kind of distributed task dispatching framework that data analysis is carried out using R language, utilizes resource pipe Platform realizes distributed resource management, and then realizes that the distributed scheduling of R language analyses performs；Using Automatic dispatching engine, Realize that the task of data fragmentation is called automatically, meet the automated analysis demand of industrial process data tracking.

The present invention possesses fault-tolerance using distributed scheduling architecture, autgmentability, can be that industrial process data analysis carries Ensure for reliable, the ever-increasing resource requirement of data analysis task can be also met by way of horizontal extension, finally Process data, the value of historical data are preferably played, support is provided for the decision-making of enterprise, is provided for the intellectuality of manufacturing process Data basis, support is provided for the transition of enterprise.

One of main emphasis of the present invention cut based on data, towards different parameters, identical calculations process, persistently The data analysis process frequently tracked, in the case where not increasing the development difficulty for the analysis program that user writes R, there is provided Availability, fault-tolerance, the strong task scheduling service architecture for being capable of perform script task parallel of expansion.

Brief description of the drawings

The detailed description made by reading with reference to the following drawings to non-limiting example, further feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is the structural representation of system provided by the invention.

Embodiment

With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this area For personnel, without departing from the inventive concept of the premise, some changes and improvements can also be made.These belong to the present invention Protection domain.

Fig. 1 that the framework of distributed data analysis task scheduling system is shown in accompanying drawing, mainly by following several module groups Into：

Distributed Storage service module：Distributed storage service is stored by Nosql databases HBase, The retrieval of data is realized by distributed search engine Solr, meets that Large Copacity, highly reliable, high-performance, data trnascription are safe, dynamic The characteristics of state extended capability is strong.This module can as the data source of branch office's analysis task, can also be used as assignment file and The memory carrier of data results.

Resource-based distributed task dispatching engine modules：Task scheduling engine is based on YARN and carries out resource management, money Source control, task scheduling and tracking.

Distributed Message Queue module：Distributed Message Queue is disappeared using the Kafka message queues increased income by distribution Breath queue realizes that the issue of data must is fulfilled for high-throughput, high reliability and persistence energy with subscribing to function, the message queue Power, so as to realize the transmitting of data.

Distributed application program coordination service module：Based on open source technology ZooKeeper.It is mainly used in pair in the system The subsequent execution of automatic enforcement engine task in single node carries out fault-tolerant.

Automatic enforcement engine module：Automatic enforcement engine is based on the Quartz frameworks increased income.For industrial process part number The characteristics of according to needing persistently to track, it is broadly divided into：There is provided according to cycle access and according to clocked flip two ways, to data point Analysis task does continual analysis.The Automatic dispatching engine on node is carried out using distributed application program coordination service fault-tolerant.

The flow of distributed data analyzing task scheduling is as follows：

(1) system initialization, the application program of each node loads from Distributed Storage service to be adjusted based on timing The data analysis task of the run mode of engine is spent, the task that created in Distributed Message Queue changes TOPIC, loading ZooKeeper is serviced.Initialize load operating state data task purpose be continue last time cluster entirety disaster (such as Whole cluster power-off) on after tasks carrying.

(2) SDK interface interchange Distributed Storage services are provided, acquisition is converted into kernel data structure in R The data source of Data.Frame structures, while data results (analysis picture, Study document etc.) can be saved in distribution In formula memory carrier.

(3) task of run mode, the variation of Automatic dispatching engine strategy are broadcasted by Distributed Message Queue in cluster, respectively Node subscription task changes TOPIC respectively in respective node updates Automatic dispatching engine strategy.

(4) explorer based on YARN will control individual task its internal memory, CPU usage amount no more than its application Value, and execution journal and status inquiry are provided.Wherein analyze script file to obtain from distributed storage service, existed by HDFS Broadcasted in cluster.

Distributed task dispatching flow characteristic is analyzed：

(1) high reliability：System is using distributed structure/architecture without Single Point of Faliure, distributed storage service, distributed message team Increase income big data technology Hbase, kafka, Solr, ZooKeeper that row, distributed application program coordination service are based on have Its own fault tolerant mechanism, and for the Automatic dispatching engine in node, system utilizes ZooKeeper Leader uniqueness, adopts Just perform the mechanism of the execution of data analysis task with the node where only Leader, after Single Point of Faliure, ZooKeeper can be selected New Leader is enumerated, therefore make use of ZooKeeper to realize the fault-tolerant of the tasks carrying based on Automatic dispatching engine strategy. Single Point of Faliure problem is also not present also with distributed storage service in the storage of assignment file.

(2) horizontal extension is facilitated：Because system architecture uses Distributed Design, therefore only need to increase node can reality The horizontal expansion of existing cluster, to adapt to more resources of data analysis mission requirements.

(3) isolation of task resource：Using the resource isolation of YARN resource management systems, can solve data analysis task Between influencing each other on resource is fought for, prevented a certain task from monopolizing system resource, and it is long-term etc. to cause other tasks to be absorbed in Treat to perform.

(4) data fragmentation performs with tasks in parallel：The HashKey of data storage definition can be utilized regular, R scripts Incoming parameter, the HashKey that presses of Distributed Storage service are inquired about, and the burst of data are realized, by original large data sets Task is cut into subtask, and Parallel Scheduling performs in distributed task dispatching system.The data analysis for taking into account R system is strong , make up the deficiency in terms of its data management.

In preferred embodiment, using following configuration：

(1) four X86 servers (being named as A, B, C, D) are provided at, and memory configurations are not less than 64G, CPU recommends most Low E2650.

(2) Distributed Message Queue service is disposed, kafka is deployed in tetra- machines of A, B, C, D simultaneously and completes cluster Configuration.

(3) distributed application program coordination service is disposed, ZooKeeper is deployed in tetra- machines of A, B, C, D simultaneously simultaneously Complete the configuration of cluster.

(4) distributed storage service is disposed, HBase master is deployed in node A, node B, C, D are disposed respectively RegionServer, while the configuration of Hadoop environment is completed, Hadoop Namenode is deployed in node A, node B, C, D DataNode is disposed respectively and completes the configuration of cluster.

(5) distributed task dispatching service is disposed, Yarn is deployed in tetra- machines of A, B, C, D simultaneously and completes cluster Configuration.

One skilled in the art will appreciate that except realizing system provided by the invention in a manner of pure computer readable program code And its beyond each device, module, unit, completely can be by the way that method and step progress programming in logic be provided come the present invention System and its each device, module, unit with gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedding Enter the form of the controller that declines etc. to realize identical function.So system provided by the invention and its every device, module, list Member is considered a kind of hardware component, and is used to realize that device, module, the unit of various functions also may be used to what is included in it To be considered as the structure in hardware component；It both can be real that will can also be considered as device, module, the unit of realizing various functions The software module of existing method can be the structure in hardware component again.

The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow Ring the substantive content of the present invention.In the case where not conflicting, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

A kind of 1. distributed data analysis task scheduling system, it is characterised in that including：

Distributed Storage service module：Stored by non-relational database, carried out by distributed search engine The retrieval of data, there is provided Distributed Storage service

Resource-based distributed task dispatching engine modules：Resource management, resources control, task scheduling and tracking are carried out, is carried For task scheduling service；

Distributed Message Queue module：Realize the issue of data with subscribing to function by Distributed Message Queue；

Distributed application program coordination service module：The subsequent execution of automatic enforcement engine task in single node is held It is wrong；

Automatic enforcement engine module：Data analysis task is analyzed.
2. distributed data analysis task scheduling system according to claim 1, it is characterised in that non-relational data Storehouse uses database HBase；Distributed search engine is using search application server Solr.
3. distributed data analysis task scheduling system according to claim 1, it is characterised in that distributed data is deposited Storage service module is the data source of analysis task.
4. distributed data analysis task scheduling system according to claim 1, it is characterised in that distributed data is deposited Storage service module is the memory carrier of assignment file and data results.
5. distributed data analysis task scheduling system according to claim 1, it is characterised in that resource-based point Cloth task scheduling engine module, be based on explorer YARN carry out resource management, resources control, task scheduling with Track.
6. distributed data analysis task scheduling system according to claim 1, it is characterised in that automatic enforcement engine Module, according to cycle access, according to clocked flip two ways, continual analysis is done to data analysis task, should using distribution The Automatic dispatching engine on node is carried out with Program Coordination service fault-tolerant.
7. distributed data analysis task scheduling system according to claim 1, it is characterised in that in system initialization Stage, the application program of each node load the run mode based on timer-triggered scheduler engine from Distributed Storage service module Data analysis task, in Distributed Message Queue create task change topic TOPIC, loading distributed application program coordinate Service ZooKeeper；

SDK interface interchange Distributed Storage service modules are provided, acquisition is converted into kernel data structure in R language The data source of Data.Frame structures, data results are saved in distributed storage carrier；

The task of run mode, the variation of Automatic dispatching engine strategy are broadcasted by Distributed Message Queue in cluster, and each node is ordered Read task and change topic TOPIC respectively in respective node updates Automatic dispatching engine strategy.
8. distributed data analysis task scheduling system according to claim 7, it is characterised in that initialization loading fortune The data task of row state, for the tasks carrying after continuing in last time cluster entirety disaster.
9. distributed data analysis task scheduling system according to claim 7, it is characterised in that explorer control The internal memory of individual task processed, CPU usage amount are no more than the application value of the task, and provide execution journal and status inquiry.
10. distributed data analysis task scheduling system according to claim 9, it is characterised in that analysis script text Part obtains from distributed storage service, is broadcasted in the cluster by HDFS.