CN108874524A

CN108874524A - Big data distributed task dispatching system

Info

Publication number: CN108874524A
Application number: CN201810643612.0A
Authority: CN
Inventors: 李平福; 程林; 杨培强
Original assignee: Shandong Inspur Business System Co Ltd
Current assignee: Shandong Inspur Business System Co Ltd
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2018-11-23

Abstract

The invention discloses big data distributed task dispatching systems, including realizing distributed task dispatching by proxy server, dispatching log acquisition is realized by acquisition cluster, streaming cluster and Distributed Message Queue and is summarized, and query result is sent to web front-end and is analyzed and is shown.Compared to the prior art the present invention, meets the big data dispatching requirement of simple demand, promotes its rapid deployment and service efficiency, while applicable industry big data application scenarios, reduce use cost and increase the versatility of scheduling system.

Description

Big data distributed task dispatching system

Technical field

The present invention relates to distributed task schedulings to transfer technical field, specifically a kind of big data distributed task dispatching system System.

Background technique

The epoch of big data technology rapid development are currently in, every profession and trade is also faced with perhaps while using big data More technical problems, how correctly wherein the efficient scheduling for solving big data cluster task, how Macro or mass analysis schedule history is remembered How record reduces cluster management difficulty, and promoting ease for maintenance is also the problem that every profession and trade faces；Nowadays technical field still A not formed standardized scheduling system, to find out its cause, on the one hand from the multifarious of every profession and trade business, on the other hand Also depend on customization and the business complexity of task scheduling system.In addition to the timing tune of this kind of inclined single machine of Crontab, Quartz Program or class libraries are spent, open source distributed task dispatching system also has very much, and more well-known has oozie, azkaban etc., as Ah In SchedulerX, the Lhotse of Tencent, typically independent research or carried out on the basis of open source it is some encapsulation and change Make, more company take encapsulation oozie mode, still, there are the drawbacks of it is as follows：1, the complicated multiplicity of usage scenario, Developer's development cost is excessively high；2, be partial to single node scheduling, the not applicable independent big data cluster mutually of usage scenario and Application cluster；3, there is task duplication scheduling in colony dispatching；4, when coping with simple dispatching requirement, above system is shown slightly Heaviness, and rapid deployment and can not use, while the complexity of its framework also increases a possibility that abnormal or mistake occurs；5, The problem of scheduling system and application system competitive resource occurred in cluster, reduction both sides' system stability；6, the scheduling of cluster Information, analysis, shows difficult problem at log collection.

Summary of the invention

Technical assignment of the invention is promoted in view of the above-mentioned problems, in order to build the big data dispatching requirement for meeting simple demand Its rapid deployment and service efficiency reduce use cost and increase scheduling system while applicable industry big data application scenarios Versatility, the invention proposes a kind of easy implementations, big data distributed task dispatching system easy to use.

The technical solution adopted by the present invention to solve the technical problems is：Big data distributed task dispatching method, specific method It is real by acquisition cluster, streaming cluster and Distributed Message Queue including realizing distributed task dispatching by proxy server Existing dispatching log acquires and summarizes, and query result is sent to web front-end and is analyzed and is shown.

Further,

S1, user pass through application server configuration scheduling rule；

S2, application server are according to scheduling rule Configuration Agent server；

S3, proxy server submit scheduler task to cluster；

Task daily record is sent to log server by S4, cluster；

S5, acquisition cluster collect task daily record；

S6, the push of streaming computing cluster pull task daily record；

S7, application server remote visiting system task daily record processing routine to streaming computing cluster；

S8, streaming computing cluster summarize task daily record result and are stored in database server；

S9, application server return, additions and deletions, change and look into task daily record result；

Query result is sent to web front-end and is analyzed and shown by S10, application server.

Big data distributed task dispatching system, including task scheduling system and dispatching log acquire aggregation system；

The task scheduling system is based on Insight HD big data platform, is realized using the Crontab in class unix system Distributed task dispatching；

The dispatching log acquires aggregation system, acquires and summarizes for dispatching log, and acquisition and summarized results are sent to Web front-end is analyzed, is shown；

The task scheduling system, including Hadoop cluster module, application server module, relational data library module, tune Spend proxy modules and log collecting server module.

Further, preferred structure is that the dispatching log acquires aggregation system, including acquisition cluster, distribution Message queue and streaming computing cluster；

The acquisition cluster is held to dispose Flume component in log server and carrying out initialization monitoring for acquisition tasks Row journal file, and it is sent to the data source in Distributed Message Queue as streaming computing；

The Distributed Message Queue will be distributed to dispose Kafka component in log server using publish-subscribe model Daily record data is sent streaming meter by buffer layer of the formula message queue as acquisition the extracted log of cluster, Distributed Message Queue Calculate cluster；

The streaming computing cluster to dispose Storm component in log server, and submits log to parse code, according to reality Border needs to form the topology of processing log, the state of resolution scheduling task and execution time；Then parsing result is written back to pass It is database module, parsing result is associated with task schedule metamessage, mapping relations is established, use is supplied to by Web page Family uses.

Further, preferred structure is that the task scheduling system further includes third party system monitoring module；

The third party system monitoring module when being unsuccessfully restarted automatically, retains system service and loses for monitoring Crond service Log is lost, and log is notified into administrator.

Further, preferred structure is the application server module, dispatches system administration journey for deployment task Sequence has the function of configuration, management, monitor task, will be submitted to scheduling proxy server module after the completion of task configuration；Including Dependency information between clocked flip mission bit stream and task；

The relational data library module receives determining for application server module for storing timed task metadata information When triggering mission bit stream and task between dependency information, and provide the interface of increase, deletion, inquiry and modification.

Further, preferred structure is the scheduling proxy server module, is used for Hadoop cluster module Submit distributed task scheduling；

The log collecting server module executes the dispatching log generated for storing scheduling proxy server module design task And dispatching record.

Compared to the prior art big data distributed task dispatching system of the invention, has the beneficial effect that：

1, this system devises a kind of distributed big data task scheduling system using the Crontab carried in class unix system, And the extension sexual function such as the acquisition of dispatching log is provided by third party's component, summarized, analyzed, show；

2, job scheduling module is isolated with job management applications, between the two influence each other is effectively reduced by decoupling effect；

3, easy-to-use effect considers from ease for use angle, greatly reduces development cost, improves development efficiency, is more suitable tax Business big data usage scenario, while third party system monitor component Monitor Daemon Server performance indicator can be used；

4, duplicate removal effect solves the problems, such as task duplication calling；

5, monitoring effect, effective analysis task dispatch state and historical record are simultaneously shown.

Detailed description of the invention

The following further describes the present invention with reference to the drawings.

Attached drawing 1 is the schematic illustration of big data distributed task dispatching system.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings and specific examples.

Insight HD, the big data development kit that wave information is produced, it simplifies the deployment of the hadoop ecosystem, and Monitoring function is provided.Crontab, the timer-triggered scheduler module of class Unix system rely on Crond service；Apache Storm, The real-time streaming computing engines freely increased income under Apache, hereinafter referred Storm；It is high under Apache Flume, Apache It can use, the distributed information log of stiff stability acquisition paradigmatic system, hereinafter referred Flume；Divide under Apache Kafka, Apache The news release ordering system of cloth streaming, hereinafter referred Kafka；Task scheduling system according to the regular hour and relies on rule Then, regular starting executes each generic task（Containing program, script etc.）Application system.

The present invention is big data distributed task dispatching system,

Embodiment 1：

Big data distributed task dispatching system specific embodiment is divided into following steps：

S1, user pass through application server configuration scheduling rule；

S3, proxy server submit scheduler task to cluster；

Task daily record is sent to log server by S4, cluster；

S5, acquisition cluster collect task daily record；

S6, the push of streaming computing cluster pull task daily record；

The task scheduling system, including Hadoop cluster module, application server module, relational data library module, tune Spend proxy modules and log collecting server module.Hadoop cluster module is developed using Insight HD big data External member builds Hadoop ecological environment, configures associated component.Application server module is disposed in book server or cluster Task scheduling system management program, program have the function of configuration, management, monitor task, are divided into clocked flip and rely on triggering, The corresponding script of scheduling system server dynamic generation can be submitted to after the completion of configuring with the dependence between custom task.It closes It is type database, for storing timed task metadata information, the clocked flip information including task and the dependence between task Relationship, and the interface of increase, deletion, inquiry, modification is provided；Scheduling proxy server module is counted by book server to big According to distributed task scheduling is submitted in cluster, scheduling system and application system are separated, it is competing to reduce resource between scheduling and application system It strives, interact, setting crond services booting self-starting, for scanning timing task information；Timed task is configured and relied on Relationship writes database, while writing in the template script of this server；Judge to rely on whether service has executed before task execution It completes, completion then starts timed task, otherwise sends alarm, terminates corresponding scheduler task.Log collecting server（Cluster）Mould Block, book server storage scheduling system task execute the dispatching log and dispatching record generated, set in scheduling proxy server The log for setting execution task redirects, to name log file name to execute the time convenient for distinguishing.

The dispatching log acquires aggregation system, including acquisition cluster, Distributed Message Queue and streaming computing cluster；

Cluster, i.e. Flume deployment of components are acquired, the role of book server is log collection person, in log server (cluster) portion It affixes one's name to Flume and service is monitored in initialization, be used for acquisition tasks execution journal file, and be sent to Distributed Message Queue (Kafka) data source in as streaming computing.Distributed Message Queue, i.e. Kafka deployment of components utilize the portion Insight HD Affix one's name to Kafka cluster, using publish-subscribe model, using there is height to handle up, can be extending transversely etc. characteristics Kafka as Flume The buffer layer of extracted log, Kafka send Storm cluster for daily record data and process.

Streaming computing cluster, i.e. Storm clustered deploy(ment) dispose Storm collection using Insight HD big data development kit Group, and log is submitted to parse code, form the topology of processing log（It can customized development according to actual needs）, resolution scheduling task State, execute the time；Parsing result is written back to relevant database（Such as oracle）In, by itself and task schedule metamessage Association, establishes mapping relations, is supplied to user by web page and uses.

The task scheduling system further includes third party system monitoring module；

The third party system monitoring module when being unsuccessfully restarted automatically, retains system service and loses for monitoring Crond service Log is lost, and log is notified into administrator with mail or short message mode.Such as Zabbix or Shell script monitoring Crond clothes Business.

Effective solution of the present invention tax big data task scheduling system demand, passes through the reality that Crontab is simple and fast Existing distributed task dispatching realizes dispatching log acquisition using Flume+Kafka+Storm and summarizes, before being as a result sent to web It is analyzed and is shown in end.

The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers Work as understanding, the present invention is not limited to above-mentioned several specific embodiments.On the basis of the disclosed embodiments, the technology The technical staff in field can arbitrarily combine different technical features, to realize different technical solutions.

Claims

1. big data distributed task dispatching method, which is characterized in that specific method includes realizing to be distributed by proxy server Formula task schedule is realized dispatching log acquisition by acquisition cluster, streaming cluster and Distributed Message Queue and is summarized, will inquire As a result web front-end is sent to be analyzed and shown.

2. big data distributed task dispatching method according to claim 1, which is characterized in that the specific method is as follows：

S1, user pass through application server configuration scheduling rule；

S3, proxy server submit scheduler task to cluster；

Task daily record is sent to log server by S4, cluster；

S5, acquisition cluster collect task daily record；

S6, the push of streaming computing cluster pull task daily record；

3. big data distributed task dispatching system, which is characterized in that summarize including task scheduling system and dispatching log acquisition System；

4. big data distributed task dispatching system according to claim 3, which is characterized in that the dispatching log is adopted Collect aggregation system, including acquisition cluster, Distributed Message Queue and streaming computing cluster；

5. big data distributed task dispatching system according to claim 3, which is characterized in that the task schedule system System further includes third party system monitoring module；

6. big data distributed task dispatching system according to claim 3, which is characterized in that the application server Module dispatches system supervisor for deployment task, has the function of configuration, management, monitor task, task is configured and is completed After be submitted to scheduling proxy server module；Including the dependency information between clocked flip mission bit stream and task；

7. big data distributed task dispatching system according to claim 3, which is characterized in that the scheduling broker Server module, for submitting distributed task scheduling to Hadoop cluster module；