CN109857558A

CN109857558A - A kind of data flow processing method and system

Info

Publication number: CN109857558A
Application number: CN201910048043.XA
Authority: CN
Inventors: 郭业俊; 李�浩; 王志强; 孙迁
Original assignee: Suningcom Group Co Ltd
Current assignee: Suningcom Group Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2019-06-07
Also published as: CA3168286A1; WO2020147330A1

Abstract

The invention discloses a kind of data flow processing method and systems, belong to big data processing field, and method includes: to determine that one in several Master nodes is used as host node by Zookeeper cluster；External interface is provided to receive the online request of business by host node, and is traffic assignments task；The current state information respectively reported by host node according to multiple Worker nodes generates the configuration information of task and is written in ZooKeeper cluster, and configuration information includes the scheduled information to execute the Worker node of task of instruction；If Worker node listens to existing in ZooKeeper cluster and being scheduled to oneself for task, starts Flume service and executed.The embodiment of the present invention can be realized the high availability of Master node and Worker node, promote the availability of Flume service, avoid the problem that resource uses uneven and waste；Further, it is possible to greatly simplify offline operation in business, influencing each other between reduction business.

Description

A kind of data flow processing method and system

Technical field

The present invention relates to big data processing field, in particular to a kind of data flow processing method and system.

Background technique

In the prior art, it will usually start Flume service in each node of cluster to realize data source source Data conversion storage into the end sink.

In the implementation of the present invention, inventor has found: since conventional fault (such as deadlock, consumption occur for node device It is abnormal etc.) when, system unaware needs artificial treatment, influences the timeliness of troubleshooting；In addition, each due to same cluster Node uses identical configuration, but the business datum amount of each node is irregular, and Flume is easy to cause to collect thread free time ratio It is bigger than normal；Further, since the online operation of new business is more frequent, and manual amendment's business configuration file is needed, when modification needs weight Entire cluster is opened, to influence the normal execution of other business in same cluster.

Summary of the invention

The present invention is directed to solve at least one of the technical problems existing in the prior art or related technologies, the present invention is mentioned thus For a kind of data flow processing method and system.

Specific technical solution provided in an embodiment of the present invention is as follows:

In a first aspect, providing a kind of data flow processing method, which comprises

Determine that one in several Master nodes is used as host node by Zookeeper cluster；

External interface is provided to receive the online request of business by the host node, and is the traffic assignments task；With And

According to the current state information that multiple Worker nodes respectively report, the configuration information of the task and write-in are generated In the ZooKeeper cluster, the configuration information includes the scheduled letter to execute the Worker node of the task of instruction Breath；

If the Worker node listens to existing in the ZooKeeper cluster and being scheduled to oneself for task, start Flume service is executed.

Further, described to determine that a conduct host node in several Master nodes includes: by Zookeeper cluster

The ZooKeeper cluster receives the host node election that the Master node is initiated based on default trigger event and asks It asks, and makes the Master node as host node after electing successfully, wherein the default trigger event is following event One of:

The Master node is activated；

Current Master nodes break down as host node.

Further, the current state information respectively reported according to multiple Worker nodes, generates the task Configuration information includes:

According to the operational state of mainframe information that the multiple Worker node respectively reports, the multiple Worker section is determined The optimal target Worker node of operational state of mainframe in point；

Instruction is generated by the task schedule to the configuration information of the target Worker node.

Further, the method also includes:

The operational state of mainframe information and task respectively reported by the host node according to the multiple Worker node is held Row status information is adjusted the configuration information of the task；

Wherein, the configuration information instruction of the task adjusted carries out capacity reducing processing to being in idle condition for task, And dilatation processing is carried out to the task in stacking states；And

Task immigration load on host computers being on the Worker node of overload is in idle condition to load on host computers Worker node on executed.

Further, the method also includes:

The host node receives the offline request to the business by the external interface；And

It is written to by the offline information of the business and for the offline information of the task of the traffic assignments described In ZooKeeper cluster, so that the Worker node for executing the task stops Flume service.

Second aspect provides a kind of data flow processing system, and the system comprises Zookeeper clusters, several Master node and multiple Worker nodes, in which:

The Zookeeper cluster, for determining that one in several Master nodes is used as host node；

The host node receives the online request of business for providing external interface, and is the traffic assignments task；

The host node is also used to the current state information respectively reported according to multiple Worker nodes, generates described appoint The configuration information of business is simultaneously written in the ZooKeeper cluster, and the configuration information includes that instruction is scheduled to execute described appoint The information of the Worker node of business；

The Worker node, if for listening to existing in the ZooKeeper cluster and being scheduled to oneself for task, Starting Flume service is executed.

Further, the ZooKeeper cluster is specifically used for:

The host node election request that the Master node is initiated based on default trigger event is received, and after electing successfully So that the Master node is as host node, wherein the default trigger event is one of following event:

The Master node is activated；

Current Master nodes break down as host node.

Further, the host node is specifically used for:

Further, the host node is specifically also used to:

The offline request to the business is received by the external interface；And

Technical solution provided in an embodiment of the present invention has the benefit that

1, it is used as host node by determining one in several Master nodes by Zookeeper cluster, so that Master node realizes high availability mechanism by Zookeeper cluster, ensure that a wherein Master node goes wrong In the case of, another Master node rapid pipe connecting can externally service in the short time, promote the availability of Flume service, simultaneously Also solve the problems, such as that processing influences processing timeliness not in time when conventional fault occurs for node device in the prior art.

2, by providing external interface by host node, the external interface of host node can be called directly to task convenient for user Carry out offline, it can be achieved that the operating time offline in business was shortened in 1 minute, to greatly simplifie business or more Line operation, also, when updating configuration, it is not necessarily to manual amendment's business configuration file, without restarting cluster, it is only necessary to restart industry Business, thus reduces influencing each other between business.

3, raw by the operational state of mainframe information respectively reported by host node according to multiple Worker nodes by host node At task configuration information and be written in ZooKeeper cluster, if Worker node listen in ZooKeeper cluster exist adjust It spends to the task of oneself, then starts Flume service and executed, it is thus achieved that by ZooKeeper cluster to the system of configuration One management avoids Flume and collects thread free time ratio problem bigger than normal, is asked to solve resource using uneven and waste Topic, while also improving the convenience of O&M.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of flow chart for data flow processing method that the embodiment of the present invention one provides；

Fig. 2 is a kind of block diagram of data flow processing system provided by Embodiment 2 of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

In the description of the present application, it is to be understood that term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present application, unless otherwise indicated, the meaning of " multiple " It is two or more.

Before introducing the embodiment of the present invention, several technical terms are simply introduced first:

Zookeeper: belonging to the sub-project of hadoop, it is that a reliable coordination for large-scale distributed system is System, the function of providing include: configuring maintenance, name Service, distributed synchronization, group service etc..

Flume: being the technology of a kind of High Availabitity, the acquisition of highly reliable and distributed massive logs, polymerization and transmission.

The embodiment of the invention provides a kind of data flow processing method, this method can be executed by data flow processing system, The system uses distributed client/server, including Zookeeper cluster, several Master nodes and multiple Worker nodes, In, each Master node and each Worker node need the first registrar node in Zookeeper cluster, so as to To be managed collectively by Zookeeper cluster to each Master node and each Worker；It is each Master node, each A Worker node, which can be respectively configured, corresponds to a hardware server, and the specific number of Master node and Worker node can To be determined by user according to own service application scenarios, it is not specifically limited herein.In embodiments of the present invention, each Flume service is respectively started in Worker node, and Flume service is used as data conversion storage tool, is responsible for data source source's Data conversion storage is into the end sink.

Fig. 1 is a kind of flow chart for data flow processing method that the embodiment of the present invention one provides, as shown in Figure 1, this method May include step:

101, determine that one in several Master nodes is used as host node by Zookeeper cluster.

Wherein, the number of Master node can be two, when one of Master node is determined as host node, Another Master node is determined as standby node.When Master node is determined as host node, the Master Node can provide external service, provide external interface it is online for business, it is offline, check, modify, and be responsible for task tune Degree.

It is asked specifically, ZooKeeper cluster receives the host node election that Master node is initiated based on default trigger event It asks, and makes Master node as host node after electing successfully.

During an illustrative realization, default trigger event can be activated for Master node.

Specifically, after executing start command, which can send out Master node into ZooKeeper cluster The election request for participating in host node (i.e. Leader node) is acted, if the determination of ZooKeeper cluster has existed as host node When Active Master node, then Master node election failure, ZooKeeper cluster is by the state of the Master node It is recorded as Standby state；If the determination of ZooKeeper cluster does not exist as the Active Master node of host node, The Master node is elected successfully, and the state recording of the Master node is Active state by ZooKeeper cluster.It is in The Master node of Active state externally provides service.

During another illustrative realization, default trigger event can be used as the current Master node of host node It breaks down.

Specifically, if current Master nodes break down as host node, deserved by the acquisition of ZooKeeper cluster The fault message of preceding Master node, and receive the choosing issued in other Master nodes in addition to the current Master node Request is lifted, determines a Master node as host node from other Master nodes within a preset time, wherein the election Process, which can be, elects the optimal Master node of host performance for host node, present invention implementation in other Master nodes Example is not limited this.

It should be noted that then the Master node will start after a Master node is elected as host node Multiple functional modules, multiple functional modules can specifically include RestServer, RPCServer and Scheduler, wherein The additions and deletions that RestServer is responsible for business, which change, looks into, and RPCServer is responsible for receiving the current state information that Worker node reports, Scheduler is responsible for dynamic dispatching distribution Task.

In addition, working directory is also stored in Zookeeper cluster, including but not limited to:

Leader, the transient node main for the choosing of Master node；

Master stores the Service URL of Active Master；

Jobs stores the directory node of Job data；

Workers, the transient node for the discovery of Worker node；

The list of Assign/<worker-id>/distribution task, i.e. one directory node of a Worker node.

In the embodiment of the present invention, main section is used as by determining one in several Master nodes by Zookeeper cluster Point ensure that a wherein Master node so that Master node realizes high availability mechanism by Zookeeper cluster In the case where going wrong, another Master node rapid pipe connecting can externally service in the short time, while also solve existing Processing influences the problem of handling timeliness not in time when conventional fault occurs for technology interior joint equipment.

102, external interface is provided to receive the online request of business by host node, and be traffic assignments task.

Specifically, providing external interface by the host node determined from several Master nodes, external interface can be used In the online request for receiving business (that is, Job), user can call external interface to carry out business by business application interface It is online, wherein the online request of the business carries the configuration information for the business being passed to using JSON format；The host node according to The configuration information of business initializes service parameter, and is allocated task for business, wherein task (that is, Task) is that Job exists Execution unit on Worker node is responsible for reading data from the end source, and dumps to the end sink.

In the embodiment of the present invention, by providing external interface by host node, host node can be called directly in order to user External interface carries out upper offline, it can be achieved that the operating time offline in business was shortened in 1 minute, to simplify to task Offline operation in business, also, when updating configuration, it is not necessarily to manual amendment's business configuration file, without cluster is restarted, is only needed Restart business, thus reduces influencing each other between business.

103, the operational state of mainframe information respectively reported by host node according to multiple Worker nodes, generates matching for task Confidence is ceased and is written in ZooKeeper cluster, and configuration information includes the scheduled letter to execute the Worker node of task of instruction Breath.

Wherein, operational state of mainframe information includes in CPU usage, memory usage, disk read-write and network up and down It is one or more.

Specifically, the process may include:

According to the operational state of mainframe information that multiple Worker nodes respectively report, host in multiple Worker nodes is determined The optimal target Worker node of operating status；Instruction is generated by task schedule to the configuration information of target Worker node.

In the present embodiment, each Worker node will start Report thread after being activated, and be responsible for the host to itself Operating status be monitored, generate operational state of mainframe information and be reported to Master node as host node.

If 104, Worker node listens to existing in ZooKeeper cluster and being scheduled to oneself for task, start Flume Service is executed.

Specifically, each Worker node monitors the state in ZooKeeper cluster respectively, if listening to ZooKeeper There is newly-increased task in cluster, and when the task is being scheduled to oneself of the task, then obtains the task from ZooKeeper cluster Configuration information, and start Flume service and executed, wherein each Worker node is deployed with Flume service respectively.

In the embodiment of the present invention, by generating the configuration information of task by host node and being written in ZooKeeper cluster, If Worker node listens to existing in ZooKeeper cluster and being scheduled to oneself for task, starts Flume service and held Row, it is thus achieved that it is inclined to avoid Flume collection thread free time ratio by unified management of the ZooKeeper cluster to configuration Big problem to solve the problems, such as resource using uneven and waste, while also improving the convenience of O&M.

Embodiment as a further preference, method provided in an embodiment of the present invention can also include:

The operational state of mainframe information and execution status of task respectively reported by host node according to multiple Worker nodes is believed Breath, is adjusted the configuration information of task.

Wherein, the configuration information instruction of task adjusted carries out capacity reducing processing to being in idle condition for task, and right Dilatation processing is carried out with the task in stacking states；And

Wherein, execution status of task information includes the speed of performing task and task accumulating amount.

Specifically, each Worker node can send the execution state information of the task in the machine and main machine status information To host node；

Host node determines the task of idle state and the task in stacking states according to execution status of task information, will The task (Idle Task) being in idle condition and the task (Busy Task) in stacking states are added separately to IdleTask queue and BusyTask queue, and automatic capacity reducing is carried out to IdleTask, and active expansion is carried out to BusyTask Hold；

Host node determines the Worker section of host overload according to the operational state of mainframe information of each Worker node Point and load on host computers are in idle condition, and load on host computers be in the task immigration on the Worker node of overload to leading It is executed on the Worker node that machine load is in idle condition.

In the embodiment of the present invention, by being believed by host node according to the operational state of mainframe that multiple Worker nodes respectively report Breath and execution status of task information, are adjusted the configuration information of task, it can be ensured that in the load on host computers of Worker node When higher, clustered machine load imbalance, the Task on the higher Worker node of load on host computers is moved into others Worker node is executed, it is thus achieved that Worker node and Task load balancing between cluster, and realize Worker The high availability of node improves the availability of Flume service, while also solving the problems, such as that resource uses uneven and waste.

Host node receives the offline request to business by external interface；And

It is written in ZooKeeper cluster by the offline information of business and for the offline information of the task of traffic assignments, with The Worker node of execution task is set to stop Flume service.

Specifically, host node provides external interface to receive the offline request of business (that is, Job), host node will be offline The status indication of business be it is offline, be then maintained by the offline information of business and for the offline information of the task of the traffic assignments In ZooKeeper cluster；The Worker node for executing the task monitors the offline information of task from ZooKeeper cluster When, then the stop order of Task can be executed, Flume service is cut off, if all tasks relevant to the business stop execution Afterwards, which completes offline.

Fig. 2 is a kind of block diagram of data flow processing system provided by Embodiment 2 of the present invention, which includes Zookeeper Cluster 10, several Master nodes 20 and multiple Worker nodes 30, as shown in Fig. 2, the number of Master node 20 can match It is set to two, including Master node 21 and Master node 22, it is another when one of Master node is as host node For a Master node as standby node, the number of Worker node 30 is configurable to three, including Worker node 31, Worker node 32 and Worker node 33, in which:

Zookeeper cluster, for determining that one in several Master nodes is used as host node；

Host node receives the online request of business for providing external interface, and is traffic assignments task；

Host node is also used to the current state information respectively reported according to multiple Worker nodes, generates the configuration of task Information is simultaneously written in ZooKeeper cluster, and configuration information includes the scheduled information to execute the Worker node of task of instruction；

Worker node, if starting for listening to existing in ZooKeeper cluster and being scheduled to oneself for task Flume service is executed.

Further, ZooKeeper cluster is specifically used for:

The host node election request that Master node is initiated based on default trigger event is received, and is made after electing successfully Master node is as host node, wherein default trigger event is one of following event:

Master node is activated；

Current Master nodes break down as host node.

Further, host node is specifically used for:

According to the operational state of mainframe information that multiple Worker nodes respectively report, host in multiple Worker nodes is determined The optimal target Worker node of operating status；

Instruction is generated by task schedule to the configuration information of target Worker node.

Further, host node is specifically also used to:

The operational state of mainframe information and execution status of task respectively reported by host node according to multiple Worker nodes is believed Breath, is adjusted the configuration information of task；

Further, host node is specifically also used to:

The offline request to business is received by external interface；And

It should be understood that in data flow processing system provided by the above embodiment, only with stroke of above-mentioned each functional module Divide and be illustrated, in practical application, can according to need and be completed by different functional modules above-mentioned function distribution, i.e., The internal structure of system is divided into different functional modules, to complete all or part of the functions described above.On in addition, It states data flow processing system and data flow processing method embodiment belongs to same design, implement process and beneficial effect is detailed See embodiment of the method, which is not described herein again.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, associated hardware can also be instructed to complete by program, the program can store can in a kind of computer It reads in storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of data flow processing method, which is characterized in that the described method includes:

External interface is provided to receive the online request of business by the host node, and is the traffic assignments task；And

According to the operational state of mainframe information that multiple Worker nodes respectively report, the configuration information of the task and write-in are generated In the ZooKeeper cluster, the configuration information includes the scheduled letter to execute the Worker node of the task of instruction Breath；

2. the method according to claim 1, wherein described determine that several Master are saved by Zookeeper cluster One in point as host node includes:

The ZooKeeper cluster receives the host node election request that the Master node is initiated based on default trigger event, And make the Master node as host node after electing successfully, wherein the default trigger event be following event it One:

The Master node is activated；

Current Master nodes break down as host node.

3. the method according to claim 1, wherein the host respectively reported according to multiple Worker nodes Running state information, the configuration information for generating the task include:

According to the operational state of mainframe information that the multiple Worker node respectively reports, determine in the multiple Worker node The optimal target Worker node of operational state of mainframe；

4. the method according to claim 1, wherein the method also includes:

The operational state of mainframe information and task execution shape respectively reported by the host node according to the multiple Worker node State information is adjusted the configuration information of the task；

Wherein, the configuration information instruction of the task adjusted carries out capacity reducing processing to being in idle condition for task, and right Dilatation processing is carried out with the task in stacking states；And

What task immigration load on host computers being on the Worker node of overload was in idle condition to load on host computers It is executed on Worker node.

5. method according to any one of claims 1 to 4, which is characterized in that the method also includes:

The ZooKeeper collection is written to by the offline information of the business and for the offline information of the task of the traffic assignments In group, so that the Worker node for executing the task stops Flume service.

6. a kind of data flow processing system, which is characterized in that the system comprises Zookeeper clusters, several Master nodes With multiple Worker nodes, in which:

The host node is also used to the current state information respectively reported according to multiple Worker nodes, generates the task Configuration information is simultaneously written in the ZooKeeper cluster, and the configuration information includes that instruction is scheduled to execute the task The information of Worker node；

The Worker node, if starting for listening to existing in the ZooKeeper cluster and being scheduled to oneself for task Flume service is executed.

7. system according to claim 6, which is characterized in that the ZooKeeper cluster is specifically used for:

The host node election request that the Master node is initiated based on default trigger event is received, and is made after electing successfully The Master node is as host node, wherein the default trigger event is one of following event:

The Master node is activated；

Current Master nodes break down as host node.

8. system according to claim 6, which is characterized in that the host node is specifically used for:

9. system according to claim 6, which is characterized in that the host node is specifically also used to:

10. according to the described in any item systems of claim 6 to 9, which is characterized in that the host node is specifically also used to:

The offline request to the business is received by the external interface；And