CN111124630A

CN111124630A - System and method for running Spark Streaming program

Info

Publication number: CN111124630A
Application number: CN201911197734.2A
Authority: CN
Inventors: 周朝卫
Original assignee: Zhongying Youchuang Information Technology Co Ltd
Current assignee: Zhongying Youchuang Information Technology Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-05-08
Anticipated expiration: 2039-11-29
Also published as: CN111124630B

Abstract

The invention discloses a system and a method for operating a Spark Streaming program, wherein the system comprises: the nodes are positioned in the candidate node queue, and each node is provided with a Spark Streaming program; determining a coordinating node among the plurality of nodes; selecting a first node from the candidate node queue to run a Spark Streaming program; receiving state information sent by each node in the candidate node queue, and determining a fault node; deleting the fault node from the candidate node queue to obtain a target candidate node queue; when the node running the Spark Streaming program fails, a second node is selected from the target candidate node queue to run the Spark Streaming program.

Description

System and method for running Spark Streaming program

Technical Field

The invention relates to the field of computers, in particular to a system and a method for operating a Spark Streaming program.

Background

Spark is a big data parallel computing framework based on memory computing, and can greatly improve the real-time performance of data processing in a big data environment. The Spark Streaming is an extension of a Spark core API, and can realize the processing of real-time Streaming data with high throughput and a fault-tolerant mechanism.

The spare Streaming needs to continuously process the real-time data stream, so that a host running the spare Streaming program needs to guarantee continuous and stable running as a resident process, and the spare Streaming program exits when the host goes down, the memory overflows due to GC (garbage collection), the memory is insufficient due to the transient peak of the data source, the Driver process of the spare Streaming program is abnormal, the transient fault of the upstream data source and the like occur, and the stable running of the spare Streaming program is difficult to guarantee.

Disclosure of Invention

An embodiment of the present invention provides an operating system of a Spark Streaming program, which is used to ensure stable operation of the Spark Streaming program, and the system includes:

the nodes are positioned in the candidate node queue, and each node is provided with a Spark Streaming program; determining a coordinating node among the plurality of nodes;

wherein the coordinating node is configured to: selecting a first node from the candidate node queue to run a Spark Streaming program, and deleting the first node from the candidate node queue;

receiving state information sent by each node in the candidate node queue, and determining a fault node according to the state information; deleting the fault node from the candidate node queue to obtain a target candidate node queue;

when the node running the Spark Streaming program fails, selecting a second node from the target candidate node queue to run the Spark Streaming program, and deleting the second node from the target candidate node queue.

The embodiment of the present invention further provides an operation method of a Spark Streaming program, which is used to ensure stable operation of the Spark Streaming program, and the method includes:

respectively installing Spark Streaming programs on a plurality of nodes, wherein the nodes are in a candidate node queue; determining a coordinating node among the plurality of nodes;

selecting a first node from the candidate node queue to run a Spark Streaming program, and deleting the first node from the candidate node queue;

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the processor implements the running method of the above Spark stream mining program.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the operation method of the Spark Streaming program is stored in the computer-readable storage medium.

The embodiment of the invention comprises the following steps: respectively installing Spark Streaming programs on a plurality of nodes, wherein the nodes are in a candidate node queue; determining a coordinating node among the plurality of nodes; selecting a first node from the candidate node queue to run a Spark Streaming program, and deleting the first node from the candidate node queue; receiving state information sent by each node in the candidate node queue, and determining a fault node according to the state information; deleting the fault node from the candidate node queue to obtain a target candidate node queue, and monitoring the state of each node in the candidate node queue to ensure that the nodes in the target candidate node queue are normal nodes; when the node running the Spark Streaming program fails, selecting a second node from the target candidate node queue to run the Spark Streaming program, deleting the second node from the target candidate node queue, and selecting a new node to run the Spark Streaming program in time when the node running the Spark Streaming program fails, so that stable running of the Spark Streaming program is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

fig. 1 is a schematic diagram of an operating system structure of a Spark Streaming program according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a flow of an operation method of a Spark Streaming program in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In order to ensure stable operation of a Spark Streaming program, an embodiment of the present invention provides an operating system of the Spark Streaming program, and fig. 1 is a schematic diagram of an operating system structure of the Spark Streaming program in the embodiment of the present invention, as shown in fig. 1, the system includes:

a plurality of nodes 01, wherein the plurality of nodes 01 are positioned in a candidate node queue 03, and each node is provided with a SparkStreaming program; determining a coordinating node 02 among the plurality of nodes 01;

wherein, the coordinating node 02 is configured to: selecting a first node from the candidate node queue 03 to run a Spark Streaming program, and deleting the first node from the candidate node queue;

receiving state information sent by each node in the candidate node queue 03, and determining a fault node according to the state information; deleting the fault node from the candidate node queue 03 to obtain a target candidate node queue 04;

when the node running the Spark Streaming program fails, selecting a second node from the target candidate node queue 04 to run the Spark Streaming program, and deleting the second node from the target candidate node queue 04.

As shown in fig. 1, an embodiment of the present invention is implemented by: respectively installing Spark Streaming programs on a plurality of nodes, wherein the nodes are in a candidate node queue; determining a coordinating node among the plurality of nodes; selecting a first node from the candidate node queue to run a Spark Streaming program, and deleting the first node from the candidate node queue; receiving state information sent by each node in the candidate node queue, and determining a fault node according to the state information; deleting the fault node from the candidate node queue to obtain a target candidate node queue, and monitoring the state of each node in the candidate node queue to ensure that the nodes in the target candidate node queue are normal nodes; when the node running the Spark Streaming program fails, selecting a second node from the target candidate node queue to run the Spark Streaming program, deleting the second node from the target candidate node queue, and selecting a new node to run the Spark Streaming program in time when the node running the Spark Streaming program fails, so that stable running of the Spark Streaming program is realized.

In a specific implementation, the plurality of nodes 01 may be a plurality of hosts, a spare Streaming program with a high availability function is deployed on each host, the plurality of nodes 01 are in the candidate node queue 03, and a coordination node 02 is generated among the plurality of nodes 01.

In one embodiment, upon failure of the coordinating node, the other nodes than the coordinating node are further configured to:

obtaining a fault coordination node version value in a database table; the database table comprises a global table, and the global table is used for storing the version value of the running coordination node;

writing a new coordination node version value into a database table according to the fault coordination node version value; the database table comprises a coordination node table, and the coordination node table is used for storing a new coordination node version value;

and taking the node which is successfully written as a new coordination node, and updating a coordination node version value in the global table.

In specific implementation, as shown in table 1, a coordination node table may be added in a database table to store information such as a version value, a host name, and a host IP of a coordination node, and a new coordination node version value may be stored when a new coordination node is generated, and when the new coordination node is used for the first time, the coordination node table is initialized, where the coordination node table has a record and the version value is initialized to 0.

For example:

insert into coordinator(node,version)values("host_spark_001",0)；

the above statement initializes a record, hostname: host _ spark _001, version: 0.

TABLE 1 coordination node table

Field value	app_kafka_id
		Coordinating nodes	Identity of coordinating nodes, e.g. hostname, host IP, etc
Version value	0 (adding 1 each time a new coordinating node is generated)

As shown in Table 2, a global table may be added to the database table to hold the running coordination node version values.

TABLE 2 Global tables

Field value	app_kafka_id
		Version value	0 (new coordinating node is generated each time)Shijia 1)

When the method is used for the first time, the global table is initialized, and the consistency of the submission marks of the coordination node table and the global table is ensured.

For example:

insert into global_version(id,flag)values("app_kafka_id",0)；

each node in the candidate node queue 03 may periodically communicate with the coordinating node 02, and when a communication failure occurs between the coordinating node 02 and each node, it indicates that the coordinating node fails, and at this time, other nodes except the coordinating node may determine a new coordinating node according to the following steps:

firstly, obtaining a version value of a fault coordination node in a global table;

then, according to the version value of the fault coordination node in the global table, writing a new version value of the coordination node into the coordination node table, setting the value of the node field as the host name of the current node, and increasing the version value by 1;

example (c):

the version value of the fault coordination node obtained from the global table is 5;

update coordinator set node＝‘host_spark_002’,version＝version+1whereversion＝5。

when one node successfully updates the version value, the version value is changed to 6, and the other nodes do not update the data in the table under the condition that the version is 5. Whether the execution is successful or not can be judged according to the return value of the execution result. When the return value is 1, the updating is successful, and the node is taken as a coordinating node.

And then, updating the version values of the coordination nodes of the global table seeds, and ensuring that the version values of the global table and the coordination node table are consistent.

In a specific implementation, the coordinating node 02 may select the node ranked first in the candidate node queue 03 to run a spare Streaming program, and delete the node from the candidate node queue 03.

In one embodiment, the coordinating node 02 is specifically configured to:

and comparing the state information of each node with preset state information, and determining a fault node according to a comparison result.

In one embodiment, the coordinating node 02 is specifically configured to:

and when the node which does not receive the state information within the preset time length exists, determining the node as a fault node.

In one embodiment, protocol adjustment 02 is specifically used to:

receiving state information sent by a fault node, and comparing the state information of the fault node with preset state information;

when the troubleshooting of the failed node is determined according to the comparison result, the failed node is added to the target candidate node queue 04.

In specific implementation, the state information of the node may include a disk state, a network card state, and the like, and each node deploying the high-available Spark Streaming program writes the latest information into the database table in real time, which may include: node name, state information, and timestamp, as shown in table 3:

table 3 node database table

Node name	Status of state	Time stamp
			spark_host_001	Is normal	2019-10-23 19:47:22
spark_host_002	Fault of	2019-10-23 19:56:47

In specific implementation, the coordinating node 02 may periodically traverse the database table, compare the state information of each node in the candidate node queue 03 with preset state information, determine a failed node according to the comparison result, and delete the failed node from the candidate node queue 03 to obtain the target candidate node queue 04. The coordinating node 02 may also set a preset time duration, and when the timestamp in table 3 exceeds the preset time duration but does not receive the state information of the node, the node is considered to be a failed node, a spare Streaming program may not be run, and the node is deleted from the candidate node queue to obtain the target candidate node queue 04. The coordinating node 02 may further receive state information sent by the failed node, compare the state information of the failed node with preset state information, and when the failure of the failed node is cleared, add the node with the cleared failure to the tail of the target candidate node queue 04 again.

In a specific implementation, the coordinating node 02 may use the API of YARN to periodically poll the status of the spare Streaming program, and when the spare Streaming program is abnormal, the coordinating node 02 may further select the node ranked first in the target candidate node queue 04 to run the spare Streaming program, and delete the node from the target candidate node queue 04.

In one embodiment, before the second node runs the Spark Streaming program, the second node is configured to: the Spark Streaming program run by the first node is killed.

In specific implementation, in order to avoid split brain, before the new node runs the Spark Streaming program, the new node can kill the Spark Streaming program run by the original node through ssh process by using kill-9 command.

In the following a specific example is given to facilitate an understanding of how the invention may be carried into effect.

The first step is as follows: deploying a Spark Streaming program with a high available function on a plurality of nodes 01, taking the plurality of nodes as a candidate node queue 03, writing version values into a coordination node table by the plurality of nodes 01, and taking a node with a first successful writing as a coordination node 02;

the second step is that: the coordination node 02 selects the first node arranged in the candidate node queue 03 to run a SparkStreaming program, and deletes the node from the candidate node queue 03;

the third step: the coordination node 02 regularly traverses the state information of each node in the candidate node queue 03 in the database table, compares the state information of each node in the candidate node queue 03 with preset state information, determines a fault node according to a comparison result, and deletes the fault node from the candidate node queue 03 to obtain a target candidate node queue 04.

The fourth step: the coordinating node 02 periodically polls the state of the Spark Streaming program using the API of YARN, and when the Spark Streaming program is abnormal, selects the node ranked first in the target candidate node queue 04 to run the Spark Streaming program, and deletes the node from the target candidate node queue 04.

The fifth step: before the new node runs the Spark Streaming program, the new node kills the Spark Streaming program run by the original node by ssh process using kill-9 command.

Based on the same inventive concept, the embodiment of the present invention further provides an operation method of a Spark Streaming program, as in the following embodiments. Because the principle of solving the problem of the operation method of the Spark Streaming program is similar to that of the operation device of the Spark Streaming program, the implementation of the method can be referred to the implementation of the device, and repeated parts are not described again.

An embodiment of the present invention provides a method for running a Spark Streaming program, and fig. 2 is a schematic diagram of a flow of the method for running the Spark Streaming program in the embodiment of the present invention, as shown in fig. 2, the method includes:

step 101: respectively installing Spark Streaming programs on a plurality of nodes, wherein the nodes are in a candidate node queue; determining a coordinating node among the plurality of nodes;

step 102: selecting a first node from the candidate node queue to run a Spark Streaming program, and deleting the first node from the candidate node queue;

step 103: receiving state information sent by each node in the candidate node queue, and determining a fault node according to the state information; deleting the fault node from the candidate node queue to obtain a target candidate node queue;

step 104: when the node running the Spark Streaming program fails, selecting a second node from the target candidate node queue to run the Spark Streaming program, and deleting the second node from the target candidate node queue.

In one embodiment, step 103 may comprise:

and when the fault elimination of the fault node is determined according to the comparison result, adding the fault eliminated node into the target candidate node queue.

In one embodiment, when the coordinating node fails, the method may further include:

the other nodes than the coordinating node perform the following steps:

In one embodiment, before the second node runs the Spark Streaming program in step 104, the method further comprises:

the Spark Streaming program run by the first node is killed.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the processor implements the running method of the Spark Streaming program.

In summary, the embodiment of the present invention provides: respectively installing Spark Streaming programs on a plurality of nodes, wherein the nodes are in a candidate node queue; determining a coordinating node among the plurality of nodes; selecting a first node from the candidate node queue to run a Spark Streaming program, and deleting the first node from the candidate node queue; receiving state information sent by each node in the candidate node queue, and determining a fault node according to the state information; deleting the fault node from the candidate node queue to obtain a target candidate node queue, and monitoring the state of each node in the candidate node queue to ensure that the nodes in the target candidate node queue are normal nodes; when the node running the Spark Streaming program fails, selecting a second node from the target candidate node queue to run the Spark Streaming program, deleting the second node from the target candidate node queue, and selecting a new node to run the Spark Streaming program in time when the node running the Spark Streaming program fails, so that stable running of the Spark Streaming program is realized.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and variations of the embodiment of the present invention may occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A system for running a Spark Streaming program, comprising: the nodes are positioned in a candidate node queue, and each node is provided with a Spark Streaming program; determining a coordinating node among the plurality of nodes;

2. The system of claim 1, wherein determining a failed node based on the status information comprises:

3. The system of claim 1, wherein determining a failed node based on the status information comprises:

4. The system of claim 1, further comprising:

receiving state information sent by the fault node, and comparing the state information of the fault node with preset state information;

5. The system of claim 1, wherein upon failure of the coordinating node, the other nodes than the coordinating node are further configured to:

obtaining a fault coordination node version value in a database table; the database table comprises a global table, and the global table is used for storing a running coordination node version value;

and taking the node which is successfully written as a new coordination node, and updating the version value of the coordination node in the global table.

6. The system of claim 1, wherein before the second node runs a Spark Streaming program, the second node is configured to: killing a Spark Streaming program run by the first node.

7. A method for operating a Spark Streaming program, comprising:

respectively installing a Spark Streaming program on a plurality of nodes, wherein the nodes are in a candidate node queue; determining a coordinating node among the plurality of nodes;

8. The method of claim 7, wherein determining a failed node based on the status information comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 7 to 8 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 7 to 8.