CN108063782A

CN108063782A - Node is delayed machine adapting method and device, node group system

Info

Publication number: CN108063782A
Application number: CN201610979682.4A
Authority: CN
Inventors: 刘绍华
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2016-11-08
Filing date: 2016-11-08
Publication date: 2018-05-22

Abstract

It delays machine adapting method and device, node group system the embodiment of the invention discloses a kind of node, for improving the speed and stability of the task for the machine node that recovers to delay.Present invention method includes：Section point obtains the machine information of delaying of first node；Section point judges that distributed caching with the presence or absence of lock corresponding with first task, is stored when wherein first task is worked normally by first node to distributed caching；If for distributed caching not there are the corresponding lock of first task, section point forms lock using distributed caching；Section point obtains the first task and take over first task of first node from distributed caching.Because distributed caching has the characteristic that can promote instruction and data reading speed, and substantial amounts of data can be handled, so as to which the speed of the task for the machine node that recovers to delay is improved, distributed caching coordinates node cluster, the stability of task recovery can be improved, also avoids the confusion generated during node cluster recovery tasks.

Description

Node downtime takeover method and device, and node cluster system

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for taking over a node downtime, and a node cluster system.

Background

Distributed applications are often deployed on clusters consisting of multiple nodes. When one of the nodes is down, the other nodes need to take over the tasks executed on the down node, for example, clearing the records of the down node, restarting the tasks on the nodes which normally work, and the like.

In the existing mode, tasks of all nodes are stored in a database for persistence. When a node is down, other nodes generally recover the data of the down node by a root node specially used for processing the down in order to take over the tasks of the down node.

However, in such an approach, the node will send multiple query requests to the database storing the tasks, and the recovery process of the tasks is overly impacted by the database.

Disclosure of Invention

The embodiment of the invention provides a method and a device for taking over a down node and a node cluster system, which are used for improving the speed and the stability of a task of recovering the down node.

In order to solve the above technical problem, an embodiment of the present invention provides the following technical solutions:

a method of node downtime takeover, the method comprising:

the second node acquires the downtime information of the first node;

the second node judges whether a lock corresponding to a first task exists in the distributed cache or not, wherein the first task is stored in the distributed cache when the first node works normally;

if the distributed cache does not have the lock corresponding to the first task, the second node forms the lock by using the distributed cache;

the second node acquires the first task of the first node from the distributed cache;

the second node takes over the first task.

In order to solve the above technical problem, an embodiment of the present invention further provides the following technical solutions:

a node downtime takeover apparatus, the apparatus comprising:

the system comprises a downtime information acquisition unit, a downtime information acquisition unit and a downtime information acquisition unit, wherein the downtime information acquisition unit is used for acquiring downtime information of a first node;

the judging unit is used for judging whether a lock corresponding to a first task exists in the distributed cache, wherein the first task is stored in the distributed cache when the first node works normally;

a lock forming unit, configured to form a lock by using the distributed cache by the second node if the lock corresponding to the first task does not exist in the distributed cache;

a task obtaining unit, configured to obtain the first task of the first node from the distributed cache;

a takeover unit for taking over the first task.

a node cluster system, the node cluster system comprises at least three nodes, the nodes comprise node down take-over devices,

the node downtime take-over device is the above-mentioned node downtime take-over device.

According to the technical scheme, the embodiment of the invention has the following advantages:

after the second node acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node which acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and the chaos generated when the node cluster recovers the task is avoided.

Drawings

Fig. 1 is a usage scenario diagram related to a method for taking over a node downtime according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for taking over a node downtime according to another embodiment of the present invention;

fig. 3 is a flowchart of a method for taking over a node downtime according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a node downtime takeover apparatus according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of a node cluster system according to another embodiment of the present invention.

Detailed Description

Fig. 1 is a usage scenario diagram related to a method for taking over a node downtime provided in an embodiment of the present invention. As shown in fig. 1, the usage scenario includes a plurality of nodes, which are combined into an AKKA cluster, where fig. 1 shows three of the nodes, a first node 101, a second node 102, and a third node 103. The usage scenario also includes a distributed cache 104 into which the node stores the task in real-time as it executes the task. After the down node, other nodes read from the distributed cache and take over the tasks of the down node.

The Akka cluster environment is that in the distributed environment, all nodes in the cluster can send messages to each other and obtain the information of node creation and closing.

The distribution is as follows: the cluster environment is composed of a plurality of nodes.

After the second node 102 acquires the downtime information of the first node 101 in an Akka message subscription mode, if the distributed cache 104 does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that the third node 103 which acquires the downtime information of the first node 101 does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. The first node, the second node and the third node belong to nodes of an Akka cluster, and the message subscription mode of the Akka is as follows: and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of data, so that the speed of the task of the downtime node is improved, the distributed caches are matched with the Akka cluster, a plurality of nodes can acquire the task of the downtime node from the distributed caches to recover the task, one node is controlled to recover the task of the downtime node through the lock, the stability of the task recovery process can be improved, and confusion caused by the Akka cluster when the task is recovered is avoided.

Fig. 2 is a flowchart of a method for taking over a node downtime according to an embodiment of the present invention, where the method is applied to an Akka cluster environment composed of a plurality of nodes, and referring to fig. 2 in combination with a usage scenario shown in fig. one and the above contents, in an embodiment of the present invention, the flow of the method for taking over a node downtime includes:

step 201: the second node acquires the downtime information of the first node;

step 202: the second node judges whether the distributed cache has a lock corresponding to the first task, wherein the first task is stored in the distributed cache when the first node works normally;

step 203: if the distributed cache does not have the lock corresponding to the first task, the second node forms the lock by using the distributed cache;

step 204: the second node acquires a first task of the first node from the distributed cache;

step 205: the second node takes over the first task.

Alternatively,

the second node acquires the downtime information of the first node, and the method comprises the following steps:

the second node acquires the downtime information of the first node in an Akka message subscription mode;

the message subscription mode of Akka is as follows:

and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node.

Alternatively,

the second node takes over the first task, including:

the second node locally creates a second task according to the first task;

the second node stores the second task to the distributed cache;

or,

and the second node distributes the first task to the third node, so that the third node creates a third task according to the first task and stores the third task to the distributed cache.

Alternatively,

after the second node takes over the first task, the method further comprises:

the second node deletes the first task from the distributed cache.

Alternatively,

after the second node determines whether the distributed cache has the lock corresponding to the first task, the method further includes:

if the distributed cache has a lock, the second node does not recover the first task, wherein the lock is formed by the third node by utilizing the distributed cache;

when the third node recovers the first task and the second node acquires the downtime information of the third node, the second node executes a step of judging whether the distributed cache has a lock corresponding to the first task.

Alternatively,

after the second node forms the lock using the distributed cache, the method further comprises:

the third node acquiring the downtime information of the first node detects whether a lock corresponding to the first task exists in the distributed cache;

if the distributed cache has a lock corresponding to the first task, the third node does not recover the first task.

Alternatively,

when the second node recovers the first task, the third node acquires the downtime information of the second node in an Akka message subscription mode, wherein a lock formed by the node is released when the node crashes;

the third node judges whether the distributed cache has a lock corresponding to the first task;

if the distributed cache does not have the lock, the third node forms the lock by utilizing the distributed cache;

the third node acquires the first task of the first node from the distributed cache;

the third node takes over the first task.

In summary, after the second node acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and the chaos generated when the node cluster recovers the task is avoided.

Fig. 3 is a flowchart of a method for taking over a node downtime according to an embodiment of the present invention, where the method is applied to an Akka cluster environment composed of a plurality of nodes, and referring to fig. 3 in combination with a usage scenario shown in fig. one and the above contents, in an embodiment of the present invention, the flow of the method for taking over a node downtime includes:

step 301: the first node operates normally and stores the first task to the distributed cache.

In the embodiment of the present invention, the Akka cluster includes a first node, a second node, and a third node, that is, the first node, the second node, and the third node belong to the Akka cluster.

Of course in some embodiments the Akka cluster may include more nodes.

And when the first node works normally, executing the first task and storing the first task to the distributed cache in real time.

For example, the first node executes the tasks WorkerActor1 and WorkerActor2, where WorkerActor1 is clear database A data and WorkerActor2 is query B information from the database. At the moment, the first node works normally, and the first node stores the information of the WorkerActor1 and the WorkerActor2 into the distributed cache in real time and persists the information.

Step 302: and the first node is down, and the work of the first task is stopped.

And when the first node is down, the first node stops working on the first task.

Wherein, the downtime is also called as shutdown, and the finger node can not work normally.

Step 303: and when the Akka cluster detects the downtime information of the first node, the Akka cluster sends the downtime information of the first node to the second node.

Because the first node, the second node and the third node form an Akka cluster environment, in the Akka distribution environment, all nodes in the cluster can send messages to each other and obtain the information of node creation and closing.

Therefore, when the Akka cluster detects the downtime information of the first node, the downtime information of the first node is sent to the second node and the third node. The information subscribing method is an Akka message subscribing method, that is, when the Akka cluster detects the downtime information of the first node, the downtime information of the first node is sent to the second node. The second node and the third node acquire the downtime information of the first node through the Akka message subscription mode

Specifically, the principle of the message subscription mode of Akka is as follows: each node of the Akka cluster forms an abstract Akka cluster system, and the Akka system acquires the running state of each node by periodically sending heartbeat, namely sending detection signals between the nodes. The second node subscribes the running state of the whole system to the Akka cluster system, namely when the state of any one node of the Akka system changes, such as starting, closing and the like. When the first node is turned off, the Akka cluster system can obtain the message and send the message to the second node.

Of course, in some embodiments, the Akka cluster system may send the downtime information of the first node to each node of the Akka cluster.

Step 304: the second node determines whether the distributed cache has a lock corresponding to the first task, if the distributed cache does not have a lock corresponding to the first task, step 305 is executed, and if the distributed cache has a lock corresponding to the first task, step 306 is executed.

After the second node acquires the downtime information of the first node through an Akka message subscription mode, the second node detects whether a lock corresponding to the first task exists in the distributed cache.

And the nodes which acquire the downtime information have the opportunity to execute the recovery operation of the first task. In order to avoid that a plurality of nodes execute the task recovery operation on the downtime node, thereby causing the confusion of the task and the functional redundancy of the cluster, in the method of the embodiment of the invention, the node which firstly acquires the downtime information executes the task recovery operation, and the specific implementation method is to determine the node which firstly acquires the downtime information through locking.

Each node acquiring the downtime information detects whether a lock corresponding to the task of the downtime node exists in the distributed cache, and if one node does not detect the lock, the node is indicated to acquire the downtime information first, and the recovery operation of the task of the downtime node can be executed. In order to enable the nodes which acquire the downtime information later not to execute the recovery operation of the tasks of the downtime nodes, the nodes which acquire the downtime information firstly form a distributed lock in the distributed cache, so that the nodes which acquire the downtime information later detect the locks corresponding to the tasks of the downtime nodes in the distributed cache, and the recovery operation of the tasks of the downtime nodes is not executed.

For example, after the second node acquires the downtime information of the first node, the second node determines whether a lock corresponding to the first task exists in the distributed cache; if the distributed cache does not have a lock, go to step 305; if the distributed cache has a lock, step 306 is performed.

Step 305: the second node forms a lock using the distributed cache.

And if the second node does not detect the lock corresponding to the first task in the distributed cache, the second node indicates that the second node is the node which firstly acquires the downtime information, the second node executes the recovery operation on the first task, and in order to enable other nodes not to execute the recovery operation on the first task, the second node forms the lock by using the distributed cache, so that a third node which acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache. Thus, the distributed lock ensures that the downtime recovery process is complete.

That is, in the method according to the embodiment of the present invention, after the second node forms a lock using the distributed cache, the third node that acquires the downtime information of the first node detects whether the distributed cache has a lock corresponding to the first task; if the distributed cache has a lock corresponding to the first task, the third node does not recover the first task.

The lock of the embodiment of the invention is a distributed lock, and the distributed lock is a mode for controlling synchronous access to shared resources among distributed systems. In distributed systems, it is often necessary to coordinate their actions. If one or a group of resources are shared between different systems or different hosts of the same system, then access to these resources often requires mutual exclusion to prevent interference with each other to ensure consistency, in which case a distributed lock is used.

Step 306: the second node does not resume the first task.

When the second node judges that the distributed cache has the lock corresponding to the first task, the second node indicates that other nodes acquire the downtime information first, and in order to avoid confusion and functional redundancy caused by the fact that a plurality of nodes restore the first task, the second node does not restore the first task.

For example, when the third node first acquires the downtime information of the first node, the third node cannot detect the lock corresponding to the first task in the distributed cache, the third node forms the lock by using the distributed cache, and after the second node detects the lock, the second node stops recovering the first task.

In some embodiments, when the third node recovers the first task, the third node may crash, and for stability of the task recovery process, when the third node recovers the first task, if the third node crashes, other nodes in the Akka cluster acquire the crash information of the third node. And when the second node acquires the downtime information of the third node, the third node crashes, the lock formed by the third node in the distributed cache is released, and the second node executes the step of judging whether the distributed cache has the lock corresponding to the first task. If the lock does not exist in the distributed cache, the second node is the node which first acquires the downtime information of the third node, so that the second node executes the step 305, the second node replaces the third node to recover the first task, and otherwise, the second node executes the step 306.

Step 307: the second node acquires a first task of the first node from the distributed cache;

the second node first acquires the downtime information of the first node, so that the second node executes the recovery operation of the first task after the second node forms a lock by using the distributed cache. Wherein the recovery operation includes retrieving a first task of the first node from the distributed cache and taking over the first task.

Because the first task is stored in the distributed cache when the first node works normally, the distributed cache has the characteristic of improving the reading speed of instructions and data and can process a large amount of data, so that the speed of the task of recovering the downtime node is improved.

For example, as shown in fig. 1, after the first node goes down, the second node obtains information of the tasks WorkerActor1 and WorkerActor2 from the distributed cache.

Step 308: the second node takes over the first task;

and after the second node acquires the first task, executing the takeover of the first task.

The method for taking over the first task comprises the following steps:

1) and after the second node locally creates a second task according to the first task, the second node stores the second task to the distributed cache.

Or,

2) and the second node distributes the first task to the third node, so that the third node creates a third task according to the first task and stores the third task to the distributed cache.

For example, as shown in fig. 1, after the first node goes down, the second node takes over the task of the first node, the second node takes over the task WorkerActor1 locally, creates a task WorkerActor3, and stores the task WorkerActor3 in the distributed cache in real time. And the second node distributes the task WorkerActor2 to the third node, the third node creates a task WorkerActor4 according to the task, and stores the task WorkerActor4 in the distributed cache.

Step 309: the second node deletes the first task from the distributed cache.

After the second node completes taking over the first task of the down first node, the second node deletes the first task from the distributed cache to eliminate the data which is not needed in the distributed cache. Of course, in some embodiments, step 309 may not be performed.

For example, after the second node successfully completes the task takeover, the information of the tasks Workeractor1 and Workeractor2 in the distributed cache is deleted. In this way, the tasks WorkerActor3, WorkerActor4 completely replace the original tasks WorkerActor1, WorkerActor 2.

In some embodiments, in order to improve the stability of task recovery executed by the method according to the embodiments of the present invention, in the AKKA cluster, there is no root node dedicated to processing downtime, but all nodes may process downtime recovery.

For example, when the second node recovers the first task, that is, when the second node acquires the first task from the distributed cache, or takes over the first task, the second node goes down, and at this time, the Akka cluster detects that the second node is sending down. And after the second node is down, the lock formed by the second node on the distributed cache is released.

The third node acquires the downtime information of the second node through an Akka message subscription mode, and judges whether a lock corresponding to the first task exists in the distributed cache or not;

if the lock does not exist in the distributed cache, the third node forms the lock by using the distributed cache, which means that the third node is the node which acquires the downtime information of the second node first, so that the third node can replace the second node to recover the first task. The third node forms a lock in the distributed cache to prevent other nodes from recovering the first task, and then the third node acquires the first task of the first node from the distributed cache; the third node takes over the first task. That is, the third node locally reconstructs the first task or sends the first task to other normal nodes, the other nodes reconstruct the first task, and the reconstructed task is cached in the distributed cache. After the third node manages the first task successfully, the third node may delete the first task from the distributed cache. If the third node detects that the distributed cache has the lock corresponding to the first task, the third node does not recover the first task.

In summary, after the second node acquires the downtime information of the first node in the Akka message subscription manner, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the distributed cache has the lock formed by the second node; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. The first node, the second node and the third node belong to nodes of an Akka cluster, and the message subscription mode of the Akka is as follows: and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of dynamic data, so that the speed of the task of the downtime node is improved, the distributed caches are matched with the Akka cluster, a plurality of nodes can acquire the task of the downtime node from the distributed caches to recover the task, one node is controlled to recover the task of the downtime node through the lock, the stability of the task recovery process can be improved, and confusion caused by the Akka cluster during task recovery is avoided.

Fig. 4 is a schematic structural diagram of a node downtime takeover apparatus according to an embodiment of the present invention, where the apparatus is integrated on a node, the node belongs to an Akka cluster, and the Akka cluster includes at least three nodes, and with reference to fig. 4 in conjunction with a usage scenario shown in fig. one and the above contents, in an embodiment of the present invention, the node downtime takeover apparatus includes:

a downtime information obtaining unit 401, configured to obtain downtime information of a first node;

a determining unit 402, configured to determine whether a lock corresponding to a first task exists in the distributed cache, where the first task is stored in the distributed cache when the first task normally operates from the first node;

a locking unit 403, configured to form a lock by using the distributed cache if the distributed cache does not have the lock corresponding to the first task;

a task obtaining unit 404, configured to obtain a first task of a first node from a distributed cache;

a takeover unit 405 for taking over a first task;

alternatively,

the downtime information acquiring unit 401 is further configured to acquire downtime information of the first node in an Akka message subscription manner;

the message subscription mode of Akka is as follows:

Alternatively,

a take-over unit 405 comprising:

a creation module 406 for creating a second task locally from the first task;

a storage module 407, configured to store the second task in the distributed cache;

or,

the takeover unit 405 is further configured to distribute the first task to the third node, so that the third node creates the third task according to the first task and then stores the third task in the distributed cache.

Alternatively,

the device still includes:

a deleting unit 408 configured to delete the first task from the distributed cache.

Alternatively,

the task obtaining unit 404 is further configured to not restore the first task if the distributed cache has a lock, where the lock is formed by the third node using the distributed cache;

when the third node recovers the first task, when the downtime information acquisition unit 401 acquires the downtime information of the third node, the determination unit 402 performs a step of determining whether a lock corresponding to the first task exists in the distributed cache.

To sum up, after the downtime information acquisition unit 401 acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the locking unit 403 forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the distributed cache has a lock formed by the second node; when the first task normally works from the first node, the first task is stored in the distributed cache, then the task obtaining unit 404 obtains the first task of the first node from the distributed cache, and the takeover unit 405 takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of dynamic data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and confusion caused when the node cluster recovers the task is avoided.

Fig. 5 is a schematic structural diagram of a node cluster system according to an embodiment of the present invention, where the node cluster system includes at least three nodes, and referring to fig. 5 in combination with the usage scenario shown in fig. one and the above, in an embodiment of the present invention, the nodes of the node cluster system 500 include a node downtime takeover apparatus,

the node downtime takeover apparatus is, for example, the node downtime takeover apparatus shown in the embodiment shown in fig. 4. For details, reference is made to the above exemplary embodiments, which are not described in detail here.

In summary, after the second node acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of dynamic data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and confusion caused when the node cluster recovers the task is avoided.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for taking over a node downtime, the method comprising:

the second node acquires the downtime information of the first node;

the second node takes over the first task.

2. The method of claim 1,

the second node acquiring the downtime information of the first node comprises the following steps:

the message subscription mode of the Akka is as follows:

3. The method of claim 1,

the second node takes over the first task, including:

the second node locally creates a second task according to the first task;

the second node stores the second task to the distributed cache;

or,

and the second node distributes the first task to the third node so that the third node stores the third task to the distributed cache after creating the third task according to the first task.

4. The method according to any one of claims 1 to 3,

after the second node takes over the first task, the method further comprises:

the second node deletes the first task from the distributed cache.

5. The method of claim 1,

if the lock exists in the distributed cache, the second node does not recover the first task, wherein the lock is formed by the third node by utilizing the distributed cache;

when the third node recovers the first task, and when the second node acquires the downtime information of the third node, the second node executes a step of judging whether the distributed cache has a lock corresponding to the first task.

6. The method of claim 1,

after the second node forms a lock using the distributed cache, the method further comprises:

the third node which acquires the downtime information of the first node detects whether the distributed cache has a lock corresponding to the first task;

and if the distributed cache has a lock corresponding to the first task, the third node does not recover the first task.

7. The method of claim 2,

when the second node recovers the first task, the third node acquires the downtime information of the second node in an Akka message subscription mode, wherein a lock formed by the node is released when the node is down;

if the lock does not exist in the distributed cache, the third node forms the lock by using the distributed cache;

the third node takes over the first task.

8. A node downtime takeover apparatus, the apparatus comprising:

a takeover unit for taking over the first task.

9. The apparatus of claim 8,

the downtime information acquisition unit is also used for acquiring the downtime information of the first node in an Akka message subscription mode;

the message subscription mode of the Akka is as follows:

10. A node cluster system, characterized in that the node cluster system comprises at least three nodes, the nodes comprise a node downtime takeover apparatus,

wherein the node downtime takeover apparatus is the node downtime takeover apparatus of claims 8 or 9.