CN111722988A

CN111722988A - Fault switching method and device for data space nodes

Info

Publication number: CN111722988A
Application number: CN202010528159.6A
Authority: CN
Inventors: 李恒
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-09-29

Abstract

The invention discloses a method and a device for switching faults of data space nodes, wherein the method comprises the following steps: synchronizing traffic data of the first data space node using the second data space node as a slave node; continuously monitoring the working states of the main node and the slave nodes by using the monitoring component, and giving an alarm when the working state of the main node shows that the first data space node has a fault; in response to receiving the alert, placing the first data space node as a slave node and the second data space node as a master node to process the customer traffic; synchronizing service data of the second data space node by using a memory server; and sequentially synchronizing the service data of the memory server and the second data space node serving as the main node. The invention can improve the availability of data space, ensure zero data loss during the fault period and further improve the service quality and the customer experience.

Description

Fault switching method and device for data space nodes

Technical Field

The present invention relates to the field of data clustering, and in particular, to a method and an apparatus for failover of data space nodes.

Background

The large data platform in the prior art provides unified management and processing capability of massive heterogeneous data, and data from different sources and different attributions are collected through the unified platform. The large data space management function of the data space can ensure that data owners of different departments or organizations have exclusive rights on data and support sharing of own data to other departments or organizations. In the existing DataSpace, the DataSpace does not have a high-availability failover mechanism, only a single-node DataSpace server is used, when a main node DataSpace fails, resources cannot be managed, tenants and users cannot log in, uninterrupted service cannot be provided for clients in the period, a certain amount of loss is caused to the clients due to service interruption, and the clients can manage the resources only after the DataSpace node is recovered, which is disastrous to the clients.

Aiming at the problems of poor availability and no early warning of faults of data space nodes in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for switching a failure of a data space node, which can improve availability of a data space, ensure zero data loss during a failure, and further improve service quality and customer experience.

Based on the above object, a first aspect of the embodiments of the present invention provides a method for failover of data space nodes, including the following steps:

synchronizing traffic data of a first data space node using a second data space node as a slave node in response to the first data space node as a master node normally processing customer traffic;

continuously monitoring the working states of the main node and the slave nodes by using monitoring components respectively attached to the first data space node and the second data space node, and giving an alarm when the working state of the main node shows that the first data space node has a fault;

in response to receiving the alarm, setting the first data space node as a slave node and the second data space node as a master node, and processing the client service by using the second data space node as the master node;

responding to the second data space node as the main node to normally process the client service, and synchronizing the service data of the second data space node by using the memory server;

and responding to the first data space node as the slave node to recover to be normal and synchronizing the service data of the memory server and the second data space node as the master node in turn.

In some embodiments, the customer traffic includes requests by different visitors for storage of data spaces and/or data resources; the business data includes authentication and logging of shared and/or independent storage and/or data resources allocated and/or reclaimed for different visitors.

In some embodiments, continuously monitoring the operational status of the master node and the slave nodes comprises: continuously monitoring, using a monitoring component, information of the first data space node and the second data space node for at least one of: memory, processor, disk;

the alarming when the working state of the main node shows that the first data space node has a fault comprises the following steps: and generating a log file according to the working state of the main node, judging whether a fault occurs by using the log file, and further giving an alarm.

In some embodiments, placing the first data space node as a slave node and the second data space node as a master node comprises:

manually forcing the first data space node to be a slave node and the second data space node to be a master node through a command line; or

And informing the distributed coordination system by the monitoring component of the first data space node, and forcibly setting the first data space node as a slave node and the second data space node as a master node by the distributed coordination system.

In some embodiments, further comprising: after the service data of the memory server and the second data space node are synchronized, the first data space node is set as a master node and the second data space node is set as a slave node.

A second aspect of the embodiments of the present invention provides a device for switching a failure of a data space node, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed performing the steps of:

In some embodiments, the steps further comprise: after the service data of the memory server and the second data space node are synchronized, the first data space node is set as a master node and the second data space node is set as a slave node.

The invention has the following beneficial technical effects: according to the method and the device for switching the data space node failure, the first data space node serving as the master node normally processes the client service, and the second data space node serving as the slave node is used for synchronizing the service data of the first data space node; continuously monitoring the working states of the main node and the slave nodes by using monitoring components respectively attached to the first data space node and the second data space node, and giving an alarm when the working state of the main node shows that the first data space node has a fault; in response to receiving the alarm, setting the first data space node as a slave node and the second data space node as a master node, and processing the client service by using the second data space node as the master node; responding to the second data space node as the main node to normally process the client service, and synchronizing the service data of the second data space node by using the memory server; the technical scheme of sequentially synchronizing the service data of the memory server and the service data of the second data space node serving as the main node in response to the recovery of the first data space node serving as the slave node is adopted, so that the availability of the data space can be improved, zero data loss during a fault period is ensured, and the service quality and the customer experience are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a failover method for data space nodes according to the present invention;

fig. 2 is a schematic diagram of a framework of a failover method for data space nodes according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

Based on the above objectives, a first aspect of the embodiments of the present invention provides an embodiment of a failover method, which can improve availability of data space and ensure zero loss of data during a failure. Fig. 1 is a schematic flow chart illustrating a failover method of data space nodes provided in the present invention.

The method for switching the data space node failure, as shown in fig. 1, includes the following steps:

step S101: synchronizing traffic data of a first data space node using a second data space node as a slave node in response to the first data space node as a master node normally processing customer traffic;

step S103: continuously monitoring the working states of the main node and the slave nodes by using monitoring components respectively attached to the first data space node and the second data space node, and giving an alarm when the working state of the main node shows that the first data space node has a fault;

step S105: in response to receiving the alarm, setting the first data space node as a slave node and the second data space node as a master node, and processing the client service by using the second data space node as the master node;

step S107: responding to the second data space node as the main node to normally process the client service, and synchronizing the service data of the second data space node by using the memory server;

step S109: and responding to the first data space node as the slave node to recover to be normal and synchronizing the service data of the memory server and the second data space node as the master node in turn.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct relevant hardware to perform the processes, and the processes can be stored in a computer readable storage medium, and when executed, the processes can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

In some embodiments, continuously monitoring the operational status of the master node and the slave nodes comprises: continuously monitoring, using a monitoring component, information of the first data space node and the second data space node for at least one of: memory, processor, disk. The alarming when the working state of the main node shows that the first data space node has a fault comprises the following steps: and generating a log file according to the working state of the main node, judging whether a fault occurs by using the log file, and further giving an alarm.

In some embodiments, the method further comprises: after the service data of the memory server and the second data space node are synchronized, the first data space node is set as a master node and the second data space node is set as a slave node.

The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.

The following further illustrates embodiments of the invention in terms of specific examples as shown in fig. 2.

The DataSpace is generally composed of two servers, one is at a Master (Master node) and the other is at a Stanby (slave node), Active processes a service request from a client, and the Stanby does not provide a service to the outside and only synchronizes the state of the Stanby so as to ensure that the state can be switched rapidly.

In addition, a monitoring component is added and deployed on nodes where the DataSpace Master and the DataSpace Stanby are located, Master service on corresponding nodes is monitored respectively, information such as resource utilization rate of the nodes is monitored, the information is used as information for evaluating the health state of the nodes, recorded in a log file and provided for a user to check, fault warning of the DataSpace Master is achieved accordingly, and the reason of fault occurrence is judged according to the information of the log file.

When the Master node DataSpace Master service fails, the slave node DataSpace Stanby service is switched to the Master, data can be synchronized to the memory server, data loss in the time period is prevented, the standby node is waited to become the Master node, the data can synchronize the memory server, zero data loss can be guaranteed, automatic fault transfer is achieved, and high availability of DataSpace is achieved.

The data synchronization has three: the DataSpace loads all data directories in the cluster through data synchronization, and respectively adds strategies to corresponding resources to ensure the isolation and access permission of the data; master and Stanby data are synchronized; master node data are synchronized to the memory server in real time.

The DataSpace realizes the control of user authentication by managing the bill. The DataSpace provides the user with a bill and provides the user with a download bill, and the user can use the obtained bill to complete identity authentication to request resources.

The monitoring component monitors information such as node state, memory, CPU, resource allocation and the like of the DataSpace through a Zookeeper (distributed coordination system), records the information in a log file, synchronizes to Stanby, ensures the consistency of the main and standby information, and can help a user to position a fault to provide convenience so as to achieve the purpose of fault early warning. And monitoring synchronization of the primary and standby authentication information to ensure that the user authentication information is consistent after switching.

When Master node fails, the monitoring component writes Master node information into Zookeeper, informs Stanby monitoring component through monitoring information, forcibly closes Master Active state, reports logs, simultaneously sends alarm indication to inform management personnel, automatically completes fault switching, and changes Stanby node state into Active to become Master. Alternatively, the master/standby state can be manually and forcibly switched through the command line, so that the standby DataSpace becomes the master node, and the data synchronization of the two nodes is maintained, so that the authentication information is consistent.

When the main node is in failure, the switching process has time delay, the data of the main server cannot be synchronized to the standby server in the period, the data can be synchronized to the memory server, and after the switching, the data of the memory service can be synchronized back to the main server, so that zero loss of the data is ensured, and the safety of enterprises is enhanced.

As can be seen from the foregoing embodiments, the failover method for data space nodes according to the embodiments of the present invention synchronizes the service data of the first data space node using the second data space node as the slave node in response to the first data space node as the master node processing the client service normally; continuously monitoring the working states of the main node and the slave nodes by using monitoring components respectively attached to the first data space node and the second data space node, and giving an alarm when the working state of the main node shows that the first data space node has a fault; in response to receiving the alarm, setting the first data space node as a slave node and the second data space node as a master node, and processing the client service by using the second data space node as the master node; responding to the second data space node as the main node to normally process the client service, and synchronizing the service data of the second data space node by using the memory server; the technical scheme of sequentially synchronizing the service data of the memory server and the service data of the second data space node serving as the main node in response to the recovery of the first data space node serving as the slave node is adopted, so that the availability of the data space can be improved, zero data loss during a fault period is ensured, and the service quality and the customer experience are improved.

It should be particularly noted that, the steps in the embodiments of the failover method for data space nodes described above may be mutually intersected, replaced, added, and deleted, so that the failover method for data space nodes transformed by these reasonable permutations and combinations shall also fall within the scope of the present invention, and shall not limit the scope of the present invention to the described embodiments.

In view of the foregoing, a second aspect of the embodiments of the present invention provides an embodiment of a failover apparatus, which can improve availability of data space and ensure zero loss of data during a failure. The failover apparatus for data space nodes comprises:

a processor; and

As can be seen from the foregoing embodiments, the failover apparatus for data space nodes according to the embodiments of the present invention synchronizes the service data of a first data space node using a second data space node as a slave node by responding to the first data space node as a master node to process the client service normally; continuously monitoring the working states of the main node and the slave nodes by using monitoring components respectively attached to the first data space node and the second data space node, and giving an alarm when the working state of the main node shows that the first data space node has a fault; in response to receiving the alarm, setting the first data space node as a slave node and the second data space node as a master node, and processing the client service by using the second data space node as the master node; responding to the second data space node as the main node to normally process the client service, and synchronizing the service data of the second data space node by using the memory server; the technical scheme of sequentially synchronizing the service data of the memory server and the service data of the second data space node serving as the main node in response to the recovery of the first data space node serving as the slave node is adopted, so that the availability of the data space can be improved, zero data loss during a fault period is ensured, and the service quality and the customer experience are improved.

It should be particularly noted that, the above-mentioned embodiment of the failover apparatus for a data space node uses the embodiment of the failover method for a data space node to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the failover method for a data space node. Of course, since the steps in the embodiment of the method for switching a failure of a data space node may be intersected, replaced, added, or deleted, these failure switching apparatuses that are transformed by reasonable permutation and combination of the data space node also belong to the protection scope of the present invention, and the protection scope of the present invention should not be limited to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for failover of data space nodes, comprising the steps of:

continuously monitoring the working states of a main node and a slave node by using monitoring components respectively attached to the first data space node and the second data space node, and giving an alarm when the working state of the main node shows that the first data space node has a fault;

in response to receiving the alarm, setting the first data space node as a slave node and the second data space node as a master node, and processing customer traffic using the second data space node as the master node;

responding to the second data space node as a main node to normally process client services, and synchronizing service data of the second data space node by using a memory server;

and responding to the first data space node as the slave node to recover to be normal and sequentially synchronizing the service data of the memory server and the second data space node as the master node.

2. The method of claim 1, wherein the customer traffic comprises requests for storage of data space and/or data resources by different visitors; the business data includes authentication and logging of shared and/or independent storage and/or data resources allocated and/or reclaimed for different visitors.

3. The method of claim 1, wherein continuously monitoring the operational status of the master node and the slave nodes comprises: continuously monitoring, using the monitoring component, information of the first data space node and the second data space node for at least one of: memory, processor, disk;

4. The method of claim 1, wherein placing the first data space node as a slave node and the second data space node as a master node comprises:

And informing a distributed coordination system by the monitoring component of the first data space node, and forcibly setting the first data space node as a slave node and setting the second data space node as a master node by the distributed coordination system.

5. The method of claim 1, further comprising: after the service data of the memory server and the second data space node are synchronized, the first data space node is set as a master node and the second data space node is set as a slave node.

6. A failover apparatus for a data space node, comprising:

a processor; and

7. The apparatus of claim 6, wherein the customer service comprises requests for storage of data space and/or data resources by different visitors; the business data includes authentication and logging of shared and/or independent storage and/or data resources allocated and/or reclaimed for different visitors.

8. The apparatus of claim 6, wherein continuously monitoring the operational status of the master node and the slave nodes comprises: continuously monitoring, using the monitoring component, information of the first data space node and the second data space node for at least one of: memory, processor, disk;

9. The apparatus of claim 6, wherein placing the first data space node as a slave node and the second data space node as a master node comprises:

10. The apparatus of claim 6, wherein the steps further comprise: after the service data of the memory server and the second data space node are synchronized, the first data space node is set as a master node and the second data space node is set as a slave node.