CN114629783B - State monitoring method, system, equipment and computer readable storage medium - Google Patents

State monitoring method, system, equipment and computer readable storage medium Download PDF

Info

Publication number
CN114629783B
CN114629783B CN202210245318.0A CN202210245318A CN114629783B CN 114629783 B CN114629783 B CN 114629783B CN 202210245318 A CN202210245318 A CN 202210245318A CN 114629783 B CN114629783 B CN 114629783B
Authority
CN
China
Prior art keywords
node
information center
backup cluster
cluster
health
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210245318.0A
Other languages
Chinese (zh)
Other versions
CN114629783A (en
Inventor
时培植
李向东
胡军擎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Information2 Software Inc
Original Assignee
Shanghai Information2 Software Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Information2 Software Inc filed Critical Shanghai Information2 Software Inc
Priority to CN202210245318.0A priority Critical patent/CN114629783B/en
Publication of CN114629783A publication Critical patent/CN114629783A/en
Application granted granted Critical
Publication of CN114629783B publication Critical patent/CN114629783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure

Abstract

The invention discloses a state monitoring method, a state monitoring system, state monitoring equipment and a computer readable storage medium. The method comprises the steps of sending a query request to an information center for establishing communication connection when a current backup cluster node is started. Wherein, the query request carries a key value of the current backup cluster node; and receiving and caching a healthy node list corresponding to the node information fed back by the information center. And when a node registration event and/or a node downtime event sent by the information center are received, updating a corresponding locally cached health node list so as to autonomously monitor the health states of all the health nodes in the health node list. The technical scheme of the invention solves the technical problem that the data backup task can be executed only when the central master control arbitration module works normally in the prior art, and achieves the technical effects of reducing the coupling between the central master control arbitration module and the backup cluster node and improving the usability of the whole system.

Description

State monitoring method, system, equipment and computer readable storage medium
Technical Field
Embodiments of the present invention relate to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a computer readable storage medium for monitoring a state.
Background
With the widespread use of computer data backup, to avoid a single point of failure, backup cluster nodes must be monitored to find healthy backup cluster nodes when a backup task is performed. In the existing monitoring method of the backup cluster nodes in the data backup cluster, a central master control arbitration module is used for monitoring all the backup cluster nodes, and when a data backup task needs to be distributed to the backup cluster nodes for execution, a query is sent to the central master control arbitration module to know which backup cluster nodes are healthy.
However, the central master arbitration module in the prior art is dependent on all backup cluster nodes and is a strong dependency. Once the central master control arbitration module cannot work normally, health information of the backup cluster nodes cannot be obtained, and normal data backup rules cannot be executed.
Disclosure of Invention
In view of this, the present invention provides a method, a system, a device and a computer readable storage medium for monitoring a state, so as to solve the technical problem that in the prior art, a central master control arbitration module becomes a strong dependency of all cluster nodes in a data backup cluster scene, and the arbitration module can execute a data backup task only when working normally.
In a first aspect, an embodiment of the present invention provides a status monitoring method, including:
when a current backup cluster node is started, sending a query request to an information center for establishing communication connection; the query request carries node information of the current backup cluster node, and the node information is characterized in a key value pair mode;
receiving and caching a healthy node list corresponding to the node information fed back by the information center;
and when a node registration event and/or a node downtime event sent by the information center are received, updating a corresponding locally cached health node list so as to autonomously monitor the health states of all the health nodes in the health node list.
In a second aspect, an embodiment of the present invention further provides a status monitoring system, including: an information center and at least one backup cluster node; the information center is in communication connection with each backup cluster node;
when the backup cluster node is started, a query request is sent to an information center for establishing communication connection so as to acquire and cache a health node list of a cluster where the backup cluster node is located, and when a node registration event and/or a node downtime event sent by the information center are received, the locally cached health node list is updated so as to autonomously monitor the health states of all the health nodes in the health node list.
In a third aspect, an embodiment of the present invention further provides the condition monitoring device, where the device includes: a memory, and one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the condition monitoring method as described in the first aspect.
In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the condition monitoring method as described in the first aspect.
According to the embodiment of the invention, when the current backup cluster node is started, a query request is sent to the information center for establishing communication connection, the healthy node list corresponding to the node information fed back by the information center is cached to the local, and when the healthy backup cluster node is selected, the coupling between the central total arbitration module and the backup cluster node is reduced based on the local healthy node list, so that the selection of the healthy backup cluster node is not strongly dependent on the central total arbitration module, the probability of task failure is greatly reduced, and the usability of the whole system is improved. And when the node registration event and/or the node downtime event sent by the information center are received, the corresponding locally cached health node list can be updated, so that the health states of other nodes can be perceived autonomously, and the purpose of monitoring the health states of other backup cluster nodes autonomously is achieved.
Drawings
FIG. 1 is a flow chart of a method for monitoring status according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a status monitoring method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a status monitoring method according to a third embodiment of the present invention;
FIG. 4 is a schematic diagram of registering and querying a healthy node list at node startup according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of updating a local healthy node list at a node downtime event according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a condition monitoring system according to a fourth embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer-readable storage medium provided in a fifth embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a state monitoring method provided in an embodiment of the present invention, where the embodiment is applicable to a scenario without a central master arbitration module in a data backup cluster system, the method may be performed by a backup cluster node, and the backup cluster node may be implemented in a software and/or hardware manner. The backup cluster node may be a server or a terminal with a data processing function, such as a computer, and the method specifically includes:
s110, when the current backup cluster node is started, a query request is sent to an information center for establishing communication connection.
The query request may carry node information of the current backup cluster node, where the node information is characterized in a key value pair form.
The backup cluster node refers to a node for executing data backup tasks in the data backup system.
It should be noted that, the information center is an independent process in the cluster system and operates in a container mode. The agent, which runs on all backup cluster nodes in the backup cluster, is a new thread of the backup daemon. The functions of the agent include two types: firstly, sending keep-alive messages to an information center at regular time; and secondly, subscribing a node downtime event or a node starting event from the information center, and correspondingly updating a map data structure of a healthy node list in the memory when the node downtime event or the node starting event is received.
In an embodiment, the information center may be a key-value pair service, and the background server receives the registration of each backup cluster node, records and updates the information in the backup cluster node list in the form of a key-value pair. It will be appreciated that the information center does not assume the role of arbitration and is used to provide key-value pair services, and that a backup cluster node arbitrates and decides itself after its health status is obtained from the information center by other nodes. The information center is a distributed key value pair service, and each backup cluster node writes information such as the health state of the backup cluster node into the information center in a key value pair mode, so that other backup cluster nodes can read from the information center, namely the information of all backup cluster nodes in the data backup system is stored in the information center. In addition, because the information center is a distributed service, that is, the information center has no single point of failure, and because the information center is a cluster, even if the node is down, other nodes can still continue to serve, that is, the information center is not easy to be down.
It should be noted that the key may be located at the end of the registry structure chain, similar to the file of a file system, and include the actual configuration information and data used when executing the current computer and application. The key value may contain several data types to accommodate the use requirements of different environments. In an embodiment, each backup cluster node is configured with a node identifier and a cluster identifier, wherein the node identifier is used for representing a unique identifier of each backup cluster node, and the cluster identifier is used for representing a cluster to which the backup cluster node belongs. It should be noted that each backup cluster node may belong to one or more clusters, that is, each backup cluster node may correspond to one or more cluster identifiers. Of course, the number of cluster identifiers corresponding to the backup cluster nodes is the same as the number of clusters to which the backup cluster nodes belong. It is understood that the cluster identity corresponding to each cluster is unique.
In one embodiment, the generating process of the node information includes: generating a corresponding character string according to the node identification of the current backup cluster node and the cluster identification of the cluster where the current backup cluster node is located; and taking the character string as node addressing information of the current backup cluster node. In the embodiment, when the current backup cluster node is started, a character string formed by a cluster identifier and a node identifier is used as a key; and then combining with other values to obtain the node identification of the current backup cluster node. Wherein, the cluster mark is before, the node mark is after, form the corresponding key.
In an embodiment, when the current backup cluster node is started, a self-starting operation is executed and a query request carrying a key value is sent to the associated information center. Of course, the query request sent by the current backup node carries the key value of the current backup cluster node, so that the information center can quickly and accurately find the healthy node list corresponding to the key value in the memory according to the node identifier and the cluster identifier in the key value.
And S120, receiving and caching a healthy node list corresponding to the node information fed back by the information center.
The healthy node list corresponding to the node information can be understood as a statistical table of all healthy backup cluster nodes of the cluster where the current backup node is located. In an embodiment, the information center stores all backup cluster nodes in a healthy state in the data backup system. After the information center receives the query request sent by the current backup cluster node, the information center analyzes the node information in the query request to obtain a cluster identifier corresponding to the current backup cluster node, searches all healthy nodes in the cluster belonging to the cluster id from a memory of the information center, and feeds back a healthy node list containing all the healthy nodes to the current backup cluster node so that the current backup cluster node receives and caches the healthy node list locally. Of course, if the cluster identifier corresponding to the current backup cluster node is multiple, that is, the cluster to which the current backup cluster node belongs is multiple, the healthy node list fed back by the information center to the current backup cluster node is also multiple. It is understood that each cluster corresponds to a list of healthy nodes. Correspondingly, the number of the healthy node lists fed back to the current backup cluster node by the information center is the same as the number of clusters to which the current cluster node belongs.
In an embodiment, the list of healthy nodes may be characterized in terms of a map data structure. In an embodiment, the information center may provide a mechanism to query all keys under a certain prefix. Because the keys of all nodes in the cluster can be spliced by the cluster identifier and the node identifier, the cluster identifier is in front, each node can splice a prefix by the cluster identifier when being started, and the information center queries all keys under the prefix according to the prefix, so that a full list of all healthy nodes, namely a healthy node list, is obtained. The information center may provide event services and the node may specify a prefix when subscribing to an event. Under this prefix, all subscribers receive an event once a new key is added, or a key is deleted, modified. All nodes in the invention can spell a prefix by the cluster identification and subscribe all events under the prefix. When a new node joins the cluster or starts, an event of newly adding a key can be provided; when an existing node is deleted by the information center because it cannot keep alive, there is an event of deleting the key. The event can be provided with an event type and a corresponding specific key, and the cluster identifier and the node identifier can be resolved from the event. Upon receipt of the event, the healthy node may learn that a node is joining or leaving and then modify the cache in the local memory accordingly.
And S130, when a node registration event and/or a node downtime event sent by the information center are received, updating a corresponding locally cached health node list so as to autonomously monitor the health states of all the health nodes in the health node list.
The node registration event refers to an event that the backup cluster node registers itself to the information center; the node downtime event refers to an event that the backup cluster node fails and cannot perform a backup task. In an embodiment, in order to ensure that each backup cluster node in the data backup system can accurately acquire information of all healthy nodes from an information center, when each backup cluster node is started, the backup cluster node actively transmits a registration request to the information center associated with the backup cluster node so that own information is recorded in the information center in a key value pair mode, and then the information center forwards the node registration event to other healthy nodes so that the other healthy nodes increase the information of the backup cluster node corresponding to the node registration event in a locally cached healthy node list.
Similarly, when a backup cluster node with downtime exists in the data backup system, the information center sends (i.e. broadcasts) a node downtime event to other healthy nodes, so that the information center deletes the information of the backup cluster node corresponding to the node downtime event from the locally cached healthy node list.
It should be noted that, whenever the key is present in the information center, whatever the value is, it indicates that the backup cluster node is healthy. When the backup cluster node is in a node downtime event, namely, when the keep-alive message is not sent to the information center at regular time, the information center automatically deletes the key.
It should be noted that, the backup cluster node may correspond to a plurality of clusters, that is, in the case that the backup cluster node having a node registration event or a node downtime event corresponds to a plurality of clusters, the information center feeds back a corresponding healthy node list to other backup cluster nodes in each cluster, so that other backup cluster nodes update the locally cached corresponding healthy node list. For example, assuming that the node at which the node downtime event occurs is node 3, where the node 3 corresponds to the cluster 1 and the cluster 2, the information center sends the node downtime event to other healthy nodes in the cluster 1 and other healthy nodes in the cluster 2, so that the other healthy nodes in the cluster 1 update the local cached healthy node list 1, and the other healthy nodes in the cluster 2 update the local cached healthy node list 2, so that each healthy node can autonomously monitor the health status of other backup cluster nodes.
According to the technical scheme, when the current backup cluster node is started, a query request is sent to an information center for establishing communication connection, a healthy node list corresponding to a key value fed back by the information center is cached to the local, and when the healthy backup cluster node is selected, the coupling between a central total arbitration module and the backup cluster node is reduced based on the local healthy node list, so that the selection of the healthy backup cluster node is not strongly dependent on the central total arbitration module, the probability of task failure is greatly reduced, and the availability of the whole system is improved. And when the node registration event and/or the node downtime event sent by the information center are received, the corresponding locally cached health node list can be updated, so that the health states of other nodes can be perceived autonomously, and the purpose of monitoring the health states of other backup cluster nodes autonomously is achieved.
Example two
Fig. 2 is a flowchart of a status monitoring method according to a second embodiment of the present invention, where the status monitoring method is further optimized based on the foregoing embodiments, and features of "sending keep-alive messages to an information center that establishes a communication connection at regular time" are added. As shown in fig. 2, the state monitoring method in this embodiment specifically includes the following steps:
And S210, when the current backup cluster node is started, a registration request and a subscription request are respectively sent to an information center for establishing communication connection, so that node information of the current backup cluster node is stored in the information center, and a node registration event and a node downtime event of the information center are subscribed.
Wherein the node information may be characterized in the form of key-value pairs. In an embodiment, the registration request refers to a request that the backup cluster node stores its own node information to the information center; the subscription request refers to a request for a node registration event and a node downtime event of the backup cluster node subscription information center.
In the embodiment, when the current backup cluster node is started, executing the self-starting operation and sending a registration request to the information center so as to store the node information of the current backup cluster node to the information center; and sending a subscription request to the information center so that the information center sends a corresponding rated node registration event and a node downtime event to the current backup cluster node under the condition that the information center detects registration requests of other backup cluster nodes or node downtime, so that the current backup cluster node updates the locally cached healthy node list.
S220, when the current backup cluster node is started, a query request is sent to an information center for establishing communication connection.
The query request carries node information of the current backup cluster node, and the node information is characterized in a key value pair mode.
And S230, receiving and caching a healthy node list corresponding to the node information fed back by the information center.
S240, in the running process of the current backup cluster node, a query request is sent to the information center at regular time so as to update the corresponding locally cached healthy node list.
In an embodiment, in the running process of the current backup cluster node, a query request carrying a cluster identifier may be sent to the information center at regular time, so that the information center queries all keys under the cluster identifier according to the cluster identifier, so as to obtain a healthy node list of all healthy nodes in the cluster. It can be understood that, in the running process of the current backup cluster node, the query request sent to the information center by the current backup cluster node at regular time is a full-volume query request, which is the same as the query request when the current backup cluster node is started. Therefore, even when the node downtime event or the node starting event occurs, the backup cluster node loses the locally cached healthy node list under the extremely low probability, and the current backup cluster node can obtain a healthy node list as soon as possible, so that the stable execution of the backup task is ensured.
In an embodiment, during operation of the current backup cluster node, the interval at which the query request is sent to the information center may be a pre-configured fixed interval. For example, the fixed interval may be in units of hours, for example, 6 hours, etc., which is not limited thereto.
In an embodiment, in order to prevent the cached healthy node list from being lost due to a node downtime event or a node startup event, each backup cluster node may read a full amount of all healthy node lists of the present cluster from the information center as in startup every several hours to update the cache. In the subsequent data backup task, the backup cluster node can also distribute the data backup task to other healthy nodes for execution. In addition, if the data backup cluster node is down in the task execution process, the healthy node can be selected to execute the data backup task again according to the latest healthy node list.
It can be obtained that the backup cluster node of the invention periodically sends a full query of the healthy node list in the cluster to the information center, as in the starting process. Thus, even if the downtime/startup event is lost due to the accident with a very small probability, the node will eventually have a timely list of healthy nodes. In addition, when the backup task needs to be executed, the executing node can be selected from healthy nodes, so that the backup task fault caused by node downtime is avoided to the greatest extent.
And S250, updating a corresponding locally cached health node list when receiving a node registration event and/or a node downtime event sent by the information center, so as to autonomously monitor the health states of all the health nodes in the health node list.
Specifically, each backup cluster node in the invention can subscribe all node update information in the cluster and correspondingly update the map data structure of the healthy node list in the memory. That is, the map data structure is generated at start-up from a full list of healthy nodes, and then incrementally updated by means of node registration events; or deleting according to the node downtime event.
According to the technical scheme of the embodiment, on the basis of the embodiment, in the running process of the current backup cluster node, a full-quantity query request is sent to the information center at regular time to update the locally cached healthy node list, so that the situation that the backup cluster node loses the healthy node list due to the node downtime event or the node starting event under the extremely low probability is avoided, and stable execution of the backup task is ensured.
In an embodiment, the state monitoring method further includes: and sending keep-alive messages to the information center for establishing the communication connection at regular time.
The keep-alive message is used for representing that the current backup cluster node is in a health state.
On the basis of the scheme, optionally, when the transmission interval of the keep-alive message of the current backup cluster node adjacent to two times reaches a preset time interval, the current backup cluster node is in a downtime state. The sending interval of two adjacent times refers to the difference between the time when the current backup cluster node sends the keep-alive message to the information center at this time and the time when the current backup cluster node sends the keep-alive message to the information center at last time. For example, the preset time interval may be generally several times of the time corresponding to the time when the keep-alive message is sent periodically, for example, the preset time interval may be 20 seconds when the backup cluster node sends the keep-alive message to the information center once every 10 seconds.
In an embodiment, the current backup cluster node may send keep-alive messages to the information center at regular time. Once the information center has not received a keep-alive message for more than a certain time interval (typically several times the transmission interval of the keep-alive message), the backup cluster node is considered down, and the information center may broadcast this node down event to all subscribers within the cluster.
Example III
Fig. 3 is a flowchart of a state monitoring method according to a third embodiment of the present invention, where the present embodiment is further optimized based on the foregoing embodiments. This embodiment describes a process of state monitoring as a preferred embodiment. The state monitoring method in this embodiment specifically includes the following steps:
and S310, when the backup cluster nodes are started, automatically registering key value pairs to the information center, and subscribing node registration events and node downtime events of all the backup cluster nodes from the information center.
In the embodiment, the information center is a key value pair service, that is, when each backup cluster node is started, a registration request is sent to the information center, and the node information of the information center is recorded and the information of the healthy node list corresponding to the backup cluster node is updated in a key value pair mode. Wherein, each backup cluster node configures own key value pair, and the key is a character string generated according to a node identifier (also referred to as a node id) and a cluster identifier (also referred to as a cluster id).
S320, when the backup cluster node is started, reading a healthy node list corresponding to the cluster from the information center, and caching the healthy node list in the local area.
And storing all the backup cluster nodes in different health node lists according to different cluster identifications. It can be understood that all backup cluster nodes identified by the same cluster can be stored in the same health node list, which is beneficial to quick query of the health status of each node. When a new backup cluster node is started, the method requests to acquire the health node list of the cluster where the information center is currently located, thereby being beneficial to the distribution and transfer of cluster backup tasks and ensuring the stable operation of the backup cluster node when executing the tasks.
S330, the backup cluster node continuously sends keep-alive messages to the information center.
In an embodiment, each backup cluster node may continuously (and also may be understood as periodically) send keep-alive messages to the information center. Once a certain backup cluster node is down or restored, all other healthy nodes in the cluster can receive a node down event or a node restoring event sent from an information center, and a locally cached healthy node list is updated accordingly.
When the backup cluster is defined, the keep-alive message is sent to the information center, so that the information center can acquire the health state of the backup cluster nodes in real time, namely, whether the backup cluster nodes can normally execute the state information of the backup task, and when the shutdown or recovered health nodes exist, the information center can timely inform all other backup cluster nodes in the cluster, thereby realizing that all nodes autonomously sense the health states of all other backup cluster nodes in the cluster, and achieving the purpose of monitoring the state.
S340, in the node operation process, each backup cluster node reads a health node list containing all other health nodes in the cluster to the information center at intervals of preset time length so as to update the local cache.
It should be noted that, in order to prevent the event from being lost, each backup cluster node can update the health node list at regular time according to the current state in the node operation process, so as to realize that the backup information which is changed last is stored in the local record in real time, and reduce the data loss caused by downtime or other faults.
S350, the backup cluster node can distribute the data backup task to the healthy node for execution in the subsequent data backup task.
In the embodiment, if the backup cluster node is down in the process of executing the task, the healthy node can be selected to execute the data backup task again according to the latest healthy node list.
It should be noted that when a certain node is down, a healthy node may be selected according to the latest healthy node list to re-execute the data backup task, and because information of all healthy nodes in the information center is read at fixed intervals, when the task is idle due to the down, the backup task can be ensured to be executed by the node in a normal working state (i.e. a healthy state) timely and effectively.
Compared with the prior art, the scheme discloses a node autonomous monitoring method without a central master control arbitration module in a data backup cluster system. In the method, the information center bears the responsibility similar to the arbitration module, but the backup cluster node autonomously maintains a healthy node list in a local memory, and when the node is selected, the coupling between the central master arbitration module and the backup cluster node is reduced based on local caching, so that the selection of the healthy node is not strongly dependent on the arbitration module. When the arbitration module is down, the backup cluster node can still continue to select nodes to execute backup tasks by depending on the healthy node list of the local memory, and the backup tasks fail only if the selected nodes are down in the period that the arbitration module is down, so that the probability of task failure is greatly reduced, and the availability of the whole system is improved.
Fig. 4 is a schematic diagram of registering and querying a healthy node list at the time of node startup according to an embodiment of the present invention. As shown in fig. 4, n backup cluster nodes in a healthy state, namely node 1 and node 2 … … node n, are stored in the information center. In an embodiment, when a backup cluster node (for example, any one of node 1 and node 2 and … … node n) sends a registration request to an information center, the backup cluster node stores own node information to the information center; when the information center receives the query request of the backup cluster node, the information center feeds back a health node list corresponding to all health nodes of the cluster where the backup cluster node is located to the backup cluster node.
Fig. 5 is a schematic diagram of updating a local healthy node list when a node downtime event is provided in an embodiment of the present invention. As shown in fig. 5, during operation of a backup cluster node (e.g., any of node 1, node 2, … …, node n), keep-alive messages, so-called daily keep-alive, are periodically sent to the information center so that the information center timely monitors all backup cluster nodes that are in a healthy state. Of course, when the interval of sending keep-alive messages to the information center by the backup cluster node reaches a preset time interval, namely, when a so-called overtime non-keep-alive condition occurs, a downtime condition occurs to the backup cluster node, and the information center sends a node downtime event to other backup cluster nodes in a healthy state, so that the other backup cluster nodes update the locally cached healthy node list. For example, as shown in fig. 5, when the node 2 does not send a keep-alive message to the information center at a fixed time, the information center considers that the node 2 has a node downtime event, and then the information center sends the node downtime event corresponding to the node 2 to other nodes, so that other backup cluster nodes (namely, node 1, node 3, node 4 and node 5 and … … node n except the node 2) update the locally cached healthy node list.
Example IV
Fig. 6 is a schematic structural diagram of a state monitoring system according to a fourth embodiment of the present invention, where the system may execute the state monitoring method according to any embodiment of the present invention, and the system has corresponding functional modules and beneficial effects of the execution method, and the embodiment may be applied to a scenario without a central master control arbitration module in a data backup cluster system.
As shown in fig. 6, the system includes: an information center 610 and at least one backup cluster node 620; the information center 610 is communicatively coupled to each of the backup cluster nodes 620.
When the backup cluster node 620 is started, a query request is sent to the information center 610 for establishing communication connection to acquire and cache a health node list of a cluster where the backup cluster node 620 is located, and when a node registration event and/or a node downtime event sent by the information center 610 are received, the locally cached health node list is updated to autonomously monitor health states of all health nodes in the health node list.
According to the technical scheme, when the current backup cluster node is started, a query request is sent to an information center for establishing communication connection. Wherein, the query request carries a key value of the current backup cluster node; and receiving and caching a healthy node list corresponding to the node information fed back by the information center. And when a node registration event and/or a node downtime event sent by the information center are received, updating a corresponding locally cached health node list so as to autonomously monitor the health states of all the health nodes in the health node list. The technical scheme of the invention solves the technical problem that the data backup task can be executed only when the central master control arbitration module works normally in the prior art, and achieves the technical effects of reducing the coupling between the central master control arbitration module and the backup cluster node and improving the usability of the whole system.
On the basis of the above embodiments, before the sending of the query request to the information center for establishing the communication connection, the method further includes: and respectively sending a registration request and a subscription request to an information center for establishing communication connection so as to store node information of the current backup cluster node to the information center, and subscribing a node registration event and a node downtime event of the information center, wherein the node information is characterized in a key value pair mode.
Further, sending keep-alive messages to an information center for establishing communication connection at regular time; the keep-alive message is used for representing that the current backup cluster node is in a health state.
Further, after the receiving and caching the health node list corresponding to the query request fed back by the information center, the method further includes: and in the running process of the current backup cluster node, periodically sending a query request to the information center to update the corresponding locally cached healthy node list.
Further, the generating process of the node information includes: generating a corresponding character string according to the node identification of the current backup cluster node and the cluster identification of the cluster where the current backup cluster node is located; and taking the character string as node information of the current backup cluster node.
Further, when the transmission interval of the keep-alive message of the current backup cluster node, which is adjacent to the current backup cluster node, reaches a preset time interval, the current backup cluster node is in a downtime state.
Further, each backup cluster node corresponds to a unique key value pair.
Example five
Fig. 7 is a schematic structural diagram of a condition monitoring device according to a fifth embodiment of the present invention. Fig. 7 shows a block diagram of a terminal 712 suitable for use in implementing embodiments of the invention. The terminal 712 shown in fig. 7 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
As shown in fig. 7, the terminal 712 is in the form of a general-purpose computing device and has a function of saving pictures by photographing, screenshot, etc., and translating. The components of terminal 712 may include, but are not limited to: one or more processors 716, a storage device 728, and a bus 718 that connects the different system components (including the storage device 728 and the processor 716).
Bus 718 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Terminal 712 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by terminal 712 and includes both volatile and nonvolatile media, removable and non-removable media.
Storage 728 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 730 and/or cache memory 732. The terminal 712 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the storage system 734 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, commonly referred to as a "hard disk drive"). Although not shown in fig. 7, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 718 through one or more data media interfaces. Storage 728 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
Programs 740 having a set (at least one) of program modules 742 may be stored, for example, in storage 728, such program modules 742 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 742 generally perform the functions and/or methodologies of the described embodiments of the invention.
The terminal 712 can also communicate with one or more external devices 714 (e.g., keyboard, pointing device, camera, display 724, etc.), one or more devices that enable a user to interact with the terminal 712, and/or any devices (e.g., network card, modem, etc.) that enable the terminal 712 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 722. Also, terminal 712 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, via network adapter 720. As shown, the network adapter 720 communicates with other modules of the terminal 712 via the bus 718. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with terminal 712, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processor 716 executes various functional applications and data processing by running programs stored in the storage 728, such as implementing the condition monitoring method provided by the above-described embodiments of the present invention.
Example six
A sixth embodiment of the present invention also provides a computer-readable storage medium, which when executed by a computer processor, is configured to perform a condition monitoring method, the method comprising:
when a current backup cluster node is started, sending a query request to an information center for establishing communication connection; the query request carries node information of the current backup cluster node, and the node information is characterized in a key value pair mode;
receiving and caching a healthy node list corresponding to the node information fed back by the information center;
and when a node registration event and/or a node downtime event sent by the information center are received, updating a corresponding locally cached health node list so as to autonomously monitor the health states of all the health nodes in the health node list.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the related operations in the state monitoring method provided in any embodiment of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.
It should be noted that, in the above-mentioned embodiments of the search apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, as long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (9)

1. A method for monitoring a condition, comprising:
when a current backup cluster node is started, sending a query request to an information center for establishing communication connection; the query request carries node information of the current backup cluster node, and the node information is characterized in a key value pair mode;
receiving and caching a healthy node list corresponding to the node information fed back by the information center;
when a node registration event and/or a node downtime event sent by the information center are received, updating a corresponding locally cached health node list so as to autonomously monitor the health states of all health nodes in the health node list;
The generation process of the node information comprises the following steps:
generating a corresponding character string according to the node identification of the current backup cluster node and the cluster identification of the cluster where the current backup cluster node is located;
and taking the character string as node information of the current backup cluster node.
2. The method of claim 1, further comprising, prior to said sending a query request to an information center that establishes a communication connection:
and respectively sending a registration request and a subscription request to an information center for establishing communication connection so as to store the node information of the current backup cluster node to the information center, and subscribing a node registration event and a node downtime event of the information center.
3. The method according to claim 1, characterized in that the method further comprises:
sending keep-alive messages to an information center for establishing communication connection at regular time; the keep-alive message is used for representing that the current backup cluster node is in a health state.
4. The method of claim 1, further comprising, after said receiving and caching the list of healthy nodes corresponding to the query request fed back by the information center:
And in the running process of the current backup cluster node, periodically sending a query request to the information center to update the corresponding locally cached healthy node list.
5. The method of claim 3, wherein the current backup cluster node is in a down state when a transmission interval of two adjacent keep-alive messages of the current backup cluster node reaches a preset time interval.
6. The method of claim 1, wherein each backup cluster node corresponds to a unique key-value pair.
7. A condition monitoring system, comprising: an information center and at least one backup cluster node; the information center is in communication connection with each backup cluster node;
when the backup cluster node is started, a query request is sent to an information center for establishing communication connection so as to acquire and cache a health node list of a cluster where the backup cluster node is located, and when a node registration event and/or a node downtime event sent by the information center are received, the locally cached health node list is updated so as to autonomously monitor the health states of all health nodes in the health node list;
Wherein, the query request carries a key value of the current backup cluster node;
the obtaining and caching the health node list of the cluster where the backup cluster node is located includes:
receiving and caching a healthy node list corresponding to the node information fed back by the information center; wherein the node information is characterized in the form of key value pairs;
the generation process of the node information comprises the following steps:
generating a corresponding character string according to the node identification of the current backup cluster node and the cluster identification of the cluster where the current backup cluster node is located;
and taking the character string as node information of the current backup cluster node.
8. A condition monitoring device, the device comprising: a memory, and one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, causes the one or more processors to implement the condition monitoring method of any one of claims 1-6.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a condition monitoring method as claimed in any one of claims 1-6.
CN202210245318.0A 2022-03-14 2022-03-14 State monitoring method, system, equipment and computer readable storage medium Active CN114629783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210245318.0A CN114629783B (en) 2022-03-14 2022-03-14 State monitoring method, system, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210245318.0A CN114629783B (en) 2022-03-14 2022-03-14 State monitoring method, system, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114629783A CN114629783A (en) 2022-06-14
CN114629783B true CN114629783B (en) 2024-03-26

Family

ID=81902453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210245318.0A Active CN114629783B (en) 2022-03-14 2022-03-14 State monitoring method, system, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114629783B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103560922A (en) * 2013-11-18 2014-02-05 北京特立信电子技术股份有限公司 Disaster recovery method and system
CN106375342A (en) * 2016-10-21 2017-02-01 用友网络科技股份有限公司 Zookeeper-technology-based system cluster method and system
CN110048896A (en) * 2019-04-29 2019-07-23 广州华多网络科技有限公司 A kind of company-data acquisition methods, device and equipment
CN111404759A (en) * 2020-04-17 2020-07-10 腾讯科技(深圳)有限公司 Service detection method, rule configuration method, related device and medium
CN113282604A (en) * 2021-07-14 2021-08-20 北京远舢智能科技有限公司 High-availability time sequence database cluster system realized based on message queue
CN113742416A (en) * 2020-05-29 2021-12-03 浙江正泰电器股份有限公司 Data processing method, device, system and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9747183B2 (en) * 2013-12-31 2017-08-29 Ciena Corporation Method and system for intelligent distributed health monitoring in switching system equipment
WO2016118979A2 (en) * 2015-01-23 2016-07-28 C3, Inc. Systems, methods, and devices for an enterprise internet-of-things application development platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103560922A (en) * 2013-11-18 2014-02-05 北京特立信电子技术股份有限公司 Disaster recovery method and system
CN106375342A (en) * 2016-10-21 2017-02-01 用友网络科技股份有限公司 Zookeeper-technology-based system cluster method and system
CN110048896A (en) * 2019-04-29 2019-07-23 广州华多网络科技有限公司 A kind of company-data acquisition methods, device and equipment
CN111404759A (en) * 2020-04-17 2020-07-10 腾讯科技(深圳)有限公司 Service detection method, rule configuration method, related device and medium
CN113742416A (en) * 2020-05-29 2021-12-03 浙江正泰电器股份有限公司 Data processing method, device, system and storage medium
CN113282604A (en) * 2021-07-14 2021-08-20 北京远舢智能科技有限公司 High-availability time sequence database cluster system realized based on message queue

Also Published As

Publication number Publication date
CN114629783A (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN102333029B (en) Routing method in server cluster system
CN110535692B (en) Fault processing method and device, computer equipment, storage medium and storage system
CN109656742B (en) Node exception handling method and device and storage medium
CN110795503A (en) Multi-cluster data synchronization method and related device of distributed storage system
JP2007503628A (en) Fast application notification in clustered computing systems
CN105493474B (en) System and method for supporting partition level logging for synchronizing data in a distributed data grid
CN103580906A (en) Data backup method, system and server
US11445013B2 (en) Method for changing member in distributed system and distributed system
CN111752488B (en) Management method and device of storage cluster, management node and storage medium
WO2016082594A1 (en) Data update processing method and apparatus
US8301750B2 (en) Apparatus, system, and method for facilitating communication between an enterprise information system and a client
CN111680015A (en) File resource processing method, device, equipment and medium
US20040123183A1 (en) Method and apparatus for recovering from a failure in a distributed event notification system
US20180121531A1 (en) Data Updating Method, Device, and Related System
CN108173665B (en) Data backup method and device
US20210326224A1 (en) Method and system for processing device failure
CN114629783B (en) State monitoring method, system, equipment and computer readable storage medium
CN114244810A (en) Virtual IP management method, device, electronic equipment and storage medium
CN111797352A (en) Method and device for sealing account and sealing system
CN108418863B (en) Management method of controller cluster, SDN controller and storage medium
US8089987B2 (en) Synchronizing in-memory caches while being updated by a high rate data stream
CN107404511B (en) Method and device for replacing servers in cluster
JP4485560B2 (en) Computer system and system management program
CN114116178A (en) Cluster framework task management method and related device
US8495153B1 (en) Distribution of messages in nodes connected by a grid architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant