CN110545197B

CN110545197B - Node state monitoring method and device

Info

Publication number: CN110545197B
Application number: CN201810532541.7A
Authority: CN
Inventors: 胡君怡
Original assignee: Hangzhou Hikvision System Technology Co Ltd
Current assignee: Hangzhou Hikvision System Technology Co Ltd
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2022-09-09
Anticipated expiration: 2038-05-29
Also published as: CN110545197A

Abstract

The invention discloses a node state monitoring method and device, and belongs to the field of computer application. The method comprises the following steps: when the first monitoring node does not receive heartbeat information sent by the second monitoring node in a first period, determining that the second monitoring node fails, wherein the second monitoring node is a node which monitors the state of at least one service node in the at least two monitoring nodes; the first monitoring node acquires the state of the at least one service node; and the first monitoring node updates the state of the at least one service node in the state record table. According to the invention, by configuring at least two monitoring nodes, if the monitoring node which provides the state monitoring service sends a fault, the other monitoring node in the at least two monitoring nodes takes over the monitoring service in time to monitor the state of at least one service node, so that the problem that the monitoring service cannot be provided due to single-point fault of a single monitoring node is avoided, and the stability of state monitoring is improved.

Description

Node state monitoring method and device

Technical Field

The invention relates to the field of computer application, in particular to a node state monitoring method and device.

Background

The cloud storage system is a system which integrates a large number of different storage devices in a network to cooperatively work and provides data storage and service access for the outside. The cloud storage system may include a plurality of service nodes, the states of the service nodes have a decisive influence on the overall service quality, and it is important to monitor the states of the service nodes.

Currently, a system monitors the states of a plurality of service nodes through a monitoring node, the plurality of service nodes send state information to the monitoring node at a predetermined period, and the monitoring node updates the states of the service nodes according to the received state information.

In the process of implementing the invention, the inventor finds that the related art has at least the following problems:

according to the method, the state of the plurality of service nodes is monitored by one monitoring node, once a server corresponding to the monitoring node goes down, a single-point fault occurs, the state of the plurality of service nodes cannot be monitored, and the state monitoring stability is poor.

Disclosure of Invention

The embodiment of the invention provides a node state monitoring method and device, which can solve the problem of poor stability of state monitoring in the related art. The technical scheme is as follows:

in a first aspect, a node status monitoring method is provided, and is applied to a first monitoring node of at least two monitoring nodes, where the method includes:

when the first monitoring node does not receive heartbeat information sent by a second monitoring node in a first period, determining that the second monitoring node fails, wherein the second monitoring node is a node for monitoring the state of at least one service node in the at least two monitoring nodes;

the first monitoring node acquires the state of the at least one service node;

the first monitoring node updates the state of the at least one service node in a state record table, and the state record table is recorded in a shared database corresponding to the at least two monitoring nodes;

the at least two monitoring nodes provide the same virtual IP address externally, so that a node in the at least two monitoring nodes, which is monitoring the state, acquires the state information sent by the at least one service node from the virtual IP address.

In one possible implementation manner, the obtaining, by the first monitoring node, the state of the at least one service node includes:

the first monitoring node acquires heartbeat information sent by a first service node in a second period, wherein the heartbeat information is used for indicating that the service state of the first service node is an online state, the first service node is any one of the at least one service node, and the online state represents that service can be provided;

the first monitoring node obtains reporting information sent by the first service node in a third period, wherein the reporting information is used for indicating each running state of the first service node, and the second period is smaller than the third period.

In one possible implementation, the method further includes:

and when the first monitoring node does not acquire the heartbeat information sent by the first service node in the next second period, modifying the service state of the first service node from an online state to an offline state, wherein the offline state represents that the service cannot be provided.

In one possible implementation, after the modifying the service state of the first service node from the online state to the offline state, the method further includes:

when the first monitoring node acquires the report information sent by the first service node, the service state of the first service node is changed from an offline state to an online state; or the like, or a combination thereof,

when the first monitoring node acquires the heartbeat information sent by the first service node, the service state of the first service node is changed from an offline state to an online state; or the like, or a combination thereof,

and when the first monitoring node acquires the login request sent by the first service node, modifying the service state of the first service node from the non-online state to the online state.

In one possible implementation, the method further includes:

and when the first monitoring node determines that the state of the at least one service node meets a preset condition, sending alarm information to an operation and maintenance node, wherein the operation and maintenance node is used for processing the alarm information.

In a possible implementation manner, the sending, by the first monitoring node, alarm information to the operation and maintenance node when the first monitoring node determines that the state of the at least one service node satisfies a preset condition includes:

when the first monitoring node determines that the service state of a second service node is switched from an online state to an offline state, sending offline warning information of the second service node to the operation and maintenance node, wherein the second service node is any one of the at least one service node; or the like, or, alternatively,

when the first monitoring node determines that the service state of the second service node is switched from the non-online state to the online state, sending online alarm information of the second service node to the operation and maintenance node; or the like, or, alternatively,

when the first monitoring node determines that the state value of any one operating state of the second service node meets a state alarm condition, state alarm information of the second service node is sent to the operation and maintenance node; or the like, or, alternatively,

and when the first monitoring node determines that the percentage of the whole residual storage capacity of the at least one service node in the whole total storage capacity meets a capacity alarm condition, sending cluster capacity alarm information to the operation and maintenance node.

In one possible implementation, the method further comprises:

when the first monitoring node receives a login request, performing state monitoring on a service node corresponding to the login request, and adding the service node corresponding to the login request into a cluster corresponding to the at least one service node;

when the first monitoring node receives a logout request, the monitoring of the state of the service node corresponding to the logout request is stopped, and the service node corresponding to the logout request is deleted from the cluster corresponding to the at least one service node.

In a possible implementation manner, the stopping monitoring the state of the service node corresponding to the logout request includes:

and the first monitoring node deletes the service state and the running state of the service node corresponding to the logout request from a state record table.

In a second aspect, an apparatus for monitoring node status is provided, which is applied to a first monitoring node of at least two monitoring nodes, and the apparatus includes:

a determining module, configured to determine that a failure occurs in a second monitoring node when the first monitoring node does not receive heartbeat information sent by the second monitoring node in a first period, where the second monitoring node is a node that performs state monitoring on at least one service node in the at least two monitoring nodes;

an obtaining module, configured to obtain, by the first monitoring node, a state of the at least one service node;

an updating module, configured to update, by the first monitoring node, a state of the at least one service node in a state record table, where the state record table is recorded in a shared database corresponding to the at least two monitoring nodes;

the at least two monitoring nodes provide the same virtual IP address to the outside, so that the node performing state monitoring in the at least two monitoring nodes acquires the state information sent by the at least one service node from the virtual IP address.

In one possible implementation, the obtaining module is configured to:

the first monitoring node acquires heartbeat information sent by a first service node in a second period, wherein the heartbeat information is used for indicating that the service state of the first service node is an online state, the first service node is any one service node in the at least one service node, and the online state represents that service can be provided;

In a possible implementation manner, the updating module is configured to modify, when the first monitoring node does not acquire the heartbeat information sent by the first service node in a next second period, a service state of the first service node from an online state to an offline state, where the offline state indicates that a service cannot be provided.

In a possible implementation manner, the update module is configured to modify a service state of the first service node from an offline state to an online state when the first monitoring node acquires the reporting information sent by the first service node; or the like, or, alternatively,

the updating module is used for modifying the service state of the first service node from an offline state to an online state when the first monitoring node acquires the heartbeat information sent by the first service node; or the like, or a combination thereof,

the updating module is used for changing the service state of the first service node from the non-online state to the online state when the first monitoring node acquires the login request sent by the first service node.

In one possible implementation, the apparatus further includes:

and the sending module is used for sending alarm information to an operation and maintenance node when the first monitoring node determines that the state of the at least one service node meets a preset condition, and the operation and maintenance node is used for processing the alarm information.

In a possible implementation manner, the sending module is configured to send offline warning information of a second service node to the operation and maintenance node when the first monitoring node determines that a service state of the second service node is switched from an online state to an offline state, where the second service node is any one of the at least one service node; or the like, or, alternatively,

the sending module is used for sending online alarm information of the second service node to the operation and maintenance node when the first monitoring node determines that the service state of the second service node is switched from an offline state to an online state; or the like, or, alternatively,

the sending module is configured to send state alarm information of the second service node to the operation and maintenance node when the first monitoring node determines that a state value of any one operating state of the second service node meets a state alarm condition; or the like, or, alternatively,

the sending module is configured to send cluster capacity alarm information to the operation and maintenance node when the first monitoring node determines that the percentage of the overall remaining storage capacity of the at least one service node in the overall total storage capacity satisfies a capacity alarm condition.

In one possible implementation, the apparatus further includes:

the adding module is used for monitoring the state of the service node corresponding to the login request when the first monitoring node receives the login request, and adding the service node corresponding to the login request into a cluster corresponding to the at least one service node;

and the deleting module is used for stopping monitoring the state of the service node corresponding to the logout request when the first monitoring node receives the logout request, and deleting the service node corresponding to the logout request from the cluster corresponding to the at least one service node.

In a possible implementation manner, the deleting module is configured to delete, by the first monitoring node, the service state and the running state of the service node corresponding to the logout request from a state record table.

In a third aspect, a node status monitoring system is provided, wherein the system includes at least two monitoring nodes and at least one service node, the at least two monitoring nodes include a first monitoring node and a second monitoring node,

the first monitoring node is used for determining that the second monitoring node fails when the heartbeat information sent by the second monitoring node is not received in a first period, and the second monitoring node is a node for monitoring the state of at least one service node in the at least two monitoring nodes;

the first monitoring node is further configured to obtain a status of the at least one service node;

the first monitoring node is further configured to update, by the first monitoring node, the state of the at least one service node in a state record table, where the state record table is recorded in a shared database corresponding to the at least two monitoring nodes;

In a fourth aspect, a computer device is provided, comprising a processor and a memory; the memory is used for storing at least one instruction; the processor is configured to execute at least one instruction stored in the memory to implement the method steps of any implementation manner of the first aspect.

In a fifth aspect, a computer-readable storage medium is provided, having at least one instruction stored therein, which when executed by a processor, implements the method steps of any of the implementations of the first aspect.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

by configuring at least two monitoring nodes, if the monitoring node which provides the state monitoring service sends a fault, the other monitoring node in the at least two monitoring nodes takes over the monitoring service in time to monitor the state of at least one service node, thereby avoiding the problem that the monitoring service cannot be provided due to single point fault of a single monitoring node and improving the stability of state monitoring.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a node status monitoring system according to an embodiment of the present invention;

fig. 2 is a flowchart of a node status monitoring method according to an embodiment of the present invention;

fig. 3 is a flowchart of a node status monitoring method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a node status monitoring apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a node status monitoring apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a node status monitoring apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device 700 according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a node status monitoring system according to an embodiment of the present invention, where the node status monitoring system may include at least two monitoring nodes (e.g., a first monitoring node and a second monitoring node) and at least one service node (e.g., a service node 1, a second service node 2, … …, a service node N), and may further include a client node and an operation and maintenance node.

The at least two monitoring nodes are used for monitoring the state of the at least one service node and providing external services for the client node, such as services for inquiring the state of one or more service nodes. If there are only two monitoring nodes, namely the first monitoring node and the second monitoring node, the two monitoring nodes may form a monitoring node group, and if there are more than two monitoring nodes, the monitoring nodes may form a monitoring node cluster. The at least two monitoring nodes share the database synchronization information, the at least two monitoring nodes provide a virtual IP (Internet Protocol) address to the outside, only one monitoring node provides service to the outside at the same time, if the monitoring node is down, other monitoring nodes take over the service immediately, that is, the monitoring node which is down replaces the monitoring node which is down currently to provide service to the outside.

At least one service node is used for providing storage and extraction services of data, and certainly, distribution services of data can also be provided. The client node is arranged to obtain the status of one or at least one of the serving nodes by sending a query request to the monitoring node. The operation and maintenance node is used for receiving the alarm information sent by the monitoring node and providing the alarm information to the operation and maintenance personnel, so that the operation and maintenance personnel can process the alarm information.

It should be noted that each of the monitoring node, the service node, the operation and maintenance node, and the client node may correspond to each individual computer device, or may correspond to the same computer device, and each of the nodes may be each virtual machine running on the same computer device. The embodiment of the present invention does not limit the physical implementation manner of each node itself, as long as the functions of the nodes can be implemented.

Fig. 2 is a flowchart of a node status monitoring method according to an embodiment of the present invention. The method is applied to a first monitoring node of at least two monitoring nodes, see fig. 2, and comprises:

201. and when the first monitoring node does not receive heartbeat information sent by a second monitoring node in a first period, determining that the second monitoring node fails, wherein the second monitoring node is a node for monitoring the state of at least one service node in the at least two monitoring nodes.

202. The first monitoring node obtains the state of the at least one service node.

203. And the first monitoring node updates the state of the at least one service node in a state record table, wherein the state record table is recorded in a shared database corresponding to the at least two monitoring nodes.

In the method provided by the embodiment of the invention, by configuring at least two monitoring nodes, if the monitoring node which currently provides the state monitoring service sends a fault, the other monitoring node in the at least two monitoring nodes takes over the monitoring service in time to monitor the state of at least one service node, so that the problem that the monitoring service cannot be provided due to single-point fault of a single monitoring node is avoided, and the stability of state monitoring is improved.

In one possible implementation, the obtaining, by the first monitoring node, the status of the at least one service node includes:

the first monitoring node obtains reporting information sent by the first service node in a third period, the reporting information is used for indicating various running states of the first service node, and the second period is smaller than the third period.

In one possible implementation, the method further comprises:

and when the first monitoring node does not acquire the heartbeat information sent by the first service node in the next second period, modifying the service state of the first service node from the online state to the offline state, wherein the offline state represents that the service cannot be provided.

when the first monitoring node acquires the report information sent by the first service node, the service state of the first service node is changed from the non-online state to the online state; or the like, or, alternatively,

when the first monitoring node acquires the heartbeat information sent by the first service node, the service state of the first service node is changed from the non-online state to the online state; or the like, or a combination thereof,

In one possible implementation, the method further comprises:

and when the first monitoring node determines that the state of the at least one service node meets the preset condition, sending alarm information to an operation and maintenance node, wherein the operation and maintenance node is used for processing the alarm information.

In a possible implementation manner, when the first monitoring node determines that the state of the at least one service node satisfies a preset condition, sending an alarm message to an operation and maintenance node includes:

when the first monitoring node determines that the service state of a second service node is switched from an online state to an offline state, sending offline alarm information of the second service node to the operation and maintenance node, wherein the second service node is any one of the at least one service node; or the like, or, alternatively,

when the first monitoring node determines that the state value of any one operation state of the second service node meets the state alarm condition, the first monitoring node sends the state alarm information of the second service node to the operation and maintenance node; or the like, or, alternatively,

In one possible implementation, the method further comprises:

and when the first monitoring node receives the logout request, stopping monitoring the state of the service node corresponding to the logout request, and deleting the service node corresponding to the logout request from the cluster corresponding to the at least one service node.

In one possible implementation manner, the stopping monitoring the state of the service node corresponding to the logout request includes:

and the first monitoring node deletes the service state and the running state of the service node corresponding to the logout request from the state record table.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present invention, and are not described in detail herein.

Fig. 3 is a flowchart of a node status monitoring method according to an embodiment of the present invention. Referring to fig. 3, the method includes:

301. and when the first monitoring node does not receive heartbeat information sent by a second monitoring node in a first period, determining that the second monitoring node fails, wherein the second monitoring node is a node for monitoring the state of at least one service node in the at least two monitoring nodes.

In this embodiment of the present invention, for any service node of the at least two service nodes, for example, a first service node, when the first service node is started, the first service node may read a configuration item, where the configuration item records virtual IP addresses and port information (e.g., port numbers) including the at least two monitoring nodes. Furthermore, the first service node may automatically log in to a monitoring node currently providing services to the outside according to the virtual IP address and the port information, where the services provided to the outside by the monitoring node include performing state monitoring on at least one service node.

Taking the monitoring node currently in the service state as the second monitoring node as an example, the first service node may send a login request to the second monitoring node according to the virtual IP address, where the login request may carry an ID (identity) of the service node, a virtual IP address of the monitoring node, and port information. The second monitoring node may record the ID of the first serving node when receiving the login request. The ID of the first service node may be used as identification information for distinguishing different service nodes by the first monitoring node. The ID of the first service node is used to uniquely identify the service node, and the ID of the first service node may be a hardware ID of a server where the first service node is located, or may be an IP address and port information of the first service node. For example, the second monitoring node may establish a status record table for recording the status of at least one service node, where the primary key of the status record table is the ID of the service node, that is, the ID of each service node is used to uniquely identify the status of each service node in the status record. The state record table may be recorded in a shared database corresponding to the at least two monitoring nodes, and the at least two monitoring nodes may implement sharing of the state record table through the shared database.

In one possible implementation, the state of the at least one service node includes a service state indicating whether the at least one service node can provide a service and an operational state indicating an operational condition of the at least one service node. The service state may include an online state and an offline state, the online state represents that the service is available, and the offline state represents that the service is unavailable. The operating state may include at least one of a CPU occupancy percentage, a memory occupancy percentage, a network IO (Input Out) percentage, a total storage capacity of the node, and a remaining storage capacity of the node. The CPU occupation percentage refers to the percentage of the CPU occupation amount in the total CPU, the memory occupation percentage refers to the percentage of the memory occupation amount in the total memory, and the network IO percentage refers to the percentage of the export flow in the import flow. By monitoring the service state of the service node, the online condition of the service node can be known, so that whether the service node can provide storage service for the outside or not is determined; the running condition of the service node can be known by monitoring the running state of the service node, so that the service node can be guaranteed to run well.

Correspondingly, the second monitoring node may record the ID of the first service node in the status record table, and in addition to recording the ID of the first service node, the second monitoring node may record the service status of the first service node as an online status in a position corresponding to the ID of the first service node in the status record table.

It should be noted that, after the first service node logs in the second monitoring node, and the second monitoring node receives the login request of the first service node, the second monitoring node may perform a cluster joining operation on the first service node, and add the first service node to the cluster corresponding to the second monitoring node, that is, count the storage capacity of the first service node into the total storage capacity of the nodes of the cluster, so that the storage capacity of the first service node may be used by the cluster, thereby providing a storage service to the outside.

In the embodiment of the present invention, in the process of monitoring the state of the at least one service node by the second monitoring node, the second monitoring node sends heartbeat information to the first monitoring node, so that the first monitoring node knows that the second monitoring node is in an online state according to the heartbeat information, that is, the second monitoring node is providing a service to the outside, for example, monitoring the state of the at least one service node. And the period of sending the heartbeat information to the first monitoring node by the second monitoring node is a first period. If the second monitoring node fails, if the second monitoring node goes down, the second monitoring node will not send heartbeat information to the first monitoring node, so that the first monitoring node will not receive the heartbeat information of the second monitoring node. Therefore, if the first monitoring node does not receive the heartbeat information sent by the second monitoring node within a first period, the first monitoring node can determine that the second monitoring node sends a fault. The heartbeat information may be a binary information set with a preset reading rule.

It should be noted that the first monitoring node may be any monitoring node of the at least two monitoring nodes except the second monitoring node. If only two monitoring nodes are configured in the system, the first monitoring node is another monitoring node. If the system is configured with only more than two monitoring nodes, the first monitoring node may be a monitoring node selected by the system according to a preset takeover policy, for example, the preset takeover policy may be a randomly selected policy, a performance-first policy, or a policy selected according to a preset order, which is not limited in the embodiment of the present invention.

302. The first monitoring node obtains a status of the at least one service node.

In the embodiment of the present invention, at least two monitoring nodes including the first monitoring node and the second monitoring node externally provide the same virtual IP address, so that a node that is monitoring the state of at least one service node in the at least two monitoring nodes obtains the state information sent by the at least one service node from the virtual IP address. For example, when the second monitoring node monitors the state of the at least one service node, the second monitoring node may acquire the state of the at least one service node, and when the second monitoring node is down, the first monitoring node monitors the state of the at least one service node, and accordingly, the first monitoring node may acquire the state of the at least one service node from the virtual IP address.

In one possible implementation, the obtaining, by the first monitoring node, the status of the at least one service node includes: the first monitoring node acquires heartbeat information sent by a first service node in a second period, wherein the heartbeat information is used for indicating that the service state of the first service node is an online state, and the first service node is any one of the at least one service node; the first monitoring node obtains reporting information sent by the first service node in a third period, wherein the reporting information is used for indicating each operation state of the first service node. Considering that the operating state of the service node generally does not change too much in a short period of time, if the operating state is reported too frequently, only the pressure of the monitoring node is increased, and once the service state of the service node changes, the capability of providing storage service to the outside of the whole cluster is greatly affected.

Each service node in at least one service node is provided with two periods, one is a period for sending heartbeat information, such as a second period, and the other is a period for sending report information, such as a third period. The second period may be set according to whether the service state of the service node needs to be updated in time, and the larger the demand for updating the service state in time is, the larger the second period is, and the second period may generally be set to 2-5 s. The second periods of different service nodes may be the same, and both the service node side and the monitoring node side may store the second periods. The third period may be set according to the frequency of the service node whose operating state needs to be updated and the load capacity of the monitoring node, where the larger the frequency of the service node whose operating state needs to be updated and the larger the load capacity of the monitoring node are, the larger the third period is, and the third period is generally set to 10s-1 min. The third periods of different service nodes may be different, and each service node may send the report information to the monitoring node by using the respective third period, thereby implementing the report of the operating state.

Correspondingly, each service node, such as the first service node, of the at least one service node that has successfully logged in may adopt a second period as a period for sending heartbeat information, and every time a second period is reached, the heartbeat information is sent to the first monitoring node. For example, the first service node may generate the heartbeat information to a virtual IP address externally provided by the monitoring node, so that the first monitoring node may obtain the heartbeat information from the virtual IP address, thereby knowing that the service state of the first service node is an online state. In a possible implementation manner, when the first monitoring node does not acquire the heartbeat information sent by the first service node in the next second period, the service state of the first service node is modified from the online state to the offline state. Because the period in which the first service node sends the heartbeat information to the first monitoring node is the second period, if the first service node sends a failure, such as the first service node goes down, the first service node does not send the heartbeat information to the first monitoring node, so that the first monitoring node does not receive the heartbeat information of the service node. Therefore, if the first monitoring node does not receive the heartbeat information sent by the first service node in a first period, the first monitoring node may determine that the first service node has sent a failure, and timely modify the service state of the first service node in the state record table to the offline state until the first monitoring node learns that the first service node recovers the online state again, and modify the service state of the first service node in the state record table to the online state.

For example, when the first monitoring node acquires the report information sent by the first service node, the service state of the first service node is modified from the non-online state to the online state; or, when the first monitoring node acquires the heartbeat information sent by the first service node, modifying the service state of the first service node from the non-online state to the online state; or, when the first monitoring node acquires the login request sent by the first service node, the service state of the first service node is modified from the non-online state to the online state. By reporting the heartbeat information, the monitoring node can know and update the service state of the service node, and the timeliness of updating the service state is ensured.

In addition to the reporting of the heartbeat information, the first serving node may also use a third period as a period for sending the reporting information, and send the reporting information to the first monitoring node every time the third period is reached. For example, the first service node may generate the reporting information to a virtual IP address externally provided by the monitoring node, so that the first monitoring node may acquire the reporting information from the virtual IP address, thereby obtaining the running state of the first service node, and update the running state of the first service node in the state record table in time according to the acquired reporting information of the first service node. The reporting information may be generated by the first service node according to the current operating state thereof, and the specific generation manner of the reporting information is not limited in the embodiment of the present invention.

Through the report of the running state, the monitoring node can know the current running condition of the service node, and through the addition of a heartbeat link, the first monitoring node can timely sense the online state of the service node according to the heartbeat link. When the service node fails to work and is down, the first monitoring node can also rapidly judge the service state of the service node, and meanwhile, the reporting period of the running state is not influenced.

303. And the first monitoring node updates the state of the at least one service node in a state record table, wherein the state record table is recorded in a shared database corresponding to the at least two monitoring nodes.

In the embodiment of the present invention, when the first monitoring node acquires the state of any service node in the at least one service node, the state record table may be updated according to the acquired state of the service node. For example, when the first monitoring node acquires the heartbeat information of the first service node, the service state of the first service node in the state record table may be confirmed, and if the service state is an online state, the service state may not be modified, and if the service state is an offline state, the service state of the first service node may be modified from the offline state to the online state. When the first monitoring node acquires the reported information of the first service node, the running states of the first service node in the state record table can be updated according to the acquired reported information.

The above steps 301 to 303 are processes in which the first monitoring node takes over the monitoring service to the at least one service node in time when the second monitoring node fails, and this high availability mode of the monitoring node avoids a problem that the system cannot monitor the state of the at least one service node when the monitoring node fails at a single point, thereby improving the stability of state monitoring.

It should be noted that the first monitoring node may also provide an external query service, for example, the client node may send a query request to the first monitoring node, where the query request is used to query the state of one or at least one service node. When the first monitoring node receives the query request, the state of the one or at least one service node may be retrieved from the state record table and fed back to the client node in a query response.

304. And when the first monitoring node determines that the state of the at least one service node meets the preset condition, sending alarm information to an operation and maintenance node, wherein the operation and maintenance node is used for processing the alarm information.

In the embodiment of the invention, the cluster where at least one service node and at least two monitoring nodes are located can be also configured with an operation and maintenance node, and the operation and maintenance node acquires the alarm information of the service node from the monitoring node and provides the alarm information to operation and maintenance personnel, so that the operation and maintenance personnel can master the state of the service node in time, and whether the service node needs to be overhauled or not is determined. Accordingly, the first monitoring node may send an alarm to the operation and maintenance node when the state of any service node in the at least one service node satisfies a preset condition.

In a possible implementation manner, taking the second service node as an example, the second service node is any one of the at least one service node, and the sending, by the first monitoring node, the alarm information to the operation and maintenance node may include the following several conditions:

in the first situation, when the first monitoring node determines that the service state of the second service node is switched from the online state to the offline state, the offline warning information of the second service node is sent to the operation and maintenance node. For example, the offline warning message may include an ID of the second service node and a current service status (offline status) of the second service node.

And in the second situation, when the first monitoring node determines that the service state of the second service node is switched from the non-online state to the online state, sending online alarm information of the second service node to the operation and maintenance node. For example, the offline warning message may include the ID of the second service node and the current service status (online status) of the second service node.

The two situations are that the service state of the second service node is alarmed, and the operation and maintenance personnel can know whether the second service node is in an online state or an offline state currently by alarming the operation and maintenance node when the service state of the second service node changes.

And in the third situation, when the first monitoring node determines that the state value of any one operating state of the second service node meets the state alarm condition, sending the state alarm information of the second service node to the operation and maintenance node.

In one possible implementation manner, the first monitoring node may compare, in each detection period, a state value of each operating state of the second serving node with a threshold list, where the threshold list includes a threshold corresponding to each operating state. The threshold list comprises at least one of a CPU (Central processing Unit) occupation percentage threshold, a memory occupation percentage threshold, a network IO percentage threshold and a percentage threshold of the node residual storage capacity in the total storage capacity of the node. The CPU occupation percentage threshold, the memory occupation percentage threshold, and the network IO percentage threshold are maximum values that are allowed, that is, the CPU occupation percentage, the memory occupation percentage, and the network IO percentage of the second service node cannot be greater than their corresponding thresholds. And when the CPU occupation percentage of the second service node is greater than the CPU occupation percentage threshold, or the memory occupation percentage is greater than the memory occupation percentage threshold, or the network IO percentage is greater than the network IO percentage threshold, the state of the second service node is considered to meet the state alarm condition.

The threshold of the percentage of the node remaining storage capacity to the node total storage capacity is a minimum value allowed, that is, the percentage of the node remaining storage capacity of the second serving node to the node total storage capacity cannot be smaller than the corresponding threshold. When the percentage of the node remaining storage capacity of the second service node in the total node storage capacity is smaller than the corresponding threshold, the first monitoring node may determine that the state of the second service node satisfies the state alarm condition.

For example, in a detection period, the node remaining storage capacity of the second serving node is 19TB, and the node total storage capacity of the second serving node is 100TB, that is, the percentage of the node remaining storage capacity to the node total storage capacity is 19%, and if the threshold of the percentage of the node remaining storage capacity to the node total storage capacity in the threshold list is 20%, it indicates that the state of the second serving node meets the preset condition, and the first monitoring node may send the alarm information that the storage capacity of the first serving node is insufficient to the operation and maintenance node.

By setting a detection period, various running states of the service node are regularly compared with corresponding thresholds in the threshold list, so that an alarm can be given in time when the running states are abnormal.

And in the fourth situation, when the first monitoring node determines that the percentage of the whole residual storage capacity of the at least one service node to the whole total storage capacity meets the capacity alarm condition, sending cluster capacity alarm information to the operation and maintenance node. The overall remaining storage capacity refers to the sum of the remaining storage capacities of the at least one service node, and the overall total storage capacity refers to the sum of the total storage capacities of the at least one service node.

A threshold value of the percentage of the total remaining storage capacity to the total storage capacity may also be included in the threshold list, which is the minimum value allowed. Correspondingly, the capacity alarm condition means that the percentage of the total remaining storage capacity to the total storage capacity is smaller than the threshold, the first monitoring node may calculate the total remaining storage capacity and the total storage capacity of the whole cluster according to the node remaining storage capacity and the total storage capacity of each service node that has joined the cluster, and calculate the percentage of the total remaining storage capacity to the total storage capacity, and when the percentage is smaller than the threshold of the percentage of the total remaining storage capacity to the total storage capacity, the first monitoring node may determine that the capacity alarm condition is satisfied, and send the cluster capacity alarm information. By alarming when the residual storage capacity of the whole cluster is insufficient, operation and maintenance personnel can timely know that the storage capacity which can be externally provided by the whole cluster is insufficient, and therefore corresponding measures are taken to solve the problem of insufficient storage capacity.

When the state of the service node meets any one of the conditions, the alarm is given to the operation and maintenance node, so that operation and maintenance personnel can know the state information of each service node in time, and can take corresponding measures to process the information if necessary.

It should be noted that the step 304 is an optional step, that is, an embodiment of the present invention may include only the step 301 to the step 303. In the above steps 301 to 304, when the second monitoring node fails, the first monitoring node monitors the state of at least one service node and alarms at an appropriate time. The scheme adopts a mode of presetting a threshold list to monitor various state information of the service nodes of the cluster, and alarms when any one exceeds the threshold, and alarms when the service state of the service node changes. The granularity of the alarm mode is refined, so that the monitoring mechanism of the monitoring node is more reliable, and the state of the service node is better monitored.

In the foregoing process, the cluster only includes the at least one service node, and actually, in order to facilitate the expansion and contraction of the cluster, a new service node may be accessed into the cluster, or an existing service node may be deleted. For example, when a new service node wants to access the cluster, the service node may perform a login operation, for example, the service node may send a login request to the first monitoring node, which is the same as the first service node sending the login request in step 301.

In a possible implementation manner, when the first monitoring node receives a login request, the first monitoring node performs state monitoring on a service node corresponding to the login request, and adds the service node corresponding to the login request to a cluster corresponding to the at least one service node. After logging in to the first monitoring node, the service node may send heartbeat information, report information, and the like to the first monitoring node, so that the first monitoring node may obtain the state of the service node and update the state record table, and the specific process is the same as the above-described steps 302 to 303, which is not described herein again.

It should be noted that, after receiving the login request of the service node, the first monitoring node may perform the operation of joining the cluster to the service node, and the process is the same as that performed by the second monitoring node to join the cluster to the first service node in step 301, and details are not described here.

Of course, when at least one service node in the cluster wants to exit the cluster, the service node may perform a logout operation, for example, the service node may send a logout request to the first monitoring node, and when the first monitoring node receives the logout request, the first monitoring node stops monitoring the state of the service node corresponding to the logout request, for example, the first monitoring node may delete the service state and the running state of the service node corresponding to the logout request from the state record table. Of course, when the first monitoring node receives the logout request, the service node corresponding to the logout request may also perform a cluster logout operation, that is, the service node corresponding to the logout request is deleted from the cluster corresponding to the at least one service node, and at this time, the storage capacity of the service node corresponding to the logout request is not counted in the total storage capacity of the nodes of the cluster.

Of course, the first monitoring node may also actively perform a cluster logout operation on any service node in the at least one service node, that is, perform a cluster logout operation on the service node after the cluster logout operation is performed, and delete the state of the service node in the state record table after the cluster logout is successful. The service node exiting the cluster may subsequently continue to log in the first monitoring node, or the monitoring nodes of other clusters perform storage service. The embodiment of the present invention is not limited thereto.

The scheme provided by the embodiment of the invention can be applied to a scene of accelerating cluster access in a cloud storage service, the service node in the scheme can perform login operation and logout operation, and the monitoring node can perform cluster access operation and cluster logout operation on the service node. The method facilitates the capacity expansion and capacity reduction management of the cluster, and provides capacity scalability for the management of the cloud storage cluster. The embodiment of the invention provides a scheme for monitoring and managing service nodes in cloud storage service, which is actually suitable for monitoring and managing the service nodes (storage nodes) in any distributed cluster, and enables all the service nodes in the distributed cluster to acquire the states of other nodes in the current cluster, thereby facilitating the subsequent realization of services such as load balancing, data distribution, data reading and writing and the like.

Fig. 4 is a schematic structural diagram of a node status monitoring apparatus according to an embodiment of the present invention. Referring to fig. 4, the apparatus includes:

a determining module 401, configured to determine that the second monitoring node fails when the first monitoring node does not receive heartbeat information sent by the second monitoring node in the first period, where the second monitoring node is a node that performs state monitoring on at least one service node in the at least two monitoring nodes;

an obtaining module 402, configured to obtain, by the first monitoring node, a state of the at least one service node;

an updating module 403, configured to update, by the first monitoring node, a state of the at least one service node in a state record table, where the state record table is recorded in a shared database corresponding to the at least two monitoring nodes;

the at least two monitoring nodes provide the same virtual IP address to the outside, so that a node in the at least two monitoring nodes, which is monitoring the state, acquires the state information sent by the at least one service node from the virtual IP address.

In one possible implementation, the obtaining module 402 is configured to:

In one possible implementation manner, the updating module 403 is configured to modify the service state of the first service node from an online state to an offline state when the first monitoring node does not acquire the heartbeat information sent by the first service node in a next second period, where the offline state indicates that no service is available.

In a possible implementation manner, the updating module 403 is configured to modify the service state of the first service node from an offline state to an online state when the first monitoring node acquires the reporting information sent by the first service node; or the like, or a combination thereof,

the updating module 403 is configured to modify the service state of the first service node from an offline state to an online state when the first monitoring node acquires the heartbeat information sent by the first service node; or the like, or, alternatively,

the updating module 403 is configured to modify the service state of the first service node from an offline state to an online state when the first monitoring node obtains the login request sent by the first service node.

In one possible implementation, referring to fig. 5, the apparatus further includes:

a sending module 404, configured to send alarm information to an operation and maintenance node when the first monitoring node determines that the state of the at least one service node meets a preset condition, where the operation and maintenance node is configured to process the alarm information.

In a possible implementation manner, the sending module 404 is configured to send offline warning information of a second service node to the operation and maintenance node when the first monitoring node determines that a service state of the second service node is switched from an online state to an offline state, where the second service node is any service node in the at least one service node; or the like, or, alternatively,

the sending module 404 is configured to send an online alarm message of the second service node to the operation and maintenance node when the first monitoring node determines that the service state of the second service node is switched from the offline state to the online state; or the like, or, alternatively,

the sending module 404 is configured to send a state alarm message of the second service node to the operation and maintenance node when the first monitoring node determines that a state value of any one operating state of the second service node meets a state alarm condition; or the like, or, alternatively,

the sending module 404 is configured to send cluster capacity alarm information to the operation and maintenance node when the first monitoring node determines that the percentage of the overall remaining storage capacity of the at least one service node in the overall total storage capacity meets the capacity alarm condition.

In one possible implementation, referring to fig. 6, the apparatus further includes:

an adding module 405, configured to perform state monitoring on a service node corresponding to the login request when the first monitoring node receives the login request, and add the service node corresponding to the login request to a cluster corresponding to the at least one service node;

a deleting module 406, configured to stop monitoring the state of the service node corresponding to the logout request when the first monitoring node receives the logout request, and delete the service node corresponding to the logout request from the cluster corresponding to the at least one service node.

In one possible implementation manner, the deleting module 406 is configured to delete, by the first monitoring node, the service status and the running status of the service node corresponding to the logout request from the status record table.

According to the device provided by the embodiment of the invention, by configuring at least two monitoring nodes, if the monitoring node which currently provides the state monitoring service sends a fault, the other monitoring node in the at least two monitoring nodes takes over the monitoring service in time to monitor the state of at least one service node, so that the problem that the monitoring service cannot be provided due to single-point fault of a single monitoring node is avoided, and the stability of state monitoring is improved.

It should be noted that: in the node state monitoring apparatus provided in the foregoing embodiment, only the division of each functional module is exemplified when the node state is in a node state, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the node status monitoring apparatus and the node status monitoring method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 7 is a schematic structural diagram of a computer device 700 according to an embodiment of the present invention, where the computer device 700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where the memory 702 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 701 to implement the methods provided by the foregoing method embodiments. Certainly, the computer device may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the computer device may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, storing at least one instruction, which when executed by a processor, implements the node status monitoring method in the above embodiments, is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware related to instructions of a program, where the program may be stored in a computer readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A node status monitoring method applied to a first monitoring node of at least two monitoring nodes, the method comprising:

when the first monitoring node does not receive heartbeat information sent by a second monitoring node in a first period, determining that the second monitoring node fails, wherein the second monitoring node is a node which monitors the state of at least one service node in the at least two monitoring nodes, and only one monitoring node monitors the state of the at least one service node at the same time;

the first monitoring node acquires the state of the at least one service node;

when the first monitoring node receives a query request of a client node, acquiring the state of a target service node corresponding to the query request from the state record table, and feeding back the state of the target service node to the client node in a query response mode;

the at least two monitoring nodes provide the same virtual network protocol IP address externally, so that the node which is monitoring the state of the at least two monitoring nodes obtains the state information sent by the at least one service node from the virtual IP address.

2. The method of claim 1, wherein the obtaining, by the first monitoring node, the status of the at least one serving node comprises:

3. The method of claim 2, further comprising:

4. The method of claim 3, wherein after modifying the service state of the first service node from an online state to an offline state, the method further comprises:

5. The method of claim 1, further comprising:

6. The method according to claim 5, wherein when the first monitoring node determines that the state of the at least one service node satisfies a preset condition, sending an alarm message to the operation and maintenance node comprises:

7. The method of claim 1, further comprising:

8. A node status monitoring apparatus, applied to a first monitoring node of at least two monitoring nodes, the apparatus comprising:

a determining module, configured to determine that a failure occurs in a second monitoring node when the first monitoring node does not receive heartbeat information sent by the second monitoring node in a first period, where the second monitoring node is a node that performs state monitoring on at least one service node in the at least two monitoring nodes, and only one monitoring node performs state monitoring on the at least one service node at the same time;

an updating module, configured to update, by the first monitoring node, a state of the at least one service node in a state record table, where the state record table is recorded in a shared database corresponding to the at least two monitoring nodes; when the first monitoring node receives a query request of a client node, acquiring the state of a target service node corresponding to the query request from the state record table, and feeding back the state of the target service node to the client node in a query response mode;

9. The apparatus of claim 8, wherein the obtaining module is configured to:

10. The apparatus of claim 9, wherein the updating module is configured to modify a service status of the first service node from an online status to an offline status when the first monitoring node does not acquire heartbeat information sent by the first service node in a next second period, and the offline status indicates that no service is available.

11. The apparatus of claim 10,

the updating module is used for modifying the service state of the first service node from an offline state to an online state when the first monitoring node acquires the report information sent by the first service node; or the like, or, alternatively,

12. The apparatus of claim 8, further comprising:

13. The apparatus of claim 12,

the sending module is configured to send offline warning information of a second service node to the operation and maintenance node when the first monitoring node determines that a service state of the second service node is switched from an online state to an offline state, where the second service node is any service node in the at least one service node; or the like, or, alternatively,

the sending module is used for sending online alarm information of the second service node to the operation and maintenance node when the first monitoring node determines that the service state of the second service node is switched from an offline state to an online state; or the like, or a combination thereof,

the sending module is used for sending cluster capacity alarm information to the operation and maintenance node when the first monitoring node determines that the percentage of the whole residual storage capacity of the at least one service node in the whole total storage capacity meets a capacity alarm condition.

14. The apparatus of claim 8, further comprising:

the adding module is used for monitoring the state of the service node corresponding to the login request when the first monitoring node receives the login request, and adding the service node corresponding to the login request into the cluster corresponding to the at least one service node;

15. A node status monitoring system, characterized in that the system comprises at least two monitoring nodes and at least one service node, the at least two monitoring nodes comprise a first monitoring node and a second monitoring node,

the first monitoring node is used for determining that the second monitoring node fails when heartbeat information sent by the second monitoring node is not received in a first period, the second monitoring node is a node which monitors the state of at least one service node in the at least two monitoring nodes, and only one monitoring node monitors the state of the at least one service node at the same time;

the first monitoring node is further configured to update the state of the at least one service node in a state record table, where the state record table is recorded in a shared database corresponding to the at least two monitoring nodes;

the first monitoring node is further used for acquiring the state of a target service node corresponding to a query request from the state record table when the query request of the client node is received, and feeding back the state of the target service node to the client node in a query response mode;

16. A computer device comprising a processor and a memory; the memory is used for storing at least one instruction; the processor, configured to execute at least one instruction stored on the memory to implement the method steps of any of claims 1-7.

17. A computer-readable storage medium having stored therein at least one instruction which, when executed by a processor, implements the method steps of any of claims 1-7.