CN111049881B

CN111049881B - Cloud platform node resource monitoring method and system and computer readable medium

Info

Publication number: CN111049881B
Application number: CN201911057183.XA
Authority: CN
Inventors: 李涛
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2022-07-22
Anticipated expiration: 2039-10-30
Also published as: CN111049881A

Abstract

The invention discloses a cloud platform node resource monitoring method, a cloud platform node resource monitoring system and a computer readable medium, wherein a main thread and a plurality of resource monitoring sub-threads are established through RMON of a node to be monitored, the main thread is used for starting the plurality of parallel resource monitoring sub-threads, the plurality of resource monitoring sub-threads are used for monitoring the running state of the resource of the node to be monitored in parallel, and alarm messages reported by the plurality of resource monitoring sub-threads are cached in an alarm queue; the alarm server is connected with a management surface interface of the alarm server through a management surface uplink interface and is connected with a storage surface interface of the alarm server through a storage surface uplink interface; the alarm messages of the alarm queue are uploaded to the alarm server through the management surface uplink interface, the management surface heartbeat request message issued by the server is not received when the preset time threshold value is exceeded, and the alarm messages of the alarm queue are uploaded to the alarm server through the storage surface uplink interface, so that the framework of the current monitoring module is optimized, and the efficiency and the reliability of resource alarm are improved.

Description

Cloud platform node resource monitoring method and system and computer readable medium

Technical Field

The invention belongs to the field of cloud computing platforms, and particularly relates to a method and a system for monitoring cloud platform node resources and a computer readable medium.

Background

With the development of cloud computing, edge clouds have been the preferred cloud computing solution for manufacturers in the field of telecommunications clouds. At present, various telecom operators such as China Mobile, Unicom, telecom and the like actively use edge cloud to deploy own telecom cloud platform. The edge cloud can issue most of the computing load to cloud platforms in various places for processing. Useful data is then aggregated to a central cloud platform. Therefore, the pressure of the central cloud is reduced, and the service efficiency is improved. As a telecommunication cloud platform, the cloud platform is required to have the characteristics of high reliability and low delay. Compared with a private cloud, the telecommunication cloud platform requires more timely feedback on the health condition of the system, so that the cloud platform has higher reliability. Therefore, an alarm module is needed to timely feed back the monitoring state of the physical hardware to the user, and a common method is to report alarms of different levels to an administrator according to the health state. And the administrator performs corresponding processing according to the health level.

StarlingX is used as an example. The resource module, such as RMON (resource monitor), may obtain, for example, a network card status, a file system usage rate, a network rate, a storage usage rate, a resource monitoring of the virtual switch, and the like, where the RMON is deployed on all nodes of the cloud platform and is responsible for monitoring the resource usage rate on its own node, and if the usage rate reaches a threshold, an alarm may be triggered. The alarm module is generally deployed on a central node, such as a control node. The RMON of other nodes can report the alarm through the network. The alarm module may aggregate the alarm information for all nodes and then present it to the user in some form. In the conventional mode, all processing modules are located in one thread for execution. Only after the state acquisition of all resources is completed, the alarm content can be processed finally and reported to the alarm module. This undoubtedly wastes much time, which causes the alarm reporting to depend on the time for acquiring the state of all resources, and if the alarm of other resources is added subsequently, the time for reporting the alarm may be further increased.

For example, in the RMON, the current most of the resources are alarmed in a single-thread manner, and if the alarming of a certain resource is too long, the alarming of other resources cannot be responded in time, and delay or time cannot be guaranteed in the aspect of resource alarming. The link interruption required in the telecommunication cloud reports the alarm within 5s, but the current design can cause that the alarm can be reported only within 2 to 3 minutes, so that the alarm reporting delay is very large, and the problem cannot be found in time. The alarm module is generally adopted to report from the management plane, and if the management plane is completely broken, the alarm of the link network plane fault cannot be reported, and the like. Meanwhile, the alarm adopted by the current cloud platform is that the alarm RMON reports the alarm (generally, the alarm is realized by sending a TCP/UDP message) through a single management network plane, but under the scene that the alarm management plane is completely interrupted, the alarm message cannot be reported to an alarm module of the server.

Disclosure of Invention

Aiming at the defects or improvement requirements in the prior art, the invention provides a cloud platform node resource monitoring method, a system and a computer readable medium, wherein a main thread and a plurality of resource monitoring sub-threads are established through RMON of a node to be monitored, and the resource running state of the node to be monitored is monitored in parallel by utilizing the plurality of resource monitoring sub-threads; the alarm message of the alarm queue is uploaded to the alarm server through the management surface uplink interface, the management surface heartbeat request message issued by the server is not received when the preset time threshold value is exceeded, and the alarm message of the alarm queue is uploaded to the alarm server through the storage surface uplink interface, so that the framework of the current monitoring module is optimized, and the efficiency and the reliability of resource alarm are improved.

In order to achieve the above object, according to an aspect of the present invention, a method for monitoring cloud platform node resources is provided, which includes the following steps:

the method comprises the steps that a main thread and a plurality of resource monitoring sub-threads are established on an RMON of a node to be monitored, the main thread is utilized to start the plurality of parallel resource monitoring sub-threads, the plurality of resource monitoring sub-threads are utilized to monitor the resource running state of the node to be monitored in parallel, and warning messages reported by the plurality of resource monitoring sub-threads are cached in a warning queue;

the RMON of the node to be monitored establishes a management surface upper connection interface and a storage surface upper connection interface, the management surface upper connection interface is connected with a management surface interface of the alarm server, and the storage surface upper connection interface is connected with a storage surface interface of the alarm server; receiving a management face heartbeat request message sent by an alarm server through a management face uplink interface according to a preset time interval, uploading the management face heartbeat feedback message to the alarm server, and simultaneously uploading the alarm message of an alarm queue to the alarm server; and if the preset time threshold is exceeded, the management surface heartbeat request message sent by the server is not received, the storage surface uplink interface starting message sent by the alarm server is received through the storage surface uplink interface, the storage surface heartbeat request message sent by the server is received according to the preset time interval, the storage surface heartbeat feedback message is uploaded to the alarm server, and meanwhile, the alarm message of the alarm queue is uploaded to the alarm server.

As a further improvement of the invention, the plurality of resource monitoring sub-threads comprise a storage utilization monitoring sub-thread, a virtual switch resource sub-thread, a file system resource monitoring sub-thread, a network rate monitoring sub-thread and a network card link monitoring state sub-thread.

As a further improvement of the invention, the main thread monitors a plurality of resource monitoring sub-threads, and when the resource monitoring sub-threads are abnormal, alarm information is reported and cached in an alarm queue; and the main thread monitors that the terminated resource monitoring sub-thread exists, and the terminated resource monitoring sub-thread is restarted.

The invention is further improved and characterized in that the alarm function of the same resource monitoring sub-thread is locked to realize that the next report can be carried out after the last alarm report of the same resource monitoring sub-thread.

As a further improvement of the invention, after the storage surface uplink interface is started, the management surface uplink interface receives a management surface heartbeat request message sent by the alarm server, uploads a management surface heartbeat feedback message to the alarm server, receives a message sent by the alarm server and used for starting the management surface uplink interface and closing the storage surface uplink interface, so as to realize that the management surface uplink interface is started again to upload the alarm message of the alarm queue.

As a further improvement of the invention, the alarm server does not receive the management surface heartbeat feedback request message and the storage surface heartbeat feedback request message of the node to be monitored when the preset time threshold value is exceeded, and the alarm server generates the fault alarm message of the management surface and the storage surface of the node to be monitored.

As a further improvement of the invention, the interaction between the RMON of the node to be monitored and the host is realized by calling a linux system command.

To achieve the above object, according to another aspect of the present invention, there is provided a cloud platform node resource monitoring system, which includes at least one processing unit, and at least one storage unit, where the storage unit stores a computer program, and when the program is executed by the processing unit, the processing unit executes the steps of the method.

To achieve the above object, according to another aspect of the present invention, there is provided a computer-readable medium storing a computer program executable by a terminal device, the program, when executed on the terminal device, causing the terminal device to perform the steps of the above method.

Generally, compared with the prior art, the technical scheme conceived by the invention has the following beneficial effects:

the invention discloses a cloud platform node resource monitoring method, a cloud platform node resource monitoring system and a computer readable medium, wherein a main thread and a plurality of resource monitoring sub-threads are established through RMON of a node to be monitored, and the resource running state of the node to be monitored is monitored in parallel by utilizing the plurality of resource monitoring sub-threads; the alarm message of the alarm queue is uploaded to the alarm server through the management plane uplink interface, the management plane heartbeat request message issued by the server is not received when the preset time threshold value is exceeded, the alarm message of the alarm queue is uploaded to the alarm server through the storage plane uplink interface, the multi-thread framework is used for improving the alarm efficiency, a plurality of channels are expanded to report the alarm, the alarm reliability is improved, and meanwhile, the alarm module does not singly receive the alarm reported by the RMON from the management plane. The alarm can be received from the storage surface, so that the alarm can be reported from the storage surface under the condition that the alarm reporting of the management surface fails, the reliability and success of the alarm are ensured, the framework of the current monitoring module is optimized, and the efficiency and reliability of the resource alarm are improved.

According to the cloud platform node resource monitoring method, the cloud platform node resource monitoring system and the computer readable medium, the main thread is used for monitoring the plurality of resource monitoring sub-threads, the resource monitoring sub-threads report alarm messages when being abnormal and buffer-store the alarm messages in the alarm queue, meanwhile, functions for reporting the alarms are locked, and after one reported alarm is finished, the next alarm is carried out, so that the alarm is not repeated, and the efficiency and the reliability of resource alarm are further improved.

Drawings

Fig. 1 is a schematic diagram of a cloud platform node resource monitoring method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The present invention will be described in further detail with reference to specific embodiments.

The technical terms involved in the invention are explained as follows:

RMON: resource Monitor, the monitored resources include all the resources that the node can Monitor, such as CPU utilization, memory utilization, etc.

Fig. 1 is a schematic diagram of a cloud platform node resource monitoring method according to an embodiment of the present invention. As shown in fig. 1, a method for monitoring cloud platform node resources includes the following steps:

the method comprises the steps that a main thread and a plurality of resource monitoring sub-threads are established on an RMON of a node to be monitored, the main thread is used for starting the plurality of parallel resource monitoring sub-threads, the plurality of resource monitoring sub-threads are used for monitoring the resource running state of the node to be monitored in parallel, and warning messages reported by the plurality of resource monitoring sub-threads are cached in a warning queue;

as a preferred embodiment, the plurality of resource monitoring sub-threads include a monitoring storage utilization sub-thread, a virtual switch resource sub-thread, a monitoring file system resource sub-thread, a monitoring network rate sub-thread, and a monitoring network card link status sub-thread.

As an example, the RMON is used to create a main thread and a plurality of resource monitoring sub-threads, the main thread is mainly responsible for system initialization, health state acquisition of the resource threads, and alarm reporting, and the main thread can circularly check the state of the process after being started; the main thread starts a resource monitoring sub-thread for each resource acquisition state to execute resource monitoring, each sub-thread task is independently executed and is executed concurrently with each other, and an alarm is immediately reported to the main thread after an abnormality is monitored, under certain scenes, if too many alarms or the alarm fails, the alarm message can be temporarily put into an alarm queue for caching, so that the alarm message is ensured not to be lost, and the main thread utilizes the alarm queue for reporting;

as a preferred embodiment, a main thread monitors a plurality of resource monitoring sub-threads, and when the resource monitoring sub-threads are abnormal, alarm messages are reported and buffered in an alarm queue; the main thread needs to set a mechanism for detecting, managing and detecting the health state of the thread, and if a certain resource thread is terminated, a new thread is restarted to continue detection;

as a preferred embodiment, because both the main thread and the resource thread report an alarm to the same alarm queue, there is a need to make a mechanism to ensure that no repeated alarm is reported.

As an example, the RMON of the node to be monitored can interact with the host by calling linux system commands. Specifically, taking file system thread monitoring as an example for illustration, for the utilization rate of the monitored/var/log directory, a system command, df-T-P-local/var/log, may be invoked, so as to obtain the value of the Capacity field in the returned result, that is, the utilization rate of the current/var/log directory, and if the utilization rate of/var/log is higher than a preset threshold (for example, 70%), an alarm message may be sent to the RMON to notify the user that the utilization rate of/var/log has exceeded 70%, and the user is prompted to take a recovery operation.

The RMON of the node to be monitored establishes a management surface uplink interface and a storage surface uplink interface, the management surface uplink interface is connected with a management surface interface of the alarm server, and the storage surface uplink interface is connected with a storage surface interface of the alarm server; receiving a management face heartbeat request message issued by an alarm server according to a preset time interval through a management face uplink interface, uploading the management face heartbeat feedback message to the alarm server, and simultaneously uploading the alarm message of an alarm queue to the alarm server; and if the preset time threshold is exceeded, the management surface heartbeat request message sent by the server is not received, the storage surface uplink interface starting message sent by the alarm server is received through the storage surface uplink interface, the storage surface heartbeat request message sent by the server is received according to the preset time interval, the storage surface heartbeat feedback message is uploaded to the alarm server, and meanwhile, the alarm message of the alarm queue is uploaded to the alarm server.

As a preferred embodiment, after the storage surface uplink interface is enabled, the management surface uplink interface receives a management surface heartbeat request message sent by the alarm server, uploads a management surface heartbeat feedback message to the alarm server, receives a message sent by the alarm server to enable the management surface uplink interface and close the storage surface uplink interface, so as to enable the management surface uplink interface to upload the alarm message of the alarm queue again.

As a preferred embodiment, the alarm server does not receive the management plane heartbeat feedback request message and the storage plane heartbeat feedback request message of the node to be monitored when exceeding a preset time threshold, and the alarm server generates a fault alarm message for the management plane and the storage plane of the node to be monitored.

As an example, when the RMON of the node to be monitored detects that the resource utilization rate exceeds the threshold, the RMON may send an alarm message to an alarm module on the alarm server through the management plane. If the management plane is failed at this time, the alarm information cannot be normally sent to the alarm module on the alarm server, at this time, the RMON first tries to send the alarm information to the alarm module on the alarm server through the management plane, and if the management plane is found to be not through, the RMON tries to send a message to the alarm module on the alarm server from the storage plane. Specifically, the heartbeat module Server may periodically send a heartbeat request to the heartbeat module Client. If the heartbeat module Client does not receive the heartbeat message sent by the heartbeat module Server from the management plane within a period of time, the management plane communication is considered to be interrupted, the heartbeat module Client message informs the RMON that the management plane communication is interrupted, and the RMON automatically switches to a storage network plane as a network plane for sending an alarm message after receiving the notification sent by the heartbeat module Client. Therefore, the management plane can be timely switched to the storage plane to continue working when the management plane is interrupted, and after the management plane is recovered to be normal, the heartbeat module Client can inform the RMON management plane that communication is recovered, and then the management plane can be switched to send an alarm message. The network detection is more accurate through heartbeat, the fact that the network plane can communicate normally is guaranteed, a traditional method for detecting the link state through a network card is unreliable, meanwhile, RMON does not need to detect the communication state of the network plane when an alarm message is sent every time, and the network plane is informed actively through a Client module of a heartbeat module, so that the alarm efficiency is guaranteed.

If all network planes (management planes, storage planes and the like) between the alarm server and the node to be monitored fail, the node to be monitored cannot send a network plane failure alarm message to the alarm server at this time. A heartbeat module Server is deployed on an alarm Server, and a heartbeat module Client is deployed on a node to be monitored. The heartbeat module Server sends heartbeat messages to the heartbeat module Client from a management plane, a storage plane and the like at regular intervals. And if the heartbeat module Client receives the heartbeat message, replying a feedback message of the heartbeat Server. If all faults of the management plane, the storage plane and the like occur, the heartbeat module Server cannot receive the reply of the heartbeat module Client. Therefore, if the feedback message is not received all the time within a certain time range, the heartbeat module Server judges the faults of the management surface, the storage surface and the like of the computational node, and then reports the network plane (the management surface and the storage surface) alarm to the alarm module, thereby ensuring that the corresponding alarm can be reported under the condition that all the network surfaces of the computational node have faults.

A cloud platform node resource monitoring system comprises at least one processing unit and at least one storage unit, wherein the storage unit stores a computer program, and when the program is executed by the processing unit, the processing unit executes the steps of the method.

A computer-readable medium, in which a computer program executable by a terminal device is stored, which program, when run on the terminal device, causes the terminal device to carry out the steps of the method.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A cloud platform node resource monitoring method is characterized by comprising the following steps:

the plurality of resource monitoring sub-threads comprise a monitoring storage utilization rate sub-thread, a virtual switch resource sub-thread, a monitoring file system resource sub-thread, a monitoring network rate sub-thread and a state sub-thread of a monitoring network card link;

the main thread monitors a plurality of resource monitoring sub-threads, and when the resource monitoring sub-threads are abnormal, alarm messages are reported and cached in an alarm queue; the main thread monitors that the terminated resource monitoring sub-thread exists, and the terminated resource monitoring sub-thread is restarted;

the RMON of the node to be monitored establishes a management surface uplink interface and a storage surface uplink interface, the management surface uplink interface is connected with a management surface interface of the alarm server, and the storage surface uplink interface is connected with a storage surface interface of the alarm server; receiving a management face heartbeat request message sent by an alarm server through a management face uplink interface according to a preset time interval, uploading the management face heartbeat feedback message to the alarm server, and simultaneously uploading the alarm message of an alarm queue to the alarm server; and if the preset time threshold is exceeded, the management surface heartbeat request message sent by the server is not received, the storage surface uplink interface starting message sent by the alarm server is received through the storage surface uplink interface, the storage surface heartbeat request message sent by the server is received according to the preset time interval, the storage surface heartbeat feedback message is uploaded to the alarm server, and meanwhile, the alarm message of the alarm queue is uploaded to the alarm server.

2. The method for monitoring the node resources of the cloud platform according to claim 1, wherein the alarm function of the same resource monitoring sub-thread is locked to realize that the next report can be performed only after the last alarm report of the same resource monitoring sub-thread.

3. The method for monitoring the cloud platform node resources according to claim 1, wherein after the storage plane uplink interface is enabled, the management plane uplink interface receives a management plane heartbeat request message sent by an alarm server, uploads a management plane heartbeat feedback message to the alarm server, receives a message sent by the alarm server to enable the management plane uplink interface and closes the storage plane uplink interface, so as to enable the management plane uplink interface to upload the alarm message of the alarm queue again.

4. The method for monitoring the node resources of the cloud platform according to claim 1, wherein the alarm server does not receive the management plane heartbeat feedback request message and the storage plane heartbeat feedback request message of the node to be monitored when exceeding a preset time threshold, and the alarm server generates a fault alarm message of the management plane and the storage plane of the node to be monitored.

5. The method for monitoring the cloud platform node resources according to claim 1, wherein the interaction between the RMON of the node to be monitored and the host is realized by calling a linux system command.

6. A cloud platform node resource monitoring system comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of the method of any one of claims 1 to 5.

7. A computer-readable medium, in which a computer program executable by a terminal device is stored, which program, when run on the terminal device, causes the terminal device to carry out the steps of the method as claimed in any one of claims 1 to 5.