CN111880997A - Distributed monitoring system, monitoring method and device - Google Patents

Distributed monitoring system, monitoring method and device Download PDF

Info

Publication number
CN111880997A
CN111880997A CN202010747493.0A CN202010747493A CN111880997A CN 111880997 A CN111880997 A CN 111880997A CN 202010747493 A CN202010747493 A CN 202010747493A CN 111880997 A CN111880997 A CN 111880997A
Authority
CN
China
Prior art keywords
sub
node
monitoring data
management
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010747493.0A
Other languages
Chinese (zh)
Inventor
杨璐
杜夏威
张晋锋
吕灼恒
李斌
袁伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi City Cloud Computing Center Co ltd
Zhongke Sugon Information Industry Chengdu Co ltd
Dawning Information Industry Beijing Co Ltd
Original Assignee
Wuxi City Cloud Computing Center Co ltd
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi City Cloud Computing Center Co ltd, Dawning Information Industry Beijing Co Ltd filed Critical Wuxi City Cloud Computing Center Co ltd
Priority to CN202010747493.0A priority Critical patent/CN111880997A/en
Publication of CN111880997A publication Critical patent/CN111880997A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/547Messaging middleware

Abstract

The application provides a distributed monitoring system, a monitoring method and a device. The system comprises: the system comprises a general management node, a plurality of sub management nodes, message middleware and a plurality of computing nodes; the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware; the message middleware is used for caching the monitoring data; and the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes. According to the embodiment of the application, the monitoring data of the computing nodes in the cluster system is acquired by adopting a layering technology, and the load of the total management node is shared by utilizing a plurality of sub management nodes and message middleware, so that the problem of overlarge load of the management node is avoided.

Description

Distributed monitoring system, monitoring method and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a distributed monitoring system, a monitoring method, and a monitoring device.
Background
The cluster system is a system formed by interconnection of two or more computing nodes through a network and cooperatively used for completing tasks in parallel application programs. Most of the existing cluster deployment structures collect information of computing nodes through management nodes to form physical mapping of the computing nodes and the management nodes. The problem of overload of the management node is easily caused by directly collecting or pushing the information of the computing node to the management node.
Disclosure of Invention
An object of the embodiments of the present application is to provide a distributed monitoring system, a monitoring method, and a monitoring device, so as to solve the problem of overload of a management node in the prior art.
In a first aspect, an embodiment of the present application provides a distributed monitoring system, including: the system comprises a general management node, a plurality of sub management nodes, message middleware and a plurality of computing nodes; the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware; the message middleware is used for caching the monitoring data; and the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes.
According to the embodiment of the application, the monitoring data of the computing nodes in the cluster system is acquired by adopting a layering technology, and the load of the total management node is shared by utilizing a plurality of sub management nodes and message middleware, so that the problem of overlarge load of the management node is avoided.
Further, the plurality of sub-management nodes form a plurality of supervision layers; the sub-management nodes in the lowest layer of the multiple monitoring layers are in communication connection with the computing nodes and are used for acquiring monitoring data of the corresponding computing nodes and transmitting the monitoring data to the sub-management nodes of the middle monitoring layer of the upper layer; each middle monitoring layer sends the monitoring data to the upper monitoring layer; and the sub-management node in the uppermost monitoring layer is in communication connection with the message middleware and is used for sending the monitoring data to the message middleware. The embodiment of the application divides the sub-management nodes into a plurality of layers, so that the multi-layer multi-level management system can be suitable for a larger-scale cluster system through a plurality of supervision layers.
Furthermore, the total management node is further configured to monitor a working state of each sub-management node, allocate a corresponding computing node to each sub-management node according to the working state of each sub-management node, and generate a corresponding mapping relation table; the mapping relation table comprises the corresponding relation of the computing nodes which are monitored by the sub-management nodes. According to the embodiment of the application, the total management node distributes the corresponding computing nodes for the sub-management nodes according to the working states of the sub-management nodes and records the computing nodes through the mapping relation table, so that on one hand, the uniform load of the sub-management nodes is ensured, and on the other hand, the score management nodes can obtain the corresponding computing nodes.
Further, the sub-management node is specifically configured to: and acquiring the mapping relation table from the total management node regularly, determining the computing node to be monitored according to the mapping relation table, and collecting monitoring data of the computing node to be monitored. In the embodiment of the application, because the total management node can dynamically adjust the corresponding relationship between the sub-management nodes and the computing nodes, the sub-management nodes determine the computing nodes to be supervised from the mapping relationship table periodically, so that the computing nodes are prevented from being monitored repeatedly or monitoring neglected.
Further, the sub-management nodes store an acquisition scheduling list; the collection scheduling list is used for representing the range of the computing nodes managed by the corresponding sub-management nodes; the sub-management node determines the computing node to be monitored according to the mapping relation table and collects monitoring data of the computing node to be monitored, and the sub-management node comprises the following steps: if the corresponding relation between the management node and the computing node in the mapping relation table changes, the computing node to be monitored is obtained from the mapping relation table; and resetting the acquisition scheduling list, and acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list. In the embodiment of the application, when the sub-management nodes know that the corresponding computing nodes change from the mapping relation table, the new computing nodes are monitored by resetting the acquisition scheduling list, so that each sub-management node can supervise the corresponding computing nodes.
Further, the general management node is further configured to store monitoring data of each computing node, determine whether the working state of each computing node is abnormal according to the monitoring data, and trigger an alarm if the working state of each computing node is abnormal. According to the embodiment of the application, whether the working state of the computing node is abnormal or not is judged by the general management node according to the monitoring data, so that the monitoring of the computing node is realized, and the alarm is triggered under the abnormal condition, so that the working personnel can find the abnormal computing node in time.
In a second aspect, an embodiment of the present application provides a data monitoring method, which is applied to a sub-management node in a distributed monitoring system, where the system includes a main management node, a plurality of sub-management nodes, a message middleware, and a plurality of compute nodes; the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware; the message middleware is used for caching the monitoring data; the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes, and the method comprises the following steps:
collecting monitoring data of corresponding computing nodes;
and sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware.
Further, the collecting monitoring data of the corresponding computing node includes:
acquiring a mapping relation table from a total management node, judging that the corresponding relation between the management node and the computing node in the mapping relation table changes according to the mapping relation table, and acquiring the computing node to be monitored from the mapping relation table; and resetting the acquisition scheduling list, and acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list.
In the embodiment of the application, because the total management node can dynamically adjust the corresponding relationship between the sub-management nodes and the computing nodes, the sub-management nodes can ensure that the computing nodes actually monitored by the sub-management nodes are consistent with the computing nodes distributed by the total management node by periodically determining the computing nodes to be monitored from the mapping relationship table.
Further, after acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list, the method further includes: and sending the monitoring data to message middleware so that the overall management node reads the monitoring data from the message middleware. According to the embodiment of the application, the monitoring data is cached through the message middleware, when the total management node needs to calculate the monitoring data of the node, the monitoring data can be directly obtained from the message middleware, and the problem that the total management node is overloaded is solved.
In a third aspect, an embodiment of the present application provides a data monitoring method, which is applied to a master management node in a distributed monitoring system, where the system includes the master management node, a plurality of sub-management nodes, a message middleware, and a plurality of compute nodes; the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware; the message middleware is used for caching the monitoring data; the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes, and the method comprises the following steps:
configuring a computing node responsible for monitoring for each sub-management node so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware;
and reading the monitoring data from the message middleware.
Further, the method further comprises:
acquiring the working states of each sub-management node and each computing node;
determining whether a mapping relation table needs to be updated according to the working states of the sub-management nodes and the computing nodes; the mapping relation table comprises the corresponding relation of the computing nodes which are monitored by the sub-management nodes;
and updating the mapping relation table under the condition that the updating is needed.
Further, the method further comprises:
receiving a query request of a sub-management node for querying a mapping relation table;
and sending the mapping relation table to the sub-management nodes.
In a fourth aspect, an embodiment of the present application provides a data monitoring apparatus, including:
the data acquisition module is used for acquiring monitoring data of the corresponding computing node;
and the data sending module is used for sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware.
In a fifth aspect, an embodiment of the present application provides a data monitoring apparatus, including:
the configuration module is used for configuring the computing nodes responsible for monitoring for each sub-management node so as to enable the sub-management nodes to acquire the monitoring data of the corresponding computing nodes and send the monitoring data to the message middleware;
and the data reading module is used for reading the monitoring data from the message middleware.
In a sixth aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a memory and a bus, wherein the processor and the memory are communicated with each other through the bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to be capable of performing the method of the second or third aspect.
In a seventh aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, including: the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method of the second or third aspect.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of a distributed monitoring system according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of another distributed monitoring system provided in an embodiment of the present application;
fig. 3 is a schematic flow chart of a data monitoring method according to an embodiment of the present application;
fig. 4 is a schematic flow chart of another data monitoring method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data monitoring apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of another data monitoring apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
In order to solve the problem of overlarge load of a management node, the application provides a distributed monitoring system adopting a layering technology, and the system mainly comprises a total management node, sub-management nodes, message middleware and a computing node. The load of the monitoring computing node is shared by the sub-management nodes, the monitored monitoring data is uploaded to the message middleware, the message middleware caches the monitoring data, and when the main management node needs to monitor the data, the main management node can directly acquire the monitoring data from the message middleware, so that the load of the main management node can be reduced.
Fig. 1 is a schematic structural diagram of a distributed monitoring system provided in an embodiment of the present application, and as shown in fig. 1, the system includes a total management node 101, a plurality of sub-management nodes 102, a message middleware 103, and a plurality of computing nodes 104, where:
each sub-management node 102 has a corresponding monitored computing node 104, and therefore, the sub-management nodes 102 are respectively in communication connection with the computing nodes 104 and the message middleware 103, and each sub-management node 102 acquires monitoring data from the corresponding monitored computing node 104 and uploads the acquired monitoring data to the message middleware 103. It should be noted that, since the computing nodes 104 monitored by the sub-management node 102 may change, each sub-management node 102 may be communicatively connected to each computing node 104, respectively, in order to be able to adapt to the change. It is understood that the sub-management node 102 may also be regarded as a server as the computing node 104, and unlike the computing node 104, the sub-management node 102 plays a role in the cluster, and the main function of the sub-management node 102 is to monitor the computing node 104. While the primary function of the compute nodes 104 is to handle tasks in the cluster, such as: query tasks, and the like.
The main function of the message middleware 103 is to cache monitoring data, the sub-management node 102 sends the monitoring data of the computing node 104 to the message middleware 103, and the message middleware 103 caches the monitoring data in a data queue.
When the overall management node 101 needs to read the monitoring data, reading can be started from the head of the data queue. In order to prevent too much monitoring data from being stored in the data queue, the process may be deleted from the data queue after the monitoring data is read by the overall management node 101.
It is understood that the overall management node 101 may read the monitoring data from the message middleware 103 according to a preset cycle. The general management node 101 may also be regarded as a server the same as the computing node 104, and the function of the general management node is different from that of the computing node 104 and the sub-management nodes 102, and mainly acquires monitoring data and determines whether the corresponding computing node 104 is abnormal according to the monitoring data. Specifically, after reading the monitoring data of the computing node from the message middleware, the total management node stores the monitoring data in the database so as to record the operating condition of the computing node, analyzes the monitoring data of each computing node according to the monitoring threshold, and if the monitoring data exceeds the monitoring threshold, indicates that the working state of the corresponding computing node is abnormal, and triggers an alarm. For example: the monitoring data may be CPU utilization, with a typical threshold of 90%; the monitoring data can also be the memory utilization rate, and the general threshold value is 80%; further monitoring data may be the CPU temperature and for liquid cooling may be set to 93%, depending on the actual liquid boiling point.
In addition, the general management node can provide a uniform interface entry for a user through a browser, so that operation and maintenance personnel can clearly browse the states of the general management node, the sub-management nodes and the computing nodes in the cluster system and the monitoring data of each computing node.
It should be noted that fig. 1 is only an example provided by the embodiment of the present application, and in a specific implementation process, the system architecture may be modified according to actual situations, for example: adjusting the number of computing nodes, adjusting the number of sub-management nodes, the number of layers and the like.
According to the embodiment of the application, the monitoring data of the computing nodes in the cluster system is acquired by adopting a layering technology, and the load of the total management node is shared by utilizing a plurality of sub management nodes and message middleware, so that the problem of overlarge load of the management node is avoided.
On the basis of the above embodiment, the plurality of sub-management nodes form a plurality of supervision layers; the sub-management nodes in the lowest layer of the plurality of monitoring layers are in communication connection with the computing nodes and are used for acquiring monitoring data of the corresponding computing nodes and transmitting the monitoring data to the sub-management nodes of the upper monitoring layer; and the sub-management node in the uppermost monitoring layer is in communication connection with the message middleware and is used for sending the monitoring data to the message middleware.
Fig. 2 is a schematic structural diagram of another distributed monitoring system provided in the embodiment of the present application, and as shown in fig. 2, different from fig. 1, a sub-management node in the embodiment of the present application is divided into three monitoring layers, and each monitoring layer includes a plurality of sub-management nodes 102. It is understood that in practice, the sub-management node 102 may be divided into more or fewer layers as desired. As can be seen from fig. 2, the sub-management node 102 at the lowest layer is directly in communication connection with the computing node 104, and is configured to obtain the monitoring data corresponding to the computing node 104, and send the monitoring data to the sub-management node 102 at the monitoring layer at the upper layer (i.e., the middle layer), and after receiving the monitoring data, the middle layer continues to send the monitoring data to the sub-management node 102 at the monitoring layer at the upper layer until the highest layer. The sub-management node 102 in the supervisory layer at the uppermost layer is communicatively connected to the message middleware 103, and is configured to send the monitoring data sent from the supervisory layer at the next layer to the message middleware 103, so that the central management node 101 reads the monitoring data from the message middleware. It should be noted that, for convenience of drawing, only one sub-management node 102 is included in the uppermost supervisory layer in fig. 2 provided in the embodiment of the present application, and in a specific implementation process, a plurality of sub-management nodes 102 may be included in the uppermost supervisory layer. If the sub-management nodes form two monitoring layers, the middle layer of the upper layer is the monitoring layer of the uppermost layer.
The embodiment of the application divides the sub-management nodes into a plurality of layers, so that the multi-layer multi-level management system can be suitable for a larger-scale cluster system through a plurality of supervision layers.
On the basis of the above embodiment, the master management node is further configured to monitor the working state of each sub-management node, allocate a corresponding computing node to each sub-management node according to the working state of each sub-management node, and generate a corresponding mapping relationship table; the mapping relation table comprises the corresponding relation of the computing nodes which are monitored by the sub-management nodes.
In a specific implementation process, the total management node is further configured to allocate a computing node to be monitored to each sub-management node, and in order to ensure load balance of each sub-management node, the computing node monitored by each sub-management node may be dynamically adjusted, where a condition for triggering dynamic adjustment is given below:
(1) if a sub-management node fails, in order to ensure that all the computing nodes can be monitored, the sub-management node that fails and the monitored computing node corresponding to the sub-management node need to be monitored by other sub-management nodes. Therefore, the total management node can monitor the working state of each sub-management node, and if a certain sub-management node fails, the computing nodes monitored by each sub-management node need to be reconfigured, so that all the computing nodes can be monitored. And when the calculation nodes are distributed for the sub-management nodes, the calculation nodes can be distributed according to the number of the sub-management nodes, the number of the calculation nodes and the current load condition of each sub-management node, so that the load of each sub-management node is ensured to be uniform, and the load of a certain sub-management node is not far greater than the load of other sub-management nodes.
(2) If a plurality of computing nodes in the computing nodes monitored by a sub-management node are shut down due to faults, the sub-management node does not need to acquire monitoring data from the faulty computing node, and therefore the load of the sub-management node is smaller than that of other sub-management nodes. The total management node may start a dynamic adjustment operation, specifically, may reallocate the monitored computing nodes for all the sub-management nodes, or may arbitrarily select several computing nodes from the sub-management nodes with the highest load to allocate to the sub-management nodes. Specifically, several computing nodes are selected, which may be determined according to a difference between the load of the sub-management node and the load of the sub-management node with the largest load, or may be selected through other strategies, which is not specifically limited in this embodiment of the present application.
(3) If a new or failure-recovered computing node is added in the cluster system, the newly added computing node needs to be monitored, the total management node may re-allocate a corresponding computing node to each sub-management node, or may select a sub-management node with the smallest load and allocate the computing node to the sub-management node with the smallest load.
It can be understood that, each time the total management node dynamically adjusts the compute node monitored by the sub-management node, the corresponding relationship between the corresponding sub-management node and the compute node in the synchronous mapping relationship table needs to be updated. The sub-management node can inquire whether the corresponding relation with the computing node changes from the total management node through the web service interface.
According to the embodiment of the application, the total management node distributes the corresponding computing nodes for the sub-management nodes according to the working states of the sub-management nodes and records the computing nodes through the mapping relation table, so that on one hand, the uniform load of the sub-management nodes is ensured, and on the other hand, the score management nodes can obtain the corresponding computing nodes.
On the basis of the above embodiment, the sub-management node is specifically configured to:
and acquiring the mapping relation table from the total management node regularly, determining the computing node to be monitored according to the mapping relation table, and collecting monitoring data of the computing node to be monitored.
In a specific implementation process, in order to ensure that the computing nodes monitored by the sub-management nodes are consistent with the computing nodes distributed by the main management node, so that the computing nodes cannot be monitored repeatedly and are not monitored neglectedly, the sub-management nodes can periodically acquire a mapping relation table from the main management node, find out the computing nodes required to be monitored from the acquired mapping relation table, and acquire monitoring data of the corresponding computing nodes.
In another embodiment, after the total management node updates the mapping relationship table, in order to enable the sub-management nodes to receive the new mapping relationship table in time, the total management node may actively send the updated mapping relationship table to the sub-management nodes. The updated mapping relation table may be specifically sent to all the sub-management nodes, or may be sent only to the sub-management node whose correspondence relationship has changed.
After the sub-management node acquires the mapping relation table stored in the total management node, if the sub-management node judges that the monitored computing node changes, the sub-management node stops monitoring the original computing node, acquires the computing node to be monitored from the mapping relation table, resets the acquisition scheduling list, and acquires the monitoring data of the corresponding computing node according to the reset acquisition scheduling list. It can be understood that the collection scheduling list is pre-stored in the corresponding sub-management node, and its main function is to determine the range of the computing node managed by the sub-management node, and if the range of the computing node managed by the sub-management node changes, the collection scheduling list needs to be updated.
It can be understood that the sub-management nodes may send the collected monitoring data to the message middleware in real time, may also send the monitoring data in one period to the message middleware according to a preset period, and may also upload the monitoring data from the last time of sending the monitoring data to the time of receiving the upload instruction to the message middleware after receiving the upload instruction of the total management node. The time for sending the monitoring data may be set according to an actual situation, which is not specifically limited in the embodiment of the present application.
In the embodiment of the application, because the total management node can dynamically adjust the corresponding relationship between the sub-management nodes and the computing nodes, the sub-management nodes determine the computing nodes to be supervised from the mapping relationship table periodically, so that the computing nodes are prevented from being monitored repeatedly or monitoring neglected.
Fig. 3 is a schematic flow chart of a data monitoring method according to an embodiment of the present application, as shown in fig. 3, the method is applied to a sub-management node in a distributed monitoring system according to any one of the embodiments, and for a specific structure of the distributed monitoring system, reference is made to the above embodiments, which is not repeated here, and the method includes:
step 301: collecting monitoring data of corresponding computing nodes;
step 302: and sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware.
In a specific implementation process, the sub-management nodes mainly function to collect monitoring data of corresponding computing nodes and send the monitoring data to the message middleware, and the message middleware caches the monitoring data. When the overall management node needs to acquire the monitoring data, the monitoring data can be read from the message middleware. It should be noted that the compute nodes monitored by each sub-management node are assigned by the master management node.
According to the embodiment of the application, the sub-management nodes respectively monitor a part of the computing nodes and send the monitoring data to the message middleware for caching, so that the load of the total management node is reduced.
On the basis of the above embodiment, the acquiring monitoring data of the corresponding computing node includes:
acquiring a mapping relation table from a total management node, judging that the corresponding relation between the management node and the computing node in the mapping relation table changes according to the mapping relation table, and acquiring the computing node to be monitored from the mapping relation table;
and resetting the acquisition scheduling list, and acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list.
In a specific implementation process, in order to ensure load balance of each sub-management node, the master management node dynamically adjusts the computing nodes monitored by each sub-management node, and therefore, in order to ensure that the computing nodes monitored by each sub-management node are consistent with the computing nodes distributed by the master management node, the sub-management nodes may periodically obtain the mapping relationship table from the master management node. It can be understood that the mapping relationship table stores the corresponding relationship between each sub-management node and the corresponding monitored computing node. Therefore, after the sub-management node acquires the mapping relation table, it can search from the mapping relation table which computing nodes it needs to monitor, and can judge whether the corresponding computing nodes in the acquired mapping relation table are changed before comparing, if so, the monitoring of the previous computing nodes is stopped, the acquisition scheduling list is reset according to the corresponding computing nodes acquired from the mapping relation table, and after the resetting, the acquisition of the monitoring data of the computing nodes is started.
It can be understood that, after the total management node updates the mapping relationship table, in order to enable the sub-management nodes to receive the new mapping relationship table in time, the total management node may actively send the updated mapping relationship table to the sub-management nodes. The updated mapping relation table may be specifically sent to all the sub-management nodes, or may be sent only to the sub-management node whose correspondence relationship has changed.
And when the sub-management nodes reset the acquisition scheduling list according to the updated mapping relation table, acquiring the monitoring data according to the acquisition scheduling list and sending the monitoring data to the message middleware, wherein the message middleware can store the monitoring data into a corresponding message queue. When the master management node needs to acquire the monitoring data, the monitoring data is read from the message queue in the message middleware.
In the embodiment of the application, because the total management node can dynamically adjust the corresponding relationship between the sub-management nodes and the computing nodes, the sub-management nodes can ensure that the computing nodes actually monitored by the sub-management nodes are consistent with the computing nodes distributed by the total management node by periodically determining the computing nodes to be monitored from the mapping relationship table.
Fig. 4 is another data monitoring method provided in this embodiment, which is applied to a total management node in a distributed monitoring system in the foregoing embodiment, and a specific architecture of the distributed monitoring system is referred to in the foregoing embodiment, and details are not repeated here. The method comprises the following steps:
step 401: configuring a computing node responsible for monitoring for each sub-management node so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware;
step 402: and reading the monitoring data from the message middleware.
In a specific implementation process, after the system is initialized, each sub-management node needs to be configured with a computing node which is responsible for monitoring. The specific configuration may be based on the number of total computation nodes in the system, the number of sub-management nodes, and the current load of each sub-management node.
Because the monitoring data of the computing nodes uploaded by each sub-management node is cached in the message middleware, when the master management node needs to acquire the monitoring data of each computing node, the monitoring data can be read from the message middleware.
According to the embodiment of the application, the sub-management nodes respectively monitor a part of the computing nodes and send the monitoring data to the message middleware for caching, so that the load of the total management node is reduced.
On the basis of the above embodiment, the method further includes:
acquiring the working states of each sub-management node and each computing node;
determining whether a mapping relation table needs to be updated according to the working states of the sub-management nodes and the computing nodes; the mapping relation table comprises the corresponding relation of the computing nodes which are monitored by the sub-management nodes;
and updating the mapping relation table under the condition that the updating is needed.
In a specific implementation process, since the working states of the sub-management nodes and the computing nodes in the distributed monitoring system may change, if a certain sub-management node fails, the originally monitored computing node of the failed sub-management node is not monitored, or multiple computing nodes in a certain sub-management node fail, the load of the sub-management node is smaller than that of other sub-management nodes, or a new computing node is added, a corresponding sub-management node needs to be allocated to the newly added computing node, and the like, all of which require the total management node to adjust the corresponding relationship between the sub-management node and the corresponding computing node. And after the corresponding relation between the sub-management node and the computing node is adjusted, synchronously updating the mapping relation table. Therefore, on one hand, the load balance of each sub-management node is ensured, and on the other hand, the computing nodes are not monitored in a leakage mode.
On the basis of the above embodiment, the method further includes:
receiving a query request of a sub-management node for querying a mapping relation table;
and sending the mapping relation table to the sub-management nodes.
In a specific implementation process, the sub-management nodes may periodically send a query request for querying the mapping relationship table to the master management node, and after receiving the query request, the master management node may send the mapping relationship table to the corresponding sub-management node, so that the sub-management nodes may find the computing nodes monitored by the sub-management nodes from the mapping relationship table.
According to the embodiment of the application, the mapping relation table is maintained in the total management node, so that the consistency of the computing nodes distributed by the total management node for the sub-management nodes and the computing nodes actually monitored by the sub-management nodes is ensured.
Fig. 5 is a schematic structural diagram of a data monitoring apparatus according to an embodiment of the present application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the above-mentioned embodiment of the method of fig. 3, and can perform various steps related to the embodiment of the method of fig. 3, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes: a data acquisition module 501 and a data transmission module 502, wherein:
the data acquisition module 501 is configured to acquire monitoring data of a corresponding computing node; the data sending module 502 is configured to send the monitoring data to the message middleware, so that the central management node obtains the monitoring data from the message middleware.
On the basis of the above embodiment, the data acquisition module 501 is specifically configured to:
acquiring a mapping relation table from a total management node, judging that the corresponding relation between the management node and the computing node in the mapping relation table changes according to the mapping relation table, and acquiring the computing node to be monitored from the mapping relation table; and resetting the acquisition scheduling list, and acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list.
On the basis of the above embodiment, the apparatus further includes a data sending module, configured to:
and sending the monitoring data to message middleware so that the overall management node reads the monitoring data from the message middleware.
Fig. 6 is a schematic structural diagram of another data monitoring apparatus provided in this embodiment of the present application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the above-mentioned embodiment of the method of fig. 4, and can perform various steps related to the embodiment of the method of fig. 4, and the specific functions of the apparatus can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy. The device includes: a configuration module 601 and a data reading module 602, wherein:
the configuration module 601 is configured to configure a computing node responsible for monitoring for each sub-management node, so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware; the data reading module 602 is configured to read the monitoring data from the message middleware.
Fig. 7 is a schematic structural diagram of an entity of an electronic device provided in an embodiment of the present application, and as shown in fig. 7, the electronic device includes: a processor (processor)701, a memory (memory)702, and a bus 703; wherein the content of the first and second substances,
the processor 701 and the memory 702 complete communication with each other through the bus 703;
the processor 701 is configured to call the program instructions in the memory 702 to execute the methods provided by the above-mentioned method embodiments, for example, including: collecting monitoring data of corresponding computing nodes; and sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware. Or
Configuring a computing node responsible for monitoring for each sub-management node so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware; and reading the monitoring data from the message middleware.
The processor 701 may be an integrated circuit chip having signal processing capabilities. The processor 701 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The Memory 702 may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), electrically Erasable Read Only Memory (EEPROM), and the like.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: collecting monitoring data of corresponding computing nodes; and sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware. Or
Configuring a computing node responsible for monitoring for each sub-management node so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware; and reading the monitoring data from the message middleware.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: collecting monitoring data of corresponding computing nodes; and sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware. Or
Configuring a computing node responsible for monitoring for each sub-management node so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware; and reading the monitoring data from the message middleware.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (11)

1. A distributed monitoring system, comprising: the system comprises a general management node, a plurality of sub management nodes, message middleware and a plurality of computing nodes;
the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware;
the message middleware is used for caching the monitoring data;
and the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes.
2. The system of claim 1, wherein the plurality of sub-management nodes form a plurality of supervisor layers; the sub-management nodes in the lowest layer of the multiple monitoring layers are in communication connection with the computing nodes and are used for acquiring monitoring data of the corresponding computing nodes and transmitting the monitoring data to the sub-management nodes of the middle monitoring layer of the upper layer; each middle monitoring layer sends the monitoring data to the upper monitoring layer; and the sub-management node in the uppermost monitoring layer is in communication connection with the message middleware and is used for sending the monitoring data to the message middleware.
3. The system according to claim 1, wherein the master management node is further configured to monitor a working state of each sub-management node, allocate a corresponding computing node to each sub-management node according to the working state of each sub-management node, and generate a corresponding mapping relationship table; the mapping relation table comprises the corresponding relation of the computing nodes which are monitored by the sub-management nodes.
4. The system of claim 3, wherein the sub-management node is specifically configured to:
and acquiring the mapping relation table from the total management node regularly, determining the computing node to be monitored according to the mapping relation table, and collecting monitoring data of the computing node to be monitored.
5. The system according to claim 4, wherein the sub-management node stores therein an acquisition schedule list; the collection scheduling list is used for representing the range of the computing nodes managed by the corresponding sub-management nodes; the sub-management node determines the computing node to be monitored according to the mapping relation table and collects monitoring data of the computing node to be monitored, and the sub-management node comprises the following steps:
if the corresponding relation between the management node and the computing node in the mapping relation table changes, the computing node to be monitored is obtained from the mapping relation table;
and resetting the acquisition scheduling list, and acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list.
6. A data monitoring method is characterized in that the method is applied to sub management nodes in a distributed monitoring system, and the system comprises a main management node, a plurality of sub management nodes, message middleware and a plurality of computing nodes; the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware; the message middleware is used for caching the monitoring data; the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes, and the method comprises the following steps:
collecting monitoring data of corresponding computing nodes;
and sending the monitoring data to the message middleware so that the general management node acquires the monitoring data from the message middleware.
7. The method of claim 6, wherein collecting monitoring data for a corresponding compute node comprises:
acquiring a mapping relation table from a total management node, judging that the corresponding relation between the management node and the computing node in the mapping relation table changes according to the mapping relation table, and acquiring the computing node to be monitored from the mapping relation table;
and resetting the acquisition scheduling list, and acquiring the monitoring data of the corresponding computing node according to the reset acquisition scheduling list.
8. A data monitoring method is characterized in that the method is applied to a main management node in a distributed monitoring system, and the system comprises the main management node, a plurality of sub-management nodes, message middleware and a plurality of computing nodes; the sub-management node is used for collecting monitoring data of the corresponding computing node and sending the monitoring data to the message middleware; the message middleware is used for caching the monitoring data; the general management node acquires the monitoring data from the message middleware to realize the monitoring of the plurality of computing nodes, and the method comprises the following steps:
configuring a computing node responsible for monitoring for each sub-management node so that the sub-management node collects monitoring data of the corresponding computing node and sends the monitoring data to the message middleware;
and reading the monitoring data from the message middleware.
9. The method of claim 8, further comprising:
acquiring the working states of each sub-management node and each computing node;
determining whether a mapping relation table needs to be updated according to the working states of the sub-management nodes and the computing nodes; the mapping relation table comprises the corresponding relation of the computing nodes which are monitored by the sub-management nodes;
and updating the mapping relation table under the condition that the updating is needed.
10. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any one of claims 6-9.
11. A non-transitory computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 6-9.
CN202010747493.0A 2020-07-29 2020-07-29 Distributed monitoring system, monitoring method and device Pending CN111880997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010747493.0A CN111880997A (en) 2020-07-29 2020-07-29 Distributed monitoring system, monitoring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010747493.0A CN111880997A (en) 2020-07-29 2020-07-29 Distributed monitoring system, monitoring method and device

Publications (1)

Publication Number Publication Date
CN111880997A true CN111880997A (en) 2020-11-03

Family

ID=73201151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010747493.0A Pending CN111880997A (en) 2020-07-29 2020-07-29 Distributed monitoring system, monitoring method and device

Country Status (1)

Country Link
CN (1) CN111880997A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463330A (en) * 2020-11-30 2021-03-09 江苏金鑫信息技术有限公司 Application bus monitoring system based on middleware technology
CN112804337A (en) * 2021-01-22 2021-05-14 苏州浪潮智能科技有限公司 Main node pressure allocation method and device, electronic equipment and storage medium
CN112882901A (en) * 2021-03-04 2021-06-01 中国航空工业集团公司西安航空计算技术研究所 Intelligent health state monitor of distributed processing system
CN114726862A (en) * 2022-05-17 2022-07-08 中诚华隆计算机技术有限公司 Method and system for determining operation state of computing node based on state monitoring chip

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104184819A (en) * 2014-08-29 2014-12-03 城云科技(杭州)有限公司 Multi-hierarchy load balancing cloud resource monitoring method
CN104935482A (en) * 2015-06-26 2015-09-23 曙光信息产业(北京)有限公司 Distributed monitoring system and method
US20200210261A1 (en) * 2016-11-29 2020-07-02 Intel Corporation Technologies for monitoring node cluster health

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104184819A (en) * 2014-08-29 2014-12-03 城云科技(杭州)有限公司 Multi-hierarchy load balancing cloud resource monitoring method
CN104935482A (en) * 2015-06-26 2015-09-23 曙光信息产业(北京)有限公司 Distributed monitoring system and method
US20200210261A1 (en) * 2016-11-29 2020-07-02 Intel Corporation Technologies for monitoring node cluster health

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463330A (en) * 2020-11-30 2021-03-09 江苏金鑫信息技术有限公司 Application bus monitoring system based on middleware technology
CN112804337A (en) * 2021-01-22 2021-05-14 苏州浪潮智能科技有限公司 Main node pressure allocation method and device, electronic equipment and storage medium
CN112882901A (en) * 2021-03-04 2021-06-01 中国航空工业集团公司西安航空计算技术研究所 Intelligent health state monitor of distributed processing system
CN114726862A (en) * 2022-05-17 2022-07-08 中诚华隆计算机技术有限公司 Method and system for determining operation state of computing node based on state monitoring chip
CN114726862B (en) * 2022-05-17 2022-08-23 中诚华隆计算机技术有限公司 Method and system for determining operation state of computing node based on state monitoring chip

Similar Documents

Publication Publication Date Title
CN111880997A (en) Distributed monitoring system, monitoring method and device
CN109412870B (en) Alarm monitoring method and platform, server and storage medium
CN107925612B (en) Network monitoring system, network monitoring method, and computer-readable medium
CN111049705B (en) Method and device for monitoring distributed storage system
CN108683720B (en) Container cluster service configuration method and device
US20150286507A1 (en) Method, node and computer program for enabling automatic adaptation of resource units
CN111818159B (en) Management method, device, equipment and storage medium of data processing node
JP2019036313A (en) High performance control server system
CN110971480B (en) Computer network condition monitoring method and device, computer equipment and storage medium
US20040083246A1 (en) Method and system for performance management in a computer system
CN115115030A (en) System monitoring method and device, electronic equipment and storage medium
CN115248826A (en) Method and system for large-scale distributed graph database cluster operation and maintenance management
KR20170084445A (en) Method and apparatus for detecting abnormality using time-series data
CN113672345A (en) IO prediction-based cloud virtualization engine distributed resource scheduling method
CN113190524A (en) Industrial big data acquisition method and system
CN111339466A (en) Interface management method and device, electronic equipment and readable storage medium
CN114629883A (en) Service request processing method and device, electronic equipment and storage medium
US10282245B1 (en) Root cause detection and monitoring for storage systems
CN104488227A (en) Method for isolated anomaly detection in large-scale data processing systems
CN109510730A (en) Distributed system and its monitoring method, device, electronic equipment and storage medium
CN112751722B (en) Data transmission quality monitoring method and system
US10223189B1 (en) Root cause detection and monitoring for storage systems
CN115469966A (en) Elastic expansion method and device for container cloud service
KR101725192B1 (en) Resource information management and data storage system through the management of the resource template
CN115168042A (en) Management method and device of monitoring cluster, computer storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20211011

Address after: Building 36, yard 8, Dongbeiwang West Road, Haidian District, Beijing 100089

Applicant after: Dawning Information Industry (Beijing) Co.,Ltd.

Applicant after: WUXI CITY CLOUD COMPUTING CENTER CO.,LTD.

Applicant after: ZHONGKE SUGON INFORMATION INDUSTRY CHENGDU Co.,Ltd.

Address before: 100000 building 36, yard 8, Dongbeiwang West Road, Haidian District, Beijing

Applicant before: Dawning Information Industry (Beijing) Co.,Ltd.

Applicant before: WUXI CITY CLOUD COMPUTING CENTER CO.,LTD.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20201103

RJ01 Rejection of invention patent application after publication