CN111049705B

CN111049705B - Method and device for monitoring distributed storage system

Info

Publication number: CN111049705B
Application number: CN201911336662.5A
Authority: CN
Inventors: 龚治文; 饶俊明; 卢道和; 郑晓腾; 龚洵峰; 刘生庆; 吴立; 吴传民
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2023-09-12
Anticipated expiration: 2039-12-23
Also published as: CN111049705A; WO2021129367A1

Abstract

The invention provides a method and a device for monitoring a distributed storage system, wherein a monitoring server sends acquisition instructions to each cluster in the distributed storage system; the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters; aiming at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform. According to the scheme, the monitoring server issues the acquisition instruction to each cluster in the distributed storage system, so that the monitoring server can monitor a plurality of clusters simultaneously; in addition, the monitoring data fed back by each cluster comprises the state data of the client connected with the cluster, so that the monitoring server can determine the alarm information through analyzing the state data of the client connected with the cluster, and the purpose of monitoring the client connected with the cluster by the monitoring server is realized.

Description

Method and device for monitoring distributed storage system

Technical Field

The present invention relates to the field of financial technology (Fintech), and in particular, to a method and apparatus for monitoring a distributed storage system.

Background

With the development of computer technology, more and more technologies (such as blockchain, cloud computing or big data) are applied in the financial field, and the traditional financial industry is gradually changing to financial technology, so that big data technology is no exception. But because of the safety and real-time requirements of the finance and payment industries, higher requirements are also put on big data technology.

In consideration of the expandability and high availability of the massive data, the banking industry generally selects a distributed storage System such as a CephFS (CephFile System) as a shared storage technical scheme, wherein a CephFuse client is connected under the CephFS; meanwhile, those skilled in the art typically use a monitoring system such as an open source promethaus to monitor CephFS. Wherein Prometaus mainly comprises parts of Exporters, prometaus Sever and the like; cephFS mainly comprises various components such as a Monitor (MON), a target storage device (Object Storage Device) and a MetaData server (MetaData Server) (MDS), and a group of settings (PG) is distributed on CephFS OSD components.

Aiming at the technical scheme of monitoring CephFS by Prometheus in the prior art, the following two problems exist:

first, prometaus 'monitoring of CephFS is mainly represented by Prometaus' data collection of CephFS OSD component status and CephFS PG status, but Prometaus does not realize monitoring of Ceph Fuse client.

Second, prometaus is very bulky to the CephFS monitoring architecture, in that a set of Prometaus needs to be deployed for each CephFS; furthermore, due to the different versions of CephFS, different exporters need to be deployed for the different versions of CephFS. As shown in fig. 1, a prior art monitoring architecture of promethaus for CephFS is shown. Referring to fig. 1, an exporter_m collects monitoring data of a cephfs_m, if the collected monitoring data meets a rule of generating alarm information, the generated alarm information is reported to a promethaus server_m, and similarly, an exporter_n collects monitoring data of a cephfs_n, if the collected monitoring data meets the rule of generating alarm information, the generated alarm information is reported to the promethaus server_n; however, because the exporter_m is not matched with the CephFS_N version, the exporter_m cannot be used for collecting the monitoring data of the CephFS_N so as to report the alarm information of the CephFS_N. That is, high availability is not realized among the Prometheus server, the Exporter and the CephFS, so that monitoring information cannot be timely reported under abnormal conditions.

To sum up, the prior art has the problem that Prometheus cannot monitor Ceph Fuse client and the monitoring efficiency of Prometheus on Ceph FS is low.

Disclosure of Invention

The invention provides a method and a device for monitoring a distributed storage system, which are used for solving the problems that Prometheus cannot monitor a Ceph Fuse client and the monitoring efficiency of Prometheus on Ceph FS is low.

In a first aspect, an embodiment of the present invention provides a method for monitoring a distributed storage system, the method including: the monitoring server sends acquisition instructions to each cluster in the distributed storage system; the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters; for at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform.

Based on the scheme, the monitoring server can monitor a plurality of clusters simultaneously by sending the acquisition instruction to each cluster in the distributed storage system, so that the situation that the monitoring server cannot effectively monitor each cluster due to mismatching of the clusters and the Exporter version is avoided; in addition, the monitoring data fed back to the monitoring server by each cluster also comprises state data of the clients connected with the cluster, which is beneficial to the monitoring server to determine the alarm information through analyzing the state data of the clients connected with the cluster, thereby realizing the purpose of monitoring the clients connected with the cluster by the monitoring server.

As a possible implementation method, the number of the monitoring servers is multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; the monitoring server sends acquisition instructions to each cluster in the distributed storage system, and the acquisition instructions comprise: aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.

Based on the scheme, a plurality of monitoring servers are arranged for the distributed storage system, on one hand, the monitoring data of each cluster are frequently obtained from each cluster in the distributed storage system, and the aim of omnibearing and even real-time monitoring of the distributed storage system can be realized; on the other hand, by arranging a plurality of monitoring servers, the distributed storage system can be monitored by other available monitoring servers under the condition that one or more monitoring servers are down. For any one monitoring server of the plurality of monitoring servers, the monitoring server issues an acquisition instruction to at least two node servers in each cluster, so that the monitoring server can acquire monitoring data of the cluster where the node server is located from other available node servers under the condition that one node server is down, and effective monitoring of each cluster by the monitoring server is realized.

As one possible implementation, the alert rule includes an alert generation rule; the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps: the monitoring server determines a first client with a changed connection state with the cluster from the monitoring data; the monitoring server determines a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.

Based on the scheme, a first client with a changed connection state with the cluster is determined through analysis of monitoring data, a second client with a changed connection state with the cluster is determined through analysis of known service change, and alarm information generated due to abnormality of the clients can be generated through comparison of the first client and the second client.

As one possible implementation method, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; the monitoring server sets an alarm suppression rule of the alarm information of the client, and the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

Based on the scheme, after determining the necessary time required by the cluster for the purpose of service requirement, the monitoring server does not report the alarm information of the client to the alarm platform in the process of the necessary time, so that the generation of known and useless alarms can be effectively avoided.

As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises: and if the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the alarm information of the client, reporting the alarm information of the MDS component to an alarm platform.

Based on the scheme, when the monitoring server acquires the alarm information of the MDS component of the cluster and the alarm information of the client connected with the cluster at the same time, the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the client and reports the alarm information of the MDS component to the alarm platform, and the alarm information of the client at a low level is automatically shielded by considering that the abnormality of the MDS component of the cluster possibly causes an abnormal event of the client connected with the cluster.

As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further includes: and the monitoring server sets cluster identifiers corresponding to the monitoring data.

Based on the scheme, the monitoring server marks the acquired monitoring data with the corresponding cluster, so that the monitoring server can quickly make corresponding alarm operation when receiving the same monitoring data of the same cluster in the later period.

As a possible implementation method, the alarm rule further includes an alarm convergence rule; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises: the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the comparison relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.

Based on the scheme, after the monitoring server determines that the alarm information is the same alarm information of a certain cluster which does not appear for the first time, the same alarm which does not appear for the first time is reported to the alarm platform according to the alarm convergence rule after the time delay is set, so that the phenomenon of resource waste caused by continuously and repeatedly sending the same alarm by the cluster can be effectively prevented.

In a second aspect, an embodiment of the present invention provides an apparatus for monitoring a distributed storage system, the apparatus including: the sending unit is used for sending acquisition instructions to each cluster in the distributed storage system; the acquisition unit is used for acquiring monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters; the determining unit is used for determining alarm information from monitoring data of at least one cluster according to a preset alarm rule and reporting the alarm information to an alarm platform.

As a possible implementation method, the number of the monitoring servers is multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; the sending unit is specifically configured to send an acquisition instruction to at least two node servers in any cluster for any monitoring server.

As one possible implementation, the alert rule includes an alert generation rule; the determining unit is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.

As one possible implementation method, the alarm rule further includes an alarm suppression rule; the determining unit is specifically configured to determine a change duration of a service change of the cluster; setting an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the determining unit is specifically configured to determine that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and report the alarm information of the MDS component to an alarm platform.

As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the determining unit is further configured to set a cluster identifier corresponding to each monitoring data.

As a possible implementation method, the alarm rule further includes an alarm convergence rule; the determining unit is specifically configured to determine that the alarm information is the same alarm information that does not appear in the cluster for the first time, and report the alarm information to the alarm platform after setting the time delay according to a comparison relationship between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.

In a third aspect, embodiments of the present invention provide a computing device comprising:

a memory for storing program instructions;

and a processor for invoking program instructions stored in said memory and executing the method according to any of the first aspects in accordance with the obtained program.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of the first aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a prior art monitoring architecture for Prometheus for CephFS;

FIG. 2 is a diagram of a method for monitoring a distributed storage system according to the present invention;

FIG. 3 is a schematic diagram of a monitoring architecture for CephFS by Prometaus in accordance with the present invention;

Fig. 4 is a schematic diagram of an apparatus for monitoring a distributed storage system according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 2, a method for monitoring a distributed storage system according to an embodiment of the present invention includes:

step 201, a monitoring server sends acquisition instructions to each cluster in the distributed storage system.

Step 202, the monitoring server obtains monitoring data fed back by each cluster based on the collection instruction, where the monitoring data includes health data of the cluster and status data of clients connected to the cluster.

Step 203, for at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule, and reports the alarm information to an alarm platform.

In the step 201, the monitoring server sends an acquisition instruction to each cluster in the distributed storage system.

Setting a plurality of clusters, such as 3 clusters, in a distributed storage system, such as CephFS_A cluster, cephFS_B cluster and CephFS_C cluster; the monitoring server Prometaus is used for monitoring the CephFS, and the Prometaus server in the monitoring server sends an acquisition instruction to the CephFS, specifically comprises the Prometaus server sending an acquisition instruction I to the CephFS_A cluster, the Prometaus server sending an acquisition instruction I to the CephFS_B cluster and the Prometaus server sending an acquisition instruction I to the CephFS_C cluster.

In the step 202, the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, where the monitoring data includes health data of the cluster itself and status data of clients connected to the cluster.

After the Prometaus Sever issues the acquisition instruction I to the CephFS_A cluster, the CephFS_A cluster responds to the acquisition instruction I correspondingly to obtain monitoring data about the CephFS_A cluster, so that the Prometaus Sever obtains the monitoring data about the CephFS_A cluster; similarly, prometheus Sever can obtain monitoring data about CephFS_B clusters and monitoring data about CephFS_C clusters.

The monitoring data of the cephfs_a cluster may be specifically represented by health data of the cephfs_a cluster itself (e.g., running state of OSD component, state data of PG), and state data of a Ceph fuse_a client connected to the cephfs_a cluster (e.g., whether the Ceph fuse_a client is connected to the cephfs_a cluster). For example, there are 100 Ceph fuse_a clients connected to the cephfs_a cluster, and the monitoring data component related to the cephfs_a cluster includes health data of the cephfs_a cluster itself, and further includes status data of 100 Ceph fuse_a clients connected to the cephfs_a cluster; the monitoring data about the cephfs_b cluster and the monitoring data about the cephfs_c cluster may refer to the case of the monitoring data about the cephfs_a cluster, which is not described herein.

In the step 203, for at least one cluster, the monitoring server determines the alarm information from the monitoring data of the cluster according to the preset alarm rule, and reports the alarm information to the alarm platform.

Setting a preset alarm rule for the CephFS_A cluster, and determining alarm information about the CephFS_A cluster by Prometaus through analyzing the acquired monitoring data from the CephFS_A cluster; further, the Prometheus reports the obtained alarm information about the CephFS_A cluster to the alarm platform, and the reporting basis is still a preset alarm rule. The alarm platform may be an IMS system, or may be another alarm platform, which is not limited in this regard. Similarly, the alarm process of Prometaus for CephFS_B cluster and CephFS_C cluster can refer to the alarm process of CephFS_A cluster, which is not described herein.

Fig. 3 shows a schematic diagram of monitoring a CephFS by promethaus according to an embodiment of the present invention. Referring to fig. 3, two monitoring servers, namely, a promethaus server_x and a promethaus server_y, are deployed, and the promethaus server_x and the promethaus server_y are used for monitoring a distributed storage system, wherein a cephfs_a cluster, a cephfs_b cluster and a cephfs_c cluster are deployed in the system; for the CephFS_A cluster, the cluster comprises a plurality of node servers, and for convenience of description, the CephFS_A cluster is provided with 4 node servers, namely A1, A2, A3 and A4; similarly, for the CephFS_B cluster, the cluster comprises a plurality of node servers, and for convenience of description, the CephFS_B cluster is provided with 4 node servers, namely B1, B2, B3 and B4; similarly, for the CephFS_C cluster, the cluster includes a plurality of node servers, and for convenience of description, the CephFS_C cluster is set to include 4 node servers, which are respectively designated as C1, C2, C3 and C4.

For the cephfs_a cluster, there are 100 Ceph fuse_a clients connected to node servers configured with MDS components in the cluster, and if 3 node servers configured with MDS components in the cephfs_a cluster are provided, all the 100 Ceph fuse_a clients are connected to the 3 node servers configured with MDS components (not shown in the figure); similarly, for the cephfs_b cluster, there are 200 Ceph fuse_b clients connected to node servers configured with MDS components in the cluster, and if 3 node servers configured with MDS components in the cephfs_b cluster are provided, then all the 200 Ceph fuse_b clients are connected to the 3 node servers configured with MDS components (not shown in the figure); similarly, for the cephfs_c cluster, there are 300 Ceph fuse_c clients connected to node servers configured with MDS components in the cluster, and if 3 node servers configured with MDS components in the cephfs_c cluster are provided, then all the 300 Ceph fuse_c clients are connected to the 3 node servers configured with MDS components (not shown in the figure).

For Prometheus Sever_X, the monitoring server issues acquisition instructions to at least two node servers in any one of the CephFS_A cluster, cephFS_B cluster and CephFS_C cluster, which is specifically expressed as follows:

set at the moment of 8:00am, prometheus Sever_X transmits acquisition instructions I to 3 node servers A1, A2 and A4 in CephFS_A cluster; meanwhile, prometaus Sever_X issues acquisition instructions I to 3 node servers, namely B1, B3 and B4 in the CephFS_B cluster; meanwhile, prometheus Sever_X issues acquisition instructions I to the 3 node servers C1, C2 and C4 in the CephFS_C cluster.

When the Prometheus Sever_X issues the acquisition instructions to at least two node servers in the CephFS_A cluster, the acquisition instructions are issued to any at least two node servers in the CephFS_A cluster in a random mode. For example, the aforementioned Prometaus Sever_X may issue the acquisition instruction I to 3 node servers A1, A2 and A4 in the CephFS_A cluster, may issue the acquisition instruction I to 3 node servers A2, A3 and A4 in the CephFS_A cluster, and may issue the acquisition instruction I to 3 node servers A1, A2 and A3 in the CephFS_A cluster, which is not limited to the present invention. Similarly, when the Prometheus Sever_X issues the acquisition instruction to at least two node servers in the CephFS_B cluster, the acquisition instruction is issued to any at least two node servers in the CephFS_B cluster in a random manner; similarly, when the Prometheus Sever_X issues the acquisition instruction to at least two node servers in the CephFS_C cluster, the acquisition instruction is issued to any at least two node servers in the CephFS_C cluster in a random manner.

For example, for the cephfs_a cluster, for convenience of description, 10 Ceph fuse_a clients, W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10, are connected to node servers in the cluster configured with MDS components; the Prometaus Sever_X sends acquisition instructions I to 3 node servers A1, A2 and A4 in the CephFS_A cluster, and the Prometaus Sever_X firstly acquires monitoring data on the A1 node server, and determines that 10 CephFuse_A clients W1, W2, W3, W4, W5, W6, W7, W8, W9 and W10 are all connected to the CephFS_A cluster through analysis of the monitoring data on the A1 node server; then, prometheus server_X then obtains the monitoring data on the A2 node server, and determines that only 3 Ceph fuse_A clients of W8, W9 and W10 are still connected to the CephFS_A cluster, and 7 Ceph fuse_A clients of W1, W2, W3, W4, W5, W6 and W7 are offline from the CephFS_A cluster through analysis of the monitoring data on the A2 node server. That is, the first clients with changed connection states with the cluster are 7 Ceph fuse_a clients, i.e., W1, W2, W3, W4, W5, W6, and W7, respectively.

For such an abnormal event that occurs at the Ceph fuse_a client, it is further necessary to determine the reason that 7 of the 7 Ceph fuse_a clients, W1, W2, W3, W4, W5, W6 and W7, are offline from the cephfs_a cluster, i.e. whether the Ceph fuse_a client is normally uninstalled from the cephfs_a cluster or passively uninstalled due to the cephfs_a cluster itself.

The traffic running on the CephFS_A cluster performs daily offloading work on some of the clients connected to the CephFS_A cluster for traffic needs. For example, for business needs, a business person may offload the 3 Ceph fuse_a clients, W5, W6, and W7, in the cephfs_a cluster. Namely, the second clients with the connection state of the cluster changed are 3 Ceph fuse_A clients, namely W5, W6 and W7 respectively.

By comparing the first client (7 Ceph fuse_a clients of W1, W2, W3, W4, W5, W6 and W7, respectively) with the second client (3 Ceph fuse_a clients of W5, W6 and W7), it can be found that the offloading of 3 Ceph fuse_a clients of W5, W6 and W7 is a normal offloading event belonging to the Ceph fuse_a client, so that the offline of 3 Ceph fuse_a clients of W5, W6 and W7 in the monitoring data does not need to be reported to the IMS system; for the abnormal unloading event of the 4 Ceph fuse_A clients, such as W1, W2, W3 and W4, the abnormal unloading event belongs to the Ceph fuse_A client, and the alarm information of the client is generated according to the alarm generation rule.

As a possible implementation manner, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; the monitoring server sets an alarm suppression rule of the alarm information of the client, and the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

As in the previous example, if the normal offloading operation is performed on the 3 Ceph fuse_a clients, i.e. W5, W6 and W7, connected to the Ceph fs_a cluster for the purpose of service requirement, and if the duration required for offloading the 3 Ceph fuse_a clients, i.e. W5, W6 and W7, is 3h, then the promethaus server_x will not report the offline events of the 3 Ceph fuse_a clients, i.e. W5, W6 and W7, connected to the Ceph fs_a cluster to the IMS system in the whole time period of 3h in the future after the acquisition of the monitoring data on the A2 node server. That is, prometheus Sever_X writes the events of W5, W6, and W7, namely 3 Ceph Fuse_A clients, offline from CephFS_A cluster into alarm suppression rules.

As in the previous example, the monitoring data of the propheus server_x for the cephfs_a cluster includes health data of the cephfs_a cluster itself (e.g., running status of OSD components, status data of PG), and status data of a Ceph fuse_a client connected to the cephfs_a cluster (e.g., whether the Ceph fuse_a client accesses the cephfs_a cluster). Setting at a time T, acquiring monitoring data related to the CephFS_A cluster by the Prometaus Sever_X, wherein the monitoring data show that an MDS component in the CephFS_A cluster is abnormal during operation, meanwhile, an abnormal unloading event is also generated at a W1 Ceph Fuse_A client connected with the CephFS_A cluster, and the Prometaus Sever_X defines the alarm level of the abnormal event generated by the MDS component in the CephFS_A cluster during operation as a high level and defines the alarm level of the abnormal unloading event generated by the W1 Ceph Fuse_A client as a low level; then, the Prometheus Sever_X reports the high-level alarm event to the IMS system, namely, the Prometheus Sever_X reports the abnormal event of the MDS component in the CephFS_A cluster when running to the IMS system, but does not report the abnormal unloading event of the 1 Ceph fuse_A client of W1 to the IMS system.

It should be noted that, the monitoring server may set the alarm level of the alarm information of the MDS component in the cluster higher than the alarm level of the alarm information of the client, because the abnormality of the MDS component in the cluster may cause an abnormal event of the client connected to the cluster, so after the alarm information of the MDS component in the cluster is reported to the IMS system and the operation and maintenance personnel perform the operation and maintenance investigation, the MDS component not only can be restored to the normal operation state, but also the client connected to the cluster can be restored to the normal state.

As an example, referring to fig. 3,Prometheus Sever_X, the acquisition instruction I is sent to three node servers A1, A2, and A4 in the cephfs_a cluster, and simultaneously sent to three node servers B1, B3, and B4 in the cephfs_b cluster, and simultaneously sent to three node servers C1, C2, and C4 in the cephfs_c cluster; when the collection instruction I is responded in the three clusters of the cephfs_a cluster, the cephfs_b cluster and the cephfs_c cluster, the propheus server_x will obtain the monitoring data of each cluster. The monitoring data may be represented by an identifier of a cluster, for example, the first stripe acquired by Prometheus Sever_X is the monitoring data on the A1 node server of CephFS_A cluster, the second stripe is the monitoring data on the B3 node server of CephFS_B cluster, the third stripe is the monitoring data on the C4 node server of CephFS_C cluster, and so on.

As in the previous example, it is assumed that the first piece of monitoring data acquired by the promethaus server_x is from the cephfs_a cluster, and after the first piece of monitoring data is analyzed according to a preset alarm rule, it is determined that the first piece of monitoring data can be reported as alarm information to the IMS system, and the alarm information generated according to the first piece of monitoring data is set to be info_1, and the alarm level of info_1 is set to be level 1; if the sixth piece of monitoring data acquired by the Prometaus Sever_X is still related to the CephFS_A cluster, after the sixth piece of monitoring data is analyzed according to a preset alarm rule, and alarm information generated according to the sixth piece of monitoring data is found to be in accordance with Info_1, the Prometaus Sever_X needs to further determine when to report the sixth piece of monitoring data to the IMS system according to the alarm level of Info_1; if the alarm delay corresponding to the alarm information with the alarm level of 1 is set to be 1h, the Prometheus Sever_X will not report Inpro_1 corresponding to the sixth piece of monitoring data to the IMS system in the next 1 h.

Setting the second piece of monitoring data acquired by Prometheus Sever_X to come from CephFS_B cluster, analyzing the second piece of monitoring data according to a preset alarm rule, determining that the second piece of monitoring data can be used as alarm information to be reported to an IMS system, enabling alarm information generated according to the second piece of monitoring data to be Info_2, and enabling the alarm level of Info_2 to be level 2; if the ninth piece of monitoring data acquired by the Prometaus server_X is still related to the CephFS_B cluster, after the ninth piece of monitoring data is analyzed according to a preset alarm rule, and alarm information generated according to the ninth piece of monitoring data is found to be in accordance with Info_2, the Prometaus server_X needs to further determine when to report the ninth piece of monitoring data to an IMS system according to the alarm level of Info_2; if the alarm delay corresponding to the alarm information with the alarm level of 2 is set to be 2 hours, the Prometheus Sever_X will not report Inpro_2 corresponding to the ninth piece of monitoring data to the IMS system in the next 2 hours.

Setting the third piece of monitoring data acquired by Prometheus Sever_X to come from CephFS_C cluster, analyzing the third piece of monitoring data according to a preset alarm rule, determining that the third piece of monitoring data can be used as alarm information to be reported to an IMS system, enabling alarm information generated according to the third piece of monitoring data to be Info_3, and enabling the alarm level of Info_3 to be level 3; setting the tenth piece of monitoring data acquired by Prometaus Sever_X to be related to the CephFS_C cluster, after analyzing the tenth piece of monitoring data according to a preset alarm rule, finding that alarm information generated according to the tenth piece of monitoring data accords with Info_3, and determining when to report the tenth piece of monitoring data to an IMS system according to the alarm level of Info_3 by Prometaus Sever_X; if the alarm delay corresponding to the alarm information with the alarm level of 3 is set to be 3 hours, the Prometheus Sever_X will not report Inpro_3 corresponding to the tenth piece of monitoring data to the IMS system in the next 3 hours.

In the above example, as the alarm levels of level 1, level 2, and level 3 decrease, the corresponding alarm delays are longer, corresponding to 1h, 2h, and 3h, respectively.

Based on the same concept, the embodiment of the present invention further provides an apparatus for monitoring a distributed storage system, as shown in fig. 4, where the apparatus includes:

a sending unit 401, configured to send an acquisition instruction to each cluster in the distributed storage system;

an obtaining unit 402, configured to obtain monitoring data fed back by each cluster based on the collection instruction, where the monitoring data includes health data of the cluster itself and status data of a client connected to the cluster;

the determining unit 403 is configured to determine, for at least one cluster, alarm information from monitoring data of the cluster according to a preset alarm rule, and report the alarm information to an alarm platform.

Further, for the device, the monitoring servers are multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; for any monitoring server, the sending unit 401 is specifically configured to send an acquisition instruction to at least two node servers in any cluster.

Further, for the apparatus, the alert rule includes an alert generation rule; the determining unit 403 is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.

Further, for the device, the alarm rule further includes an alarm suppression rule; the determining unit 403 is specifically configured to determine a change duration of a service change of the cluster; setting an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

Further, for the device, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster; the determining unit 403 is specifically configured to determine that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and report the alarm information of the MDS component to an alarm platform.

Further, for the device, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the determining unit 403 is further configured to set a cluster identifier corresponding to each monitoring data.

Further, for the device, the alarm rule further includes an alarm convergence rule; the determining unit 403 is specifically configured to determine that the alarm information is the same alarm information that does not occur for the first time in the cluster, and report the alarm information to the alarm platform after setting the time delay according to a comparison relationship between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.

Embodiments of the present invention provide a computing device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), and the like. The computing device may include a central processing unit (Center Processing Unit, CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), cathode Ray Tube (CRT), etc.

Memory, which may include Read Only Memory (ROM) and Random Access Memory (RAM), provides program instructions and data stored in the memory to the processor. In an embodiment of the present invention, the memory may be used to store program instructions of a method of monitoring a distributed storage system;

and the processor is used for calling the program instructions stored in the memory and executing the method for monitoring the distributed storage system according to the obtained program.

Embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of monitoring a distributed storage system.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, or as a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of monitoring a distributed storage system, comprising:

The monitoring server sends acquisition instructions to each cluster in the distributed storage system;

the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters;

aiming at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform;

the alarm rule comprises an alarm generation rule;

the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps:

the monitoring server determines a first client with a changed connection state with the cluster from the monitoring data;

the monitoring server determines a second client which changes the connection state with the cluster according to the service change of the cluster;

and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.

2. The method of claim 1, wherein the monitoring server is a plurality of monitoring servers; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client;

The monitoring server sends acquisition instructions to each cluster in the distributed storage system, and the acquisition instructions comprise:

aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.

3. The method of claim 1, wherein the alert rules further comprise alert suppression rules;

the monitoring server determines the change duration of the service change of the cluster;

the monitoring server sets an alarm suppression rule of the alarm information of the client, and the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

4. The method of claim 1, wherein the monitoring server generates alert information for MDS components of the cluster based on health data of the cluster itself;

the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises:

and if the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the alarm information of the client, reporting the alarm information of the MDS component to an alarm platform.

5. The method of claim 1, wherein after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further comprises:

and the monitoring server sets cluster identifiers corresponding to the monitoring data.

6. The method of any of claims 1-5, wherein the alert rules further comprise alert convergence rules;

the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the comparison relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.

7. An apparatus for monitoring a distributed storage system, comprising:

the sending unit is used for sending acquisition instructions to each cluster in the distributed storage system;

the acquisition unit is used for acquiring monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters;

The determining unit is used for determining alarm information from monitoring data of at least one cluster according to a preset alarm rule and reporting the alarm information to an alarm platform;

the alarm rule comprises an alarm generation rule;

the determining unit is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.

8. A computing device, comprising:

a memory for storing program instructions;

a processor for invoking program instructions stored in said memory to perform the method according to any of claims 1-6 in accordance with the obtained program.

9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1-6.