CN111049705A

CN111049705A - Method and device for monitoring distributed storage system

Info

Publication number: CN111049705A
Application number: CN201911336662.5A
Authority: CN
Inventors: 龚治文; 饶俊明; 卢道和; 郑晓腾; 龚洵峰; 刘生庆; 吴立; 吴传民
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-04-21
Anticipated expiration: 2039-12-23
Also published as: WO2021129367A1; CN111049705B

Abstract

The invention provides a method and a device for monitoring a distributed storage system.A monitoring server sends an acquisition instruction to each cluster in the distributed storage system; the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the cluster and state data of a client connected with the cluster; and aiming at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform. According to the scheme, the monitoring server issues the acquisition instruction to each cluster in the distributed storage system, so that the monitoring server can monitor a plurality of clusters simultaneously; in addition, the monitoring data fed back by each cluster comprises the state data of the clients connected with the cluster, so that the monitoring server can determine the alarm information by analyzing the state data of the clients connected with the cluster, and the purpose that the monitoring server monitors the clients connected with the cluster is achieved.

Description

Method and device for monitoring distributed storage system

Technical Field

The invention relates to the field of financial technology (Fintech), in particular to a method and a device for monitoring a distributed storage system.

Background

With the development of computer technology, more and more technologies (such as block chains, cloud computing or big data) are applied in the financial field, and the traditional financial industry is gradually shifting to the financial technology, and big data technology is no exception. But higher requirements are also put forward on the big data technology due to the requirements of safety and instantaneity in the financial and payment industries.

In consideration of factors such as expandability and high availability required for mass data, a distributed storage System, such as a Ceph File System (Ceph File System), is generally selected by the banking industry as a technical scheme for shared storage, wherein a Ceph Fuse client is connected below the Ceph fs; meanwhile, a monitoring system such as Prometheus, which is an open source, is generally used by those skilled in the art to monitor the CephFS. Wherein Prometheus mainly comprises parts such as Exporters and Prometheus Server; the CephFS mainly includes various components such as a Monitor (Monitor, abbreviated as MON), a target Storage Device (Object Storage Device, abbreviated as OSD), and a MetaData server (MetaData server, abbreviated as MDS), and a Place Group (PG) is distributed on the CephFS OSD component.

Aiming at the technical scheme that Prometheus monitors the CephFS in the prior art, the method has the following two problems:

first, the monitoring of CephFS by Prometheus mainly represents the data acquisition of the states of the CephFS OSD component and the CephFS PG by Prometheus, but the monitoring of the Ceph Fuse client is not realized by Prometheus.

Secondly, the monitoring architecture of the promemeus for the CephFS is very bloated, and the promemeus is represented by the requirement that a set of promemeus needs to be deployed for each CephFS; in addition, due to the difference of the CephFS versions, different expoters need to be deployed for different versions of the CephFS. FIG. 1 shows a diagram of a monitoring architecture of a prior art Prometheus for a CephFS. Referring to fig. 1, an Exporter _ M collects monitoring data of a CephFS _ M, and reports the generated alarm information to a Prometheus server _ M if the collected monitoring data meets a rule for generating alarm information, and similarly, an Exporter _ N collects monitoring data of a CephFS _ N, and reports the generated alarm information to a Prometheus server _ N if the collected monitoring data meets the rule for generating alarm information; however, because the version of Exporter _ M is not matched with that of CephFS _ N, Exporter _ M cannot be used to collect monitoring data of CephFS _ N, so that the report of alarm information of CephFS _ N is realized. That is, high availability is not achieved among prometheusserver, Exporter, and CephFS, which results in that monitoring information cannot be reported in time under abnormal conditions.

In summary, the prior art has the problems that the Prometheus cannot monitor the Ceph Fuse client and the monitoring efficiency of the Prometheus on the Ceph fs is low.

Disclosure of Invention

The invention provides a method and a device for monitoring a distributed storage system, which are used for solving the problems that Prometous cannot monitor a Ceph Fuse client and the monitoring efficiency of the Prometous on a CephFS is low.

In a first aspect, an embodiment of the present invention provides a method for monitoring a distributed storage system, where the method includes: the monitoring server sends an acquisition instruction to each cluster in the distributed storage system; the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the cluster and state data of a client connected with the cluster; and aiming at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform.

Based on the scheme, the monitoring server can monitor a plurality of clusters simultaneously by issuing the acquisition instruction to each cluster in the distributed storage system, so that the problem that the monitoring server cannot effectively monitor each cluster when the cluster is not matched with the Exporter version is avoided; in addition, the monitoring data fed back to the monitoring server by each cluster also comprises the state data of the clients connected with the clusters, so that the monitoring server can determine the alarm information by analyzing the state data of the clients connected with the clusters, and the purpose that the monitoring server monitors the clients connected with the clusters is achieved.

As a possible implementation method, the monitoring server is multiple; any cluster comprises a plurality of node servers, and the clients connected with the node servers connected with the clients are the same; the monitoring server sends an acquisition instruction to each cluster in the distributed storage system, and the acquisition instruction comprises the following steps: aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.

Based on the scheme, a plurality of monitoring servers are arranged for the distributed storage system, so that on one hand, the monitoring data of each cluster are frequently acquired from each cluster in the distributed storage system, and the aim of all-around and even real-time monitoring of the distributed storage system can be achieved; on the other hand, by setting a plurality of monitoring servers, it can be ensured that other available monitoring servers monitor the distributed storage system under the condition that one or more monitoring servers are down. For any one monitoring server in the plurality of monitoring servers, the monitoring server issues the acquisition instruction to at least two node servers in each cluster, so that the monitoring server is favorable for acquiring the monitoring data of the cluster where the node server is located from other available node servers under the condition that one node server is down, and the monitoring server can effectively monitor each cluster.

As a possible implementation method, the alarm rule includes an alarm generation rule; the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps: the monitoring server determines a first client terminal with a changed connection state with the cluster from the monitoring data; the monitoring server determines a second client terminal with a changed connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not contained in the second client and the alarm generation rule.

Based on the scheme, a first client with the changed connection state with the cluster is determined through analysis of monitoring data, a second client with the changed connection state with the cluster is determined through analysis of known service changes, and alarm information generated due to abnormity of the clients can be generated through comparison of the first client and the second client.

As a possible implementation method, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; and the monitoring server sets an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

Based on the scheme, after the necessary time required by the cluster for the purpose of service requirement is determined, the monitoring server does not report the alarm information of the client to the alarm platform in the process of the necessary time, so that the generation of known and useless alarms can be effectively avoided.

As a possible implementation method, the monitoring server generates alarm information of MDS components of the cluster according to health data of the cluster itself; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the method comprises the following steps: and the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the client, and reports the alarm information of the MDS component to an alarm platform.

Based on the scheme, when the monitoring server simultaneously acquires the alarm information of the MDS assembly of the cluster and the alarm information of the client connected with the cluster, the monitoring server determines that the alarm level of the alarm information of the MDS assembly is higher than that of the alarm information of the client in consideration of the fact that the abnormal condition of the MDS assembly of the cluster possibly causes the abnormal event of the client connected with the cluster, reports the alarm information of the MDS assembly to the alarm platform, and automatically shields the alarm information of the client at a low level.

As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the acquisition instruction, the method further includes: and the monitoring server sets cluster identifications corresponding to the monitoring data.

Based on the scheme, the monitoring server marks the corresponding cluster for each acquired monitoring data, so that the monitoring server is beneficial to rapidly performing corresponding alarm operation when receiving the same monitoring data of the same cluster in the later period.

As a possible implementation method, the alarm rule further includes an alarm convergence rule; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the method comprises the following steps: the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the contrast relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level is, the longer the time delay of the corresponding alarm time delay is.

Based on the scheme, after the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in a certain cluster, the same alarm which does not appear for the first time is reported to the alarm platform according to the alarm convergence rule after the time delay is set, so that the resource waste phenomenon caused by the fact that the cluster continuously and repeatedly sends the same alarm can be effectively prevented.

In a second aspect, an embodiment of the present invention provides an apparatus for monitoring a distributed storage system, where the apparatus includes: the sending unit is used for sending an acquisition instruction to each cluster in the distributed storage system; the acquisition unit is used for acquiring monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the cluster and state data of a client connected with the cluster; and the determining unit is used for determining alarm information from the monitoring data of the cluster according to a preset alarm rule and reporting the alarm information to an alarm platform aiming at least one cluster.

As a possible implementation method, the monitoring server is multiple; any cluster comprises a plurality of node servers, and the clients connected with the node servers connected with the clients are the same; the sending unit is specifically configured to issue an acquisition instruction to at least two node servers in any cluster, for any monitoring server.

As a possible implementation method, the alarm rule includes an alarm generation rule; the determining unit is specifically configured to determine, from the monitoring data, a first client whose connection state with the cluster changes; determining a second client terminal with a changed connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not contained in the second client and the alarm generation rule.

As a possible implementation method, the alarm rule further includes an alarm suppression rule; the determining unit is specifically configured to determine a change duration of a service change of the cluster; and setting an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

As a possible implementation method, the monitoring server generates alarm information of MDS components of the cluster according to health data of the cluster itself; the determining unit is specifically configured to report the alarm information of the MDS component to an alarm platform if it is determined that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client.

As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the acquisition instruction, the determining unit is further configured to set a cluster identifier corresponding to each monitoring data.

As a possible implementation method, the alarm rule further includes an alarm convergence rule; the determining unit is specifically configured to determine that the alarm information is the same alarm information that does not appear for the first time in the cluster, and report the alarm information to the alarm platform after setting a time delay according to a comparison relationship between an alarm level and an alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level is, the longer the time delay of the corresponding alarm time delay is.

In a third aspect, an embodiment of the present invention provides a computing device, including:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to perform a method according to any of the first aspects in accordance with the obtained program.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method according to any one of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a diagram of a monitoring architecture of Prometheus for a CephFS of the prior art;

FIG. 2 illustrates a method for monitoring a distributed storage system according to the present invention;

FIG. 3 is a diagram of a monitoring architecture of CephFS by Prometheus according to the present invention;

fig. 4 is a device for monitoring a distributed storage system according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 2, a method for monitoring a distributed storage system according to an embodiment of the present invention includes:

step 201, the monitoring server sends an acquisition instruction to each cluster in the distributed storage system.

Step 202, the monitoring server obtains monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the cluster and state data of a client connected with the cluster.

Step 203, aiming at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform.

In the step 201, the monitoring server sends a collection instruction to each cluster in the distributed storage system.

A distributed storage system such as a CephFS is provided with a plurality of clusters, for example, 3 clusters are respectively a CephFS _ A cluster, a CephFS _ B cluster and a CephFS _ C cluster; the monitoring server Prometheus serving as a monitoring CephFS issues a collection instruction to the CephFS through the Prometheus Sever in the monitoring server Prometheus Sever, specifically, the Prometheus Sever issues a collection instruction I to a CephFS _ A cluster, the Prometheus Sever issues a collection instruction I to a CephFS _ B cluster, and the Prometheus Sever issues a collection instruction I to a CephFS _ C cluster.

In step 202, the monitoring server obtains monitoring data fed back by each cluster based on the acquisition instruction, where the monitoring data includes health data of the cluster itself and status data of a client connected to the cluster.

After the Prometheus Server issues an acquisition instruction I to the CephFS _ A cluster, the CephFS _ A cluster responds to the acquisition instruction I correspondingly to obtain monitoring data about the CephFS _ A cluster, and therefore the Prometheus Server acquires the monitoring data about the CephFS _ A cluster; similarly, Prometheus server may obtain monitoring data for a CephFS _ B cluster and obtain monitoring data for a CephFS _ C cluster.

The monitoring data about the CephFS _ a cluster may be specifically represented by health data of the CephFS _ a cluster itself (e.g., an operation state of an OSD component, and a state data of a PG), and a state data of a Ceph Fuse _ a client connected to the CephFS _ a cluster (e.g., whether the Ceph Fuse _ a client is connected to the CephFS _ a cluster). For example, if there are 100 Ceph Fuse _ a clients connected to the CephFS _ a cluster, the monitoring data component for the CephFS _ a cluster includes health data of the CephFS _ a cluster itself and also includes status data of the 100 Ceph Fuse _ a clients connected to the CephFS _ a cluster; the monitoring data about the CephFS _ B cluster and the monitoring data about the CephFS _ C cluster may refer to the monitoring data about the CephFS _ a cluster, which is not described herein.

In step 203, for at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform.

For the CephFS _ A cluster, the Prometheus analyzes the acquired monitoring data from the CephFS _ A cluster according to a preset alarm rule, so that alarm information about the CephFS _ A cluster is determined; further, the Prometheus reports the obtained alarm information about the CephFS _ a cluster to the alarm platform, and the basis for reporting is still the preset alarm rule. The alarm platform may be an IMS system, or may be another alarm platform, which is not limited in the present invention. Similarly, for the alarm process of the CephFS _ B cluster and the CephFS _ C cluster, the alarm process of the CephFS _ a cluster may be referred to by Prometheus, which is not described herein.

Fig. 3 is a diagram of a monitoring architecture of CephFS by Prometheus according to an embodiment of the present invention. Referring to fig. 3, two monitoring servers are deployed, which are respectively promethaus server _ X and promethaus server _ Y, and both the promethaus server _ X and the promethaus server _ Y are used for monitoring a distributed storage system, and a CephFS _ a cluster, a CephFS _ B cluster, and a CephFS _ C cluster are deployed in the system; for a CephFS _ a cluster, which includes a plurality of node servers, for convenience of description, the CephFS _ a cluster includes 4 node servers, which are respectively denoted as a1, a2, A3 and a 4; similarly, for the CephFS _ B cluster, the cluster includes a plurality of node servers, and for convenience of description, the CephFS _ B cluster includes 4 node servers, which are denoted as B1, B2, B3 and B4; similarly, for the CephFS _ C cluster, the cluster includes a plurality of node servers, and for convenience of description, the CephFS _ C cluster includes 4 node servers, which are denoted as C1, C2, C3 and C4.

For a CephFS _ a cluster, 100 Ceph Fuse _ a clients are connected to node servers configured with MDS components in the cluster, and if 3 node servers in the CephFS _ a cluster are configured with MDS components, the 100 Ceph Fuse _ a clients are all connected to the 3 node servers configured with MDS components (not shown in the figure); similarly, for the CephFS _ B cluster, 200 Ceph Fuse _ B clients are connected to the node servers configured with MDS components in the cluster, and if 3 node servers in the CephFS _ B cluster are configured with MDS components, the 200 Ceph Fuse _ B clients are all connected to the 3 node servers configured with MDS components (not shown in the figure); similarly, for the CephFS _ C cluster, 300 Ceph Fuse _ C clients are connected to the node servers configured with MDS components in the cluster, and if 3 node servers in the CephFS _ C cluster are configured with MDS components, the 300 Ceph Fuse _ C clients are all connected to the 3 node servers configured with MDS components (not shown in the figure).

If for Prometheus server _ X, the monitoring server issues an acquisition instruction to at least two node servers in any one of the CephFS _ a cluster, the CephFS _ B cluster, and the CephFS _ C cluster, which is specifically represented as:

setting the moment of 8:00am, Prometheus server _ X sends an acquisition instruction I to 3 node servers A1, A2 and A4 in a CephFS _ A cluster; meanwhile, Prometheus server _ X issues a collection instruction I to 3 node servers B1, B3 and B4 in the CephFS _ B cluster; meanwhile, Prometheus Sever _ X issues a collection instruction I to 3 node servers C1, C2 and C4 in the CephFS _ C cluster.

It should be noted that, when the Prometheus server _ X issues the acquisition instruction to at least two node servers in the CephFS _ a cluster, the acquisition instruction is issued to any at least two node servers in the CephFS _ a cluster in a random manner. For example, the Prometheus server _ X may issue the acquisition command I to 3 node servers of a1, a2, and a4 in a CephFS _ a cluster, may issue the acquisition command I to 3 node servers of a2, A3, and a4 in the CephFS _ a cluster, or issue the acquisition command I to 3 node servers of a1, a2, and A3 in the CephFS _ a cluster, which is not limited in the disclosure. Similarly, when the Prometheus server _ X issues the acquisition instruction to at least two node servers in the CephFS _ B cluster, the acquisition instruction is issued to any at least two node servers in the CephFS _ B cluster in a random manner; similarly, when the Prometheus server _ X issues the acquisition instruction to at least two node servers in the CephFS _ C cluster, the acquisition instruction is issued to any at least two node servers in the CephFS _ C cluster in a random manner.

As a possible implementation manner, the alarm rule includes an alarm generation rule; the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps: the monitoring server determines a first client terminal with a changed connection state with the cluster from the monitoring data; the monitoring server determines a second client terminal with a changed connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not contained in the second client and the alarm generation rule.

For example, for the Ceph fs _ a cluster, for convenience of description, 10 Ceph Fuse _ a clients, i.e., W1, W2, W3, W4, W5, W6, W7, W8, W9 and W10, are connected to the node server configured with MDS components in the cluster; the method comprises the steps that Prometheus Sever _ X issues an acquisition instruction I to 3 node servers A1, A2 and A4 in a CephFS _ A cluster, the Prometheus Sever _ X is set to firstly acquire monitoring data on an A1 node server, and 10 CephFuse _ A clients W1, W2, W3, W4, W5, W6, W7, W8, W9 and W10 in the Prometheus Sever _ X are determined to be connected to the CephFS _ A cluster through analysis of the monitoring data on the A1 node server; subsequently, the Prometheus server _ X then acquires the monitoring data on the a2 node server, and through the analysis of the monitoring data on the a2 node server, determines that only 3 Ceph Fuse _ a clients of W8, W9, and W10 are still connected to the CephFS _ a cluster, and 7 Ceph Fuse _ a clients of W1, W2, W3, W4, W5, W6, and W7 have been offline from the CephFS _ a cluster. That is, the first clients with the changed connection state to the cluster are 7 Ceph Fuse _ a clients, i.e., W1, W2, W3, W4, W5, W6 and W7, respectively.

For such an abnormal event occurring at the Ceph Fuse _ a client, it needs to be further determined whether the 7 Ceph Fuse _ a clients, i.e., W1, W2, W3, W4, W5, W6 and W7, are offline from the Ceph fs _ a cluster, i.e., whether the Ceph Fuse _ a client is normally unloaded from the Ceph fs _ a cluster or passively unloaded due to the Ceph fs _ a cluster itself.

The service running on the CephFS _ a cluster performs daily offloading work on part of the clients connected to the CephFS _ a cluster for the purpose of service requirements. For example, for the purpose of business needs, business personnel will uninstall the 3 Ceph Fuse _ a clients W5, W6 and W7 in the CephFS _ a cluster. Namely, the second clients with the changed connection state with the cluster are 3 Ceph Fuse _ a clients W5, W6 and W7, respectively.

By comparing the first client (7 Ceph Fuse _ a clients of W1, W2, W3, W4, W5, W6 and W7) with the second client (3 Ceph Fuse _ a clients of W5, W6 and W7), it can be found that the offloading of the 3 Ceph Fuse _ a clients of W5, W6 and W7 is a normal offloading event belonging to the Ceph Fuse _ a client, so that the offline reporting of the 3 Ceph Fuse _ a clients of W5, W6 and W7 in the monitoring data is not required to the IMS system; and for the abnormal unloading events of the 4 Ceph Fuse _ A clients W1, W2, W3 and W4, the alarm information of the client is generated according to the alarm generation rule.

As a possible implementation manner, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; and the monitoring server sets an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

As in the foregoing example, for the purpose of service needs, a normal offloading operation is performed on 3 Ceph Fuse _ a clients connected to a Ceph fs _ a cluster, where the 3 Ceph Fuse _ a clients are W5, W6 and W7, and a time duration required for offloading the 3 Ceph Fuse _ a clients W5, W6 and W7 is 3h, and the Prometheus server _ X does not report offline events of the 3 Ceph Fuse _ a clients connected to the Ceph fs _ a cluster, W5, W6 and W7 in a whole time period of a future 3h after acquiring monitoring data on an a2 node server. That is, Prometheus server _ X writes the 3 Ceph Fuse _ a clients W5, W6, and W7 from events offline on the Ceph fs _ a cluster into the alarm suppression rules.

As in the foregoing example, the monitoring data of the Prometheus server _ X for the CephFS _ a cluster includes health data of the CephFS _ a cluster itself (e.g., running status of OSD components, status data of PG), and status data of a Ceph Fuse _ a client connected to the CephFS _ a cluster (e.g., whether the Ceph Fuse _ a client accesses the CephFS _ a cluster). At time T, obtaining monitoring data related to the Ceph fs _ a cluster by the prometes server _ X, where the monitoring data shows that an MDS component in the Ceph fs _ a cluster is abnormal during running, and at the same time, an abnormal unloading event also occurs in 1 Ceph Fuse _ a client of W1 connected to the Ceph fs _ a cluster, and then the prometes server _ X defines an alarm level of the abnormal event occurring in the running of the MDS component in the Ceph fs _ a cluster as a high level, and defines an alarm level of the abnormal unloading event occurring in 1 Ceph Fuse _ a client of W1 as a low level; and then, the Prometheus server _ X reports a high-level alarm event to the IMS system, that is, the Prometheus server _ X reports an abnormal event occurring when the MDS component in the Ceph fs _ a cluster runs to the IMS system, and does not report an abnormal unloading event occurring at the 1 Ceph Fuse _ a client of W1 to the IMS system.

It should be noted that the reason why the monitoring server can set the alarm level of the alarm information of the MDS component in the cluster to be higher than the alarm level of the alarm information of the client is that the abnormal condition of the MDS component in the cluster causes an abnormal event of the client connected to the cluster, so that after reporting the alarm information of the MDS component in the cluster to the IMS system and performing operation and maintenance investigation by operation and maintenance personnel, the MDS component can be restored to a normal operating state, and meanwhile, the client connected to the cluster can also be restored to a normal state.

As for the previous example, referring to fig. 3, Prometheus server _ X sends a collection instruction I to three node servers a1, a2, and a4 in a CephFS _ a cluster, simultaneously sends a collection instruction I to three node servers B1, B3, and B4 in a CephFS _ B cluster, and simultaneously sends a collection instruction I to three node servers C1, C2, and C4 in a CephFS _ C cluster; when the collection instruction I is responded to the three clusters, i.e., the CephFS _ a cluster, the CephFS _ B cluster, and the CephFS _ C cluster, the Prometheus server _ X acquires the monitoring data of each cluster. The monitoring data may be represented as an identifier of a cluster, for example, a first piece of acquired monitoring data by Prometheus server _ X is monitoring data on an a1 node server of a CephFS _ a cluster, a second piece of acquired monitoring data is monitoring data on a B3 node server of a CephFS _ B cluster, a third piece of acquired monitoring data is monitoring data on a C4 node server of a CephFS _ C cluster, and so on.

As in the foregoing example, it is assumed that the first piece of monitoring data acquired by Prometheus server _ X is from a CephFS _ a cluster, and after analyzing the first piece of monitoring data according to a preset alarm rule, it is determined that the first piece of monitoring data can be reported to the IMS system as alarm information, the alarm information generated according to the first piece of monitoring data is denoted as Info _1, and the alarm level of ifro _1 is denoted as level 1; setting that the sixth monitoring data acquired by Prometheus server _ X still relates to a CephFS _ a cluster, analyzing the sixth monitoring data according to a preset alarm rule, and then finding that alarm information generated according to the sixth monitoring data conforms to Info _1, the Prometheus server _ X needs to further determine when to report the sixth monitoring data to the IMS system according to the alarm level of Info _ 1; if the alarm delay corresponding to the alarm information with the alarm level of level 1 is set to be 1h, the Prometheus set _ X will not report the infra _1 corresponding to the sixth monitoring data to the IMS system in the next 1 h.

Setting that second monitoring data acquired by Prometheus Sever _ X is from a CephFS _ B cluster, analyzing the second monitoring data according to a preset alarm rule, determining that the second monitoring data can be reported to an IMS system as alarm information, setting the alarm information generated according to the second monitoring data as Info _2, and setting the alarm level of Infro _2 as level 2; if the ninth piece of monitoring data acquired by the Prometheus server _ X still relates to the CephFS _ B cluster, analyzing the ninth piece of monitoring data according to a preset alarm rule, and then finding that alarm information generated according to the ninth piece of monitoring data conforms to Info _2, the Prometheus server _ X needs to further determine when to report the ninth piece of monitoring data to the IMS system according to the alarm level of Info _ 2; if the alarm delay corresponding to the alarm information with the alarm level of level 2 is set to be 2h, the Prometheus server _ X will not report the infra _2 corresponding to the ninth piece of monitoring data to the IMS system in the next 2 h.

Setting that the third monitoring data acquired by Prometheus set _ X is from a CephFS _ C cluster, analyzing the third monitoring data according to a preset alarm rule, determining that the third monitoring data can be reported to an IMS system as alarm information, setting the alarm information generated according to the third monitoring data as Info _3, and setting the alarm level of Infro _3 as level 3; setting that tenth monitoring data acquired by prometeus server _ X still relates to a CephFS _ C cluster, analyzing the tenth monitoring data according to a preset alarm rule, and finding that alarm information generated according to the tenth monitoring data conforms to Info _3, so that prometeus server _ X needs to further determine when to report the tenth monitoring data to an IMS system according to the alarm level of Info _ 3; if the alarm delay corresponding to the alarm information with the alarm level of level 3 is set to be 3h, Prometheus server _ X will not report the infra _3 corresponding to the tenth monitoring data to the IMS system in the next 3 h.

It should be noted that, in the above example, as the alarm levels of level 1, level 2 and level 3 decrease, the time delays of the corresponding alarm time delays are longer and respectively correspond to 1h, 2h and 3 h.

Based on the same concept, an embodiment of the present invention further provides an apparatus for monitoring a distributed storage system, as shown in fig. 4, the apparatus includes:

a sending unit 401, configured to send an acquisition instruction to each cluster in the distributed storage system;

an obtaining unit 402, configured to obtain monitoring data fed back by each cluster based on the acquisition instruction, where the monitoring data includes health data of the cluster and status data of a client connected to the cluster;

the determining unit 403 is configured to, for at least one cluster, determine alarm information from monitoring data of the cluster according to a preset alarm rule, and report the alarm information to an alarm platform.

Further, for the device, the number of the monitoring servers is multiple; any cluster comprises a plurality of node servers, and the clients connected with the node servers connected with the clients are the same; for any monitoring server, the sending unit 401 is specifically configured to issue an acquisition instruction to at least two node servers in any cluster.

Further, for the apparatus, the alert rules include alert generation rules; the determining unit 403 is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client terminal with a changed connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not contained in the second client and the alarm generation rule.

Further, for the device, the alarm rules further include alarm suppression rules; the determining unit 403 is specifically configured to determine a change duration of a service change of the cluster; and setting an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

Further, for the device, the monitoring server generates alarm information of the MDS components of the cluster according to the health data of the cluster; the determining unit 403 is specifically configured to determine that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and report the alarm information of the MDS component to an alarm platform.

Further, for the apparatus, after the monitoring server obtains the monitoring data fed back by each cluster based on the acquisition instruction, the determining unit 403 is further configured to set a cluster identifier corresponding to each monitoring data.

Further, for the apparatus, the alarm rules further include an alarm convergence rule; the determining unit 403 is specifically configured to determine that the alarm information is the same alarm information that does not appear for the first time in the cluster, and report the alarm information to the alarm platform after setting a time delay according to a comparison relationship between an alarm level and an alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level is, the longer the time delay of the corresponding alarm time delay is.

Embodiments of the present invention provide a computing device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like. The computing device may include a Central Processing Unit (CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), etc.

Memory, which may include Read Only Memory (ROM) and Random Access Memory (RAM), provides the processor with program instructions and data stored in the memory. In an embodiment of the invention, the memory may be used to store program instructions for a method of monitoring a distributed storage system;

and the processor is used for calling the program instructions stored in the memory and executing the method for monitoring the distributed storage system according to the obtained program.

Embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of monitoring a distributed storage system.

It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of monitoring a distributed storage system, comprising:

the monitoring server sends an acquisition instruction to each cluster in the distributed storage system;

the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the cluster and state data of a client connected with the cluster;

and aiming at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform.

2. The method of claim 1, wherein the monitoring server is a plurality of; any cluster comprises a plurality of node servers, and the clients connected with the node servers connected with the clients are the same;

the monitoring server sends an acquisition instruction to each cluster in the distributed storage system, and the acquisition instruction comprises the following steps:

aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.

3. The method of claim 1, wherein the alarm rules include alarm generation rules;

the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps:

the monitoring server determines a first client terminal with a changed connection state with the cluster from the monitoring data;

the monitoring server determines a second client terminal with a changed connection state with the cluster according to the service change of the cluster;

and generating the alarm information of the client according to the client which is contained in the first client but not contained in the second client and the alarm generation rule.

4. The method of claim 3, wherein the alarm rules further include alarm suppression rules;

the monitoring server determines the change duration of the service change of the cluster;

and the monitoring server sets an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.

5. The method of claim 3, wherein the monitoring server generates alarm information for MDS components of the cluster based on health data of the cluster itself;

the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the method comprises the following steps:

and the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the client, and reports the alarm information of the MDS component to an alarm platform.

6. The method of claim 1, wherein after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further comprises:

and the monitoring server sets cluster identifications corresponding to the monitoring data.

7. The method of any of claims 1-6, wherein the alarm rules further include alarm convergence rules;

the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the contrast relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level is, the longer the time delay of the corresponding alarm time delay is.

8. An apparatus for monitoring a distributed storage system, comprising:

the sending unit is used for sending an acquisition instruction to each cluster in the distributed storage system;

the acquisition unit is used for acquiring monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the cluster and state data of a client connected with the cluster;

and the determining unit is used for determining alarm information from the monitoring data of the cluster according to a preset alarm rule and reporting the alarm information to an alarm platform aiming at least one cluster.

9. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 7 in accordance with the obtained program.

10. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of any one of claims 1-7.