WO2021129367A1 - Method and apparatus for monitoring distributed storage system - Google Patents

Method and apparatus for monitoring distributed storage system Download PDF

Info

Publication number
WO2021129367A1
WO2021129367A1 PCT/CN2020/134339 CN2020134339W WO2021129367A1 WO 2021129367 A1 WO2021129367 A1 WO 2021129367A1 CN 2020134339 W CN2020134339 W CN 2020134339W WO 2021129367 A1 WO2021129367 A1 WO 2021129367A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
alarm
monitoring
client
alarm information
Prior art date
Application number
PCT/CN2020/134339
Other languages
French (fr)
Chinese (zh)
Inventor
龚治文
饶俊明
卢道和
郑晓腾
龚洵峰
刘生庆
吴立
吴传民
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2021129367A1 publication Critical patent/WO2021129367A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • H04L41/0609Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on severity or priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the present invention relates to the field of financial technology (Fintech), in particular to a method and device for monitoring a distributed storage system.
  • CephFS Ceph File System
  • the Ceph Fuse client user space file system client of the Ceph file system
  • CephFS open source Prometheus
  • Prometheus is mainly composed of Exporter (client for Prometheus monitoring data collection) and Prometheus Sever (server for Prometheus monitoring); CephFS is mainly composed of monitor (Monitor, abbreviated as MON), target storage device (Object Storage Device, It is abbreviated as OSD) and metadata server (MetaData Sever, abbreviated as MDS) and other components.
  • the CephFS OSD component also has placement groups (Placement Groups, abbreviated as PG).
  • Prometheus's monitoring of CephFS is mainly manifested in Prometheus's data collection of CephFS OSD component status and CephFS PG status, but Prometheus does not implement the monitoring of Ceph Fuse client.
  • Prometheus's monitoring architecture for CephFS is very bloated, which is manifested in the need to deploy a set of Prometheus for each CephFS; in addition, due to the different versions of CephFS, different Exporters need to be deployed for different versions of CephFS.
  • Figure 1 it is a diagram of the monitoring architecture of CephFS by Prometheus in the prior art.
  • the M-numbered Prometheus monitoring data collection client collects the M-numbered Ceph file system monitoring data.
  • the collected monitoring data meets the rules for generating alarm information, it will report the generated alarm information to The M-numbered Prometheus server, in the same way, the N-numbered Prometheus monitoring data collection client collects the monitoring data of the N-numbered Ceph file system. If the collected monitoring data meets the rules for generating alarm information, The generated alarm information is reported to the N-numbered Prometheus server; however, the M-numbered Prometheus monitoring data collection client does not match the N-numbered Ceph file system version, so the M-numbered The client of Prometheus monitoring data collection is used to collect the monitoring data of the N-numbered Ceph file system to report the alarm information of the N-numbered Ceph file system. That is, Prometheus Sever, Exporter, and CephFS did not achieve high availability among the three, resulting in failure to report monitoring information in a timely manner under abnormal conditions.
  • the existing technology has problems that Prometheus cannot monitor the Ceph Fuse client and Prometheus has low monitoring efficiency for CephFS.
  • the present invention provides a method and device for monitoring a distributed storage system, which are used to solve the problems that Prometheus cannot monitor Ceph Fuse clients and Prometheus has low monitoring efficiency for CephFS.
  • an embodiment of the present invention provides a method for monitoring a distributed storage system.
  • the method includes: a monitoring server sends collection instructions to each cluster in the distributed storage system; and the monitoring server obtains that each cluster is based on
  • the monitoring data fed back by the collection instruction includes the health data of the cluster itself and the status data of the client connected to the cluster; for at least one cluster, the monitoring server obtains data from the cluster according to preset alarm rules. Determine the alarm information in the monitoring data and report the alarm information to the alarm platform.
  • the monitoring server can monitor multiple clusters at the same time by issuing collection instructions to each cluster in the distributed storage system, thus avoiding the ineffectiveness of the monitoring server when the cluster and the Exporter version do not match.
  • the purpose of the monitoring server to monitor the clients connected to the cluster is realized.
  • any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; the monitoring server is distributed to the Each cluster in the storage system sends collection instructions, including: for any monitoring server, the monitoring server issues collection instructions to at least two node servers in any cluster.
  • the monitoring server sends collection instructions to at least two node servers in each cluster to help ensure that the monitoring server is down when one of the node servers is down.
  • the monitoring data of the cluster where the node server is located can also be obtained from other available node servers, so as to realize the effective monitoring of each cluster by the monitoring server.
  • the alarm rule includes an alarm generation rule; the monitoring server determines the alarm information from the monitoring data according to the preset alarm rule, including: the monitoring server obtains the alarm information from the monitoring data Determine the first client whose connection status with the cluster has changed; the monitoring server determines the second client whose connection status with the cluster has changed according to the service change of the cluster; The client in one client but not included in the second client and the alarm generation rule generate the alarm information of the client.
  • the first client whose connection status with the cluster has changed is determined, and through the analysis of known business changes, it is determined that the connection status with the cluster has changed.
  • the second client by comparing the first client with the second client, can generate alarm information generated due to the abnormality of the client.
  • the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; the monitoring server sets the alarm suppression rule for the alarm information of the client, so The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
  • the monitoring server will not report the alarm information of the client to the alarm platform during this necessary time, which can be effective To avoid generating known but useless alarms.
  • the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to the alarm according to a preset alarm rule
  • the platform includes: the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and then reports the alarm information of the MDS component to the alarm platform.
  • the monitoring server when the monitoring server simultaneously obtains the alarm information of the MDS component of the cluster and the alarm information of the client connected to the cluster, it is considered that the abnormal event of the client connected to the cluster may be caused by the abnormality of the MDS component of the cluster.
  • the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and reports the alarm information of the MDS component to the alarm platform, automatically shielding the alarm information of the low-level client.
  • the method further includes: the monitoring server sets a cluster identifier corresponding to each monitoring data.
  • the monitoring server marks each acquired monitoring data with the corresponding cluster identification, which helps the monitoring server to quickly make corresponding alarm operations when receiving the same monitoring data of the same cluster in the future.
  • the alarm rules further include alarm convergence rules; the monitoring server reports the alarm information to the alarm platform according to preset alarm rules, including: the monitoring server determines the alarm information Is the same alarm information that does not appear for the first time in the cluster, then according to the comparison relationship between the alarm level in the alarm convergence rule and the alarm delay, the alarm information is reported to the alarm platform after the delay is set; Among them, the lower the alarm level, the longer the corresponding alarm delay.
  • the monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in a certain cluster, it reports the same alarm that does not appear for the first time to the alarm platform according to the alarm convergence rules and after a set time delay, which can effectively prevent The cluster continues to send out the same alarm repeatedly, resulting in a waste of resources.
  • an embodiment of the present invention provides a device for monitoring a distributed storage system, the device includes: a sending unit, configured to send collection instructions to each cluster in the distributed storage system; The monitoring data of each cluster is based on the collection instruction feedback, and the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster; the determining unit, for at least one cluster, is used for according to preset alarm rules, Determine alarm information from the monitoring data of the cluster and report the alarm information to the alarm platform.
  • the monitoring server can monitor multiple clusters at the same time by issuing collection instructions to each cluster in the distributed storage system, thus avoiding the ineffectiveness of the monitoring server when the cluster and the Exporter version do not match.
  • the purpose of the monitoring server to monitor the clients connected to the cluster is realized.
  • any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; for any monitoring server, all The sending unit is specifically configured to issue collection instructions to at least two node servers in any cluster.
  • the monitoring server sends collection instructions to at least two node servers in each cluster to help ensure that the monitoring server is down when one of the node servers is down.
  • the monitoring data of the cluster where the node server is located can also be obtained from other available node servers, so as to realize the effective monitoring of each cluster by the monitoring server.
  • the alarm rule includes an alarm generation rule; the determining unit is specifically configured to determine from the monitoring data the first client whose connection status with the cluster has changed; The service change of the cluster determines the second client whose connection state with the cluster has changed; and the alarm is generated according to the client included in the first client but not included in the second client and the alarm Rules to generate alarm information for the client.
  • the first client whose connection status with the cluster has changed is determined, and through the analysis of known business changes, it is determined that the connection status with the cluster has changed.
  • the second client by comparing the first client with the second client, can generate alarm information generated due to the abnormality of the client.
  • the alarm rule further includes an alarm suppression rule; the determining unit is specifically configured to determine the change duration of the service change of the cluster; and the alarm suppression rule for the alarm information of the client is set, so The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
  • the monitoring server will not report the alarm information of the client to the alarm platform during this necessary time, which can be effective To avoid generating known but useless alarms.
  • the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the determining unit is specifically configured to determine the alarm level of the alarm information of the MDS component If the alarm information is higher than the alarm information of the client, the alarm information of the MDS component is reported to the alarm platform.
  • the monitoring server when the monitoring server simultaneously obtains the alarm information of the MDS component of the cluster and the alarm information of the client connected to the cluster, it is considered that the abnormal event of the client connected to the cluster may be caused by the abnormality of the MDS component of the cluster.
  • the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and reports the alarm information of the MDS component to the alarm platform, automatically shielding the alarm information of the low-level client.
  • the determining unit is further configured to set a cluster identification corresponding to each monitoring data.
  • the monitoring server marks each acquired monitoring data with its corresponding cluster identification, which helps the monitoring server to quickly make corresponding alarm operations when receiving the same monitoring data of the same cluster in the future.
  • the alarm rule further includes an alarm convergence rule; the determining unit is specifically configured to determine that the alarm information is the same alarm information that does not appear for the first time in the cluster, and then converge according to the alarm The control relationship between the alarm level and the alarm delay in the rule, the alarm information is reported to the alarm platform after the delay is set; wherein, the lower the alarm level is, the longer the corresponding alarm delay is .
  • the monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in a certain cluster, it reports the same alarm that does not appear for the first time to the alarm platform according to the alarm convergence rules and after a set time delay, which can effectively prevent The cluster continues to send out the same alarm repeatedly, resulting in a waste of resources.
  • an embodiment of the present invention provides a computing device, including:
  • Memory used to store program instructions
  • the processor is configured to call the program instructions stored in the memory, and execute the method according to any one of the first aspects according to the obtained program.
  • an embodiment of the present invention provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute any of the operations described in the first aspect method.
  • Figure 1 is a monitoring architecture diagram of CephFS by Prometheus in the prior art
  • Figure 2 is a method for monitoring a distributed storage system provided by the present invention
  • Figure 3 is a diagram of the monitoring architecture of Prometheus for CephFS provided by the present invention.
  • Figure 4 is a device for monitoring a distributed storage system provided by the present invention.
  • Fig. 5 is a schematic diagram of a computing device provided by the present invention.
  • FIG. 2 it is a method for monitoring a distributed storage system provided by an embodiment of the present invention, and the method includes:
  • Step 201 The monitoring server sends a collection instruction to each cluster in the distributed storage system.
  • Step 202 The monitoring server obtains monitoring data fed back by each cluster based on the collection instruction, the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster.
  • Step 203 For at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule, and reports the alarm information to an alarm platform.
  • the monitoring server can monitor multiple clusters at the same time by issuing collection instructions to each cluster in the distributed storage system, thus avoiding the ineffectiveness of the monitoring server when the cluster and the Exporter version do not match.
  • the purpose of the monitoring server to monitor the clients connected to the cluster is realized.
  • the monitoring server sends collection instructions to each cluster in the distributed storage system.
  • CephFS Ceph File System, Ceph file system
  • Ceph file system has multiple clusters, such as 3, which are Ceph file system clusters numbered A, Ceph file system clusters numbered B, and number C Ceph file system cluster
  • a monitoring server for CephFS Prometheus (Prometheus), through its internal Prometheus Sever (Prometheus server) to issue collection instructions to CephFS
  • the specific performance is Prometheus Sever to A number
  • Prometheus Sever issues collection instruction I to the Ceph file system cluster numbered B
  • Prometheus Sever issues collection instruction I to the Ceph file system cluster number C.
  • the monitoring server obtains the monitoring data fed back by the clusters based on the collection instruction, and the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster.
  • Prometheus Sever issues the collection command I to the Ceph file system cluster numbered A
  • the Ceph file system cluster number A will respond to the collection command I and get the monitoring data of the Ceph file system cluster number A.
  • This Prometheus Sever obtains the monitoring data on the Ceph file system cluster with the A number; in the same way, Prometheus Sever can obtain the monitoring data on the Ceph file system cluster with the B number and the monitoring data on the Ceph file system cluster with the C number. .
  • the monitoring data of the A-numbered Ceph file system cluster can be specifically expressed as the health data of the A-numbered Ceph file system cluster itself (such as the operating status of the OSD component, the status data of the PG), and the Ceph file system cluster with the A number The status data of the user space file system client of the connected A-numbered Ceph file system (such as whether the user space file system client of the A-numbered Ceph file system is connected to the A-numbered Ceph file system).
  • the monitoring data components of the A-numbered Ceph file system cluster include the A-numbered Ceph file system cluster itself
  • the health data also includes the status data of the user space file system client of the 100 A-numbered Ceph file system connected to the A-numbered Ceph file system cluster; the monitoring data about the B-numbered Ceph file system cluster and the C-number
  • the monitoring data of the Ceph file system cluster can refer to the monitoring data of the Ceph file system cluster with the A number, which will not be repeated here.
  • the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule, and reports the alarm information to the alarm platform.
  • Prometheus analyzes the monitoring data obtained from the Ceph file system cluster with A number, and the analysis is based on the preset alarm rules, so as to determine the Ceph file system with A number
  • the alarm information of the cluster further, Prometheus will obtain the alarm information about the Ceph file system cluster with the A number and report it to the alarm platform, and the report is still based on the preset alarm rules.
  • the alarm platform may be an IMS system (Information Management System, information management system), or other alarm platforms, which is not limited in the present invention.
  • the alarm process of Prometheus for the Ceph file system cluster of B number and Ceph file system cluster of C number can refer to the alarm process of Ceph file system cluster of A number, which will not be repeated here.
  • any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; the monitoring server is distributed to the Each cluster in the storage system sends collection instructions, including: for any monitoring server, the monitoring server issues collection instructions to at least two node servers in any cluster.
  • FIG. 3 it is a diagram of a Prometheus monitoring architecture for CephFS provided by an embodiment of the present invention.
  • two monitoring servers are deployed, namely the X-numbered Prometheus server and the Y-numbered Prometheus server, the X-numbered Prometheus server and the Y-numbered Prometheus server are both It is used to monitor distributed storage systems.
  • the system has deployed A-numbered Ceph file system clusters, B-numbered Ceph file system clusters, and C-numbered Ceph file system clusters; for A-numbered Ceph file system clusters, the cluster includes For the convenience of description, the Ceph file system cluster with A number includes 4 node servers, which are the node server numbered A1, the node server numbered A2, the node server numbered A3, and the node server numbered A4. Node server; similarly, for the B-numbered Ceph file system cluster, the cluster includes multiple node servers.
  • the B-numbered Ceph file system cluster includes 4 node servers, which are respectively numbered B1 The node server numbered B2, the node server numbered B3, and the node server numbered B4; similarly, for the Ceph file system cluster numbered C, the cluster includes multiple node servers.
  • set The Ceph file system cluster with number C includes 4 node servers, which are the node server numbered C1, the node server numbered C2, the node server numbered C3, and the node server numbered C4.
  • a user space file system client with 100 A-numbered Ceph file system is connected to the node server configured with MDS components in the cluster, and the A-numbered Ceph file system cluster is set If there are 3 node servers configured with MDS components, the user space file system clients of these 100 A-numbered Ceph file systems are all connected to these 3 node servers configured with MDS components (not shown in the figure); Similarly, for a B-numbered Ceph file system cluster, a user space file system client with 200 B-numbered Ceph file systems is connected to a node server configured with MDS components in the cluster. Suppose there are 3 in the CephFS_B cluster.
  • each node server is configured with MDS components
  • the user space file system clients of the 200 B-numbered Ceph file systems are all connected to these three node servers (not shown in the figure) configured with MDS components; the same is true
  • a Ceph file system cluster with a C number a user space file system client with 300 C-number Ceph file systems is connected to a node server configured with MDS components in the cluster, and a Ceph file system cluster with a C number is set
  • There are 3 node servers configured with MDS components then the user space file system clients of the 300 C-numbered Ceph file systems are all connected to these 3 node servers configured with MDS components (not shown in the figure) .
  • the monitoring server sends at least two nodes in any of the above-mentioned A-numbered Ceph file system cluster, B-numbered Ceph file system cluster, and C-numbered Ceph file system cluster.
  • the server issues collection instructions, specifically as follows:
  • the X-numbered Prometheus server sends the A1 numbered node server, A2 numbered node server, and A4 numbered node server in the A-numbered Ceph file system cluster to the three node servers Issue collection instruction I; at the same time, the X-numbered Prometheus server delivers to the three node servers of B1 numbered node server, B3 numbered node server, and B4 numbered node server in the B-numbered Ceph file system cluster Acquisition instruction I; At the same time, the X-numbered Prometheus server issues acquisition instructions to the C-numbered Ceph file system cluster C1 numbered node server, C2 numbered node server, and C4 numbered node server. I.
  • the X-numbered Prometheus server when the X-numbered Prometheus server sends collection commands to at least two node servers in the A-numbered Ceph file system cluster, it is randomly sent to any of the A-numbered Ceph file system clusters. At least two node servers issue collection instructions.
  • the X-numbered Prometheus server can be distributed to the three node servers of the A1 numbered node server, the A2 numbered node server, and the A4 numbered node server in the A-numbered Ceph file system cluster.
  • Instruction I can also be issued to the three node servers of A2 numbered node server, A3 numbered node server and A4 numbered node server in the A numbered Ceph file system cluster, or it can be numbered A
  • the three node servers of the A1 numbered node server, the A2 numbered node server, and the A3 numbered node server in the Ceph file system cluster in the Ceph file system cluster issue a collection instruction I, which is not limited by the present invention.
  • the X-numbered Prometheus server sends collection instructions to at least two node servers in the B-numbered Ceph file system cluster, it randomly sends at least two of the B-numbered Ceph file system clusters.
  • Node servers issue collection instructions; similarly, when the Prometheus server with X number sends collection instructions to at least two node servers in the Ceph file system cluster with C number, it sends collection instructions to Ceph with C number in a random manner. Any at least two node servers in the file system cluster issue collection instructions.
  • the alarm rule includes an alarm generation rule; the monitoring server determines the alarm information from the monitoring data according to a preset alarm rule, including: the monitoring server determines the alarm information from the monitoring data Determine the first client whose connection status with the cluster has changed; the monitoring server determines the second client whose connection status with the cluster has changed according to the service change of the cluster; The client in one client but not included in the second client and the alarm generation rule generate the alarm information of the client.
  • CephFS_A cluster for the convenience of description, there are 10 Ceph Fuse_A clients connected to the cluster: W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10.
  • the node server of the MDS component Prometheus Sever_X issued the collection command I to the three node servers A1, A2, and A4 in the CephFS_A cluster.
  • Prometheus Sever_X first obtains the monitoring data on the A1 node server.
  • CephFS_A cluster For the business running on the CephFS_A cluster, for the purpose of business needs, some clients connected to the CephFS_A cluster will be uninstalled daily. For example, for the purpose of business needs, business personnel will uninstall the three Ceph Fuse_A clients, W5, W6, and W7 in the CephFS_A cluster. That is, the second clients whose connection status with the cluster have changed are three Ceph Fuse_A clients, W5, W6, and W7.
  • the offline of the three Ceph Fuse_A clients of W5, W6 and W7 in the monitoring data does not need to be reported to IMS system; and the uninstallation of the four Ceph Fuse_A clients, W1, W2, W3, and W4, belongs to the abnormal uninstallation event of the Ceph Fuse_A client, and the alarm information of the client is generated according to the alarm generation rules.
  • the alarm rules also include alarm suppression rules; the monitoring server determines the change duration of the cluster's business changes; the monitoring server sets the alarm suppression rules for the alarm information of the client, so The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
  • the three Ceph Fuse_A clients connected to the CephFS_A cluster, W5, W6, and W7, are normally uninstalled.
  • the three Ceph Fuse_A, W5, W6, and W7, are uninstalled.
  • the time required by the client is 3h, then Prometheus Sever_X will not connect to W5, W6, and W7 on the CephFS_A cluster for the entire time period of 3h in the future after Prometheus Sever_X obtains the monitoring data on the A2 node server.
  • the offline events of the three Ceph Fuse_A clients are reported to the IMS system. That is, Prometheus Sever_X writes the offline events of the three Ceph Fuse_A clients W5, W6 and W7 from the CephFS_A cluster into the alarm suppression rules.
  • the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to the alarm according to a preset alarm rule
  • the platform includes: the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and then reports the alarm information of the MDS component to the alarm platform.
  • Prometheus Sever_X's monitoring data for the CephFS_A cluster includes the health data of the CephFS_A cluster itself (such as the operating status of the OSD component, the status data of the PG), and the status data of the Ceph Fuse_A client connected to the CephFS_A cluster (such as Whether the Ceph Fuse_A client is connected to the CephFS_A cluster).
  • the status data of the Ceph Fuse_A client connected to the CephFS_A cluster such as Whether the Ceph Fuse_A client is connected to the CephFS_A cluster.
  • Prometheus Sever_X defines the alarm level of the abnormal event that occurs during the operation of the MDS component in the CephFS_A cluster as high, and defines the alarm level of the abnormal uninstall event that occurs on the 1 Ceph Fuse_A client W1 as low.
  • Prometheus Sever_X will report high-level alarm events to the IMS system, that is, Prometheus Sever_X will report the abnormal events that occur during the operation of the MDS component in the CephFS_A cluster to the IMS system, instead of the W1 Ceph Fuse_A
  • the abnormal uninstall event that occurs on the client is reported to the IMS system.
  • the reason why the monitoring server can set the alarm level of the alarm information of the MDS component in the cluster to be higher than the alarm level of the alarm information of the client is that the abnormality of the MDS component in the cluster will cause a problem with the cluster.
  • the abnormal event of the connected client therefore, after reporting the alarm information of the MDS component in the cluster to the IMS system, and after the operation and maintenance personnel conduct the operation and maintenance investigation, not only can the MDS component be restored to the normal operating state, but also can be connected to the cluster.
  • the connected client also returns to its normal state.
  • the method further includes: the monitoring server sets a cluster identifier corresponding to each monitoring data.
  • Prometheus Sever_X sends collection instructions I to the three node servers A1, A2, and A4 in the CephFS_A cluster, and sends collection instructions to the three node servers B1, B3, and B4 in the CephFS_B cluster.
  • I and send collection command I to the three node servers C1, C2, and C4 in the CephFS_C cluster at the same time; when the collection command I is responded to in the above three clusters of CephFS_A cluster, CephFS_B cluster and CephFS_C cluster, Prometheus Sever_X will The monitoring data of each of the above-mentioned clusters will be obtained.
  • the monitoring data can be expressed as the cluster identifier.
  • the first one obtained by Prometheus Sever_X is the monitoring data on the A1 node server of the CephFS_A cluster
  • the second is the monitoring data on the B3 node server of the CephFS_B cluster
  • the third is It is the monitoring data on the C4 node server of the CephFS_C cluster, and so on.
  • the alarm rule further includes an alarm convergence rule;
  • the monitoring server reports the alarm information to the alarm platform according to a preset alarm rule, including: the monitoring server determines the alarm information Is the same alarm information that does not appear for the first time in the cluster, then according to the comparison relationship between the alarm level in the alarm convergence rule and the alarm delay, the alarm information is reported to the alarm platform after the delay is set; Among them, the lower the alarm level, the longer the corresponding alarm delay.
  • the first piece of monitoring data obtained by Prometheus Sever_X comes from the CephFS_A cluster. After analyzing the first piece of monitoring data according to the preset alarm rules, it is determined that the first piece of monitoring data can be reported as alarm information.
  • the alarm information generated according to the first piece of monitoring data is set to Info_1, and the alarm level of Infro_1 is set to level 1.
  • the sixth piece of monitoring data obtained by Prometheus Sever_X is still related to the CephFS_A cluster
  • the alarm information generated according to the sixth monitoring data conforms to Info_1
  • Prometheus Sever_X needs to further determine when to use the alarm level of Infro_1.
  • the sixth monitoring data is reported to the IMS system; if the alarm delay corresponding to the alarm information with the alarm level of level 1 is 1h, then Prometheus Sever_X will not report the Infro_1 corresponding to the sixth monitoring data within the next 1h. Report to the IMS system.
  • the second piece of monitoring data obtained by Prometheus Sever_X comes from the CephFS_B cluster. After analyzing the second piece of monitoring data according to the preset alarm rules, it is determined that the second piece of monitoring data can be reported to the IMS system as alarm information.
  • the alarm information order generated according to the second monitoring data is Info_2, and the alarm level of Infro_2 is set to level 2.
  • the ninth monitoring data obtained by Prometheus Sever_X is still related to the CephFS_B cluster, according to the preset
  • the alarm information generated according to the monitoring data of Article 9 conforms to Info_2
  • Prometheus Sever_X needs to further determine when to monitor the ninth item according to the alarm level of Infro_2.
  • the data is reported to the IMS system; if the alarm delay corresponding to the alarm information with the alarm level of level 2 is set to 2h, then Prometheus Sever_X will not report the Infro_2 corresponding to the ninth monitoring data to the IMS system within the next 2h.
  • the third piece of monitoring data obtained by Prometheus Sever_X comes from the CephFS_C cluster. After analyzing the third piece of monitoring data according to the preset alarm rules, it is determined that the third piece of monitoring data can be reported to the IMS system as alarm information.
  • the alarm information order generated according to the third monitoring data is Info_3, and the alarm level of Infro_3 is set to level 3; suppose that the tenth monitoring data obtained by Prometheus Sever_X is still related to the CephFS_C cluster, according to the preset After analyzing the alarm rules and the monitoring data of Article 10, it is found that the alarm information generated according to the monitoring data of Article 10 conforms to Info_3.
  • Prometheus Sever_X needs to further determine when to monitor the monitoring data of Article 10 according to the alarm level of Infro_3. The data is reported to the IMS system; if the alarm delay corresponding to the alarm information with the alarm level of level 3 is set to 3h, then Prometheus Sever_X will not report the Infro_3 corresponding to the tenth monitoring data to the IMS system in the next 3h.
  • the monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in a certain cluster, it reports the same alarm that does not appear for the first time to the alarm platform according to the alarm convergence rules and after a set delay.
  • the cluster continues to send out the same alarm repeatedly, resulting in a waste of resources.
  • an embodiment of the present invention also provides a device for monitoring a distributed storage system. As shown in FIG. 4, the device includes:
  • the sending unit 401 is configured to send collection instructions to each cluster in the distributed storage system
  • the obtaining unit 402 is configured to obtain monitoring data fed back by each cluster based on the collection instruction, the monitoring data including the health data of the cluster itself and the status data of the client connected to the cluster;
  • the determining unit 403, for at least one cluster is configured to determine alarm information from the monitoring data of the cluster according to preset alarm rules and report the alarm information to the alarm platform.
  • any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; for any monitoring server, The sending unit 401 is specifically configured to issue collection instructions to at least two node servers in any cluster.
  • the alarm rule includes an alarm generation rule;
  • the determining unit 403 is specifically configured to determine from the monitoring data the first client whose connection status with the cluster has changed; according to The service change of the cluster determines the second client whose connection state with the cluster has changed; according to the client included in the first client but not included in the second client and the alarm Generate rules to generate alarm information for the client.
  • the alarm rule also includes an alarm suppression rule;
  • the determining unit 403 is specifically configured to determine the change duration of the service change of the cluster; set the alarm suppression rule for the alarm information of the client, The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
  • the monitoring server generates the alarm information of the MDS component of the cluster according to the health data of the cluster itself; the determining unit 403 is specifically configured to determine the alarm of the alarm information of the MDS component If the level is higher than the alarm information of the client, the alarm information of the MDS component is reported to the alarm platform.
  • the determining unit 403 is further configured to set a cluster identifier corresponding to each monitoring data.
  • the alarm rule also includes an alarm convergence rule; the determining unit 403 is specifically configured to determine that the alarm information is the same alarm information that does not appear for the first time in the cluster, and then according to the alarm The contrast relationship between the alarm level and the alarm delay in the convergence rule, the alarm information is reported to the alarm platform after the delay is set; wherein, the lower the alarm level is, the longer the corresponding alarm delay is long.
  • the embodiment of the present invention provides a computing device, and the computing device may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), etc.
  • the computing device may include a central processing unit (CPU), a memory, an input/output device, etc.
  • the input device may include a keyboard, a mouse, a touch screen, etc.
  • an output device may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), Cathode Ray Tube (CRT), etc.
  • the memory may include read-only memory (ROM) and random access memory (RAM), and provides the processor with program instructions and data stored in the memory.
  • the memory may be used to store the program instructions of the method for monitoring the distributed storage system;
  • the processor is configured to call the program instructions stored in the memory, and execute the method of monitoring the distributed storage system according to the obtained program.
  • FIG. 5 it is a schematic diagram of a computing device provided by an embodiment of this application, and the computing device includes:
  • the processor 501 is configured to read a program in the memory 502, and execute the foregoing method for monitoring a distributed storage system;
  • the processor 501 may be a central processing unit (central processing unit, CPU for short), a network processor (NP for short), or a combination of CPU and NP. It can also be a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL), or any of them combination.
  • the memory 502 is configured to store one or more executable programs, and can store data used by the processor 501 when performing operations.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 502 may include a volatile memory (volatile memory), such as random-access memory (RAM for short); the memory 502 may also include a non-volatile memory (non-volatile memory), such as flash memory ( flash memory), hard disk drive (HDD for short) or solid-state drive (SSD for short); the memory 502 may also include a combination of the foregoing types of memories.
  • volatile memory volatile memory
  • RAM random-access memory
  • non-volatile memory non-volatile memory
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the memory 502 stores the following elements, executable modules or data structures, or their subsets, or their extended sets:
  • Operating instructions including various operating instructions, used to implement various operations.
  • Operating system Including various system programs, used to implement various basic services and process hardware-based tasks.
  • the bus 505 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.
  • the bus interface 504 may be a wired communication access port, a wireless bus interface or a combination thereof, where the wired bus interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless bus interface may be a WLAN interface.
  • the embodiment of the present invention provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute a method for monitoring a distributed storage system.
  • the embodiments of the present invention can be provided as a method or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

The present invention provides a method and an apparatus for monitoring a distributed storage system. The method comprises: a monitoring server sending acquisition instructions to clusters in a distributed storage system; the monitoring server acquiring monitoring data fed back by the clusters on the basis of the acquisition instructions, the monitoring data comprising health data of the clusters and state data of clients connected to the the clusters; and for at least one cluster, the monitoring server determining alarm information from the monitoring data of the clusters according to a preset alarm rule, and reporting the alarm information to an alarm platform. In the solution, the monitoring server issues the acquisition instructions to the clusters in the distributed storage system, so that the monitoring server can monitor a plurality of clusters at the same time; in addition, the monitoring data fed back by the clusters comprises state data of the clients connected to the clusters, facilitating the monitoring server determining alarm information by analyzing the state data of the clients connected to the clusters, thereby achieving the purpose of the monitoring server monitoring the clients connected to the clusters.

Description

一种监控分布式存储系统的方法及装置Method and device for monitoring distributed storage system
相关申请的交叉引用Cross references to related applications
本申请要求在2019年12月23日提交中国专利局、申请号为201911336662.5、申请名称为“一种监控分布式存储系统的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911336662.5, and the application name is "a method and device for monitoring a distributed storage system" on December 23, 2019, the entire content of which is incorporated by reference In this application.
技术领域Technical field
本发明涉及金融科技(Fintech)领域,尤其涉及一种监控分布式存储系统的方法及装置。The present invention relates to the field of financial technology (Fintech), in particular to a method and device for monitoring a distributed storage system.
背景技术Background technique
随着计算机技术的发展,越来越多的技术(例如云计算、大数据)应用在金融领域,传统金融业正在逐步向金融科技转变,大数据技术也不例外。但由于金融、支付行业的安全性、实时性要求,也对大数据技术提出了更高的要求。With the development of computer technology, more and more technologies (such as cloud computing and big data) are applied in the financial field. The traditional financial industry is gradually transforming to financial technology, and big data technology is no exception. However, due to the security and real-time requirements of the financial and payment industries, higher requirements are also placed on big data technology.
出于对海量数据所要求的可扩展性以及高可用性等因素的考虑,银行业一般选择将CephFS(Ceph File System,Ceph文件系统)这样一种分布式存储系统作为共享存储的技术方案。其中,CephFS下连接有Ceph Fuse客户端(Ceph文件系统的用户空间文件系统客户端);与此同时,本领域的技术人员通常采用开源的Prometheus(普罗米修斯)这样一种监控系统对CephFS进行监控。其中,Prometheus主要由Exporter(Prometheus监控数据采集的客户端)和Prometheus Sever(Prometheus监控的服务器端)等部分组成;CephFS主要由监视器(Monitor,简写成MON)、目标存储设备(Object Storage Device,简写成OSD)以及元数据服务器(MetaData Sever,简写成MDS)等各类组件组成,此外,CephFS OSD组件上还分布有归置小组(Placement Groups,简写成PG)。In consideration of factors such as scalability and high availability required for massive data, the banking industry generally chooses a distributed storage system such as CephFS (Ceph File System) as a technical solution for shared storage. Among them, the Ceph Fuse client (user space file system client of the Ceph file system) is connected to CephFS; at the same time, those skilled in the art usually use the open source Prometheus (Prometheus) to monitor CephFS. To monitor. Among them, Prometheus is mainly composed of Exporter (client for Prometheus monitoring data collection) and Prometheus Sever (server for Prometheus monitoring); CephFS is mainly composed of monitor (Monitor, abbreviated as MON), target storage device (Object Storage Device, It is abbreviated as OSD) and metadata server (MetaData Sever, abbreviated as MDS) and other components. In addition, the CephFS OSD component also has placement groups (Placement Groups, abbreviated as PG).
针对现有技术中的Prometheus对于CephFS进行监控的技术方案,存在以下两方面问题:Regarding the technical solution of Prometheus monitoring CephFS in the prior art, there are the following two problems:
第一,Prometheus对于CephFS的监控主要表现为Prometheus对CephFS OSD组件状态以及CephFS PG状态的数据采集,但Prometheus并没有实现对Ceph Fuse客户端的监控。First, Prometheus's monitoring of CephFS is mainly manifested in Prometheus's data collection of CephFS OSD component status and CephFS PG status, but Prometheus does not implement the monitoring of Ceph Fuse client.
第二,Prometheus对于CephFS的监控架构非常臃肿,表现为需要给每个CephFS部署一套Prometheus;此外,由于CephFS版本的不同,还需要为不同版本的CephFS部署不同的Exporter。如图1所示,为现有技术的Prometheus对于CephFS的监控架构图。参考图1,M编号的普罗米修斯监控数据采集的客户端采集M编号的Ceph文件系统的监控数据,若所采集到的监控数据满足生成告警信息的规则,则将生成的告警信息上报至M编号的普罗米修斯服务器,同理,N编号的普罗米修斯监控数据采集的客户端采集N编号的Ceph文件系统的监控数据,若所采集到的监控数据满足生成告警信息的规则,则将生成的告警信息上报至N编号的普罗米修斯服务器;但由于M编号的普罗米修斯监控数据采集的客户端与N编号的Ceph文件系统版本的不匹配,从而不能将M编号的普罗米修斯监控数据采集的客户端用于采集N编号的Ceph文件系统的监控数据,以实现对N编号的Ceph文件系统的告警信息的上报。也即,Prometheus Sever、Exporter和CephFS这三者之间没有实 现高可用,导致在异常情况下无法及时上报监控信息。Second, Prometheus's monitoring architecture for CephFS is very bloated, which is manifested in the need to deploy a set of Prometheus for each CephFS; in addition, due to the different versions of CephFS, different Exporters need to be deployed for different versions of CephFS. As shown in Figure 1, it is a diagram of the monitoring architecture of CephFS by Prometheus in the prior art. Referring to Figure 1, the M-numbered Prometheus monitoring data collection client collects the M-numbered Ceph file system monitoring data. If the collected monitoring data meets the rules for generating alarm information, it will report the generated alarm information to The M-numbered Prometheus server, in the same way, the N-numbered Prometheus monitoring data collection client collects the monitoring data of the N-numbered Ceph file system. If the collected monitoring data meets the rules for generating alarm information, The generated alarm information is reported to the N-numbered Prometheus server; however, the M-numbered Prometheus monitoring data collection client does not match the N-numbered Ceph file system version, so the M-numbered The client of Prometheus monitoring data collection is used to collect the monitoring data of the N-numbered Ceph file system to report the alarm information of the N-numbered Ceph file system. That is, Prometheus Sever, Exporter, and CephFS did not achieve high availability among the three, resulting in failure to report monitoring information in a timely manner under abnormal conditions.
综上,现有技术存在Prometheus无法监控Ceph Fuse客户端以及Prometheus对于CephFS的监控效率低下的问题。In summary, the existing technology has problems that Prometheus cannot monitor the Ceph Fuse client and Prometheus has low monitoring efficiency for CephFS.
发明内容Summary of the invention
本发明提供一种监控分布式存储系统的方法及装置,用以解决Prometheus无法监控Ceph Fuse客户端以及Prometheus对于CephFS的监控效率低下的问题。The present invention provides a method and device for monitoring a distributed storage system, which are used to solve the problems that Prometheus cannot monitor Ceph Fuse clients and Prometheus has low monitoring efficiency for CephFS.
第一方面,本发明实施例提供一种监控分布式存储系统的方法,该方法包括:监控服务器向所述分布式存储系统中的各集群发送采集指令;所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据;针对至少一个集群,所述监控服务器根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。In the first aspect, an embodiment of the present invention provides a method for monitoring a distributed storage system. The method includes: a monitoring server sends collection instructions to each cluster in the distributed storage system; and the monitoring server obtains that each cluster is based on The monitoring data fed back by the collection instruction includes the health data of the cluster itself and the status data of the client connected to the cluster; for at least one cluster, the monitoring server obtains data from the cluster according to preset alarm rules. Determine the alarm information in the monitoring data and report the alarm information to the alarm platform.
基于该方案,监控服务器通过将采集指令下发至分布式存储系统中的各集群的方式,使得监控服务器可以同时监控多个集群,从而避免了由于集群与Exporter版本不匹配时、监控服务器无法有效地监控各集群;此外,各集群反馈给监控服务器的监控数据中还包括与集群相连的客户端的状态数据,有利于监控服务器通过对与集群相连的客户端的状态数据的分析来确定告警信息,从而实现了监控服务器对与集群相连的客户端进行监控的目的。Based on this solution, the monitoring server can monitor multiple clusters at the same time by issuing collection instructions to each cluster in the distributed storage system, thus avoiding the ineffectiveness of the monitoring server when the cluster and the Exporter version do not match. Monitor each cluster locally; in addition, the monitoring data that each cluster feeds back to the monitoring server also includes the status data of the client connected to the cluster, which is beneficial for the monitoring server to determine the alarm information by analyzing the status data of the client connected to the cluster. The purpose of the monitoring server to monitor the clients connected to the cluster is realized.
作为一种可能实现的方法,所述监控服务器为多台;任一集群中包括多台节点服务器,且连接有客户端的各节点服务器所连接的客户端均相同;所述监控服务器向所述分布式存储系统中的各集群发送采集指令,包括:针对任一台监控服务器,所述监控服务器向任一集群中的至少两台节点服务器下发采集指令。As a possible implementation method, there are multiple monitoring servers; any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; the monitoring server is distributed to the Each cluster in the storage system sends collection instructions, including: for any monitoring server, the monitoring server issues collection instructions to at least two node servers in any cluster.
基于该方案,通过为分布式存储系统设置多台监控服务器,一方面,通过频繁地从分布式存储系统中的各集群中获取各集群的监控数据,可以实现对于该分布式存储系统的全方位、甚至实时监控的目标;另一方面,通过设置多台监控服务器的方式,还可以确保在其中一台或几台监控服务器宕机的情况下,还有其他可用的监控服务器来对该分布式存储系统进行监控。对于多台监控服务器中的任一台监控服务器,该监控服务器通过向各集群中的至少两台节点服务器下发采集指令,有利于确保在其中一台节点服务器宕机的情况下,该监控服务器还可以从其他可用的节点服务器上来获取该节点服务器所在集群的监控数据,从而实现监控服务器对各集群的有效监控。Based on this solution, by setting up multiple monitoring servers for the distributed storage system, on the one hand, by frequently obtaining the monitoring data of each cluster from each cluster in the distributed storage system, it is possible to achieve a full range of the distributed storage system. , And even real-time monitoring goals; on the other hand, by setting up multiple monitoring servers, you can also ensure that when one or more of the monitoring servers is down, there are other available monitoring servers for the distributed monitoring. The storage system is monitored. For any one of the multiple monitoring servers, the monitoring server sends collection instructions to at least two node servers in each cluster to help ensure that the monitoring server is down when one of the node servers is down. The monitoring data of the cluster where the node server is located can also be obtained from other available node servers, so as to realize the effective monitoring of each cluster by the monitoring server.
作为一种可能实现的方法,所述告警规则包括告警生成规则;所述监控服务器根据预设的告警规则,从所述监控数据中确定告警信息,包括:所述监控服务器从所述监控数据中确定出与所述集群的连接状态发生变化的第一客户端;所述监控服务器根据所述集群的业务变化确定与所述集群的连接状态发生变化的第二客户端;根据包含在所述第一客户端中却不包含在所述第二客户端中的客户端及所述告警生成规则,生成客户端的告警信息。As a possible implementation method, the alarm rule includes an alarm generation rule; the monitoring server determines the alarm information from the monitoring data according to the preset alarm rule, including: the monitoring server obtains the alarm information from the monitoring data Determine the first client whose connection status with the cluster has changed; the monitoring server determines the second client whose connection status with the cluster has changed according to the service change of the cluster; The client in one client but not included in the second client and the alarm generation rule generate the alarm information of the client.
基于该方案,通过对监控数据的分析,确定出与所述集群的连接状态发生变化的第一客户端,以及通过对已知业务变化的分析,确定出与所述集群的连接状态发生变化的第二客户端,通过将第一客户端与第二客户端进行对比,即可生成由于客户端的异常而产生的告警信息。Based on this solution, through the analysis of monitoring data, the first client whose connection status with the cluster has changed is determined, and through the analysis of known business changes, it is determined that the connection status with the cluster has changed. The second client, by comparing the first client with the second client, can generate alarm information generated due to the abnormality of the client.
作为一种可能实现的方法,所述告警规则还包括告警抑制规则;所述监控服务器确定所述集群的业务变化的变化时长;所述监控服务器设置所述客户端的告警信息的告警抑制 规则,所述客户端的告警抑制规则用于将在所述变化时长内产生的所述客户端的告警信息不进行上报。As a possible implementation method, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; the monitoring server sets the alarm suppression rule for the alarm information of the client, so The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
基于该方案,在确定出集群出于业务需要的目的而要求的必要的时长后,监控服务器并不会在这段必要的时长的过程中、将客户端的告警信息上报至告警平台,从而可以有效地避免产生已知而无用的告警。Based on this solution, after determining the necessary time required by the cluster for business needs, the monitoring server will not report the alarm information of the client to the alarm platform during this necessary time, which can be effective To avoid generating known but useless alarms.
作为一种可能实现的方法,所述监控服务器根据所述集群自身的健康数据生成所述集群的MDS组件的告警信息;所述监控服务器根据预设的告警规则,将所述告警信息上报至告警平台,包括:所述监控服务器确定所述MDS组件的告警信息的告警级别高于所述客户端的告警信息,则将所述MDS组件的告警信息上报至告警平台。As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to the alarm according to a preset alarm rule The platform includes: the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and then reports the alarm information of the MDS component to the alarm platform.
基于该方案,当监控服务器同时获取到集群的MDS组件的告警信息和与集群相连的客户端的告警信息时,考虑到可能是集群的MDS组件的异常造成了与集群相连的客户端的异常事件,因此监控服务器确定MDS组件的告警信息的告警级别高于客户端的告警信息,并将MDS组件的告警信息上报至告警平台,自动屏蔽低级别的客户端的告警信息。Based on this solution, when the monitoring server simultaneously obtains the alarm information of the MDS component of the cluster and the alarm information of the client connected to the cluster, it is considered that the abnormal event of the client connected to the cluster may be caused by the abnormality of the MDS component of the cluster. The monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and reports the alarm information of the MDS component to the alarm platform, automatically shielding the alarm information of the low-level client.
作为一种可能实现的方法,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据之后,还包括:所述监控服务器设置各监控数据对应的集群标识。As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further includes: the monitoring server sets a cluster identifier corresponding to each monitoring data.
基于该方案,监控服务器通过为获取到的各监控数据打上与其对应的集群的标识,有助于监控服务器后期对于接收到同一集群的相同监控数据时、迅速地做出相应的告警操作。Based on this solution, the monitoring server marks each acquired monitoring data with the corresponding cluster identification, which helps the monitoring server to quickly make corresponding alarm operations when receiving the same monitoring data of the same cluster in the future.
作为一种可能实现的方法,所述告警规则还包括告警收敛规则;所述监控服务器根据预设的告警规则,将所述告警信息上报至告警平台,包括:所述监控服务器确定所述告警信息为所述集群中非首次出现的同一告警信息,则根据所述告警收敛规则中的告警级别与告警时延的对照关系,在设定时延后将所述告警信息上报至所述告警平台;其中,告警级别的级别越低,相应的告警时延的时延越长。As a possible implementation method, the alarm rules further include alarm convergence rules; the monitoring server reports the alarm information to the alarm platform according to preset alarm rules, including: the monitoring server determines the alarm information Is the same alarm information that does not appear for the first time in the cluster, then according to the comparison relationship between the alarm level in the alarm convergence rule and the alarm delay, the alarm information is reported to the alarm platform after the delay is set; Among them, the lower the alarm level, the longer the corresponding alarm delay.
基于该方案,在监控服务器确定告警信息为某集群非首次出现的相同告警信息后,根据告警收敛规则、在设定时延后将非首次出现的相同告警上报至所述告警平台,可以有效防止该集群持续重复发出相同告警,而造成的资源浪费现象。Based on this solution, after the monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in a certain cluster, it reports the same alarm that does not appear for the first time to the alarm platform according to the alarm convergence rules and after a set time delay, which can effectively prevent The cluster continues to send out the same alarm repeatedly, resulting in a waste of resources.
第二方面,本发明实施例提供一种监控分布式存储系统的装置,该装置包括:发送单元,用于向所述分布式存储系统中的各集群发送采集指令;获取单元,用于获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据;确定单元,针对至少一个集群,用于根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。In a second aspect, an embodiment of the present invention provides a device for monitoring a distributed storage system, the device includes: a sending unit, configured to send collection instructions to each cluster in the distributed storage system; The monitoring data of each cluster is based on the collection instruction feedback, and the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster; the determining unit, for at least one cluster, is used for according to preset alarm rules, Determine alarm information from the monitoring data of the cluster and report the alarm information to the alarm platform.
基于该方案,监控服务器通过将采集指令下发至分布式存储系统中的各集群的方式,使得监控服务器可以同时监控多个集群,从而避免了由于集群与Exporter版本不匹配时、监控服务器无法有效地监控各集群;此外,各集群反馈给监控服务器的监控数据中还包括与集群相连的客户端的状态数据,有利于监控服务器通过对与集群相连的客户端的状态数据的分析来确定告警信息,从而实现了监控服务器对与集群相连的客户端进行监控的目的。Based on this solution, the monitoring server can monitor multiple clusters at the same time by issuing collection instructions to each cluster in the distributed storage system, thus avoiding the ineffectiveness of the monitoring server when the cluster and the Exporter version do not match. Monitor each cluster locally; in addition, the monitoring data that each cluster feeds back to the monitoring server also includes the status data of the client connected to the cluster, which is beneficial for the monitoring server to determine the alarm information by analyzing the status data of the client connected to the cluster. The purpose of the monitoring server to monitor the clients connected to the cluster is realized.
作为一种可能实现的方法,所述监控服务器为多台;任一集群中包括多台节点服务器,且连接有客户端的各节点服务器所连接的客户端均相同;针对任一台监控服务器,所述发送单元,具体用于向任一集群中的至少两台节点服务器下发采集指令。As a possible implementation method, there are multiple monitoring servers; any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; for any monitoring server, all The sending unit is specifically configured to issue collection instructions to at least two node servers in any cluster.
基于该方案,通过为分布式存储系统设置多台监控服务器,一方面,通过频繁地从分布式存储系统中的各集群中获取各集群的监控数据,可以实现对于该分布式存储系统的全 方位、甚至实时监控的目标;另一方面,通过设置多台监控服务器的方式,还可以确保在其中一台或几台监控服务器宕机的情况下,还有其他可用的监控服务器来对该分布式存储系统进行监控。对于多台监控服务器中的任一台监控服务器,该监控服务器通过向各集群中的至少两台节点服务器下发采集指令,有利于确保在其中一台节点服务器宕机的情况下,该监控服务器还可以从其他可用的节点服务器上来获取该节点服务器所在集群的监控数据,从而实现监控服务器对各集群的有效监控。Based on this solution, by setting up multiple monitoring servers for the distributed storage system, on the one hand, by frequently obtaining the monitoring data of each cluster from each cluster in the distributed storage system, it is possible to achieve a full range of the distributed storage system. , And even real-time monitoring goals; on the other hand, by setting up multiple monitoring servers, you can also ensure that when one or more of the monitoring servers is down, there are other available monitoring servers for the distributed The storage system is monitored. For any one of the multiple monitoring servers, the monitoring server sends collection instructions to at least two node servers in each cluster to help ensure that the monitoring server is down when one of the node servers is down. The monitoring data of the cluster where the node server is located can also be obtained from other available node servers, so as to realize the effective monitoring of each cluster by the monitoring server.
作为一种可能实现的方法,所述告警规则包括告警生成规则;所述确定单元,具体用于从所述监控数据中确定出与所述集群的连接状态发生变化的第一客户端;根据所述集群的业务变化确定与所述集群的连接状态发生变化的第二客户端;根据包含在所述第一客户端中却不包含在所述第二客户端中的客户端及所述告警生成规则,生成客户端的告警信息。As a possible implementation method, the alarm rule includes an alarm generation rule; the determining unit is specifically configured to determine from the monitoring data the first client whose connection status with the cluster has changed; The service change of the cluster determines the second client whose connection state with the cluster has changed; and the alarm is generated according to the client included in the first client but not included in the second client and the alarm Rules to generate alarm information for the client.
基于该方案,通过对监控数据的分析,确定出与所述集群的连接状态发生变化的第一客户端,以及通过对已知业务变化的分析,确定出与所述集群的连接状态发生变化的第二客户端,通过将第一客户端与第二客户端进行对比,即可生成由于客户端的异常而产生的告警信息。Based on this solution, through the analysis of monitoring data, the first client whose connection status with the cluster has changed is determined, and through the analysis of known business changes, it is determined that the connection status with the cluster has changed. The second client, by comparing the first client with the second client, can generate alarm information generated due to the abnormality of the client.
作为一种可能实现的方法,所述告警规则还包括告警抑制规则;所述确定单元,具体用于确定所述集群的业务变化的变化时长;设置所述客户端的告警信息的告警抑制规则,所述客户端的告警抑制规则用于将在所述变化时长内产生的所述客户端的告警信息不进行上报。As a possible implementation method, the alarm rule further includes an alarm suppression rule; the determining unit is specifically configured to determine the change duration of the service change of the cluster; and the alarm suppression rule for the alarm information of the client is set, so The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
基于该方案,在确定出集群出于业务需要的目的而要求的必要的时长后,监控服务器并不会在这段必要的时长的过程中、将客户端的告警信息上报至告警平台,从而可以有效地避免产生已知而无用的告警。Based on this solution, after determining the necessary time required by the cluster for business needs, the monitoring server will not report the alarm information of the client to the alarm platform during this necessary time, which can be effective To avoid generating known but useless alarms.
作为一种可能实现的方法,所述监控服务器根据所述集群自身的健康数据生成所述集群的MDS组件的告警信息;所述确定单元,具体用于确定所述MDS组件的告警信息的告警级别高于所述客户端的告警信息,则将所述MDS组件的告警信息上报至告警平台。As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the determining unit is specifically configured to determine the alarm level of the alarm information of the MDS component If the alarm information is higher than the alarm information of the client, the alarm information of the MDS component is reported to the alarm platform.
基于该方案,当监控服务器同时获取到集群的MDS组件的告警信息和与集群相连的客户端的告警信息时,考虑到可能是集群的MDS组件的异常造成了与集群相连的客户端的异常事件,因此监控服务器确定MDS组件的告警信息的告警级别高于客户端的告警信息,并将MDS组件的告警信息上报至告警平台,自动屏蔽低级别的客户端的告警信息。Based on this solution, when the monitoring server simultaneously obtains the alarm information of the MDS component of the cluster and the alarm information of the client connected to the cluster, it is considered that the abnormal event of the client connected to the cluster may be caused by the abnormality of the MDS component of the cluster. The monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and reports the alarm information of the MDS component to the alarm platform, automatically shielding the alarm information of the low-level client.
作为一种可能实现的方法,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据之后,所述确定单元,还用于设置各监控数据对应的集群标识。As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the determining unit is further configured to set a cluster identification corresponding to each monitoring data.
基于该方案,监控服务器通过为获取到的各监控数据打上与其对应的集群的标识,有助于监控服务器后期对于接收到同一集群的相同监控数据时、迅速地做出相应的告警操作。Based on this solution, the monitoring server marks each acquired monitoring data with its corresponding cluster identification, which helps the monitoring server to quickly make corresponding alarm operations when receiving the same monitoring data of the same cluster in the future.
作为一种可能实现的方法,所述告警规则还包括告警收敛规则;所述确定单元,具体用于确定所述告警信息为所述集群中非首次出现的同一告警信息,则根据所述告警收敛规则中的告警级别与告警时延的对照关系,在设定时延后将所述告警信息上报至所述告警平台;其中,告警级别的级别越低,相应的告警时延的时延越长。As a possible implementation method, the alarm rule further includes an alarm convergence rule; the determining unit is specifically configured to determine that the alarm information is the same alarm information that does not appear for the first time in the cluster, and then converge according to the alarm The control relationship between the alarm level and the alarm delay in the rule, the alarm information is reported to the alarm platform after the delay is set; wherein, the lower the alarm level is, the longer the corresponding alarm delay is .
基于该方案,在监控服务器确定告警信息为某集群非首次出现的相同告警信息后,根据告警收敛规则、在设定时延后将非首次出现的相同告警上报至所述告警平台,可以有效防止该集群持续重复发出相同告警,而造成的资源浪费现象。Based on this solution, after the monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in a certain cluster, it reports the same alarm that does not appear for the first time to the alarm platform according to the alarm convergence rules and after a set time delay, which can effectively prevent The cluster continues to send out the same alarm repeatedly, resulting in a waste of resources.
第三方面,本发明实施例提供了一种计算设备,包括:In the third aspect, an embodiment of the present invention provides a computing device, including:
存储器,用于存储程序指令;Memory, used to store program instructions;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行如第一方面任一所述的方法。The processor is configured to call the program instructions stored in the memory, and execute the method according to any one of the first aspects according to the obtained program.
第四方面,本发明实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如第一方面任一所述的方法。In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute any of the operations described in the first aspect method.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative labor.
图1为现有技术的Prometheus对于CephFS的监控架构图;Figure 1 is a monitoring architecture diagram of CephFS by Prometheus in the prior art;
图2为本发明提供的一种监控分布式存储系统的方法;Figure 2 is a method for monitoring a distributed storage system provided by the present invention;
图3为本发明提供的一种Prometheus对于CephFS的监控架构图;Figure 3 is a diagram of the monitoring architecture of Prometheus for CephFS provided by the present invention;
图4为本发明提供的一种监控分布式存储系统的装置;Figure 4 is a device for monitoring a distributed storage system provided by the present invention;
图5为本发明提供的一种计算设备的示意图。Fig. 5 is a schematic diagram of a computing device provided by the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
如图2所示,为本发明实施例提供的一种监控分布式存储系统的方法,该方法包括:As shown in FIG. 2, it is a method for monitoring a distributed storage system provided by an embodiment of the present invention, and the method includes:
步骤201,监控服务器向所述分布式存储系统中的各集群发送采集指令。Step 201: The monitoring server sends a collection instruction to each cluster in the distributed storage system.
步骤202,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据。Step 202: The monitoring server obtains monitoring data fed back by each cluster based on the collection instruction, the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster.
步骤203,针对至少一个集群,所述监控服务器根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。Step 203: For at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule, and reports the alarm information to an alarm platform.
基于该方案,监控服务器通过将采集指令下发至分布式存储系统中的各集群的方式,使得监控服务器可以同时监控多个集群,从而避免了由于集群与Exporter版本不匹配时、监控服务器无法有效地监控各集群;此外,各集群反馈给监控服务器的监控数据中还包括与集群相连的客户端的状态数据,有利于监控服务器通过对与集群相连的客户端的状态数据的分析来确定告警信息,从而实现了监控服务器对与集群相连的客户端进行监控的目的。Based on this solution, the monitoring server can monitor multiple clusters at the same time by issuing collection instructions to each cluster in the distributed storage system, thus avoiding the ineffectiveness of the monitoring server when the cluster and the Exporter version do not match. Monitor each cluster locally; in addition, the monitoring data that each cluster feeds back to the monitoring server also includes the status data of the client connected to the cluster, which is beneficial for the monitoring server to determine the alarm information by analyzing the status data of the client connected to the cluster. The purpose of the monitoring server to monitor the clients connected to the cluster is realized.
在上述步骤201中,监控服务器向所述分布式存储系统中的各集群发送采集指令。In the above step 201, the monitoring server sends collection instructions to each cluster in the distributed storage system.
设CephFS(Ceph File System,Ceph文件系统)这样一种分布式存储系统中设置有多个集群,如3个,分别令为A编号的Ceph文件系统集群,B编号的Ceph文件系统集群,C编号的Ceph文件系统集群;作为监控CephFS的一种监控服务器Prometheus(普罗米修斯),通过其内部的Prometheus Sever(普罗米修斯服务器)向CephFS下发采集指令,具体表现为Prometheus Sever向A编号的Ceph文件系统集群下发采集指令I,Prometheus  Sever向B编号的Ceph文件系统集群下发采集指令I,Prometheus Sever向C编号的Ceph文件系统集群下发采集指令I。Suppose that a distributed storage system such as CephFS (Ceph File System, Ceph file system) has multiple clusters, such as 3, which are Ceph file system clusters numbered A, Ceph file system clusters numbered B, and number C Ceph file system cluster; as a monitoring server for CephFS, Prometheus (Prometheus), through its internal Prometheus Sever (Prometheus server) to issue collection instructions to CephFS, the specific performance is Prometheus Sever to A number Prometheus Sever issues collection instruction I to the Ceph file system cluster numbered B, Prometheus Sever issues collection instruction I to the Ceph file system cluster number C.
在上述步骤202中,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据。In the above step 202, the monitoring server obtains the monitoring data fed back by the clusters based on the collection instruction, and the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster.
当Prometheus Sever向A编号的Ceph文件系统集群下发采集指令I后,A编号的Ceph文件系统集群会对该采集指令I作出相应的响应,得到关于A编号的Ceph文件系统集群的监控数据,由此Prometheus Sever获取到关于A编号的Ceph文件系统集群的监控数据;同理,Prometheus Sever可以获取到关于B编号的Ceph文件系统集群的监控数据以及获取到关于C编号的Ceph文件系统集群的监控数据。When Prometheus Sever issues the collection command I to the Ceph file system cluster numbered A, the Ceph file system cluster number A will respond to the collection command I and get the monitoring data of the Ceph file system cluster number A. This Prometheus Sever obtains the monitoring data on the Ceph file system cluster with the A number; in the same way, Prometheus Sever can obtain the monitoring data on the Ceph file system cluster with the B number and the monitoring data on the Ceph file system cluster with the C number. .
关于A编号的Ceph文件系统集群的监控数据,具体可以表现为A编号的Ceph文件系统集群自身的健康数据(如OSD组件的运行状态、PG的状态数据),以及与A编号的Ceph文件系统集群相连的A编号的Ceph文件系统的用户空间文件系统客户端的状态数据(如A编号的Ceph文件系统的用户空间文件系统客户端是否接入A编号的Ceph文件系统)。比如与A编号的Ceph文件系统集群相连的A编号的Ceph文件系统的用户空间文件系统客户端有100个,则关于A编号的Ceph文件系统集群的监控数据部件包括A编号的Ceph文件系统集群自身的健康数据,还包括与A编号的Ceph文件系统集群相连的100个A编号的Ceph文件系统的用户空间文件系统客户端的状态数据;关于B编号的Ceph文件系统集群的监控数据、关于C编号的Ceph文件系统集群的监控数据可以参考关于A编号的Ceph文件系统集群的监控数据的情形,在此不赘述。Regarding the monitoring data of the A-numbered Ceph file system cluster, it can be specifically expressed as the health data of the A-numbered Ceph file system cluster itself (such as the operating status of the OSD component, the status data of the PG), and the Ceph file system cluster with the A number The status data of the user space file system client of the connected A-numbered Ceph file system (such as whether the user space file system client of the A-numbered Ceph file system is connected to the A-numbered Ceph file system). For example, there are 100 user space file system clients of the A-numbered Ceph file system connected to the A-numbered Ceph file system cluster, and the monitoring data components of the A-numbered Ceph file system cluster include the A-numbered Ceph file system cluster itself The health data also includes the status data of the user space file system client of the 100 A-numbered Ceph file system connected to the A-numbered Ceph file system cluster; the monitoring data about the B-numbered Ceph file system cluster and the C-number The monitoring data of the Ceph file system cluster can refer to the monitoring data of the Ceph file system cluster with the A number, which will not be repeated here.
在上述步骤203中,针对至少一个集群,所述监控服务器根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。In the above step 203, for at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to a preset alarm rule, and reports the alarm information to the alarm platform.
设对于A编号的Ceph文件系统集群,Prometheus通过对获取的来自于A编号的Ceph文件系统集群的监控数据的分析,分析的依据是预设的告警规则,从而确定出关于A编号的Ceph文件系统集群的告警信息;进一步地,Prometheus将得到的关于A编号的Ceph文件系统集群的告警信息后上报至告警平台,上报的依据仍然是预设的告警规则。其中,告警平台可以为IMS系统(Information Management System,信息管理系统),还可以是其他告警平台,对此,本发明不做限定。同理,Prometheus对于B编号的Ceph文件系统集群、C编号的Ceph文件系统集群的告警过程可以参考A编号的Ceph文件系统集群的告警过程,在此不赘述。Suppose that for the Ceph file system cluster with A number, Prometheus analyzes the monitoring data obtained from the Ceph file system cluster with A number, and the analysis is based on the preset alarm rules, so as to determine the Ceph file system with A number The alarm information of the cluster; further, Prometheus will obtain the alarm information about the Ceph file system cluster with the A number and report it to the alarm platform, and the report is still based on the preset alarm rules. The alarm platform may be an IMS system (Information Management System, information management system), or other alarm platforms, which is not limited in the present invention. In the same way, the alarm process of Prometheus for the Ceph file system cluster of B number and Ceph file system cluster of C number can refer to the alarm process of Ceph file system cluster of A number, which will not be repeated here.
作为一种可能实现的方法,所述监控服务器为多台;任一集群中包括多台节点服务器,且连接有客户端的各节点服务器所连接的客户端均相同;所述监控服务器向所述分布式存储系统中的各集群发送采集指令,包括:针对任一台监控服务器,所述监控服务器向任一集群中的至少两台节点服务器下发采集指令。As a possible implementation method, there are multiple monitoring servers; any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; the monitoring server is distributed to the Each cluster in the storage system sends collection instructions, including: for any monitoring server, the monitoring server issues collection instructions to at least two node servers in any cluster.
如图3所示,为本发明实施例提供的一种Prometheus对于CephFS的监控架构图。参考图3,部署了两台监控服务器,分别令为X编号的普罗米修斯服务器和Y编号的普罗米修斯服务器,X编号的普罗米修斯服务器和Y编号的普罗米修斯服务器均用于监控分布式存储系统,该系统中部署有A编号的Ceph文件系统集群、B编号的Ceph文件系统集群和C编号的Ceph文件系统集群;对于A编号的Ceph文件系统集群,该集群中包括了多台节点服务器,为了叙述的方便,设A编号的Ceph文件系统集群包括了4台节点服务器,分别令为A1编号的节点服务器、A2编号的节点服务器、A3编号的节点服务器和A4编号的 节点服务器;同样的,对于B编号的Ceph文件系统集群,该集群中包括了多台节点服务器,为了叙述的方便,设B编号的Ceph文件系统集群包括了4台节点服务器,分别令为B1编号的节点服务器、B2编号的节点服务器、B3编号的节点服务器和B4编号的节点服务器;同样的,对于C编号的Ceph文件系统集群,该集群中包括了多台节点服务器,为了叙述的方便,设C编号的Ceph文件系统集群包括了4台节点服务器,分别令为C1编号的节点服务器、C2编号的节点服务器、C3编号的节点服务器和C4编号的节点服务器。As shown in FIG. 3, it is a diagram of a Prometheus monitoring architecture for CephFS provided by an embodiment of the present invention. Referring to Figure 3, two monitoring servers are deployed, namely the X-numbered Prometheus server and the Y-numbered Prometheus server, the X-numbered Prometheus server and the Y-numbered Prometheus server are both It is used to monitor distributed storage systems. The system has deployed A-numbered Ceph file system clusters, B-numbered Ceph file system clusters, and C-numbered Ceph file system clusters; for A-numbered Ceph file system clusters, the cluster includes For the convenience of description, the Ceph file system cluster with A number includes 4 node servers, which are the node server numbered A1, the node server numbered A2, the node server numbered A3, and the node server numbered A4. Node server; similarly, for the B-numbered Ceph file system cluster, the cluster includes multiple node servers. For the convenience of description, the B-numbered Ceph file system cluster includes 4 node servers, which are respectively numbered B1 The node server numbered B2, the node server numbered B3, and the node server numbered B4; similarly, for the Ceph file system cluster numbered C, the cluster includes multiple node servers. For the convenience of description, set The Ceph file system cluster with number C includes 4 node servers, which are the node server numbered C1, the node server numbered C2, the node server numbered C3, and the node server numbered C4.
对于A编号的Ceph文件系统集群,设有100台A编号的Ceph文件系统的用户空间文件系统客户端连接于该集群中的被配置有MDS组件的节点服务器,设A编号的Ceph文件系统集群中有3台节点服务器被配置有MDS组件,则这100台A编号的Ceph文件系统的用户空间文件系统客户端均连接于这3台被配置有MDS组件的节点服务器(图中未示出);同理,对于B编号的Ceph文件系统集群,设有200台B编号的Ceph文件系统的用户空间文件系统客户端连接于该集群中的被配置有MDS组件的节点服务器,设CephFS_B集群中有3台节点服务器被配置有MDS组件,则这200台B编号的Ceph文件系统的用户空间文件系统客户端均连接于这3台被配置有MDS组件的节点服务器(图中未示出);同理,对于C编号的Ceph文件系统集群,设有300台C编号的Ceph文件系统的用户空间文件系统客户端连接于该集群中的被配置有MDS组件的节点服务器,设C编号的Ceph文件系统集群中有3台节点服务器被配置有MDS组件,则这300台C编号的Ceph文件系统的用户空间文件系统客户端均连接于这3台被配置有MDS组件的节点服务器(图中未示出)。For the A-numbered Ceph file system cluster, a user space file system client with 100 A-numbered Ceph file system is connected to the node server configured with MDS components in the cluster, and the A-numbered Ceph file system cluster is set If there are 3 node servers configured with MDS components, the user space file system clients of these 100 A-numbered Ceph file systems are all connected to these 3 node servers configured with MDS components (not shown in the figure); Similarly, for a B-numbered Ceph file system cluster, a user space file system client with 200 B-numbered Ceph file systems is connected to a node server configured with MDS components in the cluster. Suppose there are 3 in the CephFS_B cluster. If each node server is configured with MDS components, the user space file system clients of the 200 B-numbered Ceph file systems are all connected to these three node servers (not shown in the figure) configured with MDS components; the same is true For a Ceph file system cluster with a C number, a user space file system client with 300 C-number Ceph file systems is connected to a node server configured with MDS components in the cluster, and a Ceph file system cluster with a C number is set There are 3 node servers configured with MDS components, then the user space file system clients of the 300 C-numbered Ceph file systems are all connected to these 3 node servers configured with MDS components (not shown in the figure) .
设对于X编号的普罗米修斯服务器,该监控服务器向上述A编号的Ceph文件系统集群、B编号的Ceph文件系统集群和C编号的Ceph文件系统集群中的任一集群中的至少两台节点服务器下发采集指令,具体表现为:Suppose that for the X-numbered Prometheus server, the monitoring server sends at least two nodes in any of the above-mentioned A-numbered Ceph file system cluster, B-numbered Ceph file system cluster, and C-numbered Ceph file system cluster. The server issues collection instructions, specifically as follows:
设在上午8:00这一时刻,X编号的普罗米修斯服务器向A编号的Ceph文件系统集群中的A1编号的节点服务器、A2编号的节点服务器和A4编号的节点服务器这3台节点服务器下发采集指令I;同时,X编号的普罗米修斯服务器向B编号的Ceph文件系统集群中的B1编号的节点服务器、B3编号的节点服务器和B4编号的节点服务器这3台节点服务器下发采集指令I;同时,X编号的普罗米修斯服务器向C编号的Ceph文件系统集群中的C1编号的节点服务器、C2编号的节点服务器和C4编号的节点服务器这3台节点服务器下发采集指令I。Set at 8:00 am, the X-numbered Prometheus server sends the A1 numbered node server, A2 numbered node server, and A4 numbered node server in the A-numbered Ceph file system cluster to the three node servers Issue collection instruction I; at the same time, the X-numbered Prometheus server delivers to the three node servers of B1 numbered node server, B3 numbered node server, and B4 numbered node server in the B-numbered Ceph file system cluster Acquisition instruction I; At the same time, the X-numbered Prometheus server issues acquisition instructions to the C-numbered Ceph file system cluster C1 numbered node server, C2 numbered node server, and C4 numbered node server. I.
需要说明的是,X编号的普罗米修斯服务器向A编号的Ceph文件系统集群中的至少两台节点服务器下发采集指令时,是通过随机的方式向A编号的Ceph文件系统集群中的任意至少两台节点服务器下发采集指令。举个例子,上述X编号的普罗米修斯服务器可以是向A编号的Ceph文件系统集群中的A1编号的节点服务器、A2编号的节点服务器和A4编号的节点服务器这3台节点服务器下发采集指令I,也可以是向A编号的Ceph文件系统集群中的A2编号的节点服务器、A3编号的节点服务器和A4编号的节点服务器这3台节点服务器下发采集指令I,也可以是向A编号的Ceph文件系统集群中的A1编号的节点服务器、A2编号的节点服务器和A3编号的节点服务器这3台节点服务器下发采集指令I,对此本发明不做限定。同样的,X编号的普罗米修斯服务器向B编号的Ceph文件系统集群中的至少两台节点服务器下发采集指令时,是通过随机的方式向B编号的Ceph文件系统集群中的任意至少两台节点服务器下发采集指令;同样的,X编号的普罗米修斯服务 器向C编号的Ceph文件系统集群中的至少两台节点服务器下发采集指令时,是通过随机的方式向C编号的Ceph文件系统集群中的任意至少两台节点服务器下发采集指令。It should be noted that when the X-numbered Prometheus server sends collection commands to at least two node servers in the A-numbered Ceph file system cluster, it is randomly sent to any of the A-numbered Ceph file system clusters. At least two node servers issue collection instructions. For example, the X-numbered Prometheus server can be distributed to the three node servers of the A1 numbered node server, the A2 numbered node server, and the A4 numbered node server in the A-numbered Ceph file system cluster. Instruction I can also be issued to the three node servers of A2 numbered node server, A3 numbered node server and A4 numbered node server in the A numbered Ceph file system cluster, or it can be numbered A The three node servers of the A1 numbered node server, the A2 numbered node server, and the A3 numbered node server in the Ceph file system cluster in the Ceph file system cluster issue a collection instruction I, which is not limited by the present invention. Similarly, when the X-numbered Prometheus server sends collection instructions to at least two node servers in the B-numbered Ceph file system cluster, it randomly sends at least two of the B-numbered Ceph file system clusters. Node servers issue collection instructions; similarly, when the Prometheus server with X number sends collection instructions to at least two node servers in the Ceph file system cluster with C number, it sends collection instructions to Ceph with C number in a random manner. Any at least two node servers in the file system cluster issue collection instructions.
作为一种可能实现的方式,所述告警规则包括告警生成规则;所述监控服务器根据预设的告警规则,从所述监控数据中确定告警信息,包括:所述监控服务器从所述监控数据中确定出与所述集群的连接状态发生变化的第一客户端;所述监控服务器根据所述集群的业务变化确定与所述集群的连接状态发生变化的第二客户端;根据包含在所述第一客户端中却不包含在所述第二客户端中的客户端及所述告警生成规则,生成客户端的告警信息。As a possible implementation manner, the alarm rule includes an alarm generation rule; the monitoring server determines the alarm information from the monitoring data according to a preset alarm rule, including: the monitoring server determines the alarm information from the monitoring data Determine the first client whose connection status with the cluster has changed; the monitoring server determines the second client whose connection status with the cluster has changed according to the service change of the cluster; The client in one client but not included in the second client and the alarm generation rule generate the alarm information of the client.
举个例子,对于CephFS_A集群,为了叙述的方便,设有W1、W2、W3、W4、W5、W6、W7、W8、W9和W10这10台Ceph Fuse_A客户端连接于该集群中的被配置有MDS组件的节点服务器;Prometheus Sever_X向CephFS_A集群中的A1、A2和A4这3台节点服务器下发了采集指令I,设Prometheus Sever_X首先获取到A1节点服务器上的监控数据,通过对A1节点服务器上的监控数据的分析,确定出其中的W1、W2、W3、W4、W5、W6、W7、W8、W9和W10这10台Ceph Fuse_A客户端均连接于该CephFS_A集群;随后,Prometheus Sever_X接着获取到A2节点服务器上的监控数据,通过对A2节点服务器上的监控数据的分析,确定出其中仅有W8、W9和W10这3台Ceph Fuse_A客户端仍然连接于CephFS_A集群,而W1、W2、W3、W4、W5、W6和W7这7台Ceph Fuse_A客户端已经从CephFS_A集群上离线。也即,与所述集群的连接状态发生变化的第一客户端分别为W1、W2、W3、W4、W5、W6和W7这7台Ceph Fuse_A客户端。For example, for the CephFS_A cluster, for the convenience of description, there are 10 Ceph Fuse_A clients connected to the cluster: W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10. The node server of the MDS component; Prometheus Sever_X issued the collection command I to the three node servers A1, A2, and A4 in the CephFS_A cluster. Suppose Prometheus Sever_X first obtains the monitoring data on the A1 node server. Analysis of the monitoring data of, it is determined that the 10 Ceph Fuse_A clients, W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10 are all connected to the CephFS_A cluster; then, Prometheus Sever_X then obtains The monitoring data on the A2 node server, through the analysis of the monitoring data on the A2 node server, it is determined that only three Ceph Fuse_A clients, W8, W9, and W10, are still connected to the CephFS_A cluster, while W1, W2, W3, The 7 Ceph Fuse_A clients W4, W5, W6 and W7 have been offline from the CephFS_A cluster. That is, the first clients whose connection status with the cluster changes are W1, W2, W3, W4, W5, W6, and W7, respectively, seven Ceph Fuse_A clients.
对于Ceph Fuse_A客户端出现的这种异常事件,则需要进一步判断W1、W2、W3、W4、W5、W6和W7这7台Ceph Fuse_A客户端从CephFS_A集群上离线的原因,即在于Ceph Fuse_A客户端是正常地从CephFS_A集群中卸载,还是由于CephFS_A集群自身的原因而导致的Ceph Fuse_A客户端被动卸载。For this abnormal event in the Ceph Fuse_A client, it is necessary to further determine the reason why the seven Ceph Fuse_A clients W1, W2, W3, W4, W5, W6 and W7 are offline from the CephFS_A cluster, that is, the Ceph Fuse_A client Is it uninstalled from the CephFS_A cluster normally, or is the Ceph Fuse_A client passively uninstalled due to the CephFS_A cluster itself?
运行于CephFS_A集群上的业务,出于业务需要的目的,会对连接于CephFS_A集群上的部分客户端进行日常的卸载工作。比如,出于业务需要的目的,业务人员会对CephFS_A集群中的W5、W6和W7这3台Ceph Fuse_A客户端进行卸载。也即与所述集群的连接状态发生变化的第二客户端分别为W5、W6和W7这3台Ceph Fuse_A客户端。For the business running on the CephFS_A cluster, for the purpose of business needs, some clients connected to the CephFS_A cluster will be uninstalled daily. For example, for the purpose of business needs, business personnel will uninstall the three Ceph Fuse_A clients, W5, W6, and W7 in the CephFS_A cluster. That is, the second clients whose connection status with the cluster have changed are three Ceph Fuse_A clients, W5, W6, and W7.
通过对第一客户端(分别有W1、W2、W3、W4、W5、W6和W7这7台Ceph Fuse_A客户端)和第二客户端(W5、W6和W7这3台Ceph Fuse_A客户端)的比较,可以发现W5、W6和W7这3台Ceph Fuse_A客户端的卸载是属于Ceph Fuse_A客户端的正常卸载事件,从而对于监控数据中的W5、W6和W7这3台Ceph Fuse_A客户端的离线不需要上报至IMS系统;而对于W1、W2、W3和W4这4台Ceph Fuse_A客户端的卸载属于Ceph Fuse_A客户端的异常卸载事件,则根据告警生成规则,生成客户端的告警信息。Through the first client (there are 7 Ceph Fuse_A clients W1, W2, W3, W4, W5, W6, and W7) and the second client (3 Ceph Fuse_A clients W5, W6, and W7) By comparison, it can be found that the uninstallation of the three Ceph Fuse_A clients, W5, W6, and W7, is a normal uninstall event of the Ceph Fuse_A client. Therefore, the offline of the three Ceph Fuse_A clients of W5, W6 and W7 in the monitoring data does not need to be reported to IMS system; and the uninstallation of the four Ceph Fuse_A clients, W1, W2, W3, and W4, belongs to the abnormal uninstallation event of the Ceph Fuse_A client, and the alarm information of the client is generated according to the alarm generation rules.
作为一种可能实现的方式,所述告警规则还包括告警抑制规则;所述监控服务器确定所述集群的业务变化的变化时长;所述监控服务器设置所述客户端的告警信息的告警抑制规则,所述客户端的告警抑制规则用于将在所述变化时长内产生的所述客户端的告警信息不进行上报。As a possible implementation, the alarm rules also include alarm suppression rules; the monitoring server determines the change duration of the cluster's business changes; the monitoring server sets the alarm suppression rules for the alarm information of the client, so The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
如前述的例子,设出于业务需要的目的,对连接于CephFS_A集群上的W5、W6和W7这3台Ceph Fuse_A客户端进行正常的卸载操作,设卸载W5、W6和W7这3台Ceph Fuse_A客户端所要求的时长为3h,则Prometheus Sever_X在获取到A2节点服务器上的监控数据后的未来3h的整个时间段内,Prometheus Sever_X并不会将连接于CephFS_A集群 上的W5、W6和W7这3台Ceph Fuse_A客户端的离线事件上报至IMS系统。也即,Prometheus Sever_X将W5、W6和W7这3台Ceph Fuse_A客户端从CephFS_A集群上离线的事件写入了告警抑制规则中。As in the previous example, for the purpose of business needs, the three Ceph Fuse_A clients connected to the CephFS_A cluster, W5, W6, and W7, are normally uninstalled. Suppose the three Ceph Fuse_A, W5, W6, and W7, are uninstalled. The time required by the client is 3h, then Prometheus Sever_X will not connect to W5, W6, and W7 on the CephFS_A cluster for the entire time period of 3h in the future after Prometheus Sever_X obtains the monitoring data on the A2 node server. The offline events of the three Ceph Fuse_A clients are reported to the IMS system. That is, Prometheus Sever_X writes the offline events of the three Ceph Fuse_A clients W5, W6 and W7 from the CephFS_A cluster into the alarm suppression rules.
作为一种可能实现的方法,所述监控服务器根据所述集群自身的健康数据生成所述集群的MDS组件的告警信息;所述监控服务器根据预设的告警规则,将所述告警信息上报至告警平台,包括:所述监控服务器确定所述MDS组件的告警信息的告警级别高于所述客户端的告警信息,则将所述MDS组件的告警信息上报至告警平台。As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to the alarm according to a preset alarm rule The platform includes: the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and then reports the alarm information of the MDS component to the alarm platform.
如前述的例子中,Prometheus Sever_X对于CephFS_A集群的监控数据,包括CephFS_A集群自身的健康数据(如OSD组件的运行状态、PG的状态数据),以及与CephFS_A集群相连的Ceph Fuse_A客户端的状态数据(如Ceph Fuse_A客户端是否接入CephFS_A集群)。设在T时刻,Prometheus Sever_X获取到的有关于CephFS_A集群的监控数据,该条监控数据显示CephFS_A集群中的MDS组件在运行时出现异常,同时与CephFS_A集群相连的W1这1台Ceph Fuse_A客户端也出现异常卸载事件,则Prometheus Sever_X将CephFS_A集群中的MDS组件在运行时出现的异常事件的告警级别定义为高级别,将W1这1台Ceph Fuse_A客户端出现的异常卸载事件的告警级别定义为低级别;随后Prometheus Sever_X将高级别的告警事件上报至IMS系统,也即Prometheus Sever_X会将CephFS_A集群中的MDS组件在运行时出现的异常事件上报至IMS系统,而不会将W1这1台Ceph Fuse_A客户端出现的异常卸载事件上报至IMS系统。As in the foregoing example, Prometheus Sever_X's monitoring data for the CephFS_A cluster includes the health data of the CephFS_A cluster itself (such as the operating status of the OSD component, the status data of the PG), and the status data of the Ceph Fuse_A client connected to the CephFS_A cluster (such as Whether the Ceph Fuse_A client is connected to the CephFS_A cluster). Suppose at time T, Prometheus Sever_X obtained monitoring data about the CephFS_A cluster. This monitoring data showed that the MDS component in the CephFS_A cluster was abnormal during operation. At the same time, the CephFuse_A client W1 connected to the CephFS_A cluster was also If an abnormal uninstall event occurs, Prometheus Sever_X defines the alarm level of the abnormal event that occurs during the operation of the MDS component in the CephFS_A cluster as high, and defines the alarm level of the abnormal uninstall event that occurs on the 1 Ceph Fuse_A client W1 as low. Level; then Prometheus Sever_X will report high-level alarm events to the IMS system, that is, Prometheus Sever_X will report the abnormal events that occur during the operation of the MDS component in the CephFS_A cluster to the IMS system, instead of the W1 Ceph Fuse_A The abnormal uninstall event that occurs on the client is reported to the IMS system.
需要说明的是,监控服务器之所以可以将集群中的MDS组件的告警信息的告警级别设置的比所述客户端的告警信息的告警级别高,原因在于由于集群中的MDS组件的异常会造成与集群相连的客户端的异常事件,因而在将集群中的MDS组件的告警信息上报至IMS系统、在运维人员进行运维排查后,不仅可以将MDS组件恢复至正常的运行状态,同时可以让与集群相连的客户端也恢复至正常状态。It should be noted that the reason why the monitoring server can set the alarm level of the alarm information of the MDS component in the cluster to be higher than the alarm level of the alarm information of the client is that the abnormality of the MDS component in the cluster will cause a problem with the cluster. The abnormal event of the connected client, therefore, after reporting the alarm information of the MDS component in the cluster to the IMS system, and after the operation and maintenance personnel conduct the operation and maintenance investigation, not only can the MDS component be restored to the normal operating state, but also can be connected to the cluster. The connected client also returns to its normal state.
作为一种可能实现的方法,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据之后,还包括:所述监控服务器设置各监控数据对应的集群标识。As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further includes: the monitoring server sets a cluster identifier corresponding to each monitoring data.
如前述的例子,参考图3,Prometheus Sever_X向CephFS_A集群中的A1、A2和A4这三台节点服务器发送采集指令I,同时向CephFS_B集群中的B1、B3和B4这三台节点服务器发送采集指令I,以及同时向CephFS_C集群中的C1、C2和C4这三台节点服务器发送采集指令I;当采集指令I在上述CephFS_A集群、CephFS_B集群和CephFS_C集群这三个集群中被响应后,Prometheus Sever_X会将获取上述各个集群的监控数据。其中,监控数据可以表现为集群的标识,比如Prometheus Sever_X获取到的第一条是CephFS_A集群的A1节点服务器上的监控数据,第二条是CephFS_B集群的B3节点服务器上的监控数据,第三条是CephFS_C集群的C4节点服务器上的监控数据,等等。As in the foregoing example, referring to Figure 3, Prometheus Sever_X sends collection instructions I to the three node servers A1, A2, and A4 in the CephFS_A cluster, and sends collection instructions to the three node servers B1, B3, and B4 in the CephFS_B cluster. I, and send collection command I to the three node servers C1, C2, and C4 in the CephFS_C cluster at the same time; when the collection command I is responded to in the above three clusters of CephFS_A cluster, CephFS_B cluster and CephFS_C cluster, Prometheus Sever_X will The monitoring data of each of the above-mentioned clusters will be obtained. Among them, the monitoring data can be expressed as the cluster identifier. For example, the first one obtained by Prometheus Sever_X is the monitoring data on the A1 node server of the CephFS_A cluster, the second is the monitoring data on the B3 node server of the CephFS_B cluster, and the third is It is the monitoring data on the C4 node server of the CephFS_C cluster, and so on.
作为一种可能实现的方法,所述告警规则还包括告警收敛规则;所述监控服务器根据预设的告警规则,将所述告警信息上报至告警平台,包括:所述监控服务器确定所述告警信息为所述集群中非首次出现的同一告警信息,则根据所述告警收敛规则中的告警级别与告警时延的对照关系,在设定时延后将所述告警信息上报至所述告警平台;其中,告警级别的级别越低,相应的告警时延的时延越长。As a possible implementation method, the alarm rule further includes an alarm convergence rule; the monitoring server reports the alarm information to the alarm platform according to a preset alarm rule, including: the monitoring server determines the alarm information Is the same alarm information that does not appear for the first time in the cluster, then according to the comparison relationship between the alarm level in the alarm convergence rule and the alarm delay, the alarm information is reported to the alarm platform after the delay is set; Among them, the lower the alarm level, the longer the corresponding alarm delay.
如前述的例子,设Prometheus Sever_X获取到的第一条监控数据是来自于CephFS_A集群,根据预设的告警规则、对第一条监控数据进行分析后,确定第一条监控数据可以作 为告警信息上报至IMS系统,将根据第一条监控数据所生成的告警信息令为Info_1,且将Infro_1的告警级别为令为级别1;设Prometheus Sever_X获取到的第六条监控数据仍然是有关于CephFS_A集群的,根据预设的告警规则、对第六条监控数据进行分析后,发现根据这第六条监控数据所生成的告警信息符合Info_1,则Prometheus Sever_X需要进一步的根据Infro_1的告警级别来确定何时将这第六条监控数据上报至IMS系统;设告警级别为级别1的告警信息所对应的告警时延为1h,则Prometheus Sever_X在接下来的1h内并不会将第六条监控数据对应的Infro_1上报至IMS系统。As in the foregoing example, suppose that the first piece of monitoring data obtained by Prometheus Sever_X comes from the CephFS_A cluster. After analyzing the first piece of monitoring data according to the preset alarm rules, it is determined that the first piece of monitoring data can be reported as alarm information. To the IMS system, the alarm information generated according to the first piece of monitoring data is set to Info_1, and the alarm level of Infro_1 is set to level 1. Assume that the sixth piece of monitoring data obtained by Prometheus Sever_X is still related to the CephFS_A cluster After analyzing the sixth monitoring data according to the preset alarm rules, it is found that the alarm information generated according to the sixth monitoring data conforms to Info_1, then Prometheus Sever_X needs to further determine when to use the alarm level of Infro_1. The sixth monitoring data is reported to the IMS system; if the alarm delay corresponding to the alarm information with the alarm level of level 1 is 1h, then Prometheus Sever_X will not report the Infro_1 corresponding to the sixth monitoring data within the next 1h. Report to the IMS system.
设Prometheus Sever_X获取到的第二条监控数据是来自于CephFS_B集群,根据预设的告警规则、对第二条监控数据进行分析后,确定第二条监控数据可以作为告警信息上报至IMS系统,将根据第二条监控数据所生成的告警信息令为Info_2,且将Infro_2的告警级别为令为级别2;设Prometheus Sever_X获取到的第九条监控数据仍然是有关于CephFS_B集群的,根据预设的告警规则、对第九条监控数据进行分析后,发现根据这第九条监控数据所生成的告警信息符合Info_2,则Prometheus Sever_X需要进一步的根据Infro_2的告警级别来确定何时将这第九条监控数据上报至IMS系统;设告警级别为级别2的告警信息所对应的告警时延为2h,则Prometheus Sever_X在接下来的2h内并不会将第九条监控数据对应的Infro_2上报至IMS系统。Suppose that the second piece of monitoring data obtained by Prometheus Sever_X comes from the CephFS_B cluster. After analyzing the second piece of monitoring data according to the preset alarm rules, it is determined that the second piece of monitoring data can be reported to the IMS system as alarm information. The alarm information order generated according to the second monitoring data is Info_2, and the alarm level of Infro_2 is set to level 2. Assume that the ninth monitoring data obtained by Prometheus Sever_X is still related to the CephFS_B cluster, according to the preset After analyzing the alarm rules and the monitoring data of Article 9, it is found that the alarm information generated according to the monitoring data of Article 9 conforms to Info_2, then Prometheus Sever_X needs to further determine when to monitor the ninth item according to the alarm level of Infro_2. The data is reported to the IMS system; if the alarm delay corresponding to the alarm information with the alarm level of level 2 is set to 2h, then Prometheus Sever_X will not report the Infro_2 corresponding to the ninth monitoring data to the IMS system within the next 2h.
设Prometheus Sever_X获取到的第三条监控数据是来自于CephFS_C集群,根据预设的告警规则、对第三条监控数据进行分析后,确定第三条监控数据可以作为告警信息上报至IMS系统,将根据第三条监控数据所生成的告警信息令为Info_3,且将Infro_3的告警级别为令为级别3;设Prometheus Sever_X获取到的第十条监控数据仍然是有关于CephFS_C集群的,根据预设的告警规则、对第十条监控数据进行分析后,发现根据这第十条监控数据所生成的告警信息符合Info_3,则Prometheus Sever_X需要进一步的根据Infro_3的告警级别来确定何时将这第十条监控数据上报至IMS系统;设告警级别为级别3的告警信息所对应的告警时延为3h,则Prometheus Sever_X在接下来的3h内并不会将第十条监控数据对应的Infro_3上报至IMS系统。Suppose that the third piece of monitoring data obtained by Prometheus Sever_X comes from the CephFS_C cluster. After analyzing the third piece of monitoring data according to the preset alarm rules, it is determined that the third piece of monitoring data can be reported to the IMS system as alarm information. The alarm information order generated according to the third monitoring data is Info_3, and the alarm level of Infro_3 is set to level 3; suppose that the tenth monitoring data obtained by Prometheus Sever_X is still related to the CephFS_C cluster, according to the preset After analyzing the alarm rules and the monitoring data of Article 10, it is found that the alarm information generated according to the monitoring data of Article 10 conforms to Info_3. Then Prometheus Sever_X needs to further determine when to monitor the monitoring data of Article 10 according to the alarm level of Infro_3. The data is reported to the IMS system; if the alarm delay corresponding to the alarm information with the alarm level of level 3 is set to 3h, then Prometheus Sever_X will not report the Infro_3 corresponding to the tenth monitoring data to the IMS system in the next 3h.
需要说明的是,上述例子中,随着级别1、级别2、级别3的告警级别的降低,相应的告警时延的时延越长,分别对应1h、2h、3h。It should be noted that, in the above example, as the alarm levels of level 1, level 2, and level 3 decrease, the delay of the corresponding alarm delay is longer, corresponding to 1h, 2h, and 3h, respectively.
基于该方案,在监控服务器确定告警信息为某集群非首次出现的相同告警信息后,根据告警收敛规则、在设定时延后将非首次出现的相同告警上报至所述告警平台,可以有效防止该集群持续重复发出相同告警,而造成的资源浪费现象。Based on this solution, after the monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in a certain cluster, it reports the same alarm that does not appear for the first time to the alarm platform according to the alarm convergence rules and after a set delay. The cluster continues to send out the same alarm repeatedly, resulting in a waste of resources.
基于同样的构思,本发明实施例还提供一种监控分布式存储系统的装置,如图4所示,该装置包括:Based on the same concept, an embodiment of the present invention also provides a device for monitoring a distributed storage system. As shown in FIG. 4, the device includes:
发送单元401,用于向所述分布式存储系统中的各集群发送采集指令;The sending unit 401 is configured to send collection instructions to each cluster in the distributed storage system;
获取单元402,用于获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据;The obtaining unit 402 is configured to obtain monitoring data fed back by each cluster based on the collection instruction, the monitoring data including the health data of the cluster itself and the status data of the client connected to the cluster;
确定单元403,针对至少一个集群,用于根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。The determining unit 403, for at least one cluster, is configured to determine alarm information from the monitoring data of the cluster according to preset alarm rules and report the alarm information to the alarm platform.
进一步地,对于所述装置,所述监控服务器为多台;任一集群中包括多台节点服务器,且连接有客户端的各节点服务器所连接的客户端均相同;针对任一台监控服务器,所述发 送单元401,具体用于向任一集群中的至少两台节点服务器下发采集指令。Further, for the device, there are multiple monitoring servers; any cluster includes multiple node servers, and each node server connected to the client is connected to the same client; for any monitoring server, The sending unit 401 is specifically configured to issue collection instructions to at least two node servers in any cluster.
进一步地,对于所述装置,所述告警规则包括告警生成规则;所述确定单元403,具体用于从所述监控数据中确定出与所述集群的连接状态发生变化的第一客户端;根据所述集群的业务变化确定与所述集群的连接状态发生变化的第二客户端;根据包含在所述第一客户端中却不包含在所述第二客户端中的客户端及所述告警生成规则,生成客户端的告警信息。Further, for the device, the alarm rule includes an alarm generation rule; the determining unit 403 is specifically configured to determine from the monitoring data the first client whose connection status with the cluster has changed; according to The service change of the cluster determines the second client whose connection state with the cluster has changed; according to the client included in the first client but not included in the second client and the alarm Generate rules to generate alarm information for the client.
进一步地,对于所述装置,所述告警规则还包括告警抑制规则;所述确定单元403,具体用于确定所述集群的业务变化的变化时长;设置所述客户端的告警信息的告警抑制规则,所述客户端的告警抑制规则用于将在所述变化时长内产生的所述客户端的告警信息不进行上报。Further, for the device, the alarm rule also includes an alarm suppression rule; the determining unit 403 is specifically configured to determine the change duration of the service change of the cluster; set the alarm suppression rule for the alarm information of the client, The alarm suppression rule of the client is used to not report the alarm information of the client generated within the change duration.
进一步地,对于所述装置,所述监控服务器根据所述集群自身的健康数据生成所述集群的MDS组件的告警信息;所述确定单元403,具体用于确定所述MDS组件的告警信息的告警级别高于所述客户端的告警信息,则将所述MDS组件的告警信息上报至告警平台。Further, for the device, the monitoring server generates the alarm information of the MDS component of the cluster according to the health data of the cluster itself; the determining unit 403 is specifically configured to determine the alarm of the alarm information of the MDS component If the level is higher than the alarm information of the client, the alarm information of the MDS component is reported to the alarm platform.
进一步地,对于所述装置,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据之后,所述确定单元403,还用于设置各监控数据对应的集群标识。Further, for the device, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the determining unit 403 is further configured to set a cluster identifier corresponding to each monitoring data.
进一步地,对于所述装置,所述告警规则还包括告警收敛规则;所述确定单元403,具体用于确定所述告警信息为所述集群中非首次出现的同一告警信息,则根据所述告警收敛规则中的告警级别与告警时延的对照关系,在设定时延后将所述告警信息上报至所述告警平台;其中,告警级别的级别越低,相应的告警时延的时延越长。Further, for the device, the alarm rule also includes an alarm convergence rule; the determining unit 403 is specifically configured to determine that the alarm information is the same alarm information that does not appear for the first time in the cluster, and then according to the alarm The contrast relationship between the alarm level and the alarm delay in the convergence rule, the alarm information is reported to the alarm platform after the delay is set; wherein, the lower the alarm level is, the longer the corresponding alarm delay is long.
本发明实施例提供了一种计算设备,该计算设备具体可以为桌面计算机、便携式计算机、智能手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)等。该计算设备可以包括中央处理器(Center Processing Unit,CPU)、存储器、输入/输出设备等,输入设备可以包括键盘、鼠标、触摸屏等,输出设备可以包括显示设备,如液晶显示器(Liquid Crystal Display,LCD)、阴极射线管(Cathode Ray Tube,CRT)等。The embodiment of the present invention provides a computing device, and the computing device may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), etc. The computing device may include a central processing unit (CPU), a memory, an input/output device, etc. The input device may include a keyboard, a mouse, a touch screen, etc., and an output device may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), Cathode Ray Tube (CRT), etc.
存储器,可以包括只读存储器(ROM)和随机存取存储器(RAM),并向处理器提供存储器中存储的程序指令和数据。在本发明实施例中,存储器可以用于存储监控分布式存储系统的方法的程序指令;The memory may include read-only memory (ROM) and random access memory (RAM), and provides the processor with program instructions and data stored in the memory. In the embodiment of the present invention, the memory may be used to store the program instructions of the method for monitoring the distributed storage system;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行监控分布式存储系统的方法。The processor is configured to call the program instructions stored in the memory, and execute the method of monitoring the distributed storage system according to the obtained program.
如图5所示,为本申请实施例提供的一种计算设备的示意图,该计算设备包括:As shown in FIG. 5, it is a schematic diagram of a computing device provided by an embodiment of this application, and the computing device includes:
处理器501、存储器502、收发器503、总线接口504;其中,处理器501、存储器502与收发器503之间通过总线505连接;A processor 501, a memory 502, a transceiver 503, and a bus interface 504; among them, the processor 501, the memory 502, and the transceiver 503 are connected by a bus 505;
所述处理器501,用于读取所述存储器502中的程序,执行上述监控分布式存储系统的方法;The processor 501 is configured to read a program in the memory 502, and execute the foregoing method for monitoring a distributed storage system;
处理器501可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。还可以是硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL) 或其任意组合。The processor 501 may be a central processing unit (central processing unit, CPU for short), a network processor (NP for short), or a combination of CPU and NP. It can also be a hardware chip. The aforementioned hardware chip may be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short), or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL), or any of them combination.
所述存储器502,用于存储一个或多个可执行程序,可以存储所述处理器501在执行操作时所使用的数据。The memory 502 is configured to store one or more executable programs, and can store data used by the processor 501 when performing operations.
具体地,程序可以包括程序代码,程序代码包括计算机操作指令。存储器502可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器502也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器502还可以包括上述种类的存储器的组合。Specifically, the program may include program code, and the program code includes computer operation instructions. The memory 502 may include a volatile memory (volatile memory), such as random-access memory (RAM for short); the memory 502 may also include a non-volatile memory (non-volatile memory), such as flash memory ( flash memory), hard disk drive (HDD for short) or solid-state drive (SSD for short); the memory 502 may also include a combination of the foregoing types of memories.
存储器502存储了如下的元素,可执行模块或者数据结构,或者它们的子集,或者它们的扩展集:The memory 502 stores the following elements, executable modules or data structures, or their subsets, or their extended sets:
操作指令:包括各种操作指令,用于实现各种操作。Operating instructions: including various operating instructions, used to implement various operations.
操作系统:包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。Operating system: Including various system programs, used to implement various basic services and process hardware-based tasks.
总线505可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 505 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.
总线接口504可以为有线通信接入口,无线总线接口或其组合,其中,有线总线接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线总线接口可以为WLAN接口。The bus interface 504 may be a wired communication access port, a wireless bus interface or a combination thereof, where the wired bus interface may be, for example, an Ethernet interface. The Ethernet interface can be an optical interface, an electrical interface, or a combination thereof. The wireless bus interface may be a WLAN interface.
本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行监控分布式存储系统的方法。The embodiment of the present invention provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute a method for monitoring a distributed storage system.
本领域内的技术人员应明白,本发明的实施例可提供为方法、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention can be provided as a method or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个 方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present invention.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention is also intended to include these modifications and variations.

Claims (10)

  1. 一种监控分布式存储系统的方法,其特征在于,包括:A method for monitoring a distributed storage system is characterized in that it includes:
    监控服务器向所述分布式存储系统中的各集群发送采集指令;The monitoring server sends collection instructions to each cluster in the distributed storage system;
    所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据;The monitoring server obtains monitoring data fed back by each cluster based on the collection instruction, the monitoring data includes the health data of the cluster itself and the status data of the client connected to the cluster;
    针对至少一个集群,所述监控服务器根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。For at least one cluster, the monitoring server determines alarm information from the monitoring data of the cluster according to preset alarm rules and reports the alarm information to the alarm platform.
  2. 如权利要求1所述的方法,其特征在于,所述监控服务器为多台;任一集群中包括多台节点服务器,且连接有客户端的各节点服务器所连接的客户端均相同;The method according to claim 1, wherein there are multiple monitoring servers; any cluster includes multiple node servers, and each node server connected to the client is connected to the same client;
    所述监控服务器向所述分布式存储系统中的各集群发送采集指令,包括:The monitoring server sending collection instructions to each cluster in the distributed storage system includes:
    针对任一台监控服务器,所述监控服务器向任一集群中的至少两台节点服务器下发采集指令。For any monitoring server, the monitoring server issues collection instructions to at least two node servers in any cluster.
  3. 如权利要求1所述的方法,其特征在于,所述告警规则包括告警生成规则;The method according to claim 1, wherein the alarm rule comprises an alarm generation rule;
    所述监控服务器根据预设的告警规则,从所述监控数据中确定告警信息,包括:The monitoring server determines alarm information from the monitoring data according to preset alarm rules, including:
    所述监控服务器从所述监控数据中确定出与所述集群的连接状态发生变化的第一客户端;The monitoring server determines from the monitoring data the first client whose connection status with the cluster has changed;
    所述监控服务器根据所述集群的业务变化确定与所述集群的连接状态发生变化的第二客户端;Determining, by the monitoring server, a second client whose connection status with the cluster has changed according to the business change of the cluster;
    根据包含在所述第一客户端中却不包含在所述第二客户端中的客户端及所述告警生成规则,生成客户端的告警信息。The alarm information of the client is generated according to the client included in the first client but not included in the second client and the alarm generation rule.
  4. 如权利要求3所述的方法,其特征在于,所述告警规则还包括告警抑制规则;The method of claim 3, wherein the alarm rule further comprises an alarm suppression rule;
    所述监控服务器确定所述集群的业务变化的变化时长;The monitoring server determines the change duration of the service change of the cluster;
    所述监控服务器设置所述客户端的告警信息的告警抑制规则,所述客户端的告警抑制规则用于将在所述变化时长内产生的所述客户端的告警信息不进行上报。The monitoring server sets an alarm suppression rule for the alarm information of the client, and the alarm suppression rule of the client is used to not report the alarm information of the client generated within the change period.
  5. 如权利要求3所述的方法,其特征在于,所述监控服务器根据所述集群自身的健康数据生成所述集群的MDS组件的告警信息;The method according to claim 3, wherein the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself;
    所述监控服务器根据预设的告警规则,将所述告警信息上报至告警平台,包括:The monitoring server reports the alarm information to the alarm platform according to preset alarm rules, including:
    所述监控服务器确定所述MDS组件的告警信息的告警级别高于所述客户端的告警信息,则将所述MDS组件的告警信息上报至告警平台。The monitoring server determines that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and then reports the alarm information of the MDS component to the alarm platform.
  6. 如权利要求1所述的方法,其特征在于,所述监控服务器获取所述各集群基于所述采集指令反馈的监控数据之后,还包括:The method according to claim 1, wherein after the monitoring server obtains the monitoring data fed back by the clusters based on the collection instruction, the method further comprises:
    所述监控服务器设置各监控数据对应的集群标识。The monitoring server sets a cluster identifier corresponding to each monitoring data.
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述告警规则还包括告警收敛规则;The method according to any one of claims 1-6, wherein the alarm rule further comprises an alarm convergence rule;
    所述监控服务器根据预设的告警规则,将所述告警信息上报至告警平台,包括:The monitoring server reports the alarm information to the alarm platform according to preset alarm rules, including:
    所述监控服务器确定所述告警信息为所述集群中非首次出现的同一告警信息,则根据所述告警收敛规则中的告警级别与告警时延的对照关系,在设定时延后将所述告警信息上报至所述告警平台;其中,告警级别的级别越低,相应的告警时延的时延越长。The monitoring server determines that the alarm information is the same alarm information that does not appear for the first time in the cluster, and then, according to the comparison relationship between the alarm level and the alarm delay in the alarm convergence rule, the The alarm information is reported to the alarm platform; wherein, the lower the alarm level, the longer the corresponding alarm delay.
  8. 一种监控分布式存储系统的装置,其特征在于,包括:A device for monitoring a distributed storage system is characterized in that it comprises:
    发送单元,用于向所述分布式存储系统中的各集群发送采集指令;A sending unit, configured to send collection instructions to each cluster in the distributed storage system;
    获取单元,用于获取所述各集群基于所述采集指令反馈的监控数据,所述监控数据包括集群自身的健康数据以及与集群相连的客户端的状态数据;An obtaining unit, configured to obtain monitoring data fed back by each cluster based on the collection instruction, the monitoring data including the health data of the cluster itself and the status data of the client connected to the cluster;
    确定单元,针对至少一个集群,用于根据预设的告警规则,从所述集群的监控数据中确定告警信息并将所述告警信息上报至告警平台。The determining unit, for at least one cluster, is configured to determine alarm information from the monitoring data of the cluster according to preset alarm rules and report the alarm information to the alarm platform.
  9. 一种计算设备,其特征在于,包括:A computing device, characterized in that it comprises:
    存储器,用于存储程序指令;Memory, used to store program instructions;
    处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行如权利要求1-7任一项所述的方法。The processor is configured to call the program instructions stored in the memory, and execute the method according to any one of claims 1-7 according to the obtained program.
  10. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如权利要求1-7任一项所述的方法。A computer-readable storage medium, wherein the storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute the method according to any one of claims 1-7.
PCT/CN2020/134339 2019-12-23 2020-12-07 Method and apparatus for monitoring distributed storage system WO2021129367A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911336662.5A CN111049705B (en) 2019-12-23 2019-12-23 Method and device for monitoring distributed storage system
CN201911336662.5 2019-12-23

Publications (1)

Publication Number Publication Date
WO2021129367A1 true WO2021129367A1 (en) 2021-07-01

Family

ID=70238567

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134339 WO2021129367A1 (en) 2019-12-23 2020-12-07 Method and apparatus for monitoring distributed storage system

Country Status (2)

Country Link
CN (1) CN111049705B (en)
WO (1) WO2021129367A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114115718A (en) * 2021-08-31 2022-03-01 济南浪潮数据技术有限公司 Distributed block storage system service quality control method, device, equipment and medium
CN114760221A (en) * 2022-03-31 2022-07-15 深信服科技股份有限公司 Service monitoring method, system and storage medium
CN115567526A (en) * 2022-09-21 2023-01-03 中国平安人寿保险股份有限公司 Data monitoring method, device, equipment and medium
US20230108213A1 (en) * 2021-10-05 2023-04-06 Softiron Limited Ceph Failure and Verification

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111049705B (en) * 2019-12-23 2023-09-12 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system
CN111597091A (en) * 2020-05-20 2020-08-28 北京金山云网络技术有限公司 Data monitoring method and system, electronic equipment and computer storage medium
CN111625421B (en) * 2020-05-26 2021-07-16 云和恩墨(北京)信息技术有限公司 Method and device for monitoring distributed storage system, storage medium and processor
CN111988165B (en) * 2020-07-09 2023-01-24 云知声智能科技股份有限公司 Method and system for monitoring use condition of distributed storage system
CN112084098A (en) * 2020-10-21 2020-12-15 中国银行股份有限公司 Resource monitoring system and working method
CN112650642A (en) * 2020-12-07 2021-04-13 深圳前海微众银行股份有限公司 Alarm processing method and device, equipment and storage medium
CN112751726B (en) * 2020-12-17 2022-09-09 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium
CN112783745A (en) * 2021-02-02 2021-05-11 无锡车联天下信息技术有限公司 Cluster data monitoring method, device, system and storage medium
CN113688149A (en) * 2021-07-20 2021-11-23 青岛海尔科技有限公司 Monitoring method and device
CN113641558A (en) * 2021-08-31 2021-11-12 合众人寿保险股份有限公司 Health examination method and device and electronic equipment
CN114090644B (en) * 2022-01-20 2022-04-26 飞狐信息技术(天津)有限公司 Data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291594A (en) * 2017-06-30 2017-10-24 上海白虹软件科技股份有限公司 The device and method that openstack platforms are monitored and managed to ceph
US20180341682A1 (en) * 2017-05-26 2018-11-29 Nutanix, Inc. System and method for generating rules from search queries
CN109298945A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The monitoring of Ceph distributed storage and tuning management method towards big data platform
CN109522287A (en) * 2018-09-18 2019-03-26 平安科技(深圳)有限公司 Monitoring method, system, equipment and the medium of distributed document storage cluster
CN111049705A (en) * 2019-12-23 2020-04-21 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104202212A (en) * 2014-08-28 2014-12-10 浪潮(北京)电子信息产业有限公司 System and method for obtaining distributed cluster system alarm
CN107864063B (en) * 2017-12-12 2021-09-17 北京奇艺世纪科技有限公司 Abnormity monitoring method and device and electronic equipment
US11102174B2 (en) * 2017-12-26 2021-08-24 Palo Alto Networks, Inc. Autonomous alerting based on defined categorizations for network space and network boundary changes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341682A1 (en) * 2017-05-26 2018-11-29 Nutanix, Inc. System and method for generating rules from search queries
CN107291594A (en) * 2017-06-30 2017-10-24 上海白虹软件科技股份有限公司 The device and method that openstack platforms are monitored and managed to ceph
CN109522287A (en) * 2018-09-18 2019-03-26 平安科技(深圳)有限公司 Monitoring method, system, equipment and the medium of distributed document storage cluster
CN109298945A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The monitoring of Ceph distributed storage and tuning management method towards big data platform
CN111049705A (en) * 2019-12-23 2020-04-21 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114115718A (en) * 2021-08-31 2022-03-01 济南浪潮数据技术有限公司 Distributed block storage system service quality control method, device, equipment and medium
CN114115718B (en) * 2021-08-31 2024-03-29 济南浪潮数据技术有限公司 Distributed block storage system service quality control method, device, equipment and medium
US20230108213A1 (en) * 2021-10-05 2023-04-06 Softiron Limited Ceph Failure and Verification
CN114760221A (en) * 2022-03-31 2022-07-15 深信服科技股份有限公司 Service monitoring method, system and storage medium
CN114760221B (en) * 2022-03-31 2024-02-23 深信服科技股份有限公司 Service monitoring method, system and storage medium
CN115567526A (en) * 2022-09-21 2023-01-03 中国平安人寿保险股份有限公司 Data monitoring method, device, equipment and medium
CN115567526B (en) * 2022-09-21 2024-05-14 中国平安人寿保险股份有限公司 Data monitoring method, device, equipment and medium

Also Published As

Publication number Publication date
CN111049705A (en) 2020-04-21
CN111049705B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
WO2021129367A1 (en) Method and apparatus for monitoring distributed storage system
US10365915B2 (en) Systems and methods of monitoring a network topology
CN107241211B (en) Method and system for improving relevance between data center overlay network and underlying network
US8375200B2 (en) Embedded device and file change notification method of the embedded device
US11329869B2 (en) Self-monitoring
WO2020093637A1 (en) Device state prediction method and system, computer apparatus and storage medium
CN110532322B (en) Operation and maintenance interaction method, system, computer readable storage medium and equipment
WO2017080161A1 (en) Alarm information processing method and device in cloud computing
CN111078695B (en) Method and device for calculating association relation of metadata in enterprise
WO2019169765A1 (en) Electronic device, method for acquiring state information in cluster environment, system, and storage medium
CN111339466A (en) Interface management method and device, electronic equipment and readable storage medium
US11556120B2 (en) Systems and methods for monitoring performance of a building management system via log streams
US8380729B2 (en) Systems and methods for first data capture through generic message monitoring
CN111274032A (en) Task processing system and method, and storage medium
CN111917812B (en) Data transmission control method, device, equipment and storage medium
CN109766238B (en) Session number-based operation and maintenance platform performance monitoring method and device and related equipment
CN113626869A (en) Data processing method, system, electronic device and storage medium
US10949232B2 (en) Managing virtualized computing resources in a cloud computing environment
CN112131077A (en) Fault node positioning method and device and database cluster system
CN109388546B (en) Method, device and system for processing faults of application program
CN113590424B (en) Fault monitoring method, device, equipment and storage medium
US8661296B2 (en) Dynamic data store for failed jobs in a systems complex
CN110768855A (en) Method and device for testing linkmzation performance
WO2024045621A1 (en) Data processing method, apparatus and system
CN110519393B (en) Self-service equipment supervision method, device, equipment, server and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02-11-22)

122 Ep: pct application non-entry in european phase

Ref document number: 20905266

Country of ref document: EP

Kind code of ref document: A1