CN111049705B - Method and device for monitoring distributed storage system - Google Patents

Method and device for monitoring distributed storage system Download PDF

Info

Publication number
CN111049705B
CN111049705B CN201911336662.5A CN201911336662A CN111049705B CN 111049705 B CN111049705 B CN 111049705B CN 201911336662 A CN201911336662 A CN 201911336662A CN 111049705 B CN111049705 B CN 111049705B
Authority
CN
China
Prior art keywords
cluster
alarm
monitoring
client
monitoring server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911336662.5A
Other languages
Chinese (zh)
Other versions
CN111049705A (en
Inventor
龚治文
饶俊明
卢道和
郑晓腾
龚洵峰
刘生庆
吴立
吴传民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201911336662.5A priority Critical patent/CN111049705B/en
Publication of CN111049705A publication Critical patent/CN111049705A/en
Priority to PCT/CN2020/134339 priority patent/WO2021129367A1/en
Application granted granted Critical
Publication of CN111049705B publication Critical patent/CN111049705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • H04L41/0609Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on severity or priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Abstract

The invention provides a method and a device for monitoring a distributed storage system, wherein a monitoring server sends acquisition instructions to each cluster in the distributed storage system; the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters; aiming at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform. According to the scheme, the monitoring server issues the acquisition instruction to each cluster in the distributed storage system, so that the monitoring server can monitor a plurality of clusters simultaneously; in addition, the monitoring data fed back by each cluster comprises the state data of the client connected with the cluster, so that the monitoring server can determine the alarm information through analyzing the state data of the client connected with the cluster, and the purpose of monitoring the client connected with the cluster by the monitoring server is realized.

Description

Method and device for monitoring distributed storage system
Technical Field
The present invention relates to the field of financial technology (Fintech), and in particular, to a method and apparatus for monitoring a distributed storage system.
Background
With the development of computer technology, more and more technologies (such as blockchain, cloud computing or big data) are applied in the financial field, and the traditional financial industry is gradually changing to financial technology, so that big data technology is no exception. But because of the safety and real-time requirements of the finance and payment industries, higher requirements are also put on big data technology.
In consideration of the expandability and high availability of the massive data, the banking industry generally selects a distributed storage System such as a CephFS (CephFile System) as a shared storage technical scheme, wherein a CephFuse client is connected under the CephFS; meanwhile, those skilled in the art typically use a monitoring system such as an open source promethaus to monitor CephFS. Wherein Prometaus mainly comprises parts of Exporters, prometaus Sever and the like; cephFS mainly comprises various components such as a Monitor (MON), a target storage device (Object Storage Device) and a MetaData server (MetaData Server) (MDS), and a group of settings (PG) is distributed on CephFS OSD components.
Aiming at the technical scheme of monitoring CephFS by Prometheus in the prior art, the following two problems exist:
first, prometaus 'monitoring of CephFS is mainly represented by Prometaus' data collection of CephFS OSD component status and CephFS PG status, but Prometaus does not realize monitoring of Ceph Fuse client.
Second, prometaus is very bulky to the CephFS monitoring architecture, in that a set of Prometaus needs to be deployed for each CephFS; furthermore, due to the different versions of CephFS, different exporters need to be deployed for the different versions of CephFS. As shown in fig. 1, a prior art monitoring architecture of promethaus for CephFS is shown. Referring to fig. 1, an exporter_m collects monitoring data of a cephfs_m, if the collected monitoring data meets a rule of generating alarm information, the generated alarm information is reported to a promethaus server_m, and similarly, an exporter_n collects monitoring data of a cephfs_n, if the collected monitoring data meets the rule of generating alarm information, the generated alarm information is reported to the promethaus server_n; however, because the exporter_m is not matched with the CephFS_N version, the exporter_m cannot be used for collecting the monitoring data of the CephFS_N so as to report the alarm information of the CephFS_N. That is, high availability is not realized among the Prometheus server, the Exporter and the CephFS, so that monitoring information cannot be timely reported under abnormal conditions.
To sum up, the prior art has the problem that Prometheus cannot monitor Ceph Fuse client and the monitoring efficiency of Prometheus on Ceph FS is low.
Disclosure of Invention
The invention provides a method and a device for monitoring a distributed storage system, which are used for solving the problems that Prometheus cannot monitor a Ceph Fuse client and the monitoring efficiency of Prometheus on Ceph FS is low.
In a first aspect, an embodiment of the present invention provides a method for monitoring a distributed storage system, the method including: the monitoring server sends acquisition instructions to each cluster in the distributed storage system; the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters; for at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform.
Based on the scheme, the monitoring server can monitor a plurality of clusters simultaneously by sending the acquisition instruction to each cluster in the distributed storage system, so that the situation that the monitoring server cannot effectively monitor each cluster due to mismatching of the clusters and the Exporter version is avoided; in addition, the monitoring data fed back to the monitoring server by each cluster also comprises state data of the clients connected with the cluster, which is beneficial to the monitoring server to determine the alarm information through analyzing the state data of the clients connected with the cluster, thereby realizing the purpose of monitoring the clients connected with the cluster by the monitoring server.
As a possible implementation method, the number of the monitoring servers is multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; the monitoring server sends acquisition instructions to each cluster in the distributed storage system, and the acquisition instructions comprise: aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.
Based on the scheme, a plurality of monitoring servers are arranged for the distributed storage system, on one hand, the monitoring data of each cluster are frequently obtained from each cluster in the distributed storage system, and the aim of omnibearing and even real-time monitoring of the distributed storage system can be realized; on the other hand, by arranging a plurality of monitoring servers, the distributed storage system can be monitored by other available monitoring servers under the condition that one or more monitoring servers are down. For any one monitoring server of the plurality of monitoring servers, the monitoring server issues an acquisition instruction to at least two node servers in each cluster, so that the monitoring server can acquire monitoring data of the cluster where the node server is located from other available node servers under the condition that one node server is down, and effective monitoring of each cluster by the monitoring server is realized.
As one possible implementation, the alert rule includes an alert generation rule; the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps: the monitoring server determines a first client with a changed connection state with the cluster from the monitoring data; the monitoring server determines a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.
Based on the scheme, a first client with a changed connection state with the cluster is determined through analysis of monitoring data, a second client with a changed connection state with the cluster is determined through analysis of known service change, and alarm information generated due to abnormality of the clients can be generated through comparison of the first client and the second client.
As one possible implementation method, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; the monitoring server sets an alarm suppression rule of the alarm information of the client, and the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.
Based on the scheme, after determining the necessary time required by the cluster for the purpose of service requirement, the monitoring server does not report the alarm information of the client to the alarm platform in the process of the necessary time, so that the generation of known and useless alarms can be effectively avoided.
As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises: and if the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the alarm information of the client, reporting the alarm information of the MDS component to an alarm platform.
Based on the scheme, when the monitoring server acquires the alarm information of the MDS component of the cluster and the alarm information of the client connected with the cluster at the same time, the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the client and reports the alarm information of the MDS component to the alarm platform, and the alarm information of the client at a low level is automatically shielded by considering that the abnormality of the MDS component of the cluster possibly causes an abnormal event of the client connected with the cluster.
As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further includes: and the monitoring server sets cluster identifiers corresponding to the monitoring data.
Based on the scheme, the monitoring server marks the acquired monitoring data with the corresponding cluster, so that the monitoring server can quickly make corresponding alarm operation when receiving the same monitoring data of the same cluster in the later period.
As a possible implementation method, the alarm rule further includes an alarm convergence rule; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises: the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the comparison relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.
Based on the scheme, after the monitoring server determines that the alarm information is the same alarm information of a certain cluster which does not appear for the first time, the same alarm which does not appear for the first time is reported to the alarm platform according to the alarm convergence rule after the time delay is set, so that the phenomenon of resource waste caused by continuously and repeatedly sending the same alarm by the cluster can be effectively prevented.
In a second aspect, an embodiment of the present invention provides an apparatus for monitoring a distributed storage system, the apparatus including: the sending unit is used for sending acquisition instructions to each cluster in the distributed storage system; the acquisition unit is used for acquiring monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters; the determining unit is used for determining alarm information from monitoring data of at least one cluster according to a preset alarm rule and reporting the alarm information to an alarm platform.
Based on the scheme, the monitoring server can monitor a plurality of clusters simultaneously by sending the acquisition instruction to each cluster in the distributed storage system, so that the situation that the monitoring server cannot effectively monitor each cluster due to mismatching of the clusters and the Exporter version is avoided; in addition, the monitoring data fed back to the monitoring server by each cluster also comprises state data of the clients connected with the cluster, which is beneficial to the monitoring server to determine the alarm information through analyzing the state data of the clients connected with the cluster, thereby realizing the purpose of monitoring the clients connected with the cluster by the monitoring server.
As a possible implementation method, the number of the monitoring servers is multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; the sending unit is specifically configured to send an acquisition instruction to at least two node servers in any cluster for any monitoring server.
Based on the scheme, a plurality of monitoring servers are arranged for the distributed storage system, on one hand, the monitoring data of each cluster are frequently obtained from each cluster in the distributed storage system, and the aim of omnibearing and even real-time monitoring of the distributed storage system can be realized; on the other hand, by arranging a plurality of monitoring servers, the distributed storage system can be monitored by other available monitoring servers under the condition that one or more monitoring servers are down. For any one monitoring server of the plurality of monitoring servers, the monitoring server issues an acquisition instruction to at least two node servers in each cluster, so that the monitoring server can acquire monitoring data of the cluster where the node server is located from other available node servers under the condition that one node server is down, and effective monitoring of each cluster by the monitoring server is realized.
As one possible implementation, the alert rule includes an alert generation rule; the determining unit is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.
Based on the scheme, a first client with a changed connection state with the cluster is determined through analysis of monitoring data, a second client with a changed connection state with the cluster is determined through analysis of known service change, and alarm information generated due to abnormality of the clients can be generated through comparison of the first client and the second client.
As one possible implementation method, the alarm rule further includes an alarm suppression rule; the determining unit is specifically configured to determine a change duration of a service change of the cluster; setting an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.
Based on the scheme, after determining the necessary time required by the cluster for the purpose of service requirement, the monitoring server does not report the alarm information of the client to the alarm platform in the process of the necessary time, so that the generation of known and useless alarms can be effectively avoided.
As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the determining unit is specifically configured to determine that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and report the alarm information of the MDS component to an alarm platform.
Based on the scheme, when the monitoring server acquires the alarm information of the MDS component of the cluster and the alarm information of the client connected with the cluster at the same time, the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the client and reports the alarm information of the MDS component to the alarm platform, and the alarm information of the client at a low level is automatically shielded by considering that the abnormality of the MDS component of the cluster possibly causes an abnormal event of the client connected with the cluster.
As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the determining unit is further configured to set a cluster identifier corresponding to each monitoring data.
Based on the scheme, the monitoring server marks the acquired monitoring data with the corresponding cluster, so that the monitoring server can quickly make corresponding alarm operation when receiving the same monitoring data of the same cluster in the later period.
As a possible implementation method, the alarm rule further includes an alarm convergence rule; the determining unit is specifically configured to determine that the alarm information is the same alarm information that does not appear in the cluster for the first time, and report the alarm information to the alarm platform after setting the time delay according to a comparison relationship between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.
Based on the scheme, after the monitoring server determines that the alarm information is the same alarm information of a certain cluster which does not appear for the first time, the same alarm which does not appear for the first time is reported to the alarm platform according to the alarm convergence rule after the time delay is set, so that the phenomenon of resource waste caused by continuously and repeatedly sending the same alarm by the cluster can be effectively prevented.
In a third aspect, embodiments of the present invention provide a computing device comprising:
a memory for storing program instructions;
and a processor for invoking program instructions stored in said memory and executing the method according to any of the first aspects in accordance with the obtained program.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of the first aspects.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of a prior art monitoring architecture for Prometheus for CephFS;
FIG. 2 is a diagram of a method for monitoring a distributed storage system according to the present invention;
FIG. 3 is a schematic diagram of a monitoring architecture for CephFS by Prometaus in accordance with the present invention;
Fig. 4 is a schematic diagram of an apparatus for monitoring a distributed storage system according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 2, a method for monitoring a distributed storage system according to an embodiment of the present invention includes:
step 201, a monitoring server sends acquisition instructions to each cluster in the distributed storage system.
Step 202, the monitoring server obtains monitoring data fed back by each cluster based on the collection instruction, where the monitoring data includes health data of the cluster and status data of clients connected to the cluster.
Step 203, for at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule, and reports the alarm information to an alarm platform.
Based on the scheme, the monitoring server can monitor a plurality of clusters simultaneously by sending the acquisition instruction to each cluster in the distributed storage system, so that the situation that the monitoring server cannot effectively monitor each cluster due to mismatching of the clusters and the Exporter version is avoided; in addition, the monitoring data fed back to the monitoring server by each cluster also comprises state data of the clients connected with the cluster, which is beneficial to the monitoring server to determine the alarm information through analyzing the state data of the clients connected with the cluster, thereby realizing the purpose of monitoring the clients connected with the cluster by the monitoring server.
In the step 201, the monitoring server sends an acquisition instruction to each cluster in the distributed storage system.
Setting a plurality of clusters, such as 3 clusters, in a distributed storage system, such as CephFS_A cluster, cephFS_B cluster and CephFS_C cluster; the monitoring server Prometaus is used for monitoring the CephFS, and the Prometaus server in the monitoring server sends an acquisition instruction to the CephFS, specifically comprises the Prometaus server sending an acquisition instruction I to the CephFS_A cluster, the Prometaus server sending an acquisition instruction I to the CephFS_B cluster and the Prometaus server sending an acquisition instruction I to the CephFS_C cluster.
In the step 202, the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, where the monitoring data includes health data of the cluster itself and status data of clients connected to the cluster.
After the Prometaus Sever issues the acquisition instruction I to the CephFS_A cluster, the CephFS_A cluster responds to the acquisition instruction I correspondingly to obtain monitoring data about the CephFS_A cluster, so that the Prometaus Sever obtains the monitoring data about the CephFS_A cluster; similarly, prometheus Sever can obtain monitoring data about CephFS_B clusters and monitoring data about CephFS_C clusters.
The monitoring data of the cephfs_a cluster may be specifically represented by health data of the cephfs_a cluster itself (e.g., running state of OSD component, state data of PG), and state data of a Ceph fuse_a client connected to the cephfs_a cluster (e.g., whether the Ceph fuse_a client is connected to the cephfs_a cluster). For example, there are 100 Ceph fuse_a clients connected to the cephfs_a cluster, and the monitoring data component related to the cephfs_a cluster includes health data of the cephfs_a cluster itself, and further includes status data of 100 Ceph fuse_a clients connected to the cephfs_a cluster; the monitoring data about the cephfs_b cluster and the monitoring data about the cephfs_c cluster may refer to the case of the monitoring data about the cephfs_a cluster, which is not described herein.
In the step 203, for at least one cluster, the monitoring server determines the alarm information from the monitoring data of the cluster according to the preset alarm rule, and reports the alarm information to the alarm platform.
Setting a preset alarm rule for the CephFS_A cluster, and determining alarm information about the CephFS_A cluster by Prometaus through analyzing the acquired monitoring data from the CephFS_A cluster; further, the Prometheus reports the obtained alarm information about the CephFS_A cluster to the alarm platform, and the reporting basis is still a preset alarm rule. The alarm platform may be an IMS system, or may be another alarm platform, which is not limited in this regard. Similarly, the alarm process of Prometaus for CephFS_B cluster and CephFS_C cluster can refer to the alarm process of CephFS_A cluster, which is not described herein.
As a possible implementation method, the number of the monitoring servers is multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; the monitoring server sends acquisition instructions to each cluster in the distributed storage system, and the acquisition instructions comprise: aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.
Fig. 3 shows a schematic diagram of monitoring a CephFS by promethaus according to an embodiment of the present invention. Referring to fig. 3, two monitoring servers, namely, a promethaus server_x and a promethaus server_y, are deployed, and the promethaus server_x and the promethaus server_y are used for monitoring a distributed storage system, wherein a cephfs_a cluster, a cephfs_b cluster and a cephfs_c cluster are deployed in the system; for the CephFS_A cluster, the cluster comprises a plurality of node servers, and for convenience of description, the CephFS_A cluster is provided with 4 node servers, namely A1, A2, A3 and A4; similarly, for the CephFS_B cluster, the cluster comprises a plurality of node servers, and for convenience of description, the CephFS_B cluster is provided with 4 node servers, namely B1, B2, B3 and B4; similarly, for the CephFS_C cluster, the cluster includes a plurality of node servers, and for convenience of description, the CephFS_C cluster is set to include 4 node servers, which are respectively designated as C1, C2, C3 and C4.
For the cephfs_a cluster, there are 100 Ceph fuse_a clients connected to node servers configured with MDS components in the cluster, and if 3 node servers configured with MDS components in the cephfs_a cluster are provided, all the 100 Ceph fuse_a clients are connected to the 3 node servers configured with MDS components (not shown in the figure); similarly, for the cephfs_b cluster, there are 200 Ceph fuse_b clients connected to node servers configured with MDS components in the cluster, and if 3 node servers configured with MDS components in the cephfs_b cluster are provided, then all the 200 Ceph fuse_b clients are connected to the 3 node servers configured with MDS components (not shown in the figure); similarly, for the cephfs_c cluster, there are 300 Ceph fuse_c clients connected to node servers configured with MDS components in the cluster, and if 3 node servers configured with MDS components in the cephfs_c cluster are provided, then all the 300 Ceph fuse_c clients are connected to the 3 node servers configured with MDS components (not shown in the figure).
For Prometheus Sever_X, the monitoring server issues acquisition instructions to at least two node servers in any one of the CephFS_A cluster, cephFS_B cluster and CephFS_C cluster, which is specifically expressed as follows:
set at the moment of 8:00am, prometheus Sever_X transmits acquisition instructions I to 3 node servers A1, A2 and A4 in CephFS_A cluster; meanwhile, prometaus Sever_X issues acquisition instructions I to 3 node servers, namely B1, B3 and B4 in the CephFS_B cluster; meanwhile, prometheus Sever_X issues acquisition instructions I to the 3 node servers C1, C2 and C4 in the CephFS_C cluster.
When the Prometheus Sever_X issues the acquisition instructions to at least two node servers in the CephFS_A cluster, the acquisition instructions are issued to any at least two node servers in the CephFS_A cluster in a random mode. For example, the aforementioned Prometaus Sever_X may issue the acquisition instruction I to 3 node servers A1, A2 and A4 in the CephFS_A cluster, may issue the acquisition instruction I to 3 node servers A2, A3 and A4 in the CephFS_A cluster, and may issue the acquisition instruction I to 3 node servers A1, A2 and A3 in the CephFS_A cluster, which is not limited to the present invention. Similarly, when the Prometheus Sever_X issues the acquisition instruction to at least two node servers in the CephFS_B cluster, the acquisition instruction is issued to any at least two node servers in the CephFS_B cluster in a random manner; similarly, when the Prometheus Sever_X issues the acquisition instruction to at least two node servers in the CephFS_C cluster, the acquisition instruction is issued to any at least two node servers in the CephFS_C cluster in a random manner.
As one possible implementation, the alert rule includes an alert generation rule; the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps: the monitoring server determines a first client with a changed connection state with the cluster from the monitoring data; the monitoring server determines a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.
For example, for the cephfs_a cluster, for convenience of description, 10 Ceph fuse_a clients, W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10, are connected to node servers in the cluster configured with MDS components; the Prometaus Sever_X sends acquisition instructions I to 3 node servers A1, A2 and A4 in the CephFS_A cluster, and the Prometaus Sever_X firstly acquires monitoring data on the A1 node server, and determines that 10 CephFuse_A clients W1, W2, W3, W4, W5, W6, W7, W8, W9 and W10 are all connected to the CephFS_A cluster through analysis of the monitoring data on the A1 node server; then, prometheus server_X then obtains the monitoring data on the A2 node server, and determines that only 3 Ceph fuse_A clients of W8, W9 and W10 are still connected to the CephFS_A cluster, and 7 Ceph fuse_A clients of W1, W2, W3, W4, W5, W6 and W7 are offline from the CephFS_A cluster through analysis of the monitoring data on the A2 node server. That is, the first clients with changed connection states with the cluster are 7 Ceph fuse_a clients, i.e., W1, W2, W3, W4, W5, W6, and W7, respectively.
For such an abnormal event that occurs at the Ceph fuse_a client, it is further necessary to determine the reason that 7 of the 7 Ceph fuse_a clients, W1, W2, W3, W4, W5, W6 and W7, are offline from the cephfs_a cluster, i.e. whether the Ceph fuse_a client is normally uninstalled from the cephfs_a cluster or passively uninstalled due to the cephfs_a cluster itself.
The traffic running on the CephFS_A cluster performs daily offloading work on some of the clients connected to the CephFS_A cluster for traffic needs. For example, for business needs, a business person may offload the 3 Ceph fuse_a clients, W5, W6, and W7, in the cephfs_a cluster. Namely, the second clients with the connection state of the cluster changed are 3 Ceph fuse_A clients, namely W5, W6 and W7 respectively.
By comparing the first client (7 Ceph fuse_a clients of W1, W2, W3, W4, W5, W6 and W7, respectively) with the second client (3 Ceph fuse_a clients of W5, W6 and W7), it can be found that the offloading of 3 Ceph fuse_a clients of W5, W6 and W7 is a normal offloading event belonging to the Ceph fuse_a client, so that the offline of 3 Ceph fuse_a clients of W5, W6 and W7 in the monitoring data does not need to be reported to the IMS system; for the abnormal unloading event of the 4 Ceph fuse_A clients, such as W1, W2, W3 and W4, the abnormal unloading event belongs to the Ceph fuse_A client, and the alarm information of the client is generated according to the alarm generation rule.
As a possible implementation manner, the alarm rule further includes an alarm suppression rule; the monitoring server determines the change duration of the service change of the cluster; the monitoring server sets an alarm suppression rule of the alarm information of the client, and the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.
As in the previous example, if the normal offloading operation is performed on the 3 Ceph fuse_a clients, i.e. W5, W6 and W7, connected to the Ceph fs_a cluster for the purpose of service requirement, and if the duration required for offloading the 3 Ceph fuse_a clients, i.e. W5, W6 and W7, is 3h, then the promethaus server_x will not report the offline events of the 3 Ceph fuse_a clients, i.e. W5, W6 and W7, connected to the Ceph fs_a cluster to the IMS system in the whole time period of 3h in the future after the acquisition of the monitoring data on the A2 node server. That is, prometheus Sever_X writes the events of W5, W6, and W7, namely 3 Ceph Fuse_A clients, offline from CephFS_A cluster into alarm suppression rules.
As a possible implementation method, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster itself; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises: and if the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the alarm information of the client, reporting the alarm information of the MDS component to an alarm platform.
As in the previous example, the monitoring data of the propheus server_x for the cephfs_a cluster includes health data of the cephfs_a cluster itself (e.g., running status of OSD components, status data of PG), and status data of a Ceph fuse_a client connected to the cephfs_a cluster (e.g., whether the Ceph fuse_a client accesses the cephfs_a cluster). Setting at a time T, acquiring monitoring data related to the CephFS_A cluster by the Prometaus Sever_X, wherein the monitoring data show that an MDS component in the CephFS_A cluster is abnormal during operation, meanwhile, an abnormal unloading event is also generated at a W1 Ceph Fuse_A client connected with the CephFS_A cluster, and the Prometaus Sever_X defines the alarm level of the abnormal event generated by the MDS component in the CephFS_A cluster during operation as a high level and defines the alarm level of the abnormal unloading event generated by the W1 Ceph Fuse_A client as a low level; then, the Prometheus Sever_X reports the high-level alarm event to the IMS system, namely, the Prometheus Sever_X reports the abnormal event of the MDS component in the CephFS_A cluster when running to the IMS system, but does not report the abnormal unloading event of the 1 Ceph fuse_A client of W1 to the IMS system.
It should be noted that, the monitoring server may set the alarm level of the alarm information of the MDS component in the cluster higher than the alarm level of the alarm information of the client, because the abnormality of the MDS component in the cluster may cause an abnormal event of the client connected to the cluster, so after the alarm information of the MDS component in the cluster is reported to the IMS system and the operation and maintenance personnel perform the operation and maintenance investigation, the MDS component not only can be restored to the normal operation state, but also the client connected to the cluster can be restored to the normal state.
As a possible implementation method, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further includes: and the monitoring server sets cluster identifiers corresponding to the monitoring data.
As an example, referring to fig. 3,Prometheus Sever_X, the acquisition instruction I is sent to three node servers A1, A2, and A4 in the cephfs_a cluster, and simultaneously sent to three node servers B1, B3, and B4 in the cephfs_b cluster, and simultaneously sent to three node servers C1, C2, and C4 in the cephfs_c cluster; when the collection instruction I is responded in the three clusters of the cephfs_a cluster, the cephfs_b cluster and the cephfs_c cluster, the propheus server_x will obtain the monitoring data of each cluster. The monitoring data may be represented by an identifier of a cluster, for example, the first stripe acquired by Prometheus Sever_X is the monitoring data on the A1 node server of CephFS_A cluster, the second stripe is the monitoring data on the B3 node server of CephFS_B cluster, the third stripe is the monitoring data on the C4 node server of CephFS_C cluster, and so on.
As a possible implementation method, the alarm rule further includes an alarm convergence rule; the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises: the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the comparison relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.
As in the previous example, it is assumed that the first piece of monitoring data acquired by the promethaus server_x is from the cephfs_a cluster, and after the first piece of monitoring data is analyzed according to a preset alarm rule, it is determined that the first piece of monitoring data can be reported as alarm information to the IMS system, and the alarm information generated according to the first piece of monitoring data is set to be info_1, and the alarm level of info_1 is set to be level 1; if the sixth piece of monitoring data acquired by the Prometaus Sever_X is still related to the CephFS_A cluster, after the sixth piece of monitoring data is analyzed according to a preset alarm rule, and alarm information generated according to the sixth piece of monitoring data is found to be in accordance with Info_1, the Prometaus Sever_X needs to further determine when to report the sixth piece of monitoring data to the IMS system according to the alarm level of Info_1; if the alarm delay corresponding to the alarm information with the alarm level of 1 is set to be 1h, the Prometheus Sever_X will not report Inpro_1 corresponding to the sixth piece of monitoring data to the IMS system in the next 1 h.
Setting the second piece of monitoring data acquired by Prometheus Sever_X to come from CephFS_B cluster, analyzing the second piece of monitoring data according to a preset alarm rule, determining that the second piece of monitoring data can be used as alarm information to be reported to an IMS system, enabling alarm information generated according to the second piece of monitoring data to be Info_2, and enabling the alarm level of Info_2 to be level 2; if the ninth piece of monitoring data acquired by the Prometaus server_X is still related to the CephFS_B cluster, after the ninth piece of monitoring data is analyzed according to a preset alarm rule, and alarm information generated according to the ninth piece of monitoring data is found to be in accordance with Info_2, the Prometaus server_X needs to further determine when to report the ninth piece of monitoring data to an IMS system according to the alarm level of Info_2; if the alarm delay corresponding to the alarm information with the alarm level of 2 is set to be 2 hours, the Prometheus Sever_X will not report Inpro_2 corresponding to the ninth piece of monitoring data to the IMS system in the next 2 hours.
Setting the third piece of monitoring data acquired by Prometheus Sever_X to come from CephFS_C cluster, analyzing the third piece of monitoring data according to a preset alarm rule, determining that the third piece of monitoring data can be used as alarm information to be reported to an IMS system, enabling alarm information generated according to the third piece of monitoring data to be Info_3, and enabling the alarm level of Info_3 to be level 3; setting the tenth piece of monitoring data acquired by Prometaus Sever_X to be related to the CephFS_C cluster, after analyzing the tenth piece of monitoring data according to a preset alarm rule, finding that alarm information generated according to the tenth piece of monitoring data accords with Info_3, and determining when to report the tenth piece of monitoring data to an IMS system according to the alarm level of Info_3 by Prometaus Sever_X; if the alarm delay corresponding to the alarm information with the alarm level of 3 is set to be 3 hours, the Prometheus Sever_X will not report Inpro_3 corresponding to the tenth piece of monitoring data to the IMS system in the next 3 hours.
In the above example, as the alarm levels of level 1, level 2, and level 3 decrease, the corresponding alarm delays are longer, corresponding to 1h, 2h, and 3h, respectively.
Based on the scheme, after the monitoring server determines that the alarm information is the same alarm information of a certain cluster which does not appear for the first time, the same alarm which does not appear for the first time is reported to the alarm platform according to the alarm convergence rule after the time delay is set, so that the phenomenon of resource waste caused by continuously and repeatedly sending the same alarm by the cluster can be effectively prevented.
Based on the same concept, the embodiment of the present invention further provides an apparatus for monitoring a distributed storage system, as shown in fig. 4, where the apparatus includes:
a sending unit 401, configured to send an acquisition instruction to each cluster in the distributed storage system;
an obtaining unit 402, configured to obtain monitoring data fed back by each cluster based on the collection instruction, where the monitoring data includes health data of the cluster itself and status data of a client connected to the cluster;
the determining unit 403 is configured to determine, for at least one cluster, alarm information from monitoring data of the cluster according to a preset alarm rule, and report the alarm information to an alarm platform.
Further, for the device, the monitoring servers are multiple; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client; for any monitoring server, the sending unit 401 is specifically configured to send an acquisition instruction to at least two node servers in any cluster.
Further, for the apparatus, the alert rule includes an alert generation rule; the determining unit 403 is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.
Further, for the device, the alarm rule further includes an alarm suppression rule; the determining unit 403 is specifically configured to determine a change duration of a service change of the cluster; setting an alarm suppression rule of the alarm information of the client, wherein the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.
Further, for the device, the monitoring server generates alarm information of the MDS component of the cluster according to the health data of the cluster; the determining unit 403 is specifically configured to determine that the alarm level of the alarm information of the MDS component is higher than the alarm information of the client, and report the alarm information of the MDS component to an alarm platform.
Further, for the device, after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the determining unit 403 is further configured to set a cluster identifier corresponding to each monitoring data.
Further, for the device, the alarm rule further includes an alarm convergence rule; the determining unit 403 is specifically configured to determine that the alarm information is the same alarm information that does not occur for the first time in the cluster, and report the alarm information to the alarm platform after setting the time delay according to a comparison relationship between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.
Embodiments of the present invention provide a computing device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), and the like. The computing device may include a central processing unit (Center Processing Unit, CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), cathode Ray Tube (CRT), etc.
Memory, which may include Read Only Memory (ROM) and Random Access Memory (RAM), provides program instructions and data stored in the memory to the processor. In an embodiment of the present invention, the memory may be used to store program instructions of a method of monitoring a distributed storage system;
and the processor is used for calling the program instructions stored in the memory and executing the method for monitoring the distributed storage system according to the obtained program.
Embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of monitoring a distributed storage system.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, or as a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (9)

1. A method of monitoring a distributed storage system, comprising:
The monitoring server sends acquisition instructions to each cluster in the distributed storage system;
the monitoring server acquires monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters;
aiming at least one cluster, the monitoring server determines alarm information from monitoring data of the cluster according to a preset alarm rule and reports the alarm information to an alarm platform;
the alarm rule comprises an alarm generation rule;
the monitoring server determines alarm information from the monitoring data according to a preset alarm rule, and the method comprises the following steps:
the monitoring server determines a first client with a changed connection state with the cluster from the monitoring data;
the monitoring server determines a second client which changes the connection state with the cluster according to the service change of the cluster;
and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.
2. The method of claim 1, wherein the monitoring server is a plurality of monitoring servers; any cluster comprises a plurality of node servers, and all the node servers connected with the client are the same in the connected client;
The monitoring server sends acquisition instructions to each cluster in the distributed storage system, and the acquisition instructions comprise:
aiming at any monitoring server, the monitoring server issues acquisition instructions to at least two node servers in any cluster.
3. The method of claim 1, wherein the alert rules further comprise alert suppression rules;
the monitoring server determines the change duration of the service change of the cluster;
the monitoring server sets an alarm suppression rule of the alarm information of the client, and the alarm suppression rule of the client is used for not reporting the alarm information of the client generated in the change duration.
4. The method of claim 1, wherein the monitoring server generates alert information for MDS components of the cluster based on health data of the cluster itself;
the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises:
and if the monitoring server determines that the alarm level of the alarm information of the MDS component is higher than that of the alarm information of the client, reporting the alarm information of the MDS component to an alarm platform.
5. The method of claim 1, wherein after the monitoring server obtains the monitoring data fed back by each cluster based on the collection instruction, the method further comprises:
and the monitoring server sets cluster identifiers corresponding to the monitoring data.
6. The method of any of claims 1-5, wherein the alert rules further comprise alert convergence rules;
the monitoring server reports the alarm information to an alarm platform according to a preset alarm rule, and the monitoring server comprises:
the monitoring server determines that the alarm information is the same alarm information which does not appear for the first time in the cluster, and reports the alarm information to the alarm platform after setting time delay according to the comparison relation between the alarm level and the alarm time delay in the alarm convergence rule; wherein, the lower the level of the alarm level, the longer the corresponding alarm delay.
7. An apparatus for monitoring a distributed storage system, comprising:
the sending unit is used for sending acquisition instructions to each cluster in the distributed storage system;
the acquisition unit is used for acquiring monitoring data fed back by each cluster based on the acquisition instruction, wherein the monitoring data comprises health data of the clusters and state data of clients connected with the clusters;
The determining unit is used for determining alarm information from monitoring data of at least one cluster according to a preset alarm rule and reporting the alarm information to an alarm platform;
the alarm rule comprises an alarm generation rule;
the determining unit is specifically configured to determine, from the monitoring data, a first client that changes a connection state with the cluster; determining a second client which changes the connection state with the cluster according to the service change of the cluster; and generating the alarm information of the client according to the client which is contained in the first client but not the second client and the alarm generation rule.
8. A computing device, comprising:
a memory for storing program instructions;
a processor for invoking program instructions stored in said memory to perform the method according to any of claims 1-6 in accordance with the obtained program.
9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1-6.
CN201911336662.5A 2019-12-23 2019-12-23 Method and device for monitoring distributed storage system Active CN111049705B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911336662.5A CN111049705B (en) 2019-12-23 2019-12-23 Method and device for monitoring distributed storage system
PCT/CN2020/134339 WO2021129367A1 (en) 2019-12-23 2020-12-07 Method and apparatus for monitoring distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911336662.5A CN111049705B (en) 2019-12-23 2019-12-23 Method and device for monitoring distributed storage system

Publications (2)

Publication Number Publication Date
CN111049705A CN111049705A (en) 2020-04-21
CN111049705B true CN111049705B (en) 2023-09-12

Family

ID=70238567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911336662.5A Active CN111049705B (en) 2019-12-23 2019-12-23 Method and device for monitoring distributed storage system

Country Status (2)

Country Link
CN (1) CN111049705B (en)
WO (1) WO2021129367A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111049705B (en) * 2019-12-23 2023-09-12 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system
CN111597091A (en) * 2020-05-20 2020-08-28 北京金山云网络技术有限公司 Data monitoring method and system, electronic equipment and computer storage medium
CN111625421B (en) * 2020-05-26 2021-07-16 云和恩墨(北京)信息技术有限公司 Method and device for monitoring distributed storage system, storage medium and processor
CN111988165B (en) * 2020-07-09 2023-01-24 云知声智能科技股份有限公司 Method and system for monitoring use condition of distributed storage system
CN112084098A (en) * 2020-10-21 2020-12-15 中国银行股份有限公司 Resource monitoring system and working method
CN112650642A (en) * 2020-12-07 2021-04-13 深圳前海微众银行股份有限公司 Alarm processing method and device, equipment and storage medium
CN112751726B (en) * 2020-12-17 2022-09-09 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium
CN112783745A (en) * 2021-02-02 2021-05-11 无锡车联天下信息技术有限公司 Cluster data monitoring method, device, system and storage medium
CN113688149A (en) * 2021-07-20 2021-11-23 青岛海尔科技有限公司 Monitoring method and device
CN114115718B (en) * 2021-08-31 2024-03-29 济南浪潮数据技术有限公司 Distributed block storage system service quality control method, device, equipment and medium
CN113641558A (en) * 2021-08-31 2021-11-12 合众人寿保险股份有限公司 Health examination method and device and electronic equipment
US20230108213A1 (en) * 2021-10-05 2023-04-06 Softiron Limited Ceph Failure and Verification
CN114090644B (en) * 2022-01-20 2022-04-26 飞狐信息技术(天津)有限公司 Data processing method and device
CN114760221B (en) * 2022-03-31 2024-02-23 深信服科技股份有限公司 Service monitoring method, system and storage medium
CN115567526A (en) * 2022-09-21 2023-01-03 中国平安人寿保险股份有限公司 Data monitoring method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104202212A (en) * 2014-08-28 2014-12-10 浪潮(北京)电子信息产业有限公司 System and method for obtaining distributed cluster system alarm
CN107864063A (en) * 2017-12-12 2018-03-30 北京奇艺世纪科技有限公司 A kind of abnormality monitoring method, device and electronic equipment
CN109522287A (en) * 2018-09-18 2019-03-26 平安科技(深圳)有限公司 Monitoring method, system, equipment and the medium of distributed document storage cluster

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341682A1 (en) * 2017-05-26 2018-11-29 Nutanix, Inc. System and method for generating rules from search queries
CN107291594A (en) * 2017-06-30 2017-10-24 上海白虹软件科技股份有限公司 The device and method that openstack platforms are monitored and managed to ceph
US11102174B2 (en) * 2017-12-26 2021-08-24 Palo Alto Networks, Inc. Autonomous alerting based on defined categorizations for network space and network boundary changes
CN109298945A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The monitoring of Ceph distributed storage and tuning management method towards big data platform
CN111049705B (en) * 2019-12-23 2023-09-12 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104202212A (en) * 2014-08-28 2014-12-10 浪潮(北京)电子信息产业有限公司 System and method for obtaining distributed cluster system alarm
CN107864063A (en) * 2017-12-12 2018-03-30 北京奇艺世纪科技有限公司 A kind of abnormality monitoring method, device and electronic equipment
CN109522287A (en) * 2018-09-18 2019-03-26 平安科技(深圳)有限公司 Monitoring method, system, equipment and the medium of distributed document storage cluster

Also Published As

Publication number Publication date
CN111049705A (en) 2020-04-21
WO2021129367A1 (en) 2021-07-01

Similar Documents

Publication Publication Date Title
CN111049705B (en) Method and device for monitoring distributed storage system
CN107092522B (en) Real-time data calculation method and device
US10095598B2 (en) Transaction server performance monitoring using component performance data
CN112527848B (en) Report data query method, device and system based on multiple data sources and storage medium
CN105871581A (en) Method and device for processing of alarm information in cloud calculation
CN110096683A (en) Report form generation method, system, computer installation and computer readable storage medium
CN111740860A (en) Log data transmission link monitoring method and device
CN111078695B (en) Method and device for calculating association relation of metadata in enterprise
CN112702184A (en) Fault early warning method and device and computer-readable storage medium
CN111625418A (en) Process monitoring method and device
CN110046070B (en) Monitoring method and device of server cluster system, electronic equipment and storage medium
CN112910733A (en) Full link monitoring system and method based on big data
CN111240936A (en) Data integrity checking method and equipment
CN111274032A (en) Task processing system and method, and storage medium
CN108255710B (en) Script abnormity detection method and terminal thereof
CN109766238B (en) Session number-based operation and maintenance platform performance monitoring method and device and related equipment
CN114490272A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN109829016B (en) Data synchronization method and device
US11294704B2 (en) Monitoring and reporting performance of online services using a monitoring service native to the online service
JP2021093115A (en) Method and apparatus for processing local hot spot, electronic device and storage medium
CN111917812A (en) Data transmission control method, device, equipment and storage medium
JP5586322B2 (en) Plant monitoring system and plant monitoring method
CN115858309B (en) Data monitoring method and device for distributed system and electronic equipment
US20160154684A1 (en) Data processing system and data processing method
CN110677271A (en) Big data alarm method, device, equipment and storage medium based on ELK

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant