CN109522287B

CN109522287B - Monitoring method, system, equipment and medium for distributed file storage cluster

Info

Publication number: CN109522287B
Application number: CN201811087179.3A
Authority: CN
Inventors: 王涛
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2023-08-18
Anticipated expiration: 2038-09-18
Also published as: CN109522287A

Abstract

The invention discloses a monitoring method, a system, equipment and a medium of a distributed file storage cluster, wherein the method comprises the following steps: the monitoring server receives monitoring configuration information of the distributed file storage cluster sent by the monitoring platform, receives the internal state of the distributed file storage cluster sent by the monitoring client at fixed time, performs statistical analysis on the internal state of the cluster to obtain real-time monitoring data of the monitoring item, generates an abnormal problem if the real-time monitoring data of the monitoring item accords with an abnormal condition, generates an abnormal repairing instruction according to the abnormal problem and sends the abnormal repairing instruction to a central server of the distributed file storage cluster, so that the central server calls a corresponding abnormal repairing scheme to repair the abnormal problem. By monitoring the distributed file storage cluster in real time, the invention can timely find out the abnormal problem and repair the abnormal problem, timely maintain the health state of the cluster and improve the operation and maintenance efficiency of the distributed file storage cluster.

Description

Monitoring method, system, equipment and medium for distributed file storage cluster

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a medium for monitoring a distributed file storage cluster.

Background

CEPH is an open-source distributed file storage system, provides functions of object, block and file storage, is widely applied to data management service systems of various companies, improves the fault tolerance rate and storage efficiency of data, can manage and analyze massive data, can provide data with large orders of magnitude for access of thousands of users, and greatly saves labor resources and management cost.

However, the distributed storage of the CEPH generally has a plurality of node servers, which is complex in terms of monitoring operation and maintenance, if a fault hidden danger occurs in the server cluster, a problem cannot be easily and timely located, and at present, when the server cluster has a problem, the cause of the fault problem needs to be manually checked, so that the period of locating the problem is long, and the operation and maintenance efficiency of the CEPH cluster is reduced.

Disclosure of Invention

The embodiment of the invention provides a monitoring method, a monitoring system, monitoring equipment and a monitoring medium for a distributed file storage cluster, which are used for solving the problems of untimely CEPH cluster positioning and low operation and maintenance efficiency.

A monitoring method of a distributed file storage cluster comprises the following steps:

the monitoring server receives monitoring configuration information of the distributed file storage cluster sent by the monitoring platform, wherein the monitoring configuration information comprises monitoring items and abnormal conditions;

the monitoring server receives the internal state of the distributed file storage cluster, which is sent by the monitoring client at regular time, wherein the monitoring client is pre-deployed on a node server corresponding to a monitoring node of the distributed file storage cluster, and the internal state of the distributed file storage cluster is obtained from the node server corresponding to the monitoring node at regular time by the monitoring client;

the monitoring server performs statistical analysis on the internal state of the distributed file storage cluster according to the monitoring configuration information to obtain real-time monitoring data of the monitoring item;

if the real-time monitoring data of the monitoring item accords with the abnormal condition, the monitoring server determines the monitoring item as an abnormal object, takes the real-time monitoring data as abnormal data, and generates an abnormal problem according to the abnormal object and the abnormal data;

the monitoring server generates an abnormal repair instruction according to the abnormal problem and sends the abnormal repair instruction to a central server of the distributed file storage cluster;

And if the central server receives the abnormal repair instruction, analyzing the abnormal repair instruction, and calling a corresponding abnormal repair scheme to repair the abnormal problem according to an analysis result.

A monitoring system for a distributed file storage cluster, comprising: the system comprises a monitoring server and a central server, wherein the monitoring server and the central server are connected through a network;

the monitoring server includes:

the monitoring configuration module is used for receiving monitoring configuration information of the distributed file storage cluster sent by the monitoring platform, wherein the monitoring configuration information comprises monitoring items and abnormal conditions;

the data receiving module is used for receiving the internal state of the distributed file storage cluster, which is sent by the monitoring client at regular time, wherein the monitoring client is pre-deployed on a node server corresponding to a monitoring node of the distributed file storage cluster, and the internal state of the distributed file storage cluster is obtained from the node server corresponding to the monitoring node at regular time by the monitoring client;

the data analysis module is used for carrying out statistical analysis on the internal state of the distributed file storage cluster according to the monitoring configuration information to obtain real-time monitoring data of the monitoring item;

The abnormality confirmation module is used for determining the monitoring item as an abnormal object if the real-time monitoring data of the monitoring item accords with the abnormal condition, taking the real-time monitoring data as abnormal data and generating an abnormal problem according to the abnormal object and the abnormal data;

the abnormality notification module is used for generating an abnormality repair instruction according to the abnormality problem and sending the abnormality repair instruction to a central server of the distributed file storage cluster;

the center server includes:

and the exception repair module is used for analyzing the exception repair instruction if the exception repair instruction is received, and calling a corresponding exception repair scheme to repair the exception problem according to the analysis result.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of monitoring a distributed file storage cluster as described above when the computer program is executed.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method for monitoring a distributed file storage cluster described above.

According to the monitoring method, the system, the equipment and the medium of the distributed file storage cluster, the monitoring configuration information configured by a user aiming at the distributed file storage cluster is received by the monitoring server, the internal state of the distributed file storage cluster is acquired at fixed time according to the monitoring client deployed in advance and is uploaded to the monitoring server, the monitoring server performs statistical analysis on the internal state of the distributed file storage cluster to obtain real-time monitoring data of the monitoring item, so that the monitoring server can perform real-time monitoring on the distributed file storage cluster, customization of the monitoring item can be realized, meanwhile, if the real-time monitoring data of the monitoring item accords with abnormal conditions, the monitoring server generates a corresponding abnormal problem, generates an abnormal repair instruction according to the abnormal problem, sends the abnormal repair instruction to the central server of the distributed file storage cluster, analyzes the abnormal repair instruction, calls a corresponding abnormal repair scheme according to an analysis result, and timely maintains the health state of the distributed file storage cluster, so that the distributed file storage cluster can normally operate, the distributed file storage operation efficiency is improved, and the intelligent distributed file storage level is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of a method for monitoring a distributed file storage cluster according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for monitoring a distributed file storage cluster according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for monitoring a distributed file storage cluster according to an embodiment of the present invention, wherein a monitoring server outputs monitoring data;

FIG. 4 is a flowchart showing a method for monitoring a distributed file storage cluster according to an embodiment of the present invention, in which a monitoring server sends an alarm message;

FIG. 5 is a flowchart showing a step S60 in a method for monitoring a distributed file storage cluster according to an embodiment of the present invention;

FIG. 6 is a flowchart showing a method for monitoring a distributed file storage cluster according to an embodiment of the present invention, in which a central server transmits a repair result;

FIG. 7 is a schematic block diagram of a monitoring system for a distributed file storage cluster in accordance with one embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device in accordance with an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The monitoring method of the distributed file storage cluster provided by the application can be applied to an application environment as shown in fig. 1, wherein the distributed file storage cluster comprises a central server and a plurality of node servers, the monitoring server receives the internal states of the distributed file storage cluster obtained from the node servers in real time by a monitoring client through a network to obtain monitoring data, the monitoring server analyzes the monitoring data, outputs the real-time monitoring data to a monitoring platform, and when an abnormal problem occurs, sends an abnormal repair instruction to the central server of the distributed file storage cluster through the network, and the central server manages and maintains the node servers. The monitoring client and the monitoring platform can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices and the like. The monitoring method of the distributed file storage cluster provided by the embodiment of the application is completed by the cooperation of the monitoring server and the central server.

In an embodiment, fig. 2 shows a flowchart of a method for monitoring a distributed file storage cluster in the present embodiment, and as shown in fig. 2, the method for monitoring a distributed file storage cluster includes steps S10 to S60, which are described in detail below:

s10: the monitoring server receives monitoring configuration information of the distributed file storage cluster sent by the monitoring platform, wherein the monitoring configuration information comprises monitoring items and abnormal conditions.

In the embodiment of the invention, the distributed file storage cluster is a distributed file storage system for providing functions of object, block and file storage, and the distributed file storage system is realized by a server cluster consisting of a plurality of servers, wherein the distributed file storage cluster comprises a central server and a node server, the central server is used for managing the node server, and the node server is used for storing management files.

The monitoring server is a server for monitoring the internal state of the distributed file storage cluster, and the monitoring server can be, but not limited to, a NAGIOS (network monitoring) server, a ZABBIX (system monitoring) server and a gannlia (cluster monitoring) server, the monitoring platform is an interactive tool provided by the monitoring server for monitoring management, for example, the monitoring platform can be a virtual terminal such as a browser, so that a user can configure and view monitoring information on the monitoring platform.

Specifically, the user configures monitoring configuration information of the distributed file storage cluster in advance on the monitoring platform, the monitoring server receives monitoring configuration information of the distributed file storage cluster sent by the monitoring platform through the network, the monitoring configuration information comprises monitoring items and abnormal conditions, the monitoring items comprise monitoring objects and IP addresses of the monitoring objects, the abnormal conditions are judging conditions set for the monitoring objects in the monitoring configuration information and are used for judging whether the monitoring objects are in a normal state, the monitoring items can be conventional monitoring items for monitoring the internal state of the distributed file storage by default of the monitoring server, for example, monitoring on resource utilization rate, disk capacity, network flow and the like, can also be user-defined monitoring items, and specific monitoring items can be customized according to actual needs without limitation.

Preferably, the distributed file storage cluster may be a CEPH cluster, where the CEPH cluster is an open-source distributed file storage system, and file storage of the CEPH cluster is high in security and file storage efficiency.

For example, when the distributed file storage cluster is specifically a CEPH cluster, the monitoring items may specifically be information for monitoring an activity state of the CEPH cluster, monitoring a number of OSDs (Object Storage Device, object storage devices) in the CEPH cluster, or monitoring a number of 80-port connections of a node server in the CEPH cluster, where a main function of the OSDs is to store data, copy data, balance data, restore data, and so on, and provide a storage service for the CEPH cluster. For a monitoring item for monitoring the 80-port connection number of a node server in the CEPH cluster, the abnormal condition of the monitoring item may specifically be set as follows: the number of 80 port connections of the node server is less than 5. If the number of the 80 ports connected to a certain node server is monitored to be smaller than 5, the condition that the 80 ports of the node server are abnormal is indicated, and the abnormal conditions in preset monitoring configuration information are met.

S20: the monitoring server receives the internal state of the distributed file storage cluster sent by the monitoring client at regular time, wherein the monitoring client is pre-deployed on a node server corresponding to a monitoring node of the distributed file storage cluster, and the internal state of the distributed file storage cluster is obtained from the node server corresponding to the monitoring node at regular time by the monitoring client.

In the embodiment of the invention, the monitoring node of the distributed file storage cluster refers to a node server of the distributed file storage cluster, which is used for collecting the internal state of the distributed file storage cluster.

Preferably, when the distributed file storage cluster is specifically a CEPH cluster, the monitoring node is a MON node of the CEPH cluster, the MON node stores a cluster view of a CEPH cluster state, the cluster view includes real-time information of maps of all servers of the CEPH cluster, the CEPH cluster needs to send a request to the MON node before reading and writing data, and the latest map is requested to be obtained, and a storage position of the data is calculated according to the map so as to perform corresponding reading operation.

Specifically, a monitoring client is deployed in advance in a node server corresponding to a monitoring node of the distributed file storage cluster, the monitoring client uses a preset communication script to actively acquire an internal state of the distributed file storage cluster, wherein the communication script is a shell script edited in advance, and the preset communication script is used for acquiring the internal state of the distributed file storage cluster and transmitting the acquired internal state to the monitoring server.

Preferably, when the distributed file storage cluster is specifically a CEPH cluster, the monitoring client uses an internal command of CEPH in a preset communication script, and obtains an internal state of the CEPH cluster from a node server corresponding to the monitoring node, for example, using a command such as "CEPH-s", "CEPH pg stat" or "CEPH osd dump", where the command "CEPH-s" is a state of looking up the cluster, the command "CEPH pg stat" is a state of looking up pg, the command "CEPH osd dump" is a state of looking up osd, and pg is a set of data storage in the CEPH cluster, so as to logically group data.

Specifically, the internal state of the distributed file storage cluster acquired by the monitoring client is monitoring data, the monitoring client sends the monitoring data to the monitoring server in the form of a message through a network, the message is formed by arranging the monitoring data according to a preset format by the monitoring client, the message is a data unit exchanged and transmitted in the network, the message can completely comprise data information to be sent, the length of the message is not limited, the data information to be sent can be transmitted through the message at one time, the preset format can be specifically set according to actual needs, and the limitation is not made here.

Preferably, when the distributed file storage cluster is specifically a CEPH cluster, the monitoring client sets a timing task, periodically uses a preset communication script to obtain an internal state of the CEPH cluster, and uploads the internal state of the CEPH cluster to the monitoring server, where the timing task may be set according to an application requirement, for example, the timing task may specifically configure a corresponding configuration file by using a Crontab command, and write a "/3 x/etc/zabbix/descriptions/CEPH-status.192.168.1.15 CEPH mon > > etc/zabbix/descriptions/CEPH-status.log" command into the configuration file, and the monitoring client executes the configuration file, so as to obtain data of the internal state of the CEPH cluster at a timing, where the Crontab command is a command for setting a periodically executed, and the "/3 x/etc/zabbix/CEPH-stability/CEPH-15.192.168.15 CEPH > -script is a command for setting a periodically executed to be sent to the preset communication script, and the monitoring client sends the data to the monitoring client by a preset communication script.

S30: and the monitoring server performs statistical analysis on the internal state of the distributed file storage cluster according to the monitoring configuration information to obtain real-time monitoring data of the monitoring item.

Specifically, the monitoring server receives a message sent by the monitoring client, analyzes the message, reads monitoring data in the message, and accordingly obtains monitoring data of the internal state of the distributed file storage cluster, performs statistical analysis on the internal state of the distributed file storage cluster according to preset monitoring configuration information, and obtains monitoring data corresponding to monitoring objects of monitoring items to obtain real-time monitoring data of each monitoring item.

S40: if the real-time monitoring data of the monitoring item accords with the abnormal condition, the monitoring server determines the monitoring item as an abnormal object, takes the real-time monitoring data as the abnormal data, and generates an abnormal problem according to the abnormal object and the abnormal data.

In the embodiment of the invention, after the monitoring server counts the real-time data of each monitoring item, the real-time monitoring data of the monitoring item is compared with the abnormal conditions corresponding to the monitoring item, and whether the real-time monitoring data accords with the abnormal conditions preset for the monitoring item is compared.

Specifically, if the real-time monitoring data of the monitoring item accords with the abnormal condition, the monitoring server determines the monitoring object of the monitoring item as an abnormal object, the IP address of the monitoring object is an abnormal address, the real-time monitoring data is taken as the abnormal data, the real-time monitoring data of the monitoring item is in an abnormal state, the abnormal object is required to be maintained and processed, and meanwhile, the monitoring server generates an abnormal problem according to the abnormal object and the abnormal data, and the abnormal problem is used for describing a specific object and specific abnormal data with the abnormal problem in the distributed file storage cluster, so that operation and maintenance personnel can rapidly locate the problem of the distributed file storage cluster according to the abnormal problem.

It can be understood that if the real-time monitoring data of the monitoring item does not meet the abnormal condition, the real-time monitoring data of the monitoring item is in a normal state, and the monitoring object of the monitoring item can normally operate without maintenance.

S50: the monitoring server generates an abnormal repair instruction according to the abnormal problem and sends the abnormal repair instruction to a central server of the distributed file storage cluster.

In the embodiment of the invention, the distributed file storage cluster comprises a central server and a node server, wherein the central server is a central management server for performing management operations such as resource management, performance maintenance, monitoring configuration and the like on the node server, and the node server is a server for performing operations such as data processing, data storage and the like on objects, blocks or files and the like;

specifically, aiming at a monitoring item meeting an abnormal condition, a monitoring server generates a corresponding abnormal repair instruction according to an abnormal problem, wherein the abnormal repair instruction comprises a command for requesting maintenance, an abnormal address, an abnormal object and abnormal data, and the abnormal repair instruction is sent to a central server of a distributed file storage cluster to request the distributed file storage cluster to perform abnormal maintenance.

S60: if the central server receives the abnormal repair instruction, the abnormal repair instruction is analyzed, and a corresponding abnormal repair scheme is called according to the analysis result to repair the abnormal problem.

Specifically, if the central server receives the abnormal repair instruction, the central server analyzes the abnormal repair instruction, obtains an abnormal address, an abnormal object and abnormal data carried by the abnormal repair instruction, and determines an abnormal problem according to the abnormal object and the abnormal data.

The central server calls a corresponding abnormality repair scheme according to the abnormality problem, and maintains the abnormality problem of the server corresponding to the abnormality address in the distributed file storage cluster, wherein the abnormality repair scheme is a preset repair scheme according to some common abnormality conditions in the distributed file storage cluster, so that the central server can use the preset abnormality repair scheme to intelligently repair the abnormality problem in the distributed file storage cluster in time.

For example, in a monitoring item, the monitoring item is to monitor the disk usage of the server A1 in the distributed file storage cluster, and the abnormal condition corresponding to the monitoring item is that the disk capacity of the server A1 exceeds 95%. In the monitoring data acquired by the monitoring server, if the disk capacity of the server A1 exceeds 95%, the monitoring server determines the address of the server A1 as an abnormal address, determines the disk capacity of the server A1 as an abnormal object, determines the disk capacity of the server A1 as abnormal data, generates an abnormal problem according to the abnormal object and the abnormal data, generates an abnormal repair instruction according to the abnormal problem, sends the abnormal repair instruction to a central server of the distributed file storage cluster, requests the central server to maintain the disk of the server A1, and acquires a corresponding abnormal repair scheme according to the abnormal repair instruction, for example, an abnormal repair scheme for cleaning a cached log file or compressing a history file, uses the abnormal repair scheme to maintain the disk of the server A1, and repairs the abnormal problem of the distributed file storage cluster.

In this embodiment, the monitoring server receives monitoring configuration information configured by a user on the monitoring platform for the distributed file storage cluster, periodically obtains an internal state of the distributed file storage cluster according to a pre-deployed monitoring client, uploads the monitoring configuration information to the monitoring server, and performs statistical analysis on the internal state of the distributed file storage cluster to obtain real-time monitoring data of a monitoring item, so that the monitoring server can perform real-time monitoring on the distributed file storage cluster, customization of the monitoring item can be realized, and meanwhile, if the real-time monitoring data of the monitoring item accords with an abnormal condition, the monitoring server generates a corresponding abnormal problem, generates an abnormal repair instruction according to the abnormal problem, sends the abnormal repair instruction to a central server of the distributed file storage cluster, analyzes the abnormal repair instruction after receiving the abnormal repair instruction, and invokes a corresponding abnormal repair scheme according to an analysis result, so that the distributed file storage cluster can normally operate, and the operation and maintenance efficiency of the distributed file storage cluster can be improved, thereby improving the intelligent management level of the distributed file storage cluster.

In an embodiment, after step S30, that is, after the monitoring server performs statistical analysis on the internal state of the distributed file storage cluster according to the monitoring configuration information to obtain the real-time monitoring data of the monitoring item, the monitoring server in the monitoring method of the distributed file storage cluster may further output the real-time monitoring data according to a preset output template, which is described in detail as follows:

as shown in fig. 3, after step S30, the method for monitoring a distributed file storage cluster further includes the following steps:

s31: and the monitoring server fills the real-time monitoring data with the monitoring data according to a preset output template to obtain target data.

Specifically, after the monitoring platform is configured with the monitoring configuration information, a user allocates a corresponding output template for each monitoring item, wherein the output template is a preset template used for outputting monitoring data obtained by monitoring, the monitoring server performs statistical analysis on the internal state of the distributed file storage cluster to obtain real-time monitoring data of the monitoring item, and then performs monitoring data filling on the real-time monitoring data according to a preset output template to obtain target data displayed in the output template, and the preset output template can be a sample template provided by the monitoring platform or a custom template added by the user, for example, a template in a form of a graph, a text or a report form can be displayed, and specific display forms can be set according to actual needs without limitation.

S32: and the monitoring server outputs the target data to the monitoring platform so that a user can check the real-time state of the distributed file storage cluster through the monitoring platform.

Specifically, the monitoring server outputs target data to the monitoring platform, displays the real-time state of the distributed file storage cluster for the user in real time, displays the target data meeting the abnormal conditions in the monitoring platform in a marked red or amplified form, plays a striking role, and is used for being different from the monitoring items in the normal state, so that the user can quickly acquire the abnormal monitoring items in the output target data.

S33: the monitoring server stores the target data into a preset historical database.

Specifically, the preset history database is a database used for storing target data in the monitoring server, the monitoring server stores the target data in the preset history database so that a user can check the historical state data of the distributed file storage cluster, wherein the preset history database can be an Oracle database or a MongoDB database, and the like, the specific database type can be selected according to actual needs, and the method is not limited.

S34: and the monitoring server analyzes the running state of the distributed file storage cluster according to the target data in the historical database to obtain an analysis result, so that a user maintains the distributed file storage cluster according to the analysis result.

Specifically, the monitoring server analyzes the running state of the distributed file storage cluster, including analyzing target data within 1 day, 1 week and one month, and analysis results obtained by the analysis include abnormal monitoring items, time periods of the abnormal monitoring items and total time of the abnormal monitoring items, and a user can optimize and maintain the distributed file storage cluster according to the analysis results obtained by the analysis of the monitoring server, for example, if the disk capacity of the node server A2 has 6 abnormal problems within a week, the user can perform capacity expansion and other treatments on the disk capacity of the node server A2 according to the analysis results, so as to increase the storage capacity of the node server A2 and improve the performance of the distributed file storage cluster.

In this embodiment, the monitoring server fills the real-time monitoring data according to the preset output template to obtain the target data, and outputs the target data to the monitoring platform, so that the user can check the real-time state of the distributed file storage cluster through the monitoring platform, quickly acquire abnormal monitoring items in the target data, discover abnormal problems of the distributed file storage cluster in time, store the target data in a preset historical database, analyze the running state of the distributed file storage cluster according to the target data in the historical database, and obtain an analysis result, so that the user optimizes and maintains the distributed file storage cluster according to the analysis result, thereby improving the performance of the distributed file storage cluster.

In an embodiment, after step S40, that is, after the real-time monitoring data of the monitored item meets the abnormal condition, the monitoring server determines the monitored item as an abnormal object, uses the real-time monitoring data as the abnormal data, and generates an abnormal problem according to the abnormal object and the abnormal data, the monitoring server in the monitoring method of the distributed file storage cluster may further generate alarm information and send the alarm information to a preset alarm address, which is described in detail as follows:

as shown in fig. 4, after step S30, the method for monitoring a distributed file storage cluster further includes the following steps:

s41: and the monitoring server determines the severity of the abnormal problem according to the preset service attribute.

In this embodiment, the severity of the abnormal problem includes four levels of "warning," "general severity," "severity," and "disaster," and the preset service attribute is preset according to the service function of the monitoring object of the monitoring item in the distributed file storage cluster, and the monitoring server determines the severity of the abnormal problem according to the preset service attribute after monitoring the abnormal problem.

For example, the monitoring item is the number of OSD service states in the monitoring distributed file storage cluster, if the real-time monitoring data of the monitoring item accords with the abnormal condition, the severity of the abnormal problem occurring in the monitoring item is a "disaster" level, which indicates that the abnormal problem needs to be solved immediately, otherwise, the distributed file storage cluster will crash.

The monitoring item is the 80-port connection number of the monitoring server A3, when the real-time monitoring data of the monitoring item accords with the abnormal condition, the monitoring server can determine that the severity of the abnormal problem of the monitoring item is a warning level according to the preset service attribute, and the specific severity of the abnormal problem can be determined according to the service function of the monitoring object in the distributed file storage cluster.

S42: the monitoring server generates alarm information according to the abnormal problems and a preset format, and selects an alarm sending mode corresponding to the severity of the abnormal problems.

Specifically, the preset format may be a preset format, such as a monitoring report, or an alarm letter, but not limited to this, and may specifically be set according to the needs of practical applications, where the monitoring server may fill the content of the abnormal problem into the preset format to generate alarm information, and select an alarm transmission mode corresponding to the severity of the abnormal problem, where the alarm transmission mode is a mode of transmitting alarm information preset according to the severity of the abnormal problem, and the specific alarm transmission mode may be set according to the needs of practical applications.

For example, if the severity of the abnormal problem is the alarm information of the "disaster" level, the corresponding alarm sending mode is: the monitoring server always sends alarm information according to the monitoring frequency of the preset monitoring item until the monitoring data of the preset monitoring item is in a normal state, so as to prompt related personnel to maintain the distributed file storage cluster.

And the severity of the abnormal problem is alarm information of an alarm level, the corresponding alarm sending mode is as follows: and aiming at the same alarm information, the monitoring server only performs one-time transmission processing, and the alarm information can be continuously retransmitted until the alarm information appears again after the monitoring data of the preset monitoring item is in a normal state.

S43: the monitoring server sends the alarm information to a preset alarm address according to an alarm sending mode.

Specifically, the monitoring server acquires a preset alarm address, and sends alarm information to the preset alarm address according to a corresponding alarm sending mode, wherein the preset alarm address is a receiving address of the alarm information, and the preset alarm address comprises but is not limited to a mailbox address, a Jabber address and a short message address, wherein Jabber is an instant messaging server of a Linux system.

In this embodiment, the severity of the abnormal problem is determined by the monitoring server according to the preset service attribute, and meanwhile, alarm information is generated according to the preset format according to the abnormal problem, an alarm transmission mode corresponding to the severity of the abnormal problem is selected, the alarm information is transmitted to a preset alarm address according to the alarm transmission mode, and different transmission modes are adopted to inform operation staff aiming at different abnormal problems, so that the operation staff can adopt a corresponding maintenance mode according to the alarm information, and the maintenance efficiency of the distributed file storage cluster is improved.

In an embodiment, the embodiment provides a detailed description of a specific implementation method for repairing an abnormal problem by analyzing an abnormal repair instruction if the central server mentioned in step S60 receives the abnormal repair instruction and calling a corresponding abnormal repair scheme according to the analysis result.

Referring to fig. 5, fig. 5 shows a specific flowchart of step S60, which is described in detail below:

s601: the central server receives the abnormality repair instruction, determines an abnormality problem according to the abnormality repair instruction, and a node server with the abnormality problem.

Specifically, the central server receives the abnormal repair instruction sent by the monitoring server, analyzes the abnormal repair instruction to obtain an abnormal address, an abnormal object and abnormal data, and accordingly determines an abnormal problem and a node server with the abnormal problem.

S602: the central server searches an abnormal repair scheme corresponding to the abnormal problem and the priority level of each abnormal repair scheme from a preset abnormal repair scheme library according to the abnormal problem.

Specifically, the central server triggers the management maintenance operation of the central server on the node server based on a command of requesting maintenance carried by the exception repair instruction, and the central server searches an exception repair scheme corresponding to the exception problem from a preset exception repair scheme library according to the exception problem, wherein the exception repair scheme is a repair scheme preset according to some common exception conditions in the distributed file storage cluster, and assigns a priority level to each exception repair scheme according to the repair effect of each exception repair scheme, and stores the priority level in the exception repair scheme library, wherein the exception repair scheme library is a database for storing the exception repair schemes.

S603: and the central server sequentially acquires each abnormal repair scheme according to the order of the priority levels of the abnormal repair schemes from high to low to repair the abnormal problems of the node server until the real-time monitoring data of the monitoring item does not accord with the abnormal conditions or each abnormal repair scheme is called.

Specifically, the central server sequentially acquires each abnormal repair scheme to repair the abnormal problem of the node server according to the order of the priority levels of the abnormal repair schemes from high to low, for example, a first repair scheme, a second repair scheme, a third repair scheme … and the like, until the real-time monitoring data of the monitoring item does not meet the abnormal condition or each abnormal repair scheme is called.

For example, in a monitoring item, the monitoring item is the disk usage rate of the monitoring node server A4, and the abnormal condition corresponding to the monitoring item is that the disk capacity of the server A4 exceeds 95%, in the monitoring data obtained by the monitoring server, if the disk capacity of the server A4 exceeds 95%, the monitoring server sends an abnormal repair instruction to the central server, so that the central server of the distributed file storage cluster maintains the disk of the server A4, the central server obtains all the corresponding abnormal repair schemes in the abnormal repair scheme library according to the abnormal repair instruction, and performs problem repair on the server A4 according to the priority level of the abnormal repair scheme, firstly, the first repair scheme is called to clean the log file cached in the server A4, and if the disk capacity of the server A4 still exceeds 95% after the first repair scheme is used for repair, the second repair scheme is continuously called to compress the history file in the server A4 until the disk capacity of the server A4 is in a normal state, or each abnormal repair scheme is called once.

In this embodiment, an exception repair instruction is received through a central server, and according to an exception problem, an exception repair scheme corresponding to the exception problem and a priority level of each exception repair scheme are searched from a preset exception repair scheme library, and according to a sequence from high priority to low priority of the exception repair scheme, each exception repair scheme is sequentially obtained to repair the exception problem of a node server, and for the same exception problem, a plurality of exception repair schemes can repair the problem, so that the repair rate of the exception problem is improved, and by confirming the priority level of the exception repair scheme, the problem repair is performed by using the exception repair scheme with a better effect, so that the maintenance efficiency of the distributed file storage cluster can be improved.

In an embodiment, after step S60, that is, after the central server receives the exception repair instruction, the central server analyzes the exception repair instruction, invokes a corresponding exception repair scheme according to the analysis result to repair the exception problem, and in the monitoring method of the distributed file storage cluster, the central server may further send the repair result to a preset instant messaging address, which is described in detail below:

As shown in fig. 6, after step S60, the method for monitoring a distributed file storage cluster further includes the following steps:

s61: and the central server detects the monitored items after the node server is repaired, and a repairing result is obtained.

Specifically, after repairing the problem with the exception repairing scheme, the central server detects the monitoring item of the node server with the exception, if the real-time monitoring data of the monitoring item after repairing does not meet the exception condition corresponding to the monitoring item, the repairing result is that the repairing is successful, and if not, if the real-time monitoring data of the monitoring item after repairing meets the exception condition corresponding to the monitoring item, the repairing result is that the repairing is failed, which means that the monitoring item is still in the exception condition.

Further, if the repair result is successful, the central server sends the abnormal problem and the repair result to a preset working communication address, so that the operation and maintenance personnel can know the repair record of the distributed file storage cluster, and further optimize and maintain the distributed file storage cluster, wherein the preset working communication address is an information receiving address for the operation and maintenance personnel to process the working event, and the preset working communication address comprises but is not limited to a public mailbox address, a personal mailbox address, a short message address and other communication addresses.

S62: if the repair result is failure, the central server sends the abnormal problem and the repair result to a preset instant messaging address, so that operation and maintenance personnel can manually maintain the distributed file storage cluster in time according to the abnormal problem.

Specifically, if the repair result is failure, it indicates that the corresponding abnormal problem cannot be solved by the abnormal repair scheme in the abnormal repair scheme library, or if the abnormal repair scheme corresponding to the abnormal repair instruction does not exist in the central server, the central server sends the abnormal problem and the repair result to a preset instant messaging address, where the preset instant messaging address is an information receiving address used by an operation and maintenance person to process an emergency event, and the preset instant messaging address includes, but is not limited to, an instant messaging address such as an IM (Instant Messenger) information receiving address, a micro-message information receiving address, an ICQ information receiving address, etc., so that the operation and maintenance person can timely learn the internal state of the distributed file storage cluster, and perform manual maintenance on the abnormal problem that occurs, thereby avoiding occurrence of problems in the distributed file storage cluster and causing loss of data.

In this embodiment, the monitoring items after repairing the node server are detected by the central server to obtain a repairing result, if the repairing result is failure, the central server sends the abnormal problem and the repairing result to a preset instant messaging address, and timely informs the operation and maintenance personnel of the abnormal problem, so that the operation and maintenance personnel can manually maintain the distributed file storage cluster in time according to the abnormal problem, maintain the health state of the distributed file storage cluster, and avoid data loss.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In an embodiment, a monitoring system of a distributed file storage cluster is provided, where the monitoring system of the distributed file storage cluster corresponds to the monitoring method of the distributed file storage cluster in the foregoing embodiment one by one. As shown in fig. 7, the monitoring system of the distributed file storage cluster includes a monitoring server and a central server, wherein the monitoring server includes a monitoring configuration module 71, a data receiving module 72, a data analyzing module 73, an abnormality confirmation module 74 and an abnormality notification module 75, and the central server includes an abnormality repair module 76, and each functional module is described in detail as follows:

the monitoring server includes:

the monitoring configuration module 71 is configured to receive monitoring configuration information of the distributed file storage cluster sent by the monitoring platform, where the monitoring configuration information includes a monitoring item and an abnormal condition;

the data receiving module 72 is configured to receive an internal state of the distributed file storage cluster sent by the monitoring client at regular time, where the monitoring client is pre-deployed on a node server corresponding to a monitoring node of the distributed file storage cluster, and the internal state of the distributed file storage cluster is obtained from the node server corresponding to the monitoring node at regular time by the monitoring client;

The data analysis module 73 is configured to perform statistical analysis on the internal state of the distributed file storage cluster according to the monitoring configuration information, so as to obtain real-time monitoring data of the monitoring item;

the anomaly confirmation module 74 is configured to determine the monitored item as an anomaly object if the real-time monitored data of the monitored item meets an anomaly condition, take the real-time monitored data as anomaly data, and generate an anomaly problem according to the anomaly object and the anomaly data;

an anomaly notification module 75 for generating an anomaly repair instruction according to the anomaly problem and transmitting the anomaly repair instruction to a central server of the distributed file storage cluster;

the center server includes:

the exception repair module 76 is configured to parse the exception repair instruction if the exception repair instruction is received, and invoke a corresponding exception repair scheme to repair the exception problem according to the parsing result.

Further, the monitoring server further includes:

the data filling module is used for filling the real-time monitoring data with the monitoring data according to a preset output template to obtain target data;

the data output module is used for outputting target data to the monitoring platform so that a user can check the real-time state of the distributed file storage cluster through the monitoring platform;

The data storage module is used for storing the target data into a preset historical database;

and the data statistics module is used for analyzing the running state of the distributed file storage cluster according to the target data in the historical database to obtain an analysis result so that a user maintains the distributed file storage cluster according to the analysis result.

Further, the monitoring server further includes:

the abnormal grade confirming module is used for determining the severity of the abnormal problem according to the preset service attribute;

the alarm information generation module is used for generating alarm information according to the abnormal problems and a preset format, and selecting an alarm sending mode corresponding to the severity of the abnormal problems;

and the alarm information sending module is used for sending the alarm information to a preset alarm address according to an alarm sending mode.

Further, the anomaly repair module 76 of the center server includes:

the abnormality analysis sub-module is used for receiving the abnormality repair instruction, determining an abnormality problem according to the abnormality repair instruction and a node server with the abnormality problem;

the scheme acquisition sub-module is used for searching an abnormal repair scheme corresponding to the abnormal problem and the priority level of each abnormal repair scheme from a preset abnormal repair scheme library according to the abnormal problem;

The abnormal repair sub-module is used for sequentially acquiring each abnormal repair scheme to repair the abnormal problem of the node server according to the order of the priority levels of the abnormal repair schemes from high to low until the real-time monitoring data of the monitoring item does not accord with the abnormal condition or each abnormal repair scheme is called.

Further, the center server further includes:

the project detection module is used for detecting the monitored project after the node server is repaired to obtain a repairing result;

and the information sending module is used for sending the abnormal problems and the repair results to a preset instant messaging address if the repair results are failure, so that operation and maintenance personnel can manually maintain the distributed file storage clusters in time according to the abnormal problems.

For specific limitations on the monitoring system of the distributed file storage cluster, reference may be made to the above limitation on the monitoring method of the distributed file storage cluster, which is not described herein. The modules in the monitoring system of the distributed file storage cluster may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for monitoring a distributed file storage cluster.

In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement steps in the method for monitoring a distributed file storage cluster according to the foregoing embodiment, for example, steps S10 to S60 shown in fig. 2.

In an embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the method for monitoring a distributed file storage cluster in the above embodiment, or which when executed by a processor implements the functions of the modules of the monitoring system for a distributed file storage cluster in the above embodiment. In order to avoid repetition, a description thereof is omitted.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the system is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. The monitoring method of the distributed file storage cluster is characterized by comprising the following steps of:

2. The method for monitoring a distributed file storage cluster according to claim 1, wherein after the monitoring server performs statistical analysis on the internal state of the distributed file storage cluster according to the monitoring configuration information to obtain real-time monitoring data of the monitoring item, the method for monitoring the distributed file storage cluster further comprises:

the monitoring server fills the real-time monitoring data with the monitoring data according to a preset output template to obtain target data;

the monitoring server outputs the target data to the monitoring platform so that a user can check the real-time state of the distributed file storage cluster through the monitoring platform;

the monitoring server stores the target data into a preset historical database;

and the monitoring server analyzes the running state of the distributed file storage cluster according to the target data in the historical database to obtain an analysis result, so that the user maintains the distributed file storage cluster according to the analysis result.

3. The method according to claim 1, wherein after the monitoring server determines the monitoring item as an abnormal object if the real-time monitoring data of the monitoring item meets the abnormal condition, takes the real-time monitoring data as abnormal data, and generates an abnormal problem according to the abnormal object and the abnormal data, the method further comprises:

The monitoring server determines the severity of the abnormal problem according to a preset service attribute;

the monitoring server generates alarm information according to the abnormal problems and a preset format, and selects an alarm sending mode corresponding to the severity of the abnormal problems;

and the monitoring server sends the alarm information to a preset alarm address according to the alarm sending mode.

4. The method for monitoring the distributed file storage cluster according to claim 1, wherein if the central server receives the exception repair instruction, resolving the exception repair instruction, and invoking a corresponding exception repair scheme to repair the exception problem according to a resolving result comprises:

the central server receives the abnormal repair instruction, determines the abnormal problem according to the abnormal repair instruction and a node server with the abnormal problem;

the central server searches an abnormal repair scheme corresponding to the abnormal problem and a priority level of each abnormal repair scheme from a preset abnormal repair scheme library according to the abnormal problem;

and the central server sequentially acquires each abnormal repair scheme to repair the abnormal problem of the node server according to the order of the priority levels of the abnormal repair schemes from high to low until the real-time monitoring data of the monitoring item does not accord with the abnormal condition or each abnormal repair scheme is called.

5. The method for monitoring a distributed file storage cluster according to claim 4, wherein after the central server analyzes the exception repair instruction if receiving the exception repair instruction and invokes a corresponding exception repair scheme to repair the exception problem according to the analysis result, the method for monitoring a distributed file storage cluster further comprises:

the central server detects the monitored items after the node server is repaired to obtain a repairing result;

if the repair result is failure, the central server sends the abnormal problem and the repair result to a preset instant messaging address, so that operation and maintenance personnel can manually maintain the distributed file storage cluster in time according to the abnormal problem.

6. The monitoring system of the distributed file storage cluster is characterized by comprising a monitoring server and a central server, wherein the monitoring server and the central server are connected through a network;

the monitoring server includes:

the center server includes:

7. The monitoring system of a distributed file storage cluster of claim 6, wherein the monitoring server further comprises:

the data output module is used for outputting the target data to the monitoring platform so that a user can check the real-time state of the distributed file storage cluster through the monitoring platform;

and the data statistics module is used for analyzing the running state of the distributed file storage cluster according to the target data in the historical database to obtain an analysis result, so that the user maintains the distributed file storage cluster according to the analysis result.

8. The monitoring system of a distributed file storage cluster of claim 6, wherein in the central server, the anomaly repair module comprises:

the abnormality analysis sub-module is used for receiving the abnormality repair instruction, determining the abnormality problem according to the abnormality repair instruction and a node server with the abnormality problem;

The scheme obtaining sub-module is used for searching an abnormal repair scheme corresponding to the abnormal problem and the priority level of each abnormal repair scheme from a preset abnormal repair scheme library according to the abnormal problem;

and the abnormality repair sub-module is used for sequentially acquiring each abnormality repair scheme to repair the abnormal problem of the node server according to the order of the priority level of the abnormality repair scheme from high to low until the real-time monitoring data of the monitoring item does not accord with the abnormal condition or each abnormality repair scheme is called.

9. Computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method for monitoring a distributed file storage cluster according to any of claims 1 to 5 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method for monitoring a distributed file storage cluster according to any of claims 1 to 5.