CN115442223A

CN115442223A - Automatic operation and maintenance method for distributed cluster

Info

Publication number: CN115442223A
Application number: CN202210846889.XA
Authority: CN
Inventors: 程俊; 李文飞
Original assignee: Write Easy Network Technology Shanghai Co ltd
Current assignee: Write Easy Network Technology Shanghai Co ltd
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-12-06

Abstract

An automated operation and maintenance method for distributed clusters, comprising the steps of: host management; identifying all hosts of the distributed cluster, monitoring various resources, and early warning abnormal hosts; managing a process; process searching and identification are realized through a plurality of identification processes so as to monitor the process state, disk reading and writing and CPU memory occupation conditions; managing ports; monitoring the opening condition of a host port of the distributed cluster, and early warning the change of the port; service management; monitoring various API interfaces and services of the distributed cluster; monitoring logs; the exception searching method is used for searching the running exception of the program from the inside of the server; message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and timely informs managers. The invention overcomes the defects of the prior art, and has good actual combat effect on the management and the automatic operation and maintenance of mass hosts in the distributed cluster.

Description

Automatic operation and maintenance method for distributed cluster

Technical Field

The invention relates to the technical field of big data operation and maintenance, in particular to an automatic operation and maintenance method for a distributed cluster.

Background

With the continuous growth of data, the application of big data technology and distributed clusters is more and more extensive. More and more internet users, more and more internet devices and more data are generated, so that the parallel computing scene based on the distributed cluster is more and more common. However, there are many problems for the management and operation and maintenance of the distributed cluster, for example, many hosts cannot be managed uniformly, the running state and resource usage of the hosts cannot be monitored in real time, and massive processes, ports, services and log information of the distributed cluster run lack a way for uniform management and operation and maintenance. These problems are not solved and would cause a great economic loss and serious consequences.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an automatic operation and maintenance method for a distributed cluster, which has a very good actual combat effect on the management and the automatic operation and maintenance of a large number of hosts in the distributed cluster.

In order to achieve the purpose, the invention is realized by the following technical scheme:

an automated operation and maintenance method for distributed clusters, comprising the steps of:

step S1: host management; identifying all hosts of the distributed cluster, monitoring various resources, and early warning abnormal hosts;

step S2: managing the process; process searching and identification are realized through a plurality of identification processes so as to monitor the process state, disk reading and writing and CPU memory occupation conditions;

and step S3: managing ports; monitoring the opening condition of a host port of the distributed cluster, and early warning the change of the port;

and step S4: service management; monitoring various API interfaces and services of the distributed cluster;

step S5: monitoring logs; the exception searching method is used for searching the running exception of the program from the inside of the server;

step S6: message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and timely informs managers.

Preferably, the step S1 specifically includes the following steps:

step S11: identifying a host; all hosts of the distributed cluster carry out host information acquisition and heartbeat information report through agent service, and a server side identifies all host information;

step S12: CPU resource monitoring, identifying and monitoring the CPU resource occupation condition in the host, including the CPU resource occupation details of each program and service, sequencing the programs and services with higher CPU occupation, and counting the real-time use and idle percentage of the CPU resource;

step S13: monitoring memory resources; identifying and monitoring the memory resource occupation condition in the host, including the memory resource occupation details of each program and service, sequencing the programs and services with higher memory occupation, and counting the real-time use and idle percentage of the memory resources;

step S14: monitoring the disk resources; monitoring the use condition of the disk space of the host in real time, and judging and early warning the insufficient space by setting a threshold value of the residual space of the disk;

step S15: monitoring network resources; monitoring the use condition of the network bandwidth of a host in real time, wherein the use condition comprises bidirectional network flow statistics of data transmission and data reception, and judging the use ratio and the congestion condition of the network bandwidth;

step S16: early warning of a host; and (3) giving an early warning to the situations of overhigh continuous occupation of CPU resources, overhigh continuous occupation of memory resources, insufficient residual space of disk resources, continuous congestion of network bandwidth or network obstruction in a host list of the distributed cluster and informing managers.

Preferably, the step S2 specifically includes the following steps:

step S21: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different process management strategies according to service planning of the host;

step S22: identifying process ID; each process has a unique ID when being started, and the ID is generally not changed if the process is not terminated and restarted; the process ID identification finds the target process according to the unique ID;

step S23: identifying a process PID path; the target process can also be found according to the PID path;

step S24: identifying process keywords; when each process is started, a starting command is provided, and a target process can be quickly found according to the process keywords;

step S25: monitoring a process; monitoring the state and the running condition of the process in real time;

step S26: process early warning; and if the process under monitoring is suddenly abnormal, quickly early warning and informing a manager.

Preferably, the step S3 specifically includes the following steps:

step S31: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;

step S32: port identification; whether the specified service normally runs can be judged through whether the port is opened or not; port identification is carried out on port searching and state query by inputting specified port information;

step S33: monitoring a port; carrying out state detection on a designated port of a server;

step S34: early warning of a port; if the port under monitoring is abnormal suddenly, the early warning is quickly carried out and the manager is informed.

Preferably, the step S4 specifically includes the following steps:

step S41: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;

step S42: a service interface; detecting the service by configuring the access address of the service;

step S43: a service response; the service response comprises two contents, namely a return value of the service response on the one hand and the return content of the service response on the other hand; the method comprises the steps of configuring judgment rules of normal and abnormal states of a service interface;

step S44: monitoring the service; detecting the state of the running service of the server;

step S45: service early warning; if the monitored service is abnormal suddenly, the early warning is quickly carried out and the manager is informed.

Preferably, the step S5 specifically includes the following steps:

step S51: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;

step S52: log configuration; appointing an absolute path of a log file or appointing a folder where the log file is located, and automatically extracting the latest log for reading;

step S53: early warning keywords; setting different keywords according to the rule of each program operation; and carrying out abnormity search and fault judgment according to the keywords;

step S54: monitoring logs; detecting the content of a log file list appointed by a server, for example, early warning keywords appear in the content of the log file; the log monitoring is real-time reading, and once the log file is written with new data content, the log can be monitored;

step S55: log early warning; if the log under monitoring is abnormal suddenly, the early warning is quickly carried out and the manager is informed.

Preferably, the step S6 specifically includes the following steps:

step S61: early warning configuration; completing the setting of early warning triggering conditions under the condition that various early warning triggering conditions are configured in advance;

step S62: e, mail early warning; configuring a sending mailbox, a target mailbox, a sending server and authentication information of a mail alarm, and carrying out timely sending of the early warning mail through triggering of an early warning condition;

step S63: short message early warning; configuring a mobile phone number for receiving the short message, a server for sending the short message and authentication information, and sending the early warning short message in time through triggering of an early warning condition and a short message sending interface;

step S64: telephone early warning; configuring a mobile phone number for telephone early warning, a server for dialing a call and authentication information, and timely dialing the early warning call through triggering of an early warning condition and a call dialing interface;

step S65: early warning logs; all the early warning operations keep detailed early warning logs.

The invention provides an automatic operation and maintenance method for a distributed cluster. The method has the following beneficial effects: the management of massive hosts is supported; all the hosts of the cluster are closely monitored for operating conditions and resource occupancy. The system supports process monitoring, port monitoring, service monitoring and log monitoring, and meets most monitoring requirements of the distributed cluster; and has a flexible early warning mechanism. Firstly, the flexibility of the configuration condition of early warning trigger can flexibly configure the early warning trigger conditions of a host, a process, a port, service, a log and the like; secondly, the flexibility of the early warning channel is realized, and the channel modes such as mails, short messages, telephones and the like can be flexibly selected to carry out early warning in time.

Drawings

In order to more clearly illustrate the present invention or the prior art solutions, the drawings used in the description of the prior art will be briefly described below.

FIG. 1 is a block flow diagram of the steps of the present invention;

FIG. 2 is a block flow diagram of step S1 of the present invention;

FIG. 3 is a block flow diagram of step S2 of the present invention;

FIG. 4 is a block flow diagram of step S3 of the present invention;

FIG. 5 is a block flow diagram of step S4 of the present invention;

FIG. 6 is a block flow diagram of step S5 of the present invention;

FIG. 7 is a block diagram illustrating the flow of step S6 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.

As shown in fig. 1-7, an automated operation and maintenance method for distributed cluster includes the following steps:

step S2: managing a process; process searching and identification are realized through a plurality of identification flows, so that the purposes of monitoring the process state, disk reading and writing and CPU memory occupation are achieved;

and step S4: service management; monitoring various API interfaces and services of the distributed cluster to ensure the stable operation of the whole service of the distributed cluster;

step S5: monitoring logs; log monitoring is an important component of distributed cluster automation operation and maintenance, and can search program operation abnormity from the inside of a server, find operation and maintenance faults as soon as possible and prevent the operation and maintenance faults in advance;

step S6: message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and can timely notify management personnel.

Specifically, the step S1 of host management specifically includes the following steps:

step S11: identifying a host; all hosts of the distributed cluster carry out host information acquisition and heartbeat information report through agent service, and a server side identifies all host information; the method comprises the steps of counting the number of CPU cores of a host, the total size of a memory and the total size of a disk space.

step S15: monitoring network resources; monitoring the use condition of the network bandwidth of the host in real time, including bidirectional network flow statistics of data transmission and data reception, and judging the use ratio and the congestion condition of the network bandwidth;

Specifically, the step S2 of process management specifically includes the following steps:

step S22: identifying process ID; each process has a unique ID when being started, and the ID is not changed if the process is not terminated and restarted; the process ID identification can find the target process according to the unique ID;

step S23: identifying a process PID path; when each process is started, the system creates a PID named folder under proc, under which we have the information of our process. The target process can also be found according to the PID path;

step S24: identifying process keywords; . Each process has a start command at start-up, which contains the name of the program, the parameters of start-up, the configuration file path, etc. The target process can be quickly found according to the process keyword;

step S25: monitoring a process; the process monitoring can monitor the state and the running condition of the process in real time, and comprises the total reading and writing amount of a process disk, the CPU occupation percentage, the memory occupation percentage, the network data sending and receiving condition of the process and the like;

step S26: process early warning; if the monitored process is abnormal suddenly, for example, the memory of the CPU is continuously raised and exceeds a threshold value, or the target process stops running suddenly, early warning can be quickly carried out and a manager can be notified.

Specifically, the step S3 of port management specifically includes the following steps:

step S32: port identification; the port is a bridge for data transmission and data exchange between the server and the outside, and whether the specified service normally operates can be judged through whether the port is opened or not. The port identification carries out port searching and state query by inputting specified port information (the numerical range is 0-65535);

step S33: monitoring a port; the port monitoring can detect the state of the designated port of the server, such as the opening or closing of the port, whether the port access is stable, and the like. The port monitoring can be performed once every 60 seconds or 30 seconds according to a preset frequency;

step S34: early warning of a port; if the monitored port is abnormal suddenly, such as port closing or unstable port access, early warning can be quickly carried out and a manager can be notified.

Specifically, the step S4 of service management specifically includes the following steps:

step S42: a service interface; the service interface can perform service detection by configuring an access address of a service, and may include an IP address and a port of a target service, and may also include a detailed item access link of a specific service;

step S43: a service response; the service response includes two aspects, one is the return value of the service response, such as HTTP200, 302, 404, 500, etc.; another aspect is the returned content of the service response, such as keywords including title, content, etc. of the successful response. The service response can configure the judgment rules of the normal and abnormal states of the service interface;

step S44: monitoring the service; the service monitoring can detect the state of the service running by the server, such as the return value of the service response, the return content of the service response and the like. The service monitoring can be performed once every 60 seconds or 30 seconds according to a preset frequency;

step S45: service early warning; if the monitored service is abnormal suddenly, for example, the returned value of the service response indicates that 404 is not found, or 500 is an internal error, and the service response does not return a normal keyword result, the early warning can be quickly given and the manager can be notified.

Specifically, the step S5 of log monitoring specifically includes the following steps:

step S52: log configuration; each program is typically started with a log save path. Some programs are saved with a fixed path and a file name, and some programs are saved with a folder + time-combined file name of operation. The log configuration can specify an absolute path of a log file or a folder where the log file is located, and automatically extract the latest log for reading;

step S53: early warning keywords; the early warning keywords can be set to be different keywords according to the rule of each program operation. For example, when the program is in an abnormal state, data information reminders such as Warning, error, nullPointer and the like are written into the log file, and abnormal searching and fault judgment can be performed according to the keywords;

step S54: monitoring logs; the log monitoring can detect the content of a log file list appointed by the server, such as the content of the log file with early warning keywords. The log monitoring is real-time reading, and once the log file is written in new data content, the log can be monitored;

step S55: log early warning; if the log under monitoring is abnormal suddenly, for example, a log file records a serious error, the log file reminds a serious alarm, and the like, the early warning can be quickly carried out and the manager can be informed.

Specifically, the step S6 of message early warning specifically includes the following steps:

step S61: early warning configuration; . In the early warning configuration link, various early warning triggering conditions need to be configured in advance, such as host offline, process termination, port closing, abnormal service response, log error and the like, so as to complete the setting of the early warning triggering conditions;

step S65: early warning logs; all early warning operations retain detailed early warning logs. Including the condition of the early warning trigger, the time of the early warning trigger, the early warning channel, the contact information of the notified manager, and the like, for later data review.

By the method, the invention can support the management of the massive hosts; and closely monitoring the running condition and the resource occupation condition of all the hosts of the cluster. The system supports process monitoring, port monitoring, service monitoring and log monitoring, and meets most monitoring requirements of the distributed cluster; and has a flexible early warning mechanism. Firstly, the flexibility of the configuration condition of early warning trigger can flexibly configure the early warning trigger conditions of a host, a process, a port, service, a log and the like; and secondly, the flexibility of the early warning channel is realized, and the channel modes such as mail, short message, telephone and the like can be flexibly selected to carry out early warning in time.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An automated operation and maintenance method for distributed clusters, characterized by: the method comprises the following steps:

2. The method of claim 1, wherein the method comprises: the step S1 specifically includes the steps of:

step S11: identifying a host; all hosts of the distributed cluster carry out host information acquisition and heartbeat information reporting through agent service, and a server side identifies all host information;

3. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S2 specifically includes the following steps:

4. The method of claim 1, wherein the method comprises: the step S3 specifically includes the following steps:

step S32: port identification; whether the specified service normally runs can be judged through whether the port is opened or not; port identification is carried out by inputting appointed port information to search and query the state of a port;

5. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S4 specifically includes the following steps:

6. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S5 specifically includes the following steps:

step S53: early warning keywords; setting different keywords according to the rule of each program operation; performing abnormity searching and fault judgment according to the keywords;

step S54: monitoring logs; detecting the content of a log file list appointed by a server, for example, early warning keywords appear in the content of the log file; the log monitoring is real-time reading, and once the log file is written in new data content, the log can be monitored;

7. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S6 specifically includes the following steps:

step S65: early warning logs; all early warning operations retain detailed early warning logs.