CN115442223A - Automatic operation and maintenance method for distributed cluster - Google Patents

Automatic operation and maintenance method for distributed cluster Download PDF

Info

Publication number
CN115442223A
CN115442223A CN202210846889.XA CN202210846889A CN115442223A CN 115442223 A CN115442223 A CN 115442223A CN 202210846889 A CN202210846889 A CN 202210846889A CN 115442223 A CN115442223 A CN 115442223A
Authority
CN
China
Prior art keywords
early warning
monitoring
host
service
port
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210846889.XA
Other languages
Chinese (zh)
Inventor
程俊
李文飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Write Easy Network Technology Shanghai Co ltd
Original Assignee
Write Easy Network Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Write Easy Network Technology Shanghai Co ltd filed Critical Write Easy Network Technology Shanghai Co ltd
Priority to CN202210846889.XA priority Critical patent/CN115442223A/en
Publication of CN115442223A publication Critical patent/CN115442223A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B25/00Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems
    • G08B25/01Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B25/00Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems
    • G08B25/01Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium
    • G08B25/08Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium using communication transmission lines
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B25/00Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems
    • G08B25/01Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium
    • G08B25/10Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium using wireless transmission systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Emergency Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer And Data Communications (AREA)

Abstract

An automated operation and maintenance method for distributed clusters, comprising the steps of: host management; identifying all hosts of the distributed cluster, monitoring various resources, and early warning abnormal hosts; managing a process; process searching and identification are realized through a plurality of identification processes so as to monitor the process state, disk reading and writing and CPU memory occupation conditions; managing ports; monitoring the opening condition of a host port of the distributed cluster, and early warning the change of the port; service management; monitoring various API interfaces and services of the distributed cluster; monitoring logs; the exception searching method is used for searching the running exception of the program from the inside of the server; message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and timely informs managers. The invention overcomes the defects of the prior art, and has good actual combat effect on the management and the automatic operation and maintenance of mass hosts in the distributed cluster.

Description

Automatic operation and maintenance method for distributed cluster
Technical Field
The invention relates to the technical field of big data operation and maintenance, in particular to an automatic operation and maintenance method for a distributed cluster.
Background
With the continuous growth of data, the application of big data technology and distributed clusters is more and more extensive. More and more internet users, more and more internet devices and more data are generated, so that the parallel computing scene based on the distributed cluster is more and more common. However, there are many problems for the management and operation and maintenance of the distributed cluster, for example, many hosts cannot be managed uniformly, the running state and resource usage of the hosts cannot be monitored in real time, and massive processes, ports, services and log information of the distributed cluster run lack a way for uniform management and operation and maintenance. These problems are not solved and would cause a great economic loss and serious consequences.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an automatic operation and maintenance method for a distributed cluster, which has a very good actual combat effect on the management and the automatic operation and maintenance of a large number of hosts in the distributed cluster.
In order to achieve the purpose, the invention is realized by the following technical scheme:
an automated operation and maintenance method for distributed clusters, comprising the steps of:
step S1: host management; identifying all hosts of the distributed cluster, monitoring various resources, and early warning abnormal hosts;
step S2: managing the process; process searching and identification are realized through a plurality of identification processes so as to monitor the process state, disk reading and writing and CPU memory occupation conditions;
and step S3: managing ports; monitoring the opening condition of a host port of the distributed cluster, and early warning the change of the port;
and step S4: service management; monitoring various API interfaces and services of the distributed cluster;
step S5: monitoring logs; the exception searching method is used for searching the running exception of the program from the inside of the server;
step S6: message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and timely informs managers.
Preferably, the step S1 specifically includes the following steps:
step S11: identifying a host; all hosts of the distributed cluster carry out host information acquisition and heartbeat information report through agent service, and a server side identifies all host information;
step S12: CPU resource monitoring, identifying and monitoring the CPU resource occupation condition in the host, including the CPU resource occupation details of each program and service, sequencing the programs and services with higher CPU occupation, and counting the real-time use and idle percentage of the CPU resource;
step S13: monitoring memory resources; identifying and monitoring the memory resource occupation condition in the host, including the memory resource occupation details of each program and service, sequencing the programs and services with higher memory occupation, and counting the real-time use and idle percentage of the memory resources;
step S14: monitoring the disk resources; monitoring the use condition of the disk space of the host in real time, and judging and early warning the insufficient space by setting a threshold value of the residual space of the disk;
step S15: monitoring network resources; monitoring the use condition of the network bandwidth of a host in real time, wherein the use condition comprises bidirectional network flow statistics of data transmission and data reception, and judging the use ratio and the congestion condition of the network bandwidth;
step S16: early warning of a host; and (3) giving an early warning to the situations of overhigh continuous occupation of CPU resources, overhigh continuous occupation of memory resources, insufficient residual space of disk resources, continuous congestion of network bandwidth or network obstruction in a host list of the distributed cluster and informing managers.
Preferably, the step S2 specifically includes the following steps:
step S21: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different process management strategies according to service planning of the host;
step S22: identifying process ID; each process has a unique ID when being started, and the ID is generally not changed if the process is not terminated and restarted; the process ID identification finds the target process according to the unique ID;
step S23: identifying a process PID path; the target process can also be found according to the PID path;
step S24: identifying process keywords; when each process is started, a starting command is provided, and a target process can be quickly found according to the process keywords;
step S25: monitoring a process; monitoring the state and the running condition of the process in real time;
step S26: process early warning; and if the process under monitoring is suddenly abnormal, quickly early warning and informing a manager.
Preferably, the step S3 specifically includes the following steps:
step S31: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S32: port identification; whether the specified service normally runs can be judged through whether the port is opened or not; port identification is carried out on port searching and state query by inputting specified port information;
step S33: monitoring a port; carrying out state detection on a designated port of a server;
step S34: early warning of a port; if the port under monitoring is abnormal suddenly, the early warning is quickly carried out and the manager is informed.
Preferably, the step S4 specifically includes the following steps:
step S41: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S42: a service interface; detecting the service by configuring the access address of the service;
step S43: a service response; the service response comprises two contents, namely a return value of the service response on the one hand and the return content of the service response on the other hand; the method comprises the steps of configuring judgment rules of normal and abnormal states of a service interface;
step S44: monitoring the service; detecting the state of the running service of the server;
step S45: service early warning; if the monitored service is abnormal suddenly, the early warning is quickly carried out and the manager is informed.
Preferably, the step S5 specifically includes the following steps:
step S51: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S52: log configuration; appointing an absolute path of a log file or appointing a folder where the log file is located, and automatically extracting the latest log for reading;
step S53: early warning keywords; setting different keywords according to the rule of each program operation; and carrying out abnormity search and fault judgment according to the keywords;
step S54: monitoring logs; detecting the content of a log file list appointed by a server, for example, early warning keywords appear in the content of the log file; the log monitoring is real-time reading, and once the log file is written with new data content, the log can be monitored;
step S55: log early warning; if the log under monitoring is abnormal suddenly, the early warning is quickly carried out and the manager is informed.
Preferably, the step S6 specifically includes the following steps:
step S61: early warning configuration; completing the setting of early warning triggering conditions under the condition that various early warning triggering conditions are configured in advance;
step S62: e, mail early warning; configuring a sending mailbox, a target mailbox, a sending server and authentication information of a mail alarm, and carrying out timely sending of the early warning mail through triggering of an early warning condition;
step S63: short message early warning; configuring a mobile phone number for receiving the short message, a server for sending the short message and authentication information, and sending the early warning short message in time through triggering of an early warning condition and a short message sending interface;
step S64: telephone early warning; configuring a mobile phone number for telephone early warning, a server for dialing a call and authentication information, and timely dialing the early warning call through triggering of an early warning condition and a call dialing interface;
step S65: early warning logs; all the early warning operations keep detailed early warning logs.
The invention provides an automatic operation and maintenance method for a distributed cluster. The method has the following beneficial effects: the management of massive hosts is supported; all the hosts of the cluster are closely monitored for operating conditions and resource occupancy. The system supports process monitoring, port monitoring, service monitoring and log monitoring, and meets most monitoring requirements of the distributed cluster; and has a flexible early warning mechanism. Firstly, the flexibility of the configuration condition of early warning trigger can flexibly configure the early warning trigger conditions of a host, a process, a port, service, a log and the like; secondly, the flexibility of the early warning channel is realized, and the channel modes such as mails, short messages, telephones and the like can be flexibly selected to carry out early warning in time.
Drawings
In order to more clearly illustrate the present invention or the prior art solutions, the drawings used in the description of the prior art will be briefly described below.
FIG. 1 is a block flow diagram of the steps of the present invention;
FIG. 2 is a block flow diagram of step S1 of the present invention;
FIG. 3 is a block flow diagram of step S2 of the present invention;
FIG. 4 is a block flow diagram of step S3 of the present invention;
FIG. 5 is a block flow diagram of step S4 of the present invention;
FIG. 6 is a block flow diagram of step S5 of the present invention;
FIG. 7 is a block diagram illustrating the flow of step S6 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.
As shown in fig. 1-7, an automated operation and maintenance method for distributed cluster includes the following steps:
step S1: host management; identifying all hosts of the distributed cluster, monitoring various resources, and early warning abnormal hosts;
step S2: managing a process; process searching and identification are realized through a plurality of identification flows, so that the purposes of monitoring the process state, disk reading and writing and CPU memory occupation are achieved;
and step S3: managing ports; monitoring the opening condition of a host port of the distributed cluster, and early warning the change of the port;
and step S4: service management; monitoring various API interfaces and services of the distributed cluster to ensure the stable operation of the whole service of the distributed cluster;
step S5: monitoring logs; log monitoring is an important component of distributed cluster automation operation and maintenance, and can search program operation abnormity from the inside of a server, find operation and maintenance faults as soon as possible and prevent the operation and maintenance faults in advance;
step S6: message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and can timely notify management personnel.
Specifically, the step S1 of host management specifically includes the following steps:
step S11: identifying a host; all hosts of the distributed cluster carry out host information acquisition and heartbeat information report through agent service, and a server side identifies all host information; the method comprises the steps of counting the number of CPU cores of a host, the total size of a memory and the total size of a disk space.
Step S12: CPU resource monitoring, identifying and monitoring the CPU resource occupation condition in the host, including the CPU resource occupation details of each program and service, sequencing the programs and services with higher CPU occupation, and counting the real-time use and idle percentage of the CPU resource;
step S13: monitoring memory resources; identifying and monitoring the memory resource occupation condition in the host, including the memory resource occupation details of each program and service, sequencing the programs and services with higher memory occupation, and counting the real-time use and idle percentage of the memory resources;
step S14: monitoring the disk resources; monitoring the use condition of the disk space of the host in real time, and judging and early warning the insufficient space by setting a threshold value of the residual space of the disk;
step S15: monitoring network resources; monitoring the use condition of the network bandwidth of the host in real time, including bidirectional network flow statistics of data transmission and data reception, and judging the use ratio and the congestion condition of the network bandwidth;
step S16: early warning of a host; and (3) giving an early warning to the situations of overhigh continuous occupation of CPU resources, overhigh continuous occupation of memory resources, insufficient residual space of disk resources, continuous congestion of network bandwidth or network obstruction in a host list of the distributed cluster and informing managers.
Specifically, the step S2 of process management specifically includes the following steps:
step S21: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different process management strategies according to service planning of the host;
step S22: identifying process ID; each process has a unique ID when being started, and the ID is not changed if the process is not terminated and restarted; the process ID identification can find the target process according to the unique ID;
step S23: identifying a process PID path; when each process is started, the system creates a PID named folder under proc, under which we have the information of our process. The target process can also be found according to the PID path;
step S24: identifying process keywords; . Each process has a start command at start-up, which contains the name of the program, the parameters of start-up, the configuration file path, etc. The target process can be quickly found according to the process keyword;
step S25: monitoring a process; the process monitoring can monitor the state and the running condition of the process in real time, and comprises the total reading and writing amount of a process disk, the CPU occupation percentage, the memory occupation percentage, the network data sending and receiving condition of the process and the like;
step S26: process early warning; if the monitored process is abnormal suddenly, for example, the memory of the CPU is continuously raised and exceeds a threshold value, or the target process stops running suddenly, early warning can be quickly carried out and a manager can be notified.
Specifically, the step S3 of port management specifically includes the following steps:
step S31: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S32: port identification; the port is a bridge for data transmission and data exchange between the server and the outside, and whether the specified service normally operates can be judged through whether the port is opened or not. The port identification carries out port searching and state query by inputting specified port information (the numerical range is 0-65535);
step S33: monitoring a port; the port monitoring can detect the state of the designated port of the server, such as the opening or closing of the port, whether the port access is stable, and the like. The port monitoring can be performed once every 60 seconds or 30 seconds according to a preset frequency;
step S34: early warning of a port; if the monitored port is abnormal suddenly, such as port closing or unstable port access, early warning can be quickly carried out and a manager can be notified.
Specifically, the step S4 of service management specifically includes the following steps:
step S41: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S42: a service interface; the service interface can perform service detection by configuring an access address of a service, and may include an IP address and a port of a target service, and may also include a detailed item access link of a specific service;
step S43: a service response; the service response includes two aspects, one is the return value of the service response, such as HTTP200, 302, 404, 500, etc.; another aspect is the returned content of the service response, such as keywords including title, content, etc. of the successful response. The service response can configure the judgment rules of the normal and abnormal states of the service interface;
step S44: monitoring the service; the service monitoring can detect the state of the service running by the server, such as the return value of the service response, the return content of the service response and the like. The service monitoring can be performed once every 60 seconds or 30 seconds according to a preset frequency;
step S45: service early warning; if the monitored service is abnormal suddenly, for example, the returned value of the service response indicates that 404 is not found, or 500 is an internal error, and the service response does not return a normal keyword result, the early warning can be quickly given and the manager can be notified.
Specifically, the step S5 of log monitoring specifically includes the following steps:
step S51: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S52: log configuration; each program is typically started with a log save path. Some programs are saved with a fixed path and a file name, and some programs are saved with a folder + time-combined file name of operation. The log configuration can specify an absolute path of a log file or a folder where the log file is located, and automatically extract the latest log for reading;
step S53: early warning keywords; the early warning keywords can be set to be different keywords according to the rule of each program operation. For example, when the program is in an abnormal state, data information reminders such as Warning, error, nullPointer and the like are written into the log file, and abnormal searching and fault judgment can be performed according to the keywords;
step S54: monitoring logs; the log monitoring can detect the content of a log file list appointed by the server, such as the content of the log file with early warning keywords. The log monitoring is real-time reading, and once the log file is written in new data content, the log can be monitored;
step S55: log early warning; if the log under monitoring is abnormal suddenly, for example, a log file records a serious error, the log file reminds a serious alarm, and the like, the early warning can be quickly carried out and the manager can be informed.
Specifically, the step S6 of message early warning specifically includes the following steps:
step S61: early warning configuration; . In the early warning configuration link, various early warning triggering conditions need to be configured in advance, such as host offline, process termination, port closing, abnormal service response, log error and the like, so as to complete the setting of the early warning triggering conditions;
step S62: e, mail early warning; configuring a sending mailbox, a target mailbox, a sending server and authentication information of a mail alarm, and carrying out timely sending of the early warning mail through triggering of an early warning condition;
step S63: short message early warning; configuring a mobile phone number for receiving the short message, a server for sending the short message and authentication information, and sending the early warning short message in time through triggering of an early warning condition and a short message sending interface;
step S64: telephone early warning; configuring a mobile phone number for telephone early warning, a server for dialing a call and authentication information, and timely dialing the early warning call through triggering of an early warning condition and a call dialing interface;
step S65: early warning logs; all early warning operations retain detailed early warning logs. Including the condition of the early warning trigger, the time of the early warning trigger, the early warning channel, the contact information of the notified manager, and the like, for later data review.
By the method, the invention can support the management of the massive hosts; and closely monitoring the running condition and the resource occupation condition of all the hosts of the cluster. The system supports process monitoring, port monitoring, service monitoring and log monitoring, and meets most monitoring requirements of the distributed cluster; and has a flexible early warning mechanism. Firstly, the flexibility of the configuration condition of early warning trigger can flexibly configure the early warning trigger conditions of a host, a process, a port, service, a log and the like; and secondly, the flexibility of the early warning channel is realized, and the channel modes such as mail, short message, telephone and the like can be flexibly selected to carry out early warning in time.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. An automated operation and maintenance method for distributed clusters, characterized by: the method comprises the following steps:
step S1: host management; identifying all hosts of the distributed cluster, monitoring various resources, and early warning abnormal hosts;
step S2: managing the process; process searching and identification are realized through a plurality of identification processes so as to monitor the process state, disk reading and writing and CPU memory occupation conditions;
and step S3: managing ports; monitoring the opening condition of a host port of the distributed cluster, and early warning the change of the port;
and step S4: service management; monitoring various API interfaces and services of the distributed cluster;
step S5: monitoring logs; the exception searching method is used for searching the running exception of the program from the inside of the server;
step S6: message early warning; the automatic operation and maintenance for the distributed cluster supports message early warning of various channels and timely informs managers.
2. The method of claim 1, wherein the method comprises: the step S1 specifically includes the steps of:
step S11: identifying a host; all hosts of the distributed cluster carry out host information acquisition and heartbeat information reporting through agent service, and a server side identifies all host information;
step S12: CPU resource monitoring, identifying and monitoring the CPU resource occupation condition in the host, including the CPU resource occupation details of each program and service, sequencing the programs and services with higher CPU occupation, and counting the real-time use and idle percentage of the CPU resource;
step S13: monitoring memory resources; identifying and monitoring the memory resource occupation condition in the host, including the memory resource occupation details of each program and service, sequencing the programs and services with higher memory occupation, and counting the real-time use and idle percentage of the memory resources;
step S14: monitoring the disk resources; monitoring the use condition of the disk space of the host in real time, and judging and early warning the insufficient space by setting a threshold value of the residual space of the disk;
step S15: monitoring network resources; monitoring the use condition of the network bandwidth of a host in real time, wherein the use condition comprises bidirectional network flow statistics of data transmission and data reception, and judging the use ratio and the congestion condition of the network bandwidth;
step S16: early warning of a host; and (3) giving an early warning to the situations of overhigh continuous occupation of CPU resources, overhigh continuous occupation of memory resources, insufficient residual space of disk resources, continuous congestion of network bandwidth or network obstruction in a host list of the distributed cluster and informing managers.
3. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S2 specifically includes the following steps:
step S21: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different process management strategies according to service planning of the host;
step S22: identifying process ID; each process has a unique ID when being started, and the ID is generally not changed if the process is not terminated and restarted; the process ID identification finds the target process according to the unique ID;
step S23: identifying a process PID path; the target process can also be found according to the PID path;
step S24: identifying process keywords; when each process is started, a starting command is provided, and a target process can be quickly found according to the process keywords;
step S25: monitoring a process; monitoring the state and the running condition of the process in real time;
step S26: process early warning; and if the process under monitoring is suddenly abnormal, quickly early warning and informing a manager.
4. The method of claim 1, wherein the method comprises: the step S3 specifically includes the following steps:
step S31: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S32: port identification; whether the specified service normally runs can be judged through whether the port is opened or not; port identification is carried out by inputting appointed port information to search and query the state of a port;
step S33: monitoring a port; carrying out state detection on a designated port of a server;
step S34: early warning of a port; if the port under monitoring is abnormal suddenly, the early warning is quickly carried out and the manager is informed.
5. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S4 specifically includes the following steps:
step S41: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S42: a service interface; detecting the service by configuring the access address of the service;
step S43: a service response; the service response comprises two contents, namely a return value of the service response on the one hand and the return content of the service response on the other hand; the method comprises the steps of configuring judgment rules of normal and abnormal states of a service interface;
step S44: monitoring the service; detecting the state of the running service of the server;
step S45: service early warning; if the monitored service is abnormal suddenly, the early warning is quickly carried out and the manager is informed.
6. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S5 specifically includes the following steps:
step S51: selecting a host; firstly, selecting a target host in a distributed cluster, and setting different port management strategies according to service planning of the host;
step S52: log configuration; appointing an absolute path of a log file or appointing a folder where the log file is located, and automatically extracting the latest log for reading;
step S53: early warning keywords; setting different keywords according to the rule of each program operation; performing abnormity searching and fault judgment according to the keywords;
step S54: monitoring logs; detecting the content of a log file list appointed by a server, for example, early warning keywords appear in the content of the log file; the log monitoring is real-time reading, and once the log file is written in new data content, the log can be monitored;
step S55: log early warning; if the log under monitoring is abnormal suddenly, the early warning is quickly carried out and the manager is informed.
7. The automated operation and maintenance method for distributed clusters according to claim 1, wherein: the step S6 specifically includes the following steps:
step S61: early warning configuration; completing the setting of early warning triggering conditions under the condition that various early warning triggering conditions are configured in advance;
step S62: e, mail early warning; configuring a sending mailbox, a target mailbox, a sending server and authentication information of a mail alarm, and carrying out timely sending of the early warning mail through triggering of an early warning condition;
step S63: short message early warning; configuring a mobile phone number for receiving the short message, a server for sending the short message and authentication information, and sending the early warning short message in time through triggering of an early warning condition and a short message sending interface;
step S64: telephone early warning; configuring a mobile phone number for telephone early warning, a server for dialing a call and authentication information, and timely dialing the early warning call through triggering of an early warning condition and a call dialing interface;
step S65: early warning logs; all early warning operations retain detailed early warning logs.
CN202210846889.XA 2022-07-19 2022-07-19 Automatic operation and maintenance method for distributed cluster Pending CN115442223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210846889.XA CN115442223A (en) 2022-07-19 2022-07-19 Automatic operation and maintenance method for distributed cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210846889.XA CN115442223A (en) 2022-07-19 2022-07-19 Automatic operation and maintenance method for distributed cluster

Publications (1)

Publication Number Publication Date
CN115442223A true CN115442223A (en) 2022-12-06

Family

ID=84240868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210846889.XA Pending CN115442223A (en) 2022-07-19 2022-07-19 Automatic operation and maintenance method for distributed cluster

Country Status (1)

Country Link
CN (1) CN115442223A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104202212A (en) * 2014-08-28 2014-12-10 浪潮(北京)电子信息产业有限公司 System and method for obtaining distributed cluster system alarm
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
WO2017071134A1 (en) * 2015-10-28 2017-05-04 北京汇商融通信息技术有限公司 Distributed tracking system
CN113872795A (en) * 2021-08-20 2021-12-31 苏州浪潮智能科技有限公司 Intelligent monitoring analysis and fault processing system and method for distributed server
CN114389937A (en) * 2022-01-17 2022-04-22 徐皓原 Operation and maintenance monitoring and management system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104202212A (en) * 2014-08-28 2014-12-10 浪潮(北京)电子信息产业有限公司 System and method for obtaining distributed cluster system alarm
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
WO2017071134A1 (en) * 2015-10-28 2017-05-04 北京汇商融通信息技术有限公司 Distributed tracking system
CN113872795A (en) * 2021-08-20 2021-12-31 苏州浪潮智能科技有限公司 Intelligent monitoring analysis and fault processing system and method for distributed server
CN114389937A (en) * 2022-01-17 2022-04-22 徐皓原 Operation and maintenance monitoring and management system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘殊;: "基于Hadoop的分布式云监控平台系统的研究与设计", 电子设计工程, no. 15, 5 August 2016 (2016-08-05) *

Similar Documents

Publication Publication Date Title
CN109660380A (en) Monitoring method, platform, system and the readable storage medium storing program for executing of operation condition of server
CN108491305A (en) A kind of detection method and system of server failure
CN105159964A (en) Log monitoring method and system
KR20080055744A (en) A telecommuncations-based link monitoring system
CN103490917B (en) The detection method of troubleshooting situation and device
CN107947998B (en) Real-time monitoring system based on application system
CN111858176A (en) Remote monitoring fault self-healing system and method
CN111431754A (en) Fault analysis method and system for power distribution and utilization communication network
EP2222099B1 (en) A method, device and system of disaster recovery and handover control
US20240177239A1 (en) Intelligent user interface monitoring and alert
CN115001989A (en) Equipment early warning method, device, equipment and readable storage medium
US7120633B1 (en) Method and system for automated handling of alarms from a fault management system for a telecommunications network
US20080086562A1 (en) Management support method, management support system, management support apparatus and recording medium
US20040098230A1 (en) Computer network monitoring with test data analysis
KR102418594B1 (en) Ict equipment management system and method there of
EP1622310A2 (en) Administration system for network management systems
CN115442223A (en) Automatic operation and maintenance method for distributed cluster
CN113760634A (en) Data processing method and device
CN110311809A (en) The access terminal monitoring and managing method and device of video monitoring system
CN113472881B (en) Statistical method and device for online terminal equipment
CN113485865B (en) Data processing system based on forward proxy server access third party application
US20100153543A1 (en) Method and System for Intelligent Management of Performance Measurements In Communication Networks
CN113342596A (en) Distributed monitoring method, system and device for equipment indexes
CN114244685A (en) Cloud service center access exception handling system
CN108874626B (en) System monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination