CN111682976B

CN111682976B - Method for ensuring distributed multi-machine communication monitoring

Info

Publication number: CN111682976B
Application number: CN202010339696.6A
Authority: CN
Inventors: 朱之凯; 刘海峰
Original assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd
Current assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2022-03-01
Anticipated expiration: 2040-04-26
Also published as: CN111682976A

Abstract

The invention discloses a method for ensuring distributed multi-machine communication monitoring, which comprises the following steps that firstly, communication detection codes are deployed on each server in a distributed task system; secondly, embedding a software package of Prometheus Exporters in the communication detection codes, and reading communication variables calculated by the communication detection codes in each server through the Exporters; then, the Exporters sends the acquired communication variables to a Prometeus Server; and finally, based on the communication variable, the Prometheus Server judges whether the communication between the servers is normal. The method is more efficient in monitoring the communication condition of each server in the multitask distribution system.

Description

Method for ensuring distributed multi-machine communication monitoring

Technical Field

The invention belongs to the field of communication, and particularly relates to a method for ensuring distributed multi-machine communication monitoring.

Background

When a user submits an application for a distributed task, the task is scheduled to a different server. The distributed task has a high requirement on communication between the servers, and at least TCP (Transmission Control Protocol) communication between the servers is ensured. However, there are problems with communication between servers, and when this occurs, the distributed tasks will be in error, thereby affecting the user's service experience. The existing solution is to perform communication test on each server in the distributed task system, and record the communication test in a related file, but the maintenance cost of the scheme is high, and the management is inconvenient.

In addition, Prometheus (promimieus): the open source monitoring system is developed by using Go language, and mainly comprises a Prometheus Server (monitoring Server), a Client Library (Client Library), Exporters (data acquisition program), a Push Gateway (Push Gateway), an alert manager (alarm management), a graphical interface and the like, wherein the Prometheus roughly has the working flow:

(1) the Promultimedia Server periodically pulls metrics (indexes) from configured Exporters or Client Library, receives metrics sent by Push Gateway, or pulls metrics from other ways.

(2) The Prometheus Server runs the set alert rules after locally storing the collected metrics, and pushes the alert to the alert manager.

(3) And the alert manager processes the received alarm according to the configuration file of the alert manager and sends an alert notice such as an email, a short message and the like.

But Prometous monitors parameters such as CPU occupancy rate, GPU occupancy rate and information of a docker container of each server, and does not relate to the communication condition of the servers.

Therefore, how to design a system capable of monitoring the communication condition of the server in the platform in real time becomes an urgent technical problem to be solved.

Disclosure of Invention

In view of the foregoing problems, an object of the present invention is to provide a method for ensuring distributed multi-machine communication monitoring, which is more efficient in monitoring communication status of each server in a multitask distribution system.

The invention aims to provide a method for ensuring distributed multi-machine communication monitoring, which comprises the following steps,

deploying communication detection codes on each server in the distributed task system;

embedding a software package of Prometheus Exporters in the communication detection codes, and reading communication variables calculated by the communication detection codes in each server through the Exporters;

the Exporters sends the acquired communication variables to a Prometheus Server;

and based on the communication variable, the Prometheus Server judges whether the communication between the servers is normal.

Further, the method may further comprise,

each server communicates through TCP, and in the communication process, any one of the servers can send or feed back txt files with own IP address names to other servers and receive txt files with own IP address names fed back or sent by other servers.

Further, the method includes setting a first communication variable to determine whether communication between the two servers is abnormal, wherein,

if the current server receives a txt file with an IP address name of the current server fed back by another server, TCP communication between the two servers is normal, and a first communication variable value is 0;

if the current server does not receive the txt file with the IP address name fed back by the other server, TCP communication between the two servers is abnormal, and the value of the first communication variable is 1.

Further, the method further comprises setting a metric variable to determine whether the current server is abnormal when the TCP communication between the two servers is abnormal, wherein,

if the current server does not receive the txt file with the own IP address name fed back by all servers in other servers, the current server is abnormal, the value of the metric variable is 1, and otherwise, the value of the metric variable is 0.

Further, before the communication variables calculated by reading the communication detection codes in the servers through Exporters, the communication detection codes in the servers monitor the TCP communication with other servers in real time; wherein the content of the first and second substances,

the current server sends txt files with own IP address names to other servers;

if the current server can receive the txt file with the own IP address name fed back by one of the other servers, the TCP communication verification of the current server and the server feeding back the txt file with the own IP address name is successful,

calculating a first communication variable between the current server and the server which feeds back the txt file with the IP address name of the current server by using the communication detection code in the current server to be 0;

if the current server does not receive the txt file with the own IP address name fed back by one of the other servers, the TCP communication check between the current server and the server which does not feed back the txt file with the own IP address name fails,

the communication detection code in the current server calculates that the first communication variable between the current server and the server which does not feed back the txt file with the own IP address name is 1.

Further, the real-time monitoring of the TCP communication between the communication detection code in each server and each other server further comprises,

checking the calculated first communication variable by a communication detection code in the current server;

if the first communication variable is 1, the communication detection code in the current server continuously traverses whether a server which does not feed back the txt file with the own IP address name exists or not, wherein,

if all servers in other servers do not feed back txt files with own IP address names, calculating a metric variable in the current server to be 1 by the communication detection codes in the current server;

if one or more servers feed back txt files with own IP address names in other servers, calculating a metric variable in the current server to be 0 by the communication detection code in the current server;

if the first communication variables are all 0, the communication detection code in the current server calculates that the metric variable in the current server is 0.

Further, the Prometheus Server judges whether the communication between the servers normally includes based on the communication variables,

the Prometheus Server obtains the metric variable calculated by the communication detection code in all servers, wherein,

if the value of the metric variable of the server is 1, judging that the server is an abnormal server;

and the Prometheus Server pushes the alarm of the abnormal Server and/or stores the alarm into an abnormal database so as to isolate the abnormal Server.

Further, the method may further comprise,

each server acquires IP addresses of all other servers in the distributed task system;

each server checks the received txt file at a predetermined time for real-time monitoring of TCP communication with other respective servers by the communication detection code.

Further, the variable name of the first communication variable includes names of two servers.

The invention has the technical effects that: the method for ensuring the distributed multi-machine communication monitoring deploys the communication detection codes on each server in the distributed task system, and calculates the first communication variable and the metric variable in the TCP communication process of each server, so that the communication condition between the current server and other servers can be quickly obtained, the detection efficiency of communication faults is improved, and the method has instantaneity and rapidity. In addition, a software package of Prometeus Exporters is introduced into each Server in the multitask distribution system, and the Exporters and Prometeus servers are connected and interacted, so that the communication condition of each Server in the multitask distribution system is monitored more efficiently.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below to the drawings required for the embodiments or the technical solutions in the prior art, and for those skilled in the art, other drawings can be obtained according to the drawings of the present invention without any creative effort.

Fig. 1 shows a flow diagram of a method of ensuring distributed multi-machine communication monitoring according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the embodiment of the present invention discloses a method for ensuring distributed multi-machine communication monitoring, the method includes, first, deploying communication detection codes on each server in a distributed task system; secondly, embedding a software package of Prometheus Exporters in the communication detection codes, and reading communication variables calculated by the communication detection codes in each server through the Exporters; then, the Exporters sends the acquired communication variables to a Prometeus Server; and finally, based on the communication variable, the Prometheus Server judges whether the communication between the servers is normal. Introducing software packages of Prometheus Exporters into each Server in the multitask distribution system, wherein the Exporters and Prometheus servers are connected and interacted, and therefore the communication condition of each Server in the multitask distribution system is monitored more efficiently.

The method comprises the steps that communication detection codes are deployed on all servers in a distributed task system, then all servers communicate through TCP, and in the process of TCP communication of all the servers, any server can send or feed back txt files with own IP address names to other servers and receive txt files with own IP address names fed back or sent by other servers. Further, a server periodically sends a txt file (named by its IP address) to each of the other servers through TCP communication, and then after the txt file is obtained by the other servers, a txt file (named by the current server IP address) is returned to the server through TCP communication. The periodic interval period may be 8 hours, so that the servers can check communication between the servers through txt files transmitted to each other.

In this embodiment, the method further includes setting a first communication variable to determine whether communication between the two servers is abnormal, and a variable name of the first communication variable includes names of the two servers performing communication, wherein,

If the first communication variable in the current Server has a value of 1, the Prometous Server needs to continuously judge whether the current Server is abnormal or not, so that a metric variable needs to be set, and whether the current Server is abnormal or not is judged when TCP communication between the current Server and other servers is abnormal, wherein if the current Server does not receive a txt file with an IP address name of the current Server, which is fed back by any other Server, the current Server is abnormal, namely the current Server is abnormal, the value of the metric variable is 1, and otherwise, the value of the metric variable is 0.

Illustratively, taking an example that one server in 5 servers in the distributed task system performs TCP communication with other servers as an example, the 5 servers are respectively an a server, a B server, a C server, a D server and an E server, wherein the a server first downloads IP addresses of the four B-E servers, and then a communication detection code in the a server monitors TCP communication with the four B-E servers in real time, specifically, the a server sends a txt file with its own IP address name to the four B-E servers, wherein if the a server can receive the txt file with its own IP address name fed back by one server in the four B-E servers, the TCP communication between the a server and the server fed back with the txt file with its own IP address name is successfully verified, the communication detection code in the server a calculates that the first communication variable between the server feeding back the txt file with its own IP address name is 0. For example, the server a receives the txt file with its own IP address name fed back by the server B, which indicates that the TCP communication between the server a and the server B is normal, and the server a calculates the first communication variable with the server B as 0 through the communication detection code. The server a and the server C-E also perform the same operations. And will not be described in detail herein.

If the server A does not receive the txt file with the own IP address name fed back by one server of the four servers B-E, the TCP communication check between the server A and the server which does not feed back the txt file with the own IP address name fails, and the first communication variable between the server A and the server which does not feed back the txt file with the own IP address name is calculated by the communication detection code in the server A and is 1. For example, if the txt file with its own IP address name, which is fed back by the C server, is not received by the a server, indicating that TCP communication between the two is abnormal, the a server calculates, through the communication detection code, that the first communication variable with the C server is 1.

Further, the communication detection code in the server a checks the calculated first communication variable; wherein, if the first communication variable is 1, if the first communication variable between the A server and the C server is 1, the communication detection code in the A server continuously traverses whether a server which does not feed back the txt file with the own IP address name exists or not, wherein,

if B, D, E no txt file with its own IP address name is fed back by any of the three servers, calculating by the communication detection code in the server A to obtain a metric variable 1 in the server A;

if one or more servers in B, D, E have fed back txt files with own IP address names, the communication detection code in the A server calculates that the metric variable in the A server is 0;

preferably, the first communication variables stored in the a server are all 0, and the communication detection code in the a server calculates that the metric variable in the a server is 0. Further preferably, the communication detection code in the a server directly performs the calculation of the metric variable if the calculation results in the first communication variable being 1. If the first communication variables acquired in the TCP communication process are all 0, after the traversal is finished, the communication detection code in the server A calculates that the metric variable in the server A is 0.

The communication detection codes are deployed on each server in the distributed task system, and the first communication variable and the metric variable are calculated in the TCP communication process of each server, so that the communication condition between the current server and other servers can be quickly acquired, the detection efficiency of communication faults is improved, and the method has real-time performance and rapidity.

In this embodiment, the method further includes that each server acquires IP addresses of all other servers in the download distributed task system; each server checks the received txt file at a predetermined time for comparison with the IP addresses of all other servers. The predetermined time may be 30s (seconds), but is not limited to 30s, such as 20s, 1min (minutes), and the like, are suitable for the present invention.

Specifically, the transmission of the metric variable can be realized by using a Prometheus Client Library in an Exporter software package, specifically, the Prometheus Client Library of a Python (cross-platform computer programming language) package is used, then a Label is used to package information of the server, and then an http (network protocol) port 8000 is opened to wait for monitoring of Prometheus.

Further, the Prometheus Server acquires the metric variables of all servers, wherein if the value of the metric variable of the Server is 1, the Server is judged to be an abnormal Server; and the Prometheus Server pushes the alarm of the abnormal Server and/or stores the alarm into an abnormal database so as to isolate the abnormal Server. Specifically, the Prometheus Server adds the abnormal servers into the alarm rules of AlertManager, then sends the abnormal servers to an operation and maintenance manager in an email mode, and meanwhile adds the abnormal servers into a database of the distributed task system, wherein the database is specially used for storing the abnormal servers. If the server information is stored in the database, the tasks of the user cannot be dispatched to the server, and isolation of the abnormal server is achieved. Further, after receiving the alarm information, the operation and maintenance administrator determines that the abnormal server has been added to the database, then analyzes and repairs the abnormal server, restarts or reinstalls the abnormal server if necessary until the communication is successful, and then removes the abnormal server from the database and puts the abnormal server back into the distributed task system.

In this embodiment, a software package of proteamers is introduced into each Server in the multitask distribution system, the proteamers and the promemeus Server perform connection interaction, and an alarm rule and the like under a promemeus framework are combined, so that the monitoring of the communication condition of each Server in the multitask distribution system is more efficient, and the processing efficiency of a user on the communication abnormality is further improved.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for ensuring distributed multi-machine communication monitoring, the method comprising,

based on the communication variables, the Prometous Server judges whether the communication between the servers is normal or not, wherein the communication between the two servers is abnormal or not by setting a first communication variable, wherein if the current Server receives a txt file with an IP address name of the current Server fed back by the other Server, the TCP communication between the two servers is normal, and the value of the first communication variable is 0; if the current server does not receive the txt file with the own IP address name fed back by the other server, the TCP communication between the two servers is abnormal, the value of a first communication variable is 1, wherein the value of a metric variable is set to judge whether the current server is abnormal when the TCP communication between the two servers is abnormal, if the current server does not receive the txt file with the own IP address name fed back by all servers in the other servers, the value of the metric variable is 1, and otherwise, the value of the metric variable is 0.

2. The method for ensuring distributed multi-machine communication monitoring of claim 1, further comprising,

3. The method for guaranteeing distributed multi-machine communication monitoring according to claim 1, wherein before the communication variables calculated by reading the communication detection codes in each server by Exporters, the method further comprises monitoring the TCP communication between the communication detection codes in each server and each other server in real time; wherein the content of the first and second substances,

the current server sends txt files with own IP address names to other servers;

4. The method for guaranteeing distributed multi-machine communication monitoring as recited in claim 3, wherein the real-time monitoring of TCP communication with other servers by the communication detection code in each server further comprises,

5. The method of claim 4, wherein the Prometheus Server determines whether communications between servers are normal based on the communication variables comprises,

6. The method for ensuring distributed multi-machine communication monitoring according to any of claims 1-5, further comprising,

7. The method for ensuring distributed multi-machine communication monitoring of claim 1, wherein the variable name of the first communication variable comprises the names of two servers.