CN104270268B

CN104270268B - A kind of distributed system network performance evaluation and method for diagnosing faults

Info

Publication number: CN104270268B
Application number: CN201410508685.0A
Authority: CN
Inventors: 张攀勇; 彭成; 季旻; 苗艳超
Original assignee: Dawning Information Industry Co Ltd
Current assignee: CHINESE CORPORATION DAWNING INFORMATION INDUSTRY CHENGDU CO., LTD.; Dawning Information Industry Co Ltd
Priority date: 2014-09-28
Filing date: 2014-09-28
Publication date: 2017-12-05
Anticipated expiration: 2034-09-28
Also published as: CN104270268A

Abstract

The present invention provides a kind of distributed system network performance evaluation and method for diagnosing faults, comprises the following steps：Monitoring service is disposed in monitored distributed system；According to the feature of distributed system, operational management service；Carry out discovering network topology；Determine the monitor node set of monitored node；Management service collector node status information is simultaneously analyzed；Network performance detects；Network state is analyzed, and determines failure that may be present.The present invention take into account the state of the all-network equipment and link that participate on communication path, communication performance between node, according to network topological information, it can analyze and the particular location of the localization of faults, the precision of fault detect is improved, reduces the expense of fault detect.Simultaneously for the performance evaluation of distributed system, this method can provide the actual performance between distributed system node, rather than the theoretical performance of network system, it is possible to increase the precision of Performance Prediction.

Description

A kind of distributed system network performance evaluation and method for diagnosing faults

Technical field

The present invention relates to a kind of diagnostic method, and in particular to a kind of distributed system network performance evaluation and fault diagnosis side Method.

Background technology

Distributed system is referred to establishing on network system, and each different node is passed through into the message between node One or more services are completed in communication, cooperation.Because service is distributed to different nodes, therefore distributed system by distributed system System is with good expansibility, Fault Isolation, and application transparency.Widely should it be obtained in the IT system of reality With typical service is distributed formula file system, distributed data base, website service etc..

Each service node is interconnected together because distributed system relies on the network equipment, the performance of the network equipment and steady Qualitative performance and stability to distributed system serves conclusive effect.With the expansion of distributed system scale, make The scale of network, device type are obtained, the connected mode of equipment becomes extremely complex, can be direct if some equipment break down Have influence on the quality of top service.How efficient fault diagnosis and performance evaluation carried out to network system by instrument, had Very important meaning.

For current fault diagnosis mechanism, it is divided into hardware fault diagnosis mechanism and Software Testing Tool.

Hardware fault diagnosis mechanism includes the performance counter provided on the network equipment, there is provided various performances and failure count Device, including messaging, abandon message, hardware error message etc. and count, be able to detect that hardware device is by these countings It is no exception to be present.

Software Testing Tool calculates the network delay and band of point-to-point by carrying out the information receiving and transmitting of point-to-point on one's own initiative Width, and then judge that network whether there is failure.Typical testing tool has Iperf, netperf etc..

The problem of existing distributed system network performance evaluation and fault diagnosis are present following aspects：

● breakdown judge source is simple：Hardware counter can only detect the source of trouble of hardware in itself, can not be for network link The failure such as state, software protocol layer mistake judged；The net that software point-to-point testing tool can only be tested between two points Network performance, it can not quickly judge network failure by data.

● keeper participates in by hand：The various possible situations of keeper's manual test are needed, and may be deposited according to interpretation of result Which kind of handled in failure.As network size caused by distributed system popularization is huge, it is necessary to failure diagnosis tool Simplify and the possible breakdown point of overall network is quickly provided, be easy to keeper to carry out the judgement and exclusion of failure.

The content of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of distributed system network performance evaluation and failure Diagnostic method, it is contemplated that participate in the state of the all-network equipment and link on communication path, the communication performance between node, According to network topological information, the particular location of the simultaneously localization of faults can be analyzed, the precision of fault detect is improved, reduces event Hinder the expense of detection.

Simultaneously for the performance evaluation of distributed system, this method can provide the actual property between distributed system node Can, rather than the theoretical performance of network system, it is possible to increase the precision of Performance Prediction.

In order to realize foregoing invention purpose, the present invention adopts the following technical scheme that：

The present invention, which provides a kind of distributed system network performance evaluation and method for diagnosing faults, methods described, includes following step Suddenly：

Step 1：Monitoring service is disposed in monitored distributed system；

Step 2：According to the feature of distributed system, operational management service；

Step 3：Carry out discovering network topology；

Step 4：Determine the monitor node set of monitored node；

Step 5：Management service collector node status information is simultaneously analyzed；

Step 6：Network performance detects；

Step 7：Network state is analyzed, and determines failure that may be present.

In the step 1, according to monitored distributed system scale, monitored node is determined, and in monitored node Upper deployment monitoring service；The monitored node is defined as node where monitored service is needed in distributed system, including Server and the network equipment etc..

The network state of node where monitoring service is responsible for monitoring, including the hardware state of network interface card and operating system provide Performance count information etc.；

Monitoring service receives the order of management service and execution, and order includes network detection order and applied in network performance test life Order；

The network detection order that monitoring service is sent according to management service, carry out network detection；And sent out according to management service The applied in network performance test order gone out, carry out the applied in network performance test between node.

In the step 2, the operational management service in management node, management service is according to distributed system feature, selection Monitored node, start monitoring service, and be connected with the monitoring service on monitored node.

The connected mode of management service and monitoring service is depending on the scale of distributed system：

For small-scale distributed system, management service is directly connected with all monitoring services；

For large scale distributed system, management service is connected using tree hierarchy mode, i.e. tension management service management The management service of different subregions, the node and network of a single partition management service management configuration quantity.

In the step 3, management service initiates discovering network topology to the all-network equipment of distributed system, to determine Network topological information, and by network topology information storage into management service；If the network equipment residing for distributed system Topology Discovery is not supported, then the topological arrangement provided according to keeper builds network topological information.

In the step 4, monitored node supports following three kinds of monitor modes：

(1) total system scan mode：All nodes and the network equipment of distributed system are scanned, then monitor node Collection is combined into all nodes of internal system and the network equipment；

(2) keeper's specific mode：Keeper is by configuring specified monitor node set；

(3) application program is specified, monitoring set scan mode during failure：Application program specifies monitor node collection by API Close, system scans after suspected fault is found for specific node；The detailed process of the monitor mode is as follows：

3-1)：Application program specifies the node for needing to monitor；

3-2)：The state of monitoring service regular monitoring node, if it find that network state is abnormal, then by the exception of this node Communications status proactive notification is to management service；

3-3)：Management service calculates communication lines after node abnormal communication states notice is received according to network topology Footpath, by the all-network equipment and node on communication path, add monitor node list.

The step 5 comprises the following steps：

Step 5-1：Monitoring service of the management service into monitor node set initiates node status information and collects order；

Step 5-2：After monitoring service receives node status information collection order, the shape of this meshed network equipment is collected State, and return result to management service；

Step 5-3：The status information that management service is collected into all nodes is analyzed, and the network for confirming to have failure is set It is standby, and marked there will be the network equipment of failure in the network topological information of management service；

Step 5-4：There will be the list of the network equipment of failure to report keeper for management service, notifies keeper to carry out Safeguard.

The step 6 comprises the following steps：

Step 6-1：Monitor node of the management service into monitor node set initiates the detection of Active Networks performance, property in pairs Energy index includes bilateral network delay, network bandwidth and network performance stability, and the all-network on collector node path is set Standby counter；

Step 6-2：Monitoring service on node actively is initiated to visit after network performance probe requests thereby is received to corresponding node Message Opcode is surveyed, and returns result to management service；

Step 6-3：Management service is chosen to the algorithm to monitor node, including permutation and combination algorithm and greedy algorithm etc..

In the step 7, management service is after the result of step 5 and step 6 is received, according to the net of step 3 acquisition Network topology information carries out network state analysis, the communication test between the counter and node of comprehensive all-network equipment Can, it is determined that the network equipment or link of failure be present, it is understood that there may be failure include network card equipment hardware fault, network interface card work Pattern-Fault, network card interface and node interface mismatch, connection cables disconnect, connection cables are unstable and exchange fault.

Compared with prior art, the beneficial effects of the present invention are：

Distributed system network performance evaluation provided by the invention and method for diagnosing faults, due to consideration that distributed system The state of the all-network equipment of system, the path detection between node is actively carried out, and performance is gone out according to detection Analysis of conclusion and asked Topic or trouble point, specific to some network equipment, link, or node rank, greatly reduce grid performance point Analysis and the expense of fault diagnosis, alleviate the manual intervention of keeper；Support the fault-finding of total system and application specified path.

Brief description of the drawings

Fig. 1 is the connection diagram of management service and monitoring service in the embodiment of the present invention；

Fig. 2 is the schematic diagram that management service judges network equipment failure according to result in the embodiment of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

Step 1：Monitoring service is disposed in monitored distributed system；

Step 3：Carry out discovering network topology；

Step 4：Determine the monitor node set of monitored node；

Step 6：Network performance detects；

Step 7：Network state is analyzed, and determines failure that may be present.

In the step 2, the operational management service in management node, management service is according to distributed system feature, selection Monitored node, start monitoring service, and be connected (such as Fig. 1) with the monitoring service on monitored node.

3-1)：Application program specifies the node for needing to monitor；

The step 5 comprises the following steps：

The step 6 comprises the following steps：

Basis for estimation may have following several but be not limited to following method：

● the outside all link performances of some node are abnormal, judge that the network card equipment of the node or node are outside Connection cables failure；

● it is abnormal by communication performance on the link of some switching equipment, judge the switching equipment operation irregularity；

● the communication performance using the node-to-node of some link is abnormal, judges link exception.

Fig. 2 is the example that step 7 management service judges equipment fault according to result：

■ judges equal normal work for node 1, node 2, node 3, interchanger 1, interchanger 2 according to unit count device；

The ■ network performances that node 1 is arrived between node 2 simultaneously are normal, but node 1 arrives node 3, node 2 to node 3 State is abnormal；

■ analyzes according to the network topological information of management service, due to the public network of node 1- nodes 3 and node 2- nodes 3 Network path is interchanger 1- interchangers 2, while normal according to the unit count device of interchanger, and failure judgement may be interchanger 1- Link failure between interchanger 2, notify trouble point corresponding to keeper.

If necessary to obtain influence of the distributed system network performance to application performance, according to the application pattern of offer, Calculate expected performance number.Analysis result is reported keeper by management service, is judged by keeper, and failure is entered The corresponding processing of row.

Finally it should be noted that：The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, institute The those of ordinary skill in category field with reference to above-described embodiment still can to the present invention embodiment modify or Equivalent substitution, these are applying for this pending hair without departing from any modification of spirit and scope of the invention or equivalent substitution Within bright claims.

Claims

1. a kind of distributed system network performance evaluation and method for diagnosing faults, it is characterised in that：Methods described includes following step Suddenly：

Step 1：Monitoring service is disposed in monitored distributed system；

Step 3：Carry out discovering network topology；

Step 4：Determine the monitor node set of monitored node；

Step 6：Network performance detects；

Step 7：Network state is analyzed, and determines failure that may be present；

In the step 1, according to monitored distributed system scale, monitored node is determined, and on monitored node top Affix one's name to monitoring service；The monitored node is defined as node where monitored service is needed in distributed system, including service Device and the network equipment；

The network state of node where monitoring service is responsible for monitoring, including the performance that the hardware state of network interface card and operating system provide Count information；

Monitoring service receives the order of management service and execution, and order includes network detection order and applied in network performance test order；

The network detection order that monitoring service is sent according to management service, carry out network detection；And sent according to management service Applied in network performance test order, carry out the applied in network performance test between node；

In the step 2, the operational management service in management node, management service selects to be supervised according to distributed system feature Node is controlled, starts monitoring service, and be connected with the monitoring service on monitored node；

For large scale distributed system, management service is connected using tree hierarchy mode, i.e. tension management service management is different The management service of subregion, the node and network of a single partition management service management configuration quantity；

In the step 3, management service initiates discovering network topology to the all-network equipment of distributed system, to determine network Topology information, and by network topology information storage into management service；If the network equipment residing for distributed system does not prop up Topology Discovery is held, then the topological arrangement provided according to keeper builds network topological information；

(1) total system scan mode：All nodes and the network equipment of distributed system are scanned, then monitor node set For all nodes of internal system and the network equipment；

(3) application program is specified, monitoring set scan mode during failure：Application program specifies monitor node set by API, is System scans after suspected fault is found for specific node；The detailed process of the monitor mode is as follows：

3-1)：Application program specifies the node for needing to monitor；

3-2)：The state of monitoring service regular monitoring node, if it find that network state is abnormal, then by the exceptional communication of this node State proactive notification is to management service；

3-3)：Management service calculates communication path after node abnormal communication states notice is received according to network topology, will All-network equipment and node on communication path, add monitor node list；

The step 5 comprises the following steps：

Step 5-2：After monitoring service receives node status information collection order, the state of this meshed network equipment is collected, and Return result to management service；

Step 5-3：The status information that management service is collected into all nodes is analyzed, and confirms the network equipment of failure be present, And marked there will be the network equipment of failure in the network topological information of management service；

Step 5-4：There will be the list of the network equipment of failure to report keeper for management service, notifies keeper to be tieed up Shield；

The step 6 comprises the following steps：

Step 6-1：Monitor node of the management service into monitor node set initiates the detection of Active Networks performance in pairs, and performance refers to Marking includes bilateral network delay, network bandwidth and network performance stability, and the all-network equipment on collector node path Counter；

Step 6-2：Monitoring service on node is actively initiated detection to corresponding node and disappeared after network performance probe requests thereby is received Breath operation, and return result to management service；

Step 6-3：Management service is chosen to the algorithm to monitor node, including permutation and combination algorithm and greedy algorithm；

In the step 7, management service is opened up after the result of step 5 and step 6 is received according to the network that step 3 obtains Flutter information and carry out network state analysis, integrate the communication test performance between the counter and node of all-network equipment, really Surely the network equipment or link of failure be present, it is understood that there may be failure include network card equipment hardware fault, network interface card mode of operation Mistake, network card interface and node interface mismatch, connection cables disconnect, connection cables are unstable and exchange fault.