CN102148707A

CN102148707A - Troubleshooting method and system of monitoring agents

Info

Publication number: CN102148707A
Application number: CN2011100309810A
Authority: CN
Inventors: 王理想; 刘成平
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2011-01-28
Filing date: 2011-01-28
Publication date: 2011-08-10

Abstract

The invention provides a troubleshooting method and system of monitoring agents. In the method, a monitored resource node comprises a first monitoring agent and a second monitoring agent, wherein the second monitoring agent monitors the operation state of the first monitoring agent; and the second monitoring agent triggers the start flow of the first monitoring agent when monitoring that the first monitoring agent stops due to failures.

Description

The fault handling method of monitoring agent and system

Technical field

The present invention relates to computer application field, relate in particular to a kind of fault handling method and system of monitoring agent.

Background technology

Current, computer is more and more universal, and application surface is also more and more wider.The application of individual PC has promoted the extensive use of server.

Now, the large-scale businesses and institutions of scale is all moving ten hundreds of servers all the time.Along with the extensive use of server, how a large amount of servers, storage and the network equipment being managed effectively also becomes the problem that businesses and institutions more and more is concerned about.

For this reason, the numerous and confused device management software of releasing oneself of each big manufacturer server and software company.This device management software is by whether need installation agent realizing that management function is divided into 2 types on monitored resource node.Wherein a kind of is the monitoring agent program to be installed at monitored node, and management software just can be realized the management to monitored resource node; Another kind is need not installation agent, by Simple Network Management Protocol (Simple Network Management Protocol, SNMP) or IPMI (International Precious Metals Institute IPMI) wait to realize management.

In this device management software of two types, at monitored node the monitoring agent service routine not being installed is the easiest, safest way to manage.But, by snmp protocol and other such as agreements such as IPMI to realizing simple management to equipment.Along with the user to equipment control require more and more higher, this mode more and more can not satisfy user's regulatory requirement.Therefore, at monitored resource node the monitoring agent service routine being installed is present comparatively general a kind of mode.

In realizing process of the present invention, the inventor finds prior art, and there are the following problems:

Because the monitoring agent meeting out of service that a variety of causes causes makes Surveillance center's end to carry out proper communication with monitoring agent, causes and can't Surveillance center can't continue monitored resource node is managed, and becomes a weakness for this kind monitor mode.

Out of service in order to solve monitoring agent, can't be normally and the problem of Surveillance center's communication, we propose the scheme that a cover has the monitoring of tools management of flexibility and monitoring fault tolerance.

Summary of the invention

The invention provides a kind of fault handling method and system of monitoring agent, can not in time handle the problem that the monitoring agent operation is ended in the prior art to solve.

For solving the problems of the technologies described above, the invention provides following technical scheme:

A kind of fault handling method of monitoring agent, monitored resource node comprise first monitoring agent and second monitoring agent, wherein:

Described second monitoring agent is monitored the running status of described first monitoring agent;

Hinder operation for some reason when ending monitoring first monitoring agent, described second monitoring agent triggers the startup flow process of first monitoring agent.

Further, described method also has following characteristics: described second monitoring agent triggers the startup flow process of first monitoring agent, comprising:

Described second monitoring agent judges whether the fault that described first monitoring agent takes place needs to handle;

If the fault that described first monitoring agent takes place needs to handle, then described second monitoring agent starts described first monitoring agent after the fault of handling this described first monitoring agent generation; Otherwise described second monitoring agent directly starts described first monitoring agent.

Further, described method also has following characteristics: the process of the fault that this described first monitoring agent of described processing takes place comprises:

Described second monitoring agent is searched the processing policy of the fault correspondence of described first monitoring agent generation from the troubleshooting strategy that store in advance this locality;

If find the processing policy of the fault correspondence of first monitoring agent generation, then described second monitoring agent adopts this processing policy to handle the fault that described first monitoring agent takes place;

If do not find the processing policy of the fault correspondence of described first monitoring agent generation, then described second monitoring agent adopts this processing policy to handle the fault that described first monitoring agent takes place from the processing policy that Surveillance center obtains the fault correspondence of this monitoring agent generation again; Perhaps, the described second monitoring agent request Surveillance center handles the fault that described second monitoring agent takes place.

The information that described second monitoring agent reports the operation of first monitoring agent to end to Surveillance center;

The information that described Surveillance center ends according to described first monitoring agent operation starts described first monitoring agent.

If the fault that described first monitoring agent takes place needs to handle, then described second monitoring agent notifies Surveillance center to handle the fault that described first monitoring agent takes place; Described Surveillance center handles the fault that described monitoring agent takes place, and after troubleshooting is finished, starts described first monitoring agent;

If the fault that described first monitoring agent takes place does not need to handle, then described second monitoring agent directly starts described first monitoring agent.

A kind of fault processing system of monitoring agent, monitored resource node comprise first monitoring agent and second monitoring agent, and wherein said second monitoring agent comprises:

Supervising device is used to monitor the running status of described first monitoring agent;

Processing unit is used for hindering operation for some reason when ending monitoring first monitoring agent, and described second monitoring agent triggers the startup flow process of first monitoring agent.

Further, described method also has following characteristics: described processing unit comprises:

Judge module is used to judge whether the fault that described first monitoring agent takes place needs to handle;

Start module, be used for when the fault that described first monitoring agent takes place need be handled, after the fault of handling this described first monitoring agent generation, starting described first monitoring agent; And described second monitoring agent does not directly start described first monitoring agent when the fault of described first monitoring agent generation does not need to handle.

Further, described method also has following characteristics: described processing unit also comprises:

Search module, the troubleshooting strategy that is used for storing in advance from this locality is searched the processing policy of the fault correspondence of described first monitoring agent generation;

First processing module is used for adopting this processing policy to handle the fault that described first monitoring agent takes place when the processing policy of the fault correspondence that finds the generation of first monitoring agent;

Second processing module, be used for when the processing policy of the fault correspondence that does not find described first monitoring agent generation, obtain the processing policy of the fault correspondence that this monitoring agent takes place from Surveillance center, adopt this processing policy to handle the fault that described first monitoring agent takes place again; Perhaps, the described second monitoring agent request Surveillance center handles the fault that described second monitoring agent takes place.

Further, described method also has following characteristics:

Described processing unit comprises:

Reporting module is used for the information that reports the operation of first monitoring agent to end to described Surveillance center;

Described system also comprises:

Surveillance center is used for the information according to described first monitoring agent operation termination, starts described first monitoring agent.

Further, described method also has following characteristics:

Described processing unit comprises:

Notification module is used for notifying described Surveillance center to handle the fault that described first monitoring agent takes place when the fault that described first monitoring agent takes place need be handled;

Start module, be used for when the fault that described first monitoring agent takes place does not need to handle, directly starting described first monitoring agent;

Described system also comprises:

Surveillance center is used to handle the fault that described first monitoring agent takes place, and after troubleshooting is finished, starts described first monitoring agent.

Embodiment provided by the invention, monitor the running status of first monitoring agent by second monitoring agent, when the operation of first monitoring agent is ended, trigger the flow process that starts this first monitoring agent, shortened the time of finding the first monitoring agent fault, can shorten the time that starts first monitoring agent, the proper communication of guarantee information.

Description of drawings

Fig. 1 is the schematic flow sheet of the fault handling method of monitoring agent provided by the invention;

Fig. 2 is the structural representation of the fault processing system embodiment of monitoring agent provided by the invention;

Fig. 3 is the structural representation of processing unit described in the system shown in Figure 2 embodiment;

Fig. 4 is another structural representation of processing unit shown in Figure 3;

Fig. 5 is another structural representation of system shown in Figure 2 embodiment;

Fig. 6 is the another structural representation of system shown in Figure 2 embodiment.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.Need to prove that under the situation of not conflicting, embodiment among the application and the feature among the embodiment be combination in any mutually.

Fig. 1 is the schematic flow sheet of the fault handling method of monitoring agent provided by the invention.Method embodiment shown in Figure 1 comprises:

In embodiment one, Surveillance center is that monitored resource comprises first monitoring agent and second monitoring agent.

Wherein said monitored resource can be the physical equipment in a certain network system (cloudlike the calculating operation system), can in server, memory device (as database) and the transmission equipment (as switch and router etc.) at least one.

Step 101, second monitoring agent receive first monitoring agent and hinder the information of not moving for some reason;

Wherein, the running status of monitoring first monitoring agent can have this second monitoring agent to monitor, but be not limited thereto, also can realize monitoring function by the communication unit that is used in the monitored resource node communicating by letter with first monitoring agent, for example, this communication module is not being received corresponding response in a preset time after the first monitoring agent transmission information, then communication module determines that this first monitoring agent is not in running status, then sends this first monitoring agent to second monitoring agent and hinders the information of not moving for some reason.

Step 102, second monitoring agent trigger the flow process that starts this first monitoring agent.

Wherein carry out the difference of body, following dual mode arranged triggering the flow process that starts this first monitoring agent according to the operation that starts this first monitoring agent:

First kind of mode 102A started by second monitoring agent, and be specific as follows:

Step 201, this second monitoring agent judge whether the fault that described first monitoring agent takes place needs to handle;

For example, second monitoring agent can be stored a tabulation in advance, and writing down needs the fault handled in the contingent fault on first monitoring agent, adopts this tabulation to compare.

If the fault that described first monitoring agent of step 202 takes place needs to handle, then after the fault of handling this described first monitoring agent generation, restart this first monitoring agent; Otherwise, directly start this first monitoring agent.

Need to prove that the content that step 201 is carried out is only to need this first monitoring agent of one-shot operation just can normally move when guaranteeing this first monitoring agent of follow-up startup, with the purpose in processing time of reaching the shortening fault.Certainly, also can directly start, because will disappear restarting first monitoring agent for some faults, such as causing the operation of first monitoring agent to end because Surveillance center has sent wrong order, this moment, directly this first monitoring agent of startup was just passable.Yet, for some faults, do not have the situation of enough hard drive space storing daily record information as first monitoring agent, after starting this first monitoring agent, this fault still exists, close this first monitoring agent, waiting for that troubleshooting is finished could start this first monitoring agent again, this shows, directly starts this first monitoring agent, the problem that restarting might occur has increased time of troubleshooting.

Wherein, the process of the fault that takes place of this described first monitoring agent of the processing in the step 202 also can following dual mode:

First kind of mode 202A specifically comprises:

The processing policy of the fault correspondence that the needs that steps A 1, second monitoring agent are stored in advance from this locality are handled is searched the processing policy of the fault correspondence of this first monitoring agent generation;

If find, execution in step A5 then; Otherwise, execution in step A2～A5;

Steps A 2, second monitoring agent are inquired about the processing policy of the fault correspondence of this first monitoring agent generation to Surveillance center;

Steps A 3, Surveillance center generate the processing policy of the fault correspondence of this first monitoring agent generation;

Steps A 4, Surveillance center send the processing policy of the fault correspondence of this first monitoring agent generation to second monitoring agent;

After steps A 4 is complete, execution in step A5.

Steps A 5, second monitoring agent adopt the processing policy that obtains to handle the fault that this first monitoring agent takes place;

Steps A 6, after detecting troubleshooting and finishing, second monitoring agent starts this first monitoring agent.

Wherein in first kind of mode 202A, if do not find the processing policy of the fault correspondence of this first monitoring agent generation behind the execution in step A1, the execution content of steps A 2～A5 can also be handled in the following way: the information that second monitoring agent reports the operation of first monitoring agent to end to Surveillance center; The information that Surveillance center reports according to second monitoring agent is carried out troubleshooting to this first monitoring agent.

Second way 102B is managed the startup of first monitoring agent jointly by Surveillance center or itself and second monitoring agent, and is specific as follows:

First kind of mode specifically comprises for only starting this first monitoring agent by Surveillance center:

The information that described second monitoring agent reports the operation of first monitoring agent to end to Surveillance center; The information that described Surveillance center ends according to described first monitoring agent operation starts described first monitoring agent.

In this mode, as long as the operation of first monitoring agent is ended, second monitoring agent will send information to Surveillance center, to trigger the flow process that Surveillance center starts first monitoring agent.The advantage of this kind mode is, could determine that with the information that Surveillance center in the prior art can not receive the transmission of first monitoring agent by a period of time first monitoring agent moves termination and compares, the information that the operation of first monitoring agent is ended can in time be known in Surveillance center, can carry out troubleshooting fast, the operating process that shortens the time that first monitoring agent operation ends and second monitoring agent is simple.

The second way is that the Surveillance center and second monitoring agent manage the startup of first monitoring agent jointly, specifically comprises:

Described second monitoring agent is searched the fault that whether comprises that described first monitoring agent takes place in the fault message that the needs of storage are in advance handled;

If do not find the fault that described first monitoring agent takes place in the described fault message that needs to handle, then described second monitoring agent directly starts this first monitoring agent;

If in the described fault message that needs to handle, find the fault that described first monitoring agent takes place, then described second monitoring agent notifies Surveillance center to handle the fault that described first monitoring agent takes place, described Surveillance center handles the fault that described first monitoring agent takes place, and after troubleshooting is finished, start described first monitoring agent.

Because some faults fault after restarting this first monitoring agent will disappear, so in order to shorten the processing time of first monitoring agent, preferably, second monitoring agent can judge earlier whether the fault that first monitoring agent takes place needs to handle, handle if desired, then first monitoring agent that reports to Surveillance center moves the information of ending, otherwise second monitoring agent directly starts this first monitoring agent.As seen from the above, the fault for fault after restarting this first monitoring agent will disappear is directly started by this second monitoring agent, has reduced the report flow of second monitoring agent, has also reduced the Processing tasks of Surveillance center.

Need to prove in actual applications, when monitored resource node comprises a plurality of monitoring agent, only need have a monitoring agent to be got final product by remaining at least one monitoring agent monitoring.For instance, two monitoring agents are arranged on the monitored resource node, both can monitor the other side's running status mutually.

Certainly, the content of acting on behalf of of a plurality of monitoring agents on the monitored resource node can be identical, also can be different.For example, interface that monitoring agent is responsible for providing various information to obtain; Another monitoring agent is responsible for the supervisory control system running situation.

Fig. 2 is the structural representation of the fault processing system embodiment of monitoring agent provided by the invention.Content in conjunction with method embodiment shown in Figure 1, system shown in Figure 2 comprises: monitored resource node comprises first monitoring agent and second monitoring agent, wherein said first monitoring agent is different with the agent functionality of described second monitoring agent, and described second monitoring agent comprises:

Fig. 3 is the structural representation of processing unit described in the system shown in Figure 2 embodiment.Processing unit shown in Figure 3 comprises:

Fig. 4 is another structural representation of processing unit shown in Figure 3.Processing unit shown in Figure 4 also comprises:

Fig. 5 is another structural representation of system shown in Figure 2 embodiment.System shown in Figure 5 is specific as follows:

Described processing unit comprises:

Described system also comprises:

Fig. 6 is the another structural representation of system shown in Figure 2 embodiment.System shown in Figure 6 is specific as follows:

Described processing unit comprises:

Described system also comprises:

System embodiment provided by the invention, monitor the running status of first monitoring agent by second monitoring agent, when the operation of first monitoring agent is ended, trigger the flow process that starts this first monitoring agent, shortened the time of finding the first monitoring agent fault, can shorten the time that starts first monitoring agent, the proper communication of guarantee information.

The all or part of step that the one of ordinary skill in the art will appreciate that the foregoing description program circuit that can use a computer is realized, described computer program can be stored in the computer-readable recording medium, described computer program (as system, unit, device etc.) on the relevant hardware platform is carried out, when carrying out, comprise one of step or its combination of method embodiment.

Alternatively, all or part of step of the foregoing description also can use integrated circuit to realize, these steps can be made into integrated circuit modules one by one respectively, perhaps a plurality of modules in them or step is made into the single integrated circuit module and realizes.Like this, the present invention is not restricted to any specific hardware and software combination.

Each device/functional module/functional unit in the foregoing description can adopt the general calculation device to realize, they can concentrate on the single calculation element, also can be distributed on the network that a plurality of calculation element forms.

Each device/functional module/functional unit in the foregoing description is realized with the form of software function module and during as independently production marketing or use, can be stored in the computer read/write memory medium.The above-mentioned computer read/write memory medium of mentioning can be a read-only memory, disk or CD etc.

The above; only be the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the described protection range of claim.

Claims

1. the fault handling method of a monitoring agent is characterized in that, monitored resource node comprises first monitoring agent and second monitoring agent, wherein:

2. method according to claim 1 is characterized in that, described second monitoring agent triggers the startup flow process of first monitoring agent, comprising:

3. method according to claim 2 is characterized in that, the process of the fault that this described first monitoring agent of described processing takes place comprises:

4. method according to claim 1 is characterized in that, described second monitoring agent triggers the startup flow process of first monitoring agent, comprising:

5. method according to claim 1 is characterized in that, described second monitoring agent triggers the startup flow process of first monitoring agent, comprising:

6. the fault processing system of a monitoring agent is characterized in that, monitored resource node comprises first monitoring agent and second monitoring agent, and wherein said second monitoring agent comprises:

7. system according to claim 6 is characterized in that, described processing unit comprises:

8. system according to claim 7 is characterized in that, described processing unit also comprises:

9. system according to claim 6 is characterized in that:

Described processing unit comprises:

Described system also comprises:

10. system according to claim 6 is characterized in that:

Described processing unit comprises:

Described system also comprises: