CN109062723A

CN109062723A - The treating method and apparatus of server failure

Info

Publication number: CN109062723A
Application number: CN201810960589.8A
Authority: CN
Inventors: 赵阳阳
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2018-12-21

Abstract

The invention discloses a kind for the treatment of method and apparatus of server failure.The described method includes: determining the fault type of disk when the hard disk for detecting server breaks down；Judge the fault type whether in local pre-stored recoverable error listing；If the fault type obtains the corresponding resolution policy of the fault type in the recoverable error listing, and executes the resolution policy of the fault type.

Description

The treating method and apparatus of server failure

Technical field

The present invention relates to field of information processing, espespecially a kind for the treatment of method and apparatus of server failure.

Background technique

Server refers to a management resource and provides the computer equipment of service for user.Since server needs to respond Service request, and handled, therefore in general server should have the service of undertaking and ensure the ability of service.Server Composition include that processor, hard disk, memory, system bus etc. are similar with general computer architecture, but due to needing to provide Highly reliable service, thus processing capacity, stability, reliability, safety, scalability, in terms of require It is higher.In a network environment, the service type provided according to server is different, is divided into file server, database server is answered With program servers, WEB server etc..

With the fast development of data center, large-scale server storage application becomes the main feature of data center. And server hard disc monitoring is particularly important, hard disk failure self-adaptive recovery and failure quickly can be positioned and be reported, energy Enough server admin personnel is helped to handle in time, reduces loss.

In current server hard disc monitoring, for the failure monitoring of hard disk, not to the concrete reason of the failure of hard disk Have and analyzed and restored in time, but carries out artificial analysis and be adjusted or replace hard disk.If because of parameter reason, such as The disk state of every server of fruit breaks down, and needs to carry out Parameter analysis, troubleshooting or replacement to every server Hard disk, needs to devote considerable time and resource.

Summary of the invention

In order to solve the above-mentioned technical problems, the present invention provides a kind for the treatment of method and apparatus of server failure, can Improve the treatment effeciency of server hard disc failure.

In order to reach the object of the invention, the present invention provides a kind of processing methods of server failure, which is characterized in that packet It includes:

When the hard disk for detecting server breaks down, the fault type of disk is determined；

Judge the fault type whether in local pre-stored recoverable error listing；

If the fault type in the recoverable error listing, obtains the corresponding solution of the fault type Strategy, and execute the resolution policy of the fault type.

Wherein, the method also has a characteristic that the corresponding resolution policy of the acquisition fault type, and executes The resolution policy of the fault type includes:

The corresponding processing strategie of the fault type is searched from pre-set fault recovery knowledge base；

According to the fault recovery tool recorded in the processing strategie, the access right of the fault recovery tool is obtained；

According to the processing strategie, the fault recovery tool is run.

Wherein, the method also has a characteristic that the corresponding resolution policy of the acquisition fault type, and executes After the resolution policy of the fault type, the method also includes:

Whether the failure for detecting the disk releases；

If the trouble shooting of the disk notifies the failure of the disk to release；Otherwise, the failure is carried out Alarming processing.

Wherein, the method also have a characteristic that the method also includes:

If not in the recoverable error listing, the fault type of the disk is carried out for the fault type Alarming processing.

Wherein, after the method also has a characteristic that the fault type by the disk carries out alarming processing, The method also includes:

When detecting that the failure to the disk is handled, the resolution policy to the failure is recorded；

Identify that processing strategie and used disk restore tool in the resolution policy；

Processing strategie in the resolution policy of the fault type is increased in fault recovery knowledge base, and configures the solution Decision slightly in fault recovery tool.

In order to reach the object of the invention, the present invention provides a kind of processing units of server failure, comprising:

Determining module, for determining the fault type of disk when the hard disk for detecting server breaks down；

Judgment module, for judging the fault type whether in local pre-stored recoverable error listing；

Processing module, if obtaining the failure in the recoverable error listing for the fault type The corresponding resolution policy of type, and execute the resolution policy of the fault type.

Wherein, described device also has a characteristic that the processing module includes:

Searching unit, for searching the corresponding processing plan of the fault type from pre-set fault recovery knowledge base Slightly；

Acquiring unit, for obtaining the fault recovery work according to the fault recovery tool recorded in the processing strategie The access right of tool；

Running unit, for running the fault recovery tool according to the processing strategie.

Wherein, described device also has a characteristic that described device further include:

Detection module, for after the resolution policy for executing the fault type, detect the disk failure whether It releases；

Reporting module notifies the failure of the disk to release if the trouble shooting for the disk；Otherwise, Alarming processing is carried out to the failure.

Alarm module, if for the fault type not in the recoverable error listing, by the disk Fault type carry out alarming processing.

Logging modle, for detecting to institute after the fault type by the disk carries out alarming processing When stating the failure of disk and being handled, the resolution policy to the failure is recorded；

Identification module, processing strategie and used disk restore tool in the resolution policy for identification；

Management module, for processing strategie in the resolution policy of the fault type to be increased to fault recovery knowledge base In, and configure the fault recovery tool in the resolution policy.

Embodiment provided by the invention determines the fault type of disk when the hard disk for detecting server breaks down, The fault type is judged whether in local pre-stored recoverable error listing, to determine whether the failure can be by Equipment is handled itself, if the fault type is in the recoverable error listing, then it represents that the failure can be by Equipment is repaired, then calls locally associated resolution policy, is run the processing that the resolution policy carries out failure, is realized failure Processing in time reduces server maintenance cost, improves the treatment effeciency of server hard disc failure.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by specification, right Specifically noted structure is achieved and obtained in claim and attached drawing.

Detailed description of the invention

Attached drawing is used to provide to further understand technical solution of the present invention, and constitutes part of specification, with this The embodiment of application technical solution for explaining the present invention together, does not constitute the limitation to technical solution of the present invention.

Fig. 1 is the flow chart of the processing method of server failure provided by the invention；

Fig. 2 is the schematic diagram of server hard disc fault diagnosis provided by the invention and self-adaptive recovery alarm method；

Fig. 3 is the structure chart of the processing unit of server failure provided by the invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature can mutual any combination.

Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable Sequence executes shown or described step.

Fig. 1 is the flow chart of the processing method of server failure provided by the invention.Method shown in Fig. 1 includes:

Step 101, when the hard disk for detecting server breaks down, determine the fault type of disk；

Specifically, by agent way, the disk state information with interior monitoring server collects the base of disk by timing This information and status information, judge whether essential information and state exception occur；

Whether step 102 judges the fault type in local pre-stored recoverable error listing；

Specifically, carrying out diagnostic classification to the failure of server hard disc in advance；After hard disk failure occurs, pass through disk work Tool or Disk Logs analyze the fault type and reason of disk, and different fault types carries out different processing, recoverable Failure is then further processed；

If step 103, the fault type obtain the fault type pair in the recoverable error listing The resolution policy answered, and execute the resolution policy of the fault type.

Specifically, realized when the processing of failure by fault recovery knowledge base and troubleshooting actuator.Failure is extensive Multiple knowledge base is then constantly dynamically to be expanded, and maintenance personnel updates to fault recovery knowledge base the operation note of failure；Therefore When barrier processing actuator is then that hard disk failure is matched to fault recovery knowledge base, according to the record in knowledge base, execute corresponding Operation.

Embodiment of the method provided by the invention determines the failure of disk when the hard disk for detecting server breaks down Type judges the fault type whether in local pre-stored recoverable error listing, whether to determine the failure It can be handled by equipment itself, if the fault type is in the recoverable error listing, then it represents that the failure It can be repaired by equipment, then call locally associated resolution policy, run the processing that the resolution policy carries out failure, be realized The timely processing of failure reduces server maintenance cost, improves the treatment effeciency of server hard disc failure.

Method provided by the invention is described further below:

After determining that the failure can voluntarily be restored by equipment, it can execute in the following way and obtain the failure classes The corresponding resolution policy of type, and the operation of the resolution policy of the fault type is executed, it specifically includes:

According to the processing strategie, the fault recovery tool is run.

Specifically, the processing strategie of each failure is stored in advance in fault recovery knowledge base, and it is marked by fault code. When detecting that hard disk breaks down, by the detection data of diagnostic tool combination hard disk, it can determine and break down Reason determines corresponding fault code；After obtaining fault code, it can use fault code and restore to look into database in data Look for corresponding processing strategie.The failure for wherein including the steps that troubleshooting and each step need to use in the processing strategie is extensive Tool of returning to work for fault recovery tool to be used obtains the right to use of the fault recovery to system request, is obtaining the right to use Afterwards, according to implementation steps, corresponding fault recovery tool handling failure is run.

By the above-mentioned means, may be implemented to automatically process failure, the efficiency of troubleshooting is improved.

During debugging, the processing strategie of usual failure can have one or more, according to the actual situation, be applicable in Troubleshooting strategy it is different, therefore obtaining the corresponding resolution policy of the fault type, and execute the failure classes After the resolution policy of type, the method also includes:

Whether the failure for detecting the disk releases；

Specifically, the processing result of failure is obtained, according to processing result, to this after locally trial automatically terminates failure Failure is handled.If trouble shooting, the resolution policy is marked to be suitable for the failure cause, and the failure occurs in next time When, preferentially use the resolution policy；On the contrary, marking the resolution policy to be not particularly suited for the failure if the failure does not release Reason issues alarm to guarantee that failure is resolved, and prompts user to handle in time, certainly, if the failure cause There are also other resolution policies, also can choose other resolution policies and are handled.

Certainly, if the fault type is not in the recoverable error listing, by the failure classes of the disk Type carries out alarming processing.

Specifically, if broken down in hard disk, and after determining failure cause, determine what the failure cannot be restored automatically, Alarming processing need to be carried out in time, and the daily record data of the description information for the failure being collected into and the hard disk is sent to user, It is referred to for user.

In order to which lifting system automatically processes the ability of failure, by the fault type of the disk carry out alarming processing it Afterwards, the method also includes:

Specifically, notifying user to carry out failure with pre-set accession page after detecting expendable failure Startup separator release operation records the operation behavior of user, including input more easily to obtain the resolution policy of failure Command information and the recovery tool used；After obtaining aforesaid operations behavior, above- mentioned information are recorded, so as in rear supervention When raw similar failure, failure can be automatically processed.

Since the resolution policy of the failure has been added in fault recovery knowledge base, indicate that the failure can be that can restore Failure, then the fault code of the failure is increased in recoverable error listing.

Fig. 2 is the schematic diagram of server hard disc fault diagnosis provided by the invention and self-adaptive recovery alarm method.Fig. 2 institute Show that the realization process of method is as follows:

Firstly, monitoring disk state.When server hard disc performance or state occur abnormal, by the corresponding state of hard disk and Information is sent to fault diagnosis module.

Then, the detailed trouble information for analyzing hard disk, failure is classified, and being divided into can restore and unrecoverable failure. And failure reporting module is sent by failure.

Secondly, being directed to fault message, search hard disk failure restores knowledge base, is matched to then extensive by corresponding operation progress It is multiple, it does not search, is then sent to failure reporting module, is operated by maintenance personnel, while recording the operation row of maintenance personnel To update hard disk failure and restoring knowledge base.

Finally, failure is sent to maintenance personnel by failure reporting module, maintenance personnel is synchronous to the processing operation of failure more Newly restore knowledge base to hard disk failure, realizes the automatically-monitored alarm pipe restored to entire server hard disc automatic fault diagnosis Reason.

In conclusion method provided by the invention, carries out detailed analysis to hard disk failure, and when hard disk breaks down, energy It is enough to notify the detailed trouble information of hard disk to maintenance personnel；Adaptive failure recovery has been carried out, has been known using matching fault recovery Library is known to carry out further operating to hard disk, hard disk can not restore to report or restore normal, monitor in large-scale server hard disk There is very high technological value in management.

Fig. 3 is the structure chart of the processing unit of server failure provided by the invention.Fig. 3 shown device includes:

Determining module 301, for determining the fault type of disk when the hard disk for detecting server breaks down；

Judgment module 302, for judging the fault type whether in local pre-stored recoverable error listing In；

Processing module 303, if obtaining the event for the fault type in the recoverable error listing Hinder the corresponding resolution policy of type, and executes the resolution policy of the fault type.

In an Installation practice provided by the invention, the processing module 303 includes:

In an Installation practice provided by the invention, described device further include:

Installation practice provided by the invention determines the failure of disk when the hard disk for detecting server breaks down Type judges the fault type whether in local pre-stored recoverable error listing, whether to determine the failure It can be handled by equipment itself, if the fault type is in the recoverable error listing, then it represents that the failure It can be repaired by equipment, then call locally associated resolution policy, run the processing that the resolution policy carries out failure, be realized The timely processing of failure reduces server maintenance cost, improves the treatment effeciency of server hard disc failure.

It is illustrated below with device application example provided by the invention:

Therefore, it we have proposed a kind of server hard disc fault diagnosis and self-adaptive recovery alarm device, can effectively solve Certainly above-mentioned problem.

General frame of the present invention is divided into four modules: hard disk performance monitoring module, hard disk failure diagnostic module, hard disk are adaptive Answer recovery module, hard disk failure reporting module.

Hard disk performance monitoring module mainly passes through agent way, the disk state information with interior monitoring server.Hard disk prison Control is then that the essential information and status information of disk are collected by timing, judges whether essential information and state exception occur.

Hard disk failure diagnostic module, which is mainly realized, carries out diagnostic classification to the failure of server hard disc, by disk tools or Disk Logs analyze the fault type and reason of disk, and different fault types carries out different processing, expendable event Barrier directly reports hard disk failure reporting module, and recoverable failure is then sent to hard disk self-adaptive recovery module and carries out at next step Reason.

Hard disk self-adaptive recovery module, is divided into fault recovery knowledge base and troubleshooting actuator.Fault recovery knowledge base It is then constantly dynamically to be expanded, maintenance personnel updates to fault recovery knowledge base the operation note of failure；Troubleshooting is held When row device is then that hard disk failure is matched to fault recovery knowledge base, according to the record in knowledge base, corresponding operation is executed.

Hard disk failure reporting module, after the message for receiving hard disk failure diagnostic module and hard disk self-adaptive recovery module, And report and alarm information.

Using apparatus of the present invention, it can be achieved that the diagnostic analysis of the failure of server hard disc and automation restore, service is reduced Device administrative skill threshold improves automated diagnostic recovery capability, reduces O&M cost, improves the efficiency of management.

Those of ordinary skill in the art will appreciate that computer journey can be used in all or part of the steps of above-described embodiment Sequence process realizes that the computer program can be stored in a computer readable storage medium, the computer program exists (such as system, unit, device) executes on corresponding hardware platform, when being executed, include the steps that embodiment of the method it One or combinations thereof.

Optionally, integrated circuit can be used also to realize in all or part of the steps of above-described embodiment, these steps can To be fabricated to integrated circuit modules one by one respectively, or make multiple modules or steps in them to single integrated electricity Road module is realized.In this way, the present invention is not limited to any specific hardware and softwares to combine.

Each device/functional module/functional unit in above-described embodiment, which can be adopted, is realized with general computing device realization, it Can be concentrated on a single computing device, can also be distributed over a network of multiple computing devices.

Each device/functional module/functional unit in above-described embodiment realized in the form of software function module and as Independent product when selling or using, can store in a computer readable storage medium.Computer mentioned above Read/write memory medium can be read-only memory, disk or CD etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be subject to protection scope described in claim.

Claims

1. a kind of processing method of server failure characterized by comprising

Judge the fault type whether in local pre-stored recoverable error listing；

If the fault type in the recoverable error listing, obtains the corresponding solution decision of the fault type Slightly, and the resolution policy of the fault type is executed.

2. the method according to claim 1, wherein described obtain the corresponding resolution policy of the fault type, And the resolution policy for executing the fault type includes:

According to the processing strategie, the fault recovery tool is run.

3. the method according to claim 1, wherein described obtain the corresponding resolution policy of the fault type, And after executing the resolution policy of the fault type, the method also includes:

Whether the failure for detecting the disk releases；

If the trouble shooting of the disk notifies the failure of the disk to release；Otherwise, the failure is alerted Processing.

4. method according to any one of claims 1 to 3, which is characterized in that the method also includes:

If the fault type in the recoverable error listing, the fault type of the disk is not alerted Processing.

5. according to the method described in claim 4, it is characterized in that, the fault type by the disk carries out alarming processing Later, the method also includes:

Processing strategie in the resolution policy of the fault type is increased in fault recovery knowledge base, and configures the solution decision Fault recovery tool in slightly.

6. a kind of processing unit of server failure characterized by comprising

Processing module, if obtaining the fault type in the recoverable error listing for the fault type Corresponding resolution policy, and execute the resolution policy of the fault type.

7. device according to claim 6, which is characterized in that the processing module includes:

Searching unit, for searching the corresponding processing strategie of the fault type from pre-set fault recovery knowledge base；

Acquiring unit, for obtaining the fault recovery tool according to the fault recovery tool recorded in the processing strategie Access right；

8. device according to claim 6, which is characterized in that described device further include:

Detection module, for after the resolution policy for executing the fault type, whether the failure for detecting the disk to be released；

Reporting module notifies the failure of the disk to release if the trouble shooting for the disk；Otherwise, to institute It states failure and carries out alarming processing.

9. according to any device of claim 6 to 8, which is characterized in that described device further include:

Alarm module, if for the fault type not in the recoverable error listing, by the event of the disk Hinder type and carries out alarming processing.

10. device according to claim 9, which is characterized in that described device further include:

Logging modle, for detecting to the magnetic after the fault type by the disk carries out alarming processing When the failure of disk is handled, the resolution policy to the failure is recorded；

Management module, for processing strategie in the resolution policy of the fault type to be increased in fault recovery knowledge base, and Configure the fault recovery tool in the resolution policy.