CN111444032A

CN111444032A - Computer system fault repairing method, system and equipment

Info

Publication number: CN111444032A
Application number: CN202010142058.5A
Authority: CN
Inventors: 林达志
Original assignee: Wuxi Huayun Data Technology Service Co Ltd
Current assignee: Wuxi Huayun Data Technology Service Co Ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-07-24

Abstract

The invention provides a method, a system and equipment for repairing the fault of a computer system, wherein the method comprises the following steps: acquiring fault information of a monitored object by a monitoring system; acquiring identity information of a monitored object contained in the fault information; calling an asset list in the asset management system and formatting the fault information; the repair center calls the service type matched with the formatted fault information from a predefined execution tool library to form a fault repair scheme corresponding to the fault information; the decision center calls a fault repairing scheme and triggers and informs an event of manual repairing through a configured judgment logic; and calling a manual repair scheme to repair the fault of the monitored object. The invention realizes effective differentiation of fault types, reduces unnecessary manual intervention, realizes automatic repair of common faults and various special faults, and improves the working efficiency of operation and maintenance personnel.

Description

Computer system fault repairing method, system and equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a computer system fault repairing method, a computer system fault repairing system, and an apparatus.

Background

In order to ensure the normal operation of the server, the system performance and the hardware state of the server need to be detected in real time. If the server fails, the server needs to be debugged to ensure that the server provides good service upwards. At present, the operation and maintenance personnel are usually notified by adopting fault prompting means such as mails and short messages, and the fault is discharged by logging in the back end by the operation and maintenance personnel. However, in actual operation, the operation and maintenance personnel cannot accurately learn about the fault of the server, so that fault repair delay is caused, normal operation of the server is affected, and a risk of repair errors caused by repair of the fault of the server by the operation and maintenance personnel exists. In the present application, the term "server" is understood as a subordinate concept of the term "computer" or "computer system".

Meanwhile, reports on how to carry out self-repair on the fault and how to feed back the fault repair result to operation and maintenance personnel are not found in the prior art. Because the fault types of the server are diversified along with the upgrading of the service, such as network faults of the server, power failure data loss, inconsistent data faults, process loss and the like, operation and maintenance personnel cannot adopt the most correct fault removal strategy and configuration means to discharge the faults when facing various faults, so that the fault removal effect is not ideal, and the fault removal processing time is long.

Chinese patent publication No. CN110347525A discloses a fault handling method and apparatus. The applicant considers that the above prior art claims to be capable of determining the repair execution mode corresponding to the fault identification information according to the fault repair strategy corresponding to the fault identification information after receiving the fault alarm information. However, the applicant considers that the fault protection method is only claimed in function, and does not describe how to generate a fault processing list for the fault identification information in the fault alarm information, and how to determine a fault repair strategy and a repair execution mode according to the fault type.

In view of the above, there is a need for an improved computer system fault recovery method in the prior art to solve the above problems.

Disclosure of Invention

The invention aims to disclose a computer system fault repairing method, a repairing system and equipment thereof, which are used for solving a plurality of defects in an operation and maintenance processing means corresponding to computer faults in the prior art, in particular to realize intelligent automatic processing when faults occur and reduce manual intervention of operation and maintenance personnel, so that the operation and maintenance personnel can quickly locate fault generation reasons and automatically issue fault removal strategies, automatic repairing of common faults and various special faults is realized, the operation and maintenance efficiency is greatly improved, and the user experience is improved.

To achieve the first object, the present invention first provides a computer system fault repairing method, which includes the following steps:

acquiring fault information of a monitored object by a monitoring system;

acquiring identity information of the monitored object contained in the fault information;

calling an asset list in the asset management system and formatting the fault information;

the repair center calls the service type matched with the formatted fault information from a predefined execution tool library to form a fault repair scheme corresponding to the fault information;

the decision center calls a fault repairing scheme and triggers and informs an event of manual repairing through a configured judgment logic;

and calling a manual repair scheme to repair the fault of the monitored object.

As a further improvement of the invention, the predefined execution tool library pre-configures a repair script and an automatic operation and maintenance tool corresponding to the repair fault information.

As a further improvement of the invention, the automatic operation and maintenance tool comprises puppet, ansable or saltstack.

As a further improvement of the present invention, the repair center configures a judgment logic to discard the service type matching the formatted fault information when the predefined execution tool library cannot be called.

As a further improvement of the present invention, after invoking the manual repair scheme to perform fault repair on the monitored object, the method further includes:

an operation of outputting a fail-over event, wherein,

the fault repairing event comprises fault repairing operation content, a fault repairing log or a state log of a monitored object;

and notifies the administrator of the failover event.

As a further improvement of the present invention, the decision center is configured with a judgment logic for judging risks contained in the fault information, and when the risk corresponding to the fault information corresponding to the fault repair scheme exceeds a risk threshold configured by the decision center, an event for triggering and notifying manual repair is determined;

and the notification manual repair notifies the administrator in the form of an instant messaging tool, an email or a short message.

As a further improvement of the invention, the asset management system embeds a visual interface that displays the identity information of the monitored object;

the asset management system part is configured with an interface for an administrator to customize the fault information, so that the fault information can be modified or deleted from the asset management system through the interface.

Based on the same inventive concept, the present application also discloses a computer system fault recovery system, comprising:

the fault acquisition system is used for acquiring fault information of the monitored object by the monitoring system;

the asset management system acquires the identity information of the monitored object contained in the fault information, calls an asset list in the asset management system and formats the fault information;

the repair center calls the service type matched with the formatted fault information from the predefined execution tool library to form a fault repair scheme corresponding to the fault information;

the decision center calls the fault repairing scheme and triggers and informs the event of manual repairing through the configured judging logic;

and the notification system receives an event which is issued by the decision center and used for notifying manual repair so as to notify an administrator to issue a manual repair instruction.

As a further improvement of the invention, the method also comprises the following steps: the scheduling center is configured with a judgment logic for judging risks contained in the fault information, and determines to trigger and notify an event of manual repair when the risk corresponding to the fault information corresponding to the fault repair scheme exceeds a risk threshold configured by the decision center; and when the risk corresponding to the fault information corresponding to the fault repairing scheme does not exceed the risk threshold configured by the decision center, the dispatching center issues the fault repairing scheme to the monitored object.

As a further improvement of the invention: the event center receives the operation of the fault repairing event output by the dispatching center and informs the fault repairing event to the administrator,

wherein,

the fault repair event comprises fault repair operation content, a fault repair log or a state log of the monitored object.

Finally, the present application also discloses an apparatus comprising:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to perform a computer system failover method as disclosed in any one of the preceding inventions.

Compared with the prior art, the invention has the beneficial effects that:

the computer system fault repairing method disclosed by the invention can automatically issue the fault repairing method to common faults or faults with smaller risk, realizes effective differentiation of fault types, reduces unnecessary manual intervention, realizes automatic repair of the common faults and various special faults, greatly improves operation and maintenance efficiency, obviously improves the maintenance effect of an administrator on monitored objects such as a computer system, improves the fine granularity of fault repairing operation, and finally improves user experience.

Drawings

FIG. 1 is a general flow chart of a computer system fault recovery method according to the present invention;

FIG. 2 is a detailed flowchart of a computer system fault recovery system according to the present invention;

FIG. 3 is a topology diagram of a computer system failover system of the present invention;

FIG. 4 is a topology diagram of the computer system failover system shown in FIG. 3 configured with an interface for administrator 51 to customize failure information;

FIG. 5 is a schematic diagram of an asset management system displaying a visual interface embedding identity information of the monitored object;

fig. 6 is an example diagram illustrating a computer system fault repair system repairing OSD down alarms occurring in a Ceph storage cluster according to the present invention;

FIG. 7 is a topology diagram of an apparatus of the present invention.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

Term "Logic"includes any physical and tangible functions for performing a task. For example, each operation illustrated in the flowcharts corresponds to a logical component for performing the operation. Operations may be performed using, for example, software running on a computer device, hardware (e.g., chip-implemented logic functions), etc., and/or any combination thereof. When implemented by a computing device, the logical components represent electrical components that are physical parts of the computer system, regardless of the manner in which they are implemented.

Phrase "Is configured as"or a phrase"Is configured to"includes any manner in which any kind of physical and tangible functionality may be constructed to perform the identified operations. The functionality may be configured to use software running on a computer device, for example.

In the present application, the term "operation and maintenance person" has the technical meaning equivalent to the term "administrator".

The first embodiment is as follows:

referring to fig. 1, fig. 2, and fig. 6, the present embodiment first discloses a computer system fault repairing method (hereinafter, referred to as "method") for automatically repairing and/or manually repairing various types of faults generated during the operation of a computer system.

The method includes the following steps S1 to S6.

And step S1, acquiring the fault information of the monitored object by the monitoring system.

The monitored object 200 is monitored regularly or in real time by the monitoring system 90, and the monitoring system 90 can be a Prometheus assembly or a zabbix assembly. The monitored object 200 may be a cloud host, a physical machine, a computer, a cluster server, messaging middleware, or an application (or program). When the monitored object 200 cannot respond to the access request of the user to the monitored component 200 due to a network failure, a hardware failure, a system error, or the like, the monitoring system 90 can detect an abnormality to acquire the alarm information.

Step S2, obtaining the identity information of the monitored object contained in the fault information.

The acquired identity information of the monitored object can be verified with the attribute of the monitored object pre-configured in the asset management system 20 in the subsequent processing process, so that a corresponding fault repairing scheme is matched for specific alarm information, and the accuracy of automatically repairing the fault is improved.

Step S3, calling the asset list in the asset management system 20 and formatting the failure information. Referring to fig. 5, the asset list in this embodiment may be understood as a collection of various attributes of the monitored object 200 in a specific application scenario, and such a collection may be presented in various forms such as a visualization form, a dynamic web page, a database, a file system, and the like. The asset list at least comprises three attribute definitions of a fault object, fault content and a fault label; wherein, the fault object is used to describe the type of the monitored object 200, such as client, operation content, cluster name; the fault content is used to describe specific fault information when the monitored object 200 fails, such as an OSD down alarm, a synchronization failure of data copies among HDFS cluster nodes, a VNC open failure, and the like. The fault label defines abbreviations or an index relation corresponding to different types of faults, so that the simplicity and the accuracy of calling the fault repairing scheme are improved when the fault repairing scheme is issued at the later stage.

For example, in an Openstack-based cloud platform, failure to resolve (cause) a write hostname without DNS results in a VNC open failure (fail-over). Therefore, a failure repair scheme for solving "write hostname cannot be resolved without DNS" is packaged in advance in the predefined execution tool library 40. The failover scheme may now define a unique identifier or tag in the failure tag of the inventory shown in FIG. 5. The "fault object", "fault content" and "fault label" of each row may be in a one-to-one correspondence. Therefore, the identity information of the monitored object is obtained, the formatted alarm information is formed and sent to the repair center 30, so that the repair center 30 can conveniently and accurately match the service type corresponding to the alarm information from the predefined execution tool library 40, and judge whether the service type corresponding to the repair specific alarm information is contained in the asset list of the asset management system 20, thereby quickly determining the fault repair scheme corresponding to the alarm information.

Referring to fig. 4 and 5, the asset management system 20 embeds a visual interface 202 that displays the identity information of the monitored object; the asset management system part 20 configures an interface 201 for the administrator 51 to customize the fault information, so as to modify or delete the fault information from the asset management system 20 through the interface 201, and adjust the attribute definitions of the fault object, the fault content and the fault tag in the asset list, so as to meet the requirements of effective judgment of different types of faults and pertinence of the repair operation.

Step S4, the repair center calls the service type matched with the formatted fault information from the predefined execution tool library 40 to form a fault repair scheme corresponding to the fault information. And pre-configuring a repair script and an automatic operation and maintenance tool corresponding to the repair fault information by the pre-defined execution tool library. The automated operation and maintenance tool comprises puppet, ansable or saltstack, and the automated operation and maintenance tool is most preferably ansable. The absible is an automatic operation and maintenance tool, is developed based on Python, integrates the advantages of a plurality of operation and maintenance tools (puppet, cfengine, chef, func and fabric), and realizes functions of batch system configuration, batch program deployment, batch operation commands and the like. At this time, it is uncertain whether the repair solution requires human intervention by the intervention administrator 51 to perform troubleshooting, and it is determined by the subsequent step S5. The repair center 30 configures the decision logic to discard the service type matching the formatted fault information if the predefined execution tool library 40 cannot be called. Some of the alarm information in a computer system such as a cloud platform (i.e., the monitored object 200) does not constitute a substantial fault, and therefore is not subject to automatic repair or manual repair by the administrator 51, for example: in a scenario in which a user closes a virtual machine by himself and the cloud platform sets a specific lifecycle management for the virtual machine, if it is determined that the closing of the virtual machine triggers an alarm, which is a failure, the reliability of the computer system fault repairing system 100 based on the method during operation is significantly increased, and unnecessary resource consumption is caused. Therefore, according to the technical scheme, the fine granularity of fault repair is improved to a certain extent, and unnecessary early intervention and unnecessary manual intervention repair operation of the administrator 51 are prevented.

Step S5, the decision center 50 invokes a fault repair scheme, and triggers an event notifying manual repair through configured judgment logic.

The decision center 50 is configured with a judgment logic 401 for judging risks contained in the fault information, and determines to trigger an event for notifying manual repair when the risk corresponding to the fault information corresponding to the fault repair scheme exceeds a risk threshold configured by the decision center 50; and informing the administrator of the manual repair in the form of an instant messaging tool, an email or a short message. The notification behavior may be performed by notification system 60 shown in FIG. 3. Meanwhile, when the risk corresponding to the fault information corresponding to the fault repairing scheme does not exceed the risk threshold configured by the decision center 50, the decision center 50 notifies the dispatching center 70, and the dispatching center 70 issues the fault repairing scheme to the monitored object 200. The risk threshold values may be respectively defined according to the service types corresponding to the alarm information.

Finally, step S6 is executed, and a manual repair scheme is invoked to perform fault repair on the monitored object. Preferably, the step of invoking an artificial repair scheme to repair the fault of the monitored object further includes: outputting the operation of a fault repairing event, wherein the fault repairing event comprises the content of the fault repairing operation, a fault repairing log or a state log of a monitored object; and notifies the administrator 51 of the failure repair event. The status log containing the contents of the fail-over operation, the fail-over log, or the monitored object, formed by the operation of outputting the fail-over event is written into the event center 80, and the administrator 51 may log in the event center 80 to view the history of the operation.

In this embodiment, the fault repairing scheme issued by the administrator 51 or the fault repairing scheme pre-configured by the predefined execution tool library 40 may be one or more of the following.

(1) The server detects whether the swap switch partition needs to be used or not, and automatically clears the swap partition and closes the swap partition according to the requirement of the server partition;

(2) the server detects the clock deviation and automatically corrects the clock deviation;

(3) the service process is hung up and automatically restarted;

(4) when the OSD detects that a bad disk exists, the cluster is kicked out, and the bad disk is automatically recovered after being replaced;

(5) and (4) the network card gives an alarm of abnormal flow, and the network card analyzes the bandwidth situation occupied by the sequencing service and judges whether the flow limitation is needed or not.

Of course, the above illustrated fault repair scheme is merely exemplary and does not constitute a limitation on the fault repair scheme.

Referring to fig. 6, this embodiment shows a specific example of repairing an OSD down alarm occurring in the Ceph storage cluster.

Acquiring an OSD down alarm from a monitoring center, checking a server with an asset management system, forming a formatted message, confirming that the alarm is pushed to a repair center 30, calling an execution tool by the repair center 30, verifying the authority, and executing four checking actions, namely 701-704, on a Ceph storage cluster.

701. Checking service status and progress

1. Confirming service status

2. Confirming the process state, wherein the following process exists if the disk is normal, and the following process does not exist if the disk is abnormal, and the specific codes are as follows:

ceph 23156 1 51 10:09？00:00:06/usr/bin/ceph-osd-f--cluster ceph--id0--setuserceph--setgroupceph

702. and (3) performing read-write test on all mount partitions, wherein specific codes are as follows:

/dev/sdb1 5.5T 637G 4.9T 12％/var/lib/ceph/osd/ceph-0

entering the mounting partition of the OSD.0, and creating a file.

Failure to read or write can present similar errors as follows: Inpout/Output error

703. Checking whether the system log is related to the alarm information of the disk

A system log error is one that can be found in/var/log/messages like the following:

Sep 3 13:43:22ceph01 kernel::[15608593.558973]sd2:0:01:0:[sdc]SenseKey:Medium Error[current]

Sep 3 20:39:56ceph01 kernel::[61952397.553033]Buffer I/O error ondevice sdb1,logical block 796203307

704. executing a disk bad track detection command, wherein the specific codes are as follows:

badblocks-s-v/dev/sdb>>/tmp/scan_sdb.log

the repair center 30 obtains a determination result that the hard disk is damaged and needs to be replaced by taking the four inspection actions as determination conditions, considers that the alarm information meets the repair conditions, and executes a preset repair scheme. When the repair plan is completed, the alarm information is removed, and the time when the alarm information is received is saved to the event center 80.

The decision center 50 determines whether intervention human intervention is required based on the risk condition. In the above example, if no intervention is needed, a repair job/scheme is executed, the disk that has been damaged is kicked out of the Ceph cluster, and when the state of the Ceph cluster is recovered to normal, the operation of the entire fault repair event is saved to the event center 80, and the administrator 51 is notified.

Example two:

referring to fig. 3 to fig. 5, based on the technical solution of a computer system fault recovery method disclosed in the first embodiment, the present embodiment discloses a computer system fault recovery system 100 (hereinafter, referred to as "system 100") including:

the failure acquisition system 10 acquires failure information of the monitored object by the monitoring system 80.

The asset management system 20 acquires the identity information of the monitored object contained in the fault information, calls an asset list in the asset management system, and formats the fault information.

And the repair center 30 calls the service type matched with the formatted fault information from the predefined execution tool library 40 to form a fault repair scheme corresponding to the fault information.

And the decision center 50 calls the fault repairing scheme and triggers an event for notifying manual repairing through configured judging logic.

And the notification system 60 receives the event which is sent by the decision center 50 and notifies the manual repair, so as to notify the administrator to send a manual repair instruction.

The system 100 further comprises: a dispatch center 70 and an event center 80.

The decision center 50 is configured with a judgment logic for judging risks contained in the fault information, and determines to trigger an event for notifying manual repair when the risk corresponding to the fault information corresponding to the fault repair scheme exceeds a risk threshold configured by the decision center 50; when the risk corresponding to the fault information corresponding to the fault repairing scheme does not exceed the risk threshold configured by the decision center 50, the dispatching center 80 issues the fault repairing scheme to the monitored object 200.

The event center 80 receives the operation of the failure repair event output by the dispatch center 70, and notifies the administrator 51 of the failure repair event, wherein the failure repair event contains the failure repair operation contents, the failure repair log, or the status log of the monitored object.

Please refer to the embodiment a, and detailed descriptions thereof are omitted herein for a technical solution of the same parts in the system 100 disclosed in the present embodiment and the method for repairing a computer system failure disclosed in the first embodiment.

Example three:

referring to FIG. 7, an apparatus 400 is further disclosed in this embodiment based on the computer system fault recovery method disclosed in the first embodiment. The device 400 may be understood as a computer, a cluster of computers, a server, a physical machine containing a program, an electronic device based on an integrated circuit of an embedded program, a data center or a cloud platform.

The apparatus 400 includes one or more processors 41, one or more memories 42 for storing one or more programs that, when executed by the one or more processors, cause the one or more processors 41 to perform the computer system failover method of embodiment one. The memory 42 may be selected from a HDD, non-volatile memory, RAID, distributed storage system, cluster server, and the like.

Please refer to the description of the first embodiment, which is not repeated herein.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A computer system fault recovery method, comprising the steps of:

acquiring fault information of a monitored object by a monitoring system;

and calling a manual repair scheme to repair the fault of the monitored object.

2. The method for repairing a computer system fault according to claim 1, wherein the predefined execution tool library pre-configures a repair script and an automation operation and maintenance tool corresponding to the repair fault information.

3. The computer system fault recovery method of claim 2, wherein the automated operation and maintenance tool comprises a puppet, an allowed, or a saltstack.

4. The computer system fault recovery method of claim 1, wherein the recovery center configures the decision logic to discard the predefined execution tool library after failing to invoke the service type that matches the formatted fault information.

5. The computer system fault recovery method of any one of claims 1 to 4, further comprising, after invoking a manual repair scheme to perform fault recovery on the monitored object:

an operation of outputting a fail-over event, wherein,

and notifies the administrator of the failover event.

6. The computer system fault repairing method according to claim 5, wherein the decision center is configured with a judgment logic for judging risks contained in the fault information, and determines to trigger an event for notifying manual repair when the risk corresponding to the fault information corresponding to the fault repairing scheme exceeds a risk threshold configured by the decision center;

7. The computer system fault recovery method of claim 5, wherein the asset management system embeds a visual interface that displays identity information of the monitored object;

8. A computer system failover system, comprising:

9. The computer system failover system of claim 8 further comprising: the scheduling center is configured with a judgment logic for judging risks contained in the fault information, and determines to trigger and notify an event of manual repair when the risk corresponding to the fault information corresponding to the fault repair scheme exceeds a risk threshold configured by the decision center; and when the risk corresponding to the fault information corresponding to the fault repairing scheme does not exceed the risk threshold configured by the decision center, the dispatching center issues the fault repairing scheme to the monitored object.

10. The computer system failover system of claim 9 further comprising: the event center receives the operation of the fault repairing event output by the dispatching center and informs the fault repairing event to the administrator,

wherein,

11. An apparatus, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the computer system failover method of any of claims 1-7.