CN106201805B - Method and device for detecting server failure - Google Patents

Method and device for detecting server failure Download PDF

Info

Publication number
CN106201805B
CN106201805B CN201610607930.2A CN201610607930A CN106201805B CN 106201805 B CN106201805 B CN 106201805B CN 201610607930 A CN201610607930 A CN 201610607930A CN 106201805 B CN106201805 B CN 106201805B
Authority
CN
China
Prior art keywords
fault
server
information
repair
operating system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610607930.2A
Other languages
Chinese (zh)
Other versions
CN106201805A (en
Inventor
凌婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610607930.2A priority Critical patent/CN106201805B/en
Publication of CN106201805A publication Critical patent/CN106201805A/en
Application granted granted Critical
Publication of CN106201805B publication Critical patent/CN106201805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Debugging And Monitoring (AREA)
  • Telephonic Communication Services (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Methods and apparatus for detecting server failures are disclosed. One embodiment of the method comprises: acquiring a fault list, wherein the fault list comprises fault accessory information of a server; pushing information to be maintained based on the fault accessory information; receiving maintenance feedback information of the information to be maintained; determining whether to reinstall an operating system of the server or restart the server according to the fault list; triggering the reinstallation of the operating system of the server or the restarting of the server based on the determined result; responding to the detection that the operating system of the server is reinstalled or the server is restarted, and performing fault detection on the server; and presenting the result of fault detection on the server. The implementation method shortens the repair period and improves the efficiency of operating and maintaining the server.

Description

Method and device for detecting server failure
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of internet technologies, and in particular, to a method and an apparatus for detecting a server failure.
Background
At present, when a server fails, there are two methods for detecting the failure: one is the list issuing of the operation and maintenance personnel in the fault pool of the operation and maintenance management platform, and the other is the manual list issuing of the operation and maintenance personnel in the resource management system. The content sent by the fault pool is the hardware fault which is already found by the hardware detection tool; and the operation sheet of fault repair initiated by manual issuing is sent to the operation and maintenance personnel of the server for manual investigation.
However, when issuing a bill in the fault pool, the operation and maintenance personnel cannot initiate fault repair in time if the hardware detection tool fails, so that the hardware fault information of the server is inaccurate when the fault repair is actually initiated; when the operation and maintenance personnel issue the order manually, if the faults found by the hardware fault detection tools are not noticed, the same fault information can be submitted repeatedly. Therefore, at present, the two methods for detecting the fault both need to rely on manual operation and maintenance to complete the order issuing of the server fault, and have low efficiency and frequent errors.
Disclosure of Invention
The present application aims to propose an improved method and apparatus for detecting server failure to solve the technical problems mentioned in the background section above.
In a first aspect, the present application provides a method for detecting a server failure, the method comprising: acquiring a fault list, wherein the fault list comprises fault accessory information of a server; pushing information to be maintained based on the fault accessory information; receiving maintenance feedback information of the information to be maintained; determining whether to reinstall an operating system of the server or restart the server according to the fault list; triggering the reinstallation of the operating system of the server or the restarting of the server based on the determined result; in response to detecting that the operating system of the server has been reinstalled or that the server has been rebooted, performing failure detection on the server; and presenting the result of fault detection of the server.
In some embodiments, said obtaining a trouble ticket comprises: sending an atomic operation command to acquire a fault list from a fault checking interface at the bottom layer; and/or acquiring a fault list based on fault repair information manually input.
In some embodiments, the sending the atomic operation command to obtain the fault ticket from the underlying fault check interface includes: and if the fault accessory information is empty, presenting prompting information for manual operation and maintenance follow-up.
In some embodiments, the obtaining the trouble ticket based on the manually input trouble shooting information includes one or more of the following: responding to manually input fault repair information and associating a fault list input from a fault inspection interface at the bottom layer, and only recording an operation log; and responding to the condition that the manually input fault repair information is not related to the fault list input from the underlying fault inspection interface, acquiring the fault list from the fault repair information and recording an operation log.
In some embodiments, said pushing information to be repaired based on said faulty accessory information comprises one or more of: in response to the faulty accessory information being an internal maintenance accessory, pushing internal maintenance information to an internal platform; in response to the faulty accessory information not being an internal repair accessory, pushing outsourced repair information to the external platform.
In some embodiments, said determining whether to reload the operating system of the server from the trouble ticket comprises one or more of: determining to reload the operating system of the server in response to the trouble ticket indicating to reload the operating system of the server; and determining to reinstall the operating system of the server in response to the fact that the repair option of the fault list meets a preset reinstallation triggering condition.
In some embodiments, said determining whether to restart said server based on said trouble ticket comprises one or more of: restarting the server in response to the trouble ticket indication, determining to restart the server; and determining to restart the server in response to the fact that the repair option of the fault list meets a preset restart triggering condition.
In a second aspect, the present application provides an apparatus for detecting a server failure, the apparatus comprising: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a fault list which comprises fault accessory information of a server; the pushing unit is used for pushing information to be maintained based on the fault accessory information; the receiving unit is used for receiving maintenance feedback information of the information to be maintained; the determining unit is used for determining whether to reinstall the operating system of the server or restart the server according to the fault list; a triggering unit, configured to trigger reinstallation of an operating system of the server or restart of the server based on a result of the determination; the detection unit is used for responding to the detection that the operating system of the server is reinstalled or the server is restarted and carrying out fault detection on the server; and the presentation unit is used for presenting the result of the fault detection of the server.
In some embodiments, the obtaining unit is further configured to: sending an atomic operation command to acquire a fault list from a fault checking interface at the bottom layer; and/or acquiring a fault list based on fault repair information manually input.
In some embodiments, the obtaining unit is further configured to: and if the fault accessory information of the fault list acquired from the fault inspection interface at the bottom layer is empty, presenting reminding information for following the manual operation and maintenance.
In some embodiments, the obtaining unit is further configured to one or more of: responding to manually input fault repair information and associating a fault list input from a fault inspection interface at the bottom layer, and only recording an operation log; and responding to the condition that the manually input fault repair information is not related to the fault list input from the underlying fault inspection interface, acquiring the fault list from the fault repair information and recording an operation log.
In some embodiments, the push unit comprises one or more of: the internal pushing subunit is used for responding to the fact that the fault accessory information is an internal maintenance accessory and pushing the internal maintenance information to the internal platform; and the external pushing subunit is used for responding to the fault accessory information which is not an internal maintenance accessory and pushing outsourced maintenance information to the external platform.
In some embodiments, the determining unit comprises one or more of: an instruction reinstallation determining subunit, configured to determine to reinstall the operating system of the server in response to the trouble ticket instructing the reinstallation of the operating system of the server; and the condition reinstallation determining subunit is used for determining to reinstall the operating system of the server in response to the fact that the repair option of the fault list meets the preset reinstallation triggering condition.
In some embodiments, the determining unit comprises one or more of: the instruction restarting determination subunit is used for responding to the fault list instruction to restart the server and determining to restart the server; and the conditional restart determining subunit is used for determining to restart the server in response to that the repair reporting option of the fault list meets a preset restart triggering condition.
The method and the device for detecting the server fault provided by the application can automatically implement a fault detection process to obtain the fault list by obtaining the fault list which comprises the fault accessory information of the server, then pushing the information to be maintained based on the fault accessory information, then receiving the maintenance feedback information of the information to be maintained, then determining whether to reinstall the operating system of the server or restart the server according to the fault list, then triggering the reinstalling the operating system of the server or restart the server based on the determined result, then responding to the detection of the operating system of the reinstalled server or the restart server, carrying out fault detection on the server, and finally presenting the result of the fault detection on the server, then automatically reporting the repair according to the fault list, carrying out the order detection according to the returned maintenance feedback information, and presenting the result of the fault detection on the server, the repair period is shortened, and the efficiency of operating and maintaining the server is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flow diagram of one embodiment of a method for detecting server failure according to the present application;
FIG. 2 is an exemplary flow chart of a method of obtaining a trouble ticket in a method for detecting server failure according to the present application;
FIG. 3 is a schematic diagram of an application scenario of a method for detecting server failure according to the present application;
fig. 4 is a schematic structural diagram of an embodiment of an apparatus for detecting a server failure according to the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 shows a flow 100 of one embodiment of a method for detecting server failure according to the present application. The method for detecting the server failure comprises the following steps:
step 101, obtaining a fault list.
In this embodiment, the trouble ticket includes trouble accessory information of the server. The system or platform on which the method for detecting server failure operates may obtain trouble tickets from multiple platforms or systems that may generate trouble tickets when obtaining trouble tickets.
For example, the system or platform on which the method for detecting server failure operates may send an atomic operation command to obtain a trouble ticket from the underlying troubleshooting interface of the other platform or system, alternatively or additionally, the system or platform on which the method for detecting server failure operates may also obtain a trouble ticket based on manually entered troubleshooting information. The atomic operation command means that the operation for executing the command is not interrupted by the thread scheduling mechanism, that is, the operation once started runs to the end, and is not switched to any other thread in the middle.
The sending of the atomic operation command may be sending the atomic operation command at a predetermined time interval actively, or sending the atomic operation command triggered passively. For example, a system or platform on which the method for detecting server failure operates may trigger sending atomic operation commands upon receiving repair information for an external platform.
When the atomic operation command is sent to obtain the fault list from the fault checking interface at the bottom layer, two situations that the fault accessory information is not empty or the fault accessory information is empty may occur, and the subsequent fault processing flow can be continued in response to that the fault accessory information in the obtained fault list is not empty; and responding to the fact that the obtained fault accessory information in the fault list is empty, and prompting information for follow-up of manual operation and maintenance can be presented.
The fault report information input manually is related to the fault report input from the fault inspection interface at the bottom layer or not, and if the fault report information is related, only an operation log is recorded; if not, the fault list can be obtained from the fault repair information and the operation log is recorded.
The trouble ticket acquired here may be set such that only the trouble component information supporting repair is allowed to generate the trouble ticket, and the trouble component information not supporting repair is not allowed to generate the trouble ticket.
And 102, pushing information to be maintained based on the fault accessory information.
In this embodiment, the information to be maintained may be generated based on the fault component information of the server included in the fault list acquired in step 101, and then the information to be maintained is pushed to the maintenance party, so as to perform subsequent processing of the fault.
When the information to be maintained is pushed, a maintenance party can be determined according to the source of the fault accessory information or a maintenance manufacturer adapted to the fault accessory information, and then the information to be maintained is pushed to the maintenance party. For example, the internal repair information may be pushed to the internal platform in response to the faulty accessory information being an internal repair accessory; or in response to the faulty accessory information not being an internal repair accessory, pushing outsourced repair information to the external platform.
And 103, receiving maintenance feedback information of the information to be maintained.
In this embodiment, repair feedback information for information to be repaired may be received from the repair party. The repair feedback information here refers to feedback information for repairing the information of the failed accessory, and may include at least one or more of the following information of the failed accessory: name, location, repair time, repair content, party to repair, and repair personnel, etc.
For example, for the B failure of the a server, the operation and maintenance person C may input the maintenance feedback information from the operation and maintenance input interface after receiving the information to be maintained and completing the maintenance, and at this time, the system or the platform on which the method for detecting the server failure operates may receive the maintenance feedback information for the information to be maintained.
And step 104, determining whether to reinstall the operating system of the server or restart the server according to the fault list.
In this embodiment, after receiving the maintenance feedback information, in order to recheck the maintenance status of the server, it is necessary to perform statement detection after restarting the server or reinstalling the server.
Here, whether to reload the server or restart the server may be determined according to a parameter index in the trouble ticket. For example, the determination may be made according to the indication of the trouble ticket or the repair option of the trouble ticket: for example, if the trouble ticket indicates to reinstall the operating system of the server, determining to reinstall the operating system of the server; alternatively or additionally, if the repair option of the fault list meets the preset reinstallation triggering condition, the operating system of the reinstallation server is determined. For another example, if the fault list indicates to restart the server, determining to restart the server; alternatively or additionally, if the repair option of the fault list meets the preset restart triggering condition, the server is determined to be restarted.
And 105, triggering to reinstall the operating system of the server or restarting the server based on the determined result.
In this embodiment, if the determined result is to reinstall the server, the system or platform on which the method for detecting the server failure operates may send a reinstallation instruction to the system or platform that controls the server, so as to trigger the reinstallation of the operating system of the server; if the determined result is to restart the server, the system or platform on which the method for detecting the server failure operates may send a restart instruction to the system or platform that controls the server, thereby triggering the restart of the server.
And step 106, responding to the detection that the operating system of the server is reinstalled or the server is restarted, and performing fault detection on the server.
In this embodiment, a detection instruction may be sent by a system or a platform of the control server to query whether the operating system of the server has been reinstalled or the server has been restarted, and then, if the query result is the operating system of the reinstalled server or the server has been restarted, fault detection is performed on the server, so as to obtain a result of the statement detection. Further, the result of the statement detection can be reviewed to confirm the result of the fault detection.
Step 107, presenting the result of the fault detection of the server.
In this embodiment, after the result of the fault detection is obtained, the result of the fault detection performed on the server may be presented to the user interaction interface or the third-party platform, so as to improve the efficiency of performing operation maintenance on the server.
According to the method for detecting the server fault in the embodiment of the application, the fault list is obtained through automatically implementing the fault detection process, then the fault list is automatically reported for repair, and the statement detection is carried out according to the returned maintenance feedback information, so that the result of fault detection on the server is presented, the repair reporting period is shortened, and the efficiency of operation and maintenance on the server is improved.
In particular, in some implementations, the method combines multiple issuing flows, and avoids repeated submission according to whether the fault sheet issued manually is associated with other fault accessory information, so that the operation and maintenance efficiency of the server can be improved.
With further reference to fig. 2, fig. 2 shows an exemplary flowchart 200 of a method for obtaining a trouble ticket in the method for detecting a server failure according to the embodiment of the present application.
As shown in fig. 2, the method 200 of obtaining a trouble ticket includes the following steps:
in step 201, a fault checking process is started, followed by execution of step 202;
in step 202, it is determined whether the automatic repair is reported, if yes, step 203 is executed, and if no, step 207 is executed;
in step 203, when an atomic operation command is sent to the underlying service, machine information of fault repair is sent, and then fault accessory information of the automatically captured server is obtained from the underlying service;
in step 204, determining whether there is faulty accessory information, if yes, executing step 205, and if not, executing step 206;
in step 205, an attention type instance is generated, followed by performing step 210;
in step 206, an instance is generated and manually followed, followed by step 210;
in step 207, determining whether an instance associated with the faulty accessory information is present, if yes, performing step 209, and if no, performing step 208;
in step 208, the atomic operation is suspended, followed by execution of step 209;
in step 209, add instance exception information, followed by execution of step 210;
in step 210, the faulty accessory information is returned, followed by step 211;
in step 211, the failure check flow is completed.
According to the method for acquiring the fault list, provided by the embodiment of the application, on the basis of distinguishing the source of the fault list, manual follow-up can be carried out on the fault of the hardware detection tool, and repeated acquisition of the fault list can be avoided, so that the efficiency of operation and maintenance of the server is improved.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for detecting a server failure according to the present embodiment.
In the application scenario of fig. 3, the method for detecting server failure runs on the resource management system 310. The asset management system 310 includes an open source framework (Iplat) platform 320 and a run support system (NOC)330, may provide an interface for troubleshooting to third party systems 340, and may also provide an interface for troubleshooting to order management systems 350.
In a specific application process, the third-party system 340 first calls a fault repair interface to send repair information to the resource management system 310; then, after receiving the repair information, the resource management system 310 obtains the fault list through the Iplat platform 320; then, after acquiring the fault list, the Iplat platform 320 determines whether the fault list is a fault list reported and repaired by the third-party system 340, and if so, acquires fault accessory information of the server from the fault list; if the accessory information is acquired, the atomic operation queue continues to execute the steps of fault processing, fault detection and the like, and pushes outsourcing tasks and waits for outsourcing feedback operation results; if the accessory information is not acquired, the atomic operation queue is suspended in the current fault checking link, and enters the operation support system 330 in the form of iplat _ exception for manual operation and maintenance; if the fault ticket is not the fault ticket reported by the third-party system 340, but the fault ticket is manually positioned and then initiated from the operation support system 330, skipping the fault checking step, and directly taking the manually filled fault information as the standard for reporting the fault ticket, that is, the fault information checking process is executed only when the fault information is automatically initiated and the fault information of the server is not required to be manually positioned, and if the fault information associated with the current server is manually positioned and then filled into the system, directly skipping the fault checking step and not sending a request for capturing fault data to the bottom layer; then, the Iplat platform 320 performs fault processing on the fault accessory information acquired from the fault list, that is, pushes the information to be maintained, which is obtained based on the fault accessory information, to the order management system 350; then the order management system 350 presents the spare part warehousing entry to the user for the internal maintenance accessories, such as hard disks or over-protection accessories, based on the fault accessory information, and sends a repair mail to the manufacturer for the external maintenance accessories, such as accessories within the quality guarantee period, based on the fault accessory information; then the Iplat platform 320 receives the fault processing result returned by the order management system 350, and judges whether the operating system of the server needs to be reinstalled or the server needs to be restarted based on the fault list, if the Iplat platform 320 needs to complete the statement detection after the operating system of the server is reinstalled, the operating system is reinstalled; or if the Iplat platform 320 needs to complete the statement detection after restarting the server, the server is restarted; and then carrying out statement detection (further, carrying out statement detection and review), and presenting the result of the statement detection to the user or delivering a third party.
According to the method provided by the embodiment of the application, the fault list is obtained by automatically implementing the fault detection process, then the fault list is automatically reported for repair, and the statement detection is carried out according to the returned maintenance feedback information, so that the result of fault detection on the server is presented, the repair reporting period is shortened, and the efficiency of operation and maintenance on the server is improved.
With further reference to FIG. 4, an exemplary block diagram 400 of an apparatus for detecting server failure is shown.
As shown in fig. 4, the apparatus 400 for detecting a server failure includes: the device comprises an acquisition unit 410, a pushing unit 420, a receiving unit 430, a determining unit 440, a triggering unit 450, a detecting unit 460 and a presenting unit 470.
The obtaining unit 410 is configured to send an atomic operation command to obtain a fault list, where the fault list includes fault component information of the server. The pushing unit 420 is configured to push information to be maintained based on the faulty component information. The receiving unit 430 is configured to receive maintenance feedback information for the information to be maintained. A determining unit 440 configured to determine whether to reinstall the operating system of the server or restart the server according to the trouble ticket. And a triggering unit 450 configured to trigger the reinstallation of the operating system of the server or the restart of the server based on a result of the determination. A detection unit 460 configured to perform failure detection on the server in response to detecting that the operating system of the server has been reinstalled or the server has been restarted. A presenting unit 470 is configured to present the result of the failure detection on the server.
In some optional implementations of this embodiment, the obtaining unit is further configured to: acquiring a fault list from a fault checking interface of a bottom layer; and/or acquiring a fault list based on fault repair information manually input.
In some optional implementations of this embodiment, the obtaining unit is further configured to: and if the fault accessory information of the fault list acquired from the fault inspection interface at the bottom layer is empty, presenting reminding information for following the manual operation and maintenance.
The obtaining unit is further configured for one or more of: responding to manually input fault repair information and associating a fault list input from a fault inspection interface at the bottom layer, and only recording an operation log; and responding to the condition that the manually input fault repair information is not related to the fault list input from the underlying fault inspection interface, acquiring the fault list from the fault repair information and recording an operation log.
In some optional implementations of the embodiment, the pushing unit includes one or more of the following: an internal pushing subunit (not shown in the figure) configured to, in response to the faulty accessory information being an internal maintenance accessory, push the internal maintenance information to the internal platform; an external pushing subunit (not shown in the figure) configured to push the outsourced maintenance information to the external platform in response to the faulty accessory information not being an internal maintenance accessory.
In some optional implementations of this embodiment, the determining unit includes one or more of: an instruction reinstallation determining subunit (not shown in the figure) configured to determine an operating system of the reinstallation server in response to the trouble ticket instructing the reinstallation of the operating system of the server; and a conditional reinstallation determining subunit (not shown in the figure) configured to determine an operating system of the reinstallation server in response to the repair report option of the trouble ticket meeting a preset reinstallation trigger condition.
In some optional implementations of this embodiment, the determining unit includes one or more of: an indication restart determining subunit (not shown in the figure) configured to restart the server in response to the trouble ticket indication, and determine to restart the server; and a conditional restart determining subunit (not shown in the figure) configured to determine to restart the server in response to the repair reporting option of the trouble ticket meeting a preset restart triggering condition.
It should be understood that the units recited in the apparatus 400 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method for detecting server failure apply equally to the apparatus 400 and the units contained therein, and are not described in detail here. Corresponding elements in the apparatus 800 may cooperate with elements in the terminal device and/or the server to implement aspects of embodiments of the present application.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a pushing unit, a receiving unit, a determining unit, a triggering unit, a detecting unit, and a presenting unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit acquiring a trouble ticket".
As another aspect, the present application also provides a non-volatile computer storage medium, which may be the non-volatile computer storage medium included in the apparatus in the above-described embodiments; or it may be a non-volatile computer storage medium that exists separately and is not incorporated into the terminal. The non-transitory computer storage medium stores one or more programs that, when executed by a device, cause the device to: acquiring a fault list, wherein the fault list comprises fault accessory information of a server; pushing information to be maintained based on the fault accessory information; receiving maintenance feedback information of the information to be maintained; determining whether to reinstall an operating system of the server or restart the server according to the fault list; triggering the reinstallation of the operating system of the server or the restarting of the server based on the determined result; responding to the detection that the operating system of the server is reinstalled or the server is restarted, and performing fault detection on the server; and presenting the result of fault detection on the server.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (12)

1. A method for detecting server failure, the method comprising:
acquiring a fault list, wherein the fault list comprises fault accessory information supporting repair in a server;
pushing information to be maintained based on the fault accessory information supporting repair;
receiving maintenance feedback information of the information to be maintained;
determining whether to reinstall an operating system of the server or restart the server according to the fault list;
triggering the reinstallation of the operating system of the server or the restarting of the server based on the determined result;
in response to detecting that the operating system of the server has been reinstalled or that the server has been rebooted, performing failure detection on the server;
presenting a result of fault detection on the server;
wherein the obtaining the fault ticket comprises: and sending an atomic operation command to acquire a fault list from an underlying fault checking interface and/or acquire the fault list based on fault repair information manually input.
2. The method of claim 1, wherein sending the atomic operation command to obtain the fault ticket from the underlying fault check interface comprises:
and if the information of the fault accessories supporting repair is empty, presenting reminding information for manual operation and maintenance follow-up.
3. The method of claim 1, wherein obtaining the trouble ticket based on the manually entered trouble shooting information comprises one or more of:
responding to manually input fault repair information and associating a fault list input from a fault inspection interface at the bottom layer, and only recording an operation log;
and responding to the condition that the manually input fault repair information is not related to the fault list input from the underlying fault inspection interface, acquiring the fault list from the fault repair information and recording an operation log.
4. The method according to any one of claims 1-3, wherein the pushing information to be repaired based on the fault assembly information supporting repair comprises one or more of the following:
in response to the fault accessory information supporting repair is an internal repair accessory, pushing internal repair information to an internal platform;
and in response to the fault accessory information supporting repair is not an internal repair accessory, pushing outsourced repair information to the external platform.
5. The method of claim 1, wherein the determining whether to reload the operating system of the server based on the trouble ticket comprises one or more of:
determining to reload the operating system of the server in response to the trouble ticket indicating to reload the operating system of the server;
and determining to reinstall the operating system of the server in response to the fact that the repair option of the fault list meets a preset reinstallation triggering condition.
6. The method of any of claims 1 or 5, wherein the determining whether to restart the server based on the trouble ticket comprises one or more of:
restarting the server in response to the trouble ticket indication, determining to restart the server;
and determining to restart the server in response to the fact that the repair option of the fault list meets a preset restart triggering condition.
7. An apparatus for detecting server failure, the apparatus comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a fault list which comprises fault accessory information supporting repair in a server;
the pushing unit is used for pushing information to be maintained based on the fault accessory information supporting repair reporting;
the receiving unit is used for receiving maintenance feedback information of the information to be maintained;
the determining unit is used for determining whether to reinstall the operating system of the server or restart the server according to the fault list;
a triggering unit, configured to trigger reinstallation of an operating system of the server or restart of the server based on a result of the determination;
the detection unit is used for responding to the detection that the operating system of the server is reinstalled or the server is restarted and carrying out fault detection on the server;
the presentation unit is used for presenting the result of the fault detection of the server;
wherein the obtaining unit is further configured to: and sending an atomic operation command to acquire a fault list from an underlying fault checking interface and/or acquire the fault list based on fault repair information manually input.
8. The apparatus of claim 7, wherein the obtaining unit is further configured to:
and if the fault accessory information supporting repair of the fault list acquired from the fault inspection interface at the bottom layer is empty, presenting follow-up reminding information of manual operation and maintenance.
9. The apparatus of claim 8, wherein the obtaining unit is further configured to one or more of:
responding to manually input fault repair information and associating a fault list input from a fault inspection interface at the bottom layer, and only recording an operation log;
and responding to the condition that the manually input fault repair information is not related to the fault list input from the underlying fault inspection interface, acquiring the fault list from the fault repair information and recording an operation log.
10. The apparatus according to any one of claims 7-9, wherein the pushing unit comprises one or more of:
the internal pushing subunit is used for responding to the fault accessory information supporting repair is an internal repair accessory and pushing internal repair information to the internal platform;
and the external pushing subunit is used for responding to the fault accessory information supporting repair not being an internal repair accessory and pushing outsourced repair information to the external platform.
11. The apparatus of claim 7, wherein the determination unit comprises one or more of:
an instruction reinstallation determining subunit, configured to determine to reinstall the operating system of the server in response to the trouble ticket instructing the reinstallation of the operating system of the server;
and the condition reinstallation determining subunit is used for determining to reinstall the operating system of the server in response to the fact that the repair option of the fault list meets the preset reinstallation triggering condition.
12. The apparatus according to any one of claims 7 or 11, wherein the determining unit comprises one or more of:
the instruction restarting determination subunit is used for responding to the fault list instruction to restart the server and determining to restart the server;
and the conditional restart determining subunit is used for determining to restart the server in response to that the repair reporting option of the fault list meets a preset restart triggering condition.
CN201610607930.2A 2016-07-28 2016-07-28 Method and device for detecting server failure Active CN106201805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610607930.2A CN106201805B (en) 2016-07-28 2016-07-28 Method and device for detecting server failure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610607930.2A CN106201805B (en) 2016-07-28 2016-07-28 Method and device for detecting server failure

Publications (2)

Publication Number Publication Date
CN106201805A CN106201805A (en) 2016-12-07
CN106201805B true CN106201805B (en) 2020-02-14

Family

ID=57495932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610607930.2A Active CN106201805B (en) 2016-07-28 2016-07-28 Method and device for detecting server failure

Country Status (1)

Country Link
CN (1) CN106201805B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875018B (en) * 2017-01-04 2021-03-30 北京百度网讯科技有限公司 Method and device for automatic maintenance of super-large-scale machine
CN107819832A (en) * 2017-10-23 2018-03-20 深圳市赛亿科技开发有限公司 A kind of life of product feedback method and system
CN110430073B (en) * 2019-07-30 2022-06-21 中国工程物理研究院计算机应用研究所 Heterogeneous system automatic operation and maintenance method based on abstract service atomic operation
CN112651514B (en) * 2019-10-11 2024-02-27 中移动信息技术有限公司 Operation and maintenance task execution method and device, operation and maintenance server and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719836A (en) * 2008-10-09 2010-06-02 联想(北京)有限公司 Method and device for fault detection
KR20130123007A (en) * 2012-05-02 2013-11-12 (주)네오위즈게임즈 Method for controlling trouble and server thereof
CN104182819A (en) * 2014-07-31 2014-12-03 心触动(武汉)文化传媒有限公司 Intelligent repair reporting method and system
CN104270274A (en) * 2014-10-08 2015-01-07 广东电网公司汕头供电局 Fault processing method, device and system based on form
CN105373901A (en) * 2015-12-15 2016-03-02 国网北京市电力公司 Grid fault worksheet handling method, apparatus, and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024181B (en) * 2009-09-10 2014-04-09 上海宝信软件股份有限公司 Artificial intelligence maintenance system and method
CN105260841B (en) * 2015-10-16 2019-07-09 国网甘肃省电力公司天水供电公司 A kind of distribution network failure repairing receipt auditing system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719836A (en) * 2008-10-09 2010-06-02 联想(北京)有限公司 Method and device for fault detection
KR20130123007A (en) * 2012-05-02 2013-11-12 (주)네오위즈게임즈 Method for controlling trouble and server thereof
CN104182819A (en) * 2014-07-31 2014-12-03 心触动(武汉)文化传媒有限公司 Intelligent repair reporting method and system
CN104270274A (en) * 2014-10-08 2015-01-07 广东电网公司汕头供电局 Fault processing method, device and system based on form
CN105373901A (en) * 2015-12-15 2016-03-02 国网北京市电力公司 Grid fault worksheet handling method, apparatus, and system

Also Published As

Publication number Publication date
CN106201805A (en) 2016-12-07

Similar Documents

Publication Publication Date Title
CN106201805B (en) Method and device for detecting server failure
US9824002B2 (en) Tracking of code base and defect diagnostic coupling with automated triage
US8997088B2 (en) Methods and systems for automated deployment of software applications on heterogeneous cloud environments
US9311064B1 (en) Systems and methods for automated centralized build/merge management
US8954579B2 (en) Transaction-level health monitoring of online services
US20190171550A1 (en) System and method for continuous testing and delivery of software
CN107660289B (en) Automatic network control
US10261892B2 (en) Cloud-based automated test execution factory
US10585656B1 (en) Event manager for software deployment
US20150113331A1 (en) Systems and methods for improved software testing project execution
CN108804215B (en) Task processing method and device and electronic equipment
TW201407341A (en) Method, device and system of repairing software run-time error
US10552242B2 (en) Runtime failure detection and correction
US11715496B2 (en) Systems and methods for processing video data
CN114338363A (en) Continuous integration method, device, equipment and storage medium
WO2014136228A1 (en) Programmable controller, programmable controller system, and execute error information creation method
JP2011175494A (en) Maintenance work support system
CN113377669A (en) Automatic testing method and device, computer equipment and storage medium
US9465626B2 (en) Method and apparatus for acquiring time spent on system shutdown
US8402125B2 (en) Method of managing operations for administration, maintenance and operational upkeep, management entity and corresponding computer program product
CN113094251B (en) Method and device for testing embedded system, computer equipment and storage medium
JP2009169724A (en) Maintenance support device
KR101252358B1 (en) Apparatus and method for testing plc command
CN113806138A (en) Backup recovery detection method and device for database, electronic equipment and storage medium
US20100042560A1 (en) Context aware solution assembly in contact center applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant