CN112835739A

CN112835739A - Downtime processing method and device

Info

Publication number: CN112835739A
Application number: CN201911155242.7A
Authority: CN
Inventors: 胡晓伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2021-05-25

Abstract

The embodiment of the application discloses a downtime processing method and device. One embodiment of the above method comprises: in response to receiving downtime alarm information aiming at the target host, performing downtime confirmation on the target host; in response to determining that the target host is down, acquiring relevant information of the target host, wherein the relevant information comprises information of the target host and information of a virtual machine hosted on the target host; performing downtime recovery on the target host machine and the virtual machine hosted on the target host machine based on the relevant information; and responding to the completion of the downtime recovery of the target host and the virtual machine, and automatically diagnosing the downtime of the target host. The embodiment improves the fault recovery quality and speed of the host machine.

Description

Downtime processing method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a downtime processing method and device.

Background

In the related art, if a host is down, operation and maintenance personnel can collect information of the down host, and then maintain and restore functions of the down host.

However, this process often requires manual monitoring and processing to be accomplished. And because the steps are more, the probability of errors in manual processing is higher. In addition, the uncontrollable factors of manual processing are more, the intervention speed is slow, and the recovery speed of the function is easy to cause to be slow, so that the fault recovery quality and speed of the product are difficult to ensure.

Disclosure of Invention

The embodiment of the application provides a downtime processing method and device.

In a first aspect, an embodiment of the present application provides a downtime processing method, including: in response to receiving downtime alarm information aiming at a target host, performing downtime confirmation on the target host; in response to determining that the target host is down, acquiring relevant information of the target host, wherein the relevant information comprises information of the target host and information of a virtual machine hosted on the target host; performing downtime recovery on the target host machine and the virtual machines hosted on the target host machine based on the relevant information; and responding to the completion of the downtime recovery of the target host and the virtual machine, and automatically diagnosing the downtime of the target host.

In some embodiments, the performing the downtime confirmation on the target host includes: sending downtime confirmation information of the target host to target electronic equipment, wherein the downtime confirmation information is used for confirming whether the target host is down or not; and confirming the down of the target host machine in response to the received feedback information which is sent by the target electronic equipment and used for indicating the down of the target host machine.

In some embodiments, the information of the virtual machine hosted on the target host machine includes resource information used by the virtual machine; and performing downtime recovery on the target host and the virtual machine hosted on the target host based on the relevant information, including: determining whether a non-host-dependent virtual machine exists according to the resource information used by the virtual machine, wherein the non-host-dependent virtual machine indicates that data required by the running of the virtual machine is not located on the local area of the target host machine; in response to determining that the non-hosted-dependent virtual machine exists, migrating the non-hosted-dependent virtual machine.

In some embodiments, the information of the target host comprises resource information of the target host; and migrating the non-host-dependent virtual machine, comprising: selecting a new host machine, wherein the resource information of the new host machine is the same as the resource information of the target host machine; and migrating the non-host-dependent virtual machine to a new host machine.

In some embodiments, the performing downtime recovery on the target host and the virtual machine hosted by the target host based on the relevant information includes: determining whether a host-dependent virtual machine exists according to the resource information used by the virtual machine, wherein the host-dependent virtual machine indicates that data required by the running of the virtual machine is located locally at the target host machine; in response to determining that a host-dependent virtual machine exists and detecting that the downtime recovery of the target host is completed, detecting whether the host-dependent virtual machine is completed or not, wherein the downtime recovery of the target host is realized by starting a first automatic recovery program after the downtime, and the downtime recovery of the host-dependent virtual machine is realized by starting a second automatic recovery program after the downtime recovery of the target host is completed; and in response to determining that the downtime recovery of the host-dependent virtual machine is completed, migrating the host-dependent virtual machine.

In some embodiments, the automatically diagnosing the downtime of the target host includes: acquiring fault information of the target host and execution output information of a diagnosis command when the target host is subjected to downtime confirmation; and determining a downtime reason and a suggested recovery step aiming at the downtime reason according to the information of the target host, the fault information, the execution output information and a preset operation and maintenance knowledge base.

In some embodiments, the above method further comprises: in response to receiving confirmation information for the downtime reason and the suggested recovery steps, adding the downtime reason and the suggested recovery steps into the operation and maintenance knowledge base; and in response to receiving the modification information aiming at the downtime reason and the suggested recovery steps, adding the modified downtime reason and the modified suggested recovery steps into the operation and maintenance knowledge base.

In some embodiments, the above method further comprises: and responding to the condition that the target host machine is down, and shielding fault alarm information aiming at the target host machine.

In some embodiments, the above method further comprises: and in response to the completion of the downtime recovery of the target host and the virtual machines hosted on the target host, shielding failure alarm information of the target host.

In some embodiments, the preset terminal is configured with a downtime manual processing interface; and the above method further comprises: and in response to the detection that the target host is not recovered within the preset time, sending a manual processing notification message to a preset terminal so as to perform manual intervention processing through the downtime manual processing interface.

In a second aspect, an embodiment of the present application provides a downtime processing apparatus, including: the system comprises a downtime confirmation unit, a downtime confirmation unit and a monitoring unit, wherein the downtime confirmation unit is configured to perform downtime confirmation on a target host in response to receiving downtime alarm information aiming at the target host; an information obtaining unit configured to obtain relevant information of the target host machine in response to determining that the target host machine is down, wherein the relevant information comprises information of the target host machine and information of a virtual machine hosted on the target host machine; the downtime recovery unit is configured to perform downtime recovery on the target host and the virtual machines hosted on the target host based on the relevant information; and the automatic diagnosis unit is configured to automatically diagnose the downtime of the target host machine in response to the completion of the downtime recovery of the target host machine and the virtual machine.

In some embodiments, the downtime determining unit is further configured to: sending downtime confirmation information of the target host to target electronic equipment, wherein the downtime confirmation information is used for confirming whether the target host is down or not; and confirming the down of the target host machine in response to the received feedback information which is sent by the target electronic equipment and used for indicating the down of the target host machine.

In some embodiments, the information of the virtual machine hosted on the target host machine includes resource information used by the virtual machine; and the downtime recovery unit is further configured to: determining whether a non-host-dependent virtual machine exists according to the resource information used by the virtual machine, wherein the non-host-dependent virtual machine indicates that data required by the running of the virtual machine is not located on the local area of the target host machine; in response to determining that the non-hosted-dependent virtual machine exists, migrating the non-hosted-dependent virtual machine.

In some embodiments, the information of the target host comprises resource information of the target host; and the downtime recovery unit is further configured to: selecting a new host machine, wherein the resource information of the new host machine is the same as the resource information of the target host machine; and migrating the non-host-dependent virtual machine to a new host machine.

In some embodiments, the downtime recovery unit is further configured to: determining whether a host-dependent virtual machine exists according to the resource information used by the virtual machine, wherein the host-dependent virtual machine indicates that data required by the running of the virtual machine is located locally at the target host machine; in response to determining that a host-dependent virtual machine exists and detecting that the downtime recovery of the target host is completed, detecting whether the host-dependent virtual machine is completed or not, wherein the downtime recovery of the target host is realized by starting a first automatic recovery program after the downtime, and the downtime recovery of the host-dependent virtual machine is realized by starting a second automatic recovery program after the downtime recovery of the target host is completed; and in response to determining that the downtime recovery of the host-dependent virtual machine is completed, migrating the host-dependent virtual machine.

In some embodiments, the automatic diagnostic unit is further configured to: acquiring fault information of the target host and execution output information of a diagnosis command when the target host is subjected to downtime confirmation; and determining a downtime reason and a suggested recovery step aiming at the downtime reason according to the information of the target host, the fault information, the execution output information and a preset operation and maintenance knowledge base.

In some embodiments, the apparatus further comprises a knowledge base updating unit configured to: in response to receiving confirmation information for the downtime reason and the suggested recovery steps, adding the downtime reason and the suggested recovery steps into the operation and maintenance knowledge base; and in response to receiving the modification information aiming at the downtime reason and the suggested recovery steps, adding the modified downtime reason and the modified suggested recovery steps into the operation and maintenance knowledge base.

In some embodiments, the above apparatus further comprises: and the information shielding unit is configured to shield fault alarm information aiming at the target host machine in response to determining that the target host machine is down.

In some embodiments, the above apparatus further comprises: and the shielding removing unit is configured to remove shielding of the fault alarm information of the target host machine in response to completion of downtime recovery of the target host machine and the virtual machine hosted on the target host machine.

In some embodiments, the preset terminal is configured with a downtime manual processing interface; and the above apparatus further comprises: and the manual intervention unit is configured to respond to the detection that the target host is not recovered within the preset time length, and send a manual processing notification message to a preset terminal so as to perform manual intervention processing through the downtime manual processing interface.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the embodiments of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method as described in any one of the embodiments of the first aspect.

According to the downtime processing method and the downtime processing device provided by the embodiment of the application, after the downtime alarm information aiming at the target host is received, the downtime of the target host can be confirmed. After determining that the target host is down, the relevant information of the target host can be acquired. The related information may include information of the target host and information of the virtual machine hosted on the target host. And then performing downtime recovery on the target host machine and the virtual machine hosted on the target host machine based on the related information. After the target host and the virtual machines hosted on the target host are determined to be complete in downtime recovery, the downtime of the target host can be automatically diagnosed. The method of the embodiment can perform automatic recovery and automatic diagnosis when the host machine is down, and improves the fault recovery quality and speed of the host machine.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a downtime processing method according to the present application;

FIG. 3 is a schematic diagram of an application scenario of the downtime processing method according to the application;

FIG. 4 is a flow diagram of another embodiment of a downtime treatment method according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of a downtime treatment apparatus according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the downtime processing method or downtime processing apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, a host 105, and

virtual machines

106, 107, 108. Network 104 is used to provide a medium for communication links between terminal device 101 and virtual machine 106, between terminal device 102 and virtual machine 107, and between terminal device 103 and virtual machine 108. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the

virtual machines

106, 107, 108 over the network 104 to receive or send messages or the like. Various communication client applications, such as video applications, live applications, instant messaging tools, mailbox clients, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

Here, the

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The host 105 may be a server hosted by the

virtual machines

106, 107, 108. The

virtual machines

106, 107, 108 may be servers providing various services, such as background servers providing support for the

terminal devices

101, 102, 103. The background server may analyze and perform other processing on data such as the fault information, and feed back a processing result (e.g., fault notification information) to the terminal device.

It should be noted that the downtime processing method provided by the embodiment of the present application may be executed by a server or a terminal device other than the host 105 and the

virtual machines

106, 107, and 108, for example, the executing body may be a server in a server cluster together with the host 105 and the

virtual machines

106, 107, and 108, or a control server, and the like. Accordingly, the downtime processing device may be disposed in the server or the terminal equipment.

It should be understood that the number of terminal devices, networks, servers, and virtual machines in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, servers, and virtual machines, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a downtime treatment method according to the present application is shown. The downtime processing method of the embodiment comprises the following steps:

step 201, in response to receiving the downtime alarm information for the target host, performing downtime confirmation on the target host.

In this embodiment, the execution main body of the downtime processing method may receive the downtime alarm information for the target host in a wired connection manner or a wireless connection manner. And the executing body can confirm the downtime of the target host when or after receiving the downtime alarm information aiming at the target host. Here, the downtime confirmation is used to confirm whether the target host is down or not.

Specifically, the executing entity may determine whether the target host is down through various manners, for example, send a message to the target host, and if the target host does not respond within a preset time period, the target host is considered to be down. Or, the executing body can log in the target host remotely, and if the log-in cannot be successfully performed, the target host is considered to be down.

In some optional implementations of the present embodiment, the executing entity may perform downtime confirmation on the target host through the following steps not shown in fig. 2: sending downtime confirmation information of a target host to target electronic equipment, wherein the downtime confirmation information is used for confirming whether the target host is down or not; and confirming that the target host machine is down in response to receiving feedback information which is sent by the target electronic equipment and used for indicating that the target host machine is down.

In this implementation, the execution subject may send downtime confirmation information for the target host to the target electronic device. The target electronic device may be a target host, or may be another electronic device capable of determining whether the target host is down. In practice, the executing entity or the other electronic device may send detection information to the target host, and if the target host does not return information, it indicates that the target host is down. Or, if the executing entity or the other electronic device receives the downtime event information output by the target host, it may determine that the target host is down.

Step 202, in response to determining that the target host is down, obtaining relevant information of the target host.

The executing agent may further obtain information about the target host when or after determining that the target host is down. Here, the related information of the target host may include information of the target host and information of the virtual machine hosted on the target host. The information of the target host may include hardware configuration and resource configuration of the host, and information of a virtual machine of the host. The information of the virtual machine may include information indicating a host machine hosted by the virtual machine, information indicating a resource configuration of the virtual machine, and the like. The executing agent may obtain the relevant information of the target host from other electronic devices or locally, so as to facilitate various subsequent operations, such as initiating data migration.

In some optional implementations of this embodiment, the method may further include the following steps not shown in fig. 2: and responding to the fact that the target host machine is down, and shielding fault alarm information aiming at the target host machine.

In this implementation, when or after the execution main body determines that the target host is down, the execution main body may shield the fault alarm information for the target host. It will be appreciated that the virtual machines hosted on the target host are not available after the downtime, which may trigger a variety of fault alert information. In general, each fault alarm information pair triggers a certain recovery step, and the recovery step may include sending information to a terminal held by a technician, or triggering a recovery program, etc. The alarm information may not be sent again after the target host is recovered from the downtime condition. Therefore, the executing body can directly perform downtime recovery on the target host machine, and simultaneously shield other fault alarm information aiming at the target host machine so as to avoid the execution of various recovery steps.

Step 203, based on the above related information, performing downtime recovery on the target host and the virtual machines hosted on the target host.

After obtaining the relevant information of the target host, the execution subject may perform downtime recovery on the target host and the virtual machine hosted on the target host. The downtime recovery may include data migration, such as migrating out data stored on the target host. The downtime recovery can be realized by sending an instruction to the target host machine, or by automatically starting a recovery program by the target host machine. For example, the executing entity or other electronic device sends a data migration instruction to the target host, or a downtime recovery program is implanted in the target host in advance, and when the target host goes down, a downtime event triggers the recovery program to implement downtime recovery.

In some optional implementations of this embodiment, the information of the virtual machine hosted on the target host machine includes resource information used by the virtual machine. The executing agent may further implement the downtime recovery by the following steps not shown in fig. 2: determining whether an independent host-dependent virtual machine exists according to resource information used by the virtual machine; in response to determining that the non-hosted-dependent virtual machine exists, migrating the non-hosted-dependent virtual machine.

In this implementation manner, the resource information used by the virtual machine may include information of a storage location of data of the virtual machine, for example, if the data of the virtual machine is stored in a cloud, the resource information used by the virtual machine may include a cloud disk. If the data of the virtual machine is stored locally at the target host, the resource information used by the virtual machine may include a local disk. The execution subject may determine whether there is an un-hosted dependent virtual machine based on resource information used by the virtual machine. This time, a non-host dependent virtual machine may refer to a virtual machine that can run without data local to the host. Because the data that the non-host-dependent virtual machine depends on is not stored locally at the target host, the failure of the non-host-dependent virtual machine can be solved by migrating the non-host-dependent virtual machine to other hosts.

In some optional implementations of this embodiment, the information of the target host includes resource information of the target host. In migrating the non-hosted dependent virtual machine, the execution agent may be implemented by the following steps not shown in FIG. 2: selecting a new host machine; and migrating the non-host-dependent virtual machine to a new host machine.

In this implementation, the execution subject may select a new host for the non-host-dependent virtual machine. In order to ensure that the virtual machine can quickly recover from the fault, the execution subject may select a host having the same resource information as the target host as a new host. In this way, failure recovery can be achieved without requiring new settings for the above-described non-host-dependent virtual machine.

In some optional implementations of the present embodiment, the executing agent may further implement the downtime restoration by the following steps not shown in fig. 2: determining whether a host-dependent virtual machine exists according to resource information used by the virtual machine; in response to determining that the host-dependent virtual machine exists and detecting completion of downtime recovery of the target host machine, detecting whether the host-dependent virtual machine completes the downtime recovery; and in response to determining that the downtime recovery of the host-dependent virtual machine is completed, selecting a new host and migrating the host-dependent virtual machine to the new host.

In this implementation, the execution subject may further determine whether a host-dependent virtual machine exists according to resource information used by the virtual machine. Here, the host-dependent virtual machine means that data required for the virtual machine to run is local to the target host. If the executing body determines that the host depends on the virtual machine, the downtime recovery degree of the target host can be detected. In this implementation manner, the downtime recovery of the target host is implemented by starting the first automatic recovery program after the downtime of the target host. If the downtime of the target host is completely recovered, the execution main body can detect whether the host dependent virtual machine is completely recovered. And the host relies on the downtime recovery of the virtual machine, and after the downtime recovery of the target host is completed, a second automatic recovery program is started to realize the downtime recovery. And after the execution main body determines that the host dependent virtual machine is delayed and recovered, the host dependent virtual machine can be migrated. Specifically, the execution subject may select a new host, and then migrate the host-dependent virtual machine to the new host. It can be understood that the new host machine may be selected in the same manner as the new host machine is selected in the migration process of the non-host-dependent virtual machine, or the host-dependent virtual machine may be migrated to the migrated non-host-dependent virtual machine as the hosted host machine.

And step 204, responding to the completion of the downtime recovery of the target host and the virtual machine, and automatically diagnosing the downtime of the target host.

When or after determining that the downtime of the target host and the virtual machine is recovered, the execution main body can automatically diagnose the downtime of the target host. In this embodiment, the automatic diagnosis may determine the reason for the downtime of the target host, a means for solving the downtime, or a suggested recovery step for the downtime, and the like.

In some optional implementations of this embodiment, the method may further include the following steps not shown in fig. 2: and in response to the completion of the downtime recovery of the target host and the virtual machines hosted on the target host, shielding failure alarm information aiming at the target host.

In this implementation, when or after determining that the downtime recovery of the target host and the virtual machine hosted by the target host is completed, the execution subject may remove the shielding of the fault alarm information for the target host. In this way, fault monitoring of the target host and the virtual machines hosted on the target host may be restored.

In some optional implementation manners of this embodiment, it is preset that the terminal device is configured with the downtime manual processing interface. Here, the preset terminal may be a terminal device used by an operation and maintenance person. The above method may further comprise the following steps not shown in fig. 2: and in response to the fact that the target host is not recovered within the preset time, sending a manual processing notification message to a preset terminal so as to perform manual intervention processing through the downtime manual processing interface.

In this implementation, if the execution subject detects that the target host is not recovered within the preset duration, the execution subject may send a manual processing notification message to the preset terminal. Therefore, operation and maintenance personnel can perform manual intervention processing through the downtime manual processing interface so as to reply to the downtime target host as soon as possible.

In some optional implementations of this embodiment, the method further includes: and sending fault notification information to target terminal equipment associated with the service of the virtual machine, wherein the fault notification information corresponds to the category of the virtual machine. Correspondingly, the method further comprises the following steps: and sending fault release notice information to the target terminal equipment in response to the received downtime release information of the target host.

In these optional implementation manners, the executing entity may send the fault notification information to the target terminal device, so as to notify the user of the virtual machine affected by the downtime of the target host. The fault notification information indicates that the virtual machine is affected by a downtime fault of the target host. The target terminal device associated with the virtual machine service may be not only a terminal device requesting the virtual machine service, but also a terminal device used by a customer service person corresponding to the virtual machine service, or a terminal device of an operation and maintenance person of the service.

In practice, the execution subject may classify the virtual machines, and send different fault notification information to target terminal devices associated with services of different classes of virtual machines. The target terminal devices associated with the services of different classes of virtual machines may be different.

Specifically, the basis for classifying the virtual machines may be whether the virtual machines depend on the host, and thus a host dependent virtual machine and a host independent virtual machine are obtained, where the host dependence may mean that the virtual machines need data local to the host to run. In addition, the classification can also be based on whether to provide services for customers (i.e. users), and therefore user service virtual machines and non-user service virtual machines are obtained. The user service virtual machine here means that the virtual machine provides service for certain software of the client's terminal device, and the use of the software by the user is directly affected by the down of the host hosted by the virtual machine. It should be noted that the virtual machine may be divided according to only one division basis to obtain a virtual machine with one class label, or the virtual machine may be divided according to more than two bases to obtain a virtual machine with at least two class labels. For example, a virtual machine may serve both a host dependent virtual machine and a user virtual machine.

For example, the virtual machine a of the target host is a user service virtual machine, the virtual machine B of the target host is a non-user service virtual machine, and the execution subject sends fault notification information a, i.e., "system fault", to the mobile phone number 1 associated with the service of the virtual machine a, and sends fault notification information B, i.e., "system fault", to the mobile phone number 2 associated with the service of the virtual machine B.

For example, if the virtual machine a of the target host is a host dependent virtual machine and the virtual machine B of the target host is a non-host dependent virtual machine, the execution subject sends a fault notification message a to the mobile phone 1 associated with the service of the virtual machine a, that is, "system fault, please wait for personnel to repair". And sending fault notification information B to the number 2 mobile phone associated with the service of the virtual machine B, namely 'system fault, which is good later'.

The downtime release information indicates that the downtime of the target host is released and the function of the target host can be recovered or recovered. The executing body may send, to the target terminal device, fault notification information indicating that the downtime fault has been resolved, in a case where the downtime resolving information is received.

The implementation modes can carry out fault notification to each terminal device in a classified manner, so that users corresponding to different types of virtual machines receive targeted notifications, and meanwhile, fault notification information is more readable.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the downtime processing method according to the embodiment. In the application scenario of fig. 3, the control server 301 may perform downtime confirmation on the target host 302 in response to downtime alarm information for the target host 302. After determining that the target host 302 is down, the control server 301 obtains the relevant information of the target host 302. And based on the relevant information, performing downtime recovery on the target host 302. After the downtime recovery is completed, the downtime of the target host 302 is automatically diagnosed.

According to the downtime processing method provided by the embodiment of the application, after the downtime alarm information aiming at the target host is received, the downtime of the target host can be confirmed. After determining that the target host is down, the relevant information of the target host can be acquired. The related information may include information of the target host and information of the virtual machine hosted on the target host. And then performing downtime recovery on the target host machine and the virtual machine hosted on the target host machine based on the related information. After the target host and the virtual machines hosted on the target host are determined to be complete in downtime recovery, the downtime of the target host can be automatically diagnosed. The method of the embodiment can perform automatic recovery and automatic diagnosis when the host machine is down, and improves the fault recovery quality and speed of the host machine.

With continued reference to FIG. 4, a flow 400 of another embodiment of a downtime treatment method according to the application is illustrated. As shown in fig. 4, the downtime processing method of this embodiment may include the following steps:

step 401, in response to receiving the downtime alarm information for the target host, performing downtime confirmation on the target host.

Step 402, in response to determining that the target host is down, obtaining relevant information of the target host.

Step 403, based on the above related information, performing downtime recovery on the target host and the virtual machines hosted on the target host.

The principle of steps 401 to 403 is similar to that of steps 201 to 203, and is not described herein again.

Step 404, in response to completion of downtime recovery of the target host and the virtual machine, obtaining fault information of the target host and execution output information of the diagnosis command when performing downtime confirmation on the target host.

In this embodiment, the related information of the target host may include log information of the target host and environment information of the target host. The log information may be data generated by the target host during the operation process. The environmental information may include information of the environment in which the target host is located, and may include temperature, humidity, location, and the like.

The execution main body can acquire the fault information of the target host machine when or after the completion of the downtime recovery of the target host machine and the virtual machine is determined. The failure information may be information indicating a type of failure of the target host. In particular, the fault types may include hardware faults and software faults. The failure information may be obtained from other electronic devices, from a hardware failure platform that collects hardware failure information, or from a software failure platform that collects software failure information. The electronic equipment and the fault platform can collect fault event information output by the host machine so as to determine fault information.

The execution main body can also acquire execution output information of the diagnosis command when other electronic equipment confirms downtime of the target host machine. It is understood that when the executing agent or other electronic device performs downtime confirmation or fault confirmation on the target host, some diagnostic commands may be sent to the target host, and the target host may output some information for the diagnostic commands. This information reflects the type of failure of the target host.

Step 405, determining a downtime reason and a suggested recovery step for the downtime reason according to the information, the fault information, the execution output information and a preset operation and maintenance knowledge base of the target host.

The executing body can determine the downtime reason of the target host machine and the suggested recovery steps aiming at the downtime reason according to the obtained information and a preset operation and maintenance knowledge base. Here, the operation and maintenance knowledge base may include the type of failure, the cause of the failure, and the steps taken to recover from the failure. The execution main body can take the obtained information as a search word, search the operation and maintenance knowledge base, obtain a fault reason matched with the information as a downtime reason, and take a recovery step corresponding to the fault reason as a suggested recovery step.

The execution body can output the obtained reason of the downtime and the suggested recovery steps for reference of operation and maintenance personnel.

In some optional implementations of this embodiment, the method may further include the following steps not shown in fig. 4: in response to receiving confirmation information aiming at the downtime reason and the suggested recovery steps, adding the downtime reason and the suggested recovery steps into an operation and maintenance knowledge base; and in response to receiving modification information aiming at the downtime reason and the suggested recovery steps, adding the modified downtime reason and the modified suggested recovery steps into the operation and maintenance knowledge base.

In this implementation manner, the execution main body may output the obtained downtime reason and the suggested recovery step for the operation and maintenance personnel to check. If the operation and maintenance personnel determine the reason for the downtime and recommend the recovery steps to be correct, the operation and maintenance personnel can add the reason to the operation and maintenance knowledge base. Similarly, if the operation and maintenance personnel determine that the downtime reason and the suggested recovery steps are wrong, the downtime reason and the suggested recovery steps can be modified according to the downtime reason of the target host machine. The executing agent may add the modified reason for the downtime and the modified suggested recovery steps to the operation and maintenance knowledge base. In this way, the update of the operation and maintenance knowledge base can be realized.

In some optional implementation manners of this embodiment, the executing entity may further update, according to the updated operation and maintenance knowledge base, the first automatic recovery program that is run by the target host when the target host is down. Or, the executing body may further update, according to the updated operation and maintenance knowledge base, a second automatic recovery program that is run by the virtual machine hosted by the target host machine when the virtual machine is down.

The downtime processing method provided by the embodiment of the application can automatically diagnose the downtime reason and update the operation and maintenance knowledge base.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a downtime processing apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the downtime processing apparatus 500 of the present embodiment includes: a downtime confirming unit 501, an information acquiring unit 502, a downtime restoring unit 503, and an automatic diagnosing unit 504.

The downtime confirmation unit 501 is configured to perform downtime confirmation on the target host in response to receiving downtime alarm information for the target host.

An information obtaining unit 502 configured to obtain relevant information of the target host in response to determining that the target host is down. The related information includes information of the target host machine and information of the virtual machine hosted on the target host machine.

The downtime recovery unit 503 is configured to perform downtime recovery on the target host and the virtual machines hosted by the target host based on the relevant information.

An automatic diagnosis unit 504 configured to automatically diagnose downtime of the target host in response to completion of downtime recovery of the target host and the virtual machine.

In some optional implementations of the present embodiment, the downtime determining unit 501 may be further configured to: sending downtime confirmation information of a target host to target electronic equipment, wherein the downtime confirmation information is used for confirming whether the target host is down or not; and confirming that the target host machine is down in response to receiving feedback information which is sent by the target electronic equipment and used for indicating that the target host machine is down.

In some optional implementations of this embodiment, the information of the virtual machine hosted on the target host machine includes resource information used by the virtual machine. The downtime recovery unit 503 is further configured to: determining whether a non-host-dependent virtual machine exists according to resource information used by the virtual machine, wherein the non-host-dependent virtual machine indicates that data required by the running of the virtual machine is not located on the local area of the target host machine; in response to determining that the non-hosted-dependent virtual machine exists, migrating the non-hosted-dependent virtual machine.

In some optional implementations of this embodiment, the information of the target host includes resource information of the target host. The downtime recovery unit 503 is further configured to: selecting a new host machine, wherein the resource information of the new host machine is the same as the resource information of the target host machine; and migrating the non-host-dependent virtual machine to a new host machine.

In some optional implementations of the present embodiment, the downtime recovering unit 503 is further configured to: determining whether a host-dependent virtual machine exists according to resource information used by the virtual machine, wherein the host-dependent virtual machine indicates that data required by the running of the virtual machine is located locally at the target host machine; in response to determining that the host-dependent virtual machine exists and detecting that the downtime recovery of the target host is completed, detecting whether the downtime recovery of the host-dependent virtual machine is completed or not, wherein the downtime recovery of the target host is realized by starting a first automatic recovery program after the downtime, and the downtime recovery of the host-dependent virtual machine is realized by starting a second automatic recovery program after the downtime recovery of the target host is completed; and in response to determining that the downtime recovery of the host-dependent virtual machine is completed, migrating the host-dependent virtual machine.

In some optional implementations of the present embodiment, the automatic diagnostic unit 504 is further configured to: acquiring fault information of a target host and execution output information of a diagnosis command when the target host is subjected to downtime confirmation; and determining the downtime reason and suggesting a recovery step aiming at the downtime reason according to the information, the fault information, the execution output information and a preset operation and maintenance knowledge base of the target host.

In some optional implementations of this embodiment, the apparatus 500 may further include a knowledge base updating unit, not shown in fig. 5, configured to: in response to receiving confirmation information aiming at the downtime reason and the suggested recovery steps, adding the downtime reason and the suggested recovery steps into an operation and maintenance knowledge base; and in response to receiving modification information aiming at the downtime reason and the suggested recovery steps, adding the modified downtime reason and the modified suggested recovery steps into the operation and maintenance knowledge base.

In some optional implementations of the present embodiment, the apparatus 500 may further include an information shielding unit, not shown in fig. 5, configured to shield fault alarm information for the target host in response to determining that the target host is down.

In some optional implementations of the present embodiment, the apparatus 500 may further include a masking release unit, not shown in fig. 5, configured to release the masking of the fault alarm information for the target host in response to completion of the downtime recovery of the target host and the virtual machine hosted by the target host.

In some optional implementation manners of this embodiment, the preset terminal is configured with a downtime manual processing interface. The apparatus 500 may further include a human intervention unit, not shown in fig. 5, configured to send a human processing notification message to the preset terminal to perform human intervention processing through the down human processing interface in response to detecting that the target host is not recovered within the preset time period.

It should be understood that the units 501 to 504 recorded in the downtime processing apparatus 500 respectively correspond to each step in the method described with reference to fig. 2. Thus, the operations and features described above for the downtime processing method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

Referring now to FIG. 6, shown is a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to receiving downtime alarm information aiming at the target host, performing downtime confirmation on the target host; in response to determining that the target host is down, acquiring relevant information of the target host, wherein the relevant information comprises information of the target host and information of a virtual machine hosted on the target host; performing downtime recovery on the target host machine and the virtual machine hosted on the target host machine based on the relevant information; and responding to the completion of the downtime recovery of the target host and the virtual machine, and automatically diagnosing the downtime of the target host.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a downtime confirmation unit, an information acquisition unit, a downtime restoration unit, and an automatic diagnosis unit. The names of these elements do not in some cases constitute a limitation on the elements themselves, for example, the downtime confirmation element may also be described as an "element performing downtime confirmation on a target host".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A downtime processing method comprises the following steps:

in response to receiving downtime alarm information aiming at a target host, performing downtime confirmation on the target host;

in response to determining that the target host is down, obtaining relevant information of the target host, wherein the relevant information comprises information of the target host and information of a virtual machine hosted on the target host;

performing downtime recovery on the target host machine and the virtual machine hosted on the target host machine based on the relevant information;

and responding to the completion of the downtime recovery of the target host machine and the virtual machine, and automatically diagnosing the downtime of the target host machine.

2. The method of claim 1, wherein said performing the downtime acknowledgement for the target host comprises:

sending downtime confirmation information of the target host to target electronic equipment, wherein the downtime confirmation information is used for confirming whether the target host is down or not;

and in response to receiving feedback information sent by the target electronic equipment and used for indicating that the target host machine is down, confirming that the target host machine is down.

3. The method of claim 1, wherein the information of the virtual machine hosted on the target host machine includes resource information used by the virtual machine; and

performing downtime recovery on the target host machine and the virtual machine hosted on the target host machine based on the relevant information, including:

determining whether an independent host virtual machine exists according to the resource information used by the virtual machine, wherein the independent host virtual machine indicates that data required by the running of the virtual machine is not located on the local of the target host machine;

in response to determining that an un-hosted dependent virtual machine exists, migrating the un-hosted dependent virtual machine.

4. The method of claim 3, wherein the information of the target host comprises resource information of the target host; and

the migrating the non-host-dependent virtual machine includes:

selecting a new host machine, wherein the resource information of the new host machine is the same as the resource information of the target host machine;

and migrating the non-host-dependent virtual machine to a new host machine.

5. The method of claim 3, wherein said downing the target host and the virtual machines hosted on the target host based on the relevant information comprises:

determining whether a host dependent virtual machine exists according to the resource information used by the virtual machine, wherein the host dependent virtual machine indicates that data required by the running of the virtual machine is located in the local target host machine;

in response to determining that a host-dependent virtual machine exists and detecting completion of downtime recovery of the target host, detecting whether the host-dependent virtual machine is completed by downtime recovery, wherein the downtime recovery of the target host is realized by starting a first automatic recovery program after downtime, and the downtime recovery of the host-dependent virtual machine is realized by starting a second automatic recovery program after the downtime recovery of the target host is completed;

migrating the host-dependent virtual machine in response to determining that the downtime recovery of the host-dependent virtual machine is complete.

6. The method of claim 1, wherein said automatically diagnosing said downtime of said target host comprises:

acquiring fault information of the target host and execution output information of a diagnosis command when the target host is subjected to downtime confirmation;

and determining a downtime reason and a suggested recovery step aiming at the downtime reason according to the information of the target host, the fault information, the execution output information and a preset operation and maintenance knowledge base.

7. The method of claim 6, wherein the method further comprises:

in response to receiving acknowledgement information for the reason for the downtime and the suggested recovery steps, adding the reason for the downtime and the suggested recovery steps to the operation and maintenance knowledge base;

and in response to receiving modification information aiming at the downtime reason and the suggested recovery steps, adding the modified downtime reason and the modified suggested recovery steps into the operation and maintenance knowledge base.

8. The method of claim 1, wherein the method further comprises:

and responding to the determination that the target host machine is down, and shielding fault alarm information aiming at the target host machine.

9. The method of claim 8, wherein the method further comprises:

and in response to the completion of the downtime recovery of the target host and the virtual machines hosted on the target host, shielding failure alarm information of the target host.

10. The method according to claim 1, wherein the preset terminal is configured with a down manual processing interface; and

the method further comprises the following steps:

and in response to the fact that the target host is not recovered within the preset time, sending a manual processing notification message to a preset terminal so as to perform manual intervention processing through the downtime manual processing interface.

11. A downtime processing apparatus, comprising:

the system comprises a downtime confirmation unit, a failure detection unit and a failure detection unit, wherein the downtime confirmation unit is configured to perform downtime confirmation on a target host in response to receiving downtime alarm information aiming at the target host;

an information obtaining unit configured to obtain relevant information of the target host machine in response to determining that the target host machine is down, the relevant information including information of the target host machine and information of a virtual machine hosted on the target host machine;

the downtime recovery unit is configured to perform downtime recovery on the target host and the virtual machines hosted on the target host based on the relevant information;

an automatic diagnosis unit configured to automatically diagnose downtime of the target host in response to completion of the downtime recovery of the target host and the virtual machine.

12. The apparatus of claim 11, wherein the downtime confirmation unit is further configured to:

13. The apparatus of claim 11, wherein the information of the virtual machine hosted on the target host machine includes resource information used by the virtual machine; and

the downtime recovery unit is further configured to:

14. The apparatus of claim 13, wherein the information of the target host comprises resource information of the target host; and

the downtime recovery unit is further configured to:

and migrating the non-host-dependent virtual machine to a new host machine.

15. The apparatus of claim 13, wherein the downtime recovery unit is further configured to:

16. The apparatus of claim 11, wherein the automated diagnostic unit is further configured to:

17. The apparatus of claim 16, wherein the apparatus further comprises a knowledge base updating unit configured to:

18. The apparatus of claim 11, wherein the apparatus further comprises:

an information shielding unit configured to shield fault alarm information for the target host in response to determining that the target host is down.

19. The apparatus of claim 18, wherein the apparatus further comprises:

a shielding cancellation unit configured to cancel shielding of the fault alarm information for the target host in response to completion of downtime recovery of the target host and the virtual machines hosted on the target host.

20. The apparatus according to claim 11, wherein the preset terminal is configured with a down manual processing interface; and

the device further comprises:

and the manual intervention unit is configured to respond to the detection that the target host is not recovered within the preset time length, and send a manual processing notification message to a preset terminal so as to perform manual intervention processing through the downtime manual processing interface.

21. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

22. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.