CN111897697A

CN111897697A - Server hardware fault repairing method and device

Info

Publication number: CN111897697A
Application number: CN202010801889.9A
Authority: CN
Inventors: 李斯达; 赵亮; 刘晨科
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-06

Abstract

Disclosed are a server hardware fault repairing method and apparatus for a server group, wherein servers in the server group store data for executing tasks and serve as hosts for carrying virtual machine traffic, including: receiving server fault information in a server group, wherein the server fault information indicates that a specific server in the server group has a hardware fault; determining a candidate replacement server for replacing the specific server among a plurality of standby servers based on the operating states of the plurality of standby servers; determining, among the candidate replacement servers, a replacement server that matches the particular server based on the configuration parameters of the particular server and the configuration parameters of the candidate replacement servers; outputting a complete machine replacement message for transferring data for executing the task stored in the specific server to a replacement server; in the event that a particular server is replaced by a replacement server, the server configuration parameters in the group of servers are updated.

Description

Server hardware fault repairing method and device

Technical Field

The present application relates to cloud technologies, and in particular, to a method, an apparatus, a device, and a storage medium for repairing a server hardware failure in a server group.

Background

In the operation life cycle of the server, when the server executing the service fails or is about to fail, there is a need to perform complete machine replacement on the failed server. The replaced server has the same service data and service attributes as the originally used server. In conventional solutions, in order to implement complete replacement of the server, the manufacturer needs to be contacted to perform on-site maintenance, which takes hours or even days. In the case of a failed server needing to perform a task, such failover time consumption will seriously affect the user experience. For this reason, it is desirable to provide a faster hardware fault repair method.

Disclosure of Invention

According to an aspect of the present application, a server hardware fault repairing method for a server group is provided, wherein a server in the server group stores data for executing a task, and the method includes: receiving server fault information in the server group, wherein the server fault information indicates that a specific server in the server group has a hardware fault; responding to the server fault information, acquiring the configuration parameters of the specific server, and acquiring the configuration parameters and the operation states of a plurality of standby servers; determining a candidate replacement server for replacing the specific server among a plurality of standby servers based on the operating states of the plurality of standby servers; determining, among the candidate replacement servers, a replacement server that matches the particular server based on the configuration parameters of the particular server and the configuration parameters of the candidate replacement servers; outputting a complete machine replacement message for transferring data for performing a task stored in the specific server to the replacement server; updating server configuration parameters in the group of servers in the event the particular server is replaced by the replacement server.

In some embodiments, the operational state includes a hardware state and a network connection state; determining, among the plurality of standby servers, a candidate replacement server for replacing the particular server based on the operational states of the plurality of standby servers comprises: and selecting a standby server from the plurality of standby servers, and determining the standby server as a candidate replacement server if the hardware state indicates that the hardware state of the standby server is normal and the network connection state indicates that the network connection state of the standby server is normal.

In some embodiments, the standby server is an idle server networked with the group of servers.

In some embodiments, the number of the plurality of standby servers is determined based on a selling weight, an availability of a server in the group of servers, and a model remaining amount of the server, wherein the selling weight is a weight parameter determined according to a market plan of the server, the availability is a parameter indicating a failure rate of the server in the group of servers, and the model remaining amount is a parameter indicating the number of servers of the same model as the specific server in the group of servers.

In some embodiments, the configuration parameters include at least one of: machine room unit information, product model information, equipment type information, hardware version information and size information.

In some embodiments, before the specific server is replaced by the replacement server, the hardware failover method further includes: performing a hardware status check and a network connection status check on the replacement server to determine whether the hardware status and the network connection status of the replacement server are normal.

In some embodiments, in a case that the hardware state and the network connection state of the replacement server are both normal, replacing the specific server with the replacement server, and performing data transfer; reselecting a replacement server from the plurality of standby servers in the event that the hardware status or the network connection status of the replacement server is not normal.

In some embodiments, after updating the server configuration parameters in the server group, the hardware failover method further includes: performing a hardware status check and a network connection status check on the replacement server to determine whether the hardware status and the network connection status of the replacement server are normal.

In some embodiments, the server configuration parameters in the server group are updated if the hardware status and the network connection status of the replacement server are both normal, and a replacement server is reselected from the plurality of standby servers if the hardware status or the network connection status of the replacement server is not normal.

In some embodiments, the hardware fault recovery method further comprises: disabling the alarm information for the particular server during a hardware failover process.

According to another aspect of the present application, there is also provided a server hardware failure recovery apparatus for a server group, wherein a server in the server group stores data for performing a task, including: a fault information receiving unit configured to receive server fault information in the server group, the server fault information indicating that a hardware fault occurs in a specific server in the server group; a parameter obtaining unit configured to obtain configuration parameters of the specific server in response to the server failure information, and obtain configuration parameters and operation states of a plurality of standby servers; a standby selection unit configured to determine a candidate replacement server for replacing the specific server among a plurality of standby servers based on operation states of the plurality of standby servers; a parameter comparison unit configured to determine, among the candidate replacement servers, a replacement server that matches the specific server based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers; a guidance output unit configured to output a whole replacement message for transferring data for performing a task stored in the specific server to the replacement server; an information synchronization unit configured to update the server configuration parameters in the server group in case the specific server is replaced by the replacement server.

In some embodiments, the hardware failover apparatus further includes a status checking unit configured to perform a hardware status check and a network connection status check on the replacement server to determine whether the hardware status and the network connection status of the replacement server are normal before the specific server is replaced by the replacement server.

In some embodiments, the hardware fault repairing apparatus further includes a status checking unit configured to perform a hardware status check and a network connection status check on the replacement server after updating the server configuration parameters in the server group to determine whether the hardware status and the network connection status of the replacement server are normal.

In some embodiments, the hardware failover apparatus disables the alarm information for the particular server during the hardware failover.

According to another aspect of the present application, there is also provided a hardware fault repairing apparatus, including a memory and a processor, wherein the memory stores instructions that, when executed by the processor, cause the processor to execute the hardware fault repairing method as described above.

According to another aspect of the present application, there is also provided a computer readable storage medium having stored thereon instructions, which, when executed by a processor, cause the processor to perform a hardware fault recovery method as described above.

By utilizing the server hardware fault repairing method, device, equipment and storage medium for the server group, complete machine replacement of a fault server can be quickly realized when the server serving as a host machine fails, and the fault repairing duration is greatly shortened.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The following drawings are not intended to be drawn to scale in actual dimensions, with emphasis instead being placed upon illustrating the subject matter of the present application.

FIG. 1 illustrates an exemplary scenario for hardware failover of a server according to the present application;

FIG. 2A shows an exemplary process for hardware failover of a server;

FIG. 2B shows an illustrative process for hardware failover for a server in accordance with an embodiment of the application;

FIG. 3 shows an illustrative process for a server hardware failover method for a server farm in accordance with an embodiment of the application;

an illustrative process for determining candidate replacement servers in accordance with an embodiment of the present application is shown in FIG. 4;

FIG. 5 illustrates an exemplary process of comparing configuration parameters of two servers according to an embodiment of the application;

FIG. 6A shows an illustrative process for replacing a failed server with a replacement server in accordance with embodiments of the present application;

fig. 6B shows an exemplary process of changing the operational state of a server according to an embodiment of the application;

FIG. 7 shows an illustrative process for secondary checking of replacement servers in accordance with an embodiment of the application;

FIG. 8 shows a schematic block diagram of a server hardware failover apparatus for a server farm according to an embodiment of the present application;

FIG. 9 shows a schematic diagram of an implementation of a hardware failover scheme according to an embodiment of the application; and

FIG. 10 illustrates an architecture of a computing device according to an embodiment of the application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings, and obviously, the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without any creative effort also belong to the protection scope of the present application.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Although various references are made herein to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative and different aspects of the systems and methods may use different modules.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

In the cloud computing service, when a server (i.e., a host) for carrying virtual machine traffic fails, an operation of fault repair needs to be performed. In an application scenario where all data required for executing a service are stored in a cloud storage manner, a host only undertakes a function of executing an operation. In this case, if a hardware failure to be repaired occurs in a host, other idle servers can be selected in the network formed by the server group to take over the function of the failed host. However, such a solution requires support by cloud storage services. That is, data required for executing the task needs to be stored in a cloud storage manner. In this way, after selecting other idle servers to take over the function of the failed host, the server for replacement can read the data stored in the cloud storage system and continue to execute tasks.

However, in some application scenarios, it is not desirable to store data required for performing the task by means of cloud storage. For example, when the amount of data for executing a task is large, and when the task needs to be executed, data needs to be repeatedly accessed, the data required for executing the task is stored in a local storage manner, which is advantageous for executing the task. Data may be stored in a storage device of a host. For example, in an application such as a supercomputing center, various data to implement a user-delegated computing service can be stored locally (e.g., in a host machine) at the supercomputing center. In this case, when a host computer has a hardware failure, the failure repair cannot be realized by the aforementioned manner supported by the cloud storage service.

In view of the above scenario, the present application provides a method for repairing a server hardware failure of a server group.

FIG. 1 illustrates an exemplary scenario for hardware failover of a server according to the application. As shown in fig. 1, the system 100 may include a user terminal 110, a network 120, and a server (group) 130.

The user terminal 110 may be, for example, a computer 110-1, a mobile phone 110-2 shown in fig. 1. It is to be appreciated that the user terminal may be virtually any other type of electronic device capable of performing data processing, which may include, but is not limited to, a desktop computer, a laptop computer, a tablet computer, a smartphone, a smart home device, a wearable device, and the like.

A user may access network 120 using user terminal 110. For example, user terminal 110 may access network 120 via any wired or wireless means. The user may create commands on the user terminal 110 and send control instructions to the server 130 via the network 120 to perform tasks. The tasks referred to herein may be any computing and/or storage tasks that the server 130 is capable of performing. Such as image processing, file storage, instant messaging, batch computing, load balancing, etc.

The network 120 may be a single network, or a combination of at least two different networks. For example, network 120 may include, but is not limited to, one or a combination of local area networks, wide area networks, public networks, private networks, and the like.

The server 130 may be a server cluster, and the servers in the cluster are connected via a wired or wireless network. A group of servers may be centralized, such as a data center, or distributed. The server 130 may be local or remote.

In some embodiments, the system 100 may also include a database (not shown). A database may generally refer to a device having a storage function. The database is mainly used to store various data utilized, generated, and outputted in the operation of the server 130. The database 140 may include various memories such as a Random Access Memory (RAM), a Read Only Memory (ROM), and the like. The above mentioned storage devices are only examples and the storage devices that the system can use are not limited to these.

In some embodiments, the database may be integrated in the server 130. It will be appreciated that other storage devices for storing data may be included in the system, by connection to the server and/or user terminal.

FIG. 2A shows an exemplary process for hardware failover of a server.

As shown in fig. 2A, in block 201, a host failure is detected.

In block 202, conventional failover means are performed to attempt failover of the host. Where conventional failover means refers to any available means that can restore host functionality without a complete machine replacement. For example, failover of a host may be attempted according to a predefined repair policy. For example, hardware components (such as a power supply, a hard disk, a motherboard, a memory, and the like) included in the host may be sequentially checked. In some examples, replacement methods may be utilized to attempt repair of a hardware failure to a host. For example, the hardware components of the host may be replaced according to a predetermined order. If the fault can be repaired by simply replacing the hardware component, the host machine does not need to be replaced by the whole machine. That is, in this case, the conventional failure recovery means is effective.

In block 203, it may be determined whether conventional failover means are in effect. When the conventional fault repairing means is effective, the conventional repairing of the host can be realized. And the host machine can be continuously used after the restoration.

In block 204, when the conventional failover means is not in effect, it may be considered that failover of the host cannot be completed by a simple operation. Under the condition, the complete machine of the host machine needs to be comprehensively checked to realize fault repair. In order to reduce the influence of host machine faults on the use of a user, the fault repair can be realized by using a complete machine replacement scheme. That is, a failed host may be replaced with a new physical machine and replace the functionality of the original host.

In block 205, failover of the failed host is complete. As described above, the fault recovery of the host machine with the fault can be realized through conventional recovery or complete machine replacement.

FIG. 2B shows an illustrative process for hardware failover for a server in accordance with an embodiment of the application.

As shown in FIG. 2B, in the hardware failover process provided herein, a determination is made in block 206 that a particular server in the group of servers has failed. In some embodiments, a particular server may refer to a single server in a group of servers. In other embodiments, a particular server may also refer to two or more servers in a server group.

In block 207, it is determined whether the hardware failure occurred in the particular server needs to be resolved by way of a complete machine replacement. For example, steps S202-S203 in fig. 2A may be used to determine whether fault repair of a failed host can be achieved through conventional repair means. And if the fault repair cannot be completed by using the conventional repair means, determining that the fault repair needs to be performed in a complete machine replacement mode.

If it is determined that failover by way of a complete machine replacement is required (YES), then block 208 may be proceeded to. In block 208, a backup server may be selected from the one or more backup servers for a total replacement of the particular server. Then, block 209 may be advanced. In block 209, a global replacement may be made for the particular server. The process of

block

208 and 209 will be described in detail below with reference to fig. 3, and will not be described again here.

If it is determined in block 207 that the failure of a particular server need not be repaired by way of a complete machine replacement (no), a repeat check may be performed in block 210 to determine whether the failure needs to be repaired by way of a complete machine replacement. If the repeat check in block 210 determines that the failure of the particular server needs to be fixed by way of a complete machine replacement (yes), then the process may proceed to block 208 and continue through the steps of

blocks

208 and 209.

If the repeated checking in block 210 determines that the failure of the specific server does not need to be repaired by the complete machine replacement (no), the process may proceed to block 211, and the hardware failure repair process may be ended.

Fig. 3 shows an exemplary process of a server hardware failover method for a server farm according to an embodiment of the application. Wherein the servers in the server group may be hosts for carrying virtual machine traffic, in which various data required for executing the traffic is stored.

In step S302, server failure information in the server group may be received, the server failure information indicating that a hardware failure occurred in a specific server in the server group. In some embodiments, the server failure information indicates that the hardware failure occurred in the specific server cannot be solved by a conventional failure recovery means, and a complete machine replacement is required.

In step S304, configuration parameters of a specific server may be acquired in response to server failure information, and configuration parameters and operating states of a plurality of standby servers may be acquired.

In some embodiments, the standby server may be an idle server networked with the group of servers.

In some embodiments, the configuration parameters of the server may be obtained by accessing system information of the server. The configuration parameters of the server may be at least one of: machine room unit information, product model information, equipment type information, hardware version information and size information. The machine room unit information indicates the position of the machine room unit where the server is located. For example, the room unit information may be the number of the room unit. The product model information may be identification information of a machine model defined by a hardware manufacturer. For example, the product model information may be a character string indicating a machine model. The device type information refers to a vendor-defined device standard type that provides a cloud service. For example, the device type information may include entry level devices, workgroup level devices, department level devices, enterprise level devices, and the like. The hardware version information may be version information of a hardware component of the server, for example, version information of a hardware component such as a Central Processing Unit (CPU), a memory, a disk array (Raid), and the like. The size information may be information of a physical size of the server. For example, the size information may be represented as size information of a rack occupied by the server.

The operational state may include at least one of a hardware state and a network connection state of the server. Wherein the hardware status may indicate whether the server is capable of performing operations normally. In some embodiments, the machine room environment where the server group is located may be monitored through an Intelligent Platform Management Interface (IPMI) to obtain information of the hardware state of the server. The network connection status may indicate whether the server is capable of network communication in the server group. In some embodiments, the network connection status of the server may be obtained through a heartbeat mechanism.

In step S306, a candidate replacement server for replacing a specific server may be determined among the plurality of standby servers based on the operation states of the plurality of standby servers.

In some embodiments, for each backup server of the plurality of backup servers, the backup server may be determined to be a candidate replacement server if the hardware status indicates that the hardware status of the backup server is normal and the network connection status indicates that the network connection status of the backup server is normal.

By performing screening among a plurality of backup servers using the operation state, a backup server that can normally operate and normally communicate in the server group can be determined as a candidate replacement server. Hereinafter, an exemplary process of determining the candidate replacement server in step S306 will be described with specific reference to fig. 4.

In step S308, a replacement server matching the specific server may be determined among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers.

In some embodiments, the configuration parameters of the particular server that failed may be compared to the configuration parameters of the candidate replacement server. If the two match, the candidate replacement server may be considered a replacement server that matches the particular server.

In some implementations, if there are multiple replacement servers of the candidate replacement servers that match the particular server, a replacement server for replacing the failed particular server may be determined from the multiple matching replacement servers based on predefined rules. For example, the first candidate replacement server that matches the particular server may be determined as the replacement server for replacing the failed particular server. For another example, the candidate replacement server that matches the particular server to the highest degree may be determined as the replacement server for replacing the failed particular server.

In step S310, a whole replacement message for transferring data for executing a task stored in a specific server to a replacement server is output. In some embodiments, the whole replacement message may include information indicating identification information and a specific operation of the replacement server. For example, a complete machine replacement message may be sent to a site where the server group is located, and an operation of transferring data for performing a task to a replacement server may be performed by an operator on the site. In some embodiments, the operator may be instructed to remove the storage device storing the data for performing the task from the particular server and install it on the replacement server indicated in the complete machine replacement message.

In some embodiments, after the operation of the operator transferring the data for performing the task to the replacement server is completed, feedback information may be received and hardware information, such as a MAC address, of the replacement server may be updated in response to the feedback information.

In step S312, in the case where the specific server is replaced by the replacement server, the server configuration parameters in the server group may be updated.

In some embodiments, the server configuration parameters may include hardware configuration parameters and software configuration parameters. In some implementations, the physical rack information of the failed specific server and the rack information of the replacement server may be exchanged in a Configuration Management Database (CMDB) to avoid errors in the CMDB information. In other implementations, the corresponding fixed resource relationship of the virtual machine on a specific server (i.e., the original host) may be mapped to the replacement server in the selling information base of the cloud service, so that the user can normally use the function of the virtual machine after the hardware fault repairing process is completed.

Next, an exemplary process of determining the candidate replacement server in step S306 in fig. 3 will be described with reference to fig. 4. An exemplary process of determining candidate replacement servers according to an embodiment of the present application is shown in fig. 4.

In block 402, inventory conditions for the standby library 401 may be determined. If it is determined at block 402 that the inventory count of the standby library 401 is not 0, then block 404 may be advanced.

If it is determined in block 402 that the stock quantity of the standby library is less than the reference quantity C, then proceed to block 403 to replenish the stock quantity of the standby library. In block 403, a shipment order may be issued. The stock quantity in the stock library 401 is made not less than the reference quantity C by shipment.

In some embodiments, the reference number of standby servers contained in the standby library may be determined based on a preset standby scheduling algorithm. In some examples, the reference number C of standby servers in the standby repository may be determined based on the selling weight a, the availability B of the servers in the group of servers, and the model remaining amount D of the servers. The selling weight A is a parameter determined according to a market release plan of the server, the availability B is a parameter indicating the failure rate of the server in the server group, and the model margin D is a parameter indicating the number of servers of the same model as a specific failed server in the current server group.

The reference number of standby servers in the standby library, C, may be calculated based on equation (1):

C＝D*((1+A)*B)(1)

in block 404, a backup server may be selected in the backup repository 401.

In block 405, the hardware state of the selected standby server may be checked. If the hardware state is abnormal, return to block 401 to perform the screening again. If the hardware state is normal, proceed to block 406.

In block 406, the network connection status of the standby server may be checked. For example, there may be a server sending a heartbeat packet to check if the connection is normal. For another example, whether the login status of the server in the server group is normal may be checked. If the check in block 406 indicates that the network connection status of the standby server is abnormal, return to block 401 to re-screen. If the network connection status is normal, proceed to block 407.

In block 407, the backup server that passes the check may be determined as a candidate replacement server.

If the operating states of all the standby servers in the standby library are abnormal, it can be considered that no appropriate standby server exists in the standby library, and the fast hardware repair program is terminated.

Next, an exemplary procedure of determining an alternative server matching the specific server in step S308 will be described with specific reference to fig. 5. Fig. 5 illustrates an exemplary process of comparing configuration parameters of two servers according to an embodiment of the application.

As shown in fig. 5, the exemplary process of comparing configuration parameters of two servers begins in block 501.

In block 502, it may be compared whether the room unit information of the two servers match. When the room units of the two servers coincide, the room unit information for the two servers may be considered to match and proceed to block 502. If the room units of the two servers are not consistent, the process proceeds to block 507 to determine whether to compare again. In some embodiments, if there is room unit information for other room units allowed by the group of servers that match the room unit in which the particular server is located, block 502 may be returned and the room unit information for the two servers may be re-compared for a match. That is, it may be compared whether the room unit information of the candidate replacement server is the room unit information of the other room unit that matches the room unit where the specific server is located.

In block 503, the product model information of the two servers may be compared. If the product model information of the two servers match, block 504 may be proceeded to. If the comparison of the product model information of the two servers fails, the process proceeds to block 507 to determine whether to perform the comparison again. In some embodiments, when the product type information of two servers is the same, the product type information of the two servers may be considered to match. In other embodiments, there may be at least two different product model information that match the product model information of a particular server. Therefore, if it is determined in block 503 that the product model information of the two servers do not match, block 507 may also be advanced to attempt a re-comparison.

In block 504, the device type information of the two servers may be compared. If the device types of the two servers are matching, block 505 may be advanced. If the device type comparison of the two servers fails, proceed to block 507 to determine whether to perform the comparison again. In some embodiments, when the device type information of two servers is the same, the device type information of the two servers may be considered to match. In other embodiments, there may be at least two different device type information that match the device type information of a particular server. Therefore, if it is determined in block 504 that the device type information of the two servers do not match, block 507 may also be advanced to attempt a re-comparison.

In block 505, the hardware version information of the two servers may be compared. If the hardware versions of the two servers are matching, block 506 may be advanced. If the hardware version comparison of the two servers fails, proceed to block 507 to determine whether to perform the comparison again. In some embodiments, when the hardware version information of two servers is the same, the hardware version information of the two servers may be considered to match. In other embodiments, there may be at least two different pieces of hardware version information that match the hardware version information of a particular server. Therefore, if the hardware version information of the two servers is determined not to match in block 505, block 507 may also be proceeded to attempt the re-comparison.

In block 506, the size information of the two servers may be compared. If the sizes of the two servers are matched, block 508 may be advanced. If the size comparison of the two servers fails, proceed to block 507 to determine whether to perform the comparison again. In some embodiments, when the size information of two servers is the same, the size information of the two servers may be considered to match. In other embodiments, there may be at least two different size information that match the size information of a particular server. Therefore, if the size information of the two servers is determined not to match in block 506, block 507 may also be advanced to attempt a re-alignment.

In block 508, when all of the configuration parameters of the particular server and the candidate replacement server are determined to match, the candidate replacement server may be determined to be the replacement server that matches the particular server.

In block 509, the exemplary process of comparing the configuration parameters of the two servers ends.

Returning to FIG. 3, in the process illustrated in FIG. 3, the particular server being replaced may involve multiple reboots to satisfy the previously described checking and replacing operations. To avoid warnings caused by such abnormal conditions, the alert information for the particular server and/or the replacement server may be disabled in the process shown in fig. 3. For example, the operation state of a specific server may be set as "replacement" in the hardware failure repair process provided in the present application, and the server in this state is set not to trigger an alarm, so as to ensure that the hardware failure repair process is completed completely. When the hardware fault repair process is completed, the alarm information of the replacement server can be started. For example, the operational state of the replacement server may be restored to a normal state, such as "in operation".

Fig. 6A shows an illustrative process for replacing a failed server with a replacement server in accordance with embodiments of the present application.

At block 601, the replacement process begins.

At block 602, a complete replacement message may be output to a field operator that data for performing a task is transferred to the replacement server. The data transfer operation can be performed manually by a field operator or by a robot.

In block 603, it may be determined whether the data transfer is complete. If the data transfer is not complete, then block 604 may be reached where the operation and maintenance personnel intervene and assist in completing the data transfer process.

If the data transfer is complete, then block 605 may be proceeded to. At block 605, server configuration parameters in a group of servers may be updated to achieve information synchronization in the event that a particular server is replaced by a replacement server.

After the information synchronization is complete, block 606 may be entered to perform a status check on the replacement server performing the task in place of the original failed server to determine whether the hardware status and the network connection status of the replaced server are normal.

After the status check passes, the process may proceed to block 607 and the replacement process ends.

Fig. 6B shows an exemplary process of changing the operational state of a server according to an embodiment of the present application.

As shown in fig. 6B, in block 611, the flow begins. For example, the process shown in fig. 6B may be started after the overall replacement message for transferring the data for performing the task stored in the specific server to the replacement server is output at step S310 of fig. 3.

In block 612, the operational status of the particular server and the replacement server for replacing the particular server is adjusted to "in replacement," thereby disabling the alert information for the particular server and the replacement server. For example, after the complete machine replacement message is output at step S310 of fig. 3, the operation states of the specific server and the replacement server for replacing the specific server are automatically adjusted to "replacement in progress".

After the operational status of the replacement server is adjusted to "replacement," the field operator may perform data transfer by performing

blocks

613 and 614 shown as dashed lines on the right side of the flow of fig. 6B. In block 613, the data transfer operation for the specific server may be completed in response to the whole replacement message output in step S310 of fig. 3. The field operator may be a human or a robot.

In block 614, the replacement server may be checked for acceptance by the operator in the field. In the event of a pass through acceptance, block 615 may be advanced. Under the condition that the acceptance check is not passed, the replacement server can be debugged and configured according to the actual condition, so that the replacement server can work normally.

In block 615, the operational status of the replacement server may be adjusted to "in-operation" to enable alert information for the replacement server.

In block 616, the flow of changing the operational state of the server ends.

Referring back to fig. 3, in some embodiments, before step S310 and before step S312, that is, before the complete machine replacement message indicating that the replacement operation is performed is output and after the completion of the replacement operation is performed, the operation state of the replacement server may be secondarily checked. That is, although it is determined in step S306 that the operational status has been checked in the candidate replacement server, in order to ensure that the replacement server can operate normally after the completion of the hardware failover, it may be checked again whether the hardware status and the network connection status of the replacement server are normal.

Fig. 7 shows an exemplary process of secondary checking of the hardware status and the network connection status of the replacement server according to an embodiment of the present application.

In block 701, the secondary inspection flow begins.

As shown in fig. 7, in block 702, a hardware status check and a network connection status check may be performed on the replacement server. If the check result indicates that the hardware status and the network connection status of the replacement server are normal, the process may proceed to block 703 and the secondary check process ends.

If the results of the check in block 702 indicate that the hardware status and network connection status of the replacement server are abnormal, then flow may proceed to block 704.

In block 704, a determination may be made as to the current operating type of the replacement server. If the replacement server has not performed a data transfer operation, block 705 may be advanced to re-screen the replacement server in the alternate repository for matching with the particular server. If there is no replacement server that matches the particular server, the failover process ends.

If a data transfer operation has been performed on the replacement server, then flow may proceed to block 706, restarting the failover flow illustrated in FIG. 3. If the re-execution of the failover process fails, the process may end.

By using the server hardware fault repairing method for the server group, the complete machine replacement of the fault server can be realized by selecting the replacement server matched with the fault server in the standby machine library. By transferring the data for performing the task stored in the failed server to the replacement server, the task of the virtual machine can be performed using the replacement server instead of the failed server. By the aid of the method, when the server serving as the host of the virtual machine breaks down, a large amount of time is not consumed for analyzing specific faults of the server, the standby machine capable of replacing the broken server can be quickly determined, and normal service of the server group can be recovered. In many cases, the root cause of the hardware fault can be analyzed by simulating a service scene under the no-load condition of the host. By the method, the server with the fault is replaced from the server group, the fault server can be emptied quickly, test conditions for analyzing the fault reason are provided, and the efficiency and the precision of fault analysis are improved.

Fig. 8 shows a schematic block diagram of a server hardware failover apparatus for a server farm according to an embodiment of the application.

As shown in fig. 8, the apparatus 800 may include a failure information receiving unit 810, a parameter obtaining unit 820, a standby selecting unit 830, a parameter comparing unit 840, a guidance output unit 850, an information synchronizing unit 860, and a status checking unit 870.

The failure information receiving unit 810 may be configured to receive server failure information in a server group, the server failure information indicating that a hardware failure occurs in a specific server in the server group. In some embodiments, the server failure information indicates that the hardware failure occurred in the specific server cannot be solved by a conventional failure recovery means, and a complete machine replacement is required. In some embodiments, a particular server may refer to a single server in a group of servers. In other embodiments, a particular server may also refer to two or more servers in a server group.

The parameter obtaining unit 820 may be configured to obtain configuration parameters of a specific server in response to server failure information, and obtain configuration parameters and operation states of a plurality of standby servers.

In some embodiments, the configuration parameters of the server may be obtained by accessing system information of the server. The configuration parameters of the server may be at least one of: machine room unit information, product model information, equipment type information, hardware version information and size information.

The standby selecting unit 830 may be configured to determine a candidate replacement server for replacing a specific server among the plurality of standby servers based on the operation states of the plurality of standby servers.

By performing screening among a plurality of backup servers using the operation state, a backup server that can normally operate and normally communicate in the server group can be determined as a candidate replacement server.

In some embodiments, the standby selection unit 830 may be configured to execute the flow illustrated in fig. 4 to implement the standby selection.

The parameter comparison unit 840 may be configured to determine, among the candidate replacement servers, a replacement server matching the specific server based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers.

The parameter alignment unit 840 may be configured to perform the process illustrated in fig. 5.

The guidance output unit 850 may be configured to output a whole replacement message for transferring data for performing a task stored in a specific server to a replacement server. In some embodiments, the whole replacement message may include information indicating identification information and a specific operation of the replacement server. For example, a complete machine replacement message may be sent to a site where the server group is located, and an operation of transferring data for performing a task to a replacement server may be performed by an operator on the site. In some embodiments, the operator may be instructed to remove a storage device storing data for performing a task from a particular server and install it on a replacement server.

The information synchronization unit 860 may be configured to update the server configuration parameters in the server group in case the specific server is replaced by the replacement server.

In some embodiments, in the process illustrated in FIG. 3, the particular server being replaced may involve multiple reboots to satisfy the previously described checking and replacing operations. To avoid warnings caused by such abnormal conditions, the alert information for the particular server and/or the replacement server may be disabled in the process shown in fig. 3. For example, the operation state of a specific server may be set as "replacement" in the hardware failure repair process provided in the present application, and the server in this state is set not to trigger an alarm, so as to ensure that the hardware failure repair process is completed completely. When the hardware fault repair process is completed, the alarm information of the replacement server can be started. For example, the operational state of the replacement server may be restored to a normal state, such as "in operation".

The status checking unit 870 may be configured to perform secondary checking on the operation status of the replacement server before outputting the complete machine replacement message instructing the operator to perform the replacement operation and after the operator performs the replacement operation is completed (e.g., the information synchronization unit updates the server configuration parameters in the server group). That is, although the standby selecting unit has checked the operation state of the standby server in determining the candidate replacement servers, in order to ensure that the replacement server can normally operate after the completion of the hardware failover, it may check again whether the hardware state and the network connection state of the replacement server are normal.

Status checking unit 870 may be configured to perform the process shown in fig. 7.

By utilizing the server hardware fault repairing device for the server group, the complete machine replacement of the fault server can be realized by selecting the replacement server matched with the fault server in the standby machine library. By transferring the data for performing the task stored in the failed server to the replacement server, the task of the virtual machine can be performed using the replacement server instead of the failed server. By the aid of the method, when the server serving as the host of the virtual machine breaks down, a large amount of time is not consumed for analyzing specific faults of the server, the standby machine capable of replacing the broken server can be quickly determined, and normal service of the server group can be recovered. In many cases, the root cause of the hardware fault can be analyzed by simulating a service scene under the no-load condition of the host. By the method, the server with the fault is replaced from the server group, the fault server can be emptied quickly, test conditions for analyzing the fault reason are provided, and the efficiency and the precision of fault analysis are improved.

By using the server hardware fault repairing method and device for the server group, when the server in the server group fails, the average total replacement time is 1.677 hours. If the fault is repaired without adopting a complete machine replacement mode, at least about 8 hours are needed from fault discovery to reason analysis. The speed of the fault repairing mode provided by the application is increased by 76% compared with that of the conventional hardware fault repairing method.

FIG. 9 shows a schematic diagram of an implementation of a hardware failover scheme according to an embodiment of the application.

As shown in fig. 9, in block 901, a complete machine replacement request may be initiated based on a detected failure of a particular server.

In block 902, an available backup server may be found.

In block 903, a candidate replacement server for replacing the failed particular server may be determined in a backup repository using a backup selection unit shown in fig. 8.

In block 904, the parameter comparison unit shown in fig. 8 may be utilized to compare the candidate replacement server with the configuration parameters of the failed particular server to determine a replacement server that matches the particular server.

In block 905, it may be verified whether the operation state of the replacement server is normal using the state checking unit illustrated in fig. 8. If the operational status of the replacement server is verified to be normal, then block 905 may be advanced. If it is verified that the operational status of the replacement server is not normal, block 903 may be returned in an attempt to find other replacement servers that can match the particular server.

In block 906, a complete machine replacement message may be created and output.

In block 907, an operator in the field may connect to the server hardware failure recovery system for the server farm provided herein through authentication (token) and obtain a complete machine replacement message. According to the acquired complete machine replacement message, an operation can be performed by an operator on site so as to transfer the data for executing the task stored in the specific server to the replacement server.

In block 908, the server configuration parameters in the server group may be updated using the information synchronization unit shown in fig. 8.

In block 909, it may be determined whether the information synchronization status is normal. If the information synchronization status indicates normal, a status check may be performed with the status checking unit on the replacement server after the information synchronization, block 911. If block 911 indicates that the operational status of the replacement server is normal, then block 912 may be advanced. If block 911 indicates that the operational status of the replacement server is not normal, then block 910 may be proceeded to check the operational status of the replacement server a second time. For example, the operational status of the replacement server may be checked manually or robotically.

If the information synchronization status indicates abnormal, a secondary check may be made to the information synchronization status of the replacement server at block 910 to determine whether the information synchronization is complete by manual or robotic inspection. And may then proceed to block 912.

In block 912, the hardware failover process ends.

Furthermore, the method or apparatus according to the embodiments of the present application may also be implemented by means of the architecture of a computing device shown in fig. 10. Fig. 10 illustrates an architecture of the computing device. As shown in fig. 10, the computing device 1000 may include a bus 1010, one or at least two CPUs 1020, a Read Only Memory (ROM)1030, a Random Access Memory (RAM)1040, a communication port 1050 connecting to a network, an input/output component 1060, a hard disk 1070, and the like. A storage device in the computing device 1000, such as the ROM1030 or the hard disk 1070, may store various data or files used for processing and/or communication of the object detection method provided herein and program instructions executed by the CPU. Computing device 1000 may also include a user interface 1080. Of course, the architecture shown in FIG. 10 is merely exemplary, and one or at least two of the components in the computing device shown in FIG. 10 may be omitted when implementing different devices, as desired.

According to another aspect of the present application, there is also provided a non-transitory computer readable storage medium having stored thereon computer readable instructions which, when executed by a computer, can perform the method as described above.

According to yet another aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the server hardware fault repairing method for the server group.

Portions of the technology may be considered "articles" or "articles of manufacture" in the form of executable code and/or associated data, which may be embodied or carried out by a computer readable medium. Tangible, non-transitory storage media may include memory or storage for use by any computer, processor, or similar device or associated module. For example, various semiconductor memories, tape drives, disk drives, or any similar device capable of providing a storage function for software.

All or a portion of the software may sometimes communicate over a network, such as the internet or other communication network. Such communication may load software from one computer device or processor to another. For example: from a server or host computer of the video object detection device to a hardware platform of a computer environment, or other computer environment implementing a system, or similar functionality related to providing information needed for object detection. Thus, another medium capable of transferring software elements may also be used as a physical connection between local devices, such as optical, electrical, electromagnetic waves, etc., propagating through cables, optical cables, air, etc. The physical medium used for the carrier wave, such as an electric, wireless or optical cable or the like, may also be considered as the medium carrying the software. As used herein, unless limited to a tangible "storage" medium, other terms referring to a computer or machine "readable medium" refer to media that participate in the execution of any instructions by a processor.

This application uses specific words to describe embodiments of the application. Reference to "a first/second embodiment," "an embodiment," and/or "some embodiments" means a feature, structure, or characteristic described in connection with at least one embodiment of the application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereon. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A server hardware failover method for a group of servers, wherein a server in the group of servers stores data for performing a task, comprising:

receiving server fault information in the server group, wherein the server fault information indicates that a specific server in the server group has a hardware fault;

responding to the server fault information, acquiring the configuration parameters of the specific server, and acquiring the configuration parameters and the operation states of a plurality of standby servers;

determining a candidate replacement server for replacing the specific server among a plurality of standby servers based on the operating states of the plurality of standby servers;

determining, among the candidate replacement servers, a replacement server that matches the particular server based on the configuration parameters of the particular server and the configuration parameters of the candidate replacement servers;

outputting a complete machine replacement message for transferring data for performing a task stored in the specific server to the replacement server;

updating server configuration parameters in the group of servers in the event the particular server is replaced by the replacement server.

2. A hardware fault remediation method as claimed in claim 1 wherein the operational state comprises a hardware state and a network connection state;

determining, among the plurality of standby servers, a candidate replacement server for replacing the particular server based on the operational states of the plurality of standby servers comprises:

selecting a standby server among the plurality of standby servers,

and determining the standby server as a candidate replacement server under the condition that the hardware state indicates that the hardware state of the standby server is normal and the network connection state indicates that the network connection state of the standby server is normal.

3. A hardware failover method according to claim 1 wherein the standby server is a free server networked with the group of servers.

4. The hardware failure repair method of any one of claims 1 to 3, wherein the number of the plurality of standby servers is determined based on a selling weight, which is a weight parameter determined according to a market-release plan of the server, availability of the servers in the server group, which is a parameter indicating a failure rate of the servers in the server group, and model remaining capacity of the servers, which is a parameter indicating the number of servers of the same model as a specific server in the server group.

5. The hardware fault remediation method of claim 1, wherein the configuration parameters include at least one of: machine room unit information, product model information, equipment type information, hardware version information and size information.

6. A hardware failover method according to claim 1 wherein, before the particular server is replaced by the replacement server, the hardware failover method further comprises:

performing a hardware status check and a network connection status check on the replacement server to determine whether the hardware status and the network connection status of the replacement server are normal.

7. A hardware fault recovery method as in claim 6,

under the condition that the hardware state and the network connection state of the replacement server are normal, replacing the specific server by using the replacement server, and performing data transfer;

reselecting a replacement server from the plurality of standby servers in the event that the hardware status or the network connection status of the replacement server is not normal.

8. A hardware failover method according to claim 1 wherein, after updating the server configuration parameters in the group of servers, the hardware failover method further comprises:

9. The hardware failure recovery method of claim 8, wherein, in case that the hardware status and the network connection status of the replacement server are both normal, updating the server configuration parameters in the server group,

10. The hardware fault recovery method of claim 1, further comprising:

disabling the alarm information for the particular server during a hardware failover process.

11. A server hardware failover apparatus for a group of servers, wherein a server in the group of servers stores data for performing a task, comprising:

a fault information receiving unit configured to receive server fault information in the server group, the server fault information indicating that a hardware fault occurs in a specific server in the server group;

a parameter obtaining unit configured to obtain configuration parameters of the specific server in response to the server failure information, and obtain configuration parameters and operation states of a plurality of standby servers;

a standby selection unit configured to determine a candidate replacement server for replacing the specific server among a plurality of standby servers based on operation states of the plurality of standby servers;

a parameter comparison unit configured to determine, among the candidate replacement servers, a replacement server that matches the specific server based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers;

a guidance output unit configured to output a whole replacement message for transferring data for performing a task stored in the specific server to the replacement server;

an information synchronization unit configured to update the server configuration parameters in the server group in case the specific server is replaced by the replacement server.

12. The hardware fault remediation device of claim 11, wherein the operational state comprises a hardware state and a network connection state;

the standby selection unit is configured to:

selecting a standby server among the plurality of standby servers,

13. The hardware failover apparatus of claim 8 wherein the backup server is a spare server networked with the group of servers.

14. A hardware fault remediation device, the device comprising a memory and a processor, wherein the memory has instructions stored therein which, when executed by the processor, cause the processor to carry out the hardware fault remediation method of any one of claims 1 to 7.

15. A computer readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the hardware fault remediation method of any one of claims 1-7.