CN110908824A

CN110908824A - Fault identification method, device and equipment

Info

Publication number: CN110908824A
Application number: CN201911227646.2A
Authority: CN
Inventors: 窦方钰; 沈涛; 叶建娣
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-03-24

Abstract

The embodiment of the specification provides a fault identification method and device and a mobile terminal. The scheme comprises the following steps: and acquiring the running state information of the target server in a preset time period. And judging whether the running state data of the target server in a preset time period is in an average running state data range, wherein the average running state data range is obtained according to the running state data of a plurality of servers in the target server cluster where the target server is located. And when the judgment result shows that the running state data of the target server is not in the average running state data range, determining that the target server fails.

Description

Fault identification method, device and equipment

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, and a device for identifying a fault.

Background

The downtime fault refers to the phenomenon that the operating system of the server cannot recover from a serious system error or the server has a problem on the hardware level, so that the server has no response for a long time. The downtime fault of the server can be identified by a simple and convenient heartbeat detection method. However, for a failure that a server can only respond to a part of service requests but cannot respond to all service requests due to resource exhaustion, a Java Garbage Collection (garpage Collection) mechanism, and the like, the server still has the capability of feeding back heartbeat packets, so that the server cannot be identified by a heartbeat detection method, and needs to be identified manually. The manual identification method not only needs to consume a large amount of human resources, but also affects the identification efficiency of the fault of the server.

In summary, how to provide a more efficient fault identification method for a server has become a technical problem to be solved urgently.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a fault identification method, apparatus, and device, which are used to improve the efficiency of identifying a fault occurring in a server.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

an embodiment of the present specification provides a fault identification method, including:

acquiring running state information of a target server in a preset time period, wherein the target server is one server in a target server cluster;

judging whether the running state data of the target server is in an average running state data range or not based on the running state information to obtain a first judgment result; the average running state data range is obtained according to the running state data of a plurality of servers in the target server cluster;

and when the first judgment result shows that the running state data of the target server is not in the average running state data range, determining that the target server fails.

An embodiment of this specification provides a fault identification device, includes:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring running state information of a target server in a preset time period, and the target server is one server in a target server cluster;

the first judgment module is used for judging whether the running state data of the target server is in an average running state data range or not based on the running state information to obtain a first judgment result; the average running state data range is obtained according to the running state data of a plurality of servers in the target server cluster;

and the fault determining module is used for determining that the target server has a fault when the first judgment result shows that the running state data of the target server is not in the average running state data range.

An embodiment of this specification provides a fault identification device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

Embodiments of the present specification provide a computer readable medium, on which computer readable instructions are stored, the computer readable instructions being executable by a processor to implement the fault identification method described above.

One embodiment of the present description achieves the following advantageous effects:

and judging whether the running state data of the target server in a preset time period is in an average running state data range, wherein the average running state data range is obtained according to the running state data of a plurality of servers in the target server cluster where the target server is located. And when the judgment result shows that the running state data of the target server is not in the average running state data range, determining that the target server fails. The running state of the target server is not required to be monitored and identified manually, the workload of workers can be reduced to a greater extent, and the identification efficiency of the server with the fault is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of one or more embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the embodiments of the disclosure and not to limit the embodiments of the disclosure. In the drawings:

fig. 1 is a schematic flow chart of a fault identification method provided in an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a first application scenario of a fault identification method provided in an embodiment of the present specification;

fig. 3 is a schematic diagram of a second application scenario of a fault identification method provided in an embodiment of the present specification;

fig. 4 is a schematic structural diagram of a fault identification device corresponding to fig. 1 provided in an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a fault identification device corresponding to fig. 1 provided in an embodiment of the present specification.

Detailed Description

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the scope of protection of one or more embodiments of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

The distributed server refers to a server implementation manner in which data and programs may not be located on one server but be distributed to a plurality of servers, so that a plurality of servers are used to process computing tasks of different parts in one computing problem respectively, and finally computing results generated by the servers are integrated to obtain a final computing result. A server cluster is a server implementation that aggregates multiple servers together to perform the same service. Because the distributed server can overcome the defects of central host resource shortage and response bottleneck caused by the traditional centralized system, and the server cluster has the characteristics of high reliability and high resource availability, enterprises usually adopt the distributed server technology and the server cluster technology to build a server-side system of the application at the same time.

Currently, a manual method is usually adopted to monitor the operation data of each server in a server-side system, and if the operation data is abnormal, the reason for the abnormal operation data of the server is manually analyzed, for example, whether the abnormal operation data is caused by the sudden increase of normal traffic, whether the abnormal operation data is caused by network jitter, whether the abnormal operation data is caused by server failure, and the like. If the staff analyzes and determines that the server has a single machine fault, the staff needs to perform offline processing on the server with the single machine fault so as to reduce adverse effects of the fault server on normal operation of the server-side system as much as possible. Because a plurality of server clusters are usually included in the distributed service end system, that is, the number of servers included in the service end system is usually large, when a server fault is identified on the basis of a manual mode, the generated time delay is large; and the processing efficiency of the fault is low based on a manual mode, so that the service request response failure rate of the service end system is easily and rapidly improved, and the user experience is further influenced.

Based on this, a more efficient fault identification and processing method for the server is urgently needed.

Fig. 1 is a diagram illustrating a fault identification method according to an embodiment of the present disclosure. From the viewpoint of a program, the main body of execution of the method may be a program installed in a terminal device or a server.

As shown in fig. 1, the process may include the following steps:

step 102: the method comprises the steps of obtaining running state information of a target server in a preset time period, wherein the target server is one server in a target server cluster.

In this embodiment of the present specification, a target server to be subjected to failure determination is one server in a target server cluster, and each server in the target server cluster has the capability of providing the same service, that is, the types of service requests that can be processed by each server in the target server cluster are the same. In practical application, the fault identification method provided by the embodiment of the present specification may be simultaneously adopted to perform fault identification on each server in a target server cluster, so as to improve the real-time performance of identifying a server having a fault in the target server cluster.

In the embodiment of the present specification, the operation state information of the target server may include various information related to the operation process of the target server, for example, a time consumed by a Remote Procedure Call (Remote Procedure Call) response of the target server, an error log of the target server, a Garbage Collection (Garbage Collection) log of the target server, a time consumed by a Remote Procedure Call response of each server sending a service request to the target server, an error log of each server sending a service request to the target server, and the like. In the embodiment of the present specification, the type of the operation state information of the target server is not particularly limited.

In this specification, the target server may be periodically identified based on the method in fig. 1, and the duration of the preset time period may be the same as the duration of one period of identifying the target server. For example, when failure recognition of the target server is set once every 10 seconds, the duration of the preset time period may be 10 seconds. In order to improve the real-time performance of the generated fault identification result, the preset time period may be a time period from 10 th second before the current time to the current time.

Step 104: judging whether the running state data of the target server is in an average running state data range or not based on the running state information to obtain a first judgment result; the average operating state data range is obtained according to the operating state data of the plurality of servers in the target server cluster.

In practical application, when the servers in the target server cluster normally operate, the variation trend of the operation state data of the single server in the target server cluster is the same as the variation trend of the average operation state data of the target server cluster, and the deviation between the operation state data of the single server and the average operation state data of the target server cluster is within a certain value range. Therefore, whether the target server fails can be determined by judging whether the running state data of the target server is within the average running state data range of the target server cluster.

In the embodiment of the present specification, the operation state data of the target server can be obtained by processing the operation state information of the target server acquired in step 102. The operation status data of the target server may include a plurality of operation parameters, such as service response time, service response success rate, number of server errors, and server garbage collection frequency. Wherein the service response time comprises: the average response time of the target server to the remote procedure call request and the average response time of each server sending the service request to the target server when the service request response sent to the target server is successful. In the embodiment of the present specification, the type of the operation parameter in the operation state data of the target server is not particularly limited.

In the embodiment of the present specification, since the operation state data of the target server may include a plurality of operation parameters, the average operation state data range may be configured by a plurality of average operation parameter ranges. Wherein a plurality of average operating parameter ranges in the average operating state data range respectively correspond to a plurality of operating parameters in the operating state data of the target server.

The step 104 of determining whether the operation state data of the target server is within the average operation state data range may specifically include: judging whether the number of abnormal operation parameters in a plurality of operation parameters in the operation state data of the target server is smaller than a preset number, wherein the numerical value of the abnormal operation parameters is out of an average operation parameter range corresponding to the types of the abnormal operation parameters, and the average operation parameter range is obtained according to the operation parameters of the plurality of servers in the target server cluster, wherein the operation parameters are the same as the types of the abnormal operation parameters.

For example, assume that the operational state data of the target server includes 2 operational parameters, respectively: the average response time of the target server to the remote procedure call request and the number of server errors. The preset number is 1. Within a preset time period, the average response time of the target server to the remote procedure call request is 1 second, and the number of server errors is 100. Assuming that the average operating parameter range corresponding to the average response time of the target server to the remote procedure call request is 0.1 to 1.5 seconds; the average operating parameter corresponding to the number of server errors is in the range of 10 to 80. It can be known that the number of server error reports in the running state data of the target server is an abnormal running parameter, and since the number of the abnormal running parameters in the running state data of the target server is not less than the preset number 1, the first judgment result indicates that the running state data of the target server is not within the average running state data range.

Step 106: and when the first judgment result shows that the running state data of the target server is not in the average running state data range, determining that the target server fails.

Step 108: and when the first judgment result shows that the running state data of the target server is in the average running state data range, determining that the target server does not have a fault.

It should be understood that the order of some steps in the method described in one or more embodiments of the present disclosure may be interchanged according to actual needs, or some steps may be omitted or deleted.

In the embodiment of the present specification, the method in fig. 1 determines that the target server fails in the preset time period by determining that the operating state data of the target server in the preset time period is not within an average operating state data range, where the average operating state data range is obtained according to the operating state data of a plurality of servers in a target server cluster where the target server is located, so that there is no need to manually monitor and identify the operating state of the target server, the workload of workers can be reduced to a greater extent, and the identification efficiency of the failed server is improved.

Since the fault identification result generated in the embodiment of the present specification is obtained by comparing the operation state data of the target server with the average operation state data of the cluster where the target server is located, the problem that the server is erroneously determined to be faulty when the operation state data of the server is rapidly changed due to a sudden increase in normal traffic can be avoided, so that the accuracy of the fault identification result is high.

Based on the fault identification method in fig. 1, the embodiments of the present specification also provide some specific implementations of the method, which are described below.

Due to various factors causing the abnormal operation state of the target server, for the transient abnormal operation state phenomenon of the target server caused by factors such as time delay jitter (packetdelay), if the target server is directly judged to have a fault and the target server is subjected to fault processing, the normal working time of the target server is reduced, and the normal stable operation of a cluster where the target server is located is influenced. Therefore, when the target server shakes, the target server can be continuously monitored without directly determining that the target server fails, and if the target server is continuously in an abnormal operation state, the target server is determined to fail, so that the utilization rate of server resources is improved on the basis of ensuring accurate fault identification.

Therefore, before determining that the target server fails in step 106, the method may further include:

when the first judgment result shows that the running state data of the target server is not in the average running state data range, determining that the target server is in an abnormal running state in a specified time period; the end time of the designated time period is the same as the starting time of the preset time period.

And when the target server is in an abnormal operation state in the specified time period and the preset time period, determining that the target server fails.

In the embodiment of the present specification, the operation state of the target server in the specified time period may be determined by determining whether the operation state data of the target server in the specified time period is within the average operation state data range corresponding to the specified time period. The average operating state data range is also obtained based on the operating state data of the plurality of servers in the target server cluster, and is not described in detail herein. Specifically, when the determination result indicates that the operation state data of the target server in the specified time period is not within the average operation state data range corresponding to the specified time period, it may be determined that the target server is in the abnormal operation state in the specified time period. When the judgment result indicates that the operation state data of the target server in the specified time period is within the average operation state data range corresponding to the specified time period, it may be determined that the target server is in the normal operation state in the specified time period.

Alternatively, the operating state identifier generated by manual setting or automatic monitoring can be used to determine the operating state of the target server within a specified time period. For example, when it is determined that an operation state flag for an operation abnormality of a target server within a specified period of time is set, it is determined that the target server has failed within the specified period of time. Or when the operating state identifier which is normal for the target server to operate in the specified time period is determined to be set, determining that the target server does not fail in the specified time period. Or when the running state identifier which is abnormal in running of the target server in a specified time period is not inquired, determining that the target server does not have a fault in the specified time period, and the like.

In the embodiment of the present specification, when the first determination result indicates that the operation state data of the target server is not within the average operation state data range, it may be considered that the target server is in the abnormal operation state within the preset time period, and therefore, it may be determined that the target server fails when the target server is in the abnormal operation state before the preset time period and in both the adjacent time period of the specified duration (i.e., the specified time period) and the preset time period. And when the target server is in a normal operation state within a time period of a specified time before a preset time period, and when the target server is in an abnormal operation state within the preset time period, determining that the server shakes without failure. Therefore, the server is accurately distinguished and identified from the server in the shaking and fault.

In this embodiment of the present specification, the specified time period may be determined according to an actual requirement, and when the duration of the specified time period is equal to the duration of the preset time period, it may be considered that, if the target server is determined to be in an abnormal operation state in two consecutive preset time periods, it may be determined that the target server fails.

In this embodiment of the present specification, the specified time period further includes a preset number (greater than or equal to 2) of consecutive preset time periods, and an ending time of the preset number of consecutive preset time periods is the same as a starting time of the preset time period;

determining that the target server is in an abnormal operation state within a specified time period, which may specifically include: and determining that the target server is in an abnormal operation state within the preset number of continuous preset time periods.

In this embodiment of the present specification, for any one preset time period within a preset number of consecutive preset time periods, when the operation state data of the target server within the any one preset time period is not within the average operation state data range corresponding to the any one preset time period, it may be determined that the target server is in an abnormal operation state within the any one preset time period. Or, the operating state of the target server in any one of the preset time periods may be determined according to the operating state identifier generated by manual setting or automatic monitoring. This will not be described in detail.

In this embodiment, when a target server is in an abnormal operation state in a plurality of consecutive preset time periods (i.e., the sum of the preset number and 1), it may be determined that the target server fails. The user can conveniently set and adjust the condition of the target server with the fault according to the actual condition, the use convenience of the fault identification method is improved, and the user experience can be improved.

The failure of the single machine and the failure of the cluster can occur due to the target server cluster. The single machine failure refers to a phenomenon that one or a few servers in the target server cluster fail, and the cluster failure refers to a phenomenon that the ratio of the failed servers in the target server cluster exceeds a preset ratio. When the operating states of the servers in the target server cluster are monitored based on a manual mode, all failed servers in the target server cluster are difficult to systematically find, so that the fault of the target server is difficult to judge whether the fault is a single-machine fault or a cluster fault.

Based on this, in the embodiment of the present specification, step 106: after determining that the target server fails, the method may further include:

and determining the occupation ratio of the failed server in the target server cluster in the preset time period.

And judging whether the occupation ratio is smaller than a preset occupation ratio or not to obtain a second judgment result.

And when the second judgment result shows that the occupation ratio is smaller than the preset occupation ratio, determining that the fault of the target server is a single-machine fault. And when the second judgment result shows that the occupation ratio is greater than or equal to the preset occupation ratio, determining that the fault of the target server is a cluster fault.

In this embodiment of the present disclosure, the fault identification method in fig. 1 may be used to perform real-time fault identification on each server in the target server cluster, so as to determine whether each server in the target server cluster fails within the preset time period, and further obtain a proportion of the failed server within the preset time period in the target server cluster, so as to efficiently obtain an accurate identification result of whether the failure of the target server is a single-machine failure or a cluster failure.

In practical application, when the fault of the target server is a single-machine fault, the fault needs to be manually processed, the efficiency is low, the fault processing delay is large, and a large number of service request response failures are easily caused, so that the normal operation of the service is influenced. Based on this, in the embodiment of the present specification, an implementation manner that can be used for automatically processing a single machine fault is also provided.

Specifically, after determining that the failure of the target server is a single-machine failure, the method may further include:

and generating a first control instruction, wherein the first control instruction is used for forbidding the target server to provide the service.

Sending the first control instruction to a configuration center server so that the configuration center server generates configuration information based on the first control instruction; the configuration information is used for recording the identification information of a server which processes a service request generated by a specified server in the target server cluster; the configuration information does not include identification information of the target server. The identification information of the server may be implemented by using an internet protocol address (i.e., an IP address) of the server, or may be implemented by using a unique identifier of the server, which is not limited in this respect.

In the embodiments of the present specification, for ease of understanding, an implementation for automatically handling a single machine fault is illustrated. Assume that a distributed server-side system for shopping applications includes: an order server cluster, an inventory server cluster and the like. The inventory server cluster may respond to an inventory quantity change request generated by the order server cluster based on the user order. In the following, an example will be described in which the stock server cluster is taken as a target server cluster.

Fig. 2 is a schematic diagram of a first application scenario of a fault identification method according to an embodiment of the present disclosure, and as shown in fig. 2, an order server cluster is composed of a first server 203, a second server 204, and a third server 205. The stock server cluster is composed of a target server 206, a fourth server 207, and a fifth server 208. The configuration center server 202 may determine identification information of a server currently having a capability of providing a service for the order server cluster in the inventory server cluster, and therefore, the configuration center server 202 generates configuration information based on the determined identification information of the server currently having the capability of providing the service, and the configuration information may be used to determine a corresponding relationship between the server in the order server cluster and the server in the inventory server cluster, so that each server in the order server cluster may send a respective generated service request to a corresponding server in the inventory server cluster according to the configuration information.

As shown in fig. 2, three servers in the inventory server cluster each have the capability of providing a service, and the configuration information corresponding to fig. 2 may be used to record: the target server 206 and the fourth server 207 may be configured to process the service request of the first server 203, the target server 206 and the fifth server 208 may be configured to process the service request of the second server 204, and the fourth server 207 and the fifth server 208 may be configured to process the service request of the third server 205.

The fault identification server 201 may employ the method of fig. 1 to identify a fault with a target server 206 in the inventory server cluster. When the fault identification server 201 determines that the fault of the target server 206 is a single-machine fault, a first control instruction may be generated and sent to the configuration center server 202, so that the configuration center server generates configuration information based on the first control instruction; the configuration information is used for recording identification information of a server in the target server cluster for processing a service request generated by a server in the order server cluster; since it is determined that the target server 206 has a stand-alone failure at this time, the configuration information does not include identification information of the target server.

The configuration center server 202 sends the generated configuration information to each server in the order server cluster, and each server in the order server cluster may interact with a server in the inventory server cluster according to the configuration file. Fig. 3 is a schematic view of a second application scenario of the fault identification method provided in the embodiment of the present disclosure, and as shown in fig. 3, because the configuration information does not include the identification information of the target server 206, each server in the order server cluster no longer sends the service request to the target server 206 for processing, so that adverse effects on subsequent service request responses due to a fault occurring in the target server can be reduced.

In practical application, partial failure of the server can be solved by restarting the server. Therefore, after sending the first control instruction to the configuration center server, the method may further include:

and generating a restart control instruction for the target server.

And sending the restart control instruction to the target server to control the target server to restart.

And when the restart completion information sent by the target server is received, generating a second control instruction, wherein the second control instruction is used for indicating that the target server is enabled to provide the service.

Sending the second control instruction to the configuration center server so that the configuration center server updates the configuration information based on the second control instruction to obtain updated configuration information; the updated configuration information includes the identification information of the target server.

In this embodiment, since some servers in the target server cluster do not usually have the automatic restart function, for example, the application server providing access to the business logic for the client application program does not usually have the automatic restart function, the execution subject of the method in fig. 1 may also send a restart control instruction to the target server where the stand-alone failure occurs, so as to solve the failure of the target server, so that the target server has the capability of providing the service again.

In the embodiment of the present specification, after it is determined that the target server is restarted, the target server needs to be re-registered to continue providing the service. Therefore, when the execution subject of the method in fig. 1 receives the restart completion information sent by the target server, a second control instruction for instructing to enable the target server to provide the service may be sent to the configuration center server, and the configuration center server may update the configuration information based on the second control instruction, where at this time, the generated updated configuration information includes the identification information of the target server. Therefore, after the configuration center server sends the updated configuration information to each designated server, the designated server can send a service request to the target server based on the updated configuration information, and therefore the utilization rate of resources of the target server is improved.

In the illustrated embodiment, the method of fig. 1 further determines the average operating condition data range before step 104. In the embodiments of the present description, there may be various implementations for determining the average operating state data range. One implementation is as follows: and calculating to obtain an average running state data range according to the running state data of a plurality of servers in the target server cluster in a preset time period. The other realization mode is as follows: and determining the average running state data range corresponding to the preset time period from the preset information.

In the first implementation manner, before the step 104, the method may further include:

and acquiring the running state data of a plurality of servers in the target server cluster within the preset time period. Determining an average of the operational state data of the plurality of servers. And determining the average running state data range according to the preset deviation and the average value.

In practical applications, since the operation status data of the target server may include a plurality of operation parameters, the average operation status data range may be composed of a plurality of average operation parameter ranges. Wherein each average operating parameter range can be determined using the present implementation, which is exemplified herein. For example, assume that there are three servers in the target server cluster, wherein the average response time of the first server, the second server and the third server to the remote procedure call request within the preset time period is 1 second, 0.1 second and 0.3 second, respectively, and the first server is the target server. And when the average running state data range is determined according to the running state data of the second server and the third server in the target server cluster, the average value of the running state data of the plurality of servers is 0.2 second. Assuming that the upper limit of the preset deviation is half of the average value, it can be known that, in the determined average operating state data range, the average operating parameter range corresponding to the operating parameter that is the average response time of the server to the remote procedure call request is 0 to 0.3 seconds.

Or, assuming that the server error reporting numbers of the first server, the second server and the third server in a preset time period are 100, 10 and 30, respectively, when the average operating state data range is determined according to the operating state data of the second server and the third server in the target server cluster, the average value of the calculated operating state data of the plurality of servers is 20. Assuming that the upper limit of the preset deviation is twice of the average value, it can be known that the average operating parameter range corresponding to the operating parameter of the number of error reports of the server in the determined average operating state data range is 0 to 60.

In practical applications, since the number of servers in the target server cluster is usually tens to hundreds, even if a single server fails, the operating state data of the target server has a small influence on the average value of the operating state data of the target server cluster, and therefore, the average operating state data range may be determined according to the operating state data of a plurality of servers in the target server cluster including the target server. Therefore, when fault identification is performed on each server in the target server cluster within the preset time period, the same average running state data range can be adopted, so that the calculation amount of the fault identification server can be reduced, and the calculation resources are saved.

In practical applications, when the failure of the target server is a cluster failure, a difference between the operation state data of the target server and an average of the operation state data of the plurality of servers in the target server cluster may be small. At this time, the average operating state data range may also be determined based on a preset maximum value of the operating state data. For example, assuming that the average value obtained based on the operation state data of the plurality of servers in the target server cluster is 100, and the preset deviation is half of the average value, the predetermined average operation state data range is [0, 150], and since the preset maximum value of the operation state data is 80, at this time, the determined average operation state data range may be [0, 80], so as to avoid the erroneous identification of the operation state of the target server due to the cluster failure of the target server cluster.

In the second implementation manner, before the step 104, the method may further include:

and acquiring a preset average running state data range, wherein the average running state data range is obtained by calculation according to the historical running state data of the target server cluster.

In practical application, due to different usage habits of users on the application, the number of service requests processed by the target server cluster in the same time period on different dates has consistency. For example, a user typically uses applications frequently between 10 and 12 am, and between 1 and 4 am, the frequency of use of applications decreases because the user is typically asleep. Based on the above phenomenon, calculation may be performed according to the operation state data of the plurality of servers in the target server cluster in the same time period on different dates, so as to obtain an average value of the operation state data of the target server cluster corresponding to the time period. And determining the average running state data range corresponding to the time period in each day according to the preset deviation and the average value.

By adopting the implementation mode, the average running state data range corresponding to each time period in each day is predetermined and stored, when the average running state data range corresponding to the preset time period needs to be determined, only the pre-stored information needs to be searched, the average running state data range does not need to be generated through real-time calculation, the calculation amount of the execution main body of the method in the figure 1 can be reduced, and therefore calculation resources are saved.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 4 is a schematic structural diagram of a fault identification device corresponding to fig. 1 provided in an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include:

the acquisition module is used for acquiring the running state information of a target server in a preset time period, wherein the target server is one server in a target server cluster.

The first judgment module is used for judging whether the running state data of the target server is in an average running state data range or not based on the running state information to obtain a first judgment result; the average operating state data range is obtained according to the operating state data of the plurality of servers in the target server cluster.

The examples of this specification also provide some specific embodiments of the apparatus based on the apparatus of fig. 4, which is described below.

In this embodiment, the fault identification apparatus in fig. 4 may further include:

the abnormal operation state determining module is used for determining that the target server is in an abnormal operation state in a specified time period; the end time of the designated time period is the same as the starting time of the preset time period.

In an embodiment of this specification, the specified time period includes a preset number of consecutive preset time periods, and an ending time of the preset number of consecutive preset time periods is the same as a starting time of the preset time period. Correspondingly, the abnormal operation state determination module may be specifically configured to: and determining that the target server is in an abnormal operation state within the preset number of continuous preset time periods.

and the occupation ratio determining module is used for determining the occupation ratio of the server which fails in the preset time period in the target server cluster.

And the second judgment module is used for judging whether the occupation ratio is smaller than a preset occupation ratio or not to obtain a second judgment result.

And the single-machine fault determining module is used for determining that the fault of the target server is a single-machine fault when the second judgment result shows that the occupation ratio is smaller than the preset occupation ratio.

and the first control instruction generation module is used for generating a first control instruction, and the first control instruction is used for forbidding the target server to provide the service.

The first sending module is used for sending the first control instruction to a configuration center server so that the configuration center server can generate configuration information based on the first control instruction; the configuration information is used for recording the identification information of a server which processes a service request generated by a specified server in the target server cluster; the configuration information does not include identification information of the target server.

and the restarting control instruction generating module is used for generating a restarting control instruction aiming at the target server.

And the second sending module is used for sending the restart control instruction to the target server so as to control the target server to restart.

And the second control instruction generating module is used for generating a second control instruction when the restart completion information sent by the target server is received, wherein the second control instruction is used for indicating that the target server is enabled to provide services.

A third sending module, configured to send the second control instruction to the configuration center server, so that the configuration center server updates the configuration information based on the second control instruction to obtain updated configuration information; the updated configuration information includes the identification information of the target server.

In this embodiment of the present specification, the operation state data of the target server includes a plurality of operation parameters; the operation parameters comprise at least two of service response time, service response success rate, server error reporting quantity and server garbage recycling frequency. Correspondingly, the first determining module 404 may be specifically configured to:

and judging whether the number of abnormal operation parameters in the plurality of operation parameters is smaller than a preset number, wherein the numerical value of the abnormal operation parameters is out of an average operation parameter range corresponding to the type of the abnormal operation parameters, and the average operation parameter range is obtained according to the operation parameters of the plurality of servers in the target server cluster, which are the same as the type of the abnormal operation parameters.

and the running state data acquisition module is used for acquiring the running state data of the plurality of servers in the target server cluster within the preset time period.

And the mean value determining module is used for determining the mean value of the running state data of the plurality of servers.

And the average running state data range determining module is used for determining the average running state data range according to the preset deviation and the average value.

and the average running state data range acquisition module is used for acquiring a preset average running state data range, and the average running state data range is obtained by calculation according to the historical running state data of the target server cluster.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 5 is a schematic structural diagram of a fault identification device corresponding to fig. 1 provided in an embodiment of the present specification. As shown in fig. 5, the apparatus 500 may include:

at least one processor 510; and the number of the first and second groups,

a memory 530 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 530 stores instructions 520 executable by the at least one processor 510 to enable the at least one processor 510 to:

Based on the same idea, the embodiment of the present specification further provides a computer-readable medium corresponding to the above method. The computer readable medium has computer readable instructions stored thereon that are executable by a processor to implement the method of:

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean expression Language), ahdl (alternate Language Description Language), traffic, pl (core universal programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), and vhjraygurg-Language (Hardware Description Language), which is currently used by Hardware-Language. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, AtmelAT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

One skilled in the art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

One or more embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to one or more embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is merely exemplary of the present disclosure and is not intended to limit one or more embodiments of the present disclosure. Various modifications and alterations to one or more embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of claims of one or more embodiments of the present specification.

Claims

1. A fault identification method, comprising:

2. The method of claim 1, prior to determining that the target server is down, further comprising:

3. The method of claim 2, wherein the specified time period comprises a preset number of consecutive preset time periods, and the ending time of the preset number of consecutive preset time periods is the same as the starting time of the preset time period;

the determining that the target server is in an abnormal operation state within a specified time period specifically includes:

and determining that the target server is in an abnormal operation state within the preset number of continuous preset time periods.

4. The method of claim 2, after determining that the target server has failed, further comprising:

determining the proportion of the server with the fault in the preset time period in the target server cluster;

judging whether the occupation ratio is smaller than a preset occupation ratio or not to obtain a second judgment result;

and when the second judgment result shows that the occupation ratio is smaller than the preset occupation ratio, determining that the fault of the target server is a single-machine fault.

5. The method of claim 4, after determining that the failure of the target server is a standalone failure, further comprising:

generating a first control instruction, wherein the first control instruction is used for forbidding the target server to provide services;

sending the first control instruction to a configuration center server so that the configuration center server generates configuration information based on the first control instruction; the configuration information is used for recording the identification information of a server which processes a service request generated by a specified server in the target server cluster; the configuration information does not include identification information of the target server.

6. The method of claim 5, after sending the first control instruction to a configuration center server, further comprising:

generating a restart control instruction for the target server;

sending the restart control instruction to the target server to control the target server to restart;

when restarting completion information sent by the target server is received, generating a second control instruction, wherein the second control instruction is used for indicating that the target server is enabled to provide services;

7. The method of claim 1, wherein the operational state data of the target server includes a plurality of operational parameters; the determining whether the running state data of the target server is within the average running state data range specifically includes:

8. The method of claim 7, wherein the operational parameters include at least two of service response time, service response success rate, number of server errors, and server garbage collection frequency.

9. The method of claim 1, wherein before determining whether the operational state data of the target server is within the average operational state data range, further comprising:

acquiring running state data of a plurality of servers in the target server cluster within the preset time period;

determining an average of the operational state data of the plurality of servers;

and determining the average running state data range according to the preset deviation and the average value.

10. The method of claim 1, wherein before determining whether the operational state data of the target server is within the average operational state data range, further comprising:

11. A fault identification device comprising:

12. The apparatus of claim 11, further comprising:

13. The apparatus of claim 12, wherein the specified time period comprises a preset number of consecutive preset time periods, and the end time of the preset number of consecutive preset time periods is the same as the start time of the preset time period; the abnormal operation state determination module is specifically configured to:

14. The apparatus of claim 12, further comprising:

the occupation ratio determining module is used for determining the occupation ratio of the server which fails in the preset time period in the target server cluster;

the second judgment module is used for judging whether the occupation ratio is smaller than a preset occupation ratio or not to obtain a second judgment result;

15. The apparatus of claim 14, further comprising:

a first control instruction generation module, configured to generate a first control instruction, where the first control instruction is used to prohibit the target server from providing a service;

16. The apparatus of claim 15, further comprising:

the restarting control instruction generating module is used for generating a restarting control instruction aiming at the target server;

the second sending module is used for sending the restarting control instruction to the target server so as to control the target server to restart;

a second control instruction generation module, configured to generate a second control instruction when restart completion information sent by the target server is received, where the second control instruction is used to instruct to enable the target server to provide a service;

17. The apparatus of claim 11, the operational state data of the target server comprising a plurality of operational parameters; the first judging module is specifically configured to:

18. The apparatus of claim 11, further comprising:

the running state data acquisition module is used for acquiring the running state data of a plurality of servers in the target server cluster within the preset time period;

the mean value determining module is used for determining the mean value of the running state data of the plurality of servers;

19. A fault identification device comprising:

at least one processor; and the number of the first and second groups,

20. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the fault identification method of any one of claims 1 to 10.