CN114116128A

CN114116128A - Method, device, equipment and storage medium for fault diagnosis of container instance

Info

Publication number: CN114116128A
Application number: CN202111394919.XA
Authority: CN
Inventors: 项金鑫
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-01
Anticipated expiration: 2041-11-23
Also published as: CN114116128B

Abstract

The present disclosure relates to a method, apparatus, device and storage medium for fault diagnosis of container instances. The method comprises the following steps: acquiring a fault diagnosis request of a target container instance; determining a target service to which a target container instance belongs based on the fault diagnosis request, inquiring a call failure data set in a set time period, and determining a call failure data subset of the target container instance in the set time period; if the first statistical value of the calling failure data corresponding to each unit time window in the calling failure data subset meets a preset condition, determining a first ratio value of the calling failure data subset in the calling failure data set; wherein the set time period is divided into a first number of unit time windows; if the first specific gravity value is determined to exceed the first specific gravity threshold, determining that the target container instance is faulty. According to the embodiment of the disclosure, the efficiency and the accuracy of fault diagnosis of the single container instance can be improved.

Description

Method, device, equipment and storage medium for fault diagnosis of container instance

Technical Field

The present disclosure relates to the field of cloud computing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for diagnosing a fault of a container instance.

Background

Container technology is a virtualization technology that encapsulates applications and the environment in which the applications run (the dependencies needed for the applications to run) in a container-like manner. In the concrete implementation, the service function is realized through the embodied container instance. If a container instance fails for various reasons, problems such as failure of remote procedure deployment for the container instance, service availability jitter, etc. may result. Therefore, it is very necessary to perform a fault diagnosis on the container example.

At present, the failure diagnosis method for the container instance mainly comprises model prediction, which needs to perform model training by means of a large amount of historical data, and then performs failure diagnosis on the container instance by using the trained model.

However, this method not only depends on a large amount of historical data, but also has a problem that the model recall rate and accuracy are low, and thus failure diagnosis of the container instance cannot be performed quickly and accurately.

Disclosure of Invention

To solve the technical problems described above or at least partially solve the technical problems described above, the present disclosure provides a method, an apparatus, a device, and a storage medium for fault diagnosis of a container instance.

In a first aspect, the present disclosure provides a method of fault diagnosis for an instance of a container, the method comprising:

acquiring a fault diagnosis request of a target container instance;

determining a target service to which the target container instance belongs based on the fault diagnosis request, inquiring a call failure data set of the target service in a set time period, and determining a call failure data subset of the target container instance in the set time period;

if the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset meets a preset condition, determining a first ratio value of the call failure data subset in the call failure data set; wherein the set time period is divided into a first number of the unit time windows;

determining that the target container instance is faulty if the first specific gravity value is determined to exceed a first specific gravity threshold.

In a second aspect, the present disclosure provides a fault diagnosis apparatus for an example of a container, the apparatus comprising:

the fault diagnosis request acquisition module is used for acquiring a fault diagnosis request of a target container instance;

a calling failure data obtaining module, configured to determine, based on the fault diagnosis request, a target service to which the target container instance belongs, query a calling failure data set within a set time period, and determine a calling failure data subset of the target container instance within the set time period;

the first ratio value determining module is used for determining a first ratio value of the call failure data subset in the call failure data set if it is determined that a first statistical value of the call failure data corresponding to each unit time window in the call failure data subset meets a preset condition; wherein the set time period is divided into a first number of the unit time windows;

and the fault diagnosis module is used for determining that the target container instance is in fault if the first proportion value is determined to exceed the first proportion threshold value.

In a third aspect, the present disclosure provides a fault diagnosis apparatus for an instance of a container, the apparatus comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is used for reading the executable instructions from the memory and executing the executable instructions to realize the fault diagnosis method of the container instance explained in any embodiment of the disclosure.

In a fourth aspect, the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the method for fault diagnosis of a container instance described in any of the embodiments of the present disclosure.

Compared with the prior art, the fault diagnosis method, the fault diagnosis device, the fault diagnosis equipment and the storage medium of the container example of the embodiment of the disclosure have the following advantages:

1. in the process of fault diagnosis of the container instance, load information corresponding to the container instance, such as indexes of cpu, memory, bandwidth and the like, is not obtained, but relevant data of calling request failure of the target service, namely a calling failure data set and a calling failure data subset belonging to the target container instance, is obtained, so that the fault diagnosis of the container instance is performed by using the index of calling failure data which more directly reflects the operation condition of the container instance, and the efficiency and the accuracy of the fault diagnosis of the container instance are improved.

2. In the process of fault diagnosis of the container instance, instead of simply adopting a fixed threshold value to carry out fault diagnosis, when a first statistical value of call failure data corresponding to each unit time window in the call failure data subset meets a preset condition, a first proportion value of the call failure data subset in the call failure data set is determined, and then a target container instance fault is determined when the first proportion value exceeds the first proportion threshold value, so that the probability that the container instance is misdiagnosed as the fault is reduced, and the accuracy of fault diagnosis of the container instance is further improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is a schematic flow chart of a method for fault diagnosis of an example container provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating another exemplary method for fault diagnosis of a container according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a fault diagnosis device for an example container provided in an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a fault diagnosis device of an example container provided in an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

When a certain service is realized by the container technology, a plurality of container instances for realizing the service function can be instantiated through the container mirror image. The container instances may be distributively deployed in a plurality of physical machines, with at least one container instance deployed in each physical machine for remote invocation.

Currently, most of the single-instance fault diagnoses for the container instances adopt machine learning model diagnosis. Machine learning model diagnostics rely on a large number of training samples and complex training processes, and suffer from low recall and accuracy. In addition, the related art also adopts a manual diagnosis mode to perform fault diagnosis. However, manual diagnosis is time-consuming and labor-consuming, and the diagnosis efficiency is low. In addition, even if there is a related technical solution for migration or destruction of container instances, it does not detect container instance failures, but triggers destruction of some container instances from the perspective of container resource utilization optimization or cost optimization, etc., so as to reduce the number of container instances.

Based on the above situation, embodiments of the present disclosure provide a method, an apparatus, a device, and a storage medium for diagnosing a fault of a container instance, so as to implement comprehensive diagnosis on whether the container instance has a fault by using relevant data of a failure of a call request to a target service, and a preset condition and a first ratio corresponding to the relevant data, and improve speed and accuracy of fault diagnosis on the container instance.

The following first describes a method for diagnosing a fault of an example of a container provided in an embodiment of the present disclosure with reference to fig. 1 to 2.

In the embodiment of the present disclosure, the method for diagnosing the fault of the container instance may be performed by a fault diagnosis device of the container instance, which is integrated in a fault diagnosis apparatus of the container instance having a strong operation capability. The fault diagnosis device of the container instance may include, but is not limited to, devices such as a laptop computer, a desktop computer, a server, and the like.

Fig. 1 shows a schematic flow chart of a fault diagnosis method for an example container provided in an embodiment of the present disclosure. As shown in fig. 1, the fault diagnosis method of the container example may include the steps of:

and S110, acquiring a fault diagnosis request of the target container instance.

Wherein the target container instance is a container instance of a certain service that needs to be fault diagnosed.

Specifically, the condition that the fault diagnosis device of the container instance triggers the execution of the fault diagnosis flow is that a request for fault diagnosis of the target container instance (i.e., a fault diagnosis request) is received. The fault diagnosis request may be initiated by a service (i.e. target service) operator, i.e. upstream of the target container instance. The troubleshooting request may also be generated as an automated troubleshooting trigger for a target service corresponding to the target container instance. For example, if a service sets a polling or periodic troubleshooting, a troubleshooting request for the target container may be generated when polling the target container instance or when a timing period is reached. The fault diagnosis request at least comprises a container instance identifier of a target container instance and a service identifier of a corresponding service of the target container instance.

It should be noted that, the data formats of the fault diagnosis requests generated by different triggers may be different, so in order to improve the subsequent processing efficiency, the present disclosure formats the fault diagnosis requests to convert all the fault diagnosis requests into the set normalized data format.

S120, determining the target service to which the target container instance belongs based on the fault diagnosis request, inquiring a call failure data set in a set time period, and determining a call failure data subset of the target container instance in the set time period.

The set time period refers to a predetermined time period, which is a time period for collecting data related to a remote invocation request (or called remote procedure invocation request) of the target container instance. The set period of time may be a preset length of time. For example, the set time period may be set in advance for the case of automatically triggering the failure diagnosis of the container instance. The set time period may also be a parameter carried in the fault diagnosis request. For example, for a case where the upstream service actively initiates a failure diagnosis request, the set time period may be set in the failure diagnosis request.

The call failure data refers to flow index data of the remote call request processing failure of the target service. The call failure data set refers to a data set composed of call failure data of the target service in a set time period, and is time sequence data of the call failure data. For example, a point may be buried at the corresponding program location of the target service to statistically derive the call failure data set. The call failure data subset refers to a data set composed of call failure data of the target container instance, which is a part of the call failure data set.

Specifically, for the target service or target container instance, the processing condition of the remote call request is data capable of directly reflecting the business processing capability thereof, and the data of the request processing failure in the processing condition of the remote call request is direct data capable of reflecting whether the failure exists or not. Therefore, in the embodiment of the present disclosure, load data (such as CPU, memory, bandwidth, and the like) corresponding to a service or a container instance is not counted, but call failure data is directly obtained.

During specific implementation, the fault diagnosis device of the container instance determines the target service according to the service identifier carried in the fault diagnosis request. Then, a calling failure data set of the target service in a set time period is obtained by inquiring the obtained buried point statistical data, and a calling failure data subset of the target container instance in the set time period is obtained. The call failure data subset can be obtained by inquiring the buried point statistical data according to the service identifier carried in the fault diagnosis request and the container instance identifier of the target container instance, and can also be obtained by extracting the call failure data subset from the call failure data set according to the container instance identifier.

In some embodiments, the failed call data set includes all the buried point statistics corresponding to the target service, and the data includes some noise data which is not useful for fault diagnosis. Therefore, after the call failure data set of the target service is acquired, noise filtering is performed on the call failure data set based on the call failure type and the caller information to update the call failure data set.

The call failure type may be a type of remote call request processing failure, or may be referred to as an error type of call request failure, for example, a permission type of call request failure due to no access permission, or a timeout type of request failure due to a reason such as request timeout. The caller information refers to information of a caller who initiates a remote invocation request of the target service, and may be, for example, tester information, user information, and the like.

Specifically, the purpose of acquiring the call failure data is to diagnose whether the container instance has a fault, so that only the call failure data related to the actual operation condition of the container instance needs to be concerned. Based on the above, after the call failure data set is obtained, irrelevant data with the call failure type being the authority type or the caller information being the tester information and the like are filtered from the call failure data set, so that the interference of part of noise data is reduced, the subsequent data calculation amount is reduced, the fault diagnosis efficiency is improved, and meanwhile, the fault diagnosis accuracy is further improved.

S130, if the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset meets a preset condition, determining a first ratio value of the call failure data subset in the call failure data set.

The unit time window is a preset time unit for data processing. In the embodiment of the present disclosure, the set time period is divided into the first number of unit time windows. The first number is a preset number value. The first statistical value is a value obtained by performing statistics on the call failure data in the unit time window, and may be a value of a statistical index representing the overall distribution of the plurality of call failure data, such as a mean value and a median. The preset condition is a preset condition for preliminarily judging that the container instance has a fault with a high probability, and may be, for example, a preset fixed threshold of a statistical value, or a variation trend of the statistical value, etc.

Specifically, because the data volume of the call failure data subset is large, in order to improve the data processing efficiency, a unit time window is set in the embodiment of the present disclosure, so that the call failure data subset is divided into the first number of data segments for processing. In addition, data transition points which are respectively inconsistent with the overall data distribution exist in the data distribution of the calling failure data subset, and in order to avoid the influence of the data transition points, the first statistical value of the data segment in each unit time window is calculated in the embodiment of the disclosure. For example, the average value of the call failure data in each unit time window is calculated to obtain a first number of average values, the data amount of which is much smaller than that of the call failure data subsets, and the data distribution of the call failure data subsets can be reflected.

Then, whether each first statistical value meets a preset condition is judged. If the preset condition is met, the operation condition of the target container instance in the set time period is probably in failure. Then, the proportion of the call failure data subset in the call failure data set (namely, the first proportion value) is further calculated to determine the proportion of the number of times of call request failures caused by the target container instance in the number of times of call request failures of the whole target service.

In some embodiments, the preset conditions include a first threshold and a monotonically non-decreasing trend. Then, the above determining that the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset satisfies the preset condition may be: and determining that each first statistical value exceeds a first threshold value, and the data change trend of each first statistical value conforms to a monotonous non-decreasing trend.

The first threshold is a preset critical value of a statistical value, and is used for determining whether the data volume of the call request failure reaches the possible failure condition of the container instance. The first threshold may be a value set empirically or may be a parameter set by the trouble diagnosis requester at the trouble diagnosis request.

Specifically, when determining whether each first statistical value satisfies the preset condition, it may be determined whether each first statistical value exceeds a first threshold. If at least one of the first statistical values does not exceed the first threshold, the operation condition of the target container instance within the set time period is not possible to be fault-free, and the subsequent diagnosis process is described in the following embodiments. If all the first statistical values exceed the first threshold, the operation condition of the target container instance in the set time period is possibly faulty.

Then, it is further analyzed whether the trend (e.g. the fitted curve) of the data formed by the first statistical values conforms to the monotone non-decreasing trend. This is because the probability of failure of the target container instance is greater if the amount of call failure data for the target container instance is continuously increasing or at least remains numerically large. If the trend of the data formed by the first statistical values conforms to the monotone non-decreasing trend, the probability that the target container instance has faults is relatively high. Compared with the related technology of setting fixed threshold value direct diagnosis, the method increases the diagnosis conditions of container instance fault diagnosis and further improves the fault diagnosis accuracy.

And S140, if the first specific gravity value is determined to exceed the first specific gravity threshold, determining that the target container instance is in fault.

The first specific gravity threshold is a predetermined critical value of specific gravity, which may be an empirically set value or a parameter set by the fault diagnosis requester at the fault diagnosis request.

Specifically, if the first ratio value calculated above exceeds the first ratio threshold, which indicates that the number of times the call request fails due to the target container instance is greater than the ratio of the number of times the call request fails for the entire target service, the target container instance may be considered to have failed. At this time, fault processing may be performed on the target container instance, for example, alarm information is issued, so that management personnel can perform processing; if the target container instance is destroyed, reducing the times of failure of the calling request of the target service; for example, the target container instance is migrated to reduce the number of times of call request failures of the target service, and ensure that the number of container instances of the target service remains unchanged, thereby ensuring the operating efficiency of the target service.

In some embodiments, for the case of target container instance migration, it may be implemented as: and sending a migration request of the target container instance to a container management system corresponding to the target service, so that the container management system destroys the target container instance and regenerates a new container instance based on the migration request, and the migration of the target container instance is completed. That is, the failure diagnosis device of the container instance sends a migration request of the target container instance to the container management system that manages the container instance of the target service. And after receiving the migration request, the container management system destroys the target container instance, regenerates a new container instance according to the container mirror image corresponding to the target service to replace the target container instance, and completes the migration of the target container instance.

In some embodiments, the target container instance is determined to be fault-free if it is determined that the first specific gravity value does not exceed the first specific gravity threshold.

Specifically, if the first specific gravity value calculated above does not exceed the first specific gravity threshold, it is described that the distribution of the call failure data of the target container instance has some problems, but it is not a main reason for a large number of times of call request failures of the target service, and it is considered that the target container instance does not fail.

The technical solutions for fault diagnosis of the container instance provided in the embodiments above can obtain relevant data of a call request failure for a target service, that is, a call failure data set and a call failure data subset belonging to the target container instance, in a process of fault diagnosis of the container instance, so that fault diagnosis of the container instance is performed by using an index, which is call failure data and more directly reflects an operation condition of the container instance, and efficiency and accuracy of fault diagnosis of the container instance are improved. In addition, in the process of fault diagnosis of the container instance, when the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset is judged to meet the preset condition, the first proportion value of the call failure data subset in the call failure data set is determined, and then the target container instance is determined to be faulty when the first proportion value is determined to exceed the first proportion threshold, so that the probability that the container instance is misdiagnosed as the fault is reduced, and the accuracy of fault diagnosis of the container instance is further improved.

Fig. 2 is a schematic flow chart illustrating a fault diagnosis method for another example container provided in the embodiment of the present disclosure. As shown in fig. 2, the fault diagnosis method of the container example may include the steps of:

s201, acquiring a fault diagnosis request of the target container instance.

S202, detecting whether the target container instance is diagnosed as a fault in a historical time period.

Specifically, if a target container instance has been diagnosed as failing within a certain period of time (i.e., a historical period of time) in the past, but the target container instance remains running, then a determination needs to be made as to whether there is a failure in the physical machine that is running the target container instance. Therefore, after obtaining the fault diagnosis request, the fault diagnosis device of the container instance queries the fault diagnosis record according to the container instance identifier of the target container instance in the request, and whether the target container instance is diagnosed as a fault or not. If yes, go to S203; if not, go to S206.

S203, counting a third number of the fault container instances in the physical machine corresponding to the target container instances in the historical time period.

Wherein the failed container instance is a container instance deployed in a physical machine and diagnosed as failed.

Specifically, if the target container instance is diagnosed as a fault within a historical period of time, the container instance once deployed and the container instance now deployed in the physical machine running the target container instance are found out according to the container instance deployment history. Then, the number of faulty container instances that have been diagnosed as faulty (i.e., the third number) among these found container instances is counted based on the fault diagnosis records.

And S204, judging whether the third quantity exceeds a second threshold value.

The second threshold refers to a preset numerical value, which may be an empirically set numerical value or a parameter set by a fault diagnosis requester in a fault diagnosis request.

Specifically, the third number is compared with the second threshold value to judge whether the failure number of the container instances running in the physical machine reaches a critical value for judging the failure of the physical machine. If yes, go to S205; if not, go to S206.

S205, determining the physical machine fault and determining the target container instance fault.

Specifically, if the third number of failed container instances in the physical machine is greater than the second threshold, which indicates that a large number of call request failures exist in most container instances in the physical machine, the physical machine may be considered to be failed. In this case, it may be considered that the target container instance running on the physical machine and having the call request has failed.

S206, based on the fault diagnosis request, obtaining a call failure data set of the target service to which the target container instance belongs in a set time period, and obtaining a call failure data subset of the target container instance in the set time period.

And S207, judging whether the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset meets a preset condition.

If yes, go to step S208; if not, go to step S212.

In some embodiments, in the case that the preset condition is the first threshold and the monotonically non-decreasing trend, the above-mentioned determining that each first statistical value does not satisfy the preset condition may be implemented as: determining that any of the first statistical values does not exceed the first threshold; or determining that the data variation trend of each first statistical value does not conform to the monotone non-decreasing trend.

Specifically, according to the above description, when each first statistical value exceeds the first threshold and the variation trend of each first statistical value corresponds to the monotonically non-decreasing trend, each first statistical value is considered to satisfy the preset condition. Then, when at least one of the first threshold value and the monotonically non-decreasing trend does not meet the above condition, it is determined that each of the first statistical values does not meet the preset condition. That is, when at least one of the first statistical values is less than or equal to the first threshold, it may be determined that the first statistical values do not satisfy the preset condition; when the variation trend of each first statistical value does not accord with the monotone non-decreasing trend, determining that each first statistical value does not meet the preset condition; when at least one of the first statistical values is smaller than or equal to the first threshold and the variation trend of the first statistical values does not conform to the monotonically non-decreasing trend, it may also be determined that the first statistical values do not satisfy the predetermined condition.

And S208, determining a first ratio value of the call failure data subset in the call failure data set.

S209, judging whether the first specific gravity value exceeds a first specific gravity threshold value.

If yes, go to S210; if not, S211 is executed.

And S210, determining that the target container instance has a fault.

S211, determining that the target container instance has no fault.

S212, determining second statistical values of the call failure data corresponding to the continuous second number of unit time windows in the call failure data subsets.

Wherein the second number is a preset number value, and the second number is smaller than the first number. The second statistical value is also a value obtained by performing statistics on the call failure data in the unit time window, and may be a value of a statistical index representing the overall distribution of the plurality of call failure data, such as a mean value and a median. The statistical index of the second statistical value may be the same as or different from the statistical index of the first statistical value.

Specifically, according to the above description, if any one of the first statistical values does not exceed the first threshold, and/or the data variation trend of each first statistical value does not conform to the monotonically non-decreasing trend, it is described that whether or not the target container instance has a failure cannot be judged within the set time period. At this time, the time frame of data processing is narrowed down to further diagnose whether the target container instance is faulty or not. That is, the call failure data corresponding to the unit time windows of the second number of consecutive times are extracted from the call failure data subset. For example, the call failure data subset includes call failure data of a first number N of unit time windows, and call failure data of a second number K (K < N) of unit time windows which are continuous and located at any position in the time series data of the call failure data subset is acquired.

Then, a second statistical value of the call failure data per unit time window is calculated. If the statistical index of the second statistical value is the same as that of the first statistical value, for example, both are mean values, then a second number of first statistical values corresponding to the sequence positions can be directly extracted from the first statistical values as the corresponding second statistical values.

In some embodiments, the call failure data corresponding to the second consecutive number of unit time windows is the call failure data corresponding to the second consecutive number of unit time windows that are ranked earlier in the call failure data subset. For example, for the time series data of the call failure data of the above N unit time windows, the call failure data of the first K unit time windows is extracted therefrom. This is because: firstly, the sampling time of the calling failure data corresponding to the unit time windows of the second number of continuous units sequenced at the front is longer than that of the current time, the possibility of error is low, and the accuracy of the second statistical value and the subsequent judgment on the second statistical value can be improved, so that the accuracy of fault diagnosis is further improved; and secondly, compared with the method for extracting the call failure data corresponding to the continuous second number of unit time windows in the later sequence and the call failure data corresponding to the continuous second number of unit time windows in the earlier sequence, the method can improve the severity of subsequent fault diagnosis by using each second statistical value, and avoid excessive container instance migration operation on the basis of improving the fault diagnosis accuracy.

S213, judging whether the variation trend of each second statistical value accords with the trend of the sudden tail descent.

The tail dip trend is a trend of decreasing with a slope smaller than a slope threshold (a preset slope value, which is a negative value), and it indicates that the number of times of call request failures is decreased at a larger speed at the end of the set time period.

In particular, the data distribution of the respective second statistical values may constitute yet another data variation trend. And analyzing whether the data change trend accords with the tail dip trend. If yes, go to S214; if not, go to S215.

S214, determining that the target container instance has no fault.

Specifically, if the variation trend of each second statistical value conforms to the tail dip trend, which indicates that the number of times of call request failures of the target container instance at the end stage of the set time period is sharply reduced, and normal call request processing is being resumed for some reason, the target container instance is considered to have a self-healing trend, and therefore it is determined that the target container instance is not faulty.

S215, determining a second specific gravity value of the call failure data corresponding to the unit time windows of the continuous second quantity in the call failure data set.

Specifically, if the variation trend of each second statistical value does not conform to the tail dip trend, the operation condition of the target container example is not self-healing. At this time, the proportion of the call failure data of the second number of consecutive unit time windows in the call failure data subset in the call failure data set (i.e., the second specific gravity value) is calculated to determine the proportion of the number of call request failures caused by the target container instance in the later stage of its execution stage to the number of call request failures of the entire target service.

S216, judging whether the second specific gravity value exceeds a second specific gravity threshold value.

The second specific gravity threshold is another predetermined critical value of specific gravity, which may be an empirically set value or a parameter set by the fault diagnosis requester at the fault diagnosis request.

Specifically, the second specific gravity value is compared with the second specific gravity threshold value. If the second specific gravity value is greater than the second specific gravity threshold, performing S217; otherwise, S218 is executed.

S217, marking the target container as a suspected fault example.

Specifically, if the second specific gravity value is greater than the second specific gravity threshold value, which indicates that the number of times of call request failures caused by the target container instance at the later stage of the running stage of the target container instance is greater than the specific gravity threshold value in the number of times of call request failures of the whole target service, the target container instance may be considered to be failed, and at this time, the target container instance is marked so as to be further diagnosed by a human being in the later stage.

S218, determining that the target container instance has no fault.

Specifically, if the second specific gravity value is less than or equal to the second specific gravity threshold, the distribution of the call failure data indicating that the target container instance has some problems, but is not the main reason for the large number of times of call request failures of the target service, it may be considered that the target container instance has not failed.

The container instance fault diagnosis method provided by the embodiment can detect whether the target container instance is diagnosed as a fault in the historical time period or not before single-instance fault diagnosis, count the third number of the fault container instances in the physical machine corresponding to the target container instance in the historical time period under the condition that the target container instance is diagnosed as the fault, determine the fault of the physical machine when the third number exceeds the second threshold value, and then determine the fault of the target container instance, so that the subsequent fault diagnosis process is saved, and the fault diagnosis efficiency of the container instance is further improved. In addition, under the condition that any one of the first statistical values does not exceed the first threshold value or the data change trend of each first statistical value does not accord with the monotone non-decreasing trend, the second statistical values of the call failure data corresponding to the unit time windows of the continuous second number in the call failure data subset are determined, and the target container example is determined to have no fault when the change trend of each second statistical value accords with the tail dip trend, so that the condition that the call request of the target container example is recovered to be normal quickly is realized, the container example is determined to have no fault, the container example fault diagnosis process is perfected, and the container example fault diagnosis accuracy is further improved. In addition, under the condition that the variation trend of each second statistical value is determined not to accord with the tail dip trend, second specific gravity values of call failure data corresponding to a second number of continuous unit time windows in a call failure data set can be determined, and when the second specific gravity values exceed a second specific gravity threshold value, the target container instance is marked as a suspected fault instance; and when the second specific gravity value does not exceed the second specific gravity threshold value, the target container example is determined to be fault-free, the container example fault diagnosis process is further improved, and the accuracy of container example fault diagnosis is further improved.

Fig. 3 shows a schematic structural diagram of a fault diagnosis device for an example container provided in an embodiment of the present disclosure. As shown in fig. 3, the fault diagnosis apparatus 300 of this container example may include 33:

a fault diagnosis request obtaining module 310, configured to obtain a fault diagnosis request of a target container instance;

a call failure data obtaining module 320, configured to determine, based on the fault diagnosis request, a target service to which the target container instance belongs, query a call failure data set within a set time period, and determine a call failure data subset of the target container instance within the set time period;

the first ratio value determining module 330 is configured to determine a first ratio value of the call failure data subset in the call failure data set if it is determined that the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset meets the set condition; wherein the set time period is divided into a first number of unit time windows;

and the fault diagnosis module 340 is used for determining that the target container instance is in fault if the first specific gravity value is determined to exceed the first specific gravity threshold.

The fault diagnosis device for the container instance can acquire relevant data of call request failure of the target service, namely a call failure data set and a call failure data subset belonging to the target container instance, in the process of fault diagnosis of the container instance, so that the fault diagnosis of the container instance is performed by using the index of the call failure data which more directly reflects the operation condition of the container instance, and the efficiency and the accuracy of the fault diagnosis of the container instance are improved. In addition, in the process of fault diagnosis of the container instance, when the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset is judged to meet the preset condition, the first proportion value of the call failure data subset in the call failure data set is determined, and then the target container instance is determined to be faulty when the first proportion value exceeds the first proportion threshold, so that the probability that the container instance is misdiagnosed as the fault is reduced, and the accuracy of fault diagnosis of the container instance is further improved.

In some embodiments, the fault diagnosis apparatus 300 of the container example further includes:

the second statistical value determining module is used for determining a target service to which the target container instance belongs based on the fault diagnosis request, inquiring a call failure data set of the target service within a set time period, and determining a call failure data subset of the target container instance within the set time period, and if determining that each first statistical value does not meet a preset condition, determining second statistical values of call failure data corresponding to a second number of continuous unit time windows in the call failure data subset; wherein the second number is less than the first number;

accordingly, the fault diagnosis module 340 is further configured to:

if the change trend of each second statistical value is determined to accord with the tail dip trend, determining that the target container example has no fault; wherein, the tail dip trend is a change trend which is decreased according to the slope smaller than the slope threshold.

In some embodiments, the first ratio determination module 330 is specifically configured to:

if it is determined that each first statistical value exceeds a first threshold value and the data change trend of each first statistical value conforms to a monotone non-decreasing trend, determining a first ratio value of the calling failure data subset in the calling failure data set;

correspondingly, the second statistical value determining module is specifically configured to:

and if it is determined that any one of the first statistical values does not exceed the first threshold and/or the data change trend of each first statistical value does not conform to the monotone non-decreasing trend, determining second statistical values of the call failure data corresponding to the continuous second number of unit time windows in the call failure data subset.

the second specific gravity value determining module is used for determining second specific gravity values of the call failure data corresponding to the continuous second number of unit time windows in the call failure data subset in the call failure data set if the first statistical values are determined not to meet the preset conditions, and determining the second specific gravity values of the call failure data corresponding to the continuous second number of unit time windows in the call failure data set if the variation trend of the second statistical values is determined not to meet the tail dip trend;

accordingly, the fault diagnosis module 340 is further configured to:

if it is determined that the second specific gravity value exceeds the second specific gravity threshold, the target container instance is marked as a suspected faulty instance.

Further, the fault diagnosis module 340 is further configured to:

if it is determined that the second specific gravity value does not exceed the second specific gravity threshold, the target container instance is determined to be non-faulty.

In some embodiments, the fault diagnosis module 340 is further configured to:

after determining the first ratio value of the call failure data subset in the call failure data set, if it is determined that the first ratio value does not exceed the first ratio threshold, determining that the target container instance is fault-free.

In some embodiments, the fault diagnosis apparatus 300 of the container instance further comprises a data filtering module for:

after determining the target service to which the target container instance belongs based on the fault diagnosis request and inquiring a call failure data set of the target service in a set time period, noise filtering is carried out on the call failure data set based on a call failure type and caller information so as to update the call failure data set.

In some embodiments, the fault diagnosis apparatus 300 of the container instance further comprises a physical machine fault diagnosis module for:

after the fault diagnosis request of the target container instance is obtained, if the target container instance is detected to be diagnosed as a fault in the historical time period, counting a third number of the fault container instances in the physical machine corresponding to the target container instance in the historical time period; wherein the failed container instance is a container instance deployed in a physical machine and diagnosed as failed;

if the third number exceeds a second threshold, a physical machine failure is determined, and a target container instance failure is determined.

In some embodiments, the apparatus 300 further comprises a container instance migration module for:

after the fault of the target container instance is determined, a migration request of the target container instance is sent to a container management system corresponding to the target service, so that the container management system destroys the target container instance and regenerates a new container instance based on the migration request, and migration of the target container instance is completed.

It should be noted that the fault diagnosis apparatus 300 of the container example shown in fig. 3 may perform each step in the method embodiments shown in fig. 1 to fig. 2, and implement each process and effect in the method embodiments shown in fig. 1 to fig. 2, which are not described herein again.

Embodiments of the present disclosure also provide a fault diagnosis device of a container instance, which may include a processor and a memory, and the memory may be used to store executable instructions. Wherein the processor may be configured to read the executable instructions from the memory and execute the executable instructions to implement the steps of the fault diagnosis method of the container instance in any of the above embodiments.

The fault diagnosis device of the container instance in the embodiment of the present disclosure may include, but is not limited to, devices such as a notebook computer, a desktop computer, a server, and the like.

Fig. 4 shows a schematic structural diagram of a fault diagnosis device of an example container provided by the embodiment of the disclosure. The fault diagnosis device 400 of the container example shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.

Referring now specifically to FIG. 4, a schematic diagram of a fault diagnosis device 400 suitable for use in implementing an example of a container in embodiments of the present disclosure is shown.

As shown in fig. 4, the fault diagnosis apparatus 400 of this container example may include a processing device (e.g., a central processing unit, a graphic processor, etc.) 401 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the information processing apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output interface (I/O interface) 405 is also connected to the bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the fault diagnosis device 400 of the container instance to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 illustrates the fault diagnosis apparatus 400 with an example of a container having various devices, it is to be understood that not all of the illustrated devices are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

The embodiments of the present disclosure also provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps of the fault diagnosis method for the container instance in any of the embodiments.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program performs the above-described functions defined in the fault diagnosis method of the container instance of the embodiment of the present disclosure when executed by the processing apparatus 401.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP, and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be included in the fault diagnosis apparatus of the container instance; or may exist separately and not be incorporated into the failure diagnosis apparatus of the container example.

The above-mentioned computer-readable medium carries one or more programs which, when executed by the fault diagnosing apparatus of the container instance, cause the fault diagnosing apparatus of the container instance to perform the steps of the fault diagnosing method of the container instance explained in any embodiment of the present disclosure.

In embodiments of the present disclosure, computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method of fault diagnosis for an instance of a container, comprising:

acquiring a fault diagnosis request of a target container instance;

2. The method of claim 1, wherein after determining a target service to which the target container instance belongs based on the fault diagnosis request, querying a call failure data set of the target service within a set time period, and determining a call failure data subset of the target container instance within the set time period, the method further comprises:

if the first statistical values are determined not to meet the preset conditions, determining second statistical values of the call failure data corresponding to the unit time windows of continuous second quantity in the call failure data subsets; wherein the second number is less than the first number;

if the change trend of each second statistical value is determined to accord with the tail dip trend, determining that the target container example has no fault; wherein the tail dip trend is a change trend which decreases according to a slope smaller than a slope threshold.

3. The method of claim 2, wherein the determining that the first statistical value of the call failure data corresponding to each unit time window in the call failure data subset satisfies a preset condition comprises:

determining that each first statistical value exceeds a first threshold value, and the data change trend of each first statistical value conforms to a monotonous non-decreasing trend;

the determining that each of the first statistical values does not satisfy the preset condition includes:

determining that any of the first statistical values does not exceed the first threshold;

and/or determining that the data variation trend of each first statistical value does not conform to the monotone non-decreasing trend.

4. The method of claim 2, wherein after determining the second statistical values of the call failure data corresponding to the second number of consecutive unit time windows in the call failure data subset if it is determined that each of the first statistical values does not satisfy the preset condition, the method further comprises:

if the change trend of each second statistical value is determined not to be in accordance with the tail dip trend, determining a second specific gravity value of the call failure data corresponding to the unit time windows of the continuous second number in the call failure data set;

if the second specific gravity value is determined to exceed a second specific gravity threshold value, marking the target container instance as a suspected fault instance.

5. The method according to claim 4, wherein after determining the second specific gravity value of the call failure data corresponding to the second consecutive number of the unit time windows in the call failure data set if the variation trend of each of the second statistical values is determined not to conform to the tail dip trend, the method further comprises:

determining that the target container instance is non-faulty if it is determined that the second specific gravity value does not exceed the second specific gravity threshold.

6. The method of claim 1, wherein after the determining the first ratio value of the call failure data subset in the call failure data set, the method further comprises:

determining that the target container instance is not faulted if it is determined that the first ratio value does not exceed the first ratio threshold.

7. The method according to any one of claims 1 to 6, wherein after the determining, based on the fault diagnosis request, a target service to which the target container instance belongs and querying a call failure data set of the target service within a set time period, the method further comprises:

and noise filtering is carried out on the call failure data set based on the call failure type and the caller information so as to update the call failure data set.

8. The method of claim 1, wherein after the obtaining the request for fault diagnosis of the target container instance, the method further comprises:

if the target container instance is detected to be diagnosed as a fault in a historical time period, counting a third number of fault container instances in a physical machine corresponding to the target container instance in the historical time period; wherein the failed container instance is a container instance deployed in the physical machine and diagnosed as failed;

if the third number exceeds a second threshold, determining that the physical machine is faulty, and determining that the target container instance is faulty.

9. The method of claim 1 or 8, wherein after said determining that the target container instance is faulty, the method further comprises:

and sending a migration request of the target container instance to a container management system corresponding to the target service, so that the container management system destroys the target container instance and regenerates a new container instance based on the migration request, and the migration of the target container instance is completed.

10. A fault diagnosis apparatus for an example of a container, comprising:

a calling failure data obtaining module, configured to determine, based on the fault diagnosis request, a target service to which the target container instance belongs, query a calling failure data set of the target service within a set time period, and determine a calling failure data subset of the target container instance within the set time period;

11. A fault diagnosis apparatus for an instance of a container, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the method for fault diagnosis of a container instance as claimed in any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of fault diagnosis of a container instance of any of the preceding claims 1-9.