CN115576732B

CN115576732B - Root cause positioning method and system

Info

Publication number: CN115576732B
Application number: CN202211427061.7A
Authority: CN
Inventors: 杨家海; 赵鋆峰; 张世泽; 王之梁; 董恩焕; 卢建元; 王绍哲; 杨帅; 吕彪; 祝顺民
Original assignee: Tsinghua University; Alibaba Cloud Computing Ltd
Current assignee: Tsinghua University; Alibaba Cloud Computing Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-10
Anticipated expiration: 2042-11-15
Also published as: CN115576732A

Abstract

The embodiment of the specification provides a root cause positioning method and a system, wherein the root cause positioning method comprises the following steps: acquiring flow change information of virtual machines in a virtual machine cluster at a fault time point; screening candidate virtual machines from the virtual machine cluster according to the flow change information, and loading historical data of the candidate virtual machines connected with the fault time point; determining abnormal information of the candidate virtual machine in a preset root cause positioning dimension according to the historical data; determining a target virtual machine among the candidate virtual machines based on the exception information.

Description

Root cause positioning method and system

Technical Field

The embodiment of the specification relates to the technical field of data analysis, in particular to a root cause positioning method and system.

Background

With the development of internet technology, a great number of enterprises and individual users have chosen to put services in a cloud network. With the continuous expansion of the scale of the cloud network, the operation, maintenance and management of the cloud network gradually become a new technical challenge. Shared resource type services commonly exist in cloud networks, such as NAT (Network Address Translation) services purchased by users, shared bandwidth, shared traffic packets, dedicated line services, and the like. The shared resource type service brings new challenges to network operation and maintenance while bringing low-cost and efficient management service to users. When the shared resource is abnormal, in the prior art, although the abnormal virtual machine host can be positioned by methods such as flow statistics and data aggregation analysis, the accuracy is low, and the abnormal virtual machine host can be positioned only after operation and maintenance personnel participate in part of scenes, so that the optimal fault removal time is easily missed, and the shared resource is influenced. There is therefore a need for an effective solution to the above problems.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a root cause positioning method. One or more embodiments of the present specification also relate to a root cause positioning apparatus, a root cause positioning system, a server, a computer readable storage medium and a computer program, so as to solve the technical drawbacks in the prior art.

According to a first aspect of embodiments herein, there is provided a root cause positioning method, including:

acquiring flow change information of virtual machines in a virtual machine cluster at a fault time point;

screening candidate virtual machines from the virtual machine cluster according to the flow change information, and loading historical data of the candidate virtual machines connected with the fault time point;

determining abnormal information of the candidate virtual machine in a preset root cause positioning dimension according to the historical data;

determining a target virtual machine among the candidate virtual machines based on the exception information.

According to a second aspect of embodiments herein, there is provided a root cause localization apparatus comprising:

an information acquisition module configured to acquire traffic change information of virtual machines in a virtual machine cluster at a failure time point;

a data loading module configured to screen candidate virtual machines from the virtual machine cluster according to the flow change information, and load historical data of the candidate virtual machines connected with the fault time point;

the determining information module is configured to determine abnormal information of the candidate virtual machine in a preset root cause positioning dimension according to the historical data;

a determine virtual machine module configured to determine a target virtual machine among the candidate virtual machines based on the exception information.

According to a third aspect of embodiments herein, there is provided a root cause positioning system, comprising:

the storage node is used for acquiring flow change information of the virtual machines in the virtual machine cluster at a fault time point; screening candidate virtual machines from the virtual machine cluster according to the flow change information, reading historical data of the candidate virtual machines connected with the fault time point, and sending the historical data to a computing node;

the computing node is used for receiving the historical data and determining the abnormal information of the candidate virtual machine in a preset root cause positioning dimension according to the historical data; determining a target virtual machine among the candidate virtual machines based on the exception information.

According to a fourth aspect of embodiments herein, there is provided a server comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that when executed by the processor implement the steps of any of the above-described cause positioning methods.

According to a fifth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the root cause positioning method described above.

According to a sixth aspect of embodiments herein, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-mentioned root cause positioning method.

In order to efficiently and accurately locate a failed virtual machine, the root cause locating method provided by the present specification may first obtain flow change information of virtual machines in a virtual machine cluster at a failure time point, preliminarily exclude non-failed virtual machines according to the flow change information, thereby selecting a candidate virtual machine in the virtual machine cluster, and may load historical data of the candidate virtual machine at the failure time point in order to ensure the locating accuracy, thereby reducing the computational resources consumed by determining whether each virtual machine in the cluster has a failure one by one; after the historical data is loaded, the abnormal information of the candidate virtual machine in the preset root cause positioning dimension can be determined according to the historical data, so that whether the candidate virtual machine has a fault or not can be analyzed according to the abnormal information, the target virtual machine can be determined in the candidate virtual machine according to the abnormal information, and the host machine capable of positioning the target virtual machine can be used for conducting fault elimination processing. High precision can be achieved while high efficiency is achieved, the time consumed by operation and maintenance personnel in troubleshooting is greatly reduced, and the troubleshooting efficiency is effectively improved.

Drawings

FIG. 1 is a diagram illustrating a method for root cause location according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for root cause location provided by an embodiment of the present disclosure;

fig. 3 is a flowchart of determining flow rate change information in a root cause positioning method according to an embodiment of the present disclosure;

fig. 4 is a block diagram of a root cause location method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a processing procedure of a root cause location method according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of an exemplary root cause location device, according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a root cause location system according to an embodiment of the present disclosure;

fig. 8 is a block diagram of a server according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be implemented in many ways other than those specifically set forth herein, and those skilled in the art will appreciate that the present description is susceptible to similar generalizations without departing from the scope of the description, and thus is not limited to the specific implementations disclosed below.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.

In the present specification, a root cause location method is provided, and the present specification relates to a root cause location device, a root cause location system, a server, a computer readable storage medium, and a computer program, which are described in detail in the following embodiments one by one.

In practical application, when there is an abnormality in a shared resource, a virtual machine host with the abnormality needs to be located to perform troubleshooting, and in the prior art, statistics based on traffic Top N, aggregation analysis based on Session data, an analysis method based on machine learning, an aggregation method based on statistical analysis, and the like are mostly adopted. (1) statistical method based on flow Top N: when the egress traffic is abnormal, the N virtual machines with the top rank of the traffic are counted as possible abnormal root causes. But this method is only applicable to cases where the flow is steady and the anomaly type is a sharp spike-type anomaly. The actual network traffic time sequence tends to have higher complexity and uncertainty, and the anomaly types also include steep drop, high frequency jitter and other types, so the method has a limited application range. (2) Session data-based aggregation analysis: the network traffic outlet NAT gateway is provided with Session statistical information with finer granularity than single traffic statistical information, and the traffic change condition of the virtual machines can be more accurately converged by counting the flow information of each virtual machine, so that possible abnormal root causes of the virtual machines can be analyzed. However, session data is huge in scale, if root cause analysis is performed based on the Session data, more storage space is needed and more cost is needed compared with a method based on outlet traffic statistical analysis of a virtual machine, and effective expansion is difficult to perform with continuous expansion of cloud network scale. Although other methods can perform root cause localization, the efficiency and the timeliness are low, and therefore an effective solution is needed to solve the above problems.

Referring to the schematic diagram shown in fig. 1, in order to efficiently and accurately locate a failed virtual machine, the root cause locating method provided in this specification may first obtain flow rate change information (resource change curve) of virtual machines in a virtual machine cluster at a failure time point, preliminarily exclude non-failed virtual machines according to the flow rate change information, thereby selecting a candidate virtual machine in the virtual machine cluster, and in order to ensure the locating accuracy, may load historical data of the candidate virtual machine at the failure time point, thereby reducing the computational resources consumed by determining whether each virtual machine in the cluster has a failure one by one; after the historical data is loaded, the abnormal information of the candidate virtual machine in the preset root positioning dimension can be determined according to the historical data, so that whether the candidate virtual machine has a fault or not can be analyzed according to the abnormal information, the target virtual machine can be determined in the candidate virtual machine according to the abnormal information, and the host capable of positioning the target virtual machine can be subjected to fault elimination. High precision can be achieved while high efficiency is achieved, the time consumed by operation and maintenance personnel in troubleshooting is greatly reduced, and the troubleshooting efficiency is effectively improved.

Fig. 2 shows a flowchart of a root cause location method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step S202, acquiring the flow change information of the virtual machines in the virtual machine cluster at the failure time point.

The root cause positioning method provided by the embodiment is applied to a server side for providing shared resources; the shared resources provided by the server include but are not limited to shared bandwidth, shared traffic packets, private line services, etc.; when the shared resource is abnormal, it is impossible to provide a good service for the user or the enterprise, the root reason is that the virtual machine host providing the shared resource fails, and the server usually has a large number of hosts to support the virtual machine operation for providing sufficient shared resource. When a host fails, the host with the fault is difficult to accurately position in a short time for removing the fault; and thus greatly affect the business services of the user or enterprise. In order to solve the above problems, the root cause positioning method provided by this embodiment can quickly and accurately position the target virtual machine host, so as to assist the operation and maintenance personnel in troubleshooting, reduce the influence caused by the fault, and improve the use experience of the user or the enterprise.

Specifically, the failure time point refers to a time point when the host fails, and the shared resource traffic will fluctuate at the time point. Correspondingly, a virtual machine cluster specifically refers to a set formed by all virtual machines provided by a server, and is used for providing shared resources for users or enterprises. Correspondingly, the traffic change information specifically refers to traffic change difference information of each virtual machine in the virtual machine cluster before and after the failure time point, and is used for preliminarily judging whether the host to which the virtual machine belongs is the failed host. That is to say, the traffic change information can reflect the change situation of each virtual machine before and after the failure time point, and can preliminarily reflect whether the host machine fails or is abnormal.

Based on this, the total flow of the shared resources of the entire virtual machine cluster and the flow of the shared resources corresponding to each virtual machine change with time, so, assuming that the time sequence of the flow of the shared resources corresponding to the virtual machine cluster is y (t) and the flow of the virtual machines in the shared resources is X (t), the total flow set X = { X (X) = in the gateway for all virtual machines in the virtual machine cluster ₁ （t），x ₂ （t）…x _n (t), where n is the number of virtual machines. Since the shared resource traffic is composed of all virtual machine traffic, it can be found that y (t) = x ₁ （t）+x ₂ （t）+…+x _n （t）。

On the basis, the shared resources can be shared by a large number of virtual machines in a practical application scene. If the flow of each virtual machine needs to be calculated and analyzed when a fault occurs, a large amount of resources are consumed, the transmission of data in a network also consumes a large amount of time, and meanwhile the running time of the algorithm linearly increases along with the increase of the number of the virtual machines, so that the fault location is very unfavorable, and the optimal fault clearing time can be missed. Meanwhile, the traffic change of the shared resource at any time is stable except for special time (such as shopping preferential days) for the server, and even if the difference exists, the difference is in a controllable range. When the difference between the flow of the shared resource at a certain moment and the flow of the shared resource at the moment is large, the occurrence of a fault can be preliminarily determined, and a root cause positioning method can be triggered at the moment to realize the positioning of a fault host. That is, the change before and after y (t) at a certain moment is large, and when the change exceeds a set threshold, the host with the fault can be preliminarily determined, and subsequent root cause positioning processing can be performed.

Further, considering that the determination of the failure time point is to be the basis of the root cause location, if the location of the failure time point is not accurate enough, a large amount of computing resources may be wasted, and therefore, when the determination of the failure time point is performed, the calculation may be performed by reading the shared resource traffic information of the virtual machine cluster, in this embodiment, as shown in fig. 3, the specific implementation manner is as shown in step S2022 to step S2026.

Step S2022, determine the virtual machine cluster corresponding to the shared resource dimension, and read the shared resource traffic information of the virtual machine cluster.

Step S2024, calculating a cluster traffic variation value corresponding to at least one set time point of the virtual machine cluster according to the shared resource traffic information.

Step S2026, determining the failure time point in the at least one set time point according to the cluster traffic change value, and acquiring traffic change information of the virtual machines in the virtual machine cluster at the failure time point.

Specifically, the shared resource dimension specifically refers to a dimension corresponding to different shared resources, and when determining the failure time point, the virtual machine cluster associated with the shared resource dimension may be determined first, and then the determination is completed by reading the traffic information of the cluster. Correspondingly, the shared resource traffic information specifically refers to shared resource traffic information of the virtual machine cluster in a set time interval, which is the sum of traffic information of each virtual machine in the set time interval. For example, if the shared resource is a bandwidth resource, the traffic information of the shared resource is the sum of the traffic of the data transmitted by the bandwidth resource. Correspondingly, the setting of the time point specifically refers to a time point for detecting whether a host with a fault exists, and the time point may be an absolute time point set according to an actual requirement, or a relative time point, and the embodiment is not limited herein. Correspondingly, the cluster traffic change value specifically refers to a difference value between changes of shared resource traffic information before and after each set time point.

Based on this, in order to accurately determine the failure time point, a virtual machine cluster corresponding to the shared resource dimension may be determined first, and then the shared resource traffic information corresponding to the virtual machine cluster in a preset time interval is read, that is, the traffic information sum of all virtual machines in the virtual machine cluster is obtained. Secondly, at least one set time point can be determined, and then a cluster flow change value corresponding to each set time point is calculated according to the shared resource flow information, so that a flow change difference before and after each set time point can be analyzed. On the basis, the set time point with a large difference value can be selected as a fault time point, so that the flow change information of each virtual machine at the fault time point can be acquired subsequently, and the flow change information is used for preliminary determination of the fault virtual machine subsequently.

In specific implementation, when the cluster traffic change value corresponding to each set time point is calculated according to the shared resource traffic information, partial shared resource traffic information before and after each set time point may be determined in the shared resource traffic information, and then the partial shared resource traffic information before and after the set time point is subtracted, so that the cluster traffic change value corresponding to each set time point may be obtained. Then comparing the cluster flow change value with a preset threshold value; if the time is greater than or equal to the threshold, the set time point is considered as a fault time point, and if the time is less than the threshold, the set time point is considered as a normal time point. Therefore, the fault time point of the host machine with fault can be checked.

That is, when the shared resource corresponding to the virtual machine cluster is abnormal, the flow rate may have a quantitative difference before and after the failure time point, that is, when calculating the cluster flow rate change value corresponding to each set time point, the flow rate change value may be calculated by the following formula (1):

wherein the content of the first and second substances,

the value of the cluster flow rate change is shown, T is a set time point, and L is a time interval.

And indicating the flow information of the shared resource before and after the set time point. When calculating any time point correspondence

Thereafter, it may be determined whether there is a host failure at a point in time based on the comparison of the value to a threshold.

The present embodiment describes the above contents by taking shared bandwidth resources as an example; determining that a virtual machine cluster corresponding to the shared bandwidth resource comprises n virtual machines, namely a virtual machine 1, a virtual machine 2, 8230and a virtual machine n; firstly, the flow of shared resources of a virtual machine cluster in a T1-T2 time interval is obtained, then, according to a formula (1), a flow difference value in an L time interval before and after a T1 moment is calculated to be C1 by combining the flow of the shared resources, a flow difference value in the L time interval before and after the T2 moment is calculated to be C2, and a flow difference value in the L time interval before and after the T3 moment is calculated to be C3, then, absolute values corresponding to C1, C2 and C3 are compared with a preset fault threshold value C, the C2 absolute value is determined to be greater than C, a fault time point can be determined to be T2, and then, the initial investigation can be completed by counting the flow of each virtual machine at the T2 time point, so that a fault host can be accurately positioned in the following process.

In conclusion, the accuracy of determining the fault time point can be ensured by combining the shared resource flow information to determine the fault time point, so that root cause positioning is supported on the basis, the root cause positioning accuracy is effectively improved, and influence on users or enterprises is avoided.

Furthermore, after the failure time point is determined, considering that the number of virtual machines included in the virtual machine cluster is large, if the failure location is performed on the basis of a full number of virtual machines, a large amount of computing resources will be consumed, so in order to reduce the consumption of computing resources and improve the location accuracy, the flow change information of each virtual machine may be determined first and then preliminary screening may be performed, and in this embodiment, the specific implementation manner is as follows:

determining a normal time interval and an abnormal time interval according to the fault time point; acquiring normal flow change information of the virtual machine corresponding to the normal time interval and abnormal flow change information corresponding to the abnormal time interval; and calculating a flow change value according to the normal flow change information and the abnormal flow change information, and using the flow change value as the flow change information corresponding to the virtual machine.

Specifically, the normal time interval specifically refers to a set length time interval before the failure time point, and correspondingly, the abnormal time interval specifically refers to a set length time interval after the failure time point, and the length of the normal time interval is the same as that of the abnormal time interval; correspondingly, the normal flow change information specifically refers to flow information of the virtual machine corresponding to the normal time interval; correspondingly, the abnormal traffic change information specifically refers to traffic information of the virtual machine corresponding to the abnormal time interval. Correspondingly, the flow rate change value specifically refers to a difference value between the normal flow rate change information and the abnormal flow rate change information.

Based on this, the determination of the flow change information of any virtual machine in the virtual machine cluster is as follows: after the failure time point is determined, a normal time interval and an abnormal time interval with equal duration can be determined according to the failure time point, and then normal flow change information of the virtual machine corresponding to the normal time interval and abnormal flow change information of the virtual machine corresponding to the abnormal time interval can be obtained and used for reflecting flow conditions of the virtual machine before and after the failure time point. And then, calculating a flow change value according to the normal flow change information and the abnormal flow change information, namely determining the flow change values of the virtual machine before and after the fault time point, and taking the flow change values as the flow change information of the virtual machine.

In specific implementation, the flow rate variation value can be calculated by the following formula (2):

indicating the flow rate change value, T indicating the failure time point, and L indicating the time interval. x is the number of _i (t) before the time point of failureThe latter traffic information.

In summary, by calculating the flow change value in combination with the normal time interval and the abnormal time interval, the flow change information of each virtual machine before and after the failure time point can be accurately determined, so that whether the host of the virtual machine has a failure or not can be preliminarily determined in the following process, and the purpose of reducing the number of the troubleshooting objects can be achieved.

In addition, when acquiring the normal traffic change information and the abnormal traffic change information corresponding to the virtual machine, in order to facilitate resource management by the server and reduce the computational pressure of the server, the log server may record the traffic information separately, and read and use the information when it is needed, in this embodiment, the specific implementation manner is as follows:

creating an information reading request according to the normal time interval and the abnormal time interval, and sending the information reading request to a log server corresponding to the virtual machine cluster; receiving initial flow change information fed back by the log server aiming at the information reading request; and determining the normal flow change information and the abnormal flow change information in the initial flow change information.

Specifically, the information reading request specifically refers to a request for reading traffic change information corresponding to the virtual machine in a normal time interval and an abnormal time interval, and the request records a virtual machine identifier, and information of the normal time interval and the abnormal time interval; correspondingly, the log server specifically refers to a server for recording traffic change information of the virtual machine, and is used for maintaining the record and management of the traffic log. Accordingly, the initial traffic change information specifically refers to a sum of the normal traffic change information and the abnormal traffic change information.

Based on the method, after the normal time interval and the abnormal time interval are determined according to the fault time point, an information reading request can be created according to the identification information of the virtual machine, the normal time interval and the abnormal time interval, the information reading request is sent to a log server for recording the cluster flow log of the virtual machine, after the log server receives the information reading request, corresponding information can be obtained through analysis, initial flow change information can be read from a log database according to the information and is fed back to a server, and the server can obtain normal flow change information and abnormal flow change information through analysis of the initial flow change information and is used for calculating the flow change value subsequently.

Along the above example, after the time point t2 is determined as the failure time point, the virtual machine traffic corresponding to n virtual machines respectively can be read from the log service, and the traffic information corresponds to time intervals [ t2-L, t2] and [ t2, t2+ L ], that is, L time intervals before and after the failure time point. After the flow rates of the virtual machines are obtained, the flow rate change value corresponding to each virtual machine in the n virtual machines can be calculated according to the formula (2) and the virtual machine flow rate, and then the virtual machine with a high failure probability in the n virtual machines can be analyzed according to the flow rate change value, and root cause positioning processing is performed.

In conclusion, the flow change information is determined by combining the log server, so that the server side maintenance information is more single, more computing resources can be utilized to perform root cause positioning processing, and the root cause positioning accuracy and efficiency are further improved.

Step S204, screening candidate virtual machines in the virtual machine cluster according to the flow change information, and loading historical data of the candidate virtual machines connected with the fault time point.

Specifically, after the traffic change information corresponding to each virtual machine in the virtual machine cluster is determined, further, considering that the traffic change information can embody traffic fluctuation of the virtual machine before and after a failure time point, and therefore whether the virtual machine has a failure or not can be reflected, preliminary screening processing can be performed in combination with the traffic change information, so that determination of candidate virtual machines in the virtual machine cluster is achieved, and whether the candidate virtual machine is a failed virtual machine or not can be further analyzed by loading historical data of the candidate virtual machine at the failure time point.

The candidate virtual machines are specifically part of virtual machines screened from the virtual machine cluster, and the number of virtual machines can be reduced by screening the candidate virtual machines, so that the efficiency of subsequently updating and positioning the fault virtual machine host in one step is improved. Correspondingly, the historical data specifically refers to data processed or transmitted by the candidate virtual machine in a time interval corresponding to the failure time point. For example, if the shared resource is a bandwidth resource, the historical data is data transmitted by the candidate virtual machine in a time interval corresponding to the failure time point; and if the shared resource is a CPU resource, the historical data is the data processed by the candidate virtual machine in the time interval corresponding to the failure time point.

Further, when screening out a candidate virtual machine, in order to accurately screen out a virtual machine with a higher probability of having a fault in combination with flow change information, the virtual machine may be implemented by calculating an occupation ratio, in this embodiment, a specific implementation manner is as follows:

calculating the ratio of the flow change value corresponding to each virtual machine to the cluster flow change value; sorting the virtual machines in the virtual machine cluster according to the calculation result to obtain a virtual machine queue; and selecting a set number of virtual machines in the virtual machine queue as the candidate virtual machines.

Specifically, the calculation of the ratio can reflect the ratio of the traffic change value of the virtual machine to the cluster traffic change value, and the higher the ratio is, the more likely the ratio is a root cause; correspondingly, the virtual machine queue specifically refers to a queue obtained by sequencing the virtual machines in the virtual machine cluster according to a ratio, and the sequencing mode can be from high to low or from low to high; correspondingly, the setting of the number specifically refers to selecting the number of candidate virtual machines from the queue, and may be set according to actual requirements, which is not limited herein.

Based on this, after the flow change value corresponding to each virtual machine and the cluster flow change value corresponding to the virtual machine cluster are obtained through the calculation, the ratio of the flow change value corresponding to each virtual machine to the cluster flow change value can be calculated; and obtaining the flow abnormality degree corresponding to each virtual machine according to the ratio calculation result, and sorting the virtual machines from high to low according to the flow abnormality degrees, and then sequentially selecting a set number of virtual machines as candidate virtual machines for subsequent secondary positioning processing.

In practical application, consider if

(Cluster traffic variation value) is greater than 0, then

The larger the traffic change value, the more likely it is to be a root cause. And then

The smaller the possibility of becoming a root cause, and thus can be determined

May determine the probability of the virtual machine failing. I.e. a determination of candidate virtual machines is made. In addition, considering that the accuracy of the mode based on the calculation ratio sorting selection is low, and the fault virtual machine cannot be accurately positioned, the largest K virtual machines can be selected to enter the subsequent processing. In addition, virtual machines of TOPK C times can be selected as candidate virtual machines to meet the requirement of subsequent processing.

It should be noted that when the candidate virtual machines are screened out, the calculation of the numerical value in combination with the flow information is completed, so that the operation of the flow information storage nodes can be completed through the set SQL statements, and the server can directly obtain the result fed back by the nodes, so that the candidate virtual machines are determined from the virtual machine cluster, and the calculation processing operation is completed without data migration.

According to the method, after the flow change value corresponding to each virtual machine in n virtual machines is determined, the ratio of each flow change value to the sum of the flow change values of the n virtual machines can be calculated, the ratio corresponding to the virtual machine 1 is determined to be L1 according to the calculation result, the ratio corresponding to the virtual machine 2 is determined to be L2 \8230, the ratio corresponding to the virtual machine n is Ln, then the L1-Ln are sequenced from high to low, TOP K C virtual machines can be selected as candidate virtual machines according to the sequencing result, and the number of the screened virtual machines is m (m is less than n), so that a fault virtual machine host can be located from the m virtual machines in the following process.

In summary, the candidate virtual machines are screened out by calculating the occupation ratio, so that the data acquisition and calculation can be performed subsequently without increasing with the increase of the number of virtual machines sharing resources, and the method can be expanded to a scene with a larger virtual machine cluster scale so as to have higher expandability.

Furthermore, after the candidate virtual machine is determined, in order to accurately locate the failed virtual machine in the candidate virtual machine, the method is used to determine the failed host, analysis may be performed from data dimensions, and data associated with the virtual machine generally exists in a preset storage medium, so in order to improve data acquisition efficiency and accelerate root cause location efficiency and accuracy, a multi-thread copy manner may be adopted, and in this embodiment, the specific implementation manner is as follows:

creating a data loading request according to the fault time point, and sending the data loading request to a database corresponding to the candidate virtual machine; starting at least two threads in response to the data loading request, and receiving historical normal data and historical abnormal data fed back by the database aiming at the data loading request through the at least two threads as the historical data.

Specifically, the data loading request is to send a request for reading a failure time point associated with the candidate virtual machine to a database storing data. Correspondingly, the database is specifically a database used for storing processing data or transmission data of the virtual machine, and correspondingly, the historical normal data is specifically data transmitted or processed by the candidate virtual machine within a set time interval before the failure time point; correspondingly, the historical abnormal data specifically refers to data transmitted or processed by the candidate virtual machine within a set time interval after the fault time point; the set time intervals corresponding to the historical normal data and the historical abnormal data are the same in length.

Based on this, after the candidate virtual machines and the failure time point are determined, in order to accurately analyze whether each candidate virtual machine is associated with the failure host, a data loading request can be created according to the failure time point, and the data loading request is sent to a database corresponding to the candidate virtual machine; and then, at least two threads can be started in response to the data loading request, and historical normal data and historical abnormal data fed back by the database aiming at the data loading request are received by the at least two threads to serve as historical data, so that the data acquisition efficiency is improved, and the root cause positioning processing is completed in a short time.

In practical application, after a candidate virtual machine is determined, historical data of a connection fault time point of the candidate virtual machine needs to be pulled to the local for subsequent analysis; therefore, when the data is pulled, the data can be realized in a multi-thread and multi-batch mode. That is, in the historical data acquisition phase, multiple threads can be started simultaneously to continuously pull data, so that the network throughput is improved. Each thread acquires a batch of data instead of one data at a time through network I/O, so that the function of multiplexing connection is achieved, and the time for establishing connection and disconnecting connection is reduced.

Step S206, determining abnormal information of the candidate virtual machine in a preset root cause positioning dimension according to the historical data.

Specifically, after the historical data of the candidate virtual machine and the candidate virtual machine at the failure time point are determined, further, considering that the historical data is data processed or transmitted before and after the failure time point by the candidate virtual machine, the abnormal information of the candidate virtual machine at the preset root cause positioning dimension can be analyzed according to the change of the historical data before and after the failure time point, so as to screen out the target virtual machine according to the abnormal information, and complete the positioning of the failed host machine.

The preset root cause positioning dimension specifically refers to a dimension for analyzing abnormal fluctuation conditions of the candidate virtual machines from at least one dimension, and correspondingly, the abnormal information specifically refers to information for embodying abnormal degrees of the candidate virtual machines, so that the abnormal degree of each candidate virtual machine can be conveniently determined according to the abnormal information, the candidate virtual machine with the highest abnormal degree can be selected as the target virtual machine, and the host machine to which the candidate virtual machine belongs is the fault host machine.

Further, when determining abnormal information corresponding to a candidate virtual machine according to historical data, considering that data before a failure time point is normal data and data after the failure time point is abnormal data, the determination of the abnormal information may be performed in each sub-root positioning dimension based on the characteristic, and in this embodiment, the specific implementation manner is as follows:

dividing the historical data according to the fault time point to obtain historical normal data and historical abnormal data; determining at least one sub-root cause positioning dimension in preset root cause positioning dimensions, and calculating abnormal information of each sub-root cause positioning dimension according to the historical normal data and the historical abnormal data;

specifically, the historical normal data specifically refers to data processed or transmitted by the virtual machine before the failure time point, and the historical abnormal data specifically refers to data processed or transmitted by the virtual machine after the failure time point. Correspondingly, the sub-root cause location dimension specifically refers to a dimension for detecting the abnormal degree of the candidate virtual machine, and the sub-root cause location dimension includes, but is not limited to, a predicted root cause location dimension, a comparative root cause location dimension, and a similar root cause location dimension; the prediction root cause positioning dimension specifically refers to a dimension for determining the abnormal degree of the candidate virtual machine in a mode of comparing an expected value and an actual value of a prediction abnormal time interval; comparing the root cause positioning dimensionality specifically means determining the dimensionality of the abnormal degree of the candidate virtual machine in a threshold value comparing mode; the similarity root positioning dimension specifically refers to a dimension for determining the abnormal degree of the candidate virtual machine by calculating the curve similarity degree; the abnormal degree information of each dimension is abnormal information.

Based on the method, after historical data of the fault time points corresponding to the candidate virtual machines are obtained, the historical data can be divided according to the fault time points, and therefore historical normal data and historical abnormal data are obtained; then determining at least one sub-root factor positioning dimension in the preset root factor positioning dimensions, and calculating abnormal information of each sub-root factor positioning dimension according to historical normal data and historical abnormal data; and the positioning of the target virtual machine can be completed by combining the abnormal information of each dimension.

In conclusion, the abnormal information of each sub-root positioning dimension is determined by combining the historical normal data and the historical abnormal data, so that the abnormal degree of the candidate virtual machine can be embodied from multiple dimensions, and the target virtual machine can be accurately determined from the candidate virtual machine in the follow-up process.

Furthermore, during root cause analysis, the abnormal information is determined from at least one sub-root cause positioning dimension, and the determination can be divided into an absolute deviation analysis root cause and a relative deviation analysis root cause, so that the abnormal degree of the candidate virtual machine is determined from different angles, and the requirement of subsequently screening the target virtual machine is met. The predicted root cause positioning dimensionality is absolute deviation analysis, and the comparison root cause positioning dimensionality and the similar root cause positioning dimensionality are relative deviation analysis. The determination process of the abnormal information of different dimensions is as follows:

(1) Under the condition that the sub-root cause positioning dimension is the prediction root cause positioning dimension, determining the abnormal information of the prediction root cause positioning dimension comprises the following steps: inputting the historical normal data into a pre-trained flow prediction model to obtain flow expectation information, and determining first flow real information according to the historical abnormal data; and calculating deviation information between the flow expectation information and the first flow real information to serve as abnormal information of the candidate virtual machine in the prediction root cause positioning dimension.

Specifically, the flow prediction model is a model capable of predicting an expected flow in a set time interval; correspondingly, the flow expectation information specifically refers to a flow information representation expected to be processed by the candidate virtual machine in a set time interval, and the set time interval is an abnormal time interval; correspondingly, the first flow real information specifically refers to a flow information representation of the candidate virtual machine when actually processing data in an abnormal time interval. Correspondingly, the deviation information specifically refers to an absolute deviation obtained by comparing the real information with the expected information, wherein the larger the deviation is, the larger the abnormal degree is, and conversely, the smaller the deviation is, the smaller the abnormal degree is.

Based on this, in the case that the sub-root cause location dimension is the predicted root cause location dimension, the determination of the abnormal information corresponding to each candidate virtual machine specifically includes: firstly, inputting historical normal data into a pre-trained flow prediction model, and processing through the flow prediction model to obtain flow expectation information corresponding to an abnormal time interval; meanwhile, first flow real information corresponding to the abnormal time interval is determined according to historical abnormal data; and then, calculating the difference value of the two values to obtain the absolute deviation corresponding to the candidate virtual machine, so that the abnormal degree of the candidate virtual machine can be known according to the absolute deviation.

Along the above example, after m candidate virtual machines are determined, the expected value corresponding to the [ t2, t2+ L ] time interval can be predicted according to historical normal data of the candidate virtual machines in the [ t2-L, t2] time interval. And then obtaining the absolute deviation of the candidate virtual machine according to the difference between the real value and the expected value of the candidate virtual machine in the [ t2, t2+ L ] time interval, wherein the absolute deviation is used for representing the abnormal degree of the candidate virtual machine.

That is to say, the expected value of the candidate virtual machine in the abnormal time interval (T, T + L) can be obtained according to the historical normal data by using the moving average method, then the true value of the candidate virtual machine in the abnormal time interval (T, T + L) can be obtained according to the historical abnormal data, and the predicted deviation, namely the absolute deviation d can be obtained by subtracting the expected value and the true value _i Is a virtual machine x _i The prediction deviation of the fault host is convenient to determine subsequently.

(2) When the sub-root cause location dimension is the comparative root cause location dimension, determining abnormal information of the comparative root cause location dimension includes: generating an abnormal condition according to the historical normal data, and determining second flow real information according to the historical abnormal data; and detecting the second flow real information according to the abnormal condition, and determining the abnormal information of the candidate virtual machine in the comparative root cause positioning dimension according to the detection result.

Specifically, the abnormal condition specifically refers to a threshold value for analyzing the abnormal degree of the candidate virtual machine, and correspondingly, the second traffic real information specifically refers to traffic information representation when the candidate virtual machine actually processes data in the abnormal time interval.

Based on this, in the case that the child root location dimension is the comparative root location dimension, the determining of the abnormal information corresponding to each candidate virtual machine specifically includes: firstly, generating an abnormal condition according to historical normal data, and simultaneously determining second flow real information according to the historical abnormal data; and secondly, detecting the second flow real information by using an abnormal condition, and analyzing the abnormal degree of the candidate virtual machine through the condition, thereby obtaining the abnormal information of the candidate virtual machine in the comparison root cause positioning dimension.

According to the above example, after m candidate virtual machines are determined, fitting calculation can be performed according to historical normal data of the candidate virtual machines in the time interval [ t2-L, t2], an abnormal threshold value is obtained according to the fitting calculation result, a true value is determined by combining historical abnormal data of the candidate virtual machines in the time interval [ t2, t2+ L ], and abnormal information of the candidate virtual machines can be determined by comparing the true value with the abnormal threshold value. In specific implementation, the larger the difference between the true value and the abnormal threshold value is, the higher the abnormal degree is.

That is to say, when the relative deviation analysis is performed, the abnormal amplitude may be introduced to measure the abnormal degree of the candidate virtual machine in the abnormal time interval and the normal time interval, and the larger the abnormal amplitude is, the more likely the abnormal amplitude is the root cause. Based on the method, an abnormal threshold value can be obtained according to historical normal data through an extreme value theory, the real value is compared with the abnormal threshold value, the result of the candidate virtual machine between 0 and 1 can be determined according to the comparison result and is marked as a _i Represents a virtual machine x _i The magnitude of the anomaly.

(3) Under the condition that the sub-root cause positioning dimension is the similar root cause positioning dimension, determining the abnormal information of the similar root cause positioning dimension comprises the following steps: acquiring a first resource change curve corresponding to the virtual machine cluster, and generating a second resource change curve corresponding to the candidate virtual machine according to the historical normal data and the historical abnormal data; and calculating the curve similarity between the first resource change curve and the second resource change curve to serve as the abnormal information of the candidate virtual machine in the similar root cause positioning dimension.

Specifically, the first resource variation curve specifically refers to a shared resource variation curve of the virtual machine cluster. Correspondingly, the second resource change curve specifically refers to a resource change curve of the candidate virtual machine; correspondingly, the curve similarity specifically refers to the similarity between the shared resource change curve and the resource change curve, and a higher similarity indicates a higher abnormal degree of the candidate virtual machine, whereas a lower similarity indicates a lower abnormal degree of the candidate virtual machine.

Based on this, in the case that the sub-root cause location dimension is the similar root cause location dimension, the determination of the abnormal information corresponding to each candidate virtual machine specifically includes: a first resource change curve corresponding to the virtual machine cluster can be obtained, and a second resource change curve corresponding to the candidate virtual machine can be generated according to the historical normal data and the historical abnormal data; and then, by calculating the curve similarity between the first resource change curve and the second resource change curve, the abnormal degree of the candidate virtual machine can be determined and can be used as the abnormal information of the candidate virtual machine in the similar root cause positioning dimension.

According to the above example, after m candidate virtual machines are determined, shared resource curves corresponding to the n candidate virtual machines can be determined first, and then virtual machine curves corresponding to the candidate virtual machines are constructed according to historical abnormal data and historical normal data of the candidate virtual machines; and then, calculating the similarity between the virtual machine curve of each candidate virtual machine and the shared resource curve to determine the abnormal information of the candidate virtual machines.

That is, the determination of anomaly information may be accomplished in a manner that introduces a discrete set metric when performing the relative deviation analysis. I.e., a virtual machine curve that is more similar to the shared resource curve of the virtual machine cluster, the corresponding virtual machine is more likely to be the root. Therefore, the similarity of two curves can be calculated by using a set similarity algorithm of discrete set measurement to obtain s _i Representing curve similarity, i.e. representing virtual machine x _i Similarity to the shape of the shared resource curve.

To sum up, through carrying out the definite of abnormal information from different root because of the location dimension, can make things convenient for follow-up a plurality of dimensions to set out and gather to accurate positioning fault host computer, effectual positioning accuracy and the efficiency of having improved.

Step S208, determining a target virtual machine in the candidate virtual machines based on the abnormal information.

Specifically, after the abnormal information corresponding to the candidate virtual machine is determined, further, the target virtual machine may be determined in the candidate virtual machine in a manner of comparing the abnormal information, and the host to which the target virtual machine belongs is the fault host, and then the device identifier of the fault host is sent to the operation and maintenance personnel, so that the fault host is accurately and efficiently located, and the operation and maintenance personnel can conveniently perform fault removal processing.

Furthermore, considering that the abnormal information corresponds to each sub-root cause positioning dimension, the positioning may be completed in an integration manner, that is, the abnormal information of each sub-root cause positioning dimension may be integrated, and the target virtual machine may be determined in the candidate virtual machine according to an integration result.

Furthermore, when integrating the abnormal information of a plurality of sub-root positioning dimensions, considering that the abnormal information representation mode of each dimension is different, the abnormal information representation mode may be converted into a numerical value and then calculated, so as to screen the target virtual machine according to the calculation result, in this embodiment, the specific implementation manner is as follows:

calculating an abnormal score corresponding to each sub-root factor positioning dimension according to the abnormal information of each sub-root factor positioning dimension; and integrating the abnormal score corresponding to each sub-root positioning dimension, and determining a target virtual machine in the candidate virtual machines according to an integration result.

Specifically, the abnormal score is a score obtained after conversion according to abnormal information and is used for representing the abnormal degree of the candidate virtual machine; based on the above, after the abnormal information of the candidate virtual machine in each sub-root factor positioning dimension is obtained, the abnormal score corresponding to each sub-root factor positioning dimension can be calculated according to the abnormal information of each sub-root factor positioning dimension; and integrating the abnormal scores corresponding to the positioning dimensions of each sub-root factor, namely determining a target virtual machine in the candidate virtual machines according to an integration result so as to position the host to which the target virtual machine belongs as a fault host.

That is, the dimensions are located according to the predicted root cause, the root cause location dimensions are compared, and similar roots are identifiedAfter obtaining the abnormal information of the candidate virtual machine in each dimension due to the positioning dimension, the following steps can be carried out

And (3) calculating the abnormal score corresponding to each candidate virtual machine, then sequencing the abnormal scores according to the sequence from high to low, and selecting H candidate virtual machines as target virtual machines so as to determine a fault host according to the target virtual machines. Wherein λ is used to control the proportion of the anomaly amplitude and the shape similarity.

According to the above example, after the prediction deviation, the abnormal amplitude and the curve similarity are obtained, the abnormal score can be calculated by combining the prediction deviation, the abnormal amplitude and the curve similarity of each candidate virtual machine, then sorting is carried out according to the abnormal scores, o virtual machines can be selected from m candidate virtual machines as target virtual machines according to the sorting result, then hosts to which the o virtual machines belong can be used as fault hosts, and host information is sent to operation and maintenance personnel, so that the operation and maintenance personnel can conveniently detect and remove faults.

Referring to fig. 4, the root cause location method provided in this specification mainly includes three stages, namely curve filtering, data acquisition, and root cause location, where in the curve filtering stage, curve index aggregation is determined by calculation of abnormal amounts of shared resources to complete rough screening, and is completed by a storage node; during data acquisition, data can be pulled from the storage node to the computing node in a multithreading and multi-batch mode, and then the computing node can complete the root cause positioning processing by computing the absolute deviation and the relative deviation. The storage node is specifically a node for storing traffic information and data, and the computing node is specifically a node for performing troubleshooting.

In order to efficiently and accurately locate a failed virtual machine, the root cause locating method provided by the present specification may first obtain flow change information of virtual machines in a virtual machine cluster at a failure time point, preliminarily exclude non-failed virtual machines according to the flow change information, thereby selecting a candidate virtual machine in the virtual machine cluster, and may load historical data of the candidate virtual machine at the failure time point in order to ensure the locating accuracy, thereby reducing the computational resources consumed by determining whether each virtual machine in the cluster has a failure one by one; after the historical data is loaded, the abnormal information of the candidate virtual machine in the preset root cause positioning dimension can be determined according to the historical data, so that whether the candidate virtual machine has a fault or not can be analyzed according to the abnormal information, the target virtual machine can be determined in the candidate virtual machine according to the abnormal information, and the host machine capable of positioning the target virtual machine can be used for conducting fault elimination processing. High precision can be achieved while high efficiency is achieved, the time consumed by operation and maintenance personnel for troubleshooting is greatly reduced, and the troubleshooting efficiency is effectively improved.

The following describes the root cause positioning method further by taking an application of the root cause positioning method provided in this specification in a shared traffic resource scenario as an example, with reference to fig. 5. Fig. 5 is a flowchart illustrating a processing procedure of a root cause location method according to an embodiment of the present disclosure, and specifically includes the following steps.

Step S502, determining a virtual machine cluster corresponding to the shared resource dimension, and reading the shared resource flow information of the virtual machine cluster.

Step S504, according to the shared resource flow information, a cluster flow change value corresponding to at least one set time point of the virtual machine cluster is calculated.

Step S506, determining a failure time point in at least one set time point according to the cluster flow change value, and acquiring flow change information of the virtual machines in the virtual machine cluster at the failure time point.

Specifically, a normal time interval and an abnormal time interval are determined according to a fault time point; acquiring normal flow change information of the virtual machine corresponding to a normal time interval and abnormal flow change information of the virtual machine corresponding to an abnormal time interval; and calculating a flow change value according to the normal flow change information and the abnormal flow change information to be used as the flow change information corresponding to the virtual machine.

Further, an information reading request is created according to the normal time interval and the abnormal time interval, and the information reading request is sent to a log server corresponding to the virtual machine cluster; receiving initial flow change information fed back by a log server according to an information reading request; and determining normal flow change information and abnormal flow change information in the initial flow change information.

Step S508, calculating a ratio of the traffic variation value corresponding to each virtual machine to the cluster traffic variation value.

Step S510, sorting the virtual machines in the virtual machine cluster according to the calculation result, and obtaining a virtual machine queue.

In step S512, a set number of virtual machines are selected from the virtual machine queue as candidate virtual machines.

Step S514, a data loading request is created according to the failure time point, and the data loading request is sent to a database corresponding to the candidate virtual machine.

Step S516, starting at least two threads in response to the data loading request, and receiving, by the at least two threads, historical normal data and historical abnormal data fed back by the database in response to the data loading request as historical data.

And step S518, dividing the historical data according to the fault time points to obtain historical normal data and historical abnormal data.

Step S520, at least one sub-root cause positioning dimension is determined in the preset root cause positioning dimensions, and abnormal information of each sub-root cause positioning dimension is calculated according to historical normal data and historical abnormal data.

Specifically, when the sub-root cause location dimension is the prediction root cause location dimension, determining the abnormal information of the prediction root cause location dimension includes: inputting historical normal data into a pre-trained flow prediction model to obtain flow expectation information, and determining first flow real information according to historical abnormal data; and calculating deviation information between the flow expected information and the first flow real information to serve as abnormal information of the candidate virtual machine in the prediction root cause positioning dimension.

Under the condition that the sub-root cause positioning dimension is the comparative root cause positioning dimension, determining abnormal information of the comparative root cause positioning dimension, wherein the determining comprises the following steps: generating an abnormal condition according to the historical normal data, and determining second flow real information according to the historical abnormal data; and detecting the second flow real information according to the abnormal condition, and determining the abnormal information of the candidate virtual machines in the comparative root cause positioning dimension according to the detection result.

Under the condition that the sub-root cause positioning dimension is the similar root cause positioning dimension, determining the abnormal information of the similar root cause positioning dimension comprises the following steps: acquiring a first resource change curve corresponding to the virtual machine cluster, and generating a second resource change curve corresponding to the candidate virtual machine according to the historical normal data and the historical abnormal data; and calculating the curve similarity between the first resource change curve and the second resource change curve to serve as the abnormal information of the candidate virtual machine in the similar root cause positioning dimension.

Step S522, according to the abnormal information of each sub-root factor positioning dimension, calculating an abnormal score corresponding to each sub-root factor positioning dimension.

Step S524, integrating the abnormal score corresponding to each sub-root positioning dimension, and determining a target virtual machine from the candidate virtual machines according to an integration result.

In summary, in order to efficiently and accurately locate a failed virtual machine, flow change information of virtual machines in a virtual machine cluster at a failure time point may be obtained first, and non-failed virtual machines are preliminarily excluded according to the flow change information, so that a candidate virtual machine is selected from the virtual machine cluster, and in order to ensure the location accuracy, historical data of the candidate virtual machine at the failure time point may be loaded, so that computing resources consumed by determining whether each virtual machine in the cluster has a failure one by one may be reduced; after the historical data is loaded, the abnormal information of the candidate virtual machine in the preset root positioning dimension can be determined according to the historical data, so that whether the candidate virtual machine has a fault or not can be analyzed according to the abnormal information, the target virtual machine can be determined in the candidate virtual machine according to the abnormal information, and the host capable of positioning the target virtual machine can be subjected to fault elimination. High precision can be achieved while high efficiency is achieved, the time consumed by operation and maintenance personnel for troubleshooting is greatly reduced, and the troubleshooting efficiency is effectively improved.

Corresponding to the above method embodiment, the present specification further provides an embodiment of a root cause positioning device, and fig. 6 shows a schematic structural diagram of a root cause positioning device provided in an embodiment of the present specification. As shown in fig. 6, the apparatus includes:

an obtaining information module 602 configured to obtain traffic change information of virtual machines in a virtual machine cluster at a failure time point;

a data loading module 604, configured to screen out candidate virtual machines from the virtual machine cluster according to the flow change information, and load historical data of the candidate virtual machines associated with the failure time point;

an information determining module 606 configured to determine, according to the historical data, abnormal information of the candidate virtual machine in a preset root cause positioning dimension;

a determine virtual machine module 608 configured to determine a target virtual machine among the candidate virtual machines based on the exception information.

In an optional embodiment, the obtain information module 602 is further configured to:

determining the virtual machine cluster corresponding to the shared resource dimension, and reading the shared resource flow information of the virtual machine cluster; calculating a cluster flow change value corresponding to at least one set time point of the virtual machine cluster according to the shared resource flow information; and determining the fault time point in the at least one set time point according to the cluster flow change value, and acquiring flow change information of the virtual machines in the virtual machine cluster at the fault time point.

In an alternative embodiment, the load data module 604 is further configured to:

creating a data loading request according to the fault time point, and sending the data loading request to a database corresponding to the candidate virtual machine; starting at least two threads in response to the data loading request, and receiving historical normal data and historical abnormal data fed back by the database for the data loading request through the at least two threads as the historical data.

In an optional embodiment, the determination information module 606 is further configured to:

accordingly, the determine virtual machine module 608 is further configured to:

and integrating the abnormal information of each sub-root cause positioning dimension, and determining a target virtual machine in the candidate virtual machines according to an integration result.

In an optional embodiment, in a case that the sub-root cause location dimension is a predicted root cause location dimension, determining abnormal information of the predicted root cause location dimension includes:

inputting the historical normal data into a pre-trained flow prediction model to obtain flow expectation information, and determining first flow real information according to the historical abnormal data; and calculating deviation information between the flow expectation information and the first flow real information to serve as abnormal information of the candidate virtual machine in the prediction root cause positioning dimension.

In an optional embodiment, in a case that the child root cause location dimension is a comparison root cause location dimension, determining abnormal information of the comparison root cause location dimension includes:

generating an abnormal condition according to the historical normal data, and determining second flow real information according to the historical abnormal data; and detecting the second flow real information according to the abnormal condition, and determining the abnormal information of the candidate virtual machine in the comparative root cause positioning dimension according to the detection result.

In an optional embodiment, in a case that the sub-root location dimension is a similar root location dimension, determining the abnormal information of the similar root location dimension includes:

acquiring a first resource change curve corresponding to the virtual machine cluster, and generating a second resource change curve corresponding to the candidate virtual machine according to the historical normal data and the historical abnormal data; and calculating the curve similarity between the first resource change curve and the second resource change curve to serve as the abnormal information of the candidate virtual machine in the similar root cause positioning dimension.

In an optional embodiment, the determine virtual machine module 608 is further configured to:

In order to efficiently and accurately locate a failed virtual machine, the root cause locating device provided by this specification may first obtain flow change information of virtual machines in a virtual machine cluster at a failure time point, preliminarily exclude non-failed virtual machines according to the flow change information, thereby selecting a candidate virtual machine in the virtual machine cluster, and may load historical data of the candidate virtual machine at the failure time point in order to ensure the locating accuracy, thereby reducing the computational resources consumed by determining whether each virtual machine in the cluster has a failure one by one; after the historical data is loaded, the abnormal information of the candidate virtual machine in the preset root cause positioning dimension can be determined according to the historical data, so that whether the candidate virtual machine has a fault or not can be analyzed according to the abnormal information, the target virtual machine can be determined in the candidate virtual machine according to the abnormal information, and the host machine capable of positioning the target virtual machine can be used for conducting fault elimination processing. High precision can be achieved while high efficiency is achieved, the time consumed by operation and maintenance personnel in troubleshooting is greatly reduced, and the troubleshooting efficiency is effectively improved.

The above is a schematic solution of the cause positioning apparatus according to the present embodiment. It should be noted that the technical solution of the root cause positioning device and the technical solution of the root cause positioning method belong to the same concept, and for details that are not described in detail in the technical solution of the root cause positioning device, reference may be made to the description of the technical solution of the root cause positioning method.

Corresponding to the above method embodiment, the present specification further provides an embodiment of a root cause positioning system, and fig. 7 shows a schematic structural diagram of a root cause positioning system provided by an embodiment of the present specification. As shown in fig. 7, the root cause location system 700 includes a storage node 710 and a compute node 720;

the storage node 710 is configured to obtain traffic change information of the virtual machines in the virtual machine cluster at a failure time point; screening candidate virtual machines from the virtual machine cluster according to the flow change information, reading historical data of the candidate virtual machines connected with the fault time point, and sending the historical data to a computing node;

the computing node 720 is configured to receive the historical data, and determine, according to the historical data, abnormal information of the candidate virtual machine in a preset root cause positioning dimension; determining a target virtual machine among the candidate virtual machines based on the exception information.

The above is a schematic solution of the root cause positioning system of the present embodiment. It should be noted that the technical solution of the root cause positioning system and the technical solution of the root cause positioning method belong to the same concept, and for details that are not described in detail in the technical solution of the root cause positioning system, reference may be made to the description of the technical solution of the root cause positioning method.

Fig. 8 is a block diagram illustrating a structure of a server 800 according to an embodiment of the present disclosure. The components of the server 800 include, but are not limited to, a memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.

Server 800 also includes an access device 840 that enables server 800 to communicate via one or more networks 860. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of Network interface (e.g., a Network interface controller) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular Network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above components of the server 800 and other components not shown in fig. 8 may also be connected to each other, for example, through a bus. It should be understood that the block diagram of the server architecture shown in fig. 8 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.

The server 800 may be any type of stationary or mobile server, including a mobile Computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary server such as a desktop Computer or Personal Computer (PC). The server 800 may also be a mobile or stationary server.

Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the root cause positioning method described above.

The above is an illustrative scheme of a server of the present embodiment. It should be noted that the technical solution of the server and the technical solution of the root cause positioning method belong to the same concept, and details that are not described in detail in the technical solution of the server can be referred to the description of the technical solution of the root cause positioning method.

An embodiment of the present specification also provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the above-mentioned root cause positioning method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the root cause positioning method, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the root cause positioning method.

An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the root cause positioning method.

The above is a schematic scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program belongs to the same concept as the technical solution of the root cause positioning method, and for details that are not described in detail in the technical solution of the computer program, reference may be made to the description of the technical solution of the root cause positioning method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in source code form, object code form, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of combinations of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the embodiments. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A method of root cause location, comprising:

2. The method of claim 1, wherein the obtaining of the traffic change information of the virtual machines in the virtual machine cluster at the failure time point comprises:

determining the virtual machine cluster corresponding to the shared resource dimension, and reading the shared resource flow information of the virtual machine cluster;

calculating a cluster flow change value corresponding to at least one set time point of the virtual machine cluster according to the shared resource flow information;

and determining the fault time point in the at least one set time point according to the cluster flow change value, and acquiring flow change information of the virtual machines in the virtual machine cluster at the fault time point.

3. The method of claim 2, wherein the determining of the traffic change information of any virtual machine in the virtual machine cluster comprises:

determining a normal time interval and an abnormal time interval according to the fault time point;

acquiring normal flow change information of the virtual machine corresponding to the normal time interval and abnormal flow change information corresponding to the abnormal time interval;

and calculating a flow change value according to the normal flow change information and the abnormal flow change information, and using the flow change value as the flow change information corresponding to the virtual machine.

4. The method of claim 3, the screening candidate virtual machines in the virtual machine cluster according to the traffic change information, comprising:

calculating the ratio of the flow change value corresponding to each virtual machine to the cluster flow change value;

sorting the virtual machines in the virtual machine cluster according to the calculation result to obtain a virtual machine queue;

and selecting a set number of virtual machines in the virtual machine queue as the candidate virtual machines.

5. The method according to claim 3, wherein the acquiring of the normal traffic change information of the virtual machine corresponding to the normal time interval and the abnormal traffic change information corresponding to the abnormal time interval includes:

creating an information reading request according to the normal time interval and the abnormal time interval, and sending the information reading request to a log server corresponding to the virtual machine cluster;

receiving initial flow change information fed back by the log server aiming at the information reading request;

and determining the normal flow change information and the abnormal flow change information in the initial flow change information.

6. The method of claim 1, the loading historical data associated with the point in time of failure for the candidate virtual machine, comprising:

creating a data loading request according to the fault time point, and sending the data loading request to a database corresponding to the candidate virtual machine;

starting at least two threads in response to the data loading request, and receiving historical normal data and historical abnormal data fed back by the database aiming at the data loading request through the at least two threads as the historical data.

7. The method of claim 1, wherein the determining the abnormal information of the candidate virtual machine in a preset root cause positioning dimension according to the historical data comprises:

dividing the historical data according to the fault time point to obtain historical normal data and historical abnormal data;

determining at least one sub-root cause positioning dimension in preset root cause positioning dimensions, and calculating abnormal information of each sub-root cause positioning dimension according to the historical normal data and the historical abnormal data;

correspondingly, the determining a target virtual machine in the candidate virtual machines based on the exception information includes:

8. The method according to claim 7, in the case that the sub-root cause location dimension is a predicted root cause location dimension, determining abnormal information of the predicted root cause location dimension comprises:

inputting the historical normal data into a pre-trained flow prediction model to obtain flow expectation information, and determining first flow real information according to the historical abnormal data;

and calculating deviation information between the flow expected information and the first flow real information to serve as abnormal information of the candidate virtual machine in the prediction root cause positioning dimension.

9. The method of claim 7, wherein, in the case that the sub-root location dimension is a comparison root location dimension, the determining of the abnormal information of the comparison root location dimension comprises:

generating an abnormal condition according to the historical normal data, and determining second flow real information according to the historical abnormal data;

and detecting the second flow real information according to the abnormal condition, and determining the abnormal information of the candidate virtual machine in the comparative root cause positioning dimension according to the detection result.

10. The method according to claim 7, wherein in the case that the sub-root location dimension is a similar root location dimension, determining the abnormal information of the similar root location dimension comprises:

acquiring a first resource change curve corresponding to the virtual machine cluster, and generating a second resource change curve corresponding to the candidate virtual machine according to the historical normal data and the historical abnormal data;

and calculating the curve similarity between the first resource change curve and the second resource change curve to serve as the abnormal information of the candidate virtual machine in the similar root cause positioning dimension.

11. The method according to any one of claims 7 to 10, wherein the integrating the abnormal information of each sub-root positioning dimension, and determining the target virtual machine in the candidate virtual machines according to the integration result, comprises:

calculating an abnormal score corresponding to each sub-root factor positioning dimension according to the abnormal information of each sub-root factor positioning dimension;

and integrating the abnormal score corresponding to each sub-root positioning dimension, and determining a target virtual machine in the candidate virtual machines according to an integration result.

12. A cause location system, comprising:

13. A server, comprising:

a memory and a processor;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 11.

14. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 11.