WO2020119627A1

WO2020119627A1 - Abnormality detection and positioning method and apparatus applied to distributed container cloud platform

Info

Publication number: WO2020119627A1
Application number: PCT/CN2019/123989
Authority: WO
Inventors: 叶可江; 卢澄志; 须成忠
Original assignee: 深圳先进技术研究院
Priority date: 2018-12-15
Filing date: 2019-12-09
Publication date: 2020-06-18
Also published as: CN109800052B; CN109800052A

Abstract

An abnormality detection and positioning method and apparatus applied to a distributed container cloud platform. According to the method and the apparatus, TCP delay information of each container component is firstly acquired (S101); the TCP delay information of each container component is analyzed by means of a sliding window accumulation and abnormality detection algorithm, state information of each component is acquired, and a component state information key value pair is generated (S102); a component abnormality sub-image is constructed by means of the component state information key value pair (S103); and a detection result is transmitted to a background server, a pathological risk and solution data, corresponding to the detection result, calculated and obtained by the background server are received, and the data is transmitted to a mobile phone application of a user or is displayed and output by means of a medical site (S104). According to the method and the apparatus, an abnormal state is determined by using TCP delay information, thus reducing the overheads of data collection and improving the accuracy and timeliness of determining an abnormal state. Furthermore, in consideration of interference between components and between a physical machine and the components, a component abnormality sub-image is provided to express the propagation of an abnormal state, thereby improving the accuracy of abnormality positioning.

Description

Anomaly detection and positioning method and device applied to distributed container cloud platform

Technical field

The invention relates to the field of container cloud platforms, and in particular to an abnormality detection and positioning method and device applied to a distributed container cloud platform.

Background technique

Cloud computing as a new type of service delivery method has won the favor of industry and academia. The key technology of cloud computing is virtualization technology. By virtualizing all kinds of resources, cloud computing service providers can easily customize and deliver all kinds of resources to users, and many applications have gradually begun to migrate to cloud computing clusters. . Traditional virtualization technologies include KVM, Xen, etc. However, because traditional virtualization technology is too bulky, it is very complicated to create, modify, and migrate a component in the application cluster. Therefore, cloud computing service providers need more lightweight virtualization technology. Container technology is a lightweight operating system-level virtualization technology. Compared with the traditional virtualization technology for the virtualization of the hardware layer, container virtualization stays at the operating system layer, making it very convenient to create, modify, or migrate. Container technology is quickly used by various cloud computing service providers. Due to these characteristics of containers, users often run each component in an independent container when deploying their applications, so as to conveniently and quickly maintain applications, which results in a complicated internal structure of the container cloud. At the same time, the characteristics of the weak isolation of the containers also lead to serious interference between the containers. Once an abnormality occurs in a container, the abnormality will quickly spread. In turn, it affects different application components. Cloud service providers need a method that can abnormally locate the application clusters with complex structures established by containers.

Generally speaking, an application deployed on a container cloud is often composed of hundreds or thousands of components, and components depend on each other to form a complex graph with components as nodes. Utilizing the relevant knowledge of graph theory can locate the root cause of anomalies from this complex graph. That is, a cloud computing platform based on container technology is usually composed of thousands of physical machines, and each physical machine usually runs dozens of containers. Therefore, a cloud computing platform based on container technology is more complicated than a traditional cloud computing platform. Compared with traditional virtual machines, container isolation is worse, and the interference between containers is more serious. Therefore, compared to traditional virtual machines, containers are also more likely to affect each other. At the same time, since the container is deployed on the operating system in the running state, the abnormality of the physical machine will also cause the abnormality of the container deployed on it. Existing anomaly detection and positioning solutions lack analysis of the correlation between components and between components and physical machines. At the same time, existing anomaly detection and positioning solutions use performance index data for anomaly detection and positioning, which brings great storage and transmission Because of the overhead, it cannot adapt well to the distributed container cloud platform environment with severe interference.

Nguyen et al. in Chapter 3 of "Insight: in-situ online service failure path inference production in computing infrastructures" proposed an online black box abnormal location system to locate abnormal components. The system uses the virtual machine performance index to construct a normal fluctuation model of the performance index, to determine abnormally changing data points, and to locate abnormal components by combining the time information of the changed data points and the dependencies between the components. Although the system can detect and locate anomalies, because it uses performance indicators for anomaly detection and judgment, for complex distributed container cloud platforms, the overhead of monitoring performance indicators will be huge.

Summary of the invention

The embodiments of the present invention provide an abnormality detection and positioning method and device applied to a distributed container cloud platform, to at least solve the technical problem that the traditional single-component-based abnormality detection method cannot be applied to a distributed container cloud.

According to an embodiment of the present invention, there is provided an anomaly detection and location method applied to a distributed container cloud platform, including the following steps:

Obtain the TCP delay information of each container component;

Through the sliding window accumulation and anomaly detection algorithm, the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;

Construct component abnormal subgraphs by key value pairs of component state information;

According to the component exception subgraph, locate the container component node where the exception occurs.

Further, the TCP delay information of each container component is analyzed through sliding window accumulation and anomaly detection algorithms to obtain status information of each component and generate component status information key-value pairs including:

Initialize the sliding window of the component [L ₀ , L _k ], enter the TCP delay information until the number of TCP delayed data in the sliding window reaches k, initialize the average

Cumulative sum S _k = 0; where [L ₀ , L _k ] is a queue for storing TCP delay information from 0 to k, and k is an integer of 0<k<60;

Enter the TCP delay information L _t again, insert the TCP delay information L _t into the sliding window, and delete the oldest TCP delay information L _{tk in the} sliding window to calculate the average value in the window

And calculate the cumulative sum

Where L _t is the TCP delay information at time t, and t is an integer of t>k;

Calculate the warning value S _diff = S _max- S _min , where S _max , S _min ∈? S _tk , S _t ], S _tk is the cumulative sum of the earliest TCP delay information;

Determine whether S _diff is between the normal threshold [-h, h], if it is, then determine the status of the component is normal, otherwise determine that the status of the component is abnormal;

Generate component status information key-value pairs <CID:MID:Status> according to the status information of each component, where CID represents the number of the component, MID represents the number of the physical machine where the component is located, and Status represents the status of the component, when the component status is abnormal When the Status value is 1, normal is 0.

Further, the component abnormal subgraph constructed by the component state information key value pair includes:

Enter the component dependency graph G. The matrix of the component dependency graph is expressed as G = (Eij), where i, j represent the components in the application cluster, and Eij represents the dependency relationship between the i component and the j component. If the component i depends on For component j, the Eij value is 1, otherwise the Eij value is 0;

Traverse the component status information key-value pairs. When the Status value is 0, delete the rows and columns of i=CID or j=CID from the component dependency graph G. After the traversal, get the component dependency graph G1;

Determine whether there is an independent component node in the component dependency subgraph G1. The independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .

Further, locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:

Traverse the component abnormal subgraph G', calculate δ _i = ∑ _{j ∈ G'E} _ij , if δ _i =0, it means that the component node i is the root node of the abnormality.

Further, after locating the container component node where the abnormality occurs according to the component abnormality subgraph, the method further includes:

Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.

Further, obtaining TCP delay information of each container component includes:

Use software tcprstat to collect TCP delay information of each component.

According to another embodiment of the present invention, there is provided an anomaly detection and positioning device applied to a distributed container cloud platform, including:

The delay information obtaining unit is used to obtain TCP delay information of each container component;

The state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;

Component abnormal subgraph construction unit, which is used to construct component abnormal subgraph through key value pairs of component state information;

The abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.

Further, the device further includes:

The abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.

A storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.

A processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.

The abnormality detection and positioning method and device applied to the distributed container cloud platform in the embodiments of the present invention use TCP delay information for abnormal state judgment, reduce the overhead of data collection, and improve the accuracy and real-time nature of abnormal state judgment. At the same time, considering the interference between each component and between the physical machine and the component, a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.

BRIEF DESCRIPTION

The drawings described herein are used to provide a further understanding of the present invention and form a part of the present application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an undue limitation on the present invention. In the drawings:

FIG. 1 is a flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention;

2 is a preferred flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention;

3 is a block diagram of an anomaly detection and positioning device applied to a distributed container cloud platform according to the present invention;

FIG. 4 is a preferred module diagram of an anomaly detection and location method applied to a distributed container cloud platform of the present invention.

detailed description

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and do not have to be used to describe a specific order or sequence. It should be understood that the data used in this way are interchangeable under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, processes, methods, systems, products or devices that contain a series of steps or units need not be limited to those clearly listed Those steps or units, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or equipment.

With the development and maturity of container technology, the cloud computing system based on container technology, that is, container cloud, has begun to gradually replace the traditional cloud computing system based on virtual machines. Due to the light weight of containers, the deployment of containers is more convenient. Therefore, the internal composition of the container cloud is more complicated than traditional cloud computing platforms. Secondly, the isolation of the system resources by the container is weaker than that of the virtual machine. When multiple containers are running on the same physical host, the interference between the containers is relatively strong. Therefore, once a container in the container cloud becomes abnormal, the exception It will spread quickly and affect the entire cluster. Due to the complex internal environment of the container cloud, the traditional single-component-based anomaly detection method is no longer suitable for distributed container cloud environments. Existing technologies use performance indicators to analyze anomalies, which increases the cost of data collection. At the same time, a normal fluctuation model needs to be constructed. For frequent and complex container cloud platforms, the accuracy rate is low and lacks real-time.

The invention provides an abnormality detection and positioning method and device applied to a distributed container cloud platform for a container cloud platform. The method and the device can perform abnormal location and detection on a more complicated distributed container cloud platform, and at the same time improve the accuracy rate of abnormal location through its component abnormal sub-graph.

Example 1

According to an embodiment of the present invention, there is provided an anomaly detection and positioning method applied to a distributed container cloud platform. Referring to FIG. 1, the method includes the following steps:

S101: Obtain TCP delay information of each container component;

S102: Analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain status information of each component, and generate component status information key-value pairs;

S103: Constructing component abnormal subgraphs through component state information key-value pairs;

S104: Locate the container component node where the abnormality occurs according to the component abnormality subgraph.

The method uses TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment. At the same time, considering the interference between each component and between the physical machine and the component, a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.

As a preferred technical solution, the TCP delay information of each container component is analyzed by a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs including:

Initialize the sliding window [L ₀ , L _k ] of the component, input TCP (Transmission Control Protocol) delay information until the number of TCP delayed data in the sliding window reaches k, and initialize the average value

Accumulation and S _k = 0, the value is an initialization _{_{value, S k = S k-1}} = ... S 0 = 0, L k = L k-1 ... = L 0 = 0; where [L _0, L _k] is A queue for storing TCP delay information from 0 to k, the size of the queue is k, the value of k is used as an input, k is an integer of 0<k<60, and usually k is 10

And calculate the cumulative sum

Here is an iterative calculation. When t is k+1, S _t-1 =S _k =0; where L _t is the TCP delay information at t, and t is an integer of t>k;

Determine whether S _diff is between the normal threshold [-h, h], if it is, then determine the status of the component is normal, otherwise determine the status of the component is abnormal; h represents the acceptable range of S _diff , is Enter one of the parameters.

As a preferred technical solution, the component abnormal subgraph constructed by the key value pair of component state information includes:

Determine whether there is an independent component node in the component dependency subgraph G1, that is, the independent component node is a component node that does not depend on other component nodes and does not depend on any other component nodes, and delete such component nodes to construct a component abnormal subgraph G'.

As a preferred technical solution, locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:

As a preferred technical solution, referring to FIG. 2, after locating the container component node where the abnormality occurs according to the component abnormality subgraph, the method further includes:

S105: Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.

As a preferred technical solution, obtaining TCP delay information of each container component includes:

Use software tcprstat to collect TCP delay information of each component.

The following describes the method in detail with specific embodiments. An abnormality detection and positioning method applied to a distributed container cloud platform according to the present invention includes the following steps:

The service management program submits an abnormal location request to the service agent program;

After receiving the abnormal location request, the service agent uses software tcprstat to collect TCP delay information of the component. The software tcprstat is a free and open source tcp layer analysis tool. The response time of the statistical analysis request can be used for temporary analysis, and it can also be used for information collection at regular tasks;

Through the sliding window accumulation and anomaly detection algorithm, the TCP delay information of the component collected by the service agent is analyzed to obtain the status information of the component and generate the component status information key value pair <CID:MID:Status>;

The service agent submits the component status information key-value pair <CID:MID:Status> to the service management program;

After the service management program collects all the component state information keys, it constructs the component abnormal subgraph G’;

The service management program traverses the component abnormal subgraph _G'and calculates δ _i =∑ _j∈G' E _ij . If δ _i =0, it indicates that the component node i is the abnormal root node;

Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it indicates that the physical machine with the MID number is abnormal.

Example 2

According to another embodiment of the present invention, there is provided an anomaly detection and positioning device applied to a distributed container cloud platform, referring to FIG. 3, including:

The delay information obtaining unit 201 is used to obtain TCP delay information of each container component;

The state information obtaining unit 202 is configured to analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key-value pairs;

The component abnormal subgraph construction unit 203 is configured to construct a component abnormal subgraph through key value pairs of component state information;

The abnormal location unit 204 is configured to locate the container component node where the abnormality occurs according to the component abnormal subgraph.

The abnormality detection and positioning device of the distributed container cloud platform adopts TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment. At the same time, considering the interference between each component and between the physical machine and the component, a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.

As a preferred technical solution, referring to FIG. 4, the device further includes:

The abnormality determination unit 205 is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine with the MID number is abnormal.

Example 3

Example 4

The sequence numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

In the above embodiments of the present invention, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed technical content may be implemented in other ways. Among them, the system embodiments described above are only schematic. For example, the division of units may be a division of logical functions. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.

The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .

The above is only the preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present invention, several improvements and retouches can be made. These improvements and retouches also It should be regarded as the protection scope of the present invention.

Claims

An anomaly detection and positioning method applied to a distributed container cloud platform is characterized by the following steps:

Obtain the TCP delay information of each container component;

Through the sliding window accumulation and anomaly detection algorithm, the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;

Construct component abnormal subgraphs by key value pairs of component state information;

According to the component exception subgraph, locate the container component node where the exception occurs.
The method according to claim 1, wherein the analysis of the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs includes:

Initialize the sliding window of the component [L 0 , L k ], enter the TCP delay information until the number of TCP delayed data in the sliding window reaches k, initialize the average
Cumulative sum S k = 0; where [L 0 , L k ] is a queue for storing TCP delay information from 0 to k, and k is an integer of 0<k<60;

Enter the TCP delay information L t again, insert the TCP delay information L t into the sliding window, and delete the oldest TCP delay information L tk in the sliding window to calculate the average value in the window
And calculate the cumulative sum
Where L t is the TCP delay information at time t, and t is an integer of t>k;

Calculate the early warning value S diff = S max- S min , where S max , S min ∈ [S tk , S t ], S tk is the cumulative sum of the earliest TCP delay information;

Determine whether S diff is between the normal threshold [-h, h], if it is, then determine the status of the component is normal, otherwise determine that the status of the component is abnormal;

Generate component status information key-value pairs <CID:MID:Status> according to the status information of each component, where CID represents the number of the component, MID represents the number of the physical machine where the component is located, and Status represents the status of the component, when the component status is abnormal When the Status value is 1, normal is 0.
The method according to claim 2, wherein the constructing the component abnormal subgraph through the component state information key-value pair comprises:

Enter the component dependency graph G. The matrix of the component dependency graph is expressed as G = (Eij), where i, j represent the components in the application cluster, and Eij represents the dependency relationship between the i component and the j component. If the component i depends on For component j, the Eij value is 1, otherwise the Eij value is 0;

Traverse the component status information key-value pairs. When the Status value is 0, delete the rows and columns of i=CID or j=CID from the component dependency graph G. After the traversal, get the component dependency graph G1;

Determine whether there is an independent component node in the component dependency subgraph G1. The independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .
The method according to claim 3, wherein the locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:

Traverse the component abnormal subgraph G', calculate δ i = ∑ j ∈ G'E ij , if δ i =0, it means that the component node i is the root node of the abnormality.
The method according to claim 4, wherein after the method locates the container component node where the abnormality occurs according to the component abnormality subgraph, the method further comprises:

Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
The method according to claim 1, wherein the acquiring TCP delay information of each container component includes:

Use software tcprstat to collect TCP delay information of each component.
An anomaly detection and positioning device applied to a distributed container cloud platform, which includes:

The delay information obtaining unit is used to obtain TCP delay information of each container component;

The state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;

Component abnormal subgraph construction unit, which is used to construct component abnormal subgraph through key value pairs of component state information;

The abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
The device according to claim 7, wherein the device further comprises:

The abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.
A storage medium, characterized in that the storage medium stores a program file capable of implementing the abnormality detection and positioning method applied to a distributed container cloud platform according to any one of claims 1 to 6.
A processor, characterized in that the processor is used to run a program, wherein, when the program runs, the abnormality detection and positioning method applied to a distributed container cloud platform according to any one of claims 1 to 6 is executed .