WO2020119627A1 - Abnormality detection and positioning method and apparatus applied to distributed container cloud platform - Google Patents

Abnormality detection and positioning method and apparatus applied to distributed container cloud platform Download PDF

Info

Publication number
WO2020119627A1
WO2020119627A1 PCT/CN2019/123989 CN2019123989W WO2020119627A1 WO 2020119627 A1 WO2020119627 A1 WO 2020119627A1 CN 2019123989 W CN2019123989 W CN 2019123989W WO 2020119627 A1 WO2020119627 A1 WO 2020119627A1
Authority
WO
WIPO (PCT)
Prior art keywords
component
abnormal
container
delay information
status
Prior art date
Application number
PCT/CN2019/123989
Other languages
French (fr)
Chinese (zh)
Inventor
叶可江
卢澄志
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020119627A1 publication Critical patent/WO2020119627A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • the invention relates to the field of container cloud platforms, and in particular to an abnormality detection and positioning method and device applied to a distributed container cloud platform.
  • Cloud computing as a new type of service delivery method has won the favor of industry and academia.
  • the key technology of cloud computing is virtualization technology.
  • virtualization technology By virtualizing all kinds of resources, cloud computing service providers can easily customize and deliver all kinds of resources to users, and many applications have gradually begun to migrate to cloud computing clusters.
  • Traditional virtualization technologies include KVM, Xen, etc.
  • Container technology is a lightweight operating system-level virtualization technology. Compared with the traditional virtualization technology for the virtualization of the hardware layer, container virtualization stays at the operating system layer, making it very convenient to create, modify, or migrate.
  • Container technology is quickly used by various cloud computing service providers. Due to these characteristics of containers, users often run each component in an independent container when deploying their applications, so as to conveniently and quickly maintain applications, which results in a complicated internal structure of the container cloud. At the same time, the characteristics of the weak isolation of the containers also lead to serious interference between the containers. Once an abnormality occurs in a container, the abnormality will quickly spread. In turn, it affects different application components. Cloud service providers need a method that can abnormally locate the application clusters with complex structures established by containers.
  • an application deployed on a container cloud is often composed of hundreds or thousands of components, and components depend on each other to form a complex graph with components as nodes. Utilizing the relevant knowledge of graph theory can locate the root cause of anomalies from this complex graph. That is, a cloud computing platform based on container technology is usually composed of thousands of physical machines, and each physical machine usually runs dozens of containers. Therefore, a cloud computing platform based on container technology is more complicated than a traditional cloud computing platform. Compared with traditional virtual machines, container isolation is worse, and the interference between containers is more serious. Therefore, compared to traditional virtual machines, containers are also more likely to affect each other.
  • Nguyen et al. in Chapter 3 of "Insight: in-situ online service failure path inference production in computing infrastructures” proposed an online black box abnormal location system to locate abnormal components.
  • the system uses the virtual machine performance index to construct a normal fluctuation model of the performance index, to determine abnormally changing data points, and to locate abnormal components by combining the time information of the changed data points and the dependencies between the components.
  • the system can detect and locate anomalies, because it uses performance indicators for anomaly detection and judgment, for complex distributed container cloud platforms, the overhead of monitoring performance indicators will be huge.
  • the embodiments of the present invention provide an abnormality detection and positioning method and device applied to a distributed container cloud platform, to at least solve the technical problem that the traditional single-component-based abnormality detection method cannot be applied to a distributed container cloud.
  • an anomaly detection and location method applied to a distributed container cloud platform including the following steps:
  • the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;
  • TCP delay information of each container component is analyzed through sliding window accumulation and anomaly detection algorithms to obtain status information of each component and generate component status information key-value pairs including:
  • the component abnormal subgraph constructed by the component state information key value pair includes:
  • the independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .
  • locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
  • the method further includes:
  • obtaining TCP delay information of each container component includes:
  • an anomaly detection and positioning device applied to a distributed container cloud platform including:
  • the delay information obtaining unit is used to obtain TCP delay information of each container component
  • the state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;
  • Component abnormal subgraph construction unit which is used to construct component abnormal subgraph through key value pairs of component state information
  • the abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
  • the device further includes:
  • the abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.
  • a storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
  • a processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
  • the abnormality detection and positioning method and device applied to the distributed container cloud platform in the embodiments of the present invention use TCP delay information for abnormal state judgment, reduce the overhead of data collection, and improve the accuracy and real-time nature of abnormal state judgment.
  • a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
  • FIG. 1 is a flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention
  • FIG. 2 is a preferred flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention
  • FIG. 3 is a block diagram of an anomaly detection and positioning device applied to a distributed container cloud platform according to the present invention
  • FIG. 4 is a preferred module diagram of an anomaly detection and location method applied to a distributed container cloud platform of the present invention.
  • container cloud the cloud computing system based on container technology
  • container cloud Due to the light weight of containers, the deployment of containers is more convenient. Therefore, the internal composition of the container cloud is more complicated than traditional cloud computing platforms.
  • the isolation of the system resources by the container is weaker than that of the virtual machine.
  • the interference between the containers is relatively strong. Therefore, once a container in the container cloud becomes abnormal, the exception It will spread quickly and affect the entire cluster.
  • the traditional single-component-based anomaly detection method is no longer suitable for distributed container cloud environments.
  • Existing technologies use performance indicators to analyze anomalies, which increases the cost of data collection. At the same time, a normal fluctuation model needs to be constructed. For frequent and complex container cloud platforms, the accuracy rate is low and lacks real-time.
  • the invention provides an abnormality detection and positioning method and device applied to a distributed container cloud platform for a container cloud platform.
  • the method and the device can perform abnormal location and detection on a more complicated distributed container cloud platform, and at the same time improve the accuracy rate of abnormal location through its component abnormal sub-graph.
  • an anomaly detection and positioning method applied to a distributed container cloud platform includes the following steps:
  • S102 Analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain status information of each component, and generate component status information key-value pairs;
  • S104 Locate the container component node where the abnormality occurs according to the component abnormality subgraph.
  • the method uses TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment.
  • a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
  • the TCP delay information of each container component is analyzed by a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs including:
  • the component abnormal subgraph constructed by the key value pair of component state information includes:
  • the independent component node is a component node that does not depend on other component nodes and does not depend on any other component nodes, and delete such component nodes to construct a component abnormal subgraph G'.
  • locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
  • the method further includes:
  • S105 Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
  • obtaining TCP delay information of each container component includes:
  • An abnormality detection and positioning method applied to a distributed container cloud platform includes the following steps:
  • the service management program submits an abnormal location request to the service agent program
  • the service agent After receiving the abnormal location request, the service agent uses software tcprstat to collect TCP delay information of the component.
  • the software tcprstat is a free and open source tcp layer analysis tool.
  • the response time of the statistical analysis request can be used for temporary analysis, and it can also be used for information collection at regular tasks;
  • the TCP delay information of the component collected by the service agent is analyzed to obtain the status information of the component and generate the component status information key value pair ⁇ CID:MID:Status>;
  • the service agent submits the component status information key-value pair ⁇ CID:MID:Status> to the service management program;
  • an anomaly detection and positioning device applied to a distributed container cloud platform including:
  • the delay information obtaining unit 201 is used to obtain TCP delay information of each container component
  • the state information obtaining unit 202 is configured to analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key-value pairs;
  • the component abnormal subgraph construction unit 203 is configured to construct a component abnormal subgraph through key value pairs of component state information
  • the abnormal location unit 204 is configured to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
  • the abnormality detection and positioning device of the distributed container cloud platform adopts TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment.
  • a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
  • the device further includes:
  • the abnormality determination unit 205 is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine with the MID number is abnormal.
  • a storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
  • a processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
  • the disclosed technical content may be implemented in other ways.
  • the system embodiments described above are only schematic.
  • the division of units may be a division of logical functions.
  • there may be another division manner for example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An abnormality detection and positioning method and apparatus applied to a distributed container cloud platform. According to the method and the apparatus, TCP delay information of each container component is firstly acquired (S101); the TCP delay information of each container component is analyzed by means of a sliding window accumulation and abnormality detection algorithm, state information of each component is acquired, and a component state information key value pair is generated (S102); a component abnormality sub-image is constructed by means of the component state information key value pair (S103); and a detection result is transmitted to a background server, a pathological risk and solution data, corresponding to the detection result, calculated and obtained by the background server are received, and the data is transmitted to a mobile phone application of a user or is displayed and output by means of a medical site (S104). According to the method and the apparatus, an abnormal state is determined by using TCP delay information, thus reducing the overheads of data collection and improving the accuracy and timeliness of determining an abnormal state. Furthermore, in consideration of interference between components and between a physical machine and the components, a component abnormality sub-image is provided to express the propagation of an abnormal state, thereby improving the accuracy of abnormality positioning.

Description

应用于分布式容器云平台的异常检测与定位方法及装置Anomaly detection and positioning method and device applied to distributed container cloud platform 技术领域Technical field
本发明涉及容器云平台领域,具体而言,涉及一种应用于分布式容器云平台的异常检测与定位方法及装置。The invention relates to the field of container cloud platforms, and in particular to an abnormality detection and positioning method and device applied to a distributed container cloud platform.
背景技术Background technique
云计算作为一种新型服务提供方式,获得了产业界和学术界的青睐。云计算的关键技术就是虚拟化技术,通过将各类资源进行虚拟化,云计算服务提供商能够很便捷的将各类资源进行定制交付给用户使用,众多应用也逐渐开始迁移到云计算集群内。传统的虚拟化技术包括KVM,Xen等。但是传统的虚拟化技术由于过于笨重,对于应用集群内某个组件进行创建,修改以及迁移操作都十分的复杂,因此云计算服务提供商需要更加轻量级的虚拟化技术。容器技术是一种轻量级的操作系统级的虚拟化技术。相较于传统的虚拟化技术对于硬件层的虚拟化,容器的虚拟化停留在操作系统层,使其无论是创建、修改还是迁移都十分的便捷。容器技术很快的被各类云计算服务提供商使用。由于容器的这些特点,用户在部署其应用的时候往往会将每个组件运行在独立的容器内,以便方便快捷的对应用进行维护,这造成了容器云复杂的内部结构。同时容器的弱隔离性的特点也导致了容器之间相互干扰较为严重。一旦某个容器出现了异常,异常将会迅速传播。进而影响到不同的应用组件。云服务提供商需要一种能够对通过容器建立的结构复杂的应用集群进行异常定位的方法。Cloud computing as a new type of service delivery method has won the favor of industry and academia. The key technology of cloud computing is virtualization technology. By virtualizing all kinds of resources, cloud computing service providers can easily customize and deliver all kinds of resources to users, and many applications have gradually begun to migrate to cloud computing clusters. . Traditional virtualization technologies include KVM, Xen, etc. However, because traditional virtualization technology is too bulky, it is very complicated to create, modify, and migrate a component in the application cluster. Therefore, cloud computing service providers need more lightweight virtualization technology. Container technology is a lightweight operating system-level virtualization technology. Compared with the traditional virtualization technology for the virtualization of the hardware layer, container virtualization stays at the operating system layer, making it very convenient to create, modify, or migrate. Container technology is quickly used by various cloud computing service providers. Due to these characteristics of containers, users often run each component in an independent container when deploying their applications, so as to conveniently and quickly maintain applications, which results in a complicated internal structure of the container cloud. At the same time, the characteristics of the weak isolation of the containers also lead to serious interference between the containers. Once an abnormality occurs in a container, the abnormality will quickly spread. In turn, it affects different application components. Cloud service providers need a method that can abnormally locate the application clusters with complex structures established by containers.
通常而言,一个部署在容器云上的应用往往由成百上千的组件构成,而组件与组件之间相互依赖,构成了一张复杂的由组件作为节点的图。利用图论的相关知识能够从这个复杂的图中定位到异常发生的根源。即基于容器技术的云计算平台通常由上千台物理机组成,每台物理机上通常运行数十个容器,因而基于容器技术的云计算平台相较于传统的云计算平台更加复杂。相比于传统的 虚拟机,容器隔离性更差,容器与容器之间干扰更加严重。因而相比于传统虚拟机,容器也更容易相互影响。同时由于容器部署在运行状态下的操作系统上,因而物理机的异常也会引起部署在其上的容器发生异常。现有的异常检测定位方案缺少对于组件之间,组件与物理机之间关联性的分析,同时现有的异常检测定位方案利用性能指标数据进行异常检测与定位,带来极大的存储和传输开销,因而不能很好的适应干扰严重的分布式容器云平台环境。Generally speaking, an application deployed on a container cloud is often composed of hundreds or thousands of components, and components depend on each other to form a complex graph with components as nodes. Utilizing the relevant knowledge of graph theory can locate the root cause of anomalies from this complex graph. That is, a cloud computing platform based on container technology is usually composed of thousands of physical machines, and each physical machine usually runs dozens of containers. Therefore, a cloud computing platform based on container technology is more complicated than a traditional cloud computing platform. Compared with traditional virtual machines, container isolation is worse, and the interference between containers is more serious. Therefore, compared to traditional virtual machines, containers are also more likely to affect each other. At the same time, since the container is deployed on the operating system in the running state, the abnormality of the physical machine will also cause the abnormality of the container deployed on it. Existing anomaly detection and positioning solutions lack analysis of the correlation between components and between components and physical machines. At the same time, existing anomaly detection and positioning solutions use performance index data for anomaly detection and positioning, which brings great storage and transmission Because of the overhead, it cannot adapt well to the distributed container cloud platform environment with severe interference.
Nguyen等人在《Insight:in-situ online service failure path inference in production computing infrastructures》的第三章提出在线黑盒异常定位系统定位异常组件。该系统利用虚拟机性能指标构造性能指标的正常波动模型,判断异常变化的数据点,同时结合发生变化的数据点的时间信息和组件之间的依赖关系定位异常组件。虽然该系统能够对异常进行检测和定位,但是由于其采用性能指标进行异常检测和判断,对于复杂的分布式容器云平台,监控性能指标带来的开销将会非常巨大。Nguyen et al. in Chapter 3 of "Insight: in-situ online service failure path inference production in computing infrastructures" proposed an online black box abnormal location system to locate abnormal components. The system uses the virtual machine performance index to construct a normal fluctuation model of the performance index, to determine abnormally changing data points, and to locate abnormal components by combining the time information of the changed data points and the dependencies between the components. Although the system can detect and locate anomalies, because it uses performance indicators for anomaly detection and judgment, for complex distributed container cloud platforms, the overhead of monitoring performance indicators will be huge.
发明内容Summary of the invention
本发明实施例提供了一种应用于分布式容器云平台的异常检测与定位方法及装置,以至少解决传统基于单组件的异常检测方法无法适用于分布式容器云的技术问题。The embodiments of the present invention provide an abnormality detection and positioning method and device applied to a distributed container cloud platform, to at least solve the technical problem that the traditional single-component-based abnormality detection method cannot be applied to a distributed container cloud.
根据本发明的一实施例,提供了一种应用于分布式容器云平台的异常检测与定位方法,包括以下步骤:According to an embodiment of the present invention, there is provided an anomaly detection and location method applied to a distributed container cloud platform, including the following steps:
获取各容器组件的TCP延迟信息;Obtain the TCP delay information of each container component;
通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对;Through the sliding window accumulation and anomaly detection algorithm, the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;
通过组件状态信息键值对构造组件异常子图;Construct component abnormal subgraphs by key value pairs of component state information;
根据组件异常子图定位出异常发生的容器组件节点。According to the component exception subgraph, locate the container component node where the exception occurs.
进一步地,通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对包括:Further, the TCP delay information of each container component is analyzed through sliding window accumulation and anomaly detection algorithms to obtain status information of each component and generate component status information key-value pairs including:
初始化组件的滑动窗口[L 0,L k],输入TCP延迟信息直到滑动窗口内TCP延迟的数据个数达到k,初始化平均值
Figure PCTCN2019123989-appb-000001
累积和S k=0;其中[L 0,L k]为存放TCP延迟信息从0到k的队列,k为0<k<60的整数;
Initialize the sliding window of the component [L 0 , L k ], enter the TCP delay information until the number of TCP delayed data in the sliding window reaches k, initialize the average
Figure PCTCN2019123989-appb-000001
Cumulative sum S k = 0; where [L 0 , L k ] is a queue for storing TCP delay information from 0 to k, and k is an integer of 0<k<60;
再次输入TCP延迟信息L t,将TCP延迟信息L t插入滑动窗口,并删除滑动窗口内最早的TCP延迟信息L t-k,计算窗口内平均值
Figure PCTCN2019123989-appb-000002
并计算累积和
Figure PCTCN2019123989-appb-000003
其中L t为t时刻的TCP延迟信息,t为t>k的整数;
Enter the TCP delay information L t again, insert the TCP delay information L t into the sliding window, and delete the oldest TCP delay information L tk in the sliding window to calculate the average value in the window
Figure PCTCN2019123989-appb-000002
And calculate the cumulative sum
Figure PCTCN2019123989-appb-000003
Where L t is the TCP delay information at time t, and t is an integer of t>k;
计算预警值S diff=S max-S min,其中S max、S min∈?S t-k,S t],S t-k为最早的TCP延迟信息时的累积和; Calculate the warning value S diff = S max- S min , where S max , S min ∈? S tk , S t ], S tk is the cumulative sum of the earliest TCP delay information;
判断S diff是否处于正常阈值[-h,h]之间,如果是,则判断该组件的状态Status为正常,否则判断该组件的状态Status为异常; Determine whether S diff is between the normal threshold [-h, h], if it is, then determine the status of the component is normal, otherwise determine that the status of the component is abnormal;
根据各组件的状态信息生成组件状态信息键值对<CID:MID:Status>,其中CID表示组件的编号,MID表示组件所处的物理机的编号,Status表示组件的状态,当组件状态为异常时Status值为1,正常则为0。Generate component status information key-value pairs <CID:MID:Status> according to the status information of each component, where CID represents the number of the component, MID represents the number of the physical machine where the component is located, and Status represents the status of the component, when the component status is abnormal When the Status value is 1, normal is 0.
进一步地,通过组件状态信息键值对构造组件异常子图包括:Further, the component abnormal subgraph constructed by the component state information key value pair includes:
输入组件依赖关系图G,组件依赖关系图的矩阵表示为G=(Eij),其中i,j表示应用集群内的组件,Eij表示i组件与j组件之间的依赖关系,如果组件i依赖于组件j则Eij值为1,否则Eij值为0;Enter the component dependency graph G. The matrix of the component dependency graph is expressed as G = (Eij), where i, j represent the components in the application cluster, and Eij represents the dependency relationship between the i component and the j component. If the component i depends on For component j, the Eij value is 1, otherwise the Eij value is 0;
遍历组件状态信息键值对,当Status值为0时,从组件依赖关系图G中删除i=CID或者j=CID的行和列,遍历完毕得到组件依赖关系子图G1;Traverse the component status information key-value pairs. When the Status value is 0, delete the rows and columns of i=CID or j=CID from the component dependency graph G. After the traversal, get the component dependency graph G1;
判断组件依赖关系子图G1中是否存在独立组件节点,独立组件节点为不依赖于其他组件节点且不为其他任意组件节点依赖的组件节点,将这类组件节点删除后构造组件异常子图G’。Determine whether there is an independent component node in the component dependency subgraph G1. The independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .
进一步地,根据组件异常子图定位出异常发生的容器组件节点包括:Further, locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
遍历组件异常子图G’,计算δ i=∑ j∈G’E ij,如果δ i=0,则表示组件节点i 为异常的根节点。 Traverse the component abnormal subgraph G', calculate δ i = ∑ j ∈ G'E ij , if δ i =0, it means that the component node i is the root node of the abnormality.
进一步地,方法在根据组件异常子图定位出异常发生的容器组件节点之后还包括:Further, after locating the container component node where the abnormality occurs according to the component abnormality subgraph, the method further includes:
判断各异常根节点的MID是否相同,如果相同,则判断编号为MID的物理机发生异常。Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
进一步地,获取各容器组件的TCP延迟信息包括:Further, obtaining TCP delay information of each container component includes:
利用软件tcprstat收集各组件的TCP延迟信息。Use software tcprstat to collect TCP delay information of each component.
根据本发明的另一实施例,提供了一种应用于分布式容器云平台的异常检测与定位装置,包括:According to another embodiment of the present invention, there is provided an anomaly detection and positioning device applied to a distributed container cloud platform, including:
延迟信息获取单元,用于获取各容器组件的TCP延迟信息;The delay information obtaining unit is used to obtain TCP delay information of each container component;
状态信息获取单元,用于通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对;The state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;
组件异常子图构建单元,用于通过组件状态信息键值对构造组件异常子图;Component abnormal subgraph construction unit, which is used to construct component abnormal subgraph through key value pairs of component state information;
异常定位单元,用于根据组件异常子图定位出异常发生的容器组件节点。The abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
进一步地,装置还包括:Further, the device further includes:
异常判断单元,用于判断各异常根节点的MID是否相同,如果相同,则判断编号为MID的物理机发生异常。The abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.
一种存储介质,存储介质存储有能够实现上述任意一项应用于分布式容器云平台的异常检测与定位方法的程序文件。A storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的应用于分布式容器云平台的异常检测与定位方法。A processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
本发明实施例中的应用于分布式容器云平台的异常检测与定位方法及装置,采用TCP延迟信息进行异常状态判断,降低了数据采集的开销,提高了异常状态判断的准确性与实时性。同时考虑到各组件之间,物理机与组件之间 的干扰,提出了组件异常子图用以表示异常状态的传播,提高了异常定位的准确性。The abnormality detection and positioning method and device applied to the distributed container cloud platform in the embodiments of the present invention use TCP delay information for abnormal state judgment, reduce the overhead of data collection, and improve the accuracy and real-time nature of abnormal state judgment. At the same time, considering the interference between each component and between the physical machine and the component, a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
附图说明BRIEF DESCRIPTION
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described herein are used to provide a further understanding of the present invention and form a part of the present application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an undue limitation on the present invention. In the drawings:
图1为本发明应用于分布式容器云平台的异常检测与定位方法的流程图;FIG. 1 is a flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention;
图2为本发明应用于分布式容器云平台的异常检测与定位方法的优选流程图;2 is a preferred flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention;
图3为本发明应用于分布式容器云平台的异常检测与定位装置的模块图;3 is a block diagram of an anomaly detection and positioning device applied to a distributed container cloud platform according to the present invention;
图4为本发明应用于分布式容器云平台的异常检测与定位方法的优选模块图。FIG. 4 is a preferred module diagram of an anomaly detection and location method applied to a distributed container cloud platform of the present invention.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设 备固有的其它步骤或单元。It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and do not have to be used to describe a specific order or sequence. It should be understood that the data used in this way are interchangeable under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, processes, methods, systems, products or devices that contain a series of steps or units need not be limited to those clearly listed Those steps or units, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or equipment.
随着容器技术的发展成熟,基于容器技术的云计算系统即容器云已经开始逐步的取代传统的基于虚拟机的云计算系统。由于容器具有轻量化的特点,容器的部署更加便捷。因而容器云内部组成相比于传统云计算平台更加复杂。其次容器对系统各项资源的隔离相较于虚拟机来说不强,而同一台物理主机上运行多个容器,容器间的干扰相对较为强烈,因此一旦容器云内部某个容器发生异常,异常将迅速传播,进而影响到整个集群。而由于容器云复杂的内部环境,对于传统基于单组件的异常检测方法已经不适用于分布式容器云环境。现有技术采用性能指标对异常进行分析,增加了数据采集的开销,同时需要构造正常的波动模型,对于波动频繁且复杂的容器云平台来说准确率较低且缺乏实时性。With the development and maturity of container technology, the cloud computing system based on container technology, that is, container cloud, has begun to gradually replace the traditional cloud computing system based on virtual machines. Due to the light weight of containers, the deployment of containers is more convenient. Therefore, the internal composition of the container cloud is more complicated than traditional cloud computing platforms. Secondly, the isolation of the system resources by the container is weaker than that of the virtual machine. When multiple containers are running on the same physical host, the interference between the containers is relatively strong. Therefore, once a container in the container cloud becomes abnormal, the exception It will spread quickly and affect the entire cluster. Due to the complex internal environment of the container cloud, the traditional single-component-based anomaly detection method is no longer suitable for distributed container cloud environments. Existing technologies use performance indicators to analyze anomalies, which increases the cost of data collection. At the same time, a normal fluctuation model needs to be constructed. For frequent and complex container cloud platforms, the accuracy rate is low and lacks real-time.
本发明针对容器云平台提供了一种应用于分布式容器云平台的异常检测与定位方法及装置。通过该方法及装置能够对更加复杂的分布式容器云平台进行异常定位与检测,同时通过其组件异常子图提高异常定位的准确率。The invention provides an abnormality detection and positioning method and device applied to a distributed container cloud platform for a container cloud platform. The method and the device can perform abnormal location and detection on a more complicated distributed container cloud platform, and at the same time improve the accuracy rate of abnormal location through its component abnormal sub-graph.
实施例1Example 1
根据本发明一实施例,提供了一种应用于分布式容器云平台的异常检测与定位方法,参见图1,包括以下步骤:According to an embodiment of the present invention, there is provided an anomaly detection and positioning method applied to a distributed container cloud platform. Referring to FIG. 1, the method includes the following steps:
S101:获取各容器组件的TCP延迟信息;S101: Obtain TCP delay information of each container component;
S102:通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对;S102: Analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain status information of each component, and generate component status information key-value pairs;
S103:通过组件状态信息键值对构造组件异常子图;S103: Constructing component abnormal subgraphs through component state information key-value pairs;
S104:根据组件异常子图定位出异常发生的容器组件节点。S104: Locate the container component node where the abnormality occurs according to the component abnormality subgraph.
该方法采用TCP延迟信息进行异常状态判断,降低了数据采集的开销,提高了异常状态判断的准确性与实时性。同时考虑到各组件之间,物理机与组件之间的干扰,提出了组件异常子图用以表示异常状态的传播,提高了异常定位的准确性。The method uses TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment. At the same time, considering the interference between each component and between the physical machine and the component, a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
作为优选的技术方案中,通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对包括:As a preferred technical solution, the TCP delay information of each container component is analyzed by a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs including:
初始化组件的滑动窗口[L 0,L k],输入TCP(Transmission Control Protocol传输控制协议)延迟信息直到滑动窗口内TCP延迟的数据个数达到k,初始化平均值
Figure PCTCN2019123989-appb-000004
累积和S k=0,该值是初始化值,S k=S k-1=…S 0=0,L k=L k-1…=L 0=0;其中[L 0,L k]为存放TCP延迟信息从0到k的队列,队列的大小为k,k值作为输入,k为0<k<60的整数,通常k取10;
Initialize the sliding window [L 0 , L k ] of the component, input TCP (Transmission Control Protocol) delay information until the number of TCP delayed data in the sliding window reaches k, and initialize the average value
Figure PCTCN2019123989-appb-000004
Accumulation and S k = 0, the value is an initialization value, S k = S k-1 = ... S 0 = 0, L k = L k-1 ... = L 0 = 0; where [L 0, L k] is A queue for storing TCP delay information from 0 to k, the size of the queue is k, the value of k is used as an input, k is an integer of 0<k<60, and usually k is 10
再次输入TCP延迟信息L t,将TCP延迟信息L t插入滑动窗口,并删除滑动窗口内最早的TCP延迟信息L t-k,计算窗口内平均值
Figure PCTCN2019123989-appb-000005
并计算累积和
Figure PCTCN2019123989-appb-000006
此处为迭代计算,当t为k+1时,S t-1=S k=0;其中L t为t时刻的TCP延迟信息,t为t>k的整数;
Enter the TCP delay information L t again, insert the TCP delay information L t into the sliding window, and delete the oldest TCP delay information L tk in the sliding window to calculate the average value in the window
Figure PCTCN2019123989-appb-000005
And calculate the cumulative sum
Figure PCTCN2019123989-appb-000006
Here is an iterative calculation. When t is k+1, S t-1 =S k =0; where L t is the TCP delay information at t, and t is an integer of t>k;
计算预警值S diff=S max-S min,其中S max、S min∈?S t-k,S t],S t-k为最早的TCP延迟信息时的累积和; Calculate the warning value S diff = S max- S min , where S max , S min ∈? S tk , S t ], S tk is the cumulative sum of the earliest TCP delay information;
判断S diff是否处于正常阈值[-h,h]之间,如果是,则判断该组件的状态Status为正常,否则判断该组件的状态Status为异常;h表示可接受的S diff的范围,为输入参数之一。 Determine whether S diff is between the normal threshold [-h, h], if it is, then determine the status of the component is normal, otherwise determine the status of the component is abnormal; h represents the acceptable range of S diff , is Enter one of the parameters.
根据各组件的状态信息生成组件状态信息键值对<CID:MID:Status>,其中CID表示组件的编号,MID表示组件所处的物理机的编号,Status表示组件的状态,当组件状态为异常时Status值为1,正常则为0。Generate component status information key-value pairs <CID:MID:Status> according to the status information of each component, where CID represents the number of the component, MID represents the number of the physical machine where the component is located, and Status represents the status of the component, when the component status is abnormal When the Status value is 1, normal is 0.
作为优选的技术方案中,通过组件状态信息键值对构造组件异常子图包括:As a preferred technical solution, the component abnormal subgraph constructed by the key value pair of component state information includes:
输入组件依赖关系图G,组件依赖关系图的矩阵表示为G=(Eij),其中i,j表示应用集群内的组件,Eij表示i组件与j组件之间的依赖关系,如果组件i依赖于组件j则Eij值为1,否则Eij值为0;Enter the component dependency graph G. The matrix of the component dependency graph is expressed as G = (Eij), where i, j represent the components in the application cluster, and Eij represents the dependency relationship between the i component and the j component. If the component i depends on For component j, the Eij value is 1, otherwise the Eij value is 0;
遍历组件状态信息键值对,当Status值为0时,从组件依赖关系图G中删除i=CID或者j=CID的行和列,遍历完毕得到组件依赖关系子图G1;Traverse the component status information key-value pairs. When the Status value is 0, delete the rows and columns of i=CID or j=CID from the component dependency graph G. After the traversal, get the component dependency graph G1;
判断组件依赖关系子图G1中是否存在独立组件节点,即该独立组件节点为不依赖于其他组件节点且不为其他任意组件节点依赖的组件节点,将这类组件节点删除后构造组件异常子图G’。Determine whether there is an independent component node in the component dependency subgraph G1, that is, the independent component node is a component node that does not depend on other component nodes and does not depend on any other component nodes, and delete such component nodes to construct a component abnormal subgraph G'.
作为优选的技术方案中,根据组件异常子图定位出异常发生的容器组件节点包括:As a preferred technical solution, locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
遍历组件异常子图G’,计算δ i=∑ j∈G’E ij,如果δ i=0,则表示组件节点i为异常的根节点。 Traverse the component abnormal subgraph G', calculate δ i = ∑ j ∈ G'E ij , if δ i =0, it means that the component node i is the root node of the abnormality.
作为优选的技术方案中,参见图2,方法在根据组件异常子图定位出异常发生的容器组件节点之后还包括:As a preferred technical solution, referring to FIG. 2, after locating the container component node where the abnormality occurs according to the component abnormality subgraph, the method further includes:
S105:判断各异常根节点的MID是否相同,如果相同,则判断编号为MID的物理机发生异常。S105: Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
作为优选的技术方案中,获取各容器组件的TCP延迟信息包括:As a preferred technical solution, obtaining TCP delay information of each container component includes:
利用软件tcprstat收集各组件的TCP延迟信息。Use software tcprstat to collect TCP delay information of each component.
下面以具体实施例,对本方法进行详细说明,本发明一种应用于分布式容器云平台的异常检测与定位方法包括以下步骤:The following describes the method in detail with specific embodiments. An abnormality detection and positioning method applied to a distributed container cloud platform according to the present invention includes the following steps:
服务管理程序向服务代理程序提交异常定位请求;The service management program submits an abnormal location request to the service agent program;
服务代理程序接收到异常定位请求后,利用软件tcprstat收集组件的TCP延迟信息。软件tcprstat是免费开源的tcp层分析工具,统计分析请求的响应时间,可以用于临时分析,也可定时任务做信息收集;After receiving the abnormal location request, the service agent uses software tcprstat to collect TCP delay information of the component. The software tcprstat is a free and open source tcp layer analysis tool. The response time of the statistical analysis request can be used for temporary analysis, and it can also be used for information collection at regular tasks;
通过滑动窗口累积和异常检测算法对服务代理程序所收集的组件的TCP延迟信息进行分析,获取组件的状态信息Status并生成组件状态信息键值对<CID:MID:Status>;Through the sliding window accumulation and anomaly detection algorithm, the TCP delay information of the component collected by the service agent is analyzed to obtain the status information of the component and generate the component status information key value pair <CID:MID:Status>;
服务代理程序将组件状态信息键值对<CID:MID:Status>提交给服务管理程序;The service agent submits the component status information key-value pair <CID:MID:Status> to the service management program;
服务管理程序收集到所有的组件状态信息键值之后构造组件异常子图G’;After the service management program collects all the component state information keys, it constructs the component abnormal subgraph G’;
服务管理程序遍历组件异常子图G’,计算δ i=∑ j∈G’E ij,如果δ i=0,则表示组件节点i为异常的根节点; The service management program traverses the component abnormal subgraph G'and calculates δ i =∑ j∈G' E ij . If δ i =0, it indicates that the component node i is the abnormal root node;
判断各异常根节点的MID是否相同,如果相同,表示编号为MID的物理机发生异常。Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it indicates that the physical machine with the MID number is abnormal.
实施例2Example 2
根据本发明另一实施例,提供了一种应用于分布式容器云平台的异常检测与定位装置,参见图3,包括:According to another embodiment of the present invention, there is provided an anomaly detection and positioning device applied to a distributed container cloud platform, referring to FIG. 3, including:
延迟信息获取单元201,用于获取各容器组件的TCP延迟信息;The delay information obtaining unit 201 is used to obtain TCP delay information of each container component;
状态信息获取单元202,用于通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对;The state information obtaining unit 202 is configured to analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key-value pairs;
组件异常子图构建单元203,用于通过组件状态信息键值对构造组件异常子图;The component abnormal subgraph construction unit 203 is configured to construct a component abnormal subgraph through key value pairs of component state information;
异常定位单元204,用于根据组件异常子图定位出异常发生的容器组件节点。The abnormal location unit 204 is configured to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
本发明面向分布式容器云平台的异常检测与定位装置采用TCP延迟信息进行异常状态判断,降低了数据采集的开销,提高了异常状态判断的准确性与实时性。同时考虑到各组件之间,物理机与组件之间的干扰,提出了组件异常子图用以表示异常状态的传播,提高了异常定位的准确性。The abnormality detection and positioning device of the distributed container cloud platform adopts TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment. At the same time, considering the interference between each component and between the physical machine and the component, a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
作为优选的技术方案中,参见图4,装置还包括:As a preferred technical solution, referring to FIG. 4, the device further includes:
异常判断单元205,用于判断各异常根节点的MID是否相同,如果相同,则判断编号为MID的物理机发生异常。The abnormality determination unit 205 is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine with the MID number is abnormal.
实施例3Example 3
一种存储介质,存储介质存储有能够实现上述任意一项应用于分布式容器云平台的异常检测与定位方法的程序文件。A storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
实施例4Example 4
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的应用于分布式容器云平台的异常检测与定位方法。A processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments of the present invention, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的系统实施例仅仅是示意性的,例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content may be implemented in other ways. Among them, the system embodiments described above are only schematic. For example, the division of units may be a division of logical functions. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or software functional unit.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质 包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present invention, several improvements and retouches can be made. These improvements and retouches also It should be regarded as the protection scope of the present invention.

Claims (10)

  1. 一种应用于分布式容器云平台的异常检测与定位方法,其特征在于,包括以下步骤:An anomaly detection and positioning method applied to a distributed container cloud platform is characterized by the following steps:
    获取各容器组件的TCP延迟信息;Obtain the TCP delay information of each container component;
    通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对;Through the sliding window accumulation and anomaly detection algorithm, the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;
    通过组件状态信息键值对构造组件异常子图;Construct component abnormal subgraphs by key value pairs of component state information;
    根据组件异常子图定位出异常发生的容器组件节点。According to the component exception subgraph, locate the container component node where the exception occurs.
  2. 根据权利要求1所述的方法,其特征在于,所述通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对包括:The method according to claim 1, wherein the analysis of the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs includes:
    初始化组件的滑动窗口[L 0,L k],输入TCP延迟信息直到滑动窗口内TCP延迟的数据个数达到k,初始化平均值
    Figure PCTCN2019123989-appb-100001
    累积和S k=0;其中[L 0,L k]为存放TCP延迟信息从0到k的队列,k为0<k<60的整数;
    Initialize the sliding window of the component [L 0 , L k ], enter the TCP delay information until the number of TCP delayed data in the sliding window reaches k, initialize the average
    Figure PCTCN2019123989-appb-100001
    Cumulative sum S k = 0; where [L 0 , L k ] is a queue for storing TCP delay information from 0 to k, and k is an integer of 0<k<60;
    再次输入TCP延迟信息L t,将TCP延迟信息L t插入滑动窗口,并删除滑动窗口内最早的TCP延迟信息L t-k,计算窗口内平均值
    Figure PCTCN2019123989-appb-100002
    并计算累积和
    Figure PCTCN2019123989-appb-100003
    其中L t为t时刻的TCP延迟信息,t为t>k的整数;
    Enter the TCP delay information L t again, insert the TCP delay information L t into the sliding window, and delete the oldest TCP delay information L tk in the sliding window to calculate the average value in the window
    Figure PCTCN2019123989-appb-100002
    And calculate the cumulative sum
    Figure PCTCN2019123989-appb-100003
    Where L t is the TCP delay information at time t, and t is an integer of t>k;
    计算预警值S diff=S max-S min,其中S max、S min∈[S t-k,S t],S t-k为最早的TCP延迟信息时的累积和; Calculate the early warning value S diff = S max- S min , where S max , S min ∈ [S tk , S t ], S tk is the cumulative sum of the earliest TCP delay information;
    判断S diff是否处于正常阈值[-h,h]之间,如果是,则判断该组件的状态Status为正常,否则判断该组件的状态Status为异常; Determine whether S diff is between the normal threshold [-h, h], if it is, then determine the status of the component is normal, otherwise determine that the status of the component is abnormal;
    根据各组件的状态信息生成组件状态信息键值对<CID:MID:Status>,其中CID表示组件的编号,MID表示组件所处的物理机的编号,Status表示组件的状态,当组件状态为异常时Status值为1,正常则为0。Generate component status information key-value pairs <CID:MID:Status> according to the status information of each component, where CID represents the number of the component, MID represents the number of the physical machine where the component is located, and Status represents the status of the component, when the component status is abnormal When the Status value is 1, normal is 0.
  3. 根据权利要求2所述的方法,其特征在于,所述通过组件状态信息键值对 构造组件异常子图包括:The method according to claim 2, wherein the constructing the component abnormal subgraph through the component state information key-value pair comprises:
    输入组件依赖关系图G,组件依赖关系图的矩阵表示为G=(Eij),其中i,j表示应用集群内的组件,Eij表示i组件与j组件之间的依赖关系,如果组件i依赖于组件j则Eij值为1,否则Eij值为0;Enter the component dependency graph G. The matrix of the component dependency graph is expressed as G = (Eij), where i, j represent the components in the application cluster, and Eij represents the dependency relationship between the i component and the j component. If the component i depends on For component j, the Eij value is 1, otherwise the Eij value is 0;
    遍历组件状态信息键值对,当Status值为0时,从组件依赖关系图G中删除i=CID或者j=CID的行和列,遍历完毕得到组件依赖关系子图G1;Traverse the component status information key-value pairs. When the Status value is 0, delete the rows and columns of i=CID or j=CID from the component dependency graph G. After the traversal, get the component dependency graph G1;
    判断组件依赖关系子图G1中是否存在独立组件节点,独立组件节点为不依赖于其他组件节点且不为其他任意组件节点依赖的组件节点,将这类组件节点删除后构造组件异常子图G’。Determine whether there is an independent component node in the component dependency subgraph G1. The independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .
  4. 根据权利要求3所述的方法,其特征在于,所述根据组件异常子图定位出异常发生的容器组件节点包括:The method according to claim 3, wherein the locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
    遍历组件异常子图G’,计算δ i=∑ j∈G’E ij,如果δ i=0,则表示组件节点i为异常的根节点。 Traverse the component abnormal subgraph G', calculate δ i = ∑ j ∈ G'E ij , if δ i =0, it means that the component node i is the root node of the abnormality.
  5. 根据权利要求4所述的方法,其特征在于,所述方法在所述根据组件异常子图定位出异常发生的容器组件节点之后还包括:The method according to claim 4, wherein after the method locates the container component node where the abnormality occurs according to the component abnormality subgraph, the method further comprises:
    判断各异常根节点的MID是否相同,如果相同,则判断编号为MID的物理机发生异常。Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
  6. 根据权利要求1所述的方法,其特征在于,所述获取各容器组件的TCP延迟信息包括:The method according to claim 1, wherein the acquiring TCP delay information of each container component includes:
    利用软件tcprstat收集各组件的TCP延迟信息。Use software tcprstat to collect TCP delay information of each component.
  7. 一种应用于分布式容器云平台的异常检测与定位装置,其特征在于,包括:An anomaly detection and positioning device applied to a distributed container cloud platform, which includes:
    延迟信息获取单元,用于获取各容器组件的TCP延迟信息;The delay information obtaining unit is used to obtain TCP delay information of each container component;
    状态信息获取单元,用于通过滑动窗口累积和异常检测算法对各容器组件的TCP延迟信息进行分析,获取各组件的状态信息并生成组件状态信息键值对;The state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;
    组件异常子图构建单元,用于通过组件状态信息键值对构造组件异常子图;Component abnormal subgraph construction unit, which is used to construct component abnormal subgraph through key value pairs of component state information;
    异常定位单元,用于根据组件异常子图定位出异常发生的容器组件节点。The abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    异常判断单元,用于判断各异常根节点的MID是否相同,如果相同,则判断编号为MID的物理机发生异常。The abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.
  9. 一种存储介质,其特征在于,所述存储介质存储有能够实现权利要求1至6中任意一项所述应用于分布式容器云平台的异常检测与定位方法的程序文件。A storage medium, characterized in that the storage medium stores a program file capable of implementing the abnormality detection and positioning method applied to a distributed container cloud platform according to any one of claims 1 to 6.
  10. 一种处理器,其特征在于,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至6中任意一项所述的应用于分布式容器云平台的异常检测与定位方法。A processor, characterized in that the processor is used to run a program, wherein, when the program runs, the abnormality detection and positioning method applied to a distributed container cloud platform according to any one of claims 1 to 6 is executed .
PCT/CN2019/123989 2018-12-15 2019-12-09 Abnormality detection and positioning method and apparatus applied to distributed container cloud platform WO2020119627A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811537333.2 2018-12-15
CN201811537333.2A CN109800052B (en) 2018-12-15 2018-12-15 Anomaly detection and positioning method and device applied to distributed container cloud platform

Publications (1)

Publication Number Publication Date
WO2020119627A1 true WO2020119627A1 (en) 2020-06-18

Family

ID=66556890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/123989 WO2020119627A1 (en) 2018-12-15 2019-12-09 Abnormality detection and positioning method and apparatus applied to distributed container cloud platform

Country Status (2)

Country Link
CN (1) CN109800052B (en)
WO (1) WO2020119627A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800052B (en) * 2018-12-15 2020-11-24 深圳先进技术研究院 Anomaly detection and positioning method and device applied to distributed container cloud platform
WO2021109048A1 (en) * 2019-12-05 2021-06-10 深圳先进技术研究院 Container cloud platform abnormality detection method and system, and electronic device
CN111061586B (en) * 2019-12-05 2023-09-19 深圳先进技术研究院 Container cloud platform anomaly detection method and system and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5796937A (en) * 1994-09-29 1998-08-18 Fujitsu Limited Method of and apparatus for dealing with processor abnormality in multiprocessor system
CN101505243A (en) * 2009-03-10 2009-08-12 中国科学院软件研究所 Performance exception detecting method for Web application
CN105242971A (en) * 2015-10-20 2016-01-13 北京航空航天大学 Streaming processing system oriented memory object management method and system
CN108306879A (en) * 2018-01-30 2018-07-20 福建师范大学 The real-time abnormal localization method of distribution based on Web session streams
CN109800052A (en) * 2018-12-15 2019-05-24 深圳先进技术研究院 Abnormality detection and localization method and device applied to distributed container cloud platform

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10832150B2 (en) * 2016-07-28 2020-11-10 International Business Machines Corporation Optimized re-training for analytic models
CN106487633B (en) * 2016-10-11 2019-12-06 中国银联股份有限公司 method and device for monitoring abnormity of virtual machine
US20180124080A1 (en) * 2016-11-02 2018-05-03 Qualcomm Incorporated Methods and Systems for Anomaly Detection Using Functional Specifications Derived from Server Input/Output (I/O) Behavior
CN106776005B (en) * 2016-11-23 2019-12-13 华中科技大学 Resource management system and method for containerized application
CN108306747B (en) * 2017-01-11 2021-07-23 阿里巴巴集团控股有限公司 Cloud security detection method and device and electronic equipment
CN107612787B (en) * 2017-11-06 2021-01-12 南京易捷思达软件科技有限公司 Cloud host fault detection method based on Openstack open source cloud platform
CN108337108A (en) * 2017-12-28 2018-07-27 天津麒麟信息技术有限公司 A kind of cloud platform failure automation localization method based on association analysis
CN108259241A (en) * 2018-01-11 2018-07-06 上海有云信息技术有限公司 A kind of abnormal localization method and device of cloud platform monitoring system
CN108491306A (en) * 2018-03-19 2018-09-04 广东电网有限责任公司珠海供电局 One kind being based on enterprise's private clound credibility monitoring method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5796937A (en) * 1994-09-29 1998-08-18 Fujitsu Limited Method of and apparatus for dealing with processor abnormality in multiprocessor system
CN101505243A (en) * 2009-03-10 2009-08-12 中国科学院软件研究所 Performance exception detecting method for Web application
CN105242971A (en) * 2015-10-20 2016-01-13 北京航空航天大学 Streaming processing system oriented memory object management method and system
CN108306879A (en) * 2018-01-30 2018-07-20 福建师范大学 The real-time abnormal localization method of distribution based on Web session streams
CN109800052A (en) * 2018-12-15 2019-05-24 深圳先进技术研究院 Abnormality detection and localization method and device applied to distributed container cloud platform

Also Published As

Publication number Publication date
CN109800052B (en) 2020-11-24
CN109800052A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
US11936663B2 (en) System for monitoring and managing datacenters
US10560309B1 (en) Identifying a root cause of alerts within virtualized computing environment monitoring system
US9471455B2 (en) System, method, and computer program product for managing software updates
US10484301B1 (en) Dynamic resource distribution using periodicity-aware predictive modeling
US9495152B2 (en) Automatic baselining of business application service groups comprised of virtual machines
US11537940B2 (en) Systems and methods for unsupervised anomaly detection using non-parametric tolerance intervals over a sliding window of t-digests
WO2020119627A1 (en) Abnormality detection and positioning method and apparatus applied to distributed container cloud platform
WO2020135806A1 (en) Operation maintenance method and equipment applied to data center
US10616078B1 (en) Detecting deviating resources in a virtual environment
US20150019722A1 (en) Determining, managing and deploying an application topology in a virtual environment
US20200220796A1 (en) System monitoring with metrics correlation for data center
CN114208126A (en) Method and device for configuring cloud storage software equipment
US9400731B1 (en) Forecasting server behavior
US9367418B2 (en) Application monitoring
CN111865899B (en) Threat-driven cooperative acquisition method and device
CN113504996A (en) Load balance detection method, device, equipment and storage medium
US9929921B2 (en) Techniques for workload toxic mapping
US20230336447A1 (en) Machine learning for metric collection
US20230195495A1 (en) Realtime property based application discovery and clustering within computing environments
US20230161612A1 (en) Realtime inductive application discovery based on delta flow changes within computing environments
US20230089305A1 (en) Automated naming of an application/tier in a virtual computing environment
Zhao et al. Scheduling Parallel Migration of Virtualized Services Under Time Constraints in Mobile Edge Clouds
US20230289202A1 (en) Realtime application reconciliation within computing environments
CN111061586B (en) Container cloud platform anomaly detection method and system and electronic equipment
US20230039875A1 (en) Adaptive idle detection in a software-defined data center in a hyper-converged infrastructure

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19895582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19895582

Country of ref document: EP

Kind code of ref document: A1