WO2020119627A1 - Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués - Google Patents
Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués Download PDFInfo
- Publication number
- WO2020119627A1 WO2020119627A1 PCT/CN2019/123989 CN2019123989W WO2020119627A1 WO 2020119627 A1 WO2020119627 A1 WO 2020119627A1 CN 2019123989 W CN2019123989 W CN 2019123989W WO 2020119627 A1 WO2020119627 A1 WO 2020119627A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- component
- abnormal
- container
- delay information
- status
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
Definitions
- the invention relates to the field of container cloud platforms, and in particular to an abnormality detection and positioning method and device applied to a distributed container cloud platform.
- Cloud computing as a new type of service delivery method has won the favor of industry and academia.
- the key technology of cloud computing is virtualization technology.
- virtualization technology By virtualizing all kinds of resources, cloud computing service providers can easily customize and deliver all kinds of resources to users, and many applications have gradually begun to migrate to cloud computing clusters.
- Traditional virtualization technologies include KVM, Xen, etc.
- Container technology is a lightweight operating system-level virtualization technology. Compared with the traditional virtualization technology for the virtualization of the hardware layer, container virtualization stays at the operating system layer, making it very convenient to create, modify, or migrate.
- Container technology is quickly used by various cloud computing service providers. Due to these characteristics of containers, users often run each component in an independent container when deploying their applications, so as to conveniently and quickly maintain applications, which results in a complicated internal structure of the container cloud. At the same time, the characteristics of the weak isolation of the containers also lead to serious interference between the containers. Once an abnormality occurs in a container, the abnormality will quickly spread. In turn, it affects different application components. Cloud service providers need a method that can abnormally locate the application clusters with complex structures established by containers.
- an application deployed on a container cloud is often composed of hundreds or thousands of components, and components depend on each other to form a complex graph with components as nodes. Utilizing the relevant knowledge of graph theory can locate the root cause of anomalies from this complex graph. That is, a cloud computing platform based on container technology is usually composed of thousands of physical machines, and each physical machine usually runs dozens of containers. Therefore, a cloud computing platform based on container technology is more complicated than a traditional cloud computing platform. Compared with traditional virtual machines, container isolation is worse, and the interference between containers is more serious. Therefore, compared to traditional virtual machines, containers are also more likely to affect each other.
- Nguyen et al. in Chapter 3 of "Insight: in-situ online service failure path inference production in computing infrastructures” proposed an online black box abnormal location system to locate abnormal components.
- the system uses the virtual machine performance index to construct a normal fluctuation model of the performance index, to determine abnormally changing data points, and to locate abnormal components by combining the time information of the changed data points and the dependencies between the components.
- the system can detect and locate anomalies, because it uses performance indicators for anomaly detection and judgment, for complex distributed container cloud platforms, the overhead of monitoring performance indicators will be huge.
- the embodiments of the present invention provide an abnormality detection and positioning method and device applied to a distributed container cloud platform, to at least solve the technical problem that the traditional single-component-based abnormality detection method cannot be applied to a distributed container cloud.
- an anomaly detection and location method applied to a distributed container cloud platform including the following steps:
- the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;
- TCP delay information of each container component is analyzed through sliding window accumulation and anomaly detection algorithms to obtain status information of each component and generate component status information key-value pairs including:
- the component abnormal subgraph constructed by the component state information key value pair includes:
- the independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .
- locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
- the method further includes:
- obtaining TCP delay information of each container component includes:
- an anomaly detection and positioning device applied to a distributed container cloud platform including:
- the delay information obtaining unit is used to obtain TCP delay information of each container component
- the state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;
- Component abnormal subgraph construction unit which is used to construct component abnormal subgraph through key value pairs of component state information
- the abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
- the device further includes:
- the abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.
- a storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
- a processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
- the abnormality detection and positioning method and device applied to the distributed container cloud platform in the embodiments of the present invention use TCP delay information for abnormal state judgment, reduce the overhead of data collection, and improve the accuracy and real-time nature of abnormal state judgment.
- a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
- FIG. 1 is a flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention
- FIG. 2 is a preferred flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention
- FIG. 3 is a block diagram of an anomaly detection and positioning device applied to a distributed container cloud platform according to the present invention
- FIG. 4 is a preferred module diagram of an anomaly detection and location method applied to a distributed container cloud platform of the present invention.
- container cloud the cloud computing system based on container technology
- container cloud Due to the light weight of containers, the deployment of containers is more convenient. Therefore, the internal composition of the container cloud is more complicated than traditional cloud computing platforms.
- the isolation of the system resources by the container is weaker than that of the virtual machine.
- the interference between the containers is relatively strong. Therefore, once a container in the container cloud becomes abnormal, the exception It will spread quickly and affect the entire cluster.
- the traditional single-component-based anomaly detection method is no longer suitable for distributed container cloud environments.
- Existing technologies use performance indicators to analyze anomalies, which increases the cost of data collection. At the same time, a normal fluctuation model needs to be constructed. For frequent and complex container cloud platforms, the accuracy rate is low and lacks real-time.
- the invention provides an abnormality detection and positioning method and device applied to a distributed container cloud platform for a container cloud platform.
- the method and the device can perform abnormal location and detection on a more complicated distributed container cloud platform, and at the same time improve the accuracy rate of abnormal location through its component abnormal sub-graph.
- an anomaly detection and positioning method applied to a distributed container cloud platform includes the following steps:
- S102 Analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain status information of each component, and generate component status information key-value pairs;
- S104 Locate the container component node where the abnormality occurs according to the component abnormality subgraph.
- the method uses TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment.
- a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
- the TCP delay information of each container component is analyzed by a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs including:
- the component abnormal subgraph constructed by the key value pair of component state information includes:
- the independent component node is a component node that does not depend on other component nodes and does not depend on any other component nodes, and delete such component nodes to construct a component abnormal subgraph G'.
- locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
- the method further includes:
- S105 Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
- obtaining TCP delay information of each container component includes:
- An abnormality detection and positioning method applied to a distributed container cloud platform includes the following steps:
- the service management program submits an abnormal location request to the service agent program
- the service agent After receiving the abnormal location request, the service agent uses software tcprstat to collect TCP delay information of the component.
- the software tcprstat is a free and open source tcp layer analysis tool.
- the response time of the statistical analysis request can be used for temporary analysis, and it can also be used for information collection at regular tasks;
- the TCP delay information of the component collected by the service agent is analyzed to obtain the status information of the component and generate the component status information key value pair ⁇ CID:MID:Status>;
- the service agent submits the component status information key-value pair ⁇ CID:MID:Status> to the service management program;
- an anomaly detection and positioning device applied to a distributed container cloud platform including:
- the delay information obtaining unit 201 is used to obtain TCP delay information of each container component
- the state information obtaining unit 202 is configured to analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key-value pairs;
- the component abnormal subgraph construction unit 203 is configured to construct a component abnormal subgraph through key value pairs of component state information
- the abnormal location unit 204 is configured to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
- the abnormality detection and positioning device of the distributed container cloud platform adopts TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment.
- a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
- the device further includes:
- the abnormality determination unit 205 is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine with the MID number is abnormal.
- a storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
- a processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
- the disclosed technical content may be implemented in other ways.
- the system embodiments described above are only schematic.
- the division of units may be a division of logical functions.
- there may be another division manner for example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above integrated unit may be implemented in the form of hardware or software functional unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present invention.
- the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
La présente invention concerne un procédé et un appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués. Selon le procédé et l'appareil, des informations de retard TCP de chaque composant conteneur sont d'abord acquises (S101) ; les informations de retard TCP de chaque composant conteneur sont analysées au moyen d'un algorithme de détection d'anomalie et d'accumulation de fenêtre glissante, des informations d'état de chaque composant sont acquises, et une paire de valeurs d'informations d'état clés de composant est générée (S102) ; une sous-image d'anomalie de composant est construite au moyen de la paire de valeurs d'informations d'état clés de composant (S103) ; et un résultat de détection est transmis à un serveur d'arrière-plan, un risque pathologique et des données de solution, correspondant au résultat de détection, calculés et obtenus par le serveur d'arrière-plan sont reçus, et les données sont transmises à une application de téléphone mobile d'un utilisateur ou sont affichées et délivrées au moyen d'un site médical (S104). Selon le procédé et l'appareil, un état anormal est déterminé à l'aide d'informations de retard TCP, ce qui réduit ainsi les dépenses de collecte de données et améliore la précision et la rapidité de détermination d'un état anormal. En outre, compte tenu de l'interférence entre les composants et entre une machine physique et les composants, une sous-image d'anomalie de composant est fournie pour exprimer la propagation d'un état anormal, ce qui permet d'améliorer la précision du positionnement d'anomalie.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811537333.2 | 2018-12-15 | ||
CN201811537333.2A CN109800052B (zh) | 2018-12-15 | 2018-12-15 | 应用于分布式容器云平台的异常检测与定位方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020119627A1 true WO2020119627A1 (fr) | 2020-06-18 |
Family
ID=66556890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/123989 WO2020119627A1 (fr) | 2018-12-15 | 2019-12-09 | Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109800052B (fr) |
WO (1) | WO2020119627A1 (fr) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800052B (zh) * | 2018-12-15 | 2020-11-24 | 深圳先进技术研究院 | 应用于分布式容器云平台的异常检测与定位方法及装置 |
WO2021109048A1 (fr) * | 2019-12-05 | 2021-06-10 | 深圳先进技术研究院 | Procédé et système de détection d'anomalie de plateforme en nuage de conteneur, et dispositif électronique |
CN111061586B (zh) * | 2019-12-05 | 2023-09-19 | 深圳先进技术研究院 | 一种容器云平台异常检测方法、系统及电子设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5796937A (en) * | 1994-09-29 | 1998-08-18 | Fujitsu Limited | Method of and apparatus for dealing with processor abnormality in multiprocessor system |
CN101505243A (zh) * | 2009-03-10 | 2009-08-12 | 中国科学院软件研究所 | 一种Web应用性能异常侦测方法 |
CN105242971A (zh) * | 2015-10-20 | 2016-01-13 | 北京航空航天大学 | 面向流式处理系统的内存对象管理方法及系统 |
CN108306879A (zh) * | 2018-01-30 | 2018-07-20 | 福建师范大学 | 基于Web会话流的分布式实时异常定位方法 |
CN109800052A (zh) * | 2018-12-15 | 2019-05-24 | 深圳先进技术研究院 | 应用于分布式容器云平台的异常检测与定位方法及装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10832150B2 (en) * | 2016-07-28 | 2020-11-10 | International Business Machines Corporation | Optimized re-training for analytic models |
CN106487633B (zh) * | 2016-10-11 | 2019-12-06 | 中国银联股份有限公司 | 一种虚拟机异常的监测方法和装置 |
US20180124080A1 (en) * | 2016-11-02 | 2018-05-03 | Qualcomm Incorporated | Methods and Systems for Anomaly Detection Using Functional Specifications Derived from Server Input/Output (I/O) Behavior |
CN106776005B (zh) * | 2016-11-23 | 2019-12-13 | 华中科技大学 | 一种面向容器化应用的资源管理系统及方法 |
CN108306747B (zh) * | 2017-01-11 | 2021-07-23 | 阿里巴巴集团控股有限公司 | 一种云安全检测方法、装置和电子设备 |
CN107612787B (zh) * | 2017-11-06 | 2021-01-12 | 南京易捷思达软件科技有限公司 | 一种基于Openstack开源云平台的云主机故障检测方法 |
CN108337108A (zh) * | 2017-12-28 | 2018-07-27 | 天津麒麟信息技术有限公司 | 一种基于关联分析的云平台故障自动化定位方法 |
CN108259241A (zh) * | 2018-01-11 | 2018-07-06 | 上海有云信息技术有限公司 | 一种云平台监控系统的异常定位方法和装置 |
CN108491306A (zh) * | 2018-03-19 | 2018-09-04 | 广东电网有限责任公司珠海供电局 | 一种基于企业私有云可信性监测方法及系统 |
-
2018
- 2018-12-15 CN CN201811537333.2A patent/CN109800052B/zh active Active
-
2019
- 2019-12-09 WO PCT/CN2019/123989 patent/WO2020119627A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5796937A (en) * | 1994-09-29 | 1998-08-18 | Fujitsu Limited | Method of and apparatus for dealing with processor abnormality in multiprocessor system |
CN101505243A (zh) * | 2009-03-10 | 2009-08-12 | 中国科学院软件研究所 | 一种Web应用性能异常侦测方法 |
CN105242971A (zh) * | 2015-10-20 | 2016-01-13 | 北京航空航天大学 | 面向流式处理系统的内存对象管理方法及系统 |
CN108306879A (zh) * | 2018-01-30 | 2018-07-20 | 福建师范大学 | 基于Web会话流的分布式实时异常定位方法 |
CN109800052A (zh) * | 2018-12-15 | 2019-05-24 | 深圳先进技术研究院 | 应用于分布式容器云平台的异常检测与定位方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN109800052B (zh) | 2020-11-24 |
CN109800052A (zh) | 2019-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11936663B2 (en) | System for monitoring and managing datacenters | |
US10560309B1 (en) | Identifying a root cause of alerts within virtualized computing environment monitoring system | |
US9471455B2 (en) | System, method, and computer program product for managing software updates | |
US11537940B2 (en) | Systems and methods for unsupervised anomaly detection using non-parametric tolerance intervals over a sliding window of t-digests | |
US8903995B1 (en) | Performance impact analysis of network change | |
WO2020119627A1 (fr) | Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués | |
US20120166625A1 (en) | Automatic baselining of business application service groups comprised of virtual machines | |
US20130067077A1 (en) | Promotion of performance parameters in distributed data processing environment | |
WO2020135806A1 (fr) | Procédé et équipement de maintenance d'opération appliqués à un centre de données | |
US10616078B1 (en) | Detecting deviating resources in a virtual environment | |
US20150019722A1 (en) | Determining, managing and deploying an application topology in a virtual environment | |
US20200220796A1 (en) | System monitoring with metrics correlation for data center | |
CN114208126A (zh) | 用于配置云存储软件设备的方法和装置 | |
US9400731B1 (en) | Forecasting server behavior | |
US9367418B2 (en) | Application monitoring | |
CN111865899B (zh) | 威胁驱动的协同采集方法及装置 | |
CN113504996A (zh) | 一种负载均衡检测方法、装置、设备及存储介质 | |
US9929921B2 (en) | Techniques for workload toxic mapping | |
US20230336447A1 (en) | Machine learning for metric collection | |
US20230195495A1 (en) | Realtime property based application discovery and clustering within computing environments | |
US20230161612A1 (en) | Realtime inductive application discovery based on delta flow changes within computing environments | |
US20230089305A1 (en) | Automated naming of an application/tier in a virtual computing environment | |
Zhao et al. | Scheduling Parallel Migration of Virtualized Services Under Time Constraints in Mobile Edge Clouds | |
US20230289202A1 (en) | Realtime application reconciliation within computing environments | |
CN111061586B (zh) | 一种容器云平台异常检测方法、系统及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19895582 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19895582 Country of ref document: EP Kind code of ref document: A1 |