WO2020119627A1 - Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués - Google Patents

Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués Download PDF

Info

Publication number
WO2020119627A1
WO2020119627A1 PCT/CN2019/123989 CN2019123989W WO2020119627A1 WO 2020119627 A1 WO2020119627 A1 WO 2020119627A1 CN 2019123989 W CN2019123989 W CN 2019123989W WO 2020119627 A1 WO2020119627 A1 WO 2020119627A1
Authority
WO
WIPO (PCT)
Prior art keywords
component
abnormal
container
delay information
status
Prior art date
Application number
PCT/CN2019/123989
Other languages
English (en)
Chinese (zh)
Inventor
叶可江
卢澄志
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020119627A1 publication Critical patent/WO2020119627A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • the invention relates to the field of container cloud platforms, and in particular to an abnormality detection and positioning method and device applied to a distributed container cloud platform.
  • Cloud computing as a new type of service delivery method has won the favor of industry and academia.
  • the key technology of cloud computing is virtualization technology.
  • virtualization technology By virtualizing all kinds of resources, cloud computing service providers can easily customize and deliver all kinds of resources to users, and many applications have gradually begun to migrate to cloud computing clusters.
  • Traditional virtualization technologies include KVM, Xen, etc.
  • Container technology is a lightweight operating system-level virtualization technology. Compared with the traditional virtualization technology for the virtualization of the hardware layer, container virtualization stays at the operating system layer, making it very convenient to create, modify, or migrate.
  • Container technology is quickly used by various cloud computing service providers. Due to these characteristics of containers, users often run each component in an independent container when deploying their applications, so as to conveniently and quickly maintain applications, which results in a complicated internal structure of the container cloud. At the same time, the characteristics of the weak isolation of the containers also lead to serious interference between the containers. Once an abnormality occurs in a container, the abnormality will quickly spread. In turn, it affects different application components. Cloud service providers need a method that can abnormally locate the application clusters with complex structures established by containers.
  • an application deployed on a container cloud is often composed of hundreds or thousands of components, and components depend on each other to form a complex graph with components as nodes. Utilizing the relevant knowledge of graph theory can locate the root cause of anomalies from this complex graph. That is, a cloud computing platform based on container technology is usually composed of thousands of physical machines, and each physical machine usually runs dozens of containers. Therefore, a cloud computing platform based on container technology is more complicated than a traditional cloud computing platform. Compared with traditional virtual machines, container isolation is worse, and the interference between containers is more serious. Therefore, compared to traditional virtual machines, containers are also more likely to affect each other.
  • Nguyen et al. in Chapter 3 of "Insight: in-situ online service failure path inference production in computing infrastructures” proposed an online black box abnormal location system to locate abnormal components.
  • the system uses the virtual machine performance index to construct a normal fluctuation model of the performance index, to determine abnormally changing data points, and to locate abnormal components by combining the time information of the changed data points and the dependencies between the components.
  • the system can detect and locate anomalies, because it uses performance indicators for anomaly detection and judgment, for complex distributed container cloud platforms, the overhead of monitoring performance indicators will be huge.
  • the embodiments of the present invention provide an abnormality detection and positioning method and device applied to a distributed container cloud platform, to at least solve the technical problem that the traditional single-component-based abnormality detection method cannot be applied to a distributed container cloud.
  • an anomaly detection and location method applied to a distributed container cloud platform including the following steps:
  • the TCP delay information of each container component is analyzed to obtain the status information of each component and generate component status information key-value pairs;
  • TCP delay information of each container component is analyzed through sliding window accumulation and anomaly detection algorithms to obtain status information of each component and generate component status information key-value pairs including:
  • the component abnormal subgraph constructed by the component state information key value pair includes:
  • the independent component node is a component node that does not depend on other component nodes and does not depend on any other component node. Delete this type of component node to construct a component abnormal subgraph G' .
  • locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
  • the method further includes:
  • obtaining TCP delay information of each container component includes:
  • an anomaly detection and positioning device applied to a distributed container cloud platform including:
  • the delay information obtaining unit is used to obtain TCP delay information of each container component
  • the state information acquisition unit is used to analyze the TCP delay information of each container component through the sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key value pairs;
  • Component abnormal subgraph construction unit which is used to construct component abnormal subgraph through key value pairs of component state information
  • the abnormal location unit is used to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
  • the device further includes:
  • the abnormality determination unit is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine whose number is MID has an abnormality.
  • a storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
  • a processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
  • the abnormality detection and positioning method and device applied to the distributed container cloud platform in the embodiments of the present invention use TCP delay information for abnormal state judgment, reduce the overhead of data collection, and improve the accuracy and real-time nature of abnormal state judgment.
  • a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
  • FIG. 1 is a flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention
  • FIG. 2 is a preferred flowchart of an anomaly detection and positioning method applied to a distributed container cloud platform according to the present invention
  • FIG. 3 is a block diagram of an anomaly detection and positioning device applied to a distributed container cloud platform according to the present invention
  • FIG. 4 is a preferred module diagram of an anomaly detection and location method applied to a distributed container cloud platform of the present invention.
  • container cloud the cloud computing system based on container technology
  • container cloud Due to the light weight of containers, the deployment of containers is more convenient. Therefore, the internal composition of the container cloud is more complicated than traditional cloud computing platforms.
  • the isolation of the system resources by the container is weaker than that of the virtual machine.
  • the interference between the containers is relatively strong. Therefore, once a container in the container cloud becomes abnormal, the exception It will spread quickly and affect the entire cluster.
  • the traditional single-component-based anomaly detection method is no longer suitable for distributed container cloud environments.
  • Existing technologies use performance indicators to analyze anomalies, which increases the cost of data collection. At the same time, a normal fluctuation model needs to be constructed. For frequent and complex container cloud platforms, the accuracy rate is low and lacks real-time.
  • the invention provides an abnormality detection and positioning method and device applied to a distributed container cloud platform for a container cloud platform.
  • the method and the device can perform abnormal location and detection on a more complicated distributed container cloud platform, and at the same time improve the accuracy rate of abnormal location through its component abnormal sub-graph.
  • an anomaly detection and positioning method applied to a distributed container cloud platform includes the following steps:
  • S102 Analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain status information of each component, and generate component status information key-value pairs;
  • S104 Locate the container component node where the abnormality occurs according to the component abnormality subgraph.
  • the method uses TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment.
  • a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
  • the TCP delay information of each container component is analyzed by a sliding window accumulation and anomaly detection algorithm to obtain status information of each component and generate component status information key-value pairs including:
  • the component abnormal subgraph constructed by the key value pair of component state information includes:
  • the independent component node is a component node that does not depend on other component nodes and does not depend on any other component nodes, and delete such component nodes to construct a component abnormal subgraph G'.
  • locating the container component node where the abnormality occurs according to the component abnormality subgraph includes:
  • the method further includes:
  • S105 Determine whether the MIDs of the abnormal root nodes are the same. If they are the same, determine that the physical machine with the MID number is abnormal.
  • obtaining TCP delay information of each container component includes:
  • An abnormality detection and positioning method applied to a distributed container cloud platform includes the following steps:
  • the service management program submits an abnormal location request to the service agent program
  • the service agent After receiving the abnormal location request, the service agent uses software tcprstat to collect TCP delay information of the component.
  • the software tcprstat is a free and open source tcp layer analysis tool.
  • the response time of the statistical analysis request can be used for temporary analysis, and it can also be used for information collection at regular tasks;
  • the TCP delay information of the component collected by the service agent is analyzed to obtain the status information of the component and generate the component status information key value pair ⁇ CID:MID:Status>;
  • the service agent submits the component status information key-value pair ⁇ CID:MID:Status> to the service management program;
  • an anomaly detection and positioning device applied to a distributed container cloud platform including:
  • the delay information obtaining unit 201 is used to obtain TCP delay information of each container component
  • the state information obtaining unit 202 is configured to analyze the TCP delay information of each container component through a sliding window accumulation and anomaly detection algorithm, obtain the state information of each component and generate component state information key-value pairs;
  • the component abnormal subgraph construction unit 203 is configured to construct a component abnormal subgraph through key value pairs of component state information
  • the abnormal location unit 204 is configured to locate the container component node where the abnormality occurs according to the component abnormal subgraph.
  • the abnormality detection and positioning device of the distributed container cloud platform adopts TCP delay information for abnormal state judgment, reduces the cost of data collection, and improves the accuracy and real-time nature of abnormal state judgment.
  • a component anomaly subgraph is proposed to represent the propagation of the abnormal state, which improves the accuracy of abnormal location.
  • the device further includes:
  • the abnormality determination unit 205 is used to determine whether the MIDs of the abnormal root nodes are the same. If they are the same, it is determined that the physical machine with the MID number is abnormal.
  • a storage medium stores a program file capable of implementing any of the above methods for anomaly detection and positioning applied to a distributed container cloud platform.
  • a processor is used to run a program, wherein, when the program is running, any one of the foregoing abnormality detection and positioning methods applied to a distributed container cloud platform is executed.
  • the disclosed technical content may be implemented in other ways.
  • the system embodiments described above are only schematic.
  • the division of units may be a division of logical functions.
  • there may be another division manner for example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed over multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

La présente invention concerne un procédé et un appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués. Selon le procédé et l'appareil, des informations de retard TCP de chaque composant conteneur sont d'abord acquises (S101) ; les informations de retard TCP de chaque composant conteneur sont analysées au moyen d'un algorithme de détection d'anomalie et d'accumulation de fenêtre glissante, des informations d'état de chaque composant sont acquises, et une paire de valeurs d'informations d'état clés de composant est générée (S102) ; une sous-image d'anomalie de composant est construite au moyen de la paire de valeurs d'informations d'état clés de composant (S103) ; et un résultat de détection est transmis à un serveur d'arrière-plan, un risque pathologique et des données de solution, correspondant au résultat de détection, calculés et obtenus par le serveur d'arrière-plan sont reçus, et les données sont transmises à une application de téléphone mobile d'un utilisateur ou sont affichées et délivrées au moyen d'un site médical (S104). Selon le procédé et l'appareil, un état anormal est déterminé à l'aide d'informations de retard TCP, ce qui réduit ainsi les dépenses de collecte de données et améliore la précision et la rapidité de détermination d'un état anormal. En outre, compte tenu de l'interférence entre les composants et entre une machine physique et les composants, une sous-image d'anomalie de composant est fournie pour exprimer la propagation d'un état anormal, ce qui permet d'améliorer la précision du positionnement d'anomalie.
PCT/CN2019/123989 2018-12-15 2019-12-09 Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués WO2020119627A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811537333.2 2018-12-15
CN201811537333.2A CN109800052B (zh) 2018-12-15 2018-12-15 应用于分布式容器云平台的异常检测与定位方法及装置

Publications (1)

Publication Number Publication Date
WO2020119627A1 true WO2020119627A1 (fr) 2020-06-18

Family

ID=66556890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/123989 WO2020119627A1 (fr) 2018-12-15 2019-12-09 Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués

Country Status (2)

Country Link
CN (1) CN109800052B (fr)
WO (1) WO2020119627A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800052B (zh) * 2018-12-15 2020-11-24 深圳先进技术研究院 应用于分布式容器云平台的异常检测与定位方法及装置
WO2021109048A1 (fr) * 2019-12-05 2021-06-10 深圳先进技术研究院 Procédé et système de détection d'anomalie de plateforme en nuage de conteneur, et dispositif électronique
CN111061586B (zh) * 2019-12-05 2023-09-19 深圳先进技术研究院 一种容器云平台异常检测方法、系统及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5796937A (en) * 1994-09-29 1998-08-18 Fujitsu Limited Method of and apparatus for dealing with processor abnormality in multiprocessor system
CN101505243A (zh) * 2009-03-10 2009-08-12 中国科学院软件研究所 一种Web应用性能异常侦测方法
CN105242971A (zh) * 2015-10-20 2016-01-13 北京航空航天大学 面向流式处理系统的内存对象管理方法及系统
CN108306879A (zh) * 2018-01-30 2018-07-20 福建师范大学 基于Web会话流的分布式实时异常定位方法
CN109800052A (zh) * 2018-12-15 2019-05-24 深圳先进技术研究院 应用于分布式容器云平台的异常检测与定位方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10832150B2 (en) * 2016-07-28 2020-11-10 International Business Machines Corporation Optimized re-training for analytic models
CN106487633B (zh) * 2016-10-11 2019-12-06 中国银联股份有限公司 一种虚拟机异常的监测方法和装置
US20180124080A1 (en) * 2016-11-02 2018-05-03 Qualcomm Incorporated Methods and Systems for Anomaly Detection Using Functional Specifications Derived from Server Input/Output (I/O) Behavior
CN106776005B (zh) * 2016-11-23 2019-12-13 华中科技大学 一种面向容器化应用的资源管理系统及方法
CN108306747B (zh) * 2017-01-11 2021-07-23 阿里巴巴集团控股有限公司 一种云安全检测方法、装置和电子设备
CN107612787B (zh) * 2017-11-06 2021-01-12 南京易捷思达软件科技有限公司 一种基于Openstack开源云平台的云主机故障检测方法
CN108337108A (zh) * 2017-12-28 2018-07-27 天津麒麟信息技术有限公司 一种基于关联分析的云平台故障自动化定位方法
CN108259241A (zh) * 2018-01-11 2018-07-06 上海有云信息技术有限公司 一种云平台监控系统的异常定位方法和装置
CN108491306A (zh) * 2018-03-19 2018-09-04 广东电网有限责任公司珠海供电局 一种基于企业私有云可信性监测方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5796937A (en) * 1994-09-29 1998-08-18 Fujitsu Limited Method of and apparatus for dealing with processor abnormality in multiprocessor system
CN101505243A (zh) * 2009-03-10 2009-08-12 中国科学院软件研究所 一种Web应用性能异常侦测方法
CN105242971A (zh) * 2015-10-20 2016-01-13 北京航空航天大学 面向流式处理系统的内存对象管理方法及系统
CN108306879A (zh) * 2018-01-30 2018-07-20 福建师范大学 基于Web会话流的分布式实时异常定位方法
CN109800052A (zh) * 2018-12-15 2019-05-24 深圳先进技术研究院 应用于分布式容器云平台的异常检测与定位方法及装置

Also Published As

Publication number Publication date
CN109800052B (zh) 2020-11-24
CN109800052A (zh) 2019-05-24

Similar Documents

Publication Publication Date Title
US11936663B2 (en) System for monitoring and managing datacenters
US10560309B1 (en) Identifying a root cause of alerts within virtualized computing environment monitoring system
US9471455B2 (en) System, method, and computer program product for managing software updates
US11537940B2 (en) Systems and methods for unsupervised anomaly detection using non-parametric tolerance intervals over a sliding window of t-digests
US8903995B1 (en) Performance impact analysis of network change
WO2020119627A1 (fr) Procédé et appareil de détection et de positionnement d'anomalie appliqués à une plate-forme en nuage de conteneurs distribués
US20120166625A1 (en) Automatic baselining of business application service groups comprised of virtual machines
US20130067077A1 (en) Promotion of performance parameters in distributed data processing environment
WO2020135806A1 (fr) Procédé et équipement de maintenance d'opération appliqués à un centre de données
US10616078B1 (en) Detecting deviating resources in a virtual environment
US20150019722A1 (en) Determining, managing and deploying an application topology in a virtual environment
US20200220796A1 (en) System monitoring with metrics correlation for data center
CN114208126A (zh) 用于配置云存储软件设备的方法和装置
US9400731B1 (en) Forecasting server behavior
US9367418B2 (en) Application monitoring
CN111865899B (zh) 威胁驱动的协同采集方法及装置
CN113504996A (zh) 一种负载均衡检测方法、装置、设备及存储介质
US9929921B2 (en) Techniques for workload toxic mapping
US20230336447A1 (en) Machine learning for metric collection
US20230195495A1 (en) Realtime property based application discovery and clustering within computing environments
US20230161612A1 (en) Realtime inductive application discovery based on delta flow changes within computing environments
US20230089305A1 (en) Automated naming of an application/tier in a virtual computing environment
Zhao et al. Scheduling Parallel Migration of Virtualized Services Under Time Constraints in Mobile Edge Clouds
US20230289202A1 (en) Realtime application reconciliation within computing environments
CN111061586B (zh) 一种容器云平台异常检测方法、系统及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19895582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19895582

Country of ref document: EP

Kind code of ref document: A1