CN109800052A

CN109800052A - Abnormality detection and localization method and device applied to distributed container cloud platform

Info

Publication number: CN109800052A
Application number: CN201811537333.2A
Authority: CN
Inventors: 叶可江; 卢澄志; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2018-12-15
Filing date: 2018-12-15
Publication date: 2019-05-24
Anticipated expiration: 2038-12-15
Also published as: WO2020119627A1; CN109800052B

Abstract

The invention relates to the field of container cloud platforms, in particular to an abnormality detection and positioning method and device applied to a distributed container cloud platform. The method and device first obtain TCP delay information of each container component; accumulate and anomaly detection algorithms through sliding windows Analyze the TCP delay information of each container component, obtain the status information of each component, and generate a key-value pair of component status information; construct a component exception subgraph through the component status information key-value pair; locate the abnormal container according to the component exception subgraph component node. The method and device use the TCP delay information to judge the abnormal state, reduce the overhead of data collection, and improve the accuracy and real-time performance of the abnormal state judgment. At the same time, considering the interference between components and between physical machines and components, a component anomaly subgraph is proposed to represent the propagation of anomalies, which improves the accuracy of anomaly localization.

Description

Abnormality detection and localization method and device applied to distributed container cloud platform

Technical field

The present invention relates to container cloud platform fields, in particular to a kind of applied to the different of distributed container cloud platform Often detection and localization method and device.

Background technique

Cloud computing obtains the favor of industrial circle and academia as a kind of new services presentation mode.The pass of cloud computing Key technology is exactly virtualization technology, and by virtualizing all kinds of resources, cloud computing service provider can easily will very much All kinds of resources, which are customized, consigns to user's use, and numerous applications also gradually start to move in cloud computing cluster.Traditional void Quasi-ization technology includes KVM, Xen etc..But traditional virtualization technology is due to excessively heavy, for some component in application cluster It is created, modification and migration operation are all very complicated, therefore cloud computing service provider needs the void of more lightweight Quasi-ization technology.Container technique is a kind of virtualization technology of the operating system grade of lightweight.Compared to traditional virtualization technology The virtualization of virtualization for hardware layer, container rests on operating system layer, creates it either, modifies or migrate all It is very convenient.Container technique is cracking to be used by all kinds of cloud computing service providers.Due to these features of container, Yong Hu Often by each assembly operating in independent container when disposing its application, conveniently to be tieed up to application Shield, which results in the internal structures of container cloud complexity.The characteristics of less isolated property of container, also results between container mutually simultaneously It interferes more serious.Once exception occurs in some container, will propagate rapidly extremely.And then influence different application groups Part.Cloud service provider needs a kind of side that can be positioned extremely to the complicated application cluster established by container Method.

Typically, an application being deployed on container cloud is often made of hundreds of component, and component and group It interdepends between part, constitutes the complicated figure by component as node.It can be from this using the relevant knowledge of graph theory The root occurred extremely is navigated in the figure of a complexity.I.e. the cloud computing platform based on container technique is usually by thousands of physical machines It forms, usually runs dozens of container in every physical machine, thus based on the cloud computing platform of container technique compared to traditional Cloud computing platform is more complicated.Compared to traditional virtual machine, vessel isolation is worse, interferes between container and container more tight Weight.Thus compared to conventional virtual machine, container is also easier to influence each other.Simultaneously because the behaviour of container deployment under operation Make in system, thus the exception of physical machine can also cause the container disposed on it to be abnormal.Existing abnormality detection positioning Scheme lacks the analysis of relevance between component and physical machine between component, while existing abnormality detection locating scheme Utility achievement data is carried out abnormality detection and is positioned, and is brought and is greatly stored and transmitted expense, thus cannot be fitted well Answer the distributed container cloud platform environment of serious interference.

Nguyen et al. is in " Insight:in-situ online service failure path inference in Production computing infrastructures " chapter 3 propose that the positioning of online black box exception positioning system is abnormal Component.The system utilizes the normal fluctuation model of virtual machine performance index structural behavior index, judges the data point of anomalous variation, Abnormal component is positioned in combination with the dependence between the temporal information and component of changed data point.Although the system It can be detected and be positioned to abnormal, but since it uses performance indicator to carry out abnormality detection and judge, for complexity Distributed container cloud platform, monitoring performance index bring expense will be very huge.

Summary of the invention

The embodiment of the invention provides a kind of abnormality detections applied to distributed container cloud platform and localization method and dress It sets, at least to solve the technical issues of traditional method for detecting abnormality based on unimodule can not be suitable for distributed container cloud.

An embodiment according to the present invention provides a kind of abnormality detection and positioning applied to distributed container cloud platform Method, comprising the following steps:

Obtain the TCP delay information of each container assemblies；

Postpone information to the TCP of each container assemblies by sliding window accumulation and Outlier Detection Algorithm to analyze, obtain The status information and formation component status information key-value pair of each component；

Component exception subgraph is constructed by component status information key-value pair；

The container assemblies node occurred extremely is oriented according to component exception subgraph.

Further, postpone information to the TCP of each container assemblies by sliding window accumulation and Outlier Detection Algorithm to carry out Analysis, obtains the status information of each component and formation component status information key-value pair includes:

Sliding window [the L of initialization component₀, L_k], input TCP delay information is until the data that TCP postpones in sliding window Number reaches k, initializes average valueAccumulation and S_k=0；Wherein [L₀, L_k] it is that storage TCP postpones team of the information from 0 to k Column, k are the integer of 0 < k < 60；

Input TCP postpones information L again_t, TCP is postponed into information L_tIt is inserted into sliding window, and is deleted in sliding window earliest TCP postpone information L_t-k, average value in calculation windowAnd calculate accumulation andWherein L_tFor The TCP of t moment postpones information, the integer of t t > k；

Calculate early warning value S_diff=S_max-S_min, wherein S_max、S_min∈[S_t-k, S_t], S_t-kWhen postponing information for earliest TCP Accumulation and；

Judge S_diffWhether between normality threshold [- h, h], if it is, judging that the state Status of the component is Normally, otherwise judge the state Status of the component for exception；

According to the status information formation component status information key-value pair<CID:MID:Status>of each component, wherein CID table Show the number of component, MID indicates the number of physical machine locating for component, and Status indicates the state of component, when component states are Status value is 1 when abnormal, is normally then 0.

Further, constructing component exception subgraph by component status information key-value pair includes:

Input module dependence graph G, the matrix of component dependencies figure are expressed as G=(Eij), wherein i, and j expression is answered With the component in cluster, Eij indicates the dependence between i component and j component, the Eij value if component i is dependent on component j It is 1, otherwise Eij value is 0；

Traverse component status information key-value pair deletes i=CID from component dependencies figure G when Status value is 0 Or the row and column of j=CID, traversal finish to obtain component dependencies subgraph G1；

It whether there is stand-alone assembly node in determination component dependence subgraph G1, stand-alone assembly node is independent of it His component nodes and the component nodes not relied on for any other component nodes, construction component is different after this kind of component nodes are deleted Chang Zitu G '.

Further, orienting the container assemblies node occurred extremely according to component exception subgraph includes:

Traverse component exception subgraph G ' calculates δ_i=∑_j∈G’E_ijIf δ_i=0, then it represents that component nodes i is abnormal Root node.

Further, method is also wrapped after orienting the container assemblies node occurred extremely according to component exception subgraph It includes:

Judge whether the MID of each abnormal root node is identical, if identical, judges that the physical machine generation that number is MID is different Often.

Further, the TCP for obtaining each container assemblies postpones information and includes:

Postpone information using the TCP that software tcprstat collects each component.

According to another embodiment of the present invention, a kind of abnormality detection applied to distributed container cloud platform and fixed is provided Position device, comprising:

Postpone information acquisition unit, the TCP for obtaining each container assemblies postpones information；

State information acquisition unit, for accumulating the TCP with Outlier Detection Algorithm to each container assemblies by sliding window Delay information is analyzed, and the status information and formation component status information key-value pair of each component are obtained；

Component exception subgraph construction unit, for constructing component exception subgraph by component status information key-value pair；

Abnormal positioning unit, for orienting the container assemblies node occurred extremely according to component exception subgraph.

Further, device further include:

Abnormal deciding means, if identical, judges that number is for judging whether the MID of each abnormal root node is identical The physical machine of MID is abnormal.

A kind of storage medium, storage medium, which is stored with, can be realized above-mentioned any one applied to distributed container cloud platform Abnormality detection and localization method program file.

A kind of processor, processor is for running program, wherein program executes being applied to for above-mentioned any one when running The abnormality detection and localization method of distributed container cloud platform.

The abnormality detection and localization method and device for being applied to distributed container cloud platform in the embodiment of the present invention, uses TCP postpone information carry out abnormality judgement, reduce data acquisition expense, improve abnormality judgement accuracy with Real-time.Simultaneously in view of between each component, the interference between physical machine and component proposes component exception subgraph to indicate The propagation of abnormality improves the accuracy positioned extremely.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of abnormality detection and localization method that the present invention is applied to distributed container cloud platform；

Fig. 2 is that the present invention is applied to the abnormality detection of distributed container cloud platform and the preferred flow charts of localization method；

Fig. 3 is that the present invention is applied to the abnormality detection of distributed container cloud platform and the module map of positioning device；

Fig. 4 is that the present invention is applied to the abnormality detection of distributed container cloud platform and the preferred module figure of localization method.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

With the mature of container technique, the cloud computing system based on container technique, that is, container cloud is had begun gradually Replace traditional cloud computing system based on virtual machine.Since container has the characteristics that light-weighted, the deployment of container is more convenient. Thus composition is more complicated compared to traditional cloud computing platform inside container cloud.Secondly isolation phase of the container to system items resource It is not strong compared with for virtual machine, and multiple containers are run on same physical host, the interference between container is comparatively strong, because Once some container is abnormal inside this container cloud, will propagate rapidly extremely, and then influence entire cluster.And due to container The internal environment of cloud complexity has not been suitable for distributed container cloud ring based on the method for detecting abnormality of unimodule for tradition Border.The prior art is analyzed using performance indicator abnormal, and the expense of data acquisition is increased, while needing to construct normal Volatility model, it is lower and lack real-time for fluctuating accuracy rate for frequent and complicated container cloud platform.

The present invention provides a kind of abnormality detection and positioning applied to distributed container cloud platform for container cloud platform Method and device.Abnormal positioning and inspection can be carried out to more complicated distributed container cloud platform by this method and device It surveys, while the accuracy rate positioned extremely is improved by its component exception subgraph.

Embodiment 1

An embodiment according to the present invention provides a kind of abnormality detection applied to distributed container cloud platform and positioning side Method, referring to Fig. 1, comprising the following steps:

S101: the TCP delay information of each container assemblies is obtained；

S102: postponing information to the TCP of each container assemblies by sliding window accumulation and Outlier Detection Algorithm and analyze, Obtain the status information and formation component status information key-value pair of each component；

S103: component exception subgraph is constructed by component status information key-value pair；

S104: the container assemblies node occurred extremely is oriented according to component exception subgraph.

This method carries out abnormality judgement using TCP delay information, reduces the expense of data acquisition, improves exception The accuracy and real-time of state judgement.Simultaneously in view of between each component, the interference between physical machine and component proposes group Propagation of the part exception subgraph to indicate abnormality, improves the accuracy positioned extremely.

In as a preferred technical scheme, by sliding window accumulation with Outlier Detection Algorithm to the TCP of each container assemblies Delay information is analyzed, and obtains the status information of each component and formation component status information key-value pair includes:

Sliding window [the L of initialization component₀, L_k], input TCP (Transmission Control Protocol transmission Control protocol) postpone information until the data amount check that TCP postpones in sliding window reaches k, initialization average valueAccumulation And S_k=0, which is initialization value, S_k=S_k-1=... S₀=0, L_k=L_k-1...=L₀=0；Wherein [L₀, L_k] it is that storage TCP prolongs Slow queue of the information from 0 to k, the size of queue are k, and k value is the integer of 0 < k < 60 as input, k, and usual k takes 10；

Input TCP postpones information L again_t, TCP is postponed into information L_tIt is inserted into sliding window, and is deleted in sliding window earliest TCP postpone information L_t-k, average value in calculation windowAnd calculate accumulation andIt is herein Iterative calculation, when t is k+1, S_t-1=S_k=0；Wherein L_tPostpone information, the integer of t t > k for the TCP of t moment；

Judge S_diffWhether between normality threshold [- h, h], if it is, judging that the state Status of the component is Normally, otherwise judge the state Status of the component for exception；H indicates acceptable S_diffRange, for input one of parameter.

In as a preferred technical scheme, constructing component exception subgraph by component status information key-value pair includes:

It whether there is stand-alone assembly node in determination component dependence subgraph G1, i.e. the stand-alone assembly node is not depend on The component nodes not relied in other assemblies node and for any other component nodes, construction group after this kind of component nodes are deleted Part exception subgraph G '.

In as a preferred technical scheme, the container assemblies node packet occurred extremely is oriented according to component exception subgraph It includes:

In as a preferred technical scheme, referring to fig. 2, method is orienting the appearance occurred extremely according to component exception subgraph After device assembly node further include:

S105: judging whether the MID of each abnormal root node is identical, if identical, judges the physical machine hair that number is MID It is raw abnormal.

In as a preferred technical scheme, the TCP delay information for obtaining each container assemblies includes:

Below with specific embodiment, this method is described in detail, a kind of distribution container cloud that is applied to of the present invention is put down The abnormality detection of platform and localization method the following steps are included:

Service manager submits abnormal Location Request to service broker；

After service broker receives abnormal Location Request, postpone letter using the TCP of software tcprstat collection assembly Breath.Software tcprstat is the tcp layer analysis tool freely increased income, and statisticallys analyze the response time of request, be can be used for interim Analysis, can also timed task do information collection；

Information is postponed to the TCP of component collected by service broker by sliding window accumulation and Outlier Detection Algorithm It is analyzed, the status information Status and formation component status information key-value pair<CID:MID:Status>of securing component；

Component status information key-value pair<CID:MID:Status>is submitted to service manager by service broker；

Service manager constructs component exception subgraph G ' after being collected into all component status information key assignments；

Service manager traverse component exception subgraph G ' calculates δ_i=∑_j∈G’E_ijIf δ_i=0, then it represents that component section Point i is abnormal root node；

Judge whether the MID of each abnormal root node is identical, if identical, the physical machine for indicating that number is MID is abnormal.

Embodiment 2

Another embodiment according to the present invention provides a kind of abnormality detection and positioning applied to distributed container cloud platform Device, referring to Fig. 3, comprising:

Postpone information acquisition unit 201, the TCP for obtaining each container assemblies postpones information；

State information acquisition unit 202, for being accumulated with Outlier Detection Algorithm by sliding window to each container assemblies TCP delay information is analyzed, and the status information and formation component status information key-value pair of each component are obtained；

Component exception subgraph construction unit 203, for constructing component exception subgraph by component status information key-value pair；

Abnormal positioning unit 204, for orienting the container assemblies node occurred extremely according to component exception subgraph.

The abnormality detection and positioning device of Based on Distributed container cloud platform of the present invention are carried out abnormal using TCP delay information State judgement, reduces the expense of data acquisition, improves the accuracy and real-time of abnormality judgement.Simultaneously in view of each Between component, interference between physical machine and component proposes propagation of the component exception subgraph to indicate abnormality, improves The accuracy of abnormal positioning.

In as a preferred technical scheme, referring to fig. 4, device further include:

Abnormal deciding means 205, for judging whether the MID of each abnormal root node is identical, if identical, judges to number It is abnormal for the physical machine of MID.

Embodiment 3

Embodiment 4

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, system embodiment described above is only schematical, such as the division of unit, can be one kind Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of unit or module, It can be electrical or other forms.

Unit may or may not be physically separated as illustrated by the separation member, shown as a unit Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple units On.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products Out, which is stored in a storage medium, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of step of each embodiment method of the present invention Suddenly.And storage medium above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of abnormality detection and localization method applied to distributed container cloud platform, which comprises the following steps:

Obtain the TCP delay information of each container assemblies；

Postpone information to the TCP of each container assemblies by sliding window accumulation and Outlier Detection Algorithm to analyze, obtains each group The status information and formation component status information key-value pair of part；

2. the method according to claim 1, wherein described pass through sliding window accumulation and Outlier Detection Algorithm pair The TCP delay information of each container assemblies is analyzed, and the status information and formation component status information key-value pair of each component are obtained Include:

Sliding window [the L of initialization component₀, L_k], input TCP delay information is until the data amount check that TCP postpones in sliding window Reach k, initializes average valueAccumulation and S_k=0；Wherein [L₀, L_k] it is that storage TCP postpones queue of the information from 0 to k, k For the integer of 0 < k < 60；

Input TCP postpones information L again_t, TCP is postponed into information L_tIt is inserted into sliding window, and is deleted in sliding window earliest TCP postpones information L_t-k, average value in calculation windowAnd calculate accumulation andWherein L_tFor t The TCP at moment postpones information, the integer of t t > k；

Calculate early warning value S_diff=S_max-S_min, wherein S_max、S_min∈[S_t-k, S_t], S_t-kTiring out when postponing information for earliest TCP Product and；

Judge S_diffWhether between normality threshold [- h, h], if it is, judge the state Status of the component be it is normal, Otherwise judge the state Status of the component for exception；

According to the status information formation component status information key-value pair<CID:MID:Status>of each component, wherein CID expression group The number of part, MID indicate the number of physical machine locating for component, and Status indicates the state of component, when component states are abnormal When Status value be 1, normally then be 0.

3. according to the method described in claim 2, it is characterized in that, described different by component status information key-value pair construction component Chang Zitu includes:

Input module dependence graph G, the matrix of component dependencies figure are expressed as G=(Eij), wherein i, and j indicates application collection Component in group, Eij indicate the dependence between i component and j component, and Eij value is 1 if component i is dependent on component j, Otherwise Eij value is 0；

Traverse component status information key-value pair deletes i=CID or j when Status value is 0 from component dependencies figure G The row and column of=CID, traversal finish to obtain component dependencies subgraph G1；

It whether there is stand-alone assembly node in determination component dependence subgraph G1, stand-alone assembly node is independent of other groups Part node and the component nodes not relied on for any other component nodes, construction component is extremely sub after this kind of component nodes are deleted Scheme G '.

4. according to the method described in claim 3, it is characterized in that, described oriented according to component exception subgraph occurs extremely Container assemblies node includes:

Traverse component exception subgraph G ' calculates δ_i=∑_j∈G’E_ijIf δ_i=0, then it represents that component nodes i is abnormal root section Point.

5. according to the method described in claim 4, it is characterized in that, the method is oriented described according to component exception subgraph Extremely after the container assemblies node occurred further include:

Judge whether the MID of each abnormal root node is identical, if identical, the physical machine for judging that number is MID is abnormal.

6. the method according to claim 1, wherein the TCP delay information for obtaining each container assemblies includes:

7. a kind of abnormality detection and positioning device applied to distributed container cloud platform characterized by comprising

State information acquisition unit, for being postponed by sliding window accumulation and Outlier Detection Algorithm to the TCP of each container assemblies Information is analyzed, and the status information and formation component status information key-value pair of each component are obtained；

8. device according to claim 7, which is characterized in that described device further include:

Abnormal deciding means, if identical, judges that number is MID's for judging whether the MID of each abnormal root node is identical Physical machine is abnormal.

9. a kind of storage medium, which is characterized in that the storage medium be stored with can be realized it is any one in claim 1 to 6 It is applied to the abnormality detection of distributed container cloud platform and the program file of localization method described in.

10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit is applied to the abnormality detection and localization method of distributed container cloud platform described in requiring any one of 1 to 6.