CN103428026B - For sharing the method and system that the problem in dynamic cloud determines and diagnoses - Google Patents

For sharing the method and system that the problem in dynamic cloud determines and diagnoses Download PDF

Info

Publication number
CN103428026B
CN103428026B CN201310174315.3A CN201310174315A CN103428026B CN 103428026 B CN103428026 B CN 103428026B CN 201310174315 A CN201310174315 A CN 201310174315A CN 103428026 B CN103428026 B CN 103428026B
Authority
CN
China
Prior art keywords
tolerance
event
virtual machine
deviation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310174315.3A
Other languages
Chinese (zh)
Other versions
CN103428026A (en
Inventor
P·贾亚钱德兰
B·沙尔玛
A·维尔马
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/470,589 external-priority patent/US8862727B2/en
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN103428026A publication Critical patent/CN103428026A/en
Application granted granted Critical
Publication of CN103428026B publication Critical patent/CN103428026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to the method and system that a kind of problem for sharing in dynamic cloud determines and diagnoses.Described method includes: monitor each virtual machine in described shared dynamic cloud environment and physical server at least one tolerance;Identify problem symptom according to described supervision and generate event;Analyze described event to determine the deviation with normal behaviour;And be exception based on cloud or application and trouble according to existing knowledge by described event classification.

Description

For sharing the method and system that the problem in dynamic cloud determines and diagnoses
Technical field
Embodiments of the invention usually relate to information technology, more particularly, it relates to Intel Virtualization Technology.
Background technology
May often break down in large-scale data center, it is the principal element of data center's management cost. Data center can generate and monitor data in a large number, but detects fault and corresponding basic from these data Reason is highly difficult.Generally, by using predefined threshold value manual observation data to perform this task.
Additionally, virtualization cloud environment proposes new challenge, including shared resource, to virtual machine (VM) Carry out migrating and adjusted size of dynamic operation environment, the live load etc. of change.Share contention for resources Cache, disk and Internet resources, and if distinguish the performance issue that causes due to contention with across Challenge is there is when doing the application and trouble of resource.Exist dynamic when distinguishing property abnormality and changing with live load State property is challenged.
Virtualization is increasingly being used for emerging cloud system.This type of arranges the various of middle appearance on a large scale Exception or fault will dramatically increase and always manage cost, and make the penalty that bottom applies.Therefore, need Be used for dynamically sharing the Efficient fault management technique of cloud, these technology can distinguish cloud relevant abnormalities with Application and trouble.
Summary of the invention
In one aspect of the invention, it is provided that determine for sharing the problem in dynamic cloud environment and examine Disconnected technology.A kind of for sharing the illustrative computer realization that the problem in dynamic cloud determines and diagnoses Method may comprise steps of: monitor in described shared dynamic cloud environment at least one tolerance Each virtual machine and physical server;Identify problem symptom according to described supervision and generate event; Analyze described event to determine the deviation with normal behaviour;And according to existing knowledge, described event is divided Class is exception based on cloud or application and trouble.
In another aspect of the present invention, it is provided that a kind of true for sharing the problem in dynamic cloud environment System that is fixed and that diagnose.Described system includes memorizer, be coupled to described memorizer at least one at Reason device, and at least one different software module, each different software module is comprised in tangible meter In calculation machine computer-readable recording medium.Described software module includes: monitoring engine modules, it is on the processor Perform, for at least one tolerance monitor each virtual machine in described shared dynamic cloud environment and Physical server, and export the supervision data time series (time series) corresponding with each tolerance; Event generates engine modules, and it performs on the processor, asks for identifying according to described supervision Topic symptom also generates event;Problem Determination Engine module, it performs on the processor, for dividing Analyse described event to determine and to position the deviation with normal behaviour;And diagnostic engine module, it is in institute State on processor perform, for according to existing knowledge by described event classification be exception based on cloud or should Use fault.
In another aspect of the present invention, a kind of for determining the void under the multiple operating environments in system The method of plan machine behavior includes: each virtual machine level in systems monitors at least one resource;? Physical host rank in described system monitors that the polymerization of each resource uses;Capture multiple tolerance, institute State on all virtual machines of each resource of multiple metrics management each physical host in the system Accumulation uses;And analyze described tolerance with determine in described system multiple operating environments under virtual Machine behavior.
Another aspect of the present invention or its element can realize with the form of goods, and described goods are tangible Ground comprises computer-readable instruction, when performing described computer-readable instruction, causes computer to perform Multiple method step described here.Additionally, another aspect of the present invention or its element can be with dresses The form put realizes, and described device includes memorizer and at least one processor, described at least one Reason device is coupled to described memorizer and operable to perform described method step.More additionally, the present invention Another aspect or its element can realize with the form of parts, described parts are retouched at this for performing The method step stated or its element;Described parts can include (i) hardware module (multiple), (ii) Software module (multiple), or the combination of (iii) hardware and software module;(i)-(iii) any One all realizes particular technology described herein, and described software module is stored in tangible calculating In machine readable storage medium storing program for executing (or this type of medium multiple).
From the detailed description of the exemplary embodiment to the present invention read below with reference to accompanying drawing, this Bright these and other target, characteristic and advantage will become clear from.
Accompanying drawing explanation
Fig. 1 is the schematic diagram illustrating system architecture according to an embodiment of the invention;
Fig. 2 is to illustrate that Instance failure feature (signature) according to an embodiment of the invention is joined The table of number;
Fig. 3 is the schematic diagram illustrating fault signature example according to an embodiment of the invention;
Fig. 4 is to illustrate instance system according to an embodiment of the invention and the table of application tolerance;
Fig. 5 is the table illustrating detected failure scenario according to an embodiment of the invention;
Fig. 6 be illustrate according to an embodiment of the invention for sharing the problem in dynamic cloud environment The flow chart of the technology determined and diagnose;And
Fig. 7 is the exemplary computer system of at least one embodiment that wherein can realize the present invention System schematic.
Detailed description of the invention
As the described herein, one aspect of the present invention include the problem in dynamic cloud of sharing determine and Diagnosis.At least one embodiment of the present invention includes a kind of three stage methods, and it is by limited training Or working knowledge identifies fault as early as possible.Initially examine for affected virtual machine and resource Surveying and after fault location, diagnostic engine uses the fault signature of expert's definition to be some by failure modes One in type (cloud is relevant or application is relevant).As detailed herein, at least the one of the present invention The framework of individual embodiment allows system expands to new application, monitored tolerance and analytical technology.
Therefore, embodiments of the invention include distinguishing cloud relevant abnormalities and application and trouble.For example, Cloud relevant abnormalities can include that incorrect virtual machine (VM) size arranges, causes owing to sharing Impact, and reconfigure and/or migrate.Non-cloud/application and trouble such as can include error configurations Application, software defect, hardware fault, live load change, and application or operating system (OS) Update.
Additionally, as the described herein, at least one embodiment of the present invention includes being operated by use Contextual information and monitored VM, resource and application tolerance, promote to adapt to the Rapid Variable Design of environment. Additionally, in an example of the present invention embodiment, the result from different analytical technologies can be combined To improve accuracy.
As described in detail further at this, monitor that data can include monitoring various system and application tolerance, Such as CPU (CPU), memorizer utilization rate, cache hit/miss rate, magnetic Dish read/write, network utilization, page fault, the context switching of each VM, end-to-end prolong Late etc..Each data metric can be represented as rolling average time series after pre-processing, and Reference format allows problem detection technology to expand to newly measure.Supervision can also comprise operating environment, bag Include the context wherein performing application to capture the application that coexists.Can also use physical host CPU, Memorizer, cache, Web vector graphic etc. characterize.
As described, at least one embodiment of the present invention includes a kind of determining for problem and reliably Three stage methods of diagnosis.In this type of example embodiment, the first stage identifies the potential of Deviant Behavior Symptom also generates event, the particular VM involved by each state event location and resource.Second stage is passed through Other resource on involved VM, and across running its of same application for involved resource Its VM calculates dependency, analyzes each event further.3rd diagnostic phases is at described relevance values Any notable deviation of middle identification and normal use behavior, with by failure modes be multiple cloud relevant and/ Or the one in application relevant abnormalities.Therefore, detect and normal behaviour when light weight event generation phase Deviation time, comprehensive diagnostic will be triggered.
At least one embodiment of the present invention can include being considered as application black box, and needs limited Train or need not training to process new opplication.With only monitor one or two resource (such as CPU With memorizer) many existing issues determine that method is different, at least one embodiment of the present invention includes Monitoring resource widely, including CPU, memorizer, cache, disk and network, this enables standard Really detection and location multiclass fault.
Additionally, at least one embodiment of present system is automated, and and if only if during labelling exception Just need manual intervention.The potential source of described system indication problem is to recover.
The diagnostic phases of at least one embodiment of the present invention relies on Professional knowledge so that exactly to fault Classify.For example, it is possible to provide Professional knowledge by standardized fault signature, these faults are special Levy how the relevant each fault of capture shows as the spy of the deviation with normal behaviour in difference metric datas Property.When Deviant Behavior being detected, by with the character of the deviation of normal behaviour and various available fault Characteristic matching is to classify to fault.Described fault signature can be with standardized extensible markup Language (XML) form represents, and expert can add or edit the knowledge about failure scenario with For classification in the future.
Another concept described here is the concept of operating environment.Therefore, in order to detect system exactly Fault in system, at least one embodiment of the present invention is included in each VM rank and monitors various resources, And monitor that the polymerization of each resource uses in physical host rank.Operating environment capture management resource exists The tolerance that accumulation on all VM of each physical host uses.The behavior of particular VM is a behaviour Making may be normal under environment, but may be abnormal under another operating environment.
For example, it is considered to running the VM of application, it is delayed over normal and/or acceptable delay. Memorizer utilization rate on this VM is the highest, and the physical server of this VM of trustship shows the highest Page fault and cache-miss.Only have this information and possibly cannot disclose the character of problem, because This is likely to be due to high workload load, applies relevant abnormalities or from a server to another server Real-time VM migrate caused.Analyze operation context and will assist in identification problem.VM migrates in real time The highest page fault and cache-miss will be shown as on source and destination's server, and Other VM of the same application of trustship and server are by unaffected.This impact will be also interim, directly Completing to migration, hereafter application performance should recover normal.Application is abnormal will only affect this VM of trustship Server, and run other VM of same application by unaffected.The live load increased is by shadow Ring all VM running same application, and by revealed by the dependency across VM.
Fig. 1 is the schematic diagram illustrating system architecture according to an embodiment of the invention.For example, Fig. 1 shows the cloud architecture 102 including multiple cluster.Such as, a cluster is illustrated as service Device main frame 104, it includes VM 106,108 and 110, and another cluster is illustrated as server master Machine 112, it includes VM 114,116 and 118.
Fig. 1 also show system component, including monitoring that engine 120, event generate engine 132, problem Determine engine 138 and diagnostic engine 146.As the described herein, monitor that engine 120 is for various Tolerance (including CPU, memorizer, cache, network and disk resource) monitors each virtual machine And physical server.Event generates engine 132 and identifies the potential symptom of fault and generate event.Problem Determine that engine 138 analyzes described event to determine the deviation with normal behaviour, diagnostic engine 146 basis Described exception is classified by Professional knowledge.In conjunction with the description to Fig. 1, it is described further below each Stage.
As described, monitor that engine 120 is collected for each virtual machine and physical server and locates Manage the various systems relevant with CPU, memorizer, cache, network and disk resource and answer expenditure Amount.Specifically, system metrics catcher (profiler) module 122 is collected tolerance and they is passed Delivering to data pre-processor module 126, it performs some by submodule 128 and filters and smooth operation. Pretreatment obtains sample mean and the exceptional value of deletion error by submodule 130 in moving window And any distortion (kink) in smoothed data.Additionally, monitor that engine 120 also includes application tolerance Collector module 124 is to collect application tolerance.
Because Deviant Behavior only can be identified as trend and be not from any single sample value, institute Time series data will be processed with the algorithm at least one embodiment of the present invention.Data point is (at this The a series of moving averages being defined as in Fixed Time Interval) constitute the present invention at least one is real That executes the Outlier Detection Algorithm in example inputs unit substantially.As time goes on, for monitored Each tolerance generates multiple data points.Therefore, monitor that the output of engine 120 is data point stream, each Data point is derived from the supervision data time series corresponding with each system and application tolerance.Additionally, system It is adapted to monitor the new tolerance of time series data being represented as being similar to.
Consider below in connection with the example constructing data point from time series data.Assume from following 12 Individual cpu busy percentage sample starts: 34.2,26.8,32,27.7,38.2,29,30.1,28.3,33.5,31.1, 27.8,39}.It follows that obtain moving average in window size (such as, w=3): 31,33, 28.8,32.6,32.4,29.1,30.6,30.9,30.8,32.6,30.3,29.6}.Then, from movement above Average time extracts data point in sequence data.Such as, if the gap length of each data point is determined Position is k=4, then acquisition three below data point: [31,33,28.8,32.6], [32.4,29.1,30.6,30.9] [30.8,32.6,30.3,29.6].
It is noted that measure at virtual machine level and physical host rank collection system respectively.This is Because some resource (such as CPU and memorizer) is generally between the VM on physical host It is partitioned, and other resource (such as cache) is shared.Particular VM is run in order to capture The context of physical host or state, at least one embodiment of the present invention includes defining operation environment Concept.As it has been described above, operating environment capture tolerance, described metrics management resource is at each physical host All or multiple VM on accumulation use.Such as, this can include that each physical host is (across it The VM of all trustships) total CPU and memorizer utilization rate and cache hit rate.Such as, In the case of the real-time VM causing abnormal application behavior (such as, bigger application delay) migrates, It is observed that the cache hit/miss on the physical server of trustship fault VM and the page Mistake, the cache life on other physical server of other VM running same application with trustship In/miss compare abnormal (extremely) with page fault.
As described, event generation engine 132 uses about deviateing the specific of normal behaviour Many of VM() and the information of resource (multiple), identify the potential symptom of Deviant Behavior and trigger thing Part.The generation of event can not directly represent the existence of fault, but can only show the possibility of fault Property.May need to analyze to be confirmed whether to exist fault further, and if it is present to fault Classify.Such as, hard work load (but non-faulting) that its degree is not seen before may touch The event of sending out, because the memorizer utilization rate of VM is the highest.But, analyze (by problem further Determine that engine 138 performs) this and other resource (such as CPU and disk) or operation may be disclosed The high usage of other VM of same application is correlated with, be the most only defined as high workload load scenarios and not It it is fault.
To be reaffirmed, the effect of event generation engine 132 determines that the position of potential Deviant Behavior, Can to be further analyzed to analyze described particular VM (multiple) and resource (multiple). This at least one embodiment being easy to the present invention expands to generate a large amount of large scale system monitoring data.This Outward, event generates engine 132 and utilizes the one in multiple machine learning techniques, via model constructor Module 134 builds the model of normal use behavior.Described model in the data observed of detection with The deviation of normal behaviour, and trigger event is with output to event analyser module 136.
Model constructor module 134 can realize modeling technique (such as hidden Markov model (HMM), arest neighbors, K are average) and statistical technique (such as regression technique).Described modeling Technology attempts qualitatively and/or quantitatively measuring the deviation journey between analyzed data point and past observing value Degree.The framework of at least one embodiment of the present invention allows to generate any modeling technique as event to draw Hold up a part of plug and play of 132.
For example, modeling based on HMM is directed to use with in a large number (such as, hundreds of) data point Training HMM, causes the concept of described model capture normal behaviour.Because application behavior may be with negative Carry or live load mixing and change, so at least one embodiment of the present invention includes training multiple HMM (is provided with for different sights the most wherein and closes typical case's application live load sight Knowledge).If this step is to quickly and have low overhead, then limit created HMM Quantity.Can be that HMM provides new test data point to determine that it meets journey with described model Degree how.The advantage of the method is that HMM can capture data sample sequence and is not only to include this The sample set of data point.
Additionally, arest neighbors technology include calculating the data point that considered and setting models set of data points it Between distance metric.Described model data point set can be similar to the training set for HMM modeling Close, or can be the certain amount of data point selected from recent past.Later approach will Can As time goes on compare taking up room of application, and can detect with various time scales Change.Additionally, bigger distance metric will produce and normal or the relatively large deviation of expected behavior.Refer to Going out, this type of technology need not any type of training.
Distance metric can be one of selected from some example alternate items: the corresponding sample in two data points it Between simple vector difference, from individual specimen calculate aggregate statistics data between difference (average And standard deviation), the distance in Euclidean space, combination in any etc. of these tolerance.With HMM Difference, described distance metric provides the quantitative measurement of the departure degree observed in systems.
Additionally, develop multiple statistical technique to analyze continuous data.It is, for example possible to use it is known The model of regression technique exploitation normal behaviour, and any new data point can be compared with this model With degree of being determined for compliance with.Whether described for instruction data point is met described model by this, if or do not meet, Then indicate the departure degree of described data point and described model.Other statistical test such as can include symbol Close goodness test or test based on threshold value.Additionally, statistical technique can apply to obtain from data point Rectangular histogram rather than be applied to data point itself.
As it has been described above, Problem Determination Engine 138(it include that statistical analyzer module 140 and fault are divided Class device module 142) use across VM and the statistic correlation of resource, identify exception and locate them To many of affected resource (multiple) and VM().Event is generated what engine 132 generated Each event, this stage further analytical data is abnormal to identify.Therefore, at least the one of the present invention In individual embodiment, never call Problem Determination Engine 138, unless the event of generation.As an example it is assumed that For virtual machine VMiOn tolerance MjGeneration event.Then this stage calculates relevant MjData and Relevant VMiOn other tolerance each data between dependency, and relevant VMiOn Mj Data and about run same application other VM each on MjData between dependency (this information is the most all provided).According to knowing about the Canonical correlation value under normal behaviour Know, it is indicated that with any notable deviation of normal relevance values.These deviations are provided to diagnostic engine 146, in order to be classified into different classes of fault and non-faulting behavior according to Professional knowledge.
As an example it is assumed that TtrainingRepresent window W sometimetrIn comprise training data point Time series.TtestRepresent a certain window WteThe time series of interior test data point, in order to Ttest In number of data points less than or equal to TtrainingIn number of data points.When will train and test data Between sequence combine to produce dependency sample time-series, be expressed as Tcorrelation.Point out , each system metrics on each VM will have the time series of correspondence.Vm and r is made to divide Biao Shi virtual machine and system metrics.Use symbol TX-vm-rRepresent system metrics r of virtual machine vm Time series X(wherein X can be training, test or sample time-series).It follows that meter Calculate Tcorrelation-vm1-rAnd Tcorrelation-vm2-rBetween dependency (c1), calculate T similarlytraining-vm1-r And Ttraining-vm2-rDependency (c2) between time series.It is compared to each other the relevance values (c1 obtained And c2) to identify described event.(c2) is used as threshold value (T) described event to be defined as Workload intensity change (as abs (c1-c2)≤T) or abnormal when T (as abs (c1-c2) >).
Said process will be reliably used for the change of live load, because any change (example of live load As illustrated by the assembly 144 in Fig. 1) will not affect significantly across the VM running same application Dependency;That is, the CPU of all VM and memorizer utilization rate are by along with the increase of live load And increase.By contrast, if there is affecting the exception of certain particular VM and resource, then this will be The tolerance of the resource that management no longer associates with other VM running same application reflects.
The correlation total calculated in this process is that tolerance number is plus the VM number running application.Refer to Going out, at least one embodiment of the present invention includes the proximity analyzing the position at generation event place Territory.This contributes to expanding to bigger system size and some monitored tolerance.
Diagnostic engine 146(includes fault location module 148) use predefined Professional knowledge, will ask Topic determines that the potential abnormal sight that engine 138 detects is categorized as in various faults and non-faulting classification A kind of.Described Professional knowledge can be supplied to system with standardized fault signature form.Fault is special Levy in the relevance values of Problem Determination Engine 138 capture and capture in operating environment one group with just The deviation of Chang Hangwei, described deviation is the characteristic of they described faults.When Deviant Behavior being detected Time, attempt to match with the deviation of normal behaviour and one or more known fault feature.If sent out Now mate, then fault will successfully be classified by diagnostic engine 146.If not finding coupling, then Described behavior is labeled as Deviant Behavior by diagnostic engine, and this engine does not has the knowledge of the behavior.Fig. 2 retouches State the fault type characteristic of at least one embodiment various failure scenarios of differentiation contributing to the present invention.
Fig. 2 is the table 202 illustrating Instance failure characteristic parameter according to an embodiment of the invention. At least one embodiment of the present invention uses XML format to describe fault signature, in order to when system detects During to fault, it is allowed to extend to add new fault signature (such as, systems specialists adding).Cause This, at least one embodiment of present system can be helped by expert, learns the event to new type Barrier is classified.Can be each fault tool in conjunction with the hypothesis that at least one embodiment of the present invention is carried out There is distinctive feature, and described feature can be with monitored measurement representation.
Fig. 3 is the schematic diagram illustrating fault signature example 302 according to an embodiment of the invention. For example, Fig. 3 shows the expert changing sight about two types fault and a kind of live load The example of the feature created.Use and be represented as feature described in the different context-descriptives of labelling.Institute State only capture in feature to deviate the tolerance of normal behaviour and be threshold value by these measurement representations, described threshold value The minimum deflection in relevance values needed for expression and described characteristic matching.
Such as, all VM across hosts applications are calculated as CPU dependency.In relevance values with The deviation of normal behaviour must be at least identical with the threshold size being defined as (CPU-corr-thr), So as coupling this feature.The different contextual taggings used include: (a) VM environmental context is in void The capture fault performance of planization resource class;B the capture of () operating environment context obtains at physical host The tolerance obtained, these measurement representations use across the polymerization of all VM being positioned on described main frame;(c) System supervisor context captures any special log message obtained from system supervisor;And (d) application context capture application level performance metric.
In those contextual taggings of the fault that fault signature can include characterizing its definition one or many Individual, and allow at least one embodiment of the present invention to uniquely identify this feature.Fig. 3 shows The example aspects changed about fault and live load.According to source and destination host page fault discrimination Real-time migration fault and other fault (operation context), and use CPU, memorizer and context Notable deviation in switching number identifies that the VM size of mistake is arranged.Fig. 2 summarizes different events Barrier feature identification symbol, they contribute at least one embodiment of the present invention to each classification of fault and Subclass is identified and classifies.
As detailed herein, at least one embodiment of the present invention includes monitoring various system and application Measure and detect various fault.Therefore, Fig. 4 is to illustrate example according to an embodiment of the invention System and the table 402 of application tolerance.As described herein, at least one embodiment bag of the present invention Including supervision engine, it is collected from physical host and virtual machine thereon and measures for fault detect and examine Disconnected.Measurement data can cross over multiple system and application tolerance.The present invention listed by table 402 in Fig. 4 At least one embodiment monitored one group system tolerance.3rd row appointment of table 402 is virtual Machine rank still collects tolerance in physical server rank.
Fig. 5 is the table 502 illustrating detected failure scenario according to an embodiment of the invention. For example, Fig. 5 describes the various faults using at least one embodiment of the present invention to be detected. The details of Instance failure are described below.
In fault " the VM resource size of mistake is arranged ", VM resource distribution (CPU and storage Device) it is wrongly configured.Consider following error configuration sight: (a) for target VM CPU and The reserved allocated low-down value of memorizer;B () CPU reservation is configured with low-down value, and Memorizer is reserved is virtual value;And (c) memorizer is reserved is arranged to low-down value, and CPU Reserved have virtual value.
Fault " fault VM real-time migration " reflection is likely to be due to the problem that real-time migration of virtual machine produces. Consider following two sight.In the first sight, VM is moved to the physical host of heavy congestion. The capacity of this physical host only be enough to accommodate the VM being migrated.In the second sight, the most enough Enough resources perform the source host migration VM of the heavy congestion migrated.
" live load blending change " fault corresponding to live load mixing or live load character more Changing, change application is used the degree of different resource by this." workload intensity change " indicating fault The intensity of live load increases or reduces, and the character of live load itself keeps constant.
One of " application error configuration " representation for fault application parameter is arranged to incorrect or invalid value Situation.Additionally, " VM configuration error " capture uses some configuration parameter punching with physical host Prominent mode configures situation during virtual machine.Such as, if the CPU on source and destination main frame does not has Promising virtual machine provides one group of identical characteristic, and the activity of the most such as VM real-time migration etc will failure.
" VM reconfigures " may reconfigure period generation at VM, and VM reconfigures permissible By the adjustment of dynamic VM size, VM real-time migration or real by creating new VM in shared cloud Example realizes." impact caused due to resource-sharing " fault refers to following situation: at Same Physical Two or more VM contention for resources of trustship on machine, and the performance of one or more VM is subject to Impact.For example, it is contemplated that the cache scarcity application on a VM, this application and another VM On cache sensitive application by common trustship.Cache scarcity VM actively uses cache, Thus affect cache hit rate and the performance of other application significantly.CPU monopolizes (hog) table Show that application enters the sight that the Infinite Cyclic of exclusive (that is, using major part) CPU calculates.This fault May produce by introducing C code block, this code block performs Floating-point Computation in Infinite Cyclic.This will Consume most of available processors cycle of VM, thus cause poor application performance.
Memorizer is monopolized and is represented the sight that there is memory leakage in application.This is real by running c program Existing, this program consumes a large amount of memorizeies in target VM, and (it is constantly pre-from heap by malloc Retain reservoir and do not discharge allocated block).Leave considerably less memorizer for application, thus cause Application throughput significantly reduces.
Additionally, the situation that disk is monopolized is similar to CPU and sight monopolized by memorizer, and can make Realize with the Hadoop sequence example of multiple parallel runnings, in order to produce high disk utilization sight.
In the fault that referred to as network is exclusive, the utilization rate of network link is the highest.Additionally, be referred to as height In the fault that speed caching is exclusive, the benchmark of cache intensive and target VM coexist the most slow with simulation Deposit exclusive.
Fig. 6 be illustrate according to an embodiment of the invention for sharing the problem in dynamic cloud environment The flow chart of the technology determined and diagnose.Step 602 includes monitoring described common at least one tolerance Enjoy each virtual machine in dynamic cloud environment and physical server.Tolerance can include processing single with central authorities The relevant tolerance such as unit, memorizer, cache, Internet resources, disk resource.Monitor and additionally may be used To include exporting data point stream, each data point is derived from and each system in described shared dynamic cloud environment Unite and measure corresponding supervision data time series with application.Additionally, at least one enforcement of the present invention In example, monitor and include monitoring at virtual machine level and physical host rank respectively.
Step 604 includes identifying problem symptom according to described supervision and generating event.Identification symptom is permissible Including identifying trend from time series data.Additionally, at least one embodiment of the present invention includes profit Build the model of normal use behavior with machine learning techniques, and use described model to detect to be supervised Depending on data in the deviation of normal behaviour.
Step 606 includes analyzing described event to determine the deviation with normal behaviour.Analysis can include The statistic correlation across virtual machine and resource is used to position described deviation relative to impacted resource and void The position of plan machine.Additionally, at least one embodiment of the present invention includes analyzing generates described event place The peripheral region of position.
Step 608 include according to existing knowledge by described event classification be exception based on cloud or application therefore Barrier.Described existing (or specialty) knowledge can include fault signature, and wherein fault signature captures one group As event feature and the deviation of normal behaviour.Additionally, at least one embodiment bag of the present invention Include when the event of generation, the described deviation with normal behaviour and fault signature are matched, and if Find coupling, the most described event is classified, if not finding coupling, then by described thing Part is labeled as Deviant Behavior.
Also as the described herein, it is multiple that at least one embodiment of the present invention includes determining in system Virtual machine behavior under operating environment.This such as may include that each virtual machine level in systems Monitor at least one resource;Physical host rank in the system monitors that the polymerization of each resource makes With;Capture multiple tolerance, each resource of the plurality of metrics management each physics in the system Accumulation on all virtual machines of main frame uses;And it is many to determine in described system to analyze described tolerance Virtual machine behavior under individual operating environment.Additionally, at least one embodiment of the present invention can include root Described intersystem problem is detected according to the described virtual machine behavior under multiple operating environments.
As the described herein, the technology shown in Fig. 6 can also include providing a kind of system, wherein Described system includes different software module, and each of described different software module is included in tangible In computer-readable recordable storage medium.Such as, all modules (or its any subset) can be In same medium, or each can be in different medium.Described module can include detailed herein Any or all of assembly.In one aspect of the invention, described module such as can be in hardware handles Run on device.Then the described different software module of described system can be used (as it has been described above, firmly Perform on part processor) perform described method step.Additionally, a kind of computer program can wrap Including tangible computer-readable recordable storage medium, it has the code being suitable to be performed to perform At least one method step described here, provides described different software module including for described system.
Additionally, the technology shown in Fig. 6 can be by including the meter of computer usable program code Calculation machine program product realizes, and described computer usable program code is stored in data handling system Computer-readable recording medium in, and wherein download institute by network from remote data processing system State computer usable program code.Additionally, in one aspect of the invention, described computer program product Product can include the meter in the computer-readable recording medium being stored in server data processing system Calculation machine usable program code, and wherein said computer usable program code downloaded to far by network Journey data handling system, in order to use in the computer-readable recording medium of described remote system.
Person of ordinary skill in the field knows, various aspects of the invention can be implemented as system, Method or computer program.Therefore, various aspects of the invention can be implemented as following shape Formula, it may be assumed that hardware embodiment, completely Software Implementation (include firmware, resident soft completely Part, microcode etc.), or the embodiment that software and hardware aspect combines, may be collectively referred to as " electricity here Road ", " module " or " system ".Additionally, various aspects of the invention are also implemented as at meter The form of the computer program in calculation machine computer-readable recording medium, comprises calculating in this computer-readable medium The program code that machine is readable.
One aspect of the present invention or its element can realize with the form of device, and described device includes depositing Reservoir and at least one processor, at least one processor described is coupled to described memorizer and can grasp Make to perform exemplary method steps.
Additionally, one aspect of the present invention can use on general purpose computer or work station run soft Part.With reference to Fig. 7, this type of embodiment such as can use processor 702, memorizer 704 and input / output interface (such as, is formed by display 706 and keyboard 708).Term " processor " as This is used, it is intended to includes any processing equipment, such as, includes CPU(CPU) and / or the processing equipment of other formal layout circuit.Additionally, term " processor " can refer to multiple individually Processor.Term " memorizer " is intended to the memorizer including associating with processor or CPU, example Such as RAM(random access memory), ROM(read only memory), fixed memory device (example Such as, hard disk drive), removable memory equipment (such as, floppy disk), flash memory etc..Additionally, Phrase " input/output interface " is as used in this, it is intended to include such as entering data into Mechanism (such as, mouse) in described processing unit, and close with described processing unit for providing Mechanism's (such as, printer) of the result of connection.As a part for data processing unit 712, place Reason device 702, memorizer 704 and input/output interface (such as display 706 and keyboard 708) example As interconnected by bus 710.Can also such as pass through bus 710 is network interface 714(example Such as network interface card, it is provided that network interface card is to be connected with computer network) and Media Interface Connector 716(is such as Floppy disk or CD-ROM drive, it is provided that they are to be connected with medium 718) provide applicable Interconnection.
Therefore, as the described herein, including instruction or the generation of the described method for performing the present invention The computer software of code can be stored in association storage device (such as, ROM, fixing maybe can move Dynamic memorizer) in, and when ready for use, (such as, it is loaded into RAM by partly or entirely loading In) and realized by CPU.This type of software can include but not limited to firmware, resident software, micro-generation Code etc..
The data handling system being suitable for storage and/or execution program code will include that at least one passes through System bus 710 is directly or indirectly connected to the processor 702 of memory component 704.Described storage The local storage, the Large Copacity that use the term of execution that device element can being included in program code actual are deposited Storage device and provide the interim storage of at least some of program code with reduce must the term of execution from greatly The cache memory of the number of times of mass storage devices retrieval coding.
Input/output or I/O equipment (include but not limited to that keyboard 708, display 706, indication set Standby etc.) directly (can such as pass through bus 710) or by middle I/O controller (for understand See and be omitted) it is connected with described system.
Network adapter (such as network interface 714) can also be connected to described system so that described Data handling system can be become and other data handling system or remote by intermediate dedicated or public network Journey printer or storage device are connected.Modem, cable modem and Ethernet card are simply A few in currently available types of network adapters.
As used in this (including claim), " server " includes runtime server program Physical data processing systems (such as, system 712 as shown in Figure 7).It will be appreciated that this type of Physical server can include including display and keyboard.
As described, various aspects of the invention can be taked to be included in computer-readable medium The form of computer program, described computer-readable medium has the computer being included in can Reader code.Furthermore, it is possible to use the combination in any of computer-readable medium.Computer-readable is situated between Matter can be computer-readable signal media or computer-readable recording medium.Computer-readable storage medium Matter such as can be but not limited to electricity, magnetic, optical, electromagnetic, infrared ray or quasiconductor system, Device or device, or the combination of above-mentioned any appropriate.Computer-readable recording medium is more specifically Example (non exhaustive list) including: there is the electrical connection of one or more wire, portable meter Calculation machine dish, hard disk, random-access memory (ram), read only memory (ROM), erasable Formula programmable read only memory (EPROM or flash memory), optical fiber, the read-only storage of portable compact disc Device (CD-ROM), light storage device, magnetic memory device or the combination of above-mentioned any appropriate. In this document, computer-readable recording medium can be any tangible medium comprised or store program, This program can be commanded execution system, device or device and use or in connection.
Computer-readable signal media can include the most in a base band or pass as a carrier wave part The data signal broadcast, wherein carries computer-readable program code.This type of data signal propagated Can take various forms, include but not limited to electromagnetic signal, optical signal or above-mentioned any conjunction Suitable combination.Computer-readable signal media can also is that appointing beyond computer-readable recording medium What computer-readable medium, this computer-readable medium can send, propagate or transmit for by referring to Make execution system, device or device use or program in connection.
The program code comprised on computer-readable medium can with suitable medium transmission, including but It is not limited to wireless, wired, optical cable, RF etc., or the combination of above-mentioned any appropriate.
Each side for performing the present invention can be write with the combination in any of at least one program language The computer program code of the operation in face, described programming language includes object-oriented programming Language such as Java, Smalltalk, C++ etc., also include the process type programming language of routine Such as " C " language or similar programming language.Program code can fully be counted user Perform on calculation machine, perform the most on the user computer, perform as independent software kit, Part part the most on the user computer perform the most on the remote computer or completely at remote computer or Perform on server.In the situation relating to remote computer, remote computer can be by any number of The network of class includes that Local Area Network or wide area network (WAN) are connected to subscriber computer, Or, it may be connected to outer computer (such as utilizes ISP to pass through the Internet Connect).
At this by with reference to method, device (system) and computer program product according to embodiments of the present invention Flow chart and/or the block diagram of product describe various aspects of the invention.Should be appreciated that flow chart and/or frame The combination of each square frame in each square frame of figure and flow chart and/or block diagram, can be by computer program Instruction realizes.These computer program instructions can be supplied to general purpose computer, special-purpose computer or its The processor of its programmable data processing means, thus produce a kind of machine so that these computers Programmed instruction, when being performed by the processor of computer or other programmable data processing means, is produced The device of the function/action of regulation in one or more square frames in flowchart and/or block diagram.
These computer program instructions can also be stored in computer-readable medium, these instructions make Obtain computer, other programmable data processing means or miscellaneous equipment to work in a specific way, thus, The instruction being stored in computer-readable medium just produces and includes in flowchart and/or block diagram Goods (the article of of the instruction of the function/action of regulation in one or more square frames Manufacture).Therefore, one aspect of the present invention includes visibly comprising computer-readable instruction Goods, when perform described computer-readable instruction time, cause computer perform described here multiple Method step.
Can also computer program instructions be loaded into computer, other programmable data processing means, Or on miscellaneous equipment so that hold on computer, other programmable data processing means or miscellaneous equipment Row sequence of operations step, to produce computer implemented process so that computer or other The instruction performed on programmable device provides the one or more square frames in flowchart and/or block diagram The process of the function/action of middle regulation.
Flow chart and block diagram in accompanying drawing show the system of multiple embodiments according to the present invention, method Architecture in the cards, function and operation with computer program.In this, flow process Each square frame in figure or block diagram can represent a module, component section or a part for code, described A part for module, component section or code include at least one for realize regulation logic function can Perform instruction.It should also be noted that some as replace realization in, the function marked in square frame Can also occur to be different from the order marked in accompanying drawing.Such as, two continuous print square frames are actually Can perform substantially in parallel, they can also perform sometimes in the opposite order, and this is according to involved Depending on function.It is also noted that each square frame in block diagram and/or flow chart and block diagram and/ Or the combination of the square frame in flow chart, can be with performing the function of regulation or the special based on firmly of action The system of part realizes, or can realize with the combination of specialized hardware with computer instruction.
It should be noted that, any method described here can include another step providing a kind of system Suddenly, described system includes the different software module comprised in a computer-readable storage medium;Described mould Block such as can include any or all of assembly shown in Fig. 1.Then described system can be used Described different software module and/or submodule (as it has been described above, performing on hardware processor 702) are held The described method step of row.Additionally, computer program can include computer-readable recording medium, It has the code being suitable to be performed to perform at least one method step described here, including for Described system provides described different software module.
Under any circumstance, it will be appreciated that the assembly being shown in which can hardware in a variety of manners, Software or combinations thereof realize;Such as, special IC (multiple) (ASIC), function Circuit, have associative storage through properly programmed universal digital computer etc..Give at this After the teachings of the present invention provided, those skilled in the art is by it is contemplated that its of assembly of the present invention It realizes.
Term as used herein is intended merely to describe specific embodiment and be not intended as the present invention Restriction.As used in this, singulative " ", " one " and " being somebody's turn to do " are intended to equally Including plural form, unless the context clearly dictates otherwise.It will also be understood that ought be in this description During use, term " includes " and/or " comprising " specifies the characteristic of statement, integer, step, behaviour Make, element and/or the existence of assembly, but it is not excluded that other characteristic, integer, step, operation, Element, assembly and/or the existence of its group or increase.
Counter structure, material, operation and the dress of all function limitations in claims below Put the equivalent of (means) or step, it is intended to include any for specifically note in the claims Other unit perform the structure of this function, material or operation combinedly.
At least one aspect of the present invention can be provided with the effect of benefit, such as, create cloud and reconfigure work Dynamic feature is to process reconfiguring of virtualization driving.
Give the description to various embodiments of the present invention for illustrative purposes, but described description is not It is intended to exhaustive or is limited to the disclosed embodiments.In the scope without departing from described embodiment In the case of spirit, for a person skilled in the art many modifications and variations will be all aobvious and It is clear to.The selection of term as used herein is to most preferably explain that the principle of embodiment, reality should With or technological improvement to the technology in market, or enable those skilled in the art to understand at this Disclosed embodiment.

Claims (15)

1. for sharing the method that the problem in dynamic cloud environment determines and diagnoses, described method bag Include:
The each virtual machine in described shared dynamic cloud environment and each thing is monitored at least one tolerance Reason server;
Identify the symptom of problem in shared dynamic cloud environment according to described supervision and generate and described disease The event that shape is corresponding;
Analyze described event to determine the deviation with normal behaviour;And
According to the comparison of described event Yu at least one fault signature, it is based on cloud by described event classification Exception or application and trouble, wherein, at least one fault signature described capture one group with normal behaviour Deviation;
Wherein performed at least one step in above-mentioned steps by computer equipment.
Method the most according to claim 1, at least one tolerance wherein said includes processing single with central authorities At least one relevant tolerance in unit, memorizer, cache, Internet resources and dish resource.
Method the most according to claim 1, wherein said supervision includes exporting data point stream, every number Strong point is derived from the supervision data corresponding with each system in described shared dynamic cloud environment and application tolerance Time series.
Method the most according to claim 1, wherein said supervision includes respectively at virtual machine level and thing Reason main frame rank monitors.
Method the most according to claim 1, wherein said identification includes identifying from time series data Trend.
Method the most according to claim 1, including utilizing machine learning techniques to build normal use row For model.
Method the most according to claim 6, detects in monitored data including using described model Deviation with normal behaviour.
Method the most according to claim 1, wherein said analysis includes using across virtual machine and resource Statistic correlation positions described deviation relative to impacted resource and the position of virtual machine.
Method the most according to claim 1, including analyzing around the position generating described event place Region.
10. for sharing the system that the problem in dynamic cloud environment determines and diagnoses, described system Including:
The each virtual machine being suitable to monitor in described shared dynamic cloud environment at least one tolerance is with every The module of individual physical server;
Be suitable to identify the symptom of problem in shared dynamic cloud environment according to described supervision and generate and institute State the module of event corresponding to symptom;
Be suitable to the module analyzing described event to determine the deviation with normal behaviour;And
Be suitable to the comparison according to described event Yu at least one fault signature, be base by described event classification In exception or the module of application and trouble of cloud, wherein, at least one fault signature described capture one group with The deviation of normal behaviour.
11. systems according to claim 10, wherein said being suitable to monitors institute at least one tolerance The module stating each virtual machine in shared dynamic cloud environment and physical server includes being suitable to export data Point stream submodule, each data point be derived from each system in described shared dynamic cloud environment and answer With supervision data time series corresponding to tolerance.
12. systems according to claim 10, wherein said system includes:
Be suitable to the module of the model utilizing machine learning techniques to build normal use behavior;And
It is adapted in use to described model to detect in monitored data the module with the deviation of normal behaviour.
13. 1 kinds for sharing the system that the problem in dynamic cloud environment determines and diagnoses, described system Including:
Memorizer;
It is coupled at least one processor of described memorizer;And
At least one different software module, each different software module is comprised in tangible computer can Reading in medium, at least one different software module described includes:
Monitoring engine modules, it performs on the processor, for measuring at least one Monitor each virtual machine in described shared dynamic cloud environment and each physical server, and export with every The supervision data time series that individual tolerance is corresponding;
Event generates engine modules, and it performs on the processor, for according to described supervision Identify the symptom of the problem in shared dynamic cloud environment and generate the event corresponding with described symptom;
Problem Determination Engine module, it performs on the processor, is used for analyzing described event To determine and to position the deviation with normal behaviour;And
Diagnostic engine module, it performs on the processor, for according to described event with extremely The comparison of a few fault signature, is exception based on cloud or application and trouble by described event classification, its In, at least one fault signature described captures the deviation of a group and normal behaviour.
14. 1 kinds for the method determining the virtual machine behavior under the multiple operating environments in system, institute The method of stating includes:
Each virtual machine level in systems monitors at least one resource;
Physical host rank in the system monitors that the polymerization of each resource uses;
Capture multiple tolerance, each resource of the plurality of metrics management each physics in the system Accumulation on all virtual machines of main frame uses;And
Analyze described tolerance with determine in described system multiple operating environments under virtual machine behavior, its In, described analysis comprises tolerance described in comparison and at least one fault signature, wherein, described at least one Individual fault signature captures the deviation of a group and normal behaviour;
Wherein performed at least one step in above-mentioned steps by computer equipment.
15. methods according to claim 14, including according to the described virtual machine under multiple operating environments Behavior and detect described intersystem problem.
CN201310174315.3A 2012-05-14 2013-05-13 For sharing the method and system that the problem in dynamic cloud determines and diagnoses Active CN103428026B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13/470,589 2012-05-14
US13/470,589 US8862727B2 (en) 2012-05-14 2012-05-14 Problem determination and diagnosis in shared dynamic clouds

Publications (2)

Publication Number Publication Date
CN103428026A CN103428026A (en) 2013-12-04
CN103428026B true CN103428026B (en) 2016-11-30

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101088072A (en) * 2004-12-24 2007-12-12 国际商业机器公司 A method and system for monitoring transaction based systems

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101088072A (en) * 2004-12-24 2007-12-12 国际商业机器公司 A method and system for monitoring transaction based systems

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Statistical Techniques for Online Anomaly Detection in Data Centers;Chengwei Wang etc.;《Integrated Network Management (IM), 2011 IFIP/IEEE International Symposium on》;20110527;第385-391页,图1 *

Similar Documents

Publication Publication Date Title
US9864676B2 (en) Bottleneck detector application programming interface
US8862728B2 (en) Problem determination and diagnosis in shared dynamic clouds
EP2956858B1 (en) Periodicity optimization in an automated tracing system
US9658936B2 (en) Optimization analysis using similar frequencies
US9767006B2 (en) Deploying trace objectives using cost analyses
US8843901B2 (en) Cost analysis for selecting trace objectives
US9021447B2 (en) Application tracing by distributed objectives
US8677191B2 (en) Early detection of failing computers
KR102522005B1 (en) Apparatus for VNF Anomaly Detection based on Machine Learning for Virtual Network Management and a method thereof
WO2014126639A1 (en) Deployment of profile models with a monitoring agent
CN107967485A (en) Electro-metering equipment fault analysis method and device
WO2014200836A1 (en) Systems and methods for monitoring system performance and availability
CN111949429A (en) Server fault monitoring method and system based on density clustering algorithm
Fu et al. Performance issue diagnosis for online service systems
CN117170303B (en) PLC fault intelligent diagnosis maintenance system based on multivariate time sequence prediction
CN111367786A (en) Symbol execution method, electronic equipment and storage medium
Sirshar et al. Comparative Analysis of Software Defect PredictionTechniques
CN103428026B (en) For sharing the method and system that the problem in dynamic cloud determines and diagnoses
Tang et al. Fine-Grained Diagnosis Method for Microservice Faults Based on Hierarchical Correlation Analysis
CN117873856A (en) Software testing method, storage medium and computer equipment
Attarha et al. Identifying unique power scenarios with data mining techniques at full SoC level with real workloads

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170505

Address after: Room 3, Pacific Plaza, 1 Queen's Road East, Wan Chai, Hongkong,, China

Patentee after: Oriental concept Limited

Address before: American New York

Patentee before: International Business Machines Corp.