CN109976986B - Abnormal equipment detection method and device - Google Patents

Abnormal equipment detection method and device Download PDF

Info

Publication number
CN109976986B
CN109976986B CN201711455271.6A CN201711455271A CN109976986B CN 109976986 B CN109976986 B CN 109976986B CN 201711455271 A CN201711455271 A CN 201711455271A CN 109976986 B CN109976986 B CN 109976986B
Authority
CN
China
Prior art keywords
actual data
algorithm
period
data
evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711455271.6A
Other languages
Chinese (zh)
Other versions
CN109976986A (en
Inventor
朱婉怡
豆龙超
黄杰龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201711455271.6A priority Critical patent/CN109976986B/en
Publication of CN109976986A publication Critical patent/CN109976986A/en
Application granted granted Critical
Publication of CN109976986B publication Critical patent/CN109976986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring

Abstract

One or more embodiments of the present disclosure provide a method and an apparatus for detecting an abnormal device, where the method may include: acquiring actual data of performance indexes of a monitored system; determining first actual data of a stable interval in a time sequence and second actual data of an unstable interval in the time sequence in the actual data; evaluating the first actual data to obtain a first evaluation result, and evaluating the second actual data to obtain a second evaluation result; and determining abnormal equipment in the monitored system according to the first evaluation result and the second evaluation result.

Description

Abnormal equipment detection method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of anomaly detection technologies, and in particular, to a method and an apparatus for detecting an anomaly device.
Background
By configuring an alarm mechanism in a data center and other systems, the running state of a monitored system in the system can be monitored, so that the possible abnormal conditions of the monitored system can be found and solved in time.
In the related art, by collecting data of performance indexes (i.e., performance data) of monitored systems in a system and comparing the performance data with a predefined performance threshold, it can be determined that an abnormality may exist in a corresponding monitored system if the performance data does not meet the performance threshold.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a method and an apparatus for detecting an abnormal device.
In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
according to a first aspect of one or more embodiments of the present disclosure, a method for detecting an abnormal device is provided, including:
acquiring actual data of performance indexes of a monitored system;
determining first actual data of a stable interval in a time sequence and second actual data of an unstable interval in the time sequence in the actual data;
evaluating the first actual data to obtain a first evaluation result, and evaluating the second actual data to obtain a second evaluation result;
and determining abnormal equipment in the monitored system according to the first evaluation result and the second evaluation result.
According to a second aspect of one or more embodiments of the present specification, there is provided a detection apparatus of an abnormal device, including:
the acquisition unit acquires actual data of performance indexes of the monitored system;
a determining unit configured to determine first actual data of a stable section in a time series and second actual data of an unstable section in a time series, among the actual data;
The evaluation unit is used for evaluating the first actual data to obtain a first evaluation result and evaluating the second actual data to obtain a second evaluation result;
and the identification unit is used for determining abnormal equipment in the monitored system according to the first evaluation result and the second evaluation result.
Drawings
Fig. 1 is a schematic diagram of a data center according to an exemplary embodiment.
Fig. 2 is a flowchart of a method for detecting an abnormal device according to an exemplary embodiment.
FIG. 3 is a flow chart of an alert system implementing anomaly detection for a data center, in accordance with an exemplary embodiment.
FIG. 4 is a graphical representation of historical data for a performance index X over three operating cycles, as provided by an exemplary embodiment.
Fig. 5 is a graphical representation of actual data for a 1 minute load average, a 5 minute load average, and a 15 minute load average, as provided by an exemplary embodiment.
Fig. 6 is a schematic diagram of a normal distribution of actual data of a 1 minute load average after processing by a Box-Cox transformation algorithm according to an exemplary embodiment.
Fig. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment.
Fig. 8 is a block diagram of a detection apparatus of an abnormality device provided in an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
Fig. 1 is a schematic diagram of a data center according to an exemplary embodiment. As shown in fig. 1, hardware devices such as a device 11, a device 12, and a device 13, and an alarm system 14 are disposed in a data center. The hardware devices 11-13 and the like independently or cooperatively run application programs to realize specific functions of the data center.
The alarm system 14 monitors performance metrics of the data center to determine the operating conditions in which the data center is located. The performance index may be derived from a hardware state of a hardware device such as the devices 11-13, a software state of an application running on the hardware device, or other sources, which are not limited in this specification. By monitoring the performance indicators, the alarm system 14 may timely discover abnormal devices that may occur in the data center, so that the alarm system 14 may send a prompt or alarm to a preset object, so as to timely diagnose, analyze or process the abnormal situation, where the preset object may include a worker or an automated processing system of the data center, and the disclosure is not limited thereto.
In the embodiment of the present disclosure, by optimizing and improving the abnormality detection scheme of the alarm system 14, the monitoring operation can be more accurate and sensitive, avoiding the waste of human resources or other resources of the staff caused by the false alarm of the abnormality, and ensuring the normal operation of the data center. The data center is only one application object of the abnormality detection scheme provided in the specification; in fact, the anomaly detection scheme of the present specification can be applied to any electronic device, structure or system other than a data center, and the present specification is not limited thereto.
Fig. 2 is a flowchart of a method for detecting an abnormal device according to an exemplary embodiment. As shown in fig. 2, the method may include the steps of:
step 202, obtaining actual data of performance indexes of a monitored system.
In one embodiment, the performance index may include any parameter used to reflect the operating condition of the monitored system, such as throughput, event count, duration of operation, memory or storage size, and the like, which is not limited in this specification.
In an embodiment, a correlation analysis may be performed with respect to the performance index of the monitored system; when the actual data of the associated performance indexes are acquired, the actual data of at least one performance index of the associated performance indexes can be screened out, and only partial (one or more) actual data of the performance indexes are reserved, so that the data amount required to be processed in the subsequent process is reduced. For example, assuming that the performance indexes corresponding to the obtained actual data include a load average value within 1 minute, a load average value within 5 minutes, and a load average value within 15 minutes, if the three performance indexes are related (e.g., the degree of association is greater than a preset degree), only the actual data corresponding to the load average value within 1 minute, the load average value within 5 minutes, and the actual data corresponding to the load average value within 15 minutes may be selected, so that only the actual data corresponding to the load average value within 1 minute needs to be processed, so as to reduce the data processing amount in the subsequent step, without causing adverse effects on the final evaluation result and the like.
In an embodiment, it may be determined whether the actual data of the performance index meets a predefined standard data structure, and if not, the actual data may be adapted to meet the standard data structure. For example, assuming that the standard data structure may include a normal distribution structure, the actual data may be adjusted by an algorithm such as gaussian distribution transformation, box-Cox transformation, or the like so as to satisfy the normal distribution standard data structure.
Step 204, determining first actual data of the stable interval in time sequence and second actual data of the unstable interval in time sequence in the actual data.
In one embodiment, the run time of the monitored system is divided into time periods over a time sequence; when the performance index is in a stable state in the history data of the same time period in the selected operation periods, the same time period is determined to belong to the stable interval, and when the history data of the same time period is in an unstable state in at least one selected operation period, the same time period is determined to belong to the unstable interval. For example, assuming that the operation cycle of the monitored system is one day, the selected plurality of operation cycles may include three last days, that is, three last consecutive operation cycles, if the period 9:00-12:00 is in a stable state in all of the three operation cycles, the period 9:00-12:00 may be determined to belong to a stable interval; and if the period 3:20 to 4:10 is in an unstable state in at least one of the three operation periods, the period 3:20 to 4:10 may be determined to belong to an unstable zone.
In an embodiment, the stability interval may include one or more time periods.
In an embodiment, the unstable interval may include one or more periods of time.
In an embodiment, the operation cycle may be divided into time periods such as time, minutes, seconds, etc. according to a specific fine granularity, which is not limited in this specification.
In one embodiment, the steady state period indicates a performance indicator of the monitored system over the relevant period, which can maintain little or no fluctuations over multiple operating cycles; and the period of the unstable state indicates a performance index of the monitored system over the relevant period, with at least one large fluctuation over a plurality of operating cycles.
In an embodiment, the stability of each period may be identified by a numerical characteristic of the performance index, such as the period of unstable state may include at least one of: peak period (i.e., period in which a peak occurs), noise period (i.e., period in which noise is present), etc., whereas the steady state period does not include the peak period or the noise period, etc.
In an embodiment, the steady state period may include other periods of the operating cycle than the unstable state period, i.e., steady state and unstable state periods may be complementary within an operating cycle, together forming a complete operating cycle. In other embodiments, the periods of steady state and steady state are not necessarily complementary during the operating cycle, such as periods of uncertainty, periods of no interest, etc., which are not limiting in this description.
In one embodiment, the historical data of the performance index of the monitored system in a plurality of operation cycles may be analyzed by a time sequence pattern discovery algorithm to determine the stable section and the unstable section.
Step 206, evaluating the first actual data to obtain a first evaluation result, and evaluating the second actual data to obtain a second evaluation result.
In an embodiment, the first actual data and the second actual data are evaluated respectively, so as to obtain a corresponding first evaluation result and a corresponding second evaluation result respectively.
In an embodiment, it is assumed that a first algorithm is used to evaluate first actual data and a second algorithm is used to evaluate second actual data, where the first algorithm and the second algorithm may be an unsupervised algorithm, so that abnormal devices in the monitored system can be identified without the need of marking data to participate in the processing, and thus the processing burden of staff is greatly reduced.
In an embodiment, the first algorithm may comprise at least one of the following optional algorithms: time sequence mining algorithms, clustering algorithms, statistical learning algorithms, regression analysis algorithms, etc.
In an embodiment, the second algorithm may comprise at least one of the following optional algorithms: clustering algorithm, statistical learning algorithm and regression analysis algorithm.
And step 208, determining abnormal equipment in the monitored system according to the first evaluation result and the second evaluation result.
In an embodiment, when the first algorithm includes a plurality of selectable algorithms, each selectable algorithm has a corresponding weight value, and the evaluation result of the first actual data includes a first weighted evaluation value obtained by performing weighted calculation on evaluation values obtained by the plurality of selectable algorithms, so that the advantages of the plurality of selectable algorithms can be integrated to perform joint detection on the abnormal device, and the detection capability of the abnormal device is improved.
In an embodiment, when the second algorithm includes a plurality of selectable algorithms, each selectable algorithm has a corresponding weight value, and the evaluation result of the second actual data includes a second weighted evaluation value obtained by performing weighted calculation on evaluation values obtained by the plurality of selectable algorithms, so that the advantages of the plurality of selectable algorithms can be integrated to perform joint detection on the abnormal device, and the detection capability of the abnormal device is improved.
In an embodiment, the evaluation result may be compared with the tag data to analyze the evaluation accuracy of each of the alternative algorithms as the first algorithm or the second algorithm, and then the corresponding alternative algorithm is improved according to the evaluation accuracy. For example, the optional algorithm may evaluate the actual data of the performance index according to a predefined evaluation threshold to generate a corresponding evaluation value, and the value of the evaluation threshold may be adjusted when the optional algorithm is improved; for another example, when the first algorithm or the second algorithm includes a plurality of selectable algorithms, the weight value corresponding to each selectable algorithm may be adjusted.
In order to facilitate understanding, the technical scheme of the present specification will be described in detail below by taking an example of abnormality detection implemented by the alarm system on the data center.
FIG. 3 is a flow chart of an alert system implementing anomaly detection for a data center, in accordance with an exemplary embodiment. As shown in fig. 3, the monitoring process may include the following steps:
step 302, historical data of performance indicators is obtained.
In one embodiment, the alert system may be configured with a data collection function such that the alert system may obtain historical data of performance metrics of the data center. For example, the alarm system may be configured with an ETL (Extract-Transform-Load) module, and collect, by the ETL module, historical data of performance indicators of the data center, where the historical data may include a full amount of historical data, and may include historical data within a historical time period, which is not limited in this specification.
In an embodiment, by increasing the type of the performance index, the capability of detecting the abnormality of the data center can be improved at least to a certain extent, so that the performance index to which the acquired historical data belongs can be as comprehensive as possible, and the comprehensive detection of the data center can be realized. For example, the performance metrics may include a 1 minute load average, a 5 minute load average, a 15 minute load average, CPU utilization, throughput (query volume per second), delay (response duration), thread count, memory or storage size, and the like, which is not limited in this specification.
Step 304, a stable interval and an unstable interval are determined.
In one embodiment, the data center has a certain operation period, such as 1 day, 3 days, 1 week, 1 month, etc., which is not limited in this specification.
In one embodiment, some operational laws of the data center may be discovered based on the operational state of the data center over a plurality of operational cycles. Assuming that the operation cycle of the data center is 00:00-24:00 of each natural day, historical data of a certain performance index X in the last three operation cycles (such as d-1 day to d-3 days; in other embodiments, a plurality of operation cycles can be selected in other manners, and the description is not limited to the operation cycles), and the historical data can be expressed as three curves as shown in fig. 4; accordingly, the above-described operation law can be found by the following principle:
On a time series, the run-time period may comprise several time periods, such as 24 time periods at "hour" (or minute, second, etc.) granularity; then, for each period within the operation cycle, the above-described performance index X may be compared with the corresponding history data in the three operation cycles.
For example, in the period 41 shown in fig. 4, the performance values (i.e., the values of the historical data of the performance index X) of the data center in three operation periods are all in a stable state, that is, the data center has no peak period (period including peak value) or no noise period (period including noise) corresponding to the period 41 in three operation periods, so that the values corresponding to the period 41 in the three curves are the same or similar (e.g., the difference is smaller than the preset value), and therefore the period 41 can be considered to belong to a stable section in the operation period. Similarly, the period 42, the period 43, and the like as shown in fig. 4 may be regarded as a stable section within the operation cycle.
For the time period 44 shown in fig. 4, the performance values of the data center in three operation periods are in an unstable state, that is, the data center has a peak period or a noise period corresponding to the time period 44 in at least one operation period, so that the values corresponding to the time period 44 in the three curves have large differences due to random peaks or noise, and therefore, the time period 44 can be considered to belong to an unstable period in the operation period. Similarly, a period 45 or the like as shown in fig. 4 may be regarded as an unstable section in the operation cycle.
Therefore, an algorithm based on the principle or the similar principle can be selected, and the historical data of each performance index of the data center in a specific plurality of operation periods are analyzed, so that the operation period is divided into a corresponding stable interval and an unstable interval according to each performance index; for example, the algorithm may include a time series pattern discovery (Time Series Motif Discovery) algorithm for discovering, on a time series, a "stable interval", "unstable interval", etc. pattern in a run period.
In one embodiment, a stable interval and an unstable interval may be identified simultaneously; in another embodiment, only the stable section may be identified with all of the remaining time periods as unstable sections, or only the unstable section may be identified with all of the remaining time periods as stable sections.
In one embodiment, steps 302-304 described above may be performed based on historical data of performance metrics of the data center, and thus the historical data may be processed offline by the alert system to determine stable and unstable intervals within the operating cycle. In steps 306-314 described below, on-line analysis may be performed on the actual data of the performance metrics of the data center to identify abnormal devices within the data center.
Step 306, collecting actual data of the performance index.
In one embodiment, the number of performance indicators involved in step 302 should be no less than the performance indicators involved in step 306 to ensure that online analysis can be successfully completed in subsequent steps based on the offline processing results obtained in steps 302-304.
In one embodiment, the actual data of the performance index may include a 1 minute load average, a 5 minute load average, a 15 minute load average, a CPU utilization, a throughput (query per second), a delay (response time), a thread count, a memory or storage size, etc. of each device in the data center, which is not limited in this specification. The alarm system can aggregate the acquired actual data according to the position of each device in the data center, the device group, the running application program and the like.
In one embodiment, the alarm system may perform data cleansing on the acquired actual data. For example, the alarm system may determine a missing value of the actual data on the time sequence, and supplement the missing value (the missing value may be a default value, or may be a value of an adjacent time point, or may be a data average value in a preset time period nearby, etc., which is not limited in this specification); for another example, the alert system may clear actual data that significantly violates the business rules, etc.
In an embodiment, the alarm system may perform correlation analysis for each performance index according to the collected actual data of each performance index. Taking a 1-minute load average value, a 5-minute load average value and a 15-minute load average value as examples, if there is a significant correlation (i.e., the three are associated with each other and the degree of association is greater than a preset degree) between the actual data of the three performance indexes, only the actual data of a part of the performance indexes can be selected for the subsequent anomaly detection process, without processing all the actual data, so as to reduce the resource occupation; for example, only the actual data of the 1-minute load average value may be selected, and no subsequent processing is required for the actual data of the 5-minute load average value and the 15-minute load average value.
In one embodiment, the alarm system may normalize the collected actual data of the performance indicators such that the actual data all conform to a predefined standardized data structure. For example, the standardized data structure may include a normal distribution structure, and assuming that the alarm system selects the 1-minute load average value, the alarm system may further determine whether the actual data of the 1-minute load average value conforms to the normal distribution structure, and if not, transform the corresponding actual data by an algorithm such as gaussian distribution transform, box-Cox transform, etc. to adjust the corresponding actual data to conform to the normal distribution structure; for example, fig. 6 shows the actual data of the 1 minute load average value processed by the Box-Cox transformation algorithm, and the actual data of the 1 minute load average value conforms to the normal distribution structure by adopting the power of 0.4 (namely, taking the power of 0.4 for the actual data of the original structure) in the transformation process.
In one embodiment, the alert system may provide a standard open API (Application Programming Interface ) for personnel or other data consumers to program access to actual data for performance metrics.
In step 308, the actual data is divided into intervals.
In an embodiment, according to the stable interval and the unstable interval determined in step 304, for the collected actual data of the performance index, the actual data may be divided into the first actual data in the stable interval and the second actual data in the unstable interval, so as to analyze and process the first actual data and the second actual data respectively.
At step 310, an optional algorithm is determined.
In an embodiment, a first algorithm (which may include one or more optional algorithms) may be employed for first actual data in the stable interval, and a second algorithm (which may include one or more optional algorithms) may be employed for second actual data in the unstable interval.
In step 312A, the first actual data is evaluated.
In one embodiment, the alert system may evaluate the first actual data by an evaluation model based on the first algorithm described above. For example, the first algorithm may include at least one of the following alternative algorithms: time series mining algorithms, clustering algorithms, statistical learning algorithms, regression analysis algorithms, etc., which are not limited in this specification.
When the first algorithm comprises a time series mining algorithm, the first actual data pair for any performance indicator may be processed as follows:
1) Selecting an appropriate time window, and respectively performing smoothing processing on the data corresponding to each device and the data corresponding to the device group to which each device belongs in the first actual data, wherein the smoothing processing can be implemented by adopting a Savitzky-Golay filter; alternatively, the smoothing process may be implemented with a Kalman (Kalman) filter, a moving window averaging filter, a Butterworth (Butterworth) filter, or the like, or a low-pass filter in other embodiments. Wherein all devices within a data center may be considered to belong to one device group; alternatively, devices within a data center may be grouped by location, function, or other dimension to form multiple device groups.
2) The following processing is performed for each device group separately: in each time window, determining a device group smoothing value corresponding to data corresponding to each device group in the first actual data, taking the device group smoothing value as an expected value of each device in the device group, and simultaneously respectively determining a device smoothing value corresponding to data corresponding to each device in the first actual data; then, the distances between the device smoothed values and the corresponding expected values of each device are calculated separately.
3) The distance corresponding to each device is compared to a predefined threshold to determine the probability that each device is considered an anomalous device, i.e., an assessment score provided by the instant sequence mining algorithm for each device. The threshold may include a value determined through statistical learning in advance, for example, the algorithm of the statistical learning may include a "thread-sigma" rule, a median absolute deviation method (Median absolute deviation approach), and the like, which is not limited in this specification.
When the first algorithm includes a clustering algorithm, the clustering algorithm may include a Density-based algorithm, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering method with noise), LOF (Local Outlier Factor, local anomaly factor) algorithm, etc., and the clustering algorithm may further include a prediction-based algorithm, such as a Class of support vector machines (One-Class SVM, i.e., one-Class Support Vector Machines), etc., which is not limited in this specification.
When the first algorithm includes a statistical learning algorithm, the statistical learning algorithm may include a "thread-sigma" rule, a median absolute deviation method, turkey testing (Turkey testing), etc., which is not limited in this specification; based on the statistical learning algorithm, the equipment which is obviously deviated from the average value or the median of all the equipment in the equipment group can be selected.
In an embodiment, when the first algorithm includes multiple optional algorithms at the same time, the alarm system may obtain the corresponding evaluation scores by respectively analyzing the first actual data based on the evaluation model of each optional algorithm, and then perform weighted calculation (such as weighted sum calculation) on the evaluation scores obtained by the optional algorithms according to the weight values corresponding to each optional algorithm, so as to obtain the comprehensive scores, i.e. the first weighted evaluation scores, of the optional algorithms corresponding to the first actual data.
In step 312B, the second actual data is evaluated.
In an embodiment, the alarm system may evaluate the second actual data by an evaluation model based on the second algorithm. For example, the second algorithm may include at least one of the following alternative algorithms: clustering algorithms, statistical learning algorithms, regression analysis algorithms, etc., which are not limiting in this specification.
In an embodiment, when the second algorithm includes multiple optional algorithms at the same time, the alarm system may obtain the corresponding evaluation scores by respectively analyzing the second actual data based on the evaluation model of each optional algorithm, and then perform weighted calculation (such as weighted sum calculation) on the evaluation scores obtained by the optional algorithms according to the weight values corresponding to each optional algorithm, so as to obtain the comprehensive scores, i.e. the second weighted evaluation scores, of the optional algorithms corresponding to the second actual data.
At step 314, a risk score for each device is obtained.
In an embodiment, since the first actual data corresponds to a stable interval in the operation period and the second actual data corresponds to an unstable interval in the operation period, the risk score for each device in the data center may be obtained by combining the evaluation result of the first actual data and the evaluation result of the second actual data.
Step 316, analyzing the evaluation model based on the optional algorithm according to the evaluation result obtained in step 314 and the marked data set.
In an embodiment, the marking data set may include marking data after the worker manually marks the actual data of the performance index of the data center, where the worker marks the actual data as an abnormal state or a normal state according to the actual situation.
In an embodiment, when the alarm system evaluates the actual data of the performance index based on the optional algorithm in the first algorithm and the second algorithm, various situations may occur, for example: identifying the abnormal data as abnormal data, identifying the abnormal data as normal data, identifying the normal data as normal data, and identifying the normal data as abnormal data; by comparing the evaluation result with the marked data set, the identification effect of the above-mentioned optional algorithm on the actual data of the performance index can be determined, for example, the identification effect can be represented by the Accuracy (Precision), recall (Recall), accuracy (Accuracy), F1 value, etc. of the evaluation model corresponding to each optional algorithm, which is not limited in this specification.
At step 318, an evaluation model based on the optional algorithm is improved.
In an embodiment, the evaluation model based on the alternative algorithm may be improved according to the evaluation effect of the evaluation model corresponding to each alternative algorithm. For example, thresholds in the evaluation model for evaluating anomaly probabilities of actual data (e.g., thresholds referred to in the description above for "time series mining algorithms") may be improved. For another example, the weight values of the evaluation models corresponding to the respective alternative algorithms may be improved.
Based on steps 316-318 described above, the alert system may automatically feed back the risk score for each device, thereby enabling continued improvement of the selectable algorithm-based assessment model based on the set of signature data. The automatic feedback process may be implemented online or offline, which is not limited in this specification. For example, the automatic feedback and corresponding improvement may be achieved with reference to the following feedback formula:
wherein s represents the total anomaly score, s i Abnormality score, w, representing the ith evaluation model i Weight value representing the i-th evaluation model, f i The integrated assessment score (similar to the F1 value) representing the accuracy of the ith assessment model.
In one embodiment, the alert system may be connected to an event management platform of the data center such that the alert system may send alert information to the event management platform. The alarm information may include:
1) And analyzing the effect of the evaluation model based on the optional algorithm based on the marked data set. As shown in the following Table 1, the optional algorithms adopted by the alarm system are assumed to comprise a class of support vector machines, statistical algorithms, time sequence mining algorithms, DBSCAN algorithms, regression analysis algorithms and the like, and the evaluation model based on the optional algorithms can be evaluated in terms of accuracy, recall, F1 values and the like.
Selectable algorithm Accuracy rate of Recall rate of recall F1 value
Support vector machine 80% 80% 0.400
Statistical algorithm 90% 50% 0.321
Time sequence mining algorithm 80% 85% 0.412
DBSCAN algorithm 75% 75% 0.375
Regression analysis algorithm 85% 70% 0.384
TABLE 1
2) Details of each abnormal device. For example, the detail information may include a device ID, a risk score as shown in table 2 below; in addition, the details may further include other types of information such as device IP, a timestamp of occurrence of an anomaly, a performance index with an anomaly, and the like, which is not limited in this specification.
TABLE 2
In summary, in the present disclosure, by dividing the operation period of the data center into the stable section and the unstable section and performing the anomaly evaluation on the actual data of the performance indexes in different sections, the risk score can be given for the anomaly condition of each device in the data center, which can significantly improve the anomaly detection efficiency of the data center compared with the technical scheme in the related art. Meanwhile, by adopting a plurality of combination algorithms in the process of abnormality evaluation and continuously improving an evaluation model based on the algorithms based on the marking data, the performance of the technical scheme adopted in the specification can be further improved.
Fig. 7 is a schematic block diagram of an electronic device according to an exemplary embodiment. Referring to fig. 7, at a hardware level, the electronic device includes a processor 702, an internal bus 704, a network interface 706, a memory 708, and a nonvolatile memory 710, although other hardware may be included as needed for other services. The processor 702 reads the corresponding computer program from the nonvolatile memory 710 into the memory 708 and then runs, forming a detection means of an abnormal device on a logic level. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.
Referring to fig. 8, in a software implementation, the detecting device of the abnormal device may include:
an acquisition unit 81 that acquires actual data of performance indexes of the monitored system;
a determination unit 82 that determines first actual data of a stable section in a time series and second actual data of an unstable section in a time series among the actual data;
An evaluation unit 83, configured to evaluate the first actual data to obtain a first evaluation result, and evaluate the second actual data to obtain a second evaluation result;
and an identification unit 84 for determining an abnormal device in the monitored system according to the first evaluation result and the second evaluation result.
Optionally, the running period of the monitored system is divided into a plurality of time periods in time sequence; when the performance index is in a stable state in the history data of the same time period in the selected operation periods, the same time period is determined to belong to the stable interval, and when the history data of the same time period is in an unstable state in at least one selected operation period, the same time period is determined to belong to the unstable interval.
Optionally, the period of unstable state includes at least one of: peak period, noise period; the steady state period includes other periods of the operating cycle than the unstable state period.
Optionally, the stable interval and the unstable interval are determined from the time sequence pattern discovery algorithm in the operation period.
Optionally, an unsupervised algorithm is used to evaluate the first actual data and the second actual data.
Alternatively to this, the method may comprise,
the first actual data is evaluated by at least one of the following optional algorithms: a time sequence mining algorithm, a clustering algorithm, a statistical learning algorithm and a regression analysis algorithm;
the second actual data is evaluated by at least one of the following optional algorithms: clustering algorithm, statistical learning algorithm and regression analysis algorithm.
Alternatively to this, the method may comprise,
when multiple selectable algorithms are adopted to evaluate the first actual data, each selectable algorithm has a corresponding weight value, and the evaluation result of the first actual data comprises a first weighted evaluation value obtained by performing weighted calculation on evaluation values respectively obtained by the multiple selectable algorithms;
when the second actual data is evaluated by adopting a plurality of optional algorithms, each optional algorithm has a corresponding weight value, and the evaluation result of the second actual data comprises a second weighted evaluation value obtained by weighting and calculating the evaluation values respectively obtained by the plurality of optional algorithms.
Optionally, the method further comprises:
a comparison unit 85 that compares the evaluation result with the flag data to analyze the evaluation accuracy of each of the optional algorithms for evaluating the first actual data or the second actual data;
And an improvement unit 86 for improving the corresponding optional algorithm according to the evaluation accuracy.
Optionally, the method further comprises:
an analysis unit 87 that performs correlation analysis with respect to performance indexes of the monitored system;
and a screening unit 88 for screening out the actual data of at least one of the associated plurality of performance indicators when the actual data of the associated plurality of performance indicators are all acquired.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (14)

1. A method of detecting an abnormal device, comprising:
acquiring actual data of performance indexes of a monitored system;
determining first actual data of a stable interval in a time sequence and second actual data of an unstable interval in the time sequence in the actual data, wherein the running period of the monitored system is divided into a plurality of time periods in the time sequence; when the performance index is in a stable state in the history data of the same time period in the selected multiple operation periods, the same time period is determined to belong to the stable interval, and when the history data of the same time period is in an unstable state in at least one selected operation period, the same time period is determined to belong to the unstable interval;
evaluating the first actual data to obtain a first evaluation result, and evaluating the second actual data to obtain a second evaluation result; when multiple selectable algorithms are adopted to evaluate the first actual data, each selectable algorithm has a corresponding weight value, and the evaluation result of the first actual data comprises a first weighted evaluation value obtained by respectively carrying out weighted calculation on evaluation values obtained by the multiple selectable algorithms; when multiple optional algorithms are adopted to evaluate the second actual data, each optional algorithm has a corresponding weight value, and the evaluation result of the second actual data comprises a second weighted evaluation value obtained by performing weighted calculation on evaluation values respectively obtained by the multiple optional algorithms;
And determining abnormal equipment in the monitored system according to the first evaluation result and the second evaluation result.
2. The method of claim 1, wherein the period of unstable state comprises at least one of: peak period, noise period; the steady state period includes other periods of the operating cycle than the unstable state period.
3. The method of claim 1, wherein the stability interval and the instability interval are determined from within the run period by a time-series pattern discovery algorithm.
4. The method of claim 1, wherein the first actual data and the second actual data are evaluated using an unsupervised algorithm.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the first actual data is evaluated by at least one of the following optional algorithms: a time sequence mining algorithm, a clustering algorithm, a statistical learning algorithm and a regression analysis algorithm;
the second actual data is evaluated by at least one of the following optional algorithms: clustering algorithm, statistical learning algorithm and regression analysis algorithm.
6. The method as recited in claim 1, further comprising:
comparing the evaluation result with the tag data to analyze the evaluation accuracy of each of the optional algorithms for evaluating the first actual data or the second actual data;
and improving the corresponding optional algorithm according to the evaluation accuracy.
7. The method as recited in claim 1, further comprising:
performing correlation analysis on the performance index of the monitored system;
when the actual data of the associated plurality of performance indexes are all acquired, the actual data of at least one performance index of the associated plurality of performance indexes is screened out.
8. A detection apparatus for an abnormal device, comprising:
the acquisition unit acquires actual data of performance indexes of the monitored system;
the system comprises a determining unit, a judging unit and a judging unit, wherein the determining unit is used for determining first actual data of a stable interval in a time sequence and second actual data of an unstable interval in the time sequence in the actual data, and the running period of the monitored system is divided into a plurality of time periods in the time sequence; when the performance index is in a stable state in the history data of the same time period in the selected multiple operation periods, the same time period is determined to belong to the stable interval, and when the history data of the same time period is in an unstable state in at least one selected operation period, the same time period is determined to belong to the unstable interval;
The evaluation unit is used for evaluating the first actual data to obtain a first evaluation result and evaluating the second actual data to obtain a second evaluation result; when multiple selectable algorithms are adopted to evaluate the first actual data, each selectable algorithm has a corresponding weight value, and the evaluation result of the first actual data comprises a first weighted evaluation value obtained by respectively carrying out weighted calculation on evaluation values obtained by the multiple selectable algorithms; when multiple optional algorithms are adopted to evaluate the second actual data, each optional algorithm has a corresponding weight value, and the evaluation result of the second actual data comprises a second weighted evaluation value obtained by performing weighted calculation on evaluation values respectively obtained by the multiple optional algorithms;
and the identification unit is used for determining abnormal equipment in the monitored system according to the first evaluation result and the second evaluation result.
9. The apparatus of claim 8, wherein the period of unstable state comprises at least one of: peak period, noise period; the steady state period includes other periods of the operating cycle than the unstable state period.
10. The apparatus of claim 8, wherein the stability interval and the instability interval are determined from within the run period by a timing pattern discovery algorithm.
11. The apparatus of claim 8, wherein the first actual data and the second actual data are evaluated using an unsupervised algorithm.
12. The apparatus of claim 8, wherein the device comprises a plurality of sensors,
the first actual data is evaluated by at least one of the following optional algorithms: a time sequence mining algorithm, a clustering algorithm, a statistical learning algorithm and a regression analysis algorithm;
the second actual data is evaluated by at least one of the following optional algorithms: clustering algorithm, statistical learning algorithm and regression analysis algorithm.
13. The apparatus as recited in claim 8, further comprising:
a comparison unit that compares the evaluation result with the tag data to analyze an evaluation accuracy of each of the optional algorithms for evaluating the first actual data or the second actual data;
and the improvement unit is used for improving the corresponding optional algorithm according to the evaluation accuracy.
14. The apparatus as recited in claim 8, further comprising:
the analysis unit is used for carrying out correlation analysis on the performance index of the monitored system;
and the screening unit screens out the actual data of at least one performance index of the associated multiple performance indexes when the actual data of the associated multiple performance indexes are all acquired.
CN201711455271.6A 2017-12-28 2017-12-28 Abnormal equipment detection method and device Active CN109976986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711455271.6A CN109976986B (en) 2017-12-28 2017-12-28 Abnormal equipment detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711455271.6A CN109976986B (en) 2017-12-28 2017-12-28 Abnormal equipment detection method and device

Publications (2)

Publication Number Publication Date
CN109976986A CN109976986A (en) 2019-07-05
CN109976986B true CN109976986B (en) 2023-12-19

Family

ID=67074171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711455271.6A Active CN109976986B (en) 2017-12-28 2017-12-28 Abnormal equipment detection method and device

Country Status (1)

Country Link
CN (1) CN109976986B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569389A (en) * 2019-07-25 2019-12-13 深圳壹账通智能科技有限公司 Environment monitoring method and device, computer equipment and storage medium
CN112527604A (en) * 2020-12-16 2021-03-19 广东昭阳信息技术有限公司 Deep learning-based operation and maintenance detection method and system, electronic equipment and medium
CN113110981B (en) * 2021-03-26 2024-04-09 北京中大科慧科技发展有限公司 Air conditioner room health energy efficiency detection method for data center

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716180A (en) * 2013-12-04 2014-04-09 国网上海市电力公司 Network flow actual forecasting-based network abnormality pre-warning method
CN104899405A (en) * 2014-03-04 2015-09-09 携程计算机技术(上海)有限公司 Data prediction method and system and alarming method and system
CN106485526A (en) * 2015-08-31 2017-03-08 阿里巴巴集团控股有限公司 A kind of diagnostic method of data mining model and device
CN107066365A (en) * 2017-02-20 2017-08-18 阿里巴巴集团控股有限公司 The monitoring method and device of a kind of system exception

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7346471B2 (en) * 2005-09-02 2008-03-18 Microsoft Corporation Web data outlier detection and mitigation
US8699357B2 (en) * 2006-11-30 2014-04-15 Alcatel Lucent Methods and apparatus for instability detection in inter-domain routing
US9652354B2 (en) * 2014-03-18 2017-05-16 Microsoft Technology Licensing, Llc. Unsupervised anomaly detection for arbitrary time series
US10261851B2 (en) * 2015-01-23 2019-04-16 Lightbend, Inc. Anomaly detection using circumstance-specific detectors
JP6652699B2 (en) * 2015-10-05 2020-02-26 富士通株式会社 Anomaly evaluation program, anomaly evaluation method, and information processing device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716180A (en) * 2013-12-04 2014-04-09 国网上海市电力公司 Network flow actual forecasting-based network abnormality pre-warning method
CN104899405A (en) * 2014-03-04 2015-09-09 携程计算机技术(上海)有限公司 Data prediction method and system and alarming method and system
CN106485526A (en) * 2015-08-31 2017-03-08 阿里巴巴集团控股有限公司 A kind of diagnostic method of data mining model and device
CN107066365A (en) * 2017-02-20 2017-08-18 阿里巴巴集团控股有限公司 The monitoring method and device of a kind of system exception

Also Published As

Publication number Publication date
CN109976986A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109542740B (en) Abnormality detection method and apparatus
US11048729B2 (en) Cluster evaluation in unsupervised learning of continuous data
WO2020259421A1 (en) Method and apparatus for monitoring service system
US11151014B2 (en) System operational analytics using additional features for health score computation
US10572512B2 (en) Detection method and information processing device
CN109242135B (en) Model operation method, device and business server
US9208209B1 (en) Techniques for monitoring transformation techniques using control charts
CN110209560B (en) Data anomaly detection method and detection device
KR101872342B1 (en) Method and device for intelligent fault diagnosis using improved rtc(real-time contrasts) method
US8903757B2 (en) Proactive information technology infrastructure management
CN109976986B (en) Abnormal equipment detection method and device
CN105871634A (en) Method and application for detecting cluster anomalies and cluster managing system
JP5805169B2 (en) Behavior pattern analysis apparatus and behavior pattern analysis method
CN112084229A (en) Method and device for identifying abnormal gas consumption behaviors of town gas users
US20210019300A1 (en) Method and system for automatic anomaly detection in data
JP2020501232A (en) Risk control event automatic processing method and apparatus
US11620539B2 (en) Method and device for monitoring a process of generating metric data for predicting anomalies
Tran et al. Change detection in streaming data in the era of big data: models and issues
US9600391B2 (en) Operations management apparatus, operations management method and program
KR20170084445A (en) Method and apparatus for detecting abnormality using time-series data
US10705940B2 (en) System operational analytics using normalized likelihood scores
CN110858072B (en) Method and device for determining running state of equipment
CN114365094A (en) Timing anomaly detection using inverted indices
Atzmueller et al. Anomaly detection and structural analysis in industrial production environments
Zhou et al. The risk management using limit theory of statistics on extremes on the big data era

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant