CN115454763A - Index abnormity judgment method and device - Google Patents

Index abnormity judgment method and device Download PDF

Info

Publication number
CN115454763A
CN115454763A CN202210956510.0A CN202210956510A CN115454763A CN 115454763 A CN115454763 A CN 115454763A CN 202210956510 A CN202210956510 A CN 202210956510A CN 115454763 A CN115454763 A CN 115454763A
Authority
CN
China
Prior art keywords
threshold
value
data
trend
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210956510.0A
Other languages
Chinese (zh)
Inventor
李光水
毛文安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210956510.0A priority Critical patent/CN115454763A/en
Publication of CN115454763A publication Critical patent/CN115454763A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3041Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is an input/output interface
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time

Abstract

The application discloses a method and a device for judging index abnormity, wherein the method comprises the following steps: acquiring data of an index to be monitored to obtain data corresponding to the index to be monitored; acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, wherein the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows; obtaining a second threshold value according to the first threshold value of each sliding window in the plurality of sliding windows; and judging whether the data acquired at the current moment is abnormal or not according to the second threshold. The method and the device solve the problem that the index abnormity is judged by manually setting the static threshold value in the prior art, further can dynamically generate the threshold value according to the collected data, reduce the manual intervention and improve the accuracy of the index abnormity judgment to a certain extent.

Description

Index abnormity judgment method and device
Technical Field
The present application relates to the field of abnormality determination, and in particular, to a method and an apparatus for determining an index abnormality.
Background
Data storage is the foundation of the current network and computer system, and with the increasing amount of stored data, it is necessary to monitor whether the storage is abnormal or not. In the prior art, data acquisition is performed on an index corresponding to Input and Output (IO) related to storage, so that whether an abnormality occurs is determined according to the acquired data. In the prior art, the storage Input Output (IO) abnormality recognition is to manually preset a static threshold according to experience, and when the collected IO index is higher than the threshold, it is determined that the IO index is abnormal.
The processing mode of setting the static threshold value by depending on human experience in the prior art has two problems: one is different service scenes or different machines, the concerned threshold values are different, and the purpose that one threshold value is unified to cover various scenes cannot be achieved; the two are situations that the static threshold cannot adapt to sudden change of the service demand, and the threshold set before several days is likely to be not applicable today, which can cause abnormal false alarm or false negative.
Disclosure of Invention
The embodiment of the application provides an index abnormity judgment method and device, and aims to at least solve the problem that in the prior art, the index abnormity is judged by artificially setting a static threshold value.
According to an aspect of the present application, there is provided an index abnormality determination method including: acquiring data of an index to be monitored to obtain data corresponding to the index to be monitored; acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, wherein the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows; obtaining a second threshold according to the first threshold of each sliding window in the plurality of sliding windows, wherein the second threshold is calculated according to the first threshold; and judging whether the data acquired at the current moment is abnormal or not according to the second threshold.
According to another aspect of the present application, there is also provided an electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the above-described method steps.
According to another aspect of the present application, there is also provided a readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the above-described method steps.
According to another aspect of the present application, there is also provided an index abnormality determination apparatus including: the acquisition module is used for acquiring data of the index to be monitored to obtain data corresponding to the index to be monitored; the device comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, and the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows; a calculating module, configured to obtain a second threshold according to a first threshold of each of the plurality of sliding windows, where the second threshold is calculated according to the first threshold; and the judging module is used for judging whether the data acquired at the current moment is abnormal or not according to the second threshold.
In the embodiment of the application, data acquisition is carried out on the index to be monitored to obtain data corresponding to the index to be monitored; acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, wherein the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows; obtaining a second threshold according to the first threshold of each sliding window in the plurality of sliding windows, wherein the second threshold is calculated according to the first threshold; and judging whether the data acquired at the current moment is abnormal or not according to the second threshold. The method and the device solve the problem that the index abnormity is judged by manually setting the static threshold value in the prior art, further can dynamically generate the threshold value according to the collected data, reduce the manual intervention and improve the accuracy of the index abnormity judgment to a certain extent.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments of the application are intended to be illustrative of the application and are not intended to limit the application. In the drawings:
FIG. 1 is a flowchart of an index abnormality determination method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a possible data false positive condition according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an IO index monitoring system according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an await index monitoring abnormal point according to an embodiment of the application;
FIG. 5 is a schematic diagram of a false alarm when a value of monitoring index data is small according to an embodiment of the application;
FIG. 6 is a diagram of a second threshold and a compensation value according to an embodiment of the application;
FIG. 7 is a diagram illustrating the use of minimum thresholds according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
In the following embodiments, the monitoring of the index is described, and when the index is monitored, data corresponding to the index needs to be collected, and therefore, data that can be collected and an abnormality determination can be performed based on the collected data may be referred to as an index.
In the static threshold determination scheme, firstly, a threshold needs to be preset manually according to own experience, then, a program for acquiring the IO index is deployed in a normalized manner to acquire data, and when the value of the data corresponding to the acquired index is higher than the threshold, the abnormality is determined. As mentioned in the background, the following disadvantages exist in the static threshold decision scheme: (1) Different service scenes or different computing devices have different concerned thresholds, and it is impossible to unify one threshold to cover various scenes, which requires configuring a threshold for each service scene or each computing device, and consumes a lot of manpower and material resources, and it is difficult to ensure that the configured threshold is appropriate by this configuration method. (2) Static thresholds cannot adapt to sudden changes in traffic demand, for example, a threshold set several days ago is likely to be unsuitable today, and may cause abnormal false alarm or false alarm.
The following embodiments provide a method for intelligently identifying an exception, which may be applied to identification of a storage IO exception, and may also be applied to identification of other exceptions, for example, may be used for identification of a network device related index exception, and the like. In the following abnormality identification (or called abnormality judgment) scheme, a dynamic threshold is used, so that the index abnormality is automatically identified on the premise of not needing human intervention conditions, and the problem of abnormality judgment of a static threshold is solved.
In the present embodiment, an index abnormality determination method is provided, fig. 1 is a flowchart of an index abnormality determination method according to an embodiment of the present application, and as shown in fig. 1, steps included in the method shown in fig. 1 are explained below.
Step S102, data acquisition is carried out on an index to be monitored, and data corresponding to the index to be monitored is obtained;
step S104, acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, wherein the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows;
in this step, a concept of a sliding window is introduced, and the following description is given by way of example, assuming that 120 data are collected together between current time, where the 120 data are data 1 to data 120, respectively; assuming that the size of the sliding window is 100, the number of data in the sliding window is 100, and the step size of sliding the sliding window is 1, the sliding window starts to move from data 1 to data 100, the sliding window including data 1 to data 100 is called a first sliding window, and a first threshold of the first sliding window is calculated; then, the sliding window is moved from data 1 to data 2, the sliding window including data 2 to data 101 is called a second sliding window, the first threshold of the second sliding window is calculated, and so on, to obtain the last sliding window, the last sliding window is from data 21 to data 120, the first threshold of the last sliding window is calculated, and the first threshold of each sliding window in the plurality of sliding windows can be obtained through the step.
The number of data included in each sliding window (or the length of the sliding window) may be configured in advance, and the length of the sliding window may be kept unchanged during each sliding. Or as another alternative, the length of the sliding window may be increased or decreased at each sliding, so that the sliding window may include more or less data according to actual needs, thereby adapting to different situations. The step size of the sliding window may be configured in advance, and in the above example, 1 is used as the step size, or other values may be used as the step size. For example, 5 may be used as the step size, and if 5 is used as the step size, the first sliding window includes data 1 through data 100, the second sliding window includes data 6 through data 105, and so on. In practical application, different step sizes can be selected and configured according to practical situations, in principle, if the data volume is small, the step size can be configured to be smaller, and if the data volume is larger, the step size can be configured to be larger.
Step S106, obtaining a second threshold value according to the first threshold value of each sliding window in the plurality of sliding windows, wherein the second threshold value is obtained by calculation according to the first threshold value;
and step S108, judging whether the data acquired at the current moment is abnormal or not according to the second threshold value.
In the above steps, the threshold is no longer set manually, but a threshold for judging whether the data is abnormal is obtained according to the previously collected data. In the above step, a plurality of sliding windows are used, each sliding window includes a segment of data, so that a first threshold value can be obtained for each sliding window, the threshold value is related to the average value of the data in the sliding window, and then a second threshold value is obtained according to the first threshold values of the plurality of sliding windows, and the obtained second threshold value can represent the trend of the data corresponding to the index, thereby being beneficial to finding abnormal data. The threshold value used for judging whether the index is abnormal or not can be obtained according to the collected data in a sliding window mode through the steps, and the threshold value is dynamically updated along with continuous collection of the data, so that the problem that the index is abnormal through manual setting of a static threshold value in the prior art is solved, the threshold value can be dynamically generated according to the collected data, manual intervention is reduced, and the accuracy of index abnormality judgment is improved to a certain extent.
In the above step, the first threshold is related to the average value in the sliding window, and as a relatively simple processing method, the average value of the data in the sliding window can be directly used as the first threshold. The threshold values calculated by this processing method are average values, and there is no way to show the extreme value conditions (maximum and minimum values) occurring in the sliding window, which may affect the abnormal judgment, for example, if the values of the data in a window all fluctuate between 10 and 20, only one maximum value occurs at 50, and at this time, if only the average value is used as the first threshold value, there is no way to reflect the effect of the maximum value 50. To solve this problem, in another alternative embodiment, obtaining the first threshold value of each sliding window may include the steps of: obtaining an average value of the data in each sliding window; acquiring the maximum value and the minimum value of data in each sliding window; calculating a first difference value of the maximum value and the average value and a second difference value of the average value and the minimum value; and taking the larger one of the first difference value and the second difference value as the first threshold value. The first threshold calculated by this alternative embodiment is still related to the average value, but the larger of the difference between the average value and the maximum and minimum values is used, which alternative takes into account the extreme case in each sliding window compared to directly using the average value as the first threshold, so that the calculated first threshold can reduce false positives.
After the first threshold is calculated in the above optional manner, the second threshold may be obtained by referring to the first threshold, and the second threshold is used as a basis for determining whether the data is abnormal. Since the data in each sliding window can be calculated to obtain a first threshold, as a simplest processing mode, the first thresholds corresponding to all sliding windows are averaged, and the average is directly used as the second threshold. In practical cases, the data corresponding to the index generally has a time correlation, that is, the data acquired closer to the current time are considered to be more correlated, and if this is considered, the calculation can be performed in a weighted average manner. That is, obtaining the second threshold according to the first threshold of each of the plurality of sliding windows may include two ways: in a first mode, performing average calculation on the first threshold value of each sliding window in the plurality of sliding windows, and taking the obtained average value as the second threshold value; in a second mode, a weighted average value is calculated from the first threshold value corresponding to each of the plurality of sliding windows, and the weighted average value is used as the second threshold value. When weighting is performed, it may be considered that the closer the weight of the sliding window to the current time is, the larger the weight is, and the change of the current index data can be reflected by the second threshold calculated in this way.
After the second threshold is obtained through calculation, whether the acquired data is abnormal or not can be judged according to comparison between the second threshold and the value of the acquired data, and at this time, it can be considered that the abnormality is only considered to occur if the value of the acquired data is greater than the second threshold. This determination reduces the false alarm rate, but in some cases increases the false alarm rate. The following describes a possible situation of increasing the false alarm rate with reference to the drawings. Fig. 2 is a schematic diagram of a possible data false alarm situation according to an embodiment of the present application, and as shown in fig. 2, a graph of the calculated second threshold value is shown in the graph, where black points above the second threshold value curve are data in which abnormality occurs because the values of the points are greater than the second threshold value on the second threshold value curve at the time. It should be noted that, since the acquisition of the index data is not started, the second threshold value obtained by just starting the calculation according to the above calculation method is relatively small, and at this time, the a data point occurs, the value of the a data point is not large, and it cannot be calculated as an abnormal data point, but because the second threshold value just starting the calculation is relatively small, the value of the a data point is larger than the second threshold value, and therefore the a data point is considered as an abnormal data point, which is actually a false alarm.
To solve the false alarm problem, in one embodiment, a pre-configured fixed value may be introduced, and the fixed value may be regarded as a minimum threshold, and a value smaller than the minimum threshold is not regarded as an abnormal value. For the above a data point, if the a data point is compared with the minimum threshold and the value of the a data point is found to be less than the minimum threshold, the a data point is considered as not an abnormal data point. Although the determination method is performed by introducing a fixed value, the false alarm rate is reduced to a certain extent, but a certain probability of false alarm still exists for data larger than the minimum threshold. In view of this, in another alternative embodiment, the second threshold may be dynamically compensated according to actual conditions, and a compensation value is added, and then the judgment is performed according to the second threshold adjusted by using the compensation value. The second threshold value can be kept within a certain fluctuation range through the increase of the compensation value, so that the false alarm rate is reduced. In this optional implementation, determining whether the data acquired at the current time is abnormal according to the second threshold may include the following steps: acquiring a compensation value, wherein the compensation value is used for compensating the second threshold value; adjusting the second threshold value using a compensation value; and judging whether the data acquired at the current moment are abnormal or not according to the second threshold value adjusted by using the compensation value.
In this alternative embodiment, in order to make the compensation value reflect the change of the index data, an initial compensation value may be first, and then whether to adjust the initial compensation value may be determined according to the change trend of the second threshold value. In this alternative, in order to measure the variation trend of the second threshold, a window may still be introduced, and the length of the window may be the same as that of the sliding window in the above-mentioned embodiment, or may also be different, and if the length of the window is the same as that of the sliding window in the above-mentioned embodiment, the window is considered to be one window in units of the sliding window. In this alternative, obtaining the compensation value may include the steps of: setting a window on the obtained second threshold, wherein a plurality of second thresholds are covered in the window; taking a first second threshold value in the window as an initial compensation value; and determining whether the initial compensation value is updated according to the change trend of the second threshold value in the window so as to obtain the compensation value.
There are various ways to determine the variation trend, for example, the variation trend may be divided into an ascending trend and a descending trend, in the ascending trend, it is considered that the value of the collected index data will become larger, in the process that the data of the index value becomes larger, abnormality is likely to occur, and in order to prevent report omission, it is determined that the initial compensation value is not updated when the variation trend is the ascending trend, and the initial compensation value is used as the compensation value. When the variation trend is a descending trend, considering that the numerical value is reduced from a large numerical value, the initial compensation value obtained from the window is large, if the initial compensation value is large, the second threshold value after compensation by the compensation value is also large, and report missing may occur at this time. Therefore, when the change trend is a downward trend, the initial compensation value is updated, and the updated initial compensation value is used as the compensation value.
When the initial compensation value is updated, other second threshold values in the window can be obtained, the maximum value in the other second threshold values is selected to be used as the compensation value, and the maximum value is adopted to reduce the false alarm rate and improve the false alarm rate. Of course, the minimum value of the other second threshold values may also be used as the compensation value, and the minimum value used as the compensation value may increase the false alarm rate and decrease the false alarm rate. Or the average value of all the second threshold values in the window can also be used as the compensation value, and the like, and the compensation value can be flexibly selected according to actual needs. In a preferred embodiment, it is preferable to select the maximum value of the other second threshold values as the compensation value, so as to reduce the false alarm rate to the maximum extent. When the maximum value in the other second threshold values is selected as the updated compensation value, whether the maximum value in the other second threshold values is smaller than the initial compensation value or not is also judged, and if the maximum value in the other second threshold values is smaller than the initial compensation value, the maximum value in the other second threshold values is used as the compensation value, so that the compensation value is reduced, and the missing report rate is reduced.
There are many ways to determine whether there is an upward trend or a downward trend within a window.
For example, a first number of the other second threshold values in the window that is greater than the initial compensation value may be obtained, a first percentage of the first number to the total number of the second threshold values in the window is calculated, and in a case that the first percentage is greater than a preset first predetermined percentage, it is determined that the trend of change is an upward trend, and if not, the trend of change is a downward trend.
As another example of the present invention,
the method also can obtain a second number of the next second threshold value in the window, which is larger than the previous threshold value, calculate a second percentage of the second number in the total number, and determine that the variation trend is an ascending trend when the second percentage is larger than a preset second preset percentage, otherwise, determine that the variation trend is a descending trend.
It should be noted that the preset first predetermined percentage and the preset second predetermined percentage in the above two examples may be the same or different. In any way, the variation trend of the second threshold in the window is determined, and it can be determined whether to adjust the compensation value accordingly according to the variation as long as the variation trend is obtained.
There is also a more specific way in the downward trend, that is, the trend of the second threshold value is monotonically decreased, in this case, it is indicated that the index is returning to the normal state immediately after a round of pressure, and then an average value can be calculated according to the first threshold values corresponding to a plurality of sliding windows, and then the average value is used as a fixed compensation value. That is, in an alternative, determining whether the initial compensation value is updated according to the trend of change of the second threshold value in the window may include the steps of: acquiring a plurality of continuous first threshold values under the condition that the change trend of the second threshold values in the window is monotonously reduced; and using the obtained average value of the continuous multiple first threshold values as the fixed compensation value.
In the above embodiment, it is mentioned that in order to solve the problem of false alarm, a fixed value configured in advance may be introduced, the fixed value may be regarded as a minimum threshold, and a value smaller than the minimum threshold is not regarded as an abnormal value. The minimum threshold may also be referred to as a minimum value, the minimum value may also be retained, if the second threshold adjusted by using the compensation value is still smaller than the minimum value, then false alarm may occur only by performing abnormal reporting according to the adjusted second threshold, and in this case, the preconfigured minimum value may still be obtained in order to reduce false alarm; and selecting the maximum value of the preset minimum value and the second threshold value adjusted by using the compensation value as a basis for judging whether the data acquired at the current moment are abnormal or not.
The above embodiments can be applied to abnormality discovery of various indexes. Especially, the method can be applied to identification of storage IO abnormity. The following description will take the discovery of an exception in a storage IO as an example.
Fig. 3 is a schematic diagram of an IO index monitoring system according to an embodiment of the present application, and as shown in fig. 3, an IO index may be monitored by presetting a monitoring program, when monitoring is performed, a second threshold may be obtained by performing threshold calculation according to a sliding window, then the second threshold is compensated to obtain a compensated threshold, and finally a minimum threshold is used for intervention (that is, a maximum value of the compensated threshold and the minimum threshold is selected as a basis for abnormality determination), and after a basis for abnormality determination is obtained, abnormality determination is performed and an abnormality report is generated. The respective portions referred to in fig. 3 are explained below separately.
The monitoring program for monitoring the IO indexes can run on different operating systems, is normally run, and can periodically collect data corresponding to the indexes. For example, if the normally running monitoring program runs on a Linux system, the await, ioutl, tps, bps, and qu-size indexes can be obtained by periodically reading the existing diskstat information in the Linux system. If the monitoring program runs on other operating systems, data corresponding to the monitoring index can be acquired in a mode corresponding to the operating system, which is not described herein again. The following criteria apply: await, ioutil, tps, bps, qu-size are described separately.
await: the latency (milliseconds) per device I/O operation is averaged. It can be understood here that the response time of IO, generally, the system IO response time should be less than 5ms, and if it is greater than 10ms, it is larger. And thus may be set to 5 milliseconds when setting the await minimum threshold.
ioutil: representing how busy the disk is. ioutil may be a percentage where 100% indicates that the disk is busy and 0% indicates that the disk is free. iouitl can be calculated as follows: all IO times processed within the statistical time are divided by the total statistical time. For example, if the statistical interval is 1 second, the device has 0.8 seconds processing IO and 0.2 seconds idle, then ioutil =0.8/1=80% of the device, so this parameter implies how busy the device is. Generally, if the parameter is 100% indicating that the device is already running near full, this time too many I/O requests are generated. When setting the minimum threshold for ioutil, ioutil may be set to 20%.
tps: I/O per second, representing the number of transfers output to physical disk per second. One transfer is an I/O request to a physical disk. When the minimum threshold of tps is set, it may be set to 150 times/s.
bps: indicating the amount of data transferred (read or written) to disk per second. The units of transfer are expressed using different suffixes, the default unit being bytes/sec. In setting the bps minimum threshold, 31457280 bytes/sec may be set.
qu-size: the IO queue length, when setting the minimum threshold of qu-size, may be set to 1.
For the above-mentioned indexes stored in the IO, in a time interval under normal conditions, a trend tends to be stable, and when an abnormality occurs, a peak deviating from the stable trend may be generated, and generally, the service may only pay attention to the peak of the abnormality, so in the following optional embodiment, the threshold may be dynamically updated during the service operation process, so that the service may be more reasonably adapted to different service scenarios. The above IO indicator is the same as the above IO indicator in the manner of determining an abnormality, and therefore, in the present optional embodiment, an await is used as an example for description. In this alternative embodiment, three calculation manners of the above-described embodiments are merged, namely, sliding window threshold calculation (i.e., calculating the second threshold), compensation threshold calculation (i.e., compensating the second threshold using the compensation value), and minimum threshold intervention (i.e., selecting a larger value from the minimum threshold and the compensated second threshold as a basis for abnormality determination). These three calculation methods will be described below.
Sliding window threshold calculation
When using a sliding window for threshold calculation, the length and step size of the sliding window may be selected first. For example, a sliding window of length 100 and step size 1 may be selected, i.e. calculated in groups of 100 data, each time the sliding window is moved forward by one data. For example, data for the first calculation 1 to 100, data for the second calculation 2 to 101, data for the third calculation 3 to 102, and so on. For the calculation of each window, first, an average value of 100 data of the current window is calculated, in this optional embodiment, mavg is used as a two-step identifier, a maximum value max and a minimum value min of the 100 data in the window are obtained at the same time, then, a first difference value obtained by subtracting Mavg from max and a second difference value obtained by subtracting min from Mavg are calculated, and a maximum value of the first difference value and the second difference value is taken as a temporary threshold Tthresh (i.e., a first threshold) of the window. And recording the temporary threshold Tthresh calculated every time, and then calculating the average value threshAVG (namely a second threshold) of the temporary thresholds Tthresh of all windows, so as to be used as the basis for judging whether the acquired data is abnormal or not at this time.
The calculation formulas for Mavg, tthresh, and threshAVG are as follows:
Figure BDA0003791580300000081
[ n denotes the nth calculation, M denotes the sliding window size, M = M + n-1];
Tthresh=MAX(max-Mavg,Mavg-min);
Figure BDA0003791580300000082
[ n represents the nth calculation]。
The above formula is explained below by taking the data collected by the await index as an example. Fig. 4 is a schematic diagram of an await index monitoring abnormal point according to an embodiment of the present application, where fig. 4 takes an IO index await as an example, in fig. 4, a line labeled 1 (i.e., curve 1) identifies original sampling data, a curve labeled 2 (i.e., curve 2) is Tthresh obtained by calculation, a curve labeled 3 (i.e., curve 3) is threshAVG, and the curve 3 identifies a second threshold in the above embodiment. The asterisk numbered 4 and the red dot numbered 5 are identified outliers, wherein the dot 5 and the asterisk 4 are different in that in a window, when the occurrence of the anomaly is dense, only a few abnormal events are allowed to be recorded, otherwise a large number of identical anomaly reports in a short time are troubled. That is, the asterisk 4 is used for judging all the abnormalities, but the dot 5 is used for reporting the abnormal events.
As can be seen from the above figure, if the second threshold value is directly used to identify the outliers, it is possible to identify the most outliers. It should be noted that the outliers identified using the second threshold are all sharp spikes of the outlier, and thus, even relatively small spikes report an anomaly. This point is similar to the curve shown in fig. 2, in fig. 4, since the acquisition of the index data is not started, the value of the second threshold value obtained just by calculation in the above calculation manner is relatively small, and at this time, the a data point occurs, the value of the a data point is not large, and the a data point cannot be calculated as an abnormal data point from the numerical point of view alone, but the value of the a data point is larger than the second threshold value because the second threshold value just by calculation is relatively small, so the a data point is considered as an abnormal data point, which is actually a false alarm.
Even using the second threshold value can solve the problem caused by manually configuring the threshold value statically, but false alarms may occur. Especially, under the condition that the value of the collected IO data is always smaller, more false alarms may occur. Fig. 5 is a schematic diagram of a false alarm when the value of the monitoring index data is small according to the embodiment of the present application, and as shown in fig. 5, if the value of the collected index data is consistent and small, it indicates that the IO index is working normally, but the calculated second threshold threshAVG is also small because the value is consistent and small, and if the collected data is larger than the second threshold, the data is determined to be abnormal data. In other words, in this case, relatively small spurs are reported as well as abnormal spurs, and when the algorithm runs for a long time, more false positives may occur, which are identified by using a rectangular box in fig. 5.
To solve the problem of false alarms, in one embodiment, a pre-configured fixed value may be introduced, which may be considered as a minimum threshold, and a value smaller than the minimum threshold is not considered as an abnormal value. For the data point a in fig. 4, if the data point a is compared with the minimum threshold and the value of the data point a is found to be less than the minimum threshold, the data point a is not considered as an abnormal data point.
In addition to this embodiment, as shown in fig. 3, in order to eliminate the unnecessary false alarms, a compensation value (also referred to as a compensation threshold in this embodiment) may be added above the second threshold (also referred to as a dynamic threshold since the second threshold is dynamically calculated).
The principle of calculating the compensation threshold is to take a fluctuation range of the IO indicator in a steady state into consideration, and compensate the second threshold to make the second threshold conform to the fluctuation range. That is, in the present embodiment, the compensation value is acquired, and then the acquired compensation value is superimposed on the dynamic threshold value as a new abnormality determination threshold value.
In consideration of the self-stationary degree of the second threshold, the fluctuation of the real IO index data can be reflected, and therefore, the data trend of threshAVG is analyzed in the present embodiment. The compensation value is determined according to the trend of threshAVG. the trend line of threshAVG is the curve shown in FIG. 2. The trend line can still be in units of sliding windows, and the trend of the threshAVG data in one window is counted.
The specific statistical method is as follows: the threshAVG obtained by the first window calculation is used as the initial compensation value, and then the compensation value is updated in the subsequent calculation.
In this embodiment, the compensation value update is divided into two different scenarios:
in the first scenario, it is not known whether the IO load state is a steady state at this time, and the compensation value needs to be dynamically updated according to the fluctuation trend of the IO index.
The updating process of the compensation value in the first scenario may be: recording the first threshAVG value of the entry window as winFSThreshAvg, if more than 60% of threshAVG after the first threshAVG value is greater than winFSThreshAvg, or more than 60% of threshAVG is greater than the previous threshAvg, indicating that the current IO index data shows a rising trend, in this case, it is likely that an anomaly is about to come, and therefore, it is not necessary to update the compensation value. It should be noted that, for the IO index, when an exception occurs, the general index value may suddenly increase by a geometric multiple, and far exceed the threshold value after the compensation value and the dynamic threshold value are superimposed, so it is not necessary to update the IO index. On the contrary, if the window IO index data shows a descending trend, the compensation value needs to be updated to be the maximum threshAVG value of the window, and in this case, the compensation value needs to be updated as much as possible, so that the situation that the final superimposed threshold is too large to cause the missing report is avoided. Using the maximum threshAVG value may also be a false positive to some extent.
In the second scenario, if the threshAVG continuously decreases monotonically, it means that the IO index is returning to the steady state immediately after one round of IO pressure, so Tthresh values of multiple windows are continuously recorded, and the average value is obtained, and is then the fixed compensation value, and is not updated. Fig. 6 is a diagram illustrating a second threshold value and a compensation value according to an embodiment of the present application, and as shown in fig. 6, a curve 3 represents a threshAVG curve, and the curve 3 monotonically decreases from a predetermined time, and at this time, a predetermined compensation value (curve 6) is adopted, and it can be seen from the curve 6 that the compensation value uniformly maintains a straight line (i.e., a fixed value) after the curve 3 monotonically decreases.
Fig. 7 is a schematic diagram of minimum threshold usage according to an embodiment of the present application, where, as shown in fig. 7, a curve 3 is a calculated second threshold curve, and a curve 7 is a final threshold after a compensation value and the second threshold are superimposed, and as shown in fig. 7, in a steady state on the left side and the right side of a peak, no dot or asterisk appears, which indicates that there is no anomaly, because the second threshold helps to eliminate false alarms after the compensation value is superimposed. In fig. 7, a point B appears, which should be an abnormal point if judged after superimposing the compensation value according to the second threshold value. However, sometimes, the reported exception may not affect the service yet, and the exception which is not concerned by the service needs to be shielded. To solve this problem, it is also possible to add a minimum threshold after the second threshold is superimposed with the compensation value. The way this minimum threshold is increased is the minimum threshold intervention in fig. 3.
For the IO indexes, the default minimum threshold values of the IO indexes are respectively: await 5ms, io util 20%, tps 150 times/sec, bps 31457280 bytes/sec, qu-size 1. In the actual abnormality determination, the final determination threshold is: MAX (minimum threshold, (dynamic threshold + compensation threshold)). In another embodiment, after the abnormality is identified, a threshold for reporting the abnormality may be configured, if a certain data is identified as abnormal data, the value of the abnormal data may be compared with the threshold for reporting the abnormality, and if the value exceeds the threshold for reporting the abnormality, the abnormal data is reported, otherwise, the abnormal data is recorded only as abnormal data and is not reported.
According to the embodiment, the storage IO index abnormity can be automatically identified on the premise that human intervention conditions are not required, and in the embodiment, all IO index burr points are correctly marked through the threshold obtained by a sliding window threshold calculation method; meanwhile, in order to solve the problem of false alarm, a compensation threshold value calculation method is used for deducing the threshold value for evaluating the stable state of the IO index, so that a large number of IO abnormal false alarms are eliminated. Through calculation, the method of the embodiment is normalized to operate for more than 12 hours in a python program mode, CPU consumption is less than 1% of a single core, and therefore normalized deployment can be achieved in a cluster.
In this embodiment, an electronic device is provided, comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the method in the above embodiments.
The programs described above may be run on a processor or may also be stored in memory (or referred to as computer-readable media), which includes both non-transitory and non-transitory, removable and non-removable media, that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
These computer programs may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks, and corresponding steps may be implemented by different modules.
This embodiment provides an apparatus or a system, referred to as an index abnormality determination apparatus, including: the acquisition module is used for acquiring data of the index to be monitored to obtain data corresponding to the index to be monitored; the device comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, and the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows; a calculating module, configured to obtain a second threshold according to a first threshold of each of the plurality of sliding windows, where the second threshold is calculated according to the first threshold; and the judging module is used for judging whether the data acquired at the current moment is abnormal or not according to the second threshold.
The system or the apparatus is used for implementing the functions of the method in the foregoing embodiments, and each module in the system or the apparatus corresponds to each step in the method, which has been described in the method and is not described herein again.
Optionally, the obtaining module is configured to: obtaining an average value of the data in each sliding window; acquiring the maximum value and the minimum value of data in each sliding window; calculating a first difference value of the maximum value and the average value and a second difference value of the average value and the minimum value; and taking the larger one of the first difference value and the second difference value as the first threshold value.
Optionally, the computing module is configured to: averaging the first threshold or weighted averaging the first threshold of each of the plurality of sliding windows to obtain the second threshold.
Optionally, the determining module includes: an obtaining unit, configured to obtain a compensation value, where the compensation value is used to compensate for the second threshold; an adjusting unit, configured to adjust the second threshold value using a compensation value; and the judging unit is used for judging whether the data acquired at the current moment is abnormal or not according to the second threshold value adjusted by using the compensation value.
Optionally, the obtaining unit is configured to: setting a window on the obtained second threshold, wherein a plurality of second thresholds are covered in the window; taking a first second threshold value in the window as an initial compensation value; and determining whether the initial compensation value is updated according to the change trend of the second threshold value in the window so as to obtain the compensation value.
Optionally, the obtaining unit is configured to: judging whether the change trend of the second threshold value is an ascending trend or a descending trend; determining not to update the initial compensation value when the change trend is an ascending trend, and using the initial compensation value as the compensation value; and updating the initial compensation value when the change trend is a descending trend, and using the updated initial compensation value as the compensation value.
Optionally, the obtaining unit is configured to: acquiring a first number which is larger than the initial compensation value in other second threshold values in the window, calculating a first percentage of the first number in the total number of the second threshold values in the window, and determining that the change trend is an ascending trend and determining that the change trend is a descending trend if the first percentage is larger than a preset first preset percentage; or, obtaining a second number of the next second threshold value in the window, which is larger than the previous threshold value, calculating a second percentage of the second number in the total number, and determining that the change trend is an ascending trend when the second percentage is larger than a preset second preset percentage, otherwise, determining that the change trend is a descending trend.
Optionally, the obtaining unit is configured to: obtaining a maximum value of a second threshold value within the window; replacing the initial compensation value with a maximum value of a second threshold value within the window for use as the compensation value.
Optionally, the obtaining unit is configured to: under the condition that the change trend of the second threshold value in the window is monotonically decreased, acquiring a plurality of continuous first threshold values; and using the obtained average value of the continuous multiple first threshold values as the fixed compensation value.
Optionally, the determining unit is configured to: acquiring a preset minimum value; and selecting the maximum value of the preset minimum value and the second threshold value adjusted by using the compensation value as a basis for judging whether the data acquired at the current moment are abnormal or not.
The method and the device solve the problem that the index abnormity is judged by manually setting the static threshold value in the prior art, further can dynamically generate the threshold value according to the collected data, reduce manual intervention and improve the accuracy of the index abnormity judgment to a certain extent.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims (13)

1. An index abnormality determination method includes:
acquiring data of an index to be monitored to obtain data corresponding to the index to be monitored;
acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, wherein the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows;
obtaining a second threshold according to the first threshold of each sliding window in the plurality of sliding windows, wherein the second threshold is calculated according to the first threshold;
and judging whether the data acquired at the current moment is abnormal or not according to the second threshold.
2. The method of claim 1, wherein obtaining the first threshold for each sliding window comprises:
obtaining an average value of the data in each sliding window;
acquiring the maximum value and the minimum value of data in each sliding window;
calculating a first difference value between the maximum value and the average value and a second difference value between the average value and the minimum value;
and taking the larger one of the first difference value and the second difference value as the first threshold value.
3. The method of claim 1, wherein deriving a second threshold from the first threshold for each of the plurality of sliding windows comprises:
averaging the first threshold or weighted averaging the first threshold of each of the plurality of sliding windows to obtain the second threshold.
4. The method of claim 1, wherein determining whether the data collected at the current time is abnormal according to the second threshold comprises:
acquiring a compensation value, wherein the compensation value is used for compensating the second threshold value;
adjusting the second threshold value using a compensation value;
and judging whether the data acquired at the current moment is abnormal or not according to the second threshold value adjusted by using the compensation value.
5. The method of claim 4, wherein obtaining the compensation value comprises:
setting a window on the obtained second threshold, wherein a plurality of second thresholds are covered in the window;
taking a first second threshold value in the window as an initial compensation value;
and determining whether the initial compensation value is updated according to the change trend of the second threshold value in the window so as to obtain the compensation value.
6. The method of claim 5, wherein determining whether the initial compensation value is updated based on a trend of change of the second threshold value within the window comprises:
judging whether the change trend of the second threshold value is an ascending trend or a descending trend;
determining not to update the initial compensation value when the change trend is an ascending trend, and using the initial compensation value as the compensation value;
and updating the initial compensation value when the change trend is a descending trend, and using the updated initial compensation value as the compensation value.
7. The method of claim 6, wherein determining whether the trend of change of the second threshold is an upward trend or a downward trend comprises:
acquiring a first number which is larger than the initial compensation value in other second threshold values in the window, calculating a first percentage of the first number in the total number of the second threshold values in the window, and determining that the change trend is an ascending trend and determining that the change trend is a descending trend if the first percentage is larger than a preset first preset percentage; alternatively, the first and second electrodes may be,
and acquiring a second number of the next second threshold value in the window, which is larger than the previous threshold value, calculating a second percentage of the second number in the total number, and determining that the change trend is an ascending trend under the condition that the second percentage is larger than a preset second preset percentage, otherwise, determining that the change trend is a descending trend.
8. The method of claim 6, wherein updating the initial compensation value when the trend of change is a downward trend comprises:
obtaining a maximum value of a second threshold value within the window;
replacing the initial compensation value with a maximum value of a second threshold value within the window for use as the compensation value.
9. The method of claim 5, wherein determining whether the initial compensation value is updated based on a trend of change of the second threshold value within the window comprises:
under the condition that the change trend of the second threshold value in the window is monotonically decreased, acquiring a plurality of continuous first threshold values; and using the obtained average value of the continuous multiple first threshold values as the fixed compensation value.
10. The method according to any one of claims 4 to 9, wherein determining whether the data collected at the current time is abnormal according to the second threshold adjusted by using the compensation value comprises:
acquiring a preset minimum value;
and selecting the maximum value of the preset minimum value and the second threshold value adjusted by using the compensation value as a basis for judging whether the data acquired at the current moment are abnormal or not.
11. An electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of any of claims 1 to 10.
12. A readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the method steps of any one of claims 1 to 10.
13. An index abnormality determination device includes:
the acquisition module is used for acquiring data of the index to be monitored to obtain data corresponding to the index to be monitored;
the device comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a first threshold value of each sliding window in a plurality of sliding windows before the current moment, and the first threshold value is obtained according to an average value of data in the sliding window; sequentially moving sliding windows on the acquired data corresponding to the indexes according to a preset step length to obtain a plurality of sliding windows;
a calculating module, configured to obtain a second threshold according to a first threshold of each of the plurality of sliding windows, where the second threshold is calculated according to the first threshold;
and the judging module is used for judging whether the data acquired at the current moment is abnormal or not according to the second threshold.
CN202210956510.0A 2022-08-10 2022-08-10 Index abnormity judgment method and device Pending CN115454763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210956510.0A CN115454763A (en) 2022-08-10 2022-08-10 Index abnormity judgment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210956510.0A CN115454763A (en) 2022-08-10 2022-08-10 Index abnormity judgment method and device

Publications (1)

Publication Number Publication Date
CN115454763A true CN115454763A (en) 2022-12-09

Family

ID=84298309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210956510.0A Pending CN115454763A (en) 2022-08-10 2022-08-10 Index abnormity judgment method and device

Country Status (1)

Country Link
CN (1) CN115454763A (en)

Similar Documents

Publication Publication Date Title
CN111212038B (en) Open data API gateway system based on big data artificial intelligence
CN108509323A (en) Method for processing business, device based on log analysis and computer equipment
CN116795655B (en) Storage device performance monitoring system and method based on artificial intelligence
EP3932025B1 (en) Computing resource scheduling method, scheduler, internet of things system, and computer readable medium
JP5699715B2 (en) Data storage device and data storage method
CN111858704A (en) Data monitoring method and device, electronic equipment and storage medium
CN107040566B (en) Service processing method and device
CN115454763A (en) Index abnormity judgment method and device
CN117318297A (en) Alarm threshold setting method, system, equipment and medium based on state monitoring
CN113032239A (en) Risk prompting method and device, electronic equipment and storage medium
CN113123955B (en) Plunger pump abnormity detection method and device, storage medium and electronic equipment
CN111258854A (en) Model training method, alarm method based on prediction model and related device
CN112667479A (en) Information monitoring method and device
CN114083987B (en) Correction method and device for battery monitoring parameters and computer equipment
CN115774646A (en) Process early warning method and device, electronic equipment and storage medium
CN106686082B (en) Storage resource adjusting method and management node
CN114328078A (en) Threshold dynamic calculation method and device and computer readable storage medium
CN113808727A (en) Equipment monitoring method and device, computer equipment and readable storage medium
CN113765821A (en) Multi-dimensional access flow control system
CN108920310B (en) Abnormal value detection method and system of interface data
WO2009090944A1 (en) Rule base management system, rule base management method, and rule base management program
CN110716826A (en) Cloud disk upgrading and scheduling method, cloud host, scheduling device and system
WO2021223515A1 (en) Method and apparatus for monitoring data modality change, and device and storage medium
CN117409495B (en) Optimal maintenance time acquisition method and system based on equipment maintenance data
CN113254209B (en) Capacity management method, device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination