WO2021179574A1 - 根因定位方法、装置、计算机设备和存储介质 - Google Patents

根因定位方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021179574A1
WO2021179574A1 PCT/CN2020/118332 CN2020118332W WO2021179574A1 WO 2021179574 A1 WO2021179574 A1 WO 2021179574A1 CN 2020118332 W CN2020118332 W CN 2020118332W WO 2021179574 A1 WO2021179574 A1 WO 2021179574A1
Authority
WO
WIPO (PCT)
Prior art keywords
alarm
value
similarity
values
indicators
Prior art date
Application number
PCT/CN2020/118332
Other languages
English (en)
French (fr)
Inventor
陈桢博
徐亮
金戈
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021179574A1 publication Critical patent/WO2021179574A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to a root cause location method, device, computer equipment and storage medium.
  • anomaly detection refers to issuing an alarm for abnormal changes in the collection of indicators according to various monitoring indicators of the equipment, so as to remind the staff to pay attention to and deal with it in time.
  • Root cause identification is to recommend root cause failures or equipment to staff based on alarms, eliminating the time-consuming manual troubleshooting one by one. The accuracy of anomaly detection and root cause identification can effectively help the operation and maintenance work to find faults in time and repair them quickly.
  • the traditional root cause analysis system is only based on the hierarchical call chain of operation and maintenance equipment and anomaly detection alarms.
  • This method is a static rule. When multiple alarms occur at the device level, the system will give priority to identifying indicators downstream of the call chain as the root cause.
  • the inventor realizes that the static rule method is relatively fixed. When the root cause indicator does not generate an alarm or the root cause does not conform to the call chain logic, the root cause identification result is incorrect and cannot truly reflect the cause of the equipment failure. Therefore, the existing technology needs to be improved.
  • a root cause locating method which is used for locating the root cause of a fault in the operation and maintenance work of a root cause analysis system, and includes the following steps:
  • the warning index ranked the highest in the similarity value is output as the root cause warning index.
  • a root cause locating device is used for the root cause analysis system to locate the root cause of the failure in the operation and maintenance work, and the root cause locating device includes: an abnormality detection unit, an alarm index numerical calculation unit, and similar alarm indicators Degree calculation unit and root cause alarm indicator output unit;
  • the abnormality detection unit is used to receive abnormal information and send out alarm information
  • the alarm index value calculation unit is used to find all the alarm indexes associated with the alarm information according to the call chain, and collect the value of the alarm index;
  • the alarm indicator similarity calculation unit is used to smooth the values of all the alarm indicators, and calculate the similarity of all alarm indicators in combination with the preset lag value to obtain the similarity of the alarm indicator with a higher lag value value;
  • the root cause alarm indicator output unit is used to summarize the similarity values of the alarm indicators with the higher lag value, and combine the call chain hierarchical relationship to sort the alarm indicators with higher similarity values, and to sort the similarity values
  • the earlier alarm indicators are output as root cause alarm indicators.
  • a computer device includes a memory and a processor, and the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor executes the steps of the following root cause location method :
  • the warning index ranked the highest in the similarity value is output as the root cause warning index.
  • a storage medium storing computer-readable instructions.
  • the one or more processors execute the steps of the following root cause location method:
  • the warning index ranked the highest in the similarity value is output as the root cause warning index.
  • the above root cause location method is based on the LOESS algorithm after the operation and maintenance system receives the abnormal alarm information.
  • the similarity values of the alarm indicators are sorted by weight based on the call chain level information, and the call chain equipment corresponding to the alarm indicators with higher similarity values is output as the root cause output.
  • the device with the root cause of the risk rather than the alarm indicator ensures the diversity of the output root cause; and the determination of the root cause is based on multiple dimensions such as alarm indicator similarity, abnormal information, alarm time and call relationship, ensuring root cause identification Completeness and accuracy.
  • the method of the present application can dig out more complex root cause relationships. Through the root cause identification results, the operation and maintenance staff can quickly troubleshoot based on the alarm indicators and perform rapid fault repair work.
  • Fig. 1 is an implementation environment diagram of a root cause location method provided in an embodiment
  • Figure 2 is a block diagram of the internal structure of a computer device in an embodiment
  • Figure 3 is a flowchart of a root cause location method in an embodiment
  • FIG. 4 is a flowchart of calculating an alarm indicator with a higher similarity value based on the lag value in an embodiment
  • FIG. 5 is a flowchart of obtaining an alarm indicator with a higher similarity value according to the lag value combined with the residual value of the historical STL periodic component in an embodiment
  • Fig. 6 is a structural block diagram of a root cause locating device in an embodiment.
  • FIG. 1 is an implementation environment diagram of a root cause location method provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a terminal 120.
  • the computer device 110 is a test device, for example, a computer device used by a tester, and an automated test tool is installed on the computer device 110, for example, Appium.
  • the terminal 120 is installed with the application under test that requires root cause location.
  • the tester can send a root cause location request to the computer device 110.
  • the root cause location request carries a location request identifier, and the computer device 110 receives the location request identifier.
  • the root cause location script corresponding to the location request indicator in the computer device 110 is obtained according to the location request indicator.
  • an automated testing tool is used to execute the root cause location script, test the application under test on the terminal 120, and obtain the root cause location result corresponding to the root cause location script.
  • the terminal 120 and the computer device 110 may be smart phones, tablet computers, notebook computers, desktop computers, etc., but are not limited thereto.
  • the computer device 110 and the terminal 110 may be connected via Bluetooth, USB (Universal Serial Bus, Universal Serial Bus) or other communication connection methods, which is not limited in this application.
  • Figure 2 is a schematic diagram of the internal structure of a computer device in an embodiment.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can realize a Kind of gesture test method.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • a computer readable instruction may be stored in the memory of the computer device, and when the computer readable instruction is executed by the processor, the processor may execute a root cause location method.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a root cause location method is proposed.
  • the root cause location method can be applied to the above-mentioned computer device 110, and specifically can include the following steps 302 to 310:
  • Step 302 Receive abnormal information and send out alarm information
  • the detection of abnormal information is based on the prior art STL algorithm to perform time sequence decomposition to obtain periodic components and store them.
  • the STL (Seasonal-Trend decomposition procedure based on loess) algorithm is an algorithm in the time series decomposition. Based on LOESS, the data at a certain time is decomposed into trend component, seasonal component and remainder component:
  • STL is divided into inner loop and outer loop.
  • ⁇ N(p)n(p) is the number of samples in a period
  • ⁇ N(s)n(s) is the LOESS smoothing parameter in Step 2
  • ⁇ N(l)n(l) is the LOESS smoothing parameter in Step 3
  • ⁇ N(t)n(t) is the LOESS smoothing parameter in Step 6.
  • the sample points at the same position in each cycle constitute a subseries. It is easy to know that there are n(p)n(p) such subsequences, which is called cycle-subseries.
  • the inner loop is mainly divided into the following 6 steps:
  • Step 1 Detrending, minus the trend component of the previous round of results, Yv-T(k)vYv-Tv(k);
  • Step 3 Low-Pass Filtering of periodic sub-sequences
  • ⁇ Step 5 Deseasonalizing, subtracting the periodic component, Yv-S(k+1)vYv-Sv(k+1);
  • Step 304 Search for all alarm indicators associated with alarm information according to the call chain, and collect the values of the alarm indicators;
  • the value of the collection and alarm indicator is the value of the alarm indicator between 1 to 2 hours before the alarm is collected and 10 minutes after the alarm.
  • the associated alarm indicators here refer to the invocation relationship between different alarm objects (each object has multiple monitoring indicators), so they can influence each other.
  • the average value of the multiple indicators found is aggregated in time order to obtain a comprehensive entry indicator for subsequent calculation of similarity values.
  • the alarm indicator is In the case of a single indicator, the single indicator is used as the entry indicator for the subsequent calculation of the similarity value.
  • the above alarm index values from 1 to 2 hours before the alarm is collected to 10 minutes after the alarm are the optimized time interval to ensure that the root cause can be identified quickly after the alarm is triggered.
  • the above-mentioned time interval may also be set within 1 to 6 hours before the alarm to 10 minutes after the alarm.
  • Step 306 Perform smoothing processing on the values of all the alarm indicators, and perform similarity calculations on all alarm indicators combined with preset lag values to obtain similarity values of alarm indicators with higher lag values;
  • Fig. 4 is a flowchart of calculating an alarm indicator with a higher similarity value according to the lag value in an embodiment, which specifically includes the following steps 402 to 408:
  • Step 402 Perform a local weighted regression using the LOESS algorithm to obtain a regression value to obtain a sequence smooth value; use the LOESS algorithm to perform a smooth calculation on the value of the associated alarm indicator to remove noise.
  • the preset lag value is 0 to 90 minutes; the lag value in the LOESS algorithm preferably set in this embodiment is 0 to 90 minutes.
  • the lag value can also be preset to Calculate from 0 to 120 minutes.
  • Step 406 Calculate the similarity of all the alarm indicators under the preset lag values with the entry indicators, respectively, to obtain the lag value alarm indicators of all the alarm indicators under each lag value; taking the preset lag value of 60 minutes as an example, that is, the entry
  • the indicator is the interval from 1 hour before the start of the alarm to 10 minutes after the alarm. According to the range of the lag value, the alarm indicator interval is moved forward in 1-minute steps, and then the similarity is calculated with the original interval of the entry indicator to obtain the similarity under each lag value.
  • the calculation result of the degree value is Among them, the calculation of similarity value adopts Pearson correlation coefficient calculation, and the calculation formula of Pearson correlation coefficient is The calculation of the Pearson correlation coefficient is a prior art, and will not be repeated here.
  • the lag value alarm indicators with a similarity value greater than 0.65 are merged with similarity values to obtain an alarm indicator with a higher similarity value.
  • the preset lag value is 60 minutes. From 60 minutes before the alarm to 10 minutes after the alarm, there are 70 intervals of lag values from 60 minutes before the alarm to 10 minutes after the alarm, and the original interval of the entry index is calculated according to the Pearson correlation coefficient
  • the results of similarity values under each lag value there may be multiple similarities greater than the 0.65 similarity threshold. Some lags with similarities greater than the 0.65 threshold may be clustered in adjacent locations, and merged according to the maximum value, keeping multiple similarities The maximum lag value is to obtain the alarm indicator with a higher similarity value.
  • the time window refers to the time period from a certain moment before the alarm to 10 minutes after the alarm.
  • each alarm indicator there may be a reverse relationship between each alarm indicator, or the root cause of the indicator variation is small, so that the similarity calculation cannot be directly performed.
  • the two indicators may be the same. Change but the magnitude of the change is quite different, resulting in a lower correlation coefficient value of similarity.
  • some alarm indicators and entry indicators may have the same changes, but this change is normal for these indicators, and this change cannot be used as the root cause of the failure. .
  • a flowchart of obtaining alarm indicators with higher similarity values according to the lag value combined with the residual value of the historical STL period component specifically includes steps 502 to 506:
  • Step 502 Collect the residual value of the sequence smooth value and the historical STL periodic component obtained by the LOESS algorithm of the alarm indicator;
  • step 504 similarity calculations are performed on the alarm indicators with STL periodic component residual values, respectively, to obtain similarity values of the STL residual value alarm indicators;
  • Step 506 If the similarity value of the STL residual value alarm indicator and the similarity value of the corresponding lag value alarm indicator are both greater than 0.65, merge the similarity values of the alarm indicators to obtain an alarm with a higher similarity value index.
  • the similarity is calculated separately for the smooth value and the residual value of the alarm indicator. If the similarity is higher than the 0.65 threshold in both cases, the alarm indicator is included in the potential root. because. Residual errors can better reflect abnormal changes compared with history, and reduce the impact of normal changes.
  • Step 308 Summarize the similarity values of the alarm indicators with the higher lag value, and sort the alarm indicators with the higher similarity value in combination with the hierarchical relationship of the call chain;
  • one alarm indicator may correspond to multiple sets of results, and combined with the hierarchical relationship of the call chain, all associated alarm indicators are sorted to ensure the diversity of output root causes.
  • the similarity value of the alarm index obtained in this way can obtain the average similarity value and the delay amount with the highest numerical value.
  • the more downstream objects will affect the upstream objects the greater the possibility of the root cause.
  • the number of upstream call chains in the call chain is correspondingly reduced. If an object has potential root causes of downstream objects, the object may be an affected object, and this object can be directly excluded, and the order will be finally sorted
  • the earlier alarm indicator is output as the root cause indicator.
  • Step 310 Output the call chain device corresponding to the alarm indicator with the highest similarity value as the root cause.
  • finding the root cause needs to determine the root cause device of the failure. Therefore, according to the above steps, output the confirmed alarm indicator as the root cause indicator, and find the corresponding call chain device in the call chain hierarchical relationship. , It can be determined as the root cause of the failure.
  • a structural block diagram of a root cause locating device is provided.
  • the root cause locating device can be integrated into the above-mentioned computer equipment 110, and can specifically include an abnormality detection unit 602 and an alarm index value.
  • the abnormality detection unit 602 is configured to receive abnormal information and send out alarm information
  • the detection of abnormal information is based on the prior art STL algorithm to perform time sequence decomposition to obtain periodic components and store them.
  • the STL (Seasonal-Trend decomposition procedure based on loess) algorithm is an algorithm in the time series decomposition. Based on LOESS, the data at a certain time is decomposed into trend component, seasonal component and remainder component:
  • STL is divided into inner loop and outer loop.
  • ⁇ N(p)n(p) is the number of samples in a period
  • ⁇ N(s)n(s) is the LOESS smoothing parameter in Step 2
  • ⁇ N(l)n(l) is the LOESS smoothing parameter in Step 3
  • ⁇ N(t)n(t) is the LOESS smoothing parameter in Step 6.
  • the sample points at the same position in each cycle constitute a subseries. It is easy to know that there are n(p)n(p) such subsequences, which is called cycle-subseries.
  • the inner loop is mainly divided into the following 6 steps:
  • Step 1 Detrending, minus the trend component of the previous round of results, Yv-T(k)vYv-Tv(k);
  • Step 3 Low-Pass Filtering of periodic sub-sequences
  • ⁇ Step 5 Deseasonalizing, subtracting the periodic component, Yv-S(k+1)vYv-Sv(k+1);
  • the alarm index value calculation unit 604 is configured to search for all alarm indicators associated with the alarm information according to the call chain, and collect the value of the alarm index;
  • the alarm indicator value calculation unit 604 collects the alarm indicator value, it collects the alarm indicator value from 1 to 2 hours before the alarm to 10 minutes after the alarm.
  • the associated alarm indicators here refer to the invocation relationship between different alarm objects (each object has multiple monitoring indicators), so they can influence each other.
  • the average value of the multiple indicators found is aggregated in time order to obtain a comprehensive entry indicator for subsequent calculation of similarity values.
  • the alarm indicator is In the case of a single indicator, the single indicator is used as the entry indicator for the subsequent calculation of the similarity value.
  • the above alarm index values from 1 to 2 hours before the alarm is collected to 10 minutes after the alarm are the optimized time interval to ensure that the root cause can be identified quickly after the alarm is triggered.
  • the above-mentioned time interval may also be set within 1 to 6 hours before the alarm to 10 minutes after the alarm.
  • the alarm indicator similarity calculation unit 606 is used to smooth the values of all the alarm indicators, and calculate the similarity of all alarm indicators in combination with the preset lag value to obtain the similarity of the alarm indicator with a higher lag value. Degree value
  • the alarm indicator similarity calculation unit 606 is used to calculate the similarity value of the alarm indicator.
  • the specific method is as follows: firstly, perform local weighted regression through the LOESS algorithm to obtain the regression value to obtain the sequence smooth value; the preset lag value is 0 ⁇ 90 minutes; then calculate the similarity of all alarm indicators with the entry indicators under the preset lag values, and obtain the lag value alarm indicators of all alarm indicators under each lag value; finally, alarm the lag value with a similarity value greater than 0.65
  • the indicators are merged with similarity values to obtain alarm indicators with higher similarity values.
  • the specific calculation process is the same as step 306 in the foregoing method embodiment, and will not be repeated here.
  • each alarm indicator there may be a reverse relationship between each alarm indicator, or the root cause of the indicator variation is small, so that the similarity calculation cannot be directly performed.
  • the two indicators may be the same. Change but the magnitude of the change is quite different, resulting in a lower correlation coefficient value of similarity.
  • some alarm indicators and entry indicators may have the same changes, but this change is normal for these indicators, and this change cannot be used as the root cause of the failure. .
  • the calculation input model of the similarity value also needs to add the calculation of the residual value of the historical STL periodic component, and calculate the similarity of the residual of each alarm indicator to be able to comprehensively reflect the degree of change.
  • the alarm indicator similarity calculation unit 606 is used to calculate the similarity value of the alarm indicator. Specifically, it also adopts: firstly, the residual value of the sequence smooth value and the historical STL period component obtained by the alarm indicator through the LOESS algorithm is collected.
  • the alarm indicators of the component residual value are calculated respectively to obtain the similarity value of the STL residual value alarm indicator; if the similarity value of the STL residual value alarm indicator and the corresponding lag value alarm indicator are both greater than 0.65, the similarity values of the alarm indicators are merged to obtain an alarm indicator with a higher similarity value.
  • the root cause alarm indicator output unit 608 is configured to summarize the similarity values of the alarm indicators with the higher lag value, and combine the call chain hierarchical relationship to sort the alarm indicators with higher similarity values, and to compare the similarity values
  • the call chain device corresponding to the alarm indicator with the highest ranking is output as the root cause.
  • one alarm indicator may correspond to multiple sets of results, and in combination with the hierarchical relationship of the call chain, all associated alarm indicators are sorted to ensure the diversity of output root causes.
  • the similarity value of the alarm index obtained in this way can obtain the average similarity value and the delay amount with the highest numerical value.
  • the more downstream objects will affect the upstream objects the greater the possibility of the root cause.
  • the number of upstream call chains in the call chain is correspondingly reduced. If an object has potential root causes of downstream objects, the object may be an affected object, and this object can be directly excluded and finally sorted
  • the earlier alarm indicator is output as the root cause indicator. Finding and locating the root cause needs to determine the root cause device of the failure. Therefore, after confirming the alarm indicator as the root cause indicator, find the corresponding call chain device in the call chain hierarchical relationship to determine that it is the root cause of the failure. because.
  • a computer device in one embodiment, includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, and the processor executes the computer The following steps are implemented during the program:
  • the call chain device corresponding to the alarm indicator with the highest ranking of the similarity value is output as the root cause.
  • the processor further executes the following steps when executing the computer program: the value of the collection and alarm index is the value of the alarm index between 1 to 2 hours before the collection of the alarm and 10 minutes after the alarm.
  • the processor further executes the following steps when executing the computer program:
  • the default lag value is 0 to 90 minutes
  • the lag value alarm indicators with similarity values greater than 0.65 are merged with similarity values to obtain alarm indicators with higher similarity values.
  • the processor further executes the following steps when executing the computer program:
  • the similarity values of the STL residual value alarm indicator and the similarity value of the corresponding lag value alarm indicator are both greater than 0.65, the similarity values of the alarm indicators are merged to obtain an alarm indicator with a higher similarity value.
  • a storage medium storing computer-readable instructions.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the one or more processors perform the following steps:
  • the call chain device corresponding to the alarm indicator with the highest ranking of the similarity value is output as the root cause.
  • the processor further executes the following steps when executing the computer-readable instructions:
  • the value of the collection and alarm indicator is the value of the alarm indicator from 1 to 2 hours before the alarm is collected to 10 minutes after the alarm.
  • the processor further executes the following steps when executing the computer-readable instructions:
  • the default lag value is 0 to 90 minutes
  • the lag value alarm indicators with similarity values greater than 0.65 are merged with similarity values to obtain alarm indicators with higher similarity values.
  • the processor further executes the following steps when executing the computer-readable instructions:
  • the similarity values of the STL residual value alarm indicator and the similarity value of the corresponding lag value alarm indicator are both greater than 0.65, the similarity values of the alarm indicators are merged to obtain an alarm indicator with a higher similarity value.
  • the computer program can be stored in a computer readable storage medium. When executed, it may include the procedures of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Abstract

一种根因定位方法、装置、计算机设备和存储介质,用于定位运维系统工作中故障的根因,接收到异常信息并发出告警信息(302),根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值(304);对所有告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值(306);汇总所述lag值较高的告警指标的相似度值,结合调用链层级关系,对相似度值较高的告警指标进行排序(308),将相似度值排序靠前的告警指标对应的调用链设备作为根因输出(310)。根因的确定基于告警指标相似度、异常信息、告警时刻和调用关系等多种维度,保证了根因识别的完整性与准确性,能够挖掘到更加复杂的根因关系,运维人员能够根据告警指标快速排查,并进行故障修复工作。

Description

根因定位方法、装置、计算机设备和存储介质
本申请要求于2020年03月12日提交中国专利局、申请号为202010170390.2,发明名称为“根因定位方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别是涉及一种根因定位方法、装置、计算机设备和存储介质。
背景技术
对于计算机信息系统的运行维护工作,设备故障的快速修复是首要目标。这一工作共分为两个主要部分,即异常检测与根因识别。异常检测是指根据设备的各个监控指标,对于指标采集异变发出告警,从而提示工作人员及时关注并处理。根因识别则是根据告警,为工作人员推荐根因故障或设备,省去人工进行逐一排查的耗时。异常检测与根因识别的准确性,能够有效帮助运维工作及时发现故障并快速修复。
传统的根因分析系统,仅仅基于运维设备的层级调用链与异常检测告警。这一方法属于静态规则,当设备层级发生多个告警后,系统会优先认定调用链下游的指标为根因。发明人意识到,静态规则方法较为固定,当根因指标未产生告警,或者根因不符合调用链逻辑时,根因识别结果存在错误,无法真实反映设备故障的原因。因此,现有技术还有待改进。
发明内容
基于此,有必要针对传统根因分析系统仅仅进行静态分析法则的缺陷,提供一种动态的根因定位方法、装置、计算机设备和存储介质。
一种根因定位方法,所述根因定位方法用于根因分析系统定位运维工作中故障的根因,包括如下步骤:
接收到异常信息并发出告警信息;
根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
将所述相似度值排序靠前的告警指标作为根因告警指标输出。
一种根因定位装置,所述根因定位装置用于根因分析系统定位运维工作中故障的根因,所述根因定位装置包括:异常检测单元、告警指标数值计算单元、告警指标相似度计算单元和根因告警指标输出单元;
异常检测单元,用于接收到异常信息并发出告警信息;
告警指标数值计算单元,用于根据调用链查找所有与告警信息相关联的告警指标,并 收集所述告警指标的数值;
告警指标相似度计算单元,用于对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
根因告警指标输出单元,用于汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序,并将相似度值排序靠前的告警指标作为根因告警指标输出。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行下述根因定位方法的步骤:
接收到异常信息并发出告警信息;
根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
将所述相似度值排序靠前的告警指标作为根因告警指标输出。
一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行下述根因定位方法的步骤:
接收到异常信息并发出告警信息;
根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
将所述相似度值排序靠前的告警指标作为根因告警指标输出。
与现有技术对运维系统仅仅基于运维设备的异常检测告警与层级调用链确定根因的静态规则方法相比较,上述根因定位方法在运维系统接收到异常告警信息后,根据LOESS算法计算对应各告警指标的相似度值进行平滑处理,进一步还可以通过历史STL周期分量的残差计算各告警指标的相似度值以准确反映告警指标的变化程度,还能够准确反映不同指标的影响程度;在获取相似度值较高的告警指标后,结合调用链层级信息对告警指标的相似度值进行权重排序,输出相似度值较高的告警指标对应的调用链设备作为根因输出,输出的是存在风险根因的设备而非告警指标,保证了输出根因的多样性;且根因的确定基于告警指标相似度、异常信息、告警时刻和调用关系等多种维度,保证了根因识别的完整性与准确性。相比现有技术的静态规则,本申请方法能够挖掘到更加复杂的根因关系,通过根因识别结果,运维工作人员能够根据告警指标快速排查,并进行故障的快速修复工作。
附图说明
图1为一个实施例中提供的根因定位方法的实施环境图;
图2为一个实施例中计算机设备的内部结构框图;
图3为一个实施例中根因定位方法的流程图;
图4为一个实施例中根据lag值对被告警指标计算获取相似度值较高的告警指标的流程图;
图5为一个实施例中根据lag值结合历史STL周期分量的残差值获取相似度值较高的告警指标的流程图;
图6为一个实施例中根因定位装置的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中提供的根因定位方法的实施环境图,如图1所示,在该实施环境中,包括计算机设备110以及终端120。
计算机设备110为测试设备,例如为测试人员使用的电脑等计算机设备,计算机设备110上安装有自动化测试工具,例如可以为Appium。终端120上安装有需要进行根因定位的被测应用,当需要测试时,测试人员可以在计算机设备110发出根因定位请求,该根因定位请求中携带有定位请求标识,计算机设备110接收该根因定位请求,根据定位请求标识获取计算机设备110中与定位请求标识对应的根因定位脚本。然后利用自动化测试工具执行该根因定位脚本,对终端120上的被测应用进行测试,并获取根因定位脚本对应的根因定位结果。
需要说明的是,终端120以及计算机设备110可为智能手机、平板电脑、笔记本电脑、台式计算机等,但并不局限于此。计算机设备110以及终端110可以通过蓝牙、USB(Universal Serial Bus,通用串行总线)或者其他通讯连接方式进行连接,本申请在此不做限制。
图2为一个实施例中计算机设备的内部结构示意图。如图2所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种手势测试方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种根因定位方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
如图3所示,在一个实施例中,提出了一种根因定位方法,该根因定位方法可以应用 于上述的计算机设备110中,具体可以包括以下步骤302~310:
步骤302,接收到异常信息并发出告警信息;
在本实施例中,异常信息的检测是基于现有技术STL算法进行时序分解得到周期性分量并存储,当某采集值与对应周期性分量高于阈值时,即可发出告警信息。STL(Seasonal-Trend decomposition procedure based on loess)算法为时序分解中一种算法,基于LOESS将某时刻的数据分解为趋势分量(trend component)、周期分量(seasonal component)和余项(remainder component):
Yv=Tv+Sv+Rvv=1,…,NYv=Tv+Sv+Rvv=1,…,N
STL分为内循环(inner loop)与外循环(outer loop),其中内循环主要做了趋势拟合与周期分量的计算。假定T(k)vTv(k)、Sv(k)Sv(k)为内循环中第k-1次pass结束时的趋势分量、周期分量,初始时T(k)v=0Tv(k)=0;并有以下参数:
·n(i)n(i)内层循环数,
·n(o)n(o)外层循环数,
·n(p)n(p)为一个周期的样本数,
·n(s)n(s)为Step 2中LOESS平滑参数,
·n(l)n(l)为Step 3中LOESS平滑参数,
·n(t)n(t)为Step 6中LOESS平滑参数。
每个周期相同位置的样本点组成一个子序列(subseries),容易知道这样的子序列共有n(p)n(p)个,称其为cycle-subseries。内循环主要分为以下6个步骤:
·Step 1:去趋势(Detrending),减去上一轮结果的趋势分量,Yv-T(k)vYv-Tv(k);
·Step 2:周期子序列平滑(Cycle-subseries smoothing),用LOESS(q=nn(s)q=nn(s),d=1d=1)对每个子序列做回归,并向前向后各延展一个周期;平滑结果组成temporary seasonal series,记为C(k+1)v,v=-n(p)+1,…,-N+n(p)Cv(k+1),v=-n(p)+1,…,-N+n(p);
·Step 3:周期子序列的低通量过滤(Low-Pass Filtering),对上一个步骤的结果序列C(k+1)vCv(k+1)依次做长度为n(p)n(p)、n(p)n(p)、33的滑动平均(moving average), 然后做LOESS(q=nn(l)q=nn(l),d=1d=1)回归,得到结果序列L(k+1)v,v=1,…,NLv(k+1),v=1,…,N;相当于提取周期子序列的低通量;
·Step 4:去除平滑周期子序列趋势(Detrending of Smoothed Cycle-subseries),S(k+1)v=C(k+1)v-L(k+1)vSv(k+1)=Cv(k+1)-Lv(k+1);
·Step 5:去周期(Deseasonalizing),减去周期分量,Yv-S(k+1)vYv-Sv(k+1);
·Step 6:趋势平滑(Trend Smoothing),对于去除周期之后的序列做LOESS(q=nn(t)q=nn(t),d=1d=1)回归,得到趋势分量T(k+1)vTv(k+1)。
步骤304,根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
在本申请实施例中,所述收集与告警指标的数值是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。此处相关联的告警指标是指不同告警对象(每个对象存在多个监测指标)之间存在调用关系,因而能够相互影响。当系统中某个应用发出告警后,由于告警可能发生于多个指标,将查找到的多个指标依时刻顺序进行均值聚合以获得综合入口指标用于后续相似度值的计算,当告警指标为单个指标时,则将该单个指标作为入口指标用于后续相似度值的计算。以上收集告警前1~2小时到告警后10分钟之内的告警指标数值是保证告警触发后能够较快进行根因识别的优化时间区间。
在一些实施例中,如果需要扩大查找告警指标,也可以将上述时间区间设定为告警前1~6小时到告警后10分钟之内。
步骤306,对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
在本实施例中,根据上述查找到的关联告警指标。需要对关联告警指标的数值进行相似度处理,如图4为一个实施例中根据lag值对被告警指标计算获取相似度值较高的告警指标的流程图,具体包括如下步骤402~408:
步骤402,通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;采用LOESS算法对相关联告警指标的数值进行平滑计算去除噪点。
步骤404,预设lag值为0~90分钟;本实施例优选设置的LOESS算法中的lag值为0~90分钟,在需要扩大关联告警指标数值进行计算时,也可以将lag值预设为0~120分钟进行计算。
步骤406,将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;以预设lag值为60分钟为例,即入口指标为告警开始时刻前1小时到告警后10分钟区间,根据lag值的范围将告警指标区间以1分钟步长向前推移,再与入口指标原区间计算相似度,获得各lag值下的相似度值计算结果。其 中,相似度值的计算采用皮尔逊关联系数计算,皮尔逊关联系数的计算公式为
Figure PCTCN2020118332-appb-000001
皮尔逊关联系数的计算为现有技术,此处不赘述。
步骤408,将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。如上步骤示例,预设lag值为60分钟,从告警前60分钟到告警后10分钟,以1分钟为步长,分别有70个区间的lag值,与入口指标原区间根据皮尔逊关联系数计算出各lag值下的相似度值结果,可能会存在多个大于0.65相似度阈值的相似度,部分相似度大于0.65阈值的lag可能聚集在邻近位置,按最大值进行归并,保留多个相似度最大的lag值,即获取相似度值较高的告警指标。上述计算过程的原因是根因指标异变可能不是与入口指标同时发生的,但基本要早于入口指标,如果异变在时间窗口中的位置不相同则会降低相似度,因此,将时间窗口往前推移与入口指标进行多次计算后,从而得出较高相似度。其中,时间窗口是指告警前的某个时刻到告警后10分钟的时间区段。
在一些实施例中,各告警指标之间可能存在反向关系,或者根因指标异变程度较小,导致不能直接进行相似度计算,例如,在实际运维场景下,2个指标可能存在相同变化但是变化幅度差异较大,导致相似度关联系数值较低,此外,部分告警指标与入口指标可能存在相同变化,但是这种变化是这些指标的正常情况,不能将这种变化作为故障根因。因此,相似度值的计算输入模型除了采用上述LOESS算法计算之外,还需要加入历史STL周期分量的残差值计算,对各个告警指标的残差进行相似度计算,以能够综合反映变化程度。如图5一个实施例中根据lag值结合历史STL周期分量的残差值获取相似度值较高的告警指标的流程图,具体包括步骤502~506:
步骤502,采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值;
步骤504,对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;
步骤506,若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
通过加入对告警指标进行STL值测算的手段,对于告警指标的平滑值与残差值分别计算相似度,如果两种情况下相似度均高于0.65的阈值,则该告警指标被纳入潜在的根因。残差更能体现和历史相比的异常变化,而减少了常态变化的影响。
步骤308,汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
本申请实施例中,一个告警指标可能对应多组结果,再结合调用链层级关系,对所有相关联的告警指标进行排序,保证了输出根因的多样性。这样获取的告警指标的相似度值能够得到数值最高的相似度均值与延迟量。对于调用链层级关系信息,越往下游的对象越会影响上游对象,根因的可能性就越大。按照相似度进行排序后,调用链中上游调用链数 量相对应减少,如果某个对象存在下游对象潜在根因,则该对象可能是受影响的对象,则可直接排除这一对象,最终将排序靠前的告警指标输出作为根因指标。
步骤310,将所述相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
在本申请实施例中,查找定位根因需要确定发生故障的根因设备,因此,根据上述步骤输出确认的作为根因指标的告警指标,在调用链层级关系中查找出其对应的调用链设备,即可确定其为发生故障的根因。
如图6所示,在一个实施例中,提供了一种根因定位装置的结构框图,该根因定位装置可以集成于上述的计算机设备110中,具体可以包括异常检测单元602、告警指标数值计算单元604、告警指标相似度计算单元606和根因告警指标输出单元608;
异常检测单元602,用于接收到异常信息并发出告警信息;
在本实施例中,异常信息的检测是基于现有技术STL算法进行时序分解得到周期性分量并存储,当某采集值与对应周期性分量高于阈值时,即可发出告警信息。STL(Seasonal-Trend decomposition procedure based on loess)算法为时序分解中一种算法,基于LOESS将某时刻的数据分解为趋势分量(trend component)、周期分量(seasonal component)和余项(remainder component):
Yv=Tv+Sv+Rvv=1,…,NYv=Tv+Sv+Rvv=1,…,N
STL分为内循环(inner loop)与外循环(outer loop),其中内循环主要做了趋势拟合与周期分量的计算。假定T(k)vTv(k)、Sv(k)Sv(k)为内循环中第k-1次pass结束时的趋势分量、周期分量,初始时T(k)v=0Tv(k)=0;并有以下参数:
·n(i)n(i)内层循环数,
·n(o)n(o)外层循环数,
·n(p)n(p)为一个周期的样本数,
·n(s)n(s)为Step 2中LOESS平滑参数,
·n(l)n(l)为Step 3中LOESS平滑参数,
·n(t)n(t)为Step 6中LOESS平滑参数。
每个周期相同位置的样本点组成一个子序列(subseries),容易知道这样的子序列共有n(p)n(p)个,称其为cycle-subseries。内循环主要分为以下6个步骤:
·Step 1:去趋势(Detrending),减去上一轮结果的趋势分量,Yv-T(k)vYv-Tv(k);
·Step 2:周期子序列平滑(Cycle-subseries smoothing),用LOESS(q=nn(s)q=nn(s),d=1d=1)对每个子序列做回归,并向前向后各延展一个周期;平滑结果组成temporary seasonal series,记为C(k+1)v,v=-n(p)+1,…,-N+n(p)Cv(k+1),v=-n(p)+1,…,-N+n(p);
·Step 3:周期子序列的低通量过滤(Low-Pass Filtering),对上一个步骤的结果序列C(k+1)vCv(k+1)依次做长度为n(p)n(p)、n(p)n(p)、33的滑动平均(moving average),然后做LOESS(q=nn(l)q=nn(l),d=1d=1)回归,得到结果序列L(k+1)v,v=1,…,NLv(k+1),v=1,…,N;相当于提取周期子序列的低通量;
·Step 4:去除平滑周期子序列趋势(Detrending of Smoothed Cycle-subseries),S(k+1)v=C(k+1)v-L(k+1)vSv(k+1)=Cv(k+1)-Lv(k+1);
·Step 5:去周期(Deseasonalizing),减去周期分量,Yv-S(k+1)vYv-Sv(k+1);
Step 6:趋势平滑(Trend Smoothing),对于去除周期之后的序列做LOESS(q=nn(t)q=nn(t),d=1d=1)回归,得到趋势分量T(k+1)vTv(k+1)。
告警指标数值计算单元604,用于根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
在本实施例中,所述告警指标数值计算单元604在收集与告警指标的数值时,是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。此处相关联的告警指标是指不同告警对象(每个对象存在多个监测指标)之间存在调用关系,因而能够相互影响。当系统中某个应用发出告警后,由于告警可能发生于多个指标,将查找到的多个指标依时刻顺序进行均值聚合以获得综合入口指标用于后续相似度值的计算,当告警指标为单个指标时,则将该单个指标作为入口指标用于后续相似度值的计算。以上收集告警前1~2小时到告警后10分钟之内的告警指标数值是保证告警触发后能够较快进行根因识别的优化时间区间。
在一些实施例中,如果需要扩大查找告警指标,也可以将上述时间区间设定为告警前1~6小时到告警后10分钟之内。
告警指标相似度计算单元606,用于对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
在本实施例中,所述告警指标相似度计算单元606用于计算告警指标的相似度值具体采用:首先通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;预设lag值为0~90分钟;再将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;最后,将相似度值大于0.65的lag值告警指标进 行相似度值归并,以获取相似度值较高的告警指标。其具体的计算过程与上述方法实施例中步骤306相同,此处不赘述。
在一些实施例中,各告警指标之间可能存在反向关系,或者根因指标异变程度较小,导致不能直接进行相似度计算,例如,在实际运维场景下,2个指标可能存在相同变化但是变化幅度差异较大,导致相似度关联系数值较低,此外,部分告警指标与入口指标可能存在相同变化,但是这种变化是这些指标的正常情况,不能将这种变化作为故障根因。因此,相似度值的计算输入模型除了采用上述LOESS算法计算之外,还需要加入历史STL周期分量的残差值计算,对各个告警指标的残差进行相似度计算,以能够综合反映变化程度。所述告警指标相似度计算单元606用于计算告警指标的相似度值具体还采用:首先采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值,对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
根因告警指标输出单元608,用于汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序,并将相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
在本实施例中,一个告警指标可能对应多组结果,再结合调用链层级关系,对所有相关联的告警指标进行排序,保证了输出根因的多样性。这样获取的告警指标的相似度值能够得到数值最高的相似度均值与延迟量。对于调用链层级关系信息,越往下游的对象越会影响上游对象,根因的可能性就越大。按照相似度进行排序后,调用链中上游调用链数量相对应减少,如果某个对象存在下游对象潜在根因,则该对象可能是受影响的对象,则可直接排除这一对象,最终将排序靠前的告警指标输出作为根因指标。查找定位根因需要确定发生故障的根因设备,因此,根据确认作为根因指标的告警指标后,在调用链层级关系中查找出其对应的调用链设备,即可确定其为发生故障的根因。
在一个实施例中,提出了一种计算机设备,所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
接收到异常信息并发出告警信息;
根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
将所述相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
在一个实施例中,处理器执行计算机程序时还执行以下步骤:所述收集与告警指标的数值是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。
在一个实施例中,处理器执行计算机程序时还执行以下步骤:
通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;
预设lag值为0~90分钟;
将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;
将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。
在一个实施例中,处理器执行计算机程序时还执行以下步骤:
采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值;
对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;
若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
在一个实施例中,提出了一种存储有计算机可读指令的存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性的。该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
接收到异常信息并发出告警信息;
根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
将所述相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:
所述收集与告警指标的数值是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。
在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:
通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;
预设lag值为0~90分钟;
将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;
将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。
在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:
采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值;
对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告 警指标的相似度值;
若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种根因定位方法,其中,所述根因定位方法用于根因分析系统定位运维工作中故障的根因,包括如下步骤:
    接收到异常信息并发出告警信息;
    根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
    对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
    汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
    将所述相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
  2. 如权利要求1所述的根因定位方法,其中,所述收集与告警指标的数值是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。
  3. 如权利要求1或2所述的根因定位方法,其中,所述对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值具体包括如下步骤:
    通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;
    预设lag值为0~90分钟;
    将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;
    将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。
  4. 如权利要求3所述的根因定位方法,其中,所述对告警指标相似度值的计算还包括如下步骤:
    采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值;
    对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;
    若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
  5. 如权利要求1所述的根因定位方法,其中,所述接收到异常信息并发出告警信息,包括:
    获取采集值以及所述采集值对应的周期性分量;
    当所述采集值与所述周期性分量皆高于对应的阈值时,发出告警信息。
  6. 一种根因定位装置,其中,所述根因定位装置用于根因分析系统定位运维工作中故障的根因,所述根因定位装置包括:异常检测单元、告警指标数值计算单元、告警指标相似度计算单元和根因告警指标输出单元;
    异常检测单元,用于接收到异常信息并发出告警信息;
    告警指标数值计算单元,用于根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
    告警指标相似度计算单元,用于对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
    根因告警指标输出单元,用于汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序,并将相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
  7. 如权利要求6所述的根因定位装置,其中,所述告警指标数值计算单元在收集与告警指标的数值时,是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。
  8. 如权利要求6或7所述的根因定位装置,其中,所述告警指标相似度计算单元用于计算告警指标的相似度值具体采用:首先通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;预设lag值为0~90分钟;再将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;最后,将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。
  9. 如权利要求8所述的根因定位装置,其中,所述告警指标相似度计算单元用于计算告警指标的相似度值具体还采用:首先采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值,对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
  10. 如权利要求6所述的根因定位装置,其中,所述异常检测单元具体用于:
    获取采集值以及所述采集值对应的周期性分量;
    当所述采集值与所述周期性分量皆高于对应的阈值时,发出告警信息。
  11. 一种计算机设备,存储器和处理器,所述存储器和所述处理器相互连接,所述存储器用于存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序配置用于执行一种根因定位方法:
    其中,所述方法包括:
    接收到异常信息并发出告警信息;
    根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
    对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
    汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
    将所述相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
  12. 如权利要求11所述的计算机设备,其中,所述收集与告警指标的数值是收集告警 前1~2小时到告警后10分钟之间的告警指标的数值。
  13. 如权利要求11或12所述的计算机设备,其中,所述对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值具体包括如下步骤:
    通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;
    预设lag值为0~90分钟;
    将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;
    将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。
  14. 如权利要求13所述的计算机设备,其中,所述对告警指标相似度值的计算还包括如下步骤:
    采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值;
    对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;
    若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
  15. 如权利要求11所述的计算机设备,其中,所述接收到异常信息并发出告警信息,包括:
    获取采集值以及所述采集值对应的周期性分量;
    当所述采集值与所述周期性分量皆高于对应的阈值时,发出告警信息。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时用于实现一种根因定位方法,所述方法包括以下步骤:
    接收到异常信息并发出告警信息;
    根据调用链查找所有与告警信息相关联的告警指标,并收集所述告警指标的数值;
    对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值;
    汇总所述lag值较高的告警指标的相似度值,结合所述调用链层级关系,对相似度值较高的告警指标进行排序;
    将所述相似度值排序靠前的告警指标对应的调用链设备作为根因输出。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述收集与告警指标的数值是收集告警前1~2小时到告警后10分钟之间的告警指标的数值。
  18. 如权利要求16或17所述的计算机可读存储介质,其中,所述对所有所述告警指标的数值进行平滑处理,并将所有告警指标结合预设的lag值分别进行相似度计算,以获取lag值较高的告警指标的相似度值具体包括如下步骤:
    通过LOESS算法进行局部加权回归得到回归值获取序列平滑值;
    预设lag值为0~90分钟;
    将所有告警指标在预设各lag值下与入口指标分别计算相似度,得到所有告警指标在各lag值下的lag值告警指标;
    将相似度值大于0.65的lag值告警指标进行相似度值归并,以获取相似度值较高的告警指标。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述对告警指标相似度值的计算还包括如下步骤:
    采集告警指标通过LOESS算法获取的所述序列平滑值与历史STL周期分量的残差值;
    对于存在STL周期分量残差值的告警指标分别进行相似度计算,以得到STL残差值告警指标的相似度值;
    若STL残差值告警指标的相似度值及对应的lag值告警指标的相似度值均大于0.65,则对该告警指标的相似度值进行归并,以获取相似度值较高的告警指标。
  20. 如权利要求16所述的计算机可读存储介质,其中,所述接收到异常信息并发出告警信息,包括:
    获取采集值以及所述采集值对应的周期性分量;
    当所述采集值与所述周期性分量皆高于对应的阈值时,发出告警信息。
PCT/CN2020/118332 2020-03-12 2020-09-28 根因定位方法、装置、计算机设备和存储介质 WO2021179574A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010170390.2 2020-03-12
CN202010170390.2A CN111459695A (zh) 2020-03-12 2020-03-12 根因定位方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021179574A1 true WO2021179574A1 (zh) 2021-09-16

Family

ID=71680757

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118332 WO2021179574A1 (zh) 2020-03-12 2020-09-28 根因定位方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN111459695A (zh)
WO (1) WO2021179574A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535458A (zh) * 2021-09-17 2021-10-22 上海观安信息技术股份有限公司 异常误报的处理方法及装置、存储介质、终端
CN113793049A (zh) * 2021-09-18 2021-12-14 成都数之联科技有限公司 产品生产过程中的不良根因定位方法、装置、设备及介质
CN114237962A (zh) * 2021-12-21 2022-03-25 中国电信股份有限公司 告警根因判断方法、模型训练方法、装置、设备和介质
CN114338351A (zh) * 2021-12-31 2022-04-12 天翼物联科技有限公司 网络异常根因确定方法、装置、计算机设备及存储介质
CN115484150A (zh) * 2022-09-01 2022-12-16 中国电信股份有限公司 告警信息的处理方法、系统、设备及存储介质
CN115766402A (zh) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 服务器故障根因的过滤方法和装置、存储介质及电子装置
CN116225769A (zh) * 2023-05-04 2023-06-06 北京优特捷信息技术有限公司 一种系统故障根因的确定方法、装置、设备及介质
CN116846741A (zh) * 2023-08-31 2023-10-03 广州嘉为科技有限公司 一种告警收敛方法、装置、设备及存储介质
CN114237962B (zh) * 2021-12-21 2024-05-14 中国电信股份有限公司 告警根因判断方法、模型训练方法、装置、设备和介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459695A (zh) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 根因定位方法、装置、计算机设备和存储介质
CN112506763A (zh) * 2020-11-30 2021-03-16 清华大学 数据库系统故障根因自动定位方法和装置
CN113641526B (zh) * 2021-09-01 2024-04-05 京东科技信息技术有限公司 告警根因定位方法、装置、电子设备及计算机存储介质
CN113821413A (zh) * 2021-09-27 2021-12-21 中国建设银行股份有限公司 告警分析方法及装置
CN114978877B (zh) * 2022-05-13 2024-04-05 京东科技信息技术有限公司 一种异常处理方法、装置、电子设备及计算机可读介质
CN116244139A (zh) * 2022-12-24 2023-06-09 北京新数科技有限公司 一种基于时序数据的告警自愈方法及系统

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106843111A (zh) * 2017-03-10 2017-06-13 中国石油大学(北京) 油气生产系统报警信号根原因精确溯源方法及装置
CN107588906A (zh) * 2017-09-11 2018-01-16 北京金风慧能技术有限公司 用于液冷循环系统的液体泄漏预警方法及装置
CN108009040A (zh) * 2017-12-12 2018-05-08 杭州时趣信息技术有限公司 一种确定故障根因的方法、系统和计算机可读存储介质
CN109634819A (zh) * 2018-10-26 2019-04-16 阿里巴巴集团控股有限公司 告警根因定位方法和装置、电子设备
WO2019071384A1 (en) * 2017-10-09 2019-04-18 Bl Technologies, Inc. INTELLIGENT SYSTEMS AND METHODS FOR THE TREATMENT AND EVALUATION OF SANITATION DIAGNOSIS, DETECTION AND CONTROL OF ANOMALY IN WASTEWATER TREATMENT FACILITIES OR IN DRINKING WATER FACILITIES
CN110166264A (zh) * 2018-02-11 2019-08-23 北京三快在线科技有限公司 一种故障定位方法、装置及电子设备
CN110309009A (zh) * 2019-05-21 2019-10-08 北京云集智造科技有限公司 基于情境的运维故障根因定位方法、装置、设备及介质
CN110837953A (zh) * 2019-10-24 2020-02-25 北京必示科技有限公司 一种自动化异常实体定位分析方法
CN111459695A (zh) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 根因定位方法、装置、计算机设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5742378B2 (ja) * 2011-03-30 2015-07-01 ソニー株式会社 情報処理装置、プレイリスト生成方法及びプレイリスト生成プログラム
CN109753526A (zh) * 2018-12-28 2019-05-14 四川新网银行股份有限公司 一种基于时序相似度对告警信息分析查询的装置及方法
CN110413703B (zh) * 2019-06-21 2023-07-25 平安科技(深圳)有限公司 基于人工智能的监控指标数据的分类方法及相关设备
CN110493042B (zh) * 2019-08-16 2022-09-13 中国联合网络通信集团有限公司 故障诊断方法、装置及服务器

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106843111A (zh) * 2017-03-10 2017-06-13 中国石油大学(北京) 油气生产系统报警信号根原因精确溯源方法及装置
CN107588906A (zh) * 2017-09-11 2018-01-16 北京金风慧能技术有限公司 用于液冷循环系统的液体泄漏预警方法及装置
WO2019071384A1 (en) * 2017-10-09 2019-04-18 Bl Technologies, Inc. INTELLIGENT SYSTEMS AND METHODS FOR THE TREATMENT AND EVALUATION OF SANITATION DIAGNOSIS, DETECTION AND CONTROL OF ANOMALY IN WASTEWATER TREATMENT FACILITIES OR IN DRINKING WATER FACILITIES
CN108009040A (zh) * 2017-12-12 2018-05-08 杭州时趣信息技术有限公司 一种确定故障根因的方法、系统和计算机可读存储介质
CN110166264A (zh) * 2018-02-11 2019-08-23 北京三快在线科技有限公司 一种故障定位方法、装置及电子设备
CN109634819A (zh) * 2018-10-26 2019-04-16 阿里巴巴集团控股有限公司 告警根因定位方法和装置、电子设备
CN110309009A (zh) * 2019-05-21 2019-10-08 北京云集智造科技有限公司 基于情境的运维故障根因定位方法、装置、设备及介质
CN110837953A (zh) * 2019-10-24 2020-02-25 北京必示科技有限公司 一种自动化异常实体定位分析方法
CN111459695A (zh) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 根因定位方法、装置、计算机设备和存储介质

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535458A (zh) * 2021-09-17 2021-10-22 上海观安信息技术股份有限公司 异常误报的处理方法及装置、存储介质、终端
CN113793049B (zh) * 2021-09-18 2023-11-07 成都数之联科技股份有限公司 产品生产过程中的不良根因定位方法、装置、设备及介质
CN113793049A (zh) * 2021-09-18 2021-12-14 成都数之联科技有限公司 产品生产过程中的不良根因定位方法、装置、设备及介质
CN114237962A (zh) * 2021-12-21 2022-03-25 中国电信股份有限公司 告警根因判断方法、模型训练方法、装置、设备和介质
CN114237962B (zh) * 2021-12-21 2024-05-14 中国电信股份有限公司 告警根因判断方法、模型训练方法、装置、设备和介质
CN114338351A (zh) * 2021-12-31 2022-04-12 天翼物联科技有限公司 网络异常根因确定方法、装置、计算机设备及存储介质
CN114338351B (zh) * 2021-12-31 2024-01-12 天翼物联科技有限公司 网络异常根因确定方法、装置、计算机设备及存储介质
CN115484150B (zh) * 2022-09-01 2024-02-23 中国电信股份有限公司 告警信息的处理方法、系统、设备及存储介质
CN115484150A (zh) * 2022-09-01 2022-12-16 中国电信股份有限公司 告警信息的处理方法、系统、设备及存储介质
CN115766402A (zh) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 服务器故障根因的过滤方法和装置、存储介质及电子装置
CN116225769A (zh) * 2023-05-04 2023-06-06 北京优特捷信息技术有限公司 一种系统故障根因的确定方法、装置、设备及介质
CN116225769B (zh) * 2023-05-04 2023-07-11 北京优特捷信息技术有限公司 一种系统故障根因的确定方法、装置、设备及介质
CN116846741A (zh) * 2023-08-31 2023-10-03 广州嘉为科技有限公司 一种告警收敛方法、装置、设备及存储介质
CN116846741B (zh) * 2023-08-31 2023-11-28 广州嘉为科技有限公司 一种告警收敛方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111459695A (zh) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2021179574A1 (zh) 根因定位方法、装置、计算机设备和存储介质
CN111459700B (zh) 设备故障的诊断方法、诊断装置、诊断设备及存储介质
CN112436968B (zh) 一种网络流量的监测方法、装置、设备及存储介质
US8156377B2 (en) Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series
CN111459778A (zh) 运维系统异常指标检测模型优化方法、装置及存储介质
US8291263B2 (en) Methods and apparatus for cross-host diagnosis of complex multi-host systems in a time series with probabilistic inference
US10852357B2 (en) System and method for UPS battery monitoring and data analysis
US20120304008A1 (en) Supervised fault learning using rule-generated samples for machine condition monitoring
US20140258187A1 (en) Generating database cluster health alerts using machine learning
CN106104496A (zh) 用于任意时序的不受监督的异常检测
CN101706749B (zh) 基于软件安全缺陷检测的综合处理方法
CN107992410B (zh) 软件质量监测方法、装置、计算机设备和存储介质
JPH10510385A (ja) ソフトウエア品質のアーキテクチャに基づく分析のための方法およびシステム
CN107391335B (zh) 一种用于检查集群健康状态的方法和设备
CN110647447B (zh) 用于分布式系统的异常实例检测方法、装置、设备和介质
JPWO2020021587A1 (ja) 時系列データ診断装置、追加学習方法およびプログラム
CN113010389A (zh) 一种训练方法、故障预测方法、相关装置及设备
CN113360722B (zh) 一种基于多维数据图谱的故障根因定位方法及系统
CN113592343A (zh) 二次系统的故障诊断方法、装置、设备和存储介质
CN111722058A (zh) 基于知识图谱的电力信息系统故障检测方法、装置及介质
CN112415331A (zh) 基于多源故障信息的电网二次系统故障诊断方法
CN111061293B (zh) 多参数耦合的飞行器故障定位方法、飞行器及存储介质
CN109739720B (zh) 异常检测方法、装置、存储介质和电子设备
CN115509784A (zh) 数据库实例的故障检测方法和装置
CN114003466A (zh) 一种用于微服务应用程序的故障根因定位方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20924108

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20924108

Country of ref document: EP

Kind code of ref document: A1