CN114077510A

CN114077510A - Method and device for fault root cause positioning and fault root cause display

Info

Publication number: CN114077510A
Application number: CN202010802751.0A
Authority: CN
Inventors: 黄荣庚; 董善东; 黄小龙; 姚华宁; 李雄政
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2022-02-22

Abstract

The application belongs to the technical field of data processing and discloses a method and a device for fault root cause positioning and fault root cause display. After receiving the alarm notification, the server respectively acquires a normal index value and an abnormal index value corresponding to each attribute, respectively determines an abnormal score value of each attribute according to the normal index value and the abnormal index value of each attribute, and determines a fault root in each attribute according to the abnormal score value corresponding to each attribute. When the monitoring client receives the fault root cause and the corresponding abnormal score value of each target dimension returned by the server, the fault root cause and the abnormal score value of each target dimension are displayed in the fault positioning page, and the fault root cause positioning accuracy is improved.

Description

Method and device for fault root cause positioning and fault root cause display

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for fault root cause location and fault root cause display.

Background

In the field of intelligent operation and maintenance, index data and the like of a system are usually monitored, when the indexes are determined to be abnormal, the system is judged to have faults, and the attributes which are most likely to cause the system faults, namely fault root causes, are determined from all the attributes of the system, so that the stop loss can be further repaired.

In the prior art, when a fault root is located, whether an attribute is the fault root is usually determined only according to abnormal index data of the attribute.

However, the accuracy of positioning is low only by the fault root factor determined by the abnormal index data. Therefore, a technical scheme for fault root cause positioning, which can improve the accuracy of fault root cause positioning, is needed.

Disclosure of Invention

The embodiment of the application provides a method and a device for fault root cause positioning and fault root cause display, which are used for improving the accuracy of fault root cause positioning, reducing the cost and improving the efficiency when fault root cause positioning is carried out.

In one aspect, a method for fault root cause location is provided, including:

when an alarm notification is received, acquiring an index time sequence corresponding to each attribute in each attribute corresponding to a target dimension, wherein the index time sequence comprises a plurality of index values arranged according to a time sequence, and the target dimension corresponds to at least two attributes;

respectively screening out a normal index value and an abnormal index value of each attribute from the index time sequence corresponding to each attribute;

respectively determining an abnormal score value of each attribute according to the normal index value and the abnormal index value of each attribute;

and determining a fault root in each attribute according to the abnormal score value corresponding to each attribute.

In one aspect, a method for displaying a fault root factor is provided, where the fault root factor obtained by any one of the above fault root factor positioning methods includes:

displaying a service monitoring page based on the acquired index time sequence corresponding to the system;

according to the index time sequence corresponding to the system, when the index is determined to be abnormal, displaying an abnormal index value and corresponding abnormal time in an alarm analysis page, and sending an alarm notification to a server;

and when the fault root cause and the corresponding abnormal score value of each target dimension returned by the server based on the alarm notification are received, displaying the fault root cause and the corresponding abnormal score value of each target dimension in a fault root cause display page.

In one aspect, an apparatus for fault root cause localization is provided, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for respectively acquiring an index time sequence corresponding to each attribute in each attribute corresponding to a target dimension when an alarm notification is received, the index time sequence comprises a plurality of index values arranged according to a time sequence, and the target dimension corresponds to at least two attributes;

the screening unit is used for screening out a normal index value and an abnormal index value of each attribute from the index time sequence corresponding to each attribute;

the first determining unit is used for respectively determining the abnormal score value of each attribute according to the normal index value and the abnormal index value of each attribute;

and the second determining unit is used for determining the fault root in each attribute according to the abnormal score value corresponding to each attribute.

Preferably, the first determination unit is configured to:

respectively determining a first average value of a plurality of abnormal index values corresponding to each attribute;

respectively determining a second average value of a plurality of normal index values corresponding to each attribute;

respectively determining a first total average value of a plurality of abnormal index values corresponding to each attribute;

respectively determining a second total average value of a plurality of normal index values corresponding to each attribute;

determining an abnormal score value of each attribute according to a first average value and a second average value corresponding to each attribute, and a first total evaluation value and a second total evaluation value;

wherein the abnormality score is positively correlated with both the first average and the second total average and negatively correlated with both the second average and the first total average.

Preferably, the second determination unit is configured to:

screening each attribute according to a first preset screening condition or a second preset screening condition, and determining the screened attribute as a fault root; alternatively, the first and second electrodes may be,

and screening the attributes according to a first preset screening condition and a second preset screening condition respectively, and determining the common attribute screened according to the first preset screening condition and the second preset screening condition as a fault root.

Preferably, the second determination unit is configured to:

sorting the abnormal score values of the attributes in a high-to-low order;

and according to the sequence of the different abnormal score values, sequentially taking out one abnormal score value from the different abnormal score values, and testing the rest abnormal score values according to a preset normal distribution test algorithm until the test result represents that the rest abnormal score values accord with normal distribution.

Preferably, the second determination unit is configured to:

sorting the abnormal score values of the attributes in a high-to-low order;

and sequentially taking out one abnormal score value from the different abnormal score values according to the sequence of the different abnormal score values, and determining the range or the variance of the rest abnormal score values until the determined range is lower than a preset range threshold value or the determined variance is lower than a preset variance threshold value.

Preferably, the second determination unit is further configured to:

and determining the attributes except the fault root of each attribute as the influenced attributes.

Preferably, the second determination unit is further configured to:

and removing the attribute corresponding to the abnormal score value lower than the preset score threshold value from the attributes.

In one aspect, a fault root cause display device is provided, where a fault root cause obtained by any one of the above fault root cause positioning methods includes:

the monitoring unit is used for displaying a service monitoring page based on the acquired index time sequence corresponding to the system;

the alarm unit is used for displaying an abnormal index value and corresponding abnormal time in an alarm analysis page when the index is determined to be abnormal according to the index time sequence corresponding to the system, and sending an alarm notification to the server;

and the positioning unit is used for displaying the fault root cause and the corresponding abnormal score value of each target dimension in a fault root cause display page when the fault root cause and the corresponding abnormal score value of each target dimension returned by the server based on the alarm notification are received.

In one aspect, a control device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the steps of any one of the above-mentioned methods for fault root cause localization or fault root cause display.

In one aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of any of the above-mentioned methods for fault root cause localization or fault root cause display.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in any of the various alternative implementations of fault root cause localization described above.

In the method and the device for fault root cause positioning and fault root cause display provided by the embodiment of the application, when the monitoring client monitors that the indexes are abnormal, each abnormal index value is displayed through the alarm notification page, and an alarm notification is sent to the server. And when the server receives the alarm notification, respectively acquiring a normal index value and an abnormal index value corresponding to each attribute, respectively determining an abnormal score value of each attribute according to the normal index value and the abnormal index value of each attribute, and determining a fault root in each attribute according to the abnormal score value corresponding to each attribute. And when the monitoring client receives the fault root cause and the corresponding abnormal score value of each target dimension returned by the server, displaying the fault root cause and the abnormal score value of each target dimension through the fault positioning page. Therefore, the accuracy and the efficiency of fault root cause positioning are improved, and the cost of the fault root cause is reduced.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of a system architecture for fault root cause location according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of fault root cause location according to an embodiment of the present disclosure;

fig. 3a is an exemplary diagram of a service monitoring page in an embodiment of the present application;

FIG. 3b is a diagram of an index log record according to an embodiment of the present disclosure;

FIG. 3c is a diagram of an alarm analysis page according to an embodiment of the present application;

FIG. 3d is a diagram illustrating an example of index division according to an embodiment of the present disclosure;

fig. 3e is a flowchart of an implementation of a fault root cause determination method in the embodiment of the present application;

FIG. 3f is an exemplary diagram of an attribute partitioning presentation page according to an embodiment of the present application;

FIG. 3g is a diagram illustrating a failure root cause summary according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating an implementation of a method for displaying a root cause of a fault according to an embodiment of the present disclosure;

fig. 5a is an exemplary diagram of an alarm time distribution page in an embodiment of the present application;

FIG. 5b is a diagram illustrating an exemplary failure root cause display page according to an embodiment of the present disclosure;

FIG. 6a is a schematic structural diagram of a fault root cause locating apparatus according to an embodiment of the present disclosure;

FIG. 6b is a schematic structural diagram of a device for displaying a fault root cause according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a control device in an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solution and beneficial effects of the present application more clear and more obvious, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

First, some terms referred to in the embodiments of the present application will be described to facilitate understanding by those skilled in the art.

The terminal equipment: may be a mobile terminal, a fixed terminal, or a portable terminal such as a mobile handset, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal communication system device, personal navigation device, personal digital assistant, audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also contemplated that the terminal device can support any type of interface to the user (e.g., wearable device), and the like.

A server: the cloud server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platform and the like.

Monitoring the client: refers to a program that corresponds to a server and provides local services to clients. Except for some application programs which only run locally, the application programs are generally installed on common clients and need to be operated together with a server. After the development of the internet, more common clients include web browsers used in the world wide web, e-mail monitoring clients for receiving and sending e-mails, and monitoring client software for instant messaging. For such applications, a corresponding server and a corresponding service program are required in the network to provide corresponding services, such as database services, e-mail services, etc., so that specific communication connections need to be established between the client and the server to ensure the normal operation of the applications.

Time series: is a group of data point sequences arranged according to the chronological order. Typically, the time intervals of a time series are constant (e.g., 1 second, 1 minute, 5 minutes). The time series here mainly refers to the time series of the monitoring class.

Index time series: the index value is arranged according to a time sequence, and is a time sequence based on a plurality of index values arranged according to a time occurrence sequence.

Chartaro (Shapiro) bilateral test: is a method for checking the normality in a statistical test on frequency.

Extremely poor: the difference between the largest data and the smallest data in a set of data is called the range of the set of data.

Variance: what is described is the degree of dispersion of the data, i.e., the distance of a variable from its expected value.

The root cause of the fault: the root cause of the index abnormality, that is, one or more attributes having a high probability of causing the index abnormality.

A client: refers to a program that corresponds to a server and provides local services to clients. Except for some applications which only run locally, the application is generally installed on a common client and needs to be operated with a server side, such as a web browser.

Cloud storage: the distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also referred to as storage nodes) of different types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed storage file systems and the like, and provides data storage and service access functions to the outside.

At present, a storage method of a storage system is as follows: logical volumes are created, and when created, each logical volume is allocated physical storage space, which may be the disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, namely, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as data identification, the file system writes each object into a physical storage space of the logical volume respectively, and the file system records storage location information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage location information of each object.

The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided in advance into stripes according to a set of capacity measures of objects stored in a logical volume (the measures usually have a large margin with respect to the capacity of the actual objects to be stored) and Redundant Array of Independent Disks (RAID), and one logical volume can be understood as one stripe, thereby allocating physical storage space to the logical volume.

Database (Database): in short, it can be regarded as an electronic file cabinet, i.e. a place for storing electronic files, and a user can add, query, update, delete, etc. to the data in the files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.

A database management system: the computer software system designed for managing the database generally has basic functions of storage, interception, safety guarantee, backup and the like. The database management system may be categorized according to the database models it supports, such as relational, extensible markup language, or according to the types of computers supported, such as server clusters, mobile phones; or sorted according to the Query language used, e.g., Structured Query language (XQuery), or sorted according to performance impulse emphasis, e.g., maximum size, maximum operating speed, or other sorting.

The design concept of the embodiment of the present application is described below.

In the field of intelligent operation and maintenance, index data and the like of a system are usually monitored, when the indexes are determined to be abnormal, the system is judged to have faults, and the attributes which are most likely to cause the system faults, namely fault root causes, are determined from all the attributes of the system, so that the stop loss can be further repaired. The index data may be a Key Performance Indicator (KPI) at each time point. The KPI can be equipment index, service index, etc.

In the prior art, when a fault root cause is located, it is usually determined manually whether the attribute is the fault root cause or not according to the abnormal index data of the attribute, or the fault root cause is determined in a model locating manner.

However, the fault root factor determined only according to the abnormal index data is low in positioning accuracy, high in labor cost and time cost, large in storage resource and calculation resource consumption due to the adoption of the model positioning mode, and low in positioning efficiency.

Obviously, the conventional technology does not provide a technical scheme for fault root cause location, which can improve the location accuracy and the location efficiency and reduce the cost, and therefore, a technical scheme for fault root cause location is required to improve the accuracy and the location efficiency of fault root cause location and reduce the cost of fault root cause.

In consideration of the fact that the abnormal index data and the normal index data can be combined to locate the fault root cause, and the abnormal possibility of the attribute can be determined through the abnormal score value, so that the fault root cause can be determined and displayed. After receiving the alarm notification, the server respectively acquires a normal index value and an abnormal index value corresponding to each attribute, respectively determines an abnormal score value of each attribute according to the normal index value and the abnormal index value of each attribute, determines a fault root in each attribute according to the abnormal score value corresponding to each attribute, and displays the fault root and the abnormal score value of each target dimension through a fault positioning page of the monitoring client.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figure when the method is executed in an actual processing procedure or a device.

Fig. 1 is a schematic diagram of a system for fault root cause location. The system comprises a terminal device 100 and a server 101.

The server 101: and the monitoring client is used for respectively determining the abnormal score value of each attribute according to the normal index data and the abnormal index data of each attribute of the system when receiving the alarm notification, further determining the fault root cause according to the abnormal score value of each attribute, and sending the fault root cause to the monitoring client.

The alarm notification may be sent by the monitoring client, may be triggered when the server 101 detects that the index is abnormal, or may be sent by other devices.

The index data in the server 101, that is, the normal index data and the abnormal index data, may be uploaded by other devices, may also be uploaded by other devices through a log server, and may also be uploaded by a monitoring client. Optionally, the index data may be uploaded in the form of an index time sequence, and may also be transmitted in other formats. The index data is usually obtained by calculating the traffic data. Whether the monitored system is abnormal or not can be judged through the index data.

For example, the index may be; the service internal success rate, the service internal error times, the request success rate, the request success number, the request failure number, the total request times, the total request average time delay and the like.

It should be noted that the index data generally relates to at least one dimension, and one dimension corresponds to at least one attribute. The index data can be stored in a cloud storage or database mode, and can be managed through a database management system.

For example, the dimensions may be: the system comprises an interface, a primary account number uin, a request source, a request monitoring client, a request region, an error code, a request password ID, a sub-account number uin and the like.

The abnormity of the index data can be reflected by the attribute corresponding to the involved dimension. That is, the root cause of the index abnormality can be determined from the attributes.

For example, the index is the amount of web page access, and the involved dimensions are: user location, network operator, data center. The attributes corresponding to the user positions are as follows: beijing, Shanghai, Guangzhou. The attributes corresponding to the network operator are: mobile, telecommunications, and connectivity. The data center is as follows: a first data center and a second data center. When the access volume of the web page is abnormal, the corresponding attributes of the user position, the network operator and the data center can be reflected, and the dimension and the attribute for generating the abnormal access volume of the web page can be further determined.

Monitoring the client: the system comprises a business monitoring page and a database, wherein the business monitoring page is used for displaying received index data so as to monitor the index data; the system is also used for displaying abnormal index data and corresponding abnormal time in an alarm analysis page when the index data is determined to be abnormal, and sending an alarm notification to the server 101; and is further configured to receive a fault root cause and a corresponding abnormal score value sent by the server 101, and display the fault root cause and the abnormal score value in a fault root cause display page.

The monitoring client may be disposed in the terminal device 100, or may be disposed in the server 101. For example, the monitoring client is a cloud monitoring helper applet. In the embodiment of the present application, only the case where the terminal device 100 is provided with the monitoring client is described as an example.

The page may be divided into different page modules, so that different contents are displayed through different page modules, and therefore, the service monitoring page, the alarm analysis page, and the fault root cause display page may be the same page or different pages, which is not limited herein.

Referring to fig. 2, a flowchart of an implementation of a method for fault root cause location according to the present invention is shown. The method comprises the following specific processes:

step 200: and the monitoring client displays a service monitoring page based on the acquired index time sequence corresponding to the system.

Specifically, the monitoring client receives index data corresponding to the system sent by other equipment in real time or periodically, determines an index time sequence corresponding to the system according to the received index data, and updates the displayed index value and corresponding time in a service monitoring page in real time according to the index time sequence.

In the service monitoring page, the corresponding relation between the index and the time can be displayed in a curve or table mode.

For example, referring to FIG. 3a, an exemplary diagram of a business monitor page is shown. The abscissa is time, the ordinate is success rate (index), and the monitoring client displays a success rate curve according to the success rate index sequence. The success rate curve represents the correspondence between success rate and time.

The index time sequence corresponding to the system is composed of index values corresponding to the system, and the index values in the index time sequence are arranged according to the sequence of generated time.

It should be noted that from the viewpoint of statistical analysis, the index can be divided into a magnitude value KPI and a rate value KPI. The value KPI has additive property, such as success number and total visit amount, etc., and the value KPI is obtained by deduction, such as click rate and success rate, etc. The index value corresponding to the system can be determined according to the index value corresponding to each attribute of each dimension.

For example, assume that the index is web page access volume, the dimension is user location, and the attribute is each region in shanghai, beijing, and the like. The access volume of the web page corresponding to the system is as follows: the summation of the amount of web page accesses for each region.

The index data may be transmitted in a time-series manner, or may be transmitted in other formats, which is not limited herein.

In one embodiment, the log server extracts the index time series from the index log record and sends the index time series to the monitoring client in a specified format.

For example, referring to FIG. 3b, an index log record is shown. The log records include timestamps, dimensions, attributes, and indicators. Wherein D represents dimension, E represents attribute, and n represents attribute sequence number. And index time series corresponding to the system can be extracted through log records.

Therefore, the user can monitor the index data in real time through the monitoring client.

Step 201: and when the monitoring client determines that the indexes are abnormal according to the index time sequence corresponding to the system, displaying abnormal index values and corresponding abnormal time in an alarm analysis page, and sending an alarm notification to the server.

Specifically, when the monitoring client determines that the index is abnormal, any one of the following modes may be adopted:

the first mode is as follows: and aiming at each index value in the index time sequence, when the number of the index values positioned in the preset abnormal interval in the preset abnormal duration is determined to be higher than a preset abnormal number threshold value, determining that the index is abnormal.

The second way is: and determining an index time sequence corresponding to the current monitoring period, and obtaining a difference value time sequence by the difference value of the index time sequence corresponding to the previous monitoring period. And aiming at each difference in the difference time sequence, when the number of the differences positioned in the preset abnormal interval in the preset abnormal duration is determined to be higher than a preset abnormal number threshold, determining that the index is abnormal.

Further, when the first mode or the second mode is adopted, the abnormal time period and the abnormal index value may be determined according to the index value located in the preset abnormal interval within the preset abnormal time period.

In one embodiment, the time duration of the abnormal time period is a preset abnormal time duration and includes the index value in the preset abnormal interval, and the index value in the abnormal time period is determined as an abnormal index value.

In one embodiment, the abnormal index value is an index value located in a preset abnormal interval within the preset abnormal duration, and a time point corresponding to the abnormal index value constitutes an abnormal time period.

For example, assuming that the preset abnormal duration is 10 minutes, the preset abnormal interval is [0.6,0.8], the preset abnormal number threshold is 5, and the monitoring client determines that 7 index values exist within 3 to 13 minutes and are located at [0.6,0.8], the index is determined to be abnormal, and the index value within 3 to 13 minutes is determined to be an abnormal index value.

The third mode is as follows: and screening index values in a preset abnormal interval from the index values, and determining that the indexes are abnormal if the screened index values are continuous and the duration is higher than a preset duration threshold.

The fourth mode is as follows: determining an index time sequence corresponding to the current monitoring period, obtaining a difference value time sequence according to a difference value of the index time sequence corresponding to the previous monitoring period, screening a difference value positioned in a preset abnormal interval from all index values aiming at all difference values in the difference value time sequence, and determining that the index is abnormal if all the screened difference values are continuous and the duration is higher than a preset duration threshold.

Further, when the third or fourth method is adopted, the time period including the screened continuous index values or the filtered continuous time length of which is higher than the preset time length threshold value may be determined as an abnormal time period, and the index value in the abnormal time period may be determined as an abnormal index value.

In practical application, the preset abnormal duration, the preset abnormal interval, the preset abnormal number threshold and the preset duration threshold may all be set according to a practical application scenario, for example, the preset abnormal interval is [0.6,0.8], and the preset duration threshold is 1 minute, which is not limited herein.

The alarm page at least comprises an abnormal index value and a corresponding abnormal time period. When the abnormal index value and the corresponding abnormal time period are displayed in the alarm analysis page, any one or any combination of the following modes can be adopted:

the first mode is as follows: and only displaying each abnormal index value and corresponding time in the alarm page.

The second way is: and displaying each index value and corresponding time in the alarm page in real time, and identifying each abnormal index value and corresponding abnormal time period in a highlight or color mode and the like.

The third mode is as follows: and displaying each index value and corresponding time in the current monitoring period and each historical index value and time of the previous monitoring period in real time in the alarm page, and identifying each abnormal index value and corresponding abnormal time period in a highlight or color mode and the like.

In practical application, the monitoring period may be set according to a practical application scenario, for example, the monitoring period is 1 day, one week, or one year.

For example, assuming that the monitoring period is 1 day and 3-5 points are abnormal time periods, the alarm page displays the internal success rates of the services of today and yesterday, and identifies the internal success rates of the services within 3-5 points as abnormal data.

Optionally, the alarm page may show each index value and corresponding time in the form of a text, a graph, a curve, and the like, without limitation.

For example, referring to FIG. 3c, an alarm analysis page is shown. It is assumed that the success rate curve represents the correspondence between success rate and time. The success rate data between 6 o 'clock and 9 o' clock of 5.9 is abnormal. The success rate curve between the

points

10 and 11 of No. 5.9 and the success rate curve between the

points

10 and 11 of No. 5.8 are displayed in the alarm analysis page. And highlighting a success rate curve between points 6 and 9 of No. 5.9 by adopting a highlighting and thickening mode.

Therefore, when the system runs abnormally, the abnormal index value and the corresponding time can be triggered and displayed, so that a user can check the abnormal index data.

The alert notification may include a target dimension, and may also include a system identifier. In the subsequent steps, the server can perform fault root cause positioning according to the target dimension, and can also acquire the target dimension correspondingly set by the system identifier according to the system identifier, so as to perform fault root cause positioning according to the target dimension. The number of target dimensions may be one or more. The target dimension can be set in real time according to the instruction of the user or can be set in advance by default.

Step 202: when the alarm notification is received, the server respectively obtains an index time sequence corresponding to each attribute in the attributes corresponding to the target dimension.

Specifically, after receiving the alarm notification, the server determines the target dimensions according to the alarm notification, and respectively obtains an index time sequence corresponding to each attribute of each target dimension.

When one dimension corresponds to only one attribute, the index abnormality caused by the attribute can be known without troubleshooting, and when one dimension corresponds to at least two attributes, which attribute causes the index abnormality cannot be uniquely determined, so that the target dimension can be a dimension corresponding to at least two attributes.

When a dimension only corresponds to one attribute, both the monitoring client and the server can directly determine the attribute corresponding to the dimension as the basic information.

Step 203: and the server screens out the normal index value and the abnormal index value of the corresponding attribute from the index time sequence corresponding to each attribute.

Specifically, when the server determines the abnormal index value of the attribute, the specific steps may be referred to as step 201 above.

When determining the normal index value, any one of the following modes can be adopted:

the first mode is as follows: and determining the non-abnormal index value as a normal index value.

The second way is: and acquiring an abnormal time period corresponding to the abnormal index value, determining a normal time period according to a total time period, the abnormal time period and a preset transition time period corresponding to the index time sequence, and determining the index value in the normal time period as a normal index value.

In one embodiment, a time period corresponding to a preset transition time period before and after an abnormal time period is used as a transition time period, and the abnormal time period and the transition time period are removed from a total time period to obtain a normal time period.

In practical applications, the preset transition duration may be set according to a practical application scenario, for example, 10 minutes, which is not limited herein.

FIG. 3d is a diagram illustrating an example of index division. In fig. 3d, an index curve is shown, with time on the abscissa and index value on the ordinate. The total period is divided into a normal period, a transition period, and an abnormal period. And then the abnormal index value and the normal index value can be determined according to the divided abnormal time period and the divided normal time period.

By adopting the mode, the transition time period is set between the normal time period and the abnormal time period, and the transition time period is used as the transition buffer from the normal index value to the abnormal index value, so that the accuracy of the subsequent fault root cause positioning is improved.

Further, when the abnormal index value and the normal index value are divided, other modes can be adopted, and the method is not limited herein.

Thus, the abnormal index value and the normal index value corresponding to each attribute can be screened out.

Step 204: and the server respectively determines the abnormal score value of each attribute according to the normal index value and the abnormal index value of each attribute.

Specifically, when step 204 is executed, the server may adopt the following steps:

s2041: the server respectively determines a first average value of a plurality of abnormal index values corresponding to each attribute, and respectively determines a second average value of a plurality of normal index values corresponding to each attribute.

Thus, for each attribute, a first average and a second average of the attribute may be determined. Both the abnormal index value and the normal index value of a single attribute are considered.

S2042: the server respectively determines a first total average value of a plurality of abnormal index values corresponding to each attribute, and respectively determines a second total average value of a plurality of normal index values corresponding to each attribute.

Thus, the first total average value and the second total average value can be determined based on the index data of each attribute. Not only comprehensively considers the abnormal index values of the attributes, but also comprehensively considers the normal index values of the attributes.

S2043: and the server determines the abnormal score value of the corresponding attribute according to the first average value and the second average value corresponding to each attribute, the first total evaluation value and the second total evaluation value respectively.

Specifically, the abnormality score value is positively correlated to both the first average value and the second total average value, and negatively correlated to both the second average value and the first total average value.

In one embodiment, the server performs the following steps for each attribute:

and determining a first difference value between the first average value and the second average value corresponding to the attribute, determining a second difference value between the first total average value and the second total average value, and determining a ratio of the first difference value to the second difference value as an abnormal score value corresponding to the attribute.

Optionally, when determining the anomaly score value, the following formula may be adopted:

f＝(p1-p2)/(s1-s2)；

wherein f is the abnormality score value, p1 is the first average value, p2 is the second average value, s1 is the first total average value, and s2 is the second total average value.

Further, when determining the abnormal score value, other methods may be adopted, such as the proportion of the number of abnormal index values, the calculation method of entropy, the calculation by using the prior probability and the posterior probability, or the method of the kini (gini) coefficient and the information gain in the decision tree.

Further, the server may also remove, from the attributes, an attribute corresponding to an abnormal score value lower than a preset score threshold.

In practical applications, the preset score threshold may be set according to practical application scenarios, for example, the preset score threshold is 0 or 10, which is not limited herein.

When the abnormal score value is larger than the preset score threshold value, the attribute is abnormal, the larger the abnormal score value is, the more serious the abnormal condition of the attribute is, the higher the possibility that the attribute is a fault root is, otherwise, the smaller the abnormal problem is, and the lower the possibility that the attribute is the fault root is. Therefore, the attribute with low possibility of abnormality can be removed, so that the consumption of processing resources is reduced, and the efficiency of subsequent fault root cause positioning is improved.

Step 205: and the server determines a fault root cause in each attribute according to the abnormal score value corresponding to each attribute, and determines the attribute except the fault root cause of each attribute as the influenced attribute.

Specifically, when step 205 is executed, the server may adopt the following two ways:

the first mode is as follows: and the server screens the attributes according to the first preset screening condition or the second preset screening condition, and determines the screened attributes as fault root causes.

The second way is: the server screens the attributes according to a first preset screening condition and a second preset screening condition respectively, and determines common attributes screened according to the first preset screening condition and the second preset screening condition as fault root causes.

When the attributes are screened according to the first preset screening condition, the following steps can be adopted:

and the server sorts the abnormal score values of the attributes in a sequence from high to low, sequentially takes out one abnormal score value from the abnormal score values according to the sorting of the abnormal score values, and tests the rest abnormal score values according to a preset normal distribution test algorithm until the test result represents that the rest abnormal score values accord with normal distribution.

Wherein the normal distribution test algorithm is used for determining whether the plurality of data conform to the normal distribution.

Alternatively, the normal distribution test algorithm may be a Shapiro bilateral test.

In one embodiment, after one attribute is taken out from each attribute according to the sorting of each attribute, whether the abnormal score value corresponding to the remaining attribute accords with the normal distribution is judged through Shapiro bilateral test, if so, the screening is stopped, otherwise, the second attribute is continuously taken out from each attribute, whether the abnormal score value corresponding to the remaining attribute accords with the normal distribution is judged through Shapiro bilateral test again, if so, the screening is stopped, otherwise, the third attribute is continuously taken out from each attribute, … … and so on until the abnormal score value corresponding to the remaining attribute accords with the normal distribution, which is not described herein again.

In this way, if the abnormal score values corresponding to the remaining attributes conform to the normal distribution, which indicates that the index is abnormal, and the remaining attributes are affected by similarity or similarity, the remaining attributes are determined as affected attributes, and the screened attributes are determined as fault root causes.

When the attributes are screened according to the second preset screening condition, the following steps can be adopted:

and sorting the abnormal score values of the attributes in a descending order, and sequentially taking out one abnormal score value from the abnormal score values according to the sorting of the abnormal score values, and determining the range or the variance of the rest abnormal score values until the determined range is lower than a preset range threshold value or the determined variance is lower than a preset variance threshold value.

That is, it is determined whether the screened attribute is the failure root cause by the extreme difference or variance of the remaining abnormality score values.

Wherein the range is the difference between the maximum and minimum of the various anomaly values.

In practical application, both the preset range threshold and the preset variance threshold may be set according to a practical application scenario, for example, the preset range threshold is 10, and for example, the preset variance threshold is 5, which is not limited herein.

Thus, if the determined range is lower than the preset range threshold or the determined variance is lower than the preset variance threshold, it is indicated that the indexes are abnormal, the rest attributes are affected by similarity or closeness, the rest attributes are determined to be affected attributes, and the screened attributes are fault root causes.

In the embodiment of the application, the fault root cause can be determined by adopting any one of a normal distribution test algorithm, a range and a variance, and the fault root cause can be determined by combining the normal distribution test algorithm with the range or the variance, so that the accuracy of determining the fault root cause is improved.

Further, the server determines the target dimension corresponding to only one attribute as the basic information.

In one embodiment, referring to fig. 3e, a flowchart of a method for determining a root cause of a fault is shown. When the server determines the fault root cause, the following steps may be adopted for each target dimension:

s2052: index data including a normal index value and an abnormal index value is acquired.

S2051: and judging whether the number of the attributes corresponding to one target dimension is only one, if so, executing S2052, and otherwise, executing S2053.

S2052: and taking the attribute corresponding to the target dimension as basic information.

S2053: and screening each attribute of the target dimension by adopting a first preset screening condition according to the index data to obtain screened attributes, executing S2054 based on the attributes which are not screened, and executing S2055 based on the screened attributes.

S2054: and determining the attributes which are not screened out as the influenced attributes.

S2055: and obtaining the attribute screened out by adopting the first preset screening condition.

S2056: and screening each attribute of the target dimension by adopting a second preset screening condition according to the index data to obtain screened attributes, executing S2057 based on the attributes which are not screened, and executing S2058 based on the screened attributes.

S2057: and determining the attributes which are not screened out as the influenced attributes.

S2058: and determining the intersection between the attributes screened by adopting the second preset screening condition and the attributes obtained by adopting the first preset screening condition as a fault root.

It should be noted that, in the embodiment of the present application, only the case of determining the fault root of the system for one target dimension is described, and similarly, the fault root of the system may be determined for other target dimensions, and then the system may be repaired according to the determined fault root.

For example, referring to FIG. 3f, an exemplary diagram of an attribute partitioning presentation page is shown. In fig. 3f, the Module (Module) dimension corresponds to only one attribute, and thus, the attribute is determined to be basic information. The identity verification code (uin) and the cluster region (ClusterRegion) correspond to a plurality of attributes, and the api, mc, chord and q are determined to be fault root causes and the attributes corresponding to the uin are influenced attributes through fault root cause positioning.

For example, referring to FIG. 3g, an exemplary diagram of a failure root cause summary is shown. The fault root cause for each target dimension is shown in fig. 3 g. One or more fault roots are corresponding to one target dimension. For example, the failure root corresponding to the module is one, that is, the cloud server, and the failure root corresponding to the cluster area is multiple, that is, api.

Step 206: and the monitoring client receives the fault root cause and the corresponding abnormal score value returned by the server, and displays each fault root cause and the corresponding abnormal score value in the alarm analysis page.

Specifically, the server sends the fault root and the corresponding abnormal score value of each target dimension to the monitoring client based on the alarm notification. And the monitoring client displays the received fault root factors and the corresponding abnormal score values in the fault root factor display page.

Further, the server may also send basic information and/or affected attributes to the monitoring client, and may also send corresponding anomaly score values. The monitoring client can also display the received basic information and/or the affected attributes and corresponding abnormal score values in the fault root cause display page.

In the embodiment of the application, the index data is monitored through the monitoring client, and the fault root is displayed when the index is abnormal. The monitoring client can run in the server and can also run in the terminal equipment.

Referring to fig. 4, it is a flowchart of an implementation of a method for displaying a fault root cause, and the specific flow of the method is as follows:

step 400: and displaying a service monitoring page based on the acquired index time sequence corresponding to the system.

Step 401: and when the index is determined to be abnormal according to the index time sequence corresponding to the system, displaying an abnormal index value and corresponding abnormal time in an alarm analysis page, and sending an alarm notification to the server.

Furthermore, the server or the monitoring client can summarize abnormal time periods of a plurality of monitoring periods, and display each summarized abnormal time period in the alarm time distribution page, so that abnormal time distribution can be checked on a time scale.

Fig. 5a is a diagram of an exemplary alarm time distribution page. In fig. 5a, the ordinate is the date, the abscissa is the time (hour), the curve in the alarm time distribution page indicates the date and time (hour) at which the alarm occurred, and the longer the curve, the longer the indicator alarm time period. For example, the alarm occurs in number 02.28 and number 1:46, the alarm duration is 4 minutes, and for example, in the period of number 02.28-04.06, 6 alarms occur in the time period of 1:00-2:00, so that the alarm time distribution information can be fed back to the operator. The operator can further analyze whether the alarm is regularly generated or not, so that the occurrence of the alarm can be fundamentally reduced.

Step 402: and when the fault root cause and the corresponding abnormal score value of each target dimension returned by the server based on the alarm notification are received, displaying the fault root cause and the corresponding abnormal score value of each target dimension in a fault root cause display page.

Referring to FIG. 5b, an exemplary view of a fault root cause display page is shown. Including alarm analysis information and fault root cause information. The alarm analysis information identifies the abnormal success rate curve part in a curve thickening mode, and the fault root factor information comprises a fault root factor and a corresponding abnormal score value.

In the traditional mode, the fault root is mainly determined by adopting a manual positioning or model positioning mode, such as time series model positioning and machine learning model positioning. However, the efficiency is low when manual positioning is performed, which is usually more than 10 minutes, and in the embodiment of the application, the fault root can be positioned in second level due to high positioning efficiency; furthermore, model positioning usually needs to consume a large amount of sample data to perform model training, and the efficiency of model positioning is low, whereas in the embodiment of the present application, a data statistics manner is adopted to perform fault root positioning, which is simple and efficient, and fault analysis is performed after combining abnormal index data and normal index data, so that the positioning efficiency and accuracy are high. Further, in the embodiment of the application, the suspicious attributes of each dimension can be divided into three types, namely basic information, affected attributes and fault root, and are displayed to the user, so that the abnormal details can be displayed more specifically. Finally, the alarm information in a period of time and the fault root cause can be summarized, so that the time axis distribution and the fault root cause set of the alarm can be displayed, the hidden danger of a user can be checked, the abnormal alarm can be reduced, and convenience is provided.

Based on the same inventive concept, the embodiment of the present application further provides a device for fault root cause location, and because the principle of the device and the equipment for solving the problem is similar to that of a method for fault root cause location, the implementation of the device can refer to the implementation of the method, and repeated parts are not described again.

Fig. 6a is a schematic structural diagram of a fault root cause locating apparatus according to an embodiment of the present application. An apparatus for fault root cause localization comprising:

an obtaining unit 611, configured to, when an alarm notification is received, respectively obtain an index time sequence corresponding to each attribute in each attribute corresponding to a target dimension, where the index time sequence includes a plurality of index values arranged according to a time sequence, and the target dimension corresponds to at least two attributes;

a screening unit 612, configured to screen out a normal index value and an abnormal index value of each attribute from the index time sequence corresponding to each attribute;

a first determining unit 613, configured to determine an abnormal score value of each attribute according to a normal index value and an abnormal index value of each attribute;

the second determining unit 614 is configured to determine a failure root in each attribute according to the abnormal score value corresponding to each attribute.

Preferably, the first determining unit 613 is configured to:

Preferably, the second determining unit 614 is configured to:

sorting the abnormal score values of the attributes in a high-to-low order;

Preferably, the second determining unit 614 is configured to:

sorting the abnormal score values of the attributes in a high-to-low order;

Preferably, the second determining unit 614 is further configured to:

Fig. 6b is a schematic structural diagram of a device for displaying a fault root cause according to an embodiment of the present application. An apparatus for fault root cause localization comprising:

the monitoring unit 621 is configured to display a service monitoring page based on the obtained index time sequence corresponding to the system;

an alarm unit 622, configured to display an abnormal index value and corresponding abnormal time in an alarm analysis page when determining that the index is abnormal according to the index time sequence corresponding to the system, and send an alarm notification to the server;

and the positioning unit 623 is configured to, when receiving the fault root cause and the corresponding abnormal score value of each target dimension returned by the server based on the alarm notification, display the fault root cause and the corresponding abnormal score value of each target dimension in a fault root cause display page.

Fig. 7 shows a schematic configuration of a control device 7000. Referring to fig. 7, the control apparatus 7000 includes: a processor 7010, a memory 7020, a power supply 7030, a display unit 7040, and an input unit 7050.

The processor 7010 is a control center of the control apparatus 7000, connects the respective components by various interfaces and lines, and executes various functions of the control apparatus 7000 by running or executing software programs and/or data stored in the memory 7020, thereby monitoring the control apparatus 7000 as a whole.

In the embodiment of the present application, the processor 7010, when calling the computer program stored in the memory 7020, executes the method for locating the fault root cause provided by the embodiment shown in fig. 2 or the method for displaying the fault root cause provided by the embodiment shown in fig. 4.

Optionally, the processor 7010 may include one or more processing units; preferably, the processor 7010 may integrate an application processor, which handles primarily the operating system, user interfaces, applications, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 7010. In some embodiments, the processor, memory, and/or memory may be implemented on a single chip, or in some embodiments, they may be implemented separately on separate chips.

The memory 7020 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, various applications, and the like; the stored data area may store data created from the use of the control device 7000 and the like. In addition, the memory 7020 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The control device 7000 also includes a power supply 7030 (e.g., a battery) for powering the various components, which may be logically coupled to the processor 7010 via a power management system that may be used to manage charging, discharging, and power consumption.

Display unit 7040 may be configured to display information input by a user or information provided to the user, and various menus of control apparatus 7000, and the like, and in the embodiment of the present invention, is mainly configured to display a display interface of each application in control apparatus 7000, and objects such as texts and pictures displayed in the display interface. The display unit 7040 may include a display panel 7041. The Display panel 7041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The input unit 7050 may be used to receive information such as numbers or characters input by a user. The input unit 7050 may include a touch panel 7051 and other input devices 7052. Among other things, the touch panel 7051, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 7051 (e.g., operations by a user on or near the touch panel 7051 using any suitable object or attachment such as a finger, a stylus, etc.).

Specifically, the touch panel 7051 may detect a touch operation of a user, detect signals generated by the touch operation, convert the signals into touch point coordinates, transmit the touch point coordinates to the processor 7010, receive a command transmitted from the processor 7010, and execute the command. In addition, the touch panel 7051 can be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. Other input devices 7052 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, power on and off keys, etc.), a trackball, a mouse, a joystick, and the like.

Of course, the touch panel 7051 may cover the display panel 7041, and when the touch panel 7051 detects a touch operation on or near the touch panel 7051, the touch operation is transmitted to the processor 7010 to determine the type of the touch event, and then the processor 7010 provides a corresponding visual output on the display panel 7041 according to the type of the touch event. Although in fig. 7, the touch panel 7051 and the display panel 7041 are two separate components to implement the input and output functions of the control device 7000, in some embodiments, the touch panel 7051 and the display panel 7041 may be integrated to implement the input and output functions of the control device 7000.

The control device 7000 may also comprise one or more sensors, such as pressure sensors, gravitational acceleration sensors, proximity light sensors, etc. Of course, the control device 7000 may also comprise other components such as a camera, which are not shown in fig. 7 and will not be described in detail, since they are not components used in the embodiments of the present application.

Those skilled in the art will appreciate that fig. 7 is merely an example of a control device and is not intended to be limiting and may include more or less components than those shown, or some components in combination, or different components.

The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for fault root cause location or fault root cause display in any of the above method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and executes the computer instructions, so that the computer device executes the control method for fault root cause location or fault root cause display in any of the above method embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or partially contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a control device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for fault root cause localization, comprising:

2. The method of claim 1, wherein determining the anomaly score value for each attribute based on the normality index value and the anomaly index value for each attribute comprises:

determining an abnormal score value of each attribute according to a first average value and a second average value corresponding to each attribute, and the first total evaluation value and the second total evaluation value;

wherein the abnormality score value is positively correlated with both the first average and the second total average and negatively correlated with both the second average and the first total average.

3. The method of claim 1, wherein determining a root cause of a fault in each attribute based on the anomaly score value corresponding to each attribute comprises:

4. The method of claim 3, wherein the screening of the attributes according to the first preset screening condition comprises:

sorting the abnormal score values of the attributes in a high-to-low order;

5. The method of claim 3, wherein the screening of the attributes according to the second preset screening condition comprises:

sorting the abnormal score values of the attributes in a high-to-low order;

6. The method of any one of claims 3-5, further comprising:

7. The method of claim 4 or 5, wherein before sorting the anomaly score values for the attributes in order of high to low, further comprising:

8. A method for displaying fault root cause, wherein the fault root cause obtained by the method according to any one of claims 1 to 7 comprises:

and when receiving the fault root cause and the corresponding abnormal score value of each target dimension returned by the server based on the alarm notification, displaying the fault root cause and the corresponding abnormal score value of each target dimension in a fault root cause display page.

9. A fault root cause locating apparatus, comprising:

10. An apparatus for displaying fault root cause, wherein the fault root cause obtained by the method according to any one of claims 1 to 7 comprises:

the alarm unit is used for displaying an abnormal index value and corresponding abnormal time in an alarm analysis page and sending an alarm notice to the server when the index is determined to be abnormal according to the index time sequence corresponding to the system;