CN118118327A - Method, device, equipment, storage medium and program product for locating abnormal root cause - Google Patents

Method, device, equipment, storage medium and program product for locating abnormal root cause Download PDF

Info

Publication number
CN118118327A
CN118118327A CN202211524301.5A CN202211524301A CN118118327A CN 118118327 A CN118118327 A CN 118118327A CN 202211524301 A CN202211524301 A CN 202211524301A CN 118118327 A CN118118327 A CN 118118327A
Authority
CN
China
Prior art keywords
service
unit
determining
index
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211524301.5A
Other languages
Chinese (zh)
Inventor
黄涛
陈鹏飞
李瑞鹏
何子龙
杨莽原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211524301.5A priority Critical patent/CN118118327A/en
Publication of CN118118327A publication Critical patent/CN118118327A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a method, a device, equipment, a storage medium and a program product for locating an abnormal root cause; the method comprises the following steps: acquiring a first service parameter value of a target service currently in a service layer; when the abnormality of the target service is determined based on the first service parameter value, acquiring a service index of the target service in a service layer, a dimension index of each service dimension and a unit index of each service unit; determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; and combining the influence degree and the difference degree, determining an abnormal service unit from the service units, and determining the abnormal service unit as an abnormal root cause of the target service. The application can effectively improve the positioning accuracy of the abnormal root cause.

Description

Method, device, equipment, storage medium and program product for locating abnormal root cause
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for locating an abnormal root cause, an electronic device, a storage medium, and a program product.
Background
In the field of operation and maintenance of internet services, such as operation and maintenance services of games, fault root cause positioning of multidimensional indexes is always a challenging intelligent operation and maintenance problem. When an abnormality occurs in a certain total key performance index in the internet service operation and maintenance system, it is generally desirable to quickly and accurately locate the root cause of the fault, so as to perform repair and damage-stopping work on the root cause of the fault in time.
In the related art, when the fault root cause is positioned, unified support for multiple types of indexes is often lacking, so that the positioning of the abnormal root cause by the related art is inaccurate.
Disclosure of Invention
The embodiment of the application provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for locating an abnormal root cause, which can effectively improve the locating accuracy of the abnormal root cause.
The technical scheme of the embodiment of the application is realized as follows:
The embodiment of the application provides a method for positioning an abnormal root cause, which comprises the following steps:
acquiring a first service parameter value of a target service currently in a service layer, wherein the service layer comprises a plurality of service dimensions, and each service dimension comprises at least one service unit;
When the first service parameter value is used for determining that the target service is abnormal, acquiring a service index of the target service in the service layer, a dimension index of each service dimension and a unit index of each service unit;
determining the influence of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit;
The difference degree of the service units is used for representing the difference between the second service parameter value of the service units and the unit index;
And combining the influence degree and the difference degree, determining an abnormal service unit from the service units, and determining the abnormal service unit as an abnormal root cause of the target service.
The embodiment of the application provides a positioning device for an abnormal root cause, which comprises the following components:
The system comprises a parameter acquisition module, a service layer and a service layer, wherein the parameter acquisition module is used for acquiring a first service parameter value of a target service in the service layer, the service layer comprises a plurality of service dimensions, and each service dimension comprises at least one service unit;
the index acquisition module is used for acquiring a service index of the target service in the service layer, a dimension index of each service dimension and a unit index of each service unit when the target service is determined to be abnormal based on the first service parameter value;
The determining module is used for determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; the difference degree of the service units is used for representing the difference between the second service parameter value of the service units and the unit index;
and the root cause module is used for combining the influence degree and the difference degree, determining an abnormal service unit from the service units, and determining the abnormal service unit as the abnormal root cause of the target service.
In some embodiments, the determining module is further configured to perform the following processing for each service unit: subtracting the business index from the first business parameter value to obtain a first difference value; determining a reference index of the service unit based on the dimension index and the unit index; subtracting the business index from the reference index to obtain a second difference; and determining the ratio of the second difference value to the first difference value as the influence of the service unit on the abnormality of the target service.
In some embodiments, the determining module is further configured to determine, from the plurality of service dimensions, a target service dimension to which the service unit belongs; acquiring a second service parameter value of the target service in the service unit currently and a third service parameter value of the target service in the target service dimension currently; subtracting the second service parameter value and the third service parameter value from the first service parameter value to obtain a reference parameter value; and adding the reference parameter value, the dimension index and the unit index to obtain the reference index of the service unit.
In some embodiments, the determining module is further configured to perform the following processing for each service unit: acquiring a second service parameter value of the target service in the service unit currently; determining a ratio of the second service parameter value to the first service parameter value as a posterior probability of the service unit; determining a ratio of the unit index of the service unit to the service index, and determining the ratio as the prior probability of the service unit; and determining the difference degree of the service units based on the prior probability and the posterior probability.
In some embodiments, the determining module is further configured to determine a sum of the prior probability and the posterior probability as a sum probability; determining the ratio of the prior probability to the addition probability as a first reference probability, and determining the ratio of the posterior probability to the addition probability as a second reference probability; determining a first reference difference based on the first reference probability and the prior probability, and determining a second reference difference based on the second reference probability and the posterior probability; and adding the first reference difference degree and the second reference difference degree to obtain the difference degree of the business unit.
In some embodiments, the determining module is further configured to obtain a first reference coefficient, and determine a product of the first reference probability and the first reference coefficient as a third reference probability; and determining the logarithmic value of the third reference probability, and determining the product of the logarithmic value of the third reference probability and the prior probability as the first reference difference degree.
In some embodiments, the determining module is further configured to obtain a second reference coefficient, and determine a product of the second reference probability and the second reference coefficient as a fourth reference probability; and determining the logarithmic value of the fourth reference probability, and determining the product of the logarithmic value of the fourth reference probability and the posterior probability as the second reference difference degree.
In some embodiments, the root cause module is further configured to perform the following processing for each service dimension: when the number of the service units included in the service dimension is a plurality of, sequencing the service units included in the service dimension according to the sequence from the big difference degree to the small difference degree to obtain a service unit queue of the service dimension; and determining the abnormal business units in the business dimension from the business unit queue based on the influence degree of each business unit in the business unit queue.
In some embodiments, the root cause module is further configured to sequentially use, starting from a head of the service unit queue, the service units in the service unit queue as a current service unit, and perform the following processing for the current service unit: comparing the influence degree of the current service unit with a first influence degree threshold, and determining the current service unit as a candidate abnormal service unit when the influence degree of the current service unit is greater than the first influence degree threshold; determining the product of the influence degree of each current candidate abnormal business unit as a reference influence degree; and when the reference influence degree is larger than a second influence degree threshold, stopping comparing the next business unit in the business unit queue with the first influence degree threshold, and determining each current candidate abnormal business unit as the abnormal business unit in the business dimension.
In some embodiments, the device for locating an abnormal root cause further includes: the root relation module is used for constructing an abnormal root graph structure based on each abnormal service unit, and the abnormal root graph structure represents the association relation among the abnormal service units; and determining the abnormal root relation of the target service as the abnormal root relation of the abnormal root graph structure.
In some embodiments, the root cause relation module is further configured to construct an undirected graph structure between the abnormal service units with each abnormal service unit as a node; determining the direction of each undirected edge in the undirected graph structure based on the influence degree of each abnormal business unit; and constructing the abnormal root graph structure based on the directions of all undirected edges in the undirected graph structure and the undirected graph structure.
An embodiment of the present application provides an electronic device, including:
A memory for storing computer executable instructions or computer programs;
And the processor is used for realizing the method for positioning the abnormal root cause provided by the embodiment of the application when executing the computer executable instructions or the computer programs stored in the memory.
The embodiment of the application provides a computer readable storage medium which stores computer executable instructions for realizing the method for locating the abnormal root cause provided by the embodiment of the application when being executed by a processor.
Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the method for locating the abnormal root cause according to the embodiment of the application.
The embodiment of the application has the following beneficial effects:
When the abnormality of the target service is determined based on the first service parameter value, determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; and combining the influence degree and the difference degree, determining an abnormal business unit, and determining the abnormal business unit as an abnormal root cause of the target business. When the target business is determined to be abnormal based on the first business parameter value, the abnormal root cause in each business unit is determined based on the determined influence degree of each business unit on the abnormality of the target business and the difference degree of each business unit, and the abnormal condition of each business unit can be fully reflected due to the influence degree and the difference degree, so that the abnormal root cause in each business unit is effectively determined, and the positioning accuracy of the abnormal root cause is effectively improved.
Drawings
FIG. 1 is a schematic diagram of an architecture of a system for locating an anomaly root provided by an embodiment of the present application;
FIG. 2 is a schematic structural diagram of an electronic device for locating an abnormal root cause according to an embodiment of the present application;
FIG. 3 is a flow chart of a method for locating an abnormal root cause according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a method for locating an abnormal root cause according to an embodiment of the present application;
fig. 5 to fig. 8 are schematic flow diagrams of a method for locating an abnormal root cause according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) Dimension: and can be also called as attributes which can influence the nodes in the service operation and maintenance system, and the service operation and maintenance system is taken as a network service operation and maintenance system as an example, and provinces, network service providers and the like belong to dimensions.
2) The abnormal root causes: the root cause of the indicator anomaly, i.e., the attribute or attributes that have a greater probability of causing the indicator anomaly. The abnormal root cause of the target service refers to determining the root cause of the abnormality of the target service based on the first service parameter value.
3) Cloud storage: the distributed cloud storage system (hereinafter referred to as storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also called storage nodes) of different types in a network through application software or application interfaces to cooperatively work and provides data storage and service access functions together through functions of cluster application, grid technology, distributed storage file systems and the like.
4) Bayesian network (Bayesian network): also known as a Belief Network (Belief Network) or directed acyclic graph model (DIRECTED ACYCLIC GRAPHICAL model), is a model of probability patterns.
5) Prior probability (Prior Probability): refers to the probability obtained from past experience and analysis, such as a full probability formula, which is often used as the probability of occurrence of "cause" in the "cause result" problem. In bayesian statistical inference, an uncertainty quantity of a priori probability distribution is a probability distribution that expresses a confidence level for that quantity before taking into account some factors. For example, the prior probability distribution may represent a probability distribution of the relative proportion of votes to a particular voter in future votes. The unknown quantity may be a parameter of the model or a potential variable.
6) Posterior probability: is one of the basic concepts of information theory, and in a communication system, after a certain message is received, the probability of the message being sent, which is known by the receiving end, is called a posterior probability. The posterior probability is calculated based on the prior probability. The posterior probability may be calculated from the prior probability and likelihood function by bayesian formulas.
In the implementation of the embodiments of the present application, the applicant found that the related art has the following problems:
The lack of unified support for multiple types of indexes in the related art results in limited application range of the related art, and the related art is mainly aimed at the additively-added indexes (such as the number of calls) and derivative indexes (such as the number of success/number of calls) of the additively-added indexes based on addition, subtraction, multiplication and division operations, and is insufficient in support for non-additively-added indexes needing to use other summarization modes. There are a number of non-additivable metrics in online systems, such as CPU usage, access latency, etc. According to specific service requirements, the summarization mode adopted by the indexes comprises modes of average value solving, quantile solving and the like. Since a large number of non-additively added indexes are covered in the summary indexes, the summary mode is far more than summation. Because the related technology cannot provide unified support for non-additivity indexes and various summarization modes, the application range of the related technology is greatly limited, and the positioning of the related technology on abnormal root causes is inaccurate.
The embodiment of the application provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for locating an abnormal root cause, which can effectively improve the locating accuracy of the abnormal root cause.
Referring to fig. 1, fig. 1 is a schematic architecture diagram of a positioning system 100 for abnormal root cause provided by an embodiment of the present application, a terminal (a terminal 400 is shown in an exemplary manner) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two.
The terminal 400 is configured to display the cause of the anomaly on a graphical interface 410-1 (the graphical interface 410-1 is shown as an example) for use by a user using the client 410. The terminal 400 and the server 200 are connected to each other through a wired or wireless network.
In some embodiments, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart television, a smart watch, a car terminal, etc. The electronic device provided by the embodiment of the application can be implemented as a terminal or a server. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
In some embodiments, the server 200 obtains a first service parameter value of the target service at the service layer, when determining that there is an abnormality in the target service based on the first service parameter value, obtains a service index, a dimension index, and a unit index, determines a degree of influence and a degree of difference based on the service index, the dimension index, and the unit index, determines an abnormal service unit in combination with the degree of influence and the degree of difference, and sends the abnormal service unit to the terminal 400, and the terminal 400 determines the abnormal service unit as an abnormality root cause.
In other embodiments, the terminal 400 obtains a first service parameter value of the target service at the service layer, when determining that there is an abnormality in the target service based on the first service parameter value, obtains a service index, a dimension index, and a unit index, determines a degree of influence and a degree of difference based on the service index, the dimension index, and the unit index, determines an abnormal service unit in combination with the degree of influence and the degree of difference, and sends the abnormal service unit to the server 200, and the server 200 determines the abnormal service unit as an abnormality root cause.
In other embodiments, the embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data.
The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for locating an abnormal root cause according to an embodiment of the present application, where the electronic device 500 shown in fig. 2 may be the server 200 or the terminal 400 in fig. 1, and the electronic device 500 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420. The various components in electronic device 500 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.
The Processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi, wireless Fidelity), and universal serial bus (USB, universal Serial Bus), etc.
In some embodiments, the device for locating an abnormal root cause provided by the embodiments of the present application may be implemented in software, and fig. 2 shows the device for locating an abnormal root cause 455 stored in the memory 450, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the parameter acquisition module 4551, the index acquisition module 4552, the determination module 4553, the root cause module 4554 are logical, so that any combination or further splitting may be performed according to the functions implemented. The functions of the respective modules will be described hereinafter.
In other embodiments, the apparatus for locating an abnormal root cause according to the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus for locating an abnormal root cause according to the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the method for locating an abnormal root cause according to the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more Application specific integrated circuits (ASICs, application SPECIFIC INTEGRATED circuits), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field Programmable Gate Arrays (FPGAs), field-Programmable GATE ARRAY), or other electronic components.
In some embodiments, the terminal or the server may implement the method for locating the abnormal root cause provided by the embodiment of the present application by running a computer program or computer executable instructions. For example, the computer program may be a native program (e.g., a dedicated localization program of an abnormal root cause) or a software module in an operating system, e.g., a localization module of an abnormal root cause that may be embedded in any program (e.g., an instant messaging client, an album program, an electronic map client, a navigation client); for example, a Native Application (APP) may be used, i.e. a program that needs to be installed in an operating system to be run. In general, the computer programs described above may be any form of application, module or plug-in.
The method for locating the abnormal root causes provided by the embodiment of the application will be described in conjunction with the exemplary application and implementation of the server or the terminal provided by the embodiment of the application.
Referring to fig. 3, fig. 3 is a schematic flow chart of a method for locating an abnormal root cause according to an embodiment of the present application, which will be described with reference to steps 101 to 105 shown in fig. 3, the method for locating an abnormal root cause according to an embodiment of the present application may be implemented by a server or a terminal alone or in combination with the server and the terminal, and will be described with reference to a server alone embodiment.
In step 101, a first service parameter value of a target service currently in a service layer is obtained.
In some embodiments, the business layer includes a plurality of business dimensions, each business dimension including at least one business unit.
In some embodiments, the target service may be various operation and maintenance services in the internet service operation and maintenance field, for example, a data platform service, a capacity platform service, an operation and maintenance service of a cloud game, an operation and maintenance service of an instant messaging service, and the like. The target business can also be various industrial production environment application services, such as industrial production data statistics and the like.
In some embodiments, the service layer refers to a maximum operation level supporting operation of a target service, and when the target service is various operation and maintenance services in the internet service operation and maintenance field, the service layer of the various operation and maintenance services may be various data centers supporting operation and maintenance services. Taking a target service as an operation and maintenance service of the cloud game as an example for explanation, a data center of the cloud game provides service support for the operation and maintenance service of the cloud game, and the data center of the cloud game is a service layer of the operation and maintenance service of the cloud game.
In some embodiments, the first service parameter value of the target service at the service layer currently refers to an operation parameter of a maximum operation level supporting operation of the target service, where the first service parameter value is used to indicate whether the operation of the target service at the service layer is abnormal.
In some embodiments, the determining whether the operation of the target service in the service layer is abnormal may be implemented as follows: acquiring a service parameter threshold, and comparing the first service parameter with the service parameter threshold to obtain a comparison result, wherein the comparison result represents whether the first service parameter is larger than the service parameter threshold; and when the service parameter is greater than the service parameter threshold, determining that the operation of the target service in the service layer is abnormal.
As an example, when the target service is an operation and maintenance service of the cloud game, the data center of the cloud game provides service support for the operation and maintenance service of the cloud game, the data center of the cloud game is a service layer of the operation and maintenance service of the cloud game, and the first service parameter value of the target service currently in the service layer may be an overall operation and maintenance parameter of the operation and maintenance service of the cloud game currently in the data center of the cloud game, for example, an overall utilization rate of the data center of the cloud game, and the like.
In some embodiments, when the target service is various kinds of operation and maintenance services in the internet service operation and maintenance field, a service layer of the various kinds of operation and maintenance services may be various data centers supporting operation of the operation and maintenance services. The above step 101 may be implemented as follows: and acquiring the operation and maintenance parameters of the target service on each operation and maintenance server in the data center by the root positioning server, and summarizing the operation and maintenance parameters by the root positioning server to obtain a first service parameter value of the target service.
In some embodiments, the operation and maintenance parameters may be operation parameters of operation and maintenance servers, for example, the utilization rate of the operation and maintenance servers, and the operation parameters of each operation and maintenance server are summarized to obtain an overall operation parameter of the data center, and the overall operation parameter of the data center is used as a first service parameter value of a target service.
In some embodiments, the foregoing summary of the operation parameters of each operation and maintenance server may be implemented as follows: the operation parameters of all operation and maintenance servers are weighted and summed to obtain the overall operation parameters of the data center; or summing the operation parameters of each operation and maintenance server to obtain the overall operation parameters of the data center.
In some embodiments, the business layer includes a plurality of business dimensions, each business dimension including at least one business unit, e.g., when the business layer is a data center supporting operation of an operation and maintenance service, the data center includes a plurality of dimension operation and maintenance server clusters, each operation and maintenance server cluster including at least one business unit (i.e., operation and maintenance server).
Therefore, by acquiring the first service parameter value of the target service in the service layer at present, whether the overall operation of the target service is abnormal or not is conveniently judged in time, and when the target service is abnormal, the abnormal root cause of the target service is accurately determined by the positioning method of the abnormal root cause provided by the embodiment of the application, so that the positioning accuracy and the positioning timeliness of the abnormal root cause can be effectively improved.
In step 102, when it is determined that the target service is abnormal based on the first service parameter value, a service index of the target service in the service layer, a dimension index of each service dimension, and a unit index of each service unit are obtained.
In some embodiments, the obtaining the service index of the target service in the service layer, the dimension index of each service dimension, and the unit index of each service unit may be implemented as follows: acquiring historical service indexes of the target service at a plurality of time points of a service layer, and carrying out weighted summation on the historical service indexes at each time point to obtain the service indexes of the target service at the service layer; acquiring historical dimension indexes of a target service at a plurality of time points of each service dimension, and carrying out weighted summation on the historical dimension indexes of each time point to obtain dimension indexes of the service dimension; acquiring historical unit indexes of the target service at a plurality of time points of each service unit, and carrying out weighted summation on the historical unit indexes of each time point to obtain unit indexes of service dimension.
In some embodiments, the business layer includes a plurality of business dimensions, each business dimension including at least one business unit, e.g., when the business layer is a data center supporting operation of an operation and maintenance service, the data center includes a plurality of dimension operation and maintenance server clusters, each operation and maintenance server cluster including at least one business unit (i.e., operation and maintenance server).
In some embodiments, the service index of the target service in the service layer is used to indicate a critical value of the first service parameter value of the target service in the service layer that is abnormal; the dimension index of the service dimension is used for indicating a critical value of the target service with an abnormality in a third service parameter value of the service dimension; the unit index of the service unit is used for indicating the critical value of the target service with the abnormal second service parameter value of the service unit.
As an example, when the service layer is a data center supporting operation and maintenance service operation, the data center includes operation and maintenance server clusters with multiple dimensions, each operation and maintenance server cluster includes at least one service unit (i.e., operation and maintenance server), the service index of the target service in the service layer may be the service index of the data center supporting operation and maintenance service operation, the dimension index of the service dimension may be the dimension index of the operation and maintenance server cluster with one dimension, and the unit index of the service unit may be the unit index of one operation and maintenance server.
Therefore, the influence degree and the difference degree of each business unit are conveniently and subsequently determined based on the business index, the dimension index and the unit index by acquiring the business index of the target business in the business layer, the dimension index of each business dimension and the unit index of each business unit, so that the abnormal business unit is accurately determined from the business units, and the positioning accuracy of the abnormal root cause is effectively improved.
In step 103, the degree of influence of each business unit on the abnormality of the target business is determined based on the business index, the dimension index, and the unit index.
In some embodiments, since the service layer includes a plurality of service dimensions, each service dimension includes at least one service unit, that is, the service unit is the minimum component unit of the service layer, an anomaly in the service unit may affect that the target service is abnormal in the service layer, that is, the target service is abnormal in the service layer, which is caused by an anomaly in at least one service unit in the service layer. Because the importance degree of each business unit in the business layer is different, the influence degree of each business unit on the abnormality of the target business is also different, and the influence degree of the business unit on the abnormality of the target business is in direct proportion to the importance degree of the business unit on the target business.
In some embodiments, referring to fig. 5, fig. 5 is a flowchart of a method for locating an abnormal root cause according to an embodiment of the present application, and step 103 shown in fig. 5 may be implemented by executing the following steps 1031 to 1034 for each service unit, respectively.
In step 1031, the first business parameter value is subtracted from the business index to obtain a first difference value.
In some embodiments, the expression of the first difference may be:
Q1=v-f (1)
wherein Q 1 represents the first difference, v represents the first service parameter value, and f represents the service index.
In step 1032, a reference index for the business unit is determined based on the dimension index and the unit index.
In some embodiments, the reference index of the service unit may be obtained by replacing a second service parameter value of the first service parameter value, where the target service is currently in the service unit, and a third service parameter value of the target service, where the target service is currently in the target service dimension, with a corresponding dimension index and a unit index.
In some embodiments, the expression of the reference index of the service unit may be:
f′=F(v1,f1) (2)
Wherein F' represents a reference index of a service unit, F () represents a substitution function, v 1 represents a second service parameter value of a target service in the service unit, and a third service parameter value of the target service in a target service dimension, and F 1 represents a corresponding dimension index and unit index.
In some embodiments, step 1032 may be implemented as follows: determining a target service dimension to which a service unit belongs from a plurality of service dimensions; acquiring a second service parameter value of the target service in a service unit currently and a third service parameter value of the target service in a target service dimension currently; subtracting the second service parameter value and the third service parameter value from the first service parameter value to obtain a reference parameter value; and adding the reference parameter value, the dimension index and the unit index to obtain the reference index of the business unit.
In some embodiments, the expression of the reference parameter value may be:
v3=v-v1-v2 (3)
Wherein v represents a first service parameter value, v 1 represents a second service parameter value of the target service currently in a service unit, v 2 represents a third service parameter value, and v 3 represents a reference parameter value.
In some embodiments, the expression of the reference index of the service unit may be:
f=v3+f2+f3 (4)
Wherein f represents a reference index of a service unit, v 3 represents a reference parameter value, f 2 represents a dimension index, and f 3 represents a unit index.
In step 1033, the traffic index and the reference index are subtracted to obtain a second difference.
In some embodiments, the expression of the second difference may be:
Q2=v-f (5)
Wherein Q 2 represents the second difference, v represents the traffic index, and f represents the reference index of the traffic unit.
In step 1034, the ratio of the second difference to the first difference is determined as the influence of the service unit on the abnormality of the target service.
In some embodiments, the expression of the influence of the business unit on the anomaly of the target business may be:
Wherein, Q represents the abnormal influence of the service unit on the target service, Q 2 represents the second difference, v represents the service index, f represents the reference index of the service unit, Q 1 represents the first difference, v represents the first service parameter value, and f represents the service index.
Therefore, the influence degree of the abnormality of each business unit on the target business is accurately determined based on the business index, the dimension index and the unit index, so that the subsequent determination of the abnormal root cause of the target business is facilitated based on the influence degree, and the positioning accuracy of the abnormal root cause can be effectively improved.
In step 104, the degree of difference for each business unit is determined.
In some embodiments, the degree of difference of the traffic units is used to characterize the difference between the second traffic parameter value of the traffic unit and the unit index.
In some embodiments, referring to fig. 6, fig. 6 is a flowchart of a method for locating an abnormal root cause according to an embodiment of the present application, and step 104 shown in fig. 6 may be implemented by executing the following steps 1041 to 1044 for each service unit.
In step 1041, a second service parameter value of the target service currently in the service unit is obtained.
When the service layer is a data center supporting operation and maintenance service operation, the data center comprises a plurality of dimension operation and maintenance server clusters, each operation and maintenance server cluster comprises at least one service unit (namely an operation and maintenance server), a first service parameter of a target service in the service layer can be a service parameter of the data center supporting operation and maintenance service operation, a second service parameter value of the service unit can be a third service parameter value of the operation and maintenance server cluster with one dimension, and a unit index of the service unit can be a second service parameter value of one operation and maintenance server.
In step 1042, the ratio of the second traffic parameter value to the first traffic parameter value is determined as a posterior probability of the traffic unit.
In some embodiments, the expression of the posterior probability of the traffic unit may be:
Wherein p represents the posterior probability of the service unit, v 1 represents the second service parameter value, and v represents the first service parameter value.
In step 1043, a ratio of the unit index of the service unit to the service index is determined, and the ratio is determined as the prior probability of the service unit.
In some embodiments, the expression of the prior probability of the traffic unit may be:
Wherein f represents a service index, f 3 represents a unit index, and q represents the prior probability of a service unit.
In step 1044, a degree of difference of the service units is determined based on the prior probability and the posterior probability.
In some embodiments, the step 1044 may be implemented as follows: the sum of the prior probability and the posterior probability is determined as the sum probability; determining the ratio of the prior probability and the addition probability as a first reference probability, and determining the ratio of the posterior probability and the addition probability as a second reference probability; determining a first reference difference based on the first reference probability and the prior probability, and determining a second reference difference based on the second reference probability and the posterior probability; and adding the first reference difference degree and the second reference difference degree to obtain the difference degree of the business unit.
In some embodiments, the expression of the above-described addition probability may be:
W=p+q (9)
wherein, W represents the addition probability, p represents the posterior probability of the service unit, and q represents the prior probability of the service unit.
In some embodiments, the expression of the first reference probability may be:
Wherein W 1 represents a first reference probability, W represents a summation probability, and q represents a priori probability of a service unit.
In some embodiments, the expression of the second reference probability may be:
Wherein W 2 represents the second reference probability, W represents the addition probability, and p represents the posterior probability of the business unit.
In some embodiments, the determining the first reference difference based on the first reference probability and the prior probability may be implemented as follows: acquiring a first reference coefficient, and determining the product of the first reference probability and the first reference coefficient as a third reference probability; and determining the logarithmic value of the third reference probability, and determining the product of the logarithmic value of the third reference probability and the prior probability as the first reference difference degree.
In some embodiments, the expression of the third reference probability may be:
W3=aW1 (12)
Wherein W 3 represents a third reference probability, W 1 represents a first reference probability, and a represents a first reference coefficient.
In some embodiments, the expression of the first reference variability may be:
S1=qloga (W3) (13)
Wherein S 1 represents a first reference difference, q represents a priori probability of a service unit, W 3 represents a third reference probability, a represents a base of logarithms, a >0, and a+..
In some embodiments, determining the second reference variability based on the second reference probability and the posterior probability may be implemented as follows: acquiring a second reference coefficient, and determining the product of the second reference probability and the second reference coefficient as a fourth reference probability; and determining the logarithmic value of the fourth reference probability, and determining the product of the logarithmic value of the fourth reference probability and the posterior probability as the second reference difference degree.
In some embodiments, the expression of the second reference variability may be:
S2=ploga (W3) (14)
Wherein S 2 characterizes a second reference variability, p characterizes a posterior probability of a business unit, W 3 characterizes a third reference probability, a characterizes a bottom of a logarithm, a >0, and a+..
In some embodiments, the expression of the degree of difference of the service units may be:
S=S1+S2 (15)
Wherein S characterizes a difference of the service units, S 2 characterizes a second reference difference, and S 1 characterizes a first reference difference.
Therefore, the difference degree of each service unit is determined, so that the subsequent determination of the abnormal root cause of the target service based on the difference degree is facilitated, and the positioning accuracy of the abnormal root cause can be effectively improved.
In step 105, the degree of influence and the degree of difference are combined, and an abnormal business unit is determined from the business units, and the abnormal business unit is determined as an abnormal root cause of the target business.
In some embodiments, the abnormal root cause of the target service refers to determining, based on the first service parameter value, that the root cause of the abnormality exists in the target service.
In some embodiments, referring to fig. 7, fig. 7 is a flowchart of a method for locating an abnormal root cause according to an embodiment of the present application, and step 105 shown in fig. 7 may be implemented by executing the following steps 1051 to 1052 for each service dimension.
In step 1051, when the number of service units included in the service dimension is plural, each service unit included in the service dimension is ordered according to the order of the difference degree from large to small, so as to obtain a service unit queue of the service dimension.
In some embodiments, the service dimension service unit queue includes a plurality of service units arranged from large to small in terms of degree of variance.
As an example, the expression for the business unit queue for the business dimension may be:
L={D1,D2,D3……Dn} (16)
Wherein L represents a service unit queue of a service dimension, and D 1 to D n represent service units, wherein in the service unit queue L of the service dimension, the difference of the service units D 1 is the largest, and the difference of the service units D n is the smallest.
As an example, when the target service is various operation and maintenance services in the internet service operation and maintenance field, the service layer of the various operation and maintenance services may be each data center supporting operation and maintenance services, the ith operation and maintenance server in the data center may be a service dimension L, and the service unit queue of the service dimension L may be l= { D 1,D2,D3……Dn }, then D 1 is the ith row 1 st column operation and maintenance server in the data center, and D 2 is the ith row 2 nd column operation and maintenance server in the data center.
In step 1052, abnormal business units in the business dimension are determined from the business unit queues based on the influence of each business unit in the business unit queues.
In some embodiments, step 1052 described above may be implemented as follows: starting from the head of the service unit queue, sequentially taking the service units in the service unit queue as the current service units, and executing the following processing aiming at the current service units: comparing the influence degree of the current business unit with a first influence degree threshold, and determining the current business unit as a candidate abnormal business unit when the influence degree of the current business unit is greater than the first influence degree threshold; determining the product of the influence degree of each candidate abnormal business unit as a reference influence degree; and when the reference influence degree is larger than the second influence degree threshold, stopping comparing the next business unit in the business unit queue with the first influence degree threshold, and determining each candidate abnormal business unit as an abnormal business unit in the business dimension.
As an example, taking the service unit queue l= { D 1,D2,D3……Dn } of the service dimension L as an example, starting from the head of the service unit queue, taking the service unit D 1 in the service unit queue as the current service unit, and performing the following processing for the current service unit D 1: comparing the influence degree of the current service unit D 1 with a first influence degree threshold, and determining the current service unit D 1 as a candidate abnormal service unit when the influence degree of the current service unit D 1 is greater than the first influence degree threshold; determining the product of the influence degree (influence degree of D 1) of each candidate abnormal business unit (D 1) as a reference influence degree; when the reference influence degree (influence degree of D 1) is larger than the second influence degree threshold, the next business unit D 2 in the business unit queue is stopped to be compared with the first influence degree threshold, and the current candidate abnormal business unit (D 1) is determined to be the abnormal business unit in the business dimension.
As an example, when the reference influence degree (influence degree of D 1) is less than or equal to the second influence degree threshold, the traffic unit D 2 in the traffic unit queue is taken as the current traffic unit, and the following processing is performed for the current traffic unit D 2: comparing the influence degree of the current service unit D 2 with a first influence degree threshold, and determining the current service unit D 2 as a candidate abnormal service unit when the influence degree of the current service unit D 2 is greater than the first influence degree threshold; determining the product of the influence degree of each candidate abnormal business unit (D 1 and D 2) (the product of the influence degree of D 2 and D 1) as the reference influence degree; when the reference influence degree (the product of the influence degrees of D 2 and D 1) is greater than the second influence degree threshold, the comparison of the next traffic unit D 3 in the traffic unit queue with the first influence degree threshold is stopped, and the current candidate abnormal traffic units (D 1 and D 2) are determined as abnormal traffic units in the traffic dimension.
Therefore, the service units in the service unit queues are sequentially used as the current service units by starting from the head of the service unit queues, and the abnormal service units in the service dimension are sequentially determined, and when the reference influence degree is larger than the second influence degree threshold, the comparison of the next service unit D 3 in the service unit queues with the first influence degree threshold is stopped, so that the problem of combined explosion in the process of determining the abnormal service units is effectively avoided, the time consumption for determining the abnormal service units is effectively shortened, and the efficiency of determining the abnormal service units is improved.
In some embodiments, referring to fig. 8, fig. 8 is a flowchart of a method for locating an abnormal root cause according to an embodiment of the present application, and after step 105 shown in fig. 8, the following steps 106 to 107 may be performed to determine a root cause relationship of a target service.
In step 106, an abnormal root graph structure is constructed based on each abnormal business unit, and the abnormal root graph structure characterizes the association relationship between each abnormal business unit.
In some embodiments, the abnormal root graph structure is a directed graph structure, and each abnormal traffic unit is a node in the abnormal root graph structure.
In some embodiments, the step 106 may be implemented as follows: constructing an undirected graph structure among the abnormal business units by taking the abnormal business units as nodes; determining the direction of each undirected edge in the undirected graph structure based on the influence degree of each abnormal business unit; and constructing an abnormal root graph structure based on the directions of the undirected edges in the undirected graph structure and the undirected graph structure.
In some embodiments, the determining the direction of each undirected edge in the undirected graph structure based on the influence degree of each abnormal service unit may be implemented as follows: the following processing is respectively executed for each undirected edge in the undirected graph structure: acquiring two component vertexes of the undirected edge, determining joint probability distribution among the component vertexes based on influence degree of abnormal business units corresponding to the component vertexes, determining causal relation among the component vertexes based on the joint probability distribution among the component vertexes, and determining direction of the undirected edge based on the causal relation.
In some embodiments, determining causal relationships between constituent vertices based on the joint probability distribution between constituent vertices may be accomplished by: when the probability value corresponding to the joint probability distribution is larger than the joint probability threshold, determining the causal relationship among the constituent vertexes as a first causal relationship; and when the probability value corresponding to the joint probability distribution is smaller than or equal to the joint probability threshold value, determining the causal relationship among the constituent vertexes as a second causal relationship. The pointing directions of the constituent vertexes corresponding to the first causal relation and the second causal relation are opposite.
In step 107, the abnormal root graph structure is determined as an abnormal root relation of the target service.
In some embodiments, the abnormal root cause relationship of the target service characterizes an association relationship between abnormal service units.
Therefore, by determining the abnormal root cause relationship of the target service, the possible causal relationship among the abnormal dimensions is accurately determined, and the system manager can be effectively helped to diagnose and process faults.
In this way, by acquiring a first service parameter value of the target service at the service layer, when the target service is determined to be abnormal based on the first service parameter value, determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; and combining the influence degree and the difference degree, determining an abnormal business unit, and determining the abnormal business unit as an abnormal root cause of the target business. When the target business is determined to be abnormal based on the first business parameter value, the abnormal root cause in each business unit is determined based on the determined influence degree of each business unit on the abnormality of the target business and the difference degree of each business unit, and the abnormal condition of each business unit can be fully reflected due to the influence degree and the difference degree, so that the abnormal root cause in each business unit is effectively determined, and the positioning accuracy of the abnormal root cause is effectively improved.
In the following, an exemplary application of the embodiment of the present application in an application scenario of actual abnormal root cause localization will be described.
In the field of operation and maintenance of internet services, such as operation and maintenance services of games, fault root cause positioning of multidimensional indexes is always a challenging intelligent operation and maintenance problem. When an abnormality occurs in a certain total key performance index in the internet service operation and maintenance system, it is generally desirable to quickly and accurately locate the root cause of the fault, so as to perform repair and damage-stopping work on the root cause of the fault in time.
In some embodiments, referring to fig. 4, fig. 4 is a schematic diagram of a method for locating an abnormal root cause according to an embodiment of the present application. Acquiring index source data, performing summary logic analysis on the index source data, performing dimension aggregation to obtain multi-dimensional data, performing summary logic analysis on the index source data, performing global aggregation to obtain SLI (sequential analysis) indexes, performing dimension mapping structure on the multi-dimensional data, triggering anomaly detection by the SLI indexes, performing abnormal dimension drill-down by utilizing the data after the dimension mapping structure, and performing abnormal dimension root cause analysis.
In some embodiments, the dimension map construction is intended to specify a calculation formula between each finest grain index and the aggregate formed SLI (index summary value). This calculation formula can be configured by the user, who can customize any SLI index calculation mode based on the service itself. The requirements of the user on the SLI index calculation method include, but are not limited to, a count type, a summation type, a mean type, a maximum value type, a fractional number type and a proportion type, which are described below.
For the count type index, the finest granularity index is counted in the index aggregation process. Taking the operation and maintenance of the data center as an example, the data center A totally comprises n sh machines, and the data center can acquire the real-time operation state of each machine with the finest granularity. The count type aggregation index is to count the machine with the finest granularity, namely, the number of machines (count type index) for aggregation is n sh.
For the summation type index, the finest granularity index is summed in the index aggregation process. Also taking a data center as an example, the data center collects the micro service request amount { q 1,q2,…,qn } of each machine in the a machine room. The summation type aggregation index is to sum the request quantity of each machine, namely the aggregated micro-service request quantity (summation type index) is that
For the mean value index, mean value calculation is carried out on the finest granularity index in the index aggregation process. Taking the data center as an example, the data center collects the time consumption { rt 1,rt2,…,rtn } of the micro-service request of each machine in the machine room a. The average value type aggregation index is to calculate the average value of the micro service request time of each machine, namely the average micro service time (average value type index) of aggregation is
And (3) for obtaining the maximum value type index, carrying out the maximum value calculation on the minimum granularity index in the index aggregation process. Taking the data center as an example, the data center collects the number of working threads { wt 1,wt2,…,wtm } of the CPU cores in each machine in the machine room A. The maximum type index is calculated by calculating the maximum value (such as the maximum value) of the working threads of the CPU cores in each machine, and the maximum working threads of the final aggregation is max ({ wt i |i=1, 2, …, m }).
The quantile index is the quantile of the finest granularity index calculated in the index aggregation process. Taking the data center as an example, the data center collects the CPU utilization { c 1,c2,…,cm } of the CPU cores in each machine in the machine room A. The index of calculating the quantile type is to calculate the quantile (such as p95 quantile) of the CPU utilization rate of the CPU core in each machine, and the p95 quantile of the final aggregated CPU utilization rate is { c 1,c2,…,cm } the 95% quantile CPU utilization rate ordered from small to large.
For the proportional index, the scale calculation is carried out on the finest granularity index in the index polymerization process. The data center collects the total micro-service request quantity { q 1,q2,…,qn } of each machine in the machine room A, and the micro-service request quantity of each machine failing to execute is { e 1,e2,…,en }. The proportional index is used for calculating the execution failure rate of the micro service of the whole machine room A. The final aggregated microservices have failure rate of execution of
By establishing the dimension mapping and supporting user-defined configuration, the overall expansibility of the dimension drill-down can be improved, and the popularization of the dimension drill-down scheme is facilitated.
In some embodiments, the abnormal dimension drill down is to use the value of each finest granularity index as input, and finally find out the dimension and the corresponding element combination which actually causes the abnormal SLI index formed by summarizing. Because of the combinatorial explosion problem, it is necessary to fully mine the features of the outlier dimension to devise a heuristic search method to achieve this. The anomaly dimension and for element combinations generally have two characteristics: the abnormal changes caused by this dimension and the corresponding element combination account for a large proportion of the abnormal changes in SLI. (abnormal dimension characterizes) the change in the distribution of the aggregate value that this dimension and the corresponding element combination converge from the distribution that would normally occur. (surprise degree characterization). Embodiments of the present application use the degree of anomaly (i.e., the degree of influence described above) to characterize a first feature and the degree of surprise (i.e., the degree of variance described above) to characterize a second feature.
In some embodiments, assuming that each of the finest granularity indexes is v 1,v2, …, respectively, the SLI index based on their summary is v, where the convergence manner constructed is denoted by F, i.e., v=f (v 1,v2, …). To calculate the above two features, first, values to be taken when these indices are normal need to be obtained, and f 1,f2, … (corresponding to v 1,v2, …) and f (corresponding to v) are used, respectively. f 1,f2, …, and f can be given predictively by a time series prediction model using historical data. The embodiment of the application uses a moving average model to calculate them, i.e. for indexes such as a historical value v -T,v-T+1 with a length of T, the corresponding predicted value f is calculated as follows:
The above-described predictive model may be replaced by any time-series predictive model.
Assume that the aggregate index at a certain granularity (a certain dimension dim i and certain elements { d 1,d2, … } in that dimension) isThe SLI index is v, the predicted value of the SLI index is f, and the abnormality degree can be calculated as follows:
wherein, I.e. the real values of the dimensions under investigation are replaced by corresponding predicted values at the time of summarization,/>Shows how the abnormality of the whole SLI is relieved when the values on the dimension dim i and some elements { d 1,d2, … } under the dimension are normal values, when/>When large, the corresponding dim i∈{d1,d2, … is likely to be an outlier dimension and corresponding element value.
In some embodiments, next, to calculate the surprise degree, the prior probability of the index summarized at that granularity (dim i in a certain dimension and some elements { d 1,d2, … } in that dimension) is calculated first, i.e.
Then using the true value to calculate the posterior probability, i.e
Based on p=p dimi∈{d1,d2,…},q=qdimi∈{d1,d2,…}, the JS divergence of the a priori and a posterior distributions can be used to calculate the surprise of the particle size, i.e
To further avoid the combinatorial explosion problem, greedy algorithms are used in combination with pruning strategies based on anomaly level to search for the dimension with the highest surprise level and the corresponding element as output of drilling down in the anomaly dimension.
Specifically, first, the degree of abnormality and surprise corresponding to each element in each dimension are calculated. Then, for each dimension, the elements are ranked according to the surprise degree corresponding to the elements, and the analysis is started from the high surprise degree. If the degree of abnormality corresponding to an element is greater than a set threshold, it is added to the candidate set. If the degree of abnormality corresponding to the combination of elements in the candidate set is greater than another set threshold, the search in that dimension is terminated and the elements in the candidate set constitute abnormal elements in that dimension. After each dimension is analyzed, the candidate set for each dimension may be ranked by the corresponding surprise degree for each dimension. The output drilled in the abnormal dimension is ordered in the front.
In some cases, it is necessary to further clarify the causal relationship between specific anomaly dimensions after those anomaly dimensions are drilled down.
In the embodiment of the application, in order to distinguish real causal relation, a PC algorithm is used for causal analysis of abnormal indexes. In general, a full-join undirected graph between the anomaly indices drilled down is constructed first. The D-separation rule is then used to determine the direction of the connection. A causal markov condition (Causal Markov Condition) is then used to generate a series of independent relationships and construct a causal graph. In embodiments of the present application, a conditional cross entropy based on G 2 is used to qualitatively verify whether X is independent of Y for a given set of Z, where X, Z and Y are disjoint sets of V, X and Y are univariates, and Z is a variable set. One advantage of G 2 is that: no assumptions need be made about the distribution of each variable. G 2 is defined as:
Where m is the sample size and CE (X, y|z) is the conditional entropy of X and Y for a given Z set. On the assumption of independence, the index G 2 obeys the χ 2 distribution, and its degree of freedom is equivalent to (N X-1)(NY-1)∏Z′∈Z NZ′).
Where N X、NY and N Z′ represent the number of values of the variables X, Y and Z . Thus, by a χ 2 test, it can be determined whether its independence assumption can be accepted. If the p value exceeds an importance threshold ζ, denoted as p > ζ, then its independence assumption is considered acceptable, otherwise not true. If X is independent of Y for a given set of Z, then I (X, y|z) =1. For the probabilities of the various variables used in computing G 2, the edge probabilities are computed using the following
Where dim i is the anomaly dimension drilled below, d1 is an element in the anomaly dimension, and { d 1,d2, … } is all elements in the anomaly dimension.
The joint probability distribution may also be calculated in a similar manner:
In general, starting from a fully connected undirected graph, then using the above formula to calculate the probability, then capturing all independent relationships in all variables in a G 2 pair-wise manner, and finally using the D separation rule to determine the causal direction between the outlier dimensions. Finally, the abnormal dimension and the causal relationship thereof identified by the invention can be provided for users to assist the users to make quick response and recovery to faults in the system.
The beneficial effects brought by the embodiment of the application can be summarized as follows: the embodiment of the application can realize the support of the abnormal dimension drill-down of various types of indexes, thereby expanding the application range of the abnormal dimension drill-down method as much as possible. After users in different scenes customize different index aggregation methods, the method can be compatible with the aggregation methods, and the dimension anomaly score is calculated based on the aggregation methods. The method is beneficial to popularization of the model in a production environment, and can support the demands of users on different index aggregation methods. The embodiment of the application has the capability of further analyzing the causal relationship among the abnormal dimensions. The embodiment of the application creatively provides a root cause analysis method of abnormal dimensions based on a Bayesian network, and the possible causal relationship among the abnormal dimensions can effectively help system management staff diagnose faults and process faults.
According to the embodiment of the application, the multidimensional index abnormal dimension positioning is performed by adopting an expanded multidimensional drill-down method, and a Bayesian network is further used for confirming finer root cause relations among abnormal dimensions so as to accelerate the fault diagnosis process. However, due to the problem of explosion of the dimension combination space, the embodiment of the application gradually drills down in abnormal dimensions by a greedy algorithm and reduces the search space by pruning, so that the accuracy of the percentage of the drilling down results cannot be ensured. The broader method popularization in an industrial environment is planned to analyze and demonstrate this problem in more detail and further explore the potential and limitations of embodiments of the present application.
In this way, by acquiring a first service parameter value of the target service at the service layer, when the target service is determined to be abnormal based on the first service parameter value, determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; and combining the influence degree and the difference degree, determining an abnormal business unit, and determining the abnormal business unit as an abnormal root cause of the target business. When the target business is determined to be abnormal based on the first business parameter value, the abnormal root cause in each business unit is determined based on the determined influence degree of each business unit on the abnormality of the target business and the difference degree of each business unit, and the abnormal condition of each business unit can be fully reflected due to the influence degree and the difference degree, so that the abnormal root cause in each business unit is effectively determined, and the positioning accuracy of the abnormal root cause is effectively improved.
It can be appreciated that, in the embodiment of the present application, related data such as the first service parameter value, the service index, the unit index, etc. is referred to, when the embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
Continuing with the description below of an exemplary architecture of the device 455 for locating an anomaly root provided by embodiments of the present application implemented as a software module, in some embodiments, as shown in FIG. 3, the software module stored in the device 455 for locating an anomaly root of the memory 450 may include: the parameter obtaining module 4551 is configured to obtain a first service parameter value of a target service currently in a service layer, where the service layer includes a plurality of service dimensions, and each service dimension includes at least one service unit; the index obtaining module 4552 is configured to obtain, when it is determined that the target service is abnormal based on the first service parameter value, a service index of the target service in a service layer, a dimension index of each service dimension, and a unit index of each service unit; a determining module 4553, configured to determine, based on the service index, the dimension index, and the unit index, a degree of influence of each service unit on an anomaly of the target service, and determine a degree of difference of each service unit; the difference degree of the service units is used for representing the difference between the second service parameter value of the service units and the unit index; the root cause module 4554 is configured to combine the influence degree and the difference degree, determine an abnormal service unit from each service unit, and determine the abnormal service unit as an abnormal root cause of the target service.
In some embodiments, the determining module 4553 is further configured to perform the following processing for each service unit: subtracting the business index from the first business parameter value to obtain a first difference value; determining a reference index of the business unit based on the dimension index and the unit index; subtracting the business index from the reference index to obtain a second difference; and determining the ratio of the second difference value to the first difference value as the influence of the service unit on the abnormality of the target service.
In some embodiments, the determining module 4553 is further configured to determine, from a plurality of service dimensions, a target service dimension to which the service unit belongs; acquiring a second service parameter value of the target service in a service unit currently and a third service parameter value of the target service in a target service dimension currently; subtracting the second service parameter value and the third service parameter value from the first service parameter value to obtain a reference parameter value; and adding the reference parameter value, the dimension index and the unit index to obtain the reference index of the business unit.
In some embodiments, the determining module 4553 is further configured to perform the following processing for each service unit: acquiring a second service parameter value of a target service in a current service unit; determining the ratio of the second service parameter value to the first service parameter value as the posterior probability of the service unit; determining the ratio of the unit index of the service unit to the service index, and determining the ratio as the prior probability of the service unit; and determining the difference degree of the service units based on the prior probability and the posterior probability.
In some embodiments, the determining module 4553 is further configured to determine a sum of the prior probability and the posterior probability as a sum probability; determining the ratio of the prior probability and the addition probability as a first reference probability, and determining the ratio of the posterior probability and the addition probability as a second reference probability; determining a first reference difference based on the first reference probability and the prior probability, and determining a second reference difference based on the second reference probability and the posterior probability; and adding the first reference difference degree and the second reference difference degree to obtain the difference degree of the business unit.
In some embodiments, the determining module 4553 is further configured to obtain a first reference coefficient, and determine a product of the first reference probability and the first reference coefficient as a third reference probability; and determining the logarithmic value of the third reference probability, and determining the product of the logarithmic value of the third reference probability and the prior probability as the first reference difference degree.
In some embodiments, the determining module 4553 is further configured to obtain a second reference coefficient, and determine a product of the second reference probability and the second reference coefficient as a fourth reference probability; and determining the logarithmic value of the fourth reference probability, and determining the product of the logarithmic value of the fourth reference probability and the posterior probability as the second reference difference degree.
In some embodiments, the root cause module 4554 is further configured to perform the following processing for each service dimension: when the number of the service units included in the service dimension is a plurality of, sequencing the service units included in the service dimension according to the sequence from big to small of the difference degree to obtain a service unit queue of the service dimension; and determining abnormal business units in the business dimension from the business unit queue based on the influence degree of each business unit in the business unit queue.
In some embodiments, the root cause module 4554 is further configured to sequentially use, starting from the head of the traffic unit queue, the traffic units in the traffic unit queue as the current traffic unit, and perform the following processing for the current traffic unit: comparing the influence degree of the current business unit with a first influence degree threshold, and determining the current business unit as a candidate abnormal business unit when the influence degree of the current business unit is greater than the first influence degree threshold; determining the product of the influence degree of each candidate abnormal business unit as a reference influence degree; and when the reference influence degree is larger than the second influence degree threshold, stopping comparing the next business unit in the business unit queue with the first influence degree threshold, and determining each candidate abnormal business unit as an abnormal business unit in the business dimension.
In some embodiments, the locating device 455 of the abnormal root cause further includes: the root relation module is used for constructing an abnormal root graph structure based on each abnormal business unit, and the abnormal root graph structure represents the association relation among the abnormal business units; and determining the abnormal root relation of the target service as the abnormal root relation of the target service.
In some embodiments, the root cause relationship module is further configured to construct an undirected graph structure between the abnormal business units with each abnormal business unit as a node; determining the direction of each undirected edge in the undirected graph structure based on the influence degree of each abnormal business unit; and constructing an abnormal root graph structure based on the directions of the undirected edges in the undirected graph structure and the undirected graph structure.
Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the method for locating the abnormal root cause according to the embodiment of the application.
Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions, in which the computer-executable instructions are stored, which when executed by a processor, cause the processor to perform the method for locating an abnormal root cause provided by the embodiments of the present application, for example, the method for locating an abnormal root cause as shown in fig. 3.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of electronic devices including one or any combination of the above-described memories.
In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the application has the following beneficial effects:
(1) When the abnormality of the target service is determined based on the first service parameter value, determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; and combining the influence degree and the difference degree, determining an abnormal business unit, and determining the abnormal business unit as an abnormal root cause of the target business. When the target business is determined to be abnormal based on the first business parameter value, the abnormal root cause in each business unit is determined based on the determined influence degree of each business unit on the abnormality of the target business and the difference degree of each business unit, and the abnormal condition of each business unit can be fully reflected due to the influence degree and the difference degree, so that the abnormal root cause in each business unit is effectively determined, and the positioning accuracy of the abnormal root cause is effectively improved.
(2) By acquiring the first service parameter value of the target service in the service layer at present, whether the overall operation of the target service is abnormal or not is conveniently judged in time, and when the target service is abnormal, the abnormal root cause of the target service is accurately determined by the positioning method of the abnormal root cause provided by the embodiment of the application, so that the positioning accuracy and the positioning timeliness of the abnormal root cause can be effectively improved.
(3) The business index of the target business in the business layer, the dimension index of each business dimension and the unit index of each business unit are obtained, so that the influence degree and the difference degree of each business unit are conveniently and subsequently determined based on the business index, the dimension index and the unit index, and the abnormal business unit is accurately determined from the business units, thereby effectively improving the positioning accuracy of the abnormal root cause.
(4) The influence of each business unit on the abnormality of the target business is accurately determined based on the business index, the dimension index and the unit index, so that the subsequent determination of the abnormality root cause of the target business is facilitated based on the influence, and the positioning accuracy of the abnormality root cause can be effectively improved.
(5) By determining the difference degree of each service unit, the subsequent determination of the abnormal root cause of the target service based on the difference degree is facilitated, and the positioning accuracy of the abnormal root cause can be effectively improved.
(6) The service units in the service unit queues are sequentially used as the current service units from the head of the service unit queues, and the abnormal service units in the service dimension are sequentially determined, and when the reference influence degree is larger than the second influence degree threshold, the comparison of the next service unit D 3 in the service unit queues with the first influence degree threshold is stopped, so that the problem of combined explosion in the process of determining the abnormal service units is effectively avoided, the time consumption for determining the abnormal service units is effectively shortened, and the efficiency of determining the abnormal service units is improved.
(7) By determining the abnormal root cause relationship of the target service, the possible causal relationship among the abnormal dimensions is accurately determined, and the system manager can be effectively helped to diagnose and process faults.
(8) The beneficial effects brought by the embodiment of the application can be summarized as follows: the embodiment of the application can realize the support of the abnormal dimension drill-down of various types of indexes, thereby expanding the application range of the abnormal dimension drill-down method as much as possible. After users in different scenes customize different index aggregation methods, the method can be compatible with the aggregation methods, and the dimension anomaly score is calculated based on the aggregation methods. The method is beneficial to popularization of the model in a production environment, and can support the demands of users on different index aggregation methods. The embodiment of the application has the capability of further analyzing the causal relationship among the abnormal dimensions. The embodiment of the application creatively provides a root cause analysis method of abnormal dimensions based on a Bayesian network, and the possible causal relationship among the abnormal dimensions can effectively help system management staff diagnose faults and process faults.
(9) According to the embodiment of the application, the multidimensional index abnormal dimension positioning is performed by adopting an expanded multidimensional drill-down method, and a Bayesian network is further used for confirming finer root cause relations among abnormal dimensions so as to accelerate the fault diagnosis process. However, due to the problem of explosion of the dimension combination space, the embodiment of the application gradually drills down in abnormal dimensions by a greedy algorithm and reduces the search space by pruning, so that the accuracy of the percentage of the drilling down results cannot be ensured. The broader method popularization in an industrial environment is planned to analyze and demonstrate this problem in more detail and further explore the potential and limitations of embodiments of the present application.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (15)

1. A method for locating an abnormal root cause, the method comprising:
acquiring a first service parameter value of a target service currently in a service layer, wherein the service layer comprises a plurality of service dimensions, and each service dimension comprises at least one service unit;
When the first service parameter value is used for determining that the target service is abnormal, acquiring a service index of the target service in the service layer, a dimension index of each service dimension and a unit index of each service unit;
determining the influence of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit;
The difference degree of the service units is used for representing the difference between the second service parameter value of the service units and the unit index;
And combining the influence degree and the difference degree, determining an abnormal service unit from the service units, and determining the abnormal service unit as an abnormal root cause of the target service.
2. The method of claim 1, wherein determining the degree of influence of each business unit on the anomaly of the target business based on the business index, the dimension index, and the unit index comprises:
the following processing is performed for each service unit:
subtracting the business index from the first business parameter value to obtain a first difference value;
determining a reference index of the service unit based on the dimension index and the unit index;
Subtracting the business index from the reference index to obtain a second difference;
and determining the ratio of the second difference value to the first difference value as the influence of the service unit on the abnormality of the target service.
3. The method of claim 2, wherein the determining the reference index for the business unit based on the dimension index and the unit index comprises:
determining a target service dimension to which the service unit belongs from the service dimensions;
Acquiring a second service parameter value of the target service in the service unit currently and a third service parameter value of the target service in the target service dimension currently;
Subtracting the second service parameter value and the third service parameter value from the first service parameter value to obtain a reference parameter value;
And adding the reference parameter value, the dimension index and the unit index to obtain the reference index of the service unit.
4. The method of claim 1, wherein said determining the degree of difference for each of said service units comprises:
the following processing is performed for each service unit:
Acquiring a second service parameter value of the target service in the service unit currently;
determining a ratio of the second service parameter value to the first service parameter value as a posterior probability of the service unit;
Determining a ratio of the unit index of the service unit to the service index, and determining the ratio as the prior probability of the service unit;
And determining the difference degree of the service units based on the prior probability and the posterior probability.
5. The method of claim 4, wherein the determining the degree of variability of the traffic units based on the prior probability and the posterior probability comprises:
determining the sum of the prior probability and the posterior probability as a sum probability;
determining the ratio of the prior probability to the addition probability as a first reference probability, and determining the ratio of the posterior probability to the addition probability as a second reference probability;
determining a first reference difference based on the first reference probability and the prior probability, and determining a second reference difference based on the second reference probability and the posterior probability;
and adding the first reference difference degree and the second reference difference degree to obtain the difference degree of the business unit.
6. The method of claim 5, wherein the determining a first reference difference based on the first reference probability and the prior probability comprises:
Acquiring a first reference coefficient, and determining the product of the first reference probability and the first reference coefficient as a third reference probability;
And determining the logarithmic value of the third reference probability, and determining the product of the logarithmic value of the third reference probability and the prior probability as the first reference difference degree.
7. The method of claim 5, wherein the determining a second reference variability based on the second reference probability and the posterior probability comprises:
Acquiring a second reference coefficient, and determining the product of the second reference probability and the second reference coefficient as a fourth reference probability;
and determining the logarithmic value of the fourth reference probability, and determining the product of the logarithmic value of the fourth reference probability and the posterior probability as the second reference difference degree.
8. The method of claim 1, wherein said combining said influence and said difference to determine abnormal traffic units from each of said traffic units comprises:
The following processing is respectively executed for each service dimension:
When the number of the service units included in the service dimension is a plurality of, sequencing the service units included in the service dimension according to the sequence from the big difference degree to the small difference degree to obtain a service unit queue of the service dimension;
And determining the abnormal business units in the business dimension from the business unit queue based on the influence degree of each business unit in the business unit queue.
9. The method of claim 8, wherein said determining the abnormal business units in the business dimension from the business unit queue based on the influence degree of each of the business units in the business unit queue comprises:
sequentially taking the service units in the service unit queue as current service units from the head of the service unit queue, and executing the following processing for the current service units:
Comparing the influence degree of the current service unit with a first influence degree threshold, and determining the current service unit as a candidate abnormal service unit when the influence degree of the current service unit is greater than the first influence degree threshold;
Determining the product of the influence degree of each current candidate abnormal business unit as a reference influence degree;
and when the reference influence degree is larger than a second influence degree threshold, stopping comparing the next business unit in the business unit queue with the first influence degree threshold, and determining each current candidate abnormal business unit as the abnormal business unit in the business dimension.
10. The method of claim 1, wherein after said determining the abnormal business unit as the abnormal root cause of the target business, the method further comprises:
Constructing an abnormal root graph structure based on each abnormal service unit, wherein the abnormal root graph structure represents the association relation between each abnormal service unit;
And determining the abnormal root relation of the target service as the abnormal root relation of the abnormal root graph structure.
11. The method of claim 10, wherein constructing an anomaly root graph structure based on each of the anomaly traffic units comprises:
Taking each abnormal service unit as a node, and constructing an undirected graph structure among the abnormal service units;
Determining the direction of each undirected edge in the undirected graph structure based on the influence degree of each abnormal business unit;
and constructing the abnormal root graph structure based on the directions of all undirected edges in the undirected graph structure and the undirected graph structure.
12. A device for locating an abnormal root cause, the device comprising:
The system comprises a parameter acquisition module, a service layer and a service layer, wherein the parameter acquisition module is used for acquiring a first service parameter value of a target service in the service layer, the service layer comprises a plurality of service dimensions, and each service dimension comprises at least one service unit;
the index acquisition module is used for acquiring a service index of the target service in the service layer, a dimension index of each service dimension and a unit index of each service unit when the target service is determined to be abnormal based on the first service parameter value;
The determining module is used for determining the influence degree of each service unit on the abnormality of the target service based on the service index, the dimension index and the unit index, and determining the difference degree of each service unit; the difference degree of the service units is used for representing the difference between the second service parameter value of the service units and the unit index;
and the root cause module is used for combining the influence degree and the difference degree, determining an abnormal service unit from the service units, and determining the abnormal service unit as the abnormal root cause of the target service.
13. An electronic device, the electronic device comprising:
A memory for storing computer executable instructions or computer programs;
a processor for implementing the method for locating an anomaly root cause according to any one of claims 1 to 11 when executing computer-executable instructions or computer programs stored in the memory.
14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the method of locating an abnormal root cause according to any one of claims 1 to 11.
15. A computer program product comprising a computer program or computer-executable instructions which, when executed by a processor, implement the method of locating an anomaly root cause of any one of claims 1 to 11.
CN202211524301.5A 2022-11-30 2022-11-30 Method, device, equipment, storage medium and program product for locating abnormal root cause Pending CN118118327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211524301.5A CN118118327A (en) 2022-11-30 2022-11-30 Method, device, equipment, storage medium and program product for locating abnormal root cause

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211524301.5A CN118118327A (en) 2022-11-30 2022-11-30 Method, device, equipment, storage medium and program product for locating abnormal root cause

Publications (1)

Publication Number Publication Date
CN118118327A true CN118118327A (en) 2024-05-31

Family

ID=91214420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211524301.5A Pending CN118118327A (en) 2022-11-30 2022-11-30 Method, device, equipment, storage medium and program product for locating abnormal root cause

Country Status (1)

Country Link
CN (1) CN118118327A (en)

Similar Documents

Publication Publication Date Title
US11614990B2 (en) Automatic correlation of dynamic system events within computing devices
US20210035026A1 (en) Diagnosing & triaging performance issues in large-scale services
CN103513983A (en) Method and system for predictive alert threshold determination tool
EP3567496A1 (en) Systems and methods for indexing and searching
CN106776288B (en) A kind of health metric method of the distributed system based on Hadoop
CN111459698A (en) Database cluster fault self-healing method and device
CN114580263A (en) Knowledge graph-based information system fault prediction method and related equipment
CN111859047A (en) Fault solving method and device
CN114443639A (en) Method and system for processing data table and automatically training machine learning model
CN112181659B (en) Cloud simulation memory resource prediction model construction method and memory resource prediction method
KR20150118963A (en) Queue monitoring and visualization
CN115203435A (en) Entity relation generation method and data query method based on knowledge graph
CN114490375A (en) Method, device and equipment for testing performance of application program and storage medium
EP3923155A2 (en) Method and apparatus for processing snapshot, device, medium and product
CN117376092A (en) Fault root cause positioning method, device, equipment and storage medium
CN118118327A (en) Method, device, equipment, storage medium and program product for locating abnormal root cause
US20160292230A1 (en) Identifying a path in a workload that may be associated with a deviation
CN113918204A (en) Metadata script management method and device, electronic equipment and storage medium
CN106777981B (en) Behavior data verification method and device
CN117389908B (en) Dependency analysis method, system and medium for interface automation test case
CN116701116A (en) Server fault prediction method and device, server and storage medium
CN117076578A (en) Graph model construction method and device based on application, computer equipment and medium
CN116169666A (en) Method, device, equipment, storage medium and product for processing power grid risk
CN117851105A (en) Operation and maintenance resource fault positioning method and device and readable storage medium
CN116860507A (en) Alarm root cause determining method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication