WO2024027384A1 - 一种故障检测方法、装置、电子设备及存储介质 - Google Patents

一种故障检测方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024027384A1
WO2024027384A1 PCT/CN2023/103248 CN2023103248W WO2024027384A1 WO 2024027384 A1 WO2024027384 A1 WO 2024027384A1 CN 2023103248 W CN2023103248 W CN 2023103248W WO 2024027384 A1 WO2024027384 A1 WO 2024027384A1
Authority
WO
WIPO (PCT)
Prior art keywords
operating system
event
data
fault detection
application
Prior art date
Application number
PCT/CN2023/103248
Other languages
English (en)
French (fr)
Inventor
汪峰来
陆志浩
郑弦
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024027384A1 publication Critical patent/WO2024027384A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • the present application relates to the field of computer technology, and in particular to a fault detection method, device, electronic equipment and storage medium.
  • Cloud servers in large-scale cloud services are designed with a large number of reliability, availability and other mechanisms in the system architecture. Under the action of these mechanisms, cloud servers can tolerate some system failures and continue to run.
  • APM Application Performance Monitoring
  • the cloud service application detected by the cloud server is fault-free.
  • ITIM Internet Technology Infrastructure Monitoring
  • OS operating system
  • the cloud server will detect that the operating system of the cloud service may have hidden faults. This kind of detection difference caused by different detection perspectives such as APM layer and ITIM layer is called grayscale fault.
  • cloud servers when detecting grayscale faults, cloud servers usually perform fault diagnosis through "drilling down layer by layer", that is, the cloud server sequentially determines whether the application (Application, APP) at the APM layer is faulty, and then performs network performance management and diagnosis (Network). Whether the network function of the Performance Monitoring and Diagnostics (NPMD) layer is faulty, and whether the operating system of the ITIM layer is faulty.
  • NPMD Performance Monitoring and Diagnostics
  • the cloud server can only obtain basic information such as the central processing unit (CPU) occupancy rate of the operating system and the current disk space size, the general fault detection method It is also necessary to combine manual experience to further determine whether the operating system is faulty based on the above basic information, which is time-consuming, labor-intensive and low-efficiency.
  • basic information such as the central processing unit (CPU) occupancy rate of the operating system and the current disk space size
  • This application provides a fault detection method, device, electronic equipment and storage medium, which relates to the field of computer technology and is used to quickly and accurately detect faults in an operating system.
  • this application provides a fault detection method applied to electronic equipment, including:
  • the electronic device can determine the current operating system event data of the operating system to be detected based on the current operating system operating data and preset event rules.
  • the current operating system event data includes: data generated during the execution of process events or thread events of the current operating system in the operating system to be detected, and context information associated with process events or thread events of the current operating system. Subsequently, the electronic device can determine the system fault detection result of the operating system to be detected based on the current operating system operating data and the current operating system event data.
  • the above-mentioned current operating system running data is data generated during the running of process events or thread events of the current operating system in the operating system to be detected.
  • this kind of data generally only represents an indicator of a process event or thread event (such as CPU utilization, etc.). Therefore, the current operating system operating data is indicator type data.
  • the above-mentioned current operating system event data not only includes data generated during the execution of process events or thread events of the current operating system, but also includes context information associated with the process events or thread events of the current operating system.
  • this kind of data can not only represent an indicator of a process event or thread event, but also represent the event content of this process event or thread event (for example, the thread event is the main thread event), and/or, the event status (for example, the thread event is the main thread event).
  • This thread event is an abnormal status event), etc. Therefore, the current operating system event data is event type data.
  • the electronic device when it detects a fault at the operating system level, it can not only obtain the current operating system operating data of the indicator type, but also determine the current operating system event data of the event type based on the current operating system operating data. Subsequently, in the process of fault detection of electronic equipment, compared with the method in general technology that only uses basic information (that is, the current operating system operating data of the indicator type) combined with manual experience to detect faults, the fault detection method provided by the embodiment of the present application can Based on the current operating system operating data and current operating system event data, the system fault detection results of the operating system to be detected are quickly determined without the intervention of manual experience, which improves the efficiency of fault detection.
  • the current operating system event data includes contextual information associated with the current operating system process events or thread events
  • this method can mine the context information in the current operating system event data, thereby obtaining the relevant content of the context-related events of the process event or thread event based on the context information, and then combine the current operating system running data based on these contents, Accurately determine the system fault detection results and improve the accuracy of fault detection.
  • the fault detection method can also detect more fine-grained fault detection results (such as fault detection results of specific thread events or process events), further improving the accuracy of fault detection.
  • the fault detection method provided by the embodiments of the present application does not require adding various fault detection tools at the operating system level of the electronic device. It only needs the data collection function and data processing function of the operating system itself to quickly detect faults and realize fault-based fault detection. Lightweight fault detection in operating systems.
  • the electronic device includes: a system layer processing module, an application layer processing module and a network layer processing module; the above method of determining the system fault detection result of the operating system to be detected specifically includes: the system layer processing module determines the system fault detection result of the operating system to be detected. Detect system fault detection results of the operating system.
  • the fault detection method provided by the embodiment of the present application also includes: the system layer processing module can send the system fault detection result to the application layer processing module and the network layer processing module.
  • the application layer processing module determines the application fault detection result of the electronic device based on the system fault detection result.
  • the network layer processing module determines the network fault detection result of the electronic device based on the system fault detection result.
  • the system layer processing module in the electronic device can send the system fault detection result to the application layer processing module and the network layer processing module, so that the application layer processing module determines based on the system fault detection result.
  • the application fault detection results of the electronic equipment and the network layer processing module determine the network fault detection results of the electronic equipment based on the system fault detection results, thus realizing the loose coupling between the system layer processing module and the application layer and network layer respectively. Quickly and accurately determine the fault detection results of grayscale faults in the operating system to be detected, improving fault detection efficiency.
  • the above-mentioned method of determining the system fault detection result of the operating system to be detected based on the current operating system running data and the current operating system event data specifically includes: based on the preset state rules, the current operating system event data Add a status identifier, and input the first data into the pre-trained fault detection model to obtain a system fault detection result.
  • the status identification includes: normal identification and abnormal identification;
  • the first data includes the current operating system running data and the current operating system event data after adding the status identification;
  • the fault detection model is obtained by training based on the historical operating system data of the operating system to be detected.
  • Historical operating system data includes: historical operating system operating data and historical operating system event data.
  • the fault detection model is trained based on the historical operating system data of the operating system to be detected, and the historical operating system data includes: historical operating system operating data and historical operating system event data, therefore, the electronic device determines the operating system to be detected When the system fault detection results are obtained, the fault detection model can be used to quickly and accurately determine the system fault detection results, improving the fault detection efficiency.
  • the fault detection method also includes: obtaining historical operating system data, and training to obtain a fault detection model based on a preset fault identification algorithm and historical operating system data.
  • the preset fault identification algorithm may include: a classification algorithm and a parameter optimization algorithm; the above method of training a fault detection model based on the preset fault identification algorithm and historical operating system data specifically includes: based on the preset fault identification algorithm and historical operating system data
  • the state rule adds a state identifier to the historical operating system event data, and based on the classification algorithm and the second data, the model to be trained is trained to obtain the model to be adjusted.
  • the second data includes historical operating system operation data and historical operating system event data after adding status identifiers;
  • the model to be trained includes: an application feature classification model to be trained and a fault classification model to be trained.
  • the parameters of the model to be adjusted can be adjusted based on historical operating system operating data and parameter optimization algorithms to obtain a fault detection model;
  • the fault detection model includes: application feature classification model and fault classification model.
  • the preset fault identification algorithm can include a classification algorithm and a parameter optimization algorithm
  • the classification algorithm can be used to perform classification training on the initial model of the fault detection model
  • the parameter optimization algorithm can perform parameterization on the model during the classification training process. Adjust to obtain a fault detection model, so that the system fault detection results can be quickly and accurately determined through the fault detection model, thereby improving the fault detection efficiency.
  • the above-mentioned method of inputting the first data into a pre-trained fault detection model to obtain a system fault detection result specifically includes: inputting the first data into an application feature classification model to obtain The application characteristic fault detection result is input into the fault classification model to obtain the system fault detection result.
  • the electronic device can determine different application features based on the application feature classification model and the fault classification model in turn, as well as the fault classification corresponding to each application feature, improving improves the accuracy of fault detection.
  • the fault detection method also includes:
  • the application dependency relationship is used to represent the interdependence relationship between each process or thread and each application instance in the operating system to be detected;
  • the event propagation relationship is used to represent the propagation relationship between application events in the operating system to be detected;
  • the application event is Application events corresponding to processes or threads in the operating system to be detected and application instances.
  • the electronic device can also build application dependencies and event propagation relationships to further achieve the root cause of the fault. Because of positioning.
  • the above method of constructing application dependency relationships based on the current operating system operating data and preset dependency rules specifically includes: constructing an application dependency graph based on the current operating system operating data and preset dependency rules;
  • the graph includes multiple application nodes and edges between application nodes; the main application node among the multiple application nodes is used to represent the main thread of the application instance in the operating system to be detected; the slave application node among the multiple application nodes is used to represent: The dependency instance between the slave thread of the application instance in the operating system to be detected and the application instance in the operating system to be detected; the edge between the first application node and the second application node among the multiple application nodes is used to represent the first application node There is a dependency relationship with the second application node.
  • constructing an event propagation relationship based on application dependencies includes: constructing an event propagation graph based on the application dependency graph; the event propagation graph includes multiple event nodes corresponding to multiple application nodes one-to-one and Edges between event nodes; multiple event nodes are used to represent application instances corresponding to application events in the operating system to be detected; edges between the first event node and the second event node among the multiple event nodes are used to represent the first There is a propagation relationship between the first event corresponding to the event node and the second event corresponding to the second event node.
  • the above-mentioned method of determining the root cause event that causes the system fault in the system fault detection result according to the event propagation relationship, as well as the root cause application instance and/or the root cause process or thread specifically includes:
  • the fault event node is the event node corresponding to the fault event; the fault event is the fault event corresponding to the system fault detection result.
  • the electronic device can quickly and accurately determine the root cause event that causes the system fault in the system fault detection result, as well as the root cause application instance and/or root cause thread based on the application dependency graph and event propagation graph.
  • the system faults causing the system fault detection results include the following single types of faults or combinations of faults: network input and output IO faults, disk input and output IO faults, scheduling faults, memory faults, process or Thread failure, file system failure, disk failure, central processing unit CPU failure and container failure.
  • electronic equipment can determine the system fault detection results of various single types or combined types of faults, so that various types of system grayscale faults can be detected.
  • this application provides a fault detection device, including: an acquisition unit and a processing unit; an acquisition unit for acquiring The current operating system operating data of the operating system to be detected; the processing unit is used to determine the current operating system event data of the operating system to be detected based on the current operating system operating data and preset event rules; the current operating system event data includes: the operation to be detected In the system, the data generated during the running of process events or thread events of the current operating system and the context information associated with the process events or thread events of the current operating system; the processing unit is also used to run data based on the current operating system and current operating system events Data to determine the system fault detection results of the operating system to be detected.
  • the electronic device includes: a system layer processing module, an application layer processing module, and a network layer processing module.
  • the system layer processing module is used to determine the system fault detection results of the operating system to be detected; the system layer processing module is also used to send the system fault detection results to the application layer processing module and the network layer processing module; the application layer processing module is used to detect system faults based on As a result, the application fault detection result of the electronic device is determined; the network layer processing module is used to determine the network fault detection result of the electronic device based on the system fault detection result.
  • the processing unit is specifically configured to: add a status identifier to the current operating system event data based on preset status rules; the status identifier includes: a normal identifier and an abnormal identifier; input the first data into the pre-training In order to obtain the system fault detection results in a good fault detection model; the first data includes the current operating system operating data and the current operating system event data after adding the status identifier; the fault detection model is trained based on the historical operating system data of the operating system to be detected The obtained historical operating system data includes: historical operating system operating data and historical operating system event data.
  • the acquisition unit is also used to obtain historical operating system data; the processing unit is also used to train a fault detection model based on a preset fault identification algorithm and historical operating system data.
  • the preset fault identification algorithm includes: a classification algorithm and a parameter optimization algorithm; a processing unit specifically used to: add status identifiers to historical operating system event data based on preset status rules; based on the classification algorithm and The second data is to train the model to be trained to obtain the model to be adjusted; the second data includes historical operating system operating data and historical operating system event data after adding status identifiers; the model to be trained includes: the application feature classification model to be trained and The fault classification model to be trained; based on the historical operating system operating data and parameter optimization algorithm, the parameters of the model to be adjusted are adjusted to obtain the fault detection model; the fault detection model includes: application feature classification model and fault classification model.
  • the processing unit is specifically configured to: input the first data into the application feature classification model to obtain the application feature fault detection result; and input the first data corresponding to the application feature fault detection result.
  • System data is input into the fault classification model to obtain system fault detection results.
  • the processing unit is also used to construct application dependencies based on the current operating system operating data and preset dependency rules; the application dependencies are used to represent each process or thread and each application in the operating system to be detected. Interdependencies between instances; the processing unit is also used to construct event propagation relationships based on application dependencies; event propagation relationships are used to represent the propagation relationships between application events in the operating system to be detected; application events are the operating system to be detected The process or thread in the system and the application event corresponding to the application instance; the processing unit is also used to determine the root cause event that causes the system fault in the system fault detection result, as well as the root cause application instance and/or root cause process according to the event propagation relationship. or thread.
  • the processing unit is specifically used to: construct an application dependency graph based on current operating system operating data and preset dependency rules; the application dependency graph includes multiple application nodes and edges between application nodes;
  • the master application node among the application nodes is used to represent the main thread of the application instance in the operating system to be detected; the slave application node among multiple application nodes is used to represent: the slave thread of the application instance in the operating system to be detected and the operating system to be detected.
  • the edge between the first application node and the second application node among multiple application nodes is used to indicate that there is a dependency relationship between the first application node and the second application node.
  • the processing unit is specifically configured to: construct an event propagation graph according to the application dependency graph; the event propagation graph includes multiple event nodes corresponding to multiple application nodes and edges between event nodes. ; Multiple event nodes are used to represent application instances corresponding to application events in the operating system to be detected; The edge between the first event node and the second event node among the multiple event nodes is used to represent the first event node corresponding to the first event node. There is a propagation relationship between the event and the second event corresponding to the second event node.
  • the processing unit is specifically configured to: determine the propagation event node that has an edge with the fault event node; the fault event node is the event node corresponding to the fault event; the fault event is the event node corresponding to the system fault detection result Fault event; determine the event corresponding to the propagation start event node in the propagation event node as the root cause event, and determine the application instance corresponding to the propagation start event node as the root cause application instance, and/or, determine the propagation start event
  • the process or thread corresponding to the node is determined as the root process or thread.
  • the system faults causing the system fault detection results include the following single types of faults or combinations of faults: network input and output IO faults, disk input and output IO faults, scheduling faults, memory faults, process or Thread failure, file system failure, disk failure, central processing unit CPU failure and container failure.
  • the present application provides an electronic device, which may include: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement any of the above first aspects.
  • a fault detection method in a possible implementation.
  • the present application provides a computer-readable storage medium. Instructions are stored on the computer-readable storage medium. When the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device causes the electronic device to The fault detection method in any possible implementation manner of the above-mentioned first aspect can be executed.
  • the present application provides a computer program product.
  • the computer program product includes computer instructions.
  • the processor of the electronic device performs any of the possible tasks in the first aspect.
  • Figure 1 is a schematic diagram of a grayscale fault scenario provided by an embodiment of the present application.
  • FIG. 2 is a schematic flow chart of the fault detection method provided by General Technology
  • Figure 3 is a schematic structural diagram of a fault detection system provided by an embodiment of the present application.
  • Figure 4 is a block diagram of an internal module structure of an electronic device provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart 1 of a fault detection method provided by an embodiment of the present application.
  • Figure 6 is a schematic flow chart 2 of a fault detection method provided by an embodiment of the present application.
  • Figure 7 is a schematic flow chart 3 of a fault detection method provided by an embodiment of the present application.
  • Figure 8 is a schematic flow chart 4 of a fault detection method provided by an embodiment of the present application.
  • Figure 9 is a schematic flow chart 5 of a fault detection method provided by an embodiment of the present application.
  • Figure 10 is a schematic flow chart 6 of a fault detection method provided by an embodiment of the present application.
  • Figure 11 is a schematic flow chart 7 of a fault detection method provided by an embodiment of the present application.
  • Figure 12 is a schematic flowchart 8 of a fault detection method provided by an embodiment of the present application.
  • Figure 13 is a schematic flow chart 9 of a fault detection method provided by an embodiment of the present application.
  • Figure 14 is a schematic flowchart 10 of a fault detection method provided by an embodiment of the present application.
  • Figure 15 is a schematic flow chart 11 of a fault detection method provided by an embodiment of the present application.
  • Figure 16 is a schematic flowchart 12 of a fault detection method provided by an embodiment of the present application.
  • Figure 17 is a schematic flowchart 13 of a fault detection method provided by an embodiment of the present application.
  • Figure 18 is a schematic flowchart fourteen of a fault detection method provided by an embodiment of the present application.
  • Figure 19 is a schematic flow chart 15 of a fault detection method provided by an embodiment of the present application.
  • Figure 20 is a schematic structural diagram of a fault detection device provided by an embodiment of the present application.
  • Figure 21 shows a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • Figure 22 shows a schematic structural diagram of a server provided by an embodiment of the present application.
  • A/B can be understood as A or B.
  • first and second are used for descriptive purposes only and shall not be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, “plurality” means two or more.
  • references to the terms “including” and “having” and any variations thereof in the description of this application are intended to cover non-exclusive inclusion.
  • a process, method, system, product or device that includes a series of steps or modules is not limited to the listed steps or modules, but optionally also includes other unlisted steps or modules, or optionally also Includes other steps or modules that are inherent to such processes, methods, products, or devices.
  • Metrics Some atomic data in the operating system that can be aggregated and calculated.
  • the indicator data can be CPU usage, system memory usage, interface response time, interface response query rate per second (queries per second, QPS), etc.
  • the indicator data can also be a measurement value used to represent the current depth of the task queue, which can be updated when an element enters or exits the queue.
  • the indicator data can also be a counter used to represent the number of HyperText Transfer Protocol (Hyper Text Transfer Protocol, HTTP) requests, which is accumulated when a new request arrives.
  • HTTP Hyper Text Transfer Protocol
  • indicator data are data values stored based on time series, which can be used for some aggregation calculations such as sums, averages, percentiles, etc. over a period of time, and used for subsequent data analysis and organization of the operating system.
  • Logging is a record of events that occur when the operating system is running. It can be used to record discrete events of the operating system and provide detailed information for feedback and troubleshooting.
  • Link data (Tracing): used to record call information within the scope of message requests. Link data is a powerful tool for troubleshooting system performance problems. Link data is not only helpful in sorting out the relationship between interfaces and calls between services, but also helps in troubleshooting the causes of slow requests or exceptions.
  • link data can record the execution process and time consumption of a remote method call.
  • link data can clearly display all call information from the caller to the callee in a certain call chain.
  • Stateless data that is, the operating system operating data in the embodiment of this application, used to represent the data corresponding to an operation in the operating system.
  • a collection probe for collecting data is deployed in the electronic device.
  • the operating system running data collected by the collection probe must all come from the information carried by the stateless data and the common information that can be used by all data.
  • Stateful data that is, operating system event data in the embodiment of this application, used to represent data with a data storage function in the operating system.
  • This kind of data can store the event content of the process event or thread event corresponding to the data (for example, the thread event is a main thread event), and/or the event status (for example, the thread event is an abnormal status event), etc.
  • the operating system event data collected by the collection probe generally includes information related to the operating system event data, that is, context information.
  • the data collection probe collects stateful data and stateless data, it can make a stateful judgment by determining whether two data from the same tag have a contextual relationship within the operating system, that is, data with a contextual relationship is stateful. Data, data without context is stateless data.
  • cloud servers in large-scale cloud services are designed with a large number of reliability, availability and other mechanisms in the system architecture. Under the action of these mechanisms, cloud servers can tolerate some system failures and continue to run.
  • the cloud service application detected by the cloud server is fault-free.
  • the ITIM layer of the cloud server that is, the OS layer
  • the cloud server will detect that the operating system of the cloud service may have hidden faults. This kind of problem is caused by different detection perspectives such as APM layer and ITIM layer. Detecting differences is called grayscale glitches.
  • Gray faults usually include network input/output (IO) faults (such as random packet loss), disk IO faults (such as disk fragmentation), memory jitter/leak faults, CPU scheduling/interference faults, capacity pressure faults, etc. that cause operations A non-fatal abnormal failure that degrades system performance.
  • IO network input/output
  • disk IO faults such as disk fragmentation
  • memory jitter/leak faults such as RAM
  • CPU scheduling/interference faults such as capacity pressure faults, etc.
  • Figure 1 shows a schematic diagram of a grayscale fault scenario provided by an embodiment of the present application.
  • the fault detection module (observation) 1 of the APM layer can obtain the detection reports (probe reports) of APPs such as application (application, APP) 1, APP 2, APP 3, etc. of the APM layer.
  • observation 2 of the NPMD layer can obtain the detection report of the network devices (network devices) of the NPMD layer.
  • observation 3 of the ITIM layer can obtain the detection reports of OS 1, OS 2, OS 3 and other OSs in the ITIM layer.
  • the fault detection results determined by observation 1 of the APM layer based on the obtained detection report are the same as those determined by the ITIM layer based on the obtained detection report.
  • the fault detection results are different, that is, the detection differences caused by different detection perspectives such as the APM layer and ITIM layer are grayscale faults.
  • the general fault detection method takes the APM layer as the perspective, uses the Service Level Indicator (SLI) of the APP that perceives the APM layer as the entry point, and conducts fault diagnosis in a "drill-down" manner, that is, the cloud server sequentially Determine whether the APP at the APM layer is faulty, the network function at the NPMD layer is faulty, and the operating system at the ITIM layer is faulty.
  • SLI Service Level Indicator
  • general-purpose technology typically uses dozens of management and monitoring tools, often a combination of APM tools, NPMD tools, and countless silo-specific ITIM tools.
  • APM tools, NPMD tools and ITIM tools are not related, that is, they essentially use different computer languages, these tools cannot obtain event information about events occurring at other levels and can only implement the APM layer, Detection and analysis of the respective levels of the NPMD layer and ITIM layer.
  • the cloud server can only obtain basic information such as the operating system's CPU utilization and current disk space size, the general fault detection method needs to be combined with manual experience and further based on the above Determining whether the operating system is faulty based on basic information is time-consuming, labor-intensive, and inefficient.
  • FIG. 2 shows a schematic flow chart of a grayscale fault detection method in general technology.
  • common grayscale fault detection methods include:
  • the detection module of the APM layer detects whether the application quality of the APP is abnormal.
  • abnormal application quality of APP specifically includes performance degradation of APP, poor user experience (such as usage lag), etc.
  • the detection module of the NPMD layer detects whether the network equipment has network problems.
  • the general detection data of the ITIM layer when obtaining the general detection data of the ITIM layer, it can be obtained through the data collection module of the ITIM layer or manually.
  • the general fault detection method also needs to be combined with manual experience to further determine whether the operating system is faulty based on the above basic information, which is time-consuming, labor-intensive, and low efficiency.
  • this application provides a fault detection method, which is applied to electronic equipment, including: after obtaining the current operating system operating data of the operating system to be detected, the electronic device can determine based on the current operating system operating data and preset event rules. Current operating system event data of the operating system to be detected. Subsequently, the electronic device can determine the system fault detection result of the operating system to be detected based on the current operating system operating data and the current operating system event data.
  • the above-mentioned current operating system running data is data generated during the running of process events or thread events of the current operating system in the operating system to be detected.
  • this kind of data generally only represents an indicator of a process event or thread event (such as CPU utilization, etc.). Therefore, the current operating system operating data is indicator type data.
  • the above current operating system event data is not only It includes data generated during the running of process events or thread events of the current operating system, and also includes context information associated with process events or thread events of the current operating system.
  • this kind of data can not only represent an indicator of a process event or thread event, but also represent the event content of this process event or thread event (for example, the thread event is the main thread event), and/or, the event status (for example, the thread event is the main thread event).
  • This thread event is an abnormal status event), etc. Therefore, the current operating system event data is event type data.
  • the electronic device when it detects a fault at the operating system level, it can not only obtain the current operating system operating data of the indicator type, but also determine the current operating system event data of the event type based on the current operating system operating data. Subsequently, in the process of fault detection of electronic equipment, compared with the method in general technology that only uses basic information (that is, the current operating system operating data of the indicator type) combined with manual experience to detect faults, the fault detection method provided by the embodiment of the present application can Based on the current operating system operating data and current operating system event data, the system fault detection results of the operating system to be detected are quickly determined without the intervention of manual experience, which improves the efficiency of fault detection.
  • the current operating system event data includes contextual information associated with the current operating system process events or thread events
  • this method can mine the context information in the current operating system event data, thereby obtaining the relevant content of the context-related events of the process event or thread event based on the context information, and then combine the current operating system running data based on these contents, Accurately determine the system fault detection results and improve the accuracy of fault detection.
  • the fault detection method can also detect more fine-grained fault detection results (such as fault detection results of specific thread events or process events), further improving the accuracy of fault detection.
  • the fault detection method provided by the embodiments of the present application does not require adding various fault detection tools at the operating system level of the electronic device. It only needs the data collection function and data processing function of the operating system itself to quickly detect faults and realize fault-based fault detection. Lightweight fault detection in operating systems.
  • FIG. 3 shows a schematic structural diagram of a fault detection system provided by an embodiment of the present application.
  • the fault detection system provided by the embodiment of the present application includes a cloud terminal 301, a network device 302 and a cloud server 303.
  • the network device 302 is communicatively connected to the cloud terminal 301 and the cloud server 303 respectively.
  • cloud terminal 301 is also called a virtual terminal. It is based on the premise of reliable, high-speed network communication, mass storage, multi-process or thread, and high-efficiency CPU technology, and uses remote virtualization technology to achieve management solutions for customer terminal equipment. The solution separates the remote terminal equipment hardware, operating software and hard disk data into different transmission levels to form a dynamic architecture of the host and client, thereby achieving a unified management platform for centralized management and dynamic grouping and decentralized management.
  • the cloud terminal 301 can request service resources provided by the cloud server 303 and provide corresponding cloud services to the user.
  • Cloud server is a simple, efficient, safe and reliable computing service with elastically scalable processing capabilities.
  • the cloud server 303 can provide service resources to the cloud terminal 301, so that the cloud terminal 301 provides corresponding cloud services to users.
  • Cloud terminal 301 in Figure 3 may be a device that provides voice and/or data connectivity to users, a handheld device with wireless connectivity capabilities, or other processing device connected to a wireless modem.
  • the cloud terminal 301 can communicate with one or more core networks via a radio access network (RAN).
  • Wireless terminals can be mobile terminals, such as mobile phones (or "cellular" phones) and computers with mobile terminals, or they can be portable, pocket-sized, handheld, computer-built-in or vehicle-mounted mobile devices that are connected to wireless interfaces. Access the network to exchange language and/or data, such as mobile phones, tablets, laptops, desktop computers, netbooks, personal digital assistants (personal digital assistants, PDAs).
  • the network device 302 in Figure 3 may be a wireless communication base station or a base station controller, etc.
  • the base station may be a global system for mobile communication (GSM), a base transceiver station (BTS) in code division multiple access (CDMA), a broadband
  • eNB narrowband Internet of things
  • eNB narrowband Internet of things
  • PLMN public land mobile network
  • the cloud server 303 in Figure 3 can be a server in a server cluster (composed of multiple servers), a chip in the server, a system-on-chip in the server, or a server deployed on a physical machine.
  • Virtual machine (VM) implementation is not limited in the embodiments of this application.
  • the fault detection method provided by the embodiment of the present application is mainly used to detect grayscale faults at the operating system level of cloud server 303.
  • the fault detection method provided by the embodiment of the present application can also be used to detect grayscale faults at the operating system level of cloud terminal 301, and can also be used to detect other operating system level, network level and application levels. Grayscale faults of electronic equipment at different levels are not limited in this application.
  • the embodiment of the present application takes the detection of grayscale faults at the operating system level of cloud server 303 as an example for description.
  • the cloud server 303 may include an APM layer, an NPMD layer and an ITIM layer.
  • Observation 1 of the APM layer can obtain the detection reports of APP 1, APP 2, APP 3 and other APPs of the APM layer, and based on the detection, determine the application fault detection results of the APM layer.
  • observation 2 of the NPMD layer can obtain the detection report of the network device of the NPMD layer, and determine the network fault detection result of the NPMD layer based on the detection.
  • the ITIM layer in addition to providing general infrastructure detection capabilities, the ITIM layer also has operating system grayscale fault identification and diagnosis capabilities.
  • observation 3 of the ITIM layer can obtain detection reports of OS 1, OS 2, OS 3 and other OSs of the ITIM layer according to the fault detection method provided by the embodiment of this application, and based on the detection, determine the fault detection of the ITIM layer. result.
  • observation 3 of the ITIM layer can send the system fault detection results of the ITIM layer to observation 1 of the APM layer and observation 2 of the NPMD layer, so that observation 1 of the APM layer and observation 2 of the NPMD layer combine the system fault detection results of the ITIM layer. , determine the grayscale fault.
  • observation 1 of the APM layer can also send the application fault detection results of the APM layer to observation 2 of the NPMD layer and observation 3 of the ITIM layer, so that observation 2 of the NPMD layer and observation 3 of the ITIM layer combine the application faults of the APM layer. Detect the results to determine the grayscale fault.
  • observation 2 of the NPMD layer can also send the network fault detection results of the NPMD layer to observation 1 of the APM layer and observation 3 of the ITIM layer, so that observation 1 of the APM layer and observation 3 of the ITIM layer combine the network faults of the NPMD layer. Detect the results to determine the grayscale fault.
  • the fault detection method provided by the embodiment of the present application can integrate the island detection tools of the APM layer, NPMD layer, and ITIM layer with each other. Combining the fault detection method of intelligent operation and maintenance from the perspective of application from the top down in general technology with the scheme of actively identifying and locating grayscale faults from the perspective of the operating system from the bottom up provided by the embodiments of this application, Reduce manual intervention and improve operation and maintenance efficiency.
  • FIG. 4 is a block diagram of an internal module structure of an electronic device provided by an embodiment of the present application.
  • This electronic device can collect indicator data, log data, link data and other data in the operating system, and determine events (Events) and other data based on indicator data, log data, link data and other information, and then perform data analysis to identify grayscale faults. . Subsequently, root cause inference is performed by constructing application dependency graphs and event propagation graphs to detect operating system health, grayscale faults, and root cause location.
  • the electronic device may be the cloud server 303 in FIG. 3 , the cloud terminal 301 , or other electronic devices with an operating system level, a network level, and an application level, which are not limited in the embodiments of this application.
  • the electronic device may include: a data collection module, a grayscale fault identification module, a root cause analysis module, a detection module and a data storage module.
  • the data collection module is used to collect indicator data, log data, link data and other data in the operating system (that is, the operating system running data in the embodiment of this application), and collect data based on the indicator data, log data, link data and other data.
  • Generate event data that is, operating system event data in the embodiment of this application).
  • Event data is used to represent real-time abnormal events generated based on indicator data, log data, link data and other data.
  • the grayscale fault identification module is used for event data generated based on indicator data, log data, link data and other data, as well as multi-dimensional long-period time series data (that is, indicator data, log data, link data and other data in multiple time periods) Correlation analysis to complete the identification of grayscale faults.
  • the root cause analysis module is used to generate application dependency graphs through static service deployment methods and dynamically generate them based on domain knowledge and events. Event propagation graph, and perform root cause inference on the identified grayscale faults based on the constructed application dependency graph and event propagation graph.
  • the detection module is used to provide display functions or data interfaces for operating system health, abnormal events, and root causes of faults.
  • the data storage module is used to store the data collected by the data acquisition module, the grayscale faults identified by the grayscale fault identification module, and the root cause of the fault determined by the root cause analysis module, and provide the detection module with the data required by the detection module.
  • the fault detection method provided by the embodiment of the present application can be applied to electronic equipment.
  • FIG. 5 shows a fault detection method provided by an embodiment of the present application. As shown in Figure 5, the fault detection method specifically includes:
  • the electronic device obtains the current operating system operating data of the operating system to be detected.
  • the electronic device can trigger the fault detection process based on two trigger conditions.
  • the first trigger condition is that the electronic device actively triggers the fault detection process.
  • the electronic device can directly obtain the current operating system operating data of the operating system to be detected, thereby starting fault detection.
  • the electronic device can actively trigger the fault detection process with a preset scheduled task.
  • the electronic device can directly obtain the current operating system operating data of the operating system to be detected, thereby starting fault detection.
  • the above scheduled tasks may be periodic tasks.
  • the electronic device can periodically obtain the current operating system operating data of the operating system to be detected, thereby starting fault detection, thereby improving the fault detection efficiency of the operating system to be detected of the electronic device.
  • the second trigger condition is that the electronic device passively triggers the fault detection process.
  • the electronic device can respond to the received fault detection instruction and obtain the current operating system operating data of the operating system to be detected, thereby starting fault detection.
  • the above fault detection instruction may be sent by the APM layer or the NPMD layer, or may be generated in response to the user's fault detection operation, which is not limited in the embodiment of the present application.
  • the operating system running data may be indicator data, log data, link data and other data in the operating system to be detected.
  • Indicator data, log data, link data and other data are also called stateless data, which can represent the indicator type data generated during the operation of the current operating system.
  • indicator data For descriptions of indicator data, log data, link data and other data, please refer to the brief introduction to the relevant elements involved in the application and will not be repeated here.
  • the current operating system operating data refers to the system operating data in the current time period.
  • electronic equipment performs fault detection in the current time period, it needs to obtain the system operation data of the current time period to ensure accurate detection of system grayscale faults in the current time period.
  • the electronic device wants to detect system grayscale faults in the historical period, it can obtain the system operation data in the historical period.
  • the electronic device can call the data collection module to obtain the current operating system operating data of the operating system to be detected.
  • the electronic device determines the current operating system event data of the operating system to be detected based on the current operating system operating data and preset event rules.
  • the current operating system event data includes: data generated during the execution of process events or thread events of the current operating system in the operating system to be detected, and context information associated with process events or thread events of the current operating system.
  • the operating system event data may be event data in the operating system to be detected.
  • Event data also known as stateful data, can represent data generated during process events or thread events of the current operating system and context information associated with process events or thread events of the current operating system.
  • context information associated with process events or thread events of the current operating system.
  • the preset event rules may be pre-established based on human experience or domain knowledge. After obtaining the current operating system operating data, the electronic device can determine the current operating system event data of the operating system to be detected based on the current operating system operating data and preset event rules.
  • the current operating system operating data may be a single point of operating data
  • the current operating system event data may be event data determined based on multiple operating data within a cycle.
  • the electronic device detects the CPU utilization of the operating system to be detected every five seconds.
  • the CPU utilization can be detected 12 times per minute, thus obtaining 12 CPU utilization detection results.
  • the above 12 CPU utilization detection results are the 12 current operating system operating data.
  • Preset event rules include: when the CPU utilization is higher than the preset value, it is a fault event.
  • the electronic device can obtain the 8 CPU utilization detection results in the previous 40 seconds, and determine this based on these 8 CPU utilization detection results.
  • the first 40 seconds of the minute are CPU failure events, that is, the current operating system event data.
  • the current operating system event data may also include system activity type events such as vulnerability repairs and software upgrades within the current time period.
  • the current operating system running data obtained by the preset electronic device includes: the CPU utilization of the operating system to be detected at the current moment is the first utilization rate, and the running thread of the operating system to be detected at the current moment is thread A.
  • the preset event rules include: at the same time, the CPU utilization of the operating system to be detected is the CPU utilization of thread A.
  • the electronic device can determine, based on the current operating system running data and preset event rules, that the current operating system event data of the operating system to be detected includes: the CPU utilization occupied by thread A during the current running process. 1. Utilization rate.
  • the electronic device can also obtain context information about the running process of thread A at the current moment, such as the running instructions sent by thread B to thread A.
  • the electronic device can call the data acquisition module to determine the current operating system event data of the operating system to be detected based on the obtained current operating system operating data and preset event rules.
  • the operating system to be detected includes a data plane and a management plane.
  • the data in the data plane refers to the data used by the various system resources of the ITIM layer when the APM layer APP is running on the ITIM layer, which can directly reflect the various indicators and usage of the APP.
  • the data plane may include multiple thread nodes (for example, node 1, node n, node... in Figure 6), and each node may include multiple containers. Each container can connect to the kernel layer through the system call layer.
  • the management module of the kernel layer can develop the data plane probe module (Probe Agent) based on the extended Berkeley Packet Filter (eBPF) technology.
  • the data collection module can obtain stateless data such as indicator data, log data, link data, and stateful data in event data from the data plane based on Probe Agent, and store the obtained data in the data plane. in the storage module.
  • the data in the management plane refers to the management data of the OS when the OS in the ITIM layer is running, such as OS version upgrade event data and disk expansion event data.
  • the data collection module can collect the event data of the activity class in the event data in the management plane based on the event agent module of the management plane, and store the obtained data in the data storage module.
  • the electronic device determines the system fault detection result of the operating system to be detected based on the current operating system operating data and the current operating system event data.
  • the electronic device After the electronic device obtains the current operating system operation data and the current operating system event data, it can determine the system fault detection result of the operating system to be detected through a pre-trained fault detection model, and specifically detect For the process, please refer to the description of the embodiment below.
  • system fault detection results corresponding to multiple types of system data may be pre-stored in the electronic device. After obtaining the current operating system running data and the current operating system event data, the electronic device can read the system data and system fault detection results, thereby determining the system fault detection results of the operating system to be detected in the current time period.
  • system fault detection results corresponding to various types of system data can be determined by the electronic device based on the system data of the historical time period and the system fault detection results of the historical time period, or they can It is created in advance based on domain knowledge, and the embodiments of this application do not limit this.
  • the system faults causing the system fault detection results include the following single types of faults or combined types of faults: network IO fault, disk IO fault, scheduling fault, memory fault, process or thread fault, file System failure, disk failure, CPU failure and container failure.
  • the electronic device can detect system fault detection results of various single types or combined types of faults.
  • the above current operating system running data refers to the process events or thread events of the current operating system in the operating system to be detected. data generated during the process. In other words, this kind of data generally only represents an indicator of a process event or thread event (such as CPU utilization, etc.). Therefore, the current operating system operating data is indicator type data.
  • the above-mentioned current operating system event data not only includes data generated during the execution of process events or thread events of the current operating system, but also includes context information associated with the process events or thread events of the current operating system.
  • this kind of data can not only represent an indicator of a process event or thread event, but also represent the event content of this process event or thread event (for example, the thread event is the main thread event), and/or, the event status (for example, the thread event is the main thread event).
  • This thread event is an abnormal status event), etc. Therefore, the current operating system event data is event type data.
  • the electronic device when it detects a fault at the operating system level, it can not only obtain the current operating system operating data of the indicator type, but also determine the current operating system event data of the event type based on the current operating system operating data. Subsequently, in the process of fault detection of electronic equipment, compared with the method in general technology that only uses basic information (that is, the current operating system operating data of the indicator type) combined with manual experience to detect faults, the fault detection method provided by the embodiment of the present application can Based on the current operating system operating data and current operating system event data, the system fault detection results of the operating system to be detected are quickly determined without the intervention of manual experience, which improves the efficiency of fault detection.
  • the current operating system event data includes contextual information associated with the current operating system process events or thread events
  • this method can mine the context information in the current operating system event data, thereby obtaining the relevant content of the context-related events of the process event or thread event based on the context information, and then combine the current operating system running data based on these contents, Accurately determine the system fault detection results and improve the accuracy of fault detection.
  • the fault detection method can also detect more fine-grained fault detection results (such as fault detection results of specific thread events or process events), further improving the accuracy of fault detection.
  • the fault detection method provided by the embodiments of the present application does not require adding various fault detection tools at the operating system level of the electronic device. It only needs the data collection function and data processing function of the operating system itself to quickly detect faults and realize fault-based fault detection. Lightweight fault detection in operating systems.
  • the nodes of the cloud server may be Many, long links.
  • the data propagation effect corresponding to the process event or thread event may be delayed.
  • fault detection efficiency may be low.
  • the embodiments of the present application can combine the current operating system running data and the current operating system event data. Based on the current operating system running data and the current operating system event data, the current operating system running data and the current operating system event data have fine granularity (can be as fine as thread events or process events of the system to be detected) and wide range. Characteristics of detection data (including not only indicator type data, but also event type data) and mining contextual information in the current operating system event data can quickly and accurately determine the system fault detection results of the operating system to be detected.
  • the embodiment of the present application combines the current operating system operation data and the current operating system event data, and can quickly and accurately determine the system fault detection result of the operating system to be detected in only 0.5 hours to 72 hours.
  • the fault detection method provided by the embodiment of the present application can shorten the fault detection speed to the "minutes" level, improving the efficiency of fault detection.
  • the electronic device may include: a system layer processing module, an application layer processing module, and a network layer processing module.
  • the method for the electronic device to determine the system fault detection result of the operating system to be detected based on the current operating system running data and the current operating system event data specifically includes:
  • the system layer processing module determines the system fault detection result of the operating system to be detected based on the current operating system operating data and the current operating system event data.
  • the electronic device can further determine the grayscale faults in combination with the fault detection results of the APM layer and/or NPMD layer.
  • the fault detection method provided by the embodiment of the present application also includes:
  • the system layer processing module sends the system fault detection result to the application layer processing module and the network layer processing module.
  • the system fault detection result is the fault detection result of the system layer of the operating system to be detected, in order to combine The application fault detection result of the application layer and/or the network fault detection result of the network layer further determines the grayscale fault of the electronic device.
  • the system layer processing module can send the system fault detection result to the application layer processing module and the network layer processing module.
  • the application layer processing module determines the application fault detection result of the electronic device based on the system fault detection result.
  • the application layer processing module can determine the application fault detection result of the electronic device based on the system fault detection result.
  • the application layer processing module can determine whether the application corresponding to the system fault detection result is faulty based on the system fault detection result sent by the system layer processing module, and when determining the application fault, the application Fault determination is the result of application fault detection.
  • the application layer processing module can first determine the application fault of the application layer based on the fault detection tool of the application layer. Then, the application layer processing module combines the detected application fault with the system fault detection result sent by the system layer processing module to further determine the application fault detection result of the electronic device.
  • the network layer processing module determines the network fault detection result of the electronic device based on the system fault detection result.
  • the network layer processing module can determine the network fault detection result of the electronic device based on the system fault detection result.
  • the network layer processing module can determine whether the network device corresponding to the system fault detection result is faulty based on the system fault detection result sent by the system layer processing module, and when determining that the network device is faulty, The network equipment failure is determined as a network failure detection result.
  • the network layer processing module can first determine the network device fault of the network layer based on a fault detection tool of the network layer. Then, the network layer processing module combines the detected network device fault with the system fault detection result sent by the system layer processing module to further determine the network fault detection result of the electronic device.
  • the application layer processing module needs to first detect the application fault detection results of the APM layer. Then, after determining the application fault detection result of the APM layer, the network layer processing module determines the network fault detection result of the NPMD layer. Finally, the system layer processing module obtains the operating system operating data of the ITIM layer and determines the final grayscale fault detection results based on manual experience.
  • the APM layer, NPMD layer and ITIM layer are tightly coupled, and the electronic equipment needs to determine the fault detection results of each layer in turn to determine the final grayscale fault detection. result.
  • the ITIM layer is loosely coupled with the APM layer and NPMD layer respectively.
  • the APM layer, NPMD layer and ITIM layer can independently detect the fault detection results of the current layer.
  • the embodiment of the present application eliminates the step of drilling down layer by layer, improving fault detection efficiency.
  • the system layer processing module of the ITIM layer can provide independent data collection, grayscale fault identification, and root cause analysis capabilities. After determining the system fault detection results, it can send system fault detection to the application layer processing module and network layer processing module. As a result, the application layer processing module determines the application fault detection result of the electronic device based on the system fault detection result, and the network layer processing module determines the network fault detection result of the electronic device based on the system fault detection result. Compared with the general fault detection method It is necessary for the APM layer, NPMD layer and ITIM layer to determine the fault detection results in sequence.
  • the fault detection method provided by the embodiment of the present application can enable the APM layer, NPMD layer and ITIM layer to synchronously determine the fault detection results corresponding to the corresponding layers, and combine the fault detection results of each layer.
  • the detection results determine the detection results of grayscale faults and improve the efficiency of fault detection.
  • the fault detection results corresponding to each level are independently output to the output platform corresponding to manual detection, and the fault detection results corresponding to each level are not combined and processed. This results in the fault detection results corresponding to each level forming island data.
  • the embodiment of the present application can actively identify and locate grayscale faults from the bottom-up perspective of the operating system to be detected at the ITIM layer, and combine it with a top-down universal intelligent operation and maintenance fault detection method from the perspective of the APM layer. Data islands can be eliminated, further improving the accuracy of grayscale fault detection.
  • the electronic device can determine the system fault detection result of the operating system to be detected based on a trained fault detection model.
  • the fault detection method provided by the embodiment of the present application may include: a process in which the electronic device is trained to obtain a fault detection model (referred to as the "fault detection model training" process), and the electronic device determines the operation to be detected based on the fault detection model
  • the process of system fault detection results of the system (referred to as the "fault detection" process).
  • the "fault detection model training" process provided by the embodiment of this application specifically includes:
  • the electronic device obtains historical operating system data.
  • historical operating system data includes: historical operating system operating data and historical operating system event data.
  • Historical operating system operating data refers to system operating data in historical time periods.
  • Historical operating system event data refers to system event data in historical time periods.
  • the historical operating system event data includes: data generated during the execution of process events or thread events of the historical operating system in the operating system to be detected, and contextual information associated with the process events or thread events of the historical operating system.
  • the electronic device can first obtain the system operating data of the historical time period (ie, historical operating system operating data), and then determine the historical operating system event data of the operating system to be detected based on the historical operating system operating data and preset event rules.
  • system operating data of the historical time period ie, historical operating system operating data
  • the electronic device obtains the system operation data of the historical time period, and determines the description of the historical operating system event data of the operating system to be detected based on the historical operating system operating data and preset event rules. Please refer to S501, the electronic device obtains the operating system to be detected.
  • the current operating system operating data is determined, and based on the current operating system operating data and preset event rules, the relevant description of the current operating system event data of the operating system to be detected is determined, which will not be described again here.
  • the model can be trained based on system operation data and operating system event data in historical time periods.
  • the electronic device is trained to obtain a fault detection model based on the preset fault identification algorithm and historical operating system data.
  • the electronic device can obtain raw data of a historical time period corresponding to historical operating system operating data.
  • the electronic device can then perform preprocessing operations on these raw data to obtain historical operating system operating data, providing accurate baseline data for subsequent training of fault detection models.
  • the preprocessing operation is used to clean the original data to obtain complete, smooth and noise-free current operating system operating data.
  • the above preprocessing operations may include operations such as missing completion, anomaly denoising, and index smoothing, which are not limited in the embodiments of the present application.
  • the fault detection model can be trained based on the conjugate gradient method and historical operating system data.
  • the fault detection model can also be trained based on the gradient descent method and historical operating system data. It can also be based on other general-purpose fault detection models.
  • the fault identification algorithm and historical operating system data are trained to obtain a fault detection model, which is not limited in the embodiments of the present application.
  • the preset fault identification algorithm may include a classification algorithm and a parameter optimization algorithm.
  • the classification algorithm is used for classification training of the initial model of the fault detection model.
  • the parameter optimization algorithm can adjust the parameters of the model during the classification training process to obtain a fault detection model.
  • the method for the electronic device to train the fault detection model based on the preset fault identification algorithm and historical operating system data specifically includes:
  • the electronic device adds a status identifier to the historical operating system event data based on the preset status rules.
  • the status identification includes: normal identification and abnormal identification.
  • the electronic device can add a status identifier to the historical operating system event data of the historical time period based on the preset status rules.
  • preset state rules can be created based on domain knowledge.
  • the preset status rules may include: when thread A is running, and the CPU utilization is greater than the preset threshold, determine that the status identifier of thread A is an exception identifier; when thread A is running, the memory occupied space is greater than the preset threshold. When the memory size is determined, the status identifier of thread A is determined to be the exception identifier.
  • the electronic device can add state identifiers to part of the operating system data in the historical operating system event data based on the preset state rules.
  • the preset status rules include: when the CPU utilization of thread A is greater than 70% during the running process of thread A from 13 o'clock to 14 o'clock, it is determined that the status identification of thread A is an exception identification, and, the status identification of thread A is between 18 o'clock and 18 o'clock.
  • the CPU utilization When it is greater than 80%, when it is determined that the status identifier of thread A is an exception identifier, if thread A has a CPU utilization rate of 75% during the running process at 15 o'clock, the status identifier cannot be accurately added to the operating system event data of thread A.
  • the electronic device may add a status identifier to only part of the historical operating system event data except the operating system event data of thread A.
  • the electronic device can also add relevant explanation information to historical operating system event data based on domain knowledge.
  • the preset historical operating system event data is: during the running process of thread A from 13:00 to 14:00, the CPU utilization is greater than 70%, and the status identifier of the historical operating system event data is an exception identifier.
  • the electronic device can also add relevant interpretation information to the historical operating system event data based on domain knowledge: because during the running of thread A from 13:00 to 14:00, multiple redundant codes appeared in thread A. , resulting in CPU utilization greater than 70% and abnormal status.
  • the electronic device trains the model to be trained based on the classification algorithm and the second data to obtain the model to be adjusted.
  • the fault detection model can be a multi-classification model.
  • the electronic device trains the model to be trained based on the classification algorithm and the second data to obtain the model to be adjusted.
  • the second data includes historical operating system operation data and historical operating system event data after adding status identifiers;
  • the model to be trained includes: an application feature classification model to be trained and a fault classification model to be trained.
  • the above classification algorithm can be a K-Nearest Neighbor (KNN) classification algorithm, a Bayesian classifier algorithm, a logistic regression algorithm, or other algorithms.
  • KNN K-Nearest Neighbor
  • Bayesian classifier algorithm a Bayesian classifier algorithm
  • logistic regression algorithm a logistic regression algorithm
  • the application feature classification model is used to determine the application features of system faults of the operating system to be detected.
  • the above application characteristics may include: CPU intensive, IO intensive, periodic characteristics, etc.
  • CPU-intensive is also called computing-intensive, which means that the hard disk and memory performance of the operating system are much better than that of the CPU. At this time, most of the operation of the operating system is that the CPU is fully loaded. When the CPU wants to read or write to the hard disk or memory, the hard disk or memory can be completed in a short time.
  • IO-intensive means that the CPU performance of the operating system is much better than that of the hard disk and memory. At this time, when the system is running, most of the situation is that the CPU is waiting for I/O (hard disk/memory) read/write operations. At this time, the CPU is loaded The rate is not high. IO-intensive programs generally have low CPU utilization when they reach their performance limits. This may be because the task itself requires a lot of I/O operations, but the pipeline link is not smooth and the processor power is not fully utilized.
  • Periodic characteristics refer to the periodic characteristics of processes, threads, or applications of the operating system.
  • the fault classification model is used to determine the fault type of application characteristics of system faults in the operating system to be detected.
  • the above fault types may include: status type faults, function type faults, performance type faults, etc.
  • the electronic device in order to train a fault detection model with higher accuracy, can obtain raw data corresponding to the historical time period corresponding to the historical operating system operating data. The electronic device can then perform preprocessing operations on these raw data to obtain historical operating system operating data, providing accurate baseline data for subsequent training of fault detection models.
  • the preprocessing operation is used to clean the original data to obtain complete, smooth and noise-free current operating system operating data.
  • the above-mentioned preprocessing operations may include operations such as missing filling, anomaly denoising, and index smoothing, which are not limited in the embodiments of the present application.
  • the electronic device adjusts parameters of the model to be adjusted based on historical operating system operating data and parameter optimization algorithms to obtain a fault detection model.
  • the fault detection model includes: application feature classification model and fault classification model.
  • the parameters of the model to be adjusted need to be adjusted and optimized.
  • the electronic device can adjust parameters of the model to be adjusted based on historical operating system operating data and parameter optimization algorithms to obtain a fault detection model.
  • the parameter optimization algorithm is used to make the output accuracy of the model to be trained reach the preset accuracy.
  • the parameter optimization algorithm is used to continuously adjust the parameters of the model to be trained, so that the output results of the model to be trained are constantly close to parameterization. The goal value.
  • the parameter optimization algorithm may include: gradient descent algorithm, momentum optimization algorithm, adaptive learning rate optimization algorithm, etc.
  • the output accuracy of the above-mentioned model to be trained may include: recall rate, precision rate, F1 score (F1Score), etc.
  • the parameterized target value may include: a preset value corresponding to the recall rate, a preset value corresponding to the precision rate, and a preset value corresponding to the F1 score.
  • the preset fault identification algorithm can include a classification algorithm and a parameter optimization algorithm
  • the classification algorithm can be used to perform classification training on the initial model of the fault detection model
  • the parameter optimization algorithm can perform parameterization on the model during the classification training process. Adjust to obtain a fault detection model, so that the system fault detection results can be quickly and accurately determined through the fault detection model, thereby improving the fault detection efficiency.
  • fault detection model training process can be completed based on offline scenario training, or can be completed based on online usage scenario training.
  • the offline scenario means that after the electronic device obtains the historical operating system data, it can migrate the historical operating system data to the laboratory scenario, and train and obtain fault detection based on the preset fault identification algorithm and historical operating system data in the laboratory scenario. Model. Subsequently, when using the fault detection model, the fault detection model trained in the offline scenario can be migrated to the online usage scenario for fault detection based on the online migration algorithm.
  • the online usage scenario means that after the electronic device obtains historical operating system data, it can directly train a fault detection model based on the preset fault identification algorithm and historical operating system data. Later, when using the fault detection model, the fault detection model can be used directly for fault detection.
  • Figure 10 shows a schematic flowchart of obtaining a fault detection model based on offline scenario training provided by an embodiment of the present application.
  • the "fault detection model training" process may include:
  • the data collection module of the electronic device can collect stateless data in the historical time period (i.e., historical operating system operation data in the embodiment of the present application) and stateful data in the historical time period (i.e., historical operation data in the embodiment of the present application). system event data).
  • the grayscale fault identification module of electronic equipment can perform operations such as missing completion, anomaly removal, and index smoothing on stateless data to provide accurate baseline data for subsequent steps.
  • the grayscale fault identification module of electronic equipment can add labels that apply normal/abnormal status to the stateful data, and the content of the labels can be customized based on domain knowledge.
  • electronic devices can also add relevant explanation information to stateful data based on domain knowledge.
  • the electronic device can perform model training based on stateless data and data-tagged stateful data, thereby generating an application feature classification model and a fault classification model.
  • the electronic device can evaluate whether the application feature classification model and fault classification model have reached the training convergence condition by calculating the precision rate, recall rate, and F1-Score index of the application feature classification model and fault classification model.
  • the electronic device can generate a fault detection model in an offline scenario for test verification or online migration training.
  • the method for the electronic device to determine the system fault detection result of the operating system to be detected based on the current operating system operating data and the current operating system event data specifically includes:
  • the electronic device adds a status identifier to the current operating system event data based on the preset status rules.
  • the status identification includes: normal identification and abnormal identification.
  • the electronic device can add a status identifier to the current operating system event data based on the preset status rules.
  • the electronic device inputs the first data into the pre-trained fault detection model to obtain the system fault detection result.
  • the first data includes the current operating system running data and the current operating system event data after adding the status identifier; the fault detection model is trained based on the historical operating system data of the operating system to be detected; the historical operating system data includes: historical operating system Operational data and historical operating system event data.
  • the electronic device can obtain raw data corresponding to the current time period corresponding to the current operating system operating data.
  • the electronic device can then perform preprocessing operations on these raw data to obtain current operating system operating data, providing accurate baseline data for subsequent determination of system fault detection results.
  • the preprocessing operation is used to clean the original data to obtain complete, smooth and noise-free current operating system operating data.
  • the above preprocessing operations may include operations such as missing completion, anomaly denoising, and index smoothing, which are not limited in the embodiments of the present application.
  • the fault detection model is trained based on the historical operating system data of the operating system to be detected, and the historical operating system data includes: historical operating system operating data and historical operating system event data, therefore, the electronic device determines the operating system to be detected When the system fault detection results are obtained, the fault detection model can be used to quickly and accurately determine the system fault detection results, improving the fault detection efficiency.
  • the model to be trained includes: an application feature classification model to be trained and a fault classification model to be trained. Therefore, the electronic device can use the trained application feature classification model and fault classification model to Determine system fault detection results. In this case, the electronic device inputs the first data into the pre-trained fault detection model to obtain the system fault detection result.
  • the method specifically includes:
  • the electronic device inputs the first data into the application feature classification model to obtain the application feature fault detection result.
  • the application characteristic fault detection results may include: CPU-intensive application characteristic fault detection results, IO-intensive application characteristic fault detection results, periodic characteristic application characteristic fault detection results, etc.
  • the electronic device inputs the system data corresponding to the application characteristic fault detection result in the first data into the fault classification model to obtain the system fault detection result.
  • the system fault detection results may include: CPU-intensive applications have status type failures, CPU-intensive applications have function type failures, CPU-intensive applications have performance type failures, and IO-intensive applications have status types. Faults of the functional type occur in IO-intensive applications, performance-type faults occur in IO-intensive applications, state-type faults occur in applications with periodic characteristics, function-type faults occur in applications with periodic characteristics, and applications with periodic characteristics appear Performance type failure.
  • the electronic device can determine different application features based on the application feature classification model and the fault classification model in turn, as well as the fault classification corresponding to each application feature, improving improves the accuracy of fault detection.
  • Figure 12 shows a schematic flowchart of fault detection based on a fault detection model obtained through pre-training provided by an embodiment of the present application.
  • the "fault detection model training" process may include:
  • the data collection module of the electronic device can collect stateless data in the current time period (i.e., the current operating system running data in the embodiment of the present application) and stateful data in the current time period (i.e., the current operation data in the embodiment of the present application). system event data).
  • the grayscale fault identification module of electronic equipment can perform operations such as missing completion, anomaly removal, and index smoothing on stateless data to provide accurate baseline data for subsequent steps.
  • the grayscale fault identification module of electronic equipment can add labels that apply normal/abnormal status to the stateful data, and the content of the labels can be customized based on domain knowledge.
  • electronic devices can also add relevant explanation information to stateful data based on domain knowledge.
  • the electronic device can perform model training based on stateless data and data-tagged stateful data, thereby generating an application feature classification model and a fault classification model.
  • the electronic device can load a pre-trained fault detection model from an offline scenario based on an online migration algorithm.
  • the electronic device performs fault detection based on factors such as deviation and duration according to the application threshold defined by the application feature through the application feature classification model.
  • the final grayscale fault event (i.e., the system fault detection result in the embodiment of the present application) can be further determined through the fault classification model, such as abnormal application resource usage, Abnormal application performance, abnormal application status, etc.
  • the detection results corresponding to the grayscale fault event may include abnormal attributes of the fault event, such as the time, location, type, etc. of occurrence.
  • the type of fault event includes resource usage, performance, status, etc. of the fault event.
  • the electronic device can also construct an application dependency relationship and an event propagation relationship, so as to Further achieve fault root cause location.
  • the fault detection method provided by the embodiment of the present application also includes:
  • the electronic device builds an application dependency relationship based on the current operating system operating data and preset dependency rules.
  • the application dependency relationship is used to represent the interdependence relationship between each process or thread and each application instance in the operating system to be detected.
  • the preset dependency rule may be to determine the relationship between the first process or thread and the second process or thread when there is an execution sequence relationship between the first process or thread and the second process or thread. There is a dependency relationship between; or when the first process or thread needs to be based on the running result of the second process or thread to achieve operation, it is determined that there is a dependency relationship between the first process or thread and the second process or thread; it can also be other Type rules are not limited in the embodiments of this application.
  • the current operating system operating data includes operating data corresponding to each process or thread and each application instance. After obtaining the current operating system operating data, the electronic device can construct an application dependency relationship based on the current operating system operating data and preset dependency rules.
  • the electronic device can construct an application dependency relationship based on an application dependency graph including nodes and edges.
  • the method for the electronic device to build application dependencies based on the current operating system operating data and preset dependency rules specifically includes:
  • the electronic device builds an application dependency graph based on the current operating system operating data and preset dependency rules.
  • the application dependency graph includes multiple application nodes and edges between application nodes; the main application node among the multiple application nodes is used to represent the main thread of the application instance in the operating system to be detected; the slave application nodes among the multiple application nodes Used to represent: the dependency instance between the slave thread of the application instance in the operating system to be detected and the application instance in the operating system to be detected; the edge between the first application node and the second application node in multiple application nodes is used to represent There is a dependency relationship between the first application node and the second application node.
  • the slave thread of the application instance may include a proxy application instance (proxy), etc.
  • the attributes of application nodes can include time, status, resource usage and other information.
  • the electronic device constructs an event propagation relationship based on application dependencies.
  • the electronic device can construct an event propagation relationship based on the application dependencies.
  • the event propagation relationship is used to represent the propagation relationship between application events in the operating system to be detected.
  • Application events are application events corresponding to processes or threads and application instances in the operating system to be detected.
  • the electronic device can construct an event propagation relationship based on an event propagation graph including nodes and edges.
  • the method for electronic devices to construct event propagation relationships based on application dependencies specifically includes:
  • the electronic device constructs an event propagation graph based on the application dependency graph.
  • the event propagation graph includes multiple event nodes corresponding to multiple application nodes and edges between event nodes; multiple event nodes are used to represent application instances corresponding to application events in the operating system to be detected; multiple event nodes The edge between the first event node and the second event node in is used to indicate that there is a propagation relationship between the first event corresponding to the first event node and the second event corresponding to the second event node.
  • event nodes belong to application nodes.
  • the events corresponding to the event node include abnormal events, stateful events, and management plane events (such as application upgrades, system patches, etc.).
  • the attributes of event nodes can include information such as time, event type, exception level, etc.
  • the electronic device determines the root cause event that causes the system fault in the system fault detection result, as well as the root cause application instance and/or the root cause process or thread based on the event propagation relationship.
  • the electronic device can determine the root cause event that caused the system fault in the system fault detection result, as well as the root cause application instance and/or the root cause process or thread, based on the event propagation relationship, to achieve the goal.
  • the effect of multi-dimensional and multi-type data correlation root cause analysis solves the technical problems of common fault detection methods such as difficulty in fault root cause analysis and imprecise positioning granularity.
  • the electronic device can quickly and accurately determine the root event, as well as the root application instance and/or the root process or thread based on the application dependency graph and the event propagation graph.
  • the electronic device determines the root cause event that causes the system fault in the system fault detection result according to the event propagation relationship, and the method of the root cause application instance and/or the root cause process or thread specifically includes:
  • the electronic device determines a propagation event node having an edge with the fault event node.
  • the fault event node is the event node corresponding to the fault event; the fault event is the fault event corresponding to the system fault detection result.
  • the electronic device determines the event corresponding to the propagation start event node in the propagation event node as the root cause event, and determines the application instance corresponding to the propagation start event node as the root cause application instance, and/or, determines the propagation start event node
  • the corresponding process or thread is determined to be the root process or thread.
  • the electronic device may determine the event corresponding to the propagation start event node in the propagation event node as the root cause event, and determine the application instance corresponding to the propagation start event node as the root cause application instance, and/or, propagate The process or thread corresponding to the starting event node is determined as the root process or thread.
  • the fault detection method can, after determining the application dependency graph and event propagation graph, analyze and locate the root cause event based on the propagation relationship of the event propagation graph, thereby locating the process/thread level application and the corresponding root cause event. Based on the propagation relationship of the event propagation graph, it analyzes and locates the root cause of abnormal events by event nodes, outputs application dependency paths and event propagation paths, realizes the effect of correlation root cause analysis from multi-dimensional and multi-type data, and solves common faults. In the detection method, there are technical problems such as difficulty in analyzing the root cause of faults and imprecise positioning granularity.
  • the master node 1 in the operating system to be detected may include: application 11 and application 12.
  • the master node 2 may include: application 21 and application 22.
  • the master node n may include: application n1 and application n2.
  • Application 11 is used to execute event a.
  • Application 12 is used to perform event b.
  • Application 21 is used to perform event c.
  • Application 22 is used to perform event d.
  • Application n1 is used to execute event m.
  • Application n2 is used to execute event n.
  • the electronic device can obtain the current operating system running data of application 11, application 12, application 21, application 22, application n1 and application n2 in the operating system to be detected. Then, the electronic device can build an application dependency relationship based on the current operating system operating data and preset dependency rules.
  • the current operating system running data includes: application 11 executed event a in the current time period, and application 12 executed event b in the current time period.
  • Preset dependency rules include: applications that execute events at the same time during the current time period have dependencies.
  • the electronic device may determine that the application 11 and the application 12 have an application dependency relationship, and based on the application 11 and the application 12 have There is an application dependency relationship, and it is determined that event a and event b have an event propagation relationship.
  • the electronic device can determine that application 11 has an application dependency relationship with application 22, application 22 has an application dependency relationship with application n1, and application n1 has an application dependency relationship with application 21.
  • the electronic device can determine that event d and event m have an event propagation relationship, and event m and event c have an event propagation relationship.
  • FIG. 15 shows a schematic flowchart of yet another fault detection method provided by an embodiment of the present application. As shown in Figure 15, the steps of this fault detection method specifically include:
  • the application layer processing module detects whether the application quality of the APP is abnormal.
  • the network layer processing module detects whether the network device has network problems.
  • the system layer processing module obtains the current operating system operating data of the operating system to be detected, and determines the current operating system event data of the operating system to be detected based on the current operating system operating data and preset event rules.
  • the current operating system operating data collected by the system layer processing module may include: indicator data, log data, link data and other stateless data.
  • the current operating system event data determined by the system layer processing module may include stateful data of the event type.
  • the system layer processing module completes the identification of grayscale faults based on the obtained current operating system operating data and determined current operating system event data, as well as correlation analysis of multi-dimensional long-period time series data.
  • the system layer processing module constructs an application dependency relationship and an event propagation relationship, and determines the root cause event that causes the system fault in the system fault detection result, as well as the root cause application instance and/or the root cause process or thread based on the event propagation relationship.
  • the application layer processing module can detect whether the application quality of the APP is abnormal.
  • the network layer processing module detects whether network equipment has network problems, and the system layer processing module detects the system fault detection results of the operating system to be detected. There is a loose coupling relationship.
  • the fault detection method provided by the embodiment of the present application eliminates the step of drilling down layer by layer.
  • the ITIM layer provides independent data collection, grayscale fault identification, and root cause analysis capabilities, and can detect system faults. The results are actively pushed to the APM layer and NPMD layer, so that the application layer processing module and network layer processing module can accurately determine grayscale faults based on the system fault detection results.
  • Figure 16 shows an application dependency graph and an event propagation graph of a network IO abnormal event type provided by an embodiment of the present application.
  • the master node 1 in the operating system to be detected may include: an application market slave node and a search microservice slave node.
  • Master node 2 can include: negative one screen (nginx) slave node.
  • the master node n may include: a negative screen service slave node and a first remote dictionary service cluster (redis cluster server) slave node.
  • the master node n+1 may include: a second remote dictionary service cluster slave node.
  • the events at the APM layer corresponding to the slave node in the application market include: a.1. High-frequency search result null exception events.
  • the events of searching the APM layer corresponding to the microservice slave node include: a.2, application version upgrade event.
  • the events of the APM layer corresponding to master node 1 include: a.3, CPU high load event.
  • the events of the APM layer corresponding to the negative one screen slave node include: a.4. Microservice response timeout event.
  • the ITIM layer events corresponding to the negative one screen slave node include: b.1, N-T2 link round-trip time (RTT) abnormal event.
  • RTT round-trip time
  • the events at the APM layer corresponding to the slave node of the negative one screen service include: a.5. Abnormal events in obtaining news announcement information.
  • the events at the ITIM layer corresponding to the negative one-screen service slave node include: b.2, T2-R2 link RTT abnormal event.
  • Events at the ITIM layer corresponding to the slave node of the second remote dictionary service cluster include: b.3. Abnormal packet loss events in the second remote dictionary service network.
  • Figure 17 shows a schematic flow chart of a fault detection method for network IO abnormal event type provided by an embodiment of the present application.
  • the application layer processing module detects APP abnormal events.
  • the application layer processing module detected an abnormal application performance degradation, corresponding to the a.4 abnormal event on node 2 (microservice response should time out), and at the same time, a.1 (high-frequency search result null exception), a.2 (application version upgrade), and node n's a.5 abnormal event (abnormal acquisition of news announcement information) of node 1 were also detected.
  • the network layer processing module does not detect network abnormal events.
  • the system layer processing module collects network I/O, disk I/O, scheduling, memory, process/thread, file system, disk, CPU and container data of node 1, node 2,..., node n+1.
  • the system layer processing module identifies the b.1 abnormal event on node 2, the b.2 abnormal event on node n, and the b.3 abnormal event on node n+1.
  • the system layer processing module can identify the b.1 abnormal event on node 2 (the negative one-screen service RTT delay abnormality from node 2's Nginx to node n), and the b.2 abnormal event on node n (node n The RTT delay of the negative one-screen service to the second remote dictionary service cluster of node n+1 is abnormal), and the b.3 abnormal event of node n+1 (the second remote dictionary service cluster has a network packet loss abnormality).
  • the system layer processing module determines the root cause location result.
  • application dependencies may include: the application market of node 1 depends on the search microservice of node 1; the Ngnix service of node 2 depends on the application market of node 1; the Ngnix of node 2 depends on the negative one screen service of node n; The negative one-screen service of node n depends on the first remote dictionary service cluster; the negative one-screen service of node n depends on the second remote dictionary service cluster of node n+1.
  • the specific process of determining the root cause location result based on the event propagation relationship is: Since the negative one-screen service of node n depends on the second remote dictionary service cluster of node n+1, b.3 can be propagated to b.2. Since node 2 Ngnix relies on the negative one-screen service of node n, so b.2 can be propagated to b.1.
  • the root cause of the abnormal event in b.3 is due to the network protocol stack buffer (buffer) queue being full, so b.1 Caused by b.3, the root cause is the root cause of the abnormal event b.3.
  • Figure 18 shows the application dependency graph and event propagation graph of a mixed grayscale fault provided by the embodiment of the present application.
  • Mixed grayscale faults refer to the existence of different types of abnormal events such as network IO, disk IO, scheduling, memory, process/thread, file system, disk, CPU and container in the abnormal event propagation path.
  • the master node 1 in the operating system to be detected may include: an application market slave node and a search microservice slave node.
  • Master node 2 can include: negative one screen (nginx) slave node.
  • the master node n may include: a negative screen service slave node and a first remote dictionary service cluster (redis cluster server) slave node.
  • the master node n+1 may include: a second remote dictionary service cluster slave node.
  • the events at the APM layer corresponding to the slave node in the application market include: a.1. High-frequency search result null exception events.
  • the events of searching the APM layer corresponding to the microservice slave node include: a.2, application version upgrade event.
  • the events of the APM layer corresponding to master node 1 include: a.3, CPU high load event.
  • the events of the APM layer corresponding to the negative one screen slave node include: a.4. Microservice response timeout event.
  • the ITIM layer events corresponding to the negative one screen slave node include: b.1, N-T2 link round-trip time (RTT) abnormal event.
  • RTT round-trip time
  • the events at the APM layer corresponding to the slave node of the negative one screen service include: a.5. Abnormal events in obtaining news announcement information.
  • the events at the ITIM layer corresponding to the negative one-screen service slave node include: b.2, T2-R2 link RTT abnormal event.
  • Events at the ITIM layer corresponding to the slave node of the second remote dictionary service cluster include: b.3.
  • FIG. 19 shows a schematic flowchart of a fault detection method for mixed grayscale faults provided by an embodiment of the present application.
  • the application layer processing module detects APP abnormal events.
  • the application layer processing module detected an application performance degradation anomaly, corresponding to the a.4 abnormal event on node 2 (microservice response timeout), and also detected a.1 on node 1 (high-frequency search result null anomaly), a.2 (application version upgrade), a.5 abnormal event of node n (abnormal acquisition of news announcement information).
  • the network layer processing module does not detect network abnormal events.
  • the system layer processing module collects the network I/O, disk I/O, scheduling, internal data of node 1, node 2,..., node n+1. Storage, process/thread, file system, disk, CPU and container data.
  • the system layer processing module identifies the b.1 abnormal event on node 2, the b.2 abnormal event on node n, and the b.3 abnormal event on node n+1.
  • the system layer processing module can identify the b.1 abnormal event on node 2 (the negative one-screen service RTT delay abnormality from node 2's Nginx to node n), and the b.2 abnormal event on node n (node n The RTT delay of the negative one-screen service to the second remote dictionary service cluster of node n+1 is abnormal), and the b.3 abnormal event of node n+1 (the second remote dictionary service cluster has a CPU interference exception).
  • the system layer processing module determines the root cause location result.
  • application dependencies may include: the application market of node 1 depends on the search microservice of node 1; the Ngnix service of node 2 depends on the application market of node 1; the Ngnix of node 2 depends on the negative one screen service of node n; The negative one-screen service of node n depends on the first remote dictionary service cluster; the negative one-screen service of node n depends on the second remote dictionary service cluster of node n+1; the second remote dictionary service cluster of node n+1 is related to Spark Worker has two-way dependency.
  • the specific process of determining the root cause location result based on the event propagation relationship is: Since the negative one-screen service of node n depends on the second remote dictionary service cluster of node n+1, b.3 can be propagated to b.2. Since node 2 Ngnix relies on the negative one-screen service of node n, so b.2 can be propagated to b.1.
  • the root cause of the abnormal event b.3 is due to the batch processing business of the memory cluster computing platform (Spark Worker) preempting the CPU. Therefore, b.1 is caused by b.3, and the root cause is the root cause of the abnormal event b.3.
  • the fault detection device described in the embodiment of the present application may include one or more hardware structures and/or software modules for implementing the corresponding fault detection methods mentioned above. These execution hardware structures and/or software modules The module can form a fault detection device.
  • FIG. 20 shows a schematic structural diagram of a fault detection device provided by an embodiment of the present application.
  • the fault detection device may include: an acquisition unit 2001 and a processing unit 2002;
  • the acquisition unit 2001 is used to acquire the current operating system running data of the operating system to be detected.
  • the acquisition unit 2001 is used to perform S501.
  • the processing unit 2002 is configured to determine the current operating system event data of the operating system to be detected based on the current operating system operating data and preset event rules; the current operating system event data includes: in the operating system to be detected, process events of the current operating system or The data generated during the running of thread events is related to the context information of process events or thread events of the current operating system. For example, with reference to Figure 5, the processing unit 2002 is used to perform S502.
  • the processing unit 2002 is also used to determine the system fault detection result of the operating system to be detected based on the current operating system operating data and the current operating system event data. For example, with reference to Figure 5, the processing unit 2002 is used to perform S503.
  • the electronic device includes: a system layer processing module, an application layer processing module and a network layer processing module;
  • the system layer processing module is used to determine the system fault detection result of the operating system to be detected. For example, with reference to Figure 7, the system layer processing module is used to execute S701.
  • the system layer processing module is also used to send system fault detection results to the application layer processing module and the network layer processing module.
  • the system layer processing module is used to execute S702.
  • the application layer processing module is used to determine the application fault detection result of the electronic device based on the system fault detection result. For example, with reference to Figure 7, the application layer processing module is used to execute S703.
  • the network layer processing module is used to determine the network fault detection result of the electronic device based on the system fault detection result. For example, with reference to Figure 7, the network layer processing module is used to execute S704.
  • processing unit 2002 is specifically used to:
  • status identification includes: normal identification and abnormal identification.
  • the processing unit 2002 is used to perform S1101.
  • the processing unit 2002 is used to perform S1102.
  • the acquisition unit 2001 is also used to acquire historical operating system data.
  • the acquisition unit 2001 is used to perform S801.
  • the processing unit 2002 is also used to train a fault detection model based on the preset fault identification algorithm and historical operating system data. For example, with reference to FIG. 8 , the processing unit 2002 is used to perform S802.
  • the preset fault identification algorithm includes: a classification algorithm and a parameter optimization algorithm; the processing unit 2002 is specifically used for:
  • the model to be trained is trained to obtain the model to be adjusted;
  • the second data includes historical operating system operating data and historical operating system event data after adding status identifiers;
  • the model to be trained includes: the application to be trained Feature classification model and fault classification model to be trained.
  • the processing unit 2002 is used to perform S902.
  • the parameters of the model to be adjusted are adjusted to obtain a fault detection model;
  • the fault detection model includes: application feature classification model and fault classification model.
  • the processing unit 2002 is used to perform S903.
  • processing unit 2002 is specifically used to:
  • System data corresponding to the application feature fault detection results in the first data is input into the fault classification model to obtain the system fault detection results.
  • the processing unit 2002 is also used to construct an application dependency relationship based on the current operating system operating data and preset dependency rules; the application dependency relationship is used to represent each process or thread and each process in the operating system to be detected. Interdependencies between application instances. For example, with reference to Figure 13, the processing unit 2002 is used to perform S1301.
  • the processing unit 2002 is also used to construct an event propagation relationship based on application dependencies; the event propagation relationship is used to represent the propagation relationship between application events in the operating system to be detected; application events are processes or threads and applications in the operating system to be detected The application event corresponding to the instance.
  • the processing unit 2002 is used to perform S1302.
  • the processing unit 2002 is also configured to determine the root cause event that causes the system fault in the system fault detection result according to the event propagation relationship, as well as the root cause application instance, and/or the root cause process or thread. For example, with reference to Figure 13, the processing unit 2002 is used to perform S1303.
  • processing unit 2002 is specifically used to:
  • an application dependency graph is constructed; the application dependency graph includes multiple application nodes and edges between application nodes; the main application node among multiple application nodes is used to represent the operating system to be detected The main thread of the application instance; the slave application node in multiple application nodes is used to represent: the dependency instance between the slave thread of the application instance in the operating system to be detected and the application instance in the operating system to be detected; the third application node in multiple application nodes The edge between an application node and a second application node is used to indicate that there is a dependency relationship between the first application node and the second application node.
  • processing unit 2002 is specifically used to:
  • an event propagation graph is constructed; the event propagation graph includes multiple event nodes that correspond to multiple application nodes one-to-one and the edges between event nodes; multiple event nodes are used to represent the application event correspondence in the operating system to be detected Application examples of; the edge between the first event node and the second event node among multiple event nodes is used to indicate that there is a propagation relationship between the first event corresponding to the first event node and the second event corresponding to the second event node.
  • processing unit 2002 is specifically used to:
  • the fault event node is the event node corresponding to the fault event
  • the fault event is the fault event corresponding to the system fault detection result
  • the process or thread is determined to be the root process or thread.
  • system faults causing the system fault detection results include the following single types of faults or combined types of faults:
  • Network input and output IO failure disk input and output IO failure, scheduling failure, memory failure, process or thread failure, file system failure, disk failure, central processing unit CPU failure and container failure.
  • embodiments of the present application can divide the fault detection device into functional modules according to the above method examples.
  • the above integrated modules can be implemented in the form of hardware or software function modules.
  • the division of modules in the embodiment of the present application is schematic and is only a logical function division. In actual implementation, there may be other division methods.
  • each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
  • An embodiment of the present application also provides an electronic device.
  • the electronic device may be a terminal, and the terminal may be a user terminal such as a mobile phone or a computer.
  • Figure 21 shows a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal may be the above-mentioned fault detection device, including at least one processor 61, a communication bus 62, a memory 63 and at least one communication interface 64.
  • the processor 61 may be a processor (central processing units, CPU), a microprocessing unit, an ASIC, or one or more integrated circuits used to control the execution of the program of the present application.
  • CPU central processing units
  • ASIC application specific integrated circuit
  • the functions implemented by the processing unit 2002 in the fault detection device are the same as those implemented by the processor 61 in FIG. 21 .
  • Communication bus 62 may include a path for communicating information between the above-mentioned components.
  • Communication interface 64 uses any device such as a transceiver for communicating with other devices or communication networks, such as servers, Ethernet, radio access networks (RAN), wireless local area networks (WLAN) )wait.
  • RAN radio access networks
  • WLAN wireless local area networks
  • Memory 63 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions.
  • a dynamic storage device can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other medium for access, but not limited to this.
  • the memory can exist independently and be connected to the processing unit through a bus. The memory can also be integrated with the processing unit.
  • the memory 63 is used to store the application program code for executing the solution of the present application, and the processor 61 controls the execution.
  • the processor 61 is used to execute the application program code stored in the memory 63, thereby realizing the functions in the method of the present application.
  • the processor 61 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 21 .
  • the terminal may include multiple processors, such as processor 61 and processor 65 in Figure 21.
  • processors may be a single-CPU processor or a multi-CPU processor.
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • the terminal may also include an input device 66 and an output device 67.
  • Input device 66 communicates with output device 67 and can accept user input in a variety of ways.
  • the input device 66 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.
  • Output device 67 communicates with processor 61 and can display information in a variety of ways.
  • the output device 61 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, or the like.
  • LCD liquid crystal display
  • LED light emitting diode
  • Figure 21 does not constitute a limitation of the terminal, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • An embodiment of the present application also provides an electronic device, which may be a server, for example.
  • Figure 22 shows a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server may be a fault detection device.
  • the server may vary greatly due to different configurations or performance, and may include one or more processors 71 and one or more memories 72 . At least one instruction is stored in the memory 72 , and at least one instruction is loaded and executed by the processor 71 to implement the fault detection method provided by each of the above method embodiments.
  • the server can also have components such as wired or wireless network interfaces, keyboards, and input and output interfaces to facilitate input and output.
  • the server can also include other components for implementing device functions, which will not be described again here.
  • the present application also provides a computer-readable storage medium including instructions. Instructions are stored on the computer-readable storage medium. When the instructions in the computer-readable storage medium are executed by a processor of a computer device, the computer The fault detection method provided by the embodiment shown above can be executed.
  • the computer-readable storage medium may be a memory 63 including instructions, and the instructions may be executed by the processor 61 of the terminal to complete the above method.
  • the computer-readable storage medium may be a memory 72 including instructions, and the instructions may be executed by the processor 71 of the server to complete the above method.
  • the computer-readable storage medium may be a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • the computer program product includes computer instructions.
  • the fault detection device causes the fault detection device to execute the steps shown in any of the above figures 5 to 19.
  • Each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially, or the part that contributes to general technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

提供一种故障检测方法、装置、电子设备及存储介质,涉及计算机技术领域,用于快速、准确地确定待检测操作系统的系统故障检测结果。故障检测方法包括:在获取待检测操作系统的当前操作系统运行数据后,电子设备可以基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。其中,当前操作系统事件数据包括:待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据和当前操作系统的进程事件或线程事件关联的上下文信息。后续,电子设备可以基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。

Description

一种故障检测方法、装置、电子设备及存储介质
本申请要求于2022年8月3日提交国家知识产权局、申请号为202210927989.5、申请名称为“一种故障检测方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种故障检测方法、装置、电子设备及存储介质。
背景技术
大规模云服务中的云服务器在系统架构上设计了大量的可靠性、可用性等机制。在这些机制的作用下,云服务器可以容忍一些系统故障继续运行。此时,在云服务器的应用性能管理(Application Performance Monitoring,APM)层,云服务器检测到的云服务应用是无故障的。但是,在云服务器的互联网技术基础设施管理(Internet Technology Infrastructure Monitoring,ITIM)层,即操作系统(Operating System,OS)层,云服务器会检测到云服务的操作系统可能存在故障隐患。这种以APM层、ITIM层等不同检测视角造成的检测差异称作灰度故障。
目前,在检测灰度故障时,云服务器通常通过“层层下钻”的方式进行故障诊断,即云服务器依次判断APM层的应用程序(Application,APP)是否故障、网络性能管理与诊断(Network Performance Monitoring and Diagnostics,NPMD)层的网络功能是否故障,以及ITIM层的操作系统是否故障。
而在判断ITIM层的操作系统是否故障时,由于云服务器只能获取到操作系统的中央处理器(central processing unit,CPU)占用率、当前磁盘空间大小等基础信息,因此,通用的故障检测方法还需要结合人工经验,进一步根据上述基础信息确定操作系统是否故障,耗时耗力,效率较低。
发明内容
本申请提供一种故障检测方法、装置、电子设备及存储介质,涉及计算机技术领域,用于快速、准确地检测操作系统中的故障。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请提供一种故障检测方法,应用于电子设备,包括:
在获取待检测操作系统的当前操作系统运行数据后,电子设备可以基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。其中,当前操作系统事件数据包括:待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据和当前操作系统的进程事件或线程事件关联的上下文信息。后续,电子设备可以基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。
上述当前操作系统运行数据为待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据。也就是说,这种数据一般仅仅表示一个进程事件或线程事件的指标(例如CPU利用率等),因此,当前操作系统运行数据为指标类型的数据。而上述当前操作系统事件数据不仅包括当前操作系统的进程事件或线程事件运行过程中产生的数据,还包括当前操作系统进程事件或线程事件关联的上下文信息。也就是说,这种数据不仅可以表示一个进程事件或线程事件的指标,还可以表示这个进程事件或线程事件的事件内容(例如该线程事件为主线程事件),和/或,事件状态(例如该线程事件为异常状态事件)等,因此,当前操作系统事件数据为事件类型的数据。
在这种情况下,电子设备在检测操作系统层级的故障时,不仅可以获取到指标类型的当前操作系统运行数据,还可以基于当前操作系统运行数据确定事件类型的当前操作系统事件数据。后续,电子设备进行故障检测的过程中,相比通用技术中仅通过基础信息(即指标类型的当前操作系统运行数据)结合人工经验进行故障检测的方法,本申请实施例提供的故障检测方法可以基于当前操作系统运行数据和当前操作系统事件数据,快速确定待检测操作系统的系统故障检测结果,无需人工经验的介入,提高了故障检测的效率。
其次,由于当前操作系统事件数据包括当前操作系统进程事件或线程事件关联的上下文信息,因此,电子设备在进行故障检测的过程中,相比通用技术仅通过基础信息结合人工经验进行故障检测,本申请实施例提供的故障检测方法可以挖掘当前操作系统事件数据中的上下文信息,从而基于上下文信息,获取进程事件或者线程事件的上下文关联事件的相关内容,进而根据这些内容结合当前操作系统运行数据,准确的确定系统故障检测结果,提高了故障检测的准确度。
再次,由于当前操作系统运行数据和当前操作系统事件数据具备细粒度(可以精细到待检测系统的线程事件或进程事件)、宽范围检测数据(不仅包括指标类型的数据,还包括事件类型的数据)的特点,因此,本申请实施例提供的故障检测方法也可以检测到更细粒度的故障检测结果(例如具体线程事件或进程事件的故障检测结果),进一步提高了故障检测的准确度。
此外,本申请实施例提供的故障检测方法无需在电子设备的操作系统层级添加各种故障检测工具,仅需操作系统自身具备的数据采集功能和数据处理功能,便可以快速检测故障,实现了基于操作系统轻量级的故障检测。
在一种可能的实现方式中,电子设备包括:系统层处理模块、应用层处理模块和网络层处理模块;上述确定待检测操作系统的系统故障检测结果的方法具体包括:系统层处理模块确定待检测操作系统的系统故障检测结果。
本申请实施例提供的故障检测方法还包括:系统层处理模块可以向应用层处理模块和网络层处理模块发送系统故障检测结果。相应的,应用层处理模块基于系统故障检测结果,确定电子设备的应用故障检测结果。相应的,网络层处理模块基于系统故障检测结果,确定电子设备的网络故障检测结果。
由上可知,电子设备中的系统层处理模块可以在确定系统故障检测结果后,向应用层处理模块和网络层处理模块发送系统故障检测结果,以使得应用层处理模块基于系统故障检测结果,确定电子设备的应用故障检测结果,以及网络层处理模块基于系统故障检测结果,确定电子设备的网络故障检测结果,从而实现了系统层处理模块分别与应用层和网络层之间通过松耦合的方式,快速、准确的确定待检测操作系统的灰度故障的故障检测结果,提高了故障检测效率。
在一种可能的实现方式中,上述基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果的方法具体包括:基于预设状态规则,对当前操作系统事件数据添加状态标识,并将第一数据输入到预先训练好的故障检测模型中,以得到系统故障检测结果。其中,状态标识包括:正常标识和异常标识;第一数据包括当前操作系统运行数据和添加状态标识后的当前操作系统事件数据;故障检测模型为根据待检测操作系统的历史操作系统数据训练得到的;历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据。
由上可知,由于故障检测模型为根据待检测操作系统的历史操作系统数据训练得到的,历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据,因此,电子设备确定待检测操作系统的系统故障检测结果时,可以通过故障检测模型,快速、准确的确定系统故障检测结果,提高了故障检测效率。
在一种可能的实现方式中,该故障检测方法还包括:获取历史操作系统数据,并基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型。
由上可知,电子设备可以基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型,以便于后续通过故障检测模型,快速、准确的确定系统故障检测结果,提高了故障检测效率。
在一种可能的实现方式中,预设故障识别算法可以包括:分类算法和参数优化算法;上述基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型的方法具体包括:基于预设状态规则,对历史操作系统事件数据添加状态标识,并基于分类算法和第二数据,对待训练模型进行训练,以得到待调整模型。其中,第二数据包括历史操作系统运行数据和添加状态标识后的历史操作系统事件数据;待训练模型包括:待训练的应用特征分类模型和待训练的故障分类模型。后续,可以基于历史操作系统运行数据和参数优化算法,对待调整模型进行参数调整,以得到故障检测模型;故障检测模型包括:应用特征分类模型和故障分类模型。
由上可知,由于预设故障识别算法可以包括分类算法和参数优化算法,因此,分类算法可以用于对故障检测模型的初始模型进行分类训练,参数优化算法可以对分类训练过程中的模型进行参数 调整,从而得到故障检测模型,以便于后续通过故障检测模型,快速、准确的确定系统故障检测结果,提高了故障检测效率。
在一种可能的实现方式中,上述将第一数据输入到预先训练好的故障检测模型中,以得到系统故障检测结果的方法具体包括:将第一数据输入到应用特征分类模型中,以得到应用特征故障检测结果,并将第一数据中,与应用特征故障检测结果对应的系统数据输入到故障分类模型中,以得到系统故障检测结果。
由上可知,由于故障检测模型包括应用特征分类模型和故障分类模型,因此,电子设备可以依次根据应用特征分类模型和故障分类模型确定不同的应用特征,以及每个应用特征对应的故障分类,提高了故障检测的准确度。
在一种可能的实现方式中,该故障检测方法还包括:
根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系,以及根据应用依赖关系,构建事件传播关系。其中,应用依赖关系用于表示待检测操作系统中各进程或线程及各应用实例之间的相互依赖关系;事件传播关系用于表示待检测操作系统中应用事件之间的传播关系;应用事件为待检测操作系统中的进程或线程及应用实例对应的应用事件。后续,可以根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程。
由上可知,在确定系统故障检测结果后,为了对待检测操作系统中发生故障的事件或应用进行根因定位,电子设备还可以构建应用依赖关系和事件传播关系,以此来进一步的实现故障根因定位。
在一种可能的实现方式中,上述根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系的方法具体包括:根据当前操作系统运行数据和预设依赖规则,构建应用依赖图;应用依赖图包括多个应用节点以及应用节点之间的边;多个应用节点中的主应用节点用于表示待检测操作系统中应用实例的主线程;多个应用节点中的从应用节点用于表示:待检测操作系统中应用实例的从线程和待检测操作系统中应用实例之间的依赖实例;多个应用节点中的第一应用节点与第二应用节点之间的边用于表示第一应用节点与第二应用节点之间存在依赖关系。
由上可知,电子设备可以以应用依赖图的显示形式,构建应用依赖关系,以便于后续可以方便、快捷的根据应用依赖图进行根因定位。
在一种可能的实现方式中,上述根据应用依赖关系,构建事件传播关系,包括:根据应用依赖图,构建事件传播图;事件传播图包括与多个应用节点一一对应的多个事件节点以及事件节点之间的边;多个事件节点用于表示待检测操作系统中应用事件对应的应用实例;多个事件节点中的第一事件节点与第二事件节点之间的边用于表示第一事件节点对应的第一事件与第二事件节点对应的第二事件之间存在传播关系。
由上可知,电子设备可以以事件传播图的显示形式,构建事件传播关系,以便于后续可以方便、快捷的根据事件传播图进行根因定位。
在一种可能的实现方式中,上述根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程的方法具体包括:
确定与故障事件节点之间具有边的传播事件节点,并将传播事件节点中的传播起始事件节点对应的事件确定为根因事件,以及将传播起始事件节点对应的应用实例确定为根因应用实例,和/或,将传播起始事件节点对应的线程确定为根因进程或线程。其中,故障事件节点为故障事件对应的事件节点;故障事件为系统故障检测结果对应的故障事件。
由上可知,电子设备可以根据应用依赖图和事件传播图,快速、准确的确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因线程。
在一种可能的实现方式中,引起系统故障检测结果中的系统故障包括以下单个类型的故障或组合类型的故障:网络输入输出IO故障、磁盘输入输出IO故障、调度故障、内存故障、进程或线程故障、文件系统故障、磁盘故障、中央处理器CPU故障和容器故障。
由上可知,电子设备可以确定各种单个类型或者组合类型的故障的系统故障检测结果,从而可以检测到各种类型的系统灰度故障。
第二方面,本申请提供一种故障检测装置,包括:获取单元和处理单元;获取单元,用于获取 待检测操作系统的当前操作系统运行数据;处理单元,用于基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据;当前操作系统事件数据包括:待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据和当前操作系统的进程事件或线程事件关联的上下文信息;处理单元,还用于基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。
在一种可能的实现方式中,电子设备包括:系统层处理模块、应用层处理模块和网络层处理模块。系统层处理模块用于确定待检测操作系统的系统故障检测结果;系统层处理模块还用于向应用层处理模块和网络层处理模块发送系统故障检测结果;应用层处理模块用于基于系统故障检测结果,确定电子设备的应用故障检测结果;网络层处理模块用于基于系统故障检测结果,确定电子设备的网络故障检测结果。
在一种可能的实现方式中,处理单元,具体用于:基于预设状态规则,对当前操作系统事件数据添加状态标识;状态标识包括:正常标识和异常标识;将第一数据输入到预先训练好的故障检测模型中,以得到系统故障检测结果;第一数据包括当前操作系统运行数据和添加状态标识后的当前操作系统事件数据;故障检测模型为根据待检测操作系统的历史操作系统数据训练得到的;历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据。
在一种可能的实现方式中,获取单元,还用于获取历史操作系统数据;处理单元,还用于基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型。
在一种可能的实现方式中,预设故障识别算法包括:分类算法和参数优化算法;处理单元,具体用于:基于预设状态规则,对历史操作系统事件数据添加状态标识;基于分类算法和第二数据,对待训练模型进行训练,以得到待调整模型;第二数据包括历史操作系统运行数据和添加状态标识后的历史操作系统事件数据;待训练模型包括:待训练的应用特征分类模型和待训练的故障分类模型;基于历史操作系统运行数据和参数优化算法,对待调整模型进行参数调整,以得到故障检测模型;故障检测模型包括:应用特征分类模型和故障分类模型。
在一种可能的实现方式中,处理单元,具体用于:将第一数据输入到应用特征分类模型中,以得到应用特征故障检测结果;将第一数据中,与应用特征故障检测结果对应的系统数据输入到故障分类模型中,以得到系统故障检测结果。
在一种可能的实现方式中,处理单元,还用于根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系;应用依赖关系用于表示待检测操作系统中各进程或线程及各应用实例之间的相互依赖关系;处理单元,还用于根据应用依赖关系,构建事件传播关系;事件传播关系用于表示待检测操作系统中应用事件之间的传播关系;应用事件为待检测操作系统中的进程或线程及应用实例对应的应用事件;处理单元,还用于根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程。
在一种可能的实现方式中,处理单元,具体用于:根据当前操作系统运行数据和预设依赖规则,构建应用依赖图;应用依赖图包括多个应用节点以及应用节点之间的边;多个应用节点中的主应用节点用于表示待检测操作系统中应用实例的主线程;多个应用节点中的从应用节点用于表示:待检测操作系统中应用实例的从线程和待检测操作系统中应用实例之间的依赖实例;多个应用节点中的第一应用节点与第二应用节点之间的边用于表示第一应用节点与第二应用节点之间存在依赖关系。
在一种可能的实现方式中,处理单元,具体用于:根据应用依赖图,构建事件传播图;事件传播图包括与多个应用节点一一对应的多个事件节点以及事件节点之间的边;多个事件节点用于表示待检测操作系统中应用事件对应的应用实例;多个事件节点中的第一事件节点与第二事件节点之间的边用于表示第一事件节点对应的第一事件与第二事件节点对应的第二事件之间存在传播关系。
在一种可能的实现方式中,处理单元,具体用于:确定与故障事件节点之间具有边的传播事件节点;故障事件节点为故障事件对应的事件节点;故障事件为系统故障检测结果对应的故障事件;将传播事件节点中的传播起始事件节点对应的事件确定为根因事件,以及将传播起始事件节点对应的应用实例确定为根因应用实例,和/或,将传播起始事件节点对应的进程或线程确定为根因进程或线程。
在一种可能的实现方式中,引起系统故障检测结果中的系统故障包括以下单个类型的故障或组合类型的故障:网络输入输出IO故障、磁盘输入输出IO故障、调度故障、内存故障、进程或线程故障、文件系统故障、磁盘故障、中央处理器CPU故障和容器故障。
第三方面,本申请提供一种电子设备,可以包括:处理器和用于存储处理器可执行指令的存储器;其中,处理器被配置为执行所述指令,以实现上述第一方面中任一种可能的实现方式中的故障检测方法。
第四方面,本申请提供一种计算机可读存储介质,计算机可读存储介质上存储有指令,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行上述第一方面中任一种可能的实现方式中的故障检测方法。
第五方面,本申请提供一种计算机程序产品,该计算机程序产品包括计算机指令,当计算机指令在电子设备的处理器上运行时,使得电子设备的处理器执行如第一方面中任一种可能的实现方式中的实现方式所述的故障检测方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
可以理解地,上述各个方面所提供的故障检测装置、电子设备、计算机可读存储介质以及计算机程序产品均应用于上文所提供的故障检测方法,因此,其所能达到的有益效果可参考上文所提供的故障检测的方法中的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种灰度故障的场景示意图;
图2为通用技术提供的故障检测方法的流程示意图;
图3为本申请实施例提供的一种故障检测系统的结构示意图;
图4为本申请实施例提供的一种电子设备的内部模块结构框图;
图5为本申请实施例提供的一种故障检测方法的流程示意图一;
图6为本申请实施例提供的一种故障检测方法的流程示意图二;
图7为本申请实施例提供的一种故障检测方法的流程示意图三;
图8为本申请实施例提供的一种故障检测方法的流程示意图四;
图9为本申请实施例提供的一种故障检测方法的流程示意图五;
图10为本申请实施例提供的一种故障检测方法的流程示意图六;
图11为本申请实施例提供的一种故障检测方法的流程示意图七;
图12为本申请实施例提供的一种故障检测方法的流程示意图八;
图13为本申请实施例提供的一种故障检测方法的流程示意图九;
图14为本申请实施例提供的一种故障检测方法的流程示意图十;
图15为本申请实施例提供的一种故障检测方法的流程示意图十一;
图16为本申请实施例提供的一种故障检测方法的流程示意图十二;
图17为本申请实施例提供的一种故障检测方法的流程示意图十三;
图18为本申请实施例提供的一种故障检测方法的流程示意图十四;
图19为本申请实施例提供的一种故障检测方法的流程示意图十五;
图20为本申请实施例提供的一种故障检测装置的结构示意图;
图21示出了本申请实施例提供的一种终端的结构示意图;
图22示出了本申请实施例提供的一种服务器的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
本申请中字符“/”,一般表示前后关联对象是一种“或者”的关系。例如,A/B可以理解为A或者 B。
术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
此外,本申请的描述中所提到的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、系统、产品或设备没有限定于已列出的步骤或模块,而是可选地还包括其他没有列出的步骤或模块,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。
另外,在本申请实施例中,“示例性的”、或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”、或者“例如”等词旨在以具体方式呈现概念。
在对本申请提供的故障检测方法进行详细介绍之前,先对本申请涉及的相关要素、应用场景、实施环境进行简单介绍。
首先,对本申请涉及的相关要素进行简单介绍。
指标数据(metrics):操作系统中一些可进行聚合计算的原子型数据。
示例性的,指标数据可以是CPU占用情况、系统内存占用、接口响应时间、接口响应每秒查询率(queries per second,QPS)等。
又一示例性的,指标数据还可以是用于表示任务队列当前深度的度量值,在元素入队或出队时可以被更新。
又一示例性的,指标数据还可以是用于表示超文本传输协议(Hyper Text Transfer Protocol,HTTP)请求个数的计数器,在新请求到来时进行累加。
这些指标数据都是根据时间序列存储的数据值,可以在一段时间内进行一些求和、求平均、百分位等聚合计算,并用于后续操作系统的数据分析与整理。
日志数据(logging):logging是操作系统运行时发生的一个个事件的记录,可以用于记录操作系统的离散事件,以及为反馈排查问题提供详细的信息。
链路数据(Tracing):用于记录消息请求范围内的调用信息。链路数据是排查系统性能问题的利器。链路数据不仅有利于梳理接口及服务间调用的关系,还有助于排查慢请求产生的原因或异常发生的原因。
示例性的,链路数据可以记录一次远程方法调用的执行过程和耗时。
又一示例性的,在微服务中一般有多个调用信息,如从最外层的网关开始,A服务调用B服务,调用数据库、缓存等。在操作系统中,链路数据可以清楚展现某条调用链中从主调方到被调方内部所有的调用信息。
无状态数据(Stateless):即本申请实施例中的操作系统运行数据,用于表示操作系统中,一次操作对应的数据。本申请实施例中,电子设备中部署有用于采集数据的采集探针。采集探针采集到的操作系统运行数据必须全部来自于无状态数据所携带的信息以及可以被所有数据所使用的公共信息。
有状态数据(Stateful):即本申请实施例中的操作系统事件数据,用于表示操作系统中,有数据存储功能的数据。这种数据可以存储该数据对应的进程事件或线程事件的事件内容(例如该线程事件为主线程事件),和/或,事件状态(例如该线程事件为异常状态事件)等。本申请实施例中,采集探针采集到的操作系统事件数据一般包括该操作系统事件数据的相关信息,即上下文信息。
数据采集探针在采集有状态数据和无状态数据时,可以通过确定两个来自相同标签的数据在操作系统内是否具备上下文关系,来进行状态化的判断,即具备上下文关系的数据为有状态数据,不具备上下文关系的数据为无状态数据。
如背景技术所描述,大规模云服务中的云服务器在系统架构上设计了大量的可靠性、可用性等机制。在这些机制的作用下,云服务器可以容忍一些系统故障继续运行。此时,在云服务器的APM层,云服务器检测到的云服务应用是无故障的。但是,在云服务器的ITIM层,即OS层,云服务器会检测到云服务的操作系统可能存在故障隐患。这种以APM层、ITIM层等不同检测视角造成的 检测差异称作灰度故障。
灰色故障通常包括网络输入输出(Input/Output,IO)故障(例如随机丢包)、磁盘IO故障(例如产生磁盘碎片)、内存抖动/泄漏故障、CPU调度/干扰故障、容量压力故障等造成操作系统的性能下降的非致命异常故障。据统计分析,灰度故障是大规模云服务生产环境中影响最大的一类故障。
图1示出了本申请实施例提供的一种灰度故障的场景示意图。如图1所示,在APM层,APM层的故障检测模块(observation)1可以获取APM层的应用程序(application,APP)1、APP 2、APP 3等APP的检测报告(probe report)。
相应的,在NPMD层,NPMD层的observation 2可以获取NPMD层的网络设备(network devices)的检测报告。
相应的,在ITIM层,ITIM层的observation 3可以获取ITIM层的OS 1、OS 2、OS 3等OS的检测报告。
由于大规模云服务中的云服务器在系统架构上设计了大量的可靠性、可用性等机制,因此,APM层的observation 1基于获取的检测报告确定的故障检测结果与ITIM层基于获取的检测报告确定的故障检测结果不同,即APM层、ITIM层等不同检测视角造成的检测差异即为灰度故障。
目前,在检测灰度故障时,智能运维(Artificial intelligence for IT Operations,AIOps)的故障检测方法是以APM层为中心,自上而下(Top-Down)的技术体系进行故障检测。
具体的,通用的故障检测方法以APM层为视角,以感知APM层的APP的服务等级指标(Service Level Indicator,SLI)为入口,“层层下钻”的方式进行故障诊断,即云服务器依次判断APM层的APP是否故障、NPMD层的网络功能是否故障,以及ITIM层的操作系统是否故障。
在这种情况下,通用技术通常使用数十种管理和监控工具,通常是APM工具、NPMD工具和无数特定于孤岛的ITIM工具的组合。在这种情况下,因为APM工具、NPMD工具和ITIM工具不相关,即它们本质上使用的计算机语言不同,导致这些工具无法获得有关其他层级正在发生的事件的事件信息,只能实现APM层、NPMD层、ITIM层各自层级的检测与分析。
而在判断ITIM层的操作系统是否故障时,由于云服务器只能获取到操作系统的CPU利用率、当前磁盘空间大小等基础信息,因此,通用的故障检测方法还需要结合人工经验,进一步根据上述基础信息确定操作系统是否故障,耗时耗力,效率较低。
图2示出了通用技术中,一种灰度故障检测方法的流程示意图。如图2所示,通用的灰度故障检测方法具体包括:
S1、APM层的检测模块检测APP的应用质量是否异常。
其中,APP的应用质量异常具体包括APP的性能下降、用户体验不佳(例如使用卡顿)等。
S2、NPMD层的检测模块检测网络设备是否有网络问题。
S3、获取ITIM层的通用检测数据。
可选的,获取ITIM层的通用检测数据时,可以通过ITIM层的数据采集模块获取,也可以人工获取。
S4、基于APM层的检测结果、NPMD层的检测结果、ITIM层的检测结果和运维专家经验确定灰度故障检测结果。
S5、基于灰度故障检测结果和运维专家经验确定故障诊断结果。
由上可知,通用的故障检测方法还需要结合人工经验,进一步根据上述基础信息确定操作系统是否故障,耗时耗力,效率较低。
针对上述问题,本申请提供一种故障检测方法,应用于电子设备,包括:在获取待检测操作系统的当前操作系统运行数据后,电子设备可以基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。后续,电子设备可以基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。
上述当前操作系统运行数据为待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据。也就是说,这种数据一般仅仅表示一个进程事件或线程事件的指标(例如CPU利用率等),因此,当前操作系统运行数据为指标类型的数据。而上述当前操作系统事件数据不仅 包括当前操作系统的进程事件或线程事件运行过程中产生的数据,还包括当前操作系统进程事件或线程事件关联的上下文信息。也就是说,这种数据不仅可以表示一个进程事件或线程事件的指标,还可以表示这个进程事件或线程事件的事件内容(例如该线程事件为主线程事件),和/或,事件状态(例如该线程事件为异常状态事件)等,因此,当前操作系统事件数据为事件类型的数据。
在这种情况下,电子设备在检测操作系统层级的故障时,不仅可以获取到指标类型的当前操作系统运行数据,还可以基于当前操作系统运行数据确定事件类型的当前操作系统事件数据。后续,电子设备进行故障检测的过程中,相比通用技术中仅通过基础信息(即指标类型的当前操作系统运行数据)结合人工经验进行故障检测的方法,本申请实施例提供的故障检测方法可以基于当前操作系统运行数据和当前操作系统事件数据,快速确定待检测操作系统的系统故障检测结果,无需人工经验的介入,提高了故障检测的效率。
其次,由于当前操作系统事件数据包括当前操作系统进程事件或线程事件关联的上下文信息,因此,电子设备在进行故障检测的过程中,相比通用技术仅通过基础信息结合人工经验进行故障检测,本申请实施例提供的故障检测方法可以挖掘当前操作系统事件数据中的上下文信息,从而基于上下文信息,获取进程事件或者线程事件的上下文关联事件的相关内容,进而根据这些内容结合当前操作系统运行数据,准确的确定系统故障检测结果,提高了故障检测的准确度。
再次,由于当前操作系统运行数据和当前操作系统事件数据具备细粒度(可以精细到待检测系统的线程事件或进程事件)、宽范围检测数据(不仅包括指标类型的数据,还包括事件类型的数据)的特点,因此,本申请实施例提供的故障检测方法也可以检测到更细粒度的故障检测结果(例如具体线程事件或进程事件的故障检测结果),进一步提高了故障检测的准确度。
此外,本申请实施例提供的故障检测方法无需在电子设备的操作系统层级添加各种故障检测工具,仅需操作系统自身具备的数据采集功能和数据处理功能,便可以快速检测故障,实现了基于操作系统轻量级的故障检测。
下面将结合附图对本实施例的实施方式进行详细描述。
本申请实施例提供的故障检测方法可以应用于故障检测系统。图3示出了本申请实施例提供的一种故障检测系统的结构示意图。
如图3所示,本申请实施例提供的故障检测系统包括云终端301、网络设备302和云服务器303。其中,网络设备302分别与云终端301和云服务器303通信连接。其中,云终端301又叫虚拟终端,它是基于可靠、高速的网络通信、海量存储的实现以及多进程或线程、高效率CPU技术的前提下,利用远程虚拟化技术实现客户终端设备的管理解决方案,通过将远程终端设备硬件、运行软件及硬盘数据等分离成不同的传输层次,构成主机与客户端的动态架构,从而实现集中管理和动态分组分权管理的统一化管理平台。本申请实施例中,云终端301可以请求云服务器303提供的服务资源,并向用户提供对应的云服务。
云服务器是一种简单高效、安全可靠、处理能力可弹性伸缩的计算服务。本申请实施例中,云服务器303可以向云终端301提供服务资源,以使得云终端301向用户提供对应的云服务。
图3中的云终端301可以是指向用户提供语音和/或数据连通性的设备,具有无线连接功能的手持式设备、或连接到无线调制解调器的其他处理设备。云终端301可以经无线接入网(radio access network,RAN)与一个或多个核心网进行通信。无线终端可以是移动终端,如移动电话(或称为“蜂窝”电话)和具有移动终端的计算机,也可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置,它们与无线接入网交换语言和/或数据,例如,手机、平板电脑、笔记本电脑、台式电脑、上网本、个人数字助理(personal digital assistant,PDA)。
图3中的网络设备302可以是无线通信的基站或基站控制器等。在本申请实施例中,所述基站可以是全球移动通信系统(global system for mobile communication,GSM),码分多址(code division multiple access,CDMA)中的基站(base transceiver station,BTS),宽带码分多址(wideband code division multiple access,WCDMA)中的基站(node B),物联网(internet of things,IoT)或者窄带物联网(narrow band-internet of things,NB-IoT)中的基站(eNB),未来5G移动通信网络或者未来演进的公共陆地移动网络(public land mobile network,PLMN)中的基站,本申请实施例对此不作任何限制。
图3中的云服务器303可以是服务器集群(由多个服务器组成)中的一个服务器,也可以是该服务器中的芯片,还可以是该服务器中的片上系统,还可以通过部署在物理机上的虚拟机(virtual machine,VM)实现,本申请实施例对此不作限定。
在一种可以实现的方式中,本申请实施例提供的故障检测方法主要用于检测云服务器303操作系统层级的灰度故障。
在又一种可以实现的方式中,本申请实施例提供的故障检测方法也可以用于检测云终端301操作系统层级的灰度故障,还可以用于检测其他具有操作系统层级、网络层级和应用层级的电子设备灰度故障,本申请对此不作限定。
为了便于描述,本申请实施例以检测云服务器303操作系统层级的灰度故障为例进行说明。
如图3所示,云服务器303可以包括APM层、NPMD层和ITIM层。
APM层的observation 1可以获取APM层的APP 1、APP 2、APP 3等APP的检测报告,并基于该检测包括确定APM层的应用故障检测结果。
相应的,NPMD层的observation 2可以获取NPMD层的网络设备的检测报告,并基于该检测包括确定NPMD层的网络故障检测结果。
本申请实施例中,ITIM层除了提供通用的基础设施检测能力,还具备操作系统灰度故障识别与诊断能力。
也就是说,ITIM层的observation 3可以根据本申请实施例提供的故障检测方法,获取ITIM层的OS 1、OS 2、OS 3等OS的检测报告,并基于该检测包括确定ITIM层的故障检测结果。
后续,ITIM层的observation 3可以向APM层的observation 1、NPMD层的observation 2发送ITIM层的系统故障检测结果,以使得APM层的observation 1、NPMD层的observation 2结合ITIM层的系统故障检测结果,确定灰度故障。
相应的,APM层的observation 1也可以向NPMD层的observation 2、ITIM层的observation 3发送APM层的应用故障检测结果,以使得NPMD层的observation 2、ITIM层的observation 3结合APM层的应用故障检测结果,确定灰度故障。
相应的,NPMD层的observation 2也可以向APM层的observation 1、ITIM层的observation 3发送NPMD层的网络故障检测结果,以使得APM层的observation 1、ITIM层的observation 3结合NPMD层的网络故障检测结果,确定灰度故障。
这样,本申请实施例提供的故障检测方法可以使得APM层、NPMD层、ITIM层的孤岛检测工具相互融合。将通用技术中,以应用为视角自上而下的智能运维的故障检测方法,与本申请实施例提供的以操作系统为视角自下而上主动识别、定位灰度故障的方案相结合,减少人工介入,提高运维效率。
图4是本申请实施例提供的电子设备的内部模块结构框图。该电子设备可以采集操作系统中的指标数据、日志数据、链路数据等数据,并基于指标数据、日志数据、链路数据等信息确定事件(Events)等数据,然后进行数据分析识别灰度故障。后续,通过构建应用依赖图及事件传播图进行根因推理,实现对操作系统健康度、灰度故障、根因定位的检测。
该电子设备可以是图3中的云服务器303,也可以是云终端301,还可以是其他具有操作系统层级、网络层级和应用层级的电子设备,本申请实施例对此不作限定。
如图4所示,本申请实施例提的电子设备可以包括:数据采集模块、灰度故障识别模块、根因分析模块、检测模块和数据存储模块。
其中,数据采集模块用于采集操作系统中的指标数据、日志数据、链路数据等数据(即本申请实施例中的操作系统运行数据),并基于指标数据、日志数据、链路数据等数据生成事件数据(即本申请实施例中的操作系统事件数据)。
事件数据用于表示基于指标数据、日志数据、链路数据等数据生成的实时异常事件。
灰度故障识别模块用于基于指标数据、日志数据、链路数据等数据生成的事件数据,以及多维长周期时序数据(即多个时间段下的指标数据、日志数据、链路数据等数据)的关联分析,完成灰度故障的识别。
根因分析模块用于通过静态的服务部署方式生成应用依赖图,并基于领域知识及事件动态生成 事件传播图,并基于构建好的应用依赖图和事件传播图对识别到的灰度故障进行根因推理。
检测模块用于提供对操作系统健康度、异常事件、故障根因的展示功能或数据接口。
数据存储模块用于存储数据采集模块采集到的数据、灰度故障识别模块识别出的灰度故障以及根因分析模块确定的故障根因,并向检测模块提供检测模块所需的数据。
基于上述电子设备中的各个模块的功能,下面对本申请实施例提供的技术方案进行详细说明。
本申请实施例提供的故障检测方法可以应用于电子设备。
下面结合附图对本申请实施例提供的故障检测方法进行详细介绍。
图5示出了本申请实施例提供一种故障检测方法。如图5所示,该故障检测方法具体包括:
S501、电子设备获取待检测操作系统的当前操作系统运行数据。
具体的,在进行灰度故障检测的过程中,电子设备可以基于两个触发条件触发故障检测流程。
第一个触发条件为电子设备主动触发故障检测流程。在这种情况下,电子设备可以直接获取待检测操作系统的当前操作系统运行数据,从而开始故障检测。
可选的,电子设备主动触发故障检测流程可以是预先设定有定时任务。在当前时刻满足定时任务设定的时间时,电子设备可以直接获取待检测操作系统的当前操作系统运行数据,从而开始故障检测。
可选的,上述定时任务可以是周期性的任务。在这种情况下,电子设备可以周期性的获取待检测操作系统的当前操作系统运行数据,从而开始故障检测,从而提高电子设备的待检测操作系统的故障检测效率。
第二个触发条件为电子设备被动触发故障检测流程。在这种情况下,电子设备可以响应于接收到的故障检测指令,获取待检测操作系统的当前操作系统运行数据,从而开始故障检测。
可选的,上述故障检测指令可以是APM层或者NPMD层发送的,也可以是响应于用户的故障检测操作生成的,本申请实施例对此不作限定。
在一种可以实现的方式中,操作系统运行数据可以是待检测操作系统中的指标数据、日志数据、链路数据等数据。指标数据、日志数据、链路数据等数据又称为无状态数据,可以表示当前操作系统运行过程中产生的指标类型的数据。
关于指标数据、日志数据、链路数据等数据的描述可以参考申请涉及的相关要素的简单介绍,在此不再赘述。
当前操作系统运行数据是指当前时间段的系统运行数据。电子设备在当前时间段进行故障检测时,需要获取当前时间段的系统运行数据,从而保证准确的检测到在当前时间段的系统灰度故障。
在一种可以实现的方式中,若电子设备想要检测历史时间段的系统灰度故障,则可以获取历史时间段的系统运行数据。
在一种可以实现的方式中,结合图3,电子设备可以调用数据采集模块,获取待检测操作系统的当前操作系统运行数据。
S502、电子设备基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。
其中,当前操作系统事件数据包括:待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据和当前操作系统的进程事件或线程事件关联的上下文信息。
在一种可以实现的方式中,操作系统事件数据可以是待检测操作系统中的事件数据。事件数据又称为有状态数据,可以表示当前操作系统的进程事件或线程事件过程中产生的数据和当前操作系统的进程事件或线程事件关联的上下文信息。关于事件数据的描述可以参考申请涉及的相关要素的简单介绍,在此不再赘述。
在一种可以实现的方式中,预设事件规则可以是根据人为经验或者领域知识预先建立好的。在获取到当前操作系统运行数据后,电子设备可以基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。
也就是说,当前操作系统运行数据可以是单点的运行数据,当前操作系统事件数据可以是根据一个周期内的多个运行数据确定的事件数据。
示例性的,电子设备每五秒检测一次待检测操作系统的CPU利用率,那在当前时间段内的一 分钟可以检测12次CPU利用率,从而得到12个CPU利用率检测结果。在这种情况下,上述12个CPU利用率检测结果即为12个当前操作系统运行数据。
预设事件规则包括:CPU利用率均高于预设数值时为故障事件。当这一分钟的前40秒的CPU利用率均高于预设数值时,电子设备可以获取前40秒中的8个CPU利用率检测结果,并基于这8个CPU利用率检测结果确定这一分钟的前40秒为CPU故障事件,即当前操作系统事件数据。
在一种可以实现的方式中,当前操作系统事件数据还可以包括当前时间段内的漏洞修复、软件升级等系统活动类型的事件。
又一示例性的,预设电子设备获取到的当前操作系统运行数据包括:待检测操作系统在当前时刻的CPU利用率为第一利用率、待检测操作系统在当前时刻的运行线程为线程A。预设事件规则包括:在同一时刻下,待检测操作系统的CPU利用率为A线程的CPU利用率。在这种情况下,电子设备可以基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据包括:线程A在当前时刻的运行过程中占用的CPU利用率为第一利用率。
此外,电子设备还可以获取到关于线程A在当前时刻的运行过程中的上下文信息,例如线程B向线程A发送的运行指令等。
在一种可以实现的方式中,结合图3,电子设备可以调用数据采集模块,基于获取到的当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。
在又一种可以实现的方式中,结合图3,如图6所示,待检测操作系统中包括数据面和管理面。
其中,数据面中的数据是指APM层的APP在ITIM层运行时,对ITIM层的各种系统资源使用的数据,可以直接反应APP的各种指标与使用情况。
数据面可以包括多个线程节点(例如图6中的节点1、节点n、节点...),每个节点可以包括多个容器。每个容器可以通过系统调用层连接内核层。内核层的管理模块可以基于扩展的伯克利包过滤器(extended Berkeley Packet Filter,eBPF)技术,开发数据面探针模块(Probe Agent)。
在故障检测的过程中,数据采集模块可以基于Probe Agent从数据面中获取指标数据、日志数据、链路数据等无状态数据以及事件数据中的有状态数据,并将获取到的数据存储到数据存储模块中。
管理面中的数据是指ITIM层中的OS在运行时,OS的管理数据,例如OS版本升级事件数据、磁盘扩容事件数据。
在故障检测的过程中,数据采集模块可以基于管理面的事件代理模块(event Agent)从管理面中事件数据中的活动类的事件数据,并将获取到的数据存储到数据存储模块中。
S503、电子设备基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。
在一种可以实现的方式中,电子设备在获取到当前操作系统运行数据和当前操作系统事件数据后,可以通过预先训练好的故障检测模型,确定待检测操作系统的系统故障检测结果,具体检测过程可以参考下文实施例的描述。
在又一种可以实现的方式中,电子设备中可以预先存储有多种类型的系统数据(包括系统运行数据和操作系统事件数据)对应的系统故障检测结果。在获取到当前操作系统运行数据和当前操作系统事件数据后,电子设备可以读取系统数据与系统故障检测结果,从而确定待检测操作系统在当前时间段的系统故障检测结果。
其中,多种类型的系统数据(包括系统运行数据和操作系统事件数据)对应的系统故障检测结果可以是电子设备根据历史时间段的系统数据和历史时间段的系统故障检测结果确定的,也可以是根据领域知识预先创建的,本申请实施例对此不作限定。
在一种可以实现的方式中,引起系统故障检测结果中的系统故障包括以下单个类型的故障或组合类型的故障:网络IO故障、磁盘IO故障、调度故障、内存故障、进程或线程故障、文件系统故障、磁盘故障、CPU故障和容器故障。这样,电子设备可以检测到各种单个类型或者组合类型的故障的系统故障检测结果。
上述当前操作系统运行数据为待检测操作系统中,当前操作系统的进程事件或线程事件运行过 程中产生的数据。也就是说,这种数据一般仅仅表示一个进程事件或线程事件的指标(例如CPU利用率等),因此,当前操作系统运行数据为指标类型的数据。而上述当前操作系统事件数据不仅包括当前操作系统的进程事件或线程事件运行过程中产生的数据,还包括当前操作系统进程事件或线程事件关联的上下文信息。也就是说,这种数据不仅可以表示一个进程事件或线程事件的指标,还可以表示这个进程事件或线程事件的事件内容(例如该线程事件为主线程事件),和/或,事件状态(例如该线程事件为异常状态事件)等,因此,当前操作系统事件数据为事件类型的数据。
在这种情况下,电子设备在检测操作系统层级的故障时,不仅可以获取到指标类型的当前操作系统运行数据,还可以基于当前操作系统运行数据确定事件类型的当前操作系统事件数据。后续,电子设备进行故障检测的过程中,相比通用技术中仅通过基础信息(即指标类型的当前操作系统运行数据)结合人工经验进行故障检测的方法,本申请实施例提供的故障检测方法可以基于当前操作系统运行数据和当前操作系统事件数据,快速确定待检测操作系统的系统故障检测结果,无需人工经验的介入,提高了故障检测的效率。
其次,由于当前操作系统事件数据包括当前操作系统进程事件或线程事件关联的上下文信息,因此,电子设备在进行故障检测的过程中,相比通用技术仅通过基础信息结合人工经验进行故障检测,本申请实施例提供的故障检测方法可以挖掘当前操作系统事件数据中的上下文信息,从而基于上下文信息,获取进程事件或者线程事件的上下文关联事件的相关内容,进而根据这些内容结合当前操作系统运行数据,准确的确定系统故障检测结果,提高了故障检测的准确度。
再次,由于当前操作系统运行数据和当前操作系统事件数据具备细粒度(可以精细到待检测系统的线程事件或进程事件)、宽范围检测数据(不仅包括指标类型的数据,还包括事件类型的数据)的特点,因此,本申请实施例提供的故障检测方法也可以检测到更细粒度的故障检测结果(例如具体线程事件或进程事件的故障检测结果),进一步提高了故障检测的准确度。
此外,本申请实施例提供的故障检测方法无需在电子设备的操作系统层级添加各种故障检测工具,仅需操作系统自身具备的数据采集功能和数据处理功能,便可以快速检测故障,实现了基于操作系统轻量级的故障检测。
在一种可以实现的方式中,在包括大量的云终端和云服务器的大规模云场景下,由于云服务器的服务部署的复杂性,可能导致云服务器的节点(即云服务器的进程或线程)多、链路长。在这种情况下,考虑到时间累积效应,可能造成进程事件或线程事件对应的数据传播效应滞后。如果按照通用的故障检测方法,可能导致故障检测效率低下。本申请实施例可以结合当前操作系统运行数据和当前操作系统事件数据,基于当前操作系统运行数据和当前操作系统事件数据具备细粒度(可以精细到待检测系统的线程事件或进程事件)、宽范围检测数据(不仅包括指标类型的数据,还包括事件类型的数据)的特点,挖掘当前操作系统事件数据中的上下文信息,从而可以快速、准确的确定待检测操作系统的系统故障检测结果。
示例性的,在实际应用场景下,通过数据统计,基于通用的故障检测方法需要0.5天-7天的时间,才能确定待检测操作系统的系统故障检测结果。而本申请实施例结合当前操作系统运行数据和当前操作系统事件数据,仅需0.5小时-72小时,便可以快速、准确的确定待检测操作系统的系统故障检测结果。相比通用技术以“天”级的故障检测速度,本申请实施例提供的故障检测方法可以使得故障检测速度缩短到“分钟”级,提高了故障检测的效率。
在一种可以实现的方式中,电子设备可以包括:系统层处理模块、应用层处理模块和网络层处理模块。在这种情况下,结合图5,如图7所示,上述S503中,电子设备基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果的方法具体包括:
S701、系统层处理模块基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。
为了提高灰度故障的检测效率,在确定待检测操作系统的系统故障检测结果后,电子设备可以结合APM层和/或NPMD层的故障检测结果,进一步的确定灰度故障。在这种情况下,本申请实施例提供的故障检测方法还包括:
S702、系统层处理模块向应用层处理模块和网络层处理模块发送系统故障检测结果。
具体的,由于系统故障检测结果是待检测操作系统的系统层的故障检测结果,因此,为了结合 应用层的应用故障检测结果,和/或,网络层的网络故障检测结果进一步判定电子设备的灰度故障,系统层处理模块可以向应用层处理模块和网络层处理模块发送系统故障检测结果。
S703、应用层处理模块基于系统故障检测结果,确定电子设备的应用故障检测结果。
具体的,在接收到系统层处理模块发送的系统故障检测结果后,应用层处理模块可以基于系统故障检测结果,确定电子设备的应用故障检测结果。
在一种可以实现的方式中,应用层处理模块可以基于系统层处理模块发送的系统故障检测结果,确定与该系统故障检测结果对应的应用是否故障,并在确定该应用故障时,将该应用故障确定为应用故障检测结果。
在又一种可以实现的方式中,应用层处理模块可以先基于应用层的故障检测工具,确定应用层的应用故障。然后,应用层处理模块结合检测到的应用故障和系统层处理模块发送的系统故障检测结果,进一步的确定电子设备的应用故障检测结果。
S704、网络层处理模块基于系统故障检测结果,确定电子设备的网络故障检测结果。
具体的,在接收到系统层处理模块发送的系统故障检测结果后,网络层处理模块可以基于系统故障检测结果,确定电子设备的网络故障检测结果。
在一种可以实现的方式中,网络层处理模块可以基于系统层处理模块发送的系统故障检测结果,确定与该系统故障检测结果对应的网络设备是否故障,并在确定该网络设备故障时,将该网络设备故障确定为网络故障检测结果。
在又一种可以实现的方式中,网络层处理模块可以先基于网络层的故障检测工具,确定网络层的网络设备故障。然后,网络层处理模块结合检测到的网络设备故障和系统层处理模块发送的系统故障检测结果,进一步的确定电子设备的网络故障检测结果。
通用技术中,电子设备在检测灰度故障时,应用层处理模块需要先检测APM层的应用故障检测结果。然后,在确定APM层的应用故障检测结果之后,网络层处理模块再确定NPMD层的网络故障检测结果。最后,系统层处理模块获取ITIM层的操作系统运行数据,结合人工经验确定最终的灰度故障检测结果。
由上可知,通用技术的灰度故障检测方法中,APM层、NPMD层和ITIM层之间是紧耦合的,电子设备需要依次确定每个层级的故障检测结果,从而确定最终的灰度故障检测结果。而本申请中,ITIM层分别与APM层和NPMD层之间是松耦合的。也就是说,APM层、NPMD层和ITIM层之间可以独立检测当前层级的故障检测结果。相比通用技术,本申请实施例取消了层层下钻的步骤,提高了故障检测效率。
其次,ITIM层的系统层处理模块可以提供独立的数据采集、灰度故障识别、根因分析能力,并在确定系统故障检测结果后,可以向应用层处理模块和网络层处理模块发送系统故障检测结果,以使得应用层处理模块基于系统故障检测结果,确定电子设备的应用故障检测结果,以及网络层处理模块基于系统故障检测结果,确定电子设备的网络故障检测结果,相比通用的故障检测方法需要APM层、NPMD层和ITIM层依次确定故障检测结果,本申请实施例提供的故障检测方法可以使得APM层、NPMD层和ITIM层同步确定相应的层级对应的故障检测结果,并结合各个层级的检测结果,确定灰度故障的检测结果,提高了故障检测的效率。
此外,通用的故障检测方法中,APM层、NPMD层和ITIM层确定相应层级对应的故障检测结果后,需要通过人工经验进行灰度故障的判断。即各个层级对应的故障检测结果独立输出到人工检测对应的输出平台,各个层级对应的故障检测结果无结合处理,这就导致了各个层级对应的故障检测结果形成了孤岛数据。而本申请实施例可以以ITIM层的待检测操作系统为视角,自下而上主动识别、定位灰度故障,与以APM层为视角自上而下的通用的智能运维故障检测方法结合,可以消除数据孤岛,进一步提高了灰度故障检测的准确度。
在一种可以实现的方式中,为了快速、准确的确定待检测操作系统的系统故障检测结果,电子设备可以基于训练好的故障检测模型,确定待检测操作系统的系统故障检测结果。在这种情况下,本申请实施例提供的故障检测方法可以包括:电子设备训练得到故障检测模型的流程(简称为“故障检测模型训练”流程),以及电子设备根据故障检测模型确定待检测操作系统的系统故障检测结果的流程(简称为“故障检测”流程)。
下面先对“故障检测模型训练”流程进行描述。
如图8所示,本申请实施例提供的“故障检测模型训练”流程具体包括:
S801、电子设备获取历史操作系统数据。
其中,历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据。
历史操作系统运行数据是指历史时间段的系统运行数据。历史操作系统事件数据是指历史时间段的系统事件数据。
历史操作系统事件数据包括:待检测操作系统中,历史操作系统的进程事件或线程事件运行过程中产生的数据和历史操作系统的进程事件或线程事件关联的上下文信息。
电子设备可以先获取历史时间段的系统运行数据(即历史操作系统运行数据),然后基于历史操作系统运行数据和预设事件规则,确定待检测操作系统的历史操作系统事件数据。
电子设备获取历史时间段的系统运行数据,以及基于历史操作系统运行数据和预设事件规则,确定待检测操作系统的历史操作系统事件数据的描述,可以参考S501中,电子设备获取待检测操作系统的当前操作系统运行数据,并基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据的相关描述,在此不再赘述。
电子设备在训练得到故障检测模型时,可以基于历史时间段的系统运行数据和操作系统事件数据进行模型训练。
关于历史时间段的系统运行数据的描述可以参考S501中,关于当前时间段的系统运行数据的相关描述,在此不再赘述。
S802、电子设备基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型。
在一种可以实现的方式中,为了提高故障检测模型的模型精度,电子设备可以获取与历史操作系统运行数据对应的历史时间段的原始数据。然后,电子设备可以对这些原始数据执行预处理操作,以得到历史操作系统运行数据,为后续训练得到故障检测模型提供准确的基准数据。
其中,预处理操作用于对原始数据进行数据清洗,以得到完整、平滑且无噪声的当前操作系统运行数据。
可选的,上述预处理操作可以包括:缺失补齐、异常去躁、指标平滑等操作,本申请实施例对此不作限定。
电子设备训练得到故障检测模型时,可以基于共轭梯度法和历史操作系统数据,训练得到故障检测模型,也可以基于梯度下降法和历史操作系统数据,训练得到故障检测模型,还可以基于其他通用的故障识别算法和历史操作系统数据,训练得到故障检测模型,本申请实施例对此不作限定。
由上可知,电子设备可以基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型,以便于后续通过故障检测模型,快速、准确的确定系统故障检测结果,提高了故障检测效率。
在一种可以实现的方式中,预设故障识别算法可以包括分类算法和参数优化算法。分类算法用于对故障检测模型的初始模型进行分类训练。参数优化算法可以对分类训练过程中的模型进行参数调整,从而得到故障检测模型。在这种情况下,结合图8,如图9所示,上述S802中,电子设备基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型的方法具体包括:
S901、电子设备基于预设状态规则,对历史操作系统事件数据添加状态标识。
其中,状态标识包括:正常标识和异常标识。
具体的,在确定待检测操作系统的历史操作系统事件数据后,为了训练得到精度较高的故障检测模型,电子设备可以基于预设状态规则,对历史时间段的历史操作系统事件数据添加状态标识。
可选的,预设状态规则可以是根据领域知识创建的。
示例性的,预设状态规则可以包括:线程A在运行过程中,CPU利用率大于预设阈值时,确定线程A的状态标识为异常标识;线程A在运行过程中,内存占用空间大于预设内存大小时,确定线程A的状态标识为异常标识。
在一种可以实现的方式中,由于不同的系统事件对应的预设状态规则可能不同,因此,电子设备可以基于预设状态规则,对历史操作系统事件数据中的部分操作系统数据添加状态标识。
示例性的,当预设状态规则包括:线程A在13点-14点的运行过程中,CPU利用率大于70%时,确定线程A的状态标识为异常标识,以及,线程A在18点-19点的运行过程中,CPU利用率 大于80%时,确定线程A的状态标识为异常标识时,若线程A在15点的运行过程中,CPU利用率为75%,则无法准确的为线程A的操作系统事件数据添加状态标识。在这种情况下,电子设备可以仅对历史操作系统事件数据中,除线程A的操作系统事件数据以外的部分操作系统数据添加状态标识。
在一种可以实现的方式中,电子设备还可以基于领域知识,对历史操作系统事件数据添加相关的解释信息。
结合上述示例,预设历史操作系统事件数据为:线程A在13点-14点的运行过程中,CPU利用率大于70%,并且该历史操作系统事件数据的状态标识为异常标识。在这种情况下,电子设备还可以基于领域知识,对历史操作系统事件数据添加相关的解释信息:因为线程A在13点-14点的运行过程中,线程A中出现了多个冗余代码,从而导致CPU利用率大于70%,状态异常。
这样,通过基于领域知识,对历史操作系统事件数据添加相关的解释信息,可以使得训练得到的故障检测模型输出的系统故障检测结果具备可解释性。
S902、电子设备基于分类算法和第二数据,对待训练模型进行训练,以得到待调整模型。
具体的,故障检测模型可以是一种多分类模型。在这种情况下,电子设备基于分类算法和第二数据,对待训练模型进行训练,以得到待调整模型。
其中,第二数据包括历史操作系统运行数据和添加状态标识后的历史操作系统事件数据;待训练模型包括:待训练的应用特征分类模型和待训练的故障分类模型。
在一种可以实现的方式中,上述分类算法可以是K最近邻(k-Nearest Neighbor,KNN)分类算法,也可以是贝叶斯分类器算法,也可以是逻辑回归算法,还可以是其他用于分类的算法,本申请实施例对此不作限定。
应用特征分类模型用于确定待检测操作系统的系统故障的应用特征。
可选的,上述应用特征可以包括:CPU密集型、IO密集型、周期特征等。
CPU密集型也叫计算密集型,指的是操作系统的硬盘、内存性能相对CPU要好很多。此时,操作系统运作大部分的状况是CPU完全加载,CPU要读或者写硬盘或者内存时,硬盘或者内存在很短的时间就可以完成。
IO密集型指的是操作系统的CPU性能相对硬盘、内存要好很多,此时,系统运作,大部分的状况是CPU在等I/O(硬盘/内存)的读/写操作,此时CPU加载率并不高。IO密集型的程序一般在达到性能极限时,CPU利用率仍然较低。这可能是因为任务本身需要大量I/O操作,而管道链路不顺畅,没有充分利用处理器能力。
周期特征指的是操作系统的进程或线程或应用具有周期性的特征。
故障分类模型用于确定待检测操作系统发生系统故障的应用特征的故障类型。
可选的,上述故障类型可以包括:状态类型故障、功能类型故障、性能类型故障等。
在一种可以实现的方式中,为了训练得到精度较高的故障检测模型,电子设备可以获取与历史操作系统运行数据对应历史时间段的原始数据。然后,电子设备可以对这些原始数据执行预处理操作,以得到历史操作系统运行数据,为后续训练得到故障检测模型提供准确的基准数据。
其中,预处理操作用于对原始数据进行数据清洗,以得到完整、平滑且无噪声的当前操作系统运行数据。
可选的,上述预处理操作可以包括:缺失补齐、异常去躁、指标平滑等操作,本申请实施例对此不作限定。
S903、电子设备基于历史操作系统运行数据和参数优化算法,对待调整模型进行参数调整,以得到故障检测模型。
其中,故障检测模型包括:应用特征分类模型和故障分类模型。
具体的,在对待调整模型进行模型训练的过程中,需要对待调整模型的参数进行调整优化。在这种情况下,电子设备可以基于历史操作系统运行数据和参数优化算法,对待调整模型进行参数调整,以得到故障检测模型。
参数优化算法用于使得待训练模型的输出精度达到预设精度,通过将预设精度参数化,采用参数优化算法,不断的调整待训练模型的参数,使得待训练模型的输出结果不断接近参数化的目标 值。
在一种可以实现的方式中,参数优化算法可以包括:梯度下降算法、动量优化算法、自适应学习率优化算法等。
在一种可以实现的方式中,上述待训练模型的输出精度可以包括:查全率、查准率、F1分数(F1Score)等。
相应的,参数化的目标值可以包括:查全率对应的预设数值、查准率对应的预设数值、F1分数对应的预设数值。
由上可知,由于预设故障识别算法可以包括分类算法和参数优化算法,因此,分类算法可以用于对故障检测模型的初始模型进行分类训练,参数优化算法可以对分类训练过程中的模型进行参数调整,从而得到故障检测模型,以便于后续通过故障检测模型,快速、准确的确定系统故障检测结果,提高了故障检测效率。
在一种可以实现的方式中,上述“故障检测模型训练”流程可以是基于离线场景训练完成的,也可以是基于在线使用场景训练完成的。
离线场景是指在电子设备可以在获取历史操作系统数据后,将历史操作系统数据迁移到实验室场景下,并在实验室场景下基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型。后续,在使用故障检测模型时,可以基于在线迁移算法,将离线场景下训练得到的故障检测模型迁移到在线使用场景下进行故障检测。
在线使用场景是指在电子设备可以在获取历史操作系统数据后,直接基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型。后续,在使用故障检测模型时,可以直接使用故障检测模型进行故障检测。
图10示出了本申请实施例提供的一种基于离线场景训练得到故障检测模型的流程示意图。结合图4,如图10所示,该“故障检测模型训练”流程可以包括:
S1001、数据预处理。
具体的,电子设备的数据采集模块可以采集历史时间段的无状态数据(即本申请实施例中的历史操作系统运行数据)和历史时间段的有状态数据(即本申请实施例中的历史操作系统事件数据)。
然后,电子设备的灰度故障识别模块可以对无状态数据做缺失补齐、异常去躁、指标平滑等操作,为后续步骤提供准确的基准数据。
S1002、数据标签化。
具体的,由于有状态数据包含上下文信息,因此,电子设备的灰度故障识别模块可以为有状态数据添加应用正常/异常等状态的标签,标签的内容可以基于领域知识自定义。
其次,电子设备还可以基于领域知识,对有状态数据添加相关的解释信息。
S1003、故障检测模型训练。
具体的,电子设备可以基于无状态数据和数据标签化后的有状态数据进行模型训练,从而生成应用特征分类模型和故障分类模型。
S1004、故障识别算法评估。
具体的,电子设备可以通过计算应用特征分类模型和故障分类模型的查准率、查全率、F1-Score指标,评估应用特征分类模型和故障分类模型是否达到训练收敛条件。
S1005、故障检测模型生成。
具体的,在故障识别算法评估后,电子设备可以生成离线场景下的故障检测模型,供测试验证或线上迁移训练使用。
下面再对“故障检测”流程进行描述。
结合图5,如图11所示,上述S503中,电子设备基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果的方法具体包括:
S1101、电子设备基于预设状态规则,对当前操作系统事件数据添加状态标识。
其中,状态标识包括:正常标识和异常标识。
具体的,在确定待检测操作系统的当前操作系统事件数据后,为了使得故障检测模型可以准确 的确定待检测操作系统的系统故障检测结果,电子设备可以基于预设状态规则,对当前操作系统事件数据添加状态标识。
电子设备基于预设状态规则,对当前操作系统事件数据添加状态标识的相关描述,可以参考S901中,电子设备基于预设状态规则,对历史操作系统事件数据添加状态标识的相关描述,在此不再赘述。
S1102、电子设备将第一数据输入到预先训练好的故障检测模型中,以得到系统故障检测结果。
其中,第一数据包括当前操作系统运行数据和添加状态标识后的当前操作系统事件数据;故障检测模型为根据待检测操作系统的历史操作系统数据训练得到的;历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据。
在一种可以实现的方式中,为了提高故障检测模型输出系统故障检测结果的准确度,电子设备可以获取与当前操作系统运行数据对应当前时间段的原始数据。然后,电子设备可以对这些原始数据执行预处理操作,以得到当前操作系统运行数据,为后续确定系统故障检测结果提供准确的基准数据。
其中,预处理操作用于对原始数据进行数据清洗,以得到完整、平滑且无噪声的当前操作系统运行数据。
可选的,上述预处理操作可以包括:缺失补齐、异常去躁、指标平滑等操作,本申请实施例对此不作限定。
由上可知,由于故障检测模型为根据待检测操作系统的历史操作系统数据训练得到的,历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据,因此,电子设备确定待检测操作系统的系统故障检测结果时,可以通过故障检测模型,快速、准确的确定系统故障检测结果,提高了故障检测效率。
在一种可以实现的方式中,由S902可知,待训练模型包括:待训练的应用特征分类模型和待训练的故障分类模型,因此,电子设备可以通过训练好的应用特征分类模型和故障分类模型确定系统故障检测结果。在这种情况下,电子设备将第一数据输入到预先训练好的故障检测模型中,以得到系统故障检测结果的方法具体包括:
电子设备将第一数据输入到应用特征分类模型中,以得到应用特征故障检测结果。
其中,应用特征故障检测结果可以包括:CPU密集型的应用特征故障检测结果、IO密集型的应用特征故障检测结果、周期特征的应用特征故障检测结果等。
电子设备将第一数据中,与应用特征故障检测结果对应的系统数据输入到故障分类模型中,以得到系统故障检测结果。
其中,系统故障检测结果可以包括:CPU密集型的应用出现状态类型的故障、CPU密集型的应用出现功能类型的故障、CPU密集型的应用出现性能类型的故障、IO密集型的应用出现状态类型的故障、IO密集型的应用出现功能类型的故障、IO密集型的应用出现性能类型的故障、周期特征的应用出现状态类型的故障、周期特征的应用出现功能类型的故障、周期特征的应用出现性能类型的故障。
由上可知,由于故障检测模型包括应用特征分类模型和故障分类模型,因此,电子设备可以依次根据应用特征分类模型和故障分类模型确定不同的应用特征,以及每个应用特征对应的故障分类,提高了故障检测的准确度。
图12示出了本申请实施例提供的一种基于预先训练得到故障检测模型进行故障检测的流程示意图。结合图4,如图12所示,该“故障检测模型训练”流程可以包括:
S1201、数据预处理。
具体的,电子设备的数据采集模块可以采集当前时间段的无状态数据(即本申请实施例中的当前操作系统运行数据)和当前时间段的有状态数据(即本申请实施例中的当前操作系统事件数据)。
然后,电子设备的灰度故障识别模块可以对无状态数据做缺失补齐、异常去躁、指标平滑等操作,为后续步骤提供准确的基准数据。
S1202、数据标签化。
具体的,由于有状态数据包含上下文信息,因此,电子设备的灰度故障识别模块可以为有状态数据添加应用正常/异常等状态的标签,标签的内容可以基于领域知识自定义。
其次,电子设备还可以基于领域知识,对有状态数据添加相关的解释信息。
S1203、故障检测模型训练或加载离线场景下的故障检测模型。
在一种可以实现的方式中,电子设备可以基于无状态数据和数据标签化后的有状态数据进行模型训练,从而生成应用特征分类模型和故障分类模型。
在又一种可以实现的方式中,电子设备可以基于在线迁移算法,从离线场景下加载预先训练好的故障检测模型。
S1204、灰度故障检测。
具体的,电子设备通过应用特征分类模型,按照应用特征定义的应用阈值,基于偏离度、持续时间等因素进行故障检测。
S1205、灰度故障事件生成。
具体的,在通过应用特征分类模型进行灰度故障检测后,可以进一步通过故障分类模型,确定最终的灰度故障事件(即本申请实施例中的系统故障检测结果),例如应用资源使用异常、应用性能异常、应用状态异常等。
其中,灰度故障事件对应的检测结果可以包括故障事件的异常属性,例如发生的时间、位置、类型等。
其中,故障事件的类型包括故障事件的资源使用、性能、状态等。
在一种可以实现的方式中,在确定系统故障检测结果后,为了对待检测操作系统中发生故障的事件或应用进行根因定位,电子设备还可以构建应用依赖关系和事件传播关系,以此来进一步的实现故障根因定位。
在这种情况下,结合图5,如图13所示,本申请实施例提供的故障检测方法还包括:
S1301、电子设备根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系。
其中,应用依赖关系用于表示待检测操作系统中各进程或线程及各应用实例之间的相互依赖关系。
在一种可以实现的方式中,预设依赖规则可以是当第一进程或线程与第二进程或线程之间存在执行先后顺序的关系时,确定第一进程或线程与第二进程或线程之间具有依赖关系;也可是当第一进程或线程需要基于第二进程或线程的运行结果才能实现运行时,确定第一进程或线程与第二进程或线程之间具有依赖关系;还可以是其他类型的规则,本申请实施例对此不作限定。
当前操作系统运行数据中包括各进程或线程及各应用实例对应的运行数据。在获取当前操作系统运行数据后,电子设备可以根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系。
在一种可以实现的方式中,电子设备可以基于包括节点和边的应用依赖图的形式,构建应用依赖关系。在这种情况下,电子设备根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系的方法具体包括:
电子设备根据当前操作系统运行数据和预设依赖规则,构建应用依赖图。
其中,应用依赖图包括多个应用节点以及应用节点之间的边;多个应用节点中的主应用节点用于表示待检测操作系统中应用实例的主线程;多个应用节点中的从应用节点用于表示:待检测操作系统中应用实例的从线程和待检测操作系统中应用实例之间的依赖实例;多个应用节点中的第一应用节点与第二应用节点之间的边用于表示第一应用节点与第二应用节点之间存在依赖关系。
在一种可以实现的方式中,主节点与主节点、主节点和从节点间存在单向或者双向直接依赖关系,从节点间存在间接依赖或没有依赖关系。
应用实例的从线程可以包括代理应用实例(proxy)等。应用节点的属性可以包含时间、状态、资源使用等信息。
S1302、电子设备根据应用依赖关系,构建事件传播关系。
具体的,由于应用事件的执行主体为待检测操作系统中的进程或线程及应用实例,因此,在确定应用依赖关系后,电子设备可以根据应用依赖关系,构建事件传播关系。
其中,事件传播关系用于表示待检测操作系统中应用事件之间的传播关系。应用事件为待检测操作系统中的进程或线程及应用实例对应的应用事件。
在一种可以实现的方式中,电子设备可以基于包括节点和边的事件传播图的形式,构建事件传播关系。在这种情况下,电子设备根据应用依赖关系,构建事件传播关系的方法具体包括:
电子设备根据应用依赖图,构建事件传播图。
其中,事件传播图包括与多个应用节点一一对应的多个事件节点以及事件节点之间的边;多个事件节点用于表示待检测操作系统中应用事件对应的应用实例;多个事件节点中的第一事件节点与第二事件节点之间的边用于表示第一事件节点对应的第一事件与第二事件节点对应的第二事件之间存在传播关系。
具体的,事件节点属于应用节点。
在一种可以实现的方式中,事件节点对应的事件包括异常事件、有状态事件以及管理面的事件(例如应用升级、系统打补丁等)。事件节点的属性可以包括时间、事件类型、异常等级等信息。
在一种可以实现的方式中,有依赖关系的节点中发生的事件之间存在单向、双向、无传播关系。
S1303、电子设备根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程。
具体的,在确定事件传播关系后,电子设备可以根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程,实现了从多维度、多类型数据进行关联根因分析的效果,解决了通用的故障检测方法中,故障根因分析困难、定位粒度不精细等技术问题。
在一种可以实现的方式中,电子设备可以基于应用依赖图和事件传播图,快速、准确的确定根因事件,以及根因应用实例和/或根因进程或线程。在这种情况下,电子设备根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程的方法具体包括:
电子设备确定与故障事件节点之间具有边的传播事件节点。
其中,故障事件节点为故障事件对应的事件节点;故障事件为系统故障检测结果对应的故障事件。
电子设备将传播事件节点中的传播起始事件节点对应的事件确定为根因事件,以及将传播起始事件节点对应的应用实例确定为根因应用实例,和/或,将传播起始事件节点对应的进程或线程确定为根因进程或线程。
具体的,电子设备可以将传播事件节点中的传播起始事件节点对应的事件确定为根因事件,以及将传播起始事件节点对应的应用实例确定为根因应用实例,和/或,将传播起始事件节点对应的进程或线程确定为根因进程或线程。
这样,本申请提供的故障检测方法可以在确定应用依赖图和事件传播图后,基于事件传播图的传播关系,分析定位根因事件,从而定位到进程/线程级应用及对应的根因事件,并基于事件传播图的传播关系逐事件节点分析定位异常事件的根因,输出应用依赖路径和事件传播路径,实现了从多维度、多类型数据进行关联根因分析的效果,解决了通用的故障检测方法中,故障根因分析困难、定位粒度不精细等技术问题。
在一种可以实现的示例中,如图14所示,待检测操作系统中的主节点1中可以包括:应用11和应用12。主节点2中可以包括:应用21和应用22。主节点n中可以包括:应用n1和应用n2。应用11用于执行事件a。应用12用于执行事件b。应用21用于执行事件c。应用22用于执行事件d。应用n1用于执行事件m。应用n2用于执行事件n。电子设备可以获取待检测操作系统中,应用11、应用12、应用21、应用22、应用n1和应用n2的当前操作系统运行数据。接着,电子设备可以根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系。
示例性的,当前操作系统运行数据中包括:应用11在当前时间段执行了事件a,应用12在当前时间段执行了事件b。预设依赖规则包括:在当前时间段同时执行事件的应用具有依赖关系。在这种情况下,电子设备可以确定应用11与应用12具有应用依赖关系,并基于应用11与应用12具 有应用依赖关系,确定事件a和事件b具有事件传播关系。
基于同样的方法,电子设备可以确定应用11与应用22具有应用依赖关系,应用22与应用n1具有应用依赖关系,应用n1与应用21具有应用依赖关系。
相应的,电子设备可以确定事件d和事件m具有事件传播关系,事件m和事件c具有事件传播关系。
图15示出了本申请实施例提供的又一种故障检测方法的流程示意图。如图15所示,该故障检测方法的步骤具体包括:
S1501、应用层处理模块检测APP的应用质量是否异常。
S1502、网络层处理模块检测网络设备是否有网络问题。
S1503、系统层处理模块获取待检测操作系统的当前操作系统运行数据,以及基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据。
具体的,系统层处理模块采集的当前操作系统运行数据可以包括:指标数据、日志数据、链路数据等无状态数据。
系统层处理模块确定的当前操作系统事件数据可以包括事件类型的有状态数据。
S1504、系统层处理模块基于获取到的当前操作系统运行数据和确定的当前操作系统事件数据,以及多维长周期时序数据的关联分析,完成灰度故障的识别。
S1505、系统层处理模块构建应用依赖关系和事件传播关系,并根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程。
需要说明的是,本申请实施例对于S1501、S1502和S1503-S1505之间先后执行顺序不作限定。
也就是说,本申请实施例可以将应用层处理模块检测APP的应用质量是否异常。网络层处理模块检测网络设备是否有网络问题、系统层处理模块检测待检测操作系统的系统故障检测结果之间是松耦合的关系。相比通用的故障检测方法,本申请实施例提供的故障检测方法取消了层层下钻的步骤,ITIM层提供独立的数据采集、灰度故障识别、根因分析能力,并可以将系统故障检测结果主动推送到APM层和NPMD层,以使得应用层处理模块和网络层处理模块根据系统故障检测结果准确的确定灰度故障。
图16示出了本申请实施例提供的一种网络IO类异常事件类型的应用依赖图和事件传播图。
如图16所示,待检测操作系统中的主节点1中可以包括:应用市场从节点和搜索微服务从节点。
主节点2中可以包括:负一屏(nginx)从节点。
主节点n中可以包括:负一屏服务从节点和第一远程字典服务集群(redis cluster server)从节点。
主节点n+1中可以包括:第二远程字典服务集群从节点。
应用市场从节点对应的APM层的事件包括:a.1、高频搜索结果空异常事件。
搜索微服务从节点对应的APM层的事件包括:a.2、应用版本升级事件。
主节点1对应的APM层的事件包括:a.3、CPU高负载事件。
负一屏从节点对应的APM层的事件包括:a.4、微服务响应超时事件。
负一屏从节点对应的ITIM层的事件包括:b.1、N-T2链路往返时间(round-trip time,RTT)异常事件。
负一屏服务从节点对应的APM层的事件包括:a.5、新闻公告信息获取异常事件。
负一屏服务从节点对应的ITIM层的事件包括:b.2、T2-R2链路RTT异常事件。
第二远程字典服务集群从节点对应的ITIM层的事件包括:b.3、第二远程字典服务网络丢包异常事件。
结合图16,图17示出了本申请实施例提供的一种网络IO类异常事件类型的故障检测方法流程示意图。
S1701、应用层处理模块检测APP异常事件。
具体的,应用层处理模块检测到应用性能下降异常,对应节点2上的a.4异常事件(微服务响 应超时),同时也检测到节点1的a.1(高频搜索结果空异常)、a.2(应用版本升级)、节点n的a.5异常事件(新闻公告信息获取异常)。
S1702、网络层处理模块未检测到网络异常事件。
S1703、系统层处理模块采集节点1、节点2、…、节点n+1的网络I/O、磁盘I/O、调度、内存、进程/线程、文件系统、磁盘、CPU及容器数据。
S1704、系统层处理模块识别节点2上的b.1异常事件,节点n上的b.2异常事件,节点n+1的b.3异常事件。
具体的,系统层处理模块可以识别到节点2上的b.1异常事件(节点2的Nginx到节点n的负一屏服务RTT时延异常),节点n上的b.2异常事件(节点n的负一屏服务到节点n+1的第二远程字典服务集群RTT时延异常),节点n+1的b.3异常事件(第二远程字典服务集群存在网络丢包异常)。
S1705、系统层处理模块确定根因定位结果。
具体的,应用依赖关系可以包括:节点1的应用市场依赖于节点1的搜索微服务;节点2的Ngnix服务依赖于节点1的应用市场;节点2的Ngnix依赖于节点n的负一屏服务;节点n的负一屏服务依赖于第一远程字典服务集群;节点n的负一屏服务依赖于节点n+1的第二远程字典服务集群。
根据事件传播关系确定根因定位结果的具体过程为:由于节点n的负一屏服务依赖于节点n+1的第二远程字典服务集群,所以b.3可以传播到b.2,由于节点2的Ngnix依赖于节点n的负一屏服务,所以b.2可以传播到b.1,其中b.3异常事件的根因是由于网络协议栈缓冲器(buffer)队列满导致,所以b.1由b.3导致,根因即为b.3异常事件的根因。
图18示出了本申请实施例提供的一种混合类灰度故障的应用依赖图和事件传播图。
混合类灰度故障指是在异常事件传播路径中存在网络IO、磁盘IO、调度、内存、进程/线程、文件系统、磁盘、CPU及容器等不同类的异常事件。
如图18所示,待检测操作系统中的主节点1中可以包括:应用市场从节点和搜索微服务从节点。
主节点2中可以包括:负一屏(nginx)从节点。
主节点n中可以包括:负一屏服务从节点和第一远程字典服务集群(redis cluster server)从节点。
主节点n+1中可以包括:第二远程字典服务集群从节点。
应用市场从节点对应的APM层的事件包括:a.1、高频搜索结果空异常事件。
搜索微服务从节点对应的APM层的事件包括:a.2、应用版本升级事件。
主节点1对应的APM层的事件包括:a.3、CPU高负载事件。
负一屏从节点对应的APM层的事件包括:a.4、微服务响应超时事件。
负一屏从节点对应的ITIM层的事件包括:b.1、N-T2链路往返时间(round-trip time,RTT)异常事件。
负一屏服务从节点对应的APM层的事件包括:a.5、新闻公告信息获取异常事件。
负一屏服务从节点对应的ITIM层的事件包括:b.2、T2-R2链路RTT异常事件。
第二远程字典服务集群从节点对应的ITIM层的事件包括:b.3、第二远程字典服务CPU被干扰(CPU异常)事件。
结合图18,图19示出了本申请实施例提供的一种混合类灰度故障的故障检测方法流程示意图。
S1901、应用层处理模块检测APP异常事件。
具体的,应用层处理模块检测到应用性能下降异常,对应节点2上的a.4异常事件(微服务响应超时),同时也检测到节点1的a.1(高频搜索结果空异常)、a.2(应用版本升级)、节点n的a.5异常事件(新闻公告信息获取异常)。
S1902、网络层处理模块未检测到网络异常事件。
S1903、系统层处理模块采集节点1、节点2、…、节点n+1的网络I/O、磁盘I/O、调度、内 存、进程/线程、文件系统、磁盘、CPU及容器数据。
S1904、系统层处理模块识别节点2上的b.1异常事件,节点n上的b.2异常事件,节点n+1的b.3异常事件。
具体的,系统层处理模块可以识别到节点2上的b.1异常事件(节点2的Nginx到节点n的负一屏服务RTT时延异常),节点n上的b.2异常事件(节点n的负一屏服务到节点n+1的第二远程字典服务集群RTT时延异常),节点n+1的b.3异常事件(第二远程字典服务集群存在CPU被干扰异常)。
S1905、系统层处理模块确定根因定位结果。
具体的,应用依赖关系可以包括:节点1的应用市场依赖于节点1的搜索微服务;节点2的Ngnix服务依赖于节点1的应用市场;节点2的Ngnix依赖于节点n的负一屏服务;节点n的负一屏服务依赖于第一远程字典服务集群;节点n的负一屏服务依赖于节点n+1的第二远程字典服务集群;节点n+1的第二远程字典服务集群与Spark Worker双向依赖。
根据事件传播关系确定根因定位结果的具体过程为:由于节点n的负一屏服务依赖于节点n+1的第二远程字典服务集群,所以b.3可以传播到b.2,由于节点2的Ngnix依赖于节点n的负一屏服务,所以b.2可以传播到b.1,其中b.3异常事件的根因是由于内存集群计算平台(Spark Worker)的批处理业务抢占CPU导致,所以b.1由b.3导致,根因即为b.3异常事件的根因。
需要说明的是,本申请实施例对于S1901、S1902和S1903-S1905之间先后执行顺序不作限定。
可以理解的,在实际实施时,本申请实施例所述的故障检测设备可以包含有用于实现前述对应故障检测方法的一个或多个硬件结构和/或软件模块,这些执行硬件结构和/或软件模块可以构成一个故障检测设备。
本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
基于这样的理解,本申请实施例还对应提供一种故障检测装置,可以应用于故障检测设备。图20示出了本申请实施例提供的故障检测装置的结构示意图。如图20所示,该故障检测装置可以包括:获取单元2001和处理单元2002;
获取单元2001,用于获取待检测操作系统的当前操作系统运行数据。例如,结合图5,获取单元2001用于执行S501。
处理单元2002,用于基于当前操作系统运行数据和预设事件规则,确定待检测操作系统的当前操作系统事件数据;当前操作系统事件数据包括:待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据和当前操作系统的进程事件或线程事件关联的上下文信息。例如,结合图5,处理单元2002用于执行S502。
处理单元2002,还用于基于当前操作系统运行数据和当前操作系统事件数据,确定待检测操作系统的系统故障检测结果。例如,结合图5,处理单元2002用于执行S503。
在一种可以实现的方式中,电子设备包括:系统层处理模块、应用层处理模块和网络层处理模块;
系统层处理模块用于确定待检测操作系统的系统故障检测结果。例如,结合图7,系统层处理模块用于执行S701。
系统层处理模块还用于向应用层处理模块和网络层处理模块发送系统故障检测结果。例如,结合图7,系统层处理模块用于执行S702。
应用层处理模块用于基于系统故障检测结果,确定电子设备的应用故障检测结果。例如,结合图7,应用层处理模块用于执行S703。
网络层处理模块用于基于系统故障检测结果,确定电子设备的网络故障检测结果。例如,结合图7,网络层处理模块用于执行S704。
在一种可以实现的方式中,处理单元2002,具体用于:
基于预设状态规则,对当前操作系统事件数据添加状态标识;状态标识包括:正常标识和异常标识。例如,结合图11,处理单元2002用于执行S1101。
将第一数据输入到预先训练好的故障检测模型中,以得到系统故障检测结果;第一数据包括当前操作系统运行数据和添加状态标识后的当前操作系统事件数据;故障检测模型为根据待检测操作系统的历史操作系统数据训练得到的;历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据。例如,结合图11,处理单元2002用于执行S1102。
在一种可以实现的方式中,获取单元2001,还用于获取历史操作系统数据。例如,结合图8,获取单元2001用于执行S801。
处理单元2002,还用于基于预设故障识别算法和历史操作系统数据,训练得到故障检测模型。例如,结合图8,处理单元2002用于执行S802。
在一种可以实现的方式中,预设故障识别算法包括:分类算法和参数优化算法;处理单元2002,具体用于:
基于预设状态规则,对历史操作系统事件数据添加状态标识。例如,结合图9,处理单元2002用于执行S901。
基于分类算法和第二数据,对待训练模型进行训练,以得到待调整模型;第二数据包括历史操作系统运行数据和添加状态标识后的历史操作系统事件数据;待训练模型包括:待训练的应用特征分类模型和待训练的故障分类模型。例如,结合图9,处理单元2002用于执行S902。
基于历史操作系统运行数据和参数优化算法,对待调整模型进行参数调整,以得到故障检测模型;故障检测模型包括:应用特征分类模型和故障分类模型。例如,结合图9,处理单元2002用于执行S903。
在一种可以实现的方式中,处理单元2002,具体用于:
将第一数据输入到应用特征分类模型中,以得到应用特征故障检测结果;
将第一数据中,与应用特征故障检测结果对应的系统数据输入到故障分类模型中,以得到系统故障检测结果。
在一种可以实现的方式中,处理单元2002,还用于根据当前操作系统运行数据和预设依赖规则,构建应用依赖关系;应用依赖关系用于表示待检测操作系统中各进程或线程及各应用实例之间的相互依赖关系。例如,结合图13,处理单元2002用于执行S1301。
处理单元2002,还用于根据应用依赖关系,构建事件传播关系;事件传播关系用于表示待检测操作系统中应用事件之间的传播关系;应用事件为待检测操作系统中的进程或线程及应用实例对应的应用事件。例如,结合图13,处理单元2002用于执行S1302。
处理单元2002,还用于根据事件传播关系,确定引起系统故障检测结果中的系统故障的根因事件,以及根因应用实例,和/或根因进程或线程。例如,结合图13,处理单元2002用于执行S1303。
在一种可以实现的方式中,处理单元2002,具体用于:
根据当前操作系统运行数据和预设依赖规则,构建应用依赖图;应用依赖图包括多个应用节点以及应用节点之间的边;多个应用节点中的主应用节点用于表示待检测操作系统中应用实例的主线程;多个应用节点中的从应用节点用于表示:待检测操作系统中应用实例的从线程和待检测操作系统中应用实例之间的依赖实例;多个应用节点中的第一应用节点与第二应用节点之间的边用于表示第一应用节点与第二应用节点之间存在依赖关系。
在一种可以实现的方式中,处理单元2002,具体用于:
根据应用依赖图,构建事件传播图;事件传播图包括与多个应用节点一一对应的多个事件节点以及事件节点之间的边;多个事件节点用于表示待检测操作系统中应用事件对应的应用实例;多个事件节点中的第一事件节点与第二事件节点之间的边用于表示第一事件节点对应的第一事件与第二事件节点对应的第二事件之间存在传播关系。
在一种可以实现的方式中,处理单元2002,具体用于:
确定与故障事件节点之间具有边的传播事件节点;故障事件节点为故障事件对应的事件节点;故障事件为系统故障检测结果对应的故障事件;
将传播事件节点中的传播起始事件节点对应的事件确定为根因事件,以及将传播起始事件节点对应的应用实例确定为根因应用实例,和/或,将传播起始事件节点对应的进程或线程确定为根因进程或线程。
在一种可以实现的方式中,引起系统故障检测结果中的系统故障包括以下单个类型的故障或组合类型的故障:
网络输入输出IO故障、磁盘输入输出IO故障、调度故障、内存故障、进程或线程故障、文件系统故障、磁盘故障、中央处理器CPU故障和容器故障。
如上所述,本申请实施例可以根据上述方法示例对故障检测设备进行功能模块的划分。其中,上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。另外,还需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。
关于上述实施例中的故障检测装置,其中各个模块执行操作的具体方式、以及具备的有益效果,均已经在前述方法实施例中进行了详细描述,此处不再赘述。
本申请实施例还提供一种电子设备。该电子设备可以是终端,该终端可以是手机、电脑等用户终端。图21示出了本申请实施例提供的终端的结构示意图。
该终端可以是上述故障检测装置,包括至少一个处理器61,通信总线62,存储器63以及至少一个通信接口64。
处理器61可以是一个处理器(central processing units,CPU),微处理单元,ASIC,或一个或多个用于控制本申请方案程序执行的集成电路。作为一个示例,结合图20,故障检测设备中的处理单元2002实现的功能与图21中的处理器61实现的功能相同。
通信总线62可包括一通路,在上述组件之间传送信息。
通信接口64,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如服务器、以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。作为一个示例,结合图20,故障检测设备中的获取单元2001实现的功能与图21中的通信接口64实现的功能相同。
存储器63可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理单元相连接。存储器也可以和处理单元集成在一起。
其中,存储器63用于存储执行本申请方案的应用程序代码,并由处理器61来控制执行。处理器61用于执行存储器63中存储的应用程序代码,从而实现本申请方法中的功能。
在具体实现中,作为一种实施例,处理器61可以包括一个或多个CPU,例如图21中的CPU0和CPU1。
在具体实现中,作为一种实施例,终端可以包括多个处理器,例如图21中的处理器61和处理器65。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,终端还可以包括输入设备66和输出设备67。输入设备66和输出设备67通信,可以以多种方式接受用户的输入。例如,输入设备66可以是鼠标、键盘、触摸屏设备或传感设备等。输出设备67和处理器61通信,可以以多种方式来显示信息。例如,输出设备61可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备等。
本领域技术人员可以理解,图21中示出的结构并不构成对终端的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
本申请实施例还提供一种电子设备,该电子设备例如可以是服务器。图22示出了本申请实施例提供的服务器的结构示意图。该服务器可以是故障检测装置。该服务器可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器71和一个或一个以上的存储器72。其中,存储器72中存储有至少一条指令,至少一条指令由处理器71加载并执行以实现上述各个方法实施例提供的故障检测方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。
本申请还提供了一种包括指令的计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述计算机可读存储介质中的指令由计算机设备的处理器执行时,使得计算机能够执行上述所示实施例提供的故障检测方法。例如,计算机可读存储介质可以为包括指令的存储器63,上述指令可由终端的处理器61执行以完成上述方法。又例如,计算机可读存储介质可以为包括指令的存储器72,上述指令可由服务器的处理器71执行以完成上述方法。
可选地,计算机可读存储介质可以是非临时性计算机可读存储介质,例如,所述非临时性计算机可读存储介质可以是ROM、RAM、CD-ROM、磁带、软盘和光数据存储设备等。
本申请还提供了一种计算机程序产品,该计算机程序产品包括计算机指令,当所述计算机指令在故障检测设备上运行时,使得所述故障检测设备执行上述图5-图19任一附图所示的故障检测方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对通用技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何在本申请实施例揭露的技术范围内的变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (14)

  1. 一种故障检测方法,其特征在于,应用于电子设备,包括:
    获取待检测操作系统的当前操作系统运行数据;
    基于所述当前操作系统运行数据和预设事件规则,确定所述待检测操作系统的当前操作系统事件数据;所述当前操作系统事件数据包括:所述待检测操作系统中,当前操作系统的进程事件或线程事件运行过程中产生的数据和所述当前操作系统的进程事件或线程事件关联的上下文信息;
    基于所述当前操作系统运行数据和所述当前操作系统事件数据,确定所述待检测操作系统的系统故障检测结果。
  2. 根据权利要求1所述的故障检测方法,其特征在于,所述电子设备包括:系统层处理模块、应用层处理模块和网络层处理模块;所述确定所述待检测操作系统的系统故障检测结果,包括:
    所述系统层处理模块确定所述待检测操作系统的系统故障检测结果;
    所述方法还包括:
    所述系统层处理模块向所述应用层处理模块和所述网络层处理模块发送所述系统故障检测结果;
    所述应用层处理模块基于所述系统故障检测结果,确定所述电子设备的应用故障检测结果;
    所述网络层处理模块基于所述系统故障检测结果,确定所述电子设备的网络故障检测结果。
  3. 根据权利要求1所述的故障检测方法,其特征在于,所述基于所述当前操作系统运行数据和所述当前操作系统事件数据,确定所述待检测操作系统的系统故障检测结果,包括:
    基于预设状态规则,对所述当前操作系统事件数据添加状态标识;所述状态标识包括:正常标识和异常标识;
    将第一数据输入到预先训练好的故障检测模型中,以得到所述系统故障检测结果;所述第一数据包括所述当前操作系统运行数据和添加所述状态标识后的当前操作系统事件数据;所述故障检测模型为根据所述待检测操作系统的历史操作系统数据训练得到的;所述历史操作系统数据包括:历史操作系统运行数据和历史操作系统事件数据。
  4. 根据权利要求3所述的故障检测方法,其特征在于,还包括:
    获取所述历史操作系统数据;
    基于预设故障识别算法和所述历史操作系统数据,训练得到所述故障检测模型。
  5. 根据权利要求4所述的故障检测方法,其特征在于,所述预设故障识别算法包括:分类算法和参数优化算法;所述基于预设故障识别算法和所述历史操作系统数据,训练得到所述故障检测模型,包括:
    基于所述预设状态规则,对所述历史操作系统事件数据添加所述状态标识;
    基于所述分类算法和第二数据,对待训练模型进行训练,以得到待调整模型;所述第二数据包括所述历史操作系统运行数据和添加所述状态标识后的历史操作系统事件数据;所述待训练模型包括:待训练的应用特征分类模型和待训练的故障分类模型;
    基于所述历史操作系统运行数据和所述参数优化算法,对所述待调整模型进行参数调整,以得到所述故障检测模型;所述故障检测模型包括:应用特征分类模型和故障分类模型。
  6. 根据权利要求5所述的故障检测方法,其特征在于,所述将第一数据输入到预先训练好的故障检测模型中,以得到所述系统故障检测结果,包括:
    将所述第一数据输入到所述应用特征分类模型中,以得到应用特征故障检测结果;
    将所述第一数据中,与所述应用特征故障检测结果对应的系统数据输入到所述故障分类模型中,以得到所述系统故障检测结果。
  7. 根据权利要求1所述的故障检测方法,其特征在于,还包括:
    根据所述当前操作系统运行数据和预设依赖规则,构建应用依赖关系;所述应用依赖关系用于表示所述待检测操作系统中各进程或线程及各应用实例之间的相互依赖关系;
    根据所述应用依赖关系,构建事件传播关系;所述事件传播关系用于表示所述待检测操作系统中应用事件之间的传播关系;所述应用事件为所述待检测操作系统中的进程或线程及应用实例对应的应用事件;
    根据所述事件传播关系,确定引起所述系统故障检测结果中的系统故障的根因事件,以及根因应 用实例和/或根因进程或线程。
  8. 根据权利要求7所述的故障检测方法,其特征在于,所述根据所述当前操作系统运行数据和预设依赖规则,构建应用依赖关系,包括:
    根据所述当前操作系统运行数据和所述预设依赖规则,构建应用依赖图;所述应用依赖图包括多个应用节点以及应用节点之间的边;所述多个应用节点中的主应用节点用于表示所述待检测操作系统中应用实例的主线程;所述多个应用节点中的从应用节点用于表示:所述待检测操作系统中应用实例的从线程和所述待检测操作系统中应用实例之间的依赖实例;所述多个应用节点中的第一应用节点与第二应用节点之间的边用于表示所述第一应用节点与所述第二应用节点之间存在依赖关系。
  9. 根据权利要求8所述的故障检测方法,其特征在于,所述根据所述应用依赖关系,构建事件传播关系,包括:
    根据所述应用依赖图,构建事件传播图;所述事件传播图包括与所述多个应用节点一一对应的多个事件节点以及事件节点之间的边;所述多个事件节点用于表示所述待检测操作系统中应用事件对应的应用实例;所述多个事件节点中的第一事件节点与第二事件节点之间的边用于表示所述第一事件节点对应的第一事件与所述第二事件节点对应的第二事件之间存在传播关系。
  10. 根据权利要求9所述的故障检测方法,其特征在于,所述根据所述事件传播关系,确定引起所述系统故障检测结果中的系统故障的根因事件,以及根因应用实例和/或根因进程或线程,包括:
    确定与故障事件节点之间具有边的传播事件节点;所述故障事件节点为故障事件对应的事件节点;所述故障事件为所述系统故障检测结果对应的故障事件;
    将所述传播事件节点中的传播起始事件节点对应的事件确定为所述根因事件,以及将所述传播起始事件节点对应的应用实例确定为所述根因应用实例,和/或,将所述传播起始事件节点对应的线程确定为所述根因进程或线程。
  11. 根据权利要求1-10任一项所述的故障检测方法,其特征在于,引起所述系统故障检测结果中的系统故障包括以下单个类型的故障或组合类型的故障:
    网络输入输出IO故障、磁盘输入输出IO故障、调度故障、内存故障、进程或线程故障、文件系统故障、磁盘故障、中央处理器CPU故障和容器故障。
  12. 一种电子设备,其特征在于,所述电子设备包括:
    存储器;
    通信接口;
    一个或多个处理器;
    其中,所述存储器中存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述电子设备执行时,使得所述电子设备执行如权利要求1-11中任一项所述的故障检测方法。
  13. 一种计算机可读存储介质,其特征在于,包括计算机指令,当所述计算机指令在电子设备上运行时,使得所述电子设备执行如权利要求1-11中任一项所述的故障检测方法。
  14. 一种计算机程序产品,包括指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-11中任一项所述的故障检测方法。
PCT/CN2023/103248 2022-08-03 2023-06-28 一种故障检测方法、装置、电子设备及存储介质 WO2024027384A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210927989.5 2022-08-03
CN202210927989.5A CN117573459A (zh) 2022-08-03 2022-08-03 一种故障检测方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024027384A1 true WO2024027384A1 (zh) 2024-02-08

Family

ID=89848455

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103248 WO2024027384A1 (zh) 2022-08-03 2023-06-28 一种故障检测方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN117573459A (zh)
WO (1) WO2024027384A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107171819A (zh) * 2016-03-07 2017-09-15 北京华为数字技术有限公司 一种网络故障诊断方法及装置
CN109765883A (zh) * 2019-03-04 2019-05-17 积成电子股份有限公司 配电自动化终端运行状态评价与故障诊断方法
CN112863134A (zh) * 2020-12-31 2021-05-28 浙江清华长三角研究院 一种农村污水处理设施运行异常的智能诊断系统及方法
WO2021190357A1 (zh) * 2020-03-27 2021-09-30 华为技术有限公司 故障检测方法及设备
CN114296975A (zh) * 2021-12-22 2022-04-08 复旦大学 一种分布式系统调用链和日志融合异常检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107171819A (zh) * 2016-03-07 2017-09-15 北京华为数字技术有限公司 一种网络故障诊断方法及装置
CN109765883A (zh) * 2019-03-04 2019-05-17 积成电子股份有限公司 配电自动化终端运行状态评价与故障诊断方法
WO2021190357A1 (zh) * 2020-03-27 2021-09-30 华为技术有限公司 故障检测方法及设备
CN112863134A (zh) * 2020-12-31 2021-05-28 浙江清华长三角研究院 一种农村污水处理设施运行异常的智能诊断系统及方法
CN114296975A (zh) * 2021-12-22 2022-04-08 复旦大学 一种分布式系统调用链和日志融合异常检测方法

Also Published As

Publication number Publication date
CN117573459A (zh) 2024-02-20

Similar Documents

Publication Publication Date Title
US11616707B2 (en) Anomaly detection in a network based on a key performance indicator prediction model
US10515002B2 (en) Utilizing artificial intelligence to test cloud applications
EP3447642B1 (en) System and method for predicting application performance for large data size on big data cluster
US10042636B1 (en) End-to end project management platform with artificial intelligence integration
US20200379805A1 (en) Automated cloud-edge streaming workload distribution and bidirectional migration with lossless, once-only processing
Jacob et al. Exathlon: A benchmark for explainable anomaly detection over time series
US10091130B2 (en) Resource usage management in a stream computing environment
US20190317885A1 (en) Machine-Assisted Quality Assurance and Software Improvement
AU2019283890A1 (en) Data processing method and related products
US11200139B2 (en) Automatic configuration of software systems for optimal management and performance using machine learning
EP3323046A1 (en) Apparatus and method of leveraging machine learning principals for root cause analysis and remediation in computer environments
US10771562B2 (en) Analyzing device-related data to generate and/or suppress device-related alerts
US20210303532A1 (en) Streamlined transaction and dimension data collection
US20210366268A1 (en) Automatic tuning of incident noise
Barve et al. Fecbench: A holistic interference-aware approach for application performance modeling
CN115373835A (zh) Flink集群的任务资源调整方法、装置及电子设备
US20230133541A1 (en) Alert correlating using sequence model with topology reinforcement systems and methods
EP4113308A1 (en) Enhanced application performance framework
WO2024027384A1 (zh) 一种故障检测方法、装置、电子设备及存储介质
US20220107817A1 (en) Dynamic System Parameter for Robotics Automation
US20210286785A1 (en) Graph-based application performance optimization platform for cloud computing environment
EP4357925A1 (en) Method and device for finding causality between application instrumentation points
Stefanov Optimizing Analytics Workloads on an Enterprise Datalake via Full Stack Monitoring
Yue et al. TimeLink: enabling dynamic runtime prediction for Flink iterative jobs
Somashekar Proposal: Performance Management of Large-Scale Microservices Applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23849094

Country of ref document: EP

Kind code of ref document: A1