CN116955071A

CN116955071A - Fault classification methods, devices, equipment and storage media

Info

Publication number: CN116955071A
Application number: CN202310827081.1A
Authority: CN
Inventors: 王春华; 陈劼; 王菁菁; 宋潇; 文韬
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-27

Abstract

The application belongs to the technical field of data analysis, and discloses a fault classification method, device, equipment and storage medium. The application obtains the real-time performance index of the pod container and the host node in the real-time consumption system; performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result; inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result; inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment; generating fault information according to the fault category at the current moment, and performing anomaly detection on the average value of a plurality of single sites at the tail of the real-time performance to eliminate the influence of the burr point, and meanwhile, keeping high performance, and performing fault identification and fault classification according to the anomaly detection result to accelerate fault identification and fault classification.

Description

Fault classification methods, devices, equipment and storage media

技术领域Technical field

本发明涉及数据分析技术领域，尤其涉及一种故障分类方法、装置、设备及存储介质。The present invention relates to the technical field of data analysis, and in particular to a fault classification method, device, equipment and storage medium.

背景技术Background technique

在传统的微服务架构下的运维系统中的故障识别和故障分类中通常是强依赖于异常检测的效果，根据异常检测的结果来分析组件是否存在故障以及定位导致故障的原因，而故障分类，则是在识别出具体的故障后要对该故障进行归类，识别出该故障属于哪一类的故障，例如k8s容器cpu负载、k8s容器读io负载、node内存消耗等。在现有技术中常用的方式为a)从系统中选择一段时间历史数据上，凭借人工专家经验过滤出一些与故障密切相关的核心指标，通过筛选出重要指标后，标注历史时刻(一般以分钟为单位)是否为故障，然后利用这些指标的历史数据训练一个综合的分类模型，用来识别故障以及故障类别，或是b)从黄金业务指标、性能指标、日志数据、调用链数据等单一维度进行异常检测，并综合各个维度的异常检测结果来定位故障时刻，根据各个组件的异常分数的高低并综合组件之间的历史调用关系数据来判定故障的根本原因，一般定位到根因组件的异常指标作为根因，进而根据指标类别判断故障的类别。而上述方式存在样本数据严重不均衡、标注时间长、根据指标异常检测分数不能代表故障的严重程度，导致可信度不高等问题。Fault identification and fault classification in operation and maintenance systems under traditional microservice architecture usually rely heavily on the effect of anomaly detection. Based on the results of anomaly detection, we analyze whether there is a fault in the component and locate the cause of the fault. Fault classification , it is necessary to classify the fault after identifying the specific fault, and identify which type of fault the fault belongs to, such as k8s container cpu load, k8s container read io load, node memory consumption, etc. The commonly used method in the existing technology is to a) select a period of historical data from the system, rely on the experience of artificial experts to filter out some core indicators closely related to the fault, and after filtering out the important indicators, mark the historical moment (usually in minutes) (unit) whether it is a fault, and then use the historical data of these indicators to train a comprehensive classification model to identify faults and fault categories, or b) from a single dimension such as golden business indicators, performance indicators, log data, call chain data, etc. Carry out anomaly detection, and combine the anomaly detection results of various dimensions to locate the fault moment. Based on the abnormal score of each component and the historical call relationship data between components, the root cause of the fault is determined. Generally, the anomaly of the root cause component is located. The indicator is used as the root cause, and the type of fault is determined based on the indicator category. However, the above method has problems such as serious imbalance of sample data, long labeling time, and abnormal detection scores based on indicators that cannot represent the severity of the fault, resulting in low credibility.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above content is only used to assist in understanding the technical solution of the present invention, and does not represent an admission that the above content is prior art.

发明内容Contents of the invention

本发明的主要目的在于提供一种故障分类方法、装置、设备及存储介质，旨在解决现有技术无法同时兼顾快速异常检测及故障识别和故障分类的技术问题。The main purpose of the present invention is to provide a fault classification method, device, equipment and storage medium, aiming to solve the technical problem that the existing technology cannot simultaneously take into account rapid abnormality detection, fault identification and fault classification.

为实现上述目的，本发明提供了一种故障分类方法，所述方法包括以下步骤：In order to achieve the above objectives, the present invention provides a fault classification method, which includes the following steps:

获取实时消费系统中pod容器和主机节点的实时性能指标；Obtain real-time performance indicators of pod containers and host nodes in the real-time consumption system;

将所述实时性能指标的尾部若干个单位点的均值进行异常检测，得到异常检测结果；Perform anomaly detection on the mean value of several unit points at the tail of the real-time performance indicator to obtain an anomaly detection result;

将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果；Input the abnormal detection results into the fault identification model for fault identification, and obtain the fault identification results;

将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别；Input the fault identification results into the fault classification model for fault prediction to obtain the fault category at the current moment;

根据所述当前时刻的故障类别生成故障信息。Fault information is generated according to the fault category at the current moment.

可选地，所述将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果之前，还包括：Optionally, before inputting the anomaly detection result into the fault identification model for fault identification and obtaining the fault identification result, the method further includes:

获取pod容器与主机节点的历史性能指标；Obtain historical performance indicators of pod containers and host nodes;

对所述历史性能指标分组，得到若干组核心指标；Group the historical performance indicators to obtain several groups of core indicators;

根据时间顺序遍历历史故障样本，得到历史故障发生时间集合；Traverse historical fault samples in chronological order to obtain a set of historical fault occurrence times;

根据所述历史故障发生时间集合确定每一个所述历史故障节点时间对应的历史性能指标数据；Determine the historical performance indicator data corresponding to each historical fault node time according to the historical fault occurrence time set;

对所述每一个历史性能指标数据进行异常检测，得到异常检测结果；Perform anomaly detection on each of the historical performance indicator data to obtain anomaly detection results;

将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型。The anomaly detection results are input into the initial fault identification model for training to obtain a fault identification model.

可选地，所述将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型之后，还包括：Optionally, the step of inputting the anomaly detection results into an initial fault identification model for training, and obtaining the fault identification model, further includes:

将所述故障识别结果进行分析，得到故障时刻第一预设时间内的第一异常指标集；Analyze the fault identification results to obtain the first set of abnormal indicators within the first preset time at the fault moment;

获取所述故障时刻前第二预设时间内出现的第二异常指标集；Obtain a second set of abnormal indicators that occurred within a second preset time before the fault moment;

根据所述第二异常指标集对所述第一异常指标集进行过滤，得到在所述故障时刻后出现新异常指标集；Filter the first abnormal indicator set according to the second abnormal indicator set to obtain a new abnormal indicator set that appears after the fault moment;

根据所述新异常指标集输入至初始故障分类模型进行训练，得到故障分类模型。The new anomaly indicator set is input to the initial fault classification model for training to obtain a fault classification model.

可选地，所述将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型，包括：Optionally, inputting the anomaly detection results into an initial fault identification model for training to obtain a fault identification model includes:

为新异常指标集中的异常指标进行标记，得到样本标签，所述新异常指标集中包括至少一个指标组，每个所述指标组包括至少一个性能单位指标；Mark the abnormal indicators in the new abnormal indicator set to obtain sample labels. The new abnormal indicator set includes at least one indicator group, and each of the indicator groups includes at least one performance unit indicator;

根据所述样本标签判断当前指标组的异常状态；Determine the abnormal status of the current indicator group according to the sample label;

在所述当前指标组为异常状态时，统计当前指标组中为异常状态的性能单位指标的个数以及所述样本标签；When the current indicator group is in an abnormal state, count the number of performance unit indicators in the current indicator group that are in an abnormal state and the sample labels;

根据所述性能单位指标的个数以及所述样本标签训练得到对应数量的故障识别模型，所述故障识别模型至少为一个。According to the number of performance unit indicators and the sample labels, a corresponding number of fault identification models are obtained through training, and there is at least one fault identification model.

可选地，所述根据所述新异常指标集输入至初始故障分类模型进行训练，得到故障分类模型，包括：Optionally, the new abnormal indicator set is input to the initial fault classification model for training to obtain a fault classification model, including:

对所述样本标签进行分析，确定故障类别；Analyze the sample labels to determine the fault category;

根据所述样本标签对应的指标组得到对应的故障识别结果；Obtain the corresponding fault identification result according to the indicator group corresponding to the sample label;

根据所述故障类别、所述故障识别结果通过故障分类算法训练得到故障分类模型。A fault classification model is obtained by training a fault classification algorithm according to the fault category and the fault identification result.

可选地，所述将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果，包括：Optionally, the step of inputting the anomaly detection results into a fault identification model for fault identification and obtaining the fault identification results includes:

对所述异常检测结果进行分析，确定所述实时性能指标的分组信息；Analyze the abnormality detection results and determine the grouping information of the real-time performance indicators;

根据所述分组信息调用对应的故障识别模型；Call the corresponding fault identification model according to the grouping information;

将所述异常检测结果输入到所述对应的故障识别模型，使所述对应的故障检测模型对所述异常检测结果分析，得到分析结果，所述分析结果包括所述实时性能指标的状态；Input the abnormal detection results into the corresponding fault identification model, causing the corresponding fault detection model to analyze the abnormal detection results to obtain analysis results, where the analysis results include the status of the real-time performance indicators;

在所述分析结果中所述实时性能指标存在异常时，得到所述实时性能指标的故障识别结果。When there is an abnormality in the real-time performance index in the analysis result, a fault identification result of the real-time performance index is obtained.

可选地，所述将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别，包括：Optionally, inputting the fault identification result into a fault classification model for fault prediction to obtain the fault category at the current moment includes:

根据所述故障识别结果确定样本序号；Determine the sample serial number according to the fault identification result;

根据所述样本序号及对应的指标组生成二维表；Generate a two-dimensional table according to the sample serial number and the corresponding indicator group;

根据所述二维表进行xgboost算法进行故障预测，得到故障预测结果；Use the xgboost algorithm to perform fault prediction based on the two-dimensional table to obtain fault prediction results;

根据所述预测结果得到当前时刻的故障类别。The fault category at the current moment is obtained according to the prediction result.

此外，为实现上述目的，本发明还提出一种故障分类装置，所述故障分类装置包括：In addition, to achieve the above objectives, the present invention also proposes a fault classification device, which includes:

指标获取模块，用于获取实时消费系统中pod容器和主机节点的实时性能指标；The indicator acquisition module is used to obtain real-time performance indicators of pod containers and host nodes in the real-time consumption system;

异常检测模块，用于将所述实时性能指标的尾部若干个单位点的均值进行异常检测，得到异常检测结果；An anomaly detection module, used to perform anomaly detection on the average value of several unit points at the tail of the real-time performance indicator to obtain an anomaly detection result;

故障识别模块，用于将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果；A fault identification module, used to input the abnormal detection results into the fault identification model for fault identification, and obtain the fault identification results;

故障分类模块，用于将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别；A fault classification module, used to input the fault identification results into the fault classification model for fault prediction, and obtain the fault category at the current moment;

信息输出模块，用于根据所述当前时刻的故障类别生成故障信息。An information output module is used to generate fault information according to the fault category at the current moment.

此外，为实现上述目的，本发明还提出一种故障分类设备，所述故障分类设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的故障分类程序，所述故障分类程序配置为实现如上文所述的故障分类方法的步骤。In addition, to achieve the above object, the present invention also proposes a fault classification device. The fault classification device includes: a memory, a processor, and a fault classification program stored in the memory and capable of running on the processor. The fault classification program is configured to implement the steps of the fault classification method as described above.

此外，为实现上述目的，本发明还提出一种存储介质，所述存储介质上存储有故障分类程序，所述故障分类程序被处理器执行时实现如上文所述的故障分类方法的步骤。In addition, to achieve the above object, the present invention also proposes a storage medium, a fault classification program is stored on the storage medium, and when the fault classification program is executed by a processor, the steps of the fault classification method as described above are implemented.

本发明通过获取实时消费系统中pod容器和主机节点的实时性能指标；将所述实时性能指标的尾部若干个单位点的均值进行异常检测，得到异常检测结果；将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果；将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别；根据所述当前时刻的故障类别生成故障信息，通过对实时性能尾部的若干单位点的均值进行异常检测，能够消除毛刺点的影响，同时保持高性能，并且根据异常检测结果进行故障识别故障分类，能够加快故障识别和故障分类。The present invention obtains the real-time performance indicators of the pod container and the host node in the real-time consumption system; performs abnormal detection on the average value of several unit points at the tail of the real-time performance indicator to obtain the abnormal detection results; and inputs the abnormal detection results into the fault The identification model performs fault identification to obtain fault identification results; the fault identification results are input into the fault classification model for fault prediction to obtain the fault category at the current moment; fault information is generated according to the fault category at the current moment, and the real-time performance tail is Using the average value of several unit points for anomaly detection can eliminate the influence of burr points while maintaining high performance. Fault identification and fault classification can be performed based on the anomaly detection results, which can speed up fault identification and fault classification.

附图说明Description of the drawings

图1是本发明实施例方案涉及的硬件运行环境的故障分类设备的结构示意图；Figure 1 is a schematic structural diagram of a fault classification device of a hardware operating environment involved in an embodiment of the present invention;

图2为本发明故障分类方法第一实施例的流程示意图；Figure 2 is a schematic flow chart of the first embodiment of the fault classification method of the present invention;

图3为本发明故障分类方法第二实施例的流程示意图；Figure 3 is a schematic flow chart of the second embodiment of the fault classification method of the present invention;

图4为本发明故障分类装置第一实施例的结构框图。Figure 4 is a structural block diagram of the first embodiment of the fault classification device of the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further described with reference to the embodiments and the accompanying drawings.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

参照图1，图1为本发明实施例方案涉及的硬件运行环境的故障分类设备结构示意图。Referring to Figure 1, Figure 1 is a schematic structural diagram of a fault classification device of the hardware operating environment involved in the embodiment of the present invention.

如图1所示，该故障分类设备可以包括：处理器1001，例如中央处理器(CentralProcessing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity，Wi-Fi)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory，RAM)存储器，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in Figure 1, the fault classification device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize connection communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard). The optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wireless-Fidelity (Wi-Fi) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. The memory 1005 may optionally be a storage device independent of the aforementioned processor 1001.

本领域技术人员可以理解，图1中示出的结构并不构成对故障分类设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a limitation on the fault classification device, and may include more or fewer components than shown, or combine certain components, or arrange different components.

如图1所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及故障分类程序。As shown in Figure 1, memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a fault classification program.

在图1所示的故障分类设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本发明故障分类设备中的处理器1001、存储器1005可以设置在故障分类设备中，所述故障分类设备通过处理器1001调用存储器1005中存储的故障分类程序，并执行本发明实施例提供的故障分类方法。In the fault classification equipment shown in Figure 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the fault classification equipment of the present invention can Set in a fault classification device, the fault classification device calls the fault classification program stored in the memory 1005 through the processor 1001, and executes the fault classification method provided by the embodiment of the present invention.

本发明实施例提供了一种故障分类方法，参照图2，图2为本发明一种故障分类方法第一实施例的流程示意图。An embodiment of the present invention provides a fault classification method. Refer to Figure 2. Figure 2 is a schematic flow chart of a first embodiment of a fault classification method of the present invention.

本实施例中，所述故障分类方法包括以下步骤：In this embodiment, the fault classification method includes the following steps:

步骤S10：获取实时消费系统中pod容器和主机节点的实时性能指标。Step S10: Obtain real-time performance indicators of pod containers and host nodes in the real-time consumption system.

需要说明的是，本实施例的执行主体是故障分类设备，其中，该故障分类设备具有数据处理，数据通信及程序运行等功能，所述故障分类设备可以为集成控制器，控制计算机等设备，当然还可以为其他具备相似功能的设备，本实施例对此不做限制。It should be noted that the execution subject of this embodiment is a fault classification device, where the fault classification device has functions such as data processing, data communication, and program running. The fault classification device can be an integrated controller, a control computer, and other devices. Of course, it can also be other devices with similar functions, which is not limited in this embodiment.

应当理解的是，实时消费系统可以为卡夫卡(Kafka)实时消费系统，这类实施消费系统具有高并发的特点，在一个时间点能够容纳多用户进行请求访问，pod容器是k8s中最小的资源管理组件，也是最小运行容器化应用的资源对象，实时性能指标可以为当前情况下pod容器与主机节点当前的内存指标、CPU指标等。It should be understood that the real-time consumption system can be a Kafka real-time consumption system. This type of implemented consumption system has the characteristics of high concurrency and can accommodate multiple users requesting access at one point in time. The pod container is the smallest in k8s. The resource management component is also the minimum resource object for running containerized applications. Real-time performance indicators can be the current memory indicators, CPU indicators, etc. of the pod container and host node under the current situation.

可以理解的是，故障分类设备能够对实时消费系统中的各资源管理组件如pod容器，主机节点node等资源组件的实时性能指标数据进行监控，并能够对这些实施性能数据以分钟为单位进行聚合。It can be understood that the fault classification device can monitor the real-time performance indicator data of various resource management components in the real-time consumption system, such as pod containers, host nodes, and other resource components, and can aggregate these implementation performance data in minutes. .

步骤S20：将所述实时性能指标的尾部若干个单位点的均值进行异常检测，得到异常检测结果。Step S20: Perform abnormality detection on the average value of several unit points at the tail of the real-time performance index to obtain an abnormality detection result.

需要说明的是，实时性能指标在进行集合时，通常存在大量的数据，而这些数据中存在许多毛刺点，为了能够消除毛刺点的影响，通过利用尾部2个单位点的均值为待检测目标点的检测目标值，以达到消除毛刺点的影响，同时保持高性能。It should be noted that when real-time performance indicators are collected, there is usually a large amount of data, and there are many burr points in these data. In order to eliminate the influence of burr points, the average value of the two unit points at the tail is used as the target point to be detected. The detection target value is to eliminate the influence of burr points while maintaining high performance.

在具体实现中，通过将实时性能指标中尾部的2个单位点的均值进行异常检测时，将聚合的实施性能指标通过tail2-3sigma算法对实时性能指标数据进行故障检测，在进行故障检测时，为了在保证高性能的前提下保证数据的准确性，通过将聚合后的实时性能指标数据取尾部2个单位点的均值为目标点进行检测，通过对目标点位的实时性能指标数据进行检测，判断实施性能指标数据是否存在异常，根据判断结果确定当前的实时性能指标数据是否异常的检测结果。In the specific implementation, when detecting anomalies by using the mean value of the two tail unit points in the real-time performance indicators, the aggregated implementation performance indicators are used to perform fault detection on the real-time performance indicator data through the tail2-3sigma algorithm. When performing fault detection, In order to ensure the accuracy of the data while ensuring high performance, the average of the last 2 unit points of the aggregated real-time performance index data is taken as the target point for detection, and the real-time performance index data of the target point is detected. Determine whether the implementation performance index data is abnormal, and determine whether the current real-time performance index data is abnormal based on the judgment results.

步骤S30：将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果。Step S30: Input the abnormality detection result into the fault identification model to perform fault identification, and obtain the fault identification result.

需要说明的是，故障识别模型是通过初始故障识别模型训练而来，能够根据检测结果对故障进行识别，其中故障识别模型是根据当前进行检测的指标组确定。It should be noted that the fault identification model is trained through the initial fault identification model and can identify faults based on the detection results, where the fault identification model is determined based on the currently detected indicator group.

在具体实现中，根据当前进行识别检测的指标组选择对应的故障识别二分类模型，故障二分类模型能够根据输入的待检测值，判断对应的性能指标是否存在故障，特别的，当全部指标组都无异常时，代表该检测时刻正常。例如当前输入的指标组为宿主机node的CPU指标组以及主机node上动态pod容器的内存指标组，根据当前的指标组选择对应的宿主机CPU故障二分类模型和主机动态pod内存故障二分类模型，并都分别对各自的宿主机node的CPU指标和主机动态pod内存指标进行判断，确定当前的故障识别结果，在出现至少一个异常结果时，将对应的异常结果以及所对应的指标组进行记录，直到故障识别完成后统一输出，若对输入的各项指标组进行故障识别后无异常结果，则可以返回一个无故障的识别结果。In the specific implementation, the corresponding fault identification two-classification model is selected according to the indicator group currently being identified and detected. The fault two-classification model can determine whether the corresponding performance indicator has a fault based on the input value to be detected. In particular, when all indicator groups If there are no abnormalities, it means that the detection time is normal. For example, the currently input indicator group is the CPU indicator group of the host node and the memory indicator group of the dynamic pod container on the host node. According to the current indicator group, select the corresponding host CPU fault two-classification model and the host dynamic pod memory fault two-classification model. , and respectively judge the CPU indicators and host dynamic pod memory indicators of the respective host nodes to determine the current fault identification results. When at least one abnormal result occurs, the corresponding abnormal result and the corresponding indicator group are recorded. , until the fault identification is completed and the output is unified. If there are no abnormal results after fault identification of each input indicator group, a fault-free identification result can be returned.

步骤S40：将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别。Step S40: Input the fault identification result into the fault classification model for fault prediction to obtain the fault category at the current moment.

需要说明的是，故障分类模型是通过初始故障分类模型训练而来的，在故障识别模型中得到若干故障识别结果，将故障识别结果作为初始故障分类模型的输入进行训练，得到故障分类模型。It should be noted that the fault classification model is trained through the initial fault classification model. Several fault identification results are obtained in the fault identification model. The fault identification results are used as the input of the initial fault classification model for training to obtain the fault classification model.

在具体实现中，将故障识别的结果作为故障分类模型的输入，故障分类模型能够对故障识别结果以及故障识别结果对应的性能指标分类，对于一种故障类别而言，造成当前故障的原因有很多，因此对于一种分类，往往会对应多个性能指标以及识别结果。故障分类模型通过xgboost算法，在进行故障分类时，首先对故障识别结果进行标注，对每个故障识别结果进行标号，赋予唯一的id，并对每个故障识别结果进行特征提取，可以进行包括方差、均值、偏度等特征进行特征提取，或是特征进行拟合，生成如孤立森林进行特征表征，并根据提取到的特征生成体征集，根据xgboost算法进行故障分类，得到故障分类结果。In the specific implementation, the fault identification results are used as the input of the fault classification model. The fault classification model can classify the fault identification results and the performance indicators corresponding to the fault identification results. For a fault category, there are many reasons for the current fault. , so for a classification, there are often corresponding multiple performance indicators and recognition results. The fault classification model uses the xgboost algorithm. When classifying faults, it first labels the fault identification results, labels each fault identification result, assigns a unique ID, and performs feature extraction on each fault identification result, including variance. , mean, skewness and other features for feature extraction, or feature fitting to generate features such as an isolated forest for feature characterization, and generate a feature set based on the extracted features, perform fault classification according to the xgboost algorithm, and obtain fault classification results.

步骤S50：根据所述当前时刻的故障类别生成故障信息。Step S50: Generate fault information according to the fault category at the current moment.

在具体实现中，在得到故障分类结果之后，根据故障分类结果生成故障信息，其中故障信息中包括故障的分类结果，还包括对应的故障指标组。在生成故障信息之后，能够将故障信息发送至运维系统，使运维人员及时了解故障原因，实现快速修复。In a specific implementation, after the fault classification result is obtained, fault information is generated according to the fault classification result, where the fault information includes the fault classification result and the corresponding fault indicator group. After the fault information is generated, the fault information can be sent to the operation and maintenance system, so that the operation and maintenance personnel can understand the cause of the fault in time and achieve rapid repair.

本实施例通过获取实时消费系统中pod容器和主机节点的实时性能指标；将所述实时性能指标的尾部若干个单位点的均值进行异常检测，得到异常检测结果；将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果；将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别；根据所述当前时刻的故障类别生成故障信息，通过对实时性能尾部的若干单位点的均值进行异常检测，能够消除毛刺点的影响，同时保持高性能，并且根据异常检测结果进行故障识别故障分类，能够加快故障识别和故障分类。This embodiment obtains the real-time performance indicators of the pod container and host node in the real-time consumption system; performs anomaly detection on the average value of several unit points at the end of the real-time performance indicator to obtain the anomaly detection results; inputs the anomaly detection results into The fault identification model performs fault identification to obtain fault identification results; the fault identification results are input into the fault classification model for fault prediction to obtain the fault category at the current moment; fault information is generated according to the fault category at the current moment, and the real-time performance Anomaly detection is performed on the mean value of several unit points at the tail, which can eliminate the influence of burr points while maintaining high performance. Fault identification and fault classification are performed based on the anomaly detection results, which can speed up fault identification and classification.

参考图3，图3为本发明一种故障分类方法第二实施例的流程示意图。Referring to Figure 3, Figure 3 is a schematic flow chart of a second embodiment of a fault classification method of the present invention.

基于上述第一实施例，本实施例故障分类方法在所述步骤S30之前，还包括：Based on the above-mentioned first embodiment, the fault classification method of this embodiment also includes before step S30:

步骤S301：获取pod容器与主机节点的历史性能指标；Step S301: Obtain historical performance indicators of the pod container and host node;

步骤S302：根据时间顺序遍历历史故障样本，得到历史故障发生时间集合；Step S302: Traverse historical fault samples in chronological order to obtain a set of historical fault occurrence times;

步骤S303：根据所述历史故障发生时间集合确定每一个所述历史故障节点时间对应的历史性能指标数据；Step S303: Determine the historical performance index data corresponding to each historical fault node time according to the historical fault occurrence time set;

步骤S304：对所述每一个历史性能指标数据进行异常检测，得到异常检测结果；Step S304: Perform abnormality detection on each historical performance indicator data to obtain anomaly detection results;

步骤S305：将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型。Step S305: Input the abnormality detection results into the initial fault identification model for training to obtain a fault identification model.

在具体实现中，在获取pod容器与主机节点的历史性能指标后，能够按照资源类型对性能指标按照类别进行分组，得到与系统故障密切相关的K组核心指标，例如：In the specific implementation, after obtaining the historical performance indicators of the pod container and the host node, the performance indicators can be grouped by category according to the resource type, and K groups of core indicators closely related to system faults can be obtained, such as:

系统宿主机node的cpu指标组：The cpu indicator group of the system host node:

system.cpu.user,system.cpu.user,

system.load.1,system.load.1,

system.cpu.pct_usage,system.cpu.pct_usage,

system.load.5,system.load.5,

system.load.15system.load.15

主机node上动态pod容器的内存指标组：The memory indicator group of the dynamic pod container on the host node:

container_fs_writes./dev/vdacontainer_fs_writes./dev/vda

container_fs_inodes./dev/vda1,container_fs_inodes./dev/vda1,

container_fs_reads_MB./dev/vda,container_fs_reads_MB./dev/vda,

container_file_descriptors,container_file_descriptors,

container_fs_writes_MB./dev/vdacontainer_fs_writes_MB./dev/vda

其中，如system.cpu.user等为性能指标。在进行故障识别模型训练时，从历史标注的故障样本中按照时间顺序逐个遍历历史故障的故障发生时间，并按照每一个故障时间点查询该故障时间点前3小时的系统的历史性能指标数据，并按照分钟进行聚合，其中历史标注的故障样本中包含n*k个故障，n为故障样本个数。在训练过程中，首先确定第一个故障时间点，若第一个故障时间点为18:12:23，则查询到其前3小时的历史性能指标数据，即得到一个从15:12:23-18:12:23时间段内的历史性能指标，并对所述时间段内的历史性能指标以分钟为单位进行聚合，并不断重复这个过程，直至历史标注的故障样本遍历完成表。通过这种方式最终得到n*k个故障时刻对应的n*k份历史性能指标面板数据后针对每一份历史性能指标的面板数据，按照指标分组分别采用tail2-3sigma异常检测算法分别进行异常检测，得到n*k份异常检测结果，其中每组指标内包含数量1～m具体的个性能单位指标,并以异常检测结果为输入，输入至初始故障识别模型进行训练，得到故障识别模型。Among them, such as system.cpu.user, etc. are performance indicators. When training the fault identification model, the fault occurrence time of historical faults is traversed one by one in chronological order from the historically labeled fault samples, and the historical performance index data of the system 3 hours before the fault time point is queried according to each fault time point. And it is aggregated by minutes, where the historically labeled fault samples contain n*k faults, where n is the number of fault samples. During the training process, first determine the first failure time point. If the first failure time point is 18:12:23, query the historical performance indicator data of the previous 3 hours, that is, get a starting point from 15:12:23 -Historical performance indicators within the 18:12:23 time period, and aggregate the historical performance indicators within the time period in minutes, and repeat this process until the historically labeled fault samples traverse the complete table. In this way, n*k historical performance indicator panel data corresponding to n*k fault moments are finally obtained. For each historical performance indicator panel data, the tail2-3sigma anomaly detection algorithm is used to perform anomaly detection according to the indicator grouping. , get n*k abnormal detection results, in which each set of indicators contains a number of 1 to m specific individual performance unit indicators, and use the abnormal detection results as input and input them into the initial fault identification model for training to obtain the fault identification model.

进一步地，所述将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型之后，还包括：Further, the step of inputting the anomaly detection results into an initial fault identification model for training and obtaining the fault identification model also includes:

根据所述新异常指标集输入至初始故障分类模型进行训练，得到故障分类模型，其中根据所述新异常指标集输入至初始故障分类模型进行训练，得到故障分类模型还包括：对所述样本标签进行分析，确定故障类别；The new abnormal index set is input to the initial fault classification model for training to obtain a fault classification model, wherein the new abnormal index set is input to the initial fault classification model for training. Obtaining the fault classification model also includes: labeling the sample Conduct analysis to determine the fault category;

根据所述故障类别、所述故障识别结果通过故障分类算法训练得到故障分类模型。。A fault classification model is obtained by training a fault classification algorithm according to the fault category and the fault identification result. .

需要说明的是，第一预设时间内的第一异常指标集指的是在故障发生后的一段时间内的异常指标所形成的异常指标集合，故障时刻前第二预设时间内出现的第二异常指标集指的是在故障发生前的一段时间内所出现的异常指标形成的异常指标集。It should be noted that the first abnormal indicator set within the first preset time refers to the abnormal indicator set formed by the abnormal indicators within a period of time after the fault occurs, and the first abnormal indicator set that appears within the second preset time before the fault occurs. The second abnormal indicator set refers to the abnormal indicator set formed by abnormal indicators that appeared within a period of time before the fault occurred.

在具体实现中，可以将故障时刻后一段时间内出现的异常指标进行统计，其中这一段时间可以为2分钟，3分钟等，优选为2分钟，也可以根据实际情况另行设置，将在故障时刻2分钟内的异常指标进行记录，得到第一异常指标集；还可以将故障时刻前一段时间，例如在故障时刻前5分钟内出现的异常指标进行异常指标进行记录，得到第二异常指标集，同样的，这段时间也能够根据具体情况进行设置，本实施例对此不作限制。在得到第一异常指标集与第二异常指标集后，为了能够优化数据集，因此，能够对第一异常指标集中的数据进行过滤，将在第二异常指标集中出现的异常指标过滤，得到本次故障时新的异常指标，而新的异常指标能够认为是在故障发生后才产生的新故障，通过对第一异常指标集进行过滤后，将剩余的异常指标集作为新异常指标集，并将新异常指标集作为初始故障分类模型的输入进行模型训练，得到故障分类模型。In the specific implementation, the abnormal indicators that appear within a period of time after the fault moment can be counted. This period can be 2 minutes, 3 minutes, etc., preferably 2 minutes. It can also be set separately according to the actual situation. The abnormal indicators within 2 minutes are recorded to obtain the first abnormal indicator set; the abnormal indicators that occurred some time before the fault time, for example, within 5 minutes before the fault time can also be recorded as abnormal indicators to obtain the second abnormal indicator set. Similarly, this period of time can also be set according to specific conditions, and this embodiment does not limit this. After obtaining the first abnormal indicator set and the second abnormal indicator set, in order to optimize the data set, the data in the first abnormal indicator set can be filtered, and the abnormal indicators appearing in the second abnormal indicator set can be filtered to obtain this The first fault is a new abnormal indicator, and the new abnormal indicator can be considered as a new fault that occurred after the fault occurred. After filtering the first abnormal indicator set, the remaining abnormal indicator set is used as the new abnormal indicator set, and The new anomaly indicator set is used as the input of the initial fault classification model for model training to obtain the fault classification model.

进一步地，所述将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型，包括：Further, the anomaly detection result is input into the initial fault identification model for training to obtain the fault identification model, including:

为所述新异常指标集中的异常指标进行标记，得到样本标签，所述新异常指标集中包括至少一个指标组，每个所述指标组包括至少一个性能单位指标；Mark the abnormal indicators in the new abnormal indicator set to obtain sample labels. The new abnormal indicator set includes at least one indicator group, and each of the indicator groups includes at least one performance unit indicator;

需要说明的是，所述样本标签可以用于区分不同组别的异常性能指标，在经过过滤之后所得到的新异常指标集中包含n*k分异常检测结果，通过logistics算法对所述n*k分异常检测结果进行训练，得到k个故障识别二分类模型，输入数据的形式如表1所示：It should be noted that the sample labels can be used to distinguish different groups of abnormal performance indicators. The new abnormal indicator set obtained after filtering contains n*k abnormal detection results. The n*k points are analyzed through the logistic algorithm. We conduct training based on the anomaly detection results to obtain k two-classification models for fault identification. The form of the input data is as shown in Table 1:

在训练过程中，由于logistics算法的回归模型中的因变量只有0,1,两种权值，因此在p个独立自变量xi中，记y取1的概率为p＝P(y＝1/X),取0的概率为1-p。对于logistics回归有以下公式，其中h_θ(x)表示结果取1的概率：During the training process, since the dependent variable in the regression model of the logistic algorithm has only two weights, 0 and 1, among the p independent independent variables xi, the probability that y is 1 is recorded as p=P(y=1/ X), the probability of taking 0 is 1-p. For logistic regression, there is the following formula, where h _θ (x) represents the probability of the result being 1:

根据表1中所示内容，第一列id表示样本的序号，表示每个指标组一共n个故障标签样本,由于每个指标组的故障标签样本数可能不一样，为了便于训练，将每个故障样本标签样本数统一设置为n，最后一列y表示样本的标签表示故障的类别，如1表示当前该指标组出现故障，中间的2～7列，表示具体该指标组的指标名称和对应的异常检测的结果。According to the content shown in Table 1, the first column id represents the serial number of the sample, indicating that each indicator group has a total of n fault label samples. Since the number of fault label samples in each indicator group may be different, in order to facilitate training, each indicator group The number of fault sample label samples is uniformly set to n. The last column y indicates that the label of the sample indicates the category of the fault. For example, 1 indicates that the indicator group is currently faulty. The middle columns 2 to 7 indicate the specific indicator name of the indicator group and the corresponding The results of anomaly detection.

在对初始故障识别模型进行训练后会得到K个故障识别二分类模型，其中，能够根据历史故障标注样本的增加，动态训练并更新该模型，从而使得模型的预测效果随着故障标签样本的丰富而提升，进一步发挥监督分类的作用。After training the initial fault identification model, K two-class fault identification models will be obtained. Among them, the model can be dynamically trained and updated according to the increase in historical fault label samples, so that the prediction effect of the model increases with the enrichment of fault label samples. And improve to further play the role of supervision and classification.

进一步地，所述将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别，包括：Further, inputting the fault identification result into a fault classification model for fault prediction to obtain the fault category at the current moment includes:

需要说明的是，样本序号指的是故障识别结果中的结果序号，二维表用于反映出每一个故障检测结果之间的故障信息。It should be noted that the sample serial number refers to the result serial number in the fault identification result, and the two-dimensional table is used to reflect the fault information between each fault detection result.

在具体实现中，首先根据故障识别结果确定出当前故障识别结果中的样本序号以及对应的指标组，并根据样本序号与指标组构建一张二维表，用于表示在当前故障结果中，具体的故障性能指标，具体如表2所示：In the specific implementation, the sample serial number and corresponding indicator group in the current fault identification result are first determined based on the fault identification result, and a two-dimensional table is constructed based on the sample serial number and indicator group to represent the specific fault in the current fault identification result. Performance indicators, as shown in Table 2:

其中第1列id表示样本的序号，表示一共n*7个故障标签样本，由于根据实际情况每个指标组的故障标签样本数可不一样，此处为方便起见，设定每个指标组的故障标签样本数目一致为n个；其中最后一列y表示样本的标签故障类别，如1表示系统cpu故障、2表示pod容器的cpu故障，3表示系统内存故障，4表示pod容器读io故障。其中间的2～7列，表示指标组的类别名称和对应的故障识别模型的识别结果，1表示故障，0表示正常。将各个指标组对应的故障情况根据xgboost算法生成决策树，并由决策树中的最优点得到当前的故障类型，得到当前时刻的故障类型。The id in column 1 represents the serial number of the sample, indicating a total of n*7 fault label samples. Since the number of fault label samples for each indicator group may be different according to the actual situation, here for the sake of convenience, the fault label for each indicator group is set. The number of label samples is always n; the last column y represents the label failure category of the sample, for example, 1 represents a system CPU failure, 2 represents a pod container CPU failure, 3 represents a system memory failure, and 4 represents a pod container read IO failure. Columns 2 to 7 in the middle represent the category name of the indicator group and the recognition result of the corresponding fault identification model. 1 indicates fault and 0 indicates normal. The fault conditions corresponding to each indicator group are used to generate a decision tree based on the xgboost algorithm, and the current fault type is obtained from the optimal point in the decision tree, and the fault type at the current moment is obtained.

本实施例通过将待检测目标时序数据的最后两个书的均值作为待检测目标点的值，消除毛刺点带来的误差影响，同时对故障时刻前后一段时间数据进行过滤，将在故障发生前的异常信息过滤得到在当前故障时刻产生的新故障类型，并根据新的故障类型选择对应的故障识别模型进行故障识别，进而根据故障分类模型通过xgboost算法实现确定最终的故障类型，实现了兼顾快速异常检测及故障识别和故障分类的目的。This embodiment uses the average of the last two books of the target time series data to be detected as the value of the target point to be detected to eliminate the error impact caused by burr points, and at the same time filters the data for a period of time before and after the fault moment, so that the data before the fault occurs can be filtered. The abnormal information is filtered to obtain the new fault type that occurred at the current fault moment, and the corresponding fault identification model is selected according to the new fault type for fault identification, and then the final fault type is determined through the xgboost algorithm based on the fault classification model, achieving a balanced and fast The purpose of anomaly detection and fault identification and fault classification.

此外，本发明实施例还提出一种存储介质，所述存储介质上存储有故障分类程序，所述故障分类程序被处理器执行时实现如上文所述的故障分类方法的步骤。In addition, embodiments of the present invention also provide a storage medium, a fault classification program is stored on the storage medium, and when the fault classification program is executed by a processor, the steps of the fault classification method described above are implemented.

参照图4，图4为本发明故障分类装置第一实施例的结构框图。Referring to Figure 4, Figure 4 is a structural block diagram of the first embodiment of the fault classification device of the present invention.

如图4所示，本发明实施例提出的故障分类装置包括：As shown in Figure 4, the fault classification device proposed by the embodiment of the present invention includes:

指标获取模块10，用于获取实时消费系统中pod容器和主机节点的实时性能指标；The indicator acquisition module 10 is used to obtain real-time performance indicators of pod containers and host nodes in the real-time consumption system;

异常检测模块20，用于将所述实时性能指标的尾部若干个单位点的均值进行异常检测，得到异常检测结果；The anomaly detection module 20 is used to perform anomaly detection on the average value of several unit points at the tail of the real-time performance indicator to obtain an anomaly detection result;

故障识别模块30，用于将所述异常检测结果输入到故障识别模型进行故障识别，得到故障识别结果；The fault identification module 30 is used to input the abnormal detection results into the fault identification model to perform fault identification and obtain the fault identification results;

故障分类模块40，用于将所述故障识别结果输入到故障分类模型进行故障预测，得到当前时刻的故障类别；The fault classification module 40 is used to input the fault identification results into the fault classification model to perform fault prediction and obtain the fault category at the current moment;

信息输出模块50，用于根据所述当前时刻的故障类别生成故障信息。The information output module 50 is configured to generate fault information according to the fault category at the current moment.

在一实施例中，所述故障识别模块30，还用于获取pod容器与主机节点的历史性能指标；对所述历史性能指标分组，得到若干组核心指标；根据时间顺序遍历历史故障样本，得到历史故障发生时间集合；根据所述历史故障发生时间集合确定每一个所述历史故障节点时间对应的历史性能指标数据；对所述每一个历史性能指标数据进行异常检测，得到异常检测结果；将所述异常检测结果输入到初始故障识别模型进行训练，得到故障识别模型。In one embodiment, the fault identification module 30 is also used to obtain historical performance indicators of pod containers and host nodes; group the historical performance indicators to obtain several groups of core indicators; traverse historical fault samples in chronological order to obtain A set of historical fault occurrence times; determine the historical performance indicator data corresponding to each historical fault node time according to the historical fault occurrence time set; perform anomaly detection on each of the historical performance indicator data to obtain anomaly detection results; The above anomaly detection results are input into the initial fault identification model for training, and the fault identification model is obtained.

在一实施例中，所述故障识别模块30，还用于将所述故障识别结果进行分析，得到故障时刻第一预设时间内的第一异常指标集；获取所述故障时刻前第二预设时间内出现的第二异常指标集；根据所述第二异常指标集对所述第一异常指标集进行过滤，得到在所述故障时刻后出现新异常指标集；根据所述新异常指标集输入至初始故障分类模型进行训练，得到故障分类模型。In one embodiment, the fault identification module 30 is also used to analyze the fault identification result to obtain a first set of abnormal indicators within a first preset time of the fault moment; and to obtain a second set of abnormal indicators before the fault moment. Suppose a second set of abnormal indicators occurs within the time period; filter the first set of abnormal indicators according to the second set of abnormal indicators to obtain a new set of abnormal indicators that appear after the fault moment; according to the new set of abnormal indicators Input to the initial fault classification model for training to obtain the fault classification model.

在一实施例中，所述故障识别模块30，还用于为所述新异常指标集中的异常指标进行标记，得到样本标签，所述新异常指标集中包括至少一个指标组，每个所述指标组包括至少一个性能单位指标；根据所述样本标签判断当前指标组的异常状态；在所述当前指标组为异常状态时，统计当前指标组中为异常状态的性能单位指标的个数以及所述样本标签；根据所述性能单位指标的个数以及所述样本标签训练得到对应数量的故障识别模型，所述故障识别模型至少为一个。In one embodiment, the fault identification module 30 is also used to mark abnormal indicators in the new abnormal indicator set to obtain sample labels. The new abnormal indicator set includes at least one indicator group, and each of the indicators The group includes at least one performance unit indicator; determine the abnormal state of the current indicator group according to the sample label; when the current indicator group is in the abnormal state, count the number of performance unit indicators in the current indicator group that are in the abnormal state and the Sample labels: training to obtain a corresponding number of fault identification models based on the number of performance unit indicators and the sample labels, and there is at least one fault identification model.

在一实施例中，所述故障识别模块30，还用于对所述样本标签进行分析，确定故障类别；根据所述样本标签对应的指标组得到对应的故障识别结果；In one embodiment, the fault identification module 30 is also used to analyze the sample tags to determine the fault category; and obtain the corresponding fault identification results according to the indicator group corresponding to the sample tags;

在一实施例中，所述故障识别模块30，还用于对所述异常检测结果进行分析，确定所述实时性能指标的分组信息；根据所述分组信息调用对应的故障识别模型；将所述异常检测结果输入到所述对应的故障识别模型，使所述对应的故障检测模型对所述异常检测结果分析，得到分析结果，所述分析结果包括所述实时性能指标的状态；在所述分析结果中所述实时性能指标存在异常时，得到所述实时性能指标的故障识别结果。In one embodiment, the fault identification module 30 is also used to analyze the anomaly detection results and determine the grouping information of the real-time performance indicators; call the corresponding fault identification model according to the grouping information; The abnormal detection results are input into the corresponding fault identification model, so that the corresponding fault detection model analyzes the abnormal detection results to obtain analysis results, and the analysis results include the status of the real-time performance indicators; in the analysis When the real-time performance index in the result is abnormal, the fault identification result of the real-time performance index is obtained.

在一实施例中，所述故障分类模块40，还用于根据所述故障识别结果确定样本序号；根据所述样本序号及对应的指标组生成二维表；根据所述二维表进行xgboost算法进行故障预测，得到故障预测结果；根据所述预测结果得到当前时刻的故障类别。In one embodiment, the fault classification module 40 is also used to determine a sample serial number based on the fault identification result; generate a two-dimensional table based on the sample serial number and the corresponding indicator group; and perform the xgboost algorithm based on the two-dimensional table. Carry out fault prediction and obtain fault prediction results; obtain the fault category at the current moment based on the prediction results.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above are only examples and do not constitute any limitation on the technical solution of the present invention. In specific applications, those skilled in the art can make settings as needed, and the present invention does not impose any limitations on this.

应该理解的是，虽然本申请实施例中的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，其可以以其他的顺序执行。而且，图中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，其执行顺序也不必然是依次进行，而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although each step in the flow chart in the embodiment of the present application is displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least some of the steps in the figure may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and their execution order is not necessarily sequential. may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of stages.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the scope of the present invention. In practical applications, those skilled in the art can select some or all of them for implementation according to actual needs. The purpose of this embodiment is not limited here.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that, as used herein, the terms "include", "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or system that includes a list of elements includes not only those elements, but also other elements not expressly listed or elements inherent to the process, method, article or system. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as a read-only memory). , ROM)/RAM, magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the method described in various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the description and drawings of the present invention may be directly or indirectly used in other related technical fields. , are all similarly included in the scope of patent protection of the present invention.

Claims

1. A fault classification method, characterized in that the fault classification method includes:

Obtain real-time performance indicators of pod containers and host nodes in the real-time consumption system;

Perform anomaly detection on the mean value of several unit points at the tail of the real-time performance indicator to obtain an anomaly detection result;

Input the abnormal detection results into the fault identification model for fault identification, and obtain the fault identification results;

Input the fault identification results into the fault classification model for fault prediction to obtain the fault category at the current moment;

Fault information is generated according to the fault category at the current moment.

2. The method according to claim 1, characterized in that, before inputting the anomaly detection result into a fault identification model for fault identification and obtaining the fault identification result, the method further includes:

Obtain historical performance indicators of pod containers and host nodes;

Group the historical performance indicators to obtain several groups of core indicators;

Traverse historical fault samples in chronological order to obtain a set of historical fault occurrence times;

Determine the historical performance indicator data corresponding to each historical fault node time according to the historical fault occurrence time set;

Perform anomaly detection on each of the historical performance indicator data to obtain anomaly detection results;

The anomaly detection results are input into the initial fault identification model for training to obtain a fault identification model.

3. The method according to claim 2, wherein the anomaly detection result is input into an initial fault identification model for training, and after the fault identification model is obtained, the method further includes:

Analyze the fault identification results to obtain the first set of abnormal indicators within the first preset time at the fault moment;

Obtain a second set of abnormal indicators that occurred within a second preset time before the fault moment;

Filter the first abnormal indicator set according to the second abnormal indicator set to obtain a new abnormal indicator set that appears after the fault moment;

The new anomaly indicator set is input to the initial fault classification model for training to obtain a fault classification model.

4. The method of claim 2, wherein said inputting the anomaly detection results into an initial fault identification model for training to obtain a fault identification model includes:

Mark the abnormal indicators in the new abnormal indicator set to obtain sample labels. The new abnormal indicator set includes at least one indicator group, and each of the indicator groups includes at least one performance unit indicator;

Determine the abnormal status of the current indicator group according to the sample label;

When the current indicator group is in an abnormal state, count the number of performance unit indicators in the current indicator group that are in an abnormal state and the sample labels;

According to the number of performance unit indicators and the sample labels, a corresponding number of fault identification models are obtained through training, and there is at least one fault identification model.

5. The method of claim 4, wherein the new abnormal indicator set is input to an initial fault classification model for training to obtain a fault classification model, including:

Analyze the sample labels to determine the fault category;

Obtain the corresponding fault identification result according to the indicator group corresponding to the sample label;

A fault classification model is obtained by training a fault classification algorithm according to the fault category and the fault identification result.

6. The method of claim 1, wherein said inputting the anomaly detection result into a fault identification model for fault identification to obtain a fault identification result includes:

Analyze the abnormality detection results and determine the grouping information of the real-time performance indicators;

Call the corresponding fault identification model according to the grouping information;

Input the abnormal detection results into the corresponding fault identification model, causing the corresponding fault detection model to analyze the abnormal detection results to obtain analysis results, where the analysis results include the status of the real-time performance indicators;

When there is an abnormality in the real-time performance index in the analysis result, a fault identification result of the real-time performance index is obtained.

7. The method of claim 1, wherein said inputting the fault identification result into a fault classification model for fault prediction to obtain the fault category at the current moment includes:

Determine the sample serial number according to the fault identification result;

Generate a two-dimensional table according to the sample serial number and the corresponding indicator group;

Use the xgboost algorithm to perform fault prediction based on the two-dimensional table to obtain fault prediction results;

The fault category at the current moment is obtained according to the prediction result.

8. A fault classification device, characterized in that the fault classification device includes:

The indicator acquisition module is used to obtain real-time performance indicators of pod containers and host nodes in the real-time consumption system;

An anomaly detection module, used to perform anomaly detection on the average value of several unit points at the tail of the real-time performance indicator to obtain an anomaly detection result;

A fault identification module, used to input the abnormal detection results into the fault identification model for fault identification, and obtain the fault identification results;

A fault classification module, used to input the fault identification results into the fault classification model for fault prediction, and obtain the fault category at the current moment;

An information output module is used to generate fault information according to the fault category at the current moment.

9. A fault classification device, characterized in that the device includes: a memory, a processor, and a fault classification program stored on the memory and operable on the processor, and the fault classification program is configured to implement The steps of the fault classification method according to any one of claims 1 to 7.

10. A storage medium, characterized in that a fault classification program is stored on the storage medium, and when the fault classification program is executed by a processor, the steps of the fault classification method according to any one of claims 1 to 7 are implemented. .