CN116955071A - Fault classification method, device, equipment and storage medium - Google Patents

Fault classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN116955071A
CN116955071A CN202310827081.1A CN202310827081A CN116955071A CN 116955071 A CN116955071 A CN 116955071A CN 202310827081 A CN202310827081 A CN 202310827081A CN 116955071 A CN116955071 A CN 116955071A
Authority
CN
China
Prior art keywords
fault
result
index
abnormal
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310827081.1A
Other languages
Chinese (zh)
Inventor
王春华
陈劼
王菁菁
宋潇
文韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Jiangsu Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Jiangsu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Jiangsu Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202310827081.1A priority Critical patent/CN116955071A/en
Publication of CN116955071A publication Critical patent/CN116955071A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application belongs to the technical field of data analysis, and discloses a fault classification method, device, equipment and storage medium. The application obtains the real-time performance index of the pod container and the host node in the real-time consumption system; performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result; inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result; inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment; generating fault information according to the fault category at the current moment, and performing anomaly detection on the average value of a plurality of single sites at the tail of the real-time performance to eliminate the influence of the burr point, and meanwhile, keeping high performance, and performing fault identification and fault classification according to the anomaly detection result to accelerate fault identification and fault classification.

Description

Fault classification method, device, equipment and storage medium
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a fault classification method, device, apparatus, and storage medium.
Background
In the conventional operation and maintenance system under the micro-service architecture, fault identification and fault classification generally depend on the effect of anomaly detection, whether a component has a fault or not and the cause of the fault is located according to the result of anomaly detection, and fault classification is to classify a specific fault after identifying the fault, and identify which type of fault the fault belongs to, for example, a k8s container cpu load, a k8s container io load, node memory consumption, and the like. The method commonly used in the prior art is that a) a certain amount of core indexes closely related to faults are filtered from a system through manual expert experience, after important indexes are screened out, whether the historical time (generally in minutes) is the fault or not is marked, then a comprehensive classification model is trained by utilizing the historical data of the indexes to identify faults and fault categories, or b) abnormality detection is carried out from single dimensions such as gold service indexes, performance indexes, log data and call chain data, and the like, abnormality detection results of all dimensions are synthesized to locate the fault time, the root cause of the faults is judged according to the abnormality score of all components and the historical call relation data among the components, the abnormality index of the root cause component is generally located as the root cause, and the category of the faults is judged according to the index category. The problems of serious unbalance of sample data, long labeling time, low reliability and the like caused by the fact that the abnormal detection score according to the index cannot represent the severity of the fault exist in the mode.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The application mainly aims to provide a fault classification method, device, equipment and storage medium, and aims to solve the technical problem that the prior art cannot simultaneously realize rapid abnormality detection, fault identification and fault classification.
To achieve the above object, the present application provides a fault classification method, comprising the steps of:
acquiring real-time performance indexes of a pod container and a host node in a real-time consumption system;
performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result;
inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result;
inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment;
and generating fault information according to the fault category at the current moment.
Optionally, the inputting the abnormality detection result to a fault recognition model for fault recognition, before obtaining the fault recognition result, further includes:
acquiring historical performance indexes of the pod container and the host node;
grouping the historical performance indexes to obtain a plurality of groups of core indexes;
traversing Shi Guzhang samples according to the time sequence to obtain a historical fault occurrence time set;
determining historical performance index data corresponding to each historical fault node time according to the historical fault occurrence time set;
performing anomaly detection on each historical performance index data to obtain an anomaly detection result;
and inputting the abnormality detection result into an initial fault recognition model for training to obtain a fault recognition model.
Optionally, the step of inputting the abnormality detection result to an initial fault recognition model for training, after obtaining a fault recognition model, further includes:
analyzing the fault identification result to obtain a first abnormal index set within a first preset time of a fault moment;
acquiring a second abnormal index set occurring in a second preset time before the fault moment;
filtering the first abnormal index set according to the second abnormal index set to obtain a new abnormal index set after the fault moment;
and inputting the new abnormal index set into an initial fault classification model for training according to the new abnormal index set to obtain a fault classification model.
Optionally, the inputting the abnormality detection result into an initial fault recognition model for training to obtain a fault recognition model includes:
marking the abnormal indexes in a new abnormal index set to obtain a sample label, wherein the new abnormal index set comprises at least one index group, and each index group comprises at least one performance unit index;
judging the abnormal state of the current index group according to the sample label;
when the current index group is in an abnormal state, counting the number of performance unit indexes in the current index group which are in the abnormal state and the sample label;
and training according to the number of the performance unit indexes and the sample labels to obtain a corresponding number of fault identification models, wherein the number of the fault identification models is at least one.
Optionally, the training is performed by inputting the new abnormal index set to an initial fault classification model to obtain a fault classification model, including:
analyzing the sample label to determine fault types;
obtaining a corresponding fault identification result according to the index group corresponding to the sample label;
and training through a fault classification algorithm according to the fault category and the fault recognition result to obtain a fault classification model.
Optionally, the inputting the abnormality detection result to a fault recognition model for fault recognition to obtain a fault recognition result includes:
analyzing the abnormal detection result to determine grouping information of the real-time performance index;
calling a corresponding fault identification model according to the grouping information;
inputting the abnormal detection result into the corresponding fault identification model, so that the corresponding fault detection model analyzes the abnormal detection result to obtain an analysis result, wherein the analysis result comprises the state of the real-time performance index;
and when the real-time performance index is abnormal in the analysis result, obtaining a fault identification result of the real-time performance index.
Optionally, inputting the fault recognition result to a fault classification model for fault prediction to obtain a fault class at the current moment, including:
determining a sample sequence number according to the fault identification result;
generating a two-dimensional table according to the sample sequence number and the corresponding index group;
performing fault prediction by using an xgboost algorithm according to the two-dimensional table to obtain a fault prediction result;
and obtaining the fault category at the current moment according to the prediction result.
In addition, to achieve the above object, the present application also proposes a fault classification device including:
the index acquisition module is used for acquiring real-time performance indexes of the pod container and the host node in the real-time consumption system;
the abnormality detection module is used for carrying out abnormality detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an abnormality detection result;
the fault identification module is used for inputting the abnormal detection result into a fault identification model to carry out fault identification so as to obtain a fault identification result;
the fault classification module is used for inputting the fault identification result into a fault classification model to perform fault prediction so as to obtain a fault class at the current moment;
and the information output module is used for generating fault information according to the fault category at the current moment.
In addition, to achieve the above object, the present application also proposes a fault classification apparatus including: a memory, a processor, and a fault classification program stored on the memory and executable on the processor, the fault classification program configured to implement the steps of the fault classification method as described above.
In addition, to achieve the above object, the present application also proposes a storage medium having stored thereon a fault classification program which, when executed by a processor, implements the steps of the fault classification method as described above.
The application obtains the real-time performance index of the pod container and the host node in the real-time consumption system; performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result; inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result; inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment; generating fault information according to the fault category at the current moment, and performing anomaly detection on the average value of a plurality of single sites at the tail of the real-time performance to eliminate the influence of the burr point, and meanwhile, keeping high performance, and performing fault identification and fault classification according to the anomaly detection result to accelerate fault identification and fault classification.
Drawings
FIG. 1 is a schematic diagram of a fault classification device of a hardware operating environment according to an embodiment of the present application;
FIG. 2 is a flow chart of a first embodiment of the fault classification method according to the present application;
FIG. 3 is a flow chart of a second embodiment of the fault classification method according to the present application;
fig. 4 is a block diagram showing the construction of a first embodiment of the fault classification device according to the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a fault classification device of a hardware running environment according to an embodiment of the present application.
As shown in fig. 1, the fault classification apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the structure shown in fig. 1 does not constitute a limitation of the fault classification apparatus, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a failure classification program may be included in the memory 1005 as one type of storage medium.
In the fault classification device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the fault classification apparatus of the present application may be provided in a fault classification apparatus that invokes a fault classification program stored in the memory 1005 through the processor 1001 and executes the fault classification method provided by the embodiment of the present application.
An embodiment of the present application provides a fault classification method, referring to fig. 2, and fig. 2 is a schematic flow chart of a first embodiment of a fault classification method according to the present application.
In this embodiment, the fault classification method includes the following steps:
step S10: and acquiring real-time performance indexes of the pod container and the host node in the real-time consumption system.
It should be noted that, the execution body of the embodiment is a fault classification device, where the fault classification device has functions of data processing, data communication, program running, and the like, and the fault classification device may be an integrated controller, a control computer, and other devices with similar functions, and the embodiment is not limited to this.
It should be understood that the real-time consumption system may be a Kafka real-time consumption system, and such an implementation consumption system has the characteristic of high concurrency, and can accommodate multiple users to make a request for access at a time point, where the pod container is the smallest resource management component in k8s and is also the smallest resource object for running the containerized application, and the real-time performance index may be the current memory index, the CPU index, and the like of the pod container and the host node in the current situation.
It will be appreciated that the fault classification device can monitor the real-time performance index data of each resource management component in the real-time consumption system, such as the pod container, the host node and other resource components, and can aggregate the implementation performance data in units of minutes.
Step S20: and carrying out anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result.
It should be noted that, when the real-time performance index is integrated, there is a lot of data, and many burr points exist in the data, in order to eliminate the influence of the burr points, the average value of the tail2 single points is used as the detection target value of the target point to be detected, so as to eliminate the influence of the burr points, and meanwhile, high performance is maintained.
In specific implementation, when the average value of 2 single points at the tail part in the real-time performance index is detected abnormally, the aggregated implementation performance index is subjected to fault detection on the real-time performance index data through a tail2-3sigma algorithm, in order to ensure the accuracy of data on the premise of ensuring high performance in fault detection, the average value of 2 single points at the tail part is taken as a target point for detection on the aggregated real-time performance index data, whether the implementation performance index data is abnormal or not is judged through detection on the real-time performance index data of the target point, and the detection result of whether the current real-time performance index data is abnormal or not is determined according to the judgment result.
Step S30: and inputting the abnormal detection result into a fault recognition model to perform fault recognition, so as to obtain a fault recognition result.
It should be noted that, the fault recognition model is trained through an initial fault recognition model, and can recognize faults according to detection results, where the fault recognition model is determined according to an index set currently detected.
In a specific implementation, a corresponding fault identification classification model is selected according to the index group which is currently used for identification detection, and the fault classification model can judge whether the corresponding performance index has faults according to the input value to be detected, and particularly, when all the index groups are abnormal, the detection time is normal. For example, the currently input index set is a CPU index set of a host node and a memory index set of a dynamic pod on the host node, a corresponding host CPU fault classification model and a corresponding host dynamic pod memory fault classification model are selected according to the current index set, the CPU index of the respective host node and the host dynamic pod memory index are respectively judged, a current fault recognition result is determined, when at least one abnormal result occurs, the corresponding abnormal result and the corresponding index set are recorded until the fault recognition is completed, the corresponding abnormal result and the corresponding index set are uniformly output, and if no abnormal result is obtained after the fault recognition is performed on each input index set, a fault-free recognition result can be returned.
Step S40: and inputting the fault identification result into a fault classification model to perform fault prediction, so as to obtain the fault class at the current moment.
The fault classification model is trained by an initial fault classification model, a plurality of fault recognition results are obtained in the fault recognition model, and the fault recognition results are used as input of the initial fault classification model for training, so that the fault classification model is obtained.
In a specific implementation, a fault recognition result is used as an input of a fault classification model, and the fault classification model can classify a fault recognition result and performance indexes corresponding to the fault recognition result, and for one fault class, there are many reasons for causing the current fault, so for one class, a plurality of performance indexes and recognition results are often corresponding. When the fault classification model is used for carrying out fault classification through an xgboost algorithm, firstly, marking fault recognition results, marking each fault recognition result, endowing unique id, carrying out feature extraction on each fault recognition result, carrying out feature extraction on features including variance, mean value, skewness and the like, or carrying out feature fitting, generating feature characterization such as isolated forests, generating a feature set according to the extracted features, and carrying out fault classification according to the xgboost algorithm to obtain a fault classification result.
Step S50: and generating fault information according to the fault category at the current moment.
In a specific implementation, after the fault classification result is obtained, generating fault information according to the fault classification result, wherein the fault information comprises the fault classification result and also comprises a corresponding fault index group. After generating the fault information, the fault information can be sent to an operation and maintenance system, so that operation and maintenance personnel can know the cause of the fault in time, and quick repair is realized.
The embodiment obtains the real-time performance indexes of the pod container and the host node in the real-time consumption system; performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result; inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result; inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment; generating fault information according to the fault category at the current moment, and performing anomaly detection on the average value of a plurality of single sites at the tail of the real-time performance to eliminate the influence of the burr point, and meanwhile, keeping high performance, and performing fault identification and fault classification according to the anomaly detection result to accelerate fault identification and fault classification.
Referring to fig. 3, fig. 3 is a flow chart of a second embodiment of a fault classification method according to the present application.
Based on the first embodiment, the fault classification method of the present embodiment further includes, before the step S30:
step S301: acquiring historical performance indexes of the pod container and the host node;
step S302: traversing Shi Guzhang samples according to the time sequence to obtain a historical fault occurrence time set;
step S303: determining historical performance index data corresponding to each historical fault node time according to the historical fault occurrence time set;
step S304: performing anomaly detection on each historical performance index data to obtain an anomaly detection result;
step S305: and inputting the abnormality detection result into an initial fault recognition model for training to obtain a fault recognition model.
In a specific implementation, after the historical performance indexes of the pod container and the host node are obtained, the performance indexes can be grouped according to the resource types and the categories, so as to obtain K groups of core indexes closely related to system faults, for example:
cpu index group of system host node:
system.cpu.user,
system.load.1,
system.cpu.pct_usage,
system.load.5,
system.load.15
memory index set of dynamic pod on host node:
container_fs_writes./dev/vda
container_fs_inodes./dev/vda1,
container_fs_reads_MB./dev/vda,
container_file_descriptors,
container_fs_writes_MB./dev/vda
among them, system.cpu.user is a performance index. When the fault recognition model is trained, traversing the fault occurrence time of the history faults one by one according to a time sequence from the fault samples marked by the history, inquiring the historical performance index data of the system 3 hours before the fault time point according to each fault time point, and aggregating according to minutes, wherein the fault samples marked by the history comprise n, k and n is the number of the fault samples. In the training process, a first fault time point is firstly determined, if the first fault time point is 18:12:23, historical performance index data of the first 3 hours are queried, a historical performance index in a time period from 15:12:23 to 18:12:23 is obtained, the historical performance indexes in the time period are aggregated in units of minutes, and the process is continuously repeated until a fault sample marked by the history traverses the completion table. According to the method, n x k historical performance index panel data corresponding to n x k fault moments are finally obtained, panel data of each historical performance index are subjected to anomaly detection respectively by adopting a tail2-3sigma anomaly detection algorithm according to index groups, n x k anomaly detection results are obtained, each group of indexes comprises 1-m specific performance unit indexes, the anomaly detection results are taken as input and are input into an initial fault recognition model for training, and a fault recognition model is obtained.
Further, the step of inputting the abnormality detection result to an initial fault recognition model for training to obtain a fault recognition model further includes:
analyzing the fault identification result to obtain a first abnormal index set within a first preset time of a fault moment;
acquiring a second abnormal index set occurring in a second preset time before the fault moment;
filtering the first abnormal index set according to the second abnormal index set to obtain a new abnormal index set after the fault moment;
the training is carried out according to the new abnormal index set input to an initial fault classification model to obtain a fault classification model, wherein the training is carried out according to the new abnormal index set input to the initial fault classification model to obtain the fault classification model further comprises the following steps: analyzing the sample label to determine fault types;
obtaining a corresponding fault identification result according to the index group corresponding to the sample label;
and training through a fault classification algorithm according to the fault category and the fault recognition result to obtain a fault classification model. .
The first abnormal index set in the first preset time refers to an abnormal index set formed by abnormal indexes in a period of time after the occurrence of the fault, and the second abnormal index set in the second preset time before the moment of the fault refers to an abnormal index set formed by abnormal indexes in a period of time before the occurrence of the fault.
In a specific implementation, abnormal indexes occurring within a period of time after the fault moment can be counted, wherein the period of time can be 2 minutes, 3 minutes and the like, preferably 2 minutes, or the abnormal indexes within 2 minutes at the fault moment can be additionally set according to actual conditions, and a first abnormal index set is obtained by recording the abnormal indexes; the abnormal index occurring in a period of time before the fault time, for example, 5 minutes before the fault time, may be recorded to obtain a second abnormal index set, and similarly, the period of time may be set according to a specific situation, which is not limited in this embodiment. After the first abnormal index set and the second abnormal index set are obtained, in order to optimize the data set, the data in the first abnormal index set can be filtered, the abnormal index appearing in the second abnormal index set is filtered to obtain a new abnormal index when the fault occurs, the new abnormal index can be considered as a new fault which is generated only after the fault occurs, the residual abnormal index set is used as the new abnormal index set after the first abnormal index set is filtered, and the new abnormal index set is used as the input of the initial fault classification model to perform model training, so that the fault classification model is obtained.
Further, the step of inputting the abnormality detection result to an initial fault recognition model for training to obtain a fault recognition model includes:
marking the abnormal indexes in the new abnormal index set to obtain a sample label, wherein the new abnormal index set comprises at least one index group, and each index group comprises at least one performance unit index;
judging the abnormal state of the current index group according to the sample label;
when the current index group is in an abnormal state, counting the number of performance unit indexes in the current index group which are in the abnormal state and the sample label;
and training according to the number of the performance unit indexes and the sample labels to obtain a corresponding number of fault identification models, wherein the number of the fault identification models is at least one.
It should be noted that, the sample label may be used to distinguish abnormal performance indexes of different groups, the new abnormal index set obtained after filtering includes n×k abnormal detection results, training the n×k abnormal detection results through a logistics algorithm to obtain k fault identification classification models, and the form of input data is shown in table 1:
in the training process, because the dependent variables in the regression model of the logics algorithm only have 0 and 1 and two weights, the probability of taking 1 by y in P independent variables xi is p=p (y=1/X), and the probability of taking 0 is 1-P. For logistic regression, there is the following formula, where h θ (x) The probability of taking 1 for the result is represented:
according to the contents shown in table 1, the first column id represents the serial number of the sample, represents n failure label samples in total for each index group, and since the number of failure label samples for each index group may be different, for the convenience of training, the number of failure label samples for each index group is set to n in a unified manner, and the last column y represents the class of failure represented by the label of the sample, such as 1 represents that the current index group fails, and the middle columns 2-7 represent the index name of the specific index group and the corresponding abnormality detection result.
K failure recognition classification models are obtained after the initial failure recognition model is trained, wherein the model can be dynamically trained and updated according to the increase of historical failure labeling samples, so that the prediction effect of the model is improved along with the enrichment of failure label samples, and the supervision classification function is further exerted.
Further, the step of inputting the fault recognition result to a fault classification model for fault prediction to obtain a fault class at the current moment includes:
determining a sample sequence number according to the fault identification result;
generating a two-dimensional table according to the sample sequence number and the corresponding index group;
performing fault prediction by using an xgboost algorithm according to the two-dimensional table to obtain a fault prediction result;
and obtaining the fault category at the current moment according to the prediction result.
The sample number refers to a result number in the fault recognition result, and the two-dimensional table is used for reflecting fault information between each fault detection result.
In specific implementation, firstly, determining a sample number and a corresponding index group in a current fault recognition result according to the fault recognition result, and constructing a two-dimensional table according to the sample number and the index group, wherein the two-dimensional table is used for representing specific fault performance indexes in the current fault result, and the specific fault performance indexes are shown in table 2:
the 1 st column id represents the serial number of the sample, and represents a total of n×7 fault label samples, and since the number of fault label samples of each index group can be different according to actual conditions, the number of fault label samples of each index group is set to be n for convenience; where the last column y represents the label failure category of the sample, e.g., 1 represents a system cpu failure, 2 represents a cpu failure of the pod container, 3 represents a system memory failure, and 4 represents a pod container read failure. The middle 2-7 columns represent the category names of the index groups and the recognition results of the corresponding fault recognition models, 1 represents faults, and 0 represents normal. Generating a decision tree according to the fault conditions corresponding to each index group according to the xgboost algorithm, and obtaining the current fault type from the optimal point in the decision tree to obtain the fault type at the current moment.
According to the method, the mean value of the last two books of the target time sequence data to be detected is used as the value of the target point to be detected, error influence caused by a burr point is eliminated, meanwhile, data before and after the fault moment is filtered, abnormal information before the fault occurs is filtered to obtain a new fault type generated at the current fault moment, a corresponding fault identification model is selected according to the new fault type to conduct fault identification, the final fault type is determined according to the fault classification model through an xgboost algorithm, and the purposes of achieving both rapid abnormality detection, fault identification and fault classification are achieved.
In addition, the embodiment of the application also provides a storage medium, wherein the storage medium is stored with a fault classification program, and the fault classification program realizes the steps of the fault classification method when being executed by a processor.
Referring to fig. 4, fig. 4 is a block diagram showing the construction of a first embodiment of the fault classification device according to the present application.
As shown in fig. 4, the fault classification device provided by the embodiment of the application includes:
an index obtaining module 10, configured to obtain real-time performance indexes of a pod container and a host node in a real-time consumption system;
the abnormality detection module 20 is configured to perform abnormality detection on the average value of a plurality of single sites at the tail of the real-time performance index, so as to obtain an abnormality detection result;
the fault recognition module 30 is configured to input the abnormality detection result to a fault recognition model for performing fault recognition, so as to obtain a fault recognition result;
the fault classification module 40 is configured to input the fault recognition result to a fault classification model for performing fault prediction, so as to obtain a fault class at the current moment;
and the information output module 50 is used for generating fault information according to the fault category at the current moment.
The embodiment obtains the real-time performance indexes of the pod container and the host node in the real-time consumption system; performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result; inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result; inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment; generating fault information according to the fault category at the current moment, and performing anomaly detection on the average value of a plurality of single sites at the tail of the real-time performance to eliminate the influence of the burr point, and meanwhile, keeping high performance, and performing fault identification and fault classification according to the anomaly detection result to accelerate fault identification and fault classification.
In one embodiment, the fault identification module 30 is further configured to obtain historical performance indicators of the pod container and the host node; grouping the historical performance indexes to obtain a plurality of groups of core indexes; traversing Shi Guzhang samples according to the time sequence to obtain a historical fault occurrence time set; determining historical performance index data corresponding to each historical fault node time according to the historical fault occurrence time set; performing anomaly detection on each historical performance index data to obtain an anomaly detection result; and inputting the abnormality detection result into an initial fault recognition model for training to obtain a fault recognition model.
In an embodiment, the fault identification module 30 is further configured to analyze the fault identification result to obtain a first abnormal index set within a first preset time of the fault moment; acquiring a second abnormal index set occurring in a second preset time before the fault moment; filtering the first abnormal index set according to the second abnormal index set to obtain a new abnormal index set after the fault moment; and inputting the new abnormal index set into an initial fault classification model for training according to the new abnormal index set to obtain a fault classification model.
In an embodiment, the fault identification module 30 is further configured to label the abnormal indicators in the new abnormal indicator set to obtain a sample label, where the new abnormal indicator set includes at least one indicator set, and each of the indicator sets includes at least one performance unit indicator; judging the abnormal state of the current index group according to the sample label; when the current index group is in an abnormal state, counting the number of performance unit indexes in the current index group which are in the abnormal state and the sample label; and training according to the number of the performance unit indexes and the sample labels to obtain a corresponding number of fault identification models, wherein the number of the fault identification models is at least one.
In an embodiment, the fault identification module 30 is further configured to analyze the sample tag to determine a fault class; obtaining a corresponding fault identification result according to the index group corresponding to the sample label;
and training through a fault classification algorithm according to the fault category and the fault recognition result to obtain a fault classification model.
In an embodiment, the fault identification module 30 is further configured to analyze the anomaly detection result and determine grouping information of the real-time performance index; calling a corresponding fault identification model according to the grouping information; inputting the abnormal detection result into the corresponding fault identification model, so that the corresponding fault detection model analyzes the abnormal detection result to obtain an analysis result, wherein the analysis result comprises the state of the real-time performance index; and when the real-time performance index is abnormal in the analysis result, obtaining a fault identification result of the real-time performance index.
In an embodiment, the fault classification module 40 is further configured to determine a sample number according to the fault identification result; generating a two-dimensional table according to the sample sequence number and the corresponding index group; performing fault prediction by using an xgboost algorithm according to the two-dimensional table to obtain a fault prediction result; and obtaining the fault category at the current moment according to the prediction result.
It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the application as desired, and the application is not limited thereto.
It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily occurring in sequence, but may be performed alternately or alternately with other steps or at least a portion of the other steps or stages.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present application, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A fault classification method, characterized in that the fault classification method comprises:
acquiring real-time performance indexes of a pod container and a host node in a real-time consumption system;
performing anomaly detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an anomaly detection result;
inputting the abnormal detection result into a fault recognition model to perform fault recognition to obtain a fault recognition result;
inputting the fault identification result into a fault classification model for fault prediction to obtain a fault class at the current moment;
and generating fault information according to the fault category at the current moment.
2. The method of claim 1, wherein the inputting the anomaly detection result into a fault recognition model for fault recognition, before obtaining a fault recognition result, further comprises:
acquiring historical performance indexes of the pod container and the host node;
grouping the historical performance indexes to obtain a plurality of groups of core indexes;
traversing Shi Guzhang samples according to the time sequence to obtain a historical fault occurrence time set;
determining historical performance index data corresponding to each historical fault node time according to the historical fault occurrence time set;
performing anomaly detection on each historical performance index data to obtain an anomaly detection result;
and inputting the abnormality detection result into an initial fault recognition model for training to obtain a fault recognition model.
3. The method of claim 2, wherein the step of inputting the abnormality detection result to an initial failure recognition model for training, after obtaining a failure recognition model, further comprises:
analyzing the fault identification result to obtain a first abnormal index set within a first preset time of a fault moment;
acquiring a second abnormal index set occurring in a second preset time before the fault moment;
filtering the first abnormal index set according to the second abnormal index set to obtain a new abnormal index set after the fault moment;
and inputting the new abnormal index set into an initial fault classification model for training according to the new abnormal index set to obtain a fault classification model.
4. The method of claim 2, wherein inputting the anomaly detection result into an initial fault recognition model for training to obtain a fault recognition model, comprising:
marking the abnormal indexes in a new abnormal index set to obtain a sample label, wherein the new abnormal index set comprises at least one index group, and each index group comprises at least one performance unit index;
judging the abnormal state of the current index group according to the sample label;
when the current index group is in an abnormal state, counting the number of performance unit indexes in the current index group which are in the abnormal state and the sample label;
and training according to the number of the performance unit indexes and the sample labels to obtain a corresponding number of fault identification models, wherein the number of the fault identification models is at least one.
5. The method of claim 4, wherein the training based on the new anomaly index set input to an initial fault classification model to obtain a fault classification model comprises:
analyzing the sample label to determine fault types;
obtaining a corresponding fault identification result according to the index group corresponding to the sample label;
and training through a fault classification algorithm according to the fault category and the fault recognition result to obtain a fault classification model.
6. The method of claim 1, wherein inputting the anomaly detection result to a fault recognition model for fault recognition to obtain a fault recognition result comprises:
analyzing the abnormal detection result to determine grouping information of the real-time performance index;
calling a corresponding fault identification model according to the grouping information;
inputting the abnormal detection result into the corresponding fault identification model, so that the corresponding fault detection model analyzes the abnormal detection result to obtain an analysis result, wherein the analysis result comprises the state of the real-time performance index;
and when the real-time performance index is abnormal in the analysis result, obtaining a fault identification result of the real-time performance index.
7. The method of claim 1, wherein inputting the fault recognition result into a fault classification model for fault prediction to obtain a fault class at a current time comprises:
determining a sample sequence number according to the fault identification result;
generating a two-dimensional table according to the sample sequence number and the corresponding index group;
performing fault prediction by using an xgboost algorithm according to the two-dimensional table to obtain a fault prediction result;
and obtaining the fault category at the current moment according to the prediction result.
8. A fault classification device, characterized in that the fault classification device comprises:
the index acquisition module is used for acquiring real-time performance indexes of the pod container and the host node in the real-time consumption system;
the abnormality detection module is used for carrying out abnormality detection on the average value of a plurality of single sites at the tail part of the real-time performance index to obtain an abnormality detection result;
the fault identification module is used for inputting the abnormal detection result into a fault identification model to carry out fault identification so as to obtain a fault identification result;
the fault classification module is used for inputting the fault identification result into a fault classification model to perform fault prediction so as to obtain a fault class at the current moment;
and the information output module is used for generating fault information according to the fault category at the current moment.
9. A fault classification device, the device comprising: a memory, a processor and a fault classification program stored on the memory and executable on the processor, the fault classification program being configured to implement the steps of the fault classification method of any one of claims 1 to 7.
10. A storage medium having stored thereon a fault classification program which when executed by a processor performs the steps of the fault classification method according to any of claims 1 to 7.
CN202310827081.1A 2023-07-06 2023-07-06 Fault classification method, device, equipment and storage medium Pending CN116955071A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310827081.1A CN116955071A (en) 2023-07-06 2023-07-06 Fault classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310827081.1A CN116955071A (en) 2023-07-06 2023-07-06 Fault classification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116955071A true CN116955071A (en) 2023-10-27

Family

ID=88447073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310827081.1A Pending CN116955071A (en) 2023-07-06 2023-07-06 Fault classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116955071A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540199A (en) * 2024-01-05 2024-02-09 中国汽车技术研究中心有限公司 Fault prediction method, device and storage medium for fuel cell vehicle

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117540199A (en) * 2024-01-05 2024-02-09 中国汽车技术研究中心有限公司 Fault prediction method, device and storage medium for fuel cell vehicle
CN117540199B (en) * 2024-01-05 2024-05-07 中国汽车技术研究中心有限公司 Fault prediction method, device and storage medium for fuel cell vehicle

Similar Documents

Publication Publication Date Title
CN116450399B (en) Fault diagnosis and root cause positioning method for micro service system
CN111340250A (en) Equipment maintenance device, method and computer readable storage medium
CN110471945B (en) Active data processing method, system, computer equipment and storage medium
CN113010389A (en) Training method, fault prediction method, related device and equipment
CN115879915B (en) Cross-platform standardized overhaul method for power plant
CN116955071A (en) Fault classification method, device, equipment and storage medium
CN114519524A (en) Enterprise risk early warning method and device based on knowledge graph and storage medium
CN109783384A (en) Log use-case test method, log use-case test device and electronic equipment
CN111666978B (en) Intelligent fault early warning system for IT system operation and maintenance big data
CN115204536A (en) Building equipment fault prediction method, device, equipment and storage medium
CN115188688A (en) Abnormality detection method and apparatus, electronic device, and storage medium
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN113806343B (en) Evaluation method and system for Internet of vehicles data quality
CN113282920B (en) Log abnormality detection method, device, computer equipment and storage medium
CN114138601A (en) Service alarm method, device, equipment and storage medium
CN111047125A (en) Product failure analysis device, method and computer readable storage medium
CN117112336A (en) Intelligent communication equipment abnormality detection method, equipment, storage medium and device
CN115494431A (en) Transformer fault warning method, terminal equipment and computer readable storage medium
CN118396246B (en) Intelligent management method and device for life cycle of industrial key unit
CN112508433A (en) Data inspection method and device for operation and maintenance system
CN111985651A (en) Operation and maintenance method and device for business system
CN117076327B (en) Automatic interface detection and repair method and system
CN113112160B (en) Diagnostic data processing method, diagnostic data processing device and electronic equipment
CN117439899B (en) Communication machine room inspection method and system based on big data
CN109474445B (en) Distributed system root fault positioning method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination