WO2023214446A1 - Dispositif de classification, procédé de classification, et programme - Google Patents

Dispositif de classification, procédé de classification, et programme Download PDF

Info

Publication number
WO2023214446A1
WO2023214446A1 PCT/JP2022/019498 JP2022019498W WO2023214446A1 WO 2023214446 A1 WO2023214446 A1 WO 2023214446A1 JP 2022019498 W JP2022019498 W JP 2022019498W WO 2023214446 A1 WO2023214446 A1 WO 2023214446A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
observability
event
failure
classification
Prior art date
Application number
PCT/JP2022/019498
Other languages
English (en)
Japanese (ja)
Inventor
幸次 佐々木
優 酒井
謙輔 高橋
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/019498 priority Critical patent/WO2023214446A1/fr
Publication of WO2023214446A1 publication Critical patent/WO2023214446A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Definitions

  • the present invention relates to a classification device, a classification method, and a program.
  • An autonomous control loop method has been proposed as a known maintenance automation technology that supports maintenance personnel's decisions in maintenance operations.
  • the autonomous control loop method is a technology in which maintenance operation functions are made into components and autonomous, allowing each operating component to operate autonomously.
  • the autonomous control loop method aims to monitor the target service, autonomously follow the addition of new functions to the service or change specifications, and achieve automatic recovery in the event of a failure.
  • Non-Patent Document 1 proposes an observability data information acquisition method called Logs/Metrics/Tracing in the autonomous control loop method
  • Non-Patent Document 2 proposes a method for processing observability data and searching for causes when a failure occurs. has been proposed. These technologies can assist maintainers in speedy recovery by acquiring observability data from monitored services and analyzing the observability data.
  • Non-Patent Documents 1 and 2 mention a data acquisition method and a cause search when a failure occurs, but do not mention a response method when a failure occurs.
  • Autonomous recovery from any type of failure is difficult to achieve because it requires a device that can determine all actions, such as investigating the cause and formulating a response policy. Therefore, for known failures that have been handled by maintenance personnel in the past, the monitored service can be autonomously restored by performing recovery processing without maintenance personnel's intervention, and for unknown failures that maintenance personnel have not experienced, they will be treated as failures.
  • the present invention has been made in view of the above, and its purpose is to classify whether a fault that has occurred is a fault that has been dealt with in the past or a fault that has not been dealt with before.
  • a classification device is a classification device that classifies a failure that occurs in a monitored service, and the classification device extracts abnormal observability data from observability data when a failure occurs in the monitored service. For each known event that has been dealt with in the past, the observability data of the known event and the abnormal observability data are classified into clusters, and the abnormal observability data included in the cluster of the known event is calculated.
  • a classification unit is provided that calculates a ratio of observable data and classifies an occurring failure into either an unknown event that has not been dealt with in the past or the known event based on the ratio calculated for each known event.
  • a fault that has occurred is a fault that has been dealt with in the past or a fault that has not been dealt with before.
  • FIG. 1 is a diagram showing an example of the overall configuration including a classification device according to this embodiment.
  • FIG. 2 is a flowchart showing an example of the flow of learning processing.
  • FIG. 3 is a diagram illustrating an example of observability data stored in the known event data storage unit.
  • FIG. 4 is a diagram illustrating an example of data obtained by dimensionally compressing the observability data of FIG. 3.
  • FIG. 5 is a flowchart illustrating an example of the flow of processing for classifying observability data into normal data and failure data.
  • FIG. 6 is a diagram illustrating an example of observability data when a failure occurs.
  • FIG. 7 is a diagram illustrating an example of data obtained by dimensionally compressing the observability data of FIG. 6.
  • FIG. 8 is a flowchart illustrating an example of the flow of processing for classifying failure events.
  • FIG. 9 is a diagram illustrating an example of data in which observability data of known events and failure data are combined.
  • FIG. 10 is a diagram illustrating an example of clustering of observability data of known events and failure data.
  • FIG. 11 is a diagram showing an example of the hardware configuration of the classification device.
  • the classification device 10 shown in FIG. 1 analyzes observability data of a service to be monitored when a failure occurs, and determines whether the failure that has occurred is a known event or an unknown event.
  • the system in Figure 1 performs recovery processing autonomously without the intervention of maintenance staff when the fault is a known event, and performs a cause search process when the fault is an unknown event and provides the result of estimating the cause of the fault to the maintenance staff.
  • a known event is a failure that has been dealt with in the past.
  • An unknown event is an obstacle that we have no experience of dealing with in the past and are dealing with for the first time.
  • a data storage unit 30 is connected to the classification device 10.
  • the data storage unit 30 stores observability data acquired from a managed target that provides a monitored service.
  • Management targets include, for example, devices and containers used to provide services, and software that runs on the devices and containers.
  • observability data may be acquired from each operating component of the autonomous control loop method using the method described in Non-Patent Document 1.
  • Observability data is, for example, logs, metrics, and traces that can be obtained from managed objects.
  • Observability data may be combined into one piece of observability data for each management target using the method described in Non-Patent Document 2 on a time-by-time basis. This eliminates time differences in observability data and allows consideration of the possibility that each managed target is influencing other managed targets.
  • the classification device 10 includes a failure data extraction section 11, an event classification section 12, and a known event data storage section 13.
  • the failure data extraction unit 11 classifies observability data when a failure occurs into normal data and failure data, and extracts failure data. Specifically, the failure data extraction unit 11 acquires observability data around the time period in which the failure occurred from the data storage unit 30, and uses a machine learning model to classify each of the observability data into normal data and failure data. The failure data is output to the event classification section 12.
  • the normal data is normal observability data that is obtained when each of the managed objects of the monitored service is operating normally. Failure data is abnormal observability data obtained when a failure occurs or is about to occur in a monitored service.
  • the machine learning model is a model that uses observability data stored in the known event data storage unit 13 as training data and is trained to classify the observability data into normal data and failure data.
  • the failure data extraction unit 11 may learn a machine learning model when extracting failure data from observability data.
  • the event classification unit 12 compares the failure data with observability data for each known event and determines whether the failure is a known event or an unknown event. If the failure is a known event, the event classification unit 12 determines the known event corresponding to the failure. Specifically, the event classification unit 12 acquires observability data for each known event from the known event data storage unit 13, and for each known event, combines the failure data and the observability data of the known event into “ Clustering is performed into two clusters: a "known event cluster" and an "other cluster.” If most of the failure data is classified into the other cluster in all trials for each known event, the event classification unit 12 determines the failure to be an unknown event.
  • the event classification unit 12 determines a known event in which the percentage of failure data classified is greater than a threshold value as a known event of a failure.
  • the event classification unit 12 may further classify the failure data determined to be a known event again using a machine learning model to improve the determination accuracy.
  • the known event data storage unit 13 stores data obtained by adding an event label to observability data.
  • the event label is information indicating the classification of the known event.
  • the classification of known events is also referred to as failure type.
  • the known event data storage unit 13 stores not only failure data but also normal data.
  • the observability data stored in the known event data storage unit 13 is used as training data for a machine learning model used when classifying observability data into normal data and failure data. It is also used when the event classification unit 12 determines the fault type of fault data.
  • the classification results of the classification device 10 can be used by the factor search processing unit 50 and the recovery processing operation component 60.
  • the classification device 10 transmits the failure data to the cause search processing unit 50.
  • the factor search processing unit 50 analyzes the failure data, estimates the cause of the failure, and presents it to the maintenance person.
  • the method described in Non-Patent Document 2 can be used. If the failure is classified as one of the known events 1-N, the classification device 10 instructs the recovery processing operation component 60 corresponding to the known events 1-N to perform recovery processing for the failure.
  • the recovery processing operation component 60 executes recovery processing in a predetermined procedure for each known event 1-N, such as restarting software or restarting a device. Note that even if the cause search processing unit 50 and the recovery processing operation component 60 are not provided, prompt recovery can be expected by presenting to the maintenance person whether the event is an unknown event or a known event.
  • FIG. 2 An example of a learning process of a machine learning model that classifies observability data into normal data and failure data will be described.
  • the process shown in FIG. 2 may be executed when executing the process of classifying observability data into normal data and fault data, which will be described later, or when the data in the known event data storage unit 13 is updated. You may.
  • step S11 the failure data extraction unit 11 acquires observability data from the known event data storage unit 13.
  • FIG. 3 shows an example of observability data stored in the known event data storage unit 13.
  • observability data acquired from each of the management targets of the monitored service is combined into one piece for each time, and an event label is assigned to each piece of observability data.
  • an event label 0 is given to normal data, and a numerical value according to the fault type is given to trouble data.
  • the failure data extraction unit 11 converts the event label of the observability data into a binary value of normal or abnormal. Specifically, the event labels of normal data are left as 0, and all the event labels of failure data other than normal data are converted to 1.
  • the fault data extraction unit 11 applies principal component analysis to the obtained observability data to reduce its dimensionality and calculates an eigenvector.
  • Observability data may include more than 100 types of data, and if the number of items is large, the number of data dimensions (the number of columns in the table in FIG. 3) is reduced by principal component analysis.
  • FIG. 4 shows an example of data obtained by applying principal component analysis to the observability data of FIG. 3 to reduce the dimension.
  • FIG. 4 shows the first principal component (PC1) and second principal component (PC2) obtained by principal component analysis. Principal component scores from the third principal component onwards may be used.
  • the eigenvectors obtained by the principal component analysis of the learning process are used to calculate the principal component score of the observability data to be classified in the process of classifying the observability data into normal data and failure data, which will be described later.
  • the failure data extraction unit 11 uses the observability data dimension-reduced in step S12 as training data to learn a machine learning model that classifies the observability data into normal data and failure data.
  • the failure data extraction unit 11 uses random forest, which is one of the classification methods, to create a machine learning model that classifies observability data into normal data and failure data.
  • step S21 the failure data extraction unit 11 acquires observability data around the time period in which the failure occurred from the data storage unit 30.
  • FIG. 6 shows an example of observability data when a failure occurs. The example of FIG. 6 shows observability data combined into one at 10 second intervals.
  • step S22 the fault data extraction unit 11 uses the eigenvector calculated in step S12 of FIG. 2 to dimensionally compress the observability data at the time of fault occurrence obtained in step S21.
  • FIG. 7 shows an example in which the observability data in FIG. 6 is dimensionally compressed.
  • the failure data extraction unit 11 calculates PC1 and PC2 shown in FIG. 7 using the observability data and eigenvectors shown in FIG. Note that FIG. 7 also shows the determination result of the next step S23.
  • step S23 the failure data extraction unit 11 inputs the dimensionally compressed observability data to the machine learning model learned in the process of FIG. 2, and classifies the observability data into normal data and failure data.
  • the two rows of observability data indicated by arrows are classified as failure data.
  • step S24 the failure data extraction unit 11 extracts observability data before dimension compression of the observability data classified as failure data, and outputs it to the event classification unit 12.
  • the failure data extraction unit 11 outputs two lines of observability data that are determined to be failure data among the observability data shown in FIG. 6 to the event classification unit 12.
  • step S31 the event classification unit 12 acquires observability data of one known event from the known event data storage unit 13. That is, the event classification unit 12 acquires observability data to which the same event label is attached from the known event data storage unit 13. For example, in the first execution of the loop, observability data with an event label of 1 is acquired, and in the second execution, observability data with an event label of 2 is acquired.
  • the processing from steps S31 to S35 is repeated N times for each of known events 1-N excluding normal.
  • step S32 the event classification unit 12 combines the observability data of the known event obtained in step S31 and the failure data.
  • step S33 the event classification unit 12 applies principal component analysis to the combined data to compress the dimension.
  • FIG. 9 shows an example of data in which observability data of known events and failure data are combined in the row direction and dimensionally reduced.
  • the upper three rows of data in the example of FIG. 9 are observability data of the same known event acquired from the known event data storage unit 13.
  • the lower two rows of data in the example of FIG. 9 are failure data to be classified.
  • the event classification unit 12 clusters the combined data into two clusters.
  • clustering for example, Minibatch K-means, which is a type of unsupervised learning, can be used.
  • one cluster is a known event cluster that includes observability data of existing events, and the other cluster is an other cluster.
  • FIG. 10 shows how observability data and failure data of known events are clustered for each known event. As shown in FIG. 10, for each of known events 1-N, data that combines the observability data and failure data of the known events is clustered into two clusters.
  • the event classification unit 12 clusters the observability data of the known event and the failure data so that two clusters are formed, so when a failure of the known event occurs, the observability data of the known event is added to the other clusters. Sexual data and disability data may be classified.
  • step S35 the event classification unit 12 calculates the rate at which the failure data is classified into known event clusters. If a known event cluster includes a lot of fault data, it can be estimated that the fault is the known event.
  • steps S31 to S35 are executed for each of known events 1-N, and the proportion of failure data included in each cluster of known events 1-N is calculated.
  • step S36 the event classification unit 12 classifies the occurring failure as either an unknown event or a known event 1-N based on the percentage of failure data included in each cluster calculated for each known event 1-N. do. Specifically, if the percentage of failure data classified into known event clusters for all known events 1-N is lower than a threshold, the event classification unit 12 classifies the occurring failure as an unknown event. Furthermore, the event classification unit 12 determines known events 1-N for which the percentage of failure data classified into known event clusters is higher than the threshold value as known events of the failure that occurred.
  • the event classification unit 12 When classifying a fault that has occurred into one of the known events 1-N, the event classification unit 12 re-executes classification such as random forest on the fault data so that the fault data falls into the known event 1-N. You may confirm that it is classified.
  • the present embodiment is a classification device 10 that classifies failures that occur in monitored services, and includes a failure data extraction unit 11 and an event classification unit 12.
  • the failure data extraction unit 11 extracts failure data from observability data when a failure occurs in a monitored service.
  • the event classification unit 12 classifies the observability data and failure data of the known event into clusters, and calculates the proportion of failure data included in the cluster of known events. Based on the ratio calculated for each known event, the fault is classified as either an unknown event that has not been dealt with in the past or a known event. This makes it possible to determine whether a failure that occurs in a monitored service is an unknown event that has not been dealt with in the past, or a known event that has been dealt with in the past, enabling rapid failure recovery. Become.
  • the classification device 10 described above includes, for example, a central processing unit (CPU) 901, a memory 902, a storage 903, a communication device 904, an input device 905, and an output device 906 as shown in FIG.
  • CPU central processing unit
  • a general-purpose computer system can be used.
  • the classification device 10 is realized by the CPU 901 executing a predetermined program loaded onto the memory 902.
  • This program can be recorded on a computer-readable recording medium such as a magnetic disk, optical disk, or semiconductor memory, or can be distributed via a network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

L'invention concerne un dispositif de classification (10) destiné à classifier des défaillances qui se sont produites dans un service surveillé, comprenant une unité d'extraction de données de défaillance (11) et une unité de classification d'événement (12). L'unité d'extraction de données de défaillance (11) extrait un élément de données de défaillance à partir d'éléments de données d'observabilité à un moment où des défaillances se sont produites dans le service surveillé. L'unité de classification d'événements (12), pour des événements connus respectifs qui ont été traités dans le passé, classifie les éléments de données d'observabilité et les éléments de données de défaillance des événements connus en groupes, calcule les proportions des éléments de données de défaillance inclus dans les groupes pour les événements connus, et classifie, sur la base des proportions calculées pour les événements connus respectifs, la défaillance survenue soit comme un événement connu, soit comme un événement inconnu qui n'a pas été traité dans le passé.
PCT/JP2022/019498 2022-05-02 2022-05-02 Dispositif de classification, procédé de classification, et programme WO2023214446A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/019498 WO2023214446A1 (fr) 2022-05-02 2022-05-02 Dispositif de classification, procédé de classification, et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/019498 WO2023214446A1 (fr) 2022-05-02 2022-05-02 Dispositif de classification, procédé de classification, et programme

Publications (1)

Publication Number Publication Date
WO2023214446A1 true WO2023214446A1 (fr) 2023-11-09

Family

ID=88646390

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/019498 WO2023214446A1 (fr) 2022-05-02 2022-05-02 Dispositif de classification, procédé de classification, et programme

Country Status (1)

Country Link
WO (1) WO2023214446A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016132717A1 (fr) * 2015-02-17 2016-08-25 日本電気株式会社 Système d'analyse de journal, procédé d'analyse de journal, et support d'enregistrement de programme
WO2017081865A1 (fr) * 2015-11-13 2017-05-18 日本電気株式会社 Système et procédé d'analyse de journal, et support d'enregistrement
JP2020046883A (ja) * 2018-09-18 2020-03-26 株式会社東芝 分類装置、分類方法およびプログラム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016132717A1 (fr) * 2015-02-17 2016-08-25 日本電気株式会社 Système d'analyse de journal, procédé d'analyse de journal, et support d'enregistrement de programme
WO2017081865A1 (fr) * 2015-11-13 2017-05-18 日本電気株式会社 Système et procédé d'analyse de journal, et support d'enregistrement
JP2020046883A (ja) * 2018-09-18 2020-03-26 株式会社東芝 分類装置、分類方法およびプログラム

Similar Documents

Publication Publication Date Title
CN108038049B (zh) 实时日志控制系统及控制方法、云计算系统及服务器
CN109861844B (zh) 一种基于日志的云服务问题细粒度智能溯源方法
Watanabe et al. Online failure prediction in cloud datacenters by real-time message pattern learning
CN111209131A (zh) 一种基于机器学习确定异构系统的故障的方法和系统
US11487996B2 (en) Real-time predictive maintenance of hardware components using a stacked deep learning architecture on time-variant parameters combined with a dense neural network supplied with exogeneous static outputs
CN110825644A (zh) 一种跨项目软件缺陷预测方法及其系统
JP2019185422A (ja) 故障予知方法、故障予知装置および故障予知プログラム
CN108460397B (zh) 设备故障类型的分析方法、装置、储存介质和电子设备
CN111259947A (zh) 一种基于多模态学习的电力系统故障预警方法和系统
CN113282461A (zh) 传输网的告警识别方法和装置
CN112783682B (zh) 一种基于云手机服务的异常自动修复方法
CN111290900A (zh) 一种基于微服务日志的软件故障检测方法
US20230385699A1 (en) Data boundary deriving system and method
CN109918313A (zh) 一种基于GBDT决策树的SaaS软件性能故障诊断方法
JP2016189062A (ja) 異常検出装置、異常検出方法及びネットワーク異常検出システム
Xu et al. A data-analytics approach for enterprise resilience
CN114118295A (zh) 一种异常检测模型训练方法、异常检测方法、装置及介质
CN109445406B (zh) 基于场景测试与事务搜索的工业控制系统安全检测方法
WO2023214446A1 (fr) Dispositif de classification, procédé de classification, et programme
CN111858352B (zh) 自动化测试监控的方法、装置、设备及存储介质
CN112363891A (zh) 一种基于细粒度事件和KPIs分析的异常原因获得方法
GB2615180A (en) Systems and methods for detecting manufacturing anomalies
CN113517998B (zh) 预警配置数据的处理方法、装置、设备及存储介质
CN114157553A (zh) 一种数据处理方法、装置、设备及存储介质
CN113656323A (zh) 一种自动化测试、定位及修复故障的方法及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22940808

Country of ref document: EP

Kind code of ref document: A1