WO2023103344A1 - Data processing method and apparatus, device, and storage medium - Google Patents

Data processing method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023103344A1
WO2023103344A1 PCT/CN2022/100708 CN2022100708W WO2023103344A1 WO 2023103344 A1 WO2023103344 A1 WO 2023103344A1 CN 2022100708 W CN2022100708 W CN 2022100708W WO 2023103344 A1 WO2023103344 A1 WO 2023103344A1
Authority
WO
WIPO (PCT)
Prior art keywords
alarm
subsystem
abnormal
node device
node
Prior art date
Application number
PCT/CN2022/100708
Other languages
French (fr)
Chinese (zh)
Inventor
陈鉴镔
杨军
卢道和
陈刚
程志峰
朱嘉伟
罗海湾
李勋棋
汪晓雪
周琪
郭英亚
李兴龙
胡仲臣
周佳振
文玉茹
何勇彬
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023103344A1 publication Critical patent/WO2023103344A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The present application discloses a data processing method and apparatus, a device, and a storage medium. The method comprises: obtaining alarm information of at least two node devices comprised in the data processing system; determining at least one anomaly subsystem on the basis of the at least two node devices; performing first processing on each anomaly subsystem in the at least one anomaly subsystem to obtain alarm information of the at least one anomaly subsystem, the first processing comprising: converging the alarm information of at least one node device comprised in the anomaly subsystem as alarm information of the anomaly subsystem; and correspondingly displaying the alarm information of the anomaly subsystem for each anomaly subsystem in the at least one anomaly subsystem. According to the solution, the anomaly subsystem and the corresponding alarm information can be clearly displayed, so that fault positioning efficiency is improved.

Description

一种数据处理方法、装置、设备及存储介质A data processing method, device, equipment and storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202111491784.9、申请日为2021年12月08日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This application is based on a Chinese patent application with application number 202111491784.9 and a filing date of December 8, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated into this application by reference.
技术领域technical field
本申请涉及数据处理技术领域,涉及但不限于数据处理方法、装置、设备及存储介质。This application relates to the technical field of data processing, involving but not limited to data processing methods, devices, equipment and storage media.
背景技术Background technique
随着计算机技术的飞速发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(Fintech)转变,但由于金融行业的安全性和实时性要求,也对技术提出了更高的要求。网络技术不断发展,网络系统(也可以称为数据处理系统)中的设备越来越多,各设备之间的关系也越来越复杂。在据处理系统中出现设备故障时,由于一个设备故障可能导致与其相关的其他设备也出现故障,所以在较复杂的数据处理系统中,如何定位故障尤其重要。With the rapid development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology (Fintech). However, due to the security and real-time requirements of the financial industry, more and more technical requirements high demands. With the continuous development of network technology, there are more and more devices in a network system (also called a data processing system), and the relationship between devices is becoming more and more complex. When a device failure occurs in a data processing system, since a device failure may cause other related devices to also fail, how to locate the fault is particularly important in a more complex data processing system.
相关技术中,一般采用分布式监控(Zabbix)和开源监控(open-falcon)对系统进行监控,具体的,在每台被监控设备上部署一个代理(agent)进程,被监控设备通过agent采集告警信息,并通过Zabbix和open-falcon将告警信息上报给代理(proxy),proxy再将告警信息上报给监控设备(server)进行汇总,然后就可以在监控设备的上展示以实例维度的告警信息。In related technologies, distributed monitoring (Zabbix) and open source monitoring (open-falcon) are generally used to monitor the system. Specifically, an agent (agent) process is deployed on each monitored device, and the monitored device collects alarms through the agent. Information, and report the alarm information to the proxy (proxy) through Zabbix and open-falcon, and the proxy then reports the alarm information to the monitoring device (server) for summary, and then can display the alarm information in the instance dimension on the monitoring device.
但是,仅以设备为处理维罗列出每个被监控设备的告警信息,在告警较多的情况下,可能出现告警风暴,导致故障定位困难。However, if the alarm information of each monitored device is listed only with the device as the processing, in the case of a large number of alarms, an alarm storm may occur, making it difficult to locate the fault.
发明内容Contents of the invention
本申请提供一种数据处理方法及装置、设备、存储介质,该方案可以清楚的展示出异常的子系统以及对应的告警信息,从而提高了故障定位的效率。The present application provides a data processing method, device, device, and storage medium. The solution can clearly display abnormal subsystems and corresponding alarm information, thereby improving the efficiency of fault location.
本申请的技术方案是这样实现的:The technical scheme of the present application is realized like this:
本申请提供了一种数据处理方法,所述方法应用于数据处理系统中的控制设备,所述数据处理系统还包括节点设备,所述方法包括:The present application provides a data processing method, the method is applied to a control device in a data processing system, the data processing system further includes a node device, and the method includes:
获取所述数据处理系统包括的至少两个节点设备的告警信息;Acquiring alarm information of at least two node devices included in the data processing system;
基于所述至少两个节点设备,确定至少一个异常子系统;determining at least one abnormal subsystem based on the at least two node devices;
针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;Executing a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first processing includes: the abnormal subsystem included The alarm information of at least one node device is converged into the alarm information of the abnormal subsystem;
针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。For each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.
本申请提供了一种数据处理装置,所述装置部署于数据处理系统中的控制设备,所述数据处理系统还包括节点设备,所述装置包括:The present application provides a data processing device, the device is deployed in a control device in a data processing system, the data processing system further includes a node device, and the device includes:
获取单元,配置为获取所述数据处理系统包括的至少两个节点设备的告警信息;an obtaining unit configured to obtain alarm information of at least two node devices included in the data processing system;
确定单元,配置为基于所述至少两个节点设备,确定至少一个异常子系统;a determining unit configured to determine at least one abnormal subsystem based on the at least two node devices;
处理单元,配置为针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;A processing unit configured to execute a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain alarm information of the at least one abnormal subsystem; the first process includes: converting the The alarm information of at least one node device included in the abnormal subsystem is converged into the alarm information of the abnormal subsystem;
展示单元,配置为针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。The display unit is configured to correspondingly display the alarm information of the abnormal subsystem for each abnormal subsystem in the at least one abnormal subsystem.
本申请还提供了一种电子设备,包括:存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述数据处理方法。The present application also provides an electronic device, including: a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above data processing method when executing the program.
本申请还提供了一种存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述数据处理方法。The present application also provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above data processing method is realized.
本申请所提供的数据处理方法、装置、设备及存储介质,包括:获取所述数据处理系统包括的至少两个节点设备的告警信息;基于所述至少两个节点设备,确定至少一个异常子系统;针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。本方案可以将以设备为维度的告警信息收敛至以系统为维度的告警信息;并以系统为维度进行对应展示,这样,可以根据展示的内容,清楚的得知哪个子系统异常,哪个子系统正常,提高了定位故障子系统的效率。The data processing method, device, device, and storage medium provided in the present application include: acquiring alarm information of at least two node devices included in the data processing system; determining at least one abnormal subsystem based on the at least two node devices ; Execute a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first process includes: the abnormal subsystem includes The alarm information of at least one node device is converged into the alarm information of the abnormal subsystem; for each abnormal subsystem in the at least one abnormal subsystem, the alarm information of the abnormal subsystem is correspondingly displayed. This solution can converge the alarm information in the dimension of equipment to the alarm information in the dimension of system; and display correspondingly in the dimension of system. In this way, according to the displayed content, it is possible to clearly know which subsystem is abnormal and which subsystem Normal, improving the efficiency of locating faulty subsystems.
附图说明Description of drawings
图1为本申请实施例提供的数据处理系统的一种可选的结构示意图;FIG. 1 is an optional structural schematic diagram of a data processing system provided in an embodiment of the present application;
图2为本申请实施例提供的数据处理方法的一种可选的流程示意图;FIG. 2 is an optional schematic flowchart of a data processing method provided in an embodiment of the present application;
图3为本申请实施例提供的数据处理方法的一种可选的流程示意图Fig. 3 is an optional schematic flow chart of the data processing method provided by the embodiment of the present application
图4为本申请实施例提供的数据处理方法的一种可选的流程示意图;FIG. 4 is an optional schematic flowchart of a data processing method provided in an embodiment of the present application;
图5为本申请实施例提供的数据处理方法一种可选的流程示意图;FIG. 5 is a schematic flowchart of an optional data processing method provided in the embodiment of the present application;
图6为本申请实施例提供的数据处理方法一种可选的流程示意图;FIG. 6 is a schematic flowchart of an optional data processing method provided in the embodiment of the present application;
图7为本申请实施例提供的数据处理方法一种可选的流程示意图;FIG. 7 is a schematic flowchart of an optional data processing method provided in the embodiment of the present application;
图8为本申请实施例提供的执行器的一种可选的结构示意图;Fig. 8 is an optional structural schematic diagram of the actuator provided by the embodiment of the present application;
图9为本申请实施例提供的数据处理装置的一种可选的结构示意图;FIG. 9 is an optional structural schematic diagram of a data processing device provided in an embodiment of the present application;
图10为本申请实施例提供的电子设备的一种可选的结构示意图。FIG. 10 is a schematic structural diagram of an optional electronic device provided in an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对申请的具体技术方案做进一步详细描述。以下实施例用于说明本申请,但不用来限制本申请的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the specific technical solutions of the application will be further described in detail below in conjunction with the drawings in the embodiments of the present application. The following examples are used to illustrate the present application, but not to limit the scope of the present application.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
在以下的描述中,所涉及的术语“第一\第二\第三”仅是为例区别不同的对象,不代表针对对象的特定排序,不具有先后顺序的限定。可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, the term "first\second\third" is used as an example to distinguish different objects, and does not represent a specific order for the objects, and does not have a limitation on the sequence. It can be understood that "first\second\third" can be interchanged in a specific order or sequential order if allowed, so that the embodiments of the application described here can be used in a manner other than what is illustrated or described here implemented sequentially.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
本申请实施例可提供数据处理方法及装置、设备和存储介质。实际应用中,数据处理方法可由数据处理装置实现,数据处理装置中的各功能实体可以由控制设备的硬件资源,如处理器等计算资源、通信资源(如用于支持实现光缆、蜂窝等各种方式通信)协同实现。Embodiments of the present application may provide a data processing method, device, device, and storage medium. In practical application, the data processing method can be realized by the data processing device, and each functional entity in the data processing device can control the hardware resources of the equipment, such as computing resources such as processors, communication resources (such as used to support the realization of various mode of communication) collaborative implementation.
本申请实施例提供的数据处理方法应用于数据处理系统,数据处理系统包括控制设备和节点设备。控制设备执行:获取所述数据处理系统包括的至少两个节点设备的告警信息;基于所述至少两个节点设备,确定至少一个异常子系统;针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子 系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。The data processing method provided in the embodiment of the present application is applied to a data processing system, and the data processing system includes a control device and a node device. The control device executes: acquiring alarm information of at least two node devices included in the data processing system; determining at least one abnormal subsystem based on the at least two node devices; The abnormal subsystem executes a first process to obtain the alarm information of the at least one abnormal subsystem; the first process includes: converging the alarm information of at least one node device included in the abnormal subsystem into the abnormal Alarm information of subsystems: for each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.
作为一示例,数据处理系统10的结构可如图1所示,包括:控制设备101、节点设备102以及网络103。其中,控制设备101可以通过网络103与节点设备102通信。As an example, the structure of the data processing system 10 may be as shown in FIG. 1 , including: a control device 101 , a node device 102 and a network 103 . Wherein, the control device 101 can communicate with the node device 102 through the network 103 .
控制设备101用于执行:获取所述数据处理系统包括的至少两个节点设备的告警信息;基于所述至少两个节点设备,确定至少一个异常子系统;针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。The control device 101 is configured to: acquire alarm information of at least two node devices included in the data processing system; determine at least one abnormal subsystem based on the at least two node devices; Each of the abnormal subsystems executes a first process to obtain the alarm information of the at least one abnormal subsystem; the first process includes: converging the alarm information of at least one node device included in the abnormal subsystem into The alarm information of the abnormal subsystem: for each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.
节点设备102可以为服务器或路由器等硬件设备,或者还可以为虚拟机或容器等虚拟设备。The node device 102 may be a hardware device such as a server or a router, or may also be a virtual device such as a virtual machine or a container.
网络103用于控制设备101与节点设备102之间的通信。其中,网络103可以包括有限网络、无线网络等等。The network 103 is used for communication between the control device 101 and the node device 102 . Wherein, the network 103 may include a limited network, a wireless network, and the like.
需要说明的是,本申请实施例对于数据处理系统中的节点设备102的数量以及控制设备101的数量不作具体限定可以根据实际需求进行配置。在一示例中,控制设备101的数量可以为一个,节点设备102的数量可以为多个。其中,多个节点设备102属于不同的子系统。It should be noted that the embodiment of the present application does not specifically limit the number of node devices 102 and the number of control devices 101 in the data processing system, which can be configured according to actual needs. In an example, there may be one control device 101 , and there may be multiple node devices 102 . Wherein, multiple node devices 102 belong to different subsystems.
下面,结合图1所示的数据处理系统的示意图,对本申请实施例提供的数据处理方法及装置、设备和存储介质的各实施例进行说明。Below, with reference to the schematic diagram of the data processing system shown in FIG. 1 , various embodiments of the data processing method, device, device, and storage medium provided by the embodiments of the present application will be described.
第一方面,本申请实施例提供一种数据处理方法,该方法应用于数据处理装置;其中,该数据处理装置可以部署于图1中的控制设备101。下面,对本申请实施例提供的数据处理过程进行说明。In a first aspect, the embodiment of the present application provides a data processing method, which is applied to a data processing device; wherein, the data processing device can be deployed in the control device 101 in FIG. 1 . Next, the data processing process provided by the embodiment of the present application will be described.
图2示意了一种可选的一种数据处理方法的流程示意图,本申请实施例提供的数据处理方法,该方法可以包括但不限于图2所示的S201至S204。FIG. 2 shows a schematic flowchart of an optional data processing method. The data processing method provided in the embodiment of the present application may include but not limited to S201 to S204 shown in FIG. 2 .
S201、控制设备获取所述数据处理系统包括的至少两个节点设备的告警信息。S201. The control device acquires alarm information of at least two node devices included in the data processing system.
数据处理系统包括多个节点设备,该多个节点设备中的至少两个节点出现异常,控制设备获取该出现异常的至少两个节点设备的告警信息。The data processing system includes a plurality of node devices, at least two of the plurality of node devices are abnormal, and the control device acquires alarm information of the at least two abnormal node devices.
需要说明的是,出现异常的至少两个节点设备可以是数据处理系统包括的多个节点设备的部分或者全部。本申请实施例对存在告警信息的节点设备的具 体数量不做具体限定,可以根据实际需求进行配置。It should be noted that the at least two abnormal node devices may be part or all of multiple node devices included in the data processing system. The embodiment of this application does not specifically limit the specific number of node devices that have alarm information, and can be configured according to actual needs.
节点设备的告警信息用于表示节点设备异常,本申请实施例对节点设备的告警信息的具体内容不作限定,可以根据实际需求进行配置。The alarm information of the node device is used to indicate that the node device is abnormal. The embodiment of the present application does not limit the specific content of the alarm information of the node device, which can be configured according to actual needs.
在一种可能的实施方式中,节点设备的告警信息可以包括至少一个第一告警级别、与至少一个第一告警级别一一对应的第一告警数量。例如,节点设备的告警信息可以包括:紧急(Critical)告警、与紧急告警对应的数量1;主要(Major)告警、与主要告警对应的数量2;以及次要(Minor)告警、与次要告警对应的数量3。其中,与紧急告警对应的数量1表征发生紧急告警的次数为1;与主要告警对应的数量2表征发生主要告警的次数为2;与次要告警对应的数量3表征发生次要告警的次数为3。In a possible implementation manner, the alarm information of the node device may include at least one first alarm level and a first alarm quantity corresponding to the at least one first alarm level. For example, the alarm information of the node device may include: critical (Critical) alarm, number 1 corresponding to the critical alarm; major (Major) alarm, number 2 corresponding to the major alarm; and minor (Minor) alarm, number 2 corresponding to the minor alarm Corresponding quantity 3. Among them, the number 1 corresponding to the emergency alarm indicates that the number of emergency alarms is 1; the number 2 corresponding to the main alarm indicates that the number of major alarms is 2; the number 3 corresponding to the minor alarm indicates that the number of minor alarms is 3.
在一种可能的实施方式中,节点设备的告警信息还可以包括:与至少一个第一告警级别一一对应的告警类型,与至少一个第一告警级别一一对应的告警时间、与至少一个第一告警级别一一对应的告警日志等等。In a possible implementation manner, the alarm information of the node device may further include: an alarm type corresponding to at least one first alarm level, an alarm time corresponding to at least one first alarm level, and an alarm time corresponding to at least one first alarm level. Alarm logs corresponding to one alarm level and so on.
S201的实施可以包括:控制设备接收至少两个节点设备上报的告警日志,控制设备针对至少两个节点中的每个节点设备,对节点设备的告警日志进行处理,得到节点设备的告警信息。Implementation of S201 may include: the control device receives alarm logs reported by at least two node devices, and the control device processes the alarm logs of the node devices for each of the at least two node devices to obtain alarm information of the node devices.
S202、控制设备基于所述至少两个节点设备,确定至少一个异常子系统。S202. The control device determines at least one abnormal subsystem based on the at least two node devices.
控制设备将存在告警信息的至少两个节点设备所属的异常子系统,确定为该至少一个异常子系统。其中,至少两个节点设备可以属于一个异常子系统,也可以属于多个异常子系统。The control device determines the abnormal subsystem to which at least two node devices with alarm information belong to as the at least one abnormal subsystem. Among them, at least two node devices may belong to one abnormal subsystem, or may belong to multiple abnormal subsystems.
S203、控制设备针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息。S203. The control device executes first processing for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain alarm information of the at least one abnormal subsystem.
其中,第一处理包括:将异常子系统包括的至少一个节点设备的告警信息,收敛为异常子系统的告警信息。Wherein, the first processing includes: converging the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem.
子系统的告警信息用于表示子系统异常,本申请实施例对子系统的告警信息的具体内容不作限定,可以根据实际需求进行配置。The alarm information of the subsystem is used to indicate the abnormality of the subsystem. The embodiment of the present application does not limit the specific content of the alarm information of the subsystem, which can be configured according to actual requirements.
在一种可能的实施方式中,子系统的告警信息可以包括至少一个第二告警级别、与至少一个第二告警级别一一对应的第二告警数量。例如,子系统的告警信息可以包括:紧急(Critical)告警、与紧急告警对应的数量4;主要(Major)告警、与主要告警对应的数量5;次要(Minor)告警、与次要告警对应的数量6。In a possible implementation manner, the alarm information of the subsystem may include at least one second alarm level and a second alarm number corresponding to the at least one second alarm level. For example, the alarm information of the subsystem may include: critical (Critical) alarm, the number 4 corresponding to the critical alarm; major (Major) alarm, the number 5 corresponding to the major alarm; minor (Minor) alarm, corresponding to the minor alarm The number of 6.
在一种可能的实施方式中,子系统的告警信息还可以包括:与至少一个第二告警级别一一对应的告警类型,与至少一个第二告警级别一一对应的告警时间、与至少一个第二告警级别一一对应的节点设备等等。In a possible implementation manner, the alarm information of the subsystem may further include: an alarm type corresponding to at least one second alarm level one-to-one, an alarm time corresponding to at least one second alarm level one-to-one, and an alarm time corresponding to at least one second alarm level. The two alarm levels correspond to node devices and so on.
控制设备针对至少一个异常子系统中的每个异常的子系统执行第一处理,将该异常子系统包括的至少一个节点设备的告警信息,收敛为异常子系统的告警信息;从而得到至少一个异常子系统的告警信息。The control device executes the first process for each abnormal subsystem in the at least one abnormal subsystem, and converges the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem; thereby obtaining at least one abnormal Subsystem alarm information.
S204、控制设备针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。S204. For each abnormal subsystem in the at least one abnormal subsystem, the control device correspondingly displays the alarm information of the abnormal subsystem.
本申请实施例对对应展示异常子系统以及异常子系统的告警信息的方式不作具体限定,可以根据实际需求进行配置。The embodiment of the present application does not specifically limit the manner of correspondingly displaying the abnormal subsystem and the alarm information of the abnormal subsystem, which may be configured according to actual requirements.
在一种可能的实施方式中,可以通过图的方式进行展示。对应的,S204的实施可以包括:在图中展示数据处理系统中的至少一个异常子系统,并在每个异常子系统的对应位置展示该异常子系统的告警信息。其中,在该种实施方式中,可以通过在系统原有图(例如流程图,结构图)的基础上进行展示,也可以是通过新建一个图进行展示。In a possible implementation manner, it may be displayed in the form of a diagram. Correspondingly, the implementation of S204 may include: displaying at least one abnormal subsystem in the data processing system in the figure, and displaying the alarm information of the abnormal subsystem at a corresponding position of each abnormal subsystem. Wherein, in this implementation manner, the display may be performed on the basis of the original system diagram (such as a flow chart, a structural diagram), or may be displayed by creating a new diagram.
在另一种可能的实施方式中,可以通过文字的方式进行展示。对应的,S204的实施可以包括:通过文字的方式对应展示数据处理系统中的至少一个异常子系统,以及异常子系统的告警信息。In another possible implementation manner, it may be displayed in a textual manner. Correspondingly, the implementation of S204 may include: correspondingly displaying at least one abnormal subsystem in the data processing system and alarm information of the abnormal subsystem in a textual manner.
本申请实施例提供的数据处理方案包括:获取所述数据处理系统包括的至少两个节点设备的告警信息;基于所述至少两个节点设备,确定至少一个异常子系统;针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。本方案可以将以设备为维度的告警信息收敛至以子系统为维度的告警信息;并以子系统为维度进行对应展示,这样,可以根据展示的内容,清楚的得到哪个子系统异常,哪个子系统正常,提高了定位故障子系统的效率。The data processing scheme provided by the embodiment of the present application includes: obtaining the alarm information of at least two node devices included in the data processing system; determining at least one abnormal subsystem based on the at least two node devices; Each of the abnormal subsystems in the subsystems executes a first process to obtain the alarm information of the at least one abnormal subsystem; the first process includes: the alarm information of at least one node device included in the abnormal subsystem The information converges to the alarm information of the abnormal subsystem; for each abnormal subsystem in the at least one abnormal subsystem, the alarm information of the abnormal subsystem is correspondingly displayed. This solution can converge the alarm information in the dimension of equipment to the alarm information in the dimension of subsystem; and display correspondingly in the dimension of subsystem. In this way, according to the displayed content, it can be clearly obtained which subsystem is abnormal and which subsystem is abnormal. The system is normal, which improves the efficiency of locating the faulty subsystem.
下面,对S201控制设备获取所述数据处理系统包括的至少两个节点设备的告警信息的过程进行说明。其中,控制设备获取每个节点设备的告警信息的过程类似,现以一个节点设备为例,对该过程进行说明。该过程可以包括但不限于下述S2011至S2013。Next, a process of S201 for the control device to acquire alarm information of at least two node devices included in the data processing system will be described. Wherein, the process of the control device acquiring the alarm information of each node device is similar, and the process is described by taking one node device as an example. This process may include but not limited to the following S2011 to S2013.
S2011、控制设备对所述节点设备上报的告警日志进行处理,得到表征告警提示的关键字字段。S2011. The control device processes the alarm log reported by the node device to obtain a keyword field representing an alarm prompt.
控制设备对告警日志中的信息按照切割规范进行切割,得到多个切割后的字段,在多个切割后的字段中确定关键字字段,并提取各信息关键字字段的内容(告警提示)。The control device cuts the information in the alarm log according to the cutting specification to obtain multiple cut fields, determines the keyword field in the multiple cut fields, and extracts the content of each information keyword field (alarm prompt).
需要说明的是,切割规则需要与具体的日志规范对应。例如:一个日志规 范以空格为间隔对各信息(例如包括告警设备、告警时间、告警级别)进行区分,则切割规则为以空格为界限进行切割,得到告警设备、告警时间以及告警级别。It should be noted that the cutting rules need to correspond to specific log specifications. For example: a log specification distinguishes various information (for example, including alarm device, alarm time, and alarm level) with spaces as intervals, and the cutting rule is to cut with spaces as boundaries to obtain alarm devices, alarm time, and alarm levels.
S2012、控制设备基于所述关键字字段表征的告警提示,确定针对所述节点设备的M个第一告警级别,以及与所述M个第一告警级别一一对应的M个第一告警数量。S2012. The control device determines M first warning levels for the node device based on the warning prompt represented by the keyword field, and M first warning quantities corresponding to the M first warning levels one-to-one.
其中,M大于或等于1。Wherein, M is greater than or equal to 1.
控制设备将各告警提示按照日志中出现的数量按照从大到小的顺序进行排序,确定出排名前K的告警提示,根据告警提示确定对应告警级别,得到M个第一告警级别;统计每个第一告警级别对应的第一告警的数量。The control device sorts the alarm prompts according to the number of occurrences in the log in descending order, determines the top K alarm prompts, determines the corresponding alarm levels according to the alarm prompts, and obtains M first alarm levels; counts each The number of first alarms corresponding to the first alarm level.
其中,K大于或等于M。即一个告警级别可以对应一个或者多个类型的告警提示。例如,告警提示“JDBC异常”对应重要告警;告警提示“内存异常”也对应重要告警。Wherein, K is greater than or equal to M. That is, one alarm level can correspond to one or more types of alarm prompts. For example, the alarm prompt "JDBC exception" corresponds to a major alarm; the alarm prompt "memory exception" also corresponds to a major alarm.
S2013、控制设备基于所述M个第一告警级别,以及所述M个第一告警级数量,更新针对所述节点设备的第一告警字典,得到所述节点设备的告警信息。S2013. The control device updates a first alarm dictionary for the node device based on the M first alarm levels and the number of the M first alarm levels, to obtain alarm information of the node device.
其中,N大于或等于M。Wherein, N is greater than or equal to M.
第一告警字典包括N个第一告警级别,以及与N个第一告警级别一一对应的N个第一预设告警数量。The first warning dictionary includes N first warning levels and N first preset warning numbers corresponding to the N first warning levels.
本申请实施例对第一告警字典的表达形式不做具体限定,可以根据实际需求进行配置。示例性的,第一告警字典可以配置为:{‘minor’:0;‘major’:0;‘critical’:0}。The embodiment of the present application does not specifically limit the expression form of the first alarm dictionary, which may be configured according to actual requirements. Exemplarily, the first alarm dictionary can be configured as: {'minor': 0; 'major': 0; 'critical': 0}.
示例性的,得到的节点设备(IP为8.8.8.8的容器)的告警信息包括:{‘minor’:0,‘major’:1,‘critical’:0}。对应的,M个第一告警级别为‘major’,与M个第一告警级数量为1。Exemplarily, the obtained alarm information of the node device (the container whose IP is 8.8.8.8) includes: {'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, the M first warning levels are 'major', and the number of M first warning levels is 1.
控制设备基于M个第一告警级别,以及与M个第一告警级别一一对应的第一告警级数量,更新针对节点设备的第一告警字典,得到节点设备的告警信息。Based on the M first warning levels and the number of first warning levels corresponding to the M first warning levels, the control device updates the first warning dictionary for the node device to obtain the warning information of the node device.
这样,由于针对节点设备的第一告警字典的形式是预先设置的,在获得设备的告警信息时,只需根据具体告警内容更新告警字典,实现简单、清楚。In this way, since the form of the first alarm dictionary for the node device is preset, when obtaining the alarm information of the device, it is only necessary to update the alarm dictionary according to the specific alarm content, which is simple and clear.
下面,对S203中,控制设备将将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息的过程进行说明。该过程可以包括但不限于下述S2031至S2033。Next, in S203, the process of the control device converging the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem will be described. This process may include but not limited to the following S2031 to S2033.
S2031、控制设备获得所述异常子系统包括的至少一个节点设备的告警信息。S2031. The control device obtains alarm information of at least one node device included in the abnormality subsystem.
其中,节点设备的告警信息包括M个第一告警级别,以及与M个第一告 警级别一一对应的M个第一告警数量。Wherein, the alarm information of the node device includes M first alarm levels, and M first alarm numbers corresponding to the M first alarm levels one-to-one.
控制设备根据异常子系统包括的节点设备的标识,获取异常子系统包括的节点设备的告警信息。其中,异常子系统可以包括一个或多个节点设备。The control device obtains the alarm information of the node devices included in the abnormal subsystem according to the identifiers of the node devices included in the abnormal subsystem. Wherein, the exception subsystem may include one or more node devices.
S2032、控制设备基于所述至少一个节点设备中每个所述节点设备的M个第一告警级别,以及所述M个第一告警数量,确定针对所述异常子系统的P个第二告警级别,以及与所述P个第二告警级别一一对应的P个第二告警数量。S2032. The control device determines P second alarm levels for the abnormal subsystem based on the M first alarm levels of each of the node devices in the at least one node device and the number of the M first alarms , and P second alarm numbers corresponding to the P second alarm levels one-to-one.
其中,P大于或等于M。Wherein, P is greater than or equal to M.
控制设备将至少一个节点设备中每个节点设备的M个第一告警级别的并集确定为P个第二告警级别;针对P个第二告警级别中的每个第二告警级别,控制设备将至少一个节点设备中每个节点设备中,与该第二告警级别具有相同内容的第一告警级别的第一告警数量之和作为与该第二告警级别对应的第二告数量。The control device determines the union of the M first alarm levels of each node device in the at least one node device as P second alarm levels; for each second alarm level in the P second alarm levels, the control device will In each node device of the at least one node device, the sum of the first alarm numbers of the first alarm level having the same content as the second alarm level is used as the second alarm number corresponding to the second alarm level.
S2033、控制设备基于所述P个第二告警级别,以及所述P个第二告警数量,更新针对所述异常子系统的第二告警字典,得到所述异常子系统的告警信息。S2033. Based on the P second alarm levels and the P second alarm numbers, the control device updates a second alarm dictionary for the abnormal subsystem to obtain alarm information of the abnormal subsystem.
其中,Q大于或等于P。Wherein, Q is greater than or equal to P.
第二告警字典包括Q个第二告警级别,以及与Q个第二告警级别一一对应的第二预设告警数量。The second alarm dictionary includes Q second alarm levels and a second preset number of alarms corresponding one-to-one to the Q second alarm levels.
本申请实施例对第二告警字典的表达形式不做具体限定,可以根据实际需求进行配置。示例性的,第二告警字典可以配置为:{‘minor’:0;‘major’:0;‘critical’:0}。The embodiment of the present application does not specifically limit the expression form of the second alarm dictionary, which may be configured according to actual requirements. Exemplarily, the second alarm dictionary can be configured as: {'minor': 0; 'major': 0; 'critical': 0}.
例如,异常子系统1包括两个节点设备,分别为节点设备1和节点设备2。其中,节点设备1的告警信息包括{‘minor’:0,‘major’:1,‘critical’:0};节点设备2的告警信息包括{‘minor’:0,‘major’:1,‘critical’:1};收敛后的异常子系统1的告警信息包括:{‘minor’:0,‘major’:2,‘critical’:1}。For example, the abnormality subsystem 1 includes two node devices, node device 1 and node device 2 respectively. Among them, the alarm information of node device 1 includes {'minor': 0, 'major': 1, 'critical': 0}; the alarm information of node device 2 includes {'minor': 0, 'major': 1,' critical': 1}; the alarm information of abnormal subsystem 1 after convergence includes: {'minor': 0, 'major': 2, 'critical': 1}.
这样,由于针对子系统的第二告警字典的形式是预先设置的,在获得设备的告警信息时,只需根据各节点设备的告警字典中的内容更新第二告警字典,即更新子系统的告警级别,以及每个告警级别下的告警数量,实现简单、清楚。In this way, since the form of the second alarm dictionary for the subsystem is preset, when obtaining the alarm information of the equipment, it is only necessary to update the second alarm dictionary according to the content in the alarm dictionary of each node device, that is, to update the alarm information of the subsystem. Level, and the number of alarms under each alarm level, the implementation is simple and clear.
下面,对S204控制设备针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息的过程进行说明。该过程可以包括但不限于下述实施方式A1或实施方式A2。Next, the process of S204 for the control device to correspondingly display the alarm information of the abnormal subsystem for each abnormal subsystem in the at least one abnormal subsystem will be described. This process may include but not limited to the following Embodiment A1 or Embodiment A2.
实施方式A1、通过图的形式进行展示;Embodiment A1, displaying in the form of a diagram;
实施方式A2、通过文字的形式进行展示。Embodiment A2, displaying in the form of text.
实施方式A1可以包括但不限于下述S2041至S2043。Embodiment A1 may include but not limited to the following S2041 to S2043.
S2041、控制设备获得交易流程图;所述交易流程图中展示了所述至少一个异常子系统。S2041. The control device obtains a transaction flow chart; the transaction flow chart shows the at least one abnormal subsystem.
在一种可能的实施方式中,S2041可以实施为:控制设备采用graphviz画图的方式,对交易过程涉及的系统流程进行绘制,得到交易流程图。In a possible implementation manner, S2041 may be implemented as: the control device draws the system flow involved in the transaction process by means of graphviz drawing to obtain a transaction flow chart.
在另一种可能的实施方式中,交易的流程图是预先绘制好并存储在固定位置的,S2041可以实施为:控制设备的存储交易流程图的固定位置获取交易流程图。In another possible implementation manner, the transaction flow chart is pre-drawn and stored at a fixed location, and S2041 may be implemented as: acquiring the transaction flow chart at a fixed location where the control device stores the transaction flow chart.
S2042、控制设备在所述交易流程图中,针对所述至少一个异常子系统中的每个异常子系统,确定所述异常子系统在所述交易流程图中所属的子系统节点。S2042. In the transaction flowchart, the control device, for each abnormal subsystem in the at least one abnormal subsystem, determines the subsystem node to which the abnormal subsystem belongs in the transaction flowchart.
控制设备对至少一个异常子系统中的每个异常子系统,根据交易流程图中的各子系统的节点标识,确定异常子系统在交易流程图中所属的子系统节点。For each abnormal subsystem in the at least one abnormal subsystem, the control device determines the subsystem node to which the abnormal subsystem belongs in the transaction flow diagram according to the node identification of each subsystem in the transaction flow diagram.
S2043、控制设备对应所述子系统节点展示所述异常子系统的告警信息。S2043. The control device displays the alarm information of the abnormal subsystem corresponding to the subsystem node.
本申请实施例对控制设备对应子系统节点展示异常子系统的告警信息的具体方式不作具体限定,可以根据实际需求进行配置。例如,可以在子系统节点的预设位置(左侧、右侧、上侧、下侧等等)展示异常子系统的告警信息。The embodiment of the present application does not specifically limit the specific manner in which the corresponding subsystem node of the control device displays the alarm information of the abnormal subsystem, and may be configured according to actual requirements. For example, alarm information of abnormal subsystems may be displayed at preset positions (left, right, upper, lower, etc.) of the subsystem nodes.
实施方式A2可以包括:通过文字的方式对应展示数据处理系统中的至少一个异常子系统,以及异常子系统的告警信息。Embodiment A2 may include: correspondingly displaying at least one abnormal subsystem in the data processing system and alarm information of the abnormal subsystem in a textual manner.
采用实施方式A1,通过图的形式进行展示异常子系统的告警信息,不仅可以清楚各异常子系统的告警信息,还可以清楚异常子系统与系统中的其他子系统之间的逻辑关系。Adopting the embodiment A1, displaying the alarm information of the abnormal subsystem in the form of a graph can not only clarify the alarm information of each abnormal subsystem, but also clarify the logical relationship between the abnormal subsystem and other subsystems in the system.
采用通过文字的形式进行展示,由于文字信息的数据量小,且兼容性好,所以易于实现。Displaying in the form of text is easy to implement because of the small amount of data and good compatibility of the text information.
本申请实施例提供的数据处理方法,在确定异常子系统后,还可以对异常子系统进行进一步的处理。其中,对每个异常子系统的处理过程类似,现以一个异常子系统为例进行说明。In the data processing method provided by the embodiment of the present application, after the abnormal subsystem is determined, further processing may be performed on the abnormal subsystem. Among them, the processing process of each exception subsystem is similar, and an exception subsystem is taken as an example to illustrate.
如图3所示,该过程可以包括但不限于下述S205至S207。As shown in Fig. 3, the process may include but not limited to the following S205 to S207.
S205、控制设备确定所述异常子系统中满足第一条件的节点设备为待处理节点设备。S205. The control device determines that a node device meeting the first condition in the abnormal subsystem is a node device to be processed.
在一种可能的实施方式中,若异常子系统中,只有一个节点设备存在异常,即只有一个节点设备存在告警信息,则将该一个节点设备确定为待处理节点设备。In a possible implementation manner, if only one node device is abnormal in the abnormality subsystem, that is, only one node device has alarm information, this one node device is determined as the node device to be processed.
在另一种可能的实施方式中,一个异常子系统中多个节点设备存在异常,即多个节点设备存在告警信息,则将该多个节点设备均确定为待处理节点设备。In another possible implementation manner, if there are abnormalities in multiple node devices in an abnormality subsystem, that is, multiple node devices have alarm information, all the multiple node devices are determined as node devices to be processed.
S206、控制设备判断所述待处理节点设备是否故障。S206. The control device judges whether the node device to be processed is faulty.
本申请实施例对判断待处理节点设备是否故障的方式不做具体限定,可以根据实际需求进行配置。The embodiment of the present application does not specifically limit the manner of judging whether the node device to be processed is faulty, and may be configured according to actual requirements.
S207、控制设备在所述待处理节点设备故障的情况下,向所述待处理节点设备发送第一指令,以使所述待处理节点设备在所述第一指令的指示下执行对应操作。S207. When the node device to be processed fails, the control device sends a first instruction to the node device to be processed, so that the node device to be processed performs a corresponding operation under the instruction of the first instruction.
其中,对应的操作可以包括:做内存,线程,进程的dump,重启,隔离,自动扩容等等。Among them, the corresponding operations can include: dumping memory, threads, and processes, restarting, isolating, auto-scaling, and so on.
这样在节点设备故障的情况下,可以自动执行对应的操作,提高了故障情况下的处理效率。In this way, in the case of a node device failure, the corresponding operation can be automatically executed, which improves the processing efficiency in the case of a failure.
下面,对S205控制设备确定所述异常子系统中满足第一条件的节点设备为待处理节点设备的过程进行说明。该过程可以包括但不限于下述S2051和S2052。Next, the process of determining by the control device in S205 that the node device satisfying the first condition in the abnormal subsystem is the node device to be processed will be described. This process may include but not limited to the following S2051 and S2052.
S2051、控制设备确定所述异常子系统中每个所述节点设备的告警分数。S2051. The control device determines the alarm score of each node device in the abnormal subsystem.
示例性的,控制设备可以通过下述公式(1)确定节点设备的告警分数。Exemplarily, the control device may determine the warning score of the node device through the following formula (1).
F=Q(minor)×f1+Q(major)×f2+Q(critical)×f3   公式(1);F=Q(minor)×f1+Q(major)×f2+Q(critical)×f3 formula (1);
其中,F表示节点设备的告警分数;Q(minor)表示minor告警级别对应的第一告警数量;Q(major)表示major告警级别对应的第一告警数量;Q(critical)表示critical告警级别对应的第一告警数量;f1表示minor告警级别对应的告警分数;f2表示major告警级别对应的告警分数;f3表示critical告警级别对应的告警分数。Among them, F indicates the alarm score of the node device; Q(minor) indicates the number of the first alarm corresponding to the minor alarm level; Q(major) indicates the number of the first alarm corresponding to the major alarm level; Q(critical) indicates the number of alarms corresponding to the critical alarm level The first alarm number; f1 indicates the alarm score corresponding to the minor alarm level; f2 indicates the alarm score corresponding to the major alarm level; f3 indicates the alarm score corresponding to the critical alarm level.
在一示例中,f1可以为1分,f2可以为2分,f3可以为4分。In one example, f1 may be 1 point, f2 may be 2 points, and f3 may be 4 points.
S2052、若第一节点设备的告警分数大于或等于第一分数阈值,且第二节点设备的告警分数小于或等于第二分数阈值,则控制设备确定所述第一节点设备为所述待处理节点设备。S2052. If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, the control device determines that the first node device is the node to be processed equipment.
其中,第一分数阈值大于第二分数阈值。Wherein, the first score threshold is greater than the second score threshold.
第一节点设备为异常子系统中的任一节点设备,第二节点设备为异常子系统中除第一节点设备之外的节点设备。The first node device is any node device in the abnormal subsystem, and the second node device is a node device in the abnormal subsystem other than the first node device.
若第一节点设备的告警分数大于或等于第一分数阈值,且第二节点设备的告警分数小于或等于第二分数阈值,则控制设备确定第一节点设备为待处理节点设备;若不存在满足条件的第一节点设备,则确定不存在待处理节点设备。If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, the control device determines that the first node device is a node device to be processed; If the condition is the first node device, it is determined that there is no node device to be processed.
通过计算节点设备的告警分数的方式确定一个节点设备是否为待处理节点设备,具有实现简单、易于实现的特点。Whether a node device is a node device to be processed is determined by calculating the alarm score of the node device, which has the characteristics of simple and easy implementation.
下面,对S206控制设备判断所述待处理节点设备是否故障的过程进行说明。该过程可以包括但不限于下述S2061至S2063。Next, the process of determining whether the node device to be processed is faulty by the control device in S206 will be described. This process may include but not limited to the following S2061 to S2063.
S2061、控制设备获得所述待处理节点设备的至少一个指标。S2061. The control device obtains at least one index of the node device to be processed.
至少一个指标可以包括但不限于下述至少一项:时延、成功率以及交易量。The at least one indicator may include but not limited to at least one of the following: time delay, success rate, and transaction volume.
S2062、控制设备针对所述至少一个指标中的每个所述指标,计算所述指标与所述指标的参考值之间的第一距离,得到所述指标的判断结果。S2062. For each of the at least one indicator, the control device calculates a first distance between the indicator and a reference value of the indicator to obtain a judgment result of the indicator.
其中,若该指标与该指标之间的第一距离大于或等于第一距离阈值,则确定该指标的判断结果为异常;若该指标与该指标之间的第一距离小于第一距离阈值,则确定该指标的判断结果为正常。Wherein, if the first distance between the index and the index is greater than or equal to the first distance threshold, then determine that the judgment result of the index is abnormal; if the first distance between the index and the index is less than the first distance threshold, Then it is determined that the judgment result of the indicator is normal.
本申请实施例对第一距离阈值的具体取值不作限定,可以根据实际需求进行配置。The embodiment of the present application does not limit the specific value of the first distance threshold, which may be configured according to actual requirements.
在一种可能的实施方式中,针对至少一个指标中的所有指标建立一个孤立森林判断器,通过运行该孤立森林判断器,得到每个所有指标的判断结果。In a possible implementation manner, an isolated forest judger is established for all indicators in at least one indicator, and a judgment result of each of all indicators is obtained by running the isolated forest judger.
在另一种可能的实施方式中,针对至少一个指标中的每个指标,建立一个孤立森林判断器,通过运行与指标对应的孤立森林判断器,得到该指标的判断结果。In another possible implementation manner, for each index in at least one index, an isolated forest judger is established, and the judgment result of the index is obtained by running the isolated forest judger corresponding to the index.
S2062可以实施为:控制设备针对至少一个指标中的每个指标,计算该指标与该指标的参考值之间的第一距离,得到该指标的判断结果。S2062 may be implemented as: for each index in at least one index, the control device calculates a first distance between the index and a reference value of the index, and obtains a judgment result of the index.
本申请实施例对指标的参考值的躯体取值不作限定,可以根据实际需求进行配置。例如,指标的参考值可以根据指标的历史数据来确定,也可以根据经验值来确定。The embodiment of the present application does not limit the body value of the reference value of the index, which can be configured according to actual needs. For example, the reference value of the indicator can be determined according to the historical data of the indicator, or can be determined according to the empirical value.
本申请实施例对得到判断结果的具体方式不作限定,可以根据实际需求进行配置,在一种可能的实施方式中,可以根据孤立森林判断器得到指标的判断结果。The embodiment of the present application does not limit the specific manner of obtaining the judgment result, which can be configured according to actual needs. In a possible implementation manner, the judgment result of the index can be obtained according to the isolation forest judger.
示例性的,控制设备将至少一个指标的每个指标输入至孤立森林判断器,运行孤立森林判断器,输出第一值或第二值,若输出第一值,则表征得到针对该指标的判断结果为异常,若输出第二值,则表征得到针对该指标的判断结果为正常。Exemplarily, the control device inputs each index of at least one index into the isolated forest judger, runs the isolated forest judger, outputs the first value or the second value, and if the first value is output, then the judgment for the index is obtained by characterization The result is abnormal, and if the second value is output, it means that the judgment result for the index is normal.
S2063、若针对所述至少一个指标中的每个所述指标的判断结果均正常,则控制设备确定所述待处理节点设备未故障;否则,确定所述待处理节点设备故障。S2063. If the determination result for each of the at least one indicator is normal, the control device determines that the node device to be processed is not faulty; otherwise, determines that the node device to be processed is faulty.
通过孤立森林判断器的方式判断待处理节点设备是否故障,实现简单,处理效率高。Judging whether the node equipment to be processed is faulty by means of the isolated forest judger is simple to implement and high in processing efficiency.
可选的,在S2063之后,若判断结果为异常,本申请实施例提供的数据处理方法还可以对判断结果进行修正,以提高判断的准确性,该修正过程可以包括但不限于下述实施方式B1和实施方式B2。Optionally, after S2063, if the judgment result is abnormal, the data processing method provided by the embodiment of the present application can also correct the judgment result to improve the accuracy of the judgment. The correction process may include but not limited to the following embodiments B1 and embodiment B2.
实施方式B1、通过各指标的基准值进行修正;Embodiment B1, correcting by the reference value of each index;
实施方式B2、通过各指标的正常范围进行修正。Embodiment B2, the correction is performed through the normal range of each index.
实施方式B1可以包括:在指标包括时延的情况下,若待处理节点设备的时延小于或等于时延基准值,则控制设备修改针对所述时延的判断结果为正常。Embodiment B1 may include: when the indicator includes time delay, if the time delay of the node device to be processed is less than or equal to the time delay reference value, the control device modifies the judgment result of the time delay to be normal.
在指标包括成功率的情况下,若待处理节点设备的成功率大于或等于成功率基准值,则修改针对成功率的判断结果为正常。In the case that the indicator includes a success rate, if the success rate of the node device to be processed is greater than or equal to the success rate reference value, the judgment result for the success rate is modified to be normal.
本申请实施例对时延基准值、成功率基准值的具体取值不作限定,可以根据实际需求进行配置。The embodiment of the present application does not limit the specific values of the delay reference value and the success rate reference value, which may be configured according to actual requirements.
在一示例中,可以根据历史数据或者经验值来确定时延基准值、成功率基准值。In an example, the delay reference value and the success rate reference value may be determined according to historical data or experience values.
采用实施方式B1通过各指标的基准值进行修正,可以根据各指标的具体特点,对判断节点进行修正,具有准确率高,灵活度高的特点。Using the implementation mode B1 to correct the reference value of each index, the judgment node can be corrected according to the specific characteristics of each index, which has the characteristics of high accuracy and high flexibility.
实施方式B2可以包括:控制设备基于历史正常指标数据确定指标的正常范围;控制设备判断待处理节点设备的指标是否属于正常范围;若待处理节点设备的指标属于正常范围,则控制设备修改判断结果为正常。Embodiment B2 may include: the control device determines the normal range of the index based on historical normal index data; the control device judges whether the index of the node device to be processed belongs to the normal range; if the index of the node device to be processed belongs to the normal range, the control device modifies the judgment result as normal.
本申请实施例对与正常范围的确定方式不作具体限定,可以根据实际需求进行配置。The embodiment of the present application does not specifically limit the manner of determining the normal range, which may be configured according to actual requirements.
在一示例中,可以根据历史数据确定指标的正常范围。In an example, a normal range for an indicator can be determined based on historical data.
针对每个指标,控制设备均确定每个指标的正常范围。For each index, the control device determines a normal range for each index.
采用实施方式B2通过各指标的正常范围进行修正,由于指标的正常范围是根据历史指标的正常数据确定的,所以适应性强。The implementation mode B2 is used to correct the normal range of each index, since the normal range of the index is determined according to the normal data of the historical index, so the adaptability is strong.
下面,以节点设备是实例设备(也可以称为实例或者被监控设备),控制设备是监控设备为例,对本申请实施例提供的数据处理方法进行说明。In the following, the data processing method provided by the embodiment of the present application will be described by taking the node device as an instance device (also referred to as an instance or monitored device) and the control device as a monitoring device as an example.
为了便于理解下述实施例,对部分技术术语做简单解释。In order to facilitate the understanding of the following embodiments, some technical terms are briefly explained.
转存(dump),可以用于保存相关环境信息,生产dump文件。例如,可以用于转存内存、线程、进程等信息。Dump can be used to save relevant environment information and generate dump files. For example, it can be used to dump information such as memory, threads, and processes.
孤立森林算法,用于异常点的检测。Isolation forest algorithm for outlier detection.
黄金指标,指影响系统可靠性的指标。例如,黄金指标可以包括:成功率、时延以及交易量等等。The golden index refers to the index that affects the reliability of the system. For example, gold indicators can include: success rate, latency, transaction volume, etc.
Sklearn Python,指一个与机器学习相关的库。Sklearn Python, refers to a library related to machine learning.
Graphviz,指一种画图的工具。Graphviz refers to a drawing tool.
相关技术中,一般采用Zabbix和open-falcon对系统进行监控,具体的,在每台被监控设备(实例)上部署一个代理(agent)进程,被监控设备通过agent采集告警信息,并通过Zabbix和open-falcon将告警信息上报给代理(proxy),proxy再将告警信息上报给监控设备(server)进行汇总,然后就可以在监控设 备的上展示以实例维度的告警信息。In related technologies, Zabbix and open-falcon are generally used to monitor the system. Specifically, an agent (agent) process is deployed on each monitored device (instance), and the monitored device collects alarm information through the agent, and through Zabbix and open-falcon reports the alarm information to the proxy (proxy), and the proxy reports the alarm information to the monitoring device (server) for summary, and then displays the alarm information in the instance dimension on the monitoring device.
监控设备针对各被监控设备的告警信息,采用单一算法(比如标准差,决策树,差分整合移动平均自回归模型(Autoregressive Integrated Moving Average model,ARIMA),长短期记忆人工神经网络(Long Short-Term Memory,LSTM)等算法)或者多种算法叠加的方式(比如用集成学习,组成一个投票器,采用少数服从多数的原则)对指标进行检查,从而确定故障设备。The monitoring equipment uses a single algorithm (such as standard deviation, decision tree, autoregressive integrated moving average model (ARIMA), long short-term memory artificial neural network (Long Short-Term Memory, LSTM) and other algorithms) or a combination of multiple algorithms (such as using integrated learning to form a voting machine, using the principle of minority obeying the majority) to check the indicators to determine the faulty equipment.
但是实际中:第一点,一个系统一般是包括多个子系统的,监控设备不清楚各子系统之间的逻辑关系,无法对各子系统的告警信息的进行汇总、统计,只是一一列出所有告警信息,这样,很容易产生告警风暴,影响运维工程师的判断。But in reality: First, a system generally includes multiple subsystems. The monitoring equipment does not know the logical relationship between the subsystems, and cannot summarize and count the alarm information of each subsystem, but only lists them one by one. For all alarm information, in this way, it is easy to generate an alarm storm, which affects the judgment of the operation and maintenance engineer.
第二点,即使采用单一算法或者多种算法叠加的方式进行异常点检查,在没有进一步的修正的情况下,很容易造成误判,损耗大量的精力。The second point is that even if a single algorithm or a combination of multiple algorithms is used to check outliers, without further corrections, it is easy to cause misjudgment and consume a lot of energy.
第三点,对于确定的故障设备,仅仅只是起到了通知的作用,没有对一些简单的故障,比如单实例故障,进行自动化处理,影响系统的可用率。The third point is that for certain faulty devices, it only plays the role of notification, and some simple faults, such as single instance faults, are not automatically processed, which affects the availability of the system.
本申请实施例提供的检测方法,可以克服上述的第一点至第三点的问题。如图4所示,具体可以包括但不限于下述阶段1至阶段3。The detection method provided in the embodiment of the present application can overcome the above-mentioned problems from the first point to the third point. As shown in FIG. 4 , it may specifically include but not limited to the following stages 1 to 3.
阶段1、收集告警信息,绘制告警链路图。 Phase 1. Collect alarm information and draw an alarm link diagram.
阶段2、针对异常子系统,确定异常子系统中的第一实例,进行异常的检测,确定第一实例是否故障。 Stage 2. For the abnormal subsystem, determine the first instance in the abnormal subsystem, perform abnormal detection, and determine whether the first instance is faulty.
对于第一实例,也可以称为单实例故障的实例,相当于上述待处理节点设备,由于根据告警信息无法最终确认该实例是否真的故障,所以需要进行再次判断,本申请可以根据孤立森林判断器、人为经验以及历史数据等进行二次检查,增加故障识别的正确率。For the first instance, it can also be called a single instance failure instance, which is equivalent to the above-mentioned node device to be processed. Since it is impossible to finally confirm whether the instance is really faulty according to the alarm information, it needs to be judged again. This application can judge according to the isolated forest Secondary checks are performed on the device, human experience, and historical data to increase the accuracy of fault identification.
阶段3、将第一实例的检测结果发送到对应的执行器,通过执行器针对该第一实例,执行对应的操作。Stage 3. Send the detection result of the first instance to the corresponding executor, and the executor performs the corresponding operation on the first instance.
行对应的操作可以包括:状态的保存、流量的隔离、重启等。The operations corresponding to the row may include: saving the state, isolating traffic, restarting, etc.
总体来说,一方面通过收集的各实例的告警信息对告警信息进行收敛,得到子系统的告警信息,并展示子系统的告警信息,另一方面通过收集的各实例的告警信息以及各实例的指标信息,对第一实例进行异常检测,如果确定第一实例为故障实例,则通过调用执行器,对故障实例进行相应的操作。Generally speaking, on the one hand, the alarm information of each instance is converged through the collected alarm information of each instance, the alarm information of the subsystem is obtained, and the alarm information of the subsystem is displayed; on the other hand, the alarm information of each instance collected and the The indicator information is used to perform anomaly detection on the first instance, and if the first instance is determined to be a faulty instance, corresponding operations are performed on the faulty instance by calling the executor.
下面,对阶段1收集告警信息,绘制告警链路图的过程进行详细说明。In the following, the process of collecting alarm information and drawing an alarm link diagram in phase 1 will be described in detail.
当一个子系统出现问题时,非常容易出现告警风暴,导致监控系统被刷屏,这样,很难找到故障设备。When there is a problem with a subsystem, it is very easy to have an alarm storm, causing the monitoring system to be swiped, so that it is difficult to find the faulty device.
所以,本申请先对告警信息进行收敛和汇总,并对收敛后的告警信息,采 用python3+graphviz的方式绘制告警链路图。这样,可以清楚的看出哪个子系统出现了告警。Therefore, this application first converges and summarizes the alarm information, and uses python3+graphviz to draw the alarm link diagram for the converged alarm information. In this way, it can be clearly seen which subsystem has an alarm.
阶段1具体可以包括但不限于下述步骤A1至步骤A5。 Stage 1 may specifically include but not limited to the following steps A1 to A5.
步骤A1、收集各实例的告警信息。Step A1, collecting alarm information of each instance.
示例性的,收集各实例在最近两个小时一直存在的告警信息。Exemplarily, the alarm information of each instance that has existed in the last two hours is collected.
告警的等级可以包括:通知(Info)告警、警告(Warning)告警、次要(Minor)告警、主要(Major)告警和紧急(Critical)告警。其中,Info告警和Warning告警的告警级别较低,所以,本申请对Info告警和Warning告警不做处理,只针对Minor告警、Major告警和Critical告警进行处理。Alarm levels may include: notification (Info) alarm, warning (Warning) alarm, minor (Minor) alarm, major (Major) alarm, and critical (Critical) alarm. Among them, the alarm levels of Info alarms and Warning alarms are relatively low, so this application does not process Info alarms and Warning alarms, and only processes Minor alarms, Major alarms, and Critical alarms.
具体的,监控设备对告警日志中的信息进行切割,提取各信息中的关键字(告警类型),并将各告警类型按照日志中出现的数量从大到小进行排序,确定出排名前三的告警类型,获取这三个告警类型的告警级别。Specifically, the monitoring device cuts the information in the alarm log, extracts keywords (alarm types) in each information, and sorts each alarm type according to the number of occurrences in the log from large to small, and determines the top three Alarm type, to obtain the alarm levels of these three alarm types.
需要说明的是,切割时的切割规则需要与具体的日志规范对应。例如:一个日志规范以空格为间隔对各信息(例如包括告警设备、告警时间、告警级别、告警类型)进行区分,则切割规则为以空格为界限进行切割,得到告警设备、告警时间、告警级别以及告警类型对应的信息。It should be noted that the cutting rules during cutting need to correspond to specific log specifications. For example: a log specification distinguishes various information (for example, including alarm device, alarm time, alarm level, and alarm type) with spaces as intervals, and the cutting rule is to cut with spaces as the boundary to obtain alarm devices, alarm time, and alarm levels and information corresponding to the alarm type.
以实例为维度,创建一个实例的字典A,并初始化次数。Take the instance as the dimension, create a dictionary A of the instance, and initialize the number of times.
例如,字典A:{‘minor’:0;‘major’:0;‘critical’:0}。For example, dictionary A: {'minor': 0; 'major': 0; 'critical': 0}.
示例1:实例1(ip为8.8.8.8的容器)的告警统计:{‘minor’:0,‘major’:1,‘critical’:0}。对应的,关键字:(‘JDBC异常’:1);其中,JDBC异常表示数据库异常,1表示异常数量为1。Example 1: Alarm statistics of instance 1 (container with IP 8.8.8.8): {'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, keyword: ('JDBC exception': 1); wherein, JDBC exception indicates database exception, and 1 indicates that the number of exceptions is 1.
实例2(ip为7.7.7.7的虚拟机)的告警统计:{‘minor’:0,‘major’:1,‘critical’:0}。对应的,关键字:(‘JDBC异常’:1)。Alarm statistics of instance 2 (virtual machine with ip 7.7.7.7): {'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, keyword: ('JDBC exception': 1).
步骤A2、对各实例的告警信息采用规定的计数方法进行统计,得到每个子系统的告警信息。Step A2: Count the alarm information of each instance using a prescribed counting method to obtain the alarm information of each subsystem.
以子系统为维度,组成一个全局的字典B,全量拉取子系统中所有实例的告警,并循环遍历。比如子系统A的实例1出现了major的告警,则通过字段匹配到子系统A,在实例1的二级字典上,找到major,并在上面加1,同时在二级字典的“总”的上面加1。遍历子系统A中的所有实例的告警,得到子系统A的告警。Take the subsystem as the dimension, form a global dictionary B, pull all the alarms of all instances in the subsystem, and traverse them in a loop. For example, if there is a major alarm in instance 1 of subsystem A, it will match the field to subsystem A, find the major in the secondary dictionary of instance 1, and add 1 to it, and at the same time, in the "total" of the secondary dictionary Add 1 to the top. Traverse the alarms of all instances in subsystem A to get the alarms of subsystem A.
基于示例1,示例2:子system_A:{总数量:2;实例1:‘minor’:0,‘major’:1,‘critical’:0;实例2:‘minor’:0,‘major’:1,‘critical’:0}。对应的,关键字‘JDBC异常’。Based on example 1, example 2: subsystem_A: { total count: 2; instance 1: 'minor': 0, 'major': 1, 'critical': 0; instance 2: 'minor': 0, 'major': 1, 'critical': 0}. Correspondingly, the keyword 'JDBC exception'.
基于示例1,示例3:子system_A:{总数量:2;实例1:1,实例2:1}。 对应的,关键字‘JDBC异常’。Based on Example 1, Example 3: Subsystem_A: {Total Quantity: 2; Instance 1: 1, Instance 2: 1}. Correspondingly, the keyword 'JDBC exception'.
步骤A3、获得交易的流程图。Step A3, obtaining the flow chart of the transaction.
每个交易的流程图,是需要根据交易的调用交易过程中各子系统之间的逻辑关系进行编写的。The flow chart of each transaction needs to be written according to the logical relationship between the various subsystems in the call transaction process of the transaction.
如图5所示,该交易的流程包括:请求从四层的虚拟服务器(Linux Virtual Server Linux,LVS)子系统作为负载均衡子系统,到7层的代理服务(Nginx)子系统,再到内部的应用程序接口(Application Programming Interface,API)网关层的服务于前端的后端(Backend For Frontend,BFF)子系统,BFF子系统再将请求转发给UM接口子系统,UM接口子系统将请求发送至ACL子系统或者白名单子系统(whitelist)。As shown in Figure 5, the process of the transaction includes: request from the four-layer virtual server (Linux Virtual Server Linux, LVS) subsystem as the load balancing subsystem, to the seven-layer proxy service (Nginx) subsystem, and then to the internal The application programming interface (Application Programming Interface, API) gateway layer serves the front-end back-end (Backend For Frontend, BFF) subsystem, and the BFF subsystem forwards the request to the UM interface subsystem, and the UM interface subsystem sends the request to To the ACL subsystem or whitelist subsystem (whitelist).
可以理解的,当一个子系统出现故障时,可能会导致一大堆的子系统产生告警,导致告警风暴。所以,需要从流程图中,快速的定位到相应的故障原因。It is understandable that when a subsystem fails, it may cause a large number of subsystems to generate alarms, resulting in an alarm storm. Therefore, it is necessary to quickly locate the corresponding fault cause from the flow chart.
示例性的,流程图可以采用graphviz画图的方式进行绘制。Exemplarily, the flow chart can be drawn by means of graphviz drawing.
步骤A4、在该流程图中展示各子系统的告警信息,得到告警链路图。Step A4, display the alarm information of each subsystem in the flow chart, and obtain the alarm link diagram.
根据子系统名,在该流程图中展示各子系统的告警信息,得到告警链路图。According to the name of the subsystem, the alarm information of each subsystem is shown in the flow chart, and the alarm link diagram is obtained.
针对存在多个关键链路,对应多个关键链路的流程图的情况,还可以将多个链路的流程图一起进行以下处理:For the situation where there are multiple key links and flow charts corresponding to multiple key links, the flow charts of multiple links can also be processed together as follows:
一个核心子系统,可能有多条链路对应的流程图。A core subsystem may have flow charts corresponding to multiple links.
每条关键链路都有一个或多个核心子系统,当上下游系统出现问题时,核心子系统也会出现问题,产生告警。比如在图5中,UM就是一个核心子系统。Each key link has one or more core subsystems. When there is a problem with the upstream and downstream systems, the core subsystem will also have a problem and generate an alarm. For example, in Figure 5, UM is a core subsystem.
其中,关键链路可以是以大功能为维度的链路。比如登录链路,授权链路,短信发送链路等等。Among them, the key link may be a link with a large function as the dimension. Such as login link, authorization link, SMS sending link, etc.
具体包括:子步骤1、将所有的核心子系统作为一个字典的key,当该条链路的核心子系统都出现告警时,选择此链路,进入子步骤2。Specifically include: sub-step 1, use all core subsystems as a key of a dictionary, when all core subsystems of this link have an alarm, select this link, and enter sub-step 2.
子步骤2、在执行子步骤1后,可能得到多条链路,将这一条或多条链路中出现的非核心子系统去匹配告警,当某个非核心子系统存在告警时,则选择此链路,如果非核心子系统出现在多条链路中,则匹配到多条链路。将根据告警汇总成字典得到的子系统的告警汇总结果(相当于异常子系统的告警信息),写在对应子系统的后面,再生成如图6所示的告警链路图。Sub-step 2. After executing sub-step 1, multiple links may be obtained, and the non-core subsystems appearing in this one or more links are matched with alarms. When there is an alarm in a certain non-core subsystem, select This link matches multiple links if the non-core subsystem appears in multiple links. Write the alarm summary result of the subsystem (equivalent to the alarm information of the abnormal subsystem) obtained by summarizing the alarm into a dictionary, and write it behind the corresponding subsystem, and then generate the alarm link diagram shown in Figure 6.
步骤A5、将告警链路图发送至告警群。Step A5. Send the alarm link diagram to the alarm group.
通过Graphviz将生成图片,用python读取图片,同时将图片转化成基于64个可打印字符来表示二进制数据(base64)格式的信息,调用机器人的消息接口将告警发到告警群。Generate pictures through Graphviz, read the pictures with python, and convert the pictures into information in binary data (base64) format based on 64 printable characters, and call the message interface of the robot to send the alarm to the alarm group.
下面,对阶段2针对第一实例,进行异常的检测,确定第一实例是否故障。 阶段2具体可以包括但不限于下述步骤B1至步骤B4。Next, in phase 2, anomaly detection is performed on the first instance to determine whether the first instance is faulty. Stage 2 may specifically include but not limited to the following steps B1 to B4.
步骤B1、确定第一实例。Step B1. Determine the first instance.
计算每个实例的故障分数,将满足预设条件的实例确定为第一实例。The failure score of each instance is calculated, and the instance satisfying the preset condition is determined as the first instance.
例如,在minor级别的告警为分,major级别的告警为2分,critical级别的告警为4分的情况下,一个实例分数=minor告警的次数×1+major告警的次数×2+critical告警的次数×4。For example, when minor-level alarms are worth 2 points, major-level warnings are worth 2 points, and critical-level warnings are worth 4 points, an instance score = number of minor alarms × 1 + number of major alarms × 2 + number of critical alarms Times × 4.
预设条件为:一个系统中一个实例的分数大于或等于4分,其他的实例的分数小于或等于1分,则确定分数大于或等于4分的实例为第一实例。The preset condition is: the score of one instance in a system is greater than or equal to 4 points, and the scores of other instances are less than or equal to 1 point, then the instance whose score is greater than or equal to 4 points is determined to be the first instance.
步骤B2、基于孤立森林判断器确定第一实例是否异常。Step B2. Determine whether the first instance is abnormal based on the isolation forest judger.
首先,获取第一实例的黄金指标,对指标进行分类,得到所以实例的交易量、时延以及成功率等等。First, obtain the golden index of the first instance, classify the indexes, and obtain the transaction volume, time delay, and success rate of all instances.
对于每种指标,都训练一套相应的孤立森林判断器模型。For each indicator, a set of corresponding isolation forest discriminator models are trained.
孤立森林判断器的判断前提是所有的异常都是少数的。The premise of the isolation forest judger is that all abnormalities are minority.
孤立森林判断器的原理为根据数学的方式找到少数的异常点。The principle of the isolation forest judger is to find a small number of outliers in a mathematical way.
下面,以一个指标(时延)为例,对得到孤立森林判断器的过程进行说明。Next, taking an index (time delay) as an example, the process of obtaining the isolation forest judger will be described.
孤立森林判断器的输入数据类型:连续的业务指标数据The input data type of the isolation forest judger: continuous business indicator data
训练方式:无监督学习,代入数据就可以得到结果。但是为了调整模型参数,需要有一些打好标签的数据,进行验证。Training method: unsupervised learning, the result can be obtained by substituting data. But in order to adjust the model parameters, some labeled data is needed for verification.
异常定义:容易被孤立的点,就是分布比较稀疏,离密度较高的群体较远的点。Abnormal definition: Points that are easy to be isolated are points that are relatively sparsely distributed and far away from groups with higher density.
做法:Python3+sklearn的库。Practice: Python3+sklearn library.
具体的训练过程可以包括:The specific training process can include:
获取训练数据(1个月的时延指标数据),选择出80%的数据作为训练集,20%的数据作为测试数据,对每一个数据打标签,是正常还是异常。例如,时延0.3ms,正常;时延2ms,异常。Obtain training data (1-month delay index data), select 80% of the data as the training set, and 20% of the data as the test data, and label each data whether it is normal or abnormal. For example, if the delay is 0.3ms, it is normal; if the delay is 2ms, it is abnormal.
通过model=Isolation Forest(n_estimators,max_samples),利用Sklearn的Isolation Forest函数建立一个孤立森林判断器模型(model),将80%的数据X作为训练集,代入model.fit(X)训练函数进行训练,得到初步训练好的孤立森林判断器。Through model=Isolation Forest(n_estimators, max_samples), use the Isolation Forest function of Sklearn to build an isolated forest judge model (model), and use 80% of the data X as the training set, and substitute it into the model.fit(X) training function for training. Get the pre-trained isolation forest judge.
用训练集训练好模型后,则用带有结果的测试集(20%的数据)进行评估,得到初步训练好的孤立森林判断器的模型分数。After the model is trained with the training set, it is evaluated with the test set (20% of the data) with the result, and the model score of the preliminarily trained isolation forest judger is obtained.
例如,可以通过下述公式(2),计算孤立森林判断器的模型分数。For example, the model score of the isolation forest discriminator can be calculated by the following formula (2).
Figure PCTCN2022100708-appb-000001
Figure PCTCN2022100708-appb-000001
其中,第一数量为判断结果为异常且正确的数量;第二数量为判断结果为 正常且正确的数量;正常分数和异常分数均为50。Among them, the first number is the number of judging results as abnormal and correct; the second number is the number of judging results as normal and correct; both the normal score and the abnormal score are 50.
打乱训练数据的顺序,在确保正常和异常的数目保持一致比如都是10000个正常,10个异常的情况下,修改n_estimators(子树的数量)、max_depth(树的最大生长深度)的参数,其他的参数选择默认,从而得到多组孤立森林判断器的模型,最后得出一个分数数组。Disrupt the order of the training data, and ensure that the number of normal and abnormal numbers is consistent, such as 10,000 normal and 10 abnormal, modify the parameters of n_estimators (number of subtrees), max_depth (maximum growth depth of the tree), Other parameters are selected as default, so as to obtain models of multiple sets of isolated forest judgers, and finally obtain a score array.
其中,选择这两个参数的原因,是因为这两个参数对模型影响最大。n_estimators,选择的范围是(3-23),而max_depth选择(5-25)。先固定max_depth为5,然后遍历n_estimators,然后固定n_estimators为3,遍历max_depth。Among them, the reason for choosing these two parameters is that these two parameters have the greatest influence on the model. n_estimators, the selection range is (3-23), and max_depth selection (5-25). First fix max_depth to 5, then traverse n_estimators, then fix n_estimators to 3, traverse max_depth.
最后选取分数最高的模型,作为使用的模型。Finally, the model with the highest score is selected as the model used.
其中,训练数据,为二维数组,比如成功率的数据为[[0.88],[0.90],[0.88],[0.99]],而对应数据的结果标签,则是一个一维数组,它和数据按照顺序对应,[-1,1,1,-1],其中-1表示异常,1表示正常。如0.88对应-1,则为不正常的数据。Among them, the training data is a two-dimensional array. For example, the success rate data is [[0.88], [0.90], [0.88], [0.99]], and the result label of the corresponding data is a one-dimensional array. It is the same as The data corresponds in order, [-1, 1, 1, -1], where -1 means abnormal and 1 means normal. If 0.88 corresponds to -1, it is abnormal data.
简单来说,通过model=Isolation Forest(n_estimators,max_samples),利用Sklearn的Isolation Forest函数建立一个孤立森林判断器模型(model),将80%的数据X作为训练集,代入model.fit(X)训练函数进行训练,然后用20%的数据Y代入model.predict(Y)预测函数进行预测,然后将预测结果和标记的结果做比较,选择出模型分数最高的两个参数,保存。To put it simply, through model=Isolation Forest(n_estimators, max_samples), use Sklearn's Isolation Forest function to build an isolation forest judge model (model), and use 80% of the data X as the training set, and substitute it into model.fit(X) for training function for training, and then use 20% of the data Y to substitute into the model.predict(Y) prediction function for prediction, then compare the prediction results with the marked results, select the two parameters with the highest model scores, and save them.
当前时间点的第一实例的异常点检查:Outlier check of the first instance at the current point in time:
1、保存训练阶段得到的n_estimators,max_samples参数,并将此参数代入Isolation Forest函数,建立好孤立森林判断器模型。1. Save the n_estimators and max_samples parameters obtained in the training phase, and substitute these parameters into the Isolation Forest function to establish the isolation forest judge model.
2、将最近两天的数据,按照上述结构,生产一个二维数组,并代入模型的训练函数model.fit函数中,训练好对应的孤立森林判断器模型。2. According to the above structure, generate a two-dimensional array with the data of the last two days, and substitute it into the model training function model.fit function to train the corresponding isolated forest judge model.
3、将当前时间点的第一实例的数据,也建成一个二维数组,代入模型的建立好孤立森林判断器模型或者也可以称为预测函数(model.predict),运行孤立森林判断器模型,得到一个判断结果。3. The data of the first instance at the current time point is also built into a two-dimensional array, which is substituted into the model to establish the isolated forest judge model or also called the prediction function (model.predict), and run the isolated forest judge model. Get a judgment result.
4、若判断结果为-1,表示异常;若判断结果为1,表示正常。4. If the judgment result is -1, it means abnormal; if the judgment result is 1, it means normal.
步骤B3、基于经验信息确定第一实例是否故障。Step B3. Determine whether the first instance is faulty based on empirical information.
也可以将B3过程称为异常修正的过程,即对孤立森林判断器的判断结果进行修正。The B3 process can also be called an anomaly correction process, that is, to correct the judgment result of the isolated forest judger.
异常的修正原因:当出现大多数异常,小部分正常的情况时,会导致正常的实例被判断为异常,此时需要对异常进行二次修正。Reasons for correcting abnormalities: When most abnormalities and a small number of normal cases occur, normal instances will be judged as abnormal. In this case, a secondary correction of the abnormality is required.
解决问题:当指标是多数异常,少数正常,会存在误判的情况。即会判断正常的指标为异常,进而导致误操作。Solve the problem: When the indicators are mostly abnormal and a few are normal, there will be misjudgments. That is, it will judge the normal indicators as abnormal, which will lead to misoperation.
具体的修正过程可以包括:Specific correction procedures may include:
对于时延指标,由于时延越低,异常的可能性越低,所以当孤立森林判断指标A是异常的,但是其时延比其他实例指标的平均值低时,则修改结果,判断指标A为正常。For the delay index, since the lower the delay, the lower the possibility of abnormality, so when the isolated forest judges that index A is abnormal, but its delay is lower than the average value of other instance indexes, then modify the result and judge index A as normal.
对于成功率,由于成功率越高,异常的可能性越地,假设孤立森林判断指标B是异常的,但是其成功率比其他指标的均值的成功率高,则修改结果,判断实例B为正常实例。For the success rate, since the higher the success rate, the greater the possibility of abnormality, assuming that the isolated forest judges that the index B is abnormal, but its success rate is higher than the average success rate of other indicators, then modify the result and judge that instance B is normal instance.
步骤B4、基于历史数据确定第一实例是否故障。Step B4. Determine whether the first instance is faulty based on historical data.
也可以将B4过程称为历史数据校验过程,即对B3的修正结果进行校验。The B4 process can also be called the historical data verification process, that is, the correction result of B3 is verified.
校验的目的:减少无用的操作和告警。The purpose of verification: to reduce useless operations and alarms.
假设有实例A的指标C异常,进入校验,选择最近一周的历史数据,看有没有异常的数值。Assuming that the indicator C of instance A is abnormal, enter the verification and select the historical data of the last week to see if there are any abnormal values.
一次判断:如果历史已经有异常数据了,判断指标C是否在比历史异常值更加的严重,比如指标B成功率是80%,而历史上,95%就是异常点,那么80%一定是异常点。One judgment: If there is already abnormal data in history, judge whether indicator C is more serious than the historical abnormal value, for example, the success rate of indicator B is 80%, and in history, 95% is the abnormal point, then 80% must be the abnormal point .
二次判断:去除最高的5个点,去除最低的5个点,然后计算最大值,最小值,假设异常的点在最大最小值之间,则将异常改成正常。Secondary judgment: remove the highest 5 points, remove the lowest 5 points, and then calculate the maximum and minimum values, assuming that the abnormal point is between the maximum and minimum values, then change the abnormality to normal.
总体来说,如图7所示,异常检测分为三步,第一步是用孤立森林判断器做一个初步的判断,第二步是在第一步的基础上,根据已有的经验规则,对判断的结果进行修正,第三步是在第二步的基础上,根据历史的数据对判断的结果进行再次确认。Generally speaking, as shown in Figure 7, anomaly detection is divided into three steps. The first step is to use the isolation forest judger to make a preliminary judgment. The second step is based on the first step and according to the existing empirical rules , to revise the judgment result, and the third step is to reconfirm the judgment result based on the historical data on the basis of the second step.
下面,对阶段3将第一实例的检测结果发送到对应的执行器,通过执行器针对该第一实例,执行对应的操作的过程进行说明。In the following, the process of sending the detection result of the first instance to the corresponding executor in stage 3, and the executor performs the corresponding operation on the first instance will be described.
在第一实例的检测结果是异常的情况下,将第一实例的检测结果发送至第一实例所属的被监控设备,通过在被监控设备上部署的执行器,执行对应的操作。If the detection result of the first instance is abnormal, the detection result of the first instance is sent to the monitored device to which the first instance belongs, and the corresponding operation is executed through the executor deployed on the monitored device.
操作包括:做内存,线程,进程的dump,重启,隔离,自动扩容等等。Operations include: memory, thread, process dump, restart, isolation, automatic expansion, etc.
设定操作规则表:对于成功率异常引起的故障,则进行dump、隔离的操作。对于时延异常引起的故障,则进行dump、隔离操作。对于交易量异常引起的故障,则发出告警。Set the operation rule table: For faults caused by abnormal success rate, perform dump and isolation operations. For faults caused by abnormal delay, perform dump and isolation operations. For faults caused by abnormal transaction volume, an alarm is issued.
需要说明的是,隔离的时候,需要保证子系统中2/3的实例处于可用状态。It should be noted that during isolation, it is necessary to ensure that 2/3 of the instances in the subsystem are available.
下面,对执行器的处理过程进行说明。Next, the processing procedure of the executor will be described.
其中,通过go语言写成一个agent,部署在被监控设备(例如虚拟机或者容器)上,做一个客户端的进程(减少业务代码的逻辑),而如果是容器,如图 8所示,则用边车(sidecar)的方式:在一个任务(pod)内,启动一个agent容器,这个容器可以被称为sidecar容器。Among them, an agent is written in the go language, deployed on the monitored device (such as a virtual machine or container), as a client process (reducing the logic of the business code), and if it is a container, as shown in Figure 8, use edge Car (sidecar) method: In a task (pod), start an agent container, which can be called a sidecar container.
其中,执行器的作用如下:Among them, the role of the actuator is as follows:
收集实例的日志的报错信息。Collect error information from instance logs.
收集实例的日志的业务指标。Collect business metrics for the instance's logs.
收集对应的基础信息各种指标信息(内存、CPU等信息),然后将基础信息进行汇总上报。Collect corresponding basic information and various indicator information (memory, CPU, etc.), and then summarize and report the basic information.
接收各种指令,从而对子系统进行特定的操作。比如隔离、重启进程、对内存、线程、进程打dump。Receive various instructions to perform specific operations on the subsystem. For example, isolate, restart processes, and dump memory, threads, and processes.
比如接收到的指令是隔离,则对该实例的内存、线程、进程打dump,然后隔离。For example, if the instruction received is to isolate, dump the memory, threads, and processes of the instance, and then isolate.
比如接收到的指令是内存过高:则对该实例的内存、线程、进程打dump,然后重启。For example, the received instruction is that the memory is too high: dump the memory, threads, and processes of the instance, and then restart.
收集到指标信息时,对指标进行特定的处理,比如某个系统的某个实例出现大量的故障时,就要对其进行隔离。当某个实例的业务指标出现异常时,也可以对其进行隔离。When the indicator information is collected, specific processing is performed on the indicator. For example, when a large number of failures occur in a certain instance of a system, it must be isolated. When the business indicators of an instance are abnormal, it can also be isolated.
成功率异常,时延异常,直接进行dump操作,然后进行隔离,隔离后需要分析是否资源不够,进行自动的扩缩容。If the success rate is abnormal and the time delay is abnormal, perform the dump operation directly, and then isolate it. After isolation, it is necessary to analyze whether the resources are insufficient and perform automatic expansion and contraction.
本申请实施例提供的数据处理方法具有以下有益效果:The data processing method provided by the embodiment of the present application has the following beneficial effects:
1、通过流程图结合告警字典,生成告警链路图,形象的展现了子系统的告警级别次数、告警内容、子系统的上下游情况,从而可以协助快速定位到故障点。1. Through the flow chart and the alarm dictionary, an alarm link diagram is generated, which vividly shows the number of alarm levels, alarm content, and upstream and downstream conditions of the subsystem, so as to assist in quickly locating the fault point.
2、由告警初步找到异常的实例,代入到我们的异常点检测,一方面,根据告警筛选出异常的实例,然后用算法进行再次判断。通过有针对性的提供异常实例,不用将所有的实例的数据都代入算法,减少算法的运算量,节约成本;另一方面,利用异常的修正和历史数据的检验,提高了算法的准确性。2. Preliminarily find out the abnormal instances from the alarms, and substitute them into our abnormal point detection. On the one hand, screen out the abnormal instances according to the alarms, and then use the algorithm to judge again. By providing targeted exception instances, it is not necessary to substitute all the instance data into the algorithm, which reduces the amount of computation of the algorithm and saves costs; on the other hand, the accuracy of the algorithm is improved by using the correction of exceptions and the inspection of historical data.
3、通过改造监控的agent,增加指令执行功能,使得像保存实例环境,流量隔离更加的准确,快速。3. By modifying the monitoring agent and adding the command execution function, such as saving the instance environment and traffic isolation are more accurate and fast.
为实现上述数据处理方法,本申请实施例的一种数据处理装置,下面结合图9所示的数据处理装置的结构示意图进行说明。In order to implement the above data processing method, a data processing device according to an embodiment of the present application will be described below with reference to the schematic structural diagram of the data processing device shown in FIG. 9 .
如图9所示,数据处理装置90包括:获取单元901、确定单元902、处理单元903和展示单元904。其中:As shown in FIG. 9 , the data processing device 90 includes: an acquisition unit 901 , a determination unit 902 , a processing unit 903 and a presentation unit 904 . in:
获取单元901,配置为获取所述数据处理系统包括的至少两个节点设备的告警信息;The obtaining unit 901 is configured to obtain alarm information of at least two node devices included in the data processing system;
确定单元902,配置为基于所述至少两个节点设备,确定至少一个异常子系统;The determining unit 902 is configured to determine at least one abnormal subsystem based on the at least two node devices;
处理单元903,配置为针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;The processing unit 903 is configured to execute a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first process includes: converting the The alarm information of at least one node device included in the abnormal subsystem is converged into the alarm information of the abnormal subsystem;
展示单元904,配置为针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。The display unit 904 is configured to correspondingly display the alarm information of the abnormal subsystem for each abnormal subsystem in the at least one abnormal subsystem.
在一些实施例中,获取单元901还配置为:In some embodiments, the acquiring unit 901 is further configured to:
针对所述至少两个节点设备中的每个所述节点设备执行以下处理:Perform the following processing for each of the at least two node devices:
对所述节点设备上报的告警日志进行处理,得到表征告警提示的关键字字段;Processing the alarm log reported by the node device to obtain a keyword field representing an alarm prompt;
基于所述关键字字段表征的告警提示,确定针对所述节点设备的M个第一告警级别,以及与所述M个第一告警级别一一对应的M个第一告警数量;所述M大于或等于1;Based on the warning prompt represented by the keyword field, determine the M first warning levels for the node device, and the number of M first warnings corresponding to the M first warning levels; the M is greater than or equal to 1;
基于所述M个第一告警级别,以及所述M个第一告警级数量,更新针对所述节点设备的第一告警字典,得到所述节点设备的告警信息;所述第一告警字典包括N个第一告警级别,以及与所述N个第一告警级别一一对应的N个第一预设告警数量;所述N大于或等于M。Based on the M first warning levels and the number of the M first warning levels, update the first warning dictionary for the node device to obtain the warning information of the node device; the first warning dictionary includes N N first alarm levels, and N first preset alarm numbers corresponding to the N first alarm levels; the N is greater than or equal to M.
在一些实施例中,处理单元903,还配置为:In some embodiments, the processing unit 903 is further configured to:
获得所述异常子系统包括的至少一个节点设备的告警信息,所述节点设备的告警信息包括M个第一告警级别,以及与所述M个第一告警级别一一对应的M个第一告警数量;Obtaining alarm information of at least one node device included in the abnormal subsystem, where the alarm information of the node device includes M first alarm levels and M first alarms corresponding to the M first alarm levels one-to-one quantity;
基于所述至少一个节点设备中每个所述节点设备的M个第一告警级别,以及所述M个第一告警数量,确定针对所述异常子系统的P个第二告警级别,以及与所述P个第二告警级别一一对应的P个第二告警数量;所述P大于或等于所述M;Based on the M first alarm levels of each of the node devices in the at least one node device, and the number of the M first alarms, determine P second alarm levels for the abnormal subsystem, and determine the P second alarm levels for the abnormal subsystem, and The P second alarm numbers corresponding to the P second alarm levels one-to-one; the P is greater than or equal to the M;
基于所述P个第二告警级别,以及所述P个第二告警数量,更新针对所述异常子系统的第二告警字典,得到所述异常子系统的告警信息;所述第二告警字典包括Q个第二告警级别,以及与所述Q个第二告警级别一一对应的第二预设告警数量;所述Q大于或等于所述P。Based on the P second alarm levels and the P second alarm numbers, update a second alarm dictionary for the abnormal subsystem to obtain alarm information for the abnormal subsystem; the second alarm dictionary includes Q second warning levels, and a second preset number of warnings one-to-one corresponding to the Q second warning levels; the Q is greater than or equal to the P.
在一些实施例中,展示单元904还配置为:In some embodiments, the presentation unit 904 is further configured to:
获得交易流程图;所述交易流程图中展示了所述至少一个异常子系统;Obtaining a transaction flow diagram; the at least one abnormal subsystem is shown in the transaction flow diagram;
在所述交易流程图中,针对所述至少一个异常子系统中的每个异常子系统, 确定所述异常子系统在所述交易流程图中所属的子系统节点;In the transaction flowchart, for each abnormal subsystem in the at least one abnormal subsystem, determine the subsystem node to which the abnormal subsystem belongs in the transaction flowchart;
对应所述子系统节点展示所述异常子系统的告警信息。Displaying alarm information of the abnormal subsystem corresponding to the subsystem node.
在一些实施例中,数据处理装置90还包括执行单元,执行单元配置为:针对所述至少一个异常子系统中的每个所述异常子系统执行以下处理:In some embodiments, the data processing device 90 further includes an execution unit configured to: perform the following processing for each of the abnormal subsystems in the at least one abnormal subsystem:
确定所述异常子系统中满足第一条件的节点设备为待处理节点设备;Determining that the node device meeting the first condition in the abnormal subsystem is the node device to be processed;
判断所述待处理节点设备是否故障;judging whether the node device to be processed is faulty;
在所述待处理节点设备故障的情况下,向所述待处理节点设备发送第一指令,以使所述待处理节点设备在所述第一指令的指示下执行对应操作。When the node device to be processed fails, a first instruction is sent to the node device to be processed, so that the node device to be processed performs a corresponding operation under the instruction of the first instruction.
在一些实施例中,执行单元还配置为:In some embodiments, the execution unit is further configured to:
确定所述异常子系统中每个所述节点设备的告警分数;determining an alarm score for each of the node devices in the abnormal subsystem;
若第一节点设备的告警分数大于或等于第一分数阈值,且第二节点设备的告警分数小于或等于第二分数阈值,则确定所述第一节点设备为所述待处理节点设备;所述第一分数阈值大于所述第二分数阈值;所述第一节点设备为所述异常子系统中的任一节点设备,所述第二节点设备为所述异常子系统中除所述第一节点设备之外的节点设备。If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, then determining that the first node device is the node device to be processed; the The first score threshold is greater than the second score threshold; the first node device is any node device in the abnormal subsystem, and the second node device is any node device in the abnormal subsystem except the first node Node devices other than devices.
在一些实施例中,执行单元还配置为:In some embodiments, the execution unit is further configured to:
获得所述待处理节点设备的至少一个指标;Obtain at least one indicator of the node device to be processed;
针对所述至少一个指标中的每个所述指标,计算所述指标与所述指标的参考值之间的第一距离,得到所述指标的判断结果;其中,若所述第一距离大于或等于第一距离阈值,则确定所述判断结果为异常;若所述第一距离小于所述第一距离阈值,则确定所述判断结果为正常;For each of the indicators in the at least one indicator, calculate the first distance between the indicator and the reference value of the indicator to obtain the judgment result of the indicator; wherein, if the first distance is greater than or If it is equal to the first distance threshold, it is determined that the judgment result is abnormal; if the first distance is less than the first distance threshold, then it is determined that the judgment result is normal;
若针对所述至少一个指标中的每个所述指标的判断结果均正常,则确定所述待处理节点设备未故障;否则,确定所述待处理节点设备故障。If the judgment result for each of the indicators in the at least one indicator is normal, it is determined that the node device to be processed is not faulty; otherwise, it is determined that the node device to be processed is faulty.
在一些实施例中,若所述判断结果为异常的情况下,执行单元还配置为:In some embodiments, if the judgment result is abnormal, the execution unit is further configured to:
在所述指标包括时延的情况下,若所述待处理节点设备的时延小于或等于时延基准值,则修改针对所述时延的判断结果为正常;In the case where the index includes a time delay, if the time delay of the node device to be processed is less than or equal to a time delay reference value, modify the judgment result for the time delay to be normal;
在所述指标包括成功率的情况下,若所述待处理节点设备的成功率大于或等于成功率基准值,则修改针对所述成功率的判断结果为正常。In the case where the index includes a success rate, if the success rate of the node device to be processed is greater than or equal to a success rate reference value, modify the judgment result for the success rate to be normal.
在一些实施例中,若所述指标的判断结果为异常,执行单元还配置为:In some embodiments, if the judgment result of the indicator is abnormal, the execution unit is further configured to:
基于历史正常指标数据确定指标的正常范围;Determine the normal range of indicators based on historical normal indicator data;
判断所述待处理节点设备的指标是否属于所述正常范围;judging whether the index of the node device to be processed belongs to the normal range;
若所述待处理节点设备的指标属于所述正常范围,则修改所述指标的判断结果为正常。If the index of the node device to be processed belongs to the normal range, the judgment result of modifying the index is normal.
需要说明的是,本申请实施例提供的数据处理装置包括所包括的各单元, 可以通过电子设备中的处理器来实现;当然也可通过具体的逻辑电路实现;在实施的过程中,处理器可以为中央处理器(CPU,Central Processing Unit)、微处理器(MPU,Micro Processor Unit)、数字信号处理器(DSP,Digital Signal Processor)或现场可编程门阵列(FPGA,Field-Programmable Gate Array)等。It should be noted that the data processing device provided in the embodiment of the present application includes each included unit, which can be realized by a processor in an electronic device; of course, it can also be realized by a specific logic circuit; in the process of implementation, the processor It can be a central processing unit (CPU, Central Processing Unit), a microprocessor (MPU, Micro Processor Unit), a digital signal processor (DSP, Digital Signal Processor) or a field programmable gate array (FPGA, Field-Programmable Gate Array) wait.
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.
需要说明的是,本申请实施例中,如果以软件功能模块的形式实现上述的数据处理方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。It should be noted that, in the embodiment of the present application, if the above-mentioned data processing method is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
为实现上述数据处理方法,本申请实施例提供一种电子设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述实施例中提供的数据处理方法中的步骤。In order to implement the above data processing method, an embodiment of the present application provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above implementation when executing the program The steps in the data processing method provided in the example.
下面结合图10所示的电子设备100,对电子设备的结构图进行说明。The structural diagram of the electronic device will be described below with reference to the electronic device 100 shown in FIG. 10 .
在一示例中,电子设备100可以为上述电子设备。如图10所示,所述电子设备100包括:一个处理器1001、至少一个通信总线1002、用户接口1003、至少一个外部通信接口1004和存储器1005。其中,通信总线1002配置为实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏,外部通信接口1004可以包括标准的有线接口和无线接口。In an example, the electronic device 100 may be the above-mentioned electronic device. As shown in FIG. 10 , the electronic device 100 includes: a processor 1001 , at least one communication bus 1002 , a user interface 1003 , at least one external communication interface 1004 and a memory 1005 . Wherein, the communication bus 1002 is configured to realize connection and communication between these components. Wherein, the user interface 1003 may include a display screen, and the external communication interface 1004 may include a standard wired interface and a wireless interface.
存储器1005配置为存储由处理器1001可执行的指令和应用,还可以缓存待处理器1001以及电子设备中各模块待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据),可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。The memory 1005 is configured to store instructions and applications executable by the processor 1001, and can also cache data to be processed or processed by the processor 1001 and various modules in the electronic device (for example, image data, audio data, voice communication data and video data) Communication data), which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).
第四方面,本申请实施例提供一种存储介质,也就是计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述实施例中提供的数据处理方法中的步骤。In a fourth aspect, the embodiments of the present application provide a storage medium, that is, a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps in the data processing method provided in the above-mentioned embodiments are implemented. .
这里需要指出的是:以上存储介质和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请存储介质 和设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be pointed out here that: the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium and device embodiments of this application, please refer to the description of the method embodiment of this application for understanding.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一些实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in some embodiments" throughout this specification are not necessarily referring to the same embodiments. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation. The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, or each unit can be used as a single unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者 光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes: The steps of the foregoing method embodiments; and the foregoing storage media include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks and other media that can store program codes.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated units of the present application are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.
以上所述,仅为本申请的实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only the embodiment of the present application, but the scope of protection of the present application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, and should covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (11)

  1. 一种数据处理方法,所述方法应用于数据处理系统中的控制设备,所述数据处理系统还包括节点设备,所述方法包括:A data processing method, the method is applied to a control device in a data processing system, the data processing system further includes a node device, the method includes:
    获取所述数据处理系统包括的至少两个节点设备的告警信息;Acquiring alarm information of at least two node devices included in the data processing system;
    基于所述至少两个节点设备,确定至少一个异常子系统;determining at least one abnormal subsystem based on the at least two node devices;
    针对所述至少一个异常子系统中的每个所述异常子系统执行第一处理,以得到所述至少一个异常子系统的告警信息;所述第一处理包括:将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息;Executing a first process for each of the abnormal subsystems in the at least one abnormal subsystem, so as to obtain the alarm information of the at least one abnormal subsystem; the first processing includes: the abnormal subsystem included The alarm information of at least one node device is converged into the alarm information of the abnormal subsystem;
    针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息。For each abnormal subsystem in the at least one abnormal subsystem, correspondingly display the alarm information of the abnormal subsystem.
  2. 根据权利要求1所述的方法,所述获取所述数据处理系统包括的至少两个节点设备的告警信息,包括:The method according to claim 1, the acquiring the alarm information of at least two node devices included in the data processing system comprises:
    针对所述至少两个节点设备中的每个所述节点设备执行以下处理:Perform the following processing for each of the at least two node devices:
    对所述节点设备上报的告警日志进行处理,得到表征告警提示的关键字字段;Processing the alarm log reported by the node device to obtain a keyword field representing an alarm prompt;
    基于所述关键字字段表征的告警提示,确定针对所述节点设备的M个第一告警级别,以及与所述M个第一告警级别一一对应的M个第一告警数量;所述M大于或等于1;Based on the warning prompt represented by the keyword field, determine the M first warning levels for the node device, and the number of M first warnings corresponding to the M first warning levels; the M is greater than or equal to 1;
    基于所述M个第一告警级别,以及所述M个第一告警级数量,更新针对所述节点设备的第一告警字典,得到所述节点设备的告警信息;所述第一告警字典包括N个第一告警级别,以及与所述N个第一告警级别一一对应的N个第一预设告警数量;所述N大于或等于M。Based on the M first warning levels and the number of the M first warning levels, update the first warning dictionary for the node device to obtain the warning information of the node device; the first warning dictionary includes N N first alarm levels, and N first preset alarm numbers corresponding to the N first alarm levels; the N is greater than or equal to M.
  3. 根据权利要求1所述的方法,所述将所述异常子系统包括的至少一个节点设备的告警信息,收敛为所述异常子系统的告警信息,包括:The method according to claim 1, the converging the alarm information of at least one node device included in the abnormal subsystem into the alarm information of the abnormal subsystem includes:
    获得所述异常子系统包括的至少一个节点设备的告警信息,所述节点设备的告警信息包括M个第一告警级别,以及与所述M个第一告警级别一一对应 的M个第一告警数量;Obtaining alarm information of at least one node device included in the abnormal subsystem, where the alarm information of the node device includes M first alarm levels and M first alarms corresponding to the M first alarm levels one-to-one quantity;
    基于所述至少一个节点设备中每个所述节点设备的M个第一告警级别,以及所述M个第一告警数量,确定针对所述异常子系统的P个第二告警级别,以及与所述P个第二告警级别一一对应的P个第二告警数量;所述P大于或等于所述M;Based on the M first alarm levels of each of the node devices in the at least one node device, and the number of the M first alarms, determine P second alarm levels for the abnormal subsystem, and determine the P second alarm levels for the abnormal subsystem, and The P second alarm numbers corresponding to the P second alarm levels one-to-one; the P is greater than or equal to the M;
    基于所述P个第二告警级别,以及所述P个第二告警数量,更新针对所述异常子系统的第二告警字典,得到所述异常子系统的告警信息;所述第二告警字典包括Q个第二告警级别,以及与所述Q个第二告警级别一一对应的第二预设告警数量;所述Q大于或等于所述P。Based on the P second alarm levels and the P second alarm numbers, update a second alarm dictionary for the abnormal subsystem to obtain alarm information for the abnormal subsystem; the second alarm dictionary includes Q second warning levels, and a second preset number of warnings one-to-one corresponding to the Q second warning levels; the Q is greater than or equal to the P.
  4. 根据权利要求1所述的方法,所述针对所述至少一个异常子系统中的每个异常子系统,对应展示所述异常子系统的告警信息,包括:According to the method according to claim 1, for each abnormal subsystem in the at least one abnormal subsystem, correspondingly displaying the alarm information of the abnormal subsystem includes:
    获得交易流程图;所述交易流程图中展示了所述至少一个异常子系统;Obtaining a transaction flow diagram; the at least one abnormal subsystem is shown in the transaction flow diagram;
    在所述交易流程图中,针对所述至少一个异常子系统中的每个异常子系统,确定所述异常子系统在所述交易流程图中所属的子系统节点;In the transaction flowchart, for each abnormal subsystem in the at least one abnormal subsystem, determine the subsystem node to which the abnormal subsystem belongs in the transaction flowchart;
    对应所述子系统节点展示所述异常子系统的告警信息。Displaying alarm information of the abnormal subsystem corresponding to the subsystem node.
  5. 根据权利要求1-4任一项所述的方法,所述方法还包括:The method according to any one of claims 1-4, further comprising:
    针对所述至少一个异常子系统中的每个所述异常子系统执行以下处理:performing the following processing for each of the exception subsystems in the at least one exception subsystem:
    确定所述异常子系统中满足第一条件的节点设备为待处理节点设备;Determining that the node device meeting the first condition in the abnormal subsystem is the node device to be processed;
    判断所述待处理节点设备是否故障;judging whether the node device to be processed is faulty;
    在所述待处理节点设备故障的情况下,向所述待处理节点设备发送第一指令,以使所述待处理节点设备在所述第一指令的指示下执行对应操作。When the node device to be processed fails, a first instruction is sent to the node device to be processed, so that the node device to be processed performs a corresponding operation under the instruction of the first instruction.
  6. 根据权利要求5所述的方法,所述确定所述异常子系统中满足第一条件的节点设备为待处理节点设备,包括:According to the method according to claim 5, the determining that the node device satisfying the first condition in the abnormal subsystem is a node device to be processed comprises:
    确定所述异常子系统中每个所述节点设备的告警分数;determining an alarm score for each of the node devices in the abnormal subsystem;
    若第一节点设备的告警分数大于或等于第一分数阈值,且第二节点设备的告警分数小于或等于第二分数阈值,则确定所述第一节点设备为所述待处理节点设备;所述第一分数阈值大于所述第二分数阈值;所述第一节点设备为所述 异常子系统中的任一节点设备,所述第二节点设备为所述异常子系统中除所述第一节点设备之外的节点设备。If the warning score of the first node device is greater than or equal to the first score threshold, and the warning score of the second node device is less than or equal to the second score threshold, then determining that the first node device is the node device to be processed; the The first score threshold is greater than the second score threshold; the first node device is any node device in the abnormal subsystem, and the second node device is any node device in the abnormal subsystem except the first node Node devices other than devices.
  7. 根据权利要求5所述的方法,所述判断所述待处理节点设备是否故障,包括:According to the method according to claim 5, said judging whether the node device to be processed is faulty comprises:
    获得所述待处理节点设备的至少一个指标;Obtain at least one indicator of the node device to be processed;
    针对所述至少一个指标中的每个所述指标,计算所述指标与所述指标的参考值之间的第一距离,得到所述指标的判断结果;其中,若所述第一距离大于或等于第一距离阈值,则确定所述判断结果为异常;若所述第一距离小于所述第一距离阈值,则确定所述判断结果为正常;For each of the indicators in the at least one indicator, calculate the first distance between the indicator and the reference value of the indicator to obtain the judgment result of the indicator; wherein, if the first distance is greater than or If it is equal to the first distance threshold, it is determined that the judgment result is abnormal; if the first distance is less than the first distance threshold, then it is determined that the judgment result is normal;
    若针对所述至少一个指标中的每个所述指标的判断结果均正常,则确定所述待处理节点设备未故障;否则,确定所述待处理节点设备故障。If the judgment result for each of the indicators in the at least one indicator is normal, it is determined that the node device to be processed is not faulty; otherwise, it is determined that the node device to be processed is faulty.
  8. 根据权利要求7所述的方法,若所述指标的判断结果为异常,所述方法还包括,包括:According to the method according to claim 7, if the judgment result of the indicator is abnormal, the method further includes:
    在所述指标包括时延的情况下,若所述待处理节点设备的时延小于或等于时延基准值,则修改针对所述时延的判断结果为正常;In the case where the index includes a time delay, if the time delay of the node device to be processed is less than or equal to a time delay reference value, modify the judgment result for the time delay to be normal;
    在所述指标包括成功率的情况下,若所述待处理节点设备的成功率大于或等于成功率基准值,则修改针对所述成功率的判断结果为正常。In the case where the index includes a success rate, if the success rate of the node device to be processed is greater than or equal to a success rate reference value, modify the judgment result for the success rate to be normal.
  9. 根据权利要求7所述的方法,若所述指标的判断结果为异常,所述方法还包括:According to the method according to claim 7, if the judgment result of the index is abnormal, the method further comprises:
    基于历史正常指标数据确定指标的正常范围;Determine the normal range of indicators based on historical normal indicator data;
    判断所述待处理节点设备的指标是否属于所述正常范围;judging whether the index of the node device to be processed belongs to the normal range;
    若所述待处理节点设备的指标属于所述正常范围,则修改所述指标的判断结果为正常。If the index of the node device to be processed belongs to the normal range, the judgment result of modifying the index is normal.
  10. 一种电子设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至9任一项所述的数据处理方法。An electronic device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the data processing method according to any one of claims 1 to 9 when executing the program .
  11. 一种存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现权利要求1至9任一项所述的数据处理方法。A storage medium on which a computer program is stored, and when the computer program is executed by a processor, the data processing method according to any one of claims 1 to 9 is realized.
PCT/CN2022/100708 2021-12-08 2022-06-23 Data processing method and apparatus, device, and storage medium WO2023103344A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111491784.9 2021-12-08
CN202111491784.9A CN114157553A (en) 2021-12-08 2021-12-08 Data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023103344A1 true WO2023103344A1 (en) 2023-06-15

Family

ID=80453536

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100708 WO2023103344A1 (en) 2021-12-08 2022-06-23 Data processing method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN114157553A (en)
WO (1) WO2023103344A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114157553A (en) * 2021-12-08 2022-03-08 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193436A1 (en) * 2008-01-30 2009-07-30 Inventec Corporation Alarm display system of cluster storage system and method thereof
CN107423194A (en) * 2017-06-30 2017-12-01 阿里巴巴集团控股有限公司 Front end abnormality alarming processing method, apparatus and system
CN110806921A (en) * 2019-09-30 2020-02-18 烽火通信科技股份有限公司 OVS (optical virtual system) abnormity alarm monitoring system and method
CN112596990A (en) * 2020-12-24 2021-04-02 科华恒盛股份有限公司 Alarm storm processing method and device and terminal equipment
CN114157553A (en) * 2021-12-08 2022-03-08 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4152866B2 (en) * 2003-11-19 2008-09-17 株式会社日立製作所 Storage device, storage device system, and communication control method
CN102638100B (en) * 2012-04-05 2014-02-19 华北电力大学 District power network equipment abnormal alarm signal association analysis and diagnosis method
CN108809757B (en) * 2018-05-22 2021-06-15 平安科技(深圳)有限公司 System alarm method, storage medium and server
CN110830438A (en) * 2019-09-25 2020-02-21 杭州优行科技有限公司 Abnormal log warning method and device and electronic equipment
CN111158977B (en) * 2019-12-12 2023-07-11 深圳前海微众银行股份有限公司 Abnormal event root cause positioning method and device
US20210105338A1 (en) * 2020-01-06 2021-04-08 Intel Corporation Quality of service (qos) management with network-based media processing (nbmp)
CN111585782A (en) * 2020-03-18 2020-08-25 国网江苏省电力有限公司信息通信分公司 Comprehensive centralized alarm automatic processing system and method
CN112711507A (en) * 2020-12-17 2021-04-27 浙江高速信息工程技术有限公司 Device alarm method, electronic device, and medium
CN112699007A (en) * 2021-01-04 2021-04-23 网宿科技股份有限公司 Method, system, network device and storage medium for monitoring machine performance

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193436A1 (en) * 2008-01-30 2009-07-30 Inventec Corporation Alarm display system of cluster storage system and method thereof
CN107423194A (en) * 2017-06-30 2017-12-01 阿里巴巴集团控股有限公司 Front end abnormality alarming processing method, apparatus and system
CN110806921A (en) * 2019-09-30 2020-02-18 烽火通信科技股份有限公司 OVS (optical virtual system) abnormity alarm monitoring system and method
CN112596990A (en) * 2020-12-24 2021-04-02 科华恒盛股份有限公司 Alarm storm processing method and device and terminal equipment
CN114157553A (en) * 2021-12-08 2022-03-08 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114157553A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
US10387899B2 (en) Systems and methods for monitoring and analyzing computer and network activity
CN107729210B (en) Distributed service cluster abnormity diagnosis method and device
WO2021003810A1 (en) Service system update method, electronic device and readable storage medium
CN102937930B (en) Application program monitoring system and method
WO2022089202A1 (en) Fault identification model training method, fault identification method, apparatus and electronic device
WO2021213247A1 (en) Anomaly detection method and device
CN112087334B (en) Alarm root cause analysis method, electronic device and storage medium
WO2023045417A1 (en) Fault knowledge graph construction method and apparatus
US20210014102A1 (en) Reinforced machine learning tool for anomaly detection
WO2023071761A1 (en) Anomaly positioning method and device
US10783453B2 (en) Systems and methods for automated incident response
CN107423205A (en) A kind of system failure method for early warning and system for anti-data-leakage system
US11605010B1 (en) Computer incident scoring
WO2023103344A1 (en) Data processing method and apparatus, device, and storage medium
WO2024031930A1 (en) Error log detection method and apparatus, and electronic device and storage medium
US11531821B2 (en) Intent resolution for chatbot conversations with negation and coreferences
CN113361969A (en) Intelligent quality inspection system capable of flexibly configuring templates
CN109582670A (en) A kind of recommended method and relevant device of vehicle maintenance scheme
CN112910733A (en) Full link monitoring system and method based on big data
US20230385735A1 (en) System and method for optimized predictive risk assessment
CN114706893A (en) Fault detection method, device, equipment and storage medium
US11870933B2 (en) Emergency dispatch command information management system, device, and method capable of providing relevant emergency dispatch command information
WO2024027127A1 (en) Fault detection method and apparatus, and electronic device and readable storage medium
US20230026656A1 (en) Machine learning for categorizing text
CN117873828A (en) Alarm processing method, device, equipment and medium of server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22902781

Country of ref document: EP

Kind code of ref document: A1