CN115544202A - Alarm processing method, device and storage medium - Google Patents

Alarm processing method, device and storage medium Download PDF

Info

Publication number
CN115544202A
CN115544202A CN202110725324.1A CN202110725324A CN115544202A CN 115544202 A CN115544202 A CN 115544202A CN 202110725324 A CN202110725324 A CN 202110725324A CN 115544202 A CN115544202 A CN 115544202A
Authority
CN
China
Prior art keywords
alarm
fault
item
information
diagnosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110725324.1A
Other languages
Chinese (zh)
Inventor
李立锟
毕航华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huawei Digital Technologies Co Ltd
Original Assignee
Beijing Huawei Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huawei Digital Technologies Co Ltd filed Critical Beijing Huawei Digital Technologies Co Ltd
Priority to CN202110725324.1A priority Critical patent/CN115544202A/en
Publication of CN115544202A publication Critical patent/CN115544202A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the application discloses an alarm processing method, an alarm processing device and a storage medium, and belongs to the technical field of AI. In the embodiment of the application, the NLP model obtained by the alarm description text training is used for processing the information of each alarm in the alarm event, so that one or more candidate fault reasons of the alarm event are obtained, one or more candidate fault reasons of the alarm event can be obtained without manually presetting a diagnosis rule, and the fault information corresponding to the alarm event is determined according to the one or more candidate fault reasons, so that the method is intelligent and efficient. When new alarms are increased subsequently, the alarms can be analyzed and processed only by correspondingly adding corresponding alarm description texts to train the NLP model, rule codes do not need to be developed again, workload is small, and maintenance cost is low.

Description

告警处理方法、装置及存储介质Alarm processing method, device and storage medium

技术领域technical field

本申请实施例涉及人工智能(artificial intelligence,AI)技术领域,特别涉及一种告警处理方法、装置及存储介质。The embodiments of the present application relate to the technical field of artificial intelligence (AI), and in particular, to an alarm processing method, device, and storage medium.

背景技术Background technique

网络中的网络设备在运行的过程中可能会出现各种故障从而触发告警。控制设备在接收到网络设备上报的告警信息之后,可以对告警信息进行处理,从而确定出故障所在位置,并对该故障进行修复或者是提供故障修复的方法。Various faults may occur during the operation of network devices in the network to trigger alarms. After receiving the alarm information reported by the network equipment, the control device can process the alarm information, so as to determine the location of the fault, and repair the fault or provide a fault repair method.

相关技术中,控制设备在接收到告警信息之后,可以通过人工预先设置的诊断规则来判断该告警信息的触发原因,进而根据确定的触发原因来确定故障的位置点。In the related art, after the control device receives the alarm information, it can judge the triggering cause of the alarming information through manually preset diagnostic rules, and then determine the location of the fault according to the determined triggering cause.

由此可见,相关技术中需要人工预先设置诊断规则来对告警信息进行处理,这样,每当需要添加新的规则时,均需要重新开发规则代码,工作量大,开发难度高,维护成本也高。It can be seen that, in related technologies, it is necessary to manually pre-set diagnostic rules to process alarm information. In this way, whenever a new rule needs to be added, the rule code needs to be redeveloped, which involves a large workload, high development difficulty, and high maintenance cost. .

发明内容Contents of the invention

本申请实施例提供了一种告警处理方法、装置及存储介质,能够智能高效的对告警事件中的告警进行处理。所述技术方案如下:Embodiments of the present application provide an alarm processing method, device, and storage medium, capable of intelligently and efficiently processing alarms in alarm events. Described technical scheme is as follows:

第一方面,提供了一种告警处理方法,所述方法包括:获取告警事件,所述告警事件包括多个告警的信息;通过第一自然语言处理NLP模型对所述多个告警中的每个告警的信息进行处理,得到所述告警事件对应的一个或多个候选故障项,所述第一NLP模型通过告警描述文本训练得到;根据所述告警事件对应的一个或多个候选故障项,确定所述告警事件对应的故障信息。In a first aspect, an alarm processing method is provided, the method comprising: obtaining an alarm event, the alarm event including information of multiple alarms; processing each of the multiple alarms through a first natural language processing NLP model The information of the alarm is processed to obtain one or more candidate fault items corresponding to the alarm event, and the first NLP model is obtained by training the alarm description text; according to the one or more candidate fault items corresponding to the alarm event, determine Fault information corresponding to the alarm event.

在本申请实施例中,通过由告警描述文本训练得到的NLP模型对告警事件中的各个告警的信息进行处理,从而得到告警事件的一个或多个候选故障原因,无需通过人工预先设置诊断规则即能够得到告警事件的一个或多个候选故障原因,进而根据该一个或多个候选故障原因来确定该告警事件对应的故障信息,智能高效。后续当有新的告警增加时,只需对应的添加相应地告警描述文本对NLP模型进行训练即能够实现对告警的分析处理,无需重新开发规则代码,工作量小,维护成本低。In the embodiment of the present application, the information of each alarm in the alarm event is processed through the NLP model trained by the alarm description text, so as to obtain one or more candidate fault causes of the alarm event, without manually setting diagnostic rules in advance. One or more candidate fault causes of the alarm event can be obtained, and then the fault information corresponding to the alarm event is determined according to the one or more candidate fault causes, which is intelligent and efficient. When new alarms are added later, you only need to add the corresponding alarm description text to train the NLP model to realize the analysis and processing of alarms, without re-developing rule codes, with a small workload and low maintenance costs.

可选地,所述通过第一自然语言处理NLP模型对所述多个告警中的每个告警的信息进行处理,得到所述告警事件对应的一个或多个候选故障项的实现过程可以为:将所述多个告警中的每个告警的信息作为所述第一NLP模型的输入,通过所述第一NLP模型对所述多个告警中的每个告警的信息进行处理,得到所述告警事件的诊断链,所述告警事件的诊断链包括所述多个告警以及一个或多个故障项,且所述诊断链用于表征所述多个告警以及所述一个或多个故障项之间的逻辑关系;将所述诊断链包含的一个或多个故障项作为所述告警事件对应的一个或多个候选故障项。Optionally, the implementation process of processing the information of each of the multiple alarms through the first natural language processing NLP model to obtain one or more candidate fault items corresponding to the alarm event may be: Using the information of each alarm in the plurality of alarms as the input of the first NLP model, processing the information of each alarm in the plurality of alarms through the first NLP model to obtain the alarm A diagnostic chain of events, the diagnostic chain of the alarm event includes the multiple alarms and one or more fault items, and the diagnostic chain is used to characterize the relationship between the multiple alarms and the one or more fault items logical relationship; using one or more fault items included in the diagnosis chain as one or more candidate fault items corresponding to the alarm event.

通过NLP模型对告警事件包括的多个告警的信息进行处理后,可以得到该告警事件的诊断链,该诊断链包括多个告警以及一个或多个故障项,并且,该诊断链能够表征出各个告警之间、告警与故障项之间的逻辑关系,这样,使得告警和故障项之间的推理逻辑具有强可解释性。After processing the information of multiple alarms included in the alarm event through the NLP model, the diagnostic chain of the alarm event can be obtained. The diagnostic chain includes multiple alarms and one or more fault items, and the diagnostic chain can represent each The logical relationship between alarms and between alarms and fault items, so that the reasoning logic between alarms and fault items has strong interpretability.

需要说明的是,在本申请实施例中,控制设备还可以将该告警事件的诊断链作为后续的故障信息的一部分。It should be noted that, in the embodiment of the present application, the control device may also use the diagnosis chain of the alarm event as a part of the subsequent fault information.

可选地,在通过第一NLP模型对告警事件中的告警的信息进行处理之前,所述方法还包括:获取所述告警描述文本;根据所述告警描述文本中的多个样本告警分别对应的告警解释信息和告警原因对第一初始NLP网络进行训练,得到所述第一NLP模型。Optionally, before using the first NLP model to process the alarm information in the alarm event, the method further includes: acquiring the alarm description text; The alarm explanation information and the alarm reason train the first initial NLP network to obtain the first NLP model.

也即,本申请实施例中,第一NLP模型可以学习到告警描述文本中的各个样本告警之间的逻辑关系以及样本告警与故障项之间的逻辑关系。That is, in the embodiment of the present application, the first NLP model can learn the logical relationship between each sample alarm in the alarm description text and the logical relationship between the sample alarm and the fault item.

可选地,所述根据所述告警事件的一个或多个候选故障项,确定所述告警事件对应的故障信息的实现过程包括:调用所述一个或多个候选故障项中每个候选故障项对应的诊断项的诊断接口;通过每个候选故障项对应的诊断项的诊断接口获取相应诊断项的状态信息;根据每个候选故障项对应的诊断项的状态信息,确定所述告警事件对应的故障信息。Optionally, the implementation process of determining the fault information corresponding to the alarm event according to one or more candidate fault items of the alarm event includes: calling each candidate fault item in the one or more candidate fault items The diagnostic interface of the corresponding diagnostic item; obtain the status information of the corresponding diagnostic item through the diagnostic interface of the diagnostic item corresponding to each candidate fault item; determine the corresponding alarm event according to the status information of the diagnostic item corresponding to each candidate fault item accident details.

在本申请实施例中,在确定出告警事件对应的一个或多个候选故障项之后,可以通过自动生成的诊断接口来对相应的候选故障项进行诊断,从而得到该告警事件对应的真实故障项,也即,真正导致该告警事件的原因,整个过程智能高效。In the embodiment of the present application, after one or more candidate fault items corresponding to the alarm event are determined, the corresponding candidate fault items can be diagnosed through the automatically generated diagnosis interface, so as to obtain the real fault item corresponding to the alarm event , that is, the real cause of the alarm event, the whole process is intelligent and efficient.

可选地,所述方法还包括:通过第二NLP模型对所述告警描述文本包含的多个样本告警中的每个样本告警对应的处理步骤进行识别,以得到相应处理步骤对应的诊断项;生成每个处理步骤对应的诊断项的诊断接口。Optionally, the method further includes: using the second NLP model to identify the processing steps corresponding to each sample alarm in the multiple sample alarms included in the alarm description text, so as to obtain the diagnostic items corresponding to the corresponding processing steps; A diagnostic interface that generates diagnostic items corresponding to each processing step.

在本申请实施例中,可以通过NLP模型对告警描述文本中包含的处理步骤进行识别,以得到对应的诊断项,进而自动生成诊断项对应的诊断接口。后续,在对告警事件对应的候选故障项进行诊断时,即能够直接调用自动生成的诊断接口进行诊断,使得整个诊断过程智能高效。In the embodiment of the present application, the processing steps contained in the alarm description text can be identified through the NLP model to obtain the corresponding diagnostic items, and then the diagnostic interface corresponding to the diagnostic items can be automatically generated. Subsequently, when diagnosing the candidate fault items corresponding to the alarm event, the automatically generated diagnosis interface can be directly invoked for diagnosis, making the entire diagnosis process intelligent and efficient.

可选地,根据每个候选故障项的状态信息,确定所述告警事件对应的故障信息的实现过程可以为:从所述一个或多个候选故障项中确定对应的诊断项的状态信息与相应诊断项的预设状态信息不一致的目标故障项;根据所述目标故障项以及所述目标故障项对应的修复方法,生成所述告警事件对应的故障信息。Optionally, according to the state information of each candidate fault item, the implementation process of determining the fault information corresponding to the alarm event may be: determining the state information of the corresponding diagnostic item from the one or more candidate fault items and the corresponding diagnosing a target fault item whose preset state information is inconsistent; generating fault information corresponding to the alarm event according to the target fault item and the repair method corresponding to the target fault item.

通过获取到的诊断项的状态信息来确定状态异常的诊断项,进而据此确定候选故障项中的真实故障项,真实可靠。The diagnostic item with an abnormal state is determined through the obtained status information of the diagnostic item, and then the real fault item among the candidate fault items is determined accordingly, which is true and reliable.

可选地,所述告警描述文本包括产品信息文本、告警手册、故障维护经验文本中的至少一种。Optionally, the alarm description text includes at least one of product information text, alarm manual, and fault maintenance experience text.

在本申请实施例中,告警描述文本可以为产品信息文本、告警手册或故障维护经验文本中的至少一种。也即,只需提供产品信息文本、告警手册或者是故障维护经验文本就能得到第一NLP模型,进而智能识别各类告警,这样,当出现新增的设备或业务时,只需新增对应的额告警描述文本对第一NLP模型进行训练,既能对新增设备或业务的告警进行处理,相较于人工设置规则进行告警识别的方法,无需重新开发规则代码,工作量小,维护成本低。In this embodiment of the application, the alarm description text may be at least one of product information text, alarm manual, or fault maintenance experience text. That is to say, the first NLP model can be obtained only by providing product information text, alarm manual or fault maintenance experience text, and then intelligently identify various alarms. In this way, when new equipment or services appear, only need to add corresponding The amount of alarm description text to train the first NLP model can not only handle the alarm of new equipment or business, but compared with the method of manually setting rules for alarm identification, there is no need to re-develop the rule code, the workload is small, and the maintenance cost is low. Low.

第二方面,提供了一种告警处理装置,所述告警处理装置具有实现上述第一方面中告警处理方法行为的功能。所述告警处理装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的告警处理方法。In a second aspect, an alarm processing device is provided, and the alarm processing device has a function of realizing the behavior of the alarm processing method in the above first aspect. The alarm processing apparatus includes at least one module, and the at least one module is configured to implement the alarm processing method provided in the first aspect above.

第三方面,提供了一种告警处理装置,所述告警处理装置的结构中包括处理器和存储器,所述存储器用于存储支持告警处理装置执行上述第一方面所提供的告警处理方法的程序,以及存储用于实现上述第一方面所提供的告警处理方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述存储设备的操作装置还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。A third aspect provides an alarm processing device, the structure of the alarm processing device includes a processor and a memory, and the memory is used to store a program that supports the alarm processing device to execute the alarm processing method provided in the first aspect above, And storing the data involved in realizing the alarm processing method provided by the first aspect above. The processor is configured to execute programs stored in the memory. The operating device of the storage device may further include a communication bus for establishing a connection between the processor and the memory.

第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的告警处理方法。In a fourth aspect, a computer-readable storage medium is provided, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, it causes the computer to execute the alarm processing method described in the above-mentioned first aspect.

第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的告警处理方法。A fifth aspect provides a computer program product containing instructions, which when run on a computer causes the computer to execute the alarm processing method described in the first aspect above.

上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。The technical effects obtained by the above-mentioned second aspect, third aspect, fourth aspect and fifth aspect are similar to those obtained by the corresponding technical means in the first aspect, and will not be repeated here.

本申请实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solutions provided by the embodiments of the present application at least include:

在本申请实施例中,通过由告警描述文本训练得到的NLP模型对告警事件中的各个告警的信息进行处理,从而得到告警事件的一个或多个候选故障原因,无需通过人工预先设置诊断规则即能够得到告警事件的一个或多个候选故障原因,进而根据该一个或多个候选故障原因来确定该告警事件对应的故障信息,智能高效。后续当有新的告警增加时,只需对应的添加相应地告警描述文本对NLP模型进行训练即能够实现对告警的分析处理,无需重新开发规则代码,工作量小,维护成本低。In the embodiment of the present application, the information of each alarm in the alarm event is processed through the NLP model trained by the alarm description text, so as to obtain one or more candidate fault causes of the alarm event, without manually setting diagnostic rules in advance. One or more candidate fault causes of the alarm event can be obtained, and then the fault information corresponding to the alarm event is determined according to the one or more candidate fault causes, which is intelligent and efficient. When new alarms are added later, you only need to add the corresponding alarm description text to train the NLP model to realize the analysis and processing of alarms, without re-developing rule codes, with a small workload and low maintenance costs.

附图说明Description of drawings

图1是本申请实施例提供的一种告警处理方法所涉及的网络系统架构图;FIG. 1 is a network system architecture diagram involved in an alarm processing method provided by an embodiment of the present application;

图2是本申请实施例提供的一种计算机设备的结构示意图;Fig. 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application;

图3是本申请实施例提供的一种告警处理方法流程图;FIG. 3 is a flow chart of an alarm processing method provided in an embodiment of the present application;

图4是本申请实施例提供的一种告警事件的诊断链的示意图;FIG. 4 is a schematic diagram of a diagnosis chain of an alarm event provided by an embodiment of the present application;

图5是本申请实施例提供的一种告警处理装置的结构示意图。Fig. 5 is a schematic structural diagram of an alarm processing device provided by an embodiment of the present application.

具体实施方式detailed description

为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the following will further describe the embodiments of the present application in detail in conjunction with the accompanying drawings.

在对本申请实施例进行详细的解释说明之前,先对本申请实施例涉及的应用场景进行介绍。Before explaining the embodiment of the present application in detail, the application scenarios involved in the embodiment of the present application are firstly introduced.

随着网络技术的发展,网络架构变得日益复杂,相应地,网络中的网络设备也越来越多。其中,当网络中网络设备发生故障时,网络设备将会向控制设备发送告警,在这种情况下,需要根据告警快速定位故障以对故障进行修复,从而尽可能的减少故障对业务的影响。本申请实施例提供的告警处理方法即可以应用于上述场景中,以对告警进行分析处理,从而得到告警对应的故障信息。With the development of network technology, the network architecture becomes increasingly complex, and correspondingly, there are more and more network devices in the network. Among them, when a network device in the network fails, the network device will send an alarm to the control device. In this case, it is necessary to quickly locate the fault according to the alarm to repair the fault, so as to minimize the impact of the fault on the business. The alarm processing method provided in the embodiment of the present application can be applied to the above scenario, so as to analyze and process the alarm, so as to obtain the fault information corresponding to the alarm.

可选地,本申请实施例提供的告警处理方法也可以用于处理其他场景中的告警。例如,在银行系统中,当出现风险用户时,可以向控制设备上报告警。相应地,控制设备也可以通过本申请实施例提供的告警处理方法来对风险用户进行识别。再例如,在电力系统中,当接收到告警时,可以通过本申请实施例提供的告警处理方法来对该告警进行处理,以确定出电力系统中触发该告警的故障点,如异常用电点或者是异常设备等。再例如,在工业生成领域,自动化工厂中的控制设备可以接收生产线上的检测设备上报的告警,相应地,该控制设备同样可以通过本申请实施例提供的方法来对生产工艺中的异常点进行检测等。Optionally, the alarm processing method provided in the embodiment of the present application may also be used to process alarms in other scenarios. For example, in a banking system, when a risky user appears, an alarm can be reported to the control device. Correspondingly, the control device may also identify risky users through the alarm processing method provided in the embodiment of the present application. For another example, in the power system, when an alarm is received, the alarm processing method provided by the embodiment of the present application can be used to process the alarm, so as to determine the fault point in the power system that triggers the alarm, such as an abnormal power consumption point Or abnormal equipment, etc. For another example, in the field of industrial production, the control equipment in the automated factory can receive the alarm reported by the detection equipment on the production line. Correspondingly, the control equipment can also monitor the abnormal points in the production process through the method provided by this embodiment detection etc.

需要说明的是,上述仅是申请实施例给出的几种示例性的应用场景,对于其他一些需要对告警进行处理以确定故障点、风险点或异常点的信息的场景,本申请实施例同样适用,本申请实施例在此不再赘述。It should be noted that the above are only some exemplary application scenarios given by the embodiment of the application. For other scenarios where the alarm needs to be processed to determine the information of the failure point, risk point or abnormal point, the embodiment of the application also Applicable, the embodiment of the present application will not be repeated here.

接下来对本申请实施例提供的告警处理方法所涉及的系统架构进行介绍。Next, the system architecture involved in the alarm processing method provided by the embodiment of the present application will be introduced.

图1是本申请实施例提供的告警处理方法所涉及的一种网络系统架构示意图。如图1所示,该网络系统包括告警处理设备101和多个网络设备102,其中,告警处理设备101和多个网络设备102之间建立有通信连接。FIG. 1 is a schematic diagram of a network system architecture involved in an alarm processing method provided in an embodiment of the present application. As shown in FIG. 1 , the network system includes an alarm processing device 101 and multiple network devices 102 , where communication connections are established between the alarm processing device 101 and the multiple network devices 102 .

在本申请实施例中,告警处理设备101可以为多个网络设备102对应的控制设备。在这种情况下,各个网络设备102可以在自身发生故障或者是业务存在异常时触发告警,并向该告警处理设备101发送告警的信息,或者是,某个网络设备102可以在接收到其他网络设备102的告警时从而触发告警,并向该告警处理设备101发送告警的信息。In this embodiment of the present application, the alarm processing device 101 may be a control device corresponding to multiple network devices 102 . In this case, each network device 102 can trigger an alarm when it fails or the service is abnormal, and send the alarm information to the alarm processing device 101, or a certain network device 102 can receive other network When the device 102 generates an alarm, an alarm is triggered, and the alarm information is sent to the alarm processing device 101 .

告警处理设备101在接收到各个网络设备102的告警的信息之后,可以根据各个网络设备102上报的告警的信息对告警进行聚类,从而得到包含有多个告警的信息的告警事件。之后,告警处理设备101可以通过本申请实施例提供的告警处理方法对该告警事件中的各个告警进行处理,从而得到该告警事件对应的故障信息。其中,该告警事件中的各个告警是指具有一定关联关系的告警。After receiving the alarm information of each network device 102, the alarm processing device 101 may cluster the alarms according to the alarm information reported by each network device 102, so as to obtain an alarm event including multiple alarm information. Afterwards, the alarm processing device 101 may process each alarm in the alarm event through the alarm processing method provided in the embodiment of the present application, so as to obtain the fault information corresponding to the alarm event. Wherein, each alarm in the alarm event refers to an alarm with a certain relationship.

可选地,在一些可能的情况中,该告警处理设备101不为多个网络设备102对应的控制设备,而是具有数据处理功能的服务器或终端设备,在这种情况下,该网络系统中还可以包括该多个网络设备102对应的控制设备。在此基础上,各个网络设备102可以向该控制设备上报告警的信息,并由该控制设备对各个网络设备102上报的告警的信息进行聚类,从而得到告警事件。之后,控制设备可以将聚类得到的告警事件发送至告警处理设备101,进而由该告警处理设备101根据本申请实施例提供的告警处理方法对该告警事件中的各个告警进行处理。Optionally, in some possible situations, the alarm processing device 101 is not a control device corresponding to multiple network devices 102, but a server or terminal device with data processing functions. In this case, the network system Control devices corresponding to the multiple network devices 102 may also be included. On this basis, each network device 102 may report alarm information to the control device, and the control device clusters the alarm information reported by each network device 102 to obtain an alarm event. Afterwards, the control device may send the clustered alarm event to the alarm processing device 101, and then the alarm processing device 101 processes each alarm in the alarm event according to the alarm processing method provided by the embodiment of the present application.

其中,当告警处理设备101为多个网络设备对应的控制设备时,该告警处理设备101可以为一个网络云化引擎(network cloud engine,NCE),例如,该告警处理设备101为软件定义网络(software designed network,SDN)控制器,或者是一个云管理平台。在这种情况下,多个网络设备102为该告警处理设备101所管理控制的诸如终端设备、网关设备之类的设备,本申请实施例对此不做限定。Wherein, when the alarm processing device 101 is a control device corresponding to multiple network devices, the alarm processing device 101 may be a network cloud engine (network cloud engine, NCE), for example, the alarm processing device 101 is a software-defined network ( software designed network, SDN) controller, or a cloud management platform. In this case, the multiple network devices 102 are devices such as terminal devices and gateway devices managed and controlled by the alarm processing device 101, which is not limited in this embodiment of the present application.

当告警处理设备101不为多个网络设备对应的控制设备时,该告警处理设备101可以为服务器或终端设备。例如,该告警处理设备101为数据中心的一台服务器或一个服务器集群,或者,该告警处理设备101为一个客户端设备,本申请实施例对此不做限定。When the alarm processing device 101 is not a control device corresponding to multiple network devices, the alarm processing device 101 may be a server or a terminal device. For example, the alarm processing device 101 is a server or a server cluster in a data center, or the alarm processing device 101 is a client device, which is not limited in this embodiment of the present application.

可选地,在一些可能的情况中,当本申请实施例提供的告警处理方法应用于其他场景中,例如用于电力系统或者是工业生产领域时,该告警处理设备101可以为具有集中管理控制其他设备的功能的设备,相应地,多个网络设备102即为该控制设备101所管理和控制的设备。或者,同样的,该告警处理设备101也可以不为多个网络设备对应的控制设备,而是为服务器或终端设备,本申请实施例对此不做限定。Optionally, in some possible cases, when the alarm processing method provided by the embodiment of the present application is applied to other scenarios, such as in power systems or industrial production fields, the alarm processing device 101 can be a centralized management control Functional devices of other devices, correspondingly, the plurality of network devices 102 are devices managed and controlled by the control device 101 . Or, similarly, the alarm processing device 101 may not be a control device corresponding to multiple network devices, but a server or a terminal device, which is not limited in this embodiment of the present application.

在下文实施例中,以该告警处理设备101为多个网络设备102对应的控制设备为例对本申请实施例提供的告警处理方法进行介绍。对于该告警处理设备101为其他设备的情况,可以参考下文中的实现方式,本申请实施例不再赘述。In the following embodiments, the alarm processing method provided in the embodiment of the present application is introduced by taking the alarm processing device 101 as a control device corresponding to multiple network devices 102 as an example. For the case where the alarm processing device 101 is other devices, reference may be made to the following implementation manners, which will not be repeated in this embodiment of the present application.

图2是本申请实施例提供的一种计算机设备的结构示意图。图1所示的系统架构中的告警处理设备即能够通过图2中所示的计算机设备来实现。参见图2,该计算机设备包括一个或多个处理器201、通信总线202、存储器203以及一个或多个通信接口204。Fig. 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The alarm processing device in the system architecture shown in FIG. 1 can be realized by the computer device shown in FIG. 2 . Referring to FIG. 2 , the computer device includes one or more processors 201 , a communication bus 202 , a memory 203 and one or more communication interfaces 204 .

处理器201可以是一个通用中央处理器(central processing unit,CPU)、网络处理器(network processor,NP)、微处理器、或者可以是一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor 201 may be a general-purpose central processing unit (central processing unit, CPU), a network processor (network processor, NP), a microprocessor, or may be one or more integrated circuits for implementing the solution of the present application, such as , application-specific integrated circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof. The aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a generic array logic (generic array logic, GAL) or any combination thereof.

通信总线202用于在上述组件之间传送信息。通信总线202可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus 202 is used to transfer information between the aforementioned components. The communication bus 202 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

存储器203可以是只读存储器(read-only memory,ROM),也可以是随机存取存储器(random access memory,RAM),也可以是电可擦可编程只读存储器(electricallyerasable programmable read-only memory,EEPROM)、光盘(包括只读光盘(compact discread-only memory,CD-ROM)、压缩光盘、激光盘、数字通用光盘、蓝光光盘等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器203可以是独立存在,并通过通信总线202与处理器201相连接。存储器203也可以和处理器201集成在一起。The memory 203 may be a read-only memory (read-only memory, ROM), or a random access memory (random access memory, RAM), or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), optical discs (including compact disc-only memory (CD-ROM), compact discs, laser discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used for Any other medium that carries or stores desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 203 may exist independently, and is connected to the processor 201 through the communication bus 202 . The memory 203 can also be integrated with the processor 201 .

通信接口204使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口204包括有线通信接口,还可以包括无线通信接口。其中,有线通信接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线通信接口可以为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络通信接口或其组合等。Communication interface 204 utilizes any transceiver-like device for communicating with other devices or a communication network. The communication interface 204 includes a wired communication interface, and may also include a wireless communication interface. Wherein, the wired communication interface may be an Ethernet interface, for example. The Ethernet interface can be an optical interface, an electrical interface or a combination thereof. The wireless communication interface may be a wireless local area network (wireless local area networks, WLAN) interface, a cellular network communication interface, or a combination thereof.

在一些实施例中,该计算机设备可以包括多个处理器,如图2中所示的处理器201和处理器205。这些处理器中的每一个可以是一个单核处理器,也可以是一个多核处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。In some embodiments, the computer device may include multiple processors, such as processor 201 and processor 205 as shown in FIG. 2 . Each of these processors can be a single-core processor or a multi-core processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data such as computer program instructions.

在具体实现中,作为一种实施例,该计算机设备还可以包括输出设备206和输入设备207。输出设备206和处理器201通信,可以以多种方式来显示信息。例如,输出设备206可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备207和处理器201通信,可以以多种方式接收用户的输入。例如,输入设备207可以是鼠标、键盘、触摸屏设备或传感设备等。In a specific implementation, as an embodiment, the computer device may further include an output device 206 and an input device 207 . Output device 206 is in communication with processor 201 and can display information in a variety of ways. For example, the output device 206 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a cathode ray tube (cathode ray tube, CRT) display device, or a projector (projector), etc. . The input device 207 communicates with the processor 201 and can receive user input in various ways. For example, the input device 207 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

在一些实施例中,存储器203用于存储执行本申请方案的程序代码208,处理器201可以执行存储器203中存储的程序代码208。该程序代码中可以包括一个或多个软件模块,该计算机设备可以通过处理器201以及存储器203中的程序代码208,来实现下文图3实施例提供的告警处理方法。In some embodiments, the memory 203 is used to store the program code 208 for implementing the solution of the present application, and the processor 201 can execute the program code 208 stored in the memory 203 . The program code may include one or more software modules, and the computer device may implement the alarm processing method provided in the embodiment of FIG. 3 below through the processor 201 and the program code 208 in the memory 203 .

接下来对本申请实施例提供的告警处理方法进行介绍。Next, the alarm processing method provided by the embodiment of the present application is introduced.

图3是本申请实施例提供的一种告警处理方法的流程图,该方法可以应用于前述系统架构中的告警处理设备中,在下文中,以该告警处理设备为多个网络设备对应的控制设备为例进行说明,参见图3,该方法包括以下步骤:Fig. 3 is a flow chart of an alarm processing method provided by an embodiment of the present application. This method can be applied to the alarm processing device in the aforementioned system architecture. In the following, the alarm processing device is used as the control device corresponding to multiple network devices As an example for illustration, referring to Figure 3, the method includes the following steps:

步骤301:获取告警事件,该告警事件包括多个告警的信息。Step 301: Obtain an alarm event, where the alarm event includes information about multiple alarms.

在本申请实施例中,网络设备在检测到自身故障或者业务异常或者是接收到其他设备的告警时可以触发自身的告警,并将该告警的信息上报至控制设备。In this embodiment of the present application, when a network device detects its own failure or service anomaly, or receives an alarm from other devices, it can trigger its own alarm, and report the alarm information to the control device.

其中,告警的信息可以包括告警内容,该告警内容能够用于指示该告警为什么类型的告警。例如,告警内容可以为以太网信号丢失告警。除此之外,告警的信息还可以包括告警位置,该告警位置可以为上报该告警的网络设备的位置信息。示例性地,该告警位置为上报告警的信息的网络设备的互联网协议(internet protocol,IP)地址,或者是上报告警的信息的网络设备的媒体访问控制(media access control,MAC)地址,或者是该网络设备的设备标识,其中,该设备标识用于唯一标识该网络设备。Wherein, the alarm information may include alarm content, and the alarm content can be used to indicate what type of alarm the alarm is. For example, the alarm content may be an Ethernet signal loss alarm. In addition, the alarm information may also include an alarm location, where the alarm location may be location information of the network device reporting the alarm. Exemplarily, the alarm location is an Internet protocol (internet protocol, IP) address of the network device reporting the alarm information, or a media access control (media access control, MAC) address of the network equipment reporting the alarm information, Or it is the device identifier of the network device, where the device identifier is used to uniquely identify the network device.

控制设备在接收到各个网络设备上报的告警的信息之后,可以对接收到的告警的信息进行聚类,从而得到一个或多个告警事件。其中,每个告警事件中包括多个告警的信息,且每个告警事件中的多个告警存在一定的关联关系。例如,该多个告警可能是某个时间段内某几个具有连接关系的网络设备上报的告警,本申请实施例不对告警事件中各个告警之间的关联关系进行具体的限定。After receiving the alarm information reported by each network device, the control device may cluster the received alarm information to obtain one or more alarm events. Wherein, each alarm event includes information of multiple alarms, and the multiple alarms in each alarm event are related to a certain extent. For example, the plurality of alarms may be alarms reported by certain network devices having a connection relationship within a certain period of time, and this embodiment of the present application does not specifically limit the relationship among alarms in an alarm event.

需要说明的是,在本申请实施例中,以控制设备对获取到一个告警事件为例对本申请实施例提供的告警处理方法进行介绍。It should be noted that, in the embodiment of the present application, the alarm processing method provided in the embodiment of the present application is introduced by taking the acquisition of an alarm event by the control device as an example.

步骤302:通过第一NLP模型对多个告警中的每个告警的信息进行处理,得到告警事件的一个或多个候选故障项,第一NLP模型通过告警描述文本训练得到。Step 302: Process the information of each of the multiple alarms through the first NLP model to obtain one or more candidate fault items of the alarm event, and the first NLP model is obtained through training the alarm description text.

控制设备在获取到告警事件之后,将该告警事件包括的多个告警中的每个告警的信息作为第一NLP模型的输入,通过第一NLP模型对多个告警中的每个告警的信息进行处理,得到告警事件的诊断链,告警事件的诊断链包括多个告警以及一个或多个故障项,且诊断链用于表征多个告警以及一个或多个故障项之间的逻辑关系;将诊断链包含的一个或多个故障项作为告警事件的一个或多个候选故障项。After the control device acquires the alarm event, the information of each alarm in the multiple alarms included in the alarm event is used as the input of the first NLP model, and the information of each alarm in the multiple alarms is processed by the first NLP model. processing to obtain a diagnostic chain of alarm events, the diagnostic chain of alarm events includes multiple alarms and one or more fault items, and the diagnostic chain is used to characterize the logical relationship between multiple alarms and one or more fault items; the diagnostic One or more fault items contained in the chain are used as one or more candidate fault items of the alarm event.

在本申请实施例中,控制设备上部署有通过告警描述文本训练得到的第一NLP模型。需要说明的是,NLP算法是AI领域的一个热点话题,NLP算法旨在让机器理解人类的自然语言,从而对用户进行反馈。在本申请实施例中,控制设备在通过第一NLP模型对告警事件中的告警的信息进行处理之前,可以首先通过获取告警描述文本对第一初始NLP网络进行训练,以得到第一NLP模型。In the embodiment of the present application, the first NLP model obtained by training the alarm description text is deployed on the control device. It should be noted that the NLP algorithm is a hot topic in the field of AI. The NLP algorithm is designed to allow machines to understand human natural language so as to give feedback to users. In this embodiment of the present application, before the control device processes the information of the alarm in the alarm event through the first NLP model, it may first acquire the alarm description text to train the first initial NLP network to obtain the first NLP model.

示例性地,控制设备可以获取告警描述文本,从告警描述文本中提取多个样本告警分别对应的告警解释信息和告警原因,根据多个样本告警分别对应的告警解释信息和告警原因对第一初始NLP网络进行训练,得到第一NLP模型。Exemplarily, the control device can obtain the alarm description text, extract the alarm explanation information and alarm reasons corresponding to multiple sample alarms from the alarm description text, and analyze the first initial The NLP network is trained to obtain the first NLP model.

其中,告警描述文本包括产品信息文本、告警手册、故障维护经验文本中的至少一种。产品信息文本可以包括该控制设备所控制的各个网络设备的产品手册,该产品手册中可以包含有相应网络设备的使用说明、可能产生的告警的告警解释信息、告警原因以及对应的处理步骤。告警手册可以包括对网络中可能发生的告警的说明文档,其中,该告警手册可以包括网络中的网络设备可能产生的告警的告警解释信息、告警原因以及对应的处理步骤。故障维护经验文本可以包括客户或者其他相关人员上传的关于网络中的网络设备的告警的信息和相关的处理方法的文本,例如,该故障维护经验文本可以包括在处理网络设备的告警时填写的工单,本申请实施例对此不做具体限定。Wherein, the alarm description text includes at least one of product information text, alarm manual, and fault maintenance experience text. The product information text may include the product manual of each network device controlled by the control device, and the product manual may include instructions for use of the corresponding network device, alarm explanation information of possible alarms, alarm causes and corresponding processing steps. The alarm manual may include a document explaining the alarms that may occur in the network, where the alarm manual may include alarm explanation information, alarm causes, and corresponding processing steps for alarms that may be generated by network devices in the network. The fault maintenance experience text may include information about network device alarms in the network and related processing methods uploaded by customers or other relevant personnel. However, this embodiment of the present application does not specifically limit it.

控制设备可以获取产品信息文本、告警手册、故障维护经验文本中的一种或多种作为告警描述文本,并且,每种告警描述文本的数量可以为一个或多个,本申请实施例对此不做限定。The control device can obtain one or more of product information texts, alarm manuals, and fault maintenance experience texts as alarm description texts, and the number of each type of alarm description text can be one or more. Do limited.

在获取到告警描述文本之后,控制设备可以通过获取的告警描述文本对第一初始NLP网络进行训练。After acquiring the alarm description text, the control device may train the first initial NLP network by using the acquired alarm description text.

其中,控制设备可以将告警描述文本作为第一初始NLP网络的输入,通过第一初始NLP网络在告警描述文本中检测样本告警标识、告警解释以及告警原因等关键字,并提取包含有上述关键字的段落或语句,从而得到多个样本告警对应的告警解释信息和告警原因,之后,通过第一初始NLP网络提取每个样本告警的告警原因中的关键词。之后,将各个样本告警的告警解释信息作为预测源,将相应样本告警的告警原因中的关键词作为告警解释信息的标签,对第一初始NLP网络进行训练。其中,第一初始NLP网络可以对各个样本告警的告警解释信息进行语义分析和文本相似度计算,从而输出各个样本告警的告警解释信息对应的告警原因,将输出的告警原因与设置的相应告警解释信息的标签进行比对,以得到二者之间的偏差,通过该偏差对第一初始NLP网络中的参数进行不断调整,从而使得第一初始NLP网络逐渐学习到各个样本告警之间的逻辑关系,得到第一NLP模型。Wherein, the control device can use the alarm description text as the input of the first initial NLP network, detect keywords such as sample alarm identification, alarm explanation, and alarm reason in the alarm description text through the first initial NLP network, and extract the keywords containing the above keywords. paragraphs or sentences, so as to obtain the alarm explanation information and alarm reasons corresponding to multiple sample alarms, and then extract the keywords in the alarm reasons of each sample alarm through the first initial NLP network. After that, the alarm explanation information of each sample alarm is used as the prediction source, and the keywords in the alarm reasons of the corresponding sample alarms are used as the labels of the alarm explanation information, and the first initial NLP network is trained. Among them, the first initial NLP network can perform semantic analysis and text similarity calculation on the alarm explanation information of each sample alarm, so as to output the alarm cause corresponding to the alarm explanation information of each sample alarm, and compare the output alarm cause with the set corresponding alarm explanation The labels of the information are compared to obtain the deviation between the two, and the parameters in the first initial NLP network are continuously adjusted through the deviation, so that the first initial NLP network gradually learns the logical relationship between each sample alarm , to get the first NLP model.

例如,对于以太网信号丢失这一告警,该告警的标识为ETH_LOS,则控制设备通过第一初始NLP网络可以在告警描述文本中检测ETH_LOS、告警解释、告警原因等关键字,进而将包含有上述关键字的段落或语句提取出来,以得到ETH_LOS这一告警对应的告警解释信息和告警原因。For example, for the alarm of Ethernet signal loss, the alarm identifier is ETH_LOS, then the control device can detect keywords such as ETH_LOS, alarm explanation, and alarm reason in the alarm description text through the first initial NLP network, and then include the above-mentioned The paragraph or sentence of the keyword is extracted to obtain the alarm explanation information and alarm reason corresponding to the ETH_LOS alarm.

示例性地,ETH_LOS的告警解释信息如下:Exemplarily, the alarm explanation information of ETH_LOS is as follows:

Figure BDA0003138391170000071
Figure BDA0003138391170000071

Figure BDA0003138391170000081
Figure BDA0003138391170000081

示例性地,ETH_LOS的告警原因如下:Exemplarily, the alarm reasons of ETH_LOS are as follows:

Figure BDA0003138391170000082
Figure BDA0003138391170000082

需要说明的是,上述仅是本申请实施例提供的一种提取告警解释信息和告警原因的可能实现方式,在其他一些可能的实现方式中,也可以通过其他算法来提取告警解释信息和告警原因,或者,也可以通过第三NLP模型检测其他能够指示告警解释信息和告警原因的关键字来实现样本告警的告警解释信息和告警原因的提取,本申请实施例在此不做限定。It should be noted that the above is only a possible implementation of extracting the alarm explanation information and the alarm reason provided by the embodiment of the present application. In some other possible implementation manners, the alarm explanation information and the alarm cause may also be extracted by other algorithms Or, the third NLP model can also be used to detect other keywords that can indicate the alarm explanation information and the alarm cause to realize the extraction of the alarm explanation information and the alarm cause of the sample alarm, which is not limited in this embodiment of the present application.

可选地,在一些可能的实现方式,训练第一初始NLP网络以得到第一NLP模型的过程也可以在其他设备上进行,也即,可以由其他设备训练得到第一NLP模型,之后,可以将该第一NLP模型部署在控制设备上。其中,其他设备训练第一初始NLP网络得到第一NLP模型的实现方式与上述实现方式相同或相似,本申请实施例对此不做限定。Optionally, in some possible implementations, the process of training the first initial NLP network to obtain the first NLP model can also be performed on other devices, that is, other devices can be trained to obtain the first NLP model, and then, can The first NLP model is deployed on the control device. Wherein, the implementation manner in which other devices train the first initial NLP network to obtain the first NLP model is the same as or similar to the above implementation manner, which is not limited in this embodiment of the present application.

由于第一NLP模型学习到了告警描述文本中的各个样本告警之间的逻辑关系以及各个样本告警的告警原因,所以,当控制设备获取到告警事件之后,可以将该告警事件中包括的多个告警的信息输入至第一NLP模型中,第一NLP模型可以根据已学习到的各个样本告警之间的逻辑关系以及各个样本告警的告警原因,查找与告警事件包含的各个告警相关联的其他告警和告警原因,进而生成该告警事件的诊断链。此时,该告警事件的诊断链中将包括告警事件中的各个告警对应的故障项以及触发相应告警的其他告警,也即,该告警事件的诊断链将能够表征出各个告警之间的关联关系以及各个告警与一个或多个故障项之间的逻辑关系。Since the first NLP model has learned the logical relationship between each sample alarm in the alarm description text and the alarm cause of each sample alarm, after the control device obtains the alarm event, it can Input the information into the first NLP model, and the first NLP model can search for other alarms associated with each alarm included in the alarm event and The cause of the alarm, and then generate the diagnostic chain of the alarm event. At this point, the diagnostic chain of the alarm event will include the fault items corresponding to each alarm in the alarm event and other alarms that trigger the corresponding alarm, that is, the diagnostic chain of the alarm event will be able to characterize the correlation between the alarms And the logical relationship between each alarm and one or more fault items.

示例性地,假设告警事件中包括三个告警,分别为链路连通性丢失告警TUNNEL_LOCV、以太网信号丢失告警ETH_LOS以及物理单板离线告警BD_STATUS。在第一NLP模型学习到的各个告警之间的逻辑关系和对应的告警原因中,TUNNEL_LOCV的触发原因为ETH_LOS,也即,TUNNEL_LOCV是由ETH_LOS这一告警触发的。ETH_LOS的触发原因包括两个告警和两个故障项,其中,两个告警为单板硬件错误告警HARD_BAD、和BD_STATUS,两个故障项为对端单板故障和本端单板故障,也即,ETH_LOS这一告警可能是由HARD_BAD、BD_STATUS这两个告警中的至少一个告警触发,也可能是由对端单板故障和本端单板故障这两个故障中的至少一个故障触发。BD_STATUS的触发原因包括本端单板故障和电源异常告警POWER_ABNORMAL,HARD_BAD的触发原因包括丢失电源模块故障。如此,第一NLP模型根据上述已学习到的各个告警之间的逻辑关系和对应的告警原因,可以得到如图4所示的该告警事件的诊断链。由此可见,该诊断链包含了多个故障项以及多个告警,且表征出了各个告警之间的逻辑关系以及告警与故障项之间的逻辑关系。Exemplarily, it is assumed that the alarm event includes three alarms, which are link connectivity loss alarm TUNNEL_LOCV, Ethernet signal loss alarm ETH_LOS, and physical board offline alarm BD_STATUS. In the logical relationship among the various alarms learned by the first NLP model and the corresponding alarm causes, the triggering cause of TUNNEL_LOCV is ETH_LOS, that is, TUNNEL_LOCV is triggered by the alarm ETH_LOS. The triggering causes of ETH_LOS include two alarms and two fault items, among which, the two alarms are single-board hardware error alarms HARD_BAD and BD_STATUS, and the two fault items are the peer board fault and the local board fault, that is, The ETH_LOS alarm may be triggered by at least one of the two alarms HARD_BAD and BD_STATUS, or may be triggered by at least one of the two faults of the peer board failure and the local board failure. The triggering causes of BD_STATUS include the failure of the local board and the abnormal power supply alarm POWER_ABNORMAL, and the triggering causes of HARD_BAD include the failure of the lost power module. In this way, the first NLP model can obtain the diagnosis chain of the alarm event as shown in FIG. 4 according to the learned logical relationship among the alarms and the corresponding alarm cause. It can be seen that the diagnosis chain includes multiple fault items and multiple alarms, and represents the logical relationship between each alarm and the logical relationship between alarms and fault items.

在得到告警事件的诊断链之后,控制设备可以将该诊断链中的包括的不重复的一个或多个故障项作为该告警事件的一个或多个候选故障项。After obtaining the diagnosis chain of the alarm event, the control device may use one or more non-repeated fault items included in the diagnosis chain as one or more candidate fault items of the alarm event.

例如,图4中的告警事件的诊断链中ETH_LOS的故障项有对端单板故障和本端单板故障,BD_STATUS的故障项有本端单板故障,触发ETH_LOS的HARD_BAD的故障项有丢失电源模块。对上述故障项去重,得到该诊断链的多个故障项为对端单板故障、本端单板故障和丢失电源模块。将该多个故障项作为该告警事件的多个候选故障项。For example, in the diagnosis chain of the alarm event in Figure 4, the ETH_LOS fault item includes the remote board fault and the local board fault, the BD_STATUS fault item includes the local board fault, and the HARD_BAD fault item that triggers ETH_LOS includes power loss module. By deduplicating the above fault items, it is obtained that multiple fault items in the diagnosis chain are peer board faults, local board faults, and lost power modules. The multiple fault items are used as multiple candidate fault items of the alarm event.

步骤303:根据告警事件的一个或多个候选故障项,确定告警事件对应的故障信息。Step 303: According to one or more candidate fault items of the alarm event, determine the fault information corresponding to the alarm event.

在得到该告警事件的一个或多个候选故障项之后,控制设备可以从该一个或多个候选故障项中确定该告警事件对应的真实故障项,进而根据该真实故障项生成该告警事件对应的故障信息。After obtaining one or more candidate fault items of the alarm event, the control device can determine the real fault item corresponding to the alarm event from the one or more candidate fault items, and then generate the corresponding fault item of the alarm event according to the real fault item accident details.

示例性地,控制设备可以调用一个或多个候选故障项中每个候选故障项对应的诊断项的诊断接口;通过每个候选故障项对应的诊断项的诊断接口获取相应诊断项的状态信息;根据每个候选故障项对应的诊断项的状态信息,确定告警事件对应的故障信息。Exemplarily, the control device may call a diagnostic interface of a diagnostic item corresponding to each candidate fault item among one or more candidate fault items; obtain the status information of the corresponding diagnostic item through the diagnostic interface of the diagnostic item corresponding to each candidate fault item; According to the state information of the diagnostic item corresponding to each candidate fault item, the fault information corresponding to the alarm event is determined.

需要说明的是,控制设备中可以存储有故障项与对应的诊断项之间的映射关系。基于此,在得到一个或多个候选故障项之后,对于其中的第一候选故障项,控制设备可以从该映射关系中获取第一候选故障项对应的一个或多个诊断项。之后,对于第一候选故障项对应的一个或多个诊断项中的第一诊断项,控制设备可以调用该第一诊断项的诊断接口。It should be noted that the control device may store a mapping relationship between faulty items and corresponding diagnostic items. Based on this, after obtaining one or more candidate fault items, for the first candidate fault item among them, the control device may obtain one or more diagnostic items corresponding to the first candidate fault item from the mapping relationship. Afterwards, for the first diagnostic item among the one or more diagnostic items corresponding to the first candidate fault item, the control device may call the diagnostic interface of the first diagnostic item.

需要说明的是,各个诊断项对应的诊断接口可以是控制设备预先通过第二NLP模型对告警描述文本中的各个样本告警的处理步骤进行处理之后得到的。It should be noted that the diagnosis interface corresponding to each diagnosis item may be obtained after the control device pre-processes the processing steps of each sample alarm in the alarm description text through the second NLP model.

示例性地,控制设备可以预先通过第二NLP模型对告警描述文本包含的多个样本告警中的每个样本告警对应的处理步骤进行识别,以得到相应处理步骤对应的诊断项;之后,生成每个处理步骤对应的诊断项的诊断接口。Exemplarily, the control device may identify the processing steps corresponding to each sample alarm among the multiple sample alarms included in the alarm description text in advance through the second NLP model, so as to obtain the diagnostic items corresponding to the corresponding processing steps; after that, generate each The diagnosis interface of the diagnosis item corresponding to each processing step.

其中,控制设备首先可以从告警描述文本中提取每个样本告警对应的处理步骤。其中,由于每个样本告警对应的告警原因可能有多个,针对不同的告警原因,可能会存在不同的处理步骤,因此,每个样本告警对应的处理步骤可能有多个。在获取到每个样本告警对应的处理步骤之后,对于相应样本告警的每一种告警原因所对应的处理步骤,控制设备将其作为第二NLP模型的输入。第二NLP模型可以对接收到的处理步骤进行语义分析和关键词提取,从而得到该处理步骤对应的诊断项。之后,控制设备可以根据得到的诊断项获取相关人员输入的用于实现对相应诊断项进行诊断的代码,根据获取到的代码生成该诊断项对应的诊断接口。Wherein, the control device may first extract the processing steps corresponding to each sample alarm from the alarm description text. Wherein, since there may be multiple alarm causes corresponding to each sample alarm, different processing steps may exist for different alarm reasons, therefore, there may be multiple processing steps corresponding to each sample alarm. After acquiring the processing steps corresponding to each sample alarm, the control device uses the processing steps corresponding to each alarm cause of the corresponding sample alarm as an input of the second NLP model. The second NLP model can perform semantic analysis and keyword extraction on the received processing step, so as to obtain the diagnosis item corresponding to the processing step. Afterwards, the control device can obtain the code input by relevant personnel for diagnosing the corresponding diagnostic item according to the obtained diagnostic item, and generate a diagnostic interface corresponding to the diagnostic item according to the obtained code.

示例性地,假设ETH_LOS的各个告警原因对应的处理步骤如下:Exemplarily, it is assumed that the processing steps corresponding to each alarm cause of ETH_LOS are as follows:

Figure BDA0003138391170000091
Figure BDA0003138391170000091

通过第二NLP模型对上述处理步骤进行识别,将可以得到三个诊断项,分别为查询端口是否使能、查询对接两端端口的工作模式以及查询对端端口的环回状态。Through the second NLP model to identify the above processing steps, three diagnostic items can be obtained, which are querying whether the port is enabled, querying the working mode of the ports at both ends of the connection, and querying the loopback status of the opposite port.

在得到上述三个诊断项之后,对于其中的查询对接两端端口的工作模式这一诊断项,控制设备即可以获取相关人员输入用于实现查询对接两端端口的工作模式的代码,进而根据获取到的代码自动生成该诊断项对应的诊断接口。After obtaining the above three diagnostic items, for the diagnostic item of querying the working mode of the ports at both ends of the docking, the control device can obtain the code input by the relevant personnel to realize the query of the working mode of the ports at both ends of the docking, and then according to the acquired The received code automatically generates the diagnostic interface corresponding to the diagnostic item.

对于第二NLP模型分析得到的每个诊断项,控制设备均可以参考上述方法自动生成相应诊断项对应的诊断接口。For each diagnostic item analyzed by the second NLP model, the control device may refer to the above method to automatically generate a diagnostic interface corresponding to the corresponding diagnostic item.

需要说明的是,在本申请实施例中,在生成一个诊断项对应的诊断接口之后,控制设备可以通过该诊断项来对该诊断接口进行命名。或者,在一些可能的实现方式中,控制设备也可以为生成的诊断接口分配对应的接口标识,并将该接口标识与诊断项对应存储。It should be noted that, in the embodiment of the present application, after a diagnostic interface corresponding to a diagnostic item is generated, the control device may use the diagnostic item to name the diagnostic interface. Alternatively, in some possible implementation manners, the control device may also assign a corresponding interface identifier to the generated diagnostic interface, and store the interface identifier corresponding to the diagnostic item.

另外,由前述介绍可知,样本告警可能对应有多种告警原因,对于不同的告警原因可能会对应有不同的处理步骤。在这种情况下,在通过第二NLP模型对各个样本告警对应的处理步骤进行分析得到相应的处理步骤的诊断项之后,控制设备还可以将每个处理步骤对应的告警原因作为相应处理步骤对应的故障项,进而将每个处理步骤对应的故障项和诊断项对应存储,以得到故障项和诊断项之间的映射关系。In addition, it can be seen from the foregoing introduction that the sample alarm may correspond to multiple alarm reasons, and different processing steps may be corresponding to different alarm reasons. In this case, after analyzing the processing steps corresponding to each sample alarm through the second NLP model to obtain the diagnostic items of the corresponding processing steps, the control device can also use the alarm cause corresponding to each processing step as the corresponding processing step corresponding fault items, and then correspondingly store the fault items and diagnostic items corresponding to each processing step, so as to obtain the mapping relationship between fault items and diagnostic items.

由前述描述可知,控制设备在生成诊断项对应的诊断接口之后,可以通过诊断项对相应的诊断接口进行命名,也即,将诊断项作为对应的诊断接口的接口标识。在这种情况下,以一个或多个候选诊断项中的第一诊断项为例,控制设备在调用第一诊断项对应的诊断接口时,可以调用接口标识为第一诊断项的诊断接口。或者,由前文中的介绍可知,控制设备也可以为诊断项对应的诊断接口分配接口标识,并将接口标识对应的诊断项对应存储。在这种情况下,控制设备可以从存储的诊断项和接口标识的映射关系中获取第一诊断项对应的接口标识,进而根据获取到的接口标识调用第一诊断项对应的诊断接口。It can be known from the foregoing description that after the control device generates the diagnostic interface corresponding to the diagnostic item, it can name the corresponding diagnostic interface through the diagnostic item, that is, use the diagnostic item as the interface identifier of the corresponding diagnostic interface. In this case, taking the first diagnostic item among the one or more candidate diagnostic items as an example, when the control device calls the diagnostic interface corresponding to the first diagnostic item, it may call the diagnostic interface whose interface identifier is the first diagnostic item. Alternatively, it can be known from the foregoing introduction that the control device may also assign an interface identifier to the diagnostic interface corresponding to the diagnostic item, and store the diagnostic item corresponding to the interface identifier. In this case, the control device may obtain the interface identifier corresponding to the first diagnostic item from the stored mapping relationship between the diagnostic item and the interface identifier, and then call the diagnostic interface corresponding to the first diagnostic item according to the obtained interface identifier.

在调用第一诊断项的诊断接口之后,控制设备可以通过该第一诊断项对应的诊断接口获取第一诊断项的状态信息。After invoking the diagnostic interface of the first diagnostic item, the control device may acquire the state information of the first diagnostic item through the diagnostic interface corresponding to the first diagnostic item.

示例性地,控制设备可以获取告警事件中由该第一候选故障项所触发的告警的信息中包含的告警位置,根据获取到的告警位置和该第一诊断项所指示的诊断内容,确定该第一诊断项的位置信息。之后,根据第一诊断项的位置信息,通过第一诊断项对应的诊断接口获取第一诊断项的状态信息。Exemplarily, the control device may acquire the alarm position included in the information of the alarm triggered by the first candidate fault item in the alarm event, and determine the alarm position according to the acquired alarm position and the diagnostic content indicated by the first diagnostic item Location information of the first diagnostic item. Afterwards, according to the location information of the first diagnostic item, the status information of the first diagnostic item is acquired through the diagnostic interface corresponding to the first diagnostic item.

例如,假设第一候选故障项为对端端口工作模式不匹配或对端端口使能了内环,则第一候选故障项将对应两个诊断项,分别为查询对接两端端口的工作模式、查询对端端口的环回状态。假设第一诊断项为查询对接两端端口的工作模式,且第一候选故障项所触发的告警的信息中包含的告警位置为网络设备A的端口1,则控制设备可以根据该告警位置和第一诊断项所指示的诊断内容,确定出第一诊断项的位置信息为端口1以及与端口1连接的对端端口。For example, assuming that the first candidate fault item is that the working mode of the peer port does not match or the inner ring is enabled on the peer port, the first candidate fault item will correspond to two diagnostic items, which are querying the working mode of the ports at both ends of the connection, Query the loopback status of the peer port. Assuming that the first diagnostic item is to query the working mode of the ports at both ends of the connection, and the alarm location contained in the information of the alarm triggered by the first candidate fault item is port 1 of network device A, the control device can use the alarm location and the first The diagnosis content indicated by a diagnosis item determines that the location information of the first diagnosis item is port 1 and the peer port connected to port 1 .

在得到第一诊断项的位置信息之后,控制设备可以将该第一诊断项的位置信息作为第一诊断项的诊断接口的查询参数,从而去获取第一诊断项的状态信息。After obtaining the location information of the first diagnostic item, the control device may use the location information of the first diagnostic item as a query parameter of the diagnostic interface of the first diagnostic item, so as to acquire the status information of the first diagnostic item.

需要说明的是,控制设备可以通过第一诊断项的诊断接口直接发送诊断指令至相应地设备进行状态信息的获取。It should be noted that the control device may directly send a diagnosis instruction to the corresponding device through the diagnosis interface of the first diagnosis item to acquire status information.

可选地,控制设备也可以通过第一诊断项的诊断接口显示第一诊断项的状态信息获取指示,以便用户根据该第一诊断项的状态信息获取指示获取对应的状态信息,并将获取的状态信息输入至控制设备。相应地,控制设备可以接收用户输入的该第一诊断项的状态信息。Optionally, the control device may also display the status information acquisition indication of the first diagnostic item through the diagnostic interface of the first diagnostic item, so that the user can acquire the corresponding status information according to the status information acquisition indication of the first diagnostic item, and the acquired Status information is input to the control device. Correspondingly, the control device may receive the status information of the first diagnostic item input by the user.

例如,第一诊断项的位置信息包括网络设备的地址、端口号,则控制设备可以通过第一诊断项的诊断接口向该网络设备的地址所指示的网络设备发送诊断指令,该诊断指令中可以携带有位置信息中的端口号,网络设备在接收到该诊断指令后,根据该端口号查询相应端口的状态信息,进而将该状态信息反馈至控制设备。控制设备在接收到网络设备反馈的状态信息之后,将该状态信息作为第一诊断项的状态信息。For example, the location information of the first diagnostic item includes the address and port number of the network device, then the control device can send a diagnostic command to the network device indicated by the address of the network device through the diagnostic interface of the first diagnostic item, and the diagnostic command can be Carrying the port number in the location information, after receiving the diagnostic instruction, the network device queries the state information of the corresponding port according to the port number, and then feeds back the state information to the control device. After receiving the status information fed back by the network device, the control device uses the status information as the status information of the first diagnostic item.

对于第一候选故障项对应的每个诊断项,控制设备均可以参考上述介绍的获取第一诊断项的状态信息的方法,来获取每个诊断项对应的状态信息。For each diagnostic item corresponding to the first candidate fault item, the control device may refer to the method for obtaining the status information of the first diagnostic item described above to obtain the status information corresponding to each diagnostic item.

进一步地,对于告警事件对应的一个或多个候选故障项中的每个候选故障项,控制设备均可以参考上述对第一候选故障项的处理方法对每个候选故障项进行处理,从而得到每个候选故障项对应的诊断项的状态信息。Further, for each candidate fault item among the one or more candidate fault items corresponding to the alarm event, the control device can refer to the above-mentioned processing method for the first candidate fault item to process each candidate fault item, so as to obtain each The status information of the diagnostic items corresponding to candidate fault items.

在得到每个候选故障项对应的诊断项的状态信息之后,控制设备可以从一个或多个候选故障项中确定对应的诊断项的状态信息与相应诊断项的预设状态信息不一致的目标故障项;根据目标故障项以及目标故障项对应的修复方法,生成告警事件对应的故障信息。After obtaining the state information of the diagnostic item corresponding to each candidate fault item, the control device can determine the target fault item whose state information of the corresponding diagnostic item is inconsistent with the preset state information of the corresponding diagnostic item from one or more candidate fault items ; Generate fault information corresponding to the alarm event according to the target fault item and the repair method corresponding to the target fault item.

其中,对于每个诊断项,控制设备中可以存储有相应诊断项的预设状态信息,该预设状态信息为设备正常运行情况下该诊断项的状态信息。基于此,控制设备可以将获取到的每个诊断项的状态信息与相应诊断项的预设状态信息进行比较,如果二者不同,则说明相应地诊断项的状态异常,此时,则将相应诊断项对应的候选故障项作为目标故障项,该目标故障项即为该告警事件对应的真实故障项。Wherein, for each diagnostic item, preset status information of the corresponding diagnostic item may be stored in the control device, and the preset status information is status information of the diagnostic item when the device is in normal operation. Based on this, the control device can compare the obtained status information of each diagnostic item with the preset status information of the corresponding diagnostic item. If the two are different, it means that the status of the corresponding diagnostic item is abnormal. At this time, the corresponding The candidate fault item corresponding to the diagnosis item is used as the target fault item, and the target fault item is the real fault item corresponding to the alarm event.

在确定出该告警事件对应的真实故障项之后,控制设备根据该真实故障项生成该告警事件的故障信息。其中,该故障信息包括该告警事件对应的真实故障项。除此之外,控制设备还可以根据该告警事件对应的真实故障项,从告警描述文本中获取该真实故障项对应的处理步骤或修复方法,进而将该处理步骤或修复方法也作为该告警事件的故障信息中的一部分。After determining the real fault item corresponding to the alarm event, the control device generates the fault information of the alarm event according to the real fault item. Wherein, the fault information includes a real fault item corresponding to the alarm event. In addition, the control device can also obtain the processing steps or repair methods corresponding to the real fault items from the alarm description text according to the real fault items corresponding to the alarm event, and then use the processing steps or repair methods as the alarm event part of the fault message.

可选地,在一些可能的实现方式中,控制设备在生成故障信息之后,可以将该故障信息发送至上层运维设备,由上层运维设备进行显示,以便运维人员根据该故障信息进行相应的修复处理。Optionally, in some possible implementation manners, after the control device generates the fault information, it can send the fault information to the upper-layer operation and maintenance device, and the upper-layer operation and maintenance device will display it, so that the operation and maintenance personnel can make corresponding actions based on the fault information. repair processing.

或者,在一些可能的实现方式中,控制设备在得到该告警事件的故障信息之后,还可以根据故障信息中包括的修复方法或处理步骤,自动进行修复处理,本申请实施例对此不做限定。Or, in some possible implementations, after the control device obtains the fault information of the alarm event, it can also automatically perform repair processing according to the repair method or processing steps included in the fault information, which is not limited in this embodiment of the present application. .

在本申请实施例中,通过由告警描述文本训练得到的NLP模型对告警事件中的各个告警的信息进行处理,从而得到告警事件的一个或多个候选故障原因,无需通过人工预先设置诊断规则即能够得到告警事件的一个或多个候选故障原因,进而根据该一个或多个候选故障原因来确定该告警事件对应的故障信息,智能高效。后续当有新的告警增加时,只需对应的添加相应地告警描述文本对NLP模型进行训练即能够实现对告警的分析处理,无需重新开发规则代码,工作量小,维护成本低。In the embodiment of the present application, the information of each alarm in the alarm event is processed through the NLP model trained by the alarm description text, so as to obtain one or more candidate fault causes of the alarm event, without manually setting diagnostic rules in advance. One or more candidate fault causes of the alarm event can be obtained, and then the fault information corresponding to the alarm event is determined according to the one or more candidate fault causes, which is intelligent and efficient. When new alarms are added later, you only need to add the corresponding alarm description text to train the NLP model to realize the analysis and processing of alarms, without re-developing rule codes, with a small workload and low maintenance costs.

其次,在本申请实施例中,通过NLP模型来识别告警事件对应的故障项,由于NLP模型是通过告警描述文本训练得到,因此,即使告警事件中的告警的信息存在人类语言偏差,也依然能够通过NLP模型的识别出正确的结果,而相关技术中一旦告警的信息出现了语言上的偏差,则通过人工设定的规则将无法得出对应的故障项。由此可见,本申请实施例提供的方法适用性更广,更为稳定。Secondly, in the embodiment of the present application, the NLP model is used to identify the fault item corresponding to the alarm event. Since the NLP model is trained through the alarm description text, even if there is a human language deviation in the alarm information in the alarm event, it can still The correct result is identified through the NLP model, but in the related art once the warning information has a language deviation, the corresponding fault item cannot be obtained through the manually set rules. It can be seen that the method provided by the embodiment of the present application has wider applicability and is more stable.

第三,在本申请实施例中,通过NLP模型对告警事件包括的多个告警的信息进行处理后,可以得到该告警事件的诊断链,该诊断链包括多个告警以及一个或多个故障项,并且,该诊断链能够表征出各个告警之间、告警与故障项之间的逻辑关系,这样,使得告警和故障项之间的推理逻辑具有强可解释性。Third, in the embodiment of the present application, after processing the information of multiple alarms included in the alarm event through the NLP model, the diagnostic chain of the alarm event can be obtained, and the diagnostic chain includes multiple alarms and one or more fault items , and the diagnostic chain can characterize the logical relationship between alarms and between alarms and fault items, so that the reasoning logic between alarms and fault items has strong interpretability.

最后,在本申请实施例中,可以预先通过第二NLP模型对告警描述文本中的处理步骤进行分析处理,以得到对应的诊断项,进而自动生成该诊断项对应的诊断接口。这样,后续在对告警事件对应的候选故障项进行诊断时,即能够直接调用自动生成的诊断接口进行诊断,使得整个诊断过程智能高效。Finally, in the embodiment of the present application, the processing steps in the alarm description text can be analyzed and processed by the second NLP model in advance to obtain the corresponding diagnosis item, and then the diagnosis interface corresponding to the diagnosis item can be automatically generated. In this way, when diagnosing the candidate fault item corresponding to the alarm event, the automatically generated diagnosis interface can be directly called for diagnosis, making the whole diagnosis process intelligent and efficient.

接下来对本申请实施例提供的告警处理装置进行介绍。Next, the alarm processing apparatus provided by the embodiment of the present application will be introduced.

参见图5,本申请实施例提供了一种告警处理装置500,该装置500包括:获取模块501、处理模块502和确定模块503;Referring to FIG. 5 , an embodiment of the present application provides an alarm processing device 500, which includes: an acquisition module 501, a processing module 502, and a determination module 503;

获取模块501,用于执行前述实施例中的步骤301;An acquisition module 501, configured to execute step 301 in the foregoing embodiment;

处理模块502,用于执行前述实施例中的步骤302;A processing module 502, configured to execute step 302 in the foregoing embodiment;

确定模块503,用于执行前述实施例中的步骤302。The determining module 503 is configured to execute step 302 in the foregoing embodiment.

其中,获取模块501、处理模块502和确定模块503可以由图2所示的计算机设备中的处理器来实现。Wherein, the obtaining module 501 , the processing module 502 and the determining module 503 may be implemented by a processor in the computer device shown in FIG. 2 .

可选地,处理模块502主要用于:Optionally, the processing module 502 is mainly used for:

将多个告警中的每个告警的信息作为第一NLP模型的输入,通过第一NLP模型对多个告警中的每个告警的信息进行处理,得到告警事件的诊断链,告警事件的诊断链包括多个告警以及一个或多个故障项,且诊断链用于表征多个告警以及一个或多个故障项之间的逻辑关系;The information of each alarm in the multiple alarms is used as the input of the first NLP model, and the information of each alarm in the multiple alarms is processed through the first NLP model to obtain a diagnostic chain of alarm events, and a diagnostic chain of alarm events Including multiple alarms and one or more fault items, and the diagnostic chain is used to characterize the logical relationship between multiple alarms and one or more fault items;

将诊断链包含的一个或多个故障项作为告警事件对应的一个或多个候选故障项。One or more fault items included in the diagnosis chain are used as one or more candidate fault items corresponding to the alarm event.

可选地,该装置500还用于:Optionally, the device 500 is also used for:

获取告警描述文本;Get the alarm description text;

根据告警描述文本中的多个样本告警分别对应的告警解释信息和告警原因对第一初始NLP网络进行训练,得到第一NLP模型。The first initial NLP network is trained according to the alarm explanation information and the alarm reasons respectively corresponding to the multiple sample alarms in the alarm description text to obtain the first NLP model.

可选地,确定模块503包括:Optionally, the determining module 503 includes:

调用子模块,用于调用一个或多个候选故障项中每个候选故障项对应的诊断项的诊断接口;Invoking a submodule for invoking a diagnostic interface of a diagnostic item corresponding to each candidate fault item in one or more candidate fault items;

获取子模块,用于通过每个候选故障项对应的诊断项的诊断接口获取相应诊断项的状态信息;The obtaining sub-module is used to obtain the state information of the corresponding diagnostic item through the diagnostic interface of the diagnostic item corresponding to each candidate fault item;

确定子模块,用于根据每个候选故障项对应的诊断项的状态信息,确定告警事件对应的故障信息。The determining submodule is configured to determine the fault information corresponding to the alarm event according to the state information of the diagnostic item corresponding to each candidate fault item.

可选地,该装置500还用于:Optionally, the device 500 is also used for:

通过第二NLP模型对告警描述文本包含的多个样本告警中的每个样本告警对应的处理步骤进行识别,以得到相应处理步骤对应的诊断项;Using the second NLP model to identify the processing steps corresponding to each sample alarm in the multiple sample alarms contained in the alarm description text, so as to obtain the diagnostic items corresponding to the corresponding processing steps;

生成每个处理步骤对应的诊断项的诊断接口。A diagnostic interface that generates diagnostic items corresponding to each processing step.

可选地,确定子模块主要用于:Optionally, identify submodules primarily for:

从一个或多个候选故障项中确定对应的诊断项的状态信息与相应诊断项的预设状态信息不一致的目标故障项;Determining a target fault item whose status information of the corresponding diagnostic item is inconsistent with the preset status information of the corresponding diagnostic item from one or more candidate fault items;

根据目标故障项以及目标故障项对应的修复装置,生成告警事件对应的故障信息。The fault information corresponding to the alarm event is generated according to the target fault item and the repair device corresponding to the target fault item.

可选地,告警描述文本包括产品信息文本、告警手册、故障维护经验文本中的至少一种。Optionally, the alarm description text includes at least one of product information text, alarm manual, and fault maintenance experience text.

综上所述,在本申请实施例中,通过由告警描述文本训练得到的NLP模型对告警事件中的各个告警的信息进行处理,从而得到告警事件的一个或多个候选故障原因,无需通过人工预先设置诊断规则即能够得到告警事件的一个或多个候选故障原因,进而根据该一个或多个候选故障原因来确定该告警事件对应的故障信息,智能高效。后续当有新的告警增加时,只需对应的添加相应地告警描述文本对NLP模型进行训练即能够实现对告警的分析处理,无需重新开发规则代码,工作量小,维护成本低。To sum up, in the embodiment of this application, the information of each alarm in the alarm event is processed through the NLP model trained by the alarm description text, so as to obtain one or more candidate fault causes of the alarm event, without manual Presetting the diagnostic rules means that one or more candidate fault causes of the alarm event can be obtained, and then the fault information corresponding to the alarm event can be determined according to the one or more candidate fault causes, which is intelligent and efficient. When new alarms are added later, you only need to add the corresponding alarm description text to train the NLP model to realize the analysis and processing of alarms, without re-developing rule codes, with a small workload and low maintenance costs.

需要说明的是:上述实施例提供的告警处理装置在处理告警时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的告警处理装置与告警处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that: when the alarm processing device provided in the above-mentioned embodiments handles alarms, it only uses the division of the above-mentioned functional modules as an example. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. The internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the alarm processing device and the alarm processing method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))、或者半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。In the above embodiments, all or part may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (eg infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: Digital Versatile Disc (Digital Versatile Disc, DVD)), or a semiconductor medium (for example: Solid State Disk (Solid State Disk, SSD) )Wait.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述并不用以限制本申请实施例,凡在本申请实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。The above description is not intended to limit the embodiments of the present application, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the embodiments of the present application shall be included within the scope of protection of the embodiments of the present application.

Claims (15)

1. An alarm processing method, characterized in that the method comprises:
acquiring an alarm event, wherein the alarm event comprises information of a plurality of alarms;
processing the information of each alarm in the plurality of alarms through a first natural language processing NLP model to obtain one or more candidate fault items corresponding to the alarm event, wherein the first NLP model is obtained through alarm description text training;
and determining fault information corresponding to the alarm event according to one or more candidate fault items corresponding to the alarm event.
2. The method according to claim 1, wherein the processing information of each of the plurality of alarms through the first NLP model to obtain one or more candidate fault items corresponding to the alarm event comprises:
taking information of each alarm in the plurality of alarms as input of the first NLP model, and processing the information of each alarm in the plurality of alarms through the first NLP model to obtain a diagnosis chain of the alarm event, where the diagnosis chain of the alarm event includes the plurality of alarms and one or more fault items, and the diagnosis chain is used to characterize a logical relationship between the plurality of alarms and the one or more fault items;
and taking one or more fault items contained in the diagnosis chain as one or more candidate fault items corresponding to the alarm event.
3. The method according to claim 1 or 2, characterized in that the method further comprises:
acquiring the alarm description text;
and training a first initial NLP network according to alarm explanation information and alarm reasons respectively corresponding to a plurality of sample alarms in the alarm description text to obtain the first NLP model.
4. The method according to claim 1, wherein the determining fault information corresponding to the alarm event according to one or more candidate fault items of the alarm event comprises:
calling a diagnosis interface of a diagnosis item corresponding to each candidate fault item in the one or more candidate fault items;
acquiring the state information of the corresponding diagnosis item through the diagnosis interface of the diagnosis item corresponding to each candidate fault item;
and determining the fault information corresponding to the alarm event according to the state information of the diagnosis item corresponding to each candidate fault item.
5. The method of claim 4, further comprising:
identifying the processing step corresponding to each sample alarm in the plurality of sample alarms contained in the alarm description text through a second NLP model to obtain a diagnosis item corresponding to the corresponding processing step;
and generating a diagnosis interface of the diagnosis item corresponding to each processing step.
6. The method according to claim 4 or 5, wherein the determining fault information corresponding to the alarm event according to the state information of each candidate fault item comprises:
determining a target fault item, of which the state information of the corresponding diagnosis item is inconsistent with the preset state information of the corresponding diagnosis item, from the one or more candidate fault items;
and generating fault information corresponding to the alarm event according to the target fault item and the repair method corresponding to the target fault item.
7. The method of claim 1, 3 or 5, wherein the alarm description text comprises at least one of a product information text, an alarm manual, and a fault maintenance experience text.
8. An alert processing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an alarm event, and the alarm event comprises information of a plurality of alarms;
the processing module is used for processing the information of each alarm in the plurality of alarms through a first Natural Language Processing (NLP) model to obtain one or more candidate fault items corresponding to the alarm event, and the first NLP model is obtained through alarm description text training;
and the determining module is used for determining the fault information corresponding to the alarm event according to the one or more candidate fault items corresponding to the alarm event.
9. The apparatus of claim 8, wherein the processing module is configured to:
taking information of each alarm in the plurality of alarms as input of the first NLP model, and processing the information of each alarm in the plurality of alarms through the first NLP model to obtain a diagnosis chain of the alarm event, where the diagnosis chain of the alarm event includes the plurality of alarms and one or more fault items, and the diagnosis chain is used to characterize a logical relationship between the plurality of alarms and the one or more fault items;
and taking one or more fault items contained in the diagnosis chain as one or more candidate fault items corresponding to the alarm event.
10. The apparatus of claim 8 or 9, wherein the apparatus is further configured to:
acquiring the alarm description text;
and training the first initial NLP network according to the alarm explanation information and the alarm reason which respectively correspond to the plurality of sample alarms and are included in the alarm description text to obtain the first NLP model.
11. The apparatus of claim 8, wherein the determining module comprises:
the calling submodule is used for calling a diagnosis interface of a diagnosis item corresponding to each candidate fault item in the one or more candidate fault items;
the acquisition submodule is used for acquiring the state information of the corresponding diagnosis item through the diagnosis interface of the diagnosis item corresponding to each candidate fault item;
and the determining submodule is used for determining the fault information corresponding to the alarm event according to the state information of the diagnosis item corresponding to each candidate fault item.
12. The apparatus of claim 11, wherein the apparatus is further configured to:
identifying the processing step corresponding to each sample alarm in the plurality of sample alarms contained in the alarm description text through a second NLP model to obtain a diagnosis item corresponding to the corresponding processing step;
and generating a diagnosis interface of the diagnosis item corresponding to each processing step.
13. The apparatus of claim 11 or 12, wherein the determination submodule is configured to:
determining a target fault item, of which the state information of the corresponding diagnosis item is inconsistent with the preset state information of the corresponding diagnosis item, from the one or more candidate fault items;
and generating fault information corresponding to the alarm event according to the target fault item and the repair device corresponding to the target fault item.
14. The apparatus of claim 8, 10 or 12, wherein the alarm description text comprises at least one of a product information text, an alarm manual, and a fault maintenance experience text.
15. A computer-readable storage medium having stored thereon instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.
CN202110725324.1A 2021-06-29 2021-06-29 Alarm processing method, device and storage medium Pending CN115544202A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110725324.1A CN115544202A (en) 2021-06-29 2021-06-29 Alarm processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110725324.1A CN115544202A (en) 2021-06-29 2021-06-29 Alarm processing method, device and storage medium

Publications (1)

Publication Number Publication Date
CN115544202A true CN115544202A (en) 2022-12-30

Family

ID=84717313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110725324.1A Pending CN115544202A (en) 2021-06-29 2021-06-29 Alarm processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN115544202A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932148A (en) * 2023-09-19 2023-10-24 山东浪潮数据库技术有限公司 Problem diagnosis system and method based on AI

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932148A (en) * 2023-09-19 2023-10-24 山东浪潮数据库技术有限公司 Problem diagnosis system and method based on AI
CN116932148B (en) * 2023-09-19 2024-01-19 山东浪潮数据库技术有限公司 Problem diagnosis system and method based on AI

Similar Documents

Publication Publication Date Title
Chen et al. Automatic root cause analysis via large language models for cloud incidents
US12333398B2 (en) Integration optimization using machine learning algorithms
RU2682018C2 (en) Identification of options for troubleshooting to detect network failures
US11348023B2 (en) Identifying locations and causes of network faults
US11138058B2 (en) Hierarchical fault determination in an application performance management system
US20140040916A1 (en) Automatic event correlation in computing environments
CN114285725A (en) Network fault determination method and device, storage medium and electronic equipment
CN115037597A (en) Fault detection method and equipment
CN114172785B (en) Alarm information processing method, device, equipment and storage medium
CN117041029A (en) Network equipment fault processing method and device, electronic equipment and storage medium
US12284089B2 (en) Alert correlating using sequence model with topology reinforcement systems and methods
CN110851471A (en) Distributed log data processing method, device and system
KR20190001501A (en) Artificial intelligence operations system of telecommunication network, and operating method thereof
CN112311574A (en) Method, device and equipment for checking network topology connection
CN116662058A (en) Method, device, equipment and storage medium for constructing fault propagation relationship
CN116841902A (en) Health state checking method, device, equipment and storage medium
CN114756301A (en) Log processing method, device and system
CN115544202A (en) Alarm processing method, device and storage medium
CN114385398A (en) Request response state determination method, device, equipment and storage medium
CN115102836A (en) Network equipment failure analysis method, device and storage medium
CN117827784A (en) Noise log filtering method and system
CN119272037A (en) Data anomaly detection model training method, data anomaly detection method and device
CN113626288A (en) Fault processing method, system, device, storage medium and electronic equipment
CN111290870B (en) Method and device for detecting abnormality
Kannan et al. A differential approach for configuration fault localization in cloud environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination