WO2024066346A1 - Alarm processing method and apparatus, and storage medium and electronic apparatus - Google Patents

Alarm processing method and apparatus, and storage medium and electronic apparatus Download PDF

Info

Publication number
WO2024066346A1
WO2024066346A1 PCT/CN2023/091861 CN2023091861W WO2024066346A1 WO 2024066346 A1 WO2024066346 A1 WO 2024066346A1 CN 2023091861 W CN2023091861 W CN 2023091861W WO 2024066346 A1 WO2024066346 A1 WO 2024066346A1
Authority
WO
WIPO (PCT)
Prior art keywords
alarm
processed
processing
alarms
alarm information
Prior art date
Application number
PCT/CN2023/091861
Other languages
French (fr)
Chinese (zh)
Inventor
王超
彭浩宇
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2024066346A1 publication Critical patent/WO2024066346A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/04Arrangements for maintaining operational condition

Definitions

  • the embodiments of the present disclosure relate to the field of communications, and in particular, to an alarm processing method, device, storage medium, and electronic device.
  • the embodiments of the present disclosure provide an alarm processing method, device, storage medium and electronic device to at least solve the problem in the related art that operation and maintenance personnel handle alarms based on alarm root cause reports based on experience, resulting in a large number of alarms, high operation and maintenance personnel costs, and difficulty in manual processing.
  • a method for processing an alarm comprising:
  • the alarm information is screened, the alarm information to be manually processed is eliminated, and the alarms to be processed are obtained;
  • the pending alarm is processed according to the alarm solution.
  • an alarm processing device comprising:
  • a first determination module is configured to determine a root cause of a fault in the alarm information, and determine a user intention based on the root cause of the fault;
  • the screening module is configured to screen the alarm information and remove the alarm information when the user intends to handle the fault. In addition to the alarm information to be processed manually, the alarms to be processed are obtained;
  • a second determination module is configured to determine an alarm solution for the alarm to be processed
  • the processing module is configured to process the to-be-processed alarm according to the alarm solution.
  • a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps of any of the above method embodiments when running.
  • an electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
  • FIG1 is a hardware structure block diagram of a base station device of an alarm processing method according to an embodiment of the present disclosure
  • FIG2 is a flow chart of an alarm processing method according to an embodiment of the present disclosure.
  • FIG3 is a schematic diagram of automated alarm processing based on machine learning in the intent network domain according to an embodiment of the present disclosure
  • FIG4 is a schematic diagram of intent capture according to this embodiment.
  • FIG5 is a schematic diagram of alarm screening according to this embodiment.
  • FIG6 is a schematic diagram of a script engine according to this embodiment.
  • FIG7 is a flow chart of determining an alarm solution according to the present embodiment.
  • FIG8 is a block diagram of an alarm processing apparatus according to an embodiment of the present disclosure.
  • FIG1 is a hardware structure block diagram of the base station device of the alarm processing method of the embodiment of the present disclosure.
  • the base station device may include one or more (only one is shown in FIG1) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, wherein the above-mentioned base station device may also include a transmission device 106 and an input and output device 108 for communication functions.
  • the structure shown in FIG1 is only for illustration, and it does not limit the structure of the above-mentioned base station device.
  • the base station device may also include more or fewer components than those shown in FIG1, or have a configuration different from that shown in FIG1.
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the alarm processing method in the embodiment of the present disclosure.
  • the processor 102 executes various functional applications and service chain address pool slice processing by running the computer program stored in the memory 104, that is, to implement the above method.
  • the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the base station device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the transmission device 106 is used to receive or send data via a network.
  • the specific example of the above network may include a wireless network provided by a communication provider of a base station device.
  • the transmission device 106 includes a network adapter (Network Interface Controller, referred to as NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 can be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet wirelessly.
  • RF Radio Frequency
  • FIG. 2 is a flow chart of the alarm processing method according to an embodiment of the present disclosure. As shown in FIG. 2 , the process includes the following steps:
  • Step S202 determining the root cause of the fault in the alarm information, and determining the user intention according to the root cause of the fault;
  • the user intent capture in the above step S202 specifically obtains diagnostic information, provides one-click diagnosis, and analyzes the root cause of the alarm. Relying on the single-board diagnostic data of the actual network element and the existing processed alarm data, the sample is sufficient, and as time goes by and the amount of data continues to increase, combined with the alarm type and processing scheme reported by the actual network element, through result-driven model training, the weight of each diagnostic factor for the selection of the network element alarm processing scheme is adjusted, and a comprehensive analysis is performed from multiple dimensions that affect the operation of the base station, such as diagnostic data and environmental data, to determine the root cause of the fault and accurately capture the user intent.
  • Step S204 when the user intends to handle the fault, the alarm information is screened, and the alarm information to be manually handled is eliminated to obtain the alarms to be handled;
  • Step S206 determining an alarm solution for the alarm to be processed
  • Step S208 Process the pending alarm according to the alarm solution.
  • the above-mentioned step S208 may specifically include: adjusting the script execution order of multiple solutions based on the priorities of multiple solutions included in the alarm solution; calling the script framework corresponding to the network management system to generate alarm processing use cases corresponding to the multiple solutions; and executing the alarm processing use cases corresponding to the multiple solutions in sequence according to the script execution order until one of the multiple solutions is successfully executed.
  • Alarm processing suggestions are programmable commands and scripts.
  • the scripts are developed, submitted, and shared by operation and maintenance personnel. Based on the success rate of past alarm processing, weight features of relevant solutions (including execution order) are assigned, and machine learning models are used to generate alarm solutions. Based on diagnostic data and weights, the priority of the solutions is evaluated, and the execution order of relevant scripts is adjusted to improve execution efficiency and the success rate of alarm processing.
  • the corresponding script framework of the network management system is called to dynamically generate automated alarm processing use cases, and they are executed in sequence.
  • an Isolation Forest (IForest) is used in combination with a one class-support vector machine (OC-SVM) to train the model using normal data after removing abnormal noise points, and to find anomalies in new data.
  • OC-SVM one class-support vector machine
  • the model is constructed through the algorithm to achieve initial screening of alarms, remove current abnormal alarms and turn them into manual processing, and retain alarms that can be processed automatically.
  • the above step S204 can specifically include: using the Isolation Forest algorithm in combination with the OC-SVM model to screen the alarm information, remove the alarm information to be manually processed, and screen out the alarms to be processed. Furthermore, diagnostic data and environmental data are collected from the alarm information.
  • diagnostic data and environmental data on each board in the corresponding baseband unit (BBU) and remote radio unit (RRU) are collected from the alarm information; the diagnostic data and environmental data are used to form an N-dimensional scatter plot; the degree of alienation between the scattered points in the scatter plot is calculated using the isolation forest algorithm, and abnormal scattered points are removed according to the alienation degree to obtain an N-dimensional preliminary screening scatter plot; based on the OC-SVM model, the points to be manually processed are removed from the preliminary screening scatter plot.
  • the environmental data is subjected to dimensionality reduction processing, and the preset type features are eliminated according to the actual status of the current network element obtained to obtain the processed environmental data;
  • the diagnostic data is divided into hardware analysis data and software analysis data;
  • the hardware analysis data, software analysis data and processed environmental data are used as feature values for dimensionality reduction to form a target scatter plot;
  • the target scatter plot is screened for the second time based on the OC-SVM model to obtain the alarms to be processed.
  • the above-mentioned secondary screening of the target scatter plot based on the OC-SVM model to obtain the alarm to be processed may specifically include: determining the position of the sphere where the cluster is located in the N-dimensional space of the above-mentioned target scatter plot and calculating the radius of the sphere; if the scatter points corresponding to the alarm information exceed the radius position, the alarm information is judged to be alarm information to be manually processed; and the alarm information to be manually processed is eliminated from the target scatter plot to obtain the alarm to be processed.
  • the above-mentioned step S206 may specifically include: inputting the alarm to be processed into a pre-trained target integrated alarm decision tree, and obtaining multiple solutions and corresponding priorities output by the target integrated alarm decision tree, wherein the target integrated alarm decision tree is based on the processing success rate of the processed alarms, assigns weights of the solutions corresponding to the processed alarms, and is trained based on the training data generated by the processed alarms and the corresponding weights, and the above-mentioned alarm solutions include multiple solutions.
  • the method also includes: dividing the training data to form multiple alarm decision trees, and using decision tree pruning to trim some edge results of the multiple alarm decision trees to obtain multiple target alarm decision trees; using a random forest algorithm to combine the multiple target alarm decision trees to obtain an integrated alarm decision tree; and performing overfitting processing on the integrated alarm decision tree to obtain a target integrated alarm decision tree.
  • the method further includes: counting the processing success rate of the alarms to be processed; adjusting the weights of the solutions corresponding to the processed alarms in the above training set according to the processing success rate; and updating the above target integrated alarm decision tree according to the adjusted training set.
  • the success rate of clearing the alarm is counted, the newly added script execution order is recorded, and the training set weight of the machine learning algorithm is updated according to the success rate.
  • the validity of the newly added execution order is verified, it will be released as a preset solution with the next version.
  • a text message or email will be forwarded to the user, allowing manual intervention to optimize the strategy or replace the hardware.
  • This embodiment can be integrated into a network management system, which is a telecommunication-class operation and maintenance management (Operation and Maintenance Management, referred to as OMM) system based on B/S communication agent components.
  • OMM Operaation and Maintenance Management
  • the OMM system manages no less than 15,000 base stations, and the system has no less than 2,000 preset alarms.
  • a base station fails during the daily operation and maintenance process, the alarm is reported to the OMM system after conversion by the middleware. After the operation and maintenance personnel monitor the alarm reported by the corresponding facility, they process the alarm.
  • FIG. 3 is a schematic diagram of automated alarm processing based on machine learning in the intent network domain according to an embodiment of the present disclosure, as shown in Figure 3, including:
  • Step 1 intent capture, according to the alarm design principle, use one-click diagnosis to analyze and capture the user's specific intention for this alarm from the alarm information of the OMM system. There may be multiple reasons for the generation of an alarm.
  • This step diagnoses the network or hardware to identify the root cause of the alarm, narrow the scope of fault location in the problem domain, and quickly determine the user's intention.
  • Figure 4 is a schematic diagram of intent capture according to this embodiment. As shown in Figure 4, taking the "Link between OMM and NE broken" alarm as an example, the transmission between the evolved Node B (eNB) and the network management (OMM) will go through multiple routes, and the first hop route (Gateway) from the network element to the network management is referred to as the network element first hop gateway.
  • eNB evolved Node B
  • OMM network management
  • this node By initiating diagnosis from the OMM, a ping test is initiated to this node (first-hop gateway). If the ping fails, the root cause of the fault is considered to be situation 2 in Figure 3, that is, it is determined that the gateway configured at the IP layer referenced by the network management to network element OMC channel is abnormal, which is judged to be a transmission problem, and the link from the Operations & Maintenance Center (OMC) to this device is disconnected.
  • OMC Operations & Maintenance Center
  • the gateway configured in the IP layer referenced by the network management to the network element OMC channel is normal. Therefore, at least the link from the OMC to this device is normal. Then you need to check the transmission problem from the first hop to the network element or the hardware problem of the network element itself.
  • the operation and maintenance of base station equipment in wireless systems is mainly carried out through alarms. Once an alarm occurs, field operation and maintenance personnel troubleshoot according to the alarm handling suggestions and their own experience. In practice, the front-line operation and maintenance personnel are uneven and their abilities vary greatly. Accurate operation and maintenance prompts are particularly important for troubleshooting efficiency.
  • one-click diagnosis is used to obtain the root cause of the fault, which can accurately identify the fault behind the alarm and prepare for the next step of troubleshooting.
  • Intent capture is to capture the state that the user wants the network to achieve into the system.
  • the OMM system takes into account various fault causes that induce alarms.
  • the system reports an alarm, it is considered that the mobile communication network has a corresponding fault. If the root cause of the alarm can be accurately identified, it is considered that the user's intention has been captured.
  • Step 2 alarm screening, converts the intent into a set of configuration changes or network configurations that need to be executed, applies the algorithm model to preliminarily screen the uploaded model, and screens out relevant alarms that meet the requirements of automated processing. Before implementing intent analysis, it is necessary to analyze whether the acquired alarms meet the requirements of automated processing. In this step, the isolation forest algorithm is combined with the OC-SVM model to preliminarily screen the alarms, so as to identify the fault solution in the solution domain.
  • Isolation Forest is an unsupervised learning algorithm that can be used for anomaly detection, and is often used for outlier detection and singular value detection.
  • iForest is a method for removing outliers from training data. Unlike other anomaly detection algorithms that use quantitative indicators such as distance and density to characterize the degree of alienation between samples, iForest detects outliers through the isolation of sample points.
  • diagnostic data voltage, link status, bit error, power, CPU occupancy, board temperature, etc.
  • environmental data inlet and outlet temperature, fan speed, input voltage, etc.
  • FIG5 is a schematic diagram of alarm screening according to the present embodiment.
  • the OC-SVM vector machine is sensitive to the dimension, it is necessary to perform dimensionality reduction processing on the environmental data. The irrelevant features such as the clock state and the read/write speed are eliminated according to the actual state of the current network element.
  • the diagnostic data is then divided into two aspects: hardware analysis data and software analysis data. The data is used as feature value dimensionality reduction and then input into the OC-SVM vector machine for secondary screening.
  • a scatter plot in N-dimensional space is made according to the existing type dimension, and the formula is used:
  • z is a new data point
  • K(z, z) is the outer product of z and z
  • ⁇ i is the vector data in the training set
  • K(z, xi) is the outer product of the point and the corresponding ⁇ i
  • R is the sphere radius of the OC-SVM vector machine.
  • the OMM system After the initial screening of the alarm, according to the fault root cause analysis component, after obtaining the user's potential intention, the OMM system will provide a series of predefined processing suggestions, which will be weighted according to the historical fault resolution success rate. If there is no manual intervention, the processing suggestion with the highest weight value will be selected and converted into specific steps, and the corresponding steps will be associated with the script arrangement and execution in the following article. Take the "network element link broken" alarm as an example:
  • Step 3 policy execution, generates a predictive priority model through a machine learning algorithm based on the alarm handling solutions previously handled by the user and related diagnostic data, and automatically scripts and processes the predicted alarm handling solutions, outputs automated scripts, and executes network configuration changes.
  • the OMM system should have a REST interface based on XML or JSON encoding to support CLI (command line interface) and OPEN API (open application programming interface).
  • CLI command line interface
  • OPEN API open application programming interface
  • FIG6 is a schematic diagram of the script engine according to the present embodiment.
  • the script development and script designer (Open Script Designer, referred to as OSD) is an online script tool provided to engineering technicians and developers. Through this tool, script projects are developed, compiled and published to meet the needs of customized scripts. In addition to providing script compilation and publishing functions, the tool also provides auxiliary functions such as syntax checking, code blocks, automatic completion and online help.
  • the open script engine provides intelligent syntax prompts, business Python SDK library, and script layout designer, allowing developers to develop scripts more conveniently and reduce development time. Threshold. You can set whether the script contains important operations, as well as prompt information about the impact of the operations. When executing a script containing important operations, there will be a verification code and prompt information.
  • Script arrangement and execution After finding the script in the Open Script Execution Engine (OSE) application list, you can arrange the script. For example, you can associate the "export network element parameter file script" and the "export alarm script”. After the execution is completed, you can download the corresponding attachments to your local computer.
  • OSE Open Script Execution Engine
  • Script management can be categorized and managed by tagging, and scripts can be quickly found.
  • the scripts come with help files, output samples and other information, which can provide more detailed guidance on script use. All operation and maintenance personnel or customized development experts can develop automated scripts based on alarm processing suggestions and push the scripts to the server. After verifying the validity of the script, it will be sent out as a built-in script with future product versions.
  • the corresponding process can be divided into the following 16 processing solutions and 9 unitized processing cases. Based on the 16 judgment categories that can be used as decision tree categories (i.e., the corresponding priority content of the corresponding processing solutions can be judged), the corresponding unitized processing cases can be designed using the script designer to ensure that the script can correspond to the corresponding processing steps one by one.
  • the tree diagram construction solution is used to arrange the script and build the corresponding automatic alarm processing solution for subsequent execution.
  • the corresponding alarm type is confirmed, and by collecting the specific diagnostic data of the network element, a decision tree is generated to identify the network element alarm solution.
  • the decision tree is a top-down analysis method.
  • the data division rules are obtained, and the construction behavior is performed from the root node of the decision tree. Therefore, after completing the acquisition of the specific type of alarm, the network element diagnostic data is randomly divided into several subsets, and the data attributes are evaluated with reference to the Gini purity number. The lower the coefficient value, the fewer data attributes it represents. When the coefficient value is equal to 0, it indicates that the subset and the array category are consistent. According to this basis, the array Gini coefficient is calculated.
  • the formula is as follows:
  • D is the total number of samples
  • ci is the number of samples in the i-th category.
  • Data is divided according to the training data to form multiple decision trees.
  • Decision tree pruning is used to reduce some marginal results of the decision tree, and multiple decision tree classifiers are combined using the random forest algorithm to achieve an integrated decision tree classifier with better prediction effect.
  • the random forest method has Bagging (Bootstrap aggregating, guided aggregation algorithm), that is, the idea of integration, which is actually equivalent to sampling both samples and features, so overfitting can be avoided.
  • Figure 7 is a flow chart of the alarm solution determined according to the present embodiment.
  • the decision tree after overfitting processing can predict the alarm processing solution according to the acquired network element diagnostic data input into the tree and pre-generate an automated alarm processing use case to ensure timely alarm processing, and confirm the priority of the alarm processing solution according to the post-order traversal sequence of the tree, reserve a backup processing solution corresponding to the current alarm in the buffer to ensure the success rate of automated alarm processing, and set the MAX value of the processing solution to be processed in parallel with the alarm processing, that is, the number of cycles staying in the processing area in the processing loop, so as to avoid blocking the processing of subsequent alarms for a long time when a certain alarm cannot be resolved.
  • Step 4 network feedback, the base station or server provides network status feedback information to confirm whether the alarm is processed successfully. If unsuccessful, the next solution predicted in the strategy execution step is continued in a loop until the maximum number of loops is reached.
  • the intentional network needs to monitor the operation status of the network in real time.
  • it collects network performance data and alarm data to observe whether the alarm has been restored, whether the network performance has returned to normal, and whether the configuration data has been synchronized to the base station equipment normally.
  • it continuously predicts network equipment failures and abnormal conditions, for example, if alarm A is restored, whether alarm B is associated.
  • the system will continue to verify in real time whether the original business intent has been met, and can perform corrective actions if the preset intent is not achieved, forming a continuous closed-loop system, which improves the availability and agility of the network. Only a continuous closed-loop system can guarantee the effectiveness of the intent and ensure that the intent will not be disturbed by sudden network conditions.
  • Step 5 strategy optimization, the analysis component verifies the received network-driven feedback information through the requested intent to verify whether the requested intent is running according to the request and design expectations, and collects the successfully processed solutions and assigns them corresponding weights as a supplementary data set to further improve the AI prediction model in steps 1, 2, and 3.
  • the characteristic of this step is that in a commercial mobile communication network, alarm reports are very frequent, reaching 100,000 per day. After each alarm report triggers the execution of the strategy, the effect of the strategy (fault resolution speed and fault resolution situation) can be verified, and the prediction model can be adjusted in reverse guidance.
  • Step 6 Intent Feedback, reports the status and operation of the requested intent through value-based business outcomes.
  • the intent network needs to provide timely feedback to the intent capture link to re-convert, verify and execute the user intent.
  • the training data of the alarm processing decision tree comes from the user's own successful pre-processing use cases, it is possible to start with the training data, optimize the training data and screen the data in a targeted manner with the cooperation of the operation and maintenance personnel, so that there are more experience intentions in making decisions, which is more in line with the relevant processes of manual processing and improves the accuracy of processing.
  • the weight of the new model is compared with the old model, and the selection weight of this number is increased according to the success rate of processing, and it is fed back to the user processing.
  • This embodiment collects single-board diagnostic data, applies machine learning algorithms, converts the alarm operation and maintenance personnel's intention to solve network failures into strategies, and then implements them.
  • the corresponding automated processing suggestion for the alarm will be triggered to execute relevant scripts and command lines for automatic repair.
  • the diagnostic data of the network element single board a large number of repetitive alarms are automatically processed to achieve limited resources and reasonable allocation.
  • the automated solution for network element alarm processing it can be applied to scenarios with high operation and maintenance personnel costs, difficult manual processing, and more conventional alarms.
  • For alarm information with abnormal deviations in diagnostic data and a high probability of being unable to be automatically processed it can be further provided to the operation and maintenance personnel with multi-dimensional processing analysis based on the degree of deviation of the corresponding data.
  • Each alarm processing can assign weights to the processing plan and pre-process the plan. A large amount of data is also conducive to the application of this automated processing method to larger-scale regions.
  • FIG8 is a block diagram of the alarm processing device according to an embodiment of the present disclosure. As shown in FIG8 , the device includes:
  • a first determination module 82 is configured to determine a root cause of a fault in the alarm information, and determine a user intention based on the root cause of the fault;
  • a screening module 84 is configured to screen the alarm information and remove the alarm information to be manually processed to obtain the alarms to be processed when the user intends to process the fault;
  • a second determination module 86 is configured to determine an alarm solution for the alarm to be processed
  • the processing module 88 is configured to process the to-be-processed alarm according to the alarm solution.
  • the screening module 84 is further used to screen the alarm information by using an isolation forest algorithm combined with an OC-SVM model, to eliminate the alarm information to be manually processed, and to screen out the alarms to be processed.
  • the screening module 84 includes:
  • a collection submodule configured to collect diagnostic data and environmental data from the alarm information
  • a formation submodule configured to form an N-dimensional scatter plot of the diagnostic data and the environmental data
  • a first elimination submodule is configured to calculate the degree of alienation between the scattered points in the scatter plot by using the isolation forest algorithm, and eliminate abnormal scattered points according to the alienation degree to obtain an N-dimensional preliminary screening scatter plot;
  • the second elimination submodule is configured to eliminate the alarm information to be manually processed from the preliminary screening scatter plot based on the OC-SVM model to obtain the alarm to be processed.
  • the second elimination submodule includes:
  • a dimension reduction unit is configured to perform dimension reduction processing on the environmental data in the preliminary screening scatter plot, and remove preset type features according to the acquired actual state of the current network element to obtain processed environmental data;
  • a composition unit configured to divide the diagnostic data into hardware analysis data and software analysis data, and use the hardware analysis data, the software analysis data and the processed environment data as feature values for dimension reduction to form a target scatter plot;
  • the secondary screening unit is configured to perform secondary screening on the target scatter plot based on the OC-SVM model to obtain the alarm to be processed.
  • the secondary screening unit is further configured to determine the position of the sphere where the cluster is located in the N-dimensional space of the target scatter plot and calculate the radius of the sphere; if the scatter points corresponding to the alarm information exceed the radius position, the alarm information is judged to be the alarm information to be manually processed; the alarm information to be manually processed is eliminated from the target scatter plot to obtain the alarm to be processed.
  • the second determination module 96 is further configured to input the alarm to be processed into a pre-trained target integrated alarm decision tree to obtain multiple solutions and corresponding priorities output by the target integrated alarm decision tree, wherein the target integrated alarm decision tree is based on the processing success rate of the processed alarms, assigns weights of the solutions corresponding to the processed alarms, and is trained based on the training data generated based on the processed alarms and the corresponding weights, and the alarm solution includes the multiple solutions.
  • the device further comprises:
  • a data partitioning module is configured to perform data partitioning on the training data to form a plurality of alarm decision trees, and to use decision tree pruning to prune some edge results of the plurality of alarm decision trees to obtain a plurality of target alarm decision trees;
  • a combination module configured to use a random forest algorithm to combine the multiple target alarm decision trees to obtain an integrated alarm decision tree
  • the overfitting module is configured to perform overfitting processing on the integrated alarm decision tree to obtain a target integrated alarm decision tree.
  • the device further comprises:
  • a statistics module configured to collect statistics on the success rate of processing the pending alarms
  • An adjustment module configured to adjust the weight of the solution corresponding to the processed alarm in the training set according to the processing success rate
  • the updating module is configured to update the target integrated alarm decision tree according to the adjusted training set.
  • the processing module 88 is further configured to adjust the script execution order of the multiple solutions based on the priorities of the multiple solutions; call the script framework corresponding to the network management system to generate the scripts corresponding to the multiple solutions; Alarm processing use case; execute the alarm processing use cases corresponding to the multiple solutions in sequence according to the script execution order until one of the multiple solutions is successfully executed.
  • An embodiment of the present disclosure further provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the steps of any of the above method embodiments when running.
  • the above-mentioned computer-readable storage medium may include, but is not limited to: a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk or an optical disk, and other media that can store computer programs.
  • An embodiment of the present disclosure further provides an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
  • the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
  • modules or steps of the present disclosure can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order than here, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation.
  • the present disclosure is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Provided in the embodiments of the present disclosure are an alarm processing method and apparatus, and a storage medium and an electronic apparatus. The method comprises: determining a fault root cause of alarm information, and determining a user intention according to the fault root cause; when the user intention is fault processing, screening the alarm information, and removing alarm information to be manually processed, so as to obtain an alarm to be processed; determining an alarm solution for said alarm; and processing said alarm according to the alarm solution. In this way, the problems, in the relevant art, of the cost of operation and maintenance personnel being relatively high, and it being relatively difficult to perform manual processing due to the operation and maintenance personnel using experience thereof to process alarms on the basis of an alarm root cause report and there being a relatively large number of alarms can be solved, the alarm troubleshooting processing efficiency of key facilities of a mobile communication network can be improved, and a fault time can be shortened.

Description

一种告警处理方法、装置、存储介质及电子装置Alarm processing method, device, storage medium and electronic device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开基于2022年09月27日提交的发明名称为“一种告警处理方法、装置、存储介质及电子装置”的中国专利申请CN202211183805.5,并且要求该专利申请的优先权,通过引用将其所公开的内容全部并入本公开。This disclosure is based on Chinese patent application CN202211183805.5, filed on September 27, 2022, with the invention name “An alarm processing method, device, storage medium and electronic device”, and claims the priority of the patent application, and all the contents disclosed therein are incorporated into this disclosure by reference.
技术领域Technical Field
本公开实施例涉及通信领域,具体而言,涉及一种告警处理方法、装置、存储介质及电子装置。The embodiments of the present disclosure relate to the field of communications, and in particular, to an alarm processing method, device, storage medium, and electronic device.
背景技术Background technique
在移动通信网络领域,告警主要是为了解决和提升系统的可靠性,控制解决故障的时间,缩小影响范围。在一个庞大的移动通信网络中,每天发生着代码的变化,环境的变化,人为操作的变化。正因为这种不断的变化和混乱,才需要快速地发现潜在的异常,做出应急和正确的响应。In the field of mobile communication networks, alarms are mainly used to solve and improve system reliability, control the time to solve faults, and reduce the scope of impact. In a huge mobile communication network, code changes, environmental changes, and human operation changes occur every day. It is precisely because of this constant change and chaos that it is necessary to quickly discover potential anomalies, make emergency responses, and respond correctly.
当前网管系统中,当运维人员收到告警消息之后,会打开一个页面进行查看,这个页面提供了告警源的资产信息、配置信息、人员信息、监控指标数据和告警当天处理的情况,然后进行相关操作,如创建工单,告警静默,告警升级,批量确认告警等操作。智能化网管会提供关联性告警合并和一键诊断功能,帮忙减少需要处理的告警数以及提供告警根因报告。但运维人员并不需要知道告警产生的原因,需要的是解决告警背后的故障,并达成清除告警这一结果,相关技术中并未提出有效的解决方案,需要运维人员凭经验解决。且常规告警较多,运维人员成本较高、人力处理较难。In the current network management system, when the operation and maintenance personnel receive an alarm message, they will open a page to view it. This page provides the asset information, configuration information, personnel information, monitoring indicator data of the alarm source, and the situation of the alarm processing on the day. Then they can perform related operations, such as creating a work order, silencing the alarm, upgrading the alarm, and confirming the alarm in batches. Intelligent network management will provide related alarm merging and one-click diagnosis functions to help reduce the number of alarms that need to be processed and provide alarm root cause reports. However, the operation and maintenance personnel do not need to know the cause of the alarm. What is needed is to solve the fault behind the alarm and achieve the result of clearing the alarm. There is no effective solution in the relevant technology, and the operation and maintenance personnel need to solve it based on experience. In addition, there are many conventional alarms, and the cost of operation and maintenance personnel is high and it is difficult to handle them manually.
针对相关技术中运维人员凭经验基于告警根因报告处理告警,告警较多,运维人员成本较高、人力处理较难的问题,尚未提出解决方案。In related technologies, operation and maintenance personnel handle alarms based on their experience and alarm root cause reports, resulting in a large number of alarms, high operation and maintenance personnel costs, and difficulty in manpower processing. No solution has yet been proposed.
发明内容Summary of the invention
本公开实施例提供了一种告警处理方法、装置、存储介质及电子装置,以至少解决相关技术中运维人员凭经验基于告警根因报告处理告警,告警较多,运维人员成本较高、人力处理较难的问题。The embodiments of the present disclosure provide an alarm processing method, device, storage medium and electronic device to at least solve the problem in the related art that operation and maintenance personnel handle alarms based on alarm root cause reports based on experience, resulting in a large number of alarms, high operation and maintenance personnel costs, and difficulty in manual processing.
根据本公开的一个实施例,提供了一种告警处理方法,所述方法包括:According to an embodiment of the present disclosure, a method for processing an alarm is provided, the method comprising:
确定告警信息的故障根因,并根据所述故障根因确定用户意图;Determine the root cause of the fault in the alarm information, and determine the user's intention based on the root cause of the fault;
在所述用户意图为处理故障的情况下,对所述告警信息进行筛选,剔除待人工处理的告警信息,得到待处理告警;In the case where the user intends to handle the fault, the alarm information is screened, the alarm information to be manually processed is eliminated, and the alarms to be processed are obtained;
确定所述待处理告警的告警解决方案;Determine an alarm solution for the pending alarm;
根据所述告警解决方案对所述待处理告警进行处理。The pending alarm is processed according to the alarm solution.
根据本公开的另一个实施例,还提供了一种告警处理装置,所述装置包括:According to another embodiment of the present disclosure, there is also provided an alarm processing device, the device comprising:
第一确定模块,设置为确定告警信息的故障根因,并根据所述故障根因确定用户意图;A first determination module is configured to determine a root cause of a fault in the alarm information, and determine a user intention based on the root cause of the fault;
筛选模块,设置为在所述用户意图为处理故障的情况下,对所述告警信息进行筛选,剔 除待人工处理的告警信息,得到待处理告警;The screening module is configured to screen the alarm information and remove the alarm information when the user intends to handle the fault. In addition to the alarm information to be processed manually, the alarms to be processed are obtained;
第二确定模块,设置为确定所述待处理告警的告警解决方案;A second determination module is configured to determine an alarm solution for the alarm to be processed;
处理模块,设置为根据所述告警解决方案对所述待处理告警进行处理。The processing module is configured to process the to-be-processed alarm according to the alarm solution.
根据本公开的又一个实施例,还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。According to another embodiment of the present disclosure, a computer-readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to execute the steps of any of the above method embodiments when running.
根据本公开的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。According to another embodiment of the present disclosure, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本公开实施例的告警处理方法的基站设备的硬件结构框图;FIG1 is a hardware structure block diagram of a base station device of an alarm processing method according to an embodiment of the present disclosure;
图2是根据本公开实施例的告警处理方法的流程图;FIG2 is a flow chart of an alarm processing method according to an embodiment of the present disclosure;
图3是根据本公开实施例的基于意图网络域机器学习的自动化告警处理的示意图;FIG3 is a schematic diagram of automated alarm processing based on machine learning in the intent network domain according to an embodiment of the present disclosure;
图4是根据本实施例的意图捕获的示意图;FIG4 is a schematic diagram of intent capture according to this embodiment;
图5是根据本实施例的告警筛选的示意图;FIG5 is a schematic diagram of alarm screening according to this embodiment;
图6是根据本实施例的脚本引擎的示意图;FIG6 is a schematic diagram of a script engine according to this embodiment;
图7是根据本实施例的告警解决方案确定的流程图;FIG7 is a flow chart of determining an alarm solution according to the present embodiment;
图8是根据本公开实施例的告警处理装置的框图。FIG8 is a block diagram of an alarm processing apparatus according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本公开的实施例。Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings and in combination with the embodiments.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first", "second", etc. in the specification and claims of the present disclosure and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
本公开实施例中所提供的方法实施例可以在基站设备或者类似的运算装置中执行。以运行在基站设备上为例,图1是本公开实施例的告警处理方法的基站设备的硬件结构框图,如图1所示,基站设备可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,其中,上述基站设备还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述基站设备的结构造成限定。例如,基站设备还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiments provided in the embodiments of the present disclosure can be executed in a base station device or a similar computing device. Taking the operation on the base station device as an example, FIG1 is a hardware structure block diagram of the base station device of the alarm processing method of the embodiment of the present disclosure. As shown in FIG1, the base station device may include one or more (only one is shown in FIG1) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, wherein the above-mentioned base station device may also include a transmission device 106 and an input and output device 108 for communication functions. It can be understood by those skilled in the art that the structure shown in FIG1 is only for illustration, and it does not limit the structure of the above-mentioned base station device. For example, the base station device may also include more or fewer components than those shown in FIG1, or have a configuration different from that shown in FIG1.
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本公开实施例中的告警处理方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及业务链地址池切片处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至基站设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。 The memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the alarm processing method in the embodiment of the present disclosure. The processor 102 executes various functional applications and service chain address pool slice processing by running the computer program stored in the memory 104, that is, to implement the above method. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the base station device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
传输设备106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括基站设备的通信供应商提供的无线网络。在一个实例中,传输设备106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or send data via a network. The specific example of the above network may include a wireless network provided by a communication provider of a base station device. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, referred to as NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 can be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet wirelessly.
在本实施例中提供了一种运行于上述基站设备的告警处理方法,图2是根据本公开实施例的告警处理方法的流程图,如图2所示,该流程包括如下步骤:In this embodiment, an alarm processing method running on the above-mentioned base station device is provided. FIG. 2 is a flow chart of the alarm processing method according to an embodiment of the present disclosure. As shown in FIG. 2 , the process includes the following steps:
步骤S202,确定告警信息的故障根因,并根据所述故障根因确定用户意图;Step S202, determining the root cause of the fault in the alarm information, and determining the user intention according to the root cause of the fault;
上述步骤S202中的用户意图捕获,具体的,获取诊断信息,提供一键诊断,分析告警根因。依托于实际网元的单板诊断数据与现有已经处理完成告警数据,样本充分,且随着时间的推移和数据量的持续增加,结合实际网元上报告警类型与处理方案,通过结果驱动的模型训练,调整各个诊断因子对网元告警处理方案选择的比重,从诊断数据以及环境数据等影响基站运行的多个维度进行综合分析,研判故障根因,精确捕获用户意图。The user intent capture in the above step S202 specifically obtains diagnostic information, provides one-click diagnosis, and analyzes the root cause of the alarm. Relying on the single-board diagnostic data of the actual network element and the existing processed alarm data, the sample is sufficient, and as time goes by and the amount of data continues to increase, combined with the alarm type and processing scheme reported by the actual network element, through result-driven model training, the weight of each diagnostic factor for the selection of the network element alarm processing scheme is adjusted, and a comprehensive analysis is performed from multiple dimensions that affect the operation of the base station, such as diagnostic data and environmental data, to determine the root cause of the fault and accurately capture the user intent.
步骤S204,在用户意图为处理故障的情况下,对所述告警信息进行筛选,剔除待人工处理的告警信息,得到待处理告警;Step S204, when the user intends to handle the fault, the alarm information is screened, and the alarm information to be manually handled is eliminated to obtain the alarms to be handled;
步骤S206,确定待处理告警的告警解决方案;Step S206, determining an alarm solution for the alarm to be processed;
步骤S208,根据告警解决方案对待处理告警进行处理。Step S208: Process the pending alarm according to the alarm solution.
本实施例中,上述步骤S208具体可以包括:基于告警解决方案包含的多个解决方案的优先级调整多个解决方案的脚本执行顺序;调用网管系统对应的脚本框架生成该多个解决方案对应的告警处理用例;按照该脚本执行顺序依次执行该多个解决方案对应的告警处理用例,直到该多个解决方案的一个解决方案执行成功。In this embodiment, the above-mentioned step S208 may specifically include: adjusting the script execution order of multiple solutions based on the priorities of multiple solutions included in the alarm solution; calling the script framework corresponding to the network management system to generate alarm processing use cases corresponding to the multiple solutions; and executing the alarm processing use cases corresponding to the multiple solutions in sequence according to the script execution order until one of the multiple solutions is successfully executed.
在告警详情页面,提供在线文档,解构告警设计原理,告警处理建议为可编程命令和脚本,脚本由运维人员开发提交并共享发布。针对过往告警处理成功率,分配相关方案(包括执行顺序)权重特征,利用机器学习模型生成告警解决方案,基于诊断数据与权重进行方案优先级评估,调整相关脚本的执行先后顺序,以提高执行效率与告警处理的成功率。调用网管系统相对应的脚本框架动态生成自动化告警处理用例,并按照执行顺序依次执行。On the alarm details page, online documents are provided to deconstruct the alarm design principles. Alarm processing suggestions are programmable commands and scripts. The scripts are developed, submitted, and shared by operation and maintenance personnel. Based on the success rate of past alarm processing, weight features of relevant solutions (including execution order) are assigned, and machine learning models are used to generate alarm solutions. Based on diagnostic data and weights, the priority of the solutions is evaluated, and the execution order of relevant scripts is adjusted to improve execution efficiency and the success rate of alarm processing. The corresponding script framework of the network management system is called to dynamically generate automated alarm processing use cases, and they are executed in sequence.
通过上述步骤S202至S208,可以解决相关技术中运维人员凭经验基于告警根因报告处理告警,告警较多,运维人员成本较高、人力处理较难的问题,能够提升移动通信网络关键设施的告警排障处理效率,并缩短故障时间。Through the above steps S202 to S208, the problem in related technologies that operation and maintenance personnel handle alarms based on alarm root cause reports based on experience, there are many alarms, the cost of operation and maintenance personnel is high, and manual processing is difficult can be solved. This can improve the efficiency of alarm troubleshooting for key facilities in mobile communication networks and shorten the failure time.
本实施例中,使用孤立森林(Isolation Forest,简称为IForest)结合一类支持向量机(one class-support vector machine,简称为OC-SVM)在剔除异常噪点后使用正常数据对模型进行训练,在新数据中找出异常,是一种无监督算法。通过算法构造模型,实现告警初筛,剔除当前异常告警转为人工处理,保留能够自动化处理的告警。上述步骤S204具体可以包括:利用孤立森林算法结合OC-SVM模型对告警信息进行筛选,剔除待人工处理的告警信息,筛选出待处理告警。进一步的,从告警信息中采集诊断数据与环境数据,具体的,从告警信息中采集对应基带处理单元(Baseband Unit,简称为BBU)和远端射频单元(Remote Radio Unit,简称为RRU)框内部各个单板上的诊断数据与环境数据;将诊断数据与环境数据构成N维的散点图;利用孤立森林算法计算散点图中散点之间的疏离程度,根据疏离程度剔除异常散点,得到N维的初步筛查散点图;基于OC-SVM模型,从初步筛查散点图中剔除待人工处 理的告警信息,得到待处理告警,具体的,在初步筛查散点图中对环境数据进行降维处理,并根据获取到的当前网元实际状态剔除预设类型特征,得到处理后的环境数据;将诊断数据分为硬件分析数据与软件分析数据;将硬件分析数据、软件分析数据以及处理后的环境数据作为特征值降维后组成目标散点图;基于OC-SVM模型对目标散点图进行二次筛选,得到待处理告警。In this embodiment, an Isolation Forest (IForest) is used in combination with a one class-support vector machine (OC-SVM) to train the model using normal data after removing abnormal noise points, and to find anomalies in new data. This is an unsupervised algorithm. The model is constructed through the algorithm to achieve initial screening of alarms, remove current abnormal alarms and turn them into manual processing, and retain alarms that can be processed automatically. The above step S204 can specifically include: using the Isolation Forest algorithm in combination with the OC-SVM model to screen the alarm information, remove the alarm information to be manually processed, and screen out the alarms to be processed. Furthermore, diagnostic data and environmental data are collected from the alarm information. Specifically, diagnostic data and environmental data on each board in the corresponding baseband unit (BBU) and remote radio unit (RRU) are collected from the alarm information; the diagnostic data and environmental data are used to form an N-dimensional scatter plot; the degree of alienation between the scattered points in the scatter plot is calculated using the isolation forest algorithm, and abnormal scattered points are removed according to the alienation degree to obtain an N-dimensional preliminary screening scatter plot; based on the OC-SVM model, the points to be manually processed are removed from the preliminary screening scatter plot. Specifically, in the preliminary screening scatter plot, the environmental data is subjected to dimensionality reduction processing, and the preset type features are eliminated according to the actual status of the current network element obtained to obtain the processed environmental data; the diagnostic data is divided into hardware analysis data and software analysis data; the hardware analysis data, software analysis data and processed environmental data are used as feature values for dimensionality reduction to form a target scatter plot; the target scatter plot is screened for the second time based on the OC-SVM model to obtain the alarms to be processed.
在一可选的实施例直接,上述的基于OC-SVM模型对目标散点图进行二次筛选,得到待处理告警具体可以包括:确定上述的目标散点图的N维空间中集群所在的球体位置并计算出球体的半径;若告警信息对应的散点超过半径位置则判断告警信息为待人工处理的告警信息;从目标散点图中剔除待人工处理的告警信息,得到待处理告警。In an optional embodiment, the above-mentioned secondary screening of the target scatter plot based on the OC-SVM model to obtain the alarm to be processed may specifically include: determining the position of the sphere where the cluster is located in the N-dimensional space of the above-mentioned target scatter plot and calculating the radius of the sphere; if the scatter points corresponding to the alarm information exceed the radius position, the alarm information is judged to be alarm information to be manually processed; and the alarm information to be manually processed is eliminated from the target scatter plot to obtain the alarm to be processed.
本实施例中,上述步骤S206具体可以包括:将待处理告警输入预先训练好的目标集成告警决策树中,得到目标集成告警决策树输出的多个解决方案以及对应的优先级,其中,所述目标集成告警决策树是针对已处理告警的处理成功率,分配已处理告警对应的解决方案的权重,并基于已处理告警以及对应的权重生成的训练数据训练得到的,上述的告警解决方案包括多个解决方案。In this embodiment, the above-mentioned step S206 may specifically include: inputting the alarm to be processed into a pre-trained target integrated alarm decision tree, and obtaining multiple solutions and corresponding priorities output by the target integrated alarm decision tree, wherein the target integrated alarm decision tree is based on the processing success rate of the processed alarms, assigns weights of the solutions corresponding to the processed alarms, and is trained based on the training data generated by the processed alarms and the corresponding weights, and the above-mentioned alarm solutions include multiple solutions.
在一实施例中,所述方法还包括:将训练数据进行数据划分,形成多个告警决策树,并利用决策树剪枝裁剪多个告警决策树的部分边缘结果,得到多个目标告警决策树;利用随机森林算法,将多个目标告警决策树组合得到集成告警决策树;对集成告警决策树进行过拟合处理,得到目标集成告警决策树。In one embodiment, the method also includes: dividing the training data to form multiple alarm decision trees, and using decision tree pruning to trim some edge results of the multiple alarm decision trees to obtain multiple target alarm decision trees; using a random forest algorithm to combine the multiple target alarm decision trees to obtain an integrated alarm decision tree; and performing overfitting processing on the integrated alarm decision tree to obtain a target integrated alarm decision tree.
在另一实施例中,所述方法还包括:对待处理告警的处理成功率进行统计;根据处理成功率调整上述训练集中已处理告警对应的解决方案的权重;根据调整后的训练集更新上述目标集成告警决策树。执行命令之后,对清除告警成功率进行统计,记录新增的脚本执行顺序,并且根据成功率更新机器学习算法的训练集权重。新增的执行顺序有效性得到验证后,会作为预置方案随下一个版本发布。在自动化脚本无法清除故障时,会短信或邮件前转到用户,让人工介入进行策略优化或进行硬件更换。In another embodiment, the method further includes: counting the processing success rate of the alarms to be processed; adjusting the weights of the solutions corresponding to the processed alarms in the above training set according to the processing success rate; and updating the above target integrated alarm decision tree according to the adjusted training set. After the command is executed, the success rate of clearing the alarm is counted, the newly added script execution order is recorded, and the training set weight of the machine learning algorithm is updated according to the success rate. After the validity of the newly added execution order is verified, it will be released as a preset solution with the next version. When the automated script cannot clear the fault, a text message or email will be forwarded to the user, allowing manual intervention to optimize the strategy or replace the hardware.
本实施例可以集成到一个网管系统里,该网管系统为基于B/S通信代理组件的电信级操作维护管理(Operation and Maintenance Management,简称为OMM)系统,该OMM系统管理不少于15000个基站,系统预置不少于2000种告警,基站日常运维过程中发生故障,经中间件转化后上报告警到OMM系统,运维人员监控到对应设施上报的告警后,对告警进行处理。This embodiment can be integrated into a network management system, which is a telecommunication-class operation and maintenance management (Operation and Maintenance Management, referred to as OMM) system based on B/S communication agent components. The OMM system manages no less than 15,000 base stations, and the system has no less than 2,000 preset alarms. When a base station fails during the daily operation and maintenance process, the alarm is reported to the OMM system after conversion by the middleware. After the operation and maintenance personnel monitor the alarm reported by the corresponding facility, they process the alarm.
本实施例应用在于对移动通信网络关键设施的故障进行自愈修复,通过相关的资源配置,对告警关联可执行的处理建议,自动进行相关的脚本或命令执行,标记故障基站,提报相应的下站作业流程,或硬件维修流程,来实现自动化的故障处理,降低和释放人力投入。图3是根据本公开实施例的基于意图网络域机器学习的自动化告警处理的示意图,如图3所示,包括:The application of this embodiment is to perform self-healing repairs on the failures of key facilities in the mobile communication network. Through relevant resource configuration, the alarm is associated with executable processing suggestions, and the relevant scripts or commands are automatically executed, the faulty base station is marked, and the corresponding down-station operation process or hardware maintenance process is reported to achieve automated fault processing, reduce and release manpower input. Figure 3 is a schematic diagram of automated alarm processing based on machine learning in the intent network domain according to an embodiment of the present disclosure, as shown in Figure 3, including:
步骤1,意图捕获,根据告警设计原理,利用一键诊断,从OMM系统的告警信息中分析并捕获用户对于这个告警的特定意图。告警的产生可能有多条原因,此步骤通过对网络或硬件进行诊断,来识别告警根因,在问题域达到缩小故障定位范围,迅速确定用户意图。图4是根据本实施例的意图捕获的示意图,如图4所示,以“网元链路断(Linkbetween OMM and NE broken)”告警为例,演进型基站(Evolved Node B,简称为eNB)与网管(OMM)之间传输会经过多个路由,从网元出发至网管的第一跳路由(Gateway),简称为网元第一跳网关。 Step 1, intent capture, according to the alarm design principle, use one-click diagnosis to analyze and capture the user's specific intention for this alarm from the alarm information of the OMM system. There may be multiple reasons for the generation of an alarm. This step diagnoses the network or hardware to identify the root cause of the alarm, narrow the scope of fault location in the problem domain, and quickly determine the user's intention. Figure 4 is a schematic diagram of intent capture according to this embodiment. As shown in Figure 4, taking the "Link between OMM and NE broken" alarm as an example, the transmission between the evolved Node B (eNB) and the network management (OMM) will go through multiple routes, and the first hop route (Gateway) from the network element to the network management is referred to as the network element first hop gateway.
通过从OMM发起诊断,对此节点(第一跳网关)发起ping测试,如果不能Ping通,则认为故障根因是图3中情况②,即认定网管至网元OMC通道引用的IP层配置的网关异常,判断为传输问题,操作与维护中心(Operations&Maintenance Center,简称为OMC)至此设备的链路断开的。By initiating diagnosis from the OMM, a ping test is initiated to this node (first-hop gateway). If the ping fails, the root cause of the fault is considered to be situation ② in Figure 3, that is, it is determined that the gateway configured at the IP layer referenced by the network management to network element OMC channel is abnormal, which is judged to be a transmission problem, and the link from the Operations & Maintenance Center (OMC) to this device is disconnected.
如果能Ping通,则认定为图中情况①,即网管至网元OMC通道引用的IP层配置的网关正常,由此判断,至少OMC至此设备的链路正常。则需要排查第一跳到网元间的传输问题或网元自身硬件问题。If the ping is successful, it is considered as situation ① in the figure, that is, the gateway configured in the IP layer referenced by the network management to the network element OMC channel is normal. Therefore, at least the link from the OMC to this device is normal. Then you need to check the transmission problem from the first hop to the network element or the hardware problem of the network element itself.
无线系统中基站设备的运维主要是通过告警来运维,一旦出现告警,外场运维人员根据告警处理建议以及自身经验来排障。实际当中,一线运维人员参差不齐,人员能力差异很大,精准的运维提示对于排障效率显得尤为重要。The operation and maintenance of base station equipment in wireless systems is mainly carried out through alarms. Once an alarm occurs, field operation and maintenance personnel troubleshoot according to the alarm handling suggestions and their own experience. In practice, the front-line operation and maintenance personnel are uneven and their abilities vary greatly. Accurate operation and maintenance prompts are particularly important for troubleshooting efficiency.
在意图捕获阶段,利用一键诊断来获取故障根因,可以准确识别告警背后的故障,为下一步解决故障做好准备。In the intent capture phase, one-click diagnosis is used to obtain the root cause of the fault, which can accurately identify the fault behind the alarm and prepare for the next step of troubleshooting.
意图捕获就是将用户想要网络达到的状态捕获到系统中,OMM系统在进行告警设计时,考虑了诱发告警的各种故障原因。反过来说,当系统上报告警时,则认为移动通信网络出现了对应的故障,如果能准确识别告警根因,则认为捕获到了用户意图。Intent capture is to capture the state that the user wants the network to achieve into the system. When designing alarms, the OMM system takes into account various fault causes that induce alarms. Conversely, when the system reports an alarm, it is considered that the mobile communication network has a corresponding fault. If the root cause of the alarm can be accurately identified, it is considered that the user's intention has been captured.
步骤2,告警筛选,将意图转换为一组需要执行的配置变更或网络配置,应用算法模型对上传的模型进行初筛,筛选出符合自动化处理的相关告警。在实现意图分析前需要将分析获取到的告警是否符合自动化处理,此步骤利用孤立森林算法结合OC-SVM模型对告警进行初筛,在解决域达到对故障解决方法进行识别的目的。Step 2, alarm screening, converts the intent into a set of configuration changes or network configurations that need to be executed, applies the algorithm model to preliminarily screen the uploaded model, and screens out relevant alarms that meet the requirements of automated processing. Before implementing intent analysis, it is necessary to analyze whether the acquired alarms meet the requirements of automated processing. In this step, the isolation forest algorithm is combined with the OC-SVM model to preliminarily screen the alarms, so as to identify the fault solution in the solution domain.
孤立森林(iForest)是一种可以用于异常检测的无监督学习算法,常用于离群点检测和奇异值检测。iForest作为一种剔除训练数据中离群点的方法,与其他异常检测算法通过距离、密度等量化指标刻画样本间的疏离程度不同,iForest通过样本点的孤立来检测异常值。Isolation Forest (iForest) is an unsupervised learning algorithm that can be used for anomaly detection, and is often used for outlier detection and singular value detection. iForest is a method for removing outliers from training data. Unlike other anomaly detection algorithms that use quantitative indicators such as distance and density to characterize the degree of alienation between samples, iForest detects outliers through the isolation of sample points.
首先从获取到的告警信息中采集到对应网元BBU和RRU框内部各个单板上的诊断数据(电压、链路状态、误码、功率、CPU占有率、单板温度等),以及环境数据(进风口出风口温度、风扇转速、输入电压等),目前约有超过200种诊断数据。将对应数据构成N维的散点图,利用孤立森林算法计算样本散点之间的疏离程度,例如因为短路导致的电压过高现象,必然会产生样本散点严重偏离其他点的位置,对应剔除出来的告警信息,大概率是自动化处理无法完成的,因此需要转为人工处理。First, we collect diagnostic data (voltage, link status, bit error, power, CPU occupancy, board temperature, etc.) on each board in the corresponding network element BBU and RRU frame, as well as environmental data (inlet and outlet temperature, fan speed, input voltage, etc.) from the alarm information we obtain. Currently, there are more than 200 types of diagnostic data. We construct an N-dimensional scatter plot with the corresponding data, and use the isolation forest algorithm to calculate the degree of alienation between sample scatter points. For example, if the voltage is too high due to a short circuit, the sample scatter points will inevitably deviate seriously from the positions of other points. The corresponding alarm information that is eliminated is most likely impossible to complete with automated processing, so it needs to be transferred to manual processing.
图5是根据本实施例的告警筛选的示意图,如图5所示,通过孤立森林剔除异常的告警上报后,减少OC-SVM告警的模型运行时的噪点,由于OC-SVM向量机对维数较为敏感,因此需要对环境数据进行降维处理,根据获取到当前网元实际状态剔除掉时钟状态、读写速度等无关特征,后将诊断数据分为硬件分析数据与软件分析数据两个方面,将其作为特征值降维后输入OC-SVM向量机进行二次筛选,根据已有的类型维数作N维空间的散点图,利用公式:
FIG5 is a schematic diagram of alarm screening according to the present embodiment. As shown in FIG5, after the abnormal alarm reports are eliminated by the isolation forest, the noise of the OC-SVM alarm model during operation is reduced. Since the OC-SVM vector machine is sensitive to the dimension, it is necessary to perform dimensionality reduction processing on the environmental data. The irrelevant features such as the clock state and the read/write speed are eliminated according to the actual state of the current network element. The diagnostic data is then divided into two aspects: hardware analysis data and software analysis data. The data is used as feature value dimensionality reduction and then input into the OC-SVM vector machine for secondary screening. A scatter plot in N-dimensional space is made according to the existing type dimension, and the formula is used:
z为新的数据点,K(z,z)为z与z的外积,αi为训练集内向量数据,K(z,xi)为该点与对应αi的外积,αj是αi的转置,即αj=αiT,R为OC-SVM向量机的球半径。z is a new data point, K(z, z) is the outer product of z and z, αi is the vector data in the training set, K(z, xi) is the outer product of the point and the corresponding αi, αj is the transpose of αi, that is, αj = αiT , and R is the sphere radius of the OC-SVM vector machine.
计算在N维空间里面主要集群所在的球体位置并计算出球体半径,若超过该半径位置则 判断该告警不属于自动化处理范畴,将其移除自动化处理列表并改为人工处理。Calculate the sphere position of the main cluster in N-dimensional space and calculate the radius of the sphere. If it exceeds the radius, It is determined that the alarm does not fall within the scope of automated processing, so it is removed from the automated processing list and replaced with manual processing.
告警初筛后,根据故障根因分析组件,得到用户潜在意图后,OMM系统会提供一系列预定义处理建议,这些处理建议会按照历史故障解决成功率,权重的分配。如果没有人工干预,则将选择权重值最高的处理建议,转换成具体步骤,并且对应步骤将与后篇的脚本编排与执行相关联,以“网元链路断”告警为例:After the initial screening of the alarm, according to the fault root cause analysis component, after obtaining the user's potential intention, the OMM system will provide a series of predefined processing suggestions, which will be weighted according to the historical fault resolution success rate. If there is no manual intervention, the processing suggestion with the highest weight value will be selected and converted into specific steps, and the corresponding steps will be associated with the script arrangement and execution in the following article. Take the "network element link broken" alarm as an example:
1.检查网元类型是否为“管理网元(MO SDR)”。是->2否->3。1. Check whether the network element type is "Management network element (MO SDR)". Yes -> 2 No -> 3.
2.检查告警附加文本。2. Check the additional text of the alarm.
a.提示“电压异常”或“主控板掉电”,检查网元供电情况,查看和排除相关故障,检查告警是否恢复。是->结束否->b;a. If "voltage abnormality" or "main control board power failure" is prompted, check the power supply of the network element, check and eliminate related faults, and check whether the alarm is restored. Yes -> End No -> b;
b.提示“网管至网元OMC通道网关传输异常”,检查网管至网元OMC通道网关的传输线路,查看和排除相关故障,检查告警是否恢复。是->结束否->c;b. If the message "Transmission from the network management to the network element OMC channel gateway is abnormal", check the transmission line from the network management to the network element OMC channel gateway, check and eliminate related faults, and check whether the alarm is restored. Yes->End No->c;
c.提示“网管至网元OMC通道网关传输正常”,检查网元至网元OMC通道网关的传输线路,查看和排除相关故障,检查告警是否恢复。是->结束否->3。c. If the message "Transmission from the network management to the NE OMC channel gateway is normal" appears, check the transmission line from the NE to the NE OMC channel gateway, check and eliminate related faults, and check whether the alarm is restored. Yes -> End No -> 3.
3.检查网管至网元传输情况。3. Check the transmission status from the network management to the network element.
a.在网管服务器执行:“ping IP”,检查网管服务器与网元连接是否正常。是->4否->ba. Execute "ping IP" on the network management server to check whether the connection between the network management server and the network element is normal. Yes -> 4 No -> b
b.检查对应网元的“管理网元IP地址”是否正确。是->4否->c;b. Check whether the "Management NE IP Address" of the corresponding NE is correct. Yes -> 4 No -> c;
c.修改IP地址,等待3分钟,检查告警是否恢复。是->结束否->4;c. Change the IP address, wait for 3 minutes, and check whether the alarm is restored. Yes -> End No -> 4;
4.联系网元和网络维护工程师,查看和排除相关故障,检查告警是否恢复。是->结束否->5。4. Contact the NE and network maintenance engineers to check and eliminate related faults and check whether the alarm is restored. Yes->End No->5.
5.请寻求更高一级的设备维护支持。5. Please seek higher level equipment maintenance support.
步骤3,策略执行,根据用户之前处理的告警处理方案与相关诊断数据通过机器学习算法生成预测优先级模型,并将预测生成的告警处理方案自动编排脚本以及处理顺序,输出自动化脚本并执行网络配置变更。Step 3, policy execution, generates a predictive priority model through a machine learning algorithm based on the alarm handling solutions previously handled by the user and related diagnostic data, and automatically scripts and processes the predicted alarm handling solutions, outputs automated scripts, and executes network configuration changes.
由于网络设备的复杂性,包含基站设备,网管服务器,路由器交换机设备等,这些设备都有可能上报告警,要使意图网络自主运转起来,需要适配意图网络中所有的网络设备,这个工作量是非常巨大的。自动化和编排通过简化网络运营和管理,帮助实现这种敏捷性。自动化网络的最简单方式是通过可编程性使用标准、低级API提供对移动基站设备乃至芯片级别的细粒度控制。Due to the complexity of network devices, including base station devices, network management servers, router switch devices, etc., all of these devices may report alarms. To make the intent network operate autonomously, it is necessary to adapt all network devices in the intent network, which is a huge workload. Automation and orchestration help achieve this agility by simplifying network operations and management. The simplest way to automate a network is to use standard, low-level APIs through programmability to provide fine-grained control of mobile base station devices and even chip levels.
OMM系统应具备基于XML或JSON编码的REST接口,以支持CLI(命令行接口)和OPEN API(开放应用编程接口)。可编程性对于实现网络感知的应用程序和应用程序感知的网络而言至关重要。网络可编程不在于各种接口和各种规范,而在于对于网络的抽象,能够真正体现出用户意图,通过消除手动配置来降低网络复杂性并提高自动化水平。意图网络中的网络设备接收到下发的配置策略之后,依次执行相应策略。The OMM system should have a REST interface based on XML or JSON encoding to support CLI (command line interface) and OPEN API (open application programming interface). Programmability is critical to realizing network-aware applications and application-aware networks. Network programmability does not lie in various interfaces and specifications, but in the abstraction of the network, which can truly reflect user intent, reduce network complexity and improve the level of automation by eliminating manual configuration. After receiving the configuration policies issued, the network devices in the intent network will execute the corresponding policies in sequence.
图6是根据本实施例的脚本引擎的示意图,如图6所示,脚本开发,脚本设计器(Open Script Designer,简称为OSD)是提供给工程技术人员和开发人员的在线脚本工具。通过此工具开发、编译和发布脚本工程,实现定制化脚本的需求,工具除了提供编译脚本和发布功能外,还提供了语法检查、代码块、自动补全和在线帮助等辅助功能。开放脚本引擎提供智能语法提示,业务Python SDK库,脚本编排设计器,让开发者更方便的开发脚本,降低开发 门槛。可以给脚本设置是否包含重要操作,以及操作影响的提示信息,当执行含有重要操作的脚本时会有验证码和提示信息。FIG6 is a schematic diagram of the script engine according to the present embodiment. As shown in FIG6 , the script development and script designer (Open Script Designer, referred to as OSD) is an online script tool provided to engineering technicians and developers. Through this tool, script projects are developed, compiled and published to meet the needs of customized scripts. In addition to providing script compilation and publishing functions, the tool also provides auxiliary functions such as syntax checking, code blocks, automatic completion and online help. The open script engine provides intelligent syntax prompts, business Python SDK library, and script layout designer, allowing developers to develop scripts more conveniently and reduce development time. Threshold. You can set whether the script contains important operations, as well as prompt information about the impact of the operations. When executing a script containing important operations, there will be a verification code and prompt information.
脚本编排与执行,在脚本执行引擎(Open Script Execution Engine,简称为OSE)应用列表中,查找到脚本后,可以通过脚本编排,例如可以将“导出网元参数文件脚本”和“导出告警脚本”进行关联,执行结束后,可以下载对应附件到本地。Script arrangement and execution: After finding the script in the Open Script Execution Engine (OSE) application list, you can arrange the script. For example, you can associate the "export network element parameter file script" and the "export alarm script". After the execution is completed, you can download the corresponding attachments to your local computer.
脚本管理,通过打标签的方式,可以对脚本进行分类管理和快速查找,脚本自带帮助文件,输出样例等信息,可以更详细的指导脚本使用。所有运维人员或定制化开发专家,都可以根据告警处理建议,来开发自动化脚本,并将脚本推送到服务器,在验证该脚本有效性后,会作为内置脚本随以后的产品版本一起外发。Script management can be categorized and managed by tagging, and scripts can be quickly found. The scripts come with help files, output samples and other information, which can provide more detailed guidance on script use. All operation and maintenance personnel or customized development experts can develop automated scripts based on alarm processing suggestions and push the scripts to the server. After verifying the validity of the script, it will be sent out as a built-in script with future product versions.
可以将对应的流程分为一下16个处理方案,9种单元化处理用例,根据因此可以作为构建决策树类别的16种判断类别(即可判断对应处理方案相关优先级内容取),利用脚本设计器设计对应单元化处理用例,保证脚本能够与对应处理步骤一一对应。利用树图构建方案进行脚本编排构建对应自动化告警处理方案,用于后续执行。The corresponding process can be divided into the following 16 processing solutions and 9 unitized processing cases. Based on the 16 judgment categories that can be used as decision tree categories (i.e., the corresponding priority content of the corresponding processing solutions can be judged), the corresponding unitized processing cases can be designed using the script designer to ensure that the script can correspond to the corresponding processing steps one by one. The tree diagram construction solution is used to arrange the script and build the corresponding automatic alarm processing solution for subsequent execution.
根据告警码确认对应的告警类型,通过收集网元具体诊断数据,生成识别网元告警解决方案的决策树。在此过程明确决策树是一个利用自顶向下分析的方式,在生成前,获取数据的划分规则,从决策树的根节点开始执行构造行为。因此,在完成对告警具体类型的获取后,将网元诊断数据随机划分成若干个子集,参照基尼纯度次数,评估数据属性,系数值越低,代表数据属性越少。当系数值等于0时,表明子集与数组的类别呈现一致的状态,按照此种依据,计算数组基尼系数,公式如下:
According to the alarm code, the corresponding alarm type is confirmed, and by collecting the specific diagnostic data of the network element, a decision tree is generated to identify the network element alarm solution. In this process, it is clear that the decision tree is a top-down analysis method. Before generation, the data division rules are obtained, and the construction behavior is performed from the root node of the decision tree. Therefore, after completing the acquisition of the specific type of alarm, the network element diagnostic data is randomly divided into several subsets, and the data attributes are evaluated with reference to the Gini purity number. The lower the coefficient value, the fewer data attributes it represents. When the coefficient value is equal to 0, it indicates that the subset and the array category are consistent. According to this basis, the array Gini coefficient is calculated. The formula is as follows:
D为样本的所有数量,ci为第i类样本的数量。 D is the total number of samples, ci is the number of samples in the i-th category.
根据训练数据进行数据划分,形成多个决策树,利用决策树剪枝减少决策树的部分边缘结果,并且利用随机森林算法,将多个决策树分类器组合,从而实现一个预测效果更好的集成决策树分类器。在节点找特征进行分裂的时候,并不是对所有特征找到能使得指标(如信息增益)最大的,而是在特征中随机抽取一部分特征,在抽到的特征中间找到最优解,应用于节点,进行分裂。随机森林的方法由于有了Bagging(Bootstrap aggregating,引导聚集算法),也就是集成的思想在,实际上相当于对于样本和特征都进行了采样,所以可以避免过拟合。Data is divided according to the training data to form multiple decision trees. Decision tree pruning is used to reduce some marginal results of the decision tree, and multiple decision tree classifiers are combined using the random forest algorithm to achieve an integrated decision tree classifier with better prediction effect. When looking for features to split at the node, it is not to find all the features that can maximize the indicator (such as information gain), but to randomly extract a part of the features, find the optimal solution among the extracted features, apply it to the node, and split it. The random forest method has Bagging (Bootstrap aggregating, guided aggregation algorithm), that is, the idea of integration, which is actually equivalent to sampling both samples and features, so overfitting can be avoided.
图7是根据本实施例的告警解决方案确定的流程图,如图7所示,经过过拟合处理后的决策树,即可根据获取到的网元诊断数据输入进树中进行告警处理方案的预测并且预生成自动化告警处理用例保证告警及时处理,并且根据树的后序遍历顺序确认告警处理方案优先级,在缓冲区预留当前告警对应的后备处理方案,保证自动化告警处理的成功率,并且设置处理方案的MAX值与处理告警并行处理,即在处理循环中逗留在处理区的循环次数,以免某告警无法解决的情况下长时间阻塞后续告警的处理。Figure 7 is a flow chart of the alarm solution determined according to the present embodiment. As shown in Figure 7, the decision tree after overfitting processing can predict the alarm processing solution according to the acquired network element diagnostic data input into the tree and pre-generate an automated alarm processing use case to ensure timely alarm processing, and confirm the priority of the alarm processing solution according to the post-order traversal sequence of the tree, reserve a backup processing solution corresponding to the current alarm in the buffer to ensure the success rate of automated alarm processing, and set the MAX value of the processing solution to be processed in parallel with the alarm processing, that is, the number of cycles staying in the processing area in the processing loop, so as to avoid blocking the processing of subsequent alarms for a long time when a certain alarm cannot be resolved.
步骤4,网络反馈,基站或服务器提供网络状况反馈信息确认告警是否处理完成,若不成功则根据策略执行步骤中预测的下一个方案继续循环执行直至到达循环最大次数。 Step 4, network feedback, the base station or server provides network status feedback information to confirm whether the alarm is processed successfully. If unsuccessful, the next solution predicted in the strategy execution step is continued in a loop until the maximum number of loops is reached.
借助OSE脚本将网络配置下发到网络设备并顺利执行之后,意图网络需要实时监测网络的运行状态,一方面采集网络性能数据和告警数据,观测告警是否已恢复,网络性能是否恢复正常,配置数据是否已正常同步到基站设备。另一方面持续预测网络设备故障及异常情况,例如恢复了告警A,是否关联产生了告警B。After the network configuration is sent to the network equipment with the help of OSE scripts and successfully executed, the intentional network needs to monitor the operation status of the network in real time. On the one hand, it collects network performance data and alarm data to observe whether the alarm has been restored, whether the network performance has returned to normal, and whether the configuration data has been synchronized to the base station equipment normally. On the other hand, it continuously predicts network equipment failures and abnormal conditions, for example, if alarm A is restored, whether alarm B is associated.
系统将会持续实时验证原始的业务意图是否已经被满足了,并且在没有达到预设的意图时可以执行改正的动作,形成一个持续闭环循环的系统,提升了网络的可用性和敏捷性。唯有持续闭环的系统才可以保证意图的有效性,才可以确保意图不会被突发的网络状况干扰。The system will continue to verify in real time whether the original business intent has been met, and can perform corrective actions if the preset intent is not achieved, forming a continuous closed-loop system, which improves the availability and agility of the network. Only a continuous closed-loop system can guarantee the effectiveness of the intent and ensure that the intent will not be disturbed by sudden network conditions.
步骤5,策略优化,分析组件通过所请求的意图来验证收到的网络驱动型反馈信息,以验证所请求意图是否按照请求和设计预期运行,并收集成功处理的方案赋予其相应权重作为补充数据集进一步完善步骤1,2,3中的AI预测模型。此步骤的特点在于,在一个商用移动通信网络中,告警的上报十分频繁,可达到一天10万条,每次告警上报触发策略执行之后,都能对策略的效果(故障解决速度和故障解决情况)进行验证,并反向指导调整预测模型。Step 5, strategy optimization, the analysis component verifies the received network-driven feedback information through the requested intent to verify whether the requested intent is running according to the request and design expectations, and collects the successfully processed solutions and assigns them corresponding weights as a supplementary data set to further improve the AI prediction model in steps 1, 2, and 3. The characteristic of this step is that in a commercial mobile communication network, alarm reports are very frequent, reaching 100,000 per day. After each alarm report triggers the execution of the strategy, the effect of the strategy (fault resolution speed and fault resolution situation) can be verified, and the prediction model can be adjusted in reverse guidance.
步骤6,意图反馈,通过基于价值的业务成果,来报告所请求意图的状态及运行情况。Step 6, Intent Feedback, reports the status and operation of the requested intent through value-based business outcomes.
一旦监测到网络异常情况,意图网络需及时向意图捕获环节进行反馈,重新对用户意图进行转换、验证和下发执行。Once a network anomaly is detected, the intent network needs to provide timely feedback to the intent capture link to re-convert, verify and execute the user intent.
针对决策树预测此问题,告警处理存在较大的随机因素与经验主义的问题,因此需要进一步优化决策树的拟合问题。以下有两个方案对数拟合进行进一步优化处理:Regarding the problem of decision tree prediction, the alarm processing has large random factors and empiricism problems, so it is necessary to further optimize the fitting problem of the decision tree. The following two solutions can further optimize the logarithmic fitting:
由于告警处理决策树的训练数据来自于用户自身前处理成功用例,因此能够在训练数据入手,在运维人员的协作下对训练数据进行优化与有定向地数据筛选,使得在做出决策中有更多的经验意向,更加贴合人工处理的相关流程提高处理的正确率。并且将新模型与旧模型进行权重对比,根据处理的成功率提高此数的选择权重,并且反馈到用户处理中。Since the training data of the alarm processing decision tree comes from the user's own successful pre-processing use cases, it is possible to start with the training data, optimize the training data and screen the data in a targeted manner with the cooperation of the operation and maintenance personnel, so that there are more experience intentions in making decisions, which is more in line with the relevant processes of manual processing and improves the accuracy of processing. In addition, the weight of the new model is compared with the old model, and the selection weight of this number is increased according to the success rate of processing, and it is fed back to the user processing.
亦可以在处理方案的在对应告警成功比例进行权重分配作为新增的特征值加入训练数据中,根据分配的权重提高该方案分配进行解决的比例,不断完善预测的准确性与自动化处理告警的效率。It is also possible to assign weights to the corresponding alarm success ratios in the processing solutions and add them as new feature values to the training data. According to the assigned weights, the proportion of solutions assigned by the solutions can be increased, and the accuracy of predictions and the efficiency of automated alarm processing can be continuously improved.
本实施例通过采集单板诊断数据,应用机器学习算法,将告警运维人员的解决网络故障的意图转换为策略,再由此付诸执行。当有告警上报时,会触发该告警对应的自动化处理建议,来执行相关脚本和命令行进行自动修复。根据网元单板的诊断数据实现大量重复性质的告警自动化处理,实现有限资源,合理配置。根据网元告警处理的自动化方案,可以应用于运维人员成本较高、人力处理较难、常规告警较多的场景。对于诊断数据异常偏离、大概率无法进行自动化处理的告警信息,则可以依据对应数据的偏移程度,进一步为运维人员提供多维度的处理分析,每一项告警处理都能够对处理方案进行权重分配、方案预处理,大量的数据也有利于该自动化处理的方式向较大规模的地区应用。This embodiment collects single-board diagnostic data, applies machine learning algorithms, converts the alarm operation and maintenance personnel's intention to solve network failures into strategies, and then implements them. When an alarm is reported, the corresponding automated processing suggestion for the alarm will be triggered to execute relevant scripts and command lines for automatic repair. According to the diagnostic data of the network element single board, a large number of repetitive alarms are automatically processed to achieve limited resources and reasonable allocation. According to the automated solution for network element alarm processing, it can be applied to scenarios with high operation and maintenance personnel costs, difficult manual processing, and more conventional alarms. For alarm information with abnormal deviations in diagnostic data and a high probability of being unable to be automatically processed, it can be further provided to the operation and maintenance personnel with multi-dimensional processing analysis based on the degree of deviation of the corresponding data. Each alarm processing can assign weights to the processing plan and pre-process the plan. A large amount of data is also conducive to the application of this automated processing method to larger-scale regions.
根据本公开实施例的另一方面,还提供了一种告警处理装置,图8是根据本公开实施例的告警处理装置的框图,如图8所示,所述装置包括:According to another aspect of an embodiment of the present disclosure, an alarm processing device is further provided. FIG8 is a block diagram of the alarm processing device according to an embodiment of the present disclosure. As shown in FIG8 , the device includes:
第一确定模块82,设置为确定告警信息的故障根因,并根据所述故障根因确定用户意图;A first determination module 82 is configured to determine a root cause of a fault in the alarm information, and determine a user intention based on the root cause of the fault;
筛选模块84,设置为在所述用户意图为处理故障的情况下,对所述告警信息进行筛选,剔除待人工处理的告警信息,得到待处理告警;A screening module 84 is configured to screen the alarm information and remove the alarm information to be manually processed to obtain the alarms to be processed when the user intends to process the fault;
第二确定模块86,设置为确定所述待处理告警的告警解决方案; A second determination module 86 is configured to determine an alarm solution for the alarm to be processed;
处理模块88,设置为根据所述告警解决方案对所述待处理告警进行处理。The processing module 88 is configured to process the to-be-processed alarm according to the alarm solution.
在一实施例中,所述筛选模块84,还用于利用孤立森林算法结合OC-SVM模型对所述告警信息进行筛选,剔除待人工处理的告警信息,筛选出所述待处理告警。In one embodiment, the screening module 84 is further used to screen the alarm information by using an isolation forest algorithm combined with an OC-SVM model, to eliminate the alarm information to be manually processed, and to screen out the alarms to be processed.
在一实施例中,所述筛选模块84包括:In one embodiment, the screening module 84 includes:
采集子模块,设置为从所述告警信息中采集诊断数据与环境数据;A collection submodule, configured to collect diagnostic data and environmental data from the alarm information;
构成子模块,设置为将所述诊断数据与所述环境数据构成N维的散点图;A formation submodule, configured to form an N-dimensional scatter plot of the diagnostic data and the environmental data;
第一剔除子模块,设置为利用所述孤立森林算法计算所述散点图中散点之间的疏离程度,根据所述疏离程度剔除异常散点,得到N维的初步筛查散点图;A first elimination submodule is configured to calculate the degree of alienation between the scattered points in the scatter plot by using the isolation forest algorithm, and eliminate abnormal scattered points according to the alienation degree to obtain an N-dimensional preliminary screening scatter plot;
第二剔除子模块,设置为基于所述OC-SVM模型,从所述初步筛查散点图中剔除所述待人工处理的告警信息,得到所述待处理告警。The second elimination submodule is configured to eliminate the alarm information to be manually processed from the preliminary screening scatter plot based on the OC-SVM model to obtain the alarm to be processed.
在一实施例中,所述第二剔除子模块包括:In one embodiment, the second elimination submodule includes:
降维单元,设置为在所述初步筛查散点图中对所述环境数据进行降维处理,并根据获取到的当前网元实际状态剔除预设类型特征,得到处理后的环境数据;A dimension reduction unit is configured to perform dimension reduction processing on the environmental data in the preliminary screening scatter plot, and remove preset type features according to the acquired actual state of the current network element to obtain processed environmental data;
组成单元,设置为将所述诊断数据分为硬件分析数据与软件分析数据,并将所述硬件分析数据、所述软件分析数据以及所述处理后的环境数据作为特征值降维后组成目标散点图;A composition unit, configured to divide the diagnostic data into hardware analysis data and software analysis data, and use the hardware analysis data, the software analysis data and the processed environment data as feature values for dimension reduction to form a target scatter plot;
二次筛选单元,设置为基于所述OC-SVM模型对所述目标散点图进行二次筛选,得到所述待处理告警。The secondary screening unit is configured to perform secondary screening on the target scatter plot based on the OC-SVM model to obtain the alarm to be processed.
在一实施例中,所述二次筛选单元,还设置为确定所述目标散点图的N维空间中集群所在的球体位置并计算出所述球体的半径;若告警信息对应的散点超过所述半径位置则判断所述告警信息为所述待人工处理的告警信息;从所述目标散点图中剔除所述待人工处理的告警信息,得到所述待处理告警。In one embodiment, the secondary screening unit is further configured to determine the position of the sphere where the cluster is located in the N-dimensional space of the target scatter plot and calculate the radius of the sphere; if the scatter points corresponding to the alarm information exceed the radius position, the alarm information is judged to be the alarm information to be manually processed; the alarm information to be manually processed is eliminated from the target scatter plot to obtain the alarm to be processed.
在一实施例中,所述第二确定模块96,还设置为将所述待处理告警输入预先训练好的目标集成告警决策树中,得到所述目标集成告警决策树输出的多个解决方案以及对应的优先级,其中,所述目标集成告警决策树是针对已处理告警的处理成功率,分配所述已处理告警对应的解决方案的权重,并基于所述已处理告警以及对应的权重生成的训练数据训练得到的,所述告警解决方案包括所述多个解决方案。In one embodiment, the second determination module 96 is further configured to input the alarm to be processed into a pre-trained target integrated alarm decision tree to obtain multiple solutions and corresponding priorities output by the target integrated alarm decision tree, wherein the target integrated alarm decision tree is based on the processing success rate of the processed alarms, assigns weights of the solutions corresponding to the processed alarms, and is trained based on the training data generated based on the processed alarms and the corresponding weights, and the alarm solution includes the multiple solutions.
在一实施例中,所述装置还包括:In one embodiment, the device further comprises:
数据划分模块,设置为将所述训练数据进行数据划分,形成多个告警决策树,并利用决策树剪枝裁剪所述多个告警决策树的部分边缘结果,得到多个目标告警决策树;A data partitioning module is configured to perform data partitioning on the training data to form a plurality of alarm decision trees, and to use decision tree pruning to prune some edge results of the plurality of alarm decision trees to obtain a plurality of target alarm decision trees;
组合模块,设置为利用随机森林算法,将所述多个目标告警决策树组合得到集成告警决策树;A combination module, configured to use a random forest algorithm to combine the multiple target alarm decision trees to obtain an integrated alarm decision tree;
过拟合模块,设置为对所述集成告警决策树进行过拟合处理,得到目标集成告警决策树。The overfitting module is configured to perform overfitting processing on the integrated alarm decision tree to obtain a target integrated alarm decision tree.
在一实施例中,所述装置还包括:In one embodiment, the device further comprises:
统计模块,设置为对所述待处理告警的处理成功率进行统计;A statistics module, configured to collect statistics on the success rate of processing the pending alarms;
调整模块,设置为根据所述处理成功率调整所述训练集中所述已处理告警对应的解决方案的权重;An adjustment module, configured to adjust the weight of the solution corresponding to the processed alarm in the training set according to the processing success rate;
更新模块,设置为根据调整后的训练集更新所述目标集成告警决策树。The updating module is configured to update the target integrated alarm decision tree according to the adjusted training set.
在一实施例中,所述处理模块88,还设置为基于所述多个解决方案的优先级调整所述多个解决方案的脚本执行顺序;调用网管系统对应的脚本框架生成所述多个解决方案对应的告 警处理用例;按照所述脚本执行顺序依次执行所述多个解决方案对应的告警处理用例,直到所述多个解决方案的一个解决方案执行成功。In one embodiment, the processing module 88 is further configured to adjust the script execution order of the multiple solutions based on the priorities of the multiple solutions; call the script framework corresponding to the network management system to generate the scripts corresponding to the multiple solutions; Alarm processing use case; execute the alarm processing use cases corresponding to the multiple solutions in sequence according to the script execution order until one of the multiple solutions is successfully executed.
本公开的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。An embodiment of the present disclosure further provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the steps of any of the above method embodiments when running.
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。In an exemplary embodiment, the above-mentioned computer-readable storage medium may include, but is not limited to: a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk or an optical disk, and other media that can store computer programs.
本公开的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。An embodiment of the present disclosure further provides an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。In an exemplary embodiment, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary implementation modes, and this embodiment will not be described in detail herein.
显然,本领域的技术人员应该明白,上述的本公开的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本公开不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above modules or steps of the present disclosure can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order than here, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. Thus, the present disclosure is not limited to any specific combination of hardware and software.
以上所述仅为本公开的优选实施例而已,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。 The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the principles of the present disclosure shall be included in the protection scope of the present disclosure.

Claims (12)

  1. 一种告警处理方法,所述方法包括:An alarm processing method, the method comprising:
    确定告警信息的故障根因,并根据所述故障根因确定用户意图;Determine the root cause of the fault in the alarm information, and determine the user's intention based on the root cause of the fault;
    在所述用户意图为处理故障的情况下,对所述告警信息进行筛选,剔除待人工处理的告警信息,得到待处理告警;In the case where the user intends to handle the fault, the alarm information is screened, the alarm information to be manually processed is eliminated, and the alarms to be processed are obtained;
    确定所述待处理告警的告警解决方案;Determine an alarm solution for the pending alarm;
    根据所述告警解决方案对所述待处理告警进行处理。The pending alarm is processed according to the alarm solution.
  2. 根据权利要求1所述的方法,其中,对所述告警信息进行筛选,剔除待人工处理的告警信息,得到待处理告警包括:The method according to claim 1, wherein screening the alarm information, eliminating the alarm information to be manually processed, and obtaining the alarms to be processed comprises:
    利用孤立森林算法结合OC-SVM模型对所述告警信息进行筛选,剔除待人工处理的告警信息,筛选出所述待处理告警。The alarm information is screened by using the isolation forest algorithm in combination with the OC-SVM model, the alarm information to be manually processed is eliminated, and the alarms to be processed are screened out.
  3. 根据权利要求2所述的方法,其中,利用孤立森林算法结合OC-SVM模型对所述告警信息进行筛选,剔除待人工处理的告警信息,筛选出所述待处理告警包括:The method according to claim 2, wherein the alarm information is screened using an isolation forest algorithm combined with an OC-SVM model to eliminate alarm information to be manually processed, and the screened alarms to be processed include:
    从所述告警信息中采集诊断数据与环境数据;collecting diagnostic data and environmental data from the alarm information;
    将所述诊断数据与所述环境数据构成N维的散点图;Constructing an N-dimensional scatter plot of the diagnostic data and the environmental data;
    利用所述孤立森林算法计算所述散点图中散点之间的疏离程度,根据所述疏离程度剔除异常散点,得到N维的初步筛查散点图;The isolation forest algorithm is used to calculate the degree of alienation between the scattered points in the scatter plot, and abnormal scattered points are eliminated according to the degree of alienation to obtain an N-dimensional preliminary screening scatter plot;
    基于所述OC-SVM模型,从所述初步筛查散点图中剔除所述待人工处理的告警信息,得到所述待处理告警。Based on the OC-SVM model, the alarm information to be manually processed is removed from the preliminary screening scatter plot to obtain the alarm to be processed.
  4. 根据权利要求3所述的方法,其中,基于所述OC-SVM模型,从所述初步筛查散点图中剔除所述待人工处理的告警信息,得到所述待处理告警包括:The method according to claim 3, wherein, based on the OC-SVM model, removing the alarm information to be manually processed from the preliminary screening scatter plot, and obtaining the alarm to be processed comprises:
    在所述初步筛查散点图中对所述环境数据进行降维处理,并根据获取到的当前网元实际状态剔除预设类型特征,得到处理后的环境数据;Performing dimensionality reduction processing on the environmental data in the preliminary screening scatter plot, and eliminating preset type features according to the acquired actual state of the current network element to obtain processed environmental data;
    将所述诊断数据分为硬件分析数据与软件分析数据;dividing the diagnostic data into hardware analysis data and software analysis data;
    将所述硬件分析数据、所述软件分析数据以及所述处理后的环境数据作为特征值降维后组成目标散点图;The hardware analysis data, the software analysis data and the processed environment data are used as feature values for dimension reduction to form a target scatter plot;
    基于所述OC-SVM模型对所述目标散点图进行二次筛选,得到所述待处理告警。The target scatter plot is screened a second time based on the OC-SVM model to obtain the alarm to be processed.
  5. 根据权利要求4所述的方法,其中,基于所述OC-SVM模型对所述目标散点图进行二次筛选,得到所述待处理告警包括:The method according to claim 4, wherein performing secondary screening on the target scatter plot based on the OC-SVM model to obtain the alarm to be processed comprises:
    确定所述目标散点图的N维空间中集群所在的球体位置并计算出所述球体的半径;Determine the sphere position where the cluster is located in the N-dimensional space of the target scatter plot and calculate the radius of the sphere;
    若告警信息对应的散点超过所述半径位置则判断所述告警信息为所述待人工处理的告警信息;If the scattered points corresponding to the alarm information exceed the radius position, the alarm information is judged to be the alarm information to be manually processed;
    从所述目标散点图中剔除所述待人工处理的告警信息,得到所述待处理告警。The alarm information to be manually processed is eliminated from the target scatter plot to obtain the alarm to be processed.
  6. 根据权利要求1所述的方法,其中,确定所述待处理告警的告警解决方案包括:The method according to claim 1, wherein determining an alarm solution for the pending alarm comprises:
    将所述待处理告警输入预先训练好的目标集成告警决策树中,得到所述目标集成告警决策树输出的多个解决方案以及对应的优先级,其中,所述目标集成告警决策树是针对已处理告警的处理成功率,分配所述已处理告警对应的解决方案的权重,并基于所述已处理告警以及对应的权重生成的训练数据训练得到的,所述告警解决方案包括所述多个解决方案。 The alarm to be processed is input into a pre-trained target integrated alarm decision tree to obtain multiple solutions and corresponding priorities output by the target integrated alarm decision tree, wherein the target integrated alarm decision tree is trained based on the processing success rate of the processed alarms, assigns weights of the solutions corresponding to the processed alarms, and is generated based on the training data generated by the processed alarms and the corresponding weights, and the alarm solution includes the multiple solutions.
  7. 根据权利要求6所述的方法,其中,所述方法还包括:The method according to claim 6, wherein the method further comprises:
    将所述训练数据进行数据划分,形成多个告警决策树,并利用决策树剪枝裁剪所述多个告警决策树的部分边缘结果,得到多个目标告警决策树;The training data is divided into multiple alarm decision trees, and some edge results of the multiple alarm decision trees are trimmed by decision tree pruning to obtain multiple target alarm decision trees;
    利用随机森林算法,将所述多个目标告警决策树组合得到集成告警决策树;Using a random forest algorithm, the multiple target warning decision trees are combined to obtain an integrated warning decision tree;
    对所述集成告警决策树进行过拟合处理,得到目标集成告警决策树。An overfitting process is performed on the integrated alarm decision tree to obtain a target integrated alarm decision tree.
  8. 根据权利要求6或7所述的方法,其中,所述方法还包括:The method according to claim 6 or 7, wherein the method further comprises:
    对所述待处理告警的处理成功率进行统计;Collecting statistics on the success rate of processing the pending alarms;
    根据所述处理成功率调整所述训练集中所述已处理告警对应的解决方案的权重;adjusting the weight of the solution corresponding to the processed alarm in the training set according to the processing success rate;
    根据调整后的训练集更新所述目标集成告警决策树。The target integrated alarm decision tree is updated according to the adjusted training set.
  9. 根据权利要求6所述的方法,其中,根据所述告警解决方案对所述待处理告警进行处理包括:The method according to claim 6, wherein processing the pending alarm according to the alarm solution comprises:
    基于所述多个解决方案的优先级调整所述多个解决方案的脚本执行顺序;adjusting the script execution order of the multiple solutions based on the priorities of the multiple solutions;
    调用网管系统对应的脚本框架生成所述多个解决方案对应的告警处理用例;Calling a script framework corresponding to the network management system to generate alarm processing use cases corresponding to the multiple solutions;
    按照所述脚本执行顺序依次执行所述多个解决方案对应的告警处理用例,直到所述多个解决方案的一个解决方案执行成功。The alarm processing use cases corresponding to the multiple solutions are executed in sequence according to the script execution order until one of the multiple solutions is successfully executed.
  10. 一种告警处理装置,所述装置包括:An alarm processing device, the device comprising:
    第一确定模块,设置为确定告警信息的故障根因,并根据所述故障根因确定用户意图;A first determination module is configured to determine a root cause of a fault in the alarm information, and determine a user intention based on the root cause of the fault;
    筛选模块,设置为在所述用户意图为处理故障的情况下,对所述告警信息进行筛选,剔除待人工处理的告警信息,得到待处理告警;A screening module, configured to screen the alarm information and remove the alarm information to be manually processed to obtain the alarms to be processed when the user intends to process the fault;
    第二确定模块,设置为确定所述待处理告警的告警解决方案;A second determination module is configured to determine an alarm solution for the alarm to be processed;
    处理模块,设置为根据所述告警解决方案对所述待处理告警进行处理。The processing module is configured to process the to-be-processed alarm according to the alarm solution.
  11. 一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至9任一项中所述的方法。A computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to execute the method described in any one of claims 1 to 9 when run.
  12. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至9任一项中所述的方法。 An electronic device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute the method described in any one of claims 1 to 9.
PCT/CN2023/091861 2022-09-27 2023-04-28 Alarm processing method and apparatus, and storage medium and electronic apparatus WO2024066346A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211183805.5A CN117792864A (en) 2022-09-27 2022-09-27 Alarm processing method and device, storage medium and electronic device
CN202211183805.5 2022-09-27

Publications (1)

Publication Number Publication Date
WO2024066346A1 true WO2024066346A1 (en) 2024-04-04

Family

ID=90378624

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/091861 WO2024066346A1 (en) 2022-09-27 2023-04-28 Alarm processing method and apparatus, and storage medium and electronic apparatus

Country Status (2)

Country Link
CN (1) CN117792864A (en)
WO (1) WO2024066346A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016183967A1 (en) * 2015-05-19 2016-11-24 中兴通讯股份有限公司 Failure alarm method and apparatus for key component, and big data management system
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN112446511A (en) * 2020-11-20 2021-03-05 中国建设银行股份有限公司 Fault handling method, device, medium and equipment
CN115086148A (en) * 2022-07-15 2022-09-20 中国电信股份有限公司 Optical network alarm processing method, system, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016183967A1 (en) * 2015-05-19 2016-11-24 中兴通讯股份有限公司 Failure alarm method and apparatus for key component, and big data management system
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN112446511A (en) * 2020-11-20 2021-03-05 中国建设银行股份有限公司 Fault handling method, device, medium and equipment
CN115086148A (en) * 2022-07-15 2022-09-20 中国电信股份有限公司 Optical network alarm processing method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN117792864A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
EP2871803B1 (en) Network node failure predictive system
US11294754B2 (en) System and method for contextual event sequence analysis
US9152925B2 (en) Method and system for prediction and root cause recommendations of service access quality of experience issues in communication networks
CN109075991A (en) Cloud verifying and test automation
EP3207432B1 (en) A method for managing subsystems of a process plant using a distributed control system
CN114064196A (en) System and method for predictive assurance
EP3792769A1 (en) Model control platform
US11258659B2 (en) Management and control for IP and fixed networking
US20220150132A1 (en) Utilizing machine learning models to determine customer care actions for telecommunications network providers
US11934855B2 (en) System and method to autonomously manage hybrid information technology (IT) infrastructure
US11704186B2 (en) Analysis of deep-level cause of fault of storage management
WO2019061364A1 (en) Failure analyzing method and related device
CN106549807A (en) A kind of classification report method of daily record and system
CN113448947B (en) Method and device for distributed deployment operation and maintenance of mongo database
CN114675956A (en) Method for configuration and scheduling of Pod between clusters based on Kubernetes
WO2024066346A1 (en) Alarm processing method and apparatus, and storage medium and electronic apparatus
WO2019186243A1 (en) Global data center cost/performance validation based on machine intelligence
JP2022037107A (en) Failure analysis device, failure analysis method, and failure analysis program
Velayutham et al. Artificial Intelligence assisted Canary Testing of Cloud Native RAN in a mobile telecom system
CN115941433A (en) Network slice performance optimization guarantee method and system, storage medium and electronic equipment
Yates et al. Artificial Intelligence for Network Operations
CN116582462B (en) Converged service monitoring method and device
EP4361745A1 (en) Autonomous operation of modular industrial plants
US20230188408A1 (en) Enhanced analysis and remediation of network performance
US20240119369A1 (en) Contextual learning at the edge

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23869601

Country of ref document: EP

Kind code of ref document: A1