CN120567646A - Network fault removal method, equipment and medium under AI drive - Google Patents

Network fault removal method, equipment and medium under AI drive

Info

Publication number
CN120567646A
CN120567646A CN202510693202.7A CN202510693202A CN120567646A CN 120567646 A CN120567646 A CN 120567646A CN 202510693202 A CN202510693202 A CN 202510693202A CN 120567646 A CN120567646 A CN 120567646A
Authority
CN
China
Prior art keywords
network
fault
data
diagnosis
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510693202.7A
Other languages
Chinese (zh)
Inventor
马章竞
陈翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanmai Network Technology Shandong Co ltd
Original Assignee
Yuanmai Network Technology Shandong Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanmai Network Technology Shandong Co ltd filed Critical Yuanmai Network Technology Shandong Co ltd
Priority to CN202510693202.7A priority Critical patent/CN120567646A/en
Publication of CN120567646A publication Critical patent/CN120567646A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种AI驱动下的网络故障排除方法、设备及介质,属于网络故障监测技术领域,用于解决现有大规模网络环境中故障排除方法中存在的效率低下、易出错、自动化程度低、缺乏深度诊断能力、无法有效整合实时数据与领域知识的技术问题。方法包括:对相关网络设备进行基础上下文信息收集处理,得到初始上下文数据;对触发事件进行有关集中告警的事件聚类处理,得到事件聚类结果;将触发事件、初始上下文数据以及事件聚类结果进行基于大型语言模型的初步故障诊断处理,得到初始分析诊断数据;对初始分析诊断数据进行迭代细化下的诊断更新处理,得到最终分析诊断数据;对最终分析诊断数据进行故障解决方案的查寻处理,确定出网络故障解决策略。

The present invention discloses an AI-driven network troubleshooting method, device, and medium, which belongs to the field of network fault monitoring technology and is used to solve the technical problems of low efficiency, easy error, low degree of automation, lack of deep diagnostic capabilities, and inability to effectively integrate real-time data and domain knowledge in existing troubleshooting methods in large-scale network environments. The method includes: collecting and processing basic context information of relevant network devices to obtain initial context data; clustering events related to centralized alarms on triggering events to obtain event clustering results; performing preliminary fault diagnosis processing based on a large language model on the triggering events, initial context data, and event clustering results to obtain initial analysis and diagnosis data; performing iterative and refined diagnosis and update processing on the initial analysis and diagnosis data to obtain final analysis and diagnosis data; searching and processing the final analysis and diagnosis data for fault solutions to determine a network fault resolution strategy.

Description

Network fault removal method, equipment and medium under AI drive
Technical Field
The present application relates to the field of network fault monitoring, and in particular, to a method, an apparatus, and a medium for network fault removal under AI driving.
Background
Modern network environments are becoming increasingly complex. As network scale increases and dynamics increase, network troubleshooting becomes a critical and challenging operation and maintenance task. The problems of Border Gateway Protocol (BGP) routing, system log (syslog) analysis, alarm processing, etc., are involved, and high requirements are placed on the expertise and experience of network engineers.
Currently, troubleshooting of cloud data centers or intelligent computing network environments mainly relies on the following approaches:
manual troubleshooting, in which a network engineer logs in to the switch through SSH according to experience, manually executes a series of show commands (such as checking BGP neighbor state, interface counter, log, etc.), analyzes the output information, and gradually locates the problem. This requires the engineer to have a deep knowledge of the network protocols.
Basic automation script-simple script (such as Python, bash) is used to automatically execute some repetitive checking commands and aggregate the results for presentation.
A Network Management System (NMS) provides basic monitoring instrument panels and alarm lists, and can display information such as equipment states, interface flow and the like, but is not provided with intelligent diagnosis capability.
Generic LLM is directly applied by attempting to use generic LLM (e.g., chatGPT, claude, etc.) directly for analyzing fault descriptions or log segments, but lacks the ability to access real-time network data and may generate inaccurate or factually inconsistent "illusion" information.
LLM+RAG is used for configuration, and the existing research explores the generation, translation or conversion of network configuration by using LLM and RAG technologies, and is mainly applied to a configuration deployment stage, but not real-time fault diagnosis.
However, the prior art has some drawbacks (1) manual methods, which are time consuming, inefficient, prone to error, especially in large scale, dynamic networks, and highly dependent on the personal experience and ability level of engineers. (2) The basic script has poor flexibility, can only process a predefined simple scene, and can not cope with complex and unknown faults. (3) NMS systems, lacking deep diagnosis and root cause analysis capabilities, typically only find surface phenomena and fail to provide specific troubleshooting guidelines. (4) The general LLM is directly applied, namely, the risk of 'illusion' exists, and real-time network state and configuration data cannot be acquired, so that analysis results are unreliable or irrelevant. (5) LLM+RAG is used for configuration, and is not suitable for fault removal scenes requiring real-time data interaction and dynamic diagnosis. In general, the prior art has the problems of low automation degree, insufficient diagnosis depth, incapability of effectively combining field knowledge with real-time data, efficiency and accuracy to be improved and the like in the aspect of network fault elimination.
Disclosure of Invention
The embodiment of the application provides a network fault removal method, equipment and medium driven by an AI (automatic identification) for solving the technical problems that the existing fault removal method in a large-scale network environment is low in efficiency, easy to make mistakes, low in automation degree, lacks deep diagnosis capability, cannot effectively integrate real-time data with domain knowledge and the like.
The embodiment of the application adopts the following technical scheme:
On one hand, the embodiment of the application provides a network fault elimination method under the driving of an AI (automatic identification), which comprises the steps of collecting and processing basic context information under relevant predefining on the basis of triggering events corresponding to network fault information to obtain initial context data, carrying out event clustering processing on the triggering events to obtain event clustering results, carrying out preliminary fault diagnosis processing on the triggering events, the initial context data and the event clustering results on the basis of a large language model to obtain initial analysis diagnosis data, carrying out diagnosis updating processing on the initial analysis diagnosis data under iterative refinement according to a function call request to obtain final analysis diagnosis data, and carrying out fault solution searching processing on the final analysis diagnosis data through a preset solution generator to determine a network fault solving strategy.
According to the embodiment of the application, through the automatic information collection, analysis and diagnosis processes, the fault removal time is greatly shortened, and the dependence on manual intervention is reduced. By combining LLM (Large Language Model) reasoning capability, RAG (RETRIEVAL-Augmented Generation) domain knowledge and function call real-time data, more comprehensive and deeper analysis can be performed, and erroneous judgment caused by insufficient information or experience deviation is reduced. The end-to-end automatic flow from fault triggering to solution proposal is realized, and the burden of network engineers is reduced. The method can cope with complex network fault scenes, and gradually approaches the root cause through iterative information acquisition and analysis. The problem of 'illusion' is relieved through the RAG module, and the problem of lack of real-time data access is solved through function call, so that the LLM module can be reliably applied to the professional network operation and maintenance field. The RAG knowledge base can continuously accumulate historical fault cases and solutions, so that the diagnostic capability of the system is continuously improved. Through the event clustering engine, a large number of syslog (System Log) and alarms can be effectively processed, key modes are extracted, and information flooding is avoided.
In a possible implementation manner, based on a trigger event corresponding to network fault information, relevant network equipment is subjected to relevant pre-defined basic context information collection processing to obtain initial context data, and the method specifically comprises the steps of receiving fault trigger signals from other ends of a network through an event receiving port, wherein the other ends of the network at least comprise a monitoring system end, a log aggregator end and a user input end, the fault trigger signals at least comprise BGP neighbor Down alarm signals and specific system log information, carrying out event type identification on the fault trigger signals through a preset diagnosis engine to determine the trigger event under relevant network fault information, calling a context collector when the trigger event is identified, collecting a group of relevant basic context information under pre-definition in the relevant network equipment through the context collector based on the time type of the trigger event, wherein the basic context information at least comprises state in a real-time routing protocol, real-time routing table neighbor information and real-time relevant interface state, and carrying out initial context structure processing on the relevant basic context information to obtain the initial context data.
In a feasible implementation mode, the method comprises the steps of carrying out event clustering processing of related concentrated alarms on the triggering events to obtain event clustering results, and concretely comprises the steps of starting an alarm clustering engine when the related events under the concentrated alarms are identified, carrying out multi-feature grouping processing on the triggering events through the alarm clustering engine based on a clustering algorithm and clustering features to obtain clustering grouping results, wherein the multi-features at least comprise time features, source features and template similarity features, and carrying out identification processing of related event modes and event abstracts on the clustering grouping results to generate the event clustering results.
In a feasible implementation mode, the trigger event, the initial context data and the event clustering result are subjected to preliminary fault diagnosis processing based on a large language model to obtain initial analysis diagnosis data, specifically, the initial analysis diagnosis data are obtained by structurally integrating the trigger event, the initial context data and the event clustering result to obtain an integrated data set, the integrated data set, network expert information and diagnosis fault description information are constructed into specific prompt information, the specific prompt information is sent to an LLM module, the LLM module is an artificial intelligent model with natural language understanding and generating capability, the inquiry processing of a network fault elimination knowledge base is used for carrying out preliminary diagnosis assumption on the network fault information in the integrated data set through the LLM module, and fault information gaps in the network fault information are identified to generate the initial analysis diagnosis data, and the network fault elimination knowledge base comprises a history case, a solution, a best practice, a manual segment and an error mode.
In a feasible implementation mode, the initial analysis diagnosis data is subjected to diagnosis updating processing under iterative refinement according to a function call request to obtain final analysis diagnosis data, and the method specifically comprises the steps of carrying out identification processing on response request information of an LLM module through a diagnosis engine, connecting the function call module to network equipment and executing a function call request command if the response request information contains the function call request, carrying out iterative fault analysis on the initial analysis diagnosis data through the executed function call request command to obtain new analysis diagnosis data, feeding the new analysis diagnosis data back to the LLM module, carrying out secondary diagnosis on the new analysis diagnosis data through the LLM module, identifying a secondary fault information gap in network fault information, and carrying out loop iteration until the final analysis diagnosis data with high confidence is output.
In a feasible implementation mode, the method comprises the steps of carrying out searching processing of a fault solution on final analysis diagnosis data through a preset solution generator to determine a network fault solving strategy, specifically comprising the steps of driving the solution generator by a diagnosis engine when the final analysis diagnosis data is obtained, carrying out searching processing of the fault solution on the final analysis diagnosis data through the solution generator based on a LLM module and a searching result of a network fault elimination knowledge base to obtain an initial fault solving strategy, and carrying out adaptive adjustment on the initial fault solving strategy according to initial context data to obtain the network fault solving strategy.
In one possible implementation, an execution agent is deployed into the network device, wherein the execution agent is a centralized API gateway under a lightweight agent, and the function call module is connected into the network device through the execution agent to receive and execute the function call request command.
In a possible implementation manner, the key features of the trigger event are identified through the clustering features, wherein the clustering features comprise a timestamp difference feature, a source identification feature, a log template feature and a fault severity level feature.
In a second aspect, the embodiment of the application further provides a network fault removal device under AI driving, which comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor so that the at least one processor can execute the network fault removal method under AI driving according to any one of the embodiments.
In a third aspect, an embodiment of the present application further provides a non-volatile computer storage medium, where the storage medium is a non-volatile computer readable storage medium, where at least one program is stored, where each program includes instructions that, when executed by a terminal, cause the terminal to perform a network failure removal method under AI driving as set forth in any one of the above embodiments.
The application provides a network fault removal method, equipment and medium under AI drive, and compared with the prior art, the embodiment of the application has the following beneficial technical effects:
1. the efficiency is obviously improved, the fault removal time is greatly shortened and the dependence on manual intervention is reduced through the automatic information collection, analysis and diagnosis process.
2. The accuracy is improved, namely, the comprehensive and deeper analysis can be performed by combining the reasoning capability of LLM, the domain knowledge of RAG and the real-time data of function call, and the misjudgment caused by insufficient information or experience deviation is reduced.
3. The automation level is enhanced, the end-to-end automation flow from fault triggering to solution proposal is realized, and the burden of network engineers is lightened.
4. The capability of processing complex problems is that complex network fault scenes can be dealt with, and the root causes are approximated gradually through iterative information acquisition and analysis.
5. The LLM limitation is overcome, the problem of 'illusion' is relieved through RAG, and the problem of lack of real-time data access is solved through function call, so that the LLM can be reliably applied to the professional network operation and maintenance field.
6. And the knowledge precipitation and utilization, namely the RAG knowledge base can continuously accumulate historical fault cases and solutions, so that the diagnosis capability of the system is continuously improved.
7. And processing information overload, namely effectively processing a large number of syslogs and alarms through an event clustering engine, extracting key modes and avoiding information flooding.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art. In the drawings:
fig. 1 is a flowchart of a method for removing network faults under AI driving according to an embodiment of the present application;
FIG. 2 is a flowchart of a fault diagnosis method based on a diagnosis engine according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a network fault clearing device under AI driving according to an embodiment of the present application.
Detailed Description
In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the key technology related to the present application is network operation and maintenance, including using Command Line Interface (CLI) or API to view device status, configuration, log, performance counter, etc. Common commands such as show bg summary, show interface status, show logging, etc. are the basis for daily barrier removal. Network Management Systems (NMS) provide network monitoring, alerting and basic automation capabilities, but often lack deep diagnostic and root cause analysis functionality. The Large Language Model (LLM) is excellent in natural language processing, has reasoning and generating capabilities, and brings potential for automatic operation and maintenance. Retrieval enhancement generation (RAG) alleviates the LLM "illusion" problem by incorporating an external knowledge base and provides it with domain-specific expertise. Function call (Function call) enables LLM to request execution of external functions, thereby obtaining real-time information or performing specific operations.
The application also provides an AI-driven network fault removal system, the core architecture of which comprises the following components working cooperatively:
LLM Core (LLM Core) module, the "brain" of the system, is responsible for reasoning, analysis and decision. Advanced LLM such as DeepSeek, claude, gemini, etc. can be selected.
RAG Module (RAG Module) is connected to an internally constructed network troubleshooting knowledge base (containing historical cases, solutions, best practices, manual segments, error patterns, etc., suggested to be implemented using a vector database). Providing relevant knowledge according to LLM inquiry, alleviating illusion and providing specialty.
The function call module (Function Calling Module) is used as a bridge between the LLM and the real-time network environment. Defining a series of function interfaces (corresponding to show commands or API calls), LLM can safely obtain real-time configuration, status, log, counter, etc. information by calling these functions. A well-defined function Schema (name, parameters, return format) is required.
The context collector (Context Collector) actively collects a set of basic context information (e.g., show bg summary, related interface status, etc.) in advance by the function call module according to the initial event type (e.g., BGP Down, syslog alert) at the start of the diagnostic procedure.
An event receiving interface (Event Ingestion Interface) receives a fault trigger signal from the monitoring system, log aggregator, or user input.
Syslog/alert clustering engine (Syslog/Alarm Clustering Engine) clusters events (based on time, source, template similarity, etc., DBSCAN, etc., algorithms may be used) when a large number of related events occur, identifies patterns (such as interface jitter), reduces noise, and provides the clustered results to LLM.
Diagnostic engines (diagnostic engines) the coordinator of the system orchestrates the entire workflow. Receiving event- > call context collector- > integration information for LLM- > parsing LLM response (including function call request) - > scheduling function call execution- > feedback result for LLM- > driven solution generation.
A solution generator (Solution Generator) generates specific repair suggestions or actions based on the LLM's final diagnostic conclusion and the related solutions retrieved by the RAG module.
The data source is that the system integrates the data from the real-time exchanger (obtained by function call), internal RAG fault removal database, internal configuration/planning database, LLM internal knowledge base and monitoring system.
The event receiving interface triggers the diagnosis engine- > the diagnosis engine to call the context collector to acquire initial data- > the diagnosis engine to send data+event (+ clustering result) to the LLM core- > LLM for analysis, possibly queries the RAG module, possibly generates a function call request- > the diagnosis engine to analyze the request, calls the function call module to execute- > the interaction of the function call module and the network equipment, returns the result- > the diagnosis engine to feed back the result to the LLM- > the LLM for iterative analysis (repeatedly queries the RAG, requests the function call) - > until the LLM determines the root cause- > the diagnosis engine to drive the solution generator, and combines the LLM conclusion and the RAG knowledge generation scheme.
It should be noted that the abbreviations and key term definitions of the present application include LLM (Large Language Model): large language model. An artificial intelligence model with powerful natural language understanding and generating capabilities. RAG (RETRIEVAL-Augmented Generation) search enhancement generation. Techniques for enhancing LLM responses using an external knowledge base in combination with information retrieval and text generation techniques. BGP (Border Gateway Protocol) border gateway protocol. Routing protocols for cores on the internet. Syslog (System Log) System Log. Standard protocols for delivering log messages in IP networks. NMS (Network MANAGEMENT SYSTEM) a Network management system. A system for monitoring and managing a computer network. AI (Artificial Intelligence) Artificial Intelligence. Function call. A mechanism that allows LLM to interact with external tools or APIs to perform certain operations (e.g., obtain real-time data). CLI (Command LINE INTERFACE) Command line interface. The user interacts with the computer by way of text commands. API (Application Programming Interface) application programming interface. Specification of interactions between software components. JSON (JavaScript Object Notation) a lightweight data exchange format. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) a Density-based clustering algorithm.
The embodiment of the application provides a network fault removal method under AI driving, as shown in fig. 1, the method specifically comprises the steps of S101-S105:
s101, based on a trigger event corresponding to the network fault information, the relevant network equipment is subjected to basic context information collection processing under relevant predefining, and initial context data are obtained.
Specifically, firstly, through the event receiving port, a fault trigger signal from other ends of the network is received. The network other end at least comprises a monitoring system end, a log aggregator end and a user input end. The fault trigger signal at least comprises a BGP neighbor Down alarm signal and a specific system log message.
Further, through a preset diagnosis engine, event type identification is carried out on the fault trigger signal, and trigger events under relevant network fault information are determined.
When a trigger event is identified, a context collector is invoked.
Further, by means of a context collector, and based on the time type of the trigger event, a set of said base context information most relevant under predefined conditions in the relevant network device is collected. The basic context information at least comprises neighbor state in a real-time routing protocol, real-time routing table abstract information and real-time related interface state.
Further, the basic context information is subjected to data structuring processing to obtain initial context data.
In one embodiment, fig. 2 is a flowchart of a fault diagnosis method based on a diagnosis engine according to an embodiment of the present application, and as shown in fig. 2, an event receiving interface receives a fault trigger signal (such as BGP neighbor Down alarm, specific Syslog message). The diagnostic engine identifies the event type and immediately invokes the context collector. The context collector proactively and purposefully obtains a set of predefined baseline context information from the relevant network device via the function call module (e.g., obtain show BGP summary for BGP Down show ip BGP neighbors < filtered_ip >, relevant interface status). Finally, the collected basic context data is structured to obtain initial context data (such as JSON).
S102, carrying out event clustering processing on the triggering events in a centralized alarm mode to obtain event clustering results.
Specifically, when a related event under a centralized alert is identified, an alert clustering engine is started.
Further, the clustering result is obtained by carrying out multi-feature grouping processing on the triggering event through the alarm clustering engine based on a clustering algorithm and clustering features. The multi-feature at least comprises a time feature, a source feature and a template similarity feature.
Further, the clustering grouping result is subjected to recognition processing of related event modes and event abstracts, and an event clustering result is generated.
As a possible implementation manner, the key features of the triggering event are identified through the clustering features. The clustering features comprise a timestamp difference feature, a source identification feature, a log template feature and a fault severity level feature.
In one implementation, as shown in FIG. 2, if a large number of relevant Syslog or alarms are received in a short time, the Syslog/alarm clustering engine is started. The engine groups events using clustering algorithms (e.g., DBSCAN) and features (time, source, message templates). Finally, the cluster grouping result is identified and processed by the related event mode and the event abstract to generate an event cluster result
A succinct pattern or summary of events (e.g. "interface X frequent UP/DOWN") is identified and output.
S103, performing primary fault diagnosis processing based on the large language model on the trigger event, the initial context data and the event clustering result to obtain initial analysis diagnosis data.
Specifically, the trigger event, the initial context data and the event clustering result are further required to be integrated in a structured manner to obtain an integrated data set.
Further, the integrated data set, the network expert information and the diagnostic trouble description information are constructed as specific prompt information, and the specific prompt information is sent to the LLM module. Wherein the LLM module is an artificial intelligent model with natural language understanding and generating capability.
Further, the LLM module performs preliminary diagnosis assumption on the network fault information in the integrated data set based on query processing of the network fault elimination knowledge base, and identifies fault information gaps in the network fault information to generate initial analysis diagnosis data. The network troubleshooting knowledge base comprises historical cases, solutions, best practices, manual fragments and error modes.
In one embodiment, as shown in FIG. 2, the diagnostic engine integrates trigger events, actively collected initial context data, and event cluster results (if any) into one structured input. And then constructing a specific Prompt (Prompt) to be sent to the LLM module core, wherein the specific Prompt comprises role setting (network expert), task description (fault diagnosis), input data and indication of the reason identified by the LLM module, and when the information is insufficient, additional information is requested through function call, and the RAG knowledge base can be queried. The LLM module understands and infers, possibly initiates a query (e.g., "BGP flapping common causes") to the RAG module, proposes preliminary diagnostic assumptions, and identifies information gaps. If more information is required, the LLM generates one or more function call requests according to a predefined Schema (JSON format, specifying function names and parameters, such as
get_interface_counters(interface_name='Ethernet0'))。
And S104, performing diagnosis updating processing under iterative refinement on the initial analysis diagnosis data according to the function call request to obtain final analysis diagnosis data.
Specifically, the response request information of the LLM module is first identified by the diagnostic engine. And if the response request information contains the function call request, connecting a function call module to the network equipment, and executing a function call request command.
Further, through the executed function call request command, iterative fault analysis is carried out on the initial analysis diagnosis data to obtain new analysis diagnosis data, and the new analysis diagnosis data is fed back to the LLM module.
Further, the LLM module is used for carrying out secondary diagnosis assumption on the new analysis diagnosis data and identifying secondary fault information gaps in the network fault information. And continuing to iterate the steps until the final analysis diagnosis data under high confidence is output.
As a possible implementation, an execution agent may also be deployed into the network device. Wherein the execution agent is a centralized API gateway under the lightweight agent. The function call module is coupled to the network device by the execution agent to receive and execute the function call request command.
In one embodiment, as shown in FIG. 2, the diagnostic engine parses the LLM response. And if the function call request is included, indicating the function call module to execute. The function call module is securely connected to the network device, executes the requested command (e.g., show interface counters Ethernet a 0), and obtains real-time output. The execution results (structured data or error information) are fed back to the LLM core by the diagnostic engine. LLM receives new data, performs reasoning in combination with previous analysis, and updates diagnostic decisions. The LLM may query the RAG module again or request more/different function calls based on the new determination. That is, the cycle of "LLM analysis- > request information- > function call acquisition- > LLM analysis" continues until LLM finds the root cause (with high confidence) or has explored a reasonable path, resulting in final analysis diagnostic data.
S105, searching and processing the fault solution to the final analysis and diagnosis data through a preset solution generator, and determining a network fault solution strategy.
Specifically, when the final analysis diagnostic data is obtained, the diagnostic engine drives the solution generator.
Further, the solution generator is used for searching and processing the fault solution to the final analysis and diagnosis data based on the LLM module and the search result of the network fault removal knowledge base, so as to obtain an initial fault solution strategy.
Further, according to the initial context data, the initial fault resolution strategy is adaptively adjusted, and the network fault resolution strategy is obtained.
In one implementation, as shown in FIG. 2, when the LLM determines the root cause, i.e., the determined final analysis diagnostic data, the diagnostic engine drives the solution generator. LLM is based on a final diagnostic concept solution. The solution generator (or LLM itself) queries the RAG database for validated solutions, configuration command examples, or operational steps related to the reason for the diagnosis. And combining initial context information (such as a network switch version) to adjust the advice, ensuring applicability, completing adaptive adjustment of the initial fault resolution strategy, and finally obtaining the network fault resolution strategy. The system may also output a diagnostic report containing the root cause, confidence level, and one or more specific, operational solution suggestions.
As a possible implementation, in terms of data representation and integration, it is emphasized that all data acquired from the network device (show command output) should be parsed into the structured JSON format, with metadata such as timestamp, device ID, etc. The RAG database uses a vector database (e.g., FAISS, milvus) and an appropriate Embedding model (e.g., BGE) to store and retrieve knowledge pieces, supporting semantic searching. The function call Schema needs to be strictly defined to ensure that the LLM can generate requests in the correct format.
As a possible implementation, in LLM interaction and prompt engineering, a LLM model with strong understanding ability to codes and technical documents is selected. Careful design of System hints (System Prompt) to set LLM roles, available tools (functions, RAGs) and expected output formats. A Task Prompt (Task Prompt) is dynamically built, containing the current context, new data, RAG information, and dialog history. A small amount of example Learning (Few-shot Learning) may be used to direct LLM to generate an output (e.g., a list of function call requests) in a particular format.
As one possible implementation, in terms of a function call implementation, a secure execution agent (a lightweight agent on a centralized API gateway or device) is deployed to receive and execute function call requests. The proxy is responsible for converting JSON requests into actual network device CLI commands (executed by SSH) or API calls. Strict authentication authorization must be achieved, limiting the operations that can be performed (avoiding dangerous commands). A robust error handling mechanism is needed to feed back execution failure information to the LLM.
As a possible implementation, in terms of the clustering engine implementation, a suitable clustering algorithm (e.g., DBSCAN for finding arbitrarily shaped clusters and noise, or an algorithm based on log template similarity) is selected based on the data characteristics. Feature engineering is critical and may include timestamp differences, source identification (device/module/process), log templates (extracted with Drain, etc. tools), severity level. The goal is to reduce noise and refine the information.
The embodiment of the application also has the following steps:
The system integration innovation is that three technologies of a Large Language Model (LLM), retrieval enhancement generation (RAG) and Function call (Function call) are specifically and systematically integrated and applied to a real-time troubleshooting scene of a large-scale network environment for the first time.
And an active context prefetching mechanism, wherein before the LLM performs primary analysis, the system actively collects a group of basic and most probably related real-time context information through function call according to the fault type, and provides necessary initial information for the LLM so as to avoid guessing in information vacuum. This is a key step that is distinguished from passive waiting for LLM instructions.
The LLM driven iterative diagnosis flow is characterized in that a closed-loop iterative diagnosis workflow is designed, wherein LLM analysis- > identification information gap- > generation function call request- > system execution request is used for acquiring real-time data- > data and feeding the real-time data back to the LLM- > LLM for re-analysis based on new data. This dynamic interaction simulates expert thinking and is the core methodology.
And when a large number of syslogs or alarms are processed, an event clustering engine is introduced to preprocess and pattern recognition the original events, and the extracted information (instead of the original lengthy log) is input into the LLM, so that the efficiency and the accuracy of the LLM analysis are improved.
Cooperation of RAG with function call RAG (providing historical experience and static knowledge) and function call (providing real-time dynamic data) are used cooperatively throughout the diagnostic and solution generation process, making LLM decisions based on a combination of historical verification knowledge and current network live.
The whole system and method are designed for large-scale network environment and comprise the design of function call interfaces (corresponding to network equipment commands), the content of RAG knowledge base (network equipment specific cases and documents) and the processing logic of common network problems (such as BGP and interface problems).
The application integrates the reasoning capability of LLM, the domain knowledge enhancement capability of RAG and the real-time data acquisition capability of function call, and adopts the workflow of active context collection and iterative diagnosis. Has certain universality. In addition to large-scale network environments, the method can theoretically migrate applications to troubleshooting other complex IT systems or network environments, such as:
1. Other network operating systems, such as Cisco IOS/NX-OS, juniper Junos, etc., need only adapt the function call module to execute the commands/APIs of the corresponding platform and construct the corresponding RAG knowledge base.
2. And diagnosing faults of the cloud platform, such as faults of virtual machines, containers, network services and the like in cloud environments such as AWS, azure, GCP and the like, calling an API (application program interface) capable of docking the cloud platform by a function, wherein a RAG library comprises cloud service documents and cases.
3. The fault diagnosis of the Server operating system, such as Linux, windows Server, and the like, the function call can execute system commands, inquire log files, and the RAG library contains OS documents and common problem solutions.
4. Application system fault diagnosis, namely, for a complex distributed application system, function call can query monitoring indexes, logs and configuration of an application, and a RAG library comprises application architecture documents and historical fault records.
At the time of migration, the following adjustments are mainly required:
1) And (4) adapting a function call interface, namely re-implementing according to the command, the API or the tool of the target platform.
2) And constructing an RAG knowledge base of a specific field, namely collecting and arranging documents, cases, best practices and the like related to the target field.
3) And (3) adjusting a context collection strategy, namely defining specific content of initial context collection according to the characteristics of common problems in the target field.
4) Fine-tuning LLM cues and behaviors-it may be desirable to adjust promt so that LLM better understands terms and concepts of the target domain.
5) Redesign/adapt event clustering logic-if the target domain log/event format and pattern are different, the clustering algorithm and features need to be adjusted.
In addition, the embodiment of the present application further provides a network fault removal device under AI driving, as shown in fig. 3, where the network fault removal device 300 under AI driving specifically includes:
At least one processor 301. And a memory 302 communicatively coupled to the at least one processor 301. Wherein the memory 302 stores instructions executable by the at least one processor 301 to enable the at least one processor to perform:
Based on a triggering event corresponding to the network fault information, carrying out basic context information collection processing under relevant predefining on related network equipment to obtain initial context data;
Carrying out event clustering processing on the triggering events in a centralized alarm mode to obtain event clustering results;
Performing primary fault diagnosis processing based on a large language model on the trigger event, the initial context data and the event clustering result to obtain initial analysis diagnosis data;
According to the function call request, performing diagnosis updating processing under iterative refinement on the initial analysis diagnosis data to obtain final analysis diagnosis data;
And carrying out fault solution searching processing on the final analysis and diagnosis data through a preset solution generator to determine a network fault solution strategy.
According to the embodiment of the application, through the automatic information collection, analysis and diagnosis processes, the fault removal time is greatly shortened, and the dependence on manual intervention is reduced. By combining LLM (Large Language Model) reasoning capability, RAG (RETRIEVAL-Augmented Generation) domain knowledge and function call real-time data, more comprehensive and deeper analysis can be performed, and erroneous judgment caused by insufficient information or experience deviation is reduced. The end-to-end automatic flow from fault triggering to solution proposal is realized, and the burden of network engineers is reduced. The method can cope with complex network fault scenes, and gradually approaches the root cause through iterative information acquisition and analysis. The problem of 'illusion' is relieved through the RAG module, and the problem of lack of real-time data access is solved through function call, so that the LLM module can be reliably applied to the professional network operation and maintenance field. The RAG knowledge base can continuously accumulate historical fault cases and solutions, so that the diagnostic capability of the system is continuously improved. Through the event clustering engine, a large number of syslog (System Log) and alarms can be effectively processed, key modes are extracted, and information flooding is avoided.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.
The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (10)

1.一种AI驱动下的网络故障排除方法,其特征在于,所述方法包括:1. An AI-driven network troubleshooting method, characterized in that the method comprises: 基于网络故障信息对应的触发事件,对相关网络设备进行有关预定义下的基础上下文信息收集处理,得到初始上下文数据;Based on the trigger event corresponding to the network fault information, collect and process the predefined basic context information of the relevant network devices to obtain initial context data; 对所述触发事件进行有关集中告警的事件聚类处理,得到事件聚类结果;Performing event clustering processing on the triggering event related to the centralized alarm to obtain an event clustering result; 将所述触发事件、所述初始上下文数据以及所述事件聚类结果进行基于大型语言模型的初步故障诊断处理,得到初始分析诊断数据;Performing preliminary fault diagnosis processing on the trigger event, the initial context data, and the event clustering result based on a large language model to obtain initial analysis and diagnosis data; 根据函数调用请求,对所述初始分析诊断数据进行迭代细化下的诊断更新处理,得到最终分析诊断数据;According to the function call request, the initial analysis and diagnosis data is subjected to iterative refinement and diagnosis update processing to obtain final analysis and diagnosis data; 通过预设的解决方案生成器,对所述最终分析诊断数据进行故障解决方案的查寻处理,确定出网络故障解决策略。The final analysis and diagnosis data is searched and processed for a fault solution by a preset solution generator to determine a network fault solution strategy. 2.根据权利要求1所述的一种AI驱动下的网络故障排除方法,其特征在于,基于网络故障信息对应的触发事件,对相关网络设备进行有关预定义下的基础上下文信息收集处理,得到初始上下文数据,具体包括:2. The AI-driven network troubleshooting method according to claim 1, characterized in that, based on a trigger event corresponding to network fault information, predefined basic context information is collected and processed for relevant network devices to obtain initial context data, specifically comprising: 通过事件接收口,接收来自网络其他端的故障触发信号;其中,所述网络其他端至少包括:监控系统端、日志聚合器端以及用户输入端;所述故障触发信号至少包括:BGP邻居Down告警信号以及特定系统日志消息;Receiving a fault trigger signal from other network terminals through an event receiving port; wherein the other network terminals include at least a monitoring system terminal, a log aggregator terminal, and a user input terminal; the fault trigger signal includes at least a BGP neighbor down alarm signal and a specific system log message; 通过预设的诊断引擎,对所述故障触发信号进行事件类型识别,确定出有关网络故障信息下的触发事件;Using a preset diagnostic engine, the fault trigger signal is subjected to event type identification to determine the triggering event associated with the network fault information; 当识别到所述触发事件时,调用上下文收集器;When the trigger event is identified, calling the context collector; 通过所述上下文收集器,并基于所述触发事件的时间类型,收集所述相关网络设备中一组处于预定义下最相关的所述基础上下文信息;其中,所述基础上下文信息至少包括:实时路由协议中的邻居状态、实时路由表摘要信息以及实时相关接口状态;The context collector collects a set of predefined most relevant basic context information of the relevant network devices based on the time type of the triggering event; wherein the basic context information includes at least: neighbor status, real-time routing table summary information, and real-time related interface status in a real-time routing protocol; 对所述基础上下文信息进行数据结构化处理,得到所述初始上下文数据。The basic context information is subjected to data structuring processing to obtain the initial context data. 3.根据权利要求1所述的一种AI驱动下的网络故障排除方法,其特征在于,对所述触发事件进行有关集中告警的事件聚类处理,得到事件聚类结果,具体包括:3. The AI-driven network troubleshooting method according to claim 1, wherein the triggering event is subjected to event clustering processing related to centralized alarms to obtain event clustering results, specifically comprising: 当识别到处于集中告警下的相关事件时,启动告警聚类引擎;When relevant events under centralized alarm are identified, the alarm clustering engine is started; 通过所述告警聚类引擎,并基于聚类算法与聚类特征,对所述触发事件进行多特征分组处理,得到聚类分组结果;其中,所述多特征至少包括:时间特征、来源特征、模版相似性特征;The triggering event is subjected to multi-feature grouping processing by the alarm clustering engine based on the clustering algorithm and clustering features to obtain a clustering grouping result; wherein the multi-features include at least: time feature, source feature, and template similarity feature; 对所述聚类分组结果进行有关事件模式与事件摘要的识别处理,生成所述事件聚类结果。The clustering grouping results are subjected to identification processing related to event patterns and event summaries to generate the event clustering results. 4.根据权利要求1所述的一种AI驱动下的网络故障排除方法,其特征在于,将所述触发事件、所述初始上下文数据以及所述事件聚类结果进行基于大型语言模型的初步故障诊断处理,得到初始分析诊断数据,具体包括:4. The AI-driven network troubleshooting method according to claim 1, wherein the trigger event, the initial context data, and the event clustering result are subjected to preliminary fault diagnosis processing based on a large language model to obtain initial analysis and diagnosis data, specifically comprising: 将所述触发事件、所述初始上下文数据以及所述事件聚类结果进行结构化整合,得到整合数据组;Performing structured integration on the trigger event, the initial context data, and the event clustering result to obtain an integrated data group; 将所述整合数据组、网络专家信息以及诊断故障描述信息构建为特定提示信息,并将所述特定提示信息发送到LLM模块;其中,所述LLM模块为具备自然语言理解和生成能力的人工智能模型;Constructing the integrated data set, network expert information, and diagnostic fault description information into specific prompt information, and sending the specific prompt information to an LLM module; wherein the LLM module is an artificial intelligence model with natural language understanding and generation capabilities; 通过所述LLM模块,基于网络故障排除知识库的查询处理,对所述整合数据组中的网络故障信息进行初步诊断假设,并识别所述网络故障信息中的故障信息缺口,生成所述初始分析诊断数据;其中,所述网络故障排除知识库包含:历史案例、解决方案、最佳实践、手册片段以及错误模式。The LLM module generates the initial analysis and diagnostic data by performing preliminary diagnostic hypotheses on the network fault information in the integrated data set based on query processing of a network troubleshooting knowledge base, and identifying fault information gaps in the network fault information. The network troubleshooting knowledge base includes historical cases, solutions, best practices, manual snippets, and error patterns. 5.根据权利要求1所述的一种AI驱动下的网络故障排除方法,其特征在于,根据函数调用请求,对所述初始分析诊断数据进行迭代细化下的诊断更新处理,得到最终分析诊断数据,具体包括:5. The AI-driven network troubleshooting method according to claim 1, wherein the method further comprises: performing an iterative and refined diagnostic update process on the initial analysis and diagnosis data according to a function call request to obtain final analysis and diagnosis data; 通过诊断引擎,对LLM模块的响应请求信息进行识别处理;Identify and process the response request information of the LLM module through the diagnosis engine; 若所述响应请求信息包含所述函数调用请求,则将函数调用模块连接到网络设备中,并执行函数调用请求命令;If the response request information includes the function call request, connecting the function call module to the network device and executing the function call request command; 通过执行后的所述函数调用请求命令,对所述初始分析诊断数据进行迭代故障分析,得到新分析诊断数据,并将所述新分析诊断数据反馈到所述LLM模块中;Performing iterative fault analysis on the initial analysis and diagnosis data by executing the function call request command to obtain new analysis and diagnosis data, and feeding the new analysis and diagnosis data back to the LLM module; 通过所述LLM模块,对所述新分析诊断数据进行二次诊断假设,并识别网络故障信息中的二次故障信息缺口;循环迭代执行,直至输出高置信度的所述最终分析诊断数据。Through the LLM module, a secondary diagnosis hypothesis is made on the new analysis and diagnosis data, and a secondary fault information gap in the network fault information is identified; and the loop is iteratively executed until the final analysis and diagnosis data with high confidence is output. 6.根据权利要求1所述的一种AI驱动下的网络故障排除方法、设备及介质方法,其特征在于,通过预设的解决方案生成器,对所述最终分析诊断数据进行故障解决方案的查寻处理,确定出网络故障解决策略,具体包括:6. The AI-driven network troubleshooting method, device, and medium method according to claim 1, wherein a preset solution generator searches for a fault solution on the final analysis and diagnosis data to determine a network fault resolution strategy, specifically comprising: 当获取到所述最终分析诊断数据时,诊断引擎驱动所述解决方案生成器;When the final analysis diagnosis data is obtained, the diagnosis engine drives the solution generator; 通过所述解决方案生成器,并基于LLM模块以及网络故障排除知识库的检索结果,对所述最终分析诊断数据进行故障解决方案的查寻处理,得到初始故障解决策略;By means of the solution generator, and based on the search results of the LLM module and the network troubleshooting knowledge base, the final analysis and diagnosis data is searched for a fault solution to obtain an initial fault resolution strategy; 根据初始上下文数据,对所述初始故障解决策略进行适应性调整,得到所述网络故障解决策略。The initial fault resolution strategy is adaptively adjusted according to the initial context data to obtain the network fault resolution strategy. 7.根据权利要求5所述的一种AI驱动下的网络故障排除方法,其特征在于,7. The AI-driven network troubleshooting method according to claim 5, characterized in that: 将执行代理部署到所述网络设备中;其中,所述执行代理为轻量级代理下的中心化API网关;Deploy an execution agent to the network device; wherein the execution agent is a centralized API gateway under a lightweight agent; 通过所述执行代理,将所述函数调用模块连接到网络设备中,以接收以及执行所述函数调用请求命令。The function call module is connected to the network device through the execution agent to receive and execute the function call request command. 8.根据权利要求3所述的一种AI驱动下的网络故障排除方法,其特征在于,8. The AI-driven network troubleshooting method according to claim 3, characterized in that: 通过所述聚类特征,对所述触发事件进行关键特征的识别处理;Performing key feature identification processing on the trigger event through the clustering features; 其中,所述聚类特征包括:时间戳差特征、来源标识特征、日志模板特征以及故障严重性级别特征。The clustering features include: timestamp difference feature, source identification feature, log template feature and fault severity level feature. 9.一种AI驱动下的网络故障排除设备,其特征在于,所述设备包括:9. An AI-driven network troubleshooting device, characterized in that the device includes: 至少一个处理器;以及,at least one processor; and, 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein, 所述存储器存储有能够被所述至少一个处理器执行的指令,以使所述至少一个处理器能够执行根据权利要求1-8任一项所述的一种AI驱动下的网络故障排除方法。The memory stores instructions that can be executed by the at least one processor, so that the at least one processor can execute the AI-driven network troubleshooting method according to any one of claims 1-8. 10.一种非易失性计算机存储介质,其特征在于,所述存储介质为非易失性计算机可读存储介质,所述非易失性计算机可读存储介质存储有至少一个程序,每个所述程序包括指令,所述指令当被终端执行时,使所述终端执行根据权利要求1-8任一项所述的一种AI驱动下的网络故障排除方法。10. A non-volatile computer storage medium, characterized in that the storage medium is a non-volatile computer-readable storage medium, and the non-volatile computer-readable storage medium stores at least one program, each of which includes instructions, and when the instructions are executed by a terminal, the terminal executes an AI-driven network troubleshooting method according to any one of claims 1-8.
CN202510693202.7A 2025-05-27 2025-05-27 Network fault removal method, equipment and medium under AI drive Pending CN120567646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510693202.7A CN120567646A (en) 2025-05-27 2025-05-27 Network fault removal method, equipment and medium under AI drive

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510693202.7A CN120567646A (en) 2025-05-27 2025-05-27 Network fault removal method, equipment and medium under AI drive

Publications (1)

Publication Number Publication Date
CN120567646A true CN120567646A (en) 2025-08-29

Family

ID=96832916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510693202.7A Pending CN120567646A (en) 2025-05-27 2025-05-27 Network fault removal method, equipment and medium under AI drive

Country Status (1)

Country Link
CN (1) CN120567646A (en)

Similar Documents

Publication Publication Date Title
EP3798846B1 (en) Operation and maintenance system and method
CN107832196B (en) Monitoring device and monitoring method for abnormal content of real-time log
CN108521339B (en) Feedback type node fault processing method and system based on cluster log
US9891971B1 (en) Automating the production of runbook workflows
CN113553238B (en) Cloud platform resource abnormality automatic processing system and method
CN113760677B (en) Abnormal link analysis method, device, equipment and storage medium
CN105577411A (en) Cloud service monitoring method and device based on service origin
CN119292810A (en) Fault alarm self-healing system and method
CN115150252A (en) A network fault detection method, system and device
CN115280741A (en) System and method for autonomous monitoring and recovery in hybrid energy management
WO2015187001A2 (en) System and method for managing resources failure using fast cause and effect analysis in a cloud computing system
CN118550791A (en) Operation and maintenance management method, device and equipment of cloud server and storage medium
CN112612802A (en) Real-time data middlebox processing method, device and platform
CN118939380A (en) Cluster evaluation method, device, electronic device and storage medium
CN119885168A (en) Virtual machine mirror image static scanning method and system based on super fusion platform
CN120602308A (en) A data center intelligent operation and maintenance monitoring method and system
WO2024169467A1 (en) Fault location method for distributed network, network device, and storage medium
CN120994446A (en) Methods, apparatus, and equipment for alarm event root cause analysis based on large language models and business topology
CN120880873A (en) Automatic inspection method and device and readable storage medium
CN120856528A (en) An integrated information automation monitoring system and method for edge devices
CN120567646A (en) Network fault removal method, equipment and medium under AI drive
CN118484482A (en) Big data analysis processing method for digital information
CN114780578A (en) A query statement processing method and related device
CN121542064B (en) Methods, apparatus, equipment, storage media, and software products for operating cloud resources.
Kandan et al. A Generic Log Analyzer for automated troubleshooting in container orchestration system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination