Disclosure of Invention
The embodiment of the application provides a network fault removal method, equipment and medium driven by an AI (automatic identification) for solving the technical problems that the existing fault removal method in a large-scale network environment is low in efficiency, easy to make mistakes, low in automation degree, lacks deep diagnosis capability, cannot effectively integrate real-time data with domain knowledge and the like.
The embodiment of the application adopts the following technical scheme:
On one hand, the embodiment of the application provides a network fault elimination method under the driving of an AI (automatic identification), which comprises the steps of collecting and processing basic context information under relevant predefining on the basis of triggering events corresponding to network fault information to obtain initial context data, carrying out event clustering processing on the triggering events to obtain event clustering results, carrying out preliminary fault diagnosis processing on the triggering events, the initial context data and the event clustering results on the basis of a large language model to obtain initial analysis diagnosis data, carrying out diagnosis updating processing on the initial analysis diagnosis data under iterative refinement according to a function call request to obtain final analysis diagnosis data, and carrying out fault solution searching processing on the final analysis diagnosis data through a preset solution generator to determine a network fault solving strategy.
According to the embodiment of the application, through the automatic information collection, analysis and diagnosis processes, the fault removal time is greatly shortened, and the dependence on manual intervention is reduced. By combining LLM (Large Language Model) reasoning capability, RAG (RETRIEVAL-Augmented Generation) domain knowledge and function call real-time data, more comprehensive and deeper analysis can be performed, and erroneous judgment caused by insufficient information or experience deviation is reduced. The end-to-end automatic flow from fault triggering to solution proposal is realized, and the burden of network engineers is reduced. The method can cope with complex network fault scenes, and gradually approaches the root cause through iterative information acquisition and analysis. The problem of 'illusion' is relieved through the RAG module, and the problem of lack of real-time data access is solved through function call, so that the LLM module can be reliably applied to the professional network operation and maintenance field. The RAG knowledge base can continuously accumulate historical fault cases and solutions, so that the diagnostic capability of the system is continuously improved. Through the event clustering engine, a large number of syslog (System Log) and alarms can be effectively processed, key modes are extracted, and information flooding is avoided.
In a possible implementation manner, based on a trigger event corresponding to network fault information, relevant network equipment is subjected to relevant pre-defined basic context information collection processing to obtain initial context data, and the method specifically comprises the steps of receiving fault trigger signals from other ends of a network through an event receiving port, wherein the other ends of the network at least comprise a monitoring system end, a log aggregator end and a user input end, the fault trigger signals at least comprise BGP neighbor Down alarm signals and specific system log information, carrying out event type identification on the fault trigger signals through a preset diagnosis engine to determine the trigger event under relevant network fault information, calling a context collector when the trigger event is identified, collecting a group of relevant basic context information under pre-definition in the relevant network equipment through the context collector based on the time type of the trigger event, wherein the basic context information at least comprises state in a real-time routing protocol, real-time routing table neighbor information and real-time relevant interface state, and carrying out initial context structure processing on the relevant basic context information to obtain the initial context data.
In a feasible implementation mode, the method comprises the steps of carrying out event clustering processing of related concentrated alarms on the triggering events to obtain event clustering results, and concretely comprises the steps of starting an alarm clustering engine when the related events under the concentrated alarms are identified, carrying out multi-feature grouping processing on the triggering events through the alarm clustering engine based on a clustering algorithm and clustering features to obtain clustering grouping results, wherein the multi-features at least comprise time features, source features and template similarity features, and carrying out identification processing of related event modes and event abstracts on the clustering grouping results to generate the event clustering results.
In a feasible implementation mode, the trigger event, the initial context data and the event clustering result are subjected to preliminary fault diagnosis processing based on a large language model to obtain initial analysis diagnosis data, specifically, the initial analysis diagnosis data are obtained by structurally integrating the trigger event, the initial context data and the event clustering result to obtain an integrated data set, the integrated data set, network expert information and diagnosis fault description information are constructed into specific prompt information, the specific prompt information is sent to an LLM module, the LLM module is an artificial intelligent model with natural language understanding and generating capability, the inquiry processing of a network fault elimination knowledge base is used for carrying out preliminary diagnosis assumption on the network fault information in the integrated data set through the LLM module, and fault information gaps in the network fault information are identified to generate the initial analysis diagnosis data, and the network fault elimination knowledge base comprises a history case, a solution, a best practice, a manual segment and an error mode.
In a feasible implementation mode, the initial analysis diagnosis data is subjected to diagnosis updating processing under iterative refinement according to a function call request to obtain final analysis diagnosis data, and the method specifically comprises the steps of carrying out identification processing on response request information of an LLM module through a diagnosis engine, connecting the function call module to network equipment and executing a function call request command if the response request information contains the function call request, carrying out iterative fault analysis on the initial analysis diagnosis data through the executed function call request command to obtain new analysis diagnosis data, feeding the new analysis diagnosis data back to the LLM module, carrying out secondary diagnosis on the new analysis diagnosis data through the LLM module, identifying a secondary fault information gap in network fault information, and carrying out loop iteration until the final analysis diagnosis data with high confidence is output.
In a feasible implementation mode, the method comprises the steps of carrying out searching processing of a fault solution on final analysis diagnosis data through a preset solution generator to determine a network fault solving strategy, specifically comprising the steps of driving the solution generator by a diagnosis engine when the final analysis diagnosis data is obtained, carrying out searching processing of the fault solution on the final analysis diagnosis data through the solution generator based on a LLM module and a searching result of a network fault elimination knowledge base to obtain an initial fault solving strategy, and carrying out adaptive adjustment on the initial fault solving strategy according to initial context data to obtain the network fault solving strategy.
In one possible implementation, an execution agent is deployed into the network device, wherein the execution agent is a centralized API gateway under a lightweight agent, and the function call module is connected into the network device through the execution agent to receive and execute the function call request command.
In a possible implementation manner, the key features of the trigger event are identified through the clustering features, wherein the clustering features comprise a timestamp difference feature, a source identification feature, a log template feature and a fault severity level feature.
In a second aspect, the embodiment of the application further provides a network fault removal device under AI driving, which comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor so that the at least one processor can execute the network fault removal method under AI driving according to any one of the embodiments.
In a third aspect, an embodiment of the present application further provides a non-volatile computer storage medium, where the storage medium is a non-volatile computer readable storage medium, where at least one program is stored, where each program includes instructions that, when executed by a terminal, cause the terminal to perform a network failure removal method under AI driving as set forth in any one of the above embodiments.
The application provides a network fault removal method, equipment and medium under AI drive, and compared with the prior art, the embodiment of the application has the following beneficial technical effects:
1. the efficiency is obviously improved, the fault removal time is greatly shortened and the dependence on manual intervention is reduced through the automatic information collection, analysis and diagnosis process.
2. The accuracy is improved, namely, the comprehensive and deeper analysis can be performed by combining the reasoning capability of LLM, the domain knowledge of RAG and the real-time data of function call, and the misjudgment caused by insufficient information or experience deviation is reduced.
3. The automation level is enhanced, the end-to-end automation flow from fault triggering to solution proposal is realized, and the burden of network engineers is lightened.
4. The capability of processing complex problems is that complex network fault scenes can be dealt with, and the root causes are approximated gradually through iterative information acquisition and analysis.
5. The LLM limitation is overcome, the problem of 'illusion' is relieved through RAG, and the problem of lack of real-time data access is solved through function call, so that the LLM can be reliably applied to the professional network operation and maintenance field.
6. And the knowledge precipitation and utilization, namely the RAG knowledge base can continuously accumulate historical fault cases and solutions, so that the diagnosis capability of the system is continuously improved.
7. And processing information overload, namely effectively processing a large number of syslogs and alarms through an event clustering engine, extracting key modes and avoiding information flooding.
Detailed Description
In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the key technology related to the present application is network operation and maintenance, including using Command Line Interface (CLI) or API to view device status, configuration, log, performance counter, etc. Common commands such as show bg summary, show interface status, show logging, etc. are the basis for daily barrier removal. Network Management Systems (NMS) provide network monitoring, alerting and basic automation capabilities, but often lack deep diagnostic and root cause analysis functionality. The Large Language Model (LLM) is excellent in natural language processing, has reasoning and generating capabilities, and brings potential for automatic operation and maintenance. Retrieval enhancement generation (RAG) alleviates the LLM "illusion" problem by incorporating an external knowledge base and provides it with domain-specific expertise. Function call (Function call) enables LLM to request execution of external functions, thereby obtaining real-time information or performing specific operations.
The application also provides an AI-driven network fault removal system, the core architecture of which comprises the following components working cooperatively:
LLM Core (LLM Core) module, the "brain" of the system, is responsible for reasoning, analysis and decision. Advanced LLM such as DeepSeek, claude, gemini, etc. can be selected.
RAG Module (RAG Module) is connected to an internally constructed network troubleshooting knowledge base (containing historical cases, solutions, best practices, manual segments, error patterns, etc., suggested to be implemented using a vector database). Providing relevant knowledge according to LLM inquiry, alleviating illusion and providing specialty.
The function call module (Function Calling Module) is used as a bridge between the LLM and the real-time network environment. Defining a series of function interfaces (corresponding to show commands or API calls), LLM can safely obtain real-time configuration, status, log, counter, etc. information by calling these functions. A well-defined function Schema (name, parameters, return format) is required.
The context collector (Context Collector) actively collects a set of basic context information (e.g., show bg summary, related interface status, etc.) in advance by the function call module according to the initial event type (e.g., BGP Down, syslog alert) at the start of the diagnostic procedure.
An event receiving interface (Event Ingestion Interface) receives a fault trigger signal from the monitoring system, log aggregator, or user input.
Syslog/alert clustering engine (Syslog/Alarm Clustering Engine) clusters events (based on time, source, template similarity, etc., DBSCAN, etc., algorithms may be used) when a large number of related events occur, identifies patterns (such as interface jitter), reduces noise, and provides the clustered results to LLM.
Diagnostic engines (diagnostic engines) the coordinator of the system orchestrates the entire workflow. Receiving event- > call context collector- > integration information for LLM- > parsing LLM response (including function call request) - > scheduling function call execution- > feedback result for LLM- > driven solution generation.
A solution generator (Solution Generator) generates specific repair suggestions or actions based on the LLM's final diagnostic conclusion and the related solutions retrieved by the RAG module.
The data source is that the system integrates the data from the real-time exchanger (obtained by function call), internal RAG fault removal database, internal configuration/planning database, LLM internal knowledge base and monitoring system.
The event receiving interface triggers the diagnosis engine- > the diagnosis engine to call the context collector to acquire initial data- > the diagnosis engine to send data+event (+ clustering result) to the LLM core- > LLM for analysis, possibly queries the RAG module, possibly generates a function call request- > the diagnosis engine to analyze the request, calls the function call module to execute- > the interaction of the function call module and the network equipment, returns the result- > the diagnosis engine to feed back the result to the LLM- > the LLM for iterative analysis (repeatedly queries the RAG, requests the function call) - > until the LLM determines the root cause- > the diagnosis engine to drive the solution generator, and combines the LLM conclusion and the RAG knowledge generation scheme.
It should be noted that the abbreviations and key term definitions of the present application include LLM (Large Language Model): large language model. An artificial intelligence model with powerful natural language understanding and generating capabilities. RAG (RETRIEVAL-Augmented Generation) search enhancement generation. Techniques for enhancing LLM responses using an external knowledge base in combination with information retrieval and text generation techniques. BGP (Border Gateway Protocol) border gateway protocol. Routing protocols for cores on the internet. Syslog (System Log) System Log. Standard protocols for delivering log messages in IP networks. NMS (Network MANAGEMENT SYSTEM) a Network management system. A system for monitoring and managing a computer network. AI (Artificial Intelligence) Artificial Intelligence. Function call. A mechanism that allows LLM to interact with external tools or APIs to perform certain operations (e.g., obtain real-time data). CLI (Command LINE INTERFACE) Command line interface. The user interacts with the computer by way of text commands. API (Application Programming Interface) application programming interface. Specification of interactions between software components. JSON (JavaScript Object Notation) a lightweight data exchange format. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) a Density-based clustering algorithm.
The embodiment of the application provides a network fault removal method under AI driving, as shown in fig. 1, the method specifically comprises the steps of S101-S105:
s101, based on a trigger event corresponding to the network fault information, the relevant network equipment is subjected to basic context information collection processing under relevant predefining, and initial context data are obtained.
Specifically, firstly, through the event receiving port, a fault trigger signal from other ends of the network is received. The network other end at least comprises a monitoring system end, a log aggregator end and a user input end. The fault trigger signal at least comprises a BGP neighbor Down alarm signal and a specific system log message.
Further, through a preset diagnosis engine, event type identification is carried out on the fault trigger signal, and trigger events under relevant network fault information are determined.
When a trigger event is identified, a context collector is invoked.
Further, by means of a context collector, and based on the time type of the trigger event, a set of said base context information most relevant under predefined conditions in the relevant network device is collected. The basic context information at least comprises neighbor state in a real-time routing protocol, real-time routing table abstract information and real-time related interface state.
Further, the basic context information is subjected to data structuring processing to obtain initial context data.
In one embodiment, fig. 2 is a flowchart of a fault diagnosis method based on a diagnosis engine according to an embodiment of the present application, and as shown in fig. 2, an event receiving interface receives a fault trigger signal (such as BGP neighbor Down alarm, specific Syslog message). The diagnostic engine identifies the event type and immediately invokes the context collector. The context collector proactively and purposefully obtains a set of predefined baseline context information from the relevant network device via the function call module (e.g., obtain show BGP summary for BGP Down show ip BGP neighbors < filtered_ip >, relevant interface status). Finally, the collected basic context data is structured to obtain initial context data (such as JSON).
S102, carrying out event clustering processing on the triggering events in a centralized alarm mode to obtain event clustering results.
Specifically, when a related event under a centralized alert is identified, an alert clustering engine is started.
Further, the clustering result is obtained by carrying out multi-feature grouping processing on the triggering event through the alarm clustering engine based on a clustering algorithm and clustering features. The multi-feature at least comprises a time feature, a source feature and a template similarity feature.
Further, the clustering grouping result is subjected to recognition processing of related event modes and event abstracts, and an event clustering result is generated.
As a possible implementation manner, the key features of the triggering event are identified through the clustering features. The clustering features comprise a timestamp difference feature, a source identification feature, a log template feature and a fault severity level feature.
In one implementation, as shown in FIG. 2, if a large number of relevant Syslog or alarms are received in a short time, the Syslog/alarm clustering engine is started. The engine groups events using clustering algorithms (e.g., DBSCAN) and features (time, source, message templates). Finally, the cluster grouping result is identified and processed by the related event mode and the event abstract to generate an event cluster result
A succinct pattern or summary of events (e.g. "interface X frequent UP/DOWN") is identified and output.
S103, performing primary fault diagnosis processing based on the large language model on the trigger event, the initial context data and the event clustering result to obtain initial analysis diagnosis data.
Specifically, the trigger event, the initial context data and the event clustering result are further required to be integrated in a structured manner to obtain an integrated data set.
Further, the integrated data set, the network expert information and the diagnostic trouble description information are constructed as specific prompt information, and the specific prompt information is sent to the LLM module. Wherein the LLM module is an artificial intelligent model with natural language understanding and generating capability.
Further, the LLM module performs preliminary diagnosis assumption on the network fault information in the integrated data set based on query processing of the network fault elimination knowledge base, and identifies fault information gaps in the network fault information to generate initial analysis diagnosis data. The network troubleshooting knowledge base comprises historical cases, solutions, best practices, manual fragments and error modes.
In one embodiment, as shown in FIG. 2, the diagnostic engine integrates trigger events, actively collected initial context data, and event cluster results (if any) into one structured input. And then constructing a specific Prompt (Prompt) to be sent to the LLM module core, wherein the specific Prompt comprises role setting (network expert), task description (fault diagnosis), input data and indication of the reason identified by the LLM module, and when the information is insufficient, additional information is requested through function call, and the RAG knowledge base can be queried. The LLM module understands and infers, possibly initiates a query (e.g., "BGP flapping common causes") to the RAG module, proposes preliminary diagnostic assumptions, and identifies information gaps. If more information is required, the LLM generates one or more function call requests according to a predefined Schema (JSON format, specifying function names and parameters, such as
get_interface_counters(interface_name='Ethernet0'))。
And S104, performing diagnosis updating processing under iterative refinement on the initial analysis diagnosis data according to the function call request to obtain final analysis diagnosis data.
Specifically, the response request information of the LLM module is first identified by the diagnostic engine. And if the response request information contains the function call request, connecting a function call module to the network equipment, and executing a function call request command.
Further, through the executed function call request command, iterative fault analysis is carried out on the initial analysis diagnosis data to obtain new analysis diagnosis data, and the new analysis diagnosis data is fed back to the LLM module.
Further, the LLM module is used for carrying out secondary diagnosis assumption on the new analysis diagnosis data and identifying secondary fault information gaps in the network fault information. And continuing to iterate the steps until the final analysis diagnosis data under high confidence is output.
As a possible implementation, an execution agent may also be deployed into the network device. Wherein the execution agent is a centralized API gateway under the lightweight agent. The function call module is coupled to the network device by the execution agent to receive and execute the function call request command.
In one embodiment, as shown in FIG. 2, the diagnostic engine parses the LLM response. And if the function call request is included, indicating the function call module to execute. The function call module is securely connected to the network device, executes the requested command (e.g., show interface counters Ethernet a 0), and obtains real-time output. The execution results (structured data or error information) are fed back to the LLM core by the diagnostic engine. LLM receives new data, performs reasoning in combination with previous analysis, and updates diagnostic decisions. The LLM may query the RAG module again or request more/different function calls based on the new determination. That is, the cycle of "LLM analysis- > request information- > function call acquisition- > LLM analysis" continues until LLM finds the root cause (with high confidence) or has explored a reasonable path, resulting in final analysis diagnostic data.
S105, searching and processing the fault solution to the final analysis and diagnosis data through a preset solution generator, and determining a network fault solution strategy.
Specifically, when the final analysis diagnostic data is obtained, the diagnostic engine drives the solution generator.
Further, the solution generator is used for searching and processing the fault solution to the final analysis and diagnosis data based on the LLM module and the search result of the network fault removal knowledge base, so as to obtain an initial fault solution strategy.
Further, according to the initial context data, the initial fault resolution strategy is adaptively adjusted, and the network fault resolution strategy is obtained.
In one implementation, as shown in FIG. 2, when the LLM determines the root cause, i.e., the determined final analysis diagnostic data, the diagnostic engine drives the solution generator. LLM is based on a final diagnostic concept solution. The solution generator (or LLM itself) queries the RAG database for validated solutions, configuration command examples, or operational steps related to the reason for the diagnosis. And combining initial context information (such as a network switch version) to adjust the advice, ensuring applicability, completing adaptive adjustment of the initial fault resolution strategy, and finally obtaining the network fault resolution strategy. The system may also output a diagnostic report containing the root cause, confidence level, and one or more specific, operational solution suggestions.
As a possible implementation, in terms of data representation and integration, it is emphasized that all data acquired from the network device (show command output) should be parsed into the structured JSON format, with metadata such as timestamp, device ID, etc. The RAG database uses a vector database (e.g., FAISS, milvus) and an appropriate Embedding model (e.g., BGE) to store and retrieve knowledge pieces, supporting semantic searching. The function call Schema needs to be strictly defined to ensure that the LLM can generate requests in the correct format.
As a possible implementation, in LLM interaction and prompt engineering, a LLM model with strong understanding ability to codes and technical documents is selected. Careful design of System hints (System Prompt) to set LLM roles, available tools (functions, RAGs) and expected output formats. A Task Prompt (Task Prompt) is dynamically built, containing the current context, new data, RAG information, and dialog history. A small amount of example Learning (Few-shot Learning) may be used to direct LLM to generate an output (e.g., a list of function call requests) in a particular format.
As one possible implementation, in terms of a function call implementation, a secure execution agent (a lightweight agent on a centralized API gateway or device) is deployed to receive and execute function call requests. The proxy is responsible for converting JSON requests into actual network device CLI commands (executed by SSH) or API calls. Strict authentication authorization must be achieved, limiting the operations that can be performed (avoiding dangerous commands). A robust error handling mechanism is needed to feed back execution failure information to the LLM.
As a possible implementation, in terms of the clustering engine implementation, a suitable clustering algorithm (e.g., DBSCAN for finding arbitrarily shaped clusters and noise, or an algorithm based on log template similarity) is selected based on the data characteristics. Feature engineering is critical and may include timestamp differences, source identification (device/module/process), log templates (extracted with Drain, etc. tools), severity level. The goal is to reduce noise and refine the information.
The embodiment of the application also has the following steps:
The system integration innovation is that three technologies of a Large Language Model (LLM), retrieval enhancement generation (RAG) and Function call (Function call) are specifically and systematically integrated and applied to a real-time troubleshooting scene of a large-scale network environment for the first time.
And an active context prefetching mechanism, wherein before the LLM performs primary analysis, the system actively collects a group of basic and most probably related real-time context information through function call according to the fault type, and provides necessary initial information for the LLM so as to avoid guessing in information vacuum. This is a key step that is distinguished from passive waiting for LLM instructions.
The LLM driven iterative diagnosis flow is characterized in that a closed-loop iterative diagnosis workflow is designed, wherein LLM analysis- > identification information gap- > generation function call request- > system execution request is used for acquiring real-time data- > data and feeding the real-time data back to the LLM- > LLM for re-analysis based on new data. This dynamic interaction simulates expert thinking and is the core methodology.
And when a large number of syslogs or alarms are processed, an event clustering engine is introduced to preprocess and pattern recognition the original events, and the extracted information (instead of the original lengthy log) is input into the LLM, so that the efficiency and the accuracy of the LLM analysis are improved.
Cooperation of RAG with function call RAG (providing historical experience and static knowledge) and function call (providing real-time dynamic data) are used cooperatively throughout the diagnostic and solution generation process, making LLM decisions based on a combination of historical verification knowledge and current network live.
The whole system and method are designed for large-scale network environment and comprise the design of function call interfaces (corresponding to network equipment commands), the content of RAG knowledge base (network equipment specific cases and documents) and the processing logic of common network problems (such as BGP and interface problems).
The application integrates the reasoning capability of LLM, the domain knowledge enhancement capability of RAG and the real-time data acquisition capability of function call, and adopts the workflow of active context collection and iterative diagnosis. Has certain universality. In addition to large-scale network environments, the method can theoretically migrate applications to troubleshooting other complex IT systems or network environments, such as:
1. Other network operating systems, such as Cisco IOS/NX-OS, juniper Junos, etc., need only adapt the function call module to execute the commands/APIs of the corresponding platform and construct the corresponding RAG knowledge base.
2. And diagnosing faults of the cloud platform, such as faults of virtual machines, containers, network services and the like in cloud environments such as AWS, azure, GCP and the like, calling an API (application program interface) capable of docking the cloud platform by a function, wherein a RAG library comprises cloud service documents and cases.
3. The fault diagnosis of the Server operating system, such as Linux, windows Server, and the like, the function call can execute system commands, inquire log files, and the RAG library contains OS documents and common problem solutions.
4. Application system fault diagnosis, namely, for a complex distributed application system, function call can query monitoring indexes, logs and configuration of an application, and a RAG library comprises application architecture documents and historical fault records.
At the time of migration, the following adjustments are mainly required:
1) And (4) adapting a function call interface, namely re-implementing according to the command, the API or the tool of the target platform.
2) And constructing an RAG knowledge base of a specific field, namely collecting and arranging documents, cases, best practices and the like related to the target field.
3) And (3) adjusting a context collection strategy, namely defining specific content of initial context collection according to the characteristics of common problems in the target field.
4) Fine-tuning LLM cues and behaviors-it may be desirable to adjust promt so that LLM better understands terms and concepts of the target domain.
5) Redesign/adapt event clustering logic-if the target domain log/event format and pattern are different, the clustering algorithm and features need to be adjusted.
In addition, the embodiment of the present application further provides a network fault removal device under AI driving, as shown in fig. 3, where the network fault removal device 300 under AI driving specifically includes:
At least one processor 301. And a memory 302 communicatively coupled to the at least one processor 301. Wherein the memory 302 stores instructions executable by the at least one processor 301 to enable the at least one processor to perform:
Based on a triggering event corresponding to the network fault information, carrying out basic context information collection processing under relevant predefining on related network equipment to obtain initial context data;
Carrying out event clustering processing on the triggering events in a centralized alarm mode to obtain event clustering results;
Performing primary fault diagnosis processing based on a large language model on the trigger event, the initial context data and the event clustering result to obtain initial analysis diagnosis data;
According to the function call request, performing diagnosis updating processing under iterative refinement on the initial analysis diagnosis data to obtain final analysis diagnosis data;
And carrying out fault solution searching processing on the final analysis and diagnosis data through a preset solution generator to determine a network fault solution strategy.
According to the embodiment of the application, through the automatic information collection, analysis and diagnosis processes, the fault removal time is greatly shortened, and the dependence on manual intervention is reduced. By combining LLM (Large Language Model) reasoning capability, RAG (RETRIEVAL-Augmented Generation) domain knowledge and function call real-time data, more comprehensive and deeper analysis can be performed, and erroneous judgment caused by insufficient information or experience deviation is reduced. The end-to-end automatic flow from fault triggering to solution proposal is realized, and the burden of network engineers is reduced. The method can cope with complex network fault scenes, and gradually approaches the root cause through iterative information acquisition and analysis. The problem of 'illusion' is relieved through the RAG module, and the problem of lack of real-time data access is solved through function call, so that the LLM module can be reliably applied to the professional network operation and maintenance field. The RAG knowledge base can continuously accumulate historical fault cases and solutions, so that the diagnostic capability of the system is continuously improved. Through the event clustering engine, a large number of syslog (System Log) and alarms can be effectively processed, key modes are extracted, and information flooding is avoided.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.
The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.