CN115296975A - Method and system for operation, maintenance and troubleshooting through natural language processing - Google Patents

Method and system for operation, maintenance and troubleshooting through natural language processing Download PDF

Info

Publication number
CN115296975A
CN115296975A CN202210685072.9A CN202210685072A CN115296975A CN 115296975 A CN115296975 A CN 115296975A CN 202210685072 A CN202210685072 A CN 202210685072A CN 115296975 A CN115296975 A CN 115296975A
Authority
CN
China
Prior art keywords
fault
scheme
log
maintenance
coping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210685072.9A
Other languages
Chinese (zh)
Inventor
胡恺
余贵荣
丁庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Media Tech Co ltd
Original Assignee
Shanghai Media Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Media Tech Co ltd filed Critical Shanghai Media Tech Co ltd
Priority to CN202210685072.9A priority Critical patent/CN115296975A/en
Publication of CN115296975A publication Critical patent/CN115296975A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications

Abstract

The invention relates to the technical field of operation, maintenance and troubleshooting, in particular to a method and a system for operation, maintenance and troubleshooting through natural language processing, wherein the method comprises the following steps of S1, receiving log data of an information system, and converting continuously transmitted unstructured log data into structured log data; s2, performing log mode clustering on historical fault logs; s3, analyzing each type of fault, recording fault reasons, establishing a fault knowledge base, recording fault information after each fault occurs, extracting keywords of a fault log, and matching and corresponding the fault with the keywords; and S4, carrying out keyword matching on subsequent incoming log data, and configuring a fault coping scheme based on a matching result.

Description

Method and system for operation, maintenance and troubleshooting through natural language processing
Technical Field
The invention relates to the technical field of operation, maintenance and troubleshooting, in particular to a method and a system for operation, maintenance and troubleshooting through natural language processing.
Background
With the rapid development of informatization process, computer systems have become a part of modern enterprises. In recent years, the informatization construction of various industries is continuously perfected, and the operation of services is more and more concentrated on an information system or an information platform. The operation and maintenance work for ensuring the normal operation of the system is also increasingly important, and how to timely repair and solve the system when an emergency occurs is one of the work key points of operation and maintenance personnel.
At present, for sudden failures, the industry generally adopts a gradual troubleshooting detection method, for example, for network interruption failures: when a network interruption fault occurs at a client, it is necessary to determine whether a channel from a user (or a terminal) to a gateway device has a problem, whether devices such as a three-layer core router are normal, whether external environments such as a DNS are normal, and whether a boundary security device has a problem, check alarms related to multiple devices, and then perform corresponding processing.
The working mode of gradual checking needs to consume a great deal of energy of operation and maintenance personnel to perform manual troubleshooting, so that problems such as missing checking and the like easily occur, accurate safe operation and maintenance cannot be achieved in management, and a leak exists. Moreover, the real-time performance of the gradual troubleshooting is poor after the problems occur, and the problems cannot be found in time. In addition, because the operation and maintenance personnel face various failure causes, the result of the after-affair analysis is difficult to obviously act on the prevention or treatment of the next failure.
Disclosure of Invention
The invention aims to provide a method for operation, maintenance and troubleshooting through natural language processing, which solves the technical problems;
the invention also aims to provide a system for operation, maintenance and troubleshooting through natural language processing, and the technical problems are solved.
The technical problem solved by the invention can be realized by adopting the following technical scheme:
a method for operation, maintenance and troubleshooting through natural language processing comprises,
the method comprises the following steps that S1, log data of an information system are received, and continuously transmitted unstructured log data are converted into structured log data;
s2, carrying out log mode clustering on historical fault logs;
s3, analyzing each type of fault, recording fault reasons, establishing a fault knowledge base, recording fault information after each fault occurs, extracting keywords of the fault log, and matching and corresponding the fault with the keywords;
s4, carrying out keyword matching on the subsequently transmitted log data, determining a fault type based on a matching result if the matching is successful, and associating the fault type with a fault coping scheme; otherwise, determining the fault log as a new fault type, clustering the new fault log and adding the new fault log into the fault knowledge base.
Preferably, the manner of clustering the log patterns in step S2 includes,
preprocessing and/or constructing a bag-of-words model and/or clustering log texts by using a document theme generation model;
the preprocessing converts the special character strings in the fault log into a uniform label representation;
the bag-of-words model is expressed as an M multiplied by V matrix, wherein M is the length of the corpus and V is the length of the dictionary;
the log text clustering is carried out by using the document theme model: p (w | d) = p (w | t) p (t | d), where d is the journal text, w is a word of the journal text, t is a topic of the journal text, p (w | t) is a word distribution, and p (t | d) is a topic distribution.
Preferably, the step S3 further includes setting the fault information in the fault knowledge base, where the fault information includes the fault reason, the fault level, the fault influence range, the fault upstream-downstream relationship, the fault-related technical field classification, and the fault-related responsible person.
Preferably, in step S3, extracting the keyword of the fault log includes: the time period key of the fault, the data source key of the fault, the IP key of the fault, the grade key of the fault, the influence result key of the fault and the network protocol related to the fault.
Preferably, in step S4, the method further includes auditing the troubleshooting plan, where the auditing content includes,
and auditing according to the processing timeliness of the fault handling scheme, auditing the operation authority of the fault handling scheme, auditing the operation command of the fault handling scheme, auditing upstream and downstream equipment of the fault handling scheme, auditing responsible persons related to the fault handling scheme, and auditing the iterative maintenance of the fault handling scheme.
Preferably, the method further comprises, among others,
s5, sequencing the fault coping schemes according to a recommendation sequence to form a scheme recommendation list of the fault coping schemes, and selecting the fault coping schemes from the scheme recommendation list or adopting a brand-new fault coping scheme to cope with the fault;
and S6, adjusting the recommendation sequence of the fault handling schemes in the scheme recommendation list based on the selection mode of the fault handling schemes in the step S5.
Preferably, the recommended order evaluation content of the failure handling plan includes:
the processing duration of the fault handling scheme, the complexity of an operation command of the fault handling scheme, the number of operation nodes of the fault handling scheme, the use frequency of key instructions in the fault handling scheme, and the traceability and retrocessability of relevant operations of the fault handling scheme.
Preferably, the scheme recommendation list includes, in order: the fault handling method comprises a first scheme, a second scheme and an Nth scheme, wherein the first scheme is a default scheme, the rest fault handling schemes are alternative schemes, the fault handling schemes are annotated with fault information, the fault handling schemes are matched with fault logs, and N is a positive integer.
Preferably, step S6 specifically includes:
if the default scheme is selected, the scheme recommendation list does not change the recommendation sequence; or;
if the alternative is selected, upgrading the selected alternative to the default scheme, and degrading the original default scheme to the alternative; or;
if the new fault coping scheme is adopted, recording a new fault coping operation process as operation data, bringing the operation data into the fault knowledge base, establishing a new fault coping scheme, associating the new fault coping scheme with the information system, adding the new fault coping scheme into the scheme recommendation list, adjusting the new fault coping scheme into the first scheme, and degrading or deleting the rest fault coping schemes in the scheme recommendation list.
A system for operation, maintenance and troubleshooting through natural language processing is applied to the method for operation, maintenance and troubleshooting through natural language processing and comprises,
the acquisition module is arranged in the information system and used for acquiring the log data;
the operation and maintenance management server is connected with the acquisition module, and is used for clustering the fault logs, analyzing the fault logs, recording the fault information and extracting the keywords, and a recommendation module is arranged in the operation and maintenance management server and is used for configuring the fault coping scheme for the fault according to historical processing experience and updating the fault coping scheme and configuring a recommendation sequence for the fault coping scheme;
and the fault knowledge base is connected with the recommendation module, and the fault corresponding scheme is stored in the fault knowledge base.
The invention has the beneficial effects that: by adopting the technical scheme, the invention carries out fault knowledge base construction on the problem events needing operation, maintenance and troubleshooting in the information system in a natural language processing mode, forms a necessary detail analysis system around specific historical events and establishes an intelligent solution recommendation mechanism.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for troubleshooting operation and maintenance through natural language processing according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for troubleshooting operation and maintenance through natural language processing according to an embodiment of the present invention;
fig. 3 is an architecture diagram of a system for operation, maintenance and troubleshooting through natural language processing according to an embodiment of the present invention.
FIG. 4 is an architecture diagram of an operation and maintenance troubleshooting system when an administrator adopts a new failure handling scheme according to an embodiment of the present invention.
In the drawings: 1. an acquisition module; 2. a recommendation module; 3. a fault knowledge base; 4. and an operation and maintenance management server.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
A method for removing operation and maintenance obstacles through natural language processing, as shown in fig. 1, fig. 2 and fig. 3, comprises,
s1, receiving log data of an information system, and converting continuously-transmitted unstructured log data into structured log data;
s2, carrying out log mode clustering on historical fault logs;
s3, analyzing each type of fault, recording fault reasons, establishing a fault knowledge base 3, recording fault information after each fault, extracting keywords of a fault log, and matching and corresponding the fault with the keywords;
s4, keyword matching is carried out on the subsequently transmitted log data, if matching is successful, the fault type is determined based on the matching result, and the fault type is associated with a fault coping scheme; otherwise, the new fault type is determined, and the new fault log is clustered and added into the fault knowledge base 3.
Specifically, the fault handling scheme for operation, maintenance and troubleshooting is based on the fault knowledge base 3, the fault knowledge base 3 collects historical log data of an information system for a long time, and log data of various unstructured software and hardware devices are converted into formatted log data, so that a foundation is laid for retrieval, statistics, analysis, marking and clustering of the log data. The method comprises the steps of clustering formatted log data, distinguishing fault logs from normal logs, extracting and marking keywords of the characteristics of the fault logs, combining specific faults of one class and explaining the meaning of the faults, analyzing the causes, the occurrence frequency, the influence range, upstream and downstream association and other conditions of the faults, configuring fault response schemes for the faults and configuring recommendation priorities for multiple schemes according to historical processing experience, wherein the specific faults include network switch power supply faults, firewall primary and standby switching faults, a website fault encountering large amount of attacks, database process locking faults and the like. Once a system fault is found in operation and maintenance, a fault type can be matched from the fault knowledge base 3 according to the fault equipment log information, and a fault corresponding scheme is obtained.
The invention can comprehensively analyze the running conditions of software and hardware equipment of each information system in an enterprise network environment from multiple dimensions, compile and collect various unstructured log information with different specifications in software and hardware equipment of various manufacturers, and convert the unstructured log information into structured log data convenient for retrieval and reprocessing. The log data with the consistent types can be aggregated into log events of the same type, the log events concerned by people can be subjected to detailed fault analysis and annotation, the hard original log content is converted into a language system convenient for an administrator to understand, the administrator is helped to quickly locate accurate fault information when a fault occurs, and a fault coping scheme is timely recommended to help solve the fault, so that the time of fault influence is effectively shortened, the availability of a service system is guaranteed, and secondary disasters caused by improper fault coping are reduced.
Specifically, the log data of the system equipment is received and analyzed in a multi-dimensional mode, and the network equipment logs can be acquired from a core switch and an access switch respectively; the safety device logs can be acquired from a firewall, an intrusion detection device and a load balancing device respectively; the website logs can be acquired from apache and Nginx respectively. The method has the advantages that log data are received and analyzed in a multi-dimensional mode, the log acquisition range of important system equipment in the network environment is enlarged, the acquisition and analysis capacity is covered in the whole key network environment where the information system is located, and the problems that the fault cause analysis capacity is insufficient, fault positioning is fuzzy, the actual fault influence range is not consistent with the actual fault influence range, the fault coping scheme is not the optimal selection and the like due to single log collection can be solved. The information system is not isolated from each other in the network, has a secret and inseparable relationship with the related infrastructure of the network environment, is as slow as responding to the network attack compared with the website, can not confirm the fault cause only by checking from the website server, and adopts an effective solution scheme, and the comprehensive analysis must be carried out by associating with the network security equipment log.
In a preferred embodiment, step S1 converts the continuously incoming unstructured IT log data into structured IT log data, and extracts field information such as a timestamp, an operation object, an operation time, an operation location, an operation type, and an authorization level from the IT log data.
The unstructured IT log collection specifically comprises the following steps:
network device logging: the method comprises the following steps of including a switch log, a router log, an AP log, a WLC log and the like; support for secure device logging: the log system comprises firewall logs, anti-virus wall logs, load balancing device logs, VPN device logs, intrusion detection device logs, WAF device logs, malicious code prevention device logs, gateway device logs, bastion machine logs and the like;
collecting and analyzing log data of common database systems, including but not limited to common database system logs such as Oracle, SQL, mySQL and the like; incremental extraction of data from data tables of conventional databases (such as Oracle, SQL, mysql, etc.) is supported.
And obtaining operating system logs such as system logs, security logs, audit logs, WMI data and logs attached to system services of the windows system from various operating systems such as linux and windows.
The method comprises the steps of collecting logs of application software from an information system server, wherein the logs comprise logs in a compressed file form in a tar.gz, tar and zip format, data supporting various decoding formats, such as GBK, UTF-8, UTF-16 and the like, and the logs are collected from a server in a manner of supporting local path specification.
Specifically, in this embodiment, the acquiring of the network device log specifically includes: the method comprises the following steps of network equipment syslog server mode acquisition, network equipment syslog trap server mode acquisition, existing network monitoring server API interface mode acquisition and existing network monitoring server log forwarding acquisition. The acquired log level is preset by an administrator, and considering that the use of a large amount of daily packet forwarding information logs of the equipment is not large, the log sending level can be properly improved, and only important-level equipment logs are acquired.
The network device log attention categories include: device log in/log out (WEB management page), device log in/log out (remote/local command line), device master/slave status monitoring log, device port enable/disable log, object (IP group or policy group) change log, configuration backup/restore log, user/IP traffic monitoring log, user/IP session monitoring log, and the like.
DHCP or WLC class network device concerns: the system comprises a communication log of a management node upper connection core, a management node cluster health monitoring log, a synchronous log of a management node and a user authentication server, an identity authentication success/failure log when a user accesses a node, a user access quality monitoring log, a drift log of a terminal switching node and the like.
The security device log attention categories include: DDOS attack alarm logs, network intrusion alarm logs, stiff wood worm alarm logs, boundary sensitive port access logs, device illegal external connection logs, terminal malicious domain name resolution/access logs, network scanning logs and load balancing health state monitoring logs.
Anti-virus or table-tube type security devices are concerned: the system comprises a server side/client side virus base updating log, a server side/client side version updating log, a client side strategy synchronization log, a client side virus alarm log, a client side updating state log, a server side strategy changing log, a terminal remote maintenance log, a planning task execution log, a working software pushing log, a patch repairing log, a patch state log, a vulnerability scanning log, a safety protection strategy triggering log and the like.
Operating system class log attention: the system comprises a system login/logout/startup/shutdown management log, a system service activation/deactivation log, a system security object access audit log, a system account management log, a system resource monitoring alarm log, a system planning task execution log, a system backup reduction audit log, a system security protection alarm log and the like.
Database class log attention: the system comprises a database login/logout type management log, a database instance creation/deletion log, a database query log, a database backup/recovery log, a database permission change audit log, a database resource monitoring alarm log and the like.
Website type log attention: apache log, nginx log, tomcat log, etc.
Application class logging concerns: software foreground/background service status logs, user login/log-out logs, user authority change logs, task flow status monitoring logs, organization architecture change logs, upstream and downstream software synchronous verification logs, cluster node health monitoring logs, connection number bearing monitoring logs and the like.
In a preferred embodiment, the specific manner of performing log pattern clustering in step S2 includes preprocessing and/or constructing a bag-of-words model and/or performing log text clustering using a document topic model;
specifically, the pretreatment: converting the special character strings into uniform label representations, specifically converting IP addresses, numbers and percentages into uniform label representations, abstracting text modes and reducing the dimensionality of a dictionary;
specifically, a bag-of-words model is constructed: the bag-of-words model is expressed as an M multiplied by V matrix, wherein M is the length of the corpus, and V is the length of the dictionary;
specifically, log text clustering is performed using a document topic model (LDA model): p (w | d) = p (w | t) p (t | d), wherein p is a probability value, d is a log text, w is a word of the log text, t is a topic of the log text, p (w | t) is a word distribution, and p (t | d) is a topic distribution;
further, when a new text is predicted, word segmentation of the new text is also preprocessed to form word bag vectors, gibbs of the new text is sampled until doc-topic distribution is converged, and the final topic distribution is counted to be the result;
the topic of each word is obtained by sampling from the distribution, so that a topic with a small probability is probably obtained, when the text is too short, the distribution is not easy to converge, and the convergence effect is not necessarily good, so that the too short text with only three or four words is avoided as much as possible.
The hyper-parameters of the model comprise a topic number K, a prior parameter alpha of doc-topic distribution and a prior parameter beta of topic-word distribution, the last two hyper-parameters are directly set to be symmetrical 1/topic _ num, and the topic number K can be selected by calculating perplexity. The model only needs to store K topo-word distributions for predicting the new text, i.e. the phi matrix. The model is updated independently of the original training corpus, but the dictionary for converting the new corpus into BOW must be the same.
After the topic distribution of a certain text is obtained, in this embodiment, multiple topics with a probability greater than 0.4 or topics with a maximum probability are selected as tags of the multiple topics, and if no topic has a probability greater than 0.2, the multiple topics are classified as unclassified, preferably, the probability values in this embodiment are adjustable.
Each topic label corresponds to a word distribution, and words with high word distribution probability can be used as keywords of the topic.
In a preferred embodiment, the step S3 further includes setting fault information in the fault knowledge base 3, where the fault information includes information about a fault cause, a fault level, a fault influence range, a fault upstream-downstream relationship, a fault-related technical field classification, and a fault-related responsible person.
In particular, comprehensive explanation is provided for the data information disclosed by the fault log. For example, the causes of failure include: insufficient disk space, exhausted memory resources, continuous exceeding of a threshold value by a CPU, unexpected switching of HA and the like; the failure classes include: major faults, high-risk faults, medium faults, low-level faults, and the like; the fault influence range includes: only a single node is affected, the whole server cluster is affected, the whole network link is affected, and the whole network security domain is affected; the fault upstream and downstream relationships include: only affecting the service, affecting the identity authentication of the downstream service, affecting the task flow of the upstream service, affecting the storage hanging disk service (storage device) and affecting the proxy service (network proxy device).
In a preferred embodiment, the step S3 of extracting the keyword of the fault log includes: fault time section key words, fault data source key words, fault IP key words, fault grade key words (low, med, high and cri), fault influence result key words (up, down, change, success and failed) and fault related network protocols (http, https, SMB, RDP and SSH).
And in the step S4, auditing the fault coping scheme, wherein the auditing content comprises auditing according to the processing timeliness of the fault coping scheme, auditing the operation authority of the fault coping scheme, auditing the operation command of the fault coping scheme, auditing upstream and downstream equipment of the fault coping scheme, auditing responsible persons related to the fault coping scheme, and auditing the iterative maintenance of the fault coping scheme.
Specifically, auditing the timeliness of the scheme includes: real-time switching, minute-level, hour-level, next day, and the like; the auditing the operation authority of the scheme comprises the following steps: system administrator rights, security administrator rights, software administrator rights, database administrator rights, network administrator rights, and the like; auditing upstream and downstream equipment of the scheme includes: whether an upstream service system is affected, whether a downstream service system is affected, whether a network backbone is affected, whether a network security domain is affected, whether a platform level influence is achieved, and the like; auditing the responsible person of the device includes: operation and maintenance teams, development teams, duty teams, management teams and the like.
In a preferred embodiment, the method further comprises,
s5, sequencing the fault coping schemes according to a recommendation sequence to form a scheme recommendation list of the fault coping schemes, and selecting the fault coping schemes from the scheme recommendation list or adopting a brand-new fault coping scheme to cope with the fault;
and S6, adjusting the recommendation sequence of the fault handling schemes in the scheme recommendation list based on the selection mode of the fault handling schemes in the step S5.
In a preferred embodiment, the content of the recommended order evaluation of the failure handling solution includes:
the processing duration of the fault handling scheme, the complexity of an operation command of the fault handling scheme, the number of operation nodes of the fault handling scheme, the use frequency of key instructions in the fault handling scheme, and the traceability and retrogradability of relevant operations of the fault handling scheme.
Specifically, the processing duration is used as a recommendation basis, and the short-time recommendation is high in priority; the complexity of the operation command of the scheme is taken as a recommendation basis, and the recommendation priority with less command number is high; the number of the operation nodes of the scheme is used as a recommendation basis, and the smaller the number of the operation nodes is, the higher the recommendation priority is; the use frequency of the key instructions in the scheme is used as a recommendation basis, and the lower the frequency is, the higher the recommendation priority is; the retrospective and retrogradable operation has high recommendation priority.
In a preferred embodiment, the arrangement of the proposal recommendation list comprises: the first scheme, the second scheme and the Nth scheme are N fault coping schemes, wherein the first scheme is a default scheme, and the rest fault coping schemes are alternative schemes, wherein N is a positive integer.
In a preferred embodiment, the fault handling schemes are annotated with fault information, and the fault handling schemes match with fault logs, so that a user can view the log information matched with each fault handling scheme in detail.
In a preferred embodiment, step S6 specifically includes:
if a default scheme is selected, the scheme recommendation list does not change the recommendation sequence; or;
if the alternative is selected, upgrading the selected alternative to a default scheme, and degrading the original default scheme to the alternative; or;
if a new fault coping scheme is adopted, recording a new fault coping operation process as operation data, bringing the operation data into the fault knowledge base 3, establishing a new fault coping scheme, associating the new fault coping scheme with the information system, adding the new fault coping scheme into the scheme recommendation list and adjusting the new fault coping scheme into the first scheme, and degrading or deleting the rest fault coping schemes in the scheme recommendation list.
A system for operation, maintenance and troubleshooting through natural language processing is applied to a method for operation, maintenance and troubleshooting through natural language processing in any embodiment, as shown in figures 3 and 4, and comprises,
the acquisition module 1 is arranged in the information system and used for acquiring log data;
the operation and maintenance management server 4 is connected with the acquisition module 1, and is used for clustering fault logs, analyzing the fault logs, recording fault information and extracting keywords, a recommendation module 2 is arranged in the operation and maintenance management server 4, and the recommendation module 2 is used for configuring a fault coping scheme for the fault according to historical processing experience and updating the fault coping scheme and configuring a recommendation sequence for the fault coping scheme;
and a fault knowledge base 3 is connected with the recommendation module 2, and a fault coping scheme of the fault knowledge base 3 is stored.
In a preferred embodiment, a method for operation, maintenance and troubleshooting through natural language processing forms a set of intelligent fault handling scheme recommendation mode. Specifically, the operation modes can be divided into two types: a sequential processing mode (carrying out fault handling according to the scheme of the current knowledge base) and an iterative processing mode (carrying out fault handling by adopting a new scheme and iterating the scheme list in the knowledge base).
And (3) sequential processing mode: when the information system has a fault, the acquisition module 1 sends the log information of the fault to the recommendation module 2, the recommendation module 2 calls the fault knowledge base to query the coping schemes, and a scheme recommendation list of the coping schemes of the fault is obtained, wherein the coping schemes comprise a first scheme, a second scheme, a third scheme and the like. And providing the fault handling scheme to an administrator according to the recommended level.
If the administrator selects or verifies the default first failure handling scheme, the scheme is retained by default, and the recommendation level is unchanged. Conversely, if the administrator selects other alternatives from the solution recommendation list, the first solution ranked by default will be downgraded, the new default solution will be associated with the information system, and the selected solution will be implemented for a failure of the information system.
An iteration processing mode: this flow is consistent with the default process flow before the administrator selects the coping schemes. When the information system has a fault, the acquisition module 1 sends fault log information to the recommendation module 2, the recommendation module 2 calls the fault knowledge base 3 to inquire a coping scheme, a fault coping scheme recommendation list is obtained, and coping schemes such as a first scheme (preferred recommendation), a second scheme, a third scheme and the like are obtained. And providing the operation and maintenance scheme to the administrator according to the recommended level.
Continuing with FIG. 4, if the administrator decides not to use any of the recommendations, but instead decides to use a completely new operational step to deal with the failure, the recommendation module 2 treats it as a new failure handling solution. The recommendation module 2 transmits the fault information to an administrator, the administrator initiates a new troubleshooting operation, a related operation process is converted into operation data and fed back to the recommendation module 2, the recommendation module 2 brings the operation data into a knowledge base, a new fault coping scheme is established, the scheme is associated with an information system, and the recommendation level of the scheme is adjusted to be a first scheme. The original recommendation is degraded or deleted.
In a preferred embodiment of the invention, the technical scheme of the invention is used for carrying out operation and maintenance troubleshooting of an emergency, specifically, a website has very slow sudden alarm access, related device log keywords are matched with fault types in a knowledge base by collecting attack logs and website server load logs, the fault is matched to be in accordance with the type of suffering from sudden DDOS attack, and the fault is matched to have the characteristics of short time, high frequency, large flow, multiple attack targets, wide attack source distribution and the like, and various processing schemes such as setting a blacklist on a boundary firewall device, setting a whitelist on a load balancing device, starting connection verification on the firewall and limiting access sessions in unit time are recommended, wherein the starting connection verification on the firewall and limiting sessions are the first recommended scheme. After the administrator confirms, based on the fault-related characteristics, the recommended scheme is configured globally, and the effect is most direct and rapid. The protection effect of other alternative schemes is slower than that of the recommended scheme, the workload of various black and white list operations is large, the configuration needs to be carried out for a long time in the face of wide attack sources and attack targets, and the current time end is not suitable for the operations. Thus, the administrator agrees to perform troubleshooting using the recommended scheme. After the troubleshooting scheme is executed, DDOS attack flow is rapidly reduced, and a website system is rapidly recovered to be normal.
In another preferred embodiment of the present invention, the technical solution of the present invention is used for assisting the operation and maintenance team after the new system is handed over, specifically, after the project group completes the construction work of the graphic processing system of a certain television channel, and after a period of trial operation, the operation and maintenance work of the system is handed over to the operation and maintenance team. As the operation and maintenance team has less implementation experience and participates in operation and maintenance work during the trial run, when the system breaks down suddenly, the foreground response is normal, but the background response is slow, so that the progress bar of the user's list task is blocked. At the moment, an operation and maintenance team immediately uses the system provided by the invention to obtain a corresponding scheme aiming at the fault in the fault knowledge base, and searches the fault knowledge base to know that the fault is caused by the fact that the task connection number of the background database is too high, the background database is not released for a long time and the connection number is continuously accumulated, when the system is not restarted and maintained after running for more than 1 month, the fault possibly occurs, and the corresponding scheme is recommended to forcibly stop the task to rapidly end the session, restart the database service and restart the background server. And the operation and maintenance team manager adopts a default scheme to forcibly stop the task, the background response is quickly recovered to be normal, the service is recovered, and the background server is restarted and maintained at a later time.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1. A method for operation, maintenance and troubleshooting through natural language processing is characterized by comprising the following steps,
the method comprises the following steps that S1, log data of an information system are received, and continuously transmitted unstructured log data are converted into structured log data;
s2, performing log mode clustering on historical fault logs;
s3, analyzing each type of fault, recording fault reasons, establishing a fault knowledge base, recording fault information after each fault occurs, extracting keywords of the fault log, and matching and corresponding the fault with the keywords;
s4, carrying out keyword matching on the subsequently transmitted log data, determining a fault type based on a matching result if the matching is successful, and associating the fault type with a fault coping scheme; and if not, determining the fault log as a new fault type, clustering the new fault log and adding the new fault log into the fault knowledge base.
2. The method of claim 1, wherein the log pattern clustering in step S2 comprises,
preprocessing and/or constructing a bag-of-words model and/or clustering log texts by using a document theme generation model;
the preprocessing converts the special character strings in the fault log into a uniform label representation;
the bag-of-words model is expressed as an M multiplied by V matrix, wherein M is the length of the corpus and V is the length of the dictionary;
the log text clustering using the document topic model: p (w | d) = p (w | t) p (t | d), where d is the journal text, w is a word of the journal text, t is a topic of the journal text, p (w | t) is a word distribution, and p (t | d) is a topic distribution.
3. The method according to claim 1, wherein step S3 further comprises setting the fault information in the fault knowledge base, where the fault information includes fault cause, fault level, fault influence range, fault upstream-downstream relationship, fault correlation technique field classification, and fault correlation responsible person.
4. The method for operation, maintenance and troubleshooting through natural language processing as claimed in claim 1, wherein in step S3, extracting the keyword of the fault log comprises: the time period key of the fault, the data source key of the fault, the IP key of the fault, the grade key of the fault, the influence result key of the fault and the network protocol related to the fault.
5. The method for troubleshooting operation and maintenance through natural language processing as claimed in claim 1, wherein the step S4 further comprises auditing the failure handling scheme, the auditing content comprises,
and auditing according to the processing timeliness of the fault handling scheme, auditing the operation authority of the fault handling scheme, auditing the operation command of the fault handling scheme, auditing upstream and downstream equipment of the fault handling scheme, auditing responsible persons related to the fault handling scheme, and auditing the iterative maintenance of the fault handling scheme.
6. The method for troubleshooting operation and maintenance through natural language processing as claimed in claim 1 further comprising,
s5, sequencing the fault coping schemes according to a recommendation sequence to form a scheme recommendation list of the fault coping schemes, and selecting the fault coping schemes from the scheme recommendation list or adopting a brand-new fault coping scheme to cope with the fault;
and S6, adjusting the recommendation sequence of the fault handling schemes in the scheme recommendation list based on the selection mode of the fault handling schemes in the step S5.
7. The method for operation and maintenance troubleshooting through natural language processing as recited in claim 6, wherein the recommended order evaluation content of the failure handling solution comprises:
the processing duration of the fault handling scheme, the complexity of an operation command of the fault handling scheme, the number of operation nodes of the fault handling scheme, the use frequency of key instructions in the fault handling scheme, and the traceability and retrogradability of relevant operations of the fault handling scheme.
8. The method for operation, maintenance and troubleshooting through natural language processing as recited in claim 6, wherein the arranging in the proposal recommendation list comprises: the first scheme is a default scheme, the rest of the fault handling schemes are alternative schemes, the fault handling schemes are annotated with the fault information, the fault handling schemes are matched with the fault logs, and N is a positive integer.
9. The method for operation, maintenance and troubleshooting through natural language processing as recited in claim 8, wherein the step S6 specifically comprises:
if the default scheme is selected, the scheme recommendation list does not change the recommendation sequence; or;
if the alternative scheme is selected, upgrading the selected alternative scheme to the default scheme, and degrading the original default scheme to the alternative scheme; or;
if the new fault coping scheme is adopted, recording a new fault coping operation process as operation data, bringing the operation data into the fault knowledge base, establishing a new fault coping scheme, associating the new fault coping scheme with the information system, adding the new fault coping scheme into the scheme recommendation list, adjusting the new fault coping scheme into the first scheme, and degrading or deleting the rest fault coping schemes in the scheme recommendation list.
10. A system for operation and maintenance troubleshooting through natural language processing, which is applied to the method for operation and maintenance troubleshooting through natural language processing as claimed in any one of claims 1 to 9, and is characterized by comprising,
the acquisition module is arranged in the information system and used for acquiring the log data;
the operation and maintenance management server is connected with the acquisition module, and is used for clustering the fault logs, analyzing the fault logs, recording the fault information and extracting the keywords, and a recommendation module is arranged in the operation and maintenance management server and is used for configuring the fault coping scheme for the fault according to historical processing experience and updating the fault coping scheme and configuring a recommendation sequence for the fault coping scheme;
and the fault knowledge base is connected with the recommendation module, and the fault corresponding scheme is stored in the fault knowledge base.
CN202210685072.9A 2022-06-15 2022-06-15 Method and system for operation, maintenance and troubleshooting through natural language processing Pending CN115296975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210685072.9A CN115296975A (en) 2022-06-15 2022-06-15 Method and system for operation, maintenance and troubleshooting through natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210685072.9A CN115296975A (en) 2022-06-15 2022-06-15 Method and system for operation, maintenance and troubleshooting through natural language processing

Publications (1)

Publication Number Publication Date
CN115296975A true CN115296975A (en) 2022-11-04

Family

ID=83819864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210685072.9A Pending CN115296975A (en) 2022-06-15 2022-06-15 Method and system for operation, maintenance and troubleshooting through natural language processing

Country Status (1)

Country Link
CN (1) CN115296975A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008110A (en) * 2013-02-26 2014-08-27 成都勤智数码科技股份有限公司 Method for automatically transferring operation and maintenance work order to knowledge base
CN106921526A (en) * 2017-04-13 2017-07-04 湖南森纳信息科技有限公司 Intelligent campus network O&M system
CN107341068A (en) * 2017-06-28 2017-11-10 北京优特捷信息技术有限公司 The method and apparatus that O&M troubleshooting is carried out by natural language processing
CN107657375A (en) * 2017-09-25 2018-02-02 国网上海市电力公司 A kind of method for electric network fault judgement, verification and fault incidence analysis
CN109271272A (en) * 2018-10-15 2019-01-25 江苏物联网研究发展中心 Big data component faults based on unstructured log assist repair system
CN109525614A (en) * 2017-09-15 2019-03-26 上海明匠智能系统有限公司 Industrial cloud operational system
CN113869791A (en) * 2021-10-20 2021-12-31 深圳供电局有限公司 Power grid operation and maintenance repair method based on log model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008110A (en) * 2013-02-26 2014-08-27 成都勤智数码科技股份有限公司 Method for automatically transferring operation and maintenance work order to knowledge base
CN106921526A (en) * 2017-04-13 2017-07-04 湖南森纳信息科技有限公司 Intelligent campus network O&M system
CN107341068A (en) * 2017-06-28 2017-11-10 北京优特捷信息技术有限公司 The method and apparatus that O&M troubleshooting is carried out by natural language processing
CN109525614A (en) * 2017-09-15 2019-03-26 上海明匠智能系统有限公司 Industrial cloud operational system
CN107657375A (en) * 2017-09-25 2018-02-02 国网上海市电力公司 A kind of method for electric network fault judgement, verification and fault incidence analysis
CN109271272A (en) * 2018-10-15 2019-01-25 江苏物联网研究发展中心 Big data component faults based on unstructured log assist repair system
CN113869791A (en) * 2021-10-20 2021-12-31 深圳供电局有限公司 Power grid operation and maintenance repair method based on log model

Similar Documents

Publication Publication Date Title
CN104063473B (en) A kind of database audit monitoring system and its method
US10701096B1 (en) Systems and methods for anomaly detection on core banking systems
KR100561628B1 (en) Method for detecting abnormal traffic in network level using statistical analysis
KR102033169B1 (en) intelligence type security log analysis method
CN103546343B (en) The network traffics methods of exhibiting of network traffic analysis system and system
Brahmi et al. Towards a multiagent-based distributed intrusion detection system using data mining approaches
CN114553537A (en) Abnormal flow monitoring method and system for industrial Internet
Wang et al. A centralized HIDS framework for private cloud
CN109787844A (en) A kind of distribution master station communication fault fast positioning system
Skopik et al. synERGY: Cross-correlation of operational and contextual data to timely detect and mitigate attacks to cyber-physical systems
Wang et al. Unsupervised learning for log data analysis based on behavior and attribute features
KR102418594B1 (en) Ict equipment management system and method there of
Wenhui et al. A novel intrusion detection system model for securing web-based database systems
CN115296975A (en) Method and system for operation, maintenance and troubleshooting through natural language processing
CN114760083B (en) Method, device and storage medium for issuing attack detection file
CN103248505B (en) Based on method for monitoring network and the device of view
US20230034914A1 (en) Machine Learning Systems and Methods for API Discovery and Protection by URL Clustering With Schema Awareness
CN112437070B (en) Operation-based spanning tree state machine integrity verification calculation method and system
CN115529268A (en) Processing instructions to configure a network device
Gao et al. Study on data acquisition solution of network security monitoring system
Cisco Chapter 2: Content Engine Management Configuration and Features
Alghamdi et al. Pattern extraction for behaviours of multi-stage threats via unsupervised learning
CN110933066A (en) Monitoring system and method for illegal access of network terminal to local area network
CN114124459B (en) Cluster server security protection method, device, equipment and storage medium
KR102411941B1 (en) Firewall redundancy system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination