CN115617953A

CN115617953A - Intelligent diagnosis method and system for network service link fault

Info

Publication number: CN115617953A
Application number: CN202211420860.1A
Authority: CN
Inventors: 邹昆; 李丽娟; 霍曦; 段军; 原小卫; 杨海琴; 张驰; 郭春江; 李亮; 李晨华洋; 汪俊贵; 古训; 刘越
Original assignee: Chengdu Jiuzhou Electronic Technology Co Ltd
Current assignee: Chengdu Jiuzhou Electronic Technology Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-01-17

Abstract

The invention provides a method and a system for intelligently diagnosing network service link faults, which belong to the technical field of computers, and comprise the following steps: sampling and extracting the logs of the system service software; extracting characteristics of the sampled data and marking the sampled data as sample data; according to the sample data, TID model training is carried out by utilizing a PCA algorithm; collecting system logs, and analyzing the logs by using a TID model; and outputting a fault diagnosis result. The system and the method can comprehensively analyze and process the log of the system service software, and can accurately diagnose the abnormal condition of the system; the log of the service software is subjected to NLP processing, and the method is suitable for different programming languages and different systems; the system log can be diagnosed on line, the running state of the system can be fed back in a near real-time manner, and the operation and maintenance cost is greatly reduced.

Description

Intelligent diagnosis method and system for network service link fault

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method and a system for intelligently diagnosing network service link faults.

Background

With the deep construction of the network service system, more and more software is deployed in a matching manner in the whole system, such as data governance, data subscription, data downloading, data cataloging and the like, the software is related to acquisition, processing and storage of a front data source to final display, and each link is connected in series to form a link related to the network service.

In the previous system maintenance work, operation and maintenance personnel can check fault sources step by step through related software logs according to error prompts of software in the inspection process, and if the operation and maintenance personnel do not solve the problems in time, the whole system can be normally used. Secondly, as the service chain has too much related software, the troubleshooting difficulty and the troubleshooting time are increased, so that the system service personnel can not use the software for a long time, thereby influencing the normal work of the service personnel. The traditional mode of manually positioning the fault source through the log has the advantages of low positioning accuracy, time consumption and low working efficiency, and is similar to the operation and maintenance working mode which directly influences the whole service link and the system, so that the operation and maintenance cost is greatly increased.

In order to guarantee the normal work of business personnel and the effective operation of a network business system, higher requirements are put forward on the monitoring and fault diagnosis of the system. The device is researched aiming at the intelligent diagnosis technology of the system software link, so that the requirements of quickly and accurately diagnosing the link fault of the system software are met, and the reliable guarantee is provided for the normal operation of the whole system.

Therefore, the existing troubleshooting process has the following problems: (1) The links of the service link related to software are too many, and any software failure can cause the whole link to be incapable of being normally used; (2) data loss or errors, requiring full link troubleshooting; (3) the error information is imperfect, and the fault location is inaccurate; (4) the troubleshooting time is too long; (5) Business personnel need to perform related work on data stored and displayed in a business link in real time.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the intelligent diagnosis method and the intelligent diagnosis system for the network service link fault, which aim to solve the existing problems.

In order to achieve the above purpose, the invention adopts the technical scheme that:

the scheme provides an intelligent diagnosis method for network service link faults, which comprises the following steps:

s1, sampling and extracting system service software logs;

s2, extracting characteristics of the sampled data and marking the sampled data as sample data;

s3, training the TID model by utilizing a PCA algorithm according to the sample data;

and S4, collecting system service software logs, analyzing the system service software logs of different types by using the TID model, obtaining a fault diagnosis result according to the analysis result, and completing fault diagnosis of the network service link.

The invention has the beneficial effects that: the invention takes a network service system as background, provides a TID model on the basis of analyzing a traditional fault diagnosis model and a common diagnosis method aiming at the characteristics of the network service system and the requirement of system fault diagnosis, provides a learning mechanism based on a PCA algorithm for improving the fault diagnosis efficiency, and can quickly and accurately diagnose and position the system fault; according to the invention, the system service software log is comprehensively analyzed and processed, so that the abnormal condition of the system can be accurately diagnosed; the invention can be applied to different programming languages and different systems by extracting NLP characteristics of the log; the invention can diagnose the system log on line, feed back the system running state in near real time and greatly reduce the operation and maintenance cost.

Further, the step S1 specifically includes: and sampling and extracting the total data of the INFO level log, the DEBUG level log, the WARN level log and the ERROR level log of the system service software in a certain time period.

The beneficial effects of the above further scheme are: according to the invention, the total data acquisition of the logs can be formulated according to services to realize the acquisition of the logs, the pipeline flow for collecting the logs is arranged through the configuration files, the invasiveness of source codes is reduced, and the information of log data, log files and the like with different sources and formats is acquired, aggregated and sampled.

Still further, the step S2 specifically includes: each log is regarded as a section of text by using natural language processing NLP, key features of the log are extracted and marked, and the key features are used as sample data; or

By using a regular log feature extraction method, the key features of logs in different domains are extracted and marked as sample data through keyword extraction or specified filtering rules.

The beneficial effects of the further scheme are as follows: the key feature extraction is based on word frequency statistics, different position weights are given to words at different positions by using a paragraph labeling technology, word similarity calculation is carried out on words with the same word property and higher word frequency in a word segmentation result, the words with higher similarity are combined, and the key words are obtained by sorting according to weights through word inverse frequency. Compared with the traditional Chinese keyword extraction method, the method has the advantages that the problem of low keyword extraction precision caused by the fact that the words with high similarity are not emphasized is solved, the improved algorithm result is better improved on the basis of accuracy and recall ratio than the original basis, and the extracted keyword set can better reflect text content.

Still further, the processing NLP regards each log as a segment of text, extracts and marks key features of the log, and uses the key features as sample data, which specifically includes:

treating each log as a text by using Natural Language Processing (NLP);

acquiring a word vector of an input word in each text;

inputting the word vector into an encoder to obtain an information matrix C;

inputting the information matrix C into a decoder, and extracting the key features of the log by using the decoder;

and marking key characteristics of the log as sample data.

The beneficial effects of the further scheme are as follows: the method realizes key feature extraction based on the Transformer, and the Transformer breaks through the limitation that the RNN model can not be calculated in parallel; compared to CNN, the number of operations required to calculate the association between two locations does not increase with distance; self-attention may produce a more interpretable model from which attention distributions may be examined and the individual heads (attention heads) may learn to perform different tasks.

Still further, the step S3 includes the steps of:

s301, standardizing sample data;

s302, calculating to obtain a covariance matrix according to the sample data after the standardization processing;

s303, calculating to obtain an eigenvector and an eigenvalue of the covariance matrix by using singular value decomposition;

s304, calculating to obtain a variance contribution rate through the characteristic value:

s305, judging whether the variance contribution rate is larger than 95%, if so, obtaining the number of the principal components, and entering the step S306, otherwise, returning to the step S301;

s306, obtaining a result matrix according to the number of the principal components and the eigenvectors, determining m vectors of the corresponding principal components according to the result matrix, and outputting a TID model;

s307, simulating the data which does not appear, carrying out data segmentation on the available data, segmenting the available data into two parts, and respectively using the two parts as a training set and a test set;

s308, training the TID model by using the training set, predicting the test set by using the trained TID model, and optimizing TID model parameters according to the prediction result to finish the training of the TID model.

The beneficial effects of the further scheme are as follows: according to the method, the model generation learns the distribution characteristics of all data through sample data, the learning convergence speed is higher, and when the sample capacity is increased, the learned model can be converged to a real model more quickly; different algorithms can be chosen in a configured form as a reference for different traffic scenarios to derive the TID model.

Still further, the expression of the covariance matrix is as follows:

wherein,

representing a variance matrix, n representing the number of variables, i representing an index, X _i It is indicated that the (i) th variable,

represents the mean value of the variables.

Still further, the expression of the variance contribution ratio is as follows:

wherein,

represents the variance contribution rate of the principal component,

represents the variance of the kth principal component,

representing the nth characteristic value.

Still further, the step S4 includes the steps of:

s401, marking collected system service software logs of different types;

s402, cleaning the log data with the labels, and dividing the cleaned text data with the labels;

s403, replacing the word in the TID model and the division result with a word number to respectively form a word index sequence corresponding to the TID model and a word index sequence corresponding to the verification set;

s404, mapping all word index sequences in the step S403 to the dimensionality of the log key features through vector calculation, and calculating the vector distance between the feature vector and the normal sub-feature space;

s405, judging whether the vector distance exceeds a preset threshold value, if so, determining that the network service link is abnormal, otherwise, determining that the network service link is not abnormal, and ending the process.

The beneficial effects of the further scheme are as follows: the label processing of the invention can be used as a label through the service attribute value without processing and conversion; the data can be simply analyzed and derived according to the rule of common behaviors in the service; the data marking can also be carried out by establishing a recognition rule and carrying out component analysis on the common data. According to the label accuracy and the coverage degree, the sample data is extracted to verify and judge whether the label design is reasonable or not; directly calculating the reliability degree through indexes such as recall ratio, precision ratio and the like in the model sample; and verifying the accuracy of the label through subsequent result data observation and statistics.

The invention provides a fault diagnosis system of a network service link, which comprises:

the sampling module is used for sampling and extracting the system service software log;

the characteristic extraction module is used for extracting and marking the characteristics of the sampled data as sample data;

the model training module is used for training the TID model by utilizing a PCA algorithm according to the sample data;

and the fault diagnosis module is used for acquiring system service software logs, analyzing the system service software logs of different types by using the TID model, obtaining a fault diagnosis result according to the analysis result and completing fault diagnosis of the network service link.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flow chart of the PAC algorithm in this embodiment.

Fig. 3 is a schematic diagram of the system structure of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Examples

As shown in fig. 1, the present invention provides an intelligent diagnosis method for a network service link failure, which is implemented as follows:

s1, sampling and extracting a system service software log, which specifically comprises the following steps: sampling and extracting full data of an INFO level log, a DEBUG level log, a WARN level log and an ERROR level log of system service software in a certain time period;

in this embodiment, the system service software log includes an INFO log, a DEBUG log, a WARN log, and an ERROR log; the ETL sampling extraction is adopted as the total data extraction of a certain time period. Data acquisition is carried out through an ODS area; performing data quality analysis on data in the ODS area, verifying the correctness, completeness, consistency, completeness, effectiveness, timeliness and acquirability of the data, then performing reasonable conversion on the checked data, and performing unified processing on the data with the problems of ambiguity, repetition, incompleteness, violation of business or logic rules and the like in a source database; and loading the converted data according to the initial sequence.

S2, performing feature extraction and marking on the sampled data as sample data, wherein the sample data specifically comprises the following steps:

each log is regarded as a section of text by using natural language processing NLP, key features of the log are extracted and marked, and the key features are used as sample data; or

The method comprises the following steps of using a natural language processing NLP to treat each log as a section of text, extracting and marking key features of the log as sample data, wherein the method specifically comprises the following steps:

each log is regarded as a text by natural language processing NLP;

acquiring a word vector of an input word in each text;

inputting the word vector into an encoder to obtain an information matrix C;

and marking key features of the log as sample data.

In this embodiment, the feature extraction is to extract log key features, and the log key feature extraction method can be classified into a log feature extraction method based on NLP (natural language processing) and rules; processing the log files by means of theme extraction, type classification, structural analysis, semantic representation and the like; and extracting key words or appointing filtering rules of the single log text by various technologies in the NLP field such as synonym replacement, semantic normalization, omission/error correction, automatic word segmentation, part of speech analysis, syntactic analysis, semantic analysis and the like, and extracting different fields of the log.

In this embodiment, the log feature extraction is performed based on a feature extractor, which is composed of two major parts: an Encoder (Encoder) and a Decoder (Decoder), each module containing 6 blocks. All encoders are structurally identical and are responsible for mapping natural language sequences into hidden layers, which contain expressions for natural language sequences but do not share parameters.

The first step is as follows: and acquiring a Word vector X of the input Word, wherein the X is obtained by adding Word embedding and position embedding, and the Word embedding is obtained by adopting Word2Vec or Transformer algorithm pre-training. The position information of each word is set so as to identify the sequential relationship in the language. The position information PE of the Transformer model is expressed by linear transformation of sin and cos:

PE（pos,2i）=sin(pos/100002i/d)

PE (pos,2i+1)=cos(pos/100002i/d)

wherein pos represents the position of a word in a sentence, for example, the sentence is composed of 10 words, pos represents any position of [0-9], and the value range is [0, max sequence ]; i represents the dimension of the word vector, and the value range [0, embedding dimension ], for example, if a certain word vector is 256 dimensions, the value range of i is [0-255]; d represents the dimension of the PE, i.e., the dimension of the word vector, 256 in the above example; 2i denotes an even dimension (sin); 2i +1 represents the odd dimension (cos). The sin and cos formulas correspond to an embedding dimension, i.e., a set of odd and even numbered dimensions. And respectively processing by using the sin function and the cos function so as to generate different periodic changes and obtain the dependency relationship between positions and the time sequence characteristic of the natural language.

The second step is that: and transmitting the vector matrix obtained in the first step into an encoder, wherein the encoder comprises 6 blocks, outputting an encoded information matrix C, and the dimension of the block output by each encoder is completely consistent with that of the input.

The third step: and transmitting the coding information matrix C output by the coder to a decoder, and sequentially translating the next word i +1 by the decoder according to the currently translated words 1-i to extract the characteristics.

In this embodiment, based on the rule structured log information, a hypothesis or a known log structure is extracted, and different fields (fields) of the log are extracted through keyword extraction or pre-processing filtering rules. A primary concern is a relational log (ratified log) that has a specific format (e.g., POSIX format) and logical structure, and therefore information is easily extracted by regular expressions. For the web application logs, the logs also have relatively consistent formats, and variable information in the logs can be extracted through simple regular matching.

S3, training the TID model by using a PCA algorithm according to the sample data, wherein the implementation method comprises the following steps:

s301, standardizing sample data;

s302, according to the sample data after the standardization, a covariance matrix is obtained through calculation:

wherein,

represents the mean value of the variables;

s304, calculating the variance contribution rate through the characteristic value:

s305, judging whether the variance contribution rate is greater than 95%, if so, obtaining the number of the main components, and entering the step S306, otherwise, returning to the step S301;

s306, obtaining a result matrix according to the number of the principal components and the characteristic vectors, determining m vectors of the corresponding principal components according to the result matrix, and outputting a TID model;

s307, simulating the data which do not appear, carrying out data segmentation on the available data, dividing the available data into two parts, and respectively using the two parts as a training set and a test set;

s308, the TID model is trained by the training set, the tested set is predicted by the trained TID model, parameters of the TID model are optimized according to a prediction result, and TID model training is completed.

In this embodiment, the variance contribution rate is obtained by linearly combining the eigenvector of the covariance matrix and the original variable:

wherein,Ythe covariance contribution rate is expressed as a ratio of covariance contribution,Pa covariance matrix is represented by a value of the covariance matrix,P _n represents the nth value of the covariance matrix,Xwhich represents the vector of the original variable(s),e _nn the unit feature vector is represented by a vector of,X _n representing the weight information of the nth original variable vector.

The variance of the principal component measures the variance of the data set that can be interpreted, and the variance of the principal component is the eigenvalue λ of the covariance matrix of X, so the variance of the kth principal component is λ k. To define an index, called the variance contribution of the principal component Yk, which is the ratio of the k-th principal component's variance to the total variance:

wherein,

represents the variance contribution rate of the principal component Yk,

represents the variance of the k-th principal component,

representing the nth characteristic value.

In this embodiment, the PCA algorithm is a statistical method that tries to recombine original variables into a set of new several independent synthetic variables, and can extract several smaller sum variables from the set of new synthetic variables as much as possible to reflect information of the original variables according to actual needs, and is also a method for mathematically processing dimension reduction.

In this embodiment, the key features of the normal operation of the system are learned from the log feature vector, and the outliers detected by comparing these key features or performing unsupervised clustering on the log set or the log feature vector set are the outliers. The method generally uses a feature vector of a single log or a log set to construct a TID model, and TID model training and tuning are carried out based on a Spark distributed platform; adjusting parameters of the TID model through the evaluation index to achieve an offline optimal effect; verifying the effectiveness of TID model improvement points through the advantages and disadvantages of experimental effects; and finally, calculating the value ranges of most log template frequency vectors on the key feature dimensions, such as the value distribution range of 95% of data on the key feature dimensions, and recording the values to generate a normal sub-feature space.

In this embodiment, the PAC algorithm flow is as shown in fig. 2, 1, ETL sampling extraction is adopted to extract full data in a certain time period, and a sample data set a is obtained after processing; 2. extracting key characteristics of the sample by a characteristic extractor, namely a Transformer; 3. standardizing the sample characteristic data, eliminating the influence caused by dimension, calculating a covariance matrix, and establishing a standardized variable covariance matrix; 4. computing eigenvectors of covariance matrix by singular value decomposition

And a characteristic value

(ii) a 5. Calculating variance contribution rate through the characteristic value information; 6. determining variance contribution rate

If the ratio is greater than 95%, recalculating the method contribution ratio if the ratio is not greater than 95%, and acquiring the number m of the principal components if the ratio is greater than 95%; 7. obtaining a result matrix T = AU through the number of the principal components and the eigenvector; 8. determining m vectors of the respective principal components; 9. and outputting the TID model.

S4, collecting system service software logs, analyzing the system service software logs of different types by using a TID model, obtaining a fault diagnosis result according to the analysis result, and completing fault diagnosis of the network service link, wherein the implementation method comprises the following steps:

s401, marking collected system service software logs of different types;

s404, mapping all word index sequences in the step S403 to dimensionality of log key features through vector calculation, and calculating a vector distance between a feature vector and a normal sub-feature space;

In this embodiment, in the stage of anomaly detection, the log template frequency vector of the online log is mapped to the key feature dimension through vector calculation, and the vector distance between the vector and the normal sub-feature space is calculated, and if the distance exceeds a certain threshold, it is determined that the system is abnormal. And in the abnormal detection stage, whether the online log contains log information related to the fault is checked, and whether the online log is abnormal is judged. Extracting key features from the logs, performing dimension reduction on the key features by using a PCA method, clustering all the logs by using a K-means algorithm, and determining the found outliers as the detected anomalies.

By the design, the logs of the system service software are comprehensively analyzed and processed, and the abnormal condition of the system can be accurately diagnosed; the log of the service software is subjected to NLP processing, and the method is suitable for different programming languages and different systems; the system log can be diagnosed on line, the running state of the system can be fed back in a near real-time manner, and the operation and maintenance cost is greatly reduced.

Example 2

As shown in fig. 3, the present invention provides an intelligent diagnosis system for network service link failure, including:

the sampling module is used for sampling and extracting the system service software logs;

and the fault diagnosis module is used for acquiring system service software logs, analyzing different types of system service software logs by using the TID model, obtaining a fault diagnosis result according to the analysis result and completing fault diagnosis of the network service link.

The fault diagnosis system for a network service link provided in the embodiment shown in fig. 3 may execute the technical solution shown in the intelligent fault diagnosis method for a network service link in the above embodiment, and the implementation principle and the beneficial effect are similar, which are not described herein again.

In the embodiment of the invention, the functional units can be divided according to the intelligent network service link fault diagnosis method, for example, each function can be divided into each functional unit, or two or more functions can be integrated into one processing unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software functional unit. It should be noted that the division of the cells in the present invention is schematic, and is only a logical division, and there may be another division manner in actual implementation.

In the embodiment of the invention, the fault diagnosis system of the network service link comprises a hardware structure and/or a software module corresponding to each function in order to realize the principle and the beneficial effect of the intelligent fault diagnosis method of the network service link. It should be readily appreciated by those of ordinary skill in the art that while the exemplary elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in hardware and/or in a combination of hardware and computer software, whether such functionality is implemented as hardware or computer software, the functionality described may be implemented using different approaches for each particular application depending upon the particular application and design constraints imposed on the technology, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The invention carries out comprehensive analysis processing on the log of the system service software, and can accurately diagnose the abnormal condition of the system; the log of the service software is subjected to NLP processing and can be suitable for different programming languages and different systems; the system log can be diagnosed on line, the running state of the system can be fed back in a near real-time manner, and the operation and maintenance cost is greatly reduced.

Claims

1. An intelligent diagnosis method for network service link faults is characterized by comprising the following steps:

s1, sampling and extracting system service software logs;

2. The method according to claim 1, wherein the step S1 specifically comprises: and in a certain time period, carrying out sampling extraction on full data of an INFO level log, a DEBUG level log, a WARN level log and an ERROR level log of system service software.

3. The method according to claim 2, wherein the step S2 specifically comprises: processing NLP by using natural language to regard each log as a section of text, extracting and marking key features of the log, and taking the key features as sample data; or

4. The method according to claim 3, wherein the NLP uses natural language processing to treat each log as a segment of text, extracts and marks key features of the log, and uses the key features as sample data, which specifically includes:

treating each log as a text by using Natural Language Processing (NLP);

acquiring a word vector of an input word in each text;

inputting the word vector into an encoder to obtain an information matrix C;

and marking key features of the log as sample data.

5. The intelligent network service link fault diagnosis method according to claim 4, wherein the step S3 comprises the following steps:

s301, standardizing sample data;

6. The method of claim 5, wherein the covariance matrix is expressed as follows:

wherein,

represents the mean value of the variables.

7. The method of claim 6, wherein the variance contribution rate is expressed as follows:

wherein,

represents the variance contribution rate of the principal component,

represents the variance of the kth principal component,

representing the nth characteristic value.

8. The method according to claim 7, wherein the step S4 comprises the steps of:

s401, marking collected system service software logs of different types;

9. An intelligent diagnostic system for network service link failure, comprising: