CN114168373A

CN114168373A - NLP-based disaster recovery system abnormal point detection method

Info

Publication number: CN114168373A
Application number: CN202111363265.4A
Authority: CN
Inventors: 董惠良; 姜学峰; 汪炎平
Original assignee: China Tobacco Zhejiang Industrial Co Ltd
Current assignee: China Tobacco Zhejiang Industrial Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-03-11

Abstract

The invention discloses a disaster recovery system abnormal point detection method based on NLP, which comprises the following steps: obtaining a log text data set, extracting a key value pair structure based on NLP semantic analysis, establishing a multi-dimensional word vector matrix for each word in a dictionary, and constructing each log text data word vector matrix; inputting the current time log text data and the previous time abnormal risk confidence coefficient into a long-term and short-term memory network to obtain an abnormal risk confidence coefficient, and comparing the abnormal risk confidence coefficient with an abnormal risk threshold value to obtain a risk log text data sequence; and inputting the risk log text data sequence into a Markov chain to obtain a transition probability, taking the current log text data as an abnormal point of the risk log text data sequence when the transition probability reaches an abnormal transition probability threshold, and determining an abnormal station in the production process based on the abnormal point. The method for detecting the abnormal points of the disaster recovery system can accurately detect the abnormal points in the log text.

Description

NLP-based disaster recovery system abnormal point detection method

Technical Field

The invention belongs to the field of disaster recovery backup of computer application systems, and particularly relates to a method for detecting an abnormal point of a disaster recovery system based on NLP.

Background

With the increasing dependence degree of the business of the tobacco industry on the information system, the requirements for the stability, the high availability and the quick recovery capability of the fault of the information system are also increasing, which puts higher requirements on an information center administrator, and the workload and the work difficulty of the administrator are greatly increased due to the fact that problems are quickly positioned and processed in hardware equipment, an operating system, a database, a middleware and even a business system. At present, the national tobacco administration and each provincial company establish a corresponding local or allopatric disaster recovery system aiming at a core business system, and basically can perform manual disaster recovery switching according to disaster grades, so that the switching efficiency is low, when a real disaster occurs, manpower and material resources for disaster recovery switching are huge, and the issuing of a disaster recovery switching instruction is basically based on the judgment of system management personnel on system faults, and the instruction accuracy is to be improved. Particularly, after some semi-automatic and automatic switching modes are introduced, how to accurately grasp the issuing time of the automatic switching instruction becomes a difficult problem to be overcome urgently.

At present, in the disaster recovery switching big data analysis direction, the method for analyzing the abnormity by adopting the service index and the machine index is mature, but the method for analyzing the system log still only stays at the stage of the primary comparison, the existing abnormity detection method based on the log is mainly divided into three types, namely, the PCA is used for abnormity detection, the correlation of different log categories is analyzed for abnormity detection, and the workflow model is used for abnormity detection. Although these anomaly detection methods can be effectively applied to specific fields, they are not a general online anomaly detection method. Therefore, it is necessary to automatically detect the abnormality through the system log. However, the application of abnormal pattern detection of the system log is complex, and the main challenges are as follows:

one, unstructured logs, whose format and semantics vary greatly between different systems, some existing approaches use rule-based approaches to solve this problem, but the design of rules relies on domain knowledge, such as regular expressions, which are commonly used in the industry. More importantly, rule-based methods are not applicable for general anomaly detection, since it is almost impossible to know in advance what the points of interest are in different types of logs.

Secondly, the timeliness. In order to enable a user to timely find the abnormality of the system, the abnormality detection must be timely, and the log data is input in a data stream form, which means that a method for analyzing the whole log data is not applicable.

And thirdly, the types of the exceptions are many. Various types of exceptions may be generated by the system and the application. It is desirable that the anomaly detection system not only be directed to specific anomaly types, but also be able to detect unknown anomalies. Meanwhile, the log message also contains rich information, such as log key, parameter value, timestamp, etc. Most existing anomaly detection methods only analyze specific parts of the log message (such as log keys), which limits the types of anomalies that they can detect.

Therefore, anomaly detection is an important task for establishing a safe and reliable computer system. As tobacco business systems become more complex, they are subject to more and more vulnerabilities and defects. Therefore, anomaly detection is increasingly challenging, and many conventional anomaly detection methods based on standard mining algorithms are no longer effective. The system log records events of system states and signals at different points to help solve performance problems and faults, and the system log in the disaster recovery system of the tobacco core business system is more helpful for determining whether a disaster recovery switching instruction is issued and what data support of the disaster recovery switching instruction is issued. Disaster recovery system logs record some key events that occur during the operation of the system, and therefore they are an excellent source of information for understanding the state of the system, on-line monitoring and anomaly detection. The system log records the state and important events of each time period of the system, and the system log is an important data source for anomaly detection, the anomaly detection is a key step for constructing a high-efficiency disaster recovery system, and based on the reasons, a necessary condition for improving the accuracy of the automatic disaster recovery switching instruction is to improve the accuracy of the anomaly detection.

Disclosure of Invention

The invention provides a method for detecting abnormal points of a disaster recovery system based on NLP, which can accurately detect the abnormal points in a log text.

An NLP-based disaster recovery system abnormal point detection method comprises the following steps:

(1) obtaining a log text data set, extracting each log text data based on NLP semantic analysis to obtain a key value pair structure, carrying out word frequency screening on a plurality of key value pair structures to construct a dictionary of the log text data set, establishing a multi-dimensional word vector matrix for each word in the dictionary by using a word2vec algorithm, and constructing each log text data word vector matrix by using a plurality of multi-dimensional word vector matrices;

(2) sequentially inputting the current-time log text data and the previous-time abnormal risk confidence coefficient to a Long Short-Term Memory network (LSTM) to obtain the current-time abnormal risk confidence coefficient based on the time sequence, taking the current-time log text data as risk log text data when the current-time abnormal risk confidence coefficient reaches an abnormal risk threshold value, and constructing a risk log text data sequence by using the current-time log text data and the risk log text data;

(3) and inputting each log text data in the risk log text data sequence into a Markov chain according to the time sequence to obtain the transfer probability of each log text data to the next log text data, taking the current log text data as an abnormal point of the risk log text data sequence when the transfer probability reaches an abnormal transfer probability threshold, and determining an abnormal station in the production process based on the abnormal point.

Extracting each log text data based on NLP semantic analysis to obtain a key-value pair structure, comprising:

based on NLP semantic analysis, extracting event types and time content text data corresponding to the event types in each log text data to construct a key value pair structure, wherein the event types are keys, and the corresponding time content text data are values.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies the theory and method of efficient communication between human and computer in natural language. The science integrating linguistics, computer science and mathematics is integrated. It is intended to extract information from text data. The purpose is for the computer to process or "understand" natural language to perform automatic translation, text classification, emotion analysis, and the like.

And performing word frequency screening of each word on the event type of each log text data and the time content text data corresponding to the event type to construct a dictionary.

The method comprises the following steps of sequentially inputting the current-time log text data and the previous-time abnormal risk confidence coefficient to a long-term and short-term memory network based on the time sequence to obtain the current-time abnormal risk confidence coefficient, and comprises the following steps:

sequencing each log text data, splicing each log text data word vector matrix with the abnormal risk confidence coefficient at the previous moment, and inputting the splicing result into the long-term and short-term memory network to obtain the abnormal risk confidence coefficient at the current moment.

The risk log text data is determined based on the abnormal risk confidence coefficient output by the long-term and short-term memory network, the abnormal risk confidence coefficient is a floating point number of 0-1, the larger the numerical value is, the higher the represented abnormal risk is, the abnormal risk threshold value is set, and when the abnormal risk confidence coefficient reaches the abnormal risk threshold value, the risk log text data is positioned.

Splicing the vector matrix of each log text data word with the confidence coefficient of the abnormal risk at the previous moment, wherein the splicing process comprises the following steps:

and converting the abnormal risk confidence coefficient into a multi-dimensional one-hot code, splicing the multi-dimensional one-hot code with the multi-dimensional word vector matrix of each log text data, and taking the splicing result as the input of the long-term and short-term memory network.

And converting the abnormal risk confidence coefficient into a multidimensional one-hot coding t as follows:

t＝round(p×n)

wherein p is a floating point number of the confidence of the abnormal risk, and n is a word vector dimension of each log text data.

Transition probability lambda_ijComprises the following steps:

wherein N is_ijIs the amount of the journal text data of the ith time shifted to the journal text data of the jth time within the unit time interval deltas,

Δ S is a unit time interval of two adjacent pieces of log text data, which is the number of log text data at the ith time in the log text data set within time T.

Determining the transfer probability of each log text data in the risk log text data sequence to the next time log text data through a Markov chain, wherein the Markov chain is a multidimensional asymmetric sparse matrix, and diagonal elements in the asymmetric sparse matrix are as follows:

λ_ii＝-∑_i≠jλ_ij。

compared with the prior art, the invention has the beneficial effects that:

(1) the log text of the current tobacco industry disaster recovery system is analyzed through an NLP-type natural semantic analysis algorithm, each word of the log text is analyzed, and the abnormality detection precision is improved, so that powerful data support is provided for issuing of an automatic switching instruction, and a thought is provided for subsequent full-automatic disaster recovery switching.

(2) The method comprises the steps of firstly, locating a risk log text by using a long-short term memory network so as to quickly locate an abnormal text, extracting the log text related to the risk log text by using a time sequence to obtain a risk log text sequence of a workflow, and accurately and quickly obtaining abnormal points in the risk log text sequence by using a Markov chain.

Drawings

Fig. 1 is a block flow diagram of a method for detecting an abnormal point of a disaster recovery system based on NLP according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for detecting an abnormal point of a NLP-based disaster recovery system according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1, the present invention provides a method for detecting an abnormal point of a disaster recovery system based on NLP, which comprises the following steps:

(1) preprocessing a text log, which comprises the following specific steps:

obtaining log original data, namely a log text data set, dividing each log text data into a key-value pair structure of event type-time content text data by using NLP semantic analysis, wherein the log text data set comprises a plurality of key-value pair structures, performing word frequency screening on each word in the plurality of key-value pair structures to construct a dictionary of the log text data set, establishing a multi-dimensional word vector matrix for each word in the dictionary by using a word2vec algorithm, constructing each log text data word vector matrix by using a plurality of multi-dimensional word vector matrices, converting the log text semantic data into the multi-dimensional word vector data by converting the log text data into the plurality of multi-dimensional word vector matrices, and more accurately detecting abnormal data so as to obtain abnormal log text data;

(2) the method comprises the following steps of:

according to the time sequence, all log text data are subjected to time sequencing, current log text data and abnormal risk confidence degrees obtained at the last time through a long-short term memory network are spliced, the splicing result is input to the long-short term memory network again to obtain the abnormal risk confidence degree at the current time, the log text data are continuously input according to the time sequence, the abnormal risks of the input log text data are overlapped continuously, when the abnormal risk degree at the current time reaches an abnormal risk threshold value, the current log text data are used as risk log text data, and the log text data before the current time related to the risk log text data are used for constructing a risk log text data sequence;

the abnormal risk confidence coefficient is a floating point number of 0-1, the larger the numerical value is, the higher the represented abnormal risk is, the abnormal risk threshold value is set, when the abnormal risk confidence coefficient reaches the abnormal risk threshold value, risk log text data are positioned, the abnormal risk confidence coefficient is converted into a multi-dimensional one-hot code, the multi-dimensional one-hot code is spliced with a multi-dimensional word vector matrix of each log text data, the spliced result is used as the input of a long-term and short-term memory network, and the multi-dimensional one-hot code t is as follows:

t＝round(p×n)

wherein p is a floating point number of the abnormal risk confidence coefficient, and n is a word vector dimension of each log text data;

(3) the workflow abnormity detection comprises the following specific steps:

the abnormal probability of the current log text data output by the LSTM-based algorithm can be regarded as the result of the accumulation of the abnormal confidence degrees of the previous log text data, when the abnormal risk exceeds a threshold value in the actual production, errors can occur before, and in order to further accurately investigate and position, the probability of the next log record is quantified by calculating a transition probability matrix through a workflow model, namely a Markov chain model;

constructing a workflow model based on the transfer process of the log text data, and calculating a transfer probability matrix, wherein the workflow model is a multidimensional asymmetric sparse matrix, and diagonal elements in the asymmetric sparse matrix are as follows:

λ_ii＝-∑_i≠jλ_ij

wherein λ is_ijTo transition probabilities, N_ijIs the amount of the journal text data of the ith time shifted to the journal text data of the jth time within the unit time interval deltas,

the quantity of the log text data at the ith moment in the log text data set in the time T is delta S, and the delta S is the unit time interval of two adjacent log text data;

and forming a subsequence, namely a risk log text data sequence, by the logs with higher abnormal risk probability at the current time t and the logs before the time t, inputting the risk log text data sequence into a workflow model to obtain the transition probability of each log text data to the next time log text data, taking the current log text data as an abnormal point of the risk log text data sequence when the transition probability reaches an abnormal transition probability threshold, and determining an abnormal station in the production process based on the abnormal point.

Examples

As shown in fig. 2, the present invention selects a part of log sequences, and detects abnormal points by the method for detecting abnormal points of the NLP-based disaster recovery system, which comprises the following steps:

selecting log original data:

l₁：instance：Terminating instance

l₂：instance：Instance destroyed successfully

l₃：instance：Deleting instance files*

l₄：instance：Permission Denied

l₇：instance：Took 5s to destroy the instance

l₆：instance：Termination of instance complete

s1, storing the log data l_iI 1.. 6, processing into a key-value pair type, wherein the key is a log type, and the value is log text content, and then processingThe word2vec model is processed as a (k +1) x n dimensional matrix, where k is log l_iI.e. the log sequence is converted into a three-dimensional array of 6 x (k +1) x n.

And S2, sequentially inputting the matrix of the log sequence into the LSTM to perform the path abnormity detection. Specifically, the risk probability of initializing log is 0, and the risk probability is converted into n-dimensional 0 vector and log l₁: the matrix is spliced to form a (k +2) x n dimensional matrix, the matrix is input into the LSTM, and the abnormal risk probability p of the LSTM is output₁。p₁The value is less than the risk probability threshold value sigma, the probability p is continuously analyzed₁After being rounded by Xn, the data is processed into n-dimensional one-hot codes and logs l₂The input matrix of (k +2) x n input LSTM, output log l₂Is abnormal risk probability p₂Determining p₂The value is less than the risk probability threshold σ. And so on until the log l is found₄Probability of risk p₄Greater than the threshold sigma. Forward selection of Log subsequence { l₂，l₃，l₄And proceeds to S3.

S3 sub-sequence of workflow model { l₂，l₃，l₄And analyzing and accurately positioning the specific position where the abnormality occurs. Specifically, the starting point of the log state is set to l₂The transition probability matrix of the workflow model, the₂The log content with the highest transition probability should be "removing instance files", and l₃Are identical to each other, and

is normal metastasis. Then use₃For the starting point of the state, the log with the highest transition probability should be "Deletion of files complete", l₃To l₄Transition probability of

So that the abnormality occurs in the log l₄And the corresponding system flow accurately positions the abnormity.

The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims

1. An NLP-based disaster recovery system abnormal point detection method is characterized by comprising the following steps:

(2) sequentially inputting the log text data at the current moment and the abnormal risk confidence coefficient at the last moment into a long-term and short-term memory network based on the time sequence to obtain the abnormal risk confidence coefficient at the current moment, taking the log text data at the current moment as risk log text data when the abnormal risk confidence coefficient at the current moment reaches an abnormal risk threshold value, and constructing a risk log text data sequence by using the log text data before the current moment and the risk log text data;

2. The NLP-based disaster recovery system anomaly point detection method according to claim 1, wherein extracting each log text data based on NLP semantic analysis to obtain a key-value pair structure comprises:

3. The NLP-based disaster recovery system abnormal point detection method according to claim 2, wherein a dictionary is constructed by performing word frequency filtering of each word on the event type of each log text data and the time content text data corresponding to the event type.

4. The NLP-based disaster recovery system abnormal point detection method according to claim 1, wherein the method of sequentially inputting the current-time log text data and the previous-time abnormal risk confidence into the long-short term memory network based on the time sequence to obtain the current-time abnormal risk confidence comprises:

5. The NLP-based disaster recovery system abnormal point detection method according to claim 1 or 4, wherein the risk log text data is determined based on an abnormal risk confidence value output by a long-and-short-term memory network, the abnormal risk confidence value is a floating point number between 0 and 1, a larger numerical value represents a higher abnormal risk, an abnormal risk threshold value is set, and when the abnormal risk confidence value reaches the abnormal risk threshold value, the risk log text data is located.

6. The NLP-based disaster recovery system abnormal point detection method according to claim 5, wherein the splicing of each log text data word vector matrix and the abnormal risk confidence of the previous time includes:

7. The NLP-based disaster recovery system abnormal point detection method according to claim 6, wherein the conversion of the abnormal risk confidence into the multidimensional one-hot code t is:

t＝round(p×n)

8. The NLP-based disaster recovery system abnormal point detection method according to claim 1, wherein transition probability λ is_ijComprises the following steps:

wherein N is_ijIs the amount of the log text data of the ith time shifted to the log text data of the jth time within the unit time interval deltas,

Δ s is a unit time interval of two adjacent pieces of log text data, which is the number of log text data at the ith time in the log text data set within the time T.

9. The NLP-based disaster recovery system anomaly point detection method according to claim 8, wherein a markov chain is used to determine a transition probability for each log text data in the risk log text data sequence to be transferred to the next time log text data, wherein the markov chain is a multidimensional asymmetric sparse matrix, and diagonal elements in the asymmetric sparse matrix are:

λ_ii＝-∑_i≠jλ_ij。