CN115828888A

CN115828888A - Method for semantic analysis and structurization of various weblogs

Info

Publication number: CN115828888A
Application number: CN202211444888.9A
Authority: CN
Inventors: 徐润; 李瑶; 樊一鸣; 陈鑫; 林小竺; 周仲波; 陈静怡; 郑智浩; 阙兴黔; 邓德茂; 张红月; 胡兵轩
Original assignee: Zunyi Power Supplying Bureau of Guizhou Power Grid Co Ltd
Current assignee: Zunyi Power Supplying Bureau of Guizhou Power Grid Co Ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-03-21

Abstract

The invention provides a method for carrying out semantic analysis and structuralization on various weblogs, which comprises the following steps: data preprocessing, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like; detecting log sources, namely analyzing logs of different sources, summarizing log formats of the logs, extracting regular expressions, constructing log formats for the logs of each source, and detecting the log sources according to the log formats; the method for carrying out semantic analysis and structuralization on various weblogs can carry out semantic analysis and structuralization analysis on file/folder operation abnormity, network abnormity, database abnormity, hardware abnormity, system abnormity, other abnormity and the like, and quickly tests logs of components from different sources, wherein 10000 logs are selected for each component log to test, and the accuracy rate reaches 99.95%.

Description

Method for semantic analysis and structurization of various weblogs

Technical Field

The invention relates to the technical field of wall surface cleaning, and particularly discloses a method for performing semantic analysis and structuralization on various weblogs.

Background

With the continuous development of information technology, information systems and facilities provide great convenience for production and life of various industries, and related network security becomes a key link related to public security, even national security, and real-time monitoring of network attack behaviors and illegal behaviors becomes a necessary measure for protecting the security of key information infrastructures;

semantic parsing, which refers to a task of converting a natural language question into a logical form. The logical form is a structured semantic expression, usually an executable statement, such as Lambda expression, SQL query language, which can be directly executed by a program, retrieved from a database and returned an answer. Because of the tightly coupled nature with the knowledge base, semantic parsing is often applied in the field of automatic question-answering based on knowledge maps or databases;

in order to construct a semantic parser in a new field, researchers need to first obtain a large amount of training data, usually starting with writing a template rule of a tuple (in a standard question, logical form);

however, since only the corpus is generated by using the template rule, the trained naive semantic parser has a poor effect on the real question (natural language question) and poor generalization performance due to the obvious difference in data distribution between the standard sentence and the natural sentence. Therefore, a method for performing semantic parsing and structuring on various weblogs is provided.

Disclosure of Invention

In view of the above-mentioned drawbacks and deficiencies of the prior art, the present application is directed to a method for semantic parsing and structuring multiple weblogs, comprising:

the method comprises the following steps of firstly, preprocessing data, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like;

secondly, detecting log sources, namely analyzing logs from different sources, summarizing the log formats of the logs, extracting regular expressions, constructing a log format for the logs from each source, and detecting the log sources according to the log formats;

acquiring log data, analyzing the logs, and classifying the processed logs based on log semantics and service completion strength by using a VCNN (virtual record network) server;

step four, the VCNN server uses wide convolution, the convolution result is a feature space two-dimensional graph, output vectors of each word vector on the number of components are spliced to obtain a final output feature graph cemw ∈ Rn × k, the variable pooling layer respectively adopts maximum pooling and average pooling to pool the features extracted from the variable pooling layer, and then the results are combined to be input of a full connection layer of the convolutional neural network;

step five, the full-connection layer of the convolutional neural network plays a role of a classifier in the whole convolutional neural network, and 5 isomorphic and heterogeneous classification clusters are obtained according to the strength from failure to success of the service through convolution of the full-connection layer of the convolutional neural network;

step six, performing improved Bayesian classification based on the correlation among words, performing correlation analysis on classification results and performance of online services in the classification, finding out log source texts related to service abnormality, outputting 5 isomorphic and heterogeneous classification clusters for the VCNN server, and sequentially performing classification based on online service fault classification on the clusters;

and seventhly, identifying the level of the log belonging to the service completion strength through the steps, if the level is the level with high service failure rate, identifying the service performance associated with the log, and repeating the steps by continuously collecting the system logs of the online service to complete the online service abnormity detection.

Preferably, named entity identification requires identifying entities that frequently appear in the timestamp, url, ip, file, path, number, email logs.

Preferably, the overall structure of the VCNN server includes an input layer of a word vector matrix, a variable convolution layer, a variable pooling layer, a fully-connected layer of a convolutional neural network, and an output layer.

Preferably, the variable convolutional layer extracts features from the sentence length and the number of word vector components in the word vector matrix.

Preferably, the input matrix of the variable convolution layer is s ∈ Rn × k, where R denotes a geometric space, n denotes the length of the input sentence, and k denotes the dimension of the word vector.

Preferably, in the first step, the term is divided by considering a common hump expression in the log; in the log vectorization process, word vectors are trained by using the general corpus, the system/middleware log corpus and the service log corpus, and finally, the number of the word vectors in the components is 200 dimensions, and the size of a word bank is 583511.

Preferably, in addition to performing one-dimensional convolution in the sentence length direction, the VCNN server also performs convolution on the word vector by the number of components, where the convolution kernel size is w × 1, and w is the width of the convolution kernel in the sentence length; the number of components of each word vector corresponds to its own convolution kernel; assuming the convolution width as wg ∈ Rw × 1 and representing a one-dimensional convolution kernel applied to the g-th dimension of the input matrix; in the direction of sentence length, si represents the word vector of the ith word, si: g represents a concatenation matrix of word vectors from the ith word to the g-th word; convolution kernel wg is used for convolution of word sequences to generate features and the convolution kernel wg of the g < th > word vector in the number of components is applied to all possible word sequences in the number of the g < th > word vector in the sentence in the number of components to generate corresponding feature maps.

Has the advantages that: the method for semantic parsing and structuring aiming at various weblogs can classify log exception types into 6 types: the method comprises the steps of carrying out semantic analysis and structural analysis on file/folder operation abnormity, network abnormity, database abnormity, hardware abnormity, system abnormity, other abnormity and the like, rapidly testing logs of components with different sources, selecting 10000 logs for each component log to test, wherein the accuracy rate reaches 99.94%, and constructing rules to carry out source detection on mature system/intermediate components, so that the extremely high accuracy rate can be achieved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a block diagram of a system for semantic parsing and structuring of various weblogs in accordance with the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

The drawings in the embodiments of the invention: the different types of cross-sectional lines in the figures are not given the national standard, do not require the material of the elements, and distinguish between cross-sectional views of the elements in the figures.

Referring to fig. 1, a method for semantic parsing and structuring multiple weblogs includes the following steps:

step one, data preprocessing, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like;

The named entity identification needs to identify entities which often appear in time, url, ip, file, path, number and email logs.

The overall structure of the VCNN server comprises an input layer, a variable convolution layer, a variable pooling layer, a full connection layer and an output layer of a convolutional neural network of a word vector matrix.

Wherein the variable convolutional layer extracts features from the sentence length and the number of word vector components in the word vector matrix.

Wherein the input matrix of the variable convolution layer is s ∈ Rn × k, where R represents a geometric space, n represents the length of the input sentence, and k represents the dimension of the word vector.

The common hump expression in the log needs to be considered in word segmentation; in the log vectorization process, word vectors are trained by using the general corpus, the system/middleware log corpus and the service log corpus, and finally, the number of the word vectors in the components is 200 dimensions, and the size of a word bank is 583511.

Besides performing one-dimensional convolution in the sentence length direction, the VCNN server performs convolution on the word vectors in the number of components, wherein the convolution kernel size is w multiplied by 1, and w is the width of the convolution kernel in the sentence length; the number of components of each word vector corresponds to its own convolution kernel; assuming the convolution width as wg ∈ Rw × 1 and representing a one-dimensional convolution kernel applied to the g-th dimension of the input matrix; in the direction of sentence length, si represents the word vector of the ith word, si: g represents a concatenation matrix of word vectors from the ith word to the g-th word; convolution of the word sequence using the convolution kernel wg to generate features the convolution kernel wg over the number of components of the g-th word vector is applied to all possible word sequences over the number of components of the g-th word vector of the sentence to generate a corresponding feature map.

It should be noted that the log records detailed information of the software system during operation, and the system development and operation and maintenance personnel can analyze abnormal behaviors and errors of the system according to the log monitoring system. Log exception detection can be divided into semantic exceptions (execution results), execution exceptions (execution log sequences), and performance exceptions (execution times).

The logging system performs certain operations and the results of the corresponding operations at a certain point in time.

The types of exceptions may be broadly categorized, such as network exceptions, database exceptions, hardware exceptions, I/O exceptions, operating system exceptions, and the like. Each type can be subdivided, and taking hardware exception as an example, there may be hardware exceptions such as CPU exception, insufficient disk space, disk damage, and the like.

The premise of automatically judging the log abnormal type is to formulate a uniform log abnormal type description standard and fine classification and characteristics in each category.

The log is different from natural language text:

(1) The log is a semi-structured text, the log usually comprises a log header and log description information, and the log header often comprises fields such as a timestamp, a source and a log grade; the log description information contains the description of the current operation and the corresponding result, and the semantic information is rich;

(2) A large amount of repetition exists in the log, the log description information contains constant information and variable values, and after the variable values are often used as parameters for symbolization, a large amount of logs can be compressed into a log template;

(3) The log contains a large number of continuous writing character strings in hump format, which are related to naming formats of functions, classes and the like of different programming languages

(4) The vocabulary contained in the log data of a sophisticated system/middleware is small.

3. Vectorization of logs

Vectorized representation of logs requires consideration of the following issues:

(1) Before log vectorization, a log description field needs to be extracted, and the log description field is initialized;

(2) The variable values in the log are usually meaningless values or different ip, url, path and the like, and the variable values need to be replaced;

(3) The special writing method of the log needs to make a new rule to segment the log

(4) The more the log is repeated and the more mature the system is, the more consistent the format and description are, so that the effective vocabulary of the log is less, the subsequent OOV problem occurs, and the log data and the general data need to be combined for vectorization training.

When semantic parsing is performed on various weblogs, firstly, data preprocessing is performed: processing raw log data into standard input data required by an algorithm, comprising: named entity recognition, word segmentation, filtering, case conversion, vectorization, and the like.

Named entity identification needs to identify entities frequently appearing in logs such as timestamp, url, ip, file, path, number, email and the like;

the common hump expression in the log needs to be considered in word segmentation;

in the log vectorization process, word vectors are trained by using a universal corpus (wikidata) + system/middleware log corpus + business log corpus, and finally, the dimension of the word vectors is 200 dimensions, and the size of a word bank is 583511.

Log source detection: analyzing logs of different sources, summarizing log formats of the logs, extracting regular expressions, constructing a log format for the logs of each source, and detecting the log sources according to the log formats.

The log source detection method based on the rules tests logs of components with different sources, 10000 logs are selected for each component log to test, and the accuracy rate reaches 99.94%. For mature systems/intermediate components, the construction rules for source detection can achieve extremely high accuracy.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for semantic parsing and structuring aiming at various weblogs is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein the semantic parsing and structuring for the plurality of weblogs is performed by: named entity recognition requires recognition of entities that often appear in the timestamp, url, ip, file, path, number, email logs.

3. The method of claim 2, wherein the semantic parsing and structuring is performed on a plurality of weblogs according to a formula selected from the group consisting of: the overall structure of the VCNN server comprises an input layer, a variable convolution layer, a variable pooling layer, a full-connection layer and an output layer of a convolutional neural network.

4. The method of claim 1, wherein the semantic parsing and structuring is performed on a plurality of weblogs, and the method comprises: the variable convolutional layer extracts features from the sentence length and the number of word vector components in the word vector matrix.

5. The method of claim 1, wherein the semantic parsing and structuring is performed on a plurality of weblogs, and the method comprises: the input matrix of the variable convolutional layer is s ∈ Rn × k, where R represents a geometric space, n represents the length of the input sentence, and k represents the dimension of the word vector.

6. The method of claim 1, wherein the semantic parsing and structuring is performed on a plurality of weblogs, and the method comprises: in the first step, common hump expressions in logs need to be considered; in the log vectorization process, word vectors are trained by using the general corpus, the system/middleware log corpus and the service log corpus, and finally, the number of the word vectors in the components is 200 dimensions, and the size of a word bank is 583511.

7. The method of claim 4, wherein the semantic parsing and structuring is performed on a plurality of weblogs according to a formula selected from the group consisting of: in addition to performing one-dimensional convolution in the sentence length direction, the VCNN server performs convolution on the word vector by the number of components, the convolution kernel size is w × 1, and w is the width of the convolution kernel in the sentence length; the number of components of each word vector corresponds to its own convolution kernel; assuming the convolution width as wg ∈ Rw × 1 and representing a one-dimensional convolution kernel applied to the g-th dimension of the input matrix; in the direction of sentence length, si represents the word vector of the ith word, si: g represents a concatenation matrix of word vectors from the ith word to the g-th word; convolution of the word sequence using the convolution kernel wg to generate features the convolution kernel wg over the number of components of the g-th word vector is applied to all possible word sequences over the number of components of the g-th word vector of the sentence to generate a corresponding feature map.