CN115828888A - Method for semantic analysis and structurization of various weblogs - Google Patents

Method for semantic analysis and structurization of various weblogs Download PDF

Info

Publication number
CN115828888A
CN115828888A CN202211444888.9A CN202211444888A CN115828888A CN 115828888 A CN115828888 A CN 115828888A CN 202211444888 A CN202211444888 A CN 202211444888A CN 115828888 A CN115828888 A CN 115828888A
Authority
CN
China
Prior art keywords
log
logs
word
convolution
weblogs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211444888.9A
Other languages
Chinese (zh)
Inventor
徐润
李瑶
樊一鸣
陈鑫
林小竺
周仲波
陈静怡
郑智浩
阙兴黔
邓德茂
张红月
胡兵轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zunyi Power Supplying Bureau of Guizhou Power Grid Co Ltd
Original Assignee
Zunyi Power Supplying Bureau of Guizhou Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zunyi Power Supplying Bureau of Guizhou Power Grid Co Ltd filed Critical Zunyi Power Supplying Bureau of Guizhou Power Grid Co Ltd
Priority to CN202211444888.9A priority Critical patent/CN115828888A/en
Publication of CN115828888A publication Critical patent/CN115828888A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for carrying out semantic analysis and structuralization on various weblogs, which comprises the following steps: data preprocessing, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like; detecting log sources, namely analyzing logs of different sources, summarizing log formats of the logs, extracting regular expressions, constructing log formats for the logs of each source, and detecting the log sources according to the log formats; the method for carrying out semantic analysis and structuralization on various weblogs can carry out semantic analysis and structuralization analysis on file/folder operation abnormity, network abnormity, database abnormity, hardware abnormity, system abnormity, other abnormity and the like, and quickly tests logs of components from different sources, wherein 10000 logs are selected for each component log to test, and the accuracy rate reaches 99.95%.

Description

Method for semantic analysis and structurization of various weblogs
Technical Field
The invention relates to the technical field of wall surface cleaning, and particularly discloses a method for performing semantic analysis and structuralization on various weblogs.
Background
With the continuous development of information technology, information systems and facilities provide great convenience for production and life of various industries, and related network security becomes a key link related to public security, even national security, and real-time monitoring of network attack behaviors and illegal behaviors becomes a necessary measure for protecting the security of key information infrastructures;
semantic parsing, which refers to a task of converting a natural language question into a logical form. The logical form is a structured semantic expression, usually an executable statement, such as Lambda expression, SQL query language, which can be directly executed by a program, retrieved from a database and returned an answer. Because of the tightly coupled nature with the knowledge base, semantic parsing is often applied in the field of automatic question-answering based on knowledge maps or databases;
in order to construct a semantic parser in a new field, researchers need to first obtain a large amount of training data, usually starting with writing a template rule of a tuple (in a standard question, logical form);
however, since only the corpus is generated by using the template rule, the trained naive semantic parser has a poor effect on the real question (natural language question) and poor generalization performance due to the obvious difference in data distribution between the standard sentence and the natural sentence. Therefore, a method for performing semantic parsing and structuring on various weblogs is provided.
Disclosure of Invention
In view of the above-mentioned drawbacks and deficiencies of the prior art, the present application is directed to a method for semantic parsing and structuring multiple weblogs, comprising:
the method comprises the following steps of firstly, preprocessing data, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like;
secondly, detecting log sources, namely analyzing logs from different sources, summarizing the log formats of the logs, extracting regular expressions, constructing a log format for the logs from each source, and detecting the log sources according to the log formats;
acquiring log data, analyzing the logs, and classifying the processed logs based on log semantics and service completion strength by using a VCNN (virtual record network) server;
step four, the VCNN server uses wide convolution, the convolution result is a feature space two-dimensional graph, output vectors of each word vector on the number of components are spliced to obtain a final output feature graph cemw ∈ Rn × k, the variable pooling layer respectively adopts maximum pooling and average pooling to pool the features extracted from the variable pooling layer, and then the results are combined to be input of a full connection layer of the convolutional neural network;
step five, the full-connection layer of the convolutional neural network plays a role of a classifier in the whole convolutional neural network, and 5 isomorphic and heterogeneous classification clusters are obtained according to the strength from failure to success of the service through convolution of the full-connection layer of the convolutional neural network;
step six, performing improved Bayesian classification based on the correlation among words, performing correlation analysis on classification results and performance of online services in the classification, finding out log source texts related to service abnormality, outputting 5 isomorphic and heterogeneous classification clusters for the VCNN server, and sequentially performing classification based on online service fault classification on the clusters;
and seventhly, identifying the level of the log belonging to the service completion strength through the steps, if the level is the level with high service failure rate, identifying the service performance associated with the log, and repeating the steps by continuously collecting the system logs of the online service to complete the online service abnormity detection.
Preferably, named entity identification requires identifying entities that frequently appear in the timestamp, url, ip, file, path, number, email logs.
Preferably, the overall structure of the VCNN server includes an input layer of a word vector matrix, a variable convolution layer, a variable pooling layer, a fully-connected layer of a convolutional neural network, and an output layer.
Preferably, the variable convolutional layer extracts features from the sentence length and the number of word vector components in the word vector matrix.
Preferably, the input matrix of the variable convolution layer is s ∈ Rn × k, where R denotes a geometric space, n denotes the length of the input sentence, and k denotes the dimension of the word vector.
Preferably, in the first step, the term is divided by considering a common hump expression in the log; in the log vectorization process, word vectors are trained by using the general corpus, the system/middleware log corpus and the service log corpus, and finally, the number of the word vectors in the components is 200 dimensions, and the size of a word bank is 583511.
Preferably, in addition to performing one-dimensional convolution in the sentence length direction, the VCNN server also performs convolution on the word vector by the number of components, where the convolution kernel size is w × 1, and w is the width of the convolution kernel in the sentence length; the number of components of each word vector corresponds to its own convolution kernel; assuming the convolution width as wg ∈ Rw × 1 and representing a one-dimensional convolution kernel applied to the g-th dimension of the input matrix; in the direction of sentence length, si represents the word vector of the ith word, si: g represents a concatenation matrix of word vectors from the ith word to the g-th word; convolution kernel wg is used for convolution of word sequences to generate features and the convolution kernel wg of the g < th > word vector in the number of components is applied to all possible word sequences in the number of the g < th > word vector in the sentence in the number of components to generate corresponding feature maps.
Has the advantages that: the method for semantic parsing and structuring aiming at various weblogs can classify log exception types into 6 types: the method comprises the steps of carrying out semantic analysis and structural analysis on file/folder operation abnormity, network abnormity, database abnormity, hardware abnormity, system abnormity, other abnormity and the like, rapidly testing logs of components with different sources, selecting 10000 logs for each component log to test, wherein the accuracy rate reaches 99.94%, and constructing rules to carry out source detection on mature system/intermediate components, so that the extremely high accuracy rate can be achieved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 is a block diagram of a system for semantic parsing and structuring of various weblogs in accordance with the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
The drawings in the embodiments of the invention: the different types of cross-sectional lines in the figures are not given the national standard, do not require the material of the elements, and distinguish between cross-sectional views of the elements in the figures.
Referring to fig. 1, a method for semantic parsing and structuring multiple weblogs includes the following steps:
step one, data preprocessing, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like;
secondly, detecting log sources, namely analyzing logs from different sources, summarizing the log formats of the logs, extracting regular expressions, constructing a log format for the logs from each source, and detecting the log sources according to the log formats;
acquiring log data, analyzing the logs, and classifying the processed logs based on log semantics and service completion strength by using a VCNN (virtual record network) server;
step four, the VCNN server uses wide convolution, the convolution result is a feature space two-dimensional graph, output vectors of each word vector on the number of components are spliced to obtain a final output feature graph cemw ∈ Rn × k, the variable pooling layer respectively adopts maximum pooling and average pooling to pool the features extracted from the variable pooling layer, and then the results are combined to be input of a full connection layer of the convolutional neural network;
step five, the full-connection layer of the convolutional neural network plays a role of a classifier in the whole convolutional neural network, and 5 isomorphic and heterogeneous classification clusters are obtained according to the strength from failure to success of the service through convolution of the full-connection layer of the convolutional neural network;
step six, performing improved Bayesian classification based on the correlation among words, performing correlation analysis on classification results and performance of online services in the classification, finding out log source texts related to service abnormality, outputting 5 isomorphic and heterogeneous classification clusters for the VCNN server, and sequentially performing classification based on online service fault classification on the clusters;
and seventhly, identifying the level of the log belonging to the service completion strength through the steps, if the level is the level with high service failure rate, identifying the service performance associated with the log, and repeating the steps by continuously collecting the system logs of the online service to complete the online service abnormity detection.
The named entity identification needs to identify entities which often appear in time, url, ip, file, path, number and email logs.
The overall structure of the VCNN server comprises an input layer, a variable convolution layer, a variable pooling layer, a full connection layer and an output layer of a convolutional neural network of a word vector matrix.
Wherein the variable convolutional layer extracts features from the sentence length and the number of word vector components in the word vector matrix.
Wherein the input matrix of the variable convolution layer is s ∈ Rn × k, where R represents a geometric space, n represents the length of the input sentence, and k represents the dimension of the word vector.
The common hump expression in the log needs to be considered in word segmentation; in the log vectorization process, word vectors are trained by using the general corpus, the system/middleware log corpus and the service log corpus, and finally, the number of the word vectors in the components is 200 dimensions, and the size of a word bank is 583511.
Besides performing one-dimensional convolution in the sentence length direction, the VCNN server performs convolution on the word vectors in the number of components, wherein the convolution kernel size is w multiplied by 1, and w is the width of the convolution kernel in the sentence length; the number of components of each word vector corresponds to its own convolution kernel; assuming the convolution width as wg ∈ Rw × 1 and representing a one-dimensional convolution kernel applied to the g-th dimension of the input matrix; in the direction of sentence length, si represents the word vector of the ith word, si: g represents a concatenation matrix of word vectors from the ith word to the g-th word; convolution of the word sequence using the convolution kernel wg to generate features the convolution kernel wg over the number of components of the g-th word vector is applied to all possible word sequences over the number of components of the g-th word vector of the sentence to generate a corresponding feature map.
It should be noted that the log records detailed information of the software system during operation, and the system development and operation and maintenance personnel can analyze abnormal behaviors and errors of the system according to the log monitoring system. Log exception detection can be divided into semantic exceptions (execution results), execution exceptions (execution log sequences), and performance exceptions (execution times).
The logging system performs certain operations and the results of the corresponding operations at a certain point in time.
The types of exceptions may be broadly categorized, such as network exceptions, database exceptions, hardware exceptions, I/O exceptions, operating system exceptions, and the like. Each type can be subdivided, and taking hardware exception as an example, there may be hardware exceptions such as CPU exception, insufficient disk space, disk damage, and the like.
The premise of automatically judging the log abnormal type is to formulate a uniform log abnormal type description standard and fine classification and characteristics in each category.
The log is different from natural language text:
(1) The log is a semi-structured text, the log usually comprises a log header and log description information, and the log header often comprises fields such as a timestamp, a source and a log grade; the log description information contains the description of the current operation and the corresponding result, and the semantic information is rich;
(2) A large amount of repetition exists in the log, the log description information contains constant information and variable values, and after the variable values are often used as parameters for symbolization, a large amount of logs can be compressed into a log template;
(3) The log contains a large number of continuous writing character strings in hump format, which are related to naming formats of functions, classes and the like of different programming languages
(4) The vocabulary contained in the log data of a sophisticated system/middleware is small.
3. Vectorization of logs
Vectorized representation of logs requires consideration of the following issues:
(1) Before log vectorization, a log description field needs to be extracted, and the log description field is initialized;
(2) The variable values in the log are usually meaningless values or different ip, url, path and the like, and the variable values need to be replaced;
(3) The special writing method of the log needs to make a new rule to segment the log
(4) The more the log is repeated and the more mature the system is, the more consistent the format and description are, so that the effective vocabulary of the log is less, the subsequent OOV problem occurs, and the log data and the general data need to be combined for vectorization training.
When semantic parsing is performed on various weblogs, firstly, data preprocessing is performed: processing raw log data into standard input data required by an algorithm, comprising: named entity recognition, word segmentation, filtering, case conversion, vectorization, and the like.
Named entity identification needs to identify entities frequently appearing in logs such as timestamp, url, ip, file, path, number, email and the like;
the common hump expression in the log needs to be considered in word segmentation;
in the log vectorization process, word vectors are trained by using a universal corpus (wikidata) + system/middleware log corpus + business log corpus, and finally, the dimension of the word vectors is 200 dimensions, and the size of a word bank is 583511.
Log source detection: analyzing logs of different sources, summarizing log formats of the logs, extracting regular expressions, constructing a log format for the logs of each source, and detecting the log sources according to the log formats.
The log source detection method based on the rules tests logs of components with different sources, 10000 logs are selected for each component log to test, and the accuracy rate reaches 99.94%. For mature systems/intermediate components, the construction rules for source detection can achieve extremely high accuracy.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (7)

1. A method for semantic parsing and structuring aiming at various weblogs is characterized in that: the method comprises the following steps:
the method comprises the following steps of firstly, preprocessing data, namely processing original log data into standard input data required by an algorithm, wherein the standard input data comprises named entity recognition, word segmentation, filtering, case and case conversion, vectorization and the like;
secondly, detecting log sources, namely analyzing logs from different sources, summarizing the log formats of the logs, extracting regular expressions, constructing a log format for the logs from each source, and detecting the log sources according to the log formats;
acquiring log data, analyzing the logs, and classifying the processed logs based on log semantics and service completion strength by using a VCNN (virtual record network) server;
step four, the VCNN server uses wide convolution, the convolution result is a feature space two-dimensional graph, output vectors of each word vector on the number of components are spliced to obtain a final output feature graph cemw ∈ Rn × k, the variable pooling layer respectively adopts maximum pooling and average pooling to pool the features extracted from the variable pooling layer, and then the results are combined to be input of a full connection layer of the convolutional neural network;
step five, the full-connection layer of the convolutional neural network plays a role of a classifier in the whole convolutional neural network, and 5 isomorphic and heterogeneous classification clusters are obtained according to the strength from failure to success of the service through convolution of the full-connection layer of the convolutional neural network;
step six, performing improved Bayesian classification based on the correlation among words, performing correlation analysis on classification results and performance of online services in the classification, finding out log source texts related to service abnormality, outputting 5 isomorphic and heterogeneous classification clusters for the VCNN server, and sequentially performing classification based on online service fault classification on the clusters;
and seventhly, identifying the level of the log belonging to the service completion strength through the steps, if the level is the level with high service failure rate, identifying the service performance associated with the log, and repeating the steps by continuously collecting the system logs of the online service to complete the online service abnormity detection.
2. The method of claim 1, wherein the semantic parsing and structuring for the plurality of weblogs is performed by: named entity recognition requires recognition of entities that often appear in the timestamp, url, ip, file, path, number, email logs.
3. The method of claim 2, wherein the semantic parsing and structuring is performed on a plurality of weblogs according to a formula selected from the group consisting of: the overall structure of the VCNN server comprises an input layer, a variable convolution layer, a variable pooling layer, a full-connection layer and an output layer of a convolutional neural network.
4. The method of claim 1, wherein the semantic parsing and structuring is performed on a plurality of weblogs, and the method comprises: the variable convolutional layer extracts features from the sentence length and the number of word vector components in the word vector matrix.
5. The method of claim 1, wherein the semantic parsing and structuring is performed on a plurality of weblogs, and the method comprises: the input matrix of the variable convolutional layer is s ∈ Rn × k, where R represents a geometric space, n represents the length of the input sentence, and k represents the dimension of the word vector.
6. The method of claim 1, wherein the semantic parsing and structuring is performed on a plurality of weblogs, and the method comprises: in the first step, common hump expressions in logs need to be considered; in the log vectorization process, word vectors are trained by using the general corpus, the system/middleware log corpus and the service log corpus, and finally, the number of the word vectors in the components is 200 dimensions, and the size of a word bank is 583511.
7. The method of claim 4, wherein the semantic parsing and structuring is performed on a plurality of weblogs according to a formula selected from the group consisting of: in addition to performing one-dimensional convolution in the sentence length direction, the VCNN server performs convolution on the word vector by the number of components, the convolution kernel size is w × 1, and w is the width of the convolution kernel in the sentence length; the number of components of each word vector corresponds to its own convolution kernel; assuming the convolution width as wg ∈ Rw × 1 and representing a one-dimensional convolution kernel applied to the g-th dimension of the input matrix; in the direction of sentence length, si represents the word vector of the ith word, si: g represents a concatenation matrix of word vectors from the ith word to the g-th word; convolution of the word sequence using the convolution kernel wg to generate features the convolution kernel wg over the number of components of the g-th word vector is applied to all possible word sequences over the number of components of the g-th word vector of the sentence to generate a corresponding feature map.
CN202211444888.9A 2022-11-18 2022-11-18 Method for semantic analysis and structurization of various weblogs Pending CN115828888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211444888.9A CN115828888A (en) 2022-11-18 2022-11-18 Method for semantic analysis and structurization of various weblogs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211444888.9A CN115828888A (en) 2022-11-18 2022-11-18 Method for semantic analysis and structurization of various weblogs

Publications (1)

Publication Number Publication Date
CN115828888A true CN115828888A (en) 2023-03-21

Family

ID=85528952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211444888.9A Pending CN115828888A (en) 2022-11-18 2022-11-18 Method for semantic analysis and structurization of various weblogs

Country Status (1)

Country Link
CN (1) CN115828888A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628451A (en) * 2023-05-31 2023-08-22 江苏华存电子科技有限公司 High-speed analysis method for information to be processed

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182219A (en) * 2020-10-09 2021-01-05 杭州电子科技大学 Online service abnormity detection method based on log semantic analysis
CN113297051A (en) * 2021-07-26 2021-08-24 云智慧(北京)科技有限公司 Log analysis processing method and device
CN113377607A (en) * 2021-05-13 2021-09-10 长沙理工大学 Method and device for detecting log abnormity based on Word2Vec and electronic equipment
US20210357282A1 (en) * 2020-05-13 2021-11-18 Mastercard International Incorporated Methods and systems for server failure prediction using server logs
CN114610515A (en) * 2022-03-10 2022-06-10 电子科技大学 Multi-feature log anomaly detection method and system based on log full semantics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210357282A1 (en) * 2020-05-13 2021-11-18 Mastercard International Incorporated Methods and systems for server failure prediction using server logs
CN112182219A (en) * 2020-10-09 2021-01-05 杭州电子科技大学 Online service abnormity detection method based on log semantic analysis
CN113377607A (en) * 2021-05-13 2021-09-10 长沙理工大学 Method and device for detecting log abnormity based on Word2Vec and electronic equipment
CN113297051A (en) * 2021-07-26 2021-08-24 云智慧(北京)科技有限公司 Log analysis processing method and device
CN114610515A (en) * 2022-03-10 2022-06-10 电子科技大学 Multi-feature log anomaly detection method and system based on log full semantics

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628451A (en) * 2023-05-31 2023-08-22 江苏华存电子科技有限公司 High-speed analysis method for information to be processed
CN116628451B (en) * 2023-05-31 2023-11-14 江苏华存电子科技有限公司 High-speed analysis method for information to be processed

Similar Documents

Publication Publication Date Title
US20220405592A1 (en) Multi-feature log anomaly detection method and system based on log full semantics
CN109697162B (en) Software defect automatic detection method based on open source code library
US6047277A (en) Self-organizing neural network for plain text categorization
US20050246353A1 (en) Automated transformation of unstructured data
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN112000802A (en) Software defect positioning method based on similarity integration
Verma et al. A novel approach for text summarization using optimal combination of sentence scoring methods
CN112115326B (en) Multi-label classification and vulnerability detection method for Etheng intelligent contracts
CN112131453A (en) Method, device and storage medium for detecting network bad short text based on BERT
CN115757695A (en) Log language model training method and system
CN115828888A (en) Method for semantic analysis and structurization of various weblogs
Sharma et al. Ideology detection in the Indian mass media
US11604923B2 (en) High volume message classification and distribution
Vu et al. Revising FUNSD dataset for key-value detection in document images
CN116881971A (en) Sensitive information leakage detection method, device and storage medium
CN116932753A (en) Log classification method, device, computer equipment, storage medium and program product
CN114783446B (en) Voice recognition method and system based on contrast predictive coding
Khritankov et al. Discovering text reuse in large collections of documents: A study of theses in history sciences
Merlo et al. Feed‐forward and recurrent neural networks for source code informal information analysis
CN115373982A (en) Test report analysis method, device, equipment and medium based on artificial intelligence
Hisham et al. An innovative approach for fake news detection using machine learning
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
Sulaiman et al. South China Sea Conflicts Classification Using Named Entity Recognition (NER) and Part-of-Speech (POS) Tagging
Wunderle et al. Pointer Networks: A Unified Approach to Extracting German Opinions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20230321

WD01 Invention patent application deemed withdrawn after publication