CN114997158A

CN114997158A - Log key error automatic identification method and device based on Openstack

Info

Publication number: CN114997158A
Application number: CN202210711115.6A
Authority: CN
Inventors: 张磊
Original assignee: China Telecom Digital Intelligence Technology Co Ltd
Current assignee: China Telecom Digital Intelligence Technology Co Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-09-02

Abstract

The invention relates to an Openstack-based log key error automatic identification method and device, and belongs to the technical field of data mining. The method comprises the following steps: processing and translating historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set; training a recognition model by using the training data set to finish the training of the model; configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset; and carrying out frequency analysis on the classified keyword data to obtain high-frequency keywords, and storing logs corresponding to the high-frequency keywords into a key error reporting log collection text again. The method can automatically store the keywords in the log data into the keyword lexicon, and reduces the labor maintenance cost.

Description

Log key error automatic identification method and device based on Openstack

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to an Openstack-based log key error automatic identification method and device.

Background

How to carry out efficient operation and maintenance is a problem to be explored all the time, for some technicians, the problems can be solved by directly searching through huge logs and even directly looking at source codes, but the number of the technicians is too small, and the problem of insufficient hands is faced when tens of thousands and hundreds of thousands of cluster environments are operated and maintained. To cope with such a problem, it is necessary to extract a place where an error is actually reported from a huge log, and to improve the place.

The Open Stack is a project managed by a cloud platform, has a very huge application in the field of cloud computing, and is one of mainstream cloud computing technologies at present, so that the Open Stack is suitable for various application scenes such as businesses, markets and the like. The Open Stack is operated based on a container and is uniformly managed by an Ansible container, Nova is a calculation volume, the interior of the Nova is divided into Nova computer, Nova Scheduler, Nova API and the like, Keystone, Neutron, shader and the like, and each component function is operated in a container mode, so that the components have independent logs. If any one of the components in the middle of Keystone, Nova core, Neutron and circle cannot work normally, the whole Open Stack system cannot work, usually one container runs to report errors, a plurality of containers cannot run, and then other containers report errors, and the errors are staggered with each other, so that a large amount of error information is accumulated in logs.

In a scene of log error positioning, a conventional method is to update a keyword word bank in a manual mode, but in such a mode, the ideas of each person are different and difficult to unify, the ideas are difficult to unify when the versions are upgraded, the maintenance cost is very high, technicians are required to perform accurate maintenance at fixed points regularly, and the required labor cost is greatly increased.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a method and a device for automatically identifying log key errors based on Openstack, which can automatically store keywords in log data into a keyword lexicon and reduce the labor maintenance cost.

According to one aspect of the invention, the invention provides an Openstack-based log key error automatic identification method, which comprises the following steps:

s1: collecting historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set;

s2: training a recognition model by using the training data set to finish the training of the model;

s3: configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;

s4: and performing frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into a key error reporting log collection text again, and displaying the key errors of the logs by opening the key error reporting log collection text when the logs are analyzed.

Preferably, the collecting historical log data, and the processing the historical log data includes:

selecting a preset number of physical servers, and respectively obtaining logs of each physical server to store in a text form to obtain a preset number of log texts; combining the preset number of log texts according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log contents and error reporting contents, and partitioning by adopting word vectors to finish the partition of the keywords.

Preferably, the translating the historical log data into the digital data with the digital characteristics includes:

translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only including English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structured data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.

Preferably, the identification model is an LSTM model, and the configuring the parameters of the identification model includes:

the LSTM parameter is configured to embed _ dim parameter 27, hide _ dim parameter 17, num _ layers parameter 3, output _ size parameter 2, padding parameter 1, and the internal length parameter is configured to 180 using the word vector conversion method torch.

Preferably, the frequency analysis of the classified keyword data to obtain the high-frequency keyword includes:

according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; counting the frequency of each classified data to obtain a second parameter value, carrying out weighting operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.

According to another aspect of the present invention, the present invention further provides an Openstack-based log key error automatic identification device, where the device includes:

the processing module is used for acquiring historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set;

the training module is used for training the recognition model by utilizing the training data set to finish the training of the model;

the recognition module is used for configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;

and the analysis module is used for carrying out frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into the key error log collection text again, and realizing the display of the key errors of the logs by opening the key error log collection text when the logs are analyzed.

Preferably, the processing module collects historical log data, and processing the historical log data includes:

selecting a preset number of physical servers, respectively obtaining logs of each physical server, and storing the logs in a text form to obtain a preset number of log texts; combining the preset number of log texts according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log content and error reporting content, and partitioning by adopting word vectors to finish the partition of keywords.

Preferably, the translating the historical log data into the digital data with the digital characteristics by the processing module comprises:

Preferably, the recognition model is an LSTM model, and the configuring parameters of the recognition model by the recognition module includes:

Preferably, the analyzing module performs frequency analysis on the classified keyword data to obtain high-frequency keywords, and the obtaining of the high-frequency keywords includes:

Has the advantages that: according to the method, a log key error automatic identification technology based on Openstack is adopted, a plurality of physical machine log data sets are obtained, and word vector segmentation is adopted; setting different word vectors as LSTM datasets; carrying out multi-classification on the word vectors by adopting an LSTM; storing the medium-high frequency words of the same type as the keywords into a word bank; the keyword lexicon is obtained automatically, manual operation is not needed in the whole process, keywords in the log data can be automatically stored in the keyword lexicon, and the manual maintenance cost is reduced.

The features and advantages of the present invention will become apparent by reference to the following drawings and detailed description of specific embodiments of the invention.

Drawings

FIG. 1 is a flow chart of a method for automatically identifying key log errors based on Openstack;

FIG. 2 is a schematic diagram of an Openstack-based log key error automatic identification device.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

FIG. 1 is a flowchart of an Openstack-based log key error automatic identification method. As shown in fig. 1, this embodiment provides an Openstack-based log key error automatic identification method, where the method includes the following steps:

s1: collecting historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set.

selecting a preset number of physical servers, and respectively obtaining logs of each physical server to store in a text form to obtain a preset number of log texts; combining the log texts of the preset number according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log content and error reporting content, and partitioning by adopting word vectors to finish the partition of keywords.

Specifically, a log collection module is added on each physical server, a folder is newly created, and the log data is saved by using txt text. The log is checked against the UUID of the host. Logs of the container components Nova, Neutron, Keystone, Cinder, etc. were obtained separately. Then, the logs of the whole physical machine are all merged and stored in a txt form.

The logs of each machine are selected only in a month range, typical error types are covered in the log time range, the logs are more meaningful, data can be correspondingly processed, the total data amount is reduced, and the data quality is improved.

The preset number can be 50-100, complete log data are respectively obtained from each of 50-100 machines with typical error representatives and are stored as txt, and then all log data are combined in a combining mode to finish log collection of a cluster. And combining all txt format data in the 50-100 log files from top to bottom according to the time sequence to form a complete log data set M, wherein the data set M covers all data volume of the 50-100 txt format logs.

The length of M is generally in the millions, the content of M is composed of log time + log content + error content, next, a re regular expression of python is adopted, the log time is cut, and the log content and the error content are reserved.

The word vector segmentation technology mainly segments a sentence according to content and meaning, generally, the segmentation method for the mainstream of english is to segment the sentence according to space, because in an english sentence, words are segmented according to the form of space, and therefore, semantic segmentation adopting the segmentation mode can only segment each word, and then systematically segment the content with different meanings according to the meaning, namely, a subject, a predicate and an object.

And the recognition model is used for carrying out multiple identification, namely, the main subject guest and the predicate guest are separated, and then different objects are classified again, so that the division of the keywords is completed. Before the recognition model is used for classification, the data are marked and judged, then the recognition model is used for training to obtain the training parameters of the model, and the model can directly obtain the accurate classification result of the data by taking the unknown data as reference with the parameters.

Preferably, the translating the historical log data into the digital data with digital characteristics includes:

Specifically, the effective features for text data are specifically:

A-->0,B-->1,C-->2,D-->3,E-->4,F-->5,G-->6,H-->7,I-->8,J-->9,K-->10,L-->11,M-->12,N-->13,O-->14,P-->15,Q-->16,R-->17,S-->18,T-->19,U-->20,V-->21,W-->22,X-->23,Y-->24,Z-->25；

‘_’-->51,‘？’-->75,‘.’-->76,‘-’-->77,‘,’-->78,‘！’-->79,‘Ⅱ’-->80,‘(’-->81,‘)’-->82,‘:’-->83,‘/’-->84,‘+’-->85,‘％’-->86。

to sum up, for 26 letter features such as ABC, its valid segment is replaced with 1, 2, 3.. 26, for space segmentation between two words, 50 is used instead, and for using its features, we replace the space with an underscore such as "_", and use 51 numbers to represent its valid segment, which distinguishes from letters, and uses a number feature with larger interval, which also makes each number in our whole one-dimensional vector more featured, and this simple number feature expresses less information but more effectively than float number feature such as 1.222345678765432.

Each segment of data is processed according to the previous word vector segmentation to obtain a processed data set H of M data sets, then all lower case letters are converted into upper case letters, and the data sets are divided into! The data with the interference property such as @ # … … & () "is erased, because spaces are used for separating English words, in order to keep the digital characteristics, if the spaces are met, reading is also carried out according to the data, the effective characteristic value of the space data is given as 50, then after the data is subjected to final refinement processing, only capital letters formed by 26 letters and the digital characteristics formed by the spaces are kept, and each piece of data of millions of pieces of data in the H set is tabulated, and then data of a two-dimensional tensor structure is obtained. After each piece of data is tabulated, 26 letters are replaced by 1-26 numbers, and thus, one-dimensional tensors of digital data with millions of different lengths and all characteristic properties of original English data are obtained. The existing data is converted from H to P, the P data set is completely new digital data, and the inside of the P data set is composed of millions of list data which are converted from the original N data.

The P data set represents the digital characteristics of data, however, each piece of data is different in length and difficult to train, a two-dimensional tensor can be really identified by an LSTM model, and the length of each data list in the data is equal, so that the data needs to be subjected to secondary deep processing, an element 0 represents the existence of the data and is just suitable for completion operation, in the data set P, the length of the longest list in the data set P needs to be judged, the lengths of the remaining lists between a and b are all completed to b according to the length a and the shortest length b of the longest list, and the supplement elements are all replaced by 0, so that a brand-new data set T for training is obtained, wherein the length of the data set T is millions of log data, and each internal piece of the data set T is equal in length.

The original data set M is translated to obtain H, then the H is converted into P after digital processing, and zero padding operation is carried out on the P to obtain a data set T which can be really used by a model, so that the series of conversion is completed. And then dividing the data into two parts according to the type of Label, wherein the two parts are respectively T _ Pos: a positive sample data set. T _ Neg: and (4) carrying out negative sample data set two parts, reading the two parts of data sets into the model in sequence, and entering a later training stage.

S2: and training the recognition model by utilizing the training data set to finish the training of the model.

In this step, the training data obtained in step S1 is input to the recognition model, and the training of the recognition model is completed.

S3: and configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword data set.

Specifically, in the step, the LSTM is used as a model for deep learning training, the data T _ Pos and T _ Neg obtained in the previous step are used as the input of the model, the multi-classification technology of the LSTM is realized, and different types of log errors are counted and classified.

Firstly, inputting data into an LSTM model, configuring parameters of the LSTM as embedding _ dim 27, embedding _ dim 17, num _ layers 3, output _ size 2 and padding 1, then using a word vector conversion method torch.nn.embedding, configuring internal length parameters of the torch.nn.180, then putting the parameters processed by the embedding method in the previous step into the LSTM model, and directly processing an output result by a full connection layer.

Thus, the training of the model is completed, the training parameter file of the model and the model are put into an Api of Nova, the Api of Nova is newly built as a carrier, the body part of the Api adopts a whole log data set M of a certain machine, the API can divide the data in the data set M according to semantics, extract features, convert digital features and match LSTM model data, and finally, the classification is carried out according to the model, and the API outputs keywords for error reporting of the log of the model.

according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; and counting the frequency of each classified data to obtain a second parameter value, carrying out weighted operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.

Specifically, a keyword data set can be quickly obtained by using a trained LSTM model, then frequency analysis is carried out on data of classification categories according to the constitution of an M set, a weighted judgment is carried out on components according to the frequency and the number of LSTM classifications, another parameter is designed, each data frequency to be classified is counted, the two parameters are subjected to weighted operation, the first 30% of the weight is taken as data in the keyword data set, and the proportion can be properly adjusted according to actual conditions.

Keywords are selected using the LSTM method and stored in a database. After keywords are read from a database, a ratio of the frequency of the keywords appearing in log data to the total number of words is made to obtain a percentage, the percentage is used as a weight, each category is provided with a set of keywords which are sorted according to the weight ratio, but due to the huge base number of the data set, the content to be seen is only the position where high-frequency words appear, the keywords of each category need to be divided for the second time, only the first 30% of the keywords are taken, the keywords are in direct proportion to the number of the data set, and the parameters are manually adjusted according to requirements to ensure that a large number of worthless classified keywords cannot appear in the keyword data set.

According to the previous operation, a keyword data set is obtained, only high-frequency keywords of each important category are reserved in the keyword data set, the keyword data set is stored in a database, then log data are searched according to OCR keyword judgment, log languages corresponding to the keywords are stored in N files again, N is a brand-new key error reporting log collection text, and when logs are analyzed later, log key errors can be displayed only by opening N.

In the embodiment, a log key error automatic identification technology based on Openstack acquires log data sets of a plurality of physical machines, and word vector segmentation is used; setting different word vectors as LSTM datasets; performing multi-classification on the word vectors by adopting LSTM; storing the medium-high frequency words of the same type as the keywords into a word bank; the keyword lexicon is obtained automatically, manual operation is not needed in the whole process, keywords in the log data can be automatically stored in the keyword lexicon, and the manual maintenance cost is reduced.

Example 2

Fig. 2 is a schematic diagram of a new coronary pneumonia symptom text data recognition device based on deep learning. As shown in fig. 2, the present invention further provides an Openstack-based log key error automatic identification device, where the device includes:

the processing module 201 is configured to collect historical log data, process and translate the historical log data, convert the historical log data into digital data with digital features, and perform zero padding operation on the digital data to obtain a training data set;

a training module 202, configured to train a recognition model by using the training data set, so as to complete training of the model;

the recognition module 203 is used for configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;

the analysis module 204 is configured to perform frequency analysis on the classified keyword data to obtain high-frequency keywords, restore the logs corresponding to the high-frequency keywords to a key error log collection text, and open the key error log collection text to display key errors of the logs when the logs are analyzed.

Preferably, the processing module 201 collects historical log data, and processing the historical log data includes:

Preferably, the translating the historical log data into the digital data with the digital characteristic by the processing module 201 includes:

translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only comprising English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structure data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.

Preferably, the recognition model is an LSTM model, and the configuring of the parameters of the recognition model by the recognition module 203 includes:

the LSTM has the parameter configuration that embedding _ dim parameter is 27, hidden _ dim parameter is 17, num _ layers parameter is 3, output _ size parameter is 2, padding parameter is 1, the word vector conversion method torch.

Preferably, the analyzing module 204 performs frequency analysis on the classified keyword data to obtain high-frequency keywords, including:

The specific implementation process of the functions implemented by each module in this embodiment 2 is the same as the implementation process of each step in embodiment 1, and is not described herein again.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structural changes made by using the contents of the specification and drawings, or any other related technical fields, which are directly or indirectly applied to the present invention, are included in the scope of the present invention.

Claims

1. An Openstack-based log key error automatic identification method is characterized by comprising the following steps:

2. The method of claim 1, wherein the collecting historical log data, processing the historical log data comprises:

selecting a preset number of physical servers, respectively obtaining logs of each physical server, and storing the logs in a text form to obtain a preset number of log texts; combining the log texts of the preset number according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log contents and error reporting contents, and partitioning by adopting word vectors to finish the partition of the keywords.

3. The method of claim 2, wherein translating the historical log data into digitized data having digital characteristics comprises:

4. The method of claim 3, wherein the recognition model is an LSTM model, and wherein configuring the parameters of the recognition model comprises:

5. The method of claim 4, wherein the frequency analyzing the classified keyword data to obtain high-frequency keywords comprises:

6. An Openstack-based log key error automatic identification device, the device comprising:

and the analysis module is used for carrying out frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into the key error reporting log collection text again, and displaying the key errors of the logs by opening the key error reporting log collection text when the logs are analyzed.

7. The apparatus of claim 6, wherein the processing module collects historical log data, and wherein processing the historical log data comprises:

8. The apparatus of claim 7, wherein the processing module translates the historical log data into digitized data having a numerical characteristic comprising:

9. The apparatus of claim 8, wherein the recognition model is an LSTM model, and wherein the recognition module configures parameters of the recognition model comprises:

10. The apparatus of claim 9, wherein the analysis module performs frequency analysis on the classified keyword data to obtain high-frequency keywords, and comprises: