CN114997158A - Log key error automatic identification method and device based on Openstack - Google Patents

Log key error automatic identification method and device based on Openstack Download PDF

Info

Publication number
CN114997158A
CN114997158A CN202210711115.6A CN202210711115A CN114997158A CN 114997158 A CN114997158 A CN 114997158A CN 202210711115 A CN202210711115 A CN 202210711115A CN 114997158 A CN114997158 A CN 114997158A
Authority
CN
China
Prior art keywords
data
log
data set
parameter
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210711115.6A
Other languages
Chinese (zh)
Inventor
张磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Digital Intelligence Technology Co Ltd
Original Assignee
China Telecom Digital Intelligence Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Digital Intelligence Technology Co Ltd filed Critical China Telecom Digital Intelligence Technology Co Ltd
Priority to CN202210711115.6A priority Critical patent/CN114997158A/en
Publication of CN114997158A publication Critical patent/CN114997158A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an Openstack-based log key error automatic identification method and device, and belongs to the technical field of data mining. The method comprises the following steps: processing and translating historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set; training a recognition model by using the training data set to finish the training of the model; configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset; and carrying out frequency analysis on the classified keyword data to obtain high-frequency keywords, and storing logs corresponding to the high-frequency keywords into a key error reporting log collection text again. The method can automatically store the keywords in the log data into the keyword lexicon, and reduces the labor maintenance cost.

Description

Log key error automatic identification method and device based on Openstack
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to an Openstack-based log key error automatic identification method and device.
Background
How to carry out efficient operation and maintenance is a problem to be explored all the time, for some technicians, the problems can be solved by directly searching through huge logs and even directly looking at source codes, but the number of the technicians is too small, and the problem of insufficient hands is faced when tens of thousands and hundreds of thousands of cluster environments are operated and maintained. To cope with such a problem, it is necessary to extract a place where an error is actually reported from a huge log, and to improve the place.
The Open Stack is a project managed by a cloud platform, has a very huge application in the field of cloud computing, and is one of mainstream cloud computing technologies at present, so that the Open Stack is suitable for various application scenes such as businesses, markets and the like. The Open Stack is operated based on a container and is uniformly managed by an Ansible container, Nova is a calculation volume, the interior of the Nova is divided into Nova computer, Nova Scheduler, Nova API and the like, Keystone, Neutron, shader and the like, and each component function is operated in a container mode, so that the components have independent logs. If any one of the components in the middle of Keystone, Nova core, Neutron and circle cannot work normally, the whole Open Stack system cannot work, usually one container runs to report errors, a plurality of containers cannot run, and then other containers report errors, and the errors are staggered with each other, so that a large amount of error information is accumulated in logs.
In a scene of log error positioning, a conventional method is to update a keyword word bank in a manual mode, but in such a mode, the ideas of each person are different and difficult to unify, the ideas are difficult to unify when the versions are upgraded, the maintenance cost is very high, technicians are required to perform accurate maintenance at fixed points regularly, and the required labor cost is greatly increased.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a method and a device for automatically identifying log key errors based on Openstack, which can automatically store keywords in log data into a keyword lexicon and reduce the labor maintenance cost.
According to one aspect of the invention, the invention provides an Openstack-based log key error automatic identification method, which comprises the following steps:
s1: collecting historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set;
s2: training a recognition model by using the training data set to finish the training of the model;
s3: configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;
s4: and performing frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into a key error reporting log collection text again, and displaying the key errors of the logs by opening the key error reporting log collection text when the logs are analyzed.
Preferably, the collecting historical log data, and the processing the historical log data includes:
selecting a preset number of physical servers, and respectively obtaining logs of each physical server to store in a text form to obtain a preset number of log texts; combining the preset number of log texts according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log contents and error reporting contents, and partitioning by adopting word vectors to finish the partition of the keywords.
Preferably, the translating the historical log data into the digital data with the digital characteristics includes:
translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only including English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structured data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.
Preferably, the identification model is an LSTM model, and the configuring the parameters of the identification model includes:
the LSTM parameter is configured to embed _ dim parameter 27, hide _ dim parameter 17, num _ layers parameter 3, output _ size parameter 2, padding parameter 1, and the internal length parameter is configured to 180 using the word vector conversion method torch.
Preferably, the frequency analysis of the classified keyword data to obtain the high-frequency keyword includes:
according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; counting the frequency of each classified data to obtain a second parameter value, carrying out weighting operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.
According to another aspect of the present invention, the present invention further provides an Openstack-based log key error automatic identification device, where the device includes:
the processing module is used for acquiring historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set;
the training module is used for training the recognition model by utilizing the training data set to finish the training of the model;
the recognition module is used for configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;
and the analysis module is used for carrying out frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into the key error log collection text again, and realizing the display of the key errors of the logs by opening the key error log collection text when the logs are analyzed.
Preferably, the processing module collects historical log data, and processing the historical log data includes:
selecting a preset number of physical servers, respectively obtaining logs of each physical server, and storing the logs in a text form to obtain a preset number of log texts; combining the preset number of log texts according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log content and error reporting content, and partitioning by adopting word vectors to finish the partition of keywords.
Preferably, the translating the historical log data into the digital data with the digital characteristics by the processing module comprises:
translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only including English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structured data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.
Preferably, the recognition model is an LSTM model, and the configuring parameters of the recognition model by the recognition module includes:
the LSTM parameter is configured to embed _ dim parameter 27, hide _ dim parameter 17, num _ layers parameter 3, output _ size parameter 2, padding parameter 1, and the internal length parameter is configured to 180 using the word vector conversion method torch.
Preferably, the analyzing module performs frequency analysis on the classified keyword data to obtain high-frequency keywords, and the obtaining of the high-frequency keywords includes:
according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; counting the frequency of each classified data to obtain a second parameter value, carrying out weighting operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.
Has the advantages that: according to the method, a log key error automatic identification technology based on Openstack is adopted, a plurality of physical machine log data sets are obtained, and word vector segmentation is adopted; setting different word vectors as LSTM datasets; carrying out multi-classification on the word vectors by adopting an LSTM; storing the medium-high frequency words of the same type as the keywords into a word bank; the keyword lexicon is obtained automatically, manual operation is not needed in the whole process, keywords in the log data can be automatically stored in the keyword lexicon, and the manual maintenance cost is reduced.
The features and advantages of the present invention will become apparent by reference to the following drawings and detailed description of specific embodiments of the invention.
Drawings
FIG. 1 is a flow chart of a method for automatically identifying key log errors based on Openstack;
FIG. 2 is a schematic diagram of an Openstack-based log key error automatic identification device.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
FIG. 1 is a flowchart of an Openstack-based log key error automatic identification method. As shown in fig. 1, this embodiment provides an Openstack-based log key error automatic identification method, where the method includes the following steps:
s1: collecting historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set.
Preferably, the collecting historical log data, and the processing the historical log data includes:
selecting a preset number of physical servers, and respectively obtaining logs of each physical server to store in a text form to obtain a preset number of log texts; combining the log texts of the preset number according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log content and error reporting content, and partitioning by adopting word vectors to finish the partition of keywords.
Specifically, a log collection module is added on each physical server, a folder is newly created, and the log data is saved by using txt text. The log is checked against the UUID of the host. Logs of the container components Nova, Neutron, Keystone, Cinder, etc. were obtained separately. Then, the logs of the whole physical machine are all merged and stored in a txt form.
The logs of each machine are selected only in a month range, typical error types are covered in the log time range, the logs are more meaningful, data can be correspondingly processed, the total data amount is reduced, and the data quality is improved.
The preset number can be 50-100, complete log data are respectively obtained from each of 50-100 machines with typical error representatives and are stored as txt, and then all log data are combined in a combining mode to finish log collection of a cluster. And combining all txt format data in the 50-100 log files from top to bottom according to the time sequence to form a complete log data set M, wherein the data set M covers all data volume of the 50-100 txt format logs.
The length of M is generally in the millions, the content of M is composed of log time + log content + error content, next, a re regular expression of python is adopted, the log time is cut, and the log content and the error content are reserved.
The word vector segmentation technology mainly segments a sentence according to content and meaning, generally, the segmentation method for the mainstream of english is to segment the sentence according to space, because in an english sentence, words are segmented according to the form of space, and therefore, semantic segmentation adopting the segmentation mode can only segment each word, and then systematically segment the content with different meanings according to the meaning, namely, a subject, a predicate and an object.
And the recognition model is used for carrying out multiple identification, namely, the main subject guest and the predicate guest are separated, and then different objects are classified again, so that the division of the keywords is completed. Before the recognition model is used for classification, the data are marked and judged, then the recognition model is used for training to obtain the training parameters of the model, and the model can directly obtain the accurate classification result of the data by taking the unknown data as reference with the parameters.
Preferably, the translating the historical log data into the digital data with digital characteristics includes:
translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only including English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structured data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.
Specifically, the effective features for text data are specifically:
A-->0,B-->1,C-->2,D-->3,E-->4,F-->5,G-->6,H-->7,I-->8,J-->9,K-->10,L-->11,M-->12,N-->13,O-->14,P-->15,Q-->16,R-->17,S-->18,T-->19,U-->20,V-->21,W-->22,X-->23,Y-->24,Z-->25;
‘_’-->51,‘?’-->75,‘.’-->76,‘-’-->77,‘,’-->78,‘!’-->79,‘Ⅱ’-->80,‘(’-->81,‘)’-->82,‘:’-->83,‘/’-->84,‘+’-->85,‘%’-->86。
to sum up, for 26 letter features such as ABC, its valid segment is replaced with 1, 2, 3.. 26, for space segmentation between two words, 50 is used instead, and for using its features, we replace the space with an underscore such as "_", and use 51 numbers to represent its valid segment, which distinguishes from letters, and uses a number feature with larger interval, which also makes each number in our whole one-dimensional vector more featured, and this simple number feature expresses less information but more effectively than float number feature such as 1.222345678765432.
Each segment of data is processed according to the previous word vector segmentation to obtain a processed data set H of M data sets, then all lower case letters are converted into upper case letters, and the data sets are divided into! The data with the interference property such as @ # … … & () "is erased, because spaces are used for separating English words, in order to keep the digital characteristics, if the spaces are met, reading is also carried out according to the data, the effective characteristic value of the space data is given as 50, then after the data is subjected to final refinement processing, only capital letters formed by 26 letters and the digital characteristics formed by the spaces are kept, and each piece of data of millions of pieces of data in the H set is tabulated, and then data of a two-dimensional tensor structure is obtained. After each piece of data is tabulated, 26 letters are replaced by 1-26 numbers, and thus, one-dimensional tensors of digital data with millions of different lengths and all characteristic properties of original English data are obtained. The existing data is converted from H to P, the P data set is completely new digital data, and the inside of the P data set is composed of millions of list data which are converted from the original N data.
The P data set represents the digital characteristics of data, however, each piece of data is different in length and difficult to train, a two-dimensional tensor can be really identified by an LSTM model, and the length of each data list in the data is equal, so that the data needs to be subjected to secondary deep processing, an element 0 represents the existence of the data and is just suitable for completion operation, in the data set P, the length of the longest list in the data set P needs to be judged, the lengths of the remaining lists between a and b are all completed to b according to the length a and the shortest length b of the longest list, and the supplement elements are all replaced by 0, so that a brand-new data set T for training is obtained, wherein the length of the data set T is millions of log data, and each internal piece of the data set T is equal in length.
The original data set M is translated to obtain H, then the H is converted into P after digital processing, and zero padding operation is carried out on the P to obtain a data set T which can be really used by a model, so that the series of conversion is completed. And then dividing the data into two parts according to the type of Label, wherein the two parts are respectively T _ Pos: a positive sample data set. T _ Neg: and (4) carrying out negative sample data set two parts, reading the two parts of data sets into the model in sequence, and entering a later training stage.
S2: and training the recognition model by utilizing the training data set to finish the training of the model.
In this step, the training data obtained in step S1 is input to the recognition model, and the training of the recognition model is completed.
S3: and configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword data set.
Preferably, the identification model is an LSTM model, and the configuring the parameters of the identification model includes:
the LSTM parameter is configured to embed _ dim parameter 27, hide _ dim parameter 17, num _ layers parameter 3, output _ size parameter 2, padding parameter 1, and the internal length parameter is configured to 180 using the word vector conversion method torch.
Specifically, in the step, the LSTM is used as a model for deep learning training, the data T _ Pos and T _ Neg obtained in the previous step are used as the input of the model, the multi-classification technology of the LSTM is realized, and different types of log errors are counted and classified.
Firstly, inputting data into an LSTM model, configuring parameters of the LSTM as embedding _ dim 27, embedding _ dim 17, num _ layers 3, output _ size 2 and padding 1, then using a word vector conversion method torch.nn.embedding, configuring internal length parameters of the torch.nn.180, then putting the parameters processed by the embedding method in the previous step into the LSTM model, and directly processing an output result by a full connection layer.
Thus, the training of the model is completed, the training parameter file of the model and the model are put into an Api of Nova, the Api of Nova is newly built as a carrier, the body part of the Api adopts a whole log data set M of a certain machine, the API can divide the data in the data set M according to semantics, extract features, convert digital features and match LSTM model data, and finally, the classification is carried out according to the model, and the API outputs keywords for error reporting of the log of the model.
S4: and performing frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into a key error reporting log collection text again, and displaying the key errors of the logs by opening the key error reporting log collection text when the logs are analyzed.
Preferably, the frequency analysis of the classified keyword data to obtain the high-frequency keyword includes:
according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; and counting the frequency of each classified data to obtain a second parameter value, carrying out weighted operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.
Specifically, a keyword data set can be quickly obtained by using a trained LSTM model, then frequency analysis is carried out on data of classification categories according to the constitution of an M set, a weighted judgment is carried out on components according to the frequency and the number of LSTM classifications, another parameter is designed, each data frequency to be classified is counted, the two parameters are subjected to weighted operation, the first 30% of the weight is taken as data in the keyword data set, and the proportion can be properly adjusted according to actual conditions.
Keywords are selected using the LSTM method and stored in a database. After keywords are read from a database, a ratio of the frequency of the keywords appearing in log data to the total number of words is made to obtain a percentage, the percentage is used as a weight, each category is provided with a set of keywords which are sorted according to the weight ratio, but due to the huge base number of the data set, the content to be seen is only the position where high-frequency words appear, the keywords of each category need to be divided for the second time, only the first 30% of the keywords are taken, the keywords are in direct proportion to the number of the data set, and the parameters are manually adjusted according to requirements to ensure that a large number of worthless classified keywords cannot appear in the keyword data set.
According to the previous operation, a keyword data set is obtained, only high-frequency keywords of each important category are reserved in the keyword data set, the keyword data set is stored in a database, then log data are searched according to OCR keyword judgment, log languages corresponding to the keywords are stored in N files again, N is a brand-new key error reporting log collection text, and when logs are analyzed later, log key errors can be displayed only by opening N.
In the embodiment, a log key error automatic identification technology based on Openstack acquires log data sets of a plurality of physical machines, and word vector segmentation is used; setting different word vectors as LSTM datasets; performing multi-classification on the word vectors by adopting LSTM; storing the medium-high frequency words of the same type as the keywords into a word bank; the keyword lexicon is obtained automatically, manual operation is not needed in the whole process, keywords in the log data can be automatically stored in the keyword lexicon, and the manual maintenance cost is reduced.
Example 2
Fig. 2 is a schematic diagram of a new coronary pneumonia symptom text data recognition device based on deep learning. As shown in fig. 2, the present invention further provides an Openstack-based log key error automatic identification device, where the device includes:
the processing module 201 is configured to collect historical log data, process and translate the historical log data, convert the historical log data into digital data with digital features, and perform zero padding operation on the digital data to obtain a training data set;
a training module 202, configured to train a recognition model by using the training data set, so as to complete training of the model;
the recognition module 203 is used for configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;
the analysis module 204 is configured to perform frequency analysis on the classified keyword data to obtain high-frequency keywords, restore the logs corresponding to the high-frequency keywords to a key error log collection text, and open the key error log collection text to display key errors of the logs when the logs are analyzed.
Preferably, the processing module 201 collects historical log data, and processing the historical log data includes:
selecting a preset number of physical servers, and respectively obtaining logs of each physical server to store in a text form to obtain a preset number of log texts; combining the log texts of the preset number according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log content and error reporting content, and partitioning by adopting word vectors to finish the partition of keywords.
Preferably, the translating the historical log data into the digital data with the digital characteristic by the processing module 201 includes:
translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only comprising English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structure data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.
Preferably, the recognition model is an LSTM model, and the configuring of the parameters of the recognition model by the recognition module 203 includes:
the LSTM has the parameter configuration that embedding _ dim parameter is 27, hidden _ dim parameter is 17, num _ layers parameter is 3, output _ size parameter is 2, padding parameter is 1, the word vector conversion method torch.
Preferably, the analyzing module 204 performs frequency analysis on the classified keyword data to obtain high-frequency keywords, including:
according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; counting the frequency of each classified data to obtain a second parameter value, carrying out weighting operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.
The specific implementation process of the functions implemented by each module in this embodiment 2 is the same as the implementation process of each step in embodiment 1, and is not described herein again.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structural changes made by using the contents of the specification and drawings, or any other related technical fields, which are directly or indirectly applied to the present invention, are included in the scope of the present invention.

Claims (10)

1. An Openstack-based log key error automatic identification method is characterized by comprising the following steps:
s1: collecting historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set;
s2: training a recognition model by using the training data set to finish the training of the model;
s3: configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;
s4: and performing frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into a key error reporting log collection text again, and displaying the key errors of the logs by opening the key error reporting log collection text when the logs are analyzed.
2. The method of claim 1, wherein the collecting historical log data, processing the historical log data comprises:
selecting a preset number of physical servers, respectively obtaining logs of each physical server, and storing the logs in a text form to obtain a preset number of log texts; combining the log texts of the preset number according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log contents and error reporting contents, and partitioning by adopting word vectors to finish the partition of the keywords.
3. The method of claim 2, wherein translating the historical log data into digitized data having digital characteristics comprises:
translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only including English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structured data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.
4. The method of claim 3, wherein the recognition model is an LSTM model, and wherein configuring the parameters of the recognition model comprises:
the LSTM has the parameter configuration that embedding _ dim parameter is 27, hidden _ dim parameter is 17, num _ layers parameter is 3, output _ size parameter is 2, padding parameter is 1, the word vector conversion method torch.
5. The method of claim 4, wherein the frequency analyzing the classified keyword data to obtain high-frequency keywords comprises:
according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; counting the frequency of each classified data to obtain a second parameter value, carrying out weighting operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.
6. An Openstack-based log key error automatic identification device, the device comprising:
the processing module is used for acquiring historical log data, processing and translating the historical log data, converting the historical log data into digital data with digital characteristics, and performing zero filling operation on the digital data to obtain a training data set;
the training module is used for training the recognition model by utilizing the training data set to finish the training of the model;
the recognition module is used for configuring parameters of the recognition model, inputting log data into the recognition model for classification, and outputting keywords of log error report to form a classified keyword dataset;
and the analysis module is used for carrying out frequency analysis on the classified keyword data to obtain high-frequency keywords, storing the logs corresponding to the high-frequency keywords into the key error reporting log collection text again, and displaying the key errors of the logs by opening the key error reporting log collection text when the logs are analyzed.
7. The apparatus of claim 6, wherein the processing module collects historical log data, and wherein processing the historical log data comprises:
selecting a preset number of physical servers, respectively obtaining logs of each physical server, and storing the logs in a text form to obtain a preset number of log texts; combining the log texts of the preset number according to a time sequence to form a complete log data set, wherein the content of the log data set consists of log time, log content and error reporting content; and cutting log time, reserving log contents and error reporting contents, and partitioning by adopting word vectors to finish the partition of the keywords.
8. The apparatus of claim 7, wherein the processing module translates the historical log data into digitized data having a numerical characteristic comprising:
translating the log data set to obtain an English data set, converting all lower case letters in the English data set into upper case letters, erasing non-letter-form interference data in the data set to form a data set only including English upper case letters and spaces, listing each piece of data to obtain a two-dimensional tensor-structured data set; and replacing 26 letters by 1-26 numbers, and assigning the effective characteristic value of the space data to be 50 to obtain the digital data with the digital characteristics.
9. The apparatus of claim 8, wherein the recognition model is an LSTM model, and wherein the recognition module configures parameters of the recognition model comprises:
the LSTM parameter is configured to embed _ dim parameter 27, hide _ dim parameter 17, num _ layers parameter 3, output _ size parameter 2, padding parameter 1, and the internal length parameter is configured to 180 using the word vector conversion method torch.
10. The apparatus of claim 9, wherein the analysis module performs frequency analysis on the classified keyword data to obtain high-frequency keywords, and comprises:
according to the number of LSTM classifications, performing weighted calculation on the components according to classification categories to obtain a first parameter value; counting the frequency of each classified data to obtain a second parameter value, carrying out weighting operation on the first parameter value and the second parameter value to obtain a weight value, and taking a keyword of which the weight value meets a preset threshold condition as a high-frequency keyword.
CN202210711115.6A 2022-06-22 2022-06-22 Log key error automatic identification method and device based on Openstack Pending CN114997158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210711115.6A CN114997158A (en) 2022-06-22 2022-06-22 Log key error automatic identification method and device based on Openstack

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210711115.6A CN114997158A (en) 2022-06-22 2022-06-22 Log key error automatic identification method and device based on Openstack

Publications (1)

Publication Number Publication Date
CN114997158A true CN114997158A (en) 2022-09-02

Family

ID=83037574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210711115.6A Pending CN114997158A (en) 2022-06-22 2022-06-22 Log key error automatic identification method and device based on Openstack

Country Status (1)

Country Link
CN (1) CN114997158A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239657A1 (en) * 2011-03-18 2012-09-20 Fujitsu Limited Category classification processing device and method
US20190065343A1 (en) * 2017-08-29 2019-02-28 Fmr Llc Automated Log Analysis and Problem Solving Using Intelligent Operation and Deep Learning
CN111130877A (en) * 2019-12-23 2020-05-08 国网江苏省电力有限公司信息通信分公司 NLP-based weblog processing system and method
WO2021120875A1 (en) * 2019-12-20 2021-06-24 华为技术有限公司 Search method and apparatus, terminal device and storage medium
CN113094198A (en) * 2021-04-13 2021-07-09 中国工商银行股份有限公司 Service fault positioning method and device based on machine learning and text classification
CN113986863A (en) * 2021-10-27 2022-01-28 济南浪潮数据技术有限公司 Method, device and equipment for classifying error logs of cloud platform and readable medium
CN114491044A (en) * 2022-02-11 2022-05-13 中国工商银行股份有限公司 Log processing method and device
CN114610515A (en) * 2022-03-10 2022-06-10 电子科技大学 Multi-feature log anomaly detection method and system based on log full semantics
CN114997157A (en) * 2022-06-21 2022-09-02 中电信数智科技有限公司 New coronary pneumonia symptom text data identification method and device based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239657A1 (en) * 2011-03-18 2012-09-20 Fujitsu Limited Category classification processing device and method
US20190065343A1 (en) * 2017-08-29 2019-02-28 Fmr Llc Automated Log Analysis and Problem Solving Using Intelligent Operation and Deep Learning
WO2021120875A1 (en) * 2019-12-20 2021-06-24 华为技术有限公司 Search method and apparatus, terminal device and storage medium
CN111130877A (en) * 2019-12-23 2020-05-08 国网江苏省电力有限公司信息通信分公司 NLP-based weblog processing system and method
CN113094198A (en) * 2021-04-13 2021-07-09 中国工商银行股份有限公司 Service fault positioning method and device based on machine learning and text classification
CN113986863A (en) * 2021-10-27 2022-01-28 济南浪潮数据技术有限公司 Method, device and equipment for classifying error logs of cloud platform and readable medium
CN114491044A (en) * 2022-02-11 2022-05-13 中国工商银行股份有限公司 Log processing method and device
CN114610515A (en) * 2022-03-10 2022-06-10 电子科技大学 Multi-feature log anomaly detection method and system based on log full semantics
CN114997157A (en) * 2022-06-21 2022-09-02 中电信数智科技有限公司 New coronary pneumonia symptom text data identification method and device based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梅御东;陈旭;孙毓忠;牛逸翔;肖立;王海荣;冯百明;: "一种基于日志信息和CNN-text的软件系统异常检测方法", 计算机学报, no. 02, 9 July 2019 (2019-07-09) *

Similar Documents

Publication Publication Date Title
CN108391446B (en) Automatic extraction of training corpus for data classifier based on machine learning algorithm
US11361004B2 (en) Efficient data relationship mining using machine learning
CN109344230B (en) Code library file generation, code search, coupling, optimization and migration method
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN109492106B (en) Automatic classification method for defect reasons by combining text codes
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN112597283B (en) Notification text information entity attribute extraction method, computer equipment and storage medium
CN112732934A (en) Power grid equipment word segmentation dictionary and fault case library construction method
EP3968245A1 (en) Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus
CN113924582A (en) Machine learning processing pipeline optimization
CN111522901A (en) Method and device for processing address information in text
CN114239588A (en) Article processing method and device, electronic equipment and medium
CN112685374B (en) Log classification method and device and electronic equipment
CN107797979B (en) Analysis device and analysis method
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
EP4254279A1 (en) Machine learning pipeline augmented with explanation
CN114997158A (en) Log key error automatic identification method and device based on Openstack
EP3965024A1 (en) Automatically labeling functional blocks in pipelines of existing machine learning projects in a corpus adaptable for use in new machine learning projects
Ríos-Vila et al. End-to-End Full-Page Optical Music Recognition for Mensural Notation.
CN114443803A (en) Text information mining method and device, electronic equipment and storage medium
CN116028620B (en) Method and system for generating patent abstract based on multi-task feature cooperation
CN113313184B (en) Heterogeneous integrated self-bearing technology liability automatic detection method
JP2020166443A (en) Data processing method recommendation system, data processing method recommendation method, and data processing method recommendation program
CN113821618B (en) Method and system for extracting class items of electronic medical record
CN114492308B (en) Industry information indexing method and system combining knowledge discovery and text mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination