CN115437877A - Online analysis method and system for multi-source log, electronic equipment and storage medium - Google Patents
Online analysis method and system for multi-source log, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115437877A CN115437877A CN202210990274.4A CN202210990274A CN115437877A CN 115437877 A CN115437877 A CN 115437877A CN 202210990274 A CN202210990274 A CN 202210990274A CN 115437877 A CN115437877 A CN 115437877A
- Authority
- CN
- China
- Prior art keywords
- log
- logs
- speech
- word
- verb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses an online analysis method, a system, electronic equipment and a storage medium of a multi-source log, wherein the method comprises the following steps: collecting a multi-source log; classifying the collected logs by using a log tree and extracting key field information of the logs; preprocessing the message field of the key field information; grouping the logs by using the first word for the preprocessed log message fields; verb part-of-speech feature extraction is carried out on the grouped log message field contents; and (4) extracting a log event template and updating the log template from the log with the characteristic of the part of speech of the verb of the same kind through a Longest Common Subsequence (LCS) algorithm. According to the method, the execution class logs and the state class logs are distinguished through the part-of-speech characteristics of the log verbs, and the multi-source information of the logs is combined, so that an important data base can be provided for the construction of a workflow diagram and root cause analysis in the subsequent log analysis, and the overfitting problem that logs with similar character structures and different semantics are misjudged as the same log event is effectively solved.
Description
Technical Field
The invention relates to the technical field of log analysis, in particular to an online analysis method and system for multi-source logs, electronic equipment and a storage medium.
Background
As software increases in size and complexity, the amount of logs generated by the software becomes more voluminous. According to statistics, a large cloud application generates about 10GB logs per hour, and the traditional manual method for analyzing the log information cannot meet the requirement, so that an automatic log analysis technology is developed.
In the daily operation and maintenance process, the abnormal detection and root cause analysis by using the unstructured log are very challenging analysis work, and the real-time and effective abnormal detection from massive log data is more challenging. The content in the log can be generally divided into a constant part and a variable part, wherein the constant part is fixed text content and represents a log event template, and the variable part reflects information of the system operation, such as state values and parameters (IP address, duration, file path and the like). The purpose of log analysis is to separate log event information and parameter information, convert an original log from an unstructured log into a structured log which can be easily identified and processed by a computer, and solve the problem that unstructured log data is difficult to analyze. The method is used as a primary link of log analysis and plays a very key role in subsequent log analysis effect.
At present, in the field of log analysis research, a clustering method based on similarity, a frequent item mining method and a heuristic method exist, the existing methods mainly aim at better log performance in a specific field, most of the existing log analysis methods are only suitable for offline log analysis and cannot meet the calculation requirements of online log analysis, typical methods supporting online log analysis include Drain, spell and the like, but from experimental results, when log event templates with similar structures are encountered, the problem of overfitting of log analysis easily occurs, and the log event templates with similar structures generally exist in large-scale software systems. Two logs of OpenStack are as follows:
Instance spawned successfully.
Instance destroyed successfully.
when the two logs are subjected to event template extraction and common subsequence calculation, they are often processed as the same event template (instant plus successful), but they are actually two types of log events with completely opposite meanings.
And event logs similar in result to those found in the nova component in OpenStack.
2022-04-21 18:14:29.104 19134 INFO nova.compute.manager[req-67a9597b-4d98-486d-a8dc-b2b3718357d5-----][instance:ac8ca295-9d99-4b5f-9a73-694c87af0f3c]VM Started(Lifecycle Event)
2022-04-21 18:14:29.172 19134 INFO nova.compute.manager[req-67a9597b-4d98-486d-a8dc-b2b3718357d5-----][instance:ac8ca295-9d99-4b5f-9a73-694c87af0f3c]VM Paused(Lifecycle Event)
2022-04-21 18:14:31.339 19134 INFO nova.compute.manager[req-67a9597b-4d98-486d-a8dc-b2b3718357d5-----][instance:ac8ca295-9d99-4b5f-9a73-694c87af0f3c]VM Resumed(Lifecycle Event)
In the log parsing process, the keywords "Started", "used", and "recovered" in the log Event can be easily marked as variables when finding a common subsequence in the log parsing process, which can cause several log events with different semantic features to be misinterpreted as one Event VM < > (Lifecycle Event).
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide an online analysis method of a multi-source log, which effectively solves the problem that log events with different semantic characteristics are wrongly classified into one class of events by quickly classifying, inquiring and retrieving the log and classifying verb part-of-speech characteristics through a log tree.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides an online analysis method of a multi-source log on one hand, which comprises the following steps:
collecting logs originated from a plurality of different application components in a distributed system in a log stream mode;
classifying and retrieving the collected logs by using a log tree, and extracting log key field information, wherein the log key field information comprises: the log identification ID, host, path, timestamp, level and log content message;
preprocessing a message field in the log key field information, and replacing common variable mark IP address, digital variable and URL address characters in the message field with wildcard characters through a regular expression;
grouping the logs by using the first word for the preprocessed log message field;
verb part-of-speech feature extraction is carried out on the grouped log message field contents, classification is carried out according to the verb part-of-speech feature, the logs containing the verb part-of-speech feature are classified into execution class logs, and the logs without the verb part-of-speech feature are classified into state class logs;
and (3) extracting a log event template and updating the log template from the log with the part-of-speech characteristics of the verb of the same kind through a longest common subsequence LCS algorithm.
Preferably, the collecting is performed in a log stream manner, specifically:
and adopting an ElasticStack open source log management solution, collecting the logs of the server through the filebeat lightweight component, and configuring and collecting the corresponding directory file.
Preferably, the log tree is used to classify and retrieve the collected logs and extract the key field information of the logs, and the method specifically includes:
classifying and searching through a log tree, wherein the first layer of the log tree is a host node, the second layer is a program module node, the third layer is log key field information, when a new log is input, the host node is matched firstly, then the program module node is matched, if the matching is not successful, a branch node is added, and then the log key field is extracted; through the steps, log key field information in a uniform format is extracted from logs from different application components, the log key field information comprises log identification, log source information and log field information, wherein the log identification is a unique ID number of the log, and the log source information comprises: host name host where the log is located and path of the log; the log field information includes: the time sequence characteristic timetag of the log, the specific log content message of the log and the importance degree characteristic level of the log are set as default values when the log does not contain level fields.
Preferably, the grouping the logs by using the first word specifically includes:
when the first word is selected, a corpus method is adopted for judgment, when the selected first word is a word which is not in the corpus range, a second word is selected as a first word identifier, and the like, and when the first word is a word in the corpus, the first word is directly used as the first word identifier of the log; after determining the first word, if the first word of the log identifies the grouping originally, the ID of the log is directly added into the grouping, and if the first word of the log does not identify the grouping originally, a first word grouping is automatically created and the ID of the log is added into the grouping.
Preferably, the verb part-of-speech feature extraction is performed on the grouped log message field contents, the logs containing the verb part-of-speech feature are classified according to the verb part-of-speech feature, the logs containing the verb part-of-speech feature are classified into execution class logs, and the execution class logs are grouped according to the specific verb feature; the log without verb part-of-speech characteristics is classified as a state log, which specifically comprises the following steps:
performing part-of-speech tagging on the log by using an NLTK part-of-speech tagging tool kit, extracting verb part-of-speech features in the log, classifying the logs containing the verb part-of-speech features into execution class logs, and classifying the logs without the verb part-of-speech features into state class logs; classifying the log by utilizing verb part-of-speech characteristics of the log, and adding the ID of the log if the verb part-of-speech characteristics corresponding to the log are classified originally; if the verb part-of-speech feature classification of the log does not exist originally, automatically creating a verb part-of-speech feature classification, and adding the ID of the log into the classification; and if the input log does not contain verb part-of-speech characteristics, classifying the input log into a state class log.
Preferably, the extracting the log event template and the updating the log template of the log with the characteristic of the part of speech of the verb of the same kind by the longest common subsequence LCS algorithm specifically include:
arranging the existing log event templates in the same verb part-of-speech feature group in a descending order from long to short, calculating the longest common subsequence of the newly added logs and the log event templates in the group according to the order, if the length of the longest common subsequence is more than half of the length of the existing log template, successfully matching, directly adding the log ID into the existing log template group, and when the length of the original log template is more than the length of the longest common subsequence, updating the log template by using the existing longest common subsequence; and if the length of the longest public subsequence is less than half of the length of the existing template, the matching fails, an independent log template needs to be newly established, and the ID is added into the template.
The invention also provides an online analysis system of the multi-source logs, which is applied to the online analysis method of the multi-source logs and comprises the following steps: the system comprises a log collection module, a key field extraction module, a log preprocessing module, a first word grouping module, a verb part-of-speech feature classification module and a log event template extraction module;
the log collection module is used for collecting logs from a plurality of different application components in the distributed system in a log stream mode;
the key field extraction module is used for classifying and retrieving the collected logs by using a log tree and extracting log key field information, wherein the log key field information comprises: the log identification ID, host, path, timestamp, level and log content message;
the log preprocessing module is used for preprocessing the message field in the log key field information and replacing the special mark in the message field with a wildcard character through a regular expression;
the first word grouping module is used for grouping the logs by using the first words for the preprocessed log message fields;
the verb part-of-speech feature classification module is used for extracting verb part-of-speech features of message field contents of the logs, classifying the messages according to the verb part-of-speech features, classifying the logs containing the verb part-of-speech features into execution class logs, and classifying the logs without the verb part-of-speech features into state class logs;
the log event template extraction module is used for extracting a log event template and updating the log template from the log with the similar verb part-of-speech characteristics through a Longest Common Subsequence (LCS) algorithm.
Yet another aspect of the present invention provides an electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor, wherein,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the method for online parsing of a multi-source log.
Still another aspect of the present invention provides a computer-readable storage medium storing a program, where the program is executed by a processor to implement the online parsing method for multi-source logs.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention designs a data structure and a general method for supporting the online analysis of the multi-source logs by combining the characteristics of the logs of a distributed software system, distinguishes the execution-class logs and the state-class logs by verb characteristic regions of the logs, and provides an important data base for constructing a workflow diagram and root cause analysis in subsequent log analysis by combining the multi-source information of the logs, thereby effectively solving the overfitting problem that the logs with the same structure and different semantics are mistakenly processed into the same log event in the template extraction process.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an online parsing method for multi-source logs according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a lookup tree of an online parsing method for multi-source logs according to an embodiment of the present invention.
Fig. 3 is a schematic configuration diagram of an online parsing system for multi-source logs according to an embodiment of the present invention.
Fig. 4 is a structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
As shown in fig. 1 and fig. 2, an online parsing method for multi-source logs of the present embodiment includes the following steps:
s1, collecting logs originated from a plurality of different application components in a distributed system in a log stream mode.
Further, the collecting in a log stream manner specifically includes:
and adopting an ElasticStack open-source log management solution, collecting logs of the server through the filebeat lightweight component, and configuring and collecting corresponding directory files.
Furthermore, the log streams are input from a/var/log/nova/nova-api.log control service log of the control01 control host and a/var/log/nova/nova-computer.log calculation service log of the computer 3 computing node host.
[1]control01/var/log/nova/nova-api.log 2022-04-27 18:33:47.956 1343 INFO nova.api.openstack.compute.server_external_events
[req-288b9b4f-cda5-4e29-a3a1-ede2f4073674 ba82731d70654e39bc5e832454a6fd06 d4d0d97635944209a9093fcd443531d7-default default]Creating event network-changed:830ec68b-e073-4932-bcec-94c12afe1377 for instance b9000564-fe1a-409b-b8cc-1e88b294cd1d on compute1.
[2]control01/var/log/nova/nova-api.log 2022-04-27 18:34:44.264 1317INFOnova.api.openstack.compute.server_external_events
[req-ae103bb1-d263-4fd5-b425-ce546d38f655 ba82731d70654e39bc5e832454a6fd06 d4d0d97635944209a9093fcd443531d7-default default]Creating event network-vif-plugged:830ec68b-e073-4932-bcec-94c12afe1377 for instance 9069c8d7-cfd7-4440-88b4-0e1577cddd76 on compute3.
[3]compute3/var/log/nova/nova-compute.log 2022-04-27 18:34:44.315 19134 INFO nova.virt.libvirt.driver[-][instance:9069c8d7-cfd7-4440-88b4-0e1577cddd76]Instance spawned successfully.
[4]compute3/var/log/nova/nova-compute.log 2022-04-27 19:02:40.632 19134 INFO nova.virt.libvirt.driver[-][instance:9069c8d7-cfd7-4440-88b4-0e1577cddd76]Instance destroyed successfully.
S2, classifying and retrieving the collected logs by using a log tree, and extracting log key field information, wherein the log key field information comprises: the log identification ID, host, path, timestamp, level and log content message;
furthermore, the first layer of the log tree is a host node, the second layer is a program module node, the third layer is log key field information, when a new log is input, the host node is matched firstly, then the program module node is matched, if the matching is not successful, a branch node is added, and then the log key field is extracted; through the steps, log key field information in a uniform format is extracted from logs from different application components, the log key field information comprises log identification, log source information and log field information, wherein the log identification is a unique ID number of the log, and the log source information comprises: host name host where the log is located and path of the log; the log field information includes: the time sequence characteristic timemap of the log, the specific log content message of the log and the importance level characteristic level of the log are set as default values default for the log which does not contain level fields.
Further, the field information in a uniform format is extracted for logs from different components, as shown in the following table.
And S3, preprocessing the message field in the key field information of the log, and replacing the characters of the common variable mark IP address, the digital variable and the URL address in the message field by wildcards through a regular expression.
[1]Creating event network-changed:*for instance*on compute1
[2]Creating event network-vif-plugged:*for instance*on compute3
[3]Instance spawned successfully.
[4]Instance destroyed successfully.
And S4, grouping the logs by using the first word for the preprocessed log message field.
Further, a corpus method is adopted for judging when the first word is selected, when the selected first word is not in the corpus, the second word is selected as the first word identification, and the like, and when the first word is in the corpus, the second word is directly used as the first word identification of the log; after determining the first word, if the first word of the log identifies the grouping originally, the ID of the log is directly added into the grouping, and if the first word of the log does not identify the grouping originally, a first word grouping is automatically created and the ID of the log is added into the grouping.
Creating:{1,2}
Instance:{3,4}
And S5, verb part-of-speech feature extraction is carried out on the grouped log message field contents, classification is carried out according to verb part-of-speech features, the logs containing the verb part-of-speech features are classified into execution class logs, grouping is carried out according to specific verb features, and the logs without the verb part-of-speech features are classified into state class logs.
Further, performing part-of-speech tagging on the log by using an NLTK part-of-speech tagging tool kit, extracting verb part-of-speech features in the log, classifying the logs containing verb part-of-speech features into execution class logs, and classifying the logs not containing verb part-of-speech features into state class logs; classifying the log by utilizing verb part-of-speech characteristics of the log, and adding the ID of the log if the verb part-of-speech characteristics corresponding to the log are classified originally; if the verb part-of-speech feature classification of the log does not exist originally, automatically creating a verb part-of-speech feature classification, and adding the ID of the log into the classification; if the input log does not contain verb part-of-speech characteristics, classifying the input log into a state class log classification.
('Creating','VBG'):{1,2}
('spawned','VBD'):{3}
('destroyed','VBD'):{4}
And S6, extracting a log event template and updating the log template by the log with the same verb part-of-speech characteristics through a Longest Common Subsequence (LCS) algorithm.
Further, the existing log event templates in the same verb part-of-speech feature group are arranged in descending order from long to short, the newly added logs and the log event templates in the group are calculated according to the longest common subsequence, if the length of the longest common subsequence is more than half of the length of the existing log template, the matching is successful, the log ID is directly added into the existing log template group, and when the length of the original log template is more than the length of the longest common subsequence, the existing longest common subsequence is required to be used for updating the log template; and if the length of the longest public subsequence is less than half of the length of the existing template, the matching fails, an independent log template needs to be newly established, and the ID is added into the template.
Creating event*for instance*on*{1,2}
Instance spawned successfully.{3}
Instance destroyed successfully.{4}
Based on the same idea as the online analysis method of the multi-source logs in the embodiment, the invention also provides an online analysis system of the multi-source logs, and the system can be used for executing the online analysis method of the multi-source logs. For convenience of illustration, only the parts related to the embodiments of the present invention are shown in the schematic structural diagram of an embodiment of an online parsing system for multi-source logs, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the apparatus, and may include more or less components than those illustrated, or combine some components, or arrange different components.
As shown in fig. 3, in another embodiment of the present invention, an online parsing system 100 for multi-source logs is provided, which includes a log collecting module 101, a key field extracting module 102, a log preprocessing module 103, a first word grouping module 104, a verb part-of-speech feature classifying module 105, and a log event template extracting module 106;
the log collection module 101 is configured to collect logs originating from a plurality of different application components in the distributed system in a log stream manner;
the key field extracting module 102 is configured to classify and retrieve the collected logs by using a log tree, and extract log key field information, where the log key field information includes: the log identification ID, host, path, timestamp, level and log content message; (ii) a
The log preprocessing module 103 is configured to preprocess a message field in the log key field information, and replace a special mark in the message field with a wildcard character through a regular expression;
the first word grouping module 104 is used for grouping the logs by using the first word for the preprocessed log message fields;
the verb part-of-speech feature classification module 105 is configured to perform verb part-of-speech feature extraction on the message field content of the log, classify according to the verb part-of-speech feature, classify the log containing the verb part-of-speech feature into an execution class log, and classify the log without the verb part-of-speech feature into a state class log;
the log event template extracting module 106 is configured to extract a log event template and update the log template from the log with the part-of-speech feature of the verb of the same class by using the longest common subsequence LCS algorithm.
It should be noted that, an online parsing system for multi-source logs of the present invention corresponds to an online parsing method for multi-source logs of the present invention one to one, and the technical features and the beneficial effects thereof described in the above embodiment of an online parsing method for multi-source logs are all applicable to an embodiment of an online parsing system for multi-source logs, and specific contents may refer to the description in the embodiment of the method of the present invention, which is not described herein again and is thus stated.
In addition, in the implementation manner of the online parsing system for multi-source logs in the foregoing embodiment, the logical division of each program module is only an example, and in practical applications, the foregoing function distribution may be completed by different program modules according to needs, for example, due to configuration requirements of corresponding hardware or due to convenience of implementation of software, that is, the internal structure of the online parsing system for multi-source logs is divided into different program modules to complete all or part of the functions described above.
As shown in fig. 4, in another embodiment of the present invention, an electronic device for online parsing of multi-source logs is provided, where the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program stored in the first memory 202 and executable on the first processor 201, such as an online parsing program 203 of multi-source logs.
The first memory 202 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a removable hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 200. Further, the first memory 202 may also include both an internal storage unit and an external storage device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various types of data, such as codes of the online parsing program 203 of the multi-source log, but also temporarily store data that has been output or will be output.
The first processor 201 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 200 by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.
Fig. 4 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 4 does not constitute a limitation of the electronic device 200, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The online parsing program 203 of the multi-source log stored in the first memory 202 of the electronic device 200 is a combination of multiple instructions, and when running in the first processor 201, can implement:
collecting logs originated from a plurality of different application components in a distributed system in a log stream mode;
classifying and retrieving the collected logs by using a log tree, and extracting log key field information, wherein the log key field information comprises: the log identification ID, host, path, timestamp, level and log content message; (ii) a
Preprocessing a message field in the log key field information, and replacing a special mark in the message field with a wildcard character through a regular expression;
grouping the logs by using the first word for the preprocessed log message field;
verb part-of-speech feature extraction is carried out on the grouped log message field contents, classification is carried out according to the verb part-of-speech feature, the logs containing the verb part-of-speech feature are classified into execution class logs, and the logs without the verb part-of-speech feature are classified into state class logs;
and (4) extracting a log event template and updating the log template from the log with the characteristic of the part of speech of the verb of the same kind through a Longest Common Subsequence (LCS) algorithm.
Further, the modules/units integrated with the electronic device 200, if implemented in the form of software functional units and sold or used as independent products, may be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.
Claims (9)
1. The on-line analysis method of the multi-source log is characterized by comprising the following steps:
collecting logs originated from a plurality of different application components in a distributed system in a log stream mode;
classifying and retrieving the collected logs by using a log tree, and extracting log key field information, wherein the log key field information comprises: the log identification ID, the host name host, the path of the log, the time sequence characteristics of the log, time and level and the log content message;
preprocessing a message field in the log key field information, and replacing common variable mark IP address, digital variable and URL address characters in the message field with wildcard characters through a regular expression;
grouping the logs by using the first word for the preprocessed log message fields;
verb part-of-speech feature extraction is carried out on the grouped log message field content, classification is carried out according to verb part-of-speech features, logs containing verb part-of-speech features are classified into execution class logs, and logs without verb part-of-speech features are classified into state class logs;
and (4) extracting a log event template and updating the log template from the log with the characteristic of the part of speech of the verb of the same kind through a Longest Common Subsequence (LCS) algorithm.
2. The method for on-line parsing of a multi-source log according to claim 1, wherein the collecting is performed in a log stream manner, specifically:
and adopting an ElasticStack open source log management solution, collecting the logs of the server through the filebeat lightweight component, and configuring and collecting the corresponding directory file.
3. The on-line analysis method for multi-source logs according to claim 1, wherein the log tree is used for classifying and retrieving the collected logs and extracting key field information of the logs, and the method specifically comprises the following steps:
classifying and searching through a log tree, wherein the first layer of the log tree is a host node, the second layer is a program module node, the third layer is log key field information, when a new log is input, the host node is matched firstly, then the program module node is matched, if the matching is not successful, a branch node is added, and then the log key field is extracted; through the steps, log key field information in a uniform format is extracted from logs from different application components, the log key field information comprises log identification, log source information and log field information, wherein the log identification is a unique ID number of the log, and the log source information comprises: host name host of the log and path of the log; the log field information includes: the time sequence characteristic timetag of the log, the specific log content message of the log and the importance degree characteristic level of the log are set as default values when the log does not contain level fields.
4. The on-line parsing method for multi-source logs according to claim 1, wherein the first word is used for grouping the logs, and specifically comprises:
when the first word is selected, a corpus method is adopted for judgment, when the selected first word is a word which is not in the corpus range, a second word is selected as a first word identifier, and the like, and when the first word is a word in the corpus, the second word is directly used as the first word identifier of the log; after determining the first word, if the first word of the log identifies the grouping originally, the ID of the log is directly added into the grouping, and if the first word of the log does not identify the grouping originally, a first word grouping is automatically created and the ID of the log is added into the grouping.
5. The method for analyzing the multi-source log on line according to claim 1, wherein verb part-of-speech feature extraction is performed on the grouped log message field contents, classification is performed according to verb part-of-speech features, logs containing verb part-of-speech features are classified as execution class logs, and the logs are grouped according to specific verb features; the log without verb part-of-speech features is classified as a state log, and specifically includes:
performing part-of-speech tagging on the log by using an NLTK part-of-speech tagging tool kit, extracting verb part-of-speech features in the log, classifying the logs containing the verb part-of-speech features into execution class logs, and classifying the logs not containing the verb part-of-speech features into state class logs; classifying the log by utilizing verb part-of-speech characteristics of the log, and adding the ID of the log if the verb part-of-speech characteristics corresponding to the log are classified originally; if the verb part-of-speech feature classification of the log does not exist originally, automatically creating a verb part-of-speech feature classification, and adding the ID of the log into the classification; and if the input log does not contain verb part-of-speech characteristics, classifying the input log into a state class log.
6. The method according to claim 1, wherein the log with the part-of-speech characteristics of verbs of the same kind is processed by extracting a log event template and updating the log event template through a Longest Common Subsequence (LCS) algorithm, and specifically comprises:
arranging the existing log event templates in the same verb part-of-speech feature group in a descending order from long to short, calculating the longest common subsequence of the newly added logs and the log event templates in the group according to the order, if the length of the longest common subsequence is more than half of the length of the existing log template, successfully matching, directly adding the log ID into the existing log template group, and when the length of the original log template is more than the length of the longest common subsequence, updating the log template by using the existing longest common subsequence; and if the length of the longest public subsequence is less than half of the length of the existing template, the matching fails, an independent log template needs to be newly established, and the ID is added into the template.
7. An online parsing system for multi-source logs, comprising: the system comprises a log collection module, a key field extraction module, a log preprocessing module, a first word grouping module, a verb part-of-speech feature classification module and a log event template extraction module;
the log collection module is used for collecting logs from a plurality of different application components in the distributed system in a log stream mode;
the key field extraction module is used for classifying and retrieving the collected logs by using a log tree and extracting the key field information of the logs, wherein the key field information of the logs comprises: the log identification ID, host, path, timestamp, level and log content message;
the log preprocessing module is used for preprocessing the message field in the log key field information and replacing the special mark in the message field with a wildcard character through a regular expression;
the first word grouping module is used for grouping the preprocessed log message fields by using the first words;
the verb part-of-speech feature classification module is used for extracting verb part-of-speech features of message field contents of the logs, classifying the messages according to the verb part-of-speech features, classifying the logs containing the verb part-of-speech features into execution class logs, and classifying the logs without the verb part-of-speech features into state class logs;
the log event template extraction module is used for extracting a log event template and updating the log template from the logs with the part-of-speech characteristics of the verb of the same kind through a longest common subsequence LCS algorithm.
8. An electronic device, characterized in that the electronic device comprises:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor, wherein,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform a method of online parsing of a multi-source log according to any of claims 1-6.
9. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements a method for online parsing of a multi-source log according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210990274.4A CN115437877A (en) | 2022-08-18 | 2022-08-18 | Online analysis method and system for multi-source log, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210990274.4A CN115437877A (en) | 2022-08-18 | 2022-08-18 | Online analysis method and system for multi-source log, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115437877A true CN115437877A (en) | 2022-12-06 |
Family
ID=84242398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210990274.4A Pending CN115437877A (en) | 2022-08-18 | 2022-08-18 | Online analysis method and system for multi-source log, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115437877A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116628451A (en) * | 2023-05-31 | 2023-08-22 | 江苏华存电子科技有限公司 | High-speed analysis method for information to be processed |
CN117215902A (en) * | 2023-11-09 | 2023-12-12 | 北京集度科技有限公司 | Log analysis method, device, equipment and storage medium |
CN118093325A (en) * | 2024-04-28 | 2024-05-28 | 中国民航大学 | Log template acquisition method, electronic equipment and storage medium |
CN118378387A (en) * | 2024-06-24 | 2024-07-23 | 陕西空天信息技术有限公司 | Method, device and storage medium for analyzing geometric file of impeller machine |
-
2022
- 2022-08-18 CN CN202210990274.4A patent/CN115437877A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116628451A (en) * | 2023-05-31 | 2023-08-22 | 江苏华存电子科技有限公司 | High-speed analysis method for information to be processed |
CN116628451B (en) * | 2023-05-31 | 2023-11-14 | 江苏华存电子科技有限公司 | High-speed analysis method for information to be processed |
CN117215902A (en) * | 2023-11-09 | 2023-12-12 | 北京集度科技有限公司 | Log analysis method, device, equipment and storage medium |
CN118093325A (en) * | 2024-04-28 | 2024-05-28 | 中国民航大学 | Log template acquisition method, electronic equipment and storage medium |
CN118378387A (en) * | 2024-06-24 | 2024-07-23 | 陕西空天信息技术有限公司 | Method, device and storage medium for analyzing geometric file of impeller machine |
CN118378387B (en) * | 2024-06-24 | 2024-10-22 | 陕西空天信息技术有限公司 | Method, device and storage medium for analyzing geometric file of impeller machine |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115437877A (en) | Online analysis method and system for multi-source log, electronic equipment and storage medium | |
US8965894B2 (en) | Automated web page classification | |
Urvoy et al. | Tracking web spam with html style similarities | |
CN110929145B (en) | Public opinion analysis method, public opinion analysis device, computer device and storage medium | |
US20150341771A1 (en) | Hotspot aggregation method and device | |
JP2014041615A (en) | Method and system with high performance data meta tag using coprocessor and with data index | |
CN114817968B (en) | Method, device and equipment for tracing path of featureless data and storage medium | |
CN110837590A (en) | Information pushing method and device, computer equipment and storage medium | |
CN113342979A (en) | Hot topic identification method, computer equipment and storage medium | |
CN112506860A (en) | Block chain based collaborative audit method, device and system | |
CN113971398A (en) | Dictionary construction method for rapid entity identification in network security field | |
CN108388556B (en) | Method and system for mining homogeneous entity | |
CN110008313A (en) | A kind of unsupervised text snippet method of extraction-type | |
CN110008701B (en) | Static detection rule extraction method and detection method based on ELF file characteristics | |
CN111190873B (en) | Log mode extraction method and system for log training of cloud native system | |
CN111985212A (en) | Text keyword recognition method and device, computer equipment and readable storage medium | |
CN115017441A (en) | Asset classification method and device, electronic equipment and storage medium | |
CN113821630A (en) | Data clustering method and device | |
CN117216214A (en) | Question and answer extraction generation method, device, equipment and medium | |
US10614102B2 (en) | Method and system for creating entity records using existing data sources | |
CN106776654B (en) | Data searching method and device | |
CN115984004A (en) | Information association method, device, equipment and storage medium | |
CN114416174A (en) | Model reconstruction method and device based on metadata, electronic equipment and storage medium | |
CN114706948A (en) | News processing method and device, storage medium and electronic equipment | |
CN115203758A (en) | Data security storage method and system and cloud platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |