WO2021088385A1 - 一种在线日志解析方法、系统及其电子终端设备 - Google Patents

一种在线日志解析方法、系统及其电子终端设备 Download PDF

Info

Publication number
WO2021088385A1
WO2021088385A1 PCT/CN2020/098701 CN2020098701W WO2021088385A1 WO 2021088385 A1 WO2021088385 A1 WO 2021088385A1 CN 2020098701 W CN2020098701 W CN 2020098701W WO 2021088385 A1 WO2021088385 A1 WO 2021088385A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
sequence
string
unparsed
group
Prior art date
Application number
PCT/CN2020/098701
Other languages
English (en)
French (fr)
Inventor
施凌鹏
卢士达
陈晓露
陈琰
陆正嘉
顾中坚
富宇
顾荣斌
李天宇
刘赛
祝蓓
李静
刘云飞
Original Assignee
国网上海市电力公司
南京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国网上海市电力公司, 南京航空航天大学 filed Critical 国网上海市电力公司
Priority to EP20789812.3A priority Critical patent/EP3846048A1/en
Publication of WO2021088385A1 publication Critical patent/WO2021088385A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the invention relates to the technical field of log analysis, in particular to an online log analysis method, system and electronic terminal equipment based on string clustering and template spanning tree.
  • the system log records the operating information of the system, as an important data for online monitoring and abnormal detection of the system, which can help managers debug system failures and abnormal analysis, and further ensure the safety of the system.
  • logs have the characteristics of massive and unstructured.
  • Log parsing can convert a large number of unstructured logs into structured data, that is, extract templates from the logs.
  • log analysis improves the efficiency of anomaly detection.
  • log analysis methods can be divided into offline methods and online methods.
  • the offline method is a method of offline analysis of logs. It is necessary to collect the logs of the system within a period of time and load them into the memory to train the analysis model, and then use the model to analyze the logs. Since the offline method is to parse the log offline, it cannot parse the log in real time, and the system administrator cannot find abnormal behavior of the system in time. In addition, the offline method needs to load the log into the memory for training. If a new log type is added after training, it must be done again. Training analytical models will consume more time and human resources.
  • the online method can parse the newly generated logs in real time, and the system administrator can discover and deal with system abnormalities in time.
  • the current online log analysis methods have two shortcomings: one is that the existing online methods have room for improvement in terms of accuracy and efficiency, and the other is that these online methods are designed for specific log systems, and the analysis parameters of each system are not General purpose, low resolution efficiency and poor accuracy.
  • the purpose of the present invention is to provide an online log analysis method, system and electronic terminal equipment.
  • the method performs online analysis based on string clustering and template spanning tree, classifies logs according to length, and then calculates log string similarity , Perform secondary clustering of logs based on log string similarity, and finally use template spanning tree to extract log templates.
  • This method can efficiently and accurately extract log templates from unstructured logs, providing system administrators with a simple and intuitive summary And clean and standardized structured data, it is convenient for data analysts to perform more advanced analysis and processing of logs.
  • An online log analysis method includes the following steps:
  • Step S3 Use the unresolved log sequence as a query item, match the common node with the template in the template spanning tree of the second log group, obtain the log template of the unresolved log sequence, and repeat the subsequent unresolved log sequence in sequence Step S2 and step S3 operate to obtain a log template.
  • step S1 specifically includes:
  • each unparsed log is divided by a predefined separator to obtain unparsed log sequences of different log lengths, and each unparsed log sequence is classified into different first log groups according to different log lengths.
  • the step S2 specifically includes:
  • a second log group corresponding to the unparsed log sequence is obtained, where the log string similarity of each log sequence reaches a similarity threshold.
  • a collection of log sequences, a first log group contains several second log groups.
  • the log string is a string composed of the first letter of each word in the log sequence.
  • the first character of the word is a number or an identifier, it is represented by a wildcard, and when the first character of the word is a letter, it is represented by the letter itself.
  • the calculation formula of the log string similarity in the step S22 is:
  • sim(S m1 ,S m2 ) represents the similarity of log strings
  • ) represents the length of two log strings
  • S m1 (i) represents the i-th character of the log string S m1
  • S m2 (j) represents the j-th character of the log string S m2, defined by II f (S m1 (i), S m2 (j)) as follows:
  • the step S23 is specifically:
  • the step S3 is specifically:
  • Insert each word in the unparsed log sequence into the tree structure in turn, mark the last node of each branch as a special character, and start searching from the root node P of the unparsed log sequence to obtain the node where the word is to be found If there is no matching node, insert the unparsed log sequence into the tree structure as a new branch; if there is a matching node, select the corresponding subtree according to the matching node and go to the subtree to continue the search until the The search for each word in the parsed log sequence is completed, and the log template of the unparsed log sequence is obtained. The subsequent unparsed log sequence repeats the operations of step S2 and step S3 in turn to obtain the log template.
  • the method further comprises:
  • step S3 Determine whether the log template is successfully obtained in step S3, specifically: introducing a bounding value X, when the bounding value X is less than or equal to the threshold ⁇ , it means that the template is generated successfully, the template spanning tree is updated, and the unmatched nodes are updated to wildcards; If the bounding value X is greater than the threshold ⁇ , the unparsed log sequence is inserted as a new branch into the template spanning tree, where the bounding value X is expressed as:
  • ⁇ unmatch(t mi ) represents the number of nodes that failed to match
  • L is the length of the log
  • is the number of wildcards on the search path.
  • the method further comprises:
  • the maximum tree depth parameter MaxDepth is introduced. When the leaf node of the template spanning tree reaches the maximum tree depth parameter MaxDepth, all untraversed words in the unparsed log sequence will no longer be inserted into the template spanning tree, and the untraversed words are replaced with wildcards.
  • a log analysis system for analyzing logs using the online log analysis method, the system comprising:
  • the preprocessing module is used to preprocess the unparsed log to obtain the unparsed log sequence corresponding to the unparsed log;
  • the log sequence allocation module allocates the unparsed log sequence to the corresponding first log group according to its log length
  • the log string acquisition module is used to extract the log sequences in the first log group to obtain the log string of each log sequence
  • the log string similarity calculation module is used to calculate the log string similarity of the log string of the unparsed log sequence and the log string of the parsed log sequence in the first log group;
  • An online clustering module configured to divide the unparsed log sequence into a corresponding second log group according to the similarity of the log string
  • the template spanning tree matching module is used to match the unparsed log sequence with the template spanning tree of the second log group to obtain a log template of the unparsed log sequence.
  • the judging module is used to judge whether the log template of the unparsed log sequence is successfully obtained.
  • an electronic terminal device includes:
  • the present invention has the following advantages:
  • An online log analysis method of the present invention performs log analysis based on string clustering and template spanning tree, classifies logs according to length, and calculates log string similarity, and compares log two based on log string similarity. Sub-clustering, and finally use template spanning tree to extract log templates.
  • This method can efficiently and accurately extract log templates from unstructured logs, providing system administrators with concise and intuitive abstracts and clean and standardized structured data, which is convenient for data analysis The teacher performs more advanced analysis and processing on the log;
  • the method of the present invention puts forward the concept of log string, and calculates the similarity of log string. Based on the similarity of log string, the system logs of the same length are clustered online, which reduces the invalid comparison steps during log template extraction. Improved log parsing efficiency;
  • the present invention adopts a template extraction method based on a template spanning tree, and uses the template spanning tree to extract and update templates, thereby improving the accuracy of online log analysis.
  • Figure 1 shows an example of the structure of a log
  • FIG. 2 is a first log group with a log length of 4 in the embodiment
  • Fig. 3 is a log template set corresponding to the first log group in Fig. 2;
  • Fig. 4 is a log string set corresponding to the first log group in Fig. 2;
  • Figure 5 is a schematic diagram of the log template generation process
  • Fig. 6 is a schematic diagram of analysis using the online log analysis method of the present invention.
  • Log parsing is an efficient and accurate method for converting unstructured logs into structured forms. It obtains hidden information from massive system logs to guide subsequent operations.
  • the present invention discloses an online log analysis method.
  • the method includes the following steps:
  • each log m can be expressed as:
  • m h represents the log header
  • m c represents the log content.
  • the log header mh includes trivial variable information such as timestamp, location, log type, security, and facilities, and log analysis mainly needs to process the m c part of the log content. Because the variable information contained in the log header m h belongs to a structured message type, it is very intuitive and easy for system administrators to understand. Only regular expressions can be used to identify and match the log header. The process of log analysis is mainly to extract the log template from the log content m c.
  • the step S1 specifically includes: firstly, using regular expressions to identify the variable information that matches the log header m h of each unparsed log m, and delete the variable information.
  • the regular expression here is very simple, because it only needs to match simple tags such as timestamp and IP address instead of the log content m c , usually only two regular expressions at most can identify all unparsed logs The log header m h in m.
  • the log content m c of each unparsed log m is divided using a predefined separator to obtain unparsed log sequences of different log lengths.
  • Space separators can usually solve most system log segmentation problems.
  • the separator divides the log content m c into an unparsed log sequence T m of length L (the size of the log sequence is usually referred to as the length of the log sequence):
  • T m [t m1 ,t m2 ,...,t mL ]
  • t mi is the i-th word in the unparsed log sequence T m
  • the length of the log sequence is L, which means that the unparsed log sequence T m has a total of L words.
  • each unparsed log sequence T m is divided into different first log groups according to different log lengths.
  • Each first log group contains all log sequences with the same log length.
  • the first log group uses G L said. Indeed, there is a first set of log G L log different templates, with FIG. 2 and FIG. 3, FIG.
  • Figure 3 is a log template set corresponding to Figure 2 (different log templates matching different log sequences), the log template set is a collection of log templates for all log sequences , Figure 2 shows the first log group G L where the log length of each log sequence is 4, but the template of each log sequence is different, so if only the template extraction of the log after this step will lead to higher the complexity.
  • log string represented by S m
  • log string S m is a string composed of the first character of each word in the log sequence.
  • the log string S m is a more representative description of the log.
  • DETAILED comprising the step S2: S21, the first character string acquiring log log log of each group G L sequence, all log logging string parsed sequence within a first set of log G L constituting the first set of log G L' s log string set.
  • step S21 in order to avoid excessive logging string extraction algorithm time complexity, we extract the first character in the first set of log log G L each word as a sequence of characters representing the last log sequence
  • the first character of all words forms a string according to the position of the word in the log sequence, which is called the log string S m of the log sequence.
  • the first character of a word is a number or an identifier
  • it is represented by a wildcard
  • the wildcard is represented by *.
  • Fig. 4 is the log string set S corresponding to Fig. 2.
  • the length of each log sequence in Fig. 2 is the same, its log string S m is unique .
  • the log string S m of the first log sequence is "*dae”
  • the log string S m of the second log sequence is "dTei”
  • the subsequent ones are "esr*", "piii”, "fpie”","piuo"
  • the log string S m of each log sequence is different, and the log string S m is more representative of the description of the log.
  • sim(S m1 ,S m2 ) represents the similarity of the log strings
  • S m1 and S m2 represent the two compared log strings
  • ) represents the two log strings
  • S m1 (i) represents the i-th character of the log string S m1
  • S m2 (j) represents the j-th character of the log string S m2
  • II f (S m1 (i), S m2 (j )) is defined as follows:
  • the second log group G s corresponding to the unparsed log sequence T m is obtained, and the second log group G s is the log string similarity of each log A collection of log sequences whose degree reaches the similarity threshold.
  • a first log group G L contains several second log groups G s .
  • the step S23 specifically includes: setting a similarity threshold st. If there is a log string with a log string similarity greater than the similarity threshold st, it means that the log string S m1 and the log string S m2 are similar (that is, the entry is not resolved.
  • the log string of the log sequence T m is similar to the calculated log string), then the unparsed log sequence T m is merged into the second log group G s1 corresponding to the log string, and the second log
  • the log string similarity of each log sequence in group G s1 is greater than the similarity threshold st; if there is no log string with log string similarity greater than the similarity threshold st, a second log group G s2 is created and updated The log string set S of the first log group G s1.
  • the number of unprocessed log sequences T m that do not belong to the second log group G s2 in each first log group G s1 decreases, and the probability that the log sequences in each second log group G s2 have the same template increases.
  • Performing subsequent template extraction in the second log group G s with high log similarity can greatly reduce the time complexity of the algorithm.
  • Trie tree also known as prefix tree or dictionary tree
  • Trie tree is an ordered tree in which the nodes usually store character strings.
  • Trie tree is an algorithm that is widely used in string search and prefix matching. Each match is only related to the length of the string being queried, and the time complexity is only O(1).
  • the structure of the standard Trie tree is: all strings containing a common prefix will be hung under the same node in the Trie tree, and all the common prefixes existing in the string are concisely stored.
  • the log similarity in the second log group G s is high, the template types are few, and the template spanning tree space complexity is low, which effectively improves the efficiency and accuracy of extracting templates from the logs.
  • the step S3 is specifically: the article unresolved log sequence of words in T m t m L as a query (query term), using the first search run deep each path. Start the search from the root node P (empty node) of the unparsed log sequence T m to obtain the node where the word t mi is to be found.
  • the unparsed log sequence T m Insert the tree structure as a new branch; if there is a matching node (the node is matched successfully), select the corresponding subtree according to the node and go to the subtree to continue the search until the search for each word t mi is completed, and the unparsed log is obtained
  • the log template of the sequence T m , the subsequent unparsed log sequence repeats step S2 and step S3 in sequence to obtain the log template. For example, as shown in FIG. 5, it shows a process in which the query log T m : [connected,to,7342,exactly] matches the template spanning tree successfully to obtain a template and update the template spanning tree.
  • the log template of the second log group G s is the log template of the log sequence.
  • the 6 unresolved log sequences T m in the first log group G L in Fig. 2 belong to 6 different second log groups G s respectively . If each second log group G s has only For the unparsed log sequence T m , then the log template of each second log group G s is the log template of the unparsed log sequence T m in the group.
  • the method provided by the present invention further includes: judging whether the log template is generated successfully in step S3, specifically: introducing a limit value X, specifically as shown in formula (3),
  • ⁇ unmatch(t mi ) represents the number of nodes that failed to match
  • L is the length of the log
  • is the number of wildcard characters'*' on the search path.
  • X is less than or equal to the threshold ⁇ , it means that the template is generated successfully, and the template spanning tree is updated, and the unmatched node is updated to the wildcard character'*'; when X is greater than the threshold ⁇ , the unresolved log sequence T m is inserted into the template spanning tree as a new branch .
  • FIG. 6 it is a process of obtaining a log template by using the online log analysis method of the present invention.
  • FIG. 6 there are five unresolved log sequences T m , and each unresolved log sequence T m is used in order to obtain a template using the above algorithm.
  • the online log analysis method of the present invention can automatically analyze and quickly extract templates for each newly generated unparsed log, and has a high accuracy rate.
  • the method also includes: introducing the maximum tree depth parameter MaxDepth, when the leaf node of the template spanning tree reaches the maximum tree depth parameter MaxDepth, all untraversed words t mi of the unparsed log sequence T m will no longer be inserted into the template In the spanning tree, the untraversed words t mi are replaced by wildcard characters'*'.
  • the present invention also includes a log analysis system that uses the online log analysis method to analyze logs.
  • the system includes: a preprocessing module, a log sequence distribution module, a log string acquisition module, a log string similarity calculation module, and an online log analysis system. Clustering module and template spanning tree matching module.
  • Each module in the log analysis system is specifically: a preprocessing module, which is used to preprocess the unparsed log to obtain the unparsed log sequence corresponding to the unparsed log;
  • the log sequence allocation module allocates the unparsed log sequence to the corresponding first log group according to its log length
  • the log string acquisition module is used to extract the log sequences in the first log group to obtain the log string of each log sequence
  • the log string similarity calculation module is used to calculate the log string similarity of the log string of the unparsed log sequence and the log string of the parsed log sequence in the first log group;
  • An online clustering module configured to divide the unparsed log sequence into a corresponding second log group according to the similarity of the log string
  • the template spanning tree matching module is used to match the unparsed log sequence with the template spanning tree of the second log group to obtain a log template of the unparsed log sequence.
  • the log analysis system further includes a judgment module for judging whether the log template of the unparsed log sequence in step S3 is successfully obtained.
  • the preprocessing module, the log sequence distribution module, the log string acquisition module, the log string similarity calculation module, the online clustering module, the template spanning tree matching module, and the judgment module may be independent one or more processors or
  • the processing chip can also share an integrated processor, and if necessary, it can also include memory, related interfaces, and system transmission buses.
  • the processor or processing chip executes the program-related code to realize the corresponding function.
  • the present invention also includes an electronic terminal device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and the processor implements arbitrary functions when the program is executed.
  • an electronic terminal device which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and the processor implements arbitrary functions when the program is executed.
  • the present invention provides an online log analysis method, system, and electronic terminal equipment.
  • the method performs online analysis based on string clustering and template spanning tree, classifies logs according to length, and calculates log characters.
  • String similarity based on the log string similarity, the log is clustered twice, and finally the log template is extracted using the template spanning tree.
  • This method can efficiently and accurately extract log templates from unstructured logs, and provide system administrators with concise and intuitive summary and clean and standardized structured data, which is convenient for data analysts to perform more advanced analysis and processing on logs.
  • This method does not need to analyze the source code and the participation of domain experts. It can automatically analyze and quickly extract the template for each unparsed log, and can efficiently and universally parse the log online with high accuracy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种在线日志解析方法、系统及其电子终端设备,该方法包含以下步骤:S1、对每一条未解析日志进行日志预处理,得到多条不同日志长度未解析日志序列,将其归到对应的第一日志组中;S2、获取第一日志组中各日志序列的日志字符串,计算日志字符串相似度,基于日志字符串相似度在线聚类;S3、将未解析日志序列作为查询项与第二日志组的模板生成树中的模板匹配共同的节点,获取模板。其优点是:将日志按照长度进行分类,基于日志字符串相似度对日志进行二次聚类,最后使用模板生成树提取日志模板,该方法能够高效而准确地从非结构化日志中提取日志模板,方便数据分析师对日志进行更高级的分析和处理。

Description

一种在线日志解析方法、系统及其电子终端设备 技术领域
本发明涉及日志解析技术领域,具体涉及一种基于字符串聚类和模板生成树的在线日志解析方法、系统及其电子终端设备。
背景技术
系统日志记录了系统的运行信息,作为系统在线监控和异常检测的重要数据,可帮助管理者调试系统故障及异常分析,进一步保证系统的安全性。通常日志具有海量和非结构化的特点,日志解析可以将大量非结构化的日志转换为结构化的数据,即从日志中提取模板。日志解析作为系统异常检测的第一步,提高了异常检测的效率。
通常日志解析方法可分为离线方法和在线方法。离线方法是脱机解析日志的方法,需收集到系统在一个时间段内的日志,并加载到内存中训练解析模型,再利用模型解析日志。由于离线方法是脱机解析日志,不能实时解析日志,系统管理者无法及时发现系统异常行为,另外,离线方法需将日志加载到内存中训练,如果在训练后添加了新的日志类型就必须再次训练解析模型,将会消耗更多时间和人力资源。
与离线方法不同的是,在线方法可以实时解析新生成日志,系统管理者能够及时发现并处理系统异常。但是当前在线日志解析方法存在两个缺点:一是现有在线方法在准确性和效率方面都有改进空间,二是这些在线方法都是为特定的日志系统设计的,各个系统的解析参数并不通用,解析效率较低,准确度较差。
发明内容
本发明的目的在于提供一种在线日志解析方法、系统及其电子终端设备,该方法基于字符串聚类和模板生成树的进行在线解析,将日志按照长度进行分类,再计算日志字符串相似度,基于日志字符串相似度对日志进行二次聚 类,最后使用模板生成树提取日志模板,该方法能够高效而准确地从非结构化日志中提取日志模板,为系统管理者提供简洁直观的摘要和干净规范的结构化数据,方便数据分析师对日志进行更高级的分析和处理。
为了达到上述目的,本发明通过以下技术方案实现:
一种在线日志解析方法,该方法包含以下步骤:
S1、对每一条未解析日志进行日志预处理,得到多条不同日志长度的未解析日志序列,将各条未解析日志序列归为对应的第一日志组中,所述第一日志组为具有相同日志长度的日志序列的集合,所述第一日志组包含解析过的日志序列和未解析日志序列;
S2、获取第一日志组中各日志序列的日志字符串,对所述第一日志组中的第一条未解析日志序列的日志字符串与所述第一日志组中解析过的日志序列的日志字符串分别计算日志字符串相似度,基于日志字符串相似度在线聚类,得到该条未解析日志序列所在的第二日志组;
S3、将该条未解析日志序列作为查询项,与所述第二日志组的模板生成树中的模板匹配共同的节点,获取该条未解析日志序列的日志模板,后续未解析日志序列依次重复步骤S2和步骤S3操作获取日志模板。
优选地,步骤S1具体包含:
利用正则表达式识别每一条未解析日志的日志头中与之匹配的变量信息并删除所述变量信息;
利用预先定义的分隔符对每一条未解析日志的日志内容进行划分,得到不同日志长度的未解析日志序列,将各条未解析日志序列按照不同的日志长度归为不同的第一日志组中。
优选地,所述步骤S2具体包含:
S21、获取第一日志组内各日志序列的日志字符串,将第一日志组内所有解析过的日志序列的日志字符串组成该第一日志组的日志字符串集;
S22、对所述第一日志组中的第一条未解析日志序列的日志字符串与所述日志字符串集中的日志字符串分别计算日志字符串相似度;
S23、基于所述日志字符串相似度在线聚类,得到该条未解析日志序列对应所在的第二日志组,所述第二日志组为各条日志序列的日志字符串相似度达到相似度阈值的日志序列的集合,一个第一日志组里包含若干个第二日 志组。
优选地,所述步骤S21中,
所述日志字符串为日志序列中每个单词的首字母组成的字符串。
优选地,当单词的首字符为数字或标识符时,用通配符来表示,当单词的首字符为字母时,用字母本身来表示。
优选地,所述步骤S22中日志字符串相似度的计算公式为:
Figure PCTCN2020098701-appb-000001
其中,sim(S m1,S m2)表示日志字符串相似度,S m1和S m2分别代表一条日志字符串,len(|S m1|,|S m2|)代表两条日志字符串的长度,S m1(i)代表日志字符串S m1的第i个字符,S m2(j)代表日志字符串S m2的第j个字符,Ⅱ f(S m1(i),S m2(j))定义如下:
Figure PCTCN2020098701-appb-000002
优选地,所述步骤S23具体为:
设定相似度阈值,若存在日志字符串相似度大于相似度阈值的日志字符串,将该条未解析日志序列并入该日志字符串对应的第二日志组;若不存在日志字符串相似度大于相似度阈值的日志字符串,则新建一个第二日志组,并更新该第一日志组的日志字符串集。
优选地,所述步骤S3具体为:
将该条未解析日志序列中的每个单词依次插入树结构,将每条分支最后的节点标记为特殊字符,从该条未解析日志序列的根节点P开始搜索,取得所要查找单词的所在节点,若不存在匹配节点,则将该条未解析日志序列插入树结构作为新分支;若存在匹配节点,根据该匹配节点选择对应的子树并转到该子树继续进行搜索,直至该条未解析日志序列中的每个单词搜索完成,获取该条未解析日志序列的日志模板,后续未解析日志序列依次重复步骤S2和步骤S3的操作以获取日志模板。
优选地,该方法还包含:
判断步骤S3中日志模板是否获取成功,具体为:引入界定值X,当所 述界定值X小于或等于阈值λ,表示模板生成成功,更新模板生成树,将未匹配节点更新为通配符;当所述界定值X大于所述阈值λ,将未解析日志序列作为新的分支插入模板生成树,其中,所述界定值X表示为:
Figure PCTCN2020098701-appb-000003
其中,∑unmatch(t mi)表示匹配失败的节点数量,L为日志长度,σ为该搜索路径上通配符的数量。
优选地,该方法还包含:
引入最大树深度参数MaxDepth,当模板生成树的叶子节点达到最大树深度参数MaxDepth时,未解析日志序列中所有未遍历的单词将不再插入模板生成树,未遍历的单词都用通配符替代。
优选地,一种采用所述在线日志解析方法解析日志的日志解析系统,该系统包含:
预处理模块,用于对未解析日志进行预处理,得到所述未解析日志对应的未解析日志序列;
日志序列分配模块,将所述未解析日志序列按照其日志长度分配到对应的第一日志组内;
日志字符串获取模块,用于对第一日志组内的日志序列进行提取,获取各日志序列的日志字符串;
日志字符串相似度计算模块,用于对第一日志组中的未解析日志序列的日志字符串与解析过的日志序列的日志字符串分别计算日志字符串相似度;
在线聚类模块,用于依据所述日志字符串相似度,将所述未解析日志序列划分到对应的第二日志组;
模板生成树匹配模块,用于将所述未解析日志序列与第二日志组的模板生成树进行匹配,以获取该条未解析日志序列的日志模板。
优选地,还包含:
判断模块,用于判断所述未解析日志序列的日志模板是否获取成功。
优选地,一种电子终端设备,包含:
处理器,存储器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现任意一项所述的在线日志解析方 法。
本发明与现有技术相比具有以下优点:
(1)本发明的一种在线日志解析方法,基于字符串聚类和模板生成树进行日志解析,将日志按照长度进行分类,再计算日志字符串相似度,基于日志字符串相似度对日志二次聚类,最后使用模板生成树提取日志模板,该方法能够高效而准确地从非结构化日志中提取日志模板,为系统管理者提供简洁直观的摘要和干净规范的结构化数据,方便数据分析师对日志进行更高级的分析和处理;
(2)本发明的方法中提出了日志字符串的概念,并计算日志字符串相似度,基于日志字符串相似度对长度相同的系统日志在线聚类,减少日志模板提取时无效的比较步骤,提高了日志解析效率;
(3)本发明采用了基于模板生成树的模板提取方法,利用模板生成树进行提取、更新模板,提高了在线日志解析的准确度。
附图说明
图1为一条日志的结构示例图;
图2为实施例中的一个日志长度为4的第一日志组;
图3为图2中第一日志组对应的一种日志模板集;
图4为图2中第一日志组对应的日志字符串集;
图5为日志模板生成过程示意图;
图6为采用本发明的在线日志解析方法解析示意图。
具体实施方式
以下结合附图,通过详细说明一个较佳的具体实施例,对本发明做进一步阐述。
日志解析是一种将非结构化的日志转化为结构化形式的高效且准确的方法,其从海量的系统日志中获取隐藏的信息,以指导后续操作。
本发明公开了一种在线日志解析方法,该方法包含以下步骤:
S1、对每一条未解析日志进行日志预处理,得到多条不同日志长度的未解析日志序列,将各条未解析日志序列归为对应的第一日志组中,所述第一 日志组为具有相同日志长度的日志序列的集合,所述第一日志组内包含解析过的日志序列和未解析日志序列。
如图1所示,在日志解析过程中,每一条日志m可以表示为:
m={m h,m c}
其中,m h表示日志头,m c表示日志内容。日志头mh包括时间戳,定位,日志类型、安全性、设施等琐碎的变量信息,而日志解析主要需要对日志内容m c部分进行处理。因为日志头m h所包含的变量信息属于结构化的消息类型,非常直观且易于系统管理者理解,只需正则表达式就可以实现日志头的识别和匹配。日志解析的过程主要是在日志内容m c中提取日志模板。
所述步骤S1具体为:首先利用正则表达式识别每一条未解析日志m的日志头m h中与之匹配的变量信息并删除所述变量信息。此处的正则表达式非常简单,因为它只需匹配出例如时间戳、IP地址等简单的标记而不是日志内容m c,通常情况下最多只需两个正则表达式就可以识别所有未解析日志m中的日志头m h
接着利用预先定义的分隔符对每一条未解析日志m的日志内容m c进行划分,得到不同日志长度的未解析日志序列。空格分隔符通常能够解决大部分系统日志的分割问题。分隔符将日志内容m c划分成长度为L的未解析日志序列T m(日志序列的大小通常称之为日志序列的长度):
T m=[t m1,t m2,…,t mL]
其中,t mi为未解析日志序列T m中的第i个单词,日志序列长度为L即为该未解析日志序列T m共有L个单词。
最后,将各条未解析日志序列T m按照不同的日志长度分到不同的第一日志组中,每个第一日志组包含所有拥有相同日志长度的日志序列,所述第一日志组用G L表示。实际上,在第一日志组G L中也存在模板不同的日志,结合图2和图3所示,图2中的日志数据来自超级计算机系统(BlueGene/L,BGL)日志的日志内容m c部分(包含6条未解析日志),图3为图2对应的一种日志模板集(跟不同日志序列匹配会有不同的日志模板),所述日志模板集是所有日志序列的日志模板的集合,图2为每条日志序列的日志长度都为4的第一日志组G L,但是各条日志序列的模板都不相同,所以如果仅仅在此步骤后对日志进行模板提取会导致更高的复杂度。
为了解决上述问题,我们引入一个新概念:日志字符串,用S m来表示, 日志字符串S m是由日志序列中每个单词的首字符组成的字符串。与日志长度相比,日志字符串S m是对日志更具代表性的描述方式。
S2、获取第一日志组中各日志序列的日志字符串,对所述第一日志组中的第一条未解析日志序列T m的日志字符串与所述第一日志组中解析过的日志序列的日志字符串分别计算日志字符串相似度,基于日志字符串相似度在线聚类,得到该条未解析日志序列T m所在的第二日志组,即计算日志字符串相似度,基于日志字符串相似度对第一日志组G L中的日志序列再次聚类,得到模板相似度更高的第二日志组,用G s表示。
所述步骤S2具体包含:S21、获取第一日志组G L内各日志序列的日志字符串,将第一日志组G L内的所有解析过的日志序列的日志字符串组成该第一日志组G L的日志字符串集。
所述步骤S21中,为了避免日志字符串提取算法过高的时间复杂度,我们提取第一日志组G L中日志序列中每个单词的第一个字符作为其代表字符,最后将日志序列中所有单词的首字符按所属单词在日志序列中的位置组成一个字符串,称之为该日志序列的日志字符串S m。每个第一日志组G L中所有解析过的日志序列的日志字符串S m的集合称之为该第一日志组G L的日志字符串集,用S表示。也就是说,每个日志字符串S m作为一个字符串种类存储于日志字符串集S中。
为避免日志字符串集S中字符串种类数量爆炸,当单词的首字符为数字或标识符时,用通配符来表示,当单词首字符为字母时,用字母本身来表示。在本实施例中,通配符采用*表示。
示例地,如图2和图4结合所示,图4为图2对应的日志字符串集S,虽然图2中每条日志序列的长度都相同,但是其日志字符串S m都是唯一的,其中,第一条日志序列的日志字符串S m为“*dae”,第二条日志序列的日志字符串S m为“dTei”,后续依次为“esr*”、“piii”、“fpie”、“piuo”,可见每条日志序列的日志字符串S m都不相同,日志字符串S m对日志的描述更具代表性。
S22、对所述第一日志组中的第一条未解析日志序列T m的日志字符串与所述日志字符串集S中的日志字符串分别计算日志字符串相似度。
所述日志字符串相似度的计算公式如公式(1)所示:
Figure PCTCN2020098701-appb-000004
其中,sim(S m1,S m2)代表日志字符串相似度,S m1和S m2代表相比较的两条日志字符串,len(|S m1|,|S m2|)代表两条日志字符串的长度,S m1(i)代表日志字符串S m1的第i个字符,S m2(j)代表日志字符串S m2的第j个字符,Ⅱ f(S m1(i),S m2(j))的定义如下:
Figure PCTCN2020098701-appb-000005
S23、基于所述日志字符串相似度在线聚类,得到该条未解析日志序列T m对应所在的第二日志组G s,所述第二日志组G s为各条日志的日志字符串相似度达到相似度阈值的日志序列的集合,一个第一日志组G L里包含若干个第二日志组G s
所述步骤S23具体为:设定相似度阈值st,若存在日志字符串相似度大于相似度阈值st的日志字符串,表示日志字符串S m1和日志字符串S m2相似(即该条未解析日志序列T m的日志字符串和与之计算的日志字符串相似),则将该条未解析日志序列T m并入该日志字符串对应的第二日志组G s1中,所述第二日志组G s1内各条日志序列的日志字符串相似度大于相似度阈值st;若不存在日志字符串相似度大于相似度阈值st的日志字符串,则新建一个第二日志组G s2,并更新该第一日志组G s1的日志字符串集S。
经过此步骤后每个第一日志组G s1内未归属第二日志组G s2的未处理日志序列T m数量减少,各第二日志组G s2内日志序列拥有同一模板的概率提高。在日志相似度高的第二日志组G s内进行后续模板提取可大大降低算法的时间复杂度。
S3、在所述第二日志组G s内基于模板生成树获取模板,即将该条未解析日志T m作为查询项,与第二日志组G s的模板生成树中的模板匹配共同的节点,以此来获取该条未解析日志序列T m的日志模板(即该未解析日志的日志模板),后续未解析日志序列依次重复步骤S2和步骤S3的操作以获取日志模板。
Trie树又称前缀树或字典树,是一种有序树,其中的结点通常保存字符串。Trie树是一种在字符串查找,前缀匹配等方面应用广泛的算法,每次匹 配时只与被查询的字符串长度有关,时间复杂度只有O(1)。标准Trie树的结构为:所有含有公共前缀的字符串将挂在Trie树中同一个结点下,简明的存储了存在于字符串中的所有公共前缀。我们基于Trie树的思想构建第二日志组G s的模板生成树,将未解析日志序列T m作为查询项与模板生成树中的模板匹配共同的节点。第二日志组G s内日志相似度高,模板种类少,模板生成树空间复杂度低,有效提高了从日志中提取模板的效率和准确性。
对于一条日志内容m c,经上述步骤后已经被分割成日志序列,遍历日志序列,将其中的每个单词依次插入树结构,将每条分支最后的节点标记为特殊字符,那么从根节点到任意一个特殊字符所经过的路径表示一条日志内容m c的匹配路径,树深度为L+1,L为日志长度,在本实施例中,特殊字符采用’$’。
所述步骤S3具体为:将该条未解析日志序列T m中的单词t m作为L个询问(查询项),采用先深搜索遍历每一条路径。从该条未解析日志序列T m的根节点P(空节点)开始搜索,取得要查找单词t mi的所在节点,若不存在匹配节点(节点匹配失败),则将该未解析日志序列T m插入树结构作为新分支;若存在匹配节点(节点匹配成功),根据该节点选择对应的子树并转到该子树继续进行搜索,直至每个单词t mi搜索完成,获取该条未解析日志序列T m的日志模板,后续未解析日志序列依次重复步骤S2和步骤S3操作以获取日志模板。示例地,如图5所示,其展示了查询日志T m:[connected,to,7342,exactly]与模板生成树匹配成功获得模板并更新模板生成树的过程。
设未解析日志序列T m中的单词t mi与模板生成树上的第i个节点Q存储的单词t m相同,则P=Q,t mi变为t m(i+1)继续搜索。若t m(i+1)与第i+1个节点匹配失败,那么记录匹配失败的节点,更新根节点P为第i+2个节点,t m(i+2)继续匹配第i+2个节点,直到完成L个询问,对于一条日志的搜索才结束。提取匹配成功节点存储的单词t m,匹配失败节点则返回通配符‘*’,以此获得日志模板。
若第二日志组G s2内只有一条日志序列(未解析日志序列T m单独建立的第二日志组),依据该条日志序列在所述第二日志组G s2内建立模板树,那么所述第二日志组G s的日志模板就是该条日志序列的日志模板。如图2所示,图2的第一日志组G L内的6条未解析日志序列T m,分别属于6个不同的第二 日志组G s,若每个第二日志组G s内只有那一条未解析日志序列T m,那么每个第二日志组G s的日志模板就是组内未解析日志序列T m的日志模板。
另外,本发明所提供的方法还包含:判断步骤S3中日志模板是否生成成功,具体为:引入界定值X,具体如公式(3)所示,
Figure PCTCN2020098701-appb-000006
其中,∑unmatch(t mi)表示匹配失败的节点数量,L为日志长度,σ为该搜索路径上通配符‘*’的数量。
当X小于或等于阈值λ,表示模板生成成功,更新模板生成树,将未匹配节点更新为通配符‘*’;当X大于阈值λ,将未解析日志序列T m作为新的分支插入模板生成树。
示例地,如图6所示,为采用本发明的在线日志解析方法获得日志模板的过程。图6中有5条未解析日志序列T m,将每一条未解析日志序列T m按次序采用上述算法求取模板。示例地,处理解析日志序列T m1[connected,to,2312,error]时,首先将该条日志序列T m1划分到长度为4的第一日志组G L2里,在寻找第二日志组G S2时,发现没有日志字符串相似度大于相似度阈值的日志字符串,则新建一个第二日志组G S2,在此第二日志组G S2内基于此日志序列T m1构建模板树,则此时该第二日志组G S2的日志模板是该条日志序列T m1的模板[connected,to,2312,error]。该日志序列T m1的后续日志序列重复上述过程,最终由第二日志组G S2里的所有日志序列重新建立一个模板生成树生成模板。由上述可知,本发明的在线日志解析方法对于每一条新生成的未解析日志都能够自动解析并快速提取模板,且准确率较高。
除此之外,该方法还包含:引入最大树深度参数MaxDepth,当模板生成树的叶子节点达到最大树深度参数MaxDepth时,未解析日志序列T m所有未遍历的单词t mi将不再插入模板生成树,未遍历的单词t mi都用通配符‘*’替代。
另外,本发明还包含一种采用所述在线日志解析方法解析日志的日志解析系统,该系统包含:预处理模块、日志序列分配模块、日志字符串获取模块、日志字符串相似度计算模块、在线聚类模块和模板生成树匹配模块。
日志解析系统内各模块具体为:预处理模块,用于对未解析日志进行预 处理,得到所述未解析日志对应的未解析日志序列;
日志序列分配模块,将所述未解析日志序列按照其日志长度分配到对应的第一日志组内;
日志字符串获取模块,用于对第一日志组内的日志序列进行提取,获取各日志序列的日志字符串;
日志字符串相似度计算模块,用于对第一日志组中的未解析日志序列的日志字符串与解析过的日志序列的日志字符串分别计算日志字符串相似度;
在线聚类模块,用于依据所述日志字符串相似度,将所述未解析日志序列划分到对应的第二日志组;
模板生成树匹配模块,用于将所述未解析日志序列与第二日志组的模板生成树进行匹配,以获取该条未解析日志序列的日志模板。
优选地,该日志解析系统还包含判断模块,所述判断模块用于判断步骤S3中未解析日志序列的日志模板是否获取成功。
所述预处理模块、日志序列分配模块、日志字符串获取模块、日志字符串相似度计算模块、在线聚类模块、模板生成树匹配模块、判断模块可以分别为独立的一个或多个处理器或者处理芯片,还可以共享一个集成处理器,需要的话还可以包括存储器、相关的接口及系统传输总线。处理器或处理芯片执行程序相关的代码实现相应的功能。
另外,本发明还包含一种电子终端设备,其包含:处理器,存储器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现任意一项所述的在线日志解析方法。
综上所述,本发明提供了一种在线日志解析方法、系统及其电子终端设备,该方法基于字符串聚类和模板生成树的进行在线解析,将日志按照长度进行分类,再计算日志字符串相似度,基于日志字符串相似度对日志二次聚类,最后使用模板生成树提取日志模板。该方法能够高效而准确地从非结构化日志中提取日志模板,为系统管理者提供简洁直观的摘要和干净规范的结构化数据,方便数据分析师对日志进行更高级的分析和处理,另外,该方法不需要分析源代码和领域专家的参与,对于每一条未解析日志都能够自动解析并快速提取模板,能够高效且通用地在线解析日志,准确度较高。
尽管本发明的内容已经通过上述优选实施例作了详细介绍,但应当认识 到上述的描述不应被认为是对本发明的限制。在本领域技术人员阅读了上述内容后,对于本发明的多种修改和替代都将是显而易见的。因此,本发明的保护范围应由所附的权利要求来限定。

Claims (13)

  1. 一种在线日志解析方法,其特征在于,该方法包含以下步骤:
    S1、对每一条未解析日志进行日志预处理,得到多条不同日志长度的未解析日志序列,将各条未解析日志序列归为对应的第一日志组中,所述第一日志组为具有相同日志长度的日志序列的集合,所述第一日志组包含解析过的日志序列和未解析日志序列;
    S2、获取第一日志组中各日志序列的日志字符串,对所述第一日志组中的第一条未解析日志序列的日志字符串与所述第一日志组中解析过的日志序列的日志字符串分别计算日志字符串相似度,基于日志字符串相似度在线聚类,得到该条未解析日志序列所在的第二日志组;
    S3、将该条未解析日志序列作为查询项,与所述第二日志组的模板生成树中的模板匹配共同的节点,获取该条未解析日志序列的日志模板,后续未解析日志序列依次重复步骤S2和步骤S3操作获取日志模板。
  2. 如权利要求1所述的在线日志解析方法,其特征在于,步骤S1具体包含:
    利用正则表达式识别每一条未解析日志的日志头中与之匹配的变量信息并删除所述变量信息;
    利用预先定义的分隔符对每一条未解析日志的日志内容进行划分,得到不同日志长度的未解析日志序列,将各条未解析日志序列按照不同的日志长度归为不同的第一日志组中。
  3. 如权利要求1所述的在线日志解析方法,其特征在于,所述步骤S2具体包含:
    S21、获取第一日志组内各日志序列的日志字符串,将第一日志组内所有解析过的日志序列的日志字符串组成该第一日志组的日志字符串集;
    S22、对所述第一日志组中的第一条未解析日志序列的日志字符串与所述日志字符串集中的日志字符串分别计算日志字符串相似度;
    S23、基于所述日志字符串相似度在线聚类,得到该条未解析日志序列对应所在的第二日志组,所述第二日志组为各条日志序列的日志字符串相似度达到相似度阈值的日志序列的集合,一个第一日志组里包含若干个第二日志组。
  4. 如权利要求3所述的在线日志解析方法,其特征在于,所述步骤S21中,
    所述日志字符串为日志序列中每个单词的首字母组成的字符串。
  5. 如权利要求4所述的在线日志解析方法,其特征在于,
    当单词的首字符为数字或标识符时,用通配符来表示,当单词的首字符为字母时,用字母本身来表示。
  6. 如权利要求3所述的在线日志解析方法,其特征在于,所述步骤S22中日志字符串相似度的计算公式为:
    Figure PCTCN2020098701-appb-100001
    其中,sim(S m1,S m2)表示日志字符串相似度,S m1和S m2分别代表一条日志字符串,len(|S m1|,|S m2|)代表两条日志字符串的长度,S m1(i)代表日志字符串S m1的第i个字符,S m2(j)代表日志字符串S m2的第j个字符,Ⅱ f(S m1(i),S m2(j))定义如下:
    Figure PCTCN2020098701-appb-100002
  7. 如权利要求3所述的在线日志解析方法,其特征在于,所述步骤S23具体为:
    设定相似度阈值,若存在日志字符串相似度大于相似度阈值的日志字符串,将该条未解析日志序列并入该日志字符串对应的第二日志组;若不存在日志字符串相似度大于相似度阈值的日志字符串,则新建一个第二日志组,并更新该第一日志组的日志字符串集。
  8. 如权利要求3所述的在线日志解析方法,其特征在于,所述步骤S3具体为:
    将该条未解析日志序列中的每个单词依次插入树结构,将每条分支最后的节点标记为特殊字符,从该条未解析日志序列的根节点P开始搜索,取得所要查找单词的所在节点,若不存在匹配节点,则将该条未解析日志序列插入树结构作为新分支;若存在匹配节点,根据该匹配节点选择对应的子树并转到该子树继续进行搜索,直至该条未解析日志序列中的每个单词搜索完成,获取该条未解析日志序列的日志模板,后续未解析日志序列依次重复步骤S2和步骤S3的操作以获取日志模板。
  9. 如权利要求1或8所述的在线日志解析方法,其特征在于,该方法还包含:
    判断步骤S3中未解析日志序列的日志模板是否获取成功,具体为:引入界定值X,当所述界定值X小于或等于阈值λ,表示模板生成成功,更新模板生成树,将未匹配节点更新为通配符;当所述界定值X大于所述阈值λ,将未解析日志序列作为新的分支插入模板生成树,其中,所述界定值X表示为:
    Figure PCTCN2020098701-appb-100003
    其中,∑unmatch(t mi)表示匹配失败的节点数量,L为日志长度,σ为该搜索路径上通配符的数量。
  10. 如权利要求1所述的在线日志解析方法,其特征在于,该方法还包含:
    引入最大树深度参数MaxDepth,当模板生成树的叶子节点达到最大树深度参数MaxDepth时,未解析日志序列中所有未遍历的单词将不再插入模板生成树,未遍历的单词都用通配符替代。
  11. 一种采用如权利要求1~10所述的在线日志解析方法解析日志的日志解析系统,其特征在于,该系统包含:
    预处理模块,用于对未解析日志进行预处理,得到所述未解析日志对应的未解析日志序列;
    日志序列分配模块,将所述未解析日志序列按照其日志长度分配到对应的第一日志组内;
    日志字符串获取模块,用于对第一日志组内的日志序列进行提取,获取各日志序列的日志字符串;
    日志字符串相似度计算模块,用于对第一日志组中的未解析日志序列的日志字符串与解析过的日志序列的日志字符串分别计算日志字符串相似度;
    在线聚类模块,用于依据所述日志字符串相似度,将所述未解析日志序列划分到对应的第二日志组;
    模板生成树匹配模块,用于将所述未解析日志序列与第二日志组的模板生成树进行匹配,以获取该条未解析日志序列的日志模板。
  12. 如权利要求11所述的日志解析系统,其特征在于,还包含:
    判断模块,用于判断所述未解析日志序列的日志模板是否获取成功。
  13. 一种电子终端设备,其特征在于,包含:
    处理器,存储器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现如权利要求1~10中任意一项所述的在线日志解析方法。
PCT/CN2020/098701 2019-11-06 2020-06-29 一种在线日志解析方法、系统及其电子终端设备 WO2021088385A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP20789812.3A EP3846048A1 (en) 2019-11-06 2020-06-29 Online log analysis method, system, and electronic terminal device thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911077285.8 2019-11-06
CN201911077285.8A CN110888849B (zh) 2019-11-06 2019-11-06 一种在线日志解析方法、系统及其电子终端设备

Publications (1)

Publication Number Publication Date
WO2021088385A1 true WO2021088385A1 (zh) 2021-05-14

Family

ID=69746909

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/098701 WO2021088385A1 (zh) 2019-11-06 2020-06-29 一种在线日志解析方法、系统及其电子终端设备

Country Status (3)

Country Link
EP (1) EP3846048A1 (zh)
CN (1) CN110888849B (zh)
WO (1) WO2021088385A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535955A (zh) * 2021-07-16 2021-10-22 中国工商银行股份有限公司 一种日志快速归类方法及装置
CN115017268A (zh) * 2022-08-04 2022-09-06 北京航空航天大学 一种基于树结构的启发式日志抽取方法及系统
CN115185525A (zh) * 2022-05-17 2022-10-14 贝壳找房(北京)科技有限公司 数据倾斜代码块定位方法、装置、设备、介质及程序产品
CN115860836A (zh) * 2022-12-07 2023-03-28 广东南粤分享汇控股有限公司 一种基于用户行为大数据分析的电商服务推送方法及系统
CN117033464A (zh) * 2023-08-11 2023-11-10 上海鼎茂信息技术有限公司 一种基于聚类的日志并行解析算法及应用
CN117407242A (zh) * 2023-10-10 2024-01-16 浙江大学 基于大语言模型的低成本、零样本的在线日志解析方法

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888849B (zh) * 2019-11-06 2022-07-22 国网上海市电力公司 一种在线日志解析方法、系统及其电子终端设备
CN111581220A (zh) * 2020-05-28 2020-08-25 泰康保险集团股份有限公司 用于时间序列数据的存储及检索方法、装置、设备及存储介质
CN111832280B (zh) * 2020-07-09 2023-06-30 北京奇艺世纪科技有限公司 剧本信息处理方法、装置、电子设备及存储介质
CN112000806B (zh) * 2020-08-25 2023-06-16 携程旅游信息技术(上海)有限公司 异常日志监控分析方法、系统、设备及存储介质
CN113254438A (zh) * 2020-11-20 2021-08-13 云智慧(北京)科技有限公司 一种基于树结构的日志解析方法和系统
CN112463957B (zh) * 2020-12-14 2023-06-02 清华大学 非结构化文本日志流的摘要提取方法和装置
CN112732655B (zh) * 2021-01-13 2024-02-06 北京六方云信息技术有限公司 针对无格式日志的在线解析方法及系统
CN112882997B (zh) * 2021-02-19 2022-06-07 武汉大学 一种基于N-gram与频繁模式挖掘的系统日志解析方法
CN112883004B (zh) * 2021-02-24 2023-04-07 上海浦东发展银行股份有限公司 一种基于日志聚合的日志知识库与健康度获取方法及系统
CN113590421B (zh) * 2021-07-27 2024-04-26 招商银行股份有限公司 日志模板提取方法、程序产品及存储介质
CN113595787B (zh) * 2021-07-27 2024-03-29 招商银行股份有限公司 基于日志模板的实时日志自动告警方法、程序及介质
CN114598597B (zh) * 2022-02-24 2023-12-01 烽台科技(北京)有限公司 多源日志解析方法、装置、计算机设备及介质
JP7466878B2 (ja) 2022-06-16 2024-04-15 ソフトバンク株式会社 情報処理装置、情報処理方法及びプログラム
CN115543950B (zh) * 2022-09-29 2023-06-16 杭州中电安科现代科技有限公司 一种日志范化的数据处理系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180091404A1 (en) * 2016-09-23 2018-03-29 Hewlett Packard Enterprise Development Lp Identifying problematic messages
CN109144964A (zh) * 2018-08-21 2019-01-04 杭州安恒信息技术股份有限公司 基于机器学习的日志解析方法和装置
CN109981625A (zh) * 2019-03-18 2019-07-05 中国人民解放军陆军炮兵防空兵学院郑州校区 一种基于在线层次聚类的日志模板抽取方法
CN110175158A (zh) * 2019-05-23 2019-08-27 湖南大学 一种基于向量化的日志模板提取方法和系统
CN110888849A (zh) * 2019-11-06 2020-03-17 国网上海市电力公司 一种在线日志解析方法、系统及其电子终端设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339293B (zh) * 2016-08-20 2019-04-12 南京理工大学 一种基于签名的日志事件提取方法
US11113317B2 (en) * 2016-09-29 2021-09-07 Micro Focus Llc Generating parsing rules for log messages

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180091404A1 (en) * 2016-09-23 2018-03-29 Hewlett Packard Enterprise Development Lp Identifying problematic messages
CN109144964A (zh) * 2018-08-21 2019-01-04 杭州安恒信息技术股份有限公司 基于机器学习的日志解析方法和装置
CN109981625A (zh) * 2019-03-18 2019-07-05 中国人民解放军陆军炮兵防空兵学院郑州校区 一种基于在线层次聚类的日志模板抽取方法
CN110175158A (zh) * 2019-05-23 2019-08-27 湖南大学 一种基于向量化的日志模板提取方法和系统
CN110888849A (zh) * 2019-11-06 2020-03-17 国网上海市电力公司 一种在线日志解析方法、系统及其电子终端设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3846048A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535955A (zh) * 2021-07-16 2021-10-22 中国工商银行股份有限公司 一种日志快速归类方法及装置
CN115185525A (zh) * 2022-05-17 2022-10-14 贝壳找房(北京)科技有限公司 数据倾斜代码块定位方法、装置、设备、介质及程序产品
CN115017268A (zh) * 2022-08-04 2022-09-06 北京航空航天大学 一种基于树结构的启发式日志抽取方法及系统
CN115017268B (zh) * 2022-08-04 2022-10-11 北京航空航天大学 一种基于树结构的启发式日志抽取方法及系统
CN115860836A (zh) * 2022-12-07 2023-03-28 广东南粤分享汇控股有限公司 一种基于用户行为大数据分析的电商服务推送方法及系统
CN115860836B (zh) * 2022-12-07 2023-09-26 广东南粤分享汇控股有限公司 一种基于用户行为大数据分析的电商服务推送方法及系统
CN117033464A (zh) * 2023-08-11 2023-11-10 上海鼎茂信息技术有限公司 一种基于聚类的日志并行解析算法及应用
CN117033464B (zh) * 2023-08-11 2024-04-02 上海鼎茂信息技术有限公司 一种基于聚类的日志并行解析算法及应用
CN117407242A (zh) * 2023-10-10 2024-01-16 浙江大学 基于大语言模型的低成本、零样本的在线日志解析方法
CN117407242B (zh) * 2023-10-10 2024-04-05 浙江大学 基于大语言模型的低成本、零样本的在线日志解析方法

Also Published As

Publication number Publication date
EP3846048A4 (en) 2021-07-07
CN110888849B (zh) 2022-07-22
CN110888849A (zh) 2020-03-17
EP3846048A1 (en) 2021-07-07

Similar Documents

Publication Publication Date Title
WO2021088385A1 (zh) 一种在线日志解析方法、系统及其电子终端设备
CN110175158B (zh) 一种基于向量化的日志模板提取方法和系统
CN109408578B (zh) 一种针对异构环境监测数据融合方法
US20180004815A1 (en) Stop word identification method and apparatus
CN112541074A (zh) 日志解析方法、装置、服务器和存储介质
WO2022127259A1 (zh) 数据清洗方法、装置、设备及存储介质
CN112632960A (zh) 基于动态字段模板的日志解析方法及系统
US20050234906A1 (en) Segmentation of strings into structured records
CN116132263B (zh) 告警解决方案推荐方法、装置、电子设备及存储介质
CN113254255A (zh) 一种云平台日志的分析方法、系统、设备及介质
WO2024031930A1 (zh) 一种异常日志检测方法、装置、电子设备及存储介质
CN114153980A (zh) 知识图谱构建方法和装置、检查方法、存储介质
Chu et al. Prefix-graph: A versatile log parsing approach merging prefix tree with probabilistic graph
CN113032371A (zh) 数据库语法分析方法、装置和计算机设备
CN116302089B (zh) 基于图片相似度的代码克隆检测方法、系统及存储介质
KR102345410B1 (ko) 빅데이터 지능형 수집 방법 및 장치
US10614102B2 (en) Method and system for creating entity records using existing data sources
CN112968865B (zh) 一种基于关联规则挖掘的网络协议语法特征快速提取方法
Zhu et al. ML-parser: An efficient and accurate online log parser
Sun et al. SNNLog: A Log Parsing Scheme with Siamese Network and Fixed Depth Tree in Networks
CN110781309A (zh) 一种基于模式匹配的实体并列关系相似度计算方法
CN111930545B (zh) Sql脚本处理方法、装置和服务器
Wei et al. Improving database quality through eliminating duplicate records
Obert et al. Named Entity Comparison Algorithms in Enterprise Decision Systems.
CN114697108A (zh) 基于集成学习的系统日志异常检测方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020789812

Country of ref document: EP

Effective date: 20201022

NENP Non-entry into the national phase

Ref country code: DE