WO2021052177A1 - 日志解析方法、装置、服务器和存储介质 - Google Patents

日志解析方法、装置、服务器和存储介质 Download PDF

Info

Publication number
WO2021052177A1
WO2021052177A1 PCT/CN2020/113060 CN2020113060W WO2021052177A1 WO 2021052177 A1 WO2021052177 A1 WO 2021052177A1 CN 2020113060 W CN2020113060 W CN 2020113060W WO 2021052177 A1 WO2021052177 A1 WO 2021052177A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
cluster
sample
clusters
quality score
Prior art date
Application number
PCT/CN2020/113060
Other languages
English (en)
French (fr)
Inventor
韩静
刘建伟
陈力
叶峰
刘峥
凌航
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to US17/624,243 priority Critical patent/US20220365957A1/en
Priority to EP20866066.2A priority patent/EP3968178A4/en
Publication of WO2021052177A1 publication Critical patent/WO2021052177A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/0636Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy

Definitions

  • the embodiments of the present application relate to the field of network operation and maintenance, and in particular to a log analysis method.
  • Log parsing is to convert unstructured log text into structured log text.
  • Log parsing has important applications in areas such as system error root cause location, anomaly detection, etc. According to the log parsing results, you can clearly understand the running sequence of programs in the system. It is used for the construction of program workflow in the system and the detection of anomalies.
  • log analysis methods include: online analysis of log templates in streaming form according to the longest common subsequence, and prefix trees are usually generated during actual use to reduce the search time; using the first several fields of the log to be assigned to the tree Construct a log parse tree with a fixed depth to speed up the search speed.
  • the purpose of the embodiments of the present application is to provide a log analysis method, device, server, and storage medium.
  • the embodiment of the present application provides a log analysis method, including: obtaining sample log data; performing clustering processing on the sample log data according to the length and the first and last keywords of each sample log in the sample log data to obtain multiple Log clusters; determine the quality score of each log cluster in the plurality of log clusters; analyze the log using the plurality of log clusters and the quality score; wherein the quality score is used to determine the log when the log is analyzed
  • the adaptive similarity threshold including: obtaining sample log data; performing clustering processing on the sample log data according to the length and the first and last keywords of each sample log in the sample log data to obtain multiple Log clusters; determine the quality score of each log cluster in the plurality of log clusters; analyze the log using the plurality of log clusters and the quality score; wherein the quality score is used to determine the log when the log is analyzed
  • the adaptive similarity threshold including: obtaining sample log data; performing clustering processing on the sample log data according to the length and the first and last keywords of each sample log in the sample log data to obtain multiple Log cluster
  • the embodiment of the present application also provides a log analysis device, which includes: an acquisition module for acquiring sample log data; a clustering module for matching all the data according to the length of each sample log in the sample log data and the beginning and end keywords.
  • the sample log data is clustered to obtain a plurality of log clusters; a scoring module is used to determine the quality score of each log cluster in the plurality of log clusters; and a parsing module is used to use the plurality of log clusters and the The quality score analyzes the log; wherein the quality score is used to determine the adaptive similarity threshold of the log when the log is analyzed.
  • the embodiment of the present application also provides a server, including: at least one processor; and, a memory communicatively connected with the at least one processor; wherein the memory stores the memory that can be executed by the at least one processor; Instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the foregoing log parsing method.
  • the embodiment of the present application also provides a computer-readable storage medium that stores a computer program, and the computer program implements the above-mentioned log analysis method when the computer program is executed by a processor.
  • Fig. 1 is a flowchart of a log analysis method according to the first embodiment of the present application
  • Fig. 2 is a flowchart of a log analysis method according to a second embodiment of the present application
  • Fig. 3 is a flowchart of a log analysis method according to a third embodiment of the present application.
  • Fig. 4 is a flowchart of a log analysis method according to a fourth embodiment of the present application.
  • Fig. 5 is a flowchart of a log analysis method according to a fifth embodiment of the present application.
  • FIG. 6 is a structural block diagram of a log analysis device according to a sixth embodiment of the present application.
  • Fig. 7 is a structural block diagram of a server according to a seventh embodiment of the present application.
  • the first embodiment of the present application relates to a log analysis method.
  • sample log data is obtained; the sample log data is clustered according to the length and the first and last keywords of each sample log in the sample log data to obtain multiple log clusters; and the multiple logs are determined
  • the quality score of each log cluster in the cluster is analyzed using the multiple log clusters and the quality score; wherein the quality score is used to determine the adaptive similarity threshold of the log when the log is analyzed.
  • a log analysis method in this embodiment is shown in flowchart 1, and may specifically include the following steps:
  • Step 101 Obtain sample log data.
  • sample log data is obtained, and the obtained sample log data includes multiple sample logs, wherein the sample logs are filtered through regular expressions to filter out irrelevant information.
  • Step 102 Perform clustering processing on the sample log data according to the length and the first and last keywords of each sample log to obtain multiple log clusters.
  • the length of the sample log and the first and last keywords are used as classification conditions, and the sample log data is clustered, and multiple log clusters are obtained after clustering.
  • a log cluster is a special data structure used to store a set of log signatures and a row index list (used to store the log ID numbers belonging to the log cluster. In this application, the log length and the first and last keywords are used as the row index) , The number of constant fields of each log signature, and the similarity threshold that can be divided into the log cluster.
  • the length of the log in the log cluster and the beginning and end keywords are used as keys at the same time, and the log in the log cluster is used as a value for caching processing.
  • Step 103 Determine the quality score of each log cluster in the multiple log clusters.
  • the quality score is used to determine the adaptive similarity threshold of the log when the log is analyzed; determining the quality score of each log cluster in the plurality of log clusters includes: counting all the log clusters The compactness within each log cluster and the separability between different log cluster clusters, and the product of the normalized compactness and the separability is used as the quality score of each log cluster in the plurality of log clusters .
  • compactness refers to the ratio of the constant to the total length in the log signature.
  • the compactness is defined as g, and its calculation formula is:
  • ct r represents the number of constant fields in the log signature r
  • l r represents the length of r
  • Separability refers to the difference of the discrete item sets of different log signatures.
  • the discrete item sets include multiple discrete item pairs, and the discrete item pairs refer to all the item pairs generated by the two or two fields after the wildcard is deleted from the log.
  • the difference is generally through statistics.
  • the Jaccard distance of discrete item pairs between log clusters is separated.
  • the specific calculation formula of Jaccard distance J is as follows:
  • A is all the field pairs generated by the compared log signatures
  • B is all the field pairs generated by the existing log signatures.
  • the log signature "RAS KERNEL*generating core” in the log cluster contains a wildcard *, so its compactness
  • the discrete item pairs of the log signature include (RAS, KERNEL), (RAS, generating), (RAS, core), (KERNEL, generating), (KERNEL, core), and (generating, core), and statistics corresponding to each log cluster
  • the Jaccard distance of discrete item pairs is used to obtain the separation of the log cluster from other log clusters.
  • Step 104 Analyze logs using multiple log clusters and quality scores.
  • this embodiment performs clustering processing on the sample log data according to the length and the first and last keywords of each sample log in the sample log data, and determines each log cluster among the multiple log clusters obtained by the clustering processing. Analyze logs based on multiple log clusters and quality scores. By selecting log length and keywords as the conditions for clustering processing, the log parsing speed is greatly improved without losing the accuracy of log parsing. According to the multiple log clusters and quality scores obtained by clustering, log analysis is performed while iterative update and re-clustering, which further improves the accuracy of log analysis, thereby improving the efficiency of log analysis while ensuring the accuracy of log analysis.
  • the second embodiment of the present application relates to a log analysis method.
  • the second embodiment is roughly the same as the first embodiment.
  • the main difference is that the second embodiment of this application specifically provides for clustering the sample log data according to the length of each sample log in the sample log data and the first and last keywords. How to get multiple log clusters.
  • a log analysis method in this embodiment is shown in flowchart 2, which may specifically include the following steps:
  • Step 201 Obtain sample log data. This step is similar to step 101 in the first embodiment, and will not be repeated here.
  • Step 202 Obtain a single sample log in the sample log data.
  • Step 203 Determine whether there is a log cluster matching the length of the sample log, if yes, go to step 204, otherwise go to step 205.
  • a plurality of log clusters with the same log length as the log length of the sample log are determined from the existing log clusters, and if there is no log cluster matching the length of the sample log, a log cluster is created with the sample log.
  • Step 204 Determine whether there is a log cluster matching the beginning and end keywords of the sample log in the log clusters matching the length of the sample log, if yes, go to step 206, otherwise go to step 205.
  • the keywords at the beginning and the end of the sample log contain any special characters. If it contains any special characters such as "_”, then the keyword at that position will be replaced by a wildcard character "*", otherwise the keyword will be used As a judgment condition, among the log clusters that have been determined to match the log length of the sample log, determine multiple log clusters that match the beginning and end keywords of the sample log. If there is no log cluster that matches the beginning and end keywords of the sample log , Then create a log cluster with the sample log.
  • the sample log is "RAS KERNEL INFO generating core”.
  • Step 205 Use a single piece of sample data to create a log cluster.
  • Step 206 Determine whether there is a log cluster that matches the first and last keywords of the sample log. The similarity between the log and the sample log is higher than a preset threshold, if yes, go to step 207, otherwise go to step 205.
  • Step 207 Insert the sample log into the log cluster to which the log corresponding to the similarity higher than the preset threshold belongs.
  • the similarity between the sample log and the existing logs in the log cluster is compared to determine whether the similarity exceeds a specified threshold. Compare in turn with all logs in the log cluster. If the similarity between the two is greater than the threshold, then classify the sample log as the cluster with the highest similarity among all matching log clusters. If any of the similarity is lower than Specify the threshold to create a new log cluster based on the sample log.
  • seq sample log log comparative TEM cluster log
  • cost (seq i, tem i ) or with the representation operation, i.e. if the seq i, the same TEM i, the cost is 1, if seq i Is different from tem i , the cost value is 0, where seq i is the i-th field of seq, tem i is the i-th field of tem, l is the log length of the sample log, and n c is the current log cluster The number of constant fields of the compared logs.
  • the preset threshold t can be determined by the following calculation formula:
  • diff represents the different count values at each position of the log in the sample log and the compared log cluster
  • float represents the forced conversion of the variable data type to a floating point type
  • n c is the constant field of the log currently compared in the log cluster Quantity.
  • the sample log is "RAS KERNEL INFO generating core", and there is already a log “RAS KERNEL*generating core” in the log cluster, then the similarity is Similarity threshold Then insert it into the log cluster.
  • Step 208 Determine whether all sample logs in the sample log data have been inserted into the log cluster, if yes, go to step 209, otherwise go to step 202;
  • Step 209 Determine the quality score of each log cluster in the plurality of log clusters.
  • Step 210 Analyze logs using multiple log clusters and quality scores.
  • Step 209 and step 210 in this embodiment are similar to step 103 and step 104 in the first embodiment, and will not be repeated here.
  • this embodiment performs clustering processing on the sample log data according to the length and the first and last keywords of each sample log in the sample log data, and determines each log cluster among the multiple log clusters obtained by the clustering processing. Analyze the log based on multiple log clusters and quality scores. By selecting log length and keywords as the conditions for clustering processing, the speed of log parsing is greatly improved without losing the accuracy of log parsing.
  • the third embodiment of the present application relates to a log analysis method.
  • the process of analyzing logs using the multiple log clusters and the quality score is further refined: obtaining the target log to be parsed; using regular expressions to filter out irrelevant information in the target log Determine a matching log cluster among the multiple log clusters according to the length of the target log and the first and last keywords, and determine the adaptive similarity between the target log and the matching log cluster; according to the quality score Determine the adaptive similarity threshold between the target log and the matched log cluster; determine whether the adaptive similarity is greater than the adaptive similarity threshold, and if so, insert the target log into the matching Otherwise, use the target log to create a log cluster.
  • a log analysis method in this embodiment is shown in flowchart 3, and may specifically include the following steps:
  • Step 301 Obtain sample log data.
  • Step 302 Perform clustering processing on the sample log data according to the length and the first and last keywords of each sample log to obtain multiple log clusters.
  • Step 303 Determine the quality score of each log cluster in the multiple log clusters.
  • Steps 301 to 303 in this embodiment are similar to steps 101 to 103 in the first embodiment, and will not be repeated here.
  • Step 304 Obtain the target log to be parsed.
  • Step 305 Use regular expressions to filter out irrelevant information in the target log.
  • the first aspect is mainly the processing of position-independent items: In the original log, there are usually some fixed items, and these items appear in the same position in the same source log.
  • the first field in the log is usually the timestamp when the log was generated; or that although these items have changed but the attributes are the same, for example, a fixed position in the nova data will generate a request ID, which can be filtered by analyzing the log characteristics
  • Positional meaningless data for example, the records in the first column in this example are all timestamp attributes
  • the second aspect is mainly the processing of items with uncertain positions: there are some items in the log text, and their attributes are the same.
  • the original log to be parsed is "2017-10-07 12:00:13 RAS KERNEL INFO generating core”
  • the processed log is "RAS KERNEL INFO generating core” that is, by using After the regular expression filters out irrelevant information in the target log, the log to be parsed becomes "RAS KERNEL INFO generating core”.
  • Step 306 Determine a matching log cluster among multiple log clusters according to the length of the target log and the first and last keywords.
  • the length of the log to be parsed and the first and last keywords are directly used as keys, and a log cluster whose log length and the first and last keywords match the log to be parsed is obtained from multiple log clusters through cache query.
  • Step 307 Determine the adaptive similarity and the threshold of the adaptive similarity between the target log and the matched log cluster.
  • the adaptive similarity calculation method is the same as the similarity calculation method in the second embodiment of the present application; the adaptive similarity threshold is obtained by multiplying the similarity threshold in the second embodiment of the present application and the quality score. .
  • the adaptive similarity sim can be determined by the following calculation formula:
  • seq is the log to be resolved, for the TEM cluster matching log log, cost (seq i, tem i ) or with the representation operation, i.e. if the seq i, the same TEM i, the cost is 1, if seq i and tem i are not the same, the cost value is 0, where seq i is the i-th field of seq, tem i is the i-th field of tem, l is the log length of the log to be parsed, and n c is the log The number of constant fields of the matched logs in the cluster.
  • the adaptive similarity threshold t′ can be determined by the following calculation formula:
  • diff represents the different count values at each position of the log to be parsed and the log in the matching log cluster
  • float() represents the forced conversion of the variable data type to a floating point type
  • n c is the log value of the currently compared log in the log cluster
  • q is the quality score of the matched log cluster.
  • Step 308 Determine whether the adaptive similarity is greater than the adaptive similarity threshold, if yes, go to step 309, otherwise go to step 310.
  • Step 309 Insert the target log into the matching log cluster.
  • the log to be parsed is "RAS KERNEL INFO generating core", and there is already a log "RAS KERNEL*generating core” in the log cluster, then the adaptive similarity is Adaptive similarity threshold Where q ⁇ 1, that is, the adaptive similarity is greater than the adaptive similarity threshold, then it is inserted into the log cluster.
  • Step 310 Create a log cluster using the target log.
  • parsing logs using multiple log clusters and quality scores includes: obtaining the overall data set, and processing the target logs in turn.
  • the original log currently to be parsed is "2017-10-07 12:00:13 Delete” block blk_2342", there is already a log cluster "Delete block*”.
  • the log to be parsed is changed to "Delete block blk_2342".
  • the words are "Delete” and “*"("blk_2342" contains the special character “_”, use "*” instead), determine the log cluster "Delete block*" as the matching log cluster, and compare the log with the log in the log cluster
  • the adaptive similarity is Adaptive similarity threshold Where q ⁇ 1, so the log can be directly inserted into the log cluster.
  • this embodiment acquires sample log data; performs clustering processing on the sample log data according to the length of each sample log in the sample log data and the first and last keywords to obtain multiple log clusters;
  • the quality score of each log cluster in the log cluster; using multiple log clusters and quality scores to parse the log; wherein, using multiple log clusters and the quality score to analyze the log includes: obtaining the target log to be parsed; filtering it out using regular expressions Irrelevant information in the target log; determine a matching log cluster among multiple log clusters according to the length of the target log and the first and last keywords, and determine the adaptive similarity between the target log and the matching log cluster; determine the target log and the matching log cluster according to the quality score
  • the adaptive similarity threshold of the matched log cluster judge whether the adaptive similarity is greater than the adaptive similarity threshold, if so, insert the target log into the matched log cluster, otherwise use the target log to create a log cluster.
  • the fourth embodiment of the present application relates to a log analysis method.
  • This embodiment is further improved on the basis of the first embodiment: after analyzing logs using multiple log clusters and the quality score, it also includes: splitting log clusters that meet preset splitting conditions;
  • the split condition includes: the number of logs in the log cluster exceeds a preset log number threshold.
  • a log analysis method in this embodiment is shown in flowchart 4, which may specifically include the following steps:
  • Step 401 Obtain sample log data.
  • Step 402 Perform clustering processing on the sample log data according to the length and the first and last keywords of each sample log to obtain multiple log clusters.
  • Step 403 Determine the quality score of each log cluster in the multiple log clusters.
  • Step 404 Analyze logs using multiple log clusters and quality scores.
  • Steps 401 to 404 in this embodiment are similar to steps 101 to 104 in the first embodiment, and will not be repeated here.
  • Step 405 Determine the log cluster to be split.
  • each of the log clusters in which the number of logs exceeds a preset log number threshold in all the log clusters is split into multiple log clusters. If the number of logs in the log cluster is less than the preset log number threshold, that is, the number of logs of the currently parsed log cluster is too small, and the information stored in the parameters at each position of the log in the log cluster is limited, the log cluster will not be split temporarily. If the number of logs in the log cluster is greater than the preset log number threshold, that is, the number of logs in the currently parsed log cluster is large, which affects the speed of log parsing.
  • the preset log number threshold can be determined by the technical staff according to the actual situation. Set the size.
  • Step 406 Determine the candidate location of the log cluster to be split.
  • a position in each of the log clusters to be split whose number of fields is less than a preset field number threshold and greater than 1 is a candidate position. If the number of fields at a position in the log exceeds the preset threshold of the number of fields, it means that there are many types of parameters at that position. To a certain extent, it means that the possibility of the position being a parameter is higher.
  • the log cluster is split, and the field number threshold can be set by the technician according to the actual situation.
  • Step 407 Determine the optimal position according to the Gini value of each candidate position.
  • the Gini value is an index for selecting the optimal feature when the classical decision tree CART is used for classification problems
  • the maximum Gini value is selected as the optimal position.
  • the Gini value gini of each position can be determined by the following calculation formula:
  • l is the total number of unique fields appearing in the current position calculated in the current position
  • Is the number of occurrences of the corresponding unique field
  • cnt is the total number of fields that appear at that position.
  • the position can be used as a candidate position for splitting, so the maximum Gini value of the log parameter is set, when the Gini value of the candidate position is less than When the log parameter maximum Gini value is satisfied, the log parameter entropy value is greater than the minimum entropy value of the log parameter, and the maximum Gini value of the log parameter can be set by the technician according to the actual situation.
  • Step 408 Split the log cluster to be split into multiple log clusters according to the optimal position.
  • a given log cluster contains "RAS KERNEL INFO generating core" and "RAS KERNEL FAILED generating core”
  • the positions whose numbers are less than the preset field number threshold are regarded as candidate positions (position 2 and position 3 are selected as candidate positions here), and the Gini value is calculated for each candidate position, and it is compared with the maximum Gini value of the log parameter.
  • this embodiment obtains sample log data; performs clustering processing on the sample log data according to the length and the beginning and end keywords of each sample log in the sample log data to obtain multiple log clusters; determine multiple log clusters Use multiple log clusters and quality scores to analyze the log; split log clusters meeting preset splitting conditions; wherein, the splitting conditions include: the number of logs in the log cluster exceeds the expected number Set the threshold for the number of logs. .
  • the splitting conditions include: the number of logs in the log cluster exceeds the expected number Set the threshold for the number of logs. .
  • the fifth embodiment of the present application relates to a log analysis method.
  • This embodiment is further improved on the basis of the fourth embodiment: after splitting each log cluster whose log number exceeds the preset log number threshold in all the log clusters, into multiple log clusters, it also includes: Among the log clusters obtained by splitting, the logs with the same keywords appearing continuously but the other fields are queried as candidate deletion logs; in the candidate deletion logs, all the logs except the log with the least occurrence of the keyword are deleted .
  • a log analysis method in this embodiment is shown in flowchart 5, and may specifically include the following steps:
  • Step 501 Obtain sample log data.
  • Step 502 Perform clustering processing on the sample log data according to the length and the first and last keywords of each sample log to obtain multiple log clusters.
  • Step 503 Determine the quality score of each log cluster in the multiple log clusters.
  • Step 504 Analyze the logs using multiple log clusters and quality scores.
  • Steps 501 to 504 in this embodiment are similar to steps 101 to 104 in the first embodiment, and will not be repeated here.
  • Step 505 Split each log cluster whose number of logs exceeds the preset log number threshold in all log clusters into multiple log clusters.
  • Step 506 Delete all logs except the log with the least number of occurrences of the keyword among the logs with the same keywords continuously appearing in other fields.
  • this step is mainly aimed at the situation where logs with the same information have different lengths, mainly considering the following two situations:
  • this embodiment obtains sample log data; performs clustering processing on the sample log data according to the length and the beginning and end keywords of each sample log in the sample log data to obtain multiple log clusters; determine multiple log clusters Use multiple log clusters and quality scores to analyze the log; split each log cluster whose number of logs exceeds the preset log number threshold in all log clusters into multiple log clusters; In the log cluster, a log in which the keyword appears consecutively but other fields are the same is queried as a candidate deletion log; in the candidate deletion log, all the logs except the log with the least occurrence of the keyword are deleted. By deleting duplicate logs in logs where keywords appear continuously but other fields are the same, the log parsing speed is improved.
  • the sixth embodiment of the present application relates to a log analysis device, including: an acquisition module 601, a clustering module 602, a scoring module 603, and a parsing module 604.
  • the specific structure is shown in FIG. 6:
  • the obtaining module 601 is used to obtain sample log data.
  • the clustering module 602 is configured to perform clustering processing on the sample log data according to the length and the first and last keywords of each sample log in the sample log data to obtain multiple log clusters.
  • the scoring module 603 is used to determine the quality score of each log cluster in the plurality of log clusters
  • the parsing module 604 is configured to analyze the log using the plurality of log clusters and the quality score.
  • the clustering processing of the sample log data according to the length and the first and last keywords of each sample log in the sample log data includes: performing each sample log in the sample log data , Respectively perform the following processing: determine whether there is a log cluster matching the length of the sample log, if yes, obtain a log cluster matching the length of the sample log, if not, use the sample log to create a log Cluster; determine whether there is a log cluster matching the first and last keywords of the sample log in the log clusters matching the length of the sample log, and if so, obtain a log cluster that matches the first and last keywords of the sample log If not, use the sample log to create a log cluster; determine the similarity between all log signatures in the log cluster that matches the first and last keywords of the sample log and the sample log, if all the similarities exist If any similarity is higher than the preset threshold, the sample log is inserted into the log cluster to which the log corresponding to the similarity higher than the preset threshold belongs; otherwise, a log cluster is created using the sample log
  • the determining the quality score of each log cluster in the plurality of log clusters includes: according to the compactness of each log cluster in all the log clusters and the separability between different log clusters Plan to determine the quality score of each log cluster.
  • the parsing log using the multiple log clusters and the quality score includes: obtaining a target log to be parsed; using regular expressions to filter out irrelevant information in the target log; Determine a matching log cluster among the multiple log clusters according to the length of the target log and the first and last keywords, determine the adaptive similarity between the target log and the matched log cluster; determine according to the quality score The adaptive similarity threshold between the target log and the matched log cluster; determine whether the adaptive similarity is greater than the adaptive similarity threshold, and if so, insert the target log into the matching Log cluster, otherwise use the target log to create a log cluster.
  • the method further includes: splitting log clusters that meet preset splitting conditions; wherein, the splitting conditions include : The number of logs in the log cluster exceeds the preset log number threshold.
  • the splitting of log clusters that meet preset splitting conditions includes: determining that the number of fields in the log cluster to be split is less than a preset field number threshold as candidate locations; calculating all The Gini value of the candidate position, and split the log cluster according to the candidate position corresponding to the largest Gini value in the log cluster to be split.
  • each log cluster obtained by splitting querying logs with consecutive keywords but the same other fields as candidates Delete logs; in the candidate deletion logs, delete all the logs except the log with the least number of occurrences of the keyword.
  • this embodiment is an example of a device corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment.
  • the related technical details mentioned in the first embodiment are still valid in this embodiment, and in order to reduce repetition, they will not be repeated here.
  • the related technical details mentioned in this embodiment can also be applied in the first embodiment.
  • modules involved in this embodiment are all logical modules.
  • a logical unit can be a physical unit, a part of a physical unit, or multiple physical units. The combination of units is realized.
  • this embodiment does not introduce units that are not closely related to solving the technical problems proposed by this application, but this does not mean that there are no other units in this embodiment.
  • the seventh embodiment of the present application relates to a server.
  • the server includes at least one processor 701; and a memory 702 communicatively connected with the at least one processor 701; and a communication connected in communication with the log analysis device Component 703, the communication component 703 receives and sends data under the control of the processor 701; wherein the memory 702 stores instructions that can be executed by at least one processor 701, and the instructions are executed by at least one processor 701 to implement the log analysis method described above. example.
  • the electronic device includes: one or more processors 701 and a memory 702, and one processor 701 is taken as an example in FIG. 7.
  • the processor 701 and the memory 702 may be connected through a bus or in other ways. In FIG. 7, the connection through a bus is taken as an example.
  • the memory 702, as a computer-readable storage medium, can be used to store computer software programs, computer-executable programs, and modules.
  • the processor 701 executes various functional applications and data processing of the device by running computer software programs, instructions, and modules stored in the memory 702, that is, realizes the above-mentioned log analysis method.
  • the memory 702 may include a program storage area and a data storage area.
  • the program storage area may store an operating system and an application program required by at least one function; the storage data area may store a list of options and the like.
  • the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 702 may optionally include a memory remotely provided with respect to the processor 701, and these remote memories may be connected to an external device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • One or more modules are stored in the memory 702, and when executed by one or more processors 701, the log analysis method in any of the foregoing method implementations is executed.
  • the memory and the processor are connected in a bus manner, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together.
  • the bus can also connect various other circuits such as peripheral devices, voltage regulators, and power management circuits, etc., which are all known in the art, and therefore, no further description will be given herein.
  • the bus interface provides an interface between the bus and the transceiver.
  • the transceiver may be one element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices on the transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions.
  • the memory can be used to store data used by the processor when performing operations.
  • the eighth embodiment of the present application relates to a computer-readable storage medium storing a computer program.
  • the computer program is executed by the processor, the above log analysis method embodiment is realized.
  • the program is stored in a storage medium and includes several instructions to enable a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) that executes all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种日志解析方法、装置、服务器和存储介质,涉及网络运维领域。所述方法包括:获取样本日志数据(101),根据样本日志数据中各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇(102),并确定由聚类处理得到的多个日志簇中各日志簇的质量评分(103),再根据多个日志簇和质量评分解析日志(104)。

Description

日志解析方法、装置、服务器和存储介质
相关申请的交叉引用
本申请基于申请号为201910893383.2、申请日为2019年9月20日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请实施例涉及网络运维领域,特别涉及一种日志解析方法。
背景技术
日志解析就是将非结构日志文本转变为结构化日志文本,日志解析在系统错误根源定位,异常检测等领域具有重要的应用,根据日志解析的结果,可以明确了解系统中程序的运行顺序,进一步可以用于系统中程序工作流的构建与异常的检测。
目前常用的日志解析方法包括:根据最长公共子序列来以流式形式在线解析日志模板,在实际运用过程中通常生成前缀树来减少搜索的时间;利用将日志的前若干个字段分配到树上的不同节点,构建一种固定深度的日志解析树,以此来加速搜索的速度。
发明人发现现有技术中至少存在如下问题:构建的日志解析树仅仅利用了日志的若干个字段,虽然在一定程度上加速了解析过程,但实质上降低了解析日志的准确率。
发明内容
本申请实施方式的目的在于提供一种日志解析方法、装置、服务器和存储介质。
本申请的实施方式提供了一种日志解析方法,包括:获取样本日志数据;根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,得到多个日志簇;确定所述多个日志簇中各日志簇的质量评分;利用所述多个日志簇和所述质量评分解析日志;其中,所述质量评分用于在进行所述解析日志时确定日志的自适应相似度阈值。
本申请的实施方式还提供了一种日志解析装置,包括:获取模块,用于获取样本日志数据;聚类模块,用于根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,得到多个日志簇;评分模块,用于确定所述多个日志簇中各日志簇的 质量评分;解析模块,用于利用所述多个日志簇和所述质量评分解析日志;其中,所述质量评分用于在进行所述解析日志时确定日志的自适应相似度阈值。
本申请的实施方式还提供了一种服务器,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述日志解析方法。
本申请的实施方式还提供了一种计算机可读存储介质,存储计算机程序,所述计算机程序被处理器执行时实现上述日志解析方法。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定。
图1是根据本申请第一实施方式的日志解析方法的流程图;
图2是根据本申请第二实施方式的日志解析方法的流程图;
图3是根据本申请第三实施方式的日志解析方法的流程图;
图4是根据本申请第四实施方式的日志解析方法的流程图;
图5是根据本申请第五实施方式的日志解析方法的流程图;
图6是根据本申请第六实施方式的日志解析装置的结构方框图;
图7是根据本申请第七实施方式的服务器的结构方框图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。
本申请的第一实施方式涉及一种日志解析方法。在本实施方式中,获取样本日志数据;根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,得到多个日志簇;确定所述多个日志簇中各日志簇的质量评分;利用所述多个日志簇和所述质量评分解析日志;其中,所述质量评分用于在进行所述解析日志时确定日志的自适应相似度阈值。
下面对本实施方式的一种日志解析方法的实现细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本方案的必须。
本实施方式中的一种日志解析方法如流程图1所示,可以具体包括以下步骤:
步骤101:获取样本日志数据。
具体地说,获取样本日志数据,所获取的样本日志数据包含多个样本日志,其中,样本日志经过正则表达式过滤掉无关信息。
步骤102:根据各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇。
具体地说,把样本日志的长度和首尾关键字作为分类的条件,对样本日志数据进行聚类处理,经过聚类后得到多个日志簇。日志簇是一种特殊的数据结构,用于存储一组日志签名、一个行索引列表(用于存储属于该日志簇的日志ID号,在本申请中使用日志长度和首尾关键字作为行索引)、每个日志签名常量字段的数量、以及能够划分到该日志簇的相似度阈值。
具体地说,在得到多个日志簇时,同时将日志簇中日志的长度和首尾关键字作为键,将日志簇中的日志作为值进行缓存处理。
步骤103:确定多个日志簇中各日志簇的质量评分。
具体地说,所述质量评分用于在进行所述解析日志时确定日志的自适应相似度阈值;确定所述多个日志簇中各日志簇的质量评分,包括:统计所述所有日志簇中每个日志簇簇内的紧凑性以及不同日志簇簇间的分离性,并用归一化后的所述紧凑性和所述分离性的乘积作为所述多个日志簇中各日志簇的质量评分。
进一步说,紧凑性是指日志签名中常量与总长度的比率,定义紧凑性为g,其计算公式为:
Figure PCTCN2020113060-appb-000001
其中,ct r表示日志签名r中的常量字段的数量,l r表示r的长度。
分离性是指不同的日志签名的离散项集的差异性,其中,离散项集包括多个离散项对,离散项对是指日志删除通配符后两两字段生成的所有项对,一般通过统计不同日志簇之间的离散项对的杰卡德距离得到分离性,杰卡德距离J具体计算公式如下:
Figure PCTCN2020113060-appb-000002
其中,A为比较的日志签名生成的所有字段对,B为已有的日志签名生成的所有字段对。
在一个具体的例子中,日志簇中的日志签名“RAS KERNEL*generating core”包含一个通配符*,则其紧凑性
Figure PCTCN2020113060-appb-000003
该日志签名的离散项对包括(RAS,KERNEL)、(RAS,generating)、 (RAS,core)、(KERNEL,generating)、(KERNEL,core)和(generating,core),统计各日志簇对应的离散项对的杰卡德距离,得到该日志簇与其他日志簇的分离性。
步骤104:利用多个日志簇和质量评分解析日志。
具体地说,以流式的方式一条一条地输入需要解析的日志,并将输入的日志与已有的多个日志簇进行对比分析,并在分析过程中使用质量评分判断输入的日志与对比的日志簇的自适应相似度阈值,从而对输入的日志完成解析。
本实施方式相对于现有技术而言,根据样本日志数据中各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,并确定由聚类处理得到的多个日志簇中各日志簇的质量评分,再根据多个日志簇和质量评分解析日志,通过选取日志长度和关键字作为聚类处理的条件,在不失日志解析准确率的情况下大大提升了日志解析的速度,又通过根据聚类得到的多个日志簇和质量评分进行日志解析同时迭代更新重聚类,进一步提高了日志解析的准确率,从而在保证日志解析准确率的情况下提升了日志解析效率。
本申请的第二实施方式涉及一种日志解析方法。第二实施方式与第一实施方式大致相同,主要区别之处在于:本申请第二实施方式中具体提供了根据样本日志数据中各样本日志的长度和首尾关键字对样本日志数据进行聚类处理得到多个日志簇的方法。
本实施方式中的一种日志解析方法如流程图2所示,可以具体包括以下步骤:
步骤201:获取样本日志数据。本步骤和实施方式一中的步骤101类似,在此不作赘述。
步骤202:获取样本日志数据中的单条样本日志。
步骤203:判断是否存在与样本日志的长度匹配的日志簇,如果是,进入步骤204,否则进入步骤205。
具体地说,在已有的日志簇中确定多个日志长度与样本日志的日志长度相同的日志簇,如果没有与样本日志长度匹配的日志簇,则以该样本日志创建一个日志簇。
在一个具体的例子中,样本日志为“RAS KERNEL INFO generating core”,此时确定多个“length=5”的日志簇。
步骤204:判断在与样本日志的长度匹配的日志簇中是否存在与样本日志的首尾关键字匹配的日志簇,如果是,进入步骤206,否则进入步骤205。
具体地说,先分析该样本日志的首尾关键字是否包含任何特殊字符,如果包含任意一个特殊字符如“_”,则该位置上的关键字则以通配符“*”替代,否则以该关键字作为判断的条件,在已经确定的与该样本日志的日志长度匹配的日志簇中,确定多个与该样本日志的首尾关键字匹配的日志簇,如果没有与样本日志首尾关键字匹配的日志簇,则以该样本日志创建 一个日志簇。
在一个具体的例子中,样本日志为“RAS KERNEL INFO generating core”,此时在“length=5”的日志簇中确定日志签名的首尾关键字为“RAS”和“core”的多个日志簇。
步骤205:使用单条样本数据创建一个日志簇。
步骤206:判断与样本日志的首尾关键字匹配的日志簇中是否存在日志与样本日志的相似度高于预设阈值,如果是,进入步骤207,否则进入步骤205。
步骤207:将样本日志插入到高于预设阈值的相似度对应的日志所属的日志簇中。
具体地说,比较该样本日志与该日志簇中已有的日志的相似度,判断相似度是否超过指定的阈值。与日志簇中的所有日志进行依次比较,若两者的相似度大于阈值,则将该样本日志归类为所有匹配的日志簇中相似度最高的那一簇,若任意一个相似度都低于指定阈值,则以样本日志新建一个日志簇。
进一步说,相似度sim可以由以下计算公式确定:
Figure PCTCN2020113060-appb-000004
其中,seq为样本日志,tem为比较的日志簇中的日志,cost(seq i,tem i)表示进行同或操作,即如果seq i与,tem i相同,则cost值为1,如果seq i与,tem i不相同,则cost值为0,其中seq i为seq的第i个字段,,tem i为tem第i个字段,l为样本日志的日志长度,n c为该日志簇中当前比较的日志的常量字段的数量。
进一步说,预设阈值t可以由以下计算公式确定:
Figure PCTCN2020113060-appb-000005
其中,diff表示样本日志和比较的日志簇中的日志各位置上不同的计数值,float表示将变量数据类型强制转换为浮点类型,n c为该日志簇中当前比较的日志的常量字段的数量。
在一个具体的例子中,样本日志为“RAS KERNEL INFO generating core”,日志簇中已有日志“RAS KERNEL*generating core”,则该相似度为
Figure PCTCN2020113060-appb-000006
相似度阈值为
Figure PCTCN2020113060-appb-000007
则将其插入到该日志簇中。
步骤208:判断样本日志数据中所有样本日志是否均已插入到日志簇中,如果是,进入步骤209,否则进入步骤202;
步骤209:确定多个日志簇中各日志簇的质量评分。
步骤210:利用多个日志簇和质量评分解析日志。
本实施方式的步骤209、步骤210和实施方式一中的步骤103、步骤104类似,在此不作 赘述。
本实施方式相对于现有技术而言,根据样本日志数据中各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,并确定由聚类处理得到的多个日志簇中各日志簇的质量评分,再根据多个日志簇和质量评分解析日志。通过选取日志长度和关键字作为聚类处理的条件,在不失日志解析准确率的情况下大大提升了日志解析的速度。
本申请的第三实施方式涉及一种日志解析方法。在本实施方式中,对利用所述多个日志簇和所述质量评分解析日志的过程做了进一步细化:获取待解析的目标日志;使用正则表达式过滤掉所述目标日志中的无关信息;根据所述目标日志的长度和首尾关键字在所述多个日志簇中确定一个匹配的日志簇,确定所述目标日志与所述匹配的日志簇的自适应相似度;根据所述质量评分确定所述目标日志与所述匹配的日志簇的自适应相似度阈值;判断所述自适应相似度是否大于所述自适应相似度阈值,如果是,则将所述目标日志插入到所述匹配的日志簇,否则使用所述目标日志创建一个日志簇。
本实施方式中的一种日志解析方法如流程图3所示,可以具体包括以下步骤:
步骤301:获取样本日志数据。
步骤302:根据各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇。
步骤303:确定多个日志簇中各日志簇的质量评分。
本实施方式的步骤301至步骤303和实施方式一中的步骤101至步骤103类似,在此不作赘述。
步骤304:获取待解析的目标日志。
步骤305:使用正则表达式过滤掉所述目标日志中的无关信息。
具体地说,该步骤主要包括以下两个方面:第一方面主要是位置无关项的处理:在原始日志中,通常存在一些固定不变的项,并且这些项在同源日志中出现位置相同,比如说在日志中的第一个字段通常是日志产生的时间戳;或者说这些项虽然发生变化但属性相同,比如说在nova数据中固定位置会产生请求ID,通过分析日志特征可以过滤这些固定位置上的无意义的数据,如该例中的第一列上记录都为时间戳属性;第二方面主要是位置不定相关项的处理:在日志文本中还有一些项,它们的属性一样,但出现的位置并不固定,而且它们是日志的重要组成部分,不能够随意删除。尽管不同系统生成的日志文本格式不尽相同,但日志中的某些字段同样遵循着“国际惯例”,如IP地址、文件目录结构等,因此获取经过预处理的样本日志数据,即利用预先设定的正则表达式方法,可以大大的降低解析日志的代价。
在一个具体的例子中,待解析的原始日志为“2017-10-07 12:00:13 RAS KERNEL INFO generating core”,经过处理后的日志为“RAS KERNEL INFO generating core”,即此时通过使用正则表达式过滤掉所述目标日志中的无关信息后,待解析的日志变为“RAS KERNEL INFO generating core”。
步骤306:根据目标日志的长度和首尾关键字在多个日志簇中确定一个匹配的日志簇。
具体地说,直接以待解析日志的长度和首尾关键字作为键,通过缓存查询在多个日志簇中获取一个日志长度和首尾关键字都与待解析日志匹配的日志簇。
步骤307:确定目标日志与匹配的日志簇的自适应相似度和自适应相似度的阈值。
具体地说,自适应相似度的计算方法与本申请第二实施方式中的相似度计算方法相同;自适应相似度阈值通过将本申请第二实施方式中的相似度阈值与质量评分相乘得到。
进一步说,自适应相似度sim可以由以下计算公式确定:
Figure PCTCN2020113060-appb-000008
其中,seq为待解析的日志,tem为匹配的日志簇中的日志,cost(seq i,tem i)表示进行同或操作,即如果seq i与,tem i相同,则cost值为1,如果seq i与,tem i不相同,则cost值为0,其中seq i为seq的第i个字段,,tem i为tem第i个字段,l为待解析日志的日志长度,n c为该日志簇中匹配比较的日志的常量字段的数量。
进一步说,自适应相似度阈值t′可以由以下计算公式确定:
Figure PCTCN2020113060-appb-000009
其中,diff表示待解析的日志和匹配的日志簇中的日志各位置上不同的计数值,float()表示将变量数据类型强制转换浮点型,n c为该日志簇中当前比较的日志的常量字段的数量,q为该匹配的日志簇的质量评分。
步骤308:判断自适应相似度是否大于自适应相似度阈值,如果是,进入步骤309,否则进入步骤310。
步骤309:将目标日志插入到匹配的日志簇。
在一个具体的例子中,待解析日志为“RAS KERNEL INFO generating core”,日志簇中已有日志“RAS KERNEL*generating core”,则自适应相似度为
Figure PCTCN2020113060-appb-000010
自适应相似度阈值为
Figure PCTCN2020113060-appb-000011
其中q<1,即自适应相似度大于自适应相似度阈值,则将其插入到该日志簇中。
步骤310:使用所述目标日志创建一个日志簇。
在一个具体的例子中,利用多个日志簇和质量评分解析日志包括:获取整体数据集,依次对目标日志进行处理,当前待解析的原始日志为“2017-10-07 12:00:13 Delete block blk_2342”,已有日志簇“Delete block*”,经过正则表达式过滤,将待解析的日志变为“Delete block blk_2342”,此时根据该日志的长度为“length=3”,以及首尾关键字分别为“Delete”和“*”(“blk_2342”包含特殊字符“_”,使用“*”代替),确定日志簇“Delete block*”为匹配的日志簇,比较该日志与日志簇中日志的自适应相似度,自适应相似度为
Figure PCTCN2020113060-appb-000012
自适应相似度阈值为
Figure PCTCN2020113060-appb-000013
其中q<1,所以该日志可以直接插入到该日志簇中。
值得一提的是,对目标日志进行解析的同时,还可以对已获取的多个日志簇进行迭代式重聚类,重新计算各个日志簇的质量评分。根据实时日志解析过程不断更新日志簇,从而在解析过程中进一步提升日志解析的准确率。
本实施方式相对于现有技术而言,获取样本日志数据;根据样本日志数据中各样本日—志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇;确定多个日志簇中各日志簇的质量评分;利用多个日志簇和质量评分解析日志;其中,利用多个日志簇和所述质量评分解析日志包括:获取待解析的目标日志;使用正则表达式过滤掉目标日志中的无关信息;根据目标日志的长度和首尾关键字在多个日志簇中确定一个匹配的日志簇,确定目标日志与匹配的日志簇的自适应相似度;根据质量评分确定目标日志与匹配的日志簇的自适应相似度阈值;判断自适应相似度是否大于自适应相似度阈值,如果是,则将目标日志插入到匹配的日志簇,否则使用目标日志创建一个日志簇。通过日志长度和首尾关键字迅速确定与待解析日志相似度最高的若干个日志簇,再根据每个簇的质量评分确定待解析日志最匹配的日志簇,同时迭代更新重聚类和重新计算每个簇的日志评分,在保证日志解析准确率的情况下大大提升了日志解析的速度。
本申请的第四实施方式涉及一种日志解析方法。本实施方式在第一实施方式的基础上做了进一步改进:在利用多个日志簇和所述质量评分解析日志之后,还包括:对满足预设的分裂条件的日志簇进行分裂;其中,所述分裂条件包括:所述日志簇中的日志数量超过预设的日志数量阈值。
本实施方式中的一种日志解析方法如流程图4所示,可以具体包括以下步骤:
步骤401:获取样本日志数据。
步骤402:根据各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇。
步骤403:确定多个日志簇中各日志簇的质量评分。
步骤404:利用多个日志簇和质量评分解析日志。
本实施方式的步骤401至步骤404和实施方式一中的步骤101至步骤104类似,在此不作赘述。
步骤405:确定待分裂日志簇。
具体地说,将所有所述日志簇中日志数量超过预设的日志数量阈值的每个日志簇分裂为多个日志簇。如果日志簇中的日志数量小于预设的日志数量阈值,即当前解析到的日志簇的日志数量过少,日志簇中的日志各位置上参数存储的信息有限,则暂时不予以分裂日志簇,如果日志簇中的日志数量大于预设的日志数量阈值,即当前解析到的日志簇中日志数量较多,此时影响到日志解析的速度,预设的日志数量阈值可以由技术人员根据实际情况设定大小。
步骤406:确定待分裂日志簇的候选位置。
具体地说,确定每个所述待分裂日志簇中字段数量小于预设的字段数量阈值且大于1的位置为候选位置。如果日志中位置上的字段数量超过了预设的字段数量阈值,说明在该位置上的参数种类非常多,则一定程度上表示该位置是参数的可能性越高,此时不根据该位置对日志簇进行分裂,字段数量阈值可以由技术人员根据实际情况设定大小。
步骤407:根据各候选位置的基尼值确定最优位置。
具体地说,计算所有候选位置的基尼值(基尼值是经典决策树CART用于分类问题时选择最优特征的指标),并将每个候选位置的基尼值与日志参数最大基尼值进行比较,在比较结果为小于日志参数最大基尼值的候选位置中并选取基尼值最大为最优位置。各位置的基尼值gini可由以下计算公式确定:
Figure PCTCN2020113060-appb-000014
其中,l为在当前位置中统计出该位置中出现的独特字段的总数,
Figure PCTCN2020113060-appb-000015
为对应的独特字段的出现次数,cnt为该位置出现的字段总数。
具体地说,因为只有在日志某个位置上的参数的熵值超过日志参数最小熵值,该位置才可以作为分裂的候选位置,故设定日志参数最大基尼值,当候选位置的基尼值小于日志参数最大基尼值时即满足日志参数熵值大于日志参数最小熵值,日志参数最大基尼值可以由技术人员根据实际情况设定大小。
步骤408:根据最优位置将待分裂日志簇分裂为多个日志簇。
具体地说,将最优位置上不同的字段分配到不同的日志簇,其他位置的字段保持不变,将待分裂的日志簇分裂为多个日志簇。
在一个具体的例子中,假设给定的日志簇中包含有“RAS KERNEL INFO generating core”和“RAS KERNEL FAILED generating core”,首先判断该日志簇内日志数量是否超过预设的日志数量阈值模板,如果不超过,则跳过该日志簇的分裂,如果超过该阈值(如该日志簇内包含日志100条,该阈值设为50),判断该日志簇的日志签名各位置上的字段数,其中,位置1上仅有1个字段RAS,位置2上仅有1个字段KERNEL,位置3上有两个不同的字段INFO和FAILED,对仅有1个字段的位置跳过不处理,对那些字段数小于预设的字段数量阈值的位置将其作为候选位置(此处选择位置2和位置3为候选位置),对各候选位置进行基尼值的计算,将其与日志参数最大基尼值进行比较,并且从各候选位置中挑选出基尼值最大的那一项作为最优位置,将该位置上的字段进行分裂,如根据位置3上的字段INFO,FAILED进行日志簇分裂,则分裂后得到对应的日志簇“RAS KERNEL INFO generating core”和“RAS KERNEL FAILED generating core”。
本实施方式相对于现有技术而言,获取样本日志数据;根据样本日志数据中各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇;确定多个日志簇中各日志簇的质量评分;利用多个日志簇和质量评分解析日志;对满足预设的分裂条件的日志簇进行分裂;其中,所述分裂条件包括:所述日志簇中的日志数量超过预设的日志数量阈值。。通过对日志数量超过阈值的日志簇进行分裂,既提升了每个日志簇的质量评分,又能防止因为日志簇内日志数量过多导致解析速度变慢,也是一种对所有日志簇的更新和优化。
本申请的第五实施方式涉及一种日志解析方法。本实施方式在第四实施方式的基础上做了进一步改进:在将所有所述日志簇中日志数量超过预设的日志数量阈值的每个日志簇分裂为多个日志簇之后,还包括:在分裂得到的各日志簇中,查询到关键字连续出现而其他字段相同的日志,作为候选删除日志;在所述候选删除日志中,删除所述关键字出现次数最少的日志以外的所有所述日志。
本实施方式中的一种日志解析方法如流程图5所示,可以具体包括以下步骤:
步骤501:获取样本日志数据。
步骤502:根据各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇。
步骤503:确定多个日志簇中各日志簇的质量评分。
步骤504:利用多个日志簇和质量评分解析日志。
本实施方式的步骤501至步骤504和实施方式一中的步骤101至步骤104类似,在此不作赘述。
步骤505:将所有日志簇中日志数量超过预设的日志数量阈值的每个日志簇分裂为多个日志簇。
步骤506:删除关键字连续出现而其他字段相同的日志中关键字出现次数最少的日志以外的所有日志。
具体地说,该步骤主要针对具备相同信息的日志具有不同长度这一情况,主要从以下两种情况考虑:
(1)将连续相同的预处理变量合并成一个预处理变量,如将“delete block blk_object blk_object”和“delete block blk_object blk_object blk_object”都规范成“delete block blk_object”;
(2)将连续的参数通配符将合并成一个参数通配符,如将“A B**C”和“A B***C”都规范成“A B*C”。
本实施方式相对于现有技术而言,获取样本日志数据;根据样本日志数据中各样本日志的长度和首尾关键字对样本日志数据进行聚类处理,得到多个日志簇;确定多个日志簇中各日志簇的质量评分;利用多个日志簇和质量评分解析日志;将所有日志簇中日志数量超过预设的日志数量阈值的每个日志簇分裂为多个日志簇;在分裂得到的各日志簇中,查询到关键字连续出现而其他字段相同的日志,作为候选删除日志;在所述候选删除日志中,删除所述关键字出现次数最少的日志以外的所有所述日志。通过删除关键字连续出现而其他字段相同的日志中的重复日志,提高日志解析速度。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请第六实施方式涉及一种日志解析装置,包括:获取模块601、聚类模块602、评分模块603和解析模块604,具体结构如图6所示:
获取模块601,用于获取样本日志数据。
聚类模块602,用于根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,得到多个日志簇。
评分模块603,用于确定所述多个日志簇中各日志簇的质量评分;。
解析模块604,用于利用所述多个日志簇和所述质量评分解析日志。
在一个具体的例子中,所述根据所述样本日志数据中各样本日志的长度和首尾关键字对 所述样本日志数据进行聚类处理,包括:对所述样本日志数据中的各条样本日志,分别进行如下处理:判断是否存在与所述样本日志的长度匹配的日志簇,如果是,则获取与所述样本日志的长度匹配的日志簇,如果否,则使用所述样本日志创建一个日志簇;判断在与所述样本日志的长度匹配的日志簇中是否存在与所述样本日志的首尾关键字匹配的日志簇,如果是,则获取与所述样本日志的首尾关键字匹配的日志簇,如果否,则使用所述样本日志创建一个日志簇;确定与所述样本日志的首尾关键字匹配的日志簇中所有日志签名与所述样本日志的相似度,如果所有所述相似度中存在任一相似度高于预设阈值,则将所述样本日志插入到所述高于预设阈值的相似度对应的日志所属的日志簇中,否则使用所述样本日志创建一个日志簇。
在一个具体的例子中,所述确定所述多个日志簇中各日志簇的质量评分,包括:根据所述所有日志簇中各日志簇簇内的紧凑性以及不同日志簇簇间的分离性计确定各日志簇的质量评分。
在一个具体的例子中,另外,所述利用所述多个日志簇和所述质量评分解析日志,包括:获取待解析的目标日志;使用正则表达式过滤掉所述目标日志中的无关信息;根据所述目标日志的长度和首尾关键字在所述多个日志簇中确定一个匹配的日志簇,确定所述目标日志与所述匹配的日志簇的自适应相似度;根据所述质量评分确定所述目标日志与所述匹配的日志簇的自适应相似度阈值;判断所述自适应相似度是否大于所述自适应相似度阈值,如果是,则将所述目标日志插入到所述匹配的日志簇,否则使用所述目标日志创建一个日志簇。
在一个具体的例子中,另外,所述利用所述多个日志簇和所述质量评分解析日志之后,还包括:对满足预设的分裂条件的日志簇进行分裂;其中,所述分裂条件包括:所述日志簇中的日志数量超过预设的日志数量阈值。
在一个具体的例子中,所述对满足预设的分裂条件的日志簇进行分裂,包括:确定待分裂的所述日志簇中字段数量小于预设的字段数量阈值的位置为候选位置;计算所有所述候选位置的基尼值,并根据待分裂的所述日志簇中最大基尼值对应的所述候选位置分裂所述日志簇。
在一个具体的例子中,所述对满足预设的分裂条件的日志簇进行分裂之后,还包括:在分裂得到的各日志簇中,查询到关键字连续出现而其他字段相同的日志,作为候选删除日志;在所述候选删除日志中,删除所述关键字出现次数最少的日志以外的所有所述日志。
不难发现,本实施方式为与第一实施方式相对应的装置实施例,本实施方式可与第一实施方式互相配合实施。第一实施方式中提到的相关技术细节在本实施方式中依然有效,为了 减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在第一实施方式中。
值得一提的是,本实施方式中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施方式中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施方式中不存在其它的单元。
本申请第七实施方式涉及一种服务器,如图7所示,该服务器包括至少一个处理器701;以及,与至少一个处理器701通信连接的存储器702;以及,与日志解析装置通信连接的通信组件703,通信组件703在处理器701的控制下接收和发送数据;其中,存储器702存储有可被至少一个处理器701执行的指令,指令被至少一个处理器701执行以实现上述日志解析方法实施例。
具体地,该电子设备包括:一个或多个处理器701以及存储器702,图7中以一个处理器701为例。处理器701、存储器702可以通过总线或者其他方式连接,图7中以通过总线连接为例。存储器702作为一种计算机可读存储介质,可用于存储计算机软件程序、计算机可执行程序以及模块。处理器701通过运行存储在存储器702中的计算机软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述日志解析方法。
存储器702可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储选项列表等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施方式中,存储器702可选包括相对于处理器701远程设置的存储器,这些远程存储器可以通过网络连接至外接设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
一个或者多个模块存储在存储器702中,当被一个或者多个处理器701执行时,执行上述任意方法实施方式中的日志解析方法。
上述产品可执行本申请实施方式所提供的方法,具备执行方法相应的功能模块和有益效果,未在本实施方式中详尽描述的技术细节,可参见本申请实施方式所提供的方法。
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个 元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。
本申请第八实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述日志解析方法实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施方式是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。

Claims (11)

  1. 一种日志解析方法,包括:
    获取样本日志数据;
    根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,得到多个日志簇;
    确定所述多个日志簇中各日志簇的质量评分;
    利用所述多个日志簇和所述质量评分解析日志;
    其中,所述质量评分用于在进行所述解析日志时确定日志的自适应相似度阈值。
  2. 根据权利要求1所述的日志解析方法,其中,所述根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,包括:
    对所述样本日志数据中的各样本日志,分别进行如下处理:
    判断是否存在与所述样本日志的长度匹配的日志簇,如果是,则获取与所述样本日志的长度匹配的日志簇,如果否,则使用所述样本日志创建一个日志簇;
    判断在与所述样本日志的长度匹配的日志簇中是否存在与所述样本日志的首尾关键字匹配的日志簇,如果是,则获取与所述样本日志的首尾关键字匹配的日志簇,如果否,则使用所述样本日志创建一个日志簇;
    确定与所述样本日志的首尾关键字匹配的日志簇中所有日志签名与所述样本日志的相似度,如果所有所述相似度中存在任一相似度高于预设阈值,则将所述样本日志插入到所述高于预设阈值的相似度对应的日志所属的日志簇中,否则使用所述样本日志创建一个日志簇。
  3. 根据权利要求1所述的日志解析方法,其中,所述确定所述多个日志簇中各日志簇的质量评分,包括:
    根据所述所有日志簇中各日志簇簇内的紧凑性以及不同日志簇簇间的分离性确定各日志簇的质量评分。
  4. 根据权利要求3所述的日志解析方法,其中,所述紧凑性定义为日志中常量与总长度的比率,所述分离性定义为两个不同的所述日志的离散项集的差异性;
    其中,所述离散项集是离散项对的集合,所述离散项对是指每个所述日志中两两字段生成的所有项对。
  5. 根据权利要求1所述的日志解析方法,其中,所述利用所述多个日志簇和所述质量评分解析日志,包括:
    获取待解析的目标日志;
    使用正则表达式过滤掉所述目标日志中的无关信息;
    根据所述目标日志的长度和首尾关键字在所述多个日志簇中确定一个匹配的日志簇,确定所述目标日志与所述匹配的日志簇的自适应相似度;
    根据所述质量评分确定所述目标日志与所述匹配的日志簇的自适应相似度阈值;
    判断所述自适应相似度是否大于所述自适应相似度阈值,如果是,则将所述目标日志插入到所述匹配的日志簇,否则使用所述目标日志创建一个日志簇。
  6. 根据权利要求1所述的日志解析方法,其中,所述利用所述多个日志簇和所述质量评分解析日志之后,还包括:
    对满足预设的分裂条件的日志簇进行分裂;
    其中,所述分裂条件包括:所述日志簇中的日志数量超过预设的日志数量阈值。
  7. 根据权利要求6所述的日志解析方法,其中,所述对满足预设的分裂条件的日志簇进行分裂,包括:
    确定待分裂的所述日志簇中字段数量小于预设的字段数量阈值的位置为候选位置;
    计算所有所述候选位置的基尼值,并根据待分裂的所述日志簇中最大基尼值对应的所述候选位置分裂所述日志簇。
  8. 根据权利要求6所述的日志解析方法,其中,所述对满足预设的分裂条件的日志簇进行分裂之后,还包括:
    在分裂得到的各日志簇中,查询到关键字连续出现而其他字段相同的日志,作为候选删除日志;
    在所述候选删除日志中,删除所述关键字出现次数最少的日志以外的所有所述日志。
  9. 一种日志解析装置,包括:
    获取模块,用于获取样本日志数据;
    聚类模块,用于根据所述样本日志数据中各样本日志的长度和首尾关键字对所述样本日志数据进行聚类处理,得到多个日志簇;
    评分模块,用于确定所述多个日志簇中各日志簇的质量评分;
    解析模块,用于利用所述多个日志簇和所述质量评分解析日志;
    其中,所述质量评分用于在进行所述解析日志时确定日志的自适应相似度阈值。
  10. 一种服务器,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至8中任一项所述的日志解析方法。
  11. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的日志解析方法。
PCT/CN2020/113060 2019-09-20 2020-09-02 日志解析方法、装置、服务器和存储介质 WO2021052177A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/624,243 US20220365957A1 (en) 2019-09-20 2020-09-02 Log parsing method and device, server and storage medium
EP20866066.2A EP3968178A4 (en) 2019-09-20 2020-09-02 METHOD AND DEVICE FOR LOG-PARSING, SERVER AND STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910893383.2 2019-09-20
CN201910893383.2A CN112541074A (zh) 2019-09-20 2019-09-20 日志解析方法、装置、服务器和存储介质

Publications (1)

Publication Number Publication Date
WO2021052177A1 true WO2021052177A1 (zh) 2021-03-25

Family

ID=74883875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113060 WO2021052177A1 (zh) 2019-09-20 2020-09-02 日志解析方法、装置、服务器和存储介质

Country Status (4)

Country Link
US (1) US20220365957A1 (zh)
EP (1) EP3968178A4 (zh)
CN (1) CN112541074A (zh)
WO (1) WO2021052177A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114584619A (zh) * 2022-03-07 2022-06-03 北京北信源软件股份有限公司 设备数据解析方法、装置、电子设备及存储介质
CN114722081A (zh) * 2022-06-09 2022-07-08 杭银消费金融股份有限公司 一种基于中转库模式的流式数据时间序列传输方法及系统

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282751B (zh) * 2021-05-28 2023-12-15 腾讯科技(深圳)有限公司 日志分类方法及装置
CN113553309A (zh) * 2021-07-28 2021-10-26 恒安嘉新(北京)科技股份公司 一种日志模板的确定方法、装置、电子设备及存储介质
CN114385396B (zh) * 2021-12-27 2023-03-24 华青融天(北京)软件股份有限公司 一种日志解析方法、装置、设备及介质
CN115794563B (zh) * 2023-02-06 2023-04-11 北京升鑫网络科技有限公司 一种系统审计日记的降噪方法、装置、设备及可读介质
US12056090B1 (en) 2023-05-10 2024-08-06 Micro Focus Llc Automated preprocessing of complex logs
CN117033464B (zh) * 2023-08-11 2024-04-02 上海鼎茂信息技术有限公司 一种基于聚类的日志并行解析算法及应用
CN116910592B (zh) * 2023-09-13 2023-11-24 中移(苏州)软件技术有限公司 日志检测方法、装置、电子设备及存储介质
CN117234776A (zh) * 2023-09-18 2023-12-15 厦门国际银行股份有限公司 一种批处理报错作业的智能判定方法、装置及设备
CN118133207B (zh) * 2024-04-30 2024-08-06 苏州元脑智能科技有限公司 跨领域日志异常检测模型构建方法、装置、设备及介质
CN118568309A (zh) * 2024-07-31 2024-08-30 中南大学 一种基于日志审计的企业内部数据流通全流程追溯方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196174A1 (en) * 2015-01-02 2016-07-07 Tata Consultancy Services Limited Real-time categorization of log events
US20180102938A1 (en) * 2016-10-11 2018-04-12 Oracle International Corporation Cluster-based processing of unstructured log messages
US20180241654A1 (en) * 2017-02-21 2018-08-23 Hewlett Packard Enterprise Development Lp Anomaly detection
CN109032909A (zh) * 2018-07-18 2018-12-18 携程旅游信息技术(上海)有限公司 应用程序崩溃日志的处理方法、系统、设备和存储介质
CN109993179A (zh) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 一种对数据进行聚类的方法和装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6871201B2 (en) * 2001-07-31 2005-03-22 International Business Machines Corporation Method for building space-splitting decision tree
US11226975B2 (en) * 2015-04-03 2022-01-18 Oracle International Corporation Method and system for implementing machine learning classifications
WO2016199433A1 (ja) * 2015-06-11 2016-12-15 日本電気株式会社 メッセージ分析装置、メッセージ分析方法、および、記憶媒体
US11170177B2 (en) * 2017-07-28 2021-11-09 Nia Marcia Maria Dowell Computational linguistic analysis of learners' discourse in computer-mediated group learning environments
US11640421B2 (en) * 2019-05-14 2023-05-02 International Business Machines Corporation Coverage analysis with event clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196174A1 (en) * 2015-01-02 2016-07-07 Tata Consultancy Services Limited Real-time categorization of log events
US20180102938A1 (en) * 2016-10-11 2018-04-12 Oracle International Corporation Cluster-based processing of unstructured log messages
US20180241654A1 (en) * 2017-02-21 2018-08-23 Hewlett Packard Enterprise Development Lp Anomaly detection
CN109993179A (zh) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 一种对数据进行聚类的方法和装置
CN109032909A (zh) * 2018-07-18 2018-12-18 携程旅游信息技术(上海)有限公司 应用程序崩溃日志的处理方法、系统、设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3968178A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114584619A (zh) * 2022-03-07 2022-06-03 北京北信源软件股份有限公司 设备数据解析方法、装置、电子设备及存储介质
CN114584619B (zh) * 2022-03-07 2024-02-23 北京北信源软件股份有限公司 设备数据解析方法、装置、电子设备及存储介质
CN114722081A (zh) * 2022-06-09 2022-07-08 杭银消费金融股份有限公司 一种基于中转库模式的流式数据时间序列传输方法及系统
CN114722081B (zh) * 2022-06-09 2022-09-02 杭银消费金融股份有限公司 一种基于中转库模式的流式数据时间序列传输方法及系统

Also Published As

Publication number Publication date
US20220365957A1 (en) 2022-11-17
EP3968178A1 (en) 2022-03-16
EP3968178A4 (en) 2022-07-06
CN112541074A (zh) 2021-03-23

Similar Documents

Publication Publication Date Title
WO2021052177A1 (zh) 日志解析方法、装置、服务器和存储介质
US11714862B2 (en) Systems and methods for improved web searching
WO2021000671A1 (zh) 数据库的查询方法、装置、服务器及介质
US11321421B2 (en) Method, apparatus and device for generating entity relationship data, and storage medium
US9922102B2 (en) Templates for defining fields in machine data
US9342553B1 (en) Identifying distinct combinations of values for entities based on information in an index
JP5155001B2 (ja) 文書検索装置
US20150234927A1 (en) Application search method, apparatus, and terminal
US9177251B2 (en) Impulse regular expression matching
WO2016078592A1 (zh) 批量数据查询方法和装置
US9753977B2 (en) Method and system for managing database
US20220358178A1 (en) Data query method, electronic device, and storage medium
US20230252140A1 (en) Methods and systems for identifying anomalous computer events to detect security incidents
US11113348B2 (en) Device, system, and method for determining content relevance through ranked indexes
WO2015188315A1 (zh) 一种数据查询方法、装置及系统
US10019483B2 (en) Search system and search method
US11531665B2 (en) Automated database index management
KR101598471B1 (ko) Rdf 트리플 데이터 종류 기반 데이터 저장 및 검색 시스템
CN110716900A (zh) 一种数据查询方法和系统
US11256694B2 (en) Tolerance level-based tuning of query processing
WO2021143010A1 (zh) 一种分布式计算任务的响应方法及设备
CN113656659A (zh) 一种数据提取方法、装置、系统及计算机可读存储介质
CN113590650A (zh) 基于特征表达式的结构化查询语句甄别方法及装置
CN114090558A (zh) 针对数据库的数据质量管理方法和装置
CN103891244B (zh) 一种进行数据存储和检索的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20866066

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020866066

Country of ref document: EP

Effective date: 20211210

NENP Non-entry into the national phase

Ref country code: DE