WO2020259280A1 - 日志管理方法、装置、网络设备和可读存储介质 - Google Patents

日志管理方法、装置、网络设备和可读存储介质 Download PDF

Info

Publication number
WO2020259280A1
WO2020259280A1 PCT/CN2020/095162 CN2020095162W WO2020259280A1 WO 2020259280 A1 WO2020259280 A1 WO 2020259280A1 CN 2020095162 W CN2020095162 W CN 2020095162W WO 2020259280 A1 WO2020259280 A1 WO 2020259280A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
word
speech
logs
word weight
Prior art date
Application number
PCT/CN2020/095162
Other languages
English (en)
French (fr)
Inventor
韩静
刘建伟
叶峰
陈力
王超锋
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to EP20830538.3A priority Critical patent/EP3992812A4/en
Publication of WO2020259280A1 publication Critical patent/WO2020259280A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present disclosure relates to, but is not limited to, the field of computer technology.
  • the embodiment of the present disclosure provides a log management method, including: determining keywords in a log to be extracted; determining the word weight corresponding to each of the keywords; based on the log entry, and calculating according to the word weight Edit distance between logs; cluster the logs based on the edit distance between logs; extract the log backbone from the clustering results.
  • the embodiment of the present disclosure also provides a log management device, including: a keyword extraction module configured to determine the keywords in the log to be extracted; a word rights determination module configured to determine the word rights corresponding to each of the keywords; Edit distance calculation module, configured to calculate the edit distance between logs based on log entries, and according to the word weight; clustering module, configured to cluster logs based on the edit distance between logs; trunk extraction module , Configured to extract the log backbone from the clustering results.
  • the embodiments of the present disclosure also provide a network device, which includes a processor, a memory, and a communication bus; the communication bus is used to realize the connection and communication between the processor and the memory; the processor is used to execute all One or more computer programs stored in the memory to implement any log management method described herein.
  • the embodiments of the present disclosure also provide a computer storage medium, the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the Any method of log management.
  • FIG. 1 is a flowchart of a log management method provided by some embodiments of the present disclosure
  • FIG. 2 is a flowchart of an edit distance calculation method provided by some embodiments of the disclosure.
  • FIG. 3 is a flowchart of a clustering method provided by some embodiments of the present disclosure.
  • FIG. 5 is a flowchart of a word weight calculation method provided by some embodiments of the disclosure.
  • FIG. 6 is a schematic diagram of the composition of a log management device provided by some embodiments of the present disclosure.
  • Fig. 7 is a schematic structural diagram of a network device provided by some embodiments of the present disclosure.
  • a log management method is provided. Referring to FIG. 1, the log management method includes the following steps S101 to S105:
  • Extracting the log backbone is to facilitate log management, compress the number of logs that need to be processed, and directly affect the effect of system abnormality detection and fault location.
  • the edit distance is combined with the word weight of the keyword to accurately measure the degree of difference, and the logs are clustered based on the degree of difference.
  • edit distance is an important measure to measure the degree of differences between texts, and is often used in text clustering and other fields.
  • Edit distance is a quantitative measurement of the degree of difference between two character strings (such as English characters). The measurement method is based on how many times it takes at least to process one character string into another character string. Editing distance can be used in natural language processing.
  • DNA can also be regarded as a string composed of A, C, G, and T, so the edit distance is also used in bioinformatics to judge the similarity of two DNAs.
  • edit distance There are several different definitions of edit distance. The differences of several different edit distances are different in the processing that can be performed on strings, as follows:
  • LCS Longest Common Subsequence
  • word weight is that the weight of keywords in the log and the corresponding word, the higher the weight, then the weight of each word in a log when calculating the edit distance
  • the calculated edit distance will be smaller than the directly calculated edit distance between the two. In this way, in the clustering process, it will be closer to the actual clustering effect.
  • determining the keyword information in the log data to be extracted may include: determining words in the log other than the vocabulary in a preset form as keywords.
  • the vocabulary of the predetermined form such as some fixed phrases, prepositions, conjunctions and other vocabulary, has no specific meaning and is only used excessively. This kind of vocabulary will not be considered as a keyword; the so-called keyword is an influence
  • the actual vocabulary of the log such as verbs, may indicate the actions of the log; nouns, may indicate the executor or execution object of the log.
  • determining words other than the vocabulary in the preset form as keywords in the log may include: using regular expressions to remove the vocabulary in the preset form in the log and use the remaining words as keywords.
  • determining the word weight corresponding to each keyword information may include: calculating the word weight of the first word based on the part of speech of the keyword, and outputting the word weight of the first word as a result.
  • the word weight can be determined directly according to the part of speech of the keywords. For example, for words such as verbs and nouns, the weights are relatively large, while for words such as adjectives and adverbs, the weights are relatively small. .
  • calculating the word weight of the first word may include: inputting the keyword into a part of speech tagging tool to obtain the word part of speech; assigning weights to the words according to the part of speech to obtain the first word weight; the part of speech includes nouns , Verbs, adjectives and others.
  • determining the word weight corresponding to each keyword information may further include: after calculating the word weight of the first word based on the part of speech of the keyword, it may further include: judging whether there is a clustering/classification tag, if so , The second word weight is calculated based on the tf-idf (Term Frequency-Inverse Document Frequency, term frequency-inverse text frequency) index; the first word weight and the second word weight are averaged to get the average word weight, And output as the result.
  • tf-idf Term Frequency-Inverse Document Frequency, term frequency-inverse text frequency
  • calculating the word weight of the second word based on the tf-idf (term frequency-inverse text frequency) index may specifically include: sorting classification and text features according to clustering/classification tags, and calculating the tf-idf weight of the word.
  • determining the word weight corresponding to each keyword information may include: training a part-of-speech probability neural network model; dividing the weights of keywords of different parts of speech on the log backbone extraction; calculating each part-of-speech probability based on the neural network; calculating part of speech The weighted sum of probability and part of speech weight is used as word weight.
  • training the part of speech probabilistic neural network model includes: training the part of speech probabilistic neural network model based on an existing data set. Training the part-of-speech probability neural network model is based on the existing Chinese and English data sets to train the part-of-speech model, which can judge the part-of-speech probability value of each word in a single sentence, such as verbs, nouns, adjectives, etc.
  • dividing the weights of keywords of different parts of speech on the extraction of the log trunk may include: presetting words of different parts of speech based on prior knowledge, and weights that are affected when extracting the log trunk.
  • prior knowledge refers to the empirical conclusions drawn in the history extraction process. For the files in the log, how weights certain words occupy in the history log are used as a reference for the main extraction of the log. Among them, the preset weight can be adjusted according to different log data sets.
  • calculating the edit distance between logs may include: traversing all lexeme pairs between two logs, and updating the edit distance matrix through the word weights ; Take the lower left element of the edit distance matrix as the edit distance.
  • clustering the logs based on the edit distance between the logs may include: calculating the log distance adjacency matrix based on the edit distance between the logs; calculating the difference between the current log and each cluster in the set; Logs with a difference degree less than or equal to the threshold are added to the current cluster, logs with a difference degree greater than the threshold are added to the new cluster, and the cluster information is updated.
  • the threshold setting is a means to define whether two logs belong to the same cluster.
  • the current log is relatively close to the logs in the existing cluster and can be classified as the same In clustering, the log is added to the current cluster at this time; and if the difference between the two logs is greater than the threshold, it means that the current log is far away from the log in the existing cluster. At this time, it is based on the new Log, create a new cluster, and classify the log into the new cluster.
  • the log distance adjacency matrix before calculating the log distance adjacency matrix based on the edit distance between the logs, it may further include: initializing the clustering set.
  • the initializing operation includes: initializing a set, and each log in the set has its own As a cluster.
  • calculating the difference between the current log and each cluster in the set may include: traversing all clusters in the set, calculating the difference between the log and the cluster, and finding the cluster with the smallest difference;
  • the degree is the maximum edit distance between the log and the log in the cluster.
  • This implementation provides a log management method to determine the keywords in the log to be extracted; determine the word weight corresponding to each keyword; use the log entry as the basis and calculate the edit distance between the logs according to the word weight; The edit distance between logs is used to cluster the logs; from the clustering results, the log backbone is extracted.
  • clustering is performed by combining the word weight of the keywords and the edit distance, which significantly improves the accuracy of extracting the log backbone, and is easy to implement, and the clustering process is efficient.
  • FIG. 1 is a flowchart of a log trunk extraction method based on an edit distance with word weight according to an embodiment of the present disclosure. As shown in FIG. 1, the process is Including the following steps S101 to S105:
  • log keyword extraction includes: using regular expressions to remove some known form parameters in the log, and consider the remaining words as keywords.
  • any one of the word weight calculation methods in the embodiments of the present disclosure may be used, or multiple calculation methods may be combined.
  • calculating the word weight edit distance between logs includes the following steps S1030 to S1033:
  • S1032 Traverse all the word position pairs between the two logs and update the edit distance matrix respectively. In the update process, use the calculated word weight information when calculating the transfer cost;
  • clustering based on the edit distance with word weight includes the following steps S301 to S307:
  • Step S301 the cluster set is initialized
  • Step S302 Calculate the log distance adjacency matrix based on the edit distance with word weight
  • Step S303 Calculate the difference between the current log and each cluster in the set
  • Step S304 it is judged whether the degree of difference is less than the threshold, if yes, skip to step S305, otherwise skip to step S306;
  • Step S305 Add the log to the current cluster
  • Step S306 Create a new cluster, add the log to the new cluster, and update the cluster set;
  • step S307 it is judged whether the traversal of the log is completed, if so, the clustering is completed, otherwise, it jumps to step S303 to continue the clustering process.
  • initializing the cluster set includes: initializing an empty set, and each element of the set is a cluster.
  • calculating the log distance adjacency matrix based on the word weight edit distance may include: first initialize the log distance adjacency matrix, calculate a single pair of log word weight edit distances, fill and update, and compress the adjacency matrix structure.
  • calculating the difference between the current log and each cluster in the set includes: traversing all clusters in the set, calculating the difference between the log and the cluster, and finding the cluster with the smallest difference, where the difference is defined as The maximum value of the word weight edit distance between the log and the log in the cluster.
  • extracting the log backbone from the clustering results includes: extracting the backbone from each cluster after the clusters are stabilized.
  • all the words belonging to the backbone should be more than 1/2 of the logs in the classification. Appears, keep all the words that belong to the backbone, and remove the rest or replace them with ⁇ p>.
  • the method shown in Fig. 2 extracts log keywords after obtaining log information, calculates the word weight, and calculates the word weight edit distance between the logs for each log. Clustering is based on the edit distance with word weight, and then the log backbone is extracted from the clustering results. This process can help technicians in related fields to effectively extract the log backbone, and the accuracy of extraction is high, which will bring convenience to subsequent log classification, log analysis, and anomaly detection.
  • This embodiment innovatively combines word weight and edit distance to apply to the log backbone extraction field, which can better extract log features and improve the accuracy and performance of backbone extraction.
  • this log trunk extraction method guarantees a faster speed and satisfies real-time log analysis. By making full use of the semi-structured nature of the log, the effect and stability of trunk extraction are improved. And the results are easy to reproduce, which is convenient for professionals to perform follow-up analysis.
  • FIG. 4 is a flowchart of the method for calculating word weight according to an embodiment of the present disclosure. As shown in FIG. 4, the process includes the following steps S401 to S404:
  • Step S401 calculating the word weight based on part of speech
  • Step S402 it is judged whether there is manual clustering/classification label, if yes, skip to step S403 directly, otherwise, go to step S404;
  • Step S403 Calculate the word weight based on tf-idf, and take the average value with the word weight obtained in S402;
  • Step S404 output the word weight.
  • log keyword extraction includes: using regular expressions, removing some known form parameters in the log, and treating the remaining words as keywords;
  • calculating the word weight based on part of speech includes: inputting keywords into the nltk part of speech tagging tool to obtain the word part of speech; assigning weights to words according to part of speech, nouns and verbs with weight 3, adjectives with weight 2, and other words with weight Weight 1; Among them, weight 3 is greater than weight 2, and weight 2 is greater than weight 1. That is to say, nouns and verbs and other content words have a higher weight.
  • calculating the word weight based on tf-idf includes: sorting classification and text features according to manually intervened clustering/classification tags; calculating the tf-idf weight of the word; according to the tf-idf weight of the word, Sort from big to small, the top 30% are given weight 3, 30%-60% are given weight 2, and 60%-100% are given weight 1.
  • the word weight can be calculated, which can help semi-structured text analysis, such as log frequent item detection, log anomaly detection, etc.
  • a method for calculating word weights based on neural network for part-of-speech judgment based on log characteristics is provided.
  • FIG. 5 is a word weight calculation provided by some embodiments of the present disclosure.
  • the method flowchart includes the following steps S501 to S504:
  • S504 Calculate the weighted sum of the part-of-speech probability and the part-of-speech weight for the log word as the word weight.
  • training the part-of-speech probability neural network model includes: training the part-of-speech probability neural network model based on existing data sets in Chinese and English, which can determine the part-of-speech probability value of each word in a single sentence, such as nouns, verbs, adjectives Wait.
  • dividing the weights of different part-of-speech words for the log backbone extraction includes: based on prior knowledge, artificially setting the weights that different part-of-speech words can affect the log backbone extraction, and can adjust the weights according to different log data sets .
  • the weights of the same words in different sentences can be averaged and calculated again, as the final weights of the words, saving subsequent calculation costs.
  • This embodiment innovatively uses a neural network for word weight calculation, avoiding the limitation of directly using prior knowledge or statistical information for global processing; for the same words in different sentences, the neural network can make more accurate part-of-speech judgments. It can effectively improve the recognition rate of ambiguous words; combined with prior knowledge to perform weighting operations based on part of speech weights, more accurate word weights can be obtained, which facilitates subsequent log processing.
  • the device includes the following modules: a keyword extraction module 61 configured to determine keywords in a log to be extracted; and a word right determination module 62 configured In order to determine the word weight corresponding to each keyword; the edit distance calculation module 63 is configured to calculate the edit distance between logs based on log entries and the word weight; the clustering module 64 is configured to be based on the log entries Edit the distance to cluster the log; the trunk extraction module 65 is configured to extract the log trunk from the clustering result.
  • a keyword extraction module 61 configured to determine keywords in a log to be extracted
  • a word right determination module 62 configured In order to determine the word weight corresponding to each keyword
  • the edit distance calculation module 63 is configured to calculate the edit distance between logs based on log entries and the word weight
  • the clustering module 64 is configured to be based on the log entries Edit the distance to cluster the log
  • the trunk extraction module 65 is configured to extract the log trunk from the clustering result.
  • determining the keyword information in the log data to be extracted may include: determining words in the log other than the vocabulary in a preset form as keywords.
  • determining words other than the vocabulary in the preset form as keywords in the log may include: using regular expressions to remove the vocabulary in the preset form in the log and use the remaining words as keywords.
  • determining the word weight corresponding to each keyword information may include: calculating the word weight of the first word based on the part of speech of the keyword, and outputting the word weight of the first word as a result.
  • calculating the word weight of the first word may include: inputting the keyword into a part of speech tagging tool to obtain the word part of speech; assigning weights to the words according to the part of speech to obtain the first word weight; the part of speech includes nouns , Verbs, adjectives and others.
  • determining the word weight corresponding to each keyword information may further include: after calculating the word weight of the first word based on the part of speech of the keyword, it may further include: judging whether there is a clustering/classification tag, if so , The second word weight is calculated based on the tf-idf word frequency-inverse text frequency index; the first word weight and the second word weight are averaged to obtain the average word weight, and output as the result.
  • calculating the word weight of the second word based on the tf-idf word frequency-inverse text frequency index may include: sorting classification and text features according to the clustering/classification label, and calculating the tf-idf weight of the word.
  • determining the word weight corresponding to each keyword information may include: training a part-of-speech probability neural network model; dividing the weights of keywords of different parts of speech on the log backbone extraction; calculating each part-of-speech probability based on the neural network; calculating part of speech The weighted sum of probability and part of speech weight is used as word weight.
  • training the part-of-speech probabilistic neural network model may include: training the part-of-speech probabilistic neural network model based on an existing data set.
  • dividing the weights of keywords of different parts of speech on the extraction of the log trunk may include: presetting words of different parts of speech based on prior knowledge, and weights that are affected when extracting the log trunk.
  • calculating the edit distance between logs may include: traversing all lexeme pairs between two logs, and updating the edit distance matrix through the word weights ; Take the lower left element of the edit distance matrix as the edit distance.
  • clustering the logs based on the edit distance between the logs may include: calculating the log distance adjacency matrix based on the edit distance between the logs; calculating the difference between the current log and each cluster in the set; Logs with a difference degree less than or equal to the threshold are added to the current cluster, logs with a difference degree greater than the threshold are added to the new cluster, and the cluster information is updated.
  • the log distance adjacency matrix before calculating the log distance adjacency matrix based on the edit distance between the logs, it may further include: initializing the clustering set.
  • the initializing operation includes: initializing a set, and each log in the set has its own As a cluster.
  • calculating the difference between the current log and each cluster in the set may include: traversing all clusters in the set, calculating the difference between the log and the cluster, and finding the cluster with the smallest difference;
  • the degree is the maximum edit distance between the log and the log in the cluster.
  • Some other embodiments of the present disclosure also provide a network device. As shown in FIG. 7, it includes a processor 71, a memory 72, and a communication bus 73.
  • the communication bus 73 is used to realize the communication between the processor 71 and the memory 72.
  • the processor 71 is configured to execute one or more computer programs stored in the memory 72 to implement the steps of the log management method in the foregoing embodiments, which will not be repeated here.
  • This embodiment also provides a computer-readable storage medium, which is included in any method or technology for storing information (such as computer-readable instructions, data structures, computer program modules, or other data). Volatile or non-volatile, removable or non-removable media.
  • Computer readable storage media include but are not limited to RAM (Random Access Memory), ROM (Read-Only Memory, read-only memory), EEPROM (Electrically Erasable Programmable read only memory, charged Erasable Programmable Read-Only Memory) ), flash memory or other memory technology, CD-ROM (Compact Disc Read-Only Memory), digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tapes, magnetic disk storage or other magnetic storage systems, Or any other medium that can be used to store desired information and can be accessed by a computer.
  • the computer-readable storage medium in this embodiment can be used to store one or more computer programs, and the stored one or more computer programs can be executed by a processor to implement at least one step of the log management method in the foregoing embodiments .
  • This embodiment also provides a computer program (or computer software).
  • the computer program can be distributed on a computer-readable medium and executed by a computable system to implement at least one of the log management methods in the foregoing embodiments. step.
  • This embodiment also provides a computer program product, including a computer readable system, on which the computer program as shown above is stored.
  • the computer-readable system in this embodiment may include the computer-readable storage medium shown above.
  • communication media usually contain computer-readable instructions, data structures, computer program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium. Therefore, the present disclosure is not limited to any specific hardware and software combination.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种日志管理方法、装置、网络设备和可读存储介质。所述方法包括:确定待提取日志中的关键词(S101);确定各关键词所对应的单词词权(S102);以日志条目为基础,并根据单词词权,计算日志间的编辑距离(S103);基于日志间的编辑距离,对日志进行聚类(S104);从聚类的结果中,提取日志主干(S105)。

Description

日志管理方法、装置、网络设备和可读存储介质 技术领域
本公开涉及但不限于计算机技术领域。
背景技术
随着云计算技术的发展和日益成熟,系统的复杂度日益提升,会产生大规模的日志数据。这些日志数据记录了系统内部操作的状态,技术人员可以利用这些日志数据了解并优化系统的行为,并进行系统异常检测。由于5G及底层支撑Iaas/Paas平台日志数据量规模日益庞大,系统运维变得愈发困难,仅依靠人工分析是远远不够的。为了便于日志的分析处理,相关技术中会采用提取日志主干方式,但是,目前的日志主干提取手段中,聚类的手段单一,导致提取的日志主干的准确度差,因此,如何提高日志主干提取的准确度成为亟待解决的问题。
发明内容
本公开实施例提供一种日志管理方法,包括:确定待提取日志中的关键词;确定各所述关键词所对应的单词词权;以日志条目为基础,并根据所述单词词权,计算日志间的编辑距离;基于日志间的编辑距离,对日志进行聚类;从聚类的结果中,提取日志主干。
本公开实施例还提供一种日志管理装置,包括:关键词提取模块,配置为确定待提取日志中的关键词;词权确定模块,配置为确定各所述关键词所对应的单词词权;编辑距离计算模块,配置为以日志条目为基础,并根据所述单词词权,计算日志间的编辑距离;聚类模块,配置为基于日志间的编辑距离,对日志进行聚类;主干提取模块,配置为从聚类的结果中,提取日志主干。
本公开实施例还提供一种网络设备,网络设备包括处理器、存储器及通信总线;所述通信总线用于实现所述处理器和所述存储器之间的连接通信;所述处理器用于执行所述存储器中存储的一个或者多 个计算机程序,以实现本文所述的任一日志管理方法。
本公开实施例还提供一种计算机存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现本文所述的任一日志管理方法。
附图说明
图1为本公开的一些实施例提供的日志管理方法流程图;
图2为本公开的一些实施例提供的编辑距离计算方法流程图;
图3为本公开的一些实施例提供的聚类方法流程图;
图4为本公开的一些实施例提供的单词词权计算方法流程图;
图5为本公开的一些实施例提供的单词词权计算方法流程图;
图6为本公开的一些实施例提供的日志管理装置组成示意图;
图7为本公开的一些实施例提供的一种网络设备结构示意图。
具体实施方式
为了使本公开的目的、技术方案及优点更加清楚明白,下面通过具体实施方式结合附图对本公开实施例作进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本公开,并不用于限定本公开。
在本公开的一些实施例中提供了一种日志管理方法,参考图1,所述日志管理方法包括以下步骤S101至S105:
S101、确定待提取日志中的关键词;
S102、确定各关键词所对应的单词词权;
S103、以日志条目为基础,并根据单词词权,计算日志间的编辑距离;
S104、基于日志间的编辑距离,对日志进行聚类;
S105、从聚类的结果中,提取日志主干。
提取日志主干是为了方便对日志进行管理,压缩所需要处理的日志数目,并且可以直接影响系统异常检测、故障定位的效果。而本公开各实施例中,采用编辑距离结合关键词的单词词权,来精确衡量 差异度,并基于差异度对日志进行聚类。其中,编辑距离是衡量文本间差异程度的重要度量,常用于文本聚类等领域。编辑距离是针对二个字符串(例如英文字)的差异程度的量化量测,量测方式是看至少需要多少次的处理才能将一个字符串变成另一个字符串。编辑距离可以用在自然语言处理中,例如拼写检查可以根据一个拼错的字和其他正确的字的编辑距离,判断哪一个(或哪几个)是比较可能的字。DNA也可以视为用A、C、G和T组成的字符串,因此编辑距离也用在生物信息学中,判断二个DNA的类似程度。编辑距离有几种不同的定义,几种不同的编辑距离的差异在可以对字符串进行的处理不同,具体如下:
在莱文斯坦距离中,可以删除、加入、取代字符串中的任何一个字元,也是较常用的编辑距离定义;除此之外,还存在其他编辑距离的定义方式,例如Damerau-Levenshtein距离是一种莱文斯坦距离的变种,但允许以单一操作交换相邻的两个字符(称为字符转置),如AB→BA的距离是1(交换)而非2(先删除再插入、或者两次替换)。
LCS(最长公共子序列)距离只允许删除、加入字元。
Jaro距离只允许字符转置。
汉明距离只允许取代字元。
上述的编辑距离,在本公开各实施例中均可适用;在适用时,结合单词词权,计算具体的编辑距离。
单词词权的含义是,日志中的关键词,所对应的单词所占的权重,所占权重越高,那么对于一条日志中的组成各词而言,在计算编辑距离的时候,所占权重越多的关键词,对于该日志的聚类影响越大。比如说,日志A和日志B,其中A和B中都含有高权重的关键词x,那么,在结合单词词权计算A和B的编辑距离时,则A和B就会因为x的存在,算得的编辑距离会小于直接计算两者之间的编辑距离。这样,在聚类过程中,就会更贴近真实聚类的效果。
在一些实施例中,确定待提取日志数据中的关键词信息可以包括:将日志中,除预设形式的词汇之外的单词确定为关键词。其中, 预设形式的词汇,比如说一些定式词组,介词,连接词等词汇,其并无具体含义而只是用于过度,这种词汇就不会被认为是关键词;所谓关键词,是影响该日志的有实际意义的词汇,比如说动词,可能表明了该日志的动作;名词,可能表明了日志的执行者或者执行对象等。
在一些实施例中,将日志中,除预设形式的词汇之外的单词确定为关键词可以包括:使用正则表达式,去除日志中的预设形式的词汇,并将剩余单词作为关键词。
在一些实施例中,确定各关键词信息所对应的单词词权可以包括:基于关键词的词性,计算第一单词词权,并将第一单词词权作为结果输出。本实施例可以直接根据关键词的词性来确定单词词权,比如说,对于动词、名词等词汇,其所占的权重较大,而对于形容词、副词等词汇,其所占用的权重就较小。
在一些实施例中,基于关键词的词性,计算第一单词词权可以包括:将关键词输入词性标注工具,获取单词词性;按照词性给单词分配权重,得到第一单词词权;词性包括名词、动词、形容词以及其他。
在一些实施例中,确定各关键词信息所对应的单词词权还可以包括:在基于关键词的词性,计算第一单词词权之后,还包括:判断是否有聚类/分类标签,若有,则基于tf-idf(Term Frequency-Inverse Document Frequency,词频-逆文本频率)指数计算得到第二单词词权;将第一单词词权与第二单词词权取均值,得到平均单词词权,并作为结果输出。除了根据词性直接计算得到词权的方式外,还可以结合是否有聚类/分类标签,来判断是否基于tf-idf来计算得到第二单词词权。如若有,则可以根据第二单词词权,结合第一单词词权,求取两者的算数平均值或者加权平均值,得到平均单词词权,作为最终的单词词权输出。
在一些实施例中,基于tf-idf(词频-逆文本频率)指数计算第二单词词权具体可以包括:依据聚类/分类标签,整理分类和文本特征,并计算单词的tf-idf权重。
在一些实施例中,确定各关键词信息所对应的单词词权可以包 括:训练词性概率神经网络模型;划分不同词性的关键词对日志主干提取的权重;基于神经网络计算各词性概率;计算词性概率与词性权重的加权和,作为单词词权。
在一些实施例中,训练词性概率神经网络模型包括:基于已有数据集,训练词性概率神经网络模型。训练词性概率神经网络模型,是基于已有的中英文的数据集,训练词性模型,该模型对于单条语句可以判断其中各个单词的词性概率值,如动词、名词、形容词等。
在一些实施例中,划分不同词性的关键词对日志主干提取的权重可以包括:基于先验知识,预设不同词性的单词,在对日志主干提取时影响的权重。所谓先验知识,指的就是在历史提取过程中,所得出的经验性结论,对于日志中文件而言,某些词汇在历史日志中所占的权重如何,作为本次日志主干提取的参考。其中,该预设的权重,可以根据不同日志数据集,调整权值。
在一些实施例中,以日志条目为基础,并根据单词词权,计算日志间的编辑距离可以包括:遍历两条日志之间的所有词位对,并通过单词词权,进行编辑距离矩阵更新;取编辑距离矩阵左下角元素,作为编辑距离。
在一些实施例中,基于日志间的编辑距离,对日志进行聚类可以包括:基于日志间的编辑距离,计算日志距离邻接矩阵;计算当前日志,与集合中的各聚类的差异度;将差异度小于等于阈值的日志加入当前聚类中,将差异度大于阈值的日志加入新聚类中,并更新聚类信息。阈值的设置是界定两个日志之间是否属于同一聚类的手段,如果两个日志的差异值,小于等于阈值,那么说明当前的日志与已有聚类中的日志比较接近,可以归入同一聚类中,从而此时就将该日志加入当前聚类中;而如果两个日志的差异值,大于阈值,说明当前日志与已有聚类中的日志相去较远,此时则基于该新日志,创建新的聚类,并将该日志归入该新的聚类中。
在一些实施例中,在基于日志间的编辑距离,计算日志距离邻接矩阵之前,还可以包括:对聚类集合进行初始化操作,初始化操作包括:初始化一个集合,且集合中的每一个日志均各自为一个聚类。
在一些实施例中,计算当前日志,与集合中的各聚类的差异度可以包括:遍历集合中所有聚类,计算日志与聚类的差异度,寻找差异度最小的聚类;此处差异度为日志与聚类中日志的编辑距离的最大值。
本实施提供了一种日志管理方法,确定待提取日志中的关键词;确定各关键词所对应的单词词权;以日志条目为基础,并根据单词词权,计算日志间的编辑距离;基于日志间的编辑距离,对日志进行聚类;从聚类的结果中,提取日志主干。从而通过结合关键词的单词词权,以及编辑距离来进行聚类,从而显著提升了提取日志主干的准确度,且易于实现,聚类过程中效率高。
在本公开的另一些实施例中提供了一种的日志管理方法,图1是根据本公开实施例的基于带词权编辑距离的日志主干提取方法的流程图,如图1所示,该流程包括如下步骤S101至S105:
S101、确定待提取日志中的关键词;
S102、确定各关键词所对应的单词词权;
S103、以日志条目为基础,并根据单词词权,计算日志间的编辑距离;
S104、基于日志间的编辑距离,对日志进行聚类;
S105、从聚类的结果中,提取日志主干。
在一些实施例中,日志关键词提取包括:使用正则表达式去除日志中一些已知形式的参数,将剩余单词认为是关键词。
在一些实施例中,计算单词词权,可采用本公开各实施例中的单词词权计算方法中的任意一种,或者是结合多种一同计算。
在一些实施例中,参考图2,对每对日志,计算日志间的带词权编辑距离包括如下步骤S1030至S1033:
S1031,单对编辑距离矩阵初始化;
S1032,遍历两条日志间的所有词位对,分别进行编辑距离矩阵更新,在更新过程中,计算转移代价时使用计算好的词权信息;
S1033,取编辑距离矩阵的左下角元素,为最终的带词权日志编辑距离,填充至日志距离邻接矩阵的对应位置及其对角位置中。
在一些实施例中,参考图3,基于带词权编辑距离进行聚类包括如下步骤S301至S307:
步骤S301,聚类集合初始化;
步骤S302,基于带词权编辑距离计算日志距离邻接矩阵;
步骤S303,计算当前日志与集合中各聚类的差异度;
步骤S304,判断差异度是否小于阈值,是则跳转至步骤S305,否则跳转步骤S306;
步骤S305,将该日志加入当前聚类中;
步骤S306,新建一个聚类,将该日志加入新聚类中,并更新聚类集合;
步骤S307,判断日志是否遍历完毕,若是,则聚类完成,否则跳转至步骤S303,继续聚类过程。
在一些实施例中,聚类集合初始化包括:初始化一个空集合,集合的每个元素是一个聚类。
在一些实施例中,基于带词权编辑距离计算日志距离邻接矩阵可以包括:首先将日志距离邻接矩阵初始化,计算单对日志词权编辑距离填充并更新,邻接矩阵结构压缩。
在一些实施例中,计算当前日志与集合中各聚类的差异度包括:遍历集合中所有聚类,计算日志与聚类的差异度,寻找差异度最小的聚类,此处差异度定义为日志与聚类中日志的带词权编辑距离的最大值。
在一些实施例中,从聚类结果中提取日志主干包括:在聚类稳定后,从每个聚类提取主干,此处认为所有属于主干的词在分类中应该在1/2以上的日志中出现,保留所有属于主干的词,其余剔除或用<p>代替。
通过以上的实施方式的描述,可以表明,如图2所示的方法,在获取到日志信息后,提取日志关键词,计算单词词权,对每条日志计算日志间的带词权编辑距离,基于带词权编辑距离进行聚类,再从聚类结果中提取日志主干。该过程可以帮助相关领域的技术人员有效提取日志主干,提取的准确性高,将为后续进行日志分类、日志分析、 异常检测等工作带来便利。
本实施例创新地将词权与编辑距离结合,应用于日志主干提取领域,可以更好地提取日志特征,提升主干提取准确率和性能。同时,本日志主干提取方法与现有技术相比,在保证了速度较快,满足实时性日志分析的情况下,通过充分利用日志的半结构化特性,提升了主干提取的效果和稳定性,且结果易于复现,方便专业人员进行后续分析。
在本公开的另一些本实施例中提供了一种计算单词词权的方法,图4是根据本公开实施例的计算单词词权方法的流程图,如图4所示,该流程包括如下步骤S401至S404:
步骤S401,计算基于词性的单词词权;
步骤S402,判断是否存在人工聚类/分类标签,若是则直接跳至步骤S403,否则进入步骤S404;
步骤S403,计算基于tf-idf的单词词权,并与S402得到的单词词权取平均值;
步骤S404,输出单词词权。
其中,日志关键词提取包括:使用正则表达式,去除日志中一些已知形式的参数,将其余单词认为是关键词;
在一些实施例中,计算基于词性的单词词权包括:将关键词输入nltk词性标注工具,获取单词词性;按照词性给单词分配权重,名词和动词给予权重3,形容词给予权重2,其他词给予权重1;其中,权重3大于权重2,权重2大于权重1,也就是说,名词和动词等实义词,所占的权重则更高。
在一些实施例中,计算基于tf-idf的单词词权包括:依据人工干预过的聚类/分类标签,整理分类和文本特征;计算单词的tf-idf权重;依据单词的tf-idf权重,由大至小进行排序,前30%给予权重3,30%-60%给予权重2,60%-100%给予权重1。
通过上述步骤,提取日志关键词;计算基于词性的单词词权;判断是否存在人工聚类/分类标签,计算基于tf-idf的单词词权,输出单词词权。即可计算出单词词权,可以有助于半结构化文本的分析, 如日志频繁项检测,日志异常检测等。
在本公开的另一些实施例中提供了一种结合日志特点的,基于神经网络进行词性判断的词权计算方式的方法,参考图5,图5为本公开一些实施例提供的单词词权计算方法流程图,该流程包括如下步骤S501至S504:
S501,训练词性概率神经网络模型;
S502,划分不同词性单词对日志主干提取词性权重;
S503,对于单条日志语句,基于神经网络计算各单词词性概率;
S504,对日志单词,计算词性概率与词性权重的加权和,作为该单词词权。
在一些实施例中,训练词性概率神经网络模型包括:基于中英文已有数据集,训练词性概率神经网络模型,该模型对于单条语句可以判断其中各个单词的词性概率值,如名词、动词、形容词等。
在一些实施例中,划分不同词性单词对日志主干提取权重包括:基于先验知识,人为地设定不同词性的单词对日志主干提取时可以影响的权重,并可根据不同日志数据集调整权值。
在一些实施例中,对不同语句中的相同单词词权可再次进行平均和计算,作为最终该单词的词权,节省了后续的计算消耗。
通过上述步骤:训练词性概率神经网络模型;划分不同词性单词对日志主干提取的权重;对于单条日志语句,基于神经网络计算各单词词性概率;对于日志单词,计算词性概率与词性权重的加权和,作为该单词词权。该过程采用神经网络计算词性概率,结合先验知识计算单词词权,可有效帮助相关领域的技术人员准确判断单词对当前语句处理产生的影响,避免了直接采用先验知识或统计信息做判断的局限性,为后续的处理提供了更可靠的依据。
本实施例创新性的采用神经网络做词权计算,避免了直接使用先验知识或统计信息做全局处理的局限性;对于不同语句中的相同单词,神经网络可以做出更精准的词性判断,能够有效提高歧义单词的识别率;结合先验知识进行基于词性权重的加权操作,可以获得更精准的单词权重,方便后续的日志处理。
本公开的另一些实施例提供了一种日志管理装置,请参考图6,该装置包括如下模块:关键词提取模块61,配置为确定待提取日志中的关键词;词权确定模块62,配置为确定各关键词所对应的单词词权;编辑距离计算模块63,配置为以日志条目为基础,并根据单词词权,计算日志间的编辑距离;聚类模块64,配置为基于日志间的编辑距离,对日志进行聚类;主干提取模块65,配置为从聚类的结果中,提取日志主干。
在一些实施例中,确定待提取日志数据中的关键词信息可以包括:将日志中,除预设形式的词汇之外的单词确定为关键词。
在一些实施例中,将日志中,除预设形式的词汇之外的单词确定为关键词可以包括:使用正则表达式,去除日志中的预设形式的词汇,并将剩余单词作为关键词。
在一些实施例中,确定各关键词信息所对应的单词词权可以包括:基于关键词的词性,计算第一单词词权,并将第一单词词权作为结果输出。
在一些实施例中,基于关键词的词性,计算第一单词词权可以包括:将关键词输入词性标注工具,获取单词词性;按照词性给单词分配权重,得到第一单词词权;词性包括名词、动词、形容词以及其他。
在一些实施例中,确定各关键词信息所对应的单词词权还可以包括:在基于关键词的词性,计算第一单词词权之后,还包括:判断是否有聚类/分类标签,若有,则基于tf-idf词频-逆文本频率指数计算得到第二单词词权;将第一单词词权与第二单词词权取均值,得到平均单词词权,并作为结果输出。
在一些实施例中,基于tf-idf词频-逆文本频率指数计算第二单词词权可以包括:依据聚类/分类标签,整理分类和文本特征,并计算单词的tf-idf权重。
在一些实施例中,确定各关键词信息所对应的单词词权可以包括:训练词性概率神经网络模型;划分不同词性的关键词对日志主干提取的权重;基于神经网络计算各词性概率;计算词性概率与词性权 重的加权和,作为单词词权。
在一些实施例中,训练词性概率神经网络模型可以包括:基于已有数据集,训练词性概率神经网络模型。
在一些实施例中,划分不同词性的关键词对日志主干提取的权重可以包括:基于先验知识,预设不同词性的单词,在对日志主干提取时影响的权重。
在一些实施例中,以日志条目为基础,并根据单词词权,计算日志间的编辑距离可以包括:遍历两条日志之间的所有词位对,并通过单词词权,进行编辑距离矩阵更新;取编辑距离矩阵左下角元素,作为编辑距离。
在一些实施例中,基于日志间的编辑距离,对日志进行聚类可以包括:基于日志间的编辑距离,计算日志距离邻接矩阵;计算当前日志,与集合中的各聚类的差异度;将差异度小于等于阈值的日志加入当前聚类中,将差异度大于阈值的日志加入新聚类中,并更新聚类信息。
在一些实施例中,在基于日志间的编辑距离,计算日志距离邻接矩阵之前,还可以包括:对聚类集合进行初始化操作,初始化操作包括:初始化一个集合,且集合中的每一个日志均各自为一个聚类。
在一些实施例中,计算当前日志,与集合中的各聚类的差异度可以包括:遍历集合中所有聚类,计算日志与聚类的差异度,寻找差异度最小的聚类;此处差异度为日志与聚类中日志的编辑距离的最大值。
本公开的另一些实施例还提供了一种网络设备,参见图7所示,其包括处理器71、存储器72及通信总线73,其中:通信总线73用于实现处理器71和存储器72之间的连接通信;处理器71用于执行存储器72中存储的一个或者多个计算机程序,以实现上述各实施例中的日志管理方法的步骤,这里不再赘述。
本实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、计算 机程序模块或其他数据)的任何方法或技术中实施的易失性或非易失性、可移除或不可移除的介质。计算机可读存储介质包括但不限于RAM(Random Access Memory,随机存取存储器),ROM(Read-Only Memory,只读存储器),EEPROM(Electrically Erasable Programmable read only memory,带电可擦可编程只读存储器)、闪存或其他存储器技术、CD-ROM(Compact Disc Read-Only Memory,光盘只读存储器),数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储系统、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。
本实施例中的计算机可读存储介质可用于存储一个或者多个计算机程序,其存储的一个或者多个计算机程序可被处理器执行,以实现上述各实施例中的日志管理方法的至少一个步骤。
本实施例还提供了一种计算机程序(或称计算机软件),该计算机程序可以分布在计算机可读介质上,由可计算系统来执行,以实现上述各实施例中的日志管理方法的至少一个步骤。
本实施例还提供了一种计算机程序产品,包括计算机可读系统,该计算机可读系统上存储有如上所示的计算机程序。本实施例中该计算机可读系统可包括如上所示的计算机可读存储介质。
可见,本领域的技术人员应该明白,上文中所公开方法中的全部或某些步骤、系统、系统中的功能模块/单元可以被实施为软件(可以用计算系统可执行的计算机程序代码来实现)、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。
此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、计算机程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。 所以,本公开不限制于任何特定的硬件和软件结合。
以上内容是结合具体的实施方式对本公开实施例所作的进一步详细说明,不能认定本公开的具体实施只局限于这些说明。对于本公开所属技术领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本公开的保护范围。

Claims (17)

  1. 一种日志管理方法,包括:
    确定待提取日志中的关键词;
    确定各所述关键词所对应的单词词权;
    以日志条目为基础,并根据所述单词词权,计算日志间的编辑距离;
    基于所述日志间的所述编辑距离,对日志进行聚类;
    从所述聚类的结果中,提取日志主干。
  2. 如权利要求1所述的日志管理方法,其中,所述确定待提取日志中的关键词包括:
    将所述日志中,除预设形式的词汇之外的单词确定为关键词。
  3. 如权利要求2所述的日志管理方法,其中,所述将所述日志中,除预设形式的词汇之外的单词确定为关键词包括:
    使用正则表达式,去除所述日志中的预设形式的词汇,并将剩余单词作为所述关键词。
  4. 如权利要求1所述的日志管理方法,其中,所述确定各所述关键词所对应的单词词权包括:
    基于所述关键词的词性,计算第一单词词权,并将所述第一单词词权作为结果输出。
  5. 如权利要求4所述的日志管理方法,其中,所述基于所述关键词的词性,计算第一单词词权包括:
    将所述关键词输入词性标注工具,获取单词词性;按照所述词性给单词分配权重,得到第一单词词权;所述词性包括名词、动词、形容词。
  6. 如权利要求4所述的日志管理方法,其中,所述确定各所述关键词所对应的单词词权还包括:
    在所述基于所述关键词的词性,计算第一单词词权之后,还包括:
    判断是否有聚类/分类标签,若有,则基于词频-逆文本频率(tf-idf)指数计算得到第二单词词权;
    将所述第一单词词权与所述第二单词词权取均值,得到平均单词词权,并作为结果输出。
  7. 如权利要求6所述的日志管理方法,其中,所述基于tf-idf指数计算得到第二单词词权包括:
    依据聚类/分类标签,整理分类和文本特征,并计算单词的tf-idf权重。
  8. 如权利要求1-7任一项所述的日志管理方法,其中,所述确定各所述关键词所对应的单词词权包括:
    训练词性概率神经网络模型;
    划分不同词性的关键词对日志主干提取的权重;
    基于神经网络计算各词性概率;
    计算所述词性概率与所述词性权重的加权和,作为所述单词词权。
  9. 如权利要求8所述的日志管理方法,其中,所述训练词性概率神经网络模型包括:
    基于已有数据集,训练词性概率神经网络模型。
  10. 如权利要求8所述的日志管理方法,其中,所述划分不同词性的关键词对日志主干提取的权重包括:
    预设不同词性的单词在对日志主干提取时影响的权重。
  11. 如权利要求1-7任一项所述的日志管理方法,其中,所述以日志条目为基础,并根据所述单词词权,计算日志间的编辑距离包括:
    遍历两条日志之间的所有词位对,并通过所述单词词权,进行编辑距离矩阵更新;
    取所述编辑距离矩阵左下角元素,作为所述编辑距离。
  12. 如权利要求1-7任一项所述的日志管理方法,其中,所述基于所述日志间的所述编辑距离,对日志进行聚类包括:
    基于所述日志间的所述编辑距离,计算日志距离邻接矩阵;
    计算当前日志与集合中的各聚类的差异度;
    将所述差异度小于等于阈值的日志加入当前聚类中,将所述差异度大于阈值的日志加入新聚类中,并更新聚类信息。
  13. 如权利要求12所述的日志管理方法,其中,在所述基于所述日志间的所述编辑距离,计算日志距离邻接矩阵之前,还包括:
    对聚类集合进行初始化操作,所述初始化操作包括:初始化一个集合,且所述集合中的每一个日志均各自为一个聚类。
  14. 如权利要求12所述的日志管理方法,其中,所述计算当前日志与集合中的各聚类的差异度包括:
    遍历集合中所有聚类,计算所述当前日志与所述集合中所有聚类的差异度,寻找所述差异度最小的聚类;所述差异度为所述当前日志与所述集合中所有聚类中日志的编辑距离的最大值。
  15. 一种日志管理装置,包括:
    关键词提取模块,配置为确定待提取日志中的关键词;
    词权确定模块,配置为确定各所述关键词所对应的单词词权;
    编辑距离计算模块,配置为以日志条目为基础,并根据所述单词词权,计算日志间的编辑距离;
    聚类模块,配置为基于所述日志间的所述编辑距离,对日志进 行聚类;
    主干提取模块,配置为从所述聚类的结果中,提取日志主干。
  16. 一种网络设备,所述网络设备包括处理器、存储器及通信总线;
    所述通信总线配置为实现处理器和存储器之间的连接通信;
    所述处理器配置为执行存储器中存储的一个或者多个计算机程序,以实现如权利要求1-14中任一项所述的日志管理方法的步骤。
  17. 一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个计算机程序,所述一个或者多个计算机程序可被一个或者多个处理器执行,以实现如权利要求1-14中任一项所述的日志管理方法的步骤。
PCT/CN2020/095162 2019-06-28 2020-06-09 日志管理方法、装置、网络设备和可读存储介质 WO2020259280A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP20830538.3A EP3992812A4 (en) 2019-06-28 2020-06-09 LOG MANAGEMENT METHOD AND DEVICE, NETWORK DEVICE AND READABLE STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910579380.1A CN110516034A (zh) 2019-06-28 2019-06-28 日志管理方法、装置、网络设备和可读存储介质
CN201910579380.1 2019-06-28

Publications (1)

Publication Number Publication Date
WO2020259280A1 true WO2020259280A1 (zh) 2020-12-30

Family

ID=68623486

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/095162 WO2020259280A1 (zh) 2019-06-28 2020-06-09 日志管理方法、装置、网络设备和可读存储介质

Country Status (3)

Country Link
EP (1) EP3992812A4 (zh)
CN (1) CN110516034A (zh)
WO (1) WO2020259280A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113438114A (zh) * 2021-06-29 2021-09-24 平安普惠企业管理有限公司 互联网系统的运行状态监控方法、装置、设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516034A (zh) * 2019-06-28 2019-11-29 中兴通讯股份有限公司 日志管理方法、装置、网络设备和可读存储介质
CN111159413A (zh) * 2019-12-31 2020-05-15 深信服科技股份有限公司 日志聚类方法、装置、设备及存储介质
CN111581057B (zh) * 2020-05-08 2022-06-14 厦门服云信息科技有限公司 一种通用日志解析方法、终端设备及存储介质
CN112367222B (zh) * 2020-10-30 2022-09-27 中国联合网络通信集团有限公司 网络异常检测方法和装置
CN113407656B (zh) * 2021-06-24 2023-03-07 上海上讯信息技术股份有限公司 一种快速在线日志聚类的方法及设备
CN113254255B (zh) * 2021-07-15 2021-10-29 苏州浪潮智能科技有限公司 一种云平台日志的分析方法、系统、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608970A (zh) * 2017-09-29 2018-01-19 百度在线网络技术(北京)有限公司 词性标注模型生成方法和装置
CN108427720A (zh) * 2018-02-08 2018-08-21 中国科学院计算技术研究所 系统日志分类方法
CN110516034A (zh) * 2019-06-28 2019-11-29 中兴通讯股份有限公司 日志管理方法、装置、网络设备和可读存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224604B (zh) * 2015-09-01 2019-01-29 天津大学 一种基于堆优化的微博突发事件检测方法及其检测装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608970A (zh) * 2017-09-29 2018-01-19 百度在线网络技术(北京)有限公司 词性标注模型生成方法和装置
CN108427720A (zh) * 2018-02-08 2018-08-21 中国科学院计算技术研究所 系统日志分类方法
CN110516034A (zh) * 2019-06-28 2019-11-29 中兴通讯股份有限公司 日志管理方法、装置、网络设备和可读存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JING HONG , YUE QUN-QIN: "Sentence Similarity Computation and Application", COMPUTER SYSTEMS & APPLICATIONS, vol. 23, no. 11, 15 November 2014 (2014-11-15), pages 136 - 139, XP055773976 *
QIANG FU; JIAN-GUANG LOU; YI WANG; JIANG LI: "Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis", NINTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009 (ICDM '09), 6 December 2009 (2009-12-06), pages 149 - 158, XP031585329 *
See also references of EP3992812A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113438114A (zh) * 2021-06-29 2021-09-24 平安普惠企业管理有限公司 互联网系统的运行状态监控方法、装置、设备及存储介质
CN113438114B (zh) * 2021-06-29 2022-10-14 平安普惠企业管理有限公司 互联网系统的运行状态监控方法、装置、设备及存储介质

Also Published As

Publication number Publication date
EP3992812A1 (en) 2022-05-04
CN110516034A (zh) 2019-11-29
EP3992812A4 (en) 2023-04-26

Similar Documents

Publication Publication Date Title
WO2020259280A1 (zh) 日志管理方法、装置、网络设备和可读存储介质
US20180107933A1 (en) Web page training method and device, and search intention identifying method and device
US20200081899A1 (en) Automated database schema matching
JP5936698B2 (ja) 単語意味関係抽出装置
WO2020133960A1 (zh) 文本质检方法、电子装置、计算机设备及存储介质
US20160188568A1 (en) System and method for determining the meaning of a document with respect to a concept
EP3748548A1 (en) Adversarial learning-based text annotation method and device
CN109299280B (zh) 短文本聚类分析方法、装置和终端设备
CN110083832B (zh) 文章转载关系的识别方法、装置、设备及可读存储介质
WO2018201600A1 (zh) 信息挖掘方法、系统、电子装置及可读存储介质
WO2022048363A1 (zh) 网站分类方法、装置、计算机设备及存储介质
US11580119B2 (en) System and method for automatic persona generation using small text components
WO2017091985A1 (zh) 停用词识别方法与装置
WO2020114100A1 (zh) 一种信息处理方法、装置和计算机存储介质
WO2021196825A1 (zh) 摘要生成方法、装置、电子设备及介质
US11151327B2 (en) Autonomous detection of compound issue requests in an issue tracking system
WO2022042297A1 (zh) 文本聚类方法、装置、电子设备及存储介质
CN112328747B (zh) 事件脉络生成方法、装置、终端设备及存储介质
CN115687621A (zh) 一种短文本标签标注方法及装置
WO2022143608A1 (zh) 语言标注方法、装置、计算机设备和存储介质
WO2019041528A1 (zh) 新闻情感方向判断方法、电子设备及计算机可读存储介质
CN111930949B (zh) 搜索串处理方法、装置、计算机可读介质及电子设备
CN111062208B (zh) 一种文件审核的方法、装置、设备及存储介质
CN112579781A (zh) 文本归类方法、装置、电子设备及介质
CN116629238A (zh) 文本增强质量评估方法、电子设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20830538

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2020830538

Country of ref document: EP