WO2021068547A1 - 日志模板提取方法及装置 - Google Patents

日志模板提取方法及装置 Download PDF

Info

Publication number
WO2021068547A1
WO2021068547A1 PCT/CN2020/096134 CN2020096134W WO2021068547A1 WO 2021068547 A1 WO2021068547 A1 WO 2021068547A1 CN 2020096134 W CN2020096134 W CN 2020096134W WO 2021068547 A1 WO2021068547 A1 WO 2021068547A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
log record
record
template
records
Prior art date
Application number
PCT/CN2020/096134
Other languages
English (en)
French (fr)
Inventor
王琛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021068547A1 publication Critical patent/WO2021068547A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs

Definitions

  • This application relates to the field of computer technology, and in particular to a method and device for extracting log templates.
  • the real-time status of the software operation can be recorded in a text, which is called logs.
  • Software developers or operation and maintenance staff) can read the log to grasp the real-time situation of software operation.
  • the log includes multiple lines of log records (also known as log statements). Each line of log record is used to record an event during software operation.
  • the log record in the log usually has an implicit log template (schema), that is, the mode or the record itself. format.
  • chema implicit log template
  • logs can be divided into two types: Homologous logs and Heterogeneous logs.
  • Homogeneous log means that the log template of all log records in the log is the same; heterogeneous log means that there is no unified log template for each log record in the log.
  • the method of extracting the log template is as follows: tokenize this line of log records to obtain multiple tokens; based on the word segmentation results of each line of log records, compare the logs in the log Hierarchical clustering of records is performed to obtain multiple types of log records; template extraction is performed on each type of log records, and the obtained log templates of multiple types of log records are used as the log templates of heterogeneous logs.
  • the embodiments of the present application provide a method and device for extracting a log template, which can solve the problem that the current method for extracting a log template is relatively expensive.
  • the technical solution is as follows:
  • a method for extracting a log template includes:
  • the log records are grouped by the locally sensitive hash code of each log record, and the locally sensitive hash code can reflect the similarity of the corresponding log records of different rows, so that the grouping achieves the same effect as the clustering process, which is effective Reduce the computational complexity.
  • the locality sensitive hash code is a characteristic of the log record itself, and when obtaining the local sensitive hash code of each log record, there is no need to consider other log records. So as to realize the decorrelation of each log record in the log in the grouping process. In this way, for a log, the grouping process of its multi-line log records can be executed in parallel, which effectively reduces the operation delay and improves the operation efficiency.
  • the log template of the log may be obtained by processing each of the first log record groups respectively.
  • the processing procedures for each first log record group can be executed in parallel, thereby reducing the operation delay.
  • the amount of data required to be calculated is much smaller than the total data amount of the log, which effectively reduces the operation cost while increasing Operational efficiency.
  • the determining the locally sensitive hash code of each log record in the multi-line log record of the log includes:
  • word segmentation is to cut each line of log records into a sequence of entries.
  • word segmentation processing the processing complexity of log records can be reduced, the calculation cost of subsequent local sensitive hash codes can be reduced, and the calculation efficiency can be improved.
  • word segmentation can be performed based on different methods. For example, use spaces to segment words; or, use special characters to segment words; or use natural language to segment words.
  • each term obtained by word segmentation includes only one semantic unit.
  • the word segmentation method is relatively simple, easy to implement, and fast in word segmentation.
  • each term obtained by word segmentation includes multiple semantic units. That is, each entry includes m semantic units, m is an integer greater than 1, and the semantic unit is a word or symbol.
  • m is an integer greater than 1
  • the semantic unit is a word or symbol.
  • the entries obtained in this way can reduce undesired hash collisions.
  • the determining the locally sensitive hash code of each log record in the multi-line log record of the log includes: replacing p designated characters in each log record in the log with q fixed characters Character, get each updated log record, 1 ⁇ q ⁇ p; based on each updated log record, determine the locally sensitive hash code of each log record.
  • the determining the locally sensitive hash code of each log record based on at least one entry of each log record includes:
  • the locally sensitive hash code of the first log record is determined, and the first log record is the number of the log.
  • the line log record includes any log record of multiple entries, and the weights of at least two entries included in the first log record are different from each other. For example, the weight of each entry is determined based on the position of the entry in the first log record.
  • the weight of the entry in each log record may be related to its own attribute.
  • the entries in the constant part are usually larger than the entries in the variable part, and because the first entries in the log record usually belong to the constant part. Therefore, the weights of the first g entries in the log record can be set to be greater than the weights of other entries, g ⁇ k, g is a positive integer, and k is the length of the log record. For example, the weights of the first g entries are decreasing, and the other weights are equal and smaller than the minimum weight of the first g entries. g can be 1. In this way, associating the weight of the entry in the log record with its location attribute can calculate the local sensitive hash code more accurately, and further reduce undesired hash collisions.
  • the determining at least one first log record group includes: based on the locally sensitive hash code of each line of the log record in the log and the target characteristic of each line of the log record, The multiple lines of log records in the log are grouped to obtain the at least one first log record group.
  • the obtaining a log template of the log by processing each first log record group in the at least one first log record group includes: extracting each first log separately Log templates in the recording group; determine the log template of the log based on the log template of each of the first log recording groups.
  • the separately extracting the log templates in each of the first log record groups includes:
  • the determining the log template of the log based on each log template of the first log record group includes: performing clustering processing on the log template of the at least one first log record group , Obtain the log template of the log; or, merge the log templates of the at least one first log recording group to obtain the log template of the log; or, combine the logs of the at least one first log recording group
  • the template is used as the log template of the log.
  • the obtaining the log template of the log by processing each first log record group in the at least one first log record group includes: based on each first log record The target log records of the group determine the log template of the log, and the target log records of the first log record group are part of the log records in the first log record group.
  • the processing of the target log record is equivalent to the processing of all log records in the first log record group, but it effectively reduces the amount of data actually processed, which is equivalent to data sampling, reducing the solution space, and further reducing the computational cost .
  • the target log record of the first log record group is a row of log records randomly selected in the first log record group. In this way, it can be ensured that the probability of each log record in the first log record group being selected as the target log record is equal.
  • the determining the log template of the log based on the target log record of each first log record group includes: determining at least one second log record group, which is different from the second log record The group includes different target log records in the target log records corresponding to the at least one first log record group; all target log records included in each second log record group have the same target characteristics; The second log record group performs processing to obtain a log template of the log.
  • the solution space can be further reduced by grouping.
  • each second log record group can be executed in parallel, thereby reducing the operation delay, and the amount of data required to calculate each time the processing process is executed is much smaller than the data of the entire log It can effectively reduce the calculation cost and improve the calculation efficiency at the same time.
  • the step of separately processing each of the second log record groups to obtain the log template of the log includes: gathering log records in each of the second log record groups. Class processing to obtain at least one type of log record corresponding to each of the second log record groups; template extraction is performed on each type of log record obtained by the clustering process to obtain a log template of each type of log record ; Determine the log template of the log based on the log template of each type of log record in the at least one type of log record.
  • the processing of various types of log records can be executed in parallel, thereby reducing the operation delay, and the amount of data required to be calculated each time the processing is executed is small, which effectively reduces Computational costs, while improving computational efficiency.
  • the determining the log template of the log based on the log template of each type of log record in the at least one type of log record includes: using the log template of each type of log record as the log template.
  • the log template of the log; or, the log template of the various log records obtained by the clustering process is merged, and the log template obtained by the merge is used as the log template of the log.
  • the target feature of the log record includes at least one of the length of the log record, the first character of the log record, and the first word of the log record.
  • an apparatus for extracting a log template may include at least one module, and the at least one module may be used to implement the log template extraction method provided in the first aspect or various possible implementations of the first aspect. .
  • the present application provides a computer device including a processor and a memory.
  • the memory stores computer instructions; when the processor executes the computer instructions stored in the memory, the computer device executes the foregoing first aspect or the methods provided by various possible implementations of the first aspect, so that the computer device deploys the foregoing second aspect or The log template extraction device provided by various possible implementations of the second aspect.
  • the present application provides a computer-readable storage medium having computer instructions stored in the computer-readable storage medium, and the computer instructions instruct the computer device to execute the foregoing first aspect or various possible implementations of the first aspect.
  • the method or the computer instruction instructs the computer equipment to deploy the log template extraction device provided by the second aspect or various possible implementations of the second aspect.
  • the present application provides a computer program product.
  • the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device can read the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the above-mentioned first aspect or the method provided by various possible implementations of the first aspect, so that the computer
  • the device deploys the log template extraction device provided by the second aspect or various possible implementations of the second aspect.
  • an analysis system including: a terminal and an analysis device, the analysis device includes the log template extraction device described in the second aspect or various possible implementations of the second aspect or the computer device described in the third aspect .
  • a chip in a seventh aspect, is provided.
  • the chip may include a programmable logic circuit and/or program instructions, and is used to implement the template extraction method according to any one of the first aspects when the chip is running.
  • the log records are grouped by the locally sensitive hash code of each log record, and the locally sensitive hash code can reflect the similarity of the corresponding log records of different rows, so that the grouping is similar to the clustering process.
  • the locality sensitive hash code and the target feature are the characteristics of the log record itself.
  • each first log record group can be executed in parallel, thereby reducing the operation delay, and the amount of data required to calculate each time the processing process is executed is much smaller than the data of the entire log It can effectively reduce the calculation cost and improve the calculation efficiency at the same time.
  • the log template extraction methods provided in the embodiments of the present application all filter out most log records, so that the solution space is reduced at an exponential level, which effectively reduces the operation cost and improves the operation efficiency.
  • FIG. 1 is a schematic diagram of part of log content in a log provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an application environment involved in a log template extraction method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of an application environment involved in another log template extraction method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a locally sensitive hash algorithm provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a log template extraction method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of word segmentation results involved in a log template extraction method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of word segmentation results involved in another log template extraction method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a word segmentation result involved in another log template extraction method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the process of obtaining the locally sensitive hash codes of log records X3 and X4 provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the calculation process of the locally sensitive hash codes of the log record X3 and the log record X4 shown in FIG. 9;
  • FIG. 11 is a schematic diagram of a word segmentation result of log records X7 and X8 provided by an embodiment of the present application;
  • FIG. 12 is a schematic diagram of another word segmentation result of log records X7 and X8 provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a grouping process of a first log record group according to an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a template of a first log record group provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a word segmentation result provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of a result of obtaining a locally sensitive hash code according to an embodiment of the present application.
  • FIG. 17 is a schematic diagram of a grouping result of a first log record group according to an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a target log record of a first log record group according to an embodiment of the present application.
  • FIG. 19 is a schematic diagram of a grouping result of a second log record group according to an embodiment of the present application.
  • FIG. 20 is a schematic diagram of a log template provided by an embodiment of the present application.
  • FIG. 21 is a schematic diagram of a log template extraction device provided by an embodiment of the present application.
  • FIG. 22 is a schematic diagram of a first determining module provided by an embodiment of the present application.
  • FIG. 23 is a schematic diagram of a processing module provided by an embodiment of the present application.
  • FIG. 24 is a schematic diagram of a computing device provided by an embodiment of the present application.
  • the log analysis scenario includes an offline analysis scenario and an online analysis scenario.
  • the log data for analysis can be batch log data, such as log files, or log data that is queried in the log database; in online analysis scenarios, the log data for analysis can be real-time logs Data, also called log stream data.
  • log files are usually files downloaded by users, software developers, or operation and maintenance personnel, or files obtained through keyword searches.
  • Figure 1 is a schematic diagram of part of the log content in a log.
  • the log includes multiple lines of log records (also called log text), and each line of log records is used to record an event when the software is running.
  • Each log record is composed of multiple characters, and the multiple characters may include letters and/or symbols.
  • a log record includes a constant part and a variable part, the constant part includes at least one character, and the variable part includes at least one character.
  • the pseudo code corresponding to a log record is defined as the information that records user login: "logging.info('User%d login at%s',$uid,$time)"
  • the pseudo code corresponding to the log record The "$" in is used to define the variable part (also called variable name).
  • "$IP” means Internet Protocol (IP) address. It can be seen from the pseudo code corresponding to the log record that the log record contains two sets of variable parts, which are used to record the user name (uid) and login time (time) of each login.
  • the log records in the log usually have an implicit log template (also called a log pattern (pattern)), and the log template refers to a standard style or a fixed format used to generate the log records in the log.
  • the log template refers to a standard style or a fixed format used to generate the log records in the log.
  • the pseudo code corresponding to the aforementioned log record is actually run, it outputs a multi-line log record in the log for recording user login information.
  • the log where the multi-line log records are recorded is referred to as the first log:
  • the log template of the log is the log template of the log record in the log.
  • the variable part of the log record is identified, the variable part will be marked with a preset variable identifier.
  • This marking method essentially uses the variable identifier to replace the variable part.
  • the variable identifier is usually a wildcard "*”.
  • the variable part can be replaced with the wildcard character "*"
  • the log template of each log record obtained is "User *login at *"
  • the first log The log template of a log is "User* login at *”.
  • variable identifier can be determined to be the same as any character or entry. For example, when matching "*" with "046523”, it can be determined that "*" is the same as "046523".
  • Logs can be divided into two types: homogeneous logs and heterogeneous logs.
  • the homogenized log is taken as an example for description.
  • the log records in the homogenized log have a unified log template.
  • the extraction process of the log template can be realized by means of regular expressions.
  • part of the content of the heterogeneous log is as follows (the "#" in the heterogeneous log is used to mark the line number.
  • the content of the log may not include the line number mark and the specific line number , This example is mainly for the convenience of readers' understanding, so the line number mark and specific line number are ignored in the subsequent processing).
  • the log where the multi-line log record is shown as follows is called the second log:
  • the current extraction process includes a hierarchical clustering of log records in the log. Due to the large amount of data to be processed, the clustering algorithm requires multiple operations to obtain multiple types of log records, which is expensive.
  • the embodiment of the present application provides a method for extracting a log template, which can reduce the computational cost in the process of extracting the log template.
  • FIG. 2 is a schematic diagram of an application environment involved in a log template extraction method provided by an embodiment of the present application.
  • the application environment includes a terminal 110, an analysis device 120, and a network device 130.
  • the terminal 110 may be a display, a computer, a smart phone, a tablet computer, a laptop portable computer, etc., a device capable of interacting with a user.
  • the analysis device 120 may be a server, or a server cluster composed of several servers, and other devices capable of performing data analysis.
  • the analysis device 120 may be a cloud server (also referred to as a cloud computing server), for example, a deep learning server for providing a deep learning service (Deep Learning Service, DLS).
  • DLS Deep Learning Service
  • the terminal 110 establishes a wired or wireless communication connection with the analysis device 120 through a communication network.
  • the network device 130 may be a device capable of running software and generating log data, such as a sensor or a terminal.
  • the network device 130 is used to provide the analysis device 120 with data to be analyzed, the analysis device 120 is used to analyze log data, and the terminal 110 is used to present the analysis result to the user.
  • the communication network involved in the embodiments of this application is a second-generation (2-Generation, 2G) communication network, a third-generation (3rd Generation, 3G) communication network, a long-term evolution (Long Term Evolution, LTE) communication network, or a fifth-generation (2-Generation, 2G) communication network.
  • Generation (5rd Generation, 5G) communication network etc.
  • the aforementioned application environment may also include a storage device, which is used to store data required by the terminal 110, the analysis device 120, and/or the network device 130.
  • the storage device may be a distributed storage device.
  • the terminal 110, the analysis device 120 and/or the network device 130 can read and write data stored in the storage device.
  • the storage device performs data storage, which can reduce the load of the analysis device and improve the data analysis efficiency of the analysis device.
  • the storage device may not be provided.
  • the functions of the terminal 110 and the analysis device 120 may also be implemented by the same device, such as a computer.
  • the application environment includes two parts: the foreground 201 and the background 202.
  • the front desk 201 is used to present data to the user and receive data input by the user to realize interaction with the user; the backend 202 is used to interact with the front desk 201 for data, and perform management operations and/or data processing.
  • the front desk 201 may be deployed in the aforementioned terminal 110.
  • the background 202 can be deployed in the aforementioned analysis device 120.
  • a client, a script, or a browser may be installed in the terminal 110 to implement the deployment of the front desk 201.
  • the terminal 110 may present the user interface in the form of a client interface, a terminal interface, or a webpage corresponding to a browser.
  • the log template extraction method provided in the embodiments of the present application can be used in scenarios such as software debugging, performance optimization, or business analysis. Specifically, it can be applied to anomaly detection scenarios in these scenarios. Anomaly detection refers to the detection of patterns that do not meet expectations.
  • the data source of anomaly detection is log data generated by the operation of software in an application, process, operating system, device, or network.
  • the aforementioned analysis device 120 may use a deep learning algorithm to detect anomalies in log data.
  • the log template extraction method provided in the embodiment of the present application can also be used in other scenarios such as log compression and keyword retrieval, which is not limited in the embodiment of the present application.
  • the Locality Sensitive Hash (LSH) code is a hash code obtained based on the Locality Sensitive Hash algorithm.
  • the local sensitive hash code can reflect the similarity of the data (which can be called input data) that needs to be processed by the local sensitive hash algorithm.
  • the data may be the data recorded in the aforementioned log.
  • Locally sensitive hashing algorithm can maintain the similar relationship between the input data.
  • the obtained locally sensitive hash codes (which can be referred to as output data) are also very similar; for scenarios where the input data is very similar, the obtained locally sensitive hash codes even produce hash codes. Greek collision: For different but similar input data, the output local sensitive hash code is exactly the same.
  • the log template is extracted based on this characteristic of the local sensitive hash code.
  • an embodiment of the present application provides a log template extraction method, which is applied in the application environment shown in FIG. 2 or FIG. 3.
  • Figure 5 illustrates the application of this method to an anomaly detection scenario as an example. The method includes:
  • Step 301 The analysis device obtains a log, and the log includes multiple lines of log records.
  • the analysis device supports the analysis of these two forms of logs.
  • the analysis device periodically obtains log files or obtains log files during a specified time period to obtain batch log data.
  • the specified time period may be a low power consumption period of the terminal and/or server (that is, the power consumption is less than Specify the time period for the power consumption threshold), which can reduce the impact of log file acquisition and subsequent log analysis on other functions of the terminal and/or server; in another optional example, the analysis device continuously obtains real-time log data; in another In this optional example, the analysis device obtains batch log data or real-time log data after receiving the analysis instruction.
  • the analysis instruction may be triggered by the user at the terminal and sent by the terminal to the analysis device.
  • the analysis equipment obtains the log stream in real time and analyzes it, since it can monitor the log stream in time, if an exception occurs in the log stream, it can be discovered and reported in time, which improves the effectiveness of anomaly detection, avoids the occurrence of large-scale anomalies, and improves users Experience.
  • the log template extraction method provided in the embodiment of the present application can be used for the log template extraction of the aforementioned homogenized log, and can also be used for the aforementioned template extraction of the heterogeneous log.
  • the analysis device may directly perform step 302; in another optional manner, since the log template extraction method provided in this embodiment of the application is applied to the heterogeneous log In template extraction, the calculation efficiency is higher, so the log type can be detected first. If the log type is a homogenized log, regular expressions are used to extract the template of the log. If the log type is a heterogeneous log, the follow-up is performed Step 302.
  • Step 302 The analysis device determines the locally sensitive hash code of each log record in the multi-line log record of the log.
  • the analysis device can determine the local sensitive hash code in a variety of ways.
  • the embodiment of this application takes the following two optional implementation methods as examples for illustration:
  • the process of determining the locally sensitive hash code of each log record in the multi-line log record of the log may include:
  • Step A1 The analysis device obtains at least one token of each log record in the log.
  • the analysis device may segment each log record in the log by word segmentation technology to obtain at least one entry of each log record after the word segmentation.
  • a line of log records can be divided to obtain at least two entries; in a few cases, a line of log records can be divided to obtain one entry, and the embodiment of the present application does not limit the number of entries obtained by the division.
  • word segmentation is to cut each line of log records into a collection of entries.
  • word segmentation processing the processing complexity of log records can be reduced, the calculation cost of subsequent local sensitive hash codes can be reduced, and the calculation efficiency can be improved.
  • word segmentation can be performed based on different methods. For example, use space segmentation (this method can use String.split() sentence for space segmentation); alternatively, use special characters for segmentation; or use natural language segmentation.
  • space segmentation refers to dividing a line of log records into multiple entries according to the space.
  • the segmentation process is simple and the segmentation efficiency is high; when special characters are used for word segmentation, the special characters are usually the characters specified by the user, such as "
  • ", "##” or " " can make the semantic unit included in the segmented entry more accurate, and the segmentation accuracy is higher; the method of natural language segmentation is more common, and it can be directly in this way
  • Input log records into natural language-based tokenizers such as NLTK Word_Tokenizer, TreeBank_Tokenizer, S-Expression_tokenizer, etc.
  • the subsequent embodiments all take the word segmentation processing based on spaces for log records as an example for description.
  • the word segmentation results obtained are different.
  • the embodiment of the present application illustrates the word segmentation results in the following two optional ways:
  • each term obtained by word segmentation includes only one semantic unit.
  • the semantic unit is a word or a symbol.
  • the symbol can be a number symbol, abbreviated as a number, such as 1, 2, or other symbols, such as "/" or “:”.
  • Fig. 6 is the word segmentation result obtained by performing word segmentation processing on each log record of the aforementioned second log, and each entry includes only one semantic unit. Taking the first line of log records in Figure 6 as an example, the word segmentation results in 7 entries of "mod_jk", "child”, “workerEnv”, "in”, “error”, "state” and "6".
  • each term obtained by word segmentation includes multiple semantic units. That is, each entry includes m semantic units, and m is an integer greater than 1, which is less than the total number of entries.
  • the semantic unit is a word or symbol.
  • the last m-1 semantic units of the first entry in every two adjacent entries are the same as the first m-1 semantic units of the second entry.
  • An entry is the previous entry of the second entry.
  • the aforementioned word segmentation action can be implemented using a sliding window mechanism.
  • the analysis device can input each line of log records as a character stream into the designated tokenizer, and the tokenizer performs word segmentation processing, and the analysis device receives the tokenizer
  • the output word segmentation result can be.
  • the different word segmentation mechanisms corresponding to the foregoing first and second optional methods are implemented by different word segmentation mechanisms by different word segmenters.
  • the analysis device can support at least one word segmentation mechanism.
  • Step A2 The analysis device determines the locally sensitive hash code of each log record based on at least one entry in each log record.
  • the analysis device may determine the locally sensitive hash code of each log record based on the target local sensitive hash algorithm and at least one entry in each log record.
  • the local sensitive hash calculation process in the target local sensitive hash algorithm can refer to the local sensitive hash calculation process in the Simhash algorithm or the Minhash algorithm.
  • the smallest unit of data processed by the target local sensitive hash algorithm is a term.
  • a weighted summation method may be adopted to determine the locality sensitive hash code of the certain log record.
  • the process can refer to the Simhash algorithm.
  • the process of determining the locally sensitive hash code of the certain log record by means of weighted summation may include:
  • Step A21 For any log record, calculate the hash code of each entry in the any log record, and the hash code is composed of binary numbers 0 and 1.
  • Step A23 Perform dimensionality reduction processing on the obtained weighted summation result to obtain a locally sensitive hash code.
  • the product of each hash code and its weight is expressed by the following rules: when the value in the hash code is 1, the sum result of the corresponding position is 1 and the weight is positive. Multiply, if the value in the hash code is 0, the summation result of the corresponding position is 1 and the weight is negatively multiplied.
  • the dimensionality reduction in the aforementioned step A23 refers to reducing the value greater than 0 to 1, and reducing the value not greater than 0 to 0.
  • the process of performing numerical dimensionality reduction processing on the obtained weighted sum result includes setting the value greater than 0 in the obtained weighted sum result to 1, and setting the value not greater than 0 in the weighted sum result to 0.
  • log record X3 is: “saveLogSize cost time is 1057”
  • log record X4 is: "flush cost time is 122”
  • each entry obtained after the word segmentation of log records X3 and X4 includes only one semantic unit.
  • Figure 9 shows the process of obtaining the local sensitive hash codes of the log records X3 and X4.
  • the hash code of "flush” is: “10010111”
  • its product with weight 1 is "1, -1, -1, 1, -1, 1, 1, 1” (where the comma is In order to make the interval, it does not exist in the actual calculation process).
  • Performing a weighted summation on the hash codes of the calculated entries refers to the summation of the weighted hash codes (that is, the summation of the corresponding positions).
  • the final weighted summation result is "5, -3, -1, 1, -3, -1, 5, 3", where the first place: 5 is each word
  • the second place: -3 is the sum of the second place of the product of each item and the weight 1, that is (-1 )+(-1)+1+(-1)+(-1), other bits are calculated in the same way.
  • the calculated local sensitive hash code may cause undesired hash collisions, which are usually used in In some scenarios where the analysis accuracy is not high, but the calculation delay is high.
  • the first type of undesired hash collision hash collision caused by different log records:
  • the lengths of the log records X3 and X4 are the same, and the local sensitive hash codes determined by the aforementioned target local sensitive hash algorithm are all "10010011".
  • the local sensitive hash codes obtained in this case are the same, the actual difference between the two log records due to the different content is very large.
  • the local sensitive hash codes cannot effectively reflect the similarity of the log records X3 and X4, so they are called undesired Hash collision.
  • hash collision caused by different log record sequence also called order
  • the log record X5 is: "flush cost time is 122”
  • the log record X6 is: "122 is flush cost time”
  • each entry obtained after the word segmentation of the log records X5 and X6 includes only one semantic unit
  • the log Records X5 and X6 essentially contain the same entry, but the order of the entries is different, and the local sensitive hash code determined by the aforementioned target local sensitive hash algorithm is the same. Since the contents of the entries obtained after the word segmentation of X5 and X6 in the log record are substantially the same, but the order of the entries is different, when the aforementioned target local sensitive hash algorithm is used and the weights are set to 1, the final local sensitive hash codes are the same .
  • the traditional Simhash algorithm is used to compare article similarity, and its processing object is the article; and the weights configured for the word segmentation entries are positively correlated with the word frequency, that is, the higher the word frequency, the higher the weight. Big.
  • the weight is set in a traditional manner, since the weight is set according to the word frequency, that is, the weight of the same entry is the same, and the final local sensitive hash code is still the same.
  • the local sensitive hash codes obtained in this case are the same, the two lines of log records are actually quite different due to their different sequences.
  • the local sensitive hash codes cannot effectively reflect the similarity of the log records X5 and X6, so they are called undesired Hash collision.
  • different weights can be set for different entries in each log record to reduce the occurrence of the foregoing two types of hash collisions.
  • the process of determining the locally sensitive hash code of each log record based on at least one entry of each log record It may include: determining the locally sensitive hash code of the first log record based on multiple entries in the first log record and the weight assigned to each entry, and at least two entries included in the first log record
  • the weights are different from each other.
  • the method of obtaining the local sensitive hash codes of other log records in the multi-line log record can refer to the first log record.
  • the locally sensitive hash code of the first log record can be determined by the aforementioned weighted summation method. For the specific process, refer to the aforementioned steps A21 to A23.
  • the weights set for the entries in the same position of each log record are the same, and the weights set for at least two entries of the same log record are different, which can effectively reduce the occurrence of undesired hash collisions.
  • the weights of entries in the same row of log records can be set according to actual conditions, such as increasing or decreasing in an arithmetic sequence, or can be set in other ways.
  • the weight of the first entry is 3, the weight of the second entry is 2, and the weight of other entries is 1, then for The calculation process of the local sensitive hash code of the log record X3 and the log record X4 shown is shown in Fig. 10, and the local sensitive hash codes of the two final log records are different. In this way, the first undesired hash collision can be solved.
  • the aforementioned weights are used, and the locally-sensitive hash codes of the two finally determined log records are different. In this way, the second undesired hash collision can also be solved.
  • the traditional Simhash algorithm is used to compare the similarity of articles, and its processing object is the article; the weights configured for the word segmentation entries are positively related to the word frequency, that is, the higher the word frequency, the greater the weight .
  • the weight value of the entry in each log record may be related to its own attribute and de-correlated with the word frequency.
  • the entries in the constant part are usually larger than the entries in the variable part, and because the first entries in the log record usually belong to the constant part. Therefore, the weights of the first g entries in the log record can be set to be greater than the weights of other entries, g ⁇ k, g is a positive integer, and k is the length of the log record.
  • the weights of the first g entries are decreasing, and the other weights are equal and smaller than the minimum weight of the first g entries.
  • g can be 1.
  • the weight of the entry in the log record is associated with its location attribute, that is, the weight of each entry is determined based on the position of the entry in the first log record, so that it can be calculated more accurately Locally sensitive hash codes further reduce undesired hash collisions.
  • obtaining entries through the second optional method in the foregoing step A1 can also reduce the foregoing first and second undesired hash collisions.
  • log record X7 is: “detected a failure in network connection”
  • log record X8 is: “network connection: a failure is detected”.
  • An entry includes only one semantic unit for word segmentation, then the word segmentation result of log record X7 is: ⁇ detected,a,failure,in,network,connection ⁇ ; the word segmentation result of log record X8 is: ⁇ detected,a,failure, is,network,connection ⁇ .
  • the local sensitive hash codes of the log record X7 and the log record X8 may be very similar, and there may even be collisions, the first type of undesired hash collision.
  • the aforementioned second type of undesired hash collision can also be avoided by adopting a word segmentation method in which an entry includes multiple semantic units.
  • the process of determining the local sensitivity hash code of each log record in the multi-line log record of the log may include: directly determining the local sensitivity of each log record based on the content of each log record Hash code, that is, the aforementioned step A1 is not executed. This process can refer to the aforementioned step A2.
  • the analysis device can determine the locally sensitive hash code of each log record based on the aforementioned target local sensitive hash algorithm and the content of each log record. For example, the analysis device may input the content of each log record (ie, the character stream) into the algorithm model of the target locality sensitive hash algorithm, and receive the local sensitive hash code of each log record output by the algorithm model.
  • the smallest unit of data processed by the target local sensitive hash algorithm is a character.
  • the data granularity (that is, the smallest unit of data processed by the aforementioned target local sensitive hash algorithm) when obtaining the locally sensitive hash code of each log record is a character
  • the aforementioned first method can In the implementation mode, the data granularity when obtaining the locally sensitive hash code of each log record is the entry. Therefore, compared with the second optional implementation, the first optional implementation described above has a larger data granularity when obtaining the locally sensitive hash code of each log record. It can be seen that the first optional implementation Compared with the second optional implementation manner, the number of calculations is smaller, and the calculation cost can be saved.
  • the process of determining the locally sensitive hash code of each log record in the multi-line log record of the log may include: For each log record, each n characters is the minimum data processing unit, Get the locally sensitive hash code of each log record, n is an integer greater than 1, where each log record can be divided by the sliding window mechanism (this process can refer to the second optional method of step A1 above, but divide The unit of is changed from m semantic units to n characters).
  • the analysis device may input each line of log records into the algorithm model of the target locality sensitive hashing algorithm in units of n characters, and receive the locality sensitive hash code of each line of log records output by the algorithm model. That is, the smallest unit of data processed by the target local sensitive hash algorithm is n characters.
  • This process can refer to the n-gram (a language model) algorithm.
  • the data granularity when obtaining the locally sensitive hash code of each log record is n characters. Therefore, the first optional implementation manner is calculated relative to the third optional implementation manner. The smaller the number of times, the calculation cost can be saved, and the third optional implementation manner has a smaller number of calculations compared to the second optional implementation manner, and the calculation cost can be saved.
  • the analysis device in the process of determining the locally sensitive hash code of each line of log record (such as the process of implementing the first optional implementation or the second optional implementation described above), can also determine each log record.
  • Line log records are preprocessed to improve the efficiency of obtaining local sensitive hash codes and reduce the computational cost.
  • the process of determining the locally sensitive hash code of each log record in the multi-line log record of the log may include:
  • Step B1 The analysis device replaces p designated characters in each line of log records in the log with q fixed characters to obtain updated log records of each line.
  • the designated character may be a number
  • the fixed character may be a number or other symbols. For example, 1, 2 or "*".
  • 1 ⁇ q ⁇ p that is, the number of designated characters to be replaced is greater than the number of fixed characters. In this way, the number of characters contained in the log record is reduced to a certain extent, thereby reducing the calculation complexity of the subsequent local sensitive hash code.
  • the specified character is a number
  • the fixed character is "*”
  • the log record X9 is: "Connected to 10.110.12.01 at 2019-11-04 15:40:00”
  • the updated log record obtained in this way X9 can be: "Connected to *.**.*.* at **-*-**:*:*”.
  • the analysis device can replace multiple consecutive designated characters in each line of log records in the log with a fixed character to obtain an updated line of log records.
  • the specified character is a number
  • the fixed character is "*”
  • the log record X9 is: “Connected to 10.110.12.01 at 2019-11-04 15:40:00”
  • the updated log record obtained in this way X9 is: "Connected to *.*.*.* at *-*-* *:*:*”.
  • Step B2 Determine the locally sensitive hash code of each log record based on the updated log record.
  • step B2 can refer to the process of step A2 in the aforementioned first optional implementation, that is, the process of determining the local sensitive hash code based on the word segmentation result; you can also refer to the aforementioned second optional implementation, that is, no word segmentation.
  • the locally sensitive hash code of each log record is determined directly based on the content of each log record; the foregoing third optional implementation manner or other implementation manners may also be adopted for implementation, which is not limited in the embodiment of the present application.
  • step B2 is implemented in the manner of step A2 in the first optional implementation, the aforementioned step B1 can be executed before word segmentation (ie step A1) or after word segmentation, that is, based on For each log record after the update, the process of determining the locally sensitive hash code of each log record includes steps A1, B1, and B2 performed in sequence, or steps B1, A1, and B2 performed in sequence.
  • Step 303 The analysis device determines at least one first log record group, where different first log record groups include different rows of log records in the log; all log records included in each first log record group have the same local sensitive hash code.
  • the analysis device may divide the log records with the same local sensitive hash code among the multiple log records of the log into the same first log record group to obtain the at least one first log record group. Since the local sensitive hash code can reflect the similarity of corresponding log records of different rows, each first log record group is equivalent to a type of log record, and the effect of grouping is similar to the effect of clustering processing. Because the clustering process needs to calculate the distance between each feature (if the clustering process is applied to the log template extraction process, each feature is all the entries included in each log record), and based on the distance between each feature The calculation cost of the distance is higher. The computational complexity of the aforementioned grouping method is far less than the computational complexity of clustering processing.
  • the process of clustering by hierarchical clustering includes: calculating the distance matrix (Distance Matrix) based on the defined distance function (Distance Measurement), the distance function can be Jaccard distance function; determine multiple pairs of aggregatable log records based on the distance matrix (each pair of aggregatable log records is usually the log record with the highest similarity, which can be determined based on the minimum value of each column in the distance matrix ); Aggregate each determined pair of log records; the aggregation result is represented by a binary tree, such as a dendrogram.
  • O represents complexity, and one element is a line of log records. Even after optimization, the complexity can only be reduced to O(u 2 *logu). If the log includes tens of thousands of log records, hundreds of millions to tens of billions of calculations need to be used to achieve the aforementioned hierarchical clustering. This will cause performance bottlenecks, which will affect user experience and system stability.
  • the clustering algorithm needs to calculate the distance between each line of log records in the log and other lines of log records, that is, the log records in a log are related to each other, and each line of log records cannot be performed during the clustering process. Independent calculation, so the calculation complexity is higher, and the calculation delay is longer.
  • the local sensitive hash code is a characteristic of the log record itself, and when obtaining the local sensitive hash code of each log record, there is no need to consider other log records. So as to realize the decorrelation of each log record in the log in the grouping process.
  • the grouping process of multiple lines of log records can be executed in parallel (also known as concurrent execution, that is, the locally sensitive hash code of each line of log record is calculated separately, and the calculation is performed based on each calculated locally sensitive hash code.
  • concurrent execution that is, the locally sensitive hash code of each line of log record is calculated separately, and the calculation is performed based on each calculated locally sensitive hash code.
  • one or more groups of first log record groups can be obtained, so that subsequent processing of each first log record group can be performed separately. the process of.
  • the subsequent processing process (such as step 304) for each first log record group can be executed in parallel, thereby reducing the calculation delay, and the amount of data required to be calculated each time the processing process is executed is far It is much smaller than the overall data volume of the log, effectively reducing the computational cost and improving computational efficiency.
  • the analysis device may also determine at least one first log record group in the following manner: based on the locally sensitive hash code of each line of log records in the log and the target characteristics of each line of log records.
  • the records are grouped to obtain at least one first log record group.
  • the grouping process of its multi-line log records can be executed in parallel, which effectively reduces the operation delay and improves the operation efficiency.
  • the division rule may be: all log records included in each first log record group have the same local sensitive hash code and the same target feature. That is, the analysis device can divide the log records with the same target feature and local sensitive hash code in the multiple log records of the log into the same first log record group to obtain the at least one first log record group .
  • the target characteristic of the log record includes at least one of the length of the log record, the first character of the log record, and the first word of the log record.
  • the length of a log record is expressed by the number of entries included in the log record.
  • the length of the log record X1 in FIG. 8 is 4, and the length of the log record X2 is 4. Since the length of log records is a typical feature of log records, the probability of using the same log template for log records of different lengths is low. Therefore, by using the length of log records as the target feature, it is possible to effectively avoid some substantially dissimilar log records (such as Log records with different lengths but similar content) are classified into the same first log record group. Thus, the probability of undesired hash collisions (such as the aforementioned first type of undesired hash collisions) is reduced.
  • the beginning part of a log record is usually a constant part.
  • the first character of a log record and the first word of a log record are usually constants.
  • the same log record at the beginning has a lower probability of using the same log template.
  • the first character or the first word text of the log record is used as the target feature, which can effectively prevent some log records that are not similar in nature (for example, log records with different starting parts but similar other parts) from being divided into the same first log record group.
  • the probability of undesired hash collisions (such as the aforementioned first type of undesired hash collision and the aforementioned second type of undesired hash collision) is reduced.
  • the analysis device usually determines at least one first log record group by traversing each line of log records in the log. That is, the analysis device traverses each log record in the log, and sequentially divides the log records with the same local sensitive hash code into the same first log record group. If the division rule is: all log records included in each first log record group have the same local sensitive hash code and the same target feature. Then, the analysis device traverses each log record in the log, and sequentially divides the log records with the same target feature and local sensitive hash code among the multiple log records of the log into the same first log record group.
  • each first log record group the log records are written in the form of a text stream one by one (that is, written line by line) to the log record group.
  • each first log record group can be established before the division action, that is, the first log record group is established during initialization, and the first log record group is empty; each first log record group can also be divided It is established during the process, which is not limited in the embodiment of this application.
  • the aforementioned word segmentation process ignores the line number mark and the specific line number, and the grouping is based on the locally sensitive hash code of the log record and the length of the log record (that is, the length of the log record).
  • the target feature of the record is the length of the log record.
  • the local sensitive hash code and length of the log record in line 0 and the log record in line 4 are the same.
  • the log record in line 3 and The local sensitive hash code and length of the log record on the 5th line are the same, and the local sensitive hash code and length of the log record on the first line and the second log record are the same.
  • the analysis device traverses the log records from line 0 to line 5 in the log.
  • the first log record group 0 since there is no related grouping, the first log record group 0 is established, and the 0th log record is divided into the first log record group 0; for the first log record, because there is no related grouping, Then the first log record group 1 is established, and the first log record is divided into the first log record group 1.
  • the second log record since it is the same as the locally sensitive hash code and length of the first log record, Divide the log records of the second row into the first log record group 1.
  • Step 304 The analysis device obtains a log template of the log by processing each first log recording group in at least the first log recording group.
  • each first log record group is equivalent to a type of log record, based on this, there may be multiple optional processing methods to obtain the log template involved in the entire log.
  • the following optional processing methods in the embodiments of the present application are taken as examples to illustrate the process of obtaining the log template involved in the log:
  • the first optional processing method is to obtain the log template of each first logging group, and determine the log template of the log based on the obtained log template.
  • the process includes:
  • Step C1 The analysis device separately extracts the log templates in each first log record group.
  • step C1 may include the following steps:
  • Step C11 For each log record in each first log record group, the analysis device may compare the log record with the historical log template of the first log record group.
  • the analysis device determines at least one first log record group based on the local sensitive hash code and other target characteristics. Therefore, the lengths of the log records in the same first log record group finally obtained may be the same or different.
  • the length of log records in the same first log record group is the same. That is, in the foregoing step 303, the analysis device determines at least one first log record group based on the local sensitive hash code and the length of the log record (that is, the target feature of the log record includes at least the length). In this way, the log record can be compared with the historical log template of the first log record group by way of bit comparison.
  • the proportion of the number of identical entries in the log record length (that is, the total number of entries in the log record) is greater than the first proportion threshold, it is determined that the log record matches the historical log template; if the same words The proportion of the number of bars in the length of the log record is not greater than the first ratio threshold, and it is determined that the log record does not match the historical log template.
  • the second example if the number of different entries (the different entries refer to entries in a log record that are different from another log record) accounted for in the length of the log record is less than the second ratio threshold, Determine that the log record matches the historical log template; if the proportion of the number of different entries in the log record length is not less than the second proportion threshold, determine that the log record does not match the historical log template; in the third example, If the proportion of the number of the same entries in the log record length is greater than the first proportion threshold, and the number of different entries is less than the first number threshold, the log record is determined to match the historical log template; if the entries of the same entry are The proportion of the number in the log record length is not greater than the first proportion threshold, or the number of different entries is not less than the first number threshold, and it is determined that the log record does not match the historical log template.
  • the log record matches the historical log template; if the number of the same entry is not greater than the second number threshold, it is determined that the log record matches the second number threshold.
  • the historical log template does not match.
  • the number of different entries is less than the third number threshold, it is determined that the log record matches the historical log template; if the number of different entries is not less than the third number threshold, it is determined that the log record matches the third number threshold.
  • the historical log template does not match. There are other ways to determine whether the log record matches the historical log template of the first log record group, which is not limited in the embodiment of the present application.
  • the so-called counterpoint comparison is to compare the entries in the same position between the log record and the historical log template.
  • the log record X10 is: “User Yang Xiao Yu has been logged in”; the historical log template is: “User*** has been logged in”.
  • each entry includes a semantic unit, then "User”, “Yang”, “Xiao”, “Yu”, “has”, “been”, “logged” and “in” and “User”, “ *”, “*", “*”, “has”, “been”, “logged” and “in” are compared in a one-to-one correspondence.
  • the number of identical entries is 8
  • the log record length is 8.
  • the first ratio threshold is 1/2 and 8/8 is greater than 1/2, it is determined that the log record is Historical log template matching.
  • the lengths of log records in the same first log record group are different. That is, in the foregoing step 303, the analysis device determines at least one first log record group based only on the local sensitive hash code, or determines at least one first log record based on the local sensitive hash code and target characteristics other than the length of the log record group. In this way, the log record and the historical log template are regarded as the entry sequence respectively, and the log record can be performed with the historical log template of the first log record group by finding the longest common subsequence (LCS) of the two Compare.
  • LCS longest common subsequence
  • the length of the determined longest common subsequence that is, the total number of entries in the longest common subsequence
  • the third in the log record length that is, the total number of entries in the log record
  • the length of sequences other than the longest common subsequence in the log record (that is, the total number of entries in the log record minus the total number of entries in the longest common subsequence) is in the log record length If the proportion of is less than the fourth proportion threshold, it is determined that the log record matches the historical log template; if the proportion of the length of other sequences in the log record length is not less than the fourth proportion threshold, it is determined that the log record does not match the historical log template.
  • the log record matches the historical log template; if the length of the longest common subsequence is not greater than the first length threshold, the log record is determined Does not match the historical log template.
  • the log record matches the historical log template of the first log record group, which is not limited in the embodiment of the present application.
  • the longest common subsequence between the log record and the historical log template is to obtain the longest common subsequence of the two.
  • the longest common subsequence can be obtained by recursive or dynamic programming. Taking each entry including a semantic unit as an example, suppose that in a first log record group, the log record X10 is: "User Yang Xiao Yu has been logged in"; the first historical log template is: “User* has been logged in”. The longest common subsequence of the two is: "User has been logged in”.
  • the log record X11 is: "User Yang Xiao Yu has been logged in”; the second historical log template is: "User ** registered successfully”. Then the longest common subsequence of the two is: "User”.
  • the first length threshold is 3
  • the length of the longest common subsequence of the log record X10 is 5
  • the length of the longest common subsequence of the log record X11 is 1, and it is determined that the log record X11 does not match the second historical log template.
  • one log template exists in a first log record group, and in a few cases, there are multiple log templates in the first log record group.
  • the log record can be compared with each log template of the multiple log templates.
  • the comparison process refers to the process of the two cases mentioned above. ; Or, calculate the distance between the log record and each log template in the multiple log templates (for example, use the Jaccard distance function to calculate the distance), and compare the log record with the nearest log template, which can reduce the computational cost.
  • Step C12 When the log record matches the historical log template, a new log template of the first log record group is determined based on the log record and the historical log template.
  • the historical log template is an extracted template, and the log record matches the historical log template, the historical log template can be directly used as the new log template of the first log record group.
  • the analysis device can replace the part of the historical log template that is different from the log record with a variable indicator; or, the analysis device first determines whether the part of the historical log template that is different from the log record is only the part where the variable indicator is located.
  • the different part also includes other parts other than the part where the variable indicator is located. Replace the other parts with the variable indicator to obtain a new log template. If the different part only includes other parts other than the part where the variable indicator is located, you can Directly use the historical log template as a new log template. In this way, a more accurate log template can be obtained.
  • the corresponding historical log template can be updated after the new log template is used, for example, the historical log template can be deleted, or the corresponding historical log template can be overwritten after the new log template is used to ensure that the first log record group does not exist Duplicate log template.
  • variable indicator replaces only one entry, and a variable indicator is in the log template.
  • the length is regarded as 1.
  • one variable indicator can replace one or more consecutive entries during the replacement operation.
  • Step C13 When the log record does not match the historical log template, the log template extracted from the log record is added as a new log template of the first log record group.
  • the log record is directly used as the new log template of the first log record group.
  • the designated character may be a number
  • the fixed character may be a number or other symbols. For example, 1 or 2, * etc.
  • the analysis device usually determines at least one first log record group by traversing each line of log records in the log. Then, the aforementioned process of extracting the log templates in each first log record group can be executed after all log record grouping is completed, or can be executed in real time during the log record grouping process. Among them, extracting the log templates in each first log record group in real time during the grouping process of log records can reduce the time delay of template extraction and improve the overall timeliness of the template extraction process.
  • each first log record group should be extracted separately.
  • the process of the log template in the log record group may include: after receiving a line of log records, comparing the received log record with the historical log template, when the received log record matches the historical log template, based on the received log Record and historical log template, determine the new log template of the first log record group, when the received log record does not match the historical log template, the log template extracted from the received log record is added as the first log record group The new log template.
  • the grouping result is the grouping result shown in Figure 14, for the first log record group 0, after receiving the 0th log record, the history log template of the first log record group 0 Is empty, the log record of line 0 does not match the historical log record, extract the log template of the log record of line 0, and add it as the new log template of the first log record group: "mod_jk child workerEnv in error state*"; After the log record on line 4, compare the received log record with the historical log template: "mod_jk childworkerEnv in error state*". Since the received log record on line 4 matches the historical log template, you can compare the first log record with the historical log template: "mod_jk childworkerEnv in error state*".
  • the historical log template "mod_jk child workerEnv in error state*" of log record group 0 is used as the new log template.
  • the template extraction method of the first log record group 1 and the first log record group 2 is the same as the template extraction method.
  • the finally extracted template of each first log record group is shown in FIG. Go into details again.
  • the historical log template refers to the existing log template at the current moment
  • the new log template refers to the newly generated log template at the current moment.
  • Step C2 Determine the log template of the log based on the log template of each first log recording group.
  • the process of determining the log template of the log can be implemented in the following three ways:
  • clustering is performed on the log templates of at least one first log record group to obtain log templates of the logs.
  • Clustering processing is essentially a grouping method, used to group similar processing objects into one category, and dissimilar processing objects into different categories.
  • the analysis device can classify the one or more log templates through clustering processing, especially in the obtained log template When there are multiple log templates, there may be situations where the log templates of different first log record groups are similar.
  • similar log templates can be divided into one type of log templates, so that one or more types of log templates obtained by the division can be used as The log template for the log.
  • the one or more types of log templates may be presented to the user, so that the user can intuitively see that there are several types of log templates in the log.
  • the clustering processing may be hierarchical clustering, and the processing process can refer to the foregoing hierarchical clustering process, which is not described in detail in this embodiment of the application.
  • the log templates obtained through hierarchical clustering have hierarchical relationships, and users can adjust the accuracy (also called granularity) of clustering to obtain different clustering results.
  • At least one log template of the first log record group is merged (merging) to obtain a log template of the log.
  • Merging processing refers to the process of integrating the same or similar processing objects into one object, and its processing effect is similar to the effect of de-duplication processing.
  • the log templates of at least one first log recording group are merged, and the process of obtaining the log templates of the logs includes: when the log templates of the at least one first log recording group include at least two log templates, For every two log templates, check whether the similarity of the constant part of the two log templates is 1; when the similarity of the constant part of the two log templates is 1, use one for the variable part of one of the two log templates Replace the variable identifier and delete another log template (equivalent to keeping the constant part of any log template, and inserting the position of the original variable part between the constant parts into the variable identifier).
  • variable identifier can be a wildcard "*".
  • similarity of the constant parts of the two log templates can be determined by calculating the distance between the constant parts of the two log templates. The similarity can be calculated using the Jaccard similarity (Jaccard similarity, also known as Jaccard coefficient) algorithm. It should be noted that when the similarity of the constant parts of the two log templates is not 1, the two log templates are not processed.
  • the two templates are: "User ** has logged in” and "User *** has logged in”.
  • the constant part of both contains four entries: ⁇ User,has,logged,in ⁇ .
  • the similarity between the two is 1. Therefore, you can replace the variable part "**" of "User ** has logged in” with "*”, delete "User ** has logged in”, and get the combined log template: "User * has logged in” in”.
  • the third method is to use at least one log template of the first log record group as the log template of the log.
  • the analysis device may support one or more of the foregoing three methods.
  • the terminal may present all the trigger buttons (or icons) in the multiple modes on the user interface, or present the trigger buttons in the multiple modes in a scrolling manner, and may also present one or the trigger buttons in the multiple modes that are used more frequently.
  • Trigger buttons in multiple ways can be displayed after the user triggers other buttons again, and the other buttons can be drop-down buttons), etc., which are not limited in the embodiment of the present application.
  • the user wants to view the log template of the log corresponding to a certain method, he clicks or other methods to trigger the trigger button corresponding to the certain method.
  • the terminal receives the user's selection instruction, and the selection instruction carries the certain method.
  • the terminal sends the selection instruction to the analysis device, and the analysis device obtains the log template of the log in a corresponding manner based on the acquired selection instruction, and presents the log template to the user on the user interface by the terminal.
  • the log template when the log template is presented in the first way, the log template can be presented in a multi-level file directory structure or a tree structure (such as a binary tree); when the log template is presented in the aforementioned second or third way, if the log template is presented There are multiple log templates, and the multiple log templates can be presented in a list.
  • the log template of the log is determined by extracting the log template of each first log record group, without directly using the log records in the first log record group to participate in the calculation of the log template of the log, so that The solution space decreases exponentially, which effectively improves the computational efficiency.
  • the second optional processing method is to obtain the target log records in each first log record group separately, and determine the log template of the log based on the obtained target log records. This process includes:
  • Step D1 Obtain the target log record of each first log record group.
  • the target log record of the first log record group is a part of log records in the first log record group, for example, is a line of log records in the first log record group. Since a first log record group contains log records with the same local sensitive hash code, that is, it contains the same or similar log records, the target log record can be selected to represent the log records in the first log record group.
  • the processing of the target log record is equivalent to the processing of all the log records in the first log record group, but it effectively reduces the amount of data actually processed, which is equivalent to performing data sampling, reducing the solution space, and further Reduce the computational cost.
  • the target log record of the first log record group may be a row of log records randomly selected in the first log record group. In this way, it can be ensured that the probability of each log record in the first log record group being selected as the target log record is equal. It is worth noting that the target log records may also be filtered in the first log record group according to other preset conditions. For example, select the first log record in the first log record group, or select the latest log record (for example, the latest time stamp) in the first log record group.
  • the processor may also detect the number of log records in the first log record group.
  • step D1 for example, from the first log record Part of the group's log records is filtered as the target log record.
  • step D1 is executed, and the specific execution process of step D1 is to use one line of log records of the first log record group as the target log record.
  • Step D2 Determine a log template of the log based on the target log record of each first log record group.
  • the process of determining the log template of the log may include:
  • Step D21 Determine at least one second log record group.
  • the different second log record groups include at least one different target log record in the target log records corresponding to the first log record group; all target log records included in each second log record group have the same target characteristics.
  • the analysis device may divide the acquired target log records and log records with the same target characteristics into the same second log record group to obtain the at least one second log record group. This can further reduce the solution space.
  • the target characteristic of the log record includes at least one of the length of the log record, the first character of the log record, and the first word of the log record. It is worth noting that the target feature in step D21 and the target feature in step 303 may be the same or different.
  • one or more sets of second log record groups can be obtained, so that the subsequent process of processing each second log record group separately can be performed.
  • the subsequent processing of each second log record group (such as step D22) can be executed in parallel, thereby reducing the operation delay, and the amount of data required to calculate each time the processing is executed is much smaller than the data of the entire log It can effectively reduce the calculation cost and improve the calculation efficiency at the same time.
  • step D21 the analysis device can also detect the number of the first log record group. When there are multiple first log record groups, step D21 is executed again. When there is only one first log record group, Step D21 may not be executed.
  • Step D22 Obtain a log template of the log by separately processing each second log record group.
  • the process of obtaining the log template of the log may include:
  • Step D221 Perform clustering processing on the log records in each second log record group to obtain at least one type of log record corresponding to each second log record group.
  • the log records in each second log record group come from different first log record groups. Therefore, some log records still have similar situations.
  • clustering processing (such as hierarchical clustering) is performed on the log records in each second log record group, and similar log records can be divided into one category, so that in the subsequent process, separate clustering can be performed.
  • the process of template extraction for each type of log record obtained by class processing.
  • the processing of each type of log record can be executed in parallel, thereby reducing the calculation delay.
  • the amount of data required to perform the processing is small, which effectively reduces the computational cost and improves the computational efficiency.
  • the analysis device can also detect the number of the second log record group. When there are multiple second log record groups, perform steps D211 to D223. When the second log record group only has One, steps D211 to D223 may not be performed, and the log template of the second log record group can be directly used as the log template of the log.
  • Step D222 Perform template extraction on each type of log record obtained by the clustering process to obtain a log template of each type of log record.
  • step D222 may refer to the process of step C1, that is, a type of log record is equivalent to the foregoing first log record group, which is not described in detail in the embodiment of the present application.
  • Step D223 Determine the log template of the log based on the log template of each type of log record in the at least one type of record.
  • the process of determining the log template of the log can be implemented in the following two ways:
  • the first way is to use the log template of each type of log record as the log template of the log.
  • a type of log record is equivalent to the foregoing first log record group, which is not described in detail in the embodiment of the present application.
  • the second method is to merge the log templates of various log records obtained by the clustering process, and use the merged log template as the log template of the log.
  • a type of log record is equivalent to the foregoing first log record group, which is not described in detail in the embodiment of the present application.
  • the second optional processing method is explained with the following example.
  • the third log is shown on the left side of Fig. 15, and the word segmentation result shown on the right side of Fig. 15 is obtained after step A1 in the aforementioned step 302 .
  • the local sensitive hash code as shown in FIG. 16 is obtained.
  • the analysis device determines the first log record group based on the locally sensitive hash code and the length of the log record (that is, the target feature is the length of the log record), then the log record, the length of the log record, and the locally sensitive hash code The relationship is shown in Table 1.
  • the grouping result determined by the analysis device based on the local sensitive hash code and the length of the log record is shown in Figure 17.
  • the 1st and 6th log records are divided into a first log record group, and the 7th and 9th lines log records Divided into a first log record group, and the remaining rows of log records are each divided into a first log record group.
  • the analysis device determines the second log record group based on the length of the log record (that is, the target feature is the length of the log record), and finally a total of four second log record groups as shown in FIG. 19 are obtained, respectively:
  • step D221 performs hierarchical processing on the log records in each second log record group, after the template extraction action in step D222, five log templates as shown on the right side of FIG. 20 are obtained. They are:
  • the finally obtained log template of the log includes the five log templates.
  • Step 305 The analysis device performs abnormality detection on the log based on the log template of the log.
  • the analysis device performs feature extraction on the log based on the log template of the log; and performs anomaly detection based on the feature of the extracted log.
  • the characteristics of the log refer to the characteristics of the log records contained in the log. For example, it may include: the number of appearances of the log template, the frequency of appearance of the log template, and/or the appearance period of the log template.
  • the number of occurrences of a log template refers to the number of log records corresponding to the log template in the log
  • the occurrence frequency of a log template refers to the number of log records corresponding to the log template and the total number of log records contained in the log.
  • the ratio of the number; the appearance period of a log template refers to the period of time when the log record corresponding to the log template occurs or the time of collection belongs.
  • the analysis device can divide the log into multiple time windows, and count each line of log records included in each time window to detect To log records matching the first log pattern, and count the characteristics of the log that need to be determined in the time window, such as the number of occurrences of the first log template.
  • the analysis device compares the characteristics of the logs in multiple time windows, and determines the time window that is greater than the specified gap threshold from the characteristics of other time windows as an abnormal time window, and the log record of the first log mode in the abnormal time window is an occurrence Abnormal logging.
  • the aforementioned multiple time windows may be time windows of fixed size and non-overlapping each other, or time windows determined by a sliding window algorithm.
  • the analysis device finds that the number of occurrences of the first log template is significantly higher (for example, compared with other time windows, or the number of occurrences of the first log template in all time windows). If the difference between the mean values of, and the difference is greater than the specified difference threshold) time window, the analysis device can locate the thermal event and issue an alarm; the analysis device finds that the number of occurrences of the first log template is significantly lower (for example, The difference between the mean value of the number of occurrences of the first log template in other time windows or all time windows is negative, and the absolute value of the difference is greater than the specified difference threshold) time window, the analysis equipment can mark the cold event, Issue a warning message.
  • the terminal can display the log template of the log determined by the analysis device, and the user can specify the target log template.
  • the terminal receives the template selection instruction and sends the template selection instruction carrying the identification of the target log template to the analysis device ,
  • the analysis device performs abnormality detection on the target log template, and the detection process can refer to the abnormality detection process of the first log template described above. In this way, the analysis device can perform abnormality detection of a specific log template according to the user's instruction, improve the pertinence of abnormality detection, and ensure user experience.
  • the analysis device is based on the log template of the log to detect unknown events.
  • the analysis device can use each log template to match the log records in the log. When there is a log record that does not match all log templates, it is determined that the log record is an unknown log record, and the event corresponding to the unknown log record is Is an unknown event. It may be an abnormal event.
  • the analysis device may also detect anomalies in the log in other ways, which is not limited in the embodiment of the present application.
  • the log records in the log are grouped using a locally sensitive hash code. Therefore, the distribution rule of the log record follows the Hash distribution rule, that is, key-value (key-value). value) distribution rules, so that load balancing can be achieved.
  • Hash distribution rule that is, key-value (key-value). value
  • Hash distribution is a data distribution method based on a hash function, and the hash function can also be called a hash function.
  • key also known as the key value
  • value also known as the hash value
  • f the hash function.
  • the hash bucket algorithm is a special hash algorithm that can resolve hash conflicts.
  • a hash bucket is a container for placing a linked list of different keys (also called a hash table).
  • the hash bucket is also called an f (key) set or a value set.
  • the value corresponding to the same hash bucket is the same. Referring to the foregoing example, the number of hash buckets can be set to the value of the modulus (also called modulus), that is, 5. Multiple value values correspond to multiple hash buckets one-to-one.
  • the value value can be used as the index or number of the hash bucket.
  • Each hash bucket stores keys with the same value.
  • the conflicting keys in the same hash bucket are stored in a singly linked list, which is solved. Hash conflict.
  • the hash function may also be a remainder function (in this case, the hash function is a remainder operation (Complementation). ) Function, the number of hash buckets is the value of the modulus), or other functions, which are not limited in the embodiment of the present application.
  • each of the aforementioned first log record groups can be identified by a hash bucket.
  • Each second log record group can also be identified by a hash bucket.
  • each hash bucket has a bucket identifier, which can be determined by the corresponding grouping method. For example, in step 303, if only a locally sensitive hash code is used for grouping, the bucket identifier satisfies the following first formula:
  • Id represents a bucket identifier
  • lsh represents a locally sensitive hash code
  • f is a preset function
  • the bucket identification satisfies the following second formula:
  • Id f(x1,x2,...,xm,lsh);
  • Id represents a bucket identifier
  • lsh represents a locally sensitive hash code
  • f is a preset function
  • x1, x2,..., xm respectively represent m features included in the target feature
  • m is the total number of features included in the target feature.
  • step 305 any method that can be easily conceived by a person skilled in the art within the technical scope disclosed in this application should be covered by the protection scope of this application, and therefore will not be repeated here.
  • the log records are grouped by the locally sensitive hash code of each log record, and the locally sensitive hash code can reflect the similarity of the corresponding log records of different rows, so that the grouping is achieved It has the same effect as the clustering process, thereby effectively reducing the computational complexity.
  • the locality sensitive hash code and the target feature are the characteristics of the log record itself.
  • the local sensitive hash code and target feature of each log record there is no need to consider other log records. So as to realize the decorrelation of each log record in the log in the grouping process. In this way, for a log, the grouping process of its multi-line log records can be executed in parallel, which effectively reduces the operation delay and improves the operation efficiency.
  • each first log record group can be executed in parallel, thereby reducing the operation delay, and the amount of data required to calculate each time the processing process is executed is much smaller than the data of the entire log It can effectively reduce the calculation cost and improve the calculation efficiency at the same time.
  • the first optional processing method and the second optional processing method are used to filter out most of the log records, so that the solution space decreases exponentially, effectively reducing the computational cost and improving the computational efficiency. .
  • the traditional template extraction method When extracting the log template of the heterogeneous log, the traditional template extraction method, in an ideal state, for a log containing about 50,000 log records, it takes about 5 seconds to achieve the complete extraction of the log template.
  • the log template extraction method provided by the embodiment of the present application, in an ideal state, it takes about 1 second to achieve complete extraction of the log template, which effectively reduces the calculation delay and improves the calculation performance compared with the traditional method. Improve the user experience.
  • An embodiment of the present application provides a log template extraction device 40. As shown in FIG. 21, the device includes:
  • the first determining module 401 is used to determine the locally sensitive hash code of each log record in the multi-line log record of the log;
  • the second determining module 402 is configured to determine at least one first log record group, where different first log record groups include different rows of log records in the log; and all log records included in each first log record group Have the same locally sensitive hash code;
  • the processing module 403 is configured to obtain a log template of the log by processing each first log record group in the at least one first log record group.
  • the second determination module groups the log records by the locally sensitive hash code of each log record, and the locally sensitive hash code can reflect the similarity of the corresponding log records of different rows. , This grouping achieves the same effect as the clustering processing, thereby effectively reducing the computational complexity.
  • the first determining module 401 includes:
  • the obtaining submodule 4011 is used to obtain at least one entry of each log record in the log;
  • the first determining sub-module 4012 is configured to determine the locally sensitive hash code of each log record based on at least one entry in each log record.
  • each entry includes m semantic units, where m is an integer greater than 1, and the semantic unit is a word or symbol.
  • m is an integer greater than 1
  • the semantic unit is a word or symbol.
  • every two adjacent words The last m-1 semantic units of the first entry in the entry are the same as the first m-1 semantic units of the second entry, and the first entry is the previous entry of the second entry.
  • the first determining module 401 is configured to:
  • the first determining submodule 4012 is configured to:
  • the locally sensitive hash code of the first log record is determined, and the first log record is the number of the log.
  • a line log record includes any log record of multiple entries, and the weights of at least two entries included in the first log record are different from each other. For example, the weight of each entry is based on the entry The location in the first log record is determined.
  • the second determining module 402 is configured to:
  • the processing module 403 includes:
  • the extraction sub-module 4031 is used to extract the log templates in each of the first log record groups respectively;
  • the second determining submodule 4032 is configured to determine the log template of the log based on the log template of each of the first log record groups.
  • the extraction submodule 4031 is used to:
  • the second determining submodule 4032 is configured to:
  • processing module 403 is configured to:
  • the third determining sub-module is configured to determine the log template of the log based on the target log record of each first log record group, and the target log record of the first log record group is the first log record group Part of the log records in.
  • the target log record of the first log record group is a row of log records randomly selected in the first log record group.
  • the third determining submodule is configured to:
  • At least one second log record group is determined, and the different second log record groups include different target log records in the target log records corresponding to the at least one first log record group; each of the second log record groups includes All target log records have the same target feature; by separately processing each of the second log record groups, the log template of the log is obtained.
  • the third determining submodule is configured to:
  • the third determining submodule is configured to:
  • the log template of each type of log record is used as the log template of the log; or, the log templates of the various log records obtained by the clustering process are merged, and the merged log template is used as the log of the log template.
  • the target feature of the log record includes at least one of the length of the log record, the first character of the log record, and the first word of the log record.
  • FIG. 24 schematically provides a possible basic hardware architecture of the computing device described in this application.
  • the computing device may be a server.
  • the computing device 500 includes a processor 501, a memory 502, a communication interface 503, and a bus 504.
  • the number of processors 501 may be one or more, and FIG. 24 only illustrates one of the processors 501.
  • the processor 501 may be a central processing unit (CPU). If the computing device 500 has multiple processors 501, the types of the multiple processors 501 may be different or may be the same. Optionally, multiple processors 501 of the computing device 500 may also be integrated into a multi-core processor.
  • the memory 502 stores computer instructions and data; the memory 502 can store computer instructions and data required to implement the log template extraction method provided by the present application.
  • the memory 502 stores instructions for implementing the steps of the log template extraction method.
  • the memory 502 may be any one or any combination of the following storage media: non-volatile memory (for example, read only memory (ROM), solid state drive (SSD), hard disk (HDD), optical disk)), volatile memory.
  • the communication interface 503 may be any one or any combination of the following devices: a network interface (for example, an Ethernet interface), a wireless network card, and other devices with a network access function.
  • the communication interface 503 is used for data communication between the computing device 500 and other computing devices or terminals.
  • the bus 504 can connect the processor 501 with the memory 502 and the communication interface 503. In this way, through the bus 504, the processor 501 can access the memory 502, and can also use the communication interface 503 to interact with other computing devices or terminals.
  • the computing device 500 executes the computer instructions in the memory 502, so that the computing device 500 implements the log template extraction method provided in this application, or causes the computing device 500 to deploy a log template extraction device.
  • non-transitory computer-readable storage medium including instructions, such as a memory including instructions, which can be executed by a processor of a server to complete the log template shown in each embodiment of the present application. method of extraction.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • An embodiment of the present application provides an analysis system, including: a terminal and an analysis device, and the analysis device includes any one of the aforementioned log template extraction devices.
  • the computer may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in whole or in part in the form of a computer program product, which includes one or more computer instructions.
  • the computer may be a general-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data.
  • the center transmits to another website site, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium, or a semiconductor medium (for example, a solid state hard disk).
  • the log template extraction device provided in the above embodiment extracts the log template
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions can be allocated to different functional modules according to needs. Complete, that is, divide the internal structure of the device into different functional modules to complete all or part of the functions described above.
  • the log template extraction device provided in the foregoing embodiment and the log template extraction method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Abstract

一种日志模板提取方法及装置,属于计算机技术领域。所述方法包括:确定日志的多行日志记录中每行日志记录的局部敏感哈希码;确定至少一个第一日志记录组,不同所述第一日志记录组包括所述日志中的不同行日志记录;每个所述第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码;通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板。该方法解决了目前的日志模板提取方法的运算代价较大的问题,应用于日志的日志模板提取。

Description

日志模板提取方法及装置
本申请要求于2019年10月12日提交的申请号为201910969835.0、发明名称为“日志模式提取的方法、装置、服务器及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种日志模板提取方法及装置。
背景技术
通过在软件源代码中填加的一些特定的伪代码,可以将软件运行的实时状态记录在文本中,该文本称为日志(logs)。软件开发者(或运维工作人员)可以通过阅读日志,掌握软件运行的实时情况。
日志包括多行日志记录(也称日志语句),每一行日志记录用于记录软件运行时的一个事件,日志中的日志记录通常具有隐含的日志模板(schema),即该记录本身的模式或格式。基于日志的日志模板的不同,日志可以分为同质化日志(Homologous logs)和异质化日志(Heterogeneous logs)两类。同质化日志是指,日志中所有行日志记录的日志模板相同;异质化日志是指,日志中的各行日志记录没有统一的日志模板。通过识别日志的日志模板可以实现日志中关键数据的快速检索等功能。
目前,对于异质化日志,提取其日志模板的方法如下:将该行日志记录进行分词(tokenization),得到多个词条(tonken);基于每行日志记录的分词结果,对日志中的日志记录进行层次聚类(Hierarchical Clustering),得到多类日志记录;对每类日志记录进行模板提取,将得到的多类日志记录的日志模板作为异质化日志的日志模板。
但是,通过聚类(Clustering)处理需要进行多次运算才能得到多类日志记录,运算代价较大。
发明内容
本申请实施例提供了一种日志模板提取方法及装置,能够解决目前的日志模板提取方法的运算代价较大的问题。所述技术方案如下:
第一方面,提供一种日志模板提取方法,所述方法包括:
确定日志的多行日志记录中每行日志记录的局部敏感哈希码;确定至少一个第一日志记录组,不同所述第一日志记录组包括所述日志中的不同行日志记录;每个所述第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码;通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板。
通过每行日志记录的局部敏感哈希码进行日志记录的分组,而局部敏感哈希码又可以反映对应的不同行日志记录的相似度,如此分组达到了与聚类处理相同的效果,从而有效降低了运算复杂度。
并且,本申请实施例中,局部敏感哈希码是日志记录本身的特征,在获取每行日志 记录的局部敏感哈希码时,无需考虑其他行日志记录。从而实现了日志中各行日志记录在分组过程中的去相关。如此,对于一个日志,其多行日志记录的分组过程可以并行执行,有效减少运算时延,提高运算效率。
当第一日志记录组有多组时,可以通过对分别每个所述第一日志记录组进行处理,得到所述日志的日志模板。如此,对于各个第一日志记录组的处理过程可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量远远小于日志整体的数据量,有效降低运算代价,同时提高运算效率。
在一种可能实现中,所述确定日志的多行日志记录中每行日志记录的局部敏感哈希码,包括:
获取所述日志中每行日志记录的至少一个词条;基于所述每行日志记录的至少一个词条,确定所述每行日志记录的局部敏感哈希码。
分词的目的是将每行日志记录切割成一个词条序列,通过分词处理可以减少日志记录的处理复杂度,降低后续局部敏感哈希码的运算代价,提高运算效率。本申请实施例中,可以基于不同方式进行分词。例如,采用空格分词;或者,采用特殊字符分词;或者采用自然语言分词。
在一种可选方式中,分词得到的每个词条仅包括一个语义单元。该分词方式较为简单,容易实现,分词速度快。
在另一种可选方式中,分词得到的每个词条包括多个语义单元。也即是,每个所述词条包括m个语义单元,m为大于1的整数,所述语义单元为单词或符号,对于包括至少两个词条的日志记录,每两个相邻的词条中第一词条的后m-1个语义单元与第二词条的前m-1个语义单元相同,所述第一词条为所述第二词条的前一个词条。
通过该方式获取的词条可以减少非期望的哈希碰撞。
在一种可能实现中,所述确定日志的多行日志记录中每行日志记录的局部敏感哈希码,包括:将所述日志中每行日志记录中的p个指定字符替换为q个固定字符,得到更新的每行日志记录,1≤q<p;基于所述更新后的每行日志记录,确定所述每行日志记录的局部敏感哈希码。
对于任一日志记录,由于将多个字符替换成了一个固定字符,因此减少了该日志记录所包含的字符数,从而有效降低后续局部敏感哈希码的计算复杂度。
在一种可能实现中,所述基于所述每行日志记录的至少一个词条,确定所述每行日志记录的局部敏感哈希码,包括:
基于第一日志记录的多个词条,以及为每个所述词条分配的权值,确定所述第一日志记录的局部敏感哈希码,所述第一日志记录为所述日志的多行日志记录中包括多个词条的任一日志记录,所述第一日志记录中包括的至少两个词条的权值互不相同。例如,每个词条的权值基于所述词条在所述第一日志记录中的位置确定。
通过为每行日志记录中的不同词条设置不同权值,减少前述两种哈希碰撞产生。本申请实施例提供的目标局部敏感哈希算法中,各行日志记录的词条的权值可以与其自身属性有关。通常常量部分的词条大于变量部分的词条,又由于日志记录的靠前的词条通常属于常量部分。因此,可以设置日志记录的前g个词条的权值大于其他词条的权值,g<k,g为正整数,k为日志记录的长度。例如,前g个词条的权值递减,其他权值相等, 且小于前g个词条的最小权值。g可以为1。如此将日志记录的词条的权值与其位置属性关联,可以更准确地计算出局部敏感哈希码,进一步减少非期望的哈希碰撞。
在一种可能实现中,所述确定至少一个第一日志记录组,包括:基于所述日志中的每行日志记录的局部敏感哈希码以及所述每行日志记录的目标特征,对所述日志中的多行日志记录进行分组,得到所述至少一个第一日志记录组。
这样,在局部敏感哈希码的基础上,增加了新的分组特征,可以提高分组精度,保证划分到同一个第一日志记录组中的日志记录的相似度更高。
在一种可能实现中,所述通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板,包括:分别提取每个所述第一日志记录组中的日志模板;基于每个所述第一日志记录组的日志模板,确定所述日志的日志模板。
如此,通过提取每个第一日志记录组的日志模板来确定日志的日志模板,无需直接采用第一日志记录组中的日志记录参与日志的日志模板的计算,使得解空间呈指数级别下降,有效提高了运算效率。
在一种可能实现中,所述分别提取每个所述第一日志记录组中的日志模板,包括:
对于每个所述第一日志记录组中的每行日志记录,将所述日志记录与所述第一日志记录组的历史日志模板进行比较;当所述日志记录与所述历史日志模板匹配,基于所述日志记录与所述历史日志模板,确定所述第一日志记录组的新的日志模板;当所述日志记录与所述历史日志模板不匹配,将从所述日志记录提取的日志模板,添加为所述第一日志记录组的新的日志模板。
在一种可能实现中,所述基于每个所述第一日志记录组的日志模板,确定所述日志的日志模板,包括:对所述至少一个第一日志记录组的日志模板进行聚类处理,得到所述日志的日志模板;或者,对所述至少一个第一日志记录组的日志模板进行合并处理,得到所述日志的日志模板;或者,将所述至少一个第一日志记录组的日志模板,作为所述日志的日志模板。
在一种可能实现中,所述通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板,包括:基于每个所述第一日志记录组的目标日志记录,确定所述日志的日志模板,所述第一日志记录组的目标日志记录为所述第一日志记录组中的部分日志记录。
对该目标日志记录的处理相当于对该第一日志记录组在的所有日志记录的处理,但是有效减少了实际处理的数据量,相当于进行了数据采样,减少解空间,从而进一步降低运算代价。
在一种可能实现中,所述第一日志记录组的目标日志记录是在所述第一日志记录组中随机选择的一行日志记录。这样可以保证第一日志记录组中每行日志记录被选取为目标日志记录的概率相等。
在一种可能实现中,所述基于每个所述第一日志记录组的目标日志记录,确定所述日志的日志模板,包括:确定至少一个第二日志记录组,不同所述第二日志记录组包括所述至少一个第一日志记录组对应的目标日志记录中的不同目标日志记录;每个所述第二日志记录组包括的所有目标日志记录具有相同的目标特征;通过分别对每个所述第二日志记录组进行处理,得到所述日志的日志模板。通过分组可以进一步降低解空间。
当第二日志记录组有多组时,对于各个第二日志记录组的处理过程可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量远远小于日志整体的数据量,有效降低运算代价,同时提高运算效率。
在一种可能实现中,所述通过分别对每个所述第二日志记录组进行处理,得到所述日志的日志模板,包括:对每个所述第二日志记录组中的日志记录进行聚类处理,得到每个所述第二日志记录组对应的至少一类日志记录;分别对所述聚类处理得到的每一类日志记录进行模板提取,得到所述每一类日志记录的日志模板;基于所述至少一类日志记录中每一类日志记录的日志模板,确定所述日志的日志模板。
当第二日志记录组中的日志记录有多类时,对于各类日志记录的处理过程可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量较小,有效降低运算代价,同时提高运算效率。
在一种可能实现中,所述基于所述至少一类日志记录中每一类日志记录的日志模板,确定所述日志的日志模板,包括:将所述每一类日志记录的日志模板作为所述日志的日志模板;或者,对聚类处理得到各类日志记录的日志模板进行合并处理,将合并得到的日志模板,作为所述日志的日志模板。
在一种可能实现中,所述日志记录的目标特征包括:日志记录的长度、日志记录的首字符和日志记录的首个单词中的至少一种。
第二方面,提供一种日志模板提取装置,所述装置可以包括至少一个模块,该至少一个模块可以用于实现上述第一方面或者第一方面的各种可能实现提供的所述日志模板提取方法。
第三方面,本申请提供一种计算机设备,该计算机设备包括处理器和存储器。该存储器存储计算机指令;在该处理器执行该存储器存储的计算机指令时,该计算机设备执行上述第一方面或者第一方面的各种可能实现提供的方法,使得该计算机设备部署上述第二方面或者第二方面的各种可能实现提供的该日志模板提取装置。
第四方面,本申请提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,该计算机指令指示该计算机设备执行上述第一方面或者第一方面的各种可能实现提供的方法,或者该计算机指令指示该计算机设备部署上述第二方面或者第二方面的各种可能实现提供的日志模板提取装置。
第五方面,本申请提供一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器可以从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述第一方面或者第一方面的各种可能实现提供的方法,使得该计算机设备部署上述第二方面或者第二方面的各种可能实现提供的日志模板提取装置。
第六方面,提供一种分析系统,包括:终端和分析设备,所述分析设备包括第二方面或者第二方面的各种可能实现所述的日志模板提取装置或第三方面所述的计算机设备。
第七方面,提供一种芯片,所述芯片可以包括可编程逻辑电路和/或程序指令,当所述芯片运行时用于实现如第一方面任一所述的模板提取方法。
本申请实施例中,通过每行日志记录的局部敏感哈希码进行日志记录的分组,而局 部敏感哈希码又可以反映对应的不同行日志记录的相似度,如此分组达到了与聚类处理相同的效果,从而有效降低了运算复杂度。并且,本申请实施例中,局部敏感哈希码和目标特征是日志记录本身的特征,在获取每行日志记录的局部敏感哈希码和目标特征时,无需考虑其他行日志记录。从而实现了日志中各行日志记录在分组过程中的去相关。如此,对于一个日志,其多行日志记录的分组过程可以并行执行,有效减少运算时延,提高运算效率。当第一日志记录组有多组时,对于各个第一日志记录组的处理过程可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量远远小于日志整体的数据量,有效降低运算代价,同时提高运算效率。
本申请实施例提供的日志模板提取方法,均筛选掉多数的日志记录,使得解空间呈指数级别下降,有效降低运算代价,同时提高运算效率。
附图说明
图1是本申请实施例提供的一个日志中的部分日志内容的示意图;
图2是本申请实施例提供的一种日志模板提取方法所涉及的应用环境示意图;
图3是本申请实施例提供的另一种日志模板提取方法所涉及的应用环境示意图;
图4是本申请实施例提供的一种局部敏感哈希算法示意图;
图5是本申请实施例提供的一个日志模板提取方法的流程示意图;
图6是本申请实施例提供的一种日志模板提取方法所涉及的分词结果示意图;
图7是本申请实施例提供的另一种日志模板提取方法所涉及的分词结果示意图;
图8是本申请实施例提供的又一种日志模板提取方法所涉及的分词结果示意图;
图9是本申请实施例提供的日志记录X3和X4的局部敏感哈希码获取的过程的示意图;
图10是图9所示的日志记录X3和日志记录X4的局部敏感哈希码的计算过程示意图;
图11是本申请实施例提供的日志记录X7和X8的一种分词结果的示意图;
图12是本申请实施例提供的日志记录X7和X8的另一种分词结果的示意图;
图13是本申请实施例提供的一种第一日志记录组的分组过程的示意图;
图14是本申请实施例提供的一种第一日志记录组的模板的示意图;
图15是本申请实施例提供的一种分词结果示意图;
图16是本申请实施例提供的一种局部敏感哈希码获取结果的示意图;
图17是本申请实施例提供的一种第一日志记录组的分组结果的示意图;
图18是本申请实施例提供的一种第一日志记录组的目标日志记录的示意图;
图19是本申请实施例提供的一种第二日志记录组的分组结果示意图;
图20是本申请实施例提供的一种日志模板示意图;
图21是本申请实施例提供的一种日志模板提取装置的示意图;
图22是本申请实施例提供的一种第一确定模块的示意图;
图23是本申请实施例提供的一种处理模块的示意图;
图24是本申请实施例提供的一种计算设备的示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
日志用于记录软件运行的实时状态,通过分析日志,可以掌握软件运行的实时情况,或者进行日志的异常检测等。本申请实施例中,日志分析的场景包括离线分析场景和在线分析场景。在离线分析场景中,进行分析的日志数据可以为批量(batch)日志数据,如日志文件,或者在日志数据库中查询得到的日志数据;在在线分析场景中,进行分析的日志数据可以为实时日志数据,也称日志流(log stream)数据。其中,日志文件通常是用户、软件开发者或运维工作人员下载得到的文件,或者通过关键词搜索得到的文件。
如图1所示,图1是一个日志中的部分日志内容的示意图,日志包括多行日志记录(也称日志文本),每一行日志记录用于记录软件运行时的一个事件。每一行日志记录都由多个字符组成,该多个字符可以包括字母和/或符号等。一行日志记录包括常量部分和变量部分,常量部分包括至少一个字符,变量部分包括至少一个字符。例如,假设某一日志记录对应的伪代码被定义为记录用户登录的信息:“logging.info('User%d login at%s',$uid,$time)”,该日志记录对应的伪代码中的“$”用来定义变量部分(也称变量名)。例如,“$IP”就是网际互连协议(Internet Protocol,IP)地址的意思。从该日志记录对应的伪代码中可以看出,该日志记录包含了两组变量部分,分别用于记录每一次登录的用户名(uid)和登录时间(time)。
日志中的日志记录通常具有隐含的日志模板(也称日志模式(pattern)),日志模板指的是用于生成日志中日志记录的标准样式,或者固定格式。例如,前述日志记录对应的伪代码在实际运行后,输出得到日志中的用于记录用户登录的信息的多行日志记录。在本申请实施例将该多行日志记录所在日志称为第一日志:
“User 025862 login at 2018-12-03 02:03:00
User 045210 login at 2018-12-04 02:03:15
User 033658 login at 2018-12-05 02:03:38
User 010100 login at 2018-12-06 02:04:06
User 023025 login at 2018-12-07 02:04:51
User 046523 login at 2018-12-08 02:05:22”。
日志的日志模板即该日志中的日志记录的日志模板。通常,在对日志记录进行日志模板提取时,若识别出日志记录的变量部分,会采用预设的变量标识符对变量部分进行标记,该标记方式实质为采用变量标识符将变量部分替换。该变量标识符通常为通配符“*”。示例的,对上述第一日志的多行日志记录进行日志模板提取时,可以将变量部分采用通配符“*”替换,得到的每行日志记录的日志模板为“User * login at *”,则第一日志的日志模板为“User * login at *”。值得说明的是,在后文中的匹配过程中,该变量标识符可以被判定为与任意字符或词条相同。例如,可以在将“*”与“046523”进行匹配时,确定“*”与“046523”相同。
日志可以分为同质化日志和异质化日志两类。前述实施例的第一日志中以同质化日志为例进行说明,同质化日志中的日志记录具有统一的日志模板,日志模板的提取过程可以采用正则表达式的方式实现,而异质化日志的多行日志记录没有统一的日志模板, 通常业务日志为异质化日志。示例的,该异质化日志的部分内容如下所示(该异质化日志中的“#”用于标记行号,实际应用中,日志的内容中可以不包括该行号标记以及具体行号,该例子主要为了便于读者理解,因此在后续处理过程中忽略行号标记以及具体行号)。在本申请实施例将如下所示的多行日志记录所在日志称为第二日志:
“#0 mod_jk child workerEnv in error state 6
#1 jk2_init()Found child 6740 in scoreboard slot 7
#2 jk2_init()Found child 6741 in scoreboard slot 8
#3 workerEnv.init()ok/etc/httpd/conf/workers2.properties
#4 mod_jk child workerEnv in error state 7
#5 workerEnv.init()ok/etc/httpd/conf/workers2.properties”。
由于其各行日志记录没有统一的日志模板,因此提取过程更为复杂。目前的提取过程包含对日志中的日志记录进行层次聚类的过程,由于需要处理的数据量较大,采用聚类算法需要进行多次运算才能得到多类日志记录,运算代价较大。
本申请实施例提供一种日志模板提取方法,可以降低日志模板提取过程中的运算代价。请参考图2,图2是本申请实施例提供的一种日志模板提取方法所涉及的应用环境示意图。该应用环境包括终端110、分析设备120和网络设备130。
终端110可以是显示器、计算机、智能手机、平板电脑和膝上型便携计算机等等能够与用户交互的设备。分析设备120可以是一台服务器,或者由若干台服务器组成的服务器集群等等能够进行数据分析的设备。可选的,该分析设备120可以是云服务器(也称云计算服务器),例如,用于提供的深度学习服务(Deep Learning Service,DLS)的深度学习服务器。终端110通过通信网络与分析设备120建立有线或无线的通信连接。其中,网络设备130可以为传感器或终端等能够运行软件并产生日志数据的设备。网络设备130用于向分析设备120提供待分析的数据,分析设备120用于进行日志数据的分析,终端110用于将分析结果呈现给用户。本申请实施例中所涉及的通信网络是第二代(2-Generation,2G)通信网络、第三代(3rd Generation,3G)通信网络、长期演进(Long Term Evolution,LTE)通信网络或第五代(5rd Generation,5G)通信网络等。
可选的,前述应用环境还可以包括存储设备,其用于存储终端110、分析设备120和/或网络设备130所需存储的数据,该存储设备可以为分布式存储设备,终端110、分析设备120和/或网络设备130可以对该存储设备所存储的数据进行读写。这样在应用场景中的数据较多的情况下,由存储设备进行数据存储,可以减轻分析设备的负载,提高分析设备的数据分析效率。需要说明的是,当应用环境中的数据量较少时,也可以不设置该存储设备。在这种情况下,终端110和分析设备120的功能也可以由同一设备实现,例如计算机。
如图3所示,该应用环境包括前台201和后台202两部分。前台201用于向用户呈现数据,以及接收用户输入的数据,实现与用户的交互;后台202用于与前台201进行数据交互,并进行管理操作和/或数据处理等。其中,前台201可以部署于前述终端110中。后台202可以部署于前述分析设备120中。示例的,终端110中可以安装有客户端、脚本或者浏览器,以实现前台201的部署。如此,终端110可以以客户端界面、终端界 面或者与浏览器对应网页的形式呈现用户界面。
本申请实施例提供的日志模板提取方法可以用于软件调试、性能优化或业务分析等场景中。具体可以应用在这些场景中异常检测场景中。异常检测是指对不符合预期的模式进行检测。本申请实施例中,异常检测的数据来源为应用、进程、操作系统、设备或者网络中的软件运行所产生的日志数据。示例的,前述分析设备120可以采用深度学习(deep learning)算法进行日志数据的异常检测。值得说明的是,本申请实施例提供的日志模板提取方法还可以用于日志压缩、关键词检索等其他场景中,本申请实施例对此不做限定。
局部敏感哈希(Locality Sensitive Hash,LSH)码是基于局部敏感哈希算法获取的哈希码。局部敏感哈希码能够反映采用局部敏感哈希算法所需处理的数据(可以称为输入的数据)的相似度。在本申请实施例中,该数据可以为前述日志记录的数据。局部敏感哈希算法可以保持输入数据间的相似关系。如图4所示,对于相似的输入数据,得到的局部敏感哈希码(可以称为输出的数据)也十分相近;对于输入数据极为相似的场景,得到的局部敏感哈希码甚至产生了哈希碰撞:即对于不同的但相似的输入数据,输出的局部敏感哈希码完全一样。本申请实施例中,基于局部敏感哈希码的这种特性,进行日志模板提取。如图5所示,本申请实施例提供一种日志模板提取方法,应用于图2或图3所示的应用环境中。图5以该方法应用于异常检测场景为例进行说明,该方法包括:
步骤301、分析设备获取日志,该日志包括多行日志记录。
如前所述,日志有批量日志数据和实时日志数据两种形式。本申请实施例中,分析设备支持这两种形式的日志的分析。在一种可选示例中,分析设备周期性获取日志文件,或者在指定时段获取日志文件,以得到批量日志数据,该指定时段可以是终端和/或服务器的低功耗时段(即功耗小于指定功耗阈值的时段),如此可以降低日志文件获取以及后续日志分析对终端和/或服务器的其他功能的影响;在另一种可选示例中,分析设备持续获取实时日志数据;在又一种可选示例中,分析设备在接收到分析指令后,获取批量日志数据或者实时日志数据。该分析指令可以是用户在终端触发生成,并由终端发送至分析设备的。
当分析设备实时获取日志流并进行分析时,由于能够及时对日志流进行监控,若日志流中出现异常可以及时发现并上报,提高异常检测的实效性,避免大规模异常的出现,从而提高用户体验。
本申请实施例提供的日志模板提取方法可以用于前述同质化日志的日志模板提取,也可以用于前述异质化日志的模板提取。在一种可选方式中,分析设备在执行步骤301之后,可以直接执行步骤302;在另一种可选方式中,由于本申请实施例提供的日志模板提取方法在应用于异质化日志的模板提取时,运算效率更高,因此可以先检测日志的类型,若日志的类型为同质化日志,采用正则表达式进行日志的模板提取,若日志的类型为异质化日志,则执行后续步骤302。
步骤302、分析设备确定日志的多行日志记录中每行日志记录的局部敏感哈希码。
分析设备可以通过多种方式确定局部敏感哈希码,本申请实施例以以下两种可选实 现方式为例进行说明:
在第一种可选实现方式中,该确定日志的多行日志记录中每行日志记录的局部敏感哈希码的过程可以包括:
步骤A1、分析设备获取日志中每行日志记录的至少一个词条(token)。
可选的,分析设备可以通过分词技术对日志中每行日志记录进行分词,以得到分词后的每行日志记录的至少一个词条。通常情况下,一行日志记录可以划分得到至少两个词条;少数情况下,一行日志记录可以划分得到一个词条,本申请实施例对划分得到的词条个数不做限定。
分词的目的是将每行日志记录切割成一个词条的集合,通过分词处理可以减少日志记录的处理复杂度,降低后续局部敏感哈希码的运算代价,提高运算效率。本申请实施例中,可以基于不同方式进行分词。例如,采用空格分词(该方式可以采用String.split()语句进行空格分词);或者,采用特殊字符分词;或者采用自然语言分词。
其中,采用空格分词,指的是将一行日志记录按照空格切分成多个词条,其切分实现过程简单,切分效率高;采用特殊字符分词时,特殊字符通常为用户指定的字符,如“|”,“##”或“=”,可以使得切分到的词条包括的语义单元更准确,切分精度更高;采用自然语言分词的方式较为常用,在这种方式中可以直接将日志记录输入基于自然语言的分词器,例如NLTK Word_Tokenizer,TreeBank_Tokenizer,S-Expression_tokenizer等分词器。
为了便于读者理解,后续实施例均以基于空格进行日志记录的分词处理为例进行说明。对于不同的分词机制,得到的分词结果不同,本申请实施例以以下两种可选方式对分词结果进行说明:
在第一种可选方式中,分词得到的每个词条仅包括一个语义单元。语义单元为单词或符号,该符号可以为数字符号,简称数字,如1、2,也可以为其他符号,如“/”或“:”。如图6所示,图6是对前述第二日志的每行日志记录进行分词处理,且每个词条仅包括一个语义单元,所得到的分词结果。以图6中的第一行日志记录为例,分词得到“mod_jk”、“child”、“workerEnv”、“in”“error”、“state”和“6”共7个词条。
在第二种可选方式中,分词得到的每个词条包括多个语义单元。也即是,每个词条包括m个语义单元,m为大于1的整数,小于词条的总个数。该语义单元为单词或符号。
词条的长度可以由组成该词条的语义单元的个数表示。由于一个词条的长度太长,可能导致无效的分词,例如一行日志记录最终分成一个词条,因此,一个词条的长度不能过长,一个词条的长度通常需要使得将一行日志记录至少划分为2个词条,通常情况下,m=2,或m=3。
其中,对于包括至少两个词条的日志记录,每两个相邻的词条中第一词条的后m-1个语义单元与第二词条的前m-1个语义单元相同,第一词条为第二词条的前一个词条。如此分词得到的分词结果中每两个相邻的词条存在语义单元的重合。示例的,假设日志记录X1为:“detected a failure in network connection”,日志记录X2为:“network connection a failure is detected”,假设m=2,则两者的分词结果如图7所示;假设m=3,则两者的分词结果如图8所示。可选的,前述分词动作可以采用滑窗机制实现。
可选的,前述第一种可选方式和第二种可选方式中,分析设备可以将每行日志记录 作为一个字符流输入指定分词器,由该分词器进行分词处理,分析设备接收分词器输出的分词结果即可。前述第一种可选方式和第二种可选方式所对应的不同分词机制由不同的分词机制由不同的分词器实现。分析设备可以支持至少一种分词机制。
步骤A2、分析设备基于每行日志记录的至少一个词条,确定每行日志记录的局部敏感哈希码。
可选的,分析设备可以基于目标局部敏感哈希算法,以及每行日志记录的至少一个词条,确定每行日志记录的局部敏感哈希码。示例的,该目标局部敏感哈希算法中的局部敏感哈希计算过程可以参考Simhash算法或Minhash算法中的局部敏感哈希计算过程。该目标局部敏感哈希算法所处理的数据的最小单位为词条。
可选的,在目标局部敏感哈希算法中,在获取某一日志记录的至少一个词条后,可以采用加权求和的方式确定该某一日志记录的局部敏感哈希码。该过程可以参考Simhash算法。该采用加权求和的方式确定该某一日志记录的局部敏感哈希码的过程可以包括:
步骤A21、对于任一日志记录,计算该任一日志记录中各个词条的哈希码,该哈希码由二进制数0和1组成。
步骤A22、对计算得到的各个词条的哈希码进行加权求和,即W=∑Hash×weight,其中,W表示加权求和后的哈希序列,Hash表示每个词条的哈希码。
假设各个词条的权值均为1,则weight=1,则局部敏感哈希码为各个词条的哈希码之和,即:W=∑Hash。
步骤A23、对获取的加权求和的结果进行降维处理得到局部敏感哈希码。
在前述步骤A22的加权求和过程中,每个哈希码与其权值的乘积采用如下规则表示:遇到哈希码中的值为1,则对应位置的求和结果为1和权值正相乘,遇到哈希码中的值为0,则对应位置的求和结果为1和权值负相乘。
前述步骤A23中的降维指的是将大于0的数值的降为1,将不大于0的数值降为0。对获取的加权求和的结果进行数值降维处理的过程包括,将获取的加权求和的结果中大于0的数值,设置为1,将加权求和的结果中不大于0的数值,设置为0。
例如,假设日志记录X3为:“saveLogSize cost time is 1057”,日志记录X4为:“flush cost time is 122”,且假设日志记录X3和X4分词后得到的每个词条仅包括一个语义单元,图9示出了日志记录X3和X4的局部敏感哈希码获取的过程。图9中,“flush”的哈希码为:“10010111”,其与权值1的乘积为“1,-1,-1,1,-1,1,1,1”(其中,逗号是为了进行间隔,实际计算过程中并不存在)。对计算得到的各个词条的哈希码进行加权求和,指的是对加权后的哈希码对位求和(即对应位置求和)。以图9中的日志记录X3为例,最终确定的加权求和的结果为“5,-3,-1,1,-3,-1,5,3”,其中,首位:5是各个词条与权值1的乘积的首位之和,即1+1+1+1+1,第二位:-3是各个词条与权值1的乘积的第二位之和,即(-1)+(-1)+1+(-1)+(-1),其他位的计算方式同理。加权求和的结果为“5,-3,-1,1,-3,-1,5,3”所对应的降维的结果为“10010011”,即日志记录X3的局部敏感哈希码为“10010011”。
采用前述目标局部敏感哈希算法,若设置权值均为1,计算时延较短,计算效率较高,但是计算得到的局部敏感哈希码可能会产生非期望的哈希碰撞,通常应用在一些分析精度要求不高,但对计算时延要求较高的场景中。
为了便于读者理解,本申请实施例以以下两种非期望的哈希碰撞为例进行说明:
第一种非期望的哈希碰撞:日志记录内容不同导致的哈希碰撞:
仍然以图9中的日志记录X3和X4分为例,日志记录X3和X4的长度相同,且采用前述目标局部敏感哈希算法确定的局部敏感哈希码均为“10010011”。
这种情况下得到的局部敏感哈希码虽然相同,但是两行日志记录由于内容不同,实际差距很大,局部敏感哈希码不能有效反应日志记录X3和X4的相似度,因此称为非期望哈希碰撞。
第二种非期望的哈希碰撞:日志记录序列(也称顺序)不同导致的哈希碰撞:
假设日志记录X5为:“flush cost time is 122”,日志记录X6为:“122 is flush cost time”,且假设日志记录X5和X6分词后得到的每个词条仅包括一个语义单元,则日志记录X5和X6实质上包含了相同的词条,只是词条的顺序不同,且采用前述目标局部敏感哈希算法确定的局部敏感哈希码相同。由于日志记录X5和X6分词后得到的词条内容实质相同,但词条顺序不同,采用该前述目标局部敏感哈希算法,且权值均设置为1时,最终确定的局部敏感哈希码相同。
值得说明的是,传统的Simhash算法用于进行文章相似度比较,其处理对象为文章;并且为分词得到的词条所配置的权值与词频正相关,也即是词频越高,权值越大。本申请实施例中,若采用传统的方式设置权值,由于是根据词频设置权值,即相同的词条的权值相同,最终确定的局部敏感哈希码仍然相同。
这种情况下得到的局部敏感哈希码虽然相同,但是两行日志记录由于序列不同,实际差距很大,局部敏感哈希码不能有效反应日志记录X5和X6的相似度,因此称为非期望哈希碰撞。
在本申请实施例中,可以通过为每行日志记录中的不同词条设置不同权值,减少前述两种哈希碰撞产生。则,假设第一日志记录为日志的多行日志记录中包括多个词条的任一日志记录,基于每行日志记录的至少一个词条,确定每行日志记录的局部敏感哈希码的过程可以包括:基于第一日志记录的多个词条,以及为每个词条分配的权值,确定第一日志记录的局部敏感哈希码,该第一日志记录中包括的至少两个词条的权值互不相同。该多行日志记录中的其他行日志记录的局部敏感哈希码的获取方式均可以参考该第一日志记录。示例的,基于第一日志记录的多个词条,以及为每个词条分配的权值,可以通过前述加权求和的方式确定第一日志记录的局部敏感哈希码,具体过程参考前述步骤A21至A23。
在本申请实施例中,为各行日志记录相同位置的词条设置的权值相同,对同一行日志记录至少两个词条设置的权值不同,这样可以有效降低非期望哈希碰撞的产生。其中,同一行日志记录的词条的权值可以根据实际情况设置,例如以等差数列的方式递增或递减,也可以按照其他方式设置。
示例的,假设在目标局部敏感哈希算法中,配置第一个词条的权值为3,第二个词条的权值为2,其他词条的权值为1,则对于图9所示的日志记录X3和日志记录X4的局部敏感哈希码的计算过程如图10所示,最终确定的两行日志记录的局部敏感哈希码不同。如此可以解决第一种非期望的哈希碰撞。同理,对于日志记录X5和日志记录X6,采用前述权值,最终确定的两行日志记录的局部敏感哈希码不同。如此也可以解决第二种非期 望的哈希碰撞。
如前所述,传统的Simhash算法用于进行文章相似度比较,其处理对象为文章;为分词得到的词条所配置的权值与词频正相关,也即是词频越高,权值越大。本申请实施例提供的目标局部敏感哈希算法中,各行日志记录的词条的权值可以与其自身属性有关,与词频去相关。通常常量部分的词条大于变量部分的词条,又由于日志记录的靠前的词条通常属于常量部分。因此,可以设置日志记录的前g个词条的权值大于其他词条的权值,g<k,g为正整数,k为日志记录的长度。例如,前g个词条的权值递减,其他权值相等,且小于前g个词条的最小权值。g可以为1。如此将日志记录的词条的权值与其位置属性关联,也即是,每个词条的权值基于所述词条在所述第一日志记录中的位置确定,如此可以更准确地计算出局部敏感哈希码,进一步减少非期望的哈希碰撞。
值得说明的是,通过前述步骤A1中的第二种可选方式获取词条也可以减少前述第一种和第二种非期望的哈希碰撞。
示例的,假设日志记录X7为:“detected a failure in network connection”;日志记录X8为:“network connection:a failure is detected”,这是两个不同的日志记录,但因为用词相似,如果按照一个词条仅包括一个语义单元来进行分词,则日志记录X7的分词结果为:{detected,a,failure,in,network,connection};日志记录X8的分词结果为:{detected,a,failure,is,network,connection}。可以看出,两行日志记录中只有1个词条不一样,即是日志记录X7的词条是“in”,日志记录X8的词条是“is”。所以,日志记录X7和日志记录X8的局部敏感哈希码可能十分相近,甚至有可能产生碰撞,出现第一种非期望的哈希碰撞。
当每个词条包括两个语义单元时,日志记录X7和日志记录X8的分词结果如图11所示。可以看出,日志记录X7和日志记录X8的分词结果中有5个词条不一样,只有一个词条是一样的。基于该分词结果获取的局部敏感哈希码的区别较大,有效杜绝了第一种非期望的哈希碰撞。
当每个词条包括三个语义单元时,日志记录X7和日志记录X8的分词结果如图12所示。可以看出,日志记录X7和日志记录X8的分词结果中有4个词条不一样的。基于该分词结果获取的局部敏感哈希码的区别更大,有效杜绝了第一种非期望的哈希碰撞。
同理,前述第二种非期望的哈希碰撞也可以采用一个词条包括多个语义单元的分词方式来避免。
在第二种可选实现方式中,确定日志的多行日志记录中每行日志记录的局部敏感哈希码的过程可以包括:直接基于每行日志记录的内容,确定每行日志记录的局部敏感哈希码,也即是不执行前述步骤A1。该过程可以参考前述步骤A2,分析设备可以基于前述目标局部敏感哈希算法,以及每行日志记录的内容,确定每行日志记录的局部敏感哈希码。示例的,分析设备可以分别将每行日志记录的内容(即字符流)输入目标局部敏感哈希算法的算法模型,接收该算法模型输出的每行日志记录的局部敏感哈希码。该目标局部敏感哈希算法所处理的数据的最小单位为字符。
该第二种可选实现方式中求取每行日志记录的局部敏感哈希码时的数据粒度(即前述目标局部敏感哈希算法所处理的数据的最小单位)为字符,前述第一种可选实现方式中求取每行日志记录的局部敏感哈希码时的数据粒度为词条。因此,前述第一种可选实 现方式相对于该第二种可选实现方式,求取每行日志记录的局部敏感哈希码时的数据粒度更大,由此可知,第一种可选实现方式相对于该第二种可选实现方式运算次数更小,能够节约运算代价。
在第三种可选实现方式中,确定日志的多行日志记录中每行日志记录的局部敏感哈希码的过程可以包括:对于每行日志记录,以每n个字符为最小数据处理单位,获取每行日志记录的局部敏感哈希码,n为大于1的整数,其中,每行日志记录可以采用滑窗机制进行划分(该过程可以参考前述步骤A1的第二种可选方式,只是划分的单位由m个语义单元变为n个字符)。示例的,分析设备可以分别将每行日志记录以n个字符为单位输入目标局部敏感哈希算法的算法模型,接收该算法模型输出的每行日志记录的局部敏感哈希码。即该目标局部敏感哈希算法所处理的数据的最小单位为n个字符。该过程可以参考n-gram(一种语言模型)算法。
该第三种可选实现方式中求取每行日志记录的局部敏感哈希码时的数据粒度为n个字符,因此,第一种可选实现方式相对于该第三种可选实现方式运算次数更小,能够节约运算代价,第三种可选实现方式相对于该第二种可选实现方式运算次数更小,能够节约运算代价。
在本申请实施例中,确定每行日志记录的局部敏感哈希码的过程中(如前述第一种可选实现方式或第二可选实现方式实现的过程中),分析设备还可以对每行日志记录进行预处理,以提高局部敏感哈希码的获取效率,降低运算代价。
则在第四种可选实现方式中,确定日志的多行日志记录中每行日志记录的局部敏感哈希码的过程可以包括:
步骤B1、分析设备将日志中每行日志记录中的p个指定字符替换为q个固定字符,得到更新的每行日志记录。
由于日志记录中的一些指定字符在多数情况下都是变量,因此,将这些指定字符进行一定的替换处理,可以降低后续局部敏感哈希码的计算复杂度。示例的,该指定字符可以为数字,该固定字符可以为数字或者其他符号。例如1、2或“*”。其中,1≤q<p,也即是,被替换的指定字符的个数大于固定字符的个数。如此,在一定程度上减少该日志记录所包含的字符数,从而降低后续局部敏感哈希码的计算复杂度。
例如,假设指定字符为数字,固定字符为“*”,日志记录X9为:“Connected to 10.110.12.01 at 2019-11-04 15:40:00”,则采用该方式得到的更新后的日志记录X9可以为:“Connected to *.**.*.* at **-*-* *:*:*”。
可选的,分析设备可以将日志中每行日志记录中的连续的多个指定字符替换为一个固定字符,得到更新的每行日志记录。
对于任一日志记录,由于将多个字符替换成了一个固定字符,因此减少了该日志记录所包含的字符数,从而有效降低后续局部敏感哈希码的计算复杂度。
例如,假设指定字符为数字,固定字符为“*”,日志记录X9为:“Connected to 10.110.12.01 at 2019-11-04 15:40:00”,则采用该方式得到的更新后的日志记录X9为:“Connected to *.*.*.* at *-*-* *:*:*”。
步骤B2、基于更新后的每行日志记录,确定每行日志记录的局部敏感哈希码。
步骤B2的过程可以参考前述第一种可选实现方式中的步骤A2的过程,即基于分词结果确定局部敏感哈希码的过程;也可以参考前述第二种可选实现方式,即不分词,直接基于每行日志记录的内容,确定每行日志记录的局部敏感哈希码;还可以采用前述第三种可选实现方式或其他实现方式实现,本申请实施例对此不做限定。
值得说明的是,若步骤B2采用前述第一种可选实现方式中步骤A2的方式实现,前述步骤B1可以在分词(即步骤A1)前执行,也可以在分词之后执行,也即是,基于更新后的每行日志记录,确定每行日志记录的局部敏感哈希码的过程包括依次执行的步骤A1、B1和B2,或者,依次执行的步骤B1、A1和B2。
步骤303、分析设备确定至少一个第一日志记录组,其中,不同第一日志记录组包括日志中的不同行日志记录;每个第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码。
通常情况下,分组得到的第一日志记录组有多个。分析设备可以将日志的多行日志记录中,局部敏感哈希码均相同的日志记录,划分至同一第一日志记录组中,以得到该至少一个第一日志记录组。由于局部敏感哈希码又可以反映对应的不同行日志记录的相似度,因此,每个第一日志记录组相当于一类日志记录,分组的效果类似于聚类处理的效果。由于聚类处理需要先计算各个特征(若将该聚类处理应用于日志模板提取过程中,每个特征为每行日志记录所包括的所有词条)之间的距离,并基于各个特征之间的距离的运算代价较高。而采用前述分组方式的运算复杂度远远小于聚类处理的运算复杂度。
例如,一个日志中的日志记录的行数为u,采用层次聚类的方式进行聚类处理的过程包括:基于定义的距离函数(Distance Measurement)计算距离矩阵(Distance Matrix),该距离函数可以为杰卡德(jaccard)距离函数;基于距离矩阵中确定多对可聚合的日志记录(每对可聚合的日志记录通常为相似度最高的日志记录,其可以基于距离矩阵中每列的最小值确定);对确定的每对日志记录进行聚合;聚合结果采用二叉树表示,例如采用树状图(dendrogram)表示。其中,在距离矩阵中找到每列的最小的值的复杂度为O(u 2);然后每一次聚合是将2个合并为1个的过程,元素个数会减少1个,一共需要u-1次聚合,以完成二叉树的构建。因此,最终复杂度为O(u 2)×O(u)=O(u 3)。O表示复杂度,一个元素为一行日志记录。即使经过优化,复杂度也只能降至O(u 2*logu)。若日志包括上万行日志记录,需要采用数亿至数百亿次计算才能实现前述层次聚类。如此会造成性能瓶颈,从而影响用户体验和系统稳定性。
而本申请实施例中,由于在获取每行日志记录的局部敏感哈希码的过程中,无需计算各行日志记录之间的距离,以及进行元素聚合,计算的复杂度可以达到O(u),u为一个日志中的日志记录的行数,因此运算复杂度远远小于聚类处理的运算复杂度。从而避免性能瓶颈,减少对用户体验和系统稳定性的影响。
并且,由前述内容可知,聚类算法需要计算日志中每行日志记录与其他各行日志记录的距离,也即是一个日志中的日志记录是互相关联的,各行日志记录无法在聚类过程中进行独立的计算,因此计算复杂度较高,运算时延较长。而本申请实施例中,局部敏感哈希码是日志记录本身的特征,在获取每行日志记录的局部敏感哈希码时,无需考虑其他行日志记录。从而实现了日志中各行日志记录在分组过程中的去相关。如此,对于一个日志,其多行日志记录的分组过程可以并行执行(也称并发执行,即单独计算每行 日志记录的局部敏感哈希码,并基于每个计算得到的局部敏感哈希码进行对应日志记录的分组),有效减少运算时延,提高运算效率。
进一步的,本申请实施例通过采用局部敏感哈希码对日志的日志记录进行分组,可以获取一组或多组第一日志记录组,从而可以执行后续分别对每个第一日志记录组进行处理的过程。当第一日志记录组有多组时,对于各个第一日志记录组的后续处理过程(如步骤304)可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量远远小于日志整体的数据量,有效降低运算代价,同时提高运算效率。
可选的,分析设备还可以通过以下方式确定至少一个第一日志记录组:基于日志中的每行日志记录的局部敏感哈希码以及每行日志记录的目标特征,对日志中的多行日志记录进行分组,得到至少一个第一日志记录组。
本申请实施例中,也可以通过增加局部敏感哈希码的码长来减少哈希碰撞,但是计算复杂度会相应增加。而在局部敏感哈希码的基础上,通过引入目标特征来参与分组,增加了新的分组特征,可以在保证较短的局部敏感哈希码的码长的前提下,进一步减少哈希碰撞。并且,能够提高分组精度,保证划分到同一个第一日志记录组中的日志记录的相似度更高。进一步的,目标特征也是日志记录本身的特征,在获取每行日志记录的局部敏感哈希码和目标特征时,无需考虑其他行日志记录。从而实现了日志中各行日志记录在分组过程中的去相关。因此,对于一个日志,其多行日志记录的分组过程可以并行执行,有效减少运算时延,提高运算效率。
示例的,划分规则可以为:每个第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码,以及相同的目标特征。也即是,分析设备可以将日志的多行日志记录中,目标特征以及局部敏感哈希码均相同的日志记录,划分至同一第一日志记录组中,以得到该至少一个第一日志记录组。
例如,日志记录的目标特征包括:日志记录的长度、日志记录的首字符和日志记录的首个单词中的至少一种。
通常情况下,日志记录的长度以该日志记录所包括词条的个数表示。例如,图8中的日志记录X1的长度为4,日志记录X2的长度为4。由于日志记录的长度是日志记录比较典型的特征,长度不同的日志记录采用相同日志模板的概率较低,因此通过将日志记录的长度作为目标特征,可以有效避免一些实质不相似的日志记录(例如长度不同但内容相似的日志记录)划分至同一第一日志记录组。从而降低非期望的哈希碰撞(如前述第一种非期望的哈希碰撞)的概率。
一行日志记录的开始部分通常是常量部分,例如日志记录的首字符和日志记录的首个单词通常是常量,开始部分相同的日志记录采用相同的日志模板的概率较低,因此通过将日志记录的首字符或日志记录的首个单词文本作为目标特征,可以有效避免一些实质不相似的日志记录(例如开始部分不同但其他部分相似的日志记录)划分至同一第一日志记录组。从而降低非期望的哈希碰撞(如前述第一种非期望的哈希碰撞和前述第二种非期望的哈希碰撞)的概率。
进一步的,在步骤303中,分析设备通常通过遍历日志中的每行日志记录来确定至少一个第一日志记录组。也即是,分析设备遍历日志中的每行日志记录,依次将局部敏 感哈希码相同的日志记录划分至同一第一日志记录组。若划分规则为:每个第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码,以及相同的目标特征。则,分析设备遍历日志中的每行日志记录,依次将日志的多行日志记录中,目标特征以及局部敏感哈希码均相同的日志记录,划分至同一第一日志记录组中。如此,对于每个第一日志记录组,日志记录是以文本流的形式逐个写入(即逐行写入)该日志记录组的。其中,每个第一日志记录组可以在划分动作前建立,也即是在初始化时建立了第一日志记录组,且第一日志记录组为空;每个第一日志记录组也可以在划分过程中建立,本申请实施例对此不做限定。
为了便于读者理解,以前述第二日志为例,假设前述分词过程将行号标记以及具体行号忽略,分组所基于的特征为日志记录的局部敏感哈希码以及日志记录的长度(即前述日志记录的目标特征为日志记录的长度),第0行至第5行日志记录中,第0行日志记录和第4行日志记录的局部敏感哈希码和长度均相同,第3行日志记录和第5行日志记录的局部敏感哈希码和长度均相同,第1行日志记录和第2行日志记录的局部敏感哈希码和长度均相同。如图13所示,以下对该分组过程进行说明:分析设备遍历日志中的第0行至第5行日志记录。对于第0行日志记录,由于不存在相关分组,则建立第一日志记录组0,将第0行日志记录划分至第一日志记录组0;对于第1行日志记录,由于不存在相关分组,则建立第一日志记录组1,将第1行日志记录划分至第一日志记录组1;对于第2行日志记录,由于其与第1行日志记录的局部敏感哈希码和长度均相同,将第2行日志记录划分至第一日志记录组1;对于第3行日志记录,由于不存在相关分组,则建立第一日志记录组2,将第3行日志记录划分至第一日志记录组2;对于第4行日志记录,由于其与第0行日志记录的局部敏感哈希码和长度均相同,将第4行日志记录划分至第一日志记录组0;对于第5行日志记录,由于其与第3行日志记录的局部敏感哈希码和长度均相同,将第5行日志记录划分至第一日志记录组2。
步骤304、分析设备通过对至少第一日志记录组中每个第一日志记录组进行处理,得到日志的日志模板。
在本申请实施例中,由于每个第一日志记录组相当于一类日志记录,基于此可以有多种可选处理方式来获取整个日志所涉及的日志模板。本申请实施例以下几种可选处理方式为例,对获取日志所涉及的日志模板的过程进行说明:
第一种可选处理方式,可以通过获取每个第一日志记录组的日志模板,基于获取的日志模板来确定日志的日志模板,该过程包括:
步骤C1、分析设备分别提取每个第一日志记录组中的日志模板。
示例的,步骤C1可以包括以下步骤:
步骤C11、对于每个第一日志记录组中的每行日志记录,分析设备可以将该行日志记录与第一日志记录组的历史日志模板进行比较。
由于前述303中,分析设备基于局部敏感哈希码,以及其他的目标特征确定至少一个第一日志记录组。因此,最终得到的同一第一日志记录组中日志记录的长度可能相同也可能不同。本申请针对这两种情况,提供了两种比较方式:
第一种情况,同一第一日志记录组中日志记录的长度相同。即在前述步骤303中,分析设备基于局部敏感哈希码,以及日志记录的长度(即日志记录的目标特征至少包括 长度)确定至少一个第一日志记录组。如此,可以通过对位比较的方式来进行日志记录与第一日志记录组的历史日志模板进行比较。在第一种示例中,若相同词条的个数在日志记录长度(即日志记录的词条总数)中的占比大于第一比例阈值,确定日志记录与该历史日志模板匹配;若相同词条的个数在日志记录长度中的占比不大于第一比例阈值,确定日志记录与该历史日志模板不匹配。在第二种示例中,若不同词条(该不同词条指的是一行日志记录中与另一日志记录不同的词条)的个数在日志记录长度中的占比小于第二比例阈值,确定日志记录与该历史日志模板匹配;若不同词条的个数在日志记录长度中的占比不小于第二比例阈值,确定日志记录与该历史日志模板不匹配;在第三种示例中,若相同词条的个数在日志记录长度中的占比大于第一比例阈值,且不同词条的个数小于第一个数阈值,确定日志记录与该历史日志模板匹配;若相同词条的个数在日志记录长度中的占比不大于第一比例阈值,或不同词条的个数不小于第一个数阈值,确定日志记录与该历史日志模板不匹配。在第四种示例中,若相同词条的个数大于第二个数阈值,确定日志记录与该历史日志模板匹配;若相同词条的个数不大于第二个数阈值,确定日志记录与该历史日志模板不匹配。在第五种示例中,若不同词条的个数小于第三个数阈值,确定日志记录与该历史日志模板匹配;若不同词条的个数不小于第三个数阈值,确定日志记录与该历史日志模板不匹配。确定日志记录与第一日志记录组的历史日志模板是否匹配还有其他方式,本申请实施例对此不做限定。
所谓对位比较,即将日志记录与历史日志模板相同位置的词条进行比较。假设日志记录X10为:“User Yang Xiao Yu has been logged in”;历史日志模板为:“User***has been logged in”。假设每个词条包括一个语义单元,则分别将“User”、“Yang”、“Xiao”、“Yu”、“has”、“been”、“logged”和“in”与“User”、“*”、“*”、“*”、“has”、“been”、“logged”和“in”一一对应比较。以前述第一种示例的判定方式为例,相同词条的个数为8,日志记录长度为8,假设第一比例阈值为1/2,8/8大于1/2,确定日志记录与该历史日志模板匹配。
第二种情况,同一第一日志记录组中日志记录的长度不同。即在前述步骤303中,分析设备仅基于局部敏感哈希码确定至少一个第一日志记录组,或基于局部敏感哈希码以及除日志记录的长度之外的目标特征确定至少一个第一日志记录组。如此,将日志记录与历史日志模板分别视为词条序列,可以通过求两者的最长公共子序列(Longest Common Subsequence,LCS)的方式来进行日志记录与第一日志记录组的历史日志模板进行比较。在第一种示例中,若确定的最长公共子序列的长度(即最长公共子序列中的词条总数)在日志记录长度(即日志记录的词条总数)中的占比大于第三比例阈值,确定日志记录与该历史日志模板匹配;若确定的最长公共子序列的长度在日志记录长度中的占比不大于第三比例阈值,确定日志记录与该历史日志模板不匹配。在第二种示例中,若日志记录中除最长公共子序列之外的其他序列的长度(即日志记录的词条总数减去最长公共子序列中的词条总数)在日志记录长度中的占比小于第四比例阈值,确定日志记录与该历史日志模板匹配;若其他序列的长度在日志记录长度中的占比不小于第四比例阈值,确定日志记录与该历史日志模板不匹配。在第三种示例中,若最长公共子序列的长度大于第一长度阈值阈值,确定日志记录与该历史日志模板匹配;若最长公共子序列的长度不大于第一长度阈值,确定日志记录与该历史日志模板不匹配。确定日志记录与 第一日志记录组的历史日志模板是否匹配还有其他方式,本申请实施例对此不做限定。
求取日志记录与历史日志模板的最长公共子序列即求取两者的最长的公共部分的子序列。示例的,可以通过递归方式或者动态规划方式求取最长公共子序列。以每个词条包括一个语义单元为例,假设在一个第一日志记录组中,日志记录X10为:“User Yang Xiao Yu has been logged in”;第一历史日志模板为:“User * has been logged in”。则两者的最长公共子序列为:“User has been logged in”。假设在另一个第一日志记录组中,日志记录X11为:“User Yang Xiao Yu has been logged in”;第二历史日志模板为:“User ** registered successfully”。则两者的最长公共子序列为:“User”。
以前述第三种示例的判定方式为例,假设第一长度阈值为3,则日志记录X10的最长公共子序列的长度为5,确定日志记录X10与该第一历史日志模板匹配。日志记录X11的最长公共子序列的长度为1,确定日志记录X11与该第二历史日志模板不匹配。
需要说明的是,通常情况下一个第一日志记录组中存在一个日志模板,少数情况下,第一日志记录组存在多个日志模板。在第一日志记录组存在多个日志模板时,对于任一日志记录,可以将该日志记录分别与该多个日志模板中的每个日志模板进行比较,该比较过程参考前述两种情况的过程;或者,计算该日志记录与多个日志模板中每个日志模板的距离(如采用Jaccard距离函数计算该距离),将给日志记录与距离最近的日志模板进行比较,如此能够减少运算代价。
步骤C12、当该日志记录与历史日志模板匹配,基于该日志记录与历史日志模板,确定第一日志记录组的新的日志模板。
在第一种示例中,由于历史日志模板是已提取好的模板,而该日志记录又与该历史日志模板匹配,因此,可以直接将历史日志模板作为第一日志记录组的新的日志模板。
在第二种示例中,由于历史日志模板和日志记录可能存在一些不同的部分,这些不同的部分可以认为是变量部分。分析设备可以将历史日志模板中,与该日志记录不同的部分采用变量指示符替换;或者,分析设备先判断历史日志模板中与该日志记录不同的部分是否仅为变量指示符所在部分,若该不同的部分还包括变量指示符所在部分之外的其他部分,将其他部分采用变量指示符替换,得到新的日志模板,若该不同的部分仅包括变量指示符所在部分之外的其他部分,可以直接将历史日志模板作为新的日志模板。如此处理可以得到更为准确的日志模板。
上述两种示例中,可以采用新的日志模板后更新对应的历史日志模板,例如删除历史日志模板,或者采用新的日志模板后覆盖对应的历史日志模板,从而保证第一日志记录组中不存在重复的日志模板。
值得说明的是,与前述第一种情况相应的,为了保证日志模板与日志记录的长度一致性,进行替换操作时,一个变量指示符仅替换一个词条,一个变量指示符在日志模板中的长度视为1。
与前述第二种情况相应的,由于无需保证日志模板与日志记录的长度一致性,进行替换操作时,一个变量指示符可以替换一个或多个连续的词条。
步骤C13、当该日志记录与历史日志模板不匹配,将从该日志记录提取的日志模板,添加为第一日志记录组的新的日志模板。
当该日志记录与历史日志模板不匹配,说明第一日志记录组中当前不存在与该日志记录匹配的日志模板,需要生成一个与该日志记录对应的新的日志模板。
在第一种示例中,直接将日志记录作为第一日志记录组的新的日志模板。
在第二种示例中,参考前述B1,由于日志记录中的一些指定字符在多数情况下都是变量,因此,可以将这些指定字符进行采用变量指示符替换,从而生成第一日志记录组的新的日志模板。示例的,该指定字符可以为数字,该固定字符可以为数字或者其他符号。例如1或2,*等。
在前述步骤303中,分析设备通常通过遍历日志中的每行日志记录来确定至少一个第一日志记录组。则前述分别提取每个第一日志记录组中的日志模板的过程可以在所有日志记录分组完成后执行,也可以在日志记录的分组过程中实时执行。其中,在日志记录的分组过程中实时提取每个第一日志记录组中的日志模板,可以减少模板提取的时延,提高模板提取过程的整体时效性。
示例的,以在日志记录的分组过程中实时提取每个第一日志记录组中的日志模板为例,与前述步骤303对应的,对于每个第一日志记录组,该分别提取每个第一日志记录组中的日志模板的过程可以包括:在接收到一行日志记录后,将接收到的日志记录与历史日志模板进行比较,当接收到的日志记录与历史日志模板匹配,基于接收到的日志记录与历史日志模板,确定第一日志记录组的新的日志模板,当接收到的日志记录与历史日志模板不匹配,将从接收到的日志记录提取的日志模板,添加为第一日志记录组的新的日志模板。
仍然以前述第二日志为例,假设分组结果为图14所示的分组结果,则对于第一日志记录组0,在接收到第0行日志记录后,第一日志记录组0的历史日志模板为空,第0行日志记录与历史日志记录不匹配,提取第0行日志记录的日志模板,添加为第一日志记录组的新的日志模板:“mod_jk child workerEnv in error state *”;在接收到第4行日志记录后,将接收到的日志记录与历史日志模板:“mod_jk child workerEnv in error state *”进行比较,由于接收到的第4行日志记录与历史日志模板匹配,可以将第一日志记录组0的历史日志模板“mod_jk child workerEnv in error state *”作为新的日志模板。第一日志记录组1和第一日志记录组2的模板提取方式与该模板提取方式相同,最终提取到的每个第一日志记录组的模板如图14所示,本申请实施例对此不再赘述。
需要说明的是,前述历史日志模板和新的日志模板是相对的概念,历史日志模板指的是当前时刻已存在的日志模板,新的日志模板指的是当前时刻新生成的日志模板。
步骤C2、基于每个第一日志记录组的日志模板,确定日志的日志模板。
可选的,基于每个第一日志记录组的日志模板,确定日志的日志模板的过程可以通过以下三种方式实现:
第一种方式,对至少一个第一日志记录组的日志模板进行聚类处理,得到日志的日志模板。
聚类处理实质上是一种分组方式,用于使相似的处理对象归为一类,不相似的处理对象归为不同类。本申请实施例中,分析设备在获取了至少一个第一日志记录组的一个或多个日志模板后,可以通过聚类处理,将该一个或多个日志模板进行分类,尤其在获 取的日志模板有多个时,可能存在不同的第一日志记录组的日志模板相似的情况,通过分类,可以把相似的日志模板划分为一类日志模板,从而将划分得到的一类或多类日志模板作为日志的日志模板。在后续呈现给用户时,可以将该一类或多类日志模板呈现给用户,以使用户直观地看到日志中存在几类日志模板。示例的,该聚类处理可以为层次聚类,其处理过程参考前述的层次聚类过程,本申请实施例对此不做赘述。
通过层次聚类得到的日志模板具有层次关系,用户可以调节聚类的精度(也称粒度),以得到不同的聚类结果。
第二种方式,对至少一个第一日志记录组的日志模板进行合并(merging)处理,得到日志的日志模板。
合并处理指的是将相同或相似的处理对象整合成一个对象的过程,其处理效果类似于去重处理的效果。本申请实施例中,对至少一个第一日志记录组的日志模板进行合并处理,得到日志的日志模板的过程包括:在至少一个第一日志记录组的日志模板包括至少两个日志模板时,对于每两个日志模板,检测两个日志模板的常量部分的相似度是否为1;当两个日志模板的常量部分的相似度为1,将两个日志模板中的一个日志模板的变量部分采用一个变量标识符替换,并删除另一日志模板(相当于保留任一日志模板的常量部分,并将常量部分之间原变量部分的位置插入变量标识符)。示例的,该变量标识符可以为通配符“*”。其中,两个日志模板的常量部分的相似度可以通过计算两个日志模板的常量部分的距离确定。该相似度可以采用杰卡德相似度(Jaccard similarity,也称杰卡德系数)算法计算。需要说明的是,当两个日志模板的常量部分的相似度不为1,不对两个日志模板进行处理。
例如两个模板分别为:“User ** has logged in”和“User *** has logged in”,两者的常量部分均包含四个词条:{User,has,logged,in}。两者的相似度为1。因此,可以将“User ** has logged in”的变量部分“**”替换为“*”,删除“User *** has logged in”,得到的是合并后的日志模板:“User * has logged in”。
第三种方式,将至少一个第一日志记录组的日志模板,作为日志的日志模板。
由于在多数情况下,每个第一日志记录组的日志模板通常为一个,且若前述第一日志记录组的分组方式得当,相同或相似的日志模板较少,则可以不进行聚类处理或合并处理,直接将获取的各个日志记录组的日志模板作为日志的日志模板。
本申请实施例中,该分析设备可以支持前述三种方式中一种或多种方式。终端可以在用户界面全部呈现该多个方式的触发按钮(或图标),或者以滚动方式呈现该多个方式的触发按钮,还可以呈现该多个方式的触发按钮中使用频率较高的一个或多个方式的触发按钮(其他方式的触发按钮可以由用户再次触发其他按钮后来显示,该其他按钮可以为下拉按钮)等,本申请实施例对此不做限定。用户想要通过观看某一方式所对应的日志的日志模板时,通过点击等方式触发该某一方式所对应的触发按钮,相应的,终端接收用户的选择指令,该选择指令携带有该某一方式的标识,终端将该选择指令发送至分析设备,分析设备基于获取的选择指令采用对应的方式获取日志的日志模板,并由终端在用户界面呈现给用户。其中,采用第一种方式呈现日志模板时,可以以多层文件目录结构或树结构(如二叉树)的方式呈现日志模板;采用前述第二种或第三种方式呈现日志模板时,若日志模板有多个,可以以列表方式呈现该多个日志模板。
前述第一种可选处理方式中,通过提取每个第一日志记录组的日志模板来确定日志的日志模板,无需直接采用第一日志记录组中的日志记录参与日志的日志模板的计算,使得解空间呈指数级别下降,有效提高了运算效率。
第二种可选处理方式,可以通过分别获取每个第一日志记录组中的目标日志记录,基于获取的目标日志记录来确定日志的日志模板,该过程包括:
步骤D1、获取每个第一日志记录组的目标日志记录。
可选的,第一日志记录组的目标日志记录为第一日志记录组中的部分日志记录,例如,为第一日志记录组中的一行日志记录。由于一个第一日志记录组中包含的是相同局部敏感哈希码的日志记录,即包含了相同或相似的日志记录,因此可以选择目标日志记录来代表该第一日志记录组中的日志记录,对该目标日志记录的处理相当于对该第一日志记录组在的所有日志记录的处理,但是有效减少了实际处理的数据量,相当于进行了数据采样(sampling),减少解空间,从而进一步降低运算代价。
示例的,该第一日志记录组的目标日志记录可以是在第一日志记录组中随机选择的一行日志记录。这样可以保证第一日志记录组中每行日志记录被选取为目标日志记录的概率相等。值得说明的是,目标日志记录也可以按照其他预设条件在第一日志记录组中筛选。例如,选择第一日志记录组中第一行日志记录,或者选择第一日志记录组中最新(例如时间戳最新)的日志记录。
可选的,在步骤D1之前,处理器还可以检测第一日志记录组中日志记录的个数,当第一日志记录组包括多行日志记录时,执行步骤D1,例如,从第一日志记录组的日志记录中筛选部分日志记录作为目标日志记录。当第一日志记录仅包括一行日志记录时,执行步骤D1,步骤D1的具体执行过程为将第一日志记录组的一行日志记录作为目标日志记录。
步骤D2、基于每个第一日志记录组的目标日志记录,确定日志的日志模板。
可选的,基于每个第一日志记录组的目标日志记录,确定日志的日志模板的过程可以包括:
步骤D21、确定至少一个第二日志记录组。
通常情况下,分组得到的第二日志记录组有多个。其中,不同第二日志记录组包括至少一个第一日志记录组对应的目标日志记录中的不同目标日志记录;每个第二日志记录组包括的所有目标日志记录具有相同的目标特征。分析设备可以将获取的目标日志记录中,目标特征均相同的日志记录,划分至同一第二日志记录组中,以得到该至少一个第二日志记录组。如此可以进一步降低解空间。
例如,日志记录的目标特征包括:日志记录的长度、日志记录的首字符和日志记录的首个单词中的至少一种。值得说明的是,步骤D21中的目标特征与前述步骤303中的目标特征可以相同也可以不同。
通过采用目标特征对获取的目标日志记录进行分组,可以获取一组或多组第二日志记录组,从而可以执行后续分别对每个第二日志记录组进行处理的过程,当第二日志记录组有多组时,对于各个第二日志记录组的后续处理过程(如步骤D22)可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量远远小于日志整体的数据 量,有效降低运算代价,同时提高运算效率。
值得说明的是,在步骤D21之前,分析设备还可以检测第一日志记录组的个数,当第一日志记录组有多个时,再执行步骤D21,当第一日志记录组仅有一个,可以不执行步骤D21。
步骤D22、通过分别对每个第二日志记录组进行处理,得到日志的日志模板。
示例的,通过分别对每个第二日志记录组进行处理,得到日志的日志模板的过程可以包括:
步骤D221、对每个第二日志记录组中的日志记录进行聚类处理,得到每个第二日志记录组对应的至少一类日志记录。
各个第二日志记录组中的日志记录来自于不同的第一日志记录组,因此,一些日志记录还存在相似的情况。本申请实施例中,对每个第二日志记录组中的日志记录进行聚类处理(如层次聚类),可以将相似的日志记录划分为一类,从而后续过程中,可以执行分别对聚类处理得到的每一类日志记录进行模板提取的过程,当第二日志记录组中的日志记录有多类时,对于各类日志记录的处理过程可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量较小,有效降低运算代价,同时提高运算效率。
值得说明的是,在步骤D221之前,分析设备还可以检测第二日志记录组的个数,当第二日志记录组有多个时,再执行步骤D211至D223,当第二日志记录组仅有一个,可以不执行步骤D211至D223,直接将第二日志记录组的日志模板作为日志的日志模板。
步骤D222、分别对聚类处理得到的每一类日志记录进行模板提取,得到每一类日志记录的日志模板。
步骤D222的过程可以参考前述步骤C1的过程,即一类日志记录相当于前述一个第一日志记录组,本申请实施例对此不做赘述。
步骤D223、基于至少一类记录中每一类日志记录的日志模板,确定日志的日志模板。
示例的,基于每至少一类记录中一类日志记录(即所有类的日志记录)的日志模板,确定日志的日志模板的过程可以通过以下两种方式实现:
第一种方式、将每一类日志记录的日志模板作为日志的日志模板。
该第一种方式可以参考前述步骤C2中的第二种方式,即一类日志记录相当于前述一个第一日志记录组,本申请实施例对此不做赘述。
第二种方式、对聚类处理得到各类日志记录的日志模板进行合并处理,将合并得到的日志模板,作为日志的日志模板。
该第一种方式可以参考前述步骤C2中的第三种方式,即一类日志记录相当于前述一个第一日志记录组,本申请实施例对此不做赘述。
为了便于读者理解,以如下例子对第二种可选处理方式进行说明,假设第三日志如图15左侧所示,经过前述步骤302中的步骤A1得到如图15右侧所示的分词结果。经过前述步骤302中的步骤A2得到如图16所示的局部敏感哈希码。假设前述步骤303中,分析设备基于局部敏感哈希码以及日志记录的长度(即目标特征为日志记录的长度)确定第一日志记录组,则日志记录、日志记录的长度和局部敏感哈希码的关系如表1所示。
表1
Figure PCTCN2020096134-appb-000001
Figure PCTCN2020096134-appb-000002
分析设备基于局部敏感哈希码以及日志记录的长度确定的分组结果如图17所示,第1行和第6行日志记录划分为一个第一日志记录组,第7行和第9行日志记录划分为一个第一日志记录组,其余行日志记录各自划分为一个第一日志记录组。假设经过前述步骤D1获取每个第一日志记录组的目标日志记录如图18所示,即在第1行和第6行日志记录所属第一日志记录组中选择第1行日志记录作为目标日志记录,在第7行和第9行日志记录所属第一日志记录组中选择第7行日志记录作为目标日志记录,在其他第一日志记录组中选择各自的一行日志记录作为目标日志记录。假设前述步骤D21中,分析设备基于日志记录的长度(即目标特征为日志记录的长度)确定第二日志记录组,则最终得到如图19所示的共四个第二日志记录组,分别为第二日志记录组Z1至Z4。假设步骤D221中对每个第二日志记录组中的日志记录进行层次处理,在经过步骤D222的模板提取动作后,得到如图20的右侧所示的5个日志模板,该5个日志模板分别为:
“workerEnv.init()ok/etc/httpd/conf/workers2.properties
mod_jk child init 1-2
mod_jk child workerEnv in error state *
jk2_init()Found child * in scoreboard slot *
jk2_init()Can't find child 1566 in the scoreboard”。
假设采用前述步骤D223的第一种方式,将每一类日志记录的日志模板作为日志的日志模板,则最终得到的日志的日志模板即包括该5个日志模板。
步骤305、分析设备基于日志的日志模板,对日志进行异常检测。
在一种可选方式中,分析设备基于日志的日志模板,对日志进行特征提取;并基于提取的日志的特征进行异常检测。
日志的特征指的是日志包含的日志记录所具有的特征。示例的,其可以包括:日志模板的出现次数、日志模板的出现频率和/或日志模板的出现时段。其中,日志模板的出现次数指的是日志中该日志模板对应的日志记录的个数;日志模板的出现频率指的是该日志模板对应的日志记录的个数与日志所包含的日志记录总个数的比值;日志模板的出现时段指的是该日志模板对应的日志记录的发生时刻或采集时刻所属时段。
示例的,假设第一日志模式为日志中的任一日志模式,对于第一日志模式,分析设备可以将日志划分为多个时间窗,统计每个时间窗中包括的每行日志记录,以检测到与第一日志模式匹配的日志记录,并统计该时间窗中所需要确定日志的特征,如该第一日志模板的出现次数。分析设备通过比较多个时间窗中的日志的特征,将与其他时间窗特征差距大于指定差距阈值的时间窗确定为异常时间窗,则该异常时间窗中的第一日志模式的日志记录为出现异常的日志记录。前述多个时间窗可以为固定大小,且互不重叠的时间窗,或者为通过滑窗算法确定的时间窗。
例如,当待分析的特征为第一日志模板的出现次数时,分析设备在发现第一日志模板的出现次数明显较高(例如与其他时间窗,或者所有时间窗中第一日志模板的出现次数的均值的差值为正,且该差值大于指定差值阈值)的时间窗,分析设备可以标定位热事件,发出告警信息;分析设备在发现第一日志模板的出现次数明显较低(例如与其他时间窗,或者所有时间窗中第一日志模板的出现次数的均值的差值为负,且该差值的绝对值大于指定差值阈值)的时间窗,分析设备可以标定位冷事件,发出告警信息。
值得说明的是,终端可以显示分析设备所确定的日志的日志模板,用户可以指定目标日志模板,相应的,终端接收到模板选择指令,将携带目标日志模板的标识的模板选择指令发送至分析设备,分析设备对该目标日志模板进行异常检测,该检测过程可以参考前述第一日志模板的异常检测过程。如此,分析设备可以根据用户指示进行特定的日志模板的异常检测,提高异常检测的针对性,保证用户体验。
在二种可选方式中,分析设备基于日志的日志模板,检测未知事件。分析设备可以采用每个日志模板分别与日志中的日志记录进行匹配,当存在与所有日志模板均不匹配的日志记录时,确定该日志记录为未知日志记录,该未知日志记录所对应的事件即为未知事件。其可能是异常事件。
在异常检测场景中,分析设备还可以其他方式对日志中的异常进行检测,本申请实施例对此不做限定。
需要说明的是,前述步骤302中,日志中的日志记录在分组时,采用了局部敏感哈希码,因此,日志记录的分布规则遵循哈希(Hash)分布规则,即键-值(key-value)分布规则,如此可以实现负载均衡。
为了便于读者理解,本申请实施例对哈希分布原理进行简单介绍。哈希分布是基于哈希函数的一种数据分布方法,哈希函数也可以称为散列函数。哈希函数是基于数据的键(key,也称键值,在分布式系统中也称分布键值),得到值(value,也称哈希值)的一种函数。即value=f(key),函数f即为哈希函数。以表2为例,假设哈希函数为f(key)=key mod 5,“mod”表示取模,即该哈希函数为取模运算(Module Operation) 函数。则假设key分别为1、2、3、4、5、6、7、8和9,则对应的value分别为1、2、3、4、0、1、2、3和4。
表2
key 1 2 3 4 5 6 7 8 9
value 1 2 3 4 0 1 2 3 4
由上可知,key为1和6时,value都为1。因此,采用哈希函数确定value可能存在不同的key对应相同的value的情况,这种情况称为哈希冲突。哈希桶算法是一种特殊的哈希算法,其能够解决哈希冲突。哈希桶为放置不同key链表(也称哈希表)的容器,该哈希桶也称f(key)集合,或value集合。同一哈希桶对应的value相同。参考前述例子,可以设置哈希桶的个数为模数(也称模)的值,即5。多个value值与多个哈希桶一一对应。示例的,可以采用value值作为哈希桶的索引或编号,每个哈希桶存放具有相同value的key,同一个哈希桶中冲突的key之间用单向链表进行存储,这样就解决了哈希冲突。在查找与key对应的数据时,只需要通过key索引到对应value的哈希桶,然后从哈希桶的首地址对应的节点开始查找,即按照链表顺序查找,对比key的值,直到找到对应key,再基于查找到的key索引到对应的数据。如表1所示,key为1和6时,存储在哈希桶1中,key为2和7时,存储在哈希桶2中;key为3和8时,存储在哈希桶3中;key为4和9时,存储在哈希桶4中;key为5时,存储在哈希桶0中。
需要说明的是,前述实施例仅以哈希函数为取模的函数为例进行说明,实际上该哈希函数还可以为取余的函数(此时,该哈希函数为取余运算(Complementation)函数,哈希桶的个数为模数的值),或者其他函数,本申请实施例对此不做限定。
参考前述介绍,本申请实施例可以引入哈希桶算法来进行日志记录的分布,从而避免哈希冲突,在这种情况下,通常以哈希桶为单位来标识分布的数据。因此,前述每个第一日志记录组可以由一个哈希桶标识。每个第二日志记录组也可以由一个哈希桶标识。示例的,每个哈希桶具有一个桶标识,由可以由对应的分组方式确定。例如,在步骤303中,若仅采用局部敏感哈希码进行分组,桶标识满足以下第一公式:
Id=f(lsh);
其中,Id表示桶标识;lsh表示局部敏感哈希码,f为预设函数。
若采用局部敏感哈希码以及目标特征进行分组,桶标识满足以下第二公式:
Id=f(x1,x2,…,xm,lsh);
其中,Id表示桶标识;lsh表示局部敏感哈希码,f为预设函数,x1,x2,…,xm分别表示目标特征所包括的m个特征,m为目标特征所包括的特征的总数。
例如,目标特征仅包括日志记录的长度,则m=1,x1表示日志记录的长度,桶标识满足以下第三公式:
Id=f(x1,lsh)。
本申请实施例提供的日志模板提取方法的步骤先后顺序可以进行适当调整,步骤也可以根据情况进行相应增减,例如,在其他应用场景中,如日志压缩或关键词检索等,可以不执行前述步骤305,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内,因此不再赘述。
综上所述,本申请实施例中,通过每行日志记录的局部敏感哈希码进行日志记录的分组,而局部敏感哈希码又可以反映对应的不同行日志记录的相似度,如此分组达到了与聚类处理相同的效果,从而有效降低了运算复杂度。
并且,本申请实施例中,局部敏感哈希码和目标特征是日志记录本身的特征,在获取每行日志记录的局部敏感哈希码和目标特征时,无需考虑其他行日志记录。从而实现了日志中各行日志记录在分组过程中的去相关。如此,对于一个日志,其多行日志记录的分组过程可以并行执行,有效减少运算时延,提高运算效率。
当第一日志记录组有多组时,对于各个第一日志记录组的处理过程可以并行执行,从而减少运算时延,每次执行处理过程时所需要运算的数据量远远小于日志整体的数据量,有效降低运算代价,同时提高运算效率。
进一步的,在前述304中,采用第一种可选处理方式和第二种可选处理方式,均筛选掉多数的日志记录,使得解空间呈指数级别下降,有效降低运算代价,同时提高运算效率。
在进行异质化日志的日志模板提取时,传统的模板提取方法,在理想状态下,对于包含5万左右的日志记录的日志,需要5秒左右的时间才能实现日志模板的完全提取。而采用本申请实施例提供的日志模板提取方法,在理想状态下,需要1秒左右的时间即可实现日志模板的完全提取,相对于传统的方法有效降低了运算时延,提高了运算性能,提高了用户体验。
本申请实施例提供一种日志模板提取装置40,如图21所示,所述装置包括:
第一确定模块401,用于确定日志的多行日志记录中每行日志记录的局部敏感哈希码;
第二确定模块402,用于确定至少一个第一日志记录组,不同所述第一日志记录组包括所述日志中的不同行日志记录;每个所述第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码;
处理模块403,用于通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板。
综上所述,本申请实施例中,第二确定模块通过每行日志记录的局部敏感哈希码进行日志记录的分组,而局部敏感哈希码又可以反映对应的不同行日志记录的相似度,如此分组达到了与聚类处理相同的效果,从而有效降低了运算复杂度。
可选的,如图22所示,第一确定模块401,包括:
获取子模块4011,用于获取所述日志中每行日志记录的至少一个词条;
第一确定子模块4012,用于基于所述每行日志记录的至少一个词条,确定所述每行日志记录的局部敏感哈希码。
可选的,每个所述词条包括m个语义单元,m为大于1的整数,所述语义单元为单词或符号,对于包括至少两个词条的日志记录,每两个相邻的词条中第一词条的后m-1个语义单元与第二词条的前m-1个语义单元相同,所述第一词条为所述第二词条的前一个词条。
可选的,所述第一确定模块401,用于:
将所述日志中每行日志记录中的p个指定字符替换为q个固定字符,得到更新的每行日志记录,1≤q<p;基于所述更新后的每行日志记录,确定所述每行日志记录的局部敏感哈希码。
可选的,所述第一确定子模块4012,用于:
基于第一日志记录的多个词条,以及为每个所述词条分配的权值,确定所述第一日志记录的局部敏感哈希码,所述第一日志记录为所述日志的多行日志记录中包括多个词条的任一日志记录,所述第一日志记录中包括的至少两个词条的权值互不相同,例如,每个词条的权值基于所述词条在所述第一日志记录中的位置确定。
可选的,所述第二确定模块402,用于:
基于所述日志中的每行日志记录的局部敏感哈希码以及所述每行日志记录的目标特征,对所述日志中的多行日志记录进行分组,得到所述至少一个第一日志记录组。
可选的,如图23所示,所述处理模块403,包括:
提取子模块4031,用于分别提取每个所述第一日志记录组中的日志模板;
第二确定子模块4032,用于基于每个所述第一日志记录组的日志模板,确定所述日志的日志模板。
可选的,所述提取子模块4031,用于:
对于每个所述第一日志记录组中的每行日志记录,将所述日志记录与所述第一日志记录组的历史日志模板进行比较;当所述日志记录与所述历史日志模板匹配,基于所述日志记录与所述历史日志模板,确定所述第一日志记录组的新的日志模板;当所述日志记录与所述历史日志模板不匹配,将从所述日志记录提取的日志模板,添加为所述第一日志记录组的新的日志模板。
可选的,所述第二确定子模块4032,用于:
对所述至少一个第一日志记录组的日志模板进行聚类处理,得到所述日志的日志模板;或者,对所述至少一个第一日志记录组的日志模板进行合并处理,得到所述日志的日志模板;或者,将所述至少一个第一日志记录组的日志模板,作为所述日志的日志模板。
可选的,所述处理模块403,用于:
第三确定子模块,用于基于每个所述第一日志记录组的目标日志记录,确定所述日志的日志模板,所述第一日志记录组的目标日志记录为所述第一日志记录组中的部分日志记录。
可选的,所述第一日志记录组的目标日志记录是在所述第一日志记录组中随机选择的一行日志记录。
可选的,所述第三确定子模块,用于:
确定至少一个第二日志记录组,不同所述第二日志记录组包括所述至少一个第一日志记录组对应的目标日志记录中的不同目标日志记录;每个所述第二日志记录组包括的所有目标日志记录具有相同的目标特征;通过分别对每个所述第二日志记录组进行处理,得到所述日志的日志模板。
可选的,所述第三确定子模块,用于:
对每个所述第二日志记录组中的日志记录进行聚类处理,得到每个所述第二日志记 录组对应的至少一类日志记录;分别对所述聚类处理得到的每一类日志记录进行模板提取,得到所述每一类日志记录的日志模板;基于所述至少一类日志记录中每一类日志记录的日志模板,确定所述日志的日志模板。
可选的,所述第三确定子模块,用于:
将所述每一类日志记录的日志模板作为所述日志的日志模板;或者,对聚类处理得到各类日志记录的日志模板进行合并处理,将合并得到的日志模板,作为所述日志的日志模板。
可选的,所述日志记录的目标特征包括:日志记录的长度、日志记录的首字符和日志记录的首个单词中的至少一种。
可选地,图24示意性地提供本申请所述计算设备的一种可能的基本硬件架构。该计算设备可以为服务器。
参见图24,计算设备500包括处理器501、存储器502、通信接口503和总线504。
计算设备500中,处理器501的数量可以是一个或多个,图24仅示意了其中一个处理器501。可选地,处理器501,可以是中央处理器(central processing unit,CPU)。如果计算设备500具有多个处理器501,多个处理器501的类型可以不同,或者可以相同。可选地,计算设备500的多个处理器501还可以集成为多核处理器。
存储器502存储计算机指令和数据;存储器502可以存储实现本申请提供的日志模板提取方法所需的计算机指令和数据,例如,存储器502存储用于实现日志模板提取方法的步骤的指令。存储器502可以是以下存储介质的任一种或任一种组合:非易失性存储器(例如只读存储器(ROM)、固态硬盘(SSD)、硬盘(HDD)、光盘),易失性存储器。
通信接口503可以是以下器件的任一种或任一种组合:网络接口(例如以太网接口)、无线网卡等具有网络接入功能的器件。
通信接口503用于计算设备500与其它计算设备或者终端进行数据通信。
总线504可以将处理器501与存储器502和通信接口503连接。这样,通过总线504,处理器501可以访问存储器502,还可以利用通信接口503与其它计算设备或者终端进行数据交互。
在本申请中,计算设备500执行存储器502中的计算机指令,使得计算设备500实现本申请提供的日志模板提取方法,或者使得计算设备500部署日志模板提取装置。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器,上述指令可由服务器的处理器执行以完成本申请各个实施例所示的日志模板提取方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本申请实施例提供一种分析系统,包括:终端和分析设备,该分析设备包括前述任一所述的日志模板提取装置。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现,所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部 或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机的可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质,或者半导体介质(例如固态硬盘)等。
需要说明的是:上述实施例提供的日志模板提取装置在进行日志模板提取时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的日志模板提取装置与日志模板提取方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
在本申请中,术语“第一”和“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。术语“多个”指两个或两个以上,除非另有明确的限定。A参考B,指的是A与B相同或者A为B的简单变形。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (32)

  1. 一种日志模板提取方法,其特征在于,所述方法包括:
    确定日志的多行日志记录中每行日志记录的局部敏感哈希码;
    确定至少一个第一日志记录组,不同所述第一日志记录组包括所述日志中的不同行日志记录;每个所述第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码;
    通过对所述至少一个第一日志记录组中每个所述第一日志记录组进行处理,得到所述日志的日志模板。
  2. 根据权利要求1所述的方法,其特征在于,所述确定日志的多行日志记录中每行日志记录的局部敏感哈希码,包括:
    获取所述日志中每行日志记录的至少一个词条;
    基于所述每行日志记录的至少一个词条,确定所述每行日志记录的局部敏感哈希码。
  3. 根据权利要求2所述的方法,其特征在于,每个所述词条包括m个语义单元,m为大于1的整数,所述语义单元为单词或符号,对于包括至少两个词条的日志记录,每两个相邻的词条中第一词条的后m-1个语义单元与第二词条的前m-1个语义单元相同,所述第一词条为所述第二词条的前一个词条。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述确定日志的多行日志记录中每行日志记录的局部敏感哈希码,包括:
    将所述日志中每行日志记录中的p个指定字符替换为q个固定字符,得到更新的每行日志记录,1≤q<p;
    基于所述更新后的每行日志记录,确定所述每行日志记录的局部敏感哈希码。
  5. 根据权利要求2或3所述的方法,其特征在于,所述基于所述每行日志记录的至少一个词条,确定所述每行日志记录的局部敏感哈希码,包括:
    基于第一日志记录的多个词条,以及为每个所述词条分配的权值,确定所述第一日志记录的局部敏感哈希码,所述第一日志记录为所述日志的多行日志记录中包括多个词条的任一日志记录,所述第一日志记录中包括的至少两个词条的权值互不相同,每个词条的权值基于所述词条在所述第一日志记录中的位置确定。
  6. 根据权利要求1至5任一所述的方法,其特征在于,所述确定至少一个第一日志记录组,包括:
    基于所述日志中的每行日志记录的局部敏感哈希码以及所述每行日志记录的目标特征,对所述日志中的多行日志记录进行分组,得到所述至少一个第一日志记录组。
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板,包括:
    分别提取每个所述第一日志记录组中的日志模板;
    基于每个所述第一日志记录组的日志模板,确定所述日志的日志模板。
  8. 根据权利要求7所述的方法,其特征在于,所述分别提取每个所述第一日志记录组中的日志模板,包括:
    对于每个所述第一日志记录组中的每行日志记录,将所述日志记录与所述第一日志记录组的历史日志模板进行比较;
    当所述日志记录与所述历史日志模板匹配,基于所述日志记录与所述历史日志模板,确定所述第一日志记录组的新的日志模板;
    当所述日志记录与所述历史日志模板不匹配,将从所述日志记录提取的日志模板,添加为所述第一日志记录组的新的日志模板。
  9. 根据权利要求7所述的方法,其特征在于,所述基于每个所述第一日志记录组的日志模板,确定所述日志的日志模板,包括:
    对所述至少一个第一日志记录组的日志模板进行聚类处理,得到所述日志的日志模板;
    或者,对所述至少一个第一日志记录组的日志模板进行合并处理,得到所述日志的日志模板;
    或者,将所述至少一个第一日志记录组的日志模板,作为所述日志的日志模板。
  10. 根据权利要求1至6任一所述的方法,其特征在于,所述通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板,包括:
    基于每个所述第一日志记录组的目标日志记录,确定所述日志的日志模板,所述第一日志记录组的目标日志记录为所述第一日志记录组中的部分日志记录。
  11. 根据权利要求10所述的方法,其特征在于,所述第一日志记录组的目标日志记录是在所述第一日志记录组中随机选择的一行日志记录。
  12. 根据权利要求10所述的方法,其特征在于,所述基于每个所述第一日志记录组的目标日志记录,确定所述日志的日志模板,包括:
    确定至少一个第二日志记录组,不同所述第二日志记录组包括所述至少一个第一日志记录组对应的目标日志记录中的不同目标日志记录;每个所述第二日志记录组包括的所有目标日志记录具有相同的目标特征;
    通过分别对每个所述第二日志记录组进行处理,得到所述日志的日志模板。
  13. 根据权利要求12所述的方法,其特征在于,所述通过分别对每个所述第二日志记录组进行处理,得到所述日志的日志模板,包括:
    对每个所述第二日志记录组中的日志记录进行聚类处理,得到每个所述第二日志记录组对应的至少一类日志记录;
    分别对所述聚类处理得到的每一类日志记录进行模板提取,得到所述每一类日志记录的日志模板;
    基于所述至少一类日志记录中每一类日志记录的日志模板,确定所述日志的日志模板。
  14. 根据权利要求13所述的方法,其特征在于,所述基于所述至少一类日志记录中每一类日志记录的日志模板,确定所述日志的日志模板,包括:
    将所述每一类日志记录的日志模板作为所述日志的日志模板;
    或者,对聚类处理得到各类日志记录的日志模板进行合并处理,将合并得到的日志模板,作为所述日志的日志模板。
  15. 根据权利要求6或12所述的方法,其特征在于,所述日志记录的目标特征包括:日志记录的长度、日志记录的首字符和日志记录的首个单词中的至少一种。
  16. 一种日志模板提取装置,其特征在于,所述装置包括:
    第一确定模块,用于确定日志的多行日志记录中每行日志记录的局部敏感哈希码;
    第二确定模块,用于确定至少一个第一日志记录组,不同所述第一日志记录组包括所述日志中的不同行日志记录;每个所述第一日志记录组包括的所有日志记录具有相同的局部敏感哈希码;
    处理模块,用于通过对所述至少一个第一日志记录组中每个第一日志记录组进行处理,得到所述日志的日志模板。
  17. 根据权利要求16所述的装置,其特征在于,所述第一确定模块,包括:
    获取子模块,用于获取所述日志中每行日志记录的至少一个词条;
    第一确定子模块,用于基于所述每行日志记录的至少一个词条,确定所述每行日志记录的局部敏感哈希码。
  18. 根据权利要求17所述的装置,其特征在于,每个所述词条包括m个语义单元,m为大于1的整数,所述语义单元为单词或符号,对于包括至少两个词条的日志记录,每两个相邻的词条中第一词条的后m-1个语义单元与第二词条的前m-1个语义单元相同,所述第一词条为所述第二词条的前一个词条。
  19. 根据权利要求16至18任一所述的装置,其特征在于,所述第一确定模块,用于:
    将所述日志中每行日志记录中的p个指定字符替换为q个固定字符,得到更新的每行日志记录,1≤q<p;
    基于所述更新后的每行日志记录,确定所述每行日志记录的局部敏感哈希码。
  20. 根据权利要求17或18所述的装置,其特征在于,所述第一确定子模块,用于:
    基于第一日志记录的多个词条,以及为每个所述词条分配的权值,确定所述第一日志记录的局部敏感哈希码,所述第一日志记录为所述日志的多行日志记录中包括多个词条的任一日志记录,所述第一日志记录中包括的至少两个词条的权值互不相同,每个词 条的权值基于所述词条在所述第一日志记录中的位置确定。
  21. 根据权利要求16至20任一所述的装置,其特征在于,所述第二确定模块,用于:
    基于所述日志中的每行日志记录的局部敏感哈希码以及所述每行日志记录的目标特征,对所述日志中的多行日志记录进行分组,得到所述至少一个第一日志记录组。
  22. 根据权利要求16至21任一所述的装置,其特征在于,所述处理模块,包括:
    提取子模块,用于分别提取每个所述第一日志记录组中的日志模板;
    第二确定子模块,用于基于每个所述第一日志记录组的日志模板,确定所述日志的日志模板。
  23. 根据权利要求22所述的装置,其特征在于,所述提取子模块,用于:
    对于每个所述第一日志记录组中的每行日志记录,将所述日志记录与所述第一日志记录组的历史日志模板进行比较;
    当所述日志记录与所述历史日志模板匹配,基于所述日志记录与所述历史日志模板,确定所述第一日志记录组的新的日志模板;
    当所述日志记录与所述历史日志模板不匹配,将从所述日志记录提取的日志模板,添加为所述第一日志记录组的新的日志模板。
  24. 根据权利要求22所述的装置,其特征在于,所述第二确定子模块,用于:
    对所述至少一个第一日志记录组的日志模板进行聚类处理,得到所述日志的日志模板;
    或者,对所述至少一个第一日志记录组的日志模板进行合并处理,得到所述日志的日志模板;
    或者,将所述至少一个第一日志记录组的日志模板,作为所述日志的日志模板。
  25. 根据权利要求16至21任一所述的装置,其特征在于,所述处理模块,用于:
    第三确定子模块,用于基于每个所述第一日志记录组的目标日志记录,确定所述日志的日志模板,所述第一日志记录组的目标日志记录为所述第一日志记录组中的部分日志记录。
  26. 根据权利要求25所述的装置,其特征在于,所述第一日志记录组的目标日志记录是在所述第一日志记录组中随机选择的一行日志记录。
  27. 根据权利要求25所述的装置,其特征在于,所述第三确定子模块,用于:
    确定至少一个第二日志记录组,不同所述第二日志记录组包括所述至少一个第一日志记录组对应的目标日志记录中的不同目标日志记录;每个所述第二日志记录组包括的所有目标日志记录具有相同的目标特征;
    通过分别对每个所述第二日志记录组进行处理,得到所述日志的日志模板。
  28. 根据权利要求27所述的装置,其特征在于,所述第三确定子模块,用于:
    对每个所述第二日志记录组中的日志记录进行聚类处理,得到每个所述第二日志记录组对应的至少一类日志记录;
    分别对所述聚类处理得到的每一类日志记录进行模板提取,得到所述每一类日志记录的日志模板;
    基于所述至少一类日志记录中每一类日志记录的日志模板,确定所述日志的日志模板。
  29. 根据权利要求28所述的装置,其特征在于,所述第三确定子模块,用于:
    将所述每一类日志记录的日志模板作为所述日志的日志模板;
    或者,对聚类处理得到各类日志记录的日志模板进行合并处理,将合并得到的日志模板,作为所述日志的日志模板。
  30. 根据权利要求21或27所述的装置,其特征在于,所述日志记录的目标特征包括:日志记录的长度、日志记录的首字符和日志记录的首个单词中的至少一种。
  31. 一种计算机设备,其特征在于,包括处理器和存储器;
    在所述处理器执行所述存储器存储的计算机指令时,所述计算机设备执行权利要求1至15任一所述的模板提取方法。
  32. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括计算机指令,所述计算机指令指示计算机设备执行权利要求1至15任一所述的模板提取方法。
PCT/CN2020/096134 2019-10-12 2020-06-15 日志模板提取方法及装置 WO2021068547A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201910969835.0 2019-10-12
CN201910969835 2019-10-12
CN201911215541.5A CN111160021A (zh) 2019-10-12 2019-12-02 日志模板提取方法及装置
CN201911215541.5 2019-12-02

Publications (1)

Publication Number Publication Date
WO2021068547A1 true WO2021068547A1 (zh) 2021-04-15

Family

ID=70556284

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096134 WO2021068547A1 (zh) 2019-10-12 2020-06-15 日志模板提取方法及装置

Country Status (2)

Country Link
CN (1) CN111160021A (zh)
WO (1) WO2021068547A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220644A (zh) * 2021-05-28 2021-08-06 北京微纳星空科技有限公司 一种文件处理方法、装置、设备及存储介质
CN113535955A (zh) * 2021-07-16 2021-10-22 中国工商银行股份有限公司 一种日志快速归类方法及装置
CN115329748A (zh) * 2022-10-14 2022-11-11 北京优特捷信息技术有限公司 一种日志解析方法、装置、设备及存储介质
CN115860836A (zh) * 2022-12-07 2023-03-28 广东南粤分享汇控股有限公司 一种基于用户行为大数据分析的电商服务推送方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160021A (zh) * 2019-10-12 2020-05-15 华为技术有限公司 日志模板提取方法及装置
CN111737950B (zh) * 2020-08-27 2020-12-08 北京安帝科技有限公司 一种电厂区域设备异常判断方法
CN112068979B (zh) * 2020-09-11 2021-10-08 重庆紫光华山智安科技有限公司 一种业务故障确定方法及装置
CN116226681B (zh) * 2023-02-22 2023-11-28 北京麦克斯泰科技有限公司 一种文本相似性判定方法、装置、计算机设备和存储介质
CN116346729B (zh) * 2023-02-24 2024-02-09 安芯网盾(北京)科技有限公司 一种数据日志上报的限流方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220663A1 (en) * 2016-01-29 2017-08-03 AppDynamics, Inc. Log Event Summarization for Distributed Server System
CN107659566A (zh) * 2017-09-20 2018-02-02 深圳市创梦天地科技股份有限公司 对服务器异常访问的识别频率确定方法、装置及服务器
CN109981625A (zh) * 2019-03-18 2019-07-05 中国人民解放军陆军炮兵防空兵学院郑州校区 一种基于在线层次聚类的日志模板抽取方法
CN111160021A (zh) * 2019-10-12 2020-05-15 华为技术有限公司 日志模板提取方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049247B (zh) * 2015-07-06 2019-04-26 中国科学院信息工程研究所 一种网络安全日志模板抽取方法及装置
CN105205397B (zh) * 2015-10-13 2018-10-16 北京奇安信科技有限公司 恶意程序样本分类方法及装置
CN109144964A (zh) * 2018-08-21 2019-01-04 杭州安恒信息技术股份有限公司 基于机器学习的日志解析方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220663A1 (en) * 2016-01-29 2017-08-03 AppDynamics, Inc. Log Event Summarization for Distributed Server System
CN107659566A (zh) * 2017-09-20 2018-02-02 深圳市创梦天地科技股份有限公司 对服务器异常访问的识别频率确定方法、装置及服务器
CN109981625A (zh) * 2019-03-18 2019-07-05 中国人民解放军陆军炮兵防空兵学院郑州校区 一种基于在线层次聚类的日志模板抽取方法
CN111160021A (zh) * 2019-10-12 2020-05-15 华为技术有限公司 日志模板提取方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220644A (zh) * 2021-05-28 2021-08-06 北京微纳星空科技有限公司 一种文件处理方法、装置、设备及存储介质
CN113220644B (zh) * 2021-05-28 2022-04-26 北京微纳星空科技有限公司 一种文件处理方法、装置、设备及存储介质
CN113535955A (zh) * 2021-07-16 2021-10-22 中国工商银行股份有限公司 一种日志快速归类方法及装置
CN113535955B (zh) * 2021-07-16 2022-10-28 中国工商银行股份有限公司 一种日志快速归类方法及装置
CN115329748A (zh) * 2022-10-14 2022-11-11 北京优特捷信息技术有限公司 一种日志解析方法、装置、设备及存储介质
CN115329748B (zh) * 2022-10-14 2023-01-10 北京优特捷信息技术有限公司 一种日志解析方法、装置、设备及存储介质
CN115860836A (zh) * 2022-12-07 2023-03-28 广东南粤分享汇控股有限公司 一种基于用户行为大数据分析的电商服务推送方法及系统
CN115860836B (zh) * 2022-12-07 2023-09-26 广东南粤分享汇控股有限公司 一种基于用户行为大数据分析的电商服务推送方法及系统

Also Published As

Publication number Publication date
CN111160021A (zh) 2020-05-15

Similar Documents

Publication Publication Date Title
WO2021068547A1 (zh) 日志模板提取方法及装置
US10474513B2 (en) Cluster-based processing of unstructured log messages
US11238069B2 (en) Transforming a data stream into structured data
CN111612041B (zh) 异常用户识别方法及装置、存储介质、电子设备
US11113317B2 (en) Generating parsing rules for log messages
US8793120B1 (en) Behavior-driven multilingual stemming
US11392620B2 (en) Clustering log messages using probabilistic data structures
US9633088B1 (en) Event log versioning, synchronization, and consolidation
US10754830B2 (en) Activity information schema discovery and schema change detection and notification
US20190228085A1 (en) Log file pattern identifier
EP3591585A1 (en) Systems and methods for a data search engine based on data profiles
WO2022222943A1 (zh) 科室推荐方法、装置、电子设备及存储介质
US20200112475A1 (en) Real-time adaptive infrastructure scenario identification using syntactic grouping at varied similarity
WO2021109724A1 (zh) 日志异常检测方法及装置
CN114461792A (zh) 告警事件关联方法、装置、电子设备、介质及程序产品
US20230252140A1 (en) Methods and systems for identifying anomalous computer events to detect security incidents
CN117093556A (zh) 日志分类方法、装置、计算机设备及计算机可读存储介质
CN115051863B (zh) 异常流量检测的方法、装置、电子设备及可读存储介质
CN115203435A (zh) 基于知识图谱的实体关系生成方法及数据查询方法
CN113128213A (zh) 日志模板提取方法及装置
Boyagane et al. vue4logs--Automatic Structuring of Heterogeneous Computer System Logs
JP5020274B2 (ja) 意味ドリフトの発生評価方法及び装置
WO2023185377A1 (zh) 一种多粒度数据模式挖掘方法及相关设备
US20230073627A1 (en) Analytics database and monitoring system for structuring and storing data streams
US11244007B2 (en) Automatic adaption of a search configuration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20874935

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20874935

Country of ref document: EP

Kind code of ref document: A1