CN111160021A

CN111160021A - Log template extraction method and device

Info

Publication number: CN111160021A
Application number: CN201911215541.5A
Authority: CN
Inventors: 王琛
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2019-10-12
Filing date: 2019-12-02
Publication date: 2020-05-15
Also published as: WO2021068547A1

Abstract

The application discloses a log template extraction method and device, and belongs to the technical field of computers. The method comprises the following steps: determining a locality sensitive hash code of each row of log record in a plurality of rows of log records of the log; determining at least one first log record group, different said first log record groups comprising different row log records in said log; each of the first set of log records includes all log records having the same locality-sensitive hash code; and obtaining a log template of the log by processing each first log record group in the at least one first log record group. The method and the device solve the problem that the operation cost of the existing log template extraction method is high, and are applied to log template extraction of logs.

Description

Log template extraction method and device

The present application claims priority from chinese patent application No. 201910969835.0 entitled "method, apparatus, server, and storage medium for log pattern extraction" filed on 12/10/2019, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting a log template.

Background

By adding some specific pseudo code in the software source code, the real-time state of the software running can be recorded in the text called logs (logs). The software developer (or the operation and maintenance staff) can master the real-time condition of software operation by reading the log.

A log comprises a plurality of rows of log records (also called log statements), each row of log records being used for recording an event when the software is running, and the log records in the log usually have an implicit log template (schema), i.e. the mode or format of the record itself. Based on the difference of log templates of logs, logs can be divided into homogeneous logs (homologus logs) and Heterogeneous logs (heterogenous logs). The homogenization log means that log templates recorded by all rows of logs in the log are the same; the heterogeneous log means that log records of each line in the log do not have a uniform log template. The functions of quickly searching key data in the log and the like can be realized by identifying the log template of the log.

At present, for heterogeneous logs, a method for extracting a log template thereof is as follows: performing word segmentation (token) on the row of log records to obtain a plurality of entries (tokens); performing Hierarchical Clustering (Hierarchical Clustering) on the log records in the log based on the word segmentation result of each row of log records to obtain multiple types of log records; and extracting templates of each type of log records, and taking the obtained log templates of the various types of log records as log templates of heterogeneous logs.

However, multiple operations are required to obtain multiple types of log records through Clustering (Clustering), which is expensive.

Disclosure of Invention

The embodiment of the application provides a log template extraction method and device, and the problem that the existing log template extraction method is high in operation cost can be solved. The technical scheme is as follows:

in a first aspect, a log template extraction method is provided, where the method includes:

determining a locality sensitive hash code of each row of log record in a plurality of rows of log records of the log; determining at least one first log record group, different said first log record groups comprising different row log records in said log; each of the first set of log records includes all log records having the same locality-sensitive hash code; and obtaining a log template of the log by processing each first log record group in the at least one first log record group.

The log records are grouped through the local sensitive hash codes of each row of log records, and the local sensitive hash codes can reflect the similarity of the corresponding log records of different rows, so that the grouping achieves the same effect as the clustering processing, and the operation complexity is effectively reduced.

In addition, in the embodiment of the application, the locality sensitive hash code is a characteristic of the log record, and when the locality sensitive hash code of each row of log record is obtained, the log records of other rows do not need to be considered. Therefore, the decorrelation of each row of log records in the log in the grouping process is realized. Therefore, for one log, the grouping process of a plurality of rows of log records can be executed in parallel, the operation time delay is effectively reduced, and the operation efficiency is improved.

When the first log record group has multiple groups, the log template of the log can be obtained by processing each first log record group. Therefore, the processing processes of the first log record groups can be executed in parallel, so that the operation time delay is reduced, the data volume required to be operated is far smaller than the integral data volume of the log when the processing process is executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

In one possible implementation, the determining a locality-sensitive hash code for each of a plurality of rows of log records of a log comprises:

acquiring at least one entry recorded by each row of log in the log; and determining the locality sensitive hash code of each row of log records based on at least one entry of each row of log records.

The word segmentation aims to cut each row of log records into a vocabulary entry sequence, the processing complexity of the log records can be reduced through word segmentation processing, the operation cost of subsequent local sensitive hash codes is reduced, and the operation efficiency is improved. In the embodiment of the application, word segmentation can be performed based on different modes. For example, space word segmentation is adopted; or, special character word segmentation is adopted; or to use natural language segmentation.

In an alternative, each entry resulting from the word segmentation comprises only one semantic unit. The word segmentation mode is simple, easy to realize and high in word segmentation speed.

In another alternative, each entry resulting from the word segmentation includes a plurality of semantic units. That is, each entry includes m semantic units, m is an integer greater than 1, the semantic units are words or symbols, for a log record including at least two entries, the last m-1 semantic unit of the first entry in every two adjacent entries is the same as the first m-1 semantic unit of the second entry, and the first entry is the previous entry of the second entry.

The entries obtained in this manner may reduce undesirable hash collisions.

In one possible implementation, the determining a locality-sensitive hash code for each of a plurality of rows of log records of a log comprises: replacing p designated characters in each row of log records in the log with q fixed characters to obtain updated each row of log records, wherein q is more than or equal to1 and is less than p; and determining the locality sensitive hash code of each row of log record based on the updated each row of log record.

For any log record, because a plurality of characters are replaced by one fixed character, the number of characters contained in the log record is reduced, and the calculation complexity of the subsequent locality sensitive hash code is effectively reduced.

In one possible implementation, the determining the locality-sensitive hash code for each row of log records based on at least one entry of the each row of log records includes:

determining a locality sensitive hash code of a first log record based on a plurality of entries of the first log record and a weight value allocated to each entry, wherein the first log record is any one of a plurality of rows of log records of the log, and the weight values of at least two entries included in the first log record are different from each other. For example, the weight of each entry is determined based on the location of the entry in the first log record.

And different weights are set for different entries in each row of log records, so that the two Hash collisions are reduced. In the target locality sensitive hashing algorithm provided by the embodiment of the application, the weight of each entry recorded by each row of log can be related to the attribute of the entry. The entries of the constant part are usually larger than those of the variable part, and the preceding entries of the log record usually belong to the constant part. Therefore, the weights of the first g entries of the log record can be set to be greater than the weights of other entries, g < k, g is a positive integer, and k is the length of the log record. For example, the weights of the first g entries are decremented, and the other weights are equal and less than the minimum weight of the first g entries. g may be 1. Therefore, the weight of the entry recorded by the log is associated with the position attribute of the entry, the locality sensitive hash code can be calculated more accurately, and the unexpected hash collision is further reduced.

In one possible implementation, the determining at least one first log record group includes: and grouping a plurality of rows of log records in the log based on the locality sensitive hash code of each row of log records in the log and the target characteristics of each row of log records to obtain at least one first log record group.

Therefore, on the basis of the locality sensitive hash code, new grouping features are added, the grouping precision can be improved, and the higher similarity of log records divided into the same first log record group is ensured.

In one possible implementation, the obtaining a log template of the log by processing each first log record group of the at least one first log record group includes: respectively extracting a log template in each first log record group; determining a log template for the log based on the log template for each of the first log record groups.

Therefore, the log template of the log is determined by extracting the log template of each first log record group, and the log records in the first log record group do not need to be directly adopted to participate in the calculation of the log template of the log, so that the solution space is reduced in exponential level, and the operation efficiency is effectively improved.

In one possible implementation, the separately extracting the log template in each of the first log record groups includes:

for each row of log records in each of the first log record groups, comparing the log records to a historical log template for the first log record group; when the log record is matched with the historical log template, determining a new log template of the first log record group based on the log record and the historical log template; and when the log record is not matched with the historical log template, adding the log template extracted from the log record as a new log template of the first log record group.

In one possible implementation, the determining a log template for the log based on the log template for each of the first log record groups includes: clustering the log templates of the at least one first log record group to obtain the log templates of the logs; or, the log templates of the at least one first log record group are merged to obtain the log template of the log; or, the log template of the at least one first log record group is used as the log template of the log.

In one possible implementation, the obtaining a log template of the log by processing each first log record group of the at least one first log record group includes: determining a log template for the log based on a target log record of each of the first log record groups, the target log record of the first log record group being a partial log record of the first log record group.

The processing of the target log record is equivalent to the processing of all log records in the first log record group, but the data volume of actual processing is effectively reduced, namely the data sampling is performed, the solution space is reduced, and the operation cost is further reduced.

In one possible implementation, the target log record of the first set of log records is a randomly selected row of log records in the first set of log records. This ensures that the probability of each row of log records in the first set of log records being selected as target log records is equal.

In one possible implementation, the determining a log template for the log based on the target log record of each of the first log record groups includes: determining at least one second log record group, wherein different second log record groups comprise different target log records in the target log records corresponding to the at least one first log record group; each second log record group comprises all target log records with the same target characteristics; and processing each second log record group respectively to obtain a log template of the log. The solution space can be further reduced by grouping.

When the second log record groups have multiple groups, the processing processes of the second log record groups can be executed in parallel, so that the operation time delay is reduced, the data volume required to be operated is far smaller than the integral data volume of the log when the processing processes are executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

In one possible implementation, the obtaining the log template of the log by separately processing each second log record group includes: clustering the log records in each second log record group to obtain at least one type of log records corresponding to each second log record group; respectively carrying out template extraction on each type of log record obtained by clustering to obtain a log template of each type of log record; and determining a log template of the log based on the log template of each log record in the at least one type of log record.

When the log records in the second log record group have multiple types, the processing processes of the log records can be executed in parallel, so that the operation time delay is reduced, the data amount required to be operated is small when the processing process is executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

In one possible implementation, the determining a log template for the log based on the log template for each of the at least one type of log record includes: taking the log template of each type of log record as the log template of the log; or merging the log templates of various log records obtained by clustering, and taking the merged log template as the log template of the log.

In one possible implementation, the target characteristics of the logging include: at least one of a length of the log record, a first character of the log record, and a first word of the log record.

In a second aspect, an apparatus for extracting a log template is provided, where the apparatus may include at least one module, and the at least one module may be configured to implement the log template extracting method provided in the first aspect or various possible implementations of the first aspect.

In a third aspect, the present application provides a computer device comprising a processor and a memory. The memory stores computer instructions; when the processor executes the computer instructions stored in the memory, the computer device executes the methods provided by the first aspect or the various possible implementations of the first aspect, so that the computer device deploys the log template extraction apparatus provided by the second aspect or the various possible implementations of the second aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, in which computer instructions are stored, and the computer instructions instruct a computer device to execute the method provided by the first aspect or the various possible implementations of the first aspect, or instruct the computer device to deploy the log template extraction apparatus provided by the second aspect or the various possible implementations of the second aspect.

In a fifth aspect, the present application provides a computer program product comprising computer instructions stored in a computer readable storage medium. A processor of the computer device may read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method provided by the first aspect or the various possible implementations of the first aspect, so that the computer device deploys the log template extraction apparatus provided by the second aspect or the various possible implementations of the second aspect.

In a sixth aspect, there is provided an analysis system comprising: the terminal and the analysis device, the analysis device includes the second aspect or various possible implementations of the second aspect, the log template extraction apparatus or the computer device of the third aspect.

In a seventh aspect, a chip is provided, where the chip may include a programmable logic circuit and/or program instructions, and when the chip is run, the chip is configured to implement the template extraction method according to any one of the first aspect.

In the embodiment of the application, the log records are grouped through the locality sensitive hash code of each row of log records, and the locality sensitive hash code can reflect the similarity of the corresponding log records of different rows, so that the grouping achieves the same effect as the clustering treatment, and the operation complexity is effectively reduced. In addition, in the embodiment of the application, the locality sensitive hash codes and the target characteristics are characteristics of the log records, and when the locality sensitive hash codes and the target characteristics of each row of log records are obtained, other row of log records do not need to be considered. Therefore, the decorrelation of each row of log records in the log in the grouping process is realized. Therefore, for one log, the grouping process of a plurality of rows of log records can be executed in parallel, the operation time delay is effectively reduced, and the operation efficiency is improved. When the first log record groups have multiple groups, the processing processes of the first log record groups can be executed in parallel, so that the operation time delay is reduced, the data volume required to be operated is far smaller than the data volume of the whole log when the processing processes are executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

According to the log template extraction method, most log records are screened, so that the solution space is reduced in an exponential level, the operation cost is effectively reduced, and the operation efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of a part of log contents in a log provided by an embodiment of the present application;

fig. 2is a schematic view of an application environment involved in a log template extraction method provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an application environment involved in another log template extraction method provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a locality sensitive hashing algorithm provided in an embodiment of the present application;

FIG. 5 is a flowchart illustrating a log template extraction method according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a word segmentation result related to a log template extraction method provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a word segmentation result according to another log template extraction method provided in the embodiment of the present application;

fig. 8 is a schematic diagram of a word segmentation result according to another log template extraction method provided in the embodiment of the present application;

FIG. 9 is a diagram illustrating a process of locality sensitive hash code retrieval for log records X3 and X4 according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of the calculation process of the locality sensitive hash codes of Log record X3 and Log record X4 shown in FIG. 9;

FIG. 11 is a diagram illustrating the result of a tokenization of the log records X7 and X8 provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of another tokenization result of the journal records X7 and X8 provided by an embodiment of the present application;

FIG. 13 is a diagram illustrating a grouping process of a first log record group according to an embodiment of the present disclosure;

FIG. 14 is a diagram illustrating a template of a first log record group according to an embodiment of the present disclosure;

FIG. 15 is a diagram illustrating the segmentation result provided by the embodiment of the present application;

fig. 16 is a schematic diagram of a partially sensitive hash code obtaining result according to an embodiment of the present application;

FIG. 17 is a diagram illustrating grouping results of a first log record group according to an embodiment of the present application;

FIG. 18 is a schematic diagram of a target log record of a first log record group according to an embodiment of the present application;

FIG. 19 is a diagram illustrating grouping results of a second log record group according to an embodiment of the present application;

FIG. 20 is a diagram illustrating a log template provided by an embodiment of the present application;

fig. 21 is a schematic diagram of a log template extraction apparatus according to an embodiment of the present application;

FIG. 22is a schematic diagram of a first determining module provided by an embodiment of the present application;

FIG. 23 is a schematic diagram of a processing module provided in an embodiment of the present application;

fig. 24 is a schematic diagram of a computing device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The log is used for recording the real-time state of software operation, and the real-time condition of the software operation can be mastered or the log is subjected to abnormal detection by analyzing the log. In the embodiment of the application, the log analysis scene comprises an offline analysis scene and an online analysis scene. In an offline analysis scenario, the log data to be analyzed may be batch (batch) log data, such as a log file, or log data obtained by querying in a log database; in an online analysis scenario, the log data being analyzed may be real-time log data, also referred to as log stream (log stream) data. The log file is a file downloaded by a user, a software developer or an operation and maintenance worker, or a file searched by a keyword.

As shown in fig. 1, fig. 1 is a schematic diagram of partial log contents in a log, where the log includes multiple log records (also called log texts), and each log record is used for recording an event during the software runtime. Each line of log records is composed of a plurality of characters, which may include letters and/or symbols, etc. A row of log records includes a constant portion and a variable portion, the constant portion including at least one character and the variable portion including at least one character. For example, assume that a pseudo code corresponding to a certain log record is defined to record information that a user logs in: info (' User% d logic at% s ', $ uid, $ time) ', which records "$" in the corresponding pseudo code to define the variable part (also called variable name). For example, "$ IP" is just that meaning of an Internet Protocol (IP) address. As can be seen from the pseudo code corresponding to the log record, the log record contains two variable parts, namely a user name (uid) and a login time (time) for logging in each time.

The log records in the log typically have an implicit log template (also called a log pattern), which refers to a standard pattern, or fixed format, used to generate the log records in the log. For example, after the pseudo code corresponding to the log record is actually run, a plurality of lines of log records for recording the information of user login in the log are output. In the embodiment of the present application, the log in which the multiple rows of log records are located is referred to as a first log:

“User 025862 login at 2018-12-03 02:03:00

User 045210 login at 2018-12-04 02:03:15

User 033658 login at 2018-12-05 02:03:38

User 010100 login at 2018-12-06 02:04:06

User 023025 login at 2018-12-07 02:04:51

User 046523 login at 2018-12-08 02:05:22”。

the log template of the log is the log template of the log record in the log. Generally, when log template extraction is performed on a log record, if a variable part of the log record is identified, the variable part is marked by using a preset variable identifier, and the marking is essentially to replace the variable part by using the variable identifier. The variable identifier is typically a wildcard character. For example, when the log template extraction is performed on the multiple rows of log records of the first log, the variable part may be replaced by a wildcard character, and the obtained log template of each row of log records is "User" logic at ", so that the log template of the first log is" User "logic at". It should be noted that, in the matching process in the following text, the variable identifier may be determined to be identical to any character or entry. For example, it may be determined that "" is the same as "046523" when "" is matched with "046523".

The logs can be divided into homogeneous logs and heterogeneous logs. The first log in the foregoing embodiment is described by taking a homogeneous log as an example, log records in the homogeneous log have a uniform log template, an extraction process of the log template may be implemented in a regular expression manner, multiple rows of log records of the heterogeneous log do not have a uniform log template, and a service log is generally a heterogeneous log. For example, part of the content of the heterogeneous log is as follows (the "#" in the heterogeneous log is used to mark a row number, in practical applications, the content of the log may not include the row number mark and a specific row number, and this example is mainly convenient for readers to understand, so the row number mark and the specific row number are omitted in the subsequent processing process). In the embodiment of the present application, a log in which multiple rows of log records are located is referred to as a second log as follows:

“#0mod_jk child workerEnv in error state 6

#1jk2_init()Found child 6740 in scoreboard slot 7

#2jk2_init()Found child 6741 in scoreboard slot 8

#3workerEnv.init()ok/etc/httpd/conf/workers2.properties

#4mod_jk child workerEnv in error state 7

#5workerEnv.init()ok/etc/httpd/conf/workers2.properties”。

because each row of log records of the log storage device has no uniform log template, the extraction process is more complicated. The current extraction process comprises a process of performing hierarchical clustering on log records in the log, and because the data volume to be processed is large, a clustering algorithm needs to perform multiple operations to obtain multiple types of log records, so that the operation cost is high.

The embodiment of the application provides a log template extraction method, which can reduce the operation cost in the log template extraction process. Referring to fig. 2, fig. 2is a schematic view of an application environment related to a log template extraction method according to an embodiment of the present application. The application environment includes a terminal 110, an analysis device 120, and a network device 130.

The terminal 110 may be a display, a computer, a smart phone, a tablet and laptop computer, etc. capable of interacting with a user. The analysis device 120 may be a server, a server cluster composed of several servers, or the like capable of performing data analysis. Alternatively, the analysis device 120 may be a cloud server (also referred to as a cloud computing server), for example, a Deep Learning server for providing a Deep Learning Service (DLS). The terminal 110 establishes a wired or wireless communication connection with the analysis device 120 through a communication network. The network device 130 may be a sensor or a terminal, which can run software and generate log data. The network device 130 is configured to provide the analysis device 120 with data to be analyzed, the analysis device 120 is configured to analyze log data, and the terminal 110 is configured to present an analysis result to a user. The communication network referred to in the embodiments of the present application is a 2-Generation (2G) communication network, a 3rd Generation (3G) communication network, a Long Term Evolution (LTE) communication network, a fifth Generation (5rd Generation, 5G) communication network, or the like.

Optionally, the foregoing application environment may further include a storage device, which is used to store data that is required to be stored by the terminal 110, the analysis device 120, and/or the network device 130, where the storage device may be a distributed storage device, and the terminal 110, the analysis device 120, and/or the network device 130 may read and write data stored in the storage device. Therefore, under the condition that the data in the application scene is more, the storage device stores the data, so that the load of the analysis device can be reduced, and the data analysis efficiency of the analysis device can be improved. When the amount of data in the application environment is small, the storage device may not be provided. In this case, the functions of the terminal 110 and the analyzing device 120 may also be implemented by the same device, such as a computer.

As shown in fig. 3, the application environment includes two parts, a foreground 201 and a background 202. The foreground 201 is used for presenting data to a user, receiving data input by the user, and realizing interaction with the user; the background 202 is used for performing data interaction with the foreground 201, and performing management operation and/or data processing and the like. Wherein, the foreground 201 may be deployed in the aforementioned terminal 110. The background 202 may be deployed in the aforementioned analysis device 120. For example, a client, a script, or a browser may be installed in the terminal 110 to implement the deployment of the foreground 201. As such, the terminal 110 may present the user interface in the form of a client interface, a terminal interface, or a web page corresponding to a browser.

The log template extraction method provided by the embodiment of the application can be used in scenes such as software debugging, performance optimization or service analysis. The method can be particularly applied to the abnormal detection scene in the scenes. Anomaly detection refers to detecting patterns that are not as expected. In the embodiment of the present application, a data source of the anomaly detection is log data generated by running software in an application, a process, an operating system, a device, or a network. For example, the aforementioned analysis device 120 may employ a deep learning (deep learning) algorithm to perform anomaly detection on the log data. It is worth to be noted that the log template extraction method provided in the embodiment of the present application may also be used in other scenarios, such as log compression, keyword retrieval, and the like, which is not limited in the embodiment of the present application.

A Locality Sensitive Hash (LSH) code is a Hash code obtained based on a Locality Sensitive Hash algorithm. The locality sensitive hash code can reflect the similarity of the data (which may be referred to as input data) that needs to be processed using the locality sensitive hash algorithm. In this embodiment, the data may be the data recorded in the log. The locality sensitive hashing algorithm may maintain similarity relationships between input data. As shown in fig. 4, for similar input data, the obtained locality sensitive hash codes (which may be referred to as output data) are also very similar; for scenarios where the input data is very similar, the resulting locality sensitive hash code even generates hash collisions: that is, the output locality-sensitive hash codes are identical for different but similar input data. In the embodiment of the application, the log template is extracted based on the characteristic of the locality sensitive hash code. As shown in fig. 5, an embodiment of the present application provides a log template extraction method, which is applied to the application environment shown in fig. 2 or fig. 3. Fig. 5 illustrates an example in which the method is applied to an anomaly detection scenario, where the method includes:

step 301, the analysis device obtains a log, wherein the log comprises a plurality of rows of log records.

As previously mentioned, the log has both bulk log data and real-time log data. In the embodiment of the application, the analysis device supports the analysis of the logs in the two forms. In an optional example, the analysis device periodically obtains the log file, or obtains the log file in a specified time period to obtain the batch of log data, where the specified time period may be a low power consumption time period (i.e., a time period in which power consumption is less than a specified power consumption threshold) of the terminal and/or the server, so that the influence of the log file obtaining and subsequent log analysis on other functions of the terminal and/or the server can be reduced; in another alternative example, the analysis device continuously acquires real-time log data; in yet another alternative example, the analysis device obtains batch log data or real-time log data after receiving the analysis instruction. The analysis instruction can be generated by triggering at the terminal by a user and sent to the analysis device by the terminal.

When the analysis equipment acquires and analyzes the log stream in real time, the log stream can be monitored in time, and if the log stream is abnormal, the log stream can be found and reported in time, so that the effectiveness of abnormal detection is improved, the occurrence of large-scale abnormality is avoided, and the user experience is improved.

The log template extraction method provided by the embodiment of the application can be used for extracting the log template of the homogeneous log and can also be used for extracting the template of the heterogeneous log. In an alternative, the analyzing device may perform step 302 directly after performing step 301; in another optional manner, when the log template extraction method provided in the embodiment of the present application is applied to template extraction of heterogeneous logs, the operation efficiency is higher, so that the type of the log may be detected first, if the type of the log is a homogeneous log, a regular expression is used to perform template extraction of the log, and if the type of the log is a heterogeneous log, the subsequent step 302 is performed.

Step 302, the analysis device determines the locality sensitive hash code of each row of log records in a plurality of rows of log records of the log.

The analysis device may determine the locality sensitive hash code in multiple ways, and the following two optional implementation manners are exemplified in the embodiment of the present application:

in a first alternative implementation, the determining the locality-sensitive hash code for each log record in the multiple log records of the log may include:

step A1, the analysis device obtains at least one entry (token) of each row of log record in the log.

Optionally, the analysis device may perform word segmentation on each row of log records in the log by using a word segmentation technology, so as to obtain at least one entry of each row of log records after word segmentation. In general, a row of log records may be divided into at least two entries; in a few cases, one row of log records can be divided to obtain one entry, and the number of the entries obtained through division is not limited in the embodiment of the application.

The purpose of word segmentation is to cut each row of log records into a set of entries, and the word segmentation processing can reduce the processing complexity of the log records, reduce the operation cost of subsequent local sensitive hash codes and improve the operation efficiency. In the embodiment of the application, word segmentation can be performed based on different modes. For example, space participle is used (this approach may use string. split () statements for space participle); or, special character word segmentation is adopted; or to use natural language segmentation.

The method has the advantages that the space word segmentation is adopted, namely, a row of log records are segmented into a plurality of entries according to spaces, the segmentation implementation process is simple, and the segmentation efficiency is high; when the special character segmentation is adopted, the special character is usually a character designated by a user, such as "|", "###" or "═ so that the semantic unit included in the segmented entry is more accurate, and the segmentation precision is higher; the natural language Word segmentation is commonly used, and the log records can be directly input into a Word segmentation device based on natural language in the natural language Word segmentation device, such as Word segmentation devices of NLTK Word _ Tokenizer, TreeBank _ Tokenizer, S-Expression _ Tokenizer and the like.

For the convenience of the reader to understand, the following embodiments all use word segmentation processing for performing log recording based on space as an example for explanation. For different word segmentation mechanisms, the obtained word segmentation results are different, and the embodiment of the application explains the word segmentation results in the following two optional ways:

in a first alternative, each entry resulting from word segmentation comprises only one semantic unit. Semantic units are words or symbols, which may be numeric symbols, abbreviated numbers such as 1, 2, or other symbols such as "/" or ": ". As shown in fig. 6, fig. 6 is a word segmentation result obtained by performing word segmentation processing on each row of log records of the second log, and each entry includes only one semantic unit. Taking the first row of log records in fig. 6 as an example, word segmentation results in 7 entries of "mod _ jk", "child", "workerEnv", "in", "error", "state", and "6".

In a second alternative, each entry obtained by word segmentation includes a plurality of semantic units. That is, each entry includes m semantic units, where m is an integer greater than 1 and less than the total number of entries. The semantic units are words or symbols.

The length of an entry may be represented by the number of semantic units that make up the entry. Since a long entry may result in invalid word segmentation, for example, a log record in a line is finally segmented into one entry, the length of one entry cannot be too long, and the length of one entry usually needs to be such that the log record in a line is segmented into at least 2 entries, where m is 2 or m is 3.

For the log record comprising at least two entries, the last m-1 semantic units of the first entry in every two adjacent entries are the same as the first m-1 semantic units of the second entry, and the first entry is the previous entry of the second entry. And each two adjacent entries in the word segmentation result obtained by word segmentation have the superposition of semantic units. For example, assume that log record X1 is: "detected a failure in network connection", log record X2 is: "network connection a failure is detected", assuming that m is 2, the word segmentation results of the two are shown in fig. 7; assuming that m is 3, the segmentation results are shown in fig. 8. Optionally, the word segmentation action may be implemented by using a sliding window mechanism.

Optionally, in the first optional manner and the second optional manner, the analysis device may input each row of log records as a character stream into a designated word segmenter, perform word segmentation processing by the word segmenter, and receive a word segmentation result output by the word segmenter. The different word segmentation mechanisms corresponding to the first optional mode and the second optional mode are realized by different word segmentation mechanisms and different word segmenters. The analysis device may support at least one segmentation mechanism.

Step A2, the analysis device determines the locality sensitive hash code of each row of log record based on at least one entry of each row of log record.

Optionally, the analysis device may determine the locality-sensitive hash code of each row of log records based on the target locality-sensitive hash algorithm and the at least one entry of each row of log records. For example, the partially sensitive hash calculation process in the target partially sensitive hash algorithm may refer to a partially sensitive hash calculation process in a Simhash algorithm or a Minhash algorithm. The minimum unit of the data processed by the target locality sensitive hashing algorithm is an entry.

Optionally, in the target partially sensitive hash algorithm, after at least one entry of a certain log record is obtained, a weighted summation manner may be adopted to determine a partially sensitive hash code of the certain log record. The process may refer to the Simhash algorithm. The process of determining the locality-sensitive hash code of the certain log record by using weighted summation may include:

step A21, for any log record, calculating the hash code of each entry in the log record, wherein the hash code is composed of

binary numbers

0 and 1.

And step a22, performing weighted summation on the Hash codes of the calculated entries, that is, W ═ Σ Hash × weight, where W represents the Hash sequence after weighted summation, and Hash represents the Hash code of each entry.

Assuming that the weight of each entry is 1, the weight is 1, and the locality sensitive hash code is the sum of the hash codes of each entry, that is: w ═ Σ Hash.

And A23, performing dimensionality reduction on the obtained weighted summation result to obtain the locality sensitive hash code.

In the weighted summation process of the foregoing step a22, the product of each hash code and its weight is expressed by the following rule: when the value in the hash code is 1, the summation result of the corresponding position is 1 multiplied by the weight value positively, and when the value in the hash code is 0, the summation result of the corresponding position is 1 multiplied by the weight value negatively.

The dimension reduction in the step a23 means that the value greater than 0 is reduced to1, and the value not greater than 0 is reduced to 0. The process of performing numerical value dimensionality reduction on the obtained weighted summation result comprises setting a numerical value greater than 0 in the obtained weighted summation result to be 1, and setting a numerical value not greater than 0 in the weighted summation result to be 0.

For example, assume that log record X3 is: "saveLogSize cost time is 1057", Log record X4 is: "flush cost time is 122" and assuming that each entry resulting from the tokenization of the log records X3 and X4 only includes one semantic unit, fig. 9 shows the process of locality sensitive hash code retrieval of the log records X3 and X4. In fig. 9, the hash code of "flush" is: "10010111", which is multiplied by the weight 1, is "1, -1, -1, 1, -1, 1, 1, 1" (where comma is for interval and does not exist in the actual calculation). And performing weighted summation on the hash codes of the entries obtained by calculation, namely performing bit summation (namely corresponding position summation) on the weighted hash codes. Taking the log record X3 in fig. 9 as an example, the result of the final weighted summation is "5, -3, -1, 1, -3, -1, 5, 3", where the first: 5 is the sum of the first digits of the products of the entries and the weight 1, namely 1+1+1+1, and the second digit: -3 is the sum of the second bits of the product of the entries and the weight 1, i.e., -1) + (-1) +1+ (-1) + (-1), and the other bits are calculated in the same way. The result of the weighted summation is "5, -3, -1, 1, -3, -1, 5, 3", and the result of the dimensionality reduction corresponding to "10010011", i.e., the locality sensitive hash code of log record X3 is "10010011".

By adopting the target locality sensitive hash algorithm, if the set weight is 1, the calculation time delay is short, the calculation efficiency is high, but the calculated locality sensitive hash code may generate unexpected hash collision, and the method is generally applied to some scenes which have low requirements on analysis precision and high requirements on calculation time delay.

For the convenience of the reader to understand, the embodiments of the present application take the following two types of undesirable hash collisions as examples for illustration:

first undesired hash collisions: hash collision caused by different log record contents:

still taking the log records X3 and X4 in fig. 9 as an example, the log records X3 and X4 are the same length, and the locality sensitive hash codes determined by the target locality sensitive hash algorithm are all "10010011".

Although the obtained locality sensitive hash codes in this case are the same, the two rows of log records actually have a large difference due to different contents, and the locality sensitive hash codes cannot effectively reflect the similarity of the log records X3 and X4, so that the locality sensitive hash codes are called an unexpected hash collision.

Second type of undesirable hash collision: hash collisions caused by different log record sequences (also called orders):

assume that log record X5 is: "flush cost time is 122", log record X6 is: "122 is flush cost time", and assuming that each entry obtained after the segmentation of the log records X5 and X6 only includes one semantic unit, the log records X5 and X6 substantially include the same entry, except that the order of the entries is different, and the locality-sensitive hash codes determined by using the target locality-sensitive hash algorithm are the same. Because the entry contents obtained after the log records X5 and X6 are segmented are substantially the same, but the entry sequences are different, when the target locality sensitive hash algorithm is adopted and the weights are all set to be 1, the finally determined locality sensitive hash codes are the same.

It is worth to be noted that the traditional Simhash algorithm is used for comparing the similarity of articles, and the processing object is an article; and the weight configured for the entry obtained by word segmentation is positively correlated with the word frequency, namely the higher the word frequency is, the larger the weight is. In the embodiment of the application, if the weight is set in a traditional manner, the weight is set according to the word frequency, that is, the weights of the same entries are the same, and finally determined locality sensitive hash codes are still the same.

Although the obtained locality sensitive hash codes in this case are the same, the two rows of log records actually have a large difference due to different sequences, and the locality sensitive hash codes cannot effectively reflect the similarity of the log records X5 and X6, so that the locality sensitive hash codes are called an unexpected hash collision.

In the embodiment of the application, the two hash collisions can be reduced by setting different weights for different entries in each row of log records. Then, assuming that any log record of the plurality of log records of which the first log record is a log includes a plurality of entries, the process of determining the locality sensitive hash code of each log record may include, based on at least one entry of each log record: determining a locality sensitive hash code of a first log record based on a plurality of entries of the first log record and a weight value allocated to each entry, the weight values of at least two entries included in the first log record being different from each other. The first log record can be referenced by the manner of obtaining the locality sensitive hash codes of other log records in the multi-row log record. For example, based on the plurality of entries of the first log record and the weight value assigned to each entry, the partially sensitive hash code of the first log record may be determined by the aforementioned weighted summation, and the specific process refers to the aforementioned steps a21 to a 23.

In the embodiment of the application, the weights set for the entries at the same positions in the log records of each row are the same, and the weights set for at least two entries in the log record of the same row are different, so that the generation of unexpected hash collision can be effectively reduced. The weight of the entries recorded in the same row of log records may be set according to actual conditions, for example, the weights may be increased or decreased in an arithmetic progression manner, or may be set in other manners.

For example, assuming that in the target locality-sensitive hashing algorithm, the weight of the first entry is 3, the weight of the second entry is 2, and the weights of the other entries are 1, the calculation process of locality-sensitive hashing codes of the log record X3 and the log record X4 shown in fig. 9 is as shown in fig. 10, and the locality-sensitive hashing codes of two rows of log records determined finally are different. This may resolve the first type of undesirable hash collisions. Similarly, for the log record X5 and the log record X6, the locality sensitive hash codes of the two rows of log records finally determined by using the aforementioned weights are different. This also addresses the second type of undesirable hash collisions.

As mentioned above, the traditional Simhash algorithm is used for comparing similarity of articles, and the processing object is an article; the weight value configured for the entry obtained by word segmentation is positively correlated with the word frequency, namely the higher the word frequency is, the larger the weight value is. In the target locality sensitive hashing algorithm provided by the embodiment of the application, the weight of each entry recorded by each row of log can be related to the attribute of the entry and is decorrelated with the word frequency. The entries of the constant part are usually larger than those of the variable part, and the preceding entries of the log record usually belong to the constant part. Therefore, the weights of the first g entries of the log record can be set to be greater than the weights of other entries, g < k, g is a positive integer, and k is the length of the log record. For example, the weights of the first g entries are decremented, and the other weights are equal and less than the minimum weight of the first g entries. g may be 1. Thus, the weight of the entry of the log record is associated with the position attribute thereof, that is, the weight of each entry is determined based on the position of the entry in the first log record, so that the locality sensitive hash code can be more accurately calculated, and the undesirable hash collision is further reduced.

It should be noted that the first and second undesirable hash collisions may also be reduced by obtaining the entries in the second alternative of step a 1.

For example, assume that log record X7 is: "detected a failure in network connection"; log record X8 is: "network connection a failure is detected", which are two different log records, but because the words are similar, if the word segmentation is performed according to a term including only one semantic unit, the word segmentation result of the log record X7 is: { detected, a, failure, in, network, connection }; the word segmentation result of the log record X8 is: { detected, a, failure, is, network, connection }. It can be seen that only 1 entry in the two rows of log records is different, i.e. the entry for log record X7 is "in" and the entry for log record X8 is ". Therefore, the locality sensitive hash codes of both log X7 and log X8 may be very similar, and even a collision may occur, with the first undesirable hash collision occurring.

When each entry includes two semantic units, the word segmentation results of log X7 and log X8 are shown in fig. 11. It can be seen that 5 entries in the word segmentation results of the log record X7 and the log record X8 are different, and only one entry is the same. The local sensitive hash codes obtained based on the word segmentation result are greatly different, and the first unexpected hash collision is effectively avoided.

When each entry includes three semantic units, the segmentation results of log X7 and log X8 are shown in fig. 12. It can be seen that there are 4 entries in the segmentation results of both log X7 and log X8. The local sensitive hash codes obtained based on the word segmentation result are more different, and the first unexpected hash collision is effectively avoided.

Similarly, the second undesirable hash collision may be avoided by using a word segmentation method in which one entry includes a plurality of semantic units.

In a second alternative implementation, the process of determining the locality-sensitive hash code for each log record in multiple log records of the log may include: the locality sensitive hash code for each row of log records is determined based directly on the content of the log records, i.e., the aforementioned step a1 is not performed. The process may refer to step a2, and the analysis device may determine the locality-sensitive hash code for each row of log records based on the target locality-sensitive hash algorithm and the content of each row of log records. For example, the analysis device may input the content (i.e., character stream) of each row of log records into the algorithm model of the target locality-sensitive hash algorithm, and receive the locality-sensitive hash code of each row of log records output by the algorithm model. The minimum unit of data processed by the target locality-sensitive hashing algorithm is a character.

In the second optional implementation manner, the data granularity (i.e., the minimum unit of data processed by the target locality-sensitive hash algorithm) when the locality-sensitive hash code of each row of log record is obtained is a character, and in the first optional implementation manner, the data granularity when the locality-sensitive hash code of each row of log record is obtained is a term. Therefore, compared with the second optional implementation manner, the first optional implementation manner has a larger data granularity when the locality sensitive hash code recorded in each row of log is obtained, so that the first optional implementation manner has a smaller operation frequency compared with the second optional implementation manner, and the operation cost can be saved.

In a third alternative implementation, the process of determining the locality-sensitive hash code for each of the multiple log records of the log may include: for each line of log records, taking every n characters as the minimum data processing unit, obtaining the locality sensitive hash code of each line of log records, where n is an integer greater than 1, where each line of log records may be divided by using a sliding window mechanism (the process may refer to the second alternative of the foregoing step a1, except that the unit of division is changed from m semantic units to n characters). For example, the analysis device may input each row of log records into an algorithm model of the target locality-sensitive hash algorithm in units of n characters, and receive locality-sensitive hash codes of each row of log records output by the algorithm model. That is, the minimum unit of the data processed by the target locality-sensitive hashing algorithm is n characters. The process may refer to n-gram (a language model) algorithms.

In the third optional implementation manner, the data granularity when the locality sensitive hash code recorded by each row of log is solved is n characters, so that the operation frequency of the first optional implementation manner is smaller than that of the third optional implementation manner, the operation cost can be saved, and the operation frequency of the third optional implementation manner is smaller than that of the second optional implementation manner, and the operation cost can be saved.

In the embodiment of the present application, in the process of determining the locality-sensitive hash code of each row of log record (for example, in the process of implementing the first optional implementation manner or the second optional implementation manner), the analysis device may further perform preprocessing on each row of log record, so as to improve the obtaining efficiency of the locality-sensitive hash code and reduce the operation cost.

Then, in a fourth alternative implementation, the process of determining the locality-sensitive hash code for each of the plurality of log records of the log may include:

and step B1, replacing p designated characters in each row of log record in the log by q fixed characters by the analysis equipment to obtain the updated each row of log record.

Because some specified characters in the log record are variables in most cases, the specified characters are subjected to certain replacement processing, so that the calculation complexity of the subsequent locality sensitive hash code can be reduced. For example, the designated character may be a number and the fixed character may be a number or other symbol. Such as 1, 2 or "". Wherein, q is more than or equal to1 and less than p, namely, the number of the replaced designated characters is more than the number of the fixed characters. Therefore, the number of characters contained in the log record is reduced to a certain extent, and the calculation complexity of the subsequent locality sensitive hash code is reduced.

For example, assuming that the designated characters are numbers, the fixed characters are "+", and the log record X9 is: "Connected to10.110.12.01at 2019-11-0415: 40: 00", the updated log record X9 obtained in this way may be: "Connected to.

Optionally, the analysis device may replace a plurality of consecutive designated characters in each row of log records in the log with a fixed character, so as to obtain each row of updated log records.

For example, assuming that the designated characters are numbers, the fixed characters are "+", and the log record X9 is: "Connected to10.110.12.01at 2019-11-0415: 40: 00", the updated log record X9 obtained in this way is: "Connected to.

And step B2, determining the locality sensitive hash code of each row of log record based on each row of updated log record.

The process of step B2 may refer to the process of step a2 in the first optional implementation manner, that is, the process of determining the locality sensitive hash code based on the word segmentation result; the second optional implementation manner may also be referred to, that is, the partially sensitive hash code of each row of log record is determined directly based on the content of each row of log record without word segmentation; the third optional implementation manner or other implementation manners may also be adopted, and this is not limited in the embodiment of the present application.

It should be noted that, if the step B2 is implemented by the step a2 in the first optional implementation manner, the step B1 may be performed before the word segmentation (i.e., the step a1) or after the word segmentation, that is, based on each updated row of log records, the process of determining the locality sensitive hash code of each row of log records includes the steps a1, B1, and B2 that are performed in sequence, or the steps B1, a1, and B2 that are performed in sequence.

Step 303, the analysis device determines at least one first log record group, wherein different first log record groups comprise different row log records in the log; each first log record group includes all log records having the same locality-sensitive hash code.

In general, there are a plurality of first log record groups obtained by grouping. The analysis device can divide log records with identical locality-sensitive hash codes in multiple log records of the log into the same first log record group to obtain the at least one first log record group. Because the locality sensitive hash codes can reflect the similarity of corresponding log records of different rows, each first log record group is equivalent to one type of log record, and the grouping effect is similar to the clustering effect. Because the clustering process needs to calculate the distance between each feature (if the clustering process is applied to the log template extraction process, each feature is all terms included in each row of log records), and the operation cost based on the distance between each feature is high. The operation complexity of the grouping method is far less than that of the clustering process.

For example, the number of log records in a log is u, and the process of performing clustering processing in a hierarchical clustering manner includes: calculating a Distance Matrix (Distance Matrix) based on a defined Distance function (Distance Measurement), which may be a Jacard Distance function; determining a plurality of pairs of polymerizable log records based on the distance matrix (each pair of polymerizable log records is typically the log record with the highest similarity, which may be determined based on the minimum value of each column in the distance matrix); aggregating each pair of determined log records; the aggregation result is represented by a binary tree, for example, a dendrogram. Where the complexity of finding the minimum value for each column in the distance matrix is O (u)²) (ii) a Then, each aggregation is a process of combining 2 into 1, the number of elements is reduced by 1, and u-1 aggregation is needed in total to complete the construction of the binary tree. Thus, the final complexity is O (u)²)×O(u)＝O(u³). O denotes complexity, one element being a row of log records. Even if optimized, the complexity can only be reduced to O (u)²Logu). If the log comprises tens of thousands of log records, hundreds of millions of calculations are needed to realize the hierarchical clustering. This can create a performance bottleneck, affecting user experience and system stability.

In the embodiment of the application, in the process of obtaining the locally sensitive hash code of each row of log record, the distance between the log records of each row does not need to be calculated, and element aggregation is performed, the calculation complexity can reach o (u), and u is the row number of the log record in one log, so the calculation complexity is far less than that of clustering processing. Thereby avoiding performance bottlenecks and reducing the impact on user experience and system stability.

Moreover, as can be seen from the foregoing, the clustering algorithm needs to calculate the distance between each row of log records in the log and each other row of log records, that is, the log records in one log are correlated, and each row of log records cannot be independently calculated in the clustering process, so that the calculation complexity is high, and the calculation delay is long. In the embodiment of the application, the locality sensitive hash code is a characteristic of the log record, and when the locality sensitive hash code of each row of log record is obtained, the log records of other rows do not need to be considered. Therefore, the decorrelation of each row of log records in the log in the grouping process is realized. Therefore, for one log, the grouping process of a plurality of log records can be executed in parallel (also called concurrent execution, namely, the locality sensitive hash codes of each log record are independently calculated, and the grouping of the corresponding log records is carried out based on each calculated locality sensitive hash code), so that the operation delay is effectively reduced, and the operation efficiency is improved.

Further, in the embodiment of the present application, the log records of the log are grouped by using the locality-sensitive hash code, so that one or more first log record groups can be obtained, and a subsequent process of processing each first log record group can be performed. When the first log record groups have multiple groups, the subsequent processing processes (such as step 304) of each first log record group can be executed in parallel, so that the operation delay is reduced, the data volume required to be operated is far smaller than the data volume of the whole log when the processing process is executed each time, the operation cost is effectively reduced, and the operation efficiency is improved.

Optionally, the analysis device may also determine the at least one first log record group by: and grouping a plurality of rows of log records in the log based on the locality sensitive hash code of each row of log records in the log and the target characteristics of each row of log records to obtain at least one first log record group.

In the embodiment of the present application, the hash collision may also be reduced by increasing the code length of the locality sensitive hash code, but the computational complexity may be increased correspondingly. On the basis of the locality sensitive hash code, the target feature is introduced to participate in grouping, new grouping features are added, and hash collision can be further reduced on the premise of ensuring the code length of the shorter locality sensitive hash code. In addition, the grouping precision can be improved, and the log records divided into the same first log record group are ensured to have higher similarity. Furthermore, the target feature is also the feature of the log record itself, and when the locality sensitive hash code and the target feature of each row of log record are obtained, other rows of log records do not need to be considered. Therefore, the decorrelation of each row of log records in the log in the grouping process is realized. Therefore, for one log, the grouping process of a plurality of rows of log records can be executed in parallel, the operation time delay is effectively reduced, and the operation efficiency is improved.

As an example, the partitioning rule may be: each first set of log records includes all log records having the same locality-sensitive hash code, and the same target characteristics. That is, the analysis device may divide log records with the same target characteristic and locality-sensitive hash code among multiple log records of the log into the same first log record group to obtain the at least one first log record group.

For example, the target characteristics of the log records include: at least one of a length of the log record, a first character of the log record, and a first word of the log record.

Typically, the length of a log record is expressed in terms of the number of entries that the log record includes. For example, the length of log record X1 in fig. 8 is 4, and the length of log record X2 is 4. Because the length of the log record is a characteristic typical of log records, and the probability that log records with different lengths adopt the same log template is low, by taking the length of the log record as a target characteristic, the division of some log records with substantially dissimilar characteristics (for example, log records with different lengths but similar contents) into the same first log record group can be effectively avoided. Thereby reducing the probability of an undesired hash collision, such as the first undesired hash collision described above.

The beginning part of one row of log records is usually a constant part, for example, the first character of a log record and the first word of the log record are usually constants, and the probability that the log records with the same beginning part adopt the same log template is low, so that by taking the first character of the log record or the first word text of the log record as a target feature, some log records with substantially dissimilar characteristics (for example, log records with different beginning parts and similar other parts) can be effectively prevented from being divided into the same first log record group. Thereby reducing the probability of undesired hash collisions, such as the aforementioned first undesired hash collision and the aforementioned second undesired hash collision.

Further, in step 303, the analysis device determines at least one first set of log records, typically by traversing each row of log records in the log. That is, the analysis device traverses each row of log records in the log, and sequentially divides the log records with the same locality-sensitive hash code into the same first log record group. If the division rule is as follows: each first set of log records includes all log records having the same locality-sensitive hash code, and the same target characteristics. Then, the analysis device traverses each row of log records in the log, and sequentially divides the log records with the same target characteristics and locality sensitive hash codes in the multiple rows of log records of the log into the same first log record group. Thus, for each first log record group, the log records are written one by one (i.e., line by line) in the form of a text stream. Each first log record group can be established before the dividing action, namely, the first log record group is established during initialization, and the first log record group is empty; each first log record group may also be established in the partitioning process, which is not limited in this embodiment of the present application.

For the convenience of reader's understanding, taking the foregoing second log as an example, assuming that the foregoing word segmentation process ignores the line number mark and the specific line number, the characteristic based on which the grouping is performed is the locality sensitive hash code of the log record and the length of the log record (i.e. the target characteristic of the foregoing log record is the length of the log record), in the log records of lines 0 to 5, the locality sensitive hash codes and the lengths of the log record of line 0 and the log record of line 4 are all the same, the locality sensitive hash codes and the lengths of the log record of line 3 and the log record of line 5 are all the same, and the locality sensitive hash codes and the lengths of the log record of line 1 and the log record of line 2 are all the same. As shown in fig. 13, the grouping process is explained as follows: the analysis device traverses the log records of the 0 th line to the 5 th line in the log. For the 0 th row of log records, establishing a first log record group 0 because no related grouping exists, and dividing the 0 th row of log records into the first log record group 0; for the 1 st row of log records, establishing a first log record group 1 because no related group exists, and dividing the 1 st row of log records into the first log record group 1; for the 2 nd row log record, because the locality sensitive hash code and the length of the 2 nd row log record are the same as those of the 1 st row log record, dividing the 2 nd row log record into a first log record group 1; for the 3rd row of log records, because no related grouping exists, a first log record group 2is established, and the 3rd row of log records are divided into the first log record group 2; for the 4 th row of log records, because the locality sensitive hash codes and the lengths of the 4 th row of log records are the same as those of the 0 th row of log records, dividing the 4 th row of log records into a first log record group 0; for the line 5 log record, the line 5 log record is divided into the first log record group 2 because its locality sensitive hash code and length are the same as those of the line 3 log record.

Step 304, the analysis device obtains a log template of the log by processing each first log record group of the at least first log record groups.

In the embodiment of the present application, since each first log record group corresponds to one type of log record, there are many alternative processing ways to obtain the log template related to the whole log. The following optional processing modes are taken as examples in the embodiment of the present application to describe a process of obtaining a log template related to a log:

in a first alternative processing manner, a log template of each first log record group may be obtained, and a log template of a log may be determined based on the obtained log template, where the process includes:

and step C1, the analysis device extracts the log template in each first log record group respectively.

As an example, step C1 may include the steps of:

step C11, for each row of log records in each first set of log records, the analytics device may compare the row of log records to the historical log template for the first set of log records.

As in 303 above, the analysis device determines at least one first log record group based on the locality-sensitive hash code, as well as other target features. Therefore, the lengths of the resulting log records in the same first log record group may be the same or different. The present application provides two ways of comparison for these two cases:

in the first case, the log records in the same first log record group are of the same length. I.e. in the aforementioned step 303, the analyzing device determines at least one first log record group based on the locality-sensitive hash code and the length of the log records, i.e. the target features of the log records comprise at least the length. In this way, the log records can be compared with the historical log templates of the first log record group by bit comparison. In a first example, if the ratio of the number of the same entries in the log record length (i.e. the total number of entries of the log record) is greater than a first proportional threshold, determining that the log record matches the historical log template; and if the ratio of the number of the same entries in the log record length is not greater than a first proportional threshold, determining that the log record is not matched with the historical log template. In a second example, if the ratio of the number of different entries (the different entries refer to entries in a row of log records that are different from one log record) in the length of the log record is less than a second ratio threshold, determining that the log record matches the historical log template; if the ratio of the number of different entries in the log record length is not less than a second ratio threshold, determining that the log record is not matched with the historical log template; in a third example, if the ratio of the number of the same entries in the log record length is greater than a first ratio threshold and the number of different entries is less than a first number threshold, determining that the log record is matched with the historical log template; and if the ratio of the number of the same entries in the log record length is not greater than a first ratio threshold, or the number of different entries is not less than a first number threshold, determining that the log record is not matched with the historical log template. In a fourth example, if the number of the same entries is greater than a second number threshold, determining that the log record matches the historical log template; and if the number of the same entries is not greater than the second number threshold, determining that the log record is not matched with the historical log template. In a fifth example, if the number of different entries is less than a third number threshold, determining that the log record matches the historical log template; and if the number of the different entries is not less than the third number threshold, determining that the log record is not matched with the historical log template. There are other ways to determine whether the log record matches the historical log template of the first log record group, which is not limited in this embodiment of the present application.

The comparison of the position is to compare the entries of the log record with the entries of the history log template at the same position. Assume that log record X10 is: "User Yang Xiao Yu has been seen in; the historical log template is as follows: "User" hasben logged in ". Assuming that each entry includes a semantic unit, the terms "User", "Yang", "Xiao", "Yu", "has", "ben", "registered", and "in" are respectively compared with the terms "User", "aster", "has", "ben", "registered", and "in one-to-one correspondence. Taking the foregoing first exemplary determination method as an example, if the number of the same entries is 8 and the log record length is 8, and assuming that the first scale threshold is 1/2 and 8/8 is greater than 1/2, it is determined that the log record matches the historical log template.

In the second case, the log records in the same first log record group are of different lengths. That is, in the aforementioned step 303, the analysis device determines at least one first log record group based on the locality-sensitive hash code alone or based on the locality-sensitive hash code and a target characteristic other than the length of the log record. In this way, the log records and the historical log templates are respectively regarded as entry sequences, and the log records and the historical log templates of the first log record group can be compared in a mode of finding the Longest Common Subsequence (LCS) of the log records and the historical log templates. In a first example, if the ratio of the determined length of the longest common subsequence (i.e., the total number of entries in the longest common subsequence) to the length of the log record (i.e., the total number of entries in the log record) is greater than a third ratio threshold, determining that the log record matches the historical log template; and if the ratio of the length of the determined longest common subsequence in the length of the log record is not larger than a third ratio threshold value, determining that the log record does not match the historical log template. In a second example, if the ratio of the length of the sequence (i.e. the total number of entries of the log record minus the total number of entries in the longest common subsequence) in the log record other than the longest common subsequence to the length of the log record is less than a fourth ratio threshold, determining that the log record matches the historical log template; and if the ratio of the lengths of other sequences in the log record length is not less than the fourth ratio threshold, determining that the log record is not matched with the historical log template. In a third example, if the length of the longest common subsequence is greater than a first length threshold, determining that the log record matches the historical log template; if the length of the longest common subsequence is not greater than the first length threshold, determining that the log record does not match the historical log template. There are other ways to determine whether the log record matches the historical log template of the first log record group, which is not limited in this embodiment of the present application.

The longest common subsequence of the log record and the history log template, that is, the subsequence of the longest common part of the log record and the history log template is obtained. Illustratively, the longest common subsequence may be found in a recursive manner or a dynamic programming manner. Taking the example where each entry includes a semantic unit, assume that in a first log record group, log record X10 is: "User Yang XiaoYu has been seen in; the first history log template is: "User has been seen by a pulsed in". The longest common subsequence of both is: "User has been seen in". Assume that in another first log record group, log record X11 is: "User Yang Xiao Yu has been seen in; the second history log template is: "User registered success". The longest common subsequence of both is: "User".

Taking the foregoing third example determination manner as an example, assuming that the first length threshold is 3, the length of the longest common subsequence of log record X10 is 5, and it is determined that log record X10 matches the first historical log template. The longest common subsequence of log record X11 has a length of 1, and it is determined that log record X11 does not match the second historical log template.

Note that, in general, one log template exists in one first log record group, and in a few cases, a plurality of log templates exist in the first log record group. When a plurality of log templates exist in the first log record group, for any log record, the log record can be respectively compared with each log template in the plurality of log templates, and the comparison process refers to the processes of the two cases; alternatively, the distance between the log record and each log template in the plurality of log templates is calculated (for example, the distance is calculated by using a Jaccard distance function), and the log record is compared with the log template closest to the log record, so that the operation cost can be reduced.

Step C12, when the log record matches the historical log template, determining a new log template for the first log record group based on the log record and the historical log template.

In the first example, since the history log template is an extracted template and the log record is matched with the history log template, the history log template can be directly used as a new log template of the first log record group.

In a second example, since there may be some different parts of the historical log template and log record, these different parts may be considered variable parts. The analysis device can replace a part of the historical log template, which is different from the log record, with a variable indicator; or, the analysis device firstly judges whether the part of the history log template, which is different from the log record, is only the part where the variable indicator is located, if the different part also includes the other parts except the part where the variable indicator is located, the other parts are replaced by the variable indicator to obtain a new log template, and if the different part only includes the other parts except the part where the variable indicator is located, the history log template can be directly used as the new log template. The processing can obtain a more accurate log template.

In the above two examples, the new log template may be used to update the corresponding historical log template, for example, to delete the historical log template, or the new log template may be used to overwrite the corresponding historical log template, so as to ensure that no duplicate log template exists in the first log record group.

It should be noted that, in order to ensure the consistency between the log template and the log record length, when performing the replacement operation, one variable indicator replaces only one entry, and the length of one variable indicator in the log template is regarded as 1.

In response to the second case, since there is no need to ensure the length consistency between the log template and the log record, one variable indicator may replace one or more consecutive entries when performing the replacement operation.

Step C13, when the log record does not match the historical log template, adding the log template extracted from the log record as a new log template for the first log record group.

When the log record is not matched with the historical log template, the log template matched with the log record does not exist in the first log record group at present, and a new log template corresponding to the log record needs to be generated.

In a first example, the log record is directly made as a new log template for the first log record group.

In a second example, referring to B1 previously described, since some of the designated characters in the log records are variables in most cases, these designated characters can be replaced with variable indicators to generate a new log template for the first log record group. For example, the designated character may be a number and the fixed character may be a number or other symbol. E.g., 1 or 2, etc.

In the aforementioned step 303, the analysis device determines at least one first log record group, typically by traversing each row of log records in the log. The aforementioned process of extracting the log template in each first log record group can be performed after all log records are grouped, or can be performed in real time during the grouping process of the log records. The log template in each first log record group is extracted in real time in the grouping process of log records, so that the time delay of template extraction can be reduced, and the overall timeliness of the template extraction process is improved.

For example, taking the real-time extraction of the log template in each first log record group in the grouping process of the log records as an example, corresponding to the foregoing step 303, the process of respectively extracting the log template in each first log record group may include, for each first log record group: upon receiving a row of log records, comparing the received log records to the historical log template, determining a new log template for the first log record group based on the received log records and the historical log template when the received log records match the historical log template, and adding a log template extracted from the received log records as the new log template for the first log record group when the received log records do not match the historical log template.

Still taking the second log as an example, assuming that the grouping result is the grouping result shown in fig. 14, for the first log record group 0, after receiving the log record of the 0 th line, the history log template of the first log record group 0 is empty, the log record of the 0 th line does not match the history log record, the log template of the log record of the 0 th line is extracted, and a new log template of the first log record group is added: "mod _ jk child world in error state"; after receiving the 4 th row of log records, the received log records are matched with a historical log template: "mod _ jk child world in error state" is compared, and since the received log record of row 4 matches the historical log template, the historical log template "mod _ jk child world in error state" of the first log record group 0 may be used as a new log template. The template extraction method of the first log record group 1 and the first log record group 2is the same as the template extraction method, and the finally extracted template of each first log record group is shown in fig. 14, which is not described again in this embodiment of the present application.

It should be noted that the foregoing historical log template and the new log template are relative concepts, the historical log template refers to a log template existing at the current time, and the new log template refers to a log template newly generated at the current time.

Step C2, determining a log template for the log based on the log template for each first log record group.

Optionally, based on the log template of each first log record group, the process of determining the log template of the log may be implemented in three ways:

in the first mode, the log templates of at least one first log record group are clustered to obtain the log templates of the logs.

Clustering is essentially a grouping method for grouping similar objects into one class and dissimilar objects into a different class. In this embodiment of the application, after acquiring one or more log templates of at least one first log record group, the analysis device may classify the one or more log templates through clustering, and particularly, when there are a plurality of acquired log templates, there may be a case where the log templates of different first log record groups are similar, and through classification, the similar log templates may be classified into one type of log templates, so that the classified one or more types of log templates are used as the log templates of the logs. The one or more types of log templates may be presented to the user upon subsequent presentation to the user, such that the user visually sees how many types of log templates exist in the log. For example, the clustering process may be hierarchical clustering, and the processing process refers to the hierarchical clustering process, which is not described in this embodiment.

The log templates obtained through hierarchical clustering have hierarchical relations, and a user can adjust the clustering precision (also called granularity) to obtain different clustering results.

In the second mode, the log templates of at least one first log record group are merged (merging) to obtain the log template of the log.

The merging process refers to a process of integrating identical or similar processing objects into one object, and the processing effect thereof is similar to that of the deduplication process. In this embodiment of the present application, merging the log templates of at least one first log record group, and obtaining the log template of the log includes: when the log templates of the at least one first log record group comprise at least two log templates, detecting whether the similarity of constant parts of the two log templates is 1 or not for every two log templates; when the similarity of the constant parts of the two log templates is 1, replacing the variable part of one log template in the two log templates by using a variable identifier, and deleting the other log template (which is equivalent to reserving the constant part of any log template and inserting the position of the original variable part between the constant parts into the variable identifier). By way of example, the variable identifier may be a wildcard "". Wherein, the similarity of the constant parts of the two log templates can be determined by calculating the distance between the constant parts of the two log templates. The similarity can be calculated by using a Jaccard similarity (also called Jaccard coefficient) algorithm. It should be noted that, when the similarity of the constant parts of the two log templates is not 1, the two log templates are not processed.

For example, the two templates are respectively: "User has" and "User has" constant parts of both of them include four entries: { User, has, logged, in }. The similarity between the two is 1. Therefore, the variable part ". times." of "User. times. has logged in" may be replaced with ". times.", and "User. times. has logged in" may be deleted, resulting in the merged log template: "User has a" logged in ".

In a third mode, a log template of at least one first log record group is used as a log template of the log.

In most cases, there is usually one log template for each first log record group, and if the first log record groups are grouped in a proper manner and there are fewer log templates that are the same or similar, the obtained log templates of the respective log record groups can be directly used as the log templates of the logs without performing clustering processing or merging processing.

In the embodiment of the present application, the analysis device may support one or more of the three manners. The terminal may present all the trigger buttons (or icons) in the multiple manners in the user interface, or present the trigger buttons in the multiple manners in a scrolling manner, and may also present one or more trigger buttons in one manner with a higher frequency of use among the trigger buttons in the multiple manners (the trigger buttons in other manners may be displayed after the user triggers another button again, and the other button may be a pull-down button), and the like, which is not limited in this embodiment of the application. When a user wants to watch a log template of a log corresponding to a certain mode, a trigger button corresponding to the certain mode is triggered by clicking and the like, correspondingly, a terminal receives a selection instruction of the user, the selection instruction carries an identifier of the certain mode, the terminal sends the selection instruction to analysis equipment, the analysis equipment acquires the log template of the log in a corresponding mode based on the acquired selection instruction, and the log template is presented to the user on a user interface by the terminal. Wherein, adopt

When the log template is presented in the first mode, the log template can be presented in a multi-layer file directory structure or a tree structure (such as a binary tree); when the second or third method is used to present the log templates, if there are a plurality of log templates, the plurality of log templates may be presented in a list manner.

In the first optional processing mode, the log template of the log is determined by extracting the log template of each first log record group, and the log records in the first log record group do not need to be directly adopted to participate in the calculation of the log template of the log, so that the solution space is reduced in an exponential level, and the operation efficiency is effectively improved.

In a second alternative processing manner, the target log record in each first log record group is obtained, and the log template of the log is determined based on the obtained target log record, where the process includes:

step D1, obtain the target log record for each first log record group.

Alternatively, the target log record of the first log record group is a partial log record in the first log record group, for example, a row of log records in the first log record group. Because a first log record group contains log records of the same locality sensitive hash code, that is, contains the same or similar log records, a target log record can be selected to represent the log records in the first log record group, and the processing of the target log record is equivalent to the processing of all log records in the first log record group, but the data amount actually processed is effectively reduced, which is equivalent to performing data sampling (sampling), so that the solution space is reduced, and the operation cost is further reduced.

For example, the target log record of the first log record group may be a randomly selected row of log records in the first log record group. This ensures that the probability of each row of log records in the first set of log records being selected as target log records is equal. It should be noted that the target log records may be filtered in the first log record group according to other preset conditions. For example, the first row of log records in the first log record group is selected, or the most recent (e.g., most recent in timestamp) log records in the first log record group are selected.

Optionally, before step D1, the processor may further detect a number of log records in the first log record group, and when the first log record group includes multiple rows of log records, execute step D1, for example, to screen partial log records from the log records of the first log record group as target log records. When the first log record includes only one row of log records, step D1 is performed, and the specific execution of step D1 is to take one row of log records of the first log record group as the target log record.

Step D2, determining a log template for the log based on the target log record for each first log record group.

Optionally, the process of determining a log template of the log based on the target log record of each first log record group may include:

and D21, determining at least one second log record group.

In general, there are a plurality of second log record groups obtained by grouping. Wherein the different second log record groups comprise different target log records in the target log records corresponding to the at least one first log record group; each second set of log records includes all target log records having the same target characteristic. The analysis device may divide log records with the same target characteristics in the obtained target log records into the same second log record group to obtain the at least one second log record group. This may further reduce the solution space.

For example, the target characteristics of the log records include: at least one of a length of the log record, a first character of the log record, and a first word of the log record. It should be noted that the target feature in step D21 may be the same as or different from the target feature in step 303.

The acquired target log records are grouped by adopting the target characteristics, one or more groups of second log record groups can be acquired, so that the subsequent process of processing each second log record group respectively can be executed, when the second log record groups have multiple groups, the subsequent processing process (such as step D22) of each second log record group can be executed in parallel, the operation time delay is reduced, the data volume required to be operated is far less than the integral data volume of the log when the processing process is executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

It should be noted that, before step D21, the analyzer may detect the number of first log record groups, and when there are a plurality of first log record groups, step D21 may be executed, and when there is only one first log record group, step D21 may not be executed.

And D22, processing each second log record group to obtain a log template of the log.

For example, the process of obtaining the log template of the log by separately processing each second log record group may include:

and step D221, clustering the log records in each second log record group to obtain at least one type of log records corresponding to each second log record group.

The log records in the respective second log record groups are from different first log record groups, and therefore, similar situations also exist for some log records. In the embodiment of the application, the log records in each second log record group are clustered (for example, hierarchical clustering), similar log records can be divided into one type, so that in the subsequent process, the process of respectively performing template extraction on each type of log record obtained by clustering can be executed, when the log records in the second log record group have multiple types, the processing processes of the various types of log records can be executed in parallel, so that the operation time delay is reduced, the data amount required to be operated is smaller when the processing process is executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

It should be noted that, before step D221, the analysis device may further detect the number of the second log record groups, and when there are a plurality of second log record groups, steps D211 to D223 are performed, and when there is only one second log record group, steps D211 to D223 may not be performed, and the log template of the second log record group is directly used as the log template of the log.

And D222, respectively extracting templates of all types of log records obtained by clustering to obtain a log template of each type of log record.

The process of step D222 may refer to the process of step C1, that is, one type of log record corresponds to the first log record group, which is not described in detail in this embodiment of the present application.

And D223, determining a log template of the log based on the log template of each log record in at least one type of records.

For example, based on the log template of one type of log record (i.e. log records of all types) in each at least one type of record, the process of determining the log template of the log can be implemented in two ways:

in the first mode, a log template of each type of log record is used as a log template of the log.

For the first manner, reference may be made to the second manner in the step C2, that is, one type of log record corresponds to the first log record group, which is not described in detail in this embodiment of the application.

And in the second mode, the log templates of various log records obtained by clustering are merged, and the merged log template is used as the log template of the log.

For the first manner, reference may be made to the third manner in the step C2, that is, one type of log record corresponds to the first log record group, which is not described in detail in this embodiment of the application.

For the convenience of the reader, the second alternative processing manner is described in the following example, and assuming that the third log is shown on the left side of fig. 15, the word segmentation result shown on the right side of fig. 15 is obtained through step a1 in the foregoing step 302. The locality sensitive hash code shown in fig. 16 is obtained through step a2 in the foregoing step 302. Assuming that in the foregoing step 303, the analysis device determines the first log record group based on the locality-sensitive hash code and the length of the log record (i.e. the length of the log record as the target feature), the relationship between the log record, the length of the log record and the locality-sensitive hash code is shown in table 1.

TABLE 1

The grouping result of the analysis device determined based on the locality-sensitive hash code and the length of the log record is shown in fig. 17, wherein the log records in the 1 st and 6 th rows are divided into a first log record group, the log records in the 7 th and 9 th rows are divided into a first log record group, and the log records in the rest rows are respectively divided into a first log record group. It is assumed that the target log record of each first log record group obtained through the foregoing step D1 is as shown in fig. 18, that is, the 1 st row log record is selected as the target log record in the first log record group to which the 1 st and 6 th row log records belong, the 7 th row log record is selected as the target log record in the first log record group to which the 7 th and 9 th row log records belong, and the respective one row log record is selected as the target log record in the other first log record groups. Assuming that the analysis device determines the second log record group based on the length of the log record (i.e., the length of the log record as the target feature) in the aforementioned step D21, a total of four second log record groups, respectively, second log record groups Z1 to Z4, as shown in fig. 19, are finally obtained. Assuming that the log records in each second log record group are processed in a hierarchical manner in step D221, after the template extraction operation in step D222, 5 log templates shown on the right side of fig. 20 are obtained, where the 5 log templates are:

“workerEnv.init()ok/etc/httpd/conf/workers2.properties

mod_jk child init 1-2

mod_jk child workerEnv in error state*

jk2_init()Found child*in scoreboard slot*

jk2_init()Can't find child 1566in the scoreboard”。

assuming that the log template recorded by each type of log is used as the log template of the log in the first manner of the foregoing step D223, the finally obtained log template of the log includes the 5 log templates.

And 305, the analysis equipment detects the abnormity of the log based on the log template of the log.

In an optional mode, the analysis device performs feature extraction on the log based on a log template of the log; and anomaly detection is performed based on the extracted features of the log.

The characteristics of the log refer to characteristics that log records contained in the log have. Exemplary, it may include: the number of occurrences of the log template, the frequency of occurrences of the log template, and/or the period of occurrence of the log template. The occurrence frequency of the log template refers to the number of log records corresponding to the log template in the log; the occurrence frequency of the log template refers to the ratio of the number of log records corresponding to the log template to the total number of the log records contained in the log; the occurrence time period of the log template refers to the occurrence time or the acquisition time period of the log record corresponding to the log template.

For example, assuming that the first log pattern is any log pattern in the log, for the first log pattern, the analysis device may divide the log into a plurality of time windows, count each row of log records included in each time window to detect a log record matching the first log pattern, and count features of the log required to be determined in the time window, such as the occurrence number of the first log template. The analysis device determines a time window with a characteristic difference from other time windows larger than a specified difference threshold value as an abnormal time window by comparing the characteristics of the logs in the plurality of time windows, and then the log record of the first log mode in the abnormal time window is the log record with abnormality. The time windows may be fixed size and non-overlapping time windows, or determined by a sliding window algorithm.

For example, when the feature to be analyzed is the occurrence number of the first log template, the analysis device may mark a thermal event and send out an alarm message in a time window in which the occurrence number of the first log template is found to be significantly higher (for example, a difference value from other time windows or a mean value of the occurrence numbers of the first log template in all the time windows is positive and the difference value is greater than a specified difference threshold value); the analysis device may mark a cold event and issue an alarm message in a time window in which the occurrence of the first log template is found to be significantly lower (e.g., negative difference from other time windows, or the mean of the occurrences of the first log template in all time windows, and the absolute value of the difference is greater than a specified difference threshold).

It should be noted that the terminal may display the log template of the log determined by the analysis device, the user may specify the target log template, and correspondingly, the terminal receives the template selection instruction, sends the template selection instruction carrying the identifier of the target log template to the analysis device, and the analysis device performs anomaly detection on the target log template, where the detection process may refer to the anomaly detection process of the first log template. Therefore, the analysis equipment can perform the abnormity detection of the specific log template according to the user indication, the pertinence of the abnormity detection is improved, and the user experience is ensured.

In both alternatives, the analysis device detects unknown events based on a log template of the log. The analysis device may match each log template with a log record in the log, and when there is a log record that does not match with all the log templates, determine that the log record is an unknown log record, and an event corresponding to the unknown log record is an unknown event. Which may be an exception event.

In an anomaly detection scenario, the analysis device may also detect an anomaly in the log in other manners, which is not limited in the embodiment of the present application.

It should be noted that, in the foregoing step 302, when the log records in the log are grouped, the locally sensitive Hash code is used, so that the distribution rule of the log records follows a Hash (Hash) distribution rule, that is, a key-value (key-value) distribution rule, and thus load balancing can be achieved.

For the convenience of understanding of readers, the embodiments of the present application briefly introduce the hash distribution principle. Hash distribution is a data distribution method based on a hash function, which may also be referred to as a hash function. The hash function is a function that obtains a value (also called a hash value) based on a key (also called a key value, also called a distribution key value in a distributed system) of data. That is, value ═ f (key), the function f is a hash function. Taking table 2 as an example, assume that the hash function is f (key) mod5, where "mod" represents modulo, that is, the hash function is a modulo Operation (Module Operation) function. Then, assuming keys are 1, 2, 3, 4, 5, 6, 7, 8 and 9, respectively, the corresponding values are 1, 2, 3, 4, 0, 1, 2, 3 and 4, respectively.

TABLE 2

key	1	2	3	4	5	6	7	8	9
										value	1	2	3	4	0	1	2	3	4

From the above, when the key is 1 and 6, the value is 1. Therefore, when the value is determined by using the hash function, there may be a case where different keys correspond to the same value, and this case is called hash collision. The hash bucket algorithm is a special hash algorithm that can resolve hash collisions. A hash bucket is a container that holds different linked lists of keys (also called hash tables), also called f (key) sets, or value sets. The values corresponding to the same hash bucket are the same. Referring to the foregoing example, the number of hash buckets may be set to a modulo (also called modulo) value, i.e., 5. The value values correspond to hash buckets one to one. For example, value values may be used as indexes or numbers of hash buckets, each hash bucket stores keys with the same value, and conflicting keys in the same hash bucket are stored by using a single linked list, so that hash conflicts are resolved. When searching for data corresponding to a key, only the hash bucket corresponding to the value needs to be indexed through the key, then searching is started from a node corresponding to the head address of the hash bucket, namely searching according to the sequence of the linked list, the value of the key is compared until the corresponding key is found, and then the corresponding data is indexed based on the searched key. As shown in table 1, when keys are 1 and 6, they are stored in hash bucket 1, and when keys are 2 and 7, they are stored in hash bucket 2; when the key is 3 and 8, storing the key in the hash bucket 3; when keys are 4 and 9, storing the keys in the hash bucket 4; when key is 5, it is stored in hash bucket 0.

It should be noted that the foregoing embodiment only takes the hash function as a modulo function as an example for description, and in fact, the hash function may also be a remainder function (in this case, the hash function is a remainder (computation) function, and the number of hash buckets is a modulus value), or other functions, which is not limited in this embodiment.

With reference to the foregoing description, embodiments of the present application may introduce a hash bucket algorithm to perform distribution of log records, so as to avoid hash collision, in which case, the distributed data is generally identified in units of hash buckets. Thus, each of the aforementioned first log record groups may be identified by a hash bucket. Each second set of log records may also be identified by a hash bucket. Each hash bucket illustratively has a bucket identification, which may be determined by the corresponding grouping. For example, in step 303, if only locality sensitive hash codes are used for the packets, the bucket identifier satisfies the following first formula:

Id＝f(lsh)；

wherein Id represents a bucket identification; lsh denotes locality sensitive hash codes, f is a predetermined function.

If the locality sensitive hash codes and the target features are used for grouping, the bucket identifier meets the following second formula:

Id＝f(x1,x2,…,xm，lsh)；

wherein Id represents a bucket identification; lsh, f is a preset function, x1, x2, …, xm respectively represent m features included in the target feature, and m is the total number of features included in the target feature.

For example, the target feature only includes the length of the log record, then m is 1, x1 represents the length of the log record, and the bucket identifier satisfies the following third formula:

Id＝f(x1,lsh)。

the order of the steps of the method for extracting a log template provided in the embodiment of the present application may be appropriately adjusted, and the steps may also be increased or decreased according to the situation, for example, in other application scenarios, such as log compression or keyword search, the foregoing step 305 may not be executed, and any method that is familiar to those skilled in the art and within the technical scope disclosed in the present application, and the method may be easily considered to be changed within the protection scope of the present application, and therefore, will not be described again.

In summary, in the embodiment of the present application, the log records are grouped by the locality sensitive hash code of each row of log records, and the locality sensitive hash code can reflect the similarity of the corresponding log records of different rows, so that the grouping achieves the same effect as the clustering, thereby effectively reducing the operation complexity.

In addition, in the embodiment of the application, the locality sensitive hash codes and the target characteristics are characteristics of the log records, and when the locality sensitive hash codes and the target characteristics of each row of log records are obtained, other row of log records do not need to be considered. Therefore, the decorrelation of each row of log records in the log in the grouping process is realized. Therefore, for one log, the grouping process of a plurality of rows of log records can be executed in parallel, the operation time delay is effectively reduced, and the operation efficiency is improved.

When the first log record groups have multiple groups, the processing processes of the first log record groups can be executed in parallel, so that the operation time delay is reduced, the data volume required to be operated is far smaller than the data volume of the whole log when the processing processes are executed every time, the operation cost is effectively reduced, and the operation efficiency is improved.

Further, in the foregoing 304, both the first optional processing manner and the second optional processing manner are adopted, and most log records are screened out, so that the solution space is reduced in an exponential level, the operation cost is effectively reduced, and the operation efficiency is improved.

When the log template of the heterogeneous log is extracted, the conventional template extraction method ideally needs about 5 seconds for the log containing about 5 ten thousand log records to completely extract the log template. By adopting the log template extraction method provided by the embodiment of the application, complete extraction of the log template can be realized in about 1 second under an ideal state, and compared with the traditional method, the method effectively reduces the operation time delay, improves the operation performance and improves the user experience.

An embodiment of the present application provides a log template extracting apparatus 40, as shown in fig. 21, the apparatus includes:

a first determining module 401, configured to determine a locality sensitive hash code of each row of log records in multiple rows of log records of a log;

a second determining module 402, configured to determine at least one first log record group, different first log record groups comprising different row log records in the log; each of the first set of log records includes all log records having the same locality-sensitive hash code;

a processing module 403, configured to obtain a log template of the log by processing each first log record group in the at least one first log record group.

In summary, in the embodiment of the present application, the second determining module groups log records through the locality sensitive hash code of each row of log records, and the locality sensitive hash code can reflect the similarity of corresponding log records of different rows, so that the grouping achieves the same effect as the clustering, thereby effectively reducing the operation complexity.

Optionally, as shown in fig. 22, the first determining module 401 includes:

the obtaining submodule 4011 is configured to obtain at least one entry of each row of log records in the log;

the first determining sub-module 4012 is configured to determine the locality sensitive hash code of each row of log records based on at least one entry of each row of log records.

Optionally, each entry includes m semantic units, m is an integer greater than 1, the semantic units are words or symbols, for a log record including at least two entries, the last m-1 semantic unit of the first entry in every two adjacent entries is the same as the first m-1 semantic unit of the second entry, and the first entry is the previous entry of the second entry.

Optionally, the first determining module 401 is configured to:

replacing p designated characters in each row of log records in the log with q fixed characters to obtain updated each row of log records, wherein q is more than or equal to1 and is less than p; and determining the locality sensitive hash code of each row of log record based on the updated each row of log record.

Optionally, the first determining sub-module 4012 is configured to:

determining a locality sensitive hash code of a first log record based on a plurality of entries of the first log record and a weight value allocated to each entry, where the first log record is any one of a plurality of rows of log records including the plurality of entries, and the weight values of at least two entries included in the first log record are different from each other, for example, the weight value of each entry is determined based on a position of the entry in the first log record.

Optionally, the second determining module 402 is configured to:

and grouping a plurality of rows of log records in the log based on the locality sensitive hash code of each row of log records in the log and the target characteristics of each row of log records to obtain at least one first log record group.

Optionally, as shown in fig. 23, the processing module 403 includes:

an extraction submodule 4031, configured to extract a log template in each first log record group respectively;

a second determining submodule 4032, configured to determine, based on a log template of each of the first log record groups, a log template of the log.

Optionally, the extracting sub-module 4031 is configured to:

Optionally, the second determining sub-module 4032 is configured to:

clustering the log templates of the at least one first log record group to obtain the log templates of the logs; or, the log templates of the at least one first log record group are merged to obtain the log template of the log; or, the log template of the at least one first log record group is used as the log template of the log.

Optionally, the processing module 403 is configured to:

a third determining sub-module, configured to determine a log template of the log based on a target log record of each of the first log record groups, where the target log record of the first log record group is a partial log record in the first log record group.

Optionally, the target log record of the first log record group is a randomly selected row of log records in the first log record group.

Optionally, the third determining sub-module is configured to:

determining at least one second log record group, wherein different second log record groups comprise different target log records in the target log records corresponding to the at least one first log record group; each second log record group comprises all target log records with the same target characteristics; and processing each second log record group respectively to obtain a log template of the log.

Optionally, the third determining sub-module is configured to:

clustering the log records in each second log record group to obtain at least one type of log records corresponding to each second log record group; respectively carrying out template extraction on each type of log record obtained by clustering to obtain a log template of each type of log record; and determining a log template of the log based on the log template of each log record in the at least one type of log record.

Optionally, the third determining sub-module is configured to:

taking the log template of each type of log record as the log template of the log; or merging the log templates of various log records obtained by clustering, and taking the merged log template as the log template of the log.

Optionally, the target characteristics of the log record include: at least one of a length of the log record, a first character of the log record, and a first word of the log record.

Alternatively, FIG. 24 schematically provides one possible basic hardware architecture for a computing device as described herein. The computing device may be a server.

Referring to fig. 24, computing device 500 includes a processor 501, memory 502, a communication interface 503, and a bus 504.

In the computing device 500, the number of the processors 501 may be one or more, and fig. 24 illustrates only one of the processors 501. Alternatively, the processor 501 may be a Central Processing Unit (CPU). If the computing device 500 has multiple processors 501, the types of the multiple processors 501 may be different, or may be the same. Optionally, multiple processors 501 of computing device 500 may also be integrated into a multi-core processor.

Memory 502 stores computer instructions and data; the memory 502 may store computer instructions and data required to implement the log template extraction methods provided herein, e.g., the memory 502 stores instructions for implementing the steps of the log template extraction methods. The memory 502 may be any one or any combination of the following storage media: nonvolatile memory (e.g., Read Only Memory (ROM), Solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory.

The communication interface 503 may be any one or any combination of the following devices: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.

Communication interface 503 is used for data communication by computing device 500 with other computing devices or terminals.

The bus 504 may connect the processor 501 with the memory 502 and the communication interface 503. Thus, via bus 504, processor 501 may access memory 502 and may also interact with other computing devices or terminals via communication interface 503.

In the present application, the computing device 500 executes computer instructions in the memory 502, causing the computing device 500 to implement the log template extraction method provided herein, or causing the computing device 500 to deploy the log template extraction apparatus.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, e.g., a memory comprising instructions, executable by a processor of a server to perform the log template extraction method shown in various embodiments of the present application is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

An embodiment of the present application provides an analysis system, including: the terminal and the analysis equipment, the analysis equipment includes any one of the log template extraction device.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product comprising one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium, or a semiconductor medium (e.g., solid state disk), among others.

It should be noted that: in the log template extraction apparatus provided in the foregoing embodiment, when performing log template extraction, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the log template extraction device provided by the above embodiment and the log template extraction method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

In this application, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise. A refers to B and refers to the simple variation where A is the same as B or A is B.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Claims

1. A log template extraction method is characterized by comprising the following steps:

determining a locality sensitive hash code of each row of log record in a plurality of rows of log records of the log;

determining at least one first log record group, different said first log record groups comprising different row log records in said log; each of the first set of log records includes all log records having the same locality-sensitive hash code;

and obtaining a log template of the log by processing each first log record group in the at least one first log record group.

2. The method of claim 1, wherein determining the locality-sensitive hash code for each of a plurality of rows of log records of the log comprises:

acquiring at least one entry recorded by each row of log in the log;

and determining the locality sensitive hash code of each row of log records based on at least one entry of each row of log records.

3. The method of claim 2, wherein each entry comprises m semantic units, m being an integer greater than 1, the semantic units being words or symbols, and wherein for a log record comprising at least two entries, the last m-1 semantic units of a first entry in each two adjacent entries are the same as the first m-1 semantic units of a second entry, the first entry being the previous entry of the second entry.

4. The method of any of claims 1 to 3, wherein determining the locality-sensitive hash code for each of the plurality of rows of log records of the log comprises:

replacing p designated characters in each row of log records in the log with q fixed characters to obtain updated each row of log records, wherein q is more than or equal to1 and is less than p;

and determining the locality sensitive hash code of each row of log record based on the updated each row of log record.

5. The method of claim 2 or 3, wherein determining the locality sensitive hash code for the per-line log record based on the at least one entry for the per-line log record comprises:

determining a locality sensitive hash code of a first log record based on a plurality of entries of the first log record and a weight value allocated to each entry, wherein the first log record is any one of a plurality of rows of log records including the plurality of entries, the weight values of at least two entries included in the first log record are different from each other, and the weight value of each entry is determined based on the position of the entry in the first log record.

6. The method of any of claims 1 to 5, wherein determining at least one first log record group comprises:

7. The method according to any one of claims 1 to 6, wherein obtaining the log template of the log by processing each of the at least one first log record group comprises:

respectively extracting a log template in each first log record group;

determining a log template for the log based on the log template for each of the first log record groups.

8. The method of claim 7, wherein said separately extracting the log template in each of the first log record groups comprises:

for each row of log records in each of the first log record groups, comparing the log records to a historical log template for the first log record group;

when the log record is matched with the historical log template, determining a new log template of the first log record group based on the log record and the historical log template;

and when the log record is not matched with the historical log template, adding the log template extracted from the log record as a new log template of the first log record group.

9. The method of claim 7, wherein determining the log template for the log based on the log template for each of the first set of log records comprises:

clustering the log templates of the at least one first log record group to obtain the log templates of the logs;

or, the log templates of the at least one first log record group are merged to obtain the log template of the log;

or, the log template of the at least one first log record group is used as the log template of the log.

10. The method according to any one of claims 1 to 6, wherein obtaining the log template of the log by processing each of the at least one first log record group comprises:

determining a log template for the log based on a target log record of each of the first log record groups, the target log record of the first log record group being a partial log record of the first log record group.

11. The method of claim 10, wherein the target log record of the first set of log records is a randomly selected row of log records in the first set of log records.

12. The method of claim 10, wherein determining a log template for the log based on the target log record for each of the first set of log records comprises:

determining at least one second log record group, wherein different second log record groups comprise different target log records in the target log records corresponding to the at least one first log record group; each second log record group comprises all target log records with the same target characteristics;

and processing each second log record group respectively to obtain a log template of the log.

13. The method of claim 12, wherein obtaining the log template of the log by separately processing each of the second log record groups comprises:

clustering the log records in each second log record group to obtain at least one type of log records corresponding to each second log record group;

respectively carrying out template extraction on each type of log record obtained by clustering to obtain a log template of each type of log record;

and determining a log template of the log based on the log template of each log record in the at least one type of log record.

14. The method of claim 13, wherein determining the log template for the log based on the log template for each of the at least one type of log record comprises:

taking the log template of each type of log record as the log template of the log;

or merging the log templates of various log records obtained by clustering, and taking the merged log template as the log template of the log.

15. The method of claim 6 or 12, wherein the target characteristics of the log records comprise: at least one of a length of the log record, a first character of the log record, and a first word of the log record.

16. An apparatus for extracting a log template, the apparatus comprising:

the first determining module is used for determining the locality sensitive hash code of each row of log record in a plurality of rows of log records of the log;

a second determining module for determining at least one first log record group, different first log record groups comprising different row log records in the log; each of the first set of log records includes all log records having the same locality-sensitive hash code;

and the processing module is used for processing each first log record group in the at least one first log record group to obtain a log template of the log.

17. The apparatus of claim 16, wherein the first determining module comprises:

the obtaining submodule is used for obtaining at least one entry of each row of log record in the log;

a first determining submodule, configured to determine a locality sensitive hash code of each row of log records based on at least one entry of each row of log records.

18. The apparatus of claim 17, wherein each entry comprises m semantic units, m being an integer greater than 1, the semantic units being words or symbols, and wherein for a log record comprising at least two entries, the last m-1 semantic units of a first entry in each two adjacent entries are the same as the first m-1 semantic units of a second entry, the first entry being the previous entry of the second entry.

19. The apparatus of any one of claims 16 to 18, wherein the first determining module is configured to:

20. The apparatus of claim 17 or 18, wherein the first determining submodule is configured to:

21. The apparatus according to any of the claims 16 to 20, wherein the second determining means is configured to:

22. The apparatus of any one of claims 16 to 21, wherein the processing module comprises:

the extraction submodule is used for respectively extracting the log template in each first log record group;

and the second determining submodule is used for determining the log template of the log based on the log template of each first log record group.

23. The apparatus of claim 22, wherein the extraction sub-module is configured to:

24. The apparatus of claim 22, wherein the second determining submodule is configured to:

25. The apparatus according to any one of claims 16 to 21, wherein the processing module is configured to:

26. The apparatus of claim 25, wherein the target log record of the first set of log records is a randomly selected row of log records in the first set of log records.

27. The apparatus of claim 25, wherein the third determining submodule is configured to:

28. The apparatus of claim 27, wherein the third determining submodule is configured to:

29. The apparatus of claim 28, wherein the third determining submodule is configured to:

30. The apparatus of claim 21 or 27, wherein the target characteristics of the log record comprise: at least one of a length of the log record, a first character of the log record, and a first word of the log record.

31. A computer device comprising a processor and a memory;

the computer device, when executing the computer instructions stored by the memory, performs the template extraction method of any of claims 1 to 15.

32. A computer-readable storage medium comprising computer instructions that direct a computer device to perform the template extraction method of any of claims 1 to 15.