WO2021109724A1 - Procédé et appareil de détection d'anomalie de journal - Google Patents

Procédé et appareil de détection d'anomalie de journal Download PDF

Info

Publication number
WO2021109724A1
WO2021109724A1 PCT/CN2020/121544 CN2020121544W WO2021109724A1 WO 2021109724 A1 WO2021109724 A1 WO 2021109724A1 CN 2020121544 W CN2020121544 W CN 2020121544W WO 2021109724 A1 WO2021109724 A1 WO 2021109724A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
fragments
fragment
distance
abnormal
Prior art date
Application number
PCT/CN2020/121544
Other languages
English (en)
Chinese (zh)
Inventor
王琛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021109724A1 publication Critical patent/WO2021109724A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Definitions

  • This application relates to the field of computer technology, and in particular to a log abnormality detection method and device.
  • the real-time status record of the software operation can be recorded in a file in the form of text.
  • the file is called logs or log files.
  • a log includes multiple lines of log records (also called log statements), and each line of log records is used to record an event when the software is running.
  • the software developer or operation and maintenance staff
  • the log record in the log usually has an implicit log template (schema), that is, the mode or format of the record itself.
  • chema implicit log template
  • the analysis device compares the log (or the log within a period of time) with the specified reference log, obtains the change of the two log templates, and presents the change, and the software The developer recognizes anomalies in the log based on the presented content.
  • the embodiments of the present application provide a log anomaly detection method and device, which can solve the problem of relatively high computational cost of the current log anomaly detection method.
  • the technical solution is as follows:
  • a log abnormality detection method includes:
  • each of the multiple log fragments includes multiple lines of log records in the log; at least one line of log records of different log fragments in the multiple log fragments Different; determine the distance between every two log shards in the multiple log shards; determine the distance between each two log shards in the multiple log shards Whether there are abnormal log fragments.
  • the distance between every two log shards in the multiple log shards is obtained, and based on the distance between every two log shards obtained, it is determined whether there is a plurality of log shards. Abnormal log fragmentation, and then locate the abnormal log fragmentation, no need to manually identify whether there is an abnormality in the log, and effectively improve the abnormality detection efficiency of the log.
  • the process of determining the distance between every two log fragments in the plurality of log fragments includes:
  • the distance between every two log fragments in the plurality of log fragments is determined.
  • the distance of the local sensitive hash code of every two log shards can be obtained.
  • the distance is determined as the distance between every two corresponding log fragments. In this way, the distance between every two log fragments can be quickly determined.
  • the method further includes: determining a locally sensitive hash code of each log fragment in the plurality of log fragments based on a plurality of entries of each log fragment in the plurality of log fragments. In this way, the data granularity when obtaining the locally sensitive hash code of each log segment is the entry, the number of operations is smaller, and the operation cost can be saved.
  • the process of determining the locally sensitive hash code of each log fragment in the plurality of log fragments based on the plurality of entries of each log fragment in the plurality of log fragments includes: The multiple entries of each log segment are deduplicated to obtain an entry set; based on the entry set corresponding to each log segment, the locally sensitive hash code of each log segment is determined.
  • the de-duplication process can simplify the number of entries used in the subsequent calculation of the local sensitive hash code, thereby improving the operation efficiency of the analysis device.
  • the process of determining the locally sensitive hash code of each log segment includes: calculating all the words in the entry set corresponding to each log segment The sum of the hash codes of the bars; the dimensionality reduction process is performed on the sum of the hash codes corresponding to each log segment to obtain the locally sensitive hash code of each log segment.
  • Calculating the sum of the hash codes of all the entries in the entry set corresponding to each log fragment is equivalent to the weight of all entries being 1, the calculation delay is shorter, and the calculation efficiency is higher. And it is equivalent to amplifying the probability of abnormal entries, and increasing the probability of identifying abnormal log fragments.
  • the process of determining whether there is an abnormal log fragment in the plurality of log fragments based on the distance between every two log fragments in the plurality of log fragments includes:
  • the K-distance of each log shard in the multiple log shards is determined, and any log shard in the multiple log shards.
  • the K-distance of a fragment is the distance between the Kth closest log fragment to any log fragment among the multiple log fragments, and K is a positive integer, K is less than G, and G Is the total number of the plurality of log fragments; based on the K-distance of each log fragment in the plurality of log fragments, it is determined whether there is an abnormal log fragment among the plurality of log fragments.
  • the detection process of log shards can be executed only once, which effectively simplifies abnormal log shards.
  • the detection process reduces the computational complexity and improves the efficiency of abnormality determination.
  • the process of determining whether there is an abnormal log fragment in the plurality of log fragments includes:
  • the target value range [ ⁇ –3 ⁇ , ⁇ +3 ⁇ ] where ⁇ is the K of the multiple log shards -The mean value of the distance, the ⁇ is the standard deviation of the K-distance of the multiple log shards, when the K-distance of any one of the multiple log shards is not in the target value range [ ⁇ –3 ⁇ ⁇ , ⁇ +3 ⁇ ], determine that any log fragment is an abnormal log fragment;
  • the mu is the mean value of the K-distance of the log fragments other than the first log fragment in the plurality of log fragments
  • the sigma is the K-distance of the plurality of log fragments except the first log fragment
  • the standard deviation of the K-distance of the log segment of the log, when the K-distance of the first log segment is not within the target value range [mu–3 ⁇ sigma,mu+3 ⁇ sigma] determine the first log
  • the fragmentation is abnormal log fragmentation
  • the entropy value corresponding to each log shard in the multiple log shards determines the entropy value corresponding to each log shard in the multiple log shards, where the entropy value corresponding to any one log shard is after the multiple log shards are removed, the remaining log shards The entropy value of the K-distance of the log fragment.
  • the log fragment corresponding to the largest entropy value is determined as Abnormal log fragmentation.
  • the process of determining whether there is an abnormal log fragment in the plurality of log fragments based on the distance between every two log fragments in the plurality of log fragments includes:
  • the analysis device divides the multiple log shards into multiple log shard sets based on the distance between every two log shards in the multiple log shards, and each log shard set includes at least one log shard; If the number of log fragments in any log fragment set is less than the specified number threshold, it is determined that the log fragments in the log fragment set are abnormal log fragments; when the number of log fragments in any log fragment set is abnormal The number is not less than the specified number threshold, and it is determined that the log fragments in the log fragment set are not abnormal log fragments.
  • the analysis device can divide the log fragments according to the principle of equal division to ensure the accuracy of the abnormal log fragments obtained by the final positioning. For example, in the log fragments obtained by the final division, different log fragments include the same row. Number of log records; or, different log shards include log records of the same amount of data.
  • the distance between every two log fragments is negatively related to the similarity of the two log fragments, that is, the closer the distance (ie, the smaller), the higher the similarity; the farther the distance (I.e. the larger), the lower the similarity.
  • the analysis device may determine whether there is an abnormal log fragment in the multiple log fragments based on the similarity between every two log fragments in the plurality of log fragments.
  • the process includes: when the similarity between any log fragment and other log fragments is less than the similarity threshold, determining that any log fragment is an abnormal log fragment; When the similarity between the log fragment and other log fragments is not less than the similarity threshold, it is determined that any log fragment is not an abnormal log fragment.
  • the analysis device may determine the K-distance of each log fragment in the plurality of log fragments based on the similarity between every two log fragments in the plurality of log fragments.
  • the K-distance of any log shard in the multiple log shards is the similarity between the K-th farthest log shard from any log shard in the multiple log shards and any log shard, K is Positive integer, K is less than G, G is the total number of multiple log shards, so, the K-distance is directly based on the aforementioned K-distance determined based on the distance between every two log shards in the multiple log shards The definition of is different; based on the K-distance of each log fragment in the multiple log fragments, determine whether there is an abnormal log fragment in the multiple log fragments.
  • the second aspect of the embodiments of the present application provides a log abnormality detection method, which includes:
  • each of the multiple log shards includes multiple rows of log records in the log; at least one log record of different log shards in the multiple log shards is different ; Determine the similarity between every two log shards in the multiple log shards; determine the multiple log shards based on the similarity between every two log shards in the multiple log shards Whether there are abnormal log fragments in.
  • a log abnormality detection device may include at least one module, and the at least one module may be used to implement the foregoing first aspect, second aspect, or various possible implementations of the first aspect and second aspect.
  • the log abnormality detection method provided.
  • the present application provides a computer device including a processor and a memory.
  • the memory stores computer instructions; when the processor executes the computer instructions stored in the memory, the computer device executes the methods provided by the foregoing first aspect, second aspect, or various possible implementations of the first aspect and second aspect, so that the The computer equipment deploys the log abnormality detection device provided by the foregoing third aspect or various possible implementations of the third aspect.
  • the present application provides a computer-readable storage medium having computer instructions stored in the computer-readable storage medium, and the computer instructions instruct the computer device to execute the first aspect, the second aspect, or the first aspect and the second aspect.
  • the various possible implementations of the aspect may implement the provided method, or the computer instruction instructs the computer device to deploy the above-mentioned third aspect or the log abnormality detection apparatus provided by the various possible implementations of the third aspect.
  • the present application provides a computer program product.
  • the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device can read the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the first aspect, the second aspect, or the various possibilities of the first aspect and the second aspect.
  • the provided method is implemented, so that the computer device deploys the log abnormality detection device provided by the foregoing third aspect or various possible implementations of the third aspect.
  • an analysis system including: a terminal and an analysis device, and the analysis device includes the log abnormality detection device described in the third aspect or various possible implementations of the third aspect or the computer device described in the fourth aspect .
  • a chip is provided.
  • the chip may include a programmable logic circuit and/or program instructions, and when the chip is running, it is used to implement things such as the first aspect, the second aspect, or the first and second aspects.
  • Various possible implementations of the provided methods are possible implementations of the provided methods.
  • the distance between every two log shards in the multiple log shards is obtained, and based on the distance between every two log shards obtained, it is determined whether there is a plurality of log shards. Abnormal log fragmentation, and then locate the abnormal log fragmentation, no need to manually identify whether there is an abnormality in the log, and effectively improve the abnormality detection efficiency of the log.
  • the log abnormality detection method provided by the embodiment of the application can support the log abnormality detection function.
  • the log abnormality detection function can be triggered manually or automatically, for example, at a specified time point or a specified period of time, or periodically and automatically. Execution; on the other hand, the log anomaly detection function does not need to specify a reference log; on the other hand, the log anomaly detection function can not identify abnormal log fragments, based on which can accurately locate abnormal log records.
  • the flexibility of anomaly detection is higher, the implementation process is simple, and the location of abnormal log fragments can be performed, thereby effectively improving the efficiency of log anomaly detection.
  • the log abnormality detection method provided by the embodiment of the present application locates abnormal log fragments by comparing the similarity of the contents of multiple log fragments of the log, and can detect unknown abnormal log fragments.
  • abnormal log fragments can be quickly located, effectively reducing the time complexity and space complexity of log positioning.
  • FIG. 1 is a schematic diagram of part of log content in a log provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an application environment involved in a log abnormality detection method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of an application environment involved in another log abnormality detection method provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a log abnormality detection method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a process of fragmenting real-time log data according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of a process of fragmenting log data into batches according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a processing flow of a locally sensitive hash algorithm provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a word segmentation result obtained by using a space segmentation method to perform word segmentation according to an embodiment of the present application
  • FIG. 9 is a schematic diagram of another word segmentation result obtained by using a space segmentation method according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a word segmentation result obtained by using a special character word segmentation method according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a process of obtaining a locally sensitive hash code provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of another local sensitive hash code acquisition process provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a distance matrix provided by an embodiment of this application.
  • FIG. 14 is a schematic diagram of the distribution of spatial points corresponding to log fragments according to an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a normal distribution principle provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a log abnormality detection device provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of another log abnormality detection device provided by an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of part of the log content in a log.
  • the log includes multiple lines of log records (also called log text), each line (also called each) Log records are used to record an event when the software is running.
  • Each log record is composed of multiple characters, and the multiple characters may include letters and/or symbols.
  • the abnormality detection of the log can be performed.
  • the software developer or operation and maintenance staff
  • the software performance can be performed based on the abnormal situation.
  • whether there is an abnormality in the current log still needs to be manually identified, so the efficiency of abnormality detection is low.
  • the embodiment of the present application provides a log abnormality detection method, which can improve the efficiency of log abnormality detection.
  • FIG. 2 is a schematic diagram of an application environment involved in a log abnormality detection method provided by an embodiment of the present application.
  • the application environment includes a terminal 110, an analysis device 120, and a network device 130.
  • the terminal 110 may be a device capable of interacting with a user, such as a display, a computer, a smart phone, a tablet computer, and a laptop portable computer.
  • the analysis device 120 may be a server, or a server cluster composed of several servers, and other devices capable of performing data analysis.
  • the analysis device 120 may be a cloud server (also referred to as a cloud computing server), for example, a deep learning server used to provide a deep learning service (DLS).
  • the terminal 110 establishes a wired or wireless communication connection with the analysis device 120 through a communication network.
  • the network device 130 may be a device capable of running software and generating log data, such as a sensor or a terminal.
  • the network device 130 is used to provide the analysis device 120 with data to be analyzed, the analysis device 120 is used to analyze log data, and the terminal 110 is used to present the analysis result to the user.
  • the communication network involved in the embodiments of this application is a second-generation (2-Generation, 2G) communication network, a third-generation (3rd Generation, 3G) communication network, a long-term evolution (Long Term Evolution, LTE) communication network, or a fifth-generation (2-Generation, 2G) communication network.
  • Generation (5rd Generation, 5G) communication network etc.
  • the aforementioned application environment may also include a storage device, which is used to store data required by the terminal 110, the analysis device 120, and/or the network device 130.
  • the storage device may be a distributed storage device.
  • the terminal 110, the analysis device 120 and/or the network device 130 can read and write data stored in the storage device.
  • the storage device stores the data, which can reduce the load of the analysis device and improve the data analysis efficiency of the analysis device.
  • the functions of the terminal 110 and the analysis device 120 may also be implemented by the same device, such as a computer.
  • the application environment includes two parts: the foreground 201 and the background 202.
  • the front desk 201 is used for presenting data to the user and receiving data input by the user to realize interaction with the user; the backstage 202 is used for data interaction with the front desk 201, and performing management operations and/or data processing.
  • the front desk 201 may be deployed in the aforementioned terminal 110.
  • the background 202 can be deployed in the aforementioned analysis device 120.
  • a client, a script, or a browser may be installed in the terminal 110 to implement the deployment of the front desk 201.
  • the terminal 110 may present the user interface in the form of a client interface, a terminal interface, or a webpage corresponding to a browser.
  • the log abnormality detection method provided by the embodiment of the present application can be used in log analysis scenarios such as software debugging, performance optimization, or business analysis. Specifically, it can be applied to anomaly detection scenarios in these log analysis scenarios. Anomaly detection refers to the detection of patterns that do not meet expectations.
  • the data source also called data source
  • the data source for anomaly detection is log data generated by the operation of applications, processes, operating systems, devices, or networks, and these data can be stored in a database, a local file, or a message queue. in. If the log is a streaming scenario of a log stream, the data is stored in a message queue.
  • the message queue is a Kafka message queue.
  • the aforementioned analysis device 120 may use a deep learning algorithm to detect anomalies in log data.
  • the embodiment of the present application provides a log abnormality detection method. By obtaining multiple log fragments (also called log fragments) of the log, and comparing the similarity of the content of the log fragments, the abnormal log fragments are detected.
  • the content of the log shard of is significantly different from the content of other log shards. Based on this principle, as shown in Figure 4, the method includes:
  • Step 301 The analysis device obtains a log, and the log includes multiple lines of log records.
  • the log analysis scenarios include offline analysis scenarios and online analysis scenarios.
  • the log data to be analyzed can be batch log data stored in the log database, such as log files, or log data obtained by querying the log database, where the log files are usually users, software developers The file downloaded by the operator or the operation and maintenance staff, or the file obtained by keyword search.
  • the analysis device can read the log in the log database to obtain the log.
  • the log data to be analyzed may be log data collected in real time, also called log stream data.
  • the analysis device can collect logs through the collector to realize the acquisition of logs.
  • the log has two forms: batch log data and real-time log data.
  • the analysis device supports the analysis of these two forms of logs.
  • the analysis device periodically obtains log files or obtains log files during a specified time period to obtain batch log data.
  • the specified time period may be a low power consumption period of the terminal and/or server (that is, the power consumption is less than Specify the time period for the power consumption threshold), which can reduce the impact of log file acquisition and subsequent log analysis on other functions of the terminal and/or server; in another optional example, the analysis device continuously obtains real-time log data; in another In this optional example, the analysis device obtains batch log data or real-time log data after receiving the analysis instruction.
  • the analysis instruction may be triggered by the user at the terminal and sent by the terminal to the analysis device.
  • the analysis equipment obtains the log stream in real time and analyzes it, since it can monitor the log stream in time, if an exception occurs in the log stream, it can be discovered and reported in time, which improves the effectiveness of anomaly detection, avoids the occurrence of large-scale anomalies, and improves users Experience.
  • Step 302 The analysis device obtains multiple log fragments of the log.
  • the analysis device may obtain multiple log fragments based on the log.
  • Each log fragment in the multiple log fragments includes multiple rows of log records in the log, that is, one log fragment is a collection of multiple rows of log records.
  • each log segment includes consecutive multiple lines of log records in the log. At least one log record of different log fragments in multiple log fragments is different.
  • the embodiment of the present application judges the abnormality of the log by comparing the similarity between the log fragments, when the size of each log fragment is the same or similar, the accuracy of the judgment is higher. Therefore, in this step, the sizes of the multiple log fragments obtained are the same or similar.
  • the embodiments of the present application take the following ways as examples for description:
  • the log is divided into multiple log fragments according to the number of log records.
  • the log fragmentation rule is: different log fragments obtained by the division include log records with the same number of rows.
  • the specified order can be the order of log records in the log from front to back.
  • the analysis device can store the obtained log records as a log segment after each m rows of log records are obtained (for example, each time m rows of log records are read from the data source) to obtain multiple log records.
  • Log shards When the log is real-time log data, the analysis device can store the obtained log records as a log segment after each m rows of log records are obtained (for example, each time m rows of log records are read from the data source) to obtain multiple log records.
  • Log shards log shards.
  • the first method is prone to be divided into the last log segment, the number of log records is less than m, assuming that the number of remaining log records at this time is n, this In this case, the log records of mn rows adjacent to the n rows of log records and the n rows of log records can be divided into one log segment.
  • the log records of the last two adjacent log segments obtained by the division have mn Row logging is the same.
  • other m-n log records and the n log records may also be divided into one log segment, as long as it is ensured that the different log segments obtained by the final division all include m log records.
  • the log is divided into multiple log fragments according to the amount of data recorded in the log.
  • the log segmentation rule is: different log segments obtained by the segmentation include log records with the same amount of data.
  • the analysis device can divide the log records with the specified data volume in a specified order into one log segment, and the specified data volume can be 5 megabytes to 15 megabytes, for example, 10 megabytes.
  • the specified order can be the order of log records in the log from front to back.
  • the analysis device can store the obtained log records as a log segment after each log record of the specified amount of data (for example, each time the log record of the specified amount of data is read from the data source) , To get multiple log fragments.
  • the number of log records is less than the specified data volume. Assuming that the remaining log record data volume at this time is x, the specified data volume Is y. In this case, one or more log records with a data volume of yx adjacent to the remaining log records can be divided, and the remaining log records can be divided into one log segment. In this way, the last two The log records of two adjacent log shards have the same data volume as the log records of yx. When the embodiments of this application are actually implemented, other log records with a data volume of yx and the remaining log records can also be divided into one log segment, as long as it is ensured that the different log segments obtained by the final division all include the specified data volume Log records.
  • Figure 5 uses the log as real-time log data (ie, log stream) as an example to illustrate the fragmentation process.
  • the analysis device reads the log record of the log, it writes the read log record to a message queue (such as a Kafka message queue).
  • a message queue such as a Kafka message queue.
  • the analysis device first caches it until the user passes
  • the user device reads the log records in the message queue, it splices the read log records with the cached log records until the spliced log records reach the target size, and the log records of the target size are divided into one log segment.
  • Figure 5 takes the target size of 1000 rows and the log fragments obtained by segmentation as an example, but this is not limited.
  • Figure 6 uses the log as an example of batch log data to illustrate the fragmentation process.
  • the analysis device can load all the log records into the memory at one time, and then analyze the log records of the traversal log of the device, and perform the fragmentation operation according to the target size (for example, m rows or specified data amount) until all the log records are divided. sheet.
  • Fig. 6 takes the log fragments obtained by dividing into 4 pieces as an example, but this is not limited.
  • both the first method and the second method described above divide log fragments according to the principle of equal division.
  • the number of log records included in the multiple log fragments obtained by the division is also It can be different, as long as the difference in the number of rows of log records included in any two log shards is within the range of the difference in the number of rows.
  • the amount of data included in the multiple log fragments obtained by the division may also be different, as long as it is ensured that the data amount difference of the log records included in any two log fragments is within the specified data amount difference range.
  • the foregoing multiple log fragments may also be divided in a sliding window division manner, which is not described in detail in the embodiment of the present application.
  • Step 303 The analysis device determines the distance between every two log fragments in the multiple log fragments.
  • the distance between every two log shards is used to reflect the similarity between every two log shards.
  • the process of the analysis device determining the distance between every two log fragments in the plurality of log fragments includes:
  • Step A1 Determine the locally sensitive hash code of each log fragment in the multiple log fragments.
  • the local sensitive hash code is a hash code obtained based on the local sensitive hash algorithm.
  • the local sensitive hash code can reflect the similarity of the data (which can be called input data) that needs to be processed by the local sensitive hash algorithm.
  • the data may be data of the aforementioned log fragments.
  • Locally sensitive hashing algorithm can maintain the similar relationship between the input data. As shown in Figure 7, for similar input data, the obtained locally sensitive hash codes (which can be called output data) are also very similar; for scenarios where the input data is very similar (in Figure 7, the input data is divided into two rows).
  • the obtained locally sensitive hash codes even produce hash collisions: that is, for different but similar input data, the output locally sensitive hash codes are exactly the same (the output data in Figure 7 are all Take "1101101" as an example for description). It can be seen from the above statement that the locally sensitive hash code can be used as a feature of log fragments.
  • the embodiment of the present application calls it the signature of log fragments. The more similar the signatures of two log fragments, the closer the content of the two log fragments.
  • the analysis device can determine the locally sensitive hash code of each log segment in a variety of ways.
  • the embodiment of the present application takes the following two optional implementation methods as examples for illustration:
  • the process of determining the locally sensitive hash code of each log fragment in the multiple log fragments includes:
  • Step A11 The analysis device obtains multiple tokens of each log segment in the multiple log segments.
  • the analysis device may segment each log record in each log segment by word segmentation technology to obtain multiple entries for each log segment.
  • the term includes at least one minimum semantic unit, and usually only includes one minimum semantic unit.
  • the semantic unit is a word, phrase, or symbol.
  • the symbol can be a number symbol, abbreviated as a number, such as 1 or 2, or other symbols, such as "/" or ":”.
  • a line of log records can be divided to obtain at least two entries; in a few cases, a line of log records can be divided to obtain one entry, and this embodiment of the present application divides each log segment to obtain The number of entries is not limited.
  • word segmentation is to cut each log record of each log segment into a collection of entries.
  • word segmentation processing the processing complexity of log records can be reduced, the calculation cost of subsequent local sensitive hash codes can be reduced, and the calculation efficiency can be improved.
  • different methods can be used for word segmentation. For example, use spaces to separate words; alternatively, use special characters to separate words; alternatively, use designated segmentation characters that include spaces or special characters; or use natural language segmentation.
  • Figure 8 is a schematic diagram of a word segmentation result using space segmentation.
  • Using spaces to segment words refers to dividing a line of log records into multiple entries according to spaces.
  • the segmentation process is simple and the segmentation efficiency is high.
  • the first entry and the last entry contain a lot of information.
  • the first entry has time, class name, and another Digital information.
  • the special characters include: "
  • ", "##” or " ", the word segmentation result obtained by using special character segmentation is shown in Figure 10.
  • Each entry obtained by segmentation is a minimum semantic Unit, so the segmentation accuracy is higher.
  • Word segmentation using designated segmentation characters is a combination of space and sampled special character segmentation methods.
  • Natural language word segmentation is more commonly used.
  • log records in log fragmentation can be directly input to a natural language-based tokenizer, such as Word_Tokenizer, TreeBank_Tokenizer, S-Expression_tokenizer, etc. in NLTK (Natural Language Toolkit) Device.
  • a natural language-based tokenizer such as Word_Tokenizer, TreeBank_Tokenizer, S-Expression_tokenizer, etc. in NLTK (Natural Language Toolkit) Device.
  • NLTK Natural Language Toolkit
  • the analysis device may input each log segment as a character stream into a designated tokenizer, and the tokenizer performs word segmentation processing, and the analysis device receives the word segmentation result output by the tokenizer.
  • Step A12 The analysis device determines the locally sensitive hash code of each log fragment in the multiple log fragments based on the multiple entries of each log fragment in the multiple log fragments.
  • each log segment is stored in the form of a collection of entries.
  • multiple entries of each log segment no longer have a sequential relationship.
  • the analysis device can directly use the set of multiple entries of the log segment obtained by word segmentation as the entry set; or, the analysis device can perform deduplication processing on multiple entries of the log segment Get the collection of entries. Then, the word segmentation device can determine the locally sensitive hash code of each log segment based on the set of entries corresponding to each log segment. Among them, the aforementioned de-duplication processing can simplify the number of entries used in the subsequent calculation of the local sensitive hash code, thereby improving the operation efficiency of the analysis device.
  • the word segmentation device can determine the local sensitive hash code of each log segment based on the set of entries corresponding to each log segment.
  • the analysis device can be based on the target local sensitive hash algorithm and each log The set of entries corresponding to the fragments determines the locally sensitive hash code of each log fragment.
  • the local sensitive hash calculation process in the target local sensitive hash algorithm can refer to the local sensitive hash calculation process in the Simhash algorithm or the Minhash algorithm.
  • the smallest unit of data processed by the target local sensitive hash algorithm is a term.
  • the weighted summation method can be used to determine the locality sensitive hash of the log record code.
  • This process can refer to the Simhash algorithm.
  • the process of determining the locally sensitive hash code of the certain log record by means of weighted summation may include:
  • Step A121 For any log segment, calculate the hash code of each entry in the entry set corresponding to any log segment, where the hash code is composed of binary numbers 0 and 1.
  • the weight of each term may be positively correlated with the term frequency of the term in the term set. That is, the higher the word frequency, the greater the weight. Normally, the weight of each term is equal to the term frequency of the term in the term set.
  • Term frequency refers to the number of occurrences of the term. For example, if the term “we” appears 5 times in a term set, the term frequency of the term "we” is 5.
  • Step A123 Perform dimensionality reduction processing on the obtained weighted summation result to obtain a locally sensitive hash code.
  • step A122 the product of each hash code and its weight is expressed by the following rules: when the value in the hash code is 1, the sum result of the corresponding position is: 1 and weight Positive multiplication, when the value in the hash code is 0, the summation result of the corresponding position is: 1 and the negative weight are multiplied.
  • the dimensionality reduction in the aforementioned step A123 refers to reducing the value greater than 0 to 1, and reducing the value not greater than 0 to 0.
  • the process of performing dimensionality reduction processing on the obtained weighted sum result includes setting the value greater than 0 in the obtained weighted sum result to 1, and setting the value not greater than 0 in the weighted sum result to 0 .
  • Figure 11 shows a log fragment X1. This is the process of obtaining local sensitive hash codes.
  • the weight of the first entry is 3, the weight of the second entry is 2, and the weight of other entries is 1, and the hash code of "flush” is: "10010111”.
  • the product of the weight 3 is "3, -3, -3, 3, -3, 3, 3, 3" (where the comma is for spacing and does not exist in the actual calculation process).
  • Performing a weighted summation on the calculated hash codes of each entry refers to the summation of the weighted hash codes (that is, the summation of the corresponding positions).
  • the final weighted summation result is "6, -4, -6, 6, -6, 0, 8, 4", where the first place: 6 is the sum of the first place of the product of each entry and the corresponding weight, That is 3+2+1+1-1, the second place: -4 is the sum of the second place of the product of each entry and the corresponding weight, namely (-3)+(-2)+1+(-1 )+1, the other bits are calculated in the same way.
  • the result of the weighted summation "6, -4, -6, 6, -6, 0, 8, 4" corresponds to the dimensionality reduction result "10010111", that is, the locally sensitive hash code of the log segment X1 is " 10010111".
  • the word frequency of each term before the deduplication processing can be recorded to determine the weight of each term.
  • the same locality sensitive hash code can be calculated by using the aforementioned steps A121 to A123; in another optional way, the deduplication processing may not be recorded
  • the word frequency of each entry, and the weight of each entry is set to 1. In this way, for each entry in the same log segment, the weight is equal and equal to 1.
  • Step A124 The analysis device calculates the sum of the hash codes of all the entries in the entry set corresponding to each log segment.
  • Step A125 Perform dimensionality reduction processing on the sum of the hash codes corresponding to each log segment to obtain the locally sensitive hash code of each log segment.
  • the weight is set to 1, the calculation delay is shorter and the calculation efficiency is higher. And since the sum of the hash codes of all entries is directly obtained, it is equivalent to the weight of all entries is 1. If the weight is set according to the word frequency of each entry, since the number of normal entries in a log segment is usually much greater than the number of abnormal entries, the weight of normal entries is much greater than that of abnormal entries Weight. Setting the weight of all entries to 1, that is, reduces the weight of normal entries, which is equivalent to amplifying the probability of abnormal entries and increasing the probability of identifying abnormal log fragments. For example, a log segment X includes 5 abnormal entries and 1000 normal entries.
  • the 1000 entries are the same as the 1000 normal entries of the aforementioned log segment X If the entries are the same, excluding abnormal entries, by setting the weight of all entries to 1, the locally sensitive hash of log fragment X and the locally sensitive hash of log fragment Y can be significantly different, thus In the subsequent process, the abnormal log fragments can be effectively distinguished.
  • the abnormal log fragment includes an entry that is obviously different from other log fragments, and the obviously different entry is the aforementioned abnormal entry.
  • Fig. 12 shows another process of obtaining a locally sensitive hash code of the log segment X1.
  • the weight of all entries is 1.
  • the hash code of "flush” is: "10010111”
  • the product of its and weight 1 is "1, -1, -1, 1,-1,1,1,1”
  • the comma is for interval and does not exist in the actual calculation process.
  • Performing a weighted summation on the calculated hash codes of each entry refers to the summation of the weighted hash codes (that is, the summation of the corresponding positions).
  • the final result of the weighted summation determined by the log record X1 is "3, -1, -3, 3, -3, -1, 5, 1", where the first position: 3 is the product of each entry and the weight 1.
  • the sum of the first place, namely 1+1+1+1-1, the second place: -1 is the sum of the second place of the product of each entry and the weight 1, namely (-1)+(-1)+1 +(-1)+1, other bits are calculated in the same way.
  • the result of the weighted summation "3, -1, -3, 3, -3, -1, 5, 1" corresponds to the dimensionality reduction result "10010011", that is, the locally sensitive hash code of the log record X1 is " 10010011".
  • the process of determining the locally sensitive hash code of each log fragment among multiple log fragments includes: determining each log fragment directly based on the content of each log fragment
  • the local sensitive hash code of is not performing the entry acquisition step of step A11.
  • the analysis device may determine the locally sensitive hash code of each log record based on the aforementioned target locality sensitive hash algorithm and the content of each log segment. For example, the analysis device may separately input the content (ie, character stream) of each log segment into the algorithm model of the target locality-sensitive hashing algorithm, and receive the locality-sensitive hash code of each log segment output by the algorithm model.
  • the smallest unit of data processed by the target local sensitive hash algorithm is a character.
  • the data granularity (that is, the smallest unit of data processed by the aforementioned target locality-sensitive hash algorithm) when obtaining the locally sensitive hash code of each log segment is a character, and the aforementioned first type
  • the data granularity when obtaining the locally sensitive hash code of each log segment is the entry. Therefore, compared with the second optional implementation, the data granularity of the aforementioned first optional implementation is larger when calculating the locally sensitive hash code of each log segment. It can be seen that the first optional implementation is Compared with the second optional implementation manner, the implementation manner has a smaller number of calculations, which can save calculation costs.
  • multiple entries of each log segment are stored in the form of entry sequence.
  • entry sequence corresponding to each log segment multiple entries of each log segment have a sequential relationship.
  • the analysis device can directly arrange the multiple entries of the log segment obtained by word segmentation in the order before the word segmentation to obtain the entry sequence. Then, the word segmentation device can determine the locally sensitive hash code of each log segment based on the entry sequence corresponding to each log segment.
  • the word segmentation device determines the local sensitive hash code of each log segment based on the entry sequence corresponding to each log segment. You can refer to the aforementioned set of entries obtained without deduplication to determine the part of each log segment. The process of the sensitive hash code will not be repeated in this embodiment of the application.
  • Step A2 The analysis device determines the distance between every two log fragments in the multiple log fragments based on the locally sensitive hash code of each log fragment.
  • step A1 because the locality sensitive hash code of a log fragment can reflect the similarity between the data of this log fragment and the data of other log fragments, the local sensitive hash code of every two log fragments can be obtained The distance obtained is determined as the distance between every two corresponding log fragments. In this way, the distance between every two log fragments can be quickly determined.
  • the analysis device can calculate the distance of the locally sensitive hash code of every two log shards based on the locally sensitive hash code of each log shard and the specified distance algorithm, and determine the calculated distance as the corresponding every two log shards. The distance between log shards.
  • the specified distance algorithm may be the Hamming distance algorithm, and correspondingly, the acquired distance is the Hamming distance. Then the analysis device may determine the Hamming distance of the locally sensitive hash code of every two log fragments as the distance between the two log fragments. Among them, the Hamming distance refers to the number of different data at the same position in the character sequence. For example, the character sequence: 010 and 010. Based on the analysis of the Hamming distance algorithm, it can be seen that the second and third positions are different, and the Hamming distance of the two character sequences is 2.
  • the specified distance algorithm may also be other distance algorithms.
  • the specified distance algorithm is Euclidean distance algorithm. Then the analysis device may determine the Euclidean distance of the locally sensitive hash code of every two log fragments as the distance between the two log fragments. Among them, the Euclidean distance is the space point distance.
  • the analysis device can also use other methods to determine the distance between every two log shards.
  • the analysis device can also use the Jaccard similarity function to determine the distance between every two log shards. This distance is called Jaccard distance, Jaccard similarity or Jaccard coefficient.
  • D represents the Jaccard distance
  • R1 is the intersection of the entries of the two log shards
  • R2 is the union of the entries of the two log shards.
  • the Jaccard similarity function is used to determine the distance between every two log shards, if multiple entries of each log shard need to be obtained, then multiple words of each log shard are obtained.
  • the analysis device Since the analysis device has obtained the distance between each log fragment and other log fragments in multiple log fragments, and a log can be divided into more log fragments, such as 3 to 8, the final analysis device Multiple distance values can be obtained. For example, there are w log fragments. If the obtained distance value includes the distance between the log fragment and itself (that is, the distance is 0), the obtained distance value is w 2 ; if the obtained distance is The value does not include the distance between the log fragment and itself, and the obtained distance value is (w 2 -w).
  • the multiple distance values can be represented by a distance matrix as shown in FIG. 13.
  • Figure 13 assumes that the analysis device has acquired 4 log fragments, log fragments 1 to 4, and the locally sensitive hash codes of log fragments 1 to 4 are 01010101, 01010111, 00010111, and 11110010, respectively.
  • the Hamming distance of the locally sensitive hash code of the log fragments is determined as the distance between two log fragments.
  • the distances between log shard 1 and log shards 2 to 4 are 1, 2 and 5 respectively; the distances between log shard 2 and log shards 1, 3 and 4 are 1 respectively , 1 and 4; the distances between log shard 3 and log shards 1, 2 and 4 are 2, 1 and 5 respectively; the distances between log shard 4 and log shards 1 to 3 are 5 and 4 respectively And 5.
  • the distance values in the lower left and upper right corners of the distance matrix in Figure 13 are symmetrically distributed. Therefore, the content in the lower left or upper right corner of the distance matrix can be used to indicate that each log shard and other logs in the multiple log shards The distance of the shard.
  • Step 304 The analysis device determines whether there is an abnormal log fragment in the multiple log fragments based on the distance between every two log fragments in the plurality of log fragments.
  • the analysis device determines whether there are abnormal log fragments in the multiple log fragments based on the distance between every two log fragments in the plurality of log fragments.
  • the process of determining whether there is an abnormal log fragment among the plurality of log fragments can be detected through a variety of abnormalities.
  • the following two anomaly detection methods are used as examples for description:
  • the first anomaly detection method is to determine whether there is an abnormal log fragment in the multiple log fragments based on the K-distance (K-Distance) of each two log fragments in the plurality of log fragments.
  • the anomaly detection method includes :
  • Step B1 The analysis device determines the K-distance of each log fragment in the plurality of log fragments based on the distance between each two log fragments in the plurality of log fragments.
  • step B1 the larger the K value, the lower the sensitivity, and the higher the accuracy of the finally determined abnormal log fragments. Therefore, a smaller K value can be set here, for example, K is 1, or when When G ⁇ 5% is greater than 1, K is less than or equal to G ⁇ 5%.
  • the analysis device can first obtain the distance between every two log fragments, and then, for each log fragment, the distance between the obtained log fragment and the log fragment Sorting, such as sorting in ascending order or descending order, and then determining the K-distance of the log segment based on the sorting result.
  • the analysis device can obtain multiple distance values corresponding to each log segment.
  • the multiple distance values can be represented by a distance matrix. If the analysis of abnormal log fragments is directly based on the distance matrix, suppose there are w log fragments. For each log fragment, it needs to be based on the difference between it and other log fragments. (w-1) distance to determine whether the log fragment is an abnormal log fragment. For multiple log fragments, (w-1) log fragment detection processes are required, and the detection process can refer to the subsequent step B2 process.
  • the obtained distance matrix (such as the aforementioned distance matrix including (w 2 -w) distance values) is converted into a one-dimensional distance value set, and the distance value set includes w distance values.
  • the detection process of log fragmentation only needs to be executed once, which effectively simplifies the detection process of abnormal log fragmentation, reduces computational complexity, and improves the efficiency of abnormality determination.
  • Each log fragment of the multiple log fragments obtained in step 302 can actually be regarded as a point in the high-dimensional space, and the distance between every two log fragments is two in the high-dimensional space. The distance between points.
  • the dimension of the high-dimensional space may be the number of bits of the locality-sensitive hash code. For example, for a 128-bit locally sensitive hash code, it can be regarded as a point in a 128-dimensional space.
  • Figure 14 takes multiple log shards including log shards 1 to 5, a total of 5 log shards, and each log shard is a point in a two-dimensional space as an example.
  • the The five log fragments are represented by a one-to-one correspondence between points A to E.
  • Step B2 The analysis device determines whether there is an abnormal log fragment in the multiple log fragments based on the K-distance of each log fragment in the plurality of log fragments.
  • the embodiment of the present application based on the K-distance of each of the multiple log shards, there are multiple ways to determine whether there is an abnormal log shard in the multiple log shards.
  • the embodiment of the present application is based on the K-distance of each log shard. The following optional methods are taken as examples for description:
  • the abnormal log fragments are determined based on the Raida criterion (also called the 3-sigma rule).
  • the embodiment of the present application provides the following two optional examples to determine whether each log fragment in the multiple log fragments is an abnormal log fragment:
  • the analysis device can determine the target value range [ ⁇ –3 ⁇ , ⁇ +3 ⁇ ] based on the K-distance of each log fragment among multiple log fragments, where the ⁇ is the mean value of the K-distance of multiple log shards, the ⁇ is the standard deviation of the K-distance of the multiple log shards, when the K-distance of any one of the multiple log shards is not in the target value
  • the range is [ ⁇ –3 ⁇ , ⁇ +3 ⁇ ]
  • it is determined that any log fragment is an abnormal log fragment
  • the K-distance of any log fragment among multiple log fragments is not in the target
  • the value range is [ ⁇ –3 ⁇ , ⁇ +3 ⁇ ]
  • the principle of the first optional example is: because there are many normal points among the multiple points corresponding to multiple log fragments, the abnormal points have little effect on the mean and standard deviation. Assuming that there is no abnormal point in the multiple points, calculate the mean value and standard deviation of the multiple points, and determine whether there is a point beyond the target value range based on this, and if there is such a point, determine that the point is an abnormal point.
  • the first optional example is applicable when there are a large number of log fragments, that is, when there are many sample points. In this way, the target value range only needs to be calculated once, and the calculation overhead is small.
  • the process of judging whether point A is an abnormal point, that is, whether log segment 1 is an abnormal log segment is as follows: calculate the mean and standard deviation from point A to point E, based on the calculated The mean and standard deviation determine the target value range; when the K-distance of point A is not within the target value range, point A is determined to be an abnormal point, and log segment 1 is an abnormal log segment; when the K-distance of point A is When the distance is within the target value range, it is determined that the point A is not an abnormal point, and log segment 1 is not an abnormal log segment.
  • the calculation method from point B to point E is the same, and will not be repeated in the embodiment of the present application.
  • the log fragment detection process provided in this embodiment of the present application includes: determining the first log fragment
  • the corresponding target value range is [mu–3 ⁇ sigma,mu+3 ⁇ sigma], where the mu is the mean value of the K-distance of log fragments other than the first log fragment among multiple log fragments,
  • the sigma is the standard deviation of the K-distance of log fragments other than the first log fragment among multiple log fragments.
  • the K-distance of the first log fragment determines that the first log fragment is an abnormal log fragment; when the K-distance of the first log fragment is within the target value range [mu–3 ⁇ sigma,mu+3 ⁇ sigma ], it is determined that the first log fragment is not an abnormal log fragment, that is, it is a normal log fragment.
  • the detection process of other log fragments among the multiple log fragments reference may be made to the aforementioned detection process of the first log fragment, which is not described in detail in the embodiment of the present application.
  • the principle of this second optional example is: when there are fewer normal points among the multiple points corresponding to multiple log fragments, abnormal points have a greater impact on the mean and standard deviation. Then for each of the multiple points, assuming that the point is an abnormal point and the remaining points are normal points, calculate the mean and standard deviation of the remaining points, and determine whether the point exceeds the target value range based on this, thereby determining the Whether the point is an abnormal point.
  • This second optional example is applicable when the number of log fragments is small, that is, when the number of sample points is small.
  • the process of judging whether point A is an abnormal point is as follows: calculate the remaining points from point A to point E except for point A ( That is, the mean and standard deviation from point B to point E).
  • the process is as follows: calculate the remaining points from point A to point E except for point D The mean and standard deviation of points (ie point A, point B, point C, point E).
  • the aforementioned two optional examples can be selected according to actual conditions. For example, after the analysis device obtains log fragments, when the number of obtained log fragments is greater than the specified number threshold (the number of log fragments is large, There are many corresponding points), and the method provided in the first example above is used to determine the abnormal log fragments; when the number of obtained log fragments is not greater than the specified number threshold (the number of log fragments is less, the corresponding points are more Less), using the method provided in the second example above to determine abnormal log fragments. In this way, although the target value range of each log segment needs to be determined, since the number of log segments is small, the overall calculation amount is also within an acceptable range.
  • the specified number threshold the number of log fragments is large, There are many corresponding points
  • the abnormal log fragments are determined based on the principle of entropy change.
  • entropy is used to describe the degree of confusion in the state of molecules.
  • U [1,2,3,4,5,6,7,8,9,10] the sample distribution is not uniform, so the certainty is low, and the entropy value is lower than T.
  • V [1,2,3,4,5,6,7,8,900,1000]
  • the sample distribution is more uneven than U, and its entropy value is the lowest.
  • the entropy value is inversely proportional to the uniformity of the sample distribution, that is, the higher the entropy value, the higher the uniformity of the sample distribution, and the lower the entropy value, the lower the uniformity of the sample distribution.
  • the embodiment of the present application provides a formula for calculating the entropy value of a sample, and the formula for calculating the entropy value is as follows:
  • H(i) represents the entropy value of the sample
  • i represents the data in the sample.
  • the principle of determining the entropy value of the sample based on the entropy value calculation formula is called the entropy change principle.
  • the log fragment detection process includes: determining the entropy value corresponding to each log fragment among multiple log fragments, wherein, referring to the aforementioned entropy value calculation formula, any log fragment
  • the entropy value H(i) corresponding to the slice is the entropy value of the K-distance of the remaining log slices after removing any log slice from multiple log slices, and i represents the K-distance of the remaining log slices.
  • the log fragment corresponding to the largest entropy value can be determined as the abnormal log fragment.
  • the difference between the largest entropy value and the smallest entropy value in the obtained entropy values is not greater than the specified difference threshold, it means that the distribution uniformity of multiple log shards after removing different log allocations does not change much, that is, The content of multiple log fragments is relatively close, and there is usually no abnormal log fragment.
  • point D may be an abnormal point.
  • the specified difference threshold is 0.2
  • point D is determined to be an abnormal point, and log segment 4 is an abnormal log segment.
  • an abnormal log fragment occurs, there is one abnormal log fragment, and it is the log fragment corresponding to the largest entropy value. In actual implementation, there may be multiple abnormal log fragments. Then after an abnormal log fragment is determined based on the aforementioned second alternative method, the abnormal log fragment can be eliminated, and multiple log fragments can be obtained as the updated multiple log fragments, and the aforementioned first log fragment can be used again. Two alternative methods are to determine whether there are abnormal log fragments based on the updated multiple log fragments, and so on, until there are no abnormal log fragments in the updated log fragments.
  • the second anomaly detection method is based on the Hierarchical Clustering algorithm to determine whether there are abnormal log fragments among multiple log fragments.
  • Hierarchical clustering refers to clustering elements belonging to the same category together based on the distance between the elements being clustered.
  • the process of the second abnormality detection method includes:
  • the analysis device divides the multiple log shards into multiple log shard sets based on the distance between every two log shards in the multiple log shards, and each log shard set includes at least one log shard; If the number of log fragments in any log fragment set is less than the specified number threshold, it is determined that the log fragments in the log fragment set are abnormal log fragments; when the number of log fragments in any log fragment set is abnormal The number is not less than the specified number threshold, and it is determined that the log fragments in the log fragment set are not abnormal log fragments.
  • the specified number threshold is 2.
  • the number of log fragments in the log fragment set G1 is less than 2, that is, the number of log fragments in the log fragment set G1 is only 1.
  • the other log fragments in a log fragment are not of the same type, that is, they are different from other log fragments, so the log fragment is an abnormal log fragment.
  • the embodiment of the present application may also adopt other methods to determine an abnormal log fragment based on the distance between every two log fragments of the multiple log fragments, for example, each of the multiple log fragments
  • the distance between two log shards is presented to the user in a stereogram, table or histogram, and the identification of each log shard is presented, so that the user can select the log shard that the user thinks is abnormal, and receive the selection instruction triggered by the user ,
  • the selection instruction carries the identifier of the target log segment selected by the user, and the target log segment is determined to be an abnormal log segment.
  • the application embodiment does not limit the manner of determining abnormal log fragments based on the distance between every two log fragments in the multiple log fragments.
  • Step 305 The analysis device determines the abnormal log record in the abnormal log segment.
  • the analysis device can determine the abnormal log records in a variety of ways. In an optional manner, the analysis device can present abnormal log fragments, and the user selects the abnormal log record; in another optional manner, the analytical device can determine the log template of the abnormal log fragment and present the log template , The user selects the abnormal log template, and after obtaining the abnormal log template, the log record corresponding to the abnormal log template is presented.
  • the analysis device may also use other methods to determine the abnormal log record, which is not limited in the embodiment of the present application.
  • the log record in the log usually has an implicit log template.
  • the log template refers to a standard style or a fixed format used to generate log records in the log. For example, after the code corresponding to the aforementioned log record is actually executed, it outputs a multi-line log record in the log for recording user login information.
  • the log where the multi-line log records are recorded is referred to as the first log:
  • the log template of the log is the log template of the log record in the log.
  • the variable part of the log record if the variable part of the log record is identified, the variable part will be marked with a preset variable identifier.
  • the marking method is essentially using the variable identifier to replace the variable part.
  • the variable identifier is usually a wildcard "*".
  • the wildcard character "*" can be used to replace the variable part, and the log template of each log record obtained is "User*login at*", then the first log record is "User*login at*".
  • the log template of a log is "User*login at*".
  • the sequence of steps in the log abnormality detection method provided in the embodiments of the present application can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. For example, in other application scenarios, such as keyword retrieval, the foregoing step 305 may not be performed.
  • the distance between every two log fragments is negatively related to the similarity of the two log fragments, that is, the closer the distance (ie, the smaller), the higher the similarity; the farther the distance (I.e. the larger), the lower the similarity.
  • the distance between Usually, the distance d is a non-negative real number, and the numerical range of similarity is [0,1].
  • the analysis device may determine whether there is an abnormal log fragment in the multiple log fragments based on the similarity between every two log fragments in the plurality of log fragments.
  • the process includes: when the similarity between any log fragment and other log fragments is less than the similarity threshold, determining that any log fragment is an abnormal log fragment; When the similarity between the log fragment and other log fragments is not less than the similarity threshold, it is determined that any log fragment is not an abnormal log fragment. In this way, the aforementioned steps B1 and B2 may not be executed.
  • the analysis device may determine the K-distance of each log fragment in the plurality of log fragments based on the similarity between every two log fragments in the plurality of log fragments,
  • the K-distance of any one of the multiple log shards is the similarity between the Kth farthest one of the multiple log shards and any one of the log shards, K It is a positive integer, K is less than G, and G is the total number of multiple log fragments.
  • K-distance is different from the definition of K-distance in step B1.
  • K is G
  • the analysis device may also use a cosine angle algorithm (also called a cosine similarity algorithm) to determine the similarity between every two log segments.
  • the cosine angle algorithm refers to the cosine value of the angle between two vectors in a vector space as a measure of the difference between the two vectors. The cosine value is close to 1, and the angle tends to 0, indicating that the two vectors are more similar. The cosine value is close to 0, and the included angle tends to 90 degrees, indicating that the two vectors are more dissimilar. Therefore, after obtaining the locality sensitive hash code of each log shard in multiple log shards, the cosine value between every two locality sensitive hash codes can be regarded as the corresponding one between every two log shards. ⁇ similarity.
  • the embodiment of the present application may also obtain multiple log fragments of the log in step 302, and the analysis device may obtain the multiple log fragments in other ways. Based on the similarity between every two log fragments of the multiple log fragments, determine whether there is an abnormal log fragment in the multiple log fragments based on the similarity between each two log fragments of the multiple log fragments.
  • the analysis device compares the log (or the log within a period of time) with the specified reference log, obtains the changes in the log templates of the two, and presents the changes .
  • the software developer identifies the abnormal situation in the log based on the content presented, this function is called the log comparison (logcompare) function.
  • the log comparison function needs to be manually triggered; on the other hand, it needs to manually specify the reference log; on the other hand, the log comparison function does not actually identify the log anomalies, but only provides a reference for software developers to find log anomalies information.
  • the log abnormality detection method provided in the embodiment of the application can support the log abnormality detection function.
  • the log abnormality detection function can be triggered manually or automatically, for example, at a specified time point or a specified period of time, or periodically Automatic execution; on the other hand, the log anomaly detection function does not need to specify a reference log; on the other hand, the log anomaly detection function can not identify abnormal log fragments, based on which can accurately locate abnormal log records.
  • the flexibility of anomaly detection is higher, the implementation process is simple, and the location of abnormal log fragments can be performed, thereby effectively improving the efficiency of log anomaly detection.
  • the log abnormality detection method provided by the embodiment of the present application locates abnormal log fragments by comparing the similarity of the contents of multiple log fragments of the log, and can detect unknown abnormal log fragments.
  • abnormal log fragments can be quickly located, effectively reducing the time complexity and space complexity of log positioning.
  • An embodiment of the present application provides a log abnormality detection device 40. As shown in FIG. 16, the device includes:
  • the obtaining module 401 is configured to obtain a plurality of log fragments of a log, and each log fragment of the plurality of log fragments includes a plurality of log records in the log; and different log fragments in the plurality of log fragments At least one log record of the log fragment is different;
  • the first determining module 402 is configured to determine the distance between every two log fragments in the plurality of log fragments
  • the second determining module 403 is configured to determine whether there is an abnormal log fragment in the plurality of log fragments based on the distance between each two log fragments in the plurality of log fragments.
  • the distance between every two log shards in the multiple log shards is obtained, and based on the distance between every two log shards obtained, it is determined whether there is a plurality of log shards. Abnormal log fragmentation, and then locate the abnormal log fragmentation, no need to manually identify whether there is an abnormality in the log, and effectively improve the abnormality detection efficiency of the log.
  • the first determining module 402 is configured to: determine every two log fragments in the plurality of log fragments based on the locally sensitive hash code of each log fragment in the plurality of log fragments The distance between log fragments.
  • the device 40 further includes: a third determining module 404, configured to determine the multiple entries based on the multiple entries of each of the multiple log fragments The locality sensitive hash code of each log fragment in each log fragment.
  • a third determining module 404 configured to determine the multiple entries based on the multiple entries of each of the multiple log fragments The locality sensitive hash code of each log fragment in each log fragment.
  • the third determining module 404 is configured to: de-duplicate multiple entries of each log segment to obtain a set of entries; based on the set of entries corresponding to each log segment To determine the locally sensitive hash code of each log segment.
  • the third determining module 404 is configured to: calculate the sum of the hash codes of all the entries in the entry set corresponding to each log segment;
  • the dimensionality reduction process is performed on the sum of the hash codes corresponding to each log segment to obtain the locally sensitive hash code of each log segment.
  • the second determining module 403 is configured to: determine each log segment in the plurality of log segments based on the distance between each two log segments in the plurality of log segments
  • the K-distance of any one of the multiple log shards is the K-th closest to the any one of the multiple log shards and the The distance between any log shards, K is a positive integer, K is less than G, where G is the total number of the multiple log shards; based on the K of each log shard of the multiple log shards -Distance, to determine whether there is an abnormal log fragment among the multiple log fragments.
  • the second determining module 403 is configured to determine the target value range [ ⁇ –3 ⁇ , ⁇ +3 ⁇ ] based on the K-distance of each log fragment in the plurality of log fragments ⁇ ], wherein the ⁇ is the mean value of the K-distance of the multiple log fragments, and the ⁇ is the standard deviation of the K-distance of the multiple log fragments, when the multiple log fragments When the K-distance of any log fragment in is not within the target value range [ ⁇ –3 ⁇ , ⁇ +3 ⁇ ], it is determined that any log fragment is an abnormal log fragment; or, Determine the target value range [mu–3 ⁇ sigma, mu+3 ⁇ sigma] corresponding to the first log fragment, where the first log fragment is any one of the multiple log fragments , The mu is the mean value of K-distance of log fragments other than the first log fragment in the plurality of log fragments, and the sigma is the mean value of the K-distance of the log fragments except for the first log fragment.
  • the entropy value corresponding to each log fragment in the plurality of log fragments is determined, wherein the entropy value corresponding to any log fragment Is the K-distance entropy value of the remaining log shards after removing any one of the log shards, when the obtained entropy value is the difference between the largest entropy value and the smallest entropy value
  • the log fragment corresponding to the maximum entropy value is determined as an abnormal log fragment.
  • different log fragments include log records with the same number of rows; or, different log fragments include log records with the same amount of data.
  • FIG. 18 schematically provides a possible basic hardware architecture of the computing device described in this application.
  • the computing device may be a server.
  • the computing device 500 includes a processor 501, a memory 502, a communication interface 503, and a bus 504.
  • the number of processors 501 may be one or more, and FIG. 18 only illustrates one of the processors 501.
  • the processor 501 may be a central processing unit (CPU). If the computing device 500 has multiple processors 501, the types of the multiple processors 501 may be different or may be the same. Optionally, multiple processors 501 of the computing device 500 may also be integrated into a multi-core processor.
  • the memory 502 stores computer instructions and data; the memory 502 can store computer instructions and data required to implement the log anomaly detection method provided by the present application.
  • the memory 502 stores instructions for implementing the steps of the log anomaly detection method.
  • the memory 502 may be any one or any combination of the following storage media: non-volatile memory (for example, read only memory (ROM), solid state drive (SSD), hard disk (HDD), optical disk)), volatile memory.
  • the communication interface 503 may be any one or any combination of the following devices: a network interface (for example, an Ethernet interface), a wireless network card, and other devices with a network access function.
  • the communication interface 503 is used for data communication between the computing device 500 and other computing devices or terminals.
  • the bus 504 can connect the processor 501 with the memory 502 and the communication interface 503. In this way, through the bus 504, the processor 501 can access the memory 502, and can also use the communication interface 503 to interact with other computing devices or terminals.
  • the computing device 500 executes the computer instructions in the memory 502, so that the computing device 500 implements the log abnormality detection method provided in this application, or causes the computing device 500 to deploy a log abnormality detection device.
  • non-transitory computer-readable storage medium including instructions, such as a memory including instructions, which can be executed by a processor of a server to complete the log exceptions shown in each embodiment of the present application. Detection method.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • An embodiment of the present application provides an analysis system, including: a terminal and an analysis device, and the analysis device includes any one of the aforementioned log abnormality detection devices.
  • the computer may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in the form of a computer program product in whole or in part, and the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data.
  • the center transmits to another website, computer, server, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium, or a semiconductor medium (for example, a solid-state hard disk).
  • the log abnormality detection device provided in the above embodiment performs log abnormality detection
  • only the division of the above-mentioned functional modules is used as an example.
  • the above-mentioned functions can be allocated to different functional modules according to needs. Complete, that is, divide the internal structure of the device into different functional modules to complete all or part of the functions described above.
  • the log anomaly detection device provided in the foregoing embodiment and the log anomaly detection method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Abstract

L'invention concerne un procédé et un système de détection d'anomalie de journal, qui font partie du domaine technique des ordinateurs. Le procédé comprend : l'acquisition d'une pluralité de fragments de journal d'un journal, chaque fragment de la pluralité de fragments de journal comprenant une pluralité de lignes d'enregistrements de journal dans le journal, et au moins une ligne d'un enregistrement de journal dans différents fragments de journal de la pluralité de fragments de journal étant différente; la détermination de la distance entre tous les deux fragments de journal de la pluralité de fragments de journal; et sur la base de la distance entre tous les deux fragments de journal de la pluralité de fragments de journal, la détermination s'il existe un fragment de journal anormal dans la pluralité de fragments de journal. Selon la présente invention, le problème existant de détection relativement faible d'anomalies de diagraphie est résolu, et la présente invention peut être appliquée à la détection d'anomalies de diagraphie.
PCT/CN2020/121544 2019-12-02 2020-10-16 Procédé et appareil de détection d'anomalie de journal WO2021109724A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201911214265 2019-12-02
CN201911214265.0 2019-12-02
CN202010066339.7A CN111240942A (zh) 2019-12-02 2020-01-20 日志异常检测方法及装置
CN202010066339.7 2020-01-20

Publications (1)

Publication Number Publication Date
WO2021109724A1 true WO2021109724A1 (fr) 2021-06-10

Family

ID=70878054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121544 WO2021109724A1 (fr) 2019-12-02 2020-10-16 Procédé et appareil de détection d'anomalie de journal

Country Status (2)

Country Link
CN (1) CN111240942A (fr)
WO (1) WO2021109724A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111240942A (zh) * 2019-12-02 2020-06-05 华为技术有限公司 日志异常检测方法及装置
CN111538642B (zh) * 2020-07-02 2020-10-02 杭州海康威视数字技术股份有限公司 一种异常行为的检测方法、装置、电子设备及存储介质
CN114844778B (zh) * 2022-04-25 2023-05-30 中国联合网络通信集团有限公司 核心网的异常检测方法、装置、电子设备及可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514398A (zh) * 2013-10-18 2014-01-15 中国科学院信息工程研究所 一种实时在线日志检测方法及系统
CN104951555A (zh) * 2015-06-30 2015-09-30 浪潮(北京)电子信息产业有限公司 一种日志信息管理方法及日志信息管理终端
CN105183912A (zh) * 2015-10-12 2015-12-23 北京百度网讯科技有限公司 异常日志确定方法和装置
CN107707545A (zh) * 2017-09-29 2018-02-16 深信服科技股份有限公司 一种异常网页访问片段检测方法、装置、设备及存储介质
CN110210512A (zh) * 2019-04-19 2019-09-06 北京亿阳信通科技有限公司 一种自动化日志异常检测方法及系统
EP3582115A1 (fr) * 2018-06-15 2019-12-18 Dynatrace LLC Procédé et système d'analyse de données de journal sur la base de signatures superminhash
CN111240942A (zh) * 2019-12-02 2020-06-05 华为技术有限公司 日志异常检测方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7937334B2 (en) * 2006-05-31 2011-05-03 Lockheed Martin Corporation System and method for defining normal operating regions and identifying anomalous behavior of units within a fleet, operating in a complex, dynamic environment
CN101452704B (zh) * 2007-11-29 2011-05-11 中国科学院声学研究所 一种基于信息传递的说话人聚类方法
US11194692B2 (en) * 2017-09-22 2021-12-07 Nec Corporation Log-based system maintenance and management
CN108776654A (zh) * 2018-05-30 2018-11-09 昆明理工大学 一种基于改进的simhash文本对比方法
CN110175158B (zh) * 2019-05-23 2020-11-10 湖南大学 一种基于向量化的日志模板提取方法和系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514398A (zh) * 2013-10-18 2014-01-15 中国科学院信息工程研究所 一种实时在线日志检测方法及系统
CN104951555A (zh) * 2015-06-30 2015-09-30 浪潮(北京)电子信息产业有限公司 一种日志信息管理方法及日志信息管理终端
CN105183912A (zh) * 2015-10-12 2015-12-23 北京百度网讯科技有限公司 异常日志确定方法和装置
CN107707545A (zh) * 2017-09-29 2018-02-16 深信服科技股份有限公司 一种异常网页访问片段检测方法、装置、设备及存储介质
EP3582115A1 (fr) * 2018-06-15 2019-12-18 Dynatrace LLC Procédé et système d'analyse de données de journal sur la base de signatures superminhash
CN110210512A (zh) * 2019-04-19 2019-09-06 北京亿阳信通科技有限公司 一种自动化日志异常检测方法及系统
CN111240942A (zh) * 2019-12-02 2020-06-05 华为技术有限公司 日志异常检测方法及装置

Also Published As

Publication number Publication date
CN111240942A (zh) 2020-06-05

Similar Documents

Publication Publication Date Title
WO2021109724A1 (fr) Procédé et appareil de détection d'anomalie de journal
US11539736B1 (en) Network asset correlator for cybersecurity operations
WO2021068547A1 (fr) Procédé et appareil d'extraction de schéma de journal
CN110826648B (zh) 一种利用时序聚类算法实现故障检测的方法
CN107463605B (zh) 低质新闻资源的识别方法及装置、计算机设备及可读介质
CN108170692B (zh) 一种热点事件信息处理方法和装置
CN111324784B (zh) 一种字符串处理方法及装置
WO2018076739A1 (fr) Méthode de traitement de données et dispositif de traitement de données
WO2016180268A1 (fr) Procédé et dispositif d'agrégation de texte
US9836603B2 (en) Systems and methods for automated generation of generic signatures used to detect polymorphic malware
US20130013597A1 (en) Processing Repetitive Data
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
US10789225B2 (en) Column weight calculation for data deduplication
US10594573B2 (en) Systems and methods for rule quality estimation
WO2022048363A1 (fr) Procédé et appareil de classification de site web, dispositif informatique et support de stockage
US20160219068A1 (en) Method and apparatus for automatically identifying signature of malicious traffic using latent dirichlet allocation
CN111869176B (zh) 用于恶意软件签名生成的系统和方法
CN113254255B (zh) 一种云平台日志的分析方法、系统、设备及介质
JPWO2018159337A1 (ja) プロファイル生成装置、攻撃検知装置、プロファイル生成方法、および、プロファイル生成プログラム
US9330075B2 (en) Method and apparatus for identifying garbage template article
TW202042132A (zh) 一種異常交易節點的檢測方法及裝置
CN113255370A (zh) 基于语义相似度的行业类型推荐方法、装置、设备及介质
US20150347406A1 (en) Corpus Generation Based Upon Document Attributes
Franke et al. Parallel Privacy-preserving Record Linkage using LSH-based Blocking.
CN113901981A (zh) 设备聚类方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20896893

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20896893

Country of ref document: EP

Kind code of ref document: A1