CN111240942A - Log abnormity detection method and device - Google Patents

Log abnormity detection method and device Download PDF

Info

Publication number
CN111240942A
CN111240942A CN202010066339.7A CN202010066339A CN111240942A CN 111240942 A CN111240942 A CN 111240942A CN 202010066339 A CN202010066339 A CN 202010066339A CN 111240942 A CN111240942 A CN 111240942A
Authority
CN
China
Prior art keywords
log
fragments
fragment
distance
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010066339.7A
Other languages
Chinese (zh)
Inventor
王琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN111240942A publication Critical patent/CN111240942A/en
Priority to PCT/CN2020/121544 priority Critical patent/WO2021109724A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a log anomaly detection method and device, and belongs to the technical field of computers. The method comprises the following steps: obtaining a plurality of log fragments of a log, wherein each log fragment of the plurality of log fragments comprises a plurality of rows of log records in the log; at least one row of log records of different log fragments in the plurality of log fragments are different; determining a distance between every two log shards of the plurality of log shards; determining whether an abnormal log fragment exists in the plurality of log fragments based on a distance between every two log fragments in the plurality of log fragments. The method and the device solve the problem that the efficiency of detecting the log abnormity is low at present, and are applied to abnormity detection of the log.

Description

Log abnormity detection method and device
The present application claims priority of chinese patent application No. 201911214265.0 entitled "method, apparatus, server, and storage medium for log anomaly detection" filed on 2019, 12/02/2019, the entire contents of which are incorporated herein by reference.
Technical Field
The present application relates to the field of computer technologies, and in particular, to a log anomaly detection method and apparatus.
Background
By adding some specific code in the software source code, the real-time status record of the software running can be recorded in a file in a text form, and the file is called logs (logs) or log files. A log comprises a plurality of lines of log records (also called log statements), each for recording an event at runtime of the software. When the abnormal condition occurs in the log, a software developer (or an operation and maintenance worker) can perform software performance optimization based on the abnormal condition.
A journal record in a journal typically has an implicit journal template (schema), i.e., the schema or format of the record itself. At present, if it is necessary to determine whether an abnormality exists in the log, the analysis device compares the log (or the log in a time period) with a specified reference log to obtain the change conditions of the log templates of the two logs, and presents the change conditions, and a software developer identifies the abnormality in the log based on the presented content.
However, since the presence or absence of an anomaly in the log still requires manual identification, anomaly detection is inefficient.
Disclosure of Invention
The embodiment of the application provides a log anomaly detection method and device, which can solve the problem of high operation cost of the conventional log anomaly detection method. The technical scheme is as follows:
in a first aspect, a log anomaly detection method is provided, where the method includes:
obtaining a plurality of log fragments of a log, wherein each log fragment of the plurality of log fragments comprises a plurality of rows of log records in the log; at least one row of log records of different log fragments in the plurality of log fragments are different; determining a distance between every two log shards of the plurality of log shards; determining whether an abnormal log fragment exists in the plurality of log fragments based on a distance between every two log fragments in the plurality of log fragments.
According to the embodiment of the application, the distance between every two log fragments in the plurality of log fragments of the log is obtained, and whether abnormal log fragments exist in the plurality of log fragments is determined based on the obtained distance between every two log fragments, so that the abnormal log fragments are positioned, whether the log is abnormal or not is not required to be manually identified, and the abnormal detection efficiency of the log is effectively improved.
Optionally, the determining a distance between every two log shards of the plurality of log shards includes:
determining a distance between every two of the plurality of log fragments based on the locality-sensitive hash code of each of the plurality of log fragments.
Because the locality sensitive hash code of the log fragment can reflect the similarity between the data of the log fragment and the data of other log fragments, the distance between the locality sensitive hash codes of every two log fragments can be obtained, and the obtained distance is determined as the distance between every two corresponding log fragments. This enables a fast determination of the distance between every two log slices.
Optionally, the method further comprises: determining a locality sensitive hash code for each of the plurality of log shards based on the plurality of terms for each of the plurality of log shards. By adopting the method, the data granularity when the locality sensitive hash code of each log fragment is obtained is used as the entry, the operation times are smaller, and the operation cost can be saved.
Optionally, the determining the locality sensitive hash code of each of the plurality of log fragments based on the plurality of entries of each of the plurality of log fragments comprises: carrying out duplicate removal processing on the multiple entries of each log fragment to obtain an entry set; and determining the locality sensitive hash code of each log fragment based on the entry set corresponding to each log fragment. The deduplication processing can simplify the number of entries used for subsequent calculation of the locality sensitive hash code, thereby improving the operation efficiency of the analysis device.
Optionally, the process of determining the locality-sensitive hash code of each log fragment based on the entry set corresponding to each log fragment includes: calculating the sum of the hash codes of all entries in the entry set corresponding to each log fragment; and performing dimension reduction processing on the sum of the hash codes corresponding to each log fragment to obtain the locality sensitive hash code of each log fragment.
And calculating the sum of the hash codes of all entries in the entry set corresponding to each log fragment, namely that the weights of all entries are 1, so that the calculation time delay is short and the calculation efficiency is high. And the probability of the abnormal entry is amplified equivalently, and the probability of identifying the abnormal log fragment is improved.
In an optional manner, the process of determining whether an abnormal log fragment exists in the plurality of log fragments based on a distance between every two log fragments in the plurality of log fragments includes:
determining a K-distance of each log fragment of the plurality of log fragments based on a distance between every two log fragments of the plurality of log fragments, wherein the K-distance of any log fragment of the plurality of log fragments is a distance between a log fragment which is nearest to the Kth log fragment of the any log fragment and the any log fragment, K is a positive integer, K is smaller than G, and G is the total number of the plurality of log fragments; determining whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments.
By converting the distance between every two log fragments in the plurality of log fragments into the K-distance of each log fragment, the detection process of the log fragments can be executed only once, the detection process of abnormal log fragments is effectively simplified, the calculation complexity is reduced, and the efficiency of abnormal judgment is improved.
Optionally, the determining whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments includes:
determining a target value range [ mu-3 x sigma, mu +3 x sigma ] based on the K-distance of each log fragment in the plurality of log fragments, wherein mu is the mean value of the K-distances of the plurality of log fragments, and sigma is the standard deviation of the K-distances of the plurality of log fragments, and when the K-distance of any log fragment in the plurality of log fragments is not within the target value range [ mu-3 x sigma, mu +3 x sigma ], determining that the any log fragment is an abnormal log fragment;
or determining a target value range [ mu-3 × sigma, mu +3 × sigma ] corresponding to a first log fragment, wherein the first log fragment is any log fragment in the plurality of log fragments, the mu is an average value of K-distances of log fragments except the first log fragment in the plurality of log fragments, the sigma is a standard deviation of the K-distances of the log fragments except the first log fragment in the plurality of log fragments, and when the K-distance of the first log fragment is not within the target value range [ mu-3 × sigma, mu +3 × sigma ], determining that the first log fragment is an abnormal log fragment;
or determining an entropy value corresponding to each log fragment in the plurality of log fragments, wherein the entropy value corresponding to any log fragment is an entropy value of a K-distance between the remaining log fragments after the log fragments are removed, and when a difference value between a maximum entropy value and a minimum entropy value in the obtained entropy values is larger than a specified difference threshold value, determining the log fragment corresponding to the maximum entropy value as an abnormal log fragment.
In another alternative, the determining whether an abnormal log slice exists in the plurality of log slices based on a distance between every two log slices in the plurality of log slices includes:
the analysis device divides the plurality of log fragments into a plurality of log fragment sets based on the distance between every two log fragments in the plurality of log fragments, wherein each log fragment set comprises at least one log fragment; when the number of the log fragments in any log fragment set is smaller than a specified number threshold, determining that the log fragments in the log fragment set are abnormal log fragments; and when the number of the log fragments in any log fragment set is not less than the specified number threshold, determining that the log fragments in the log fragment set are not abnormal log fragments.
Optionally, the analysis device may divide the log fragments according to an averaging principle to ensure accuracy of the abnormal log fragments obtained by the final positioning, for example, in the log fragments obtained by the final division, different log fragments include log records with the same number of lines; alternatively, different log slices include log records of the same data volume.
Optionally, after obtaining the distance between every two log shards in the plurality of log shards, the similarity between every two log shards in the plurality of log shards may also be determined based on the obtained distance. In the embodiment of the present application, the distance between every two log fragments is negatively related to the similarity of the two log fragments, that is, the closer the distance is (that is, the smaller the distance is), the higher the similarity is; the further (i.e., the larger) the distance, the lower the similarity. For example, the similarity d between every two log slices satisfies: d is 1/(1+ s), and s is the distance between each two log slices corresponding to the similarity between each two log slices. The analysis device may determine whether an abnormal log slice exists in the plurality of log slices based on a similarity between every two log slices in the plurality of log slices. In one example, the process includes: when the similarity between any log fragment and other log fragments is smaller than a similarity threshold value, determining that any log fragment is an abnormal log fragment; and when the similarity between any log fragment and other log fragments is not less than the similarity threshold, determining that the log fragment is not an abnormal log fragment.
In another example, the analysis device may determine a K-distance of each of the plurality of log fragments based on a similarity between every two log fragments of the plurality of log fragments, the K-distance of any log fragment of the plurality of log fragments being a similarity between a log fragment of the plurality of log fragments that is K-far from the any log fragment and any log fragment, K being a positive integer, K being less than G, G being a total number of the plurality of log fragments, and thus the K-distance is different from the definition of the K-distance determined directly based on the distance between every two log fragments of the plurality of log fragments; determining whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments.
Based on the same concept as that of the first aspect, a second aspect of an embodiment of the present application provides a log anomaly detection method, including:
obtaining a plurality of log fragments of a log, wherein each log fragment of the plurality of log fragments comprises a plurality of rows of log records in the log; at least one row of log records of different log fragments in the plurality of log fragments are different; determining a similarity between every two log shards in the plurality of log shards; determining whether an abnormal log fragment exists in the plurality of log fragments based on the similarity between every two log fragments in the plurality of log fragments.
In a third aspect, an apparatus for detecting log anomalies is provided, where the apparatus may include at least one module, and the at least one module may be configured to implement the first aspect, the second aspect, or various possible implementations of the first aspect and the second aspect.
In a fourth aspect, the present application provides a computer device comprising a processor and a memory. The memory stores computer instructions; when the processor executes the computer instructions stored in the memory, the computer device executes the methods provided in the first aspect, the second aspect, or the various possible implementations of the first aspect and the second aspect, so that the computer device deploys the log anomaly detection apparatus provided in the third aspect or the various possible implementations of the third aspect.
In a fifth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions that instruct a computer device to execute the method provided in the first aspect, the second aspect, or various possible implementations of the first aspect and the second aspect, or instruct the computer device to deploy the log anomaly detection apparatus provided in the third aspect or various possible implementations of the third aspect.
In a sixth aspect, the present application provides a computer program product comprising computer instructions stored in a computer readable storage medium. A processor of the computer device may read the computer instructions from the computer-readable storage medium, and execute the computer instructions to cause the computer device to execute the method provided by the first aspect, the second aspect, or various possible implementations of the first aspect and the second aspect, so that the computer device deploys the log anomaly detection apparatus provided by the third aspect or various possible implementations of the third aspect.
In a sixth aspect, there is provided an analysis system comprising: a terminal and an analysis device, wherein the analysis device comprises the third aspect or various possible implementations of the third aspect, the log anomaly detection apparatus or the computer device of the fourth aspect.
In a seventh aspect, a chip is provided, which may comprise programmable logic circuits and/or program instructions, and which when executed, is configured to implement the method as provided by the first aspect, the second aspect or various possible implementations of the first aspect and the second aspect.
According to the embodiment of the application, the distance between every two log fragments in the plurality of log fragments of the log is obtained, and whether abnormal log fragments exist in the plurality of log fragments is determined based on the obtained distance between every two log fragments, so that the abnormal log fragments are positioned, whether the log is abnormal or not is not required to be manually identified, and the abnormal detection efficiency of the log is effectively improved.
The log anomaly detection method provided by the embodiment of the application can support a log anomaly detection function, on one hand, the log anomaly detection function can be triggered manually or automatically, for example, triggered at a specified time point or a specified time period, and can be executed periodically and automatically; on the other hand, the log abnormality detection function does not need to specify a reference log; in yet another aspect, the log anomaly detection function may not identify anomalous log fragments based on which anomalous log records may be accurately located. In summary, in the log anomaly detection method provided by the embodiment of the application, the flexibility of anomaly detection is higher, the implementation process is simple, and the abnormal log fragments can be positioned, so that the efficiency of log anomaly detection can be effectively improved. In addition, according to the log anomaly detection method provided by the embodiment of the application, the abnormal log fragments are located by comparing the similarity of the contents of the plurality of log fragments of the log, and the unknown abnormal log fragments can be detected.
Furthermore, when the log anomaly detection method provided by the embodiment of the application is applied to an online analysis scene, the anomalous log fragments can be quickly positioned, and the time complexity and the space complexity of log positioning are effectively reduced.
Drawings
FIG. 1 is a schematic diagram of a part of log contents in a log provided by an embodiment of the present application;
fig. 2 is a schematic view of an application environment related to a log anomaly detection method provided in an embodiment of the present application;
fig. 3 is a schematic application environment diagram according to another log anomaly detection method provided in the embodiment of the present application;
FIG. 4 is a schematic flowchart of a log anomaly detection method according to an embodiment of the present application;
fig. 5 is a schematic flow chart of a fragmentation process in which a log is real-time log data according to an embodiment of the present application;
fig. 6 is a schematic flow chart of a fragmentation process in which a log is batch log data according to an embodiment of the present application;
fig. 7 is a schematic processing flow diagram of a locality sensitive hashing algorithm according to an embodiment of the present application;
fig. 8 is a schematic diagram of a word segmentation result obtained by performing word segmentation in a space word segmentation manner according to an embodiment of the present application;
fig. 9 is a schematic diagram of another word segmentation result obtained by performing word segmentation in a space word segmentation manner according to the embodiment of the present application;
fig. 10 is a schematic diagram of a word segmentation result obtained by performing word segmentation by using a special character word segmentation method according to an embodiment of the present application;
fig. 11 is a schematic diagram illustrating an obtaining process of a locality sensitive hash code according to an embodiment of the present application;
fig. 12 is a schematic diagram illustrating an acquisition process of another locality-sensitive hash code according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a distance matrix provided in an embodiment of the present application;
fig. 14 is a schematic distribution diagram of spatial points corresponding to log segments according to an embodiment of the present disclosure;
FIG. 15 is a schematic diagram illustrating a normal distribution principle provided by an embodiment of the present application;
fig. 16 is a schematic structural diagram of a log anomaly detection apparatus according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of another log anomaly detection apparatus provided in the embodiment of the present application;
fig. 18 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The log is used for recording the real-time state of the software operation, as shown in fig. 1, fig. 1 is a schematic diagram of the content of a part of the log in a log, and the log comprises a plurality of lines of log records (also called log texts), and each line of log records (also called each bar) is used for recording an event during the software operation. Each line of log records is composed of a plurality of characters, which may include letters and/or symbols, etc.
By analyzing the log, abnormality detection of the log can be performed. When the abnormal condition occurs in the log, a software developer (or an operation and maintenance worker) can perform software performance optimization based on the abnormal condition. However, at present, it is still necessary to manually identify whether there is an abnormality in the log, and therefore, the efficiency of abnormality detection is low.
The embodiment of the application provides a log anomaly detection method which can improve the efficiency of log anomaly detection. Referring to fig. 2, fig. 2 is a schematic view of an application environment related to a log anomaly detection method according to an embodiment of the present application. The application environment includes a terminal 110, an analysis device 120, and a network device 130.
The terminal 110 may be a display, a computer, a smart phone, a tablet and laptop computer, etc. capable of interacting with a user. The analysis device 120 may be a server, a server cluster composed of several servers, or the like capable of performing data analysis. Alternatively, the analysis device 120 may be a cloud server (also referred to as a cloud computing server), for example, a Deep Learning server for providing a Deep Learning Service (DLS). The terminal 110 establishes a wired or wireless communication connection with the analysis device 120 through a communication network. The network device 130 may be a sensor or a terminal, which can run software and generate log data. The network device 130 is configured to provide the analysis device 120 with data to be analyzed, the analysis device 120 is configured to analyze log data, and the terminal 110 is configured to present an analysis result to a user. The communication network referred to in the embodiments of the present application is a 2-Generation (2G) communication network, a 3rd Generation (3G) communication network, a Long Term Evolution (LTE) communication network, a fifth Generation (5rd Generation, 5G) communication network, or the like.
Optionally, the foregoing application environment may further include a storage device, configured to store data that is required to be stored by the terminal 110, the analysis device 120, and/or the network device 130, where the storage device may be a distributed storage device, and the terminal 110, the analysis device 120, and/or the network device 130 may read and write data stored in the storage device. Therefore, under the condition that the data in the application scene is more, the storage device stores the data, so that the load of the analysis device can be reduced, and the data analysis efficiency of the analysis device can be improved. It should be noted that when the amount of data in the application environment is small, the storage device may not be separately provided. In this case, the functions of the terminal 110 and the analyzing device 120 may also be implemented by the same device, such as a computer.
As shown in fig. 3, the application environment includes two parts, a foreground 201 and a background 202. The foreground 201 is used for presenting data to a user, receiving data input by the user, and realizing interaction with the user; the background 202 is used for performing data interaction with the foreground 201, and performing management operation and/or data processing and the like. Wherein, the foreground 201 may be deployed in the aforementioned terminal 110. The background 202 may be deployed in the aforementioned analysis device 120. For example, a client, a script, or a browser may be installed in the terminal 110 to implement the deployment of the foreground 201. As such, the terminal 110 may present the user interface in the form of a client interface, a terminal interface, or a web page corresponding to a browser.
The log anomaly detection method provided by the embodiment of the application can be used in log analysis scenes such as software debugging, performance optimization or service analysis. The method can be particularly applied to the abnormal detection scene in the log analysis scenes. Anomaly detection refers to detecting patterns that are not as expected. In the embodiment of the present application, a data source (also referred to as a data source) for anomaly detection is log data generated by running software in an application, a process, an operating system, a device, or a network, and the data may be stored in a database, a local file, or a message queue. For example, in a streaming scenario where the log is a log stream, the data is stored in a message queue, which is optionally a Kafka message queue. For example, the aforementioned analysis device 120 may employ a deep learning (deep learning) algorithm to perform the anomaly detection of the log data.
The embodiment of the application provides a log anomaly detection method, which is characterized in that a plurality of log fragments (also called log fragments) of a log are obtained, and the content of the abnormal log fragment is detected by comparing the similarity of the content of the log fragments, wherein the content of the abnormal log fragment is obviously different from the content of other log fragments. Based on this principle, as shown in fig. 4, the method includes:
step 301, the analysis device obtains a log, wherein the log comprises a plurality of rows of log records.
The log analysis scene comprises an off-line analysis scene and an on-line analysis scene. In an offline analysis scenario, the log data to be analyzed may be batch (batch) log data stored in a log database, such as a log file, or log data obtained by querying in the log database, where the log file is typically a file downloaded by a user, a software developer, or an operation and maintenance worker, or a file obtained by keyword search. The analysis device can read the log in the log database to obtain the log. In an online analysis scenario, the log data to be analyzed may be log data collected in real time, which is also called log stream (log stream) data. The analysis device can collect the logs through the collector to achieve the acquisition of the logs.
As previously mentioned, the log has both bulk log data and real-time log data. In the embodiment of the application, the analysis device supports the analysis of the logs in the two forms. In an optional example, the analysis device periodically obtains the log file, or obtains the log file in a specified time period to obtain the batch of log data, where the specified time period may be a low power consumption time period (i.e., a time period in which power consumption is less than a specified power consumption threshold) of the terminal and/or the server, so that the influence of the log file obtaining and subsequent log analysis on other functions of the terminal and/or the server can be reduced; in another alternative example, the analysis device continuously acquires real-time log data; in yet another alternative example, the analysis device obtains batch log data or real-time log data after receiving the analysis instruction. The analysis instruction can be generated by triggering at the terminal by a user and sent to the analysis device by the terminal.
When the analysis equipment acquires and analyzes the log stream in real time, the log stream can be monitored in time, and if the log stream is abnormal, the log stream can be found and reported in time, so that the effectiveness of abnormal detection is improved, the occurrence of large-scale abnormality is avoided, and the user experience is improved.
Step 302, the analysis device obtains a plurality of log fragments of the log.
In this embodiment of the application, after obtaining the log, the analysis device may obtain a plurality of log fragments based on the log. Each log slice of the plurality of log slices comprises a plurality of rows of log records in the log, namely, one log slice is a collection of a plurality of rows of log records. Optionally, each log slice includes consecutive rows of log records in the log. At least one row of log records of different log fragments of the plurality of log fragments is different.
According to the method and the device, the log is judged to be abnormal by comparing the similarity between the log fragments, and when the sizes of the log fragments are the same or similar, the judgment accuracy is high. Therefore, in this step, the sizes of the obtained plurality of log fragments are the same or similar. The method for acquiring multiple log fragments may be various, and the following methods are taken as examples in the embodiment of the present application to explain:
in the first mode, the log is divided into a plurality of log fragments according to the number of rows of log records.
Optionally, the partitioning rule of the log shards is as follows: the different divided log fragments comprise log records with the same number of lines. Correspondingly, the analysis device can divide each continuous m rows of log records into one log fragment according to a specified sequence, wherein m is an integer greater than 1, and optionally, m is greater than or equal to 500 and less than or equal to 1500. For example, m is 1000. The specified order may be the order of log records in the log from front to back.
When the log is real-time log data, the analysis device may store the acquired log record as one log slice after acquiring the m-row log record (for example, after reading the m-row log record from the data source each time) to obtain a plurality of log slices.
When the log is batch log data, when the log is divided into the last log fragment by the first method, the number of rows of log records is less than m, and if the number of rows of the remaining log records at this time is n, m-n row log records adjacent to the n row log records and the n row log records can be divided into one log fragment, so that the log records of the last two adjacent log fragments obtained by division have the same m-n row log records. In practical implementation, the embodiment of the present application may also divide the log records of other m-n rows and the log record of the n row into one log fragment, as long as it is ensured that the finally divided different log fragments all include the log records of m rows.
In the second mode, the log is divided into a plurality of log fragments according to the data volume recorded by the log.
Optionally, the partitioning rule of the log shards is as follows: the different divided log fragments comprise log records with the same data volume. Accordingly, the analysis device may divide the log records with the data volume of the specified data volume into one log slice according to the specified order, and the specified data volume may be 5 to 15 megabytes, for example, 10 megabytes. The specified order may be the order of log records in the log from front to back.
When the log is real-time log data, the analysis device may store the acquired log record as one log fragment after acquiring the log record of the specified data amount (for example, after reading the log record of the specified data amount from the data source each time) to obtain multiple log fragments.
When the log is batch log data, when the log is divided into the last log fragment easily in the second mode, the number of lines of the log record is not enough to specify the data volume, and the data volume of the remaining log record is assumed to be x and the specified data volume is y at this time, in this case, one or more lines of log records with the data volume of y-x adjacent to the remaining log record can be divided into one log fragment, so that the log records of the last two adjacent log fragments obtained by division have the same log record with the data volume of y-x. In practical implementation, other log records with data volume y-x and the remaining log records can be divided into one log fragment, as long as it is ensured that the finally divided different log fragments all include the log record with the specified data volume.
For the convenience of the reader, fig. 5 illustrates the slicing procedure by taking the log as real-time log data (i.e., log stream) as an example. After reading the log record of the log, the analysis device writes the read log record into a message queue (such as a card-card (Kafka) message queue). After a user reads the log records in the message queue through the user equipment, if the read data of the log records does not reach a target size, for example, m lines or specified data volume, the analysis equipment caches the read data of the log records first, and after the user reads the log records in the message queue through the user equipment each time, the read log records and the cached log records are spliced until the spliced log records reach the target size, and the log records with the target size are divided into log fragments. Fig. 5 illustrates an example in which the target size is 1000 lines and the divided log fragments are 4 pieces, but this is not limitative.
Fig. 6 illustrates a slicing flow by taking a log as a batch log data as an example. The analysis device may load all log records of the log into the memory at one time, and then the analysis device traverses the log records of the log and performs a fragmentation operation according to a target size (for example, m rows or a specified data amount) until all the log records are fragmented. Fig. 6 illustrates an example in which the divided log fragments are 4 pieces, but this is not limitative.
It should be noted that, in the first manner and the second manner, the log fragments are divided according to an equal division principle, in an actual implementation of the embodiment of the present application, the number of rows of log records included in the plurality of divided log fragments may also be different, as long as it is ensured that a difference value between the number of rows of log records included in any two log fragments is within a range of the specified difference value between the number of rows. Or, the data volumes included in the plurality of divided log fragments may also be different, as long as the data volume difference value of the log records included in any two log fragments is ensured to be within the specified data volume difference value range.
Optionally, the plurality of log segments may also be divided in a sliding window division manner, which is not described in this embodiment of the present application.
Step 303, the analysis device determines the distance between every two log fragments in the plurality of log fragments.
Wherein, the distance between every two log slices is used for reflecting the similarity between every two log slices. Optionally, the process of the analysis device determining the distance between each two of the plurality of log fragments includes:
step A1, determining a locality sensitive hash code for each of a plurality of log shards.
For the convenience of the reader, a brief introduction is made below to the Locality Sensitive Hash (LSH) code. The locality sensitive hash code is a hash code obtained based on a locality sensitive hash algorithm. The locality sensitive hash code can reflect the similarity of the data (which may be referred to as input data) that needs to be processed using the locality sensitive hash algorithm. In this embodiment, the data may be the data of the aforementioned log fragment. The locality sensitive hashing algorithm may maintain similarity relationships between input data. As shown in fig. 7, for similar input data, the obtained locality sensitive hash codes (which may be referred to as output data) are also very similar; for scenarios where the input data are very similar (fig. 7 illustrates the input data as two lines of log records), the obtained locality-sensitive hash code even generates a hash collision: that is, the output locality sensitive hash codes are identical for different but similar input data (fig. 7 illustrates that the output data are all "1101101"). As can be seen from the above statements, the partially sensitive hash code may be used as a feature of the log fragment, and in the embodiment of the present application, the signature of the log fragment is referred to as a signature of the log fragment, and the more similar the signatures of the two log fragments are, the closer the contents of the two log fragments are.
In this embodiment of the present application, an analysis device may determine a locality-sensitive hash code of each log fragment in multiple ways, and the following two optional implementation manners are used as examples in this embodiment of the present application to describe:
in a first alternative implementation, the determining a locality-sensitive hash code for each log slice of the plurality of log slices includes:
step A11, the analysis device obtains a plurality of terms (tokens) of each log fragment of a plurality of log fragments.
Optionally, the analysis device may perform word segmentation on each row of log records in each log fragment through a word segmentation technique to obtain a plurality of entries of each log fragment. The term includes at least one minimum semantic unit, typically only one minimum semantic unit. The semantic unit is a word, a phrase or a symbol, and the symbol can be a numeric symbol, a number for short, such as 1 or 2, or other symbols, such as "/" or ": ". In general, in each log fragment, a row of log records may be divided into at least two entries; in a few cases, one row of log records may be divided into one entry, and the number of entries divided in each log fragment is not limited in the embodiment of the present application.
The purpose of word segmentation is to cut each row of log records of each log fragment into a set of entries, and the word segmentation processing can reduce the processing complexity of the log records, reduce the operation cost of subsequent local sensitive hash codes and improve the operation efficiency.
In the embodiment of the application, word segmentation can be performed in different ways. For example, space division is adopted; or, special character word segmentation is adopted; or, dividing words by using a specified segmentation character, wherein the specified segmentation character comprises a blank space or a special character; or to use natural language segmentation.
As shown in fig. 8, fig. 8 is a schematic diagram of a word segmentation result using space word segmentation. The space word segmentation is adopted, namely a row of log records are segmented into a plurality of entries according to spaces, the segmentation implementation process is simple, and the segmentation efficiency is high.
When a special character segmentation is used, the special character is usually a character designated by a user, such as "\ n", "[", "", "\ t", "\\ r", "|" or "# #", etc., if a term includes only a minimum semantic unit, the segmentation using the special character can make the semantic unit included in the segmented term more accurate and the segmentation accuracy higher than that using space segmentation, for example, the segmentation result obtained by using space segmentation is recorded in "20171223-22: 15:35:11| Step _ sputits |30002312| gettodaytodetaldetaldail delaysteps ═ 1514038440000# #7015# #548365# # 1301 # #13026# # 27177962", as shown in fig. 9, the first term and the last term both include a lot of information, such as the first term, and a lot of information in the second term are also recognized by a number, each number represents a certain meaning and should be cut out individually. Therefore, each entry cannot be efficiently cut into a minimum semantic unit. Assume that the special characters include: "|", "# #" or "═ then the word segmentation result obtained by adopting the special character word segmentation mode is shown in fig. 10. Each entry obtained by segmentation is a minimum semantic unit, so that the segmentation precision is higher.
The word segmentation by using the specified segmentation symbol is a combined mode of space and sampling special character word segmentation. For example, specifying a slicer may include: one of "\\ n", "[", "", "{", "(", "\ t", "\\ r", "|", "# #", and a space.
The method of using natural Language segmentation is common, and in this method, the log records in the log segment can be directly input into a natural Language-based segmenter, such as Word _ Tokenizer, TreeBank _ Tokenizer, S-Expression _ Tokenizer, etc. in nltk (natural Language toolkit). Compared with the method of assigning segmentation symbols or special character segmentation, the method of segmenting words by adopting natural language does not need a user to assign symbols for segmentation in advance, thereby simplifying the operation of the user.
Optionally, in the word segmentation process, the analysis device may input each log segment as a character stream to a designated word segmentation device, the word segmentation device performs word segmentation processing, and the analysis device receives a word segmentation result output by the word segmentation device.
Step a12, the analysis device determines the locality sensitive hash code of each of the plurality of log shards based on the plurality of entries of each of the plurality of log shards.
In the embodiment of the application, the plurality of entries of each log fragment may be stored in different forms, and correspondingly, the manner of determining the locality sensitive hash code of each log fragment is also different. The examples of the present application are described by taking the following two cases as examples:
in the first case, the multiple entries of each log slice are stored in the form of a set of entries. In the entry set corresponding to each log fragment, a plurality of entries of each log fragment no longer have a sequential relationship.
For each log fragment, the analysis device may directly use a set of multiple entries of the log fragment obtained by word segmentation as an entry set; or, the analysis device may perform deduplication processing on the multiple entries of the log fragment to obtain an entry set. Then, the word segmentation device may determine the locality sensitive hash code of each log fragment based on the entry set corresponding to each log fragment. The deduplication processing can simplify the number of entries used for subsequent calculation of the locality sensitive hash code, so that the operation efficiency of the analysis equipment is improved.
The term segmentation device may determine the locality sensitive hash code of each log fragment based on the entry set corresponding to each log fragment, for example, the analysis device may determine the locality sensitive hash code of each log fragment based on a target locality sensitive hash algorithm and the entry set corresponding to each log fragment. For example, the partially sensitive hash calculation process in the target partially sensitive hash algorithm may refer to a partially sensitive hash calculation process in a Simhash algorithm or a Minhash algorithm. The minimum unit of the data processed by the target locality sensitive hashing algorithm is an entry. The embodiments of the present application are illustrated by the following two alternative examples:
in a first optional example, in the target locality-sensitive hashing algorithm, after obtaining the entry set corresponding to each log slice, a locality-sensitive hashing code of the certain log record may be determined in a weighted summation manner. The process may refer to the Simhash algorithm. The process of determining the locality-sensitive hash code of the certain log record by using weighted summation may include:
step A121, for any log fragment, calculating a hash code of each entry in the entry set corresponding to the log fragment, where the hash code is composed of binary numbers 0 and 1.
And step a122, performing weighted summation on the Hash codes of the calculated entries, that is, W ∑ Hash × weight, where W represents a Hash sequence obtained after the weighted summation, Hash represents the Hash code of each entry, and weight represents a weight of each entry.
Optionally, the weight of each entry may be positively correlated with the word frequency of the entry in the entry set. I.e. the higher the word frequency, the larger the weight. Typically, the weight of each entry is equal to the word frequency of the entry in the set of entries. Term frequency refers to the number of times an entry occurs. For example, if the entry "we" appears 5 times in a set of entries, the frequency of the entry "we" is 5.
And step A123, performing dimensionality reduction on the obtained weighted summation result to obtain the locality sensitive hash code.
In the weighted summation process in the foregoing step a122, the product of each hash code and its weight is expressed by the following rule: when the value in the hash code is 1, the summation result of the corresponding positions is: 1 and the weight are multiplied positively, and if the value in the hash code is 0, the summation result of the corresponding position is as follows: and 1 is multiplied by the weight negatively.
The dimension reduction in the foregoing step a123 means that a value greater than 0 is reduced to 1, and a value not greater than 0 is reduced to 0. The process of performing dimensionality reduction on the obtained weighted summation result includes setting a value greater than 0 in the obtained weighted summation result to be 1, and setting a value not greater than 0 in the weighted summation result to be 0.
For example, suppose that the entry sets corresponding to the log shard X1 include entries of: "flush", "cost", "time", "is", and "122", fig. 11 shows a process of locality sensitive hash acquisition for log shard X1. In fig. 11, it is assumed that the weight of the first entry is 3, the weight of the second entry is 2, the weights of the other entries are 1, and the hash code of "flush" is: "10010111", which is multiplied by the weight 3, is "3, -3, -3, 3, -3, 3, 3, 3" (where comma is for interval and does not exist in the actual calculation). And performing weighted summation on the hash codes of the entries obtained by calculation, namely performing bit summation (namely corresponding position summation) on the weighted hash codes. The result of the final determined weighted sum is "6, -4, -6, 6, -6, 0, 8, 4", where the first bit: 6 is the sum of the first digits of the products of each entry and the corresponding weight, namely 3+2+1+1-1, and the second digit: -4 is the sum of the second bits of the product of each entry and the corresponding weight, i.e., -3) + (-2) +1+ (-1) +1, and the other bits are calculated in the same way. The weighted sum result "6, -4, -6, 6, -6, 0, 8, 4" corresponds to the dimensionality reduction result "10010111", i.e., the partially sensitive hash code of the log slice X1 is "10010111".
It should be noted that, while the term set is obtained through a deduplication process, in an alternative, the term frequency of each term before the deduplication process may be recorded to determine the weight of each term. Therefore, no matter whether deduplication processing is adopted or not, for the same log fragment, the same locality sensitive hash code can be calculated by adopting the processes from the step A121 to the step A123; in another alternative, the term frequency of each entry before the deduplication processing may not be recorded, and the weight of each entry may be set to 1. Thus, for each entry in the same log fragment, the weights are all equal and 1. The process of determining the locality-sensitive hash code of each log fragment by the analysis device based on the target locality-sensitive hash algorithm and the entry set corresponding to each log fragment includes:
step A124, the analysis device calculates the sum of the hash codes of all the entries in the entry set corresponding to each log fragment.
Step a124 may refer to the case where weight is 1 in step a121 and step a122 described above. For any log slice, the locality sensitive hash code is the sum of the hash codes of the entries, that is: w ═ Σ Hash, which denotes a Hash code.
And step A125, performing dimensionality reduction on the sum of the hash codes corresponding to each log fragment to obtain the locality sensitive hash code of each log fragment.
In step a125, reference may be made to the case where weight is 1 in step a 123.
By adopting the target locality sensitive hash algorithm, if the set weight is 1, the calculation time delay is shorter, and the calculation efficiency is higher. And because the sum of the hash codes of all the entries is directly obtained, the weight of all the entries is equal to 1. If the weight is set according to the word frequency of each entry, the weight of the normal entry is far greater than that of the abnormal entry because the number of the normal entry in one log fragment is usually far greater than that of the abnormal entry. And the weight values of all the entries are set to be 1, that is, the weight values of the normal entries are reduced, which is equivalent to amplifying the probability of the occurrence of the abnormal entries, and improving the probability of identifying the abnormal log fragments. For example, if one log fragment X includes 5 abnormal entries and 1000 normal entries, and another log fragment Y includes 1005 normal entries, where 1000 entries are the same as the 1000 normal entries of the log fragment X and do not include an abnormal entry, the weight of all entries is set to 1, so that the partially sensitive hash of the log fragment X and the partially sensitive hash of the log fragment Y are obviously different, and thus the abnormal log fragments are effectively distinguished in the subsequent process. The abnormal log fragments comprise entries which are obviously different from other log fragments, and the obviously different entries are the abnormal entries.
Fig. 12 illustrates another partially sensitive hash code acquisition process for log shard X1. In fig. 12, it is assumed that the weight of all entries is 1, for example, in fig. 12, the hash code of "flush" is: "10010111", which is multiplied by the weight 1, is "1, -1, -1, 1, -1, 1,1, 1" (where comma is for interval and does not exist in the actual calculation). And performing weighted summation on the hash codes of the entries obtained by calculation, namely performing bit summation (namely corresponding position summation) on the weighted hash codes. The result of the weighted summation ultimately determined by log record X1 is "3, -1, -3, 3, -3, -1, 5, 1", where the first: 3 is the sum of the first digits of the products of each entry and the weight 1, namely 1+1+1+1-1, and the second digit: -1 is the sum of the second bits of the product of the entries and the weight 1, i.e., -1) + (-1) +1+ (-1) +1, and the other bits are calculated in the same way. The result of weighted summation "3, -1, -3, 3, -3, -1, 5, 1" corresponds to dimension reduction result "10010011", i.e., the locality sensitive hash code of log record X1 is "10010011".
In a second alternative implementation, the determining the locality-sensitive hash code for each log shard of the plurality of log shards includes: and determining the locality sensitive hash code of each log fragment directly based on the content of each log fragment, that is, not performing the entry obtaining step of the step a 11. The analysis device may determine the locality sensitive hash code of each row of log records based on the aforementioned target locality sensitive hash algorithm and the content of each log slice. For example, the analysis device may input the content (i.e., character stream) of each log slice into the algorithm model of the target locality-sensitive hash algorithm, and receive the locality-sensitive hash code of each log slice output by the algorithm model. The minimum unit of data processed by the target locality-sensitive hashing algorithm is a character.
In the second optional implementation manner, the data granularity (i.e., the minimum unit of data processed by the target locality-sensitive hash algorithm) when the locality-sensitive hash code of each log slice is obtained is a character, and in the first optional implementation manner, the data granularity when the locality-sensitive hash code of each log slice is obtained is a term. Therefore, compared with the second optional implementation manner, the first optional implementation manner has a larger data granularity when the locality sensitive hash code of each log fragment is obtained, so that the first optional implementation manner has a smaller operation frequency compared with the second optional implementation manner, and the operation cost can be saved.
In the second case, the entries of each log slice are stored in a sequence of entries. In the entry sequence corresponding to each log fragment, a plurality of entries of each log fragment have an order relationship.
For each log fragment, the analysis device may directly arrange the multiple entries of the log fragment obtained by word segmentation according to the sequence before word segmentation to obtain an entry sequence. Then, the word segmentation device may determine the locality sensitive hash code of each log fragment based on the entry sequence corresponding to each log fragment.
The process of determining the locally sensitive hash code of each log fragment by the word segmentation device based on the entry sequence corresponding to each log fragment may refer to the process of determining the locally sensitive hash code of each log fragment based on the entry set obtained without performing deduplication processing, which is not described in detail herein.
Step a2, the analysis device determines the distance between every two log fragments in the plurality of log fragments based on the locality sensitive hash code of each log fragment.
As described in step a1, since the locality-sensitive hash code of a log slice can reflect the similarity between the data of the log slice and the data of other log slices, the distance between the locality-sensitive hash codes of every two log slices can be obtained, and the obtained distance is determined as the distance between every two corresponding log slices. This enables a fast determination of the distance between every two log slices.
For example, the analysis device may calculate a distance between the locality sensitive hash codes of every two log slices based on the locality sensitive hash code of each log slice and a specified distance algorithm, and determine the calculated distance as a distance between every two corresponding log slices.
For example, the specified distance algorithm may be a hamming distance algorithm, and accordingly, the obtained distance is a hamming distance. The analysis device may determine the hamming distance of the locality-sensitive hash codes for every two log shards as the distance between the two log shards. The hamming distance refers to the number of different data at the same position of the character sequence. Such as the sequence of characters: 010 and 010, the second and third bits are different based on the hamming distance algorithm analysis, and the hamming distance of the two character sequences is 2.
For example, the specified distance algorithm may be other distance algorithms. If the specified distance algorithm is the Euclidean distance algorithm. The analysis device may determine the euclidean distance of the locality sensitive hash codes of every two log slices as the distance between the two log slices. Wherein the Euclidean distance is a space point distance.
It should be noted that in the embodiment of the present application, other algorithms may also be used to calculate the distance between the locally sensitive hash codes of every two log fragments, which is not described in detail in this embodiment of the present application.
It should be noted that the foregoing steps a1 and a2 are only an exemplary method for determining a distance between every two log slices in a plurality of log slices provided by the embodiment of the present application, and in practical implementation of the embodiment of the present application, the analysis device may also determine a distance between every two log slices in other manners, for example, the analysis device may also determine a distance between every two log slices by using a Jaccard similarity function, where the distance is referred to as a Jaccard distance, a Jaccard similarity (Jaccard similarity), or a Jaccard coefficient. By way of example, the Jacard similarity function is: d ═ R1/R2; wherein D represents the Jacard distance, R1 is the intersection of the contents of the two log shards, and R2 is the union of the contents of the two log shards; or D represents the jaccard distance, R1 is the intersection of the entries of the two log shards, and R2 is the union of the entries of the two log shards. Optionally, when determining the distance between every two log fragments by using the jaccard similarity function, if multiple entries of each log fragment need to be obtained, the process of obtaining multiple entries of each log fragment may refer to the foregoing step a 11.
Since the analysis device obtains the distance between each log fragment and another log fragment in the plurality of log fragments, and the number of log fragments obtained by dividing one log is large, for example, 3 to 8, finally, the analysis device may obtain a plurality of distance values, for example, there are w log fragments, and if the obtained distance value includes the distance between the log fragment and itself (i.e., the distance is 0), the obtained distance value is w2A plurality of; if the obtained distance value does not include the distance between the log fragment and the log fragment, the obtained distance value is (w)2-w) of the said substrate. The plurality of distance values may be represented by a distance matrix as shown in fig. 13. Fig. 13 assumes that the analysis device acquires 4 log shards, which are log shards 1 to 4, respectively, locally sensitive hash codes of the log shards 1 to 4 are 01010101, 01010111, 00010111, and 11110010, respectively, and assumes that the hamming distance of the locally sensitive hash codes of every two log shards is determined as the distance between the two log shards. Then as shown in fig. 13, the distances between log slice 1 and log slices 2 to 4 are 1,2, and 5, respectively; distances between the log fragment 2 and the log fragments 1, 3 and 4 are respectively 1,1 and 4; the distances between the log fragment 3 and the log fragments 1,2 and 4 are 2, 1 and 5 respectively; the distances between log slice 4 and log slices 1-3 are 5, 4, and 5, respectively. The distance values of the lower left corner and the upper right corner in the distance matrix in fig. 13 are symmetrically distributed, and therefore, the distance of each log slice from other log slices in the plurality of log slices can be represented only by the content of the lower left corner or the upper right corner in the distance matrix.
As can be seen from the distance matrix in fig. 13, the distances between the log fragments 1 to 3 are relatively close, that is, the contents of the log fragments 1 to 3 are relatively close, and the distances between the locality sensitive hash codes of the log fragment 4 and the locality sensitive hash codes of the log fragments 1 to 3 are relatively large, so that the difference between the contents of the log fragment 4 and the contents of the log fragments 1 to 3 is relatively far, and the log fragment 4 may be abnormal. An anomalous log slice may be identified by a subsequent step 304.
Step 304, the analysis device determines whether an abnormal log fragment exists in the plurality of log fragments based on the distance between every two log fragments in the plurality of log fragments.
In this embodiment of the present application, the process of determining, by the analysis device, whether an abnormal log fragment exists in the plurality of log fragments based on the distance between every two log fragments in the plurality of log fragments may be implemented by using multiple abnormality detection manners, and this embodiment of the present application takes the following two abnormality detection manners as examples:
a first anomaly detection method for determining whether an anomalous log fragment exists in a plurality of log fragments based on a K-Distance (K-Distance) between every two log fragments in the plurality of log fragments, the anomaly detection method comprising:
step B1, the analysis device determines a K-distance of each of the plurality of log shards based on a distance between every two of the plurality of log shards.
The K-distance of any log fragment in the log fragments is the distance between the log fragment which is the Kth closest to the log fragment in the log fragments and the log fragment, K is a positive integer, K is smaller than G, and G is the total number of the log fragments. That is, K is less than or equal to a plurality of G-1. For example, assuming that K is 2 and G is 8, the distance between each log fragment and the other log fragments is 7, and for any log fragment X2 in the plurality of log fragments, the K-distance is the distance between the log fragment closest to the 2 nd log fragment X2 in the plurality of log fragments and the log fragment X2. In step B1, the larger the K value, the lower the sensitivity, and the higher the accuracy of the finally determined abnormal log shard, so a smaller K value may be set here, for example, K is 1, or K is less than or equal to gx 5% when gx 5% is greater than 1.
In obtaining the K-distance of each log slice, the analysis device may first obtain the distance between every two log slices, then, for each log slice, sort the obtained distances to the log slice, for example, in an ascending order or a descending order, and then determine the K-distance of the log slice based on the sorting result.
As shown in step 303, the analysis device may obtain a plurality of distance values corresponding to each log slice. The plurality of distance values may be represented by a distance matrix, and if the abnormal log fragments are directly analyzed based on the distance matrix, if there are w log fragments, it is necessary to determine whether each log fragment is an abnormal log fragment based on (w-1) distances between the log fragment and other log fragments. Then for multiple log shards (w-1) log shard detection procedures are needed, which can refer to the procedure of subsequent step B2. In the embodiment of the present application, since the K-distance algorithm is adopted, the obtained distance matrix (as mentioned above, comprising (w)2-w) distance matrices of distance values) into a one-dimensional set of distance values comprising w distance values. Therefore, the detection process of the log fragments is only needed to be executed once, the detection process of the abnormal log fragments is effectively simplified, the calculation complexity is reduced, and the efficiency of abnormity judgment is improved.
Each log fragment in the plurality of log fragments obtained in step 302 may actually be regarded as a point in the high-dimensional space, and a distance between every two log fragments is a distance between two points in the high-dimensional space. Referring to step 304, if the log fragment is identified by the locality-sensitive hash code of each log fragment, the dimension of the high-dimensional space may be the number of bits of the locality-sensitive hash code. For example, a 128-bit locality-sensitive hash code may be considered as a point in a 128-dimensional space. For the convenience of reader understanding, fig. 14 illustrates an example that the plurality of log fragments include 5 log fragments, namely, 1 to 5 log fragments, and each log fragment is a point of a 2-dimensional space, where in fig. 14, the 5 log fragments are respectively represented by points a to E in a one-to-one correspondence.
As shown in table 1, table 1 is a distance matrix recording distances between log fragments as shown in fig. 14, and it can be seen from fig. 14 and table 1 that, assuming that K is 2, a distance from point a to point B is 1, a distance from point C is 2, a distance from point D is 9, and a distance from point E is 1.5, it can be seen that point E is the second closest to point a, and a distance from point a to point E is 2-distance of point a, i.e. 1.5. Similarly, the 2-distance of each point can be calculated, and finally the K-distances of the respective log slices (corresponding to the points a to E) obtained based on table 1 are shown in table 2 as 1.5, 2, 8 and 2, respectively.
The log pieces recorded in table 1 are 5 in total, that is, w is 5, and the recorded distance value is w2Even if the distance between every two log fragments is recorded according to the data in the upper right corner of the distance matrix, 10 distance values need to be recorded, and the detection process of log fragments needs to be directly performed based on the 10 distance values, and 4 times of detection processes of log fragments need to be performed; the total number of K-distances recorded in Table 2 is 5, and 1 log slice detection process is required. Therefore, the detection process of the log fragments is carried out by adopting the K-distance of each log fragment in the plurality of log fragments, and the operation cost can be effectively saved.
TABLE 1
Log sharding 1 Log sharding 2 Log sharding 3 Log fragmentation 4 Log fragmentation 5
Log sharding 1 1 2 9 1.5
Log sharding 2 1 1.5 9 2
Log sharding 3 2 1.5 8 2
Log fragmentation 4 9 9 8 8
Log fragmentation 5 1.5 2 2 8
TABLE 2
Log sharding 1 Log sharding 2 Log sharding 3 Log fragmentation 4 Log fragmentation 5
Distance K- 1.5 1.5 2 8 2
Step B2, the analysis device determines whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments.
In this embodiment of the present application, there are various ways to determine whether an abnormal log fragment exists in a plurality of log fragments based on a K-distance of each log fragment in the plurality of log fragments, and this embodiment of the present application takes the following several optional ways as an example for description:
in a first alternative, the anomalous log shards are determined based on the Lauda criterion (also known as the 3-sigma rule).
Based on the statistical principle, if a sample obeys normal distribution, the 3-sigma rule is satisfied, the 3-sigma rule is shown in fig. 15, the mean value of the sample is μ, the standard deviation (i.e., variance) of the sample is σ, and the probability that the sample falls within the value range [ μ -3 × σ, μ +3 × σ ] is 99.7%, which is approximately equal to 100%. Thus, if a value is outside this range, i.e., less than μ -3 × σ, or greater than μ +3 × σ, the probability is 0.3%. This can be considered a small probability event and defined as an abnormal event.
Based on the 3-sigma rule, the embodiments of the present application provide the following two optional examples to determine whether each log fragment in the plurality of log fragments is an abnormal log fragment:
in a first optional example, the analysis device may determine, based on a K-distance of each log slice in the plurality of log slices, a target value range [ μ -3 × σ, μ +3 × σ ], where μ is an average of the K-distances of the plurality of log slices, and σ is a standard deviation of the K-distances of the plurality of log slices, and when the K-distance of any log slice in the plurality of log slices is not within the target value range [ μ -3 × σ, μ +3 × σ ], determine that the any log slice is an abnormal log slice; and when the K-distance of any log fragment in the plurality of log fragments is not within the target value range [ mu-3 multiplied by sigma, mu +3 multiplied by sigma ], determining that the any log fragment is not an abnormal log fragment.
The principle of this first alternative example is: when a plurality of normal points are arranged in a plurality of points corresponding to a plurality of log fragments, the average value and the standard deviation are not greatly influenced by the abnormal points. And if so, determining that the point is an abnormal point. The first optional example is suitable for the case that the number of the log fragments is large, that is, the number of the sample points is large, so that the target value range only needs to be calculated once, and the calculation cost is small.
Still taking the example shown in fig. 14 as an example, it is assumed that the abnormal log slice is determined by using the first alternative example. Then, through the 3-sigma rule, it is determined whether the point a is an abnormal point, that is, whether the log fragment 1 is an abnormal log fragment, as follows: calculating the mean value and the standard deviation from the point A to the point E, and determining a target value range based on the calculated mean value and standard deviation; when the K-distance of the point A is not within the target value range, determining that the point A is an abnormal point, and determining that the log fragment 1 is an abnormal log fragment; and when the K-distance of the point A is within the target value range, determining that the point A is not an abnormal point and the log fragment 1 is not an abnormal log fragment. The calculation from point B to point E is the same, and the embodiment of the present application is not described again.
In a second optional example, assuming that the first log fragment is any log fragment of a plurality of log fragments, the detection process of the log fragment provided in the embodiment of the present application includes: determining a target value range [ mu-3 × sigma, mu +3 × sigma ] corresponding to the first log fragment, wherein mu is an average value of K-distances of the log fragments except the first log fragment in the log fragments, sigma is a standard deviation of the K-distances of the log fragments except the first log fragment in the log fragments, and when the K-distance of the first log fragment is not within the target value range [ mu-3 × sigma, mu +3 × sigma ], determining that the first log fragment is an abnormal log fragment; and when the K-distance of the first log fragment is within the target value range [ mu-3 multiplied by sigma, mu +3 multiplied by sigma ], determining that the first log fragment is not an abnormal log fragment, namely a normal log fragment. The detection process of other log fragments in the plurality of log fragments may refer to the detection process of the first log fragment, which is not described in this embodiment.
The principle of this second alternative example is: when the number of normal points in a plurality of points corresponding to a plurality of log slices is small, the influence of the abnormal points on the mean value and the standard deviation is large. And for each point in the plurality of points, assuming that the point is an abnormal point and the rest points are normal points, calculating the mean value and the standard deviation of the rest points, and determining whether the point exceeds the target value range based on the mean value and the standard deviation, thereby determining whether the point is the abnormal point. This second alternative example applies when the number of log slices is small, i.e. when there are few sample points.
Still taking the example shown in fig. 14 as an example, it is assumed that the log slice of the anomaly is determined by using the second alternative example. Then, through the 3-sigma rule, it is determined whether the point a is an abnormal point, that is, whether the log fragment 1 is an abnormal log fragment, as follows: the mean and standard deviation of the remaining points of points a to E except for point a (i.e., points B to E) were calculated.
Wherein the mean of the remaining points is: (1.5+2+8+2)/4 ═ 3.375; the standard deviation of the remaining points is: 1/(4-1) × [ (3.375-1.5)2+(3.375-2)2+(3.375-8)2+(3.375-2)2]0.53.52+1.89+18.06+ 1.89-1.68. The corresponding target value range of the point A is as follows: [ 3.375-3X 1.68,3.375+ 3X 1.68 ]]=[-1.67,8.415]. Because the K-distance of the point A is within the target value range, the point A can be judged to be a normal point, and the log fragment 1 is not an abnormal log fragment.
Similarly, the process of judging whether the point D is an abnormal point, that is, whether the log fragment 4 is an abnormal log fragment, according to the 3-sigma rule is as follows: the mean and standard deviation of the remaining points (i.e., points a, B, C, E) from points a to E, except for point D, were calculated.
Wherein the mean of the remaining points is: (1.5+1.5+2+2)/4 ═ 1.75; the standard deviation of the remaining points is:
Figure BDA0002376073430000161
Figure BDA0002376073430000162
the corresponding target value range of the D point is as follows: [ 1.75-3X 0.167,1.75+ 3X 0.167 ]]=[1.25,2.25]. Since the K-distance of the point D is not within the target value range, it can be determined that the point D is an abnormal point, and the log segment 5 is an abnormal log segment.
It should be noted that the foregoing two optional examples may be selectively used according to an actual situation, for example, after the analysis device acquires the log fragments, when the number of the acquired log fragments is greater than a specified number threshold (the number of the log fragments is large, and the number of corresponding points is large), the abnormal log fragments are determined by using the method provided by the foregoing first example; when the number of the acquired log fragments is not greater than the specified number threshold (the number of the log fragments is small, and the number of corresponding points is small), the abnormal log fragments are determined in the manner provided by the second example. In this way, although the target value range of each log fragment needs to be determined, the total amount of computation is within an acceptable range because the number of log fragments is small.
In a second alternative, the abnormal log slice is determined based on the principle of entropy change.
Entropy is used to describe the degree of misordering of molecular states, as known from thermodynamic principles. In the data processing process, the uncertainty of the data is described by using the concept of entropy as a reference. For example, for a set of samples T ═ 1,1,1,1,1,1,1,1,1, the samples in T are evenly distributed, so the certainty is high and the entropy value is large. In contrast, for another set of samples U ═ 1,2,3,4,5,6,7,8,9,10], the samples are not uniformly distributed, and therefore the certainty is low, with entropy values below T. For yet another set of samples V ═ 1,2,3,4,5,6,7,8,900,1000, the sample distribution is less uniform than U, with its entropy at the lowest. The entropy value is inversely proportional to the distribution uniformity of the samples, i.e., the higher the entropy value is, the higher the distribution uniformity of the samples is, and the lower the entropy value is, the lower the distribution uniformity of the samples is. The embodiment of the application provides an entropy calculation formula of a sample, which is as follows:
Figure BDA0002376073430000163
where h (i) represents the entropy value of the sample and i represents the data in the sample. The principle of determining the entropy value of a sample based on this entropy value calculation formula is called the principle of entropy change.
Based on the principle of entropy change, the detection process of the log fragment provided by the embodiment of the present application includes: and determining an entropy value corresponding to each log fragment in the plurality of log fragments, wherein the entropy value H (i) corresponding to any log fragment refers to the entropy value calculation formula, is the entropy value of the K-distance of the rest log fragments after any log fragment is removed from the plurality of log fragments, and i represents the K-distance of the rest log fragments. When the difference value between the maximum entropy value and the minimum entropy value in the obtained entropy values is greater than a specified difference threshold value, it is indicated that the distribution uniformity degree of the multiple log fragments after removing different log distributions is large, and for the log fragment corresponding to the maximum entropy value, the distribution of the multiple log fragments is most uniform after removing the log fragment. Accordingly, the log slice corresponding to the largest entropy value may be determined to be an anomalous log slice. When the difference between the maximum entropy value and the minimum entropy value in the obtained entropy values is not greater than the specified difference threshold, it is indicated that the distribution uniformity degree of the multiple log fragments after the different log allocations are removed is not changed greatly, that is, the contents of the multiple log fragments are relatively close, and there is usually no abnormal log fragment.
Still taking the example shown in fig. 14 as an example, assume that a set formed by a plurality of log shards corresponding to points a to E is called a shard set; when the entropy value H1 after removing log shard 1 in the shard set is 1.1199994100487753 as shown in table 3, the entropy value is the minimum entropy value; when the entropy value H2 after removing log slice 4 in the slice set is 1.3760552852604169 as shown in table 4, the entropy value is the maximum entropy value.
TABLE 3
-P(i)log(i)
B 1.5/13.5=0.111
C 2/13.5=0.148
D 8/13.5=0.593
E 2/13.5=0.148
Entropy of the entropy 1.1199994100487753
TABLE 4
-P(i)log(i)
A 1.5/7=0.214
B 1.5/7=0.214
C 2/7=0.286
E 2/7=0.286
Entropy of the entropy 1.3760552852604169
As shown in table 3, after point a is removed, i.e. after removing log slice 1, entropy value H1 becomes smaller, which means that after removing log slice 1, K-distance of slice set becomes more disordered; similarly, as shown in Table 4, the entropy value H2 becomes larger after point D is removed, thus indicating that the K-distance of the shard set becomes more uniform after removing log shard 4. Therefore, the D point may be an outlier. Assuming that the specified difference threshold is 0.2, since the difference between the entropy value H2 and the entropy value H1 is greater than 0.2, it is determined that point D is an abnormal point and log slice 4 is an abnormal log slice.
It should be noted that, in the second optional manner, if an abnormal log fragment occurs by default, there is one abnormal log fragment, and the abnormal log fragment is the log fragment corresponding to the largest entropy value. In actual implementation, there may be multiple log fragments that are abnormal. Then, after determining an abnormal log fragment based on the second optional manner, the abnormal log fragment may be removed to obtain a plurality of log fragments as updated log fragments, and the second optional manner is adopted again to determine whether there is an abnormal log fragment based on the updated log fragments, and so on until there is no abnormal log fragment in the updated log fragments.
In the second anomaly detection mode, whether an abnormal log fragment exists in the plurality of log fragments is determined based on a Hierarchical Clustering (Hierarchical Clustering) algorithm. Hierarchical clustering refers to clustering elements belonging to the same class together based on the distance between the clustered elements. In this embodiment of the present application, the process of the second anomaly detection method includes:
the analysis device divides the plurality of log fragments into a plurality of log fragment sets based on the distance between every two log fragments in the plurality of log fragments, wherein each log fragment set comprises at least one log fragment; when the number of the log fragments in any log fragment set is smaller than a specified number threshold, determining that the log fragments in the log fragment set are abnormal log fragments; and when the number of the log fragments in any log fragment set is not less than the specified number threshold, determining that the log fragments in the log fragment set are not abnormal log fragments. For example, the specified number threshold is 2. In this case, it is assumed that the number of log fragments in log fragment set G1 is less than 2, that is, the number of log fragments in log fragment set G1 is only 1, and the log fragment and other log fragments in the plurality of log fragments do not belong to the same class, that is, are different from all other log fragments, and thus the log fragment is an abnormal log fragment.
It is worth mentioning that the embodiment of the present application may also determine an abnormal log fragment based on a distance between every two log fragments in the plurality of log fragments in other manners, for example, the distance between every two log fragments in the plurality of log fragments is presented to a user in a manner of a stereogram, a table, or a histogram, and an identifier of each log fragment is presented, so that the user selects a log fragment that the user considers to be abnormal, receives a selection instruction triggered by the user, where the selection instruction carries the identifier of a target log fragment selected by the user, and the target log fragment is determined to be the abnormal log fragment. The application embodiment does not limit the mode of determining the abnormal log fragments based on the distance between every two log fragments in the plurality of log fragments.
Step 305, the analysis device determines an abnormal log record in the abnormal log slice.
After determining the abnormal log slice, the analysis device may determine the abnormal log record in a variety of ways. In an alternative, the analysis device may present an abnormal log fragment, and the user selects an abnormal log record; in another optional mode, the analysis device may determine a log template of an abnormal log fragment, present the log template, select the abnormal log template by the user, and present a log record corresponding to the abnormal log template after obtaining the abnormal log template. The analysis device may also determine the abnormal log record in other manners, which is not limited in this embodiment of the present application.
For the convenience of the reader to understand, the embodiments of the present application make a brief introduction to the log template, and the log records in the log usually have an implicit log template, and the log template refers to a standard style or a fixed format for generating the log records in the log. For example, after the code corresponding to the log record is actually run, a plurality of lines of log records for recording the information of user login in the log are output. In the embodiment of the present application, the log in which the multiple rows of log records are located is referred to as a first log:
“User 025862login at 2018-12-03 02:03:00
User 045210login at 2018-12-04 02:03:15
User 033658login at 2018-12-05 02:03:38
User 010100login at 2018-12-06 02:04:06
User 023025login at 2018-12-07 02:04:51
User 046523login at 2018-12-08 02:05:22”。
the log template of the log is the log template of the log record in the log. Generally, when log exception detection is performed on a log record, if a variable part of the log record is identified, the variable part is marked by using a preset variable identifier, and the marking is essentially to replace the variable part by using the variable identifier. The variable identifier is typically a wildcard character. For example, when detecting log anomalies in multiple log records of the first log, the variable part may be replaced by a wildcard character, and the obtained log template of each log record is "User login at", and then the log template of the first log is "User login at".
The order of steps of the log anomaly detection method provided in the embodiment of the present application may be appropriately adjusted, and the steps may also be increased or decreased according to the situation, for example, in other application scenarios, such as keyword search, the foregoing step 305 may not be executed.
For example, in the foregoing step 304, after the distance between every two log shards in the plurality of log shards is obtained, the similarity between every two log shards in the plurality of log shards may also be determined based on the obtained distance. In the embodiment of the present application, the distance between every two log fragments is negatively related to the similarity of the two log fragments, that is, the closer the distance is (that is, the smaller the distance is), the higher the similarity is; the further (i.e., the larger) the distance, the lower the similarity. For example, the similarity s between every two log slices satisfies: s is 1/(1+ d), and d is the distance between each two log slices corresponding to the similarity between each two log slices. Typically, the distance d is a non-negative real number, and the similarity value ranges from 0, 1. The analysis device may determine whether an abnormal log slice exists in the plurality of log slices based on a similarity between every two log slices in the plurality of log slices. In one example, the process includes: when the similarity between any log fragment and other log fragments is smaller than a similarity threshold value, determining that any log fragment is an abnormal log fragment; and when the similarity between any log fragment and other log fragments is not less than the similarity threshold, determining that the log fragment is not an abnormal log fragment. So the aforementioned steps B1 and B2 may not be performed. In another example, the analysis device may determine a K-distance of each log shard of the plurality of log shards based on a similarity between every two log shards of the plurality of log shards, where the K-distance of any log shard of the plurality of log shards is a similarity between a log shard of the plurality of log shards which is further from the kth of the any log shard and any log shard, K is a positive integer, K is smaller than G, and G is a total number of the plurality of log shards, and thus the K-distance is different from the definition of the K-distance in the foregoing step B1, in this step, the smaller the K value is, the lower the sensitivity is, the lower the accuracy of the finally determined abnormal log shard is, and therefore, a larger K value may be set here, for example, K is G; determining whether an abnormal log slice exists in the plurality of log slices based on the K-distance of each log slice in the plurality of log slices, which may refer to step B2 described above.
For another example, the analysis device may also determine the similarity between each two log slices by using a cosine angle algorithm (also called cosine similarity algorithm). The cosine included angle algorithm refers to that a cosine value between two vector included angles in a vector space is used for measuring the difference between the two vectors, the cosine value is close to 1, the included angle tends to 0, the more similar the two vectors are, the cosine value is close to 0, the included angle tends to 90 degrees, and the more dissimilar the two vectors are. Therefore, after the locality-sensitive hash code of each log slice in the plurality of log slices is obtained, the cosine value between every two locality-sensitive hash codes can be used as the similarity between every two corresponding log slices.
For another example, based on the same concept as that of the foregoing embodiment of the log anomaly detection method, in this embodiment of the application, after the multiple log fragments of the log are obtained in step 302, the analysis device may obtain the similarity between every two log fragments in the multiple log fragments in other manners, and determine whether an anomalous log fragment exists in the multiple log fragments based on the similarity between every two log fragments in the multiple log fragments.
In the embodiments of the present application, the similarity and the distance can be mutually converted, and any person skilled in the art can easily think of the changed method within the technical scope disclosed in the present application, and the changed method should be covered within the protection scope of the present application, and thus the detailed description is not repeated.
In the related art, if it is required to determine whether an abnormality exists in a log, an analysis device compares the log (or a log in a time period) with a specified reference log to obtain a change condition of log templates of the log and present the change condition, and a software developer identifies the abnormality in the log based on the presented content, which is called a log comparison (log compare) function. On one hand, the log comparison function needs to be triggered manually; on the other hand, the reference log needs to be manually specified; in yet another aspect, the log comparison function does not substantially identify log anomalies, but merely provides reference information for software developers to find log anomalies.
The log anomaly detection method provided by the embodiment of the application can support a log anomaly detection function, on one hand, the log anomaly detection function can be triggered manually or automatically, for example, triggered at a specified time point or a specified time period, and can be executed periodically and automatically; on the other hand, the log abnormality detection function does not need to specify a reference log; in yet another aspect, the log anomaly detection function may not identify anomalous log fragments based on which anomalous log records may be accurately located. In summary, in the log anomaly detection method provided by the embodiment of the application, the flexibility of anomaly detection is higher, the implementation process is simple, and the abnormal log fragments can be positioned, so that the efficiency of log anomaly detection can be effectively improved. In addition, according to the log anomaly detection method provided by the embodiment of the application, the abnormal log fragments are located by comparing the similarity of the contents of the plurality of log fragments of the log, and the unknown abnormal log fragments can be detected.
Furthermore, when the log anomaly detection method provided by the embodiment of the application is applied to an online analysis scene, the anomalous log fragments can be quickly positioned, and the time complexity and the space complexity of log positioning are effectively reduced.
An embodiment of the present application provides a log anomaly detection apparatus 40, as shown in fig. 16, the apparatus includes:
an obtaining module 401, configured to obtain multiple log fragments of a log, where each log fragment of the multiple log fragments includes multiple rows of log records in the log; at least one row of log records of different log fragments in the plurality of log fragments are different;
a first determining module 402 for determining a distance between every two log shards of the plurality of log shards;
a second determining module 403, configured to determine whether an abnormal log fragment exists in the plurality of log fragments based on a distance between every two log fragments in the plurality of log fragments.
According to the embodiment of the application, the distance between every two log fragments in the plurality of log fragments of the log is obtained, and whether abnormal log fragments exist in the plurality of log fragments is determined based on the obtained distance between every two log fragments, so that the abnormal log fragments are positioned, whether the log is abnormal or not is not required to be manually identified, and the abnormal detection efficiency of the log is effectively improved.
Optionally, the first determining module 402 is configured to: determining a distance between every two of the plurality of log shards based on the locality-sensitive hash code of each of the plurality of log shards.
Optionally, as shown in fig. 17, the apparatus 40 further includes: a third determining module 404, configured to determine a locality sensitive hash code of each of the plurality of log fragments based on the plurality of entries of each of the plurality of log fragments.
Optionally, the third determining module 404 is configured to: performing duplicate removal processing on the multiple entries of each log fragment to obtain an entry set; and determining the locality sensitive hash code of each log fragment based on the entry set corresponding to each log fragment.
Optionally, the third determining module 404 is configured to: calculating the sum of hash codes of all entries in the entry set corresponding to each log fragment;
and performing dimension reduction processing on the sum of the hash codes corresponding to each log fragment to obtain the locality sensitive hash code of each log fragment.
Optionally, the second determining module 403 is configured to: determining a K-distance of each log fragment of the plurality of log fragments based on a distance between every two log fragments of the plurality of log fragments, wherein the K-distance of any log fragment of the plurality of log fragments is a distance between a log fragment which is closer to a Kth of the any log fragment of the plurality of log fragments and the any log fragment, K is a positive integer, K is smaller than G, and G is the total number of the plurality of log fragments; determining whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments.
Optionally, the second determining module 403 is configured to: determining a target value range [ mu-3 xsigma, mu +3 xsigma ] based on the K-distance of each log fragment in the plurality of log fragments, wherein mu is the mean value of the K-distances of the plurality of log fragments, and sigma is the standard deviation of the K-distances of the plurality of log fragments, and when the K-distance of any log fragment in the plurality of log fragments is not within the target value range [ mu-3 xsigma, mu +3 xsigma ], determining that the any log fragment is an abnormal log fragment; or determining a target value range [ mu-3 × sigma, mu +3 × sigma ] corresponding to a first log fragment, wherein the first log fragment is any log fragment in the plurality of log fragments, the mu is an average value of K-distances of log fragments except the first log fragment in the plurality of log fragments, the sigma is a standard deviation of the K-distances of the log fragments except the first log fragment in the plurality of log fragments, and when the K-distance of the first log fragment is not within the target value range [ mu-3 × sigma, mu +3 × sigma ], determining that the first log fragment is an abnormal log fragment; or determining an entropy value corresponding to each log fragment in the plurality of log fragments, wherein the entropy value corresponding to any log fragment is an entropy value of a K-distance between the remaining log fragments after the log fragments are removed, and when a difference value between a maximum entropy value and a minimum entropy value in the obtained entropy values is larger than a specified difference threshold value, determining the log fragment corresponding to the maximum entropy value as an abnormal log fragment.
Optionally, different log fragments include log records with the same number of rows; or, different log fragments include log records of the same data volume.
Alternatively, fig. 18 schematically provides one possible basic hardware architecture for a computing device as described herein. The computing device may be a server.
Referring to fig. 18, computing device 500 includes a processor 501, memory 502, a communication interface 503, and a bus 504.
In the computing device 500, the number of the processors 501 may be one or more, and fig. 18 illustrates only one of the processors 501. Alternatively, the processor 501 may be a Central Processing Unit (CPU). If the computing device 500 has multiple processors 501, the types of the multiple processors 501 may be different, or may be the same. Optionally, multiple processors 501 of computing device 500 may also be integrated into a multi-core processor.
Memory 502 stores computer instructions and data; the memory 502 may store computer instructions and data required to implement the log anomaly detection methods provided herein, e.g., the memory 502 stores instructions for implementing the steps of the log anomaly detection methods. The memory 502 may be any one or any combination of the following storage media: nonvolatile memory (e.g., Read Only Memory (ROM), Solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory.
The communication interface 503 may be any one or any combination of the following devices: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.
Communication interface 503 is used for data communication by computing device 500 with other computing devices or terminals.
The bus 504 may connect the processor 501 with the memory 502 and the communication interface 503. Thus, via bus 504, processor 501 may access memory 502 and may also interact with other computing devices or terminals via communication interface 503.
In the present application, the computing device 500 executes computer instructions in the memory 502, causing the computing device 500 to implement the log anomaly detection method provided herein, or causing the computing device 500 to deploy a log anomaly detection apparatus.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, e.g., a memory comprising instructions, executable by a processor of a server to perform the log anomaly detection method shown in the various embodiments of the present application is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
An embodiment of the present application provides an analysis system, including: the terminal and the analytical equipment, this analytical equipment includes any one of the aforesaid log anomaly detection device.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product comprising one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium, or a semiconductor medium (e.g., solid state disk), among others.
It should be noted that: in the log anomaly detection apparatus provided in the above embodiment, only the division of the functional modules is used for illustration when performing log anomaly detection, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the log anomaly detection device and the log anomaly detection method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
In this application, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise. A refers to B and refers to the simple variation where A is the same as B or A is B.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (18)

1. A log anomaly detection method, the method comprising:
obtaining a plurality of log fragments of a log, wherein each log fragment of the plurality of log fragments comprises a plurality of rows of log records in the log; at least one row of log records of different log fragments in the plurality of log fragments are different;
determining a distance between every two log shards of the plurality of log shards;
determining whether an abnormal log fragment exists in the plurality of log fragments based on a distance between every two log fragments in the plurality of log fragments.
2. The method of claim 1, wherein the determining the distance between every two log shards of the plurality of log shards comprises:
determining a distance between every two of the plurality of log shards based on the locality-sensitive hash code of each of the plurality of log shards.
3. The method of claim 2, further comprising:
determining a locality sensitive hash code for each of the plurality of log shards based on the plurality of terms for each of the plurality of log shards.
4. The method of claim 3, wherein determining the locality sensitive hash code for each of the plurality of log shards based on the plurality of terms for each of the plurality of log shards comprises:
performing duplicate removal processing on the multiple entries of each log fragment to obtain an entry set;
and determining the locality sensitive hash code of each log fragment based on the entry set corresponding to each log fragment.
5. The method of claim 4, wherein the determining the locality sensitive hash code for each log slice based on the entry set corresponding to the log slice comprises:
calculating the sum of hash codes of all entries in the entry set corresponding to each log fragment;
and performing dimension reduction processing on the sum of the hash codes corresponding to each log fragment to obtain the locality sensitive hash code of each log fragment.
6. The method of any of claims 1 to 5, wherein the determining whether an anomalous log shard exists in the plurality of log shards based on a distance between every two log shards in the plurality of log shards comprises:
determining a K-distance of each log fragment of the plurality of log fragments based on a distance between every two log fragments of the plurality of log fragments, wherein the K-distance of any log fragment of the plurality of log fragments is a distance between a log fragment which is closer to a Kth of the any log fragment of the plurality of log fragments and the any log fragment, K is a positive integer, K is smaller than G, and G is the total number of the plurality of log fragments;
determining whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments.
7. The method of claim 6, wherein determining whether an anomalous log shard exists in the plurality of log shards based on the K-distance of each log shard in the plurality of log shards comprises:
determining a target value range [ mu-3 xsigma, mu +3 xsigma ] based on the K-distance of each log fragment in the plurality of log fragments, wherein mu is the mean value of the K-distances of the plurality of log fragments, and sigma is the standard deviation of the K-distances of the plurality of log fragments, and when the K-distance of any log fragment in the plurality of log fragments is not within the target value range [ mu-3 xsigma, mu +3 xsigma ], determining that the any log fragment is an abnormal log fragment;
or determining a target value range [ mu-3 × sigma, mu +3 × sigma ] corresponding to a first log fragment, wherein the first log fragment is any log fragment in the plurality of log fragments, the mu is an average value of K-distances of log fragments except the first log fragment in the plurality of log fragments, the sigma is a standard deviation of the K-distances of the log fragments except the first log fragment in the plurality of log fragments, and when the K-distance of the first log fragment is not within the target value range [ mu-3 × sigma, mu +3 × sigma ], determining that the first log fragment is an abnormal log fragment;
or determining an entropy value corresponding to each log fragment in the plurality of log fragments, wherein the entropy value corresponding to any log fragment is an entropy value of a K-distance between the remaining log fragments after the log fragments are removed, and when a difference value between a maximum entropy value and a minimum entropy value in the obtained entropy values is larger than a specified difference threshold value, determining the log fragment corresponding to the maximum entropy value as an abnormal log fragment.
8. The method of any of claims 1 to 7, wherein different of the log shards comprise the same number of rows of log records; or, different log fragments include log records of the same data volume.
9. An apparatus for log anomaly detection, the apparatus comprising:
the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a plurality of log fragments of a log, and each log fragment of the plurality of log fragments comprises a plurality of rows of log records in the log; at least one row of log records of different log fragments in the plurality of log fragments are different;
a first determining module, configured to determine a distance between every two log shards in the plurality of log shards;
a second determining module, configured to determine whether an abnormal log fragment exists in the plurality of log fragments based on a distance between every two log fragments in the plurality of log fragments.
10. The apparatus of claim 9, wherein the first determining module is configured to:
determining a distance between every two of the plurality of log shards based on the locality-sensitive hash code of each of the plurality of log shards.
11. The apparatus of claim 10, further comprising:
a third determining module, configured to determine a locality sensitive hash code of each of the plurality of log fragments based on the plurality of entries of each of the plurality of log fragments.
12. The apparatus of claim 11, wherein the third determining module is configured to:
performing duplicate removal processing on the multiple entries of each log fragment to obtain an entry set;
and determining the locality sensitive hash code of each log fragment based on the entry set corresponding to each log fragment.
13. The apparatus of claim 12, wherein the third determining module is configured to:
calculating the sum of hash codes of all entries in the entry set corresponding to each log fragment;
and performing dimension reduction processing on the sum of the hash codes corresponding to each log fragment to obtain the locality sensitive hash code of each log fragment.
14. The apparatus of any of claims 9 to 13, wherein the second determining module is configured to:
determining a K-distance of each log fragment of the plurality of log fragments based on a distance between every two log fragments of the plurality of log fragments, wherein the K-distance of any log fragment of the plurality of log fragments is a distance between a log fragment which is closer to a Kth of the any log fragment of the plurality of log fragments and the any log fragment, K is a positive integer, K is smaller than G, and G is the total number of the plurality of log fragments;
determining whether an abnormal log fragment exists in the plurality of log fragments based on the K-distance of each log fragment in the plurality of log fragments.
15. The apparatus of claim 14, wherein the second determining module is configured to:
determining a target value range [ mu-3 xsigma, mu +3 xsigma ] based on the K-distance of each log fragment in the plurality of log fragments, wherein mu is the mean value of the K-distances of the plurality of log fragments, and sigma is the standard deviation of the K-distances of the plurality of log fragments, and when the K-distance of any log fragment in the plurality of log fragments is not within the target value range [ mu-3 xsigma, mu +3 xsigma ], determining that the any log fragment is an abnormal log fragment;
or determining a target value range [ mu-3 × sigma, mu +3 × sigma ] corresponding to a first log fragment, wherein the first log fragment is any log fragment in the plurality of log fragments, the mu is an average value of K-distances of log fragments except the first log fragment in the plurality of log fragments, the sigma is a standard deviation of the K-distances of the log fragments except the first log fragment in the plurality of log fragments, and when the K-distance of the first log fragment is not within the target value range [ mu-3 × sigma, mu +3 × sigma ], determining that the first log fragment is an abnormal log fragment;
or determining an entropy value corresponding to each log fragment in the plurality of log fragments, wherein the entropy value corresponding to any log fragment is an entropy value of a K-distance between the remaining log fragments after the log fragments are removed, and when a difference value between a maximum entropy value and a minimum entropy value in the obtained entropy values is larger than a specified difference threshold value, determining the log fragment corresponding to the maximum entropy value as an abnormal log fragment.
16. The apparatus according to any of claims 9 to 15, wherein different log shards comprise the same number of rows of log records; or, different log fragments include log records of the same data volume.
17. A computer device comprising a processor and a memory;
the computer device performs the log anomaly detection method of any one of claims 1 to 8 when the processor executes the computer instructions stored by the memory.
18. A computer-readable storage medium comprising computer instructions that direct a computer device to perform the log anomaly detection method of any one of claims 1 to 8.
CN202010066339.7A 2019-12-02 2020-01-20 Log abnormity detection method and device Pending CN111240942A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/121544 WO2021109724A1 (en) 2019-12-02 2020-10-16 Log anomaly detection method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911214265 2019-12-02
CN2019112142650 2019-12-02

Publications (1)

Publication Number Publication Date
CN111240942A true CN111240942A (en) 2020-06-05

Family

ID=70878054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066339.7A Pending CN111240942A (en) 2019-12-02 2020-01-20 Log abnormity detection method and device

Country Status (2)

Country Link
CN (1) CN111240942A (en)
WO (1) WO2021109724A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538642A (en) * 2020-07-02 2020-08-14 杭州海康威视数字技术股份有限公司 Abnormal behavior detection method and device, electronic equipment and storage medium
WO2021109724A1 (en) * 2019-12-02 2021-06-10 华为技术有限公司 Log anomaly detection method and apparatus
CN114844778A (en) * 2022-04-25 2022-08-02 中国联合网络通信集团有限公司 Core network anomaly detection method and device, electronic equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer
US7937334B2 (en) * 2006-05-31 2011-05-03 Lockheed Martin Corporation System and method for defining normal operating regions and identifying anomalous behavior of units within a fleet, operating in a complex, dynamic environment
CN104951555A (en) * 2015-06-30 2015-09-30 浪潮(北京)电子信息产业有限公司 Log information management method and log information management terminal
CN105183912A (en) * 2015-10-12 2015-12-23 北京百度网讯科技有限公司 Abnormal log determination method and device
CN107707545A (en) * 2017-09-29 2018-02-16 深信服科技股份有限公司 A kind of abnormal web page access fragment detection method, device, equipment and storage medium
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods
WO2019060043A1 (en) * 2017-09-22 2019-03-28 Nec Laboratories America, Inc. Log-based system maintenance and management
CN110175158A (en) * 2019-05-23 2019-08-27 湖南大学 A kind of log template extraction method and system based on vectorization

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514398B (en) * 2013-10-18 2016-08-17 中国科学院信息工程研究所 A kind of real-time online log detection method and system
US11431475B2 (en) * 2018-06-15 2022-08-30 Dynatrace Llc Method and system for log data analytics based on SuperMinHash signatures
CN110210512B (en) * 2019-04-19 2024-03-26 北京亿阳信通科技有限公司 Automatic log anomaly detection method and system
CN111240942A (en) * 2019-12-02 2020-06-05 华为技术有限公司 Log abnormity detection method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7937334B2 (en) * 2006-05-31 2011-05-03 Lockheed Martin Corporation System and method for defining normal operating regions and identifying anomalous behavior of units within a fleet, operating in a complex, dynamic environment
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer
CN104951555A (en) * 2015-06-30 2015-09-30 浪潮(北京)电子信息产业有限公司 Log information management method and log information management terminal
CN105183912A (en) * 2015-10-12 2015-12-23 北京百度网讯科技有限公司 Abnormal log determination method and device
WO2019060043A1 (en) * 2017-09-22 2019-03-28 Nec Laboratories America, Inc. Log-based system maintenance and management
CN107707545A (en) * 2017-09-29 2018-02-16 深信服科技股份有限公司 A kind of abnormal web page access fragment detection method, device, equipment and storage medium
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods
CN110175158A (en) * 2019-05-23 2019-08-27 湖南大学 A kind of log template extraction method and system based on vectorization

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021109724A1 (en) * 2019-12-02 2021-06-10 华为技术有限公司 Log anomaly detection method and apparatus
CN111538642A (en) * 2020-07-02 2020-08-14 杭州海康威视数字技术股份有限公司 Abnormal behavior detection method and device, electronic equipment and storage medium
CN114844778A (en) * 2022-04-25 2022-08-02 中国联合网络通信集团有限公司 Core network anomaly detection method and device, electronic equipment and readable storage medium
CN114844778B (en) * 2022-04-25 2023-05-30 中国联合网络通信集团有限公司 Abnormality detection method and device for core network, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
WO2021109724A1 (en) 2021-06-10

Similar Documents

Publication Publication Date Title
WO2021109724A1 (en) Log anomaly detection method and apparatus
CN110826648B (en) Method for realizing fault detection by utilizing time sequence clustering algorithm
WO2021068547A1 (en) Log schema extraction method and apparatus
CN111563521A (en) Site-specific anomaly detection
US11580222B2 (en) Automated malware analysis that automatically clusters sandbox reports of similar malware samples
US20200104498A1 (en) Independent malware detection architecture
CN108536868B (en) Data processing method and device for short text data on social network
CN113255370B (en) Industry type recommendation method, device, equipment and medium based on semantic similarity
US10581845B2 (en) Method and apparatus for assigning device fingerprints to internet devices
CN110750615B (en) Text repeatability judgment method and device, electronic equipment and storage medium
CN113254255B (en) Cloud platform log analysis method, system, device and medium
JPWO2018159337A1 (en) Profile generation device, attack detection device, profile generation method, and profile generation program
US10824694B1 (en) Distributable feature analysis in model training system
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111869176A (en) System and method for malware signature generation
CN106294406B (en) Method and equipment for processing application access data
US11947572B2 (en) Method and system for clustering executable files
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
CN117312825A (en) Target behavior detection method and device, electronic equipment and storage medium
CN113128213A (en) Log template extraction method and device
CN113874888A (en) Information processing apparatus, generation method, and generation program
CN111931229B (en) Data identification method, device and storage medium
CN115048345A (en) Abnormal log detection method and device, electronic equipment and storage medium
US11210605B1 (en) Dataset suitability check for machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220217

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Applicant after: Huawei Cloud Computing Technology Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Applicant before: HUAWEI TECHNOLOGIES Co.,Ltd.