WO2024031930A1 - 一种异常日志检测方法、装置、电子设备及存储介质 - Google Patents

一种异常日志检测方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024031930A1
WO2024031930A1 PCT/CN2023/071830 CN2023071830W WO2024031930A1 WO 2024031930 A1 WO2024031930 A1 WO 2024031930A1 CN 2023071830 W CN2023071830 W CN 2023071830W WO 2024031930 A1 WO2024031930 A1 WO 2024031930A1
Authority
WO
WIPO (PCT)
Prior art keywords
abnormal
log
vocabulary
template
vector
Prior art date
Application number
PCT/CN2023/071830
Other languages
English (en)
French (fr)
Inventor
赵利强
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024031930A1 publication Critical patent/WO2024031930A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • Embodiments of the present application relate to the field of log processing, and in particular to an abnormal log detection method, device, electronic device and non-volatile readable storage medium.
  • Log information is a widely available data resource used to record system status and key events when various software systems are running. Developers often use log information to view system running status, detect anomalies, and deduce the cause of failures. However, with the increase in the scale and complexity of modern computer systems, log information has exploded, which also poses challenges for efficient detection of log information.
  • the purpose of the embodiments of this application is to provide an abnormal log detection method, device, electronic device and non-volatile readable storage medium, which can use a finite state automaton constructed from abnormal vocabulary to perform abnormal detection on log information, and can improve the log information. Improve the efficiency of anomaly detection and reduce the usage of computing resources.
  • an abnormal log detection method including:
  • the log information is determined to be an abnormal log.
  • the above-mentioned finite state automaton is an AC automaton.
  • the above-mentioned finite state automaton constructed from the abnormal vocabulary detects the target abnormal vocabulary contained in the above log information, and uses the dynamic programming algorithm and the preset corresponding to the above-mentioned target abnormal vocabulary.
  • the abnormal value determines the total abnormal value corresponding to the above log information, including:
  • the characters in the above log information are sequentially input into the above AC automaton for matching, and the nodes corresponding to the above characters in the above AC automaton and the corresponding status of the above nodes are determined;
  • the abnormal words corresponding to the above status and the other abnormal words mentioned above are set as the target abnormal words corresponding to the above characters, and the dynamic programming algorithm and the preset abnormal values corresponding to the target abnormal words of the above characters are used to determine the above total abnormal values.
  • the above-mentioned determination of the above-mentioned total outlier value using a dynamic programming algorithm and the preset outlier value corresponding to the target anomalous vocabulary of the above-mentioned character includes:
  • s represents the string of the above log information
  • s n-1 and s n represent the n-1th character and n-th character in the above string
  • f(s n-1 ) represent the total abnormal values corresponding to the above s n-1 and the above s n
  • state n represents the state corresponding to the above s n character
  • state n ⁇ error_word represents that the above state n does not have
  • state n error_word means that the above state n has a corresponding target abnormal word
  • score(state n ) means the sum of preset abnormal values corresponding to the target abnormal words of the above s n characters.
  • the finite state automaton constructed from the abnormal words before using the finite state automaton constructed from the abnormal words to detect the target abnormal words contained in the above log information, it also includes:
  • the step of detecting the target abnormal vocabulary contained in the log information by using a finite state automaton constructed from the abnormal vocabulary is entered.
  • the above calculation of the similarity value between the log vector to be detected and the normal log vector corresponding to the normal log template includes:
  • a represents the above-mentioned log vector to be detected
  • b represents the above-mentioned normal log vector
  • similarity (a, b) represents the above-mentioned similarity value
  • a i and b i represent the i-th vocabulary in the log vector to be detected and the above-mentioned normal log respectively.
  • the above log information before using the above log information to generate the log vector to be detected, it also includes:
  • the original log template corresponding to the log template vector included in the accumulated template category is set as the above normal log template.
  • template categories including:
  • the vector to be processed is set as a template kernel. vector and added to the above template kernel vector set;
  • the above-mentioned vector to be processed is added to the target module with the smallest lexicographic order.
  • the template category corresponding to the core vector
  • the above finite state automaton is an AC automaton. Before using the finite state automaton constructed from the abnormal vocabulary to detect the target abnormal vocabulary contained in the above log information, it also includes:
  • the above-mentioned abnormal vocabulary library contains a plurality of the above-mentioned abnormal words, and each of the above-mentioned abnormal words has a corresponding preset abnormal value;
  • the exception lexicon obtained above includes:
  • the above-mentioned extraction of the above-mentioned abnormal vocabulary from the above-mentioned vocabulary to be processed based on the above-mentioned TF-IDF value includes:
  • the previously preset proportion of words to be processed is set as the above-mentioned abnormal words, and the above-mentioned TF-IDF value is used to set corresponding preset abnormal values for the above-mentioned abnormal words.
  • the above-mentioned use of the above-mentioned TF-IDF value to set corresponding preset abnormal values for the above-mentioned abnormal words includes:
  • tf-idf i represents the TF-IDF value of the i-th abnormal word mentioned above
  • e represents the natural logarithm base.
  • the above-mentioned calculation of the TF-IDF value corresponding to the above-mentioned vocabulary to be processed includes:
  • tf-idf i tf(t, d) ⁇ idf(t, D);
  • tf-idf i the TF-IDF value of the i-th word to be processed
  • t represents the i-th word to be processed above
  • d represents the exception log
  • D represents the set containing all the above exception logs
  • tf(t, d) indicates the word frequency of the above exception word t, which is calculated as follows:
  • t′ ⁇ d represents all words in the exception log
  • idf(t, D) represents the inverse file frequency of word t, which is calculated as follows:
  • An embodiment of the present application also provides an abnormal log detection device, including:
  • Obtain module which is set to obtain log information
  • the detection module is configured to use a finite state automaton constructed from abnormal words to detect the target abnormal words contained in the above log information, and to use a dynamic programming algorithm and the preset abnormal value corresponding to the above target abnormal words to determine the total number corresponding to the above log information. outliers;
  • the determination module is configured to determine that the log information is an abnormal log when it is determined that the total abnormal value is greater than the first preset threshold.
  • An embodiment of the present application also provides an electronic device, including:
  • the processor is configured to implement the above-mentioned abnormal log detection method when executing the above-mentioned computer program.
  • Embodiments of the present application also provide a non-volatile readable storage medium.
  • Computer-executable instructions are stored in the non-volatile readable storage medium.
  • the above-mentioned steps are implemented.
  • Abnormal log detection method is used to determine whether abnormal logs have been accessed in the non-volatile readable storage medium.
  • Embodiments of the present application provide an abnormal log detection method, which includes: obtaining log information; using a finite state automaton constructed from abnormal words to detect the target abnormal words contained in the above log information, and using a dynamic programming algorithm and the corresponding target abnormal words
  • the preset abnormal value determines the total abnormal value corresponding to the above log information; when it is determined that the above total abnormal value is greater than the first preset threshold, the above log information is determined to be an abnormal log.
  • the embodiment of the present application can use a finite state automaton constructed from abnormal words to detect abnormalities in log information.
  • the automaton can automatically detect the target abnormal words contained in the log information, and can then use dynamic programming algorithms and the correspondence between these words.
  • the preset abnormal value determines the total abnormal value corresponding to the log information, and when it is determined that the total abnormal value is greater than the preset threshold, the log information can be determined to be an abnormal log.
  • the embodiment of the present application uses the target abnormal vocabulary extracted by the finite state automaton to determine whether the log information is an abnormal log, and the target abnormal vocabulary belongs to plain text data, the plain text log can be detected, and the existing method can avoid only The situation where log data with time series data can be detected; in addition, because compared with traditional machine learning and deep learning methods, finite state automata are more computationally efficient and the code required for implementation is more streamlined, so this application Embodiments can not only use finite state automata to improve the efficiency of abnormal log detection, but also reduce the consumption of computing resources by abnormal log detection to ensure that the detection function can be configured in hardware devices with lower computing resources, thereby effectively Improved applicable scenarios for abnormal log detection.
  • Embodiments of the present application also provide an abnormal log detection device, electronic equipment and a non-volatile readable storage medium, which have the above beneficial effects.
  • Figure 1 is a flow chart of an abnormal log detection method provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of an AC automaton provided by an embodiment of the present application.
  • Figure 3 is a flow chart of another abnormal log detection method provided by an embodiment of the present application.
  • Figure 4 is a structural block diagram of an abnormal log detection device provided by an embodiment of the present application.
  • Figure 5 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • Figure 6 is a structural block diagram of a non-volatile readable storage medium provided by an embodiment of the present application.
  • abnormal log detection usually uses a method based on principal component analysis or a method based on deep learning to detect the timing parameters in the log information to extract the abnormal information in the log information.
  • exception logs can detect problems through time series data.
  • Many error logs do not contain time series variables but are pure text data.
  • deep learning model training often requires a large amount of computing resources, and various word vectors are It takes up a lot of storage resources, and real-time computing performance is often stretched when dealing with large-scale streaming log data.
  • embodiments of the present application can provide an abnormal log detection method, which can use a finite state automaton constructed from abnormal vocabulary to perform abnormal detection on log information, which can improve the efficiency of log abnormal detection and reduce the occupancy of computing resources.
  • Figure 1 is a flow chart of an abnormal log detection method provided by an embodiment of the present application. The method may include:
  • the embodiments of this application do not limit the source and type of log information.
  • the log information can belong to any system or service.
  • the embodiments of this application do not limit the method of collecting log information. It can be understood that the method of collecting log information is related to the data source and the communication protocol used by the data source, and can be set based on actual application requirements and related technologies.
  • the embodiments of this application do not limit the timing of obtaining log information. For example, it can be obtained in real time, or all logs generated within a period can be obtained periodically, and can be set according to actual application requirements. In a possible situation, in order to facilitate timely detection of abnormal forms, log information can be obtained in real time.
  • the embodiment of the present application uses a finite state automaton constructed from abnormal words to detect abnormal log information, where the abnormal words refer to words extracted from the abnormal log, and the automaton can be a deterministic finite state automaton (DFA, Deterministic finite automata). Since automata have high matching efficiency when performing character matching, and the amount of code required to build automata is small, they are more suitable for scenarios with limited computing resources, such as in embedded application scenarios where computing resources are relatively scarce. Stateful automata can achieve higher performance.
  • the embodiment of the present application uses finite state automata to detect abnormal log information. Compared with existing machine learning methods and deep learning methods, it can achieve better results while significantly reducing the usage of computing resources.
  • the automaton in the embodiment of the present application does not use a time series method to detect anomalies in log information, but uses a character matching method to detect anomalies. Therefore, it can effectively detect plain text logs and avoid The related technology can only detect the problem of log information containing time series vocabulary.
  • the abnormal words used to construct the above finite state automata are all set with corresponding preset abnormal values.
  • the preset abnormal values corresponding to the target abnormal words can be used to determine the total abnormal values corresponding to the log information, and then based on the total abnormal values, it can be determined whether the log information belongs to the abnormal log.
  • the embodiments of this application do not limit the preset abnormal value corresponding to each abnormal word, which can be set according to actual application requirements.
  • the embodiments of this application do not limit the setting method of these preset abnormal values. For example, they can be set according to the preset operation and maintenance detection rules, or they can be set according to the frequency of words appearing in the exception logs and other information. They can also be set according to the actual situation. Set the application requirements.
  • the embodiments of this application do not limit the number of exception words required to build a finite state automaton, which can be set according to actual application requirements.
  • the embodiments of the present application are not limited to the way of constructing a finite state automaton using abnormal words.
  • it can be constructed on the basis of the Aho-Corasick algorithm, where the Aho-Corasick algorithm is used in multi-pattern matching.
  • a commonly used algorithm the finite state automaton constructed by it can also be called an AC automaton.
  • AC automaton To facilitate understanding of how to use the AC automaton constructed by the Aho-Corasick algorithm to detect abnormal logs, please refer to Figure 2.
  • Figure 2 is a schematic diagram of an AC automaton provided in an embodiment of the present application, where root represents the root node, The other nodes represent characters, the solid lines represent the branches in the dictionary tree that constructs the AC automaton, and the dotted lines represent the failure pointers (fail) in the AC automata.
  • the failure pointers can cause the failure of a node in the dictionary tree to match. Jump directly to the best matching node to continue matching, and try to avoid going back to the root node to start matching again; the path between each node represents the vocabulary, for example, the vocabulary "he” can be composed of root, h, and e nodes, and the vocabulary "he” can be composed of root, h, and e nodes. , r nodes can form vocabulary "her".
  • each character in the string can be input to the AC automaton in turn, and the automaton will start from the root node and match along the path direction, for example, for the string to be tested "her" , h, e, r can be input to the AC automaton in sequence, and the automaton will first match the node h corresponding to the character h from the root node downward, then match the node e corresponding to the character e from the node h downward, and finally Match the node r corresponding to the character r from the e node downwards.
  • each node in the automaton has a corresponding "state", which corresponds to the actual vocabulary when performing string matching.
  • the node e on the leftmost branch can be Corresponding to the word "he”
  • the node r on the leftmost branch can correspond to the word "her”
  • the node h on the leftmost branch does not have a corresponding word.
  • Special nodes with corresponding words are marked in gray in Figure 2, and it can be understood that in this embodiment of the present application, these special nodes should correspond to abnormal words.
  • these special nodes can also be marked with preset outliers corresponding to abnormal words.
  • the target node pointed by the failure pointer of the node to which it belongs may also have a corresponding exception vocabulary
  • the node pointed by the failure pointer of the target node may also have a corresponding exception vocabulary.
  • the failure pointer of e node 1 in the path root, s, h, e points to e node 2 in the path root, h, e, and e
  • the failure pointer of node 2 points to the root node, so when calculating the total outlier value on e-node 1, in addition to accumulating the preset outlier value corresponding to the word "she", it is also necessary to accumulate the preset outlier value corresponding to the word "he” .
  • a dynamic recursive algorithm can be used to optimize the calculation process of the total outliers. It should be noted that the embodiment of the present application does not limit the derivation form of the dynamic recursive algorithm when calculating the total outliers, and it can be set according to actual application requirements.
  • the finite state automaton is an AC automaton.
  • the finite state automaton constructed from the abnormal vocabulary is used to detect the target abnormal vocabulary contained in the log information, and the dynamic programming algorithm and the preset corresponding to the target abnormal vocabulary are used.
  • the abnormal value determines the total abnormal value corresponding to the log information, including:
  • Step 11 Input the characters in the log information into the AC automaton in sequence for matching, and determine the nodes corresponding to the characters in the AC automaton and the corresponding status of the nodes;
  • Step 12 When the state has a corresponding exception word, use the failure pointer to find other exception words corresponding to other nodes between the node and the root node;
  • Step 13 Set the abnormal vocabulary corresponding to the state and other abnormal words as the target abnormal vocabulary corresponding to the character, use the dynamic programming algorithm and the preset abnormal value corresponding to the target abnormal vocabulary of the character to determine the total abnormal value, and process the next character ;
  • Step 14 When the state does not have a corresponding exception word, process the next character.
  • the total outlier value is determined using a dynamic programming algorithm and the preset outlier value corresponding to the target outlier vocabulary of the character, which may include:
  • state n corresponds to an exception word
  • the node pointed to by its failure pointer may also correspond to an exception word.
  • the score function should use the failure pointer to calculate all possible exception words in a loop until it backtracks to the root node. .
  • the embodiments of the present application do not limit the detailed value of the first preset threshold, which can be set according to actual application requirements.
  • the log information is determined to be an abnormal log, corresponding alarm information can also be generated and output.
  • the embodiments of this application do not limit the detailed form of the alarm information, which can be set according to actual application requirements.
  • the embodiments of this application do not limit the detailed method of outputting alarm information. For example, it can be output to a display device in an electronic device, or it can be output to a device of a designated operation and maintenance personnel through SMS and email. This can be done according to actual application requirements. set up.
  • embodiments of the present application can use a finite state automaton constructed from abnormal words to detect abnormalities in log information.
  • the automaton can automatically detect the target abnormal words contained in the log information, and can then use dynamic programming algorithms and
  • the preset abnormal values corresponding to these words determine the total abnormal value corresponding to the log information, and when it is determined that the total abnormal value is greater than the preset threshold, the log information can be determined to be an abnormal log.
  • the embodiment of the present application uses the target abnormal vocabulary extracted by the finite state automaton to determine whether the log information is an abnormal log, and the target abnormal vocabulary belongs to plain text data, the plain text log can be detected, and the existing method can avoid only The situation where log data with time series data can be detected; in addition, because compared with traditional machine learning and deep learning methods, finite state automata are more computationally efficient and the code required for implementation is more streamlined, so this application Embodiments can not only use finite state automata to improve the efficiency of abnormal log detection, but also reduce the consumption of computing resources by abnormal log detection to ensure that the detection function can be configured in hardware devices with lower computing resources, thereby effectively Improved applicable scenarios for abnormal log detection.
  • the generation process of the finite state automaton is introduced in detail below.
  • the finite state automaton is an AC automaton.
  • the finite state automaton constructed from the abnormal vocabulary to detect the target abnormal vocabulary contained in the log information it may also include:
  • the exception vocabulary library is used to store abnormal words.
  • the embodiments of the present application do not limit the construction process of the exception vocabulary library.
  • exception logs containing exception information can be collected and constructed using the exception vocabulary contained in the exception logs.
  • the embodiment of this application does not limit how to extract abnormal words from the abnormal log. For example, it can be extracted according to preset rules, or the TF-IDF value of each word in the log can be calculated and extracted based on this value, where TF-IDF means Term frequency - inverse document frequency.
  • extraction can be performed based on the TF-IDF value.
  • an exception lexicon which can include:
  • Step 31 Obtain the exception log and segment the exception log to obtain the vocabulary to be processed
  • Step 32 Calculate the TF-IDF value corresponding to the vocabulary to be processed, and extract abnormal vocabulary from the vocabulary to be processed based on the TF-IDF value;
  • Step 33 Add the abnormal vocabulary to the abnormal vocabulary library.
  • the TF-IDF value can be calculated as follows:
  • calculating the TF-IDF value corresponding to the vocabulary to be processed may include:
  • Step 41 Use the following method to calculate the TF-IDF value corresponding to the vocabulary to be processed:
  • tf-idf i represents the TF-IDF value of the i-th word to be processed
  • t represents the i-th word to be processed
  • d represents the exception log
  • D represents the inclusion
  • tf(t, d) represents the word frequency of abnormal vocabulary t, which is calculated as follows:
  • t′ ⁇ d represents all words in the exception log
  • idf(t, D) represents the inverse file frequency of word t, which is calculated as follows:
  • the previously preset proportion of words to be processed can be set as abnormal words and added to the abnormal vocabulary in the order of TF-IDF value from high to low.
  • the embodiment of the present application does not limit the detailed value of the preset ratio, which may be the top 2%, for example.
  • TF-IDF values can also be used to set preset abnormal values for abnormal words.
  • extracting abnormal words from the vocabulary to be processed based on the TF-IDF value may include:
  • Step 51 In order from high to low TF-IDF values, set the previously preset proportion of words to be processed as abnormal words, and use the TF-IDF values to set corresponding preset abnormal values for the abnormal words.
  • the embodiments of the present application are not limited to the detailed method of using the TF-IDF value to set the preset abnormal value of the abnormal vocabulary.
  • the TF-IDF value can be divided by the natural logarithm base to obtain the preset abnormal value. It can also be set in other ways.
  • TF-IDF values to set corresponding preset abnormal values for abnormal words, including:
  • Step 61 Use the TF-IDF value to set the corresponding preset abnormal value for the abnormal vocabulary in the following way: Among them, tf-idf i represents the TF-IDF value of the i-th abnormal word, and e represents the natural logarithm base.
  • predetermined rules can also be used to extract abnormal words from the vocabulary to be processed and add corresponding preset abnormal values to them.
  • the preset abnormal values of this batch of words can also be higher than the preset abnormal values of abnormal words extracted using TF-IDF values, and can be set according to actual application requirements.
  • Step 71 Extract the target abnormal vocabulary from the vocabulary to be processed according to the preset rules, and add the corresponding preset abnormal value to the target abnormal vocabulary;
  • Step 72 Add the target abnormal vocabulary to the abnormal vocabulary library.
  • S202 Construct a dictionary tree using the abnormal vocabulary database, and mark preset abnormal values for nodes corresponding to the abnormal vocabulary in the dictionary tree.
  • the dictionary tree should meet the following conditions: 1. The root node does not contain characters, and each node except the root node contains only one character; 2. From the root node to a certain node, the characters passing on the path are connected. , the string corresponding to the node; 3. All sub-nodes of each node contain different characters. After completing the construction of the dictionary tree, preset abnormal values can be marked on the nodes corresponding to the abnormal words, so that the total abnormal values can be calculated later.
  • the embodiments of the present application can construct a finite state automaton required for abnormal log detection according to the construction method of the AC automaton, which can ensure efficient abnormal log detection while occupying less computing resources.
  • finite state automata before using finite state automata to detect log information, in order to improve detection efficiency, existing normal log templates can also be used to filter the log information to extract target log information with a higher possibility of anomalies. Then use finite state automata to detect the target log information.
  • finite state automata before using the finite state automaton constructed from the abnormal words to detect the target abnormal words contained in the log information, it may also include:
  • the corresponding log vector to be detected can first be generated; then, the vector will be similar to the normal log vector corresponding to the normal log template to calculate the similarity. Determine the similarity between the log information and each normal log template; furthermore, when it is found that the log information is different from each normal log template, that is, when it is found that the similarity between the log vector to be detected and each normal log vector is less than the preset threshold, It can be determined that the log information is more likely to be abnormal log information, and finite state automata should be used for detection.
  • the normal log template is the document template used for general normal log information.
  • each element in the log vector is generated by the vocabulary in the log information. For example, you can first segment the log information text to obtain the log text words, then extract the first letter of each log text word, and use the sequence composed of the first letters as a log vector, for example, for the log "log(error):hello world.” , after word segmentation according to punctuation marks, it is divided into 4 log text words: log, error, hello, world, then the feature vector of this log is [l, e, h, w].
  • the embodiments of the present application do not limit the detailed calculation method of similarity.
  • the embodiment of the present application does not limit the detailed value of the second preset threshold, which can be set according to actual application requirements, for example, it can be set to 0.8.
  • Step 81 Obtain all original log templates, and use each original log template to generate the corresponding log template vector.
  • log templates here include both normal log templates and abnormal log templates.
  • log template vector For the method, reference may be made to the above embodiment, and details will not be described again here.
  • Step 82 Classify the log template vectors to obtain template categories, and sort the template categories from large to small according to the number of log template vectors corresponding to each template category;
  • log templates are classified to obtain template categories, which may include:
  • Step 91 Create a template core vector set and set the first log template vector as the vector to be processed.
  • template kernel vector set is an empty set when it is initially created.
  • Step 92 When it is determined that the template kernel vector set is empty, or there is no target template kernel vector whose similarity to the vector to be processed is greater than the fourth preset threshold in the template kernel vector set, set the vector to be processed as the template kernel vector. And added to the template kernel vector collection.
  • the template kernel vector in the embodiment of this application is a representative vector of the template category.
  • the similarity between the vector to be processed and each template kernel vector should first be calculated. If the similarity does not exceed the preset threshold, it means that the vector to be processed and the corresponding template kernel vector do not belong to the same category; conversely, if the similarity exceeds the preset threshold, it means that the vector to be processed and the corresponding template kernel vector can be Belong to the same category.
  • the similarity between the vector to be processed and each template kernel vector does not exceed the preset threshold, it means that the vector to be processed does not belong to any template category, and then the vector to be processed can be set is the template kernel vector of the new template category and is added to the template kernel vector collection.
  • the embodiment of the present application does not limit the detailed value of the fourth preset threshold, which can be set according to actual application requirements, for example, it can be 0.8.
  • the calculation method of the above similarity reference may be made to the above embodiment, which will not be described again here.
  • the vector to be processed can be directly set as the template kernel vector and added to the set.
  • Step 93 When it is determined that the target template kernel vector exists in the template kernel vector set, add the vector to be processed to the template category corresponding to the target template kernel vector with the smallest lexicographic order.
  • the vector to be processed may have a high degree of similarity with multiple target template kernel vectors in the template kernel vector set. In this case, in the embodiment of the present application, it may be preferable to add the vector to be processed to these multiple target templates.
  • Step 94 Enter the step of setting the next log template vector as a vector to be processed until all log template vectors are processed.
  • the abnormal template category composed of abnormal log templates is not only significantly different from the normal template category composed of normal log templates, but also the number of log template vectors contained in the abnormal template category is significantly less than that contained in the normal template category.
  • Step 83 Extract the number of log template vectors corresponding to the template category from the sorting sequence and accumulate them, and after each accumulation, calculate the ratio between the current accumulated number and the total number of log templates;
  • Step 84 When it is determined that the ratio is greater than the third preset threshold, set the original log template corresponding to the log template vector included in the accumulated template category as a normal log template.
  • the embodiment of the present application does not limit the detailed value of the third preset threshold.
  • it can be 98%, that is, the log template with the top 98% of the total volume is defined as the normal log template.
  • the embodiments of the present application can also use the existing normal log template to filter the log information before using the finite state automaton to detect the log information, so as to extract the target log information with a higher possibility of abnormality, and then Using finite state automata to process target log information Carry out detection to improve detection efficiency.
  • Figure 3 is a flow chart of another abnormal log detection method provided by an embodiment of the present application.
  • the method can include:
  • step 4 Use the knowledge of operation and maintenance experts to extract the abnormal words in step 1, assign corresponding abnormal scores, and add them to the abnormal word list.
  • the abnormal words extracted using expert knowledge should have a high degree of discrimination, that is, the score is significantly higher than the majority in step 2.
  • step 5 For each log that is initially determined to be abnormal, use the finite state automaton and dynamic programming algorithm in step 5 to calculate the abnormality score of each log.
  • abnormal log detection device electronic equipment and non-volatile readable storage media provided by the embodiments of the present application.
  • the abnormal log detection device, electronic equipment and non-volatile readable storage medium described below are the same as those described above.
  • Abnormal log detection methods can be referenced in correspondence with each other.
  • FIG 4 is a structural block diagram of an abnormal log detection device provided by an embodiment of the present application.
  • the device may include:
  • the acquisition module 401 is configured to obtain log information
  • the detection module 402 is configured to use a finite state automaton constructed from abnormal words to detect the target abnormal words contained in the log information, and to use a dynamic programming algorithm and a preset abnormal value corresponding to the target abnormal word to determine the total abnormal value corresponding to the log information. ;
  • the determination module 403 is configured to determine that the log information is an abnormal log when it is determined that the total abnormal value is greater than the first preset threshold.
  • the finite state automaton is an AC automaton
  • the detection module 402 may include:
  • the matching submodule is configured to input the characters in the log information into the AC automaton in sequence for matching, and determine the nodes corresponding to the characters in the AC automaton and the corresponding status of the nodes;
  • the search submodule is set to use the failure pointer to find other exception words corresponding to other nodes between the node and the root node when the state has a corresponding exception word;
  • the calculation submodule is configured to set the abnormal vocabulary corresponding to the state and other abnormal words to the target abnormal vocabulary corresponding to the character, and determine the total abnormal value using the dynamic programming algorithm and the preset abnormal value corresponding to the target abnormal vocabulary of the character.
  • calculation submodule is set to:
  • the device may also include:
  • the filtering module is configured to use log information to generate a log vector to be detected, and calculate the similarity value between the log vector to be detected and the normal log vector corresponding to the normal log template;
  • the detection module is further configured to, when it is determined that the similarity value is less than the second preset threshold, enter the step of detecting the target abnormal vocabulary contained in the log information using a finite state automaton constructed from the abnormal vocabulary.
  • the filter module can include:
  • the device may also include:
  • the template acquisition module is configured to obtain all original log templates and use each original log template to generate the corresponding log template vector;
  • the classification module is configured to classify log template vectors to obtain template categories, and sort the template categories from large to small according to the number of log template vectors corresponding to each template category;
  • the accumulation module is set to extract and accumulate the number of log template vectors corresponding to the template categories from the sorting sequence, and after each accumulation, calculate the ratio between the current accumulated number and the total number of original log templates;
  • the setting module is configured to set the original log template corresponding to the log template vector included in the accumulated template category as a normal log template when it is determined that the ratio is greater than the third preset threshold.
  • the classification module can include:
  • the first setting submodule is set to create a template core vector set and set the first log template vector as the vector to be processed;
  • the first processing submodule is configured to: when it is determined that the template kernel vector set is empty, or there is no target template kernel vector in the template kernel vector set whose similarity to the vector to be processed is greater than the fourth preset threshold, the target template kernel vector to be processed is The vector is set to the template kernel vector and added to the template kernel vector collection;
  • the second processing submodule is configured to add the vector to be processed to the template category corresponding to the target template kernel vector with the smallest lexicographic order when it is determined that the target template kernel vector exists in the template kernel vector set;
  • the second setting submodule is configured to enter the step of setting the next log template vector as a vector to be processed until all log template vectors are processed.
  • the finite state automaton is an AC automaton
  • the device may also include:
  • the abnormal vocabulary acquisition module is set to obtain the abnormal vocabulary; the abnormal vocabulary contains multiple abnormal words, and each abnormal word has a corresponding preset abnormal value;
  • the dictionary tree building module is configured to build a dictionary tree using the abnormal vocabulary library, and label the nodes corresponding to the abnormal vocabulary in the dictionary tree with preset abnormal values;
  • the prefix pointer calculation module is configured to perform prefix pointer calculation on the dictionary tree using breadth-first search to construct a failure pointer in the dictionary tree to obtain an AC automaton.
  • the exception vocabulary acquisition module can include:
  • the exception log acquisition sub-module is configured to obtain exception logs and segment the exception logs to obtain the vocabulary to be processed;
  • the TF-IDF processing submodule is set to calculate the TF-IDF value corresponding to the vocabulary to be processed, and extract abnormal vocabulary from the vocabulary to be processed based on the TF-IDF value;
  • the first adding submodule is configured to add abnormal words to the abnormal vocabulary library.
  • the TF-IDF processing sub-module can include:
  • the abnormal word extraction unit is set to set the previously preset proportion of words to be processed as abnormal words in order from high to low TF-IDF values, and use the TF-IDF values to set corresponding preset abnormal values for the abnormal words. .
  • the abnormal vocabulary extraction unit may include:
  • the preset abnormal value setting subunit is set to use the TF-IDF value to set the corresponding preset abnormal value for the abnormal vocabulary in the following way: Among them, tf-idf i represents the TF-IDF value of the i-th abnormal word, and e represents the natural logarithm base.
  • the TF-IDF processing sub-module can include:
  • the TF-IDF calculation unit is set to calculate the TF-IDF value corresponding to the vocabulary to be processed in the following way:
  • tf-idf i represents the TF-IDF value of the i-th word to be processed
  • t represents the i-th word to be processed
  • d represents the exception log
  • D represents the set containing all exception logs
  • tf(t, d) represents the exception vocabulary
  • the word frequency of t is calculated as follows: Among them, t′ ⁇ d represents all words in the exception log; idf(t, D) represents the inverse file frequency of word t, which is calculated as follows:
  • the exception vocabulary acquisition module can also include:
  • the abnormal vocabulary extraction submodule is configured to extract target abnormal words from the vocabulary to be processed according to preset rules, and add corresponding preset abnormal values to the target abnormal words;
  • the second adding sub-module is configured to add the target abnormal vocabulary to the abnormal vocabulary library.
  • Figure 5 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • An embodiment of the present application also provides an electronic device, including:
  • Memory 501 configured to store computer programs
  • the processor 502 is configured to implement the above-mentioned steps of the abnormal log detection method when executing the computer program.
  • Figure 6 is a structural block diagram of a non-volatile readable storage medium provided by an embodiment of the present application.
  • An embodiment of the present application also provides a non-volatile readable storage medium.
  • the readable storage medium 601 stores a computer program. When the computer program is executed by the processor, the steps of the abnormal log detection method of any of the above embodiments are implemented.
  • the embodiment of the non-volatile readable storage medium part corresponds to the embodiment of the abnormal log detection method part, for the embodiment of the storage medium part, please refer to the description of the embodiment of the abnormal log detection method part, and will not be described again here.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请实施例提供一种异常日志检测方法、装置、电子设备及存储介质,涉及日志处理领域,方法包括:获取日志信息;利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇,并利用动态规划算法及目标异常词汇对应的预设异常值确定日志信息对应的总异常值;当确定总异常值大于第一预设阈值时,判定日志信息为异常日志。本申请实施例使用由异常词汇构建的有限状态自动机对纯文本日志信息进行异常检测,避免仅能对具有时序数据的日志数据进行检测的情况,此外还利用该自动机提升异常日志检测的效率,降低异常日志检测对计算资源的消耗量,以确保该检测功能可配置在计算资源更低的硬件设备中,扩展异常日志检测的适用场景。

Description

一种异常日志检测方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2022年8月12日提交中国专利局,申请号为202210964876.2,申请名称为“一种异常日志检测方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及日志处理领域,特别涉及一种异常日志检测方法、装置、电子设备及非易失性可读存储介质。
背景技术
日志信息是一种广泛可用的数据资源,用于记录各种软件系统运行时的系统状态和关键事件。开发人员常常利用日志信息来查看系统运行状态、检测异常、推导故障发生原因。然而,随着现代计算机系统规模和复杂性的增加,日志信息爆炸式增长,这也为日志信息的高效检测提出了挑战。
相关技术中,通常采用基于主成分分析的方法或基于深度学习的方法对日志信息中的时序参数进行检测,以提取出日志信息中的异常信息。然而,并非所有的异常日志都能通过时序数据发现问题,许多错误日志并不包含时序变量而是属于纯文本数据;此外,深度学习模型训练时往往需要消耗大量的计算资源,各类词向量对存储资源占据较多,应对大型流式日志数据时实时计算性能也常常捉襟见肘。
发明内容
本申请实施例的目的是提供一种异常日志检测方法、装置、电子设备及非易失性可读存储介质,可使用由异常词汇构建的有限状态自动机对日志信息进行异常检测,能够提升日志异常检测的效率并降低对计算资源的占用率。
为解决上述技术问题,本申请实施例提供一种异常日志检测方法,包括:
获取日志信息;
利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇,并利用动态规划算法及上述目标异常词汇对应的预设异常值确定上述日志信息对应的总异常值;
当确定上述总异常值大于第一预设阈值时,判定上述日志信息为异常日志。
可选地,上述有限状态自动机为AC自动机,上述利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇,并利用动态规划算法及上述目标异常词汇对应的预设异常值确定上述日志信息对应的总异常值,包括:
将上述日志信息中的字符依次输入至上述AC自动机中进行匹配,确定上述字符在上述AC自动机中对应的节点及上述节点对应的状态;
当上述状态具有对应的异常词汇时,通过失败指针查找上述节点与根节点间的其他节点对应的其他异常词汇;
将上述状态对应的异常词汇和上述其他异常词汇设置为上述字符对应的目标异常词汇,并利用动态规划算法及上述字符的目标异常词汇对应的预设异常值确定上述总异常值。
可选地,上述利用动态规划算法及上述字符的目标异常词汇对应的预设异常值确定上述总异常值,包括:
利用上述动态规划算法及上述字符的目标异常词汇对应的预设异常值以如下方式计算上述总异常值:

其中,s表示上述日志信息的字符串,sn-1、sn表示上述字符串中的第n-1个字符和第n个字符,
f(sn-1)、f(sn)表示上述sn-1和上述sn对应的总异常值,staten表示上述sn字符对应的状态,staten≠error_word表示上述staten不具有对应的目标异常词汇,staten=error_word表示上述staten具有对应的目标异常词汇,score(staten)表示上述sn字符的目标异常词汇对应的预设异常值的总和。
可选地,在利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇之前,还包括:
利用上述日志信息生成待检测日志向量,并计算上述待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值;
当确定上述相似度值小于第二预设阈值时,进入上述利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇的步骤。
可选地,上述计算上述待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值,包括:
按照如下方式计算上述待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值:

其中,a表示上述待检测日志向量,b表示上述正常日志向量,similarity(a,b)表示上
述相似度值,ai和bi分别表示待检测日志向量中的第i个词汇和上述正常日志向量中的第i个词汇;当ai与bi相等时,ai=bi的值为1,当ai与bi不相等时,ai=bi的值为0;min(·)表示最小值函数,max(·)表示最大值函数,len(·)表示向量长度。
可选地,在利用上述日志信息生成待检测日志向量之前,还包括:
获取所有原始日志模板,并利用各上述原始日志模板生成对应的日志模板向量;
对日志模板向量进行分类得到模板类别,并根据各模板类别对应的日志模板向量数量,按从大到小的顺序对上述模板类别进行排序;
从排序序列中依次提取模板类别对应的日志模板向量数量进行累加,并在每次累加结束后,计算当前累加数量与日志模板总数量间的比值;
当确定上述比值大于第三预设阈值时,将已累加的模板类别所包含的日志模板向量对应的原始日志模板设置为上述正常日志模板。
可选地,上述对日志模板向量进行分类得到模板类别,包括:
创建模板核向量集合,并将首个日志模板向量设置为待处理向量;
当确定上述模板核向量集合为空,或上述模板核向量集合中不存在与上述待处理向量间的相似度大于第四预设阈值的目标模板核向量时,将上述待处理向量设置为模板核向量并添加至上述模板核向量集合;
当确定上述模板核向量集合中存在上述目标模板核向量时,将上述待处理向量添加至字典序最小的目标模 板核向量对应的模板类别中;
对下一日志模板向量进入上述设置为待处理向量的步骤,直至完成对所有上述日志模板向量的处理。
可选地,上述有限状态自动机为AC自动机,在利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇之前,还包括:
获取异常词库;上述异常词库包含多个上述异常词汇,每一上述异常词汇均有对应的预设异常值;
利用上述异常词库构建字典树,并在上述字典树中为与上述异常词汇对应的节点标注上述预设异常值;
使用广度优先搜索对上述字典树进行前缀指针计算,以在上述字典树中构造失败指针,得到上述AC自动机。
可选地,上述获取异常词库,包括:
获取异常日志,并对上述异常日志进行分词得到待处理词汇;
计算上述待处理词汇对应的TF-IDF值,并根据上述TF-IDF值从上述待处理词汇中提取上述异常词汇;
将上述异常词汇添加至上述异常词库。
可选地,上述根据上述TF-IDF值从上述待处理词汇中提取上述异常词汇,包括:
按照上述TF-IDF值从高到低的顺序,将前预设比例的待处理词汇设置为上述异常词汇,并利用上述TF-IDF值为上述异常词汇设置对应的预设异常值。
可选地,上述利用上述TF-IDF值为上述异常词汇设置对应的预设异常值,包括:
利用上述TF-IDF值以如下方式为上述异常词汇设置对应的预设异常值:

其中,tf-idfi表示第i个上述异常词汇的TF-IDF值,e表示自然对数底数。
可选地,上述计算上述待处理词汇对应的TF-IDF值,包括:
采用如下方式计算上述待处理词汇对应的TF-IDF值:
tf-idfi=tf(t,d)·idf(t,D);
其中,tf-idfi表示第i个待处理词汇的TF-IDF值,t表示上述第i个待处理,d表示异常日志,D
表示包含所有上述异常日志的集合;tf(t,d)表示上述异常词汇t的词频,采用如下方式计算:

其中t′∈d表示异常日志中的所有词汇;idf(t,D)表示单词t的逆文件频率,采用如下方式
计算:
可选地,在对上述异常日志进行分词得到待处理词汇之后,还包括:
根据预设规则从上述待处理词汇中提取目标异常词汇,并为上述目标异常词汇添加对应的预设异常值;
将上述目标异常词汇添加至上述异常词库。
本申请实施例还提供一种异常日志检测装置,包括:
获取模块,被设置为获取日志信息;
检测模块,被设置为利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇,并利用动态规划算法及上述目标异常词汇对应的预设异常值确定上述日志信息对应的总异常值;
判定模块,被设置为当确定上述总异常值大于第一预设阈值时,判定上述日志信息为异常日志。
本申请实施例还提供一种电子设备,包括:
存储器,被设置为存储计算机程序;
处理器,被设置为执行上述计算机程序时实现如上上述的异常日志检测方法。
本申请实施例还提供一种非易失性可读存储介质,上述非易失性可读存储介质中存储有计算机可执行指令,上述计算机可执行指令被处理器加载并执行时,实现如上上述的异常日志检测方法。
本申请实施例提供一种异常日志检测方法,包括:获取日志信息;利用由异常词汇构建的有限状态自动机检测上述日志信息中包含的目标异常词汇,并利用动态规划算法及上述目标异常词汇对应的预设异常值确定上述日志信息对应的总异常值;当确定上述总异常值大于第一预设阈值时,判定上述日志信息为异常日志。
可见,本申请实施例可使用由异常词汇构建的有限状态自动机对日志信息进行异常检测,该自动机能够自动检测日志信息中所包含的目标异常词汇,进而可利用动态规划算法及这些词汇对应的预设异常值确定日志信息对应的总异常值,并在确定总异常值大于预设阈值时,可判定日志信息为异常日志。由于本申请实施例利用由有限状态自动机提取得到的目标异常词汇确定日志信息是否为异常日志,且目标异常词汇属于纯文本数据,因此可对纯文本日志进行检测,且能够避免现有方法仅能对具有时序数据的日志数据进行检测的情况;此外,由于相较于传统的机器学习和深度学习方法,有限状态自动机的计算效率更高,且实现所需的代码更加精简,因此本申请实施例不仅能够采用有限状态自动机提升异常日志检测的效率,同时还能够降低异常日志检测对计算资源的消耗量,以确保该检测功能可配置在计算资源更低的硬件设备中,进而可有效提升异常日志检测的适用场景。本申请实施例还提供一种异常日志检测装置、电子设备及非易失性可读存储介质,具有上述有益效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例所提供的一种异常日志检测方法的流程图;
图2为本申请实施例所提供的一种AC自动机的示意图;
图3为本申请实施例所提供的另一种异常日志检测方法的流程图;
图4为本申请实施例所提供的一种异常日志检测装置的结构框图;
图5为本申请实施例所提供的一种电子设备的结构框图;
图6为本申请实施例所提供的一种非易失性可读存储介质的结构框图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。
相关技术中,异常日志检测通常采用基于主成分分析的方法或基于深度学习的方法对日志信息中的时序参数进行检测,以提取出日志信息中的异常信息。然而,并非所有的异常日志都能通过时序数据发现问题,许多错误日志并不包含时序变量而是属于纯文本数据;此外,深度学习模型训练时往往需要消耗大量的计算资源,各类词向量对存储资源占据较多,应对大型流式日志数据时实时计算性能也常常捉襟见肘。有鉴于此,本申请实施例可提供一种异常日志检测方法,可使用由异常词汇构建的有限状态自动机对日志信息进行异常检测,能够提升日志异常检测的效率并降低对计算资源的占用率。请参考图1,图1为本申请实施例所提供的一种异常日志检测方法的流程图,该方法可以包括:
S101、获取日志信息。
需要说明的是,本申请实施例并不限定日志信息的来源及种类,该日志信息可以属于任意系统或服务。本申请实施例也不限定日志信息的采集方式,可以理解的是,日志信息的采集方式与数据源及数据源所采用的通信协议有关,可结合实际应用需求及相关技术进行设置。本申请实施例并不限定获取日志信息的时机,例如可实时获取,也可周期性获取周期内生成的所有日志,可根据实际应用需求进行设定。在一种可能的情况中,为便于及时发现异常形式,日志信息可以实时获取。
S102、利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇,并利用动态规划算法及目标异常词汇对应的预设异常值确定日志信息对应的总异常值。
本申请实施例采用由异常词汇构建的有限状态自动机进行异常日志信息检测,其中异常词汇指从异常日志中提取出的词汇,而该自动机可以为确定性有限状态自动机(DFA,Deterministic finite automata)。由于自动机在进行字符匹配时匹配效率较高,且构建自动机所需的代码量较小,因此更加适合在计算资源较为有限的场景,例如在计算资源相对稀缺的嵌入式应用场景中,有限状态自动机能够发挥出更高的性能。也正是如此,本申请实施例采用有限状态自动机进行异常日志信息检测,相较于现有的机器学习方法及深度学习方法来说,能够在显著降低对计算资源占用情况的前提下取得较好检测效果,不仅可以对大型系统中实时产生的日志数据流进行高效的检测,同时还能够适应更多的应用场景,特别是能够在计算资源较为稀缺的嵌入式应用场景中发挥不错的性能。另外,还值得提出的是,本申请实施例中的自动机并非采用时序方式对日志信息进行异常检测,而是采用字符匹配的方式进行异常检测,因此能够对纯文本日志进行有效检测,能够避免相关技术仅能对包含时序词汇的日志信息进行检测的问题。
可选的,需要说明的是,构建上述有限状态自动机的异常词汇均设置有对应的预设异常值。在利用自动机确定日志信息中包含的目标异常词汇后,可利用目标异常词汇对应的预设异常值确定日志信息对应的总异常值,进而可根据总异常值判断日志信息是否属于异常日志。本申请实施例并不限定每一个异常词汇对应的预设异常值,可根据实际应用需求进行设定。本申请实施例也不限定这些预设异常值的设置方法,例如可根据预设的运维检测规则进行设定,也可根据词汇在异常日志中出现的频率等信息进行设定,可根据实际应用需求进行设定。可选的,本申请实施例并不限定构建有限状态自动机所需使用的异常词汇的数量,可根据实际应用需求进行设定。
可选的,需要说明的是,本申请实施例并不限定利用异常词汇构建有限状态自动机的方式,例如可在Aho-Corasick算法的基础上进行构建,其中Aho-Corasick算法是多模式匹配中常用的算法,其构建的有限状态自动机又可被称为AC自动机。为便于理解如何利用由Aho-Corasick算法构建的AC自动机进行异常日志检测,请参考图2,图2为本申请实施例所提供的一种AC自动机的示意图,其中,root表示根节点,而其他节点则表示字符,实线表示构造AC自动机的字典树中的树枝,而虚线则表示AC自动机中的失败指针(fail),其中失败指针可使得在字典树某节点匹配失败后,直接跳转到最佳匹配节点继续匹配,尽量避免回溯至根节点重新开始匹配;各节点间的路径表示词汇,例如由root、h、e节点可组成词汇“he”,由root、h、e、r节点可组成词汇 “her”。当接收到待检测字符串时,可依次将字符串中的各字符输入至AC自动机,而该自动机则会从根节点开始,沿路径方向进行匹配,例如对于待测字符串“her”,可依次将h、e、r输入至AC自动机,而该自动机首先会从根节点向下,匹配字符h对应的节点h,随后从节点h向下匹配字符e对应的节点e,最后从e节点向下匹配字符r对应的节点r。应当指出的是,该自动机中的每一节点均有对应的“状态”(state),在进行字符串匹配时,该状态与实际的词汇相对应,例如最左侧树枝上的节点e可对应词汇“he”,最左侧树枝上的节点r可对应词汇“her”,而最左侧树枝上的节点h并不具有对应的词汇。对于具有对应词汇的特殊节点,在图2中已用灰色进行标记,而可以理解的是,在本申请实施例中,这些特殊节点应当与异常词汇相对应。当然,为了提升计算总异常值的效率,这些特殊节点上也可标注对应异常词汇的预设异常值。可选的,需特别指出的是,当某个状态具有对应的异常词汇时,其所属节点的失败指针所指向的目标节点可能也具有对应的异常词汇,同时目标节点的失败指针所指向的节点也可能具有对应的异常词汇,而在进行总异常值计算时,除了累加本状态对应的异常词汇的预设异常值之外,还需通过失败指针查找本状态所属节点至根节点间的其他节点,确定其他节点对应的其他异常词汇,并将其他异常词汇的预设异常值也累加至总异常值中。例如在图2中,root、s、h、e这条路径(对应词汇“she”)中的e节点1的失败指针,指向了root、h、e这条路径中的e节点2,而e节点2的失败指针指向了根节点,那么在计算e节点1上计算总异常值时,除了要累加词汇“she”对应的预设异常值,还需要累加词汇“he”对应的预设异常值。
可选的,为提升总异常值的计算效率,在本申请实施例可利用动态递归算法对总异常值的计算过程进行优化。需要说明的是,本申请实施例并不限定动态递归算法在计算总异常值时的推导形式,可根据实际应用需求进行设定。
在一种可能的情况中,有限状态自动机为AC自动机,利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇,并利用动态规划算法及目标异常词汇对应的预设异常值确定日志信息对应的总异常值,包括:
步骤11:将日志信息中的字符依次输入至AC自动机中进行匹配,确定字符在AC自动机中对应的节点及节点对应的状态;
步骤12:当状态具有对应的异常词汇时,通过失败指针查找节点与根节点间的其他节点对应的其他异常词汇;
步骤13:将状态对应的异常词汇和其他异常词汇设置为字符对应的目标异常词汇,利用动态规划算法及字符的目标异常词汇对应的预设异常值确定总异常值,并对下一字符进行处理;
步骤14:当状态不具有对应的异常词汇时,对下一字符进行处理。
在一种可能的情况中,利用动态规划算法及字符的目标异常词汇对应的预设异常值确定总异常值,可以包括:
步骤21:利用动态规划算法以及字符的目标异常词汇对应的预设异常值以如下方式计算总异常值:

其中,s表示日志信息的字符串,sn-1、sn表示字符串中的第n-1个字符和第n个字符,
f(sn-1)、f(sn)表示sn-1和sn对应的总异常值,staten表示sn字符对应的状态,staten≠error_word表示staten不具有对应的目标异常词汇,staten=error_word表示staten具有对应的目标异常词汇, score(staten)表示sn字符的目标异常词汇对应的预设异常值的总和。
换句话说,假若staten对应一个异常词,则其失败指针所指向的节点也有可能对应一个异常词,此时,score函数应利用失败指针循环计算出所有可能的异常词,直至回溯到根节点。
S103、当确定总异常值大于第一预设阈值时,判定日志信息为异常日志。
需要说明的是,本申请实施例并不限定第一预设阈值的详细数值,可根据实际应用需求进行设定。为方便运维人员及时进行异常排查维护,在确定日志信息为异常日志时,也可生成对应的告警信息并进行输出。本申请实施例并不限定告警信息的详细形式,可根据实际应用需求进行设定。本申请实施例也不限定输出告警信息的详细方式,例如可输出至电子设备中的显示设备上,也可以通过短信及邮件的方式输出至指定运维人员的设备上,可根据实际应用需求进行设定。
基于上述实施例,本申请实施例可使用由异常词汇构建的有限状态自动机对日志信息进行异常检测,该自动机能够自动检测日志信息中所包含的目标异常词汇,进而可利用动态规划算法及这些词汇对应的预设异常值确定日志信息对应的总异常值,并在确定总异常值大于预设阈值时,可判定日志信息为异常日志。由于本申请实施例利用由有限状态自动机提取得到的目标异常词汇确定日志信息是否为异常日志,且目标异常词汇属于纯文本数据,因此可对纯文本日志进行检测,且能够避免现有方法仅能对具有时序数据的日志数据进行检测的情况;此外,由于相较于传统的机器学习和深度学习方法,有限状态自动机的计算效率更高,且实现所需的代码更加精简,因此本申请实施例不仅能够采用有限状态自动机提升异常日志检测的效率,同时还能够降低异常日志检测对计算资源的消耗量,以确保该检测功能可配置在计算资源更低的硬件设备中,进而可有效提升异常日志检测的适用场景。
基于上述实施例,下面对有限状态自动机的生成过程进行详细介绍。在一种可能的情况中,有限状态自动机为AC自动机,在利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇之前,还可以包括:
S201、获取异常词库;异常词库包含多个异常词汇,每一异常词汇均有对应的预设异常值。
在本申请实施例中,异常词库用于存放异常词汇。本申请实施例并不限定异常词库的构建过程,例如可收集包含异常信息的异常日志,并利用异常日志中所包含的异常词汇进行构建。本申请实施例并不限定如何从异常日志中提取异常词汇,例如可根据预设规则进行提取,也可以计算日志中各词汇的TF-IDF值,并根据该值进行提取,其中TF-IDF表示词频-逆文件频率。在本申请实施例中,为高效提取异常词汇,可根据TF-IDF值进行提取。
在一种可能的情况中,获取异常词库,可以包括:
步骤31:获取异常日志,并对异常日志进行分词得到待处理词汇;
步骤32:计算待处理词汇对应的TF-IDF值,并根据TF-IDF值从待处理词汇中提取异常词汇;
步骤33:将异常词汇添加至异常词库。
可选的,可采用如下方式计算TF-IDF值:
在一种可能的情况中,计算待处理词汇对应的TF-IDF值,可以包括:
步骤41:采用如下方式计算待处理词汇对应的TF-IDF值:
tf-idfi=tf(t,d)·idf(t,D);
其中,tf-idfi表示第i个待处理词汇的TF-IDF值,t表示第i个待处理,d表示异常日志,D表示包含 所有异常日志的集合;tf(t,d)表示异常词汇t的词频,采用如下方式计算:

其中t′∈d表示异常日志中的所有词汇;idf(t,D)表示单词t的逆文件频率,采用如下方式
计算:
可选的,在得到各待处理词汇对应的TF-IDF值后,可依照TF-IDF值从高到低的顺序,将前预设比例的待处理词汇设置为异常词汇并添加至异常词库。需要说明的是,本申请实施例并不限定预设比例的详细数值,例如可以为前2%。此外,还可利用TF-IDF值设置异常词汇的预设异常值。
在一种可能的情况中,根据TF-IDF值从待处理词汇中提取异常词汇,可以包括:
步骤51:按照TF-IDF值从高到低的顺序,将前预设比例的待处理词汇设置为异常词汇,并利用TF-IDF值为异常词汇设置对应的预设异常值。
需要说明的是,本申请实施例并不限定利用TF-IDF值设置异常词汇的预设异常值的详细方式,例如可将TF-IDF值与自然对数底数相除,得到预设异常值,也可采用其他方式进行设置。
在一种可能的情况中,利用TF-IDF值为异常词汇设置对应的预设异常值,包括:
步骤61:利用TF-IDF值以如下方式为异常词汇设置对应的预设异常值:

其中,tf-idfi表示第i个异常词汇的TF-IDF值,e表示自然对数底数。
可选的,在得到待处理词汇后,也可利用预先确定的规则,从待处理词汇中提取异常词汇,并为其添加对应的预设异常值。当然,为了突出利用规则提取的异常词汇的作用,这批词汇的预设异常值也可高于利用TF-IDF值提取的异常词汇的预设异常值,可根据实际应用需求进行设定。
在一种可能的情况中,在对异常日志进行分词得到待处理词汇之后,还可以包括:
步骤71:根据预设规则从待处理词汇中提取目标异常词汇,并为目标异常词汇添加对应的预设异常值;
步骤72:将目标异常词汇添加至异常词库。
S202、利用异常词库构建字典树,并在字典树中为与异常词汇对应的节点标注预设异常值。
需要说明的是,本申请实施例并不限定字典树的构造过程,可参考相关技术。可选的,字典树应当满足如下条件:1、根节点不包含字符,除根节点外的每个节点都只包含一个字符;2、从根节点到某一结点,路径上经过的字符连接起来,为该节点对应的字符串;3、每个节点的所有子节点包含的字符都不相同。在完成字典树的构造之后,可在异常词汇对应的节点上标注预设异常值,以便后续计算总异常值。
S203、使用广度优先搜索对字典树进行前缀指针计算,以在字典树中构造失败指针,得到有限状态自动机。
需要说明的是,本申请实施例并不限定失败指针的构造过程,可参考相关技术。
基于上述实施例,本申请实施例可依照AC自动机的构造方式构造异常日志检测所需使用的有限状态自动机,可确保在占用较少计算资源的情况下高效地进行异常日志检测。
基于上述实施例,在利用有限状态自动机检测日志信息之前,为提升检测效率,还可利用已有的正常日志模板对日志信息进行过滤,以提取出具有较高异常可能性的目标日志信息,再利用有限状态自动机对目标日志信息进行检测。在一种可能的情况中,在利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇之前,还可以包括:
S301、利用日志信息生成待检测日志向量,并计算待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值。
在本申请实施例中,日志信息在被输入至有限状态自动机之前,可首先被生成对应的待检测日志向量;随后,该向量将与正常日志模板对应的正常日志向量进行相似度计算,以确定日志信息与各正常日志模板的相似程度;进而,当发现日志信息与各个正常日志模板均不相同时,即发现待检测日志向量与各个正常日志向量间的相似度均小于预设阈值时,便可确定日志信息属于异常日志信息的可能性较高,应当使用有限状态自动机进行检测。应当指出的是,正常日志模板即为一般正常日志信息所使用的文档模板,其可以手动指定,也可以根据归类的方式自动确定;日志向量中的各个元素均由日志信息中的词汇生成,例如可首先对日志信息文本进行分词得到日志文本词汇,进而提取每个日志文本词汇的首字母,并将由首字母构成的序列作为一个日志向量,比如对于日志“log(error):hello world.”,按标点符号进行分词后分为4个日志文本词汇:log、error、hello、world,则该条日志特征向量为[l,e,h,w]。
可选的,需要说明的是,本申请实施例并不限定相似度的详细计算方式,例如可使用余弦相似度、欧式距离、编辑距离等。为提交计算效率,在本申请实施例中,可按照如下方式计算待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值:

其中a表示待检测日志向量,b表示正常日志向量,similarity(a,b)表示相似度值,ai
和bi分别表示待检测日志向量中的第i个词汇和正常日志向量中的第i个词汇;ai=bi属于布尔运算,当ai与bi相等时,ai=bi的值为1,当ai与bi不相等时,ai=bi的值为0;min(·)表示最小值函数,max(·)表示最大值函数,len(·)表示向量长度。
S302、当确定相似度值小于第二预设阈值时,进入利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇的步骤。
需要说明的是,本申请实施例并不限定第二预设阈值的详细数值,可根据实际应用需求进行设定,例如可设置为0.8。
下面对正常日志模板的自动筛选过程进行详细介绍。在利用日志信息生成待检测日志向量之前,还可以包括:
步骤81:获取所有原始日志模板,并利用各原始日志模板生成对应的日志模板向量。
应当指出的是,此处的原始日志模板既包含正常日志模板,也包含异常日志模板。关于日志模板向量的生 成方法可参考上述实施例,此处不再赘述。
步骤82:对日志模板向量进行分类得到模板类别,并根据各模板类别对应的日志模板向量数量,按从大到小的顺序对模板类别进行排序;
需要说明的是,本申请实施例并不限定对日志模板向量进行分类的详细方法,例如可通过聚类等方式进行自动分类。在本申请实施例中,为方便起见,可简单根据两个日志模板向量之间的相似度是否超过预设阈值来确定这两个日志模板向量是否属于同一类别。在一种可能的情况中,对日志模板进行分类得到模板类别,可以包括:
步骤91:创建模板核向量集合,并将首个日志模板向量设置为待处理向量。
应当说明的是,模板核向量集合在创建之初为空集合。
步骤92:当确定模板核向量集合为空,或模板核向量集合中不存在与待处理向量间的相似度大于第四预设阈值的目标模板核向量时,将待处理向量设置为模板核向量并添加至模板核向量集合。
本申请实施例中的模板核向量是模板类别的代表向量,在确定某一待处理向量所归属的模板类别时,应当首先计算待处理向量与各个模板核向量间的相似度。假若相似度并未超过预设阈值,则说明待处理向量与对应的模板核向量并不属于相同类别;反之,假若相似度超过了预设阈值,则说明待处理向量与对应的模板核向量可以属于相同类别。进而,可以理解的是,当待处理向量与各个模板核向量之间的相似度均不超过预设阈值时,则说明待处理向量并不属于任一模板类别,进而便可将待处理向量设置为新模板类别的模板核向量,并添加至模板核向量集合。需要说明的是,本申请实施例并不限定第四预设阈值的详细数值,可根据实际应用需求进行设定,例如可以为0.8。关于上述相似度的计算方法,可参考上述实施例,此处不再赘述。可选的,可以理解的是,假若集合中不存在模板核向量,则可直接将待处理向量设置为模板核向量并添加至该集合中。
步骤93:当确定模板核向量集合中存在目标模板核向量时,将待处理向量添加至字典序最小的目标模板核向量对应的模板类别中。
可以理解的是,待处理向量可能与模板核向量集合中的多个目标模板核向量具有较高的相似度,此时在本申请实施例中可优选将待处理向量添加至这多个目标模板核向量中字典序最小的模板核向量对应的模板类别中,其中字典序是基于字母顺序排列的单词按字母顺序排列的方法。
步骤94:对下一日志模板向量进入设置为待处理向量的步骤,直至完成对所有日志模板向量的处理。可选的,在完成对日志模板向量的分类之后,根据各模板类别对应的日志模板向量数量,按从大到小的顺序对模板类别进行排序。这是由于异常日志模板与正常日志模板间的相似度较低,且异常日志模板在所有原始日志模板中的占比交底。换而言之,由异常日志模板所组成的异常模板类别不仅显著区别与由正常日志模板所组成的正常模板类别,且异常模板类别所包含的日志模板向量数量显著少于正常模板类别所包含的日志模板向量数量,因此仅需对根据各模板类别对应的日志模板向量数量,按从大到小的顺序对模板类别进行排序,并从排序序列中提取日志模板向量数量较多的模板类别即可得到正常日志模板。
步骤83:从排序序列中依次提取模板类别对应的日志模板向量数量及进行累加,并在每次累加结束后,计算当前累加数量与日志模板总数量间的比值;
步骤84:当确定比值大于第三预设阈值时,将已累加的模板类别所包含的日志模板向量对应的原始日志模板设置为正常日志模板。
需要说明的是,本申请实施例并不限定第三预设阈值的详细数值,例如可以为98%,即可将总量占比在前98%的日志模板定为正常日志模板。
基于上述实施例,本申请实施例还可在利用有限状态自动机检测日志信息之前,利用已有的正常日志模板对日志信息进行过滤,以提取出具有较高异常可能性的目标日志信息,再利用有限状态自动机对目标日志信息 进行检测,以提高检测效率。
基于上述实施例,下面基于可选的例子介绍上述异常日志检测方法。请参考图3,图3为本申请实施例所提供的另一种异常日志检测方法的流程图。该方法可以包括:
1、使用聚类算法对所有系统日志模板进行建模,并将总量占比在前98%的系统日志模板定为正常日志模板;
2、获取各类系统错误日志数据并进行分词。
3、计算所有错误日志数据中各个词汇的TF-IDF分值,并将得分排在前2%的词汇作为异常词会加入异常词表,词汇wi的异常分值默认为其中tf-idfi为该词汇的TF-IDF得分,e为自然对数底数。
4、运用运维专家知识提取出步骤1中的异常词并赋予相应异常分值,加入异常词表。使用专家知识提取的异常词应具备较高程度的区分度,即分值明显高于步骤2中的大多数。
5、对异常词库使用Aho-Corasick算法建立有限状态自动机(DFA),并在相应节点标记词汇的异常分值。
6、对于系统产生的实时日志数据,提取每条日志的待检测日志向量,并使用正常日志模板库进行初步过滤,若该日志属于某一类正常日志模板(相似度>=0.8),则该日志判定为正常日志,否则初步判定为异常日志;
7、对于每一条初步判定为异常的日志,使用步骤5中有限状态自动机和动态规划算法计算每条日志的异常得分。
8、设置告警阈值α,若步骤7中计算出的日志异常得分大于α,则判定为异常日志,进行告警。
下面对本申请实施例提供的异常日志检测装置、电子设备及非易失性可读存储介质进行介绍,下文描述的异常日志检测装置、电子设备及非易失性可读存储介质与上文描述的异常日志检测方法可相互对应参照。
请参考图4,图4为本申请实施例所提供的一种异常日志检测装置的结构框图,该装置可以包括:
获取模块401,被设置为获取日志信息;
检测模块402,被设置为利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇,并利用动态规划算法及目标异常词汇对应的预设异常值确定日志信息对应的总异常值;
判定模块403,被设置为当确定总异常值大于第一预设阈值时,判定日志信息为异常日志。
可选地有限状态自动机为AC自动机,检测模块402,可以包括:
匹配子模块,被设置为将日志信息中的字符依次输入至AC自动机中进行匹配,确定字符在AC自动机中对应的节点及节点对应的状态;
查找子模块,被设置为当状态具有对应的异常词汇时,通过失败指针查找节点与根节点间的其他节点对应的其他异常词汇;
计算子模块,被设置为将状态对应的异常词汇和其他异常词汇设置为字符对应的目标异常词汇,并利用动态规划算法及字符的目标异常词汇对应的预设异常值确定总异常值。
可选地,计算子模块,被设置为:
利用动态规划算法以及字符的目标异常词汇对应的预设异常值以如下方式计算总异常值:

其中,s表示日志信息的字符串,sn-1、sn表示字符串中的第n-1个字符和第n个字符,
f(sn-1)、f(sn)表示sn-1和sn对应的总异常值,staten表示sn字符对应的状态,staten≠error_word表示staten不具有对应的目标异常词汇,staten=error_word表示staten具有对应的目标异常词汇,score(staten)表示sn字符的目标异常词汇对应的预设异常值的总和。
可选地,该装置还可以包括:
过滤模块,被设置为利用日志信息生成待检测日志向量,并计算待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值;
检测模块,还被设置为当确定相似度值小于第二预设阈值时,进入利用由异常词汇构建的有限状态自动机检测日志信息中包含的目标异常词汇的步骤。
可选地,过滤模块,可以包括:
相似度值计算子模块,被设置为按照如下方式计算待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值:

其中a表示待检测日志向量,b表示正常日志向量,similarity(a,b)表示相似度值,ai
和bi分别表示待检测日志向量中的第i个词汇和正常日志向量中的第i个词汇;当ai与bi相等时,ai=bi的值为1,当ai与bi不相等时,ai=bi的值为0;min(·)表示最小值函数,max(·)表示最大值函数,len(·)表示向量长度。
可选地,该装置还可以包括:
模板获取模块,被设置为获取所有原始日志模板,并利用各原始日志模板生成对应的日志模板向量;
分类模块,被设置为对日志模板向量进行分类得到模板类别,并根据各模板类别对应的日志模板向量数量,按从大到小的顺序对模板类别进行排序;
累加模块,被设置为从排序序列中依次提取模板类别对应的日志模板向量数量进行累加,并在每次累加结束后,计算当前累加数量与原始日志模板总数量间的比值;
设置模块,被设置为当确定比值大于第三预设阈值时,将已累加的模板类别所包含的日志模板向量对应的原始日志模板设置为正常日志模板。
可选地,分类模块,可以包括:
第一设置子模块,被设置为创建模板核向量集合,并将首个日志模板向量设置为待处理向量;
第一处理子模块,被设置为当确定模板核向量集合为空,或模板核向量集合中不存在与待处理向量间的相似度大于第四预设阈值的目标模板核向量时,将待处理向量设置为模板核向量并添加至模板核向量集合;
第二处理子模块,被设置为当确定模板核向量集合中存在目标模板核向量时,将待处理向量添加至字典序最小的目标模板核向量对应的模板类别中;
第二设置子模块,被设置为对下一日志模板向量进入设置为待处理向量的步骤,直至完成对所有日志模板向量的处理。
可选地,有限状态自动机为AC自动机,该装置还可以包括:
异常词库获取模块,被设置为获取异常词库;异常词库包含多个异常词汇,每一异常词汇均有对应的预设异常值;
字典树构建模块,被设置为利用异常词库构建字典树,并在字典树中为与异常词汇对应的节点标注预设异常值;
前缀指针计算模块,被设置为使用广度优先搜索对字典树进行前缀指针计算,以在字典树中构造失败指针,得到AC自动机。
可选地,异常词库获取模块,可以包括:
异常日志获取子模块,被设置为获取异常日志,并对异常日志进行分词得到待处理词汇;
TF-IDF处理子模块,被设置为计算待处理词汇对应的TF-IDF值,并根据TF-IDF值从待处理词汇中提取异常词汇;
第一添加子模块,被设置为将异常词汇添加至异常词库。
可选地,TF-IDF处理子模块,可以包括:
异常词汇提取单元,被设置为按照TF-IDF值从高到低的顺序,将前预设比例的待处理词汇设置为异常词汇,并利用TF-IDF值为异常词汇设置对应的预设异常值。
可选地,异常词汇提取单元,可以包括:
预设异常值设置子单元,被设置为利用TF-IDF值以如下方式为异常词汇设置对应的预设异常值:

其中,tf-idfi表示第i个异常词汇的TF-IDF值,e表示自然对数底数。
可选地,TF-IDF处理子模块,可以包括:
TF-IDF计算单元,被设置为采用如下方式计算待处理词汇对应的TF-IDF值:
tf-idfi=tf(t,d)·(t,D);
其中,tf-idfi表示第i个待处理词汇的TF-IDF值,t表示第i个待处理,d表示异常日志,D表示包含所有异常日志的集合;tf(t,d)表示异常词汇t的词频,采用如下方式计算:

其中t′∈d表示异常日志中的所有词汇;idf(t,D)表示单词t的逆文件频率,采用如下方式
计算:
可选地,异常词库获取模块,还可以包括:
异常词汇提取子模块,被设置为根据预设规则从待处理词汇中提取目标异常词汇,并为目标异常词汇添加对应的预设异常值;
第二添加子模块,被设置为将目标异常词汇添加至异常词库。
请参考图5,图5为本申请实施例所提供的一种电子设备的结构框图,本申请实施例还提供一种电子设备,包括:
存储器501,被设置为存储计算机程序;
处理器502,被设置为执行计算机程序时实现如上述的异常日志检测方法的步骤。
由于电子设备部分的实施例与异常日志检测方法部分的实施例相互对应,因此电子设备部分的实施例请参见异常日志检测方法部分的实施例的描述,这里不再赘述。
请参考图6,图6为本申请实施例所提供的一种非易失性可读存储介质的结构框图,本申请实施例还提供一种非易失性可读存储介质,非易失性可读存储介质601上存储有计算机程序,计算机程序被处理器执行时实现上述任意实施例的异常日志检测方法的步骤。
由于非易失性可读存储介质部分的实施例与异常日志检测方法部分的实施例相互对应,因此存储介质部分的实施例请参见异常日志检测方法部分的实施例的描述,这里不再赘述。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上对本申请实施例所提供的一种异常日志检测方法、装置、电子设备及非易失性可读存储介质进行了详细介绍。本文中应用了可选的个例对本申请实施例的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请实施例的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请实施例原理的前提下,还可以对本申请实施例进行若干改进和修饰,这些改进和修饰也落入本申请实施例权利要求的保护范围内。

Claims (20)

  1. 一种异常日志检测方法,其特征在于,包括:
    获取日志信息;
    利用由异常词汇构建的有限状态自动机检测所述日志信息中包含的目标异常词汇,并利用动态规划算法及所述目标异常词汇对应的预设异常值确定所述日志信息对应的总异常值;
    当确定所述总异常值大于第一预设阈值时,判定所述日志信息为异常日志。
  2. 根据权利要求1所述的异常日志检测方法,其特征在于,所述有限状态自动机为AC自动机,所述利用由异常词汇构建的有限状态自动机检测所述日志信息中包含的目标异常词汇,并利用动态规划算法及所述目标异常词汇对应的预设异常值确定所述日志信息对应的总异常值,包括:
    将所述日志信息中的字符依次输入至所述AC自动机中进行匹配,确定所述字符在所述AC自动机中对应的节点及所述节点对应的状态;
    当所述状态具有对应的异常词汇时,通过失败指针查找所述节点与根节点间的其他节点对应的其他异常词汇;
    将所述状态对应的异常词汇和所述其他异常词汇设置为所述字符对应的目标异常词汇,并利用所述动态规划算法及所述字符的目标异常词汇对应的预设异常值确定所述总异常值。
  3. 根据权利要求2所述的异常日志检测方法,其特征在于,所述利用所述动态规划算法及所述字符的目标异常词汇对应的预设异常值确定所述总异常值,包括:
    利用所述动态规划算法以及所述字符的目标异常词汇对应的预设异常值以如下方式计算所述总异常值:
    其中,s表示所述日志信息的字符串,Sn-1、Sn表示所述字符串中的第n-1个字符和第n个字符,f(Sn-1)、f(Sn)表示所述Sn-1和所述Sn对应的总异常值,staten表示所述Sn字符对应的状态,staten≠error_word表示所述staten不具有对应的目标异常词汇,staten=error_word表示所述staten具有对应的目标异常词汇,score(staten)表示所述Sn字符的目标异常词汇对应的预设异常值的总和。
  4. 根据权利要求1所述的异常日志检测方法,其特征在于,在利用由异常词汇构建的有限状态自动机检测所述日志信息中包含的目标异常词汇之前,还包括:
    利用所述日志信息生成待检测日志向量,并计算所述待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值;
    当确定所述相似度值小于第二预设阈值时,进入所述利用由异常词汇构建的有限状态自动机检测所述日志信息中包含的目标异常词汇的步骤。
  5. 根据权利要求4所述的异常日志检测方法,其特征在于,所述计算所述待检测日志向量与正常日志 模板对应的正常日志向量之间的相似度值,包括:
    按照如下方式计算所述待检测日志向量与正常日志模板对应的正常日志向量之间的相似度值:
    其中a表示所述待检测日志向量,b表示所述正常日志向量,similarity(a,b)表示所述相似度值,ai和bi分别表示待检测日志向量中的第i个词汇和所述正常日志向量中的第i个词汇;当ai与bi相等时,ai=bi的值为1,当ai与bi不相等时,ai=bi的值为0;min(·)表示最小值函数,max(·)表示最大值函数,len(·)表示向量长度。
  6. 根据权利要求4所述的异常日志检测方法,其特征在于,在利用所述日志信息生成待检测日志向量之前,还包括:
    获取所有原始日志模板,并利用各所述原始日志模板生成对应的日志模板向量;
    对日志模板向量进行分类得到模板类别,并根据各模板类别对应的日志模板向量数量,按从大到小的顺序对所述模板类别进行排序;
    从排序序列中依次提取模板类别对应的日志模板向量数量进行累加,并在每次累加结束后,计算当前累加数量与原始日志模板总数量间的比值;
    当确定所述比值大于第三预设阈值时,将已累加的模板类别所包含的日志模板向量对应的原始日志模板设置为所述正常日志模板。
  7. 根据权利要求6所述的异常日志检测方法,其特征在于,所述对日志模板向量进行分类得到模板类别,包括:
    创建模板核向量集合,并将首个日志模板向量设置为待处理向量;
    当确定所述模板核向量集合为空,或所述模板核向量集合中不存在与所述待处理向量间的相似度大于第四预设阈值的目标模板核向量时,将所述待处理向量设置为模板核向量并添加至所述模板核向量集合;
    当确定所述模板核向量集合中存在所述目标模板核向量时,将所述待处理向量添加至字典序最小的目标模板核向量对应的模板类别中;
    对下一日志模板向量进入所述设置为待处理向量的步骤,直至完成对所有所述日志模板向量的处理。
  8. 根据权利要求1至7任一项所述的异常日志检测方法,其特征在于,所述有限状态自动机为AC自动机,在利用由异常词汇构建的有限状态自动机检测所述日志信息中包含的目标异常词汇之前,还包括:
    获取异常词库;所述异常词库包含多个所述异常词汇,每一所述异常词汇均有对应的预设异常值;
    利用所述异常词库构建字典树,并在所述字典树中为与所述异常词汇对应的节点标注所述预设异常值;
    使用广度优先搜索对所述字典树进行前缀指针计算,以在所述字典树中构造失败指针,得到所述AC自动机。
  9. 根据权利要求8所述的异常日志检测方法,其特征在于,所述获取异常词库,包括:
    获取异常日志,并对所述异常日志进行分词得到待处理词汇;
    计算所述待处理词汇对应的TF-IDF值,并根据所述TF-IDF值从所述待处理词汇中提取所述异常词汇;
    将所述异常词汇添加至所述异常词库。
  10. 根据权利要求9所述的异常日志检测方法,其特征在于,所述根据所述TF-IDF值从所述待处理词汇中提取所述异常词汇,包括:
    按照所述TF-IDF值从高到低的顺序,将前预设比例的待处理词汇设置为所述异常词汇,并利用所述TF-IDF值为所述异常词汇设置对应的预设异常值。
  11. 根据权利要求10所述的异常日志检测方法,其特征在于,所述利用所述TF-IDF值为所述异常词汇设置对应的预设异常值,包括:
    利用所述TF-IDF值以如下方式为所述异常词汇设置对应的预设异常值:
    其中,tf-idfi表示第i个所述异常词汇的TF-IDF值,e表示自然对数底数。
  12. 根据权利要求9所述的异常日志检测方法,其特征在于,所述计算所述待处理词汇对应的TF-IDF值,包括:
    采用如下方式计算所述待处理词汇对应的TF-IDF值:
    tf-idfi=tf(t,d)·idf(t,D);
    其中,tf-idfi表示第i个待处理词汇的TF-IDF值,t表示所述第i个待处理,d表示异常日志,D表示包含所有所述异常日志的集合;tf(t,d)表示所述异常词汇t的词频,采用如下方式计算:
    其中t′∈d表示异常日志中的所有词汇;idf(t,D)表示单词t的逆文件频率,采用如下方式计算:
  13. 根据权利要求9所述的异常日志检测方法,其特征在于,在对所述异常日志进行分词得到待处理词汇之后,还包括:
    根据预设规则从所述待处理词汇中提取目标异常词汇,并为所述目标异常词汇添加对应的预设异常值;
    将所述目标异常词汇添加至所述异常词库。
  14. 根据权利要求8所述的异常日志检测方法,其特征在于,所述获取异常词库,包括:
    收集包含异常信息的异常日志;
    利用所述异常日志中所包含的异常词汇进行构建所述异常词库。
  15. 根据权利要求13所述的异常日志检测方法,其特征在于,为所述目标异常词汇添加的对应的预设异常值高于利用所述TF-IDF值为所述异常词汇设置对应的预设异常值。
  16. 根据权利要求4所述的异常日志检测方法,其特征在于,所述利用所述日志信息生成待检测日志向量,包括:
    对所述日志信息进行分词得到日志文本词汇,提取每个所述日志文本词汇的首字母,并将由每个所述日志文本词汇的首字母构成的序列作为所述待检测日志向量。
  17. 根据权利要求4所述的异常日志检测方法,其特征在于,所述正常日志模板是正常日志信息所使用的文档模板。
  18. 一种异常日志检测装置,其特征在于,包括:
    获取模块,被设置为获取日志信息;
    检测模块,被设置为利用由异常词汇构建的有限状态自动机检测所述日志信息中包含的目标异常词汇,并利用动态规划算法及所述目标异常词汇对应的预设异常值确定所述日志信息对应的总异常值;
    判定模块,被设置为当确定所述总异常值大于第一预设阈值时,判定所述日志信息为异常日志。
  19. 一种电子设备,其特征在于,包括:
    存储器,被设置为存储计算机程序;
    处理器,被设置为执行所述计算机程序时实现如权利要求1至17任一项所述的异常日志检测方法。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如权利要求1至17任一项所述的异常日志检测方法。
PCT/CN2023/071830 2022-08-12 2023-01-11 一种异常日志检测方法、装置、电子设备及存储介质 WO2024031930A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210964876.2A CN115034220B (zh) 2022-08-12 2022-08-12 一种异常日志检测方法、装置、电子设备及存储介质
CN202210964876.2 2022-08-12

Publications (1)

Publication Number Publication Date
WO2024031930A1 true WO2024031930A1 (zh) 2024-02-15

Family

ID=83130585

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071830 WO2024031930A1 (zh) 2022-08-12 2023-01-11 一种异常日志检测方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN115034220B (zh)
WO (1) WO2024031930A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743838A (zh) * 2024-02-20 2024-03-22 卓世智星(成都)科技有限公司 用于大语言模型的数据知识提取方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034220B (zh) * 2022-08-12 2023-01-10 苏州浪潮智能科技有限公司 一种异常日志检测方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180129579A1 (en) * 2016-11-10 2018-05-10 Nec Laboratories America, Inc. Systems and Methods with a Realtime Log Analysis Framework
CN111538642A (zh) * 2020-07-02 2020-08-14 杭州海康威视数字技术股份有限公司 一种异常行为的检测方法、装置、电子设备及存储介质
CN113032226A (zh) * 2021-05-28 2021-06-25 北京宝兰德软件股份有限公司 异常日志的检测方法、装置、电子设备及存储介质
CN115034220A (zh) * 2022-08-12 2022-09-09 苏州浪潮智能科技有限公司 一种异常日志检测方法、装置、电子设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684469B (zh) * 2018-12-13 2023-06-06 平安科技(深圳)有限公司 敏感词过滤方法、装置、计算机设备及存储介质
CN114595127A (zh) * 2020-12-03 2022-06-07 腾讯科技(深圳)有限公司 日志异常处理方法、装置、设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180129579A1 (en) * 2016-11-10 2018-05-10 Nec Laboratories America, Inc. Systems and Methods with a Realtime Log Analysis Framework
CN111538642A (zh) * 2020-07-02 2020-08-14 杭州海康威视数字技术股份有限公司 一种异常行为的检测方法、装置、电子设备及存储介质
CN113032226A (zh) * 2021-05-28 2021-06-25 北京宝兰德软件股份有限公司 异常日志的检测方法、装置、电子设备及存储介质
CN115034220A (zh) * 2022-08-12 2022-09-09 苏州浪潮智能科技有限公司 一种异常日志检测方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN XIN-CHI, HAN JIAN-MIN, JIA JIONG: "FACA: A Multiple Pattern Matching Algorithm Based on AC Automata", COMPUTER ENGINEERING, SHANGHAI JISUANJI XUEHUI, CN, vol. 38, no. 11, 5 June 2012 (2012-06-05), CN , pages 173 - 176, XP093137176, ISSN: 1000-3428, DOI: 10.3969/j.issn.1000-3428.2012.11.053 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743838A (zh) * 2024-02-20 2024-03-22 卓世智星(成都)科技有限公司 用于大语言模型的数据知识提取方法
CN117743838B (zh) * 2024-02-20 2024-04-30 卓世智星(成都)科技有限公司 用于大语言模型的数据知识提取方法

Also Published As

Publication number Publication date
CN115034220A (zh) 2022-09-09
CN115034220B (zh) 2023-01-10

Similar Documents

Publication Publication Date Title
WO2024031930A1 (zh) 一种异常日志检测方法、装置、电子设备及存储介质
CN114610515B (zh) 基于日志全语义的多特征日志异常检测方法及系统
US10587632B1 (en) Neural network-based malware detection
WO2020207167A1 (zh) 文本分类方法、装置、设备及计算机可读存储介质
WO2021003810A1 (zh) 一种服务系统的更新方法、电子设备及可读存储介质
WO2022222300A1 (zh) 开放关系抽取方法、装置、电子设备及存储介质
WO2021051864A1 (zh) 词典扩充方法及装置、电子设备、存储介质
CN102891852A (zh) 基于报文分析的协议格式自动推断方法
EP4258610A1 (en) Malicious traffic identification method and related apparatus
CN111581956B (zh) 基于bert模型和k近邻的敏感信息识别方法及系统
CN113254255B (zh) 一种云平台日志的分析方法、系统、设备及介质
CN114818643B (zh) 一种保留特定业务信息的日志模板提取方法及装置
CN111859093A (zh) 敏感词处理方法、装置及可读存储介质
WO2022143608A1 (zh) 语言标注方法、装置、计算机设备和存储介质
CN112579781B (zh) 文本归类方法、装置、电子设备及介质
KR20210011822A (ko) 인공 지능 기반 비정상 로그를 탐지하는 방법 및 이를 구현하는 시스템
CN113723542A (zh) 一种日志聚类处理方法及系统
CN111488400B (zh) 数据分类方法、装置和计算机可读存储介质
CN112685374A (zh) 日志分类方法、装置及电子设备
CN115495587A (zh) 一种基于知识图谱的告警分析方法及装置
CN115169490A (zh) 一种日志分类方法、装置、设备及计算机可读存储介质
CN111341404B (zh) 一种基于ernie模型的电子病历数据组解析方法及系统
CN116029280A (zh) 一种文档关键信息抽取方法、装置、计算设备和存储介质
CN113641823A (zh) 文本分类模型训练、文本分类方法、装置、设备及介质
CN113064597B (zh) 一种冗余代码的识别方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23851166

Country of ref document: EP

Kind code of ref document: A1