CN117332083A - Log clustering method and device, electronic equipment and storage medium - Google Patents

Log clustering method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117332083A
CN117332083A CN202311209999.6A CN202311209999A CN117332083A CN 117332083 A CN117332083 A CN 117332083A CN 202311209999 A CN202311209999 A CN 202311209999A CN 117332083 A CN117332083 A CN 117332083A
Authority
CN
China
Prior art keywords
word
frequent
log data
linear
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311209999.6A
Other languages
Chinese (zh)
Inventor
陈浩
周伯仰
钱继安
黄星焱
李刚
陈金牛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202311209999.6A priority Critical patent/CN117332083A/en
Publication of CN117332083A publication Critical patent/CN117332083A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a log clustering method, a device, electronic equipment and a storage medium, wherein frequent words in each piece of log data are determined according to a preset support threshold; after each first candidate cluster and each corresponding first linear template are determined according to frequent words in each piece of log data, the similarity between each first linear template is determined according to the frequent words in each first linear template, then the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold are combined, and each target cluster and each corresponding target linear template are determined. The problem of log clustering dispersion or overfitting caused by inaccurate preset supporting threshold values can be solved by combining the first candidate clusters corresponding to the first linear templates with the similarity larger than the preset similarity threshold value, and the accuracy of the log clustering is improved.

Description

Log clustering method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a log clustering method, a device, an electronic device, and a storage medium.
Background
As data centers and computer applications have advanced in technology in terms of data processing capacity, size, and complexity. Application systems typically generate large amounts of log data each day, on the order of tens of GB or even hundreds of GB. For example, in a security log management system, it would receive approximately 1 million events per day. This will bring a great operating pressure to the operating personnel. To simplify the management of log data, many studies suggest using data mining related methods to discover event patterns from event logs. These methods can ultimately be used for many different purposes, for example: developing log event association rules, detecting system faults and network anomalies, visualizing event association patterns, identifying and reporting network traffic, and automatically constructing intrusion detection system alert classifiers.
Prior art log clustering methods such as SLCT are designed to mine linear templates and abnormal events from logs. In the clustering process, the SLCT assigns event log lines conforming to the same linear template to the same class cluster, and reports all detected class clusters to the user as the linear template. In order to find a cluster in the log data, the user needs to define a support threshold s for a cluster, where s defines the minimum number of rows in each cluster. The SLCT method begins clustering by passing the input dataset to identify frequent words that occur in at least s rows. In addition, each frequent word information will contain its location information in the log line.
The problem in the prior art is that the log clustering effect depends on a user-defined support threshold, the support threshold is set according to user experience, the accuracy of the support threshold cannot be guaranteed, and the support thresholds of logs of different scales are different. Incorrect setting of the support threshold can lead to scattered or overfitting of log clusters, so that the accuracy of the log clusters is poor.
Disclosure of Invention
The application provides a log clustering method, a log clustering device, electronic equipment and a storage medium, which are used for solving the problem of poor accuracy of log clustering in the prior art.
In a first aspect, the present application provides a log clustering method, the method including:
acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold;
according to the frequent words in each piece of log data, determining each first candidate cluster and each corresponding first linear template;
and determining the similarity between the first linear templates according to the frequent words in the first linear templates, merging first candidate class clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target class cluster and each corresponding target linear template.
In a second aspect, the present application provides a log clustering device, the device including:
the first determining module is used for acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold;
the second determining module is used for determining each first candidate cluster and each corresponding first linear template according to the frequent words in each piece of log data;
and the third determining module is used for determining the similarity between the first linear templates according to the frequent words in the first linear templates, merging the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target cluster and each corresponding target linear template.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the method when executing the program stored in the memory.
In a fourth aspect, the present application provides a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps.
The application provides a log clustering method, a log clustering device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold; according to the frequent words in each piece of log data, determining each first candidate cluster and each corresponding first linear template; and determining the similarity between the first linear templates according to the frequent words in the first linear templates, merging first candidate class clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target class cluster and each corresponding target linear template.
The technical scheme has the following advantages or beneficial effects:
in the method, frequent words in each piece of log data are determined according to a preset supporting threshold value; after each first candidate cluster and each corresponding first linear template are determined according to frequent words in each piece of log data, the similarity between each first linear template is determined according to the frequent words in each first linear template, then the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold are combined, and each target cluster and each corresponding target linear template are determined. The problem of log clustering dispersion or overfitting caused by inaccurate preset supporting threshold values can be solved by combining the first candidate clusters corresponding to the first linear templates with the similarity larger than the preset similarity threshold value, and the accuracy of the log clustering is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a log clustering process provided in the present application;
FIG. 2 is a log clustering flowchart provided in the present application;
FIG. 3 is a schematic diagram of generating candidate class clusters and linear templates provided herein;
FIG. 4 is a schematic diagram illustrating candidate cluster merging provided herein;
FIG. 5 is an exemplary schematic diagram of candidate cluster connectivity provided herein;
fig. 6 is a schematic structural diagram of a log clustering device provided in the present application;
fig. 7 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.
It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.
Fig. 1 is a schematic diagram of a log clustering process provided in the present application, where the process includes the following steps:
S101: and acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold value.
S102: and determining each first candidate cluster and each corresponding first linear template according to the frequent words in each piece of log data.
S103: and determining the similarity between the first linear templates according to the frequent words in the first linear templates, merging first candidate class clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target class cluster and each corresponding target linear template.
The log clustering method is applied to electronic equipment, and the electronic equipment can be PC, tablet personal computers and other equipment, and can also be a server.
The method comprises the steps that electronic equipment firstly obtains various pieces of log data to be clustered, and a preset supporting threshold value is kept in the electronic equipment. And determining frequent words in each piece of log data according to a preset supporting threshold value. And taking the word segmentation with the occurrence frequency larger than a preset support threshold value in each piece of log data as a frequent word. And determining each first candidate cluster according to the frequent words in each piece of log data. For example, the preset support threshold is 10, and frequent words having occurrence frequencies greater than the preset support threshold in each piece of log data include "Application", "start", "at" and "node". Then the two pieces of log data for log data "Application Gateway start at node Sfmmbkas" and log data "Application NG app start at node Ngap" are log data in the same first candidate cluster. Preferably, when frequent words in each piece of log data are determined according to a preset supporting threshold, each variable word can be determined, namely, the word segmentation with the occurrence frequency not greater than the preset supporting threshold is used as the variable word. The wild card is used to replace the variable word, and the wild card is also included in the first linear template corresponding to the first candidate cluster, for example, the wild card is denoted by a wild card symbol "×", and the first linearity corresponding to the first candidate cluster may be denoted as "Application × start at node". In order to further make the representation of the first linearity corresponding to the first candidate cluster more accurate, the wild card may further include a range of number of alternative variable words in addition to the wild card symbol, for example, the first linearity corresponding to the first candidate cluster is represented by "Application {1,2}start at node {1,1}", {1,2} and {1,1} are both ranges of number of alternative variable words, and {1,2} means that in each log data, at least one variable word is located between "Application" and "start", and at most two variable words are located. {1,1} means that there is at least one variable word located after "node" in each log data, and at most one variable word is also located.
After each first candidate cluster and each corresponding first linear template are determined, the similarity between each first linear template is determined according to the frequent words in each first linear template, wherein for any two first linear templates, the larger the number of the same frequent words contained in the two first linear templates is, the larger the similarity is, the smaller the number of the same frequent words contained in the two first linear templates is, and the smaller the similarity is. Alternatively, for any two first linear templates, the number of the same frequent words contained in the two first linear templates and the total number of the segmented words contained in the two first linear templates may be counted, and the ratio of the number of the same frequent words contained in the two first linear templates to the total number of the segmented words contained in the two first linear templates is used as the similarity of the two first linear templates. Alternatively, for any two first linear templates, the number of words that contain the same frequent word, the number of wildcards that contain the same word, and the total number of the words that contain the same wildcard in the two first linear templates may be counted, a sum value of the number of words that contain the same frequent word and the number of wildcards that contain the same word is calculated, and a ratio of the sum value to the total number of the words that contain the same wildcard and the wildcards in the two first linear templates is used as the similarity of the two first linear templates. Wildcards with the same number range of wildcards and substitution variants are the same wildcard.
For example, the first linear templates "remote_dr172.18.179.21 access@timetamp {1,1}" and the first linear templates "remote_drr {1,1}access@timetamp20210517164312", "remote_dr172.18.179.21 access@timetamp {1,1}" include the words "remote_drr", "172.18.179.21", "access", "@ timetamp" and "{ 1,1}". "remote_drr {1,1}access@timetamp20210517164312" contains the words "remote_drr", "{ 1,1}," @ access "," @ timetamp "and" 20210517164312". If the ratio of the number of the same frequent words to the total number of the divided words contained in the two first linear templates is used as the similarity of the two first linear templates, the similarity is determined to be 3/6=0.5. If the ratio of the sum of the number of the same frequent words and the number of the same wildcards is used as the similarity of the two first linear templates, and the ratio of the sum of the number of the same frequent words and the number of the same wildcards is used as the similarity of the two first linear templates, the similarity is determined to be 4/6=0.67.
The electronic device stores a preset similarity threshold, for example, the preset similarity threshold is 0.4, 0.5, and the like. And determining the similarity between the first linear templates, merging first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target cluster and each corresponding target linear template. Also described in the above example, the similarity between the first linear template "remote_dr172.18.179.21 access@timeamp {1,1}" and the first linear template "remote_drr {1,1}access@timetamp20210517164312" is greater than a predetermined similarity threshold, and the target linear template obtained after merging is expressed as "remote_drr {1,1} access@timeamp {1,1 }. And combining the first candidate cluster corresponding to the first linear template 'remote_dr172.18.179.21 access@timeamp {1,1 }', and the first candidate cluster in the first linear template 'remote_drr {1,1} access@timeamp 20210517164312', so as to obtain the target cluster.
In the method, frequent words in each piece of log data are determined according to a preset supporting threshold value; after each first candidate cluster and each corresponding first linear template are determined according to frequent words in each piece of log data, the similarity between each first linear template is determined according to the frequent words in each first linear template, then the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold are combined, and each target cluster and each corresponding target linear template are determined. The problem of log clustering dispersion or overfitting caused by inaccurate preset supporting threshold values can be solved by combining the first candidate clusters corresponding to the first linear templates with the similarity larger than the preset similarity threshold value, and the accuracy of the log clustering is improved.
In the application, the obtaining each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset support threshold value includes:
each piece of log data to be clustered is obtained, each word in the log data is determined through a word segmentation algorithm, and the word with the occurrence frequency larger than a preset supporting threshold value in each word is used as the frequent word.
After each piece of log data to be clustered is obtained, firstly, word segmentation processing is carried out on each piece of log data through a word segmentation algorithm, each word segmentation in each piece of log data is determined, and then, the word segmentation with the occurrence frequency larger than a preset support threshold value in each word segmentation in each piece of log data is used as a frequent word. The word segmentation algorithm includes, but is not limited to, a maximum matching algorithm, a shortest path word segmentation algorithm, an N-gram word segmentation algorithm, an HMM word segmentation algorithm, a CRF word segmentation algorithm, and the like.
In order to make the determination of the frequent word more accurate, the step of using the word with the occurrence frequency greater than the preset support threshold value in each word as the frequent word includes:
and aiming at each piece of log data, performing word segmentation and duplication removal processing on each word segmentation in the log data, and taking the word segmentation with the occurrence frequency greater than a preset supporting threshold value after the word segmentation and duplication removal processing of each piece of log data as the frequent word.
In the application, for each piece of log data, the piece of log data is subjected to word segmentation and duplication removal processing, that is, only one identical word segmentation in the piece of log data is reserved. And then counting the occurrence frequency of each word in each piece of log data subjected to word segmentation and duplication removal processing, and determining frequent words, namely taking the word segments with the occurrence frequency of each log data word segmentation and duplication removal processing larger than a preset support threshold value as the frequent words. It should be noted that, the word segmentation and duplication removal process is performed on each piece of log data only when determining frequent words, and the determination is performed on each first candidate cluster and the corresponding first linear template or the original log data when determining the first candidate clusters.
Acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold value comprises the following steps:
acquiring each piece of log data to be clustered, determining each word in the log data through a word segmentation algorithm, performing word segmentation and duplication removal processing on each word segmentation, and taking the word segmentation with the occurrence frequency greater than a preset support threshold value after the word segmentation and duplication removal processing as the frequent word; taking the word segmentation with the occurrence frequency not greater than a preset support threshold value after the word segmentation de-duplication treatment as the variable word;
according to the frequent words in each log data, determining each first candidate cluster and each corresponding first linear template comprises:
taking the log data with the same frequent word and the same relative position information of the frequent word in the log data as the log data in the same first candidate cluster; replacing variable words in log data in the same first candidate cluster by wild cards to obtain a first linear template corresponding to the first candidate cluster; wherein the first linear template includes frequent words and wild cards including wild card symbols and a range of number of replacement variable words.
The relative position information of the frequent word in the log data refers to the front-back relative position information of the frequent word in the log data. For example, the frequent words are "Application", "start", "at" and "node", the frequent words in the log data "Application Gateway start at node Sfmmbkas" and the log data "Application NG app start at node Ngap" are the same, and the relative position information of the frequent words in the log data is the same, and the two log data are the log data in the same first candidate cluster. And replacing the variable words in the log data with wild cards to obtain a first linear template corresponding to the first candidate cluster. For example, the frequent words "Application", "start", "at" and "node", and the log data "Application Gateway start at node Sfmmbkas" and "NG app start at node Application Ngap" are the same but the relative position information of the frequent words is different, and at this time, the frequent words do not belong to the log data in the same first candidate cluster.
Determining the similarity between the first linear templates according to the frequent words in the first linear templates comprises:
And for any two first linear templates, determining the similarity of the any two first linear templates according to the same number of frequent words, the same number of wildcards and the total number of segmentation words in the any two first linear templates.
Optionally, determining the sum value of the same number of frequent words and the same number of wild cards in any two first linear templates, and taking the ratio of the sum value to the total number of the segmented words as the similarity of the any two first linear templates.
In order to further make each determined target cluster more accurate, in the present application, merging the first candidate clusters corresponding to the first linear templates with the similarity greater than the preset similarity threshold, and determining each target cluster and each corresponding target linear template includes:
combining the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold to obtain second candidate clusters and second linear templates corresponding to the second candidate clusters;
aiming at each frequent word in each second linear template, determining the fitting degree of the frequent word according to the frequency of the simultaneous occurrence of the frequent word and each frequent word in the second candidate cluster corresponding to the second linear template and the total frequency of the respective occurrence of the frequent word; determining frequent words to be combined according to the fit degree of the frequent words and a preset fit degree threshold;
And merging the second candidate class clusters corresponding to the second linear templates with the same frequent words and different frequent words to be merged to obtain each target class cluster and each corresponding target linear template.
In the application, first candidate class clusters corresponding to a first linear template with similarity larger than a preset similarity threshold are combined, the combined class clusters are used as second candidate class clusters, and the first linear template is combined to obtain a second linear template. For example, the second linear template is "Interface {1,1}down at node router1," where "Interface", "down", "at", "node" and "router1" are frequent words. For the ' Interface ', according to each piece of log data in the second candidate cluster corresponding to the second linear template, the frequency of simultaneous occurrence of the ' Interface ' and the ' down ', the frequency of simultaneous occurrence of the ' Interface ' and the ' at ', the frequency of simultaneous occurrence of the ' Interface ' and the ' node ', and the frequency of simultaneous occurrence of the Interface ' and the ' router1 ' are respectively determined. And determining the frequency of occurrence of "Interface", the frequency of occurrence of "down", the frequency of occurrence of "at", the frequency of occurrence of "node" and the frequency of occurrence of "router 1". Optionally, the frequencies of "Interface" and "down" appearing at the same time, the frequencies of "Interface" and "at" appearing at the same time, the first sum of the frequencies of "Interface" and "node" appearing at the same time, the frequency of "Interface" appearing at the same time, the frequency of "down" appearing at the same time, the second sum of the frequency of "at" appearing at the same time, the frequency of "node" appearing at the same time, and the frequency of "router1" appearing at the same time are calculated, and the ratio of the first sum and the second sum is taken as the fitness of "Interface". Preferably, the frequency of "Interface" and "down" appearing at the same time, the frequency of "Interface" and "at" appearing at the same time, the average value of the frequency of "Interface" and "router1" appearing at the same time, and the second sum value of the frequency of "down" appearing at the same time, the frequency of "node" appearing at the same time and the frequency of "router1" appearing at the same time are calculated, and the ratio of the average value and the second sum value is taken as the fit degree of "Interface".
For the second linear template "Interface {1,1}down at node router1", the degree of fit of each of the frequent words "Interface", "down", "at", "node" and "router1" can be determined by the above method, respectively. The electronic equipment stores a preset matching degree threshold value, and frequent words with matching degree smaller than the preset matching degree threshold value in the second linear template are used as frequent words to be combined.
And merging the second candidate class clusters corresponding to the second linear templates with the same frequent words and different frequent words to be merged to obtain each target class cluster and each corresponding target linear template. For example, frequent words to be combined in the second linear template "Interface {1,1}down at node router1" are "router1", frequent words to be combined in the second linear template "Interface {2,3}down at node router2" are "router2", the frequent words of the two second linear templates are the same, the frequent words to be combined are different, and the two second linear templates are combined to obtain the target linear template "Interface {1,3}down at node router1 i router2". And merging the second candidate class clusters corresponding to the two second linear templates into a class cluster, namely a target class cluster corresponding to the target linear template. The ".{ 1,3}" in the target linear templates is the result of the wildcard matching merging of the two second linear templates, and the "router 1I router2" is the result of frequent word merging to be merged of the two second linear templates, wherein "-" represents the or relationship.
The method for clustering log data introduces a candidate cluster merging optimization and candidate cluster connection strategy based on the previous clustering method, can optimize candidate clusters (hereinafter referred to as linear templates) found and analyzed from logs and abnormal log data, and improves the accuracy of log clustering. In the method, the linear templates and the candidate clusters are generated, the similarity between the linear templates is selected, the candidate clusters corresponding to the linear templates with higher similarity are combined, and meanwhile, the linear templates are updated. And then cluster selection is carried out, frequent word matching degree is calculated according to the dependency relationship among the frequent words after the cluster selection is completed, the frequent words with lower matching degree are replaced by auxiliary identifiers, and finally the identified clusters with the identical linear templates are merged into new clusters.
The frequent words in the linear template determined by the method do not comprise frequent word position information, and are insensitive to word segmentation position change. And after the candidate class clusters are determined, using a candidate class cluster merging optimization and candidate class cluster connection strategy to ensure that the linear templates cannot be overfitted.
The application provides a log analysis clustering method based on a frequent word linear template. The method can be summarized mainly in four steps. The first step: traversing event data, and counting frequent words according to a support threshold; and a second step of: traversing the event data again to create frequent word linear templates and candidate class clusters; and a third step of: in order to solve the problem that the candidate clusters are excessively split due to improper setting of the support threshold, the candidate clusters are combined and optimized after the candidate clusters are generated; fourth step: and performing connection optimization on the combined and optimized class clusters.
Fig. 2 is a log clustering flowchart provided in the present application, including the following steps:
s201: each piece of log data l= [ I1, I2, … …, in ] to be clustered is input. And a support threshold s is entered.
S202: and counting frequent words according to the support threshold value, and creating a frequent word set.
S203: a first linear template is determined from each piece of log data and the set of frequent words.
S204: and determining the similarity between the first linear templates according to the frequent words in the first linear templates, and merging the first candidate clusters corresponding to the first linear templates with the similarity greater than a preset similarity threshold to obtain second candidate clusters and the second linear templates corresponding to the second candidate clusters.
S205: aiming at each frequent word in each second linear template, determining the fitting degree of the frequent word according to the frequency of simultaneous occurrence of the frequent word and each frequent word in the second candidate cluster corresponding to the second linear template and the total frequency of respective occurrence of each frequent word in the second candidate cluster of the same genus; determining frequent words to be combined according to the fit degree of the frequent words and a preset fit degree threshold; and merging the second candidate class clusters corresponding to the second linear templates with the same frequent words and different frequent words to be merged to obtain each target class cluster and each corresponding target linear template.
The construction of frequent word sets is described as follows:
the present application considers the frequency of occurrence (the number of occurrence or the frequency of occurrence) of each word in log data, but does not contain the positional information of the word. To find the linear templates that reach the support threshold s, frequent words for each linear template occur in at least s pieces of log data. After each line of log data is segmented, the log data in the same line is subjected to segmentation and duplication removal processing, word frequencies of all segmented words are counted, the word frequencies exceeding a support threshold s are frequent words, and the frequent words are classified into a frequent word set.
Creating linear template generation candidate class clusters is described as follows:
after the frequent word set is constructed, candidate class clusters (a class of log clusters with the same linear template) are generated, all the frequent words are extracted from the log data for each line in the log data, the words are processed into tuples, original positions in original lines are reserved, the tuples can be used as identification of the candidate class clusters, and the lines can be classified as the corresponding candidate class clusters. If a given candidate cluster does not exist, then a support count (for that class of linear templates) is initialized and set to 1, creating a new linear template from that row. If a candidate cluster exists, its support count is incremented and the linear template is adjusted to cover the current row. The support count refers to the number of pieces of log data conforming to the linear template.
FIG. 3 is a schematic diagram of generating candidate class clusters and linear templates provided herein. The determined frequent words include "Application", "start", "at" and "node". For the log data 1"Application Gateway start at node Sfmmbkas", it is determined whether or not there is a linear template, if not, a linear template "Application {1,1}start at node {1,1}" is generated, if yes, a support count of the linear template "Application {1,1}start at node {1,1}" is added by 1, for the log data 2"Application NG app start at node Ngap" there is a linear template already, a linear template support count is added by 1 again, and the linear template is updated to "Application {1,2}start at node {1,1}". * {1,2} means that at least one word is divided but not more than two words are between Application and start.
When the method is specifically implemented, log data with the same frequent words and the same relative position information of the frequent words can be merged according to the frequent words in each piece of log data, corresponding linear templates are extracted, and then a linear template extraction result can be obtained.
The candidate cluster merging optimization is described as follows:
and completing construction of candidate class clusters through all log data. The candidate clusters need to be adjusted to prevent the linear templates from being overfitted, and the candidate clusters can be generated through a heuristic strategy. A more detailed linear template is developed for each candidate class cluster and incorporated into the currently given candidate linear template. The support count of the current linear template is updated. And deleting the linear templates with the support count smaller than the support threshold s after merging and optimizing, and obtaining the extraction result of the linear templates.
Fig. 4 is a schematic diagram of candidate cluster merging examples provided in the present application. As shown in fig. 4, the support count of the linear template "remote_drr 172.18.179.21access@timetamp {1,1}" is 10, the support count of the linear template "remote_drr {1,1}access@timetamp 20210517164312" is 25, and the support count of the linear template "remote_drr {1,1} access@timetamp {1,1}" is 115. The combined linear templates of the three linear templates are 'remote_drr {1,1} access@timeamp {1,1 }', and the support count is 150.
Candidate cluster-like connections are described as follows:
for each frequent word in the linear template, calculating the degree of fit (frequency of simultaneous occurrence) of each frequent word and other frequent words, wherein the degree of fit is identified by a fit weight; and setting a threshold t as a fitness threshold, creating an auxiliary identifier for each linear template of the candidate clusters according to the weight model to identify frequent words with lower fitness, and finally merging the identified clusters with identical linear templates into a new cluster.
When two or more clusters are connected, the support count of the joint cluster is set to the sum of the support counts of the original clusters, while the linear templates of the joint cluster are adjusted to cover the linear templates in all the original clusters.
Since the linear templates of the joint clusters consist of highly correlated words, they are not subject to overfitting. Furthermore, frequent words that are under-weighted are incorporated into the linear templates as a replacement list without losing data. Finally, adding clusters will reduce the number of linear templates, making cluster inspection easier for human experts.
Fig. 5 is an exemplary schematic diagram of candidate cluster connection provided in the present application, as shown in fig. 5, the support count of the linear template "Interface {1,1}down at node router1" is 50, the support count of the linear template "Interface {2,3}down at node router2" is 50, and the target linear template obtained after connection is "Interface {1,3}down at node router1 i router2", and the support count is 100.
Fig. 6 is a schematic structural diagram of a log clustering device provided in the present application, including:
the first determining module 61 is configured to obtain each piece of log data to be clustered, and determine frequent words in each piece of log data according to a preset support threshold;
a second determining module 62, configured to determine each first candidate cluster and each corresponding first linear template according to the frequent word in each log data;
And a third determining module 63, configured to determine, according to the frequent words in the first linear templates, a similarity between the first linear templates, combine first candidate clusters corresponding to the first linear templates with the similarity greater than a preset similarity threshold, and determine each target cluster and each corresponding target linear template.
The first determining module 61 is configured to obtain each piece of log data to be clustered, determine each word segment in the log data through a word segmentation algorithm, and use the word segment with the occurrence frequency greater than a preset support threshold value in each word segment as the frequent word.
The first determining module 61 is configured to perform word segmentation and duplication removal processing on each word segment in the log data for each piece of log data, and use, as the frequent word, the word segment with the occurrence frequency greater than a preset support threshold after the word segmentation and duplication removal processing of each log data.
The first determining module 61 is configured to obtain each piece of log data to be clustered, determine each word segment in the log data through a word segmentation algorithm, perform word segmentation and duplication removal processing on each word segment, and use, as the frequent word, a word segment whose occurrence frequency after the word segmentation and duplication removal processing is greater than a preset support threshold; taking the word segmentation with the occurrence frequency not greater than a preset support threshold value after the word segmentation de-duplication treatment as the variable word;
The second determining module 62 is configured to use, as log data in the same first candidate cluster, log data in which the frequent word in each piece of log data is the same and the relative position information of the frequent word in the log data is the same; replacing variable words in log data in the same first candidate cluster by wild cards to obtain a first linear template corresponding to the first candidate cluster; wherein the first linear template includes frequent words and wild cards including wild card symbols and a range of number of replacement variable words.
The third determining module 63 is configured to determine, for any two first linear templates, a similarity of the any two first linear templates according to the same number of frequent words, the same number of wildcards, and the total number of segmentation words in the any two first linear templates.
The third determining module 63 is configured to combine the first candidate clusters corresponding to the first linear templates with the similarity greater than a preset similarity threshold to obtain each second candidate cluster and each corresponding second linear template; aiming at each frequent word in each second linear template, determining the fitting degree of the frequent word according to the frequency of the simultaneous occurrence of the frequent word and each frequent word in the second candidate cluster corresponding to the second linear template and the total frequency of the respective occurrence of the frequent word; determining frequent words to be combined according to the fit degree of the frequent words and a preset fit degree threshold; and merging the second candidate class clusters corresponding to the second linear templates with the same frequent words and different frequent words to be merged to obtain each target class cluster and each corresponding target linear template.
The present application also provides an electronic device, as shown in fig. 7, including: the processor 71, the communication interface 72, the memory 73 and the communication bus 74, wherein the processor 71, the communication interface 72 and the memory 73 complete communication with each other through the communication bus 74;
the memory 73 has stored therein a computer program which, when executed by the processor 71, causes the processor 71 to perform any of the above method steps.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 72 is used for communication between the above-described electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
The present application also provides a computer-readable storage medium having stored thereon a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform any of the above method steps.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (14)

1. A method of clustering logs, the method comprising:
acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold;
according to the frequent words in each piece of log data, determining each first candidate cluster and each corresponding first linear template;
and determining the similarity between the first linear templates according to the frequent words in the first linear templates, merging first candidate class clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target class cluster and each corresponding target linear template.
2. The method of claim 1, wherein the obtaining each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset support threshold value comprises:
each piece of log data to be clustered is obtained, each word in the log data is determined through a word segmentation algorithm, and the word with the occurrence frequency larger than a preset supporting threshold value in each word is used as the frequent word.
3. The method of claim 2, wherein the step of using, as the frequent word, a word having a frequency of occurrence greater than a preset support threshold value among the individual words comprises:
And aiming at each piece of log data, performing word segmentation and duplication removal processing on each word segmentation in the log data, and taking the word segmentation with the occurrence frequency greater than a preset supporting threshold value after the word segmentation and duplication removal processing of each piece of log data as the frequent word.
4. The method of claim 3, wherein obtaining each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset support threshold comprises:
acquiring each piece of log data to be clustered, determining each word in the log data through a word segmentation algorithm, performing word segmentation and duplication removal processing on each word segmentation, and taking the word segmentation with the occurrence frequency greater than a preset support threshold value after the word segmentation and duplication removal processing as the frequent word; taking the word segmentation with the occurrence frequency not greater than a preset support threshold value after the word segmentation de-duplication treatment as the variable word;
according to the frequent words in each log data, determining each first candidate cluster and each corresponding first linear template comprises:
taking the log data with the same frequent word and the same relative position information of the frequent word in the log data as the log data in the same first candidate cluster; replacing variable words in log data in the same first candidate cluster by wild cards to obtain a first linear template corresponding to the first candidate cluster; wherein the first linear template includes frequent words and wild cards including wild card symbols and a range of number of replacement variable words.
5. The method of claim 4, wherein determining similarities between the respective first linear templates based on the frequent words in the respective first linear templates comprises:
and for any two first linear templates, determining the similarity of the any two first linear templates according to the same number of frequent words, the same number of wildcards and the total number of segmentation words in the any two first linear templates.
6. The method of any one of claims 1-5, wherein merging the first candidate class clusters corresponding to the first linear templates having the similarity greater than a preset similarity threshold, and determining each target class cluster and each corresponding target linear template comprises:
combining the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold to obtain second candidate clusters and second linear templates corresponding to the second candidate clusters;
aiming at each frequent word in each second linear template, determining the fitting degree of the frequent word according to the frequency of the simultaneous occurrence of the frequent word and each frequent word in the second candidate cluster corresponding to the second linear template and the total frequency of the respective occurrence of the frequent word; determining frequent words to be combined according to the fit degree of the frequent words and a preset fit degree threshold;
And merging the second candidate class clusters corresponding to the second linear templates with the same frequent words and different frequent words to be merged to obtain each target class cluster and each corresponding target linear template.
7. A log clustering device, the device comprising:
the first determining module is used for acquiring each piece of log data to be clustered, and determining frequent words in each piece of log data according to a preset supporting threshold;
the second determining module is used for determining each first candidate cluster and each corresponding first linear template according to the frequent words in each piece of log data;
and the third determining module is used for determining the similarity between the first linear templates according to the frequent words in the first linear templates, merging the first candidate clusters corresponding to the first linear templates with the similarity larger than a preset similarity threshold, and determining each target cluster and each corresponding target linear template.
8. The apparatus of claim 7, wherein the first determining module is configured to obtain each piece of log data to be clustered, determine each word segment in the log data through a word segmentation algorithm, and use, as the frequent word, a word segment in each word segment, where the frequency of occurrence of the word segment is greater than a preset support threshold.
9. The apparatus of claim 8, wherein the first determining module is configured to perform word segmentation and duplication removal processing on each word segment in the log data for each piece of log data, and use, as the frequent word, a word segment whose occurrence frequency after the word segmentation and duplication removal processing of each log data is greater than a preset support threshold.
10. The apparatus of claim 9, wherein the first determining module is configured to obtain each piece of log data to be clustered, determine each word segment in the log data through a word segmentation algorithm, perform word segmentation and duplication removal processing on each word segment, and use, as the frequent word, a word segment whose occurrence frequency after the word segmentation and duplication removal processing is greater than a preset support threshold; taking the word segmentation with the occurrence frequency not greater than a preset support threshold value after the word segmentation de-duplication treatment as the variable word;
the second determining module is configured to use, as log data in the same first candidate cluster, log data in which the frequent word in each piece of log data is the same and the relative position information of the frequent word in the log data is the same; replacing variable words in log data in the same first candidate cluster by wild cards to obtain a first linear template corresponding to the first candidate cluster; wherein the first linear template includes frequent words and wild cards including wild card symbols and a range of number of replacement variable words.
11. The apparatus of claim 10, wherein the third determining module is configured to determine, for any two first linear templates, a similarity of the any two first linear templates based on a number of identical frequent words, a number of identical wildcards, and a total number of segmentations in the any two first linear templates.
12. The apparatus of any one of claims 7 to 11, wherein the third determining module is configured to combine first candidate clusters corresponding to the first linear templates with the similarity greater than a preset similarity threshold to obtain each second candidate cluster and each corresponding second linear template; aiming at each frequent word in each second linear template, determining the fitting degree of the frequent word according to the frequency of the simultaneous occurrence of the frequent word and each frequent word in the second candidate cluster corresponding to the second linear template and the total frequency of the respective occurrence of the frequent word; determining frequent words to be combined according to the fit degree of the frequent words and a preset fit degree threshold; and merging the second candidate class clusters corresponding to the second linear templates with the same frequent words and different frequent words to be merged to obtain each target class cluster and each corresponding target linear template.
13. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1-6 when executing a program stored on a memory.
14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.
CN202311209999.6A 2023-09-18 2023-09-18 Log clustering method and device, electronic equipment and storage medium Pending CN117332083A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311209999.6A CN117332083A (en) 2023-09-18 2023-09-18 Log clustering method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311209999.6A CN117332083A (en) 2023-09-18 2023-09-18 Log clustering method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117332083A true CN117332083A (en) 2024-01-02

Family

ID=89282138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311209999.6A Pending CN117332083A (en) 2023-09-18 2023-09-18 Log clustering method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117332083A (en)

Similar Documents

Publication Publication Date Title
CN109033200B (en) Event extraction method, device, equipment and computer readable medium
JP6643211B2 (en) Anomaly detection system and anomaly detection method
US10878336B2 (en) Technologies for detection of minority events
CN108809745A (en) A kind of user's anomaly detection method, apparatus and system
US20180349468A1 (en) Log analysis system, log analysis method, and log analysis program
CN111612038B (en) Abnormal user detection method and device, storage medium and electronic equipment
JP2020071665A (en) Behavior recognition method, behavior recognition program, and behavior recognition device
JP6780655B2 (en) Log analysis system, method and program
JP7173332B2 (en) Fraud detection device, fraud detection method, and fraud detection program
US11036818B2 (en) Method and system for detecting graph based event in social networks
CN112711578B (en) Big data denoising method for cloud computing service and cloud computing financial server
WO2018069950A1 (en) Method, system, and program for analyzing logs
CN106301979B (en) Method and system for detecting abnormal channel
CN106682507A (en) Virus library acquiring method and device, equipment, server and system
CN113329034B (en) Big data service optimization method based on artificial intelligence, server and storage medium
Diao et al. Clustering by Detecting Density Peaks and Assigning Points by Similarity‐First Search Based on Weighted K‐Nearest Neighbors Graph
CN112365269A (en) Risk detection method, apparatus, device and storage medium
CN112632000A (en) Log file clustering method and device, electronic equipment and readable storage medium
CN115795466B (en) Malicious software organization identification method and device
CN112199388A (en) Strange call identification method and device, electronic equipment and storage medium
CN117332083A (en) Log clustering method and device, electronic equipment and storage medium
CN117391214A (en) Model training method and device and related equipment
CN115526173A (en) Feature word extraction method and system based on computer information technology
CN112766387B (en) Training data error correction method, device, equipment and storage medium
CN114846767A (en) Techniques for analyzing data with a device to resolve conflicts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination