CN113590558A - Log correlation analysis method and device, storage medium and equipment - Google Patents

Log correlation analysis method and device, storage medium and equipment Download PDF

Info

Publication number
CN113590558A
CN113590558A CN202110880651.4A CN202110880651A CN113590558A CN 113590558 A CN113590558 A CN 113590558A CN 202110880651 A CN202110880651 A CN 202110880651A CN 113590558 A CN113590558 A CN 113590558A
Authority
CN
China
Prior art keywords
item set
log
events
support degree
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110880651.4A
Other languages
Chinese (zh)
Inventor
周鹏
葛思江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110880651.4A priority Critical patent/CN113590558A/en
Publication of CN113590558A publication Critical patent/CN113590558A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a log correlation analysis method, a log correlation analysis device, a storage medium and a device. And dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets. And removing redundant events in each item set to obtain each effective item set. And querying events contained in each effective item set, obtaining all occurred events, and constructing the 1 st item set based on all the occurred events. And (4) carrying out frequent item set mining on the item set 1 to obtain a frequent item set. And analyzing the frequent item set to obtain a plurality of subsets. And for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets. Therefore, the method and the device for analyzing the log correlation can effectively improve the accuracy of the log correlation analysis.

Description

Log correlation analysis method and device, storage medium and equipment
Technical Field
The present application relates to the field of log event correlation analysis, and in particular, to a log correlation analysis method, apparatus, storage medium, and device.
Background
With the development of modern internet technology, enterprises may deploy various IT systems both internally and externally, and these systems generate a large amount of logs every day, where the logs include operating system logs, network behavior logs, application program logs, network infrastructure logs, security host logs, and the like, and log data constantly records activities of various IT devices, contains rich operation and maintenance and security knowledge, and is an important basis for failure root analysis and security event tracing in information security.
Because the log data quantity is huge and rapidly increases along with time, it is very difficult to rely on manual analysis, and with the development of big data technology, the data storage capacity is stronger and stronger, and the trend of enterprise data business is more and more obvious, the log is used as typical unstructured data, how to dig out valuable information from the huge unstructured data is more and more important, if the relevance of the occurrence of an event can be found out from a large number of logs, the security event which is possibly contained can be early warned in time, and the probability or confidence degree of the occurrence of the security event can be labeled.
At present, the main technology applied by the log correlation analysis method and device in the market is to collect log data, perform correlation analysis, further discover problems or disasters in advance, and accordingly early warn. The more common log analysis is event correlation analysis. However, event correlation analysis involves two problems: the first is that the item set segmentation mode selected by event correlation analysis is to extract all feature data of each log as an item set, which results in a large and intensive number of item sets and reduced accuracy of event correlation analysis; and secondly, similar characteristic data in log information is not classified and integrated, and a method and a device for uniformly classifying similar events in the log into the same item are lacked. Therefore, the current log correlation analysis lacks flexibility, correlation analysis can only be performed for specific situations, and log event correlation cannot be dynamically corrected, so that the accuracy of an analysis result is reduced.
Disclosure of Invention
The application provides a log correlation analysis method, a log correlation analysis device, a storage medium and a device, and aims to improve the accuracy of log correlation analysis.
In order to achieve the above object, the present application provides the following technical solutions:
a log relevance analysis method, comprising:
acquiring a plurality of logs acquired by a log source, and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;
dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; the method comprises the steps that a plurality of logs of which the generation time belongs to the same preset time range are divided into the same item set;
identifying redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set;
inquiring events contained in each effective item set, obtaining all occurred events, and constructing a 1 st item set based on all the occurred events;
performing frequent item set mining on the 1 st item set to obtain a frequent item set;
analyzing the frequent item set to obtain a plurality of subsets;
calculating confidence levels between the subsets;
and for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets.
Optionally, the identifying, by using a preset regular expression, a redundant event in each item set includes:
for each item set, identifying variables of events in the item set by using a preset regular expression;
identifying a plurality of events with the same variable as a target event;
and randomly selecting one event from the target events as an effective event, and identifying other events as redundant events.
Optionally, the performing frequent item set mining on the item set 1 to obtain a frequent item set includes:
combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set;
judging whether the item set 2 contains an event of which the support degree is smaller than a preset support degree threshold value;
if the item set 2 does not contain the event that the support degree is smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set;
if the item set 2 contains the event of which the support degree is smaller than a preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event of which the support degree is smaller than the preset support degree threshold value, and identifying the item set k +1 as the frequent item set;
wherein k 2,3,4, the presetting step includes: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
Optionally, the obtaining a plurality of logs collected by a log source, and constructing a log set based on each log and the generation time of each log, includes:
acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log;
and constructing a log set based on the effective logs and the generation time of each effective log.
A log correlation analysis apparatus comprising:
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of logs acquired by a log source and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;
the dividing unit is used for dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; dividing a plurality of logs of which the generation time belongs to the same preset time range into the same item set;
the identification unit is used for identifying the redundant events in each item set by using a preset regular expression and eliminating the redundant events in each item set to obtain each effective item set;
the construction unit is used for inquiring the events contained in each effective item set, obtaining all the occurred events, and constructing the 1 st item set based on all the occurred events;
the mining unit is used for mining the 1 st item set frequently to obtain a frequent item set;
the analysis unit is used for analyzing the frequent item set to obtain a plurality of subsets;
a calculation unit for calculating a confidence between the subsets;
and the determining unit is used for determining that strong correlation exists between the events contained in any two subsets according to the confidence degree between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value.
Optionally, the identification unit is specifically configured to:
for each item set, identifying variables of events in the item set by using a preset regular expression;
identifying a plurality of events with the same variable as a target event;
and randomly selecting one event from the target events as an effective event, and identifying other events as redundant events.
Optionally, the excavation unit is specifically configured to:
combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set;
judging whether the item set 2 contains an event of which the support degree is smaller than a preset support degree threshold value;
if the item set 2 does not contain the event that the support degree is smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set;
if the item set 2 contains the event of which the support degree is smaller than a preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event of which the support degree is smaller than the preset support degree threshold value, and identifying the item set k +1 as the frequent item set;
wherein k is 2,3,4 …, and the presetting step comprises: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
Optionally, the obtaining unit is specifically configured to:
acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log;
and constructing a log set based on the effective logs and the generation time of each effective log.
A computer-readable storage medium comprising a stored program, wherein the program performs the log correlation analysis method.
A log correlation analysis device, comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;
the memory is used for storing a program, and the processor is used for executing the program, wherein the log correlation analysis method is executed when the program runs.
According to the technical scheme, a plurality of logs collected by a log source are obtained, and a log set is constructed on the basis of each log and the generation time of each log. And dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets. And identifying the redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set. And querying events contained in each effective item set, obtaining all occurred events, and constructing the 1 st item set based on all the occurred events. And (4) carrying out frequent item set mining on the item set 1 to obtain a frequent item set. And analyzing the frequent item set to obtain a plurality of subsets, and calculating the confidence degree among the subsets. And for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets. In addition, a preset regular expression is used for identifying redundant events in each item set and eliminating the redundant events in each item set, so that the redundant events (which can be understood as similar events) in the logs can be integrated, and the influence of the redundant events on the log correlation analysis is avoided. Therefore, compared with the prior art, the method and the device have better flexibility, not only can disregard the adverse effect caused by the increase of the number of the logs, but also can eliminate the adverse effect of the redundant events on the log correlation analysis, thereby effectively improving the accuracy of the log correlation analysis.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of a log correlation analysis method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a log collection architecture according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a log correlation analysis apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, a schematic diagram of a log correlation analysis method provided in an embodiment of the present application includes the following steps:
s101: the method comprises the steps of obtaining a plurality of logs collected by a log source, and constructing a log set based on the logs and the generation time of each log.
Wherein each log contains one or more events.
Optionally, after acquiring a plurality of logs collected by the log source, a log source tag may be set for each log, and a wildcard vocabulary initialization process may be performed on each log to obtain each valid log. And finally, constructing a log set based on the effective logs and the generation time of each effective log.
Generally, the collection process of the log is a common technical means in the art, so the log correlation analysis method described in the present application can be applied to most log collection architectures, and specifically, can be applied to the architecture shown in fig. 2.
In the embodiment of the present application, the expression form of the log is shown as formula (1).
L1=strr11+x1,strr12+y1,strr13+z1,… (1)
In formula (1), L1 represents a log, strr11, strr12 and strr13 are fixed strings, strr11+ x represents an event r11, x represents a variable of an event r11, x1 represents a value of a variable x, strr12+ y represents an event r12, y represents a variable of an event r12, y1 represents a value of a variable y, strr13+ z represents an event r13, z represents a variable of an event r13, and z1 represents a value of a variable z, generally, the event strr11+ x1 and the event strr11+ x2 both represent the same event r11, and the difference is that the value of the variable x of two events is different.
Further, a combination of the log and its own generation time can be represented by (L1, t1), and t1 represents the generation time of the log L1. Accordingly, the expression form of the log set may be as shown in equation (2).
{(L1,t1),(L2,t2),(L3,t3),(L4,t4)……,(Ln,tn)}(2)
In the formula (2), n is a positive integer, and the greater the value of n, the higher the accuracy of the log correlation analysis shown in this embodiment is.
S102: and dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets.
The method comprises the steps of generating a plurality of logs of which the generation time belongs to the same preset time range, and dividing the logs into the same item set. The time granularity indicated by the preset time range includes but is not limited to: year, month, week, day, hour, etc. Generally speaking, the larger the total number of logs in the log set, the smaller the time granularity should be set, and thus, the accuracy of the log correlation analysis can be improved.
Specifically, the log set shown in formula (2) is assumed to be a log set within 365 days, and the time granularity shown in the preset time range is assumed to be days. Correspondingly, according to the generation time, each log in the log set is divided to obtain a plurality of item sets, and the division result is shown as formula (3).
Figure BDA0003191910550000081
It should be noted that the above specific implementation process is only for illustration.
S103: and identifying the redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set.
The specific process of identifying redundant events in each item set by using a preset regular expression is common knowledge familiar to those skilled in the art.
Optionally, for each item set, identifying variables of events in the item set by using a preset regular expression; identifying a plurality of events with the same variable as a target event; one event is randomly selected from all target events to serve as an effective event, and other events are marked as redundant events.
Specifically, taking the item set shown in formula (3) as an example, the events contained in each item set are shown in formula (4). In equation (4), any two events may be the same event, i.e., redundant events, and one of the events needs to be eliminated, or any two events may also be different events, and both events may be retained.
Figure BDA0003191910550000082
It should be noted that, the redundant events in each item set are removed to obtain each effective item set, which can reduce the workload of log correlation analysis and avoid the waste of computing resources, thereby improving the analysis efficiency.
S104: and querying events contained in each effective item set, obtaining all occurred events, and constructing the 1 st item set based on all the occurred events.
Specifically, taking the term set shown in formula (4) as an example, the events included in each valid term set are queried to obtain all the occurred events, and all the occurred events are collected to construct a 1 st term set, where the expression form of the 1 st term set may be shown in formula (5).
Figure BDA0003191910550000091
S105: and (4) carrying out frequent item set mining on the 1 st item set by utilizing a frequent item set mining algorithm to obtain a frequent item set.
The specific implementation process of performing frequent item set mining on the 1 st item set comprises the following steps:
1. and combining the two item sets 1 to obtain an item set 2, and calculating the support of each event in the item set 2.
In general, the specific process of calculating the support of each event in a set of items is common general knowledge familiar to those skilled in the art.
2. And judging whether the item set 2 contains an event with the support degree smaller than a preset support degree threshold value.
3. And if the item set 2 does not contain the event of which the support degree is less than the preset support degree threshold value, identifying the item set 2 as a frequent item set.
4. If the item set 2 contains the event with the support degree smaller than the preset support degree threshold value, the step 4 is executed in an iterative manner until the item set k +1 does not contain the event with the support degree smaller than the preset support degree threshold value, and the item set k +1 is identified as a frequent item set.
Wherein k is 2,3,4.
5. Removing the events with the support degree smaller than the preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
In the embodiment of the application, frequent item set mining algorithm is used to perform frequent item set mining on item set 1 to obtain a frequent item set, and the expression form of the frequent item set can be shown as formula (6).
{strrn1+xn,strrn2+yn,strrn3+zn}(6)
In the formula (6), n1, xn, n2, yn, n3, zn are all fixed values.
S106: and analyzing the frequent item set to obtain a plurality of subsets.
Specifically, taking the frequent item set shown in formula (6) as an example, the frequent item set is analyzed, and the obtained subsets include:
{strrn1+xn,strrn2+yn,strrn3+zn},{strrn1+xn},{strrn2+yn},{strrn3+zn},{strrn1+xn,strrn2+yn},{strrn2+yn,strrn3+zn},{strrn1+xn,strrn3+zn}。
s107: confidence levels between the respective subsets are calculated.
The specific process of calculating the confidence between the subsets is well known to those skilled in the art.
Specifically, taking each subset obtained by analyzing the formula (6) as an example, the process of calculating the confidence level between each subset includes:
{{strrn1+xn,strrn2+yn,strrn3+zn}}→{{strrn1+xn}};
{{strrn1+xn,strrn2+yn,strrn3+zn}}→{{strrn2+yn}};
{{strrn1+xn,strrn2+yn,strrn3+zn}}→{{strrn3+zn}};
……
{{strrn1+xn}}→{{strrn2+yn}}。
s108: and for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets.
In summary, the item set division is performed on each log according to the generation time of each log, so that the accuracy of log correlation analysis can be prevented from being influenced due to the increase of the number of the item sets, in addition, the redundant events in each item set are identified by using a preset regular expression, and the redundant events in each item set are removed, so that the redundant events (which can be understood as similar events) in the logs can be integrated, and the influence of the redundant events on the log correlation analysis can be prevented. Therefore, compared with the prior art, the scheme of the embodiment has better flexibility, does not have adverse effects caused by the increase of the number of logs, and can eliminate the adverse effects of redundant events on log correlation analysis, so that the accuracy of the log correlation analysis can be effectively improved.
Corresponding to the log correlation analysis method provided by the embodiment of the application, the embodiment of the application also provides a log correlation analysis device.
As shown in fig. 3, an architecture diagram of a log correlation analysis apparatus provided in the embodiment of the present application includes:
an obtaining unit 100, configured to obtain multiple logs collected by a log source, and construct a log set based on each log and a generation time of each log; each log contains one or more events.
The obtaining unit 100 is specifically configured to: acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log; and constructing a log set based on the effective logs and the generation time of each effective log.
The dividing unit 200 is configured to divide each log in the log set according to the generation time of each log to obtain a plurality of item sets; the method comprises the steps of generating a plurality of logs of which the generation time belongs to the same preset time range, and dividing the logs into the same item set.
The identifying unit 300 is configured to identify the redundant events in each item set by using a preset regular expression, and remove the redundant events in each item set to obtain each valid item set.
The identification unit 300 is specifically configured to: for each item set, identifying variables of all events in the item set by using a preset regular expression; identifying a plurality of events with the same variable as a target event; one event is randomly selected from all target events to serve as an effective event, and other events are marked as redundant events.
The constructing unit 400 is configured to query events included in each valid item set, obtain all the occurred events, and construct the item 1 set based on all the occurred events.
And the mining unit 500 is used for performing frequent item set mining on the item set 1 to obtain a frequent item set.
Wherein, the digging unit 500 is specifically configured to: combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set; judging whether the item set 2 contains an event with the support degree smaller than a preset support degree threshold value; if the item set 2 does not contain an event with the support degree smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set; if the item set 2 contains an event with the support degree smaller than the preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event with the support degree smaller than the preset support degree threshold value, and identifying the item set k +1 as a frequent item set; wherein, k 2,3,4, predetermine the step and include: removing the events with the support degree smaller than the preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
The parsing unit 600 is configured to parse the frequent item set to obtain a plurality of subsets.
A calculation unit 700 for calculating a confidence between the respective subsets.
A determining unit 800, configured to determine, for the confidence level between any two subsets, that, if the confidence level between any two subsets is greater than a preset confidence level threshold, a strong correlation exists between the events included in any two subsets.
In summary, the item set division is performed on each log according to the generation time, so that the accuracy of log correlation analysis can be prevented from being influenced by the increase of the number of the item sets, in addition, the redundant events in each item set are identified by using the preset regular expression, and the redundant events in each item set are eliminated, so that the redundant events (which can be understood as similar events) in the logs can be integrated, and the influence of the redundant events on the log correlation analysis can be prevented. Therefore, compared with the prior art, the scheme of the embodiment has better flexibility, does not have adverse effects caused by the increase of the number of logs, and can eliminate the adverse effects of redundant events on log correlation analysis, so that the accuracy of the log correlation analysis can be effectively improved.
The application also provides a computer readable storage medium, which includes a stored program, wherein the program executes the log correlation analysis method provided by the application.
The present application further provides a log correlation analysis device, including: a processor, a memory, and a bus. The processor is connected with the memory through a bus, the memory is used for storing programs, and the processor is used for running the programs, wherein the program runs to execute the log correlation analysis method provided by the application, and the method comprises the following steps:
acquiring a plurality of logs acquired by a log source, and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;
dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; dividing a plurality of logs of which the generation time belongs to the same preset time range into the same item set;
identifying redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set;
inquiring events contained in each effective item set, obtaining all occurred events, and constructing a 1 st item set based on all the occurred events;
performing frequent item set mining on the 1 st item set to obtain a frequent item set;
analyzing the frequent item set to obtain a plurality of subsets;
calculating confidence levels between the subsets;
and for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets.
Optionally, the identifying, by using a preset regular expression, a redundant event in each item set includes:
for each item set, identifying variables of events in the item set by using a preset regular expression;
identifying a plurality of events with the same variable as a target event;
and randomly selecting one event from the target events as an effective event, and identifying other events as redundant events.
Optionally, the performing frequent item set mining on the item set 1 to obtain a frequent item set includes:
combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set;
judging whether the item set 2 contains an event of which the support degree is smaller than a preset support degree threshold value;
if the item set 2 does not contain the event that the support degree is smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set;
if the item set 2 contains the event of which the support degree is smaller than a preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event of which the support degree is smaller than the preset support degree threshold value, and identifying the item set k +1 as the frequent item set;
wherein k 2,3,4, the presetting step includes: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
Optionally, the obtaining a plurality of logs collected by a log source, and constructing a log set based on each log and the generation time of each log, includes:
acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log;
and constructing a log set based on the effective logs and the generation time of each effective log.
The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A log relevance analysis method, comprising:
acquiring a plurality of logs acquired by a log source, and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;
dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; the method comprises the steps that a plurality of logs of which the generation time belongs to the same preset time range are divided into the same item set;
identifying redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set;
inquiring events contained in each effective item set, obtaining all occurred events, and constructing a 1 st item set based on all the occurred events;
performing frequent item set mining on the 1 st item set to obtain a frequent item set;
analyzing the frequent item set to obtain a plurality of subsets;
calculating confidence levels between the subsets;
and for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets.
2. The method of claim 1, wherein identifying redundant events in each of the sets of terms using a preset regular expression comprises:
for each item set, identifying variables of events in the item set by using a preset regular expression;
identifying a plurality of events with the same variable as a target event;
and randomly selecting one event from the target events as an effective event, and identifying other events as redundant events.
3. The method of claim 1, wherein the mining the item set 1 frequently to obtain a frequent item set comprises:
combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set;
judging whether the item set 2 contains an event of which the support degree is smaller than a preset support degree threshold value;
if the item set 2 does not contain the event that the support degree is smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set;
if the item set 2 contains the event of which the support degree is smaller than a preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event of which the support degree is smaller than the preset support degree threshold value, and identifying the item set k +1 as the frequent item set;
wherein k 2,3,4, the presetting step includes: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
4. The method of claim 1, wherein obtaining a plurality of logs collected by a log source and constructing a log set based on each log and a generation time of each log comprises:
acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log;
and constructing a log set based on the effective logs and the generation time of each effective log.
5. A log correlation analysis apparatus, comprising:
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of logs acquired by a log source and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;
the dividing unit is used for dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; dividing a plurality of logs of which the generation time belongs to the same preset time range into the same item set;
the identification unit is used for identifying the redundant events in each item set by using a preset regular expression and eliminating the redundant events in each item set to obtain each effective item set;
the construction unit is used for inquiring the events contained in each effective item set, obtaining all the occurred events, and constructing the 1 st item set based on all the occurred events;
the mining unit is used for mining the 1 st item set frequently to obtain a frequent item set;
the analysis unit is used for analyzing the frequent item set to obtain a plurality of subsets;
a calculation unit for calculating a confidence between the subsets;
and the determining unit is used for determining that strong correlation exists between the events contained in any two subsets according to the confidence degree between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value.
6. The apparatus according to claim 5, wherein the identification unit is specifically configured to:
for each item set, identifying variables of events in the item set by using a preset regular expression;
identifying a plurality of events with the same variable as a target event;
and randomly selecting one event from the target events as an effective event, and identifying other events as redundant events.
7. The device according to claim 5, characterized in that the digging unit is particularly adapted to:
combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set;
judging whether the item set 2 contains an event of which the support degree is smaller than a preset support degree threshold value;
if the item set 2 does not contain the event that the support degree is smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set;
if the item set 2 contains the event of which the support degree is smaller than a preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event of which the support degree is smaller than the preset support degree threshold value, and identifying the item set k +1 as the frequent item set;
wherein k 2,3,4, the presetting step includes: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.
8. The apparatus according to claim 5, wherein the obtaining unit is specifically configured to:
acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log;
and constructing a log set based on the effective logs and the generation time of each effective log.
9. A computer-readable storage medium comprising a stored program, wherein the program performs the log correlation analysis method of any one of claims 1-4.
10. A log correlation analysis apparatus, comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;
the memory is used for storing a program, and the processor is used for executing the program, wherein the program executes the log correlation analysis method according to any one of claims 1 to 4 when running.
CN202110880651.4A 2021-08-02 2021-08-02 Log correlation analysis method and device, storage medium and equipment Pending CN113590558A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110880651.4A CN113590558A (en) 2021-08-02 2021-08-02 Log correlation analysis method and device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110880651.4A CN113590558A (en) 2021-08-02 2021-08-02 Log correlation analysis method and device, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN113590558A true CN113590558A (en) 2021-11-02

Family

ID=78253808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110880651.4A Pending CN113590558A (en) 2021-08-02 2021-08-02 Log correlation analysis method and device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN113590558A (en)

Similar Documents

Publication Publication Date Title
US10878335B1 (en) Scalable text analysis using probabilistic data structures
US10452627B2 (en) Column weight calculation for data deduplication
US10713140B2 (en) Identifying latent states of machines based on machine logs
CN111143103A (en) Incidence relation determining method, device, equipment and readable storage medium
CN111159127A (en) Log analysis method and device based on Apriori algorithm
Cao et al. Graph-based workflow recommendation: on improving business process modeling
CN107688563B (en) Synonym recognition method and recognition device
Vadrevu et al. Maxs: Scaling malware execution with sequential multi-hypothesis testing
CN115203435A (en) Entity relation generation method and data query method based on knowledge graph
US10430424B2 (en) Parameter suggestion based on user activity
CN110688846B (en) Periodic word mining method, system, electronic equipment and readable storage medium
US11501058B2 (en) Event detection based on text streams
CN113344023A (en) Code recommendation method, device and system
CN117390132A (en) Method, system and storage medium for managing data and API
Lee et al. Detecting anomaly teletraffic using stochastic self-similarity based on Hadoop
CN113590558A (en) Log correlation analysis method and device, storage medium and equipment
CN111368864A (en) Identification method, availability evaluation method and device, electronic equipment and storage medium
CN115510847A (en) Code workload analysis method and device
CN115168509A (en) Processing method and device of wind control data, storage medium and computer equipment
CN115495587A (en) Alarm analysis method and device based on knowledge graph
WO2021047576A1 (en) Log record processing method and apparatus, and device and machine-readable storage medium
CN113760864A (en) Data model generation method and device
US11025658B2 (en) Generating summaries of messages associated with assets in an enterprise system
Jingliang et al. A data-driven approach based on LDA for identifying duplicate bug report
CN114021116A (en) Construction method of homologous analysis knowledge base, homologous analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination