CN113590558A

CN113590558A - Log correlation analysis method and device, storage medium and equipment

Info

Publication number: CN113590558A
Application number: CN202110880651.4A
Authority: CN
Inventors: 周鹏; 葛思江
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-02

Abstract

The application discloses a log correlation analysis method, a log correlation analysis device, a storage medium and a device. And dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets. And removing redundant events in each item set to obtain each effective item set. And querying events contained in each effective item set, obtaining all occurred events, and constructing the 1 st item set based on all the occurred events. And (4) carrying out frequent item set mining on the item set 1 to obtain a frequent item set. And analyzing the frequent item set to obtain a plurality of subsets. And for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets. Therefore, the method and the device for analyzing the log correlation can effectively improve the accuracy of the log correlation analysis.

Description

Log correlation analysis method and device, storage medium and equipment

Technical Field

The present application relates to the field of log event correlation analysis, and in particular, to a log correlation analysis method, apparatus, storage medium, and device.

Background

With the development of modern internet technology, enterprises may deploy various IT systems both internally and externally, and these systems generate a large amount of logs every day, where the logs include operating system logs, network behavior logs, application program logs, network infrastructure logs, security host logs, and the like, and log data constantly records activities of various IT devices, contains rich operation and maintenance and security knowledge, and is an important basis for failure root analysis and security event tracing in information security.

Because the log data quantity is huge and rapidly increases along with time, it is very difficult to rely on manual analysis, and with the development of big data technology, the data storage capacity is stronger and stronger, and the trend of enterprise data business is more and more obvious, the log is used as typical unstructured data, how to dig out valuable information from the huge unstructured data is more and more important, if the relevance of the occurrence of an event can be found out from a large number of logs, the security event which is possibly contained can be early warned in time, and the probability or confidence degree of the occurrence of the security event can be labeled.

At present, the main technology applied by the log correlation analysis method and device in the market is to collect log data, perform correlation analysis, further discover problems or disasters in advance, and accordingly early warn. The more common log analysis is event correlation analysis. However, event correlation analysis involves two problems: the first is that the item set segmentation mode selected by event correlation analysis is to extract all feature data of each log as an item set, which results in a large and intensive number of item sets and reduced accuracy of event correlation analysis; and secondly, similar characteristic data in log information is not classified and integrated, and a method and a device for uniformly classifying similar events in the log into the same item are lacked. Therefore, the current log correlation analysis lacks flexibility, correlation analysis can only be performed for specific situations, and log event correlation cannot be dynamically corrected, so that the accuracy of an analysis result is reduced.

Disclosure of Invention

The application provides a log correlation analysis method, a log correlation analysis device, a storage medium and a device, and aims to improve the accuracy of log correlation analysis.

In order to achieve the above object, the present application provides the following technical solutions:

a log relevance analysis method, comprising:

acquiring a plurality of logs acquired by a log source, and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;

dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; the method comprises the steps that a plurality of logs of which the generation time belongs to the same preset time range are divided into the same item set;

identifying redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set;

inquiring events contained in each effective item set, obtaining all occurred events, and constructing a 1 st item set based on all the occurred events;

performing frequent item set mining on the 1 st item set to obtain a frequent item set;

analyzing the frequent item set to obtain a plurality of subsets;

calculating confidence levels between the subsets;

and for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets.

Optionally, the identifying, by using a preset regular expression, a redundant event in each item set includes:

for each item set, identifying variables of events in the item set by using a preset regular expression;

identifying a plurality of events with the same variable as a target event;

and randomly selecting one event from the target events as an effective event, and identifying other events as redundant events.

Optionally, the performing frequent item set mining on the item set 1 to obtain a frequent item set includes:

combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set;

judging whether the item set 2 contains an event of which the support degree is smaller than a preset support degree threshold value;

if the item set 2 does not contain the event that the support degree is smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set;

if the item set 2 contains the event of which the support degree is smaller than a preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event of which the support degree is smaller than the preset support degree threshold value, and identifying the item set k +1 as the frequent item set;

wherein k 2,3,4, the presetting step includes: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.

Optionally, the obtaining a plurality of logs collected by a log source, and constructing a log set based on each log and the generation time of each log, includes:

acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log;

and constructing a log set based on the effective logs and the generation time of each effective log.

A log correlation analysis apparatus comprising:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of logs acquired by a log source and constructing a log set based on each log and the generation time of each log; each of the logs contains one or more events;

the dividing unit is used for dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; dividing a plurality of logs of which the generation time belongs to the same preset time range into the same item set;

the identification unit is used for identifying the redundant events in each item set by using a preset regular expression and eliminating the redundant events in each item set to obtain each effective item set;

the construction unit is used for inquiring the events contained in each effective item set, obtaining all the occurred events, and constructing the 1 st item set based on all the occurred events;

the mining unit is used for mining the 1 st item set frequently to obtain a frequent item set;

the analysis unit is used for analyzing the frequent item set to obtain a plurality of subsets;

a calculation unit for calculating a confidence between the subsets;

and the determining unit is used for determining that strong correlation exists between the events contained in any two subsets according to the confidence degree between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value.

Optionally, the identification unit is specifically configured to:

identifying a plurality of events with the same variable as a target event;

Optionally, the excavation unit is specifically configured to:

wherein k is 2,3,4 …, and the presetting step comprises: removing the events with the support degree smaller than a preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.

Optionally, the obtaining unit is specifically configured to:

A computer-readable storage medium comprising a stored program, wherein the program performs the log correlation analysis method.

A log correlation analysis device, comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;

the memory is used for storing a program, and the processor is used for executing the program, wherein the log correlation analysis method is executed when the program runs.

According to the technical scheme, a plurality of logs collected by a log source are obtained, and a log set is constructed on the basis of each log and the generation time of each log. And dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets. And identifying the redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set. And querying events contained in each effective item set, obtaining all occurred events, and constructing the 1 st item set based on all the occurred events. And (4) carrying out frequent item set mining on the item set 1 to obtain a frequent item set. And analyzing the frequent item set to obtain a plurality of subsets, and calculating the confidence degree among the subsets. And for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets. In addition, a preset regular expression is used for identifying redundant events in each item set and eliminating the redundant events in each item set, so that the redundant events (which can be understood as similar events) in the logs can be integrated, and the influence of the redundant events on the log correlation analysis is avoided. Therefore, compared with the prior art, the method and the device have better flexibility, not only can disregard the adverse effect caused by the increase of the number of the logs, but also can eliminate the adverse effect of the redundant events on the log correlation analysis, thereby effectively improving the accuracy of the log correlation analysis.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a log correlation analysis method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a log collection architecture according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a log correlation analysis apparatus according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, a schematic diagram of a log correlation analysis method provided in an embodiment of the present application includes the following steps:

s101: the method comprises the steps of obtaining a plurality of logs collected by a log source, and constructing a log set based on the logs and the generation time of each log.

Wherein each log contains one or more events.

Optionally, after acquiring a plurality of logs collected by the log source, a log source tag may be set for each log, and a wildcard vocabulary initialization process may be performed on each log to obtain each valid log. And finally, constructing a log set based on the effective logs and the generation time of each effective log.

Generally, the collection process of the log is a common technical means in the art, so the log correlation analysis method described in the present application can be applied to most log collection architectures, and specifically, can be applied to the architecture shown in fig. 2.

In the embodiment of the present application, the expression form of the log is shown as formula (1).

L1＝strr11+x1，strr12+y1，strr13+z1，… (1)

In formula (1), L1 represents a log, strr11, strr12 and strr13 are fixed strings, strr11+ x represents an event r11, x represents a variable of an event r11, x1 represents a value of a variable x, strr12+ y represents an event r12, y represents a variable of an event r12, y1 represents a value of a variable y, strr13+ z represents an event r13, z represents a variable of an event r13, and z1 represents a value of a variable z, generally, the event strr11+ x1 and the event strr11+ x2 both represent the same event r11, and the difference is that the value of the variable x of two events is different.

Further, a combination of the log and its own generation time can be represented by (L1, t1), and t1 represents the generation time of the log L1. Accordingly, the expression form of the log set may be as shown in equation (2).

{(L1，t1)，(L2，t2)，(L3，t3)，(L4，t4)……，(Ln，tn)}(2)

In the formula (2), n is a positive integer, and the greater the value of n, the higher the accuracy of the log correlation analysis shown in this embodiment is.

S102: and dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets.

The method comprises the steps of generating a plurality of logs of which the generation time belongs to the same preset time range, and dividing the logs into the same item set. The time granularity indicated by the preset time range includes but is not limited to: year, month, week, day, hour, etc. Generally speaking, the larger the total number of logs in the log set, the smaller the time granularity should be set, and thus, the accuracy of the log correlation analysis can be improved.

Specifically, the log set shown in formula (2) is assumed to be a log set within 365 days, and the time granularity shown in the preset time range is assumed to be days. Correspondingly, according to the generation time, each log in the log set is divided to obtain a plurality of item sets, and the division result is shown as formula (3).

It should be noted that the above specific implementation process is only for illustration.

S103: and identifying the redundant events in each item set by using a preset regular expression, and eliminating the redundant events in each item set to obtain each effective item set.

The specific process of identifying redundant events in each item set by using a preset regular expression is common knowledge familiar to those skilled in the art.

Optionally, for each item set, identifying variables of events in the item set by using a preset regular expression; identifying a plurality of events with the same variable as a target event; one event is randomly selected from all target events to serve as an effective event, and other events are marked as redundant events.

Specifically, taking the item set shown in formula (3) as an example, the events contained in each item set are shown in formula (4). In equation (4), any two events may be the same event, i.e., redundant events, and one of the events needs to be eliminated, or any two events may also be different events, and both events may be retained.

It should be noted that, the redundant events in each item set are removed to obtain each effective item set, which can reduce the workload of log correlation analysis and avoid the waste of computing resources, thereby improving the analysis efficiency.

S104: and querying events contained in each effective item set, obtaining all occurred events, and constructing the 1 st item set based on all the occurred events.

Specifically, taking the term set shown in formula (4) as an example, the events included in each valid term set are queried to obtain all the occurred events, and all the occurred events are collected to construct a 1 st term set, where the expression form of the 1 st term set may be shown in formula (5).

S105: and (4) carrying out frequent item set mining on the 1 st item set by utilizing a frequent item set mining algorithm to obtain a frequent item set.

The specific implementation process of performing frequent item set mining on the 1 st item set comprises the following steps:

1. and combining the two item sets 1 to obtain an item set 2, and calculating the support of each event in the item set 2.

In general, the specific process of calculating the support of each event in a set of items is common general knowledge familiar to those skilled in the art.

2. And judging whether the item set 2 contains an event with the support degree smaller than a preset support degree threshold value.

3. And if the item set 2 does not contain the event of which the support degree is less than the preset support degree threshold value, identifying the item set 2 as a frequent item set.

4. If the item set 2 contains the event with the support degree smaller than the preset support degree threshold value, the step 4 is executed in an iterative manner until the item set k +1 does not contain the event with the support degree smaller than the preset support degree threshold value, and the item set k +1 is identified as a frequent item set.

Wherein k is 2,3,4.

5. Removing the events with the support degree smaller than the preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.

In the embodiment of the application, frequent item set mining algorithm is used to perform frequent item set mining on item set 1 to obtain a frequent item set, and the expression form of the frequent item set can be shown as formula (6).

{strrn1+xn,strrn2+yn,strrn3+zn}(6)

In the formula (6), n1, xn, n2, yn, n3, zn are all fixed values.

S106: and analyzing the frequent item set to obtain a plurality of subsets.

Specifically, taking the frequent item set shown in formula (6) as an example, the frequent item set is analyzed, and the obtained subsets include:

{strrn1+xn,strrn2+yn,strrn3+zn}，{strrn1+xn}，{strrn2+yn}，{strrn3+zn}，{strrn1+xn,strrn2+yn}，{strrn2+yn,strrn3+zn}，{strrn1+xn,strrn3+zn}。

s107: confidence levels between the respective subsets are calculated.

The specific process of calculating the confidence between the subsets is well known to those skilled in the art.

Specifically, taking each subset obtained by analyzing the formula (6) as an example, the process of calculating the confidence level between each subset includes:

{{strrn1+xn,strrn2+yn,strrn3+zn}}→{{strrn1+xn}}；

{{strrn1+xn,strrn2+yn,strrn3+zn}}→{{strrn2+yn}}；

{{strrn1+xn,strrn2+yn,strrn3+zn}}→{{strrn3+zn}}；

……

{{strrn1+xn}}→{{strrn2+yn}}。

s108: and for the confidence degrees between any two subsets, if the confidence degree between any two subsets is greater than a preset confidence degree threshold value, determining that strong correlation exists between the events contained in any two subsets.

In summary, the item set division is performed on each log according to the generation time of each log, so that the accuracy of log correlation analysis can be prevented from being influenced due to the increase of the number of the item sets, in addition, the redundant events in each item set are identified by using a preset regular expression, and the redundant events in each item set are removed, so that the redundant events (which can be understood as similar events) in the logs can be integrated, and the influence of the redundant events on the log correlation analysis can be prevented. Therefore, compared with the prior art, the scheme of the embodiment has better flexibility, does not have adverse effects caused by the increase of the number of logs, and can eliminate the adverse effects of redundant events on log correlation analysis, so that the accuracy of the log correlation analysis can be effectively improved.

Corresponding to the log correlation analysis method provided by the embodiment of the application, the embodiment of the application also provides a log correlation analysis device.

As shown in fig. 3, an architecture diagram of a log correlation analysis apparatus provided in the embodiment of the present application includes:

an obtaining unit 100, configured to obtain multiple logs collected by a log source, and construct a log set based on each log and a generation time of each log; each log contains one or more events.

The obtaining unit 100 is specifically configured to: acquiring a plurality of logs acquired by a log source, setting a log source label for each log, and performing wildcard vocabulary initialization processing on each log to obtain each effective log; and constructing a log set based on the effective logs and the generation time of each effective log.

The dividing unit 200 is configured to divide each log in the log set according to the generation time of each log to obtain a plurality of item sets; the method comprises the steps of generating a plurality of logs of which the generation time belongs to the same preset time range, and dividing the logs into the same item set.

The identifying unit 300 is configured to identify the redundant events in each item set by using a preset regular expression, and remove the redundant events in each item set to obtain each valid item set.

The identification unit 300 is specifically configured to: for each item set, identifying variables of all events in the item set by using a preset regular expression; identifying a plurality of events with the same variable as a target event; one event is randomly selected from all target events to serve as an effective event, and other events are marked as redundant events.

The constructing unit 400 is configured to query events included in each valid item set, obtain all the occurred events, and construct the item 1 set based on all the occurred events.

And the mining unit 500 is used for performing frequent item set mining on the item set 1 to obtain a frequent item set.

Wherein, the digging unit 500 is specifically configured to: combining the two 1 st item sets to obtain a 2 nd item set, and calculating the support degree of each event in the 2 nd item set; judging whether the item set 2 contains an event with the support degree smaller than a preset support degree threshold value; if the item set 2 does not contain an event with the support degree smaller than a preset support degree threshold value, identifying the item set 2 as a frequent item set; if the item set 2 contains an event with the support degree smaller than the preset support degree threshold value, iteratively executing a preset step until the item set k +1 does not contain the event with the support degree smaller than the preset support degree threshold value, and identifying the item set k +1 as a frequent item set; wherein, k 2,3,4, predetermine the step and include: removing the events with the support degree smaller than the preset support degree threshold value in the kth item set to obtain a new kth item set, combining the 1 st item set and the new kth item set to obtain a kth +1 item set, calculating the support degree of each event in the kth +1 item set, and judging whether the kth +1 item set contains the events with the support degree smaller than the preset support degree threshold value.

The parsing unit 600 is configured to parse the frequent item set to obtain a plurality of subsets.

A calculation unit 700 for calculating a confidence between the respective subsets.

A determining unit 800, configured to determine, for the confidence level between any two subsets, that, if the confidence level between any two subsets is greater than a preset confidence level threshold, a strong correlation exists between the events included in any two subsets.

In summary, the item set division is performed on each log according to the generation time, so that the accuracy of log correlation analysis can be prevented from being influenced by the increase of the number of the item sets, in addition, the redundant events in each item set are identified by using the preset regular expression, and the redundant events in each item set are eliminated, so that the redundant events (which can be understood as similar events) in the logs can be integrated, and the influence of the redundant events on the log correlation analysis can be prevented. Therefore, compared with the prior art, the scheme of the embodiment has better flexibility, does not have adverse effects caused by the increase of the number of logs, and can eliminate the adverse effects of redundant events on log correlation analysis, so that the accuracy of the log correlation analysis can be effectively improved.

The application also provides a computer readable storage medium, which includes a stored program, wherein the program executes the log correlation analysis method provided by the application.

The present application further provides a log correlation analysis device, including: a processor, a memory, and a bus. The processor is connected with the memory through a bus, the memory is used for storing programs, and the processor is used for running the programs, wherein the program runs to execute the log correlation analysis method provided by the application, and the method comprises the following steps:

dividing each log in the log set according to the generation time of each log to obtain a plurality of item sets; dividing a plurality of logs of which the generation time belongs to the same preset time range into the same item set;

analyzing the frequent item set to obtain a plurality of subsets;

calculating confidence levels between the subsets;

identifying a plurality of events with the same variable as a target event;

The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A log relevance analysis method, comprising:

analyzing the frequent item set to obtain a plurality of subsets;

calculating confidence levels between the subsets;

2. The method of claim 1, wherein identifying redundant events in each of the sets of terms using a preset regular expression comprises:

identifying a plurality of events with the same variable as a target event;

3. The method of claim 1, wherein the mining the item set 1 frequently to obtain a frequent item set comprises:

4. The method of claim 1, wherein obtaining a plurality of logs collected by a log source and constructing a log set based on each log and a generation time of each log comprises:

5. A log correlation analysis apparatus, comprising:

a calculation unit for calculating a confidence between the subsets;

6. The apparatus according to claim 5, wherein the identification unit is specifically configured to:

identifying a plurality of events with the same variable as a target event;

7. The device according to claim 5, characterized in that the digging unit is particularly adapted to:

8. The apparatus according to claim 5, wherein the obtaining unit is specifically configured to:

9. A computer-readable storage medium comprising a stored program, wherein the program performs the log correlation analysis method of any one of claims 1-4.

10. A log correlation analysis apparatus, comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;

the memory is used for storing a program, and the processor is used for executing the program, wherein the program executes the log correlation analysis method according to any one of claims 1 to 4 when running.