CN114615063A

CN114615063A - Attack tracing method and device based on log correlation analysis

Info

Publication number: CN114615063A
Application number: CN202210248730.8A
Authority: CN
Inventors: 王瑞华; 周博雅; 万海; 焦伟; 严人宁; 孙逸伦; 赵曦滨
Original assignee: China Bond Jinke Information Technology Co ltd; Tsinghua University
Current assignee: China Bond Jinke Information Technology Co ltd; Tsinghua University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-10

Abstract

The invention discloses an attack tracing method and device based on log correlation analysis, wherein the method comprises the following steps: the method comprises the steps of collecting logs of all levels in a system, constructing a log connection graph according to incidence relations among the logs of all levels, connecting tracing elements in all single-source tracing graphs by using marking nodes in the log connection graph to obtain a fusion tracing graph, further executing an attack tracing algorithm in the fusion tracing graph, excavating communication paths formed around a dependent explosion node in the fusion tracing graph based on a short circuit mechanism, calculating evaluation scores of all paths searched by the attack tracing algorithm by using the search path and the communication paths as one of path evaluation factors, and selecting the path with the highest evaluation score as an attack chain output by the attack tracing algorithm. By the method, the attack tracing of the log file can be realized, and the problem of expected effect which is difficult to achieve by using a mode of relieving the dependence explosion problem in the attack tracing in the prior art is solved.

Description

Attack tracing method and device based on log correlation analysis

Technical Field

The invention relates to the technical field of network security, in particular to an attack tracing method and device based on log association analysis.

Background

Advanced Persistent Threats (APT) have recently become one of the most critical cyberspace threats facing enterprises and organizations. APT attacks usually last a long time and they use covert techniques to make themselves difficult to detect. The related art shows that a traceback graph can help a security operator trace back from attack symptoms to an initial attack entry point. In particular, a traceback graph may be constructed by collecting and analyzing system logs, where nodes represent entities (e.g., processes, files, network sockets) and edges represent operations between nodes. However, these methods may face serious problems when dealing with long running applications, which have a large number of input and output operations during their lifecycle. This results in nodes with a large number of incoming and outgoing edges, as can be seen in the firefox process of FIG. 1. Such nodes can hinder backtracking analysis because all of their outgoing edges depend on all of their incoming edges. This is the so-called "explosion-dependent" problem.

The two technologies can divide a process running for a long time into units with finer granularity, and the units represent the cycle of event processing, so that the dependency relationship applied to the process before can be more accurately bound to the units, and the dependency explosion can be weakened because one unit has fewer dependency relationships than the original process entity. However, both methods have their own problems. The taint analysis method usually brings huge time and space overhead, which cannot work under the scene of APT with huge data volume; binary instrumentation methods typically require modification of the source code or binary file, which is often difficult to use in an enterprise environment.

In recent years, students have tried to perform dependency analysis based on some non-instrumented log fusion method in the attack tracing process, and they have used not only information of the process itself but also information outside the process to perform dependency modeling. They utilize a wider variety of logs, especially application-level logs, to accomplish the process-to-unit partitioning. Because the application log provides high-level semantics of application behavior that are difficult to capture from traditional audit logs, these approaches work well for dealing with dependency explosion issues without the use of binary instrumentation. The design of the dependency analysis method based on log fusion has two key points: the log fusion mechanism is the log fusion mechanism, and the multi-level log fusion capability is the log fusion mechanism. For example, AlChemist provides a general execution model, and 135 rules are set in the execution model; the UIScope uses only one rule (timestamp matching) to merge the GUI layer traceback graph and the audit layer traceback graph. Intuitively, it is generally believed that an ideal dependency analysis framework based on log fusion should have the following characteristics: first, the fusion mechanism needs to be generic and stable, so that any change in data source will not result in a large change in rules or even a complete rewrite; second, the framework needs to have the ability to fuse as many log hierarchies and finger categories as possible, since more information brings more semantics, thereby enabling the analysis algorithm to analyze more precise dependencies. However, AlChemist cannot cover all execution models, and its fusion rule is not complete; the UIscope can only fuse two data sources of a specific level, and the method has no universality. Therefore, the expected effect which is difficult to achieve by using the dependence analysis to alleviate the dependence explosion problem in the attack tracing cannot be accurately restored to the cause-and-effect relationship between the attack steps.

Disclosure of Invention

The invention provides an attack tracing method and device based on log association analysis, which are characterized in that a log association analysis technology is used for merging tracing graphs of a plurality of data sources, and the attack tracing based on a short-circuit mechanism is executed on the basis of the merged tracing graphs, so that the expected effect which is difficult to achieve by using a dependence analysis mode to relieve the dependence explosion problem in the attack tracing in the prior art is solved, and the problem of causal relationship among attack steps cannot be accurately restored. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides an attack tracing method based on log association analysis, where the method includes:

collecting logs of all levels in a system, and constructing a log connection graph according to the incidence relation among the logs of all levels, wherein the log connection graph shows that log nodes and mark nodes alternately appear, one log node analyzes at least one mark node, and one mark node is analyzed by at least one log node;

connecting the tracing elements in each single-source tracing graph by using the marking nodes in the log connection graph to obtain a fusion tracing graph, wherein the single-source tracing graph is obtained by converting log files of each data source by using a log analysis algorithm, and the tracing elements comprise nodes and edges;

executing an attack tracing algorithm in the fusion tracing graph, and mining a communication path formed around a dependent explosion node in the fusion tracing graph based on a short circuit mechanism, wherein the dependent explosion node is a node of which the sum of an input edge and an output edge is greater than a preset value;

and aiming at each step number of the search path in the attack tracing algorithm, calculating the evaluation score of each path searched by the attack tracing algorithm by using the search path including the communication path as one of path evaluation factors, and selecting the path with the highest evaluation score as an attack chain output by the attack tracing algorithm.

Optionally, the constructing a log connection graph according to the association relationship among the logs of each layer includes:

taking each row of logs as a log node, analyzing the characteristics expressing the association relation between the log nodes in the logs of each layer, and forming a marked node set;

and aiming at each mark node in the mark node set, forming a connecting link by using the mark nodes analyzed by at least two log nodes, connecting the at least two log nodes with the mark nodes according to the connecting link, and drawing a log connection graph.

Optionally, the analyzing the feature that expresses the association relationship between the log nodes in the logs of each layer to form a marked node set includes:

aiming at log nodes in the same layer, mining a common mode representing the incidence relation between the log nodes in the same layer by using a frequent analysis item to form a marker node set;

for the cross-level log nodes, the pre-enumerated cross-level connection marks are used for expressing the characteristics of the incidence relation among the log nodes, and a mark node set is formed.

Optionally, for each marker node in the marker node set, a connection link is formed by using the marker nodes analyzed by at least two log nodes, and the at least two log nodes are connected with the marker node according to the connection link, so as to draw a log connection graph, including:

aiming at each mark node in the mark node set, judging whether the mark node is analyzed by other log nodes or not, and using the mark node as a mark node analyzed by at least two log nodes;

and forming a connection link zone for the mark node to express a sharing relation between the mark node and the log node, and drawing a log connection graph according to the sharing relation, wherein the mark node is connected with at least two log nodes associated with the mark node in the sharing relation.

Optionally, the connecting the tracing elements in each single-source tracing graph by using the marker nodes in the log connection graph to obtain the fusion tracing graph includes:

traversing all the marked nodes in the log connection graph, inquiring the incidence relation among the log nodes in the log connection graph, and analyzing the tracing elements with the incidence relation in each single-source tracing graph according to the incidence relation among the nodes;

and connecting the tracing elements with the incidence relation in the single-source tracing graphs to obtain a fusion tracing graph.

Optionally, the executing an attack tracing algorithm in the fused tracing graph, and mining a communication path formed around a dependent explosion node in the fused tracing graph based on a short-circuit mechanism includes:

determining nodes with the sum of input edges and output edges larger than a preset value in the fusion tracing graph as explosion-dependent nodes in advance;

and executing an attack tracing algorithm in the fusion tracing graph, when the attack tracing algorithm is positioned to a dependent explosion node, carrying out depth-first search on a path around the dependent explosion node based on a short circuit mechanism, and excavating a communication path formed around the dependent explosion node.

Optionally, the calculating, by using the search path including the communication path as one of path evaluation factors for each step number of the search path in the attack tracing algorithm, an evaluation score of each path searched by the attack tracing algorithm, and selecting a path with a highest evaluation score as an attack chain output by the attack tracing algorithm includes:

setting time difference among log nodes, log node occurrence frequency and the communication path contained in the search path as path evaluation factors aiming at each step number of the search path in the attack tracing algorithm;

scores acting on each path are searched in the attack tracing algorithm by weighting and summarizing different path evaluation factors to obtain the evaluation score of each path, and the path with the highest evaluation score is selected as an attack chain output by the attack tracing algorithm.

Optionally, after the step number of each path searched in the attack tracing algorithm is used, the searched path includes the communication path as one of path evaluation factors, the evaluation score of each path searched by the attack tracing algorithm is calculated, and the path with the highest evaluation score is selected as the attack chain output by the attack tracing algorithm, the method further includes:

reconstructing an attack chain of each log data set aiming at the preset log data sets containing different attack scenes, and acquiring a reference standard for evaluating the attack from the attack chain of each log data set;

outputting an attack chain obtained by tracing the log data in a centralized manner by using an attack tracing algorithm of log association analysis;

and according to the reference standard for evaluating the attack, evaluating the accuracy of the attack chain obtained by tracing the log data set.

Optionally, the reconstructing an attack chain of each log data set for log data sets containing different attack scenarios preset, and obtaining a reference standard for evaluating an attack, includes:

reconstructing an attack chain of each log data set by using log data sets which are preset to contain different attack scenes, and marking a sensitive entity in the attack chain;

and matching the attack logs from the log data set by using the identification information corresponding to the sensitive entity to serve as a reference standard for evaluating the attack.

In a second aspect, an embodiment of the present invention provides an attack tracing apparatus based on log association analysis, where the apparatus includes:

the log connection graph shows that log nodes and mark nodes appear alternately, one log node analyzes at least one mark node, and one mark node is at least analyzed by one log node;

the connection unit is used for connecting the tracing elements in each single-source tracing graph by using the marking nodes in the log connection graph to obtain a fusion tracing graph, wherein the single-source tracing graph is obtained by converting log files of each data source by using a log analysis algorithm, and the tracing elements comprise nodes and edges;

the mining unit is used for executing an attack tracing algorithm in the fusion tracing graph and mining a communication path formed around a dependent explosion node in the fusion tracing graph based on a short circuit mechanism, wherein the dependent explosion node is a node of which the sum of an input edge and an output edge is greater than a preset value;

and the selecting unit is used for calculating the evaluation scores of all the paths searched by the attack tracing algorithm by using the communication paths contained in the searched paths as one of path evaluation factors according to each step number of the searched paths in the attack tracing algorithm, and selecting the path with the highest evaluation score as an attack chain output by the attack tracing algorithm.

Optionally, the building unit includes:

the analysis module is used for analyzing the characteristics of the incidence relation among the log nodes in the logs of all levels by taking each row of log as a log node to form a marked node set;

and the drawing unit is used for forming a connecting link by using the mark nodes analyzed by the at least two log nodes aiming at each mark node in the mark node set, connecting the at least two log nodes with the mark nodes according to the connecting link and drawing a log connection graph.

Optionally, the parsing module is specifically configured to, for log nodes in the same layer, use frequent analysis items to mine a common mode representing an association relationship between log nodes in the same layer, and form a marker node set;

the analysis module is specifically configured to express, for the cross-level log nodes, features of an association relationship between the log nodes by using pre-enumerated cross-level connection markers, and form a marker node set.

Optionally, the drawing module is specifically configured to, for each marker node in the marker node set, determine whether the marker node is analyzed by other log nodes, and use the marker node as a marker node analyzed by at least two log nodes;

the drawing module is specifically further configured to form a connection link with the mark node to express a sharing relationship between the mark node and the log node, and draw a log connection graph according to the sharing relationship, where the mark node is connected with at least two log nodes associated with the mark node in the sharing relationship.

Optionally, the connection unit includes:

the query module is used for traversing all the mark nodes in the log connection graph, querying the incidence relation among the log nodes in the log connection graph, and analyzing the tracing elements with the incidence relation in each single-source tracing graph according to the incidence relation among the nodes;

and the connection module is used for connecting the tracing elements with the incidence relation in each single-source tracing graph to obtain a fusion tracing graph.

Optionally, the excavation unit includes:

the determining module is used for determining nodes, in the fusion traceback graph, of which the sum of input edges and output edges is larger than a preset value as explosion-dependent nodes in advance;

and the mining module is used for executing an attack tracing algorithm in the fusion tracing graph, when the attack tracing algorithm is positioned to a dependent explosion node, performing depth-first search on a path around the dependent explosion node based on a short circuit mechanism, and mining a communication path formed around the dependent explosion node.

Optionally, the selecting unit includes:

the setting module is used for setting time difference among log nodes, log node occurrence frequency and the communication path contained in the search path as path evaluation factors aiming at each step number of the search path in the attack tracing algorithm;

and the selecting module is used for searching the scores acting on each path in the attack tracing algorithm by weighting and summarizing the evaluation factors of different paths to obtain the evaluation score of each path, and selecting the path with the highest evaluation score as the attack chain output by the attack tracing algorithm.

Optionally, the apparatus further comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for reconstructing an attack chain of each log data set aiming at the log data sets which are preset to contain different attack scenes and acquiring a reference standard for evaluating the attack from the attack chain of each log data set;

the output unit is used for outputting an attack chain obtained by tracing the log data in a centralized manner by using an attack tracing algorithm of log correlation analysis;

and the evaluation unit is used for evaluating the accuracy of the attack chain obtained by tracing the log data in a centralized manner according to the reference standard for evaluating the attack.

Optionally, the obtaining unit includes:

the system comprises a marking module, a data processing module and a data processing module, wherein the marking module is used for reconstructing an attack chain of each log data set by using the preset log data sets containing different attack scenes and marking the sensitive entities in the attack chain;

and the matching module is used for matching the attack logs from the log data set by utilizing the identification information corresponding to the sensitive entity and taking the attack logs as a reference standard for evaluating the attack.

In a third aspect, an embodiment of the present invention provides a storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to implement the method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides an attack tracing apparatus based on log association analysis, including:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.

As can be seen from the above, the attack tracing method and apparatus based on log association analysis according to the embodiments of the present invention construct a log connection graph by collecting logs of each level in a system and according to an association relationship between the logs of each level, can effectively fuse logs of different levels and different data sources, the log connection graph represents that log nodes and mark nodes appear alternately, one log node parses at least one mark node, and one mark node is parsed by at least one log node, and then connects tracing elements in each single-source tracing graph by using the mark nodes in the log connection graph to obtain a fused tracing graph, the single-source tracing graph is obtained by converting log files of each data source by using a log parsing algorithm, the tracing elements include nodes and edges, and further execute an attack tracing algorithm in the fused tracing graph, and a communication path formed around the dependent explosion node in the fusion tracing graph is mined based on a short circuit mechanism, the communication path strengthens the causal relationship between events while considering the incidence relationship between logs, and compared with the mode of relieving the dependent explosion problem by using dependence analysis in the attack tracing process in the prior art, the attack chain output in the attack tracing process can accurately restore the British relationship between attack steps, so that the dependent explosion node is effectively bypassed in the tracing process, and the dependent explosion problem is relieved.

In addition, the technical effects that the embodiment can also realize include:

(1) the attack tracing mode adopts a set of general tracing flow chart merging framework based on log correlation analysis, and can merge unit tracing charts of each data source to generate a general tracing chart full of high-order semantics.

(2) The attack tracing mode relates to an attack tracing algorithm based on short circuit on a fusion tracing graph, can effectively relieve the problem of dependence on explosion, and improves the accuracy of attack tracing.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is to be understood that the drawings in the following description are merely exemplary of some embodiments of the invention. For a person skilled in the art, without inventive effort, further figures can be obtained from these figures.

FIG. 1 is a diagram illustrating an example of an explosion-dependent process of Firefox according to an embodiment of the present invention;

fig. 2 is a flowchart of an attack tracing method based on log association analysis according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a hierarchical structure of a log connection graph according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating a merging mechanism for elements of a trace source graph according to an embodiment of the present invention;

fig. 5 is a schematic system flow diagram of the application of the attack tracing method provided in the embodiment of the present invention to a dependency analysis and attack tracing framework;

fig. 6 is a diagram of a data set attack link and a topology structure of the system according to the embodiment of the present invention;

FIG. 7 is a diagram illustrating a short-circuit mechanism according to an embodiment of the present invention;

fig. 8 is a data volume increase situation of the APT-microservice data set provided by the embodiment of the present invention in 9 days;

fig. 9 is memory usage statistics of an APT-microservice dataset according to an embodiment of the present invention;

FIG. 10 is a comparison of a log connection graph generated by Pro-Navigator and a log connection graph generated by HERCULE according to an embodiment of the present invention;

fig. 11 is a block diagram of an attack tracing apparatus based on log association analysis according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It should be apparent that the described embodiments are only some of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the embodiments and drawings of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The invention provides an attack tracing method and device based on log association analysis, which are characterized in that a log association analysis technology is used for merging tracing graphs of a plurality of data sources, and the attack tracing based on a short-circuit mechanism is executed on the basis of the merged tracing graphs, so that the expected effect which is difficult to achieve by using a dependence analysis mode to relieve the dependence explosion problem in the attack tracing in the prior art is solved, and the problem of causal relationship among attack steps cannot be accurately restored. Aiming at frame skipping caused by attack tracing under an APT scene by the dependence explosion problem, the solution of the related technology uses a log fusion technology to fully utilize multi-level information to relieve the dependence explosion problem, and obtains results exceeding the traditional binary instrumentation method and the stain analysis method. However, these fusion mechanisms are either built on some complex fusion rules, which lack flexibility, or on rules that are too simple, which can only handle special scenarios. In the embodiment of the invention, a traceable graph fusion-based mode is provided, a universal dependency analysis and attack traceable framework is used, the universality and the flexibility are achieved, the traceable graphs of a plurality of data sources are combined by using a log association analysis technology to generate a fusion traceable graph, and the mode is a combination method universal on the level of the fusion traceable graph, so that an analyst can ignore the difference of the data sources, and further an attack traceable algorithm based on a short circuit mechanism is designed on the fusion traceable graph, so that a dependency explosion node can be bypassed, and the dependency explosion problem can be relieved.

The following provides a detailed description of embodiments of the invention.

Fig. 1 is a schematic flowchart of an attack tracing method based on log association analysis according to an embodiment of the present invention. The method may comprise the steps of:

s100: and collecting logs of all levels in the system, and constructing a log connection graph according to the incidence relation among the logs of all levels.

The log connection graph shows that log nodes and mark nodes appear alternately, one log node analyzes at least one mark node, and one mark node is analyzed by at least one log node. According to the association relationship, a log connection graph can be constructed.

It should be noted that the log connection graph in the related art is first proposed by the HERCULE, and defines an association relationship between logs of each row. It treats the log as a node, parses the attributes from the log, and provides 29 equations for determining the association. If two rows of logs satisfy some of the equations, they will be directly connected by an undirected edge in the log connection graph. Obviously, the time complexity of this approach is N squared, since each new log line needs to check if it is connected to all the logs before it arrives. This brings a huge temporal and spatial load, making it unusable in APT scenarios. The method and the device have the advantages that the format in the log connection graph in the related technology is changed, the mark set is analyzed to be used as the connection link to construct the log connection graph, the query complexity is greatly reduced, and the complexity close to linearity can work well in the APT scene.

Specifically, in the process of constructing the log connection graph according to the association relationship among the logs of each level, each row of logs can be used as a log node, the characteristics of the log of each level expressing the association relationship among the log nodes are analyzed, a marked node set is formed, further, for each marked node in the marked node set, a connection link is formed by using the marked nodes analyzed by at least two log nodes, the at least two log nodes are connected with the marked nodes according to the connection link, and the log connection graph is drawn. In an actual application scenario, the hierarchical structure of the log connection graph may be as shown in fig. 3, and the specific process executed by combining the following algorithm for constructing the log connection graph includes: the input parameter Lall represents all collected logs, PARSELOGTAGS represents the mark analysis process mentioned above, and returns an analyzed mark set, and for each mark, judges whether the mark exists in the hash table, if the mark does not exist, creates a mark node and adds the mark node into the hash table, if the mark node exists, directly takes out the corresponding node in the hash table, and finally connects the mark node and the log node. The query complexity of the hash table is O (1), the system supports k types of marks (k is a small constant, namely k hash tables need to be stored), whether each log can be analyzed by class can be judged by class, if N logs are shared, the time complexity is O (kN), and the near-linear complexity can work well in an APT scene.

Algorithm 1: construct log join graph

In the embodiment of the invention, the input of the dependency analysis and the attack tracing is logs of different levels, and the logs have different formats and have a semantic gap. In order to express the association relationship between the logs, some features, called marks, need to be parsed from the logs. When any one of the same marks is shared by the two logs, the two logs are considered to have certain direct correlation, and a log connection graph can be drawn according to the relation between the logs and the marks. Considering that the levels of the log nodes may be the same or different, for the log nodes in the same level, a common mode representing the association relationship between the log nodes in the same level is mined by using frequent analysis items to form a marker node set, and for the log nodes in the cross-level, the characteristics of the association relationship between the log nodes are expressed by using pre-enumerated cross-level connection markers to form the marker node set.

Specifically, for the tag parsing of the peer log, unlike the method in which the HERCULE uses predefined features to connect logs, the embodiment of the present invention uses frequent item analysis to select a tag set for peer log association. The same-level logs generally have similar formats, e.g., network-level logs always have two columns < ip, port > or three columns < url, http _ status, method > and audit-level logs always have six columns < uid, gid, suid, sgid, fuid, fgid > and GUI-level logs always have six columns

< x _ pos, y _ pos > two columns. When processing multiple log data sources at the same layer, the common frequent items can be mined, some common modes can be found for connecting different logs in the layer, the mode is very flexible, and any layer can be customized, for example, some company applications have unified and unique log formats, and the association analysis of the company applications is completed without using predefined rules.

Specifically, for the label analysis of the cross-level logs, the logs of different levels often cannot analyze the same label type, and the log formats of the logs are very different, but sometimes an association relationship does exist. For example, a row of Apache log "172.16.0.1.. GET/index. html HTTP/1.1" can resolve an IP tag with a value of "172.16.0.1"; similarly, the audio log may also record a line of ssh remote connection log and record "hostname 172.16.0.1". Obviously, there is a certain relation between the two rows of logs, and the IP first initiates a web penetration attack, then knows the ssh password of the host, and then logs in the host directly through ssh. In this example, the IP may be associated as a marker with two logs belonging to the server layer and the audit layer, respectively. By analyzing a large number of logs, summarizing 65 possible cross-layer connection tags and selecting the most common 5 cross-layer connection tags listed in table 1, the feature set used in the embodiment of the present invention can complete more general and more complex cross-layer log association analysis compared to the method of using only timestamps, such as UIScope.

Table 1: cross-layer connection tag enumeration

Marking	Constraint conditions
		Timestamp	Two rows of logs are recorded at the same time
Processes	Two rows of logs have the same process number or parent process number
		SNAME	Extracting the communicating character string (such as file name) from the two-line log
PNAME	File path identity for two rows of logs
		IPNAME	The ip addresses of two rows of logs are the same

In general, some methods directly utilize logs to perform attack investigation without analyzing entities in the logs, and compared with the causal relationship in a tracing graph, a log connection graph can perform attack tracing from a log with a problem by directly using an association relationship, so that a large number of logs related to attack are found and are directly used for attack investigation. However, the method for completing attack investigation by directly using the log connection graph only focuses on the log level due to the rough source tracing granularity, and ignores the rich semantics contained in the log and the rich causal relationship which can be analyzed by the semantics. Therefore, the attack tracing analysis method provided by the embodiment of the invention only adopts the log connection graph as a rule of the combined tracing graph, and does not directly trace the source on the log connection graph, thereby improving the accuracy of the attack tracing.

For backward and forward analysis, attack tracing algorithms often run on a tracing graph, including backward analysis and forward analysis. The backward analysis algorithm can find the origin of the attack, and it traces back from a symptom event to obtain a causal link according to the time stamp of the event. The forward analysis algorithm may find a series of events affected by the attack, which is often imported with nodes found by the backward analysis. Traditional source-tracing analysis algorithms rely on explosion problems to make the APT attacks run very long and hidden, which brings a lot of system load and false alarms. When a tracing graph is constructed for some typical applications, the method refers to the idea of backward analysis, and some composition strategies are formulated when the tracing graph of the GUI layer is constructed, so that the attack tracing algorithm can be better applied to various scenes, and correspondingly, if the construction quality of the tracing graph of each data source is not high, the effect of the USR algorithm is also reduced. The attack tracing algorithm provided by the embodiment of the invention can be called as Prov-Navigator and is applied to the accurate dependency analysis and attack tracing framework associated with the log, in the attack tracing, the backward analysis algorithm is equivalent to a Navigator which leads an analyst to shuttle among different data sources, and the shortcut is continuously searched to explore a deeper and more complete attack path.

S110: and connecting the tracing elements in the single-source tracing graphs by using the marking nodes in the log connection graph to obtain a fusion tracing graph.

The single-source tracing graph is obtained by converting a log file of each data source by using a log analysis algorithm, and the tracing elements comprise nodes and edges. The tracing graph is a directed acyclic graph with time as a direction and represents the causal relationship between a subject (a process, a thread and the like) and an object (a file, a registry, a network socket and the like) in the system. The source graph can express the cause and effect relationship of the two events no matter how long the time interval between the two events is, so that a security expert can complete attack investigation by using the source graph, and from BackTracker, most of work uses a system log to generate the source graph.

The source tracing diagram in the embodiment of the invention refers to the construction modes in the related technology, including Beep, ProTracer, UIScope and backing. The method uses log analysis to obtain entities and edges, wherein the edges represent the tracing relation between the entities, and the logs of each data source are converted into a single tracing graph according to the algorithm of the log connection graph.

It can be understood that, since the traceback graph is a non-directed graph, a new edge is introduced between the single-source traceback graphs, and the direction of the edge represents the time relationship between the traceback elements, specifically referring to fig. 4, fig. 4 shows a merging mechanism of the traceback graph elements in the embodiment of the present invention, which illustrates how to determine the direction of the edge under different situations, and particularly in the case that both traceback elements are edge, the sortenties function is used to ensure that the generated traceback graph is loop-free.

Specifically, the incidence relation among the log nodes in the log connection graph can be inquired by traversing all the marked nodes in the log connection graph, and the tracing elements with the incidence relation in each single-source tracing graph are analyzed according to the incidence relation among the nodes; and connecting the tracing elements with the incidence relation in each single-source tracing graph to obtain the fusion tracing graph. Specifically, in the process of merging the separate tracing graphs into one merged tracing graph, the merged tracing graph can be constructed in a tracing graph merging mode, and after the log connection graph CLG is obtained, the tracing graph merging can be completed only by connecting tracing graph elements (points or edges) analyzed by two rows of logs directly related to each other. In the merging process, only three cases need to be considered, namely point-to-point matching, point-to-edge matching, and edge-to-edge matching. With specific reference to the contents described in the following single-source tracing graph merging algorithm: the parameter CLG represents the constructed log connection graph, and Ps represents the set of each individual traceback graph. Because two rows of associated logs are directly connected through the marks in the log connection graph CLG, the construction of the fusion traceback graph can be completed only by traversing all the marks and processing each pair of directly connected logs by using a PROGRAPHMERGE function.

And 2, algorithm: multiple single source traceback graph merging

In practice, the accuracy of the attack tracing algorithm in the embodiment of the present invention needs to depend on the quality of the tracing graph construction of each data source. If the tracing graph of each data source can keep rich semantics during construction, the query result of the USR is also very good, that is, the corresponding attack tracing effect is better. When constructing a bottom-layer tracing graph, the related art has a plurality of mature modes, but when composing an upper-layer log, there is no universal way at present.

It should be noted that the attack tracing algorithm in the embodiment of the present invention can process streaming data well, and three steps of the whole algorithm support streaming computation, which is specifically shown in the construction process of constructing the log connection graph, when a new log appears, only the mark needs to be analyzed from the new log, and the association with the existing logs can be completed, so that the analysis and processing efficiency is improved, and the situation of flooding of a large amount of data can be well coped with. Subsequently, a high-support streaming type tracing graph generation algorithm is adopted in a real scene, the process of combining the tracing graphs is also completely streaming, and a user can immediately perform attack chain query according to alarm information at any time so as to obtain an attack report at the current moment.

The merging mode of the tracing graphs is not required to be completed based on rules, is universal and flexible, and no matter which log is used, the merging mode in the embodiment of the invention can be used for merging the single-source tracing graphs into a merged tracing graph as long as log association analysis is completed and a log connection graph is constructed.

S120: and executing an attack tracing algorithm in the fusion tracing graph, and mining communication paths formed around the dependent explosion nodes in the fusion tracing graph based on a short circuit mechanism.

Wherein, the explosion-dependent node is a node whose sum of the input edge and the output edge is greater than a preset value. Since the dependency explosion problem occurs mainly on processes running for a long time, which interact with many other entities during their lifecycle, a large number of dependencies are created. When tracing is performed in a tracing graph, all input edges of these processes depend on their output edges. However, only a few of the dependencies are truly attack-related. The dependency explosion problem in the traceback graph is described using the Firefox scenario shown with particular reference to FIG. 1. All socket entities (over 300) depend on all file entities (over 200), resulting in tens of thousands of dependencies, which makes the traceable analysis difficult. According to the embodiment of the invention, the tracing elements in the single-source tracing graphs are fused, so that the fused tracing graphs can generate some high-level tracing paths, and further, in the subsequent application of an attack tracing algorithm process, dependent explosion nodes can be successfully bypassed through the paths, and the paths are called shortcuts. When the attack tracing is carried out, whenever a front is a dependent explosion node, a depth-first search (DFS) is carried out, the DFS does not pass through the front dependent explosion node, and if a shortcut is found to bypass the front dependent explosion node to reach the other end of the front dependent explosion node, the path is selected to continue tracing.

According to the attack tracing algorithm provided by the embodiment of the invention, when an alarm occurs, the alarm is used as a symptom event and is input into the algorithm to reconstruct an attack chain. Specifically, a node in the fusion tracing graph, in which the sum of the input edge and the output edge is greater than a preset value, may be determined as a dependent explosion node in advance; and executing an attack tracing algorithm in the fusion tracing graph, when the attack tracing algorithm is positioned to the dependent explosion node, carrying out depth-first search on a path around the dependent explosion node based on a short circuit mechanism, and excavating a communication path formed around the dependent explosion node.

It can be appreciated that dependent explosive nodes typically exist in an audit level traceback graph. When the same behaviors are described, the high-level tracing graph often has rich semantics but lacks connectivity, and the auditing-level tracing graph has opposite direction, and often has very strong connectivity but does not have rich semantic information. If only the bottom tracing graph is used for tracing, the tracing algorithm inevitably depends on explosion nodes. But after merging the tracing graphs, the problem that the high-level tracing graphs lack connectivity is relieved, and many high-level paths exist in the connected fused tracing graphs, and the generation of the paths provides possibility for bypassing the dependent explosion nodes. In the fusion traceback graph, when the multi-source traceback graph merging algorithm encounters a dependent explosion node, the multi-source traceback graph merging algorithm can perform depth-first search once, and if a high-level path bypasses the dependent explosion node and reaches the neighbor, the path is called as a short circuit. Based on the observed results, shorts are common in merged traceback graphs because both the high-level data sources and the low-level data sources tend to be logged for the same system behavior. The nature of the short circuit is: when the bottom layer tracing is exploded, the high-level records are adopted for more accurate tracing, and the high-level records are not only more accurate, but also have richer semantics, so that a better attack chain can be restored. Therefore, when a short circuit is found, the weight of the path is obviously increased, so that the tracing algorithm is more inclined to select the short circuit, which is the working mode of the short circuit mechanism.

S130: and aiming at each step number of the search path in the attack tracing algorithm, calculating the evaluation score of each path searched by the attack tracing algorithm by using the search path including the communication path as one of path evaluation factors, and selecting the path with the highest evaluation score as an attack chain output by the attack tracing algorithm.

Specifically, for each step number of a search path in an attack tracing algorithm, time difference between log nodes, occurrence frequency of the log nodes, and communication paths included in the search path may be set as path evaluation factors; scores acting on each path are searched in the attack tracing algorithm by weighting and summarizing different path evaluation factors to obtain the evaluation score of each path, and the path with the highest evaluation score is selected as an attack chain output by the attack tracing algorithm.

The attack tracing mode in the embodiment of the invention improves the traditional algorithm, and is an attack tracing algorithm based on a short circuit mechanism, the specific algorithm execution flow is shown as follows, Step represents the maximum Step number of searching, and K represents the condition number explored in each Step, namely K conditions with the highest score are selected in each Step for continuous searching. The fraction of the path is calculated by the CALCULATESOCRE function, which takes into account the three factors of time difference, frequency, short circuit. Where time and frequency are affected by other works NODOZE HOLMES and MORSE, they believe that in APT attacks, events with longer time spans, less frequent times, are more likely to be in the attack chain. Unlike them, however, the dependency analysis and attack tracing framework does not need to traverse your whole graph to count frequencies, but rather directly employs the number of logs connected by the corresponding labels in the log connection graph to express frequencies (since the edges across data sources must be labeled from one label in the HPG according to the above merging strategy), which is an important advantage of merging the tracing graphs using the log connection graph CLG. The details of the path score are shown in equation 1: the time difference and frequency are weighted and directly added with the additional score added by the short-circuit mechanism. As described above, when a dependent explosion node is encountered, the calcultesecore function adds additional scores to the shortcut paths, thereby making these paths easier to be actively selected by the traceback algorithm. And finally, when a path of each step is selected, selecting the K entities with the highest scores as the inlet nodes of the next search until no new entities are found or the set step number is reached. Finally, a subgraph containing the found entities and edges is output, namely an attack chain output by the dependency analysis and attack tracing framework.

Algorithm 3: attack tracing algorithm based on short circuit mechanism

In conclusion, a set of framework integrating accurate and universal dependency analysis and attack tracing is established based on the attack tracing of the log association analysis in the embodiment of the invention, the framework has universality and flexibility, and a threat model similar to other works is used. First, assuming the integrity of the log, an attacker can intrude into the system, but cannot tamper with the collected logs at various levels, at various data sources. In fact, there are some established means to ensure this. Secondly, it is assumed that the tracing graph generated by the log and the process of generating the tracing graph are not attacked, in other words, the generated tracing graph also has integrity. Third, assume that the system is not under attack before deploying the dependency analysis and attack tracing framework. Fourth, assume that the hardware, kernel, and audit system are part of a trusted computer.

In an actual application scenario, the attack tracing method provided by the embodiment of the present invention may be applied to a system flowchart of a dependency analysis and attack tracing framework, and specifically as shown in fig. 5, first, logs of each level in a system are collected, including an audit layer, an application layer, a service layer, a network layer, a GUI layer, a user layer, and the like, and then, an association relationship between the logs is analyzed to construct a log connection graph (CLG). The log join graph will be used as a rule for subsequent trace-source graph merging, which is based on two basic assumptions: (1) the log of each data source can generate its own tracing graph. (2) If two rows of logs have direct associations in the log connection graph, the traceback graph elements (entities or edges) that they resolve also have direct associations. Therefore, when the log connection graph is obtained, the tracing graphs generated by each log can be directly combined to form a fused tracing graph, the fused tracing graph can generate some high-level tracing paths, so that the attack tracing algorithm can bypass the dependent explosion nodes through the paths, the attack tracing algorithm based on the short circuit mechanism is further provided, and when an alarm occurs, the nodes in the tracing graph are used as sign events to be input into the attack tracing algorithm, and then an attack link is reconstructed.

Furthermore, in order to verify the tracing effect achieved by the attack tracing manner based on log association analysis in the embodiment of the present invention, an attack chain of each log data set may be reconstructed for the log data sets which are preset to include different attack scenarios, and a reference standard for evaluating an attack is obtained from the attack chain of each log data set; outputting an attack chain obtained by tracing from a log data set by using an attack tracing algorithm of log correlation analysis; and according to the reference standard for evaluating the attack, evaluating the accuracy of the attack chain obtained by tracing the log data set.

As a presetting of the experimental evaluation process, the performance of the PROV-NAVIGATOR is evaluated by using 6 APT attacks in the embodiment of the present invention. The operating systems used in these experiments include Ubuntu 18.04 and Windows 10. All of these operating systems run audit logging and other application-level or user-level logging modules. The audit log detail level is set to DEBUG or INFO. Most of the time, the system operates as a benign activity. Attacks rarely occur during the entire experiment. The server for testing the overhead and effectiveness of the PROV-NAVIGATOR algorithm deploys an Intel Xeon Platinum 8255C 2.50GHz CPU and a 16GB memory, and runs on Ubuntu 18.04.4. The basic case of these 6 attacks is described as follows:

(1) APT-1 to APT-5

To check the accuracy and recall of the Prov-Navigator and to check the validity of the short-circuit mechanism. We designed five sets of experiments, APT-1 to APT-5, the details of which are listed in Table 2. These data sets contain logs at the audit level, application level, and user level, which are collected from multiple machines, and each data set contains a complete, multi-hop attack. These attacks were designed with reference to the Darpa TC dataset attack model and some typical attack models such as ATT & CK.

Table 2: basic information of APT data set

(2)APT-microservice

The APT-microservice data set is a data set which is collected from a distributed microservice environment comprising 13 nodes and is up to 9 days, and is used for evaluating the space overhead (RQ3) of the Prov-Navigator and evaluating the execution speed of the Prov-Navigator in the APT scene. The structure and mode of operation of this scenario are as follows.

The specific scene design is described as follows: the enterprise is an internal office network of a medium-sized financial enterprise, and the main business of the enterprise is intelligent quantitative investment. The data set attack links and topology of the system are shown in fig. 6, with the implementation and dashed lines representing benign behavior and attack paths, respectively. The normal business process of the enterprise is as follows: the employees submit the trading strategies, the managers can approve and release the trading strategies submitted by the employees, and the released strategies can automatically operate the fund pool and complete the trading. In addition, the staff can also carry out daily operations such as card punching.

The design of the attack scenario is described as follows: an employee responsible for maintaining a corporate web portal clicks a phishing link when working at home using a laptop distributed by the corporation, which allows a hacker to remotely control the employee's laptop and log into the corporate intranet. In order to obtain a stable intrusion path, a hacker implants a backdoor at the back end of the company portal, and can stably enter the company intranet from the server of the portal. A hacker observes behaviors in an intranet for a long time, and steals the password of a manager through injecting the sql of the employee card-checking database, so that the hacker can log in an approval node. The hacker submits the malicious transaction strategy by simulating the behavior of normal staff, logs in the approval node and issues the malicious strategy. The transactions conducted by the malicious policies submitted by hackers eventually cause huge financial losses to the company.

Further, in order to show the performance of reconstructing the attack chain, specifically, in the process of acquiring the reference standard for evaluating the attack, the preset log data sets containing different attack scenes can be used for reconstructing the attack chain of each log data set, and the sensitive entities in the attack chain are marked; and matching the attack logs from the log data set by using the identification information corresponding to the sensitive entity to serve as a reference standard for evaluating the attack.

Based on APT-1 to APT-5, as the five groups of experiments are all multi-machine and multi-hop complex experiments, the process of obtaining reference answers for evaluating the accuracy needs to be described in detail: firstly, manually reconstructing an attack chain of each data set according to an attack script; then, the sensitive entities (such as ip, process number, file name, etc.) are marked; finally, three types of matching are performed based on these entities. Firstly, name matching is carried out, and logs directly containing the entity names are classified into standard answers; secondly, time range matching is carried out, logs which occur at the same time with the logs found by name matching are classified as standard answers, and a large number of logs which occur at the same time are often generated in one attack step; the last step is manual review, because the first two steps may include normal behavior, so the domain expert marks the normal event and excludes it from the answer set.

After the steps, reference answers for evaluating the accuracy are obtained, the attack tracing method in the embodiment of the invention is operated on the five data sets, and the number of hit answers in the results searched by the attack tracing is counted to calculate the accuracy; meanwhile, the method also focuses on how many attack steps are covered by the query result, and calculates the recall rate according to the attack steps.

The specific experimental results are shown in table 3 below, which illustrates the effects of the attack tracing algorithm provided by the embodiment of the present invention on the data sets, where TP represents the number of events hitting the groudtruth, FP represents the normal behavior misreported by the attack tracing algorithm, and TP/(TP + FP) represents the accuracy. Here an average accuracy of 95.3% was achieved over five data sets. The accuracy on APT-1 and APT-4 datasets is relatively low because when ETW is used as an auditing tool, a very large number of attack companion behaviors are recorded, which can lead to some false positives. Nevertheless, the accuracy on both datasets exceeded 90%, and 100% accuracy was achieved on the APT-2 dataset, which also used ETW. Meanwhile, an average recall rate of 98.4% is obtained on the five data sets, all attack steps are found on the first four data sets, only 1 step is not found on the APT-5 data set with the most steps, and the step is a branch step (not a main step), which fully proves the excellent effect of the attack tracing algorithm provided by the embodiment of the invention in the APT scene.

Table 3: the attack tracing algorithm in the embodiment of the invention is represented on each data set

Furthermore, in order to illustrate the effect of a tracing graph merging means adopted by the attack tracing algorithm provided by the embodiment of the present invention on mitigating the explosion-dependent problem, two sets of comparison tests are designed, wherein the first set omits a shortcut mechanism, and the second set only uses a bottom-layer log (removes all high-layer logs on the basis of the shortcut mechanism). There is no need to do a set of experiments to drop high-level logs but retain shortcuts because shortcuts are only introduced when there are high-level logs. In the case of using only the bottom layer log, the accuracy rate is reduced to 84.4%, and the recall rate is greatly reduced to 69%, because in the bottom layer tracing graph, the tracing algorithm cannot bypass the dependent explosion node through the high layer shortcut path, when the dependent explosion node passes, a large amount of false reports are naturally generated (thereby reducing the accuracy rate), and the tracing direction and depth are also affected, so that all attack steps cannot be found (thereby reducing the recall rate). This is fully illustrated: the fusion of the high-level logs is helpful for relieving the problem of dependence on explosion in the APT scene by the attack tracing algorithm in the embodiment of the invention.

It should be noted that the embodiments of the present invention do not reproduce the manner in other related arts. On the one hand, Alchemist does not disclose all 135 fusion rules thereof, and thus cannot reproduce; on the other hand, the uiscope is only a method for UI analysis, only combines the source tracing maps of the GUI layer and the audit layer, cannot process the logs of multiple other layers provided herein, and has no generality.

The following conclusions can be drawn by combining the above experiments: the attack tracing algorithm provided by the embodiment of the invention has good effect on five APT attack data sets, the average accuracy rate of 95.3% and the average recall rate of 98.4% are obtained, only one of 64 attack steps of the five data sets is not found, and the richer the high-level log is, the better the attack tracing algorithm can have the effect.

Further, the short-circuit mechanism can be evaluated, specifically, a group of experiments for removing the short-circuit mechanism in the attack tracing algorithm are performed in table 3, after the short-circuit mechanism is removed, the accuracy of the attack tracing algorithm is reduced to 83.4%, the recall rate is reduced to 89.8%, and except for APT-4, 1-3 attack steps are not found on each data set. This proves that: the tracing algorithm with the short circuit mechanism is used on the fusion tracing graph, so that the dependent explosion nodes can be effectively bypassed, and the tracing accuracy and recall rate are improved.

To illustrate more graphically the effect of the short-circuit mechanism, an example is given in FIG. 7, which is a real case in the APT-3 data set. In this case, the attacker sends a request to the backend of the website, the request initiates a mysql transaction, and the sql injection queries a data table with sensitive information in the mysql. The bottom and top logs record the above behavior, but the mysql database always generates some explosion-dependent processes like the "mysql pid" node in the figure. If only the traceability graph of the audit layer is used, when tracing from the attribute 1 node, benign3 and benign4 are inevitably included in the traceability range when tracing to the attribute 2 node through the dependent explosion node, thereby causing false alarm. Through the source tracing graph merging in the embodiment of the invention, relevant parts in the database level source tracing graph and the application level source tracing graph are merged to form a shortcut path. The path bypasses the bottom layer dependent explosion nodes and accurately reaches the attack2 node from the attack1 node through the high-level path, so that the occurrence of false alarm is avoided, and richer semantic information is added into the attack path, which is the working mode of the shortcut mechanism.

Further, the speed of the attack tracing algorithm can be evaluated, in order to verify whether the attack tracing algorithm can complete the construction of the fusion tracing graph at a higher speed in massive data in the APT scene, an experiment can be performed in the APT-microservice data set, and table 5 describes the data volume increase per day and the time consumed for constructing the tracing graph in the 9-day operation process. The experimental results show that: the attack tracing algorithm only takes 134 seconds to construct a huge tracing graph which comprises 1037 ten thousand entities and 9127 ten thousand edges from 25GB original log data which is counted for 9 days, and the data processing speed of 187MB/s on average is achieved. Therefore, the attack tracing algorithm provided by the embodiment of the invention has the capability of processing mass data brought by the APT scene and the distributed cluster, and completes the construction of the fusion tracing graph at an ultra-fast processing speed.

Table 4: composition time statistics of APT-microservice data set

Furthermore, the space load of the attack tracing algorithm in the embodiment of the invention can be evaluated, which mainly comprises the following steps:

(1) for disk occupancy overhead, the disk occupancy overhead is defined as the ratio of the traceback graph size (occupied disk space) to the original log size (occupied disk space). To measure the space overhead, logs from 6 experiments were processed on the same system. The results are shown in table 5, which shows an average space overhead of only 3.1%.

Table 5: spatial complexity statistics of APT-1 to APT-5 datasets

Data set name	Space load
		APT-1	5.2％
APT-2	1.0％
		APT-3	1.2％
APT-4	3.5％
		APT-5	4.8％
APT-microservice	3.0％
		Average	3.1％

For APT-2, the system introduces an overhead of 1.0%, while APT-1 introduces an overhead of 5.2%. This difference is because the logs of the application and the attack tracing algorithm in the embodiments of the present invention have a larger useful semantic information part than the audit logs, and therefore less useless information part needs to be compressed. APT-1 has a higher space overhead because of the higher proportion of application logs. The overall overhead in APT microservice is also evaluated here. At a normal workload of 9 days, the system would generate 25GB of logs at this time. Fig. 8 shows the increase of the occupied space of the original log within 9 days of the APT-microservice data set. The results are further compared to omega with the same spatial overhead evaluation method. In omega, the space overhead ratio is from 1% to 8%, compared to the better performance of the attack tracing algorithm in the embodiment of the present invention, the result is in the range of 1% to 5%.

(2) The memory overhead of the attack tracing algorithm in the embodiment of the invention is evaluated aiming at a system with relatively heavy workload. Showing that the attack tracing algorithm presents low memory overhead. Fig. 9 shows memory usage statistics for APT-microservice datasets, including constructing CLG, HHPG, and query attack killing chains. Over time, the memory utilization rate is increasing, and the final memory usage reaches the peak value of 5.1GB, which is acceptable compared to the data size in this scenario.

It is emphasized here that the attack tracing algorithm of the present embodiment can handle the situation where different logs arrive out of order, which is in fact not common in real environment. The log connection graph construction algorithm of the attack tracing algorithm does not require that the logs arrive according to a certain specific sequence, the logs arrive in any sequence, the final composition result cannot be changed, and the subsequent tracing graph combination and attack chain query cannot be influenced naturally.

Related studies can be classified as log processing and attack tracing. The log processing comprises log auditing, log association analysis and log analysis; attack tracing includes data tracing and tracing analysis. Of these works, the comparison in table 6 was made by selecting some of the most relevant works to the work of the examples of the present invention, and the comparison results are shown below:

table 6: comparison of various traceability frameworks

It can be appreciated that conventional approaches exploit the information of the process itself to determine the relationship between the inputs and outputs, thereby solving the problem of dependency explosion. The two most classical solutions are taint analysis and binary instrumentation. Taint analysis methods, such as RAIN, detect anomalies in the tracing graph and prune the tracing graph. RAIN uses dynamic flow tracing techniques for attack replay, which uses taint flow operations to trace taint propagation logic between programs. However, such methods all bring a lot of runtime overhead and space overhead, and cannot be used in a real production environment. Binary instrumentation methods, such as BEEP, MPI, TRACE, segment application processes into smaller units and translate dependencies on the processes into inter-unit dependencies. Beep performs execution partitioning using event processing loops; trace uses UBSI technology to implement a data structure identification. Although the binary instrumentation technique significantly eases the dependency explosion and improves the tracing accuracy, the developer cannot guarantee that the modification of the binary code by itself will not affect the stability and safety of the system, so that it is difficult to be used in the production environment.

In recent years, more and more researchers try to introduce high-level semantic information into a traditional tracing graph to generate a fusion tracing graph containing rich context information. The fused traceback graph utilizes more source information in attack investigation and also provides more details. The omega Log and the AlChelmist adopt a log fusion mode to fuse information in various logs together to generate a fusion log, and then the fusion log is directly converted into a fusion traceability graph. The omega Log takes the audio log as a bus and embeds information of other logs onto the bus. It uses a control flow graph to enhance the effect of log fusion, but its approach cannot handle the case of multi-threading and background processes. Alhemist extracts common fields from different logs, establishes 135 fusion rules on the fields, and utilizes a rule reasoning engine to assist in completing log fusion. The Alhemist only divides the log into a high level and a low level, which causes that when a more detailed level needs to be introduced, such as a GUI level and a database level, the rules of the log need to be greatly increased or even completely rewritten. Therefore, the fusion rule of AlChemist cannot cope with the case of hierarchy expansion. The UIScope provides a new perspective for generating the fused traceback map. The method claims that the problem of dependence on explosion can be solved by fusing the GUI layer tracing graph and the auditing layer tracing graph, and a process running for a long time can be bound to a plurality of GUI events by the fused tracing graph, so that unit division of the process with fine granularity can be performed. The UIScope mainly uses the timestamp to connect the GUI layer tracing graph and the auditing layer tracing graph, which has good effect in some specific cases. However, UIScope cannot be used as a generic graph-tracing merging scheme because there are many applications that run without a GUI.

The method and the device use a log association technology to mine and analyze association relations among different logs so as to complete an anomaly detection task. In order to further illustrate the result of comparing the log connection graph generated by Pro-Navigator with the log connection graph generated by HERCULE, specifically, as shown in FIG. 10, based on the log connection graph concept proposed by HERCULE, the log connection graph first analyzes various logs, such as DNS logs, HTTP logs, audit logs, and the like, further extracts fields, such as IP addresses, in the logs by using rules written by experts, then gives 29 equations based on the fields to determine whether an association relationship exists between the two logs, takes the logs as nodes, and connects the associated logs together, thereby forming the log connection graph. Thereafter, HERCULE utilizes social network discovery analysis and community detection techniques for analysis and uses some machine learning related methods to optimize weights, employing a patterning approach with a complexity of N-squared, which makes it difficult to work in APT scenarios. Besides, the HERCULE directly performs source tracing analysis on the log association graph CLG, and only the association relationship between logs is considered, and the cause and effect between events are ignored. Compared with the methods based on the tracing graph, the method can not accurately restore the causal relationship between attack steps by obtaining the attack chain. According to the attack tracing method, firstly, logs of all layers and types are used for constructing a log association graph. The log association graph is then used to merge single-source traceback graphs generated from logs of a single data source. And finally, the attack tracing algorithm uses a novel query method, fully utilizes the high-level short circuit generated by the tracing graph fusion, reconstructs an attack chain from the sign event and finishes attack investigation. The final evaluation experiment showed: the attack tracing algorithm can effectively find out subgraphs related to the attack, obtains high recall rate and accuracy, has strong usability and stability, and can obtain good working effect with extremely low cost in the APT scene without using any pile inserting technology.

The attack tracing method based on log association analysis provided by the embodiment of the invention can effectively fuse logs of different levels and different data sources by collecting logs of each level in a system and constructing a log connection graph according to the association relationship among the logs of each level, wherein the log connection graph shows that log nodes and mark nodes alternately appear, one log node analyzes at least one mark node, one mark node is analyzed by at least one log node, then the mark nodes in the log connection graph are utilized to connect tracing elements in each single source tracing graph to obtain a fusion tracing graph, the single source tracing graph is obtained by converting log files of each data source by using a log analysis algorithm, the tracing elements comprise nodes and edges, further the attack tracing algorithm is executed in the fusion tracing graph, and communication paths formed around dependent explosion nodes in the fusion tracing graph are mined based on a short circuit mechanism, compared with the prior art that the dependence explosion problem is relieved by using dependence analysis in the attack tracing process, the communication path strengthens the cause-and-effect relationship between events while considering the incidence relationship between logs, and the attack chain output in the attack tracing process can accurately restore the British relationship between attack steps, so that dependence explosion nodes are effectively bypassed in the tracing process, and the dependence explosion problem is relieved.

Based on the foregoing embodiment, another embodiment of the present invention provides an attack tracing apparatus based on log association analysis, as shown in fig. 11, the apparatus includes:

the constructing unit 20 may be configured to collect logs of each layer in the system, and construct a log connection graph according to an association relationship between the logs of each layer, where the log connection graph shows that log nodes and mark nodes appear alternately, and one log node analyzes at least one mark node and one mark node is analyzed by at least one log node;

the connection unit 22 may be configured to connect, by using a mark node in the log connection graph, the tracing elements in each single source tracing graph to obtain a fused tracing graph, where the single source tracing graph is obtained by converting a log file of each data source by using a log parsing algorithm, and the tracing elements include nodes and edges;

the mining unit 24 may be configured to execute an attack tracing algorithm in the fused tracing graph, and mine a communication path formed around a dependent explosion node in the fused tracing graph based on a short-circuit mechanism, where the dependent explosion node is a node whose sum of input edges and output edges is greater than a preset value;

the selecting unit 26 may be configured to, for each step of the search path in the attack tracing algorithm, calculate an evaluation score of each path searched by the attack tracing algorithm by using the search path including the communication path as one of path evaluation factors, and select a path with a highest evaluation score as an attack chain output by the attack tracing algorithm.

In a specific application scenario, the building unit 20 includes:

the analysis module can be used for analyzing the characteristics expressing the association relation between the log nodes in the logs of each layer by taking each row of logs as a log node to form a marked node set;

the mapping unit may be configured to form a connection link using the mark node analyzed by the at least two log nodes for each mark node in the mark node set, and connect the at least two log nodes and the mark node according to the connection link to map a log connection graph.

In a specific application scenario, the parsing module may be specifically configured to mine, for log nodes at the same level, a common pattern representing an association relationship between log nodes at the same level by using frequent analysis items, and form a marker node set;

the analysis module may be further configured to express, for a cross-level log node, a feature of an association relationship between the log nodes by using a pre-enumerated cross-level connection tag, and form a tag node set.

In a specific application scenario, the drawing module may be specifically configured to determine, for each marker node in the marker node set, whether the marker node is parsed by other log nodes, and use the marker node as a marker node parsed by at least two log nodes;

the drawing module may be further configured to form a connection link with the marker node to express a sharing relationship between the marker node and the log node, and draw a log connection graph according to the sharing relationship, where the marker node is connected to at least two log nodes associated with the marker node.

In a specific application scenario, the connection unit 22 includes:

the query module can be used for traversing all the labeled nodes in the log connection graph, querying the incidence relation among the log nodes in the log connection graph, and analyzing the tracing elements with the incidence relation in each single-source tracing graph according to the incidence relation among the nodes;

the connection module may be configured to connect the tracing elements having an association relationship in each single-source tracing graph to obtain a fusion tracing graph.

In a specific application scenario, the mining unit 24 includes:

the determining module may be configured to determine, in advance, a node in the fusion traceable graph, where a sum of input edges and output edges is greater than a preset value, as a dependent explosion node;

and the mining module can be used for executing an attack tracing algorithm in the fusion tracing graph, when the attack tracing algorithm is positioned to a dependent explosion node, performing depth-first search on a path around the dependent explosion node based on a short circuit mechanism, and mining a communication path formed around the dependent explosion node.

In a specific application scenario, the selecting unit 26 includes:

the setting module can be used for setting time difference among log nodes, log node occurrence frequency and the communication path contained in the search path as path evaluation factors aiming at each step number of the search path in the attack tracing algorithm;

the selecting module can be used for searching the scores acting on each path through the attack tracing algorithm by weighting and summarizing different path evaluation factors to obtain the evaluation score of each path, and selecting the path with the highest evaluation score as the attack chain output by the attack tracing algorithm.

In a specific application scenario, the apparatus further includes:

and the evaluation unit is used for evaluating the accuracy of the attack chain obtained by tracing the log data set according to the reference standard for evaluating the attack.

In a specific application scenario, the obtaining unit includes:

Based on the above method embodiments, another embodiment of the present invention provides a storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the above method.

Based on the foregoing embodiment, another embodiment of the present invention provides an attack tracing method based on log association analysis, including:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method described above.

The above device embodiment corresponds to the method embodiment, and has the same technical effect as the method embodiment, and for the specific description, refer to the method embodiment. The device embodiment is obtained based on the method embodiment, and for specific description, reference may be made to the method embodiment section, which is not described herein again. Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An attack tracing method based on log association analysis is characterized by comprising the following steps:

2. The method of claim 1, wherein constructing a log connection graph according to the association relationship between the logs of the respective layers comprises:

3. The method of claim 2, wherein the parsing the features in the logs of the respective levels that express the association relationship between the log nodes to form a set of tagged nodes comprises:

4. The method of claim 2, wherein for each marker node in the set of marker nodes, forming a connection tie using the marker nodes parsed from at least two log nodes, and connecting the at least two log nodes with the marker nodes according to the connection tie, drawing a log connection graph, comprising:

5. The method according to claim 1, wherein the connecting the tracing elements in each single-source tracing graph by using the label nodes in the log connection graph to obtain a fused tracing graph comprises:

6. The method of claim 1, wherein the executing an attack tracing algorithm in the fused tracing graph and mining communication paths formed around dependent explosion nodes in the fused tracing graph based on a short circuit mechanism comprises:

7. The method according to any one of claims 1 to 6, wherein for each step number of a search path in the attack tracing algorithm, using the search path including the communication path as one of path evaluation factors, calculating an evaluation score of each path searched by the attack tracing algorithm, and selecting a path with the highest evaluation score as an attack chain output by the attack tracing algorithm, comprises:

8. The method according to any one of claims 1 to 6, wherein after the step number of searching paths in the attack tracing algorithm is used, a search path containing the communication path is used as one of path evaluation factors, evaluation scores of the paths searched by the attack tracing algorithm are calculated, and a path with the highest evaluation score is selected as an attack chain output by the attack tracing algorithm, the method further comprises:

outputting an attack chain obtained by tracing from a log data set by using an attack tracing algorithm of log correlation analysis;

9. The method according to claim 8, wherein the reconstructing attack chain of each log data set aiming at the log data sets which are preset to contain different attack scenarios and obtaining the reference standard for evaluating the attack comprises:

reconstructing an attack chain of each log data set by using the preset log data sets containing different attack scenes, and marking a sensitive entity in the attack chain;

10. An attack tracing apparatus based on log association analysis, the apparatus comprising: