CN115396138A

CN115396138A - Tracing graph reduction method and device

Info

Publication number: CN115396138A
Application number: CN202210623444.5A
Authority: CN
Inventors: 万海; 周博雅; 孙逸伦; 焦伟; 严人宁; 王瑞华; 赵曦滨
Original assignee: China Bond Jinke Information Technology Co ltd; Tsinghua University
Current assignee: China Bond Jinke Information Technology Co ltd; Tsinghua University
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-11-25

Abstract

The invention discloses a method and a device for reducing a tracing graph, wherein the method comprises the following steps: the method comprises the steps of obtaining application program logs interacted between a system and the outside through co-occurrence analysis of historical operating data of a host, analyzing and converting the application program logs into normalized logs according to a preset field format, extracting entities from the normalized logs by utilizing a predefined rule set, establishing a mapping relation between the entities and nodes in a tracing graph by utilizing a common field aiming at each normalized log so as to obtain target nodes related to the entities in the tracing graph, further taking the target nodes as trigger points, capturing the context of the target nodes by adopting a random walk-based algorithm, extracting tracing subgraphs starting from the target nodes aiming at each target node, and combining all the tracing subgraphs to form a reduced tracing graph. By the method, the tracing graph can be reduced, and the problems that the tracing graph in the prior art has huge data volume and influences attack tracing detection precision are solved.

Description

Method and device for reducing tracing graph

Technical Field

The invention relates to the technical field of network security, in particular to a method and a device for reducing a tracing graph.

Background

Advanced Persistent Threats (APT) have recently become one of the most critical cyberspace threats facing enterprises and organizations. An APT attack is a hidden and long-running computer network intrusion, usually with some intent. The system has a low-frequency and low-speed attack mode and a remote control point attack strategy. APT attacks typically use zero-day vulnerabilities and, after gaining control of the target system, may be hidden in the system for a long time, typically several months, to raise doubt. Since the consequence of APT attack is serious, its detection is especially important for network security.

There are many ways for APT detection, and the detection method based on the tracing graph is widely used. The tracing graph is a directed graph, nodes in the graph represent system entities, edges in the graph represent system calls, and causal relations between information flow in the system and system time are included. The tracing graph may be constructed from system logs (e.g., audit logs, application logs) so that almost all system behavior may be captured. The above characteristics make it a suitable data source for APT detection. UNICORN is the most advanced current source tracing graph-based APT detection method, which trains a model by using a source tracing graph representing the normal behavior of a system, and then uses the model to check whether the source tracing graph representing the current system behavior contains APT attacks, wherein the detection precision of the model is more than 94% in most cases. However, the data volume of the tracing graph is usually very large, and one host can generate more than 35GB data in one day, while in practical applications, one high-throughput host may generate more than 1GB data in as short as 5 minutes, and furthermore, since APT attack usually lasts for a long time, the corresponding tracing data volume is increased, so that the tracing graph becomes larger, and the tracing detection time is also lengthened.

The related art can use some data reduction methods to alleviate the data size problem faced by the traceback graph-based APT detection method, mainly including a graph structure-based reduction method and a graph embedding-based reduction method. Graph structure based reduction methods attempt to remove redundant nodes and edges while preserving the connectivity of the traceback graph (i.e., causal relationships). However, due to the low frequency, low speed attack pattern of APT, the parts of the traceable graph relevant to detection are usually not continuous. If connectivity needs to be preserved, too much information is retained to achieve a sufficient reduction rate for timely APT detection. Graph embedding-based reduction methods map the traceback graph into a low-dimensional space where graph information is preserved using graph embedding techniques, which can usually achieve very high reduction rates, but because the information of the entire traceback graph is embedded into the low-dimensional space and there is no doubt noise in the graph, it is difficult to achieve satisfactory results based on this detection method.

Disclosure of Invention

The invention provides a method and a device for reducing a tracing graph, which are used for capturing the context of nodes by establishing the mapping relation between entities in an application log and the nodes in the tracing graph and taking the generated subgraph as the input of an attack tracing detection method, thereby solving the problems that the tracing graph in the prior art has huge data quantity and influences the attack tracing detection precision. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for reducing a tracing graph, where the method includes:

the method comprises the steps of obtaining application program logs of interaction between a system and the outside through co-occurrence analysis of historical operating data of a host;

analyzing and converting the application program log into a normalized log according to a preset field format, and extracting an entity from the normalized log by utilizing a predefined rule set;

for each normalized log, establishing a mapping relation between the entity and a node in a tracing graph by using a common field to obtain a target node related to the entity in the tracing graph, wherein the tracing graph is generated by capturing information flow of a system kernel by using a tracing tool;

and capturing the context of the target node by taking the target node as a trigger point and adopting a random walk-based algorithm, extracting the tracing subgraphs from the target node aiming at each target node, and combining all the tracing subgraphs to form a reduced tracing graph.

Optionally, the co-occurrence analysis is performed on the historical operating data of the host to obtain an application program log of the system interacting with the outside, and the method includes:

acquiring a co-occurrence application program list from application programs interacted between a system and the outside by performing co-occurrence analysis on historical operating data of a host;

and selecting a target application program with the co-occurrence frequency ranking before a first preset numerical value from the co-occurrence application program list, and collecting a log of the target application program.

Optionally, the collecting the log of the target application includes:

acquiring an application log of a target application program by inquiring a process corresponding to the target application program interacted with the outside on a host;

and for a target application program which cannot obtain the application log, collecting the system call of the target application program by using an auditing tool to obtain the auditing log of the target application program.

Optionally, the analyzing and converting the application program log into a normalized log according to a preset field format includes:

acquiring data information of a preset field from the application program log by analyzing the application program log;

and normalizing the data information of the preset field into a log table according to a preset field format to form a normalized log.

Optionally, the predefined rule set includes default rules and custom extension rules, and the extracting entities from the normalized log by using the predefined rule set includes:

using the default rule to judge that if a preset field of the normalized log contains a known attack signature and/or a key system command, extracting a matching field from the normalized log as an accurate entity;

otherwise, extracting nouns and the requested time stamps from the preset fields of the normalized log by using the custom extension rule as result entities.

Optionally, the using the default rule to determine that, if the preset field of the normalized log includes a known attack signature and/or a key system command, a matching field is extracted from the normalized log as an accurate entity, including:

using the default rule to judge whether an application program field in the normalized log is matched with a known attack signature or not if a preset field of the normalized log contains the known attack signature, and if so, extracting the application program field as an accurate entity;

and judging whether the preset field of the normalized log contains a key system command by using the default rule, and processing the key system command contained in the application program field and the preset field in the normalized log.

Optionally, the extracting nouns and timestamps requested by the nouns from the preset fields of the normalized log by using the customized extension rule as result entities includes:

marking a preset field appearing in the normalized log by using the custom extension rule;

and performing part-of-speech analysis in combination with natural language processing to extract nouns and requested time stamps thereof from the preset fields as result entities.

Optionally, the establishing, for each normalized log, a mapping relationship between the entity and a node in a tracing graph by using a common field to obtain a target node associated with the entity in the tracing graph includes:

selecting a name and a timestamp of a request as a common field for each normalized log;

and if the name of the entity and the node in the tracing graph have the same name and the difference of the requested time stamps between the name of the entity and the node in the tracing graph is within a preset range, mapping the entity to the corresponding node in the tracing graph, wherein the node is a target node associated with the entity in the tracing graph.

Optionally, the capturing, with the target node as a trigger point, the context of the target node by using a random walk-based algorithm, and extracting, for each target node, a traceable subgraph starting from the target node, includes:

performing random walk by taking the target node as a trigger point, and recording the visited frequency of the node in the random walk process;

and sequencing according to the visited frequency of the passing nodes, reserving the nodes sequenced before a second preset numerical value as the context of the target nodes, and extracting a traceable subgraph from the target nodes aiming at each target node.

In a second aspect, an embodiment of the present invention provides an apparatus for reducing a tracing graph, where the apparatus includes:

the acquisition unit is used for acquiring application program logs of the system interacting with the outside by carrying out co-occurrence analysis on historical operating data of the host;

the extraction unit is used for analyzing and converting the application program log into a normalized log according to a preset field format and extracting an entity from the normalized log by utilizing a predefined rule set;

the establishing unit is used for establishing a mapping relation between the entity and nodes in a tracing graph by using a common field aiming at each normalized log so as to obtain target nodes related to the entity in the tracing graph, wherein the tracing graph is generated by capturing information flow of a system kernel by using a tracing tool;

and the capturing unit is used for capturing the context of the target node by taking the target node as a trigger point and adopting a random walk-based algorithm, extracting the tracing subgraph from the target node aiming at each target node, and combining all the tracing subgraphs to form a reduced tracing graph.

Optionally, the obtaining unit includes:

the analysis module is used for performing co-occurrence analysis on historical operating data of the host and acquiring a co-occurrence application program list from application programs interacted between the system and the outside;

and the collecting module is used for selecting a target application program with the co-occurrence frequency ranking before a first preset numerical value from the co-occurrence application program list and collecting a log of the target application program.

Optionally, the collection module is specifically configured to collect an application log of a target application program by querying a process corresponding to the target application program interacting with the outside on a host;

the collection module is specifically configured to collect, by using an auditing tool, a system call of a target application program for which an application log cannot be obtained, so as to obtain the audit log of the target application program.

Optionally, the extracting unit includes:

the analysis module is used for acquiring data information of a preset field from the application program log by analyzing the application program log;

and the normalization module is used for normalizing the data information of the preset field into a log table according to a preset field format to form a normalized log.

Optionally, the predefined rule set includes a default rule and a custom extension rule, and the extracting unit is specifically configured to use the default rule to determine that, if a preset field of the normalized log includes a known attack signature and/or a key system command, a matching field is extracted from the normalized log and used as an accurate entity;

the extracting unit is specifically further configured to extract, if not, the noun and the timestamp requested by the noun from the preset field of the normalized log by using the custom extension rule as a result entity.

Optionally, the extracting unit is further specifically configured to determine, using the default rule, whether an application field in the normalized log matches a known attack signature if a preset field of the normalized log includes the known attack signature, and if so, extract the application field as an accurate entity;

the extracting unit is specifically configured to use the default rule to determine that, if the preset field of the normalized log includes a key system command, the application field in the normalized log and the key system command included in the preset field are processed.

Optionally, the extracting unit is further specifically configured to mark a preset field appearing in the normalized log by using the customized extension rule;

the extracting unit is specifically further configured to perform part-of-speech analysis in combination with natural language processing to extract a noun and a timestamp requested by the noun from the preset field as a result entity.

Optionally, the establishing unit includes:

the selecting module is used for selecting the name and the requested time stamp as a common field aiming at each normalized log;

and the mapping module is used for mapping the entity to a corresponding node in the tracing graph if the name of the entity and the node in the tracing graph have the same name and the difference of the requested timestamps between the name of the entity and the node in the tracing graph is within a preset range, wherein the node is a target node associated with the entity in the tracing graph.

Optionally, the capturing unit includes:

the recording module is used for carrying out random walk by taking the target node as a trigger point and recording the accessed frequency passing through the node in the random walk process;

and the sequencing module is used for sequencing according to the visited frequency of the passing nodes, reserving the nodes sequenced before a second preset numerical value as the context of the target nodes, and extracting a traceable subgraph from the target nodes aiming at each target node.

In a third aspect, an embodiment of the present invention provides a storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to implement the method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a device for reducing a tracing graph, including:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.

As can be seen from the above, the method and apparatus for reducing a traceback graph provided in the embodiments of the present invention perform co-occurrence analysis on historical operating data of a host, obtain an application log of interaction between a system and the outside, analyze and convert the application log into a normalized log according to a preset field format, extract an entity from the normalized log by using a predefined rule set, establish a mapping relationship between the entity and a node in the traceback graph by using a common field for each normalized log, so as to obtain a target node associated with the entity in the traceback graph, where the traceback graph is generated by capturing an information stream of a system kernel by using a traceback tool, and then capture a context of the target node by using the target node as a trigger point and using a random walk-based algorithm, extract a traceback subgraph starting from the target node for each target node, combine all the traceback graphs to form a reduced traceback graph, and compare with a graph structure-based reduction method and a graph-based reduction method in the prior art, and the traceback graph-embedded subgraph graph is constructed by establishing a graph-based reduction method for reducing the traceback graph and detecting attacks.

In addition, the embodiment can also achieve the technical effects of:

(1) The tracing graph reduction method adopts an algorithm based on random walk to effectively capture the context of the entity in the tracing graph, and the calculation complexity is greatly reduced.

(2) In the simulated APT scene and the real-world APT scene, different APT detection methods are used for carrying out extensive evaluation on the tracing graph reduction method. The result shows that the tracing graph reduction method in the embodiment of the invention achieves high compression rate.

(3) The tracing graph reduction method is beneficial to APT detection, can be integrated with different APT detection methods, and improves the precision of the traditional anomaly detection method and the current most advanced APT detector.

Of course, it is not necessary for any product or method to achieve all of the above-described advantages at the same time for practicing the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below. It is to be understood that the drawings in the following description are of some embodiments of the invention only. For a person skilled in the art, without inventive effort, further figures can be obtained from these figures.

Fig. 1 is a flowchart of a method for reducing a tracing graph according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of the overall work of the tracing graph reduction method according to the embodiment of the present invention;

fig. 3 is a schematic diagram of data distribution before and after reducing a tracing graph according to an embodiment of the present invention;

4 a-4 b are schematic diagrams illustrating an experimental scenario based on an APT-1 runtime overhead provided by an embodiment of the present invention;

fig. 5 is a block diagram of a tracing graph reduction apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It should be apparent that the described embodiments are only some of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art based on the embodiments of the present invention without inventive step, are within the scope of the present invention.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the embodiments and drawings of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The invention provides a tracing graph reduction method and device, which are used for capturing the context of nodes by establishing the mapping relation between entities in application logs and the nodes in the tracing graph and taking the generated subgraph as the input of an attack tracing detection method, thereby solving the problems that the tracing graph in the prior art has huge data quantity and influences the attack tracing detection precision. Advanced persistent attack APT has great threat to Wang Lina security due to its novelty, imperceptibility and long-term characteristics. The detection method based on the tracing graph can effectively detect the APT, but the data volume is too large, and good precision and effect cannot be obtained. In order to reduce the data volume, the existing tracing graph reduction methods are all general or forensic analysis-oriented reduction algorithms, and cannot achieve a good reduction effect in an APT detection scene. The efficient application log guided tracing graph reduction method is provided, entities relevant to attacks are identified through application logs, then the context of nodes matched with the entities is explored in the tracing graph through a random walk algorithm, and reduced subgraphs are generated according to the context, experimental evaluation shows that the method has a good data reduction effect, the reduction rate can reach 150.46 times at most, and the algorithm only has 16% of short-term overhead. The method can be integrated with different APT detection methods, the performance of the current leading APT detection method can be improved by 14.3% on average, and the performance of the traditional attack detection method can be improved by 68.42% at most in the APT detection scene.

Before describing a specific implementation process of the present invention, in order to facilitate understanding of the implementation process, the method of the embodiment of the present invention mainly refers to the following factors:

1. an attacker relies on processes, files and sockets (sockets) in the target system to launch an attack. In almost all scenarios, an attacker inevitably establishes a connection with the target system and interacts with the processes and files of the system.

2. Not all applications are related to attacks. The number of nodes representing processes, files and sockets in the tracing graph is large, and which nodes can be removed without affecting the detection effect, and an APT killing chain is referred to herein, so that attackers can intelligently utilize public-oriented applications, such as Web servers and download tools. Thus, the goal can be narrowed down to public-facing applications.

3. The application log embodies the behavior of the process. Because the application is the entry point for an attack, there must be a lot of high-level semantic information in the application log that is relevant to the attack. In practice, it has been found that there are many of the same entities in the application log as in the tracing graph that underlies it, meaning that the entities in the application log can be mapped to nodes in the tracing graph. Therefore, with the help of the application log, which nodes of the traceback graph are relevant to detection can be found out.

4. To avoid discovery, the actual attack process will only last for a short time. Although APT attackers may remain hidden for a long time after penetrating into the system, when an attack is actually initiated, it is usually chosen to be completed in the shortest time to avoid being discovered. Thus, only a small context related to the attack portal needs to be captured, thereby achieving a higher reduction rate.

Generally speaking, APT detection can be performed only by paying attention to nodes and limited contexts thereof in the tracing graph related to the application program interacting with the interface. The following provides a detailed description of embodiments of the invention.

Fig. 1 is a flowchart illustrating a method for reducing a tracing graph according to an embodiment of the present invention. The method may comprise the steps of:

s100: and acquiring an application program log of the system interacting with the outside by carrying out co-occurrence analysis on the historical operating data of the host.

In the enterprise environment, the micro service architecture is widely used, and only one or a few application programs which interact with the outside world are likely to run on each host. In other words, only a few processes corresponding to applications interacting with the outside world on each host need to be attended to and their logs collected. But collecting only the logs of public-facing applications can result in information loss because there may be other applications in the system that implicitly rely on the progress of the applications interacting with the outside world. Therefore, co-occurrence analysis is performed through the historical operating data of the host computer to obtain the first K applications co-occurring with the application interacting with the outside world, and the logs of the first K applications are collected for subsequent analysis.

Specifically, a co-occurrence application program list is obtained from application programs interacted with the outside from the system by performing co-occurrence analysis on historical operating data of the host, a target application program with a co-occurrence frequency ranking before a first preset numerical value is further selected from the co-occurrence application program list, and a log of the target application program is collected. The co-occurrence analysis is considered as selecting the application program which co-occurs with the external interaction process in the same time range, so that the range of possible attack paths can be reasonably expanded, and the context can be considered as much as possible.

The logs comprise application logs and audit logs, specifically, in the process of collecting the logs of the target application program, the application logs of the target application program can be collected by inquiring a process corresponding to the target application program which is interacted with the outside on the host, and for the target application program which cannot obtain the application logs, an audit tool is used for collecting system calls of the target application program so as to obtain the audit logs of the target application program.

Illustratively, for target applications that cannot obtain application logs, such as business software that does not provide application logs and applications that delete logs for production environment performance, auditing tools (e.g., a Linux auditing framework) are used to collect their system calls.

Specifically, in this step, a list of applications that need attention and logs thereof, including an application log and an audit log, are obtained, and specific examples of the logs are as shown below.

Application Log example:

example audit logs:

in an actual application scenario, the above process is mainly implemented based on Python, and a tracing graph and an application log are first collected from an alternative data source. In practical applications, it is considered that the predefined rule can be selected according to different scenes. In the present embodiment, camFlow is used as a reference implementation. It should be noted that for applications that do not have a well-designed log, it is conservatively assumed that system call logs associated with system objects (e.g., processes, files, and sockets) may also be utilized. Xanthus is an automated tool that orchestrates virtual machines to revive scenes and generate real logs, using this function to build three real attack datasets. However, the requirement of cross-platform and universality cannot be completely met by supporting only Linux and macOS by Xanthus. Therefore, the embodiment of the invention expands and modifies Xanthus in the implementation process, adds some practical options, and makes the Xanthus compatible with Windows so as to meet the requirements of cross-platform and universality.

In order to find out the processes which need to be tracked in the whole system tracing process, the embodiment of the invention introduces a possible solution, namely target process co-occurrence analysis. The goal is to abstract out other processes that may be active while attacking the portal process in order to reasonably expand the scope of the suspicious process. Here, the atop is selected as the data source to be analyzed. Specifically, the logs extracted by the atop parser are taken as a corpus. The process names in the log are then divided into units corresponding to the same time stamp. This allows the co-occurrence matrix of process names to be constructed and a node dictionary and an edge dictionary to be generated. The node dictionary contains node names and node weights (frequencies), and the edge dictionary contains start points, end points, and edge weights (frequencies). After system processes (systemd, audiod, etc.) are filtered out, processes with the edge weights of all edges related to the target processes being the largest K of all the weights are taken as processes needing to be tracked.

S110: and analyzing and converting the application program log into a normalized log according to a preset field format, and extracting an entity from the normalized log by utilizing a predefined rule set.

The method includes the steps that a preset field format is equivalent to a table filling format set for fields in a log, specifically, in the process of analyzing and converting an application program log into a standardized log according to the preset field format, data information of a preset field can be obtained from the application program log by analyzing the application program log, and then the data information of the preset field is normalized into a log table according to the preset field format to form the standardized log.

Illustratively, the conversion of the application log into a normalized format is shown in Table1 below, and the header shows the fields of the normalized log, { Time, PID, PNAME, IP, protocol, path, msg }. Time represents the Time stamp of the request, PID represents the process number when the application program runs, PNAME represents the process name of the application program, IP represents which IP address the request comes from, protocol represents the Protocol used by the request, and Path represents the related system file Path or URL. All application logs (including the corresponding audit logs) are parsed and converted to normalized logs. In addition, the Msg field is populated with network traffic information.

TABLE1 normalized Log example

N0

Time

PID

PNAME

IP

Protocol

Path

Msg

1

t1

p1

apache2

IP1

HTTP

a.com

SELECT*FROM tablel

2

t2

p2

smbd

IP2

SMB

/usr/sbin/smbd

/＝`nohup telnetd-l/bin/n/sh-p 4444`

3

t3

p3

TermServices.exe

IP3

RDP

-

\x02\xf0\x80\x7f\x65\x82\x01\x94\x04

The normalization process of the data information is further explained here above in the application log example and the audit log example. On line 2 of the above application log example, the log display of apache2, at time t1, a request from IP1 is sent to apache2 and an attempt is made to access a.com. In lines 1-4 of the audit log example above, the audit log displays p1 for pid of apache2, and the requested network traffic information is SELECT FROM table1. So normalized log 1, the first row in table1, can be obtained. In line 7 of the log example above, the log display of smbd, at time t2, a request from IP2 is sent to smbd. At lines 5-8 of the above audit log example, the related audit log shows that pid of smbd is p2, the path of smbd is smbd and the requested network traffic information is/= 'nonhup telnetd-1/bin/sh-p 4444'. This results in normalized log 2, the second row in table1. Exe has no application log, so only its audit log is used here. In the 10 th to 11 th lines of the above example of the audit log, the audit log shows that pid of terminal services is p3, the request from IP3 is sent, and the network traffic information of the request is \ x02\ xf0\ x80\ x7f \ x65\ x82\ x01\ x94\ x04, so as to obtain the normalized log 3, i.e., the third line in table1.

Further, here, 14 resolvers may be specifically developed for the application log. If a new application is added, only a new parser needs to be written for that application. Because the log format of an application typically does not change much, the modification cost is relatively low.

Specifically, after obtaining the normalized log, entities are extracted from the normalized log using a set of predetermined rules, the entities being extracted from the log of applications interacting with the outside world that may be targeted by an attacker, and therefore, the entities are likely to be relevant to the attacker.

In practical applications, the main goal of application log parsing is to generate a normalized log that may expose detection-related entities. The existence of a multi-level log requires multiple data resolvers corresponding to different data sources. In the method for reducing the tracing graph provided by the invention, 14 resolvers based on regular expressions are used, and more complex context-free grammar or even context-sensitive grammar is not needed.

During parsing, certain fields that are considered to have the same meaning often differ in different log formats, and thus a generalization operation is implemented to further normalize the log. For example, version number, timestamp format, and path are generally variable information even if they have the same meaning. The format of the normalized log has been mentioned previously. The Path field and Message field of the normalized log typically contain system commands and files that are often critical to extracting the relevant entities for detection.

The method comprises the steps that a predefined rule set comprises a default rule and a custom extension rule, in the process of extracting an entity from a normalized log by using the predefined rule set, the default rule can be used for judging that if a preset field of the normalized log contains a known attack signature and/or a key system command, a matching field is extracted from the normalized log to serve as an accurate entity, otherwise, a noun and a timestamp requested by the noun are extracted from the preset field of the normalized log to serve as an effect entity by using the custom extension rule. In the process of extracting the precise entity, a default rule can be specifically used for judging whether the application program field in the normalized log is matched with a known attack signature or not if the preset field of the normalized log contains the known attack signature, and if so, the application program field is extracted as the precise entity; and judging whether the preset field of the normalized log contains the key system command by using a default rule, and processing the application program field in the normalized log and the key system command contained in the preset field. In the process of extracting the fruiting body, a user-defined extension rule can be specifically used for marking a preset field appearing in the normalized log; performing part-of-speech analysis in combination with natural language processing extracts nouns from preset fields and timestamps of their requests as result entities.

The process of applying the following predefined rules to each normalized log in order is illustrated:

1. if the field Msg contains a signature of a known attack, it will be checked if the PNAME, protocol, path (if not null) field matches the known attack. If there is a match, the PID, PNAME and Path (if not empty) are extracted as entities. For example, msg in normalized log 3 of table1 is part of a message from MS12-020 vulnerability. PNAME is then found to be TermService and Protocol is rdp, consistent with the known attack, so the extracted entity is { p3, termService }.

2. If the Msg field contains some critical system commands, then the critical system commands in PID, PNAME, path and Msg will be processed. This step comes from the intuition that an attacker can manipulate system entities using system built-in commands. Thus, if the system is attacked, the fields Path and Msg in the normalized log may contain some system commands, e.g., cp, rm, ls. For example, one of the system key commands in/bin/sh in normalized log 2, msg of Table1 appears to be someone trying to manipulate it, so the extracted entities are { p2, smbd,/usr/sbin/smbd,/bin/sh }.

3. If the above two conditions are not true, the entity appearing in the Msg field is extracted. Msg contains semantics about user behavior, sometimes in a structured text format. Thus, natural Language Processing (NLP) can be used to tag the Msg and perform a part-of-speech analysis to extract nouns and their times from the Msg as entities. For example, table1 in Msg of normalized log 2 of table1 is a noun. So the resulting entity is p1, table1.

As can be seen from the above, the first two rules are for a particular problem, and the last is the default rule (or the bottom of the pocket rule). The problem-specific rules are intended to extract an exact set of entities from the normalized log, and when all the problem-specific rules are not applicable, default rules will be used. Furthermore, the rule set is extensible, where new rules can be easily added.

It can be understood that detecting the extraction of the related entities is crucial to the tracing graph reduction method of the present invention, and greatly affects the reduced tracing graph. At the same time, the definition of detecting related entities relies heavily on predefined rules. When reading or writing a file using a system command, it is first assumed that the file name is the entity relevant for detection. In addition to the filenames disclosed by the traceback graph and the application log together, the table entities disclosed by the application messages are also considered detection-related entities. In particular, as mentioned above, the application messages contain content like SELECT FROM table1. It is clear that the operations of table1 are likely to be malicious. In the case of normal application messages, they are also considered to contain semantic information about the expected behavior of the user. The application layer message is treated as a sentence. To extract the entities in the application layer log, space is used because it contains a fast entity recognition model. In addition to collecting the detection-related entities, their timestamps will be recorded, which will later be used to match trigger points on the traceback graph and the names of the detection-related entities.

S120: and for each normalized log, establishing a mapping relation between the entity and the nodes in the tracing graph by using a common field so as to obtain a target node related to the entity in the tracing graph.

The tracing graph is generated by capturing information flow of a system kernel by using a tracing tool. The tracing graph is a directed acyclic graph with time as a direction and represents the causal relationship between a subject (a process, a thread and the like) and an object (a file, a registry, a network socket and the like) in the system. The traceability graph can express the causal relationship of the two events no matter how long the time interval between the two events is, so that a security expert can use the traceability graph to complete attack investigation.

For each normalized log, after an entity corresponding to the normalized log is obtained, nodes directly related to the entities can be found in a tracing graph according to the names and the time of the entities, the nodes are called trigger points, and specifically, for each normalized log, the names and the requested time stamps are selected as common fields; and if the name of the entity and the node in the tracing graph have the same name and the difference of the requested time stamps between the name of the entity and the node in the tracing graph is within a preset range, mapping the entity to the corresponding node in the tracing graph, wherein the node is a target node associated with the entity in the tracing graph.

S130: and taking the target node as a trigger point, capturing the context of the target node by adopting a random walk-based algorithm, extracting a traceable subgraph from the target node aiming at each target node, and combining all the traceable subgraphs to form a reduced traceable graph.

Here, based on the trigger point, an algorithm based on a random walk (as shown in algorithm 1 below) will be used to capture the context of the trigger point. For each trigger point, algorithm 1 outputs subgraphs extracted from the original traceback graph from these points, and then combines the extracted subgraphs from all trigger points together to form a reduced traceback graph.

Algorithm 1, graph reduction Algorithm

By means of an algorithm based on random walks, the problem of dependence on explosion can be effectively avoided, and it is noted here that many detection methods rely on the stability of data, and random walks obviously cannot meet requirements. Thus, in designing the algorithm, the weight of the trigger point neighbors is increased by retaining higher frequency nodes during random walks to capture as much of the entity's mulberry context as possible. This is also consistent with the above-mentioned, that the actual attack lasts only a short time to avoid being discovered, and that generally only the context of the entity associated with the attack needs to be concerned.

In addition, some applications may not provide the log, or the information in the log is insufficient, so that the matched trigger points are few, and the generated subgraph may not support the subsequent APT detection process. To address this issue, algorithm 1 will downgrade to Breadth First Search (BFS) and possibly capture the context of the entity much when the number of trigger points is found to be less than a predefined threshold.

Specifically, the target node may be used as a trigger point to perform random walk, and the visited frequency passing through the node is recorded in the random walk process; and sequencing according to the accessed frequency of the nodes, reserving the nodes sequenced before the second preset numerical value as the context of the target nodes, and extracting the traceable subgraph from the target nodes aiming at each target node. For example, in the random walk process, the trigger point is a v point, the nodes around the random access are repeatedly accessed for many times, the node x1 is accessed with a probability of 1/q, the node x2 is accessed with a probability of 2/q, the node x3 is accessed with a probability of 3/q, the node x3 with the highest access probability is reserved as the context of the target node, and a traceable subgraph starting from the target node is formed.

In actual practice, after the entities that are relevant are extracted (detected), they are mapped to the origin graph of the entire system using their common fields. The currently selected common fields are name and timestamp. That is, if an entity (which is detection related) has the same name as a node of the traceback graph and the timestamp difference between them is within a certain range, the entity is mapped to the node. This node will be selected as the starting point for the next graph reduction. The above process is repeated until all entities are mapped and all nodes to which they map are selected. The selected node is referred to as a "trigger point".

The concrete graph reduction process is performed using NetworkX, which is a graph processing library written in Python. Inspired by the node2vec algorithm, the trigger point is used as a starting point for generating a plurality of node sequences, while the frequencies of the nodes are preserved in the frequency dictionary. After sorting in descending order, only a given number of the most frequent nodes are retained. Considering that the context of a node is generally more important, when the number of trigger points is below a certain threshold, the traceback graph reduction algorithm in the present application degenerates to breadth-first search (BFS) to preserve more context.

In summary, for the problem of huge data volume of the tracing graph, the solution of the related art can use many tracing graph reduction methods, but still has many disadvantages, firstly, since these methods are all general or forensic analysis-oriented, not for APT detection, the efficiency of tracing graph reduction is relatively low, and usually only 10 × reduction rate, i.e. reduction rate is 1/10 of the original data. The general or forensics analysis-oriented tracing graph reduction method requires that the reduced tracing graph can still retain causal relationships (that is, retain connectivity of the tracing graph), and these can generate redundant information; secondly, the tracing graph reduction methods depend on the whole tracing graph, the memory and the hard disk space with the same size as the image data are needed, and the method is low in practicability due to the large scale of the graph data in the APT scene; again, these traceback graph reduction methods are designed only for certain specific scenarios, for the common cross of scenarios, e.g. the file access pattern of the NodeMerge learning process to merge redundant nodes, but due to the different file access patterns in each scenario, this method may not be applicable to other scenarios, and finally, almost all methods are based on the traceback graph itself only for reduction without fully exploiting other information that may be introduced to obtain higher reduction rates.

However, the ideal data reduction method should have the following features: high reduction rate, low overhead and no negative influence on the APT detection effect. It is generally believed that an APT detection (rather than a generic or forensic analysis-oriented) tailored reduction algorithm can achieve higher reduction rates because more detection-independent information can be reduced. Besides the CIA, according to experience, an attacker can only use an application program of the system interacting with the outside to launch the attack, and the application log contains high-level semantic information related to the APT attack and can be used for detection. Moreover, in the same system, the application log and the tracing graph share many same entities. Therefore, high-level semantics related to APT detection in the application log can be fused into the traceback graph to achieve more efficient data reduction. The embodiment of the invention provides a tracing graph reduction method for efficient APT detection oriented application log guidance, which removes redundant information in the tracing graph, can obtain a high reduction rate, and can keep the precision of the APT detection method. And using the following premise assumptions: first, the integrity of the data (primarily application logs and provenance data) is assumed. That is, it is assumed that an attacker can corrupt the application, but the attacker cannot modify the data that has been collected and stored, ensuring that the integrity of the data is beyond the context. Second, assume that the system has not been attacked before the traceback graph compression method is deployed. Finally, it is assumed that the hardware, kernel, running program, and trace graph compression methods are implemented correctly. Specifically, the tracing graph reduction method collects logs of an application program interacting with the outside and a tracing graph of the system, extracts an entity from the application logs and establishes a mapping relation between the entity and a node in the tracing graph, then adopts a random walk-based algorithm to capture the context of the node, namely a subgraph, from the node in the tracing graph establishing the mapping relation with the entity, and finally uses the generated subgraph as the input of an APT detection method.

In an actual application scenario, the flow of the overall work of the tracing graph reduction method provided by the embodiment of the present invention is specifically as shown in fig. 2, and firstly, log filtering: co-occurrence analysis is carried out on the application interacted with the outside on the host computer, the application interacted with the outside or an application set related to the application interacted with the outside is found, and then an application log is obtained; then log parsing: analyzing the log into a standard format; further entity extraction: extracting entities from the normalized log according to predefined rules; and finally, entity mapping and graph reduction: mapping the extracted entities to nodes in a tracing graph, capturing the contexts of the nodes based on the biased walking algorithm provided by the invention, and finally filtering out a subgraph from the original tracing graph, wherein the obtained subgraph can be used as the input of an APT detection method.

Furthermore, in order to verify the tracing effect achieved by the tracing graph reduction method in the embodiment of the invention, the effect of tracing graph reduction can be evaluated by using a preset attack data set. Specifically, in the data set selection process, three types of attacks are realized based on detailed reports and collected logs of real-world APT activities, and in order to simulate an APT scene, a typical APT killing chain can be strictly followed, wherein the three types of attacks include 7 main stages, namely, information collection, planning, tool transmission, penetration, backdoor installation, host control and follow-up operation. The specific data set generated is as follows, and the detailed data set features are shown in table 2:

1. a supply chain data set. In this dataset, two common supply chain attack scenarios were simulated.

SC-1 scenario. An attacker first identifies a remote enterprise CI server that often uses wget to download a Debian package from a different repository. The attacker discovers that the wget version on the server is 1.17, which is easily uploaded to any remote file (CVE-2016-4971) when the victim requests a malicious URL from the attacked server. A common remote access trojan is embedded in the Debian package and destroys one of the storage repositories. Thus, any request to download a seemingly legitimate software package is unknowingly redirected to the attacker's FTP server, which has the software package embedded with the Trojan horse. The CI server downloads and installs the attack package and also installs malicious Trojan horse software. The trojan then establishes a C & C channel with the attacker, creating a reverse TCP shell session on the CI server. Finally, the attacker modifies the CI server configuration settings to gain control over the CI deployment output.

SC-2 scenario. The scene setting is substantially similar to SC-1. However, the attacker exploits a different vulnerability derived from the Bash 4.3 version. In short, this vulnerability allows an attacker to execute arbitrary code in a Bash script (CVE-2014-6271) by adding a trailing string after the function definition. Both of the above cases represent a disruption in the supply chain.

2. APT-1 dataset. Compared with the two attacks, the attack scene of the APT-1 is more complex. An attacker first discovers an SQL injection vulnerability of an enterprise website by using a vulnerability scanner so as to obtain control over a web server. Then, the attacker uses the web server to perform slow intranet scanning, and finds that the target OA server has a persistent blue (MS 17-010) vulnerability. The attacker then logs in to the OA server with the vulnerability. The attacker obtains the root access right of the OA server by uploading and executing the MS 15-015. Finally, the critical information is stolen and returned.

TABLE 2 data set characteristics

It should be specifically noted that all the above experiments were run on a Linux host, using an 8-core Intel Xeon Cascade Lake 8255C CPU (2.50 GHz) and 32GB RAM. In order to systematically evaluate the tracing graph reduction method of the embodiment of the present invention, the accuracy, precision, recall, and F1 value before and after reduction are compared to check the retention condition of the tracing graph after reduction. In addition, the runtime overhead is also evaluated to determine whether the traceback graph reduction method of the embodiment of the invention has reasonable overhead.

Further, in order to check whether the traceback graph reduction method of the embodiment of the present invention has a low negative impact on the performance of the APT detector. The currently most advanced traceback-based APT detection method UNICON is selected. UNICORN is a most advanced APT-based anomaly detection algorithm that utilizes a traceback graph. The method adopts a novel graph sketching technology, summarizes the tracing graph into a structure with updatable increment and fixed size, and can model long-term behavior and evolution thereof. Therefore, the method can effectively detect the detection result of the APT when the tracing graph changes.

For the data reduction rate, in order to evaluate the effectiveness of the traceback graph reduction method of the embodiment of the present invention in data reduction, the traceback graph reduction method of the embodiment of the present invention is applied to the above 3 data sets, and compared with other 4 data reduction methods LogGC, CPR, FD/SD, GS/SS. The reduced data size and reduction rate are shown in table 3. On average, the reduction rate of the tracing graph reduction method of the embodiment of the present invention is 78.14x, which is higher than that of other methods.

TABLE 3 reduction of results

Since we take the application log size into account when calculating the reduction rate, the reduction rate of the APT-1 dataset seems to be lower than other methods. If only the size of the reduced tracing graph is considered, the reduction rate of APT-1 reaches 98.74x, which is far higher than the current result. The same is true for other data sets. In practice, application logs are typically collected for analysis in an enterprise, such as debugging, user information statistics, and the like. Therefore, the tracing graph reduction method of the embodiment of the invention has higher reduction rate in the real world.

In addition, the reduction rate of the method is relatively low for the SC-2 and APT-1 data sets. Consider that the reduction rate is inversely proportional to the number of entities matching between the application log and the traceback graph. In the SC-2 dataset, the Bash log contains almost all user actions, resulting in more entities matching and a lower reduction rate. For the APT-1 dataset, the scenario is much more complex, there are more entities matching, and therefore the reduction rate is lower. However, since the traceback graph reduction method according to the embodiment of the present invention may be used together with other reduction methods, redundant entities may be removed by applying methods such as LogGC and NodeMerge, and these methods may delete redundant entities in the traceback graph.

For the impact of the APT detection method, we apply the traceback graph reduction method of the embodiments of the invention and 4 other reduction methods (LogGC, CPR, FD/SD and GS/SS) to 3 data sets. Table 4 shows the change in evaluation index before and after the reduction of each dataset by UNICORN (accuracy, precision, recycle, F1-Score). It can be seen from table 4 that other methods have more or less negative impact on performance, but the traceback graph reduction method of embodiments of the present invention improves for all standards.

TABLE 4 evaluation results on UNICORN

Since the attacker knows the target to be attacked in advance, and generally at least the general flow of the target system supply chain, the attacker no longer needs to search for vulnerabilities in the system. In this case, the trace left by the attacker in the system is greatly reduced, which presents some difficulties in the detection of supply chain attacks. Furthermore, APT attacks that can be launched in APT-1 dataset scenarios are hidden and complex. This is one of the reasons UNICORN does not perform well on the original traceback graph. The traceable graph reduction method provided by the embodiment of the invention reserves the information related to detection and deletes the information unrelated to attack, so that the performance of the UNICORN is positively influenced.

Further, in order to better understand the influence of the APT detection performance, a well-known dimension reduction method t-SNE can be used for visualizing the graph data before and after dimension reduction. The data distribution before and after the source tracing graph is specifically reduced is shown in fig. 3, the data distribution before the reduction is represented by a triangular shape, and the data distribution after the reduction is represented by a square shape, so that the data distribution after the reduction is more distinguishable, and the accuracy of the detector is improved as can be seen from fig. 3. The original graphs containing attacks and not containing attacks take an overlapping and indivisible form. After reduction, the attack graph is in a straight line and a plane shape, while the normal graph is in a straight line and a plane shape overlapped with the attack graph. Although the overlay problem is not completely eliminated, a reasonable hyperplane can still be found to separate most of the attack data from the normal data. The existence of overlap is considered acceptable here because some benign operations are certain to be performed in a network attack.

With respect to runtime overhead, although the main purpose of the traceback graph reduction method provided by the embodiment of the present invention is to preserve subgraphs related to detection, it is preferable to keep runtime overhead low. In this embodiment, subgraphs are generated using biased random walks as a first selection method. Specifically, an Alias method, which is a discrete sampling method with a time complexity of O (1), can be used. Meanwhile, it is not negligible that, in the case of fewer trigger points, in order to better save context information, the subgraph generation algorithm is degenerated to the BFS algorithm, and the fixed-step BFS traversal is obviously an algorithm with time complexity of O (1). In summary, the total time complexity of the traceback graph reduction method can be guaranteed to be O (N).

The CPU and memory usage of the traceback graph reduction method in the embodiment of the present invention is measured here to understand more deeply whether the traceback graph reduction method has reasonable overhead. 4 a-4 b show schematic diagrams of APT-1-based runtime overhead of an experimental scenario, and a tracing diagram of 1.8GB is reduced in one round. In terms of CPU utilization, the average CPU utilization over long runs of UNICORN running using the baseline configuration is of primary concern. As can be seen from fig. 4a, the average CPU utilization is around 16%. In addition, as shown in fig. 4b, the traceback graph reduction method in the embodiment of the present invention requires 1320MB of memory on average at runtime, and the occupied memory space is up to 5,773MB, which is acceptable in most cases. Overall, the runtime overhead is low.

It will be appreciated that, in addition to UNICORN, other graph-based anomaly detection methods have been tested in support of conventional detection methods.

1. ODDBALL: ODDBALL is a graph-based anomaly detection algorithm that finds anomalies in static weighted graphs based on egonet patterns. It is a fast, unsupervised method to detect abnormal nodes in the weighted graph and assign an "outlier" to each node, which can be further used to evaluate the degree of abnormality of the node.

2. Anomaly detection based on graph embedding: traditional graph embedding is a way to find a low-dimensional representation of a graph while preserving certain attributes. There have been some studies to embed a graph for anomaly detection. For the graph embedding approach, node2Vec, LINE and SDNE are implemented separately here. The goal of Node2Vec is to learn the mapping of nodes to a low dimensional space of features by maximizing the probability of occurrence of subsequent nodes in a fixed length random walk. LINE explicitly defines two functions for first and second order approximations, respectively, and minimizes the combination of the two. SDNE uses a depth auto-encoder to maintain first and second order network neighbors.

And further selecting outliers of the embedded LOF and IF generation graphs. LOF proposes identifying density-based local anomaly factors by assigning each object a degree of likelihood of becoming an outlier, referred to as the object's local anomaly factor (LOF). IF briefly, the isolated forest method creates an itre set for a given data set, and then the anomalies are those instances where the average path length over itre is short.

In order to divide the data into attack data and normal data, the number of abnormal points is used as an index, and the XGboost is used as a binary algorithm. XGBoost is the most advanced method to solve many of the two-classification problems and is currently available in an open source package. It introduces a novel sparse-aware (sparse-aware) algorithm for parallel tree learning, while the theoretically reasonable weighted quantile sketch process enables instance weights to be handled in approximate tree learning.

Table 5 fully illustrates that the tracing graph reduction method according to the embodiment of the present invention is applicable to various graph-based anomaly detection methods. At the same time, past work demonstrated that conventional approaches may not be suitable for APT attack detection tasks because of their poor modeling capability for the long-term behavior of the system. However, the graph reduction method proposed here can promote the performance of the conventional method and also expand the generality of the graph reduction. As shown in the above experiments, static graphs and streaming graphs are considered in the present invention, which indicates that traceback graph reduction does not require a particular type of graph. Furthermore, the traceback graph reduction method of embodiments of the present invention is designed to be applicable to almost all types of logs. Due to the existence of a pre-constructed parser, the traceback graph reduction method has little requirement on the format of the log. Furthermore, in an enterprise environment, runtime overhead is reasonable. Therefore, the tracing graph reduction method provided by the embodiment of the invention can be applied to various scenes and enterprise environments. Furthermore, the traceback graph reduction method is orthogonal to the other methods and therefore can theoretically be used with them to achieve higher reduction rates.

TABLE 5 support of conventional detection methods

The effectiveness of the traceback graph reduction method in an embodiment of the invention is evaluated for systems in simulated APT scenarios and APT scenarios from the real world. The evaluation result shows that the traceback graph compression method provided by the embodiment of the invention can achieve a reduction rate of 78.14 on average, can achieve a reduction rate of 150.46 at maximum, and has 16% of CPU occupation in a short time. In addition, the tracing graph reduction method can be integrated with different APT detection methods, and the performance of the detection method cannot be influenced. In an evaluation scene, the precision of the UNICORN of the currently most advanced APT detection method is improved by 14.3% on average, and the precision of several traditional anomaly detection methods is improved by 68.42% as high as possible. This shows that the conventional anomaly detection method can detect the APT by using the tracing graph reduced by the tracing graph reduction method in the embodiment of the present invention.

The tracing graph reduction method provided by the embodiment of the invention acquires application program logs interacted between a system and the outside by performing co-occurrence analysis on historical operating data of a host, analyzes and converts the application program logs into normalized logs according to a preset field format, extracts an entity from the normalized logs by utilizing a predefined rule set, establishes a mapping relation between the entity and nodes in the tracing graph by utilizing a common field aiming at each normalized log so as to acquire target nodes associated with the entity in the tracing graph, uses a tracing tool to capture information flow of a system kernel, then uses the target nodes as trigger points, captures contexts of the target nodes by utilizing a random walk-based algorithm, extracts tracing subgraphs departing from the target nodes aiming at each target node, combines all the tracing subgraphs to form the reduced tracing graph, and compared with a graph structure-based reduction method and a graph embedding-based reduction method in the prior art, the tracing graph reduction method based on the graph structure in the application logs is used for capturing the contexts of the tracing nodes, and the attack detection method is used for effectively reducing attack amount of the tracing nodes, thereby effectively improving the tracing node detection accuracy.

Further, the tracing graph reduction algorithm in the embodiment of the present invention can achieve more beneficial effects:

for the possibility of transition to streaming algorithm: although intrusion detection and forensics analysis are usually performed after the fact, because they are usually time-limited tasks, it is still necessary to design a real-time tracing graph reduction method, and the tracing graph reduction method in the embodiment of the present invention can be directly extended to streaming algorithms. Specifically, at any time point, the traceback graph reduction method acquires the whole system traceback graph and other multi-level system information logs, and can continuously match the graphs according to the extracted names and timestamps of the entities and generate trigger points, so that the traceback graph reduction method can generate subgraphs during running. Meanwhile, in the experiment, experiments are also carried out on UNICORN, and the potential of the tracing graph reduction method in the invention, which is beneficial to the flow-type APT detection, is proved.

For the effect of filtration: the performance of the traceback graph reduction method of the present invention may be affected by the effect of process filtering. In particular implementations, co-occurrence analysis can be performed using the atop data to filter out applications that should be considered. However, because atop is a coarse-grained system analysis tool, the filtering process may be inaccurate.

And (3) processing of performance bottleneck: modern APT attacks always produce large amounts of log data. Extracting detection-related entities from such huge data is a difficult task, which puts higher demands on computational performance. The realization of the tracing graph reduction method in the invention uses SpaCy which is good at large-scale information extraction task. However, there is still no guarantee that the throughput of the method will be universally applicable to all scenarios.

Based on the foregoing embodiments, another embodiment of the present invention provides a device for reducing a tracing graph, as shown in fig. 5, the device includes:

the obtaining unit 20 may be configured to obtain an application log of interaction between the system and the outside by performing co-occurrence analysis on historical operating data of the host;

the extracting unit 22 may be configured to parse and convert the application log into a normalized log according to a preset field format, and extract an entity from the normalized log by using a predefined rule set;

the establishing unit 24 may be configured to establish, for each normalized log, a mapping relationship between the entity and a node in a tracing graph using a common field to obtain a target node associated with the entity in the tracing graph, where the tracing graph is generated by capturing an information stream of a system kernel using a tracing tool;

the capturing unit 26 may be configured to capture the context of the target node by using the target node as a trigger point and using a random walk-based algorithm, extract, for each target node, a traceable subgraph starting from the target node, and combine all the traceable subgraphs to form a reduced traceable graph.

In a specific application scenario, the obtaining unit 20 includes:

the analysis module can be used for acquiring a co-occurrence application program list from application programs interacted between the system and the outside by performing co-occurrence analysis on historical operating data of the host;

the collecting module may be configured to select a target application with a co-occurrence frequency ranking before a first preset value from the co-occurrence application list, and collect a log of the target application.

In a specific application scenario, the collection module may be specifically configured to collect an application log of a target application program by querying a process corresponding to the target application program interacting with the outside on a host;

the collection module may be further specifically configured to collect, by using an auditing tool, a system call of a target application program for which an application log cannot be obtained, so as to obtain the audit log of the target application program.

In a specific application scenario, the extracting unit 22 includes:

the analysis module can be used for acquiring data information of a preset field from the application program log by analyzing the application program log;

and the normalization module can be used for normalizing the data information of the preset field into a log table according to a preset field format to form a normalized log.

In a specific application scenario, the predefined rule set includes a default rule and a custom extension rule, and the extracting unit 22 may be specifically configured to use the default rule to determine that, if a preset field of the normalized log includes a known attack signature and/or a key system command, a matching field is extracted from the normalized log as an accurate entity;

the extracting unit 22 may be further specifically configured to extract, as a result entity, a noun and a timestamp requested by the noun from a preset field of the normalized log by using the customized extension rule otherwise.

In a specific application scenario, the extracting unit 22 may be further configured to use the default rule to determine, if a preset field of the normalized log includes a known attack signature, whether an application field in the normalized log is matched with the known attack signature, and if so, extract the application field as an accurate entity;

the extracting unit 22 may be further configured to determine, by using the default rule, that if the preset field of the normalized log includes a key system command, process the application field in the normalized log and the key system command included in the preset field.

In a specific application scenario, the extracting unit 22 may be further configured to mark a preset field appearing in the normalized log by using the customized extension rule;

the extracting unit 22 may be further configured to perform a part-of-speech analysis in combination with natural language processing to extract a noun and a timestamp requested by the noun from the preset field as a result entity.

In a specific application scenario, the establishing unit 24 includes:

the selecting module can be used for selecting a name and a requested timestamp as a common field aiming at each normalized log;

the mapping module may be configured to map the entity to a corresponding node in the traceback graph if the name of the entity and the node in the traceback graph have the same name and a timestamp difference requested between the two is within a preset range, where the node is a target node associated with the entity in the traceback graph.

In a specific application scenario, the capturing unit 26 includes:

the recording module can be used for performing random walk by taking the target node as a trigger point and recording the visited frequency of the node in the random walk process;

and the sequencing module can be used for sequencing according to the accessed frequency of the passing nodes, reserving the nodes sequenced before a second preset numerical value as the context of the target node, and extracting the source tracing subgraph from the target node for each target node.

Based on the above method embodiments, another embodiment of the present invention provides a storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the above method.

Based on the foregoing embodiment, another embodiment of the present invention provides a reduction of a tracing graph, including:

one or more processors;

a storage device to store one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method described above.

The above device embodiment corresponds to the method embodiment, and has the same technical effect as the method embodiment, and for the specific description, refer to the method embodiment. The device embodiment is obtained based on the method embodiment, and for specific description, reference may be made to the method embodiment section, which is not described herein again. Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for reducing a tracing graph, the method comprising:

aiming at each normalized log, establishing a mapping relation between the entity and a node in a tracing graph by using a common field so as to obtain a target node which is associated with the entity in the tracing graph, wherein the tracing graph is generated by capturing information flow of a system kernel by using a tracing tool;

2. The method of claim 1, wherein obtaining application logs of system interactions with the outside world by co-occurrence analysis of historical operating data of the host comprises:

the method comprises the steps that co-occurrence analysis is carried out on historical operating data of a host, and a co-occurrence application program list is obtained from application programs of interaction between a system and the outside;

3. The method of claim 2, wherein the collecting the log of the target application comprises:

the method comprises the steps of collecting application logs of a target application program through a process corresponding to the target application program interacted with the outside on a query host;

and for the target application program which cannot obtain the application log, collecting the system call of the target application program by using an auditing tool to obtain the auditing log of the target application program.

4. The method of claim 1, wherein parsing and converting the application log into a normalized log according to a preset field format comprises:

5. The method of claim 1, wherein the predefined rule set includes default rules and custom extension rules, and wherein extracting entities from the normalized log using the predefined rule set comprises:

6. The method of claim 5, wherein the determining, using the default rule, if a predetermined field of the normalized log contains a known attack signature and/or a key system command, extracting a matching field from the normalized log as an accurate entity comprises:

7. The method as claimed in claim 5, wherein the extracting nouns and requested timestamps from the preset fields of the normalized log as result entities using the custom extension rules comprises:

8. The method according to any one of claims 1-7, wherein said establishing, for each normalized log, a mapping relationship between the entity and a node in a traceback graph using a common field to obtain a destination node associated with the entity in the traceback graph comprises:

and if the name of the entity and the node in the tracing graph have the same name and the difference of the requested time stamps between the name of the entity and the node in the tracing graph is in a preset range, mapping the entity to the corresponding node in the tracing graph, wherein the node is a target node associated with the entity in the tracing graph.

9. The method according to any one of claims 1 to 7, wherein said capturing the context of the target node by using the algorithm based on random walk with the target node as the trigger point, extracting the tracing subgraph from the target node for each target node, comprises:

and sequencing according to the visited frequency of the passing nodes, reserving the nodes sequenced before a second preset numerical value as the context of the target nodes, and extracting a traceable subgraph starting from the target nodes aiming at each target node.

10. An apparatus for reducing a tracing graph, the apparatus comprising: