CN117540367A

CN117540367A - Attack investigation method based on behavior sequence and language model

Info

Publication number: CN117540367A
Application number: CN202210909030.9A
Authority: CN
Inventors: 胡威; 高雅婷; 赵金梦; 王景初; 尚智婕; 李家威; 张茹; 刘建毅; 陈连栋; 程凯
Original assignee: State Grid Information and Telecommunication Co Ltd; Beijing University of Posts and Telecommunications; Information and Telecommunication Branch of State Grid Hebei Electric Power Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Beijing University of Posts and Telecommunications; Information and Telecommunication Branch of State Grid Hebei Electric Power Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2024-02-09

Abstract

The invention discloses an attack investigation method based on a behavior sequence and a language model, which is characterized in that an audit log is constructed as the behavior sequence, a deep bidirectional pre-training model (BERT) based on a transducer is utilized to perform self-supervision learning on the behavior sequence, and classification tasks of the attack behavior sequence and a normal behavior sequence are realized in a fine tuning mode. Comprising the following steps: generating a behavior sequence, extracting a behavior dependency subgraph from the behavior dependency graph, converting the subgraph into the behavior sequence, and processing the sequence by using a part-of-speech reduction method; the pre-training model adopts a self-supervision mode to perform representation learning on unlabeled behavior sequences; aiming at the fine adjustment of the downstream task, fine adjustment is carried out on the model by using the marked data, the trained model is obtained, and the classification task of the behavior sequence is realized. The invention provides a new design thought for attack investigation by constructing a method based on a behavior sequence and a language model.

Description

Attack investigation method based on behavior sequence and language model

Technical Field

The invention belongs to the field of log analysis, and particularly relates to a language model and attack investigation based on audit logs.

Background

Security events in large enterprises and organizations are on an increasing trend, and complex attacks, represented by advanced sustainable threats, are becoming a major threat for enterprises and organizations. To address these threats, enterprises deploy threat detection software, such as Intrusion Prevention Software (IPS) and Security Information and Event Management (SIEM) tools, and the like. The software continually monitors enterprise-wide activities, captures the behavior and status of system execution, and generates threat alerts when suspicious activity is found. The network security analyst responds to the events, screens the alarms through methods such as traceability analysis or causality analysis, and discovers the root cause of attack and the damage range of the attack. However, the automation software often adopts a simple matching method, so that a high false alarm rate is caused, so that security analysts need to spend a great deal of time screening the alarms, and real attacks cannot be found in time. In addition, APT attacks are blind, easily bypass these automatic detection tools and remain hidden, and analysis of individual events makes it difficult to find these disguised attacks.

To overcome the above-mentioned deficiencies, recent work has considered the relevance between IOCs and the context information of alarm events. In fact, the relevance and context information contains the behavior and goals of the attacker, which is very different from the behavior and goals of the normal user and is difficult to hide. The attack investigation automatically analyzes the authenticity of the alarm by analyzing the context information of the alarm event offline to reduce false alarm, and simultaneously discovers that the threat behavior is not detected by analyzing the historical audit log. Some solutions to attack surveys, such as matching rule knowledge bases or using tag strategies, require manual involvement of domain experts, and the integrity and accuracy of the expert knowledge will affect the analysis results.

Aiming at the problems, the Nodoze constructs an event frequency library to replace a rule knowledge base based on that audit events related to attacks rarely occur. However, to avoid being detected, an attacker may disguise as normal behavior or use some normal process, such as svchost.exe, which may affect the accuracy of the method of calculating threat scores based on matching or single event frequency, and the above method only considers the reduction of false positives of threat detection software and does not discuss the missing report as a difficult problem. Some existing efforts extract the attack behavior from threat intelligence and design matching algorithms to search the audit log for these known attacks. However, many APT attacks are not disclosed by security companies. In addition, APT attacks are targeted, and network weapons are upgraded or intrusion policies are changed when new targets are attacked. This makes threat intelligence based methods unable to discover unknown attacks.

Pre-trained language models have proven effective in enhancing many natural language processing tasks. The pretraining-based method performs representation learning on a large amount of unlabeled data, and by marking a certain amount of data, the data representation forms can be easily transferred to a downstream task through the fine-tuning-based or feature-based method, but the language models are unidirectional, and the context of two directions cannot be considered when processing the task of the character level.

Disclosure of Invention

The invention provides an attack investigation method based on a behavior sequence and a language model, which is characterized in that the behavior sequence of an event is generated from a behavior dependency graph constructed based on an audit log, the behavior in the sequence is represented and learned by using a deep bidirectional pre-training language model based on a transducer, and the normal behavior sequence and the attack behavior sequence are learned through marked data, so that the analysis and discovery of the attack behavior event are realized.

The invention provides an attack investigation method based on a behavior sequence and a language model, which comprises the following steps:

1) Extracting the relation between the entities from the audit log, constructing a behavior dependency graph, wherein nodes in the graph represent the entities in the audit log, and directed edges represent the relation between the entities;

2) Respectively traversing the behavior dependency graph forwards and backwards by taking an attack behavior event and a normal behavior event as starting points, generating a sub-graph containing contextual information of the behavior event, optimizing the sub-graph, and merging similar behaviors and nodes;

3) Generating a behavior sequence based on the extracted behavior dependent subgraph, processing the entities in the behavior sequence based on the morphological reduction thought in natural language processing, and retaining the characteristics of the entities;

4) And performing representation learning on the behavior sequence by using a depth bidirectional pre-training model (BERT) based on a transducer to obtain vector representations of entities and behaviors in the sequence, and performing fine adjustment on the model by using the marked behavior sequence to apply the model to downstream classification tasks so as to realize the discovery of the attack sequence, thereby reducing false alarm or discovering unknown attack behaviors.

Further, the behavioral sequence generation and language model construction and training includes:

a) The behavior dependency graph is a graph data structure extracted from an audit log and represents causal relations among entities, the dependency graph consists of nodes and directed edges, the nodes represent the entities in the audit log, such as processes, files and the like, and the directed edges represent behaviors, such as reading, connection and the like;

b) Extracting context information of the behavior event from the dependency graph by using a depth-first traversal (DFS) method from the behavior event, taking depth-first traversal (DFS) suspension conditions of the attack behavior event as normal entities, taking depth-first traversal (DFS) suspension conditions of the normal behavior event as attack entities, optimizing generated subgraphs, and merging events which are close in time and similar;

c) Converting the subgraph into a behavior sequence in time sequence, behavior events in the sequence being represented as (E _i ,Action _i ,E _j ) Wherein E is _i And E is _j Representing a specific entity, action _i Representing the behavior between entities, mapping the entities in the sequence based on the ideas of word reduction in natural language processing:

wherein E 'is' _i Basic characteristics of the entity, such as file type process names and the like, are reserved, and the sequence characteristics are reserved in the process, so that learning based on the behavior sequence is facilitated;

d) Constructing a depth bidirectional pre-training model based on a transducer, designing a pre-training task, wherein the input of the model is a large number of token-free behavior sequences, 15% of tokens are randomly covered in the input sequences, 10% of the tokens are randomly replaced, the covered tokens are replaced by special tokens [ MASK ], and the model is trained to predict the covered position tokens based on context;

e) Aiming at the downstream task, a marked behavior sequence is input into the model, and the pre-training model is finely adjusted based on the classification task, so that the classification of the attack behavior sequence and the normal behavior sequence is realized.

Further, step 1) there are three types of entity types, including a process, a file, and an IP address, and the relationship between entities represents an operation, such as reading and writing (read and write) of the file by the process, connection (connection) of the process, and the like;

further, in the forward depth traversal process taking the attack action event as the starting point in the step 2), when the starting node of the traversed action event is normal action, the forward depth traversal is stopped, when the ending node of the traversed action event is normal action, the backward depth traversal is stopped, and when the depth traversal taking the normal action event as the starting point, the depth traversal is opposite to the attack action event;

further, in step 2), merging multiple behaviors of one entity to another entity in a short time, and merging the entities for operation of the entity to multiple similar entities in a short time to optimize the dependency subgraph;

further, in the step 3), morphological reduction is mainly performed according to the characteristics of the entity, including removing the process ID, mapping the file according to the file type, which retains the basic characteristics of the entity and is beneficial to the representation learning of the model on the sequence;

the method can effectively reduce the false alarm rate of the automatic threat detection software, and find out hidden attack behavior, compared with the prior method:

1. the invention provides an attack investigation method based on a behavior sequence and a language model, which can fully consider the context information of the behavior event, reduce the dependence on marking data by self-supervision learning of the behavior sequence, and fine tune the model by using the learned data representation aiming at a downstream task, thereby learning sentence-level characteristics and obtaining better effect on identifying the attack sequence;

2. the invention provides the morphological reduction method of the entity, which can greatly reduce the size of a dictionary and keep the characteristics of the entity at a system layer, thereby being beneficial to the representation and the study of the model on a behavior sequence;

3. the invention uses a depth bidirectional pre-training model based on a transducer and designs a pre-training task based on a mask language model, and compared with other language models, the method can consider the contexts in two directions at the same time, thereby improving the representation learning performance of the model on the entity;

4. according to the invention, only a small amount of annotation data is needed, so that the model can learn sentence-level characteristics in a fine tuning mode, better performance is possessed in a downstream task, and dependence on the annotation data can be reduced.

Drawings

FIG. 1 is a block diagram of a method framework of the present invention. The method mainly comprises the steps of behavior sequence generation, language model pre-training and fine adjustment aiming at downstream tasks.

Fig. 2 is a physical morphological reduction mapping table.

FIG. 3 is a graph of model accuracy, recall, precision, and F1-score for different data sets.

Fig. 4 is a schematic structural diagram of an attack investigation device based on a behavior sequence and a language model.

Detailed Description

In order that the above-recited features and advantages of the present invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

The attack investigation method designed by the invention is based on the behavior sequence and the language model, and is suitable for investigating the authenticity of the alarm by utilizing the audit log and finding the undetected attack behavior. According to the method, an audit log is expressed as a behavior dependency graph, a behavior sequence is generated, the characteristics of the behavior sequence are learned by a model, classification of an attack behavior sequence and a normal behavior sequence is realized, a specific implementation flow is shown in a figure 1, and the method mainly comprises the following steps:

and step 101, extracting a behavior dependency graph from the audit log, wherein the behavior dependency graph comprises information of entity nodes and operations among the entity nodes.

In the embodiment of the application, the audit log can be derived from logging software of different operating systems, such as an ETW (Event Trace For Windows event in tracking) system of Windows, an audit log system of Linux, and the like.

In this embodiment of the present application, the content of the entity node is an entity name in the audit log, for example, a file name, a process ID, an IP address, and the operation between the entity nodes is an operation behavior of the entity, for example, a process to read, write, delete, and the like, of the file.

Step 201, starting from a behavior event, performing depth-first traversal on the behavior dependency graph based on an entity executing the operation in the event and the operated entity, and stopping the traversal based on a specific condition to generate a behavior dependency subgraph.

Step 202, traversing the behavior dependent subgraph, searching the behavior of a certain entity for executing a plurality of operations on another entity within 10s, merging, searching the behavior of the certain entity for executing the same operations on a plurality of similar entities, merging the similar entities, and optimizing the behavior dependent subgraph.

In the embodiment of the application, a plurality of directed edges exist between two entity nodes, which represent operations executed at different times between entities, and the operations generated in a time window T are combined to effectively reduce and optimize the graph; similar entities represent the same file type under the same path.

Step 301, sorting behavior events in the behavior dependent subgraph according to a time sequence, generating a behavior sequence, and aiming at the attack behavior sequence, adding an attack behavior sequence sample by randomly removing an entity related to attack.

In the embodiment of the present application, the behavior sequence is expressed in the form of (E _i ,Action _i ,E _j ) Wherein E is _i And E is _j Representing entity nodes, actions in the graph _i Representing a directed edge; in order to solve the training problem caused by sample imbalance, a certain entity related to attack in the attack behavior sequence is removed, and a new attack behavior sequence is generated.

Step 302, traversing the behavior sequence, performing morphological reduction on entities in the sequence, mapping common files based on file suffixes or types, and designing heuristic hidden paths or mapping files based on the positions of the files which cannot be judged.

In this embodiment of the present application, performing morphological reduction on an entity is similar to part-of-speech reduction in natural language processing, mainly removing irrelevant paths and file names of files, and preserving file types, as shown in fig. 2, where this step may remove irrelevant information in the entity, and help the model learn features.

Step 401, tokenize all behavior sequences and add special tokens CLS, [ SEP ] and [ PAD ], where [ CLS ] and [ SEP ] are used to represent the beginning and end of the sequence and [ PAD ] is used to PAD shorter sequences to a fixed length.

In the embodiment of the application, the tokenization is to express words in the sequence in a digital form, so that the sequence is conveniently input into a model.

Step 402, randomly selecting 15% of non-special tokens in the sequence, replacing 90% of the tokens with envoy tokens [ MASK ], randomly selecting the rest 10% of non-special tokens for replacement, and inputting the tokenized sequence and the vector representing the position information into a model for self-supervision learning.

Step 403, defining downstream tasks as classification of the attack behavior sequence and the normal behavior sequence, inputting the marked data into the model and performing fine adjustment to obtain a trained model.

In the embodiment of the application, the same number of normal behavior sequences and abnormal behavior sequences are selected for marking and are input into a pre-training model, and part of parameters in the training model are updated along with training to learn sentence-level behavior characteristics.

Step 404, converting the unknown behavior time into an unknown behavior sequence, inputting the unknown behavior sequence into a trained model to finish classification of the behavior sequence, and finding out the attack behavior.

According to the method, experimental analysis is carried out by utilizing the DAPRA CADETS data set and the audit log data set generated by the public simulation APT attack, aiming at the feasibility of the method, the accuracy of false alarm identification and the investigation capability of unknown attack, and experimental results show that the method can effectively reduce the false alarm rate of an automatic threat detection tool and can find the undetected unknown attack.

Next, an attack investigation device based on a behavior sequence and a language model provided in the embodiment of the present application will be described. As shown in fig. 4, the attack investigation apparatus includes a behavior sequence generation module 401, a language model pre-training module 402, a downstream task fine-tuning module 403, and an attack investigation module 404.

The behavior sequence generation module 401 is configured to convert the audit log into a behavior dependency graph, and generate a behavior sequence from the behavior dependency graph;

the language model pre-training module 402 is configured to tokenize the behavior sequence, and learn and represent words in the behavior sequence by using a self-supervised learning manner to obtain a vectorized representation of the words;

the fine tuning pre-training language model module 403 is configured to learn sentence-level features of the attack behavior sequence and the normal behavior sequence, and update parameters of the model to obtain a model capable of implementing the task;

and the attack investigation module 404 is configured to classify an unknown behavior sequence, determine whether the unknown behavior sequence is an attack behavior sequence, and determine attack investigation according to the determination.

Embodiments of the present disclosure also provide an electronic device comprising a memory and a processor, the memory further storing computer instructions executable by the processor, the computer instructions, when executed, implementing the processing method according to any one of claims 1 to 6.

Embodiments of the present disclosure also provide a computer readable storage medium, characterized in that the computer readable storage medium stores computer instructions, which when run on a computer, implement the processing method according to any one of claims 1 to 6.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. An attack investigation method based on a behavior sequence and a language model is characterized by comprising the following steps:

extracting a behavior dependency graph from an audit log and generating a behavior sequence, traversing the behavior dependency graph by taking a behavior event as a starting point to generate a behavior dependency sub graph, wherein the behavior dependency sub graph represents the context information of the behavior event, converting the behavior dependency sub graph into a sequence form according to a time sequence, and performing morphological reduction on entities in the sequence to generate the behavior sequence;

constructing a depth bidirectional pre-training language model based on a transducer, tokenizing the behavior sequence, designing a pre-training task, and inputting the tokenized behavior sequence into the pre-training model to obtain a vector representation form of words in the sequence;

fine-tuning a pre-training language model, sampling and marking the behavior sequence to obtain an attack behavior sequence classification training data set, and inputting the vector representation form of the word and the training data set into the pre-training language model for training to obtain a fine-tuned attack behavior sequence classification model;

judging an unknown event, extracting a behavior dependency subgraph from an audit log according to the unknown event, generating an unknown behavior sequence, inputting the unknown behavior sequence into the attack behavior sequence classification model to obtain a classification result of the model on the unknown sequence, and determining attack investigation according to the classification.

2. The method of claim 1, wherein extracting the behavior dependency graph from the audit log and generating the behavior sequence comprises:

extracting a behavior dependency graph from an audit log, wherein nodes represent entities, the types of the entities are processes, files and IP addresses, and directed edges among the entities represent operations among the entities;

taking an attack behavior event and a normal behavior event as starting points, performing forward and backward traversal in a behavior dependency graph to extract a behavior dependency subgraph, wherein a depth-first traversal suspension condition for the attack behavior event is a normal entity, and a depth-first traversal suspension condition for the normal behavior event is an attack entity;

optimizing the behavior dependent subgraph, setting a time window T, merging the behaviors of a certain entity in the T for executing a plurality of operations on another entity, searching the behaviors of the certain entity in the subgraph for executing the same operations on a plurality of similar entities, and merging the similar entities;

converting subgraphs into behavior sequence seq= { Event in time sequence ₁ ,Event ₂ ,…Event _n Event in which _i ＝(E _i ,Action _i ,E _j ) Representing behavioural events, E _i Representing entities, actions _i Representing the behavior between the entities, performing morphological reduction on the entities in the sequence:

wherein E 'is' _i The basic characteristics of the entity, including file type and process name, are preserved.

3. The attack investigation method based on a behavior sequence and a language model according to claim 1, characterized in that the tokenizing the behavior sequence, designing a pre-training task, comprises:

tokenizing all behavior sequences, and adding special tokens [ CLS ], [ SEP ] and [ PAD ] to construct a token dictionary, wherein [ CLS ] and [ SEP ] are used for representing the beginning and the end of the sequences, and [ PAD ] is used for filling shorter sequences into fixed lengths;

designing a pre-training task, randomly selecting 15% of non-special tokens in the sequence, replacing 90% of the tokens with envoy tokens [ MASK ], randomly selecting the rest 10% of the tokens for replacement, inputting the tokenized sequence and a vector representing position information into a model, and training the model into tokens for predicting covered positions based on the context.

4. The method of claim 1, wherein fine tuning the pre-trained language model comprises:

marking sample data as an attack behavior sequence or a normal behavior sequence, carrying out data enhancement on the attack behavior sequence in the behavior sequence, and carrying out random sampling on the normal behavior sequence to obtain a training data set;

and inputting the vector representation form of the word and the training data set into a model for training, and adjusting parameters in the model in the training process so that the model can learn the sequence-level characteristics to obtain the fine-tuned attack sequence classification model.

5. The attack investigation method based on a behavior sequence and a language model according to claim 1, wherein the judging of the unknown event comprises:

based on the step of extracting a behavior dependency graph from an audit log and generating a behavior sequence as claimed in claim 2, extracting a behavior dependency subgraph corresponding to an unknown behavior event, and converting the behavior dependency subgraph into the unknown behavior sequence;

the step of tokenizing the unknown behavior sequence based on claim 3, tokenizing the unknown behavior sequence according to a constructed token dictionary, inputting the tokenized unknown behavior sequence into the fine-tuned model of claim 5, obtaining a classification result of the unknown sequence, and determining an unknown event attack investigation according to the classification result.

6. An attack investigation apparatus based on a behavior sequence and a language model, comprising:

the behavior sequence generation module is used for converting the audit log into a behavior dependency graph and generating a behavior sequence from the behavior dependency graph;

the language model pre-training module is used for tokenizing the behavior sequence, and learning and representing words in the behavior sequence by utilizing a self-supervision learning mode to obtain a vectorization representation form of the words;

the fine tuning pre-training language model module is used for learning sentence-level characteristics of the attack behavior sequence and the normal behavior sequence, and updating parameters of the model to obtain a model capable of realizing the classification task of the attack behavior sequence;

and the attack investigation module is used for classifying the behavior sequences generated by the unknown event, judging whether the behavior sequences are attack behavior sequences or not, and determining attack investigation according to the judgment.

7. The attack investigation apparatus based on a behavior sequence and language model according to claim 6, wherein the behavior sequence generation module comprises:

the behavior dependency graph construction module is used for constructing an audit log into a behavior dependency graph, extracting the relationship between the entity information of the system layer and the entity contained in the audit log, and representing the relationship in the form of a directed label graph;

the behavior dependent sub-graph construction module is used for generating behavior dependent sub-graphs of attack events and normal events from the behavior dependent graph and optimizing the behavior dependent sub-graphs;

the behavior sequence generation module is used for converting the behavior dependent subgraph into a behavior sequence according to the event sequence of the event occurrence and performing part-of-speech reduction on the entities in the behavior sequence.

8. The attack investigation apparatus of claim 6, wherein the language model pre-training module comprises:

the behavior sequence tokenizing module is used for tokenizing words in the behavior sequence, constructing a dictionary, adding special tokens and setting all sequences to be of fixed length;

the pre-training task module is used for covering part of words in the behavior sequence and replacing the part of words in the behavior sequence;

and the training module is used for inputting the behavior sequence into the pre-training model for training, capturing the contextual characteristics of the entity and the behavior, and obtaining the embedded representation of the entity and the behavior.

9. An electronic device comprising a memory and a processor, the memory further storing computer instructions executable by the processor, the computer instructions, when executed, implementing the processing method of any of claims 1 to 6.

10. A computer readable storage medium storing computer instructions which, when run on a computer, implement the processing method of any one of claims 1 to 6.