CN115859277A

CN115859277A - Host intrusion detection method based on system call sequence

Info

Publication number: CN115859277A
Application number: CN202310072261.3A
Authority: CN
Inventors: 李涛; 唐聪; 何俊江; 兰小龙; 方文波; 陈姿妤
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-03-28
Anticipated expiration: 2043-02-07
Also published as: CN115859277B

Abstract

The invention discloses a host intrusion detection method based on a system calling sequence, which relates to the technical field of computer security and comprises the following steps: s1: capturing system calling information, and dividing the system calling information into a plurality of system calling sequences; s2: defining an abnormal activity track represented by an abnormal sequence; s3: storing the mapping relation between the characteristics of different granularities; s4: converting the relational map into an abstract behavior tree; s5: pruning the abstract behavior tree, S6: converting the captured system calling sequence into a leaf node sequence, and extracting features from the new leaf node sequence; s7: performing feature dimension reduction on the extracted feature vector; s8: and taking the feature vector subjected to the dimensionality reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence. The invention solves the problems of overhigh vector dimensionality and overlong time consumption generated in the feature extraction process in the prior art, and can reduce the hardware cost required by the deployment of the host.

Description

Host intrusion detection method based on system call sequence

Technical Field

The invention relates to the technical field of computer security, in particular to a host intrusion detection method based on a system call sequence.

Background

Existing Intrusion Detection Systems (IDS) can be classified into Network-based Intrusion detection systems (NIDS) and Host-based Intrusion detection systems (HIDS). NIDS are typically deployed at a backbone network node to identify network intrusion events by detecting network traffic, while HIDS are deployed at hosts to monitor various host data, such as logs, directories, files, and registries, to detect and prevent malicious activity. In contrast to NIDS, HIDS has the ability to detect internal attacks and Advanced Persistent Threats (APT), which can be considered the last line of defense to secure network assets.

At present, in terms of a host intrusion detection system, an intrusion detection system based on machine learning/deep learning is the intrusion detection system with the best performance and the most extensive application at present, and in order to construct good features for detection, system call information is used as the most primitive and finest granularity information of an operating system, so that a system call sequence also becomes the most widely and frequently used feature of HIDS for constructing an intrusion detection engine.

Feature extraction is a key task of intrusion detection systems, but since this operation is itself very time consuming, some attacks may have been performed before the feature selection/extraction task was completed. At present, typical feature extraction methods in the construction of an intrusion detection engine based on system calls include an N-gram sliding window, TF-IDF (Term Frequency-Inverse Document) and a window Frequency method (combining the N-gram and the TF-IDF), wherein the N-gram scans the whole system call, extracts N continuous system call sequences from the system call, and retains sequence information in the execution process of the system call, but does not consider the importance of extracted different features for distinguishing intrusions. In contrast, the TF-IDF method can be used to distinguish the importance of different features, but cannot preserve the order information of the system. Compared with the N-gram and TF-IDF methods, the window frequency method combines the advantages of the N-gram and the TF-IDF and makes up the respective defects of the N-gram and the TF-IDF.

The window frequency method flow is shown in fig. 2, and the specific steps are as follows:

A. and capturing system call information from the system log, and dividing the system call information into system call sequences S1, S2 and Si with different lengths so as to facilitate subsequent data processing.

B. Normal or abnormal system call sequence labels are marked on the system call sequences S1, S2 and Si, so that the construction of a subsequent machine learning intrusion detection engine is facilitated

C. And performing feature extraction from the system call by using a window frequency method, converting a system call sequence into a feature vector to be suitable for the input of a machine learning intrusion detection engine, wherein a feature fragment with a fixed length N is taken from the system call sequence by using an N-gram (for example, if the fixed length is 3, the feature fragment of S1 is [4 168 42,.,. 168 4, 168 4 240 ]), and then a weight is given to the extracted feature fragment by using a TF-IDF method (for example, the weight of "4 168 42" is 0.01045553), so that different system call sequences can be converted into a vector representation suitable for the intrusion detection engine by using the method.

D. And sending the vector representation of the system call sequence and the corresponding classification label into a machine learning model for training, and constructing a machine learning engine for intrusion detection through training of a large amount of system call sequence data.

However, the existing window frequency method directly extracts the relevant features from the original system call sequence, and in order to meet the requirements of the detection engine on feature segments with different lengths, different fixed lengths need to be set to capture the relevant feature segments, which results in an exponential increase in the number of the relevant feature segments, which further causes an overhigh dimensionality of the extracted feature vector and an overlong feature extraction time, and the intrusion detection engine constructed by the method needs to consume a large amount of storage resources and calculation resources.

Disclosure of Invention

The invention aims to solve the problems of overhigh vector dimensionality and overlong time consumption in the feature extraction process in the prior art, reduce the hardware cost required by the deployment of a host and provide a host intrusion detection method and device based on a system call sequence to solve the problems.

In a first aspect, the present invention provides a method for intrusion detection based on system call, which comprises a system call feature extraction stage and a leaf node sequence detection stage, wherein the system call feature extraction stage and the leaf node sequence detection stage are used for extracting a system call feature from a system call feature

The system calling feature extraction stage comprises the following steps:

s1, capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences, and marking corresponding sequence labels;

s2, defining abnormal activity tracks represented by abnormal sequences in different granularity characteristic representation modes;

s3, storing the mapping relation among the features with different granularities by using a relational mapping chart;

s4, converting the relational mapping graph into an abstract behavior tree;

s5, pruning the abstract behavior tree, and storing the structure of the abstracted behavior tree after pruning;

the leaf node sequence detection stage comprises the following steps:

s6, performing leaf node mapping through the abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and performing feature extraction from the new leaf node sequence by using a window frequency method;

s7, performing feature dimensionality reduction on the extracted feature vectors;

and S8, taking the feature vector after the dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.

Optionally, in step S2, the granularity characterization method includes:

the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size characterization, low-level coarse particle size characterization and high-level coarse particle size characterization.

Optionally, in step S3, storing the mapping relationship between the features of different granularities by using a relationship map includes:

the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph;

optionally, in step S4, converting the relational map into the abstract behavior tree includes:

and converting the graph storage mode of the relational mapping graph into a tree storage mode, and storing the relational mapping graph by using an abstract tree structure.

Optionally, in step S5, pruning the abstract behavior tree, and storing the pruned abstract behavior tree structure includes:

and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a certain preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.

Optionally, in step S7, performing feature dimension reduction on the extracted feature vector includes:

and performing feature dimensionality reduction on the extracted feature vectors through singular value decomposition.

Optionally, in step S8, the feature vector after the dimension reduction is used as an input of a machine learning model, and the corresponding leaf node sequence is divided into an abnormal type and a normal type, including:

and dividing the feature vectors and the classification labels after dimension reduction into a training set, a testing set and a verifying set, wherein the training set is used for training the model and determining parameters, the testing set is used for determining the network structure and adjusting the hyper-parameters of the model, the verifying set is used for verifying the generalization ability of the model, selecting different machine learning algorithm models, performing parameter selection, and evaluating the model effect by using cross validation.

In a second aspect, the present invention provides an intrusion detection apparatus based on system call, which includes a system call feature extraction unit and a leaf node sequence detection unit, wherein the system call feature extraction unit extracts the system call feature from the intrusion detection apparatus based on the system call feature, and the leaf node sequence detection unit detects the system call feature from the intrusion detection apparatus based on the system call feature

The system call feature extraction unit includes:

the system comprises a capturing unit, a judging unit and a judging unit, wherein the capturing unit is used for capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences and marking corresponding sequence labels;

the granularity unit is used for defining an abnormal activity track represented by the abnormal sequence through different granularity characteristic representation modes;

a mapping unit, configured to store a mapping relationship between the features of different granularities by using a relationship map;

the tree conversion unit is used for converting the relational mapping chart into an abstract behavior tree;

the pruning unit is used for pruning the abstract behavior tree and storing the structure of the abstracted behavior tree after pruning;

the leaf node sequence detection unit includes:

the leaf conversion unit is used for mapping leaf nodes through an abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and extracting characteristics from the new leaf node sequence by using a window frequency method;

the dimensionality reduction unit is used for performing characteristic dimensionality reduction on the extracted characteristic vector;

and the output unit is used for taking the feature vector after the dimension reduction as the input of the machine learning model and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.

Optionally, the granularity unit and the granularity characteristic characterization manner include:

the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size, low-level coarse particle size and high-level coarse particle size respectively.

Optionally, the mapping unit, configured to store the mapping relationship between the features of different granularities by using a relational map, includes:

optionally, the tree converting unit, configured to convert the relational map into the abstract behavior tree, includes:

Optionally, the pruning unit is configured to prune the abstract behavior tree, and the storing the pruned abstract behavior tree structure includes:

and selecting to cut different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.

Optionally, the dimension reduction unit, configured to perform feature dimension reduction on the extracted feature vector, includes:

Optionally, the output unit is configured to use the feature vector after the dimension reduction as an input of a machine learning model, and divide the corresponding leaf node sequence into an abnormal type and a normal type, where the method includes:

Compared with the prior art, the technical scheme of the invention has the following advantages:

the number of the feature fragments generated by feature extraction is reduced, the dimensionality of the feature vectors generated by the feature fragments is reduced, and the time overhead of feature extraction is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of a window frequency method;

FIG. 3 is a mapping relationship between three granularity characterization manners;

fig. 4 is a schematic diagram of pruning an abstract behavior tree.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below by referring to the accompanying drawings and examples.

Example 1

As shown in FIG. 1, the present invention provides a host intrusion detection method based on a system call sequence, which mainly comprises a system call feature extraction stage and a leaf node sequence detection stage, wherein the system call feature extraction stage and the leaf node sequence detection stage are adopted in the method

The system calling feature extraction stage is as follows:

s1, capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences, and marking corresponding sequence tags.

Capturing system call information through a specific system call capturing program in an original system call information capturing stage; wherein the captured system call log information is segmented into a number of system call sequences and labeled with corresponding categories.

Different operating systems provide a process tracking system call interface, and the tracking function of the process can be realized through the process tracking system call interface. Therefore, the parent process can control the child process and change the core mirror image of the child process, including reading and writing the data of the child process space and the like. The basic principle of raw system call information capture is that when process tracing is used, all signals sent to the traced child process are forwarded to the parent process, while the child process is blocked. And after receiving the signal, the parent process can check and modify the stopped child process, then the child process is allowed to continue to run, the system calling information can be captured by the method, then the system calling information is processed into a plurality of original system calling sequences, and then corresponding sequence labels are marked, so that subsequent feature extraction and model training are facilitated.

And S2, defining the abnormal activity track represented by the abnormal sequence in different granularity characteristic characterization modes.

Optionally, the granularity characteristic characterization method includes: the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size, low-level coarse particle size and high-level coarse particle size respectively.

As shown in fig. 2, the representation modes include an original system call sequence feature representation mode, a system behavior feature representation mode, and a system kernel module feature representation mode, and feature granularities of the representation modes are a fine-grained representation, a low-level coarse-grained representation, and a high-level coarse-grained representation, respectively. Defining an abnormal activity track represented by an abnormal sequence through three different granularity characteristic representation modes; according to the behavior expressed by the system calling sequence, more than one hundred system calling numbers (fine granularity) are mapped into seventy more system behaviors (low-level coarse granularity), and according to the function of the system behaviors, seventy more system behavior interfaces are mapped into seven kernel function sub-modules (high-level coarse granularity), so that each original system calling sequence can be converted into a system behavior sequence and a system kernel module sequence in different granularity characteristic expression modes.

And S3, storing the mapping relation between the features with different granularities by using a relational mapping chart.

Optionally, storing the mapping relationship between the features of different granularities by using a relationship map comprises: the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph.

As shown in fig. 3, a specific mapping relationship exists among the three granularity characterizing ways, intrusion behaviors can be detected through an original system call sequence, a system behavior sequence and a system kernel module sequence, and the mapping relationship expressed by three different granularity characteristics is expressed by using a relational mapping diagram, wherein the original system call sequence has a better detection effect, the system behavior sequence has a lower detection effect, the system kernel module sequence has a poorer detection effect, time overhead and performance overhead of the three are also reduced in sequence, and a multi-granularity mixed sequence is constructed from the three sequences in order to balance the detection effect and resource overhead.

And S4, converting the relational mapping graph into an abstract behavior tree.

Optionally, converting the relational map into the abstract behavior tree comprises: and converting the graph storage mode of the relational mapping graph into a tree storage mode, and storing the relational mapping graph by using an abstract tree structure.

The mapping relation of the relational mapping graph is many-to-one, is the same as the node relation of the tree structure, and the mapping relation (named as an abstract behavior tree) is stored through the tree structure so as to be convenient for subsequent adjustment of the tree structure.

And S5, pruning the abstract behavior tree and storing the structure of the abstracted behavior tree after pruning.

Optionally, pruning the abstract behavior tree, and storing the pruned abstract behavior tree structure includes: and selecting to cut different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.

As shown in fig. 4, leaf nodes of the current abstract behavior tree are composed of fine-grained features, pruning is performed on the abstract behavior tree, different leaf nodes are selectively pruned each time, after pruning, the leaf nodes of the abstract behavior tree are represented by features of different granularities (fine-grained, low-level coarse-grained and high-level coarse-grained), after a plurality of rounds of leaf node pruning, the leaf nodes of the abstract behavior tree that are finally reserved are composed of system call nodes, system behavior nodes and system kernel module nodes, each system call number corresponds to a leaf node of the abstract behavior tree, and each system call sequence composed of system call numbers can be converted into a new leaf node sequence through the tree.

The leaf node sequence detection stage comprises the following steps:

the window frequency method extracts features from the new leaf node sequence, and compared with the original sequence, the leaf node sequence not only retains the information contained in the original sequence, but also greatly reduces the vector dimension generated in the feature extraction process, and obviously reduces the time overhead and the calculation overhead.

And S7, performing feature dimension reduction on the extracted feature vector.

Optionally, performing feature dimension reduction on the extracted feature vector through singular value decomposition;

the machine learning model has higher requirement on the feature dimension, and the extracted feature vector needs to be subjected to dimensionality reduction, wherein the singular value dimensionality reduction method is high in speed and good in effect, and can reduce the dimensionality of the extracted feature vector to the formulated dimensionality.

Optionally, the feature vectors and the classification labels after the dimensionality reduction are divided into a training set, a test set and a verification set, the training set is used for training the model and determining parameters, the test set is used for determining the network structure and adjusting the hyper-parameters of the model, and the verification set is used for verifying the generalization capability of the model, so that the intrusion detection engine with high efficiency and low cost is obtained finally.

Dividing data into a training set and a test set, selecting different machine learning algorithm models, performing parameter selection, using cross validation to evaluate the model effect, continuously performing parameter tuning, constructing an intrusion detection engine with high accuracy and low cost, and finally deploying the engine on a host to realize intrusion detection on the host.

The following is an example of a system call sequence to illustrate the principles and processes of the present invention for reducing feature fragments.

Original system call sequence T: {5 125 6 53 6 91 4 78 78 125 122 };

if T is used as a whole corpus of feature extraction, extracting features from the corpus by using a window frequency method; and setting the fixed length of the feature segment to K, and the number of the generated feature segments to N (feature vector dimension)

If K =1, the extracted feature fragment is:

[5]，[125]，[6]，[3]，[91]，[4]，[78]，[122]，[192]；N=T(1)->9；

if K =2, the extracted feature fragment is:

[5 125]，[125 6]，[6 5]，[5 3]，[3 6]，[6 91]，[91 4]，[4 78]，[78 78]，[78 125]，[125 122]，[122 192]；N=T(2)->12；

if K =3, the extracted feature fragment is:

[5 125 6]，[125 6 5]，[6 5 3]，…，[125 122 192]；N=T(3)->12；

if K =4, the extracted feature fragment is:

[5 125 6 5]，[125 6 5 3]，[6 5 3 6]，…，[78 125 122 192]；N=T(4)->12；

if K =5, the extracted feature fragment is:

[5 125 6 5 3]，[125 6 5 3 6]，[6 5 3 6 91]，…，[7878 125 122 192]；N=T(5)->11；

when K =1-5, the number of feature fragments generated is: t (1-5) = T (1) + T (2) + T (3) + T (4) + T (5) =56;

when the method is adopted, the abstract behavior tree is utilized to convert the original system calling sequence into the following leaf node sequence

L：{fs-xattr kernel-sched fs-xattr fs-xattr io fs-xattr kernel-capability io fs-stat fs-stat fs-stat kernel-schedkernel-sched ipc-sem}；

Taking L as a whole corpus of feature extraction, and extracting features from the corpus by using a window frequency method; setting the fixed length value of the characteristic segments as K, and the number of the generated characteristic segments as N;

if K =1, the extracted feature fragment is:

[fs-xattr]，[kernel-sched]，[io]，[kernel-capability]，[fs-stat]，[ipc-sem]；N=L(1)->6；

if K =2, the extracted feature fragment is:

[fs-xattr kernel-sched]，[kernel-schedfs-xattr]，…，[kernel-sched ipc-sem]；N=L(2)->12；

if K =3, the extracted feature fragment is:

[fs-xattr kernel-sched fs-xattr]，…，[kernel-schedkernel-sched ipc-sem]；N=L(3)->12；

if K =4, the extracted feature fragment is:

[fs-xattr kernel-sched fs-xattr fs-xattr]，…，[fs-xattrfs-xattr kernel-sched ipc-sem]；N=L(4)->11；

if K =5, the extracted feature fragment is:

[fs-xattr kernel-sched fs-xattr fs-xattr io]，…，[fs-statfs-stat kernel-sched kernel-sched ipc-sem]；N=L(5)->10；

when K =1-5, the number of feature fragments produced is: l (1-5) = L (1) + L (2) + L (3) + L (4) + L (5) =51;

from the above, it is known that L (i) < = T (i) and L (i-j) < = T (i-j) wherein (0 < -i and j belong to positive integers) are provided, the above explains the principle that the present invention is effective;

in an actual application scene, the size of a corpus is far larger than that of the corpus, the corpus is formed by tens of thousands of pieces of original system call sequence data, and the advantages of the method are fully proved along with the increase of the size of the corpus;

therefore, the ADFA-LD data set is used as a corpus, and the advantage of the method for reducing the characteristic fragments is evaluated on the corpus;

the feature fragments T (k) generated using the original system call sequences on the ADFA-LD corpus and the feature fragments L (k) generated by the leaf nodes of the present invention are represented by the following Table 1:

TABLE 1

The feature fragment extraction method of the present invention is described above, and the vectorization method of the feature fragment of the present invention is explained next;

defining a corpus formed by a plurality of selected leaf node sequences as

Defining the ith leaf node sequence in the corpus as ^ or ^>

Wherein->

Representing the characteristic segments contained by the leaf nodes; />

A tag (normal or malicious) corresponding to the leaf node sequence, wherein £ is present>

Indicates that the sequence is a normal sequence, and>

indicating that the sequence is a malicious sequence; wherein->

Representing the number of leaf node sequences in the corpus.

The present invention uses tf-idf techniques to convert feature segments into vectors, making them available in an input format suitable for various classifier models;

the detailed description of the calculation of tf-idf values for feature fragment items is as follows:

word frequency

The calculation formula is shown as (1)>

An i-th characteristic segment representing a j-th leaf node sequence>

Is frequently, wherein->

Represents the number of times a characteristic segment bi appears in the entire corpus, based on the comparison result>

Representing the total number of all the characteristic segments contained in the corpus;

(1)

inverse file frequency

The calculation formula is shown as (2)>

Represents the ith characteristic segment->

Is inverted, wherein->

Represents the sum of the number of all leaf node sequences contained in the corpus, and>

indicating that the characteristic fragment is included in all leaf node sequences->

Is greater than or equal to the leaf node sequence number of>

To avoid the case where the denominator is 0 (when the feature segments in the test set do not appear in the prediction library formed by the training set);

(2)

therefore, the value calculation formula for tf-idf is shown in (3), and the feature segment

Tf-idf of is equal to the word frequency of the feature fragment->

And inverse file frequency->

Is multiplied by the characteristic fragment->

Is defined as being ^ er>

Then for a leaf node sequence which contains a plurality of characteristic segments->

Conversion into a vector representation->

；

Generally, when the fixed length value of the feature segment is set to be larger or when a plurality of feature segments with different lengths are required to be included, the feature vector after conversion may have a higher dimension.

In order to reduce the dimension of the feature vector more quickly, the invention adopts an SVD method to reduce the dimension, because the SVD method has higher calculation efficiency than a principal component calculation method.

Then, the feature vectors after the dimension reduction are used as the input of various machine learning models (four machine learning classification models), and finally, the corresponding leaf node sequences are divided into abnormal and normal.

Compared with the prior art that the related features are directly extracted from the original system calling sequence by the window frequency method, the method maps the original system calling sequence into the leaf node sequence, and then extracts the feature segments on the leaf node sequence, so that the speed of increasing the number of the feature segments is remarkably slowed down, the consumed time of feature extraction is reduced, and the accuracy of the constructed intrusion detection engine is improved to a certain extent; the performance evaluation is carried out on a data set ADFA-LD, when the fixed length n of a characteristic fragment extracted by a window frequency method is set to be 3, the quantity of the characteristic fragments generated by a leaf node sequence and an original system calling sequence is 18316 and 8632 respectively, compared with the system calling sequence, the quantity of the characteristic fragments generated by the leaf node sequence and the original system calling sequence is reduced by 112.19%, when the length of the characteristic fragment extracted by the window frequency method is set to be 1-5 (the quantity of all the characteristic fragments with the fixed length value of 1-5), the quantity of the characteristic fragments generated by the leaf node sequence and the original system calling sequence is 135485 and 160035 respectively, compared with the system calling sequence, the quantity of the characteristic fragments is reduced by 15.34%, compared with the intrusion detection engine constructed by a machine learning model such as SVM and the like, the performance of the intrusion detection engine constructed by using four indexes including precision rate, recall rate, F1 score and false alarm rate is set to be 1-5 respectively, compared with the 20 index values constructed by the prior art, wherein 16 index values are superior, and in addition, the average time of feature extraction is reduced by 1.026.s, 140.59, 140.s and 140.59.

Compared with the prior art, the number of the feature fragments generated by feature extraction is reduced, the dimensionality of the feature vectors generated by the feature fragments is reduced, and the time overhead of feature extraction is reduced.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed.

Claims

1. A method for intrusion detection based on system calls, comprising the steps of,

1) A system calling feature extraction stage;

s1: capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences, and marking corresponding sequence labels;

s2: defining abnormal activity tracks represented by abnormal sequences in different granularity characteristic characterization modes;

s3: using a relational map to store the mapping relations between the features of different granularities;

s4: converting the relational map into an abstract behavior tree;

s5: pruning the abstract behavior tree, and storing the structure of the abstracted behavior tree after pruning;

2) Leaf node sequence detection stage;

s6: performing leaf node mapping through an abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and performing feature extraction from the new leaf node sequence by using a window frequency method;

s7: performing feature dimensionality reduction on the extracted feature vector;

s8: and taking the feature vector after the dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.

2. The method for intrusion detection based on system call according to claim 1, wherein the step S2: the granularity characteristic characterization mode comprises the following steps:

3. The method for intrusion detection based on system call as claimed in claim 2, wherein the step S3: using a relational map to store the mapping relationships between the features of different granularities, comprising:

the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph.

4. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S4: converting the relational map to an abstract behavior tree, comprising:

5. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S5: pruning the abstract behavior tree, and storing the structure of the abstracted behavior tree after pruning, which comprises the following steps:

6. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S7: and performing feature dimensionality reduction on the extracted feature vector, wherein the feature dimensionality reduction comprises the following steps:

7. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S8: the feature vector after dimensionality reduction is used as the input of a machine learning model, and the corresponding leaf node sequence is divided into an abnormal type and a normal type, and the method comprises the following steps:

and dividing the feature vectors and the classification labels after dimension reduction into a training set, a testing set and a verification set, wherein the training set is used for training the model and determining parameters, the testing set is used for determining the network structure and adjusting the hyper-parameters of the model, the verification set is used for verifying the generalization ability of the model, different machine learning algorithm models are selected for parameter selection, and the effect of the model is evaluated by using cross validation.