CN115859277A - Host intrusion detection method based on system call sequence - Google Patents

Host intrusion detection method based on system call sequence Download PDF

Info

Publication number
CN115859277A
CN115859277A CN202310072261.3A CN202310072261A CN115859277A CN 115859277 A CN115859277 A CN 115859277A CN 202310072261 A CN202310072261 A CN 202310072261A CN 115859277 A CN115859277 A CN 115859277A
Authority
CN
China
Prior art keywords
sequence
leaf node
feature
system call
intrusion detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310072261.3A
Other languages
Chinese (zh)
Other versions
CN115859277B (en
Inventor
李涛
唐聪
何俊江
兰小龙
方文波
陈姿妤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202310072261.3A priority Critical patent/CN115859277B/en
Publication of CN115859277A publication Critical patent/CN115859277A/en
Application granted granted Critical
Publication of CN115859277B publication Critical patent/CN115859277B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a host intrusion detection method based on a system calling sequence, which relates to the technical field of computer security and comprises the following steps: s1: capturing system calling information, and dividing the system calling information into a plurality of system calling sequences; s2: defining an abnormal activity track represented by an abnormal sequence; s3: storing the mapping relation between the characteristics of different granularities; s4: converting the relational map into an abstract behavior tree; s5: pruning the abstract behavior tree, S6: converting the captured system calling sequence into a leaf node sequence, and extracting features from the new leaf node sequence; s7: performing feature dimension reduction on the extracted feature vector; s8: and taking the feature vector subjected to the dimensionality reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence. The invention solves the problems of overhigh vector dimensionality and overlong time consumption generated in the feature extraction process in the prior art, and can reduce the hardware cost required by the deployment of the host.

Description

Host intrusion detection method based on system call sequence
Technical Field
The invention relates to the technical field of computer security, in particular to a host intrusion detection method based on a system call sequence.
Background
Existing Intrusion Detection Systems (IDS) can be classified into Network-based Intrusion detection systems (NIDS) and Host-based Intrusion detection systems (HIDS). NIDS are typically deployed at a backbone network node to identify network intrusion events by detecting network traffic, while HIDS are deployed at hosts to monitor various host data, such as logs, directories, files, and registries, to detect and prevent malicious activity. In contrast to NIDS, HIDS has the ability to detect internal attacks and Advanced Persistent Threats (APT), which can be considered the last line of defense to secure network assets.
At present, in terms of a host intrusion detection system, an intrusion detection system based on machine learning/deep learning is the intrusion detection system with the best performance and the most extensive application at present, and in order to construct good features for detection, system call information is used as the most primitive and finest granularity information of an operating system, so that a system call sequence also becomes the most widely and frequently used feature of HIDS for constructing an intrusion detection engine.
Feature extraction is a key task of intrusion detection systems, but since this operation is itself very time consuming, some attacks may have been performed before the feature selection/extraction task was completed. At present, typical feature extraction methods in the construction of an intrusion detection engine based on system calls include an N-gram sliding window, TF-IDF (Term Frequency-Inverse Document) and a window Frequency method (combining the N-gram and the TF-IDF), wherein the N-gram scans the whole system call, extracts N continuous system call sequences from the system call, and retains sequence information in the execution process of the system call, but does not consider the importance of extracted different features for distinguishing intrusions. In contrast, the TF-IDF method can be used to distinguish the importance of different features, but cannot preserve the order information of the system. Compared with the N-gram and TF-IDF methods, the window frequency method combines the advantages of the N-gram and the TF-IDF and makes up the respective defects of the N-gram and the TF-IDF.
The window frequency method flow is shown in fig. 2, and the specific steps are as follows:
A. and capturing system call information from the system log, and dividing the system call information into system call sequences S1, S2 and Si with different lengths so as to facilitate subsequent data processing.
B. Normal or abnormal system call sequence labels are marked on the system call sequences S1, S2 and Si, so that the construction of a subsequent machine learning intrusion detection engine is facilitated
C. And performing feature extraction from the system call by using a window frequency method, converting a system call sequence into a feature vector to be suitable for the input of a machine learning intrusion detection engine, wherein a feature fragment with a fixed length N is taken from the system call sequence by using an N-gram (for example, if the fixed length is 3, the feature fragment of S1 is [4 168 42,.,. 168 4, 168 4 240 ]), and then a weight is given to the extracted feature fragment by using a TF-IDF method (for example, the weight of "4 168 42" is 0.01045553), so that different system call sequences can be converted into a vector representation suitable for the intrusion detection engine by using the method.
D. And sending the vector representation of the system call sequence and the corresponding classification label into a machine learning model for training, and constructing a machine learning engine for intrusion detection through training of a large amount of system call sequence data.
However, the existing window frequency method directly extracts the relevant features from the original system call sequence, and in order to meet the requirements of the detection engine on feature segments with different lengths, different fixed lengths need to be set to capture the relevant feature segments, which results in an exponential increase in the number of the relevant feature segments, which further causes an overhigh dimensionality of the extracted feature vector and an overlong feature extraction time, and the intrusion detection engine constructed by the method needs to consume a large amount of storage resources and calculation resources.
Disclosure of Invention
The invention aims to solve the problems of overhigh vector dimensionality and overlong time consumption in the feature extraction process in the prior art, reduce the hardware cost required by the deployment of a host and provide a host intrusion detection method and device based on a system call sequence to solve the problems.
In a first aspect, the present invention provides a method for intrusion detection based on system call, which comprises a system call feature extraction stage and a leaf node sequence detection stage, wherein the system call feature extraction stage and the leaf node sequence detection stage are used for extracting a system call feature from a system call feature
The system calling feature extraction stage comprises the following steps:
s1, capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences, and marking corresponding sequence labels;
s2, defining abnormal activity tracks represented by abnormal sequences in different granularity characteristic representation modes;
s3, storing the mapping relation among the features with different granularities by using a relational mapping chart;
s4, converting the relational mapping graph into an abstract behavior tree;
s5, pruning the abstract behavior tree, and storing the structure of the abstracted behavior tree after pruning;
the leaf node sequence detection stage comprises the following steps:
s6, performing leaf node mapping through the abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and performing feature extraction from the new leaf node sequence by using a window frequency method;
s7, performing feature dimensionality reduction on the extracted feature vectors;
and S8, taking the feature vector after the dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.
Optionally, in step S2, the granularity characterization method includes:
the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size characterization, low-level coarse particle size characterization and high-level coarse particle size characterization.
Optionally, in step S3, storing the mapping relationship between the features of different granularities by using a relationship map includes:
the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph;
optionally, in step S4, converting the relational map into the abstract behavior tree includes:
and converting the graph storage mode of the relational mapping graph into a tree storage mode, and storing the relational mapping graph by using an abstract tree structure.
Optionally, in step S5, pruning the abstract behavior tree, and storing the pruned abstract behavior tree structure includes:
and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a certain preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.
Optionally, in step S7, performing feature dimension reduction on the extracted feature vector includes:
and performing feature dimensionality reduction on the extracted feature vectors through singular value decomposition.
Optionally, in step S8, the feature vector after the dimension reduction is used as an input of a machine learning model, and the corresponding leaf node sequence is divided into an abnormal type and a normal type, including:
and dividing the feature vectors and the classification labels after dimension reduction into a training set, a testing set and a verifying set, wherein the training set is used for training the model and determining parameters, the testing set is used for determining the network structure and adjusting the hyper-parameters of the model, the verifying set is used for verifying the generalization ability of the model, selecting different machine learning algorithm models, performing parameter selection, and evaluating the model effect by using cross validation.
In a second aspect, the present invention provides an intrusion detection apparatus based on system call, which includes a system call feature extraction unit and a leaf node sequence detection unit, wherein the system call feature extraction unit extracts the system call feature from the intrusion detection apparatus based on the system call feature, and the leaf node sequence detection unit detects the system call feature from the intrusion detection apparatus based on the system call feature
The system call feature extraction unit includes:
the system comprises a capturing unit, a judging unit and a judging unit, wherein the capturing unit is used for capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences and marking corresponding sequence labels;
the granularity unit is used for defining an abnormal activity track represented by the abnormal sequence through different granularity characteristic representation modes;
a mapping unit, configured to store a mapping relationship between the features of different granularities by using a relationship map;
the tree conversion unit is used for converting the relational mapping chart into an abstract behavior tree;
the pruning unit is used for pruning the abstract behavior tree and storing the structure of the abstracted behavior tree after pruning;
the leaf node sequence detection unit includes:
the leaf conversion unit is used for mapping leaf nodes through an abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and extracting characteristics from the new leaf node sequence by using a window frequency method;
the dimensionality reduction unit is used for performing characteristic dimensionality reduction on the extracted characteristic vector;
and the output unit is used for taking the feature vector after the dimension reduction as the input of the machine learning model and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.
Optionally, the granularity unit and the granularity characteristic characterization manner include:
the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size, low-level coarse particle size and high-level coarse particle size respectively.
Optionally, the mapping unit, configured to store the mapping relationship between the features of different granularities by using a relational map, includes:
the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph;
optionally, the tree converting unit, configured to convert the relational map into the abstract behavior tree, includes:
and converting the graph storage mode of the relational mapping graph into a tree storage mode, and storing the relational mapping graph by using an abstract tree structure.
Optionally, the pruning unit is configured to prune the abstract behavior tree, and the storing the pruned abstract behavior tree structure includes:
and selecting to cut different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.
Optionally, the dimension reduction unit, configured to perform feature dimension reduction on the extracted feature vector, includes:
and performing feature dimensionality reduction on the extracted feature vectors through singular value decomposition.
Optionally, the output unit is configured to use the feature vector after the dimension reduction as an input of a machine learning model, and divide the corresponding leaf node sequence into an abnormal type and a normal type, where the method includes:
and dividing the feature vectors and the classification labels after dimension reduction into a training set, a testing set and a verifying set, wherein the training set is used for training the model and determining parameters, the testing set is used for determining the network structure and adjusting the hyper-parameters of the model, the verifying set is used for verifying the generalization ability of the model, selecting different machine learning algorithm models, performing parameter selection, and evaluating the model effect by using cross validation.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the number of the feature fragments generated by feature extraction is reduced, the dimensionality of the feature vectors generated by the feature fragments is reduced, and the time overhead of feature extraction is reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of a window frequency method;
FIG. 3 is a mapping relationship between three granularity characterization manners;
fig. 4 is a schematic diagram of pruning an abstract behavior tree.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below by referring to the accompanying drawings and examples.
Example 1
As shown in FIG. 1, the present invention provides a host intrusion detection method based on a system call sequence, which mainly comprises a system call feature extraction stage and a leaf node sequence detection stage, wherein the system call feature extraction stage and the leaf node sequence detection stage are adopted in the method
The system calling feature extraction stage is as follows:
s1, capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences, and marking corresponding sequence tags.
Capturing system call information through a specific system call capturing program in an original system call information capturing stage; wherein the captured system call log information is segmented into a number of system call sequences and labeled with corresponding categories.
Different operating systems provide a process tracking system call interface, and the tracking function of the process can be realized through the process tracking system call interface. Therefore, the parent process can control the child process and change the core mirror image of the child process, including reading and writing the data of the child process space and the like. The basic principle of raw system call information capture is that when process tracing is used, all signals sent to the traced child process are forwarded to the parent process, while the child process is blocked. And after receiving the signal, the parent process can check and modify the stopped child process, then the child process is allowed to continue to run, the system calling information can be captured by the method, then the system calling information is processed into a plurality of original system calling sequences, and then corresponding sequence labels are marked, so that subsequent feature extraction and model training are facilitated.
And S2, defining the abnormal activity track represented by the abnormal sequence in different granularity characteristic characterization modes.
Optionally, the granularity characteristic characterization method includes: the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size, low-level coarse particle size and high-level coarse particle size respectively.
As shown in fig. 2, the representation modes include an original system call sequence feature representation mode, a system behavior feature representation mode, and a system kernel module feature representation mode, and feature granularities of the representation modes are a fine-grained representation, a low-level coarse-grained representation, and a high-level coarse-grained representation, respectively. Defining an abnormal activity track represented by an abnormal sequence through three different granularity characteristic representation modes; according to the behavior expressed by the system calling sequence, more than one hundred system calling numbers (fine granularity) are mapped into seventy more system behaviors (low-level coarse granularity), and according to the function of the system behaviors, seventy more system behavior interfaces are mapped into seven kernel function sub-modules (high-level coarse granularity), so that each original system calling sequence can be converted into a system behavior sequence and a system kernel module sequence in different granularity characteristic expression modes.
And S3, storing the mapping relation between the features with different granularities by using a relational mapping chart.
Optionally, storing the mapping relationship between the features of different granularities by using a relationship map comprises: the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph.
As shown in fig. 3, a specific mapping relationship exists among the three granularity characterizing ways, intrusion behaviors can be detected through an original system call sequence, a system behavior sequence and a system kernel module sequence, and the mapping relationship expressed by three different granularity characteristics is expressed by using a relational mapping diagram, wherein the original system call sequence has a better detection effect, the system behavior sequence has a lower detection effect, the system kernel module sequence has a poorer detection effect, time overhead and performance overhead of the three are also reduced in sequence, and a multi-granularity mixed sequence is constructed from the three sequences in order to balance the detection effect and resource overhead.
And S4, converting the relational mapping graph into an abstract behavior tree.
Optionally, converting the relational map into the abstract behavior tree comprises: and converting the graph storage mode of the relational mapping graph into a tree storage mode, and storing the relational mapping graph by using an abstract tree structure.
The mapping relation of the relational mapping graph is many-to-one, is the same as the node relation of the tree structure, and the mapping relation (named as an abstract behavior tree) is stored through the tree structure so as to be convenient for subsequent adjustment of the tree structure.
And S5, pruning the abstract behavior tree and storing the structure of the abstracted behavior tree after pruning.
Optionally, pruning the abstract behavior tree, and storing the pruned abstract behavior tree structure includes: and selecting to cut different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.
As shown in fig. 4, leaf nodes of the current abstract behavior tree are composed of fine-grained features, pruning is performed on the abstract behavior tree, different leaf nodes are selectively pruned each time, after pruning, the leaf nodes of the abstract behavior tree are represented by features of different granularities (fine-grained, low-level coarse-grained and high-level coarse-grained), after a plurality of rounds of leaf node pruning, the leaf nodes of the abstract behavior tree that are finally reserved are composed of system call nodes, system behavior nodes and system kernel module nodes, each system call number corresponds to a leaf node of the abstract behavior tree, and each system call sequence composed of system call numbers can be converted into a new leaf node sequence through the tree.
The leaf node sequence detection stage comprises the following steps:
s6, performing leaf node mapping through the abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and performing feature extraction from the new leaf node sequence by using a window frequency method;
the window frequency method extracts features from the new leaf node sequence, and compared with the original sequence, the leaf node sequence not only retains the information contained in the original sequence, but also greatly reduces the vector dimension generated in the feature extraction process, and obviously reduces the time overhead and the calculation overhead.
And S7, performing feature dimension reduction on the extracted feature vector.
Optionally, performing feature dimension reduction on the extracted feature vector through singular value decomposition;
the machine learning model has higher requirement on the feature dimension, and the extracted feature vector needs to be subjected to dimensionality reduction, wherein the singular value dimensionality reduction method is high in speed and good in effect, and can reduce the dimensionality of the extracted feature vector to the formulated dimensionality.
And S8, taking the feature vector after the dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.
Optionally, the feature vectors and the classification labels after the dimensionality reduction are divided into a training set, a test set and a verification set, the training set is used for training the model and determining parameters, the test set is used for determining the network structure and adjusting the hyper-parameters of the model, and the verification set is used for verifying the generalization capability of the model, so that the intrusion detection engine with high efficiency and low cost is obtained finally.
Dividing data into a training set and a test set, selecting different machine learning algorithm models, performing parameter selection, using cross validation to evaluate the model effect, continuously performing parameter tuning, constructing an intrusion detection engine with high accuracy and low cost, and finally deploying the engine on a host to realize intrusion detection on the host.
The following is an example of a system call sequence to illustrate the principles and processes of the present invention for reducing feature fragments.
Original system call sequence T: {5 125 6 53 6 91 4 78 78 125 122 };
if T is used as a whole corpus of feature extraction, extracting features from the corpus by using a window frequency method; and setting the fixed length of the feature segment to K, and the number of the generated feature segments to N (feature vector dimension)
If K =1, the extracted feature fragment is:
[5],[125],[6],[3],[91],[4],[78],[122],[192];N=T(1)->9;
if K =2, the extracted feature fragment is:
[5 125],[125 6],[6 5],[5 3],[3 6],[6 91],[91 4],[4 78],[78 78],[78 125],[125 122],[122 192];N=T(2)->12;
if K =3, the extracted feature fragment is:
[5 125 6],[125 6 5],[6 5 3],…,[125 122 192];N=T(3)->12;
if K =4, the extracted feature fragment is:
[5 125 6 5],[125 6 5 3],[6 5 3 6],…,[78 125 122 192];N=T(4)->12;
if K =5, the extracted feature fragment is:
[5 125 6 5 3],[125 6 5 3 6],[6 5 3 6 91],…,[7878 125 122 192];N=T(5)->11;
when K =1-5, the number of feature fragments generated is: t (1-5) = T (1) + T (2) + T (3) + T (4) + T (5) =56;
when the method is adopted, the abstract behavior tree is utilized to convert the original system calling sequence into the following leaf node sequence
L:{fs-xattr kernel-sched fs-xattr fs-xattr io fs-xattr kernel-capability io fs-stat fs-stat fs-stat kernel-schedkernel-sched ipc-sem};
Taking L as a whole corpus of feature extraction, and extracting features from the corpus by using a window frequency method; setting the fixed length value of the characteristic segments as K, and the number of the generated characteristic segments as N;
if K =1, the extracted feature fragment is:
[fs-xattr],[kernel-sched],[io],[kernel-capability],[fs-stat],[ipc-sem];N=L(1)->6;
if K =2, the extracted feature fragment is:
[fs-xattr kernel-sched],[kernel-schedfs-xattr],…,[kernel-sched ipc-sem];N=L(2)->12;
if K =3, the extracted feature fragment is:
[fs-xattr kernel-sched fs-xattr],…,[kernel-schedkernel-sched ipc-sem];N=L(3)->12;
if K =4, the extracted feature fragment is:
[fs-xattr kernel-sched fs-xattr fs-xattr],…,[fs-xattrfs-xattr kernel-sched ipc-sem];N=L(4)->11;
if K =5, the extracted feature fragment is:
[fs-xattr kernel-sched fs-xattr fs-xattr io],…,[fs-statfs-stat kernel-sched kernel-sched ipc-sem];N=L(5)->10;
when K =1-5, the number of feature fragments produced is: l (1-5) = L (1) + L (2) + L (3) + L (4) + L (5) =51;
from the above, it is known that L (i) < = T (i) and L (i-j) < = T (i-j) wherein (0 < -i and j belong to positive integers) are provided, the above explains the principle that the present invention is effective;
in an actual application scene, the size of a corpus is far larger than that of the corpus, the corpus is formed by tens of thousands of pieces of original system call sequence data, and the advantages of the method are fully proved along with the increase of the size of the corpus;
therefore, the ADFA-LD data set is used as a corpus, and the advantage of the method for reducing the characteristic fragments is evaluated on the corpus;
the feature fragments T (k) generated using the original system call sequences on the ADFA-LD corpus and the feature fragments L (k) generated by the leaf nodes of the present invention are represented by the following Table 1:
TABLE 1
Figure SMS_1
The feature fragment extraction method of the present invention is described above, and the vectorization method of the feature fragment of the present invention is explained next;
defining a corpus formed by a plurality of selected leaf node sequences as
Figure SMS_2
Defining the ith leaf node sequence in the corpus as ^ or ^>
Figure SMS_3
Wherein->
Figure SMS_4
Representing the characteristic segments contained by the leaf nodes; />
Figure SMS_5
A tag (normal or malicious) corresponding to the leaf node sequence, wherein £ is present>
Figure SMS_6
Indicates that the sequence is a normal sequence, and>
Figure SMS_7
indicating that the sequence is a malicious sequence; wherein->
Figure SMS_8
Representing the number of leaf node sequences in the corpus.
The present invention uses tf-idf techniques to convert feature segments into vectors, making them available in an input format suitable for various classifier models;
the detailed description of the calculation of tf-idf values for feature fragment items is as follows:
word frequency
Figure SMS_9
The calculation formula is shown as (1)>
Figure SMS_10
An i-th characteristic segment representing a j-th leaf node sequence>
Figure SMS_11
Is frequently, wherein->
Figure SMS_12
Represents the number of times a characteristic segment bi appears in the entire corpus, based on the comparison result>
Figure SMS_13
Representing the total number of all the characteristic segments contained in the corpus;
Figure SMS_14
(1)
inverse file frequency
Figure SMS_15
The calculation formula is shown as (2)>
Figure SMS_16
Represents the ith characteristic segment->
Figure SMS_17
Is inverted, wherein->
Figure SMS_18
Represents the sum of the number of all leaf node sequences contained in the corpus, and>
Figure SMS_19
indicating that the characteristic fragment is included in all leaf node sequences->
Figure SMS_20
Is greater than or equal to the leaf node sequence number of>
Figure SMS_21
To avoid the case where the denominator is 0 (when the feature segments in the test set do not appear in the prediction library formed by the training set);
Figure SMS_22
(2)
therefore, the value calculation formula for tf-idf is shown in (3), and the feature segment
Figure SMS_23
Tf-idf of is equal to the word frequency of the feature fragment->
Figure SMS_24
And inverse file frequency->
Figure SMS_25
Is multiplied by the characteristic fragment->
Figure SMS_26
Is defined as being ^ er>
Figure SMS_27
Then for a leaf node sequence which contains a plurality of characteristic segments->
Figure SMS_28
Conversion into a vector representation->
Figure SMS_29
Figure SMS_30
Generally, when the fixed length value of the feature segment is set to be larger or when a plurality of feature segments with different lengths are required to be included, the feature vector after conversion may have a higher dimension.
In order to reduce the dimension of the feature vector more quickly, the invention adopts an SVD method to reduce the dimension, because the SVD method has higher calculation efficiency than a principal component calculation method.
Then, the feature vectors after the dimension reduction are used as the input of various machine learning models (four machine learning classification models), and finally, the corresponding leaf node sequences are divided into abnormal and normal.
Compared with the prior art that the related features are directly extracted from the original system calling sequence by the window frequency method, the method maps the original system calling sequence into the leaf node sequence, and then extracts the feature segments on the leaf node sequence, so that the speed of increasing the number of the feature segments is remarkably slowed down, the consumed time of feature extraction is reduced, and the accuracy of the constructed intrusion detection engine is improved to a certain extent; the performance evaluation is carried out on a data set ADFA-LD, when the fixed length n of a characteristic fragment extracted by a window frequency method is set to be 3, the quantity of the characteristic fragments generated by a leaf node sequence and an original system calling sequence is 18316 and 8632 respectively, compared with the system calling sequence, the quantity of the characteristic fragments generated by the leaf node sequence and the original system calling sequence is reduced by 112.19%, when the length of the characteristic fragment extracted by the window frequency method is set to be 1-5 (the quantity of all the characteristic fragments with the fixed length value of 1-5), the quantity of the characteristic fragments generated by the leaf node sequence and the original system calling sequence is 135485 and 160035 respectively, compared with the system calling sequence, the quantity of the characteristic fragments is reduced by 15.34%, compared with the intrusion detection engine constructed by a machine learning model such as SVM and the like, the performance of the intrusion detection engine constructed by using four indexes including precision rate, recall rate, F1 score and false alarm rate is set to be 1-5 respectively, compared with the 20 index values constructed by the prior art, wherein 16 index values are superior, and in addition, the average time of feature extraction is reduced by 1.026.s, 140.59, 140.s and 140.59.
Compared with the prior art, the number of the feature fragments generated by feature extraction is reduced, the dimensionality of the feature vectors generated by the feature fragments is reduced, and the time overhead of feature extraction is reduced.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed.

Claims (7)

1. A method for intrusion detection based on system calls, comprising the steps of,
1) A system calling feature extraction stage;
s1: capturing system calling information, dividing the captured system calling information into a plurality of system calling sequences, and marking corresponding sequence labels;
s2: defining abnormal activity tracks represented by abnormal sequences in different granularity characteristic characterization modes;
s3: using a relational map to store the mapping relations between the features of different granularities;
s4: converting the relational map into an abstract behavior tree;
s5: pruning the abstract behavior tree, and storing the structure of the abstracted behavior tree after pruning;
2) Leaf node sequence detection stage;
s6: performing leaf node mapping through an abstract behavior tree, converting the captured system calling sequence into a leaf node sequence, and performing feature extraction from the new leaf node sequence by using a window frequency method;
s7: performing feature dimensionality reduction on the extracted feature vector;
s8: and taking the feature vector after the dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into an abnormal leaf node sequence and a normal leaf node sequence.
2. The method for intrusion detection based on system call according to claim 1, wherein the step S2: the granularity characteristic characterization mode comprises the following steps:
the method comprises the steps of (1) an original system call sequence characteristic representation mode, a system behavior characteristic representation mode and a system kernel module characteristic representation mode; the characteristic particle size is characterized by fine particle size, low-level coarse particle size and high-level coarse particle size respectively.
3. The method for intrusion detection based on system call as claimed in claim 2, wherein the step S3: using a relational map to store the mapping relationships between the features of different granularities, comprising:
the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the characteristics with different granularities is stored through a relational mapping graph.
4. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S4: converting the relational map to an abstract behavior tree, comprising:
and converting the graph storage mode of the relational mapping graph into a tree storage mode, and storing the relational mapping graph by using an abstract tree structure.
5. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S5: pruning the abstract behavior tree, and storing the structure of the abstracted behavior tree after pruning, which comprises the following steps:
and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, and when the accuracy reaches a certain preset threshold, considering that the current abstract behavior tree meets the characteristic extraction requirement and storing the current abstract behavior tree structure.
6. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S7: and performing feature dimensionality reduction on the extracted feature vector, wherein the feature dimensionality reduction comprises the following steps:
and performing feature dimensionality reduction on the extracted feature vectors through singular value decomposition.
7. The method for intrusion detection based on system call as claimed in claim 1, wherein the step S8: the feature vector after dimensionality reduction is used as the input of a machine learning model, and the corresponding leaf node sequence is divided into an abnormal type and a normal type, and the method comprises the following steps:
and dividing the feature vectors and the classification labels after dimension reduction into a training set, a testing set and a verification set, wherein the training set is used for training the model and determining parameters, the testing set is used for determining the network structure and adjusting the hyper-parameters of the model, the verification set is used for verifying the generalization ability of the model, different machine learning algorithm models are selected for parameter selection, and the effect of the model is evaluated by using cross validation.
CN202310072261.3A 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence Active CN115859277B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310072261.3A CN115859277B (en) 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310072261.3A CN115859277B (en) 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence

Publications (2)

Publication Number Publication Date
CN115859277A true CN115859277A (en) 2023-03-28
CN115859277B CN115859277B (en) 2023-05-02

Family

ID=85657673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310072261.3A Active CN115859277B (en) 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence

Country Status (1)

Country Link
CN (1) CN115859277B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205474A1 (en) * 2001-07-30 2004-10-14 Eleazar Eskin System and methods for intrusion detection with dynamic window sizes
CN102546638A (en) * 2012-01-12 2012-07-04 冶金自动化研究设计院 Scene-based hybrid invasion detection method and system
CN110298381A (en) * 2019-05-24 2019-10-01 中山大学 A kind of cloud security service functional tree Network Intrusion Detection System
CN110597735A (en) * 2019-09-25 2019-12-20 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN111563234A (en) * 2020-04-23 2020-08-21 华南理工大学 Feature extraction method of system call data in host anomaly detection
CN111931175A (en) * 2020-09-23 2020-11-13 四川大学 Industrial control system intrusion detection method based on small sample learning
CN112134862A (en) * 2020-09-11 2020-12-25 国网电力科学研究院有限公司 Coarse-fine granularity mixed network anomaly detection method and device based on machine learning
CN112613032A (en) * 2020-12-15 2021-04-06 中国科学院信息工程研究所 Host intrusion detection method and device based on system call sequence
CN113094713A (en) * 2021-06-09 2021-07-09 四川大学 Self-adaptive host intrusion detection sequence feature extraction method and system
CN114816909A (en) * 2022-04-13 2022-07-29 北京计算机技术及应用研究所 Real-time log detection early warning method and system based on machine learning
CN115278752A (en) * 2022-06-10 2022-11-01 广州大学 AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205474A1 (en) * 2001-07-30 2004-10-14 Eleazar Eskin System and methods for intrusion detection with dynamic window sizes
CN102546638A (en) * 2012-01-12 2012-07-04 冶金自动化研究设计院 Scene-based hybrid invasion detection method and system
CN110298381A (en) * 2019-05-24 2019-10-01 中山大学 A kind of cloud security service functional tree Network Intrusion Detection System
CN110597735A (en) * 2019-09-25 2019-12-20 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN111563234A (en) * 2020-04-23 2020-08-21 华南理工大学 Feature extraction method of system call data in host anomaly detection
CN112134862A (en) * 2020-09-11 2020-12-25 国网电力科学研究院有限公司 Coarse-fine granularity mixed network anomaly detection method and device based on machine learning
CN111931175A (en) * 2020-09-23 2020-11-13 四川大学 Industrial control system intrusion detection method based on small sample learning
CN112613032A (en) * 2020-12-15 2021-04-06 中国科学院信息工程研究所 Host intrusion detection method and device based on system call sequence
CN113094713A (en) * 2021-06-09 2021-07-09 四川大学 Self-adaptive host intrusion detection sequence feature extraction method and system
CN114816909A (en) * 2022-04-13 2022-07-29 北京计算机技术及应用研究所 Real-time log detection early warning method and system based on machine learning
CN115278752A (en) * 2022-06-10 2022-11-01 广州大学 AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SEYIT AHMET CAMTEPE 等: "Modeling and detection of complex attacks" *
吴桐: "基于机器学习的入侵检测技术研究" *
陈涛: "网络威胁检测模型及行为序列分析方法研究" *

Also Published As

Publication number Publication date
CN115859277B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
US10474818B1 (en) Methods and devices for detection of malware
US11888881B2 (en) Context informed abnormal endpoint behavior detection
WO2017064705A1 (en) Method of identifying and tracking sensitive data and system thereof
US11580222B2 (en) Automated malware analysis that automatically clusters sandbox reports of similar malware samples
CN109614795B (en) Event-aware android malicious software detection method
US20220279045A1 (en) Global iterative clustering algorithm to model entities&#39; behaviors and detect anomalies
CN113360912A (en) Malicious software detection method, device, equipment and storage medium
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
Rani et al. Design of an intrusion detection model for IoT-enabled smart home
KR102437278B1 (en) Document malware detection device and method combining machine learning and signature matching
CN113378167A (en) Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing
Čeponis et al. Evaluation of deep learning methods efficiency for malicious and benign system calls classification on the AWSCTD
CN113762294A (en) Feature vector dimension compression method, device, equipment and medium
CN115344563B (en) Data deduplication method and device, storage medium and electronic equipment
Yu et al. A unified malicious documents detection model based on two layers of abstraction
CN115859277B (en) Host intrusion detection method based on system call sequence
Wang et al. Malware detection using cnn via word embedding in cloud computing infrastructure
CN114398887A (en) Text classification method and device and electronic equipment
CN115080745A (en) Multi-scene text classification method, device, equipment and medium based on artificial intelligence
Modupe et al. Exploring support vector machines and random forests to detect advanced fee fraud activities on internet
Jiang et al. A pyramid stripe pooling-based convolutional neural network for malware detection and classification
CN114662099A (en) AI model-based application malicious behavior detection method and device
US20240211811A1 (en) Non-transitory computer-readable recording medium storing information processing program, information processing method, and information processing apparatus
WO2023042318A1 (en) Information processing program, information processing method, and information processing device
Wang et al. Research on awareness method of cloud user abnormal behavior based on log audit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant