CN115378702B

CN115378702B - Attack detection system based on Linux system call

Info

Publication number: CN115378702B
Application number: CN202211004258.XA
Authority: CN
Inventors: 万邦睿; 何雨多; 钱鹰; 黄江平; 金霜
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2024-04-02
Anticipated expiration: 2042-08-22
Also published as: CN115378702A

Abstract

The invention belongs to the technical field of computer security, and particularly relates to an attack detection method and system based on Linux system call, comprising the following steps: acquiring a system call generated by a system, intercepting the system call sequence into subsequences with equal length as a sequence to be detected, and converting the subsequences into a detection sequence in the form of word vectors; initially judging the category of the detection sequence in the form of the word vector by a deep learning detection model, if the detection sequence is judged to be an abnormal sequence, putting the sequence into an attack library, updating a detection matching library, and if the detection sequence is judged to be a normal sequence; comparing the sequence which is preliminarily judged to be normal with the detection matching library in matching degree; and judging the sequence of which the class cannot be judged by adopting cluster calculation to judge the matching library, thereby obtaining a detection result. The invention adopts a deep learning model and a matching library mode for derivative attack detection, and adopts a cluster detection mode for unknown attack, thereby solving the problems of redundant call of a system call sequence, overlong sequence and missing report rate of intrusion detection.

Description

Attack detection system based on Linux system call

Technical Field

The invention belongs to the technical field of computer security, and particularly relates to an attack detection system based on Linux system call.

Background

The current 'belief-creating' industry develops rapidly, and domestic operating systems with Linux as kernels are gradually used by more and more individuals and enterprises by virtue of unique advantages of the domestic operating systems, but malicious attacks to the domestic operating systems are increased, so that intrusion detection researches based on the Linux systems are not easy to develop.

The Linux system call is an application programming interface realized by a Linux kernel and is used for the interaction between an application program and the kernel, namely, the application program depending on the Linux system can interact with the kernel through the system call when running. And the Linux system call sequences generated by the normal operation and the malicious operation have obvious differences, so the invention can develop the research of the intrusion detection system based on the Linux system call.

The existing attack detection mode based on Linux system call is realized by analyzing the system call of a program, wherein the method mainly comprises the steps of analyzing the collected Linux system call sequence, establishing a normal system call base line or establishing an abnormal system call characteristic spectrum, and judging that the sequence has aggressiveness when the behavior different from the normal system call base line or the abnormal call characteristic is matched.

The prior art has the following problems:

(1) The attack detection research aiming at Linux system call is mainly based on derivative attack detection of known attack types, so that a certain report missing rate exists in the current intrusion detection system.

(2) Attack detection research on Linux system call is mainly in theory, and in the practical application situation, the Linux system call sequence has the problems of redundant call, overlong sequence and the like, and the problem is lack of treatment.

Disclosure of Invention

In order to solve the technical problems, the invention provides an attack detection system based on Linux system call, which comprises:

an attack detection method based on Linux system call comprises the following steps of

S1: acquiring a system call generated in the running process of a system, and acquiring an attack sequence and a normal sequence from an existing database to construct a data set;

s2: removing redundant system call in call generated in the system operation process, generating a system call sequence in a specified time period, intercepting the system call sequence into equal-length subsequences, and intercepting sequences in a data set into equal-length subsequences;

s3: dividing equal-length subsequences in a data set into an attack sequence and a normal sequence according to data types, and respectively storing the attack sequence and the normal sequence into two sequence libraries to obtain a detection sequence matching library;

s4: intercepting a system call sequence into subsequences with equal length as a sequence to be detected, and converting the subsequences into a detection sequence in a word vector form;

s5: initially judging the category of the detection sequence in the form of the word vector by a deep learning model, if the detection sequence is judged to be an abnormal sequence, putting the sequence into an attack library, updating a detection matching library, and if the detection sequence is judged to be a normal sequence, carrying out further judgment;

s6: comparing the initially determined sequence with the detection matching library in matching degree, and determining the category of the sequence;

s7: and judging the sequence of which the class cannot be judged by adopting cluster calculation to judge the matching library, thereby obtaining a detection result.

Preferably, the system call sequence is intercepted into subsequences with equal length, which specifically comprises:

removing redundant system call by adopting a statistical analysis method, generating a system call sequence, and intercepting the generated system call sequence into a call sequence with fixed length: intercepting a call sequence with the length exceeding a fixed length by adopting a sliding window technology, and filling 0 at the tail part of the generated sequence for filling if the length of the generated sequence does not reach the fixed length.

Preferably, a statistical analysis method is adopted to remove redundant system calls, which specifically comprises the following steps:

according to the sequences of the same type in the dataset, respectively calculating TF-IDF values of the system calls of each of the two types of sequences, respectively sorting the TF-IDF values of each of the two types of sequences in descending order, respectively screening out the last 40 system calls in the two types of sequences, screening out the repeatedly occurring system calls from the last 40 system calls in each type of sequences, comparing and analyzing the two screened system calls, taking the same system call in the two types of sequences as a redundant system call, carrying out statistical analysis on the selected redundant system call and all the system calls generated in the system operation process, finding out the system call which is the same as the selected redundant system call, and removing the system call.

Further, the calculating TF-IDF values of the system calls in each system call sequence is expressed as:

wherein the TF-IDF _a1 The TF-IDF value representing system call a1,representing how frequently system call a1 occurs in sequence a, Σa _1,k Representing the total number of sequences in which the system call a1 occurs in the generated sequence k, |n| represents the total number of sequences of normal type or attack type in the generated sequence, | { N } _a1 The number of sequences in which the system call a1 occurs in the current class is denoted by } |.

Preferably, the comparison of the matching degree between the sequence which is preliminarily judged to be normal and the sequence in the matching library is carried out, and the type of the sequence is judged, which specifically comprises the following steps:

setting the matching degree to be 0.8, comparing the similarity of the sequence to be detected and the sequences in the two types of matching libraries, if the similarity of the sequence to be detected and the sequence in the normal sequence matching library is greater than 0.8, the sequence is the normal sequence, if the similarity of the sequence to be detected and the sequence in the attack sequence matching library is greater than 0.8, the sequence is the attack sequence, and if the similarity of the sequence to be detected and the sequence in the normal sequence matching library or the sequence in the attack sequence matching library is less than 0.8, the type of the sequence cannot be identified.

Preferably, the performing cluster computation specifically includes:

s1: marking a detection sequence which cannot be identified by the detection unit as seq01, and converting the seq01 into a sequence seqm01 in the form of a word vector;

s2: selecting sequence seq generated by host h1 and seq01 in intranet connection host group in same time period ₁₁ And converted into a sequence seqm in the form of a word vector ₁₁ Calculate its Euclidean distance d (seqm ₀₁ ,seqm ₁₁ )；

S3: setting a threshold distance, if d (seqm 01, seqm 11) > distance, determining that the sequence generated by the host is similar to the detection sequence, repeating S2-S3, detecting all host numbers of the similar sequence to the detection sequence, if the hostNumber > =threshold, determining that the detection sequence type is an attack sequence, and if the hostNumber is less than threshold, determining that the detection sequence type is a normal sequence, wherein threshold represents the set threshold of the number of the hosts with the similar sequence.

Further, the Euclidean distance d (seqm ₀₁ ,seqm ₁₁ ) Calculation, shown as:

wherein seqm ₀₁ Representing a detection sequence in the form of a word vector seqm ₁₁ Sequences, m, representing word vector forms generated by the host h1 and the detection sequences in the same time period in the intranet connection host group _mi The expression sequence seqm ₀₁ I-th system call, n _ni The expression sequence seqm ₁₁ Is the ith system call.

An attack detection system based on Linux system call, comprising: the device comprises a collecting module, a training module and a detecting module;

the collection module includes: the system call acquisition unit, the processing unit and the data transmission unit;

the system call acquisition unit is used for collecting related information called by the execution process of the designated process, acquiring an attack sequence and a normal sequence from the existing database and constructing a data set;

the related information includes: calling time, system calling name, process name and thread name;

the processing unit processes the system call information collected by the system call acquisition unit according to the length or the call time to generate a system call sequence in a specified time period, intercepts the system call sequence into equal-length subsequences, and intercepts sequences in the data set into equal-length subsequences;

the data transmission module transmits the subsequence generated by the system call information processing to the detection module as a detection sequence, and transmits the subsequence in the data set to the training module;

the training module comprises: the device comprises a data unit, a conversion unit and a training unit;

the data unit divides the equal-length subsequences in the data set into an attack sequence and a normal sequence according to the data type, and stores the attack sequence and the normal sequence into two sequence libraries of an attack library and a normal library respectively to obtain a detection sequence matching library;

the conversion unit converts the sequences stored in the sequence library into a word vector matrix;

the training unit adopts a deep learning technology to train the detection system according to the word vector matrix;

the detection module comprises: the device comprises a conversion unit, a detection unit, a matching unit, a calculation unit and an identification unit;

the conversion unit converts the sequence to be detected into a sequence in the form of a word vector according to the form of the word vector generated by the training module conversion unit;

the detection unit performs preliminary classification judgment on the sequence in the form of the word vector converted from the sequence to be detected through the deep learning model, if the sequence is judged to be an abnormal sequence, the sequence is sent to the training module for updating a matching library, and if the sequence is judged to be a normal sequence, the sequence is sent to the matching unit for further judgment;

the matching unit carries out rechecking on the sequence which is preliminarily judged to be normal by the detection unit, compares the matching degree of the sequence which is preliminarily judged to be normal with the sequence in the matching library, and judges the type of the sequence;

when the detection unit cannot identify the type of the detection sequence, the calculation unit calculates the similarity of the sequence generated by the sequence and other hosts in the intranet connection host group at the same time period, judges whether the sequence is similar according to a set similarity threshold value, and judges the sequence as an attack sequence if the sequence is similar to a plurality of hosts in the intranet connection host group, so as to obtain a detection result.

The invention has the beneficial effects that: according to the invention, the known type and unknown type attacks can be detected at the system call sequence level, so that the detected report missing rate is reduced, and the method is more suitable for actual application scenes.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a flow chart of a sequence process of the present invention;

FIG. 3 is a flow chart of the detection of the invention;

FIG. 4 is a schematic diagram of the system of the present invention;

fig. 5 is a system call sample of the system of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

An attack detection method based on Linux system call, as shown in fig. 1, is characterized by comprising the following steps:

s4: intercepting a system call sequence into subsequences with equal length as a sequence to be detected, and converting the subsequences into a word vector form;

s5: initially judging the category of the sequence in the form of the word vector by a deep learning model, if the sequence is judged to be an abnormal sequence, putting the sequence into an attack library, updating a detection matching library, and if the sequence is judged to be a normal sequence, carrying out further judgment;

Intercepting the system call sequence into subsequences with equal length, which concretely comprises the following steps:

the information intercepted by the system call acquisition unit is processed into a system call sequence, the system call sequence can be processed according to the length or the call time, namely, the system call name and the generation time in the information are screened out, a fixed-length system call sequence in a specified time period is generated, as shown in fig. 2, redundant system calls are removed in the processing process, an equal-length call sequence with the length of 500 is generated, and the sequence with the length exceeding 500 is intercepted by adopting a sliding window technology, wherein the step length is set to be 250. The sequence length is less than 500 and the waiting time in the collection process exceeds 20 seconds, then 0 is filled in after the sequence for filling.

In this embodiment, the analysis method removes redundant system calls, which specifically includes:

according to the sequences of the same type in the dataset, respectively calculating TF-IDF values of system calls of each sequence in the two types of sequences, respectively sorting the TF-IDF values of each sequence in the two types of sequences in descending order, respectively screening out the last 40 system calls in the two types of sequences according to the smaller the TF-IDF values of the system calls and the smaller the influence of the system calls on the types of the sequences formed, respectively screening out the repeatedly occurring system calls from the last 40 system calls in each type of sequences, comparing and analyzing the two types of screened system calls, taking the same system call in the two types of sequences as a redundant system call, carrying out statistical analysis on the selected redundant system call and all the system calls generated in the system operation process, finding out the system call which is the same as the selected redundant system call, and removing the system call.

The TF-IDF value of the system call in each system call sequence is calculated as follows:

Converting a sequence to be detected into a sequence in the form of a word vector, which specifically comprises the following steps:

the method comprises the steps of regarding a single system call in a sequence as a word, regarding the whole system call sequence as a sentence, setting the dimension of a word vector to be 250D, training the word vector by using a Gensim tool according to different system call numbers in a word vector corpus, inputting a system call sequence set into a word vector model, generating a word vector matrix, and obtaining a sequence in a word vector form.

The method for preliminarily judging the category of the sequence in the form of the word vector through the deep learning model specifically comprises the following steps:

setting a training batch 20, a learning rate 0.01, an input vector dimension 250 dimension, an embedded layer dimension 125 dimension, a hidden layer 125 dimension, an output dimension 125 dimension, an activation function ReLU and a network layer number 2 layer by taking the generated word vector matrix of 500x250 as an input sequence, and taking the last hidden layer output as a classification result; the deep learning detection model includes, but is not limited to, a CNN, RNN, LSTM, GRU deep learning model.

Comparing the sequence which is preliminarily judged to be normal with the sequence in the matching library in matching degree, judging the type of the sequence, as shown in fig. 3, specifically comprising:

The cluster calculation specifically comprises the following steps:

S3: setting a threshold distance, if d (seqm 01, seqm 11) > distance, determining that the sequence generated by the host is similar to the detection sequence, repeating S2-S3, detecting the number of hosts hostNumber with the similar sequence to the detection sequence, if hostNumber > =threshold, determining that the detection sequence type is an attack sequence, and if hostNumber < threshold, determining that the detection sequence type is a normal sequence, wherein threshold represents the set threshold of the number of hosts with the similar sequence.

The Euclidean distance d (seqm ₀₁ ,seqm ₁₁ ) Calculation, expressed as:

An attack detection system based on Linux system call, as shown in fig. 4, includes: the device comprises a collecting module, a training module and a detecting module;

the detection unit performs preliminary classification judgment on the sequence in the form of the word vector converted from the sequence to be detected through deep learning, if the sequence is judged to be an abnormal sequence, the sequence is sent to the training module for updating a matching library, and if the sequence is judged to be a normal sequence, the sequence is sent to the matching unit for further judgment;

The collecting module is used for collecting the system call names and the system call numbers which are called by the execution process of the appointed process, generating a system call sequence according to the interception sequence, intercepting the call sequence into subsequences with equal length for the input of the training and detecting module, and storing the obtained sequence locally, wherein the module comprises a system call acquisition unit, a processing unit and a receiving and transmitting unit.

The system call acquisition unit can collect information related to system call generated in the execution process of the designated process, and as shown in fig. 5, the information content includes call time, system call name, process name, thread number and the like.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An attack detection method based on Linux system call is characterized by comprising the following steps:

s5: initially judging the category of the detection sequence in the form of the word vector by a deep learning detection model, if the detection sequence is judged to be an abnormal sequence, putting the sequence into an attack library, updating a detection matching library, and if the detection sequence is judged to be a normal sequence, carrying out further judgment;

s7: judging the sequence of which the class cannot be judged by adopting cluster calculation to obtain a detection result;

the cluster calculation specifically comprises the following steps:

s71: marking a detection sequence which cannot be identified by the detection unit as seq01, and converting the seq01 into a sequence seqm01 in the form of a word vector;

s72: selecting sequence seq generated by host h1 and seq01 in intranet connection host group in same time period ₁₁ And converted into a sequence seqm in the form of a word vector ₁₁ Calculate its Euclidean distance d (seqm ₀₁ ,seqm ₁₁ )；

S73: setting a threshold distance, if d (seqm 01, seqm 11) > distance, determining that the sequence generated by the host is similar to the detection sequence, repeating S72-S73, detecting all host numbers of the similar sequence to the detection sequence, if the hostNumber > =threshold, determining that the detection sequence type is an attack sequence, and if the hostNumber is less than threshold, determining that the detection sequence type is a normal sequence, wherein threshold represents the set threshold of the number of the hosts with the similar sequence.

2. The attack detection method based on Linux system call according to claim 1, wherein the system call sequence is intercepted into subsequences with equal length, specifically comprising:

3. The attack detection method based on Linux system call according to claim 2, wherein the removing redundant system call by adopting the statistical analysis method specifically comprises:

according to the known type sequences in the dataset, respectively calculating TF-IDF values of system calls of each sequence in the two types of sequences, respectively sorting the TF-IDF values of each sequence in the two types of sequences in descending order, respectively screening out the last 40 system calls in the two types of sequences, screening out repeated system calls from the last 40 system calls in each type of sequences, comparing and analyzing the two screened system calls, taking the same system call in the two types of sequences as a redundant system call, carrying out statistical analysis on the selected redundant system call and all the system calls generated in the system operation process, finding out the system call which is the same as the selected redundant system call, and removing the system call.

4. A Linux system call based attack detection method according to claim 3 wherein the TF-IDF value of the system call is calculated as:

wherein the TF-IDF _a1 TF-IDF value, TF, representing system call a1 _a1,a Representing how frequently system call a1 occurs in sequence a, Σa _1,k Representing the total number of sequences in which the system call a1 occurs in the generated sequence k, |n| represents the total number of sequences of normal type or attack type in the generated sequence, | { N } _a1 The number of sequences in which the system call a1 occurs in the current class is denoted by } |.

5. The attack detection method based on Linux system call according to claim 1, wherein comparing the sequence which is preliminarily determined to be normal with the sequence in the matching library in matching degree, and determining the type of the sequence, specifically comprises:

6. The attack detection method based on Linux system call according to claim 1, wherein the euclidean distance d (seqm ₀₁ ,seqm ₁₁ ) Calculation, expressed as:

7. An attack detection system based on Linux system call, comprising: the device comprises a collecting module, a training module and a detecting module;