CN114266040A

CN114266040A - Linux host intrusion detection method

Info

Publication number: CN114266040A
Application number: CN202111527054.XA
Authority: CN
Inventors: 杨航; 郭乔进; 吴其华; 高沙沙; 产院东; 张欣怡; 张峰; 汪义飞; 相银堂; 刘蔚棣
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-04-01

Abstract

The invention discloses a Linux host intrusion detection method.A detection system related to the invention comprises a monitoring process module, a processing module, a detection module and a log module. And the monitoring process module screens the process and starts the monitoring process. The processing module extracts the detected system call output, maintains a system call event time sequence text sequence, and periodically sends the system call event time sequence text sequence to the detection module. The detection module places the Fine-tuning refined Bert model at the upstream, executes a text classification task at the downstream, inputs a system call text sequence of a specific process, embeds the text sequence into a word vector through the model, and outputs a classification result. The log module analyzes the detection and classification results and performs log recording and warning according to the classification results. The method is independent of hardware, has universality, and is convenient for model migration and wide application.

Description

Linux host intrusion detection method

Technical Field

The invention relates to an intrusion detection method, in particular to a Linux host intrusion detection method.

Background

With the continuous popularization and development of computer networks, the requirements for computer and network security protection are also increased. When the host computer is invaded, the normal operation of the computer and the normal operation of the service can be greatly damaged. The host intrusion detection plays an important role in the aspects of realizing computer and network security and the like. Compared with network intrusion detection, the host intrusion detection has the advantages of no need of additional network equipment, reduced implementation cost, flexible configuration and good detection effect. As a popular open source operating system, the security of the Linux operating system is also receiving more and more attention. Under the background, the effective and efficient Linux operating system host intrusion detection method gradually becomes the key point of research.

At present, research on a Linux host intrusion detection method at home and abroad has a certain foundation. The system call is used as a unique way for a user to access system resources, can be used for describing the track of program execution, and can judge the abnormality of the program and the intrusion according to the execution sequence. The forest and summer of the university of electronic technology develops research on a variable-length mode system call sequence matching algorithm, realizes a novel double-chain tree storage and collection method, obviously reduces storage port health, and improves search efficiency. The Daifeng of the university of Hunan agriculture detects whether the system call flow is abnormal or not by constructing a finite state automaton of a function and utilizing the automaton. Test results show that the technology can effectively detect the intrusion behavior. A hidden Markov model training module and a detection module are constructed on the basis of Zhenghaixiang of Guangdong industry university, and a detection method is improved. In summary, the present host intrusion detection method based on the system call sequence mainly analyzes and detects the numerical sequence by capturing the system call sequence number, but still has the disadvantages that the system call value is not directly linked with the meaning, the interpretability is poor, the time sequence characteristics are difficult to be fully utilized, the model is difficult to migrate due to different architecture system call numbers, and the like.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a Linux host intrusion detection method aiming at the defects of the prior art.

In order to solve the technical problem, the invention discloses a Linux host intrusion detection method.

A Linux host intrusion detection method is characterized in that a host intrusion detection system based on text classification is constructed; the host intrusion detection system comprises: the system comprises a monitoring process module, a processing module, a detection module and a log module;

the monitoring process module is used for process screening, white list maintenance and monitoring process life cycle management;

the processing module processes the output of the monitoring process module, extracts the system calling behavior and maintains the text sequence of the system calling event;

the detection module executes calling time text embedding and intrusion detection based on a Bert pre-training model;

the log module is used for analyzing the detection result, alarming the intrusion and storing the log.

In the invention, a system call text sequence with actual meaning is obtained by utilizing a host intrusion detection system call mapping, the time sequence latent characteristics of the system call text sequence are mined by using a Bert pre-training model, and the host intrusion detection system executes the following steps to realize intrusion detection:

step 1, starting a monitoring process;

step 2, calling sequence maintenance;

step 3, intrusion detection classification;

step 4, log processing;

step 5, updating the classification model;

and 6, repeating the steps 1-5 to realize continuous Linux host intrusion detection.

In step 1 of the invention, a process white list is established on a Linux host, a running process in a system is developed and screened, if the running process is not in the white list, a monitoring process module starts a Linux system process tracking tool, namely a strace process, and the monitoring module is used for developing and monitoring the system calling condition of the strace process, binding the strace and a monitored process, and monitoring the system calling condition of the process.

In step 2 of the invention, after receiving the strace process output of the monitored thread, a processing module maintains a system call text sequence for each monitored process, extracts the system call output of the corresponding process, extracts the actual system call operation, converts the actual system call operation into the actual system call semantics according to mapping, inserts the actual system call semantics into the tail end of the system call text sequence of the monitored process, and stores the corresponding operation text information according to the system call sequence; and reserving time sequence information of the operation by using the text sequence, capturing potential time sequence characteristics, and matching time sequence facies of the intrusion operation.

In step 3, a corresponding system calling text sequence is transmitted into a detection module to serve as a text classification model based on a Bert model in an input transmission detection module; the Bert model is trained on a large scale, hidden information of a text is mined and mapped to word embedding vectors, and a classification task is deployed at the downstream to classify the word vectors; the text classification model is subjected to fine tuning training through similar text sequence data before use, and a presentation layer after training is embedded into a host intrusion detection text classification task; the detection module receives system call text sequence input, maps the corresponding text to the corresponding embedded word vector through a Bert model, and outputs the classification result of the word vector through a Softmax layer at the top.

In step 4 of the invention, when the classified result output by the system calling text sequence of the monitored process is an intrusion event, an alarm is sent out, the system calling event is stored in an alarm queue, and alarm information is sent to a user, if the classified output result is a non-intrusion event, the system calling event is regarded as a general event, and is stored in a working log.

In step 5 of the invention, the system is continuously updated in the long-time running process, and the classified and manually verified system calling sequence stored in the import log is supported, the sequence is marked, the model is periodically finely adjusted and updated, and the detection library is periodically enriched and continuously updated.

The monitoring process module can maintain a process white list which can skip inspection, and for a given process, if the given process is not in the white list, the monitoring process can be started and bound, and meanwhile, the whole life cycle of the monitoring process is managed; the processing module can obtain the output of the corresponding monitoring process, shield the difference of the bottom layer of the hardware, extract the system calling event from the difference, and maintain the text sequence of the calling event; the detection module can input the text sequence into the pre-training model after Fine-tuning to generate a text embedded vector, and then the text embedded vector is delivered to a downstream text classification task to perform intrusion detection classification; the log module can take different measures according to the detection result, and for the intrusion event mark, an alarm is sent to the user, the alarm is recorded in the alarm log, for the general event, the alarm is stored in the general working log, and all the logs can be reviewed and checked by the user in the storage period.

Has the advantages that:

the Linux host intrusion detection method provided by the invention constructs a host intrusion detection system based on text classification, and performs host intrusion detection by using a Bert text classification model based on system call time sequence mapping. The invention has high precision and strong interpretability on intrusion detection, does not influence the normal work of a tested host, can realize real-time online monitoring, has no relation with hardware in a final system model, shields the difference of a hardware bottom layer, has universality, and is convenient for the migration and wide application of the model.

1. The method of the invention has high precision, is very sensitive to the intrusion detection event, and can quickly identify the suspicious event and send an alarm.

2. The method converts the traditional system call into the semantic text, and the model has strong interpretability.

3. The method provided by the invention adopts a pretraining model based on Bert, can mine the time sequence characteristics of the system calling sequence, and captures potential characteristics.

4. The method of the invention shields the difference of hardware to different system calls, has strong universality and is very friendly to the migration and wide application of the model.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of a Linux host intrusion detection method.

FIG. 2 is a schematic diagram of a detection process of an unknown process system call sequence.

FIG. 3 is a schematic diagram of a classification model of a detection module.

FIG. 4 is a schematic diagram illustrating a classification model updating process of a detection module.

Detailed Description

The invention relates to a Linux host intrusion detection method, which specifically comprises the following steps:

1. establishing a process white list on a Linux host, developing and screening a process running in a system, if the process is not in the white list, starting a strace process by a monitoring process module to develop and monitor a system calling program of the process, wherein a specific setting parameter is-p corresponds to a thread number-s 15', and at the moment, binding the strace and a monitored process by the monitoring module to monitor the system calling condition of the thread;

2. after receiving the strace process output of the monitored thread, the processing module extracts the system call output of the corresponding process, extracts the actual system call operation and generates a system call sequence of the monitored process. The processing module maintains a system calling sequence for each monitored process, and the latest system calling operation semantics can be added to the tail end of the sequence;

3. and transmitting the corresponding system call text sequence into the detection module as a text classification model based on the Bert model in the input transmission detection module. The text classification model is a Bert model trained by homogeneous text sequence data fine tuning. The model embeds the trained presentation layer into a host intrusion detection text classification task, and outputs a classification result through a Softmax layer at the top;

4. and when the classification result output by the system calling sequence of the monitored process is an intrusion event, sending an alarm, storing the system calling event into an alarm queue, if the intrusion is displayed after the system calling event of the process is continuously detected for 3 times, preliminarily speculating that the process implements host intrusion behavior, and sending alarm information to a user. If the classified output result is a non-invasive event, the system calling event sequence is regarded as a general event and stored into a working log for a user to check;

5. the system can be continuously updated in the long-term operation process, the import of a system calling time sequence with marks is supported, the model is regularly fine-tuned and updated, and the regular enrichment and continuous upgrading of a detection library can be realized;

6. and (5) repeating the steps 1-5 to achieve the purpose of real-time detection.

Examples

The invention discloses a Linux host intrusion detection method, which constructs a host intrusion detection system based on text classification, and utilizes a Bert text classification model to carry out host intrusion detection based on system calling time sequence mapping. The working process mainly comprises five steps of monitoring process starting, calling sequence maintenance, intrusion detection classification, log processing and classification model updating. The flow chart of the first four steps is shown in detail in fig. 2, and the last step of updating the classification model is shown in detail in fig. 4.

Firstly, starting a monitoring process.

To implement host intrusion detection, a process detection filter with appropriate granularity should be set. If the granularity is too coarse, many suspicious processes are often ignored, and potential intrusion behavior is often unrecognizable. If the granularity is too fine, many security processes are often monitored, which increases the loss of system performance. Because the granularity is often difficult to control and changes along with the requirements of different security levels, the method selects to generate the process white list for the system. The requirements of different security levels can correspond to process white lists with different granularities, and a user can also manually modify the lists.

During actual detection, the monitoring process module acquires a process list currently running in the Linux system, and develops and screens a white list of the software module aiming at each process. If the process is not on the white list, then a system call snoop needs to be performed on the software (skipped if at all). And acquiring the process number of the process, using a command of 'strand-p corresponding to the process number-s 15', specifying that the monitoring process is bound to the corresponding process through the process number, and setting the maximum character string length output by system call to be 15 (which is enough to extract system call actions).

After that, the monitor process module directs the monitoring result to the processing module for further processing.

And secondly, calling sequence maintenance.

The processing module obtains the output of the monitoring process bound by the process, and extracts the corresponding system calling action according to the format when a new output is generated. At this time, the stride command has referred to the system call reference table of different hardware architectures (such as arm, bfin, m68k), and for the same action, different system call numbers in different architectures are mapped to a unified semantic. At this time, the whole model is based on the strace method, bottom hardware is shielded, and the unified system call event is mapped to the same semantic meaning.

After obtaining the semantics of a certain system call, the processing module adds it to the end of the system call sequence of the process. The processing module maintains a system call sequence of maximum length 64. When the sequence length is less than 64, the call sequence will not be issued. And if the length of the sequence exceeds 64, popping up the original queue head text by the calling sequence, keeping the length of the queue at 64, generating a system calling statement by the sequence according to the time sequence relation, and transmitting the system calling statement into the detection module.

And thirdly, intrusion detection classification.

BERT pre-training model: the BERT model (see Vaswani A, Shazeer N, Parmar N, et al. attention Is All You New [ J ]. arXiv,2017.) Is a novel keel-level training word vector model. BERT first uses a large amount of unsupervised corpora for model pre-training, and then uses a small amount of labeled corpora for fine-tuning to complete a specific Natural Language Processing (NLP) task. Compared with the traditional model, Bert proves the importance of bidirectional pre-training on language representation, a Mask (Mask) method is used, context information is used in the training of the model without leaking label information, and partial words (token) of a random selection sequence are replaced by Mask marks, so that the deep bidirectional representation of the pre-training is realized. Meanwhile, Bert avoids the behavior that the engineering task needs to modify the architecture aiming at the service, identifies the model based on fine tuning, and realizes the most advanced performance on a large number of sentence-level and token-level tasks.

The detection module deploys a host intrusion detection text classification task at the downstream of the Bert pre-training model, and the composition structure diagram is shown in fig. 3 essentially. Upstream is the trimmed Bert pre-trained model. Downstream is the text classification task. During fine adjustment, the model obtains token output of Bert, and transmits the embedded vector corresponding to cls to the softmax logistic regression layer for classification training. In this case, the classification task may be a multi-classification task (having a clear classification label for different intrusions) or a bi-classification task (only classified into intrusion events and general events). Whatever task is, it is necessary to do the fine tuning and give an explicit identification of the fine tuning data.

From the overall process, the detection module acquires a time sequence system call statement, takes the time sequence system call statement as an input and transmits the input into a text classification model, and outputs a classification result through a softmax logistic regression layer. If the classification result is 0, the system calling sequence is judged as a general event, otherwise, the system calling sequence is identified as an intrusion event. The final text sequence classification results are sent to the log module for processing.

Fourthly, log processing.

The log module can be used to record and store the sequence of events under test. After receiving the detection result from the detection module, if the classification result shows that an intrusion event occurs, the process is marked as an alarm process, the system call event is stored in an alarm queue, and the alarm log records the system call event. And continuously detecting the subsequent sequence of the process, and if the subsequent sequence of the process is continuously displayed as an intrusion event for 3 times, directly sending an alarm to the user and reminding the user to check. If the classification result shows that no intrusion event occurs, the system calling sequence is regarded as a general event and is stored into a general working log (reserved for 30 days) and can be reviewed and viewed by a user in a storage period.

After entering the log module, the user can manually review the alarm log and the general working log and manually mark a single record. Meanwhile, the user can select records in batches for updating the model.

And fifthly, updating the classification model.

Intrusion detection approaches are constantly changing. Just as virus libraries are constantly updated, intrusion detection databases should also be updated and maintained on a regular basis. The schematic diagram of the classification model updating process of the detection module is shown in fig. 4.

The updating data of the classification model come from the inside and the outside of the system, and are respectively an updating record from the outside and a user selected record from the log module. The whole updating comprises fine adjustment of the existing text classification model and storage of the existing model, and the model can be replaced after the updating is completed. At this time, the latest model can also learn the characteristics implied by the event text sequence in the updated record and can be used for the subsequent detection work.

The present invention provides a method and a system for detecting Linux host intrusion, and a method and a way for implementing the technical solution are many, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the present invention, and these improvements and decorations should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A Linux host intrusion detection method is characterized in that a host intrusion detection system based on text classification is constructed; the host intrusion detection system comprises: the system comprises a monitoring process module, a processing module, a detection module and a log module;

the detection module executes calling event text embedding and intrusion detection based on a Bert pre-training model;

2. The method according to claim 1, wherein a system call text sequence with actual meaning is obtained by using a host intrusion detection system call mapping, the time-series latent characteristics of the system call text sequence are mined by using a Bert pre-training model, and the host intrusion detection system performs the following steps to realize intrusion detection:

step 1, starting a monitoring process;

step 2, calling sequence maintenance;

step 3, intrusion detection classification;

step 4, log processing;

step 5, updating the classification model;

3. The method according to claim 2, wherein in step 1, a process white list is created on the Linux host, and the running processes in the system are screened, if the running processes are not in the white list, the monitoring process module starts a Linux system process tracking tool, i.e., a strace process, to monitor the system call condition, and the monitoring module binds the strace with the monitored process to monitor the system call condition of the process.

4. The method for detecting the intrusion of the Linux host according to claim 3, wherein in the step 2, after receiving the strace process output of the monitored thread, the processing module maintains a system call text sequence for each monitored process, extracts the system call output of the corresponding process, extracts the actual system call operation, converts the actual system call operation into the actual system call semantics according to the mapping, inserts the actual system call semantics into the tail end of the system call text sequence of the monitored process, and stores the corresponding operation text information according to the system call sequence; and reserving time sequence information of the operation by using the text sequence, capturing potential time sequence characteristics and matching the time sequence of the intrusion operation.

5. The method according to claim 4, wherein in step 3, the text sequence corresponding to the system call is transmitted to the detection module as a text classification model based on the Bert model in the input transmission detection module; the Bert model is trained on a large scale, hidden information of a text is mined and mapped to word embedding vectors, and a classification task is deployed at the downstream to classify the word vectors; the text classification model is subjected to fine tuning training through similar text sequence data before use, and a presentation layer after training is embedded into a host intrusion detection text classification task; the detection module receives system call text sequence input, maps the corresponding text to the corresponding embedded word vector through a Bert model, and outputs the classification result of the word vector through a Softmax layer at the top.

6. The method according to claim 5, wherein in step 4, when the classification result of the system call text sequence output of the monitored process is an intrusion event, an alarm is issued, the system call event is stored in an alarm queue, and an alarm message is issued to the user, and if the classification result is a non-intrusion event, the system call event is regarded as a general event and stored in a working log.

7. The method according to claim 6, wherein in step 5, the system is continuously updated during long-term operation, and a classified and manually verified system call sequence stored in the import log is supported, wherein the sequence is marked, and the model is periodically trimmed and updated, and the detection library is periodically enriched and continuously updated.

8. The method according to claim 7, wherein the monitor process module is capable of maintaining a white list of processes that can skip inspection, and for a given process, if it is not in the white list, the monitor process is started and bound, while managing the entire life cycle of the monitor process; the processing module can obtain the output of the corresponding monitoring process, shield the difference of the bottom layer of the hardware, extract the system calling event from the difference and maintain the text sequence of the calling event.

9. The method according to claim 8, wherein the detection module is capable of inputting a text sequence into a Fine-tuning pre-training model to generate a text embedded vector, and then performing intrusion detection classification by a downstream text classification task.

10. The method according to claim 9, wherein the log module is capable of taking different measures according to the detection result, and for the intrusion event flag, sending an alarm to the user, recording in a warning log, and for a general event, storing in a general working log, and all logs can be reviewed and viewed by the user in a storage period.