CN114266040A - Linux host intrusion detection method - Google Patents

Linux host intrusion detection method Download PDF

Info

Publication number
CN114266040A
CN114266040A CN202111527054.XA CN202111527054A CN114266040A CN 114266040 A CN114266040 A CN 114266040A CN 202111527054 A CN202111527054 A CN 202111527054A CN 114266040 A CN114266040 A CN 114266040A
Authority
CN
China
Prior art keywords
module
system call
text
sequence
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111527054.XA
Other languages
Chinese (zh)
Inventor
杨航
郭乔进
吴其华
高沙沙
产院东
张欣怡
张峰
汪义飞
相银堂
刘蔚棣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202111527054.XA priority Critical patent/CN114266040A/en
Publication of CN114266040A publication Critical patent/CN114266040A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a Linux host intrusion detection method.A detection system related to the invention comprises a monitoring process module, a processing module, a detection module and a log module. And the monitoring process module screens the process and starts the monitoring process. The processing module extracts the detected system call output, maintains a system call event time sequence text sequence, and periodically sends the system call event time sequence text sequence to the detection module. The detection module places the Fine-tuning refined Bert model at the upstream, executes a text classification task at the downstream, inputs a system call text sequence of a specific process, embeds the text sequence into a word vector through the model, and outputs a classification result. The log module analyzes the detection and classification results and performs log recording and warning according to the classification results. The method is independent of hardware, has universality, and is convenient for model migration and wide application.

Description

Linux host intrusion detection method
Technical Field
The invention relates to an intrusion detection method, in particular to a Linux host intrusion detection method.
Background
With the continuous popularization and development of computer networks, the requirements for computer and network security protection are also increased. When the host computer is invaded, the normal operation of the computer and the normal operation of the service can be greatly damaged. The host intrusion detection plays an important role in the aspects of realizing computer and network security and the like. Compared with network intrusion detection, the host intrusion detection has the advantages of no need of additional network equipment, reduced implementation cost, flexible configuration and good detection effect. As a popular open source operating system, the security of the Linux operating system is also receiving more and more attention. Under the background, the effective and efficient Linux operating system host intrusion detection method gradually becomes the key point of research.
At present, research on a Linux host intrusion detection method at home and abroad has a certain foundation. The system call is used as a unique way for a user to access system resources, can be used for describing the track of program execution, and can judge the abnormality of the program and the intrusion according to the execution sequence. The forest and summer of the university of electronic technology develops research on a variable-length mode system call sequence matching algorithm, realizes a novel double-chain tree storage and collection method, obviously reduces storage port health, and improves search efficiency. The Daifeng of the university of Hunan agriculture detects whether the system call flow is abnormal or not by constructing a finite state automaton of a function and utilizing the automaton. Test results show that the technology can effectively detect the intrusion behavior. A hidden Markov model training module and a detection module are constructed on the basis of Zhenghaixiang of Guangdong industry university, and a detection method is improved. In summary, the present host intrusion detection method based on the system call sequence mainly analyzes and detects the numerical sequence by capturing the system call sequence number, but still has the disadvantages that the system call value is not directly linked with the meaning, the interpretability is poor, the time sequence characteristics are difficult to be fully utilized, the model is difficult to migrate due to different architecture system call numbers, and the like.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a Linux host intrusion detection method aiming at the defects of the prior art.
In order to solve the technical problem, the invention discloses a Linux host intrusion detection method.
A Linux host intrusion detection method is characterized in that a host intrusion detection system based on text classification is constructed; the host intrusion detection system comprises: the system comprises a monitoring process module, a processing module, a detection module and a log module;
the monitoring process module is used for process screening, white list maintenance and monitoring process life cycle management;
the processing module processes the output of the monitoring process module, extracts the system calling behavior and maintains the text sequence of the system calling event;
the detection module executes calling time text embedding and intrusion detection based on a Bert pre-training model;
the log module is used for analyzing the detection result, alarming the intrusion and storing the log.
In the invention, a system call text sequence with actual meaning is obtained by utilizing a host intrusion detection system call mapping, the time sequence latent characteristics of the system call text sequence are mined by using a Bert pre-training model, and the host intrusion detection system executes the following steps to realize intrusion detection:
step 1, starting a monitoring process;
step 2, calling sequence maintenance;
step 3, intrusion detection classification;
step 4, log processing;
step 5, updating the classification model;
and 6, repeating the steps 1-5 to realize continuous Linux host intrusion detection.
In step 1 of the invention, a process white list is established on a Linux host, a running process in a system is developed and screened, if the running process is not in the white list, a monitoring process module starts a Linux system process tracking tool, namely a strace process, and the monitoring module is used for developing and monitoring the system calling condition of the strace process, binding the strace and a monitored process, and monitoring the system calling condition of the process.
In step 2 of the invention, after receiving the strace process output of the monitored thread, a processing module maintains a system call text sequence for each monitored process, extracts the system call output of the corresponding process, extracts the actual system call operation, converts the actual system call operation into the actual system call semantics according to mapping, inserts the actual system call semantics into the tail end of the system call text sequence of the monitored process, and stores the corresponding operation text information according to the system call sequence; and reserving time sequence information of the operation by using the text sequence, capturing potential time sequence characteristics, and matching time sequence facies of the intrusion operation.
In step 3, a corresponding system calling text sequence is transmitted into a detection module to serve as a text classification model based on a Bert model in an input transmission detection module; the Bert model is trained on a large scale, hidden information of a text is mined and mapped to word embedding vectors, and a classification task is deployed at the downstream to classify the word vectors; the text classification model is subjected to fine tuning training through similar text sequence data before use, and a presentation layer after training is embedded into a host intrusion detection text classification task; the detection module receives system call text sequence input, maps the corresponding text to the corresponding embedded word vector through a Bert model, and outputs the classification result of the word vector through a Softmax layer at the top.
In step 4 of the invention, when the classified result output by the system calling text sequence of the monitored process is an intrusion event, an alarm is sent out, the system calling event is stored in an alarm queue, and alarm information is sent to a user, if the classified output result is a non-intrusion event, the system calling event is regarded as a general event, and is stored in a working log.
In step 5 of the invention, the system is continuously updated in the long-time running process, and the classified and manually verified system calling sequence stored in the import log is supported, the sequence is marked, the model is periodically finely adjusted and updated, and the detection library is periodically enriched and continuously updated.
The monitoring process module can maintain a process white list which can skip inspection, and for a given process, if the given process is not in the white list, the monitoring process can be started and bound, and meanwhile, the whole life cycle of the monitoring process is managed; the processing module can obtain the output of the corresponding monitoring process, shield the difference of the bottom layer of the hardware, extract the system calling event from the difference, and maintain the text sequence of the calling event; the detection module can input the text sequence into the pre-training model after Fine-tuning to generate a text embedded vector, and then the text embedded vector is delivered to a downstream text classification task to perform intrusion detection classification; the log module can take different measures according to the detection result, and for the intrusion event mark, an alarm is sent to the user, the alarm is recorded in the alarm log, for the general event, the alarm is stored in the general working log, and all the logs can be reviewed and checked by the user in the storage period.
Has the advantages that:
the Linux host intrusion detection method provided by the invention constructs a host intrusion detection system based on text classification, and performs host intrusion detection by using a Bert text classification model based on system call time sequence mapping. The invention has high precision and strong interpretability on intrusion detection, does not influence the normal work of a tested host, can realize real-time online monitoring, has no relation with hardware in a final system model, shields the difference of a hardware bottom layer, has universality, and is convenient for the migration and wide application of the model.
1. The method of the invention has high precision, is very sensitive to the intrusion detection event, and can quickly identify the suspicious event and send an alarm.
2. The method converts the traditional system call into the semantic text, and the model has strong interpretability.
3. The method provided by the invention adopts a pretraining model based on Bert, can mine the time sequence characteristics of the system calling sequence, and captures potential characteristics.
4. The method of the invention shields the difference of hardware to different system calls, has strong universality and is very friendly to the migration and wide application of the model.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of a Linux host intrusion detection method.
FIG. 2 is a schematic diagram of a detection process of an unknown process system call sequence.
FIG. 3 is a schematic diagram of a classification model of a detection module.
FIG. 4 is a schematic diagram illustrating a classification model updating process of a detection module.
Detailed Description
The invention relates to a Linux host intrusion detection method, which specifically comprises the following steps:
1. establishing a process white list on a Linux host, developing and screening a process running in a system, if the process is not in the white list, starting a strace process by a monitoring process module to develop and monitor a system calling program of the process, wherein a specific setting parameter is-p corresponds to a thread number-s 15', and at the moment, binding the strace and a monitored process by the monitoring module to monitor the system calling condition of the thread;
2. after receiving the strace process output of the monitored thread, the processing module extracts the system call output of the corresponding process, extracts the actual system call operation and generates a system call sequence of the monitored process. The processing module maintains a system calling sequence for each monitored process, and the latest system calling operation semantics can be added to the tail end of the sequence;
3. and transmitting the corresponding system call text sequence into the detection module as a text classification model based on the Bert model in the input transmission detection module. The text classification model is a Bert model trained by homogeneous text sequence data fine tuning. The model embeds the trained presentation layer into a host intrusion detection text classification task, and outputs a classification result through a Softmax layer at the top;
4. and when the classification result output by the system calling sequence of the monitored process is an intrusion event, sending an alarm, storing the system calling event into an alarm queue, if the intrusion is displayed after the system calling event of the process is continuously detected for 3 times, preliminarily speculating that the process implements host intrusion behavior, and sending alarm information to a user. If the classified output result is a non-invasive event, the system calling event sequence is regarded as a general event and stored into a working log for a user to check;
5. the system can be continuously updated in the long-term operation process, the import of a system calling time sequence with marks is supported, the model is regularly fine-tuned and updated, and the regular enrichment and continuous upgrading of a detection library can be realized;
6. and (5) repeating the steps 1-5 to achieve the purpose of real-time detection.
Examples
The invention discloses a Linux host intrusion detection method, which constructs a host intrusion detection system based on text classification, and utilizes a Bert text classification model to carry out host intrusion detection based on system calling time sequence mapping. The working process mainly comprises five steps of monitoring process starting, calling sequence maintenance, intrusion detection classification, log processing and classification model updating. The flow chart of the first four steps is shown in detail in fig. 2, and the last step of updating the classification model is shown in detail in fig. 4.
Firstly, starting a monitoring process.
To implement host intrusion detection, a process detection filter with appropriate granularity should be set. If the granularity is too coarse, many suspicious processes are often ignored, and potential intrusion behavior is often unrecognizable. If the granularity is too fine, many security processes are often monitored, which increases the loss of system performance. Because the granularity is often difficult to control and changes along with the requirements of different security levels, the method selects to generate the process white list for the system. The requirements of different security levels can correspond to process white lists with different granularities, and a user can also manually modify the lists.
During actual detection, the monitoring process module acquires a process list currently running in the Linux system, and develops and screens a white list of the software module aiming at each process. If the process is not on the white list, then a system call snoop needs to be performed on the software (skipped if at all). And acquiring the process number of the process, using a command of 'strand-p corresponding to the process number-s 15', specifying that the monitoring process is bound to the corresponding process through the process number, and setting the maximum character string length output by system call to be 15 (which is enough to extract system call actions).
After that, the monitor process module directs the monitoring result to the processing module for further processing.
And secondly, calling sequence maintenance.
The processing module obtains the output of the monitoring process bound by the process, and extracts the corresponding system calling action according to the format when a new output is generated. At this time, the stride command has referred to the system call reference table of different hardware architectures (such as arm, bfin, m68k), and for the same action, different system call numbers in different architectures are mapped to a unified semantic. At this time, the whole model is based on the strace method, bottom hardware is shielded, and the unified system call event is mapped to the same semantic meaning.
After obtaining the semantics of a certain system call, the processing module adds it to the end of the system call sequence of the process. The processing module maintains a system call sequence of maximum length 64. When the sequence length is less than 64, the call sequence will not be issued. And if the length of the sequence exceeds 64, popping up the original queue head text by the calling sequence, keeping the length of the queue at 64, generating a system calling statement by the sequence according to the time sequence relation, and transmitting the system calling statement into the detection module.
And thirdly, intrusion detection classification.
BERT pre-training model: the BERT model (see Vaswani A, Shazeer N, Parmar N, et al. attention Is All You New [ J ]. arXiv,2017.) Is a novel keel-level training word vector model. BERT first uses a large amount of unsupervised corpora for model pre-training, and then uses a small amount of labeled corpora for fine-tuning to complete a specific Natural Language Processing (NLP) task. Compared with the traditional model, Bert proves the importance of bidirectional pre-training on language representation, a Mask (Mask) method is used, context information is used in the training of the model without leaking label information, and partial words (token) of a random selection sequence are replaced by Mask marks, so that the deep bidirectional representation of the pre-training is realized. Meanwhile, Bert avoids the behavior that the engineering task needs to modify the architecture aiming at the service, identifies the model based on fine tuning, and realizes the most advanced performance on a large number of sentence-level and token-level tasks.
The detection module deploys a host intrusion detection text classification task at the downstream of the Bert pre-training model, and the composition structure diagram is shown in fig. 3 essentially. Upstream is the trimmed Bert pre-trained model. Downstream is the text classification task. During fine adjustment, the model obtains token output of Bert, and transmits the embedded vector corresponding to cls to the softmax logistic regression layer for classification training. In this case, the classification task may be a multi-classification task (having a clear classification label for different intrusions) or a bi-classification task (only classified into intrusion events and general events). Whatever task is, it is necessary to do the fine tuning and give an explicit identification of the fine tuning data.
From the overall process, the detection module acquires a time sequence system call statement, takes the time sequence system call statement as an input and transmits the input into a text classification model, and outputs a classification result through a softmax logistic regression layer. If the classification result is 0, the system calling sequence is judged as a general event, otherwise, the system calling sequence is identified as an intrusion event. The final text sequence classification results are sent to the log module for processing.
Fourthly, log processing.
The log module can be used to record and store the sequence of events under test. After receiving the detection result from the detection module, if the classification result shows that an intrusion event occurs, the process is marked as an alarm process, the system call event is stored in an alarm queue, and the alarm log records the system call event. And continuously detecting the subsequent sequence of the process, and if the subsequent sequence of the process is continuously displayed as an intrusion event for 3 times, directly sending an alarm to the user and reminding the user to check. If the classification result shows that no intrusion event occurs, the system calling sequence is regarded as a general event and is stored into a general working log (reserved for 30 days) and can be reviewed and viewed by a user in a storage period.
After entering the log module, the user can manually review the alarm log and the general working log and manually mark a single record. Meanwhile, the user can select records in batches for updating the model.
And fifthly, updating the classification model.
Intrusion detection approaches are constantly changing. Just as virus libraries are constantly updated, intrusion detection databases should also be updated and maintained on a regular basis. The schematic diagram of the classification model updating process of the detection module is shown in fig. 4.
The updating data of the classification model come from the inside and the outside of the system, and are respectively an updating record from the outside and a user selected record from the log module. The whole updating comprises fine adjustment of the existing text classification model and storage of the existing model, and the model can be replaced after the updating is completed. At this time, the latest model can also learn the characteristics implied by the event text sequence in the updated record and can be used for the subsequent detection work.
The present invention provides a method and a system for detecting Linux host intrusion, and a method and a way for implementing the technical solution are many, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the present invention, and these improvements and decorations should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. A Linux host intrusion detection method is characterized in that a host intrusion detection system based on text classification is constructed; the host intrusion detection system comprises: the system comprises a monitoring process module, a processing module, a detection module and a log module;
the monitoring process module is used for process screening, white list maintenance and monitoring process life cycle management;
the processing module processes the output of the monitoring process module, extracts the system calling behavior and maintains the text sequence of the system calling event;
the detection module executes calling event text embedding and intrusion detection based on a Bert pre-training model;
the log module is used for analyzing the detection result, alarming the intrusion and storing the log.
2. The method according to claim 1, wherein a system call text sequence with actual meaning is obtained by using a host intrusion detection system call mapping, the time-series latent characteristics of the system call text sequence are mined by using a Bert pre-training model, and the host intrusion detection system performs the following steps to realize intrusion detection:
step 1, starting a monitoring process;
step 2, calling sequence maintenance;
step 3, intrusion detection classification;
step 4, log processing;
step 5, updating the classification model;
and 6, repeating the steps 1-5 to realize continuous Linux host intrusion detection.
3. The method according to claim 2, wherein in step 1, a process white list is created on the Linux host, and the running processes in the system are screened, if the running processes are not in the white list, the monitoring process module starts a Linux system process tracking tool, i.e., a strace process, to monitor the system call condition, and the monitoring module binds the strace with the monitored process to monitor the system call condition of the process.
4. The method for detecting the intrusion of the Linux host according to claim 3, wherein in the step 2, after receiving the strace process output of the monitored thread, the processing module maintains a system call text sequence for each monitored process, extracts the system call output of the corresponding process, extracts the actual system call operation, converts the actual system call operation into the actual system call semantics according to the mapping, inserts the actual system call semantics into the tail end of the system call text sequence of the monitored process, and stores the corresponding operation text information according to the system call sequence; and reserving time sequence information of the operation by using the text sequence, capturing potential time sequence characteristics and matching the time sequence of the intrusion operation.
5. The method according to claim 4, wherein in step 3, the text sequence corresponding to the system call is transmitted to the detection module as a text classification model based on the Bert model in the input transmission detection module; the Bert model is trained on a large scale, hidden information of a text is mined and mapped to word embedding vectors, and a classification task is deployed at the downstream to classify the word vectors; the text classification model is subjected to fine tuning training through similar text sequence data before use, and a presentation layer after training is embedded into a host intrusion detection text classification task; the detection module receives system call text sequence input, maps the corresponding text to the corresponding embedded word vector through a Bert model, and outputs the classification result of the word vector through a Softmax layer at the top.
6. The method according to claim 5, wherein in step 4, when the classification result of the system call text sequence output of the monitored process is an intrusion event, an alarm is issued, the system call event is stored in an alarm queue, and an alarm message is issued to the user, and if the classification result is a non-intrusion event, the system call event is regarded as a general event and stored in a working log.
7. The method according to claim 6, wherein in step 5, the system is continuously updated during long-term operation, and a classified and manually verified system call sequence stored in the import log is supported, wherein the sequence is marked, and the model is periodically trimmed and updated, and the detection library is periodically enriched and continuously updated.
8. The method according to claim 7, wherein the monitor process module is capable of maintaining a white list of processes that can skip inspection, and for a given process, if it is not in the white list, the monitor process is started and bound, while managing the entire life cycle of the monitor process; the processing module can obtain the output of the corresponding monitoring process, shield the difference of the bottom layer of the hardware, extract the system calling event from the difference and maintain the text sequence of the calling event.
9. The method according to claim 8, wherein the detection module is capable of inputting a text sequence into a Fine-tuning pre-training model to generate a text embedded vector, and then performing intrusion detection classification by a downstream text classification task.
10. The method according to claim 9, wherein the log module is capable of taking different measures according to the detection result, and for the intrusion event flag, sending an alarm to the user, recording in a warning log, and for a general event, storing in a general working log, and all logs can be reviewed and viewed by the user in a storage period.
CN202111527054.XA 2021-12-14 2021-12-14 Linux host intrusion detection method Pending CN114266040A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111527054.XA CN114266040A (en) 2021-12-14 2021-12-14 Linux host intrusion detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111527054.XA CN114266040A (en) 2021-12-14 2021-12-14 Linux host intrusion detection method

Publications (1)

Publication Number Publication Date
CN114266040A true CN114266040A (en) 2022-04-01

Family

ID=80827001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111527054.XA Pending CN114266040A (en) 2021-12-14 2021-12-14 Linux host intrusion detection method

Country Status (1)

Country Link
CN (1) CN114266040A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115983171A (en) * 2023-03-17 2023-04-18 摩尔线程智能科技(北京)有限责任公司 Method and simulation platform for post-simulation of system on chip

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115983171A (en) * 2023-03-17 2023-04-18 摩尔线程智能科技(北京)有限责任公司 Method and simulation platform for post-simulation of system on chip

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
CN111475804B (en) Alarm prediction method and system
US10552762B2 (en) Machine learning of physical conditions based on abstract relations and sparse labels
US20200410164A1 (en) Methods and systems using cognitive artifical intelligence to implement adaptive linguistic models to process data
CN112579728B (en) Behavior abnormity identification method and device based on mass data full-text retrieval
CN109840157A (en) Method, apparatus, electronic equipment and the storage medium of fault diagnosis
US20210067531A1 (en) Context informed abnormal endpoint behavior detection
CN111522708B (en) Log recording method, computer equipment and storage medium
KR20170035892A (en) Recognition of behavioural changes of online services
CN111611218A (en) Distributed abnormal log automatic identification method based on deep learning
US10365995B2 (en) Composing future application tests including test action data
CN109753286A (en) A method of the code method based on functional label counts its call number
CN111428236A (en) Malicious software detection method, device, equipment and readable medium
CN109801151A (en) Financial fraud risk monitoring and control method, apparatus, computer equipment and storage medium
CN111552843A (en) Fault prediction method based on weighted causal dependency graph
CN107111609A (en) Lexical analyzer for neural language performance identifying system
CN111860981A (en) Enterprise national industry category prediction method and system based on LSTM deep learning
Zeng et al. EtherGIS: a vulnerability detection framework for ethereum smart contracts based on graph learning features
CN110245077A (en) A kind of response method and equipment of program exception
CN114266040A (en) Linux host intrusion detection method
US20220198331A1 (en) Machine model update method and apparatus, medium, and device
KR20210011822A (en) Method of detecting abnormal log based on artificial intelligence and system implementing thereof
CN114356257A (en) Log printing method, apparatus, computer device, storage medium, and program product
Haug et al. Change detection for local explainability in evolving data streams
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination