CN110334508A

CN110334508A - A kind of host sequence intrusion detection method

Info

Publication number: CN110334508A
Application number: CN201910596409.7A
Authority: CN
Inventors: 卢逸君
Original assignee: Information Security Test And Appraisal Center Guangdong Province
Current assignee: Information Security Test And Appraisal Center Guangdong Province
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2019-10-15
Anticipated expiration: 2039-07-03
Also published as: CN110334508B

Abstract

A kind of host sequence intrusion detection method, comprising the following steps: S1, respectively to each training sequence and each cycle tests extracts m characteristic commands；S2, the characteristic commands construction feature command set by the training sequence and cycle tests extraction is used；S3, distribution of the training sequence on characteristic commands collection characteristic dimension space is calculated；S4, the m characteristic commands extracted to every cycle tests are mapped as a new vector in characteristic commands concentration, are formed in the distribution vector of the characteristic commands collection characteristic dimension spatially；S5, the vector formed to every string cycle tests, most like k training order sequence therewith is found in the distribution, the most type of frequency of occurrence in the corresponding type of the k training order sequence is determined as to the differentiation type of the cycle tests.That the present invention provides a kind of costs is low, it is simple to implement, general performance is good host sequence intrusion detection method.

Description

A kind of host sequence intrusion detection method

Technical field

The invention belongs to computer network instrument detection field, especially a kind of host sequence intrusion detection method.

Background technique

The computer network instrument detection field of Intrusion Detection based on host mainly passes through host system calling sequence (abbreviation host Sequence) detect the abnormal behaviour of user.In host sequence intrusion detection, the object host sequence of detection is that user passes through life Enable the operating system bottom command sequence of row, routine call." sequence " indicates the series of commands of acquisition, and " order " only indicates it Middle single command.

The intrusion detection method of existing Intrusion Detection based on host system call sequence mainly includes following four classes:

1, based on serializing feature

The technology that serializing feature modeling is more mainstream is carried out using N-Gram, which proposed in 1996, will be System, which calls, regards word as, and calling sequence is regarded as phrase, sets k as sequence length, then window size is k+1, in sliding window When with the subsequent sequence collection of each word of data-base recording.The defect of this method is rate of false alarm height, and needs to construct sufficiently large spy Database is levied, although the experiment in ADFA data set shows that this method TPR can reach 90% or more, efficiency is higher, rate of false alarm Up to 30%, it reduces rate of false alarm and needs sufficiently more training sequences.

2, the feature based on document frequency statistics

Method based on frequency includes bag of words, TF-IDF, HMM etc., is with word frequency inverse document frequency method TF-IDF It represents, method is to call system to carry out vectorization calculating, is first given a mark with TF-IDF to sequence.On the basis of marking sequence On, it carries out SVM, KNN scheduling algorithm and classifies.Characteristic commands extracting method based on frequency is the disadvantage is that, the extraction of feature is complete It is based on probability entirely, is not based on semantic feature, important feature may be lost.Show to utilize TF- in the experiment of ADFA data set For IDF to a small amount of characteristic commands of sequential extraction procedures, the congregational rate of feature is unobvious, and needs to expend more calculating money when calculating IDF Source.

3, term vector, sentence vector incorporation model are based on

This method does not consider word frequency, but from the similitude between word on higher dimensional space apart from level extraction feature, Sequence is trained to a shallow-layer neural network, by training set each order or every string subcommand be mapped to a finger Determine in the vector space of dimension, then carries out dimensionality reduction.This kind of algorithm shortcomings are that calculate consumption resource larger, and Host-based intrusion detection Application scenarios determine that it needs the algorithm of more lightweight.

4, it is based on neural network

Ghosh et al. has used artificial neural network to misuse detection and abnormality detection, and ROC curve shows that TPR is 77.3%, FPR 2.2%.Han and CHO is introduced Evolutionary Neural Network (ENN) under study for action, normal recordings in training sequence Ratio with attack data is 2:1, shows that using the rate of false alarm of ENN be only 0.0011% to the experiment of DARPA99 data set, is examined It is about 1 hour that survey rate, which reaches 100%, the ENN training time,.Easily there is over-fitting using the methods of ENN, and the method TPR of ANN It shows not ideal enough.

Summary of the invention

It is a primary object of the present invention to overcome the deficiencies of the prior art and provide, a kind of cost is low, it is simple, comprehensive to implement Close the host sequence intrusion detection method to do very well.

To achieve the above object, the invention adopts the following technical scheme:

A kind of host sequence intrusion detection method, comprising the following steps:

S1, to training sequence and cycle tests, extract m characteristic commands；

S2, the characteristic commands construction feature command set by the training sequence and cycle tests extraction is used；

S3, distribution of the training sequence on characteristic commands collection characteristic dimension space is calculated；

S4, the m characteristic commands extracted to every cycle tests, are mapped as one in the characteristic commands collection dimensional space A new vector is formed in the distribution vector of the characteristic commands collection characteristic dimension spatially；

S5, the vector formed to every cycle tests, find k most like therewith characteristic commands sequence in the distribution The most type of frequency of occurrence in the corresponding type of the k characteristic commands sequence, is determined as the differentiation of the cycle tests by column Type.

Further:

It is further comprising the steps of:

S6, judge whether m is greater than given threshold, if it is not, then the value of m is updated to m+1, repeat execution step S1-S5；

S7, classifying quality detection is carried out to the type identification result under different m values, with the m with optimal classification effect Differentiation type under value differentiates type as final.

In step S7, classifying quality detection is carried out using TPR and FPR as the Testing index of classifying quality.

In step S1, include: to m characteristic commands of sequential extraction procedures

By it is Sequence Transformed be oriented authorized graph G=(V, E), wherein V indicate sequence in order point set, V={ V₁, V₂,…,V_i,…V_n| 1≤i≤n }, E is the side of cum rights w in figure, indicates the context relation of order；

Calculation command V according to the following formula_iScore:

Wherein, for ordering V_i、V_j, w_ijFor any two order V in sequence_iAnd V_jBetween context weight, indicate V_i Subsequent order is V_jNumber, In (V_i) expression subsequent commands be V_iCommand set, Out (V_j) it is order V_jSubsequent command set It closes, d is damped coefficient, and value range is (0,1), WS (V_j) it is order V_jScore；

Wherein when calculating each order score, preset initial value is specified to all orders, and recursive calculation is repeatedly changed In generation, is until convergence；

After the WS value of orders all in the above method sequence of calculation, the maximum m life of WS value in all orders is extracted It enables.

Preferably, d takes 0.85.

In step S5, the k in the distribution closest vectors of the vector are calculated, take category label many by Voting principle Number, the differentiation type as the cycle tests.

Step S5 includes:

Institute the distance between directed quantity and test vector in S51, the calculating distribution；

S52, it sorts according to apart from size sequence, such as ascending order arrangement；

S53, it chooses with the test vector apart from the smallest k vector；

S54, the frequency of occurrence for determining classification where the k vector；

S55, prediction classification of the highest classification of frequency of occurrence as the test vector is returned.

In step S1, the initial value of m is more than or equal to 1.

A kind of computer readable storage medium, is stored with computer program, and the computer program can be executed by processor To realize the method.

The invention has the following beneficial effects:

The invention proposes a kind of Intrusion Detection based on host intruding detection system (host-based intrusion detection System, abbreviation HIDS) intrusion detection method, be able to solve Host Intrusion Detection System to algorithm time cost, invasion become Kind problem compares sensitive issue, is that a kind of cost is low compared with prior art, it is easy to accomplish, the good abnormality detection side of general performance Method can reach better characteristic commands extraction and detection effect.

The advantages of this method embodiment includes:

1, there is certain semantic feature with m characteristic commands of this method to each sequential extraction procedures, can effectively extracts Representative order, prominent crucial attack.

2, it is being instructed independent of training sequence compared to traditional STIDE algorithm according to the calculated order score of this method In the case where practicing sample size less, FPR performance can be effectively promoted.

3, compared with traditional TF-IDF method, when taking lesser m value, resource needed for this method extracts characteristic commands is more It is few, because calculating IDF in TF-IDF algorithm needs higher complexity.When m takes smaller value, it is assumed that training sequence N item, it is average long Spend P, all sequences length and be P*N, TF-IDF time-consuming reach f (P*N), this method time-consuming be f (N).

4, the characteristic commands extracted have certain semantic feature, can effectively extract key order, and do not have to be concerned about training sequence The case where key order occurs in column can successfully manage and new attack type occur and sequence of attack artificially changed Situation.

5, this method can effectively be lifted at the detected representation in the case that sample is sparse and imbalanced training sets.

It 6, can effective lifting feature order extraction efficiency compared to TF-IDF extracting method；Compared to STIDE method, originally Invention can effectively promote rate of false alarm performance.

Detailed description of the invention

Fig. 1 is the host sequence intrusion detection method flow chart of an embodiment of the present invention.

Fig. 2 be in the embodiment of the present invention by it is Sequence Transformed be digraph effect.

Fig. 3 is to be compared in ADFA data set using the testing result of the embodiment of the present invention and STIDE method.

Fig. 4 is to extract feature quantity comparison using the embodiment of the present invention and TF-IDF method in ADFA data set.

Fig. 5 is to be compared in ADFA data set using the detection time-consuming of the embodiment of the present invention and TF-IDF method.

Specific embodiment

It elaborates below to embodiments of the present invention.It is emphasized that following the description is only exemplary, The range and its application being not intended to be limiting of the invention.

Refering to fig. 1, in one embodiment, a kind of host sequence intrusion detection method, comprising the following steps:

S1, to training sequence and cycle tests, extract m characteristic commands；

S4, the m characteristic commands extracted to every cycle tests are mapped as one newly in the characteristic commands collection dimension Vector is formed in the distribution vector of the characteristic commands collection characteristic dimension spatially；

In a preferred embodiment, this method is further comprising the steps of:

In a preferred embodiment, in step S7, classification effect is carried out using TPR and FPR as the Testing index of classifying quality Fruit detection.

In a preferred embodiment, in step S1, include: to m characteristic commands of sequential extraction procedures

Calculation command V according to the following formula_iScore:

There is certain semantic feature by m characteristic commands of this method to each sequential extraction procedures, can effectively extract and provide Representational order, prominent crucial attack go out.The characteristic commands of extraction have certain semantic feature, can effectively extract Crucial strike order, and do not have to be concerned about the case where key order occurs in training data, new attack can be successfully managed The case where type, sequence of attack are artificially changed.In this method, order score is independent of training sequence, compared to traditional STIDE algorithm can effectively promote FPR performance in the case where training samples number is few.It is dilute that this method can effectively be lifted at sample Detected representation in the case where dredging with imbalanced training sets.Compared to TF-IDF extracting method, this method can effectively lifting feature be ordered Enable extraction efficiency；Compared to STIDE method, this method can effectively promote rate of false alarm performance.

In a preferred embodiment, d takes 0.85.

In a preferred embodiment, in step S5, the k in the distribution closest vectors of the vector are calculated, by throwing Ticket principle takes category label mode, the differentiation type as the cycle tests.

In a preferred embodiment, step S5 includes:

S53, selection and test vector are apart from the smallest k vector；

In one embodiment, in step S1, the initial value of m is more than or equal to 1.

The feature and advantage of the specific embodiment of the invention are further described below in conjunction with attached drawing.

In host sequence intrusion detection, the object host sequence of detection is that user passes through order line, the behaviour of routine call Make system bottom command sequence." sequence " indicates a series of order, and " order " only indicates wherein single command.One string sequence pair Answer a classification results, i.e. " normal " or "abnormal".

The method of the present invention can be divided into two big steps: the first step is that the characteristic commands of sequence are extracted, and second step is classification and Detection.

Step 1: characteristic commands are extracted: the characteristic commands that system call sequence is carried out on host sequence are extracted.

In this step, oriented authorized graph G=(V, E) is converted by host sequence, V indicates order, is converted into point set, E table The context relation for showing order is converted into the side in figure.Any two order V_iAnd V_jBetween context weight be w_ij, indicate V_iSubsequent commands are V_jNumber, the order V given for one_i, In (V_i) indicate the command set for being directed toward the order, Out (V_i) To order V_iThe command history of direction.

Then by such as following formula calculation command V_iScore:

Wherein, d is damped coefficient, and value range is (0,1), represents a certain specific command and is directed toward any other order Probability usually takes 0.85, WS (V_j) it is order V_jScore.

Characteristic commands extraction process to each sequence includes: according to the method described above, to calculate the WS of all orders of sequence Value takes the maximum m order of WS value.

When calculating each order score using this method, specify specific initial value to all orders, and recursive calculation until Convergence.Wherein, the score WS of order is complementary, so needing successive ignition until its convergence.

Above method embodies such thought: if showing this life after an order appears in many orders It enables more important；One order follows hard on the very high order of WS score, therefore mentions then the WS of this order obtains branch It is high.

It is called for example, recording a system with serializing, certain a string system call sequence is with following system call number To indicate:

6 6 63 6 42 120 6 195 120 6 6 114 114 1 1 2 5 2 2 5 2 2 5 2 1 1 1 1 1 1 1 1 2 5 2 1 1 1 1 1 2 5 2 1 1 1 1 2 5 2 2 5 2 2 5 2 2 5 2 2 5 2 1 1 2 5 2

This example be carried out with a string sequence characteristic commands extraction example (the string number is treated sequence, one Number indicate a kind of order).Sequence command is converted by above method, order and its context relation form digraph (referring to fig. 2, converting digraph effect for attack sequence).It recycles WS formula to calculate the score of each order, calculates one In a sequence after the score of all orders, m of highest scoring is selected to order, as the feature that this sequence is selected, thus Play the role of dimensionality reduction.The characteristic commands 1,252,6,120 when characteristic commands quantity m=4 are finally extracted, this four orders are selected It is this four order highest scorings to be calculated according to the formula of front, and do not have to consider appearance of this feature in training sequence Frequency.The effect chosen as seen from Figure 2, these orders are located at the key position in digraph, with the generation in structure Table.M can take other values, be herein only citing for 4.

Characteristic commands are extracted core code and are accomplished by

Input: sequence sequence to be extracted, characteristic commands quantity m

Output: the characteristic commands arranged according to WS value descending

Step 2: classification and Detection

1, the characteristic commands construction feature command set S extracted using training sequence and cycle tests by the first step；

2, distribution distri_train of the training sequence in characteristic commands collection S characteristic dimension is calculated；

3, the m characteristic commands extracted to each cycle tests are formed in characteristic commands collection S characteristic dimension spatially Distribution vector；

4, k closest vectors of the vector on distribution distri_train are calculated, take classification designator by Voting principle Mode is as its classification, using the result of classification as the judging result to the cycle tests.

Mode is exactly the maximum number of the frequency of occurrences.In step 4, the vectors for taking k nearest with target range is detected, then from The most result of ballot selection frequency of occurrence in the result label of k vector.This method passes through between measurement different characteristic value Distance is classified.Its thinking is: if the k in feature space, a sample most like (i.e. most adjacent in feature space Most of in sample closely) belong to some classification, then the sample also belongs to this classification.

Specific steps may include:

1, institute the distance between the directed quantity and test vector in distribution distri_train are calculated；

2, it sorts according to apart from increasing order；

3, it chooses and test vector is apart from the smallest k vector；

4, the frequency of occurrence of the classification before determining where k vector；

5, the k highest classification of vector frequency of occurrence is classified as the prediction of the test vector before returning.

In a specific embodiment, this method includes the following steps:

1, data processing and initialization.Obscure different classes of data, takes 90% to be used to train at random；Initialize m=1.

2, it to all training sequences and cycle tests, is utilized respectively characteristic commands extraction algorithm extract () and constructs m spy Sign order.(characteristic commands extraction algorithm includes all processes of the first step, comprising: converts, is formed oriented to sequence command Figure, the score of each order is calculated using WS formula, then takes the order of m highest scoring.)

3, with all characteristic commands construction feature command set S.

4, distribution distri_train of the training sequence on characteristic commands collection S is calculated.

5, the m characteristic commands extracted to each cycle tests, are mapped as a new vector V, shape in characteristic commands collection S At the distribution vector on characteristic commands collection S characteristic dimension space.

6, classified by KNN and determine sequence type: the vector V formed to each cycle tests is looked in distri_train To the characteristic commands sequence apart from nearest k, in the corresponding type of k sequence, most types is sentencing for the cycle tests Other type.

7, according to m current value, the TPR and FPR of classification results are assessed.

TPR and FPR is the Testing index of classifying quality.In two classification problems, mould can be more accurately measured with TPR, FPR Type classification results.TPR is real example rate, indicates the ratio for currently assigning to all positive samples shared by true positive sample in positive sample Example；FPR is false positive example rate, indicates currently to assign to all negative sample sums shared by true negative sample in positive sample classification by mistake Ratio.TPR, closer to 0, indicates that effect is better closer to 1, FPR.

8, work as m > preset upper limit (in all sequences, the minimum value of different command number), exit；Otherwise m repeats to walk from increasing 1 Rapid 2.

9, optimal TPR and FPR is showed according under different m values, determines final m value.

Experimental result

Experiment is carried out in ADFA-LD data set.ADFA data set is a set of master that Australian Defence Force Academy externally issues The data acquisition system of machine grade intruding detection system, is widely used in the test of intrusion detection class product.It has been incited somebody to action in data set Sorts of systems calling completes characterization, and is marked to attack type.

This method is verified in ADFA-LD data set, takes 90% as training sequence, residue 10% is cycle tests. Experiment shows the test using data same in ADFA-LD data set, it was demonstrated that validity, in terms of, this method than pass System method has apparent advantage.In this experiment, best features be can get when general m takes 5, k to take 1, TPR is up to 92%, FPR at this time About 1.9%.

In terms of validity:

(30%) TPR about 90%, FPR is about compared, this method can effectively promote recall rate, subtract with the effect of traditional STIDE Few rate of false alarm.Referring to Fig. 3, compared in ADFA data set using the testing result of this method and STIDE method.The results show that this Method recall rate is better than STIDE method, and rate of false alarm is far below tradition STIDE method.

In terms for the treatment of effeciency:

When m takes optimum value 5,3 times of improved efficiency of this method ratio TF-IDF method or more.Although the two performance is close, The feature quantity as needed for this method ratio TF-IDF is smaller, lower than the characteristic dimension of TF-IDF.Referring to fig. 4, as m=5, Two methods detected representation is consistent and is best, but this method takes characteristic commands 89 altogether, and TF-IDF takes characteristic commands 139 altogether It is a.Moreover, this method can save the overhead that rate calculates IDF, therefore have greatly in terms of overall time-consuming compared with TF-IDF method Width is promoted.Referring to Fig. 5, the results show that two methods performance is consistent and best, but this method time-consuming only needs TF- as m=5 The 30% of IDF time-consuming.

The above content is combine it is specific/further detailed description of the invention for preferred embodiment, cannot recognize Fixed specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, Without departing from the inventive concept of the premise, some replacements or modifications can also be made to the embodiment that these have been described, And these substitutions or variant all shall be regarded as belonging to protection scope of the present invention.

Claims

1. a kind of host sequence intrusion detection method, which comprises the following steps:

S1, to training sequence and cycle tests, extract m characteristic commands；

S4, the m characteristic commands extracted to every cycle tests, are mapped as one in the characteristic commands collection characteristic dimension space A new vector is formed in the distribution vector of the characteristic commands collection characteristic dimension spatially；

S5, the vector formed to every cycle tests, find k most like therewith characteristic commands sequence in the distribution, The most type of frequency of occurrence in the corresponding type of the k characteristic commands sequence is determined as to the differentiation class of the cycle tests Type.

2. host sequence intrusion detection method as described in claim 1, which is characterized in that further comprising the steps of:

S6, judge whether m is greater than given threshold, if it is not, then the value of m is updated to m+1, repeats and execute step S1-S5；

S7, classifying quality detection is carried out to the type identification result under different m values, with the m value with optimal classification effect Under differentiation type differentiate type as final.

3. host sequence intrusion detection method as claimed in claim 2, which is characterized in that in step S7, made with TPR and FPR Classifying quality detection is carried out for the Testing index of classifying quality.

4. host sequence intrusion detection method as described in any one of claims 1 to 3, which is characterized in that in step S1, to sequence Column extract m characteristic commands

By it is Sequence Transformed be oriented authorized graph G=(V, E), wherein V indicate sequence in order point set, V={ V₁, V₂..., V_i... V_n| 1≤i≤n }, E is the side of cum rights w in figure, indicates the context relation of order；

Calculation command V according to the following formula_iScore:

Wherein, for ordering V_i、V_j, w_ijFor any two order V in sequence_iAnd V_jBetween context weight, indicate V_iIt is subsequent Order be V_jNumber, In (V_i) expression subsequent commands be V_iCommand set, Out (V_j) it is order V_jSubsequent command history, d For damped coefficient, value range is (0,1), WS (V_j) it is order V_jScore；

Wherein when calculating each order score, preset initial value is specified to all orders, and recursive calculation is straight through successive ignition To convergence；

After the WS value of orders all in the above method sequence of calculation, the maximum m order of WS value in all orders is extracted.

5. host sequence intrusion detection method as claimed in claim 4, which is characterized in that d takes 0.85.

6. such as host sequence intrusion detection method described in any one of claim 1 to 5, which is characterized in that in step S5, calculate The k in the distribution closest vectors of the vector, take category label mode by Voting principle, as sentencing for the cycle tests Other type.

7. host sequence intrusion detection method as claimed in claim 6, which is characterized in that step S5 includes:

S52, it sorts according to apart from size sequence；

S53, selection and test vector are apart from the smallest k vector；

8. host sequence intrusion detection method as described in any one of claim 1 to 7, which is characterized in that in step S1, m's Initial value is more than or equal to 1.

9. a kind of computer readable storage medium, it is characterised in that: be stored with computer program, the computer program can be located Reason device is executed to realize according to claim 1 to any one of 8 the methods.