CN109450860A

CN109450860A - A kind of detection method threatened based on entropy and the advanced duration of support vector machines

Info

Publication number: CN109450860A
Application number: CN201811200227.5A
Authority: CN
Inventors: 谭佳雨; 王箭
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-03-08

Abstract

The invention discloses a kind of detection methods threatened based on entropy and the advanced duration of support vector machines, are related to field of information security technology.The present invention is after arranging original flow data packet, pass through data statistics, it calculates the entropy of flow information in local area network and then extracts new feature vector, reuse support vector machines training and establish machine learning model, be finally reached and detect the purpose that advanced duration threatens.It is an advantage of the invention that mass data is divided into the smaller flow section of the order of magnitude, and then can soon carry out classification based training and detection by being segmented continuous data on flows.Meanwhile will test specific malicious traffic stream and being converted into the flow section that detection includes malicious traffic stream, the processing of greatly simplified data and reduce rate of false alarm.

Description

A kind of detection method threatened based on entropy and the advanced duration of support vector machines

Technical field

The present invention relates to field of information security technology more particularly to a kind of methods for detecting advanced duration and threatening.

Background technique

In the past more than ten years, Global Internet high speed development.Meanwhile network attack also emerges one after another, in attack It is all evolving in quantity and means.2010, notorious network attack shake net (Stuxnet) was by security study personnel It was found that and disclosing a large amount of attack details.Henceforth, advanced duration threatens (Advanced Persistent Threats, APT) constantly appear in the visual field of people.APT is a kind of long-term, the net customized for specific objective height Network attack means.Due to its attack feature multiplicity, attack pattern emerges one after another, therefore traditional inspection based on pattern match Survey mode is difficult to play effective effect.

In recent years, there are many work about APT detection.It is attacked for APT, many security firms and research people Member gives a variety of solutions.The technologies such as " traditional characteristic matching ", " black and white lists " all can not effectively find that unknown APT is attacked It hits.Therefore, Xu Wang et al. is proposed after having studied many APT attack cases, is detected for the stage of communication in APT attack With the presence or absence of the flow of order and controlling behavior and its generation, scheme is found between the destination address of order controlling behavior access The relevance address information in normal discharge that compares is very different place, and scheme calculates each access by detection The relevance of location is to find suspicious flow.But the deficiency of the method is that detection time is too long, can not find in time and Reduce APT attack bring loss.In addition, Sana Siddiqui proposes the classification using the machine learning based on fractal dimension Method, by study it has been found that the feature for the attack traffic that Malware used in APT attack case generates establishes inspection Model is surveyed, and by comparing to obtain more outstanding classification performance with common kNN algorithm.Although this scheme can detect portion Divide malicious traffic stream, but the rate of false alarm generated is very high, and very high calculating energy is needed for the processing of the data on flows of magnanimity Power, therefore its detectability is greatly restricted.F à tima Barcel ó-Rico et al. lays particular emphasis on detection HTTP request Behavioural characteristic, with semi-supervised learning model Genetic Programming (GP), two Decision Tree Classifier (DTC) and Support Vector Machine (SVM) are tested, and compared the different machine of these types The superiority and inferiority of learning algorithm.As a result, it has been found that SVM and DTC can only only have the case where a small amount of known suspicious example in mass data Still it is able to maintain preferable performance.

Summary of the invention

Goal of the invention: in view of the deficiencies of the prior art, it is an object of the present invention to provide a kind of inspections that advanced duration threatens Survey method.This method quickly accurately can identify APT attack, while the feature pole that this method is new by extraction by mass data Data volume to be processed needed for the earth reduces.

Technical solution: to achieve the goals above, the present invention proposes a kind of advanced lasting based on entropy and support vector machines Property threaten detection method, include the following steps:

A kind of detection method threatened based on entropy and the advanced duration of support vector machines, is included the following steps:

A) data on flows is recorded on heart interchanger in a local network, acquires flow information, flow information includes but is not limited to The source of data packet, destination address, source, destination port, byte number, timestamp；

B) data for saving acquisition save several features of data flow by network package analysis-reduction at data flow Information；

C) time period t is set, step b) the data obtained is divided into the sample that n time interval is t, is calculated separately The entropy of several characteristic informations of each sample forms new data characteristics vector；

D) using the data obtained feature vector in step c) as input, train foundation that can identify band by machine learning There is the model of abnormal flow, until the evaluation index of training pattern reaches specified threshold；

E) classified using model obtained by step d) to the flow in any interval of time t, to judge this section With the presence or absence of abnormal data flow in flow.

Further, in step b, the characteristic information includes transmission byte number, transmits data packet number and data flow Duration.

Further, in step b, the data that the acquisition saves include normal discharge data sample and abnormal flow data Sample, the data on flows that wherein normal discharge data sample is collected in step a, abnormal flow data sample is from open Mila Parkour contribution Contagio malware data library.

Further, in step c, the time period t takes 10 seconds in the network of 100Mbps, and the equipment for handling data is matched It is set to 4 core 2.5GHz CPU, 8GB RAM of Intel and 2.5TB hard disk.

Further, in step c, when the entropy of calculating data flow characteristics information forms new feature vector, using Shannon Discrete source comentropy algorithm:

Wherein, H (X) indicates the discrete source entropy of data flow characteristics information X, P (x_i) indicate that characteristic information X takes x_iWhen, at this The probability in certain value interval is appeared in sample, b is the logarithm truth of a matter, and m indicates the number of samples of characteristic information X in the sample.

Further, in step d, machine learning model uses support vector machines:

Wherein, { -1 ,+1 } f (t) ∈, the transposition of () ' representing matrix, sign (ζ) indicate the symbol of number ζ, and t is the number According to the element after feature vector Plays, i.e. test sample, α is Ah's lagrange's variable, x_s(s=1 ..., | S |) it is branch Vector is held, S indicates training sample data set, and p represents polynomial number, and k is sorting parameter obtained in training process；

Kernel function in support vector machines uses gaussian kernel function, for two sample number strong point x_iWith observation data point x_j, Its target value is y, obtains gaussian kernel function:

k(x_i,x_j)=e^-y‖x_i-x_j‖²。

Further, in step d, the evaluation index includes: accuracy rate, and rate of precision returns and calls rate and f1score together.

Further, in step d, the specified threshold is set as 90% or more.

The utility model has the advantages that

The present invention is divided into the smaller flow section of the order of magnitude by the way that continuous data on flows to be segmented, by mass data, and Calculating the new feature vector of extraction by statistics makes the data volume for being eventually used to training machine learning model further be contracted It is small, and then can soon carry out classification based training and detection.Meanwhile it will test specific malicious traffic stream and being converted into detection comprising disliking The flow section of meaning flow, greatly simplifies the processing of data and reduces rate of false alarm.

Detailed description of the invention

Fig. 1 is the overview flow chart of the method for the present invention；

Fig. 2 is data flow duration probability distribution illustrated example.

Specific embodiment:

With reference to the accompanying drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that is retouched The embodiment stated is only a part of the embodiment of the present invention, instead of all the embodiments.Below at least one exemplary reality Apply the description only actually of example be it is illustrative, never as to the present invention and its application or any restrictions used.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The present invention is based on the detection methods that entropy and the advanced duration of support vector machines threaten.Compiling flow packet number According to later, by data statistics, calculates the entropy of flow information in local area network and then extract new feature vector, reuse support Machine learning model is established in vector machine training, is finally reached and is detected the purpose that advanced duration threatens.Specific implementation step is as follows, And detailed process is shown in attached drawing 1.

Step S1: data on flows is acquired from center switch, crucial flow information is extracted, binary data is flowed into Row reduction, with PCAP (Process Characterization Analysis Package, process characteristic analysis software package) text The form of part is saved TCP packet, UDP packet.Wherein flow information includes but is not limited to the source of data packet, destination address, source, Destination port, byte number, timestamp.The untreated original traffic packet data of acquisition is as shown in table 1,

The original traffic packet data of table 1

Step S2: by using network package analysis tool Wireshark, flow data packet is arranged.First use Filter remove length be 0 and retransmitted packet, specific filter condition be "！tcp.analysis.retransmission and tcp.len>0".After obtaining pretreated data, using command-line tool tshark according to the serial number and timestamp of data packet Rearrangement, composition data stream, and preservation is exported into txt file.It is as shown in table 2 derived data instance,

Data instance derived from table 2

Address:port A	Address:port B	Packets	Bytes	Duration
					172.18.125.127:50504	42.202.151.230:80	14989	11543526	28.7467
172.18.19.8:7037	58.221.74.144:80	14821	16151885	21.8665
					172.26.28.188:26528	182.247.250.19:80	12694	9610185	20.8633
172.21.61.237:57685	172.106.33.219:10019	6030	4645450	23.8794

Step S3: one time period t of setting, using the one piece of data stream in the t period as a sample, so present Some datas on flows are divided into n sections, just obtain n sample.It as shown in Fig. 2, is the probability of duration in one of sample Distribution map.Horizontal axis is the data flow duration, and the longitudinal axis is the duration of data flow to fall probability within a certain period of time.Then Pass through shannon entropy formula

Wherein, H (X) indicates the discrete source entropy of data flow characteristics information X, P (x_i) indicate that characteristic information X takes x_iWhen, at this The probability in certain value interval is appeared in sample, b is the logarithm truth of a matter, and b=2, m is usually taken to indicate characteristic information X in the sample Number of samples, calculate the entropy of each sample, saved as record.Calculative feature includes the byte of sample Number, the entropy of number-of-packet and duration, and then the entropy of these three features become the feature vector of input learner.Table 3 The entropy of each feature obtained is finally computed for wherein 4 samples and for the mark of classification.

The entropy of each feature of table 3 and mark for classification

Ep_Packets	Ep_Bytes	Ep_Duration	flag
				1.62997396765	1.13345078829	7.70188552486	0
1.57602566801	1.26045873852	7.59871075104	0
				1.14233680811	0.910838676236	4.64659436002	1
1.13322375711	0.859611891072	1.74121380584	1

Step S4: using the result in step S3 as input, machine learning is carried out as learner using support vector machines Training, establishes disaggregated model, wherein the kernel function in support vector machines uses gaussian kernel function.It constantly changes to learner Generation training, until indices are more than 90%, these indexs include: accuracy rate, and rate of precision returns and calls rate and f1score together.Wherein prop up Hold vector machine:

Accuracy rate (Accuracy):

Rate of precision (Precision):

Recall rate (Recall):

F1 score:

Wherein, TP, TN, FP, FN are real example, true counter-example, false positive example, the number of false counter-example respectively.Accuracy rate can be commented Estimate correct ratio in the result of prediction, rate of precision is capable of the ratio of positive example in the result of assessment prediction, and recall rate can be assessed The really ratio that the positive example in situation is found out.F1 score is the synthesis of rate of precision and recall rate, and f1 is higher to illustrate model more Steadily and surely.

Step S5: the subsequent data on flows for carrying out same treatment can be judged after completing model training.

More than, embodiments of the present invention are illustrated by way of example, but the scope of the present invention is not limited to above-mentioned example, In range recorded in claim, it can be changed, be deformed according to purpose.

Claims

1. a kind of detection method threatened based on entropy and the advanced duration of support vector machines, which is characterized in that including walking as follows It is rapid:

A) data on flows is recorded on heart interchanger in a local network, acquires flow information, flow information includes but is not limited to data The source of packet, destination address, source, destination port, byte number, timestamp；

B) data for saving acquisition save several features letter of data flow by network package analysis-reduction at data flow Breath；

C) time period t is set, step b) the data obtained is divided into the sample that n time interval is t, is calculated separately each The entropy of several characteristic informations of sample forms new data characteristics vector；

D) using the data obtained feature vector in step c) as input, train foundation that can identify with different by machine learning The model of normal flow, until the evaluation index of training pattern reaches specified threshold；

E) classified using model obtained by step d) to the flow in any interval of time t, to judge this section of flow In with the presence or absence of abnormal data flow.

2. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 1, It is characterized in that, in step b, the characteristic information includes transmission byte number, transmits the duration of data packet number and data flow.

3. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 1, It being characterized in that, in step b, the data that the acquisition saves include normal discharge data sample and abnormal flow data sample, The data on flows that middle normal discharge data sample is collected in step a, abnormal flow data sample come from disclosed Mila The Contagio malware data library of Parkour contribution.

4. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 1, It is characterized in that, in step c, the time period t takes 10 seconds in the network of 100Mbps, and the device configuration for handling data is Intel 4 core 2.5GHz CPU, 8GB RAM and 2.5TB hard disk.

5. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 1, It is characterized in that, in step c, when the entropy of calculating data flow characteristics information forms new feature vector, using the discrete letter of Shannon Source information entropy algorithm:

Wherein, H (X) indicates the discrete source entropy of data flow characteristics information X, P (x_i) indicate that characteristic information X takes x_iWhen, in the sample In appear in probability in certain value interval, b is the logarithm truth of a matter, and m indicates the number of samples of characteristic information X in the sample.

6. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 1, It is characterized in that, in step d, machine learning model uses support vector machines:

Wherein, { -1 ,+1 } f (t) ∈, the transposition of () ' representing matrix, sign (ζ) indicate that the symbol of number ζ, t are that the data are special Element after levying vector Plays, i.e., test sample, α are Ah's lagrange's variable, x_s(s=1 ..., | S |) be support to Amount, S indicate training sample data set, and p represents polynomial number, and k is sorting parameter obtained in training process；

Kernel function in support vector machines uses gaussian kernel function, for two sample number strong point x_iWith observation data point x_j, mesh Scale value is y, obtains gaussian kernel function:

k(x_i,x_j)=e^-y‖x_i-x_j‖²。

7. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 1, It is characterized in that, in step d, the evaluation index includes: accuracy rate, and rate of precision returns and calls rate and f1 score together.

8. a kind of detection method threatened based on entropy and the advanced duration of support vector machines according to claim 7, It is characterized in that, in step d, the specified threshold is set as 90% or more.