CN110022313A

CN110022313A - Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning

Info

Publication number: CN110022313A
Application number: CN201910226995.6A
Authority: CN
Inventors: 王方伟; 王长广; 杨少杰; 赵冬梅
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2019-07-16
Anticipated expiration: 2039-03-25
Also published as: CN110022313B

Abstract

The polymorphic worm feature extraction and polymorphic worm discrimination method that the invention discloses a kind of based on machine learning, feature extracting method includes load polymorphic worm and is divided into test set and training set, establishes, trains and verify malice polymorphic worm behavioural characteristic mathematical model step.Discrimination method real-time monitor (RTM) operating status records polymorphic worm feature extraction experimental result.The present invention can more fast and accurately extract the feature of polymorphic worm, can be applied in real time among the monitoring of network flow, the scalability of program is strong, humanized convenient for control, visualization window.

Description

Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning

Technical field

The present invention relates to a kind of polymorphic worm feature extracting methods and polymorphic worm discrimination method more particularly to one kind to be based on The polymorphic worm feature extraction of machine learning and polymorphic worm discrimination method, belong to technical field of network security.

Background technique

With internet being widely popularized and applying in every field, worm has become current network space safety and master One of the chief threat of machine safety, quick sensing and raising feature extraction accuracy problems for polymorphic worm become interconnection Major issue in net safety.With the continuous development of computer networking technology, it is fast that polymorphic worm shows mutation, spread speed Rapidly, destructive power is big, and self-reproduction ability is strong, it is difficult to which the characteristics of finding causes immeasurable damage to social production life It loses.With being constantly progressive for polymorphic worm producing method and circulation way, propagation can extend over the entire globe in a short time, and right Network environment causes the strike of strength, and propagation can also carry out breeding and Intranet transmitting, traditional sequence ratio by personal host Rapidly extracting can not have effectively been carried out to mode and the protection of network normal environment is provided.Therefore, improve what polymorphic worm extracted Accuracy and extraction efficiency become urgent problem to be solved.

Summary of the invention

The polymorphic worm feature extraction that the technical problem to be solved in the present invention is to provide a kind of based on machine learning and polymorphic Worm discrimination method.

In order to solve the above technical problems, the technical solution adopted by the present invention is that:

Technical solution one:

A kind of polymorphic worm feature extracting method based on machine learning, comprising the following steps:

Step 1: load polymorphic worm is simultaneously divided into test set and training set: building regular expression is segmented, one by one Polymorphic worm data set is loaded, malicious data set is divided into test set and training set according to preset ratio；Test set and instruction Practicing collection, further grouping forms file set, and each malicious is corresponding with a file in file set respectively；

Step 2: training malice polymorphic worm behavior model: being extracted in the training set using unsupervised machine learning mode 1 or more sub-line of malicious be characterized, the malice polymorphic worm behavior model of acquisition is that each sub-line is characterized by power It is weighted and averaged again.

The polymorphic worm feature extracting method based on machine learning further includes step 3: refining malice polymorphic worm Behavioural characteristic mathematical model: the malicious extracted again in the way of supervised learning using the malicious in test set Sub-line is characterized, and the identical sub-line that step 2 and step 3 are extracted is characterized and corresponds to weight difference less than preset value, remaining behavior Feature is given up.

Regular expression in the step 1 be " GET.*http/1.1.* r n.*? ".

The calculation method of each behavioural characteristic weight in malice polymorphic worm behavior model in the step 2 is equal are as follows:

The calculation method of the forward direction word frequency are as follows:

The calculation method of the Feature Conversion are as follows:

The calculation method of the inverse document frequency are as follows:

Technical solution two:

A kind of polymorphic worm discrimination method of polymorphic worm feature extracting method described in application technology scheme one, including with Lower step:

Real-time monitor (RTM) operating status is recorded the log of polymorphic worm feature extraction and is shown using visualization window； The log includes whether monitoring programme has each Step Time node of abnormality, program, data set Packet State, every It is identified polymorphic after worm code segment initial data, the signature analysis tentatively extracted to respective code section, program test Worm feature.

Having the technical effect that acquired by by adopting the above technical scheme

1. will not be with data volume and dimension increase and positive word frequency value using the method memory usage amount for improving hash calculating It can accomplish position corresponding relationship one by one, improve positive word frequency accuracy in computation.

2. making the feature between different dimensions numerically more have comparativity using the method that positive word frequency calculates is improved, greatly The big accuracy for improving feature extraction.

3. accelerating convergence speed of the algorithm using the method that inverse document frequency calculates is improved, really polymorphic how to go made Can more obviously to be distinguished with other characteristic behaviors in weight performance.

4. the method using weight computing can be avoided inverse document frequency and calculate rare spy when calculating final behavioural characteristic The excessively high situation of weight, advanced optimizes characteristic extraction procedure when sign.

5. the feature that the present invention can more fast and accurately extract polymorphic worm；

6. the present invention can be applied in real time among the monitoring of network flow, polymorphic worm is detected；

7. scalability of the invention is strong, humanized convenient for control, visualization window.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Specific embodiment

The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

Embodiment 1

The polymorphic worm feature extracting method based on machine learning further includes step 3: refining malice polymorphic worm Behavioural characteristic mathematical model: the malicious extracted again in the way of supervised learning using the malicious in test set Sub-line is characterized, and the identical sub-line that step 2 and step 3 are extracted is characterized and corresponds to weight difference less than 0.67, remaining behavior is special Sign is given up.

Regular expression in step 1 be " GET.*http/1.1.* r n.*? ".

It is that its positive word frequency is inverse with it that each sub-line in malice polymorphic worm behavior model in step 2, which is characterized weight, To the product of document frequency；

High dimensional data, ergodic data collection building mapping, when enabling to calculate positive word frequency value are handled using Hash is improved Corresponding worm code segment specific location, specific formula are as follows:

The positive word frequency value of calculating is combined with improvement hash algorithm using double normalization algorithms, allows high latitude feature in numerical value Upper to have more comparativity, specific formula is as follows:

Subcharacter significance level in entire file is calculated using inverse document frequency algorithm, specific formula is as follows:

Final power is calculated, specific formula is as follows:

Polymorphic worm behavioural characteristic is finally determined using weighted mean method.

The model builds home environment using sbt (Simple build tool), is carried out using Scala programming language Program is realized.The needs of building of home environment are interacted with remote server, environment required for downloading experiment algorithm routine It relies on, improves the achievable probability of program by way of introducing other and relying on packet, so that environment can normally and efficiently be run, structure Build the local service repository using Spark as frame；

Wherein step 3: it refining malice polymorphic worm behavioural characteristic mathematical model: is used according to the mathematical model in step 2 Unsupervised machine learning mode extracts the weight that each sub-line of the malicious in the training set is characterized.

The feature extraction that polymorphic worm is completed in the way of unsupervised machine learning, it is unsupervised under noiseless state Machine learning, which is advantageous in that, so that algorithm is automatically extracted out under the premise of subcharacter known to not specified polymorphic worm A large amount of suspicious polymorphic worm subcharacter.

Step 4: correction malice polymorphic worm behavioural characteristic mathematical model: using in test set in the way of supervised learning The sub-line of malicious extracted of malicious verification step 3 be characterized, if institute's extraction unit characterization of molecules includes known compacted Worm characteristic segments, subcharacter are real malice polymorphic worm feature, and otherwise subcharacter is not real malice polymorphic worm feature. Wherein, the subcharacter weight of polymorphic worm differs about 1.37 with the weight of non-polymorphic worm subcharacter.

Feature is marked automatically using the method program of supervised learning, the feature after label is tested, finally Whether program validation test feature is true polymorphic worm feature, and this method can be more accurate under the conditions of noise states are stronger Identify polymorphic worm data, multiple arithmetic check finally automatically extracts more comprehensively accurate polymorphic worm feature.This method It can be applied in day normal flow monitoring.

The feature extraction that polymorphic worm is carried out using the method that unsupervised learning and supervised learning combine, can be more Comprehensively extract the feature of polymorphic worm.The extraction of unsupervised learning existing characteristics does not have targetedly disadvantage；There is supervision to learn It practises since program " learning ability " is weaker than supervised learning, existing characteristics extract incomplete disadvantage.It is carried out using two ways Carrying out verification in conjunction with and using test set can make up for it the two weakness, and the more comprehensive and accurate spy for extracting polymorphic worm Sign.

Embodiment 2

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the principle of the present invention, it can also make several improvements and retouch, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of polymorphic worm feature extracting method based on machine learning, it is characterised in that: the following steps are included:

Step 1: load polymorphic worm is simultaneously divided into test set and training set: building regular expression is segmented, and is loaded one by one Malicious data set is divided into test set and training set according to preset ratio by polymorphic worm data set；Test set and training set Further grouping forms file set, and each malicious is corresponding with a file in file set respectively；

Step 2: the evil in the training set training malice polymorphic worm behavior model: being extracted using unsupervised machine learning mode 1 or more the sub-line of meaning worm is characterized, and the malice polymorphic worm behavior model of acquisition is characterized for each sub-line to be added by weight Weight average.

2. the polymorphic worm feature extracting method according to claim 1 based on machine learning, it is characterised in that: further include Step 3: refining malice polymorphic worm behavioural characteristic mathematical model: compacted using the malice in test set in the way of supervised learning The sub-line for the malicious that worm is extracted again is characterized, and the identical sub-line that step 2 and step 3 are extracted is characterized and to correspond to weight poor The different reservation less than preset value, remaining sub-line, which is characterized, to be given up.

3. the polymorphic worm feature extracting method according to claim 1 based on machine learning, it is characterised in that: step 1 In regular expression be " GET.*http/1.1.* r n.*? ".

4. the polymorphic worm feature extracting method according to claim 1 based on machine learning, it is characterised in that: step 2 In malice polymorphic worm behavior model in each subcharacter weight calculation method it is equal are as follows:

The calculation method of the forward direction word frequency are as follows:

The calculation method of the Feature Conversion are as follows:

The calculation method of the inverse document frequency are as follows:

。

5. the polymorphic worm of the polymorphic worm feature extracting method according to any one of claims 1-4 based on machine learning Discrimination method, comprising the following steps:

Real-time monitor (RTM) operating status is recorded the log of polymorphic worm feature extraction and is shown using visualization window；It is described Log includes whether monitoring programme has each Step Time node of abnormality, program, data set Packet State, every worm Identified polymorphic worm after code segment initial data, the signature analysis tentatively extracted to respective code section, program test Feature.