CN111177724A

CN111177724A - Automatic detection method for polymorphic worm virus

Info

Publication number: CN111177724A
Application number: CN201911272282.XA
Authority: CN
Inventors: 王方伟; 杨少杰; 王长广; 李青茹; 黄文艳; 李军
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-05-19

Abstract

The invention discloses an automatic detection method for polymorphic worm viruses, and belongs to the technical field of machine learning and network security. The method comprises the steps of firstly selecting a polymorphic worm data sequence for training, smoothing, extracting to obtain a characteristic sequence, and then performing optimized extraction by using a Gaussian-Bernoulli RBM algorithm to obtain a new characteristic sequence to form a polymorphic worm characteristic library. The invention can quickly and accurately extract the characteristic sequences of different types of polymorphic worm viruses, and can analyze the potential variation risk possibly existing in the polymorphic worm viruses and the suspected polymorphic worm segment sequences to finally realize defense more accurately and quickly. The method can manually adjust the training parameters, regulate the training mode of the system and improve the accuracy of the system for extracting the polymorphic worms.

Description

Automatic detection method for polymorphic worm virus

Technical Field

The invention belongs to the technical field of machine learning and network security, and particularly relates to an automatic detection method for polymorphic worm viruses.

Background

The polymorphic worm is a deformable virus, the form can be changed every time infection occurs, code modification operation can be performed on the premise of self encryption and keeping the intention of the original worm, and then the polymorphic worm is quickly spread in a network, thereby bringing huge harm to the Internet.

Polymorphic worm viruses such as 'zero day' and the like bring great harm to the Internet. The Lesoxhlet software in the polymorphic worm viruses transfers targets from consumers to enterprises, and in this year, the infection rate of the enterprises is increased by 12 percent compared with 2018, so that countless loss is caused to the enterprises. Furthermore, the various variant worms of Lesovirus are more difficult to detect and their resulting losses are more unpredictable.

Particularly, with the arrival of the big data era of the internet, the efficient extraction of the characteristics of the polymorphic worms becomes more and more difficult, and the problem of the internet security is particularly serious. Therefore, the detection and defense of computer worm virus become a very important problem in the network security field.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an automatic detection method facing to the polymorphic worm virus, which can efficiently extract the characteristic data of the polymorphic worm virus, provide a powerful tool for the inspection and the defense of the polymorphic worm virus and better ensure the network security.

In order to solve the technical problems, the invention adopts the technical scheme that: an automatic detection method facing to polymorphic worm viruses comprises the following steps:

step 1, selecting a polymorphic worm data sequence for training, smoothing the training sequence by using an N-gram smoothing algorithm, and extracting to obtain a characteristic sequence, wherein the specific operation comprises the following steps:

1.1) firstly, segmenting a training sequence according to a flow specification;

1.2) carrying out data smoothing on the obtained sequences by utilizing an N-gram smoothing algorithm;

1.3) carrying out feature extraction on the smoothed data to obtain a feature sequence;

and 2, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm and extracting characteristics to form a polymorphic worm characteristic sequence under a low noise condition, wherein the specific operation is divided into the following steps:

2.1) firstly, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bemoulli RBM algorithm to obtain a noise reduction characteristic sequence;

and 2.2) carrying out feature extraction on the noise reduction feature sequence by utilizing a Gaussian-Bernoulli RBM algorithm to obtain a polymorphic worm feature sequence under the low noise condition.

And 3, repeating the step 2 to obtain a series of polymorphic worm characteristic sequences to form a characteristic library.

Preferably, the N-gram smoothing algorithm in substep 1.2 is a Laplace N-gram algorithm or a Good-Turing N-gram algorithm.

The smooth formula of the Laplace N-gram algorithm is as follows:

wherein, P (x)_i) For counting a word x_iProbability of occurrence, C (x)_i) Is the word x_iThe number of occurrences in the training polymorphic worm data sequence, V, represents the laplacian smoothing parameter.

Preferably, the value range of V is: 1 to 0.0001.

Preferably, the smoothing formula of the Good-training N-gram algorithm in substep 1.2 is:

wherein, P (x)_i) For counting a word x_iProbability of occurrence, λ is a random linear difference, λ₁、λ₂、λ₃Satisfying random assignment and must be 1.

Preferably, the construction of the Gaussian-Bernoulli RBM algorithm in step 2 is specifically divided into the following sub-steps:

step 1: constructing an RBM algorithm: the RBM joint probability distribution function is shown as formula (3):

wherein the variable v, h is n-dimensional binary random vector x ∈ {0, 1} of Boltzmann machineⁿTwo subsets of the decomposition: the visible layer v and the hidden layer h, E and Z respectively represent the normalized constants of the energy function and the partition function, and the formula is shown in the formulas (4) and (5):

E(v，h)＝-v^TWh-b^Tv-c^Th (4)

Z＝∑_v∑_hexp{-E(v，h)} (5)

w is a weight matrix of the model parameters, and b and c are offset vectors;

combining formula (3), (4), and (5) to obtain formula (6):

step 2: constructing a Bernoulli RBM algorithm: the Bernoulli RBM probability distribution function is shown as (7):

at this time, σ is represented as

x is v or h;

step 3: constructing a Gaussian-Bemoulli RBM algorithm: and (3) parameterizing Gaussian distribution by using the precision matrix, wherein the probability distribution function formula of the algorithm is shown as the formula:

wherein the content of the first and second substances,

the technical effect obtained by adopting the technical scheme is as follows:

the invention can quickly and accurately extract the characteristic sequences of different types of polymorphic worm viruses, and can analyze the potential variation risk possibly existing in the polymorphic worm viruses and the suspected polymorphic worm segment sequences by utilizing the characteristic sequences generated by the N-gram smoothing algorithm to finally realize defense more accurately and quickly. The method can manually adjust the training parameters, regulate the training mode of the system and improve the accuracy of the system for extracting the polymorphic worms.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1, an automatic detection method for polymorphic worm virus comprises the following steps:

step 1, selecting a polymorphic worm data sequence for training, and smoothing the training sequence by using an N-gram smoothing algorithm to obtain a characteristic sequence.

An improved original formula is constructed according to Markov hypothesis, maximum likelihood estimation is selected as conditional probability, and sequence data of various polymorphic worms are trained to deduce suspicious segment sequences.

The specific operation comprises the following steps:

1.1) firstly segmenting a training sequence according to a flow specification, wherein the training method is as follows;

the polymorphic worm data sequence for training generally comprises more than one phrase sequence, and each phrase sequence consists of n (n is more than or equal to 1) phrases x_i(i is more than or equal to 1 and less than or equal to n), and each sequence must be composed of a plurality of words x_iAnd otherwise, the data transmission specification is not met.

The data after segmentation are as follows:

S1：GET...HTTP/1.1\r\n...\r\nHost...

S2：GET...HTTP/1.1\r\n...\r\nHost...

S3：POST...HTTP/1.1\r\n...\r\nHost...

taking S1 sentence as an example, S1 is a training polymorphic worm data sequence, and the probability of being copied into a sentence with length n is P (S1) P (x)₁，x₂，...，x_n) Wherein x is_i(1. ltoreq. i.ltoreq.n) is the phrase constituting S1. There is therefore a case where: the computer considers the word x in the sentence S3₁There is a spelling error for "POST", because the probability of a sentence occurring is P (S1) ≈ P (S2) > P (S3), and finally "GET" will be x₁A feature appears.

According to the above description, the formula is firstly put in order:

P(S1)＝p(x₁，x₂，...，x_n)

then, a formula is constructed according to the Markov chain theory, and each word x_i(1. ltoreq. i.ltoreq.n) is associated with the first m-1 (m. epsilon. N +) finite words to give the following formula (1):

suppose C (x)_i) Is the word x_iThe number of occurrences in the polymorphic worm data sequence for the training model, C (x)_i-n-1，...，x_i) Is the word x_i-n-1，...，x_i-1The total number of occurrences, and thus the maximum likelihood estimate, can be expressed as:

taking the Apache-Knacker worm sequence as an example, equation (10) can be expressed as:

wherein, P₃Probability of occurrence of a relation of three words, P₂Is the probability of two words appearing in connection.

1.2) carrying out data smoothing on the obtained sequences by using a smoothing algorithm;

the N-gram smoothing algorithm is a Laplace N-gram algorithm or a Good-training N-gram algorithm.

The Laplace N-gram algorithm may also be referred to as a modified smoothing method. Firstly, the problem of zero probability which is easy to improve is solved, according to the N-gram model formula under the maximum likelihood probability distribution, the Laplace smoothing is added with a V for each formula operation, and the V represents a Laplace smoothing parameter and is used for adjusting the smoothing problem in the probability distribution calculation process. Adding a smoothing parameter V for each operation in the formula (10), and changing the formula (10) into the formula (1) after improving the smoothing:

taking equation (3) as the basis for smoothing data, V is called laplacian smoothing parameter, and the value range of V is: 1 to 0.0001.

Good-training is a statistical technique used for counting the number of words in a data set, clustering according to the number of the appeared words, and finally estimating the probability of new words possibly generated by using the clustered result. For the Good-Turing N-gram algorithm, the method can also be called a linear smoothing method, and the specific smoothing mode is formula (2):

Claims

1. an automatic detection method facing to polymorphic worm viruses is characterized by comprising the following steps:

1.1) firstly, segmenting a training sequence according to a flow specification;

2.1) firstly, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm to obtain a noise reduction characteristic sequence;

and 2.2) carrying out feature extraction on the noise reduction feature sequence by utilizing a Gaussian-Bemoulli RBM algorithm to obtain a polymorphic worm feature sequence under a low noise condition.

2. The method according to claim 1, wherein the N-gram smoothing algorithm in substep 1.2 is a Laplace N-gram algorithm or a Good-Turing N-gram algorithm.

3. The method according to claim 2, wherein the smooth formula of the Laplace N-gram algorithm is as follows:

4. The method according to claim 3, wherein V is selected from the group consisting of: 1 to 0.0001.

5. The method according to claim 2, wherein the Good-training N-gram algorithm in substep 1.2 has a smoothing formula of:

6. The automatic detection method for polymorphic worm viruses according to claim 1, wherein the construction of the Gaussian-Bernoulli RBM algorithm in the step 2 is specifically divided into the following sub-steps: