CN111177724A - Automatic detection method for polymorphic worm virus - Google Patents

Automatic detection method for polymorphic worm virus Download PDF

Info

Publication number
CN111177724A
CN111177724A CN201911272282.XA CN201911272282A CN111177724A CN 111177724 A CN111177724 A CN 111177724A CN 201911272282 A CN201911272282 A CN 201911272282A CN 111177724 A CN111177724 A CN 111177724A
Authority
CN
China
Prior art keywords
algorithm
sequence
polymorphic
smoothing
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911272282.XA
Other languages
Chinese (zh)
Inventor
王方伟
杨少杰
王长广
李青茹
黄文艳
李军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Normal University
Original Assignee
Hebei Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Normal University filed Critical Hebei Normal University
Priority to CN201911272282.XA priority Critical patent/CN111177724A/en
Publication of CN111177724A publication Critical patent/CN111177724A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses an automatic detection method for polymorphic worm viruses, and belongs to the technical field of machine learning and network security. The method comprises the steps of firstly selecting a polymorphic worm data sequence for training, smoothing, extracting to obtain a characteristic sequence, and then performing optimized extraction by using a Gaussian-Bernoulli RBM algorithm to obtain a new characteristic sequence to form a polymorphic worm characteristic library. The invention can quickly and accurately extract the characteristic sequences of different types of polymorphic worm viruses, and can analyze the potential variation risk possibly existing in the polymorphic worm viruses and the suspected polymorphic worm segment sequences to finally realize defense more accurately and quickly. The method can manually adjust the training parameters, regulate the training mode of the system and improve the accuracy of the system for extracting the polymorphic worms.

Description

Automatic detection method for polymorphic worm virus
Technical Field
The invention belongs to the technical field of machine learning and network security, and particularly relates to an automatic detection method for polymorphic worm viruses.
Background
The polymorphic worm is a deformable virus, the form can be changed every time infection occurs, code modification operation can be performed on the premise of self encryption and keeping the intention of the original worm, and then the polymorphic worm is quickly spread in a network, thereby bringing huge harm to the Internet.
Polymorphic worm viruses such as 'zero day' and the like bring great harm to the Internet. The Lesoxhlet software in the polymorphic worm viruses transfers targets from consumers to enterprises, and in this year, the infection rate of the enterprises is increased by 12 percent compared with 2018, so that countless loss is caused to the enterprises. Furthermore, the various variant worms of Lesovirus are more difficult to detect and their resulting losses are more unpredictable.
Particularly, with the arrival of the big data era of the internet, the efficient extraction of the characteristics of the polymorphic worms becomes more and more difficult, and the problem of the internet security is particularly serious. Therefore, the detection and defense of computer worm virus become a very important problem in the network security field.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an automatic detection method facing to the polymorphic worm virus, which can efficiently extract the characteristic data of the polymorphic worm virus, provide a powerful tool for the inspection and the defense of the polymorphic worm virus and better ensure the network security.
In order to solve the technical problems, the invention adopts the technical scheme that: an automatic detection method facing to polymorphic worm viruses comprises the following steps:
step 1, selecting a polymorphic worm data sequence for training, smoothing the training sequence by using an N-gram smoothing algorithm, and extracting to obtain a characteristic sequence, wherein the specific operation comprises the following steps:
1.1) firstly, segmenting a training sequence according to a flow specification;
1.2) carrying out data smoothing on the obtained sequences by utilizing an N-gram smoothing algorithm;
1.3) carrying out feature extraction on the smoothed data to obtain a feature sequence;
and 2, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm and extracting characteristics to form a polymorphic worm characteristic sequence under a low noise condition, wherein the specific operation is divided into the following steps:
2.1) firstly, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bemoulli RBM algorithm to obtain a noise reduction characteristic sequence;
and 2.2) carrying out feature extraction on the noise reduction feature sequence by utilizing a Gaussian-Bernoulli RBM algorithm to obtain a polymorphic worm feature sequence under the low noise condition.
And 3, repeating the step 2 to obtain a series of polymorphic worm characteristic sequences to form a characteristic library.
Preferably, the N-gram smoothing algorithm in substep 1.2 is a Laplace N-gram algorithm or a Good-Turing N-gram algorithm.
The smooth formula of the Laplace N-gram algorithm is as follows:
Figure BDA0002314550960000021
wherein, P (x)i) For counting a word xiProbability of occurrence, C (x)i) Is the word xiThe number of occurrences in the training polymorphic worm data sequence, V, represents the laplacian smoothing parameter.
Preferably, the value range of V is: 1 to 0.0001.
Preferably, the smoothing formula of the Good-training N-gram algorithm in substep 1.2 is:
Figure BDA0002314550960000022
wherein, P (x)i) For counting a word xiProbability of occurrence, λ is a random linear difference, λ1、λ2、λ3Satisfying random assignment and must be 1.
Preferably, the construction of the Gaussian-Bernoulli RBM algorithm in step 2 is specifically divided into the following sub-steps:
step 1: constructing an RBM algorithm: the RBM joint probability distribution function is shown as formula (3):
Figure BDA0002314550960000031
wherein the variable v, h is n-dimensional binary random vector x ∈ {0, 1} of Boltzmann machinenTwo subsets of the decomposition: the visible layer v and the hidden layer h, E and Z respectively represent the normalized constants of the energy function and the partition function, and the formula is shown in the formulas (4) and (5):
E(v,h)=-vTWh-bTv-cTh (4)
Z=∑vhexp{-E(v,h)} (5)
w is a weight matrix of the model parameters, and b and c are offset vectors;
combining formula (3), (4), and (5) to obtain formula (6):
Figure BDA0002314550960000032
step 2: constructing a Bernoulli RBM algorithm: the Bernoulli RBM probability distribution function is shown as (7):
Figure BDA0002314550960000033
at this time, σ is represented as
Figure BDA0002314550960000034
x is v or h;
step 3: constructing a Gaussian-Bemoulli RBM algorithm: and (3) parameterizing Gaussian distribution by using the precision matrix, wherein the probability distribution function formula of the algorithm is shown as the formula:
Figure BDA0002314550960000035
wherein the content of the first and second substances,
Figure BDA0002314550960000036
the technical effect obtained by adopting the technical scheme is as follows:
the invention can quickly and accurately extract the characteristic sequences of different types of polymorphic worm viruses, and can analyze the potential variation risk possibly existing in the polymorphic worm viruses and the suspected polymorphic worm segment sequences by utilizing the characteristic sequences generated by the N-gram smoothing algorithm to finally realize defense more accurately and quickly. The method can manually adjust the training parameters, regulate the training mode of the system and improve the accuracy of the system for extracting the polymorphic worms.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 1, an automatic detection method for polymorphic worm virus comprises the following steps:
step 1, selecting a polymorphic worm data sequence for training, and smoothing the training sequence by using an N-gram smoothing algorithm to obtain a characteristic sequence.
An improved original formula is constructed according to Markov hypothesis, maximum likelihood estimation is selected as conditional probability, and sequence data of various polymorphic worms are trained to deduce suspicious segment sequences.
The specific operation comprises the following steps:
1.1) firstly segmenting a training sequence according to a flow specification, wherein the training method is as follows;
the polymorphic worm data sequence for training generally comprises more than one phrase sequence, and each phrase sequence consists of n (n is more than or equal to 1) phrases xi(i is more than or equal to 1 and less than or equal to n), and each sequence must be composed of a plurality of words xiAnd otherwise, the data transmission specification is not met.
The data after segmentation are as follows:
S1:GET...HTTP/1.1\r\n...\r\nHost...
S2:GET...HTTP/1.1\r\n...\r\nHost...
S3:POST...HTTP/1.1\r\n...\r\nHost...
taking S1 sentence as an example, S1 is a training polymorphic worm data sequence, and the probability of being copied into a sentence with length n is P (S1) P (x)1,x2,...,xn) Wherein x isi(1. ltoreq. i.ltoreq.n) is the phrase constituting S1. There is therefore a case where: the computer considers the word x in the sentence S31There is a spelling error for "POST", because the probability of a sentence occurring is P (S1) ≈ P (S2) > P (S3), and finally "GET" will be x1A feature appears.
According to the above description, the formula is firstly put in order:
P(S1)=p(x1,x2,...,xn)
then, a formula is constructed according to the Markov chain theory, and each word xi(1. ltoreq. i.ltoreq.n) is associated with the first m-1 (m. epsilon. N +) finite words to give the following formula (1):
Figure BDA0002314550960000051
suppose C (x)i) Is the word xiThe number of occurrences in the polymorphic worm data sequence for the training model, C (x)i-n-1,...,xi) Is the word xi-n-1,...,xi-1The total number of occurrences, and thus the maximum likelihood estimate, can be expressed as:
Figure BDA0002314550960000052
taking the Apache-Knacker worm sequence as an example, equation (10) can be expressed as:
Figure BDA0002314550960000053
wherein, P3Probability of occurrence of a relation of three words, P2Is the probability of two words appearing in connection.
1.2) carrying out data smoothing on the obtained sequences by using a smoothing algorithm;
the N-gram smoothing algorithm is a Laplace N-gram algorithm or a Good-training N-gram algorithm.
The Laplace N-gram algorithm may also be referred to as a modified smoothing method. Firstly, the problem of zero probability which is easy to improve is solved, according to the N-gram model formula under the maximum likelihood probability distribution, the Laplace smoothing is added with a V for each formula operation, and the V represents a Laplace smoothing parameter and is used for adjusting the smoothing problem in the probability distribution calculation process. Adding a smoothing parameter V for each operation in the formula (10), and changing the formula (10) into the formula (1) after improving the smoothing:
Figure BDA0002314550960000054
taking equation (3) as the basis for smoothing data, V is called laplacian smoothing parameter, and the value range of V is: 1 to 0.0001.
Good-training is a statistical technique used for counting the number of words in a data set, clustering according to the number of the appeared words, and finally estimating the probability of new words possibly generated by using the clustered result. For the Good-Turing N-gram algorithm, the method can also be called a linear smoothing method, and the specific smoothing mode is formula (2):
Figure BDA0002314550960000055

Claims (6)

1. an automatic detection method facing to polymorphic worm viruses is characterized by comprising the following steps:
step 1, selecting a polymorphic worm data sequence for training, smoothing the training sequence by using an N-gram smoothing algorithm, and extracting to obtain a characteristic sequence, wherein the specific operation comprises the following steps:
1.1) firstly, segmenting a training sequence according to a flow specification;
1.2) carrying out data smoothing on the obtained sequences by utilizing an N-gram smoothing algorithm;
1.3) carrying out feature extraction on the smoothed data to obtain a feature sequence;
and 2, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm and extracting characteristics to form a polymorphic worm characteristic sequence under a low noise condition, wherein the specific operation is divided into the following steps:
2.1) firstly, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm to obtain a noise reduction characteristic sequence;
and 2.2) carrying out feature extraction on the noise reduction feature sequence by utilizing a Gaussian-Bemoulli RBM algorithm to obtain a polymorphic worm feature sequence under a low noise condition.
And 3, repeating the step 2 to obtain a series of polymorphic worm characteristic sequences to form a characteristic library.
2. The method according to claim 1, wherein the N-gram smoothing algorithm in substep 1.2 is a Laplace N-gram algorithm or a Good-Turing N-gram algorithm.
3. The method according to claim 2, wherein the smooth formula of the Laplace N-gram algorithm is as follows:
Figure FDA0002314550950000011
wherein, P (x)i) For counting a word xiProbability of occurrence, C (x)i) Is the word xiThe number of occurrences in the training polymorphic worm data sequence, V, represents the laplacian smoothing parameter.
4. The method according to claim 3, wherein V is selected from the group consisting of: 1 to 0.0001.
5. The method according to claim 2, wherein the Good-training N-gram algorithm in substep 1.2 has a smoothing formula of:
Figure FDA0002314550950000021
wherein, P (x)i) For counting a word xiProbability of occurrence, λ is a random linear difference, λ1、λ2、λ3Satisfying random assignment and must be 1.
6. The automatic detection method for polymorphic worm viruses according to claim 1, wherein the construction of the Gaussian-Bernoulli RBM algorithm in the step 2 is specifically divided into the following sub-steps:
step 1: constructing an RBM algorithm: the RBM joint probability distribution function is shown as formula (3):
Figure FDA0002314550950000022
wherein the variable v, h is n-dimensional binary random vector x ∈ {0, 1} of Boltzmann machinenTwo subsets of the decomposition: the visible layer v and the hidden layer h, E and Z respectively represent the normalized constants of the energy function and the partition function, and the formula is shown in the formulas (4) and (5):
E(v,h)=-vTWh-bTv-cTh (4)
Z=∑vhexp{-E(v,h)} (5)
w is a weight matrix of the model parameters, and b and c are offset vectors;
combining formula (3), (4), and (5) to obtain formula (6):
Figure FDA0002314550950000023
step 2: constructing a Bernoulli RBM algorithm: the Bernoulli RBM probability distribution function is shown as (7):
Figure FDA0002314550950000024
at this time, σ is represented as
Figure FDA0002314550950000025
x is v or h;
step 3: constructing a Gaussian-Bemoulli RBM algorithm: and (3) parameterizing Gaussian distribution by using the precision matrix, wherein the probability distribution function formula of the algorithm is shown as the formula:
Figure FDA0002314550950000031
wherein the content of the first and second substances,
Figure FDA0002314550950000032
CN201911272282.XA 2019-12-12 2019-12-12 Automatic detection method for polymorphic worm virus Pending CN111177724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911272282.XA CN111177724A (en) 2019-12-12 2019-12-12 Automatic detection method for polymorphic worm virus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911272282.XA CN111177724A (en) 2019-12-12 2019-12-12 Automatic detection method for polymorphic worm virus

Publications (1)

Publication Number Publication Date
CN111177724A true CN111177724A (en) 2020-05-19

Family

ID=70655430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911272282.XA Pending CN111177724A (en) 2019-12-12 2019-12-12 Automatic detection method for polymorphic worm virus

Country Status (1)

Country Link
CN (1) CN111177724A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030021859A (en) * 2001-09-08 2003-03-15 주식회사 비즈모델라인 System and Method for mailing warning message against the worm virus and anti-virus vaccine automatically against it via wireless networks
CN109299357A (en) * 2018-08-31 2019-02-01 昆明理工大学 A kind of Laotian text subject classification method
CN110022313A (en) * 2019-03-25 2019-07-16 河北师范大学 Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030021859A (en) * 2001-09-08 2003-03-15 주식회사 비즈모델라인 System and Method for mailing warning message against the worm virus and anti-virus vaccine automatically against it via wireless networks
CN109299357A (en) * 2018-08-31 2019-02-01 昆明理工大学 A kind of Laotian text subject classification method
CN110022313A (en) * 2019-03-25 2019-07-16 河北师范大学 Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张绍辉;: "集成参数自适应调整及隐含层降噪的深层RBM算法" *
曾键;赵辉;: "一种基于N-Gram的计算机病毒特征码自动提取方法" *
林伟;柳荣其;徐熙;: "一种基于N-Gram的垃圾邮件过滤方法研究" *
陈琦: "一种基于RBM的深层神经网络音素识别方法" *

Similar Documents

Publication Publication Date Title
CN108737406B (en) Method and system for detecting abnormal flow data
CN107832787B (en) Radar radiation source identification method based on bispectrum self-coding characteristics
CN111914253B (en) Method, system, equipment and readable storage medium for intrusion detection
CN110674865B (en) Rule learning classifier integration method oriented to software defect class distribution unbalance
CN110363001B (en) Application layer malicious request detection method based on Transformer model
Wang et al. Evaluating CNN and LSTM for web attack detection
CN113011194B (en) Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113628059B (en) Associated user identification method and device based on multi-layer diagram attention network
CN110019653B (en) Social content representation method and system fusing text and tag network
CN111259397A (en) Malware classification method based on Markov graph and deep learning
CN113642717A (en) Convolutional neural network training method based on differential privacy
Chen et al. SpecMark: A spectral watermarking framework for IP protection of speech recognition systems.
CN107240100B (en) Image segmentation method and system based on genetic algorithm
JP2019086979A (en) Information processing device, information processing method, and program
Perez et al. Efficient projection algorithms onto the weighted ℓ1 ball
CN111177724A (en) Automatic detection method for polymorphic worm virus
CN116502091A (en) Network intrusion detection method based on LSTM and attention mechanism
Zhao et al. Graph similarity metric using graph convolutional network: Application to malware similarity match
CN111079143B (en) Trojan horse detection method based on multi-dimensional feature map
CN111008529B (en) Chinese relation extraction method based on neural network
Huang et al. Research on Malicious URL Identification and Analysis for Network Security
CN112329002B (en) Method for dynamically adjusting execution order of guessing rule according to plaintext characteristics of partial password
CN113918717B (en) Text backdoor defense method for cleaning data
EP4293956A1 (en) Method for predicting malicious domains
Li et al. Similarity measure for multivariate time series based on dynamic time warping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200519