CN111177724A - Automatic detection method for polymorphic worm virus - Google Patents
Automatic detection method for polymorphic worm virus Download PDFInfo
- Publication number
- CN111177724A CN111177724A CN201911272282.XA CN201911272282A CN111177724A CN 111177724 A CN111177724 A CN 111177724A CN 201911272282 A CN201911272282 A CN 201911272282A CN 111177724 A CN111177724 A CN 111177724A
- Authority
- CN
- China
- Prior art keywords
- algorithm
- sequence
- polymorphic
- smoothing
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 241000700605 Viruses Species 0.000 title claims abstract description 19
- 238000001514 detection method Methods 0.000 title claims abstract description 10
- 238000009499 grossing Methods 0.000 claims abstract description 31
- 238000000034 method Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 230000009467 reduction Effects 0.000 claims description 8
- 238000005315 distribution function Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 2
- 238000000354 decomposition reaction Methods 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims description 2
- 239000000126 substance Substances 0.000 claims description 2
- 230000007123 defense Effects 0.000 abstract description 4
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/145—Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses an automatic detection method for polymorphic worm viruses, and belongs to the technical field of machine learning and network security. The method comprises the steps of firstly selecting a polymorphic worm data sequence for training, smoothing, extracting to obtain a characteristic sequence, and then performing optimized extraction by using a Gaussian-Bernoulli RBM algorithm to obtain a new characteristic sequence to form a polymorphic worm characteristic library. The invention can quickly and accurately extract the characteristic sequences of different types of polymorphic worm viruses, and can analyze the potential variation risk possibly existing in the polymorphic worm viruses and the suspected polymorphic worm segment sequences to finally realize defense more accurately and quickly. The method can manually adjust the training parameters, regulate the training mode of the system and improve the accuracy of the system for extracting the polymorphic worms.
Description
Technical Field
The invention belongs to the technical field of machine learning and network security, and particularly relates to an automatic detection method for polymorphic worm viruses.
Background
The polymorphic worm is a deformable virus, the form can be changed every time infection occurs, code modification operation can be performed on the premise of self encryption and keeping the intention of the original worm, and then the polymorphic worm is quickly spread in a network, thereby bringing huge harm to the Internet.
Polymorphic worm viruses such as 'zero day' and the like bring great harm to the Internet. The Lesoxhlet software in the polymorphic worm viruses transfers targets from consumers to enterprises, and in this year, the infection rate of the enterprises is increased by 12 percent compared with 2018, so that countless loss is caused to the enterprises. Furthermore, the various variant worms of Lesovirus are more difficult to detect and their resulting losses are more unpredictable.
Particularly, with the arrival of the big data era of the internet, the efficient extraction of the characteristics of the polymorphic worms becomes more and more difficult, and the problem of the internet security is particularly serious. Therefore, the detection and defense of computer worm virus become a very important problem in the network security field.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an automatic detection method facing to the polymorphic worm virus, which can efficiently extract the characteristic data of the polymorphic worm virus, provide a powerful tool for the inspection and the defense of the polymorphic worm virus and better ensure the network security.
In order to solve the technical problems, the invention adopts the technical scheme that: an automatic detection method facing to polymorphic worm viruses comprises the following steps:
step 1, selecting a polymorphic worm data sequence for training, smoothing the training sequence by using an N-gram smoothing algorithm, and extracting to obtain a characteristic sequence, wherein the specific operation comprises the following steps:
1.1) firstly, segmenting a training sequence according to a flow specification;
1.2) carrying out data smoothing on the obtained sequences by utilizing an N-gram smoothing algorithm;
1.3) carrying out feature extraction on the smoothed data to obtain a feature sequence;
and 2, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm and extracting characteristics to form a polymorphic worm characteristic sequence under a low noise condition, wherein the specific operation is divided into the following steps:
2.1) firstly, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bemoulli RBM algorithm to obtain a noise reduction characteristic sequence;
and 2.2) carrying out feature extraction on the noise reduction feature sequence by utilizing a Gaussian-Bernoulli RBM algorithm to obtain a polymorphic worm feature sequence under the low noise condition.
And 3, repeating the step 2 to obtain a series of polymorphic worm characteristic sequences to form a characteristic library.
Preferably, the N-gram smoothing algorithm in substep 1.2 is a Laplace N-gram algorithm or a Good-Turing N-gram algorithm.
The smooth formula of the Laplace N-gram algorithm is as follows:
wherein, P (x)i) For counting a word xiProbability of occurrence, C (x)i) Is the word xiThe number of occurrences in the training polymorphic worm data sequence, V, represents the laplacian smoothing parameter.
Preferably, the value range of V is: 1 to 0.0001.
Preferably, the smoothing formula of the Good-training N-gram algorithm in substep 1.2 is:
wherein, P (x)i) For counting a word xiProbability of occurrence, λ is a random linear difference, λ1、λ2、λ3Satisfying random assignment and must be 1.
Preferably, the construction of the Gaussian-Bernoulli RBM algorithm in step 2 is specifically divided into the following sub-steps:
step 1: constructing an RBM algorithm: the RBM joint probability distribution function is shown as formula (3):
wherein the variable v, h is n-dimensional binary random vector x ∈ {0, 1} of Boltzmann machinenTwo subsets of the decomposition: the visible layer v and the hidden layer h, E and Z respectively represent the normalized constants of the energy function and the partition function, and the formula is shown in the formulas (4) and (5):
E(v,h)=-vTWh-bTv-cTh (4)
Z=∑v∑hexp{-E(v,h)} (5)
w is a weight matrix of the model parameters, and b and c are offset vectors;
combining formula (3), (4), and (5) to obtain formula (6):
step 2: constructing a Bernoulli RBM algorithm: the Bernoulli RBM probability distribution function is shown as (7):
step 3: constructing a Gaussian-Bemoulli RBM algorithm: and (3) parameterizing Gaussian distribution by using the precision matrix, wherein the probability distribution function formula of the algorithm is shown as the formula:
the technical effect obtained by adopting the technical scheme is as follows:
the invention can quickly and accurately extract the characteristic sequences of different types of polymorphic worm viruses, and can analyze the potential variation risk possibly existing in the polymorphic worm viruses and the suspected polymorphic worm segment sequences by utilizing the characteristic sequences generated by the N-gram smoothing algorithm to finally realize defense more accurately and quickly. The method can manually adjust the training parameters, regulate the training mode of the system and improve the accuracy of the system for extracting the polymorphic worms.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 1, an automatic detection method for polymorphic worm virus comprises the following steps:
step 1, selecting a polymorphic worm data sequence for training, and smoothing the training sequence by using an N-gram smoothing algorithm to obtain a characteristic sequence.
An improved original formula is constructed according to Markov hypothesis, maximum likelihood estimation is selected as conditional probability, and sequence data of various polymorphic worms are trained to deduce suspicious segment sequences.
The specific operation comprises the following steps:
1.1) firstly segmenting a training sequence according to a flow specification, wherein the training method is as follows;
the polymorphic worm data sequence for training generally comprises more than one phrase sequence, and each phrase sequence consists of n (n is more than or equal to 1) phrases xi(i is more than or equal to 1 and less than or equal to n), and each sequence must be composed of a plurality of words xiAnd otherwise, the data transmission specification is not met.
The data after segmentation are as follows:
S1:GET...HTTP/1.1\r\n...\r\nHost...
S2:GET...HTTP/1.1\r\n...\r\nHost...
S3:POST...HTTP/1.1\r\n...\r\nHost...
taking S1 sentence as an example, S1 is a training polymorphic worm data sequence, and the probability of being copied into a sentence with length n is P (S1) P (x)1,x2,...,xn) Wherein x isi(1. ltoreq. i.ltoreq.n) is the phrase constituting S1. There is therefore a case where: the computer considers the word x in the sentence S31There is a spelling error for "POST", because the probability of a sentence occurring is P (S1) ≈ P (S2) > P (S3), and finally "GET" will be x1A feature appears.
According to the above description, the formula is firstly put in order:
P(S1)=p(x1,x2,...,xn)
then, a formula is constructed according to the Markov chain theory, and each word xi(1. ltoreq. i.ltoreq.n) is associated with the first m-1 (m. epsilon. N +) finite words to give the following formula (1):
suppose C (x)i) Is the word xiThe number of occurrences in the polymorphic worm data sequence for the training model, C (x)i-n-1,...,xi) Is the word xi-n-1,...,xi-1The total number of occurrences, and thus the maximum likelihood estimate, can be expressed as:
taking the Apache-Knacker worm sequence as an example, equation (10) can be expressed as:
wherein, P3Probability of occurrence of a relation of three words, P2Is the probability of two words appearing in connection.
1.2) carrying out data smoothing on the obtained sequences by using a smoothing algorithm;
the N-gram smoothing algorithm is a Laplace N-gram algorithm or a Good-training N-gram algorithm.
The Laplace N-gram algorithm may also be referred to as a modified smoothing method. Firstly, the problem of zero probability which is easy to improve is solved, according to the N-gram model formula under the maximum likelihood probability distribution, the Laplace smoothing is added with a V for each formula operation, and the V represents a Laplace smoothing parameter and is used for adjusting the smoothing problem in the probability distribution calculation process. Adding a smoothing parameter V for each operation in the formula (10), and changing the formula (10) into the formula (1) after improving the smoothing:
taking equation (3) as the basis for smoothing data, V is called laplacian smoothing parameter, and the value range of V is: 1 to 0.0001.
Good-training is a statistical technique used for counting the number of words in a data set, clustering according to the number of the appeared words, and finally estimating the probability of new words possibly generated by using the clustered result. For the Good-Turing N-gram algorithm, the method can also be called a linear smoothing method, and the specific smoothing mode is formula (2):
Claims (6)
1. an automatic detection method facing to polymorphic worm viruses is characterized by comprising the following steps:
step 1, selecting a polymorphic worm data sequence for training, smoothing the training sequence by using an N-gram smoothing algorithm, and extracting to obtain a characteristic sequence, wherein the specific operation comprises the following steps:
1.1) firstly, segmenting a training sequence according to a flow specification;
1.2) carrying out data smoothing on the obtained sequences by utilizing an N-gram smoothing algorithm;
1.3) carrying out feature extraction on the smoothed data to obtain a feature sequence;
and 2, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm and extracting characteristics to form a polymorphic worm characteristic sequence under a low noise condition, wherein the specific operation is divided into the following steps:
2.1) firstly, carrying out noise reduction treatment on the characteristic sequence obtained in the step 1 by utilizing a Gaussian-Bernoulli RBM algorithm to obtain a noise reduction characteristic sequence;
and 2.2) carrying out feature extraction on the noise reduction feature sequence by utilizing a Gaussian-Bemoulli RBM algorithm to obtain a polymorphic worm feature sequence under a low noise condition.
And 3, repeating the step 2 to obtain a series of polymorphic worm characteristic sequences to form a characteristic library.
2. The method according to claim 1, wherein the N-gram smoothing algorithm in substep 1.2 is a Laplace N-gram algorithm or a Good-Turing N-gram algorithm.
3. The method according to claim 2, wherein the smooth formula of the Laplace N-gram algorithm is as follows:
wherein, P (x)i) For counting a word xiProbability of occurrence, C (x)i) Is the word xiThe number of occurrences in the training polymorphic worm data sequence, V, represents the laplacian smoothing parameter.
4. The method according to claim 3, wherein V is selected from the group consisting of: 1 to 0.0001.
6. The automatic detection method for polymorphic worm viruses according to claim 1, wherein the construction of the Gaussian-Bernoulli RBM algorithm in the step 2 is specifically divided into the following sub-steps:
step 1: constructing an RBM algorithm: the RBM joint probability distribution function is shown as formula (3):
wherein the variable v, h is n-dimensional binary random vector x ∈ {0, 1} of Boltzmann machinenTwo subsets of the decomposition: the visible layer v and the hidden layer h, E and Z respectively represent the normalized constants of the energy function and the partition function, and the formula is shown in the formulas (4) and (5):
E(v,h)=-vTWh-bTv-cTh (4)
Z=∑v∑hexp{-E(v,h)} (5)
w is a weight matrix of the model parameters, and b and c are offset vectors;
combining formula (3), (4), and (5) to obtain formula (6):
step 2: constructing a Bernoulli RBM algorithm: the Bernoulli RBM probability distribution function is shown as (7):
step 3: constructing a Gaussian-Bemoulli RBM algorithm: and (3) parameterizing Gaussian distribution by using the precision matrix, wherein the probability distribution function formula of the algorithm is shown as the formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911272282.XA CN111177724A (en) | 2019-12-12 | 2019-12-12 | Automatic detection method for polymorphic worm virus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911272282.XA CN111177724A (en) | 2019-12-12 | 2019-12-12 | Automatic detection method for polymorphic worm virus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111177724A true CN111177724A (en) | 2020-05-19 |
Family
ID=70655430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911272282.XA Pending CN111177724A (en) | 2019-12-12 | 2019-12-12 | Automatic detection method for polymorphic worm virus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111177724A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20030021859A (en) * | 2001-09-08 | 2003-03-15 | 주식회사 비즈모델라인 | System and Method for mailing warning message against the worm virus and anti-virus vaccine automatically against it via wireless networks |
CN109299357A (en) * | 2018-08-31 | 2019-02-01 | 昆明理工大学 | A kind of Laotian text subject classification method |
CN110022313A (en) * | 2019-03-25 | 2019-07-16 | 河北师范大学 | Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning |
-
2019
- 2019-12-12 CN CN201911272282.XA patent/CN111177724A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20030021859A (en) * | 2001-09-08 | 2003-03-15 | 주식회사 비즈모델라인 | System and Method for mailing warning message against the worm virus and anti-virus vaccine automatically against it via wireless networks |
CN109299357A (en) * | 2018-08-31 | 2019-02-01 | 昆明理工大学 | A kind of Laotian text subject classification method |
CN110022313A (en) * | 2019-03-25 | 2019-07-16 | 河北师范大学 | Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning |
Non-Patent Citations (4)
Title |
---|
张绍辉;: "集成参数自适应调整及隐含层降噪的深层RBM算法" * |
曾键;赵辉;: "一种基于N-Gram的计算机病毒特征码自动提取方法" * |
林伟;柳荣其;徐熙;: "一种基于N-Gram的垃圾邮件过滤方法研究" * |
陈琦: "一种基于RBM的深层神经网络音素识别方法" * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108737406B (en) | Method and system for detecting abnormal flow data | |
CN107832787B (en) | Radar radiation source identification method based on bispectrum self-coding characteristics | |
CN111914253B (en) | Method, system, equipment and readable storage medium for intrusion detection | |
CN110674865B (en) | Rule learning classifier integration method oriented to software defect class distribution unbalance | |
CN110363001B (en) | Application layer malicious request detection method based on Transformer model | |
Wang et al. | Evaluating CNN and LSTM for web attack detection | |
CN113011194B (en) | Text similarity calculation method fusing keyword features and multi-granularity semantic features | |
CN113628059B (en) | Associated user identification method and device based on multi-layer diagram attention network | |
CN110019653B (en) | Social content representation method and system fusing text and tag network | |
CN111259397A (en) | Malware classification method based on Markov graph and deep learning | |
CN113642717A (en) | Convolutional neural network training method based on differential privacy | |
Chen et al. | SpecMark: A spectral watermarking framework for IP protection of speech recognition systems. | |
CN107240100B (en) | Image segmentation method and system based on genetic algorithm | |
JP2019086979A (en) | Information processing device, information processing method, and program | |
Perez et al. | Efficient projection algorithms onto the weighted ℓ1 ball | |
CN111177724A (en) | Automatic detection method for polymorphic worm virus | |
CN116502091A (en) | Network intrusion detection method based on LSTM and attention mechanism | |
Zhao et al. | Graph similarity metric using graph convolutional network: Application to malware similarity match | |
CN111079143B (en) | Trojan horse detection method based on multi-dimensional feature map | |
CN111008529B (en) | Chinese relation extraction method based on neural network | |
Huang et al. | Research on Malicious URL Identification and Analysis for Network Security | |
CN112329002B (en) | Method for dynamically adjusting execution order of guessing rule according to plaintext characteristics of partial password | |
CN113918717B (en) | Text backdoor defense method for cleaning data | |
EP4293956A1 (en) | Method for predicting malicious domains | |
Li et al. | Similarity measure for multivariate time series based on dynamic time warping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200519 |