CN108768954B

CN108768954B - DGA malicious software identification method

Info

Publication number: CN108768954B
Application number: CN201810419555.8A
Authority: CN
Inventors: 罗熙; 徐震; 王利明; 杨婧
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2020-07-10
Anticipated expiration: 2038-05-04
Also published as: CN108768954A

Abstract

The invention discloses a DGA malicious software identification method which can quickly identify DGA malicious software based on the weakness of a DGA technology. Since the host infected by DGA malware does not know its control server domain name, the infected host needs to constantly and randomly generate domain names and attempt to connect until it is successfully connected to the control server. Based on the defects and by using the idea of random walk for reference, the invention considers the domain name connection failed each time of the host as one random walk, provides a calculation method of random walk increment, and judges whether the host is infected by DGA malicious software or not by comparing the random walk number and the random walk increment with a preset threshold value. The method can complete detection before the infected host is connected to the control server, effectively inhibits the application of DGA malicious software, and has wide application prospect in the field of network security.

Description

DGA malicious software identification method

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a DGA (Domain Generation Algorithm) malicious software identification method.

Background

Early DGA techniques were mainly used for botnets, and many correlation detection methods detect DGA malware based on characteristics of botnets (including synchronicity, periodicity, and node correlation). While recent DGA techniques are used in lasso software, in such applications, lasso software does not have the characteristics of botnet described above, and thus the conventional detection methods are difficult to apply to such scenarios.

The key weakness of DGA malware is that the infected host does not know the domain name of the control server, i.e. the infected host needs to continuously generate domain names and try to connect through a random algorithm until successfully connecting to the domain name of the control server. Thus, DGA malware identification can be achieved by analyzing the number of domain names that fail a connection and the characteristics of the domain names themselves.

The detection METHODs proposed by patent CN106576058A "system AND METHOD for detecting domain generation algorithm malware AND system infected by the malware", patent CN106992969A "detection METHOD for domain name generation BASED on DGA of domain name string statistical features", patent CN105577660A "detection METHOD for domain name of DGA BASED on random forest", patent CN107046586A "detection METHOD for domain name generation BASED on natural language-like features", patent US2013191915(a1) "METHOD AND system domain DETECTING DGA-base MA L way", all use a single domain name or related parameters of the domain name as analysis detection objects, AND because a large number of normal domain names exist in the actual network environment, especially short domain names, the detection METHODs all have high false alarm rate.

Disclosure of Invention

The invention solves the problems: aiming at the key weakness of DGA malware, namely that an infected host does not know the domain name of a control server, a DGA malware identification method is provided, and the infected host can be detected before being connected to the control server.

The technical scheme of the invention is as follows: in order to achieve the purpose, the invention adopts the following technical scheme.

A DGA malware identification method comprising the steps of:

a) the domain name connection of each failure of the host is called a random walk, and the random walk increment delta is calculated based on the domain name of the ith connection failure of the host_iAnd obtaining Λ the increment of the previous n random walks_n；

b) When Λ_nGreater than a predetermined upper threshold B_uOr the number n of steps of the random walk exceeds a preset threshold B_sWhen it is determined that the host is infected with DGA malware, Λ_nLess than a lower threshold B_lIf so, judging that the host is not infected by the DGA malicious software;

c) when a host is determined to be infected, an alarm is raised and reset Λ₁0, when the host is determined to be in the normal state, direct reset Λ is performed₁＝0。

Further, in step 1), the random walk increment Δ_iIs calculated by

Wherein l is domain name length, Pr (α)₀) And Pr (α)_k|α_k-1) The statistical derivation of Pr (α) based on the top 10 million Alexa-ranked Domain names₀) For all domain names the initial character is α₀Statistical probability of (3), Pr (α)_k|α_k-1) For the k-1 character in all domain names is α_k-1Under the condition that the k-th character is α_kThe probability of (c).

Further, Pr (α)_k|α_k-1) Is calculated by

Wherein

As a binary character set α_k-1α_kThe number of times that it occurs in all domain names,

is a start character of α_k-1The number of occurrences of the binary character set in all domain names.

Further, in step 2), the upper threshold limit B_uAnd lower threshold bound B_lBased on the calculation of the missing report rate fnr and the false report rate fpr, the calculation method is

The false alarm rate indicates that the host is not infected but the determination result is that the host is in an infected state.

Further, in step 2), the false alarm rate fnr, the false alarm rate fpr and the threshold value B_sThe method can be comprehensively determined according to factors such as system security requirements, current network conditions and the like.

Compared with the prior art, the invention has the beneficial effects that: existing DGA malware identification methods can be broadly divided into two categories. The DGA domain names are judged by extracting and analyzing the characteristics of single domain names, and the detection method has high false alarm rate due to the fact that a large number of normal irregular domain names exist in the actual network environment, particularly the domain names with short lengths. The other type is based on the characteristics of botnets, namely whether the domain name is abnormal is judged by analyzing the characteristics of multiple connections, so that the DGA domain name can be detected only after the connection request is completed. According to the DGA malicious software identification method based on the threshold random walk algorithm, malicious samples are not needed to be used as training sets, detection can be completed before infected hosts are connected to a control server, and the detection rate can be improved to the maximum extent by the threshold random walk algorithm while the detection accuracy is guaranteed. The invention is verified by experiments that the false alarm rate can be less than 3%, thus showing the effectiveness.

Drawings

FIG. 1 is a diagram of a finite state machine according to the present invention.

Detailed Description

The following detailed description of specific embodiments of the invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but not to limit the scope of the invention.

The invention needs to set a missing report rate fnr and a false report rate fpr at first, and calculates the upper threshold value boundary B of the random walk increment based on fnr and fpr_uAnd lower threshold bound B_l. Step number threshold B for random walk_sIt is not set for the moment.

For example, if fnr is set to 0.01 and fpr is set to 0.001, it can be calculated

The maximum number of end steps S for normal access under the above parameters is then tested,then setting a step number threshold B of random walk according to S_s。

For example, if S is 12, B may be set_s＝15。

In the detection phase, FIG. 1, Λ in the initial state₁When the host tries a domain name connection, if the connection is successful Λ_nIf the connection fails, the random walk increment delta is calculated according to the following formula_iAnd a random walk increment sum Λ_n，Λ_n＝∑_iΔ_i。

Wherein l is the domain name length, Pr (α)₀) For all domain names the initial character is α₀Statistical probability of (3), Pr (α)_k|α_k-1) For the k-1 character in all domain names is α_k-1Under the condition that the k-th character is α_kProbability of (D.Pr) (α)_k|α_k-1) The calculation method comprises the following steps:

wherein

When Λ_nGreater than a predetermined upper threshold B_uOr the number of steps of the random walk exceeds a preset threshold B_sWhen it is determined that the host is infected with DGA malware, Λ_nLess than a lower threshold B_lThen it is determined that the host is not infected by DGA malware.

When a host is determined to be infected, an alarm is raised and the host returns to the initial state, i.e., reset Λ₁When the host is 0If it is determined to be normal, it is returned directly to the initial state and reset Λ₁＝0。

The above embodiments are merely illustrative, and not restrictive, and various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention, and therefore all equivalent technical solutions are intended to be included within the scope of the invention.

Claims

1. A DGA malware identification method is characterized by comprising the following steps:

(1) the domain name connection of each failure of the host is called a random walk, and the random walk increment delta is calculated based on the domain name of the ith connection failure of the host_iAnd obtaining Λ the increment of the previous n random walks_n；

(2) When Λ_nGreater than a predetermined upper threshold B_uOr n exceeds a preset threshold B_sWhen it is determined that the host is infected with DGA malware, Λ_nLess than a lower threshold B_lIf so, judging that the host is not infected by the DGA malicious software;

(3) when a host is determined to be infected, an alarm is raised and reset Λ₁0, when the host is determined to be in the normal state, direct reset Λ is performed₁＝0；

In the step (1), the random walk increment delta_iIs calculated by

Wherein l is the domain name length, Pr (α)₀) For all domain names the initial character is α₀Statistical probability of (3), Pr (α)_k|α_k-1) For the k-1 character in all domain names is α_k-1Under the condition that the k-th character is α_kThe probability of (d);

in the step (1), the increment sum Λ is randomly stroked_nIs Λ_n＝∑_iΔ_i；

In the step (2), the upper threshold B_uAnd lower threshold bound B_lBased on the calculation of the missing report rate fnr and the false report rate fpr, the calculation method is

2. The DGA malware identification method of claim 1, wherein the Pr (α)_k|α_k-1) The calculation method comprises the following steps:

wherein

3. The DGA malware identification method of claim 1, wherein in step (1), if the domain name accessed by the host is successfully connected, Λ is performed_nRemain unchanged.

4. The DGA malware identification method of claim 1, wherein: in the step (2), the false alarm rate fnr, the false alarm rate fpr and the threshold B_sThe setting principle is as follows: setting interval (0, 0.01) of missing report rate fnr]To ensure that abnormal accesses can be identified more; setting interval (0, 0.001) of false alarm rate fpr]To ensure that normal access finishes the whole identification process in a short time(ii) a Threshold value B_sSetting interval as [ S, S x 150%]Wherein S is not set to the threshold B_sIn the case of (3), the maximum number of end steps in the normal access, that is, the maximum number of random walk steps required for the host to be determined to be in the normal state.