CN113792291B

CN113792291B - Host recognition method and device infected by domain generation algorithm malicious software

Info

Publication number: CN113792291B
Application number: CN202111063886.0A
Authority: CN
Inventors: 张道娟; 房磊; 张錋; 张英杰
Original assignee: Global Energy Interconnection Research Institute
Current assignee: Global Energy Interconnection Research Institute
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-08-18
Anticipated expiration: 2041-09-10
Also published as: CN113792291A

Abstract

The invention provides a host recognition method and device infected by domain generation algorithm malicious software, wherein the method comprises the following steps: extracting a plurality of feature vectors according to the inquiry time interval of the host computer to the non-existence domain; clustering the plurality of feature vectors to form at least one cluster; and carrying out outlier analysis on the cluster to determine a malicious cluster, and determining a host corresponding to the feature vector in the malicious cluster as an infected host. According to the method, the host infected by the malicious software based on the domain generation algorithm can be accurately identified according to the feature vectors extracted by the host for the query time interval without the domain, the method for analyzing and determining the infected host by the clustering clusters can be used for rapidly completing the analysis of a large number of feature vectors and determining the infected host more efficiently, and the infected host can be identified by combining the relations among the feature vectors after the feature vectors are clustered, so that the identification result is more accurate.

Description

Host recognition method and device infected by domain generation algorithm malicious software

Technical Field

The invention relates to the technical field of computer network security, in particular to a host recognition method and device infected by domain generation algorithm malicious software.

Background

Domain generation algorithms (Domain Generation Algorithm, DGA) are advanced DNS technology that are often applied to malware families to evade domain name blacklist detection. An attacker may periodically generate thousands of domains that can be used as C & C communications from which the attacker selects a small subset for the actual command and control (C & C). The C & C domain is randomly generated and has a short lifetime, so that the detection method, which relies on a static domain list, becomes ineffective.

The generated fields are calculated based on a given seed, which may be a digital constant, current date, time, etc., the generated fields are composed of random and unreadable character linkages. In most cases, the feature distribution of the randomly generated domain is quite different from the feature distribution of the legitimate domain. Thus, DGA malware may be detected based on lexical properties. However, an attacker can adjust his DGA-generated domain by simulating the character distribution of popular domains or words, escaping the detection of these methods. In this case, more intrinsic features should be extracted to detect DGA malware.

The most common time-based features for detecting DGA-based malware are the periodicity of the C & C connection and the points of change in the nxdata traffic. The periodicity-based detection method requires multiple C & C connections to extract features, which is ineffective if the infected host does not periodically connect to the C & C server. Second, with the change point detection method, it is difficult to accurately detect DGA-based malware because benign hosts are also likely to produce suddenly increased traffic.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defect that the malicious software based on DGA is difficult to accurately detect in the prior art, so as to provide a host recognition method and device infected by the malicious software of the domain generation algorithm.

The first aspect of the present invention provides a host recognition method infected by domain generation algorithm malware, comprising: extracting a plurality of feature vectors according to the inquiry time interval of the host computer to the non-existence domain; clustering the plurality of feature vectors to form at least one cluster; and carrying out outlier analysis on the cluster to determine a malicious cluster, and determining a host corresponding to the feature vector in the malicious cluster as an infected host.

Optionally, in the method for identifying a host infected by domain generation algorithm malware provided by the present invention, extracting feature vectors according to a query time interval of a host for a domain not existing includes: acquiring a domain name of an absent domain queried by a host; screening non-malicious domains in the non-existing domains according to the domain name; and extracting the feature vector according to the query time interval of the host computer to other non-existence domains except the non-malicious domain.

Optionally, in the host recognition method infected by the domain generation algorithm malicious software provided by the present invention, clustering is performed on a plurality of feature vectors to form at least one cluster, including: respectively taking each characteristic vector as a candidate cluster; extracting two candidate clusters closest to the current distance, comparing the distance between the two candidate clusters closest to the current distance with the respective internal distances of the two candidate clusters, and judging whether the two candidate clusters closest to the current distance meet the merging condition; if the two candidate clusters closest to the current distance meet the merging condition, merging the two candidate clusters closest to the current distance into a new candidate cluster; if other combinable candidate clusters exist, repeatedly executing the extraction of the two candidate clusters closest to the current distance, comparing the distance between the two candidate clusters closest to the current distance with the respective internal distances of the two candidate clusters, and judging whether the two candidate clusters closest to the current distance meet the combination condition; and if the two candidate clusters closest to the current distance meet the merging condition, merging the two candidate clusters closest to the current distance to form a new candidate cluster until no other combinable candidate clusters exist, and determining the currently existing candidate cluster as a cluster.

Optionally, the host identifying method infected by the domain generation algorithm malicious software provided by the invention further comprises the following steps: and if the two candidate clusters closest to the current distance do not meet the merging condition, outputting one candidate cluster in the two candidate clusters closest to the current distance.

Optionally, in the host recognition method infected by the domain generation algorithm malicious software provided by the invention, a first threshold is formed according to the sizes of the feature vectors of the two candidate clusters closest to the current distance; forming a second threshold according to the average difference and standard deviation of the internal distances of the two candidate clusters closest to the current distance; when the distance between the two candidate clusters closest to the current distance is smaller than the maximum value between the first threshold value and the second threshold value, the two candidate clusters closest to the current distance are judged to meet the merging condition, and the two candidate clusters closest to the current distance are merged into a new cluster.

Optionally, in the host recognition method infected by domain generation algorithm malicious software provided by the present invention, performing outlier analysis on the cluster to determine a malicious cluster includes: determining the statistical test value of each cluster according to the size and standard deviation of each cluster; carrying out significance test on the statistical test value, and judging whether the assumption that each cluster is not a maximum cluster is refused or not; if the assumption that the cluster is not a very large cluster is rejected, the cluster is determined to be a malicious cluster.

Optionally, in the host recognition method infected by the domain generation algorithm malicious software provided by the invention, a formula for determining the statistical test value of each cluster according to the size and standard deviation of each cluster is as follows:wherein, |c _i I is cluster c _i Is the cluster c _i S is the average size of cluster c _i Standard deviation of (2).

Optionally, in the host recognition method infected by the domain generation algorithm malicious software provided by the invention, the process of performing the significance test on the statistical test value is as follows: when checking the valueWhen the assumption that the cluster is not a very large cluster is rejected, the cluster is determined to be a malicious cluster, wherein +.>Representing a system with a degree of freedom q-2 and +.>The upper threshold of the t-distribution of significance levels, α, represents the significance level.

A second aspect of the present invention provides a host recognition apparatus infected with domain generation algorithm malware, comprising: the feature extraction module is used for extracting a plurality of feature vectors according to the inquiry time interval of the host computer to the non-existence domain; the clustering module is used for clustering the plurality of feature vectors to form at least one cluster; the infected host identifying module is used for carrying out outlier analysis on the cluster to determine a malicious cluster, and determining a host corresponding to the feature vector in the malicious cluster as an infected host.

A third aspect of the present invention provides a computer apparatus comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to perform a host recognition method infected with domain generation algorithm malware as provided in the first aspect of the invention.

A fourth aspect of the invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform a host recognition method infected with domain generation algorithm malware as provided in the first aspect of the invention.

The technical scheme of the invention has the following advantages:

according to the host recognition method and device for the domain-generated algorithm malicious software infection, feature vectors are extracted according to the time interval of host computer to the non-existing domain inquiry, because the host computer inquires a plurality of non-existing domains in inherent time delay after being infected by the malicious software family to find out the aggregation point for command and control (C & C) connection, even if the domain-generated algorithm-based malicious software family tends to inquire the domain at intervals of constant time, some complex malicious software can still realize the time interval inquiry based on certain probability distribution (such as Gaussian distribution) to mask the similarity, therefore, based on the characteristic, the feature vectors extracted according to the inquiry time interval of the host computer to the non-existing domains can accurately recognize the host computer infected by the malicious software of the domain-generated algorithm, in addition, when the infected host computer tries to connect with the C & C server, more non-existing domains than legal host computer inquiry, and the feature vectors are more similar, after the feature vectors are extracted, the host computer recognition method and device for the domain-generated algorithm malicious software infection can realize clustering of the feature vectors and cluster-infected host computer, the feature vectors can be more accurately analyzed based on the feature vectors, and the feature vectors can be more accurately analyzed by the host computer, and the host computer can be more accurately recognized by the feature vector analysis when the cluster is more than the corresponding to the cluster-generated cluster, and the host computer is more than the host computer infected by the feature vector, by implementing the host identification method and device for the domain generation algorithm malicious software infection, the host infected by the domain generation algorithm malicious software can be identified more accurately and rapidly.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of one specific example of a host identification method infected with domain generation algorithm malware in an embodiment of the present invention;

FIG. 2 is a workflow diagram of a finite state machine for executing the host recognition method infected by domain generation algorithm malware according to the embodiment of the present invention;

FIG. 3 is a functional block diagram of one specific example of a host identification device infected with domain generation algorithm malware in an embodiment of the present invention;

fig. 4 is a schematic diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the terms "first," "second," and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The embodiment of the invention provides a host identification method infected by domain generation algorithm malicious software, which comprises the following steps as shown in fig. 1:

step S10: a plurality of feature vectors are extracted according to a host's query time interval for non-existing domains (NXDomains). In embodiments of the present invention, because DGA-based malware families typically map their servers to secondary domain names and domains served by dynamic DNS (e.g., ddns. Net), analysis of non-existing domains only analyzes secondary domain names and domains served by dynamic DNS (e.g., ddns. Net).

In an alternative embodiment, when extracting the feature vector according to the query time interval of the host to the non-existence domain (nxdata), the host may form one or more time interval sequences for the query time intervals of the host to the non-existence domains, and then form different feature vectors according to each time interval sequence. In an exemplary embodiment, if a time interval sequence is formed when 10 non-existing domains are queried by the host in advance, in a specific implementation process, after the host queries 20 non-existing domains, a first time interval sequence may be formed according to time intervals when the host queries 1-10 non-existing domains, a second time interval sequence may be formed according to time intervals when the host queries 11-20 non-existing domains, and then feature vectors may be extracted through the first time interval sequence and the second time interval sequence, respectively.

Since hosts infected with the same malware family will query many nxdata in the inherent time delay to find the aggregation point of the C & C connection, even if there is a DGA-based malware family that tends to query the domain at constant time intervals, some complex malware can still implement time interval queries based on some probability distribution (e.g., gaussian distribution) to mask the similarity, regardless of how they have similar statistics, such as the mean, standard deviation, etc. of the query time interval, in an alternative embodiment, feature vectors can be constructed from the host's query time interval to the nxdata by extracting the mean, variance, median, maximum, minimum, mode, etc. of the query time interval.

The features extracted by the embodiments of the present invention can be extracted in a shorter time and are not dependent on a particular temporal pattern.

In an alternative embodiment, since the repeatedly queried domain cannot be used to find the C & C connection point, when the host recognition method infected by the domain generation algorithm malware provided by the embodiment of the present invention is executed, only the query time interval when the host queries different domains is acquired in one day.

Step S20: and clustering the plurality of feature vectors to form at least one cluster.

In an alternative embodiment, since the number of DGA-based malware families in the monitored network is uncertain, a hierarchical clustering algorithm is used when clustering feature vectors, without the need to input the number of clusters. The hierarchical clustering algorithm merges the most similar cluster pairs as the hierarchy rises until a termination condition is met.

Step S30: and carrying out outlier analysis on the cluster to determine a malicious cluster, and determining a host corresponding to the feature vector in the malicious cluster as an infected host.

In an alternative embodiment, when infected hosts attempt to connect to a C & C server, they will query more NXDomain than legitimate hosts, and the feature vectors are also more similar. When malware is present, the number of elements within a corresponding cluster will be much larger than other clusters due to the high degree of similarity in their behavior. Thus, large clusters can be detected based on an anomaly (large) value detection algorithm, thereby identifying malware.

According to the host recognition method infected by the domain generation algorithm malicious software, feature vectors are extracted according to the time interval of host computer to the non-existing domain inquiry, because the host computer inquires a plurality of non-existing domains in inherent time delay after being infected by the malicious software family to find out the aggregation point for connecting command and control (C & C), even if the domain generation algorithm malicious software family tends to inquire the domain at intervals of constant time, some complex malicious software can still realize the time interval inquiry based on certain probability distribution (such as Gaussian distribution) to mask the similarity, therefore, based on the characteristic, the feature vectors extracted according to the inquiry time interval of the host computer to the non-existing domains can accurately recognize the host computer infected by the malicious software based on the domain generation algorithm, in addition, when the infected host computer is connected with the C & C server, more non-existing domains are inquired by the host computer, and the feature vectors are more similar, therefore, after the feature vectors are extracted, the host recognition method infected by the domain generation algorithm malicious software can cluster the feature vectors and can accurately analyze the feature vectors based on the corresponding cluster, and the feature vectors can be more accurately analyzed by the host computer, and the host computer infected by the feature vectors can be more quickly recognized by the host computer, and the host computer infected by the feature vectors can be more accurately analyzed based on the feature vector analysis, by implementing the host identification method for the domain generation algorithm malicious software infection, provided by the embodiment of the invention, the host infected by the domain generation algorithm malicious software can be identified more accurately and rapidly.

In an alternative embodiment, before performing the above step S10 to extract a plurality of feature vectors according to the query time interval of the host for the non-existence domain, the amount (n _nx ) Whether or not the amount of the nxdata queried by the host is greater than the preset value (n), when the amount of the nxdata queried by the host is less than or equal to the preset value, step S10 is not performed (nxdata is d when the amount of the nxdata queried by the host is greater than the preset value ₁ ,d ₂ ,…,d _n ) Step S10 is executed, wherein the inquiry time stamp is set as t ₁ ,t ₂ ,…,t _n Extracting time intervals of queries, i.e. S _t ＝{l _k :t _k+1 -t _k }，k∈[1,n-1]。

In an alternative embodiment, the preset value may be set according to actual requirements, for example, in a general campus network environment, the preset value may be set to 10, where n _nx >At 10, the time interval of the query is extracted, S _t ＝{l _k :t _k+1 -t _k }，k∈[1,9]Then, the above step S10 is performed.

In an alternative embodiment, the step S10 specifically includes:

first, a domain name of an absent domain queried by a host is obtained.

In an alternative embodiment, the nxdata traffic generated in the monitoring network may be collected, so as to obtain the domain name of the non-existing domain queried by the host in the monitoring network.

Non-malicious domains in the non-existing domain are then filtered based on the domain name.

In an alternative embodiment, the non-existent domain may be considered to be a non-malicious domain when the domain name satisfies one of the following conditions:

1. invalid top-level domain name: if the top-level domain name of the non-existing domain queried by the host is not in the preset registered top-level domain name list, the non-existing domain is a non-malicious domain. Illustratively, the preset registry top-level domain name list may be a registry top-level domain name list provided by the IANA.

2. Irregular characters: if the non-existing domain of the host query contains illegal characters, the non-existing domain is judged to be a non-malicious domain, and the illegal characters refer to characters which are not contained in the legal domain, for example, if the legal domain only consists of letters, numbers, dashes and hyphens, when the non-existing domain of the host query contains other characters except letters, numbers, dashes and hyphens, the domain is possibly caused by input errors or incorrect configuration, and is a non-malicious domain.

3. Popular domain name: if the domain name of the non-existing domain queried by the host is the preset popular domain name, the non-existing domain is a non-malicious domain. Illustratively, the preset popular domain names may be the top 10 ten thousand domain names in Alexa and the web sites of the 500-strong company in fobs are popular legal domain names, and since these nxdata are mostly utilized by legal services for transmitting one-time signals, these non-existent domains may be determined as non-malicious domains.

Finally, feature vectors are extracted according to the query time interval of the host computer for other non-existence domains except the non-malicious domain, and the detailed description of the step S10 is referred to above.

In the host recognition method infected by the domain generation algorithm malicious software provided by the embodiment of the invention, the non-malicious domain is filtered, and then the feature vector is extracted through the query time interval of other filtered non-existing domains, so that the calculated amount can be reduced, and the recognition efficiency of the host infected by the domain generation algorithm malicious software is improved.

In an alternative embodiment, as shown in fig. 2, the computer device is in a waiting state when determining the number of the non-existing domains of the host query, is in a ready state when executing the above step S10 to extract the feature vector, is in a detection state when executing the above step S20 and step S30, and continuously monitors the number of the non-existing domains of the host query and the time interval of querying each non-existing domain, each time the number n of the non-existing domains of the host query reaches a certain value n _nx Step S10 is executed according to the extracted feature vector, whether the time T from the starting moment to the current moment exceeds the time window T is judged, if the time T from the starting moment to the current moment does not exceed the time window T, the value of n is reset, the number of the non-existing domains queried by the host and the time interval of the non-existing domains are continuously monitored, and if the time T from the starting moment to the current moment exceeds the time window T, the detection state is entered to execute the steps S20 and S30. The time window T may be set according to the current network environment, and for example, if the current network environment is a general campus network environment, the time window T may be set to 1h. After executing steps S20 and S30, the number n of non-existing domains queried by the host is reset _nx And a time t (t=0, n _nx =0), and returns to the waiting state.

Since the host queries the non-existence domains continuously, and the host queries the non-existence domains more in number and longer in time, if the features are extracted again according to the time interval sequences after the time interval sequences are acquired, the calculation amount is large and the calculation time is longer, in the embodiment shown in fig. 2, when the host monitors the query number of the non-existence domains, namely, the time intervals, the calculation of the feature vector is performed once each time interval sequence is formed, so that the problems of long time consumption and large calculation amount caused by simultaneously calculating the time interval sequences are avoided.

In an alternative embodiment, the step S20 specifically includes:

first, each feature vector is used as a candidate cluster.

Then, two candidate clusters closest to the current distance are extracted, the distance between the two candidate clusters closest to the current distance is compared with the respective internal distances of the two candidate clusters, and whether the two candidate clusters closest to the current distance meet the merging condition is judged. In an alternative embodiment, the nearest cluster pair may be extracted by the getClusters () function.

And if the two candidate clusters closest to the current distance meet the merging condition, merging the two candidate clusters closest to the current distance into a new candidate cluster.

In the embodiment of the invention, the merging condition referred to when two candidate clusters are merged is determined according to the two candidate clusters, that is, when two different candidate clusters are merged, the condition is different, the fixed merging condition may not be applicable to all cluster pairs, and the two candidate clusters can be merged when the two candidate clusters are judged according to the merging condition determined by the two candidate clusters, so that the merging between the candidate clusters is more reasonable.

And if the two candidate clusters closest to the current distance do not meet the merging condition, outputting one candidate cluster in the two candidate clusters closest to the current distance.

If other combinable candidate clusters exist, repeatedly executing the extraction of the two candidate clusters closest to the current distance, comparing the distance between the two candidate clusters closest to the current distance with the respective internal distances of the two candidate clusters, and judging whether the two candidate clusters closest to the current distance meet the combination condition; and if the two candidate clusters closest to the current distance meet the merging condition, merging the two candidate clusters closest to the current distance to form a new candidate cluster until no other combinable candidate clusters exist, and determining the currently existing candidate cluster as a cluster.

In an alternative embodiment, when two candidate clusters closest to the current distance are combined, the process of judging whether the two candidate clusters meet the combining condition specifically includes:

first, a first threshold is formed according to the sizes of feature vectors of two candidate clusters closest to the current distance.

In an alternative embodiment, since the time delay of most DGA-based malware families is less than 1 second, the distance of the vectors generated by them is likely to be less than Where |v| is the vector size, |can be applied to +.>As a first threshold.

In embodiments of the present invention, features extracted from the time lags may be compatible with periodic, change point or vocabulary based detection. Accuracy may also be improved by analyzing the time delay, for example, when a periodicity or change point is detected.

Then, a second threshold is formed based on the average difference and standard deviation of the inner distances of the two candidate clusters whose current distance is closest.

In an alternative embodiment, if the two closest candidate clusters are c _i And c _j First according to the first candidate cluster c _i Calculates a first candidate threshold value by the average difference and standard deviation of the internal distances of: a, a _i ＝mean(c _i )+2*std(c _i ) According to the second candidate cluster c _j The average difference and standard deviation of the internal distances of (2) calculate a second candidate threshold value a _j ＝mean(c _j )+2*std(c _j ) Then the firstThe minimum value of the first candidate threshold value and the second candidate threshold value is determined as a second threshold value min { a } _i ,a _j }。

When the current distance is the distance d between the two nearest candidate clusters _ij Less than a maximum between the first threshold and the second thresholdAnd when the two candidate clusters closest to the current distance are judged to meet the merging condition, the two candidate clusters closest to the current distance are merged into a new cluster.

In an optional embodiment, the determining the malicious cluster by performing outlier analysis on the cluster based on Grubbs test in the step S30 specifically includes:

firstly, determining a statistical test value of each cluster according to the size and standard deviation of each cluster:

wherein, |c _i I is cluster c _i Is the size of cluster C _i S is the average size of cluster c _i Standard deviation of (2).

Then, a significance test is performed on the statistical test value to judge whether the assumption that each cluster is not a very large cluster is refused, in the embodiment of the invention, two assumptions H are defined ₁ And H ₀ Respectively, whether or not there is a very large cluster.

Since infected hosts attempt to connect to the C & C server, they will query more nxdata than legitimate hosts and feature vectors are more similar. When malware is present, the number of elements within a corresponding cluster will be much larger than other clusters due to the high degree of similarity in their behavior. Thus, large clusters can be detected based on an anomaly (large) value detection algorithm, thereby identifying malware.

In an alternative embodiment, when the test valueWhen the assumption that the cluster is not a very large cluster is rejected, the cluster is determined to be a malicious cluster, wherein +.>Representing a system with a degree of freedom q-2 and +.>The upper threshold of the t-distribution of significance levels, α, represents the significance level (typically set to 0.001).

Finally, if the assumption that the cluster is not a very large cluster is rejected, the cluster is determined to be a malicious cluster. If the assumption that the cluster is not a very large cluster is accepted, the verification process terminates.

The embodiment of the invention provides a host recognition device infected by domain generation algorithm malicious software, as shown in fig. 3, comprising:

the feature extraction module 10 is configured to extract a plurality of feature vectors according to a query time interval of the host for the domain that does not exist, and details of the feature extraction module are described in the above embodiment, and are not repeated herein.

The clustering module 20 is configured to cluster the plurality of feature vectors to form at least one cluster, and details of the clustering module are described in the above embodiment in the step S20, which is not repeated herein.

The infected host identifying module 30 is configured to analyze the abnormal value of the cluster to determine a malicious cluster, determine a host corresponding to the feature vector in the malicious cluster as an infected host, and details of the step S30 are described in the above embodiment, which is not repeated herein.

According to the host recognition device infected by the domain generation algorithm malicious software, feature vectors are extracted according to the time interval of host computer to the non-existing domain inquiry, because the host computer inquires a plurality of non-existing domains in inherent time delay after being infected by the malicious software family to find a gathering point for connecting command and control (C & C), even if the domain generation algorithm malicious software family tends to inquire the domain at intervals of constant time, some complex malicious software can still realize the time interval inquiry based on certain probability distribution (such as Gaussian distribution) to mask the similarity, therefore, based on the characteristic, the feature vectors extracted according to the inquiry time interval of the host computer to the non-existing domains can accurately recognize the host computer infected by the malicious software based on the domain generation algorithm, in addition, when the infected host computer is connected with the C & C server, more non-existing domains are inquired by the host computer, and the feature vectors are more similar, therefore, after the feature vectors are extracted, the host recognition device infected by the domain generation algorithm malicious software can cluster the feature vectors and can accurately analyze the feature vectors based on the corresponding cluster, the feature vectors can be more accurately analyzed, and the host computer can be more accurately recognized by the host computer infected by the feature vectors, the host computer can be more quickly analyzed based on the feature vectors in the cluster analysis, by implementing the host identification device infected by the domain generation algorithm malicious software provided by the embodiment of the invention, the host infected by the domain generation algorithm malicious software can be identified more accurately and rapidly.

The embodiment of the present invention provides a computer device, as shown in fig. 4, which mainly includes one or more processors 31 and a memory 32, and in fig. 4, one processor 31 is taken as an example.

The computer device may further include: an input device 33 and an output device 34.

The processor 31, the memory 32, the input device 33 and the output device 34 may be connected by a bus or otherwise, for example in fig. 4.

The processor 31 may be a central processing unit (Central Processing Unit, CPU). The processor 31 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or a combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory 32 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of host recognition devices infected with domain generation algorithm malware, etc. In addition, the memory 32 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 32 optionally includes memory remotely located with respect to processor 31, which may be connected via a network to host recognition devices infected with domain generation algorithm malware. The input device 33 may receive a user-entered computing request (or other numeric or character information) and generate key signal inputs related to a host recognition device infected with domain generation algorithm malware. The output device 34 may include a display device such as a display screen for outputting the calculation result.

Embodiments of the present invention provide a computer readable storage medium storing computer instructions, where the computer readable storage medium stores computer executable instructions that are capable of executing the host recognition method infected by domain generation algorithm malware in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A method for identifying a host infected with malware of a domain generation algorithm, comprising:

extracting a plurality of feature vectors according to the inquiry time interval of the host computer to the non-existence domain;

clustering the plurality of feature vectors to form at least one cluster;

performing outlier analysis on the cluster to determine a malicious cluster, and determining a host corresponding to a feature vector in the malicious cluster as an infected host;

performing outlier analysis on the cluster to determine a malicious cluster, including:

determining the statistical test value of each cluster according to the size and standard deviation of each cluster;

carrying out significance test on the statistical test value, and judging whether the assumption that each cluster is not a biggest cluster is refused or not;

if the assumption that the cluster is not the very large cluster is refused, determining the cluster as a malicious cluster;

the formula for determining the statistical test value of each cluster according to the size and standard deviation of each cluster is as follows:

wherein, |c _i I is cluster c _i Is the cluster c _i S is the average size of cluster c _i Standard deviation of (2);

the process of carrying out significance test on the statistical test value comprises the following steps:

when checking the valueWhen the assumption that the cluster is not a very large cluster is rejected, the cluster is determined to be a malicious cluster, wherein +.>Representing a system with a degree of freedom q-2 and +.>The upper threshold of the t-distribution of significance levels, α, represents the significance level.

2. The method of claim 1, wherein extracting feature vectors based on a host's query time interval for non-existing domains comprises:

acquiring a domain name of an absent domain queried by the host;

screening non-malicious domains in the non-existing domain according to the domain name;

and extracting feature vectors according to the query time intervals of the host to other non-existence domains except the non-malicious domain.

3. The method of claim 1, wherein clustering the plurality of feature vectors to form at least one cluster comprises:

respectively taking each characteristic vector as a candidate cluster;

extracting two candidate clusters closest to the current distance, comparing the distance between the two candidate clusters closest to the current distance with the respective internal distances of the two candidate clusters, and judging whether the two candidate clusters closest to the current distance meet the merging condition;

if the two candidate clusters closest to the current distance meet the merging condition, merging the two candidate clusters closest to the current distance into a new candidate cluster;

4. A method according to claim 3, further comprising:

5. The method according to claim 3 or 4, wherein,

forming a first threshold according to the sizes of the feature vectors of the two candidate clusters closest to the current distance;

forming a second threshold according to the average difference and standard deviation of the internal distances of the two candidate clusters closest to the current distance;

and when the distance between the two candidate clusters closest to the current distance is smaller than the maximum value between the first threshold value and the second threshold value, judging that the two candidate clusters closest to the current distance meet a merging condition, and merging the two candidate clusters closest to the current distance into a new cluster.

6. A host recognition device infected with domain generation algorithm malware, comprising:

the feature extraction module is used for extracting a plurality of feature vectors according to the inquiry time interval of the host computer to the non-existence domain;

the clustering module is used for clustering the plurality of feature vectors to form at least one cluster;

the infected host identifying module is used for carrying out outlier analysis on the cluster to determine a malicious cluster, and determining a host corresponding to the feature vector in the malicious cluster as an infected host;

7. A computer device, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to perform the host recognition method of malware infection by a domain generation algorithm as claimed in any one of claims 1 to 5.

8. A computer readable storage medium storing computer instructions for causing the computer to perform the host recognition method of malware infection by a domain generation algorithm according to any one of claims 1-5.