CN114257565B

CN114257565B - Method, system and server for mining potential threat domain names

Info

Publication number: CN114257565B
Application number: CN202010945102.6A
Authority: CN
Inventors: 何振财; 李彬; 全俊斌; 乔雅莉; 邓太良; 郝建忠; 钟雪慧; 刘峥; 余筱蕙; 孙际勇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guangdong Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guangdong Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2023-09-05
Anticipated expiration: 2040-09-10
Also published as: CN114257565A

Abstract

The application discloses a method, a system and a server for mining potential threat domain names, relates to the field of communication, and aims to solve the problem that analysis efficiency is low for mass data by adopting an existing threat domain name identification method. The method comprises the following steps: acquiring a second feature set based on a first feature set acquired in advance, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set and comprises dynamic scene features and static scene features; and obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set. The present application threatens the mining of domain names.

Description

Method, system and server for mining potential threat domain names

Technical Field

The present application relates to the field of communications, and in particular, to a method, a system, and a server for mining a potentially threatening domain name.

Background

With the progress of science and the development of communication technology, mobile phones have become an indispensable component of people's daily life. In recent years, more and more users browse web pages and watch video information through mobile internet, and use various mobile phone Applications (APP) to conduct social, entertainment, learning, life and the like, so as to generate huge mobile internet access data. Many lawbreakers attack the mobile phones of users through the network server, and the security of the network is greatly threatened. Thus, efficient mining of potentially threatening domain names has been an impending task.

In the related art, a domain name content recognition engine is often built for mining potentially threatening domain names. The method is based on the similarity recognition technology of the webpage content, and the elements of the webpage are subjected to feature mining and reclassification prediction recognition. However, since the content elements can be acquired only by successfully accessing the web pages in this way, the analysis efficiency is low in the face of massive data, and it is difficult to realize effective analysis of the whole web log.

It can be seen that there is a need for a method for mining potentially threatening domain names that improves the efficiency of mining threatening domain names.

Disclosure of Invention

The embodiment of the application provides a method for mining a potential threat domain name, which is used for solving the problems of low analysis efficiency for mass data by adopting the existing threat domain name identification method.

The embodiment of the application also provides a system for mining the potential threat domain names, which is used for solving the problems that the existing threat domain name recognition method is adopted, the analysis efficiency is low and the effective analysis of the whole network log is difficult to realize due to the fact that the mass data are faced.

The embodiment of the application adopts the following technical scheme:

in a first aspect, a method for mining a potentially threatening domain name is provided, comprising:

acquiring a second feature set based on a first feature set acquired in advance, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set and comprises dynamic scene features and static scene features;

And obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set.

In a second aspect, a mining system for potentially threatening domain names is provided, comprising:

an acquisition module, configured to acquire a second feature set based on a first feature set acquired in advance, where the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set includes dynamic scene features and static scene features;

and the processing module is used for obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set.

In a third aspect, a server is provided, comprising:

In a fourth aspect, there is provided a computer-readable storage medium, in which a program is stored, which when executed, performs the following process:

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:

in the embodiment of the application, a second feature set is acquired based on a first feature set acquired in advance, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set comprises dynamic scene features and static scene features; and obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set. Therefore, the first feature set is acquired in advance, the second feature set is acquired, the feature set related to the potential threat domain name can be rapidly defined in big data, the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce the data processing amount, and therefore the efficiency of front-end threat domain name identification can be effectively improved, and the effective analysis of the whole network log is realized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for mining a potentially threatening domain name provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an exemplary feature association link in an embodiment of the present application;

FIG. 3 is a flow chart of determining a set of suspected threat domain names based on association analysis in an embodiment of the application;

FIG. 4 is a flow chart of a method of mining potentially threatening domain names provided by embodiments of the present application;

FIG. 5 is a block diagram of a system provided by an embodiment of the present application;

fig. 6 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a method and a system for mining a potential threat domain name.

In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, shall fall within the scope of the application.

Fig. 1 is a flow chart of a method for mining a potentially threatening domain name provided by an embodiment of the present application. As shown in fig. 1, a method for mining a domain name with potential threat provided by an embodiment of the application may include the following steps:

step 110, acquiring a second feature set based on a first feature set acquired in advance, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set comprises dynamic scene features and static scene features.

In an embodiment of the present application, the first feature set may specifically include, but is not limited to: at least two of domain name IP number, number of access users, number of access times, number of uniform resource locators (Uniform Resource Locator, URL), number of maximum access users, number of single user average URL, number of URL average users, number of URL average access times, number of user standard deviation, number of access standard deviation, URL access number discrete coefficient, number of access user discrete degree, number of times of frequent access of single user access domain name on the same day, number of maximum URL times of domain name access, number of single user access URL, domain name survival time, terminal operating system attribute, access carrier attribute, domain name return code attribute, IP region attribute, URL return code attribute.

In an embodiment of the present application, the second feature set may specifically include a dynamic scene feature and a static scene feature.

The dynamic scene features include: at least one of a communication feature and an association feature, wherein the communication feature comprises a domain name generation algorithm (Domain Generation Algorithm, DGA) attribute, the association feature comprising: at least one of a domain name return code attribute, a domain name IP region attribute, and a URL return code attribute; the static scene features include: at least one of behavioral characteristics and fingerprint characteristics, wherein the behavioral characteristics include: at least one of the number of access users, the dispersion of the number of access users, the maximum URL number of domain name access, the number of single user access URLs and the survival time of domain names, wherein the fingerprint feature comprises: at least one of a terminal operating system attribute and an access carrier attribute.

The DGA attribute specifically refers to the calculation of the random discrete degree of the domain name character, namely the DGA attribute is the ratio of the entropy value of the domain name to the ratio of the vowel of the domain name, wherein the entropy value refers to the character randomness degree value, and the vowel value of the domain name refers to the ratio of the number of the vowels of the domain name to the total length of the domain name.

The attribute of the domain name return code specifically refers to the jump status code attribute of the website when accessing the domain name, such as: the domain name returns a 4xx error, but the URL returns 200 that the request was successful or returns 204 no content. The domain names where these return codes occur are highly suspicious.

The IP region attribute specifically refers to the attribution of the IP address of domain name resolution, and the risk degree of the IP outside the general environment is higher than that of the IP inside the environment.

The URL return code attribute specifically refers to the state code attribute of the URL when the URL under the domain name is accessed. For example: the URL returns 302 a status code indicating that the URL was redirected and that there is a suspicion.

The number of access users specifically means that the more the number of access users of the domain name is in the same day, the influence range of the domain name is wide, and the risk degree is high except for the white list domain name.

The access user number dispersion specifically means that the access of the white list domain name is balanced due to the subjective randomness of the user; if the number of access users of a domain name and the URL thereof is concentrated, the domain name may be in burst access stage, and the risk degree is high.

The maximum URL number of times of domain name access is more than one, namely the URL under the domain name, and if the URL with centralized access appears, the greater the access number of times is, the higher the possibility that the URL is utilized is, and the higher the risk degree is.

The number of the single-user access URLs specifically means that the access behaviors of the single user cannot frequently interact with the domain name of the malicious server if the access behaviors of the single user are controlled, and the behaviors are greatly different from the behaviors of uncontrolled normal users, so that the risk degree is higher when the number of the single-user access URLs is smaller.

The domain name survival time length specifically refers to that a malicious domain name and a gray domain name are periodically replaced by a new domain name in order to avoid monitoring, and the shorter the survival time length is, the higher the danger degree is.

In an embodiment of the present application, the second feature set is a subset of the first feature set. That is, any one of the features in the second feature set is from the first feature set. For example, the first feature set is, for example, a user standard deviation, DGA, and the second feature set is, for example, a DGA.

In the embodiment of the present application, before step 110, the method for mining a domain name of a potential threat provided in the embodiment of the present application may further include a step of acquiring a first feature set. In particular, the process of obtaining the first feature set may include: acquiring data of an operator side; and obtaining a first feature set related to the potential threat domain name based on the obtained data of the operator side.

The operator side has a widely covered communication network, and mass data is generated at all times. The data on the operator side comprises basic information of the user and information of multiple dimensions such as communication data, social activity data, consumption behavior data, position information data and the like of the user. The data of the operator side has unique data integrity, continuity and richness, which is incomparable with any other industry data. Therefore, the embodiment of the application can obtain a large amount of user information, such as information which does not relate to user privacy, by utilizing the advantages of massive data at the operator side, further obtains the first feature set related to the potential threat domain name after summarizing the massive data, and can ensure the accuracy of the obtained first feature set.

In step 110, the acquiring the second feature set based on the first feature set acquired in advance includes: training the feature data in the first feature set acquired in advance by utilizing a random forest algorithm to acquire a second feature set related to the potential threat domain name.

The first feature set is trained through a random forest algorithm to obtain a second feature set, and the second feature set is a subset of the first feature set, so that the data volume of the obtained second feature set is obviously reduced compared with that of the original first feature set, and the data processing efficiency is improved.

Specifically, in the embodiment of the present application, the specific training process using the random forest algorithm may be performed according to the following steps:

the method comprises the following substeps: for the feature data in the first feature set acquired in advance, adopting a mode of sampling with put back to construct n new training sets;

wherein the feature data in different new training sets may be repeated, as may the feature data in the same new training set.

Wherein, the feature data may refer to a specific value of each feature in the first feature set when aiming at a domain name.

Sub-step two: constructing a sub-decision tree according to the new training set; wherein each sub-decision tree corresponds to a new training set;

the random forest is composed of n (n is a positive integer, and the specific value of n has randomness) sub-decision trees, and each node in the decision tree is a judging condition on specific characteristic data. For example, if the node is taken as a parent node, two child nodes exist in the node, and are located on the left side and the right side of the node, and are respectively assumed to be a left child node and a right child node. If the father node is DGA value, the left child node can be generated by adopting a mode of generating the left child node when the father node is larger than a certain value and generating the right child node when the father node is smaller than a certain value; if the parent node is a terminal operating system attribute, a yes/no judgment logic can be adopted to generate left and right child nodes.

And a sub-step three: and voting the end nodes of the n sub-decision trees to obtain the second feature set related to the potential threat domain names.

The voting is to perform statistical analysis on the end nodes of n decision trees, and select nodes with more occurrence times as voting results.

In order to intuitively display the second feature set related to the potential threat domain name, exemplary data of the second feature set related to the potential threat domain name obtained through the above sub-steps is shown in table 1 below.

TABLE 1

And 120, obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set.

According to the method for mining the potential threat domain name, the second feature set is acquired based on the first feature set acquired in advance, wherein the first feature set is the feature set related to the potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set comprises dynamic scene features and static scene features; and obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set. Therefore, the first feature set is acquired in advance, the second feature set is acquired, the feature set related to the potential threat domain name can be rapidly defined in big data, the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce the data processing amount, and therefore the efficiency of front-end threat domain name identification can be effectively improved, and the effective analysis of the whole network log is realized.

In the embodiment of the present application, the association calculation may be performed on the features in the second feature set in various manners, for example, using Apriori algorithm, FP-Tree algorithm, etc. The following description will be given by taking the Apriori algorithm as an example.

The specific calculation mode of the association calculation can execute specific calculation through an Apriori algorithm. The Apriori algorithm is an algorithm for mining data association rules, and the association degree between the second feature sets can be mined through the Apriori algorithm, so that a suspected threat domain name set is formed.

Specifically, referring to fig. 3, in an embodiment of the present application, step 120 may include the following processes:

step 1201: determining an evaluation standard of a potential threat domain name feature set, wherein the evaluation standard comprises at least one of a support degree and a confidence degree;

evaluation criteria for the set of potential threat domain features include support, confidence, etc. In general, to select frequent K sets of items in a data set, custom evaluation criteria are typically required. In the Apriori algorithm, at least one of the support degree and the confidence degree is mainly used as an evaluation index.

The frequent K item set consists of K item data with tight association degree. Wherein K is a positive integer, and K is more than or equal to 2.

The support degree is the probability that a plurality of feature specific values in the second feature set appear in one domain name at the same time.

If there are two feature specific values X and Y in the second feature set to be analyzed for relevance in one domain name, the corresponding support value is

Support(XY)＝P(XY)＝P(X)P(Y|X)

Wherein P (XY) represents the probability of X and Y occurring simultaneously; p (X) represents the probability of X occurrence; p (Y|X) represents the probability of Y occurrence under the condition that X occurs; support (XY) represents the degree of Support corresponding to XY.

Similarly, if there are three feature specific values X, Y and Z in the second feature set to be analyzed for relevance in one domain name, the corresponding support values are

Support(XYZ)＝P(XYZ)＝P(X)P(Y|X)P(Z|XY)

Wherein P (XYZ) represents the probability of X, Y and Z occurring simultaneously; p (X) represents the probability of X occurrence; p (Y|X) represents the probability of Y occurrence under the condition that X occurs; p (Z|XY) represents the probability of Z occurrence under the condition that XY occurs; support (XYZ) represents the Support degree to which XYZ corresponds.

Wherein the Confidence is the meaning of a conditional probability in statistics, and in combination with the above examples, a specific Confidence (X < =y) =p (x|y)

Confidence(X<＝YZ)＝P(X|YZ)

Where Confidence (X < =y) represents the Confidence of X to Y, confidence (X < =yz) represents the Confidence of X to YZ, P (x|y) represents the probability of X occurring under the condition that Y occurs, and P (x|yz) represents the probability of X occurring under the condition that YZ occurs.

Step 1202: and carrying out association analysis on the features in the second feature set based on the determined evaluation standard and the Apriori algorithm so as to determine a suspected threat domain name set.

Specifically, step 1202 may include the following sub-step one through sub-step three.

The method comprises the following substeps: performing data connection on the data in the second feature set; pruning the candidate 1 item set obtained after connection according to a preset evaluation standard to obtain a frequent 1 item set;

the candidate 1 item set is obtained by connecting initial effective characteristic data in pairs and eliminating the same items. For example, for data 0 and 1, the item sets 01 and 10 are obtained after the data are connected in pairs, and the item sets 01 and 10 are substantially identical items, but the order is different, so that after the identical items are removed, one item set 01 or 10 in the candidate 1 item sets is obtained. And connecting other data in the data according to the two-to-two connection modes.

The pruning processing refers to deleting the candidate 1 item set lower than a preset evaluation standard.

Sub-step two: connecting elements in the frequent 1 item set to obtain a candidate 2 item set, and pruning elements in the candidate 2 item set lower than a preset evaluation standard to obtain the frequent 2 item set;

wherein, the data connection is performed in a manner of referring to the first substep for the elements in the frequent 1 item set, thereby obtaining a candidate 2 item set. Pruning is carried out on the candidate 2 item sets in the mode of the first substep, so that frequent 2 item sets are obtained. It should be noted here that, for the candidate 1 item set that has been subjected to pruning, then the candidate 2 item set related to the candidate 1 item set need not be judged according to a preset evaluation criterion before the pruning is performed, but the pruning process may be directly performed.

And a sub-step three: for the frequent K item set, iterating according to the above process until the frequent K+1 item set cannot be found, and determining the obtained frequent K item set as a suspected threat domain name set; wherein K is a positive integer, and K is more than or equal to 2.

The candidate K item set in the application can be a set consisting of K-1 initial valid features; the frequent K term set refers to a candidate K term set satisfying a preset evaluation criterion.

During the operation of the Apriori algorithm, if a certain set of K terms is frequent, then all subsets of the set of K terms are frequent. It will be appreciated that if a set of candidates is below a preset evaluation criterion during pruning, then it will be deleted and all subsets thereof will be pruned. By utilizing the process, the algorithm traversal time can be greatly shortened, and the algorithm processing efficiency is further improved.

Network operation behaviors in all domain name scenes can be acquired through an Apriori algorithm, and are illustrated by a simple single-chain model, and in an actual model, the network operation behaviors are a mixed model formed by a plurality of single-chain strips.

An example analysis of a relationship link in a simple dimension is performed below, and fig. 3 is a schematic diagram of the characteristic association relationship link analysis of the present application.

The following is a specific illustration of the various sub-steps in step 1202, as shown in fig. 3, where there is initial valid feature data DGA for the features in the second feature set in domain name a, IP attributes, number of users accessed on the same day, and domain name survival duration. DGA is numbered 0, ip attribute is numbered 1, the number of users visited on the same day is numbered 2, and the domain name survival duration is numbered 3.

And executing the substep 1 on the marked features, and performing data connection on the initial effective feature data of the features in the second feature set to obtain a connected data set.

Specifically, after DGA (0) and IP attribute (1) are connected through data, a data set 01 is obtained; the DGA (0) is connected with the user number (2) accessed on the same day to obtain a data set 02; the IP attribute (1) and the number of users (2) accessed on the same day are connected to obtain a data set 12; the IP attribute (1) and the domain name survival duration (3) are connected to obtain a data set 13; the number of users (2) accessed on the same day and the domain name survival duration (3) are connected to obtain a data set 23.

Then pruning is carried out on the data set obtained after connection according to a preset evaluation standard, and if the data set 23 is lower than the preset evaluation standard, pruning is carried out on the data set, so that a frequent 1-item set is obtained: 01. 02, 03, 12, 13.

And executing the substep 2 on the frequent 1 item set, and connecting elements in the frequent 1 item set to obtain a candidate 2 item set.

Specifically, a frequent 1 item set 01, a frequent 1 item set 02 and a frequent 1 item set 12 are connected to obtain a candidate 2 item set 012; the frequent 1 item set 01, the frequent 1 item set 03 and the frequent 1 item set 13 are connected to obtain a candidate 2 item set 013; the frequent 1 item set 02, the frequent 1 item set 03 and the data set 23 are connected to obtain a candidate 2 item set 023; the frequent 1 item set 12, the frequent 1 item set 13 and the candidate 1 item set 23 are connected to obtain a candidate 2 item set 123.

Pruning is carried out on elements in the candidate 2 sets, which are lower than a preset evaluation standard, so as to obtain frequent 2 sets. Here, in the above step 1, the pruning process has been performed on the candidate 1 item set 23, and then the pruning process is performed synchronously on the candidate 2 item sets 023 and 123 having the candidate 1 item set 23 as the parent node. Then, the frequent 2 item sets after pruning are 012, 013.

The sub-step 3 is performed on the frequent 2 item set, and for the frequent K item set (k=2 in this example analysis), the above process is iterated until a frequent k+1 (k=2 in this example analysis) item set cannot be found, and the obtained frequent K item set is determined to be a suspected threat domain name set.

Specifically, the frequent 2 item set 012, the frequent 2 item set 013, the candidate 2 item set 023 and the candidate 2 item set 123 are connected to obtain a candidate 3 item set 0123. It should be noted that, in the above step 2, the pruning has been performed on the candidate 2 item sets 023 and 123, and then the pruning is performed on the candidate 3 item set 0123 with the candidate 2 item sets 023 and 123 as parent nodes synchronously, that is, the frequent 3 item sets cannot be found at this time, and the obtained frequent 2 item set is determined as the suspected threat domain name set.

Optionally, in an embodiment, the method for mining a domain name of a potential threat provided in the embodiment of the application may further include the following steps:

acquiring dynamic and static characteristics of domain name data in a monitoring data source;

wherein, the domain name data in the monitoring data source refers to specific domain name data under the domain name in the monitoring real-time network.

And comparing the obtained dynamic and static characteristics of the domain name data with the obtained suspected threat domain name set to obtain a domain name detection result.

Optionally, in an embodiment, the method for mining a domain name of a potential threat provided in the embodiment of the application may further include the following steps: and feeding back the domain name detection result to a malicious link library under the condition that the domain name detection result is that the threat domain name is detected.

Fig. 4 is a flow chart of a method for mining a potentially threatening domain name provided by an embodiment of the present application. As shown in fig. 4, a method for mining a domain name with potential threat provided by an embodiment of the application may include the following steps:

step 410: acquiring data of an operator side; and obtaining a first feature set related to the potential threat domain name based on the obtained data of the operator side.

Step 420: training the feature data in the first feature set acquired in advance by utilizing a random forest algorithm to acquire a second feature set related to the potential threat domain name.

The specific process of processing the data by the random forest algorithm is referred to above in step 110, and will not be described herein.

Step 430: an evaluation criterion for the set of potential threat domain features is determined, the evaluation criterion comprising at least one of a degree of support, a degree of confidence.

Step 440: performing data connection on initial valid feature data of the features in the second feature set; pruning is carried out on the candidate 1 item set obtained after connection according to a preset evaluation standard, so that a frequent 1 item set is obtained.

Step 450: and connecting elements in the frequent 1 item set to obtain a candidate 2 item set, and pruning elements in the candidate 2 item set lower than a preset evaluation standard to obtain the frequent 2 item set.

Step 460: and iterating according to the process for the frequent K item set until the frequent K+1 item set cannot be found, and determining the obtained frequent K item set as a suspected threat domain name set.

Step 470: and acquiring dynamic and static characteristics of domain name data in the monitoring data source.

Step 480: and comparing the obtained dynamic and static characteristics of the domain name data with the obtained suspected threat domain name set to obtain a domain name detection result.

Fig. 5 is a block diagram of a system for mining a potentially threatening domain name according to an embodiment of the present application. Referring to fig. 5, a mining system for potentially threatening domain names provided by an embodiment of the present application may include:

an obtaining module 502, configured to obtain a second feature set based on a first feature set obtained in advance, where the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set includes dynamic scene features and static scene features;

a processing module 504, configured to obtain a set of suspected threat domain names by performing association computation on the features in the second feature set.

The system for mining the potential threat domain name provided by the embodiment of the application is characterized in that a second feature set is acquired based on a first feature set acquired in advance, wherein the first feature set is a feature set related to the potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set comprises dynamic scene features and static scene features; and obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set. Therefore, the first feature set is acquired in advance, the second feature set is acquired, the feature set related to the potential threat domain name can be rapidly defined in big data, the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce the data processing amount, and therefore the efficiency of front-end threat domain name identification can be effectively improved, and the effective analysis of the whole network log is realized.

Optionally, in an embodiment of the present application, in the process of acquiring the second feature set, the acquiring module 502 may be configured to train feature data in the first feature set acquired in advance by using a random forest algorithm to acquire the second feature set related to the domain name of the potential threat.

Optionally, in one embodiment of the present application, before the second feature set is acquired based on the first feature set acquired in advance, the acquiring module 502 is further configured to:

acquiring data of an operator side; obtaining a first feature set related to a potential threat domain name based on the obtained data of the operator side;

in an embodiment of the present application, optionally, the first feature set includes: at least two of domain name IP number, access user number, access times, URL number, maximum access user number, single user average URL number, URL average user number, URL average access times, user number standard deviation, access number standard deviation, URL access number discrete coefficient, access user number discrete degree, single user access domain name frequency access user number on the same day, maximum URL number of domain name access, single user access URL number, domain name survival time length, terminal operating system attribute, access carrier attribute, domain name return code attribute, IP region attribute and URL return code attribute.

In an embodiment of the present application, optionally, the dynamic scene feature includes: at least one of a communication feature and an association feature, wherein the communication feature comprises a DGA attribute, the association feature comprising: at least one of a domain name return code attribute, a domain name IP region attribute, and a URL return code attribute; the static scene features include: at least one of behavioral characteristics and fingerprint characteristics, wherein the behavioral characteristics include: at least one of the number of access users, the dispersion of the number of access users, the maximum URL number of domain name access, the number of single user access URLs and the survival time of domain names, wherein the fingerprint feature comprises: at least one of a terminal operating system attribute and an access carrier attribute.

In an embodiment of the present application, optionally, the processing module 502 is configured to:

determining an evaluation standard of a potential threat domain name feature set, wherein the evaluation standard comprises at least one of a support degree and a confidence degree;

and carrying out association analysis on the features in the second feature set based on the determined evaluation standard and the Apriori algorithm so as to determine a suspected threat domain name set.

Optionally, in performing association analysis on the features in the second feature set based on the determined evaluation criteria and Apriori algorithm to determine a set of suspected threat domain names, the processing module 504 is configured to:

Performing data connection on initial valid feature data of the features in the second feature set; pruning the candidate 1 item set obtained after connection according to a preset evaluation standard to obtain a frequent 1 item set;

connecting elements in the frequent 1 item set to obtain a candidate 2 item set, and pruning elements in the candidate 2 item set lower than a preset evaluation standard to obtain the frequent 2 item set;

for the frequent K item set, iterating according to the above process until the frequent K+1 item set cannot be found, and determining the obtained frequent K item set as a suspected threat domain name set; wherein K is a positive integer, and K is more than or equal to 2.

Optionally, the obtaining module 502 is further configured to: and acquiring dynamic and static characteristics of domain name data in the monitoring data source. Accordingly, the processing module 504 is further configured to: and comparing the obtained dynamic and static characteristics of the domain name data with the obtained suspected threat domain name set to obtain a domain name detection result.

The specific process of the steps executed by each module in the system for mining a potentially threatening domain name provided in the embodiment of the present application may refer to the method embodiment, and will not be described herein.

Fig. 6 is a block diagram of a server according to an embodiment of the present application. Referring to fig. 6, a server for a potentially threatening domain name provided by an embodiment of the present application may include:

An obtaining module 602, configured to obtain a second feature set based on a first feature set obtained in advance, where the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set includes dynamic scene features and static scene features;

and the processing module 604 is configured to obtain a set of suspected threat domain names by performing association calculation on the features in the second feature set.

Optionally, in an embodiment of the present application, in the process of acquiring the second feature set, the acquiring module 602 may be configured to train feature data in the first feature set acquired in advance with a random forest algorithm to acquire the second feature set related to the domain name of the potential threat.

Optionally, in one embodiment of the present application, before the second feature set is acquired based on the first feature set acquired in advance, the acquiring module 602 is further configured to:

In an embodiment of the present application, optionally, the processing module 602 is configured to:

Optionally, in performing association analysis on the features in the second feature set based on the determined evaluation criteria and Apriori algorithm to determine a set of suspected threat domain names, the processing module 604 is configured to:

Optionally, the obtaining module 602 is further configured to: and acquiring dynamic and static characteristics of domain name data in the monitoring data source. Accordingly, the processing module 604 is further configured to: and comparing the obtained dynamic and static characteristics of the domain name data with the obtained suspected threat domain name set to obtain a domain name detection result.

The specific process of the steps executed by each module in the server provided in the embodiment of the present application may refer to the method embodiment, and will not be described herein.

Further, an embodiment of the present application provides a computer-readable storage medium storing a program therein, which when executed, performs the following process:

The specific implementation process of each step may be described above, and will not be described herein.

The storage medium provided by the embodiment of the application is used for acquiring a second feature set based on a first feature set acquired in advance, wherein the first feature set is a feature set related to a potential threat domain name, the second feature set is a subset of the first feature set, and the second feature set comprises dynamic scene features and static scene features; and obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set. Therefore, the first feature set is acquired in advance, the second feature set is acquired, the feature set related to the potential threat domain name can be rapidly defined in big data, the acquisition of the suspected threat domain name set based on the second feature set can greatly reduce the data processing amount, and therefore the efficiency of front-end threat domain name identification can be effectively improved, and the effective analysis of the whole network log is realized.

It should be noted that in the embodiment of the application, the effective features associated with the threat domain name are trained through a random forest algorithm, so that the interference caused by the features irrelevant to the threat domain name is effectively avoided. Due to the introduction of two randomness in the random forest algorithm (random selection of a new training set sample and random selection of frequent features), the random forest algorithm is not easy to fall into overfitting, and the effective features selected by voting have relative accuracy. The dynamic and static characteristics of the threat domain name are extracted, the potential threat domain name characteristic set is mined under a mixed scene through the association rule Apriori algorithm, the static and dynamic rules of the domain name can be comprehensively reflected, the characteristic set formed by a plurality of effective characteristics can be used for fitting an unknown domain name more accurately, and the identification capability of the unknown domain name is improved. The method and the device for mining the massive data in the mobile communication network can rapidly and accurately delineate the range of the potential threat domain name in the big data according to the processing flow of the method and the device for mining the massive data in the mobile communication network, and further improve the domain name recognition efficiency.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A method of mining a potentially threatening domain name, comprising:

obtaining a suspected threat domain name set by performing association calculation on the features in the second feature set;

wherein the dynamic scene features include: at least one of a communication feature and an association feature, wherein the communication feature comprises a domain name generation algorithm DGA attribute, the association feature comprising: at least one of a domain name return code attribute, a domain name IP region attribute, and a URL return code attribute; the static scene features include: at least one of behavioral characteristics and fingerprint characteristics, wherein the behavioral characteristics include: at least one of the number of access users, the dispersion of the number of access users, the maximum URL number of domain name access, the number of single user access URLs and the survival time of domain names, wherein the fingerprint feature comprises: at least one of a terminal operating system attribute and an access carrier attribute; the DGA attribute specifically refers to the calculation of the random discrete degree of the domain name character, namely the DGA attribute is the ratio of the entropy value of the domain name to the ratio of the vowel of the domain name, wherein the entropy value refers to the character randomness degree value, and the vowel value of the domain name refers to the ratio of the number of vowels of the domain name to the total length of the domain name.

2. The mining method of claim 1, wherein the acquiring a second feature set based on the pre-acquired first feature set comprises: training the feature data in the first feature set acquired in advance by utilizing a random forest algorithm to acquire a second feature set related to the potential threat domain name.

3. The mining method of claim 1, wherein prior to the acquiring the second feature set based on the pre-acquired first feature set, the method further comprises:

acquiring data of an operator side;

obtaining a first feature set related to a potential threat domain name based on the obtained data of the operator side;

wherein the first feature set comprises: at least two of domain name IP number, access user number, access times, URL number, maximum access user number, single user average URL number, URL average user number, URL average access times, user number standard deviation, access number standard deviation, URL access number discrete coefficient, access user number discrete degree, single user access domain name frequency access user number on the same day, maximum URL number of domain name access, single user access URL number, domain name survival time length, terminal operating system attribute, access carrier attribute, domain name return code attribute, IP region attribute, URL return code attribute and DGA attribute.

4. A mining method according to any one of claims 1 to 3, wherein said obtaining a set of suspected threat domain names by performing a correlation analysis on features in said second set of features comprises:

5. The mining method of claim 4, wherein the performing association analysis on the features in the second feature set based on the determined evaluation criteria and Apriori algorithm to determine a set of suspected threat domain names comprises:

6. The mining method of claim 1, further comprising:

7. A system for mining potentially threatening domain names, comprising:

the processing module is used for obtaining a suspected threat domain name set by carrying out association calculation on the features in the second feature set;

8. A server, comprising:

9. A computer-readable storage medium, wherein a program is stored in the computer-readable storage medium, which when executed, performs the following process: