CN108712403B

CN108712403B - Illegal Domain Name Mining Method Based on Domain Name Structure Similarity

Info

Publication number: CN108712403B
Application number: CN201810419153.8A
Authority: CN
Inventors: 张兆心; 程亚楠; 吴晓宝; 崔诗尧; 杜跃进; 陆柯羽
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Shandong Tianhe Cyberspace Security Technology Research Institute Co ltd
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2020-08-04
Anticipated expiration: 2038-05-04
Also published as: CN108712403A

Abstract

The invention provides an illegal domain name mining method based on the similarity of domain name structure, which solves the technical problem that the existing method cannot actively mine a large number of illegal domain names; the method comprises the following steps: Step 1, read illegal domain names from the domain name blacklist; step 2. Determine whether there is a class that has successfully aggregated, if not, go to step 10; otherwise, continue to the next step; step 3, determine whether the current domain name can be classified into the i-th aggregation class, if not, go to Step 10; otherwise, proceed to the next step; the judgment is based on whether the current domain name is similar to the central domain name, and the central domain name refers to a representative domain name in the aggregation class; Step 4, merge the current domain name into the i-th aggregation class, and extract The generation pattern generated during the matching process between the current domain name and the central domain name of this type is obtained, and the next step is continued; the generation pattern is the wildcard string extracted from each domain name in the aggregation class and the central domain name. The present invention is widely used in the field of information technology.

Description

Illegal Domain Name Mining Method Based on Domain Name Structure Similarity

技术领域technical field

本发明涉及一种非法域名挖掘方法，特别是涉及一种基于域名构造相似性的非法域名挖掘方法。The invention relates to an illegal domain name mining method, in particular to an illegal domain name mining method based on the similarity of domain name structures.

背景技术Background technique

随着互联网的发展迅速，伴随互联网出现的产物之一的域名，也逐渐被人们认识和普及，域名在给我们带来了记忆网站以及修改IP上的便利的同时，也隐藏着一些无法避免的安全隐患。With the rapid development of the Internet, the domain name, one of the products of the Internet, has gradually been recognized and popularized by people. While the domain name brings us the convenience of remembering websites and modifying IP, it also hides some unavoidable Security risks.

今年来，越来越多非法组织通过域名承载一些非法的行为，如僵尸网络、钓鱼网站、黄赌毒类网站等，广大网民在财产和精神上都带来了难以估量的损害，因此，迫切地要求高效快速地挖掘非法域名的方法被提出。This year, more and more illegal organizations have carried some illegal activities through domain names, such as botnets, phishing websites, pornography, gambling and drug websites, etc. The vast number of netizens have brought incalculable damage to their property and spirit. Therefore, it is urgent to A method to mine illegal domain names efficiently and quickly is proposed.

目前绝大多数浏览器采用事先准备好的黑名单，通过定期更新维护黑名单来遏制网民访问非法网站，但是因缺少主动挖掘大量非法域名的方法，而缺乏时效性。At present, most browsers use pre-prepared blacklists to prevent netizens from accessing illegal websites by regularly updating and maintaining the blacklists.

发明内容SUMMARY OF THE INVENTION

本发明针对现有方法不能主动挖掘大量非法域名的技术问题，提供一种能够主动挖掘大量非法域名的基于域名构造相似性的非法域名挖掘方法。Aiming at the technical problem that the existing method cannot actively mine a large number of illegal domain names, the invention provides an illegal domain name mining method based on the similarity of the domain name structure, which can actively mine a large number of illegal domain names.

为此，本发明的技术方案是，包括以下步骤：For this reason, the technical scheme of the present invention is, comprising the following steps:

步骤1，从域名黑名单中读取非法域名；Step 1, read illegal domain names from the domain name blacklist;

步骤2，判断是否存在聚集成功的类，若不存在，则转到步骤10；否则，继续下一步；Step 2, judge whether there is a successfully aggregated class, if not, go to step 10; otherwise, continue to the next step;

步骤3，判断当前域名是否可以归到第i个聚集类中，若不可以，则转到步骤10；否则，继续下一步；第i聚集类是按照相似规则将相似的域名聚成的第i类；Step 3, determine whether the current domain name can be classified into the ith clustering class, if not, go to step 10; otherwise, continue to the next step; the ith clustering class is the ith clustering class that aggregates similar domain names according to similar rules. kind;

判断的依据是当前域名是否与中心域名相似，中心域名是指聚集类中有代表性的域名；The judgment is based on whether the current domain name is similar to the central domain name, and the central domain name refers to the representative domain name in the aggregated category;

判断当前域名是否与中心域名相似的具体方法为：The specific method to determine whether the current domain name is similar to the central domain name is as follows:

(1)若两个域名只有顶级域不同，其他部分相同，则两个域名相似；(1) If the two domain names are different only in the top-level domain and other parts are the same, the two domain names are similar;

(2)若两个域名顶级域相同，当二级域长度相同时，二级域的相同位置不超过2个字符不同；或相同位置连续多个相同字符不同，则两个域名相似；当两个域名二级域长度相差1且长域名去掉一个字符可以变成短域名时，则两个域名相似；(2) If the top-level domains of two domain names are the same, when the length of the second-level domain is the same, the same position of the second-level domain does not differ by more than 2 characters; or the same position in the same position is different, the two domain names are similar; When the lengths of the second-level domains of the two domain names differ by 1 and the long domain name can be changed to a short domain name by removing one character, the two domain names are similar;

(3)若(1)和(2)中均未判定为相似，则两个域名不相似；(3) If neither of (1) and (2) is determined to be similar, the two domain names are not similar;

步骤4，将当前域名并入第i个聚集类，并提取出当前域名与该类中心域名的匹配过程中所产生的生成模式，继续下一步；Step 4, merge the current domain name into the i-th aggregation class, and extract the generation pattern generated during the matching process between the current domain name and the central domain name of this class, and continue to the next step;

生成模式是聚集类中各域名与中心域名所提取出的通配字符串；The generation mode is a wildcard string extracted from each domain name and the central domain name in the aggregation class;

步骤5，通过生成模式中进行枚举产生可能存在的与中心域名相似的相似域名，并筛选掉相似域名中已入库的非法域名，继续下一步；Step 5, generate possible similar domain names similar to the central domain name by enumerating in the generation mode, and filter out the illegal domain names that have been stored in the similar domain names, and continue to the next step;

步骤6，通过获取域名WHOIS信息来逐个判断步骤5中筛选后的相似域名是否存在，若不存在，丢弃；否则，保留，继续下一步；Step 6: Determine whether the similar domain names screened in Step 5 exist one by one by obtaining the domain name WHOIS information, if not, discard; otherwise, keep it and continue to the next step;

步骤7，检测保留的域名是否非法，若检测出非法，添加到非法域名集中；否则，添加到未知域名集中；继续下一步；Step 7: Detect whether the reserved domain name is illegal, if it is detected illegal, add it to the illegal domain name set; otherwise, add it to the unknown domain name set; continue to the next step;

步骤8，判断步骤5中筛选后的相似域名是否检测完毕，若检测完毕，继续下一步；否则，转到步骤6；Step 8, determine whether the similar domain names screened in Step 5 have been detected, if the detection is completed, continue to the next step; otherwise, go to Step 6;

步骤9，判断步骤1的非法域名是否已经聚类完毕，若已聚类完毕，则算法结束；否则，转到步骤1；Step 9, determine whether the illegal domain names of step 1 have been clustered, if the clustering has been completed, the algorithm ends; otherwise, go to step 1;

步骤10，创建新类，将当前域名设置为该类的中心域名，转到步骤9。Step 10, create a new class, set the current domain name as the central domain name of the class, and go to step 9.

优选地，步骤4中，生成模式使用通配符来代替两个非法域名之间的差异部分，使用指示符来表示指定通配符的枚举操作。Preferably, in step 4, the generation pattern uses wildcards to replace the difference between two illegal domain names, and uses indicators to indicate the enumeration operation specifying wildcards.

优选地，步骤7中，检测是通过通过权威第三方检测接口进行的。Preferably, in step 7, the detection is performed through an authoritative third-party detection interface.

本发明的有益效果：本方法是建立在已有大批非法域名的基础上作分析，从而挖掘出大量未包含的非法域名。首先，对已准备的黑名单中的非法域名集进行聚类，将构造上相似的非法域名聚成一类，从而形成多个聚集类；然后，从每个类中提取出一个或多个生成模式，得到生成模式的集合；再者，通过生成模式进行枚举，生成疑似非法的相似域名；最后，使用第三方权威检测接口对疑似非法的生成域名集进行检测，筛选出非法的相似域名。该方法从非法域名构造相似性角度出发，主动挖掘出大量的库中不存在的非法域名，并且基于域名构造相似性挖掘出来的非法域名之间具有很强的关联性，有利于非法域名的关联分析、团伙分析。Beneficial effects of the invention: the method is based on the analysis of a large number of existing illegal domain names, so as to excavate a large number of illegal domain names that are not included. First, cluster the set of illegal domain names in the prepared blacklist, and group the illegal domain names that are similar in structure into one class, thereby forming multiple clusters; then, extract one or more generation patterns from each class , to obtain the set of generation patterns; thirdly, enumerate through the generation patterns to generate suspected illegal similar domain names; finally, use the third-party authoritative detection interface to detect the suspected illegal generated domain name set, and filter out the illegal similar domain names. This method starts from the similarity of illegal domain name structure, actively mines a large number of illegal domain names that do not exist in the database, and the illegal domain names mined based on the similarity of domain name structure have a strong correlation, which is conducive to the association of illegal domain names Analysis, group analysis.

附图说明Description of drawings

图1是本发明实施例的整体功能流程图；Fig. 1 is the overall function flow chart of the embodiment of the present invention;

图2是本发明实施例的方法流程图。FIG. 2 is a flowchart of a method according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合实施例对本发明做进一步描述。The present invention will be further described below in conjunction with the embodiments.

非法域名之间存在着构造的相似性，往往通过对单个非法域名构造上作略微修改就可以产生批量的非法域名，且由此得到的这批非法域名极有可能是出自同一个注册者或者是同一个非法组织注册。如通过非法域名00080d.com就可以挖掘出更多相似的非法域名00080e.com、00080f.com、00080w.com等。There are structural similarities between illegal domain names. Often, a batch of illegal domain names can be generated by slightly modifying the structure of a single illegal domain name. registered with the same illegal organization. For example, through the illegal domain name 00080d.com, more similar illegal domain names 00080e.com, 00080f.com, 00080w.com, etc. can be mined.

如图1、2所示，本实施例提供一种基于域名构造相似性的非法域名挖掘方法，主要步骤包括相似聚类、生成模式提取、生成相似域名、检测相似域名的存在性和非法性四大块步骤。本实施例是以赌博、色情、诈骗类非法域名集作为黑名单进行聚类，采取的是自定义的相似规则，将构造相似的域名聚成一类，然后提取每一类的生成模式，生成相似域名，最终检测出非法且实际存在的相似域名。具体步骤如下：As shown in Figures 1 and 2, this embodiment provides a method for mining illegal domain names based on the similarity of domain name construction. The main steps include similarity clustering, generating pattern extraction, generating similar domain names, and detecting the existence and illegality of similar domain names. Chunk steps. In this embodiment, illegal domain names such as gambling, pornography, and fraud are used as a blacklist for clustering, and a self-defined similarity rule is adopted to group similar domain names into one category, and then extract the generation pattern of each category to generate similar domain name, and finally detected illegal and actual similar domain names. Specific steps are as follows:

步骤3，判断当前域名是否可以归到第i个聚集类中，若不可以，则转到步骤10；否则，继续下一步；Step 3, determine whether the current domain name can be classified into the i-th aggregation class, if not, go to step 10; otherwise, continue to the next step;

其中，判断的方法是判断域名是否与中心域名相似，中心域名是指聚集类中有代表性的域名；Among them, the judgment method is to judge whether the domain name is similar to the central domain name, and the central domain name refers to the representative domain name in the aggregation class;

第i聚集类是按照自定义的相似规则将相似的域名聚成的第i类，相似规则如下：The i-th aggregation class is the i-th class that aggregates similar domain names according to the self-defined similarity rules. The similarity rules are as follows:

(1)若两个域名只有顶级域不同，其他部分相同，如08vip.vip和08vip.tv，则两个域名相似；(1) If the two domain names are different only in the top-level domain, and other parts are the same, such as 08vip.vip and 08vip.tv, then the two domain names are similar;

(2)若两个域名顶级域相同，当二级域长度相同时，二级域的相同位置不超过2个字符不同，如00037b.com和00037c.com、099sun.com和099sky.com、1188030.com和1388033.com；或相同位置连续多个相同字符不同，如4148ww.com和4148nn.com、4040uuu.com和4040jjj.com，则两个域名相似；当两个域名二级域长度相差1且长域名去掉一个字符可以变成短域名时，如0000524.com和00001524.com，则两个域名相似；(2) If the top-level domains of two domain names are the same, when the length of the second-level domain is the same, the same position of the second-level domain should not differ by more than 2 characters, such as 00037b.com and 00037c.com, 099sun.com and 099sky.com, 1188030 .com and 1388033.com; or multiple consecutive identical characters in the same position with different characters, such as 4148ww.com and 4148nn.com, 4040uuu.com and 4040jjj.com, then the two domain names are similar; And when a long domain name can be changed into a short domain name by removing one character, such as 0000524.com and 00001524.com, the two domain names are similar;

(3)若(1)和(2)中均未判定为相似，则两个域名不相似。(3) If neither of (1) and (2) are determined to be similar, the two domain names are not similar.

其中，生成模式是聚集类中各域名与中心域名所提取出的通配字符串；生成模式的提取方法是，使用通配符来代替两个非法域名之间的差异部分，使用指示符来表示指定通配符的枚举操作，具体说明如下：Among them, the generation mode is the wildcard string extracted from each domain name and the central domain name in the aggregation class; the extraction method of the generation mode is to use wildcards to replace the difference between two illegal domain names, and use indicators to indicate the specified wildcards The enumeration operation is as follows:

(1)若两个域名只有顶级域不同，如08vip.vip和08vip.tv，可以提取出生成模式08vip.％；(1) If only the top-level domain is different between the two domain names, such as 08vip.vip and 08vip.tv, the generation mode 08vip.% can be extracted;

(2)若两个域名相似且只相差一个字符，如0000524.com和00001524.com可以提取出模式00001524-.com或0000524+.com；(2) If two domain names are similar and only differ by one character, such as 0000524.com and 00001524.com, the pattern 00001524-.com or 0000524+.com can be extracted;

(3)若两个域名相似且相同位置不超过2个字符不同，当相同位置的不同字符均为数字时，如1188030.com和1388033.com，可以提取出模式1#8803#.com；当相同位置的不同字符均为字母时，如00037b.com和00037c.com、099sun.com和099sky.com，可以提取出模式00037#.com、099s**.com；当相同位置的不同字符一边为数字一边为字母时，如004zyz.com和0044y8.com，可以提取出模式004$y$.com；(3) If two domain names are similar and no more than 2 characters in the same position are different, when the different characters in the same position are all numbers, such as 1188030.com and 1388033.com, the pattern 1#8803#.com can be extracted; when When different characters in the same position are all letters, such as 00037b.com and 00037c.com, 099sun.com and 099sky.com, patterns 00037#.com and 099s**.com can be extracted; when different characters in the same position are When there are letters on one side of the numbers, such as 004zyz.com and 0044y8.com, the pattern 004$y$.com can be extracted;

(4)若两个域名相似且只有相同位置连续多个相同字符不同，如4148ww.com和4148nn.com、4040uuu.com和4040jjj.com、1186655.com和1186699.com，可以分别提取出匹配模式4148**&.com、4040***&.com、11866##&.com。(4) If two domain names are similar and only the same characters are different in the same position, such as 4148ww.com and 4148nn.com, 4040uuu.com and 4040jjj.com, 1186655.com and 1186699.com, the matching pattern can be extracted respectively 4148**&.com, 4040***&.com, 11866##&.com.

步骤5，通过生成模式中的通配符和指示符来指导性的枚举产生可能存在的与中心域名相似的相似域名，并筛选掉相似域名中已入库的非法域名，继续下一步；Step 5: Use the wildcards and indicators in the generation mode to enumerate to generate possible similar domain names that are similar to the central domain name, and filter out the illegal domain names that have been stored in the similar domain names, and continue to the next step;

通配符、指示符的具体说明如下：The specific descriptions of wildcards and indicators are as follows:

(1)％为顶级域通配符，枚举时将％替换成黑名单中提取出的顶级域；(1) % is a top-level domain wildcard, when enumerating, replace % with the top-level domain extracted from the blacklist;

(2)-、+均为指示符，指示枚举时需删减二级域某个字符或增加某个字符；(2) -, + are indicators, indicating that a character in the second-level domain needs to be deleted or a character added during enumeration;

(3)*为字母通配符、#为数字通配符、$为字母数字通配符，*枚举时换成字母a～z、#枚举时换成数字0～9、$枚举时换成0～9，a～z；(3) * is an alphabetic wildcard, # is a numeric wildcard, $ is an alphanumeric wildcard, * is replaced by letters a～z when enumerating, # is replaced by numbers 0～9 when enumerating, and 0～9 is replaced by $ when enumerating , a～z;

(4)&为连续指示符，指示枚举时所有通配符替换同一个字符；(4) & is a continuous indicator, indicating that all wildcards replace the same character during enumeration;

步骤6，通过获取域名WHOIS信息来逐个判断步骤5中筛选后的相似域名是否存在，若不存在，丢弃；否则，保留，继续下一步操作；Step 6: Determine whether the similar domain names screened in Step 5 exist one by one by obtaining the domain name WHOIS information, if not, discard; otherwise, keep it, and continue to the next step;

步骤7，通过权威第三方检测接口检测保留的域名是否非法，若检测出非法，添加到非法域名集中；否则，添加到未知域名集中，继续下一步；Step 7: Detect whether the reserved domain name is illegal through an authoritative third-party detection interface. If it is detected to be illegal, add it to the illegal domain name set; otherwise, add it to the unknown domain name set, and continue to the next step;

定期对未知域名集中的域名进行检测，判断其是否为非法域名，若检测出非法，添加到非法域名集中；否则，保留在未知域名集中；Regularly check the domain names in the unknown domain name set to determine whether it is an illegal domain name. If it is detected as illegal, add it to the illegal domain name set; otherwise, keep it in the unknown domain name set;

本方法是建立在已有大批非法域名的基础上作分析，从而挖掘出大量未包含的非法域名。首先，对已准备的黑名单中的非法域名集进行聚类，将构造上相似的非法域名聚成一类，从而形成多个聚集类；然后，从每个类中提取出一个或多个生成模式，得到生成模式的集合；再者，通过生成模式进行枚举，生成疑似非法的相似域名；最后，使用第三方权威检测接口对疑似非法的生成域名集进行检测，筛选出非法的相似域名。该方法从非法域名构造相似性角度出发，主动挖掘出大量的库中不存在的非法域名，并且基于域名构造相似性挖掘出来的非法域名之间具有很强的关联性，有利于非法域名的关联分析、团伙分析等。This method is based on the analysis of a large number of existing illegal domain names, so as to excavate a large number of illegal domain names that are not included. First, cluster the set of illegal domain names in the prepared blacklist, and group the illegal domain names that are similar in structure into one class, thereby forming multiple clusters; then, extract one or more generation patterns from each class , to obtain the set of generation patterns; thirdly, enumerate through the generation patterns to generate suspected illegal similar domain names; finally, use the third-party authoritative detection interface to detect the suspected illegal generated domain name set, and filter out the illegal similar domain names. This method starts from the similarity of illegal domain name structure, actively mines a large number of illegal domain names that do not exist in the database, and the illegal domain names mined based on the similarity of domain name structure have a strong correlation, which is conducive to the association of illegal domain names Analysis, group analysis, etc.

惟以上者，仅为本发明的具体实施例而已，当不能以此限定本发明实施的范围，故其等同组件的置换，或依本发明专利保护范围所作的等同变化与修改，皆应仍属本发明权利要求书涵盖之范畴。However, the above are only specific embodiments of the present invention, which should not limit the scope of the present invention. Therefore, the replacement of the equivalent components, or the equivalent changes and modifications made according to the scope of the patent protection of the present invention, should still belong to the scope of the present invention. The scope of the present invention is covered by the claims.

Claims

1. an illegal domain name mining method based on domain name structure similarity, is characterized in that, comprises the following steps:

Step 1, read illegal domain names from the domain name blacklist;

Step 2, judge whether there is a successfully aggregated class, if not, go to step 10; otherwise, continue to the next step;

Step 3, determine whether the current domain name can be classified into the i-th aggregation class, if not, go to step 10; otherwise, continue to the next step; the i-th aggregation class is to aggregate similar domain names according to similar rules class i;

The judgment is based on whether the current domain name is similar to the central domain name, and the central domain name refers to a representative domain name in the aggregation class;

The specific method for judging whether the current domain name is similar to the central domain name is:

(1) If the two domain names are different only in the top-level domain and other parts are the same, the two domain names are similar;

(2) If the top-level domains of two domain names are the same, when the length of the second-level domain is the same, the same position of the second-level domain does not differ by more than 2 characters; or the same position in the same position is different, the two domain names are similar; When the lengths of the second-level domains of the two domain names differ by 1 and the long domain name can be changed to a short domain name by removing one character, the two domain names are similar;

(3) If neither of (1) and (2) is determined to be similar, the two domain names are not similar;

Step 4, merge the current domain name into the i-th aggregation class, and extract the generation pattern generated during the matching process between the current domain name and the central domain name of this class, and continue to the next step;

The generation mode is a wildcard string extracted from each domain name and the central domain name in the aggregation class;

Step 5, generate possible similar domain names similar to the central domain name by enumerating in the generation mode, and filter out the illegal domain names that have been stored in the similar domain names, and continue to the next step;

Step 6: Determine whether the similar domain names screened in Step 5 exist one by one by obtaining the domain name WHOIS information, if not, discard; otherwise, keep it and continue to the next step;

Step 7: Detect whether the reserved domain name is illegal, if it is detected illegal, add it to the illegal domain name set; otherwise, add it to the unknown domain name set; continue to the next step;

Step 8, determine whether the similar domain names screened in Step 5 have been detected, if the detection is completed, continue to the next step; otherwise, go to Step 6;

Step 9, determine whether the illegal domain names of step 1 have been clustered, if the clustering has been completed, the algorithm ends; otherwise, go to step 1;

Step 10, create a new class, set the current domain name as the central domain name of the class, and go to step 9.

2. the illegal domain name mining method based on domain name structure similarity according to claim 1, is characterized in that, in described step 4, described generation pattern uses wildcard to replace the difference part between two illegal domain names, uses instruction character to indicate the enumeration operation of the specified wildcard.

3. The method for mining illegal domain names based on the similarity of domain name structures according to claim 1, wherein in the step 7, the detection is performed through an authoritative third-party detection interface.