CN108712403B - Illegal Domain Name Mining Method Based on Domain Name Structure Similarity - Google Patents

Illegal Domain Name Mining Method Based on Domain Name Structure Similarity Download PDF

Info

Publication number
CN108712403B
CN108712403B CN201810419153.8A CN201810419153A CN108712403B CN 108712403 B CN108712403 B CN 108712403B CN 201810419153 A CN201810419153 A CN 201810419153A CN 108712403 B CN108712403 B CN 108712403B
Authority
CN
China
Prior art keywords
domain name
domain
illegal
similar
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810419153.8A
Other languages
Chinese (zh)
Other versions
CN108712403A (en
Inventor
张兆心
程亚楠
吴晓宝
崔诗尧
杜跃进
陆柯羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Tianhe Cyberspace Security Technology Research Institute Co ltd
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN201810419153.8A priority Critical patent/CN108712403B/en
Publication of CN108712403A publication Critical patent/CN108712403A/en
Application granted granted Critical
Publication of CN108712403B publication Critical patent/CN108712403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种基于域名构造相似性的非法域名挖掘方法,其解决了现有方法不能主动挖掘大量非法域名的技术问题;包括以下步骤:步骤1,从域名黑名单中读取非法域名;步骤2,判断是否存在聚集成功的类,若不存在,则转到步骤10;否则,继续下一步;步骤3,判断当前域名是否可以归到第i个聚集类中,若不可以,则转到步骤10;否则,继续下一步;判断的依据是当前域名是否与中心域名相似,中心域名是指聚集类中有代表性的域名;步骤4,将当前域名并入第i个聚集类,并提取出当前域名与该类中心域名的匹配过程中所产生的生成模式,继续下一步;生成模式是聚集类中各域名与中心域名所提取出的通配字符串。本发明广泛应用于信息技术领域。

Figure 201810419153

The invention provides an illegal domain name mining method based on the similarity of domain name structure, which solves the technical problem that the existing method cannot actively mine a large number of illegal domain names; the method comprises the following steps: Step 1, read illegal domain names from the domain name blacklist; step 2. Determine whether there is a class that has successfully aggregated, if not, go to step 10; otherwise, continue to the next step; step 3, determine whether the current domain name can be classified into the i-th aggregation class, if not, go to Step 10; otherwise, proceed to the next step; the judgment is based on whether the current domain name is similar to the central domain name, and the central domain name refers to a representative domain name in the aggregation class; Step 4, merge the current domain name into the i-th aggregation class, and extract The generation pattern generated during the matching process between the current domain name and the central domain name of this type is obtained, and the next step is continued; the generation pattern is the wildcard string extracted from each domain name in the aggregation class and the central domain name. The present invention is widely used in the field of information technology.

Figure 201810419153

Description

基于域名构造相似性的非法域名挖掘方法Illegal Domain Name Mining Method Based on Domain Name Structure Similarity

技术领域technical field

本发明涉及一种非法域名挖掘方法,特别是涉及一种基于域名构造相似性的非法域名挖掘方法。The invention relates to an illegal domain name mining method, in particular to an illegal domain name mining method based on the similarity of domain name structures.

背景技术Background technique

随着互联网的发展迅速,伴随互联网出现的产物之一的域名,也逐渐被人们认识和普及,域名在给我们带来了记忆网站以及修改IP上的便利的同时,也隐藏着一些无法避免的安全隐患。With the rapid development of the Internet, the domain name, one of the products of the Internet, has gradually been recognized and popularized by people. While the domain name brings us the convenience of remembering websites and modifying IP, it also hides some unavoidable Security risks.

今年来,越来越多非法组织通过域名承载一些非法的行为,如僵尸网络、钓鱼网站、黄赌毒类网站等,广大网民在财产和精神上都带来了难以估量的损害,因此,迫切地要求高效快速地挖掘非法域名的方法被提出。This year, more and more illegal organizations have carried some illegal activities through domain names, such as botnets, phishing websites, pornography, gambling and drug websites, etc. The vast number of netizens have brought incalculable damage to their property and spirit. Therefore, it is urgent to A method to mine illegal domain names efficiently and quickly is proposed.

目前绝大多数浏览器采用事先准备好的黑名单,通过定期更新维护黑名单来遏制网民访问非法网站,但是因缺少主动挖掘大量非法域名的方法,而缺乏时效性。At present, most browsers use pre-prepared blacklists to prevent netizens from accessing illegal websites by regularly updating and maintaining the blacklists.

发明内容SUMMARY OF THE INVENTION

本发明针对现有方法不能主动挖掘大量非法域名的技术问题,提供一种能够主动挖掘大量非法域名的基于域名构造相似性的非法域名挖掘方法。Aiming at the technical problem that the existing method cannot actively mine a large number of illegal domain names, the invention provides an illegal domain name mining method based on the similarity of the domain name structure, which can actively mine a large number of illegal domain names.

为此,本发明的技术方案是,包括以下步骤:For this reason, the technical scheme of the present invention is, comprising the following steps:

步骤1,从域名黑名单中读取非法域名;Step 1, read illegal domain names from the domain name blacklist;

步骤2,判断是否存在聚集成功的类,若不存在,则转到步骤10;否则,继续下一步;Step 2, judge whether there is a successfully aggregated class, if not, go to step 10; otherwise, continue to the next step;

步骤3,判断当前域名是否可以归到第i个聚集类中,若不可以,则转到步骤10;否则,继续下一步;第i聚集类是按照相似规则将相似的域名聚成的第i类;Step 3, determine whether the current domain name can be classified into the ith clustering class, if not, go to step 10; otherwise, continue to the next step; the ith clustering class is the ith clustering class that aggregates similar domain names according to similar rules. kind;

判断的依据是当前域名是否与中心域名相似,中心域名是指聚集类中有代表性的域名;The judgment is based on whether the current domain name is similar to the central domain name, and the central domain name refers to the representative domain name in the aggregated category;

判断当前域名是否与中心域名相似的具体方法为:The specific method to determine whether the current domain name is similar to the central domain name is as follows:

(1)若两个域名只有顶级域不同,其他部分相同,则两个域名相似;(1) If the two domain names are different only in the top-level domain and other parts are the same, the two domain names are similar;

(2)若两个域名顶级域相同,当二级域长度相同时,二级域的相同位置不超过2个字符不同;或相同位置连续多个相同字符不同,则两个域名相似;当两个域名二级域长度相差1且长域名去掉一个字符可以变成短域名时,则两个域名相似;(2) If the top-level domains of two domain names are the same, when the length of the second-level domain is the same, the same position of the second-level domain does not differ by more than 2 characters; or the same position in the same position is different, the two domain names are similar; When the lengths of the second-level domains of the two domain names differ by 1 and the long domain name can be changed to a short domain name by removing one character, the two domain names are similar;

(3)若(1)和(2)中均未判定为相似,则两个域名不相似;(3) If neither of (1) and (2) is determined to be similar, the two domain names are not similar;

步骤4,将当前域名并入第i个聚集类,并提取出当前域名与该类中心域名的匹配过程中所产生的生成模式,继续下一步;Step 4, merge the current domain name into the i-th aggregation class, and extract the generation pattern generated during the matching process between the current domain name and the central domain name of this class, and continue to the next step;

生成模式是聚集类中各域名与中心域名所提取出的通配字符串;The generation mode is a wildcard string extracted from each domain name and the central domain name in the aggregation class;

步骤5,通过生成模式中进行枚举产生可能存在的与中心域名相似的相似域名,并筛选掉相似域名中已入库的非法域名,继续下一步;Step 5, generate possible similar domain names similar to the central domain name by enumerating in the generation mode, and filter out the illegal domain names that have been stored in the similar domain names, and continue to the next step;

步骤6,通过获取域名WHOIS信息来逐个判断步骤5中筛选后的相似域名是否存在,若不存在,丢弃;否则,保留,继续下一步;Step 6: Determine whether the similar domain names screened in Step 5 exist one by one by obtaining the domain name WHOIS information, if not, discard; otherwise, keep it and continue to the next step;

步骤7,检测保留的域名是否非法,若检测出非法,添加到非法域名集中;否则,添加到未知域名集中;继续下一步;Step 7: Detect whether the reserved domain name is illegal, if it is detected illegal, add it to the illegal domain name set; otherwise, add it to the unknown domain name set; continue to the next step;

步骤8,判断步骤5中筛选后的相似域名是否检测完毕,若检测完毕,继续下一步;否则,转到步骤6;Step 8, determine whether the similar domain names screened in Step 5 have been detected, if the detection is completed, continue to the next step; otherwise, go to Step 6;

步骤9,判断步骤1的非法域名是否已经聚类完毕,若已聚类完毕,则算法结束;否则,转到步骤1;Step 9, determine whether the illegal domain names of step 1 have been clustered, if the clustering has been completed, the algorithm ends; otherwise, go to step 1;

步骤10,创建新类,将当前域名设置为该类的中心域名,转到步骤9。Step 10, create a new class, set the current domain name as the central domain name of the class, and go to step 9.

优选地,步骤4中,生成模式使用通配符来代替两个非法域名之间的差异部分,使用指示符来表示指定通配符的枚举操作。Preferably, in step 4, the generation pattern uses wildcards to replace the difference between two illegal domain names, and uses indicators to indicate the enumeration operation specifying wildcards.

优选地,步骤7中,检测是通过通过权威第三方检测接口进行的。Preferably, in step 7, the detection is performed through an authoritative third-party detection interface.

本发明的有益效果:本方法是建立在已有大批非法域名的基础上作分析,从而挖掘出大量未包含的非法域名。首先,对已准备的黑名单中的非法域名集进行聚类,将构造上相似的非法域名聚成一类,从而形成多个聚集类;然后,从每个类中提取出一个或多个生成模式,得到生成模式的集合;再者,通过生成模式进行枚举,生成疑似非法的相似域名;最后,使用第三方权威检测接口对疑似非法的生成域名集进行检测,筛选出非法的相似域名。该方法从非法域名构造相似性角度出发,主动挖掘出大量的库中不存在的非法域名,并且基于域名构造相似性挖掘出来的非法域名之间具有很强的关联性,有利于非法域名的关联分析、团伙分析。Beneficial effects of the invention: the method is based on the analysis of a large number of existing illegal domain names, so as to excavate a large number of illegal domain names that are not included. First, cluster the set of illegal domain names in the prepared blacklist, and group the illegal domain names that are similar in structure into one class, thereby forming multiple clusters; then, extract one or more generation patterns from each class , to obtain the set of generation patterns; thirdly, enumerate through the generation patterns to generate suspected illegal similar domain names; finally, use the third-party authoritative detection interface to detect the suspected illegal generated domain name set, and filter out the illegal similar domain names. This method starts from the similarity of illegal domain name structure, actively mines a large number of illegal domain names that do not exist in the database, and the illegal domain names mined based on the similarity of domain name structure have a strong correlation, which is conducive to the association of illegal domain names Analysis, group analysis.

附图说明Description of drawings

图1是本发明实施例的整体功能流程图;Fig. 1 is the overall function flow chart of the embodiment of the present invention;

图2是本发明实施例的方法流程图。FIG. 2 is a flowchart of a method according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合实施例对本发明做进一步描述。The present invention will be further described below in conjunction with the embodiments.

非法域名之间存在着构造的相似性,往往通过对单个非法域名构造上作略微修改就可以产生批量的非法域名,且由此得到的这批非法域名极有可能是出自同一个注册者或者是同一个非法组织注册。如通过非法域名00080d.com就可以挖掘出更多相似的非法域名00080e.com、00080f.com、00080w.com等。There are structural similarities between illegal domain names. Often, a batch of illegal domain names can be generated by slightly modifying the structure of a single illegal domain name. registered with the same illegal organization. For example, through the illegal domain name 00080d.com, more similar illegal domain names 00080e.com, 00080f.com, 00080w.com, etc. can be mined.

如图1、2所示,本实施例提供一种基于域名构造相似性的非法域名挖掘方法,主要步骤包括相似聚类、生成模式提取、生成相似域名、检测相似域名的存在性和非法性四大块步骤。本实施例是以赌博、色情、诈骗类非法域名集作为黑名单进行聚类,采取的是自定义的相似规则,将构造相似的域名聚成一类,然后提取每一类的生成模式,生成相似域名,最终检测出非法且实际存在的相似域名。具体步骤如下:As shown in Figures 1 and 2, this embodiment provides a method for mining illegal domain names based on the similarity of domain name construction. The main steps include similarity clustering, generating pattern extraction, generating similar domain names, and detecting the existence and illegality of similar domain names. Chunk steps. In this embodiment, illegal domain names such as gambling, pornography, and fraud are used as a blacklist for clustering, and a self-defined similarity rule is adopted to group similar domain names into one category, and then extract the generation pattern of each category to generate similar domain name, and finally detected illegal and actual similar domain names. Specific steps are as follows:

步骤1,从域名黑名单中读取非法域名;Step 1, read illegal domain names from the domain name blacklist;

步骤2,判断是否存在聚集成功的类,若不存在,则转到步骤10;否则,继续下一步;Step 2, judge whether there is a successfully aggregated class, if not, go to step 10; otherwise, continue to the next step;

步骤3,判断当前域名是否可以归到第i个聚集类中,若不可以,则转到步骤10;否则,继续下一步;Step 3, determine whether the current domain name can be classified into the i-th aggregation class, if not, go to step 10; otherwise, continue to the next step;

其中,判断的方法是判断域名是否与中心域名相似,中心域名是指聚集类中有代表性的域名;Among them, the judgment method is to judge whether the domain name is similar to the central domain name, and the central domain name refers to the representative domain name in the aggregation class;

第i聚集类是按照自定义的相似规则将相似的域名聚成的第i类,相似规则如下:The i-th aggregation class is the i-th class that aggregates similar domain names according to the self-defined similarity rules. The similarity rules are as follows:

(1)若两个域名只有顶级域不同,其他部分相同,如08vip.vip和08vip.tv,则两个域名相似;(1) If the two domain names are different only in the top-level domain, and other parts are the same, such as 08vip.vip and 08vip.tv, then the two domain names are similar;

(2)若两个域名顶级域相同,当二级域长度相同时,二级域的相同位置不超过2个字符不同,如00037b.com和00037c.com、099sun.com和099sky.com、1188030.com和1388033.com;或相同位置连续多个相同字符不同,如4148ww.com和4148nn.com、4040uuu.com和4040jjj.com,则两个域名相似;当两个域名二级域长度相差1且长域名去掉一个字符可以变成短域名时,如0000524.com和00001524.com,则两个域名相似;(2) If the top-level domains of two domain names are the same, when the length of the second-level domain is the same, the same position of the second-level domain should not differ by more than 2 characters, such as 00037b.com and 00037c.com, 099sun.com and 099sky.com, 1188030 .com and 1388033.com; or multiple consecutive identical characters in the same position with different characters, such as 4148ww.com and 4148nn.com, 4040uuu.com and 4040jjj.com, then the two domain names are similar; And when a long domain name can be changed into a short domain name by removing one character, such as 0000524.com and 00001524.com, the two domain names are similar;

(3)若(1)和(2)中均未判定为相似,则两个域名不相似。(3) If neither of (1) and (2) are determined to be similar, the two domain names are not similar.

步骤4,将当前域名并入第i个聚集类,并提取出当前域名与该类中心域名的匹配过程中所产生的生成模式,继续下一步;Step 4, merge the current domain name into the i-th aggregation class, and extract the generation pattern generated during the matching process between the current domain name and the central domain name of this class, and continue to the next step;

其中,生成模式是聚集类中各域名与中心域名所提取出的通配字符串;生成模式的提取方法是,使用通配符来代替两个非法域名之间的差异部分,使用指示符来表示指定通配符的枚举操作,具体说明如下:Among them, the generation mode is the wildcard string extracted from each domain name and the central domain name in the aggregation class; the extraction method of the generation mode is to use wildcards to replace the difference between two illegal domain names, and use indicators to indicate the specified wildcards The enumeration operation is as follows:

(1)若两个域名只有顶级域不同,如08vip.vip和08vip.tv,可以提取出生成模式08vip.%;(1) If only the top-level domain is different between the two domain names, such as 08vip.vip and 08vip.tv, the generation mode 08vip.% can be extracted;

(2)若两个域名相似且只相差一个字符,如0000524.com和00001524.com可以提取出模式00001524-.com或0000524+.com;(2) If two domain names are similar and only differ by one character, such as 0000524.com and 00001524.com, the pattern 00001524-.com or 0000524+.com can be extracted;

(3)若两个域名相似且相同位置不超过2个字符不同,当相同位置的不同字符均为数字时,如1188030.com和1388033.com,可以提取出模式1#8803#.com;当相同位置的不同字符均为字母时,如00037b.com和00037c.com、099sun.com和099sky.com,可以提取出模式00037#.com、099s**.com;当相同位置的不同字符一边为数字一边为字母时,如004zyz.com和0044y8.com,可以提取出模式004$y$.com;(3) If two domain names are similar and no more than 2 characters in the same position are different, when the different characters in the same position are all numbers, such as 1188030.com and 1388033.com, the pattern 1#8803#.com can be extracted; when When different characters in the same position are all letters, such as 00037b.com and 00037c.com, 099sun.com and 099sky.com, patterns 00037#.com and 099s**.com can be extracted; when different characters in the same position are When there are letters on one side of the numbers, such as 004zyz.com and 0044y8.com, the pattern 004$y$.com can be extracted;

(4)若两个域名相似且只有相同位置连续多个相同字符不同,如4148ww.com和4148nn.com、4040uuu.com和4040jjj.com、1186655.com和1186699.com,可以分别提取出匹配模式4148**&.com、4040***&.com、11866##&.com。(4) If two domain names are similar and only the same characters are different in the same position, such as 4148ww.com and 4148nn.com, 4040uuu.com and 4040jjj.com, 1186655.com and 1186699.com, the matching pattern can be extracted respectively 4148**&.com, 4040***&.com, 11866##&.com.

步骤5,通过生成模式中的通配符和指示符来指导性的枚举产生可能存在的与中心域名相似的相似域名,并筛选掉相似域名中已入库的非法域名,继续下一步;Step 5: Use the wildcards and indicators in the generation mode to enumerate to generate possible similar domain names that are similar to the central domain name, and filter out the illegal domain names that have been stored in the similar domain names, and continue to the next step;

通配符、指示符的具体说明如下:The specific descriptions of wildcards and indicators are as follows:

(1)%为顶级域通配符,枚举时将%替换成黑名单中提取出的顶级域;(1) % is a top-level domain wildcard, when enumerating, replace % with the top-level domain extracted from the blacklist;

(2)-、+均为指示符,指示枚举时需删减二级域某个字符或增加某个字符;(2) -, + are indicators, indicating that a character in the second-level domain needs to be deleted or a character added during enumeration;

(3)*为字母通配符、#为数字通配符、$为字母数字通配符,*枚举时换成字母a~z、#枚举时换成数字0~9、$枚举时换成0~9,a~z;(3) * is an alphabetic wildcard, # is a numeric wildcard, $ is an alphanumeric wildcard, * is replaced by letters a~z when enumerating, # is replaced by numbers 0~9 when enumerating, and 0~9 is replaced by $ when enumerating , a~z;

(4)&为连续指示符,指示枚举时所有通配符替换同一个字符;(4) & is a continuous indicator, indicating that all wildcards replace the same character during enumeration;

步骤6,通过获取域名WHOIS信息来逐个判断步骤5中筛选后的相似域名是否存在,若不存在,丢弃;否则,保留,继续下一步操作;Step 6: Determine whether the similar domain names screened in Step 5 exist one by one by obtaining the domain name WHOIS information, if not, discard; otherwise, keep it, and continue to the next step;

步骤7,通过权威第三方检测接口检测保留的域名是否非法,若检测出非法,添加到非法域名集中;否则,添加到未知域名集中,继续下一步;Step 7: Detect whether the reserved domain name is illegal through an authoritative third-party detection interface. If it is detected to be illegal, add it to the illegal domain name set; otherwise, add it to the unknown domain name set, and continue to the next step;

定期对未知域名集中的域名进行检测,判断其是否为非法域名,若检测出非法,添加到非法域名集中;否则,保留在未知域名集中;Regularly check the domain names in the unknown domain name set to determine whether it is an illegal domain name. If it is detected as illegal, add it to the illegal domain name set; otherwise, keep it in the unknown domain name set;

步骤8,判断步骤5中筛选后的相似域名是否检测完毕,若检测完毕,继续下一步;否则,转到步骤6;Step 8, determine whether the similar domain names screened in Step 5 have been detected, if the detection is completed, continue to the next step; otherwise, go to Step 6;

步骤9,判断步骤1的非法域名是否已经聚类完毕,若已聚类完毕,则算法结束;否则,转到步骤1;Step 9, determine whether the illegal domain names of step 1 have been clustered, if the clustering has been completed, the algorithm ends; otherwise, go to step 1;

步骤10,创建新类,将当前域名设置为该类的中心域名,转到步骤9。Step 10, create a new class, set the current domain name as the central domain name of the class, and go to step 9.

本方法是建立在已有大批非法域名的基础上作分析,从而挖掘出大量未包含的非法域名。首先,对已准备的黑名单中的非法域名集进行聚类,将构造上相似的非法域名聚成一类,从而形成多个聚集类;然后,从每个类中提取出一个或多个生成模式,得到生成模式的集合;再者,通过生成模式进行枚举,生成疑似非法的相似域名;最后,使用第三方权威检测接口对疑似非法的生成域名集进行检测,筛选出非法的相似域名。该方法从非法域名构造相似性角度出发,主动挖掘出大量的库中不存在的非法域名,并且基于域名构造相似性挖掘出来的非法域名之间具有很强的关联性,有利于非法域名的关联分析、团伙分析等。This method is based on the analysis of a large number of existing illegal domain names, so as to excavate a large number of illegal domain names that are not included. First, cluster the set of illegal domain names in the prepared blacklist, and group the illegal domain names that are similar in structure into one class, thereby forming multiple clusters; then, extract one or more generation patterns from each class , to obtain the set of generation patterns; thirdly, enumerate through the generation patterns to generate suspected illegal similar domain names; finally, use the third-party authoritative detection interface to detect the suspected illegal generated domain name set, and filter out the illegal similar domain names. This method starts from the similarity of illegal domain name structure, actively mines a large number of illegal domain names that do not exist in the database, and the illegal domain names mined based on the similarity of domain name structure have a strong correlation, which is conducive to the association of illegal domain names Analysis, group analysis, etc.

惟以上者,仅为本发明的具体实施例而已,当不能以此限定本发明实施的范围,故其等同组件的置换,或依本发明专利保护范围所作的等同变化与修改,皆应仍属本发明权利要求书涵盖之范畴。However, the above are only specific embodiments of the present invention, which should not limit the scope of the present invention. Therefore, the replacement of the equivalent components, or the equivalent changes and modifications made according to the scope of the patent protection of the present invention, should still belong to the scope of the present invention. The scope of the present invention is covered by the claims.

Claims (3)

1.一种基于域名构造相似性的非法域名挖掘方法,其特征是,包括以下步骤:1. an illegal domain name mining method based on domain name structure similarity, is characterized in that, comprises the following steps: 步骤1,从域名黑名单中读取非法域名;Step 1, read illegal domain names from the domain name blacklist; 步骤2,判断是否存在聚集成功的类,若不存在,则转到步骤10;否则,继续下一步;Step 2, judge whether there is a successfully aggregated class, if not, go to step 10; otherwise, continue to the next step; 步骤3,判断当前域名是否可以归到第i个聚集类中,若不可以,则转到步骤10;否则,继续下一步;所述第i聚集类是按照相似规则将相似的域名聚成的第i类;Step 3, determine whether the current domain name can be classified into the i-th aggregation class, if not, go to step 10; otherwise, continue to the next step; the i-th aggregation class is to aggregate similar domain names according to similar rules class i; 所述判断的依据是当前域名是否与中心域名相似,所述中心域名是指聚集类中有代表性的域名;The judgment is based on whether the current domain name is similar to the central domain name, and the central domain name refers to a representative domain name in the aggregation class; 所述判断当前域名是否与中心域名相似的具体方法为:The specific method for judging whether the current domain name is similar to the central domain name is: (1)若两个域名只有顶级域不同,其他部分相同,则两个域名相似;(1) If the two domain names are different only in the top-level domain and other parts are the same, the two domain names are similar; (2)若两个域名顶级域相同,当二级域长度相同时,二级域的相同位置不超过2个字符不同;或相同位置连续多个相同字符不同,则两个域名相似;当两个域名二级域长度相差1且长域名去掉一个字符可以变成短域名时,则两个域名相似;(2) If the top-level domains of two domain names are the same, when the length of the second-level domain is the same, the same position of the second-level domain does not differ by more than 2 characters; or the same position in the same position is different, the two domain names are similar; When the lengths of the second-level domains of the two domain names differ by 1 and the long domain name can be changed to a short domain name by removing one character, the two domain names are similar; (3)若(1)和(2)中均未判定为相似,则两个域名不相似;(3) If neither of (1) and (2) is determined to be similar, the two domain names are not similar; 步骤4,将当前域名并入第i个聚集类,并提取出当前域名与该类中心域名的匹配过程中所产生的生成模式,继续下一步;Step 4, merge the current domain name into the i-th aggregation class, and extract the generation pattern generated during the matching process between the current domain name and the central domain name of this class, and continue to the next step; 所述生成模式是聚集类中各域名与中心域名所提取出的通配字符串;The generation mode is a wildcard string extracted from each domain name and the central domain name in the aggregation class; 步骤5,通过生成模式中进行枚举产生可能存在的与中心域名相似的相似域名,并筛选掉相似域名中已入库的非法域名,继续下一步;Step 5, generate possible similar domain names similar to the central domain name by enumerating in the generation mode, and filter out the illegal domain names that have been stored in the similar domain names, and continue to the next step; 步骤6,通过获取域名WHOIS信息来逐个判断步骤5中筛选后的相似域名是否存在,若不存在,丢弃;否则,保留,继续下一步;Step 6: Determine whether the similar domain names screened in Step 5 exist one by one by obtaining the domain name WHOIS information, if not, discard; otherwise, keep it and continue to the next step; 步骤7,检测保留的域名是否非法,若检测出非法,添加到非法域名集中;否则,添加到未知域名集中;继续下一步;Step 7: Detect whether the reserved domain name is illegal, if it is detected illegal, add it to the illegal domain name set; otherwise, add it to the unknown domain name set; continue to the next step; 步骤8,判断步骤5中筛选后的相似域名是否检测完毕,若检测完毕,继续下一步;否则,转到步骤6;Step 8, determine whether the similar domain names screened in Step 5 have been detected, if the detection is completed, continue to the next step; otherwise, go to Step 6; 步骤9,判断步骤1的非法域名是否已经聚类完毕,若已聚类完毕,则算法结束;否则,转到步骤1;Step 9, determine whether the illegal domain names of step 1 have been clustered, if the clustering has been completed, the algorithm ends; otherwise, go to step 1; 步骤10,创建新类,将当前域名设置为该类的中心域名,转到步骤9。Step 10, create a new class, set the current domain name as the central domain name of the class, and go to step 9. 2.根据权利要求1所述的基于域名构造相似性的非法域名挖掘方法,其特征在于,所述步骤4中,所述生成模式使用通配符来代替两个非法域名之间的差异部分,使用指示符来表示指定通配符的枚举操作。2. the illegal domain name mining method based on domain name structure similarity according to claim 1, is characterized in that, in described step 4, described generation pattern uses wildcard to replace the difference part between two illegal domain names, uses instruction character to indicate the enumeration operation of the specified wildcard. 3.根据权利要求1所述的基于域名构造相似性的非法域名挖掘方法,其特征在于,所述步骤7中,所述检测是通过通过权威第三方检测接口进行的。3. The method for mining illegal domain names based on the similarity of domain name structures according to claim 1, wherein in the step 7, the detection is performed through an authoritative third-party detection interface.
CN201810419153.8A 2018-05-04 2018-05-04 Illegal Domain Name Mining Method Based on Domain Name Structure Similarity Active CN108712403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810419153.8A CN108712403B (en) 2018-05-04 2018-05-04 Illegal Domain Name Mining Method Based on Domain Name Structure Similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810419153.8A CN108712403B (en) 2018-05-04 2018-05-04 Illegal Domain Name Mining Method Based on Domain Name Structure Similarity

Publications (2)

Publication Number Publication Date
CN108712403A CN108712403A (en) 2018-10-26
CN108712403B true CN108712403B (en) 2020-08-04

Family

ID=63867784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810419153.8A Active CN108712403B (en) 2018-05-04 2018-05-04 Illegal Domain Name Mining Method Based on Domain Name Structure Similarity

Country Status (1)

Country Link
CN (1) CN108712403B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109495475B (en) * 2018-11-19 2022-03-18 中国联合网络通信集团有限公司 Domain name detection method and device
CN109889491A (en) * 2019-01-02 2019-06-14 兰州理工大学 A fast detection method of malicious domain names based on lexical features
CN110336777B (en) * 2019-04-30 2020-10-16 北京邮电大学 Communication interface acquisition method and device for Android application
CN113157997B (en) * 2020-01-23 2024-09-27 华为技术有限公司 Domain name feature extraction method and feature extraction device
CN113315739A (en) * 2020-02-26 2021-08-27 深信服科技股份有限公司 Malicious domain name detection method and system
CN112073549B (en) * 2020-08-25 2023-06-02 山东伏羲智库互联网研究院 Domain name based system relation determining method and device
CN114710468B (en) * 2022-03-31 2024-05-14 绿盟科技集团股份有限公司 Domain name generation and identification method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098235A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing mail inspection method based on text characteristic analysis
CN102523311A (en) * 2011-11-25 2012-06-27 中国科学院计算机网络信息中心 Illegal domain name recognition method and device
CN103812966A (en) * 2014-03-03 2014-05-21 刁永平 Implementation method of autonomous extensible IP internet (AEIP) by loose source and record route (LSRR)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110289138A1 (en) * 2010-05-20 2011-11-24 Bhavin Turakhia Method, machine and computer program product for sharing an application session across a plurality of domain names
US8850474B2 (en) * 2010-07-26 2014-09-30 Cisco Technology, Inc. Virtual content store in interactive services architecture
CN102299978A (en) * 2011-09-23 2011-12-28 上海西默通信技术有限公司 Black list adding, filtering and redirecting method applied to DNS (Domain Name System)
CN105912670A (en) * 2012-09-18 2016-08-31 北京奇虎科技有限公司 Method and device for network hotspot excavation
US8914883B2 (en) * 2013-05-03 2014-12-16 Fortinet, Inc. Securing email communications
CN106330811A (en) * 2015-06-15 2017-01-11 中兴通讯股份有限公司 Method and device for determining domain name credibility

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098235A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing mail inspection method based on text characteristic analysis
CN102523311A (en) * 2011-11-25 2012-06-27 中国科学院计算机网络信息中心 Illegal domain name recognition method and device
CN103812966A (en) * 2014-03-03 2014-05-21 刁永平 Implementation method of autonomous extensible IP internet (AEIP) by loose source and record route (LSRR)

Also Published As

Publication number Publication date
CN108712403A (en) 2018-10-26

Similar Documents

Publication Publication Date Title
CN108712403B (en) Illegal Domain Name Mining Method Based on Domain Name Structure Similarity
CN103544436B (en) System and method for distinguishing phishing websites
CN112910929B (en) Method and device for malicious domain name detection based on heterogeneous graph representation learning
JP6697123B2 (en) Profile generation device, attack detection device, profile generation method, and profile generation program
CN105827594B (en) A kind of dubiety detection method based on domain name readability and domain name mapping behavior
US20170295187A1 (en) Detection of malicious domains using recurring patterns in domain names
CN105224600B (en) A kind of detection method and device of Sample Similarity
CN107358075A (en) A kind of fictitious users detection method based on hierarchical clustering
US20170053031A1 (en) Information forecast and acquisition method based on webpage link parameter analysis
CN104168288A (en) Automatic vulnerability discovery system and method based on protocol reverse parsing
CN103607391B (en) SQL injection attack detection method based on K-means
CN108111526A (en) A kind of illegal website method for digging based on abnormal WHOIS information
CN103973697B (en) A kind of thing network sensing layer intrusion detection method
CN102622553A (en) Method and device for detecting webpage safety
CN110300027A (en) A kind of abnormal login detecting method
CN107743128A (en) An illegal website mining method based on the domain name associated with the homepage and the same service IP
CN104794051A (en) Automatic Android platform malicious software detecting method
CN111935097A (en) Method for detecting DGA domain name
CN105718795A (en) Malicious code evidence obtaining method and system on the basis of feature code under Linux
CN113706100B (en) Method and system for real-time detection and identification of IoT terminal equipment in distribution network
CN111614543A (en) A URL-based spear phishing email detection method and system
CN103455754B (en) A kind of malicious searches keyword recognition methods based on regular expression
CN107612911A (en) Method based on the infected main frame of DNS flow detections and C&C servers
CN106326746B (en) A kind of rogue program behavioural characteristic base construction method and device
CN108171057A (en) The matched Android platform malware detection method of feature based

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240924

Address after: 298-1 Huanhai Road, Sunjiatuan Town, Huancui District, Weihai City, Shandong Province 264200, China 201-2

Patentee after: Shandong Tianhe Cyberspace Security Technology Research Institute Co.,Ltd.

Country or region after: China

Address before: 264209 No. 2, Wenhua West Road, Shandong, Weihai

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY (WEIHAI)

Country or region before: China