CN109241483B - Website discovery method and system based on domain name recommendation - Google Patents

Website discovery method and system based on domain name recommendation Download PDF

Info

Publication number
CN109241483B
CN109241483B CN201811008674.0A CN201811008674A CN109241483B CN 109241483 B CN109241483 B CN 109241483B CN 201811008674 A CN201811008674 A CN 201811008674A CN 109241483 B CN109241483 B CN 109241483B
Authority
CN
China
Prior art keywords
domain name
character string
character
candidate
root
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811008674.0A
Other languages
Chinese (zh)
Other versions
CN109241483A (en
Inventor
张凯
刘春阳
吴昱明
王鹏
钟习
张旭
刘悦
李雄
俞晓明
张翔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN201811008674.0A priority Critical patent/CN109241483B/en
Publication of CN109241483A publication Critical patent/CN109241483A/en
Application granted granted Critical
Publication of CN109241483B publication Critical patent/CN109241483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a website discovery method based on domain name recommendation, which comprises the following steps: randomly selecting any character arrangement combination in a domain name character set to obtain a root character string; forming candidate character strings by the root character strings; splicing the candidate character string and the candidate domain name suffix to form a recommended domain name; performing DNS analysis on the recommended domain name to judge that the recommended domain name is legal; and verifying whether the legal domain name has a corresponding website, and if so, acquiring the legal domain name as a target website.

Description

Website discovery method and system based on domain name recommendation
Technical Field
The invention belongs to the field of internet resource discovery, and particularly relates to a website discovery technology based on domain name recommendation.
Background
With the continuous development and evolution of the internet, various types of websites appear, and more website domain names need to be found out for the needs of supervision or search, where the website domain name is mainly a first-level domain name, which refers to a domain name with only one non-top-level domain name before a top-level domain name, such as sina.
The traditional method usually adopts an epidemic method to find more domain names, and the epidemic method comprises the following steps: the method comprises the steps of collecting data of a website, extracting more URLs (Uniform Resource locators) from the data, and searching for duplicates of the extracted URLs and known domain names to obtain more domain names. In addition, there are some domain name discovery methods based on network interception. The Zhongzhou national patent "a method and a system for searching for an unregistered website based on a multi-path data access mode", application number 201410299875.6, adopts a multi-path data access mode to obtain a domain name, screens out the unregistered domain name and forms a domain name seed bank; performing DNS analysis on the domain name which is not recorded to obtain a corresponding IP address; positioning an IP address to obtain an unregistered domain name library; and obtaining the information of the unregistered website through activity verification.
The above methods are relatively passive methods, and only if these domain names appear in the acquired web pages, or in the intercepted data stream, will they be acquired. For some relatively isolated web sites, it may be difficult to obtain, such as some personal blog web sites, which may be substantially difficult to discover due to too few linked-in sites.
Disclosure of Invention
Aiming at the problems, the invention provides a website discovery method based on domain name recommendation, which comprises the following steps: randomly selecting any character arrangement combination in a domain name character set to obtain a root character string; and forming a candidate character string by using the root character string: splicing the candidate character string and the candidate domain name suffix to form a recommended domain name; performing DNS analysis on the recommended domain name to judge that the recommended domain name is legal; and verifying whether the legal domain name has a corresponding website, and if so, acquiring the legal domain name as a target website.
The website discovery method takes the root character string as the candidate character string, or splices the root character string with a prefix character string and/or a suffix character string to form the candidate character string.
The website discovery method of the invention obtains the frequency S1 that the character string A1 with the length of M characters appears in the prefix position of the known domain name in all the known domain names, and if S1 is more than m.S, the character A1 is taken as the prefix character string; acquiring the frequency S2 of the character string A2 with the length of M characters in the suffix position of the known domain name in all the known domain names, and taking the character A2 as a suffix character string if S2 is greater than m.S; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
The website discovery method of the invention selects any number of characters from the domain name coincidence set to be arranged and combined to generate a plurality of character strings, and takes the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.
The invention also relates to a website discovery system based on domain name recommendation, which comprises the following components:
the root character string module is used for randomly selecting any character arrangement combination in the domain name character set to obtain a root character string;
a candidate character string module for forming a candidate character string by the root character string;
the domain name generation module is used for splicing the candidate character string and a candidate domain name suffix to form a recommended domain name;
the domain name verification module is used for performing DNS analysis on the recommended domain name so as to judge the legal recommended domain name as a legal domain name;
and the website acquisition module is used for verifying whether the legal domain name has a corresponding website or not, and acquiring the legal domain name as a target website if the legal domain name has the corresponding website.
The website discovery system of the present invention, wherein the candidate string module further comprises: the first candidate character string module is used for taking the root character string as the candidate character string; and the second candidate character string module is used for splicing the root character string and the prefix character string and/or the suffix character string into the candidate character string.
The second candidate string module comprises: a prefix character string module for acquiring the prefix character string; acquiring the frequency S1 that a character string A1 with the length of M characters appears in the prefix position of the known domain name in all the known domain names, and if S1 is greater than m.S, using the character A1 as the prefix character string; a suffix string module to obtain the suffix string: acquiring the frequency S2 of the character string A2 with the length of M characters in all the known domain names at the position of suffixes of the known domain names, and taking the character A2 as a suffix character string if S2 is greater than M & S; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
The website discovery system of the present invention, wherein the root string module specifically comprises: selecting any number of characters from the domain name character set to be arranged and combined to generate a plurality of character strings, and taking the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.
Drawings
Fig. 1 is a flowchart of a website discovery method based on domain name recommendation according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a website discovery system based on domain name recommendation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the following describes in detail a website discovery method and system based on domain name recommendation, which are provided by the present invention, with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The existing domain name discovery method mostly adopts a passive mode, and only after enough data streams are acquired, undiscovered domain names are acquired from the data streams through data stream analysis, but if a certain website belongs to an isolated website and the linked websites are few, the data streams containing URLs of the isolated website are difficult to acquire, and further the website domain names are difficult to acquire.
The invention discloses a website discovery method based on domain name recommendation, which adopts a technology of autonomously generating a domain name, selects one or more characters from a domain name character set to be arranged and combined to obtain a root character string, forms a candidate character string by the root character string, sequentially splices the candidate character string with a candidate domain name suffix to form a recommended domain name, and obtains a legal domain name after performing DNS analysis on the recommended domain name to verify the legality of the recommended domain name. Fig. 1 is a flowchart of a website discovery method based on domain name recommendation according to an embodiment of the present invention, and as shown in fig. 1, specifically, the website discovery method based on domain name recommendation according to the present invention includes:
step 1, selecting any character from a domain name character set to be arranged and combined, taking a character string with the length of N characters in the generated character string as a root character string and constructing a root set: the scale of the root set can be rapidly increased due to the increase of N, the range of N is limited, and the length of character strings in the root set does not exceed 10 characters;
step 2, counting all known domain names to obtain the double characters A2Number of occurrences at prefix position and suffix position to calculate the double character A, respectively2The frequency S1 appearing at the prefix position and the frequency S2 appearing at the suffix position are set according to the experience, m is larger than 1, when K characters are shared in the existing domain name, the frequency S of any double character appearing at the prefix position is 1/K2If S1 > n.S ═ m/K2Then with double character A2Is a prefix character string, if S2 > m.S ═ m/K2Then with double character A2A suffix string that is a double character; because the characters in the domain name are required to meet the requirements of the domain name character set, according to the range of the domain name character set, K is 37; for three characters A3And four characters A4Processing by adopting the method to obtain prefix character strings and suffix character strings of three characters and four characters; in some embodiments, for example, the five-character string and the six-character string may also be processed by the above method to obtain a prefix string and a suffix string of the five-character string and the six-character string, which is not limited by the present invention; constructing a prefix set from all prefix character strings and constructing a suffix from all suffix character stringsGathering; the selected characters are any characters in a domain name character set, such as double characters ab, g2, cc, three characters abc, hhu, ttt, yy6, four characters abcd, jjjyh, 7fff, and the invention is not limited thereto;
step 3, forming candidate character strings on the basis of the root character strings in the root set, wherein the candidate character strings comprise any root character string selected from the root set and used as candidate character strings, and if the root character string selected from the root set is cde, the candidate character strings are cde; or any prefix character string and any root character string are respectively selected from the prefix set and the root set and are spliced into a candidate character string, and if the prefix character string is ab and the root character string is cde, the candidate character string is abcde; or any root character string and any suffix character string are respectively selected from the root set and the suffix set and are spliced into candidate character strings, and if the root character string is cde and the suffix character string is fgh, the candidate character strings are cdefgh; or sequentially selecting any prefix character string, any root character string and any suffix character string from the prefix set, the root set and the suffix set, and splicing the prefix character string, the root character string and the suffix character string into a candidate character string, wherein if the prefix character string is ab, the root character string is cde and the suffix character string is fgh, the candidate character string is abcdefgh; the invention is not limited thereto;
step 4, splicing the candidate character strings and the candidate domain name suffixes to form a recommended domain name; if the candidate character string is abcde and the candidate domain name suffix is com, splicing into a recommended domain name abcde.com; candidate domain name suffixes include national domain name suffixes and international top-level domain name suffixes such as.cn,. com.cn,. edu,. edu.cn, to which the present invention is not limited;
step 5, analyzing the generated recommended domain name by using asynchronous DNS analysis; initiating DNS analysis on the recommended domain name through a plurality of DNS servers, if at least two DNS servers return the same IP address, verifying the recommended domain name as a legal domain name, otherwise, verifying the recommended domain name as an illegal domain name, and discarding the illegal domain name;
step 6, verifying the obtained legal domain name to judge whether a corresponding website exists, if so, acquiring the corresponding website as a target website, and if not, discarding the legal domain name; the invention adopts curl tool to verify legal domain name, and can also use other software tools with website verification function to verify, the invention is not limited by this.
Fig. 2 is a schematic structural diagram of a website discovery system based on domain name recommendation according to an embodiment of the present invention. As shown in fig. 2, a website discovery system based on domain name recommendation of the present invention includes: the system comprises a root character string module, a candidate character string module, a domain name generation module, a domain name verification module and a website acquisition module; the root character string module is used for randomly selecting any character arrangement combination in a domain name character set to obtain a root character string; the candidate character string module is used for forming candidate character strings by the root character strings; the domain name generation module is used for splicing the candidate character strings and the candidate domain name suffixes to form a recommended domain name; the domain name verification module is used for performing DNS analysis on the recommended domain name so as to judge the legal recommended domain name as a legal domain name; the website acquisition module is used for verifying whether the legal domain name has a corresponding website, and if the legal domain name has the corresponding website, acquiring the target website.
The root character string module of the website discovery system specifically comprises: selecting any number of characters from the domain name character set to be arranged and combined to generate a plurality of character strings, and taking the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10. For example, the selected character is any character in the domain name character set, such as the double characters ab, g2, cc, the three characters abc, hhu, ttt, yy6, the four characters abcd, jjjyh, 7fff, which is not limited in the present invention; the method selects any character from the domain name character set to be arranged and combined, and constructs a root set of the character string with the length of N characters in the generated character string, because the increase of N can rapidly increase the scale of the root set, the invention limits the range of N, and the root character string does not exceed the length of 10 characters.
The website discovery system of the invention takes the root character string as the candidate character string and splices the candidate character string with the candidate domain suffix to form the recommended domain name, and further, further optimization can be carried out on the basis of the root character string, therefore, the candidate character string module specifically comprises: a first candidate character string module and a second candidate character string module; the first candidate character string module is used for taking a root character string as a candidate character string; and the second candidate character string module is used for splicing the root character string and the prefix character string and/or the suffix character string into candidate character strings. Therefore, the second candidate string module further includes a prefix string module and a suffix string module.
The prefix character string module is used for acquiring the prefix character string, namely acquiring the frequency S1 of the prefix position of the character string A1 with the length of M characters in all known domain names, and if S1 is greater than m.S, using the character A1 as the prefix character string; the suffix character string module is used for acquiring a suffix character string, namely acquiring the frequency S2 of the suffix position of a character string A2 with the length of M characters in all known domain names, and taking the character A2 as the suffix character string if S2 is greater than M & S; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
Specifically, all the known domain names are counted to obtain the double-character A2Number of occurrences at prefix position and suffix position to calculate the double character A, respectively2The frequency S1 appearing at the prefix position and the frequency S2 appearing at the suffix position are set according to the experience, m is larger than 1, when K characters are shared in the existing domain name, the frequency S of any double character appearing at the prefix position is 1/K2If S1 > n.S ═ m/K2Then with double character A2Is a prefix character string, if S2 > m.S ═ m/K2Then with double character A2A suffix string that is a double character; because the characters in the domain name are required to meet the requirements of the domain name character set, according to the range of the domain name character set, K is 37; for three characters A3And four characters A4Processing by adopting the method to obtain prefix character strings and suffix character strings of three characters and four characters; in some embodiments, for example, the five-character and six-character strings may also be processed by the method described above to obtain prefix strings and suffix strings of the five-character and six-character strings, which is not limited by the present invention.
The website acquisition module is used for verifying the obtained legal domain name to judge whether a corresponding website exists, if so, acquiring the corresponding website as a target website, and if not, discarding the legal domain name; the invention adopts curl tool to verify legal domain name, and can also use other software tools with website verification function to verify, the invention is not limited by this.
The invention adopts the technology of automatically generating the domain name instead of spreading discovery or other passive acquisition methods, avoids the limitation of passive discovery and can detect more website domain names; the domain name is generated by adopting a root discovery and combination method, so that the detection success rate and pertinence of active detection are improved.
The present invention may be embodied in other specific forms without departing from the spirit or scope of the invention, and it should be understood that various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A website discovery method based on domain name recommendation is characterized by comprising the following steps:
randomly selecting any character arrangement combination in a domain name character set to obtain a root character string;
taking the root character string as a candidate character string, or splicing the root character string with a prefix character string and/or a suffix character string to form the candidate character string;
splicing the candidate character string and the candidate domain name suffix to form a recommended domain name;
performing DNS analysis on the recommended domain name to judge that the recommended domain name is legal;
and verifying whether the legal domain name has a corresponding website, and if so, acquiring the legal domain name as a target website.
2. The method of claim 1, wherein the frequency S1 is obtained when the M character strings A1 appear at the prefix position of the known domain name among all the known domain names, if S1 >m.S, the character A1 is used as the prefix character string; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
3. The website discovery method of claim 1, wherein the frequency S2 of occurrence of a character string a2 having a length of M characters in a suffix position of the known domain name among all known domain names is obtained, and if S2 > M · S, the character string a2 is used as a suffix character string; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
4. The website discovery method according to claim 1, wherein any number of characters are selected from the domain name character set for permutation and combination to generate a plurality of character strings, and the character string with a character length of N is used as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.
5. A website discovery system based on domain name recommendation, comprising:
the root character string module is used for randomly selecting any character arrangement combination in the domain name character set to obtain a root character string;
a candidate character string module for forming a candidate character string by the root character string; the method comprises the following steps: the first candidate character string module is used for taking the root character string as the candidate character string; the second candidate character string module is used for splicing the root character string and the prefix character string and/or the suffix character string into the candidate character string;
the domain name generation module is used for splicing the candidate character string and a candidate domain name suffix to form a recommended domain name;
the domain name verification module is used for performing DNS analysis on the recommended domain name so as to judge the legal recommended domain name as a legal domain name;
and the website acquisition module is used for verifying whether the legal domain name has a corresponding website or not, and acquiring the legal domain name as a target website if the legal domain name has the corresponding website.
6. A website discovery system as defined in claim 5, wherein the second candidate string module comprises: a prefix character string module for acquiring the prefix character string; acquiring the frequency S1 that a character string A1 with the length of M characters appears in the prefix position of the known domain name in all the known domain names, and if S1 is greater than m.S, using the character A1 as the prefix character string; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
7. The website discovery system of claim 5 wherein the second candidate string module further comprises: a suffix string module for obtaining the suffix string; acquiring the frequency S2 that a character string A2 with the length of M characters appears in the suffix position of the known domain name in all the known domain names, and taking the character A2 as a suffix character string if S2 is greater than M & S; wherein S is 1/KMK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.
8. The website discovery system of claim 5, wherein the root string module specifically comprises: selecting any number of characters from the domain name character set to be arranged and combined to generate a plurality of character strings, and taking the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.
CN201811008674.0A 2018-08-31 2018-08-31 Website discovery method and system based on domain name recommendation Active CN109241483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811008674.0A CN109241483B (en) 2018-08-31 2018-08-31 Website discovery method and system based on domain name recommendation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811008674.0A CN109241483B (en) 2018-08-31 2018-08-31 Website discovery method and system based on domain name recommendation

Publications (2)

Publication Number Publication Date
CN109241483A CN109241483A (en) 2019-01-18
CN109241483B true CN109241483B (en) 2021-10-12

Family

ID=65068896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811008674.0A Active CN109241483B (en) 2018-08-31 2018-08-31 Website discovery method and system based on domain name recommendation

Country Status (1)

Country Link
CN (1) CN109241483B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113285979B (en) * 2021-04-15 2022-11-29 北京奇艺世纪科技有限公司 Network request processing method, device, terminal and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104113539A (en) * 2014-07-11 2014-10-22 哈尔滨工业大学(威海) Phishing website engine detection method and device
CN107770132A (en) * 2016-08-18 2018-03-06 中兴通讯股份有限公司 A kind of method and device detected to algorithm generation domain name

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504583B1 (en) * 2012-02-14 2013-08-06 Microsoft Corporation Multi-domain recommendations
CN104065532B (en) * 2014-06-26 2018-08-14 国家计算机网络与信息安全管理中心 A kind of non-recorded website search method and system based on multichannel data access way
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means
CN106302440B (en) * 2016-08-11 2019-12-10 国家计算机网络与信息安全管理中心 Method for acquiring suspicious phishing websites through multiple channels
CN106503125B (en) * 2016-10-19 2019-10-15 中国互联网络信息中心 A kind of data source extended method and device
CN108124025A (en) * 2017-12-14 2018-06-05 北京锐安科技有限公司 Website converts detection method, the device and system of domain name

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104113539A (en) * 2014-07-11 2014-10-22 哈尔滨工业大学(威海) Phishing website engine detection method and device
CN107770132A (en) * 2016-08-18 2018-03-06 中兴通讯股份有限公司 A kind of method and device detected to algorithm generation domain name

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网络钓鱼欺诈检测技术研究;张茜;《网络与信息安全学报》;20170731;第3卷(第7期);第7-10页 *

Also Published As

Publication number Publication date
CN109241483A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
Marchal et al. PhishStorm: Detecting phishing with streaming analytics
US9258289B2 (en) Authentication of IP source addresses
US8990936B2 (en) Method and device for detecting flood attacks
JP5989919B2 (en) URL matching apparatus, URL matching method, and URL matching program
US20170053031A1 (en) Information forecast and acquisition method based on webpage link parameter analysis
CN107046586B (en) A kind of algorithm generation domain name detection method based on natural language feature
JP5415390B2 (en) Filtering method, filtering system, and filtering program
CN105635064B (en) CSRF attack detection method and device
CN106789849B (en) CC attack identification method, node and system
Marchal et al. PhishScore: Hacking phishers' minds
JP5465651B2 (en) List generation method, list generation apparatus, and list generation program
CN114328962A (en) Method for identifying abnormal behavior of web log based on knowledge graph
US8392421B1 (en) System and method for internet endpoint profiling
CN110233821B (en) Detection and safety scanning system and method for network space of intelligent equipment
He et al. Malicious domain detection via domain relationship and graph models
CN109241483B (en) Website discovery method and system based on domain name recommendation
CN109547294B (en) Networking equipment model detection method and device based on firmware analysis
JP2006215735A (en) Duplicate website detection device
CN106227741A (en) A kind of extensive URL matching process based on multilevel hash index chained list
Marchal et al. Semantic exploration of DNS
CN108200191B (en) Utilize the client dynamic URL associated script character string detection system of perturbation method
CN106161352A (en) A kind of matching process and client, server and matching unit
CN108170812B (en) Data filtering method and equipment
Na et al. Service identification of internet-connected devices based on common platform enumeration
JP2012118577A (en) Illegal domain detection device, illegal domain detection method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant