CN109241483B

CN109241483B - Website discovery method and system based on domain name recommendation

Info

Publication number: CN109241483B
Application number: CN201811008674.0A
Authority: CN
Inventors: 张凯; 刘春阳; 吴昱明; 王鹏; 钟习; 张旭; 刘悦; 李雄; 俞晓明; 张翔宇
Original assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2021-10-12
Anticipated expiration: 2038-08-31
Also published as: CN109241483A

Abstract

The invention relates to a website discovery method based on domain name recommendation, which comprises the following steps: randomly selecting any character arrangement combination in a domain name character set to obtain a root character string; forming candidate character strings by the root character strings; splicing the candidate character string and the candidate domain name suffix to form a recommended domain name; performing DNS analysis on the recommended domain name to judge that the recommended domain name is legal; and verifying whether the legal domain name has a corresponding website, and if so, acquiring the legal domain name as a target website.

Description

Website discovery method and system based on domain name recommendation

Technical Field

The invention belongs to the field of internet resource discovery, and particularly relates to a website discovery technology based on domain name recommendation.

Background

With the continuous development and evolution of the internet, various types of websites appear, and more website domain names need to be found out for the needs of supervision or search, where the website domain name is mainly a first-level domain name, which refers to a domain name with only one non-top-level domain name before a top-level domain name, such as sina.

The traditional method usually adopts an epidemic method to find more domain names, and the epidemic method comprises the following steps: the method comprises the steps of collecting data of a website, extracting more URLs (Uniform Resource locators) from the data, and searching for duplicates of the extracted URLs and known domain names to obtain more domain names. In addition, there are some domain name discovery methods based on network interception. The Zhongzhou national patent "a method and a system for searching for an unregistered website based on a multi-path data access mode", application number 201410299875.6, adopts a multi-path data access mode to obtain a domain name, screens out the unregistered domain name and forms a domain name seed bank; performing DNS analysis on the domain name which is not recorded to obtain a corresponding IP address; positioning an IP address to obtain an unregistered domain name library; and obtaining the information of the unregistered website through activity verification.

The above methods are relatively passive methods, and only if these domain names appear in the acquired web pages, or in the intercepted data stream, will they be acquired. For some relatively isolated web sites, it may be difficult to obtain, such as some personal blog web sites, which may be substantially difficult to discover due to too few linked-in sites.

Disclosure of Invention

Aiming at the problems, the invention provides a website discovery method based on domain name recommendation, which comprises the following steps: randomly selecting any character arrangement combination in a domain name character set to obtain a root character string; and forming a candidate character string by using the root character string: splicing the candidate character string and the candidate domain name suffix to form a recommended domain name; performing DNS analysis on the recommended domain name to judge that the recommended domain name is legal; and verifying whether the legal domain name has a corresponding website, and if so, acquiring the legal domain name as a target website.

The website discovery method takes the root character string as the candidate character string, or splices the root character string with a prefix character string and/or a suffix character string to form the candidate character string.

The website discovery method of the invention obtains the frequency S1 that the character string A1 with the length of M characters appears in the prefix position of the known domain name in all the known domain names, and if S1 is more than m.S, the character A1 is taken as the prefix character string; acquiring the frequency S2 of the character string A2 with the length of M characters in the suffix position of the known domain name in all the known domain names, and taking the character A2 as a suffix character string if S2 is greater than m.S; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

The website discovery method of the invention selects any number of characters from the domain name coincidence set to be arranged and combined to generate a plurality of character strings, and takes the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.

The invention also relates to a website discovery system based on domain name recommendation, which comprises the following components:

the root character string module is used for randomly selecting any character arrangement combination in the domain name character set to obtain a root character string;

a candidate character string module for forming a candidate character string by the root character string;

the domain name generation module is used for splicing the candidate character string and a candidate domain name suffix to form a recommended domain name;

the domain name verification module is used for performing DNS analysis on the recommended domain name so as to judge the legal recommended domain name as a legal domain name;

and the website acquisition module is used for verifying whether the legal domain name has a corresponding website or not, and acquiring the legal domain name as a target website if the legal domain name has the corresponding website.

The website discovery system of the present invention, wherein the candidate string module further comprises: the first candidate character string module is used for taking the root character string as the candidate character string; and the second candidate character string module is used for splicing the root character string and the prefix character string and/or the suffix character string into the candidate character string.

The second candidate string module comprises: a prefix character string module for acquiring the prefix character string; acquiring the frequency S1 that a character string A1 with the length of M characters appears in the prefix position of the known domain name in all the known domain names, and if S1 is greater than m.S, using the character A1 as the prefix character string; a suffix string module to obtain the suffix string: acquiring the frequency S2 of the character string A2 with the length of M characters in all the known domain names at the position of suffixes of the known domain names, and taking the character A2 as a suffix character string if S2 is greater than M & S; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

The website discovery system of the present invention, wherein the root string module specifically comprises: selecting any number of characters from the domain name character set to be arranged and combined to generate a plurality of character strings, and taking the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.

Drawings

Fig. 1 is a flowchart of a website discovery method based on domain name recommendation according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a website discovery system based on domain name recommendation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the following describes in detail a website discovery method and system based on domain name recommendation, which are provided by the present invention, with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The existing domain name discovery method mostly adopts a passive mode, and only after enough data streams are acquired, undiscovered domain names are acquired from the data streams through data stream analysis, but if a certain website belongs to an isolated website and the linked websites are few, the data streams containing URLs of the isolated website are difficult to acquire, and further the website domain names are difficult to acquire.

The invention discloses a website discovery method based on domain name recommendation, which adopts a technology of autonomously generating a domain name, selects one or more characters from a domain name character set to be arranged and combined to obtain a root character string, forms a candidate character string by the root character string, sequentially splices the candidate character string with a candidate domain name suffix to form a recommended domain name, and obtains a legal domain name after performing DNS analysis on the recommended domain name to verify the legality of the recommended domain name. Fig. 1 is a flowchart of a website discovery method based on domain name recommendation according to an embodiment of the present invention, and as shown in fig. 1, specifically, the website discovery method based on domain name recommendation according to the present invention includes:

step 1, selecting any character from a domain name character set to be arranged and combined, taking a character string with the length of N characters in the generated character string as a root character string and constructing a root set: the scale of the root set can be rapidly increased due to the increase of N, the range of N is limited, and the length of character strings in the root set does not exceed 10 characters;

step 2, counting all known domain names to obtain the double characters A₂Number of occurrences at prefix position and suffix position to calculate the double character A, respectively₂The frequency S1 appearing at the prefix position and the frequency S2 appearing at the suffix position are set according to the experience, m is larger than 1, when K characters are shared in the existing domain name, the frequency S of any double character appearing at the prefix position is 1/K²If S1 > n.S ═ m/K²Then with double character A₂Is a prefix character string, if S2 > m.S ═ m/K²Then with double character A₂A suffix string that is a double character; because the characters in the domain name are required to meet the requirements of the domain name character set, according to the range of the domain name character set, K is 37; for three characters A₃And four characters A₄Processing by adopting the method to obtain prefix character strings and suffix character strings of three characters and four characters; in some embodiments, for example, the five-character string and the six-character string may also be processed by the above method to obtain a prefix string and a suffix string of the five-character string and the six-character string, which is not limited by the present invention; constructing a prefix set from all prefix character strings and constructing a suffix from all suffix character stringsGathering; the selected characters are any characters in a domain name character set, such as double characters ab, g2, cc, three characters abc, hhu, ttt, yy6, four characters abcd, jjjyh, 7fff, and the invention is not limited thereto;

step 3, forming candidate character strings on the basis of the root character strings in the root set, wherein the candidate character strings comprise any root character string selected from the root set and used as candidate character strings, and if the root character string selected from the root set is cde, the candidate character strings are cde; or any prefix character string and any root character string are respectively selected from the prefix set and the root set and are spliced into a candidate character string, and if the prefix character string is ab and the root character string is cde, the candidate character string is abcde; or any root character string and any suffix character string are respectively selected from the root set and the suffix set and are spliced into candidate character strings, and if the root character string is cde and the suffix character string is fgh, the candidate character strings are cdefgh; or sequentially selecting any prefix character string, any root character string and any suffix character string from the prefix set, the root set and the suffix set, and splicing the prefix character string, the root character string and the suffix character string into a candidate character string, wherein if the prefix character string is ab, the root character string is cde and the suffix character string is fgh, the candidate character string is abcdefgh; the invention is not limited thereto;

step 4, splicing the candidate character strings and the candidate domain name suffixes to form a recommended domain name; if the candidate character string is abcde and the candidate domain name suffix is com, splicing into a recommended domain name abcde.com; candidate domain name suffixes include national domain name suffixes and international top-level domain name suffixes such as.cn,. com.cn,. edu,. edu.cn, to which the present invention is not limited;

step 5, analyzing the generated recommended domain name by using asynchronous DNS analysis; initiating DNS analysis on the recommended domain name through a plurality of DNS servers, if at least two DNS servers return the same IP address, verifying the recommended domain name as a legal domain name, otherwise, verifying the recommended domain name as an illegal domain name, and discarding the illegal domain name;

step 6, verifying the obtained legal domain name to judge whether a corresponding website exists, if so, acquiring the corresponding website as a target website, and if not, discarding the legal domain name; the invention adopts curl tool to verify legal domain name, and can also use other software tools with website verification function to verify, the invention is not limited by this.

Fig. 2 is a schematic structural diagram of a website discovery system based on domain name recommendation according to an embodiment of the present invention. As shown in fig. 2, a website discovery system based on domain name recommendation of the present invention includes: the system comprises a root character string module, a candidate character string module, a domain name generation module, a domain name verification module and a website acquisition module; the root character string module is used for randomly selecting any character arrangement combination in a domain name character set to obtain a root character string; the candidate character string module is used for forming candidate character strings by the root character strings; the domain name generation module is used for splicing the candidate character strings and the candidate domain name suffixes to form a recommended domain name; the domain name verification module is used for performing DNS analysis on the recommended domain name so as to judge the legal recommended domain name as a legal domain name; the website acquisition module is used for verifying whether the legal domain name has a corresponding website, and if the legal domain name has the corresponding website, acquiring the target website.

The root character string module of the website discovery system specifically comprises: selecting any number of characters from the domain name character set to be arranged and combined to generate a plurality of character strings, and taking the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10. For example, the selected character is any character in the domain name character set, such as the double characters ab, g2, cc, the three characters abc, hhu, ttt, yy6, the four characters abcd, jjjyh, 7fff, which is not limited in the present invention; the method selects any character from the domain name character set to be arranged and combined, and constructs a root set of the character string with the length of N characters in the generated character string, because the increase of N can rapidly increase the scale of the root set, the invention limits the range of N, and the root character string does not exceed the length of 10 characters.

The website discovery system of the invention takes the root character string as the candidate character string and splices the candidate character string with the candidate domain suffix to form the recommended domain name, and further, further optimization can be carried out on the basis of the root character string, therefore, the candidate character string module specifically comprises: a first candidate character string module and a second candidate character string module; the first candidate character string module is used for taking a root character string as a candidate character string; and the second candidate character string module is used for splicing the root character string and the prefix character string and/or the suffix character string into candidate character strings. Therefore, the second candidate string module further includes a prefix string module and a suffix string module.

The prefix character string module is used for acquiring the prefix character string, namely acquiring the frequency S1 of the prefix position of the character string A1 with the length of M characters in all known domain names, and if S1 is greater than m.S, using the character A1 as the prefix character string; the suffix character string module is used for acquiring a suffix character string, namely acquiring the frequency S2 of the suffix position of a character string A2 with the length of M characters in all known domain names, and taking the character A2 as the suffix character string if S2 is greater than M & S; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

Specifically, all the known domain names are counted to obtain the double-character A₂Number of occurrences at prefix position and suffix position to calculate the double character A, respectively₂The frequency S1 appearing at the prefix position and the frequency S2 appearing at the suffix position are set according to the experience, m is larger than 1, when K characters are shared in the existing domain name, the frequency S of any double character appearing at the prefix position is 1/K²If S1 > n.S ═ m/K²Then with double character A₂Is a prefix character string, if S2 > m.S ═ m/K²Then with double character A₂A suffix string that is a double character; because the characters in the domain name are required to meet the requirements of the domain name character set, according to the range of the domain name character set, K is 37; for three characters A₃And four characters A₄Processing by adopting the method to obtain prefix character strings and suffix character strings of three characters and four characters; in some embodiments, for example, the five-character and six-character strings may also be processed by the method described above to obtain prefix strings and suffix strings of the five-character and six-character strings, which is not limited by the present invention.

The website acquisition module is used for verifying the obtained legal domain name to judge whether a corresponding website exists, if so, acquiring the corresponding website as a target website, and if not, discarding the legal domain name; the invention adopts curl tool to verify legal domain name, and can also use other software tools with website verification function to verify, the invention is not limited by this.

The invention adopts the technology of automatically generating the domain name instead of spreading discovery or other passive acquisition methods, avoids the limitation of passive discovery and can detect more website domain names; the domain name is generated by adopting a root discovery and combination method, so that the detection success rate and pertinence of active detection are improved.

The present invention may be embodied in other specific forms without departing from the spirit or scope of the invention, and it should be understood that various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A website discovery method based on domain name recommendation is characterized by comprising the following steps:

randomly selecting any character arrangement combination in a domain name character set to obtain a root character string;

taking the root character string as a candidate character string, or splicing the root character string with a prefix character string and/or a suffix character string to form the candidate character string;

splicing the candidate character string and the candidate domain name suffix to form a recommended domain name;

performing DNS analysis on the recommended domain name to judge that the recommended domain name is legal;

and verifying whether the legal domain name has a corresponding website, and if so, acquiring the legal domain name as a target website.

2. The method of claim 1, wherein the frequency S1 is obtained when the M character strings A1 appear at the prefix position of the known domain name among all the known domain names, if S1 >m.S, the character A1 is used as the prefix character string; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

3. The website discovery method of claim 1, wherein the frequency S2 of occurrence of a character string a2 having a length of M characters in a suffix position of the known domain name among all known domain names is obtained, and if S2 > M · S, the character string a2 is used as a suffix character string; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

4. The website discovery method according to claim 1, wherein any number of characters are selected from the domain name character set for permutation and combination to generate a plurality of character strings, and the character string with a character length of N is used as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.

5. A website discovery system based on domain name recommendation, comprising:

a candidate character string module for forming a candidate character string by the root character string; the method comprises the following steps: the first candidate character string module is used for taking the root character string as the candidate character string; the second candidate character string module is used for splicing the root character string and the prefix character string and/or the suffix character string into the candidate character string;

6. A website discovery system as defined in claim 5, wherein the second candidate string module comprises: a prefix character string module for acquiring the prefix character string; acquiring the frequency S1 that a character string A1 with the length of M characters appears in the prefix position of the known domain name in all the known domain names, and if S1 is greater than m.S, using the character A1 as the prefix character string; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

7. The website discovery system of claim 5 wherein the second candidate string module further comprises: a suffix string module for obtaining the suffix string; acquiring the frequency S2 that a character string A2 with the length of M characters appears in the suffix position of the known domain name in all the known domain names, and taking the character A2 as a suffix character string if S2 is greater than M & S; wherein S is 1/K^MK is the number of characters in the domain name character set, M is a positive integer, M is a preset value, M is more than or equal to 2 and less than or equal to 4, and M is more than 1.

8. The website discovery system of claim 5, wherein the root string module specifically comprises: selecting any number of characters from the domain name character set to be arranged and combined to generate a plurality of character strings, and taking the character string with the character length of N as the root character string; wherein N is a positive integer, and N is more than or equal to 1 and less than or equal to 10.