WO2018072363A1 - Method and device for extending data source - Google Patents
Method and device for extending data source Download PDFInfo
- Publication number
- WO2018072363A1 WO2018072363A1 PCT/CN2017/073611 CN2017073611W WO2018072363A1 WO 2018072363 A1 WO2018072363 A1 WO 2018072363A1 CN 2017073611 W CN2017073611 W CN 2017073611W WO 2018072363 A1 WO2018072363 A1 WO 2018072363A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- resource locator
- uniform resource
- locator data
- data
- character
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Definitions
- the present invention belongs to the field of Internet security detection technologies, and more particularly, to a data source extension method and apparatus.
- phishing as a form of security attack, creates a phishing website by mimicking the page content of a legitimate website, and induces users to visit phishing websites to steal personal privacy information such as user names, bank accounts, and passwords.
- the detection methods for phishing websites mainly focus on the detection algorithm field, that is, researching efficient and accurate detection algorithms to detect websites, so as to find phishing websites from many websites.
- the detection method ie, the possible phishing website
- the discovery of the data source relies on the reports of the majority of netizens. In this way, the detection of the phishing website is passive and does not have active discovery. Ability and low secondary utilization for known phishing sites.
- an object of the present invention is to provide a data source extension method and apparatus for improving the secondary utilization rate of a known phishing website, expanding the detection range, and effectively reducing the lag and artificial dependence of phishing discovery.
- the technical solutions are as follows:
- the present invention provides a data source extension method, the method comprising:
- the data includes at least the uniform resource locator data of the known phishing website;
- Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.
- the method further includes:
- the uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.
- the ranking of the uniform resource locator data in the list of each sub-level second-level domain name set is such that the uniform resource locator data with higher similarity is adjacent in the sorting, including:
- the uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.
- performing the pairwise comparison on the all known uniform resource locator data to obtain a plurality of uniform resource locator templates including:
- the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;
- the uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
- the method for extending the uniform resource locator template to obtain the uniform resource locator data corresponding to each unicast resource locator template that can be regarded as a phishing website includes:
- Expanding the reserved uniform resource locator template where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates.
- Uniform resource locator data
- the present invention also provides a data source expansion apparatus, the apparatus comprising:
- An obtaining unit configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;
- a comparison unit configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates
- an extension unit configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.
- the device further comprises:
- a list forming unit configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list
- a classification unit configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;
- a sorting unit configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.
- the sorting unit comprises:
- a classification subunit configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;
- a sorting subunit configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.
- the comparing unit comprises:
- a comparison subunit configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same
- Obtaining a subunit configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;
- a first replacement subunit configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;
- a second replacement subunit configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type
- the replacement symbol replaces the character at the jth position
- a third replacement subunit configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;
- a fourth replacement subunit configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;
- the configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
- the expansion unit comprises:
- a statistics subunit configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates
- a reserved sub-unit configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition
- An extension subunit configured to extend the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all characters corresponding to the first preset replacement symbol corresponding type The first preset replacement symbol in the uniform resource locator template and all the characters in the corresponding type of the second preset replacement symbol are sequentially replaced with the second preset replacement symbol in the uniform resource locator template.
- the de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
- the foregoing technical solution provided by the present invention can obtain a uniform resource locator template based on all known uniform resource locator data, and expand the uniform resource locator template to obtain a visual corresponding to each uniform resource locator template.
- the phishing website is automatically acquired by itself, which effectively reduces the lag and artificial dependence of the phishing discovery.
- the detection range can be expanded, the loss of interest can be reduced, and the uniform resource locator data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
- FIG. 1 is a flowchart of a data source detecting method according to an embodiment of the present invention
- FIG. 2 is a flowchart of obtaining a URL template in the data source detecting method shown in FIG. 1;
- FIG. 3 is a flowchart of a URL template extension in the data source detection method shown in FIG. 1;
- FIG. 4 is another flowchart of a data source detecting method according to an embodiment of the present invention.
- FIG. 5 is a schematic diagram of obtaining a URL template according to an embodiment of the present invention.
- FIG. 6 is a schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention.
- Figure 7 is a schematic structural view of a comparison unit in the data source detecting device shown in Figure 6;
- FIG. 8 is a schematic structural diagram of an extension unit in the data source detecting apparatus shown in FIG. 6;
- FIG. 9 is another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention.
- the embodiment of the present invention provides a data source extension method for actively obtaining URL data that can be regarded as a phishing website and improving known phishing websites. Secondary utilization of URL data.
- FIG. 1 is a flowchart of a data source extension method provided by an embodiment of the present invention, for actively acquiring URL data that can be regarded as a phishing website, and improving URL data of a known phishing website.
- the secondary utilization may specifically include the following steps:
- all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
- the process of obtaining a URL template and extending each URL template in the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
- the process of obtaining a URL template provided by an embodiment of the present invention may include the following steps:
- the first preset replacement symbol is preset to replace the corresponding character in the URL data.
- the type of the character at the j-th position in the two URL data is a numeric type
- the first preset replacement symbol is adopted.
- the first preset replacement symbol can be "#"
- the character at the jth position will be replaced by "#”.
- the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
- the second preset replacement symbol is preset to replace the corresponding character in the URL data.
- the type of the character at the jth position in the two URL data is a letter type
- the second preset is adopted. Replace the symbol to replace, if the second preset replacement symbol can be "@", the character at the jth position will be replaced with "@”, of course, the second preset replacement symbol can also adopt other symbols, depending on Practical application to determine.
- the character at the jth position in the i-th URL data is a numeric type
- the character at the j-th position is replaced with the first preset replacement symbol
- the j-th position in the i-th URL data is The type of the character is a letter type
- the character at the jth position is replaced with the second preset replacement symbol.
- the above URL data of www.g2-bc.com and www.g-abb.com wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre-
- the replacement symbol, the first preset replacement symbol is substituted for the character at the sixth position.
- the character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.
- the URL data obtained by replacing all the different characters is the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2-
- the URL template for the two URL data bc.com and www.g-abb.com is www.g#@b@.com.
- the process of expanding each URL template is as shown in FIG. 3, and may include the following steps:
- the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset.
- Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:
- the number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.
- the extended URL template is extended, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol with the first preset replacement symbol in the URL and adopting the second preset replacement symbol corresponding type. The entire character replaces the second preset replacement symbol in the URL template in turn, and obtains the extended URL data corresponding to each URL template.
- the first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example.
- 10 numbers 0-9 are sequentially used.
- To replace, and for the second preset replacement symbol in the URL template replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.
- the reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into
- the type of characters that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website.
- the tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.
- the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template to obtain a visual corresponding to each URL template.
- the URL data of the phishing website realizes the self-acquisition of the phishing website, which effectively reduces the lag and artificial dependence of the phishing discovery.
- the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
- the data source detection method provided by the embodiment of the present invention may further sort the URL data after the URL data is obtained, so as to associate the URL data with higher similarity, so that the URL data with higher similarity can be concentrated. It is calculated that the degree to which the legal URL data is falsified into a higher type of character is higher, and the URL data is expanded in a targeted manner.
- FIG. 4 shows another flowchart of the data source detecting method provided by the embodiment of the present invention, which may include the following steps:
- all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
- step 405 and step 406 is different from the above steps 102 and 103 only in that the URL template is obtained based on the second-level domain name corresponding to each URL data in the sub-level second-level domain name collection list.
- the replacement and extension methods are the same.
- the process of getting a URL template is:
- the current variable domain2 is first assigned to the variable domain1, and then the next line is read sequentially, and the value is assigned to the variable domain2.
- the data source detection method provided by the embodiment of the present invention is described below by using the preset hyphen as "-", the first preset replacement symbol being "#”, and the second preset replacement symbol being "@" as an example.
- all known URL data is URL data of a known phishing website, as shown in Table 1.
- Table 1 shows the URL data of the phishing website.
- Www.abc.com Www.a-c.com Mg.afgc.com Tg.agm.net Www.agbc.com M.acc.com Www.g2-bc.com Wwww.g-abb.com Wap.abc.net Www.1bc.com
- the list of second-level domain names obtained from the URL data in Table 1 above is: abc.com, ac.com, afgc.com, agm.net, agbc.com, acc.com, g2-bc.com, g-abb.com , abc.net, 1bc.com
- .net list abc.net, agm.net
- an ordered list of URL templates is obtained: a@c.com (2 times), a@@c.com (1 time) g#@b@.com (1 time), #bc.com (1 time).
- the number of occurrences of # and @ in the URL template is not more than two, and the number of occurrences of the URL template is not more than one.
- the default URL template in the URL template list is: a@c. Com, a@@c.com, #bc.com.
- the above three URL templates are extended, and #bc.com is taken as an example.
- the extended URL data includes: 0bc.com, 1bc.com, 2bc.com, 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.
- the above extended URL data is deduplicated with the URL data 1bc.com of the known phishing website (1bc.com is repeated), and the final extended all can be regarded as the URL data of the phishing website: 0bc.com, 2bc.com , 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.
- the embodiment of the present invention further provides a data source expansion apparatus.
- the schematic diagram of the structure is as shown in FIG. 6, and may include: an obtaining unit 11, a comparing unit 12, and an expanding unit 13.
- the obtaining unit 11 is configured to obtain all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
- the comparing unit 12 is configured to perform a pairwise comparison on all known URL data to obtain a plurality of URL templates.
- the reason why all the known URL data are compared in pairs is because: multiple URL data may correspond to a URL template, so that the comparison between two and two is convenient to count the number of occurrences of a certain URL template, and then expand with a URL template. It is more targeted.
- the comparison unit 12 may obtain a plurality of URL templates by using the structure shown in FIG. 7, where the comparison unit 12 may include: a comparison subunit 121, a recording subunit 122, an acquisition subunit 123, and a first replacement subunit. 124. A second replacement subunit 125, a third replacement subunit 126, a fourth replacement subunit 127, and a configuration subunit 128.
- the comparing subunit 121 is configured to compare the characters at each position in the i-th URL data and the i+1th URL data in sequence when the lengths of the i-th URL data and the i+1th URL data are the same.
- i a natural number
- i 1, 2, ..., m-1
- m is the total number of URL data.
- the obtaining sub-unit 123 is configured to acquire the type of the character at the j-th position in the i-th URL data and the i+1th URL data when the characters at the j-th position are different.
- a first replacement subunit 124 configured to replace the jth position with the first preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a digital type The character at the place.
- the first preset replacement symbol is preset to replace the corresponding character in the URL data.
- the type of the character at the j-th position in the two URL data is a numeric type
- the first preset replacement symbol is adopted.
- the first preset replacement symbol can be "#"
- the character at the jth position will be replaced by "#”.
- the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
- a second replacement subunit 125 configured to replace the jth position with the second preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a letter type The character at the place.
- the second preset replacement symbol is preset to replace the corresponding character in the URL data.
- the type of the character at the jth position in the two URL data is a letter type
- the second preset replacement symbol is adopted.
- the second preset replacement symbol can be "@"
- the character at the jth position will be replaced with "@”.
- the second preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
- a third replacement sub-unit 126 configured to use the ith when the type of the character at the j-th position in the i-th URL data and the type of the character at the j-th position in the i+1th URL data are different
- the preset replacement symbol corresponding to the type of the character at the jth position in the URL data replaces the character at the jth position.
- the character at the jth position in the i-th URL data is a numeric type
- the character at the j-th position is replaced with the first preset replacement symbol
- the j-th position in the i-th URL data is The type of the character is a letter type
- the character at the jth position is replaced with the second preset replacement symbol.
- a fourth replacement subunit 127 configured to: when the character at the jth position in the i-th URL data or the i+1th URL data is a preset hyphen, at a j-th position that is not a preset hyphen
- the type of the character corresponds to the preset replacement symbol to replace the character at the jth position.
- the above URL data of www.g2-bc.com and www.g-abb.com wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre-
- the replacement symbol, the first preset replacement symbol is substituted for the character at the sixth position.
- the character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.
- the configuration subunit 128 is configured to replace the URL data of all the different characters with the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2-bc.com and www.g-abb
- the URL template for .com's two URL data is www.g#@b@.com.
- the extension unit 13 is configured to expand each URL template to obtain URL data corresponding to each URL template and can be regarded as a phishing website.
- the extension unit 13 may extend each URL template by using the structure shown in FIG. 8.
- the extension template 13 includes: a statistical subunit 131, a reserved subunit 132, an extended subunit 133, and a deduplication subunit. 134.
- the statistics sub-unit 131 is configured to perform statistics on the number of URL templates to obtain an ordered list of URL templates. The number of times the URL template is counted is to count the number of occurrences of each URL template, and then merge the same URL templates to reduce the number of URL templates.
- the retention sub-unit 132 is configured to retain the URL template in the URL template list that meets the preset condition. After the URL template in the URL template list is compared with the preset condition, the partial URL template is deleted, and then the URL template conforming to the preset condition is retained as the URL template finally used for the extension, thereby further reducing the number of URL templates.
- the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset.
- Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:
- the number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.
- the extension sub-unit 133 is configured to expand the reserved URL template, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol in sequence with the first preset replacement symbol in the URL template, and adopting the second pre- All the characters corresponding to the replacement symbol are sequentially replaced with the second preset replacement symbols in the URL template, and the extended URL data corresponding to each URL template is obtained.
- the first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example.
- 10 numbers 0-9 are sequentially used.
- To replace, and for the second preset replacement symbol in the URL template replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.
- the reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into
- the type of characters that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website.
- the tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.
- the de-sub-unit 134 is configured to perform de-duplication processing on the extended URL data and all known URL data to obtain URL data that can be regarded as a phishing website.
- the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template, and obtain URL data corresponding to each URL template that can be regarded as a phishing website, and realize fishing.
- the website's self-acquisition is effective, which reduces the lag and artificial dependence of phishing discovery.
- the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
- the data source detecting apparatus may further obtain the URL data after acquiring the URL data. Sorting the URL data to adjacent URL data with higher similarity, so that the URL data with higher similarity can be collected, and the degree of the legal URL data being falsified into which type of characters is relatively high, so as to have Targeted extension of URL data.
- FIG. 9 another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention is shown. On the basis of FIG. 6, the following may further include: a list forming unit 14, a classifying unit 15, and a sorting unit 16.
- the list forming unit 14 is configured to obtain a second-level domain name of each URL data, and form a second-level domain name collection list, such as a second-level domain name of “www.abc.com”, “abc.com”, and then two URLs of each URL.
- the level domain names are stored in a list to form a list of second-level domain name collections.
- the classification unit 15 is configured to perform classification according to the top-level domain name in the second-level domain name collection list, and obtain a sub-level second-level domain name collection list having different top-level domain names. If the TLDs of "www.abc.com” and “www.efg.com” are both ".com", the two URL data will be stored in the sub-level domain name list corresponding to ".com".
- the sorting unit 16 is configured to sort the URL data in each of the sub-level second-level domain name collection lists, so that the URL data with higher similarity is adjacent in the sorting.
- the sorting unit includes: a sorting subunit and a sorting subunit, wherein the sorting subunit is configured to classify URL data in each sublevel second domain name collection list based on a preset hyphen, and obtain URL data including a preset hyphen And URL data that does not contain a preset hyphen.
- the sorting sub-unit is configured to sequentially sort the URL data containing the preset hyphen and the URL data not including the preset hyphen in order of length and alphabetic order, so that the URL data with higher similarity can be collected and counted as legal.
- the degree to which the URL data is falsified to which type of character is higher, and the URL data is expanded in a targeted manner.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
www.abc.comWww.abc.com | www.a-c.comWww.a-c.com | mg.afgc.comMg.afgc.com | tg.agm.netTg.agm.net | www.agbc.comWww.agbc.com |
m.acc.comM.acc.com | www.g2-bc.comWww.g2-bc.com | www.g-abb.comWww.g-abb.com | wap.abc.netWap.abc.net | www.1bc.comWww.1bc.com |
.com排序结果.com sort results | .net排序结果.net sort results |
g2-bc.comG2-bc.com | abc.netAbc.net |
g-abb.comG-abb.com | agm.netAgm.net |
afgc.comAfgc.com | |
agbc.comAgbc.com | |
1bc.com1bc.com | |
abc.comAbc.com | |
acc.comAcc.com | |
a-c.comA-c.com |
Claims (10)
- 一种数据源扩展方法,其特征在于,所述方法包括:A data source extension method, the method comprising:获取全部已知的统一资源定位符数据,其中所述全部已知的统一资源定位符数据至少包括已知钓鱼网站的统一资源定位符数据;Obtaining all known uniform resource locator data, wherein the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板;Performing a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据。Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.
- 根据权利要求1所述的方法,其特征在于,在获取全部已知的统一资源定位符数据之后,在对所述全部已知的统一资源定位符数据进行两两对比之前,所述方法还包括:The method according to claim 1, wherein after acquiring all known uniform resource locator data, before performing the pairwise comparison of the all known uniform resource locator data, the method further comprises :获取每个统一资源定位符数据的二级域名,形成二级域名集合列表;Obtaining a second-level domain name of each uniform resource locator data to form a second-level domain name collection list;根据所述二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表;Sorting according to the top-level domain name in the second-level domain name collection list, and obtaining a list of sub-level second-level domain name collections having different top-level domain names;对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻。The uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.
- 根据权利要求2所述的方法,其特征在于,所述对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻,包括:The method according to claim 2, wherein the sorting the uniform resource locator data in each sub-level second-level domain name set list is such that the uniformity resource locator data with higher similarity is adjacent in the sorting ,include:基于预设连字符,对每个子二级域名集合列表中的统一资源定位符数据进行分类,得到含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据;Sorting the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, and obtaining uniform resource locator data including the preset hyphen and a unified resource not including the preset hyphen Locator data;对含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据依次按照长度和字母顺序进行排序。The uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.
- 根据权利要求1或2所述的方法,其特征在于,所述对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板,包括:The method according to claim 1 or 2, wherein the comparing the all known uniform resource locator data in pairs, to obtain a plurality of uniform resource locator templates, comprising:当第i个统一资源定位符数据和第i+1个统一资源定位符数据的长度相同 时,依次比较所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为统一资源定位符数据的总数;When the i-th uniform resource locator data and the i+1th uniform resource locator data have the same length And comparing the characters at each position in the i-th uniform resource locator data and the i+1th uniform resource locator data, i is a natural number, and i=1, 2, . . . , m- 1, m is the total number of uniform resource locator data;当所述第j个位置处的字符相同时,记录下第j个位置处的字符,并继续比较下一个字符,j=1,2,…..,n,n为第i个统一资源定位符数据中字符总数;When the characters at the jth position are the same, the characters at the jth position are recorded, and the next character is continuously compared, j=1, 2, ....., n, n is the i-th uniform resource location. The total number of characters in the data;当所述第j个位置处的字符不同时,获取所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型;When the characters at the jth position are different, acquiring the type of the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data;当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换所述第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, replacing the jth with the first preset replacement symbol The character at the position;当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换所述第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type, replacing the jth with a second preset replacement symbol The character at the position;当所述第i个统一资源定位符数据中第j个位置处的字符的类型和第i+1个统一资源定位符数据中第j个位置处的字符的类型不同时,以所述第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the type of the character at the j-th position in the i+1th uniform resource locator data are different, a preset replacement symbol corresponding to the type of the character at the jth position in the URL data to replace the character at the jth position;当所述第i个统一资源定位符数据或第i+1个统一资源定位符数据中第j个位置处的字符为预设连字符时,以不是所述预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换所述第j个位置处的字符;When the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;对所有不同字符替换后的统一资源定位符数据为所述第i个统一资源定位符数据和第i+1个统一资源定位符数据对应的统一资源定位符模板。The uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
- 根据权利要求4所述的方法,其特征在于,所述对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据,包括:The method according to claim 4, wherein the each of the uniform resource locator templates is extended to obtain a uniform resource locator corresponding to each of the uniform resource locator templates that can be regarded as a phishing website. Data, including:对所述统一资源定位符模板进行次数统计,得到一有序的统一资源定位符模板列表;Performing statistics on the number of the uniform resource locator templates to obtain an ordered list of uniform resource locator templates;保留所述统一资源定位符模板列表中符合预设条件的所述统一资源定位 符模板;Retaining the uniform resource location in the uniform resource locator template list that meets preset conditions Symbol template对保留的所述统一资源定位符模板进行扩展,其中扩展过程包括:依次采用所述第一预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第一预设替换符号以及采用所述第二预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第二预设替换符号,得到每个所述统一资源定位符模板对应的扩展后的统一资源定位符数据;Expanding the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates. Uniform resource locator data;将扩展后的统一资源定位符数据与全部已知的统一资源定位符数据进行去重处理,得到全部可视为钓鱼网站的统一资源定位符数据。The extended uniform resource locator data is deduplicated with all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
- 一种数据源扩展装置,其特征在于,所述装置包括:A data source expansion device, characterized in that the device comprises:获取单元,用于获取全部已知的统一资源定位符数据,其中所述全部已知的统一资源定位符数据至少包括已知钓鱼网站的统一资源定位符数据;An obtaining unit, configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;对比单元,用于对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板;a comparison unit, configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;扩展单元,用于对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据。And an extension unit, configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.
- 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:列表形成单元,用于获取每个统一资源定位符数据的二级域名,形成二级域名集合列表;a list forming unit, configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list;分类单元,用于根据所述二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表;a classification unit, configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;排序单元,用于对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻。And a sorting unit, configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.
- 根据权利要求7所述的装置,其特征在于,所述排序单元,包括:The device according to claim 7, wherein the sorting unit comprises:分类子单元,用于基于预设连字符,对每个子二级域名集合列表中的统一资源定位符数据进行分类,得到含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据;a classification subunit, configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;排序子单元,用于对含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据依次按照长度和字母顺序进行排序。 And a sorting subunit, configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.
- 根据权利要求6或7所述的装置,其特征在于,所述对比单元,包括:The device according to claim 6 or 7, wherein the comparing unit comprises:比较子单元,用于当第i个统一资源定位符数据和第i+1个统一资源定位符数据的长度相同时,依次比较所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为统一资源定位符数据的总数;a comparison subunit, configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same The character at each position in the resource locator data, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;记录子单元,用于当所述第j个位置处的字符相同时,记录下第j个位置处的字符,并触发所述比较子单元继续比较下一个字符,j=1,2,…..,n,n为第i个统一资源定位符数据中字符总数;a recording subunit, configured to record a character at the jth position when the characters at the jth position are the same, and trigger the comparing subunit to continue comparing the next character, j=1, 2, .... ., n, n is the total number of characters in the i-th uniform resource locator data;获取子单元,用于当所述第j个位置处的字符不同时,获取所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型;Obtaining a subunit, configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;第一替换子单元,用于当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换所述第j个位置处的字符;a first replacement subunit, configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;第二替换子单元,用于当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换所述第j个位置处的字符;a second replacement subunit, configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type The replacement symbol replaces the character at the jth position;第三替换子单元,用于当所述第i个统一资源定位符数据中第j个位置处的字符的类型和第i+1个统一资源定位符数据中第j个位置处的字符的类型不同时,以所述第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符;a third replacement subunit, configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;第四替换子单元,用于当所述第i个统一资源定位符数据或第i+1个统一资源定位符数据中第j个位置处的字符为预设连字符时,以不是所述预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换所述第j个位置处的字符;a fourth replacement subunit, configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;配置子单元,用于对所有不同字符替换后的统一资源定位符数据为所述第i个统一资源定位符数据和第i+1个统一资源定位符数据对应的统一资源定位符模板。 The configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
- 根据权利要求9所述的装置,其特征在于,所述扩展单元,包括:The device according to claim 9, wherein the extension unit comprises:统计子单元,用于对所述统一资源定位符模板进行次数统计,得到一有序的统一资源定位符模板列表;a statistics subunit, configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates;保留子单元,用于保留所述统一资源定位符模板列表中符合预设条件的所述统一资源定位符模板;a reserved sub-unit, configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition;扩展子单元,用于对保留的所述统一资源定位符模板进行扩展,其中扩展过程包括:依次采用所述第一预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第一预设替换符号以及采用所述第二预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第二预设替换符号,得到每个所述统一资源定位符模板对应的扩展后的统一资源定位符数据;An extension subunit, configured to extend the reserved uniform resource locator template, where the expansion process includes: sequentially replacing all characters in the corresponding type of the corresponding replacement symbol with the characters in the uniform resource locator template And replacing, by the first preset replacement symbol and all the characters corresponding to the second preset replacement symbol, the second preset replacement symbol in the uniform resource locator template, to obtain each of the unified resources. The extended uniform resource locator data corresponding to the locator template;去重子单元,用于将扩展后的统一资源定位符数据与全部已知的统一资源定位符数据进行去重处理,得到全部可视为钓鱼网站的统一资源定位符数据。 The de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610911941.X | 2016-10-19 | ||
CN201610911941.XA CN106503125B (en) | 2016-10-19 | 2016-10-19 | A kind of data source extended method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018072363A1 true WO2018072363A1 (en) | 2018-04-26 |
Family
ID=58294512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/073611 WO2018072363A1 (en) | 2016-10-19 | 2017-02-15 | Method and device for extending data source |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106503125B (en) |
WO (1) | WO2018072363A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241483B (en) * | 2018-08-31 | 2021-10-12 | 中国科学院计算技术研究所 | Website discovery method and system based on domain name recommendation |
CN109672678B (en) * | 2018-12-24 | 2021-05-14 | 亚信科技(中国)有限公司 | Phishing website identification method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222187A (en) * | 2011-06-02 | 2011-10-19 | 国家计算机病毒应急处理中心 | Domain name structural feature-based hang horse web page detection method |
CN104202291A (en) * | 2014-07-11 | 2014-12-10 | 西安电子科技大学 | Anti-phishing method based on multi-factor comprehensive assessment method |
CN104765882A (en) * | 2015-04-29 | 2015-07-08 | 中国互联网络信息中心 | Internet website statistics method based on web page characteristic strings |
US20150295942A1 (en) * | 2012-12-26 | 2015-10-15 | Sinan TAO | Method and server for performing cloud detection for malicious information |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8438642B2 (en) * | 2009-06-05 | 2013-05-07 | At&T Intellectual Property I, L.P. | Method of detecting potential phishing by analyzing universal resource locators |
CN102082792A (en) * | 2010-12-31 | 2011-06-01 | 成都市华为赛门铁克科技有限公司 | Phishing webpage detection method and device |
CN103491101A (en) * | 2013-09-30 | 2014-01-01 | 北京金山网络科技有限公司 | Phishing website detecting method and device and client-side |
CN103685307B (en) * | 2013-12-25 | 2017-08-11 | 北京奇虎科技有限公司 | The method and system of feature based storehouse detection fishing fraud webpage, client, server |
CN104615760B (en) * | 2015-02-13 | 2018-04-13 | 北京瑞星网安技术股份有限公司 | Fishing website recognition methods and system |
CN105138912A (en) * | 2015-09-25 | 2015-12-09 | 北京奇虎科技有限公司 | Method and device for generating phishing website detection rules automatically |
-
2016
- 2016-10-19 CN CN201610911941.XA patent/CN106503125B/en active Active
-
2017
- 2017-02-15 WO PCT/CN2017/073611 patent/WO2018072363A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222187A (en) * | 2011-06-02 | 2011-10-19 | 国家计算机病毒应急处理中心 | Domain name structural feature-based hang horse web page detection method |
US20150295942A1 (en) * | 2012-12-26 | 2015-10-15 | Sinan TAO | Method and server for performing cloud detection for malicious information |
CN104202291A (en) * | 2014-07-11 | 2014-12-10 | 西安电子科技大学 | Anti-phishing method based on multi-factor comprehensive assessment method |
CN104765882A (en) * | 2015-04-29 | 2015-07-08 | 中国互联网络信息中心 | Internet website statistics method based on web page characteristic strings |
Also Published As
Publication number | Publication date |
---|---|
CN106503125B (en) | 2019-10-15 |
CN106503125A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107786575B (en) | DNS flow-based self-adaptive malicious domain name detection method | |
CN108737423B (en) | Phishing website discovery method and system based on webpage key content similarity analysis | |
Huang et al. | Phishing URL detection via CNN and attention-based hierarchical RNN | |
Xiang et al. | Cantina+ a feature-rich machine learning framework for detecting phishing web sites | |
Wardman et al. | High-performance content-based phishing attack detection | |
Kumar et al. | Malicious URL detection using multi-layer filtering model | |
CN108023868B (en) | Malicious resource address detection method and device | |
Balduzzi et al. | Targeted attacks detection with spunge | |
Weedon et al. | Random forest explorations for URL classification | |
Li et al. | Phishing detection based on newly registered domains | |
Bai | Phishing website detection based on machine learning algorithm | |
Nowroozi et al. | An adversarial attack analysis on malicious advertisement url detection framework | |
WO2018072363A1 (en) | Method and device for extending data source | |
Khan | Detection of phishing websites using deep learning techniques | |
Wu et al. | Malicious website detection based on urls static features | |
Khan et al. | Hot zone identification: Analyzing effects of data sampling on spam clustering | |
Zhang et al. | Detecting the DGA-based malicious domain names | |
Yazhmozhi et al. | Natural language processing and Machine learning based phishing website detection system | |
Lei et al. | Design and implementation of an automatic scanning tool of SQL injection vulnerability based on Web crawler | |
Lee et al. | Collaborative cyberporn filtering with collective intelligence | |
Tariq et al. | USING black-list and white-list technique to detect malicious URLs | |
Xiong et al. | MIRD: trigram-based M alicious URL detection I mplanted with R andom D omain name recognition | |
Ma | Research on black hat SEO behaviour measurement | |
TWI579717B (en) | Dynamic Web site HTTP network packet and database packet auditing system and method | |
Lee et al. | Objectionable content filtering by click-through data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17861640 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17861640 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17861640 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/12/2019) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17861640 Country of ref document: EP Kind code of ref document: A1 |