WO2018072363A1 - Method and device for extending data source - Google Patents

Method and device for extending data source Download PDF

Info

Publication number
WO2018072363A1
WO2018072363A1 PCT/CN2017/073611 CN2017073611W WO2018072363A1 WO 2018072363 A1 WO2018072363 A1 WO 2018072363A1 CN 2017073611 W CN2017073611 W CN 2017073611W WO 2018072363 A1 WO2018072363 A1 WO 2018072363A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource locator
uniform resource
locator data
data
character
Prior art date
Application number
PCT/CN2017/073611
Other languages
French (fr)
Chinese (zh)
Inventor
李晓东
李雪妮
耿光刚
陈勇
Original Assignee
中国互联网络信息中心
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国互联网络信息中心 filed Critical 中国互联网络信息中心
Publication of WO2018072363A1 publication Critical patent/WO2018072363A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Definitions

  • the present invention belongs to the field of Internet security detection technologies, and more particularly, to a data source extension method and apparatus.
  • phishing as a form of security attack, creates a phishing website by mimicking the page content of a legitimate website, and induces users to visit phishing websites to steal personal privacy information such as user names, bank accounts, and passwords.
  • the detection methods for phishing websites mainly focus on the detection algorithm field, that is, researching efficient and accurate detection algorithms to detect websites, so as to find phishing websites from many websites.
  • the detection method ie, the possible phishing website
  • the discovery of the data source relies on the reports of the majority of netizens. In this way, the detection of the phishing website is passive and does not have active discovery. Ability and low secondary utilization for known phishing sites.
  • an object of the present invention is to provide a data source extension method and apparatus for improving the secondary utilization rate of a known phishing website, expanding the detection range, and effectively reducing the lag and artificial dependence of phishing discovery.
  • the technical solutions are as follows:
  • the present invention provides a data source extension method, the method comprising:
  • the data includes at least the uniform resource locator data of the known phishing website;
  • Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.
  • the method further includes:
  • the uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.
  • the ranking of the uniform resource locator data in the list of each sub-level second-level domain name set is such that the uniform resource locator data with higher similarity is adjacent in the sorting, including:
  • the uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.
  • performing the pairwise comparison on the all known uniform resource locator data to obtain a plurality of uniform resource locator templates including:
  • the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;
  • the uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
  • the method for extending the uniform resource locator template to obtain the uniform resource locator data corresponding to each unicast resource locator template that can be regarded as a phishing website includes:
  • Expanding the reserved uniform resource locator template where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates.
  • Uniform resource locator data
  • the present invention also provides a data source expansion apparatus, the apparatus comprising:
  • An obtaining unit configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;
  • a comparison unit configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates
  • an extension unit configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.
  • the device further comprises:
  • a list forming unit configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list
  • a classification unit configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;
  • a sorting unit configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.
  • the sorting unit comprises:
  • a classification subunit configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;
  • a sorting subunit configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.
  • the comparing unit comprises:
  • a comparison subunit configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same
  • Obtaining a subunit configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;
  • a first replacement subunit configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;
  • a second replacement subunit configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type
  • the replacement symbol replaces the character at the jth position
  • a third replacement subunit configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;
  • a fourth replacement subunit configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;
  • the configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
  • the expansion unit comprises:
  • a statistics subunit configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates
  • a reserved sub-unit configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition
  • An extension subunit configured to extend the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all characters corresponding to the first preset replacement symbol corresponding type The first preset replacement symbol in the uniform resource locator template and all the characters in the corresponding type of the second preset replacement symbol are sequentially replaced with the second preset replacement symbol in the uniform resource locator template.
  • the de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
  • the foregoing technical solution provided by the present invention can obtain a uniform resource locator template based on all known uniform resource locator data, and expand the uniform resource locator template to obtain a visual corresponding to each uniform resource locator template.
  • the phishing website is automatically acquired by itself, which effectively reduces the lag and artificial dependence of the phishing discovery.
  • the detection range can be expanded, the loss of interest can be reduced, and the uniform resource locator data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
  • FIG. 1 is a flowchart of a data source detecting method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of obtaining a URL template in the data source detecting method shown in FIG. 1;
  • FIG. 3 is a flowchart of a URL template extension in the data source detection method shown in FIG. 1;
  • FIG. 4 is another flowchart of a data source detecting method according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of obtaining a URL template according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention.
  • Figure 7 is a schematic structural view of a comparison unit in the data source detecting device shown in Figure 6;
  • FIG. 8 is a schematic structural diagram of an extension unit in the data source detecting apparatus shown in FIG. 6;
  • FIG. 9 is another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention.
  • the embodiment of the present invention provides a data source extension method for actively obtaining URL data that can be regarded as a phishing website and improving known phishing websites. Secondary utilization of URL data.
  • FIG. 1 is a flowchart of a data source extension method provided by an embodiment of the present invention, for actively acquiring URL data that can be regarded as a phishing website, and improving URL data of a known phishing website.
  • the secondary utilization may specifically include the following steps:
  • all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
  • the process of obtaining a URL template and extending each URL template in the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
  • the process of obtaining a URL template provided by an embodiment of the present invention may include the following steps:
  • the first preset replacement symbol is preset to replace the corresponding character in the URL data.
  • the type of the character at the j-th position in the two URL data is a numeric type
  • the first preset replacement symbol is adopted.
  • the first preset replacement symbol can be "#"
  • the character at the jth position will be replaced by "#”.
  • the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
  • the second preset replacement symbol is preset to replace the corresponding character in the URL data.
  • the type of the character at the jth position in the two URL data is a letter type
  • the second preset is adopted. Replace the symbol to replace, if the second preset replacement symbol can be "@", the character at the jth position will be replaced with "@”, of course, the second preset replacement symbol can also adopt other symbols, depending on Practical application to determine.
  • the character at the jth position in the i-th URL data is a numeric type
  • the character at the j-th position is replaced with the first preset replacement symbol
  • the j-th position in the i-th URL data is The type of the character is a letter type
  • the character at the jth position is replaced with the second preset replacement symbol.
  • the above URL data of www.g2-bc.com and www.g-abb.com wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre-
  • the replacement symbol, the first preset replacement symbol is substituted for the character at the sixth position.
  • the character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.
  • the URL data obtained by replacing all the different characters is the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2-
  • the URL template for the two URL data bc.com and www.g-abb.com is www.g#@b@.com.
  • the process of expanding each URL template is as shown in FIG. 3, and may include the following steps:
  • the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset.
  • Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:
  • the number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.
  • the extended URL template is extended, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol with the first preset replacement symbol in the URL and adopting the second preset replacement symbol corresponding type. The entire character replaces the second preset replacement symbol in the URL template in turn, and obtains the extended URL data corresponding to each URL template.
  • the first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example.
  • 10 numbers 0-9 are sequentially used.
  • To replace, and for the second preset replacement symbol in the URL template replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.
  • the reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into
  • the type of characters that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website.
  • the tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.
  • the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template to obtain a visual corresponding to each URL template.
  • the URL data of the phishing website realizes the self-acquisition of the phishing website, which effectively reduces the lag and artificial dependence of the phishing discovery.
  • the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
  • the data source detection method provided by the embodiment of the present invention may further sort the URL data after the URL data is obtained, so as to associate the URL data with higher similarity, so that the URL data with higher similarity can be concentrated. It is calculated that the degree to which the legal URL data is falsified into a higher type of character is higher, and the URL data is expanded in a targeted manner.
  • FIG. 4 shows another flowchart of the data source detecting method provided by the embodiment of the present invention, which may include the following steps:
  • all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
  • step 405 and step 406 is different from the above steps 102 and 103 only in that the URL template is obtained based on the second-level domain name corresponding to each URL data in the sub-level second-level domain name collection list.
  • the replacement and extension methods are the same.
  • the process of getting a URL template is:
  • the current variable domain2 is first assigned to the variable domain1, and then the next line is read sequentially, and the value is assigned to the variable domain2.
  • the data source detection method provided by the embodiment of the present invention is described below by using the preset hyphen as "-", the first preset replacement symbol being "#”, and the second preset replacement symbol being "@" as an example.
  • all known URL data is URL data of a known phishing website, as shown in Table 1.
  • Table 1 shows the URL data of the phishing website.
  • Www.abc.com Www.a-c.com Mg.afgc.com Tg.agm.net Www.agbc.com M.acc.com Www.g2-bc.com Wwww.g-abb.com Wap.abc.net Www.1bc.com
  • the list of second-level domain names obtained from the URL data in Table 1 above is: abc.com, ac.com, afgc.com, agm.net, agbc.com, acc.com, g2-bc.com, g-abb.com , abc.net, 1bc.com
  • .net list abc.net, agm.net
  • an ordered list of URL templates is obtained: a@c.com (2 times), a@@c.com (1 time) g#@b@.com (1 time), #bc.com (1 time).
  • the number of occurrences of # and @ in the URL template is not more than two, and the number of occurrences of the URL template is not more than one.
  • the default URL template in the URL template list is: a@c. Com, a@@c.com, #bc.com.
  • the above three URL templates are extended, and #bc.com is taken as an example.
  • the extended URL data includes: 0bc.com, 1bc.com, 2bc.com, 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.
  • the above extended URL data is deduplicated with the URL data 1bc.com of the known phishing website (1bc.com is repeated), and the final extended all can be regarded as the URL data of the phishing website: 0bc.com, 2bc.com , 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.
  • the embodiment of the present invention further provides a data source expansion apparatus.
  • the schematic diagram of the structure is as shown in FIG. 6, and may include: an obtaining unit 11, a comparing unit 12, and an expanding unit 13.
  • the obtaining unit 11 is configured to obtain all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
  • the comparing unit 12 is configured to perform a pairwise comparison on all known URL data to obtain a plurality of URL templates.
  • the reason why all the known URL data are compared in pairs is because: multiple URL data may correspond to a URL template, so that the comparison between two and two is convenient to count the number of occurrences of a certain URL template, and then expand with a URL template. It is more targeted.
  • the comparison unit 12 may obtain a plurality of URL templates by using the structure shown in FIG. 7, where the comparison unit 12 may include: a comparison subunit 121, a recording subunit 122, an acquisition subunit 123, and a first replacement subunit. 124. A second replacement subunit 125, a third replacement subunit 126, a fourth replacement subunit 127, and a configuration subunit 128.
  • the comparing subunit 121 is configured to compare the characters at each position in the i-th URL data and the i+1th URL data in sequence when the lengths of the i-th URL data and the i+1th URL data are the same.
  • i a natural number
  • i 1, 2, ..., m-1
  • m is the total number of URL data.
  • the obtaining sub-unit 123 is configured to acquire the type of the character at the j-th position in the i-th URL data and the i+1th URL data when the characters at the j-th position are different.
  • a first replacement subunit 124 configured to replace the jth position with the first preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a digital type The character at the place.
  • the first preset replacement symbol is preset to replace the corresponding character in the URL data.
  • the type of the character at the j-th position in the two URL data is a numeric type
  • the first preset replacement symbol is adopted.
  • the first preset replacement symbol can be "#"
  • the character at the jth position will be replaced by "#”.
  • the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
  • a second replacement subunit 125 configured to replace the jth position with the second preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a letter type The character at the place.
  • the second preset replacement symbol is preset to replace the corresponding character in the URL data.
  • the type of the character at the jth position in the two URL data is a letter type
  • the second preset replacement symbol is adopted.
  • the second preset replacement symbol can be "@"
  • the character at the jth position will be replaced with "@”.
  • the second preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
  • a third replacement sub-unit 126 configured to use the ith when the type of the character at the j-th position in the i-th URL data and the type of the character at the j-th position in the i+1th URL data are different
  • the preset replacement symbol corresponding to the type of the character at the jth position in the URL data replaces the character at the jth position.
  • the character at the jth position in the i-th URL data is a numeric type
  • the character at the j-th position is replaced with the first preset replacement symbol
  • the j-th position in the i-th URL data is The type of the character is a letter type
  • the character at the jth position is replaced with the second preset replacement symbol.
  • a fourth replacement subunit 127 configured to: when the character at the jth position in the i-th URL data or the i+1th URL data is a preset hyphen, at a j-th position that is not a preset hyphen
  • the type of the character corresponds to the preset replacement symbol to replace the character at the jth position.
  • the above URL data of www.g2-bc.com and www.g-abb.com wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre-
  • the replacement symbol, the first preset replacement symbol is substituted for the character at the sixth position.
  • the character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.
  • the configuration subunit 128 is configured to replace the URL data of all the different characters with the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2-bc.com and www.g-abb
  • the URL template for .com's two URL data is www.g#@b@.com.
  • the extension unit 13 is configured to expand each URL template to obtain URL data corresponding to each URL template and can be regarded as a phishing website.
  • the extension unit 13 may extend each URL template by using the structure shown in FIG. 8.
  • the extension template 13 includes: a statistical subunit 131, a reserved subunit 132, an extended subunit 133, and a deduplication subunit. 134.
  • the statistics sub-unit 131 is configured to perform statistics on the number of URL templates to obtain an ordered list of URL templates. The number of times the URL template is counted is to count the number of occurrences of each URL template, and then merge the same URL templates to reduce the number of URL templates.
  • the retention sub-unit 132 is configured to retain the URL template in the URL template list that meets the preset condition. After the URL template in the URL template list is compared with the preset condition, the partial URL template is deleted, and then the URL template conforming to the preset condition is retained as the URL template finally used for the extension, thereby further reducing the number of URL templates.
  • the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset.
  • Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:
  • the number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.
  • the extension sub-unit 133 is configured to expand the reserved URL template, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol in sequence with the first preset replacement symbol in the URL template, and adopting the second pre- All the characters corresponding to the replacement symbol are sequentially replaced with the second preset replacement symbols in the URL template, and the extended URL data corresponding to each URL template is obtained.
  • the first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example.
  • 10 numbers 0-9 are sequentially used.
  • To replace, and for the second preset replacement symbol in the URL template replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.
  • the reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into
  • the type of characters that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website.
  • the tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.
  • the de-sub-unit 134 is configured to perform de-duplication processing on the extended URL data and all known URL data to obtain URL data that can be regarded as a phishing website.
  • the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template, and obtain URL data corresponding to each URL template that can be regarded as a phishing website, and realize fishing.
  • the website's self-acquisition is effective, which reduces the lag and artificial dependence of phishing discovery.
  • the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
  • the data source detecting apparatus may further obtain the URL data after acquiring the URL data. Sorting the URL data to adjacent URL data with higher similarity, so that the URL data with higher similarity can be collected, and the degree of the legal URL data being falsified into which type of characters is relatively high, so as to have Targeted extension of URL data.
  • FIG. 9 another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention is shown. On the basis of FIG. 6, the following may further include: a list forming unit 14, a classifying unit 15, and a sorting unit 16.
  • the list forming unit 14 is configured to obtain a second-level domain name of each URL data, and form a second-level domain name collection list, such as a second-level domain name of “www.abc.com”, “abc.com”, and then two URLs of each URL.
  • the level domain names are stored in a list to form a list of second-level domain name collections.
  • the classification unit 15 is configured to perform classification according to the top-level domain name in the second-level domain name collection list, and obtain a sub-level second-level domain name collection list having different top-level domain names. If the TLDs of "www.abc.com” and “www.efg.com” are both ".com", the two URL data will be stored in the sub-level domain name list corresponding to ".com".
  • the sorting unit 16 is configured to sort the URL data in each of the sub-level second-level domain name collection lists, so that the URL data with higher similarity is adjacent in the sorting.
  • the sorting unit includes: a sorting subunit and a sorting subunit, wherein the sorting subunit is configured to classify URL data in each sublevel second domain name collection list based on a preset hyphen, and obtain URL data including a preset hyphen And URL data that does not contain a preset hyphen.
  • the sorting sub-unit is configured to sequentially sort the URL data containing the preset hyphen and the URL data not including the preset hyphen in order of length and alphabetic order, so that the URL data with higher similarity can be collected and counted as legal.
  • the degree to which the URL data is falsified to which type of character is higher, and the URL data is expanded in a targeted manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and a device for extending a data source, the method comprising: obtaining uniform resource locator templates on the basis of all known uniform resource locator (URL) data, and extending the uniform resource locator templates, to obtain uniform resource locator data capable of being regarded as a fishing website corresponding to each uniform resource locator template, achieving the automatic and active acquisition of the fishing website, effectively reducing the problems of hysteresis of the discovery of the fishing website and human dependency. In the above-mentioned manner, it is possible to extend detection range, reduce loss of interest, and to perform an extension on the basis of the uniform resource locator data of a known fishing website, improving the ratio of the secondary utilization of the known fishing website.

Description

一种数据源扩展方法及装置Data source expansion method and device
本发明要求于2016年10月19日提交中国专利局、申请号为201610911941.X、发明名称为“一种数据源扩展方法及装置”的中国专利发明的优先权,其全部内容通过引用结合在本发明中。The present invention claims the priority of the Chinese patent invention filed on October 19, 2016 by the Chinese Patent Office, the application number is 201610911941.X, and the invention name is "a data source expansion method and device", the entire contents of which are incorporated by reference. In the present invention.
技术领域Technical field
本发明属于互联网安全检测技术领域,更具体的说,尤其涉及一种数据源扩展方法及装置。The present invention belongs to the field of Internet security detection technologies, and more particularly, to a data source extension method and apparatus.
背景技术Background technique
互联网作为现代生活的重要组成部分,已经广泛地被各种团体和组织用于在线贸易和服务等事宜,这也导致互联网更容易受到来自各方的安全攻击。比如网络钓鱼作为安全攻击的一种形式,通过模仿合法网站的页面内容创建钓鱼网站,并诱导用户访问钓鱼网站,以窃取用户的个人隐私信息,如用户名、银行账号和密码等。As an important part of modern life, the Internet has been widely used by various groups and organizations for online trade and services, which also makes the Internet more vulnerable to security attacks from all sides. For example, phishing, as a form of security attack, creates a phishing website by mimicking the page content of a legitimate website, and induces users to visit phishing websites to steal personal privacy information such as user names, bank accounts, and passwords.
随着互联网的快速发展,在利益的驱使下,从事网络钓鱼攻击的黑色产业链呈逐渐上升趋势,因此针对钓鱼网站的检测方法在电子商务和金融证券等企业的安全运营中起着越来越重要的地位。With the rapid development of the Internet, driven by the interests, the black industry chain engaged in phishing attacks is gradually rising. Therefore, the detection method for phishing websites is becoming more and more important in the security operations of enterprises such as e-commerce and financial securities. Important position.
目前针对钓鱼网站的检测方法主要集中在检测算法领域,即研究高效和准确的检测算法对网站进行检测,以从众多网站中查找到钓鱼网站。而在检测方法所针对的数据源(即可能的钓鱼网站)来说,数据源的发现都是依赖于广大网民的举报,在这种方式下,钓鱼网站的检测较为被动,不具备主动发现的能力,且对于已知的钓鱼网站的二次利用率较低。At present, the detection methods for phishing websites mainly focus on the detection algorithm field, that is, researching efficient and accurate detection algorithms to detect websites, so as to find phishing websites from many websites. In the data source targeted by the detection method (ie, the possible phishing website), the discovery of the data source relies on the reports of the majority of netizens. In this way, the detection of the phishing website is passive and does not have active discovery. Ability and low secondary utilization for known phishing sites.
发明内容Summary of the invention
有鉴于此,本发明的目的在于提供一种数据源扩展方法及装置,用于提高已知的钓鱼网站的二次利用率,扩大检测范围,并有效降低钓鱼发现的滞后性与人工依赖的问题。技术方案如下:In view of this, an object of the present invention is to provide a data source extension method and apparatus for improving the secondary utilization rate of a known phishing website, expanding the detection range, and effectively reducing the lag and artificial dependence of phishing discovery. . The technical solutions are as follows:
本发明提供一种数据源扩展方法,所述方法包括:The present invention provides a data source extension method, the method comprising:
获取全部已知的统一资源定位符数据,其中所述全部已知的统一资源定位 符数据至少包括已知钓鱼网站的统一资源定位符数据;Obtain all known uniform resource locator data, wherein all known known uniform resource locations The data includes at least the uniform resource locator data of the known phishing website;
对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板;Performing a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;
对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据。Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.
优选地,在获取全部已知的统一资源定位符数据之后,在对所述全部已知的统一资源定位符数据进行两两对比之前,所述方法还包括:Preferably, after obtaining all the known uniform resource locator data, before performing the pairwise comparison on the all known uniform resource locator data, the method further includes:
获取每个统一资源定位符数据的二级域名,形成二级域名集合列表;Obtaining a second-level domain name of each uniform resource locator data to form a second-level domain name collection list;
根据所述二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表;Sorting according to the top-level domain name in the second-level domain name collection list, and obtaining a list of sub-level second-level domain name collections having different top-level domain names;
对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻。The uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.
优选地,所述对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻,包括:Preferably, the ranking of the uniform resource locator data in the list of each sub-level second-level domain name set is such that the uniform resource locator data with higher similarity is adjacent in the sorting, including:
基于预设连字符,对每个子二级域名集合列表中的统一资源定位符数据进行分类,得到含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据;Sorting the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, and obtaining uniform resource locator data including the preset hyphen and a unified resource not including the preset hyphen Locator data;
对含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据依次按照长度和字母顺序进行排序。The uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.
优选地,所述对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板,包括:Preferably, performing the pairwise comparison on the all known uniform resource locator data to obtain a plurality of uniform resource locator templates, including:
当第i个统一资源定位符数据和第i+1个统一资源定位符数据的长度相同时,依次比较所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为统一资源定位符数据的总数;When the length of the i-th uniform resource locator data and the (i+1)th uniform resource locator data are the same, sequentially comparing the i-th uniform resource locator data and the i+1th uniform resource locator data The character at each position, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;
当所述第j个位置处的字符相同时,记录下第j个位置处的字符,并继续比较下一个字符,j=1,2,…..,n,n为第i个统一资源定位符数据中字符总数;When the characters at the jth position are the same, the characters at the jth position are recorded, and the next character is continuously compared, j=1, 2, ....., n, n is the i-th uniform resource location. The total number of characters in the data;
当所述第j个位置处的字符不同时,获取所述第i个统一资源定位符数据 和第i+1个统一资源定位符数据中第j个位置处的字符的类型;Obtaining the i-th uniform resource locator data when characters at the j-th position are different And the type of the character at the jth position in the i+1th uniform resource locator data;
当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换所述第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, replacing the jth with the first preset replacement symbol The character at the position;
当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换所述第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type, replacing the jth with a second preset replacement symbol The character at the position;
当所述第i个统一资源定位符数据中第j个位置处的字符的类型和第i+1个统一资源定位符数据中第j个位置处的字符的类型不同时,以所述第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the type of the character at the j-th position in the i+1th uniform resource locator data are different, a preset replacement symbol corresponding to the type of the character at the jth position in the URL data to replace the character at the jth position;
当所述第i个统一资源定位符数据或第i+1个统一资源定位符数据中第j个位置处的字符为预设连字符时,以不是所述预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换所述第j个位置处的字符;When the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;
对所有不同字符替换后的统一资源定位符数据为所述第i个统一资源定位符数据和第i+1个统一资源定位符数据对应的统一资源定位符模板。The uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
优选地,所述对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据,包括:Preferably, the method for extending the uniform resource locator template to obtain the uniform resource locator data corresponding to each unicast resource locator template that can be regarded as a phishing website includes:
对所述统一资源定位符模板进行次数统计,得到一有序的统一资源定位符模板列表;Performing statistics on the number of the uniform resource locator templates to obtain an ordered list of uniform resource locator templates;
保留所述统一资源定位符模板列表中符合预设条件的所述统一资源定位符模板;Retaining the uniform resource locator template in the uniform resource locator template list that meets a preset condition;
对保留的所述统一资源定位符模板进行扩展,其中扩展过程包括:依次采用所述第一预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第一预设替换符号以及采用所述第二预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第二预设替换符号,得到每个所述统一资源定位符模板对应的扩展后的统一资源定位符数据;Expanding the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates. Uniform resource locator data;
将扩展后的统一资源定位符数据与全部已知的统一资源定位符数据进行 去重处理,得到全部可视为钓鱼网站的统一资源定位符数据。Extend the extended Uniform Resource Locator data with all known Uniform Resource Locator data To deal with the re-processing, get all the Uniform Resource Locator data that can be regarded as the phishing website.
另一方面,本发明还提供一种数据源扩展装置,所述装置包括:In another aspect, the present invention also provides a data source expansion apparatus, the apparatus comprising:
获取单元,用于获取全部已知的统一资源定位符数据,其中所述全部已知的统一资源定位符数据至少包括已知钓鱼网站的统一资源定位符数据;An obtaining unit, configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;
对比单元,用于对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板;a comparison unit, configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;
扩展单元,用于对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据。And an extension unit, configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.
优选地,所述装置还包括:Preferably, the device further comprises:
列表形成单元,用于获取每个统一资源定位符数据的二级域名,形成二级域名集合列表;a list forming unit, configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list;
分类单元,用于根据所述二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表;a classification unit, configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;
排序单元,用于对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻。And a sorting unit, configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.
优选地,所述排序单元,包括:Preferably, the sorting unit comprises:
分类子单元,用于基于预设连字符,对每个子二级域名集合列表中的统一资源定位符数据进行分类,得到含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据;a classification subunit, configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;
排序子单元,用于对含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据依次按照长度和字母顺序进行排序。And a sorting subunit, configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.
优选地,所述对比单元,包括:Preferably, the comparing unit comprises:
比较子单元,用于当第i个统一资源定位符数据和第i+1个统一资源定位符数据的长度相同时,依次比较所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为统一资源定位符数据的总数;a comparison subunit, configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same The character at each position in the resource locator data, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;
记录子单元,用于当所述第j个位置处的字符相同时,记录下第j个位置处的字符,并触发所述比较子单元继续比较下一个字符,j=1,2,…..,n,n为 第i个统一资源定位符数据中字符总数;a recording subunit, configured to record a character at the jth position when the characters at the jth position are the same, and trigger the comparing subunit to continue comparing the next character, j=1, 2, .... .,n,n is The total number of characters in the i-th uniform resource locator data;
获取子单元,用于当所述第j个位置处的字符不同时,获取所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型;Obtaining a subunit, configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;
第一替换子单元,用于当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换所述第j个位置处的字符;a first replacement subunit, configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;
第二替换子单元,用于当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换所述第j个位置处的字符;a second replacement subunit, configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type The replacement symbol replaces the character at the jth position;
第三替换子单元,用于当所述第i个统一资源定位符数据中第j个位置处的字符的类型和第i+1个统一资源定位符数据中第j个位置处的字符的类型不同时,以所述第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符;a third replacement subunit, configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;
第四替换子单元,用于当所述第i个统一资源定位符数据或第i+1个统一资源定位符数据中第j个位置处的字符为预设连字符时,以不是所述预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换所述第j个位置处的字符;a fourth replacement subunit, configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;
配置子单元,用于对所有不同字符替换后的统一资源定位符数据为所述第i个统一资源定位符数据和第i+1个统一资源定位符数据对应的统一资源定位符模板。The configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
优选地,所述扩展单元,包括:Preferably, the expansion unit comprises:
统计子单元,用于对所述统一资源定位符模板进行次数统计,得到一有序的统一资源定位符模板列表;a statistics subunit, configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates;
保留子单元,用于保留所述统一资源定位符模板列表中符合预设条件的所述统一资源定位符模板;a reserved sub-unit, configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition;
扩展子单元,用于对保留的所述统一资源定位符模板进行扩展,其中扩展过程包括:依次采用所述第一预设替换符号对应类型的全部字符依次替换所述 统一资源定位符模板中的所述第一预设替换符号以及采用所述第二预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第二预设替换符号,得到每个所述统一资源定位符模板对应的扩展后的统一资源定位符数据;An extension subunit, configured to extend the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all characters corresponding to the first preset replacement symbol corresponding type The first preset replacement symbol in the uniform resource locator template and all the characters in the corresponding type of the second preset replacement symbol are sequentially replaced with the second preset replacement symbol in the uniform resource locator template. Obtaining the extended uniform resource locator data corresponding to each of the uniform resource locator templates;
去重子单元,用于将扩展后的统一资源定位符数据与全部已知的统一资源定位符数据进行去重处理,得到全部可视为钓鱼网站的统一资源定位符数据。The de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
与现有技术相比,本发明提供的上述技术方案具有如下优点:Compared with the prior art, the above technical solution provided by the present invention has the following advantages:
本发明提供的上述技术方案可以以全部已知的统一资源定位符数据为基础,得到统一资源定位符模板,并对统一资源定位符模板进行扩展,得到每个统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据,实现钓鱼网站的自行主动获取,有效降低钓鱼发现的滞后性与人工依赖的问题。并且通过上述方式可以扩大检测范围,降低利益损失,且可以将已知钓鱼网站的统一资源定位符数据作为基础进行扩展,从而提高已知的钓鱼网站的二次利用率。The foregoing technical solution provided by the present invention can obtain a uniform resource locator template based on all known uniform resource locator data, and expand the uniform resource locator template to obtain a visual corresponding to each uniform resource locator template. For the unicast resource locator data of the phishing website, the phishing website is automatically acquired by itself, which effectively reduces the lag and artificial dependence of the phishing discovery. In addition, the detection range can be expanded, the loss of interest can be reduced, and the uniform resource locator data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are Some embodiments of the present invention may also be used to obtain other drawings based on these drawings without departing from the art.
图1是本发明实施例提供的数据源检测方法的一种流程图;FIG. 1 is a flowchart of a data source detecting method according to an embodiment of the present invention;
图2是图1所示数据源检测方法中得到URL模板的流程图;2 is a flowchart of obtaining a URL template in the data source detecting method shown in FIG. 1;
图3是图1所示数据源检测方法中URL模板扩展的流程图;3 is a flowchart of a URL template extension in the data source detection method shown in FIG. 1;
图4是本发明实施例提供的数据源检测方法的另一种流程图;4 is another flowchart of a data source detecting method according to an embodiment of the present invention;
图5是本发明实施例提供的得到URL模板的示意图;FIG. 5 is a schematic diagram of obtaining a URL template according to an embodiment of the present invention;
图6是本发明实施例提供的数据源检测装置的一种结构示意图;6 is a schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention;
图7是图6所示数据源检测装置中对比单元的结构示意图;Figure 7 is a schematic structural view of a comparison unit in the data source detecting device shown in Figure 6;
图8是图6所示数据源检测装置中扩展单元的结构示意图;8 is a schematic structural diagram of an extension unit in the data source detecting apparatus shown in FIG. 6;
图9是本发明实施例提供的数据源检测装置的另一种结构示意图。FIG. 9 is another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention.
具体实施方式 detailed description
目前在浏览器中输入拼写错误的统一资源定位符(URL,Uniform Resource Locato)数据是十分常见的,而网络犯罪分子往往会利用这一情况来误导用户请求转至钓鱼网站,这种现象被称为“误植域名”。对网络钓鱼来说,网络犯罪分子通常会注册与正规网站类似的域名,然后等待拼写错误的用户访问,或者利用URL的视觉相似性来诱导用户主动点击这种“高仿”URL链接。比如www.10086.cn为中国移动的官方网站,网络犯罪分子可能会使用www.1oo86.cn(用字母“o”代替数字“0”)或者使用www.l0086.cn(用字母“l”代替数字“1”)等钓鱼网站欺骗用户进行访问。而这些钓鱼网站的发现在目前仅能依赖于广大网民的举报,为此本发明实施例提供一种数据源扩展方法,以自行主动获取可视为钓鱼网站的URL数据,并提高已知钓鱼网站的URL数据的二次利用率。It is very common to enter a misspelled Uniform Resource Locato (URL) data in a browser, and cybercriminals often use this situation to mislead users into requesting a phishing website. This phenomenon is called For "missing domain names." For phishing, cybercriminals typically register domain names similar to regular websites, then wait for misspelled users to access them, or use the visual similarity of URLs to induce users to actively click on such "high imitation" URL links. For example, www.10086.cn is the official website of China Mobile. Cybercriminals may use www.1oo86.cn (with the letter “o” instead of the number “0”) or use www.l0086.cn (with the letter “l” instead) Phishing sites such as the number "1") trick users into accessing. The discovery of these phishing websites can only rely on the reports of the majority of netizens. For this reason, the embodiment of the present invention provides a data source extension method for actively obtaining URL data that can be regarded as a phishing website and improving known phishing websites. Secondary utilization of URL data.
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
请参阅图1,其示出了本发明实施例提供的数据源扩展方法的一种流程图,用于自行主动获取可视为钓鱼网站的URL数据,并提高已知钓鱼网站的URL数据的二次利用率,具体可以包括以下步骤:Please refer to FIG. 1 , which is a flowchart of a data source extension method provided by an embodiment of the present invention, for actively acquiring URL data that can be regarded as a phishing website, and improving URL data of a known phishing website. The secondary utilization may specifically include the following steps:
101:获取全部已知的URL数据,其中全部已知的URL数据至少包括已知钓鱼网站的URL数据。也就是说在本发明实施例中可以至少将已知钓鱼网站的URL数据作为基础进行数据扩展,从而提高已知钓鱼网站的URL数据的二次利用率,如以www.1oo86.cn为基础进行扩展。当然,在本发明实施例中还可以对其他已知合法网站的URL数据为基础进行扩展,如以www.360.com为基础进行扩展。101: Acquire all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
102:对全部已知的URL数据进行两两对比,得到多个URL模板。之所以对全部已知的URL数据进行两两比对是因为:多个URL数据可能对应一个URL模板,这样经过两两对比便于统计出某一个URL模板的出现次数,在后续以URL模板进行扩展时更加有针对性。 102: Perform a pairwise comparison on all known URL data to obtain multiple URL templates. The reason why all the known URL data are compared in pairs is because: multiple URL data may correspond to a URL template, so that the comparison between two and two is convenient to count the number of occurrences of a certain URL template, and then expand with a URL template. It is more targeted.
103:对每个URL模板进行扩展,得到每个URL模板对应的可视为钓鱼网站的URL数据。103: Extend each URL template to obtain URL data corresponding to each URL template that can be regarded as a phishing website.
下面结合附图,对本发明实施例中得到URL模板以及对每个URL模板进行扩展的过程进行详细说明。如图2所示,其示出了本发明实施例提供的得到URL模板的过程,可以包括以下步骤:The process of obtaining a URL template and extending each URL template in the embodiment of the present invention will be described in detail below with reference to the accompanying drawings. As shown in FIG. 2, the process of obtaining a URL template provided by an embodiment of the present invention may include the following steps:
1021:当第i个URL数据和第i+1个URL数据的长度相同时,依次比较第i个URL数据和第i+1个URL数据中的每个位置处的字符,i为自然数,且i==1,2,……,m-1,m为URL数据的总数。1021: When the length of the i-th URL data and the i+1th URL data are the same, the characters at each position in the i-th URL data and the i+1th URL data are sequentially compared, i is a natural number, and i==1, 2, ..., m-1, m is the total number of URL data.
以www.g2-bc.com为第i个URL数据和www.g-abb.com为第i+1个URL数据为例,经过长度比较可知,这两个URL数据的长度相同,则可以依次比较这两个URL数据中各个位置处的字符,若这两个URL数据的长度不同,则继续获取其他URL数据进行比较。Take the i-th URL data and www.g-abb.com for the i+1th URL data as an example. After length comparison, the lengths of the two URL data are the same, then they can be sequentially Compare the characters at each position in the two URL data. If the lengths of the two URL data are different, continue to obtain other URL data for comparison.
1022:当第j个位置处的字符相同时,记录下第j个位置处的字符,并继续比较下一个字符,j=1,2,……,n,n为第i个URL数据中字符总数。比如,第1至第4位置处的字符相同,则记录下这四个位置处的字符,继续比较第5位置处的字符。1022: When the characters at the jth position are the same, record the character at the jth position, and continue to compare the next character, j=1, 2, ..., n, n is the character in the i-th URL data. total. For example, if the characters at the first to fourth positions are the same, the characters at the four positions are recorded, and the characters at the fifth position are continuously compared.
1023:当第j个位置处的字符不同时,获取第i个URL数据和第i+1个URL数据中第j个位置处的字符的类型。1023: When the characters at the jth position are different, the type of the character at the jth position in the i-th URL data and the i+1th URL data is acquired.
1024:当第i个URL数据和第i+1个URL数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换第j个位置处的字符。1024: When the type of the character at the jth position in the i-th URL data and the i+1th URL data is a numeric type, the character at the j-th position is replaced with the first preset replacement symbol.
其中第一预设替换符号为预设用来替换URL数据中的对应字符的,当两个URL数据中第j个位置处的字符的类型均是数字类型,则会采用第一预设替换符号来替换,如第一预设替换符号可以为“#”,则会将第j个位置处的字符替换为“#”,当然第一预设替换符号还可以采用其他符号,具体可根据实际应用来确定。The first preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the j-th position in the two URL data is a numeric type, the first preset replacement symbol is adopted. To replace, if the first preset replacement symbol can be "#", the character at the jth position will be replaced by "#". Of course, the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
1025:当第i个URL数据和第i+1个URL数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换第j个位置处的字符。1025: When the type of the character at the jth position in the i-th URL data and the i+1th URL data is a letter type, the character at the j-th position is replaced with the second preset replacement symbol.
其中第二预设替换符号为预设用来替换URL数据中的对应字符的,当两个URL数据中第j个位置处的字符的类型均是字母类型,则会采用第二预设 替换符号来替换,如第二预设替换符号可以为“@”,则会将第j个位置处的字符替换为“@”,当然第二预设替换符号还可以采用其他符号,具体可根据实际应用来确定。The second preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the jth position in the two URL data is a letter type, the second preset is adopted. Replace the symbol to replace, if the second preset replacement symbol can be "@", the character at the jth position will be replaced with "@", of course, the second preset replacement symbol can also adopt other symbols, depending on Practical application to determine.
1026:当第i个URL数据中第j个位置处的字符的类型和第i+1个URL数据中第j个位置处的字符的类型不同时,以第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符。1026: When the type of the character at the jth position in the i-th URL data is different from the type of the character at the j-th position in the i+1th URL data, the j-th position in the i-th URL data The preset replacement symbol corresponding to the type of character at the place replaces the character at the jth position.
比如第i个URL数据中第j个位置处的字符的类型为数字类型,则以第一预设替换符号来替换第j个位置处的字符,若第i个URL数据中第j个位置处的字符的类型为字母类型,则以第二预设替换符号来替换第j个位置处的字符。For example, if the type of the character at the jth position in the i-th URL data is a numeric type, the character at the j-th position is replaced with the first preset replacement symbol, if the j-th position in the i-th URL data is The type of the character is a letter type, and the character at the jth position is replaced with the second preset replacement symbol.
1027:当第i个URL数据或第i+1个URL数据中第j个位置处的字符为预设连字符时,以不是预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符。1027: When the character at the jth position in the i-th URL data or the i+1th URL data is a preset hyphen, the pre-correspondence corresponding to the type of the character at the j-th position that is not the preset hyphen Replace the symbol to replace the character at the jth position.
比如上述www.g2-bc.com和www.g-abb.com这两个URL数据,其中第6个位置处的字符一个为数字2,一个为预设连字符-,则以数字对应的预设替换符号,即第一预设替换符号来替换第6个位置处的字符。而第7个位置处的字符一个为预设连字符-,一个为字母a,则以字母对应的预设替换符号,即第二预设替换符号来替换第7个位置处的符号。For example, the above URL data of www.g2-bc.com and www.g-abb.com, wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre- The replacement symbol, the first preset replacement symbol, is substituted for the character at the sixth position. The character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.
1028:在经过上述步骤后完成所有不同字符的替换,则对所有不同字符替换后得到的URL数据为第i个URL数据和第i+1个URL数据对应的URL模板,比如上述www.g2-bc.com和www.g-abb.com这两个URL数据的URL模板为www.g#@b@.com。1028: After all the different characters are replaced after the above steps, the URL data obtained by replacing all the different characters is the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2- The URL template for the two URL data bc.com and www.g-abb.com is www.g#@b@.com.
而对每个URL模板进行扩展的过程如图3所示,可以包括以下步骤:The process of expanding each URL template is as shown in FIG. 3, and may include the following steps:
1031:对URL模板进行次数统计,得到一有序的URL模板列表。其中对URL模板进行次数统计,是为了统计每个URL模板出现的次数,进而将相同的URL模板合并,以减少URL模板的数量。1031: Perform statistics on the number of URL templates to obtain an ordered list of URL templates. The number of times the URL template is counted is to count the number of occurrences of each URL template, and then merge the same URL templates to reduce the number of URL templates.
1032:保留URL模板列表中符合预设条件的URL模板。URL模板列表中各个URL模板在经过与预设条件比对后,会删除部分URL模板,然后将符合预设条件的URL模板保留以作为最终用于扩展的URL模板,进一步减少URL模板的数量。 1032: Keep the URL template in the URL template list that meets the preset conditions. After the URL template in the URL template list is compared with the preset condition, the partial URL template is deleted, and then the URL template conforming to the preset condition is retained as the URL template finally used for the extension, thereby further reducing the number of URL templates.
在本发明实施例中预设条件可以根据实际应用来确定,比如限定URL模板中预设替换符号的数量以及URL模板出现的次数,以charvalue为预设的预设替换符号出现的最大次数,以numvalue为预设的URL模板出现的最大次数遍历有序的URL模板列表以通过以下条件控制模板数量:In the embodiment of the present invention, the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset. Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:
URL模板中“@”、“#”的数量和不大于charvalue的值,那么保留该URL模板,否则删除;The number of "@", "#" in the URL template is not greater than the value of charvalue, then the URL template is retained, otherwise deleted;
URL模板的出现的次数不小于numvalue的值,那么保留该URL模板,否则删除。The number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.
1033:对保留的URL模板进行扩展,其中扩展过程包括:依次采用第一预设替换符号对应类型的全部字符依次替换URL中的第一预设替换符号以及采用第二预设替换符号对应类型的全部字符依次替换URL模板中的第二预设替换符号,得到每个URL模板对应的扩展后的URL数据。1033: The extended URL template is extended, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol with the first preset replacement symbol in the URL and adopting the second preset replacement symbol corresponding type. The entire character replaces the second preset replacement symbol in the URL template in turn, and obtains the extended URL data corresponding to each URL template.
以上述第一预设替换符号为“#”,第二预设替换符号为“@”为例进行说明,对于URL模板中的第一预设替换符号来说,依次用10个数字0~9去替换,而对于URL模板中的第二预设替换符号来说,依次用26个英文字母a~z去替换。在对URL模板中的各个预设替换符号进行替换后,则得到每个URL模板对应的多个URL数据。The first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example. For the first preset replacement symbol in the URL template, 10 numbers 0-9 are sequentially used. To replace, and for the second preset replacement symbol in the URL template, replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.
之所以这样替换是因为URL模板中第一预设替换符号和第二预设替换符号是与字符的类型相对应的,而这一对应方式正好体现出各个钓鱼网站的URL数据易被篡改成何种类型的字符,即本发明实施例通过对各个钓鱼网站的URL数据进行统计得出易被篡改成何种类型的字符,使得得到的URL模板和扩展后的URL数据符合钓鱼网站的URL数据被篡改的方式,进而使得URL模板和扩展后的URL数据的针对性强,且可以通过较少的数据得到较为准确的URL数据,得到的各个扩展后的URL数据可以作为网络钓鱼检测的数据源,提高通用性。The reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into The type of characters, that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website. The tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.
1034:将扩展后的URL数据与全部已知的URL数据进行去重处理,得到全部可视为钓鱼网站的URL数据。1034: Perform deduplication processing on the extended URL data and all known URL data to obtain URL data that can be regarded as a phishing website.
从上述技术方案可知,本发明实施例可以以全部已知的URL数据为基础,得到URL模板,并对URL模板进行扩展,得到每个URL模板对应的可视为 钓鱼网站的URL数据,实现钓鱼网站的自行主动获取,有效降低钓鱼发现的滞后性与人工依赖的问题。并且通过上述方式可以扩大检测范围,降低利益损失,且可以将已知钓鱼网站的URL数据作为基础进行扩展,从而提高已知的钓鱼网站的二次利用率。As can be seen from the foregoing technical solutions, the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template to obtain a visual corresponding to each URL template. The URL data of the phishing website realizes the self-acquisition of the phishing website, which effectively reduces the lag and artificial dependence of the phishing discovery. In addition, the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
此外,本发明实施例提供的数据源检测方法还可以在获取到URL数据后,对URL数据进行排序,以将相似度较高的URL数据相邻,这样可以将相似度较高的URL数据集中,统计出合法URL数据被篡改成何种类型的字符的程度较高,以有针对性的进行URL数据的扩展。如图4所示,其示出了本发明实施例提供的数据源检测方法的另一种流程图,可以包括以下步骤:In addition, the data source detection method provided by the embodiment of the present invention may further sort the URL data after the URL data is obtained, so as to associate the URL data with higher similarity, so that the URL data with higher similarity can be concentrated. It is calculated that the degree to which the legal URL data is falsified into a higher type of character is higher, and the URL data is expanded in a targeted manner. As shown in FIG. 4, it shows another flowchart of the data source detecting method provided by the embodiment of the present invention, which may include the following steps:
401:获取全部已知的URL数据,其中全部已知的URL数据至少包括已知钓鱼网站的URL数据。也就是说在本发明实施例中可以至少将已知钓鱼网站的URL数据作为基础进行数据扩展,从而提高已知钓鱼网站的URL数据的二次利用率,如以www.1oo86.cn为基础进行扩展。当然,在本发明实施例中还可以对其他已知合法网站的URL数据为基础进行扩展,如以www.360.com为基础进行扩展。401: Acquire all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
402:获取每个URL数据的二级域名,形成二级域名集合列表,如“www.abc.com”的二级域名为“abc.com”,进而将每个URL的二级域名存储到一个列表中,形成二级域名集合列表。402: Obtain a second-level domain name of each URL data, and form a second-level domain name collection list, for example, the second-level domain name of “www.abc.com” is “abc.com”, and then store the second-level domain name of each URL into one In the list, a list of secondary domain name collections is formed.
403:根据二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表。如“www.abc.com”和“www.efg.com”的顶级域名TLD均为“.com”,则这两个URL数据都将存放在“.com”对应的子二级域名列表中。403: According to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections with different top-level domain names. If the TLDs of "www.abc.com" and "www.efg.com" are both ".com", the two URL data will be stored in the sub-level domain name list corresponding to ".com".
404:对每个子二级域名集合列表中的URL数据进行排序,以使相似度较高的URL数据在排序中相邻。比如基于预设连字符,对每个子二级域名集合列表中的URL数据进行分类,得到含有预设连字符的URL数据和不含有预设连字符的URL数据,然后对含有预设连字符的URL数据和不含有预设连字符的URL数据依次按照长度和字母顺序进行排序,这样就可以将相似度较高的URL数据集中,统计出合法URL数据被篡改成何种类型的字符的程度较高, 以有针对性的进行URL数据的扩展。404: Sort the URL data in each sub-level second-level domain name collection list so that the similarity-high URL data is adjacent in the sorting. For example, based on the preset hyphen, the URL data in each sub-level second-level domain name collection list is classified, and the URL data containing the preset hyphen and the URL data not including the preset hyphen are obtained, and then the URL with the preset hyphen is included. URL data and URL data without preset hyphens are sorted in order of length and alphabetic order, so that URL data with higher similarity can be collected, and the degree of falsification of legal URL data into what type of characters is compared. High, Targeted extension of URL data.
405:对全部已知的URL数据进行两两对比,得到多个URL模板。405: Perform a pairwise comparison on all known URL data to obtain multiple URL templates.
406:对每个URL模板进行扩展,得到每个URL模板对应的可视为钓鱼网站的URL数据。406: Extend each URL template to obtain URL data corresponding to each URL template that can be regarded as a phishing website.
在本发明实施例中,步骤405和步骤406的执行过程与上述步骤102和步骤103的不同之处仅在于:基于子二级域名集合列表中的各个URL数据对应的二级域名得到URL模板,而替换方式和扩展方式均相同。如得到URL模板的过程是:In the embodiment of the present invention, the execution process of step 405 and step 406 is different from the above steps 102 and 103 only in that the URL template is obtained based on the second-level domain name corresponding to each URL data in the sub-level second-level domain name collection list. The replacement and extension methods are the same. The process of getting a URL template is:
(1)对每个排序后的子二级域名列表,顺序读取子二级域名列表中的各个二级域名:(1) For each sorted sub-level second-level domain name list, sequentially read each second-level domain name in the sub-level second-level domain name list:
如果当前读取的是第一行,那么再顺序读取第二行,并将读取到的二级域名分别赋值给两个变量domain1、domain2;If the current row is read, then the second row is sequentially read, and the read second-level domain names are respectively assigned to two variables domain1, domain2;
如果当前读取的不是第一行,那么先将当前变量domain2赋值给变量domain1,再顺序读取下一行,将其赋值给变量domain2。If the current line is not the first line, then the current variable domain2 is first assigned to the variable domain1, and then the next line is read sequentially, and the value is assigned to the variable domain2.
(2)如果两个变量domain1、domain2的长度相同(假设length=n),那么以从左向右的顺序,依次比较两个变量每个位置处的字符:(2) If the lengths of the two variables domain1, domain2 are the same (assuming length=n), then the characters at each position of the two variables are compared in order from left to right:
1)如果第i(i=1,2,…,n)个位置处的字符相同,记录下该相同字符,并继续比较下一个字符;1) If the characters at the i-th (i = 1, 2, ..., n) positions are the same, record the same character and continue to compare the next character;
2)如果第i(i=1,2,…,n)个位置处的字符不相同,则按照下述方式进行:2) If the characters at the i-th (i = 1, 2, ..., n) positions are not the same, proceed as follows:
a)如果两个字符的类型同为数字(0~9)类型,那么以第一预设替换符号“#”替换;a) If the type of the two characters is the same as the number (0~9) type, replace it with the first preset replacement symbol "#";
b)如果两个字符的类型同为英文字母(a~z)类型,那么以第二预设替换符号“@”替换;b) if the two characters are of the same type as the English alphabet (a ~ z), then replace the second preset replacement symbol "@";
c)如果两个字符的类型分别为数字(0~9)类型、英文字母(a~z)类型,那么以domain1第i个位处置的字符的类型进行代替,即domian1第i处为数字0~9,那么用“#”替换,domain1第i处为英文字母a~z,那么用“@”替换;c) If the two characters are of the type (0~9) type and the English alphabet (a~z) type, then the type of the character handled by the i-th bit of domain1 is replaced, that is, the first i is the number 0 of the doman1 ~9, then replace with "#", domain i is the English letter a ~ z, then replace with "@";
d)如果两个字符中有一个为连字符“-”,那么以另一个字符的类型进行代替。d) If one of the two characters is a hyphen "-", it is replaced by the type of the other character.
3)重复上述步骤1)至步骤2),生成一个URL模板。 3) Repeat steps 1) to 2) above to generate a URL template.
(3)如果两个变量domain1、domain2的长度不同,跳转至步骤(1)执行。(3) If the lengths of the two variables domain1, domain2 are different, jump to step (1) to execute.
(4)重复执行步骤(1)至步骤(3)直到子二级域名列表的结尾。(4) Repeat steps (1) through (3) until the end of the sub-level domain name list.
对于URL模板的扩展过程请参阅图3所示,对此本发明实施例不再阐述。For the extension process of the URL template, please refer to FIG. 3, which is not described in the embodiment of the present invention.
下面以预设连字符为“-”,第一预设替换符号为“#”,第二预设替换符号为“@”为例,对本发明实施例提供的数据源检测方法进行说明。假设全部已知的URL数据为已知钓鱼网站的URL数据,如表1所示。The data source detection method provided by the embodiment of the present invention is described below by using the preset hyphen as "-", the first preset replacement symbol being "#", and the second preset replacement symbol being "@" as an example. Assume that all known URL data is URL data of a known phishing website, as shown in Table 1.
表1已知钓鱼网站的URL数据Table 1 shows the URL data of the phishing website.
www.abc.comWww.abc.com www.a-c.comWww.a-c.com mg.afgc.comMg.afgc.com tg.agm.netTg.agm.net www.agbc.comWww.agbc.com
m.acc.comM.acc.com www.g2-bc.comWww.g2-bc.com www.g-abb.comWww.g-abb.com wap.abc.netWap.abc.net www.1bc.comWww.1bc.com
上述表1中URL数据的得到二级域名集合列表为:abc.com、a-c.com、afgc.com、agm.net、agbc.com、acc.com、g2-bc.com、g-abb.com、abc.net、1bc.comThe list of second-level domain names obtained from the URL data in Table 1 above is: abc.com, ac.com, afgc.com, agm.net, agbc.com, acc.com, g2-bc.com, g-abb.com , abc.net, 1bc.com
在基于顶级域名分类后,得到两个子二级域名列表,分别是:After classifying based on the top-level domain, you get a list of two sub-level domain names, which are:
.com列表:abc.com、acc.com、agbc.com、afgc.com、g-abb.com、a-c.com、g2-bc.com、1bc.com.com list: abc.com, acc.com, agbc.com, afgc.com, g-abb.com, a-c.com, g2-bc.com, 1bc.com
.net列表:abc.net、agm.net.net list: abc.net, agm.net
对上述两个子二级域名列表中的URL数据进行排序,排序结果如表2所示:Sort the URL data in the above two sub-level second-level domain name lists, and the sorting result is shown in Table 2:
表2子二级域名列表的排序结果Table 2 Sort results of sub-level domain name list
.com排序结果.com sort results .net排序结果.net sort results
g2-bc.comG2-bc.com abc.netAbc.net
g-abb.comG-abb.com agm.netAgm.net
afgc.comAfgc.com  
agbc.comAgbc.com  
1bc.com1bc.com  
abc.comAbc.com  
acc.comAcc.com  
a-c.comA-c.com  
对于上述两个排序结果,以.com排序结果为例说明如何得到URL模板。 For the above two sorting results, the result of .com sorting is taken as an example to show how to get the URL template.
读取g2-bc.com和g-abb.com这两个URL数据,由于这两个URL数据的长度相同,所以从左向右依次比较每个位置处的字符,发现第2和第3这两个位置处存在预设连字符-,且不存在预设连字符-的URL数据在这两个位置处的字符的类型是数字类型和字母类型,则第2位置处的字符采用“#”替换,第3位置处的字符采用“@”替换,并且第5位置处的字符不同,且字符的类型均是字母类型,则用“@”替换,替换过程请参阅图5所示,得到的URL模板为g#@b@.com。Read the two URL data g2-bc.com and g-abb.com. Since the lengths of the two URL data are the same, compare the characters at each position from left to right, and find the 2nd and 3rd. URLs where there are preset hyphens at two locations, and there are no preset hyphens - the type of characters at these two positions is a numeric type and a letter type, and the characters at the second position are "#" Replace, the characters at the 3rd position are replaced by "@", and the characters at the 5th position are different, and the characters are of the letter type, and are replaced by "@". The replacement process is shown in Figure 5. The URL template is g#@b@.com.
然后读取afgc.com和-abb.com并进行比较,由于这两个URL数据的长度不同,所以继续读取列表中剩余的URL数据,并得到对应的URL模板,具体的,获取agbc.com和afgc.com,将两个URL数据进行比较得到的URL模板为:a@@c.com。Then read afgc.com and -abb.com and compare, because the length of the two URL data is different, so continue to read the remaining URL data in the list, and get the corresponding URL template, specifically, get agbc.com And afgc.com, the URL template obtained by comparing the two URL data is: a@@c.com.
读取1bc.com,并与agbc.com比较,由于长度不同,继续读取列表中的其他URL数据,获得abc.com,并与1bc.com比较,由于长度相同且1bc.com位置靠前,所以从左向右依次比较后得出URL模板为:#bc.com。Read 1bc.com and compare it with agbc.com. Because of the different lengths, continue to read other URL data in the list, get abc.com, and compare it with 1bc.com, because the length is the same and 1bc.com is in the top position. So from the left to the right, the URL template is: #bc.com.
读取acc.com,并与abc.com比较,由于长度相同,所以从左向右依次比较后得出URL模板为:a@c.com。Read acc.com and compare it with abc.com. Since the lengths are the same, the URL template is compared from left to right and is: a@c.com.
读取a-c.com,并与acc.com比较,由于长度相同其acc.com位置靠前,所以从左向右依次比较后得出URL模板为:a@c.com。Read a-c.com, and compared with acc.com, because the length of the same acc.com position is ahead, so the URL template is compared from left to right: a@c.com.
最后根据各URL模板的出现次数进行排序,得到有序的URL模板列表:a@c.com(2次)、a@@c.com(1次)g#@b@.com(1次)、#bc.com(1次)。Finally, according to the number of occurrences of each URL template, an ordered list of URL templates is obtained: a@c.com (2 times), a@@c.com (1 time) g#@b@.com (1 time), #bc.com (1 time).
以URL模板中出现#和@的次数不多于两次,且URL模板的出现次数不多于一次为预设条件,保留URL模板列表中符合此预设条件的URL模板为:a@c.com、a@@c.com、#bc.com。The number of occurrences of # and @ in the URL template is not more than two, and the number of occurrences of the URL template is not more than one. The default URL template in the URL template list is: a@c. Com, a@@c.com, #bc.com.
对保留的以上三个URL模板进行扩展,以#bc.com为例,扩展后的URL数据包括:0bc.com、1bc.com、2bc.com、3bc.com、4bc.com、5bc.com、6bc.com、7bc.com、8bc.com、9bc.com。The above three URL templates are extended, and #bc.com is taken as an example. The extended URL data includes: 0bc.com, 1bc.com, 2bc.com, 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.
最后,将以上扩展后的URL数据与已知钓鱼网站的URL数据1bc.com去重(1bc.com重复),得到最终扩展的全部可视为钓鱼网站的URL数据:0bc.com、2bc.com、3bc.com、4bc.com、5bc.com、6bc.com、7bc.com、8bc.com、 9bc.com。Finally, the above extended URL data is deduplicated with the URL data 1bc.com of the known phishing website (1bc.com is repeated), and the final extended all can be regarded as the URL data of the phishing website: 0bc.com, 2bc.com , 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.
对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。For the foregoing method embodiments, for the sake of brevity, they are all described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described order of actions, because according to the present invention, Some steps can be performed in other orders or at the same time. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
与上述方法实施例相对应,本发明实施例还提供一种数据源扩展装置,其结构示意图如图6所示,可以包括:获取单元11、对比单元12和扩展单元13。Corresponding to the foregoing method embodiment, the embodiment of the present invention further provides a data source expansion apparatus. The schematic diagram of the structure is as shown in FIG. 6, and may include: an obtaining unit 11, a comparing unit 12, and an expanding unit 13.
获取单元11,用于获取全部已知的URL数据,其中全部已知的URL数据至少包括已知钓鱼网站的URL数据。也就是说在本发明实施例中可以至少将已知钓鱼网站的URL数据作为基础进行数据扩展,从而提高已知钓鱼网站的URL数据的二次利用率,如以www.1oo86.cn为基础进行扩展。当然,在本发明实施例中还可以对其他已知合法网站的URL数据为基础进行扩展,如以www.360.com为基础进行扩展。The obtaining unit 11 is configured to obtain all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.
对比单元12,用于对全部已知的URL数据进行两两对比,得到多个URL模板。之所以对全部已知的URL数据进行两两比对是因为:多个URL数据可能对应一个URL模板,这样经过两两对比便于统计出某一个URL模板的出现次数,在后续以URL模板进行扩展时更加有针对性。The comparing unit 12 is configured to perform a pairwise comparison on all known URL data to obtain a plurality of URL templates. The reason why all the known URL data are compared in pairs is because: multiple URL data may correspond to a URL template, so that the comparison between two and two is convenient to count the number of occurrences of a certain URL template, and then expand with a URL template. It is more targeted.
在本发明实施例中,对比单元12可以采用图7所示结构得到多个URL模板,其中对比单元12可以包括:比较子单元121、记录子单元122、获取子单元123、第一替换子单元124、第二替换子单元125、第三替换子单元126、第四替换子单元127和配置子单元128。In the embodiment of the present invention, the comparison unit 12 may obtain a plurality of URL templates by using the structure shown in FIG. 7, where the comparison unit 12 may include: a comparison subunit 121, a recording subunit 122, an acquisition subunit 123, and a first replacement subunit. 124. A second replacement subunit 125, a third replacement subunit 126, a fourth replacement subunit 127, and a configuration subunit 128.
比较子单元121,用于当第i个URL数据和第i+1个URL数据的长度相同时,依次比较第i个URL数据和第i+1个URL数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为URL数据的总数。The comparing subunit 121 is configured to compare the characters at each position in the i-th URL data and the i+1th URL data in sequence when the lengths of the i-th URL data and the i+1th URL data are the same. i is a natural number, and i = 1, 2, ..., m-1, m is the total number of URL data.
以www.g2-bc.com为第i个URL数据和www.g-abb.com为第i+1个URL数据为例,经过长度比较可知,这两个URL数据的长度相同,则可以依次比较这两个URL数据中各个位置处的字符,若这两个URL数据的长度不同,则 继续获取其他URL数据进行比较。Take the i-th URL data and www.g-abb.com for the i+1th URL data as an example. After length comparison, the lengths of the two URL data are the same, then they can be sequentially Comparing the characters at each position in the two URL data, if the lengths of the two URL data are different, then Continue to get other URL data for comparison.
记录子单元122,用于当第j个位置处的字符相同时,记录下第j个位置处的字符,并触发比较子单元121继续比较下一个字符,j=1,2,…..,n,n为第i个URL数据中字符总数。比如,第1至第4位置处的字符相同,则记录下这四个位置处的字符,继续比较第5位置处的字符。The recording sub-unit 122 is configured to record the character at the j-th position when the characters at the j-th position are the same, and trigger the comparison sub-unit 121 to continue comparing the next character, j=1, 2, ....., n, n is the total number of characters in the i-th URL data. For example, if the characters at the first to fourth positions are the same, the characters at the four positions are recorded, and the characters at the fifth position are continuously compared.
获取子单元123,用于当第j个位置处的字符不同时,获取第i个URL数据和第i+1个URL数据中第j个位置处的字符的类型。The obtaining sub-unit 123 is configured to acquire the type of the character at the j-th position in the i-th URL data and the i+1th URL data when the characters at the j-th position are different.
第一替换子单元124,用于当第i个URL数据和第i+1个URL数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换第j个位置处的字符。a first replacement subunit 124, configured to replace the jth position with the first preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a digital type The character at the place.
其中第一预设替换符号为预设用来替换URL数据中的对应字符的,当两个URL数据中第j个位置处的字符的类型均是数字类型,则会采用第一预设替换符号来替换,如第一预设替换符号可以为“#”,则会将第j个位置处的字符替换为“#”,当然第一预设替换符号还可以采用其他符号,具体可根据实际应用来确定。The first preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the j-th position in the two URL data is a numeric type, the first preset replacement symbol is adopted. To replace, if the first preset replacement symbol can be "#", the character at the jth position will be replaced by "#". Of course, the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
第二替换子单元125,用于当第i个URL数据和第i+1个URL数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换第j个位置处的字符。a second replacement subunit 125, configured to replace the jth position with the second preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a letter type The character at the place.
其中第二预设替换符号为预设用来替换URL数据中的对应字符的,当两个URL数据中第j个位置处的字符的类型均是字母类型,则会采用第二预设替换符号来替换,如第二预设替换符号可以为“@”,则会将第j个位置处的字符替换为“@”,当然第二预设替换符号还可以采用其他符号,具体可根据实际应用来确定。The second preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the jth position in the two URL data is a letter type, the second preset replacement symbol is adopted. To replace, if the second preset replacement symbol can be "@", the character at the jth position will be replaced with "@". Of course, the second preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.
第三替换子单元126,用于当第i个URL数据中第j个位置处的字符的类型和第i+1个URL数据中第j个位置处的字符的类型不同时,以第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符。a third replacement sub-unit 126, configured to use the ith when the type of the character at the j-th position in the i-th URL data and the type of the character at the j-th position in the i+1th URL data are different The preset replacement symbol corresponding to the type of the character at the jth position in the URL data replaces the character at the jth position.
比如第i个URL数据中第j个位置处的字符的类型为数字类型,则以第一预设替换符号来替换第j个位置处的字符,若第i个URL数据中第j个位置处 的字符的类型为字母类型,则以第二预设替换符号来替换第j个位置处的字符。For example, if the type of the character at the jth position in the i-th URL data is a numeric type, the character at the j-th position is replaced with the first preset replacement symbol, if the j-th position in the i-th URL data is The type of the character is a letter type, and the character at the jth position is replaced with the second preset replacement symbol.
第四替换子单元127,用于当第i个URL数据或第i+1个URL数据中第j个位置处的字符为预设连字符时,以不是预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符。a fourth replacement subunit 127, configured to: when the character at the jth position in the i-th URL data or the i+1th URL data is a preset hyphen, at a j-th position that is not a preset hyphen The type of the character corresponds to the preset replacement symbol to replace the character at the jth position.
比如上述www.g2-bc.com和www.g-abb.com这两个URL数据,其中第6个位置处的字符一个为数字2,一个为预设连字符-,则以数字对应的预设替换符号,即第一预设替换符号来替换第6个位置处的字符。而第7个位置处的字符一个为预设连字符-,一个为字母a,则以字母对应的预设替换符号,即第二预设替换符号来替换第7个位置处的符号。For example, the above URL data of www.g2-bc.com and www.g-abb.com, wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre- The replacement symbol, the first preset replacement symbol, is substituted for the character at the sixth position. The character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.
配置子单元128,用于对所有不同字符替换后的URL数据为第i个URL数据和第i+1个URL数据对应的URL模板,比如上述www.g2-bc.com和www.g-abb.com这两个URL数据的URL模板为www.g#@b@.com。The configuration subunit 128 is configured to replace the URL data of all the different characters with the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2-bc.com and www.g-abb The URL template for .com's two URL data is www.g#@b@.com.
扩展单元13,用于对每个URL模板进行扩展,得到每个URL模板对应的可视为钓鱼网站的URL数据。在本发明实施例中,扩展单元13可以采用图8所示结构对每个URL模板进行扩展,其中扩展范元13包括:统计子单元131、保留子单元132、扩展子单元133和去重子单元134。The extension unit 13 is configured to expand each URL template to obtain URL data corresponding to each URL template and can be regarded as a phishing website. In the embodiment of the present invention, the extension unit 13 may extend each URL template by using the structure shown in FIG. 8. The extension template 13 includes: a statistical subunit 131, a reserved subunit 132, an extended subunit 133, and a deduplication subunit. 134.
统计子单元131,用于对URL模板进行次数统计,得到一有序的URL模板列表。其中对URL模板进行次数统计,是为了统计每个URL模板出现的次数,进而将相同的URL模板合并,以减少URL模板的数量。The statistics sub-unit 131 is configured to perform statistics on the number of URL templates to obtain an ordered list of URL templates. The number of times the URL template is counted is to count the number of occurrences of each URL template, and then merge the same URL templates to reduce the number of URL templates.
保留子单元132,用于保留URL模板列表中符合预设条件的URL模板。URL模板列表中各个URL模板在经过与预设条件比对后,会删除部分URL模板,然后将符合预设条件的URL模板保留以作为最终用于扩展的URL模板,进一步减少URL模板的数量。The retention sub-unit 132 is configured to retain the URL template in the URL template list that meets the preset condition. After the URL template in the URL template list is compared with the preset condition, the partial URL template is deleted, and then the URL template conforming to the preset condition is retained as the URL template finally used for the extension, thereby further reducing the number of URL templates.
在本发明实施例中预设条件可以根据实际应用来确定,比如限定URL模板中预设替换符号的数量以及URL模板出现的次数,以charvalue为预设的预设替换符号出现的最大次数,以numvalue为预设的URL模板出现的最大次数遍历有序的URL模板列表以通过以下条件控制模板数量:In the embodiment of the present invention, the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset. Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:
URL模板中“@”、“#”的数量和不大于charvalue的值,那么保留该URL模板,否则删除; The number of "@", "#" in the URL template is not greater than the value of charvalue, then the URL template is retained, otherwise deleted;
URL模板的出现的次数不小于numvalue的值,那么保留该URL模板,否则删除。The number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.
扩展子单元133,用于对保留的URL模板进行扩展,其中扩展过程包括:依次采用第一预设替换符号对应类型的全部字符依次替换URL模板中的第一预设替换符号以及采用第二预设替换符号对应类型的全部字符依次替换URL模板中的第二预设替换符号,得到每个URL模板对应的扩展后的URL数据。The extension sub-unit 133 is configured to expand the reserved URL template, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol in sequence with the first preset replacement symbol in the URL template, and adopting the second pre- All the characters corresponding to the replacement symbol are sequentially replaced with the second preset replacement symbols in the URL template, and the extended URL data corresponding to each URL template is obtained.
以上述第一预设替换符号为“#”,第二预设替换符号为“@”为例进行说明,对于URL模板中的第一预设替换符号来说,依次用10个数字0~9去替换,而对于URL模板中的第二预设替换符号来说,依次用26个英文字母a~z去替换。在对URL模板中的各个预设替换符号进行替换后,则得到每个URL模板对应的多个URL数据。The first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example. For the first preset replacement symbol in the URL template, 10 numbers 0-9 are sequentially used. To replace, and for the second preset replacement symbol in the URL template, replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.
之所以这样替换是因为URL模板中第一预设替换符号和第二预设替换符号是与字符的类型相对应的,而这一对应方式正好体现出各个钓鱼网站的URL数据易被篡改成何种类型的字符,即本发明实施例通过对各个钓鱼网站的URL数据进行统计得出易被篡改成何种类型的字符,使得得到的URL模板和扩展后的URL数据符合钓鱼网站的URL数据被篡改的方式,进而使得URL模板和扩展后的URL数据的针对性强,且可以通过较少的数据得到较为准确的URL数据,得到的各个扩展后的URL数据可以作为网络钓鱼检测的数据源,提高通用性。The reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into The type of characters, that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website. The tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.
去重子单元134,用于将扩展后的URL数据与全部已知的URL数据进行去重处理,得到全部可视为钓鱼网站的URL数据。The de-sub-unit 134 is configured to perform de-duplication processing on the extended URL data and all known URL data to obtain URL data that can be regarded as a phishing website.
从上述技术方案可知,本发明实施例可以以全部已知的URL数据为基础,得到URL模板,并对URL模板进行扩展,得到每个URL模板对应的可视为钓鱼网站的URL数据,实现钓鱼网站的自行主动获取,有效降低钓鱼发现的滞后性与人工依赖的问题。并且通过上述方式可以扩大检测范围,降低利益损失,且可以将已知钓鱼网站的URL数据作为基础进行扩展,从而提高已知的钓鱼网站的二次利用率。According to the above technical solution, the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template, and obtain URL data corresponding to each URL template that can be regarded as a phishing website, and realize fishing. The website's self-acquisition is effective, which reduces the lag and artificial dependence of phishing discovery. In addition, the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.
此外,本发明实施例提供的数据源检测装置还可以在获取到URL数据后, 对URL数据进行排序,以将相似度较高的URL数据相邻,这样可以将相似度较高的URL数据集中,统计出合法URL数据被篡改成何种类型的字符的程度较高,以有针对性的进行URL数据的扩展。如图9所示,其示出了本发明实施例提供的数据源检测装置的另一种结构示意图,在图6基础上,还可以包括:列表形成单元14、分类单元15和排序单元16。In addition, the data source detecting apparatus provided by the embodiment of the present invention may further obtain the URL data after acquiring the URL data. Sorting the URL data to adjacent URL data with higher similarity, so that the URL data with higher similarity can be collected, and the degree of the legal URL data being falsified into which type of characters is relatively high, so as to have Targeted extension of URL data. As shown in FIG. 9, another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention is shown. On the basis of FIG. 6, the following may further include: a list forming unit 14, a classifying unit 15, and a sorting unit 16.
列表形成单元14,用于获取每个URL数据的二级域名,形成二级域名集合列表,如“www.abc.com”的二级域名为“abc.com”,进而将每个URL的二级域名存储到一个列表中,形成二级域名集合列表。The list forming unit 14 is configured to obtain a second-level domain name of each URL data, and form a second-level domain name collection list, such as a second-level domain name of “www.abc.com”, “abc.com”, and then two URLs of each URL. The level domain names are stored in a list to form a list of second-level domain name collections.
分类单元15,用于根据二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表。如“www.abc.com”和“www.efg.com”的顶级域名TLD均为“.com”,则这两个URL数据都将存放在“.com”对应的子二级域名列表中。The classification unit 15 is configured to perform classification according to the top-level domain name in the second-level domain name collection list, and obtain a sub-level second-level domain name collection list having different top-level domain names. If the TLDs of "www.abc.com" and "www.efg.com" are both ".com", the two URL data will be stored in the sub-level domain name list corresponding to ".com".
排序单元16,用于对每个子二级域名集合列表中的URL数据进行排序,以使相似度较高的URL数据在排序中相邻。比如排序单元包括:分类子单元和排序子单元,其中分类子单元,用于基于预设连字符,对每个子二级域名集合列表中的URL数据进行分类,得到含有预设连字符的URL数据和不含有预设连字符的URL数据。排序子单元,用于对含有预设连字符的URL数据和不含有预设连字符的URL数据依次按照长度和字母顺序进行排序,这样就可以将相似度较高的URL数据集中,统计出合法URL数据被篡改成何种类型的字符的程度较高,以有针对性的进行URL数据的扩展。The sorting unit 16 is configured to sort the URL data in each of the sub-level second-level domain name collection lists, so that the URL data with higher similarity is adjacent in the sorting. For example, the sorting unit includes: a sorting subunit and a sorting subunit, wherein the sorting subunit is configured to classify URL data in each sublevel second domain name collection list based on a preset hyphen, and obtain URL data including a preset hyphen And URL data that does not contain a preset hyphen. The sorting sub-unit is configured to sequentially sort the URL data containing the preset hyphen and the URL data not including the preset hyphen in order of length and alphabetic order, so that the URL data with higher similarity can be collected and counted as legal. The degree to which the URL data is falsified to which type of character is higher, and the URL data is expanded in a targeted manner.
而在图9所示数据源检测装置中对比单元12和扩展单元13的工作过程可以参阅上述方法实施例中的相关说明,本发明实施例不再阐述。For the working process of the comparison unit 12 and the extension unit 13 in the data source detection device shown in FIG. 9, reference may be made to the related description in the foregoing method embodiments, which is not described in the embodiment of the present invention.
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置类实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。It should be noted that each embodiment in the specification is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the embodiments are referred to each other. can. For the device type embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包 括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Moreover, the term "package The inclusion of "including" or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also other elements not specifically listed. Or include elements that are inherent to such a process, method, article, or device. The elements defined by the phrase "comprising a ..." are not excluded from the process of including the element, without further limitation. There are other identical elements in the method, item, or device.
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。 The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims (10)

  1. 一种数据源扩展方法,其特征在于,所述方法包括:A data source extension method, the method comprising:
    获取全部已知的统一资源定位符数据,其中所述全部已知的统一资源定位符数据至少包括已知钓鱼网站的统一资源定位符数据;Obtaining all known uniform resource locator data, wherein the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;
    对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板;Performing a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;
    对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据。Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.
  2. 根据权利要求1所述的方法,其特征在于,在获取全部已知的统一资源定位符数据之后,在对所述全部已知的统一资源定位符数据进行两两对比之前,所述方法还包括:The method according to claim 1, wherein after acquiring all known uniform resource locator data, before performing the pairwise comparison of the all known uniform resource locator data, the method further comprises :
    获取每个统一资源定位符数据的二级域名,形成二级域名集合列表;Obtaining a second-level domain name of each uniform resource locator data to form a second-level domain name collection list;
    根据所述二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表;Sorting according to the top-level domain name in the second-level domain name collection list, and obtaining a list of sub-level second-level domain name collections having different top-level domain names;
    对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻。The uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.
  3. 根据权利要求2所述的方法,其特征在于,所述对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻,包括:The method according to claim 2, wherein the sorting the uniform resource locator data in each sub-level second-level domain name set list is such that the uniformity resource locator data with higher similarity is adjacent in the sorting ,include:
    基于预设连字符,对每个子二级域名集合列表中的统一资源定位符数据进行分类,得到含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据;Sorting the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, and obtaining uniform resource locator data including the preset hyphen and a unified resource not including the preset hyphen Locator data;
    对含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据依次按照长度和字母顺序进行排序。The uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.
  4. 根据权利要求1或2所述的方法,其特征在于,所述对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板,包括:The method according to claim 1 or 2, wherein the comparing the all known uniform resource locator data in pairs, to obtain a plurality of uniform resource locator templates, comprising:
    当第i个统一资源定位符数据和第i+1个统一资源定位符数据的长度相同 时,依次比较所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为统一资源定位符数据的总数;When the i-th uniform resource locator data and the i+1th uniform resource locator data have the same length And comparing the characters at each position in the i-th uniform resource locator data and the i+1th uniform resource locator data, i is a natural number, and i=1, 2, . . . , m- 1, m is the total number of uniform resource locator data;
    当所述第j个位置处的字符相同时,记录下第j个位置处的字符,并继续比较下一个字符,j=1,2,…..,n,n为第i个统一资源定位符数据中字符总数;When the characters at the jth position are the same, the characters at the jth position are recorded, and the next character is continuously compared, j=1, 2, ....., n, n is the i-th uniform resource location. The total number of characters in the data;
    当所述第j个位置处的字符不同时,获取所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型;When the characters at the jth position are different, acquiring the type of the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data;
    当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换所述第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, replacing the jth with the first preset replacement symbol The character at the position;
    当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换所述第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type, replacing the jth with a second preset replacement symbol The character at the position;
    当所述第i个统一资源定位符数据中第j个位置处的字符的类型和第i+1个统一资源定位符数据中第j个位置处的字符的类型不同时,以所述第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符;When the type of the character at the jth position in the i-th uniform resource locator data and the type of the character at the j-th position in the i+1th uniform resource locator data are different, a preset replacement symbol corresponding to the type of the character at the jth position in the URL data to replace the character at the jth position;
    当所述第i个统一资源定位符数据或第i+1个统一资源定位符数据中第j个位置处的字符为预设连字符时,以不是所述预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换所述第j个位置处的字符;When the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;
    对所有不同字符替换后的统一资源定位符数据为所述第i个统一资源定位符数据和第i+1个统一资源定位符数据对应的统一资源定位符模板。The uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
  5. 根据权利要求4所述的方法,其特征在于,所述对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据,包括:The method according to claim 4, wherein the each of the uniform resource locator templates is extended to obtain a uniform resource locator corresponding to each of the uniform resource locator templates that can be regarded as a phishing website. Data, including:
    对所述统一资源定位符模板进行次数统计,得到一有序的统一资源定位符模板列表;Performing statistics on the number of the uniform resource locator templates to obtain an ordered list of uniform resource locator templates;
    保留所述统一资源定位符模板列表中符合预设条件的所述统一资源定位 符模板;Retaining the uniform resource location in the uniform resource locator template list that meets preset conditions Symbol template
    对保留的所述统一资源定位符模板进行扩展,其中扩展过程包括:依次采用所述第一预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第一预设替换符号以及采用所述第二预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第二预设替换符号,得到每个所述统一资源定位符模板对应的扩展后的统一资源定位符数据;Expanding the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates. Uniform resource locator data;
    将扩展后的统一资源定位符数据与全部已知的统一资源定位符数据进行去重处理,得到全部可视为钓鱼网站的统一资源定位符数据。The extended uniform resource locator data is deduplicated with all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
  6. 一种数据源扩展装置,其特征在于,所述装置包括:A data source expansion device, characterized in that the device comprises:
    获取单元,用于获取全部已知的统一资源定位符数据,其中所述全部已知的统一资源定位符数据至少包括已知钓鱼网站的统一资源定位符数据;An obtaining unit, configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;
    对比单元,用于对所述全部已知的统一资源定位符数据进行两两对比,得到多个统一资源定位符模板;a comparison unit, configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;
    扩展单元,用于对每个所述统一资源定位符模板进行扩展,得到每个所述统一资源定位符模板对应的可视为钓鱼网站的统一资源定位符数据。And an extension unit, configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    列表形成单元,用于获取每个统一资源定位符数据的二级域名,形成二级域名集合列表;a list forming unit, configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list;
    分类单元,用于根据所述二级域名集合列表中的顶级域名进行分类,得到具有不同顶级域名的子二级域名集合列表;a classification unit, configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;
    排序单元,用于对每个子二级域名集合列表中的统一资源定位符数据进行排序,以使相似度较高的统一资源定位符数据在排序中相邻。And a sorting unit, configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.
  8. 根据权利要求7所述的装置,其特征在于,所述排序单元,包括:The device according to claim 7, wherein the sorting unit comprises:
    分类子单元,用于基于预设连字符,对每个子二级域名集合列表中的统一资源定位符数据进行分类,得到含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据;a classification subunit, configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;
    排序子单元,用于对含有所述预设连字符的统一资源定位符数据和不含有所述预设连字符的统一资源定位符数据依次按照长度和字母顺序进行排序。 And a sorting subunit, configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.
  9. 根据权利要求6或7所述的装置,其特征在于,所述对比单元,包括:The device according to claim 6 or 7, wherein the comparing unit comprises:
    比较子单元,用于当第i个统一资源定位符数据和第i+1个统一资源定位符数据的长度相同时,依次比较所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中的每个位置处的字符,i为自然数,且i=1,2,……,m-1,m为统一资源定位符数据的总数;a comparison subunit, configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same The character at each position in the resource locator data, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;
    记录子单元,用于当所述第j个位置处的字符相同时,记录下第j个位置处的字符,并触发所述比较子单元继续比较下一个字符,j=1,2,…..,n,n为第i个统一资源定位符数据中字符总数;a recording subunit, configured to record a character at the jth position when the characters at the jth position are the same, and trigger the comparing subunit to continue comparing the next character, j=1, 2, .... ., n, n is the total number of characters in the i-th uniform resource locator data;
    获取子单元,用于当所述第j个位置处的字符不同时,获取所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型;Obtaining a subunit, configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;
    第一替换子单元,用于当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为数字类型时,以第一预设替换符号替换所述第j个位置处的字符;a first replacement subunit, configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;
    第二替换子单元,用于当所述第i个统一资源定位符数据和第i+1个统一资源定位符数据中第j个位置处的字符的类型为字母类型时,以第二预设替换符号替换所述第j个位置处的字符;a second replacement subunit, configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type The replacement symbol replaces the character at the jth position;
    第三替换子单元,用于当所述第i个统一资源定位符数据中第j个位置处的字符的类型和第i+1个统一资源定位符数据中第j个位置处的字符的类型不同时,以所述第i个URL数据中第j个位置处的字符的类型对应的预设替换符号来替换第j个位置处的字符;a third replacement subunit, configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;
    第四替换子单元,用于当所述第i个统一资源定位符数据或第i+1个统一资源定位符数据中第j个位置处的字符为预设连字符时,以不是所述预设连字符的第j个位置处的字符的类型对应的预设替换符号来替换所述第j个位置处的字符;a fourth replacement subunit, configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;
    配置子单元,用于对所有不同字符替换后的统一资源定位符数据为所述第i个统一资源定位符数据和第i+1个统一资源定位符数据对应的统一资源定位符模板。 The configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
  10. 根据权利要求9所述的装置,其特征在于,所述扩展单元,包括:The device according to claim 9, wherein the extension unit comprises:
    统计子单元,用于对所述统一资源定位符模板进行次数统计,得到一有序的统一资源定位符模板列表;a statistics subunit, configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates;
    保留子单元,用于保留所述统一资源定位符模板列表中符合预设条件的所述统一资源定位符模板;a reserved sub-unit, configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition;
    扩展子单元,用于对保留的所述统一资源定位符模板进行扩展,其中扩展过程包括:依次采用所述第一预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第一预设替换符号以及采用所述第二预设替换符号对应类型的全部字符依次替换所述统一资源定位符模板中的所述第二预设替换符号,得到每个所述统一资源定位符模板对应的扩展后的统一资源定位符数据;An extension subunit, configured to extend the reserved uniform resource locator template, where the expansion process includes: sequentially replacing all characters in the corresponding type of the corresponding replacement symbol with the characters in the uniform resource locator template And replacing, by the first preset replacement symbol and all the characters corresponding to the second preset replacement symbol, the second preset replacement symbol in the uniform resource locator template, to obtain each of the unified resources. The extended uniform resource locator data corresponding to the locator template;
    去重子单元,用于将扩展后的统一资源定位符数据与全部已知的统一资源定位符数据进行去重处理,得到全部可视为钓鱼网站的统一资源定位符数据。 The de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
PCT/CN2017/073611 2016-10-19 2017-02-15 Method and device for extending data source WO2018072363A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610911941.X 2016-10-19
CN201610911941.XA CN106503125B (en) 2016-10-19 2016-10-19 A kind of data source extended method and device

Publications (1)

Publication Number Publication Date
WO2018072363A1 true WO2018072363A1 (en) 2018-04-26

Family

ID=58294512

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/073611 WO2018072363A1 (en) 2016-10-19 2017-02-15 Method and device for extending data source

Country Status (2)

Country Link
CN (1) CN106503125B (en)
WO (1) WO2018072363A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241483B (en) * 2018-08-31 2021-10-12 中国科学院计算技术研究所 Website discovery method and system based on domain name recommendation
CN109672678B (en) * 2018-12-24 2021-05-14 亚信科技(中国)有限公司 Phishing website identification method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222187A (en) * 2011-06-02 2011-10-19 国家计算机病毒应急处理中心 Domain name structural feature-based hang horse web page detection method
CN104202291A (en) * 2014-07-11 2014-12-10 西安电子科技大学 Anti-phishing method based on multi-factor comprehensive assessment method
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
US20150295942A1 (en) * 2012-12-26 2015-10-15 Sinan TAO Method and server for performing cloud detection for malicious information

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8438642B2 (en) * 2009-06-05 2013-05-07 At&T Intellectual Property I, L.P. Method of detecting potential phishing by analyzing universal resource locators
CN102082792A (en) * 2010-12-31 2011-06-01 成都市华为赛门铁克科技有限公司 Phishing webpage detection method and device
CN103491101A (en) * 2013-09-30 2014-01-01 北京金山网络科技有限公司 Phishing website detecting method and device and client-side
CN103685307B (en) * 2013-12-25 2017-08-11 北京奇虎科技有限公司 The method and system of feature based storehouse detection fishing fraud webpage, client, server
CN104615760B (en) * 2015-02-13 2018-04-13 北京瑞星网安技术股份有限公司 Fishing website recognition methods and system
CN105138912A (en) * 2015-09-25 2015-12-09 北京奇虎科技有限公司 Method and device for generating phishing website detection rules automatically

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222187A (en) * 2011-06-02 2011-10-19 国家计算机病毒应急处理中心 Domain name structural feature-based hang horse web page detection method
US20150295942A1 (en) * 2012-12-26 2015-10-15 Sinan TAO Method and server for performing cloud detection for malicious information
CN104202291A (en) * 2014-07-11 2014-12-10 西安电子科技大学 Anti-phishing method based on multi-factor comprehensive assessment method
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings

Also Published As

Publication number Publication date
CN106503125B (en) 2019-10-15
CN106503125A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN107786575B (en) DNS flow-based self-adaptive malicious domain name detection method
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
Huang et al. Phishing URL detection via CNN and attention-based hierarchical RNN
Xiang et al. Cantina+ a feature-rich machine learning framework for detecting phishing web sites
Wardman et al. High-performance content-based phishing attack detection
Kumar et al. Malicious URL detection using multi-layer filtering model
CN108023868B (en) Malicious resource address detection method and device
Balduzzi et al. Targeted attacks detection with spunge
Weedon et al. Random forest explorations for URL classification
Li et al. Phishing detection based on newly registered domains
Bai Phishing website detection based on machine learning algorithm
Nowroozi et al. An adversarial attack analysis on malicious advertisement url detection framework
WO2018072363A1 (en) Method and device for extending data source
Khan Detection of phishing websites using deep learning techniques
Wu et al. Malicious website detection based on urls static features
Khan et al. Hot zone identification: Analyzing effects of data sampling on spam clustering
Zhang et al. Detecting the DGA-based malicious domain names
Yazhmozhi et al. Natural language processing and Machine learning based phishing website detection system
Lei et al. Design and implementation of an automatic scanning tool of SQL injection vulnerability based on Web crawler
Lee et al. Collaborative cyberporn filtering with collective intelligence
Tariq et al. USING black-list and white-list technique to detect malicious URLs
Xiong et al. MIRD: trigram-based M alicious URL detection I mplanted with R andom D omain name recognition
Ma Research on black hat SEO behaviour measurement
TWI579717B (en) Dynamic Web site HTTP network packet and database packet auditing system and method
Lee et al. Objectionable content filtering by click-through data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17861640

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17861640

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17861640

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/12/2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17861640

Country of ref document: EP

Kind code of ref document: A1