WO2018072363A1

WO2018072363A1 - Method and device for extending data source

Info

Publication number: WO2018072363A1
Application number: PCT/CN2017/073611
Authority: WO
Inventors: 李晓东; 李雪妮; 耿光刚; 陈勇
Original assignee: 中国互联网络信息中心
Priority date: 2016-10-19
Filing date: 2017-02-15
Publication date: 2018-04-26
Also published as: CN106503125B; CN106503125A

Abstract

A method and a device for extending a data source, the method comprising: obtaining uniform resource locator templates on the basis of all known uniform resource locator (URL) data, and extending the uniform resource locator templates, to obtain uniform resource locator data capable of being regarded as a fishing website corresponding to each uniform resource locator template, achieving the automatic and active acquisition of the fishing website, effectively reducing the problems of hysteresis of the discovery of the fishing website and human dependency. In the above-mentioned manner, it is possible to extend detection range, reduce loss of interest, and to perform an extension on the basis of the uniform resource locator data of a known fishing website, improving the ratio of the secondary utilization of the known fishing website.

Description

Data source expansion method and device

The present invention claims the priority of the Chinese patent invention filed on October 19, 2016 by the Chinese Patent Office, the application number is 201610911941.X, and the invention name is "a data source expansion method and device", the entire contents of which are incorporated by reference. In the present invention.

Technical field

The present invention belongs to the field of Internet security detection technologies, and more particularly, to a data source extension method and apparatus.

Background technique

As an important part of modern life, the Internet has been widely used by various groups and organizations for online trade and services, which also makes the Internet more vulnerable to security attacks from all sides. For example, phishing, as a form of security attack, creates a phishing website by mimicking the page content of a legitimate website, and induces users to visit phishing websites to steal personal privacy information such as user names, bank accounts, and passwords.

With the rapid development of the Internet, driven by the interests, the black industry chain engaged in phishing attacks is gradually rising. Therefore, the detection method for phishing websites is becoming more and more important in the security operations of enterprises such as e-commerce and financial securities. Important position.

At present, the detection methods for phishing websites mainly focus on the detection algorithm field, that is, researching efficient and accurate detection algorithms to detect websites, so as to find phishing websites from many websites. In the data source targeted by the detection method (ie, the possible phishing website), the discovery of the data source relies on the reports of the majority of netizens. In this way, the detection of the phishing website is passive and does not have active discovery. Ability and low secondary utilization for known phishing sites.

Summary of the invention

In view of this, an object of the present invention is to provide a data source extension method and apparatus for improving the secondary utilization rate of a known phishing website, expanding the detection range, and effectively reducing the lag and artificial dependence of phishing discovery. . The technical solutions are as follows:

The present invention provides a data source extension method, the method comprising:

Obtain all known uniform resource locator data, wherein all known known uniform resource locations The data includes at least the uniform resource locator data of the known phishing website;

Performing a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;

Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.

Preferably, after obtaining all the known uniform resource locator data, before performing the pairwise comparison on the all known uniform resource locator data, the method further includes:

Obtaining a second-level domain name of each uniform resource locator data to form a second-level domain name collection list;

Sorting according to the top-level domain name in the second-level domain name collection list, and obtaining a list of sub-level second-level domain name collections having different top-level domain names;

The uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.

Preferably, the ranking of the uniform resource locator data in the list of each sub-level second-level domain name set is such that the uniform resource locator data with higher similarity is adjacent in the sorting, including:

Sorting the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, and obtaining uniform resource locator data including the preset hyphen and a unified resource not including the preset hyphen Locator data;

The uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.

Preferably, performing the pairwise comparison on the all known uniform resource locator data to obtain a plurality of uniform resource locator templates, including:

When the length of the i-th uniform resource locator data and the (i+1)th uniform resource locator data are the same, sequentially comparing the i-th uniform resource locator data and the i+1th uniform resource locator data The character at each position, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;

When the characters at the jth position are the same, the characters at the jth position are recorded, and the next character is continuously compared, j=1, 2, ....., n, n is the i-th uniform resource location. The total number of characters in the data;

Obtaining the i-th uniform resource locator data when characters at the j-th position are different And the type of the character at the jth position in the i+1th uniform resource locator data;

When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, replacing the jth with the first preset replacement symbol The character at the position;

When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type, replacing the jth with a second preset replacement symbol The character at the position;

When the type of the character at the jth position in the i-th uniform resource locator data and the type of the character at the j-th position in the i+1th uniform resource locator data are different, a preset replacement symbol corresponding to the type of the character at the jth position in the URL data to replace the character at the jth position;

When the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;

The uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.

Preferably, the method for extending the uniform resource locator template to obtain the uniform resource locator data corresponding to each unicast resource locator template that can be regarded as a phishing website includes:

Performing statistics on the number of the uniform resource locator templates to obtain an ordered list of uniform resource locator templates;

Retaining the uniform resource locator template in the uniform resource locator template list that meets a preset condition;

Expanding the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates. Uniform resource locator data;

Extend the extended Uniform Resource Locator data with all known Uniform Resource Locator data To deal with the re-processing, get all the Uniform Resource Locator data that can be regarded as the phishing website.

In another aspect, the present invention also provides a data source expansion apparatus, the apparatus comprising:

An obtaining unit, configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;

a comparison unit, configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;

And an extension unit, configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.

Preferably, the device further comprises:

a list forming unit, configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list;

a classification unit, configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;

And a sorting unit, configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.

Preferably, the sorting unit comprises:

a classification subunit, configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;

And a sorting subunit, configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.

Preferably, the comparing unit comprises:

a comparison subunit, configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same The character at each position in the resource locator data, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;

a recording subunit, configured to record a character at the jth position when the characters at the jth position are the same, and trigger the comparing subunit to continue comparing the next character, j=1, 2, .... .,n,n is The total number of characters in the i-th uniform resource locator data;

Obtaining a subunit, configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;

a first replacement subunit, configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;

a second replacement subunit, configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type The replacement symbol replaces the character at the jth position;

a third replacement subunit, configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;

a fourth replacement subunit, configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;

The configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.

Preferably, the expansion unit comprises:

a statistics subunit, configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates;

a reserved sub-unit, configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition;

An extension subunit, configured to extend the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all characters corresponding to the first preset replacement symbol corresponding type The first preset replacement symbol in the uniform resource locator template and all the characters in the corresponding type of the second preset replacement symbol are sequentially replaced with the second preset replacement symbol in the uniform resource locator template. Obtaining the extended uniform resource locator data corresponding to each of the uniform resource locator templates;

The de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.

Compared with the prior art, the above technical solution provided by the present invention has the following advantages:

The foregoing technical solution provided by the present invention can obtain a uniform resource locator template based on all known uniform resource locator data, and expand the uniform resource locator template to obtain a visual corresponding to each uniform resource locator template. For the unicast resource locator data of the phishing website, the phishing website is automatically acquired by itself, which effectively reduces the lag and artificial dependence of the phishing discovery. In addition, the detection range can be expanded, the loss of interest can be reduced, and the uniform resource locator data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are Some embodiments of the present invention may also be used to obtain other drawings based on these drawings without departing from the art.

FIG. 1 is a flowchart of a data source detecting method according to an embodiment of the present invention;

2 is a flowchart of obtaining a URL template in the data source detecting method shown in FIG. 1;

3 is a flowchart of a URL template extension in the data source detection method shown in FIG. 1;

4 is another flowchart of a data source detecting method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of obtaining a URL template according to an embodiment of the present invention;

6 is a schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention;

Figure 7 is a schematic structural view of a comparison unit in the data source detecting device shown in Figure 6;

8 is a schematic structural diagram of an extension unit in the data source detecting apparatus shown in FIG. 6;

FIG. 9 is another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention.

detailed description

It is very common to enter a misspelled Uniform Resource Locato (URL) data in a browser, and cybercriminals often use this situation to mislead users into requesting a phishing website. This phenomenon is called For "missing domain names." For phishing, cybercriminals typically register domain names similar to regular websites, then wait for misspelled users to access them, or use the visual similarity of URLs to induce users to actively click on such "high imitation" URL links. For example, www.10086.cn is the official website of China Mobile. Cybercriminals may use www.1oo86.cn (with the letter “o” instead of the number “0”) or use www.l0086.cn (with the letter “l” instead) Phishing sites such as the number "1") trick users into accessing. The discovery of these phishing websites can only rely on the reports of the majority of netizens. For this reason, the embodiment of the present invention provides a data source extension method for actively obtaining URL data that can be regarded as a phishing website and improving known phishing websites. Secondary utilization of URL data.

The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Please refer to FIG. 1 , which is a flowchart of a data source extension method provided by an embodiment of the present invention, for actively acquiring URL data that can be regarded as a phishing website, and improving URL data of a known phishing website. The secondary utilization may specifically include the following steps:

101: Acquire all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.

102: Perform a pairwise comparison on all known URL data to obtain multiple URL templates. The reason why all the known URL data are compared in pairs is because: multiple URL data may correspond to a URL template, so that the comparison between two and two is convenient to count the number of occurrences of a certain URL template, and then expand with a URL template. It is more targeted.

103: Extend each URL template to obtain URL data corresponding to each URL template that can be regarded as a phishing website.

The process of obtaining a URL template and extending each URL template in the embodiment of the present invention will be described in detail below with reference to the accompanying drawings. As shown in FIG. 2, the process of obtaining a URL template provided by an embodiment of the present invention may include the following steps:

1021: When the length of the i-th URL data and the i+1th URL data are the same, the characters at each position in the i-th URL data and the i+1th URL data are sequentially compared, i is a natural number, and i==1, 2, ..., m-1, m is the total number of URL data.

Take the i-th URL data and www.g-abb.com for the i+1th URL data as an example. After length comparison, the lengths of the two URL data are the same, then they can be sequentially Compare the characters at each position in the two URL data. If the lengths of the two URL data are different, continue to obtain other URL data for comparison.

1022: When the characters at the jth position are the same, record the character at the jth position, and continue to compare the next character, j=1, 2, ..., n, n is the character in the i-th URL data. total. For example, if the characters at the first to fourth positions are the same, the characters at the four positions are recorded, and the characters at the fifth position are continuously compared.

1023: When the characters at the jth position are different, the type of the character at the jth position in the i-th URL data and the i+1th URL data is acquired.

1024: When the type of the character at the jth position in the i-th URL data and the i+1th URL data is a numeric type, the character at the j-th position is replaced with the first preset replacement symbol.

The first preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the j-th position in the two URL data is a numeric type, the first preset replacement symbol is adopted. To replace, if the first preset replacement symbol can be "#", the character at the jth position will be replaced by "#". Of course, the first preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.

1025: When the type of the character at the jth position in the i-th URL data and the i+1th URL data is a letter type, the character at the j-th position is replaced with the second preset replacement symbol.

The second preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the jth position in the two URL data is a letter type, the second preset is adopted. Replace the symbol to replace, if the second preset replacement symbol can be "@", the character at the jth position will be replaced with "@", of course, the second preset replacement symbol can also adopt other symbols, depending on Practical application to determine.

1026: When the type of the character at the jth position in the i-th URL data is different from the type of the character at the j-th position in the i+1th URL data, the j-th position in the i-th URL data The preset replacement symbol corresponding to the type of character at the place replaces the character at the jth position.

For example, if the type of the character at the jth position in the i-th URL data is a numeric type, the character at the j-th position is replaced with the first preset replacement symbol, if the j-th position in the i-th URL data is The type of the character is a letter type, and the character at the jth position is replaced with the second preset replacement symbol.

1027: When the character at the jth position in the i-th URL data or the i+1th URL data is a preset hyphen, the pre-correspondence corresponding to the type of the character at the j-th position that is not the preset hyphen Replace the symbol to replace the character at the jth position.

For example, the above URL data of www.g2-bc.com and www.g-abb.com, wherein the character at the sixth position is a number 2, and one is a preset hyphen - and the number corresponds to the pre- The replacement symbol, the first preset replacement symbol, is substituted for the character at the sixth position. The character at the 7th position is a preset hyphen - and one is the letter a, and the symbol at the 7th position is replaced by the preset replacement symbol corresponding to the letter, that is, the second preset replacement symbol.

1028: After all the different characters are replaced after the above steps, the URL data obtained by replacing all the different characters is the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2- The URL template for the two URL data bc.com and www.g-abb.com is www.g#@b@.com.

The process of expanding each URL template is as shown in FIG. 3, and may include the following steps:

1031: Perform statistics on the number of URL templates to obtain an ordered list of URL templates. The number of times the URL template is counted is to count the number of occurrences of each URL template, and then merge the same URL templates to reduce the number of URL templates.

1032: Keep the URL template in the URL template list that meets the preset conditions. After the URL template in the URL template list is compared with the preset condition, the partial URL template is deleted, and then the URL template conforming to the preset condition is retained as the URL template finally used for the extension, thereby further reducing the number of URL templates.

In the embodiment of the present invention, the preset condition may be determined according to an actual application, such as limiting the number of preset replacement symbols in the URL template and the number of occurrences of the URL template, and the maximum number of occurrences of the preset replacement symbols by using the charvalue as a preset. Numvalue is the maximum number of times a preset URL template appears to traverse an ordered list of URL templates to control the number of templates by the following conditions:

The number of "@", "#" in the URL template is not greater than the value of charvalue, then the URL template is retained, otherwise deleted;

The number of occurrences of the URL template is not less than the value of numvalue, then the URL template is retained, otherwise it is deleted.

1033: The extended URL template is extended, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol with the first preset replacement symbol in the URL and adopting the second preset replacement symbol corresponding type. The entire character replaces the second preset replacement symbol in the URL template in turn, and obtains the extended URL data corresponding to each URL template.

The first preset replacement symbol is “#”, and the second preset replacement symbol is “@” as an example. For the first preset replacement symbol in the URL template, 10 numbers 0-9 are sequentially used. To replace, and for the second preset replacement symbol in the URL template, replace with 26 English letters a to z in sequence. After the replacement of each preset replacement symbol in the URL template, a plurality of URL data corresponding to each URL template are obtained.

The reason for this replacement is that the first preset replacement symbol and the second preset replacement symbol in the URL template correspond to the type of the character, and this corresponding method just reflects that the URL data of each phishing website is easily falsified into The type of characters, that is, the embodiment of the present invention calculates the type of characters that are easily falsified by statistic of the URL data of each phishing website, so that the obtained URL template and the extended URL data conform to the URL data of the phishing website. The tampering method further makes the URL template and the extended URL data highly targeted, and can obtain more accurate URL data through less data, and the obtained extended URL data can be used as a data source for phishing detection. Improve versatility.

1034: Perform deduplication processing on the extended URL data and all known URL data to obtain URL data that can be regarded as a phishing website.

As can be seen from the foregoing technical solutions, the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template to obtain a visual corresponding to each URL template. The URL data of the phishing website realizes the self-acquisition of the phishing website, which effectively reduces the lag and artificial dependence of the phishing discovery. In addition, the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.

In addition, the data source detection method provided by the embodiment of the present invention may further sort the URL data after the URL data is obtained, so as to associate the URL data with higher similarity, so that the URL data with higher similarity can be concentrated. It is calculated that the degree to which the legal URL data is falsified into a higher type of character is higher, and the URL data is expanded in a targeted manner. As shown in FIG. 4, it shows another flowchart of the data source detecting method provided by the embodiment of the present invention, which may include the following steps:

401: Acquire all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.

402: Obtain a second-level domain name of each URL data, and form a second-level domain name collection list, for example, the second-level domain name of “www.abc.com” is “abc.com”, and then store the second-level domain name of each URL into one In the list, a list of secondary domain name collections is formed.

403: According to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections with different top-level domain names. If the TLDs of "www.abc.com" and "www.efg.com" are both ".com", the two URL data will be stored in the sub-level domain name list corresponding to ".com".

404: Sort the URL data in each sub-level second-level domain name collection list so that the similarity-high URL data is adjacent in the sorting. For example, based on the preset hyphen, the URL data in each sub-level second-level domain name collection list is classified, and the URL data containing the preset hyphen and the URL data not including the preset hyphen are obtained, and then the URL with the preset hyphen is included. URL data and URL data without preset hyphens are sorted in order of length and alphabetic order, so that URL data with higher similarity can be collected, and the degree of falsification of legal URL data into what type of characters is compared. High, Targeted extension of URL data.

405: Perform a pairwise comparison on all known URL data to obtain multiple URL templates.

406: Extend each URL template to obtain URL data corresponding to each URL template that can be regarded as a phishing website.

In the embodiment of the present invention, the execution process of step 405 and step 406 is different from the

above steps

102 and 103 only in that the URL template is obtained based on the second-level domain name corresponding to each URL data in the sub-level second-level domain name collection list. The replacement and extension methods are the same. The process of getting a URL template is:

(1) For each sorted sub-level second-level domain name list, sequentially read each second-level domain name in the sub-level second-level domain name list:

If the current row is read, then the second row is sequentially read, and the read second-level domain names are respectively assigned to two variables domain1, domain2;

If the current line is not the first line, then the current variable domain2 is first assigned to the variable domain1, and then the next line is read sequentially, and the value is assigned to the variable domain2.

(2) If the lengths of the two variables domain1, domain2 are the same (assuming length=n), then the characters at each position of the two variables are compared in order from left to right:

1) If the characters at the i-th (i = 1, 2, ..., n) positions are the same, record the same character and continue to compare the next character;

2) If the characters at the i-th (i = 1, 2, ..., n) positions are not the same, proceed as follows:

a) If the type of the two characters is the same as the number (0~9) type, replace it with the first preset replacement symbol "#";

b) if the two characters are of the same type as the English alphabet (a ~ z), then replace the second preset replacement symbol "@";

c) If the two characters are of the type (0~9) type and the English alphabet (a~z) type, then the type of the character handled by the i-th bit of domain1 is replaced, that is, the first i is the number 0 of the doman1 ~9, then replace with "#", domain i is the English letter a ~ z, then replace with "@";

d) If one of the two characters is a hyphen "-", it is replaced by the type of the other character.

3) Repeat steps 1) to 2) above to generate a URL template.

(3) If the lengths of the two variables domain1, domain2 are different, jump to step (1) to execute.

(4) Repeat steps (1) through (3) until the end of the sub-level domain name list.

For the extension process of the URL template, please refer to FIG. 3, which is not described in the embodiment of the present invention.

The data source detection method provided by the embodiment of the present invention is described below by using the preset hyphen as "-", the first preset replacement symbol being "#", and the second preset replacement symbol being "@" as an example. Assume that all known URL data is URL data of a known phishing website, as shown in Table 1.

Table 1 shows the URL data of the phishing website.

www.abc.comWww.abc.com	www.a-c.comWww.a-c.com	mg.afgc.comMg.afgc.com	tg.agm.netTg.agm.net	www.agbc.comWww.agbc.com
m.acc.comM.acc.com	www.g2-bc.comWww.g2-bc.com	www.g-abb.comWww.g-abb.com	wap.abc.netWap.abc.net	www.1bc.comWww.1bc.com

The list of second-level domain names obtained from the URL data in Table 1 above is: abc.com, ac.com, afgc.com, agm.net, agbc.com, acc.com, g2-bc.com, g-abb.com , abc.net, 1bc.com

After classifying based on the top-level domain, you get a list of two sub-level domain names, which are:

.com list: abc.com, acc.com, agbc.com, afgc.com, g-abb.com, a-c.com, g2-bc.com, 1bc.com

.net list: abc.net, agm.net

Sort the URL data in the above two sub-level second-level domain name lists, and the sorting result is shown in Table 2:

Table 2 Sort results of sub-level domain name list

.com排序结果.com sort results	.net排序结果.net sort results
g2-bc.comG2-bc.com	abc.netAbc.net
g-abb.comG-abb.com	agm.netAgm.net
afgc.comAfgc.com
agbc.comAgbc.com
1bc.com1bc.com
abc.comAbc.com
acc.comAcc.com
a-c.comA-c.com

For the above two sorting results, the result of .com sorting is taken as an example to show how to get the URL template.

Read the two URL data g2-bc.com and g-abb.com. Since the lengths of the two URL data are the same, compare the characters at each position from left to right, and find the 2nd and 3rd. URLs where there are preset hyphens at two locations, and there are no preset hyphens - the type of characters at these two positions is a numeric type and a letter type, and the characters at the second position are "#" Replace, the characters at the 3rd position are replaced by "@", and the characters at the 5th position are different, and the characters are of the letter type, and are replaced by "@". The replacement process is shown in Figure 5. The URL template is g#@b@.com.

Then read afgc.com and -abb.com and compare, because the length of the two URL data is different, so continue to read the remaining URL data in the list, and get the corresponding URL template, specifically, get agbc.com And afgc.com, the URL template obtained by comparing the two URL data is: a@@c.com.

Read 1bc.com and compare it with agbc.com. Because of the different lengths, continue to read other URL data in the list, get abc.com, and compare it with 1bc.com, because the length is the same and 1bc.com is in the top position. So from the left to the right, the URL template is: #bc.com.

Read acc.com and compare it with abc.com. Since the lengths are the same, the URL template is compared from left to right and is: a@c.com.

Read a-c.com, and compared with acc.com, because the length of the same acc.com position is ahead, so the URL template is compared from left to right: a@c.com.

Finally, according to the number of occurrences of each URL template, an ordered list of URL templates is obtained: a@c.com (2 times), a@@c.com (1 time) g#@b@.com (1 time), #bc.com (1 time).

The number of occurrences of # and @ in the URL template is not more than two, and the number of occurrences of the URL template is not more than one. The default URL template in the URL template list is: a@c. Com, a@@c.com, #bc.com.

The above three URL templates are extended, and #bc.com is taken as an example. The extended URL data includes: 0bc.com, 1bc.com, 2bc.com, 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.

Finally, the above extended URL data is deduplicated with the URL data 1bc.com of the known phishing website (1bc.com is repeated), and the final extended all can be regarded as the URL data of the phishing website: 0bc.com, 2bc.com , 3bc.com, 4bc.com, 5bc.com, 6bc.com, 7bc.com, 8bc.com, 9bc.com.

For the foregoing method embodiments, for the sake of brevity, they are all described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described order of actions, because according to the present invention, Some steps can be performed in other orders or at the same time. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

Corresponding to the foregoing method embodiment, the embodiment of the present invention further provides a data source expansion apparatus. The schematic diagram of the structure is as shown in FIG. 6, and may include: an obtaining unit 11, a comparing unit 12, and an expanding unit 13.

The obtaining unit 11 is configured to obtain all known URL data, wherein all known URL data includes at least URL data of a known phishing website. That is to say, in the embodiment of the present invention, at least the URL data of the known phishing website can be used as the basis for data expansion, thereby improving the secondary utilization rate of the URL data of the known phishing website, for example, based on www.1oo86.cn. Expansion. Of course, in the embodiment of the present invention, the URL data of other known legitimate websites may also be extended, for example, based on www.360.com.

The comparing unit 12 is configured to perform a pairwise comparison on all known URL data to obtain a plurality of URL templates. The reason why all the known URL data are compared in pairs is because: multiple URL data may correspond to a URL template, so that the comparison between two and two is convenient to count the number of occurrences of a certain URL template, and then expand with a URL template. It is more targeted.

In the embodiment of the present invention, the comparison unit 12 may obtain a plurality of URL templates by using the structure shown in FIG. 7, where the comparison unit 12 may include: a comparison subunit 121, a recording subunit 122, an acquisition subunit 123, and a first replacement subunit. 124. A second replacement subunit 125, a third replacement subunit 126, a fourth replacement subunit 127, and a configuration subunit 128.

The comparing subunit 121 is configured to compare the characters at each position in the i-th URL data and the i+1th URL data in sequence when the lengths of the i-th URL data and the i+1th URL data are the same. i is a natural number, and i = 1, 2, ..., m-1, m is the total number of URL data.

Take the i-th URL data and www.g-abb.com for the i+1th URL data as an example. After length comparison, the lengths of the two URL data are the same, then they can be sequentially Comparing the characters at each position in the two URL data, if the lengths of the two URL data are different, then Continue to get other URL data for comparison.

The recording sub-unit 122 is configured to record the character at the j-th position when the characters at the j-th position are the same, and trigger the comparison sub-unit 121 to continue comparing the next character, j=1, 2, ....., n, n is the total number of characters in the i-th URL data. For example, if the characters at the first to fourth positions are the same, the characters at the four positions are recorded, and the characters at the fifth position are continuously compared.

The obtaining sub-unit 123 is configured to acquire the type of the character at the j-th position in the i-th URL data and the i+1th URL data when the characters at the j-th position are different.

a first replacement subunit 124, configured to replace the jth position with the first preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a digital type The character at the place.

a second replacement subunit 125, configured to replace the jth position with the second preset replacement symbol when the type of the character at the jth position in the i-th URL data and the i+1th URL data is a letter type The character at the place.

The second preset replacement symbol is preset to replace the corresponding character in the URL data. When the type of the character at the jth position in the two URL data is a letter type, the second preset replacement symbol is adopted. To replace, if the second preset replacement symbol can be "@", the character at the jth position will be replaced with "@". Of course, the second preset replacement symbol can also adopt other symbols, depending on the actual application. to make sure.

a third replacement sub-unit 126, configured to use the ith when the type of the character at the j-th position in the i-th URL data and the type of the character at the j-th position in the i+1th URL data are different The preset replacement symbol corresponding to the type of the character at the jth position in the URL data replaces the character at the jth position.

a fourth replacement subunit 127, configured to: when the character at the jth position in the i-th URL data or the i+1th URL data is a preset hyphen, at a j-th position that is not a preset hyphen The type of the character corresponds to the preset replacement symbol to replace the character at the jth position.

The configuration subunit 128 is configured to replace the URL data of all the different characters with the URL template corresponding to the i-th URL data and the i+1th URL data, such as the above www.g2-bc.com and www.g-abb The URL template for .com's two URL data is www.g#@b@.com.

The extension unit 13 is configured to expand each URL template to obtain URL data corresponding to each URL template and can be regarded as a phishing website. In the embodiment of the present invention, the extension unit 13 may extend each URL template by using the structure shown in FIG. 8. The extension template 13 includes: a statistical subunit 131, a reserved subunit 132, an extended subunit 133, and a deduplication subunit. 134.

The statistics sub-unit 131 is configured to perform statistics on the number of URL templates to obtain an ordered list of URL templates. The number of times the URL template is counted is to count the number of occurrences of each URL template, and then merge the same URL templates to reduce the number of URL templates.

The retention sub-unit 132 is configured to retain the URL template in the URL template list that meets the preset condition. After the URL template in the URL template list is compared with the preset condition, the partial URL template is deleted, and then the URL template conforming to the preset condition is retained as the URL template finally used for the extension, thereby further reducing the number of URL templates.

The extension sub-unit 133 is configured to expand the reserved URL template, where the expansion process includes: sequentially replacing all the characters corresponding to the first preset replacement symbol in sequence with the first preset replacement symbol in the URL template, and adopting the second pre- All the characters corresponding to the replacement symbol are sequentially replaced with the second preset replacement symbols in the URL template, and the extended URL data corresponding to each URL template is obtained.

The de-sub-unit 134 is configured to perform de-duplication processing on the extended URL data and all known URL data to obtain URL data that can be regarded as a phishing website.

According to the above technical solution, the embodiment of the present invention can obtain a URL template based on all known URL data, and expand the URL template, and obtain URL data corresponding to each URL template that can be regarded as a phishing website, and realize fishing. The website's self-acquisition is effective, which reduces the lag and artificial dependence of phishing discovery. In addition, the detection range can be expanded, the loss of profit can be reduced, and the URL data of the known phishing website can be expanded as a basis, thereby improving the secondary utilization rate of the known phishing website.

In addition, the data source detecting apparatus provided by the embodiment of the present invention may further obtain the URL data after acquiring the URL data. Sorting the URL data to adjacent URL data with higher similarity, so that the URL data with higher similarity can be collected, and the degree of the legal URL data being falsified into which type of characters is relatively high, so as to have Targeted extension of URL data. As shown in FIG. 9, another schematic structural diagram of a data source detecting apparatus according to an embodiment of the present invention is shown. On the basis of FIG. 6, the following may further include: a list forming unit 14, a classifying unit 15, and a sorting unit 16.

The list forming unit 14 is configured to obtain a second-level domain name of each URL data, and form a second-level domain name collection list, such as a second-level domain name of “www.abc.com”, “abc.com”, and then two URLs of each URL. The level domain names are stored in a list to form a list of second-level domain name collections.

The classification unit 15 is configured to perform classification according to the top-level domain name in the second-level domain name collection list, and obtain a sub-level second-level domain name collection list having different top-level domain names. If the TLDs of "www.abc.com" and "www.efg.com" are both ".com", the two URL data will be stored in the sub-level domain name list corresponding to ".com".

The sorting unit 16 is configured to sort the URL data in each of the sub-level second-level domain name collection lists, so that the URL data with higher similarity is adjacent in the sorting. For example, the sorting unit includes: a sorting subunit and a sorting subunit, wherein the sorting subunit is configured to classify URL data in each sublevel second domain name collection list based on a preset hyphen, and obtain URL data including a preset hyphen And URL data that does not contain a preset hyphen. The sorting sub-unit is configured to sequentially sort the URL data containing the preset hyphen and the URL data not including the preset hyphen in order of length and alphabetic order, so that the URL data with higher similarity can be collected and counted as legal. The degree to which the URL data is falsified to which type of character is higher, and the URL data is expanded in a targeted manner.

For the working process of the comparison unit 12 and the extension unit 13 in the data source detection device shown in FIG. 9, reference may be made to the related description in the foregoing method embodiments, which is not described in the embodiment of the present invention.

It should be noted that each embodiment in the specification is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the embodiments are referred to each other. can. For the device type embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

Finally, it should also be noted that in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Moreover, the term "package The inclusion of "including" or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also other elements not specifically listed. Or include elements that are inherent to such a process, method, article, or device. The elements defined by the phrase "comprising a ..." are not excluded from the process of including the element, without further limitation. There are other identical elements in the method, item, or device.

The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded

The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims

A data source extension method, the method comprising:

Obtaining all known uniform resource locator data, wherein the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;

Performing a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;

Each of the uniform resource locator templates is extended, and the uniform resource locator data corresponding to each of the uniform resource locator templates can be regarded as a phishing website.
The method according to claim 1, wherein after acquiring all known uniform resource locator data, before performing the pairwise comparison of the all known uniform resource locator data, the method further comprises :

Obtaining a second-level domain name of each uniform resource locator data to form a second-level domain name collection list;

Sorting according to the top-level domain name in the second-level domain name collection list, and obtaining a list of sub-level second-level domain name collections having different top-level domain names;

The uniform resource locator data in each sub-level second-level domain name collection list is sorted so that the similarity resource locator data with higher similarity is adjacent in the sorting.
The method according to claim 2, wherein the sorting the uniform resource locator data in each sub-level second-level domain name set list is such that the uniformity resource locator data with higher similarity is adjacent in the sorting ,include:

Sorting the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, and obtaining uniform resource locator data including the preset hyphen and a unified resource not including the preset hyphen Locator data;

The uniform resource locator data containing the preset hyphen and the uniform resource locator data not containing the preset hyphen are sequentially sorted in length and alphabetical order.
The method according to claim 1 or 2, wherein the comparing the all known uniform resource locator data in pairs, to obtain a plurality of uniform resource locator templates, comprising:

When the i-th uniform resource locator data and the i+1th uniform resource locator data have the same length And comparing the characters at each position in the i-th uniform resource locator data and the i+1th uniform resource locator data, i is a natural number, and i=1, 2, . . . , m- 1, m is the total number of uniform resource locator data;

When the characters at the jth position are the same, the characters at the jth position are recorded, and the next character is continuously compared, j=1, 2, ....., n, n is the i-th uniform resource location. The total number of characters in the data;

When the characters at the jth position are different, acquiring the type of the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data;

When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, replacing the jth with the first preset replacement symbol The character at the position;

When the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type, replacing the jth with a second preset replacement symbol The character at the position;

When the type of the character at the jth position in the i-th uniform resource locator data and the type of the character at the j-th position in the i+1th uniform resource locator data are different, a preset replacement symbol corresponding to the type of the character at the jth position in the URL data to replace the character at the jth position;

When the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, the j-th position that is not the preset hyphen Replacing the character at the jth position with a preset replacement symbol corresponding to the type of the character;

The uniform resource locator data after the replacement of all the different characters is the uniform resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
The method according to claim 4, wherein the each of the uniform resource locator templates is extended to obtain a uniform resource locator corresponding to each of the uniform resource locator templates that can be regarded as a phishing website. Data, including:

Performing statistics on the number of the uniform resource locator templates to obtain an ordered list of uniform resource locator templates;

Retaining the uniform resource location in the uniform resource locator template list that meets preset conditions Symbol template

Expanding the reserved uniform resource locator template, where the expanding process includes: sequentially replacing all the characters in the uniform resource locator template with all characters of the first preset replacement symbol corresponding type And replacing the second preset replacement symbol in the uniform resource locator template with all the characters of the corresponding type of the second preset replacement symbol, and obtaining an extension corresponding to each of the uniform resource locator templates. Uniform resource locator data;

The extended uniform resource locator data is deduplicated with all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.
A data source expansion device, characterized in that the device comprises:

An obtaining unit, configured to acquire all known uniform resource locator data, where the all known uniform resource locator data includes at least uniform resource locator data of a known phishing website;

a comparison unit, configured to perform a pairwise comparison on all the known uniform resource locator data to obtain a plurality of uniform resource locator templates;

And an extension unit, configured to extend each of the uniform resource locator templates, and obtain uniform resource locator data corresponding to each of the uniform resource locator templates that can be regarded as a phishing website.
The device according to claim 6, wherein the device further comprises:

a list forming unit, configured to acquire a second-level domain name of each uniform resource locator data, to form a second-level domain name set list;

a classification unit, configured to perform, according to the top-level domain name in the second-level domain name collection list, obtain a list of sub-level second-level domain name collections having different top-level domain names;

And a sorting unit, configured to sort the uniform resource locator data in each of the sub-level second-level domain name collection lists, so that the uniformity resource locator data with higher similarity is adjacent in the sorting.
The device according to claim 7, wherein the sorting unit comprises:

a classification subunit, configured to classify the uniform resource locator data in each sub-level second-level domain name collection list based on a preset hyphen, to obtain uniform resource locator data including the preset hyphen and not including the pre- Set hyphenated Uniform Resource Locator data;

And a sorting subunit, configured to sequentially sort the uniform resource locator data including the preset hyphen and the uniform resource locator data not including the preset hyphen in order of length and alphabetic order.
The device according to claim 6 or 7, wherein the comparing unit comprises:

a comparison subunit, configured to compare the i-th uniform resource locator data and the i+1th unit in sequence when the length of the i-th uniform resource locator data and the i+1th uniform resource locator data are the same The character at each position in the resource locator data, i is a natural number, and i = 1, 2, ..., m-1, m is the total number of uniform resource locator data;

a recording subunit, configured to record a character at the jth position when the characters at the jth position are the same, and trigger the comparing subunit to continue comparing the next character, j=1, 2, .... ., n, n is the total number of characters in the i-th uniform resource locator data;

Obtaining a subunit, configured to acquire, when the characters at the jth position are different, the i-th uniform resource locator data and the character at the j-th position in the i+1th uniform resource locator data Types of;

a first replacement subunit, configured to: when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a digital type, the first preset The replacement symbol replaces the character at the jth position;

a second replacement subunit, configured to use a second preset when the type of the character at the jth position in the i-th uniform resource locator data and the i+1th uniform resource locator data is a letter type The replacement symbol replaces the character at the jth position;

a third replacement subunit, configured to: when the type of the character at the jth position in the i th uniform resource locator data and the type of the character at the jth position in the i+1th uniform resource locator data At the same time, the character at the jth position is replaced by a preset replacement symbol corresponding to the type of the character at the jth position in the i-th URL data;

a fourth replacement subunit, configured to: when the character at the jth position in the i-th uniform resource locator data or the i+1th uniform resource locator data is a preset hyphen, Substituting a preset replacement symbol corresponding to the type of the character at the j-th position of the hyphen to replace the character at the j-th position;

The configuration sub-unit is configured to replace the uniform resource locator data for all the different characters into the unified resource locator template corresponding to the i-th uniform resource locator data and the i+1th uniform resource locator data.
The device according to claim 9, wherein the extension unit comprises:

a statistics subunit, configured to perform statistics on the number of times of the uniform resource locator template, to obtain an ordered list of uniform resource locator templates;

a reserved sub-unit, configured to retain the uniform resource locator template in the uniform resource locator template list that meets a preset condition;

An extension subunit, configured to extend the reserved uniform resource locator template, where the expansion process includes: sequentially replacing all characters in the corresponding type of the corresponding replacement symbol with the characters in the uniform resource locator template And replacing, by the first preset replacement symbol and all the characters corresponding to the second preset replacement symbol, the second preset replacement symbol in the uniform resource locator template, to obtain each of the unified resources. The extended uniform resource locator data corresponding to the locator template;

The de-sub-unit is configured to perform de-duplication processing on the extended uniform resource locator data and all known uniform resource locator data to obtain uniform resource locator data that can be regarded as a phishing website.