CN103399912A

CN103399912A - Fishing web page clustering method and device

Info

Publication number: CN103399912A
Application number: CN2013103265762A
Authority: CN
Inventors: 罗焱
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2013-07-30
Filing date: 2013-07-30
Publication date: 2013-11-20
Anticipated expiration: 2033-07-30
Also published as: WO2015014279A1; CN103399912B

Abstract

The invention discloses a fishing web page clustering method and device. The method comprises the steps of receiving any fishing website; acquiring the domain name of the fishing website; acquiring the domain name type corresponding to the domain name from a preset domain name list; according to the domain name type, realizing fishing web page clustering. The fishing web page clustering method and device can realize fishing web page clustering after acquiring the domain name type corresponding to fishing websites, so that two defects generated by the clustering method in the prior art when a fishing criminal uses a secondary domain name of a second-level domain are overcome. Consequently, the false alarm rate and the missing reporting rate of fishing web pages are reduced, the detection ratio of the fishing webpages is improved, and the broadcast of fishing web pages is completely stopped from the source.

Description

Phishing webpage clustering method and device

Technical Field

The invention relates to the field of information security, in particular to a phishing webpage clustering method and device.

Background

The phishing webpage is usually disguised as a bank webpage or an e-commerce webpage, and the main harm is to steal private information such as a bank account number and a password submitted by a user. The phishing webpage is a network fraud behavior, which means that lawless persons use various means to imitate the URL (webpage address) and the page content of a real website, or insert dangerous HTML (hypertext markup language) codes into some webpages of a website by using bugs on a server program of the real website so as to cheat private data such as a user bank or a credit card account number, a password and the like. Clustering phishing webpages refers to grouping together webpages for "phishing" as a comparison criterion for detecting phishing webpages.

In the prior art, a plurality of methods for clustering phishing webpages exist, and the traditional phishing webpage clustering method comprises the following steps: firstly, determining a standard time period, such as a natural day, secondly, presetting a threshold value, acquiring the detected number of fishing webpages in any station or domain, thirdly, judging whether the acquired number exceeds the preset threshold value, and marking the whole station or the whole domain corresponding to the detected number of the fishing webpages exceeding the threshold value as the fishing webpages.

However, since the prior art phishing webpage clustering method only converges to a station or a domain, the prior art clustering method has two disadvantages for phishing perpetrators who are good at using a secondary domain name of a secondary domain name for crime fighting:

first, when a phishing perpetrator uses a secondary domain name of a secondary domain name to commit a crime, the prior art would identify the entire secondary domain as a phishing webpage, which may result in false positives of a part of the secondary domain name under the secondary domain name that is not used for committing, such as a large number of phishing webpages detected under the secondary domain name cn. However, in addition to the secondary domain name (e.g., a.cn.ms) applied for crime fighting by the phishing perpetrator, other secondary domain names (e.g., b.cn.ms) which are not used for "phishing" under the secondary domain name cn.ms may be misinformed as phishing webpages, so the clustering method of the prior art has the disadvantage of high false alarm rate.

Second, when a phishing perpetrator uses a secondary domain name for the secondary domain name to commit a crime, a technique of "extensive domain name resolution" is typically used. For example, b.a.cn.ms, c.a.cn.ms and d.e.a.cn.ms are all secondary domain names of a.cn.ms, if the prior art phishing webpage clustering method is used, all three sub-stations of b.a.cn.ms, c.a.cn.ms and d.e.a.cn.ms are usually identified as phishing webpages, but since the phishing perpetrator uses the "universal resolution technology", a large number of secondary domain names of a.cn.ms, i.e. a.cn.ms, can be automatically generated in a very short time, and thus, the prior art clustering method of the whole station or the whole domain does not completely stop the spread of the phishing webpages from the source.

Disclosure of Invention

In order to solve two defects generated by a clustering method in the prior art when a phishing criminal uses a secondary domain name of a secondary domain name for crime, the invention provides a phishing webpage clustering method and a device, which can reduce the false alarm rate of the phishing webpage and thoroughly prevent the spread of the phishing webpage from the source.

The invention provides a phishing webpage clustering method, which comprises the following steps:

receiving any fishing website;

acquiring a domain name of the fishing website;

acquiring a domain name type corresponding to the domain name from a preset domain name table;

and according to the domain name type, realizing phishing webpage clustering.

Preferably, the clustering phishing webpages according to the domain name type includes:

judging whether the domain name type is a secondary domain name or not, and if so, acquiring a secondary domain of the domain name;

when the preset clustering information base does not comprise the secondary domain, increasing the counting result of the secondary domain by 1 to obtain the counting result of the secondary domain;

and judging whether the counting result of the secondary domain meets the clustering condition, and if so, clustering the secondary domain of the domain name to the clustering information base.

Preferably, the method further comprises:

when the domain name type is not a secondary domain name, increasing 1 to the counting result of the domain name to obtain the counting result of the domain name;

and judging whether the counting result of the domain name meets the clustering condition, and if so, clustering the domain name to the clustering information base.

Preferably, the clustering condition includes:

within a preset time, the counting result is larger than a preset threshold value;

or,

and in the preset time, the ratio of the counting result to the website of the whole domain or the secondary domain is greater than a preset ratio value.

The invention also provides a phishing webpage clustering device, which comprises:

the receiving module is used for receiving any fishing website;

the first acquisition module is used for acquiring the domain name of the phishing website;

the second acquisition module is used for acquiring the domain name type corresponding to the domain name from a preset domain name table;

and the clustering module is used for realizing the clustering of the phishing webpages according to the domain name types.

Preferably, the clustering module includes:

the first judgment sub-module is used for judging whether the domain name type is a secondary domain name;

the first obtaining sub-module is used for obtaining a secondary domain of the domain name when the result of the first judging sub-module is yes;

the first increasing submodule is used for increasing the counting result of the secondary domain by 1 to obtain the counting result of the secondary domain when the preset clustering information base does not comprise the secondary domain;

the second judgment submodule is used for judging whether the counting result of the secondary domain meets the clustering condition or not;

and the first clustering sub-module is used for clustering the secondary domain of the domain name to the clustering information base when the result of the second judging sub-module is yes.

Preferably, the clustering module further comprises:

the second increasing sub-module is used for increasing the counting result of the domain name by 1 to obtain the counting result of the domain name when the domain name type is not the second-level domain name;

the third judgment sub-module is used for judging whether the counting result of the domain name meets the clustering condition;

and the second clustering submodule is used for clustering the domain name to the clustering information base when the result of the third judging submodule is yes.

The method comprises the steps of firstly receiving any phishing website, secondly obtaining a domain name of the phishing website, thirdly obtaining a domain name type corresponding to the domain name from a preset domain name table, and finally realizing phishing webpage clustering according to the domain name type. Compared with the method for clustering the phishing websites to the station or the domain in the prior art, the method can realize the clustering of the phishing webpages according to the domain name type after the domain name type corresponding to the phishing website is obtained, so that two defects generated by the clustering method in the prior art when a phishing criminal uses a secondary domain name of a secondary domain name for crime are effectively overcome, the false alarm rate of the phishing webpages can be reduced, and the spread of the phishing webpages is thoroughly prevented from the source.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a phishing webpage clustering method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the distribution of domain names of phishing websites among various types of domain names;

FIG. 3 is a flowchart of a phishing webpage clustering method according to a second embodiment of the present invention;

fig. 4 is a structural diagram of a phishing webpage clustering device provided in the third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a server according to a third embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

The invention discloses a method for clustering phishing webpages, which is characterized in that a domain name type is used for solving the problem of clustering the phishing webpages, so that the defects of the prior art are effectively overcome, when a phishing criminal uses a secondary domain name of a secondary domain name for committing, the phishing webpages can be directly clustered to the secondary domain name of the secondary domain name applied by the phishing criminal and are marked as the phishing webpages, and the propagation of the phishing webpages is thoroughly stopped at the source. The invention can effectively solve the two defects of the clustering method in the prior art for phishing criminals who use the secondary domain name of the secondary domain name to do crime by utilizing the characteristic of the domain name type.

Referring to fig. 1, fig. 1 is a flowchart of a phishing webpage clustering method provided in this embodiment, which may specifically include:

step 101: any phishing website is received.

In this embodiment, any fishing website is received, where the fishing website is a detected fishing website, and the specific detection method is not limited in this embodiment.

Step 102: and acquiring the domain name of the phishing website.

In this embodiment, after receiving the phishing website, the domain name of the phishing website is acquired. Among them, a character-type address corresponding to a numeric-type IP address on a network is called a domain name.

In actual operation, there are many ways to acquire the domain name of the phishing website, and this embodiment does not limit this. For example, if the fishing website is b.a.cn.ms, the domain name of the obtained fishing website is cn.ms, and if the fishing website is b.a.com/1.asp, the domain name of the obtained fishing website is a.com.

Step 103: and acquiring the domain name type corresponding to the domain name from a preset domain name table.

In this embodiment, after the domain name is obtained, the domain name type corresponding to the domain name is queried in a preset domain name table, where the domain name type in the preset domain name table may be set as a secondary domain name or a non-secondary domain name. Referring to table 1, fig. 1 is a preset domain name table provided in this embodiment, and a specific form of the domain name table is not limited to the form provided in table 1, and meanwhile, the domain name table may be obtained through manual statistics. The specific domain name table provided in this embodiment may be as follows:

domain name type	Domain name
		1	tk
1	co.cc
		2	in
2	info
		3	com
4	cn
		5	cn.ms
5	net.tf
		6	3322.org
7	vicp.net

TABLE 1

There are many methods for dividing the domain name type, for example, the domain name type can be divided in terms of the cost for obtaining the domain name, and since the free or low-cost domain name can directly reduce the cost of phishing crimes, the free or low-cost domain name is generally used by most phishing criminals, and the division of the domain name type by the domain name cost can powerfully attack phishing crimes to some extent.

If domain name types are classified according to the cost of acquiring domain names, the domain name types can be classified into free domain names (including free top-level domain names, free second-level domain names), cheap domain names (including cheap top-level domain names, cheap second-level domain names such as dynamic domain names), and the like, and the cheap or free domain names are gradually becoming the flooding areas of phishing websites. Referring to fig. 2, fig. 2 is a schematic diagram of the distribution of the domain names of the phishing websites among various types of domain names, which is taken from the annual report of the chinese phishing alliance 2012, and can be seen from fig. 2, except com, other domain names are basically free domain names and cheap domain names. Wherein, tk,. co.cc,. pl, which occupies a large specific weight, are representative of free top-level domain names; the top level domain names ms and tf contain a large number of free secondary domain names, such as cn.ms, hk.ms, net.tf and eu.tf; cheap top-level domain names include to, info, in, etc., whereas cheap second-level domain names are mostly provided domestically by dynamic domain name providers, such as 3322. org.

Step 104: and according to the domain name type, realizing phishing webpage clustering.

In this embodiment, the clustering of the phishing webpages is realized according to the domain name type corresponding to the domain name.

In actual operation, whether the domain name belongs to a second-level domain name or not can be determined according to the domain name type, and subsequent clustering of phishing webpages is performed according to a determined result.

In this embodiment, firstly, any phishing website is received, secondly, a domain name of the phishing website is obtained, thirdly, a domain name type corresponding to the domain name is obtained from a preset domain name table, and finally, clustering of phishing webpages is achieved according to the domain name type. Compared with the method for clustering the phishing websites to the station or the domain in the prior art, the embodiment can realize the clustering of the phishing webpages according to the domain name type after the domain name type corresponding to the phishing website is obtained, so that two defects generated by the clustering method in the prior art when a phishing criminal uses a secondary domain name of a secondary domain name for crime are effectively overcome, the false alarm rate of the phishing webpages can be reduced, and the propagation of the phishing webpages is thoroughly prevented from the source.

Example two

Referring to fig. 3, fig. 3 is a flowchart of a phishing webpage clustering method provided in this embodiment, which may specifically include:

step 301: receiving any fishing website;

step 302: acquiring a domain name of the fishing website;

step 303: acquiring a domain name type corresponding to the domain name from a preset domain name table;

steps 301 to 303 in this embodiment are the same as steps 101 to 103 in the first embodiment, and are not described again here.

Step 304: and judging whether the domain name type is a secondary domain name or not, if so, entering a step 305, and if not, entering a step 309.

In this embodiment, after obtaining the domain name type corresponding to the domain name, first determine whether the domain name type belongs to the second-level domain name, if so, go to step 305, otherwise, go to step 309.

In actual operation, after determining the domain name type, it may be determined whether the domain name type belongs to a second-level domain name with reference to table 2, where table 2 is a domain name type table, and a specific form of the domain name type table is not limited to the form provided in table 2, and meanwhile, the domain name type table may be obtained through manual statistics. In this embodiment, the domain name type corresponding to the domain name may be first obtained through table 1, and then, whether the domain name type belongs to the second-level domain name is queried in table 2. Specifically, table 2 may be as follows:

TABLE 2

Step 305: and acquiring a secondary domain of the domain name.

In this embodiment, when the domain name type is a secondary domain name, a secondary domain name of the domain name is obtained, and as illustrated below, if the phishing website is b.a.cn.ms, the domain name of the phishing website is cn.ms, and meanwhile, the secondary domain name of the phishing website is a.cn.ms.

Specifically, there are many ways to obtain the secondary domain of the domain name, which is not limited in this embodiment.

Step 306: and when the preset clustering information base does not comprise the secondary domain, increasing the counting result of the secondary domain by 1 to obtain the counting result of the secondary domain.

In this embodiment, it is first determined whether the obtained secondary domain belongs to a preset clustering information base, and if not, the counting result of the secondary domain is increased by 1, so as to obtain the final counting result of the secondary domain.

In actual operation, the number of times the secondary domain is detected is counted in real time, i.e. the count result is incremented by 1 if the secondary domain is detected once. Among them, the counting method is not limited.

Step 307: and judging whether the counting result of the secondary domain meets the clustering condition, if so, entering the step 308.

In this embodiment, after the counting result of the secondary domain is obtained, it is first determined whether the counting result meets a preset clustering condition, if so, step 308 is entered, otherwise, clustering of other phishing webpages may be continued.

In practical operation, the clustering condition may be: within a preset time, the counting result is larger than a preset threshold value; or, within the preset time, the ratio of the counting result to the website of the whole domain or the secondary domain is greater than a preset ratio value.

Referring to table 2, the clustering condition may be set to "the number of web addresses of the blacked out threshold of the whole day domain is 50", and then, when the count result of the one day of the secondary domain is greater than 50, step 308 is entered. Similarly, the clustering condition may be set to "the ratio of the daily cluster black threshold black sites is 50%", and then step 308 is performed when the one-day counting result of the secondary domain accounts for more than 50% of all black sites.

Step 308: and clustering the secondary domain of the domain name to the clustering information base.

In this embodiment, when the counting result of the secondary domain meets the preset clustering condition, the secondary domain is clustered into the clustering information base, that is, the secondary domain is determined as a phishing webpage.

Step 309: and increasing the counting result of the domain name by 1 to obtain the counting result of the domain name.

In this embodiment, when the domain name type is not a secondary domain name, the counting result of the domain name is added by 1 to obtain the counting result.

Step 310: and judging whether the counting result of the domain name meets the clustering condition, if so, entering step 311.

In this embodiment, after the counting result of the domain name is obtained, it is first determined whether the counting result meets a preset clustering condition, if so, step 311 is performed, otherwise, clustering is performed on other phishing webpages.

For example, for a domain name with a domain name type of 1, the clustering condition obtained through table 2 may be that "the number of websites of the blackout threshold of the entire daily domain is greater than 50", that is, the number of clustered websites of the entire daily domain is greater than 50.

Step 311: and clustering the domain name to the clustering information base.

In this embodiment, when the counting result of the domain name meets the preset clustering condition, the domain name is clustered to the clustering information base.

According to the embodiment, the phishing webpage clustering can be realized according to the domain name type after the domain name type corresponding to the phishing website is obtained, so that two defects generated by the clustering method in the prior art when a phishing criminal uses a secondary domain name of a secondary domain name for committing a crime are effectively solved, the false alarm rate of the phishing webpage can be reduced, and the propagation of the phishing webpage can be thoroughly prevented from the source.

EXAMPLE III

Referring to fig. 4, fig. 4 is a structural diagram of a phishing webpage clustering device provided in this embodiment, where the device may include:

a receiving module 401, configured to receive any phishing website;

a first obtaining module 402, configured to obtain a domain name of the phishing website;

a second obtaining module 403, configured to obtain a domain name type corresponding to the domain name from a preset domain name table;

and the clustering module 404 is configured to implement clustering of the phishing webpages according to the domain name types.

Wherein the clustering module may include:

Meanwhile, the clustering module may further include:

Referring to fig. 5, fig. 5 shows a server provided in the present embodiment, where the server may be used to implement the method provided in the foregoing embodiments. Specifically, the method comprises the following steps:

the server may include components such as a memory 510 having one or more readable storage media, an input unit 520, an output unit 530 including a processor 540 having one or more processing cores, and a power supply 550. Wherein:

the memory 510 may be used to store software programs and modules, and the processor 540 may execute various functional applications and data processing by operating the software programs and modules stored in the memory 510. The memory 510 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer, and the like. Further, the memory 510 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 510 may also include a memory controller to provide the processor 540 and the input unit 520 access to the memory 510.

The input unit 520 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The processor 540 is a control center of the server, connects various parts using various interfaces and lines, performs various functions of the computer and processes data by operating or executing software programs and/or modules stored in the memory 510 and calling data stored in the memory 510, thereby integrally monitoring the mobile phone. Optionally, processor 540 may include one or more processing cores.

The server also includes a power supply 550 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 540 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 550 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Specifically, in this embodiment, the processor 540 loads the executable file corresponding to the process of one or more application programs into the memory 510 according to the following instructions, and the processor 540 runs the application programs stored in the memory 510, so as to implement various functions:

receiving any fishing website;

acquiring a domain name of the fishing website;

and according to the domain name type, realizing phishing webpage clustering.

Preferably, the method further comprises:

Preferably, the clustering condition includes:

or,

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method and the device for clustering phishing webpages provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the embodiment of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1.A phishing webpage clustering method, the method comprising:

receiving any fishing website;

acquiring a domain name of the fishing website;

and according to the domain name type, realizing phishing webpage clustering.

2. The method of claim 1, wherein the enabling phishing web clustering based on the domain name type comprises:

3. The method of claim 2, further comprising:

4. The method according to claim 2 or 3, wherein the clustering condition comprises:

or,

5. A phishing webpage clustering apparatus, the apparatus comprising:

the receiving module is used for receiving any fishing website;

6. The apparatus of claim 5, wherein the clustering module comprises:

7. The apparatus of claim 6, wherein the clustering module further comprises: