WO2012089005A1

WO2012089005A1 - Method and apparatus for phishing web page detection

Info

Publication number: WO2012089005A1
Application number: PCT/CN2011/083745
Authority: WO
Inventors: 马勺布; 郭辉
Original assignee: 成都市华为赛门铁克科技有限公司
Priority date: 2010-12-31
Filing date: 2011-12-09
Publication date: 2012-07-05
Also published as: US20130086677A1; US9218482B2; CN102082792A

Abstract

Embodiments of the present invention provide a method and an apparatus for phishing web page detection. The method comprises: determining whether a unique domain name corresponding to a web page to be detected exists in a trusted domain name library; when the unique domain name does not exist in the trusted domain name library, determining the similarity between content features extracted from the web page to be detected and content features of each template file in a template file library respectively, the content features at least comprising: an encoding format, a document object model, words, and the number of words; and when the similarity between the content features extracted from the web page to be detected and the content features in at least one template file is greater than a preset similarity threshold, determining that the web page to be detected is a phishing web page. The embodiments of the present invention enhance the accuracy of phishing web page detection results.

Description

Fishing webpage detection method and device The application claims the priority of the Chinese patent application filed on December 31, 2010, the Chinese Patent Office, the application number is 201010620647.6, and the invention name is "fishing webpage detection method and equipment", the entire contents of which are incorporated by reference. Combined in this application. Technical field

The embodiments of the present invention relate to network technologies, and in particular, to a phishing webpage detection method and device. Background technique

The phishing website reporting mechanism is a basic solution to protect against phishing attacks. The anti-phishing organization encourages the end user to submit the discovered phishing information. The phishing information includes the Uniform Resource Locator (URL), the mail content, etc., and then the collected phishing information is discriminated and organized into a knowledge base. For example, a URL list method, a one-way hash (Hash) value method, and the like. The knowledge base is deployed in various security devices or client software, and the device detects that the knowledge inventory intercepts and filters the webpage during the currently visited webpage, thereby preventing attacks on the phishing webpage.

Currently, the general method is to integrate the Phishing detection module into the client software. When the user accesses the webpage through the browser, the Phishing detection module calculates the suspiciousness of the webpage according to the local or remote data query result, when the suspiciousness is high. , Send an alert message to the user. The remote Anti-Phishing server provides data update, query, filtering and other functions to many client Phishing detection modules. The monitoring basis of the Phishing detection module mainly includes: a list of known phishing URLs, a list of Phishing IPs, a list of trusted i or names, phishing keywords, and general features of phishing pages. The general features of the phishing webpage include: HyperText Markup Language (HTML) input tags, data matching social security numbers, inconsistent URLs displayed and real URLs, etc.

Because the URL, IP, and domain name of phishing pages change frequently, there are many normal web pages that are also included. Phishing keywords. Therefore, when the phishing webpage is detected by the above method, not only the recognition rate of the phishing webpage is low, but also the false positive rate of the normal webpage is high, and thus the detection accuracy of the existing phishing webpage detecting method is low. Summary of the invention

The embodiment of the invention provides a method and a device for detecting a phishing webpage, which are used to improve the detection accuracy of the phishing website.

The embodiment of the invention provides a method for detecting a phishing webpage, including:

Determining whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;

When the unique domain name does not exist in the trusted domain name database, the similarity between the content feature extracted from the to-be-detected webpage and the content feature in each template file of the template file library is determined; the content feature includes at least: Encoding format, document object model, vocabulary and number of words;

And determining, in the content feature that is to be detected from the to-be-detected webpage, that the similarity of the content feature in the template file is greater than a preset similarity threshold, determining that the to-be-detected webpage is a phishing webpage.

The embodiment of the invention provides a phishing webpage detecting device, which comprises:

Trust the domain name library, which is used to save the unique domain name corresponding to the trusted webpage;

a template file library, configured to save a plurality of template files, where the template file includes content features extracted from a webpage; the content features include at least: a coding format of the webpage, a document object model, a vocabulary, and a number of words;

a domain name determining module, configured to determine whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;

a content extraction module, configured to extract content features extracted from the to-be-detected webpage when the unique domain name does not exist in the trust domain name database;

a similarity determining module, configured to respectively determine a similarity between a content feature extracted from the to-be-detected webpage and a content feature in each template file of the template file library;

a phishing webpage determining module, configured to extract content features from the webpage to be detected, at least The web page to be detected is a phishing webpage.

In the embodiment of the present invention, after determining that the unique domain name of the to-be-detected webpage is not the trusted domain name, the similarity between each template file in the template file library, such as the encoding format, the document object model, the vocabulary, the number of words, and the like, are determined by the content characteristics of the webpage to be detected. The similarity between the content feature and the content feature in each template file in the template file library determines whether the page to be detected is a phishing page. Therefore, the present invention can improve the accuracy of the phishing webpage detection result by determining whether the webpage is a phishing webpage by using the content feature. In addition, since the present invention first determines whether the web page to be detected is a trusted web page through the continuously updated trust domain name library, the probability of misidentifying the brand web page as a phishing web page is reduced. DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.

1 is a flowchart of Embodiment 1 of a method for detecting a phishing webpage according to the present invention;

2 is a flowchart of Embodiment 2 of a method for detecting a phishing webpage according to the present invention;

3 is a flowchart of Embodiment 3 of a method for detecting a phishing webpage according to the present invention;

4A is a schematic structural diagram of Embodiment 1 of a phishing webpage detecting device provided by the present invention; FIG. 4B is a schematic diagram of an application scenario of a phishing webpage detecting device provided by the present invention; FIG. FIG. 5 is a schematic structural diagram of Embodiment 2 of a phishing webpage detecting apparatus provided by the present invention;

6 is a schematic structural diagram of a similarity determining module in FIG. 4 or FIG. 5;

FIG. 7 is a schematic structural diagram of Embodiment 3 of a phishing webpage detecting apparatus provided by the present invention. detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is apparent that the described embodiments are a part of the embodiments of the invention, rather than all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without departing from the inventive scope are the scope of the present invention.

FIG. 1 is a flowchart of Embodiment 1 of a method for detecting a phishing webpage according to the present invention. As shown in FIG. 1, this embodiment includes:

Step 11: Determine whether there is a unique domain name corresponding to the webpage to be detected in the trusted domain name database.

In this embodiment, the webpage to be detected may have multiple acquisition manners, one is to download the to-be-detected webpage according to the URL, and the downloaded webpage to be detected is stored in the storage medium; one is to directly extract the data packet from the network communication traffic. When the data packet is directly extracted from the network communication traffic, the data packet is further parsed to form an HTML file.

After obtaining the webpage to be detected, the unique domain name is extracted from the URL corresponding to the webpage to be detected, and the unique domain name is searched in the trusted domain name database. When the unique domain name exists in the trusted domain name database, that is, the unique domain name is a trusted domain name, indicating that the to-be-detected webpage corresponding to the unique domain name is not a phishing webpage. When there is no such unique domain name in the trusted domain name database, the to-be-detected webpage may be a phishing webpage or a phishing webpage, and the subsequent content feature matching process is further needed to detect whether the webpage to be detected is a fish-fishing webpage.

A unique domain name that holds a tens of thousands, millions, or even tens of millions of trusted web pages in a trusted domain name library. The purpose is to exclude branded web pages or web pages that have never been attacked by a phishing website by detecting a phishing webpage. The domain name database needs to be updated periodically. The collection and extraction of domain names are mainly based on the following principles: The URLs are retrieved one by one from the collected URL list. When the top-level domain name is a non-state top-level domain in a URL, the second-level domain name is extracted from the URL. Write the trusted domain name database; the top-level domain name in the URL is the national domain name and the second-level domain name is the top-level domain name string, and the third-level domain name is extracted from the URL and written into the trusted domain name database.

For example, the top-level domains in the URL are ".com,,"".org",".edu,,"".net",".gov","int,,""mil","biz","info" , non-national top-level domain names such as "pro", "name" and "idv", the second-level domain name is extracted from the URL. If the top-level domain name is a country or domain name, it is determined whether the second-level domain is a commonly used top-level domain name string, for example " Com,,, "org", "net", "gov", "edu,," and "biz" are extracted Third-level domain name, otherwise only the second-level domain name is extracted. The extracted domain names are as follows: huawei.com, huawei.com.cn, sina.com.cn, apwg.org, apwg.net, etc. After the domain name is extracted, the extracted domain name is converted into a hash table storage to facilitate subsequent query. The specific hash algorithm for establishing the hash table may adopt a standard algorithm such as MD5 or SHA1, or a custom algorithm.

Step 12: When there is no unique domain name in the trusted domain name database, determine the similarity between the content features extracted from the web page to be detected and the content features in each template file of the template file library.

The template file library can be a brand template library or a phishing template library. The template file library is configured to save a template file including content features extracted from the phishing webpage, or to save a template file including content features extracted from the brand webpage; the content features at least include extracted from the webpage: an encoding format, a document object model, Vocabulary and vocabulary quantity.

When the trusted domain name library does not have a unique domain name corresponding to the to-be-detected webpage, the content feature is extracted from the to-be-detected webpage, and is matched with the content feature saved in each template file in the phishing template library; in addition, the brand template may also be used. The content features saved in each template file in the library are matched to determine the similarity between the content features extracted from the web page to be detected and the content features in each template file.

Since a large number of phishing websites generate or directly spoof branded web pages through automatic programs, the same encoding format, closer vocabulary and similar Document Object Model (DOM) are usually adopted, and the number of words is also close. The embodiment of the present invention can determine the similarity between the to-be-detected webpage and the brand webpage or the phishing webpage by analyzing the content features including the encoding format, the document object model, the vocabulary and the vocabulary quantity.

The phishing template library includes a plurality of phishing template files for storing content features extracted from each phishing webpage. When the phishing template library is created, the content features are extracted from multiple phishing webpages, and the content features of each phishing webpage are separately saved in the form of template files.

The brand template gallery includes multiple brand template files for saving content features extracted from various brand web pages. Brand pages are often spoofed pages or pages that may be counterfeited, such as major bank pages around the world, insurance company pages, online payment agencies or corporate web pages, and social networking sites. When the brand template library is created, content features are extracted from multiple brand web pages, and the content characteristics of each brand web page are separately saved in the form of template files. Step 13: When the content feature extracted from the webpage to be detected is at least greater than a preset similarity threshold in a template file, determine that the webpage to be detected is a phishing webpage.

When the content feature extracted from the webpage to be detected is compared with one or more phishing template files in the phishing template library that are similar to the webpage to be detected, the webpage to be detected is determined to be a phishing webpage of the non-counterfeit brand webpage. For example, the similarity can be a percentage value or other custom type. When the similarity is a percentage value, the higher the percentage value, the greater the similarity; the similarity can also be a value from 0 to 100. In this case, the larger the value, the greater the similarity, wherein the preset similarity threshold may be an empirical value.

In addition, since each template file of the phishing template library corresponds to one phishing webpage, when it is determined that the content features of the spoofing webpage are the same as the content features of the phishing webpage, the webpage name of the phishing webpage similar to the webpage to be detected may be determined.

When the content feature extracted from the webpage to be detected and the one or more products in the brand template library have a brand template file similar to the webpage to be detected, since the unique domain name corresponding to the webpage to be detected is not a trusted domain name, it is determined The webpage to be detected is a phishing webpage of a counterfeit brand webpage.

In the embodiment of the present invention, after determining that the unique domain name of the to-be-detected webpage is not a trusted domain name, determining the similarity between each template file in the template file library by using the content feature of the to-be-detected webpage, and determining whether the to-be-detected webpage is a phishing webpage. The brand template file saves the content characteristics of the brand webpage. When the unique domain name of the webpage to be detected is not a trusted domain name, when the similarity between the content feature and the brand webpage is high, the webpage to be detected is determined to be a counterfeit brand webpage. Phishing page. The template file saves the content feature of the phishing webpage or the content feature of the brand webpage. When the similarity between the meat content feature of the webpage to be detected and the template file is high, the webpage to be detected is determined to be a phishing webpage of the non-phishing brand webpage. Since phishing web pages are usually generated by automated programs or directly spoof brand web pages, and the content characteristics of most phishing web pages are basically similar, the content features reflect the characteristics of phishing web pages. Therefore, the present invention can improve the accuracy of the phishing webpage detection result by determining whether the webpage is a phishing webpage by using the content feature. In addition, since the present invention first determines whether the to-be-detected web page is trusted by the continuously updated trust domain name library The webpage, which reduces the chance of misjudged the brand page as a phishing page.

FIG. 2 is a flowchart of Embodiment 2 of a method for detecting a phishing webpage according to the present invention. This example mainly describes how to match the content features of the web page to be detected with the phishing template file in the phishing template library. As shown in FIG. 2, this embodiment includes:

Step 20: Extract the content feature from the web page to be detected.

Before step 20, the trusted domain name database is firstly searched for the unique domain name of the web page to be detected. Since the trusted domain name database stores the trusted unique domain name, when the trusted domain name inventory is in the unique domain name of the to-be-detected webpage, the determined webpage is determined to be Trusted webpage. If the unique domain name of the webpage to be detected does not exist in the trusted domain name database, step 20 is performed to determine whether the webpage of the webpage to be detected is a fishery webpage.

Step 21: Determine whether there is a fish template file in the phishing template library that has not been matched with the web page to be detected. If yes, go to step 22, otherwise end.

If the brand template file in the brand template library is matched with the web page to be detected, step 21 may be: determining whether the brand template library has a brand template file that does not match the web page to be detected.

Step 22: Read a fishing template file that has not yet matched the page to be detected from the nautical template library.

When the phishing template library is created, in order to avoid storing the phishing template files with similar content features in the phishing brand library, the content features extracted from the phishing webpage and the content features in the phishing template files in the phishing template library are extracted from the phishing webpage. The matching is performed to determine the similarity between the content feature extracted from the phishing webpage and each phishing template file, and the similarity size is used to determine whether the content feature is written into the phishing template library in the form of a phishing template file. When the similarity between the content feature extracted from the phishing webpage and each fish template file is less than a preset similarity threshold, the content feature extracted from the phishing webpage forms a phishing template file and is written into the phishing template library.

Similarly, when creating a brand template library, in order to avoid saving the brand template files with the same content characteristics in the brand library, after extracting the content features from the brand webpage, the content features extracted from the brand webpage and the brand template files in the brand template library are Content features are matched to determine the internal extraction from the brand page The similarity between the feature and each brand template file, and determining whether the content feature is written into the brand template library in the form of a brand template file by the similarity size. When the similarity between the content feature extracted from the brand webpage and each brand template file is less than the preset similarity threshold, the content feature extracted from the brand webpage forms a brand template file and is written into the brand template library.

Step 23: Determine whether the encoding format of the to-be-detected webpage is the same as the encoding format in the current phishing template file. If it is not the same, go back to step 21 and if it is the same, go to step 24.

Step 24: When the coding format of the to-be-detected webpage is the same as the coding format in the current tempo template file, determine whether the absolute value of the difference between the number of vocabulary extracted from the to-be-detected webpage and the vocabulary quantity in the current template file is similar in quantity Within the preset range. If it is not within the similar preset range, return to step 21 to execute; if the number is within the preset range, go to step 25.

When the absolute value of the difference between the number of words extracted from the web page to be detected and the number of words in the current phishing template file is within a preset number range, indicating the number of words extracted from the web page to be detected and the vocabulary in the current template file The number is relatively close, and the webpage to be detected may be a phishing webpage, and further judgment is required to determine whether it is a phishing webpage. The quantity of the vocabulary extracted from the web page to be detected is equal to the number of vocabulary in the current phishing template file. If the difference between the two is large, the web page to be detected is not considered to be the current phishing template file. Similarly, the number of similar preset ranges can be set according to the number of words in the web page to be detected.

Step 25: When the number of vocabulary words extracted from the webpage to be detected is in a similar preset range, determine whether the vocabulary similarity between the vocabulary extracted from the webpage to be detected and the vocabulary in the current phishing template file is in a vocabulary similarity high preset value and vocabulary Similar between low preset values. If the lexical similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value, perform step 26. If the lexical similarity is not between the vocabulary similarity high preset value and the vocabulary similar low preset value, but the lexical similarity is greater than the lexical similarity high preset value, step 27 is performed, and the vocabulary similarity is less than the vocabulary similar low preset value, and the returning step is returned. 21 execution.

The vocabulary similarity refers to the metric of how many words in the web page to be detected are the same as a phishing template file. Generally, the lexical similarity can be described as a certain formula, for example: the web page to be detected has m words, and some A phishing template file has n words, both of which have the same vocabulary. At this time, the lexical similarity can be described as a percentage value: [2 X s/(m + n)] X 100, when the value is high At a certain threshold, it is considered that the vocabulary in the web page to be detected is highly similar to the vocabulary of a certain phishing template file.

When the vocabulary similarity is greater than the vocabulary similarity, the vocabulary of the spoofed webpage is the same as the phishing slogan. The webpage corresponding to the phishing template file is a phishing webpage. If the webpage corresponding to the current brand template file is a brand webpage, since it is determined that there is no unique domain name of the webpage to be detected in the trusted domain name database before extracting the content feature of the webpage to be detected, it is also determined that the webpage to be detected is fishing. Web page.

When the vocabulary similarity is less than the vocabulary similarity high preset value, it indicates that the vocabulary of the web page to be detected is less than the same vocabulary of the template file, and it can

Step 26: When the vocabulary similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value, determine whether the model similarity between the document object model extracted from the to-be-detected webpage and the current maritime template file is greater than The model is similar to the preset value. If step 27 is performed, otherwise return to step 21 for execution.

The model similarity between the document object model extracted from the web page to be detected and the document object model in the current phishing template file is greater than the similar preset value of the model, indicating that the two are similar in terms of the document object model. The model similarity can be converted into a percentage, and the model similarity can be converted into a value from 0 to 100. When the model similarity is converted into a percentage, the model-like preset value can be 80%. When the model similarity is converted to a value from 0 to 100, the model-like preset value can be 50.

Step 27: When the model similarity is greater than the model similar preset value, determine that the webpage to be detected is a phishing webpage, and output the phishing webpage name corresponding to the phishing template file. Go back to step 21 to execute.

After determining that the webpage to be detected is a phishing webpage, the purpose of continuing matching with the subsequent template file is to find the template file with the highest similarity from the template files that reach the similar preset value of the model according to the model similarity, thereby outputting The name of the phishing page corresponding to the template file with the highest similarity.

If the brand template file in the brand template library is read in step 22, the web page name of the brand web page corresponding to the brand template file is output in step 27.

It should be noted that the phishing template may only include some content features in the encoding format, the number of words, the lexical similarity, and the similarity of the document object model, and the above contents may also be flexible. In combination, the order in which similarity judgments are made can also be flexibly adjusted. E.g:

Alternative 1:

Step 23 is omitted. After step 22, a phishing template file that has not been matched with the to-be-detected page is sequentially read from the nautical template library, and then directly proceeds to step 24 to determine the number of vocabulary extracted from the web page to be detected and the current template. Whether the absolute value of the difference in the number of words in the file is within a predetermined range of the number. If it is not within the similar preset range, return to step 21 to execute; if the number is within the preset range, go to step 25.

Alternative 2:

First, the vocabulary quantity and the vocabulary similarity judgment described in steps 24 to 25 are performed, and when the phishing webpage cannot be determined according to the vocabulary number and the vocabulary similarity, the encoding format of step 23 is performed, and if the encoding format is the same For phishing pages, otherwise non-phishing pages.

Various alternatives are not listed here.

Embodiments of the invention. The content features extracted from the webpage to be detected: the encoding format, vocabulary, webpage vocabulary and DOM of the webpage to be detected are respectively matched with the content features saved by each phishing template file in the phishing template library, and the encoding format matches the currently matched fishing When the template files are the same, it is determined that the webpage to be detected is a phishing webpage, and continues to match the next phishing template file. When the coding format is different, the number of words in the current phishing template file is matched. When the number of vocabulary files in the current phishing template file is close, the page to be detected is determined to be a phishing page, otherwise the vocabulary similarity is continued with the phishing template file. match. When the lexical similarity reaches the vocabulary similar preset value, the webpage to be detected is determined to be a phishing webpage, and continues to match the next phishing template file; otherwise, the model similarity is matched with the DOM of the phishing template file, and the model is similar to the preset value. When it is determined, the webpage to be detected is a phishing webpage. When it is determined that the webpage to be detected is a phishing webpage, the webpage name of the currently matched phishing template argument is also output. In addition, the content features of the web page to be detected can be matched with each template file in the brand template library. When the webpage to be detected is determined to be a phishing webpage, the name of the webpage corresponding to the template file may be output, that is, the name of the brand webpage counterfeited by the webpage to be detected.

FIG. 3 is a flowchart of Embodiment 3 of a method for detecting a phishing webpage according to the present invention. This example mainly describes the process of establishing a brand template file in the brand template library. The phishing template file in the phishing template library has been created. The process is similar to the brand template library. The only difference is that the phishing template file in the phishing template library is used to save the content features of the known phishing webpage, and the brand template file in the brand template library is used to save the content features of the known brand webpage. As shown in FIG. 3, this embodiment includes:

Step 30: Determine if there are still unprocessed URLs in the brand URL list. If it is step 31, otherwise it ends.

Step 31: Read an unprocessed URL in order from the brand URL list.

Step 32: Download the corresponding web page according to the read URL.

Step 33: Extract the content features from the download page: Download the encoding format, vocabulary, vocabulary quantity, and DOM of the web page.

Step 34: Determine if there is a matching brand template file in the brand template library. It is specifically determined whether the brand template library exists or not has a brand template file that matches the content features extracted from the downloaded web page. If there is a brand template file that has not been matched with the content feature extracted from the downloaded web page, go to step 35, otherwise go to step 37.

Step 35: Read a brand template file that has not been matched in order from the brand template gallery. Step 36: Determine whether the similarity between the content feature of the downloaded webpage and the content feature of the current brand template file is less than a preset similarity threshold. If it is less than the preset similarity threshold, it is determined that the download network is not similar to the current brand template file, and the process returns to step 34 to continue matching with the subsequent brand template file. If it is greater than the preset similarity threshold, it is determined that the downloading network is similar to the current brand template file, and the content feature of the downloaded webpage does not need to be saved in the brand template library, and the process returns to step 30 to match the downloaded webpage corresponding to the next URL. .

Step 37: Write the content characteristics of the downloaded web page into the brand template library in the form of a brand template file. Go back to step 30 to continue.

When the brand template library is established in the embodiment of the present invention, the content features of the downloaded webpage are matched with the existing brand template files in the brand template library, and only the brand template file similar to the content feature of the downloaded webpage does not exist in the brand template library ( That is, when the download page is not similar to all the brand template files, the download page is stored in the brand template library as a brand template file, thereby avoiding repeatedly saving the brand template files of multiple similar web pages in the brand template library. 4A is a schematic structural diagram of Embodiment 1 of a phishing webpage detecting apparatus provided by the present invention. As shown in FIG. 4, the embodiment includes: a trusted domain name library 40, a domain name determining module 41, a content extracting module 42, a similarity determining module 43 and a phishing webpage determining module 44, and a template file library 45.

Trust domain name library 40, used to save a trusted unique domain name. The template file library 45 is configured to save a plurality of template files, and the template file includes content features extracted from the webpage; the content features include at least: a coding format of the webpage, a document object model, a vocabulary, and a vocabulary quantity. Specifically, the template file library includes: a phishing template library and a brand template library. A phishing template library for saving template files including content features extracted from a phishing web page. A brand template library for saving template files that include content features extracted from brand web pages.

The domain name determining module 41 is configured to determine whether there is a unique domain name corresponding to the webpage to be detected in the trusted domain name library 40. The content extraction module 42 is configured to: when the domain name determining module 41 determines that there is no unique domain name in the trusted domain name database, the content feature extracted from the webpage to be detected.

The similarity determining module 43 is configured to respectively determine the similarity between the content features extracted by the content extraction module 42 from the web page to be detected and the content features in the template files of the template file library 45.

The phishing webpage determining module 44 is configured to extract the content features from the webpage to be detected, at least with the phishing webpage.

In the embodiment of the present invention, the phishing webpage detecting device detects the webpage, and does not need to complete the cooperation of the remote device, and can be deployed at any network node to support large traffic detection. For example, it can be deployed on network traffic monitoring devices, firewall devices, and routers. FIG. 4B is a schematic diagram of an application scenario of a phishing webpage detecting device provided by the present invention. As shown in FIG. 4B, the phishing webpage detecting device obtains the URL of the webpage to be detected from the network traffic monitoring device, downloads the webpage to be detected from the network according to the URL, and then outputs the detection result to other devices. FIG. 4C is a schematic diagram of another application scenario of the phishing webpage detecting device provided by the present invention. As shown in FIG. 4C, the phishing webpage detecting device directly obtains an HTTP data packet from the network traffic monitoring device for phishing webpage detection, and outputs the detection result to other devices.

Further, as shown in FIG. 5, the embodiment further includes: a webpage name output module 46, configured to The phishing page name corresponding to the template files or the corresponding phishing brand page name is output. For the working mechanism of each module, refer to the description of the corresponding embodiment in FIG. 1 , and details are not described herein again. In the phishing detection device of the embodiment of the present invention, when detecting the webpage to be detected, the domain name determining module 41 searches for the unique domain name corresponding to the page to be detected from the locally saved trust domain name database, and the similarity does not exist in the trusted domain name database. The determining module 43 matches the content features of the web page to be detected with the template file saved locally to determine the similarity. Since the phishing webpage is usually generated by an automatic program or directly spoofs the brand webpage, the content characteristics of the phishing webpage are basically similar, and the content features can reflect the characteristics of the phishing webpage. Therefore, the present invention determines whether the webpage is phishing by the content feature, and improves the accuracy of the phishing webpage detection result. In addition, since the present invention first determines whether the web page to be detected is a trusted web page through the continuously updated trust domain name library, the probability of misjudge the brand web page as a phishing web page is reduced.

Fig. 6 is a schematic structural view of the similarity determining module in Fig. 4 or Fig. 5. As shown in FIG. 6, the similarity determining module 43 includes: a reading unit 431, an encoding format determining unit 432, a vocabulary number determining unit 433, a vocabulary determining unit 434, and an object model determining unit 435.

The reading unit 431 is configured to read a template file from the phishing template library or the brand template library.

The encoding format determining unit 432 is configured to determine whether the encoding format extracted from the web page to be detected is the same as the encoding format in the template file.

The vocabulary quantity determining unit 433 is configured to determine, when the encoding format determining unit 432 determines that the encoding format is the same, whether the number of vocabularies extracted from the web page to be detected is within a preset range corresponding to the number of vocabularies in the template file.

The vocabulary determining unit 434 is configured to determine whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is higher than the preset value when the number of vocabulary is similar to the preset range. The vocabulary is similar between low preset values.

The object model determining unit 435 is configured to determine, when the vocabulary similarity degree is between the vocabulary similarity high preset value and the vocabulary similar low preset value, the document object model extracted from the to-be-detected webpage. Similarity with the model of the document object model in the template file, and judge Whether the similarity of the model is greater than the similar preset value of the model.

The phishing webpage determining module 44 is configured to determine, when the object model determining unit 435 determines that the model similarity is greater than the model similar preset value or when the vocabulary determining unit 434 has a vocabulary similarity higher than the vocabulary similarity high preset value, determining that the webpage to be detected is for fishing Web page.

For the working mechanism of each module, refer to the description of the corresponding embodiment in FIG. 2, and details are not described herein again. Embodiments of the invention. The content features extracted from the webpage to be detected: the webpage encoding format, the webpage vocabulary, the webpage vocabulary, and the webpage DOM are respectively matched with the content features saved in each template file in the phishing template library to obtain multiple similarities. And determining that the webpage to be detected is a phishing webpage, and determining a webpage name corresponding to the template file whose similarity is greater than a preset similarity threshold, so as to determine that the webpage to be detected is similar. Phishing page. In addition, the content features of the web page to be detected can be matched with each template file in the brand template library. When the template file whose similarity is greater than the preset similar threshold is determined in the brand template library, it is determined that the webpage to be detected is a phishing webpage, and the name of the webpage corresponding to the template file is also output, that is, the webpage to be detected is counterfeited. The name of the brand web page.

FIG. 7 is a schematic structural diagram of Embodiment 3 of a phishing webpage detecting apparatus provided by the present invention. As shown in FIG. 7, the phishing template library building module 47, the brand template library building module 48, and the trust domain name database building module 49 are further included on the basis of FIG.

The phishing template library building module 47 is configured to match the content features extracted from the phishing webpage with the content features in the template files in the phishing template library, and determine the similarity between the content features extracted from the phishing webpage and each template file; When the similarity between the content feature extracted by the phishing webpage and each template file is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.

The brand template library building module 48 is configured to match the content features extracted from the brand webpage with the content features in the template files in the brand template library, and determine the similarity between the content features extracted from the brand webpage and each template file; When the similarity between the content feature extracted by the brand webpage and each template file is less than the preset similarity threshold, the content feature forming template file extracted from the brand webpage is written into the brand template library. The trusted domain name database establishing module 49 is configured to: if the top-level domain name in the URL is a non-national top-level domain name, extract the second-level domain name from the URL and write the trusted domain name database; if the top-level domain name in the URL is a national domain name and the second-level domain name is a top-level domain character String, extract the third-level domain name from the URL and write it to the trusted domain name database.

For the working mechanism of each module, refer to the description of the corresponding embodiment in FIG. 3, and details are not described herein again. When the brand template library is established in the embodiment of the present invention, the content features of the downloaded webpage are matched with the existing template files in the brand template library, and only when there is no template file similar to the content feature of the downloaded webpage in the brand template library. The downloaded webpage is stored in the brand template library as a template file, thereby avoiding repeatedly saving the template files of multiple similar webpages in the brand template library.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing storage medium includes: ROM, RAM, magnetic disk or optical disk, etc., which can store various program codes. Finally, the above embodiments are only used to illustrate the technical solution of the present invention. The invention is described in detail with reference to the foregoing embodiments, and those of ordinary skill in the art should understand that the technical solutions described in the foregoing embodiments may be modified or some of the techniques may be The features are equivalent to the equivalents; and the modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

Rights request

A phishing webpage detecting method, characterized in that:

The method for detecting a phishing webpage according to claim 1, wherein the determining the similarity between the content feature extracted from the webpage to be detected and the content features in each template file of the template file library includes:

Reading a template file from the template file library, and determining whether the encoding format extracted from the to-be-detected webpage is the same as the encoding format in the template file;

When the encoding format extracted from the to-be-detected webpage is the same as the encoding format in the template file, determining an absolute value of a difference between the number of words extracted from the to-be-detected webpage and the number of words in the template file Whether it is within a similar range of presets;

When the quantity of the vocabulary is similar to the predetermined range, determining whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is similar to a vocabulary-like high preset value and a vocabulary-like low preset Between values;

Calculating model similarity between the document object model extracted from the to-be-detected webpage and the document object model in the template file when the vocabulary similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value ;

Determining that the to-be-detected webpage is a phishing webpage when the model similarity is greater than a model similar preset value or when the vocabulary similarity is higher than a vocabulary similarity high preset value; from the phishing template library or the brand template The library reads the next template file and repeats the above steps until the template file with the highest similarity is found from multiple template files that reach the model-like preset value according to the model similarity.

The method for detecting a phishing webpage according to claim 1 or 2, wherein the trusted domain name database is used to store a unique domain name trusted by the webpage to be detected, and the template file library is a brand template library or a phishing template library. The template file in the phishing template library includes content features extracted from the phishing webpage, and the template file in the brand template library includes content features extracted from the brand webpage.

The method for detecting a phishing webpage according to claim 1 or 2, wherein after the determining that the webpage to be detected is a phishing webpage, the method further comprises: when the template file is similar to a threshold, outputting the template file The corresponding phishing page name or the corresponding phishing brand page name.

The method for detecting a phishing webpage according to claim 1, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises:

Matching the content features extracted from the phishing webpage with the content features in the template files in the phishing template library, determining the similarity between the content features extracted from the phishing webpage and each of the template files; and extracting the content from the phishing webpage When the similarity between the feature and each of the template files is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.

Matching the content features extracted from the brand webpage with the content features in each template file in the brand template library, determining the similarity between the content features extracted from the brand webpage and each of the template files; and extracting the content from the brand webpage When the similarity between the feature and each of the template files is less than the similar preset value of the model, the content feature forming template file extracted from the brand webpage is written into the brand template library.

The method for detecting a phishing webpage according to claim 5 or 6, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises:

When the top-level domain name of the collected uniform resource locator is a non-national top-level domain name, the second-level domain name is extracted from the uniform resource locator and written into the trust domain name database; And the third-level domain name is extracted from the uniform resource locator and the trusted domain name database is written.

8. A phishing webpage detecting device, comprising:

a template file library, configured to save a plurality of template files, where the template file includes content features extracted from a webpage; the content features include at least: an encoding format, a document object model, a vocabulary, and a number of words;

The phishing webpage determining module is configured to: at least the webpage to be detected is a phishing webpage, in the content feature extracted from the webpage to be detected.

The phishing webpage detecting device according to claim 8, further comprising: a webpage name outputting module, configured to determine a phishing webpage name corresponding to a file of the content feature extracted from the webpage to be detected or Corresponding to the name of the counterfeit brand web page.

The phishing webpage detecting device according to claim 9, wherein the similarity determining module comprises:

a reading unit, configured to read the template file from the phishing template library or the brand template library; and the encoding format in the template file is the same; when the encoding format in the template file is the same, determining the extracted from the to-be-detected webpage Number of words and Whether the absolute value of the difference in the number of words in the template file is within a preset number range; the vocabulary determining unit is configured to use a difference between the number of words extracted from the web page to be detected and the number of words in the template file When the absolute value of the number is within a predetermined preset range, determining whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is similar to a vocabulary-like high preset value and a vocabulary-like low preset Between values;

An object model determining unit, configured to determine, in the template object model extracted from the to-be-detected webpage, in the template file, when the vocabulary similarity is between a vocabulary similarity high preset value and a vocabulary similar low preset value a model similarity of the document object model, and determining whether the model similarity is greater than a similar preset value of the model;

The phishing webpage determining module is configured to determine that the webpage to be detected is a fish webpage when the model similarity is greater than a model similar preset value or when the vocabulary similarity is higher than a vocabulary similarity high preset value.

The phishing webpage detecting device according to claim 10, wherein the template file library comprises:

The phishing template library is configured to save a template file including content features extracted from the phishing webpage; and a brand template library, configured to save a template file including content features extracted from the brand webpage.

The phishing webpage detecting device according to claim 11, further comprising: a phishing template library establishing module, configured to perform content features extracted from the phishing webpage and content features in each template file in the phishing template library Matching, determining a similarity between the content feature extracted from the phishing webpage and each of the template files; when the similarity between the content feature extracted from the phishing webpage and each of the template files is less than the preset similarity threshold, Writing a content feature forming template file extracted from the phishing webpage into the phishing template library;

a brand template library building module, configured to match content features extracted from a brand webpage with content features in each template file in the brand template library, and determine a similarity between the content features extracted from the brand webpage and each of the template files; When the similarity between the content feature extracted from the brand webpage and each of the template files is less than the preset similarity threshold, the content feature forming template file extracted from the brand webpage is written into the brand template library.

The phishing webpage detecting device according to claim 12, further comprising: a trusted domain name database establishing module, configured to collect the unified resource locator from the unified resource when the top-level domain name is a non-national top-level domain name The second-level domain name is extracted from the locator and written into the trust domain name database; when the top-level domain name in the collected uniform resource locator is a country domain name and the second-level domain name is a top-level domain name string, three levels are extracted from the uniform resource locator The domain name is written to the trust domain name library.

14. A method for detecting a phishing webpage, comprising:

When the unique domain name does not exist in the trusted domain name database, the similarity between the content feature extracted from the to-be-detected webpage and the content feature in each template file of the template file library is determined; the content feature includes at least: Vocabulary, vocabulary quantity, and document object model;

The method for detecting a phishing webpage according to claim 14, wherein the determining the similarity between the content feature extracted from the webpage to be detected and the content features in each template file of the template file library includes:

Reading a template file from the template file library, and determining whether an absolute value of a difference between the number of words extracted from the to-be-detected web page and the number of words in the template file is within a preset number range;

Determining that the to-be-detected webpage is a phishing webpage when the model similarity is greater than a model similar preset value or when the vocabulary similarity is higher than a vocabulary similarity high preset value; from the phishing template library or the brand template The library reads the next template file and repeats the above steps until it is based on model similarity Find the template file with the highest similarity among the template files that reach the model-like preset value.

The method for detecting a phishing webpage according to claim 14 or 15, wherein the trusted domain name database is used to store a unique domain name trusted by the webpage to be detected, and the template file library is a brand template library or a phishing template library. The template file in the phishing template library includes content features extracted from the phishing webpage, and the template file in the brand template library includes content features extracted from the brand webpage.

The method for detecting a phishing webpage according to claim 14 or 15, wherein after the determining that the webpage to be detected is a phishing webpage, the method further comprises: when the template file is similar to a threshold, outputting the template file The corresponding phishing page name or the corresponding phishing brand page name.

The method for detecting a phishing webpage according to claim 14 or 15, wherein before determining whether there is a unique domain name corresponding to the webpage to be detected in the trust domain name database, the method further comprises: a content feature extracted from the phishing webpage, Matching content features in each template file in the phishing template library, determining similarity between content features extracted from the phishing webpage and each of the template files; and comparing content features extracted from the phishing webpage with each of the template files When the degree is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.

The method for detecting a phishing webpage according to claim 14, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises: