WO2012089005A1 - Method and apparatus for phishing web page detection - Google Patents

Method and apparatus for phishing web page detection Download PDF

Info

Publication number
WO2012089005A1
WO2012089005A1 PCT/CN2011/083745 CN2011083745W WO2012089005A1 WO 2012089005 A1 WO2012089005 A1 WO 2012089005A1 CN 2011083745 W CN2011083745 W CN 2011083745W WO 2012089005 A1 WO2012089005 A1 WO 2012089005A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
phishing
template
similarity
domain name
Prior art date
Application number
PCT/CN2011/083745
Other languages
French (fr)
Chinese (zh)
Inventor
马勺布
郭辉
Original Assignee
成都市华为赛门铁克科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都市华为赛门铁克科技有限公司 filed Critical 成都市华为赛门铁克科技有限公司
Publication of WO2012089005A1 publication Critical patent/WO2012089005A1/en
Priority to US13/689,230 priority Critical patent/US9218482B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/44Program or device authentication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Definitions

  • the embodiments of the present invention relate to network technologies, and in particular, to a phishing webpage detection method and device. Background technique
  • the phishing website reporting mechanism is a basic solution to protect against phishing attacks.
  • the anti-phishing organization encourages the end user to submit the discovered phishing information.
  • the phishing information includes the Uniform Resource Locator (URL), the mail content, etc., and then the collected phishing information is discriminated and organized into a knowledge base. For example, a URL list method, a one-way hash (Hash) value method, and the like.
  • the knowledge base is deployed in various security devices or client software, and the device detects that the knowledge inventory intercepts and filters the webpage during the currently visited webpage, thereby preventing attacks on the phishing webpage.
  • the general method is to integrate the Phishing detection module into the client software.
  • the Phishing detection module calculates the suspiciousness of the webpage according to the local or remote data query result, when the suspiciousness is high. , Send an alert message to the user.
  • the remote Anti-Phishing server provides data update, query, filtering and other functions to many client Phishing detection modules.
  • the monitoring basis of the Phishing detection module mainly includes: a list of known phishing URLs, a list of Phishing IPs, a list of trusted i or names, phishing keywords, and general features of phishing pages.
  • the general features of the phishing webpage include: HyperText Markup Language (HTML) input tags, data matching social security numbers, inconsistent URLs displayed and real URLs, etc.
  • HTML HyperText Markup Language
  • the embodiment of the invention provides a method and a device for detecting a phishing webpage, which are used to improve the detection accuracy of the phishing website.
  • the embodiment of the invention provides a method for detecting a phishing webpage, including:
  • the similarity between the content feature extracted from the to-be-detected webpage and the content feature in each template file of the template file library is determined;
  • the content feature includes at least: Encoding format, document object model, vocabulary and number of words;
  • the embodiment of the invention provides a phishing webpage detecting device, which comprises:
  • a template file library configured to save a plurality of template files, where the template file includes content features extracted from a webpage; the content features include at least: a coding format of the webpage, a document object model, a vocabulary, and a number of words;
  • a domain name determining module configured to determine whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database
  • a content extraction module configured to extract content features extracted from the to-be-detected webpage when the unique domain name does not exist in the trust domain name database
  • a similarity determining module configured to respectively determine a similarity between a content feature extracted from the to-be-detected webpage and a content feature in each template file of the template file library
  • a phishing webpage determining module configured to extract content features from the webpage to be detected, at least The web page to be detected is a phishing webpage.
  • the similarity between each template file in the template file library such as the encoding format, the document object model, the vocabulary, the number of words, and the like, are determined by the content characteristics of the webpage to be detected.
  • the similarity between the content feature and the content feature in each template file in the template file library determines whether the page to be detected is a phishing page. Therefore, the present invention can improve the accuracy of the phishing webpage detection result by determining whether the webpage is a phishing webpage by using the content feature.
  • the present invention since the present invention first determines whether the web page to be detected is a trusted web page through the continuously updated trust domain name library, the probability of misidentifying the brand web page as a phishing web page is reduced.
  • Embodiment 1 is a flowchart of Embodiment 1 of a method for detecting a phishing webpage according to the present invention
  • Embodiment 2 is a flowchart of Embodiment 2 of a method for detecting a phishing webpage according to the present invention
  • Embodiment 3 is a flowchart of Embodiment 3 of a method for detecting a phishing webpage according to the present invention
  • FIG. 4A is a schematic structural diagram of Embodiment 1 of a phishing webpage detecting device provided by the present invention
  • FIG. 4B is a schematic diagram of an application scenario of a phishing webpage detecting device provided by the present invention
  • FIG. 5 is a schematic structural diagram of Embodiment 2 of a phishing webpage detecting apparatus provided by the present invention
  • FIG. 6 is a schematic structural diagram of a similarity determining module in FIG. 4 or FIG. 5;
  • FIG. 7 is a schematic structural diagram of Embodiment 3 of a phishing webpage detecting apparatus provided by the present invention. detailed description
  • FIG. 1 is a flowchart of Embodiment 1 of a method for detecting a phishing webpage according to the present invention. As shown in FIG. 1, this embodiment includes:
  • Step 11 Determine whether there is a unique domain name corresponding to the webpage to be detected in the trusted domain name database.
  • the webpage to be detected may have multiple acquisition manners, one is to download the to-be-detected webpage according to the URL, and the downloaded webpage to be detected is stored in the storage medium; one is to directly extract the data packet from the network communication traffic. When the data packet is directly extracted from the network communication traffic, the data packet is further parsed to form an HTML file.
  • the unique domain name is extracted from the URL corresponding to the webpage to be detected, and the unique domain name is searched in the trusted domain name database.
  • the unique domain name exists in the trusted domain name database, that is, the unique domain name is a trusted domain name, indicating that the to-be-detected webpage corresponding to the unique domain name is not a phishing webpage.
  • the to-be-detected webpage may be a phishing webpage or a phishing webpage, and the subsequent content feature matching process is further needed to detect whether the webpage to be detected is a fish-fishing webpage.
  • the purpose is to exclude branded web pages or web pages that have never been attacked by a phishing website by detecting a phishing webpage.
  • the domain name database needs to be updated periodically.
  • the collection and extraction of domain names are mainly based on the following principles:
  • the URLs are retrieved one by one from the collected URL list.
  • the top-level domain name is a non-state top-level domain in a URL
  • the second-level domain name is extracted from the URL.
  • the trusted domain name database the top-level domain name in the URL is the national domain name and the second-level domain name is the top-level domain name string
  • the third-level domain name is extracted from the URL and written into the trusted domain name database.
  • the top-level domains in the URL are ".com,,”.org”,.edu,,”.net”,.gov”,”int,,””mil”,”biz”,”info” , non-national top-level domain names such as "pro”, "name” and "idv”, the second-level domain name is extracted from the URL. If the top-level domain name is a country or domain name, it is determined whether the second-level domain is a commonly used top-level domain name string, for example " Com,,, “org”, “net”, “gov”, “edu,,” and “biz” are extracted Third-level domain name, otherwise only the second-level domain name is extracted.
  • the extracted domain names are as follows: huawei.com, huawei.com.cn, sina.com.cn, apwg.org, apwg.net, etc.
  • the extracted domain name is converted into a hash table storage to facilitate subsequent query.
  • the specific hash algorithm for establishing the hash table may adopt a standard algorithm such as MD5 or SHA1, or a custom algorithm.
  • Step 12 When there is no unique domain name in the trusted domain name database, determine the similarity between the content features extracted from the web page to be detected and the content features in each template file of the template file library.
  • the template file library can be a brand template library or a phishing template library.
  • the template file library is configured to save a template file including content features extracted from the phishing webpage, or to save a template file including content features extracted from the brand webpage; the content features at least include extracted from the webpage: an encoding format, a document object model, Vocabulary and vocabulary quantity.
  • the content feature is extracted from the to-be-detected webpage, and is matched with the content feature saved in each template file in the phishing template library; in addition, the brand template may also be used.
  • the content features saved in each template file in the library are matched to determine the similarity between the content features extracted from the web page to be detected and the content features in each template file.
  • the embodiment of the present invention can determine the similarity between the to-be-detected webpage and the brand webpage or the phishing webpage by analyzing the content features including the encoding format, the document object model, the vocabulary and the vocabulary quantity.
  • the phishing template library includes a plurality of phishing template files for storing content features extracted from each phishing webpage.
  • the content features are extracted from multiple phishing webpages, and the content features of each phishing webpage are separately saved in the form of template files.
  • the brand template gallery includes multiple brand template files for saving content features extracted from various brand web pages.
  • Brand pages are often spoofed pages or pages that may be counterfeited, such as major bank pages around the world, insurance company pages, online payment agencies or corporate web pages, and social networking sites.
  • the brand template library is created, content features are extracted from multiple brand web pages, and the content characteristics of each brand web page are separately saved in the form of template files.
  • Step 13 When the content feature extracted from the webpage to be detected is at least greater than a preset similarity threshold in a template file, determine that the webpage to be detected is a phishing webpage.
  • the webpage to be detected is determined to be a phishing webpage of the non-counterfeit brand webpage.
  • the similarity can be a percentage value or other custom type.
  • the similarity is a percentage value, the higher the percentage value, the greater the similarity; the similarity can also be a value from 0 to 100. In this case, the larger the value, the greater the similarity, wherein the preset similarity threshold may be an empirical value.
  • each template file of the phishing template library corresponds to one phishing webpage
  • the webpage name of the phishing webpage similar to the webpage to be detected may be determined.
  • the webpage to be detected is a phishing webpage of a counterfeit brand webpage.
  • the brand template file saves the content characteristics of the brand webpage.
  • the unique domain name of the webpage to be detected is not a trusted domain name
  • the similarity between the content feature and the brand webpage is high
  • the webpage to be detected is determined to be a counterfeit brand webpage. Phishing page.
  • the template file saves the content feature of the phishing webpage or the content feature of the brand webpage.
  • the webpage to be detected is determined to be a phishing webpage of the non-phishing brand webpage. Since phishing web pages are usually generated by automated programs or directly spoof brand web pages, and the content characteristics of most phishing web pages are basically similar, the content features reflect the characteristics of phishing web pages. Therefore, the present invention can improve the accuracy of the phishing webpage detection result by determining whether the webpage is a phishing webpage by using the content feature. In addition, since the present invention first determines whether the to-be-detected web page is trusted by the continuously updated trust domain name library The webpage, which reduces the chance of misjudged the brand page as a phishing page.
  • FIG. 2 is a flowchart of Embodiment 2 of a method for detecting a phishing webpage according to the present invention. This example mainly describes how to match the content features of the web page to be detected with the phishing template file in the phishing template library. As shown in FIG. 2, this embodiment includes:
  • Step 20 Extract the content feature from the web page to be detected.
  • the trusted domain name database is firstly searched for the unique domain name of the web page to be detected. Since the trusted domain name database stores the trusted unique domain name, when the trusted domain name inventory is in the unique domain name of the to-be-detected webpage, the determined webpage is determined to be Trusted webpage. If the unique domain name of the webpage to be detected does not exist in the trusted domain name database, step 20 is performed to determine whether the webpage of the webpage to be detected is a fishery webpage.
  • Step 21 Determine whether there is a fish template file in the phishing template library that has not been matched with the web page to be detected. If yes, go to step 22, otherwise end.
  • step 21 may be: determining whether the brand template library has a brand template file that does not match the web page to be detected.
  • Step 22 Read a fishing template file that has not yet matched the page to be detected from the nautical template library.
  • the content features extracted from the phishing webpage and the content features in the phishing template files in the phishing template library are extracted from the phishing webpage.
  • the matching is performed to determine the similarity between the content feature extracted from the phishing webpage and each phishing template file, and the similarity size is used to determine whether the content feature is written into the phishing template library in the form of a phishing template file.
  • the similarity between the content feature extracted from the phishing webpage and each fish template file is less than a preset similarity threshold, the content feature extracted from the phishing webpage forms a phishing template file and is written into the phishing template library.
  • the content features extracted from the brand webpage and the brand template files in the brand template library are Content features are matched to determine the internal extraction from the brand page The similarity between the feature and each brand template file, and determining whether the content feature is written into the brand template library in the form of a brand template file by the similarity size.
  • the similarity between the content feature extracted from the brand webpage and each brand template file is less than the preset similarity threshold, the content feature extracted from the brand webpage forms a brand template file and is written into the brand template library.
  • Step 23 Determine whether the encoding format of the to-be-detected webpage is the same as the encoding format in the current phishing template file. If it is not the same, go back to step 21 and if it is the same, go to step 24.
  • Step 24 When the coding format of the to-be-detected webpage is the same as the coding format in the current tempo template file, determine whether the absolute value of the difference between the number of vocabulary extracted from the to-be-detected webpage and the vocabulary quantity in the current template file is similar in quantity Within the preset range. If it is not within the similar preset range, return to step 21 to execute; if the number is within the preset range, go to step 25.
  • the webpage to be detected may be a phishing webpage, and further judgment is required to determine whether it is a phishing webpage.
  • the quantity of the vocabulary extracted from the web page to be detected is equal to the number of vocabulary in the current phishing template file. If the difference between the two is large, the web page to be detected is not considered to be the current phishing template file.
  • the number of similar preset ranges can be set according to the number of words in the web page to be detected.
  • Step 25 When the number of vocabulary words extracted from the webpage to be detected is in a similar preset range, determine whether the vocabulary similarity between the vocabulary extracted from the webpage to be detected and the vocabulary in the current phishing template file is in a vocabulary similarity high preset value and vocabulary Similar between low preset values. If the lexical similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value, perform step 26. If the lexical similarity is not between the vocabulary similarity high preset value and the vocabulary similar low preset value, but the lexical similarity is greater than the lexical similarity high preset value, step 27 is performed, and the vocabulary similarity is less than the vocabulary similar low preset value, and the returning step is returned. 21 execution.
  • the vocabulary similarity refers to the metric of how many words in the web page to be detected are the same as a phishing template file.
  • the lexical similarity can be described as a certain formula, for example: the web page to be detected has m words, and some A phishing template file has n words, both of which have the same vocabulary.
  • the lexical similarity can be described as a percentage value: [2 X s/(m + n)] X 100, when the value is high
  • it is considered that the vocabulary in the web page to be detected is highly similar to the vocabulary of a certain phishing template file.
  • the vocabulary of the spoofed webpage is the same as the phishing slogan.
  • the webpage corresponding to the phishing template file is a phishing webpage. If the webpage corresponding to the current brand template file is a brand webpage, since it is determined that there is no unique domain name of the webpage to be detected in the trusted domain name database before extracting the content feature of the webpage to be detected, it is also determined that the webpage to be detected is fishing. Web page.
  • the vocabulary similarity is less than the vocabulary similarity high preset value, it indicates that the vocabulary of the web page to be detected is less than the same vocabulary of the template file, and it can
  • Step 26 When the vocabulary similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value, determine whether the model similarity between the document object model extracted from the to-be-detected webpage and the current maritime template file is greater than The model is similar to the preset value. If step 27 is performed, otherwise return to step 21 for execution.
  • the model similarity between the document object model extracted from the web page to be detected and the document object model in the current phishing template file is greater than the similar preset value of the model, indicating that the two are similar in terms of the document object model.
  • the model similarity can be converted into a percentage, and the model similarity can be converted into a value from 0 to 100.
  • the model-like preset value can be 80%.
  • the model-like preset value can be 50.
  • Step 27 When the model similarity is greater than the model similar preset value, determine that the webpage to be detected is a phishing webpage, and output the phishing webpage name corresponding to the phishing template file. Go back to step 21 to execute.
  • the purpose of continuing matching with the subsequent template file is to find the template file with the highest similarity from the template files that reach the similar preset value of the model according to the model similarity, thereby outputting The name of the phishing page corresponding to the template file with the highest similarity.
  • the web page name of the brand web page corresponding to the brand template file is output in step 27.
  • the phishing template may only include some content features in the encoding format, the number of words, the lexical similarity, and the similarity of the document object model, and the above contents may also be flexible. In combination, the order in which similarity judgments are made can also be flexibly adjusted. E.g:
  • Step 23 is omitted.
  • a phishing template file that has not been matched with the to-be-detected page is sequentially read from the nautical template library, and then directly proceeds to step 24 to determine the number of vocabulary extracted from the web page to be detected and the current template. Whether the absolute value of the difference in the number of words in the file is within a predetermined range of the number. If it is not within the similar preset range, return to step 21 to execute; if the number is within the preset range, go to step 25.
  • the vocabulary quantity and the vocabulary similarity judgment described in steps 24 to 25 are performed, and when the phishing webpage cannot be determined according to the vocabulary number and the vocabulary similarity, the encoding format of step 23 is performed, and if the encoding format is the same For phishing pages, otherwise non-phishing pages.
  • the content features extracted from the webpage to be detected are respectively matched with the content features saved by each phishing template file in the phishing template library, and the encoding format matches the currently matched fishing
  • the template files are the same, it is determined that the webpage to be detected is a phishing webpage, and continues to match the next phishing template file.
  • the coding format is different, the number of words in the current phishing template file is matched.
  • the page to be detected is determined to be a phishing page, otherwise the vocabulary similarity is continued with the phishing template file. match.
  • the webpage to be detected is determined to be a phishing webpage, and continues to match the next phishing template file; otherwise, the model similarity is matched with the DOM of the phishing template file, and the model is similar to the preset value.
  • the webpage to be detected is a phishing webpage.
  • the webpage name of the currently matched phishing template argument is also output.
  • the content features of the web page to be detected can be matched with each template file in the brand template library.
  • the name of the webpage corresponding to the template file may be output, that is, the name of the brand webpage counterfeited by the webpage to be detected.
  • FIG. 3 is a flowchart of Embodiment 3 of a method for detecting a phishing webpage according to the present invention.
  • This example mainly describes the process of establishing a brand template file in the brand template library.
  • the phishing template file in the phishing template library has been created. The process is similar to the brand template library. The only difference is that the phishing template file in the phishing template library is used to save the content features of the known phishing webpage, and the brand template file in the brand template library is used to save the content features of the known brand webpage.
  • this embodiment includes:
  • Step 30 Determine if there are still unprocessed URLs in the brand URL list. If it is step 31, otherwise it ends.
  • Step 31 Read an unprocessed URL in order from the brand URL list.
  • Step 32 Download the corresponding web page according to the read URL.
  • Step 33 Extract the content features from the download page: Download the encoding format, vocabulary, vocabulary quantity, and DOM of the web page.
  • Step 34 Determine if there is a matching brand template file in the brand template library. It is specifically determined whether the brand template library exists or not has a brand template file that matches the content features extracted from the downloaded web page. If there is a brand template file that has not been matched with the content feature extracted from the downloaded web page, go to step 35, otherwise go to step 37.
  • Step 35 Read a brand template file that has not been matched in order from the brand template gallery.
  • Step 36 Determine whether the similarity between the content feature of the downloaded webpage and the content feature of the current brand template file is less than a preset similarity threshold. If it is less than the preset similarity threshold, it is determined that the download network is not similar to the current brand template file, and the process returns to step 34 to continue matching with the subsequent brand template file. If it is greater than the preset similarity threshold, it is determined that the downloading network is similar to the current brand template file, and the content feature of the downloaded webpage does not need to be saved in the brand template library, and the process returns to step 30 to match the downloaded webpage corresponding to the next URL. .
  • Step 37 Write the content characteristics of the downloaded web page into the brand template library in the form of a brand template file. Go back to step 30 to continue.
  • FIG. 4A is a schematic structural diagram of Embodiment 1 of a phishing webpage detecting apparatus provided by the present invention. As shown in FIG. 4, the embodiment includes: a trusted domain name library 40, a domain name determining module 41, a content extracting module 42, a similarity determining module 43 and a phishing webpage determining module 44, and a template file library 45.
  • Trust domain name library 40 used to save a trusted unique domain name.
  • the template file library 45 is configured to save a plurality of template files, and the template file includes content features extracted from the webpage; the content features include at least: a coding format of the webpage, a document object model, a vocabulary, and a vocabulary quantity.
  • the template file library includes: a phishing template library and a brand template library.
  • a phishing template library for saving template files including content features extracted from a phishing web page.
  • a brand template library for saving template files that include content features extracted from brand web pages.
  • the domain name determining module 41 is configured to determine whether there is a unique domain name corresponding to the webpage to be detected in the trusted domain name library 40.
  • the content extraction module 42 is configured to: when the domain name determining module 41 determines that there is no unique domain name in the trusted domain name database, the content feature extracted from the webpage to be detected.
  • the similarity determining module 43 is configured to respectively determine the similarity between the content features extracted by the content extraction module 42 from the web page to be detected and the content features in the template files of the template file library 45.
  • the phishing webpage determining module 44 is configured to extract the content features from the webpage to be detected, at least with the phishing webpage.
  • the phishing webpage detecting device detects the webpage, and does not need to complete the cooperation of the remote device, and can be deployed at any network node to support large traffic detection. For example, it can be deployed on network traffic monitoring devices, firewall devices, and routers.
  • FIG. 4B is a schematic diagram of an application scenario of a phishing webpage detecting device provided by the present invention. As shown in FIG. 4B, the phishing webpage detecting device obtains the URL of the webpage to be detected from the network traffic monitoring device, downloads the webpage to be detected from the network according to the URL, and then outputs the detection result to other devices.
  • FIG. 4C is a schematic diagram of another application scenario of the phishing webpage detecting device provided by the present invention. As shown in FIG. 4C, the phishing webpage detecting device directly obtains an HTTP data packet from the network traffic monitoring device for phishing webpage detection, and outputs the detection result to other devices.
  • the embodiment further includes: a webpage name output module 46, configured to The phishing page name corresponding to the template files or the corresponding phishing brand page name is output.
  • a webpage name output module 46 configured to The phishing page name corresponding to the template files or the corresponding phishing brand page name is output.
  • the domain name determining module 41 searches for the unique domain name corresponding to the page to be detected from the locally saved trust domain name database, and the similarity does not exist in the trusted domain name database.
  • the determining module 43 matches the content features of the web page to be detected with the template file saved locally to determine the similarity.
  • the present invention determines whether the webpage is phishing by the content feature, and improves the accuracy of the phishing webpage detection result. In addition, since the present invention first determines whether the web page to be detected is a trusted web page through the continuously updated trust domain name library, the probability of misjudge the brand web page as a phishing web page is reduced.
  • Fig. 6 is a schematic structural view of the similarity determining module in Fig. 4 or Fig. 5.
  • the similarity determining module 43 includes: a reading unit 431, an encoding format determining unit 432, a vocabulary number determining unit 433, a vocabulary determining unit 434, and an object model determining unit 435.
  • the reading unit 431 is configured to read a template file from the phishing template library or the brand template library.
  • the encoding format determining unit 432 is configured to determine whether the encoding format extracted from the web page to be detected is the same as the encoding format in the template file.
  • the vocabulary quantity determining unit 433 is configured to determine, when the encoding format determining unit 432 determines that the encoding format is the same, whether the number of vocabularies extracted from the web page to be detected is within a preset range corresponding to the number of vocabularies in the template file.
  • the vocabulary determining unit 434 is configured to determine whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is higher than the preset value when the number of vocabulary is similar to the preset range.
  • the vocabulary is similar between low preset values.
  • the object model determining unit 435 is configured to determine, when the vocabulary similarity degree is between the vocabulary similarity high preset value and the vocabulary similar low preset value, the document object model extracted from the to-be-detected webpage. Similarity with the model of the document object model in the template file, and judge Whether the similarity of the model is greater than the similar preset value of the model.
  • the phishing webpage determining module 44 is configured to determine, when the object model determining unit 435 determines that the model similarity is greater than the model similar preset value or when the vocabulary determining unit 434 has a vocabulary similarity higher than the vocabulary similarity high preset value, determining that the webpage to be detected is for fishing Web page.
  • the content features extracted from the webpage to be detected are respectively matched with the content features saved in each template file in the phishing template library to obtain multiple similarities. And determining that the webpage to be detected is a phishing webpage, and determining a webpage name corresponding to the template file whose similarity is greater than a preset similarity threshold, so as to determine that the webpage to be detected is similar. Phishing page.
  • the content features of the web page to be detected can be matched with each template file in the brand template library.
  • the template file whose similarity is greater than the preset similar threshold is determined in the brand template library, it is determined that the webpage to be detected is a phishing webpage, and the name of the webpage corresponding to the template file is also output, that is, the webpage to be detected is counterfeited.
  • the name of the brand web page is a phishing webpage, and the name of the webpage corresponding to the template file is also output, that is, the webpage to be detected is counterfeited. The name of the brand web page.
  • FIG. 7 is a schematic structural diagram of Embodiment 3 of a phishing webpage detecting apparatus provided by the present invention. As shown in FIG. 7, the phishing template library building module 47, the brand template library building module 48, and the trust domain name database building module 49 are further included on the basis of FIG.
  • the phishing template library building module 47 is configured to match the content features extracted from the phishing webpage with the content features in the template files in the phishing template library, and determine the similarity between the content features extracted from the phishing webpage and each template file; When the similarity between the content feature extracted by the phishing webpage and each template file is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.
  • the brand template library building module 48 is configured to match the content features extracted from the brand webpage with the content features in the template files in the brand template library, and determine the similarity between the content features extracted from the brand webpage and each template file; When the similarity between the content feature extracted by the brand webpage and each template file is less than the preset similarity threshold, the content feature forming template file extracted from the brand webpage is written into the brand template library.
  • the trusted domain name database establishing module 49 is configured to: if the top-level domain name in the URL is a non-national top-level domain name, extract the second-level domain name from the URL and write the trusted domain name database; if the top-level domain name in the URL is a national domain name and the second-level domain name is a top-level domain character String, extract the third-level domain name from the URL and write it to the trusted domain name database.
  • the content features of the downloaded webpage are matched with the existing template files in the brand template library, and only when there is no template file similar to the content feature of the downloaded webpage in the brand template library.
  • the downloaded webpage is stored in the brand template library as a template file, thereby avoiding repeatedly saving the template files of multiple similar webpages in the brand template library.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Embodiments of the present invention provide a method and an apparatus for phishing web page detection. The method comprises: determining whether a unique domain name corresponding to a web page to be detected exists in a trusted domain name library; when the unique domain name does not exist in the trusted domain name library, determining the similarity between content features extracted from the web page to be detected and content features of each template file in a template file library respectively, the content features at least comprising: an encoding format, a document object model, words, and the number of words; and when the similarity between the content features extracted from the web page to be detected and the content features in at least one template file is greater than a preset similarity threshold, determining that the web page to be detected is a phishing web page. The embodiments of the present invention enhance the accuracy of phishing web page detection results.

Description

钓鱼网页检测方法及设备 本申请要求于 2010 年 12 月 31 日提交中国专利局、 申请号为 201010620647.6、 发明名称为 "钓鱼网页检测方法及设备" 的中国专利申请的 优先权, 其全部内容通过引用结合在本申请中。 技术领域  Fishing webpage detection method and device The application claims the priority of the Chinese patent application filed on December 31, 2010, the Chinese Patent Office, the application number is 201010620647.6, and the invention name is "fishing webpage detection method and equipment", the entire contents of which are incorporated by reference. Combined in this application. Technical field
本发明实施例涉及网络技术, 尤其涉及一种钓鱼网页检测方法及设备。 背景技术  The embodiments of the present invention relate to network technologies, and in particular, to a phishing webpage detection method and device. Background technique
钓鱼网站举报机制是防护钓鱼网站攻击的一种基础性解决方法。 反钓鱼 组织鼓励终端用户提交发现的 phishing (钓鱼)信息, phishing信息包括统一 资源定位符( Uniform Resource Locator, 简称 URL ), 邮件内容等, 然后将收 集到的 phishing信息进行甄别处理组织成知识库, 例如 URL列表方式、 单向 哈希 (Hash )值方式等。 将知识库部署在各类安全设备或客户端软件中, 上 述设备监测到知识库存在当前访问的网页时对该网页拦截和过滤, 防止钓鱼 网页的攻击,  The phishing website reporting mechanism is a basic solution to protect against phishing attacks. The anti-phishing organization encourages the end user to submit the discovered phishing information. The phishing information includes the Uniform Resource Locator (URL), the mail content, etc., and then the collected phishing information is discriminated and organized into a knowledge base. For example, a URL list method, a one-way hash (Hash) value method, and the like. The knowledge base is deployed in various security devices or client software, and the device detects that the knowledge inventory intercepts and filters the webpage during the currently visited webpage, thereby preventing attacks on the phishing webpage.
目前, 通用的方法是将 Phishing检测模块集成到客户端软件中, 当用户 通过浏览器访问网页时, Phishing检测模块依据本地或者远程数据查询结果计 算出该网页的可疑度, 当可疑度较高时, 向用户发出告警信息。 远程 Anti-Phishing服务器向众多客户端 Phishing检测模块提供数据更新、 查询、 过滤等功能。 Phishing检测模块的监测依据主要包括:已知 phishing 的 URL 列 表, Phishing 的 IP列表, 信任 i或名列表, phishing关键词、 phishing网页通用 特征等。 phishing 网页通用特征包括:拥有超文本置标语言( HyperText Markup Language, HTML )输入标签, 有符合社会保险号码的数据, 显示的 URL和 真实 URL不一致等,  Currently, the general method is to integrate the Phishing detection module into the client software. When the user accesses the webpage through the browser, the Phishing detection module calculates the suspiciousness of the webpage according to the local or remote data query result, when the suspiciousness is high. , Send an alert message to the user. The remote Anti-Phishing server provides data update, query, filtering and other functions to many client Phishing detection modules. The monitoring basis of the Phishing detection module mainly includes: a list of known phishing URLs, a list of Phishing IPs, a list of trusted i or names, phishing keywords, and general features of phishing pages. The general features of the phishing webpage include: HyperText Markup Language (HTML) input tags, data matching social security numbers, inconsistent URLs displayed and real URLs, etc.
由于, 钓鱼网页的 URL、 IP 和域名经常变化, 有许多正常网页也包括 phishing关键词。 因此, 通过上述方法检测钓鱼网页时, 不仅对钓鱼网页的识 别率较低, 而且对正常网页的误判率也较高、 因而, 现有钓鱼网页检测方法 的检测准确率较低。 发明内容 Because the URL, IP, and domain name of phishing pages change frequently, there are many normal web pages that are also included. Phishing keywords. Therefore, when the phishing webpage is detected by the above method, not only the recognition rate of the phishing webpage is low, but also the false positive rate of the normal webpage is high, and thus the detection accuracy of the existing phishing webpage detecting method is low. Summary of the invention
本发明实施例提供一种钓鱼网页检测方法及设备, 用以提高钓鱼网站的 检测准确率。  The embodiment of the invention provides a method and a device for detecting a phishing webpage, which are used to improve the detection accuracy of the phishing website.
本发明实施例提供一种钓鱼网页检测方法, 包括:  The embodiment of the invention provides a method for detecting a phishing webpage, including:
判断信任域名库中是否存在待检测网页对应的唯一域名;  Determining whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;
在所述信任域名库中不存在所述唯一域名时, 分别确定从所述待检测网 页中提取的内容特征与模板文件库的各模板文件中内容特征的相似度; 所述 内容特征至少包括: 编码格式、 文档对象模型、 词汇和词汇数量;  When the unique domain name does not exist in the trusted domain name database, the similarity between the content feature extracted from the to-be-detected webpage and the content feature in each template file of the template file library is determined; the content feature includes at least: Encoding format, document object model, vocabulary and number of words;
在从所述待检测网页中提取的内容特征, 至少与一个所述模板文件中内 容特征的相似度大于预设的相似阈值时, 确定所述待检测网页为钓鱼网页。  And determining, in the content feature that is to be detected from the to-be-detected webpage, that the similarity of the content feature in the template file is greater than a preset similarity threshold, determining that the to-be-detected webpage is a phishing webpage.
本发明实施例提供一种钓鱼网页检测设备, 包括:  The embodiment of the invention provides a phishing webpage detecting device, which comprises:
信任域名库, 用于保存受信任网页对应的唯一域名;  Trust the domain name library, which is used to save the unique domain name corresponding to the trusted webpage;
模板文件库, 用于保存多个模板文件, 所述模板文件包括从网页中提取 的内容特征; 所述内容特征至少包括: 网页的编码格式、 文档对象模型、 词 汇和词汇数量;  a template file library, configured to save a plurality of template files, where the template file includes content features extracted from a webpage; the content features include at least: a coding format of the webpage, a document object model, a vocabulary, and a number of words;
域名确定模块, 用于判断信任域名库中是否存在待检测网页对应的唯一 域名;  a domain name determining module, configured to determine whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;
内容提取模块, 用于在所述信任域名库中不存在所述唯一域名时, 从所 述待检测网页中提取的内容特征;  a content extraction module, configured to extract content features extracted from the to-be-detected webpage when the unique domain name does not exist in the trust domain name database;
相似度确定模块, 用于分别确定从所述待检测网页中提取的内容特征与 所述模板文件库的各模板文件中内容特征的相似度;  a similarity determining module, configured to respectively determine a similarity between a content feature extracted from the to-be-detected webpage and a content feature in each template file of the template file library;
钓鱼网页确定模块, 用于在从所述待检测网页中提取的内容特征, 至少 待检测网页为钓鱼网页。 a phishing webpage determining module, configured to extract content features from the webpage to be detected, at least The web page to be detected is a phishing webpage.
本发明实施例, 确定待检测网页的唯一域名不是信任域名后, 通过待检 测网页的内容特征确定与模板文件库中各模板文件的相似度, 如编码格式、 文档对象模型、 词汇和词汇数量等内容特征与模板文件库中各模板文件中内 容特征的相似度, 确定该待检测网页是否为钓鱼网页。 因此本发明通过内容 特征确定网页是否钓鱼网页, 可提高钓鱼网页检测结果的准确性。 另外, 由 于本发明通过不断更新的信任域名库先确定待检测网页是否为受信任的网 页, 从而减少了将品牌网页误判为钓鱼网页的几率。 附图说明  In the embodiment of the present invention, after determining that the unique domain name of the to-be-detected webpage is not the trusted domain name, the similarity between each template file in the template file library, such as the encoding format, the document object model, the vocabulary, the number of words, and the like, are determined by the content characteristics of the webpage to be detected. The similarity between the content feature and the content feature in each template file in the template file library determines whether the page to be detected is a phishing page. Therefore, the present invention can improve the accuracy of the phishing webpage detection result by determining whether the webpage is a phishing webpage by using the content feature. In addition, since the present invention first determines whether the web page to be detected is a trusted web page through the continuously updated trust domain name library, the probability of misidentifying the brand web page as a phishing web page is reduced. DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下 面描述中的附图是本发明的一些实施例, 对于本领域普通技术人员来讲, 在 不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。  In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.
图 1为本发明提供的钓鱼网页检测方法实施例一流程图;  1 is a flowchart of Embodiment 1 of a method for detecting a phishing webpage according to the present invention;
图 2为本发明提供的钓鱼网页检测方法实施例二流程图;  2 is a flowchart of Embodiment 2 of a method for detecting a phishing webpage according to the present invention;
图 3为本发明提供的钓鱼网页检测方法实施例三流程图;  3 is a flowchart of Embodiment 3 of a method for detecting a phishing webpage according to the present invention;
图 4A为本发明提供的钓鱼网页检测设备实施例一结构示意图; 图 4B为本发明提供的钓鱼网页检测设备一种应用场景示意图; 图 4C为本发明提供的钓鱼网页检测设备另一种应用场景示意图; 图 5为本发明提供的钓鱼网页检测设备实施例二结构示意图;  4A is a schematic structural diagram of Embodiment 1 of a phishing webpage detecting device provided by the present invention; FIG. 4B is a schematic diagram of an application scenario of a phishing webpage detecting device provided by the present invention; FIG. FIG. 5 is a schematic structural diagram of Embodiment 2 of a phishing webpage detecting apparatus provided by the present invention;
图 6为图 4或图 5中相似度确定模块的结构示意图;  6 is a schematic structural diagram of a similarity determining module in FIG. 4 or FIG. 5;
图 7为本发明提供的钓鱼网页检测设备实施例三结构示意图。 具体实施方式  FIG. 7 is a schematic structural diagram of Embodiment 3 of a phishing webpage detecting apparatus provided by the present invention. detailed description
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发 明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于 本发明中的实施例, 本领域普通技术人员在没有做出创造性劳动前提下所获 得的所有其他实施例, 都属于本发明保护的范围。 The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is apparent that the described embodiments are a part of the embodiments of the invention, rather than all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without departing from the inventive scope are the scope of the present invention.
图 1为本发明提供的钓鱼网页检测方法实施例一流程图。 如图 1所示, 本实施例包括:  FIG. 1 is a flowchart of Embodiment 1 of a method for detecting a phishing webpage according to the present invention. As shown in FIG. 1, this embodiment includes:
步骤 11 : 判断信任域名库中是否存在待检测网页对应的唯一域名。  Step 11: Determine whether there is a unique domain name corresponding to the webpage to be detected in the trusted domain name database.
本实施例中待检测网页可以有多种获取方式, 一种是根据 URL下载待检 测网页, 将下载后的待检测网页存储于存储介质中; 一种是从网络通信流量 中直接提取数据包。 直接从网络通信流量中提取数据包时, 进一步将数据包 进行解析直接形成 HTML文件。  In this embodiment, the webpage to be detected may have multiple acquisition manners, one is to download the to-be-detected webpage according to the URL, and the downloaded webpage to be detected is stored in the storage medium; one is to directly extract the data packet from the network communication traffic. When the data packet is directly extracted from the network communication traffic, the data packet is further parsed to form an HTML file.
获取待检测网页后, 从待检测网页对应的 URL中提取出唯一域名, 并在 信任域名库查找该唯一域名。 信任域名库中存在该唯一域名时, 即该唯一域 名为信任域名, 表明该唯一域名对应的待检测网页不是钓鱼网页。 信任域名 库中没有该唯一域名时, 该待检测网页有可能是钓鱼网页, 也可能不是钓鱼 网页, 需进一步通过后续的内容特征匹配过程, 检测该待检测网页是否为钓 鱼网页。  After obtaining the webpage to be detected, the unique domain name is extracted from the URL corresponding to the webpage to be detected, and the unique domain name is searched in the trusted domain name database. When the unique domain name exists in the trusted domain name database, that is, the unique domain name is a trusted domain name, indicating that the to-be-detected webpage corresponding to the unique domain name is not a phishing webpage. When there is no such unique domain name in the trusted domain name database, the to-be-detected webpage may be a phishing webpage or a phishing webpage, and the subsequent content feature matching process is further needed to detect whether the webpage to be detected is a fish-fishing webpage.
信任域名库中保存有万级、 百万级甚至千万级受信任网页的唯一域名, 目的是在检测钓鱼网页时, 先通过唯一域名排除品牌网页或从未受到钓鱼网 站攻击的网页。 信任域名库需要周期性更新, 域名的收集和提取主要依据如 下原则: 从收集的 URL列表逐一取出 URL, 在某一 URL中顶级域名为非国 家顶级域名时, 从该 URL中提取出二级域名写入信任域名库; 该 URL中顶 级域名是国家域名且二级域名是顶级域名字符串, 从 URL中提取三级域名写 入信任域名库。  A unique domain name that holds a tens of thousands, millions, or even tens of millions of trusted web pages in a trusted domain name library. The purpose is to exclude branded web pages or web pages that have never been attacked by a phishing website by detecting a phishing webpage. The domain name database needs to be updated periodically. The collection and extraction of domain names are mainly based on the following principles: The URLs are retrieved one by one from the collected URL list. When the top-level domain name is a non-state top-level domain in a URL, the second-level domain name is extracted from the URL. Write the trusted domain name database; the top-level domain name in the URL is the national domain name and the second-level domain name is the top-level domain name string, and the third-level domain name is extracted from the URL and written into the trusted domain name database.
例如, URL中顶级域名是 ".com,,、 ".org", ".edu,,、 ".net", ".gov", "int,,、 "mil", "biz", "info", "pro", "name" 和" idv"等非国家顶级域名, 则 URL中 提取二级域名。 顶级域名是国家或地区域名, 则判断二级域是否为常用的顶 级域名字符串, 例如" com,,、 "org", "net", "gov", "edu,,和 "biz"等, 则提取到 三级域名, 否则只提取到二级域名。 提取到的域名如下所示: huawei.com、 huawei.com.cn、 sina.com.cn、 apwg.org、 apwg.net等。 域名提取后, 夺提取 的域名转换成 Hash表存储以利于后续查询,建立 Hash表的具体 Hash算法可 以采用 MD5、 SHA1等标准算法, 也可以采用自定义算法。 For example, the top-level domains in the URL are ".com,,"".org",".edu,,"".net",".gov","int,,""mil","biz","info" , non-national top-level domain names such as "pro", "name" and "idv", the second-level domain name is extracted from the URL. If the top-level domain name is a country or domain name, it is determined whether the second-level domain is a commonly used top-level domain name string, for example " Com,,, "org", "net", "gov", "edu,," and "biz" are extracted Third-level domain name, otherwise only the second-level domain name is extracted. The extracted domain names are as follows: huawei.com, huawei.com.cn, sina.com.cn, apwg.org, apwg.net, etc. After the domain name is extracted, the extracted domain name is converted into a hash table storage to facilitate subsequent query. The specific hash algorithm for establishing the hash table may adopt a standard algorithm such as MD5 or SHA1, or a custom algorithm.
步骤 12: 在信任域名库中不存在唯一域名时, 分别确定从待检测网页中 提取的内容特征与模板文件库的各模板文件中内容特征的相似度。  Step 12: When there is no unique domain name in the trusted domain name database, determine the similarity between the content features extracted from the web page to be detected and the content features in each template file of the template file library.
模板文件库可为品牌模板库, 也可为钓鱼模板库。 模板文件库用于保存 包括从钓鱼网页提取的内容特征的模板文件, 或用于保存包括从品牌网页提 取的内容特征的模板文件; 内容特征至少包括从网页提取的: 编码格式、 文 档对象模型、 词汇和词汇数量。  The template file library can be a brand template library or a phishing template library. The template file library is configured to save a template file including content features extracted from the phishing webpage, or to save a template file including content features extracted from the brand webpage; the content features at least include extracted from the webpage: an encoding format, a document object model, Vocabulary and vocabulary quantity.
在信任域名库不存在该待检测网页对应的唯一域名时, 从该待检测网页 中提取内容特征, 与钓鱼模板库中每个模板文件中保存的内容特征进行匹配; 另外, 也可与品牌模板库中每个模板文件中保存的内容特征进行匹配, 确定 从待检测网页中提取的内容特征与各模板文件中内容特征的相似度。  When the trusted domain name library does not have a unique domain name corresponding to the to-be-detected webpage, the content feature is extracted from the to-be-detected webpage, and is matched with the content feature saved in each template file in the phishing template library; in addition, the brand template may also be used. The content features saved in each template file in the library are matched to determine the similarity between the content features extracted from the web page to be detected and the content features in each template file.
由于大量钓鱼网站通过自动程序产生或直接仿冒品牌网页时, 通常会采 用相同的编码格式、 较接近的词汇和相似的文档对象模型 ( Document Object Model, 简称 DOM ) , 并且词汇数量也基本接近, 因此, 本发明实施例通过 分析包括编码格式、 文档对象模型、 词汇和词汇数量的内容特征, 可以确定 待检测网页与品牌网页或钓鱼网页的相似度。  Since a large number of phishing websites generate or directly spoof branded web pages through automatic programs, the same encoding format, closer vocabulary and similar Document Object Model (DOM) are usually adopted, and the number of words is also close. The embodiment of the present invention can determine the similarity between the to-be-detected webpage and the brand webpage or the phishing webpage by analyzing the content features including the encoding format, the document object model, the vocabulary and the vocabulary quantity.
钓鱼模板库中包括多个钓鱼模板文件, 用于保存从各钓鱼网页提取的内 容特征。 建立钓鱼模板库时, 从多个钓鱼网页分别提取出内容特征, 以模板 文件形式分别保存每个钓鱼网页的内容特征。  The phishing template library includes a plurality of phishing template files for storing content features extracted from each phishing webpage. When the phishing template library is created, the content features are extracted from multiple phishing webpages, and the content features of each phishing webpage are separately saved in the form of template files.
品牌模板库中包括多个品牌模板文件, 用于保存从各品牌网页提取的内 容特征。 品牌网页为经常被仿冒的网页或可能被仿冒的网页, 比如全球各大 银行网页、 保险公司网页、 网上支付机构或企业网页、 社交网站登陆网页等。 建立品牌模板库时, 从多个品牌网页分别提取出内容特征, 以模板文件形式 分别保存每个品牌网页的内容特征。 步骤 13: 在从待检测网页中提取的内容特征, 至少与一个模板文件中内 容特征的相似度大于预设的相似阈值时, 确定待检测网页为钓鱼网页。 The brand template gallery includes multiple brand template files for saving content features extracted from various brand web pages. Brand pages are often spoofed pages or pages that may be counterfeited, such as major bank pages around the world, insurance company pages, online payment agencies or corporate web pages, and social networking sites. When the brand template library is created, content features are extracted from multiple brand web pages, and the content characteristics of each brand web page are separately saved in the form of template files. Step 13: When the content feature extracted from the webpage to be detected is at least greater than a preset similarity threshold in a template file, determine that the webpage to be detected is a phishing webpage.
从待检测网页中提取的内容特征, 与钓鱼模板库中一个或一个以上的钓 存在与该待检测网页相似的钓鱼模板文件时, 确定该待检测网页为非仿冒品 牌网页的钓鱼网页。 例如, 相似度可以是百分比数值, 也可以是其他的自定 义的类型, 当相似度是百分比数值时, 百分比数值越高, 相似度越大; 相似 度也可以是 0到 100的数值, 在这种情况下, 数值越大相似度越大, 其中, 预设的相似阈值可以是经验值。  When the content feature extracted from the webpage to be detected is compared with one or more phishing template files in the phishing template library that are similar to the webpage to be detected, the webpage to be detected is determined to be a phishing webpage of the non-counterfeit brand webpage. For example, the similarity can be a percentage value or other custom type. When the similarity is a percentage value, the higher the percentage value, the greater the similarity; the similarity can also be a value from 0 to 100. In this case, the larger the value, the greater the similarity, wherein the preset similarity threshold may be an empirical value.
另外, 由于钓鱼模板库每个模板文件对应一个钓鱼网页, 在确定该待检 测网页中内容特征与钓鱼网页的内容特征相同时, 还可确定与该待检测网页 相似的钓鱼网页的网页名称。  In addition, since each template file of the phishing template library corresponds to one phishing webpage, when it is determined that the content features of the spoofing webpage are the same as the content features of the phishing webpage, the webpage name of the phishing webpage similar to the webpage to be detected may be determined.
从待检测网页中提取的内容特征, 与品牌模板库中一个或一个以上的品 存在与该待检测网页相似的品牌模板文件时, 由于该待检测网页对应的唯一 域名不是信任域名, 因此确定该待检测网页为仿冒品牌网页的钓鱼网页。  When the content feature extracted from the webpage to be detected and the one or more products in the brand template library have a brand template file similar to the webpage to be detected, since the unique domain name corresponding to the webpage to be detected is not a trusted domain name, it is determined The webpage to be detected is a phishing webpage of a counterfeit brand webpage.
本发明实施例, 确定待检测网页的唯一域名不是信任域名后, 通过待检 测网页的内容特征确定与模板文件库中各模板文件的相似度, 确定该待检测 网页是否为钓鱼网页。 品牌模板文件保存的是品牌网页的内容特征, 在该待 检测网页的唯一域名不是信任域名的情况下, 其内容特征与品牌网页的相似 度较高时, 确定该待检测网页为仿冒品牌网页的钓鱼网页。 模板文件保存的 是钓鱼网页的内容特征或品牌网页的内容特征, 在该待检测网页的肉容特征 与模板文件的相似度较高时, 确定该待检测网页为非仿冒品牌网页的钓鱼网 页。 由于钓鱼网页通常由自动程序产生或直接仿冒品牌网页, 且大多数钓鱼 网页的内容特征基本相似, 内容特征反映出钓鱼网页的特性。 因此本发明通 过内容特征确定网页是否钓鱼网页, 可提高钓鱼网页检测结果的准确性。 另 外, 由于本发明通过不断更新的信任域名库先确定待检测网页是否为受信任 的网页, 从而减少了将品牌网页误判为钓鱼网页的几率。 In the embodiment of the present invention, after determining that the unique domain name of the to-be-detected webpage is not a trusted domain name, determining the similarity between each template file in the template file library by using the content feature of the to-be-detected webpage, and determining whether the to-be-detected webpage is a phishing webpage. The brand template file saves the content characteristics of the brand webpage. When the unique domain name of the webpage to be detected is not a trusted domain name, when the similarity between the content feature and the brand webpage is high, the webpage to be detected is determined to be a counterfeit brand webpage. Phishing page. The template file saves the content feature of the phishing webpage or the content feature of the brand webpage. When the similarity between the meat content feature of the webpage to be detected and the template file is high, the webpage to be detected is determined to be a phishing webpage of the non-phishing brand webpage. Since phishing web pages are usually generated by automated programs or directly spoof brand web pages, and the content characteristics of most phishing web pages are basically similar, the content features reflect the characteristics of phishing web pages. Therefore, the present invention can improve the accuracy of the phishing webpage detection result by determining whether the webpage is a phishing webpage by using the content feature. In addition, since the present invention first determines whether the to-be-detected web page is trusted by the continuously updated trust domain name library The webpage, which reduces the chance of misjudged the brand page as a phishing page.
图 2 为本发明提供的钓鱼网页检测方法实施例二流程图。 本实例主要说 明如何将待检测网页的内容特征与钓鱼模板库中钓鱼模板文件进行匹配的方 法。 如图 2所示, 本实施例包括:  FIG. 2 is a flowchart of Embodiment 2 of a method for detecting a phishing webpage according to the present invention. This example mainly describes how to match the content features of the web page to be detected with the phishing template file in the phishing template library. As shown in FIG. 2, this embodiment includes:
步骤 20: 从待检测网页中提取出内容特征。  Step 20: Extract the content feature from the web page to be detected.
在步骤 20之前, 先在信任域名库查找待检测网页的唯一域名, 由于信任 域名库保存的是受信任的唯一域名, 因此当信任域名库存在待检测网页的唯 一域名时, 确定待检测网页为受信任的网页。 如果信任域名库中不存在待检 测网页的唯一域名执行步骤 20 , 通过待检测网页的内容特征判断其是否为钓 鱼网页。  Before step 20, the trusted domain name database is firstly searched for the unique domain name of the web page to be detected. Since the trusted domain name database stores the trusted unique domain name, when the trusted domain name inventory is in the unique domain name of the to-be-detected webpage, the determined webpage is determined to be Trusted webpage. If the unique domain name of the webpage to be detected does not exist in the trusted domain name database, step 20 is performed to determine whether the webpage of the webpage to be detected is a fishery webpage.
步骤 21 : 判断钓鱼模板库中是否存在还没有与待检测网页进行匹配的钓 鱼模板文件。 如果是则执行步骤 22, 否则结束。  Step 21: Determine whether there is a fish template file in the phishing template library that has not been matched with the web page to be detected. If yes, go to step 22, otherwise end.
如果采用品牌模板库中品牌模板文件与待检测网页进行匹配, 则步骤 21 可为: 判断品牌模板库是否存在还没有与该待检测网页进行匹配的品牌模板 文件。  If the brand template file in the brand template library is matched with the web page to be detected, step 21 may be: determining whether the brand template library has a brand template file that does not match the web page to be detected.
步骤 22: 从钓鱼模板库中按序读取一个还没有与待检测页匹配的钓鱼模 板文件。  Step 22: Read a fishing template file that has not yet matched the page to be detected from the nautical template library.
建立钓鱼模板库时, 为避免在钓鱼品牌库保存内容特征相似的钓鱼模板 文件, 从钓鱼网页提取出内容特征后, 将从钓鱼网页提取的内容特征与钓鱼 模板库中各钓鱼模板文件中内容特征进行匹配, 确定从钓鱼网页提取的内容 特征与各钓鱼模板文件的相似度, 通过相似度大小确定是否将该内容特征以 钓鱼模板文件的形式写入钓鱼模板库。 在从钓鱼网页提取的内容特征与各钓 鱼模板文件的相似度均小于预设的相似阈值时, 将从钓鱼网页提取的内容特 征形成钓鱼模板文件写入钓鱼模板库。  When the phishing template library is created, in order to avoid storing the phishing template files with similar content features in the phishing brand library, the content features extracted from the phishing webpage and the content features in the phishing template files in the phishing template library are extracted from the phishing webpage. The matching is performed to determine the similarity between the content feature extracted from the phishing webpage and each phishing template file, and the similarity size is used to determine whether the content feature is written into the phishing template library in the form of a phishing template file. When the similarity between the content feature extracted from the phishing webpage and each fish template file is less than a preset similarity threshold, the content feature extracted from the phishing webpage forms a phishing template file and is written into the phishing template library.
同理, 建立品牌模板库时, 为避免在品牌库保存内容特征相同的品牌模 板文件, 从品牌网页提取出内容特征后, 将从品牌网页提取的内容特征与品 牌模板库中各品牌模板文件中内容特征进行匹配, 确定从品牌网页提取的内 容特征与各品牌模板文件的相似度, 通过相似度大小确定是否将该内容特征 以品牌模板文件的形式写入品牌模板库。 在从品牌网页提取的内容特征与各 品牌模板文件的相似度均小于预设的相似阈值时, 将从品牌网页提取的内容 特征形成品牌模板文件写入品牌模板库。 Similarly, when creating a brand template library, in order to avoid saving the brand template files with the same content characteristics in the brand library, after extracting the content features from the brand webpage, the content features extracted from the brand webpage and the brand template files in the brand template library are Content features are matched to determine the internal extraction from the brand page The similarity between the feature and each brand template file, and determining whether the content feature is written into the brand template library in the form of a brand template file by the similarity size. When the similarity between the content feature extracted from the brand webpage and each brand template file is less than the preset similarity threshold, the content feature extracted from the brand webpage forms a brand template file and is written into the brand template library.
步骤 23: 判断该待检测网页的编码格式是否与当前钓鱼模板文件中的编 码格式相同。 如果不相同返回步骤 21执行, 如果相同执行步骤 24。  Step 23: Determine whether the encoding format of the to-be-detected webpage is the same as the encoding format in the current phishing template file. If it is not the same, go back to step 21 and if it is the same, go to step 24.
步骤 24: 在该待检测网页的编码格式与当前钓鱼模板文件中的编码格式 相同时, 判断从待检测网页中提取的词汇数量与当前模板文件中的词汇数量 差值的绝对值是否在数量相似预设范围内。 如果不在数量相似预设范围内, 返回步骤 21执行; 如果在数量相似预设范围内, 执行步骤 25。  Step 24: When the coding format of the to-be-detected webpage is the same as the coding format in the current tempo template file, determine whether the absolute value of the difference between the number of vocabulary extracted from the to-be-detected webpage and the vocabulary quantity in the current template file is similar in quantity Within the preset range. If it is not within the similar preset range, return to step 21 to execute; if the number is within the preset range, go to step 25.
从待检测网页中提取的词汇数量与当前钓鱼模板文件中的词汇数量的差 值的绝对值在数量相似预设范围内时, 表明从待检测网页中提取的词汇数量 与当前模板文件中的词汇数量较接近, 该待检测网页有可能是钓鱼网页, 需 通过进一步的判断才可确定其是否钓鱼网页。 通过数量相似预设范围可确定 从待检测网页中提取的词汇数量与当前钓鱼模板文件中的词汇数量是否在一 个量级, 如果两者相差较大, 则认为待检测网页与当前钓鱼模板文件不相似, 数量相似预设范围可根据待检测网页中的词汇数量设置。  When the absolute value of the difference between the number of words extracted from the web page to be detected and the number of words in the current phishing template file is within a preset number range, indicating the number of words extracted from the web page to be detected and the vocabulary in the current template file The number is relatively close, and the webpage to be detected may be a phishing webpage, and further judgment is required to determine whether it is a phishing webpage. The quantity of the vocabulary extracted from the web page to be detected is equal to the number of vocabulary in the current phishing template file. If the difference between the two is large, the web page to be detected is not considered to be the current phishing template file. Similarly, the number of similar preset ranges can be set according to the number of words in the web page to be detected.
步骤 25: 从待检测网页中提取的词汇数量在数量相似预设范围时, 判断 从待检测网页中提取的词汇与当前钓鱼模板文件中词汇的词汇相似度是否在 词汇相似高预设值与词汇相似低预设值之间。 如果词汇相似度在词汇相似高 预设值与词汇相似低预设值之间执行步骤 26。 若词汇相似度不在词汇相似高 预设值与词汇相似低预设值之间, 但词汇相似度大于词汇相似高预设值时执 行步骤 27, 词汇相似度小于词汇相似低预设值时返回步骤 21执行。  Step 25: When the number of vocabulary words extracted from the webpage to be detected is in a similar preset range, determine whether the vocabulary similarity between the vocabulary extracted from the webpage to be detected and the vocabulary in the current phishing template file is in a vocabulary similarity high preset value and vocabulary Similar between low preset values. If the lexical similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value, perform step 26. If the lexical similarity is not between the vocabulary similarity high preset value and the vocabulary similar low preset value, but the lexical similarity is greater than the lexical similarity high preset value, step 27 is performed, and the vocabulary similarity is less than the vocabulary similar low preset value, and the returning step is returned. 21 execution.
词汇相似度是指待检测网页中的词汇与某一钓鱼模板文件有多少相同的 词汇的度量, 一般情况下词汇相似度可以描述成某种算式, 比如: 待检测网 页有 m个词汇, 而某一钓鱼模板文件有 n个词汇, 两者有 s个相同的词汇, 此时词汇相似度可描述为一个百分比数值: [2 X s/(m + n)] X 100, 当该数值高 于某一阈值, 则认为待检测网页中的词汇与某一钓鱼模板文件的词汇相似度 很高。 The vocabulary similarity refers to the metric of how many words in the web page to be detected are the same as a phishing template file. Generally, the lexical similarity can be described as a certain formula, for example: the web page to be detected has m words, and some A phishing template file has n words, both of which have the same vocabulary. At this time, the lexical similarity can be described as a percentage value: [2 X s/(m + n)] X 100, when the value is high At a certain threshold, it is considered that the vocabulary in the web page to be detected is highly similar to the vocabulary of a certain phishing template file.
词汇相似度大于词汇相似高预设值时, 表明待检测网页的词汇与钓鱼模 板文件的相同词汇较多, 由于当前钓鱼模板文件对应的网页是钓鱼网页, 因 此可确定待检测网页为钓鱼网页。 如果当前品牌模板文件对应的网页为品牌 网页, 由于在提取待检测网页的内容特征之前, 已确定在信任域名库中没有 该待检测网页的唯一域名, 因此, 同样可确定该待检测网页为钓鱼网页。  When the vocabulary similarity is greater than the vocabulary similarity, the vocabulary of the spoofed webpage is the same as the phishing slogan. The webpage corresponding to the phishing template file is a phishing webpage. If the webpage corresponding to the current brand template file is a brand webpage, since it is determined that there is no unique domain name of the webpage to be detected in the trusted domain name database before extracting the content feature of the webpage to be detected, it is also determined that the webpage to be detected is fishing. Web page.
词汇相似度小于词汇相似高预设值时, 表明待检测网页的词汇与模板文 件的相同词汇较少, 可确定该待检测网页不是钓鱼网页。  When the vocabulary similarity is less than the vocabulary similarity high preset value, it indicates that the vocabulary of the web page to be detected is less than the same vocabulary of the template file, and it can
步骤 26: 词汇相似度在词汇相似高预设值与词汇相似低预设值之间时, 判断从待检测网页中提取的文档对象模型与当前钓鱼模板文件中文档对象模 型的模型相似度是否大于模型相似预设值。 如果是执行步骤 27, 否则返回步 骤 21执行。  Step 26: When the vocabulary similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value, determine whether the model similarity between the document object model extracted from the to-be-detected webpage and the current maritime template file is greater than The model is similar to the preset value. If step 27 is performed, otherwise return to step 21 for execution.
从待检测网页中提取的文档对象模型与当前钓鱼模板文件中文档对象模 型的模型相似度大于模型相似预设值, 表明两者在文档对象模型方面的相似 程度较高。 模型相似度可以换算成百分比数, 模型相似度也可以换算成 0到 100的数值。将模型相似度换算成百分比数时模型相似预设值可以为 80%。将 模型相似度换算成 0到 100的数值时, 模型相似预设值可以是 50。  The model similarity between the document object model extracted from the web page to be detected and the document object model in the current phishing template file is greater than the similar preset value of the model, indicating that the two are similar in terms of the document object model. The model similarity can be converted into a percentage, and the model similarity can be converted into a value from 0 to 100. When the model similarity is converted into a percentage, the model-like preset value can be 80%. When the model similarity is converted to a value from 0 to 100, the model-like preset value can be 50.
步骤 27: 在模型相似度大于模型相似预设值时, 确定待检测网页为钓鱼 网页, 并输出该钓鱼模板文件对应的钓鱼网页名称。 返回步骤 21执行。  Step 27: When the model similarity is greater than the model similar preset value, determine that the webpage to be detected is a phishing webpage, and output the phishing webpage name corresponding to the phishing template file. Go back to step 21 to execute.
在确定待检测网页为钓鱼网页后, 与后续的模板文件继续匹配的目的是, 可根据模型相似度从多个达到模型相似预设值的模板文件中找出相似度最高 的模板文件, 从而输出该相似度最高的模板文件对应的钓鱼网页名称。  After determining that the webpage to be detected is a phishing webpage, the purpose of continuing matching with the subsequent template file is to find the template file with the highest similarity from the template files that reach the similar preset value of the model according to the model similarity, thereby outputting The name of the phishing page corresponding to the template file with the highest similarity.
如果在步骤 22中读取的是品牌模板库中品牌模板文件, 则步骤 27中输 出该品牌模板文件对应的品牌网页的网页名称。  If the brand template file in the brand template library is read in step 22, the web page name of the brand web page corresponding to the brand template file is output in step 27.
需要说明的是, 钓鱼模板中可以仅包含编码格式、 词汇数量、 词汇相似 度、 文档对象模型的相似度中的部分内容特征, 并且上述各内容也可以灵活 组合, 进行相似度判决时的顺序也可以灵活调整。 例如: It should be noted that the phishing template may only include some content features in the encoding format, the number of words, the lexical similarity, and the similarity of the document object model, and the above contents may also be flexible. In combination, the order in which similarity judgments are made can also be flexibly adjusted. E.g:
替代方案一:  Alternative 1:
省略步骤 23 , 在执行步骤 22, 从钓鱼模板库中按序读取一个还没有与待 检测页匹配的钓鱼模板文件后, 直接进入步骤 24, 判断从待检测网页中提取 的词汇数量与当前模板文件中的词汇数量差值的绝对值是否在数量相似预设 范围内。 如果不在数量相似预设范围内, 返回步骤 21执行; 如果在数量相似 预设范围内, 执行步骤 25。  Step 23 is omitted. After step 22, a phishing template file that has not been matched with the to-be-detected page is sequentially read from the nautical template library, and then directly proceeds to step 24 to determine the number of vocabulary extracted from the web page to be detected and the current template. Whether the absolute value of the difference in the number of words in the file is within a predetermined range of the number. If it is not within the similar preset range, return to step 21 to execute; if the number is within the preset range, go to step 25.
替代方案二:  Alternative 2:
先执行步骤 24〜步骤 25所述的词汇数量、 词汇相似度的判决, 再在根据 词汇数量、 词汇相似度无法判断出为钓鱼网页时, 再执行步骤 23编码格式的 判断, 若编码格式相同则为钓鱼网页, 否则为非钓鱼网页。  First, the vocabulary quantity and the vocabulary similarity judgment described in steps 24 to 25 are performed, and when the phishing webpage cannot be determined according to the vocabulary number and the vocabulary similarity, the encoding format of step 23 is performed, and if the encoding format is the same For phishing pages, otherwise non-phishing pages.
各种替代方案在这里不再一一列举。  Various alternatives are not listed here.
本发明实施例。 通过从待检测网页提取的内容特征: 待检测网页的编码 格式、 词汇、 网页词汇量和 DOM, 分别与钓鱼模板库中各钓鱼模板文件保存 的内容特征进行匹配, 在编码格式与当前匹配的钓鱼模板文件相同时, 则确 定待检测网页为钓鱼网页, 并继续与下一个钓鱼模板文件进行匹配。 在编码 格式不同时, 与当前钓鱼模板文件中的词汇数量进行匹配, 在与当前钓鱼模 板文件的词汇数量接近时, 确定该待检测网页为钓鱼网页, 否则继续与该钓 鱼模板文件进行词汇相似度匹配。 在词汇相似度达到词汇相似预设值时确定 该待检测网页为钓鱼网页, 并继续与下一个钓鱼模板文件进行匹配; 否则与 该钓鱼模板文件的 DOM进行模型相似度匹配,模型相似预设值时,确定待检 测网页为钓鱼网页。 在确定待检测网页为钓鱼网页时, 同时还输出当前匹配 的钓鱼模板议论折的网页名称。 另外, 还可将待检测网页的内容特征与品牌 模板库中各模板文件进行匹配。 确定该待检测网页为钓鱼网页的同时, 还可 输出该模板文件对应网页的名称, 即该待检测网页所仿冒的品牌网页的名称。  Embodiments of the invention. The content features extracted from the webpage to be detected: the encoding format, vocabulary, webpage vocabulary and DOM of the webpage to be detected are respectively matched with the content features saved by each phishing template file in the phishing template library, and the encoding format matches the currently matched fishing When the template files are the same, it is determined that the webpage to be detected is a phishing webpage, and continues to match the next phishing template file. When the coding format is different, the number of words in the current phishing template file is matched. When the number of vocabulary files in the current phishing template file is close, the page to be detected is determined to be a phishing page, otherwise the vocabulary similarity is continued with the phishing template file. match. When the lexical similarity reaches the vocabulary similar preset value, the webpage to be detected is determined to be a phishing webpage, and continues to match the next phishing template file; otherwise, the model similarity is matched with the DOM of the phishing template file, and the model is similar to the preset value. When it is determined, the webpage to be detected is a phishing webpage. When it is determined that the webpage to be detected is a phishing webpage, the webpage name of the currently matched phishing template argument is also output. In addition, the content features of the web page to be detected can be matched with each template file in the brand template library. When the webpage to be detected is determined to be a phishing webpage, the name of the webpage corresponding to the template file may be output, that is, the name of the brand webpage counterfeited by the webpage to be detected.
图 3 为本发明提供的钓鱼网页检测方法实施例三流程图。 本实例主要说 明品牌模板库中品牌模板文件建立过程。 钓鱼模板库中钓鱼模板文件建立过 程与品牌模板库相似, 区别仅在于钓鱼模板库中钓鱼模板文件用于保存已知 钓鱼网页的内容特征, 而品牌模板库中品牌模板文件用于保存已知品牌网页 的内容特征。 如图 3所示, 本实施例包括: FIG. 3 is a flowchart of Embodiment 3 of a method for detecting a phishing webpage according to the present invention. This example mainly describes the process of establishing a brand template file in the brand template library. The phishing template file in the phishing template library has been created. The process is similar to the brand template library. The only difference is that the phishing template file in the phishing template library is used to save the content features of the known phishing webpage, and the brand template file in the brand template library is used to save the content features of the known brand webpage. As shown in FIG. 3, this embodiment includes:
步骤 30: 判断品牌 URL列表中是否还存在没有处理的 URL。 如果是执 行步骤 31 , 否则结束。  Step 30: Determine if there are still unprocessed URLs in the brand URL list. If it is step 31, otherwise it ends.
步骤 31 : 从品牌 URL列表按序读取一个没有处理的 URL。  Step 31: Read an unprocessed URL in order from the brand URL list.
步骤 32: 根据读取的 URL下载相应的网页。  Step 32: Download the corresponding web page according to the read URL.
步骤 33: 从下载网页中提取出内容特征: 下载网页的编码格式、 词汇、 词汇数量和 DOM。  Step 33: Extract the content features from the download page: Download the encoding format, vocabulary, vocabulary quantity, and DOM of the web page.
步骤 34: 判断品牌模板库是否存在还没有匹配的品牌模板文件。 具体判 断品牌模板库是否存在还没有与从下载网页中提取出内容特征进行匹配的品 牌模板文件。 如果存在还没有与从下载网页中提取出内容特征进行匹配的品 牌模板文件, 执行步骤 35 , 否则执行步骤 37。  Step 34: Determine if there is a matching brand template file in the brand template library. It is specifically determined whether the brand template library exists or not has a brand template file that matches the content features extracted from the downloaded web page. If there is a brand template file that has not been matched with the content feature extracted from the downloaded web page, go to step 35, otherwise go to step 37.
步骤 35: 从品牌模板库中按序读取一个没有匹配过的品牌模板文件。 步骤 36: 判断该下载网页的内容特征与当前品牌模板文件的内容特征的 相似度是否小于预设的相似阈值。 如果小于预设的相似阈值, 确定该下载网 与当前品牌模板文件不相似, 返回步骤 34执行继续与后续的品牌模板文件进 行匹配。 如果大于预设的相似阈值, 确定该下载网与当前品牌模板文件相似, 不需要在品牌模板库中保存该下载网页的内容特征, 返回步骤 30执行, 以对 下一个 URL对应的下载网页进行匹配。  Step 35: Read a brand template file that has not been matched in order from the brand template gallery. Step 36: Determine whether the similarity between the content feature of the downloaded webpage and the content feature of the current brand template file is less than a preset similarity threshold. If it is less than the preset similarity threshold, it is determined that the download network is not similar to the current brand template file, and the process returns to step 34 to continue matching with the subsequent brand template file. If it is greater than the preset similarity threshold, it is determined that the downloading network is similar to the current brand template file, and the content feature of the downloaded webpage does not need to be saved in the brand template library, and the process returns to step 30 to match the downloaded webpage corresponding to the next URL. .
步骤 37: 将下载网页的内容特征以品牌模板文件形式写入品牌模板库。 返回步骤 30继续执行。  Step 37: Write the content characteristics of the downloaded web page into the brand template library in the form of a brand template file. Go back to step 30 to continue.
本发明实施例建立品牌模板库时, 将下载网页的内容特征与品牌模板库 中已有品牌模板文件进行匹配, 只有在品牌模板库中不存在与该下载网页的 内容特征相似的品牌模板文件 (即下载网页与所有品牌模板文件都不相似 ) 时, 才将该下载网页以品牌模板文件形式存入品牌模板库中, 从而避免了在 品牌模板库中重复保存多个相似网页的品牌模板文件。 图 4A为本发明提供的钓鱼网页检测设备实施例一结构示意图。如图 4所 示, 本实施例包括: 信任域名库 40、 域名确定模块 41、 内容提取模块 42、 相 似度确定模块 43和钓鱼网页确定模块 44以及模板文件库 45。 When the brand template library is established in the embodiment of the present invention, the content features of the downloaded webpage are matched with the existing brand template files in the brand template library, and only the brand template file similar to the content feature of the downloaded webpage does not exist in the brand template library ( That is, when the download page is not similar to all the brand template files, the download page is stored in the brand template library as a brand template file, thereby avoiding repeatedly saving the brand template files of multiple similar web pages in the brand template library. 4A is a schematic structural diagram of Embodiment 1 of a phishing webpage detecting apparatus provided by the present invention. As shown in FIG. 4, the embodiment includes: a trusted domain name library 40, a domain name determining module 41, a content extracting module 42, a similarity determining module 43 and a phishing webpage determining module 44, and a template file library 45.
信任域名库 40, 用于保存受信任的唯一域名。 模板文件库 45 , 用于保存 多个模板文件, 模板文件包括从网页中提取的内容特征; 所述内容特征至少 包括: 网页的编码格式、 文档对象模型、 词汇和词汇数量。 具体地, 模板文 件库包括: 钓鱼模板库和品牌模板库。 钓鱼模板库, 用于保存包括从钓鱼网 页中提取的内容特征的模板文件。 品牌模板库, 用于保存包括从品牌网页中 提取的内容特征的模板文件。  Trust domain name library 40, used to save a trusted unique domain name. The template file library 45 is configured to save a plurality of template files, and the template file includes content features extracted from the webpage; the content features include at least: a coding format of the webpage, a document object model, a vocabulary, and a vocabulary quantity. Specifically, the template file library includes: a phishing template library and a brand template library. A phishing template library for saving template files including content features extracted from a phishing web page. A brand template library for saving template files that include content features extracted from brand web pages.
域名确定模块 41 ,用于判断信任域名库 40中是否存在待检测网页对应的 唯一域名。 内容提取模块 42, 用于在域名确定模块 41确定信任域名库中不存 在唯一域名时, 从待检测网页中提取的内容特征。  The domain name determining module 41 is configured to determine whether there is a unique domain name corresponding to the webpage to be detected in the trusted domain name library 40. The content extraction module 42 is configured to: when the domain name determining module 41 determines that there is no unique domain name in the trusted domain name database, the content feature extracted from the webpage to be detected.
相似度确定模块 43 ,用于分别确定内容提取模块 42从待检测网页中提取 的内容特征与模板文件库 45的各模板文件中内容特征的相似度。  The similarity determining module 43 is configured to respectively determine the similarity between the content features extracted by the content extraction module 42 from the web page to be detected and the content features in the template files of the template file library 45.
钓鱼网页确定模块 44, 用于在从待检测网页中提取的内容特征, 至少与 为钓鱼网页。  The phishing webpage determining module 44 is configured to extract the content features from the webpage to be detected, at least with the phishing webpage.
由于本发明实施例钓鱼网页检测设备检测网页, 不需要远程设备配合完 成, 可以部署于任意网络节点处, 支持大流量检测。 例如可部署于网络流量 监控设备、 防火墙设备和路由器等。 图 4B为本发明提供的钓鱼网页检测设备 一种应用场景示意图。 如图 4B所示, 本发明实施例钓鱼网页检测设备从网络 流量监控设备中获取待检测网页的 URL,根据 URL从网络下载待检测网页后 进行检测, 将检测结果输出给其它设备。 图 4C为本发明提供的钓鱼网页检测 设备另一种应用场景示意图。 如图 4C所示, 本发明实施例钓鱼网页检测设备 直接从网络流量监控设备获取 HTTP数据包进行钓鱼网页检测, 将检测结果 输出给其它设备。  In the embodiment of the present invention, the phishing webpage detecting device detects the webpage, and does not need to complete the cooperation of the remote device, and can be deployed at any network node to support large traffic detection. For example, it can be deployed on network traffic monitoring devices, firewall devices, and routers. FIG. 4B is a schematic diagram of an application scenario of a phishing webpage detecting device provided by the present invention. As shown in FIG. 4B, the phishing webpage detecting device obtains the URL of the webpage to be detected from the network traffic monitoring device, downloads the webpage to be detected from the network according to the URL, and then outputs the detection result to other devices. FIG. 4C is a schematic diagram of another application scenario of the phishing webpage detecting device provided by the present invention. As shown in FIG. 4C, the phishing webpage detecting device directly obtains an HTTP data packet from the network traffic monitoring device for phishing webpage detection, and outputs the detection result to other devices.
进一步, 如图 5所示, 本实施例还包括: 网页名称输出模块 46, 用于确 件, 输出该些模板文件所对应的钓鱼网页名称或对应的被仿冒品牌网页名称。 上述各模块的工作机理参见图 1对应实施例的描述, 在此不再贅述。 本发明实施例钓鱼检测设备, 在检测待检测网页时, 域名确定模块 41从 本地保存的信任域名库中查找待检测页面对应的唯一域名, 在信任域名库中 不存在该唯一域名时, 相似度确定模块 43将待检测网页的内容特征, 与保存 在本地的模板文件进行匹配确定相似度。 由于钓鱼网页通常由自动程序产生 或直接仿冒品牌网页, 钓鱼网页的内容特征基本相似, 内容特征可反映出钓 鱼网页的特性。 因此本发明通过内容特征确定网页是否钓鱼网页, 提高了钓 鱼网页检测结果的准确性。 另外, 由于本发明通过不断更新的信任域名库先 确定待检测网页是否为受信任的网页, 从而减少了将品牌网页误判为钓鱼网 页的几率。 Further, as shown in FIG. 5, the embodiment further includes: a webpage name output module 46, configured to The phishing page name corresponding to the template files or the corresponding phishing brand page name is output. For the working mechanism of each module, refer to the description of the corresponding embodiment in FIG. 1 , and details are not described herein again. In the phishing detection device of the embodiment of the present invention, when detecting the webpage to be detected, the domain name determining module 41 searches for the unique domain name corresponding to the page to be detected from the locally saved trust domain name database, and the similarity does not exist in the trusted domain name database. The determining module 43 matches the content features of the web page to be detected with the template file saved locally to determine the similarity. Since the phishing webpage is usually generated by an automatic program or directly spoofs the brand webpage, the content characteristics of the phishing webpage are basically similar, and the content features can reflect the characteristics of the phishing webpage. Therefore, the present invention determines whether the webpage is phishing by the content feature, and improves the accuracy of the phishing webpage detection result. In addition, since the present invention first determines whether the web page to be detected is a trusted web page through the continuously updated trust domain name library, the probability of misjudge the brand web page as a phishing web page is reduced.
图 6为图 4或图 5中相似度确定模块的结构示意图。 如图 6所示, 相似 度确定模块 43 包括: 读取单元 431、 编码格式确定单元 432、 词汇数量确定 单元 433、 词汇确定单元 434和对象模型确定单元 435。  Fig. 6 is a schematic structural view of the similarity determining module in Fig. 4 or Fig. 5. As shown in FIG. 6, the similarity determining module 43 includes: a reading unit 431, an encoding format determining unit 432, a vocabulary number determining unit 433, a vocabulary determining unit 434, and an object model determining unit 435.
读取单元 431 , 用于从钓鱼模板库或品牌模板库中读取一模板文件。  The reading unit 431 is configured to read a template file from the phishing template library or the brand template library.
编码格式确定单元 432,用于判断从待检测网页中提取的编码格式是否与 模板文件中的编码格式相同。  The encoding format determining unit 432 is configured to determine whether the encoding format extracted from the web page to be detected is the same as the encoding format in the template file.
词汇数量确定单元 433 ,用于在编码格式确定单元 432确定编码格式相同 时, 判断从待检测网页中提取的词汇数量是否在模板文件中的词汇数量对应 的数量相似预设范围内。  The vocabulary quantity determining unit 433 is configured to determine, when the encoding format determining unit 432 determines that the encoding format is the same, whether the number of vocabularies extracted from the web page to be detected is within a preset range corresponding to the number of vocabularies in the template file.
词汇确定单元 434,用于词汇数量确定单元 433确定词汇数量在数量相似 预设范围时, 判断从待检测网页中提取的词汇与模板文件中词汇的词汇相似 度是否在词汇相似高预设值与词汇相似低预设值之间。  The vocabulary determining unit 434 is configured to determine whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is higher than the preset value when the number of vocabulary is similar to the preset range. The vocabulary is similar between low preset values.
对象模型确定单元 435 ,用于在词汇确定单元 434确定所述词汇相似度在 词汇相似高预设值与词汇相似低预设值之间时, 确定从所述待检测网页中提 取的文档对象模型与所述模板文件中文档对象模型的模型相似度, 并判断所 述模型相似度是否大于模型相似预设值。 The object model determining unit 435 is configured to determine, when the vocabulary similarity degree is between the vocabulary similarity high preset value and the vocabulary similar low preset value, the document object model extracted from the to-be-detected webpage. Similarity with the model of the document object model in the template file, and judge Whether the similarity of the model is greater than the similar preset value of the model.
钓鱼网页确定模块 44, 具体用于在对象模型确定单元 435确定模型相似 度大于模型相似预设值或在词汇确定单元 434词汇相似度高于词汇相似高预 设值时, 确定待检测网页为钓鱼网页。  The phishing webpage determining module 44 is configured to determine, when the object model determining unit 435 determines that the model similarity is greater than the model similar preset value or when the vocabulary determining unit 434 has a vocabulary similarity higher than the vocabulary similarity high preset value, determining that the webpage to be detected is for fishing Web page.
上述各模块的工作机理参见图 2对应实施例的描述, 在此不再贅述。 本发明实施例。 通过从待检测网页提取的内容特征: 网页编码格式、 网 页词汇、 网页词汇量和网页 DOM, 分别与钓鱼模板库中各模板文件保存的内 容特征进行匹配, 得到多个相似度。 只要其中一个相似度大于预设的相似阈 值, 则确定该待检测网页是钓鱼网页, 并还可确定相似度大于预设的相似阈 值的模板文件对应的网页名称, 从而确定该待检测网页相似的钓鱼网页。 另 外, 还可将待检测网页的内容特征与品牌模板库中各模板文件进行匹配。 在 品牌模板库中确定出相似度大于预设的相似阀值的模板文件时, 确定该待检 测网页为钓鱼网页的同时, 还可输出该模板文件对应网页的名称, 即该待检 测网页所仿冒的品牌网页的名称。  For the working mechanism of each module, refer to the description of the corresponding embodiment in FIG. 2, and details are not described herein again. Embodiments of the invention. The content features extracted from the webpage to be detected: the webpage encoding format, the webpage vocabulary, the webpage vocabulary, and the webpage DOM are respectively matched with the content features saved in each template file in the phishing template library to obtain multiple similarities. And determining that the webpage to be detected is a phishing webpage, and determining a webpage name corresponding to the template file whose similarity is greater than a preset similarity threshold, so as to determine that the webpage to be detected is similar. Phishing page. In addition, the content features of the web page to be detected can be matched with each template file in the brand template library. When the template file whose similarity is greater than the preset similar threshold is determined in the brand template library, it is determined that the webpage to be detected is a phishing webpage, and the name of the webpage corresponding to the template file is also output, that is, the webpage to be detected is counterfeited. The name of the brand web page.
图 7为本发明提供的钓鱼网页检测设备实施例三结构示意图。 如图 7所 示, 在图 5所示的基础上还包括: 钓鱼模板库建立模块 47、 品牌模板库建立 模块 48和信任域名库建立模块 49。  FIG. 7 is a schematic structural diagram of Embodiment 3 of a phishing webpage detecting apparatus provided by the present invention. As shown in FIG. 7, the phishing template library building module 47, the brand template library building module 48, and the trust domain name database building module 49 are further included on the basis of FIG.
钓鱼模板库建立模块 47, 用于将从钓鱼网页提取的内容特征, 与钓鱼模 板库中各模板文件中内容特征进行匹配, 确定从钓鱼网页提取的内容特征与 各模板文件的相似度; 在从钓鱼网页提取的内容特征与各模板文件的相似度 均小于预设的相似阈值时, 将从钓鱼网页提取的内容特征形成模板文件写入 钓鱼模板库。  The phishing template library building module 47 is configured to match the content features extracted from the phishing webpage with the content features in the template files in the phishing template library, and determine the similarity between the content features extracted from the phishing webpage and each template file; When the similarity between the content feature extracted by the phishing webpage and each template file is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.
品牌模板库建立模块 48, 用于将从品牌网页提取的内容特征, 与品牌模 板库中各模板文件中内容特征进行匹配, 确定从品牌网页提取的内容特征与 各模板文件的相似度; 在从品牌网页提取的内容特征与各模板文件的相似度 均小于预设的相似阈值时, 将从品牌网页提取的内容特征形成模板文件写入 品牌模板库。 信任域名库建立模块 49, 用于若 URL中顶级域名为非国家顶级域名,从 URL中提取出二级域名写入信任域名库;若 URL中顶级域名是国家域名且二 级域名是顶级域字符串, 从 URL中提取三级域名写入信任域名库。 The brand template library building module 48 is configured to match the content features extracted from the brand webpage with the content features in the template files in the brand template library, and determine the similarity between the content features extracted from the brand webpage and each template file; When the similarity between the content feature extracted by the brand webpage and each template file is less than the preset similarity threshold, the content feature forming template file extracted from the brand webpage is written into the brand template library. The trusted domain name database establishing module 49 is configured to: if the top-level domain name in the URL is a non-national top-level domain name, extract the second-level domain name from the URL and write the trusted domain name database; if the top-level domain name in the URL is a national domain name and the second-level domain name is a top-level domain character String, extract the third-level domain name from the URL and write it to the trusted domain name database.
上述各模块的工作机理参见图 3对应实施例的描述, 在此不再贅述。 本发明实施例建立品牌模板库时, 将下载网页的内容特征与品牌模板库 中已有模板文件进行匹配, 只有在品牌模板库中不存在与该下载网页的内容 特征相似的模板文件时, 才将该下载网页以模板文件形式存入品牌模板库中, 从而避免了在品牌模板库中重复保存多个相似网页的模板文件。  For the working mechanism of each module, refer to the description of the corresponding embodiment in FIG. 3, and details are not described herein again. When the brand template library is established in the embodiment of the present invention, the content features of the downloaded webpage are matched with the existing template files in the brand template library, and only when there is no template file similar to the content feature of the downloaded webpage in the brand template library. The downloaded webpage is stored in the brand template library as a template file, thereby avoiding repeatedly saving the template files of multiple similar webpages in the brand template library.
本领域普通技术人员可以理解: 实现上述方法实施例的全部或部分步骤 可以通过程序指令相关的硬件来完成, 前述的程序可以存储于一计算机可读 取存储介质中, 该程序在执行时, 执行包括上述方法实施例的步骤; 而前述 的存储介质包括: ROM、 RAM, 磁碟或者光盘等各种可以存储程序代码的介 最后应说明的是: 以上实施例仅用以说明本发明的技术方案, 而非对其 限制; 尽管参照前述实施例对本发明进行了详细的说明, 本领域的普通技术 人员应当理解: 其依然可以对前述各实施例所记载的技术方案进行修改, 或 者对其中部分技术特征进行等同替换; 而这些修改或者替换, 并不使相应技 术方案的本质脱离本发明各实施例技术方案的精神和范围。  A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing storage medium includes: ROM, RAM, magnetic disk or optical disk, etc., which can store various program codes. Finally, the above embodiments are only used to illustrate the technical solution of the present invention. The invention is described in detail with reference to the foregoing embodiments, and those of ordinary skill in the art should understand that the technical solutions described in the foregoing embodiments may be modified or some of the techniques may be The features are equivalent to the equivalents; and the modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

权 利 要 求 Rights request
1、 一种钓鱼网页检测方法, 其特征在于, 包括:  A phishing webpage detecting method, characterized in that:
判断信任域名库中是否存在待检测网页对应的唯一域名;  Determining whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;
在所述信任域名库中不存在所述唯一域名时, 分别确定从所述待检测网 页中提取的内容特征与模板文件库的各模板文件中内容特征的相似度; 所述 内容特征至少包括: 编码格式、 文档对象模型、 词汇和词汇数量;  When the unique domain name does not exist in the trusted domain name database, the similarity between the content feature extracted from the to-be-detected webpage and the content feature in each template file of the template file library is determined; the content feature includes at least: Encoding format, document object model, vocabulary and number of words;
在从所述待检测网页中提取的内容特征, 至少与一个所述模板文件中内 容特征的相似度大于预设的相似阈值时, 确定所述待检测网页为钓鱼网页。  And determining, in the content feature that is to be detected from the to-be-detected webpage, that the similarity of the content feature in the template file is greater than a preset similarity threshold, determining that the to-be-detected webpage is a phishing webpage.
2、 根据权利要求 1所述的钓鱼网页检测方法, 其特征在于, 所述分别确 定从所述待检测网页中提取的内容特征与模板文件库的各模板文件中内容特 征的相似度, 包括:  The method for detecting a phishing webpage according to claim 1, wherein the determining the similarity between the content feature extracted from the webpage to be detected and the content features in each template file of the template file library includes:
从所述模板文件库中读取模板文件, 判断从所述待检测网页中提取的编 码格式是否与所述模板文件中的编码格式相同;  Reading a template file from the template file library, and determining whether the encoding format extracted from the to-be-detected webpage is the same as the encoding format in the template file;
在从所述待检测网页中提取的编码格式与所述模板文件中的编码格式相 同时, 判断从所述待检测网页中提取的词汇数量与所述模板文件中词汇数量 的差值的绝对值是否在数量相似预设范围内;  When the encoding format extracted from the to-be-detected webpage is the same as the encoding format in the template file, determining an absolute value of a difference between the number of words extracted from the to-be-detected webpage and the number of words in the template file Whether it is within a similar range of presets;
所述词汇数量在所述数量相似预设范围时, 确定从所述待检测网页中提 取的词汇与所述模板文件中词汇的词汇相似度是否在词汇相似高预设值与词 汇相似低预设值之间;  When the quantity of the vocabulary is similar to the predetermined range, determining whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is similar to a vocabulary-like high preset value and a vocabulary-like low preset Between values;
在所述词汇相似度在词汇相似高预设值与词汇相似低预设值之间时, 计 算从所述待检测网页中提取的文档对象模型与所述模板文件中文档对象模型 的模型相似度;  Calculating model similarity between the document object model extracted from the to-be-detected webpage and the document object model in the template file when the vocabulary similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value ;
在所述模型相似度大于模型相似预设值或在所述词汇相似度高于词汇相 似高预设值时, 确定所述待检测网页为钓鱼网页; 从所述钓鱼模板库或所述 品牌模板库读取下一模板文件, 重复执行上述步骤, 直至根据模型相似度从 多个达到模型相似预设值的模板文件中找出相似度最高的模板文件。 Determining that the to-be-detected webpage is a phishing webpage when the model similarity is greater than a model similar preset value or when the vocabulary similarity is higher than a vocabulary similarity high preset value; from the phishing template library or the brand template The library reads the next template file and repeats the above steps until the template file with the highest similarity is found from multiple template files that reach the model-like preset value according to the model similarity.
3、 根据权利要求 1或 2所述的钓鱼网页检测方法, 其特征在于, 所述信 任域名库用于保存待检测网页受信任的唯一域名, 所述模板文件库为品牌模 板库或钓鱼模板库; 所述钓鱼模板库中模板文件中包括从钓鱼网页提取的内 容特征, 所述品牌模板库中模板文件包括从品牌网页提取的内容特征。 The method for detecting a phishing webpage according to claim 1 or 2, wherein the trusted domain name database is used to store a unique domain name trusted by the webpage to be detected, and the template file library is a brand template library or a phishing template library. The template file in the phishing template library includes content features extracted from the phishing webpage, and the template file in the brand template library includes content features extracted from the brand webpage.
4、 根据权利要求 1或 2所述的钓鱼网页检测方法, 其特征在于, 在所述 确定所述待检测网页为钓鱼网页之后, 还包括: 相似阈值的模板文件时, 输出所述模板文件所对应的钓鱼网页名称或对应的 被仿冒品牌网页名称。  The method for detecting a phishing webpage according to claim 1 or 2, wherein after the determining that the webpage to be detected is a phishing webpage, the method further comprises: when the template file is similar to a threshold, outputting the template file The corresponding phishing page name or the corresponding phishing brand page name.
5、 根据权利要求 1所述的钓鱼网页检测方法, 其特征在于, 在所述判断 信任域名库中是否存在待检测网页对应的唯一域名之前还包括:  The method for detecting a phishing webpage according to claim 1, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises:
将从钓鱼网页提取的内容特征, 与钓鱼模板库中各模板文件中内容特征 进行匹配, 确定从钓鱼网页提取的内容特征与各所述模板文件的相似度; 在从所述钓鱼网页提取的内容特征与各所述模板文件的相似度均小于所 述预设的相似阈值时, 将从钓鱼网页提取的内容特征形成模板文件写入所述 钓鱼模板库。  Matching the content features extracted from the phishing webpage with the content features in the template files in the phishing template library, determining the similarity between the content features extracted from the phishing webpage and each of the template files; and extracting the content from the phishing webpage When the similarity between the feature and each of the template files is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.
6、 根据权利要求 1所述的钓鱼网页检测方法, 其特征在于, 在所述判断 信任域名库中是否存在待检测网页对应的唯一域名之前还包括:  The method for detecting a phishing webpage according to claim 1, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises:
将从品牌网页提取的内容特征, 与品牌模板库中各模板文件中内容特征 进行匹配, 确定从品牌网页提取的内容特征与各所述模板文件的相似度; 在从所述品牌网页提取的内容特征与各所述模板文件的相似度均小于所 述模型相似预设值时, 将从品牌网页提取的内容特征形成模板文件写入所述 品牌模板库。  Matching the content features extracted from the brand webpage with the content features in each template file in the brand template library, determining the similarity between the content features extracted from the brand webpage and each of the template files; and extracting the content from the brand webpage When the similarity between the feature and each of the template files is less than the similar preset value of the model, the content feature forming template file extracted from the brand webpage is written into the brand template library.
7、 根据权利要求 5或 6所述的钓鱼网页检测方法, 其特征在于, 在所述 判断信任域名库中是否存在待检测网页对应的唯一域名之前还包括:  The method for detecting a phishing webpage according to claim 5 or 6, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises:
收集的统一资源定位符中顶级域名为非国家顶级域名时, 从所述统一资 源定位符中提取出二级域名写入所述信任域名库; 收集的所述统一资源定位符中顶级域名是国家域名且二级域名是顶级域 名字符串时, 从所述统一资源定位符中提取三级域名写入所述信任域名库。 When the top-level domain name of the collected uniform resource locator is a non-national top-level domain name, the second-level domain name is extracted from the uniform resource locator and written into the trust domain name database; And the third-level domain name is extracted from the uniform resource locator and the trusted domain name database is written.
8、 一种钓鱼网页检测设备, 其特征在于, 包括:  8. A phishing webpage detecting device, comprising:
信任域名库, 用于保存受信任网页对应的唯一域名;  Trust the domain name library, which is used to save the unique domain name corresponding to the trusted webpage;
模板文件库, 用于保存多个模板文件, 所述模板文件包括从网页中提取 的内容特征; 所述内容特征至少包括: 编码格式、 文档对象模型、 词汇和词 汇数量;  a template file library, configured to save a plurality of template files, where the template file includes content features extracted from a webpage; the content features include at least: an encoding format, a document object model, a vocabulary, and a number of words;
域名确定模块, 用于判断信任域名库中是否存在待检测网页对应的唯一 域名;  a domain name determining module, configured to determine whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;
内容提取模块, 用于在所述信任域名库中不存在所述唯一域名时, 从所 述待检测网页中提取的内容特征;  a content extraction module, configured to extract content features extracted from the to-be-detected webpage when the unique domain name does not exist in the trust domain name database;
相似度确定模块, 用于分别确定从所述待检测网页中提取的内容特征与 所述模板文件库的各所述模板文件中内容特征的相似度;  a similarity determining module, configured to respectively determine a similarity between a content feature extracted from the to-be-detected webpage and a content feature in each template file of the template file library;
钓鱼网页确定模块, 用于在从所述待检测网页中提取的内容特征, 至少 待检测网页为钓鱼网页。  The phishing webpage determining module is configured to: at least the webpage to be detected is a phishing webpage, in the content feature extracted from the webpage to be detected.
9、 根据权利要求 8所述的钓鱼网页检测设备, 其特征在于, 还包括: 网页名称输出模块, 用于确定与从所述待检测网页中提取的内容特征的 文件所对应的钓鱼网页名称或对应的被仿冒品牌网页名称。  The phishing webpage detecting device according to claim 8, further comprising: a webpage name outputting module, configured to determine a phishing webpage name corresponding to a file of the content feature extracted from the webpage to be detected or Corresponding to the name of the counterfeit brand web page.
10、 根据权利要求 9所述的钓鱼网页检测设备, 其特征在于, 所述相似 度确定模块包括:  The phishing webpage detecting device according to claim 9, wherein the similarity determining module comprises:
读取单元, 用于从钓鱼模板库或品牌模板库中读取模板文件; 与所述模板文件中的编码格式相同; 模板文件中的编码格式相同时, 判断从所述待检测网页中提取的词汇数量与 所述模板文件中词汇数量的差值的绝对值是否在数量相似预设范围内; 词汇确定单元, 用于从所述待检测网页中提取的词汇数量与所述模板文 件中词汇数量的差值的绝对值在所述数量相似预设范围内时, 判断从所述待 检测网页中提取的词汇与所述模板文件中词汇的词汇相似度是否在词汇相似 高预设值与词汇相似低预设值之间; a reading unit, configured to read the template file from the phishing template library or the brand template library; and the encoding format in the template file is the same; when the encoding format in the template file is the same, determining the extracted from the to-be-detected webpage Number of words and Whether the absolute value of the difference in the number of words in the template file is within a preset number range; the vocabulary determining unit is configured to use a difference between the number of words extracted from the web page to be detected and the number of words in the template file When the absolute value of the number is within a predetermined preset range, determining whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is similar to a vocabulary-like high preset value and a vocabulary-like low preset Between values;
对象模型确定单元, 用于在所述词汇相似度在词汇相似高预设值与词汇 相似低预设值之间时, 确定从所述待检测网页中提取的文档对象模型与所述 模板文件中文档对象模型的模型相似度, 并判断所述模型相似度是否大于所 述模型相似预设值;  An object model determining unit, configured to determine, in the template object model extracted from the to-be-detected webpage, in the template file, when the vocabulary similarity is between a vocabulary similarity high preset value and a vocabulary similar low preset value a model similarity of the document object model, and determining whether the model similarity is greater than a similar preset value of the model;
所述钓鱼网页确定模块, 具体用于在所述模型相似度大于模型相似预设 值或在所述词汇相似度高于词汇相似高预设值时, 确定所述待检测网页为钓 鱼网页。  The phishing webpage determining module is configured to determine that the webpage to be detected is a fish webpage when the model similarity is greater than a model similar preset value or when the vocabulary similarity is higher than a vocabulary similarity high preset value.
11、 根据权利要求 10所述的钓鱼网页检测设备, 其特征在于, 所述模板 文件库包括:  The phishing webpage detecting device according to claim 10, wherein the template file library comprises:
钓鱼模板库, 用于保存包括从钓鱼网页中提取的内容特征的模板文件; 品牌模板库, 用于保存包括从品牌网页中提取的内容特征的模板文件。 The phishing template library is configured to save a template file including content features extracted from the phishing webpage; and a brand template library, configured to save a template file including content features extracted from the brand webpage.
12、 根据权利要求 11所述的钓鱼网页检测设备, 其特征在于, 还包括: 钓鱼模板库建立模块, 用于将从钓鱼网页提取的内容特征, 与钓鱼模板 库中各模板文件中内容特征进行匹配, 确定从钓鱼网页提取的内容特征与各 所述模板文件的相似度; 在从所述钓鱼网页提取的内容特征与各所述模板文 件的相似度均小于所述预设的相似阈值时, 将从钓鱼网页提取的内容特征形 成模板文件写入所述钓鱼模板库; The phishing webpage detecting device according to claim 11, further comprising: a phishing template library establishing module, configured to perform content features extracted from the phishing webpage and content features in each template file in the phishing template library Matching, determining a similarity between the content feature extracted from the phishing webpage and each of the template files; when the similarity between the content feature extracted from the phishing webpage and each of the template files is less than the preset similarity threshold, Writing a content feature forming template file extracted from the phishing webpage into the phishing template library;
品牌模板库建立模块, 用于将从品牌网页提取的内容特征, 与品牌模板 库中各模板文件中内容特征进行匹配, 确定从品牌网页提取的内容特征与各 所述模板文件的相似度; 在从所述品牌网页提取的内容特征与各所述模板文 件的相似度均小于所述预设的相似阈值时, 将从品牌网页提取的内容特征形 成模板文件写入所述品牌模板库。 a brand template library building module, configured to match content features extracted from a brand webpage with content features in each template file in the brand template library, and determine a similarity between the content features extracted from the brand webpage and each of the template files; When the similarity between the content feature extracted from the brand webpage and each of the template files is less than the preset similarity threshold, the content feature forming template file extracted from the brand webpage is written into the brand template library.
13、 根据权利要求 12所述的钓鱼网页检测设备, 其特征在于, 还包括: 信任域名库建立模块, 用于收集的统一资源定位符中顶级域名为非国家顶级 域名时, 从所述统一资源定位符中提取出二级域名写入所述信任域名库; 收 集的统一资源定位符中顶级域名是国家域名且二级域名是顶级域名字符串 时, 从所述统一资源定位符中提取三级域名写入所述信任域名库。 The phishing webpage detecting device according to claim 12, further comprising: a trusted domain name database establishing module, configured to collect the unified resource locator from the unified resource when the top-level domain name is a non-national top-level domain name The second-level domain name is extracted from the locator and written into the trust domain name database; when the top-level domain name in the collected uniform resource locator is a country domain name and the second-level domain name is a top-level domain name string, three levels are extracted from the uniform resource locator The domain name is written to the trust domain name library.
14、 一种钓鱼网页检测方法, 其特征在于, 包括:  14. A method for detecting a phishing webpage, comprising:
判断信任域名库中是否存在待检测网页对应的唯一域名;  Determining whether there is a unique domain name corresponding to the to-be-detected webpage in the trusted domain name database;
在所述信任域名库中不存在所述唯一域名时, 分别确定从所述待检测网 页中提取的内容特征与模板文件库的各模板文件中内容特征的相似度; 所述 内容特征至少包括: 词汇、 词汇数量和文档对象模型;  When the unique domain name does not exist in the trusted domain name database, the similarity between the content feature extracted from the to-be-detected webpage and the content feature in each template file of the template file library is determined; the content feature includes at least: Vocabulary, vocabulary quantity, and document object model;
在从所述待检测网页中提取的内容特征, 至少与一个所述模板文件中内 容特征的相似度大于预设的相似阈值时, 确定所述待检测网页为钓鱼网页。  And determining, in the content feature that is to be detected from the to-be-detected webpage, that the similarity of the content feature in the template file is greater than a preset similarity threshold, determining that the to-be-detected webpage is a phishing webpage.
15、 根据权利要求 14所述的钓鱼网页检测方法, 其特征在于, 所述分别 确定从所述待检测网页中提取的内容特征与模板文件库的各模板文件中内容 特征的相似度, 包括:  The method for detecting a phishing webpage according to claim 14, wherein the determining the similarity between the content feature extracted from the webpage to be detected and the content features in each template file of the template file library includes:
从所述模板文件库中读取模板文件, 判断从所述待检测网页中提取的词 汇数量与所述模板文件中词汇数量的差值的绝对值是否在数量相似预设范围 内;  Reading a template file from the template file library, and determining whether an absolute value of a difference between the number of words extracted from the to-be-detected web page and the number of words in the template file is within a preset number range;
所述词汇数量在所述数量相似预设范围时, 确定从所述待检测网页中提 取的词汇与所述模板文件中词汇的词汇相似度是否在词汇相似高预设值与词 汇相似低预设值之间;  When the quantity of the vocabulary is similar to the predetermined range, determining whether the vocabulary similarity between the vocabulary extracted from the to-be-detected webpage and the vocabulary in the template file is similar to a vocabulary-like high preset value and a vocabulary-like low preset Between values;
在所述词汇相似度在词汇相似高预设值与词汇相似低预设值之间时, 计 算从所述待检测网页中提取的文档对象模型与所述模板文件中文档对象模型 的模型相似度;  Calculating model similarity between the document object model extracted from the to-be-detected webpage and the document object model in the template file when the vocabulary similarity is between the vocabulary similarity high preset value and the vocabulary similar low preset value ;
在所述模型相似度大于模型相似预设值或在所述词汇相似度高于词汇相 似高预设值时, 确定所述待检测网页为钓鱼网页; 从所述钓鱼模板库或所述 品牌模板库读取下一模板文件, 重复执行上述步骤, 直至根据模型相似度从 多个达到模型相似预设值的模板文件中找出相似度最高的模板文件。 Determining that the to-be-detected webpage is a phishing webpage when the model similarity is greater than a model similar preset value or when the vocabulary similarity is higher than a vocabulary similarity high preset value; from the phishing template library or the brand template The library reads the next template file and repeats the above steps until it is based on model similarity Find the template file with the highest similarity among the template files that reach the model-like preset value.
16、 根据权利要求 14或 15所述的钓鱼网页检测方法, 其特征在于, 所 述信任域名库用于保存待检测网页受信任的唯一域名, 所述模板文件库为品 牌模板库或钓鱼模板库; 所述钓鱼模板库中模板文件中包括从钓鱼网页提取 的内容特征, 所述品牌模板库中模板文件包括从品牌网页提取的内容特征。  The method for detecting a phishing webpage according to claim 14 or 15, wherein the trusted domain name database is used to store a unique domain name trusted by the webpage to be detected, and the template file library is a brand template library or a phishing template library. The template file in the phishing template library includes content features extracted from the phishing webpage, and the template file in the brand template library includes content features extracted from the brand webpage.
17、 根据权利要求 14或 15所述的钓鱼网页检测方法, 其特征在于, 在 所述确定所述待检测网页为钓鱼网页之后, 还包括: 相似阈值的模板文件时, 输出所述模板文件所对应的钓鱼网页名称或对应的 被仿冒品牌网页名称。  The method for detecting a phishing webpage according to claim 14 or 15, wherein after the determining that the webpage to be detected is a phishing webpage, the method further comprises: when the template file is similar to a threshold, outputting the template file The corresponding phishing page name or the corresponding phishing brand page name.
18、 根据权利要求 14或 15所述的钓鱼网页检测方法, 其特征在于, 在 所述判断信任域名库中是否存在待检测网页对应的唯一域名之前还包括: 将从钓鱼网页提取的内容特征, 与钓鱼模板库中各模板文件中内容特征 进行匹配, 确定从钓鱼网页提取的内容特征与各所述模板文件的相似度; 在从所述钓鱼网页提取的内容特征与各所述模板文件的相似度均小于所 述预设的相似阈值时, 将从钓鱼网页提取的内容特征形成模板文件写入所述 钓鱼模板库。  The method for detecting a phishing webpage according to claim 14 or 15, wherein before determining whether there is a unique domain name corresponding to the webpage to be detected in the trust domain name database, the method further comprises: a content feature extracted from the phishing webpage, Matching content features in each template file in the phishing template library, determining similarity between content features extracted from the phishing webpage and each of the template files; and comparing content features extracted from the phishing webpage with each of the template files When the degree is less than the preset similarity threshold, the content feature forming template file extracted from the phishing webpage is written into the phishing template library.
19、 根据权利要求 14所述的钓鱼网页检测方法, 其特征在于, 在所述判 断信任域名库中是否存在待检测网页对应的唯一域名之前还包括:  The method for detecting a phishing webpage according to claim 14, wherein before the determining whether the unique domain name corresponding to the webpage to be detected exists in the trusted domain name database, the method further comprises:
将从品牌网页提取的内容特征, 与品牌模板库中各模板文件中内容特征 进行匹配, 确定从品牌网页提取的内容特征与各所述模板文件的相似度; 在从所述品牌网页提取的内容特征与各所述模板文件的相似度均小于所 述模型相似预设值时, 将从品牌网页提取的内容特征形成模板文件写入所述 品牌模板库。  Matching the content features extracted from the brand webpage with the content features in each template file in the brand template library, determining the similarity between the content features extracted from the brand webpage and each of the template files; and extracting the content from the brand webpage When the similarity between the feature and each of the template files is less than the similar preset value of the model, the content feature forming template file extracted from the brand webpage is written into the brand template library.
PCT/CN2011/083745 2010-12-31 2011-12-09 Method and apparatus for phishing web page detection WO2012089005A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/689,230 US9218482B2 (en) 2010-12-31 2012-11-29 Method and device for detecting phishing web page

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2010106206476A CN102082792A (en) 2010-12-31 2010-12-31 Phishing webpage detection method and device
CN201010620647.6 2010-12-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/689,230 Continuation US9218482B2 (en) 2010-12-31 2012-11-29 Method and device for detecting phishing web page

Publications (1)

Publication Number Publication Date
WO2012089005A1 true WO2012089005A1 (en) 2012-07-05

Family

ID=44088544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083745 WO2012089005A1 (en) 2010-12-31 2011-12-09 Method and apparatus for phishing web page detection

Country Status (3)

Country Link
US (1) US9218482B2 (en)
CN (1) CN102082792A (en)
WO (1) WO2012089005A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156490A (en) * 2014-09-01 2014-11-19 北京奇虎科技有限公司 Method and device for detecting suspicious fishing webpage based on character recognition
US9839002B2 (en) 2012-12-31 2017-12-05 Huawei Technologies Co., Ltd Mobility management method and device

Families Citing this family (109)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL209960A0 (en) * 2010-12-13 2011-02-28 Comitari Technologies Ltd Web element spoofing prevention system and method
CN102082792A (en) * 2010-12-31 2011-06-01 成都市华为赛门铁克科技有限公司 Phishing webpage detection method and device
US20120331551A1 (en) * 2011-06-24 2012-12-27 Koninklijke Kpn N.V. Detecting Phishing Attempt from Packets Marked by Network Nodes
CN103179095B (en) * 2011-12-22 2016-03-30 阿里巴巴集团控股有限公司 A kind of method and client terminal device detecting fishing website
CN102436563B (en) * 2011-12-30 2014-07-09 奇智软件(北京)有限公司 Method and device for detecting page tampering
CN102682237B (en) * 2012-03-08 2015-08-05 珠海市君天电子科技有限公司 Malicious method and system are sentenced for web download file
CN102622553A (en) * 2012-04-24 2012-08-01 腾讯科技(深圳)有限公司 Method and device for detecting webpage safety
CN102737183B (en) * 2012-06-12 2014-08-13 腾讯科技(深圳)有限公司 Method and device for webpage safety access
WO2014079257A1 (en) * 2012-11-20 2014-05-30 Gao Jianqing Exclusion of limited items on the basis of partial hash value
CN103580948A (en) * 2012-12-27 2014-02-12 哈尔滨安天科技股份有限公司 Method and device for detecting network based on structural-file index information
CN103077208B (en) * 2012-12-28 2016-01-27 华为技术有限公司 URL(uniform resource locator) matched processing method and device
US9398038B2 (en) * 2013-02-08 2016-07-19 PhishMe, Inc. Collaborative phishing attack detection
US8966637B2 (en) 2013-02-08 2015-02-24 PhishMe, Inc. Performance benchmarking for simulated phishing attacks
US9356948B2 (en) 2013-02-08 2016-05-31 PhishMe, Inc. Collaborative phishing attack detection
JP6015546B2 (en) * 2013-04-30 2016-10-26 キヤノンマーケティングジャパン株式会社 Information processing apparatus, information processing method, and program
CN103281320B (en) * 2013-05-23 2016-12-07 中国科学院计算机网络信息中心 Brand counterfeit website detection method based on Web page icon coupling
CN103455758A (en) * 2013-08-22 2013-12-18 北京奇虎科技有限公司 Method and device for identifying malicious website
CN103442014A (en) * 2013-09-03 2013-12-11 中国科学院信息工程研究所 Method and system for automatic detection of suspected counterfeit websites
CN104462152B (en) * 2013-09-23 2019-04-09 深圳市腾讯计算机系统有限公司 A kind of recognition methods of webpage and device
CN103501306B (en) * 2013-10-23 2016-09-14 腾讯科技(武汉)有限公司 A kind of network address knows method for distinguishing, server and system
CN104717185B (en) * 2013-12-16 2019-03-26 腾讯科技(北京)有限公司 Displaying response method, device, server and the system of short uniform resource locator
US11017426B1 (en) * 2013-12-20 2021-05-25 BloomReach Inc. Content performance analytics
CN103685308B (en) * 2013-12-25 2017-04-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
WO2015141665A1 (en) * 2014-03-19 2015-09-24 日本電信電話株式会社 Website information extraction device, system, website information extraction method, and website information extraction program
CN104008131B (en) * 2014-04-30 2018-07-13 广州市动景计算机科技有限公司 A kind of web data processing method and processing device
CN104135467B (en) * 2014-05-29 2015-09-23 腾讯科技(深圳)有限公司 Identify method and the device of malicious websites
CN104079560A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN104050257A (en) * 2014-06-13 2014-09-17 百度国际科技(深圳)有限公司 Detection method and device for phishing webpage
CN105373730A (en) * 2014-08-25 2016-03-02 中国信托商业银行股份有限公司 Method and system for automatically investigating phishing webpages
CN105391674B (en) * 2014-09-04 2020-10-16 腾讯科技(深圳)有限公司 Information processing method and system, server and client
US9398047B2 (en) * 2014-11-17 2016-07-19 Vade Retro Technology, Inc. Methods and systems for phishing detection
CN105488406B (en) * 2014-12-29 2019-02-26 哈尔滨安天科技股份有限公司 A kind of similar malice sample matches method and system based on feature vector
US10164927B2 (en) 2015-01-14 2018-12-25 Vade Secure, Inc. Safe unsubscribe
US9930025B2 (en) 2015-03-23 2018-03-27 Duo Security, Inc. System and method for automatic service discovery and protection
US9906539B2 (en) 2015-04-10 2018-02-27 PhishMe, Inc. Suspicious message processing and incident response
US9596265B2 (en) * 2015-05-13 2017-03-14 Google Inc. Identifying phishing communications using templates
CN106302319A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 A kind of detection method for phishing site and equipment
CN106330811A (en) * 2015-06-15 2017-01-11 中兴通讯股份有限公司 Domain name credibility determination method and device
EP3125147B1 (en) * 2015-07-27 2020-06-03 Swisscom AG System and method for identifying a phishing website
CN105187415A (en) * 2015-08-24 2015-12-23 成都秋雷科技有限责任公司 Phishing webpage detection method
CN105208002A (en) * 2015-08-24 2015-12-30 成都秋雷科技有限责任公司 Phishing website interception method
CN105138918B (en) * 2015-09-01 2019-03-29 百度在线网络技术(北京)有限公司 A kind of recognition methods of secure file and device
US9386037B1 (en) * 2015-09-16 2016-07-05 RiskIQ Inc. Using hash signatures of DOM objects to identify website similarity
WO2017049045A1 (en) * 2015-09-16 2017-03-23 RiskIQ, Inc. Using hash signatures of dom objects to identify website similarity
WO2017049042A1 (en) 2015-09-16 2017-03-23 RiskIQ, Inc. Identifying phishing websites using dom characteristics
US9674213B2 (en) * 2015-10-29 2017-06-06 Duo Security, Inc. Methods and systems for implementing a phishing assessment
CN106713246B (en) * 2015-11-17 2019-08-13 中国移动通信集团公司 A kind of detection method, device and mobile terminal that the application program page is kidnapped
CN105530251A (en) * 2015-12-14 2016-04-27 深圳市深信服电子科技有限公司 Method and device for identifying phishing website
US10893009B2 (en) * 2017-02-16 2021-01-12 eTorch Inc. Email fraud prevention
US10142366B2 (en) 2016-03-15 2018-11-27 Vade Secure, Inc. Methods, systems and devices to mitigate the effects of side effect URLs in legitimate and phishing electronic messages
CN107204960B (en) * 2016-03-16 2020-11-24 阿里巴巴集团控股有限公司 Webpage identification method and device and server
WO2017189727A1 (en) 2016-04-26 2017-11-02 RiskIQ, Inc. Techniques for monitoring version numbers of web frameworks
US11049161B2 (en) * 2016-06-20 2021-06-29 Mimeo.Com, Inc. Brand-based product management with branding analysis
US20180007066A1 (en) * 2016-06-30 2018-01-04 Vade Retro Technology Inc. Detection of phishing dropboxes
RU2634211C1 (en) 2016-07-06 2017-10-24 Общество с ограниченной ответственностью "Траст" Method and system of protocols analysis of harmful programs interaction with control centers and detection of computer attacks
US10193923B2 (en) 2016-07-20 2019-01-29 Duo Security, Inc. Methods for preventing cyber intrusions and phishing activity
CN106156348B (en) * 2016-07-21 2019-06-28 杭州安恒信息技术股份有限公司 A kind of auditing method of database object script risky operation
RU2649793C2 (en) 2016-08-03 2018-04-04 ООО "Группа АйБи" Method and system of detecting remote connection when working on web resource pages
US10498761B2 (en) 2016-08-23 2019-12-03 Duo Security, Inc. Method for identifying phishing websites and hindering associated activity
CN107786529B (en) * 2016-08-31 2020-12-01 阿里巴巴集团控股有限公司 Website detection method, device and system
RU2634209C1 (en) 2016-09-19 2017-10-24 Общество с ограниченной ответственностью "Группа АйБи ТДС" System and method of autogeneration of decision rules for intrusion detection systems with feedback
CN107870927B (en) * 2016-09-26 2021-08-13 博彦泓智科技(上海)有限公司 File evaluation method and device
US10404740B2 (en) 2016-10-03 2019-09-03 Telepathy Labs, Inc. System and method for deprovisioning
CN106503125B (en) * 2016-10-19 2019-10-15 中国互联网络信息中心 A kind of data source extended method and device
US10313352B2 (en) * 2016-10-26 2019-06-04 International Business Machines Corporation Phishing detection with machine learning
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system
US20180173799A1 (en) * 2016-12-21 2018-06-21 Verisign, Inc. Determining a top level domain from a domain name
RU2637477C1 (en) * 2016-12-29 2017-12-04 Общество с ограниченной ответственностью "Траст" System and method for detecting phishing web pages
RU2671991C2 (en) * 2016-12-29 2018-11-08 Общество с ограниченной ответственностью "Траст" System and method for collecting information for detecting phishing
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system
CN107800686B (en) * 2017-09-25 2020-06-12 中国互联网络信息中心 Phishing website identification method and device
RU2689816C2 (en) 2017-11-21 2019-05-29 ООО "Группа АйБи" Method for classifying sequence of user actions (embodiments)
US10009375B1 (en) * 2017-12-01 2018-06-26 KnowBe4, Inc. Systems and methods for artificial model building techniques
RU2677361C1 (en) 2018-01-17 2019-01-16 Общество с ограниченной ответственностью "Траст" Method and system of decentralized identification of malware programs
RU2677368C1 (en) 2018-01-17 2019-01-16 Общество С Ограниченной Ответственностью "Группа Айби" Method and system for automatic determination of fuzzy duplicates of video content
RU2680736C1 (en) 2018-01-17 2019-02-26 Общество с ограниченной ответственностью "Группа АйБи ТДС" Malware files in network traffic detection server and method
RU2668710C1 (en) 2018-01-17 2018-10-02 Общество с ограниченной ответственностью "Группа АйБи ТДС" Computing device and method for detecting malicious domain names in network traffic
RU2676247C1 (en) 2018-01-17 2018-12-26 Общество С Ограниченной Ответственностью "Группа Айби" Web resources clustering method and computer device
RU2681699C1 (en) 2018-02-13 2019-03-12 Общество с ограниченной ответственностью "Траст" Method and server for searching related network resources
CN110309402A (en) * 2018-02-27 2019-10-08 阿里巴巴集团控股有限公司 Detect the method and system of website
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium
US20190319905A1 (en) * 2018-04-13 2019-10-17 Inky Technology Corporation Mail protection system
CN110647896B (en) * 2018-06-26 2023-02-03 深信服科技股份有限公司 Phishing page identification method based on logo image and related equipment
CN110647895B (en) * 2018-06-26 2023-02-03 深信服科技股份有限公司 Phishing page identification method based on login box image and related equipment
WO2020044469A1 (en) * 2018-08-29 2020-03-05 Bbソフトサービス株式会社 Illicit webpage detection device, illicit webpage detection device control method, and control program
EP3888335A4 (en) * 2018-11-26 2022-08-10 Cyberfish Ltd. Phishing protection methods and systems
CN111224923B (en) * 2018-11-26 2022-07-22 阿里巴巴集团控股有限公司 Detection method, device and system for counterfeit websites
RU2708508C1 (en) 2018-12-17 2019-12-09 Общество с ограниченной ответственностью "Траст" Method and a computing device for detecting suspicious users in messaging systems
RU2701040C1 (en) 2018-12-28 2019-09-24 Общество с ограниченной ответственностью "Траст" Method and a computer for informing on malicious web resources
WO2020176005A1 (en) 2019-02-27 2020-09-03 Общество С Ограниченной Ответственностью "Группа Айби" Method and system for identifying a user according to keystroke dynamics
US11233820B2 (en) 2019-09-10 2022-01-25 Paypal, Inc. Systems and methods for detecting phishing websites
RU2728498C1 (en) 2019-12-05 2020-07-29 Общество с ограниченной ответственностью "Группа АйБи ТДС" Method and system for determining software belonging by its source code
RU2728497C1 (en) 2019-12-05 2020-07-29 Общество с ограниченной ответственностью "Группа АйБи ТДС" Method and system for determining belonging of software by its machine code
RU2743974C1 (en) 2019-12-19 2021-03-01 Общество с ограниченной ответственностью "Группа АйБи ТДС" System and method for scanning security of elements of network architecture
US11470114B2 (en) 2019-12-27 2022-10-11 Paypal, Inc. Malware and phishing detection and mediation platform
US11381598B2 (en) 2019-12-27 2022-07-05 Paypal, Inc. Phishing detection using certificates associated with uniform resource locators
US11671448B2 (en) * 2019-12-27 2023-06-06 Paypal, Inc. Phishing detection using uniform resource locators
US12021894B2 (en) * 2019-12-27 2024-06-25 Paypal, Inc. Phishing detection based on modeling of web page content
SG10202001963TA (en) 2020-03-04 2021-10-28 Group Ib Global Private Ltd System and method for brand protection based on the search results
US11475090B2 (en) 2020-07-15 2022-10-18 Group-Ib Global Private Limited Method and system for identifying clusters of affiliated web resources
RU2743619C1 (en) 2020-08-06 2021-02-20 Общество с ограниченной ответственностью "Группа АйБи ТДС" Method and system for generating the list of compromise indicators
US11831417B2 (en) * 2020-09-28 2023-11-28 Focus IP Inc. Threat mapping engine
CN112217815B (en) * 2020-10-10 2022-09-13 杭州安恒信息技术股份有限公司 Phishing website identification method and device and computer equipment
CN115085952A (en) * 2021-03-10 2022-09-20 中国电信股份有限公司 Phishing website processing method and device, storage medium and electronic equipment
US11947572B2 (en) 2021-03-29 2024-04-02 Group IB TDS, Ltd Method and system for clustering executable files
NL2030861B1 (en) 2021-06-01 2023-03-14 Trust Ltd System and method for external monitoring a cyberattack surface
CN114070819B (en) * 2021-10-09 2022-11-18 北京邮电大学 Malicious domain name detection method, device, electronic device and storage medium
US20230188563A1 (en) * 2021-12-09 2023-06-15 Blackberry Limited Identifying a phishing attempt
US20240020347A1 (en) * 2022-07-18 2024-01-18 Bank Of America Corporation Browser Application Extension for Payload Detection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1728655A (en) * 2004-11-25 2006-02-01 刘文印 Method and system for detecting and identifying counterfeit web page
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
US7630987B1 (en) * 2004-11-24 2009-12-08 Bank Of America Corporation System and method for detecting phishers by analyzing website referrals
CN102082792A (en) * 2010-12-31 2011-06-01 成都市华为赛门铁克科技有限公司 Phishing webpage detection method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060080735A1 (en) * 2004-09-30 2006-04-13 Usa Revco, Llc Methods and systems for phishing detection and notification
CN101510887B (en) * 2009-03-27 2012-01-25 腾讯科技(深圳)有限公司 Method and device for identifying website
CN101826105B (en) * 2010-04-02 2013-06-05 南京邮电大学 Phishing webpage detection method based on Hungary matching algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7630987B1 (en) * 2004-11-24 2009-12-08 Bank Of America Corporation System and method for detecting phishers by analyzing website referrals
CN1728655A (en) * 2004-11-25 2006-02-01 刘文印 Method and system for detecting and identifying counterfeit web page
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102082792A (en) * 2010-12-31 2011-06-01 成都市华为赛门铁克科技有限公司 Phishing webpage detection method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9839002B2 (en) 2012-12-31 2017-12-05 Huawei Technologies Co., Ltd Mobility management method and device
CN104156490A (en) * 2014-09-01 2014-11-19 北京奇虎科技有限公司 Method and device for detecting suspicious fishing webpage based on character recognition

Also Published As

Publication number Publication date
US20130086677A1 (en) 2013-04-04
US9218482B2 (en) 2015-12-22
CN102082792A (en) 2011-06-01

Similar Documents

Publication Publication Date Title
WO2012089005A1 (en) Method and apparatus for phishing web page detection
CN104125209B (en) Malice website prompt method and router
US9756068B2 (en) Blocking domain name access using access patterns and domain name registrations
WO2019134334A1 (en) Network abnormal data detection method and apparatus, computer device and storage medium
US9614862B2 (en) System and method for webpage analysis
CA2816069C (en) Data loss monitoring of partial data streams
US20080235163A1 (en) System and method for online duplicate detection and elimination in a web crawler
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
WO2021258838A1 (en) Phishing website detection method and apparatus, and device and computer readable storage medium
CN112929390B (en) Network intelligent monitoring method based on multi-strategy fusion
WO2014036801A1 (en) Method for detecting phishing website without depending on sample
WO2014032619A1 (en) Web address access method and system
CN111835777B (en) Abnormal flow detection method, device, equipment and medium
WO2014000537A1 (en) System and method for finding phishing website
WO2015139507A1 (en) Method and apparatus for detecting security of a downloaded file
CN108900554B (en) HTTP asset detection method, system, device and computer medium
CN102129528A (en) WEB page tampering identification method and system
CN106022126B (en) A kind of web page characteristics extracting method towards WEB trojan horse detections
WO2018077035A1 (en) Malicious resource address detecting method and apparatus, and storage medium
WO2015014221A1 (en) Trash information filtering method and device
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
CN109150842B (en) Injection vulnerability detection method and device
CN111125704A (en) Webpage Trojan horse recognition method and system
Fatt et al. Phishdentity: Leverage website favicon to offset polymorphic phishing website
CN104021143A (en) Method and device for recording webpage access behavior

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11853161

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11853161

Country of ref document: EP

Kind code of ref document: A1