WO2015039553A1

WO2015039553A1 - Method and system for identifying fraudulent websites priority claim and related application

Info

Publication number: WO2015039553A1
Application number: PCT/CN2014/085529
Authority: WO
Inventors: Jie Liu; Li Lu; Wanglin CHEN; Qiuying CHEN; Wenwen DUAN
Original assignee: Tencent Technology (Shenzhen) Company Limited
Priority date: 2013-09-23
Filing date: 2014-08-29
Publication date: 2015-03-26
Also published as: CN104462152B; CN104462152A

Abstract

A server system with processor (s) and memory identifies, among a plurality of messages sent over a social networking platform, a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious URL. The server system determines a legitimate URL that corresponds to the potentially suspicious URL based on contextual information corresponding to the respective message. In accordance with a first determination that the legitimate URL and the potentially suspicious URL in the respective message are not identical, the server system determines a level of similarity between the legitimate URL and the potentially suspicious URL. In accordance with a second determination that the level of similarity exceeds a first predetermined threshold, the server system: identifies the potentially suspicious URL in the respective message as a suspicious URL； and performs a security risk determination process on a first website corresponding to the suspicious URL.

Description

METHOD AND SYSTEM FOR IDENTIFYING FRAUDULENT WEBSITES PRIORITY CLAIM AND RELATED APPLICATION

This application claims priority to Chinese Patent Application No. 201310443265.4, entitled “Method and Apparatus for Selecting Targets, ” filed on September 23, 2013, which is incorporated by reference in its entirety.

TECHNICAL FIELD

The present application relates to the field of Internet technologies, and, more particularly, to a method and system for identifying fraudulent websites.

BACKGROUND

Phishing websites mimic the uniform resource locator ( “URL” ) and page content of legitimate websites (e. g., a banking website, a security website, an e-commerce website, etc. ) in order to bait a user into divulging private information so as to subsequently steal the user’ s property, identity, and/or other virtual wealth. To combat phishing, most commercially available network security products provide protection during internet browsing sessions by identifying and blocking the phishing websites.

At present, the network security products identify websites reported by users as phishing websites based on reports from users, and cannot ensure accuracy of the phishing website identification results.

SUMMARY

In order to address the problems stated in the background section, the embodiments of the present disclosure provide methods and systems for identifying fraudulent websites.

In some embodiments, a method of monitoring messages that link to fraudulent websites is performed at a server system (e. g., server system 108, Figures 1-2) with one or more processors and memory. The method includes identifying, among a plurality of messages sent over a social networking platform, a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious Uniform Resource Locator ( “URL” ) . The method includes determining a legitimate URL that corresponds to the potentially suspicious URL in the respective message based at least in part on contextual information corresponding to the respective message. In accordance with a first determination that the legitimate URL and the potentially suspicious URL in the respective message are not identical, the method includes determining a level of similarity between the legitimate URL and the potentially suspicious URL. In accordance with a second determination that the level of similarity exceeds a first predetermined threshold, the method includes: identifying the potentially suspicious URL in the respective message as a suspicious URL； and performing a security risk determination process on a first website corresponding to the suspicious URL.

In some embodiments, a computer system (e. g., server system 108, Figures 1-2) includes one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs include instructions for performing, or controlling performance of, the operations of any of the methods described herein. In some embodiments, a non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which, when executed by a computer system (e. g., server system 108, Figures 1-2) with one or more processors, cause the computer system to perform, or control performance of, the operations of any of the methods described herein. In some embodiments, a computer system (e. g., server system 108, Figures 1-2) includes means for performing, or controlling performance of, the operations of any of the methods described herein.

Various advantages of the present application are apparent in light of the descriptions below.

BRIEF DESCRIPTION OF THE DRAWINGS

The aforementioned features and advantages of the techniques as well as additional features and advantages thereof will be more clearly understood hereinafter as a result of a detailed description of preferred embodiments when taken in conjunction with the drawings.

To illustrate the technical solutions according to the embodiments of the present application more clearly, the accompanying drawings for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are only some embodiments of the present application； persons skilled in the art may obtain other drawings according to the accompanying drawings without paying any creative effort.

Figure 1 is a block diagram of a server-client environment in accordance with some embodiments.

Figure 2 is a block diagram of a server system in accordance with some embodiments.

Figure 3 is a block diagram of a client device in accordance with some embodiments.

Figure 4 is a flowchart diagram of a method of identifying fraudulent websites in accordance with some embodiments.

Figure 5 is a flowchart diagram of a method of identifying fraudulent websites in accordance with some embodiments.

Figures 6A-6D is a flowchart diagram of a method of monitoring messages that link to fraudulent websites in accordance with some embodiments.

Figure 7 is a block diagram of a server-side module in accordance with some embodiments.

Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the subject matter presented herein. But it will be apparent to one skilled in the art that the subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.

The following clearly and completely describes the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present application. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present application without creative efforts shall fall within the protection scope of the present application.

As shown in Figure 1, data processing for a social networking platform is implemented in a server-client environment 100 in accordance with some embodiments. In accordance with some embodiments, server-client environment 100 includes client-side processing 102-1, 102-2 (hereinafter “client-side module 102” ) executed on a client device 104-1, 104-2, and server-side processing 106 (hereinafter “server-side module 106” ) executed on a server system 108. Client-side module 102 communicates with server-side module 106 through one or more networks 110. Client-side module 102 provides client-side functionalities for the social networking platform (e.g., instant messaging and social networking services) and communications with server-side module 106. Server-side module 106 provides server-side functionalities for the social networking platform such as instant messaging and social networking services and/or monitoring messages that link to fraudulent websites for any number of client modules 102 each residing on a respective client device 104.

In some embodiments, server-side module 106 includes one or more processors 112, messages database 114, fraudulent website database 116, an I/O interface to one or more clients 118, and an I/O interface to one or more external services 120. I/O interface to one or more clients 118 facilitates the client-facing input and output processing for server-side module 106. One or more processors 112 receive messages sent over the social networking platform and determine whether a respective message includes a link to a fraudulent website. Messages database 114 stores messages previously sent over the social networking platform, and fraudulent website database 116 stores a list of indentified fraudulent websites and associated URLs. I/O interface to one or more external services 120 facilitates communications with one or more external services 122 (e. g., cloud-based service providers such as video and/or image hosting websites) .

Examples of client device 104 include, but are not limited to, a handheld computer, a wearable computing device, a personal digital assistant (PDA), atablet computer, a laptop computer, a desktop computer, a cellular telephone, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, a game console, a television, a remote control, or a combination of any two or more of these data processing devices or other data processing devices.

Examples of one or more networks 110 include local area networks (LAN) and wide area networks (WAN) such as the Internet. One or more networks 110 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Global System for Mobile Communications (GSM) , Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.

Server system 108 is implemented on one or more standalone data processing apparatuses or a distributed network of computers. In some embodiments, server system 108 also employs various virtual devices and/or services of third party service providers (e. g., third-party cloud service providers) to provide the underlying computing resources and/or infrastructure resources of server system 108.

Server-client environment 100 shown in Figure 1 includes both a client-side portion (e.g., client-side module 102) and a server-side portion (e. g., server-side module 106) . In some embodiments, data processing is implemented as a standalone application installed on client device 104. In addition, the division of functionalities between the client and server portions of client environment data processing can vary in different embodiments. For example, in some embodiments, client-side module 102 is a thin-client that provides only user-facing input and output processing functions, and delegates all other data processing functionalities to a backend server (e. g. , server system 108) .

Figure 2 is a block diagram illustrating server system 108 in accordance with some embodiments. Server system 108, typically, includes one or more processing units (CPUs) 112, one or more network interfaces 204 (e. g., including I/O interface to one or more clients 118 and I/O interface to one or more external services 120), memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset) . Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices； and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 112. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some implementations, memory 206, or the non-transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

·operating system 210 including procedures for handling various basic system services and for performing hardware dependent tasks；

·network communication module 212 for connecting server system 108 to other computing devices (e. g., client devices 104 and one or more external services 122) connected to one or more networks 110 via one or more network interfaces 204 (wired or wireless) ；

·server-side module 106, which provides server-side data processing for the social networking platform (e. g., monitoring messages sent over the social networking platform that link to fraudulent websites), includes, but is not limited to:

omessage monitoring module 222 for identifying a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious URL；

oURL determination module 224 for determining a legitimate URL that corresponds to the potentially suspicious URL in the respective message；

oURL similarity module 226 for determining whether the legitimate URL and the potentially suspicious URL are identical, for determining a level of similarity between the legitimate URL and the potentially suspicious URL, and for identifying the potentially suspicious URL as a suspicious URL when the level of similarity exceeds a first predetermined threshold；

osecurity risk determination module 228 for determining whether a first website corresponding to the suspicious URL is a fraudulent website, including but not limited to:

·security risk factor module 232 for determining a security risk factor for the first website corresponding to the suspicious URL, optionally, based on at least one of: a first content similarity metric, a second content similarity metric, and a count of links in the first website that are either broken links or improper links；

·keyword extraction module 234 for extracting one or more sensitive keywords from the text information of the first website, the extracted one or more sensitive keywords are included in a predetermined group of sensitive keywords in keyword list 252；

·first content similarity module 236 for determining the first content similarity metric based on common keywords between the extracted one or more sensitive keywords from the text information retrieved from the first website and sensitive keywords included in a legitimate website corresponding to the legitimate URL；

·screenshot module 238 for obtaining a screenshot of the first website corresponding to the suspicious URL and a screenshot of a legitimate website corresponding to the legitimate URL；

·second content similarity module 240 for determining the second content similarity metric based on a number or ratio of similar pixels between the screenshot of the first website and the screenshot of a legitimate website； and

·link integrity module 242 for searching for links contained in the first website that are either broken links or improper links and determining the count of broken links and/or improper links in the first website； and

osubsequent actions module 244 for sending a notification to a sender of the respective message indicating that a user account of the sender for the social networking platform may be compromised after identifying the first website corresponding to the suspicious URL as a fraudulent website； and

·server data 250 storing data for the social networking platform, including but not limited to:

omessages database 114 storing messages previously sent over the social networking platform；

ofraudulent website database 116 storing a list of indentified fraudulent websites and associated URLs； and

okeyword list 252 storing the predetermined group of sensitive keywords.

Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i. e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

Figure 3 is a block diagram illustrating a representative client device 104 associated with a user in accordance with some embodiments. Client device 104, typically, includes one or more processing units (CPUs) 302, one or more network interfaces 304, memory 306, and one or more communication buses 308 for interconnecting these components (sometimes called a chipset) . Client device 104 also includes a user interface 310. User interface 310 includes one or more output devices 312 that enable presentation of media content, including one or more speakers and/or one or more visual displays. User interface 310 also includes one or more input devices 314, including user interface components that facilitate user input such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, some client devices 104 use a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. Memory 306 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices； and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 306, optionally, includes one or more storage devices remotely located from one or more processing units 302. Memory 306, or alternatively the non-volatile memory within memory 306, includes a non-transitory computer readable storage medium. In some implementations, memory 306, or the non-transitory computer readable storage medium of memory 306, stores the following programs, modules, and data structures, or a subset or superset thereof:

·operating system 316 including procedures for handling various basic system services and for performing hardware dependent tasks；

·network communication module 318 for connecting client device 104 to other computing devices (e. g., server system 108) connected to one or more networks 110 via one or more network interfaces 304 (wired or wireless) ；

·presentation module 320 for enabling presentation of information (e. g., auser interface for a social networking platform, widget, webpage, game, and/or application, audio and/or video content, text, etc. ) at client device 104 via one or more output devices 312 (e. g., displays, speakers, etc. ) associated with user interface 310；

·input processing module 322 for detecting one or more user inputs or interactions from one of the one or more input devices 314 and interpreting the detected input or interaction；

·client-side module 102, which provides client-side data processing and functionalities for the social networking platform；

·one or more application 332-1–332-N for execution by client device 104； and

·client data 340 storing data associated with the social networking platform, including, but is not limited to:

ouser profile 342 storing a profile associated with the user of client device 104 including custom parameters (e. g., age, location, hobbies, etc. ) for the user, social network contacts associated with the user in the social networking platform, and identified trends and/or likes/dislikes of the user； and

ouser data 344 storing data authored, saved, liked, or favorited by the user of client device 104 in the social networking platform.

Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i. e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, memory 306, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 306, optionally, stores additional modules and data structures not described above.

In some embodiments, at least some of the functions of the client-side module 102 are performed by the server-side module 106, and the corresponding sub-modules of these functions may be located within the server-side module 106 rather than the client-side module 102. In some embodiments, at least some of the functions of the server-side module 106 are performed by the client-side module 102, and the corresponding sub-modules of these functions may be located within the client-side module 102 rather than the server-side module 106. Server system 108 and client device 104 shown in Figures 2-3, respectively, are merely illustrative, and different configurations of the modules for implementing the functions described herein are possible in various embodiments.

Figure 4 is a flowchart diagram of a method 400 of identifying fraudulent websites in accordance with some embodiments. In some embodiments, method 400 is performed by a server system with one or more processors and memory. For example, in some embodiments, method 400 is performed by server system 108 (Figures 1-2) or a component thereof (e. g., server-side module 106, Figures 1-2) . In some embodiments, method 400 is governed by instructions that are stored in a non-transitory computer readable storage medium and the instructions are executed by one or more processors of the server system. Optional operations are indicated by dashed lines (e. g., boxes with dashed-line borders) .

In an exemplary embodiment, a security risk determination process is carried out on a target website with reference to the Uniform Resource Locator ( “URL” ) and content of the target website. A similarity score between the target website (e. g., apotentially fraudulent website) and a legitimate website is determined based on an analysis result, thereby improving the accuracy rate of identifying fraudulent websites and ensuring increased network security. It should be noted that, in all the embodiments of the application, a security risk determination process for identifying fraudulent websites (e. g., phishing websites) is taken as an example, but this process is not limited to identifying fraudulent websites. Any implementation requiring a determination of the similarity between websites is applicable to the process, and thus they will not be recited redundantly in the following embodiments. It should also be understood that the process can be performed on a per-webpage basis or a per-website basis. In the present disclosure, a person skilled in the art would understand that operations regarding a target website can be performed with respect to a webpage of a target website, or a target webpage to determine whether the website or webpage involved in the process is fraudulent.

For a target website, the server identifies (402) a URL corresponding to the target website and obtains content of the target website. In some embodiments, the server selects the target website from a number of Internet websites with a web crawler system, and subsequently determines whether the identified target website is a fraudulent website (e. g., aphishing website) via a security risk determination process. This process is performed one-by-one on Internet websites. For a website having passed through the process, its URL is recorded, and if the website is selected again, the process will not be repeated again.

In some embodiments, the web crawler system first identifies the URL of the target website (or a webpage thereof) and sends a request to a host (e. g., an IP address) of the target website (or the webpage thereof) based on the URL, thereby obtaining content of the target website (e. g. , the HTML (Hypertext Markup Language) of the target website (or the webpage thereof)) . In some embodiments, when selecting the target website (or the webpage thereof), abnormal websites (or abnormal webpages) are filtered out by the web crawler system. For example, a website (or the webpage thereof) whose HTML cannot be parsed or whose content cannot be displayed properly is filtered out, thereby improving the efficiency of the subsequent identification process. In some embodiments, the legitimate website is predetermined by the server such as an online banking website, a shopping website, a website related to the personal information of a user, and the like. For example the legitimate website involves the presentation and/or input of private user and/or property information leading to a high possibility of being faked by a fraudulent website to steal such information.

The server determines (404) a first content similarity metric between the URL of the target website and a URL of a corresponding legitimate website. In some embodiments, the URL and content of the legitimate website are both stored by the server in advance. Because a fraudulent website typically mimics the URL of the legitimate website, in step 404, the server determines a first content similarity metric between the URL of the target website and the URL of the legitimate website. Specifically, in some embodiments, the server determines the first content similarity metric as the number of characters in common between the two URLs or the similarity between the host addresses of the two URLs. The specific calculating method of the first content similarity metric is not intended to be limited in the application.

The server determines (406) a second content similarity metric between the content of the target website and the content of the legitimate website. The fraudulent website typically also mimics the content of the legitimate website in addition to mimicking the URL of the legitimate website so as to confuse users. Thus, the server determines a second content similarity metric between the content of the target website and the content of the legitimate website. In some embodiments, the second content similarity metric is determined based on the text information of the two websites. In some embodiments, step 406 comprises steps 408-414, which are described as follows.

In some embodiments, the server analyzes (408) the HTML of the target website and extracts text information from the HTML. First, the server parses the HTML of the target website, by locating tags in the HTML, and identifying text attributes associated with each tag so as to extract the text information.

In some embodiments, the server filters out (410) invisible portions of the extracted text information. In some embodiments, the server identifies portion (s) of the text information that have the same font color as the background color of the target website. These text portion (s) are determined to be invisible and are filtered out. Subsequently, the extraction of keywords is done based on the text information excluding the invisible portion (s), thereby improving the accuracy of the second content similarity metric and the accuracy of identifying fraudulent websites. For example, some phishing websites deliberately add text information different from the legitimate website so as to avoid safety detection, where the added text information has the same color as background to be invisible to the users.

In some embodiments, the server extracts (412) keywords from the text information. In some embodiments, word segmentation is conducted on the text information prior to keyword extraction. In some embodiments, words having the highest occurrence frequency may be determined as the keywords. Alternatively, in some embodiments, words that match a group of predefined keywords are identified as keywords. For example, the words “payment, ” “password, ” “bank, ” etc. which appear frequently in websites that are imitated by the fraudulent websites are included in the group of predefined keywords.

In some embodiments, the server performs (414) a matching process between the extracted keywords and the keywords of the legitimate website so as to determine a second content similarity metric between the content of the target website and the content of the legitimate website. In some embodiments, for the legitimate website, keywords are extracted and stored in advance in the same manner as in

steps

404 and 406. In some embodiments, the matching process determines an occurrence frequency of overlapping keywords between the target website and the legitimate website. In some embodiments, the occurrence frequency of overlapping keywords is the second content similarity metric. For example, if eight of the ten keywords extracted from the target website match the keywords in the legitimate website, then the second content similarity metric is high value.

In some embodiments,

steps

404 and 406 are performed in parallel as opposed to sequentially.

The server determines (416) whether the target website is a fraudulent website based at least in part on the first content similarity metric and/or the second content similarity metric. In some embodiments, the server calculates a similarity score based on the first and/or the second content similarity metrics, and, in accordance with a determination that the similarity score exceeds a predetermined value, the server identifies the target website as a fraudulent website (e. g., a phishing website mimicking the legitimate website) . In some embodiments, the similarity score is calculated based on one of a set of algorithms such as summation, weighted average, and the like.

Figure 5 is a flowchart diagram of a method 500 of identifying fraudulent websites in accordance with some embodiments. In some embodiments, method 500 is performed by a server system with one or more processors and memory. For example, in some embodiments, method 500 is performed by server system 108 (Figures 1-2) or a component thereof (e. g., server-side module 106, Figures 1-2) . In some embodiments, method 500 is governed by instructions that are stored in a non-transitory computer readable storage medium and the instructions are executed by one or more processors of the server system. Optional operations are indicated by dashed lines (e. g., boxes with dashed-line borders) .

In the following embodiment, a security risk determination process is carried out on a target website with reference to the Uniform Resource Locator ( “URL” ) and content of the target website. A similarity score between the target website (e. g., a potentially fraudulent website) and a legitimate website is determined based on an analysis result, thereby improving the accuracy rate of identifying fraudulent websites. In some embodiments, in addition to text information, the target website also includes picture information and website framework information. Therefore, the following embodiment further improves the accuracy of identifying fraudulent websites by analyzing screenshots of the target website and the legitimate website.

The server calculates (502) a similarity score according to a first content similarity metric and/or a second content similarity metric. For example, the first content similarity metric is determined in step 404 of Figure 4 and the second content similarity metric is determined in step 406 of Figure 4. In some embodiments, the server calculates a similarity score based on the first and/or the second content similarity metrics. In some embodiments, the similarity score is calculated based on one of a set of algorithms such as summation, weighted average, and the like.

The server determines (504) whether the similarity score is within a preset score interval. For example, the similarity score is a value between 1.0 and 0. In this example, when the similarity score exceeds 0.8, the server performs step 512. In this example, when the similarity score is within the preset score interval between 0.6 and 0.8, the server performs step 506. Continuing with this example, when the similarity score is less than 0.6, the server determines that the target website is not a fraudulent website.

When the similarity score is within the preset interval, the server obtains (506) a screenshot of the target website. For example, the screenshot of the target website includes all of the content in the display area of the target website when viewing the target website with a web browser.

The server determines (508) a third content similarity metric between the screenshot of the target website and a screenshot of the legitimate website. In some embodiments, the server performs a screenshot analysis process between the two screenshots in order to calculate the third content similarity metric. The screenshot analysis process consumes a larger amount of system resource and requires longer processing time. Thus, in some embodiments, in order to further improve the efficiency of identifying fraudulent websites, the server only performs the screenshot analysis process when the similarity score calculated from the first and second content similarity metrics. In some embodiments, the screenshot for the legitimate website is stored in advance and processing of the screenshot of the legitimate website is also performed in advance.

In some embodiments, the screenshot analysis process is performed by calculating a characteristic value of each screenshot and comparing the proximity of the characteristic values. For example, gray-scale processing is first carried out on the whole screenshot of the target website to acquire corresponding gray-scale images, thereby determining a gray-scale value of each pixel of the screenshot. Then, the gray-scale values of each pixel on the screenshot are compared to the gray-scale values of corresponding pixels of the legitimate website so as to determine a number of pixels having the same gray-scale value or pixels to which the difference of the gray-scale values is within a certain range of value difference. Subsequently, in some embodiments, third content similarity metric is calculated according to this number of pixels. However, the screenshot analysis process is not limited to the aforementioned implementation.

The server determines (510) whether the target website is a fraudulent website based at least in part on the third content similarity metric.

The server determines (512) whether the target website is a fraudulent website based at least in part on the similarity score. In some embodiments, in accordance with a determination that the similarity score exceeds a predetermined value, the server identifies the target website as a fraudulent website (e. g., a phishing website mimicking the legitimate website) .

In some embodiments, the screenshot analysis process is performed in parallel with steps 404 and 406 (Figure 4) . In some embodiments, the server performs step 416 (Figure 4) based at least in part on the third content similarity metric in addition to the first and second content similarity metrics.

Figures 6A-6D are a flowchart diagram of a method 600 of monitoring messages that link to fraudulent websites in accordance with some embodiments. In some embodiments, method 600 is performed by a server system with one or more processors and memory. For example, in some embodiments, method 600 is performed by server system 108 (Figures 1-2) or a component thereof (e.g., server-side module 106, Figures 1-2) . In some embodiments, method 600 is governed by instructions that are stored in a non-transitory computer readable storage medium and the instructions are executed by one or more processors of the server system. Optional operations are indicated by dashed lines (e. g., boxes with dashed-line borders) .

The server system identifies (602), among a plurality of messages sent over the social networking platform, a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious Uniform Resource Locator ( “URL” ) . In some embodiments, server-side module 106 or a component thereof (e. g., message monitoring module 222 Figure 2) monitors messages being sent in the social networking platform so as to identify messages that include links to fraudulent websites. In some embodiments, message monitoring module 222 identifies a URL as a potentially suspicious URL when the respective message including the URL satisfies predefined first criterion. For example, the message contains sensitive content, such as a request for private information, or the message is one of plurality of identical messages. In some embodiments, the message is a private message, a broadcast message (e. g., a Tweet^TM or a Facebook^TM status update), a chat message, or a group message sent via the social networking platform.

For example, a first user’ s account for the social networking platform is hacked by a malicious entity. Continuing with this example, the malicious entity uses the first user’ s account to send messages to the first user’ s friends and/or contacts, wherein the messages include a link to a phishing website. In this example, the malicious entity intends to exploit the first user’ s reputation and goodwill with his/her friends and/or contacts in order to bait the first user’ s friends and/or contacts into divulging their private/personal information at the phishing website (e. g., credit card information, social security number, login information, and the like) .

In some embodiments, the predefined first criterion includes (604) one or more of: online activities associated with a sender of the respective message, content of the respective message, and a location from which the respective message was sent. In some embodiments, the predefined first criterion is satisfied when one or more of the following are conditions are met. In one embodiment, the sender of the message has a history of suspicious activity. In another embodiment, the sender of the message also sent the same message to more than N recipients of the social networking platform, where N is a predetermined value. In another embodiment, the link included in the message has been included in X messages sent over the social networking platform in the last Y hours, where X and Y are predetermined values. In another embodiment, the message includes one or more of a plurality of keywords included in the list of keywords 252 (Figure 2) . In another embodiment, the message was sent from an IP address located in a different continent from the legitimate entity that is being mimicked. For example, the message was sent from an IP address located in Mongolia whereas the legitimate entity is located in the USA.

The server system determines (606) a legitimate URL that corresponds to the potentially suspicious URL in the respective message based at least in part on contextual information corresponding to the respective message. In some embodiments, the contextual information includes the sender of the respective message, the text content of the message, and the link included in the respective message. In some embodiments, server-side module 106 or a component thereof (e. g. , URL determination module 224, Figure 2) determines a legitimate URL being mimicked by the potentially suspicious URL included in the respective message based on the contextual information corresponding to the respective message.

In accordance with a first determination that the legitimate URL and the potentially suspicious URL in the respective message are not identical, the server system determines (608) a level of similarity between the legitimate URL and the potentially suspicious URL. In some embodiments, server-side module 106 or a component thereof (e. g., URL similarity module 226, Figure 2) determines whether the legitimate URL and the potentially suspicious URL are identical (i.e., the two URLS are composed of identical characters or have the same destination) . For example, if the URLs are identical, then the link in the respective message is also a legitimate URL. In some embodiments, after determining that the potentially suspicious URL and the legitimate URL are not identical, server-side module 106 or a component thereof (e. g., URL similarity module 226, Figure 2) determines a level of similarity between the legitimate URL and the potentially suspicious URL.

In accordance with a second determination (610) that the level of similarity exceeds a first predetermined threshold, the server system: identifies (612) the potentially suspicious URL in the respective message as a suspicious URL； and performs (614) a security risk determination process on a first website corresponding to the suspicious URL. In some embodiments, when the level of similarity between the legitimate URL and the potentially suspicious URL exceeds a first predetermined threshold, server-side module 106 or a component thereof (e. g., security determination module 228, Figure 2) identifies the potentially suspicious URL as a suspicious URL and performs a security risk determination process on a first website corresponding to the suspicious URL. In some embodiments, the level of similarity between the legitimate URL and the potentially suspicious URL exceeds the first predetermined threshold when the two URLs share X percentage of characters, where X is a predetermined value. For example, when identifying the potentially suspicious URL as a suspicious URL, the suspicious URL included in the respective message is stored in a list of suspicious URLs. For example, the server system performs the security risk determination process to determine if the website associated with the link is website with malicious intent such as a phishing website.

In some embodiments, performing the security risk determination process on the first website corresponding to the suspicious URL comprises (616) : determining a security risk factor for the first website corresponding to the suspicious URL； and, in accordance with a third determination that the security risk factor for the first website meets a second predefined criterion, identifying the first website corresponding to the suspicious URL as a fraudulent website. In some embodiments, server-side module 106 or a component thereof (e. g., security risk module 232, Figure 2) determines a security risk factor for the first website. In some embodiments, the security risk factor is determined according to a predefined algorithm by security risk module 232 (Figure 2) . In some embodiments, the predefined algorithm includes a first content similarity metric (e. g., calculated in step 618 by first content similarity module 236, Figure 2), a second content similarity metric (e. g. , calculated in step 622 by second content similarity module 240, Figure 2), and/or a count of links in the first website that are either broken links or improper links (e. g., determined in step 624 by link integrity module 242, Figure 2) . In some embodiments, the predefined algorithm also takes into account whether the first website includes one or more text entry fields or the ability for a user to enter data. For example, if the first website does not include a text entry field, it is less likely that the first website is a phishing website.

In some embodiments, in accordance with a determination that the security risk satisfies a second predefined criterion, server-side module 106 or a component thereof (e. g., security determination module 228, Figure 2) identifies the first website as a fraudulent website. In some embodiments, the second predefined criterion is satisfied when the security risk factor exceeds a predetermined value. For example, the fraudulent website is a phishing website used to gather login or financial credentials of users associated with the social networking platform. In some embodiments, after identifying the first website as a fraudulent website, server-side module 106 or a component thereof (e. g., security determination module 228, Figure 2) stores the suspicious URL and the IP address of the fraudulent website and, optionally, the screenshot of the fraudulent website in fraudulent website database 116.

In some embodiments, the server system determines the security risk factor by (618) : retrieving text information from the first website corresponding to the suspicious URL； extracting one or more sensitive keywords from the text information, where the extracted one or more sensitive keywords are included in a predetermined group of sensitive keywords； and determining a first content similarity metric based on common keywords between the extracted one or more sensitive keywords from the text information retrieved from the first website and sensitive keywords included in a legitimate website corresponding to the legitimate URL, where the security risk factor is based at least in part on the first content similarity metric. In some embodiments, server-side module 106 retrieves text information from the HTML code of the first website, and, subsequently, server-side module 106 or a component thereof (e. g., keyword extraction module 234, Figure 2) extracts keywords from the text information that match a predefined list of keywords 252. For example, list of keywords 252 includes keywords included in websites that are most often the target of phishing such as “payment, ” “password, ” “bank, ” and the like. In some embodiments, server-side module 106 or a component thereof (e. g., first content similarity module 236, Figure 2) determines a first content similarity metric based on common keywords between the keywords extracted from the text information of the first website and keywords included in the legitimate website corresponding to the legitimate URL. For example, the first content similarity metric has a higher value when the ratio of common keywords is close to 1: 1.

In some embodiments, prior to extracting one or more sensitive keywords from the text information, the server system (620) : identifies a portion of the retrieved text information that is not visible to a viewer of the first website corresponding to the suspicious URL； and removes the portion of the retrieved text information that is not visible to the viewer of the first website from the text information. In some embodiments, the non-visible portions of the first website include text whose font color is the same as the background color of the first website. For example, some phishing websites deliberately add text different from the legitimate website so as to avoid safety detection, where the added text has the same color as background to be invisible to the users.

In some embodiments, the server system determines the security risk factor by (622) : obtaining a screenshot of the first website corresponding to the suspicious URL； and determining a second content similarity metric based on a number or ratio of similar pixels between the screenshot of the first website and a screenshot of a legitimate website corresponding to the legitimate URL, where the security risk factor is based at least in part on the second content similarity metric. In some embodiments, server-side module 106 or a component thereof (e. g., screenshot module 238, Figure 2) captures a screenshot of the first website. In some embodiments, server-side module 106 or a component thereof (e. g., second content similarity module 240, Figure 2) determines a second content similarity metric based on a number or ratio of similar pixels between the screenshot of the first website and a screenshot of a legitimate website corresponding to the legitimate URL. In some embodiments, the second content similarity metric is a pixel similarity ratio calculated by matching the two screenshots pixel-by-pixel.

In some embodiments, server-side module 106 only calculates the second content similarity metric when the value of the first content similarity metric is less than a predetermined threshold. In some embodiments, first content similarity module 236 (Figure 2) and second content similarity module 240 (Figure 2) determine the first and second content similarity metrics in parallel.

In some embodiments, the server system determines the security risk factor by (624) searching for links contained in the first website that are either broken links or improper links, where the security risk factor is based at least in part on a count of links in the first website that are either broken links or improper links. In some embodiments, server-side module 106 or a component thereof (e. g., link integrity module 242, Figure 2) determines a count of links in the first website that are either broken links or improper links. In some embodiments, a link is improper when the destination page has suspect similarity to the first website. For example, the first website pertains to banking whereas the destination page pertains to a wholly different subject. In some embodiments, a link is improper when more than X links in the first website link to the same destination page. In some embodiments, a link is improper when the destination page has minimum relatedness with anchor text for the link. In some embodiments, a small number of broken or improper links bears little weight in determining the security risk. However, if the count of broken and/or improper links exceeds a threshold count then it has a greater weight in the determination of the security risk factor.

In some embodiments, the sender of the message and/or the message with the link to the identified fraudulent website is used as a clue to identify other messages with the same suspicious URL to the fraudulent website or different URLs whose destination is the same fraudulent website. In some embodiments, the sender of the message and/or the message with the link to the identified fraudulent website is used as a clue to identify other compromised or deceptive users sending the same message or messages with the same link to the identified fraudulent website.

In some embodiments, after identifying the first website corresponding to the suspicious URL as a fraudulent website, the server system sends (626) a notification, to a sender of the respective message indicating that a user account of the sender for the social networking platform may be compromised. In some embodiments, server-side module 106 or a component thereof (e. g. , subsequent action (s) module 244, Figure 2) sends a notification to an email address, phone number, etc. registered with the user account that sent the respective message. For example, the notification indicates that the user account may have been hacked or comprised. For example, instead of using a throwaway account, an entity associated with a phishing website hacks a legitimate account or emulates a legitimate account.

In some embodiments, after identifying the first website corresponding to the suspicious URL as a fraudulent website, the server system (628) : identifies one or more other messages sent by one or more other users of the social networking platform that did not satisfy the predefined criterion but include the suspicious URL； and, for a respective one of the one or more other users, sends a notification indicating that a respective user account of the other user for the social networking platform may be compromised. In some embodiments, server-side module 106 or a component thereof (e. g., subsequent action (s) module 244, Figure 2) sends a notification to an email address, phone number, etc. registered with the user accounts that sent messages over the social networking platform that did not previously satisfy the first predefined criteria but include the suspicious URL to the fraudulent website.

In some embodiments, in response to identifying the first website as a fraudulent website, server-side module 106 or a component thereof (e. g., subsequent action (s) module 244, Figure 2) deletes messages sent over the social networking platform that include the suspicious URL to the fraudulent website or messages with other URLs whose destination is the fraudulent website. Alternatively, in some embodiments, server-side module 106 or a component thereof (e. g. , subsequent action (s) module 244, Figure 2) sends a message to recipients of the respective message indicating that the website corresponding to the URL included in the respective message is associated with a security risk and that the sender’ s account may have been compromised.

Figure 7 is a block diagram of a server-side module in accordance with some embodiments. In some embodiments, server-side module 106 is used to implement

methods

400 and 500 of identifying fraudulent websites as described in Figures 4 and 5, respectively. In some embodiments, server-side module 106 includes the following components: obtaining unit 702, first content similarity unit 704, second content similarity unit 706, and identifying unit 708.

In some embodiments, obtaining unit 702 is configured to obtain the URL and content of a target website on which server-side module 106 is performing a security risk determination process in order to determine whether the target website is a fraudulent website (e. g., a phishing website) .

In some embodiments, first content similarity unit 704 is configured to determine a first content similarity metric between the URL of the target website and the URL of the legitimate website.

In some embodiments, second content similarity unit 706 is configured to determine a second content similarity metric between the content of the target website and the content of the legitimate website.

In some embodiments, second content similarity unit 706 includes the following subunits: analyzing subunit 722, filtering subunit 724, extracting subunit 726, and matching subunit 728.

In some embodiments, analyzing subunit 722 is configured to analyze the HTML of the target website and to extract text information from the HTML.

In some embodiments, filtering subunit 724 is configured to filter out invisible portions of the text information.

In some embodiments, extracting subunit 726 is configured to extract keywords from the text information.

In some embodiments, matching subunit 728 is configured to perform a matching process between the extracted keywords and the keywords of the legitimate website so as to determine a second content similarity metric between the content of the target website and the content of the legitimate website.

In some embodiments, identifying unit 708 is configured to calculate a similarity score based at least in part on the first and/or second content similarity metrics and to identify the target website as a fraudulent website when the similarity score exceeds a predetermined value.

In some embodiments, server-side module 106 further includes a screenshot unit 732 and a third content similarity unit 734.

In some embodiments, screenshot unit 732 is configured to obtain a screenshot of the target website.

In some embodiments, third content similarity unit 734 is configured to determine a third content similarity metric between the screenshot of the target website and a screenshot of the legitimate website.

While particular embodiments are described above, it will be understood it is not intended to limit the application to these particular embodiments. On the contrary, the application includes alternatives, modifications and equivalents that are within the spirit and scope of the appended claims. Numerous specific details are set forth in order to provide a thorough understanding of the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that the subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.

Claims

A method of monitoring messages that link to fraudulent websites, comprising:at a server system of a social networking platform, the server system comprises one or more processors and memory:

identifying, among a plurality of messages sent over the social networking platform, a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious Uniform Resource Locator (URL) ；

determining a legitimate URL that corresponds to the potentially suspicious URL in the respective message based at least in part on contextual information corresponding to the respective message；

in accordance with a first determination that the legitimate URL and the potentially suspicious URL in the respective message are not identical, determining a level of similarity between the legitimate URL and the potentially suspicious URL； and

in accordance with a second determination that the level of similarity exceeds a first predetermined threshold:

identifying the potentially suspicious URL in the respective message as a suspicious URL； and

performing a security risk determination process on a first website corresponding to the suspicious URL.
The method of claim 1, wherein the predefined first criterion includes one or more of: online activities associated with a sender of the respective message, content of the respective message, and a location from which the respective message was sent.
The method of any of claims 1-2, wherein performing the security risk determination process on the first website corresponding to the suspicious URL comprises:

determining a security risk factor for the first website corresponding to the suspicious URL； and

in accordance with a third determination that the security risk factor for the first website meets a second predefined criterion, identifying the first website corresponding to the suspicious URL as a fraudulent website.
The method of claim 3, wherein determining the security risk factor comprises:

retrieving text information from the first website corresponding to the suspicious URL；

extracting one or more sensitive keywords from the text information, wherein the extracted one or more sensitive keywords are included in a predetermined group of sensitive keywords； and

determining a first content similarity metric based on common keywords between the extracted one or more sensitive keywords from the text information retrieved from the first website and sensitive keywords included in a legitimate website corresponding to the legitimate URL,

wherein the security risk factor is based at least in part on the first content similarity metric.
The method of claim 4, further comprising:

prior to extracting one or more sensitive keywords from the text information:

identifying a portion of the retrieved text information that is not visible to a viewer of the first website corresponding to the suspicious URL； and

removing the portion of the retrieved text information that is not visible to the viewer of the first website from the text information.
The method of any of claims 3-5, wherein determining the security risk factor comprises:

obtaining a screenshot of the first website corresponding to the suspicious URL； and

determining a second content similarity metric based on a number or ratio of similar pixels between the screenshot of the first website and a screenshot of a legitimate website corresponding to the legitimate URL,

wherein the security risk factor is based at least in part on the second content similarity metric.
The method of any of claims 3-6, wherein determining the security risk factor comprises:

searching for links contained in the first website that are either broken links or improper links,

wherein the security risk factor is based at least in part on a count of links in the first website that are either broken links or improper links.
A server system, comprising:

one or more processors； and

memory storing one or more programs to be executed by the one or more processors, the one or more programs comprising instructions for:

identifying, among a plurality of messages sent over the social networking platform, a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious Uniform Resource Locator (URL) ；

determining a legitimate URL that corresponds to the potentially suspicious URL in the respective message based at least in part on contextual information corresponding to the respective message；

in accordance with a first determination that the legitimate URL and the potentially suspicious URL in the respective message are not identical, determining a level of similarity between the legitimate URL and the potentially suspicious URL； and

in accordance with a second determination that the level of similarity exceeds a first predetermined threshold:

identifying the potentially suspicious URL in the respective message as a suspicious URL； and

performing a security risk determination process on a first website corresponding to the suspicious URL.
The server system of claim 8, wherein the predefined first criterion includes one or more of: online activities associated with a sender of the respective message, content of the respective message, and a location from which the respective message was sent.
The server system of any of claims 8-9, wherein performing the security risk determination process on the first website corresponding to the suspicious URL comprises:

determining a security risk factor for the first website corresponding to the suspicious URL； and

in accordance with a third determination that the security risk factor for the first website meets a second predefined criterion, identifying the first website corresponding to the suspicious URL as a fraudulent website.
The server system of claim 10, wherein determining the security risk factor comprises:

retrieving text information from the first website corresponding to the suspicious URL；

extracting one or more sensitive keywords from the text information, wherein the extracted one or more sensitive keywords are included in a predetermined group of sensitive keywords； and

determining a first content similarity metric based on common keywords between the extracted one or more sensitive keywords from the text information retrieved from the first website and sensitive keywords included in a legitimate website corresponding to the legitimate URL,

wherein the security risk factor is based at least in part on the first content similarity metric.
The server system of claim 11, wherein the one or more programs further comprise instructions for:

prior to extracting one or more sensitive keywords from the text information:

identifying a portion of the retrieved text information that is not visible to a viewer of the first website corresponding to the suspicious URL； and

removing the portion of the retrieved text information that is not visible to the viewer of the first website from the text information.
The server system of any of claims 10-12, wherein determining the security risk factor comprises:

obtaining a screenshot of the first website corresponding to the suspicious URL； and

determining a second content similarity metric based on a number or ratio of similar pixels between the screenshot of the first website and a screenshot of a legitimate website corresponding to the legitimate URL,

wherein the security risk factor is based at least in part on the second content similarity metric.
The server system of any of claims 10-13, wherein determining the security risk factor comprises:

searching for links contained in the first website that are either broken links or improper links,

wherein the security risk factor is based at least in part on a count of links in the first website that are either broken links or improper links.
A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which, when executed by a server system with one or more processors and a display, cause the server system to perform operations comprising:

identifying, among a plurality of messages sent over the social networking platform, a respective message that satisfies a predefined first criterion indicating that the respective message includes a potentially suspicious Uniform Resource Locator (URL) ；

determining a legitimate URL that corresponds to the potentially suspicious URL in the respective message based at least in part on contextual information corresponding to the respective message；

in accordance with a first determination that the legitimate URL and the potentially suspicious URL in the respective message are not identical, determining a level of similarity between the legitimate URL and the potentially suspicious URL； and

in accordance with a second determination that the level of similarity exceeds a first predetermined threshold:

identifying the potentially suspicious URL in the respective message as a suspicious URL； and

performing a security risk determination process on a first website corresponding to the suspicious URL.
The non-transitory computer readable storage medium of claim 15, wherein the predefined first criterion includes one or more of: online activities associated with a sender of the respective message, content of the respective message, and a location from which the respective message was sent.
The non-transitory computer readable storage medium of any of claims 15-16, wherein performing the security risk determination process on the first website corresponding to the suspicious URL comprises:

determining a security risk factor for the first website corresponding to the suspicious URL； and

in accordance with a third determination that the security risk factor for the first website meets a second predefined criterion, identifying the first website corresponding to the suspicious URL as a fraudulent website.
The non-transitory computer readable storage medium of claim 17, wherein determining the security risk factor comprises:

retrieving text information from the first website corresponding to the suspicious URL；

extracting one or more sensitive keywords from the text information, wherein the extracted one or more sensitive keywords are included in a predetermined group of sensitive keywords； and

determining a first content similarity metric based on common keywords between the extracted one or more sensitive keywords from the text information retrieved from the first website and sensitive keywords included in a legitimate website corresponding to the legitimate URL,

wherein the security risk factor is based at least in part on the first content similarity metric.
The non-transitory computer readable storage medium of any of claims 17-18, wherein determining the security risk factor comprises:

obtaining a screenshot of the first website corresponding to the suspicious URL； and

determining a second content similarity metric based on a number or ratio of similar pixels between the screenshot of the first website and a screenshot of a legitimate website corresponding to the legitimate URL,

wherein the security risk factor is based at least in part on the second content similarity metric.
The non-transitory computer readable storage medium of any of claims 17-19, wherein determining the security risk factor comprises:

searching for links contained in the first website that are either broken links or improper links,

wherein the security risk factor is based at least in part on a count of links in the first website that are either broken links or improper links.