TWI595373B

TWI595373B - Method and system for identifying suspected phishing websites

Info

Publication number: TWI595373B
Application number: TW098119737A
Authority: TW
Inventors: li-ming Zhang; Po Wen; yong-wei Kong
Original assignee: Alibaba Group Holding Ltd
Priority date: 2009-06-12
Filing date: 2009-06-12
Publication date: 2017-08-11
Also published as: TW201044212A

Description

Method and system for identifying suspected phishing websites

本發明關於電腦技術領域，特別關於一種識別疑似仿冒網站的方法與系統。The invention relates to the field of computer technology, and in particular to a method and system for identifying a suspected counterfeit website.

隨著網路技術的發展，即時通訊(IM)工具逐漸成為用戶進行線上交易/電子商務的一種重要工具。而其中不乏一些不法網站，通過將網址的名稱設為非常接近正規網站的方式，騙取用戶信任，損害用戶利益。With the development of network technology, instant messaging (IM) tools have gradually become an important tool for users to conduct online transactions / e-commerce. There are some illegal websites. By setting the name of the website to be very close to the regular website, the user's trust is deceived and the interests of the user are harmed.

目前，業界通常有如下共識：仿冒網站，指網站地址的名稱非常類似於正規的商業網站，且意圖在於損害用戶利益的網站。At present, the industry usually has the following consensus: a phishing website means that the name of the website address is very similar to a formal commercial website, and the website is intended to harm the user's interests.

仿冒網站列表：列舉了已知的被判定為仿冒網站的地址列表，這類列表中的網站往往通過用戶投訴，或者人工篩查獲得，且這類列表中的網站通常已經損害了用戶的利益。List of phishing websites: Lists the list of known addresses that are determined to be phishing websites. Websites in such lists are often obtained through user complaints or manual screening, and the websites in such lists have generally harmed the interests of users.

待保護網站列表：列舉了需要進行保護的正規網站，一般來說這類需要保護的網站位址，是網路交易或者電子商務中高頻度出現的網站，例如淘寶，阿裏巴巴，支付寶等，其也是最容易被仿冒的網站。List of websites to be protected: Lists the regular websites that need to be protected. Generally, such websites that need to be protected are websites that appear frequently in online transactions or e-commerce, such as Taobao, Alibaba, Alipay, etc. It is also the easiest website to be counterfeited.

現有的識別技術以資料庫形式提供了已知的正規網站或者仿冒網站的查詢識別，即通過查詢待保護網站列表和/或仿冒網站列表來識別正規網站和/或仿冒網站。現有的識別技術雖然可以識別出仿冒網站，但不法分子可以通過重新申請新的網站位址的方式繼續行騙，而且，現有的識別技術需要在收到舉報或者說事發後才能更新資料庫，無法做到前期識別，風險警示，也就是說，現有的識別實際是一種精確的匹配，即資料庫中儲存了某個網站位址後，才能進行識別，否則無法進行識別。The existing identification technology provides a query identification of a known regular website or a phishing website in the form of a database, that is, a regular website and/or a phishing website is identified by querying a list of websites to be protected and/or a list of phishing websites. Although the existing identification technology can identify the counterfeit website, the criminals can continue to defraud by reapplying for the new website address. Moreover, the existing identification technology needs to update the database after receiving the report or the incident. It is impossible to identify the early stage and the risk warning. That is to say, the existing identification is actually an exact match, that is, after the website address is stored in the database, the identification can be performed, otherwise the identification cannot be performed.

本發明實施例提供一種識別疑似仿冒網站的方法與系統，以達到事前識別，降低訪問仿冒網站概率的目的。Embodiments of the present invention provide a method and system for identifying a suspected phishing website to achieve pre-identification and reduce the probability of accessing a counterfeit website.

本發明公開了一種識別疑似仿冒網站的方法，包括：設備獲得待識別的網站地址；根據該待識別的網站位址確定該網站不屬於待保護的正規網站且不是仿冒網站後，應用該待識別的網站位址與疑似仿冒網站規則進行第二正則運算式匹配，若匹配成功，則判定該待識別網站地址為疑似仿冒網站。The invention discloses a method for identifying a suspected phishing website, comprising: obtaining, by the device, a website address to be identified; determining, according to the website address to be identified, that the website does not belong to a regular website to be protected and is not a phishing website, applying the to-be-identified The website address is matched with the suspected phishing website rule by a second regular expression. If the matching is successful, the address of the website to be identified is determined to be a suspected phishing website.

其中，該設備獲得待識別的網站位址的步驟包括：將設備所得到的任何字串和/或文本按照統一資源定位符URL的特徵，使用預先設定的第一正則運算式進行匹配，從匹配結果中獲得待識別的網站地址；或者，如果設備所得到的任何字串和/或文本本身已經帶有統一資源定位符資訊，則直接從該字串和/或文本獲得待識別的網站地址。The step of obtaining the website address to be identified by the device includes: matching any string and/or text obtained by the device according to the feature of the uniform resource locator URL, using a preset first regular expression, and matching The website address to be identified is obtained in the result; or, if any string and/or text obtained by the device already carries the uniform resource locator information, the website address to be identified is directly obtained from the string and/or text.

其中，應用該待識別的網站位址與疑似仿冒網站規則進行第二正則運算式匹配的步驟包括：The step of applying the to-be-identified website address and the suspected phishing website rule to perform the second regular expression matching includes:

01)從所獲得的待識別網站位址中提取主機統一資源定位符資訊；01) extracting host uniform resource locator information from the obtained website address to be identified;

02)判斷該主機統一資源定位符資訊中是否存在干擾字元，若存在，則執行步驟03)，若不存在，則將該提取出的主機統一資源定位符資訊作為待對比的關鍵字，然後執行步驟04)；02) determining whether there is an interference character in the unified resource locator information of the host, if yes, performing step 03), if not, using the extracted host uniform resource locator information as a keyword to be compared, and then Perform step 04);

03)將該提取出的主機統一資源定位符資訊中的干擾字元刪除，將刪除干擾字元後的主機統一資源定位符資訊作為待對比的關鍵字；03) deleting the extracted interference character in the extracted unified resource locator information, and deleting the host uniform resource locator information after the interference character is used as the keyword to be compared;

04)將該待對比的關鍵字和疑似仿冒網站規則進行第二正則運算式匹配。04) Perform the second regular expression matching on the keyword to be compared and the suspected phishing website rule.

其中，該干擾字元包括：下劃線、減號、空格、點號其中之一或任意組合。The interference character includes one of an underline, a minus sign, a space, and a dot, or any combination thereof.

其中，根據該網站位址確定該網站不屬於待保護的正規網站且不是仿冒網站的步驟包括：Wherein, according to the website address, the steps of determining that the website does not belong to a regular website to be protected and not a counterfeit website include:

判斷待識別的網站地址是否在預設的待保護網站列表中，若不存在，則該獲得的待識別網站地址不屬於待保護的正規網站；判斷待識別的網站地址是否在預設的仿冒網站列表中，若不存在，則該獲得的待識別網站地址不是仿冒網站。Determining whether the address of the website to be identified is in the preset list of websites to be protected. If not, the obtained website address to be identified does not belong to the regular website to be protected; and determining whether the website address to be identified is in the preset phishing website In the list, if it does not exist, the obtained website address to be identified is not a phishing website.

其中，該設備是用戶端設備或網路側的伺服器。The device is a client device or a server on the network side.

其中，該用戶端設備包括即時通信工具和移動終端。The client device includes an instant communication tool and a mobile terminal.

其中，該方法進一步包括：該設備將判斷結果通知給用戶。The method further includes: the device notifying the user of the determination result.

本發明還公開了一種識別疑似仿冒網站的裝置，包括：網站位址獲取單元，用於獲得待識別的網站位址；網站位址處理單元，用於根據該待識別的網站位址，確定該網站不屬於待保護的正規網站且不是仿冒網站，且應用該待識別的網站位址與疑似仿冒網站規則進行第二正則運算式匹配成功後，判定該待識別網站地址為疑似仿冒網站。The invention also discloses an apparatus for identifying a suspected phishing website, comprising: a website address obtaining unit, configured to obtain a website address to be identified; and a website address processing unit, configured to determine the website address according to the website address to be identified The website does not belong to the regular website to be protected and is not a phishing website, and after the second regular expression is successfully matched by the website address to be identified and the suspected phishing website rule, the address of the website to be identified is determined to be a suspected phishing website.

其中，該網站位址獲取單元包括：第一網址獲得單元，用於將設備所得到的任何字串和/或文本按照統一資源定位符URL的特徵，使用預先設定的第一正則運算式進行匹配，從匹配結果中獲得待識別的網站地址；第二網址獲得單元，用於在設備所得到的任何字串和/或文本本身已經帶有統一資源定位符資訊時，直接從該字串和/或文本獲得待識別的網站地址。The website address obtaining unit includes: a first website obtaining unit, configured to match any string and/or text obtained by the device according to a feature of the uniform resource locator URL, using a preset first regular expression. Obtaining a website address to be identified from the matching result; the second website obtaining unit is configured to directly from the string and/or when any string obtained by the device and/or the text itself has the uniform resource locator information Or text to get the address of the website to be identified.

其中，該網站位址處理單元包括：正規網站判定單元，用於確定該待識別的網站位址不在預設的待保護網站列表中後，確定該待識別網站位址不屬於待保護的正規網站；仿冒網站判定單元，用於確定該待識別的網站位址不在預設的仿冒網站列表中後，確定該待識別網站地址不是仿冒網站；疑似網站判定單元，用於在該待識別網站位址與疑似仿冒網站規則進行第二正則運算式匹配成功後，判定該待識別的網站地址為疑似仿冒網站。The website address processing unit includes: a formal website determining unit, configured to determine that the website address to be identified is not in the preset list of websites to be protected, and determine that the website address to be identified does not belong to a regular website to be protected. The phishing website determining unit is configured to determine that the website address to be identified is not in the preset phishing website list, and determine that the website address to be identified is not a phishing website; the suspect website determining unit is used for the website address to be identified After the second regular expression is successfully matched with the suspected phishing website rule, it is determined that the website address to be identified is a suspected phishing website.

其中，該疑似網站判定單元包括：提取單元，用於從所獲得的待識別網站位址中提取主機統一資源定位符資訊；關鍵字獲取單元，用於在不存在干擾字元時，將該提取出的主機統一資源定位符資訊作為待對比的關鍵字，在存在干擾字元時，將該提取出的主機統一資源定位符資訊中的干擾字元刪除，將刪除干擾字元後的主機統一資源定位符資訊作為待對比的關鍵字；匹配單元，用於在該待對比的關鍵字與疑似仿冒網站規則進行第二正則運算式匹配成功後，判定該待識別的網站地址為疑似仿冒網站。The suspected website determining unit includes: an extracting unit, configured to extract host uniform resource locator information from the obtained website address to be identified; and a keyword acquiring unit, configured to extract the extracted character when there is no interference character The host uniform resource locator information is used as the keyword to be compared. When there is an interference character, the extracted interference character in the host uniform resource locator information is deleted, and the host unified resource after the interference character is deleted. The locator information is used as a keyword to be compared; the matching unit is configured to determine that the website address to be identified is a suspected phishing website after the keyword to be compared and the suspected phishing website rule are successfully matched by the second regular expression.

其中，該裝置位於用戶端設備或網路側設備。The device is located at the user equipment or the network side device.

其中，該裝置進一步包括：提示裝置，用於將判斷結果通知給用戶。The device further includes: a prompting device, configured to notify the user of the determination result.

應用本發明之上述實施例提供的識別疑似仿冒網站的方法和裝置，可以在用戶受損失之前識別出疑似仿冒網站，達到了事前識別，降低訪問仿冒網站概率的目的，提前進行了風險提示，將可能的損失降為最小。The method and device for identifying a suspected phishing website provided by the above embodiments of the present invention can identify a suspected phishing website before the user suffers the loss, achieve the purpose of prior identification, reduce the probability of accessing the phishing website, and perform risk warning in advance. The possible losses are minimized.

下面將結合本發明實施例中的附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

本發明首先對一些概念進行解釋：疑似仿冒網站：網站位址的命名方式同“仿冒網站”，但是尚未確定該網站是否將會損害用戶的利益，例如www.taopao.com，在尚未確定該網站是否損害淘寶網(www.taobao.com)用戶的利益前，不能將其確定為仿冒網站，但是可以將其定義為疑似仿冒網站，疑似仿冒網站雖然不一定會對用戶利益造成損害，但需要提前進行預警，以減少用戶訪問仿冒網站的概率，最大限度的保證用戶利益。The present invention first explains some concepts: a suspected phishing website: the naming of the website address is the same as the "phishing website", but it has not been determined whether the website will harm the interests of the user, such as www.taopao.com, the website has not yet been determined Whether it is harmful to the interests of Taobao (www.taobao.com) users, it cannot be identified as a counterfeit website, but it can be defined as a suspected counterfeit website. Although the suspected counterfeit website does not necessarily damage the user's interests, it needs Early warning is provided to reduce the probability of users accessing counterfeit websites and maximize the benefits of users.

參見圖1，其是根據本發明實施例的一種識別疑似仿冒網站的方法流程圖，本發明實施例既可以在用戶終端側執行，也可以在網路側執行，具體包括：FIG. 1 is a flowchart of a method for identifying a suspected phishing website according to an embodiment of the present invention. The embodiment of the present invention may be performed on the user terminal side or on the network side, and specifically includes:

步驟101，獲得待識別的網站地址；這裏，並不對獲取網站位址的方式進行限制，例如，可以在應用即時通訊(IM)軟體聊天的過程中獲得網站地址，或者，在用戶的個性簽名中獲得網站地址等等，無論應用哪種方式所獲得的網站位址，在這裏都可以被認為是待識別的網站位址。Step 101: Obtain a website address to be identified; here, the method for obtaining a website address is not limited. For example, the website address may be obtained during the process of applying an instant messaging (IM) software chat, or in the user's personalized signature. Obtaining the website address, etc., regardless of the website address obtained by which method is used, can be considered as the website address to be identified here.

可以理解，可以通過一個或多個應用場景來獲得待識別網站位址，具體的應用場景可以是：對於即時通信技術領域：可以通過即時通信工具獲得待識別的網站地址，具體場景包括但不限於以下幾種：It can be understood that the website address to be identified can be obtained through one or more application scenarios. The specific application scenario may be: for the field of instant communication technology: the website address to be identified may be obtained through an instant communication tool, and the specific scenarios include but are not limited to The following are:

場景1：用戶應用即時通信工具(包含單對單聊天，多人的聊天室、群等)交流時，當用戶接收到即時消息後，可以從即時消息內容中獲得URL位址鏈結；Scenario 1: When a user applies an instant messaging tool (including a one-to-one chat, a multi-person chat room, a group, etc.), when the user receives the instant message, the URL address link can be obtained from the instant message content;

場景2：當用戶點擊即時通信工具中的聯繫人列表，群成員列表，或者其他形式的聯繫人列表時，可以從聯繫人的狀態欄域或者簽名區域獲得URL位址鏈結；Scenario 2: When the user clicks on the contact list, group member list, or other form of contact list in the instant messaging tool, the URL address link can be obtained from the status bar field or the signature area of the contact;

場景3：用戶在登錄通訊軟體後，收到離線消息(在用戶未登錄時，接收到的消息)時，可以從該離線消息中獲得URL位址鏈結；Scenario 3: When a user logs in to the communication software and receives an offline message (a message received when the user is not logged in), the URL address link can be obtained from the offline message;

場景4：通常，即時通訊套裝軟體含浮出資訊，該浮出資訊一般表現為系統任務欄區域右下角浮出的視窗，用戶可以從該浮出的視窗內容中獲得URL位址鏈結。Scenario 4: Generally, the instant messaging suite software contains floating information, and the floating information generally appears as a window floating in the lower right corner of the system taskbar area, and the user can obtain the URL address link from the floating window content.

對於流覽器技術領域，包括但不限於以下應用場景：用戶通過點擊網頁中的帶有超鏈結形式的圖片，文字，視頻等任何可點擊的元素，從此可點擊元素的指向獲得URL位址鏈結。For the field of browser technology, including but not limited to the following application scenarios: the user clicks on any clickable element in the form of a hyperlinked picture, text, video, etc., and obtains the URL address from the point of the clickable element. link.

由於本發明實施例既可以在用戶終端側執行，也可以在網路側執行，因而，無論是用戶端側還是伺服器側，獲取待識別網站位址的具體實現方式可以為：對設備本身可以獲得到的任何字串和/或文本按照統一資源定位符(URL，Uniform Resource Locator)的特徵，使用預先設定的第一正則運算式進行過濾判斷，從匹配結果中獲得URL，該通過過濾判斷得到的URL即為待識別的網站位址，因此，不管場景如何變化，只要任何字串和/或文本通過預設的正則運算式進行匹配，能夠得到URL即可。The embodiment of the present invention can be performed on the user terminal side or on the network side. Therefore, the specific implementation manner of obtaining the website address to be identified may be obtained on the user side or the server side. Any string and/or text that arrives is filtered according to the characteristics of the Uniform Resource Locator (URL, Uniform Resource Locator), and the URL is obtained from the matching result by the preset first regular expression. The URL is the address of the website to be identified, so no matter how the scene changes, as long as any string and/or text is matched by a preset regular expression, the URL can be obtained.

需要說明的是，一種例外的情況：在流覽器領域中，可點擊元素(例如圖片，文字等)本身已經帶有URL資訊，因而不再需要進行正則運算式的匹配識別，可直接獲取此元素所指向的網站URL地址。It should be noted that there is an exception: in the field of browsers, clickable elements (such as pictures, texts, etc.) already have URL information themselves, so that it is no longer necessary to perform regular expression matching recognition, which can be directly obtained. The URL of the website to which the element points.

可以理解，為描述方便，此處將用於從得到的字串和/或文本中獲得URL位址的正則運算式稱為第一正則運算式。It will be appreciated that for convenience of description, the regular expression used herein to obtain the URL address from the resulting string and/or text is referred to as the first regular expression.

步驟102，根據該待識別的網站位址判斷該網站是否為待保護的正規網站和仿冒網站，若不是，則執行步驟103，若是，則結束。Step 102: Determine, according to the website address to be identified, whether the website is a regular website and a phishing website to be protected. If not, execute step 103, and if yes, end.

具體判斷過程是：判斷待識別的網站地址是否在預設的待保護網站列表中，若不存在，則該獲得的待識別網站地址不屬於待保護的正規網站；判斷待識別的網站地址是否在預設的仿冒網站列表中，若不存在，則該獲得的待識別網站地址不是仿冒網站。The specific judgment process is: determining whether the website address to be identified is in the preset list of websites to be protected, and if not, the obtained website address to be identified does not belong to the regular website to be protected; determining whether the website address to be identified is If the default phishing website list does not exist, the obtained website address to be identified is not a phishing website.

上述兩個判斷沒有先後順序，即既可以先判斷是否在預設的待保護網站列表中，也可以先判斷是否在預設的仿冒網站列表中。The above two judgments have no order, that is, it can be judged whether it is in the preset list of websites to be protected, or whether it is first in the list of preset phishing websites.

可以理解，如果待識別的網站位址在預設的待保護網站列表中，或者在預設的仿冒網站列表中，則可以判定該待識別的網站位址為正規網站或仿冒網站，這樣，已經可以確定該待識別網站的性質了，因而，可以直接結束，不需要再進行後續操作了。It can be understood that if the website address to be identified is in the preset list of websites to be protected, or in the list of preset phishing websites, it can be determined that the website address to be identified is a regular website or a phishing website, so that The nature of the website to be identified can be determined, and thus, it can be directly ended without further follow-up operations.

步驟103，應用待識別的網站位址與疑似仿冒網站規則進行第二正則運算式匹配，若匹配成功，則判定該待識別網站地址為疑似仿冒網站，否則判定待識別網站地址為非疑似仿冒網站。Step 103: Apply a second regular expression matching to the website address to be identified and the suspected phishing website rule. If the matching is successful, determine that the address of the website to be identified is a suspected phishing website, otherwise determine that the website address to be identified is a non-suspicion phishing website. .

在此，將用於匹配疑似仿冒網站的正則運算式稱為第二正則運算式。Here, the regular expression for matching a suspected phishing website is referred to as a second regular expression.

再有，當設備得出判定結果後，可以向用戶進行提示，具體的提示方式可以採用以下任何之一：Moreover, after the device obtains the determination result, the user may be prompted, and the specific prompting manner may adopt any one of the following:

方式一：採用圖形的方式向用戶進行提示，例如，如確定為待保護正規網站，則在該網站網址旁畫“√”；如果確定為仿冒網站或疑似仿冒網站，則在該網站網址旁畫“×”；如果確定為非疑似仿冒網站，則在該網站網址旁畫“?”。Method 1: Use a graphical way to prompt the user. For example, if it is determined to be a regular website to be protected, draw “√” next to the website URL; if it is determined to be a counterfeit website or a suspected counterfeit website, draw next to the website URL. “×”; if it is determined to be a non-suspicious phishing website, draw “?” next to the website URL.

方式二：採用文字的方式向用戶進行提示，例如，如確定為待保護正規網站，則提示用戶“可點擊”，否則，提示用戶“不安全”或“可能不安全”。Method 2: prompting the user in a text manner, for example, if it is determined to be a regular website to be protected, the user is prompted to “clickable”, otherwise, the user is prompted to be “unsafe” or “may be unsafe”.

上述是以終端側為例，來說明如何將判斷結果告知用戶，對於網路側而言，其與終端側類似，不同之處在於將判斷出的結果先傳給終端側，再由終端側提示用戶。The above is an example of the terminal side to explain how to inform the user of the judgment result. For the network side, it is similar to the terminal side. The difference is that the determined result is first transmitted to the terminal side, and then the terminal side prompts the user. .

下面具體說明如何應用待識別的網站位址與疑似仿冒網站規則進行第二正則運算式匹配，參見圖2，其是根據本發明實施例的應用待識別的網站位址與疑似仿冒網站規則進行正則運算式匹配的流程圖，具體包括：The following is a detailed description of how to apply the website address to be identified and the suspected phishing website rule to perform a second regular expression matching. Referring to FIG. 2, the method for applying the website address to be identified and the suspected phishing website rule is used according to an embodiment of the present invention. The flow chart of the arithmetic matching includes:

步驟201，從所獲得的待識別網站位址中提取主機統一資源定位符(hosturl)資訊；例如，所獲得的網站地址為Protocol://hosturl/pathurl，則刪除該網站位址中的路徑資訊、協定首碼等，僅提取出hosturl資訊。Step 201: Extract host uniform resource locator (hosturl) information from the obtained website address to be identified; for example, if the obtained website address is Protocol://hosturl/pathurl, delete the path information in the website address. , the first code of the agreement, etc., only the hosturl information is extracted.

步驟202，判斷上述hosturl資訊中是否存在干擾字元，若存在，則執行步驟203，否則，執行步驟204。Step 202: Determine whether there is an interference character in the hosturl information. If yes, execute step 203. Otherwise, perform step 204.

上述干擾字元是常見的模仿網站位址採用的干擾手段，具體可以包括：各種分隔符號如下劃線(_)、減號(-)、空格、點號(.)等等，在實現過程中，干擾字元可以是上述其中之一或任意組合。The above-mentioned interference character is a commonly used interference means for imitating a website address, and may specifically include: various separation symbols such as a dash (_), a minus sign (-), a space, a dot (.), etc., in the implementation process, The interference character can be one or any combination of the above.

步驟203，將上述提取出的hosturl資訊中的干擾字元刪除，將刪除干擾字元後的hosturl資訊作為待對比的關鍵字；然後執行步驟205。In step 203, the interference character in the extracted hosturl information is deleted, and the hosturl information after the interference character is deleted is used as the keyword to be compared; then step 205 is performed.

步驟204，將提取出的hosturl資訊作為待對比的關鍵字，然後執行步驟205。Step 204: The extracted hosturl information is used as a keyword to be compared, and then step 205 is performed.

步驟205，將上述待對比的關鍵字和疑似仿冒網站規則進行第二正則運算式匹配。Step 205: Perform the second regular expression matching on the keyword to be compared and the suspected phishing website rule.

統一資源定位符(URL，Uniform Resource Locator)也被稱為網頁位址，是網際網路上標準的資源的地址。現在它已經被萬維網聯盟編制為網際網路標準RFC1738。URL是用於完整地描述Internet上網頁和其他資源的位址的一種標識方法。Internet上的每一個網頁都具有一個唯一的名稱標識，通常稱之為URL位址，這種位址可以是本地磁片，也可以是局域網上的某一台電腦，更多的是Internet上的站點。簡單地說，URL就是Web位址，俗稱“網址”。The Uniform Resource Locator (URL), also known as the web address, is the address of a standard resource on the Internet. It has now been compiled by the World Wide Web Consortium as the Internet standard RFC1738. A URL is an identification method used to fully describe the address of web pages and other resources on the Internet. Every web page on the Internet has a unique name identifier, usually called a URL address. This address can be a local disk, or a computer on a local area network, and more on the Internet. Site. Simply put, a URL is a web address, commonly known as a "URL."

疑似仿冒網站規則，通過對主機名中重要的關鍵字以及其常見變體通過正則運算式進行描述。上述關鍵字是指hosturl中能夠代表網站的單詞或者其組合，例如taobao,alibaba,yahoo,ebay等都可被稱為關鍵字。疑似仿冒網站規則的設計不追求非常通用，而是對待保護網站列表中所列舉的正規網站，通過人工逐一編寫正則運算式的方式來進行的，採用了一種針對關鍵字的簡單有效的匹配方式，以此形成疑似仿冒網站規則的正則運算式。Suspected phishing website rules are described by regular expressions for important keywords in the host name and their common variants. The above keywords refer to words or combinations of hostsurl that can represent websites, such as taobao, alibaba, yahoo, ebay, etc. can be called keywords. The design of the suspected phishing website rules is not very versatile. Instead, it treats the regular websites listed in the list of protected websites. By manually writing the regular expressions one by one, a simple and effective matching method for keywords is adopted. This forms a regular expression for suspected counterfeit website rules.

上述常見變體包括但不限於以下幾種表現形式：The above common variants include but are not limited to the following manifestations:

1、以不易識別為依據，例如英文字母O和阿拉伯數字0，在電腦顯示上常常容易被忽視，例如將taobao. com篡改為taoba0. com，注意，第二個是0而不是字母o；1, based on the difficulty of identification, such as the English letter O and the Arabic numeral 0, is often easily overlooked on the computer display, such as changing taobao.com篡 to taba0. com, note that the second is 0 instead of the letter o;

2、省略英文字元中的一些不影響閱讀的輔音，例如將www.taobao.com，篡改未www.taoba.com。2. Omit some consonants in English characters that do not affect reading. For example, www.taobao.com, tampering with www.taoba.com.

3、增加分割符號，例如將www.taobao.com篡改為www.tao-bao.com。3. Increase the split symbol, for example, change www.taobao.com to www.tao-bao.com.

下面給出一個疑似仿冒網站規則的實例，該實例中要尋找針對阿裏巴巴及其子公司網站地址的疑似仿冒網站。An example of a suspected phishing website rule is given below, in which case a suspected phishing website for the address of Alibaba and its affiliate websites is sought.

需要說明的是，由於作為輸入的hosturl已經確保不含有干擾字元(在進行正則運算式匹配之前已有刪除干擾字元的步驟)，因而，表示疑似仿冒網站規則的正則運算式中不再考慮這些問題。It should be noted that since the hosturl as input has ensured that there is no interference character (the step of deleting the interference character before the regular expression matching is performed), the regular expression indicating the suspected phishing website rule is no longer considered. these questions.

本發明上述實施例的執行主體既可以是用戶終端，也可以是網路側的伺服器。其中的用戶終端可以是即時通信工具，也可以是移動終端。The execution body of the above embodiment of the present invention may be a user terminal or a server on the network side. The user terminal may be an instant communication tool or a mobile terminal.

應用本發明上述實施例提供的識別疑似仿冒網站的方法，可以在用戶受損失之前識別出疑似仿冒網站，達到了事前識別的目的，並且，將識別結果通知給用戶，降低了訪問仿冒網站概率，本發明實施例提前進行了風險提示，將可能的損失降為最小。The method for identifying a suspected phishing website provided by the above embodiment of the present invention can identify a suspected phishing website before the user suffers the loss, achieve the purpose of prior identification, and notify the user of the recognition result, thereby reducing the probability of accessing the phishing website. Embodiments of the present invention provide risk warnings in advance to minimize possible losses.

本發明還提供了一種識別疑似仿冒網站的裝置，參見圖3，包括：網站位址獲取單元301和網站位址處理單元302，其中，網站位址獲取單元301，用於獲得待識別的網站位址；這裏，並不對獲取網站位址的方式進行限制，無論應用哪種方式所獲得的網站位址，在這裏都可以被認為是待識別的網站位址。The present invention also provides an apparatus for identifying a suspected phishing website. Referring to FIG. 3, the website address obtaining unit 301 and the website address processing unit 302 are configured to obtain the website address to be identified. Address; here, there is no restriction on the way to obtain the address of the website, no matter which way the website address is obtained, it can be regarded as the website address to be identified here.

其中，這裏並不對獲取網站位址的方式進行限制，例如，可以在應用即時通訊(IM)軟體聊天的過程中獲得網站地址，或者，在用戶的個性簽名中獲得網站地址等等，無論應用哪種方式所獲得的網站位址，在這裏都可以被認為是待識別的網站位址。具體應用場景和前述相同，此處不再贅述。Here, there is no restriction on the way to obtain the website address, for example, the website address can be obtained in the process of applying instant messaging (IM) software chat, or the website address can be obtained in the user's personalized signature, etc., regardless of the application. The website address obtained by the method can be regarded as the website address to be identified here. The specific application scenario is the same as the foregoing, and is not described here.

網站位址處理單元302，用於根據該待識別的網站位址，確定該網站不屬於待保護的正規網站且不是仿冒網站，且應用該待識別的網站位址與疑似仿冒網站規則進行第二正則運算式匹配成功後，判定該待識別網站地址為疑似仿冒網站。The website address processing unit 302 is configured to determine, according to the website address to be identified, that the website does not belong to a regular website to be protected and is not a phishing website, and apply the website address to be identified and the suspected phishing website rule to perform a second After the regular expression is successfully matched, it is determined that the address of the website to be identified is a suspected phishing website.

再有，上述裝置還可以包括：提示裝置，用於將判斷結果通知給用戶。如果上述裝置位於終端側，則該提示裝置可以將判斷結果直接提示給用戶；如果上述裝置位於網路側，則該提示裝置可以將判斷結果先通知給終端，由終端顯示給用戶。Furthermore, the above apparatus may further include: prompting means for notifying the user of the determination result. If the device is located on the terminal side, the prompting device may directly present the determination result to the user; if the device is located on the network side, the prompting device may notify the terminal of the determination result and display it to the user by the terminal.

圖4所示為根據本發明實施例的網站位址處理單元的結構圖，其可以包括：正規網站判定單元3021、仿冒網站判定單元3022和疑似網站判定單元3023，其中，正規網站判定單元3021，用於確定該待識別的網站位址不在預設的待保護網站列表中後，確定該待識別網站位址不屬於待保護的正規網站；仿冒網站判定單元3022，用於確定該待識別的網站位址不在預設的仿冒網站列表中後，確定該待識別網站地址不是仿冒網站；疑似網站判定單元3023，用於在該待識別網站位址與疑似仿冒網站規則進行第二正則運算式匹配成功後，判定該待識別的網站地址為疑似仿冒網站。4 is a structural diagram of a website address processing unit according to an embodiment of the present invention, which may include: a regular website determining unit 3021, a counterfeit website determining unit 3022, and a suspected website determining unit 3023, wherein the regular website determining unit 3021, After determining that the website address to be identified is not in the preset list of websites to be protected, determining that the website address to be identified does not belong to a regular website to be protected; the counterfeit website determining unit 3022 is configured to determine the website to be identified. After the address is not in the preset phishing website list, it is determined that the website address to be identified is not a phishing website; the suspect website determining unit 3023 is configured to perform the second regular expression matching on the website address to be identified and the suspected phishing website rule. After that, it is determined that the address of the website to be identified is a suspected phishing website.

圖5所示為根據本發明實施例的疑似網站判定單元的結構圖，其可以包括：提取單元30231、關鍵字獲取單元30232和匹配單元30233，其中，提取單元30231，用於從所獲得的待識別網站位址提取主機統一資源定位符資訊；具體的，可以通過刪除網站位址中的路徑資訊、協定首碼等方式提取出hosturl資訊。FIG. 5 is a structural diagram of a suspected website determining unit according to an embodiment of the present invention, which may include: an extracting unit 30231, a keyword obtaining unit 30232, and a matching unit 30233, wherein the extracting unit 30231 is configured to obtain the obtained Identify the website address to extract the host uniform resource locator information; specifically, you can extract the hosturl information by deleting the path information and the agreement first code in the website address.

關鍵字獲取單元30232，用於在不存在干擾字元時，將該提取出的主機統一資源定位符資訊作為待對比的關鍵字，在存在干擾字元時，將該提出的主機統一資源定位符資訊中的干擾字元刪除，將刪除干擾字元後的主機統一資源定位符資訊作為待對比的關鍵字；上述干擾字元是常見的模仿網站位址採用的干擾手段，具體可以包括：各種分隔符號如下劃線(_)、減號(-)、空格、點號(.)等等，在實現過程中，干擾字元可以是上述其中之一或任意組合。The keyword obtaining unit 30232 is configured to: when the interference character is not present, use the extracted host uniform resource locator information as a keyword to be compared, and when the interference character exists, the proposed host uniform resource locator The interference character in the information is deleted, and the host uniform resource locator information after the interference character is deleted is used as the keyword to be compared; the above-mentioned interference character is a common interference means used to imitate the website address, and may specifically include: various separations. The symbols are as follows (_), minus (-), spaces, dots (.), etc., and during the implementation, the interference characters may be one or any combination of the above.

匹配單元30233，用於在該待對比的關鍵字與疑似仿冒網站規則進行第二正則運算式匹配成功後，判定該待識別的網站地址為疑似仿冒網站。The matching unit 30233 is configured to determine that the website address to be identified is a suspected phishing website after the keyword to be compared and the suspected phishing website rule are successfully matched by the second regular expression.

圖6所示為根據本發明實施例的網站位址獲取單元的結構圖，其可以包括：第一網址獲得單元3011和第二網址獲得單元3012，其中，第一網址獲得單元3011，用於將設備所得到的任何字串和/或文本按照統一資源定位符URL的特徵，使用預先設定的第一正則運算式進行匹配，從匹配結果中獲得待識別的網站地址；第二網址獲得單元3012，用於在設備所得到的任何字串和/或文本本身已經帶有統一資源定位符資訊時，直接從該字串和/或文本獲得待識別的網站地址。FIG. 6 is a structural diagram of a website address obtaining unit according to an embodiment of the present invention, which may include: a first website obtaining unit 3011 and a second website obtaining unit 3012, wherein the first website obtaining unit 3011 is configured to Any string and/or text obtained by the device is matched according to the feature of the Uniform Resource Locator URL, using a preset first regular expression, and the website address to be identified is obtained from the matching result; the second URL obtaining unit 3012, The website address to be identified is obtained directly from the string and/or text when any string and/or text obtained by the device already carries the uniform resource locator information.

本發明上述實施例提供的識別疑似仿冒網站的裝置，既可以在用戶終端側，也可以在網路側，也就是說，上述裝置既可以位於用戶終端，也可以位於網路側的伺服器。其中的用戶終端，既可以是即時通訊工具，也可以是移動終端。The device for identifying a suspected phishing website provided by the foregoing embodiment of the present invention may be located on the user terminal side or on the network side, that is, the device may be located at the user terminal or the server on the network side. The user terminal can be either an instant messaging tool or a mobile terminal.

應用本發明上述實施例提供的識別疑似仿冒網站的裝置，可以在用戶受損失之前識別出疑似仿冒網站，達到了事前識別的目的，本發明將識別結果通知給用戶，降低了訪問仿冒網站概率，提前進行了風險提示，將可能的損失降為最小。The device for identifying a suspected phishing website provided by the foregoing embodiment of the present invention can identify a suspected phishing website before the user suffers the loss, and achieves the purpose of prior identification. The present invention notifies the user of the recognition result, thereby reducing the probability of accessing the phishing website. Risk warnings were made in advance to minimize possible losses.

為了描述的方便，以上該裝置的各部分以功能分為各種單元分別描述。當然，在實施本發明時可以把各單元的功能在同一個或多個軟體或硬體中實現。For convenience of description, the various parts of the above apparatus are separately described by functions into various units. Of course, the functions of each unit can be implemented in the same or multiple software or hardware in the practice of the present invention.

需要說明的是，在本文中，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、物品或者設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個……”限定的要素，並不排除在包括該要素的過程、方法、物品或者設備中還存在另外的相同要素。It is to be understood that the term "comprises", "comprising" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, item, or device. An element defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

本領域普通技術人員可以理解實現上述方法實施方式中的全部或部分步驟是可以通過程式來指令相關的硬體來完成，該程式可以儲存於電腦可讀取儲存介質中，這裏所稱得的儲存介質，如：ROM/RAM、磁碟、光碟等。A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by a program to instruct the related hardware, and the program can be stored in a computer readable storage medium, where the storage is referred to herein. Media, such as: ROM / RAM, disk, CD, etc.

以上所述僅為本發明的較佳實施例而已，並非用於限定本發明的保護範圍。凡在本發明的精神和原則之內所作的任何修改、等同替換、改進等，均包含在本發明的保護範圍內。The above is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the invention are intended to be included within the scope of the invention.

301．．．網站位址獲取單元301. . . Website address acquisition unit

302．．．網站位址處理單元302. . . Website address processing unit

3021．．．正規網站判定單元3021. . . Formal website decision unit

3022．．．仿冒網站判定單元3022. . . Counterfeit website decision unit

3023．．．疑似網站判定單元3023. . . Suspected website decision unit

30231．．．提取單元30231. . . Extraction unit

30232．．．關鍵字獲取單元30232. . . Keyword acquisition unit

30233．．．匹配單元30233. . . Matching unit

為了更清楚地說明本發明實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本發明的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only Some of the embodiments of the invention may be obtained by those of ordinary skill in the art in view of the drawings without departing from the scope of the invention.

圖1是根據本發明實施例的一種識別疑似仿冒網站的方法流程圖；1 is a flow chart of a method for identifying a suspected counterfeit website according to an embodiment of the present invention;

圖2是根據本發明實施例的應用待識別的網站位址與疑似仿冒網站規則進行正則運算式匹配的流程圖；2 is a flowchart of a regular expression matching of a website address to be identified and a suspected phishing website rule according to an embodiment of the present invention;

圖3是根據本發明實施例的一種識別疑似仿冒網站的裝置結構圖；3 is a structural diagram of an apparatus for identifying a suspected phishing website according to an embodiment of the present invention;

圖4是根據本發明實施例的網站位址處理單元的結構圖；4 is a structural diagram of a website address processing unit according to an embodiment of the present invention;

圖5是根據本發明實施例的疑似網站判定單元的結構圖；FIG. 5 is a structural diagram of a suspected website determining unit according to an embodiment of the present invention; FIG.

圖6是根據本發明實施例的網站位址獲取單元的結構圖。FIG. 6 is a structural diagram of a website address obtaining unit according to an embodiment of the present invention.

Claims

A method for identifying a suspected phishing website, comprising: obtaining a website address to be identified; determining, according to the website address to be identified, that the website does not belong to a regular website to be protected and is not a phishing website, applying the to-be-identified The website address and the suspected phishing website rule perform a second regular expression matching. If the matching is successful, the website address to be identified is determined to be a suspected phishing website, wherein the website address to be identified and the suspected phishing website rule are applied for the second The regular expression matching includes determining whether there is an interference character.

The method of claim 1, wherein the step of obtaining the website address to be identified comprises: using any of the obtained strings and/or texts according to the characteristics of the uniform resource locator URL, using a preset first The regular expression is matched to obtain the website address to be identified from the matching result; or, if any string and/or text obtained has the uniform resource locator information, directly from the string and/or text Get the address of the website to be identified.

The method of claim 1, wherein the step of applying the to-be-identified website address and the suspected phishing website rule to perform the second regular expression matching comprises: 01) extracting the host from the obtained website address to be identified Uniform resource locator information; 02) determine whether there is interference in the unified resource locator information of the host If the character is present, step 03) is performed. If not, the extracted host uniform resource locator information is used as the keyword to be compared, and then step 04) is executed; 03) unifying the extracted host The interference character deletion in the resource locator information deletes the host uniform resource locator information after the interference character as the keyword to be compared; 04) performs the second regular operation on the keyword to be compared and the suspected phishing website rule. Matching.

The method of claim 3, wherein the interference character comprises: one of an underline, a minus sign, a space, a dot, or any combination.

The method of claim 1, wherein determining, according to the website address, that the website does not belong to a regular website to be protected and is not a phishing website comprises: determining whether the website address to be identified is in a preset list of websites to be protected If the information does not exist, the obtained website address to be identified does not belong to the regular website to be protected; whether the website address to be identified is in the preset list of phishing websites, and if not, the obtained website address to be identified Not a fake website.

The method of claim 1, wherein the method further comprises: notifying the user of the result of the determination.

An apparatus for identifying a suspected phishing website, comprising: a website address obtaining unit, configured to obtain a website address to be identified; a website address processing unit, configured to determine, according to the website address to be identified, that the website does not belong to a regular website to be protected and is not a counterfeit website, and apply the to-be-identified website address and the suspected counterfeit website rule to perform a second regularity After the successful matching of the computing formula, determining that the address of the website to be identified is a suspected phishing website, wherein applying the to-be-identified website address and the suspected phishing website rule to perform the second regular expression matching includes determining whether there is an interference character.

The device of claim 7, wherein the website address obtaining unit comprises: a first website obtaining unit, configured to use any string and/or text obtained by the device according to the characteristics of the uniform resource locator URL. The first regular expression is pre-set to match, and the website address to be identified is obtained from the matching result; the second URL obtaining unit is used to obtain a uniform resource locator for any string and/or text obtained by the device itself. For information, the website address to be identified is obtained directly from the string and/or text.

The device of claim 7, wherein the website address processing unit comprises: a formal website determining unit, configured to determine that the website address to be identified is not in a preset list of websites to be protected, and determine the to-be-identified The website address is not a regular website to be protected; the phishing website determining unit is configured to determine that the website address to be identified is not in the list of the default phishing websites, and determine that the website address to be identified is not a phishing website; The suspected website determining unit is configured to determine that the website address to be identified is a suspected phishing website after the second regular expression is successfully matched between the to-be-identified website address and the suspected phishing website rule.

The device of claim 9, wherein the suspected website determining unit comprises: an extracting unit, configured to extract host uniform resource locator information from the obtained website address to be identified; and a keyword obtaining unit, configured to When there is no interference character, the extracted host uniform resource locator information is used as a keyword to be compared, and when the interference character exists, the extracted interference character in the host uniform resource locator information is deleted. The host uniform resource locator information after the interference character is deleted is used as the keyword to be compared; the matching unit is configured to determine that the second regular expression is successfully matched after the keyword to be compared and the suspected phishing website rule are successfully matched. The identified website address is a suspected phishing website.

The device of claim 10, wherein the interference character comprises: one of an underline, a minus sign, a space, a dot, or any combination.

The device of claim 7, wherein the device is located at a client device or a network side device.

The device of claim 7, wherein the device further comprises: prompting means for notifying the user of the result of the determination.