WO2021258838A1

WO2021258838A1 - Phishing website detection method and apparatus, and device and computer readable storage medium

Info

Publication number: WO2021258838A1
Application number: PCT/CN2021/088986
Authority: WO
Inventors: 梁杰; 范渊
Original assignee: 杭州安恒信息技术股份有限公司
Priority date: 2020-06-22
Filing date: 2021-04-22
Publication date: 2021-12-30
Also published as: CN111756724A

Abstract

A phishing website detection method, a phishing website detection apparatus, a computer device, and a computer readable storage medium. The phishing website detection method comprises: obtaining multiple pieces of feature information of a website to be detected; obtaining a confidence value and a weight value corresponding to each piece of feature information; determining a weighted confidence value of said website according to the confidence values and the weight values respectively corresponding to the multiple pieces of feature information; and in the case that the weighted confidence value is greater than a preset threshold, determining that said website is a phishing website.

Description

Phishing website detection method, device, equipment, computer readable storage medium

Related application

This application claims the priority of the Chinese patent application filed on June 22, 2020, the application number is 202010572801.0, and the invention title is "Phishing website detection methods, devices, equipment, computer-readable storage media", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of Internet security technology, in particular to the detection method of phishing websites, the detection device of phishing websites, computer equipment and computer-readable storage media.

Background technique

Related terms are explained as follows:

URL (Uniform Resource Locator): Network address.

Threat intelligence: According to Gartner's definition of threat intelligence, threat intelligence is evidence-based knowledge that is related to existing or brewing threats or hazards faced by assets, and this knowledge includes context , Mechanisms, indicators, inferences, and feasible recommendations. These knowledge can provide a basis for decision-making in threat response. From the perspective of security practitioners, threat intelligence refers to some intrusion threat indicators, which can be used to determine whether the target to be tested poses a security threat to the system.

IOC (Indicators of Compromise, intrusion threat indicators): Refers to specific indicators in threat intelligence.

Confidence: refers to the degree to which the true value of a parameter has a certain probability to fall around the measurement result.

Phishing website: Refers to a fake website that deceives users. Its interface is basically the same as that of a real website. It deceives consumers or steals account and password information submitted by visitors.

The current mainstream phishing website identification schemes can be divided into the following two types:

One is a blacklist-based detection method, which uses phishing links to disclose websites, such as the PhishTank website, which contains a large number of phishing websites and is updated frequently. You can directly compare the currently visited links to determine whether the link is a phishing link. . This method of identification is relatively intuitive. However, the biggest problem lies in the high rate of underreporting. At present, there are few websites that disclose phishing links in China, and there are very few phishing link disclosure websites in certain industries, such as the financial industry. On the one hand, due to the small sample size , On the other hand, it was disclosed afterwards that only a small part of the phishing websites that can be intercepted across the entire network.

The other is an offline judgment and detection method based on link characteristics, which constructs a characteristic model according to the link length, the correlation between the link path and the host, and whether there is an encryption protocol, and judges according to the characteristic model. This method is universal in scope. However, it cannot be combined with existing threat intelligence in real time. A large number of false intercepts will be generated through offline judgment, which requires increased manpower input for review and confirmation, resulting in high cost and lag.

Currently, there is no effective solution to the problem of low accuracy of detection results of phishing websites in related technologies.

Summary of the invention

The embodiments of the present application provide a method for detecting a phishing website, a device for detecting a phishing website, a computer device, and a computer-readable storage medium, so as to at least solve the problem of low accuracy of the detection result of the phishing website in related technologies.

In the first aspect, an embodiment of the present application provides a method for detecting a phishing website, including: obtaining multiple feature information of the website to be tested; obtaining the confidence value and weight value corresponding to each feature information; and according to the multiple features The confidence value and weight value corresponding to the information respectively determine the weighted confidence value of the website to be tested; when the weighted confidence value is greater than a preset threshold, the website to be tested is determined to be a phishing website.

In some of the embodiments, the multiple feature information includes first feature information; obtaining multiple feature information of the website to be tested includes: obtaining the URL of the website; obtaining the first feature information according to the URL, where The first feature information includes at least one of the following: IP, domain name, file hash value of executable file, and Whois information.

In some of the embodiments, the multiple feature information further includes second feature information; obtaining multiple feature information of the website to be tested further includes at least one of the following: obtaining domain name information associated with the IP from an information database, And obtaining the IP information associated with the domain name, the second characteristic information includes the domain name information associated with the IP, and the IP information associated with the domain name; through the Whois counter-check technology, obtain the information associated with the Whois The second characteristic information includes the IP and domain information associated with the Whois information; the dynamic analysis technology of the executable file is used to obtain the IP and domain name information connected back to the executable file, where: The second characteristic information includes IP and domain name information that is connected back to the executable file.

In some of the embodiments, obtaining the confidence value and weight value corresponding to each feature information includes: matching multiple feature information of the website to be tested with preset feature information in the intelligence database; and according to the matching result Obtain the confidence value and the weight value corresponding to each feature information in the website to be tested.

In some of the embodiments, obtaining the confidence value and the weight value corresponding to each feature information further includes: determining third feature information from the plurality of feature information, where the third feature information includes at least one of the following : The network transmission protocol information and port information obtained from the URL; according to the third characteristic information, the confidence value and the weight value corresponding to the third characteristic information are obtained.

In some of the embodiments, after determining the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the multiple feature information, the method further includes: determining the weighted confidence value Whether the value is greater than the first preset threshold; in the case where it is determined that the weighted confidence value is greater than the first preset threshold, determine that the website to be tested is a phishing website, and refuse to access the website to be tested; determine the Whether the weighted confidence value is greater than the second preset threshold; in the case where it is determined that the weighted confidence value is greater than the second preset threshold, determine that the website to be tested is a suspected phishing website, and issue an instruction to indicate the to-be-tested website The tested website is the warning information of a suspected phishing website.

In some of these embodiments, after it is determined that the website to be tested is a phishing website or a suspected phishing website, the method further includes: combining the URL of the website to be tested and multiple features in the website to be tested Information is included in the information database.

In the second aspect, an embodiment of the present application provides a detection device for a phishing website, which includes: a first acquisition module for acquiring multiple feature information of the website to be tested; a second acquisition module for acquiring each feature information corresponding The first determination module is used to determine the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the multiple feature information; the second determination module uses In a case where the weighted confidence value is greater than a preset threshold, it is determined that the website to be tested is a phishing website.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor. When the processor executes the computer program, The method for detecting phishing websites as described in the above first aspect is realized.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for detecting a phishing website as described in the first aspect is implemented.

Compared with related technologies, the phishing website detection method, phishing website detection device, computer equipment, and computer readable storage medium provided in the embodiments of the present application obtain multiple feature information of the website to be tested; obtain each feature information corresponding Determine the weighted confidence value of the website to be tested according to the confidence value and weight value corresponding to multiple feature information; when the weighted confidence value is greater than the preset threshold, determine the website to be tested For the phishing website, the problem of low accuracy of the detection result of the phishing website in the related technology is solved, and the accuracy of the detection result of the phishing website is improved.

The details of one or more embodiments of the present application are presented in the following drawings and descriptions, so as to make the other features, purposes and advantages of the present application more concise and understandable.

Description of the drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:

Fig. 1 is a flowchart of a method for detecting a phishing website according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an association relationship between multiple feature information in a website to be tested according to an embodiment of the present application.

Fig. 3 is a schematic flowchart of a method for detecting a phishing website according to an optional embodiment of the present application.

Fig. 4 is a structural block diagram of a phishing website detection device according to an embodiment of the present application.

Fig. 5 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present application.

detailed description

In order to make the purpose, technical solutions, and advantages of this application clearer, the following describes and illustrates this application with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments provided in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The reference to "embodiments" in this application means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those of ordinary skill in the art clearly and implicitly understand that the embodiments described in this application can be combined with other embodiments without conflict.

Unless otherwise defined, the technical terms or scientific terms involved in this application shall have the usual meanings understood by those with general skills in the technical field to which this application belongs. The "a", "an", "one", "the" and other similar words referred to in this application do not indicate a quantitative limit, and may indicate a singular or plural number. The terms "include", "include", "have" and any of their variations in this application are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or product that includes a series of steps or modules (units) The equipment is not limited to the listed steps or units, but may further include unlisted steps or units, or may further include other steps or units inherent to these processes, methods, products, or equipment. The terms "connected", "connected", "coupled" and the like mentioned in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The "plurality" referred to in this application refers to two or more. "And/or" describes the association relationship of the associated objects, which means that there can be three kinds of relationships. For example, "A and/or B" can mean that: A alone exists, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects before and after are in an "or" relationship. The terms "first", "second", "third", etc. involved in this application merely distinguish similar objects, and do not represent a specific order for the objects.

This embodiment provides a method for detecting a phishing website. Fig. 1 is a flowchart of a method for detecting a phishing website according to an embodiment of the present application. As shown in Fig. 1, the process includes the following steps:

Step S101: Obtain multiple feature information of the website to be tested. The characteristic information includes IP, domain name, and file hash value.

Among them, the IP is like 192.168.1.1, which is the only addressing method for the user client (such as a browser) to locate the target server. The IP of the server in the Internet is unique in the world, and it can be divided into IPV4 and IPV6. This embodiment is applicable to both types of IP.

Among them, the domain name is like www.baidu.com. The DNS service can resolve the domain name to IP, and then initiate a request for service resources. Therefore, the domain name and the IP are related to each other. In the same time, a domain name can be configured to resolve multiple IPs, resulting in a one-to-many relationship. In the historical dimension, a server with the same IP may be resolved by multiple domain name changes, resulting in a many-to-many relationship.

Among them, the file hash value is a character string obtained by calculating the file characteristics of a file based on the message digest algorithm, and the character string includes letters and/or numbers. The calculated file hash value under current conditions has a one-to-one correspondence with a certain file, and the file hash value can be used as a unique identifier of the corresponding file to distinguish it from other files. In this embodiment, a message digest algorithm such as SHA (Secure Hash Algorithm) and MD5 (Message-Digest Algorithm) may be used to obtain the file Hash value. Different message digest algorithms have different lengths of the features of the digested files. The longer the length, the more unique it is. In the above two algorithms, you can choose SHA-256 to obtain the file hash value. Because the MD5 string is relatively short, there may be two different files but the MD5 value is the same; if you choose other algorithms, the string will be too long. Waste storage space. Therefore, this embodiment adopts SHA-256 to achieve the effects of both anti-collision properties and saving storage space.

Step S102: Obtain the confidence value and the weight value corresponding to each feature information. In this embodiment, the characteristic information is the intrusion threat indicator. This implementation assigns a confidence value and a weight value to each intrusion threat indicator. The confidence value represents that a certain intrusion threat indicator is judged as phishing relative to the website to be tested. The accuracy of the website decision, and the weight value represents the importance of a certain intrusion threat indicator relative to the decision that the website under test is judged as a phishing website.

Step S103: Determine the weighted confidence value of the website to be tested according to the confidence value and the weight value respectively corresponding to the multiple feature information. In this embodiment, the product of the confidence value corresponding to each feature information and the weight value is used as the individual weight, and the weighted confidence value of the website to be tested is obtained by calculating the sum of the individual weights. Among them, the weight value percentage refers to the ratio of the weight value of a certain feature information to the sum of the weight values of all feature information.

Related technologies usually make judgments directly based on single threat intelligence and characteristics, leading to large errors. This embodiment considers multiple feature information, considers the confidence and weight of each feature information, and finally determines the result with a weighted confidence value, which is beneficial to reduce errors.

Step S104: In the case that the weighted confidence value is greater than the preset threshold, it is determined that the website to be tested is a phishing website. Wherein, the preset threshold value may be an empirical value obtained through multiple tests, or may be a preset value.

Through the above steps, the problem of low accuracy of the detection result of the phishing website in the related technology is solved, and the accuracy of the detection result of the phishing website is improved.

Malicious files with threats usually operate offline or online. Offline mode of operation, for the purpose of destroying the user’s computer, causing the user’s experience of using computing is not smooth or even unavailable; online mode of operation, for the purpose of illegally collecting user information, often steals the user’s computer information, password, and URL Ways to find and connect to a remote server, and upload the collected user information to the malicious server. In this way, the feature information can be obtained one by one according to the association relationship between multiple feature information in the website to be tested.

Figure 2 is a schematic diagram of the association relationship between multiple feature information in the website to be tested. As shown in Figure 2, different feature information has an interconnection relationship and can be transformed into each other. The following embodiments will introduce how to extract multiple pieces of characteristic information according to the association relationship between the pieces of characteristic information.

In some of the embodiments, the multiple feature information includes first feature information; acquiring multiple feature information of the website to be tested includes: acquiring the URL of the website; acquiring the first feature information according to the URL, where the first feature information includes but not Limited to at least one of the following: IP, domain name, file hash value of executable files, and Whois information.

The first characteristic information refers to the directly relevant and accurate information obtained from the URL of the website. The following will separately introduce the obtaining of the website URL and the obtaining of the first characteristic information.

(1) Get the website URL, the URL is like <protocal>://[host]:[port]/[path]. Among them, protocol is a protocol, such as http (Hyper Text Transfer Protocol), https (Hyper Text Transfer Protocol over Secure Socket Layer, Hypertext Transfer Security Protocol), ftp (File Transfer Protocol, file transfer protocol). Host is the host, and the domain name can be resolved to IP or directly to IP, such as www.baidu.com or 180.101.49.11. port is the port number, common website ports such as 80, 443, 8080, etc., 80 and 443 can be omitted by default. path is the path, such as index.html.

Phishing websites usually induce users to click through the browser to produce malicious behavior. As a unique identifier of a web page on the Internet, URL can be obtained in the following two ways. One is to copy the address bar at the top of the browser to obtain the URL when browsing the web; the other is to intercept the URL accessed by the user through a firewall device in an automated scenario.

After the URL is obtained, multi-dimensional feature extraction will be performed, where the first feature information directly obtained starts with $base.

(2) Split the URL to get $base_protocal, $base_host, $base_port, $base_path. In terms of implementation, it can be split by separators or extracted by regular expressions. Regular expression is a matching extraction technology, which can be used to determine whether a character string with specified characteristics is matched or not, and can also be used to extract specified information.

(3) Extract domain name and IP, if $base_host is a domain name, then $base_domain=$base_host. Then, dns (Domain Name System, domain name system (service) protocol) analysis is performed on $base_domain. The analysis method can be obtained by analyzing the nslookup command of windows or linux, or it can be obtained by analyzing online tools. There may be more than one IP resolved, which is recorded as $base_ip. If $host is ip, then $base_ip=$base_host.

(4) Extract Whois information. Whoiss information is a type of domain name registration owner information, usually a business or individual. Through this information, it is possible to query the registration owner's registration time, expiration time, and email address. The query method can be obtained through the Whois command of linux or windows, or through an online tool, denoted as $base_whois.

(5) Extract the Hash value of the file in the webpage. Generally, financial websites, especially online banking transaction websites, provide secret shield plug-ins for users to download and install in order to ensure information security. Counterfeit phishing websites will also provide "Secret Shield" download links, but the executable files provided are likely to be virus Trojan horse programs. In some embodiments, the html (Hyper Text Markup Language) code of the webpage of the website to be tested can be obtained through http request URL or browser rendering, and the executable file download link can be retrieved from it. Record the downloaded executable file as $base_file, and calculate its file Hash value and record it as $base_hash.

In some of the embodiments, the multiple feature information further includes second feature information; obtaining multiple feature information of the website to be tested also includes but is not limited to at least one of the following: obtaining domain name information associated with the IP from the intelligence database, and Obtain the IP information associated with the domain name, the second characteristic information includes the domain name information associated with the IP, and the IP information associated with the domain name; through the Whois counter-check technology, obtain the IP and domain information associated with the Whois information, the second characteristic information includes The IP and domain name information associated with the Whois information; through the executable file dynamic analysis technology, the IP and domain name information connected to the executable file is obtained, where the second characteristic information includes the IP and domain name information connected back to the executable file.

The second feature information refers to indirectly related and rough information relative to the first feature information. After the URL is obtained, multi-dimensional feature extraction will be performed. Among them, the information obtained indirectly through the search starts with $ext. The following will separately introduce the acquisition of the second characteristic information.

(6) Extract historically resolved IP and domain names. The domain name or IP is obtained in the above steps. Since the binding relationship between the domain name and the IP is not permanent, the domain name owner can change the IP pointed to by his domain name at any time; and the same IP will be leased to different users at different times. Bind different domain names. In some embodiments, the domain name information associated with the IP and the IP information associated with the domain name can be obtained in the intelligence database according to the binding relationship between the domain name and the IP. The intelligence database includes a threat intelligence database; Information collection website, obtain domain name information associated with IP, and obtain IP information associated with domain name. This historically resolved domain name and IP are recorded in $ext_domain and $ext_ip respectively.

(7) Re-check Whois information. Individuals or organizations have the right to register multiple domain names. Through the Whois counter-check technology, you can use Whois information to obtain more domain names under the individual or organization, and record them in $ext_domains. These domain names can be parsed to find out the corresponding domain names. IP, enter $ext_ip.

(8) Extract the server address in the executable file. If it is a malicious executable program, it is likely to upload user information to a malicious server online to collect sensitive user data. Use executable file dynamic analysis technology to analyze the domain name or IP connected to the executable file and record it in $ext_domain or $ext_ip. Among them, the executable file dynamic analysis technology involves binary, disassembly, sandbox and other technologies.

Through the above steps, directly related and accurate $base_ data, as well as indirect related, rough $ext_ data have been obtained. Table 1 shows the summary of the two types of feature information obtained through the above steps, as shown in Table 1.

Table 1 Directly related and indirect related feature information

The following embodiments will combine the two types of feature information given in Table 1 for further matching analysis.

In some of the embodiments, obtaining the confidence value and weight value corresponding to each feature information includes: matching multiple feature information of the website to be tested with preset feature information in the information database; The confidence value and weight value corresponding to each feature information in the website.

The intelligence database in this embodiment includes a threat intelligence database. The threat intelligence database may include self-built intelligence databases and third-party intelligence databases. The common point is that they include domain names, IPs, and files with malicious behavior in similar dimensions. Core intrusion threat indicators such as hash value. This embodiment can optionally adopt a self-built intelligence database. In addition to the above-mentioned core intrusion threat indicators, in some embodiments, Whois information, historical analysis domain names and IP information, intelligence confidence, and malicious URLs can be set in the self-built intelligence database. The information within is used to assist the acquisition of multiple feature information in the previous step, and the weighted determination of the multiple feature information acquired later. In addition, for the executable file obtained in the previous step, you can also upload a self-built information library to analyze whether there is malicious behavior.

The following will introduce how to make a weighted judgment on the acquired multiple feature information.

(1) Principle: In the case where the first feature information ($base_) hits the intelligence database, assign a high weight value to the confidence value of the first feature information; when the second feature information ($ext_) hits the intelligence database Next, assign a low weight value to the confidence value of the second feature information.

(2) Formula: Suppose the reliability value is c, 0≤c≤1; the weight value is w, 1≤w≤10; the judgment score is S, S is the weighted confidence value, 0≤S≤1, weighted Averaging algorithm, there are

Among them, n is a natural number.

(3) The weighting of the first characteristic information: $base_domain, $base_ip, $base_whois, $base_hash, and $base_path are the information that can be included in the information library.

Among them, when matching $base_domain, $base_ip, and $base_hash, this type of feature information has high accuracy and is the core matching content. Exact matching can be used, that is, $base_domain is exactly equal to the domain of the information library; $base_ip is the same; $base_hash has different manifestations. In this embodiment, the SHA256 algorithm is used to accurately match the SHA256 value in the intelligence database. For the above matched data, c is taken from the preset confidence field in the information library, the confidence field contains the confidence value, and the confidence field reflects the confidence of a certain feature information under current conditions. For unmatched data, the value of c is 0, and the value of w is 8.

$base_whois mainly matches the data of the registered owner and the contact information of the registered owner, and an exact matching method can be used. For matched data, c is taken from the confidence field preset in the information library; for unmatched data, c takes the value of 0, and w takes the value of 6.

$base_path is matched by whether it is contained or not. c is taken from the preset confidence field in the information library. If it is not matched, the value of c is 0, and the value of w is 4.

It should be noted that the value of c above is determined according to the preset confidence field in the information library, and the confidence value corresponding to a certain feature information in the confidence field is calculated based on the credibility of the feature information under current conditions owned. And w can be taken from the preset value in the information database, or can be appropriately adjusted according to the accuracy of the feature information detection.

(4) The second feature information weighting: the domain name and IP are obtained through historical analysis or analysis of the network behavior of executable files. Compared with the first feature information obtained from the URL, the second feature information has a slightly lower relevance and has multiple pieces of data. This embodiment assigns a lower weight value to this type of feature information. In the case that the second characteristic information matches the intrusion threat index included in the intelligence database, c is taken from the confidence field preset in the intelligence database; otherwise, if it does not match, the value of c is 0, and the value of w is 2 . Every time an intrusion threat index is matched, it is regarded as one of the cumulative items of the weighted average.

At present, the https network transmission encryption technology has been widely promoted. Financial industry websites prefer to adopt the https protocol. For websites with non-https protocol, the typical Google Chrome (Google browser) will be marked as an insecure website. The https certificate needs to be issued by an authority to be trusted by the browser. Untrusted or inconsistent certificates may be rejected by the browser.

It can be seen that the network transmission protocol and port information can also be used as feature information for weighted judgment. In some of the embodiments, obtaining the confidence value and the weight value corresponding to each feature information further includes: determining third feature information from a plurality of feature information, where the third feature information includes but is not limited to at least one of the following: The network transmission protocol information and port information obtained from the URL; according to the third characteristic information, the confidence value and the weight value corresponding to the third characteristic information are obtained.

The third feature information refers to non-intelligence feature information, that is, network transmission protocol information and port information obtained from the URL.

If the website to be tested only uses the http protocol, let c=1 and w=2. If the website to be tested uses the https protocol but does not have a certificate issued by an authority, it is judged to be untrustworthy, and c=1 and w=3.

For the convenience of users, websites usually use well-known domain names to help users remember, and use default ports that can be omitted. For example, http protocol usually uses port 80, and https protocol usually uses port 443. If the website to be tested is only accessed by ip, let c=1 and w=3. If the website under test uses a non-default port, but uses ports like 8080, 9090, 8888, then c=1, w=3.

In some of these embodiments, after determining the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the multiple feature information, the method further includes: judging whether the weighted confidence value is greater than the first preset threshold; In the case where it is determined that the weighted confidence value is greater than the first preset threshold, the website to be tested is determined to be a phishing website, and access to the website to be tested is denied. Determine whether the weighted confidence value is greater than the second preset threshold; if it is determined that the weighted confidence value is greater than the second preset threshold, determine that the website to be tested is a suspected phishing website, and issue an instruction to indicate that the website to be tested is a suspected phishing website Website warning information.

Through the above comprehensive information weighting, substituting the weighted average formula to calculate the score S, S can be classified according to the preset threshold, and the strategy for the website to be tested is adopted according to the classification result.

The first preset threshold may take a value of 0.8. If S ≥ 0.8, the website to be tested is determined to be a phishing website, and the user can be directly denied access.

The second preset threshold can take a value of 0.6. If S≥0.6, it is determined that the website under test is a suspected phishing website, and an alarm message can be issued to remind the website under test that there is a possibility of fraud, and careful judgment is required.

In some of these embodiments, after determining that the website to be tested is a phishing website or a suspected phishing website, the method further includes: including the URL of the website to be tested and multiple characteristic information of the website to be tested into the information database.

In this embodiment, a third preset threshold can be set. The third preset threshold can take a value of 0.4. If S ≥ 0.4, the characteristic information of the website to be tested is included in the suspicious information database for further verification. Use of subsequent optimization algorithm parameters. If S<0.4, discard the characteristic information obtained from the website to be tested.

The open source threat intelligence or commercial threat intelligence used in related technologies usually detects websites in an offline manner. Due to the lag of intrusion threat indicators, the false alarm rate will gradually increase with the development of time. However, this embodiment adopts a self-built intelligence database, sets intrusion threat indicators in the intelligence database, and reversely includes the suspicious information detected in the current website to be tested into the intelligence database to update the intrusion threat indicators and solve the problem of intrusion threat indicators. The problem of lag is conducive to reducing the false alarm rate; at the same time, it is conducive to increasing the richness of the information database, improving the accuracy and breadth of judgment, and maintaining a virtuous circle.

The following describes and illustrates the embodiments of the present application through optional embodiments.

Fig. 3 is a schematic flowchart of a method for detecting a phishing website according to an optional embodiment of the present application. As shown in Fig. 3, the process includes the following steps:

Step S301: Obtain the URL of the website.

Step S302: Split the URL to obtain $base_protocal, $base_host, $base_port, and $base_path.

Step S303: Obtain or parse directly to obtain $base_domain, $base_ip, $base_whois, and $base_hash.

Step S304: Obtain or parse indirectly to obtain $ext_domain and $ext_ip.

Step S305: match the information database to obtain the confidence value c and the weight value w; the non-information database directly determines the confidence value c and the weight value w.

Step S306: Use a weighted average algorithm to calculate a weighted confidence value. That is, the following formula is used to calculate S:

Step S307: Make an access restriction decision based on the S score.

Step S308: Retrieve the website information determined as a phishing website or a suspected phishing website to the information database.

In some of these embodiments, the IOC information in the information library can be directly delivered to a web (World Wide Web) firewall, and in the case of matching characteristic information, it is directly intercepted to achieve precise interception.

Compared with related technologies, the embodiments of the present application include the following advantages:

(1) Using the built-in confidence level corresponding to the intrusion threat indicator in the self-built intelligence database can help determine the suspicious degree of the target website.

(2) Extract the directly related information that can be matched with the information database from the URL of the website to be tested, and obtain the additional related information by directly related to the attributes of the information itself, so as to make comprehensive judgments and realize multi-dimensional matching. Compared with traditional methods, phishing websites can be found more accurately, which helps prevent users from being scammed by fake websites and affect normal economic order.

(3) Weight the confidence of various feature information, and finally use the weight to determine the detection result, so as to solve the problem of large errors caused by single threat intelligence and feature directly making judgments.

(4) This test result can be stored in the database to expand the basic database data for later phishing website judgment; for the problem of false positives, intervention can be taken by adjusting the weight value.

This embodiment also provides a detection device for a phishing website, which is used to implement the above-mentioned embodiments and optional implementation manners, and those that have been described will not be repeated. As used below, the terms "module", "sub-module", etc., can be a combination of software and/or hardware that can implement predetermined functions. Although the devices described in the following embodiments are preferably implemented by software, implementation by hardware or a combination of software and hardware is also possible and conceived.

Fig. 4 is a structural block diagram of a detection device for a phishing website according to an embodiment of the present application. As shown in Fig. 4, the device includes:

The first obtaining module 41 is used to obtain multiple feature information of the website to be tested.

The second acquisition module 42 is coupled to the first acquisition module 41 and is configured to acquire the confidence value and the weight value corresponding to each feature information.

The first determination module 43 is coupled to the second acquisition module 42 and is configured to determine the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the plurality of feature information, respectively.

The second determining module 44, coupled to the first determining module 43, is configured to determine that the website to be tested is a phishing website when the weighted confidence value is greater than the preset threshold.

In some of the embodiments, the device includes: a first acquisition sub-module for acquiring the URL of the website; a second acquisition sub-module for acquiring the first characteristic information according to the URL, where the first characteristic information includes but is not limited to the following At least one: IP, domain name, file hash value of executable file, Whois information.

In some of the embodiments, the device includes: a third acquisition sub-module for acquiring domain name information associated with the IP from the information database, and acquiring IP information associated with the domain name, and the second characteristic information includes the domain name information associated with the IP , And the IP information associated with the domain name; The Whois counter-check module is used to obtain the IP and domain name information associated with the Whois information through the Whois counter-check technology. The second feature information includes the IP and domain name information associated with the Whois information; executable; The file dynamic analysis module is used to obtain the IP and domain name information connected to the executable file through the dynamic analysis technology of the executable file, wherein the second characteristic information includes the IP and domain name information connected back to the executable file.

In some of the embodiments, the device includes: a matching module for matching multiple feature information of the website to be tested with preset feature information in the intelligence database; according to the matching result, the corresponding feature information of the website to be tested is obtained The confidence value and weight value of.

In some of the embodiments, the device includes: a determining sub-module for determining third characteristic information from a plurality of characteristic information, where the third characteristic information includes but is not limited to at least one of the following: network transmission obtained from URL Protocol information, port information; the fourth acquisition sub-module is used to acquire the confidence value and weight value corresponding to the third characteristic information according to the third characteristic information.

In some of the embodiments, the device further includes: a first judging module for judging whether the weighted confidence value is greater than the first preset threshold; in the case where it is judged that the weighted confidence value is greater than the first preset threshold, it is determined that the weighted confidence value is greater than the first preset threshold. The test website is a phishing website, and access to the website to be tested is denied; the second judgment module is used to judge whether the weighted confidence value is greater than the second preset threshold; when it is judged that the weighted confidence value is greater than the second preset threshold, Determine that the website to be tested is a suspected phishing website, and issue an alarm message indicating that the website to be tested is a suspected phishing website.

In some of these embodiments, the device further includes: an inclusion module, which is used to include the URL of the website to be tested and multiple characteristic information of the website to be tested into the information database.

It should be noted that each of the above-mentioned modules may be a functional module or a program module, which may be implemented by software or hardware. For modules implemented by hardware, each of the foregoing modules may be located in the same processor; or each of the foregoing modules may also be located in different processors in any combination.

In addition, the method for detecting a phishing website in the embodiment of the present application described in conjunction with FIG. 1 can be implemented by a computer device. Fig. 5 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present application.

The computer device may include a processor 51 and a memory 52 storing computer program instructions.

Specifically, the foregoing processor 51 may include a central processing unit (CPU), or a specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Among them, the memory 52 may include a large-capacity memory for data or instructions. For example and not limitation, the memory 52 may include a hard disk drive (Hard Disk Drive, referred to as HDD), a floppy disk drive, a solid state drive (Solid State Drive, referred to as SSD), flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial Universal Serial Bus (USB for short) driver or a combination of two or more of these. Where appropriate, the storage 52 may include removable or non-removable (or fixed) media. Where appropriate, the memory 52 may be internal or external to the data processing device. In a specific embodiment, the memory 52 is a non-volatile (Non-Volatile) memory. In a specific embodiment, the memory 52 includes a read-only memory (Read-Only Memory, ROM for short) and a random access memory (Random Access Memory, RAM for short).

The memory 52 may be used to store or cache various data files that need to be processed and/or used for communication, and possible computer program instructions executed by the processor 51.

The processor 51 reads and executes the computer program instructions stored in the memory 52 to implement any one of the phishing website detection methods in the foregoing embodiments.

In some of the embodiments, the detection device of the phishing website may further include a communication interface 53 and a bus 50. Among them, as shown in FIG. 5, the processor 51, the memory 52, and the communication interface 53 are connected through the bus 50 and complete the mutual communication.

The communication interface 53 is used to implement communication between various modules, devices, units, and/or devices in the embodiments of the present application. The communication interface 53 can also implement data communication with other components such as external devices, image/data acquisition devices, databases, external storage, and image/data processing workstations.

The bus 50 includes hardware, software, or both, and couples the components of the computer device to each other. The bus 50 includes but is not limited to at least one of the following: a data bus (Data Bus), an address bus (Address Bus), a control bus (Control Bus), an expansion bus (Expansion Bus), and a local bus (Local Bus). Where appropriate, the bus 50 may include one or more buses. Although the embodiments of this application describe and show a specific bus, this application considers any suitable bus or interconnection.

The computer device can execute the detection method of the phishing website in the embodiment of the present application based on the acquired URL of the website to be tested, thereby realizing the detection method of the phishing website described in conjunction with FIG. 1.

In addition, in combination with the detection method of the phishing website in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium for implementation. The computer-readable storage medium stores computer program instructions; when the computer program instructions are executed by the processor, any one of the phishing website detection methods in the above-mentioned embodiments is implemented.

The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered as the scope of this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation on the scope of the patent application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for detecting phishing websites, which is characterized in that the method includes:

Obtain multiple feature information of the website to be tested;

Obtain the confidence value and weight value corresponding to each feature information;

Determine the weighted confidence value of the website to be tested according to the confidence value and the weight value respectively corresponding to the multiple feature information;

In a case where the weighted confidence value is greater than a preset threshold, it is determined that the website to be tested is a phishing website.
The method for detecting a phishing website according to claim 1, wherein the plurality of characteristic information includes first characteristic information; and acquiring the plurality of characteristic information of the website to be tested includes:

Obtain the URL of the website;

The first characteristic information is acquired according to the URL, where the first characteristic information includes at least one of the following: IP, domain name, file hash value of the executable file, and Whois information.
The method for detecting a phishing website according to claim 2, wherein the plurality of characteristic information further includes second characteristic information; acquiring the plurality of characteristic information of the website to be tested further includes at least one of the following:

Obtain the domain name information associated with the IP and obtain the IP information associated with the domain name from the information database. The second characteristic information includes the domain name information associated with the IP and the IP information associated with the domain name. ；

Obtain the IP and domain name information associated with the Whois information through the Whois reverse search technology, and the second characteristic information includes the IP and domain name information associated with the Whois information;

Through the dynamic analysis technology of the executable file, the IP and domain name information connected back to the executable file is obtained, wherein the second characteristic information includes the IP and domain name information back connected to the executable file.
The method for detecting a phishing website according to claim 3, wherein obtaining the confidence value and the weight value corresponding to each feature information comprises:

Matching multiple feature information of the website to be tested with preset feature information in the information database;

According to the matching result, a confidence value and a weight value corresponding to each feature information in the website to be tested are obtained.
The method for detecting a phishing website according to claim 4, wherein obtaining the confidence value and the weight value corresponding to each feature information further comprises:

Determine third characteristic information from the plurality of characteristic information, where the third characteristic information includes at least one of the following: network transmission protocol information and port information obtained from the URL;

According to the third characteristic information, a confidence value and a weight value corresponding to the third characteristic information are obtained.
The method for detecting a phishing website according to claim 3, wherein after the weighted confidence value of the website to be tested is determined according to the confidence value and the weight value corresponding to the plurality of feature information, the method further include:

Determine whether the weighted confidence value is greater than a first preset threshold; if it is determined that the weighted confidence value is greater than the first preset threshold, determine that the website to be tested is a phishing website, and deny access to the website to be tested Test website;

Determine whether the weighted confidence value is greater than a second preset threshold; if it is determined that the weighted confidence value is greater than the second preset threshold, determine that the website to be tested is a suspected phishing website, and issue an instruction The website to be tested is warning information of a suspected phishing website.
The method for detecting a phishing website according to claim 6, wherein after determining that the website to be tested is a phishing website or a suspected phishing website, the method further comprises:

The URL of the website to be tested and multiple characteristic information of the website to be tested are included in the information database.
A detection device for phishing websites, which is characterized in that it comprises:

The first acquisition module is used to acquire multiple feature information of the website to be tested;

The second acquisition module is used to acquire the confidence value and weight value corresponding to each feature information;

The first determining module is configured to determine the weighted confidence value of the website to be tested according to the confidence value and the weight value respectively corresponding to the multiple feature information;

The second determining module is configured to determine that the website to be tested is a phishing website when the weighted confidence value is greater than a preset threshold.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 7. The method for detecting phishing websites described in any one of 7.
A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to realize the method for detecting a phishing website according to any one of claims 1 to 7.