CN111756724A

CN111756724A - Detection method, device and equipment for phishing website and computer readable storage medium

Info

Publication number: CN111756724A
Application number: CN202010572801.0A
Authority: CN
Inventors: 梁杰; 范渊
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: DBAPPSecurity Co Ltd; Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-09
Also published as: WO2021258838A1

Abstract

The application relates to a phishing website detection method, a phishing website detection device, computer equipment and a computer readable storage medium, wherein the phishing website detection method comprises the following steps: acquiring a plurality of characteristic information of a website to be tested; acquiring a confidence value and a weight value corresponding to each feature information; determining a weighted confidence value of the to-be-detected website according to the confidence values and the weight values corresponding to the plurality of feature information respectively; and under the condition that the weighted confidence value is larger than a preset threshold value, determining that the website to be detected is a phishing website. By the method and the device, the problem that the detection result accuracy of the phishing website is low in the related technology is solved, and the accuracy of the detection result of the phishing website is improved.

Description

Detection method, device and equipment for phishing website and computer readable storage medium

Technical Field

The present application relates to the field of internet security, and in particular, to a method and an apparatus for detecting a phishing website, a computer device, and a computer-readable storage medium.

Background

The relevant terms are explained as follows:

URL (Uniform Resource Locator): a network address.

Threat intelligence: according to Gartner's definition of threat intelligence, threat intelligence is some evidence-based knowledge that is related to existing or impending threats or hazards faced by an asset, including scenarios, mechanisms, indicators, inferences, and feasible suggestions, which may provide decision-making basis for threat response. From the perspective of a security practitioner, threat intelligence refers to some intrusion threat indicators, and can be used to determine whether a target to be detected poses a security threat to a system.

IOC (Indicators of intrusion threat): the index items in threat intelligence.

Confidence coefficient: refers to the extent to which the true value of a parameter has a certain probability of falling around the measurement.

And (3) phishing websites: the method refers to a false website of a deception user, the interface of the false website is basically consistent with that of a real website, and a consumer is deceived or account number and password information submitted by a visitor are stolen.

Currently mainstream phishing website identification schemes can be classified into the following two types:

one is a black list-based detection method, which reveals websites, such as the Phish Tank website, through phishing links, which includes a large number of phishing websites and has a high update frequency, and whether the connection is a phishing link can be judged by directly comparing the currently accessed links. The identification mode is more intuitive, however, the biggest problem is that the report missing rate is high, the number of websites for disclosing the phishing links in China is small at present, and the number of phishing link disclosure websites in some industries is more than a few, such as financial industries, on one hand, due to the small sample amount, and on the other hand, due to post-disclosure, the number of the intercepted phishing websites in the whole network range is only a small part.

The other method is an off-line judgment and detection method based on the link characteristics, which constructs a characteristic model according to the link length, the correlation degree of a link path and a host and whether an encryption protocol exists or not and judges according to the characteristic model. The method has wide general range, however, the existing threat information cannot be combined in real time, a large amount of false interception can be generated through an off-line judgment method, the manpower input needs to be increased for auditing confirmation, the cost input is large, and the hysteresis exists.

At present, no effective solution is provided for the problem of low accuracy of the detection result of the phishing website in the related technology.

Disclosure of Invention

The embodiment of the application provides a phishing website detection method, a phishing website detection device, computer equipment and a computer readable storage medium, and aims to at least solve the problem that the detection result accuracy of a phishing website in the related art is low.

In a first aspect, an embodiment of the present application provides a method for detecting a phishing website, including: acquiring a plurality of characteristic information of a website to be tested; acquiring a confidence value and a weight value corresponding to each feature information; determining a weighted confidence value of the website to be tested according to the confidence values and the weight values corresponding to the characteristic information respectively; and under the condition that the weighted confidence value is larger than a preset threshold value, determining that the website to be tested is a phishing website.

In some of these embodiments, the plurality of feature information includes first feature information; the method for acquiring the characteristic information of the website to be tested comprises the following steps: acquiring the URL of the website; acquiring the first characteristic information according to the URL, wherein the first characteristic information comprises at least one of the following: IP, domain name, file Hash value of executable file, Whois information.

In some of these embodiments, the plurality of feature information further includes second feature information; the acquiring of the plurality of feature information of the website to be tested further comprises at least one of the following steps: acquiring domain name information associated with the IP from an intelligence library, and acquiring IP information associated with the domain name, wherein the second characteristic information comprises the domain name information associated with the IP and the IP information associated with the domain name; acquiring IP and domain name information associated with the Whois information by a Whois back-check technology, wherein the second characteristic information comprises the IP and domain name information associated with the Whois information; and acquiring IP and domain name information which is connected back to the executable file through an executable file dynamic analysis technology, wherein the second characteristic information comprises the IP and domain name information which is connected back to the executable file.

In some embodiments, obtaining the confidence value and the weight value corresponding to each feature information includes: matching the plurality of characteristic information of the website to be tested with preset characteristic information in the information library; and obtaining a confidence value and a weight value corresponding to each feature information in the website to be tested according to the matching result.

In some embodiments, obtaining the confidence value and the weight value corresponding to each feature information further includes: determining third feature information from the plurality of feature information, wherein the third feature information comprises at least one of: network transmission protocol information and port information acquired from the URL; and acquiring a confidence value and a weight value corresponding to the third characteristic information according to the third characteristic information.

In some embodiments, after determining the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the plurality of feature information, the method further includes: judging whether the weighted confidence value is larger than a first preset threshold value or not; determining the website to be tested as a phishing website and refusing to access the website to be tested under the condition that the weighted confidence value is judged to be larger than a first preset threshold value; judging whether the weighted confidence value is larger than a second preset threshold value or not; and under the condition that the weighted confidence value is larger than a second preset threshold value, determining that the website to be detected is a suspected phishing website, and sending alarm information for indicating that the website to be detected is the suspected phishing website.

In some embodiments, after determining that the website to be tested is a phishing website or a suspected phishing website, the method further includes: and recording the URL of the website to be tested and a plurality of characteristic information in the website to be tested into the information library.

In a second aspect, an embodiment of the present application provides a detection apparatus for a phishing website, including: the first acquisition module is used for acquiring a plurality of characteristic information of the website to be detected; the second acquisition module is used for acquiring a confidence value and a weight value corresponding to each piece of feature information; the first determining module is used for determining the weighted confidence value of the to-be-detected website according to the confidence values and the weight values respectively corresponding to the plurality of characteristic information; and the second determining module is used for determining the website to be tested as a phishing website under the condition that the weighted confidence value is greater than a preset threshold value.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for detecting a phishing website as described in the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for detecting phishing websites as described in the first aspect above.

Compared with the related art, the phishing website detection method, the phishing website detection device, the computer equipment and the computer readable storage medium provided by the embodiment of the application acquire a plurality of characteristic information of a website to be detected; acquiring a confidence value and a weight value corresponding to each feature information; determining a weighted confidence value of the to-be-detected website according to the confidence values and the weight values corresponding to the plurality of feature information respectively; and under the condition that the weighted confidence value is greater than the preset threshold value, determining that the website to be detected is the phishing website, solving the problem of low accuracy of the detection result of the phishing website in the related technology, and improving the accuracy of the detection result of the phishing website.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a phishing website detection method according to an embodiment of the application;

FIG. 2 is a schematic diagram illustrating an association relationship between a plurality of feature information in a website to be tested according to an embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for detecting phishing websites in accordance with a preferred embodiment of the present application;

FIG. 4 is a block diagram of a phishing website detection apparatus according to an embodiment of the present application;

fig. 5 is a hardware configuration diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any creative effort belong to the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The embodiment provides a detection method of a phishing website. Fig. 1 is a flowchart of a phishing website detection method according to an embodiment of the application, and as shown in fig. 1, the flowchart includes the following steps:

step S101, acquiring a plurality of characteristic information of the website to be tested. The characteristic information comprises IP, domain name and file Hash value.

Wherein, the IP shape is 192.168.1.1, which is the only addressing mode for the user client (e.g. browser) to locate the target server. The IP of the server in the internet is globally unique and can be divided into IPV4 and IPV6, and the present embodiment is applicable to both types of IP.

Wherein the domain name is www.baidu.com. The domain name can be resolved to IP by DNS service, and then a request for service resources is issued. Therefore, the domain name and the IP have a correlation relationship, and one domain name can be configured to resolve a plurality of IPs in the same time to generate a one-to-many relationship. In the historical dimension, the same IP server may be changed and analyzed by a plurality of domain names, and a many-to-many relationship is generated.

The file Hash value is a character string obtained by calculating the file characteristics of a file based on a message digest algorithm, and the character string comprises letters and/or numbers. The Hash value of the file generated by calculation under the current condition has a one-to-one correspondence relationship with a certain file, and the Hash value of the file can be used as a unique identifier of the corresponding file to be distinguished from other files. The embodiment may use a Message digest Algorithm such as SHA (Secure Hash Algorithm), MD5(Message-digest Algorithm), and the like to obtain the file Hash value. According to different message summarization algorithms, the summarized file features have different lengths, and the longer the length is, the more unique the file features are. In both of the above algorithms, it is preferable that SHA-256 obtain the file Hash value because the string of MD5 is relatively short, and there may be two different files but the MD5 value is the same; if other algorithms are selected, a string that is too long will waste memory. Therefore, the embodiment adopts SHA-256 to achieve the effects of simultaneously achieving the collision resistance and saving the storage space.

And step S102, acquiring a confidence value and a weight value corresponding to each piece of feature information. In this embodiment, the characteristic information, that is, the intrusion threat indicators, is implemented by assigning a confidence value and a weight value to each intrusion threat indicator, where the confidence value represents an accuracy of a decision that a certain intrusion threat indicator is judged as a phishing website relative to the website to be tested, and the weight value represents an importance of the decision that a certain intrusion threat indicator is judged as a phishing website relative to the website to be tested.

And S103, determining the weighted confidence value of the to-be-detected website according to the confidence values and the weight values respectively corresponding to the plurality of feature information. In this embodiment, the product of the confidence value corresponding to each piece of feature information and the weight value ratio is used as a single weight, and the weighted confidence value of the to-be-detected website is obtained by adding and calculating the single weights. The weight value ratio refers to a ratio of a weight value of a certain characteristic information to the sum of weight values of all characteristic information.

The related art generally makes a judgment directly based on single threat intelligence and characteristics, and therefore errors are large. In the embodiment, a plurality of pieces of feature information are considered, the confidence coefficient and the weight of each piece of feature information are considered, and the result is finally determined by the weighted confidence value, so that the error is favorably reduced.

And step S104, determining the website to be tested as a phishing website under the condition that the weighted confidence value is larger than a preset threshold value. The preset threshold may be an empirical value obtained by multiple tests, or may be a preset value.

Through the steps, the problem that the detection result accuracy of the phishing website is low in the related technology is solved, and the accuracy of the detection result of the phishing website is improved.

Threatening malicious files often operate in an off-line or on-line manner. The off-line operation mode aims at destroying the user computer, so that the user experience of using the computer is not smooth or even can not use the computer; in an online operation mode, the purpose of illegally collecting user information is often to steal data and passwords on a user computer, a remote server is searched and connected in a URL mode, and the collected user information is uploaded to a malicious server. Therefore, the characteristic information can be acquired one by one according to the incidence relation among the characteristic information in the website to be tested

Fig. 2 is a schematic diagram of an association relationship between a plurality of pieces of feature information in a website to be tested, and as shown in fig. 2, different pieces of feature information have an association relationship with each other and can be mutually converted. The following embodiment will describe how to extract a plurality of pieces of feature information based on the association relationship between the plurality of pieces of feature information.

In some of these embodiments, the plurality of characteristic information includes first characteristic information; the method for acquiring the characteristic information of the website to be tested comprises the following steps: acquiring a URL of a website; obtaining first characteristic information according to the URL, wherein the first characteristic information comprises but is not limited to at least one of the following: IP, domain name, file Hash value of executable file, Whois information.

The first feature information refers to directly related and accurate information obtained from the URL of the website, and the obtaining of the URL of the website and the obtaining of the first feature information will be described below separately.

(1) And acquiring a website URL in the form of < protocal >:/[ host ]: port/[ path ]. The Protocol is, for example, http (Hyper Text Transfer Protocol), https (Hyper Text Transfer Protocol over Secure Socket Layer), ftp (file Transfer Protocol). host is host, and the domain name can be resolved into IP or directly IP, such as www.baidu.com or 180.101.49.11. The port is a port number, common website ports such as 80, 443, 8080, etc., 80 and 443 can be omitted by default. Html is a path, e.g., index.

Phishing websites generally induce users to click through browsers, resulting in malicious behavior. The URL, which is a unique identifier of a web page on the internet, can be obtained in the following two ways. One is that when browsing the web page, the address bar at the top of the browser is copied to obtain the URL; the other is to intercept the URL accessed by the user through the firewall device in the automation scenario.

After the URL is acquired, multidimensional feature extraction is performed, where the first feature information directly acquired starts with $ base.

(2) The URL is split and $ base _ protocal, $ base _ host, $ base _ port, $ base _ path are obtained. In an implementation manner, the method can be split by separators or extracted by regular expressions. The regular expression is a matching extraction technology, and for a character string with specified characteristics, the regular expression can be used for judging whether the character string is matched or not and also can be used for extracting specified information in the character string.

(3) Extracting a domain name and an IP, and if the $ base _ host is the domain name, the $ base _ domain is $ base _ host. Then, dns (Domain Name System, Domain Name System (service) protocol) analysis is performed on $ base _ Domain, and the analysis mode can be obtained through the nslookup command analysis of windows or linux, and also can be obtained through the analysis of an online tool. The resolved IP may be multiple, denoted as $ base _ IP. If the $ host is ip, then $ base _ ip is $ base _ host.

(4) Whois information is extracted. The whois information is registered owner information of a domain name, typically a business or an individual, by which it is possible to inquire about information such as a registration time, an expiration time, and a mailbox of a registered owner. The query mode can be obtained through linux or windows Whois command, or can be obtained through an online tool and is recorded as $ base _ Whois.

(5) And extracting the Hash value of the file in the webpage. In general, financial websites, especially internet banking transaction websites, provide security plug-ins for users to download and install in order to ensure information security. The phishing websites that are counterfeited also provide a 'secret shield' download link, however, the executable file provided is likely to be a virus trojan horse type program. In some embodiments, an html (Hyper Text markup language) code of a webpage in a website to be tested may be obtained by http requesting a URL or using a browser rendering manner, an executable file download link is retrieved from the html code, the downloaded executable file is recorded as $ base _ file, and a file Hash value of the executable file is calculated and recorded as $ base _ Hash.

In some of these embodiments, the plurality of characteristic information further includes second characteristic information; the obtaining of the plurality of feature information of the website to be tested further includes, but is not limited to, at least one of the following: acquiring domain name information associated with the IP from an intelligence base and acquiring IP information associated with the domain name, wherein the second characteristic information comprises the domain name information associated with the IP and the IP information associated with the domain name; obtaining IP and domain name information associated with the Whois information through a Whois back-check technology, wherein the second characteristic information comprises the IP and domain name information associated with the Whois information; and acquiring IP and domain name information connected back to the executable file through an executable file dynamic analysis technology, wherein the second characteristic information comprises the IP and domain name information connected back to the executable file.

The second feature information refers to indirectly related, rough information with respect to the first feature information, and after the URL is acquired, multi-dimensional feature extraction is performed, in which the indirectly acquired information is searched for beginning at $ ext. The following describes the acquisition of the second feature information, respectively.

(6) Extracting historical resolved IP and domain names. The domain name or the IP is obtained in the steps, and as the binding relationship between the domain name and the IP is not invariable, a domain name owner can change the IP pointed by the domain name at any time; and the same IP can be leased to different users for use at different time, and different domain names are bound. In some embodiments, domain name information associated with the IP and IP information associated with the domain name may be obtained in an intelligence repository according to a binding relationship between the domain name and the IP, wherein the intelligence repository includes a threat intelligence repository; it is also possible to retrieve domain name information associated with the IP at some information listing websites, and to retrieve IP information associated with the domain name. The domain name and IP for this history resolution are written in $ ext _ domain and $ ext _ IP, respectively.

(7) Whois information is back-checked. An individual or an entity can register a plurality of domain names with authority, more domain names under the name of the individual or the entity can be acquired by Whois information through Whois back-checking technology, and are recorded as $ ext _ domains, wherein the domain name information can analyze corresponding IP and is recorded as $ ext _ IP.

(8) The server address in the executable file is extracted. If the executable program is malicious, the user information is uploaded to a malicious server possibly in an online mode, and sensitive data of the user is collected. Using an executable file dynamic parsing technique, the domain name or IP concatenated with the executable file is parsed, and either $ ext _ domain or $ ext _ IP is posted. The executable file dynamic analysis technology relates to the technologies of binary system, disassembling, sandbox and the like.

Through the above steps, directly related, accurate $ base _ data, and indirectly related, coarse $ ext _ data have been acquired. Table 1 shows a summary of the two types of characteristic information obtained by the above steps, as shown in table 1.

TABLE 1 direct correlation and indirect correlation of characteristic information

The following examples will be combined with the two types of characteristic information given in table 1 for further matching analysis.

In some embodiments, obtaining the confidence value and the weight value corresponding to each feature information includes: matching a plurality of characteristic information of the website to be tested with preset characteristic information in an information library; and obtaining a confidence value and a weight value corresponding to each feature information in the website to be tested according to the matching result.

The intelligence base in the embodiment comprises a threat intelligence base, wherein the threat intelligence base can comprise a self-built intelligence base and a third-party intelligence base, and the common point of the threat intelligence base is that core intrusion threat indexes such as domain names, IP (Internet protocol), file Hash values and the like with malicious behaviors are recorded according to similar dimensions. In addition to the above-mentioned core intrusion threat indicators, in some embodiments, the self-built intelligence library may be provided with information including Whois information, historical resolution domain name and IP information, intelligence confidence, and malicious URLs to assist in obtaining the plurality of feature information in the previous step, and then performing weighted determination on the obtained plurality of feature information. In addition, the executable file acquired in the previous step can be uploaded to a self-built intelligence base to analyze whether malicious behaviors exist.

How to perform the weighting determination on the acquired plurality of feature information will be described below.

(1) Principle: under the condition that the first characteristic information ($ base _) hits the information base, distributing a high weight value for the confidence value of the first characteristic information; in the case where the second feature information ($ ext _) hits in the intelligence repository, a low weight value is assigned to the confidence value of the second feature information.

(2) The formula: assuming that the confidence value is c, c is more than or equal to 0 and less than or equal to 1; the weight value is w, w is more than or equal to 1 and less than or equal to 10; the judgment score is S, S is a weighted confidence value, S is more than or equal to 0 and less than or equal to 1, and a weighted average algorithm is adopted, so that

Wherein n is a natural number.

(3) First characteristic information weighting: and $ base _ domain, $ base _ ip, $ base _ whois, $ base _ hash and $ base _ path are information library recordable information.

When $ base _ domain, $ base _ ip and $ base _ hash are matched, the accuracy of the type of characteristic information is high, and for core matching content, a precise matching mode can be adopted, namely, $ base _ domain is completely equal to domain of an intelligence base; base _ ip for the same reason; the base _ hash has different expressions, and the embodiment adopts the SHA256 algorithm to be accurately matched with the SHA256 value in the intelligence base. And c, taking the matched data from a confidence field preset in an intelligence library, wherein the confidence field contains a confidence value and reflects the confidence condition of certain characteristic information under the current condition. For unmatched data, c is 0, and w is 8.

The $ base _ whois mainly matches the data of the registered owner and the contact information of the registered owner, and an accurate matching mode can be adopted. C, for the matched data, taking a confidence field preset in an information base; for unmatched data, c takes a value of 0, and w takes a value of 6.

And matching the base _ path in a mode of containing or not, wherein c is taken from a confidence field preset in an information base, if not, c is taken as 0, and w is taken as 4.

It should be noted that the value of c is determined according to a confidence field preset in the intelligence library, and a confidence value corresponding to certain feature information in the confidence field is calculated according to the credibility of the feature information under the current condition. And w can be taken from a preset value in an information base, and can also be properly adjusted according to the accuracy of the characteristic information detection.

(4) Second characteristic information weighting: and obtaining the domain name and the IP through history analysis or networking behavior analysis of the executable file. Compared with the first feature information obtained from the URL, the second feature information has a slightly lower correlation and a plurality of data volumes, and this embodiment assigns a lower weight value to such feature information. C, under the condition that the second characteristic information is matched with the intrusion threat index recorded in the information library, taking the second characteristic information from a confidence field preset in the information library; otherwise, if the matching is not achieved, c is 0, and w is 2. And each matched intrusion threat index is used as one of the accumulation items of the weighted average.

At present, the https network transmission encryption technology has been widely popularized, the https protocol is preferably adopted for financial industry websites, and for websites with a non-https protocol, a typical Google Chrome (Google browser) will be marked as an insecure website. Whereas https certificates, which require issuance by an authority to be trusted by the browser, may be untrusted or inconsistent and may be denied by the browser.

Therefore, the network transmission protocol and the port information can also be used as the characteristic information of the weighting judgment. In some embodiments, obtaining the confidence value and the weight value corresponding to each feature information further includes: determining third feature information from the plurality of feature information, wherein the third feature information includes, but is not limited to, at least one of: network transmission protocol information and port information acquired from the URL; and acquiring a confidence value and a weight value corresponding to the third characteristic information according to the third characteristic information.

The third characteristic information refers to non-intelligence characteristic information, i.e. network transmission protocol information and port information obtained from URL

And if the website to be tested only uses the http protocol, making c equal to 1 and w equal to 2. If the website to be tested uses the https protocol but does not have a certificate issued by an authority, the website to be tested is judged to be not credible, and c is 1, and w is 3.

Websites are typically accessed by users using domain names with known meanings to help users remember, and using default ports that can omit inputs, such as http protocol typically uses 80 ports and https protocol typically uses 443 ports. And if the website to be tested is accessed only in an ip mode, making c equal to 1 and w equal to 3. If the website to be tested uses a non-default port and uses ports 8080, 9090 and 8888, c is 1, and w is 3.

In some embodiments, after determining the weighted confidence value of the to-be-tested website according to the confidence value and the weight value corresponding to the plurality of feature information, the method further includes: judging whether the weighted confidence value is larger than a first preset threshold value or not; and under the condition that the weighted confidence value is larger than the first preset threshold value, determining that the website to be tested is a phishing website, and refusing to access the website to be tested. Judging whether the weighted confidence value is larger than a second preset threshold value or not; and under the condition that the weighted confidence value is larger than a second preset threshold value, determining that the website to be detected is a suspected phishing website, and sending alarm information for indicating that the website to be detected is the suspected phishing website.

And weighting the comprehensive information, substituting the weighted average formula into the weighted average formula to calculate to obtain a score S, grading the score S according to a preset threshold value, and adopting a strategy aiming at the website to be tested according to a grading result.

The first preset threshold value can be 0.8, if S is larger than or equal to 0.8, the website to be tested is determined to be a phishing website, and the user can be directly denied access.

The second preset threshold value can be 0.6, if S is more than or equal to 0.6, the website to be detected is determined to be a suspected phishing website, and warning information can be sent out to remind the website to be detected of the possibility of fraud and needs to be carefully judged.

In some embodiments, after determining that the website to be tested is a phishing website or a suspected phishing website, the method further includes: and recording the URL of the website to be tested and a plurality of characteristic information in the website to be tested into an information base.

In this embodiment, a third preset threshold may be set, where the third preset threshold may take a value of 0.4, and if S is greater than or equal to 0.4, the feature information of the website to be tested is recorded in the suspicious information base, so as to facilitate further verification and be used by subsequent optimization algorithm parameters. And if S is less than 0.4, discarding the characteristic information acquired from the website to be tested.

Open source threat intelligence or commercial threat intelligence adopted in the related art generally detects websites in an off-line mode, and due to the fact that an intrusion threat index has hysteresis, the false alarm rate gradually rises along with the development of time. The embodiment adopts a self-built information base, sets an intrusion threat index in the information base, and reversely records the detected suspicious information in the current website to be detected into the information base so as to update the intrusion threat index, solve the problem of hysteresis of the intrusion threat index and be beneficial to reducing the false alarm rate; meanwhile, the method is beneficial to improving the richness of an information base, improving the judgment accuracy and the judgment breadth and keeping the virtuous circle.

The embodiments of the present application are described and illustrated below by means of preferred embodiments.

Fig. 3 is a flowchart illustrating a method for detecting phishing websites according to a preferred embodiment of the present application, and as shown in fig. 3, the flowchart includes the following steps:

step S301, obtaining URL of website.

Step S302, the URL is split, and $ base _ protocal, $ base _ host, $ base _ port, and $ base _ path are obtained.

Step S303, directly obtaining or analyzing to obtain $ base _ domain, $ base _ ip, $ base _ whois and $ base _ hash.

Step S304, indirectly obtaining or analyzing to obtain $ ext _ domain, $ ext _ ip.

S305, matching an intelligence library to obtain a confidence value c and a weight value w; and directly judging the confidence value c and the weight value w by the non-intelligence library.

And step S306, calculating a weighted confidence value by adopting a weighted average algorithm. I.e. calculate S using the following formula:

and step S307, making an access limitation decision through the S score.

Step S308, the website information of the phishing websites or suspected phishing websites is collected to an information bank.

In some embodiments, the IOC information in the intelligence library may be directly issued to a World wide web (World wide web) firewall, and directly intercepted when meeting the matching feature information, so as to implement accurate interception.

Compared with the related art, the embodiment of the application has the following advantages:

(1) the confidence coefficient corresponding to the intrusion threat index and built in the self-built information base is used for helping to judge the suspicious degree of the target website.

(2) Extracting direct associated information which can be matched with an information base from the URL of the website to be tested, and acquiring additional associated information through the self attribute of the direct associated information to comprehensively judge and realize multi-dimensional matching. Compared with the traditional mode, the phishing website can be found more accurately, and the method is favorable for preventing the user from being cheated by a fake website and influencing the normal economic order.

(3) And weighting the confidence degrees of various characteristic information, and finally determining a detection result by the weight, thereby solving the problem of large error caused by directly judging single threat intelligence and characteristics.

(4) The detection result can be put in storage for expanding the data of the basic library and used for judging the following phishing website; aiming at the problem of false alarm, intervention can be adopted by adjusting the weight value.

The embodiment also provides a detection device for a phishing website, which is used for implementing the above embodiments and preferred embodiments, and the description of the device is omitted. As used below, the terms "module," "sub-module," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram illustrating a configuration of a phishing website detection apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus includes:

the first obtaining module 41 is configured to obtain a plurality of feature information of a website to be tested.

And a second obtaining module 42, coupled to the first obtaining module 41, configured to obtain a confidence value and a weight value corresponding to each feature information.

The first determining module 43 is coupled to the second obtaining module 42, and configured to determine a weighted confidence value of the website to be tested according to the confidence values and the weight values respectively corresponding to the plurality of feature information.

And the second determining module 44 is coupled to the first determining module 43, and is configured to determine that the website to be tested is a phishing website if the weighted confidence value is greater than the preset threshold value.

In some of these embodiments, an apparatus comprises: the first acquisition submodule is used for acquiring the URL of the website; the second obtaining submodule is used for obtaining first characteristic information according to the URL, wherein the first characteristic information comprises but is not limited to at least one of the following: IP, domain name, file Hash value of executable file, Whois information.

In some of these embodiments, an apparatus comprises: the third obtaining sub-module is used for obtaining domain name information associated with the IP from an intelligence base and obtaining the IP information associated with the domain name, and the second characteristic information comprises the domain name information associated with the IP and the IP information associated with the domain name; the Whois back-check module is used for acquiring the IP and domain name information associated with the Whois information through a Whois back-check technology, and the second characteristic information comprises the IP and domain name information associated with the Whois information; and the executable file dynamic analysis module is used for acquiring the IP and domain name information which is connected back to the executable file through an executable file dynamic analysis technology, wherein the second characteristic information comprises the IP and domain name information which is connected back to the executable file.

In some of these embodiments, an apparatus comprises: the matching module is used for matching a plurality of characteristic information of the website to be tested with preset characteristic information in the information library; and obtaining a confidence value and a weight value corresponding to each feature information in the website to be tested according to the matching result.

In some of these embodiments, an apparatus comprises: a determining sub-module, configured to determine third feature information from the plurality of feature information, wherein the third feature information includes, but is not limited to, at least one of: network transmission protocol information and port information acquired from the URL; and the fourth obtaining submodule is used for obtaining the confidence value and the weight value corresponding to the third characteristic information according to the third characteristic information.

In some of these embodiments, the apparatus further comprises: the first judgment module is used for judging whether the weighted confidence value is greater than a first preset threshold value or not; determining the website to be tested as a phishing website and refusing to access the website to be tested under the condition that the weighted confidence value is judged to be larger than the first preset threshold value; the second judgment module is used for judging whether the weighted confidence value is larger than a second preset threshold value or not; and under the condition that the weighted confidence value is larger than a second preset threshold value, determining that the website to be detected is a suspected phishing website, and sending alarm information for indicating that the website to be detected is the suspected phishing website.

In some embodiments, the apparatus further comprises: and the receiving and recording module is used for receiving and recording the URL of the website to be tested and the characteristic information in the website to be tested into the information library.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

In addition, the method for detecting phishing websites in the embodiment of the present application described in conjunction with fig. 1 can be implemented by a computer device. Fig. 5 is a hardware structure diagram of a computer device according to an embodiment of the present application.

The computer device may comprise a processor 51 and a memory 52 in which computer program instructions are stored.

Specifically, the processor 51 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 52 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 52 may include a Hard Disk Drive (Hard Disk Drive, abbreviated HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 52 may include removable or non-removable (or fixed) media, where appropriate. The memory 52 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 52 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 52 includes Read-Only Memory (ROM) and Random Access Memory (RAM).

The memory 52 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 51.

The processor 51 may read and execute the computer program instructions stored in the memory 52 to implement any one of the phishing website detection methods in the above embodiments.

In some of these embodiments, the phishing website's detection device may also include a communication interface 53 and bus 50. As shown in fig. 5, the processor 51, the memory 52, and the communication interface 53 are connected via the bus 50 to complete mutual communication.

The communication interface 53 is used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present application. The communication interface 53 may also enable communication with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 50 comprises hardware, software, or both coupling the components of the computer device to each other. Bus 50 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). Bus 50 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The computer device can execute the phishing website detection method in the embodiment of the application based on the acquired URL of the website to be detected, so that the phishing website detection method described in conjunction with fig. 1 is realized.

In addition, in combination with the method for detecting a phishing website in the above embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any one of the detection methods for phishing websites of the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A phishing website detection method is characterized by comprising the following steps:

acquiring a plurality of characteristic information of a website to be tested;

acquiring a confidence value and a weight value corresponding to each feature information;

determining a weighted confidence value of the website to be tested according to the confidence values and the weight values corresponding to the characteristic information respectively;

and under the condition that the weighted confidence value is larger than a preset threshold value, determining that the website to be tested is a phishing website.

2. A phishing website detection method as claimed in claim 1, wherein said plurality of feature information includes first feature information; the method for acquiring the characteristic information of the website to be tested comprises the following steps:

acquiring the URL of the website;

acquiring the first characteristic information according to the URL, wherein the first characteristic information comprises at least one of the following: IP, domain name, file Hash value of executable file, Whois information.

3. A phishing website detection method as claimed in claim 2, wherein said plurality of feature information further comprises second feature information; the acquiring of the plurality of feature information of the website to be tested further comprises at least one of the following steps:

acquiring domain name information associated with the IP from an intelligence library, and acquiring IP information associated with the domain name, wherein the second characteristic information comprises the domain name information associated with the IP and the IP information associated with the domain name;

acquiring IP and domain name information associated with the Whois information by a Whois back-check technology, wherein the second characteristic information comprises the IP and domain name information associated with the Whois information;

and acquiring IP and domain name information which is connected back to the executable file through an executable file dynamic analysis technology, wherein the second characteristic information comprises the IP and domain name information which is connected back to the executable file.

4. A phishing website detection method as claimed in claim 3, wherein obtaining a confidence value and a weight value corresponding to each feature information comprises:

matching the plurality of characteristic information of the website to be tested with preset characteristic information in the information library;

and obtaining a confidence value and a weight value corresponding to each feature information in the website to be tested according to the matching result.

5. The method for detecting the phishing website as claimed in claim 4, wherein the obtaining the confidence value and the weight value corresponding to each feature information further comprises:

determining third feature information from the plurality of feature information, wherein the third feature information comprises at least one of: network transmission protocol information and port information acquired from the URL;

and acquiring a confidence value and a weight value corresponding to the third characteristic information according to the third characteristic information.

6. A phishing website detection method as claimed in claim 3, wherein after determining the weighted confidence value of the website to be detected according to the confidence value and the weight value corresponding to each of the plurality of feature information, the method further comprises:

judging whether the weighted confidence value is larger than a first preset threshold value or not; determining the website to be tested as a phishing website and refusing to access the website to be tested under the condition that the weighted confidence value is judged to be larger than a first preset threshold value;

judging whether the weighted confidence value is larger than a second preset threshold value or not; and under the condition that the weighted confidence value is larger than a second preset threshold value, determining that the website to be detected is a suspected phishing website, and sending alarm information for indicating that the website to be detected is the suspected phishing website.

7. The method for detecting phishing websites of claim 6, wherein after determining that the website to be detected is a phishing website or a suspected phishing website, the method further comprises:

and recording the URL of the website to be tested and a plurality of characteristic information in the website to be tested into the information library.

8. A phishing website detection apparatus, comprising:

the first acquisition module is used for acquiring a plurality of characteristic information of the website to be detected;

the second acquisition module is used for acquiring a confidence value and a weight value corresponding to each piece of feature information;

the first determining module is used for determining the weighted confidence value of the to-be-detected website according to the confidence values and the weight values respectively corresponding to the plurality of characteristic information;

and the second determining module is used for determining the website to be tested as a phishing website under the condition that the weighted confidence value is greater than a preset threshold value.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of detecting phishing websites of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a method of detecting a phishing website as claimed in any one of claims 1 to 7.