CN115001763B

CN115001763B - Phishing website attack detection method and device, electronic equipment and storage medium

Info

Publication number: CN115001763B
Application number: CN202210553089.9A
Authority: CN
Inventors: 杨鹤; 徐自全
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-03-19
Anticipated expiration: 2042-05-20
Also published as: CN115001763A

Abstract

The embodiment of the disclosure discloses a phishing website attack detection method, a phishing website attack detection device, electronic equipment and a storage medium. The phishing website attack detection method comprises the following steps: acquiring URL data information; respectively extracting URL data, webpage content data and third party information data from the URL data information; extracting features based on the URL data, the webpage content data and the third party information data to obtain extracted features; obtaining algorithm disturbance characteristics based on the extracted characteristics; fusing the extracted features with algorithm disturbance features to obtain a feature data set; a phishing website URL is detected based on the feature dataset and a machine learning detection engine. The characteristic characterization capability is increased, so that the phishing website can be comprehensively characterized, the phishing website is prevented from bypassing detection, and the comprehensive detection purpose is achieved.

Description

Phishing website attack detection method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of network security, and in particular relates to a phishing website attack detection method, a phishing website attack detection device, electronic equipment and a storage medium.

Background

With the popularization and development of services such as mobile payment and online banking, the life of people is more and more dependent on the Internet, and the network attack aiming at user information is more varied, so that the harm is extremely great, the information security of network users is threatened, the network fishing attack is taken as one of social engineering attacks, and the network fishing attack has very strong deception and disguise, so that important personal information such as bank account passwords is unknowingly leaked by a victim.

Phishing generally refers to the inclusion of hyperlinked messages directed to some malicious web site sent by threat actors who attempt to obtain sensitive information about the user clicking on the link. Although most end users remain careful about suspicious links, there are few end users that cannot distinguish phishing websites, resulting in potential data loss. Phishing has become one of the highest risk attacks. Despite the various approaches in anti-phishing, phishing attacks have still caused significant losses in the last few years, and phishing detection has remained a hotspot problem in cyber-space security.

The detection of phishing websites currently has roughly 3 directions: based on black and white list detection, based on visual similarity detection and machine learning algorithm detection.

Black and white list detection: and the security personnel formulates a blacklist and a whitelist detection rule according to rule characteristics of data such as phishing website URLs and the like. When the URL data is subjected to blacklist detection, if the data is matched with the blacklist rule, the data is phishing website attack data, and if the URL data is not matched with the blacklist rule, the data is normal data; the white list rule detection is opposite to the black list rule detection, if the URL data is matched with the white list flow rule, the URL data is normal data, and if the URL data is not matched with the white list rule, the URL data is phishing website attack data;

visual similarity detection: the phishing websites mainly use pictures to replace words, screenshot to replace webpages and the like aiming at webpage contents. And calculating a fuzzy similarity index by using a visual similarity detection algorithm to evaluate the similarity between the current page and the phishing websites in the existing blacklist. Thereby judging whether the current webpage has phishing website attack behaviors.

Machine learning algorithm detection: the main extracted features are classified and detected by using machine learning models such as decision trees, support vector machines, bayes, multi-layer neural network algorithms and the like.

The following problems are found in the prior art in practicing embodiments of the present disclosure by the inventors:

Phishing site detection based on black and white lists is dependent on its size, scope, update rate and frequency, accuracy, and other characteristics, however many 0-day phishing sites survive very short, and have been profitable and vanished before they are blacklisted. Therefore, the phishing website detection based on the black-and-white list has larger hysteresis, which is unfavorable for timely detecting the phishing website so as to reduce loss.

The detection method based on vision is low in detection speed, and is unfavorable for reminding a user rapidly.

The machine learning algorithm has good detection capability on detecting sensitive features of the URL of the fishing website, can greatly improve the accuracy of detection of the fishing website, but requires a deep security expert to manually select flow features, the features of the prior art are not comprehensive enough, the fishing website can still bypass the detection to realize attack, and the security expert needs to update and add the features according to the upgrade of an application system, service update or new security holes, and the like, retrain a machine learning model, and consume a great deal of manpower resources and the problems of model update lag, and the like.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a phishing website attack detection method, apparatus, electronic device, and storage medium, which at least partially solve the problem in the prior art that the extraction features are not comprehensive enough, so that the phishing website can bypass detection to implement attack.

In a first aspect, an embodiment of the present disclosure provides a phishing website attack detection method, including:

acquiring URL data information;

respectively extracting URL data, webpage content data and third party information data from the URL data information;

extracting features based on the URL data, the webpage content data and the third party information data to obtain extracted features;

obtaining algorithm disturbance characteristics based on the extracted characteristics;

fusing the extracted features with algorithm disturbance features to obtain a feature data set;

a phishing website URL is detected based on the feature dataset and a machine learning detection engine.

According to a specific implementation of an embodiment of the disclosure, the extracting features includes: URL features, third party information features, and web page features.

According to a specific implementation manner of the embodiment of the disclosure, the URL feature includes:

IP address feature, URL length feature, shorten service feature, function compliance feature, HTTPS connection feature, port number feature, and HTTPS token feature.

According to a specific implementation manner of the embodiment of the disclosure, the third party information feature includes:

domain name expiration time feature, domain name age feature, domain name record feature, domain name access volume feature, page rank feature, access feature, URL statistics feature, and URL identity feature.

According to a specific implementation manner of the embodiment of the present disclosure, the web page features include:

icon features, request URL features, anchor link features, server form features, mailbox features, redirect features, trigger features, pop-up features, set response features, inline frame Iframe features, and point to web page features.

According to a specific implementation manner of the embodiment of the present disclosure, obtaining an algorithm disturbance feature based on the extracted feature includes:

based on the URL features, the third-party information features and the webpage features, 3 algorithm disturbance features are calculated respectively by using a GBDT algorithm, an XGBoost algorithm and a LightGBM algorithm in a Stacking strategy.

According to a specific implementation manner of the embodiment of the present disclosure, after the step of detecting the phishing website URL based on the feature data set and the machine learning detection engine, the method further includes:

and performing backtracking analysis based on URL data information of the phishing website, wherein the backtracking analysis comprises the steps of determining endpoint information, an attacked object, an attack step, an attack range and a damage degree.

And visualizing the phishing website URL data, and automatically isolating, repairing and remedying the security threat caused by the phishing website aiming at the phishing website URL data information of the retrospective analysis.

According to a specific implementation manner of the embodiment of the disclosure, the machine learning detection engine is obtained through model training;

the model training comprises:

acquiring phishing website URLs to obtain phishing website URL samples;

running phishing website URL samples in the sandboxes to obtain URL sample data of the phishing websites;

obtaining normal website URL sample data based on the obtained normal website URL data;

extracting features of URL sample data of the phishing website and URL sample data of a normal website to obtain training features;

processing the training features through a Stacking strategy to obtain training algorithm features;

obtaining a training characteristic data set based on the training algorithm characteristics;

and carrying out random forest algorithm processing on the training characteristic data set to obtain the fishing website detection classifier.

According to a specific implementation manner of the embodiment of the disclosure, the random forest algorithm training process includes:

randomly putting back samples from an original training set to take m samples, and sampling the m samples for n_tree times to generate n_tree training sets;

Training n_tree decision tree models based on the n_tree training sets;

each decision tree model in the n_tree decision tree models is split according to the best feature selected by the base index until all training samples of the node belong to the same class, and a plurality of split decision tree models are obtained;

and forming a random forest by a plurality of split decision tree models.

According to a specific implementation manner of the embodiment of the present disclosure, the base index is:

wherein p is _k For the probability that the sample belongs to the k-th class decision tree model, k is a constant and Gini (p) is a base index.

storing the detected phishing website URL;

and updating and training the machine learning detection engine based on the stored detected phishing website URL.

In a second aspect, an embodiment of the present disclosure further provides a phishing website attack detection apparatus, including:

the acquisition module is used for acquiring URL data information;

the extraction module is used for respectively extracting URL data, webpage content data and third party information data from the URL data information;

The feature module is used for extracting features based on the URL data, the webpage content data and the third party information data to obtain extracted features;

the algorithm module is used for obtaining algorithm disturbance characteristics based on the extracted characteristics;

the fusion module is used for fusing the extracted features with the algorithm disturbance features to obtain a feature data set;

and the detection module is used for detecting phishing website URLs based on the characteristic data set and a machine learning detection engine.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the phishing website attack detection method of any of the first aspects.

In a fourth aspect, embodiments of the present disclosure further provide a computer-readable storage medium storing computer instructions for causing a computer to perform the phishing website attack detection method of any one of the first aspects.

According to the phishing website attack detection method, the device, the electronic equipment and the computer readable storage medium, URL data, webpage content data and third party information data are respectively extracted from URL data information, feature extraction is carried out based on the URL data, the webpage content data and the third party information data to obtain extraction features, algorithm disturbance features are obtained based on the extraction features, the extraction features and the algorithm disturbance features are fused to obtain a feature data set, feature characterization capability is improved, so that a phishing website can be comprehensively characterized, bypass detection of the phishing website is avoided, and the purpose of comprehensive detection is achieved.

Furthermore, a plurality of base learners with excellent classification performance are integrated into a high-performance model through a Stacking strategy, and the input features and the prediction results of the first stage of the Stacking algorithm are simultaneously used as the input features of the second stage, namely, the input features and the prediction results of the first stage of the Stacking algorithm are used as a feature data set trained by a machine learning detection engine, so that model disturbance features are increased, and model performance and generalization capability are improved while the advantages of high precision, high speed and the like of each model are fully exerted.

The machine learning detection engine can self-learn based on the detected phishing websites, and solves the problems that a large amount of manual participation is needed, the model update is lagged, and the like.

The foregoing description is only an overview of the disclosed technology, and may be implemented in accordance with the disclosure of the present disclosure, so that the above-mentioned and other objects, features and advantages of the present disclosure can be more clearly understood, and the following detailed description of the preferred embodiments is given with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a flowchart of a phishing website attack detection method provided in an embodiment of the present disclosure;

FIG. 2 is a schematic block diagram of phishing website attack detection provided by an embodiment of the present disclosure;

FIG. 3 is a schematic block diagram of machine learning detection engine training in phishing website attack detection provided by an embodiment of the present disclosure;

Fig. 4 is a flowchart of phishing website attack detection in an application scenario provided in an embodiment of the present disclosure;

FIG. 5 is a schematic block diagram of a phishing website attack detection apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

It should be appreciated that the following specific embodiments of the disclosure are described in order to provide a better understanding of the present disclosure, and that other advantages and effects will be apparent to those skilled in the art from the present disclosure. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

The EDR system is an active endpoint security solution system, and compared with a static defense technology of endpoint security protection adopting a preset security policy, the EDR system enhances threat detection and response evidence obtaining capability, and can rapidly detect, identify, monitor and process endpoint events, so that detection and prevention are carried out before the threat has not caused injury.

In order to facilitate understanding, as shown in fig. 1 and fig. 2, the embodiment discloses a phishing website attack detection method, which includes:

step S101: acquiring URL data information;

in a specific application scenario, the detection method of the embodiment is deployed in an EDR system, and URL data information in enterprise network data is extracted by using the EDR system.

And installing an EDR endpoint system in the terminal by a user, and extracting and recording relevant URL data information by the EDR endpoint system when the user accesses a browsing webpage by using the URL in the terminal. And sending the URL data information to the EDR cloud system.

Step S102: respectively extracting URL data, webpage content data and third party information data from the URL data information;

In a specific application scenario, after receiving the URL data information, the EDR cloud system extracts URL data features, URL web page features, and third party data features based on the URL data information.

Step S103: extracting features based on the URL data, the webpage content data and the third party information data to obtain extracted features;

extracting features, including: URL features, third party information features, and web page features.

URL features comprising: IP address feature, URL length feature, shorten service feature, function compliance feature, HTTPS connection feature, port number feature, and HTTPS token feature. The functionally equivalent features include a '@' symbol feature, a '/' symbol feature, a '-' symbol feature, and the like.

The method comprises the following steps:

(1) IP address characteristics: whether an IP address is used in the URL as a substitute for the domain name.

(2) URL length feature: it is checked whether the URL is long enough to hide the actual domain of the web page.

(3) Shortening service characteristics: whether an attacker uses a URL shortening service such as TinyURL to hide the real domain of URL.

(4) Character of the'@' symbol: checking if the '@' symbol is present, the occurrence of the '@' symbol in the URL will be redirected to the field behind the symbol, while the field in front of the symbol is ignored.

(5) ' v/v sign feature: checking the number of '//' symbols, an attacker may use the '//' symbols to redirect to a phishing page.

(6) ' symbol feature: checking if '-' is present, the feature is used by blending two words mimicking the legal field.

(7) '.' symbol characteristics: check' number of symbols, check URL subdomain number.

(8) HTTPS connection feature: the feature checks if there is an HTTPS connection, which shows additional protection of the data transfer from the client to the server, and vice versa.

(9) Port number feature: whether the URL uses a non-standard port number to carry dummy content.

HTTPS token feature: whether there is an HTTPS token in the domain portion of the URL.

A third party information feature comprising: domain name expiration time feature, domain name age feature, domain name record feature, domain name access volume feature, page rank feature, access feature, URL statistics feature, and URL identity feature. The access feature may be a hundred degree access feature or may be an access feature of other search engines, such as a search dog. The present embodiment is exemplified by, but not limited to, hundred degrees.

The method comprises the following steps:

(1) domain name expiration time feature: the length of time that the domain name distance expires is checked.

(2) Domain name age characteristics: the length of time that the domain name has been used is checked.

(3) Domain name recording characteristics: the domain name is checked for records in the WHOIS database.

(4) Domain name access volume feature: the check website ranks the amount of access at the ALEXA website.

(5) Page ranking features: page ranking (PageRank) is a measure of the importance of a web page on the Internet.

(6) Hundred degree access features: it is determined whether the URL can be accessed at hundred degrees.

(7) URL statistics feature: statistics whether the URL is in phish tank and StopBadware organization published phishing website lists.

(8) URL identity feature: the feature may be extracted from the WHOIS database. It is determined whether the website identity is part of a URL.

Web page features comprising: icon features, request URL features, anchor link features, server form features, mailbox features, redirect features, trigger features, pop-up features, set response features, inline frame Iframe features, and point to web page features.

The triggering feature may be a mouse action, or may be a touch action on a touch screen or other triggering actions of man-machine interaction, and in this embodiment, the mouse triggering is taken as an example. The set response feature is used to determine whether some trigger actions are disabled, in this embodiment, right clicking the mouse.

The method comprises the following steps:

(1) icon features: and judging whether the icon loading address accords with the webpage domain name.

(2) Request URL feature: it is determined whether or not external objects, such as images, videos, and sounds, contained in the request URL web page are loaded from another domain.

(3) Anchor point link feature: it is determined whether the anchor links contained in the request URL web page are loaded from another domain.

(4) Other anchor linking features: whether or not the associated anchor links in the web page for the HTML document, client script, and other network resources, etc. are loaded from another domain.

(5) Server form features: it is checked whether there is a string containing an empty character or "about: blank" in the web page and whether the web page domain name is the same as the form server domain name.

(6) Mailbox feature: it is checked whether the server form uses a mail () or mail () function.

(7) Redirection feature: and calculating the number of the redirected links in the webpage.

(8) The mouse is characterized in that: when the mouse is moved onto a link, the status bar displays whether the address bar is the same as the domain of the URL.

(9) Spring window characteristics: and judging whether the webpage popup behavior contains a text field required to be input.

Right click feature of mouse: judging whether the JavaScript is used for disabling the right-click function, so that a user cannot view and store the webpage source code.

Inline frame Iframe feature: it is determined whether the web page uses an inline frame Iframe.

Pointing to webpage features: the number of links to the web page is counted.

Step S104: obtaining algorithm disturbance characteristics based on the extracted characteristics;

in a specific application scenario, the implementation uses a Stacking strategy, and based on URL features, third party information features and webpage features, 3 algorithm disturbance features are calculated by using a GBDT algorithm, an XGBoost algorithm and a LightGBM algorithm in the Stacking strategy.

The gradient lifting decision tree (gradient boosting decision tree, GBDT) is used as an improvement of the AdaBoost algorithm and consists of a gradient lifting algorithm and a decision tree algorithm, and the core of the gradient lifting decision tree is to reduce residual errors, namely, a decision tree is generated in the negative gradient direction to reduce the residual errors of the last time. The GBDT algorithm iteratively reduces the loss function when building the model, so that the model is improved in the direction of continuous optimization.

The limit gradient lifting tree (extreme gradient Boosting, XGBoost) algorithm is a Boosting algorithm provided based on the gradient lifting algorithm, improves the calculation mode of an objective function on the basis of gradient lifting, converts the optimization problem of the objective function into the minimum problem of solving a quadratic function, trains a decision tree model by using the second derivative information of a loss function, thereby accelerating the training speed and reducing the running time of the model. Meanwhile, the complexity of the tree is used as a regular term to be added into the objective function, so that the generalization performance of the model is improved.

The lightweight gradient hoist (light gradient boosting machine, lightGBM) algorithm is a novel GBDT algorithm. In order to find the optimal splitting node with the fastest speed and the smallest memory expense, the LightGBM uses a histogram algorithm, a single-side sampling algorithm (GOSS), and a mutual exclusion feature bundling algorithm (exclusive feature building, EFB), and the core is to use a leaf-growth (leaf-wise) strategy to grow a tree, and find the leaf node with the largest gain value from the current leaf nodes for splitting. Meanwhile, in order to prevent the occurrence of over fitting, the depth of the tree is limited, and the time for searching the optimal depth tree is shortened.

Step S105: fusing the extracted features with algorithm disturbance features to obtain a feature data set;

in a specific application scenario, the URL feature, the third party information feature, the webpage feature 3 features, the URL feature 10 features, the third party information feature 8 features and the webpage feature 13 features are fused together to form 33 new feature sets.

And combining the original features and the algorithm disturbance features to obtain a new feature set. And send the new feature data set generated to the machine learning detection engine.

Step S106: phishing website URLs are detected based on the feature data set and the machine learning detection engine.

In a specific application scenario, the machine learning detection engine finally determines the URL as normal URL data and phishing website URL data by voting according to the classification result calculated by each decision tree model in the self-learning random forest classification algorithm.

In this embodiment, after the step of detecting the phishing website URL based on the feature data set and the machine learning detection engine, the method further includes:

According to the obtained phishing website URL data information, the EDR system supports backtracking analysis, wherein the backtracking analysis comprises information of determining endpoint information (physical position, IP, process/thread, file information and the like), an attacked object, an attack step, an attack range, a damage degree and the like.

And visualizing the normal URL data and the phishing website URL data, and automatically isolating, repairing and remedying the security threat according to the phishing website URL data information analyzed by backtracking, wherein specific measures comprise endpoint and network isolation, malicious code cleaning, patch repairing, software upgrading and the like.

storing the detected phishing website URL;

And storing the phishing website URL data information into a malicious URL library and/or a threat information center.

And retraining and updating a machine learning detection engine in the EDR system according to the quantity of information stored in the malicious URL library and/or the threat intelligence center.

The machine learning detection engine can use a random forest algorithm to train the classifier on the characteristic data set, and finally train the classifier for phishing website detection.

As shown in fig. 3, the machine learning detection engine is obtained through model training;

the model training comprises:

acquiring phishing website URLs to obtain phishing website URL samples;

and randomly sampling back m samples from the original training set by using a bootstrapping method, and performing n_tree sampling to generate n_tree training sets.

And training the n_tree decision tree models for the n_tree training sets respectively.

for a single decision tree model, assuming that the number of training sample features is n, the best feature is selected for splitting according to the base index at each splitting. For a general decision tree, if there are K total classes, the probability that the sample belongs to the kth class is: p is p _k 。

Each tree is known to split in this way until all training examples for that node belong to the same class. Pruning is not required in the splitting process of the decision tree.

And forming the generated multiple decision trees into a random forest.

The random forest algorithm model is a statistical learning theory, and takes a decision tree model as a basic classifier, and the final classification result is determined by voting of the final output result of a single decision tree model. The random forest algorithm model has the following advantages: the method has good tolerance to noise and abnormal values; the problem of overfitting of the decision tree model does not occur; the method has good expandability and parallelism for classifying the high-dimensional data. In addition, random forests are data-driven non-parametric classification methods that do not require prior knowledge of classification, etc., as long as the classification rules for a given sample are trained.

As shown in FIG. 3, the EDR system-based encrypted traffic detection system mainly comprises a threat information engine, a phishing website URL library, a sandbox, URL webpage content, a third party information library, a Stacking strategy feature processing module and a machine learning engine module.

The threat information engine periodically crawls the latest published malicious URL samples and the related information of the samples, classifies the samples through descriptions of the malicious URL samples, and forms a phishing website URL sample library.

And the phishing website URL sample library is used for dynamically executing the malicious URL samples in the phishing website URL sample library in a sandbox, so that the malicious URL samples are activated and connected with an external server, thereby generating malicious URL webpage data and acquiring third party information, and storing the malicious URL webpage data and acquiring third party information in the malicious URL webpage data and third party information library.

The malicious URL webpage data and the acquired third party information base are used for carrying out feature extraction on the fishing website sample and the normal sample by capturing the normal URL and the webpage data and the third party information (marked as the normal URL) in the enterprise network.

The Stacking strategy feature processing module extracts original features through features, and uses the original features as input features of a GBDT algorithm, an XGBoost algorithm and a LightGBM algorithm, and outputs new algorithm disturbance features. The training process is as follows:

GBDT algorithm:

input: training data set

T＝{(x ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _n ,y _n )}；

Loss function:

L(y,f(x))；

initialization of

The estimate is a constant value for which the loss function is minimized,

m＝1,2,…,M。

(a) For i=1, 2, …, N, calculate:

taking the residual as an estimate of the residual;

(b) For r _mi Fitting a regression tree to obtain leaf node region of the mth tree

R _mj ,j＝1,2,…,J；

(c) _{For a pair of} j＝1,2,…,J _{Calculation of} ：

Estimating a leaf node area by utilizing linear search, and minimizing a loss function;

(d) Updating the regression tree:

and (3) outputting: regression tree:

XGBoost algorithm:

the XGBoost algorithm differs from GBDT in that its objective function is given by the formula:

wherein,

respectively the first and second derivatives of the loss function, θ (f _t ) Is the structure of the t-th tree.

LightGBM algorithm:

for the training set:

the goal of the LightGBM is to find a deterministic function f ^* (x) The maximum likelihood function f (x) of (c) such that the expectation of the loss function L (y, f (x)) is minimized, as shown in the following equation:

f＝arg min _f E _y,X L(y,f(x))，

wherein E is _y,X L (y, f (x)) represents the expected value of the loss function, i.e. the function f is taken such that the expected value is minimized ^* (x)。

As shown in fig. 2 and 4, the detection method is mainly as follows:

and extracting URL information, webpage content and third party information features from the enterprise network, performing feature processing through a Stacking strategy, and sending the newly generated data feature data set into a machine learning detection engine.

The machine learning detection engine uses a trained random forest algorithm classifier on the feature data set, votes according to the classification result of each decision tree model in the random forest algorithm model, and finally obtains whether the URL data is phishing website behaviors according to the number of votes.

Experiments show that the method for detecting the phishing websites by extracting the URL, the third party information, the page content and the like based on the EDR system and combining the algorithm disturbance characteristics calculated by the GBDT algorithm, the XGboost algorithm and the LightGBM algorithm by using a self-learning Random Forest algorithm (Random Forest) can effectively reduce the false alarm rate and effectively improve the efficiency and the accuracy of detecting the encryption flow under a limited training sample.

In one specific application scenario:

1. the user installs the EDR endpoint system in a terminal (e.g., a computer).

2. When a user accesses a web page in a terminal (such as a computer) by using the URL, the EDR endpoint system extracts and records relevant URL data information. And sending the URL data information to the EDR cloud system.

3. After the EDR cloud system receives the URL data information, URL data features, URL webpage features and third-party data features are respectively extracted based on the URL data, stacking strategy disturbance features are calculated, and the features are fused.

4. The EDR cloud system uses a random forest algorithm to detect whether phishing website behaviors exist on the webpage. If the webpage is a normal webpage, the user can access normally; if the webpage has fishing behaviors, the EDR cloud system provides various defensive measures such as alarming and interception. And storing the URL data information in a malicious URL library and a threat information center.

5. And the EDR cloud system retrains and updates the random forest algorithm of the Stacking strategy algorithm according to the URL storage quantity and time.

According to the improved Stacking strategy self-learning machine learning phishing website attack detection method based on the EDR system, the original characteristics and the algorithm disturbance characteristics are combined to form a new characteristic data set through the algorithm disturbance characteristics calculated based on the URL data characteristics, the URL webpage characteristics and the third party data characteristics and combining the algorithm in the Stacking strategy, and the phishing website detection method is realized through a self-learning Random Forest algorithm (Random Forest). The method solves the problems that more manual participation is needed, the maintenance cost is high, the updating speed is low, the detection efficiency is low and the like in the detection based on the black-and-white list rule and the traditional machine learning detection, and simultaneously improves the generalization capability of model detection by using the algorithm disturbance characteristic of the Stacking strategy. The accuracy of phishing website detection is effectively improved.

As shown in fig. 5, the present embodiment further discloses a phishing website attack detection device, including:

the acquisition module is used for acquiring URL data information;

An electronic device of an embodiment of the present disclosure includes a memory and a processor. The memory is for storing non-transitory computer readable instructions. In particular, the memory may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like.

The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device to perform the desired functions. In one embodiment of the present disclosure, the processor is configured to execute the computer readable instructions stored in the memory, so that the electronic device performs all or part of the steps of the phishing website attack detection method of the embodiments of the present disclosure described above.

It should be understood by those skilled in the art that, in order to solve the technical problem of how to obtain a good user experience effect, the present embodiment may also include well-known structures such as a communication bus, an interface, and the like, and these well-known structures are also included in the protection scope of the present disclosure.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. A schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 6, the electronic device may include a processing means (e.g., a central processing unit, a graphic processor, etc.) that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage means into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the electronic device are also stored. The processing device, ROM and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.

In general, the following devices may be connected to the I/O interface: input means including, for example, sensors or visual information gathering devices; output devices including, for example, display screens and the like; storage devices including, for example, magnetic tape, hard disk, etc.; a communication device. The communication means may allow the electronic device to communicate wirelessly or by wire with other devices, such as edge computing devices, to exchange data. While fig. 6 shows an electronic device having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device, or installed from a storage device, or installed from ROM. All or part of the steps of the phishing website attack detection method of the embodiments of the present disclosure are performed when the computer program is executed by the processing device.

The detailed description of the present embodiment may refer to the corresponding description in the foregoing embodiments, and will not be repeated herein.

The computer-readable storage medium of embodiments of the present disclosure has non-transitory computer-readable instructions stored thereon. When executed by a processor, perform all or part of the steps of the phishing website attack detection method of the various embodiments of the present disclosure described previously.

The computer-readable storage medium described above includes, but is not limited to: optical storage media (e.g., CD-ROM and DVD), magneto-optical storage media (e.g., MO), magnetic storage media (e.g., magnetic tape or removable hard disk), media with built-in rewritable non-volatile memory (e.g., memory card), and media with built-in ROM (e.g., ROM cartridge).

The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

Claims

1. A phishing website attack detection method is characterized by comprising the following steps:

acquiring URL data information;

and extracting features based on the URL data, the webpage content data and the third party information data to obtain extracted features, wherein the extracted features comprise: URL features, third party information features, and web page features; the URL feature includes: IP address feature, URL length feature, shorten service feature, four function compliance features, HTTPS connection feature, port number feature, and HTTPS token feature; the third party information feature includes: domain name expiration time feature, domain name age feature, domain name record feature, domain name access volume feature, page rank feature, access feature, URL statistics feature, and URL identity feature; the webpage features include: icon features, request URL features, anchor link features, other anchor link features, server form features, mailbox features, redirection features, trigger features, popup features, set response features, inline frame Iframe features, and point to web page features;

detecting phishing website URLs based on the feature data set and a machine learning detection engine;

obtaining an algorithm disturbance characteristic based on the extracted characteristic, including:

based on the URL features, the third party information features and the webpage features, 3 algorithm disturbance features are calculated respectively by using a GBDT algorithm, an XGBoost algorithm and a LightGBM algorithm in a Stacking strategy;

fusing the extracted features with algorithm disturbance features to obtain a feature data set, wherein the feature data set comprises:

the URL features, the third party information features and the web page features are combined with 3 algorithm disturbance features to obtain 33 new feature sets, and the 33 new feature sets form a feature data set;

the machine learning detection engine is obtained through model training;

the model training comprises:

acquiring phishing website URLs to obtain phishing website URL samples;

2. The phishing website attack detection method of claim 1, further comprising, after the step of detecting a phishing website URL based on the feature data set and a machine learning detection engine:

3. The phishing website attack detection method of claim 2, further comprising, after the step of detecting a phishing website URL based on the feature data set and a machine learning detection engine:

4. The phishing website attack detection method of claim 1, wherein the random forest algorithm training process comprises:

training n_tree decision tree models based on the n_tree training sets;

each decision tree model in the n_tree decision tree models is split according to the best feature selected by the base index until all training samples of the nodes belong to the same class, and a plurality of split decision tree models are obtained;

and forming a random forest by a plurality of split decision tree models.

5. The phishing website attack detection method of claim 4, wherein the keni index is:

6. The phishing website attack detection method of claim 1, further comprising, after the step of detecting a phishing website URL based on the feature data set and a machine learning detection engine:

storing the detected phishing website URL;

7. A phishing website attack detection device, comprising:

The acquisition module is used for acquiring URL data information;

a detection module for detecting phishing website URLs based on the feature data set and a machine learning detection engine;

the extracting features includes: URL features, third party information features, and web page features; the URL feature includes: IP address feature, URL length feature, shorten service feature, function compliance feature, HTTPS connect feature, port number feature, and HTTPS token feature; the third party information feature includes: domain name expiration time feature, domain name age feature, domain name record feature, domain name access volume feature, page rank feature, access feature, URL statistics feature, and URL identity feature; the webpage features include: icon features, request URL features, anchor link features, server form features, mailbox features, redirect features, trigger features, pop-up features, set response features, inline frame Iframe features, and point to web page features;

fusing each of the URL feature, the third party information feature and the webpage feature with 3 algorithm disturbance features respectively to obtain a plurality of new feature sets, wherein the plurality of new feature sets form a feature data set;

the machine learning detection engine is obtained through model training;

the model training comprises:

acquiring phishing website URLs to obtain phishing website URL samples;

8. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the phishing website attack detection method of any of claims 1-6.

9. A computer-readable storage medium storing computer instructions for causing a computer to perform the phishing website attack detection method of any of claims 1-6.