WO2019109743A1 - Url攻击检测方法、装置以及电子设备 - Google Patents

Url攻击检测方法、装置以及电子设备 Download PDF

Info

Publication number
WO2019109743A1
WO2019109743A1 PCT/CN2018/110769 CN2018110769W WO2019109743A1 WO 2019109743 A1 WO2019109743 A1 WO 2019109743A1 CN 2018110769 W CN2018110769 W CN 2018110769W WO 2019109743 A1 WO2019109743 A1 WO 2019109743A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
domain name
access request
sample
url access
Prior art date
Application number
PCT/CN2018/110769
Other languages
English (en)
French (fr)
Inventor
李龙飞
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2019109743A1 publication Critical patent/WO2019109743A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Definitions

  • the present specification relates to the field of computer applications, and in particular, to a URL attack detection method, apparatus, and electronic device.
  • This specification proposes a URL attack detection method, which includes:
  • the URL attack detection model is a machine learning model trained based on a PU-Learning machine learning algorithm
  • the URL attack detection model is a machine learning model trained based on a cost-sensitive PU-Learning machine learning algorithm.
  • the method further includes:
  • Extracting domain name features of several dimensions from domain name information carried in a plurality of URL access request samples wherein the plurality of URL access request samples include a URL access request sample marked with a sample tag and a URL access request sample of an unlabeled sample tag;
  • the sample tag characterizes the URL access request sample as a URL attack request;
  • the URL attack detection model is obtained by training the plurality of URL access request samples based on a cost-sensitive PU-Learning machine learning algorithm.
  • the loss function corresponding to the URL access request sample and the URL access request sample of the unlabeled sample tag marked with the sample tag are respectively configured with a cost sensitive weight; wherein, the URL access request is marked with the sample tag
  • the cost sensitive weight of the loss function corresponding to the sample is greater than the cost sensitive weight of the loss function corresponding to the URL access request sample of the unlabeled sample tag.
  • the URL attack detection model includes multiple machine learning models trained based on a PU-Learning machine learning algorithm
  • the extracted domain name features of the several dimensions include a combination of multiple of the following domain name features:
  • the total number of characters in the domain name information The total number of characters in the domain name information, the total number of letters in the domain name information, the total number of digits in the domain name information, the total number of symbols in the domain name information, the number of different characters in the domain name information, the number of different letters in the domain name information, the number of different digits in the domain name information, and the different symbols of the domain name information. number.
  • the present specification also proposes a URL attack detecting device, the device comprising:
  • the first extraction module extracts domain name features of several dimensions from the domain name information carried in the URL access request;
  • the prediction module inputs the domain name feature into a preset URL attack detection model for predictive calculation, and obtains a risk score of the URL access request; wherein the URL attack detection model is a machine trained based on a PU-Learning machine learning algorithm Learning model
  • a determining module determines whether the URL access request is a URL attack request based on the risk score.
  • the URL attack detection model is a machine learning model trained based on a cost-sensitive PU-Learning machine learning algorithm.
  • the device further includes:
  • the second extraction module extracts domain name features of the plurality of dimensions from the domain name information carried in the plurality of URL access request samples, wherein the plurality of URL access request samples include the URL access request sample and the unlabeled sample tag marked with the sample tag. a URL access request sample; the sample tag characterizing the URL access request sample as a URL attack request;
  • the training module trains the plurality of URL access request samples based on a cost-sensitive PU-Learning machine learning algorithm to obtain the URL attack detection model.
  • the loss function corresponding to the URL access request sample and the URL access request sample of the unlabeled sample tag marked with the sample tag are respectively configured with a cost sensitive weight; wherein, the URL access request is marked with the sample tag
  • the cost sensitive weight of the loss function corresponding to the sample is greater than the cost sensitive weight of the loss function corresponding to the URL access request sample of the unlabeled sample tag.
  • the URL attack detection model includes multiple machine learning models trained based on a PU-Learning machine learning algorithm
  • the prediction module is further:
  • the extracted domain name features of the several dimensions include a combination of multiple of the following domain name features:
  • the total number of characters in the domain name information The total number of characters in the domain name information, the total number of letters in the domain name information, the total number of digits in the domain name information, the total number of symbols in the domain name information, the number of different characters in the domain name information, the number of different letters in the domain name information, the number of different digits in the domain name information, and the different symbols of the domain name information. number.
  • the present specification also proposes an electronic device comprising:
  • a memory for storing machine executable instructions
  • the processor is caused to:
  • the technical solution provided by the embodiment of the present specification provides a URL access by inputting a domain name feature extracted from the domain name information carried in the URL access request to a URL attack detection model trained based on the PU-Learning machine learning algorithm for predictive calculation. Requesting an attack detection can detect potential URL attacks in advance, which helps to protect against potential abnormal URL access in a timely manner.
  • FIG. 1 is a flowchart of a URL attack detection method according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of constructing a training sample set training PU-Learning model according to an embodiment of the present specification
  • FIG. 3 is a hardware structural diagram of an electronic device carrying a URL attack detecting apparatus according to an embodiment of the present disclosure
  • FIG. 4 is a logic block diagram of the URL attack detecting apparatus according to an embodiment of the present disclosure.
  • Machine learning according to whether the training samples have tag information, is usually divided into three categories: supervised learning, unsupervised learning, and semi-supervised learning.
  • semi-supervised learning refers to the training samples used to train the machine learning model. Only some of the training samples are labeled samples, while the remaining training samples are unmarked samples, and the unlabeled samples are used to assist the learning process of the labeled samples. .
  • labeled training samples are usually divided into marked positive and negative samples;
  • the labeled samples in the training samples collected by the modeling party are likely to contain only one category of markers; for example, there may be only a small number of labeled positive samples, and the remaining samples are unlabeled samples.
  • Machine learning for this scenario is often referred to as PU Learning (Positive and Unlabeled Learning), a machine learning process for labeled positive and unlabeled samples.
  • the present specification proposes a PU-Learning machine learning algorithm to perform a machine learning training on a sample of URL access requests that are both marked as URL attacks and unmarked, to construct a URL attack detection model, and use the URL.
  • the attack detection model performs attack detection on normal URL access requests to discover potential URL attack technical solutions.
  • a number of URL access request samples may be prepared in advance; wherein, in these URL access request samples, a plurality of positive samples and a number of unmarked samples marked as URL attacks are simultaneously included. Then, the URL access request samples may be subjected to data segmentation, and the domain name information carried in the URL access request samples may be extracted; for example, the primary domain name and the corresponding domain name suffix carried in the URL access request.
  • domain name features of several dimensions may be extracted from the domain name information, and the domain name features are normalized, and then the normalized domain name features are used as modeling features to construct training samples.
  • the training samples can be trained based on the PU-Learning machine learning algorithm to construct a URL attack detection model; for example, the training samples can be trained using a cost-sensitive PU-Learning machine learning algorithm.
  • the domain name information of several dimensions can be extracted from the domain name information carried in the URL access request that needs to be detected by the attack in the same manner, and the prediction is constructed based on the extracted domain name features.
  • Sample input the completed prediction sample into the above URL attack detection model for prediction calculation, obtain a risk score of the URL access request (such as the URL access request is a risk probability of the URL attack request), and then based on the risk score Determine if the URL access request is a URL attack request.
  • the URL detection request is detected by inputting the domain name feature extracted from the domain name information carried in the URL access request to the URL attack detection model trained by the PU-Learning machine learning algorithm for attack detection.
  • potential URL attacks can be discovered in advance, which helps to protect against potential abnormal URL access in a timely manner.
  • FIG. 1 is a method for detecting a URL attack according to an embodiment of the present disclosure, and performing the following steps:
  • Step 102 Extract domain name features of several dimensions from the domain name information carried in the URL access request.
  • Step 104 Enter the domain name feature into a preset URL attack detection model to perform prediction calculation, and obtain a risk score of the URL access request.
  • the URL attack detection model is a machine trained based on a PU-Learning machine learning algorithm. Learning model
  • Step 106 Determine, according to the risk score, whether the URL access request is a URL attack request.
  • the modeler can pre-collect a large number of URL access requests marked as URL attacks as positive samples, and a large number of unmarked URL access requests as unmarked samples, and based on the collected URL access request samples.
  • the training sample set is constructed, and then the training sample set is trained based on the PU-Learning machine learning algorithm to construct the above URL attack detection model.
  • FIG. 2 is a flowchart of constructing a training sample set training PU-Learning model according to the present specification.
  • the collected original URL access request samples may be separately segmented into data, and the domain name information carried in the URL access request samples may be extracted.
  • the domain name information may be specifically The primary domain name carried in the URL access request and the domain name suffix corresponding to the primary domain name are included.
  • the domain name information of the known URL attack request may be filtered out from the domain name information to construct a machine learning model. That is, the domain name information that best characterizes the URL attack request is filtered to participate in the modeling.
  • this part of the domain name information can be filtered.
  • domain name features of several dimensions may be extracted from the domain name information respectively as modeling features.
  • the domain name feature extracted from the domain name information is not specifically limited in this specification. In actual application, any form of the domain name information and the regular domain name feature that can be represented in the URL attack request can be characterized. Selected as a modeling feature.
  • those skilled in the art who participate in modeling can extract domain name features of several dimensions from the parameter values corresponding to the domain name information based on experience, and then try to model based on the domain name features, and The modeling results are evaluated to select the domain name features of several dimensions that have the highest contribution to the model as modeling features.
  • the domain name features extracted from the domain name information may include the total number of characters of the domain name information, the total number of letters of the domain name information, the total number of digits of the domain name information, the total number of symbols of the domain name information, and the domain name information.
  • those skilled in the art may combine the above eight dimensions as modeling features, or further select multiple dimensions from the above eight dimensions to be combined as modeling features.
  • domain name features of the eight dimensions shown above are merely exemplary; obviously, in practical applications, those skilled in the art may also extract domain name features of other dimensions than the above eight dimensions from the domain name information. Modeling features are not enumerated in this specification.
  • the domain name features of these dimensions may also be returned.
  • the normalization process normalizes the value range of different domain name features to a uniform numerical interval, thereby eliminating the influence of different value range of domain name features on modeling accuracy.
  • a corresponding feature vector may be separately created as a training sample for each URL access request sample based on the domain name features extracted from the domain name information carried by each URL access request sample;
  • the dimension of the created feature vector is the same as the dimension of the extracted domain name feature.
  • a target matrix can be created based on the feature vector constructed for each URL access request sample; for example, assuming that a total of N URL access request samples are collected, from each A URL access request sample extracts an M-dimensional domain name feature, and the target matrix may specifically be an N*M-dimensional target matrix.
  • the target matrix created is the training sample set that ultimately participates in the machine learning model training.
  • the training samples can be trained based on the PU-Learning machine learning algorithm to construct the above URL attack detection model.
  • a plurality of machine learning strategies can usually be included; for example, a typical machine learning strategy included in the PU-Learning machine learning algorithm usually includes a two-stage strategy and a cost. There are two types of cost-sensitive strategies.
  • the so-called two-stage method the algorithm firstly discovers the potential reliable negative samples (Reliable Negative) in the unlabeled samples based on the known positive and unlabeled samples, and then based on the known positive samples and the reliable negative samples mined, Transform the problem into a traditional supervised machine learning process to train the classification model.
  • reliable Negative potential reliable negative samples
  • the algorithm assumes that the proportion of positive samples in unlabeled samples is extremely low, and can directly use unlabeled samples as negative samples, and set a higher cost sensitive weight relative to negative samples for positive samples; for example, usually In the objective equation based on the cost-sensitive PU-Learning machine learning algorithm, a higher cost sensitive weight is set for the loss function corresponding to the positive sample.
  • the cost of the final trained classification model to divide a positive sample is far greater than the cost of dividing a negative sample, so that it can be directly used by using positive and unmarked
  • the sample learns a cost-sensitive classifier to classify unknown samples.
  • the cost-sensitive PU-Learning machine learning algorithm can be selected as a modeling algorithm, and the above training sample set is trained to construct the URL attack detection model described above.
  • the training of the training sample set by the cost-sensitive PU-Learning machine learning algorithm is merely exemplary and is not intended to be limiting; obviously, those skilled in the art are combined with the present specification.
  • the technical details disclosed in the specification, when implementing the scheme of the present specification, other machine learning strategies (such as two-stage method) in the PU-Learning machine learning algorithm may also be used, which are not detailed in this specification.
  • a positive sample in the above training sample set is marked as 1, indicating a URL access request corresponding to the training sample, which is a known URL attack request; and an unmarked sample is marked as a negative sample as -1, indicating The URL access request corresponding to the training sample is a normal URL access request.
  • Represents a loss function corresponding to a negative sample labeled -1 to characterize the error loss between the model's prediction of the negative sample g(x i ) and the true marker y i -1; typically, g(x The greater the difference between i ) and y i , the greater the loss.
  • C + represents the cost sensitive weight configured for the loss function corresponding to the positive sample.
  • C - represents the cost sensitive weight configured for the loss function corresponding to the positive sample.
  • the value of C + is greater than C - .
  • the value of C + can be set to a value much larger than C - , indicating that the final trained URL attack detection model is worth the cost of a positive sample. Greater than the cost of dividing a negative sample.
  • the specific type of the above loss function l(y i , g(x i )) is not particularly limited in the present specification; for example, in practical applications, a common log-loss function or
  • R(w) is a regularization term for controlling the complexity of the model; for example, the above regularization term can usually be identified by an L1 norm or an L2 norm; the above ⁇ is a regularized hyperparameter , can be flexibly set in practical applications.
  • the objective equation of the cost-sensitive PU-Learning machine learning algorithm can be expressed in the form of the following formula:
  • the domain name information name and the extracted domain name information may be extracted from the URL access request that needs to perform attack detection according to the same feature extraction method as shown in FIG. 2 .
  • the domain name information is filtered, and the domain name features of several dimensions are extracted from the filtered domain name information (consistent with the domain name characteristics of the model training stage), and then the prediction samples are constructed based on the extracted domain name features, and the prediction samples are input into the above URL attack detection.
  • the model performs a predictive calculation to obtain a risk score for the URL access request.
  • the URL risk score may be further determined based on the URL risk score to determine whether the URL access request is a URL attack request;
  • the model after inputting the prediction sample into the URL attack detection model, the model usually outputs a risk score corresponding to the positive sample (ie, the URL attack request) and the negative sample (ie, the normal URL access request);
  • the URL access request may be determined as a URL attack request or a normal normal URL access request by comparing the sizes between the two risk scores; if the risk score corresponding to the positive sample is greater than or minus The risk score corresponding to the sample indicates that the URL access request is a URL attack request; otherwise, the URL access request is a normal URL access request.
  • the two risk scores may be compared with a preset risk threshold to determine the URL access request.
  • the specific type if the risk score corresponding to the positive sample is greater than the risk score corresponding to the negative sample, and the risk score corresponding to the positive sample is greater than the preset risk threshold, it indicates that the URL access request is a URL attack request;
  • the risk score corresponding to the negative sample is greater than the risk score corresponding to the positive sample, and the risk score corresponding to the negative sample is greater than the preset risk threshold, indicating that the URL access request is a normal URL access request.
  • an integrated learning manner may be used to train multiple URL attack detection models and attack in the multiple URLs. After the detection model is completed, the specific type of the URL access request is determined by integrating (merging) the risk scores output by the multiple attack detection models.
  • the training samples and the training models included in the training samples may have different differences.
  • the technology in the field People can be flexibly controlled based on actual modeling needs.
  • the training feature set may be divided into a plurality of training sample subsets according to the included modeling features by sampling the modeling features included in the training samples in the training sample set. Then, each training sample subset is separately trained to construct the above URL attack detection model.
  • the training sample set may not be sampled, but a plurality of URL attack detection models that need to be trained share a training sample set, which is not limited in this specification.
  • the target equations of each URL attack detection model may also be different; for example, still based on the cost-sensitive one shown above
  • the target equation of the PU-Learning machine learning algorithm is taken as an example.
  • the loss function used in the target equation corresponding to each URL attack detection model is used.
  • the types and regularization items may be different from each other; for example, by training two URL attack detection models through integrated learning, the target attack equation corresponding to the first attack detection model may adopt a log-loss function; and the second attack In the target equation corresponding to the detection model, a hinge-loss function may be used; or, the regularization term in the target equation of the first attack detection model may adopt an L1 norm; and the regularity in the target equation corresponding to the second attack detection model For the item, you can use the L2 norm, and so on.
  • multiple prediction samples can be constructed in the same manner, and then input to the multiple The URL attack detection model performs predictive calculations to obtain a plurality of risk scores corresponding to the URL access request.
  • the weighted calculation may be performed on the plurality of risk scores, and then the weighted calculation result is used as the final risk score of the URL access request to further determine the specific type of the URL access request.
  • the weight value of each risk score may be set to a weighted average manner. 0.5 is summed, and the average of the risk scores obtained by the prediction calculation using a plurality of URL attack modeling models is used as the final risk score of the URL access request.
  • a large number of URL access requests marked as URL attacks, and a large number of unmarked URL access requests are generally precipitated in the security system as positive samples and unmarked samples, using PU-Learning machines.
  • Learning algorithm training to construct a URL attack detection model, and using the URL attack detection model to perform attack detection on a URL access request compared with a conventional URL attack detection method that is manually configured by a security personnel in a security system. Identify potential URL attacks in advance to help protect against potential exception URL access in a timely manner.
  • the URL attack detection model can detect potential threats in advance in daily URL access requests, the security personnel can be instructed to timely improve and supplement the detection rules in the security system, thereby improving the security system.
  • the present specification also provides an embodiment of a URL attack detecting apparatus.
  • the embodiment of the URL attack detecting device of the present specification can be applied to an electronic device.
  • the device embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software.
  • the processor of the electronic device in which the computer is located reads the corresponding computer program instructions in the non-volatile memory into the memory.
  • FIG. 3 a hardware structure diagram of an electronic device in which the URL attack detecting device of the present specification is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in FIG.
  • the electronic device in which the device is located in the embodiment may also include other hardware according to the actual function of the electronic device, and details are not described herein.
  • FIG. 4 is a block diagram of a URL attack detecting apparatus shown in an exemplary embodiment of the present specification.
  • the URL attack detection device 40 can be applied to the electronic device shown in FIG. 3, and includes: a first extraction module 401, a prediction module 402, and a determination module 403.
  • the first extraction module 401 extracts domain name features of several dimensions from the domain name information carried in the URL access request.
  • the prediction module 402 is configured to input the domain name feature into a preset URL attack detection model for predictive calculation, and obtain a risk score of the URL access request; wherein the URL attack detection model is trained based on a PU-Learning machine learning algorithm. Machine learning model;
  • the determining module 403 determines whether the URL access request is a URL attack request based on the risk score.
  • the URL attack detection model is a machine learning model trained based on a cost-sensitive PU-Learning machine learning algorithm.
  • the device 40 further includes:
  • a second extraction module 404 (not shown in FIG. 4), extracting domain name features of several dimensions from domain name information carried in the plurality of URL access request samples; wherein the plurality of URL access request samples include tagged sample tags a URL access request sample and a URL access request sample of the unlabeled sample tag; the sample tag characterizing the URL access request sample as a URL attack request;
  • a building module 405 constructs a training sample based on the extracted domain name features
  • a training module 406 (not shown in FIG. 4) trains the plurality of URL access request samples based on a cost-sensitive PU-Learning machine learning algorithm to obtain the URL attack detection model.
  • the loss function corresponding to the URL access request sample and the URL access request sample of the unlabeled sample tag marked with the sample tag are respectively configured with a cost sensitive weight; wherein, the URL with the sample tag is marked
  • the cost sensitive weight of the loss function corresponding to the access request sample is greater than the cost sensitive weight of the loss function corresponding to the URL access request sample of the unlabeled sample tag.
  • the URL attack detection model includes a plurality of machine learning models obtained by training a plurality of URL access request samples based on a PU-Learning machine learning algorithm;
  • the prediction module 402 is further:
  • the extracted domain name features of the several dimensions include a combination of multiple of the following domain name features:
  • the total number of characters in the domain name information The total number of characters in the domain name information, the total number of letters in the domain name information, the total number of digits in the domain name information, the total number of symbols in the domain name information, the number of different characters in the domain name information, the number of different letters in the domain name information, the number of different digits in the domain name information, and the different symbols of the domain name information. number.
  • the device embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the present specification. Those of ordinary skill in the art can understand and implement without any creative effort.
  • the system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function.
  • a typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, and a game control.
  • the present specification also provides an embodiment of an electronic device.
  • the electronic device includes a processor and a memory for storing machine executable instructions; wherein the processor and the memory are typically interconnected by an internal bus.
  • the device may also include an external interface to enable communication with other devices or components.
  • the processor by reading and executing the machine-executable instructions stored in the memory corresponding to the control logic of the URL attack detection, the processor is caused to:
  • the URL attack detection model is a machine learning model trained based on a PU-Learning machine learning algorithm
  • the URL attack detection model is a machine learning model obtained by training the plurality of URL access request samples based on a cost-sensitive PU-Learning machine learning algorithm.
  • the processor by reading and executing the machine executable instructions corresponding to the control logic of the URL attack detection detected by the memory, the processor is further caused to:
  • the plurality of URL access request samples include a URL access request sample tagged with the sample tag and a URL access request of the unlabeled sample tag a sample tag that characterizes the URL access request sample as a URL attack request;
  • the URL attack detection model is obtained by training the plurality of URL access request samples based on a cost-sensitive PU-Learning machine learning algorithm.
  • the loss function corresponding to the URL access request sample and the URL access request sample of the unlabeled sample tag marked with the sample tag are respectively configured with a cost sensitive weight; wherein, the URL with the sample tag is marked
  • the cost sensitive weight of the loss function corresponding to the access request sample is greater than the cost sensitive weight of the loss function corresponding to the URL access request sample of the unlabeled sample tag.
  • the URL attack detection model includes a plurality of machine learning models obtained by training a plurality of URL access request samples based on a PU-Learning machine learning algorithm;
  • the processor by reading and executing the machine executable instructions corresponding to the control logic of the URL attack detection detected by the memory, the processor is further caused to:
  • the extracted domain name features of the several dimensions include a combination of multiple of the following domain name features:
  • the total number of characters in the domain name information The total number of characters in the domain name information, the total number of letters in the domain name information, the total number of digits in the domain name information, the total number of symbols in the domain name information, the number of different characters in the domain name information, the number of different letters in the domain name information, the number of different digits in the domain name information, and the different symbols of the domain name information. number.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本说明书提供一种URL攻击检测方法,包括:从URL访问请求中携带的域名信息中提取若干维度的域名特征;将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法对若干URL访问请求样本进行训练得到的机器学习模型;基于所述风险评分确定所述URL访问请求是否为URL攻击请求。

Description

URL攻击检测方法、装置以及电子设备 技术领域
本说明书涉及计算机应用领域,尤其涉及一种URL攻击检测方法、装置、以及电子设备。
背景技术
在互联网的应用场景中,每天都会产生大量的对于网址的URL访问请求。在这些大量的URL访问请求中,也不乏不法分子试图通过不合法的URL访问请求而发起的URL攻击;例如,常见的URL攻击如木马攻击、SQL注入攻击、跨站脚本攻击(XSS)等。这一类非法的URL访问请求,通常会与普通的URL访问请求存在一定的区别;因此,在构建线上系统的同时,通过一些安全手段对非法用户发起的URL攻击进行快速的识别检测是不可忽视的问题。
发明内容
本说明书提出一种URL攻击检测方法,所述方法包括:
从URL访问请求中携带的域名信息中提取若干维度的域名特征;
将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
可选的,所述URL攻击检测模型为基于代价敏感的PU-Learning机器学习算法训练得到的机器学习模型。
可选的,所述方法还包括:
从若干URL访问请求样本中携带的域名信息中提取若干维度的域 名特征;其中,所述若干URL访问请求样本包括被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本;所述样本标签表征所述URL访问请求样本为URL攻击请求;
基于提取到的域名特征构建训练样本;
基于代价敏感的PU-Learning机器学习算法对所述若干URL访问请求样本进行训练得到所述URL攻击检测模型。
可选的,与被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本对应的损失函数,分别被配置了代价敏感权重;其中,与被标记了样本标签的URL访问请求样本对应的损失函数的代价敏感权重,大于与未标记样本标签的URL访问请求样本对应的损失函数的代价敏感权重。
可选的,所述URL攻击检测模型包括基于PU-Learning机器学习算法训练得到的多个机器学习模型;
将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分,包括:
将所述域名特征分别输入所述多个机器学习模型进行预测计算,得到多个风险评分;对所述多个风险评分进行加权计算得到所述URL访问请求的风险评分。
可选的,提取出的所述若干维度的域名特征包括以下域名特征中的多个的组合:
域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数。
本说明书还提出一种URL攻击检测装置,所述装置包括:
第一提取模块,从URL访问请求中携带的域名信息中提取若干维度的域名特征;
预测模块,将所述域名特征输入预设的URL攻击检测模型进行预测 计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
确定模块,基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
可选的,所述URL攻击检测模型为基于代价敏感的PU-Learning机器学习算法训练得到的机器学习模型。
可选的,所述装置还包括:
第二提取模块,从若干URL访问请求样本中携带的域名信息中分别提取若干维度的域名特征;其中,所述若干URL访问请求样本包括被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本;所述样本标签表征所述URL访问请求样本为URL攻击请求;
构建模块,基于提取到的域名特征构建训练样本;
训练模块,基于代价敏感的PU-Learning机器学习算法对所述若干URL访问请求样本进行训练得到所述URL攻击检测模型。
可选的,与被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本对应的损失函数,分别被配置了代价敏感权重;其中,与被标记了样本标签的URL访问请求样本对应的损失函数的代价敏感权重,大于与未标记样本标签的URL访问请求样本对应的损失函数的代价敏感权重。
可选的,所述URL攻击检测模型包括基于PU-Learning机器学习算法训练得到的多个机器学习模型;
将预测模块进一步:
将所述域名特征分别输入所述多个机器学习模型进行预测计算,得到多个风险评分;对所述多个风险评分进行加权计算得到所述URL访问请求的风险评分。
可选的,提取出的所述若干维度的域名特征包括以下域名特征中的多个的组合:
域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数。
本说明书还提出一种电子设备,包括:
处理器;
用于存储机器可执行指令的存储器;
其中,通过读取并执行所述存储器存储的与URL攻击检测的控制逻辑对应的机器可执行指令,所述处理器被促使:
从URL访问请求中携带的域名信息中提取若干维度的域名特征;
将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法对若干URL访问请求样本进行训练得到的机器学习模型;
基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
本说明书实施例提供的技术方案,通过将从URL访问请求中携带的域名信息中提取出的域名特征输入至基于PU-Learning机器学习算法训练出的URL攻击检测模型进行预测计算,来对URL访问请求进行攻击检测,可以提前发现潜在的URL攻击,从而有助于对潜在的异常URL访问及时的进行安全防护。
附图说明
图1是本说明书一实施例示出的URL攻击检测方法的流程图;
图2是本说明书一实施例示出的一种构建训练样本集训练PU-Learning模型的流程图;
图3是本说明书一实施例提供的承载一种URL攻击检测装置的电子设备所涉及的硬件结构图;
图4是本说明书一实施例提供的一种所述URL攻击检测装置的逻辑 框图。
具体实施方式
机器学习,根据训练样本是否有标记信息,通常被划分为为有监督学习、无监督学习、半监督学习这三大类。其中,半监督学习,是指用于训练机器学习模型的训练样本中,仅部分训练样本是有标记样本,而其余的训练样本为无标记样本,利用无标记样本来辅助有标记样本的学习过程。
传统的半监督学习,通常存在多种类型的标记样本;例如,对于应用极为广泛的二分类问题,有标记的训练样本通常被划分为有标记的正样本和负样本;
然而,在实际场景中,建模一方收集到的训练样本中的有标记样本,很可能只包含一个类别的标记;比如,可能只有少量有标记的正样本,其余的样本均为无标记样本。而针对于这场景的机器学习,通常称之PU Learning(Positive and Unlabeled Learning,正样本和无标记学习),即针对有标记的正样本和无标记样本的机器学习过程。
在传统的进行URL攻击检测的安全系统中,通常会沉淀大量的被标记为URL攻击的URL访问请求,和大量无标记的URL访问请求;因此,如何利用传统的安全系统中沉淀的这些携带标记和未被标记的URL访问请求,通过机器学习的方式来提前发现那些潜在的URL攻击(比如URL木马攻击),对于URL攻击的识别检测和及时防护,则具有十分重要的意义。
有鉴于此,本说明书提出一种基于PU-Learning机器学习算法对同时包含大量被标记为URL攻击和未被标记的URL访问请求样本进行机器学习训练,来构建URL攻击检测模型,并使用该URL攻击检测模型对正常的URL访问请求进行攻击检测,来发现潜在的URL攻击的技术方案。
在实现时,可以预先准备若干URL访问请求样本;其中,在这些URL访问请求样本中同时包含若干被标记为URL攻击的正样本和若干无标记样本。然后,可以对这些URL访问请求样本进行数据切分,提取出这些URL访问请求样本中携带的域名信息;比如,URL访问请求中携带的主域名和对应的域名后缀。
进一步,可以从这些域名信息中分别提取出若干个维度的域名特征,并对这些域名特征进行归一化处理,然后将归一化处理后的域名特征作为建模特征来构建训练样本。当训练样本构建完成后,可以基于PU-Learning机器学习算法对这些训练样本进行训练,来构建URL攻击检测模型;例如,可以采用基于代价敏感的PU-Learning机器学习算法对训练样本进行训练。
最后,当URL攻击检测模型训练完成后,可以按照相同的方式,从需要进行攻击检测的URL访问请求携带的域名信息中分别提取出若干维度的域名特征,并基于提取出的域名特征来构建预测样本,将构建完成的预测样本输入至上述URL攻击检测模型中进行预测计算,得到该URL访问请求的风险评分(比如该URL访问请求为URL攻击请求的风险概率),然后可以基于该风险评分来确定该URL访问请求是否为URL攻击请求。
通过以上技术方案,通过将从URL访问请求中携带的域名信息中提取出的域名特征输入至基于PU-Learning机器学习算法训练出的URL攻击检测模型进行预测计算,来对URL访问请求进行攻击检测,可以提前发现潜在的URL攻击,从而有助于对潜在的异常URL访问及时的进行安全防护。
下面通过具体实施例并结合具体的应用场景对本说明书进行描述。
请参考图1,图1是本说明书一实施例提供的一种URL攻击检测方法,执行以下步骤:
步骤102,从URL访问请求中携带的域名信息中提取若干维度的域 名特征;
步骤104,将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
步骤106,基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
在本说明书中,建模方可以预先收集大量的被标记为URL攻击的URL访问请求作为正样本,和大量未进行标记的URL访问请求作为无标记样本,并基于收集到的这些URL访问请求样本来构建训练样本集,然后基于PU-Learning机器学习算法对该训练样本集进行训练,来构建上述URL攻击检测模型。
请参见图2,图2为本说明书示出的一种构建训练样本集训练PU-Learning模型的流程图。
如图2所示,首先,可以对收集到的这些原始的URL访问请求样本分别进行数据切分,提取出这些URL访问请求样本中携带的域名信息;例如,在实现时,上述域名信息具体可以包括URL访问请求中携带的主域名和与主域名对应的域名后缀。
当提取出这些URL访问请求样本中携带的域名信息后,可以从这些域名信息中筛选出已知的URL攻击请求中较为常见的那一部分域名信息,用以构建机器学习模型。即筛选出最能够表征URL攻击请求的特征的域名信息,来参与建模。
例如,在实际应用中,对于部分只在个别的URL访问请求中出现的特殊域名信息,由于这部分域名信息并不能真实反映出URL攻击请求的特征,参与建模会对模型的结果造成干扰,因此对于这部分域名信息可以进行过滤处理。
进一步的,对于筛选出的域名信息,可以从这些域名信息中分别提 取出若干个维度的域名特征,来作为建模特征。
其中,从这些域名信息中提取出的域名特征,在本说明书中不进行特殊限定,在实际应用中,任意形式的能够表征URL攻击请求中携带的域名信息的特征以及规律的域名特征,都可以被选定作为建模特征。
例如,在实际应用中,参与建模的本领域技术人员,可以基于经验从这些域名信息对应的参数取值中提取出若干个维度的域名特征,然后基于这些域名特征进行尝试建模,并对建模结果进行评估,来从中筛选出对模型的贡献度最高的若干个维度的域名特征作为建模特征。
在示出的一种实施方式中,从这些域名信息中提取出的域名特征,可以包括域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数等8个维度。在实际应用中,本领域技术人员可以将以上8个维度进行组合作为建模特征,或者从以上8个维度进一步筛选出多个维度进行组合作为建模特征。
当然,以上示出的8个维度的域名特征仅为示例性的;显然,在实际应用中,本领域技术人员也可以从这些域名信息中提取出以上8个维度以外的其它维度的域名特征作为建模特征,在本说明书中不再进行一一列举。
请继续参见图2,当从筛选出的域名信息中分别提取出若干个维度的域名特征后,由于不同的域名特征的取值范围可能并不统一,因此还可以对这些维度的域名特征进行归一化处理,将不同的域名特征的取值范围归一化到一个统一的数值区间,从而来消除由于域名特征的取值范围不同对建模精度造成的影响。
当对提取出的域名特征归一化处理完成之后,可以基于从各URL访问请求样本携带的域名信息中提取出的域名特征,为各URL访问请求样本分别创建一个对应的特征向量作为训练样本;其中,创建的特征向量 的维度,与提取出的域名特征的维度相同。
当为各URL访问请求样本构建了对应的特征向量后,此时可以基于为各URL访问请求样本构建的特征向量,创建一个目标矩阵;例如,假设共计收集到N条URL访问请求样本,从每一个URL访问请求样本提取出M维的域名特征,那么该目标矩阵具体可以是一个N*M维的目标矩阵。
此时,创建的该目标矩阵,即为最终参与机器学习模型训练的训练样本集。
请继续参见图2,当训练样本集训练完毕,此时可以基于PU-Learning机器学习算法对这些训练样本进行训练,来构建上述URL攻击检测模型。
其中,对于PU-Learning机器学习算法而言,通常可以包含多种机器学习策略;例如,PU-Learning机器学习算法包含的典型的机器学习策略,通常包括两阶段法(two-stage strategy)和代价敏感法(cost-sensitive strategy)两类。
所谓两阶段法,算法首先基于已知的正样本和无标记样本,在无标记样本中挖掘发现潜在的可靠负样本(Reliable Negative),然后基于已知的正样本和挖掘出来的可靠负样本,将问题转化为传统的有监督的机器学习的过程,来训练分类模型。
而对于代价敏感的策略而言,算法假设无标记样本中正样本的比例极低,可以直接将无标记样本作为负样本,为正样本设置一个相对于负样本更高的代价敏感权重;例如,通常会在基于代价敏感的PU-Learning机器学习算法的目标方程中,为与正样本对应的损失函数,设置一个更高的代价敏感权重。
通过给正样本设置更高的代价敏感权重,使得最终训练出的分类模型分错一个正样本的代价远远大于分错一个负样本的代价,如此一来,可以直接通过利用正样本和无标记样本(当作负样本)学习一个代价敏感的分类器,来对未知的样本进行分类。
在示出的一种实施方式中,可以选择代价敏感的PU-Learning机器学习算法作为建模算法,对上述训练样本集进行训练,来构建上述URL攻击检测模型。
以下对基于代价敏感的PU-Learning机器学习算法对上述训练样本集进行训练的过程进行详细。
其中,需要说明的是,在本说明书中以基于代价敏感的PU-Learning机器学习算法对上述训练样本集进行训练仅为示例性的,并不用于限定;显然,本领域技术人员在结合本说明书中披露的技术细节,对本说明书的方案进行实施时,也可以采用PU-Learning机器学习算法中其它的机器学习策略(比如两阶段法),在本说明书中不再一一详述。
在本说明书中,假设上述训练样本集中的正样本被标记为1,表示与该训练样本对应的URL访问请求,为已知的URL攻击请求;无标记样本被作为负样本标记为-1,表示该训练样本对应的URL访问请求为正常的URL访问请求。
基于代价敏感的PU-Learning机器学习算法的目标方程可以用如下公式进行表征:
Figure PCTCN2018110769-appb-000001
其中,在以上公式中:
Figure PCTCN2018110769-appb-000002
即为最终需要训练的机器学习模型的函数表达式,表示学习到的模型对训练样本x i的预测结果(即最终输出的风险评分)。其中,w T和b即为最终需要训练出的模型参数。
Figure PCTCN2018110769-appb-000003
表示与标记为1的正样本对应的损失函数,用于表征模型对正样本的预测结果g(x i)和真实标记y i=1之间的误差损失;
Figure PCTCN2018110769-appb-000004
表示与标记为-1的负样本对应的损失函数,用于表征模型对负样本的预测结果g(x i)和真实标记y i=-1之间的误差损失;通常 来说,g(x i)和y i的差别越大,带来的损失越大。
C +表示为正样本对应的损失函数配置的代价敏感权重。
C -表示为正样本对应的损失函数配置的代价敏感权重。其中,C +的取值大于C -,在实际应用中,可以将C +的取值设置为一个远大于C -的取值,表示最终训练出的URL攻击检测模型分错一个正样本的代价大于分错一个负样本的代价。
其中,上述损失函数l(y i,g(x i))的具体类型,在本说明书中不进行特别限定;例如,在实际应用中,可以选择常见的log-loss函数或者
hinge-loss函数。
其中,当上述损失函数l(y i,g(x i)采用log-loss函数,表达式为:
log-loss{l(y i,g(x i))=log(1+exp(-y ig(x i)))};
当上述损失函数l(y i,g(x i)采用hinge-loss函数,表达式为:
hinge-loss{l(y i,g(x i))=max{0,1-y ig(x i)})。
上述λR(w)表示R(w)为正则化项,用于控制模型的复杂度;例如,上述正则化项的通常可以用L1范数或L2范数来标识;上述λ为正则化超参数,在实际应用中可以灵活设置。
例如,假设选择log-loss函数作为损失函数,选择L2范数作为正则化项,那么基于代价敏感的PU-Learning机器学习算法的目标方程可以表示成如下公式的形式:
Figure PCTCN2018110769-appb-000005
在本说明书中,可以将上述训练样本集中的训练样本,输入以上目标方程中进行迭代运算,计算出预测结果g(x i)和真实标记y i=1以及y i=-1之间的误差损失最小时的模型参数w T和b。其中,具体的迭代运算以及求解过程,本说明书中不再进行详述,本领域技术人员在将本说明书的技术方案付诸实践时,可以参考相关技术中的记载。
当计算出预测结果g(x i)和真实标记y i=1以及y i=-1之间的误差损失最小时的模型参数w T和b后,此时上述目标方程算法收敛,上述URL攻击检测模型训练完毕。
在本说明书中,当上述URL攻击检测模型训练完毕后,可以按照如图2示出的相同的特征提取方式,从需要进行攻击检测的URL访问请求提取域名信息名、从提取到的域名信息中筛选域名信息、从筛选出的域名信息中提取若干个维度的域名特征(与模型训练阶段的域名特征一致),然后基于提取到的域名特征构建预测样本,并将预测样本输入至上述URL攻击检测模型进行预测计算,得到该URL访问请求的风险评分。
当通过上述URL攻击检测模型预测出该URL访问请求的风险评分后,可以进一步基于该URL风险评分,来确定该URL访问请求是否为URL攻击请求;
例如,将预测样本输入至URL攻击检测模型后,模型通常会对应于正样本(即URL攻击请求)和负样本(即正常URL访问请求)分别输出一个风险评分;
在一种实现方式中,可以通过比较这两个风险评分之间的大小,来确定该URL访问请求为URL攻击请求,还是普通的正常URL访问请求;如果与正样本对应的风险评分大于与负样本对应的风险评分,则表明该URL访问请求为URL攻击请求;反之,表明该URL访问请求为正常URL访问请求。
在另一种实现方式中,为了提升判定结果的准确度,除了直接比较这两个风险评分以外,还可以将这两个风险评分与预设的风险阈值进行比较,来确定该URL访问请求的具体类型;如果与正样本对应的风险评分大于与负样本对应的风险评分,并且与正样本对应的风险评分大于预设的风险阈值,则表明该URL访问请求为URL攻击请求;反之,如果与负样本对应的风险评分大于与正样本对应的风险评分,并且与负样本对应的风险评分大于预设的风险阈值,则表明该URL访问请求为正常 URL访问请求。
请继续参见图2,在本说明书中,为了提升训练出的URL攻击检测模型的稳定性和预测能力,可以采用集成学习的方式,训练出多个URL攻击检测模型,并在该多个URL攻击检测模型训练完毕后,通过集成(融合)该多个攻击检测模型输出的风险评分,来确定该URL访问请求的具体类型。
其中,在通过上述集成学习的方式,来训练多个URL攻击检测模型时,所采用的训练样本、训练样本中包含的建模特征、目标方程都可以存在差异,在实际应用中,本领域技术人员可以基于实际的建模需求进行灵活的控制。
例如,在一种实现方式中,可以通过对上述训练样本集中的训练样本所包含的建模特征进行采样,将上述训练样本集按照所包含的建模特征,划分为多个训练样本子集,然后针对每一个训练样本子集分别进行训练,来构建上述URL攻击检测模型。当然,在实际应用中,也可以不对上述训练样本集进行采样,而是需要训练的多个URL攻击检测模型共用一个训练样本集,在本说明书中不进行别限定。
在另一种实现方式中,在通过上述集成学习的方式,来训练多个URL攻击检测模型时,各URL攻击检测模型的目标方程也可以存在差异;例如,仍以以上示出的基于代价敏感的PU-Learning机器学习算法的目标方程为例,在实际应用中,在通过集成学习的方式,训练各个URL攻击检测模型时,与各URL攻击检测模型对应的目标方程中,所采用的损失函数的类型、以及正则化项可以互不相同;比如,以通过集成学习训练两个URL攻击检测模型为例,第一攻击检测模型对应的目标方程中,可以采用log-loss函数;而第二攻击检测模型对应的目标方程中,可以采用hinge-loss函数;或者,第一攻击检测模型的目标方程中的正则化项,可以采用L1范数;而第二攻击检测模型对应的目标方程中的正则化项,可以采用L2范数,等等。
在这种情况下,当需要使用训练完成的URL攻击检测模型,对需要进行攻击检测的URL访问请求进行攻击检测时,可以基于同样的方式,构建多个预测样本,然后分别输入至该多个URL攻击检测模型进行预测计算,得到多个对应于该URL访问请求的风险评分。此时,可以对该多个风险评分进行加权计算,然后将加权计算结果作为该URL访问请求最终的风险评分,来进一步确定该URL访问请求的具体类型。
其中,对上述多个风险评分进行加权计算的具体方式,在本说明书中不进行特别限定;例如,在一种实现方式中,可以采用加权平均的方式,将每一个风险评分的权重值设置为0.5进行求和,利用多个URL攻击建模模型进行预测计算得到的风险评分的平均值,来作为该URL访问请求最终的风险评分。
通过以上实施例可知,在本说明书中,通常将安全系统中沉淀大量的被标记为URL攻击的URL访问请求,和大量无标记的URL访问请求作为正样本和无标记样本,利用PU-Learning机器学习算法训练来构建URL攻击检测模型,并使用该URL攻击检测模型对URL访问请求进行攻击检测,与传统的由安全人员在安全系统中手工配置的检测规则进行URL攻击检测的方式相比,可以提前发现潜在的URL攻击,从而有助于对潜在的异常URL访问及时的进行安全防护。而且,由于上述URL攻击检测模型能够在日常的URL访问请求中,提前发现潜在的威胁,因此可以指导安全人员及时的对安全系统中的检测规则及时的进行完善和补充,能够提升整个安全系统的安全等级。
与上述方法实施例相对应,本说明书还提供了一种URL攻击检测装置的实施例。本说明书的URL攻击检测设备的实施例可以应用在电子设备上。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在电子设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图3所示,为本说明书的URL 攻击检测装置所在电子设备的一种硬件结构图,除了图3所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的电子设备通常根据该电子设备的实际功能,还可以包括其他硬件,对此不再赘述。
图4是本说明书一示例性实施例示出的一种URL攻击检测装置的框图。
请参考图4,所述URL攻击检测装置40可以应用在前述图3所示的电子设备中,包括有:第一提取模块401、预测模块402和确定模块403。
其中,第一提取模块401,从URL访问请求中携带的域名信息中提取若干维度的域名特征;
预测模块402,将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
确定模块403,基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
在本实施例中,所述URL攻击检测模型为基于代价敏感的PU-Learning机器学习算法训练得到的机器学习模型。
在本实施例中,所述装置40还包括:
第二提取模块404(图4中未示出),从所述若干URL访问请求样本中携带的域名信息中提取若干维度的域名特征;其中,所述若干URL访问请求样本包括被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本;所述样本标签表征所述URL访问请求样本为URL攻击请求;
构建模块405(图4中未示出),基于提取到的域名特征构建训练样本;
训练模块406(图4中未示出),基于代价敏感的PU-Learning机器 学习算法对所述若干URL访问请求样本进行训练得到所述URL攻击检测模型。
在本实施例中,与被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本对应的损失函数,分别被配置了代价敏感权重;其中,与被标记了样本标签的URL访问请求样本对应的损失函数的代价敏感权重,大于与未标记样本标签的URL访问请求样本对应的损失函数的代价敏感权重。
在本实施例中,所述URL攻击检测模型包括基于PU-Learning机器学习算法对若干URL访问请求样本进行训练得到的多个机器学习模型;
将预测模块402进一步:
将所述域名特征分别输入所述多个机器学习模型进行预测计算,得到多个风险评分;对所述多个风险评分进行加权计算得到所述URL访问请求的风险评分。
在本实施例中,提取出的所述若干维度的域名特征包括以下域名特征中的多个的组合:
域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数。
上述装置中各个模块的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
与上述方法实施例相对应,本说明书还提供了一种电子设备的实施例。该电子设备包括:处理器以及用于存储机器可执行指令的存储器;其中,处理器和存储器通常通过内部总线相互连接。在其他可能的实现方式中,所述设备还可能包括外部接口,以能够与其他设备或者部件进行通信。
在本实施例中,通过读取并执行所述存储器存储的与URL攻击检测的控制逻辑对应的机器可执行指令,所述处理器被促使:
从URL访问请求中携带的域名信息中提取若干维度的域名特征;
将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
在本实施例中,所述URL攻击检测模型为基于代价敏感的PU-Learning机器学习算法对所述若干URL访问请求样本进行训练得到的机器学习模型。
在本实施例中,通过读取并执行所述存储器存储的URL攻击检测的控制逻辑对应的机器可执行指令,所述处理器还被促使:
从所述若干URL访问请求样本中携带的域名信息中提取若干维度的域名特征;其中,所述若干URL访问请求样本包括被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本;所述样本标签表征所述URL访问请求样本为URL攻击请求;
基于提取到的域名特征构建训练样本;
基于代价敏感的PU-Learning机器学习算法对所述若干URL访问请求样本进行训练得到所述URL攻击检测模型。
在本实施例中,与被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本对应的损失函数,分别被配置了代价敏感权重;其中,与被标记了样本标签的URL访问请求样本对应的损失函数的代价敏感权重,大于与未标记样本标签的URL访问请求样本对应的损失函数的代价敏感权重。
在本实施例中,所述URL攻击检测模型包括基于PU-Learning机器学习算法对若干URL访问请求样本进行训练得到的多个机器学习模型;
在本实施例中,通过读取并执行所述存储器存储的URL攻击检测的控制逻辑对应的机器可执行指令,所述处理器还被促使:
将所述域名特征分别输入所述多个机器学习模型进行预测计算,得到多个风险评分;对所述多个风险评分进行加权计算得到所述URL访问请求的风险评分。
在本实施例中,提取出的所述若干维度的域名特征包括以下域名特征中的多个的组合:
域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本说明书的其它实施方案。本说明书旨在涵盖本说明书的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本说明书的一般性原理并包括本说明书未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本说明书的真正范围和精神由下面的权利要求指出。
应当理解的是,本说明书并不局限于上面已经描述并在附图中示出 的精确结构,并且可以在不脱离其范围进行各种修改和改变。本说明书的范围仅由所附的权利要求来限制。
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。

Claims (13)

  1. 一种URL攻击检测方法,所述方法包括:
    从URL访问请求中携带的域名信息中提取若干维度的域名特征;
    将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
    基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
  2. 根据权利要求1所述的方法,所述URL攻击检测模型为基于代价敏感的PU-Learning机器学习算法训练得到的机器学习模型。
  3. 根据权利要求1所述的方法,所述方法还包括:
    从若干URL访问请求样本中携带的域名信息中提取若干维度的域名特征;其中,所述若干URL访问请求样本包括被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本;所述样本标签表征所述URL访问请求样本为URL攻击请求;
    基于提取到的域名特征构建训练样本;
    基于代价敏感的PU-Learning机器学习算法对所述若干URL访问请求样本进行训练得到所述URL攻击检测模型。
  4. 根据权利要求3所述的方法,与被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本对应的损失函数,分别被配置了代价敏感权重;其中,与被标记了样本标签的URL访问请求样本对应的损失函数的代价敏感权重,大于与未标记样本标签的URL访问请求样本对应的损失函数的代价敏感权重。
  5. 根据权利要求1所述的方法,所述URL攻击检测模型包括基于PU-Learning机器学习算法训练得到的多个机器学习模型;
    将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分,包括:
    将所述域名特征分别输入所述多个机器学习模型进行预测计算,得 到多个风险评分;对所述多个风险评分进行加权计算得到所述URL访问请求的风险评分。
  6. 根据权利要求1或者3所述的方法,提取出的所述若干维度的域名特征包括以下域名特征中的多个的组合:
    域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数。
  7. 一种URL攻击检测装置,所述装置包括:
    第一提取模块,从URL访问请求中携带的域名信息中提取若干维度的域名特征;
    预测模块,将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法训练得到的机器学习模型;
    确定模块,基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
  8. 根据权利要求7所述的装置,所述URL攻击检测模型为基于代价敏感的PU-Learning机器学习算法训练得到的机器学习模型。
  9. 根据权利要求7所述的装置,所述装置还包括:
    第二提取模块,从若干URL访问请求样本中携带的域名信息中分别提取若干维度的域名特征;其中,所述若干URL访问请求样本包括被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本;所述样本标签表征所述URL访问请求样本为URL攻击请求;
    构建模块,基于提取到的域名特征构建训练样本;
    训练模块,基于代价敏感的PU-Learning机器学习算法对所述若干URL访问请求样本进行训练得到所述URL攻击检测模型。
  10. 根据权利要求9所述的装置,与被标记了样本标签的URL访问请求样本和未标记样本标签的URL访问请求样本对应的损失函数,分别 被配置了代价敏感权重;其中,与被标记了样本标签的URL访问请求样本对应的损失函数的代价敏感权重,大于与未标记样本标签的URL访问请求样本对应的损失函数的代价敏感权重。
  11. 根据权利要求7所述的装置,所述URL攻击检测模型包括基于PU-Learning机器学习算法训练得到的多个机器学习模型;
    将预测模块进一步:
    将所述域名特征分别输入所述多个机器学习模型进行预测计算,得到多个风险评分;对所述多个风险评分进行加权计算得到所述URL访问请求的风险评分。
  12. 根据权利要求7或者9所述的装置,提取出的所述若干维度的域名特征包括以下域名特征中的多个的组合:
    域名信息的字符总数、域名信息的字母总数、域名信息的数字总数、域名信息的符号总数、域名信息的不同字符数、域名信息的不同字母数、域名信息的不同数字数、域名信息的不同符号数。
  13. 一种电子设备,包括:
    处理器;
    用于存储机器可执行指令的存储器;
    其中,通过读取并执行所述存储器存储的与URL攻击检测的控制逻辑对应的机器可执行指令,所述处理器被促使:
    从URL访问请求中携带的域名信息中提取若干维度的域名特征;
    将所述域名特征输入预设的URL攻击检测模型进行预测计算,得到所述URL访问请求的风险评分;其中,所述URL攻击检测模型为基于PU-Learning机器学习算法对若干URL访问请求样本进行训练得到的机器学习模型;
    基于所述风险评分确定所述URL访问请求是否为URL攻击请求。
PCT/CN2018/110769 2017-12-07 2018-10-18 Url攻击检测方法、装置以及电子设备 WO2019109743A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711283758.0A CN108111489B (zh) 2017-12-07 2017-12-07 Url攻击检测方法、装置以及电子设备
CN201711283758.0 2017-12-07

Publications (1)

Publication Number Publication Date
WO2019109743A1 true WO2019109743A1 (zh) 2019-06-13

Family

ID=62209372

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/110769 WO2019109743A1 (zh) 2017-12-07 2018-10-18 Url攻击检测方法、装置以及电子设备

Country Status (3)

Country Link
CN (1) CN108111489B (zh)
TW (1) TWI673625B (zh)
WO (1) WO2019109743A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110912917A (zh) * 2019-11-29 2020-03-24 深圳市任子行科技开发有限公司 恶意url检测方法及系统
CN113949526A (zh) * 2021-09-07 2022-01-18 中云网安科技有限公司 一种访问控制方法、装置、存储介质及电子设备
CN114070819A (zh) * 2021-10-09 2022-02-18 北京邮电大学 恶意域名检测方法、设备、电子设备及存储介质
CN114363025A (zh) * 2021-12-27 2022-04-15 中国电信股份有限公司 域名检测方法、装置、设备及存储介质
CN114866344A (zh) * 2022-07-05 2022-08-05 佛山市承林科技有限公司 信息系统数据安全防护方法、系统及云平台
CN116455640A (zh) * 2023-04-20 2023-07-18 云盾智慧安全科技有限公司 一种网站安全防护方法及装置
CN118296390A (zh) * 2024-06-06 2024-07-05 齐鲁工业大学(山东省科学院) 可穿戴行为识别模型的训练方法、行为识别方法及系统

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108111489B (zh) * 2017-12-07 2020-06-30 阿里巴巴集团控股有限公司 Url攻击检测方法、装置以及电子设备
CN108229156A (zh) 2017-12-28 2018-06-29 阿里巴巴集团控股有限公司 Url攻击检测方法、装置以及电子设备
CN108366071B (zh) 2018-03-06 2020-06-23 阿里巴巴集团控股有限公司 Url异常定位方法、装置、服务器及存储介质
CN109145030B (zh) * 2018-06-26 2022-07-22 创新先进技术有限公司 一种异常数据访问的检测方法和装置
CN109688110A (zh) * 2018-11-22 2019-04-26 顺丰科技有限公司 Dga域名检测模型构建方法、装置、服务器及存储介质
CN111600919B (zh) * 2019-02-21 2023-04-07 北京金睛云华科技有限公司 智能网络应用防护系统模型的构建方法和装置
US11368486B2 (en) * 2019-03-12 2022-06-21 Fortinet, Inc. Determining a risk probability of a URL using machine learning of URL segments
CN109936582B (zh) * 2019-04-24 2020-04-28 第四范式(北京)技术有限公司 构建基于pu学习的恶意流量检测模型的方法及装置
CN111181756B (zh) * 2019-07-11 2021-12-14 腾讯科技(深圳)有限公司 一种域名安全性判定方法、装置、设备及介质
CN110933105B (zh) * 2019-12-13 2021-10-22 中国电子科技网络信息安全有限公司 一种Web攻击检测方法、系统、介质和设备
CN113158182A (zh) * 2020-01-07 2021-07-23 深信服科技股份有限公司 一种web攻击检测方法、装置及电子设备和存储介质
CN111314291A (zh) * 2020-01-15 2020-06-19 北京小米移动软件有限公司 网址安全性检测方法及装置、存储介质
CN113395237A (zh) * 2020-03-12 2021-09-14 中国电信股份有限公司 攻击检测方法及装置、计算机可存储介质
CN113537262B (zh) * 2020-04-20 2024-05-28 深信服科技股份有限公司 数据分析方法、装置、设备和可读存储介质
CN114553496B (zh) * 2022-01-28 2022-11-15 中国科学院信息工程研究所 基于半监督学习的恶意域名检测方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176698A (zh) * 2010-12-20 2011-09-07 北京邮电大学 一种基于迁移学习的用户异常行为检测方法
CN105677900A (zh) * 2016-02-04 2016-06-15 南京理工大学 恶意用户检测方法及装置
CN107426199A (zh) * 2017-07-05 2017-12-01 浙江鹏信信息科技股份有限公司 一种网络异常行为检测与分析的方法及系统
CN107577945A (zh) * 2017-09-28 2018-01-12 阿里巴巴集团控股有限公司 Url攻击检测方法、装置以及电子设备
CN108111489A (zh) * 2017-12-07 2018-06-01 阿里巴巴集团控股有限公司 Url攻击检测方法、装置以及电子设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI310919B (en) * 2002-01-11 2009-06-11 Sap Ag Context-aware and real-time item tracking system architecture and scenariors
TW200919210A (en) * 2007-07-18 2009-05-01 Steven Kays Adaptive electronic design
TW200926033A (en) * 2007-07-18 2009-06-16 Steven Kays Adaptive electronic design
US8869271B2 (en) * 2010-02-02 2014-10-21 Mcafee, Inc. System and method for risk rating and detecting redirection activities
CN102790762A (zh) * 2012-06-18 2012-11-21 东南大学 基于url分类的钓鱼网站检测方法
CN105357221A (zh) * 2015-12-04 2016-02-24 北京奇虎科技有限公司 识别钓鱼网站的方法及装置
CN106131016B (zh) * 2016-07-13 2019-05-03 北京知道创宇信息技术有限公司 恶意url检测干预方法、系统及装置
CN106789888B (zh) * 2016-11-18 2020-08-04 重庆邮电大学 一种多特征融合的钓鱼网页检测方法
CN106713303A (zh) * 2016-12-19 2017-05-24 北京启明星辰信息安全技术有限公司 一种恶意域名检测方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176698A (zh) * 2010-12-20 2011-09-07 北京邮电大学 一种基于迁移学习的用户异常行为检测方法
CN105677900A (zh) * 2016-02-04 2016-06-15 南京理工大学 恶意用户检测方法及装置
CN107426199A (zh) * 2017-07-05 2017-12-01 浙江鹏信信息科技股份有限公司 一种网络异常行为检测与分析的方法及系统
CN107577945A (zh) * 2017-09-28 2018-01-12 阿里巴巴集团控股有限公司 Url攻击检测方法、装置以及电子设备
CN108111489A (zh) * 2017-12-07 2018-06-01 阿里巴巴集团控股有限公司 Url攻击检测方法、装置以及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YA-LIN ZHANG: "POSTER: A PU Learning based System for Potential Malicious URL Detection", CCS ' 17 PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMU- NICATIONS SECURITY, 30 October 2017 (2017-10-30) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110912917A (zh) * 2019-11-29 2020-03-24 深圳市任子行科技开发有限公司 恶意url检测方法及系统
CN113949526A (zh) * 2021-09-07 2022-01-18 中云网安科技有限公司 一种访问控制方法、装置、存储介质及电子设备
CN114070819A (zh) * 2021-10-09 2022-02-18 北京邮电大学 恶意域名检测方法、设备、电子设备及存储介质
CN114363025A (zh) * 2021-12-27 2022-04-15 中国电信股份有限公司 域名检测方法、装置、设备及存储介质
CN114866344A (zh) * 2022-07-05 2022-08-05 佛山市承林科技有限公司 信息系统数据安全防护方法、系统及云平台
CN114866344B (zh) * 2022-07-05 2022-09-27 佛山市承林科技有限公司 信息系统数据安全防护方法、系统及云平台
CN116455640A (zh) * 2023-04-20 2023-07-18 云盾智慧安全科技有限公司 一种网站安全防护方法及装置
CN118296390A (zh) * 2024-06-06 2024-07-05 齐鲁工业大学(山东省科学院) 可穿戴行为识别模型的训练方法、行为识别方法及系统

Also Published As

Publication number Publication date
CN108111489B (zh) 2020-06-30
TWI673625B (zh) 2019-10-01
CN108111489A (zh) 2018-06-01
TW201926106A (zh) 2019-07-01

Similar Documents

Publication Publication Date Title
WO2019109743A1 (zh) Url攻击检测方法、装置以及电子设备
WO2019128529A1 (zh) Url攻击检测方法、装置以及电子设备
CN107577945B (zh) Url攻击检测方法、装置以及电子设备
WO2021026805A1 (zh) 对抗样本检测方法、装置、计算设备及计算机存储介质
US20230022943A1 (en) Method and system for defending against adversarial sample in image classification, and data processing terminal
US11609990B2 (en) Post-training detection and identification of human-imperceptible backdoor-poisoning attacks
CN111027069B (zh) 恶意软件家族检测方法、存储介质和计算设备
CN106415594B (zh) 用于面部验证的方法和系统
Thieltges et al. The devil’s triangle: Ethical considerations on developing bot detection methods
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN112948578B (zh) 一种dga域名开集分类方法、装置、电子设备及介质
Barros et al. Malware‐SMELL: A zero‐shot learning strategy for detecting zero‐day vulnerabilities
Agrawal et al. Robust neural malware detection models for emulation sequence learning
CN116015703A (zh) 模型训练方法、攻击检测方法及相关装置
Zhu et al. Effective phishing website detection based on improved BP neural network and dual feature evaluation
CN111783088B (zh) 一种恶意代码家族聚类方法、装置和计算机设备
Yang et al. Efficient and persistent backdoor attack by boundary trigger set constructing against federated learning
US20230306106A1 (en) Computer Security Systems and Methods Using Self-Supervised Consensus-Building Machine Learning
WO2023011606A1 (en) Training method of live body detection network, method and apparatus of live body detectoin
CN113259369B (zh) 一种基于机器学习成员推断攻击的数据集认证方法及系统
CN114925765A (zh) 对抗性集成分类模型的构建方法、装置、设备及存储介质
CN115001763A (zh) 钓鱼网站攻击检测方法、装置、电子设备及存储介质
Liu et al. Enhanced attacks on defensively distilled deep neural networks
CN111563276A (zh) 一种网页篡改检测方法、检测系统及相关设备
JP2020140488A (ja) 情報処理装置、情報処理方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18885795

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18885795

Country of ref document: EP

Kind code of ref document: A1