CN112182578A - Model training method, URL detection method and device - Google Patents

Model training method, URL detection method and device Download PDF

Info

Publication number
CN112182578A
CN112182578A CN202011120753.8A CN202011120753A CN112182578A CN 112182578 A CN112182578 A CN 112182578A CN 202011120753 A CN202011120753 A CN 202011120753A CN 112182578 A CN112182578 A CN 112182578A
Authority
CN
China
Prior art keywords
parameter
url
parameters
determining
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011120753.8A
Other languages
Chinese (zh)
Inventor
张雅淋
李龙飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Advanced New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN202011120753.8A priority Critical patent/CN112182578A/en
Publication of CN112182578A publication Critical patent/CN112182578A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/51Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems at application loading time, e.g. accepting, rejecting, starting or inhibiting executable software based on integrity or source reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Abstract

The embodiment of the specification discloses a model training method, a URL detection method and a URL detection device. In the embodiment of the description, a plurality of URLs are obtained, parameters in each URL are determined, a feature vector corresponding to each parameter is obtained, and then an isolated forest model is constructed according to the feature vectors corresponding to the parameters respectively.

Description

Model training method, URL detection method and device
Technical Field
The present disclosure relates to the field of information technologies, and in particular, to a model training method, a method for detecting a URL, and an apparatus for detecting a URL.
Background
In the internet era, network security is particularly important. Hackers often exploit network security vulnerabilities to hack into servers through Uniform Resource Locators (URLs), performing illegal operations such as Structured Query Language (SQL) injection attacks, cross-site scripting attacks, and the like. Taking SQL injection attack as an example, a hacker may add an illegal field in a parameter of a URL, so that when the server parses a received URL, the illegal field is mistaken for an executable code to be executed, thereby threatening data security on the server.
In practical applications, a person responsible for network security usually sets some security rules based on business experience (for example, a URL containing the XX field cannot pass the detection), so that the server detects whether the received URL conforms to the security rules, and only parses URLs conforming to the security rules, thereby avoiding being attacked.
Based on the prior art, a more secure and reliable method for detecting a URL is needed.
Disclosure of Invention
The embodiment of the specification provides a model training method, a URL detection method and a URL detection device, and aims to solve the problem that an existing URL detection method is not high in safety.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
the model training method provided by the embodiment of the specification comprises the following steps:
acquiring a plurality of Uniform Resource Locators (URLs);
for each URL, extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
and constructing an Isolation Forest model according to the characteristic vectors corresponding to the parameters respectively, wherein the Isolation Forest model is used for detecting whether the URL is abnormal or not.
The method for detecting the URL provided by the embodiment of the specification comprises the following steps:
acquiring a URL;
extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
The model training device provided by the embodiment of the specification comprises:
the acquisition module acquires a plurality of Uniform Resource Locators (URLs);
the extraction module is used for extracting parameters in the URL aiming at each URL;
the determining module is used for determining a feature vector corresponding to each extracted parameter;
and the processing module is used for constructing an isolation forest model according to the characteristic vectors corresponding to the parameters respectively, and the isolation forest model is used for detecting whether the URL is abnormal or not.
An apparatus for detecting a URL provided in an embodiment of the present specification includes:
the acquisition module acquires the URL;
the extraction module extracts the parameters in the URL;
the determining module is used for determining a feature vector corresponding to each extracted parameter;
the anomaly detection module is used for inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated Forest Isolation Forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
A model training apparatus provided in an embodiment of the present specification includes one or more processors and a memory, where the memory stores a program and is configured to be executed by the one or more processors to perform the following steps:
acquiring a plurality of Uniform Resource Locators (URLs);
for each URL, extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
and constructing an isolation forest model according to the characteristic vectors corresponding to the parameters respectively, wherein the isolation forest model is used for detecting whether the URL is abnormal or not.
An apparatus for detecting a URL provided in an embodiment of the present specification includes one or more processors and a memory, where the memory stores a program and is configured to be executed by the one or more processors to perform the following steps:
acquiring a URL;
extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
As can be seen from the technical solutions provided in the embodiments of the present specification, a plurality of URLs are obtained, parameters in each URL are determined, a feature vector corresponding to each parameter is obtained, and then an isolated forest model is constructed according to the feature vectors corresponding to each parameter. The isolated forest model may be used to detect whether a URL is abnormal. In general, the abnormal URL is often the URL sent by the hacker, and the server may refuse to parse the abnormal URL, so as to avoid being attacked by the hacker.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a flow chart of a model training method provided by an embodiment of the present disclosure;
FIGS. 2 a-c are schematic diagrams of normal point and abnormal point distributions provided in the embodiments of the present disclosure;
FIG. 3 is a flowchart of a method for detecting a URL according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a model training apparatus provided in an embodiment of the present disclosure;
FIG. 5 is a diagram illustrating an apparatus for detecting a URL according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a model training apparatus provided in an embodiment of the present disclosure;
fig. 7 is a schematic diagram of an apparatus for detecting a URL according to an embodiment of the present disclosure.
Detailed Description
The existing method for detecting the URL is to detect the URL by the server according to a safety rule set manually. On one hand, however, means for hackers to attack the network by using the URL vary widely, and manually established security rules are difficult to cover all the attacking means; on the other hand, manually established security rules often lag behind emerging means of attack.
Therefore, in one or more embodiments of the present specification, a plurality of URLs are obtained, parameters in the URLs are extracted, a feature vector corresponding to each parameter is determined, and an Isolation Forest model is constructed according to the feature vectors corresponding to the parameters. As is well known to those skilled in the art, the isolated forest model is an anomaly detection model, and can be used to detect whether a URL is anomalous, where the anomalous URL is often a URL sent by a hacker, and the server can refuse to resolve the anomalous URL, thereby avoiding the hacker from attacking the URL.
It should be noted that, an isolated forest model can be constructed according to the feature vectors corresponding to the parameters in the URLs, because in practice, the main means for hackers to attack the server by using the URLs is to add illegal fields in the parameters of the URLs. That is, there is a significant difference between the feature vector of the parameter in the normal URL and the feature vector of the parameter in the abnormal URL. The characteristics of parameters in an abnormal URL are often rare and clearly distinguished from the characteristics of parameters in a normal URL.
Based on this, the core idea of the technical scheme described in this specification is to use feature vectors of parameters in a plurality of known URLs as data samples to construct an isolated forest model. The isolated forest model can judge whether the URL is abnormal or not according to the characteristic vector of the parameter in the URL to be detected.
In order to make the technical solutions in the present specification better understood, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive step through the embodiments of the present description shall fall within the scope of protection of the present description.
The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a model training method provided in an embodiment of the present specification, including the following steps:
s100: several URLs are obtained.
In the embodiments of the present specification, the execution subject may be a server or other device having data processing capability, and the execution subject is a server as an example hereinafter.
It is well known that for a URL, the parameters in the URL may contain some information that the user (possibly a hacker) enters.
For example, "http:// server/path/documentname1 value1 and name2 value 2" is a typical structure of a URL, "? The "latter data is the parameter. More than one parameter may be contained in a URL, with different parameters typically separated by "&", each parameter having a parameter name and parameter values. The parameter values are typically entered by a user. In this example, the URL contains two parameters, "name 1 value 1" indicates that the parameter value of the parameter with the parameter name1 is value 1; "name 2 value 2" indicates that the parameter value of the parameter with the parameter name of 2 is value 2.
Hackers sometimes add an abnormal illegal field in the parameters of the URL to attack the server. For example, if a good user logs in to the server, the normal URL sent is as follows:
the method comprises the steps of providing http:// server/path/documentname1 ═ user1 ═ name2 ═ password1 ═ password, wherein the parameter value of the first parameter is user name "user 1", the parameter value of the second parameter is password "password 1", the server analyzes the URL, and after the user name and the password are verified to pass, the user logs in the server.
When the hacker wants to impersonate the user "user 1" to log in the server, the hacker can use the means of SQL injection attack to send the following abnormal URLs to the server:
the method comprises the steps of providing a http:// server/path/documentname1 ═ user1& name2 ═ or 1 ═ in which a parameter value of a first parameter is a user name "user 1", a parameter value of a second parameter is not a password corresponding to the user name but an illegal field "or 1 ═ 1", and due to the inherent characteristic of SQL syntax, when a server cannot verify the password of a user according to the illegal field, the illegal field can be analyzed into an executable code by the server and executed by the server, so that a hacker can log in an account of the user "user 1" without the password and operate data of the user.
In step S200, the URLs obtained by the server generally include partially normal URLs and partially abnormal URLs. And the abnormal URL is rare, so that the abnormal URL accounts for a lower proportion of the URLs.
S102: for each URL, parameters in the URL are extracted.
In this embodiment, the parameter in the server extraction URL may be a parameter name and a parameter value included in the extraction URL, or may be a parameter value for extracting only the parameter in the URL.
In addition, the server may extract all parameters in each URL, or may extract some parameters in the URL.
In practical application, the probability of occurrence of some parameter names is low, and hackers rarely add illegal fields to the parameter values corresponding to the parameter names with low probability of occurrence, so that the server does not extract the parameter values corresponding to the parameter names with low probability of occurrence.
Specifically, the server may determine, for each URL, a parameter whose parameter name satisfies a specified condition, among parameters contained in the URL; for each determined parameter, a parameter value for that parameter is extracted. Wherein the specified condition may be that the occurrence probability of the parameter name is greater than a specified probability value. Therefore, the parameters with lower occurrence probability can be filtered, and the burden of the server on processing data in the subsequent steps is reduced.
S104: and determining a feature vector corresponding to each extracted parameter.
In this embodiment of the present specification, for each extracted parameter, an N-dimensional feature vector corresponding to the parameter may be determined according to a parameter value of the parameter; n is a natural number greater than 0.
The dimension of the parameter corresponding to the feature vector may include at least one of a total number of characters, a total number of letters, a total number of numbers, a total number of specific symbols, a number of different characters, a number of different letters, a number of different numbers, and a number of different specific symbols included in a parameter value of the parameter.
Taking URL "http:// server/path/documentname1 ═ user1 ═ name2 ═ password1 as an example, the parameter value of the parameter name1 in the URL is user1, and the parameter value contains the total number of characters 5, the total number of letters 4, the total number of numbers 1, the total number of specific symbols 0, the number of different characters 5, the number of different letters 4, the number of different numbers 1, and the number of different specific symbols 0. Then, the feature vector corresponding to the parameter name1 may be (5, 4, 1, 0, 5, 4, 1, 0).
Further, the values of each dimension of the feature vector may be normalized. Here again, following the above example, the equation can be based
Figure BDA0002731946700000071
And normalizing the 8 feature vector values corresponding to the parameter name 1. Wherein x represents a characteristic vector value, z represents the total number of characters contained in the parameter name1, and y represents a numerical value constructed after x is subjected to normalization processing. Then, the parameter name1 contains a feature vector of (5/5, 4/5, 1/5, 0/5, 5/5, 4/5, 1/5, 0/5), i.e., (1, 0.8, 0.2, 0, 1, 0.8, 0.8, 0).
S106: and constructing an isolated forest model according to the characteristic vectors corresponding to the parameters respectively.
In the embodiment of the description, an isolation forest algorithm is adopted, and an isolation forest model is constructed according to the feature vectors corresponding to the parameters respectively, and is used for detecting whether the URL is abnormal or not. And normal or abnormal marking on the characteristic vectors corresponding to the parameters is not needed.
The idea of the isolated forest algorithm is briefly introduced here. Referring to fig. 2a, the 10 dots shown in fig. 2a include hollow dots and solid dots, the number of the hollow dots is large (8 dots) and the distribution is concentrated, and the number of the solid dots is small (2 dots) and the distribution is dispersed. The hollow dots may be regarded as normal dots, and the solid dots as abnormal dots. That is, outliers are just a few and outliers. Then the following operations are carried out:
division 1: a line appears randomly, dividing the points in fig. 2a into parts a and B, resulting in fig. 2B.
Division 2: aiming at the part A, a line continues to appear randomly, and points in the part A are divided into a part C and a part D; also, for part B, a line appears randomly, dividing the point in part B into part E and part F, as in fig. 2 c.
The division continues until the plane shown in fig. 2a is divided into 10 sections, each section containing only 1 point, i.e. each point is divided into a dedicated section (if only one point is included in a section, this section is the dedicated section of this point). Obviously, the solid dots are easier and faster to be scribed into the dedicated section, as shown in fig. 2b, the solid dots in the upper right corner have already been scribed into the dedicated section (section F). That is, the more easily a certain point is scratched into the exclusive portion, the more abnormal this point is.
Based on the above idea, in the isolated forest algorithm, there are S classification trees (which may be binary trees), and for each binary tree, the points shown in fig. 2a are placed in the root node, and from the root node, the condition of each bifurcation is random (i.e., each time the point is divided by a randomly occurring line), and in the binary tree, the earlier the point that falls into the leaf node is, the higher the probability of the abnormality is.
Taking the isolated forest algorithm as an example, the isolated forest model is constructed according to the feature vectors corresponding to the parameters in step S106 for brief description.
The isolated forest includes S binary trees (itrees), and for each iTree, the process of training the iTree can be described as follows:
step one, randomly selecting M feature vectors in each feature vector, and putting the M feature vectors into a root node of the iTree;
secondly, randomly assigning a dimension (designated dimension) in N dimensions of the feature vector, and randomly assigning a value in the value of the designated dimension as a cutting value; the cutting value is between the maximum value and the minimum value in the values of the appointed dimension of the M characteristic vectors;
thirdly, dividing the M eigenvectors into two parts according to the cutting value, wherein one part is the eigenvector with the value of the designated dimension not less than the cutting value, and the other part is the eigenvector with the value of the designated dimension less than the cutting value;
and fourthly, recursively executing the second step and the third step until the iTree reaches a specified height or only one feature vector is put on leaf nodes of the iTree. Wherein the specified height can be set as desired, typically log 2M.
Through the four steps, an iTree can be trained.
When training the next iTree, M feature vectors may be randomly selected from all feature vectors or M feature vectors may be randomly selected from unselected feature vectors in the first step.
And repeatedly executing the four steps to obtain S trained iTrees to form an isolation forest model.
Fig. 3 is a flowchart of a method for detecting a URL according to an embodiment of the present disclosure, including the following steps:
s300: and acquiring the URL.
S302: and extracting the parameters in the URL.
S304: and determining a feature vector corresponding to each extracted parameter.
S306: and inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to perform anomaly detection on the URL.
The URL in fig. 3 is a URL to be detected. For the description of steps S300 to S304, refer to steps S100 to S104, and are not described again.
In step S306, the feature vectors corresponding to the parameters may be input into the isolated forest model to obtain model output results corresponding to the parameters, and whether there is an abnormal parameter in the parameters is determined according to the model output results corresponding to the parameters.
Further, for each parameter, inputting the feature vector corresponding to the parameter into an isolated forest model, classifying the feature vector corresponding to the parameter through each classification tree in the isolated forest model, and determining the average height of leaf nodes where the feature vector corresponding to the parameter falls in each classification tree as a model output result corresponding to the parameter; then, for each parameter, if the model output result corresponding to the parameter is smaller than a specified threshold, determining that the parameter is abnormal, and if the model output result corresponding to the parameter is not smaller than the specified threshold, determining that the parameter is normal; when any parameter is determined to be abnormal, determining abnormal parameters in the parameters; and when all the parameters are determined to be normal, determining that abnormal parameters do not exist in all the parameters.
Through the method shown in fig. 1 and fig. 3, an isolation forest model is constructed according to the feature vectors of the parameters in the URL, so that the server can detect the received URL through the isolation forest model, and if the received URL is determined to be abnormal, the server can refuse to analyze the URL, thereby avoiding hacking and improving network security.
In addition, through the embodiment of the specification, potential network attack means can be found. Specifically, whether a certain URL is abnormal or not can be determined through an isolation forest model, if the URL is abnormal, the fact that the parameter value of the parameter is abnormal is meant, the abnormal parameter value can prompt a worker to analyze an attack means adopted by a hacker, and the worker can conveniently perfect the safety rule.
Based on the model training method shown in fig. 1, an embodiment of the present specification further provides a model training apparatus, as shown in fig. 4, including:
an obtaining module 401, which obtains a plurality of Uniform Resource Locators (URLs);
an extracting module 402, for each URL, extracting parameters in the URL;
a determining module 403, configured to determine, for each extracted parameter, a feature vector corresponding to the parameter;
and the processing module 404 is configured to construct an isolation forest model according to the feature vectors corresponding to the parameters, wherein the isolation forest model is used for detecting whether the URL is abnormal.
The extraction module is used for determining parameters of which parameter names meet specified conditions in the parameters contained in each URL; for each determined parameter, a parameter value for that parameter is extracted.
The determining module is used for determining an N-dimensional feature vector corresponding to each extracted parameter according to the parameter value of the parameter; n is a natural number greater than 0.
The dimensions of the N-dimensional feature vector specifically include: the parameter value of the parameter contains at least one of a total number of characters, a total number of letters, a total number of numerals, a total number of symbols, a number of different characters, a number of different letters, a number of different numerals, and a number of different symbols.
Based on the method for detecting a URL shown in fig. 3, an embodiment of the present specification further provides an apparatus for detecting a URL, as shown in fig. 5, including:
the obtaining module 501 obtains a URL;
an extracting module 502, which extracts the parameters in the URL;
a determining module 503, configured to determine, for each extracted parameter, a feature vector corresponding to the parameter;
the anomaly detection module is used for inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated Forest Isolation Forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
The anomaly detection module inputs the characteristic vectors corresponding to the parameters into a pre-constructed isolated Forest Isolation Forest model, and constructs model output results corresponding to the parameters; judging whether abnormal parameters exist in the parameters according to the model output results corresponding to the parameters respectively; if yes, determining that the URL is abnormal; otherwise, determining that the URL is normal.
The anomaly detection module inputs the feature vector corresponding to each parameter into a pre-constructed isolated forest model, classifies the feature vector corresponding to the parameter through each classification tree in the isolated forest model, and determines the average height of leaf nodes, into which the feature vector corresponding to the parameter falls, in each classification tree as a model output result corresponding to the parameter; and for each parameter, if the model output result corresponding to the parameter is smaller than a specified threshold value, determining that the parameter is abnormal, and if the model output result corresponding to the parameter is not smaller than the specified threshold value, determining that the parameter is normal.
Based on the model training method shown in fig. 2, the present specification further provides a model training apparatus, as shown in fig. 6, including one or more processors and a memory, where the memory stores a program and is configured to be executed by the one or more processors to perform the following steps:
acquiring a plurality of Uniform Resource Locators (URLs);
for each URL, extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
and constructing an Isolation Forest model according to the characteristic vectors corresponding to the parameters respectively, wherein the Isolation Forest model is used for detecting whether the URL is abnormal or not.
Based on the method for detecting a URL shown in fig. 3, the present specification embodiment further provides an apparatus for detecting a URL, as shown in fig. 7, including one or more processors and a memory, where the memory stores a program and is configured to be executed by the one or more processors to perform the following steps:
acquiring a URL;
extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
inputting the characteristic vectors corresponding to the parameters into a pre-constructed Isolation Forest model to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatuses shown in fig. 6 and 7, since they are substantially similar to the method embodiments, the description is simple, and in relation to the description, reference may be made to part of the description of the method embodiments.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean expression Language), ahdl (alternate Language Description Language), traffic, pl (kernel universal programming Language), HDCal, JHDL (alternate Description Language), Lava, Lola, HDL, pamm, hard Language (Hardware Description Language), and vhhardware Description Language (vhhardware Description Language). It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, AtmelAT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims (14)

1. A model training method, comprising:
acquiring a plurality of Uniform Resource Locators (URLs);
for each URL, extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter; the dimensionality of the feature vector corresponding to the parameter comprises at least one of the total number of characters, the total number of letters, the total number of numbers, the total number of specific symbols, the number of different characters, the number of different letters, the number of different numbers and the number of different specific symbols contained in the parameter value of the parameter;
and constructing an Isolation Forest model according to the characteristic vectors corresponding to the parameters respectively, wherein the Isolation Forest model is used for detecting whether the URL is abnormal or not.
2. The method according to claim 1, wherein for each URL, extracting parameters from the URL includes:
for each URL, determining parameters of which parameter names meet specified conditions in the parameters contained in the URL;
for each determined parameter, a parameter value for that parameter is extracted.
3. The method according to claim 2, wherein for each extracted parameter, determining a feature vector corresponding to the parameter specifically includes:
aiming at each extracted parameter, determining an N-dimensional feature vector corresponding to the parameter according to the parameter value of the parameter; n is a natural number greater than 0.
4. A method of detecting a URL, comprising:
acquiring a URL;
extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the method of any one of claims 1-3.
5. The method according to claim 4, wherein the feature vectors corresponding to the parameters are input into a pre-constructed isolated forest model to detect the abnormality of the URL, and specifically comprises:
inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model to obtain model output results corresponding to the parameters;
judging whether abnormal parameters exist in the parameters according to the model output results corresponding to the parameters respectively;
if yes, determining that the URL is abnormal;
otherwise, determining that the URL is normal.
6. The method according to claim 5, wherein the feature vectors corresponding to the parameters are input into a pre-constructed isolated forest model to obtain model output results corresponding to the parameters, and the method specifically comprises the following steps:
inputting the feature vector corresponding to each parameter into a pre-constructed isolated forest model, classifying the feature vector corresponding to the parameter through each classification tree in the isolated forest model, determining the average height of leaf nodes of the feature vector corresponding to the parameter falling into each classification tree, and taking the average height as a model output result corresponding to the parameter;
judging whether abnormal parameters exist in the parameters according to the model output results corresponding to the parameters respectively, and specifically comprising the following steps:
and for each parameter, if the model output result corresponding to the parameter is smaller than a specified threshold value, determining that the parameter is abnormal, and if the model output result corresponding to the parameter is not smaller than the specified threshold value, determining that the parameter is normal.
7. A model training apparatus comprising:
the acquisition module acquires a plurality of Uniform Resource Locators (URLs);
the extraction module is used for extracting parameters in the URL aiming at each URL;
the determining module is used for determining a feature vector corresponding to each extracted parameter; the dimensionality of the feature vector corresponding to the parameter comprises at least one of the total number of characters, the total number of letters, the total number of numbers, the total number of specific symbols, the number of different characters, the number of different letters, the number of different numbers and the number of different specific symbols contained in the parameter value of the parameter;
and the processing module is used for constructing an isolation forest model according to the characteristic vectors corresponding to the parameters respectively, and the isolation forest model is used for detecting whether the URL is abnormal or not.
8. The apparatus according to claim 7, wherein the extraction module determines, for each URL, a parameter whose parameter name satisfies a specified condition among parameters included in the URL; for each determined parameter, a parameter value for that parameter is extracted.
9. The device of claim 8, wherein the determining module determines, for each extracted parameter, an N-dimensional feature vector corresponding to the parameter according to a parameter value of the parameter; n is a natural number greater than 0.
10. An apparatus to detect a URL, comprising:
the acquisition module acquires the URL;
the extraction module extracts the parameters in the URL;
the determining module is used for determining a feature vector corresponding to each extracted parameter;
the anomaly detection module is used for inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to carry out anomaly detection on the URL; the isolated forest model is constructed according to the method of any one of claims 1-3.
11. The device according to claim 10, wherein the anomaly detection module inputs the eigenvectors corresponding to the parameters respectively into a pre-constructed isolated forest model to obtain model output results corresponding to the parameters respectively; judging whether abnormal parameters exist in the parameters according to the model output results corresponding to the parameters respectively; if yes, determining that the URL is abnormal; otherwise, determining that the URL is normal.
12. The apparatus according to claim 11, wherein the anomaly detection module inputs, for each parameter, a feature vector corresponding to the parameter into a pre-constructed isolated forest model, so as to classify the feature vector corresponding to the parameter through each classification tree in the isolated forest model, and determine an average height of leaf nodes into which the feature vector corresponding to the parameter falls in each classification tree as a model output result corresponding to the parameter; and for each parameter, if the model output result corresponding to the parameter is smaller than a specified threshold value, determining that the parameter is abnormal, and if the model output result corresponding to the parameter is not smaller than the specified threshold value, determining that the parameter is normal.
13. A model training apparatus comprising one or more processors and memory, the memory storing a program and configured to perform the following steps by the one or more processors:
acquiring a plurality of Uniform Resource Locators (URLs);
for each URL, extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter; the dimensionality of the feature vector corresponding to the parameter comprises at least one of the total number of characters, the total number of letters, the total number of numbers, the total number of specific symbols, the number of different characters, the number of different letters, the number of different numbers and the number of different specific symbols contained in the parameter value of the parameter;
and constructing an isolation forest model according to the characteristic vectors corresponding to the parameters respectively, wherein the isolation forest model is used for detecting whether the URL is abnormal or not.
14. An apparatus for detecting a URL comprising one or more processors and memory, the memory storing a program and configured to perform the following steps by the one or more processors:
acquiring a URL;
extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the method of any one of claims 1-4.
CN202011120753.8A 2017-10-24 2017-10-24 Model training method, URL detection method and device Pending CN112182578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011120753.8A CN112182578A (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011120753.8A CN112182578A (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device
CN201710998117.7A CN107992741B (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201710998117.7A Division CN107992741B (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device

Publications (1)

Publication Number Publication Date
CN112182578A true CN112182578A (en) 2021-01-05

Family

ID=62030610

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202011120753.8A Pending CN112182578A (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device
CN201710998117.7A Active CN107992741B (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201710998117.7A Active CN107992741B (en) 2017-10-24 2017-10-24 Model training method, URL detection method and device

Country Status (3)

Country Link
CN (2) CN112182578A (en)
TW (1) TWI696090B (en)
WO (1) WO2019080660A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182578A (en) * 2017-10-24 2021-01-05 创新先进技术有限公司 Model training method, URL detection method and device
CN108229156A (en) 2017-12-28 2018-06-29 阿里巴巴集团控股有限公司 URL attack detection methods, device and electronic equipment
CN110086749A (en) * 2018-01-25 2019-08-02 阿里巴巴集团控股有限公司 Data processing method and device
CN108366071B (en) 2018-03-06 2020-06-23 阿里巴巴集团控股有限公司 URL (Uniform resource locator) abnormity positioning method and device, server and storage medium
CN108984376B (en) * 2018-05-31 2021-11-19 创新先进技术有限公司 System anomaly detection method, device and equipment
CN108777873B (en) * 2018-06-04 2021-03-02 江南大学 Wireless sensor network abnormal data detection method based on weighted mixed isolated forest
CN110032881B (en) * 2018-12-28 2023-09-22 创新先进技术有限公司 Data processing method, device, equipment and medium
CN109815566A (en) * 2019-01-09 2019-05-28 同济大学 A kind of method for detecting abnormality of the go AI chess manual file of SGF format
CN110399268B (en) * 2019-07-26 2023-09-26 创新先进技术有限公司 Abnormal data detection method, device and equipment
CN110958222A (en) * 2019-10-31 2020-04-03 苏州浪潮智能科技有限公司 Server log anomaly detection method and system based on isolated forest algorithm
CN113065610B (en) * 2019-12-12 2022-05-17 支付宝(杭州)信息技术有限公司 Isolated forest model construction and prediction method and device based on federal learning
CN114095391B (en) * 2021-11-12 2024-01-12 上海斗象信息科技有限公司 Data detection method, baseline model construction method and electronic equipment
CN116776135B (en) * 2023-08-24 2023-12-19 之江实验室 Physical field data prediction method and device based on neural network model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544210A (en) * 2013-09-02 2014-01-29 烟台中科网络技术研究所 System and method for identifying webpage types
CN104077396A (en) * 2014-07-01 2014-10-01 清华大学深圳研究生院 Method and device for detecting phishing website
CN106131071A (en) * 2016-08-26 2016-11-16 北京奇虎科技有限公司 A kind of Web method for detecting abnormality and device
CN106846806A (en) * 2017-03-07 2017-06-13 北京工业大学 Urban highway traffic method for detecting abnormality based on Isolation Forest
WO2017107965A1 (en) * 2015-12-25 2017-06-29 北京奇虎科技有限公司 Web anomaly detection method and apparatus
CN107196953A (en) * 2017-06-14 2017-09-22 上海丁牛信息科技有限公司 A kind of anomaly detection method based on user behavior analysis

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7082426B2 (en) * 1993-06-18 2006-07-25 Cnet Networks, Inc. Content aggregation method and apparatus for an on-line product catalog
US9412024B2 (en) * 2013-09-13 2016-08-09 Interra Systems, Inc. Visual descriptors based video quality assessment using outlier model
JP6276417B2 (en) * 2013-11-04 2018-02-07 イルミオ, インコーポレイテッドIllumio,Inc. Automatic generation of label-based access control rules
CN104899508B (en) * 2015-06-17 2018-12-07 中国互联网络信息中心 A kind of multistage detection method for phishing site and system
KR20170108330A (en) * 2016-03-17 2017-09-27 한국전자통신연구원 Apparatus and method for detecting malware code
CN105956472B (en) * 2016-05-12 2019-10-18 宝利九章(北京)数据技术有限公司 Identify webpage in whether include hostile content method and system
CN106960040B (en) * 2017-03-27 2019-09-17 北京神州绿盟信息安全科技股份有限公司 A kind of classification of URL determines method and device
CN112182578A (en) * 2017-10-24 2021-01-05 创新先进技术有限公司 Model training method, URL detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544210A (en) * 2013-09-02 2014-01-29 烟台中科网络技术研究所 System and method for identifying webpage types
CN104077396A (en) * 2014-07-01 2014-10-01 清华大学深圳研究生院 Method and device for detecting phishing website
WO2017107965A1 (en) * 2015-12-25 2017-06-29 北京奇虎科技有限公司 Web anomaly detection method and apparatus
CN106131071A (en) * 2016-08-26 2016-11-16 北京奇虎科技有限公司 A kind of Web method for detecting abnormality and device
CN106846806A (en) * 2017-03-07 2017-06-13 北京工业大学 Urban highway traffic method for detecting abnormality based on Isolation Forest
CN107196953A (en) * 2017-06-14 2017-09-22 上海丁牛信息科技有限公司 A kind of anomaly detection method based on user behavior analysis

Also Published As

Publication number Publication date
CN107992741B (en) 2020-08-28
TW201917618A (en) 2019-05-01
TWI696090B (en) 2020-06-11
WO2019080660A1 (en) 2019-05-02
CN107992741A (en) 2018-05-04

Similar Documents

Publication Publication Date Title
CN107992741B (en) Model training method, URL detection method and device
CN106055574B (en) Method and device for identifying illegal uniform resource identifier (URL)
CN109246064B (en) Method, device and equipment for generating security access control and network access rule
KR101874373B1 (en) A method and apparatus for detecting malicious scripts of obfuscated scripts
CN108924118B (en) Method and system for detecting database collision behavior
WO2021027831A1 (en) Malicious file detection method and apparatus, electronic device and storage medium
CN111159697B (en) Key detection method and device and electronic equipment
CN111538869A (en) Method, device and equipment for detecting transaction abnormal group
Brown et al. Detection of mobile malware: an artificial immunity approach
CN114024761B (en) Network threat data detection method and device, storage medium and electronic equipment
US9646157B1 (en) Systems and methods for identifying repackaged files
CN111414621B (en) Malicious webpage file identification method and device
CN112464297A (en) Hardware Trojan horse detection method and device and storage medium
CN107562703B (en) Dictionary tree reconstruction method and system
CN113971284A (en) JavaScript-based malicious webpage detection method and device and computer-readable storage medium
CN109726554B (en) Malicious program detection method and device
CN108334775B (en) Method and device for detecting jail-crossing plug-in
CN110502902A (en) A kind of vulnerability classification method, device and equipment
US9235639B2 (en) Filter regular expression
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN114491528A (en) Malicious software detection method, device and equipment
JP2019175334A (en) Information processing device, control method, and program
CN112491816A (en) Service data processing method and device
CN113111346A (en) Multi-engine WebShell script file detection method and system
CN105279434A (en) Naming method and device of malicious program sample family

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40043463

Country of ref document: HK