CN111556042B - Malicious URL detection method and device, computer equipment and storage medium - Google Patents

Malicious URL detection method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111556042B
CN111556042B CN202010325183.XA CN202010325183A CN111556042B CN 111556042 B CN111556042 B CN 111556042B CN 202010325183 A CN202010325183 A CN 202010325183A CN 111556042 B CN111556042 B CN 111556042B
Authority
CN
China
Prior art keywords
url
samples
detected
malicious
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010325183.XA
Other languages
Chinese (zh)
Other versions
CN111556042A (en
Inventor
张宁波
范渊
刘博�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN202010325183.XA priority Critical patent/CN111556042B/en
Publication of CN111556042A publication Critical patent/CN111556042A/en
Application granted granted Critical
Publication of CN111556042B publication Critical patent/CN111556042B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a detection method, a detection device, computer equipment and a storage medium of a malicious URL, wherein the detection method of the malicious URL comprises the steps of obtaining a URL to be detected; extracting characteristic values of a plurality of characteristics of the URL to be detected; selecting at least one URL sample from the URL sample library according to the characteristic values of the plurality of characteristics in the sequence of the similarity of the characteristics from large to small; and counting the URL sample number of each URL type in at least one URL sample, and taking the URL type with the most URL sample number as the URL type of the URL to be detected. The method and the device solve the problem that malicious URLs except a sample library and a feature library cannot be identified in the related technology, judge malicious URL types which are not in an information library through machine classification learning, and reduce the rate of missing report of the malicious URLs.

Description

Malicious URL detection method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of network information security technologies, and in particular, to a method and an apparatus for detecting a malicious URL, a computer device, and a storage medium.
Background
At present, two main modes of malicious URL detection exist. One is a collision detection method based on an intelligence base, and the other is rule judgment based on attack key characteristic words.
For the collision detection method based on the information base, the main process is as follows: and comparing the extracted URL to be detected with a known threat information library in a collision way, and judging whether the URL is malicious or normal. The detection method has the advantages of accurate detection and no occurrence of false alarm. The method has the disadvantages that the existing threat intelligence base is limited in quantity and cannot be updated in real time, and once a new malicious URL which is not in the intelligence base appears, the missing report can be caused.
Meanwhile, the two detection methods are static matching detection based on known intelligence or characteristic values, and malicious URLs outside a sample library and a characteristic library cannot be identified.
At present, no effective solution is provided aiming at the problem that malicious URLs except a sample library and a feature library cannot be identified in the related art.
Disclosure of Invention
The embodiment of the application provides a method and a device for detecting a malicious URL, computer equipment and a storage medium, so as to at least solve the problem that malicious URLs except a sample library and a feature library cannot be identified in the related art.
In a first aspect, an embodiment of the present application provides a method for detecting a malicious URL, including:
acquiring a URL to be detected;
extracting characteristic values of a plurality of characteristics of the URL to be detected;
according to the characteristic values of the characteristics, at least one URL sample is selected from a URL sample library according to the sequence of characteristic similarity from large to small, wherein the URL sample library comprises URL samples of various URL types, and the URL types comprise the types of normal URLs and the types of malicious URLs;
and counting the URL sample number of each URL type in the at least one URL sample, and taking the URL type with the largest URL sample number as the URL type of the URL to be detected.
In some of these embodiments, the feature similarity is determined in euclidean distance; matching at least one URL sample according to feature similarity from a URL sample library according to the feature values of the plurality of features comprises:
calculating Euclidean distances between the URL to be detected and URL samples in the URL sample library in a multidimensional vector space with the plurality of characteristics as dimensions;
and taking the URL sample with the Euclidean distance from the URL to be detected smaller than the preset distance as the at least one URL sample.
In some of these embodiments, the feature similarity is determined in euclidean distance; matching at least one URL sample according to feature similarity from a URL sample library according to the feature values of the plurality of features comprises:
calculating Euclidean distances between the URL to be detected and URL samples in the URL sample library in a multi-dimensional vector space with the characteristics as dimensions;
and selecting a preset number of URL samples from the URL sample library as the at least one URL sample according to the sequence of the Euclidean distance from small to large.
In some embodiments, in a case where the URL type having the largest number of URL samples includes at least two URL types, taking the URL type having the largest number of URL samples as the URL type of the URL to be detected includes:
in a multidimensional vector space with the plurality of features as dimensions, respectively calculating average Euclidean distances between the URL samples of the at least two URL types and the URL to be detected to obtain at least two average Euclidean distances;
and taking the URL type corresponding to the minimum average Euclidean distance in the at least two average Euclidean distances as the URL type of the URL to be detected.
In some embodiments, the using, as the at least one URL sample, a URL sample whose euclidean distance to the URL to be detected is smaller than a preset distance includes:
increasing the preset distance in a case where the URL type having the largest number of URL samples includes at least two URL types;
and taking the URL sample with the Euclidean distance of the URL to be detected smaller than the increased preset distance as the at least one URL sample.
In some embodiments, selecting a preset number of URL samples from the URL sample library as the at least one URL sample in an order from a smaller euclidean distance to a larger euclidean distance includes:
increasing the preset number in a case where the URL type having the largest number of URL samples includes at least two URL types;
and selecting the increased URL samples with the preset number from the URL sample library as the at least one URL sample according to the sequence of the Euclidean distance from small to large.
In some embodiments, matching at least one URL sample from the URL sample library according to feature similarity based on the feature values of the plurality of features comprises:
respectively carrying out collision comparison on the URL to be detected with a malicious URL information library and a malicious URL keyword library;
under the condition that the comparison with the malicious URL information library or the malicious URL keyword library is successful, determining that the URL to be detected is a malicious URL, and determining that the type of the malicious URL of the URL to be detected is the type of the malicious URL obtained through the comparison;
and matching at least one URL sample from a URL sample library according to the feature similarity according to the feature values of the features under the condition that the collision comparison with the malicious URL intelligence library and the malicious URL keyword library fails.
In a second aspect, an embodiment of the present application provides an apparatus for detecting a malicious URL, including:
the acquisition module is used for acquiring the URL to be detected;
the extraction module is used for extracting the characteristic values of the characteristics of the URL to be detected;
the matching module is used for selecting at least one URL sample from a URL sample library according to the characteristic values of the characteristics and the sequence of characteristic similarity from large to small, wherein the URL sample library comprises URL samples of various URL types, and the various URL types comprise types of normal URLs and types of malicious URLs;
and the processing module is used for counting the URL sample number of each URL type in the at least one URL sample and taking the URL type with the largest URL sample number as the URL type of the URL to be detected.
In some of these embodiments, the matching module comprises:
the first calculation unit is used for calculating Euclidean distances between the URL to be detected and URL samples in the URL sample library in a multidimensional vector space with the plurality of features as dimensions;
and the first processing unit and the first calculating unit are used for taking the URL sample of which the Euclidean distance from the URL to be detected is less than the preset distance as the at least one URL sample.
In some of these embodiments, the matching module further comprises:
the second calculation unit is used for calculating the Euclidean distance between the URL to be detected and the URL sample in the URL sample library in the multidimensional vector space with the plurality of features as dimensions;
and the second processing unit is coupled with the second calculating unit and used for selecting a preset number of URL samples from the URL sample library as the at least one URL sample according to the sequence of the Euclidean distance from small to large.
In some of these embodiments, the processing module comprises:
a third calculating unit, configured to, when the URL types with the largest number of URL samples include at least two URL types, respectively calculate average euclidean distances between URL samples of the at least two URL types and the URL to be detected in a multidimensional vector space with the multiple features as dimensions, to obtain at least two average euclidean distances;
and the first confirmation component is coupled and connected with the third calculation unit and is used for taking the URL type corresponding to the minimum average Euclidean distance in the at least two average Euclidean distances as the URL type of the URL to be detected.
In some of these embodiments, the first processing unit comprises:
a first processing component that increases the preset distance in a case where the URL type having the largest number of URL samples includes at least two URL types;
and the second confirmation component is coupled and connected with the first processing component and is used for taking the URL sample of which the Euclidean distance of the URL to be detected is smaller than the increased preset distance as the at least one URL sample.
In some of these embodiments, the second processing unit further comprises:
a second processing component for increasing the preset number in case that the URL type having the largest number of URL samples includes at least two URL types;
and the third confirming component is coupled and connected with the second processing component and used for selecting the increased URL samples with preset number from the URL sample library as the at least one URL sample according to the sequence of the Euclidean distance from small to large.
In some of these embodiments, the matching module further comprises:
the comparison unit is used for respectively carrying out collision comparison on the URL to be detected with a malicious URL information library and a malicious URL keyword library;
the first confirmation unit is coupled with the comparison unit and used for determining that the URL to be detected is a malicious URL and determining that the type of the malicious URL of the URL to be detected is the type of the malicious URL which is compared in a collision way under the condition that the comparison with the malicious URL information library or the malicious URL keyword library is successful in a collision way;
and the second confirmation unit is coupled with the comparison unit and used for matching at least one URL sample from the URL sample library according to the feature similarity according to the feature values of the features under the condition that the collision comparison with the malicious URL intelligence library and the malicious URL keyword library fails.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for detecting a malicious URL according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for detecting a malicious URL according to the first aspect.
Compared with the related art, the detection method, the detection device, the computer equipment and the storage medium for the malicious URL, provided by the embodiment of the application, are used for acquiring the URL to be detected; extracting characteristic values of a plurality of characteristics of the URL to be detected; matching at least one URL sample from a URL sample library according to the feature similarity according to the feature values of the features; the number of URL samples of each URL type in the at least one URL sample is counted, the URL types with the number larger than the preset number are used as the URL types of the URLs to be detected, the problem that malicious URLs outside a sample library and a feature library cannot be identified in the related technology is solved, the malicious URL types which are not in an information library are judged through machine classification learning, and the rate of missing report of the malicious URLs is reduced.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart of a method of detecting malicious URLs according to an embodiment of the present application;
fig. 2 is a detailed flowchart of malicious URL detection according to an embodiment of the present application;
FIG. 3 is a graph of URL distribution of various types in a URL sample library during similarity measurement in an embodiment of the present application;
fig. 4 is a block diagram of a detection apparatus of a malicious URL according to an embodiment of the present application;
fig. 5 is an internal structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (including a single reference) are to be construed in a non-limiting sense as indicating either the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
Various techniques described herein may be used for URL detection in the field of information security. The URL indicates the location of the resource and the protocol used to access it. Each file on the internet has a unique URL that contains information indicating the location of the file and how the browser should handle it.
The URL contains information of the protocol used to access the resource, the location of the server (whether by IP address or domain name), the port number on the server (optional), the location of the resource in the server directory structure, the fragment identifier (optional). A URL is also known as a Uniform Resource Locator (URL) or Web address.
A URL is a Uniform Resource Identifier (URI).
Before describing and explaining embodiments of the present application, a description will be given of the related art used in the present application as follows:
in the field of information security, data analysis and data mining, the size of differences among individuals (samples) is required to be known frequently, and then the similarity and the category of the individuals are evaluated. Most common are correlation analysis in data analysis, classification and clustering algorithms in data mining, such as K Nearest Neighbors (KNN) and K Means (K-Means). The difference of different individuals is measured, and mainly the difference of similarity between individuals is measured. Most commonly, distance measures are used to measure the distance that an individual has in space, with greater distances indicating greater differences between individuals. Common in the distance metric is euclidean distance, that is, euclidean distance, which can represent absolute differences in individual numerical features, so that the euclidean distance is more used for analysis that needs to represent differences in the numerical value of a dimension, such as analyzing similarity or difference of user values by using user behavior indexes. In the euclidean distance calculation, the position of sample X and sample Y in the vector space is represented as X (X) 1 ,x 2 ,x 3 ,…,x n )、Y=(y 1 ,y 2 ,y 3 ,…y n ). The distance between the two samples X and Y in the vector space is calculated by the following euclidean distance calculation formula:
Figure BDA0002462935060000071
since the euclidean distance calculation is based on the absolute value of each dimension feature, the euclidean metric needs to ensure that each dimension index is at the same scale level.
It should be noted that, in the embodiment of the present application and the related art, the URL and the URL sample to be detected both have the features with the same dimension and the same feature value, and in the embodiment of the present application, the URL and the URL sample to be detected both have at least the following features with 9 dimensions: URL length, parameter name length, path depth, domain name length, parameter number, domain name number, number of digits in URI, number of letters in URI, and number of special characters in URI. And calculating feature similarity (Euclidean distance) in a multi-dimensional vector space created by the features of the 9 dimensions, so as to match the category of the URL to be detected.
The embodiment provides a method for detecting a malicious URL. Fig. 1 is a flowchart of a method for detecting a malicious URL according to an embodiment of the present application, where as shown in fig. 1, the flowchart includes the following steps:
and step S101, acquiring the URL to be detected.
Step S102, extracting characteristic values of a plurality of characteristics of the URL to be detected;
step S103, according to the characteristic values of a plurality of characteristics, selecting at least one URL sample from a URL sample library according to the sequence of characteristic similarity from large to small, wherein the URL sample library comprises URL samples of a plurality of URL types, and the plurality of URL types comprise types of normal URLs and types of malicious URLs;
and step S104, counting the URL sample number of each URL type in at least one URL sample, and taking the URL type with the largest URL sample number as the URL type of the URL to be detected.
Through the steps from S101 to S104, the URL to be detected is obtained; extracting characteristic values of a plurality of characteristics of the URL to be detected; selecting at least one URL sample from the URL sample library according to the characteristic values of the plurality of characteristics and the sequence of the characteristic similarity from big to small; the method comprises the steps of counting the URL sample number of each URL type in at least one URL sample, using the URL type with the largest URL sample number as the URL type of the URL to be detected, solving the problem that malicious URLs except a known malicious URL information sample library and a malicious keyword library cannot be identified in the related technology, judging the type of the URL to be detected by judging the feature similarity of unknown malicious URLs in the URL sample library and the URL to be detected and counting and judging (the URL sample with the largest number) according to the number of URL samples meeting the condition of the feature similarity, realizing the judgment of malicious URL types which are not in the known malicious URL information sample library and the malicious keyword library through machine classification learning, and reducing the rate of missing report of the malicious URLs.
The embodiments of the present application are described and illustrated below by way of preferred embodiments.
In some of these embodiments, the feature similarity is determined in euclidean distance; in step S103, according to the feature values of the plurality of features, selecting at least one URL sample from the URL sample library in the order of the feature similarity from large to small is implemented by the following steps:
step S103-4, calculating Euclidean distances between the URL to be detected and URL samples in a URL sample library in a multidimensional vector space with a plurality of characteristics as dimensions;
and step S103-5, taking the URL sample with the Euclidean distance from the URL to be detected smaller than the preset distance as at least one URL sample.
Calculating Euclidean distances between the URL to be detected and URL samples in a URL sample library in a multi-dimensional vector space with a plurality of features as dimensions through the steps S103-4 to S103-5; and taking the URL sample with the Euclidean distance from the URL to be detected smaller than the preset distance as at least one URL sample. The problem that the accuracy of the detection result cannot be guaranteed due to the fact that the number of malicious URL samples is not considered in the related technology is solved.
In some of these embodiments, the feature similarity is determined in euclidean distance; the step S103 of matching at least one URL sample from the URL sample library according to the feature similarity based on the feature values of the plurality of features may be further implemented by:
step S103-6, calculating Euclidean distances between the URL to be detected and URL samples in a URL sample library in a multidimensional vector space with a plurality of characteristics as dimensions;
and step S103-7, selecting a preset number of URL samples from the URL sample library as at least one URL sample according to the sequence of the Euclidean distance from small to large.
Calculating Euclidean distances between the URL to be detected and URL samples in a URL sample library in a multi-dimensional vector space with a plurality of features as dimensions through the steps S103-6 to S103-7; and selecting a preset number of URL samples from the URL sample library as at least one URL sample according to the sequence of the Euclidean distance from small to large. The problem that the accuracy of the detection result cannot be guaranteed due to the fact that the number of malicious URL samples is not considered in the related technology is further solved.
In some embodiments, in the case that the URL type with the largest number of URL samples includes at least two URL types, the URL type with the largest number of URL samples is regarded as the URL type of the URL to be detected in step S104 by:
step S104-1, respectively calculating average Euclidean distances between URL samples of at least two URL types and a URL to be detected in a multidimensional vector space with a plurality of features as dimensions to obtain at least two average Euclidean distances;
and step S104-2, taking the URL type corresponding to the minimum average Euclidean distance in the at least two average Euclidean distances as the URL type of the URL to be detected.
In some embodiments, the step S103-5 is implemented by taking the URL sample with the euclidean distance to the URL to be detected smaller than the preset distance as at least one URL sample, and includes the following steps:
step S103-51, increasing the preset distance under the condition that the URL type with the maximum URL sample number comprises at least two URL types;
and S103-52, taking the URL sample with the Euclidean distance of the URL to be detected smaller than the increased preset distance as at least one URL sample.
In some embodiments, the step S103-7 of selecting a preset number of URL samples from the URL sample library as at least one URL sample in the order from small to large according to the euclidean distance is implemented by:
step S103-71, increasing the preset number in the case that the URL type with the largest URL sample number comprises at least two URL types;
and S103-72, selecting the increased URL samples with preset number from the URL sample library as at least one URL sample according to the sequence of the Euclidean distance from small to large.
In some embodiments, the selecting at least one URL sample from the URL sample library according to the feature values of the plurality of features in the descending order of the feature similarity in step S103 may further be implemented by:
step S103-1, the URL to be detected is respectively collided and compared with a malicious URL information library and a malicious URL keyword library;
step S103-2, under the condition that the comparison with the malicious URL information library or the malicious URL keyword library is successful, determining that the URL to be detected is a malicious URL, and determining that the type of the malicious URL of the URL to be detected is the type of the malicious URL obtained by the comparison;
and step S103-3, under the condition that the collision comparison with the malicious URL information library and the malicious URL keyword library fails, matching at least one URL sample from the URL sample library according to the feature similarity according to the feature values of the plurality of features.
Through the steps S103-1 to S103-3, the URL to be detected is determined to be the malicious URL through respectively performing collision comparison on the URL to be detected and the malicious URL information library and the malicious URL keyword library, or through the step S103, the URL to be detected is further judged until the type of the URL is judged. The method and the device solve the problem that when the URL to be detected belongs to the corresponding URL type in the known malicious URL information library and the malicious URL keyword library, the characteristic similarity matching judgment is carried out, and the consumption of computer resources is high.
Fig. 2 is a detailed flowchart of malicious URL detection according to an embodiment of the present application, fig. 3 is a distribution diagram of various types of URLs in a URL sample library in a similarity measurement process in the embodiment of the present application, as shown in fig. 2 to fig. 3, a process according to an embodiment of the present application is as follows:
in the embodiment of the present application, the URL to be detected and the URL sample each have at least the following features with 9 dimensions: wherein, including the length characteristic: URL length, parameter name length, path depth, domain name length and quantity characteristics: the number of parameters, the number of domain names, the number of digits in the URI, the number of letters in the URI and the number of special characters in the URI.
The method is based on a K-nearest neighbor classification algorithm (KNN algorithm), and a machine learning URL sample library is established based on a known malicious URL information library and a normal URL database. Extracting principal component parameters (characteristic values of a plurality of characteristics) from a URL sample to be detected, and calculating a spatial distance between the URL sample to be detected and the URL sample in the URL sample library in a multidimensional vector space by adopting the Euclidean distance calculation formula. And taking N known samples with the nearest distance to count the types of the known samples, and classifying the URL to be detected into the most number of categories in the N nearest samples according to a voting rule subject to majority by a minority.
Establishing a sample library based on classified malicious URL sample data and normal URL sample data, obtaining N known samples closest to the URLs to be detected in a 9-dimensional vector space through calculation, counting the number of URLs of various types in the N known samples, and giving the URLs of the most URL types to the samples to be detected.
As shown in fig. 3, N =10 is selected, and of the 10 known URLs closest to the URL to be detected in the multidimensional vector space, the URL with the largest number is the SQL injection attack class URL, so that it is determined that the URL to be detected is the SQL injection attack class malicious URL. It should be noted that the types of malicious URLs include: SQL injection attack-type URLs, XSS attack-type URLs, sensitive file attack-type URLs, directory traversal attack-type URLs, and other attack-type URLs, and specifically, the keywords of various types of malicious URLs are as follows:
SQL injection attack: and, or, xp _, substr, utl, benchmark, shutdown, hex, sqlmap, md5, hex, union, drop, delete, concat, orderby, exec;
and (3) attack of sensitive files: access _ log, text/play, phpinfo, proc/self/cmdlene,/fcckeditor/, web.xml;
directory traversal attack: v./,. \\;
other attacks: base64, wget, curl, redict, upload, ping, shal, java.
In order to ensure the accuracy of the detection result, in the process of detecting the URL to be detected, the following steps are further executed:
enough sample space is needed, the number of samples is ensured to be balanced, and the condition that the number of samples of a certain type is too large to become a dominant sample and influence the accuracy of a detection result is prevented. In the URL detection process of this embodiment, 10W samples are selected for each of various malicious URLs and normal URLs, and 60W sample data is summed;
n =10 is taken for the first detection, namely 10 URLs nearest to the URL to be detected are taken in the multi-dimensional vector space for statistics, and the URL type with the largest quantity is endowed to the URL to be detected;
when the statistical results with the largest number have 2 types or more than 2 types, respectively obtaining the average distance between each known sample and the URL to be detected, and endowing the URL to be detected with the type with the minimum average distance;
when the statistical results with the largest number have 2 or more than 2 types and the average distances between the URLs of the types and the URL to be detected are equal, increasing the value of N by 5 for calculation and repeating the steps;
according to the detection characteristic of the KNN algorithm, in order to ensure the accuracy of the statistical result, the value of N is not more than 20 during calculation, and when N =20, the type of the sample to be detected still cannot be calculated and counted, and the sample is defined as other attack URLs.
The present embodiment further provides a device for detecting a malicious URL, where the device is used to implement the foregoing embodiments and preferred embodiments, and details of which have been already described are omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a detection apparatus for malicious URLs according to an embodiment of the present disclosure, where as shown in fig. 4, the apparatus includes:
an obtaining module 41, configured to obtain a URL to be detected;
the extracting module 42 is coupled with the obtaining module 41 and is used for extracting characteristic values of a plurality of characteristics of the URL to be detected;
a matching module 43, coupled to the extracting module 42, configured to select at least one URL sample from the URL sample library according to a characteristic value of the plurality of characteristics in an order from a large characteristic similarity to a small characteristic similarity, where the URL sample library includes URL samples of a plurality of URL types, and the plurality of URL types include a normal URL type and a malicious URL type;
and the processing module 44 is coupled to the matching module 43 and configured to count the number of URL samples of each URL type in the at least one URL sample, and use the URL type with the largest number of URL samples as the URL type of the URL to be detected.
In some of these embodiments, matching module 44 includes:
the first calculation unit is used for calculating the Euclidean distance between the URL to be detected and the URL sample in the URL sample library in the multidimensional vector space with a plurality of features as dimensions;
and the first processing unit and the first calculating unit are used for taking the URL sample with the Euclidean distance smaller than the preset distance from the URL to be detected as at least one URL sample.
In some of these embodiments, matching module 44 further includes:
the second calculation unit is used for calculating the Euclidean distance between the URL to be detected and the URL sample in the URL sample library in the multidimensional vector space with the plurality of characteristics as dimensions;
and the second processing unit is coupled with the second computing unit and used for selecting a preset number of URL samples from the URL sample library as at least one URL sample according to the sequence from the European distance from small to large.
In some of these embodiments, the processing module 43 includes:
the third calculation unit is used for respectively calculating the average Euclidean distances between the URL samples of the at least two URL types and the URL to be detected in the multidimensional vector space with the plurality of characteristics as dimensions under the condition that the URL types with the most URL sample numbers comprise at least two URL types to obtain at least two average Euclidean distances;
and the first confirmation component is coupled and connected with the third calculation unit and used for taking the URL type corresponding to the minimum average Euclidean distance in at least two average Euclidean distances as the URL type of the URL to be detected.
In some of these embodiments, the first processing unit comprises:
a first processing component that increases a preset distance in a case where the URL type having the largest number of URL samples includes at least two URL types;
and the second confirmation component is coupled and connected with the first processing component and is used for taking the URL sample of which the Euclidean distance of the URL to be detected is smaller than the increased preset distance as at least one URL sample.
In some of these embodiments, the second processing unit further comprises:
a second processing component for increasing the preset number in case that the URL type having the largest number of URL samples includes at least two URL types;
and the third confirming component is coupled and connected with the second processing component and used for selecting the increased preset number of URL samples from the URL sample library as at least one URL sample according to the sequence of the Euclidean distance from small to large.
In some of these embodiments, matching module 44 further includes:
the comparison unit is used for respectively carrying out collision comparison on the URL to be detected with the malicious URL information library and the malicious URL keyword library;
the first confirmation unit is coupled with the comparison unit and used for determining the URL to be detected as the malicious URL and determining the type of the malicious URL to be detected as the type of the malicious URL after collision comparison under the condition that the comparison with the malicious URL information library or the malicious URL keyword library is successful;
and the second confirmation unit is coupled with the comparison unit and used for matching at least one URL sample from the URL sample library according to the feature similarity according to the feature values of the plurality of features under the condition that the collision comparison with the malicious URL information library and the malicious URL keyword library fails.
It should be noted that the above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the above modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
In addition, the method for detecting the malicious URL according to the embodiment of the present application described in conjunction with fig. 1 may be implemented by a computer device. Fig. 5 is a hardware structure diagram of a computer device according to an embodiment of the present application.
The computer device may comprise a processor 51 and a memory 52 in which computer program instructions are stored.
Specifically, the processor 51 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 52 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 52 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, magnetic tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 52 may include removable or non-removable (or fixed) media, where appropriate. The memory 52 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 52 is a Non-Volatile (Non-Volatile) memory. In certain embodiments, memory 52 includes Read-Only Memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically Alterable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended Data Out Dynamic Random Access Memory (EDODRAM), a Synchronous Dynamic Random Access Memory (SDRAM), and the like.
The memory 52 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 51.
The processor 51 reads and executes the computer program instructions stored in the memory 52 to implement any one of the malicious URL detection methods in the above embodiments.
In some of these embodiments, the computer device may also include a communication interface 53 and a bus 50. As shown in fig. 5, the processor 51, the memory 52, and the communication interface 53 are connected to each other via the bus 50 to complete communication therebetween.
The communication interface 53 is used for implementing communication between various modules, apparatuses, units and/or devices in the embodiments of the present application. The communication interface 53 may also enable communication with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 50 includes hardware, software, or both coupling the components of the computer device to each other. Bus 50 includes, but is not limited to, at least one of the following: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example and not limitation, bus 50 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a vlslave Bus, a Video Bus, or a combination of two or more of these suitable electronic buses. Bus 50 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the present application, any suitable buses or interconnects are contemplated by the present application.
The computer device may execute the method for detecting a malicious URL in the embodiment of the present application based on the obtained unknown URL type in the information security network, thereby implementing the method for detecting a malicious URL described in conjunction with fig. 1.
In addition, with the detection method of the malicious URL in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement the detection method. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement the method for detecting a malicious URL in any of the above embodiments.
All possible combinations of the technical features of the above embodiments may not be described for the sake of brevity, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for detecting a malicious URL, comprising:
acquiring a URL to be detected;
extracting characteristic values of a plurality of characteristics of the URL to be detected;
selecting a plurality of URL samples from a URL sample library according to the characteristic values of the plurality of characteristics in a sequence of characteristic similarity from large to small, wherein the URL sample library comprises URL samples of a plurality of URL types, the plurality of URL types comprise types of normal URLs and types of malicious URLs, and the characteristic similarity is determined by Euclidean distance;
and counting the URL sample number of each URL type in the plurality of URL samples, and taking the URL type with the largest URL sample number as the URL type of the URL to be detected.
2. The method according to claim 1, wherein matching a plurality of URL samples according to feature similarity from a URL sample library based on the feature values of the plurality of features comprises:
calculating Euclidean distances between the URL to be detected and URL samples in the URL sample library in a multi-dimensional vector space with the characteristics as dimensions;
and taking the URL samples with Euclidean distance to the URL to be detected smaller than a preset distance as the plurality of URL samples.
3. The method according to claim 1, wherein matching a plurality of URL samples according to feature similarity from a URL sample library based on feature values of the plurality of features comprises:
calculating Euclidean distances between the URL to be detected and URL samples in the URL sample library in a multidimensional vector space with the plurality of characteristics as dimensions;
and selecting a preset number of URL samples from the URL sample library as the plurality of URL samples according to the sequence of the Euclidean distances from small to large.
4. The method according to claim 1, wherein, in a case where the URL type having the largest number of URL samples includes at least two URL types, the determining, as the URL type of the URL to be detected, the URL type having the largest number of URL samples includes:
in a multidimensional vector space with the plurality of characteristics as dimensions, respectively calculating average Euclidean distances between the URL samples of the at least two URL types and the URL to be detected to obtain at least two average Euclidean distances;
and taking the URL type corresponding to the minimum average Euclidean distance in the at least two average Euclidean distances as the URL type of the URL to be detected.
5. The method according to claim 2, wherein the step of using, as the plurality of URL samples, URL samples having a euclidean distance to the URL to be detected smaller than a preset distance comprises:
increasing the preset distance in a case where the URL type having the largest number of URL samples includes at least two URL types;
and taking the URL samples with Euclidean distance of the URL to be detected smaller than the increased preset distance as the plurality of URL samples.
6. The method according to claim 3, wherein selecting a preset number of URL samples from the URL sample library as the plurality of URL samples according to the sequence of Euclidean distances from small to large comprises:
increasing the preset number in a case where the URL type having the largest number of URL samples includes at least two URL types;
and selecting the increased URL samples with preset number from the URL sample library as the plurality of URL samples according to the sequence of the Euclidean distance from small to large.
7. The method according to claim 1, wherein selecting the plurality of URL samples from the URL sample library in order of decreasing feature similarity according to the feature values of the plurality of features comprises:
respectively carrying out collision comparison on the URL to be detected and a malicious URL information library and a malicious URL keyword library;
under the condition that the comparison with the malicious URL information library or the malicious URL keyword library is successful, determining that the URL to be detected is a malicious URL, and determining that the type of the malicious URL of the URL to be detected is the type of the malicious URL obtained by the comparison;
and matching a plurality of URL samples from a URL sample library according to the feature similarity according to the feature values of the features under the condition that the collision comparison with the malicious URL information library and the malicious URL keyword library fails.
8. An apparatus for detecting a malicious URL, comprising:
the acquisition module is used for acquiring the URL to be detected;
the extraction module is used for extracting the characteristic values of the characteristics of the URL to be detected;
a matching module, configured to match multiple URL samples according to feature similarity from a URL sample library according to feature values of the multiple features, where the URL sample library includes URL samples of multiple URL types, the multiple URL types include a type of a normal URL and a type of a malicious URL, and the feature similarity is determined by an euclidean distance;
and the processing module is used for counting the number of URL samples of each URL type in the plurality of URL samples and taking the URL types of which the number is more than the preset number as the URL types of the URL to be detected.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of detecting a malicious URL according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of detecting a malicious URL according to any one of claims 1 to 7.
CN202010325183.XA 2020-04-23 2020-04-23 Malicious URL detection method and device, computer equipment and storage medium Active CN111556042B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010325183.XA CN111556042B (en) 2020-04-23 2020-04-23 Malicious URL detection method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010325183.XA CN111556042B (en) 2020-04-23 2020-04-23 Malicious URL detection method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111556042A CN111556042A (en) 2020-08-18
CN111556042B true CN111556042B (en) 2022-12-20

Family

ID=72002539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010325183.XA Active CN111556042B (en) 2020-04-23 2020-04-23 Malicious URL detection method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111556042B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112671747B (en) * 2020-12-17 2022-08-30 赛尔网络有限公司 Overseas malicious URL statistical method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557695A (en) * 2015-09-25 2017-04-05 卓望数码技术(深圳)有限公司 A kind of malicious application detection method and system
CN107438083A (en) * 2017-09-06 2017-12-05 安徽大学 Detection method for phishing site and its detecting system under a kind of Android environment
WO2018077035A1 (en) * 2016-10-31 2018-05-03 腾讯科技(深圳)有限公司 Malicious resource address detecting method and apparatus, and storage medium
CN108282450A (en) * 2017-01-06 2018-07-13 阿里巴巴集团控股有限公司 The detection method and device of abnormal domain name
CN108718298A (en) * 2018-04-28 2018-10-30 北京奇安信科技有限公司 Connect flow rate testing methods and device outside a kind of malice
CN110414223A (en) * 2019-07-08 2019-11-05 新华三信息安全技术有限公司 A kind of attack detection method and device
CN110851828A (en) * 2019-09-30 2020-02-28 光通天下网络科技股份有限公司 Malicious URL monitoring method and device based on multi-dimensional features and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557695A (en) * 2015-09-25 2017-04-05 卓望数码技术(深圳)有限公司 A kind of malicious application detection method and system
WO2018077035A1 (en) * 2016-10-31 2018-05-03 腾讯科技(深圳)有限公司 Malicious resource address detecting method and apparatus, and storage medium
CN108282450A (en) * 2017-01-06 2018-07-13 阿里巴巴集团控股有限公司 The detection method and device of abnormal domain name
CN107438083A (en) * 2017-09-06 2017-12-05 安徽大学 Detection method for phishing site and its detecting system under a kind of Android environment
CN108718298A (en) * 2018-04-28 2018-10-30 北京奇安信科技有限公司 Connect flow rate testing methods and device outside a kind of malice
CN110414223A (en) * 2019-07-08 2019-11-05 新华三信息安全技术有限公司 A kind of attack detection method and device
CN110851828A (en) * 2019-09-30 2020-02-28 光通天下网络科技股份有限公司 Malicious URL monitoring method and device based on multi-dimensional features and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于URL的恶意访问检测方法;李梦玉等;《通信学报》;20180930;全文 *

Also Published As

Publication number Publication date
CN111556042A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
Ramesh et al. An efficacious method for detecting phishing webpages through target domain identification
US11418485B2 (en) Pattern-based malicious URL detection
CN110099059B (en) Domain name identification method and device and storage medium
US9819689B2 (en) Large scale malicious process detection
Niakanlahiji et al. Phishmon: A machine learning framework for detecting phishing webpages
CN111585955B (en) HTTP request abnormity detection method and system
CN112003838B (en) Network threat detection method, device, electronic device and storage medium
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
CN110798488B (en) Web application attack detection method
CN107888606B (en) Domain name credit assessment method and system
CN108023868B (en) Malicious resource address detection method and device
CN111756724A (en) Detection method, device and equipment for phishing website and computer readable storage medium
CN112019519B (en) Method and device for detecting threat degree of network security information and electronic device
CN111556042B (en) Malicious URL detection method and device, computer equipment and storage medium
CN111783159A (en) Webpage tampering verification method and device, computer equipment and storage medium
CN113992625A (en) Domain name source station detection method, system, computer and readable storage medium
CN111885034B (en) Internet of things attack event tracking method and device and computer equipment
CN116800504A (en) Dynamic authentication method and device for terminal physical fingerprint extraction and illegal access
CN116633672A (en) Alarm information detection method and device, electronic equipment and storage medium
CN105099996B (en) Website verification method and device
CN113792291B (en) Host recognition method and device infected by domain generation algorithm malicious software
CN114884686B (en) PHP threat identification method and device
CN116192462A (en) Malicious software analysis method and device based on PE file format
CN115357894A (en) Application program bug detection method and system with custom verification function
CN111859063B (en) Control method and device for monitoring transfer seal information in Internet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant