CN115270120A

CN115270120A - Malicious URL blocking method

Info

Publication number: CN115270120A
Application number: CN202210746821.4A
Authority: CN
Inventors: 张广兴; 姜海洋; 景阳; 梁帅; 夏可强; 涂楚; 何旭
Original assignee: Jiangsu Future Networks Innovation Institute
Current assignee: Jiangsu Future Networks Innovation Institute
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-11-01

Abstract

The invention provides a malicious URL blocking method which is characterized by comprising the following specific steps: s1, extracting character strings of a plurality of bytes from each URL in a malicious URL library as features to make a URL feature character string set; s2, making a complement of the URL characteristic character string set; s3, saving a complementary set of the URL feature set by using a hash table, inquiring the hash table before inquiring the malicious URL library, and quickly passing through non-malicious URLs; s4, if no URL in the matching exists in the step S3, the whole malicious URL library is required to be searched; the malicious URL blocking method can rapidly release most of non-malicious URLs, reduce the total URL amount checked by entering a malicious URL library of consumption performance, and avoid performance loss caused by checking a large number of non-malicious URLs.

Description

Malicious URL blocking method

Technical Field

The invention relates to the field of data communication, in particular to a malicious URL blocking method.

Background

With the rapid development of internet technology, people are more and more widely using the internet in work and life. The maintenance of network security, the provision of a secure network environment for network users, is an important issue to be considered by network operation and maintenance personnel and security equipment manufacturers. Network security is a big topic, and relates to many fields, but what influences the experience of surfing the internet most is the problem of website access. For example, when browsing a web page or downloading a resource, some advertisement pages, unknown network links, etc. are often popped up, and sometimes a certain link is inadvertently clicked, which may cause a network user to download and install some plug-ins, leave various history records containing bad information, etc. under an unknown condition, and even cause serious failures such as system paralysis, etc. for the network user. This is the effect of the user accessing a malicious URL (Uniform Resource Locator). URL format: scheme:// Domain-name: port/Path-to-resources # Anchor. For example http:// www.example.com:80/path/to/myfile. Htmlkey1= value1& key2= value2# somewhhe reInTheDocument.

Most malicious URLs are pushed by some websites or are accessed by users by mistake without knowing, and some pushed information is spread in the local area network. Therefore, a network administrator may expect the security device to have a function of intercepting a malicious URL. When most of safety devices realize the malicious URL blocking function, the adopted method is as follows: a malicious URL list is maintained in the device, the URL of each HTTP message is inquired, and if the malicious URL in the list is hit, the message is discarded, so that the blocking function is realized. The malicious URL list is maintained by professional technicians and is periodically updated, and then is placed on a designated website and can be taken by developers. Malicious URL libraries are typically on the order of hundreds of thousands, millions, and are also in constant growth. URLs are typically made up of relatively long strings, typically over 100 bytes in average length. The proportion of the HTTP messages in the current network flow is about 30% -40%, most of the HTTP messages are accessed to normal URLs, and only a few requests can access malicious URLs.

In summary, the problems to be faced in implementing malicious URL blocking are: 1) The malicious URL library to be queried is large in size and is continuously increasing; 2) The URL consists of a long character string, and the query comparison is very energy-consuming; 3) In the current network environment, the HTTP request message volume is large, namely the message volume carrying the URL is large, but the message volume carrying the malicious URL is small, and most of the messages are normally accessed URLs.

At present, most of malicious URL blocking functions are realized by the following methods: firstly, extracting a URL from each HTTP request message, and then sending the obtained URL to a malicious URL library for query to see whether the URL is a malicious URL or not. The biggest problem of the general processing method is that the query efficiency is low. Because most of the URLs queried in the malicious URL removal library are non-malicious URLs, only a few URLs are malicious. Therefore, when a query is made for a malicious URL library with a large amount of total and a slow single query, a large amount of non-malicious URL queries consume performance and do useless work.

Disclosure of Invention

The invention aims to provide a malicious URL blocking method, which can rapidly release most of non-malicious URLs, reduce the total URL amount checked by entering a consumption performance malicious URL library and avoid performance loss caused by checking a large number of non-malicious URLs.

It is believed that, in general, most traffic on the internet is benign. Based on this, most benign traffic can be filtered out in advance through a white list mode, so that the benign traffic does not need to enter a time-consuming full-amount malicious URL library for detection. Therefore, the basic idea of the technical scheme of the invention is as follows: since the general malicious URL library searching performance is high, a white list item with high searching efficiency is set firstly, and all URLs to be searched are quickly pre-classified. Through efficient pre-classification matching, a part of benign traffic is filtered, the remaining part is subjected to full-scale malicious URL library matching of consumption performance, namely, URL query operation of most consumption performance is replaced by an efficient white list matching mode, and therefore the effect of performance improvement is achieved.

In order to achieve the purpose, the invention provides the following technical scheme:

a malicious URL blocking method is characterized by comprising the following specific steps:

s1, extracting character strings of a plurality of bytes from each URL in a malicious URL library as features to make a URL feature character string set;

s2, making a complementary set of the URL characteristic character string set;

s3, saving a complementary set of the URL feature set by using a hash table, inquiring the hash table before inquiring the malicious URL library, and quickly releasing the non-malicious URLs;

s4, if no URL in the matching exists in the step S3, the query needs to be carried out on the whole malicious URL library.

The step S1 specifically includes the following contents:

the technical scheme of the invention takes the extraction of 3-byte URL characteristics as an example for the following detailed description, the characteristic character string of 3 bytes is selected from the domain name of malicious URL, most of domain name information is composed of numbers, letters and words such as 'www.baidu.com', 'test.find-phone.in' and '210.39.38.5', and the like, wherein the first-level domain name is 'www', so the method is not representative, and the characteristic selection mode is as follows:

if the first-level domain name is 'www', taking 3 continuous characters from the second-level domain name as a characteristic character string;

if the first-level domain name is not 'www', taking continuous 3 characters from the first-level domain name as a characteristic character string; the less (rough) the characteristic characters of the URL are obtained, the less the complement set amount of the URL is, the faster the query speed is, but the less the filtered non-malicious URLs are; on the contrary, the more feature characters of the URL are obtained (refined), the more the complement total amount is, the slower the query speed is, and the more non-malicious URLs are filtered, so that the number of the feature characters is selected according to the type of the device and the traffic scene.

The invention aims at IPS equipment at an enterprise gateway level, the flow is small, the hardware resource configuration of the equipment is not high, 3 characters are selected for characteristics, and the number of the characters can be properly adjusted by other level equipment.

The step S2 specifically includes the following contents:

to take a complement of a feature string set, a concept of a feature string complete set is explained first, taking a feature string of 3 characters as an example, the complete set refers to the whole of all 3-byte character strings which can form features, the feature string of 3 characters in the invention is the first 3 characters of a primary domain name and a secondary domain name of a URL, and possible combination forms are as follows: the first bit is 26 lower case characters and ten numeric characters from 0 to 9; the second and third bits are 26 lower case characters, 0-9 ten numeric characters and ". The total number of the corpus is 36X 37, and the complement is the set of strings left after the URL feature set is removed from the corpus.

Therefore, the process of making the URL feature set complement is as follows: each character string in the full set is compared with the URL feature set, and the match is removed, so that all the character strings without match are complementary sets of the URL feature set.

The step S3 specifically includes the following contents:

compared with the prior query mode, the query mode of the invention adds a quick query table entry, namely a hash table made of the complement of the URL feature set; the most extreme maximum number of the complementary set of the URL feature set is 36 × 37=49,284 character strings with 3 bytes, the relative magnitude is small, and the query speed is high, so the hash table is checked first, part of non-malicious URLs are filtered, the rest part is checked again, the first-class domain name of the URL analyzed from the HTTP request message or the continuous first 3 characters from the second-class domain name (the first-class domain name is the second-class domain name of the www', otherwise, the first-class domain name is used) are firstly used for querying, the URL in the hash table is the non-malicious URL, the URL can be directly released, the URL which is not searched is possibly malicious URL and possibly non-malicious URL, further verification is needed, because the non-malicious URL has a large occupation ratio, a quick and efficient query mode is used for firstly dividing part of the non-malicious URL, the rest URLs are used for querying with the consumption performance removal, and the whole query efficiency can be improved.

The method for making and using the URL feature complement hash table in the step S3 is summarized as follows:

(1) the preparation method comprises the following steps: extracting a plurality of characters from the domain name of each URL of the malicious URL library to be used as the characteristics of the URL, acquiring the characteristics of all malicious URLs to be made into a set, and then taking a complement set of the characteristics. In the subset of the white list set, each element has only a few bytes and the total amount of the elements is very small, so the white list subset can be used for making an item with high query efficiency, and the item is arranged before the query of a malicious URL library to perform pre-classification on the total URLs needing to be queried.

(2) The using method comprises the following steps: the URL of the table entry in the matching is a non-malicious URL and can be directly released, and URLs without malicious URLs in the matching also have non-malicious URLs, so that the query of the whole malicious URLs is required to be made, and whether the URL is a malicious URL or not is checked.

The step S4 specifically includes the following contents:

the URLs (in the matching) filtered out in the step 3 are all non-malicious URLs; URLs that are not filtered (not in match), some are non-malicious URLs, and some are malicious URLs; for the URLs which are not filtered in the step 3, subsequent searching is needed to determine whether the URLs are malicious URLs, so that the URLs which are not filtered also need to be continued to a full-scale malicious URL library for query, the query process is consistent with the current common implementation mode and is not described in detail, and the query matching is not only malicious URLs, but also blocking operation is needed; and if no match exists, the URL is not malicious and is directly released.

Compared with the prior art, the invention has the following beneficial effects:

firstly, constructing a complementary set of a URL feature set, wherein elements of the complementary set are only 3 bytes, and the maximum total number of the elements is 49284 under extreme conditions, so that the query speed is high; then, before the malicious URLs of the consumption performance are subjected to full-scale matching query, most non-malicious URLs are filtered by the complementary set, and then the rest URLs are subjected to full-scale query, so that the operation of the consumption performance can be greatly reduced, and the overall query efficiency is improved.

Drawings

FIG. 1 is a diagram comparing a conventional URL matching method with the URL matching method of the present invention;

FIG. 2 is a flow chart of the present invention for extracting 3-byte URL features;

FIG. 3 is a flow chart of the URL feature set collection complement of the present invention;

FIG. 4 is a flow chart of URL matching according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to clarify technical problems, technical solutions, implementation processes and performance displays. It should be understood that the specific embodiments described herein are for illustrative purposes only. The present invention is not limited to the above embodiments. Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Example 1

s2, making a complement of the URL characteristic character string set;

As shown in fig. 2, the step S1 specifically includes the following steps:

As shown in fig. 3, the step S2 specifically includes the following steps:

As shown in fig. 4, the step S3 specifically includes the following steps:

The method for making and using the URL feature complementary set hash table in the step S3 is summarized as follows:

(1) the preparation method comprises the following steps: extracting a plurality of characters from the domain name of each URL of the malicious URL library to be used as the characteristics of the URL, acquiring the characteristics of all malicious URLs to be used as a set, and then taking a complement set of the characteristics. In the subset of the white list set, each element only has a few bytes and the total amount of the elements is very small, so that the white list subset can be used for making an item with high query efficiency, and the item is arranged before the malicious URL library query to perform pre-classification on the URLs which are required to be queried.

The step S4 specifically includes the following contents:

the URLs (in the matching) filtered out in the step 3 are all non-malicious URLs; URLs that are not filtered (not in match), some are non-malicious URLs, and some are malicious URLs; for the URLs which are not filtered in the step 3, subsequent searching is needed to determine whether the URLs are malicious URLs, so that the unfiltered URLs also need to be continued to a full-scale malicious URL library for query, the query process is consistent with the current common implementation mode and is not described in detail, the query matches the malicious URLs, and blocking operation is needed; and if no match exists, the URL is not malicious and is directly released.

As shown in fig. 1, fig. (a) is a conventional URL matching method, where all URLs are directly searched in a full-scale malicious URL library; the graph (b) is the method provided by the invention, and has more green pre-query table URL feature complementary hash table than the graph (a), the table entry is the white list entry with high query efficiency, most benign traffic is filtered out in advance by the table entry, and the rest URLs are matched in a full malicious URL library to check whether the URL is a malicious URL needing to be blocked.

The method can be implemented according to the following steps:

step 1: and acquiring a complete set of the malicious URL library, extracting domain name fields, and storing the domain name fields into a URL-domain.

Step 2: extract 3-byte URL features, the flow chart is shown in fig. 1. Go through by row to see if the first 3 characters are "www". If yes, crossing the first-level domain name, and taking the first three characters from the second-level domain name; if not, the first 3 characters are taken from the primary domain name. And storing the obtained characteristics of all 3 characters into a 3byte-url.

And step 3: the 3-byte URL feature set generated in 2 is complemented by the flow chart shown in fig. 2. 1) And generating a complete set of feature sets according to all possible combinations of the selected 3-byte character strings. The first 3 characters of the URL first-level or second-level domain name are selected, so that the first possible characters are lower case letters a-z and numbers 0-9; the second, three-digit possible character is the lower case letters a-z, the numbers 0-9, and the symbol "". The full set is a permutation and combination of 3 bytes of these letters, numbers, characters, and the total number is: 36 × 37=49,284. 2) The 3byte-url.txt file already exists in the full set, which is the complement of the URL feature set. 3) And storing the obtained complementary set of the URL feature set into a negative-3byte-URL.

And 4, step 4: in the equipment which needs to support the malicious URL blocking function, a 3-byte character string in a negative-3byte-url.txt file is read, and a hash table suitable for character string query is generated.

And 5: and analyzing the URL from the HTTP request message for pre-query, and determining whether to pass or block. The URL analyzed from the message is firstly inquired in the hash table generated in the previous step. The searched description is definitely non-malicious URL and can be directly released; the unchecked URLs may be non-malicious URLs or malicious URLs.

Step 6: for URLs which are not searched in the previous step, further query needs to be carried out, the URLs need to be queried by using a full-amount malicious URL library, and the searched URLs are malicious URLs and need to be blocked; what is not being checked is a non-malicious URL that needs to be passed.

The flow chart of step 5 and step 6 is shown in fig. 3.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A malicious URL blocking method is characterized by comprising the following specific steps:

s2, making a complement of the URL characteristic character string set;

2. The malicious URL blocking method according to claim 1, wherein the step S1 specifically includes the following steps:

The invention aims at IPS equipment at an enterprise gateway level, the flow is small, the hardware resource configuration of the equipment is not high, 3 characters are selected as characteristics, and the number of the characters can be properly adjusted by other level equipment.

3. The malicious URL blocking method according to claim 1, wherein the step S2 specifically includes the following steps:

Therefore, the process of making the URL feature set complement is: each character string in the full set is compared with the URL feature set, and the match is removed, so that all the character strings without match are complementary sets of the URL feature set.

4. The malicious URL blocking method according to claim 1, wherein the step S3 specifically comprises the following steps:

compared with the prior query mode, the query mode of the invention adds a quick query table entry, namely a hash table made of the complement of the URL feature set; the most extreme maximum number of the complement set of the URL feature set is 36 × 37=49,284 character strings with 3 bytes, the relative magnitude is small, the query speed is high, therefore, the hash table is searched first, part of non-malicious URLs are filtered, the rest part of the hash table is searched again, the first-level domain name or the second-level domain name (the first-level domain name is the second-level domain name for the www), and otherwise, the first-level domain name is used for continuous first 3 characters starting from the first-level domain name or the second-level domain name of the URL analyzed in the HTTP request message, the query is performed in the hash table, the searched URL is the non-malicious URL, the URL can be directly released, the URL which is not searched is possibly malicious and possibly non-malicious URL needs to be further verified, because the non-malicious URL has a large occupation ratio, a quick and efficient query mode is used for firstly separating out a part of the non-malicious URL, and the rest of the URL is then used for query with the consumption performance, and the whole query efficiency can be improved.

5. The malicious URL blocking method according to claim 4, wherein the method for making and using the URL feature complement hash table in the step S3 is summarized as follows:

6. The malicious URL blocking method according to claim 1, wherein the step S4 specifically includes the following steps: