CN117201194B

CN117201194B - URL classification method, device and system based on character string similarity calculation

Info

Publication number: CN117201194B
Application number: CN202311461558.5A
Authority: CN
Inventors: 周丽娟; 洪剑珂; 刘恋; 严格知; 张洁卉; 章勇
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-11-06
Filing date: 2023-11-06
Publication date: 2024-01-05
Anticipated expiration: 2043-11-06
Also published as: CN117201194A

Abstract

The invention discloses a URL classification method, a device and a system based on character string similarity calculation, which belong to the field of network security, and the method converts the compiling rules of URLs in the same website/information system into a more regular character string form by using a unified algorithm, and then judges whether the two URLs are classified in the same way by using the character string similarity, so that false alarm and false alarm rate existing in the current methods for directly calculating the character string similarity and classifying can be well reduced, for example: the URL length becomes a small part of the converted character string, so that the influence on the whole character string comparison is greatly reduced. Particularly, the unknown attack can be prevented in advance without any priori attack knowledge.

Description

URL classification method, device and system based on character string similarity calculation

Technical Field

The invention belongs to the field of network security, and particularly relates to a URL classification method, device and system based on character string similarity calculation.

Background

In recent years, the network security problem has more and more serious influence on the society, and particularly various 0day loopholes are endlessly layered along with the supply chain attack based on the source code audit technology, and are also the most difficult to protect. The Chinese patent with publication number of CN 111259279A discloses an attack URL detection method based on dynamic feature extraction, which uses a cyclic neural network in deep learning to perform feature extraction on the attack URL, thereby learning the features shared by the attack URL, and manually writing a matching rule for matching. The Chinese patent with publication number of CN 108965336A discloses an attack detection method and device, wherein the method and device are pre-configured with an attack characteristic decision tree, the decision tree consists of a plurality of layers which are sequentially connected, each layer consists of a plurality of nodes which are sequentially connected, and whether the network message comprises attack stored by each node is searched by traversing each node in the decision tree. In addition, various malicious URL detection or classification techniques are endless, such as static and dynamic recognition techniques, and static recognition techniques are mainly classified into three kinds of techniques, namely, black list-based techniques, rule matching-based techniques and machine learning-based techniques.

The method needs to master a great deal of known prior knowledge, and discovers other known or unknown attacks after feature discovery is summarized, so that the problem that URL cannot be discovered for some brand new 0day attacks exists, and the problem of false alarm can also occur due to the diversity of web sites.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a URL classification method, device and system based on character string similarity calculation, and the method can realize accurate classification of URLs by carrying out modeling processing on path parts of URLs to be classified to obtain similarity of the modeling character strings and white list URL modeling character strings without grasping any known attack knowledge.

To achieve the above object, according to a first aspect of the present invention, there is provided a URL classification method based on string similarity calculation, including:

s1, acquiring and storing access logs of a target website or an information system in real time;

s2, extracting access URLs to be classified from the access log, deleting parameter parts of the access URLs, and carrying out modeling processing on path parts of the access URLs to obtain a modeling character string of the access URLs;

wherein the modeling process includes: dividing a path part into a plurality of character strings by taking a path character and decimal points in all paths as separators, and respectively preprocessing each character string except the last character string as a target character string to obtain a corresponding patterned sub-character string; if the length of the last character string is smaller than the length threshold value, copying the last character string to the tail of the last-stage modeling sub-character string, otherwise, preprocessing the last character string serving as a target character string to the corresponding modeling sub-character string;

the pretreatment is as follows: setting the beginning of a patterned sub-string, and adding the length of the target string to the tail of the patterned sub-string; when upper case letters, lower case letters, numbers, ligatures and other characters exist in the target character string respectively, adding first, second, third, fourth and fifth character string identifiers correspondingly to the string tail of the modeling sub-character string; after the pretreatment, the beginning of each patterned substring is the same;

s3, deleting parameter parts of each white list URL of the target website or the information system, and carrying out the modeling processing on path parts of the parameter parts to obtain a modeling character string of each white list URL;

s4, if the similarity between the patterned character string of the access URL and the patterned character strings of the whitelist URLs is smaller than a set threshold value, the access URL is an abnormal URL, and if not, the access URL is a normal URL.

According to a second aspect of the present invention, there is provided a URL classification apparatus based on string similarity calculation, comprising:

the log acquisition module is used for acquiring and storing access logs of the target website or the information system in real time;

the first processing module is used for extracting the access URLs to be classified from the access log, deleting the parameter parts of the access URLs, and carrying out modeling processing on the path parts of the access URLs to obtain a modeling character string of the access URLs;

the second processing module is used for deleting the parameter part of each white list URL of the target website or the information system, and carrying out the modeling processing on the path part of the parameter part to obtain the modeling character string of each white list URL;

and the classification module is used for considering the access URL as an abnormal URL if the similarity between the patterned character string of the access URL and the patterned character strings of the whitelist URLs is smaller than a set threshold value, or else, considering the access URL as a normal URL.

According to a third aspect of the present invention, there is provided a URL classification system based on string similarity calculation, comprising: a computer readable storage medium and a processor;

the computer-readable storage medium is for storing executable instructions;

the processor is configured to read executable instructions stored in the computer readable storage medium and perform the method according to the first aspect.

According to a fourth aspect of the present invention there is provided a computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of the first aspects.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

the invention provides a URL classification method based on character string similarity calculation, relates to abnormal URL access identification research in Web security, and is a classification method based on a known normal URL set of a website/information system, namely white list URL comparison. The method converts the compiling rule of the URL in the same website/information system into a more regular character string form by using a unified algorithm, and then judges whether the two URLs are of the same class by using the character string similarity, so that false alarm and false alarm rate existing in the current methods for directly calculating the character string similarity and classifying can be well reduced, for example: the URL length becomes a small part of the converted character string, so that the influence on the whole character string comparison is greatly reduced. Particularly, the unknown attack can be prevented in advance without any priori attack knowledge.

Drawings

Fig. 1 is a schematic flow chart of a URL classification method based on character string similarity calculation according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an access log structure according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of URL patterning processing according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Currently unknown attack URL identification for web applications generally adopts machine learning based on known attack features, and it is desirable to find some common features to thereby generate rules or patterns, etc. for identifying unknown attacks. However, such a method must grasp enough known attacks in advance, otherwise, a missing report occurs, and some brand-new 0day attacks cannot be identified.

Also, because of the complexity of the website/information system structure and implementation, it is often the case that its URL composition may vary greatly, for example, using random strings, chinese pinyin and numbers, symbol combinations as the names of directories or files, etc. If the machine learning algorithm is adopted to train and classify the white list URL, a good training result cannot be obtained because a plurality of different composition characteristics and rules possibly appear in the white list URL at the same time, so that the classification effect is poor. The method of calculating the similarity of the character strings is more direct and effective, but based on the same reasons, the common unmodified calculation algorithm of similarity of the character strings can still have more false alarm and missing report situations, for example, the false alarm of classifying the attack URL into a white list URL is likely to occur for URLs with similar paths and larger path lengths, and more missing report is likely to occur for URLs containing Chinese, chinese pinyin and random character strings, particularly for URLs which obviously show unified programming rules due to certain differences in length and composition. If there are more missed reports, more manual analysis and verification will be required, resulting in inefficient classification methods.

In order to solve the above problems, the embodiment of the invention introduces a modeling processing stage based on a conventional character string similarity calculation algorithm, converts macro features on the URL character string composition into objects of similarity calculation, thereby reducing the influence caused by the difference of URL length and composition details. This means that the URL is either an attack URL; or an interface or link to the website/system, but is rarely accessed at ordinary times. In either case, analysis is important from a network security perspective, i.e., to implement "zero trust" detection.

It should be noted that, generally, a URL is formed by a protocol, a domain name or an IP, a port number, a path and a parameter, but unless specifically stated otherwise, a URL referred to in the present invention refers to a URL of the same website/information system, that is, a portion of the protocol, the domain name, the IP and the port number are the same, so that an object of the string similarity calculation and classification hereinafter refers to a path portion of the URL. Furthermore, URLs that have been explicitly identified and intercepted by security devices are not an object of the present invention.

The URL classification method based on character string similarity calculation provided by the embodiment of the invention, as shown in fig. 1, includes:

s1, acquiring and storing access logs of a target website or an information system in real time.

Specifically, all accesses to a website or information system are collected and saved.

Preferably, as shown in fig. 2, the log content includes at least the following elements: date and time, source IP address (hereinafter referred to as source IP), destination IP address (hereinafter referred to as destination IP), destination port, application protocol (http or https), http request header host part (hereinafter referred to as host), complete URL (including parameter (hereinafter referred to as URL), http response code (hereinafter referred to as response code).

The log collection and storage is continuously performed, and the data content required by the statistical analysis is collected and stored, wherein the data content comprises a plurality of necessary parts of a website access request and a response for subsequent statistical analysis. Because these functions are basically consistent with the general log functions of the website system, the log functions of the website are basically only required to be started, and the log contents can be collected in real time and stored in a server for subsequent unified analysis.

S2, extracting the access URLs to be classified from the access log, deleting the parameter parts of the access URLs, and carrying out modeling processing on the path parts of the access URLs to obtain the modeling character strings of the access URLs.

as a further preferred aspect of the present invention, the pretreatment further comprises: when the target character string is respectively provided with Chinese characters, english words and blank characters, a sixth character string identifier, a seventh character string identifier and an eighth character string identifier are correspondingly added to the string tail of the modeling sub-character string.

As a further preferred aspect of the present invention, the pretreatment further comprises: and after the character string identifier is added, adding a tail identifier at the tail of the modeling sub-character string.

Specifically, URLs of normal accesses (i.e., not intercepted by the security device or not existing) are extracted from the access log according to a preset period, that is, a normal access URL in one access log is extracted, a parameter part of the normal access URL is deleted, and the URL is subjected to modeling processing to obtain a modeling character string of the URL.

The flow of the modeling process includes: firstly, removing a parameter part of a URL, when the URL is subjected to patterning, the whole URL format and structure are required to be kept unchanged, namely, a pather is not involved in the patterning, the URL is divided into a plurality of levels of paths as a segmenter, each level of paths is a character string needing the patterning, the character string of each level of paths is preprocessed to obtain a corresponding patterned sub-character string, and the patterned sub-character strings are combined according to the sequence of each level of paths (namely, each level of paths before separation) to obtain the patterned character string.

It should be noted that the present invention is not limited to the order of adding the above character string length and the first to eighth character string identifiers, that is: the character string length may be added first, and then the first to eighth character string identifiers may be added; the first to eighth character string identifiers may be added first, and then the character string length may be added; it is also possible to add the first to second string identifiers first, then add the string length, and finally add the third to eighth string identifiers, etc. However, the addition of the tail identifier is performed after the addition of the character string length and the first to eighth character string identifiers is completed.

It is understood that the other characters are any characters except capital, lowercase letters, numbers, hyphens; the first to eighth character string identifiers and the tail identifier are all different; any character may be used for the first to eighth character string identifiers and the tail identifier, so long as the first to eighth character string identifiers and the tail identifier are ensured to be different.

As shown in fig. 3, the flow of the above-described modeling process is described below as a specific example:

1) The URL is removed to obtain log records of file accesses of various pictures, videos, documents, css, js and the like.

2) Parsing the URL: dividing the path part of the URL into a plurality of levels of paths by taking the path symbol as a separator, storing each level of paths according to the original sequence, and then entering the step 3) to sequentially carry out the following modeling processing on the character strings corresponding to each level of paths.

3) Initialization of patterned strings: a fixed start "mod" is added to indicate the starting position of the patterned string for a path.

4) Judging whether the path contains a decimal point symbol ". If so, dividing the path into a plurality of parts by taking the decimal point symbol as a divider, storing the character strings according to the original sequence, and then entering step 5) to continuously perform the following modeling processing on the character strings according to the sequence; if not, i.e. if the path does not contain a decimal point, jump 6).

5) Judging whether the current character string is the last segmented character string, if the current character string is the last segmented character string, judging whether the length of the current character string is more than 5 (the suffix length of the current common file type does not exceed 5 characters), if the current character string is less than 5, directly adding the character string to the tail of the modeling character string, and entering step 13); if the number is more than 5, the step 6) is carried out; if the current string is not the last split string, go to step 6).

It can be understood that the decimal point symbol is used as a suffix indicating the file type in the URL in the general case, so that the common file suffix is not subjected to modeling processing in the modeling processing method and is directly copied into the modeling character string; for unusual file suffixes (e.g., custom), then the patterning process is performed.

6) And adding the character string length at the tail of the patterned character string.

7) It is determined whether the string contains lowercase letters "a-z". If the character string contains lower case letters, adding a letter 'a' as a first character string identifier at the tail of the patterned character string, and if the character string does not contain lower case letters, carrying out the next processing.

8) It is determined whether the string contains uppercase letters "a-Z". If the capital letter is contained, adding a letter A as a second character string identifier at the tail of the patterned character string, and if the capital letter is not contained, carrying out the next processing.

9) It is determined whether the string contains the alphabets "0-9". If the number letter is contained, the letter "d" is added to the tail of the patterned character string as a third character string identifier, and if the number letter is not contained, the next processing is carried out.

10 Judging whether the character string contains hyphens "-" and "_". If the character string is contained, the letter D is added to the tail of the patterned character string as a fourth character string identifier, and if the character string is not contained, the next processing is carried out.

11 Judging whether the character string contains any character except the case letter, the number, "-" and "_". If the character string is contained, the letter T is added to the tail of the patterned character string as a fifth character string identifier, and if the character string is not contained, the next processing is carried out.

12 A) the patterned string tail adds a symbol "-" as a tail identifier.

13 Judging whether the character string generated in the step 4) is processed completely, and if so, entering the next step, otherwise, jumping to the step 5).

14 Judging whether the path generated in the step 2) is processed completely, and if so, entering the next step, otherwise, jumping to the step 3).

15 The patterned substring of each path is reconnected to the patterned URL output in the order before division by the path symbol "/".

The main purpose of the processing steps is to extract the compiling rule characteristics of the URL character string and then re-represent the URL character string in a rule unified mode. The representation mode adopts a character string form, and does not destroy the path structure of the original URL, so that the influence of specific length and content can be reduced in the subsequent calculation of the similarity of the character string, and the similarity of the URL on the whole frame and the content composition can be maintained as much as possible. The extraction of the compiling characteristics can be increased according to actual conditions, for example, the compiling characteristics can be further refined into whether Chinese characters, english words, blank characters and the like are contained.

The URL patterning process using the above steps is exemplified as follows:

examples: the URL (i.e. the path portion of the URL field in the corresponding website access log after the parameter portion is removed) is as follows:

/Admin/aaaa/bbb-123/ccd(e).html

after the modeling processing, the modeling character string corresponding to the URL is:

/mod-5aA-/mod-4a-/mod-7adD-/mod-6aT-html

specifically, for the character string "ccd (e) & html", since the character string contains decimal points, the character string is divided into a plurality of character strings "ccd (e)/html" by using the decimal points as separators, and since the character string is sequentially processed, "ccd (e)" is subjected to the modeling processing first: the first character string is set to have a beginning of "mod-", the length 6 of the character string is added to the end of the patterned sub-character string because the length of the character string is 6, the patterned sub-character string is updated to "mod-6", the first character string identifier "mod-6a" is updated because the character string contains lower case letters, the fifth character string identifier "T" is added to the end of the patterned sub-character string because the character string contains characters (sum) except for lower case letters, numbers, "-" and "_", and the patterned sub-character string is updated to "mod-6aT", and aT this time, the character string identifier is added completely, the tail identifier "-" is added to the end of the patterned sub-string, the patterned sub-string is updated to "mod-6aT-", the patterned process of "ccd (e)" is ended, and the process of "html" of the next string is entered, and since the last string is the last string, it is first determined whether the length of the last string is greater than the length threshold 5, and since the length of the last string is 4 and less than the length threshold 5, the last string is directly copied to the end of the sub-patterned string "mod-6aT-" corresponding to the last string "ccd (e)", that is, the patterned sub-string corresponding to the last string "ccd (e)", is updated to "mod-6 aT-html", and the patterned process is ended. It will be appreciated that the modeling of strings of other level paths is the same.

And S3, deleting parameter parts of each white list URL in the white list URL list of the target website or the information system, and carrying out the modeling processing on path parts of the parameter parts to obtain the modeling character strings of each white list URL.

As a further preferred aspect of the present invention, the whitelist URL is periodically extracted from the access log.

Specifically, the whitelist URL is periodically extracted from the above access log, and URLs with large statistical access amount and wide access range are suggested.

After the white list URL set of the target website or the information system is prepared, deleting parameter parts of each white list URL, and carrying out modeling processing on all the white list URLs to obtain a modeling character string set of all the white list URLs.

Specifically, the similarity between the patterned character string of the access URL and the patterned character string of each whitelist URL is calculated, if the similarity between the patterned character string of the access URL and the patterned character string of any whitelist URL is greater than a set threshold, the access URL may be judged to be the same as the whitelist URL in classification, and the access URL may be judged to be the whitelist URL, that is, the normal URL of the website/system, without risk; if the similarity between the patterned character string of the access URL and the patterned character string of each whitelist URL is smaller than the set threshold, that is, if the access URL is not classified as the same as each whitelist URL, the URL is not a normal URL of the website or the information system, which is risky.

As a further preferred aspect of the present invention, if the visited URL is a normal URL, it is added to the white list URL set.

Specifically, if the visited URL is classified as a whitelist URL, then its patterned string is added to the whitelist URL set and patterned string pattern set of the website/system, respectively.

Calculation of the similarity of the patterned string of the visited URL to the patterned string of the respective whitelisted URLs various common string similarity calculation algorithms may be used, for example the difflib. Taking the setting of the threshold value to be 0.98 as an example, the threshold value for judging the classification according to the similarity calculation value adopts a value higher than 0.98, and can be specifically set according to the similarity calculation algorithm used. When the similarity is higher than 0.98, it may be determined that the visited URL and the whitelist URL are the same category.

It should be noted that, the technical solution set forth above is URL classification for web access of a unified website/information system, and the main purpose is to distinguish between normal and abnormal web access, so that strict requirements on comparison of URL string details are not required.

In summary, the method provided by the invention relates to recognition research of abnormal URL access in Web security, and is a classification method based on a known normal URL set of a certain website/information system, namely white list URL comparison. The method adopts part of URLs in the same website/information system as the white list, such as URLs with large access quantity and wide access range, then converts the compiling rules of the URLs into a more regular character string form by a modeling method, and then calculates and compares the similarity of the character strings to identify other URLs with the same compiling rules of the website even if the URLs do not have the characteristics of large access quantity and wide access range.

The above-mentioned compilation rules refer to the personal style and characteristics that website developers exhibit when naming the various files, directories, functions and parameters of the website. The special characters, words and symbols contained in the URL with obvious attack characteristics are not present in the URL composition of the normal website.

And the hidden URL found based on source code audit or the uploaded webshell has the characteristic of larger compiling difference with the URL for normally providing service for the website.

The URL after being processed by the specially designed modeling algorithm can well reduce false alarm and false alarm rate existing in some current methods for directly calculating the similarity of character strings and classifying the character strings, and particularly, unknown attacks can be prevented in advance without any priori attack knowledge.

The URL classification device based on the character string similarity calculation provided by the invention is described below, and the URL classification system based on the character string similarity calculation described below and the URL classification method based on the character string similarity calculation described above can be correspondingly referred to each other.

The embodiment of the invention provides a URL classification device based on character string similarity calculation, which comprises:

the acquisition module is used for acquiring and storing access logs of the target websites or the information systems in real time;

The embodiment of the invention provides a URL classification system based on character string similarity calculation, which comprises the following steps: a computer readable storage medium and a processor;

the computer-readable storage medium is for storing executable instructions;

the processor is configured to read executable instructions stored in the computer readable storage medium and perform a method as in any of the embodiments described above.

Embodiments of the present invention provide a computer readable storage medium storing computer instructions for causing a processor to perform a method as described in any of the embodiments above.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A URL classification method based on string similarity calculation, comprising:

2. The method of claim 1, wherein the preprocessing further comprises: when the target character string is respectively provided with Chinese characters, english words and blank characters, a sixth character string identifier, a seventh character string identifier and an eighth character string identifier are correspondingly added to the string tail of the modeling sub-character string.

3. The method of claim 1 or 2, wherein the pre-processing further comprises: and after the character string identifier is added, adding a tail identifier at the tail of the modeling sub-character string.

4. The method of claim 1, wherein if the visited URL is a normal URL, it is added to a whitelist URL set.

5. The method of claim 1, wherein the access URL to be categorized is an access URL that is not intercepted by a security device.

6. The method of claim 1, wherein the access log comprises a date time, a source IP address, a destination port, an application protocol, an http request header host portion, a full URL, an http response code.

7. The method of claim 1, wherein the whitelist URLs are periodically extracted from an access log.

8. A URL classification apparatus based on string similarity calculation, comprising:

9. A URL classification system based on string similarity calculation, comprising: a computer readable storage medium and a processor;

the computer-readable storage medium is for storing executable instructions;

the processor is configured to read executable instructions stored in the computer readable storage medium and perform the method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-7.