WO2018001078A1 - 一种url匹配方法、装置及存储介质 - Google Patents

一种url匹配方法、装置及存储介质 Download PDF

Info

Publication number
WO2018001078A1
WO2018001078A1 PCT/CN2017/087815 CN2017087815W WO2018001078A1 WO 2018001078 A1 WO2018001078 A1 WO 2018001078A1 CN 2017087815 W CN2017087815 W CN 2017087815W WO 2018001078 A1 WO2018001078 A1 WO 2018001078A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
target
segment
matching
access
Prior art date
Application number
PCT/CN2017/087815
Other languages
English (en)
French (fr)
Inventor
邱文昌
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018001078A1 publication Critical patent/WO2018001078A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming

Definitions

  • the present disclosure relates to the field of network technologies, and in particular, to a URL matching method, apparatus, and storage medium.
  • the URL (Uniform Resource Locator) is the address of a standard resource on the Internet. Each resource on the Internet has a unique URL that is used to indicate the location of the network resource and how the browser handles the resource. According to the agreement, the format of the URL is as follows:
  • Hostname (host domain name) and Path (resource path) are mandatory contents, and port information or parameter information in "[]" are optional.
  • the URL-based matching algorithm is mainly used for URL management. It is widely used in the field of access control such as intrusion detection, bad information filtering, network data flow control, etc. It is one of the core of the firewall system.
  • URL matching is mainly applied to the rule policy configuration and website response filtering in the network management system.
  • the rule policy configuration mainly processes the display of some web pages that have been developed, for example, matching the content of the webpage on a specific type of webpage page. Advertising, etc., the configuration advertisement mentioned here does not mean that the page developer configures the advertisement for the webpage page during the development, but refers to the network management personnel adding advertisements on some pages as needed after the page development is completed.
  • Website response filtering refers to filtering and filtering the pages of bad websites.
  • the URL of the bad website is stored and recorded. After receiving the URL input by the user, the Hostname field in the HTTP header information that is accessed is matched with the record. If the URL currently accessed by the user exists in the record, the system The page corresponding to the URL is masked, and a prompt message is returned to the user.
  • the number of URLs is very large, which makes the management of URLs more and more cumbersome and more complicated.
  • the number of URL blacklists may reach one million. Or tens of millions of levels.
  • the user enters a URL it needs to match the million or even tens of millions of URLs in the blacklist. If the matching finds that the hostname of the URL does not exist in the blacklist, the user is provided with the resource corresponding to the input URL. And this matching process can take more time, which will affect the user experience to a large extent. Therefore, how to effectively perform large-scale URL matching is an urgent problem to be solved in network management.
  • a URL matching method, apparatus, and storage medium provided by the embodiments of the present disclosure mainly solve the technical problem: providing a new URL matching solution different from the related technology, to solve the related art in the large-scale URL There are technical problems in the matching when the matching is inefficient and the matching takes a long time.
  • an embodiment of the present disclosure provides a URL matching method, including:
  • the embodiment of the present disclosure further provides a URL matching apparatus, including:
  • the segmentation module is configured to segment the at least one of the obtained domain name unit and the resource path unit of the access URL according to the preset division rule to obtain an access URL segment, where the access URL segment includes at least two of the URLs The complete content between the symbols;
  • mapping module configured to map each of the access URL segments to a one-to-one corresponding access key value according to a preset mapping rule
  • a matching module configured to perform, according to a preset matching order, each of the access key values and each target key value corresponding to each stored target URL segment to be matched step by step;
  • an obtaining module configured to acquire a processing policy for the access URL when each of the access URL segments of the access URL matches the corresponding target URL segment.
  • the embodiment of the present disclosure further provides a storage medium in which computer executable instructions are stored, the computer executable instructions being used to execute the foregoing URL matching method.
  • the URL matching method, apparatus, and storage medium perform segmentation processing according to a preset division rule by at least one of a domain name unit and a resource path unit of the acquired access URL, and then the processed access is obtained.
  • the URL segment is mapped to the corresponding key value; then the access key value corresponding to each URL segment is matched step by step with the target key value corresponding to each stored target URL, and if the access key value is equal to the target key value, the access URL segment is characterized If the matching URL segment is successfully matched, if all the access URL segments in the access URL can match the corresponding target matching URL segment, the access URL belongs to one of the URLs that need to be managed and controlled.
  • This matching scheme can improve the matching accuracy by segmenting the URL and accurately manage the URLs that need to be controlled.
  • mapping the URL segments to key values and directly matching the key values can effectively improve the matching efficiency and reduce the matching efficiency.
  • the user's waiting time improves the user experience.
  • FIG. 1 is a flowchart of a URL matching method according to Embodiment 1 of the present disclosure
  • FIG. 2 is a schematic diagram of a storage structure for storing a target URL segment according to Embodiment 1 of the present disclosure
  • FIG. 3 is a schematic flowchart of processing a target URL according to Embodiment 2 of the present disclosure
  • FIG. 4 is a schematic structural diagram of a URL matching apparatus according to Embodiment 3 of the present disclosure.
  • FIG. 5 is another schematic structural diagram of a URL matching apparatus according to Embodiment 3 of the present disclosure.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • the URLs in this embodiment can be divided into two types, one is to determine a URL that needs to be managed and controlled, and such a URL belongs to the target URL. The other is the URL that the user enters and wants to access, which we call the access URL.
  • the URL corresponding to the page containing the bad information belongs to the type of target that needs to be controlled. URL.
  • the network administrator wants to specifically manage certain pages for example, if the network administrator wants to place promotion information of a brand car on the page displaying the car information as needed, these pages are The corresponding URL belongs to the target URL that needs to be managed.
  • the access URL Before the access URL is matched, the access URL undergoes two processes of segmentation processing and mapping processing, and when the access URL is matched, it is entered with the pre-stored target URL. Line matching. Therefore, before the matching of the access URL, the same processing should be performed on the target URL. In this embodiment, we can treat the segmentation processing and the mapping processing as the "preprocessing" process that all URLs will undergo.
  • the target URL should be handled in the same way as the access URL, or the pre-processing rules for accessing the URL should be consistent with the target URL, because the pre-processing of the target URL is preceded, and the pre-processing of the target URL is determined.
  • the processing rules for accessing URLs For the sake of brevity, in the embodiment, when the pre-processing procedure of the URL is introduced, the target URL and the access URL are not distinguished.
  • a URL includes at least a host domain name Hostname and a resource path Path.
  • the Hostname is a domain name unit
  • the Path is a resource path unit.
  • the ".” is defined as the top-level domain name, the second-level domain name, the three-pole domain name and the registered domain name from the right to the left, and the top-level domain name is divided into the national top-level domain name and the international top-level domain name. Domain names such as “.cn”, “.uk”, “.de”, etc., and international top-level domains include “.com", “.net”, “.org” and so on.
  • the content between the two ".” is the smallest domain segment.
  • the "/" is used as the boundary, and the content between the two "/” is the smallest resource path segment.
  • the minimum domain name segment and the minimum resource path segment are basic components of a URL. Therefore, after segmentation processing of any one of the domain name unit or the resource path unit in the URL, the segmentation processing should be ensured.
  • the obtained URL segments are at least one complete minimum constituent unit, that is, at least the complete content between the two separators in the URL is included in each URL segment.
  • the two separators mentioned here can be any two separators in a URL.
  • mapping rule refers to the correspondence between each URL segment and each key value. As shown in Table 1 and Table 1 (continued):
  • the key value of the URL segment is 2, and if the URL segment is ".org", the corresponding key value should be 5.
  • each resource path is pre-defined. Setting key values for each URL segment in the cell is almost impossible. Therefore, in order to solve this problem, a predetermined algorithm can be used to calculate and process each URL segment, and the calculated result is used as a key value of the URL segment.
  • a predetermined algorithm can be used to calculate and process each URL segment, and the calculated result is used as a key value of the URL segment.
  • the key values are not configured for each URL segment before the result is calculated, since the adopted algorithm is uniform, each URL segment will also have a one-to-one key according to this fixed algorithm. value.
  • the hash algorithm can map binary values of arbitrary length into shorter fixed-length binary values. This small binary value is called a hash value.
  • a hash value is a unique and extremely compact numerical representation of a piece of data. If you hash a plaintext and even change only one letter of the paragraph, subsequent hashes will produce different values. Therefore, in this embodiment, the hash algorithm may be used to process the URL segment, and the processing result is used as the key value corresponding to the URL segment.
  • the target URL segment and its target key value corresponding to each target URL segment are segmented, the target URL segment and its target key value are stored in a preset matching order, so as to be in the subsequent process.
  • management control can be performed, and the preset matching order can be set by the administrator.
  • the storage may be performed step by step according to the preset matching order, so that the matching can be performed step by step in the subsequent matching process.
  • the order is from left to right.
  • the storage level of the top-level domain can be set to the highest, the second-level domain name is stored under the top-level domain name, and the third-level domain name, the registered domain name, and then the resource path unit are in the order of the second-level domain name.
  • the leftmost target URL segment is stored in order from the leftmost target URL segment until the rightmost target URL segment is stored.
  • the target URL segment when the target URL segment is stored, the target URL segment may be stored according to a tree-preserving structure, and each target URL segment and its corresponding target key value are stored according to a preset matching order to obtain a storage tree, and the previously matched target URL segment. As the parent node of the target URL segment that is matched later, the target URL segment that is matched later is used as the child node of the previously matched target URL segment.
  • each target URL segment under the domain name unit and each target URL segment under the resource path unit are not distinguished, that is, each target URL segment in the target URL is stored in a storage tree. However, if there are too many target URLs for management control, storing the URL segments under the domain name unit and the resource path unit in one storage tree may cause a storage tree to be too large.
  • the embodiment further provides another way of storing the target URL segment: storing each target URL segment in the domain name unit in the target URL and its target key value according to the tree storage structure to obtain a corresponding domain name storage tree; Each target URL segment in the resource path unit and its target key value are stored according to the tree storage structure to obtain a corresponding resource path storage tree; the leaf node of the domain name unit storage tree points to the root node of the resource path unit storage tree. That is to say, the target URL segment in the domain name unit and each target UR segment in the resource path unit are actually stored in two storage trees, and the two are connected by pointers.
  • S106 is a matching process for accessing a URL, that is, the obtained URL is a URL corresponding to a page that the user wants to access, and after the segmentation processing and mapping processing on the access URL, each obtained access URL needs to be obtained.
  • the segment is matched step by step with the stored target URL segment.
  • the meaning of step-by-step matching refers to matching an access URL segment with each target URL segment of the same level. It can be understood that the matching key value of the access URL segment and the target URL corresponding to the URL segment should be After the matching is successful, according to the preset matching order, the access URL segment after the access URL segment is selected to match the child nodes of the successfully matched target URL segment, and then the matching process is continued until all the access URLs in a URL segment are obtained.
  • the segments are all matched successfully or a certain access URL segment cannot match the corresponding target URL, and then the prompt is returned.
  • the domain name units of the three target URLs are stored in the storage tree shown in FIG. 2, which are "abccd”, “ga12.cd”, and “d.13.cd”, respectively.
  • match "12" with "12” and "13” of course, it will succeed in the first two matching processes, until the matching failure occurs when "f” is matched with "a”.
  • the access URL that represents the user input is the one that does not require management control, so the page resources needed by the user can be directly returned.
  • the processing strategy for the URL should be obtained, for example, blocking the page corresponding to the URL or adding a certain advertisement information.
  • the URL matching method provided by the embodiment of the present disclosure can pass a URL through multiple key values by performing the same segmentation processing and mapping processing on the target URL and the access URL. Combined way to indicate that in this case, when matching URLs, it is essentially to match some simple values in order, which converts the matching process of a complex URL into a simple value matching. The process improves the matching efficiency, shortens the waiting time of the user, and is beneficial to the improvement of the user experience.
  • the matching method proposed in this embodiment performs segmentation processing on the resource path, it is possible to finely manage the webpage pages in which the domain name unit is normal and the bad information exists in the resource path, so that the management is more refined. More effective and more accurate.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • Extracting the domain name unit and the resource path unit is essentially extracting the valuable part.
  • the non-useful part firstly, the parameter and the segmentation part of the target URL without the use value, and the identification of the parameter part can be first removed.
  • the identifier of the "?" segment is "#", that is, the contents after "?" and "#” can be deleted directly.
  • the contents of "[]" are optional according to the provisions of the agreement, these parts can also be directly removed.
  • the meaning of a wildcard is that the part can be in any form.
  • S303 may be performed or S304 may be performed.
  • the URL is directly mapped and stored.
  • mapping process can be performed directly according to the hash algorithm. It is worth noting that when storing such a URL without a wildcard, a hash table can be used instead of the hash storage tree.
  • the domain name unit and the resource path may be segmented at different times, and only one of them may be entered.
  • the line is segmented and waits for the corresponding target URL segment, and the other one is not segmented, but the portion is directly used as a target URL segment.
  • the processing of each URL should be performed in a consistent manner.
  • the process of mapping the URL without the wildcard is performed by using a hash algorithm.
  • the hash algorithm should also be used.
  • the hashing of the access URL segment should be performed in the subsequent process. The algorithm is implemented to ensure that the key values and the URL segments can achieve a true correspondence.
  • the method of Embodiment 1 when the target URL segment obtained by the segmentation process and its target key value are stored, the method of Embodiment 1 is still adopted, that is, the target URL segment and its target key value are matched according to a preset according to the tree storage structure.
  • the storage is sequentially performed to form a storage tree.
  • each URL segment with a prefix wildcard and a domain name unit with a suffix wildcard can be stored separately.
  • Each URL segment with a prefix wildcard and a resource path unit with a suffix wildcard may also be stored separately.
  • the parameter of the non-use value and the segmentation part of the access URL may be first removed, the hash value of the access URL of the non-valued part is calculated, and then the calculated hash value is used first. Matching the hash values stored in the hash table. If the matching is successful, the corresponding processing policy is obtained. If the matching is unsuccessful, the access URL may be segmented, and then performed in the manner provided in the first embodiment. Matching, no more details here.
  • the related matching method directly comparing the domain name unit of the access URL with the resource path unit and the domain name unit of the target URL and the resource path unit, the time complexity of the related matching method is O(n3), and n is the target.
  • the number of URLs but the URL matching method provided in this embodiment is independent of the number of target URLs when matching, and the time complexity of the algorithm is O(L). Therefore, the method provided in this embodiment is greatly reduced.
  • the complexity of the matching process Improved matching efficiency.
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • the present embodiment provides a URL matching apparatus.
  • the URL matching method provided in the first embodiment and the second embodiment can be implemented by using the URL matching apparatus provided in this embodiment.
  • the URL matching device 40 includes a segmentation module 402, a mapping module 404, a matching module 406, and an acquisition module 408.
  • the segmentation module 402 is configured to perform segmentation processing on at least one of the domain name unit and the resource path unit of the acquired access URL according to a preset division rule to obtain an access URL segment.
  • the mapping module 404 is configured to map each access URL segment to a one-to-one corresponding access key value according to a preset mapping rule.
  • the matching module 406 is configured to perform stepwise matching of each access key value and each target key value corresponding to each stored target URL segment according to a preset matching order.
  • the obtaining module 408 is configured to obtain a processing strategy for the access URL when each access URL segment of the access URL matches the corresponding target URL segment.
  • the URLs in this embodiment can be divided into two types, one is to determine a URL that needs to be managed and controlled, and such a URL belongs to the target URL. The other is the URL that the user enters and wants to access, which we call the access URL.
  • the URL corresponding to the page containing the bad information belongs to the type of target that needs to be controlled. URL.
  • the network administrator wants to specifically manage certain pages for example, if the network administrator wants to place promotion information of a brand car on the page displaying the car information as needed, these pages are The corresponding URL belongs to the target URL that needs to be managed.
  • the access URL is segmented and mapped by the segmentation module 402 and the mapping module 404, respectively, and when the access URL is matched, the URL is matched with the pre-stored target URL. Therefore, before the matching of the access URL, the segmentation module 402 and the mapping module 404 should perform the same processing on the target URL first. In this embodiment, we can use the segmentation processing of the segmentation module 402 and the mapping process of the mapping module 404. The "pre-processing" process that all URLs experience.
  • the segmentation module 402 and the mapping module 404 should handle the target URL in the same way as the access URL, or the pre-processing rule for accessing the URL should be consistent with the target URL, because the target URL is pre-processed, the target
  • the preprocessing process of the URL determines the processing rules for accessing the URL. For the sake of brevity, in the embodiment, when the pre-processing procedure of the URL is introduced, the target URL and the access URL are not distinguished.
  • a URL includes at least a host domain name Hostname and a resource path Path.
  • the Hostname is a domain name unit
  • the Path is a resource path unit.
  • the segmentation module 402 is divided by ".”, and the Hostname is divided into a top-level domain name, a second-level domain name, a three-pole domain name and a registered domain name from right to left, and the top-level domain name is divided into a national top-level domain name and an international domain name.
  • Top-level domains national top-level domains such as “.cn”, “.uk”, “.de”, etc.
  • international top-level domains include “.com”, “.net”, “.org” and so on.
  • the content between the two “.” is the smallest domain segment.
  • the segmentation module 402 is divided by “/”, and the content between the two "/” is the smallest resource path segment.
  • the minimum domain name segment and the minimum resource path segment are basic components of a URL. Therefore, after the segmentation module 402 performs segmentation processing on any one of the domain name unit or the resource path unit in the URL, it should be guaranteed.
  • Each URL segment obtained after the segment processing is at least one complete minimum constituent unit, that is, at least the complete content between the two separators in the URL is included in each URL segment.
  • the two separators mentioned here can be any two separators in a URL.
  • the mapping module 404 maps each URL segment to a one-to-one key according to a preset mapping rule. value.
  • the mapping module 404 can perform calculation processing on each URL segment by using a preset algorithm, and use the calculated result as a key value of the URL segment.
  • the key values are not configured for each URL segment before the result is calculated, since the adopted algorithm is uniform, each URL segment will also have a one-to-one key according to this fixed algorithm. value.
  • the mapping module 404 mapping the UR segment to the key value by using a preset algorithm preferably ensures that different URLs correspond to different key values, so as to ensure that there is no conflict when the URL is subsequently matched.
  • the hash algorithm can map binary values of arbitrary length into shorter fixed-length binary values. This small binary value is called a hash value.
  • a hash value is a unique and extremely compact numerical representation of a piece of data. If you hash a plaintext and even change only one letter of the paragraph, subsequent hashes will produce different values. Therefore, in this embodiment, the mapping module 404 may use a hash algorithm to process the URL segment, and use the processing result as the key value corresponding to the URL segment.
  • This embodiment also provides another URL matching device. Please refer to FIG. 5:
  • the URL matching device 40 includes a partitioning module 402, a mapping module 404, a matching module 406, and an obtaining module 408, and a storage module 410.
  • the storage module 410 is configured to process each target URL segment obtained by segment processing and its corresponding target key. The values are stored step by step according to the preset matching order, so that when the access URL input by the user is the target URL in the subsequent process, the management control may be performed, and the preset matching order may be set by the administrator.
  • the storage function of the storage module 410 is for the target URL. If the obtained URL is the target URL, after the segmentation processing of the segmentation module 402 and the mapping process of the mapping module 404, the storage module 410 can associate the target URL segment with the target URL segment. The target key value is stored, In order to determine that the access URL input by the user is the target URL in a subsequent process, management control can be performed.
  • the storage module 410 can store the target URL segments and their corresponding target key values in a step-by-step manner according to a preset matching order, so that the matching can be performed step by step in the subsequent matching process.
  • a preset matching order it is common practice to first match each URL segment in the domain name unit and then match each URL segment in the resource path unit; when matching the URL segments in the matching domain name unit, proceed from right to left. , that is, first match the top-level domain name, and then match the second-level domain name, the third-level domain name, and the registered domain name.
  • the order is from left to right.
  • the storage level of the top-level domain can be set to the highest, the second-level domain name is stored under the top-level domain name, and the third-level domain name, the registered domain name, and then the resource path unit are in the order of the second-level domain name.
  • the leftmost target URL segment is stored in order from the leftmost target URL segment until the rightmost target URL segment is stored.
  • the storage module 410 when the storage module 410 stores the target URL segment, it may be performed according to the tree storage structure, and the target URL segments and their corresponding target key values are stored according to the preset matching order to obtain a storage tree, which is matched before.
  • the target URL segment is used as the parent node of the target URL segment that is matched later, and the target URL segment that is matched later is used as the child node of the previously matched target URL segment.
  • each target URL segment under the domain name unit and each target URL segment under the resource path unit are not distinguished, that is, each target URL segment in the target URL is stored in a storage tree. However, if there are too many target URLs for management control, storing the URL segments under the domain name unit and the resource path unit in one storage tree may cause a storage tree to be too large.
  • the embodiment further provides another way for the storage module 410 to store the target URL segment: storing each target URL segment in the domain name unit in the target URL and its target key value according to the tree storage structure to obtain a corresponding domain name storage tree; Each target URL segment in the resource path unit in the target URL and its target key value are stored according to the tree storage structure to obtain a corresponding resource path storage tree; the leaf node of the domain name unit storage tree points to the root node of the resource path unit storage tree. That is to say, the target URL segment in the domain name unit and each target UR segment in the resource path unit are actually stored in two storage trees, and the two are connected by pointers.
  • the matching module 406 matches each access key value corresponding to each access URL segment obtained by the segment processing to each target key value corresponding to each stored target URL segment in a preset matching order.
  • the matching module 406 After the obtained URL is the URL corresponding to the page that the user wants to access, after the segmentation process and the mapping process are performed on the access URL, the matching module 406 needs to step through the obtained each access URL segment and the stored target URL segment. match.
  • the meaning of the step-by-step matching means that an access URL segment is matched with each target URL segment of the same level. It can be understood that the matching module 406 matches the access key value of the access URL segment and the key corresponding to each target URL.
  • the matching module 406 selects the access URL segment after the access URL segment to match the child nodes of the successfully matched target URL segment according to the preset matching order, and then loops the matching process until a URL All access URL segments in the segment match successfully or a certain access URL segment cannot match the corresponding target URL and return a prompt.
  • the obtaining module 408 is configured to obtain a processing policy for the access URL when each of the access URL segments of the access URL matches the corresponding target URL segment.
  • the obtaining module 408 should obtain a processing policy for the URL, for example, masking the page corresponding to the URL or adding a certain advertisement information.
  • the URL matching device 40 provided in this embodiment may be deployed on a server, where the segmentation module 402, the mapping module 404, the matching module 406, and the obtaining module 408 may all be implemented by a processor in the server, and the storage module 410 may be configured by The processor and the memory are implemented together.
  • the URL matching apparatus performs the same segmentation processing and mapping processing on the target URL and the access URL, thereby realizing the effect of characterizing the complex URL with a simple value, thereby reducing the matching difficulty and improving the matching efficiency. , shortening the waiting time of the user, which is beneficial to the improvement of the user experience.
  • the matching device proposed in this embodiment performs segmentation processing on the resource path, it is possible to finely manage the webpage pages in which the domain name unit is normal and the bad information exists in the resource path, so that the management is more refined. More effective and more accurate.
  • Embodiments of the present disclosure also provide a storage medium.
  • the foregoing storage medium may be configured to store program code for performing the following steps:
  • the storage medium is further arranged to store program code for performing the following steps:
  • S1 The URL segment is processed according to a hash algorithm, and the processing result is used as a key value corresponding to the URL segment.
  • the storage medium is further arranged to store program code for performing the following steps:
  • the storage medium is further arranged to store program code for performing the following steps:
  • the storage medium is further arranged to store program code for performing the following steps:
  • the target URL segments obtained by the segmentation process and their corresponding target key values are stored step by step according to the preset matching order, and the previously matched target URL segments and their target key value storage levels are high.
  • the storage medium is further arranged to store program code for performing the following steps:
  • the storage medium is further arranged to store program code for performing the following steps:
  • the leaf node of the domain name unit storage tree points to a root node of the resource path unit storage tree.
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • a mobile hard disk e.g., a hard disk
  • magnetic memory e.g., a hard disk
  • the processor executes the method steps described in the foregoing embodiments according to the stored program code in the storage medium.
  • modules or steps of the above embodiments of the present disclosure may be implemented by a general computing device, which may be concentrated on a single computing device or distributed among multiple computing devices.
  • they may be implemented by program code executable by the computing device, such that they may be stored in a storage medium (ROM/RAM, disk, optical disk) by a computing device, and in some
  • the steps shown or described may be performed in an order different from that herein, or they may be separately fabricated into individual integrated circuit modules, or a plurality of the modules or steps may be implemented as a single integrated circuit module. Therefore, the present disclosure is not limited to any specific combination of hardware and software.
  • segmentation processing is performed according to a preset division rule by at least one of a domain name unit and a resource path unit of the obtained access URL, and then the processed access URL segment is mapped to a corresponding key value;
  • the access key value corresponding to each URL segment is matched with the target key value corresponding to each stored target URL step by step. If the access key value is equal to the target key value, the representation access URL segment matches the target URL segment successfully, if the access URL is successful All access URL segments in the match can match the corresponding target matching URL segment, indicating that the access URL belongs to one of the URLs that need to be managed.
  • This matching scheme can improve the matching accuracy by segmenting the URL and accurately manage the URLs that need to be controlled.
  • mapping the URL segments to key values and directly matching the key values can effectively improve the matching efficiency and reduce the matching efficiency. The user's waiting time improves the user experience.

Abstract

本公开实施例提供一种URL匹配方法、装置及存储介质,通过对获取到的访问URL的域名单元与资源路径单元中的至少一个根据预设划分规则进行分段处理后将处理得到的访问URL段映射至对应的关键值;然后将各个URL段对应的访问关键值与存储的各目标URL的目标关键值进行逐级匹配:如果访问关键值与目标关键值相等,则表征访问URL段与目标URL段匹配成功,若访问URL中所有的访问URL段都能匹配到对应的目标匹配URL段,则说明该访问URL属于需要被管理控制的URL中的一个。这种匹配方案通过将URL进行分段处理能够提高匹配准确率,精准地管理那些需要进行控制的URL;同时将URL段映射到关键值,然后直接对关键值进行匹配能够有效提高匹配效率,减少用户的等待时间,提高用户体验。

Description

一种URL匹配方法、装置及存储介质 技术领域
本公开涉及网络技术领域,尤其涉及一种URL匹配方法、装置及存储介质。
背景技术
URL(Uniform Resource Locator,统一资源定位符)是互联网上标准资源的地址。互联网上的每个资源都有一个唯一的URL,其信息用于指示网络资源的位置以及浏览器对该资源的处理方式。根据协议的规定,URL的格式如下所示:
protocol://Hostname[:port]/Path/[;parameters][?query]#fragment
其中,Hostname(主机域名)和Path(资源路径)是必须存在的内容,“[]”内的端口信息或者参数信息都属于可选的内容。
基于URL特征的匹配算法主要用于URL管理,其在入侵检测、不良信息过滤、网络数据流量控制等访问控制领域中应用广泛,是防火墙系统的核心之一。URL匹配在网管系统中主要应用于规则策略配置和网站响应过滤两部分:规则策略配置主要是对已经开发完成的部分网页的显示进行处理,例如在特定类型的网页页面上配置与网页内容相匹配的广告等,这里所说的配置广告并非是指页面开发人员在开发的时候给网页页面配置广告,而是指页面开发完成之后,网络管理人员根据需要在部分页面上添置广告。网站响应过滤则是指对不良网站的页面进行屏蔽过滤。相关技术中都是将不良网站的URL进行存储记录,当接收到用户输入的URL之后,将访问携带的HTTP头信息中的Hostname字段与记录匹配,如果用户当前访问的URL在记录中存在,系统就将该URL对应的页面屏蔽,并向用户返回提示信息。
如今网址数量十分庞大,导致URL的管理工作也越来越繁重,越来越复杂,例如,针对网站响应过滤而言,URL黑名单数量可能会达到百万 或者千万级别,当用户输入一个URL之后需要与黑名单中百万甚至千万的URL进行匹配,匹配结束发现该URL的Hostname在黑名单中不存在才会向用户提供其输入URL对应的资源,而这一匹配过程可能会花费较多的时间,这在很大程度上会影响用户体验。因此,如何有效进行大规模URL匹配是目前网络管理中亟待解决的一个问题。
发明内容
本公开实施例提供的一种URL匹配方法、装置及存储介质,主要解决的技术问题是:提供一种与相关技术不同的、新的URL匹配方案,用以解决相关技术中对大规模URL进行匹配的时候存在的匹配效率低、匹配耗时长的技术问题。
为解决上述技术问题,本公开实施例提供一种URL匹配方法,包括:
按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段,所述访问URL段中至少包括所述访问URL中两个分隔符之间的完整内容;
根据预设映射规则将各所述访问URL段映射至一一对应的访问关键值;
按照预设匹配顺序将各所述访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配;
在所述访问URL的各所述访问URL段均匹配到对应的目标URL段时,获取对所述访问URL的处理策略。
本公开实施例还提供一种URL匹配装置,包括:
分割模块,设置为按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段,所述访问URL段中至少包括所述URL中两个分隔符之间的完整内容;
映射模块,设置为根据预设映射规则将各所述访问URL段映射至一一对应的访问关键值;
匹配模块,设置为按照预设匹配顺序将各所述访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配;
获取模块,设置为在所述访问URL的各所述访问URL段均匹配到对应的目标URL段时,获取对所述访问URL的处理策略。
本公开实施例还提供一种存储介质,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行前述的URL匹配方法。
本公开的有益效果是:
根据本公开实施例提供的URL匹配方法、装置以及存储介质,通过对获取到的访问URL的域名单元与资源路径单元中的至少一个根据预设划分规则进行分段处理,然后将处理得到的访问URL段映射至对应的关键值;然后将各个URL段对应的访问关键值与存储的各目标URL对应的目标关键值进行逐级匹配,如果访问关键值与目标关键值相等,则表征访问URL段与目标URL段匹配成功,若访问URL中所有的访问URL段都能匹配到对应的目标匹配URL段,则说明该访问URL属于需要被管理控制的URL中的一个。这种匹配方案通过将URL进行分段处理能够提高匹配准确率,精准地管理那些需要进行控制的URL;同时将URL段映射到关键值,然后直接对关键值进行匹配能够有效提高匹配效率,减少用户的等待时间,提高用户体验。
附图说明
图1为本公开实施例一提的URL匹配方法的一种流程图;
图2为本公开实施例一中提出的目标URL段进行存储的存储结构示意图;
图3为本公开实施例二提供的对目标URL进行处理的一种流程示意图;
图4为本公开实施例三提供的URL匹配装置的一种结构示意图;
图5为本公开实施例三提供的URL匹配装置的另一种结构示意图。
具体实施方式
下面通过可选实施方式结合附图对本公开实施例作详细说明。
实施例一:
为了解决相关技术中对用户的访问URL与大规模的需要进行管理控制的URL进行匹配的时候出现的匹配效率地低下,匹配耗时长,影响用户体验的技术问题,本实施例提供一种URL匹配方法,请参见图1:
本实施例中的URL可以分为两种类型,一种是确定需要被管理控制的URL,这种URL属于目标URL。另一种是用户输入的想要访问的URL,我们称之为访问URL。
当网络管理员根据用户举报、投诉,或者通过自己检测发现的了含有不良信息的、需要被屏蔽的页面,这时候这些含有不良信息的页面所对应的URL就属于需要被控制的那一类目标URL。另外一方面,如果网络管理员想要对某些页面进行特别的管理,例如,在网络管理员根据需要,想要在展示汽车信息的页面上投放某品牌汽车的推广信息,因此,这些页面所对应的URL就属于需要被管理的目标URL。
当用户点击网页上的链接信息或者是直接输入网址想要访问某一页面的时候,需要对用户想要访问的页面对应的访问URL进行匹配,以确定该访问URL是否是需要被管理控制的目标URL中的一个,如果是,则需要根据对应的处理策略对该访问URL对应的页面进行控制管理,如果不是,则表示可以直接根据该访问URL获取到用户需要的资源并向用户进行展示。
S102、按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段。
S104、根据预设映射规则将各访问URL段映射至一一对应的访问关键值。
对访问URL进行匹配之前,访问URL会经历分段处理与映射处理这两个过程,而对访问URL进行匹配的时候,是与预先存储的目标URL进 行匹配。因此,对访问URL进行匹配之前,应当先对目标URL进行同样的处理,在本实施例中,我们可以将分段处理与映射处理当作所有URL都会经历的“预处理”过程。
对目标URL的处理方式应当与对访问URL的处理方式相同,或者说,访问URL的预处理规则应当与目标URL的一致,因为目标URL的预处理过程在前,目标URL的预处理过程决定了对访问URL的处理规则。为了简洁,本实施例中在对URL的预处理过程进行介绍的时候,不区分目标URL与访问URL。
根据前面的介绍,一个URL中至少包括主机域名Hostname与资源路径Path两部分,在本实施例中,Hostname为域名单元,Path为资源路径单元。在域名单元当中,以“.”为界,将Hostname从右至左依次划分为顶级域名、二级域名、三极域名与注册域名,在顶级域名分为国家顶级域名和国际顶级域名,国家顶级域名如“.cn”,“.uk”,“.de”等,而国际顶级域名包括“.com”,“.net”,“.org”等。在两个“.”之间的内容为最小的域名段。在资源路径当中,以“/”为界进行划分,两个“/”之间的内容为最小的资源路径段。
根据上述介绍可以知道,最小域名段和最小资源路径段是一个URL的基本组成单元,因此,对URL中域名单元或者是资源路径单元中的任意一个进行分段处理之后,应当保证分段处理之后得到的各URL段至少是一个完整的最小组成单元,也就是说在各URL段当中至少包括URL中两个分隔符之间的完整内容。这里所说的两个分隔符可以是一个URL中的任意两个分隔符。
在对URL段进行映射之前,可以先预设映射规则,映射规则是指各URL段与各关键值之间的对应关系。如表1与表1(续)所示:
表1
URL段 关键值
.com 1
表1(续)
.cn 2
.org 5
当分段处理后得到的URL段为“.cn”,那么该URL段的关键值为2,如果URL段为“.org”,则对应的关键值应当为5。
可以理解的是,现在的网站繁多,网页页面更是不计其数,预先为所有的URL段设置好对应的关键值是不太现实的,特别是对于海量的资源路径,预先为每一个资源路径单元中的每一个URL段设置好关键值几乎是不可能实现的事。所以,为了解决这个问题,可以采用一种预设的算法对各个URL段进行计算处理,将计算得到的结果作为URL段的关键值。在这种情况下,虽然在计算得出结果之前并没有为各个URL段配置好关键值,但是由于采用的算法是统一的,所以各个URL段根据这个固定的算法也会得到一一对应的关键值。
通过预设算法来将UR段映射至关键值的做法最好能够保证不同的URL对应不同的关键值,这样才能保证在对URL进行后续的匹配的时候不会出现冲突的情况。哈希算法能够将任意长度的二进制值映射为较短的固定长度的二进制值,这个小的二进制值称为哈希值。哈希值是一段数据唯一且极其紧凑的数值表示形式。如果散列一段明文而且哪怕只更改该段落的一个字母,随后的哈希都将产生不同的值。因此,在本实施例当中,可以采用哈希算法来对URL段进行处理,将处理结果作为该URL段对应的关键值。
在对目标URL与访问URL进行分段处理与映射处理之后,就需要对这两类不同的URL进行分段处理:
对于目标URL,分段得到目标URL段及与各目标URL段一一对应的目标关键值之后,需要将目标URL段及其目标关键值按照预设的匹配顺序进行存储,以便在在后续过程中当用户输入的访问URL为该目标URL时,可以进行管理控制,预设匹配顺序可以由管理员来设置。
在对目标URL段及其对应的目标关键值进行存储的时候,可以根据预设匹配顺序进行逐级存储,以便在后续的匹配过程中,也能够进行逐级匹配。为了提高匹配效率,通常的做法是先匹配域名单元中的各URL段再匹配资源路径单元中的各URL段;在对匹配域名单元中的URL段进行匹配时,按照从右至左的顺序进行,即先匹配顶级域名,再依次匹配二级域名、三级域名、注册域名。当对匹配资源路径单元中的各URL段进行匹配时,按照从左至右的顺序进行。因此,在存储的时候,可以将顶级域名的存储级别设置为最高,在顶级域名之下在存储二级域名,在二级域名之下依次是三级域名、注册域名,然后是资源路径单元中最左边的目标URL段,从最左边的目标URL段依次进行存储,直到存储完最右边的目标URL段。
在本实施例当中,存储目标URL段的时候可以根据树形存储结构进行,根据预设匹配顺序对各目标URL段及其对应的目标关键值进行存储得到存储树,在前匹配的目标URL段作为在后匹配的目标URL段的父节点,在后匹配的目标URL段作为在前匹配的目标URL段的孩子节点。在这种存储方式当中,并没有将域名单元下的各目标URL段和资源路径单元下的各目标URL段进行区分,也就是说目标URL中的各个目标URL段都会存储在一个存储树中。但是如果需要管理控制的目标URL太多,在一棵存储树中同时存储域名单元和资源路径单元下的URL段的话,可能会导致一棵存储树过于庞大。
本实施例还提供另外一种存储目标URL段的方式:将目标URL中域名单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的域名存储树;将目标URL中资源路径单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的资源路径存储树;域名单元存储树的叶子节点指向资源路径单元存储树的根节点。也就是说,域名单元中的目标URL段和资源路径单元中的各目标UR段实际上是分成两个存储树进行存储的,二者之间依靠指针进行联系。
S106、按照预设匹配顺序将各访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配。
S106是针对访问URL的匹配过程的,也就是说,获取到的URL是用户想要访问的页面所对应的URL,在对访问URL进行分段处理与映射处理之后,需要对得到的各个访问URL段与存储的目标URL段进行逐级匹配。逐级匹配的含义是指将一个访问URL段与其同一等级的各个目标URL段进行匹配,可以明白的是,这里匹配的应当是访问URL段的访问关键值与各个目标URL对应的关键值,当匹配成功之后,按照预设匹配顺序,选择该访问URL段之后的访问URL段与匹配成功的目标URL段的各子节点进行匹配,然后一直循环这个匹配过程,直至一个URL段中所有的访问URL段都匹配成功或者某一个访问URL段无法匹配到相应的目标URL,然后返回提示为止。
如图2所示,在图2所示出的存储树中存储了三个目标URL的域名单元,分别是“a.b.ccd”、“g.a.12.cd”和“d.13.cd”,现在用户输入了一个访问URL为“f.12.cd”,在进行逐级匹配的时候,需要将访问URL段中的“cd”与目标URL段中的“ccd”和“cd”进行匹配,当匹配成功之后,再将“12”与“12”和“13”进行匹配,当然在前两次匹配过程中都能成功,直到将“f”与“a”进行匹配的时候会出现匹配失败的情况。这时候表征用户输入的访问URL是不需要进行管理控制的那一类,所以可以直接向用户返回其需要的页面资源。如果用户输入的访问URL中,不仅域名单元的各个URL段能够匹配到对应的目标URL段,而且资源路径下的全部URL段也都能匹配成功,则表征用户输入的这个访问URL是需要被管理控制的那一类URL中的一个。
S108、在访问URL的各访问URL段均匹配到对应的目标URL段时,获取对访问URL的处理策略。
在确定需要对用户输入的访问URL进行管理控制之后应当要获取到对该URL的处理策略,例如对该URL对应的页面进行屏蔽或者添加某一广告信息等。
本公开实施例提供的URL匹配方法,通过对目标URL和访问URL进行相同的分段处理和映射处理,能够让将一个URL通过多个关键值的 组合方式来表示,在这种情况下,在对URL进行匹配的时候,实质上就是按照顺序来匹配一些简单的值,这种方式将一个复杂的URL的匹配过程转化成了简单的值的匹配过程,提高了匹配效率,缩短了用户的等待时间,有利于用户体验的提升。另一方面,因为本实施例提出的匹配方法会对资源路径进行分段处理,所以能对那些域名单元正常,而资源路径中下存在不良信息的网页页面进行精细管理,使得管理更细化,更有效,更精准。
实施例二:
在实施例一提供的URL匹配方法当中,无论目标URL是什么样的都会对其进行分段处理,但本实施例中提出另外一种处理方式,请参考图3:
S301、提取目标URL中的域名单元与资源路径单元。
提取域名单元与资源路径单元实质上就是提取有价值的部分,反过来说,也可以提出不具使用价值的部分:首先可以先剔除目标URL中没有使用价值的参数与分段部分,参数部分的标识为“?”分段部分的标识为“#”,也就是说,可以直接将“?”与“#”之后的内容都删除掉。另外由于根据协议的规定“[]”内的内容都属于可选类容,因此这些部分也是可以直接剔除掉的。
S302、判断域名单元或资源路径单元中是否存在通配符。
通配符的含义是该部分为任意形式都可以。当域名单元与资源路径单元中都没有通配符,则可以执行S303否则执行S304。
S303、将该URL直接进行映射处理并进行存储。
映射处理可以直接根据哈希算法进行,值得注意的是,在存储这种不带有通配符的URL时,可以采用哈希表来代替哈希存储树。
S304、对域名单元与资源路径单元进行分段处理。
分段处理的过程可以参见实施例一,这里不再赘述,可以理解的是,也可以不同时对域名单元与资源路径进行分段处理,可以仅对其中一个进 行分段,并等到对应的目标URL段,对另外一个不进行分段处理,而是直接将该部分作为一个目标URL段。
S305、对分段处理得到的目标URL段进行映射处理并存储。
应当明白的是,在进行映射处理的时候,对各个URL的处理都应当采用一致的方式进行,例如,在本实施例中,对不带通配符的URL的映射处理的过程是采用哈希算法进行的,那么在对带有通配符的资源路径单元或域名单元下的URL段进行处理的时候也应当采用哈希算法进行,另外在后续过程中对访问URL段进行映射处理的时候也应当采用哈希算法进行,以保证关键值与URL段能够实现真正的对应。
本实施例中,对于分段处理得到的目标URL段及其目标关键值进行存储的时候依然采用实施例一的方式,即按照树形存储结构对目标URL段及其目标关键值按照预设匹配顺序进行存储形成存储树。
根据通配符所在位置的不同,可以将其分为前缀通配符与后缀通配符,为了匹配方便,可以将带有前缀通配符的域名单元和带有后缀通配符的域名单元的各URL段进行分别存储,同样的,也可以将带有前缀通配符的资源路径单元和带有后缀通配符的资源路径单元的各URL段进行分别存储。
当接收到用户的访问URL时,可以先将该访问URL中无使用价值的参数与分段部分进行剔除,计算剔除无价值部分的访问URL的哈希值,然后先使用计算得到的哈希值与哈希表中存储的各个哈希值进行匹配,如果匹配成功,则获取对应的处理策略,如果匹配不成功,则可以将该访问URL进行分段处理,然后按照实施例一提供的方式进行匹配,这里不再赘述。
和相关匹配方法中,直接将访问URL的域名单元与资源路径单元与目标URL的域名单元与资源路径单元进行匹配的方式相比,相关匹配方法的时间复杂度为O(n3),n为目标URL的数目,但本实施例提供的URL匹配方法在进行匹配的时候与目标URL的条数无关,其算法的时间复杂度为O(L),因此,本实施例提供的方法极大的降低了匹配过程的复杂度, 提高了匹配效率。
实施例三:
本实施例提供一种URL匹配装置,实施例一和实施例二提供的URL匹配方法可以通过本实施例提供的URL匹配装置进行实施,下面对本实施例提供的URL匹配装置进行介绍,请参见图4:
该URL匹配装置40包括分割模块402、映射模块404、匹配模块406、获取模块408。
分割模块402设置为按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段。映射模块404设置为根据预设映射规则将各访问URL段映射至一一对应的访问关键值。匹配模块406设置为按照预设匹配顺序将各访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配。获取模块408设置为在访问URL的各访问URL段均匹配到对应的目标URL段时,获取对访问URL的处理策略。
本实施例中的URL可以分为两种类型,一种是确定需要被管理控制的URL,这种URL属于目标URL。另一种是用户输入的想要访问的URL,我们称之为访问URL。
当网络管理员根据用户举报、投诉,或者通过自己检测发现的了含有不良信息的、需要被屏蔽的页面,这时候这些含有不良信息的页面所对应的URL就属于需要被控制的那一类目标URL。另外一方面,如果网络管理员想要对某些页面进行特别的管理,例如,在网络管理员根据需要,想要在展示汽车信息的页面上投放某品牌汽车的推广信息,因此,这些页面所对应的URL就属于需要被管理的目标URL。
当用户点击网页上的链接信息或者是直接输入网址想要访问某一页面的时候,需要对用户想要访问的页面对应的访问URL进行匹配,以确定该访问URL是否是需要被管理控制的目标URL中的一个,如果是,则需要根据对应的处理策略对该访问URL对应的页面进行控制管理,如果 不是,则表示可以直接根据该访问URL获取到用户需要的资源并向用户进行展示。
对访问URL进行匹配之前,访问URL会被分割模块402和映射模块404分别进行分段处理与映射处理,而对访问URL进行匹配的时候,是与预先存储的目标URL进行匹配。因此,对访问URL进行匹配之前,分割模块402和映射模块404应当先对目标URL进行同样的处理,在本实施例中,我们可以将分割模块402的分段处理与映射模块404的映射处理当作所有URL都会经历的“预处理”过程。
分割模块402和映射模块404对目标URL的处理方式应当与对访问URL的处理方式相同,或者说,访问URL的预处理规则应当与目标URL的一致,因为目标URL的预处理过程在前,目标URL的预处理过程决定了对访问URL的处理规则。为了简洁,本实施例中在对URL的预处理过程进行介绍的时候,不区分目标URL与访问URL。
根据前面的介绍,一个URL中至少包括主机域名Hostname与资源路径Path两部分,在本实施例中,Hostname为域名单元,Path为资源路径单元。在域名单元当中,分割模块402以“.”为界进行划分,将Hostname从右至左依次划分为顶级域名、二级域名、三极域名与注册域名,在顶级域名分为国家顶级域名和国际顶级域名,国家顶级域名如“.cn”,“.uk”,“.de”等,而国际顶级域名包括“.com”,“.net”,“.org”等。在两个“.”之间的内容为最小的域名段。在资源路径当中,分割模块402以“/”为界进行划分,两个“/”之间的内容为最小的资源路径段。
根据上述介绍可以知道,最小域名段和最小资源路径段是一个URL的基本组成单元,因此,分割模块402对URL中域名单元或者是资源路径单元中的任意一个进行分段处理之后,应当保证分段处理之后得到的各URL段至少是一个完整的最小组成单元,也就是说在各URL段当中至少包括URL中两个分隔符之间的完整内容。这里所说的两个分隔符可以是一个URL中的任意两个分隔符。
映射模块404根据预设映射规则将各URL段映射至一一对应的关键 值。
可以理解的是,现在的网站繁多,网页页面更是不计其数,预先为所有的URL段设置好对应的关键值是不太现实的,特别是对于海量的资源路径,预先为每一个资源路径单元中的每一个URL段设置好关键值几乎是不可能实现的事。所以,为了解决这个问题,映射模块404可以采用一种预设的算法对各个URL段进行计算处理,将计算得到的结果作为URL段的关键值。在这种情况下,虽然在计算得出结果之前并没有为各个URL段配置好关键值,但是由于采用的算法是统一的,所以各个URL段根据这个固定的算法也会得到一一对应的关键值。
映射模块404通过预设算法来将UR段映射至关键值的做法最好能够保证不同的URL对应不同的关键值,这样才能保证在对URL进行后续的匹配的时候不会出现冲突的情况。哈希算法能够将任意长度的二进制值映射为较短的固定长度的二进制值,这个小的二进制值称为哈希值。哈希值是一段数据唯一且极其紧凑的数值表示形式。如果散列一段明文而且哪怕只更改该段落的一个字母,随后的哈希都将产生不同的值。因此,在本实施例当中,映射模块404可以采用哈希算法来对URL段进行处理,将处理结果作为该URL段对应的关键值。
在对目标URL与访问URL进行分段处理与映射处理之后,就需要对这两类不同的URL进行分段处理,。本实施例还提供另一种URL匹配装置,请参考图5:
URL匹配装置40除了包括分割模块402、映射模块404、匹配模块406、获取模块408以外,还包括存储模块410,存储模块410设置为将分段处理得到的各目标URL段及其对应的目标关键值根据预设匹配顺序逐级进行存储,以便在在后续过程中当用户输入的访问URL为该目标URL时,可以进行管理控制,预设匹配顺序可以由管理员来设置。
存储模块410的存储作用是针对目标URL,如果获取到的URL是目标URL,则在经过分割模块402的分段处理和映射模块404的映射处理之后,存储模块410可以将目标URL段和其对应的目标关键值进行存储, 以便在在后续过程中确定用户输入的访问URL为该目标URL时,可以进行管理控制。
存储模块410在对目标URL段及其对应的目标关键值进行存储的时候,可以根据预设匹配顺序进行逐级存储,以便在后续的匹配过程中,也能够进行逐级匹配。为了提高匹配效率,通常的做法是先匹配域名单元中的各URL段再匹配资源路径单元中的各URL段;在对匹配域名单元中的URL段进行匹配时,按照从右至左的顺序进行,即先匹配顶级域名,再依次匹配二级域名、三级域名、注册域名。当对匹配资源路径单元中的各URL段进行匹配时,按照从左至右的顺序进行。因此,在存储的时候,可以将顶级域名的存储级别设置为最高,在顶级域名之下在存储二级域名,在二级域名之下依次是三级域名、注册域名,然后是资源路径单元中最左边的目标URL段,从最左边的目标URL段依次进行存储,直到存储完最右边的目标URL段。
在本实施例当中,存储模块410存储目标URL段的时候可以根据树形存储结构进行,根据预设匹配顺序对各目标URL段及其对应的目标关键值进行存储得到存储树,在前匹配的目标URL段作为在后匹配的目标URL段的父节点,在后匹配的目标URL段作为在前匹配的目标URL段的孩子节点。在这种存储方式当中,并没有将域名单元下的各目标URL段和资源路径单元下的各目标URL段进行区分,也就是说目标URL中的各个目标URL段都会存储在一个存储树中。但是如果需要管理控制的目标URL太多,在一棵存储树中同时存储域名单元和资源路径单元下的URL段的话,可能会导致一棵存储树过于庞大。
本实施例还为存储模块410提供另外一种存储目标URL段的方式:将目标URL中域名单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的域名存储树;将目标URL中资源路径单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的资源路径存储树;域名单元存储树的叶子节点指向资源路径单元存储树的根节点。也就是说,域名单元中的目标URL段和资源路径单元中的各目标UR段实际上是分成两个存储树进行存储的,二者之间依靠指针进行联系。
匹配模块406按照预设匹配顺序将分段处理得到的各访问URL段对应的各访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配。
当获取到的URL是用户想要访问的页面所对应的URL,在对访问URL进行分段处理与映射处理之后,匹配模块406需要对得到的各个访问URL段与存储的目标URL段进行逐级匹配。逐级匹配的含义是指将一个访问URL段与其同一等级的各个目标URL段进行匹配,可以明白的是,这里匹配模块406匹配的应当是访问URL段的访问关键值与各个目标URL对应的关键值,当匹配成功之后,匹配模块406按照预设匹配顺序,选择该访问URL段之后的访问URL段与匹配成功的目标URL段的各子节点进行匹配,然后一直循环这个匹配过程,直至一个URL段中所有的访问URL段都匹配成功或者某一个访问URL段无法匹配到相应的目标URL并返回提示为止。
获取模块408设置为在访问URL的各所述访问URL段均匹配到对应的目标URL段时,获取对访问URL的处理策略。
在确定需要对用户输入的访问URL进行管理控制之后获取模块408应当要获取到对该URL的处理策略,例如对该URL对应的页面进行屏蔽或者添加某一广告信息等。
本实施例提供的URL匹配装置40可以部署在服务器上,其中的分割模块402、映射模块404、匹配模块406和获取模块408都可以由服务器中的处理器来实现,而存储模块410则可以由处理器与存储器来配合实现。
本公开实施例提供的URL匹配装置,通过对目标URL和访问URL进行相同的分段处理和映射处理,从而实现以简单的值来表征复杂URL的效果,进而降低了匹配难度,提高了匹配效率,缩短了用户的等待时间,有利于用户体验的提升。另一方面,因为本实施例提出的匹配装置会对资源路径进行分段处理,所以能对那些域名单元正常,而资源路径中下存在不良信息的网页页面进行精细管理,使得管理更细化,更有效,更精准。
实施例三
本公开的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:
S1,按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段,所述访问URL段中至少包括所述访问URL中两个分隔符之间的完整内容;
S2,根据预设映射规则将各所述访问URL段映射至一一对应的访问关键值;
S3,按照预设匹配顺序将各所述访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配
S4,若所述访问URL的各所述访问URL段均匹配到对应的目标URL段,则所述访问URL需要被管理控制,获取对所述访问URL的处理策略。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,根据哈希算法对URL段进行处理,将处理结果作为所述URL段对应的关键值。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,确定所述目标URL的域名单元和资源路径单元中的至少一个中包括通配符。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,先匹配域名单元中的各URL段再匹配资源路径单元中的各URL段;
S2,当对匹配域名单元中的各URL段进行匹配时,按照从右至左的顺序进行;
S3,当对匹配资源路径单元中的各URL段进行匹配时,按照从左至右的顺序进行。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,按照所述预设划分规则对获取的目标URL的域名单元与资源路径单元中的至少一个进行分段处理得到目标URL段;所述目标URL为需要被管理控制的URL;所述目标URL段中至少包括所述目标URL中两个分隔符之间的完整内容;
S2,根据所述预设映射规则将各所述目标URL段映射至一一对应的目标关键值;
S3,将分段处理得到的各目标URL段及其对应的目标关键值根据所述预设匹配顺序逐级进行存储,在前匹配的目标URL段及其目标关键值存储级别高。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,根据树形存储结构对各目标URL段及其对应的目标关键值进行存储得到存储树,在前匹配的目标URL段作为在后匹配的目标URL段的父节点,在后匹配的目标URL段作为在前匹配的目标URL段的孩子节点。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,将所述目标URL中域名单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的域名存储树;
S2,将所述目标URL中资源路径单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的资源路径存储树;
S3,所述域名单元存储树的叶子节点指向所述资源路径单元存储树的根节点。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
可选地,在本实施例中,处理器根据存储介质中已存储的程序代码执行上述实施例记载的方法步骤。
可选地,本实施例中的示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述本公开实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储介质(ROM/RAM、磁碟、光盘)中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。所以,本公开不限制于任何特定的硬件和软件结合。
以上内容是结合可选的实施方式对本公开实施例所作的详细说明,不能认定本公开的可选实施只局限于这些说明。对于本公开所属技术领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本公开的保护范围。
工业实用性
本公开实施例中,通过对获取到的访问URL的域名单元与资源路径单元中的至少一个根据预设划分规则进行分段处理,然后将处理得到的访问URL段映射至对应的关键值;然后将各个URL段对应的访问关键值与存储的各目标URL对应的目标关键值进行逐级匹配,如果访问关键值与目标关键值相等,则表征访问URL段与目标URL段匹配成功,若访问URL中所有的访问URL段都能匹配到对应的目标匹配URL段,则说明该访问URL属于需要被管理控制的URL中的一个。这种匹配方案通过将URL进行分段处理能够提高匹配准确率,精准地管理那些需要进行控制的URL;同时将URL段映射到关键值,然后直接对关键值进行匹配能够有效提高匹配效率,减少用户的等待时间,提高用户体验。

Claims (12)

  1. 一种URL匹配方法,包括:
    按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段,所述访问URL段中至少包括所述访问URL中两个分隔符之间的完整内容;
    根据预设映射规则将各所述访问URL段映射至一一对应的访问关键值;
    按照预设匹配顺序将各所述访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配;
    若所述访问URL的各所述访问URL段均匹配到对应的目标URL段,则所述访问URL需要被管理控制,获取对所述访问URL的处理策略。
  2. 如权利要求1所述的URL匹配方法,其中,所述预设映射规则为:根据哈希算法对URL段进行处理,将处理结果作为所述URL段对应的关键值。
  3. 如权利要求1所述的URL匹配方法,其中,在按照预设划分规则对目标URL的域名单元与资源路径单元中的至少一个进行分段处理之前还包括:确定所述目标URL的域名单元和资源路径单元中的至少一个中包括通配符。
  4. 如权利要求1所述的URL匹配方法,其中,所述预设匹配顺序为:
    先匹配域名单元中的各URL段再匹配资源路径单元中的各URL段;
    当对匹配域名单元中的各URL段进行匹配时,按照从右至左的顺序进行;
    当对匹配资源路径单元中的各URL段进行匹配时,按照从左至右的 顺序进行。
  5. 如权利要求1-4任一项所述的URL匹配方法,其中,所述按照预设匹配顺序将各所述访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配之前还包括:
    按照所述预设划分规则对获取的目标URL的域名单元与资源路径单元中的至少一个进行分段处理得到目标URL段;所述目标URL为需要被管理控制的URL;所述目标URL段中至少包括所述目标URL中两个分隔符之间的完整内容;
    根据所述预设映射规则将各所述目标URL段映射至一一对应的目标关键值;
    将分段处理得到的各目标URL段及其对应的目标关键值根据所述预设匹配顺序逐级进行存储,在前匹配的目标URL段及其目标关键值存储级别高。
  6. 如权利要求5所述的URL匹配方法,其中,所述将分段处理得到的各目标URL段及其对应的目标关键值根据预设匹配顺序逐级进行存储包括:
    根据树形存储结构对各目标URL段及其对应的目标关键值进行存储得到存储树,在前匹配的目标URL段作为在后匹配的目标URL段的父节点,在后匹配的目标URL段作为在前匹配的目标URL段的孩子节点。
  7. 如权利要求5所述的URL匹配方法,其中,所述将分段处理得到的各目标URL段及其对应的目标关键值根据预设匹配顺序逐级进行存储包括:
    将所述目标URL中域名单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的域名存储树;
    将所述目标URL中资源路径单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的资源路径存储树;
    所述域名单元存储树的叶子节点指向所述资源路径单元存储树的根节点。
  8. 一种URL匹配装置,包括:
    分割模块,设置为按照预设划分规则对获取的访问URL的域名单元与资源路径单元中的至少一个进行分段处理得到访问URL段,所述访问URL段中至少包括所述URL中两个分隔符之间的完整内容;
    映射模块,设置为根据预设映射规则将各所述访问URL段映射至一一对应的访问关键值;
    匹配模块,设置为按照预设匹配顺序将各所述访问关键值与存储的各目标URL段对应的各目标关键值进行逐级匹配;
    获取模块,设置为在所述访问URL的各所述访问URL段均匹配到对应的目标URL段,所述访问URL需要被管理控制时,获取对所述访问URL的处理策略。
  9. 如权利要求8所述的URL匹配装置,其中,
    所述分割模块还设置为按照所述预设划分规则对获取的目标URL的域名单元与资源路径单元中的至少一个进行分段处理得到目标URL段;所述目标URL为需要被管理控制的URL;所述目标URL段中至少包括所述目标URL中两个分隔符之间的完整内容;
    所述映射模块还设置为根据所述预设映射规则将各所述目标URL段映射至一一对应的目标关键值;
    所述URL匹配装置还包括:
    存储模块,设置为当获取的所述URL为目标URL时,将分段处理得到的各目标URL段及其对应的目标关键值根据预设匹配顺序逐级进行存储,在前匹配的URL段及其目标关键值存储级别高。
  10. 如权利要求9所述的URL匹配装置,其中,所述存储模块设置为:根据树形存储结构对各目标URL段及其对应的目标关键值进行存储得到存储树,在前匹配的目标URL段作为在后匹配的目标URL段的父节点,在后匹配的目标URL段作为在前匹配的目标URL段的孩子节点。
  11. 如权利要求9所述的URL匹配装置,其中,所述存储模块设置为:
    将所述目标URL中域名单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的域名存储树;
    将所述目标URL中资源路径单元中的各目标URL段及其目标关键值根据树形存储结构进行存储得到对应的资源路径存储树;
    所述域名单元存储树的叶子节点指向所述资源路径单元存储树的根节点。
  12. 一种存储介质,所述存储介质包括存储的程序,其中,所述程序运行时执行上述权利要求1至7任一项中所述的方法。
PCT/CN2017/087815 2016-06-29 2017-06-09 一种url匹配方法、装置及存储介质 WO2018001078A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610494455.2 2016-06-29
CN201610494455.2A CN107547671A (zh) 2016-06-29 2016-06-29 一种url匹配方法及装置

Publications (1)

Publication Number Publication Date
WO2018001078A1 true WO2018001078A1 (zh) 2018-01-04

Family

ID=60786210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/087815 WO2018001078A1 (zh) 2016-06-29 2017-06-09 一种url匹配方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN107547671A (zh)
WO (1) WO2018001078A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858852A (zh) * 2018-08-23 2020-03-03 北京国双科技有限公司 一种注册域名的获取方法及装置
CN110874443A (zh) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 一种url模式获取方法、装置、电子设备及可读存储介质
CN110874444A (zh) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 一种url转换模型的建立方法、装置及电子设备
CN113839940A (zh) * 2021-09-18 2021-12-24 北京知道创宇信息技术股份有限公司 基于url模式树的防御方法、装置、电子设备和可读存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875086B (zh) * 2018-07-18 2023-03-28 山东中创软件商用中间件股份有限公司 一种uri路径资源的匹配方法及系统
CN112416875B (zh) * 2020-11-24 2024-04-09 平安消费金融有限公司 日志管理方法、装置、计算机设备及存储介质
CN112650955B (zh) * 2020-12-30 2024-04-12 中国农业银行股份有限公司 一种统一资源定位符url的处理方法及装置
CN112804373B (zh) * 2020-12-30 2022-10-14 微医云(杭州)控股有限公司 接口域名确定方法、装置、电子设备及存储介质
CN113312549B (zh) * 2021-05-25 2024-01-26 北京天空卫士网络安全技术有限公司 一种域名处理方法和装置
CN113849373A (zh) * 2021-09-27 2021-12-28 中国电信股份有限公司 服务器监管方法、装置以及存储介质
CN114006774A (zh) * 2021-12-31 2022-02-01 北京微步在线科技有限公司 一种流量信息的检测方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030093517A1 (en) * 2001-10-31 2003-05-15 Tarquini Richard P. System and method for uniform resource locator filtering
CN101605129A (zh) * 2009-06-23 2009-12-16 北京理工大学 一种用于url过滤系统的url查找方法
CN102045360A (zh) * 2010-12-27 2011-05-04 成都市华为赛门铁克科技有限公司 恶意网址库的处理方法及装置
CN102110132A (zh) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 统一资源定位符匹配查找方法、装置和网络侧设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9411793B2 (en) * 2010-07-13 2016-08-09 Motionpoint Corporation Dynamic language translation of web site content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030093517A1 (en) * 2001-10-31 2003-05-15 Tarquini Richard P. System and method for uniform resource locator filtering
CN101605129A (zh) * 2009-06-23 2009-12-16 北京理工大学 一种用于url过滤系统的url查找方法
CN102110132A (zh) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 统一资源定位符匹配查找方法、装置和网络侧设备
CN102045360A (zh) * 2010-12-27 2011-05-04 成都市华为赛门铁克科技有限公司 恶意网址库的处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XU DONGLIANG: "Research on High performance online Pattern Matching Algorithm", CHINA DOCTORAL DISSERTATIONS, 15 March 2016 (2016-03-15), pages 73 - 81 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858852A (zh) * 2018-08-23 2020-03-03 北京国双科技有限公司 一种注册域名的获取方法及装置
CN110858852B (zh) * 2018-08-23 2022-05-10 北京国双科技有限公司 一种注册域名的获取方法及装置
CN110874443A (zh) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 一种url模式获取方法、装置、电子设备及可读存储介质
CN110874444A (zh) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 一种url转换模型的建立方法、装置及电子设备
CN110874444B (zh) * 2018-08-31 2023-10-31 北京搜狗科技发展有限公司 一种url转换模型的建立方法、装置及电子设备
CN113839940A (zh) * 2021-09-18 2021-12-24 北京知道创宇信息技术股份有限公司 基于url模式树的防御方法、装置、电子设备和可读存储介质

Also Published As

Publication number Publication date
CN107547671A (zh) 2018-01-05

Similar Documents

Publication Publication Date Title
WO2018001078A1 (zh) 一种url匹配方法、装置及存储介质
CN108092962B (zh) 一种恶意url检测方法及装置
WO2019056721A1 (zh) 信息推送方法、电子设备及计算机存储介质
US9248068B2 (en) Security threat detection of newly registered domains
CN110943961B (zh) 数据处理方法、设备以及存储介质
CN102722563B (zh) 页面显示方法及装置
CN108427731B (zh) 页面代码的处理方法、装置、终端设备及介质
CN109246064B (zh) 安全访问控制、网络访问规则的生成方法、装置及设备
CN107908694A (zh) 互联网新闻的舆情聚类分析方法、应用服务器及计算机可读存储介质
US10834105B2 (en) Method and apparatus for identifying malicious website, and computer storage medium
CN108881138B (zh) 一种网页请求识别方法及装置
US20200314135A1 (en) Method for determining duplication of security vulnerability and analysis apparatus using same
CN111277461B (zh) 一种内容分发网络节点的识别方法、系统及设备
CN109657434B (zh) 应用访问方法及装置
CN109688094B (zh) 基于网络安全的可疑ip配置方法、装置、设备及存储介质
CN112131507A (zh) 网站内容处理方法、装置、服务器和计算机可读存储介质
US10931688B2 (en) Malicious website discovery using web analytics identifiers
US20170206252A1 (en) Nxd query monitor
CN107577943B (zh) 基于机器学习的样本预测方法、装置及服务器
CN108243265A (zh) 一种dns解析处理方法及装置
CN109788050B (zh) 一种获取源站ip地址方法、系统、电子设备和介质
WO2020063448A1 (zh) 一种信息拦截的方法、装置及终端
CN106682014B (zh) 游戏展示数据的生成方法和装置
CN109144514B (zh) Json格式数据解析存储方法及装置
CN107404515A (zh) 异步http请求的处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17819075

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17819075

Country of ref document: EP

Kind code of ref document: A1