WO2020006908A1 - 一种url去重方法及装置 - Google Patents

一种url去重方法及装置 Download PDF

Info

Publication number
WO2020006908A1
WO2020006908A1 PCT/CN2018/108708 CN2018108708W WO2020006908A1 WO 2020006908 A1 WO2020006908 A1 WO 2020006908A1 CN 2018108708 W CN2018108708 W CN 2018108708W WO 2020006908 A1 WO2020006908 A1 WO 2020006908A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
threshold
repeated set
equal
downloaded
Prior art date
Application number
PCT/CN2018/108708
Other languages
English (en)
French (fr)
Inventor
熊庆昌
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020006908A1 publication Critical patent/WO2020006908A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a URL deduplication method and device.
  • the URL crawler refers to that the crawler system first carefully selects a part of web pages from Internet pages, uses the link addresses of these web pages as seed URLs, and places these seeds in the queue of URLs to be crawled. The crawler reads from the queue of URLs to be crawled in turn, and The URL is resolved by the domain name system (DNS), and the link address is converted into the IP address corresponding to the website server. Then give it the relative path name of the webpage to the webpage downloader, and the webpage downloader is responsible for downloading the page.
  • DNS domain name system
  • the web pages downloaded to the local area are stored in the page library, waiting for subsequent processing such as indexing; on the other hand, the URL of the downloaded web page is placed in a crawl queue, and this queue records the crawler system has downloaded URL of the webpage to avoid repeated crawling of the system.
  • URL deduplication refers to removing duplicate crawled URLs to avoid crawling the same web page multiple times. For example, for each given URL, map it to a physical address. When it is necessary to detect whether the given URL is duplicated, it is only necessary to determine whether the physical address corresponding to the given URL already exists. If it exists, it means that it has been downloaded, then the download is abandoned, otherwise the given URL is put into the crawl. Queue, waiting for download. Because many URLs in a website only have different parameter parts, and these URLs with only different parameter parts are likely to have been downloaded, but the physical addresses obtained after mapping these URLs are not the same. In this case, only the URL Corresponding physical addresses to determine whether these URLs are duplicates will lead to crawling many duplicate URLs, affecting the scanning efficiency of the web vulnerability scanning system.
  • the embodiments of the present application provide a URL deduplication method and device, which can reduce the number of repeated URLs downloaded, thereby improving the scanning efficiency of the WEB vulnerability scanning system.
  • an embodiment of the present application provides a URL deduplication method, which includes:
  • first parameter reduction processing is performed on the first URL to obtain a second URL.
  • the first parameter reduction processing is used to reduce parameter variables in the first URL.
  • the collection includes URLs that have been downloaded in the historical record after the first generalization process;
  • the second repeating set includes The URL that has been downloaded in the historical record is obtained after the first parameter reduction process.
  • an embodiment of the present application provides a URL deduplication device.
  • the device includes:
  • a first generalization processing module configured to perform a first generalization processing on the original URL to obtain a first URL, where the first generalization processing is used to replace multiple consecutive characters of the same type in the original URL with a single character;
  • a first parameter reduction processing module configured to perform a first parameter reduction processing on the first URL to obtain a second URL when the first URL does not belong to the first duplicate set, and the first parameter reduction processing is used to reduce the first URL
  • a parameter variable in the URL, the first repeated set includes URLs obtained from the downloaded URL in the history record after the first generalization process;
  • a download module configured to detect whether the first occurrence of the second URL in the second repeated set is less than or equal to a first threshold when the second URL belongs to the second repeated set, and if so, download the original URL,
  • the second repeated set includes URLs that have been downloaded in the history record after the first parameter reduction process.
  • an embodiment of the present application provides a terminal including a processor and a memory, and the processor and the memory are connected to each other.
  • the memory is used to store a computer program that supports the terminal to execute the foregoing method, and the computer program includes program instructions.
  • the processor is configured to call the program instruction to execute the URL deduplication method of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the foregoing first section.
  • URL deduplication method on one hand.
  • the embodiments of the present application can reduce the number of duplicate URLs downloaded by hierarchically deduplicating URLs, thereby improving the scanning efficiency of the WEB vulnerability scanning system.
  • FIG. 1 is a schematic flowchart of a URL deduplication method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a relationship between a first repeating set and occurrences
  • FIG. 3 is another schematic flowchart of a URL deduplication method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a relationship between a second repeated set and the number of occurrences
  • FIG. 5 is a schematic block diagram of a URL deduplication device according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a terminal according to an embodiment of the present application.
  • the question mark "?” Character is used to separate the file name part and the parameter part in the URL.
  • the logical and "&" characters indicate the separator between the parameters specified in the URL.
  • the URL deduplication method may include steps:
  • the terminal may sort the parameter part of the original URL according to the parameter name, and perform a first generalization process on the parameter value of the parameter part of the original URL to obtain the first URL.
  • the first generalization process is used to replace multiple consecutive characters of the same type in the original URL with a single character. For example, consecutive numbers such as 145 can be replaced with the number 1, consecutive letters such as FK, aj, and dgA are replaced with the letter A, and special characters are replaced with the symbol%. Among them, special characters refer to characters other than numbers and letters, such as a question mark "?", An exclamation point "!, And so on.
  • the terminal may detect whether the same URL as the first URL exists in the first repeated set. If it does not exist, it indicates that the first URL does not belong to the first repeated set, and the first URL may be processed. The parameter portion of the first parameter reduction processing is performed to obtain a second URL. The first parameter reduction process is used to reduce the parameter variables of the first URL parameter part, such as removing the parameter value of the first URL parameter part and retaining the parameter name. If it exists, it indicates that the first URL belongs to the first repeated set, and the terminal may add 1 to the number of occurrences of the first URL in the first repeated set to obtain a first URL of the first URL in the first repeated set.
  • the terminal determines whether the first URL obtained through the first generalization process has been downloaded, thereby determining whether the original URL needs to be downloaded.
  • the original URL with the same data format in the parameter value can be filtered out, which reduces the number of downloads to duplicate URLs. The quantity, and because only some of the variables in the original URL parameter values are reduced, the accuracy of deduplication is guaranteed.
  • the first repeated set may include a URL obtained by downloading the URL in the history record after the first generalization process, that is, the first URL corresponding to the downloaded URL.
  • the downloaded URL may be a URL that has not been processed by the first generalization.
  • the terminal detects whether the same URL1 exists in the first repeated set URL4, URL6, and URL7, that is, whether URL1 exists in the first repeated set, because URL1 does not exist in the first repeated set, It indicates that URL1 does not belong to the first repeated set.
  • the terminal removes the parameter value of the parameter part of URL1 from the reserved parameter name, and obtains the second URL as: http://xxx.pingan.com/cgi-bin/index1.html? param1 & param2.
  • the terminal may use the message-digest algorithm 5 (MD5) to calculate a hash value of the first URL, and may detect whether a hash value of the first URL exists. If the first repeated set does not exist, it indicates that the hash value of the first URL does not belong to the first repeated set, and then the parameter portion of the first URL may be subjected to a first parameter reduction process to obtain a second URL.
  • the first parameter reduction process is used to reduce the parameter variables of the first URL parameter part. For example, if the parameter value of the first URL parameter part is removed, only the parameter name is retained.
  • the number of occurrences of the hash value of the first URL in the first repeated set may be increased by 1 to obtain the hash value of the first URL in the first repeated set.
  • the second number of occurrences in the first repeating set and it can be determined whether the second number of occurrences is less than or equal to the second threshold, and if it is (ie, the second number of occurrences is less than or equal to the second threshold), download the above original URL, If not (that is, the second number of occurrences is greater than the second threshold), the original URL is discarded, that is, the original URL is not downloaded.
  • the first repeated set may include a URL obtained by downloading the URL in the history record after the first generalization process described above, and then calculating the hash value obtained by MD5, that is, the hash value of the first URL corresponding to the downloaded URL.
  • the second threshold is an integer greater than 0. Because the hash function is a function that converts data of any size into data of a specific size, and the hash value of the URL is stored in the downloaded set (the first repeated set), rather than the complete URL, the storage space can be reduced (because the complete URL characters are many, and the hash value is data of a fixed size). At the same time, when detecting whether the first URL belongs to the first duplicate set, processing efficiency can be improved.
  • FIG. 2 it is a schematic diagram of the relationship between the first repeated set and the number of occurrences.
  • the elements in the first repeated set are 01, 03, and 06
  • the first URL is URL1
  • the terminal Calculate the hash value of URL1 as 03.
  • the terminal detects that the hash value of URL1 is in the first repeated set, indicating that URL1 belongs to the first repeated set.
  • the terminal adds the number of occurrences of hash value 03 of URL1 in the first repeated set to 3 plus 1.
  • the terminal determines that the second occurrence number 4 is less than the second threshold value 10, and downloads the original URL.
  • the terminal may add the first URL to the first repeated set to form a new first repeated set, that is, update the first
  • the number of occurrences of the first URL in the first repeated set may be set to 1
  • the first URL may be subjected to a first parameter reduction process to obtain a second URL.
  • the terminal uses the latest first repeated set to detect whether the next original URL has been downloaded.
  • the terminal can more accurately filter duplicate URLs, and can further improve scanning efficiency.
  • the first repeating set includes URL4, URL6, and URL7, and the first URL is URL1.
  • URL1 does not belong to the first repeating set
  • the terminal may add the URL1 to the first repeating set.
  • the updated first repeating set Including URL1, URL4, URL6, and URL7.
  • the terminal may also set the number of occurrences of URL1 in the first repeated set to 1.
  • the first repeated set may be empty (because there are no downloaded URLs at this time).
  • the second URL belongs to the second repeated set, it is detected whether the first occurrence number of the second URL in the second repeated set is less than or equal to the first threshold, and if so, the original URL is downloaded.
  • the terminal can detect whether the same URL as the second URL exists in the second repeated set. If it exists, it indicates that the second URL belongs to the second repeated set, and the second URL can be The number of occurrences in the second repeated set is increased by 1 to obtain the first number of occurrences of the second URL in the second repeated set, and it can be detected whether the number of first occurrences is less than or equal to a first threshold. If yes (that is, the If the first occurrence is less than or equal to the first threshold, the original URL is downloaded. If not (that is, if the first occurrence is greater than the first threshold), the original URL is discarded, that is, the original URL is not downloaded.
  • the terminal can directly download the original URL.
  • the second repeated set may include a URL obtained from the URL in the history record after the first parameter reduction process is performed, that is, a second URL corresponding to the downloaded URL.
  • the downloaded URL may be a URL that has not been processed by the first parameter reduction.
  • the terminal determines whether the second URL obtained through the first parameter reduction process has been downloaded, thereby determining whether the original URL needs to be downloaded, and can filter out original URLs with different parameter values, further reducing the number of duplicate URLs downloaded. The scanning efficiency is improved while ensuring the accuracy.
  • the first threshold is less than or equal to the second threshold, and the first threshold may be an integer greater than 0. Because the first generalization process only reduces the variables in the parameter values of the original URL. At this time, there are more variables in the first URL, fewer URLs are filtered out, and more URLs are downloaded. There are more occurrences in a repeated set. Therefore, the first threshold value is smaller than the second threshold value to ensure that unfiltered URLs can be filtered out after the first parameter reduction process to achieve the purpose of hierarchical deduplication.
  • the terminal may use MD5 to calculate a hash value of the second URL, and may detect whether a hash value of the second URL exists in the second repeated set.
  • the hash value of the second URL belongs to the second repeated set, and the number of occurrences of the hash value of the second URL in the second repeated set can be increased by 1 to obtain the hash value of the second URL in the second repeated set.
  • the first number of occurrences and it can be determined whether the first number of occurrences is less than or equal to the first threshold. If it is (that is, the first number of occurrences is less than or equal to the first threshold), the original URL is downloaded.
  • the second repeated set may include a URL obtained by downloading the original URL in the history record after the first parameter reduction is performed, and then a hash value obtained after calculation by MD5, that is, a second URL corresponding to the downloaded original URL. hash value.
  • the hash function is a function that converts data of any size into data of a specific size, and the hash value of the URL is stored in the downloaded set (the second repeated set), rather than the complete URL, the storage space can be further reduced, and When detecting whether the second URL belongs to the second repeated set, processing efficiency can be further improved.
  • the terminal when the terminal detects that the second URL does not belong to the second repeated set, the terminal may add the second URL to the second repeated set to form a new second repeated set, that is, update the first The second repeated set, and the number of times that the second URL appears in the second repeated set can be set to 1, and the original URL can be downloaded. After the terminal downloads the original URL or discards the original URL, the terminal uses the latest second repeated set to detect whether the next original URL has been downloaded. By continuously updating the second duplicate set, the terminal can more accurately filter duplicate URLs, and can further improve scanning efficiency.
  • the second repeated set may be empty (because there is no downloaded URL at this time).
  • a first generalization process is performed on the original URL, and it is determined whether the first URL belongs to the first repeated set. If the first URL does not belong to the first repeated set, the first URL is first processed.
  • the second URL is obtained through the de-parameterization process, and it is determined whether the second URL belongs to the second repeated set. If the second URL belongs to the second repeated set, it is detected whether the first occurrence number of the second URL in the second repeated set is It is less than or equal to the first threshold. If it is, the original URL is downloaded. If not, the original URL is discarded, that is, not downloaded.
  • the number of repeated URLs can be reduced, thereby improving the scanning efficiency of the WEB vulnerability scanning system.
  • the URL deduplication method may include steps:
  • the second URL belongs to the second repeated set, it is detected whether the first occurrence number of the second URL in the second repeated set is less than or equal to the first threshold, and if so, the original URL is downloaded.
  • the terminal can detect whether the same URL as the second URL exists in the second repeated set. If it exists, it indicates that the second URL belongs to the second repeated set, and the second URL can be The number of occurrences in the second repeated set is increased by 1 to obtain the first number of occurrences of the second URL in the second repeated set, and it can be detected whether the number of first occurrences is less than or equal to a first threshold. If yes (that is, the If the first occurrence is less than or equal to the first threshold, the original URL is downloaded. If not (that is, if the first occurrence is greater than the first threshold), the original URL is discarded, that is, the original URL is not downloaded.
  • the second URL may be subjected to a second parameter reduction process to obtain a third URL.
  • the second parameter reduction process is used to reduce parameter variables in the second URL, such as removing parameter parts (including parameter values and parameter names) in the second URL.
  • the second repetitive set may include a URL obtained by downloading the URL in the history record through the first parameter reduction process, that is, the second URL corresponding to the downloaded URL.
  • the downloaded URL may be a URL that has not been processed by the first parameter reduction.
  • the terminal determines whether the second URL obtained through the first parameter reduction process has been downloaded, thereby determining whether the original URL needs to be downloaded, and can filter out original URLs with different parameter values, further reducing the number of duplicate URLs downloaded.
  • the scanning efficiency is improved while ensuring the accuracy.
  • the first threshold is less than or equal to the second threshold, and the first threshold may be an integer greater than 0. Because the first generalization process only reduces the variables in the parameter values of the original URL. At this time, there are more variables in the first URL, fewer URLs are filtered out, and more URLs are downloaded. There are more occurrences in a repeated set. Therefore, the first threshold value is smaller than the second threshold value, which can ensure that the URLs that are not filtered out in steps S301 to S302 can be filtered out after the first parameter reduction process, thereby achieving the purpose of hierarchical deduplication.
  • FIG. 4 it is a schematic diagram of the relationship between the second repeated set and the number of occurrences.
  • the first threshold value is 7
  • the second repeated set includes URL2 and URL5
  • the first URL is URL2
  • the terminal detects that URL2 is in the second In the repeated set, it is indicated that URL2 belongs to the second repeated set.
  • the terminal adds 1 to 1 of the occurrence number of URL2 in the second repeated set to obtain the first occurrence 2 of URL2 in the second repeated set.
  • the terminal determines the first occurrence 2 If it is less than the first threshold 7, the original URL is downloaded.
  • the terminal may use MD5 to calculate a hash value of the second URL, and may detect whether a hash value of the second URL exists in the second repeated set.
  • the hash value of the second URL belongs to the second repeated set, and the number of occurrences of the hash value of the second URL in the second repeated set can be increased by 1 to obtain the hash value of the second URL in the second repeated set.
  • the first number of occurrences and it can be determined whether the first number of occurrences is less than or equal to the first threshold. If it is (that is, the first number of occurrences is less than or equal to the first threshold), the original URL is downloaded.
  • the original URL is discarded, that is, the original URL is not downloaded. If it does not exist, it indicates that the hash value of the second URL does not belong to the second repeated set, and the second URL may be subjected to a second parameter reduction process to obtain a third URL.
  • the second parameter reduction processing is used to reduce parameter variables in the second URL, such as removing parameter parts (including parameter values and parameter names) in the second URL.
  • the second repeated set may include a URL obtained by downloading the URL in the history record after the first parameter reduction process is performed, and then a hash value obtained by MD5 calculation, that is, a hash value of the second URL corresponding to the downloaded URL. .
  • the hash function is a function that converts data of any size into data of a specific size, and the hash value of the URL is stored in the downloaded set (the second repeated set), rather than the complete URL, the storage space can be further reduced, and When detecting whether the second URL belongs to the second repeated set, processing efficiency can be further improved.
  • the terminal calculates the hash value of URL2 as 04, the terminal detects that the hash value 04 of URL2 is not in the second repeated set, indicating that URL2 does not belong to the second repeated set, and the terminal removes the parameter part in URL2 to obtain the third URL and the third URL For: http://xxx.pingan.com/cgi-bin/index1.html.
  • the terminal may add the second URL to the second repeated set to form a new second repeated set, that is, update the first For the second repeated set, the number of occurrences of the second URL in the second repeated set may be set to 1, and the second URL may be subjected to a second parameter reduction process to obtain a third URL.
  • the terminal downloads the original URL or discards the original URL, the terminal uses the latest second repeated set to detect whether the next original URL has been downloaded.
  • the second repeated set may be empty (because there is no downloaded URL at this time).
  • the terminal may detect whether the same URL as the third URL exists in the third repeated set. If it exists, it indicates that the third URL belongs to the third repeated set.
  • the number of occurrences in the third repeated set is increased by 1, to obtain the third number of occurrences of the third URL in the third repeated set, and it can be detected whether the third number of occurrences is less than or equal to a third threshold.
  • the third occurrence number is less than or equal to the third threshold), the original URL is downloaded, and if not (that is, the third occurrence number is greater than the third threshold), the original URL is discarded, that is, the original URL is not downloaded.
  • the terminal may perform a second generalization process on the file name portion of the third URL to obtain a fourth URL.
  • the second generalization process is used to replace at least one character of the target type in the third URL file name part with the target character, such as replacing one or more numbers in the file name part of the third URL with a preset single Number 1".
  • the third repetitive set may include a URL obtained after the downloaded URL in the history record is processed by the second parameter reduction, that is, a third URL corresponding to the downloaded URL.
  • the downloaded URL may be a URL that has not been processed by the second parameter reduction.
  • the terminal determines whether the third URL obtained through the second parameter reduction process has been downloaded, thereby determining whether the original URL needs to be downloaded.
  • the original URL with different parameters can be filtered out, that is, the variables in the original URL are reduced and discarded. There are more original URLs, which reduces the number of duplicate URLs downloaded, and further improves scanning efficiency while ensuring accuracy.
  • the third threshold may be less than or equal to the first threshold, the first threshold is less than or equal to the second threshold, and the third threshold may be an integer greater than 0.
  • the third threshold value is smaller than the first threshold value, which can ensure that the URLs not filtered out in steps S303 to S304 can be filtered out after the second parameter reduction process, thereby achieving the purpose of hierarchical deduplication.
  • the terminal may use MD5 to calculate the hash value of the third URL, and may detect whether the hash value of the third URL exists in the third repeated set.
  • the hash value of the three URLs belongs to the third repeating set, and the number of occurrences of the hash value of the third URL in the third repeating set can be increased by 1 to obtain the hash value of the third URL in the third repeating set.
  • the third occurrence and it can be determined whether the third occurrence is less than or equal to the third threshold. If it is (that is, the third occurrence is less than or equal to the third threshold), then the original URL is downloaded.
  • the original URL is discarded, that is, the original URL is not downloaded. If it does not exist, it indicates that the hash value of the third URL does not belong to the third repeated set, and then a second generalization process may be performed on the third URL to obtain a fourth URL.
  • the second generalization process is used to replace at least one character of the target type in the third URL file name part with the target character, such as replacing one or more numbers in the file name part of the third URL with a preset single Number 1".
  • the third repetitive set may include a URL obtained from the downloaded URL in the historical record after the second parameter reduction process, and then a hash value obtained after MD5 calculation, that is, a hash value of the third URL corresponding to the downloaded URL. . Because the hash function is a function that converts data of any size into data of a specific size, and the hash value of the URL is stored in the downloaded set (the third repeated set), rather than the complete URL, the storage space can be further reduced, and When detecting whether the third URL belongs to the third repeated set, the processing efficiency can be further improved.
  • the terminal may add the third URL to the third repeated set to form a new third repeated set, that is, update the first A triplicate set.
  • the number of occurrences of the third URL in the third duplicate set can be set to 1, and a second generalization process can be performed on the third URL to obtain a fourth URL.
  • the terminal uses the latest third repeated set to detect whether the next original URL has been downloaded. By continuously updating the third repeated set, the terminal can filter out the repeated URLs more accurately and further improve the scanning efficiency.
  • the third repeated set may be empty (because there is no downloaded URL at this time).
  • the fourth URL belongs to the fourth repeating set, it is detected whether the fourth occurrence number of the fourth URL in the fourth repeating set is less than or equal to a fourth threshold, and if so, the original URL is downloaded.
  • the terminal can detect whether the fourth repeated set has the same URL as the fourth URL, and if it exists, it indicates that the fourth URL belongs to the fourth repeated set, and the fourth URL can be The number of occurrences in the fourth repeated set is increased by 1, to obtain the fourth number of occurrences of the fourth URL in the fourth repeated set, and it can be detected whether the number of fourth occurrences is less than or equal to a fourth threshold.
  • the fourth occurrence number is less than or equal to the fourth threshold), the original URL is downloaded; if not (that is, the fourth occurrence number is greater than the fourth threshold), the original URL is discarded, that is, the original URL is not downloaded.
  • the terminal detects that the fourth URL does not exist in the fourth repeated set, it indicates that the fourth URL does not belong to the fourth repeated set, and the terminal can directly download the original URL.
  • the fourth repetitive set may include a URL obtained by downloading the URL in the history record through the first and second parameter reduction processing and the second generalization processing, that is, the fourth URL corresponding to the downloaded URL.
  • the terminal determines whether the fourth URL obtained through the second generalization process has been downloaded, thereby determining whether the original URL needs to be downloaded.
  • the original URL with a different file name can be filtered out, the variables in the original URL are reduced, and then discarded There are more original URLs and fewer duplicate URLs downloaded, which improves scanning efficiency while ensuring accuracy.
  • the fourth threshold is less than or equal to the third threshold
  • the third threshold may be less than or equal to the first threshold
  • the first threshold is less than or equal to the second threshold
  • the fourth threshold is greater than or equal to An integer of 0.
  • the fourth threshold value is smaller than the third threshold value, which can ensure that the URLs that are not filtered out in steps S305 to S306 can be filtered out after the second generalization process, thereby achieving the purpose of hierarchical deduplication.
  • the terminal may use MD5 to calculate a hash value of the fourth URL, and may detect whether a hash value of the fourth URL exists in the fourth repeated set.
  • the hash value of the four URLs belongs to the fourth repeating set, and the number of occurrences of the hash value of the fourth URL in the fourth repeating set can be increased by 1 to obtain the hash value of the fourth URL in the fourth repeating set.
  • the fourth occurrence number and it can be determined whether the fourth occurrence number is less than or equal to the fourth threshold. If it is (that is, the fourth occurrence number is less than or equal to the fourth threshold), the original URL is downloaded.
  • the fourth repetitive set may include URLs that have been downloaded in the historical records after the first and second parameter reduction processing and the second generalization processing described above, and then a hash value obtained after MD5 calculation, that is, downloaded The hash value of the fourth URL corresponding to the URL.
  • the hash function is a function that converts data of any size into data of a specific size, and the hash value of the URL is stored in the downloaded set (the fourth repeated set), rather than the complete URL, the storage space can be further reduced.
  • processing efficiency can be further improved.
  • the terminal when the terminal detects that the fourth URL does not belong to the fourth repeated set, the terminal may add the fourth URL to the fourth repeated set to form a new fourth repeated set, that is, update the first A quadruple set, and at the same time, the number of occurrences of the fourth URL in the fourth repeat set can be set to 1, and the original URL can be downloaded.
  • the terminal After the terminal downloads the original URL or discards the original URL, the terminal uses the latest fourth repeating set to detect whether the next original URL has been downloaded. By continuously updating the fourth duplicate set, the terminal can more accurately filter duplicate URLs, and can further improve scanning efficiency.
  • the fourth repeated set may be empty (because there are no downloaded URLs at this time).
  • the original URL is hierarchically deduplicated, and the variables in the original URL are gradually reduced, and the original URL is downloaded according to the URL after the variable is gradually reduced. If the original URL is downloaded at this level, The original URL is discarded. If the level determines that the original URL is not downloaded, the original URL is downloaded. If the level cannot determine whether the original URL has been downloaded, it proceeds to the next level until it is determined that the original URL has been downloaded or not downloaded.
  • it can not only reduce the number of downloads to duplicate URLs, improve the scanning efficiency of the WEB vulnerability scanning system, but also improve the accuracy of deduplication.
  • the URL deduplication device provided in the embodiment of the present application includes:
  • a first generalization processing module 10 configured to perform a first generalization process on an original URL to obtain a first URL, where the first generalization process is used to replace multiple consecutive characters of the same type in the original URL with a single character;
  • a first parameter reduction processing module 20 is configured to perform a first parameter reduction processing on the first URL to obtain a second URL when the first URL does not belong to the first repeated set, and the first parameter reduction processing is used to reduce the first URL.
  • a parameter variable in a URL, the first repeated set includes URLs obtained from the downloaded URL in the history record after the first generalization process;
  • Downloading module 30 configured to detect whether the first occurrence of the second URL in the second repeated set is less than or equal to a first threshold when the second URL belongs to the second repeated set, and if so, download the original URL
  • the second repeated set includes URLs that have been downloaded in the historical record after the first parameter reduction process.
  • the apparatus further includes an obtaining module 40 for obtaining a first URL in the first repeating set when the first URL belongs to the first repeating set. Two occurrences; the download module 30 is further configured to download the original URL when the second occurrence is less than or equal to a second threshold. The second threshold is greater than or equal to the first threshold.
  • the first parameter reduction processing module 20 includes a calculation unit 201, a detection unit 202, and a first parameter reduction processing unit 203.
  • the calculation unit 201 is configured to calculate a hash value of the first URL;
  • the detection unit 202 is used to detect whether the hash value of the first URL belongs to a first duplicate set;
  • the first parameter reduction processing unit 203 is used to When the hash value of the first URL does not belong to the first repeated set, the first URL is subjected to the first parameter reduction process to obtain a second URL.
  • the first repeating set includes a hash value of the URL obtained after the downloaded URL in the history record is processed by the first generalization.
  • the apparatus further includes a second parameter reduction processing module 50.
  • the second parameter reduction processing module 50 is configured to perform a second parameter reduction processing on the second URL to obtain a third URL when the second URL does not belong to the second repeated set; the download module 30 is further configured to be used when When the third URL belongs to the third repeated set, it is detected whether the third occurrence number of the third URL in the third repeated set is less than or equal to a third threshold, and if so, the original URL is downloaded.
  • the third repeated set includes URLs obtained by downloading the URLs in the historical records after the second parameter reduction process, the second parameter reduction process is used to reduce parameter variables in the second URL, and the third threshold is less than Or equal to the first threshold.
  • the apparatus further includes a second generalization processing module 60.
  • the second generalization processing module 60 is configured to perform a second generalization processing on the third URL to obtain a fourth URL when the third URL does not belong to the third repeated set; the download module 30 is further configured to be used when When the fourth URL belongs to the fourth repeated set, it is detected whether a fourth occurrence number of the fourth URL in the fourth repeated set is less than or equal to a fourth threshold, and if so, the original URL is downloaded.
  • the fourth repeated set includes URLs obtained after downloading URLs in the history record through the first, the second parameter reduction processing, and the second generalization processing.
  • the second generalization processing is used for the third At least one character of the target type in the URL is replaced with the target character, and the fourth threshold is less than or equal to the third threshold.
  • the downloading module 30 is further configured to download the original URL when the fourth URL does not belong to the fourth repeated set.
  • the download module 30 is further configured to detect whether the first occurrence of the second URL in the second repeated set is less than or equal to the first after the second URL belongs to the second repeated set. Before the threshold, the number of occurrences of the second URL in the second repeating set is increased by 1 to obtain the first number of occurrences.
  • the URL deduplication device may implement the implementation manner provided by each step in the implementation manner provided in FIG. 1 or FIG. 3 through the foregoing modules to implement the functions implemented in the foregoing embodiments.
  • the URL deduplication device may implement the implementation manner provided by each step in the implementation manner provided in FIG. 1 or FIG. 3 through the foregoing modules to implement the functions implemented in the foregoing embodiments.
  • the corresponding description provided by each step in the method embodiment shown in FIG. 1 or FIG. 3 is not described herein again.
  • the URL deduplication device may hierarchically deduplicate the original URL, reduce the variables in the original URL step by step, and determine whether the original URL has been downloaded based on the URL after reducing the variables step by step. If it is judged that the original URL has been downloaded, the original URL is discarded. If the level determines that the original URL has not been downloaded, the original URL is downloaded. If the level cannot determine whether the original URL has been downloaded, it proceeds to the next level judgment until it is determined The original URL was downloaded or not downloaded. Through a more detailed hierarchical deduplication scheme, it can not only reduce the number of downloads to duplicate URLs, improve the scanning efficiency of the WEB vulnerability scanning system, but also improve the accuracy of deduplication.
  • the terminal in this embodiment may include one or more processors 601 and a memory 602.
  • the processor 601 and the memory 602 are connected through a bus 603.
  • the memory 602 is configured to store a computer program, where the computer program includes program instructions, and the processor 601 is configured to execute the program instructions stored in the memory 602.
  • the processor 601 is configured to call the program instruction for execution:
  • first parameter reduction processing is performed on the first URL to obtain a second URL.
  • the first parameter reduction processing is used to reduce parameter variables in the first URL.
  • the collection includes URLs that have been downloaded in the historical record after the first generalization process;
  • the second repeating set includes The URL that has been downloaded in the historical record is obtained after the first parameter reduction process.
  • the processor 601 may be a central processing unit (CPU), and the processor may also be other general-purpose processors, digital signal processors (DSPs) ), Application specific integrated circuit (ASIC), ready-made programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSPs digital signal processors
  • ASIC Application specific integrated circuit
  • FPGA ready-made programmable gate array
  • FPGA field-programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 602 may include a read-only memory and a random access memory, and provide instructions and data to the processor 601. A part of the memory 602 may further include a non-volatile random access memory. For example, the memory 602 may also store information of a device type.
  • the processor 601 described in the embodiment of the present application may execute the implementation manner described in the URL deduplication method provided in the embodiment of the present application, and may also implement the implementation of the URL deduplication device described in the embodiment of the present application. The method is not repeated here.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the URL shown in FIG. 1 or FIG. 3 is implemented. For details, refer to the description of the embodiment shown in FIG. 1 or FIG. 3, and details are not described herein again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请实施例公开了一种URL去重方法及装置,其中方法包括:对原始URL进行第一泛化处理,并判断该第一URL是否属于第一重复集合,若该第一URL不属于该第一重复集合,则对该第一URL进行第一减参处理得到第二URL,并判断该第二URL是否属于第二重复集合,若该第二URL属于第二重复集合,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载该原始URL,若否,则丢弃该原始URL,即不下载。采用本申请实施例,可以减少下载到重复URL的数量,从而可以提高WEB漏洞扫描系统的扫描效率。

Description

一种URL去重方法及装置
本申请要求于2018年7月5日提交中国专利局、申请号为2018107353055、申请名称为“一种统一资源定位符URL去重方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,尤其涉及一种URL去重方法及装置。
背景技术
目前业内很多万维网(world wide web,WEB)漏洞扫描系统,都有自己的统一资源定位符(uniform resource locator,URL)爬虫,也有自己的URL去重方法。URL爬虫指爬虫系统首先从互联网页面中精心选择一部分网页,以这些网页的链接地址作为种子URL,将这些种子放入待抓取URL队列中,爬虫从待抓取URL队列依次读取,并将URL通过域名系统(domain name system,DNS)解析,把链接地址转换为网站服务器对应的IP地址。然后将其和网页相对路径名称交给网页下载器,网页下载器负责页面的下载。对于下载到本地的网页,一方面将其存储到页面库中,等待建立索引等后续处理;另一方面将下载网页的URL放入已抓取队列中,这个队列记录了爬虫系统已经下载过的网页URL,以避免系统的重复抓取。
URL去重指的是将重复抓取的URL去除,避免多次抓取同一网页。例如,对每一个给定的URL,将其映射到某个物理地址上。当需要检测该给定URL是否重复时,只需判断该给定URL对应的物理地址是否已经存在,若存在,说明已经被下载过,则放弃下载,否则将该给定URL放入待抓取队列,等待下载。由于一个网站中的很多URL仅是参数部分不同,而这些仅是参数部分不同的URL很大可能已经被下载过,但这些URL经过映射后得到的物理地址是不相同的,这时仅通过URL对应的物理地址来判断这些URL是否重复将导致爬取到很多重复的URL,影响WEB漏洞扫描系统的扫描效率。
发明内容
本申请实施例提供一种URL去重方法及装置,可减少下载到重复URL的数量,从而可以提高WEB漏洞扫描系统的扫描效率。
第一方面,本申请实施例提供了一种URL去重方法,该方法包括:
对原始URL进行第一泛化处理,得到第一URL,该第一泛化处理用于将该原始URL中相同类型的多个连续字符替换为单个字符;
若该第一URL不属于第一重复集合,对该第一URL进行第一减参处理得到第二URL,该第一减参处理用于减少该第一URL中的参数变量,该第一重复集合包括历史记录中已下载的URL经过该第一泛化处理后得到的URL;
若该第二URL属于第二重复集合,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载该原始URL,该第二重复集合包括历史记录中已下载的URL经过该第一减参处理后得到的URL。
第二方面,本申请实施例提供了一种URL去重装置,该装置包括:
第一泛化处理模块,用于对原始URL进行第一泛化处理,得到第一URL,该第一泛化处理用于将该原始URL中相同类型的多个连续字符替换为单个字符;
第一减参处理模块,用于当该第一URL不属于第一重复集合时,对该第一URL进行第一减参处理得到第二URL,该第一减参处理用于减少该第一URL中的参数变量,该第一重复集合包括历史记录中已下载的URL经过该第一泛化处理后得到的URL;
下载模块,用于当该第二URL属于第二重复集合时,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载该原始URL,该第二重复集合包括历史记录中已下载的URL经过该第一减参处理后得到的URL。
第三方面,本申请实施例提供了一种终端,包括处理器和存储器,该处理器和存储器相互连接,其中,该存储器用于存储支持终端执行上述方法的计算机程序,该计算机程序包括程序指令,该处理器被配置用于调用该程序指令,执行上述第一方面的URL去重方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行上述第一方面的URL去重方法。
本申请实施例通过对URL进行分级去重,可减少下载到重复URL的数量,从而可以提高WEB漏洞扫描系统的扫描效率。
附图说明
图1是本申请实施例提供的URL去重方法的一示意流程图;
图2是第一重复集合和出现次数的关系示意图;
图3是本申请实施例提供的URL去重方法的另一示意流程图;
图4是第二重复集合和出现次数的关系示意图;
图5是本申请实施例提供的URL去重装置的一示意性框图;
图6是本申请实施例提供的终端的一示意性框图。
具体实施方式
在介绍本申请实施例之前,先介绍URL的数据结构。一个URL的结构通常为“协议://服务器名称(IP地址)/路径/文件名?参数”,例如:http://xxx.pingan.com/cgi-bin/index1.html?param1=value1&param2=value2,其中param1=value1&param2=value2表示这个URL的参数部分,URL的参数部分由参数名和参数值组成,param1和param2表示参数名,value1和value2表示参数值,参数值可为数字、字母(包括大小写)、特殊字符(指除数字、字母之外的字符)和/或它们的组合。问号“?”字符用于分隔URL中的文件名部分和参数部分,逻辑与“&”字符表示URL中指定的参数间的分隔符。
下面将结合图1至图6,对本申请实施例提供的URL去重方法及装置进行说明。
参见图1,是本申请实施例提供的URL去重方法的一示意流程图,如图1所示,该URL去重方法可包括步骤:
S101,对原始URL进行第一泛化处理,得到第一URL。
在一些可行的实施方式中,终端可以将原始URL的参数部分按照参数名进行排序,并对该原始URL参数部分的参数值进行第一泛化处理,得到第一URL。该第一泛化处理用于将该原始URL中相同类型的多个连续字符替换为单个字符。例如,可将连续的数字如145替换为数字1,连续的字母如FK、aj、dgA都替换为字母A,特殊字符都替换为符号%。其中,特殊字符指除数字和字母外的字符,如问号“?”、感叹号“!”等。
例如,终端爬取到的原始URL为http://xxx.pingan.com/cgi-bin/index1.html?param1=v167!ABD&param2=val_ue2,终端可以将原始URL的参数部分按照参数名进行排序,保证原始URL的参数部分顺序排列,再将原始URL的参数值中连续的数字替换为预设的单个数字“1”、连续的字母替换为预设的单个字母“A”、特殊字符替换为预设的单个特殊字符“%”,得到第一URL为http://xxx.pingan.com/cgi-bin/index1.html?param1=v1%A&param2=A%A2。
S102,若第一URL不属于第一重复集合,对第一URL进行第一减参处理得到第二URL。
在一些可行的实施方式中,终端可以检测第一重复集合中是否存在与上述第一URL相同的URL,若不存在,说明上述第一URL不属于第一重复集合,则可以对该第一URL的参数部分进行第一减参处理得到第二URL。该第一减参处理用于减少该第一URL参数部分的参数变量,如去掉第一URL参数部分的参数值保留参数名。若存在,说明该第一URL属于该第一重复集合,则终端可以将该第一URL在该第一重复集合中的出现次数加1,得到该第一URL在该第一重复集合中的第二出现次数,并可以判断该第二出现次数是否小于或等于第二阈值,若是(即该第二出现次数小于或等于该第二阈值),则下载上述原始URL,若否(即该第二出现次数大于该第二阈值),则丢弃上述原始URL,即不下载该原始URL。终端通过判断经过第一泛化处理得到的第一URL是否已经被下载过,从而判断出是否需要下载原始URL,可以过滤掉参数值中具有相同数据格式的原始URL,减少了下载到重复URL的数量,又因为只减少了原始URL参数值中的部分变量,所以保证了去重的准确率。其中,第一重复集合可以包括历史记录中已下载的URL经过上述第一泛化处理后得到的URL,即已下载的URL对应的第一URL。该已下载的URL可以为未经第一泛化处理的URL。
例如,假设第一重复集合包括URL4、URL6以及URL7,第一URL为URL1:http://xxx.pingan.com/cgi-bin/index1.html?param1=v1%A&param2=A%A2,终端检测第一重复集合URL4、URL6以及URL7中是否存在与URL1相同的URL,即第一重复集合中是否存在URL1,因为第一重复集合中不存在URL1,说明URL1不属于第一重复集合,终端将URL1参数部分的参数值去掉保留参数名,得到第二URL为:http://xxx.pingan.com/cgi-bin/index1.html?param1&param2。
在一些可行的实施方式中,终端可以利用消息-摘要算法5(message-digest algorithm 5,MD5)计算上述第一URL的哈希(hash)值,并可以检测该第一URL的hash值是否存在该第一重复集合中,若不存在,说明上述第一URL的hash值不属于该第一重复集合,则可以对该第一URL的参数部分进行第一减参处理得到第二URL。该第一减参处理用于减少该第一URL参数部分的参数变量,如去掉第一URL参数部分的参数值只保留参数名。若存在,说明上述第一URL的hash值属于该第一重复集合,可以将该第一URL的hash 值在该第一重复集合中的出现次数加1,得到该第一URL的hash值在该第一重复集合中的第二出现次数,并可以判断该第二出现次数是否小于或等于第二阈值,若是(即该第二出现次数小于或等于该第二阈值),则下载上述原始URL,若否(即该第二出现次数大于该第二阈值),则丢弃上述原始URL,即不下载该原始URL。其中,第一重复集合可以包括历史记录中已下载的URL经过上述第一泛化处理后得到的URL,再经过MD5计算后得到的hash值,即已下载的URL对应的第一URL的hash值。该第二阈值为大于0的整数。因为hash函数是将任意大小的数据转换成特定大小的数据的函数,且已下载集合(第一重复集合)中存储的是URL的hash值,而不是完整的URL,可以减少存储空间(因为完整URL字符较多,而hash值是固定大小的数据),同时在检测第一URL是否属于第一重复集合时,可以提高处理效率。
例如,如图2所示,是第一重复集合和出现次数的关系示意图,其中,假设第二阈值为10,第一重复集合中的元素为01、03以及06,第一URL为URL1,终端计算URL1的hash值为03,终端检测到URL1的hash值03在第一重复集合中,说明URL1属于第一重复集合,终端将URL1的hash值03在第一重复集合中的出现次数3加1,得到URL1的hash值03在第一重复集合中的第二出现次数4。终端判断第二出现次数4小于第二阈值10,下载原始URL。
在一些可行的实施方式中,终端在检测出该第一URL不属于该第一重复集合时,终端可以将该第一URL加入该第一重复集合形成新的第一重复集合,即更新该第一重复集合,同时可以将该第一URL在该第一重复集合中的出现次数置为1,并可以对该第一URL进行第一减参处理得到第二URL。终端在下载上述原始URL或丢弃上述原始URL之后,终端利用最新的第一重复集合检测下一个原始URL是否被下载过。终端通过不断更新第一重复集合,可以更准确地过滤掉重复的URL,并且可以进一步提高扫描效率。例如,第一重复集合包括URL4、URL6以及URL7,第一URL为URL1,此时,URL1不属于第一重复集合,终端可以将该URL1加入第一重复集合,此时更新后的第一重复集合包括URL1、URL4、URL6以及URL7。同时,终端也可以将URL1在第一重复集合中的出现次数也置为1。
在一些可行的实施方式中,若上述原始URL为某个网站中爬取到的第一个URL,那么第一重复集合可以为空(因为此时没有已下载的URL)。
S103,若第二URL属于第二重复集合,检测第二URL在第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载原始URL。
在一些可行的实施方式中,终端可以检测第二重复集合中是否存在与上述第二URL相同的URL,若存在,说明该第二URL属于该第二重复集合,则可以将该第二URL在该第二重复集合中的出现次数加1,得到该第二URL在该第二重复集合中的第一出现次数,并可以检测该第一出现次数是否小于或等于第一阈值,若是(即该第一出现次数小于或等于该第一阈值),则下载上述原始URL,若否(即第一出现次数大于该第一阈值),则丢弃上述原始URL,即不下载该原始URL。若终端检测到该第二重复集合中不存在该第二URL,说明该第二URL不属于该第二重复集合,则可以直接下载上述原始URL。其中,第二重复集合可以包括历史记录中已下载的URL经过上述第一减参处理后得到的URL,即已下 载的URL对应的第二URL。该已下载的URL可以为未经第一减参处理的URL。终端通过判断经过第一减参处理得到的第二URL是否已经被下载过,从而判断出是否需要下载原始URL,可以过滤掉参数值不相同的原始URL,进一步减少了下载到重复URL的数量,在保证准确率的同时提高了扫描效率。需要说明的是,该第一阈值小于或等于上述第二阈值,该第一阈值可以为大于0的整数。因为第一泛化处理仅是减少了原始URL的参数值中的变量,此时第一URL中的变量还较多,过滤掉的URL就少,下载的URL就较多,第一URL在第一重复集合中出现的次数就多。因此第一阈值小于第二阈值可以保证未过滤掉的URL,经过第一减参处理后可以被过滤掉,达到分级去重的目的。
在一些可行的实施方式中,终端可以利用MD5计算上述第二URL的哈希(hash)值,并可以检测该第二URL的hash值是否存在该第二重复集合中,若存在,说明上述第二URL的hash值属于该第二重复集合,可以将该第二URL的hash值在该第二重复集合中的出现次数加1,得到该第二URL的hash值在该第二重复集合中的第一出现次数,并可以判断该第一出现次数是否小于或等于第一阈值,若是(即该第一出现次数小于或等于该第一阈值),则下载上述原始URL,若否(即该第一出现次数大于该第一阈值),则丢弃上述原始URL,不下载。若不存在,说明上述第二URL的hash值不属于该第二重复集合,则可以下载上述原始URL。其中,第二重复集合可以包括历史记录中已下载的原始URL经过上述第一减参处理后得到的URL,再经过MD5计算后得到的hash值,即已下载的原始URL对应的第二URL的hash值。因为hash函数是将任意大小的数据转换成特定大小的数据的函数,且已下载集合(第二重复集合)中存储的是URL的hash值,而不是完整的URL,可以进一步减少存储空间,同时在检测第二URL是否属于第二重复集合时,可以进一步提高处理效率。
在一些可行的实施方式中,终端在检测出该第二URL不属于该第二重复集合时,终端可以将该第二URL加入该第二重复集合形成新的第二重复集合,即更新该第二重复集合,同时可以将该第二URL在该第二重复集合中的出现次数置为1,并可以下载上述原始URL。终端在下载上述原始URL或丢弃上述原始URL之后,终端利用最新的第二重复集合检测下一个原始URL是否被下载过。终端通过不断更新第二重复集合,可以更准确地过滤掉重复的URL,并且可以进一步提高扫描效率。
在一些可行的实施方式中,若上述原始URL为某个网站中爬取到的第一个URL,那么第二重复集合可以为空(因为此时没有已下载的URL)。
本申请实施例通过对原始URL进行第一泛化处理,并判断该第一URL是否属于第一重复集合,若该第一URL不属于该第一重复集合,则对该第一URL进行第一减参处理得到第二URL,并判断该第二URL是否属于第二重复集合,若该第二URL属于第二重复集合,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载该原始URL,若否,则丢弃该原始URL,即不下载,可以减少下载到重复URL的数量,从而可以提高WEB漏洞扫描系统的扫描效率。
参见图3,是本申请实施例提供的URL去重方法的另一示意流程图。如图3所示,该URL去重方法可包括步骤:
S301,对原始URL进行第一泛化处理,得到第一URL。
S302,若第一URL不属于第一重复集合,对第一URL进行第一减参处理得到第二URL。
本申请实施例中上述步骤S301-步骤S302的实现方式可参考图1所示实施例的步骤S101-步骤S102所提供的实现方式,在此不再赘述。
S303,若第二URL属于第二重复集合,检测第二URL在第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载原始URL。
S304,若第二URL不属于第二重复集合,对第二URL进行第二减参处理得到第三URL。
在一些可行的实施方式中,终端可以检测第二重复集合中是否存在与上述第二URL相同的URL,若存在,说明该第二URL属于该第二重复集合,则可以将该第二URL在该第二重复集合中的出现次数加1,得到该第二URL在该第二重复集合中的第一出现次数,并可以检测该第一出现次数是否小于或等于第一阈值,若是(即该第一出现次数小于或等于该第一阈值),则下载上述原始URL,若否(即第一出现次数大于该第一阈值),则丢弃上述原始URL,即不下载该原始URL。若终端检测到该第二重复集合中不存在该第二URL,说明该第二URL不属于该第二重复集合,则可以对该第二URL进行第二减参处理得到第三URL。该第二减参处理用于减少第二URL中的参数变量,如去除该第二URL中的参数部分(包括参数值和参数名)。其中,第二重复集合可以包括历史记录中已下载的URL经过上述第一减参处理后得到的URL,即已下载的URL对应的第二URL。该已下载的URL可以为未经第一减参处理的URL。终端通过判断经过第一减参处理得到的第二URL是否已经被下载过,从而判断出是否需要下载原始URL,可以过滤掉参数值不相同的原始URL,进一步减少了下载到重复URL的数量,在保证准确率的同时提高了扫描效率。需要说明的是,该第一阈值小于或等于上述第二阈值,该第一阈值可以为大于0的整数。因为第一泛化处理仅是减少了原始URL的参数值中的变量,此时第一URL中的变量还较多,过滤掉的URL就少,下载的URL就较多,第一URL在第一重复集合中出现的次数就多。因此第一阈值小于第二阈值可以保证步骤S301-步骤S302未过滤掉的URL,经过第一减参处理后可以被过滤掉,达到分级去重的目的。
例如,如图4所示,是第二重复集合和出现次数的关系示意图,其中假设第一阈值为7,第二重复集合包括URL2和URL5,第一URL为URL2,终端检测到URL2在第二重复集合中,说明URL2属于第二重复集合,终端将URL2在第二重复集合中的出现次数1加1,得到URL2在第二重复集合中的第一出现次数2,终端判断第一出现次数2小于第一阈值7,则下载原始URL。
在一些可行的实施方式中,终端可以利用MD5计算上述第二URL的哈希(hash)值,并可以检测该第二URL的hash值是否存在该第二重复集合中,若存在,说明上述第二URL的hash值属于该第二重复集合,可以将该第二URL的hash值在该第二重复集合中的出现次数加1,得到该第二URL的hash值在该第二重复集合中的第一出现次数,并可以判断该第一出现次数是否小于或等于第一阈值,若是(即该第一出现次数小于或等于该第一阈值),则下载上述原始URL,若否(即该第一出现次数大于该第一阈值),则丢弃上述原始URL,即不下载该原始URL。若不存在,说明上述第二URL的hash值不属于该第二重复集合,则可以对该第二URL进行第二减参处理得到第三URL。该第二减参处理用于减少第二URL 中的参数变量,如去除该第二URL中的参数部分(包括参数值和参数名)。其中,第二重复集合可以包括历史记录中已下载的URL经过上述第一减参处理后得到的URL,再经过MD5计算后得到的hash值,即已下载的URL对应的第二URL的hash值。因为hash函数是将任意大小的数据转换成特定大小的数据的函数,且已下载集合(第二重复集合)中存储的是URL的hash值,而不是完整的URL,可以进一步减少存储空间,同时在检测第二URL是否属于第二重复集合时,可以进一步提高处理效率。
例如,假设第二重复集合中的元素为07和09,第二URL为URL2:http://xxx.pingan.com/cgi-bin/index1.html?param1&param2,终端计算URL2的hash值为04,终端检测到URL2的hash值04不在第二重复集合中,说明URL2不属于第二重复集合,终端去掉URL2中的参数部分得到第三URL,第三URL为:http://xxx.pingan.com/cgi-bin/index1.html。
在一些可行的实施方式中,终端在检测出该第二URL不属于该第二重复集合时,终端可以将该第二URL加入该第二重复集合形成新的第二重复集合,即更新该第二重复集合,同时可以将该第二URL在该第二重复集合中的出现次数置为1,并可以对该第二URL进行第二减参处理得到第三URL。终端在下载上述原始URL或丢弃上述原始URL之后,终端利用最新的第二重复集合检测下一个原始URL是否被下载过。终端通过不断更新第二重复集合,可以更准确地过滤掉重复的URL,并且可以进一步提高扫描效率。
在一些可行的实施方式中,若上述原始URL为某个网站中爬取到的第一个URL,那么第二重复集合可以为空(因为此时没有已下载的URL)。
S305,若第三URL属于第三重复集合,检测第三URL在第三重复集合中的第三出现次数是否小于或等于第三阈值,若是,则下载原始URL。
S306,若第三URL不属于第三重复集合,对第三URL进行第二泛化处理得到第四URL。
在一些可行的实施方式中,终端可以检测第三重复集合中是否存在与上述第三URL相同的URL,若存在,说明该第三URL属于该第三重复集合,则可以将该第三URL在该第三重复集合中的出现次数加1,得到该第三URL在该第三重复集合中的第三出现次数,并可以检测该第三出现次数是否小于或等于第三阈值,若是(即该第三出现次数小于或等于该第三阈值),则下载上述原始URL,若否(即第三出现次数大于该第三阈值),则丢弃上述原始URL,即不下载该原始URL。若终端检测到该第三重复集合中不存在该第三URL,说明该第三URL不属于该第三重复集合,则可以对该第三URL的文件名部分进行第二泛化处理得到第四URL。该第二泛化处理用于将第三URL文件名部分中目标类型的至少一个字符替换为目标字符,如将该第三URL的文件名部分中的一个或多个数字替换为预设的单个数字“1”。其中,第三重复集合可以包括历史记录中已下载的URL经过上述第二减参处理后得到的URL,即已下载的URL对应的第三URL。该已下载的URL可以为未经第二减参处理的URL。终端通过判断经过第二减参处理得到的第三URL是否已经被下载过,从而判断出是否需要下载原始URL,可以过滤掉参数部分不相同的原始URL,即减少了原始URL中的变量,丢弃的原始URL多,进而减少了下载到重复URL的数量,在保证准确率的同时进一步提高了扫描效率。需要说明的是,该第三阈值可以小于或等于上述第一阈值,上述第一阈值小于或等于上述第二阈值,该第三阈值可以为大于0的整数。因为第一 减参处理仅是减少了原始URL参数部分的变量,此时第二URL中的变量还较多,过滤掉的URL就较少,下载的URL就较多,第二URL在第二重复集合中出现的次数就较多。因此第三阈值小于第一阈值可以保证步骤S303-步骤S304未过滤掉的URL,经过第二减参处理后可以被过滤掉,达到分级去重的目的。
在一些可行的实施方式中,终端可以利用MD5计算上述第三URL的哈希(hash)值,并可以检测该第三URL的hash值是否存在该第三重复集合中,若存在,说明上述第三URL的hash值属于该第三重复集合,可以将该第三URL的hash值在该第三重复集合中的出现次数加1,得到该第三URL的hash值在该第三重复集合中的第三出现次数,并可以判断该第三出现次数是否小于或等于第三阈值,若是(即该第三出现次数小于或等于该第三阈值),则下载上述原始URL,若否(即该第三出现次数大于该第三阈值),则丢弃上述原始URL,即不下载该原始URL。若不存在,说明上述第三URL的hash值不属于该第三重复集合,则可以对该第三URL进行第二泛化处理得到第四URL。该第二泛化处理用于将第三URL文件名部分中目标类型的至少一个字符替换为目标字符,如将该第三URL的文件名部分中的一个或多个数字替换为预设的单个数字“1”。其中,第三重复集合可以包括历史记录中已下载的URL经过上述第二减参处理后得到的URL,再经过MD5计算后得到的hash值,即已下载的URL对应的第三URL的hash值。因为hash函数是将任意大小的数据转换成特定大小的数据的函数,且已下载集合(第三重复集合)中存储的是URL的hash值,而不是完整的URL,可以进一步减少存储空间,同时在检测第三URL是否属于第三重复集合时,可以进一步提高处理效率。
在一些可行的实施方式中,终端在检测出该第三URL不属于该第三重复集合时,终端可以将该第三URL加入该第三重复集合形成新的第三重复集合,即更新该第三重复集合,同时可以将该第三URL在该第三重复集合中的出现次数置为1,并可以对该第三URL进行第二泛化处理得到第四URL。终端在下载上述原始URL或丢弃上述原始URL之后,终端利用最新的第三重复集合检测下一个原始URL是否被下载过。终端通过不断更新第三重复集合,可以更准确地分级过滤掉重复的URL,并且可以进一步提高扫描效率。
在一些可行的实施方式中,若上述原始URL为某个网站中爬取到的第一个URL,那么第三重复集合可以为空(因为此时没有已下载的URL)。
S307,若第四URL属于第四重复集合,检测第四URL在第四重复集合中的第四出现次数是否小于或等于第四阈值,若是,则下载原始URL。
S308,若第四URL不属于第四重复集合,下载原始URL。
在一些可行的实施方式中,终端可以检测第四重复集合中是否存在与上述第四URL相同的URL,若存在,说明该第四URL属于该第四重复集合,则可以将该第四URL在该第四重复集合中的出现次数加1,得到该第四URL在该第四重复集合中的第四出现次数,并可以检测该第四出现次数是否小于或等于第四阈值,若是(即该第四出现次数小于或等于该第四阈值),则下载上述原始URL,若否(即第四出现次数大于该第四阈值),则丢弃上述原始URL,即不下载该原始URL。若终端检测到该第四重复集合中不存在该第四URL,说明该第四URL不属于该第四重复集合,则可以直接下载上述原始URL。其中,第四重复集合可以包括历史记录中已下载的URL经过上述第一、第二减参处理以及上述第二泛化 处理后得到的URL,即已下载的URL对应的第四URL。终端通过判断经过第二泛化处理得到的第四URL是否已经被下载过,从而判断出是否需要下载原始URL,可以过滤掉文件名部分不同的原始URL,减少了原始URL中的变量,进而丢弃的原始URL多,下载到的重复URL少,在保证准确率的同时进一步提高了扫描效率。需要说明的是,该第四阈值小于或等于上述第三阈值,上述第三阈值可以小于或等于上述第一阈值,上述第一阈值小于或等于上述第二阈值,该第四阈值为大于或等于0的整数。第四阈值小于第三阈值可以保证步骤S305-步骤S306未过滤掉的URL,经过第二泛化处理后可以被过滤掉,达到分级去重的目的。
在一些可行的实施方式中,终端可以利用MD5计算上述第四URL的哈希(hash)值,并可以检测该第四URL的hash值是否存在该第四重复集合中,若存在,说明上述第四URL的hash值属于该第四重复集合,可以将该第四URL的hash值在该第四重复集合中的出现次数加1,得到该第四URL的hash值在该第四重复集合中的第四出现次数,并可以判断该第四出现次数是否小于或等于第四阈值,若是(即该第四出现次数小于或等于该第四阈值),则下载上述原始URL,若否(即该第四出现次数大于该第四阈值),则丢弃上述原始URL,即不下载该原始URL。若不存在,说明上述第四URL的hash值不属于该第四重复集合,则可以直接下载上述原始URL。其中,第四重复集合可以包括历史记录中已下载的URL经过上述第一、第二减参处理以及上述第二泛化处理后得到的URL,再经过MD5计算后得到的hash值,即已下载的URL对应的第四URL的hash值。因为hash函数是将任意大小的数据转换成特定大小的数据的函数,且已下载集合(第四重复集合)中存储的是URL的hash值,而不是完整的URL,可以进一步减少存储空间,同时在检测第四URL是否属于第四重复集合时,可以进一步提高处理效率。
在一些可行的实施方式中,终端在检测出该第四URL不属于该第四重复集合时,终端可以将该第四URL加入该第四重复集合形成新的第四重复集合,即更新该第四重复集合,同时可以将第四URL在第四重复集合中的出现次数置为1,可以下载上述原始URL。终端在下载上述原始URL或丢弃上述原始URL之后,终端利用最新的第四重复集合检测下一个原始URL是否被下载过。终端通过不断更新第四重复集合,可以更准确地过滤掉重复的URL,并且可以进一步提高扫描效率。
在一些可行的实施方式中,若上述原始URL为某个网站中爬取到的第一个URL,那么第四重复集合可以为空(因为此时没有已下载的URL)。
本申请实施例通过对原始URL进行分级去重,逐级减少原始URL中的变量,并根据逐级减少变量后的URL判断原始URL是否已下载,若在该级就判断出原始URL已下载,则丢弃原始URL,若该级就确定原始URL未下载,则下载原始URL,若该级无法判断出原始URL是否已下载,则进入下一级判断,直至判断出原始URL已下载或未下载。通过更细化的分级去重方案,不仅能够减少下载到重复URL的数量,提高WEB漏洞扫描系统的扫描效率,还可以提高去重的准确率。
参见图5,是本申请实施例提供的URL去重装置的一示意性框图。本申请实施例提供的URL去重装置包括:
第一泛化处理模块10,用于对原始URL进行第一泛化处理,得到第一URL,该第一泛化处理用于将该原始URL中相同类型的多个连续字符替换为单个字符;
第一减参处理模块20,用于当该第一URL不属于第一重复集合时,对该第一URL进行第一减参处理得到第二URL,该第一减参处理用于减少该第一URL中的参数变量,该第一重复集合包括历史记录中已下载的URL经过该第一泛化处理后得到的URL;
下载模块30,用于当该第二URL属于第二重复集合时,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载该原始URL,该第二重复集合包括历史记录中已下载的URL经过该第一减参处理后得到的URL。
在一些可行的实施方式中,该装置还包括获取模块40,该获取模块40,用于当该第一URL属于该第一重复集合时,获取该第一URL在该第一重复集合中的第二出现次数;上述下载模块30,还用于当该第二出现次数小于或等于第二阈值时,下载该原始URL。其中,该第二阈值大于或等于该第一阈值。
在一些可行的实施方式中,上述第一减参处理模块20包括计算单元201、检测单元202以及第一减参处理单元203。该计算单元201,用于计算该第一URL的哈希值;该检测单元202,用于检测该第一URL的哈希值是否属于第一重复集合;该第一减参处理单元203,用于当该第一URL的哈希值不属于该第一重复集合时,对该第一URL进行该第一减参处理得到第二URL。其中,该第一重复集合包括历史记录中已下载的URL经过该第一泛化处理后得到的URL的哈希值。
在一些可行的实施方式中,该装置还包括第二减参处理模块50。该第二减参处理模块50,用于当该第二URL不属于该第二重复集合时,对该第二URL进行第二减参处理得到第三URL;上述下载模块30,还用于当该第三URL属于第三重复集合时,检测该第三URL在该第三重复集合中的第三出现次数是否小于或等于第三阈值,若是,则下载该原始URL。其中,该第三重复集合包括历史记录中已下载的URL经过该第二减参处理后得到的URL,该第二减参处理用于减少该第二URL中的参数变量,该第三阈值小于或等于该第一阈值。
在一些可行的实施方式中,该装置还包括第二泛化处理模块60。该第二泛化处理模块60,用于当该第三URL不属于该第三重复集合时,对该第三URL进行第二泛化处理得到第四URL;上述下载模块30,还用于当该第四URL属于第四重复集合时,检测该第四URL在该第四重复集合中的第四出现次数是否小于或等于第四阈值,若是,则下载该原始URL。其中,该第四重复集合包括历史记录中已下载的URL经过该第一、该第二减参处理以及该第二泛化处理后得到的URL,该第二泛化处理用于将该第三URL中目标类型的至少一个字符替换为目标字符,该第四阈值小于或等于该第三阈值。
在一些可行的实施方式中,上述下载模块30还用于当该第四URL不属于该第四重复集合时,下载该原始URL。
在一些可行的实施方式中,上述下载模块30还用于在该第二URL属于第二重复集合之后,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值之前,将该第二URL在该第二重复集合中的出现次数加1得到第一出现次数。
具体实现中,上述URL去重装置可通过上述各个模块执行上述图1或者图3所提供的实现方式中各个步骤所提供的实现方式,实现上述各实施例中所实现的功能,具体可参见 上述图1或图3所示的方法实施例中各个步骤提供的相应描述,在此不再赘述。
在本申请实施例中,URL去重装置可通过对原始URL进行分级去重,逐级减少原始URL中的变量,并根据逐级减少变量后的URL判断原始URL是否已下载,若在该级就判断出原始URL已下载,则丢弃原始URL,若该级就确定原始URL未下载,则下载原始URL,若该级无法判断出原始URL是否已下载,则进入下一级判断,直至判断出原始URL已下载或未下载。通过更细化的分级去重方案,不仅能够减少下载到重复URL的数量,提高WEB漏洞扫描系统的扫描效率,还可以提高去重的准确率。
参见图6,是本申请实施例提供的终端的一示意性框图。如图6所示,本实施例中的终端可以包括:一个或多个处理器601和存储器602。上述处理器601和存储器602通过总线603连接。存储器602用于存储计算机程序,所述计算机程序包括程序指令,处理器601用于执行存储器602存储的程序指令。其中,处理器601被配置用于调用该程序指令执行:
对原始URL进行第一泛化处理,得到第一URL,该第一泛化处理用于将该原始URL中相同类型的多个连续字符替换为单个字符;
若该第一URL不属于第一重复集合,对该第一URL进行第一减参处理得到第二URL,该第一减参处理用于减少该第一URL中的参数变量,该第一重复集合包括历史记录中已下载的URL经过该第一泛化处理后得到的URL;
若该第二URL属于第二重复集合,检测该第二URL在该第二重复集合中的第一出现次数是否小于或等于第一阈值,若是,则下载该原始URL,该第二重复集合包括历史记录中已下载的URL经过该第一减参处理后得到的URL。
应当理解,在一些可行的实施方式中,所称处理器601可以是中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
该存储器602可以包括只读存储器和随机存取存储器,并向处理器601提供指令和数据。存储器602的一部分还可以包括非易失性随机存取存储器。例如,存储器602还可以存储设备类型的信息。
具体实现中,本申请实施例中所描述的处理器601可执行本申请实施例提供的URL去重方法中所描述的实现方式,也可执行本申请实施例所描述的URL去重装置的实现方式,在此不再赘述。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令被处理器执行时实现图1或图3所示的URL去重方法,具体细节请参照图1或图3所示实施例的描述,在此不再赘述。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖 在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种URL去重方法,其特征在于,包括:
    对原始URL进行第一泛化处理,得到第一URL,所述第一泛化处理用于将所述原始URL中相同类型的多个连续字符替换为单个字符;
    若所述第一URL不属于第一重复集合,对所述第一URL进行第一减参处理得到第二URL,所述第一减参处理用于减少所述第一URL中的参数变量,所述第一重复集合包括历史记录中已下载的URL经过所述第一泛化处理后得到的URL;
    若所述第二URL属于第二重复集合,检测所述第二URL在所述第二重复集合中的第一出现次数是否小于或等于第一阈值,若所述第一出现次数小于或等于所述第一阈值,则下载所述原始URL,所述第二重复集合包括历史记录中已下载的URL经过所述第一减参处理后得到的URL。
  2. 根据权利要求1所述的方法,其特征在于,所述方法,还包括:
    若所述第一URL属于所述第一重复集合,获取所述第一URL在所述第一重复集合中的第二出现次数;
    若所述第二出现次数小于或等于第二阈值,下载所述原始URL;
    其中,所述第二阈值大于或等于所述第一阈值。
  3. 根据权利要求1所述的方法,其特征在于,所述若所述第一URL不属于第一重复集合,对所述第一URL进行第一减参处理得到第二URL,包括:
    计算所述第一URL的哈希值;
    检测所述第一URL的哈希值是否属于第一重复集合,所述第一重复集合包括历史记录中已下载的URL经过所述第一泛化处理后得到的URL的哈希值;
    若所述第一URL的哈希值不属于所述第一重复集合,对所述第一URL进行所述第一减参处理得到第二URL。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述方法,还包括:
    若所述第二URL不属于所述第二重复集合,对所述第二URL进行第二减参处理得到第三URL,所述第二减参处理用于减少所述第二URL中的参数变量;
    若所述第三URL属于第三重复集合,检测所述第三URL在所述第三重复集合中的第三出现次数是否小于或等于第三阈值,若所述第三出现次数小于或等于所述第三阈值,则下载所述原始URL,所述第三重复集合包括历史记录中已下载的URL经过所述第二减参处理后得到的URL;
    其中,所述第三阈值小于或等于所述第一阈值。
  5. 根据权利要求4所述的方法,其特征在于,所述方法,还包括:
    若所述第三URL不属于所述第三重复集合,对所述第三URL进行第二泛化处理得到第四URL,所述第二泛化处理用于将所述第三URL中目标类型的至少一个字符替换为目标字符;
    若所述第四URL属于第四重复集合,检测所述第四URL在所述第四重复集合中的第四出现次数是否小于或等于第四阈值,若所述第四出现次数小于或等于所述第四阈值,则 下载所述原始URL,所述第四重复集合包括历史记录中已下载的URL经过所述第一、所述第二减参处理以及所述第二泛化处理后得到的URL;
    其中,所述第四阈值小于或等于所述第三阈值。
  6. 根据权利要求5所述的方法,其特征在于,所述方法,还包括:
    若所述第四URL不属于所述第四重复集合,下载所述原始URL。
  7. 根据权利要求1-6任意一项所述的方法,其特征在于,所述在所述第二URL属于第二重复集合之后,检测所述第二URL在所述第二重复集合中的第一出现次数是否小于或等于第一阈值之前,所述方法还包括:
    将所述第二URL在所述第二重复集合中的出现次数加1得到第一出现次数。
  8. 一种URL去重装置,其特征在于,包括:
    第一泛化处理模块,用于对原始URL进行第一泛化处理,得到第一URL,所述第一泛化处理用于将所述原始URL中相同类型的多个连续字符替换为单个字符;
    第一减参处理模块,用于当所述第一URL不属于第一重复集合时,对所述第一URL进行第一减参处理得到第二URL,所述第一减参处理用于减少所述第一URL中的参数变量,所述第一重复集合包括历史记录中已下载的URL经过所述第一泛化处理后得到的URL;
    下载模块,用于当所述第二URL属于第二重复集合时,检测所述第二URL在所述第二重复集合中的第一出现次数是否小于或等于第一阈值,若所述第一出现次数小于或等于所述第一阈值,则下载所述原始URL,所述第二重复集合包括历史记录中已下载的URL经过所述第一减参处理后得到的URL。
  9. 根据权利要求8所述的装置,其特征在于,所述装置还包括:
    获取模块,用于当所述第一URL属于所述第一重复集合时,获取所述第一URL在所述第一重复集合中的第二出现次数;
    所述下载模块,还用于当所述第二出现次数小于或等于第二阈值时,下载所述原始URL;
    其中,所述第二阈值大于或等于所述第一阈值。
  10. 根据权利要求8所述的装置,其特征在于,所述第一减参处理模块包括:
    计算单元,用于计算所述第一URL的哈希值;
    检测单元,用于检测所述第一URL的哈希值是否属于第一重复集合,所述第一重复集合包括历史记录中已下载的URL经过所述第一泛化处理后得到的URL的哈希值;
    第一减参处理单元,用于当所述第一URL的哈希值不属于所述第一重复集合时,对所述第一URL进行所述第一减参处理得到第二URL。
  11. 根据权利要求8-18任意一项所述的装置,其特征在于,所述装置还包括:
    第二减参处理模块,用于当所述第二URL不属于所述第二重复集合时,对所述第二URL进行第二减参处理得到第三URL,所述第二减参处理用于减少所述第二URL中的参数变量;
    上述下载模块,还用于当所述第三URL属于第三重复集合时,检测所述第三URL在所述第三重复集合中的第三出现次数是否小于或等于第三阈值,若所述第三出现次数小于或等于所述第三阈值,则下载所述原始URL,所述第三重复集合包括历史记录中已下载的URL经过所述第二减参处理后得到的URL;
    其中,所述第三阈值小于或等于所述第一阈值。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    第二泛化处理模块,用于当所述第三URL不属于所述第三重复集合时,对所述第三URL进行第二泛化处理得到第四URL,所述第二泛化处理用于将所述第三URL中目标类型的至少一个字符替换为目标字符;
    上述下载模块,还用于当所述第四URL属于第四重复集合时,检测所述第四URL在所述第四重复集合中的第四出现次数是否小于或等于第四阈值,若所述第四出现次数小于或等于所述第四阈值,则下载所述原始URL,所述第四重复集合包括历史记录中已下载的URL经过所述第一、所述第二减参处理以及所述第二泛化处理后得到的URL;
    其中,所述第四阈值小于或等于所述第三阈值。
  13. 根据权利要求12所述的装置,其特征在于,所述下载模块还用于当所述第四URL不属于所述第四重复集合时,下载所述原始URL。
  14. 根据权利要求8-13任意一项所述的装置,其特征在于,所述下载模块还用于将所述第二URL在所述第二重复集合中的出现次数加1得到第一出现次数。
  15. 一种终端,其特征在于,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行:
    对原始URL进行第一泛化处理,得到第一URL,所述第一泛化处理用于将所述原始URL中相同类型的多个连续字符替换为单个字符;
    若所述第一URL不属于第一重复集合,对所述第一URL进行第一减参处理得到第二URL,所述第一减参处理用于减少所述第一URL中的参数变量,所述第一重复集合包括历史记录中已下载的URL经过所述第一泛化处理后得到的URL;
    若所述第二URL属于第二重复集合,检测所述第二URL在所述第二重复集合中的第一出现次数是否小于或等于第一阈值,若所述第一出现次数小于或等于所述第一阈值,则下载所述原始URL,所述第二重复集合包括历史记录中已下载的URL经过所述第一减参处理后得到的URL。
  16. 根据权利要求15所述的终端,其特征在于,所述处理器还用于:
    若所述第一URL属于所述第一重复集合,获取所述第一URL在所述第一重复集合中的第二出现次数;
    若所述第二出现次数小于或等于第二阈值,下载所述原始URL;
    其中,所述第二阈值大于或等于所述第一阈值。
  17. 根据权利要求15所述的终端,其特征在于,所述处理器具体用于:
    计算所述第一URL的哈希值;
    检测所述第一URL的哈希值是否属于第一重复集合,所述第一重复集合包括历史记录中已下载的URL经过所述第一泛化处理后得到的URL的哈希值;
    若所述第一URL的哈希值不属于所述第一重复集合,对所述第一URL进行所述第一减参处理得到第二URL。
  18. 根据权利要求15-17任意一项所述的终端,其特征在于,所述处理器还用于:
    若所述第二URL不属于所述第二重复集合,对所述第二URL进行第二减参处理得到第三URL,所述第二减参处理用于减少所述第二URL中的参数变量;
    若所述第三URL属于第三重复集合,检测所述第三URL在所述第三重复集合中的第三出现次数是否小于或等于第三阈值,若所述第三出现次数小于或等于所述第三阈值,则下载所述原始URL,所述第三重复集合包括历史记录中已下载的URL经过所述第二减参处理后得到的URL;
    其中,所述第三阈值小于或等于所述第一阈值。
  19. 根据权利要求18所述的终端,其特征在于,所述处理器还用于:
    若所述第三URL不属于所述第三重复集合,对所述第三URL进行第二泛化处理得到第四URL,所述第二泛化处理用于将所述第三URL中目标类型的至少一个字符替换为目标字符;
    若所述第四URL属于第四重复集合,检测所述第四URL在所述第四重复集合中的第四出现次数是否小于或等于第四阈值,若所述第四出现次数小于或等于所述第四阈值,则下载所述原始URL,所述第四重复集合包括历史记录中已下载的URL经过所述第一、所述第二减参处理以及所述第二泛化处理后得到的URL;
    其中,所述第四阈值小于或等于所述第三阈值。
  20. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-7任一项所述的方法。
PCT/CN2018/108708 2018-07-05 2018-09-29 一种url去重方法及装置 WO2020006908A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810735305.5 2018-07-05
CN201810735305.5A CN108984703B (zh) 2018-07-05 2018-07-05 一种统一资源定位符url去重方法及装置

Publications (1)

Publication Number Publication Date
WO2020006908A1 true WO2020006908A1 (zh) 2020-01-09

Family

ID=64536296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108708 WO2020006908A1 (zh) 2018-07-05 2018-09-29 一种url去重方法及装置

Country Status (2)

Country Link
CN (1) CN108984703B (zh)
WO (1) WO2020006908A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259282A (zh) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 Url去重方法、装置、电子设备及计算机可读存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008419B (zh) * 2019-03-11 2023-07-14 创新先进技术有限公司 网页去重方法、装置及设备
CN110825947B (zh) * 2019-10-31 2024-03-08 深圳前海微众银行股份有限公司 Url去重方法、装置、设备与计算机可读存储介质
CN112906005A (zh) * 2021-02-02 2021-06-04 浙江大华技术股份有限公司 Web漏洞扫描方法、装置、系统、电子装置和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013143363A1 (en) * 2012-03-29 2013-10-03 Tencent Technology (Shenzhen) Company Limited A method and apparatus for data storage and downloading
CN104933056A (zh) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 统一资源定位符去重方法及装置
CN106547764A (zh) * 2015-09-18 2017-03-29 北京国双科技有限公司 网页数据去重的方法及装置
CN106919570A (zh) * 2015-12-24 2017-07-04 国家新闻出版广电总局广播科学研究院 一种面向网络新媒体的页面链接去重扫描方法及装置
CN107885820A (zh) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 基于爬虫系统的广度遍历定向抓取方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102185741B (zh) * 2011-06-10 2013-06-26 浙江大学 多层架构下估算事务的处理器需求的方法
CA2824977C (en) * 2012-08-30 2019-03-19 Accenture Global Services Limited Online content collection
CN106815247B (zh) * 2015-11-30 2020-05-22 北京国双科技有限公司 统一资源定位符获取方法及装置
CN106844389B (zh) * 2015-12-07 2021-05-04 阿里巴巴集团控股有限公司 网络资源地址url的处理方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013143363A1 (en) * 2012-03-29 2013-10-03 Tencent Technology (Shenzhen) Company Limited A method and apparatus for data storage and downloading
CN104933056A (zh) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 统一资源定位符去重方法及装置
CN106547764A (zh) * 2015-09-18 2017-03-29 北京国双科技有限公司 网页数据去重的方法及装置
CN106919570A (zh) * 2015-12-24 2017-07-04 国家新闻出版广电总局广播科学研究院 一种面向网络新媒体的页面链接去重扫描方法及装置
CN107885820A (zh) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 基于爬虫系统的广度遍历定向抓取方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259282A (zh) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 Url去重方法、装置、电子设备及计算机可读存储介质
CN111259282B (zh) * 2020-02-13 2023-08-29 深圳市腾讯计算机系统有限公司 Url去重方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN108984703B (zh) 2023-04-18
CN108984703A (zh) 2018-12-11

Similar Documents

Publication Publication Date Title
WO2020006908A1 (zh) 一种url去重方法及装置
US20190334948A1 (en) Webshell detection method and apparatus
RU2551820C2 (ru) Способ и устройство для проверки файловой системы на наличие вирусов
CN107241296B (zh) 一种Webshell的检测方法及装置
US10855701B2 (en) Methods and devices for automatically detecting attack signatures and generating attack signature identifications
CN109246064B (zh) 安全访问控制、网络访问规则的生成方法、装置及设备
US20140325596A1 (en) Authentication of ip source addresses
WO2012089005A1 (zh) 钓鱼网页检测方法及设备
WO2018001078A1 (zh) 一种url匹配方法、装置及存储介质
JP2017510894A (ja) エンドポイントからのネットワークリクエストに言語分析を適用するマルウェアに感染しているマシンを識別するためのシステム
CN104978526A (zh) 病毒特征的提取方法及装置
CN106570025B (zh) 一种数据过滤的方法及装置
US9749295B2 (en) Systems and methods for internet traffic analysis
WO2020006909A1 (zh) 一种url去重方法及装置
CN109743309B (zh) 一种非法请求识别方法、装置及电子设备
IL243418B (en) System and method for monitoring computer network security
CN109542462B (zh) 一种系统环境部署方法、存储介质和服务器
CN110868419A (zh) Web后门攻击事件的检测方法、装置及电子设备
CN113660215A (zh) 基于Web应用防火墙的攻击行为检测方法以及装置
CN106339372B (zh) 搜索引擎优化的方法和装置
CN109361658B (zh) 基于工控行业的异常流量信息存储方法、装置及电子设备
CN110717036B (zh) 一种统一资源定位符的去重方法、装置及电子设备
CN109361674B (zh) 旁路接入的流式数据检测方法、装置以及电子设备
CN107844702B (zh) 基于云防护环境下网站木马后门检测方法及装置
US20170185772A1 (en) Information processing system, information processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18925671

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18925671

Country of ref document: EP

Kind code of ref document: A1