CN108984703A - A kind of uniform resource position mark URL De-weight method and device - Google Patents

A kind of uniform resource position mark URL De-weight method and device Download PDF

Info

Publication number
CN108984703A
CN108984703A CN201810735305.5A CN201810735305A CN108984703A CN 108984703 A CN108984703 A CN 108984703A CN 201810735305 A CN201810735305 A CN 201810735305A CN 108984703 A CN108984703 A CN 108984703A
Authority
CN
China
Prior art keywords
url
repeats
original
gather
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810735305.5A
Other languages
Chinese (zh)
Other versions
CN108984703B (en
Inventor
熊庆昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810735305.5A priority Critical patent/CN108984703B/en
Priority to PCT/CN2018/108708 priority patent/WO2020006908A1/en
Publication of CN108984703A publication Critical patent/CN108984703A/en
Application granted granted Critical
Publication of CN108984703B publication Critical patent/CN108984703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Abstract

The embodiment of the present application discloses a kind of uniform resource position mark URL De-weight method and device, wherein method includes: to carry out the first extensive processing to original URL, and judge whether the first URL belongs to first and repeat to gather, gather if the first URL is not belonging to first repetition, it then carries out first to the first URL and subtracts considering and handling reason and obtaining the 2nd URL, and judge whether the 2nd URL belongs to second and repeat to gather, it repeats to gather if the 2nd URL belongs to second, detect whether first frequency of occurrence of the 2nd URL in the second repetition set is less than or equal to first threshold, if, then download the original URL, if not, then abandon the original URL, it does not download.Using the embodiment of the present application, it is possible to reduce the quantity of repetition URL is downloaded to, so as to improve the scan efficiency of WEB vulnerability scanning system.

Description

A kind of uniform resource position mark URL De-weight method and device
Technical field
This application involves Internet technical field more particularly to a kind of uniform resource position mark URL De-weight method and devices.
Background technique
Current WWW (world wide web, WEB) vulnerability scanning systems many in the industry, there is the unified resource of oneself Finger URL (uniform resource locator, URL) crawler, also there is the URL De-weight method of oneself.URL crawler refers to crawler System selects a part of webpage meticulously first from internet page, using the chained address of these webpages as seed URL, by this A little seeds are put into URL queue to be grabbed, and crawler is successively read from URL queue to be grabbed, and URL is passed through domain name system Chained address, is converted to the corresponding IP address of Website server by (domain name system, DNS) parsing.Then by it Page download device is given with webpage relative path name, page download device is responsible for the downloading of the page.For locally downloading net Page, is on the one hand stored in pool of page, waits and establishes the subsequent processings such as index;On the other hand the URL for downloading webpage is put Enter and grabbed in queue, the webpage URL that this queue record crawler system had been downloaded grabs to avoid the repetition of system It takes.
URL duplicate removal refers to repeat the URL removal of crawl, avoids repeatedly grabbing same webpage.For example, being given to each Fixed URL is mapped that on some physical address.When needing to detect the given URL and whether repeating, it need to only judge that this is given Whether the corresponding physical address of URL has existed, and if it exists, illustrates to be downloaded, then abandons downloading, otherwise give this URL is put into queue to be grabbed, and waits for downloads.Since many URL in a website are only argument section differences, and these are only The different URL of argument section was very big to be downloaded, but these URL obtained physical address after mapping is not phase With, at this moment only judged by the corresponding physical address of URL these URL whether repeat will lead to crawl it is many duplicate URL influences the scan efficiency of WEB vulnerability scanning system.
Summary of the invention
The embodiment of the present application provides a kind of uniform resource position mark URL De-weight method and device, can reduce and download to repetition The quantity of URL, so as to improve the scan efficiency of WEB vulnerability scanning system.
In a first aspect, the embodiment of the present application provides a kind of uniform resource position mark URL De-weight method, this method comprises:
First extensive processing is carried out to original URL, obtains the first URL, which is used for will be in the original URL Multiple continuation characters of same type replace with single character;
It repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining the 2nd URL, This first subtract consider and handle reason for reducing the parametric variable in the first URL, this first repeat set include historical record under The URL that the URL of load is obtained after the first extensive processing;
It repeats to gather if the 2nd URL belongs to second, detects first appearance of the 2nd URL in the second repetition set Whether number is less than or equal to first threshold, if so, downloading the original URL, which includes in historical record The URL downloaded by this first subtract consider and handle reason after obtained URL.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
It first repeats to gather if the first URL belongs to this, obtains the first URL at this and first repeat second going out in set Occurrence number;
If second frequency of occurrence is less than or equal to second threshold, the original URL is downloaded;
Wherein, which is greater than or equal to the first threshold.
With reference to first aspect, it in a kind of possible embodiment, repeats to gather if being somebody's turn to do the first URL and being not belonging to first, It carries out first to the first URL and subtracts considering and handling reason and obtaining the 2nd URL, comprising:
Calculate the cryptographic Hash of the first URL;
Whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather, which includes historical record In the cryptographic Hash of URL that is obtained after the first extensive processing of the URL that has downloaded;
It first repeats to gather if the cryptographic Hash of the first URL is not belonging to this, this is carried out to the first URL and first subtracts and considers and handles reason Obtain the 2nd URL.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
It second repeats to gather if the 2nd URL is not belonging to this, carries out second to the 2nd URL and subtract considering and handling reason and obtaining third URL, this second subtract consider and handle reason for reducing the parametric variable in the 2nd URL;
If the 3rd URL belongs to the third repeating set, detects third of the 3rd URL in the third repeating set and occur Whether number is less than or equal to third threshold value, if so, downloading the original URL, which includes in historical record The URL downloaded by this second subtract consider and handle reason after obtained URL;
Wherein, which is less than or equal to the first threshold.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
If the 3rd URL is not belonging to the third repeating set, the second extensive processing is carried out to the 3rd URL and obtains the 4th URL, the second extensive processing is for replacing with target character at least one character of target type in the 3rd URL;
It repeats to gather if the 4th URL belongs to the 4th, detects the 4th URL the 4th and repeat the 4th appearance in set Whether number is less than or equal to the 4th threshold value, if so, downloading the original URL, the 4th repeats set including in historical record The URL downloaded by this first, this second subtract consider and handle reason and the second extensive processing after obtained URL;
Wherein, the 4th threshold value is less than or equal to the third threshold value.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
It repeats to gather if the 4th URL is not belonging to the 4th, downloads the original URL.
Second aspect, the embodiment of the present application provide a kind of URL duplicate removal device, which includes:
First extensive processing module obtains the first URL, this is first extensive for carrying out the first extensive processing to original URL Processing is for replacing with single character for multiple continuation characters of same type in the original URL;
First subtracts ginseng processing module, for being carried out to the first URL when the first URL is not belonging to the first repetition set First subtracts and considers and handles reason and obtain the 2nd URL, this first subtracts and consider and handle reason for reducing the parametric variable in the first URL, first weight Multiple set includes the URL that the URL downloaded in historical record is obtained after the first extensive processing;
Download module, for detecting the 2nd URL in second repetition when the 2nd URL belongs to the second repetition and gathers Whether the first frequency of occurrence in set is less than or equal to first threshold, if so, downloading the original URL, which collects Close include in historical record the URL that has downloaded by this first subtract consider and handle reason after obtained URL.
In conjunction with second aspect, in a kind of possible embodiment, the device further include:
Module is obtained, for obtaining the first URL in first weight when the first URL belongs to first repetition and gathers The second frequency of occurrence in multiple set;
Above-mentioned download module is also used to that it is original to download this when second frequency of occurrence is less than or equal to second threshold URL;
Wherein, which is greater than or equal to the first threshold.
In conjunction with second aspect, in a kind of possible embodiment, this first subtract ginseng processing module include:
Computing unit, for calculating the cryptographic Hash of the first URL;
Detection unit, whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather, which gathers Cryptographic Hash including the URL downloaded in historical record the URL obtained after the first extensive processing;
First subtracts ginseng processing unit, for when the cryptographic Hash of the first URL be not belonging to this first repeat to gather when, to this One URL carries out this and first subtracts and consider and handle reason and obtain the 2nd URL.
In conjunction with second aspect, in a kind of possible embodiment, the device further include:
Second subtracts ginseng processing module, for when the 2nd URL be not belonging to this second repeat gather when, to the 2nd URL into Row second, which subtracts, considers and handles reason and obtains the 3rd URL, this second subtracts and consider and handle reason for reducing the parametric variable in the 2nd URL;
Above-mentioned download module is also used to when the 3rd URL belongs to the third repeating set, detect the 3rd URL this Whether the three third frequency of occurrence repeated in set are less than or equal to third threshold value, if so, downloading the original URL, the third Repeat set include in historical record the URL that has downloaded by this second subtract consider and handle reason after obtained URL;
Wherein, which is less than or equal to the first threshold.
In conjunction with second aspect, in a kind of possible embodiment, the device further include:
Second extensive processing module, for when the 3rd URL is not belonging to the third repeating set, to the 3rd URL into The extensive processing of row second obtains the 4th URL, which is used at least one word of target type in the 3rd URL Symbol replaces with target character;
Above-mentioned download module is also used to when the 4th URL belongs to the 4th and repeats to gather, detect the 4th URL this Whether four the 4th frequency of occurrence repeated in set are less than or equal to the 4th threshold value, if so, the original URL is downloaded, the 4th Repeat set include in historical record the URL that has downloaded by this first, this second subtracts and considers and handles reason and the second extensive processing The URL obtained afterwards;
Wherein, the 4th threshold value is less than or equal to the third threshold value.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned download module is also used to not belong to as the 4th URL When the 4th repeats to gather, the original URL is downloaded.
The third aspect, the embodiment of the present application provide a kind of terminal, including processor and memory, the processor and storage Device is connected with each other, wherein the memory is used to store the computer program for supporting terminal to execute the above method, the computer program Including program instruction, which is configured for calling the program instruction, executes the unified resource positioning of above-mentioned first aspect Accord with URL De-weight method.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, which deposits Computer program is contained, which includes program instruction, which makes the processor when being executed by a processor Execute the uniform resource position mark URL De-weight method of above-mentioned first aspect.
The embodiment of the present application judges whether the first URL belongs to first by carrying out the first extensive processing to original URL It repeats to gather, first repeats to gather if the first URL is not belonging to this, carry out first to the first URL and subtract considering and handling reason and obtaining the Two URL, and judge whether the 2nd URL belongs to second and repeat to gather, it repeats to gather if the 2nd URL belongs to second, detection should Whether first frequency of occurrence of the 2nd URL in the second repetition set is less than or equal to first threshold, if so, downloading the original Beginning URL is not downloaded if it is not, then abandoning the original URL, it is possible to reduce the quantity of repetition URL is downloaded to, so as to improve The scan efficiency of WEB vulnerability scanning system.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of schematic flow diagram of uniform resource position mark URL De-weight method provided by the embodiments of the present application;
Fig. 2 is the first relation schematic diagram for repeating set and frequency of occurrence;
Fig. 3 is the schematic flow diagram of another uniform resource position mark URL De-weight method provided by the embodiments of the present application;
Fig. 4 is the second relation schematic diagram for repeating set and frequency of occurrence;
Fig. 5 is a kind of schematic block diagram of URL duplicate removal device provided by the embodiments of the present application;
Fig. 6 is a kind of schematic block diagram of terminal provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
It should be appreciated that the description and claims of this application and term " first " in the attached drawing, " second ", " third ", " 4th " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " tool Have " and their any deformations, it is intended that it covers and non-exclusive includes.Such as contain the mistake of a series of steps or units Journey, method, system, product or equipment are not limited to listed step or unit, but optionally further comprising do not list The step of or unit, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
It is also understood that referenced herein " embodiment " it is meant that describe in conjunction with the embodiments special characteristic, structure or Characteristic may be embodied at least one embodiment of the application.Each position in the description shows that the phrase might not Identical embodiment is each meant, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art Member explicitly and implicitly understands that embodiment described herein can be combined with other embodiments.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Before introducing the embodiment of the present application, the data structure of URL is first introduced.The structure of one URL is usually " association View: // server name (IP address)/path/filename? parameter ", such as: http://xxx.pingan.com/cgi-bin/ Index1.html? param1=value1&param2=value2, wherein param1=value1&param2=value2 table Show that the argument section of this URL, the argument section of URL are made of parameter name and parameter value, param1 and param2 expression parameter Name, value1 and value2 expression parameter value, parameter value (can refer to divisor for number, alphabetical (including capital and small letter), spcial character Character except word, letter) and/or their combination.Question mark "? " character is used to separate the file names portion and parameter in URL Part, the separator between the parameter specified in logical AND " " character representation URL.
Below in conjunction with Fig. 1 to Fig. 6, to uniform resource position mark URL De-weight method provided by the embodiments of the present application and dress It sets and is illustrated.
It is a kind of exemplary flow of uniform resource position mark URL De-weight method provided by the embodiments of the present application referring to Fig. 1 Figure, as shown in Figure 1, the uniform resource position mark URL De-weight method can include:
S101 carries out the first extensive processing to original URL, obtains the first URL.
In some possible embodiments, the argument section of original URL can be ranked up by terminal according to parameter name, And the first extensive processing is carried out to the parameter value of the original URL parameter part, obtain the first URL.The first extensive processing is used for Multiple continuation characters of same type in the original URL are replaced with into single character.For example, continuous number such as 145 can be replaced It is changed to number 1, continuous letter such as FK, aj, dgA replace with alphabetical A, and spcial character all replaces with symbol %.Wherein, special Character refer to except number and letter in addition to character, as question mark "? ", exclamation mark "!" etc..
For example, the original URL that terminal crawls is http://xxx.pingan.com/cgi-bin/index1.html? Param1=v167!ABD&param2=val_ue2, terminal can arrange the argument section of original URL according to parameter name Sequence guarantees that the argument section sequence of original URL arranges, then number continuous in the parameter value of original URL is replaced with preset Individual digit " 1 ", continuous letter replace with preset single letter " A ", spcial character replaces with preset single special word It accords with " % ", does is obtaining the first URL http://xxx.pingan.com/cgi-bin/index1.html? param1=v1%A& Param2=A%A2.
S102 repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining second URL。
In some possible embodiments, terminal can detecte first and repeat to whether there is and above-mentioned first in set The identical URL of URL illustrates that above-mentioned first URL is not belonging to first and repeats to gather if it does not exist, then can be to the ginseng of the first URL Number part carries out first and subtracts considering and handling reason and obtaining the 2nd URL.This first subtracts the ginseng considered and handled and managed for reducing the first URL parameter part Number variable such as removes the parameter value retention parameter name of the first URL parameter part.If it exists, illustrate the first URL belong to this first It repeats to gather, then frequency of occurrence of the first URL in the first repetition set can be added 1 by terminal, obtained the first URL and existed The second frequency of occurrence in the first repetition set, and may determine that whether second frequency of occurrence is less than or equal to the second threshold Value then downloads above-mentioned original URL if (i.e. second frequency of occurrence be less than or equal to the second threshold), if not (i.e. this second Frequency of occurrence is greater than the second threshold), then above-mentioned original URL is abandoned, i.e., does not download the original URL.Terminal is passed through by judgement First it is extensive handle whether obtained the first URL had been downloaded, can be with to judge whether need to download original URL The original URL in parameter value with identical data format is filtered out, reduces the quantity for downloading to repetition URL, and because only subtract The Partial Variable in original URL parameter value is lacked, so ensure that the accuracy rate of duplicate removal.Wherein, the first repetition set can wrap Include the URL that the URL downloaded in historical record is obtained after the above-mentioned first extensive processing, that is, the URL downloaded corresponding One URL.The URL downloaded can be the URL without the first extensive processing.
For example, it is assumed that the first repetition set includes URL4, URL6 and URL7, the first URL is URL1:http: // Xxx.pingan.com/cgi-bin/index1.html? param1=v1%A&param2=A%A2, terminal detect the first weight It whether there is URL identical with URL1 in set URL4, URL6 and URL7 again, i.e., first repeats to whether there is in set URL1 illustrates that URL1 is not belonging to first and repeats to gather, terminal is by URL1 parameter portion because first repeats that URL1 is not present in set The parameter value divided removes retention parameter name, obtains the 2nd URL are as follows: http://xxx.pingan.com/cgi-bin/ Index1.html? param1&param2.
In some possible embodiments, terminal can use 5 (message-digest of Message-Digest algorithm Algorithm 5, MD5) calculate Hash (hash) value of above-mentioned first URL, and can detecte the first URL hash value whether There are in the first repetition set, if it does not exist, illustrate that the hash value of above-mentioned first URL is not belonging to first repetition and gathers, then It can carry out first to the argument section of the first URL and subtract considering and handling reason and obtaining the 2nd URL.This first subtract consider and handle reason for reducing this The parametric variable of first URL parameter part such as removes the parameter value retention parameter name of the first URL parameter part.If it exists, it says The hash value of bright above-mentioned first URL belongs to first repetition and gathers, and the hash value of the first URL can be collected in first repetition Frequency of occurrence in conjunction adds 1, obtains second frequency of occurrence of the hash value of the first URL in the first repetition set, and can To judge whether second frequency of occurrence is less than or equal to second threshold, if (i.e. second frequency of occurrence, which is less than or equal to, is somebody's turn to do Second threshold), then above-mentioned original URL is downloaded, if not (i.e. second frequency of occurrence is greater than the second threshold), then abandons above-mentioned original Beginning URL does not download the original URL.Wherein, first to repeat set may include the URL that has downloaded in historical record by upper The URL obtained after the first extensive processing is stated, the hash value obtained after being calculated using MD5, that is, the URL downloaded corresponding first The hash value of URL.The second threshold is the integer greater than 0.Because hash function is the data conversion by arbitrary size at specific The function of the data of size, and downloaded stored in set (first repeat gather) be URL hash value, rather than it is complete URL, it is possible to reduce memory space (because complete URL character is more, and hash value is the data of fixed size), while detecting When whether the first URL belongs to the first repetition set, treatment effeciency can be improved.
For example, as shown in Fig. 2, being the first relation schematic diagram for repeating set and frequency of occurrence, wherein assuming that second threshold It is 10, the first element repeated in set is 01,03 and 06, and the first URL is URL1, and the hash value that terminal calculates URL1 is 03, terminal detects that the hash value 03 of URL1 repeats in set first, illustrates that URL1 belongs to first and repeats to gather, terminal will The hash value 03 of URL1 repeats the frequency of occurrence 3 in set plus 1 first, and the hash value 03 for obtaining URL1 repeats to gather first In the second frequency of occurrence 4.Terminal judges that the second frequency of occurrence 4 is less than second threshold 10, downloads original URL.
In some possible embodiments, terminal is when detecting that the first URL is not belonging to the first repetition set, eventually Holding the first URL, which can be added, first repetition gathers the first repetition set formed newly, that is, updates first repetition and gather, The first URL first can be repeated the frequency of occurrence in set simultaneously and be set to 1 at this, and the can be carried out to the first URL One subtracts and considers and handles reason and obtain the 2nd URL.After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal utilizes newest terminal First repeat set and detect whether next original URL was downloaded.Terminal repeats to gather by constantly updating first, can be with Duplicate URL is more accurately filtered out, and can be further improved scan efficiency.For example, first repeat set include URL4, URL6 and URL7, the first URL are URL1, repeat to gather at this point, URL1 is not belonging to first, which can be added the by terminal One repeats to gather, and updated first repeats to gather including URL1, URL4, URL6 and URL7 at this time.Meanwhile terminal can also be with URL1 is repeated into the frequency of occurrence in set first and is also set to 1.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that First repetition set can be empty (because the URL not downloaded at this time).
S103 repeats to gather if the 2nd URL belongs to second, and the 2nd URL of detection, which repeats first in set second, to be occurred Whether number is less than or equal to first threshold, if so, downloading original URL.
In some possible embodiments, terminal can detecte second and repeat to whether there is and above-mentioned second in set The identical URL of URL, and if it exists, illustrate the 2nd URL belong to this second repeat gather, then can by the 2nd URL this second The frequency of occurrence repeated in set adds 1, obtains the 2nd URL at this and second repeats the first frequency of occurrence in set, and can be with Detect whether first frequency of occurrence is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to this One threshold value), then above-mentioned original URL is downloaded, if not (i.e. the first frequency of occurrence is greater than the first threshold), is then abandoned above-mentioned original URL does not download the original URL.If terminal detect this second repeat to illustrate in set there is no the 2nd URL this second URL is not belonging to second repetition and gathers, then can directly download above-mentioned original URL.Wherein, the second repetition set may include going through The URL, that is, the URL downloaded corresponding second that the URL downloaded in Records of the Historian record is obtained after above-mentioned first subtracts and consider and handle reason URL.The URL downloaded can be to subtract the URL for considering and handling reason without first.Terminal, which is subtracted by judgement by first, to be considered and handled reason and obtains The 2nd URL whether be downloaded, to judge whether need to download original URL, parameter value not phase can be filtered out Same original URL, further reduces the quantity for downloading to repetition URL, improves scan efficiency while guaranteeing accuracy rate. It should be noted that the first threshold is less than or equal to above-mentioned second threshold, which can be the integer greater than 0.Cause For the first extensive variable handled in the parameter value for only reducing original URL, the variable in the first URL is also more at this time, mistake The URL filtered is just few, and for the URL of downloading with regard to more, the number that the first URL occurs in the first repetition set is just more.Therefore first Threshold value, which is less than second threshold, can guarantee the URL not filtered out, can be filtered after first subtracts and consider and handle reason, reach classification The purpose of duplicate removal.
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 2nd URL, And it can detecte the hash value of the 2nd URL with the presence or absence of in the second repetition set, and if it exists, illustrate above-mentioned 2nd URL's Hash value belongs to second repetition and gathers, can be by frequency of occurrence of the hash value of the 2nd URL in the second repetition set Add 1, obtain the hash value of the 2nd URL at this and second repeat the first frequency of occurrence in set, and may determine that this first goes out Whether occurrence number is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to the first threshold), then under Above-mentioned original URL is carried, if not (i.e. first frequency of occurrence is greater than the first threshold), then above-mentioned original URL is abandoned, does not download. If it does not exist, illustrate that the hash value of above-mentioned 2nd URL is not belonging to second repetition and gathers, then can download above-mentioned original URL.Its In, second repeats to gather the original URL that may include to have downloaded in historical record obtains after above-mentioned first subtracts and consider and handle reason URL, the hash value obtained after being calculated using MD5, that is, the hash value of corresponding 2nd URL of the original URL downloaded.Because Hash function is the function by the data conversion of arbitrary size at the data of particular size, and has downloaded set (the second repetition collection Close) in store be URL hash value, rather than complete URL can be further reduced memory space, while in detection the When whether two URL belong to the second repetition set, treatment effeciency can be further improved.
In some possible embodiments, terminal is when detecting that the 2nd URL is not belonging to the second repetition set, eventually Holding the 2nd URL, which can be added, second repetition gathers the second repetition set formed newly, that is, updates second repetition and gather, Frequency of occurrence of the 2nd URL in the second repetition set can be set to 1 simultaneously, and above-mentioned original URL can be downloaded.Eventually After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal is next using newest second repetition set detection at end Whether original URL was downloaded.Terminal repeats to gather by constantly updating second, can more accurately filter out duplicate URL, And it can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that Second repetition set can be empty (because the URL not downloaded at this time).
The embodiment of the present application judges whether the first URL belongs to first by carrying out the first extensive processing to original URL It repeats to gather, first repeats to gather if the first URL is not belonging to this, carry out first to the first URL and subtract considering and handling reason and obtaining the Two URL, and judge whether the 2nd URL belongs to second and repeat to gather, it repeats to gather if the 2nd URL belongs to second, detection should Whether first frequency of occurrence of the 2nd URL in the second repetition set is less than or equal to first threshold, if so, downloading the original Beginning URL is not downloaded if it is not, then abandoning the original URL, it is possible to reduce the quantity of repetition URL is downloaded to, so as to improve The scan efficiency of WEB vulnerability scanning system.
It is the exemplary flow of another uniform resource position mark URL De-weight method provided by the embodiments of the present application referring to Fig. 3 Figure, as shown in figure 3, the uniform resource position mark URL De-weight method can include:
S301 carries out the first extensive processing to original URL, obtains the first URL.
S302 repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining second URL。
The implementation of above-mentioned steps S301- step S302 can refer to the step of embodiment illustrated in fig. 1 in the embodiment of the present application Implementation provided by rapid S101- step S102, details are not described herein.
S303 repeats to gather if the 2nd URL belongs to second, and the 2nd URL of detection, which repeats first in set second, to be occurred Whether number is less than or equal to first threshold, if so, downloading original URL.
S304 repeats to gather if the 2nd URL is not belonging to second, carries out second to the 2nd URL and subtract considering and handling reason and obtaining third URL。
In some possible embodiments, terminal can detecte second and repeat to whether there is and above-mentioned second in set The identical URL of URL, and if it exists, illustrate the 2nd URL belong to this second repeat gather, then can by the 2nd URL this second The frequency of occurrence repeated in set adds 1, obtains the 2nd URL at this and second repeats the first frequency of occurrence in set, and can be with Detect whether first frequency of occurrence is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to this One threshold value), then above-mentioned original URL is downloaded, if not (i.e. the first frequency of occurrence is greater than the first threshold), is then abandoned above-mentioned original URL does not download the original URL.If terminal detect this second repeat to illustrate in set there is no the 2nd URL this second URL is not belonging to this and second repeats to gather, then can carry out second to the 2nd URL and subtract considering and handling reason and obtaining the 3rd URL.This second subtracts Reason is considered and handled for reducing parametric variable in the 2nd URL, such as removes argument section (including parameter value and ginseng in the 2nd URL It is several).Wherein, the second repetition set may include that the URL downloaded in historical record is obtained after above-mentioned first subtracts and consider and handle reason Corresponding 2nd URL of the URL arrived, that is, the URL downloaded.The URL downloaded can be to subtract the URL for considering and handling reason without first. Terminal, which is subtracted by judgement by first, considers and handles whether the 2nd URL that reason obtains had been downloaded, to judge whether need Original URL is downloaded, the different original URL of parameter value can be filtered out, further reduce the quantity for downloading to repetition URL, Scan efficiency is improved while guaranteeing accuracy rate.It should be noted that the first threshold is less than or equal to above-mentioned second threshold Value, the first threshold can be the integer greater than 0.Because the first extensive processing only reduces in the parameter value of original URL Variable, as soon as the variable at this time in URL is also more, the URL filtered out is few, and the URL of downloading is with regard to more, and the first URL is first It is just more to repeat the number occurred in set.Therefore first threshold, which is less than second threshold, can guarantee step S301- step S302 not The URL filtered out can be filtered after first subtracts and consider and handle reason, achieve the purpose that be classified duplicate removal.
For example, as shown in figure 4, being the second relation schematic diagram for repeating set and frequency of occurrence, wherein assuming first threshold It is 7, second, which repeats set, includes URL2 and URL5, and the first URL is URL2, and terminal detects that URL2 is repeated in set second, Illustrate that URL2 belongs to second and repeats to gather, URL2 is repeated the frequency of occurrence 1 in set plus 1 second by terminal, is obtained URL2 and is existed Second repeats the first frequency of occurrence 2 in set, and terminal judges that the first frequency of occurrence 2 is less than first threshold 7, then downloads original URL。
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 2nd URL, And it can detecte the hash value of the 2nd URL with the presence or absence of in the second repetition set, and if it exists, illustrate above-mentioned 2nd URL's Hash value belongs to second repetition and gathers, can be by frequency of occurrence of the hash value of the 2nd URL in the second repetition set Add 1, obtain the hash value of the 2nd URL at this and second repeat the first frequency of occurrence in set, and may determine that this first goes out Whether occurrence number is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to the first threshold), then under Above-mentioned original URL is carried, if not (i.e. first frequency of occurrence is greater than the first threshold), then above-mentioned original URL is abandoned, i.e., does not download The original URL.If it does not exist, illustrate the hash value of above-mentioned 2nd URL be not belonging to this second repeat gather, then can to this second URL carries out second and subtracts considering and handling reason and obtaining the 3rd URL.This second subtracts and considers and handles reason for reducing parametric variable in the 2nd URL, such as goes Except the argument section (including parameter value and parameter name) in the 2nd URL.Wherein, the second repetition set may include historical record In the URL that is obtained after above-mentioned first subtracts and consider and handle reason of the URL that has downloaded, the hash value obtained after being calculated using MD5, i.e., The hash value of corresponding 2nd URL of the URL of downloading.Because hash function is the data conversion by arbitrary size into particular size The function of data, and downloaded stored in set (second repeat gather) be URL hash value, rather than complete URL can To be further reduced memory space, while when whether the 2nd URL of detection belongs to the second repetition and gather, place can be further improved Manage efficiency.
For example, it is assumed that the second element repeated in set is 07 and 09, the 2nd URL is URL2:http: // Xxx.pingan.com/cgi-bin/index1.html? param1&param2, the hash value that terminal calculates URL2 is 04, eventually End detects that the hash value 04 of URL2 does not repeat in set second, illustrates that URL2 is not belonging to second and repeats to gather, terminal is removed Argument section in URL2 obtains the 3rd URL, the 3rd URL are as follows: http://xxx.pingan.com/cgi-bin/ index1.html。
In some possible embodiments, terminal is when detecting that the 2nd URL is not belonging to the second repetition set, eventually Holding the 2nd URL, which can be added, second repetition gathers the second repetition set formed newly, that is, updates second repetition and gather, The 2nd URL second can be repeated the frequency of occurrence in set simultaneously and be set to 1 at this, and the can be carried out to the 2nd URL Two subtract and consider and handle reason and obtain the 3rd URL.After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal utilizes newest terminal Second repeat set and detect whether next original URL was downloaded.Terminal repeats to gather by constantly updating second, can be with Duplicate URL is more accurately filtered out, and can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that Second repetition set can be empty (because the URL not downloaded at this time).
S305 detects third of the 3rd URL in the third repeating set and occurs if the 3rd URL belongs to the third repeating set Whether number is less than or equal to third threshold value, if so, downloading original URL.
S306 carries out the second extensive processing to the 3rd URL and obtains the 4th if the 3rd URL is not belonging to the third repeating set URL。
In some possible embodiments, terminal can detecte in the third repeating set and whether there is and above-mentioned third The identical URL of URL, and if it exists, illustrate that the 3rd URL belongs to the third repeating set, then it can be by the 3rd URL in the third The frequency of occurrence repeated in set adds 1, obtains third frequency of occurrence of the 3rd URL in the third repeating set, and can be with Detect whether the third frequency of occurrence is less than or equal to third threshold value, if (i.e. the third frequency of occurrence be less than or equal to this Three threshold values), then above-mentioned original URL is downloaded, if not (i.e. third frequency of occurrence is greater than the third threshold value), is then abandoned above-mentioned original URL does not download the original URL.If terminal detects that there is no the 3rd URL in the third repeating set, illustrate the third URL is not belonging to the third repeating set, then can carry out the second extensive processing to the file names portion of the 3rd URL and obtain the 4th URL.The second extensive processing is for replacing with target word at least one character of target type in the 3rd URL file names portion One or more numbers in the file names portion of 3rd URL are such as replaced with preset individual digit " 1 " by symbol.Wherein, Three to repeat set may include the URL that the URL that has downloaded in historical record is obtained after above-mentioned second subtracts and consider and handle reason, i.e., under Corresponding 3rd URL of the URL of load.The URL downloaded can be to subtract the URL for considering and handling reason without second.Terminal passes through judgement warp It crosses second and subtracts whether the 3rd URL for considering and handling and managing and obtaining had been downloaded, thus judge whether need to download original URL, it can To filter out the different original URL of argument section, that is, reduce the variable in original URL, the original URL of discarding is more, in turn Reduce the quantity for downloading to repetition URL, further improves scan efficiency while guaranteeing accuracy rate.It needs to illustrate It is that the third threshold value can be less than or equal to above-mentioned first threshold, above-mentioned first threshold is less than or equal to above-mentioned second threshold, should Third threshold value can be the integer greater than 0.Because first subtract consider and handle reason only reduce the variable of original URL parameter part, this When the 2nd URL in variable it is also more, the URL filtered out with regard to less, the URL of downloading with regard to more, the 2nd URL second repeat collect The number occurred in conjunction is with regard to more.Therefore third threshold value, which is less than first threshold, can guarantee that step S303- step S304 is not filtered The URL fallen can be filtered after second subtracts and consider and handle reason, achieve the purpose that be classified duplicate removal.
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 3rd URL, And it can detecte the hash value of the 3rd URL with the presence or absence of in the third repeating set, and if it exists, illustrate above-mentioned 3rd URL's Hash value belongs to the third repeating set, can be by frequency of occurrence of the hash value of the 3rd URL in the third repeating set Add 1, obtains third frequency of occurrence of the hash value of the 3rd URL in the third repeating set, and may determine that the third goes out Whether occurrence number is less than or equal to third threshold value, if (i.e. the third frequency of occurrence be less than or equal to the third threshold value), then under Above-mentioned original URL is carried, if not (i.e. the third frequency of occurrence is greater than the third threshold value), then above-mentioned original URL is abandoned, i.e., does not download The original URL.If it does not exist, illustrate that the hash value of above-mentioned 3rd URL is not belonging to the third repeating set, then it can be to the third URL carries out the second extensive processing and obtains the 4th URL.The second extensive processing is for by target class in the 3rd URL file names portion At least one character of type replaces with target character, such as replaces one or more numbers in the file names portion of the 3rd URL It is changed to preset individual digit " 1 ".Wherein, the third repeating set may include the URL that has downloaded in historical record by above-mentioned Second subtracts the URL for considering and handling and obtaining after reason, the hash value obtained after calculating using MD5, that is, the corresponding third of the URL downloaded The hash value of URL.Because hash function is the function by the data conversion of arbitrary size at the data of particular size, and has been downloaded What is stored in set (the third repeating set) is the hash value of URL, rather than complete URL, and it is empty can be further reduced storage Between, while when whether the 3rd URL of detection belongs to the third repeating set, it can be further improved treatment effeciency.
In some possible embodiments, terminal is when detecting that the 3rd URL is not belonging to the third repeating set, eventually 3rd URL can be added the third repeating set and form new the third repeating set by end, that is, update the third repeating set, Frequency of occurrence of the 3rd URL in the third repeating set can be set to 1 simultaneously, and the can be carried out to the 3rd URL Two extensive processing obtain the 4th URL.After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal utilizes newest terminal The third repeating set detect whether next original URL was downloaded.Terminal, can be with by constantly updating the third repeating set More accurately classified filtering falls duplicate URL, and can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that The third repeating set can be empty (because the URL not downloaded at this time).
S307 repeats to gather if the 4th URL belongs to the 4th, and the 4th URL of detection, which repeats the in set the 4th the 4th, to be occurred Whether number is less than or equal to the 4th threshold value, if so, downloading original URL.
S308 repeats to gather, downloads original URL if the 4th URL is not belonging to the 4th.
In some possible embodiments, terminal can detecte the 4th and repeat to whether there is and the above-mentioned 4th in set The identical URL of URL, and if it exists, illustrate that the 4th URL belongs to the 4th and repeats to gather, then it can be by the 4th URL the 4th The frequency of occurrence repeated in set adds 1, obtains the 4th URL the 4th and repeats the 4th frequency of occurrence in set, and can be with Detect whether the 4th frequency of occurrence is less than or equal to the 4th threshold value, if (i.e. the 4th frequency of occurrence be less than or equal to this Four threshold values), then above-mentioned original URL is downloaded, if not (i.e. the 4th frequency of occurrence is greater than the 4th threshold value), is then abandoned above-mentioned original URL does not download the original URL.If terminal detects that the 4th repeats that the 4th URL is not present in set, illustrate the 4th URL is not belonging to the 4th and repeats to gather, then can directly download above-mentioned original URL.Wherein, the 4th repetition set may include going through The URL that the URL downloaded in Records of the Historian record is obtained after above-mentioned first, second subtracts and consider and handle reason and above-mentioned second extensive processing, Corresponding 4th URL of the URL downloaded.Whether terminal passes through the 4th URL for judging to obtain by the second extensive processing It was downloaded, to judge whether need to download original URL, the different original URL of file names portion can be filtered out, reduced Variable in original URL, and then the original URL abandoned is more, the repetition URL downloaded to is few, guarantee accuracy rate while into One step improves scan efficiency.It should be noted that the 4th threshold value is less than or equal to above-mentioned third threshold value, above-mentioned third threshold value Above-mentioned first threshold can be less than or equal to, above-mentioned first threshold is less than or equal to above-mentioned second threshold, and the 4th threshold value is big In or equal to 0 integer.4th threshold value, which is less than third threshold value, can guarantee the URL that step S305- step S306 is not filtered out, pass through It can be filtered after crossing the second extensive processing, achieve the purpose that be classified duplicate removal.
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 4th URL, And the hash value that can detecte the 4th URL repeats in set with the presence or absence of the 4th, and if it exists, illustrates above-mentioned 4th URL's Hash value belongs to the 4th and repeats to gather, and the hash value of the 4th URL can be repeated the frequency of occurrence in set the 4th Add 1, the hash value for obtaining the 4th URL repeats the 4th frequency of occurrence in set the 4th, and may determine that the 4th goes out Whether occurrence number is less than or equal to the 4th threshold value, if (i.e. the 4th frequency of occurrence be less than or equal to the 4th threshold value), then under Above-mentioned original URL is carried, if not (i.e. the 4th frequency of occurrence is greater than the 4th threshold value), then above-mentioned original URL is abandoned, i.e., does not download The original URL.If it does not exist, illustrate that the hash value of above-mentioned 4th URL is not belonging to the 4th and repeats to gather, then can directly download Above-mentioned original URL.Wherein, the 4th repetition set may include that the URL downloaded in historical record subtracts by above-mentioned first, second The URL obtained after reason and above-mentioned second extensive processing is considered and handled, the hash value obtained after calculating using MD5 has been downloaded The hash value of corresponding 4th URL of URL.Because hash function is the data by the data conversion of arbitrary size at particular size Function, and downloaded stored in set (the 4th repeat gather) be URL hash value, rather than complete URL can be into one Step reduces memory space, while when whether the 4th URL of detection belongs to the 4th repetition and gather, can be further improved processing effect Rate.
In some possible embodiments, terminal is when detecting that the 4th URL is not belonging to the 4th repetition set, eventually 4th URL can be added the 4th repetition set is formed newly the 4th and repeat set by end, that is, updated the 4th and repeated to gather, The 4th URL can be repeated into the frequency of occurrence in set the 4th simultaneously and be set to 1, above-mentioned original URL can be downloaded.Terminal is under After carrying above-mentioned original URL or abandoning above-mentioned original URL, terminal is next original using newest 4th repetition set detection Whether URL was downloaded.Terminal repeats to gather by constantly updating the 4th, can more accurately filter out duplicate URL, and It can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that 4th repetition set can be empty (because the URL not downloaded at this time).
The embodiment of the present application reduces the variable in original URL by carrying out classification duplicate removal to original URL step by step, and according to URL after reducing variable step by step judges whether original URL has downloaded, if indicating that original URL has been downloaded in the grade, abandons Original URL downloads original URL if the grade determines that original URL is not downloaded, if whether the grade can not judge original URL Downloading then enters next stage and judges, until judging that original URL has been downloaded or do not downloaded.Pass through the classification removing repeat more refined Case can not only reduce the quantity for downloading to repetition URL, improve the scan efficiency of WEB vulnerability scanning system, can also improve The accuracy rate of weight.
It is a kind of schematic block diagram of URL duplicate removal device provided by the embodiments of the present application referring to Fig. 5.The embodiment of the present application The URL duplicate removal device of offer includes:
First extensive processing module 10 obtains the first URL, this is first general for carrying out the first extensive processing to original URL Change processing for multiple continuation characters of same type in the original URL to be replaced with single character;
First subtract ginseng processing module 20, for when the first URL be not belonging to first repeat gather when, to the first URL into Row first, which subtracts, considers and handles reason and obtains the 2nd URL, this first subtracts and consider and handle reason for reducing the parametric variable in the first URL, this first Repeating set includes the URL that the URL downloaded in historical record is obtained after the first extensive processing;
Download module 30, for detecting the 2nd URL in second weight when the 2nd URL belongs to the second repetition and gathers Whether the first frequency of occurrence in multiple set is less than or equal to first threshold, if so, the original URL is downloaded, second repetition Set include in historical record the URL that has downloaded by this first subtract consider and handle reason after obtained URL.
In some possible embodiments, which further includes obtaining module 40, the acquisition module 40, for when this When one URL belongs to the first repetition set, second frequency of occurrence of the first URL in the first repetition set is obtained;It is above-mentioned Download module 30 is also used to download the original URL when second frequency of occurrence is less than or equal to second threshold.Wherein, this Two threshold values are greater than or equal to the first threshold.
In some possible embodiments, above-mentioned first subtract ginseng processing module 20 include computing unit 201, detection unit 202 and first subtract ginseng processing unit 203.The computing unit 201, for calculating the cryptographic Hash of the first URL;The detection unit 202, whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather;This first subtract ginseng processing unit 203, be used for When the cryptographic Hash of the first URL be not belonging to this first repeat to gather when, this is carried out to the first URL and first subtracts and considers and handles reason and obtain the Two URL.Wherein, which includes what the URL downloaded in historical record was obtained after the first extensive processing The cryptographic Hash of URL.
In some possible embodiments, the device further include second subtract ginseng processing module 50.This second subtracts and considers and handles reason Module 50, for when the 2nd URL be not belonging to this second repeat to gather when, carry out second to the 2nd URL and subtract considering and handling reason and obtaining 3rd URL;Above-mentioned download module 30 is also used to detect the 3rd URL at this when the 3rd URL belongs to the third repeating set Whether the third frequency of occurrence in the third repeating set is less than or equal to third threshold value, if so, downloading the original URL.Wherein, The third repeating set include the URL that has been downloaded in historical record by this second subtract consider and handle reason after obtained URL, this second subtracts Reason is considered and handled for reducing the parametric variable in the 2nd URL, which is less than or equal to the first threshold.
In some possible embodiments, which further includes the second extensive processing module 60.The second extensive processing Module 60, for carrying out the second extensive processing to the 3rd URL and obtaining when the 3rd URL is not belonging to the third repeating set 4th URL;Above-mentioned download module 30 is also used to detect the 4th URL at this when the 4th URL belongs to the 4th repetition and gathers Whether the 4th the 4th frequency of occurrence repeated in set is less than or equal to the 4th threshold value, if so, downloading the original URL.Wherein, 4th repeat set include in historical record the URL that has downloaded by this first, this second subtracts and considers and handles reason and this is second general The URL obtained after change processing, the second extensive processing is for replacing at least one character of target type in the 3rd URL Target character, the 4th threshold value are less than or equal to the third threshold value.
In some possible embodiments, above-mentioned download module 30 is also used to be not belonging to the quadruple as the 4th URL When gathering again, the original URL is downloaded.
It is provided in the specific implementation, above-mentioned URL duplicate removal device can execute above-mentioned Fig. 1 or Fig. 3 by above-mentioned modules Implementation in implementation provided by each step, realize the function of being realized in the various embodiments described above, can specifically join See the corresponding description that each step provides in above-mentioned Fig. 1 or embodiment of the method shown in Fig. 3, details are not described herein.
In the embodiment of the present application, URL duplicate removal device can be reduced step by step original by carrying out classification duplicate removal to original URL Variable in URL, and judge whether original URL has downloaded according to the URL after reduction variable step by step, if indicating that original in the grade Beginning, URL was downloaded, then abandoned original URL, if the grade determines that original URL is not downloaded, downloaded original URL, if the grade can not Judge whether original URL has downloaded, then enters next stage and judge, until judging that original URL has been downloaded or do not downloaded.Pass through The classification duplicate removal scheme more refined can not only reduce the quantity for downloading to repetition URL, improve sweeping for WEB vulnerability scanning system Efficiency is retouched, the accuracy rate of duplicate removal can also be improved.
It is a kind of schematic block diagram of terminal provided by the embodiments of the present application referring to Fig. 6.As shown in fig. 6, in the present embodiment Terminal may include: one or more processors 601 and memory 602.Above-mentioned processor 601 and memory 602 pass through total Line 603 connects.Memory 602 includes program instruction for storing computer program, the computer program, and processor 601 is used In the program instruction for executing the storage of memory 602.Wherein, processor 601 is configured for that the program instruction is called to execute:
First extensive processing is carried out to original URL, obtains the first URL, which is used for will be in the original URL Multiple continuation characters of same type replace with single character;
It repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining the 2nd URL, This first subtract consider and handle reason for reducing the parametric variable in the first URL, this first repeat set include historical record under The URL that the URL of load is obtained after the first extensive processing;
It repeats to gather if the 2nd URL belongs to second, detects first appearance of the 2nd URL in the second repetition set Whether number is less than or equal to first threshold, if so, downloading the original URL, which includes in historical record The URL downloaded by this first subtract consider and handle reason after obtained URL.
It should be appreciated that in some possible embodiments, alleged processor 601 can be central processing unit (central processing unit, CPU), which can also be other general processors, digital signal processor (digital signal processor, DSP), specific integrated circuit (application specific integrated Circuit, ASIC), ready-made programmable gate array (field-programmable gate array, FPGA) or other can Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor Or the processor is also possible to any conventional processor etc..
The memory 602 may include read-only memory and random access memory, and to processor 601 provide instruction and Data.The a part of of memory 602 can also include nonvolatile RAM.For example, memory 602 can also be deposited Store up the information of device type.
In the specific implementation, system provided by the embodiments of the present application can be performed in processor 601 described in the embodiment of the present application URL duplicate removal described in the embodiment of the present application also can be performed in implementation described in one Resource Locator URL De-weight method The implementation of device, details are not described herein.
The embodiment of the present application also provides a kind of computer readable storage medium, which has meter Calculation machine program, the computer program include program instruction, which realizes Fig. 1 or shown in Fig. 3 when being executed by processor Uniform resource position mark URL De-weight method, detail please refer to the description of Fig. 1 or embodiment illustrated in fig. 3, no longer superfluous herein It states.
Above-mentioned computer readable storage medium can be uniform resource position mark URL duplicate removal described in aforementioned any embodiment The internal storage unit of device or electronic equipment, such as the hard disk or memory of electronic equipment.The computer readable storage medium It can be the plug-in type hard disk being equipped on the External memory equipment of the electronic equipment, such as the electronic equipment, intelligent memory card (smart media card, SMC), secure digital (secure digital, SD) card, flash card (flash card) etc..Into One step, the computer readable storage medium can also both internal storage units including the electronic equipment or including external storage Equipment.The computer readable storage medium is for storing other program sum numbers needed for the computer program and the electronic equipment According to.The computer readable storage medium can be also used for temporarily storing the data that has exported or will export.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond scope of the present application.
The application be referring to the embodiment of the present application method, apparatus (terminal) and computer program product flow chart with/ Or block diagram describes.It should be understood that each process that can be realized by computer program instructions in flowchart and/or the block diagram and/ Or the combination of the process and/or box in box and flowchart and/or the block diagram.It can provide these computer program instructions To general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices processor to generate one A machine so that by the instruction that the processor of computer or other programmable data processing devices executes generate for realizing The device for the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although the application is described in conjunction with specific features and embodiment, it is clear that, do not departing from this Shen In the case where spirit and scope please, it can be carry out various modifications and is combined.Correspondingly, the specification and drawings are only institute The exemplary illustration for the application that attached claim is defined, and be considered as covered within the scope of the application any and all and repair Change, change, combining or equivalent.Obviously, those skilled in the art the application can be carried out various modification and variations without It is detached from spirit and scope.If in this way, these modifications and variations of the application belong to the claim of this application and its Within the scope of equivalent technologies, then the application is also intended to include these modifications and variations.

Claims (10)

1. a kind of uniform resource position mark URL De-weight method characterized by comprising
First extensive processing is carried out to original URL, obtains the first URL, the first extensive processing is used for will be in the original URL Multiple continuation characters of same type replace with single character;
It repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining the 2nd URL, Described first subtracts and considers and handles reason for reducing the parametric variable in the first URL, and described first to repeat set include historical record In the URL that is obtained after the described first extensive processing of the URL that has downloaded;
It repeats to gather if the 2nd URL belongs to second, detection the 2nd URL repeats first in set described second and goes out Whether occurrence number is less than or equal to first threshold, if first frequency of occurrence is less than or equal to the first threshold, downloads The original URL, described second repeats URL of the set including having downloaded in historical record obtains after described first subtracts and consider and handle reason The URL arrived.
2. the method according to claim 1, wherein the method, further includes:
It repeats to gather if the first URL belongs to described first, obtains the first URL described first and repeat the in set Two frequency of occurrence;
If second frequency of occurrence is less than or equal to second threshold, the original URL is downloaded;
Wherein, the second threshold is greater than or equal to the first threshold.
If 3. the method according to claim 1, wherein the first URL be not belonging to first repeat gather, It carries out first to the first URL and subtracts considering and handling reason and obtaining the 2nd URL, comprising:
Calculate the cryptographic Hash of the first URL;
Whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather, and described first, which repeats set, includes historical record In the cryptographic Hash of URL that is obtained after the described first extensive processing of the URL that has downloaded;
It repeats to gather if the cryptographic Hash of the first URL is not belonging to described first, described first is carried out to the first URL and subtracts ginseng Processing obtains the 2nd URL.
4. method according to claim 1-3, which is characterized in that the method, further includes:
It repeats to gather if the 2nd URL is not belonging to described second, carries out second to the 2nd URL and subtract considering and handling reason and obtaining third URL, described second subtract consider and handle reason for reducing the parametric variable in the 2nd URL;
If the 3rd URL belongs to the third repeating set, detects third of the 3rd URL in the third repeating set and go out Whether occurrence number is less than or equal to third threshold value, if the third frequency of occurrence is less than or equal to the third threshold value, downloads The original URL, the third repeating set include that the URL downloaded in historical record is obtained after described second subtracts and consider and handle reason The URL arrived;
Wherein, the third threshold value is less than or equal to the first threshold.
5. according to the method described in claim 4, it is characterized in that, the method, further includes:
If the 3rd URL is not belonging to the third repeating set, the second extensive processing is carried out to the 3rd URL and obtains the 4th URL, the second extensive processing is for replacing with target character at least one character of target type in the 3rd URL;
It repeats to gather if the 4th URL belongs to the 4th, detection the 4th URL repeats the in set the 4th the described 4th and goes out Whether occurrence number is less than or equal to the 4th threshold value, if the 4th frequency of occurrence is less than or equal to the 4th threshold value, downloads The original URL, the described 4th repeats URL of the set including having downloaded in historical record subtracts by described first, described second Consider and handle the URL obtained after reason and the second extensive processing;
Wherein, the 4th threshold value is less than or equal to the third threshold value.
6. according to the method described in claim 5, it is characterized in that, the method, further includes:
It repeats to gather if the 4th URL is not belonging to the described 4th, downloads the original URL.
7. a kind of URL duplicate removal device characterized by comprising
First extensive processing module obtains the first URL, the first extensive place for carrying out the first extensive processing to original URL Reason is for replacing with single character for multiple continuation characters of same type in the original URL;
First subtracts ginseng processing module, for being carried out to the first URL when the first URL is not belonging to the first repetition set First subtracts and considers and handles reason and obtain the 2nd URL, and described first subtracts and consider and handle reason for reducing the parametric variable in the first URL, described First repeats the URL that URL of the set including having downloaded in historical record is obtained after the described first extensive processing;
Download module, for detecting the 2nd URL in second weight when the 2nd URL belongs to the second repetition and gathers Whether the first frequency of occurrence in multiple set is less than or equal to first threshold, if first frequency of occurrence is less than or equal to described First threshold then downloads the original URL, and described second repeats to gather described in the URL process including having downloaded in historical record First subtracts the URL for considering and handling and obtaining after reason.
8. device according to claim 7, which is characterized in that described device further include:
Module is obtained, for when the first URL belongs to described first and repeats to gather, obtaining the first URL described the One repeats the second frequency of occurrence in set;
The download module is also used to download described original when second frequency of occurrence is less than or equal to second threshold URL;
Wherein, the second threshold is greater than or equal to the first threshold.
9. a kind of terminal, which is characterized in that including processor and memory, the processor and memory are connected with each other, wherein The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for Described program instruction is called, as the method according to claim 1 to 6 is executed.
10. a kind of computer readable storage medium, which is characterized in that the computer storage medium is stored with computer program, The computer program includes program instruction, and described program instruction makes the processor execute such as right when being executed by a processor It is required that the described in any item methods of 1-6.
CN201810735305.5A 2018-07-05 2018-07-05 Uniform Resource Locator (URL) duplicate removal method and device Active CN108984703B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810735305.5A CN108984703B (en) 2018-07-05 2018-07-05 Uniform Resource Locator (URL) duplicate removal method and device
PCT/CN2018/108708 WO2020006908A1 (en) 2018-07-05 2018-09-29 Url de-duplication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810735305.5A CN108984703B (en) 2018-07-05 2018-07-05 Uniform Resource Locator (URL) duplicate removal method and device

Publications (2)

Publication Number Publication Date
CN108984703A true CN108984703A (en) 2018-12-11
CN108984703B CN108984703B (en) 2023-04-18

Family

ID=64536296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810735305.5A Active CN108984703B (en) 2018-07-05 2018-07-05 Uniform Resource Locator (URL) duplicate removal method and device

Country Status (2)

Country Link
CN (1) CN108984703B (en)
WO (1) WO2020006908A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008419A (en) * 2019-03-11 2019-07-12 阿里巴巴集团控股有限公司 Removing duplicate webpages method, device and equipment
CN110825947A (en) * 2019-10-31 2020-02-21 深圳前海微众银行股份有限公司 URL duplicate removal method, device, equipment and computer readable storage medium
CN112906005A (en) * 2021-02-02 2021-06-04 浙江大华技术股份有限公司 Web vulnerability scanning method, device, system, electronic device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259282B (en) * 2020-02-13 2023-08-29 深圳市腾讯计算机系统有限公司 URL (Uniform resource locator) duplication removing method, device, electronic equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102185741A (en) * 2011-06-10 2011-09-14 浙江大学 Method for estimating needs of transaction in processor in multi-tier architecture
CN103823825A (en) * 2012-08-30 2014-05-28 埃森哲环球服务有限公司 Online content collection
CN106815247A (en) * 2015-11-30 2017-06-09 北京国双科技有限公司 URL acquisition methods and device
CN106844389A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 The treating method and apparatus of network resources address URL

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365865B (en) * 2012-03-29 2017-07-11 腾讯科技(深圳)有限公司 Date storage method, data download method and its device
CN104933056B (en) * 2014-03-18 2019-08-13 腾讯科技(深圳)有限公司 Uniform resource locator De-weight method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
CN106919570B (en) * 2015-12-24 2020-12-22 国家新闻出版广电总局广播科学研究院 Page link duplication removal scanning method and device for new network media
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102185741A (en) * 2011-06-10 2011-09-14 浙江大学 Method for estimating needs of transaction in processor in multi-tier architecture
CN103823825A (en) * 2012-08-30 2014-05-28 埃森哲环球服务有限公司 Online content collection
CN106815247A (en) * 2015-11-30 2017-06-09 北京国双科技有限公司 URL acquisition methods and device
CN106844389A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 The treating method and apparatus of network resources address URL

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008419A (en) * 2019-03-11 2019-07-12 阿里巴巴集团控股有限公司 Removing duplicate webpages method, device and equipment
CN110825947A (en) * 2019-10-31 2020-02-21 深圳前海微众银行股份有限公司 URL duplicate removal method, device, equipment and computer readable storage medium
WO2021082938A1 (en) * 2019-10-31 2021-05-06 深圳前海微众银行股份有限公司 Url deduplication method, apparatus, device and computer-readable storage medium
CN110825947B (en) * 2019-10-31 2024-03-08 深圳前海微众银行股份有限公司 URL deduplication method, device, equipment and computer readable storage medium
CN112906005A (en) * 2021-02-02 2021-06-04 浙江大华技术股份有限公司 Web vulnerability scanning method, device, system, electronic device and storage medium

Also Published As

Publication number Publication date
CN108984703B (en) 2023-04-18
WO2020006908A1 (en) 2020-01-09

Similar Documents

Publication Publication Date Title
CN108984703A (en) A kind of uniform resource position mark URL De-weight method and device
CN104978526B (en) The extracting method and device of virus characteristic
CN103392169B (en) Sort method and system
CN109246210B (en) Internet of things communication method and device
CN109271359A (en) Log information processing method, device, electronic equipment and readable storage medium storing program for executing
CN107391571A (en) The processing method and processing device of sensing data
CN108920668A (en) A kind of uniform resource position mark URL De-weight method and device
CN110189165A (en) Channel abnormal user and abnormal channel recognition methods and device
CN108959359A (en) A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN109302383A (en) A kind of URL monitoring method and device
CN110019205A (en) A kind of data storage, restoring method, device and computer equipment
CN108509440A (en) A kind of data processing method and device
CN106598747A (en) Network data package parallel processing method and device
CN112350912B (en) Data acquisition method, system and device based on Modbus protocol
CN110213073A (en) Data flow variation, electronic equipment, calculate node and storage medium
CN108463813B (en) Method and device for processing data
CN108551485A (en) A kind of streaming medium content caching method, device and computer storage media
CN108595685A (en) A kind of data processing method and device
CN105095387A (en) Method and device for POI data collection based on user comment information
CN107391627A (en) EMS memory occupation analysis method, device and the server of data
CN113472681A (en) Flow rate limiting method and device
CN110430140A (en) Path processing method, device, equipment and storage medium
CN109002544A (en) A kind of data processing method, device and computer-readable medium
CN105095382A (en) Method and device for sample distributed clustering calculation
CN109643307A (en) Stream processing system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant