CN108984703A - A kind of uniform resource position mark URL De-weight method and device - Google Patents
A kind of uniform resource position mark URL De-weight method and device Download PDFInfo
- Publication number
- CN108984703A CN108984703A CN201810735305.5A CN201810735305A CN108984703A CN 108984703 A CN108984703 A CN 108984703A CN 201810735305 A CN201810735305 A CN 201810735305A CN 108984703 A CN108984703 A CN 108984703A
- Authority
- CN
- China
- Prior art keywords
- url
- repeats
- original
- gather
- occurrence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
Abstract
The embodiment of the present application discloses a kind of uniform resource position mark URL De-weight method and device, wherein method includes: to carry out the first extensive processing to original URL, and judge whether the first URL belongs to first and repeat to gather, gather if the first URL is not belonging to first repetition, it then carries out first to the first URL and subtracts considering and handling reason and obtaining the 2nd URL, and judge whether the 2nd URL belongs to second and repeat to gather, it repeats to gather if the 2nd URL belongs to second, detect whether first frequency of occurrence of the 2nd URL in the second repetition set is less than or equal to first threshold, if, then download the original URL, if not, then abandon the original URL, it does not download.Using the embodiment of the present application, it is possible to reduce the quantity of repetition URL is downloaded to, so as to improve the scan efficiency of WEB vulnerability scanning system.
Description
Technical field
This application involves Internet technical field more particularly to a kind of uniform resource position mark URL De-weight method and devices.
Background technique
Current WWW (world wide web, WEB) vulnerability scanning systems many in the industry, there is the unified resource of oneself
Finger URL (uniform resource locator, URL) crawler, also there is the URL De-weight method of oneself.URL crawler refers to crawler
System selects a part of webpage meticulously first from internet page, using the chained address of these webpages as seed URL, by this
A little seeds are put into URL queue to be grabbed, and crawler is successively read from URL queue to be grabbed, and URL is passed through domain name system
Chained address, is converted to the corresponding IP address of Website server by (domain name system, DNS) parsing.Then by it
Page download device is given with webpage relative path name, page download device is responsible for the downloading of the page.For locally downloading net
Page, is on the one hand stored in pool of page, waits and establishes the subsequent processings such as index;On the other hand the URL for downloading webpage is put
Enter and grabbed in queue, the webpage URL that this queue record crawler system had been downloaded grabs to avoid the repetition of system
It takes.
URL duplicate removal refers to repeat the URL removal of crawl, avoids repeatedly grabbing same webpage.For example, being given to each
Fixed URL is mapped that on some physical address.When needing to detect the given URL and whether repeating, it need to only judge that this is given
Whether the corresponding physical address of URL has existed, and if it exists, illustrates to be downloaded, then abandons downloading, otherwise give this
URL is put into queue to be grabbed, and waits for downloads.Since many URL in a website are only argument section differences, and these are only
The different URL of argument section was very big to be downloaded, but these URL obtained physical address after mapping is not phase
With, at this moment only judged by the corresponding physical address of URL these URL whether repeat will lead to crawl it is many duplicate
URL influences the scan efficiency of WEB vulnerability scanning system.
Summary of the invention
The embodiment of the present application provides a kind of uniform resource position mark URL De-weight method and device, can reduce and download to repetition
The quantity of URL, so as to improve the scan efficiency of WEB vulnerability scanning system.
In a first aspect, the embodiment of the present application provides a kind of uniform resource position mark URL De-weight method, this method comprises:
First extensive processing is carried out to original URL, obtains the first URL, which is used for will be in the original URL
Multiple continuation characters of same type replace with single character;
It repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining the 2nd URL,
This first subtract consider and handle reason for reducing the parametric variable in the first URL, this first repeat set include historical record under
The URL that the URL of load is obtained after the first extensive processing;
It repeats to gather if the 2nd URL belongs to second, detects first appearance of the 2nd URL in the second repetition set
Whether number is less than or equal to first threshold, if so, downloading the original URL, which includes in historical record
The URL downloaded by this first subtract consider and handle reason after obtained URL.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
It first repeats to gather if the first URL belongs to this, obtains the first URL at this and first repeat second going out in set
Occurrence number;
If second frequency of occurrence is less than or equal to second threshold, the original URL is downloaded;
Wherein, which is greater than or equal to the first threshold.
With reference to first aspect, it in a kind of possible embodiment, repeats to gather if being somebody's turn to do the first URL and being not belonging to first,
It carries out first to the first URL and subtracts considering and handling reason and obtaining the 2nd URL, comprising:
Calculate the cryptographic Hash of the first URL;
Whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather, which includes historical record
In the cryptographic Hash of URL that is obtained after the first extensive processing of the URL that has downloaded;
It first repeats to gather if the cryptographic Hash of the first URL is not belonging to this, this is carried out to the first URL and first subtracts and considers and handles reason
Obtain the 2nd URL.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
It second repeats to gather if the 2nd URL is not belonging to this, carries out second to the 2nd URL and subtract considering and handling reason and obtaining third
URL, this second subtract consider and handle reason for reducing the parametric variable in the 2nd URL;
If the 3rd URL belongs to the third repeating set, detects third of the 3rd URL in the third repeating set and occur
Whether number is less than or equal to third threshold value, if so, downloading the original URL, which includes in historical record
The URL downloaded by this second subtract consider and handle reason after obtained URL;
Wherein, which is less than or equal to the first threshold.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
If the 3rd URL is not belonging to the third repeating set, the second extensive processing is carried out to the 3rd URL and obtains the 4th
URL, the second extensive processing is for replacing with target character at least one character of target type in the 3rd URL;
It repeats to gather if the 4th URL belongs to the 4th, detects the 4th URL the 4th and repeat the 4th appearance in set
Whether number is less than or equal to the 4th threshold value, if so, downloading the original URL, the 4th repeats set including in historical record
The URL downloaded by this first, this second subtract consider and handle reason and the second extensive processing after obtained URL;
Wherein, the 4th threshold value is less than or equal to the third threshold value.
With reference to first aspect, in a kind of possible embodiment, this method, further includes:
It repeats to gather if the 4th URL is not belonging to the 4th, downloads the original URL.
Second aspect, the embodiment of the present application provide a kind of URL duplicate removal device, which includes:
First extensive processing module obtains the first URL, this is first extensive for carrying out the first extensive processing to original URL
Processing is for replacing with single character for multiple continuation characters of same type in the original URL;
First subtracts ginseng processing module, for being carried out to the first URL when the first URL is not belonging to the first repetition set
First subtracts and considers and handles reason and obtain the 2nd URL, this first subtracts and consider and handle reason for reducing the parametric variable in the first URL, first weight
Multiple set includes the URL that the URL downloaded in historical record is obtained after the first extensive processing;
Download module, for detecting the 2nd URL in second repetition when the 2nd URL belongs to the second repetition and gathers
Whether the first frequency of occurrence in set is less than or equal to first threshold, if so, downloading the original URL, which collects
Close include in historical record the URL that has downloaded by this first subtract consider and handle reason after obtained URL.
In conjunction with second aspect, in a kind of possible embodiment, the device further include:
Module is obtained, for obtaining the first URL in first weight when the first URL belongs to first repetition and gathers
The second frequency of occurrence in multiple set;
Above-mentioned download module is also used to that it is original to download this when second frequency of occurrence is less than or equal to second threshold
URL;
Wherein, which is greater than or equal to the first threshold.
In conjunction with second aspect, in a kind of possible embodiment, this first subtract ginseng processing module include:
Computing unit, for calculating the cryptographic Hash of the first URL;
Detection unit, whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather, which gathers
Cryptographic Hash including the URL downloaded in historical record the URL obtained after the first extensive processing;
First subtracts ginseng processing unit, for when the cryptographic Hash of the first URL be not belonging to this first repeat to gather when, to this
One URL carries out this and first subtracts and consider and handle reason and obtain the 2nd URL.
In conjunction with second aspect, in a kind of possible embodiment, the device further include:
Second subtracts ginseng processing module, for when the 2nd URL be not belonging to this second repeat gather when, to the 2nd URL into
Row second, which subtracts, considers and handles reason and obtains the 3rd URL, this second subtracts and consider and handle reason for reducing the parametric variable in the 2nd URL;
Above-mentioned download module is also used to when the 3rd URL belongs to the third repeating set, detect the 3rd URL this
Whether the three third frequency of occurrence repeated in set are less than or equal to third threshold value, if so, downloading the original URL, the third
Repeat set include in historical record the URL that has downloaded by this second subtract consider and handle reason after obtained URL;
Wherein, which is less than or equal to the first threshold.
In conjunction with second aspect, in a kind of possible embodiment, the device further include:
Second extensive processing module, for when the 3rd URL is not belonging to the third repeating set, to the 3rd URL into
The extensive processing of row second obtains the 4th URL, which is used at least one word of target type in the 3rd URL
Symbol replaces with target character;
Above-mentioned download module is also used to when the 4th URL belongs to the 4th and repeats to gather, detect the 4th URL this
Whether four the 4th frequency of occurrence repeated in set are less than or equal to the 4th threshold value, if so, the original URL is downloaded, the 4th
Repeat set include in historical record the URL that has downloaded by this first, this second subtracts and considers and handles reason and the second extensive processing
The URL obtained afterwards;
Wherein, the 4th threshold value is less than or equal to the third threshold value.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned download module is also used to not belong to as the 4th URL
When the 4th repeats to gather, the original URL is downloaded.
The third aspect, the embodiment of the present application provide a kind of terminal, including processor and memory, the processor and storage
Device is connected with each other, wherein the memory is used to store the computer program for supporting terminal to execute the above method, the computer program
Including program instruction, which is configured for calling the program instruction, executes the unified resource positioning of above-mentioned first aspect
Accord with URL De-weight method.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, which deposits
Computer program is contained, which includes program instruction, which makes the processor when being executed by a processor
Execute the uniform resource position mark URL De-weight method of above-mentioned first aspect.
The embodiment of the present application judges whether the first URL belongs to first by carrying out the first extensive processing to original URL
It repeats to gather, first repeats to gather if the first URL is not belonging to this, carry out first to the first URL and subtract considering and handling reason and obtaining the
Two URL, and judge whether the 2nd URL belongs to second and repeat to gather, it repeats to gather if the 2nd URL belongs to second, detection should
Whether first frequency of occurrence of the 2nd URL in the second repetition set is less than or equal to first threshold, if so, downloading the original
Beginning URL is not downloaded if it is not, then abandoning the original URL, it is possible to reduce the quantity of repetition URL is downloaded to, so as to improve
The scan efficiency of WEB vulnerability scanning system.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of schematic flow diagram of uniform resource position mark URL De-weight method provided by the embodiments of the present application;
Fig. 2 is the first relation schematic diagram for repeating set and frequency of occurrence;
Fig. 3 is the schematic flow diagram of another uniform resource position mark URL De-weight method provided by the embodiments of the present application;
Fig. 4 is the second relation schematic diagram for repeating set and frequency of occurrence;
Fig. 5 is a kind of schematic block diagram of URL duplicate removal device provided by the embodiments of the present application;
Fig. 6 is a kind of schematic block diagram of terminal provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
It should be appreciated that the description and claims of this application and term " first " in the attached drawing, " second ",
" third ", " 4th " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " tool
Have " and their any deformations, it is intended that it covers and non-exclusive includes.Such as contain the mistake of a series of steps or units
Journey, method, system, product or equipment are not limited to listed step or unit, but optionally further comprising do not list
The step of or unit, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
It is also understood that referenced herein " embodiment " it is meant that describe in conjunction with the embodiments special characteristic, structure or
Characteristic may be embodied at least one embodiment of the application.Each position in the description shows that the phrase might not
Identical embodiment is each meant, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art
Member explicitly and implicitly understands that embodiment described herein can be combined with other embodiments.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Before introducing the embodiment of the present application, the data structure of URL is first introduced.The structure of one URL is usually " association
View: // server name (IP address)/path/filename? parameter ", such as: http://xxx.pingan.com/cgi-bin/
Index1.html? param1=value1¶m2=value2, wherein param1=value1¶m2=value2 table
Show that the argument section of this URL, the argument section of URL are made of parameter name and parameter value, param1 and param2 expression parameter
Name, value1 and value2 expression parameter value, parameter value (can refer to divisor for number, alphabetical (including capital and small letter), spcial character
Character except word, letter) and/or their combination.Question mark "? " character is used to separate the file names portion and parameter in URL
Part, the separator between the parameter specified in logical AND " " character representation URL.
Below in conjunction with Fig. 1 to Fig. 6, to uniform resource position mark URL De-weight method provided by the embodiments of the present application and dress
It sets and is illustrated.
It is a kind of exemplary flow of uniform resource position mark URL De-weight method provided by the embodiments of the present application referring to Fig. 1
Figure, as shown in Figure 1, the uniform resource position mark URL De-weight method can include:
S101 carries out the first extensive processing to original URL, obtains the first URL.
In some possible embodiments, the argument section of original URL can be ranked up by terminal according to parameter name,
And the first extensive processing is carried out to the parameter value of the original URL parameter part, obtain the first URL.The first extensive processing is used for
Multiple continuation characters of same type in the original URL are replaced with into single character.For example, continuous number such as 145 can be replaced
It is changed to number 1, continuous letter such as FK, aj, dgA replace with alphabetical A, and spcial character all replaces with symbol %.Wherein, special
Character refer to except number and letter in addition to character, as question mark "? ", exclamation mark "!" etc..
For example, the original URL that terminal crawls is http://xxx.pingan.com/cgi-bin/index1.html?
Param1=v167!ABD¶m2=val_ue2, terminal can arrange the argument section of original URL according to parameter name
Sequence guarantees that the argument section sequence of original URL arranges, then number continuous in the parameter value of original URL is replaced with preset
Individual digit " 1 ", continuous letter replace with preset single letter " A ", spcial character replaces with preset single special word
It accords with " % ", does is obtaining the first URL http://xxx.pingan.com/cgi-bin/index1.html? param1=v1%A&
Param2=A%A2.
S102 repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining second
URL。
In some possible embodiments, terminal can detecte first and repeat to whether there is and above-mentioned first in set
The identical URL of URL illustrates that above-mentioned first URL is not belonging to first and repeats to gather if it does not exist, then can be to the ginseng of the first URL
Number part carries out first and subtracts considering and handling reason and obtaining the 2nd URL.This first subtracts the ginseng considered and handled and managed for reducing the first URL parameter part
Number variable such as removes the parameter value retention parameter name of the first URL parameter part.If it exists, illustrate the first URL belong to this first
It repeats to gather, then frequency of occurrence of the first URL in the first repetition set can be added 1 by terminal, obtained the first URL and existed
The second frequency of occurrence in the first repetition set, and may determine that whether second frequency of occurrence is less than or equal to the second threshold
Value then downloads above-mentioned original URL if (i.e. second frequency of occurrence be less than or equal to the second threshold), if not (i.e. this second
Frequency of occurrence is greater than the second threshold), then above-mentioned original URL is abandoned, i.e., does not download the original URL.Terminal is passed through by judgement
First it is extensive handle whether obtained the first URL had been downloaded, can be with to judge whether need to download original URL
The original URL in parameter value with identical data format is filtered out, reduces the quantity for downloading to repetition URL, and because only subtract
The Partial Variable in original URL parameter value is lacked, so ensure that the accuracy rate of duplicate removal.Wherein, the first repetition set can wrap
Include the URL that the URL downloaded in historical record is obtained after the above-mentioned first extensive processing, that is, the URL downloaded corresponding
One URL.The URL downloaded can be the URL without the first extensive processing.
For example, it is assumed that the first repetition set includes URL4, URL6 and URL7, the first URL is URL1:http: //
Xxx.pingan.com/cgi-bin/index1.html? param1=v1%A¶m2=A%A2, terminal detect the first weight
It whether there is URL identical with URL1 in set URL4, URL6 and URL7 again, i.e., first repeats to whether there is in set
URL1 illustrates that URL1 is not belonging to first and repeats to gather, terminal is by URL1 parameter portion because first repeats that URL1 is not present in set
The parameter value divided removes retention parameter name, obtains the 2nd URL are as follows: http://xxx.pingan.com/cgi-bin/
Index1.html? param1¶m2.
In some possible embodiments, terminal can use 5 (message-digest of Message-Digest algorithm
Algorithm 5, MD5) calculate Hash (hash) value of above-mentioned first URL, and can detecte the first URL hash value whether
There are in the first repetition set, if it does not exist, illustrate that the hash value of above-mentioned first URL is not belonging to first repetition and gathers, then
It can carry out first to the argument section of the first URL and subtract considering and handling reason and obtaining the 2nd URL.This first subtract consider and handle reason for reducing this
The parametric variable of first URL parameter part such as removes the parameter value retention parameter name of the first URL parameter part.If it exists, it says
The hash value of bright above-mentioned first URL belongs to first repetition and gathers, and the hash value of the first URL can be collected in first repetition
Frequency of occurrence in conjunction adds 1, obtains second frequency of occurrence of the hash value of the first URL in the first repetition set, and can
To judge whether second frequency of occurrence is less than or equal to second threshold, if (i.e. second frequency of occurrence, which is less than or equal to, is somebody's turn to do
Second threshold), then above-mentioned original URL is downloaded, if not (i.e. second frequency of occurrence is greater than the second threshold), then abandons above-mentioned original
Beginning URL does not download the original URL.Wherein, first to repeat set may include the URL that has downloaded in historical record by upper
The URL obtained after the first extensive processing is stated, the hash value obtained after being calculated using MD5, that is, the URL downloaded corresponding first
The hash value of URL.The second threshold is the integer greater than 0.Because hash function is the data conversion by arbitrary size at specific
The function of the data of size, and downloaded stored in set (first repeat gather) be URL hash value, rather than it is complete
URL, it is possible to reduce memory space (because complete URL character is more, and hash value is the data of fixed size), while detecting
When whether the first URL belongs to the first repetition set, treatment effeciency can be improved.
For example, as shown in Fig. 2, being the first relation schematic diagram for repeating set and frequency of occurrence, wherein assuming that second threshold
It is 10, the first element repeated in set is 01,03 and 06, and the first URL is URL1, and the hash value that terminal calculates URL1 is
03, terminal detects that the hash value 03 of URL1 repeats in set first, illustrates that URL1 belongs to first and repeats to gather, terminal will
The hash value 03 of URL1 repeats the frequency of occurrence 3 in set plus 1 first, and the hash value 03 for obtaining URL1 repeats to gather first
In the second frequency of occurrence 4.Terminal judges that the second frequency of occurrence 4 is less than second threshold 10, downloads original URL.
In some possible embodiments, terminal is when detecting that the first URL is not belonging to the first repetition set, eventually
Holding the first URL, which can be added, first repetition gathers the first repetition set formed newly, that is, updates first repetition and gather,
The first URL first can be repeated the frequency of occurrence in set simultaneously and be set to 1 at this, and the can be carried out to the first URL
One subtracts and considers and handles reason and obtain the 2nd URL.After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal utilizes newest terminal
First repeat set and detect whether next original URL was downloaded.Terminal repeats to gather by constantly updating first, can be with
Duplicate URL is more accurately filtered out, and can be further improved scan efficiency.For example, first repeat set include URL4,
URL6 and URL7, the first URL are URL1, repeat to gather at this point, URL1 is not belonging to first, which can be added the by terminal
One repeats to gather, and updated first repeats to gather including URL1, URL4, URL6 and URL7 at this time.Meanwhile terminal can also be with
URL1 is repeated into the frequency of occurrence in set first and is also set to 1.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that
First repetition set can be empty (because the URL not downloaded at this time).
S103 repeats to gather if the 2nd URL belongs to second, and the 2nd URL of detection, which repeats first in set second, to be occurred
Whether number is less than or equal to first threshold, if so, downloading original URL.
In some possible embodiments, terminal can detecte second and repeat to whether there is and above-mentioned second in set
The identical URL of URL, and if it exists, illustrate the 2nd URL belong to this second repeat gather, then can by the 2nd URL this second
The frequency of occurrence repeated in set adds 1, obtains the 2nd URL at this and second repeats the first frequency of occurrence in set, and can be with
Detect whether first frequency of occurrence is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to this
One threshold value), then above-mentioned original URL is downloaded, if not (i.e. the first frequency of occurrence is greater than the first threshold), is then abandoned above-mentioned original
URL does not download the original URL.If terminal detect this second repeat to illustrate in set there is no the 2nd URL this second
URL is not belonging to second repetition and gathers, then can directly download above-mentioned original URL.Wherein, the second repetition set may include going through
The URL, that is, the URL downloaded corresponding second that the URL downloaded in Records of the Historian record is obtained after above-mentioned first subtracts and consider and handle reason
URL.The URL downloaded can be to subtract the URL for considering and handling reason without first.Terminal, which is subtracted by judgement by first, to be considered and handled reason and obtains
The 2nd URL whether be downloaded, to judge whether need to download original URL, parameter value not phase can be filtered out
Same original URL, further reduces the quantity for downloading to repetition URL, improves scan efficiency while guaranteeing accuracy rate.
It should be noted that the first threshold is less than or equal to above-mentioned second threshold, which can be the integer greater than 0.Cause
For the first extensive variable handled in the parameter value for only reducing original URL, the variable in the first URL is also more at this time, mistake
The URL filtered is just few, and for the URL of downloading with regard to more, the number that the first URL occurs in the first repetition set is just more.Therefore first
Threshold value, which is less than second threshold, can guarantee the URL not filtered out, can be filtered after first subtracts and consider and handle reason, reach classification
The purpose of duplicate removal.
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 2nd URL,
And it can detecte the hash value of the 2nd URL with the presence or absence of in the second repetition set, and if it exists, illustrate above-mentioned 2nd URL's
Hash value belongs to second repetition and gathers, can be by frequency of occurrence of the hash value of the 2nd URL in the second repetition set
Add 1, obtain the hash value of the 2nd URL at this and second repeat the first frequency of occurrence in set, and may determine that this first goes out
Whether occurrence number is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to the first threshold), then under
Above-mentioned original URL is carried, if not (i.e. first frequency of occurrence is greater than the first threshold), then above-mentioned original URL is abandoned, does not download.
If it does not exist, illustrate that the hash value of above-mentioned 2nd URL is not belonging to second repetition and gathers, then can download above-mentioned original URL.Its
In, second repeats to gather the original URL that may include to have downloaded in historical record obtains after above-mentioned first subtracts and consider and handle reason
URL, the hash value obtained after being calculated using MD5, that is, the hash value of corresponding 2nd URL of the original URL downloaded.Because
Hash function is the function by the data conversion of arbitrary size at the data of particular size, and has downloaded set (the second repetition collection
Close) in store be URL hash value, rather than complete URL can be further reduced memory space, while in detection the
When whether two URL belong to the second repetition set, treatment effeciency can be further improved.
In some possible embodiments, terminal is when detecting that the 2nd URL is not belonging to the second repetition set, eventually
Holding the 2nd URL, which can be added, second repetition gathers the second repetition set formed newly, that is, updates second repetition and gather,
Frequency of occurrence of the 2nd URL in the second repetition set can be set to 1 simultaneously, and above-mentioned original URL can be downloaded.Eventually
After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal is next using newest second repetition set detection at end
Whether original URL was downloaded.Terminal repeats to gather by constantly updating second, can more accurately filter out duplicate URL,
And it can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that
Second repetition set can be empty (because the URL not downloaded at this time).
The embodiment of the present application judges whether the first URL belongs to first by carrying out the first extensive processing to original URL
It repeats to gather, first repeats to gather if the first URL is not belonging to this, carry out first to the first URL and subtract considering and handling reason and obtaining the
Two URL, and judge whether the 2nd URL belongs to second and repeat to gather, it repeats to gather if the 2nd URL belongs to second, detection should
Whether first frequency of occurrence of the 2nd URL in the second repetition set is less than or equal to first threshold, if so, downloading the original
Beginning URL is not downloaded if it is not, then abandoning the original URL, it is possible to reduce the quantity of repetition URL is downloaded to, so as to improve
The scan efficiency of WEB vulnerability scanning system.
It is the exemplary flow of another uniform resource position mark URL De-weight method provided by the embodiments of the present application referring to Fig. 3
Figure, as shown in figure 3, the uniform resource position mark URL De-weight method can include:
S301 carries out the first extensive processing to original URL, obtains the first URL.
S302 repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining second
URL。
The implementation of above-mentioned steps S301- step S302 can refer to the step of embodiment illustrated in fig. 1 in the embodiment of the present application
Implementation provided by rapid S101- step S102, details are not described herein.
S303 repeats to gather if the 2nd URL belongs to second, and the 2nd URL of detection, which repeats first in set second, to be occurred
Whether number is less than or equal to first threshold, if so, downloading original URL.
S304 repeats to gather if the 2nd URL is not belonging to second, carries out second to the 2nd URL and subtract considering and handling reason and obtaining third
URL。
In some possible embodiments, terminal can detecte second and repeat to whether there is and above-mentioned second in set
The identical URL of URL, and if it exists, illustrate the 2nd URL belong to this second repeat gather, then can by the 2nd URL this second
The frequency of occurrence repeated in set adds 1, obtains the 2nd URL at this and second repeats the first frequency of occurrence in set, and can be with
Detect whether first frequency of occurrence is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to this
One threshold value), then above-mentioned original URL is downloaded, if not (i.e. the first frequency of occurrence is greater than the first threshold), is then abandoned above-mentioned original
URL does not download the original URL.If terminal detect this second repeat to illustrate in set there is no the 2nd URL this second
URL is not belonging to this and second repeats to gather, then can carry out second to the 2nd URL and subtract considering and handling reason and obtaining the 3rd URL.This second subtracts
Reason is considered and handled for reducing parametric variable in the 2nd URL, such as removes argument section (including parameter value and ginseng in the 2nd URL
It is several).Wherein, the second repetition set may include that the URL downloaded in historical record is obtained after above-mentioned first subtracts and consider and handle reason
Corresponding 2nd URL of the URL arrived, that is, the URL downloaded.The URL downloaded can be to subtract the URL for considering and handling reason without first.
Terminal, which is subtracted by judgement by first, considers and handles whether the 2nd URL that reason obtains had been downloaded, to judge whether need
Original URL is downloaded, the different original URL of parameter value can be filtered out, further reduce the quantity for downloading to repetition URL,
Scan efficiency is improved while guaranteeing accuracy rate.It should be noted that the first threshold is less than or equal to above-mentioned second threshold
Value, the first threshold can be the integer greater than 0.Because the first extensive processing only reduces in the parameter value of original URL
Variable, as soon as the variable at this time in URL is also more, the URL filtered out is few, and the URL of downloading is with regard to more, and the first URL is first
It is just more to repeat the number occurred in set.Therefore first threshold, which is less than second threshold, can guarantee step S301- step S302 not
The URL filtered out can be filtered after first subtracts and consider and handle reason, achieve the purpose that be classified duplicate removal.
For example, as shown in figure 4, being the second relation schematic diagram for repeating set and frequency of occurrence, wherein assuming first threshold
It is 7, second, which repeats set, includes URL2 and URL5, and the first URL is URL2, and terminal detects that URL2 is repeated in set second,
Illustrate that URL2 belongs to second and repeats to gather, URL2 is repeated the frequency of occurrence 1 in set plus 1 second by terminal, is obtained URL2 and is existed
Second repeats the first frequency of occurrence 2 in set, and terminal judges that the first frequency of occurrence 2 is less than first threshold 7, then downloads original
URL。
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 2nd URL,
And it can detecte the hash value of the 2nd URL with the presence or absence of in the second repetition set, and if it exists, illustrate above-mentioned 2nd URL's
Hash value belongs to second repetition and gathers, can be by frequency of occurrence of the hash value of the 2nd URL in the second repetition set
Add 1, obtain the hash value of the 2nd URL at this and second repeat the first frequency of occurrence in set, and may determine that this first goes out
Whether occurrence number is less than or equal to first threshold, if (i.e. first frequency of occurrence be less than or equal to the first threshold), then under
Above-mentioned original URL is carried, if not (i.e. first frequency of occurrence is greater than the first threshold), then above-mentioned original URL is abandoned, i.e., does not download
The original URL.If it does not exist, illustrate the hash value of above-mentioned 2nd URL be not belonging to this second repeat gather, then can to this second
URL carries out second and subtracts considering and handling reason and obtaining the 3rd URL.This second subtracts and considers and handles reason for reducing parametric variable in the 2nd URL, such as goes
Except the argument section (including parameter value and parameter name) in the 2nd URL.Wherein, the second repetition set may include historical record
In the URL that is obtained after above-mentioned first subtracts and consider and handle reason of the URL that has downloaded, the hash value obtained after being calculated using MD5, i.e.,
The hash value of corresponding 2nd URL of the URL of downloading.Because hash function is the data conversion by arbitrary size into particular size
The function of data, and downloaded stored in set (second repeat gather) be URL hash value, rather than complete URL can
To be further reduced memory space, while when whether the 2nd URL of detection belongs to the second repetition and gather, place can be further improved
Manage efficiency.
For example, it is assumed that the second element repeated in set is 07 and 09, the 2nd URL is URL2:http: //
Xxx.pingan.com/cgi-bin/index1.html? param1¶m2, the hash value that terminal calculates URL2 is 04, eventually
End detects that the hash value 04 of URL2 does not repeat in set second, illustrates that URL2 is not belonging to second and repeats to gather, terminal is removed
Argument section in URL2 obtains the 3rd URL, the 3rd URL are as follows: http://xxx.pingan.com/cgi-bin/
index1.html。
In some possible embodiments, terminal is when detecting that the 2nd URL is not belonging to the second repetition set, eventually
Holding the 2nd URL, which can be added, second repetition gathers the second repetition set formed newly, that is, updates second repetition and gather,
The 2nd URL second can be repeated the frequency of occurrence in set simultaneously and be set to 1 at this, and the can be carried out to the 2nd URL
Two subtract and consider and handle reason and obtain the 3rd URL.After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal utilizes newest terminal
Second repeat set and detect whether next original URL was downloaded.Terminal repeats to gather by constantly updating second, can be with
Duplicate URL is more accurately filtered out, and can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that
Second repetition set can be empty (because the URL not downloaded at this time).
S305 detects third of the 3rd URL in the third repeating set and occurs if the 3rd URL belongs to the third repeating set
Whether number is less than or equal to third threshold value, if so, downloading original URL.
S306 carries out the second extensive processing to the 3rd URL and obtains the 4th if the 3rd URL is not belonging to the third repeating set
URL。
In some possible embodiments, terminal can detecte in the third repeating set and whether there is and above-mentioned third
The identical URL of URL, and if it exists, illustrate that the 3rd URL belongs to the third repeating set, then it can be by the 3rd URL in the third
The frequency of occurrence repeated in set adds 1, obtains third frequency of occurrence of the 3rd URL in the third repeating set, and can be with
Detect whether the third frequency of occurrence is less than or equal to third threshold value, if (i.e. the third frequency of occurrence be less than or equal to this
Three threshold values), then above-mentioned original URL is downloaded, if not (i.e. third frequency of occurrence is greater than the third threshold value), is then abandoned above-mentioned original
URL does not download the original URL.If terminal detects that there is no the 3rd URL in the third repeating set, illustrate the third
URL is not belonging to the third repeating set, then can carry out the second extensive processing to the file names portion of the 3rd URL and obtain the 4th
URL.The second extensive processing is for replacing with target word at least one character of target type in the 3rd URL file names portion
One or more numbers in the file names portion of 3rd URL are such as replaced with preset individual digit " 1 " by symbol.Wherein,
Three to repeat set may include the URL that the URL that has downloaded in historical record is obtained after above-mentioned second subtracts and consider and handle reason, i.e., under
Corresponding 3rd URL of the URL of load.The URL downloaded can be to subtract the URL for considering and handling reason without second.Terminal passes through judgement warp
It crosses second and subtracts whether the 3rd URL for considering and handling and managing and obtaining had been downloaded, thus judge whether need to download original URL, it can
To filter out the different original URL of argument section, that is, reduce the variable in original URL, the original URL of discarding is more, in turn
Reduce the quantity for downloading to repetition URL, further improves scan efficiency while guaranteeing accuracy rate.It needs to illustrate
It is that the third threshold value can be less than or equal to above-mentioned first threshold, above-mentioned first threshold is less than or equal to above-mentioned second threshold, should
Third threshold value can be the integer greater than 0.Because first subtract consider and handle reason only reduce the variable of original URL parameter part, this
When the 2nd URL in variable it is also more, the URL filtered out with regard to less, the URL of downloading with regard to more, the 2nd URL second repeat collect
The number occurred in conjunction is with regard to more.Therefore third threshold value, which is less than first threshold, can guarantee that step S303- step S304 is not filtered
The URL fallen can be filtered after second subtracts and consider and handle reason, achieve the purpose that be classified duplicate removal.
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 3rd URL,
And it can detecte the hash value of the 3rd URL with the presence or absence of in the third repeating set, and if it exists, illustrate above-mentioned 3rd URL's
Hash value belongs to the third repeating set, can be by frequency of occurrence of the hash value of the 3rd URL in the third repeating set
Add 1, obtains third frequency of occurrence of the hash value of the 3rd URL in the third repeating set, and may determine that the third goes out
Whether occurrence number is less than or equal to third threshold value, if (i.e. the third frequency of occurrence be less than or equal to the third threshold value), then under
Above-mentioned original URL is carried, if not (i.e. the third frequency of occurrence is greater than the third threshold value), then above-mentioned original URL is abandoned, i.e., does not download
The original URL.If it does not exist, illustrate that the hash value of above-mentioned 3rd URL is not belonging to the third repeating set, then it can be to the third
URL carries out the second extensive processing and obtains the 4th URL.The second extensive processing is for by target class in the 3rd URL file names portion
At least one character of type replaces with target character, such as replaces one or more numbers in the file names portion of the 3rd URL
It is changed to preset individual digit " 1 ".Wherein, the third repeating set may include the URL that has downloaded in historical record by above-mentioned
Second subtracts the URL for considering and handling and obtaining after reason, the hash value obtained after calculating using MD5, that is, the corresponding third of the URL downloaded
The hash value of URL.Because hash function is the function by the data conversion of arbitrary size at the data of particular size, and has been downloaded
What is stored in set (the third repeating set) is the hash value of URL, rather than complete URL, and it is empty can be further reduced storage
Between, while when whether the 3rd URL of detection belongs to the third repeating set, it can be further improved treatment effeciency.
In some possible embodiments, terminal is when detecting that the 3rd URL is not belonging to the third repeating set, eventually
3rd URL can be added the third repeating set and form new the third repeating set by end, that is, update the third repeating set,
Frequency of occurrence of the 3rd URL in the third repeating set can be set to 1 simultaneously, and the can be carried out to the 3rd URL
Two extensive processing obtain the 4th URL.After downloading above-mentioned original URL or abandoning above-mentioned original URL, terminal utilizes newest terminal
The third repeating set detect whether next original URL was downloaded.Terminal, can be with by constantly updating the third repeating set
More accurately classified filtering falls duplicate URL, and can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that
The third repeating set can be empty (because the URL not downloaded at this time).
S307 repeats to gather if the 4th URL belongs to the 4th, and the 4th URL of detection, which repeats the in set the 4th the 4th, to be occurred
Whether number is less than or equal to the 4th threshold value, if so, downloading original URL.
S308 repeats to gather, downloads original URL if the 4th URL is not belonging to the 4th.
In some possible embodiments, terminal can detecte the 4th and repeat to whether there is and the above-mentioned 4th in set
The identical URL of URL, and if it exists, illustrate that the 4th URL belongs to the 4th and repeats to gather, then it can be by the 4th URL the 4th
The frequency of occurrence repeated in set adds 1, obtains the 4th URL the 4th and repeats the 4th frequency of occurrence in set, and can be with
Detect whether the 4th frequency of occurrence is less than or equal to the 4th threshold value, if (i.e. the 4th frequency of occurrence be less than or equal to this
Four threshold values), then above-mentioned original URL is downloaded, if not (i.e. the 4th frequency of occurrence is greater than the 4th threshold value), is then abandoned above-mentioned original
URL does not download the original URL.If terminal detects that the 4th repeats that the 4th URL is not present in set, illustrate the 4th
URL is not belonging to the 4th and repeats to gather, then can directly download above-mentioned original URL.Wherein, the 4th repetition set may include going through
The URL that the URL downloaded in Records of the Historian record is obtained after above-mentioned first, second subtracts and consider and handle reason and above-mentioned second extensive processing,
Corresponding 4th URL of the URL downloaded.Whether terminal passes through the 4th URL for judging to obtain by the second extensive processing
It was downloaded, to judge whether need to download original URL, the different original URL of file names portion can be filtered out, reduced
Variable in original URL, and then the original URL abandoned is more, the repetition URL downloaded to is few, guarantee accuracy rate while into
One step improves scan efficiency.It should be noted that the 4th threshold value is less than or equal to above-mentioned third threshold value, above-mentioned third threshold value
Above-mentioned first threshold can be less than or equal to, above-mentioned first threshold is less than or equal to above-mentioned second threshold, and the 4th threshold value is big
In or equal to 0 integer.4th threshold value, which is less than third threshold value, can guarantee the URL that step S305- step S306 is not filtered out, pass through
It can be filtered after crossing the second extensive processing, achieve the purpose that be classified duplicate removal.
In some possible embodiments, terminal can use Hash (hash) value that MD5 calculates above-mentioned 4th URL,
And the hash value that can detecte the 4th URL repeats in set with the presence or absence of the 4th, and if it exists, illustrates above-mentioned 4th URL's
Hash value belongs to the 4th and repeats to gather, and the hash value of the 4th URL can be repeated the frequency of occurrence in set the 4th
Add 1, the hash value for obtaining the 4th URL repeats the 4th frequency of occurrence in set the 4th, and may determine that the 4th goes out
Whether occurrence number is less than or equal to the 4th threshold value, if (i.e. the 4th frequency of occurrence be less than or equal to the 4th threshold value), then under
Above-mentioned original URL is carried, if not (i.e. the 4th frequency of occurrence is greater than the 4th threshold value), then above-mentioned original URL is abandoned, i.e., does not download
The original URL.If it does not exist, illustrate that the hash value of above-mentioned 4th URL is not belonging to the 4th and repeats to gather, then can directly download
Above-mentioned original URL.Wherein, the 4th repetition set may include that the URL downloaded in historical record subtracts by above-mentioned first, second
The URL obtained after reason and above-mentioned second extensive processing is considered and handled, the hash value obtained after calculating using MD5 has been downloaded
The hash value of corresponding 4th URL of URL.Because hash function is the data by the data conversion of arbitrary size at particular size
Function, and downloaded stored in set (the 4th repeat gather) be URL hash value, rather than complete URL can be into one
Step reduces memory space, while when whether the 4th URL of detection belongs to the 4th repetition and gather, can be further improved processing effect
Rate.
In some possible embodiments, terminal is when detecting that the 4th URL is not belonging to the 4th repetition set, eventually
4th URL can be added the 4th repetition set is formed newly the 4th and repeat set by end, that is, updated the 4th and repeated to gather,
The 4th URL can be repeated into the frequency of occurrence in set the 4th simultaneously and be set to 1, above-mentioned original URL can be downloaded.Terminal is under
After carrying above-mentioned original URL or abandoning above-mentioned original URL, terminal is next original using newest 4th repetition set detection
Whether URL was downloaded.Terminal repeats to gather by constantly updating the 4th, can more accurately filter out duplicate URL, and
It can be further improved scan efficiency.
In some possible embodiments, if above-mentioned original URL is first URL crawled in some websites, that
4th repetition set can be empty (because the URL not downloaded at this time).
The embodiment of the present application reduces the variable in original URL by carrying out classification duplicate removal to original URL step by step, and according to
URL after reducing variable step by step judges whether original URL has downloaded, if indicating that original URL has been downloaded in the grade, abandons
Original URL downloads original URL if the grade determines that original URL is not downloaded, if whether the grade can not judge original URL
Downloading then enters next stage and judges, until judging that original URL has been downloaded or do not downloaded.Pass through the classification removing repeat more refined
Case can not only reduce the quantity for downloading to repetition URL, improve the scan efficiency of WEB vulnerability scanning system, can also improve
The accuracy rate of weight.
It is a kind of schematic block diagram of URL duplicate removal device provided by the embodiments of the present application referring to Fig. 5.The embodiment of the present application
The URL duplicate removal device of offer includes:
First extensive processing module 10 obtains the first URL, this is first general for carrying out the first extensive processing to original URL
Change processing for multiple continuation characters of same type in the original URL to be replaced with single character;
First subtract ginseng processing module 20, for when the first URL be not belonging to first repeat gather when, to the first URL into
Row first, which subtracts, considers and handles reason and obtains the 2nd URL, this first subtracts and consider and handle reason for reducing the parametric variable in the first URL, this first
Repeating set includes the URL that the URL downloaded in historical record is obtained after the first extensive processing;
Download module 30, for detecting the 2nd URL in second weight when the 2nd URL belongs to the second repetition and gathers
Whether the first frequency of occurrence in multiple set is less than or equal to first threshold, if so, the original URL is downloaded, second repetition
Set include in historical record the URL that has downloaded by this first subtract consider and handle reason after obtained URL.
In some possible embodiments, which further includes obtaining module 40, the acquisition module 40, for when this
When one URL belongs to the first repetition set, second frequency of occurrence of the first URL in the first repetition set is obtained;It is above-mentioned
Download module 30 is also used to download the original URL when second frequency of occurrence is less than or equal to second threshold.Wherein, this
Two threshold values are greater than or equal to the first threshold.
In some possible embodiments, above-mentioned first subtract ginseng processing module 20 include computing unit 201, detection unit
202 and first subtract ginseng processing unit 203.The computing unit 201, for calculating the cryptographic Hash of the first URL;The detection unit
202, whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather;This first subtract ginseng processing unit 203, be used for
When the cryptographic Hash of the first URL be not belonging to this first repeat to gather when, this is carried out to the first URL and first subtracts and considers and handles reason and obtain the
Two URL.Wherein, which includes what the URL downloaded in historical record was obtained after the first extensive processing
The cryptographic Hash of URL.
In some possible embodiments, the device further include second subtract ginseng processing module 50.This second subtracts and considers and handles reason
Module 50, for when the 2nd URL be not belonging to this second repeat to gather when, carry out second to the 2nd URL and subtract considering and handling reason and obtaining
3rd URL;Above-mentioned download module 30 is also used to detect the 3rd URL at this when the 3rd URL belongs to the third repeating set
Whether the third frequency of occurrence in the third repeating set is less than or equal to third threshold value, if so, downloading the original URL.Wherein,
The third repeating set include the URL that has been downloaded in historical record by this second subtract consider and handle reason after obtained URL, this second subtracts
Reason is considered and handled for reducing the parametric variable in the 2nd URL, which is less than or equal to the first threshold.
In some possible embodiments, which further includes the second extensive processing module 60.The second extensive processing
Module 60, for carrying out the second extensive processing to the 3rd URL and obtaining when the 3rd URL is not belonging to the third repeating set
4th URL;Above-mentioned download module 30 is also used to detect the 4th URL at this when the 4th URL belongs to the 4th repetition and gathers
Whether the 4th the 4th frequency of occurrence repeated in set is less than or equal to the 4th threshold value, if so, downloading the original URL.Wherein,
4th repeat set include in historical record the URL that has downloaded by this first, this second subtracts and considers and handles reason and this is second general
The URL obtained after change processing, the second extensive processing is for replacing at least one character of target type in the 3rd URL
Target character, the 4th threshold value are less than or equal to the third threshold value.
In some possible embodiments, above-mentioned download module 30 is also used to be not belonging to the quadruple as the 4th URL
When gathering again, the original URL is downloaded.
It is provided in the specific implementation, above-mentioned URL duplicate removal device can execute above-mentioned Fig. 1 or Fig. 3 by above-mentioned modules
Implementation in implementation provided by each step, realize the function of being realized in the various embodiments described above, can specifically join
See the corresponding description that each step provides in above-mentioned Fig. 1 or embodiment of the method shown in Fig. 3, details are not described herein.
In the embodiment of the present application, URL duplicate removal device can be reduced step by step original by carrying out classification duplicate removal to original URL
Variable in URL, and judge whether original URL has downloaded according to the URL after reduction variable step by step, if indicating that original in the grade
Beginning, URL was downloaded, then abandoned original URL, if the grade determines that original URL is not downloaded, downloaded original URL, if the grade can not
Judge whether original URL has downloaded, then enters next stage and judge, until judging that original URL has been downloaded or do not downloaded.Pass through
The classification duplicate removal scheme more refined can not only reduce the quantity for downloading to repetition URL, improve sweeping for WEB vulnerability scanning system
Efficiency is retouched, the accuracy rate of duplicate removal can also be improved.
It is a kind of schematic block diagram of terminal provided by the embodiments of the present application referring to Fig. 6.As shown in fig. 6, in the present embodiment
Terminal may include: one or more processors 601 and memory 602.Above-mentioned processor 601 and memory 602 pass through total
Line 603 connects.Memory 602 includes program instruction for storing computer program, the computer program, and processor 601 is used
In the program instruction for executing the storage of memory 602.Wherein, processor 601 is configured for that the program instruction is called to execute:
First extensive processing is carried out to original URL, obtains the first URL, which is used for will be in the original URL
Multiple continuation characters of same type replace with single character;
It repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining the 2nd URL,
This first subtract consider and handle reason for reducing the parametric variable in the first URL, this first repeat set include historical record under
The URL that the URL of load is obtained after the first extensive processing;
It repeats to gather if the 2nd URL belongs to second, detects first appearance of the 2nd URL in the second repetition set
Whether number is less than or equal to first threshold, if so, downloading the original URL, which includes in historical record
The URL downloaded by this first subtract consider and handle reason after obtained URL.
It should be appreciated that in some possible embodiments, alleged processor 601 can be central processing unit
(central processing unit, CPU), which can also be other general processors, digital signal processor
(digital signal processor, DSP), specific integrated circuit (application specific integrated
Circuit, ASIC), ready-made programmable gate array (field-programmable gate array, FPGA) or other can
Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor
Or the processor is also possible to any conventional processor etc..
The memory 602 may include read-only memory and random access memory, and to processor 601 provide instruction and
Data.The a part of of memory 602 can also include nonvolatile RAM.For example, memory 602 can also be deposited
Store up the information of device type.
In the specific implementation, system provided by the embodiments of the present application can be performed in processor 601 described in the embodiment of the present application
URL duplicate removal described in the embodiment of the present application also can be performed in implementation described in one Resource Locator URL De-weight method
The implementation of device, details are not described herein.
The embodiment of the present application also provides a kind of computer readable storage medium, which has meter
Calculation machine program, the computer program include program instruction, which realizes Fig. 1 or shown in Fig. 3 when being executed by processor
Uniform resource position mark URL De-weight method, detail please refer to the description of Fig. 1 or embodiment illustrated in fig. 3, no longer superfluous herein
It states.
Above-mentioned computer readable storage medium can be uniform resource position mark URL duplicate removal described in aforementioned any embodiment
The internal storage unit of device or electronic equipment, such as the hard disk or memory of electronic equipment.The computer readable storage medium
It can be the plug-in type hard disk being equipped on the External memory equipment of the electronic equipment, such as the electronic equipment, intelligent memory card
(smart media card, SMC), secure digital (secure digital, SD) card, flash card (flash card) etc..Into
One step, the computer readable storage medium can also both internal storage units including the electronic equipment or including external storage
Equipment.The computer readable storage medium is for storing other program sum numbers needed for the computer program and the electronic equipment
According to.The computer readable storage medium can be also used for temporarily storing the data that has exported or will export.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially
Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not
It is considered as beyond scope of the present application.
The application be referring to the embodiment of the present application method, apparatus (terminal) and computer program product flow chart with/
Or block diagram describes.It should be understood that each process that can be realized by computer program instructions in flowchart and/or the block diagram and/
Or the combination of the process and/or box in box and flowchart and/or the block diagram.It can provide these computer program instructions
To general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices processor to generate one
A machine so that by the instruction that the processor of computer or other programmable data processing devices executes generate for realizing
The device for the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although the application is described in conjunction with specific features and embodiment, it is clear that, do not departing from this Shen
In the case where spirit and scope please, it can be carry out various modifications and is combined.Correspondingly, the specification and drawings are only institute
The exemplary illustration for the application that attached claim is defined, and be considered as covered within the scope of the application any and all and repair
Change, change, combining or equivalent.Obviously, those skilled in the art the application can be carried out various modification and variations without
It is detached from spirit and scope.If in this way, these modifications and variations of the application belong to the claim of this application and its
Within the scope of equivalent technologies, then the application is also intended to include these modifications and variations.
Claims (10)
1. a kind of uniform resource position mark URL De-weight method characterized by comprising
First extensive processing is carried out to original URL, obtains the first URL, the first extensive processing is used for will be in the original URL
Multiple continuation characters of same type replace with single character;
It repeats to gather if the first URL is not belonging to first, carries out first to the first URL and subtract considering and handling reason and obtaining the 2nd URL,
Described first subtracts and considers and handles reason for reducing the parametric variable in the first URL, and described first to repeat set include historical record
In the URL that is obtained after the described first extensive processing of the URL that has downloaded;
It repeats to gather if the 2nd URL belongs to second, detection the 2nd URL repeats first in set described second and goes out
Whether occurrence number is less than or equal to first threshold, if first frequency of occurrence is less than or equal to the first threshold, downloads
The original URL, described second repeats URL of the set including having downloaded in historical record obtains after described first subtracts and consider and handle reason
The URL arrived.
2. the method according to claim 1, wherein the method, further includes:
It repeats to gather if the first URL belongs to described first, obtains the first URL described first and repeat the in set
Two frequency of occurrence;
If second frequency of occurrence is less than or equal to second threshold, the original URL is downloaded;
Wherein, the second threshold is greater than or equal to the first threshold.
If 3. the method according to claim 1, wherein the first URL be not belonging to first repeat gather,
It carries out first to the first URL and subtracts considering and handling reason and obtaining the 2nd URL, comprising:
Calculate the cryptographic Hash of the first URL;
Whether the cryptographic Hash for detecting the first URL, which belongs to first, repeats to gather, and described first, which repeats set, includes historical record
In the cryptographic Hash of URL that is obtained after the described first extensive processing of the URL that has downloaded;
It repeats to gather if the cryptographic Hash of the first URL is not belonging to described first, described first is carried out to the first URL and subtracts ginseng
Processing obtains the 2nd URL.
4. method according to claim 1-3, which is characterized in that the method, further includes:
It repeats to gather if the 2nd URL is not belonging to described second, carries out second to the 2nd URL and subtract considering and handling reason and obtaining third
URL, described second subtract consider and handle reason for reducing the parametric variable in the 2nd URL;
If the 3rd URL belongs to the third repeating set, detects third of the 3rd URL in the third repeating set and go out
Whether occurrence number is less than or equal to third threshold value, if the third frequency of occurrence is less than or equal to the third threshold value, downloads
The original URL, the third repeating set include that the URL downloaded in historical record is obtained after described second subtracts and consider and handle reason
The URL arrived;
Wherein, the third threshold value is less than or equal to the first threshold.
5. according to the method described in claim 4, it is characterized in that, the method, further includes:
If the 3rd URL is not belonging to the third repeating set, the second extensive processing is carried out to the 3rd URL and obtains the 4th
URL, the second extensive processing is for replacing with target character at least one character of target type in the 3rd URL;
It repeats to gather if the 4th URL belongs to the 4th, detection the 4th URL repeats the in set the 4th the described 4th and goes out
Whether occurrence number is less than or equal to the 4th threshold value, if the 4th frequency of occurrence is less than or equal to the 4th threshold value, downloads
The original URL, the described 4th repeats URL of the set including having downloaded in historical record subtracts by described first, described second
Consider and handle the URL obtained after reason and the second extensive processing;
Wherein, the 4th threshold value is less than or equal to the third threshold value.
6. according to the method described in claim 5, it is characterized in that, the method, further includes:
It repeats to gather if the 4th URL is not belonging to the described 4th, downloads the original URL.
7. a kind of URL duplicate removal device characterized by comprising
First extensive processing module obtains the first URL, the first extensive place for carrying out the first extensive processing to original URL
Reason is for replacing with single character for multiple continuation characters of same type in the original URL;
First subtracts ginseng processing module, for being carried out to the first URL when the first URL is not belonging to the first repetition set
First subtracts and considers and handles reason and obtain the 2nd URL, and described first subtracts and consider and handle reason for reducing the parametric variable in the first URL, described
First repeats the URL that URL of the set including having downloaded in historical record is obtained after the described first extensive processing;
Download module, for detecting the 2nd URL in second weight when the 2nd URL belongs to the second repetition and gathers
Whether the first frequency of occurrence in multiple set is less than or equal to first threshold, if first frequency of occurrence is less than or equal to described
First threshold then downloads the original URL, and described second repeats to gather described in the URL process including having downloaded in historical record
First subtracts the URL for considering and handling and obtaining after reason.
8. device according to claim 7, which is characterized in that described device further include:
Module is obtained, for when the first URL belongs to described first and repeats to gather, obtaining the first URL described the
One repeats the second frequency of occurrence in set;
The download module is also used to download described original when second frequency of occurrence is less than or equal to second threshold
URL;
Wherein, the second threshold is greater than or equal to the first threshold.
9. a kind of terminal, which is characterized in that including processor and memory, the processor and memory are connected with each other, wherein
The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for
Described program instruction is called, as the method according to claim 1 to 6 is executed.
10. a kind of computer readable storage medium, which is characterized in that the computer storage medium is stored with computer program,
The computer program includes program instruction, and described program instruction makes the processor execute such as right when being executed by a processor
It is required that the described in any item methods of 1-6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810735305.5A CN108984703B (en) | 2018-07-05 | 2018-07-05 | Uniform Resource Locator (URL) duplicate removal method and device |
PCT/CN2018/108708 WO2020006908A1 (en) | 2018-07-05 | 2018-09-29 | Url de-duplication method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810735305.5A CN108984703B (en) | 2018-07-05 | 2018-07-05 | Uniform Resource Locator (URL) duplicate removal method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108984703A true CN108984703A (en) | 2018-12-11 |
CN108984703B CN108984703B (en) | 2023-04-18 |
Family
ID=64536296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810735305.5A Active CN108984703B (en) | 2018-07-05 | 2018-07-05 | Uniform Resource Locator (URL) duplicate removal method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108984703B (en) |
WO (1) | WO2020006908A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008419A (en) * | 2019-03-11 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Removing duplicate webpages method, device and equipment |
CN110825947A (en) * | 2019-10-31 | 2020-02-21 | 深圳前海微众银行股份有限公司 | URL duplicate removal method, device, equipment and computer readable storage medium |
CN112906005A (en) * | 2021-02-02 | 2021-06-04 | 浙江大华技术股份有限公司 | Web vulnerability scanning method, device, system, electronic device and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259282B (en) * | 2020-02-13 | 2023-08-29 | 深圳市腾讯计算机系统有限公司 | URL (Uniform resource locator) duplication removing method, device, electronic equipment and computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102185741A (en) * | 2011-06-10 | 2011-09-14 | 浙江大学 | Method for estimating needs of transaction in processor in multi-tier architecture |
CN103823825A (en) * | 2012-08-30 | 2014-05-28 | 埃森哲环球服务有限公司 | Online content collection |
CN106815247A (en) * | 2015-11-30 | 2017-06-09 | 北京国双科技有限公司 | URL acquisition methods and device |
CN106844389A (en) * | 2015-12-07 | 2017-06-13 | 阿里巴巴集团控股有限公司 | The treating method and apparatus of network resources address URL |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103365865B (en) * | 2012-03-29 | 2017-07-11 | 腾讯科技(深圳)有限公司 | Date storage method, data download method and its device |
CN104933056B (en) * | 2014-03-18 | 2019-08-13 | 腾讯科技(深圳)有限公司 | Uniform resource locator De-weight method and device |
CN106547764A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | The method and device of web data duplicate removal |
CN106919570B (en) * | 2015-12-24 | 2020-12-22 | 国家新闻出版广电总局广播科学研究院 | Page link duplication removal scanning method and device for new network media |
CN107885820A (en) * | 2017-11-07 | 2018-04-06 | 北京小度互娱科技有限公司 | Breadth traversal orientation grasping means based on crawler system |
-
2018
- 2018-07-05 CN CN201810735305.5A patent/CN108984703B/en active Active
- 2018-09-29 WO PCT/CN2018/108708 patent/WO2020006908A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102185741A (en) * | 2011-06-10 | 2011-09-14 | 浙江大学 | Method for estimating needs of transaction in processor in multi-tier architecture |
CN103823825A (en) * | 2012-08-30 | 2014-05-28 | 埃森哲环球服务有限公司 | Online content collection |
CN106815247A (en) * | 2015-11-30 | 2017-06-09 | 北京国双科技有限公司 | URL acquisition methods and device |
CN106844389A (en) * | 2015-12-07 | 2017-06-13 | 阿里巴巴集团控股有限公司 | The treating method and apparatus of network resources address URL |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008419A (en) * | 2019-03-11 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Removing duplicate webpages method, device and equipment |
CN110825947A (en) * | 2019-10-31 | 2020-02-21 | 深圳前海微众银行股份有限公司 | URL duplicate removal method, device, equipment and computer readable storage medium |
WO2021082938A1 (en) * | 2019-10-31 | 2021-05-06 | 深圳前海微众银行股份有限公司 | Url deduplication method, apparatus, device and computer-readable storage medium |
CN110825947B (en) * | 2019-10-31 | 2024-03-08 | 深圳前海微众银行股份有限公司 | URL deduplication method, device, equipment and computer readable storage medium |
CN112906005A (en) * | 2021-02-02 | 2021-06-04 | 浙江大华技术股份有限公司 | Web vulnerability scanning method, device, system, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108984703B (en) | 2023-04-18 |
WO2020006908A1 (en) | 2020-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984703A (en) | A kind of uniform resource position mark URL De-weight method and device | |
CN104978526B (en) | The extracting method and device of virus characteristic | |
CN103392169B (en) | Sort method and system | |
CN109246210B (en) | Internet of things communication method and device | |
CN109271359A (en) | Log information processing method, device, electronic equipment and readable storage medium storing program for executing | |
CN107391571A (en) | The processing method and processing device of sensing data | |
CN108920668A (en) | A kind of uniform resource position mark URL De-weight method and device | |
CN110189165A (en) | Channel abnormal user and abnormal channel recognition methods and device | |
CN108959359A (en) | A kind of uniform resource locator semanteme De-weight method, device, equipment and medium | |
CN109302383A (en) | A kind of URL monitoring method and device | |
CN110019205A (en) | A kind of data storage, restoring method, device and computer equipment | |
CN108509440A (en) | A kind of data processing method and device | |
CN106598747A (en) | Network data package parallel processing method and device | |
CN112350912B (en) | Data acquisition method, system and device based on Modbus protocol | |
CN110213073A (en) | Data flow variation, electronic equipment, calculate node and storage medium | |
CN108463813B (en) | Method and device for processing data | |
CN108551485A (en) | A kind of streaming medium content caching method, device and computer storage media | |
CN108595685A (en) | A kind of data processing method and device | |
CN105095387A (en) | Method and device for POI data collection based on user comment information | |
CN107391627A (en) | EMS memory occupation analysis method, device and the server of data | |
CN113472681A (en) | Flow rate limiting method and device | |
CN110430140A (en) | Path processing method, device, equipment and storage medium | |
CN109002544A (en) | A kind of data processing method, device and computer-readable medium | |
CN105095382A (en) | Method and device for sample distributed clustering calculation | |
CN109643307A (en) | Stream processing system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |