CN104933056B - Uniform resource locator De-weight method and device - Google Patents
Uniform resource locator De-weight method and device Download PDFInfo
- Publication number
- CN104933056B CN104933056B CN201410100765.2A CN201410100765A CN104933056B CN 104933056 B CN104933056 B CN 104933056B CN 201410100765 A CN201410100765 A CN 201410100765A CN 104933056 B CN104933056 B CN 104933056B
- Authority
- CN
- China
- Prior art keywords
- duplicate removal
- resource locator
- uniform resource
- rule
- removal rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 230000011218 segmentation Effects 0.000 claims abstract description 47
- 238000001914 filtration Methods 0.000 claims description 26
- 238000001514 detection method Methods 0.000 abstract description 12
- 230000015654 memory Effects 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 8
- 238000007405 data analysis Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 238000012550 audit Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000007257 malfunction Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 239000013078 crystal Substances 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005206 flow analysis Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Abstract
The present invention proposes a kind of uniform resource locator De-weight method and device, and uniform resource locator De-weight method includes: to preset duplicate removal rule base according to the structure of uniform resource locator;The uniform resource locator data for wanting duplicate removal are obtained from website visitation data;According to the structure and segmentation parameter of uniform resource locator, the uniform resource locator for duplicate removal is matched with the duplicate removal rule in the duplicate removal rule base;And the uniform resource locator corresponding with identical duplicate removal rule matched is filtered, and corresponding each duplicate removal rule retains a uniform resource locator.Method and device through the embodiment of the present invention can be filtered duplicate removal to magnanimity url data by duplicate removal rule, avoid when URL security breaches detect, the duplicate same CGI of scanning of security scan device, to improve the detection efficiency of security breaches.
Description
Technical field
The present invention relates to network technique field, in particular to a kind of uniform resource locator De-weight method and device, and
Corresponding duplicate removal rule generating method and device.
Background technique
URL Rewrite is a kind of couple of URL(Uniform Resource Locator in internet, unified resource positioning
Symbol) technology that is written over, it obtains the URL request for accessing website that user terminal is sent first, then it is write as again
Another manageable URL of website, what user obtained is the returned content of the address URL after treatment.
For example, the news on many news websites has many classifications, such as sport, science and technology etc., and has daily
New news briefing, that is, classify by date, and there are also news to index ID below for these daily news.User terminal access one
When the news pages of a sport category, after Website server receives access request, CGI(Common Gateway can be passed through
Interface, general network administration interface) form the intermediate address URL uniformly to access back-end data, as shown in Equation:
http://www.qq.com/news/getNews?type=sports&date=20131120&id=1 (1)
And the intermediate URL structure as formula (1) has disadvantage in the majority: being not easy to remember, is not easy to read, in mobile phone, plate
Inconvenient propagation etc. on the mobile terminals such as computer.Therefore, many Website servers can be tied the intermediate URL by URL rewriting technique
Structure is rewritten as the form such as formula (2):
http://www.qq.com/news/sports/20131120/1.html (2)
This URL structure of formula (2) overcomes the shortcomings that formula (1) that intermediate URL, so that URL is shorter, it is easier to
Memory is read, while facilitating propagation.
But URL rewriting technique also results in such a phenomenon: a dynamic CGI meeting while bring benefit in the majority
It is rewritten by multiple URL, leads to the URL increasing number an of network address.For example, the segmentation parameter in formula (1) includes " getNews
Type ", " date " and " id ";It is now assumed that user terminal accesses the news pages of a science and technology, formula (3) is made up of CGI
The intermediate address URL:
http://www.qq.com/news/getNews?type=science&date=20131121&id=2 (3)
It can be seen that the segmentation parameter in formula (3) equally includes " getNews type ", " date " and " id ", difference
Be only each segmentation parameter value, so formula (3) and the two intermediate addresses RUL of formula (1) are essentially by same
What a dynamic CGI was generated.Formula (4) is the address URL after formula (3) are rewritten by URL rewriting technique:
http://www.qq.com/news/science/20131121/2.html (4)
As it can be seen that the address URL of formula (4) and the address URL of formula (2) they are two different addresses URL, so, pass through
The address that one dynamic CGI is generated can be rewritten into multiple addresses URL.And this phenomenon can bring following problem:
When doing the detection of URL security breaches, since multiple URL may be rewritten to a dynamic by URL rewriting technique
CGI detects the security breaches of these URL, and that detect in fact is the same dynamic CGI after rewriteeing, thus will lead to one
It is multiple that CGI is repetitively scanned detection.Moreover, for large-scale website, since URL quantity is too many, in addition a URL might have
It is extremely low to the detection efficiency of corporate business security breaches to eventually lead to URL security scan device for multiple parameters.Meanwhile it repeating
The same dynamic CGI of Scanning Detction, also brings unnecessary performance loss and operation cost to corporate business website.
Summary of the invention
The purpose of the embodiment of the present invention is that a kind of uniform resource locator De-weight method and device are provided, to solve URL weight
Writing technology can make URL security breaches detect efficiency reduce, and to Website server bring performance loss and increase operation at
This problem of.
The embodiment of the present invention proposes a kind of uniform resource locator De-weight method, comprising:
Duplicate removal rule base is preset according to the structure of uniform resource locator, stores in the duplicate removal rule base and multiple removes weight-normality
Then, each duplicate removal rule corresponds to the different structure of uniform resource locator, and the corresponding system of expression is provided in duplicate removal rule
The rewriting mark of writing segmentation parameter in one Resource Locator;
The uniform resource locator data for wanting duplicate removal are obtained from website visitation data;
According to the structure and segmentation parameter of uniform resource locator, by the uniform resource locator for wanting duplicate removal with it is described
Duplicate removal rule in duplicate removal rule base is matched;And
The uniform resource locator corresponding with identical duplicate removal rule matched is filtered, and corresponds to and each removes weight-normality
Then retain a uniform resource locator.
The embodiment of the present invention also proposes a kind of duplicate removal rule generating method, comprising:
Obtain the uniform resource locator data under the domain name that generate duplicate removal rule;
The uniform resource locator of the acquisition is clustered;
Uniform resource locator after cluster according to domain name parameters part, suffix portion, division number part and is divided
Section argument section is split, and forms a plurality of statistical information;
Obtain the mutually isostructural statistical information after over-segmentation;And
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with into rewriting mark, and passes through replacement
It crosses the statistical information for rewriteeing and identifying and generates new duplicate removal rule.
The embodiment of the present invention separately proposes a kind of duplicate removal rule generating method, comprising:
Existing duplicate removal rule in preset duplicate removal rule base is obtained, the structure of the duplicate removal rule includes domain name parameters portion
Point, suffix portion, division number part and rewriting rule part;
Obtain multiple uniform resource locator data under the domain name that generate duplicate removal rule;
Suffix portion and rewriting rule part by existing duplicate removal rule, under the domain name of duplicate removal rule to be generated
Multiple uniform resource locator are matched;And
When the uniform resource locator being matched to number be greater than setting threshold value, then will generate the domain name of duplicate removal rule
The domain name parameters part in corresponding duplicate removal rule is replaced, and generates new duplicate removal rule.
The embodiment of the present invention proposes a kind of uniform resource locator duplicate removal device, comprising:
Duplicate removal rule base setup module, it is described to go for presetting duplicate removal rule base according to the structure of uniform resource locator
Multiple duplicate removals rule is stored in weight rule base, each duplicate removal rule corresponds to the different structure of uniform resource locator, and described goes
Weight-normality then in be provided with the rewriting mark of the segmentation parameter for indicating writing in corresponding uniform resource locator;
Uniform resource locator handling module, for obtaining the uniform resource locator for wanting duplicate removal from website visitation data
Data;
Matching module wants the unified of duplicate removal to provide for the structure and segmentation parameter according to uniform resource locator by described
Source finger URL is matched with the duplicate removal rule in the duplicate removal rule base;And
Deduplication module, the uniform resource locator corresponding with identical duplicate removal rule for that will match are filtered, and
Corresponding each duplicate removal rule retains a uniform resource locator.
The embodiment of the present invention also proposes a kind of duplicate removal rule generating means, comprising:
Uniform resource locator obtains module, for obtaining the uniform resource locator under the domain name that generate duplicate removal rule
Data;
Cluster module, for being clustered to the uniform resource locator of the acquisition;
Divide module, for the uniform resource locator after clustering according to domain name parameters part, suffix portion, segments
Mesh part and segmentation parameter part are split, and form a plurality of statistical information;
Statistical information obtains module, for obtaining the mutually isostructural statistical information after over-segmentation;And
Segmentation parameter replacement module, for replacing the different corresponding segments parameter value of mutually isostructural statistical information intermediate value
To rewrite mark, and new duplicate removal rule is generated by replacing the statistical information for rewriteeing and identifying.
The embodiment of the present invention separately proposes a kind of duplicate removal rule generating means, comprising:
Duplicate removal rule acquisition module, it is described to remove weight-normality for obtaining existing duplicate removal rule in preset duplicate removal rule base
Structure then includes domain name parameters part, suffix portion, division number part and rewriting rule part;
Uniform resource locator obtains module, fixed for obtaining multiple unified resources under the domain name that generate duplicate removal rule
Position symbol data;
Suffix and rewriting rule matching module, for by existing duplicate removal rule suffix portion and rewriting rule portion
Point, multiple uniform resource locator under the domain name of duplicate removal rule to be generated are matched;And
Domain name parameters replacement module is greater than the threshold value of setting for the number when the uniform resource locator being matched to, then
The domain name that duplicate removal rule will be generated replaces domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
Compared with the existing technology, the beneficial effects of the present invention are: method and device through the embodiment of the present invention, Ke Yitong
Past weight-normality is then filtered duplicate removal to magnanimity url data, after restoring their rewritings of minute quantity in a large amount of url data
Dynamic CGI is avoided when URL security breaches detect, the duplicate same CGI of scanning of security scan device, to improve peace
The detection efficiency of full loophole.
Detailed description of the invention
Fig. 1 is a kind of running environment schematic diagram for logging in address recording method and device of the embodiment of the present invention;
Fig. 2 is a kind of structure chart of Website server in Fig. 1;
Fig. 3 is a kind of flow chart of uniform resource locator De-weight method of the embodiment of the present invention;
Fig. 4 is flow chart when a kind of URL for wanting duplicate removal of the embodiment of the present invention is matched with duplicate removal rule;
Fig. 5 is the flow chart of the first duplicate removal rule generating method of the embodiment of the present invention;
Fig. 6 is another flow chart of the first duplicate removal rule generating method of the embodiment of the present invention;
Fig. 7 is the flow chart of second of duplicate removal rule generating method of the embodiment of the present invention;
Fig. 8 is another flow chart of second of duplicate removal rule generating method of the embodiment of the present invention;
Fig. 9 is a kind of structure chart of uniform resource locator duplicate removal device of the embodiment of the present invention;
Figure 10 is a kind of structure chart of matching module of the embodiment of the present invention;
Figure 11 is the structure chart of the first duplicate removal rule generating means of the embodiment of the present invention;
Figure 12 is another structure chart of the first duplicate removal rule generating means of the embodiment of the present invention;
Figure 13 is a kind of structure chart that the embodiment of the present invention rewrites filtering module;
Figure 14 is a kind of structure chart of authentication module of the embodiment of the present invention;
Figure 15 is the structure chart of second of duplicate removal rule generating means of the embodiment of the present invention;
Figure 16 is another structure chart of second of duplicate removal rule generating means of the embodiment of the present invention.
Specific embodiment
For the present invention aforementioned and other technology contents, feature and effect refer to the preferable reality of schema in following cooperation
Applying can clearly be presented in example detailed description.By the explanation of specific embodiment, when predetermined mesh can be reached to the present invention
The technical means and efficacy taken be able to more deeply and it is specific understand, however institute's accompanying drawings are only to provide with reference to and say
It is bright to be used, it is not intended to limit the present invention.
The embodiment of the present invention proposes a kind of uniform resource locator De-weight method and device, and corresponding duplicate removal rule is raw
At method and device, heavy filtration is carried out for the address URL after rewriteeing, and generate duplicate removal rule in order to the address URL
Carry out duplicate removal.It referring to Figure 1, is the running environment schematic diagram of above-mentioned method and device.One or more user terminals 100 can
By being only painted one in network 300 and one or more Website server 200(Fig. 1) it is connected.The user terminal 100 can be
Tablet computer, mobile phone, electronic reader, remote controler, PC, laptop, mobile unit, Web TV, wearable device etc.
Smart machine with network function.The network 300 may be, for example, internet, local area network, intranet etc..
Fig. 2 is further regarded to, is the structural block diagram of one embodiment of above-mentioned Website server 200.Such as Fig. 2 institute
Show, Website server 200 includes: memory 102, storage control 104, one or more (one is only shown in figure) processors
106, Peripheral Interface 108 and network controller 112.It is appreciated that structure shown in Fig. 2 is only to illustrate, not to website
The structure of server 200 causes to limit.For example, Website server 200 may also include than shown in Fig. 2 more or less groups
Part, or with the configuration different from shown in Fig. 2.
Memory 102 can be used for storing software program and module, such as the uniform resource locator in the embodiment of the present invention
De-weight method and the corresponding program instruction/module of device, the software journey that processor 104 is stored in memory 102 by operation
Sequence and module realize above-mentioned method thereby executing various function application and data processing.
Memory 102 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetic
Property storage device, flash memory or other non-volatile solid state memories.In some instances, memory 102 can further comprise
The memory remotely located relative to processor 106, these remote memories can pass through network connection to Website server
200.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.Place
Reason device 106 and other possible components can carry out the access of memory 102 under the control of storage control 104.
Various input/output devices are couple processor 106 by Peripheral Interface 108.106 run memory 102 of processor
Interior various softwares, instruction, and carry out data processing.In some embodiments, it Peripheral Interface 108, processor 106 and deposits
Storage controller 104 can be realized in one single chip.In some other example, they can be real by independent chip respectively
It is existing.
Network controller 112 is for receiving and transmitting network signal.Above-mentioned network signal may include wireless signal or
Wire signal.In an example, above-mentioned network signal is cable network signal.At this point, network controller 112 may include processing
The elements such as device, random access memory, converter, crystal oscillator.
Above-mentioned software program and module includes: operating system 122 and Website server module 224.Wherein operate
System 122 may be, for example, LINUX, UNIX, WINDOWS, may include it is various for management system task (such as memory management,
Store equipment control, power management etc.) component software and/or driving, and can mutually be communicated with various hardware or component software,
To provide the running environment of other software component.Website server module 224 operates on the basis of operating system 122, and
It is monitored by the network service of operating system 122 come the web access requests of automatic network, is completed according to web access requests corresponding
Data processing, and return the result the data of webpage or extended formatting to user terminal 100.Above-mentioned Website server module
224 such as may include dynamic web page script and script interpreter.Above-mentioned script interpreter may be, for example, the website Apache
Server program is used to for dynamic web page script to be processed into client acceptable format, such as hypertext markup
(HTML) language format or extensible markup language (XML) format etc..
Embodiment one
Fig. 3 is referred to, is a kind of flow chart of uniform resource locator De-weight method of the embodiment of the present invention comprising
Following steps:
S301 presets duplicate removal rule base according to the structure of uniform resource locator, stores in the duplicate removal rule base multiple
Duplicate removal rule, each duplicate removal rule corresponds to the different structure of uniform resource locator, and is provided with expression in duplicate removal rule
The rewriting mark of writing segmentation parameter in corresponding uniform resource locator.
S302 obtains the uniform resource locator data for wanting duplicate removal from website visitation data.
S303, according to the structure and segmentation parameter of uniform resource locator, by the uniform resource locator for wanting duplicate removal
It is matched with the duplicate removal rule in the duplicate removal rule base.
The uniform resource locator corresponding with identical duplicate removal rule matched is filtered by S304, and corresponding each
Duplicate removal rule retains a uniform resource locator.
According to W3C(World Wide Web Consortium, World Wide Web Consortium) standard can be with from the structure of a URL
Decomposite three main parts: domain name parameters part (in the embodiment of the present invention also referred to as " part host "), path parameter
Partially (in the embodiment of the present invention also referred to as " part path "), query argument part (is also referred to as in the embodiment of the present invention
" part query ").Such as URL form is the portion http://www.qq.com/nba/lbj.php age=30&sex=1, host
It is divided into " www.qq.com ", the part path is "/nba/lbj.php ", and the part query is " age=30&sex=1 ", wherein
The parameter name contamination of the part query is " age&sex ".It is big that URL can also be divided into three according to the suffix difference of the part path
Class: the first kind, the not URL of suffix;Second class has the URL of suffix, suffix such as php, jsp, html, htm etc.;Third class, after
Sew the URL for "/".
In step S301, the duplicate removal rule is used to recognize whether URL was rewritten in embodiments of the present invention.By in
Division of the above-mentioned W3C standard to not writing URL structure feature, therefore the embodiment of the present invention rewrites front and back for URL
Structural relation, preferably by duplicate removal rule settings at including domain name parameters part, suffix portion, division number part and rewriting
Rule section, for ease of description, the form of duplicate removal rule is expressed as { host, suffix, sections, rule_ by the present embodiment
Path }.It can be seen that the duplicate removal rule is made of four parts: host(domain name parameters part), suffix(suffix portion),
Sections(division number part) and rule_path(rewriting rule part).
Host after the corresponding rewriting in the part host in URL;After the corresponding rewriting in the part suffix in URL behind the part path
Sew;The number for the segmentation that the path of URL is come out by "/" symbol segmentation after the corresponding rewriting in the part sections, the word split
Symbol string is known as segmentation parameter;Rule_path is the rewriting rule of the part path in URL after rewriteeing, and the segmentation parameter of rewriting is used
Mark replacement (being indicated to rewrite mark with " * " in the embodiment of the present invention) is rewritten, indicates that the section to rewrite excessive section parameter, does not weigh
The segmentation parameter write retains constant in rule_path.
By taking the URL of formula (2) as an example: http://www.qq.com/news/sports/20131120/1.html, it should
In URL, parameter value " sports ", " 20131120 " and " 1 " field has all been re-written in the parameter of dynamic CGI, thus with " * "
Them are replaced, and parameter " news " is not rewritten, so be retained in duplicate removal rule, therefore the duplicate removal rule of the URL can table
It is shown as: { www.qq.com, html, 4, news/*/*/* }.
As it can be seen that duplicate removal rule corresponds to the structure of URL, and is identified by rewritings therein and can represent the URL and be
It is no to be rewritten.
In step S302, the url data for grabbing Website server can mainly use two methods: first, pass through network
Crawler wants the uniform resource locator data of duplicate removal described in obtaining, and second, institute is obtained by filtration from the original log of web page access
State the uniform resource locator data for wanting duplicate removal.
In step S303, will the URL of duplicate removal when being matched with duplicate removal rule, can one by one will duplicate removal URL with
Duplicate removal rule in duplicate removal rule base is matched, if a URL is matched to corresponding duplicate removal rule, illustrates this URL
It was rewritten, it may be necessary to go heavy filtration.Moreover, judging whether a URL was rewritten, it is also necessary to consider its rule_
Whether the segmentation parameter of path has rewriting to identify, and then determines that this URL was rewritten if there is rewriteeing mark.
Furthermore, it is understood that the url filtering by matching less than duplicate removal rule is fallen during URL and duplicate removal rule match,
Refer to Fig. 4, specifically can with the following steps are included:
S3031 is filtered out and described the unified resource of duplicate removal is wanted to position according to the domain name parameters part in duplicate removal rule
The not corresponding uniform resource locator of domain name in symbol.For example, if the domain name of a URL is " www.sina.com ", and going
In weight rule base, the part host of none duplicate removal rule is www.sina.com, then this url filtering is fallen.
S3032, according in duplicate removal rule suffix portion and division number part, filter out the system for wanting duplicate removal
In one Resource Locator, in uniform resource locator not corresponding with all duplicate removal regular textures.I.e. if URL
The part suffix of any one duplicate removal rule in suffix portion and segmentation parameter part, with duplicate removal rule base and sections
It does not correspond to, then this url filtering is fallen.
S3033 is filtered out and described the unified resource of duplicate removal is wanted to position according to the rewriting rule part in duplicate removal rule
There is no writing uniform resource locator in symbol.
For the matching process for further understanding URL Yu duplicate removal rule, present embodiment discloses a kind of matching algorithm such as 1 institutes of table
Show:
Table 1
Wherein, if successful match, the duplicate removal rule being matched to is returned, represents the URL as the URL of rewriting;Otherwise, it returns
Null character string, representing the URL not is the URL rewritten.Step 1 and step 2 are to output and input explanation;Step 3 indicates that algorithm is opened
Begin;Step 4 judges whether there is duplicate removal rule identical with the URL/domain name parameter of input, if it is not, filtering out input
URL;5th, 6 steps judge whether there is duplicate removal rule identical with the suffix portion of the URL of input and division number part, if
No, then the URL of input is filtered out;8th, 9,10 steps indicate when find there are at least one duplicate removal rule, domain name parameters portion
Point, suffix portion and division number part it is identical as the URL of input, then successively judge input URL and find out these go
Whether the rewriting rule part of weight-normality then is consistent.Step 11 successively judges these duplicate removals found out rule, rewriting rule part
Whether the quantity of section is identical as the value of division number part;13rd, 14,15 steps indicate if find out with input URL it is corresponding
Duplicate removal rule, if can not find duplicate removal rule corresponding with the URL of input, filters out this then exporting duplicate removal rule
The URL of input.
In step S304, after all URL for wanting duplicate removal are matched, find out corresponding to the duplicate removal rule matched
Uniform resource locator is counted, since the corresponding URL of identical duplicate removal rule is the intermediate URL weight generated by the same CGI
Made of writing, so URL corresponding to identical duplicate removal rule is filtered, and only retain one relative to each duplicate removal rule
URL, for subsequent Hole Detection or other data analyses.
For example, " http://www.qq.com/news/sports/20131120/1.html " and " http: //
The two URL of www.qq.com/news/science/20131121/2.html ", can match www.qq.com, html,
4, news/*/*/* } this duplicate removal rule, then, illustrate that the two URL were rewritten, i.e., the two URL are not weighed
URL before writing is generated by the same CGI, it is possible to one is arbitrarily filtered out from the two URL.
Method through this embodiment can be filtered duplicate removal to magnanimity url data by duplicate removal rule, from a large amount of
Dynamic CGI after restoring their rewritings of minute quantity in url data is avoided when URL security breaches detect, security scan
The duplicate same CGI of scanning of device, to improve the detection efficiency of security breaches.
In addition, when doing web access requests flow analysis, after the accessing page request flow counted on is rewritten
URL flow, and the flow of real CGI cannot be counted on.And in website CC attack, (Challenge Collapsar is a kind of normal
The website attack method seen, the certain hosts of attacker's control ceaselessly send out mass data packet and cause server to provide to other side's server
Source exhausts, until delay machine is collapsed) in system of defense, need to count audit is the flow of the CGI after rewriteeing.By this implementation
The method of example can also be improved the data analysis efficiency to CC attack.
The embodiment of the present invention also proposed a kind of generation method of duplicate removal rule of embodiment, the generation of duplicate removal rule and URL
Duplicate removal can be carried out with iterative cycles, such as when URL duplicate removal can be matched, the URL conduct crossed without duplicate removal that filters out
The generation benchmark of duplicate removal rule, to improve duplicate removal rule base and expand URL duplicate removal range.The embodiment of the invention discloses two kinds
Method generates duplicate removal rule, and one is the data structure building duplicate removal rules by URL, and another kind is by that will have existed
Duplicate removal rule be applied under new domain name, and form new duplicate removal rule.It describes in detail below:
Embodiment two
Fig. 5 is referred to, is the flow chart of the first duplicate removal rule generating method of the embodiment of the present invention comprising following
Step:
S501 obtains the uniform resource locator data under the domain name that generate duplicate removal rule.I.e. from Website server
Data center reads the identical URL in the part host, and the URL needs of reading are not writing.
S502 clusters the uniform resource locator of the acquisition.Specifically, according to length and character lexcographical order
The URL that step S501 is read is clustered.URL after cluster can make system operations more rapidly, improve duplicate removal
The efficiency that rule generates.
The length refers to the length of the section of URL, the number of the section come out in URL by "/" symbol segmentation in other words, example
Such as, " http://www.qq.com/news/getNews type=sports&date=20131120&id=1 " and " http: //
The "/" for including in the two URL of www.qq.com/news/getNews type=science&date=20131121&id=2 "
The number of symbol is two, then it is assumed that the two URL lengths are identical.
The character lexcographical order refers to sequence of the character in URL in dictionary, such as " http://www.aa.com/
News/getNews type=sports&date=20131120&id=1 " and " http://www.aa.com/news/
" domain name part of the two URL all includes " aa " to getNews type=science&date=20131121&id=2, can be incited somebody to action
The two URL are classified as same class;"http://www.bb.com/news/getNews?type=sports&date=
20131120&id=1 " and " http://www.bb.com/news/getNews type=science&date=20131121&id
=2 " domain name part of the two URL all includes " bb ", the two URL can be classified as same class.
S503, by the uniform resource locator after cluster according to domain name parameters part, suffix portion, division number part with
And segmentation parameter part is split, and forms a plurality of statistical information.
Specifically, this step be to after cluster URL carry out structural analysis, and formed as host, suffix,
Sections, index, path } this form statistical information.Statistical information can be stored in the memory of Website server
In, it can also be stored in system cache.Wherein, index is the index position of the part path.
For example, the statistical information of " http://www.qq.com/news/sports/20131120/1.html " this URL
It is { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 } that index indicates the storage position of the part path
It sets, i.e., " 0/1/2/3 " and " news/sport/20131120/1 " correspond.
S504 obtains the mutually isostructural statistical information after over-segmentation.
For example, the statistical information of " http://www.qq.com/news/sports/20131120/1.html " this URL
It is { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 };"http://www.qq.com/news/
The statistical information of this URL of science/20131121/2.html " is { www.qq.com, html, 4,0/1/2/3, news/
Science/20131121/2 }, then it is assumed that the structure of the statistical information of the two URL is identical.
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with rewriting mark, and led to by S505
It crosses the statistical information for replacing and rewriteeing and identifying and generates new duplicate removal rule.
For example, for { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 } and
Www.qq.com, html, 4,0/1/2/3, news/science/20131121/2 } the two statistical informations, according to index's
Numerical value is respectively compared each segmentation parameter of the part path, if the value of corresponding segments parameter is different, is replaced
At rewrite identify, the two statistical informations through rewriting mark replacement after i.e. generate www.qq.com, html, 4,0/1/2/
3, news/*/*/* }, then remove the part index in statistical information, just forms new duplicate removal rule: www.qq.com,
Html, 4, news/*/*/* }.So far, newly-generated duplicate removal rule is stored in duplicate removal rule base, when for the matching of URL duplicate removal
Inquiry uses.
Fig. 6 is referred to, is another flow chart of the first duplicate removal rule generating method of the embodiment of the present invention, packet
Include following steps:
S601 obtains the uniform resource locator data under the domain name that generate duplicate removal rule.
S602 filters out uniform resource locator writing in the uniform resource locator data of acquisition.Due to
When generating statistical information, structure to URL and segmentation parameter and parameter value analyze, and the URL after rewriteeing can not know
Other segmentation parameter, in order to ensure operation does not malfunction, thus this step is further filtered the URL of acquisition, and removal is rewritten
The URL crossed.
It specifically, can be by URL that step S601 is obtained according to (the i.e. table 1 of matching algorithm described in embodiment one
Duplicate removal matching algorithm) matched with duplicate removal rule existing in duplicate removal rule base, if a URL can match it is corresponding
Duplicate removal rule, then illustrating that this URL was rewritten, then filters this out.
S603 clusters the uniform resource locator of the acquisition according to length and character lexcographical order.
S604, by the uniform resource locator after cluster according to domain name parameters part, suffix portion, division number part with
And segmentation parameter part is split, and forms a plurality of statistical information.
S605 obtains the mutually isostructural statistical information after over-segmentation.
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with rewriting mark, and led to by S606
It crosses the statistical information for replacing and rewriteeing and identifying and generates new duplicate removal rule.
New duplicate removal rule match testing by S607 with mutually isostructural uniform resource locator all under same domain name
Card.The purpose of this step is to make further verifying to the availability that newly-generated duplicate removal rule is made.
Specifically, the process of verifying still can be using matching algorithm (the i.e. duplicate removal of table 1 described in embodiment one
With algorithm).Firstly, obtain with new duplicate removal rule possess same domain name parameter, suffix, division number URL;Secondly, by new
Duplicate removal rule matched with the URL of the acquisition;When the number for being matched to corresponding URL is more than given threshold, explanation
This newly-generated duplicate removal rule can be used to identify whether URL is rewritten, then be verified.
S608, to new one pending mark of duplicate removal rule setting.In order to further prevent system operations to malfunction, influence
Subsequent duplicate removal rule automatically generate and URL goes the automatic running of master control program, so being stamped for newly-generated duplicate removal rule
It is pending just to remove this for manually carrying out verification check to the new duplicate removal rule after verification passes through for one pending mark
Core mark makes the duplicate removal rule formally be come into operation.
The method of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality
The De-weight method of example one is applied when to URL duplicate removal more efficiently and accurately, and then can be further improved to URL security breaches
Detection efficiency and count CGI flow when improve data analysis efficiency.
Embodiment three
Fig. 7 is referred to, is the flow chart of second of duplicate removal rule generating method of the embodiment of the present invention comprising following
Step:
S701 obtains existing duplicate removal rule in preset duplicate removal rule base, and the structure of the duplicate removal rule includes domain name
Argument section, suffix portion, division number part and rewriting rule part.
S702 obtains multiple uniform resource locator data under the domain name that generate duplicate removal rule.The URL quantity of acquisition
Cannot be very little, it in general cannot be less than 5000.
S703, suffix portion and rewriting rule part by existing duplicate removal rule, to the domain of duplicate removal rule to be generated
Multiple uniform resource locator under one's name are matched.
S704, when the uniform resource locator being matched to number be greater than setting threshold value, then will generate duplicate removal rule
Domain name replace the domain name parameters part in corresponding duplicate removal rule, and generate new duplicate removal rule.
As an example it is assumed that { www.qq.com, html, 4, news/*/*/* } is an existing in duplicate removal rule base go
Then, suffix portion and rewriting rule part are " html " and " news/*/*/* " respectively to weight-normality, with this two parts to generating
The URL of duplicate removal rule is matched, and matching can (i.e. the duplicate removal of table 1, which matches, calculates using matching algorithm described in embodiment one
Method), if the URL matched is more than the threshold value (such as 100) of setting, 100 URL for illustrating that this is matched are to be rewritten
It crosses, it may also be said to which bright these want applicable " html " and " news/*/*/* " this two parts of the URL of duplicate removal to carry out duplicate removal, then directly
The domain name of the URL of duplicate removal is wanted to be substituted into the domain name parameters part in corresponding duplicate removal rule, if wanting the domain name of the URL of duplicate removal is
" www.sina.com ", then the new duplicate removal rule generated after replacement is exactly: www.qq.com, html, 4,
News/*/*/* }.
Fig. 8 is referred to, is another flow chart of second of duplicate removal rule generating method of the embodiment of the present invention, packet
Include following steps:
S801 obtains existing duplicate removal rule in preset duplicate removal rule base, and the structure of the duplicate removal rule includes domain name
Argument section, suffix portion, division number part and rewriting rule part.
S802, to the identical duplicate removal rule of suffix portion in existing duplicate removal rule and rewriting rule part, according to difference
The number of domain name is ranked up.
S803 obtains the suffix portion of the setting quantity duplicate removal rule of the number of different domain names at most according to ranking results
Divide and rewriting rule part.
The quantity of duplicate removal rule in duplicate removal rule base is usually very much, if existing all duplicate removal rules all made
Primary matching, then the efficiency that certainly will generate duplicate removal rule can be very low.It is first right in step S802 thus in order to improve operation efficiency
The existing duplicate removal rule of the identical part suffix and the part rule_pathe is ranked up, and sequence is then found out in step S803
As a result in, the most setting quantity duplicate removal rule of the number of different domain names.
For example, it is assumed that including 6 duplicate removal rules in duplicate removal rule base:
{ www.aa.com, html, 4, news/*/*/* }
{ www.bb.com, html, 4, news/*/*/* }
{ www.cc.com, html, 4, news/*/*/* }
{ www.aa.com, html, 3, news/*/* }
{ www.bb.com, html, 3, news/*/* }
{ www.aa.com, html, 3, news/* }
After sequence, the different domain names comprising " html " and " news/*/*/* " have 3, i.e., " www.aa.com ",
"www.bb.com","www.cc.com";Different domain names comprising " html " and " news/*/* " have 2, i.e.,
"www.aa.com","www.bb.com";Different domain names comprising " html " and " news/* " have 1, i.e. " www.aa.com ".
It is now assumed that two groups of suffix portions and rewriting rule part are chosen, according to ranking results, " html " and " news/*/*/* ",
The quantity of " html " and " news/*/* " corresponding different domain names is most, by this two groups of suffix portions and rewriting rule extracting section
Out.And " html " and " news/* " corresponding domain name only has one group, illustrates that its applicable domain name is less, i.e., versatility is low, will
It is filtered out.
S804 obtains multiple uniform resource locator data under the domain name that generate duplicate removal rule.
S805 passes through the suffix portion and rewriting of the most setting quantity duplicate removal rule of the number of the different domain names of acquisition
Rule section matches multiple uniform resource locator under the domain name of duplicate removal rule to be generated.It, equally can be with when matching
Using matching algorithm described in embodiment one (i.e. the duplicate removal matching algorithm of table 1).
S806, when the uniform resource locator being matched to number be greater than setting threshold value, then will generate duplicate removal rule
Domain name replace the domain name parameters part in corresponding duplicate removal rule, and generate new duplicate removal rule.
The method of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality
The De-weight method of example one is applied when to URL duplicate removal more efficiently and accurately, and then can be further improved to URL security breaches
Detection efficiency and count CGI flow when improve data analysis efficiency.
Example IV
Corresponding to the method for previous embodiment one, the embodiment of the present invention also proposes that a kind of uniform resource locator goes to reset
It sets, refers to Fig. 9, which includes: duplicate removal rule base setup module 901, uniform resource locator handling module 902, matching
Module 903 and deduplication module 904.
Duplicate removal rule base setup module 901 is used to preset duplicate removal rule base according to the structure of uniform resource locator, described
Multiple duplicate removal rules are stored in duplicate removal rule base, each duplicate removal rule corresponds to the different structure of uniform resource locator, and described
The rewriting mark for the segmentation parameter for indicating writing in corresponding uniform resource locator is provided in duplicate removal rule.It is described to remove weight-normality
Structure then may include domain name parameters part, suffix portion, division number part and rewriting rule part.The rewriting mark
Knowledge can be set in the rewriting rule part.
Uniform resource locator handling module 902, which is used to obtain from website visitation data, wants the unified resource of duplicate removal to position
Accord with data.The uniform resource locator handling module 902 described in web crawlers acquisition by wanting the unified resource of duplicate removal to position
Data are accorded with, or want the uniform resource locator data of duplicate removal described in acquisition from the original log of web page access.
Matching module 903 is used for structure and segmentation parameter according to uniform resource locator, by the unification for wanting duplicate removal
Resource Locator is matched with the duplicate removal rule in the duplicate removal rule base.
The unified resource corresponding with identical duplicate removal rule that deduplication module 904 is used to match matching module 903 positions
Symbol is filtered, and corresponding each duplicate removal rule retains a uniform resource locator.
Furthermore, it is understood that matching module 903 will be matched less than duplicate removal rule during URL and duplicate removal rule match
Url filtering fall, referring to Figure 10, matching module 903 further can also include: the filtering of the first filter element 9031, second
Unit 9032 and third filter element 9033.
First filter element 9031 is used for according to the domain name parameters part in duplicate removal rule, is filtered out and described is wanted duplicate removal
Uniform resource locator in the not corresponding uniform resource locator of domain name.
Second filter element 9032 be used for according in duplicate removal rule suffix portion and division number part, filter out
In the uniform resource locator for wanting duplicate removal, in uniform resource locator not corresponding with all duplicate removal regular textures.
Third filter element 9033 is used for according to the rewriting rule part in duplicate removal rule, is filtered out and described is wanted duplicate removal
Uniform resource locator in there is no writing uniform resource locator.
Device through this embodiment can be filtered duplicate removal to magnanimity url data by duplicate removal rule, from a large amount of
Dynamic CGI after restoring their rewritings of minute quantity in url data is avoided when URL security breaches detect, security scan
The duplicate same CGI of scanning of device, to improve the detection efficiency of security breaches.
Embodiment five
Corresponding to previous embodiment two, the embodiment of the present invention also proposes the first duplicate removal rule generating means, refers to figure
11, which includes: that uniform resource locator obtains module 1101, cluster module 1102, segmentation module 1103, statistical information obtain
Modulus block 1104 and segmentation parameter replacement module 1105.
It is fixed that uniform resource locator obtains the unified resource that module 1101 is used to obtain under the domain name that generate duplicate removal rule
Position symbol data.
Cluster module 1102 is for clustering the uniform resource locator of the acquisition.Cluster module 1102 can be right
The uniform resource locator of the acquisition is clustered according to length and character lexcographical order.
Segmentation module 1103 be used for will cluster after uniform resource locator according to domain name parameters part, suffix portion, divide
Number of segment mesh part and segmentation parameter part are split, and form a plurality of statistical information.
Statistical information obtains module 1104 for obtaining the mutually isostructural statistical information after over-segmentation.
Segmentation parameter replacement module 1105 is used for the corresponding segments parameter value that mutually isostructural statistical information intermediate value is different
Rewriting mark is replaced with, and generates new duplicate removal rule by replacing the statistical information for rewriteeing and identifying.
Referring to Figure 12, it is another structure chart of the first duplicate removal rule generating means of the embodiment of the present invention,
Including in addition to including: that uniform resource locator obtains module 1101, cluster module 1102, segmentation module 1103, statistical information obtain
Modulus block 1104 and segmentation parameter replacement module 1105, further includes: rewrite filtering module 1106, authentication module 1107 and examine
Core identify and arrange module 1108.
Filtering module 1106 is rewritten to be used to obtain duplicate removal to be generated in uniform resource locator acquisition module 1101
After uniform resource locator data under the domain name of rule, filters out the uniform resource locator and obtain the described of module acquisition
Writing uniform resource locator in uniform resource locator data.
After authentication module 1107 is used to generate new duplicate removal rule in the segmentation parameter replacement module 1105, new is gone
Weight-normality then carries out matching verifying with mutually isostructural uniform resource locator all under same domain name.
After audit identify and arrange module 1108 is used to generate new duplicate removal rule in the segmentation parameter replacement module 1105,
To new one pending mark of duplicate removal rule setting.
Referring to Figure 13, a kind of structure chart of filtering module, the rewriting filtering module are rewritten for the embodiment of the present invention
1106 include: duplicate removal rule match unit 11061 and uniform resource locator data filtering units 11062.
Duplicate removal rule match unit 11061 is used for the system for obtaining uniform resource locator acquisition module 1101
One resource locator data is matched with existing duplicate removal rule.
Filter element 11062 matches corresponding duplicate removal rule for filtering out the duplicate removal rule match unit 11061
Uniform resource locator.
Referring to Figure 14, it is a kind of structure chart of authentication module of the embodiment of the present invention, which includes:
Selection unit 11071 and matching judgement unit 11072.
Selection unit 11071 is used to obtain the institute for possessing same domain name parameter, suffix, division number with new duplicate removal rule
There is uniform resource locator.
The unified resource that matching judgement unit 11072 is used to obtain new duplicate removal rule and the selection unit 11071
Finger URL is matched, and when the number for being matched to corresponding uniform resource locator is more than given threshold, is verified.
The device of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality
The duplicate removal device of the De-weight method and example IV of applying example one when to URL duplicate removal more efficiently and accurately, and then can be into one
Improve to the detection efficiency of URL security breaches and improve when counting the flow of CGI data analysis efficiency in step ground.
Embodiment six
Corresponding to previous embodiment three, the embodiment of the present invention also proposes second of duplicate removal rule generating means, refers to figure
15, which includes: duplicate removal rule acquisition module 1501, uniform resource locator acquisition module 1502, suffix and rewriting rule
Matching module 1503 and domain name parameters replacement module 1504.
Duplicate removal rule acquisition module 1501 is for obtaining existing duplicate removal rule, the duplicate removal in preset duplicate removal rule base
The structure of rule includes domain name parameters part, suffix portion, division number part and rewriting rule part.
Uniform resource locator obtains multiple unifications that module 1502 is used to obtain under the domain name that generate duplicate removal rule and provides
Source locator data.
Suffix and rewriting rule matching module 1503 are used for suffix portion and rewriting rule by existing duplicate removal rule
Part matches multiple uniform resource locator under the domain name of duplicate removal rule to be generated.
Domain name parameters replacement module 1504 is used to be greater than the threshold value of setting when the number for the uniform resource locator being matched to,
The domain name that duplicate removal rule will then be generated replaces domain name parameters part in corresponding duplicate removal rule, and generates and new remove weight-normality
Then.
Referring to Figure 16, it is another structure chart of second of duplicate removal rule generating means of the embodiment of the present invention, it should
Device is in addition to including: duplicate removal rule acquisition module 1501, uniform resource locator acquisition module 1502, suffix and rewriting rule
With module 1503 and domain name parameters replacement module 1504, further includes: sorting module 1505 and duplicate removal Rules Filtering module
1506。
Suffix in the existing duplicate removal rule that sorting module 1505 is used to obtain the duplicate removal rule acquisition module 1501
Part duplicate removal rule identical with rewriting rule part, is ranked up according to the number of different domain names.
Duplicate removal Rules Filtering module 1506 is used for the ranking results according to the sorting module 1502, obtains different domain names
The suffix portion of the most setting quantity duplicate removal rule of number and rewriting rule part.
The most setting quantity of the number of the different domain names of the suffix and rewriting rule matching module 1503 by obtaining
The suffix portion of a duplicate removal rule and rewriting rule part position multiple unified resources under the domain name of duplicate removal rule to be generated
Symbol is matched.
The device of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality
The duplicate removal device of the De-weight method and example IV of applying example one when to URL duplicate removal more efficiently and accurately, and then can be into one
Improve to the detection efficiency of URL security breaches and improve when counting the flow of CGI data analysis efficiency in step ground.
Through the above description of the embodiments, those skilled in the art can be understood that the embodiment of the present invention
The mode of necessary general hardware platform can also be added to realize by software by hardware realization.Based on such reason
Solution, the technical solution of the embodiment of the present invention can be embodied in the form of software products, which can store one
In a non-volatile memory medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are used so that a meter
It calculates machine equipment (can be personal computer, server or the network equipment etc.) and executes each implement scene institute of the embodiment of the present invention
The method stated.
The above described is only a preferred embodiment of the present invention, be not intended to limit the present invention in any form, though
So the present invention has been disclosed as a preferred embodiment, and however, it is not intended to limit the invention, any technology people for being familiar with this profession
Member, is not departing within the scope of technical scheme, when the technology contents using the disclosure above make a little change or modification
For the equivalent embodiment of equivalent variations, but it is all without departing from technical scheme content, it is right according to the technical essence of the invention
Any simple modification, equivalent change and modification made by above embodiments, all of which are still within the scope of the technical scheme of the invention.
Claims (28)
1. a kind of uniform resource locator De-weight method characterized by comprising
Duplicate removal rule base is preset according to the structure of uniform resource locator, multiple duplicate removal rules are stored in the duplicate removal rule base,
Each duplicate removal rule corresponds to the different structure of uniform resource locator, and the corresponding unified money of expression is provided in duplicate removal rule
The rewriting mark of writing segmentation parameter in the finger URL of source;
The uniform resource locator data for wanting duplicate removal are obtained from website visitation data;
According to the structure and segmentation parameter of uniform resource locator, by the uniform resource locator for wanting duplicate removal and the duplicate removal
Duplicate removal rule in rule base is matched;And
The uniform resource locator corresponding with identical duplicate removal rule matched is filtered, and corresponding each duplicate removal rule is protected
Stay a uniform resource locator.
2. uniform resource locator De-weight method as described in claim 1, which is characterized in that the structure packet of the duplicate removal rule
Include domain name parameters part, suffix portion, division number part and rewriting rule part.
3. uniform resource locator De-weight method as claimed in claim 2, which is characterized in that duplicate removal rule it is described heavy
Write the rewriting mark that Rule section is provided with segmentation parameter writing in corresponding uniform resource locator.
4. uniform resource locator De-weight method as claimed in claim 3, which is characterized in that described by the system for wanting duplicate removal
The step of one Resource Locator is matched with the duplicate removal rule in the duplicate removal rule base include:
According to the domain name parameters part in duplicate removal rule, filters out and described want in the uniform resource locator of duplicate removal domain name not
Corresponding uniform resource locator;
According in duplicate removal rule suffix portion and division number part, filter out and described the unified resource of duplicate removal wanted to position
Fu Zhong, in uniform resource locator not corresponding with all duplicate removal regular textures;And
According to the rewriting rule part in the duplicate removal rule, filter out in the uniform resource locator for wanting duplicate removal without weight
The uniform resource locator write.
5. uniform resource locator De-weight method as described in claim 1, which is characterized in that described from website visitation data
Obtain the step of wanting the uniform resource locator data of duplicate removal include: by web crawlers obtain described in want the unified resource of duplicate removal
Locator data, or the uniform resource locator data of duplicate removal are wanted described in obtaining from the original log of web page access.
6. a kind of uniform resource locator De-weight method as claimed in any one of claims 1 to 5, which is characterized in that described to go
The generation method of weight-normality then includes:
Obtain the uniform resource locator data under the domain name that generate duplicate removal rule;
The uniform resource locator of the acquisition is clustered;
Uniform resource locator after cluster is joined according to domain name parameters part, suffix portion, division number part and segmentation
Number part is split, and forms a plurality of statistical information;
Obtain the mutually isostructural statistical information after over-segmentation;And
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with into rewriting mark, and overweight by replacing
The statistical information for writing mark generates new duplicate removal rule.
7. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that the unification to the acquisition
The step of Resource Locator is clustered include: to the uniform resource locator of the acquisition according to length and character lexcographical order into
Row cluster.
8. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that the acquisition will generate weight-normality
It include: the uniform resource locator number for filtering out acquisition after the step of uniform resource locator data under domain name then
The writing uniform resource locator in.
9. uniform resource locator De-weight method as claimed in claim 8, which is characterized in that described to filter out the described of acquisition
The step of writing uniform resource locator, includes: in uniform resource locator data
The uniform resource locator data that will acquire are matched with existing duplicate removal rule;And
Filter out the uniform resource locator for matching corresponding duplicate removal rule.
10. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that described generate new removes weight-normality
It include: by new duplicate removal rule and mutually isostructural uniform resource locator progress all under same domain name after then the step of
With verifying.
11. uniform resource locator De-weight method as claimed in claim 10, which is characterized in that described by new duplicate removal rule
The step of match verifying with mutually isostructural uniform resource locator all under same domain name include:
Obtain all uniform resource locator for possessing same domain name parameter, suffix, division number with new duplicate removal rule;
New duplicate removal rule is matched with the uniform resource locator of the acquisition;And
When the number for being matched to corresponding uniform resource locator is more than given threshold, it is verified.
12. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that described generate new removes weight-normality
It include: to new one pending mark of duplicate removal rule setting after then the step of.
13. a kind of uniform resource locator De-weight method as claimed in any one of claims 1 to 5, which is characterized in that described to go
The generation method of weight-normality then includes:
Obtain existing duplicate removal rule in preset duplicate removal rule base, the structure of the duplicate removal rule include domain name parameters part,
Suffix portion, division number part and rewriting rule part;
Obtain multiple uniform resource locator data under the domain name that generate duplicate removal rule;
Suffix portion and rewriting rule part by existing duplicate removal rule, to multiple under the domain name of duplicate removal rule to be generated
Uniform resource locator is matched;And
When the number for the uniform resource locator being matched to is greater than the threshold value of setting, then the domain name that will generate duplicate removal rule is replaced
Domain name parameters part in corresponding duplicate removal rule, and generate new duplicate removal rule.
14. uniform resource locator De-weight method as claimed in claim 13, which is characterized in that described to obtain preset duplicate removal
In rule base after the step of existing duplicate removal rule further include:
To the identical duplicate removal rule of suffix portion in existing duplicate removal rule and rewriting rule part, according to the number of different domain names
It is ranked up;
According to ranking results, obtains the suffix portion of the setting quantity duplicate removal rule of the number of different domain names at most and rewrite rule
Then part;
The suffix portion and rewriting rule part by existing duplicate removal rule, under the domain name of duplicate removal rule to be generated
The step of multiple uniform resource locator are matched includes: the setting quantity most by the number of the different domain names of acquisition
The suffix portion of duplicate removal rule and rewriting rule part, to multiple uniform resource locator under the domain name of duplicate removal rule to be generated
It is matched.
15. a kind of uniform resource locator duplicate removal device characterized by comprising
Duplicate removal rule base setup module, it is described to remove weight-normality for presetting duplicate removal rule base according to the structure of uniform resource locator
Multiple duplicate removals rule is then stored in library, each duplicate removal rule corresponds to the different structure of uniform resource locator, and described removes weight-normality
The rewriting mark for the segmentation parameter for indicating writing in corresponding uniform resource locator is provided in then;
Uniform resource locator handling module, for obtaining the uniform resource locator number for wanting duplicate removal from website visitation data
According to;
Matching module determines the unified resource for wanting duplicate removal for the structure and segmentation parameter according to uniform resource locator
Position symbol is matched with the duplicate removal rule in the duplicate removal rule base;And
Deduplication module, the uniform resource locator corresponding with identical duplicate removal rule for that will match are filtered, and corresponding
Each duplicate removal rule retains a uniform resource locator.
16. uniform resource locator duplicate removal device as claimed in claim 15, which is characterized in that the structure of the duplicate removal rule
Including domain name parameters part, suffix portion, division number part and rewriting rule part.
17. uniform resource locator duplicate removal device as claimed in claim 16, which is characterized in that duplicate removal rule it is described
Rewriting rule part is provided with the rewriting mark of segmentation parameter writing in corresponding uniform resource locator.
18. uniform resource locator duplicate removal device as claimed in claim 17, which is characterized in that the matching module is further
Include:
First filter element, for filtering out the unification for wanting duplicate removal according to the domain name parameters part in the duplicate removal rule
The not corresponding uniform resource locator of domain name in Resource Locator;
Second filter element, for according in duplicate removal rule suffix portion and division number part, filter out described want
In the uniform resource locator of duplicate removal, in uniform resource locator not corresponding with all duplicate removal regular textures;And
Third filter element, for filtering out the unification for wanting duplicate removal according to the rewriting rule part in the duplicate removal rule
There is no writing uniform resource locator in Resource Locator.
19. uniform resource locator duplicate removal device as claimed in claim 15, which is characterized in that the uniform resource locator
Handling module by web crawlers obtain described in want the uniform resource locator data of duplicate removal, or the original day from web page access
The uniform resource locator data of duplicate removal are wanted described in obtaining in will.
20. a kind of such as the described in any item uniform resource locator duplicate removal devices of claim 15~19, which is characterized in that described
Uniform resource locator duplicate removal device includes duplicate removal rule generating means, and the duplicate removal rule generating means include:
Uniform resource locator obtains module, for obtaining the uniform resource locator number under the domain name that generate duplicate removal rule
According to;
Cluster module, for being clustered to the uniform resource locator of the acquisition;
Divide module, for the uniform resource locator after clustering according to domain name parameters part, suffix portion, division number portion
Divide and segmentation parameter part is split, and forms a plurality of statistical information;
Statistical information obtains module, for obtaining the mutually isostructural statistical information after over-segmentation;And
Segmentation parameter replacement module, for the different corresponding segments parameter value replacement of mutually isostructural statistical information intermediate value to be attached most importance to
Mark is write, and generates new duplicate removal rule by replacing the statistical information for rewriteeing and identifying.
21. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the cluster module is to described
The uniform resource locator of acquisition is clustered according to length and character lexcographical order.
22. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the duplicate removal rule generates dress
It sets further include:
Filtering module is rewritten, is obtained under the domain name of duplicate removal rule to be generated for obtaining module in the uniform resource locator
Uniform resource locator data after, filter out the uniform resource locator and obtain the unified resource positioning that module obtains
Accord with uniform resource locator writing in data.
23. uniform resource locator duplicate removal device as claimed in claim 22, which is characterized in that the rewriting filtering module into
One step includes:
Duplicate removal rule match unit, for the uniform resource locator to be obtained the uniform resource locator that module obtains
Data are matched with existing duplicate removal rule;And
Uniform resource locator data filtering units go out corresponding duplicate removal for filtering out the duplicate removal rule match units match
The uniform resource locator of rule.
24. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the duplicate removal rule generates dress
It sets further include:
Authentication module, after generating new duplicate removal rule in the segmentation parameter replacement module, by new duplicate removal rule and together
All mutually isostructural uniform resource locator carry out matching verifying under one domain name.
25. uniform resource locator duplicate removal device as claimed in claim 24, which is characterized in that the authentication module is further
Include:
Selection unit, for obtaining all unified moneys for possessing same domain name parameter, suffix, division number with new duplicate removal rule
Source finger URL;And
It matches judgement unit and works as matching for matching new duplicate removal rule with the uniform resource locator of the acquisition
To corresponding uniform resource locator number be more than given threshold when, be verified.
26. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the duplicate removal rule generates dress
It sets further include:
Identify and arrange module is audited, after generating new duplicate removal rule in the segmentation parameter replacement module, to new duplicate removal
One pending mark of rule setting.
27. a kind of such as the described in any item uniform resource locator duplicate removal devices of claim 15~19, which is characterized in that described
Uniform resource locator duplicate removal device includes duplicate removal rule generating means, and the duplicate removal rule generating means include:
Duplicate removal rule acquisition module, for obtaining existing duplicate removal rule in preset duplicate removal rule base, the duplicate removal rule
Structure includes domain name parameters part, suffix portion, division number part and rewriting rule part;
Uniform resource locator obtains module, for obtaining multiple uniform resource locator under the domain name that generate duplicate removal rule
Data;
Suffix and rewriting rule matching module, for by existing duplicate removal rule suffix portion and rewriting rule part, it is right
The multiple uniform resource locator generated under the domain name of duplicate removal rule are matched;And
Domain name parameters replacement module is greater than the threshold value of setting for the number when the uniform resource locator being matched to, then will
The domain name of generation duplicate removal rule replaces the domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
28. uniform resource locator duplicate removal device as claimed in claim 27, which is characterized in that the duplicate removal rule generates dress
It sets further include:
Sorting module, for being obtained to the duplicate removal rule acquisition module existing duplicate removal rule in suffix portion and rewrite rule
The then identical duplicate removal rule in part, is ranked up according to the number of different domain names;
Duplicate removal Rules Filtering module, for the ranking results according to the sorting module, the number for obtaining different domain names is most
Set suffix portion and the rewriting rule part of quantity duplicate removal rule;
The most setting quantity of the number for the different domain names that the suffix and rewriting rule matching module pass through acquisition removes weight-normality
Suffix portion and rewriting rule part then, to multiple uniform resource locator progress under the domain name of duplicate removal rule to be generated
Match.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410100765.2A CN104933056B (en) | 2014-03-18 | 2014-03-18 | Uniform resource locator De-weight method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410100765.2A CN104933056B (en) | 2014-03-18 | 2014-03-18 | Uniform resource locator De-weight method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104933056A CN104933056A (en) | 2015-09-23 |
CN104933056B true CN104933056B (en) | 2019-08-13 |
Family
ID=54120224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410100765.2A Active CN104933056B (en) | 2014-03-18 | 2014-03-18 | Uniform resource locator De-weight method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104933056B (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844389B (en) * | 2015-12-07 | 2021-05-04 | 阿里巴巴集团控股有限公司 | Method and device for processing URL (Uniform resource locator) |
CN106919570B (en) * | 2015-12-24 | 2020-12-22 | 国家新闻出版广电总局广播科学研究院 | Page link duplication removal scanning method and device for new network media |
CN105630983A (en) * | 2015-12-28 | 2016-06-01 | 努比亚技术有限公司 | Resource obtaining and optimizing device and method |
CN105975599B (en) * | 2016-05-11 | 2020-01-07 | 北京京东金融科技控股有限公司 | Method and device for monitoring page embedded points of website |
CN108268508A (en) * | 2016-12-30 | 2018-07-10 | 北京国双科技有限公司 | URL De-weight methods and device |
CN108694184B (en) * | 2017-04-06 | 2022-03-11 | 北京国双科技有限公司 | Exposure URL processing method and device |
CN109561163B (en) * | 2017-09-27 | 2022-03-15 | 阿里巴巴集团控股有限公司 | Method and device for generating uniform resource locator rewriting rule |
CN107766167A (en) * | 2017-10-23 | 2018-03-06 | 郑州云海信息技术有限公司 | A kind of fault log repeats to report an error the method for merger |
CN109861317B (en) * | 2017-11-30 | 2022-06-10 | 南京泉峰科技有限公司 | Adapter, portable power supply system and control method |
CN108449355A (en) * | 2018-04-04 | 2018-08-24 | 上海有云信息技术有限公司 | A kind of vulnerability scanning method and system |
CN110413866B (en) * | 2018-04-27 | 2024-02-02 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
CN108959359B (en) * | 2018-05-16 | 2022-10-11 | 顺丰科技有限公司 | Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium |
CN108984703B (en) * | 2018-07-05 | 2023-04-18 | 平安科技(深圳)有限公司 | Uniform Resource Locator (URL) duplicate removal method and device |
CN108920668B (en) * | 2018-07-05 | 2023-04-18 | 平安科技(深圳)有限公司 | Uniform Resource Locator (URL) duplicate removal method and device |
CN110717036B (en) * | 2018-07-11 | 2023-11-10 | 阿里巴巴集团控股有限公司 | Method and device for removing duplication of uniform resource locator and electronic equipment |
CN109359250B (en) * | 2018-08-31 | 2022-05-31 | 创新先进技术有限公司 | Uniform resource locator processing method, device, server and readable storage medium |
CN110008419B (en) * | 2019-03-11 | 2023-07-14 | 创新先进技术有限公司 | Webpage deduplication method, device and equipment |
CN110825947B (en) * | 2019-10-31 | 2024-03-08 | 深圳前海微众银行股份有限公司 | URL deduplication method, device, equipment and computer readable storage medium |
CN111259282B (en) * | 2020-02-13 | 2023-08-29 | 深圳市腾讯计算机系统有限公司 | URL (Uniform resource locator) duplication removing method, device, electronic equipment and computer readable storage medium |
CN114650152B (en) * | 2020-12-17 | 2023-06-20 | 中国科学院计算机网络信息中心 | Super computing center vulnerability detection method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377260A (en) * | 2012-04-28 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Analysis method and device of URLs (Uniform Resource Locator) of weblog |
CN103530337A (en) * | 2013-09-30 | 2014-01-22 | 北京奇虎科技有限公司 | Device and method for recognizing invalid parameters in URL |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7185088B1 (en) * | 2003-03-31 | 2007-02-27 | Microsoft Corporation | Systems and methods for removing duplicate search engine results |
-
2014
- 2014-03-18 CN CN201410100765.2A patent/CN104933056B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377260A (en) * | 2012-04-28 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Analysis method and device of URLs (Uniform Resource Locator) of weblog |
CN103530337A (en) * | 2013-09-30 | 2014-01-22 | 北京奇虎科技有限公司 | Device and method for recognizing invalid parameters in URL |
Also Published As
Publication number | Publication date |
---|---|
CN104933056A (en) | 2015-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104933056B (en) | Uniform resource locator De-weight method and device | |
US10218599B2 (en) | Identifying referral pages based on recorded URL requests | |
CN110099059B (en) | Domain name identification method and device and storage medium | |
CN104246759B (en) | Application programming interfaces testing service | |
EP2984616A1 (en) | Method and device for testing multiple versions | |
CN107888616A (en) | The detection method of construction method and Webshell the attack website of disaggregated model based on URI | |
US20160188723A1 (en) | Cloud website recommendation method and system based on terminal access statistics, and related device | |
CN110362727A (en) | Third party for search system searches for application | |
CN114417197A (en) | Access record processing method and device and storage medium | |
US20130198240A1 (en) | Social Network Analysis | |
US11546380B2 (en) | System and method for creation and implementation of data processing workflows using a distributed computational graph | |
CN105653949B (en) | A kind of malware detection methods and device | |
CN109582844A (en) | A kind of method, apparatus and system identifying crawler | |
CN110851136A (en) | Data acquisition method and device, electronic equipment and storage medium | |
US11797617B2 (en) | Method and apparatus for collecting information regarding dark web | |
US9380126B2 (en) | Data collection and distribution management | |
CN104021124A (en) | Method, device and system used for processing webpage data | |
CN108197465B (en) | Website detection method and device | |
CN111859076B (en) | Data crawling method, device, computer equipment and computer readable storage medium | |
CN113343312A (en) | Page tamper-proofing method and system based on front-end point burying technology | |
CN107085684A (en) | The detection method and device of performance of program | |
CN111931214A (en) | Data processing method, device, server and storage medium | |
CN107404491A (en) | Terminal environments method for detecting abnormality, detection means and computer-readable recording medium | |
CN115051863B (en) | Abnormal flow detection method and device, electronic equipment and readable storage medium | |
US20150331917A1 (en) | Recording medium having stored therein transmission order determination program, transmission order determination device, and transmission order determination method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240108 Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd. Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |
|
TR01 | Transfer of patent right |