CN104933056B - Uniform resource locator De-weight method and device - Google Patents

Uniform resource locator De-weight method and device Download PDF

Info

Publication number
CN104933056B
CN104933056B CN201410100765.2A CN201410100765A CN104933056B CN 104933056 B CN104933056 B CN 104933056B CN 201410100765 A CN201410100765 A CN 201410100765A CN 104933056 B CN104933056 B CN 104933056B
Authority
CN
China
Prior art keywords
duplicate removal
resource locator
uniform resource
rule
removal rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410100765.2A
Other languages
Chinese (zh)
Other versions
CN104933056A (en
Inventor
何双宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410100765.2A priority Critical patent/CN104933056B/en
Publication of CN104933056A publication Critical patent/CN104933056A/en
Application granted granted Critical
Publication of CN104933056B publication Critical patent/CN104933056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention proposes a kind of uniform resource locator De-weight method and device, and uniform resource locator De-weight method includes: to preset duplicate removal rule base according to the structure of uniform resource locator;The uniform resource locator data for wanting duplicate removal are obtained from website visitation data;According to the structure and segmentation parameter of uniform resource locator, the uniform resource locator for duplicate removal is matched with the duplicate removal rule in the duplicate removal rule base;And the uniform resource locator corresponding with identical duplicate removal rule matched is filtered, and corresponding each duplicate removal rule retains a uniform resource locator.Method and device through the embodiment of the present invention can be filtered duplicate removal to magnanimity url data by duplicate removal rule, avoid when URL security breaches detect, the duplicate same CGI of scanning of security scan device, to improve the detection efficiency of security breaches.

Description

Uniform resource locator De-weight method and device
Technical field
The present invention relates to network technique field, in particular to a kind of uniform resource locator De-weight method and device, and Corresponding duplicate removal rule generating method and device.
Background technique
URL Rewrite is a kind of couple of URL(Uniform Resource Locator in internet, unified resource positioning Symbol) technology that is written over, it obtains the URL request for accessing website that user terminal is sent first, then it is write as again Another manageable URL of website, what user obtained is the returned content of the address URL after treatment.
For example, the news on many news websites has many classifications, such as sport, science and technology etc., and has daily New news briefing, that is, classify by date, and there are also news to index ID below for these daily news.User terminal access one When the news pages of a sport category, after Website server receives access request, CGI(Common Gateway can be passed through Interface, general network administration interface) form the intermediate address URL uniformly to access back-end data, as shown in Equation:
http://www.qq.com/news/getNews?type=sports&date=20131120&id=1 (1)
And the intermediate URL structure as formula (1) has disadvantage in the majority: being not easy to remember, is not easy to read, in mobile phone, plate Inconvenient propagation etc. on the mobile terminals such as computer.Therefore, many Website servers can be tied the intermediate URL by URL rewriting technique Structure is rewritten as the form such as formula (2):
http://www.qq.com/news/sports/20131120/1.html (2)
This URL structure of formula (2) overcomes the shortcomings that formula (1) that intermediate URL, so that URL is shorter, it is easier to Memory is read, while facilitating propagation.
But URL rewriting technique also results in such a phenomenon: a dynamic CGI meeting while bring benefit in the majority It is rewritten by multiple URL, leads to the URL increasing number an of network address.For example, the segmentation parameter in formula (1) includes " getNews Type ", " date " and " id ";It is now assumed that user terminal accesses the news pages of a science and technology, formula (3) is made up of CGI The intermediate address URL:
http://www.qq.com/news/getNews?type=science&date=20131121&id=2 (3)
It can be seen that the segmentation parameter in formula (3) equally includes " getNews type ", " date " and " id ", difference Be only each segmentation parameter value, so formula (3) and the two intermediate addresses RUL of formula (1) are essentially by same What a dynamic CGI was generated.Formula (4) is the address URL after formula (3) are rewritten by URL rewriting technique:
http://www.qq.com/news/science/20131121/2.html (4)
As it can be seen that the address URL of formula (4) and the address URL of formula (2) they are two different addresses URL, so, pass through The address that one dynamic CGI is generated can be rewritten into multiple addresses URL.And this phenomenon can bring following problem:
When doing the detection of URL security breaches, since multiple URL may be rewritten to a dynamic by URL rewriting technique CGI detects the security breaches of these URL, and that detect in fact is the same dynamic CGI after rewriteeing, thus will lead to one It is multiple that CGI is repetitively scanned detection.Moreover, for large-scale website, since URL quantity is too many, in addition a URL might have It is extremely low to the detection efficiency of corporate business security breaches to eventually lead to URL security scan device for multiple parameters.Meanwhile it repeating The same dynamic CGI of Scanning Detction, also brings unnecessary performance loss and operation cost to corporate business website.
Summary of the invention
The purpose of the embodiment of the present invention is that a kind of uniform resource locator De-weight method and device are provided, to solve URL weight Writing technology can make URL security breaches detect efficiency reduce, and to Website server bring performance loss and increase operation at This problem of.
The embodiment of the present invention proposes a kind of uniform resource locator De-weight method, comprising:
Duplicate removal rule base is preset according to the structure of uniform resource locator, stores in the duplicate removal rule base and multiple removes weight-normality Then, each duplicate removal rule corresponds to the different structure of uniform resource locator, and the corresponding system of expression is provided in duplicate removal rule The rewriting mark of writing segmentation parameter in one Resource Locator;
The uniform resource locator data for wanting duplicate removal are obtained from website visitation data;
According to the structure and segmentation parameter of uniform resource locator, by the uniform resource locator for wanting duplicate removal with it is described Duplicate removal rule in duplicate removal rule base is matched;And
The uniform resource locator corresponding with identical duplicate removal rule matched is filtered, and corresponds to and each removes weight-normality Then retain a uniform resource locator.
The embodiment of the present invention also proposes a kind of duplicate removal rule generating method, comprising:
Obtain the uniform resource locator data under the domain name that generate duplicate removal rule;
The uniform resource locator of the acquisition is clustered;
Uniform resource locator after cluster according to domain name parameters part, suffix portion, division number part and is divided Section argument section is split, and forms a plurality of statistical information;
Obtain the mutually isostructural statistical information after over-segmentation;And
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with into rewriting mark, and passes through replacement It crosses the statistical information for rewriteeing and identifying and generates new duplicate removal rule.
The embodiment of the present invention separately proposes a kind of duplicate removal rule generating method, comprising:
Existing duplicate removal rule in preset duplicate removal rule base is obtained, the structure of the duplicate removal rule includes domain name parameters portion Point, suffix portion, division number part and rewriting rule part;
Obtain multiple uniform resource locator data under the domain name that generate duplicate removal rule;
Suffix portion and rewriting rule part by existing duplicate removal rule, under the domain name of duplicate removal rule to be generated Multiple uniform resource locator are matched;And
When the uniform resource locator being matched to number be greater than setting threshold value, then will generate the domain name of duplicate removal rule The domain name parameters part in corresponding duplicate removal rule is replaced, and generates new duplicate removal rule.
The embodiment of the present invention proposes a kind of uniform resource locator duplicate removal device, comprising:
Duplicate removal rule base setup module, it is described to go for presetting duplicate removal rule base according to the structure of uniform resource locator Multiple duplicate removals rule is stored in weight rule base, each duplicate removal rule corresponds to the different structure of uniform resource locator, and described goes Weight-normality then in be provided with the rewriting mark of the segmentation parameter for indicating writing in corresponding uniform resource locator;
Uniform resource locator handling module, for obtaining the uniform resource locator for wanting duplicate removal from website visitation data Data;
Matching module wants the unified of duplicate removal to provide for the structure and segmentation parameter according to uniform resource locator by described Source finger URL is matched with the duplicate removal rule in the duplicate removal rule base;And
Deduplication module, the uniform resource locator corresponding with identical duplicate removal rule for that will match are filtered, and Corresponding each duplicate removal rule retains a uniform resource locator.
The embodiment of the present invention also proposes a kind of duplicate removal rule generating means, comprising:
Uniform resource locator obtains module, for obtaining the uniform resource locator under the domain name that generate duplicate removal rule Data;
Cluster module, for being clustered to the uniform resource locator of the acquisition;
Divide module, for the uniform resource locator after clustering according to domain name parameters part, suffix portion, segments Mesh part and segmentation parameter part are split, and form a plurality of statistical information;
Statistical information obtains module, for obtaining the mutually isostructural statistical information after over-segmentation;And
Segmentation parameter replacement module, for replacing the different corresponding segments parameter value of mutually isostructural statistical information intermediate value To rewrite mark, and new duplicate removal rule is generated by replacing the statistical information for rewriteeing and identifying.
The embodiment of the present invention separately proposes a kind of duplicate removal rule generating means, comprising:
Duplicate removal rule acquisition module, it is described to remove weight-normality for obtaining existing duplicate removal rule in preset duplicate removal rule base Structure then includes domain name parameters part, suffix portion, division number part and rewriting rule part;
Uniform resource locator obtains module, fixed for obtaining multiple unified resources under the domain name that generate duplicate removal rule Position symbol data;
Suffix and rewriting rule matching module, for by existing duplicate removal rule suffix portion and rewriting rule portion Point, multiple uniform resource locator under the domain name of duplicate removal rule to be generated are matched;And
Domain name parameters replacement module is greater than the threshold value of setting for the number when the uniform resource locator being matched to, then The domain name that duplicate removal rule will be generated replaces domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
Compared with the existing technology, the beneficial effects of the present invention are: method and device through the embodiment of the present invention, Ke Yitong Past weight-normality is then filtered duplicate removal to magnanimity url data, after restoring their rewritings of minute quantity in a large amount of url data Dynamic CGI is avoided when URL security breaches detect, the duplicate same CGI of scanning of security scan device, to improve peace The detection efficiency of full loophole.
Detailed description of the invention
Fig. 1 is a kind of running environment schematic diagram for logging in address recording method and device of the embodiment of the present invention;
Fig. 2 is a kind of structure chart of Website server in Fig. 1;
Fig. 3 is a kind of flow chart of uniform resource locator De-weight method of the embodiment of the present invention;
Fig. 4 is flow chart when a kind of URL for wanting duplicate removal of the embodiment of the present invention is matched with duplicate removal rule;
Fig. 5 is the flow chart of the first duplicate removal rule generating method of the embodiment of the present invention;
Fig. 6 is another flow chart of the first duplicate removal rule generating method of the embodiment of the present invention;
Fig. 7 is the flow chart of second of duplicate removal rule generating method of the embodiment of the present invention;
Fig. 8 is another flow chart of second of duplicate removal rule generating method of the embodiment of the present invention;
Fig. 9 is a kind of structure chart of uniform resource locator duplicate removal device of the embodiment of the present invention;
Figure 10 is a kind of structure chart of matching module of the embodiment of the present invention;
Figure 11 is the structure chart of the first duplicate removal rule generating means of the embodiment of the present invention;
Figure 12 is another structure chart of the first duplicate removal rule generating means of the embodiment of the present invention;
Figure 13 is a kind of structure chart that the embodiment of the present invention rewrites filtering module;
Figure 14 is a kind of structure chart of authentication module of the embodiment of the present invention;
Figure 15 is the structure chart of second of duplicate removal rule generating means of the embodiment of the present invention;
Figure 16 is another structure chart of second of duplicate removal rule generating means of the embodiment of the present invention.
Specific embodiment
For the present invention aforementioned and other technology contents, feature and effect refer to the preferable reality of schema in following cooperation Applying can clearly be presented in example detailed description.By the explanation of specific embodiment, when predetermined mesh can be reached to the present invention The technical means and efficacy taken be able to more deeply and it is specific understand, however institute's accompanying drawings are only to provide with reference to and say It is bright to be used, it is not intended to limit the present invention.
The embodiment of the present invention proposes a kind of uniform resource locator De-weight method and device, and corresponding duplicate removal rule is raw At method and device, heavy filtration is carried out for the address URL after rewriteeing, and generate duplicate removal rule in order to the address URL Carry out duplicate removal.It referring to Figure 1, is the running environment schematic diagram of above-mentioned method and device.One or more user terminals 100 can By being only painted one in network 300 and one or more Website server 200(Fig. 1) it is connected.The user terminal 100 can be Tablet computer, mobile phone, electronic reader, remote controler, PC, laptop, mobile unit, Web TV, wearable device etc. Smart machine with network function.The network 300 may be, for example, internet, local area network, intranet etc..
Fig. 2 is further regarded to, is the structural block diagram of one embodiment of above-mentioned Website server 200.Such as Fig. 2 institute Show, Website server 200 includes: memory 102, storage control 104, one or more (one is only shown in figure) processors 106, Peripheral Interface 108 and network controller 112.It is appreciated that structure shown in Fig. 2 is only to illustrate, not to website The structure of server 200 causes to limit.For example, Website server 200 may also include than shown in Fig. 2 more or less groups Part, or with the configuration different from shown in Fig. 2.
Memory 102 can be used for storing software program and module, such as the uniform resource locator in the embodiment of the present invention De-weight method and the corresponding program instruction/module of device, the software journey that processor 104 is stored in memory 102 by operation Sequence and module realize above-mentioned method thereby executing various function application and data processing.
Memory 102 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetic Property storage device, flash memory or other non-volatile solid state memories.In some instances, memory 102 can further comprise The memory remotely located relative to processor 106, these remote memories can pass through network connection to Website server 200.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.Place Reason device 106 and other possible components can carry out the access of memory 102 under the control of storage control 104.
Various input/output devices are couple processor 106 by Peripheral Interface 108.106 run memory 102 of processor Interior various softwares, instruction, and carry out data processing.In some embodiments, it Peripheral Interface 108, processor 106 and deposits Storage controller 104 can be realized in one single chip.In some other example, they can be real by independent chip respectively It is existing.
Network controller 112 is for receiving and transmitting network signal.Above-mentioned network signal may include wireless signal or Wire signal.In an example, above-mentioned network signal is cable network signal.At this point, network controller 112 may include processing The elements such as device, random access memory, converter, crystal oscillator.
Above-mentioned software program and module includes: operating system 122 and Website server module 224.Wherein operate System 122 may be, for example, LINUX, UNIX, WINDOWS, may include it is various for management system task (such as memory management, Store equipment control, power management etc.) component software and/or driving, and can mutually be communicated with various hardware or component software, To provide the running environment of other software component.Website server module 224 operates on the basis of operating system 122, and It is monitored by the network service of operating system 122 come the web access requests of automatic network, is completed according to web access requests corresponding Data processing, and return the result the data of webpage or extended formatting to user terminal 100.Above-mentioned Website server module 224 such as may include dynamic web page script and script interpreter.Above-mentioned script interpreter may be, for example, the website Apache Server program is used to for dynamic web page script to be processed into client acceptable format, such as hypertext markup (HTML) language format or extensible markup language (XML) format etc..
Embodiment one
Fig. 3 is referred to, is a kind of flow chart of uniform resource locator De-weight method of the embodiment of the present invention comprising Following steps:
S301 presets duplicate removal rule base according to the structure of uniform resource locator, stores in the duplicate removal rule base multiple Duplicate removal rule, each duplicate removal rule corresponds to the different structure of uniform resource locator, and is provided with expression in duplicate removal rule The rewriting mark of writing segmentation parameter in corresponding uniform resource locator.
S302 obtains the uniform resource locator data for wanting duplicate removal from website visitation data.
S303, according to the structure and segmentation parameter of uniform resource locator, by the uniform resource locator for wanting duplicate removal It is matched with the duplicate removal rule in the duplicate removal rule base.
The uniform resource locator corresponding with identical duplicate removal rule matched is filtered by S304, and corresponding each Duplicate removal rule retains a uniform resource locator.
According to W3C(World Wide Web Consortium, World Wide Web Consortium) standard can be with from the structure of a URL Decomposite three main parts: domain name parameters part (in the embodiment of the present invention also referred to as " part host "), path parameter Partially (in the embodiment of the present invention also referred to as " part path "), query argument part (is also referred to as in the embodiment of the present invention " part query ").Such as URL form is the portion http://www.qq.com/nba/lbj.php age=30&sex=1, host It is divided into " www.qq.com ", the part path is "/nba/lbj.php ", and the part query is " age=30&sex=1 ", wherein The parameter name contamination of the part query is " age&sex ".It is big that URL can also be divided into three according to the suffix difference of the part path Class: the first kind, the not URL of suffix;Second class has the URL of suffix, suffix such as php, jsp, html, htm etc.;Third class, after Sew the URL for "/".
In step S301, the duplicate removal rule is used to recognize whether URL was rewritten in embodiments of the present invention.By in Division of the above-mentioned W3C standard to not writing URL structure feature, therefore the embodiment of the present invention rewrites front and back for URL Structural relation, preferably by duplicate removal rule settings at including domain name parameters part, suffix portion, division number part and rewriting Rule section, for ease of description, the form of duplicate removal rule is expressed as { host, suffix, sections, rule_ by the present embodiment Path }.It can be seen that the duplicate removal rule is made of four parts: host(domain name parameters part), suffix(suffix portion), Sections(division number part) and rule_path(rewriting rule part).
Host after the corresponding rewriting in the part host in URL;After the corresponding rewriting in the part suffix in URL behind the part path Sew;The number for the segmentation that the path of URL is come out by "/" symbol segmentation after the corresponding rewriting in the part sections, the word split Symbol string is known as segmentation parameter;Rule_path is the rewriting rule of the part path in URL after rewriteeing, and the segmentation parameter of rewriting is used Mark replacement (being indicated to rewrite mark with " * " in the embodiment of the present invention) is rewritten, indicates that the section to rewrite excessive section parameter, does not weigh The segmentation parameter write retains constant in rule_path.
By taking the URL of formula (2) as an example: http://www.qq.com/news/sports/20131120/1.html, it should In URL, parameter value " sports ", " 20131120 " and " 1 " field has all been re-written in the parameter of dynamic CGI, thus with " * " Them are replaced, and parameter " news " is not rewritten, so be retained in duplicate removal rule, therefore the duplicate removal rule of the URL can table It is shown as: { www.qq.com, html, 4, news/*/*/* }.
As it can be seen that duplicate removal rule corresponds to the structure of URL, and is identified by rewritings therein and can represent the URL and be It is no to be rewritten.
In step S302, the url data for grabbing Website server can mainly use two methods: first, pass through network Crawler wants the uniform resource locator data of duplicate removal described in obtaining, and second, institute is obtained by filtration from the original log of web page access State the uniform resource locator data for wanting duplicate removal.
In step S303, will the URL of duplicate removal when being matched with duplicate removal rule, can one by one will duplicate removal URL with Duplicate removal rule in duplicate removal rule base is matched, if a URL is matched to corresponding duplicate removal rule, illustrates this URL It was rewritten, it may be necessary to go heavy filtration.Moreover, judging whether a URL was rewritten, it is also necessary to consider its rule_ Whether the segmentation parameter of path has rewriting to identify, and then determines that this URL was rewritten if there is rewriteeing mark.
Furthermore, it is understood that the url filtering by matching less than duplicate removal rule is fallen during URL and duplicate removal rule match, Refer to Fig. 4, specifically can with the following steps are included:
S3031 is filtered out and described the unified resource of duplicate removal is wanted to position according to the domain name parameters part in duplicate removal rule The not corresponding uniform resource locator of domain name in symbol.For example, if the domain name of a URL is " www.sina.com ", and going In weight rule base, the part host of none duplicate removal rule is www.sina.com, then this url filtering is fallen.
S3032, according in duplicate removal rule suffix portion and division number part, filter out the system for wanting duplicate removal In one Resource Locator, in uniform resource locator not corresponding with all duplicate removal regular textures.I.e. if URL The part suffix of any one duplicate removal rule in suffix portion and segmentation parameter part, with duplicate removal rule base and sections It does not correspond to, then this url filtering is fallen.
S3033 is filtered out and described the unified resource of duplicate removal is wanted to position according to the rewriting rule part in duplicate removal rule There is no writing uniform resource locator in symbol.
For the matching process for further understanding URL Yu duplicate removal rule, present embodiment discloses a kind of matching algorithm such as 1 institutes of table Show:
Table 1
Wherein, if successful match, the duplicate removal rule being matched to is returned, represents the URL as the URL of rewriting;Otherwise, it returns Null character string, representing the URL not is the URL rewritten.Step 1 and step 2 are to output and input explanation;Step 3 indicates that algorithm is opened Begin;Step 4 judges whether there is duplicate removal rule identical with the URL/domain name parameter of input, if it is not, filtering out input URL;5th, 6 steps judge whether there is duplicate removal rule identical with the suffix portion of the URL of input and division number part, if No, then the URL of input is filtered out;8th, 9,10 steps indicate when find there are at least one duplicate removal rule, domain name parameters portion Point, suffix portion and division number part it is identical as the URL of input, then successively judge input URL and find out these go Whether the rewriting rule part of weight-normality then is consistent.Step 11 successively judges these duplicate removals found out rule, rewriting rule part Whether the quantity of section is identical as the value of division number part;13rd, 14,15 steps indicate if find out with input URL it is corresponding Duplicate removal rule, if can not find duplicate removal rule corresponding with the URL of input, filters out this then exporting duplicate removal rule The URL of input.
In step S304, after all URL for wanting duplicate removal are matched, find out corresponding to the duplicate removal rule matched Uniform resource locator is counted, since the corresponding URL of identical duplicate removal rule is the intermediate URL weight generated by the same CGI Made of writing, so URL corresponding to identical duplicate removal rule is filtered, and only retain one relative to each duplicate removal rule URL, for subsequent Hole Detection or other data analyses.
For example, " http://www.qq.com/news/sports/20131120/1.html " and " http: // The two URL of www.qq.com/news/science/20131121/2.html ", can match www.qq.com, html, 4, news/*/*/* } this duplicate removal rule, then, illustrate that the two URL were rewritten, i.e., the two URL are not weighed URL before writing is generated by the same CGI, it is possible to one is arbitrarily filtered out from the two URL.
Method through this embodiment can be filtered duplicate removal to magnanimity url data by duplicate removal rule, from a large amount of Dynamic CGI after restoring their rewritings of minute quantity in url data is avoided when URL security breaches detect, security scan The duplicate same CGI of scanning of device, to improve the detection efficiency of security breaches.
In addition, when doing web access requests flow analysis, after the accessing page request flow counted on is rewritten URL flow, and the flow of real CGI cannot be counted on.And in website CC attack, (Challenge Collapsar is a kind of normal The website attack method seen, the certain hosts of attacker's control ceaselessly send out mass data packet and cause server to provide to other side's server Source exhausts, until delay machine is collapsed) in system of defense, need to count audit is the flow of the CGI after rewriteeing.By this implementation The method of example can also be improved the data analysis efficiency to CC attack.
The embodiment of the present invention also proposed a kind of generation method of duplicate removal rule of embodiment, the generation of duplicate removal rule and URL Duplicate removal can be carried out with iterative cycles, such as when URL duplicate removal can be matched, the URL conduct crossed without duplicate removal that filters out The generation benchmark of duplicate removal rule, to improve duplicate removal rule base and expand URL duplicate removal range.The embodiment of the invention discloses two kinds Method generates duplicate removal rule, and one is the data structure building duplicate removal rules by URL, and another kind is by that will have existed Duplicate removal rule be applied under new domain name, and form new duplicate removal rule.It describes in detail below:
Embodiment two
Fig. 5 is referred to, is the flow chart of the first duplicate removal rule generating method of the embodiment of the present invention comprising following Step:
S501 obtains the uniform resource locator data under the domain name that generate duplicate removal rule.I.e. from Website server Data center reads the identical URL in the part host, and the URL needs of reading are not writing.
S502 clusters the uniform resource locator of the acquisition.Specifically, according to length and character lexcographical order The URL that step S501 is read is clustered.URL after cluster can make system operations more rapidly, improve duplicate removal The efficiency that rule generates.
The length refers to the length of the section of URL, the number of the section come out in URL by "/" symbol segmentation in other words, example Such as, " http://www.qq.com/news/getNews type=sports&date=20131120&id=1 " and " http: // The "/" for including in the two URL of www.qq.com/news/getNews type=science&date=20131121&id=2 " The number of symbol is two, then it is assumed that the two URL lengths are identical.
The character lexcographical order refers to sequence of the character in URL in dictionary, such as " http://www.aa.com/ News/getNews type=sports&date=20131120&id=1 " and " http://www.aa.com/news/ " domain name part of the two URL all includes " aa " to getNews type=science&date=20131121&id=2, can be incited somebody to action The two URL are classified as same class;"http://www.bb.com/news/getNews?type=sports&date= 20131120&id=1 " and " http://www.bb.com/news/getNews type=science&date=20131121&id =2 " domain name part of the two URL all includes " bb ", the two URL can be classified as same class.
S503, by the uniform resource locator after cluster according to domain name parameters part, suffix portion, division number part with And segmentation parameter part is split, and forms a plurality of statistical information.
Specifically, this step be to after cluster URL carry out structural analysis, and formed as host, suffix, Sections, index, path } this form statistical information.Statistical information can be stored in the memory of Website server In, it can also be stored in system cache.Wherein, index is the index position of the part path.
For example, the statistical information of " http://www.qq.com/news/sports/20131120/1.html " this URL It is { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 } that index indicates the storage position of the part path It sets, i.e., " 0/1/2/3 " and " news/sport/20131120/1 " correspond.
S504 obtains the mutually isostructural statistical information after over-segmentation.
For example, the statistical information of " http://www.qq.com/news/sports/20131120/1.html " this URL It is { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 };"http://www.qq.com/news/ The statistical information of this URL of science/20131121/2.html " is { www.qq.com, html, 4,0/1/2/3, news/ Science/20131121/2 }, then it is assumed that the structure of the statistical information of the two URL is identical.
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with rewriting mark, and led to by S505 It crosses the statistical information for replacing and rewriteeing and identifying and generates new duplicate removal rule.
For example, for { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 } and Www.qq.com, html, 4,0/1/2/3, news/science/20131121/2 } the two statistical informations, according to index's Numerical value is respectively compared each segmentation parameter of the part path, if the value of corresponding segments parameter is different, is replaced At rewrite identify, the two statistical informations through rewriting mark replacement after i.e. generate www.qq.com, html, 4,0/1/2/ 3, news/*/*/* }, then remove the part index in statistical information, just forms new duplicate removal rule: www.qq.com, Html, 4, news/*/*/* }.So far, newly-generated duplicate removal rule is stored in duplicate removal rule base, when for the matching of URL duplicate removal Inquiry uses.
Fig. 6 is referred to, is another flow chart of the first duplicate removal rule generating method of the embodiment of the present invention, packet Include following steps:
S601 obtains the uniform resource locator data under the domain name that generate duplicate removal rule.
S602 filters out uniform resource locator writing in the uniform resource locator data of acquisition.Due to When generating statistical information, structure to URL and segmentation parameter and parameter value analyze, and the URL after rewriteeing can not know Other segmentation parameter, in order to ensure operation does not malfunction, thus this step is further filtered the URL of acquisition, and removal is rewritten The URL crossed.
It specifically, can be by URL that step S601 is obtained according to (the i.e. table 1 of matching algorithm described in embodiment one Duplicate removal matching algorithm) matched with duplicate removal rule existing in duplicate removal rule base, if a URL can match it is corresponding Duplicate removal rule, then illustrating that this URL was rewritten, then filters this out.
S603 clusters the uniform resource locator of the acquisition according to length and character lexcographical order.
S604, by the uniform resource locator after cluster according to domain name parameters part, suffix portion, division number part with And segmentation parameter part is split, and forms a plurality of statistical information.
S605 obtains the mutually isostructural statistical information after over-segmentation.
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with rewriting mark, and led to by S606 It crosses the statistical information for replacing and rewriteeing and identifying and generates new duplicate removal rule.
New duplicate removal rule match testing by S607 with mutually isostructural uniform resource locator all under same domain name Card.The purpose of this step is to make further verifying to the availability that newly-generated duplicate removal rule is made.
Specifically, the process of verifying still can be using matching algorithm (the i.e. duplicate removal of table 1 described in embodiment one With algorithm).Firstly, obtain with new duplicate removal rule possess same domain name parameter, suffix, division number URL;Secondly, by new Duplicate removal rule matched with the URL of the acquisition;When the number for being matched to corresponding URL is more than given threshold, explanation This newly-generated duplicate removal rule can be used to identify whether URL is rewritten, then be verified.
S608, to new one pending mark of duplicate removal rule setting.In order to further prevent system operations to malfunction, influence Subsequent duplicate removal rule automatically generate and URL goes the automatic running of master control program, so being stamped for newly-generated duplicate removal rule It is pending just to remove this for manually carrying out verification check to the new duplicate removal rule after verification passes through for one pending mark Core mark makes the duplicate removal rule formally be come into operation.
The method of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality The De-weight method of example one is applied when to URL duplicate removal more efficiently and accurately, and then can be further improved to URL security breaches Detection efficiency and count CGI flow when improve data analysis efficiency.
Embodiment three
Fig. 7 is referred to, is the flow chart of second of duplicate removal rule generating method of the embodiment of the present invention comprising following Step:
S701 obtains existing duplicate removal rule in preset duplicate removal rule base, and the structure of the duplicate removal rule includes domain name Argument section, suffix portion, division number part and rewriting rule part.
S702 obtains multiple uniform resource locator data under the domain name that generate duplicate removal rule.The URL quantity of acquisition Cannot be very little, it in general cannot be less than 5000.
S703, suffix portion and rewriting rule part by existing duplicate removal rule, to the domain of duplicate removal rule to be generated Multiple uniform resource locator under one's name are matched.
S704, when the uniform resource locator being matched to number be greater than setting threshold value, then will generate duplicate removal rule Domain name replace the domain name parameters part in corresponding duplicate removal rule, and generate new duplicate removal rule.
As an example it is assumed that { www.qq.com, html, 4, news/*/*/* } is an existing in duplicate removal rule base go Then, suffix portion and rewriting rule part are " html " and " news/*/*/* " respectively to weight-normality, with this two parts to generating The URL of duplicate removal rule is matched, and matching can (i.e. the duplicate removal of table 1, which matches, calculates using matching algorithm described in embodiment one Method), if the URL matched is more than the threshold value (such as 100) of setting, 100 URL for illustrating that this is matched are to be rewritten It crosses, it may also be said to which bright these want applicable " html " and " news/*/*/* " this two parts of the URL of duplicate removal to carry out duplicate removal, then directly The domain name of the URL of duplicate removal is wanted to be substituted into the domain name parameters part in corresponding duplicate removal rule, if wanting the domain name of the URL of duplicate removal is " www.sina.com ", then the new duplicate removal rule generated after replacement is exactly: www.qq.com, html, 4, News/*/*/* }.
Fig. 8 is referred to, is another flow chart of second of duplicate removal rule generating method of the embodiment of the present invention, packet Include following steps:
S801 obtains existing duplicate removal rule in preset duplicate removal rule base, and the structure of the duplicate removal rule includes domain name Argument section, suffix portion, division number part and rewriting rule part.
S802, to the identical duplicate removal rule of suffix portion in existing duplicate removal rule and rewriting rule part, according to difference The number of domain name is ranked up.
S803 obtains the suffix portion of the setting quantity duplicate removal rule of the number of different domain names at most according to ranking results Divide and rewriting rule part.
The quantity of duplicate removal rule in duplicate removal rule base is usually very much, if existing all duplicate removal rules all made Primary matching, then the efficiency that certainly will generate duplicate removal rule can be very low.It is first right in step S802 thus in order to improve operation efficiency The existing duplicate removal rule of the identical part suffix and the part rule_pathe is ranked up, and sequence is then found out in step S803 As a result in, the most setting quantity duplicate removal rule of the number of different domain names.
For example, it is assumed that including 6 duplicate removal rules in duplicate removal rule base:
{ www.aa.com, html, 4, news/*/*/* }
{ www.bb.com, html, 4, news/*/*/* }
{ www.cc.com, html, 4, news/*/*/* }
{ www.aa.com, html, 3, news/*/* }
{ www.bb.com, html, 3, news/*/* }
{ www.aa.com, html, 3, news/* }
After sequence, the different domain names comprising " html " and " news/*/*/* " have 3, i.e., " www.aa.com ", "www.bb.com","www.cc.com";Different domain names comprising " html " and " news/*/* " have 2, i.e., "www.aa.com","www.bb.com";Different domain names comprising " html " and " news/* " have 1, i.e. " www.aa.com ". It is now assumed that two groups of suffix portions and rewriting rule part are chosen, according to ranking results, " html " and " news/*/*/* ", The quantity of " html " and " news/*/* " corresponding different domain names is most, by this two groups of suffix portions and rewriting rule extracting section Out.And " html " and " news/* " corresponding domain name only has one group, illustrates that its applicable domain name is less, i.e., versatility is low, will It is filtered out.
S804 obtains multiple uniform resource locator data under the domain name that generate duplicate removal rule.
S805 passes through the suffix portion and rewriting of the most setting quantity duplicate removal rule of the number of the different domain names of acquisition Rule section matches multiple uniform resource locator under the domain name of duplicate removal rule to be generated.It, equally can be with when matching Using matching algorithm described in embodiment one (i.e. the duplicate removal matching algorithm of table 1).
S806, when the uniform resource locator being matched to number be greater than setting threshold value, then will generate duplicate removal rule Domain name replace the domain name parameters part in corresponding duplicate removal rule, and generate new duplicate removal rule.
The method of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality The De-weight method of example one is applied when to URL duplicate removal more efficiently and accurately, and then can be further improved to URL security breaches Detection efficiency and count CGI flow when improve data analysis efficiency.
Example IV
Corresponding to the method for previous embodiment one, the embodiment of the present invention also proposes that a kind of uniform resource locator goes to reset It sets, refers to Fig. 9, which includes: duplicate removal rule base setup module 901, uniform resource locator handling module 902, matching Module 903 and deduplication module 904.
Duplicate removal rule base setup module 901 is used to preset duplicate removal rule base according to the structure of uniform resource locator, described Multiple duplicate removal rules are stored in duplicate removal rule base, each duplicate removal rule corresponds to the different structure of uniform resource locator, and described The rewriting mark for the segmentation parameter for indicating writing in corresponding uniform resource locator is provided in duplicate removal rule.It is described to remove weight-normality Structure then may include domain name parameters part, suffix portion, division number part and rewriting rule part.The rewriting mark Knowledge can be set in the rewriting rule part.
Uniform resource locator handling module 902, which is used to obtain from website visitation data, wants the unified resource of duplicate removal to position Accord with data.The uniform resource locator handling module 902 described in web crawlers acquisition by wanting the unified resource of duplicate removal to position Data are accorded with, or want the uniform resource locator data of duplicate removal described in acquisition from the original log of web page access.
Matching module 903 is used for structure and segmentation parameter according to uniform resource locator, by the unification for wanting duplicate removal Resource Locator is matched with the duplicate removal rule in the duplicate removal rule base.
The unified resource corresponding with identical duplicate removal rule that deduplication module 904 is used to match matching module 903 positions Symbol is filtered, and corresponding each duplicate removal rule retains a uniform resource locator.
Furthermore, it is understood that matching module 903 will be matched less than duplicate removal rule during URL and duplicate removal rule match Url filtering fall, referring to Figure 10, matching module 903 further can also include: the filtering of the first filter element 9031, second Unit 9032 and third filter element 9033.
First filter element 9031 is used for according to the domain name parameters part in duplicate removal rule, is filtered out and described is wanted duplicate removal Uniform resource locator in the not corresponding uniform resource locator of domain name.
Second filter element 9032 be used for according in duplicate removal rule suffix portion and division number part, filter out In the uniform resource locator for wanting duplicate removal, in uniform resource locator not corresponding with all duplicate removal regular textures.
Third filter element 9033 is used for according to the rewriting rule part in duplicate removal rule, is filtered out and described is wanted duplicate removal Uniform resource locator in there is no writing uniform resource locator.
Device through this embodiment can be filtered duplicate removal to magnanimity url data by duplicate removal rule, from a large amount of Dynamic CGI after restoring their rewritings of minute quantity in url data is avoided when URL security breaches detect, security scan The duplicate same CGI of scanning of device, to improve the detection efficiency of security breaches.
Embodiment five
Corresponding to previous embodiment two, the embodiment of the present invention also proposes the first duplicate removal rule generating means, refers to figure 11, which includes: that uniform resource locator obtains module 1101, cluster module 1102, segmentation module 1103, statistical information obtain Modulus block 1104 and segmentation parameter replacement module 1105.
It is fixed that uniform resource locator obtains the unified resource that module 1101 is used to obtain under the domain name that generate duplicate removal rule Position symbol data.
Cluster module 1102 is for clustering the uniform resource locator of the acquisition.Cluster module 1102 can be right The uniform resource locator of the acquisition is clustered according to length and character lexcographical order.
Segmentation module 1103 be used for will cluster after uniform resource locator according to domain name parameters part, suffix portion, divide Number of segment mesh part and segmentation parameter part are split, and form a plurality of statistical information.
Statistical information obtains module 1104 for obtaining the mutually isostructural statistical information after over-segmentation.
Segmentation parameter replacement module 1105 is used for the corresponding segments parameter value that mutually isostructural statistical information intermediate value is different Rewriting mark is replaced with, and generates new duplicate removal rule by replacing the statistical information for rewriteeing and identifying.
Referring to Figure 12, it is another structure chart of the first duplicate removal rule generating means of the embodiment of the present invention, Including in addition to including: that uniform resource locator obtains module 1101, cluster module 1102, segmentation module 1103, statistical information obtain Modulus block 1104 and segmentation parameter replacement module 1105, further includes: rewrite filtering module 1106, authentication module 1107 and examine Core identify and arrange module 1108.
Filtering module 1106 is rewritten to be used to obtain duplicate removal to be generated in uniform resource locator acquisition module 1101 After uniform resource locator data under the domain name of rule, filters out the uniform resource locator and obtain the described of module acquisition Writing uniform resource locator in uniform resource locator data.
After authentication module 1107 is used to generate new duplicate removal rule in the segmentation parameter replacement module 1105, new is gone Weight-normality then carries out matching verifying with mutually isostructural uniform resource locator all under same domain name.
After audit identify and arrange module 1108 is used to generate new duplicate removal rule in the segmentation parameter replacement module 1105, To new one pending mark of duplicate removal rule setting.
Referring to Figure 13, a kind of structure chart of filtering module, the rewriting filtering module are rewritten for the embodiment of the present invention 1106 include: duplicate removal rule match unit 11061 and uniform resource locator data filtering units 11062.
Duplicate removal rule match unit 11061 is used for the system for obtaining uniform resource locator acquisition module 1101 One resource locator data is matched with existing duplicate removal rule.
Filter element 11062 matches corresponding duplicate removal rule for filtering out the duplicate removal rule match unit 11061 Uniform resource locator.
Referring to Figure 14, it is a kind of structure chart of authentication module of the embodiment of the present invention, which includes: Selection unit 11071 and matching judgement unit 11072.
Selection unit 11071 is used to obtain the institute for possessing same domain name parameter, suffix, division number with new duplicate removal rule There is uniform resource locator.
The unified resource that matching judgement unit 11072 is used to obtain new duplicate removal rule and the selection unit 11071 Finger URL is matched, and when the number for being matched to corresponding uniform resource locator is more than given threshold, is verified.
The device of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality The duplicate removal device of the De-weight method and example IV of applying example one when to URL duplicate removal more efficiently and accurately, and then can be into one Improve to the detection efficiency of URL security breaches and improve when counting the flow of CGI data analysis efficiency in step ground.
Embodiment six
Corresponding to previous embodiment three, the embodiment of the present invention also proposes second of duplicate removal rule generating means, refers to figure 15, which includes: duplicate removal rule acquisition module 1501, uniform resource locator acquisition module 1502, suffix and rewriting rule Matching module 1503 and domain name parameters replacement module 1504.
Duplicate removal rule acquisition module 1501 is for obtaining existing duplicate removal rule, the duplicate removal in preset duplicate removal rule base The structure of rule includes domain name parameters part, suffix portion, division number part and rewriting rule part.
Uniform resource locator obtains multiple unifications that module 1502 is used to obtain under the domain name that generate duplicate removal rule and provides Source locator data.
Suffix and rewriting rule matching module 1503 are used for suffix portion and rewriting rule by existing duplicate removal rule Part matches multiple uniform resource locator under the domain name of duplicate removal rule to be generated.
Domain name parameters replacement module 1504 is used to be greater than the threshold value of setting when the number for the uniform resource locator being matched to, The domain name that duplicate removal rule will then be generated replaces domain name parameters part in corresponding duplicate removal rule, and generates and new remove weight-normality Then.
Referring to Figure 16, it is another structure chart of second of duplicate removal rule generating means of the embodiment of the present invention, it should Device is in addition to including: duplicate removal rule acquisition module 1501, uniform resource locator acquisition module 1502, suffix and rewriting rule With module 1503 and domain name parameters replacement module 1504, further includes: sorting module 1505 and duplicate removal Rules Filtering module 1506。
Suffix in the existing duplicate removal rule that sorting module 1505 is used to obtain the duplicate removal rule acquisition module 1501 Part duplicate removal rule identical with rewriting rule part, is ranked up according to the number of different domain names.
Duplicate removal Rules Filtering module 1506 is used for the ranking results according to the sorting module 1502, obtains different domain names The suffix portion of the most setting quantity duplicate removal rule of number and rewriting rule part.
The most setting quantity of the number of the different domain names of the suffix and rewriting rule matching module 1503 by obtaining The suffix portion of a duplicate removal rule and rewriting rule part position multiple unified resources under the domain name of duplicate removal rule to be generated Symbol is matched.
The device of the present embodiment can automatically generate duplicate removal rule, further perfect to duplicate removal rule base, to make reality The duplicate removal device of the De-weight method and example IV of applying example one when to URL duplicate removal more efficiently and accurately, and then can be into one Improve to the detection efficiency of URL security breaches and improve when counting the flow of CGI data analysis efficiency in step ground.
Through the above description of the embodiments, those skilled in the art can be understood that the embodiment of the present invention The mode of necessary general hardware platform can also be added to realize by software by hardware realization.Based on such reason Solution, the technical solution of the embodiment of the present invention can be embodied in the form of software products, which can store one In a non-volatile memory medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are used so that a meter It calculates machine equipment (can be personal computer, server or the network equipment etc.) and executes each implement scene institute of the embodiment of the present invention The method stated.
The above described is only a preferred embodiment of the present invention, be not intended to limit the present invention in any form, though So the present invention has been disclosed as a preferred embodiment, and however, it is not intended to limit the invention, any technology people for being familiar with this profession Member, is not departing within the scope of technical scheme, when the technology contents using the disclosure above make a little change or modification For the equivalent embodiment of equivalent variations, but it is all without departing from technical scheme content, it is right according to the technical essence of the invention Any simple modification, equivalent change and modification made by above embodiments, all of which are still within the scope of the technical scheme of the invention.

Claims (28)

1. a kind of uniform resource locator De-weight method characterized by comprising
Duplicate removal rule base is preset according to the structure of uniform resource locator, multiple duplicate removal rules are stored in the duplicate removal rule base, Each duplicate removal rule corresponds to the different structure of uniform resource locator, and the corresponding unified money of expression is provided in duplicate removal rule The rewriting mark of writing segmentation parameter in the finger URL of source;
The uniform resource locator data for wanting duplicate removal are obtained from website visitation data;
According to the structure and segmentation parameter of uniform resource locator, by the uniform resource locator for wanting duplicate removal and the duplicate removal Duplicate removal rule in rule base is matched;And
The uniform resource locator corresponding with identical duplicate removal rule matched is filtered, and corresponding each duplicate removal rule is protected Stay a uniform resource locator.
2. uniform resource locator De-weight method as described in claim 1, which is characterized in that the structure packet of the duplicate removal rule Include domain name parameters part, suffix portion, division number part and rewriting rule part.
3. uniform resource locator De-weight method as claimed in claim 2, which is characterized in that duplicate removal rule it is described heavy Write the rewriting mark that Rule section is provided with segmentation parameter writing in corresponding uniform resource locator.
4. uniform resource locator De-weight method as claimed in claim 3, which is characterized in that described by the system for wanting duplicate removal The step of one Resource Locator is matched with the duplicate removal rule in the duplicate removal rule base include:
According to the domain name parameters part in duplicate removal rule, filters out and described want in the uniform resource locator of duplicate removal domain name not Corresponding uniform resource locator;
According in duplicate removal rule suffix portion and division number part, filter out and described the unified resource of duplicate removal wanted to position Fu Zhong, in uniform resource locator not corresponding with all duplicate removal regular textures;And
According to the rewriting rule part in the duplicate removal rule, filter out in the uniform resource locator for wanting duplicate removal without weight The uniform resource locator write.
5. uniform resource locator De-weight method as described in claim 1, which is characterized in that described from website visitation data Obtain the step of wanting the uniform resource locator data of duplicate removal include: by web crawlers obtain described in want the unified resource of duplicate removal Locator data, or the uniform resource locator data of duplicate removal are wanted described in obtaining from the original log of web page access.
6. a kind of uniform resource locator De-weight method as claimed in any one of claims 1 to 5, which is characterized in that described to go The generation method of weight-normality then includes:
Obtain the uniform resource locator data under the domain name that generate duplicate removal rule;
The uniform resource locator of the acquisition is clustered;
Uniform resource locator after cluster is joined according to domain name parameters part, suffix portion, division number part and segmentation Number part is split, and forms a plurality of statistical information;
Obtain the mutually isostructural statistical information after over-segmentation;And
The different corresponding segments parameter value of mutually isostructural statistical information intermediate value is replaced with into rewriting mark, and overweight by replacing The statistical information for writing mark generates new duplicate removal rule.
7. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that the unification to the acquisition The step of Resource Locator is clustered include: to the uniform resource locator of the acquisition according to length and character lexcographical order into Row cluster.
8. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that the acquisition will generate weight-normality It include: the uniform resource locator number for filtering out acquisition after the step of uniform resource locator data under domain name then The writing uniform resource locator in.
9. uniform resource locator De-weight method as claimed in claim 8, which is characterized in that described to filter out the described of acquisition The step of writing uniform resource locator, includes: in uniform resource locator data
The uniform resource locator data that will acquire are matched with existing duplicate removal rule;And
Filter out the uniform resource locator for matching corresponding duplicate removal rule.
10. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that described generate new removes weight-normality It include: by new duplicate removal rule and mutually isostructural uniform resource locator progress all under same domain name after then the step of With verifying.
11. uniform resource locator De-weight method as claimed in claim 10, which is characterized in that described by new duplicate removal rule The step of match verifying with mutually isostructural uniform resource locator all under same domain name include:
Obtain all uniform resource locator for possessing same domain name parameter, suffix, division number with new duplicate removal rule;
New duplicate removal rule is matched with the uniform resource locator of the acquisition;And
When the number for being matched to corresponding uniform resource locator is more than given threshold, it is verified.
12. uniform resource locator De-weight method as claimed in claim 6, which is characterized in that described generate new removes weight-normality It include: to new one pending mark of duplicate removal rule setting after then the step of.
13. a kind of uniform resource locator De-weight method as claimed in any one of claims 1 to 5, which is characterized in that described to go The generation method of weight-normality then includes:
Obtain existing duplicate removal rule in preset duplicate removal rule base, the structure of the duplicate removal rule include domain name parameters part, Suffix portion, division number part and rewriting rule part;
Obtain multiple uniform resource locator data under the domain name that generate duplicate removal rule;
Suffix portion and rewriting rule part by existing duplicate removal rule, to multiple under the domain name of duplicate removal rule to be generated Uniform resource locator is matched;And
When the number for the uniform resource locator being matched to is greater than the threshold value of setting, then the domain name that will generate duplicate removal rule is replaced Domain name parameters part in corresponding duplicate removal rule, and generate new duplicate removal rule.
14. uniform resource locator De-weight method as claimed in claim 13, which is characterized in that described to obtain preset duplicate removal In rule base after the step of existing duplicate removal rule further include:
To the identical duplicate removal rule of suffix portion in existing duplicate removal rule and rewriting rule part, according to the number of different domain names It is ranked up;
According to ranking results, obtains the suffix portion of the setting quantity duplicate removal rule of the number of different domain names at most and rewrite rule Then part;
The suffix portion and rewriting rule part by existing duplicate removal rule, under the domain name of duplicate removal rule to be generated The step of multiple uniform resource locator are matched includes: the setting quantity most by the number of the different domain names of acquisition The suffix portion of duplicate removal rule and rewriting rule part, to multiple uniform resource locator under the domain name of duplicate removal rule to be generated
It is matched.
15. a kind of uniform resource locator duplicate removal device characterized by comprising
Duplicate removal rule base setup module, it is described to remove weight-normality for presetting duplicate removal rule base according to the structure of uniform resource locator Multiple duplicate removals rule is then stored in library, each duplicate removal rule corresponds to the different structure of uniform resource locator, and described removes weight-normality The rewriting mark for the segmentation parameter for indicating writing in corresponding uniform resource locator is provided in then;
Uniform resource locator handling module, for obtaining the uniform resource locator number for wanting duplicate removal from website visitation data According to;
Matching module determines the unified resource for wanting duplicate removal for the structure and segmentation parameter according to uniform resource locator Position symbol is matched with the duplicate removal rule in the duplicate removal rule base;And
Deduplication module, the uniform resource locator corresponding with identical duplicate removal rule for that will match are filtered, and corresponding Each duplicate removal rule retains a uniform resource locator.
16. uniform resource locator duplicate removal device as claimed in claim 15, which is characterized in that the structure of the duplicate removal rule Including domain name parameters part, suffix portion, division number part and rewriting rule part.
17. uniform resource locator duplicate removal device as claimed in claim 16, which is characterized in that duplicate removal rule it is described Rewriting rule part is provided with the rewriting mark of segmentation parameter writing in corresponding uniform resource locator.
18. uniform resource locator duplicate removal device as claimed in claim 17, which is characterized in that the matching module is further Include:
First filter element, for filtering out the unification for wanting duplicate removal according to the domain name parameters part in the duplicate removal rule The not corresponding uniform resource locator of domain name in Resource Locator;
Second filter element, for according in duplicate removal rule suffix portion and division number part, filter out described want In the uniform resource locator of duplicate removal, in uniform resource locator not corresponding with all duplicate removal regular textures;And
Third filter element, for filtering out the unification for wanting duplicate removal according to the rewriting rule part in the duplicate removal rule There is no writing uniform resource locator in Resource Locator.
19. uniform resource locator duplicate removal device as claimed in claim 15, which is characterized in that the uniform resource locator Handling module by web crawlers obtain described in want the uniform resource locator data of duplicate removal, or the original day from web page access The uniform resource locator data of duplicate removal are wanted described in obtaining in will.
20. a kind of such as the described in any item uniform resource locator duplicate removal devices of claim 15~19, which is characterized in that described Uniform resource locator duplicate removal device includes duplicate removal rule generating means, and the duplicate removal rule generating means include:
Uniform resource locator obtains module, for obtaining the uniform resource locator number under the domain name that generate duplicate removal rule According to;
Cluster module, for being clustered to the uniform resource locator of the acquisition;
Divide module, for the uniform resource locator after clustering according to domain name parameters part, suffix portion, division number portion Divide and segmentation parameter part is split, and forms a plurality of statistical information;
Statistical information obtains module, for obtaining the mutually isostructural statistical information after over-segmentation;And
Segmentation parameter replacement module, for the different corresponding segments parameter value replacement of mutually isostructural statistical information intermediate value to be attached most importance to Mark is write, and generates new duplicate removal rule by replacing the statistical information for rewriteeing and identifying.
21. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the cluster module is to described The uniform resource locator of acquisition is clustered according to length and character lexcographical order.
22. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the duplicate removal rule generates dress It sets further include:
Filtering module is rewritten, is obtained under the domain name of duplicate removal rule to be generated for obtaining module in the uniform resource locator Uniform resource locator data after, filter out the uniform resource locator and obtain the unified resource positioning that module obtains Accord with uniform resource locator writing in data.
23. uniform resource locator duplicate removal device as claimed in claim 22, which is characterized in that the rewriting filtering module into One step includes:
Duplicate removal rule match unit, for the uniform resource locator to be obtained the uniform resource locator that module obtains Data are matched with existing duplicate removal rule;And
Uniform resource locator data filtering units go out corresponding duplicate removal for filtering out the duplicate removal rule match units match The uniform resource locator of rule.
24. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the duplicate removal rule generates dress It sets further include:
Authentication module, after generating new duplicate removal rule in the segmentation parameter replacement module, by new duplicate removal rule and together All mutually isostructural uniform resource locator carry out matching verifying under one domain name.
25. uniform resource locator duplicate removal device as claimed in claim 24, which is characterized in that the authentication module is further Include:
Selection unit, for obtaining all unified moneys for possessing same domain name parameter, suffix, division number with new duplicate removal rule Source finger URL;And
It matches judgement unit and works as matching for matching new duplicate removal rule with the uniform resource locator of the acquisition To corresponding uniform resource locator number be more than given threshold when, be verified.
26. uniform resource locator duplicate removal device as claimed in claim 20, which is characterized in that the duplicate removal rule generates dress It sets further include:
Identify and arrange module is audited, after generating new duplicate removal rule in the segmentation parameter replacement module, to new duplicate removal One pending mark of rule setting.
27. a kind of such as the described in any item uniform resource locator duplicate removal devices of claim 15~19, which is characterized in that described Uniform resource locator duplicate removal device includes duplicate removal rule generating means, and the duplicate removal rule generating means include:
Duplicate removal rule acquisition module, for obtaining existing duplicate removal rule in preset duplicate removal rule base, the duplicate removal rule Structure includes domain name parameters part, suffix portion, division number part and rewriting rule part;
Uniform resource locator obtains module, for obtaining multiple uniform resource locator under the domain name that generate duplicate removal rule Data;
Suffix and rewriting rule matching module, for by existing duplicate removal rule suffix portion and rewriting rule part, it is right The multiple uniform resource locator generated under the domain name of duplicate removal rule are matched;And
Domain name parameters replacement module is greater than the threshold value of setting for the number when the uniform resource locator being matched to, then will The domain name of generation duplicate removal rule replaces the domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
28. uniform resource locator duplicate removal device as claimed in claim 27, which is characterized in that the duplicate removal rule generates dress It sets further include:
Sorting module, for being obtained to the duplicate removal rule acquisition module existing duplicate removal rule in suffix portion and rewrite rule The then identical duplicate removal rule in part, is ranked up according to the number of different domain names;
Duplicate removal Rules Filtering module, for the ranking results according to the sorting module, the number for obtaining different domain names is most Set suffix portion and the rewriting rule part of quantity duplicate removal rule;
The most setting quantity of the number for the different domain names that the suffix and rewriting rule matching module pass through acquisition removes weight-normality Suffix portion and rewriting rule part then, to multiple uniform resource locator progress under the domain name of duplicate removal rule to be generated Match.
CN201410100765.2A 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device Active CN104933056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410100765.2A CN104933056B (en) 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410100765.2A CN104933056B (en) 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device

Publications (2)

Publication Number Publication Date
CN104933056A CN104933056A (en) 2015-09-23
CN104933056B true CN104933056B (en) 2019-08-13

Family

ID=54120224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410100765.2A Active CN104933056B (en) 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device

Country Status (1)

Country Link
CN (1) CN104933056B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844389B (en) * 2015-12-07 2021-05-04 阿里巴巴集团控股有限公司 Method and device for processing URL (Uniform resource locator)
CN106919570B (en) * 2015-12-24 2020-12-22 国家新闻出版广电总局广播科学研究院 Page link duplication removal scanning method and device for new network media
CN105630983A (en) * 2015-12-28 2016-06-01 努比亚技术有限公司 Resource obtaining and optimizing device and method
CN105975599B (en) * 2016-05-11 2020-01-07 北京京东金融科技控股有限公司 Method and device for monitoring page embedded points of website
CN108268508A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 URL De-weight methods and device
CN108694184B (en) * 2017-04-06 2022-03-11 北京国双科技有限公司 Exposure URL processing method and device
CN109561163B (en) * 2017-09-27 2022-03-15 阿里巴巴集团控股有限公司 Method and device for generating uniform resource locator rewriting rule
CN107766167A (en) * 2017-10-23 2018-03-06 郑州云海信息技术有限公司 A kind of fault log repeats to report an error the method for merger
CN109861317B (en) * 2017-11-30 2022-06-10 南京泉峰科技有限公司 Adapter, portable power supply system and control method
CN108449355A (en) * 2018-04-04 2018-08-24 上海有云信息技术有限公司 A kind of vulnerability scanning method and system
CN110413866B (en) * 2018-04-27 2024-02-02 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN108959359B (en) * 2018-05-16 2022-10-11 顺丰科技有限公司 Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium
CN108984703B (en) * 2018-07-05 2023-04-18 平安科技(深圳)有限公司 Uniform Resource Locator (URL) duplicate removal method and device
CN108920668B (en) * 2018-07-05 2023-04-18 平安科技(深圳)有限公司 Uniform Resource Locator (URL) duplicate removal method and device
CN110717036B (en) * 2018-07-11 2023-11-10 阿里巴巴集团控股有限公司 Method and device for removing duplication of uniform resource locator and electronic equipment
CN109359250B (en) * 2018-08-31 2022-05-31 创新先进技术有限公司 Uniform resource locator processing method, device, server and readable storage medium
CN110008419B (en) * 2019-03-11 2023-07-14 创新先进技术有限公司 Webpage deduplication method, device and equipment
CN110825947B (en) * 2019-10-31 2024-03-08 深圳前海微众银行股份有限公司 URL deduplication method, device, equipment and computer readable storage medium
CN111259282B (en) * 2020-02-13 2023-08-29 深圳市腾讯计算机系统有限公司 URL (Uniform resource locator) duplication removing method, device, electronic equipment and computer readable storage medium
CN114650152B (en) * 2020-12-17 2023-06-20 中国科学院计算机网络信息中心 Super computing center vulnerability detection method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN103530337A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Device and method for recognizing invalid parameters in URL

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7185088B1 (en) * 2003-03-31 2007-02-27 Microsoft Corporation Systems and methods for removing duplicate search engine results

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN103530337A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Device and method for recognizing invalid parameters in URL

Also Published As

Publication number Publication date
CN104933056A (en) 2015-09-23

Similar Documents

Publication Publication Date Title
CN104933056B (en) Uniform resource locator De-weight method and device
US10218599B2 (en) Identifying referral pages based on recorded URL requests
CN110099059B (en) Domain name identification method and device and storage medium
CN104246759B (en) Application programming interfaces testing service
EP2984616A1 (en) Method and device for testing multiple versions
CN107888616A (en) The detection method of construction method and Webshell the attack website of disaggregated model based on URI
US20160188723A1 (en) Cloud website recommendation method and system based on terminal access statistics, and related device
CN110362727A (en) Third party for search system searches for application
CN114417197A (en) Access record processing method and device and storage medium
US20130198240A1 (en) Social Network Analysis
US11546380B2 (en) System and method for creation and implementation of data processing workflows using a distributed computational graph
CN105653949B (en) A kind of malware detection methods and device
CN109582844A (en) A kind of method, apparatus and system identifying crawler
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
US11797617B2 (en) Method and apparatus for collecting information regarding dark web
US9380126B2 (en) Data collection and distribution management
CN104021124A (en) Method, device and system used for processing webpage data
CN108197465B (en) Website detection method and device
CN111859076B (en) Data crawling method, device, computer equipment and computer readable storage medium
CN113343312A (en) Page tamper-proofing method and system based on front-end point burying technology
CN107085684A (en) The detection method and device of performance of program
CN111931214A (en) Data processing method, device, server and storage medium
CN107404491A (en) Terminal environments method for detecting abnormality, detection means and computer-readable recording medium
CN115051863B (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
US20150331917A1 (en) Recording medium having stored therein transmission order determination program, transmission order determination device, and transmission order determination method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240108

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right