CN104933056A - Uniform resource locator (URL) de-duplication method and device - Google Patents

Uniform resource locator (URL) de-duplication method and device Download PDF

Info

Publication number
CN104933056A
CN104933056A CN201410100765.2A CN201410100765A CN104933056A CN 104933056 A CN104933056 A CN 104933056A CN 201410100765 A CN201410100765 A CN 201410100765A CN 104933056 A CN104933056 A CN 104933056A
Authority
CN
China
Prior art keywords
duplicate removal
url
rule
uniform resource
resource locator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410100765.2A
Other languages
Chinese (zh)
Other versions
CN104933056B (en
Inventor
何双宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410100765.2A priority Critical patent/CN104933056B/en
Publication of CN104933056A publication Critical patent/CN104933056A/en
Application granted granted Critical
Publication of CN104933056B publication Critical patent/CN104933056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a uniform resource locator (URL) de-duplication method and device. The URL de-duplication method comprises presetting a de-duplication rule base according to structures of URLs; acquiring URL data to be de-duplicated from website visiting data; matching the URLs to be de-duplicated with de-duplication rules in the de-duplication rule base according to the structures and segmentation parameters of the URLs; and filtering the matched URLs corresponding to the same de-duplication rules, and reserving one URL corresponding to each de-duplication rule. Through the method and device, a massive amount URL data is filtered and de-duplicated through the de-duplication rules, and the situation of repeatedly scanning the same common gateway interface (CGI) by a security flaw scanner during URL security flaw detection is avoided, thereby raising a security flaw detection efficiency.

Description

URL(uniform resource locator) De-weight method and device
Technical field
The present invention relates to networking technology area, particularly a kind of URL(uniform resource locator) De-weight method and device, and corresponding duplicate removal rule generating method and device.
Background technology
URL Rewrite is a kind of to URL(Uniform Resource Locator in internet, URL(uniform resource locator)) carry out the technology that rewrites, first it obtain the URL request of the access websites that user side sends, then it is write as website another URL manageable again, the returned content of the URL address after process that what user obtained be through.
For example, the news on many news websites has a lot of classifications, such as physical culture, science and technology etc., and has new news briefing every day, namely classifies by date, and the news of these every days also has news index ID below.When user side accesses the news pages of a sport category, after Website server receives request of access, can by CGI(Common Gateway Interface, general network administration interface) form one in the middle of URL address unify to access back-end data, as shown in Equation (1):
http://www.qq.com/news/getNews?type=sports&date=20131120&id=1 (1)
And the middle URL structure as formula (1) has shortcoming in the majority: not easily remember, not easily read, on the mobile terminal such as mobile phone, panel computer, inconvenience is propagated.Therefore, this middle URL structure to be rewritten as form as formula (2) by URL rewriting technique by many Website servers:
http://www.qq.com/news/sports/20131120/1.html (2)
This URL structure of formula (2) overcomes the shortcoming of that middle URL of formula (1), makes URL shorter, more easily reads memory, simultaneously convenient propagation.
But, while URL rewriting technique brings benefit in the majority, result also in such phenomenon: a dynamic CGI can be rewritten by multiple URL, causes the URL increasing number of a network address.Such as, the segmentation parameter in formula (1) comprises " getNews type ", " date " and " id "; Now suppose that user side accesses the news pages of a scientific and technological class, formula (3) is the middle URL address consisted of CGI:
http://www.qq.com/news/getNews?type=science&date=20131121&id=2 (3)
Can see, segmentation parameter in formula (3) comprises " getNews type ", " date " and " id " equally, the value of different is only each segmentation parameter, with the formula in the middle of (3) and formula (1) these two RUL address generated by same dynamic CGI in fact.Formula (4) is the URL address after formula (3) is rewritten by URL rewriting technique:
http://www.qq.com/news/science/20131121/2.html (4)
Visible, the URL address of formula (4) and the URL address of formula (2) are two different URL addresses, so the address generated by a dynamic CGI can be rewritten into multiple URL address.And this phenomenon can bring following problem:
When doing URL security breaches and detecting, because multiple URL may be rewritten to a dynamic CGI by URL rewriting technique, the security breaches of these URL are detected, in fact detect be rewrite after same dynamic CGI, a CGI thus can be caused to be repetitively scanned detection repeatedly.And, for large-scale website, because URL quantity is too many, adds that a URL may have multiple parameter, finally cause URL security scan device extremely low to the detection efficiency of corporate business security breaches.Meanwhile, multiple scanning detects same dynamic CGI, brings unnecessary performance loss and operation cost also to corporate business website.
Summary of the invention
The object of the embodiment of the present invention is to provide a kind of URL(uniform resource locator) De-weight method and device, and the efficiency that URL security breaches can be made to detect to solve URL rewriting technique reduces, and brings performance loss to Website server and increase the problem of operation cost.
The embodiment of the present invention proposes a kind of URL(uniform resource locator) De-weight method, comprising:
Duplicate removal rule base is preset according to the structure of URL(uniform resource locator), multiple duplicate removal rule is deposited in described duplicate removal rule base, the different structure of the corresponding URL(uniform resource locator) of each duplicate removal rule, and the rewriting mark being provided with the segmentation parameter representing writing in corresponding URL(uniform resource locator) in described duplicate removal rule;
The URL(uniform resource locator) data wanting duplicate removal are obtained from website visitation data;
According to structure and the segmentation parameter of URL(uniform resource locator), the described URL(uniform resource locator) of duplicate removal of wanting is mated with the duplicate removal rule in described duplicate removal rule base; And
The URL(uniform resource locator) corresponding with identical duplicate removal rule matched is filtered, and corresponding each duplicate removal rule retains a URL(uniform resource locator).
The embodiment of the present invention also proposes a kind of duplicate removal rule generating method, comprising:
Acquisition will generate the URL(uniform resource locator) data under the domain name of duplicate removal rule;
Cluster is carried out to the URL(uniform resource locator) of described acquisition;
URL(uniform resource locator) after cluster is split according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations;
Obtain mutually isostructural statistical information after over-segmentation; And
Corresponding segments parameter values different for mutually isostructural statistical information intermediate value is replaced with and rewrites mark, and generated new duplicate removal rule by replacing the statistical information rewriteeing mark.
The embodiment of the present invention separately proposes a kind of duplicate removal rule generating method, comprising:
Obtain existing duplicate removal rule in the duplicate removal rule base preset, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part;
Acquisition will generate the multiple URL(uniform resource locator) data under the domain name of duplicate removal rule;
By suffix portion and the rewriting rule part of existing duplicate removal rule, the multiple URL(uniform resource locator) under the domain name that generate duplicate removal rule are mated; And
When the number of the URL(uniform resource locator) matched is greater than the threshold value of setting, then the domain name that will generate duplicate removal rule replaces the domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
The embodiment of the present invention proposes a kind of URL(uniform resource locator) duplicate removal device, comprising:
Duplicate removal rule base arranges module, for presetting duplicate removal rule base according to the structure of URL(uniform resource locator), multiple duplicate removal rule is deposited in described duplicate removal rule base, the different structure of the corresponding URL(uniform resource locator) of each duplicate removal rule, and the rewriting mark being provided with the segmentation parameter representing writing in corresponding URL(uniform resource locator) in described duplicate removal rule;
URL(uniform resource locator) handling module, for obtaining the URL(uniform resource locator) data wanting duplicate removal from website visitation data;
Matching module, for according to the structure of URL(uniform resource locator) and segmentation parameter, mates the described URL(uniform resource locator) of duplicate removal of wanting with the duplicate removal rule in described duplicate removal rule base; And
Duplicate removal module, for the URL(uniform resource locator) corresponding with identical duplicate removal rule matched being filtered, and corresponding each duplicate removal rule retains a URL(uniform resource locator).
The embodiment of the present invention also proposes a kind of duplicate removal generate rule device, comprising:
URL(uniform resource locator) acquisition module, for obtaining the URL(uniform resource locator) data under the domain name that will generate duplicate removal rule;
Cluster module, for carrying out cluster to the URL(uniform resource locator) of described acquisition;
Segmentation module, for the URL(uniform resource locator) after cluster being split according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations;
Statistical information acquisition module, for obtaining mutually isostructural statistical information after over-segmentation; And
Segmentation parameter replacement module, rewrites mark for being replaced with by corresponding segments parameter values different for mutually isostructural statistical information intermediate value, and generates new duplicate removal rule by replacing the statistical information rewriteeing mark.
The embodiment of the present invention separately proposes a kind of duplicate removal generate rule device, comprising:
Duplicate removal rule acquisition module, for obtaining existing duplicate removal rule in default duplicate removal rule base, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part;
URL(uniform resource locator) acquisition module, for obtain to generate duplicate removal rule domain name under multiple URL(uniform resource locator) data;
Suffix and rewriting rule matching module, for by the suffix portion of existing duplicate removal rule and rewriting rule part, mate the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule; And
Domain name parameters replacement module, for being greater than the threshold value of setting when the number of the URL(uniform resource locator) matched, then will generating the domain name parameters part in the duplicate removal rule of the domain name replacement correspondence of duplicate removal rule, and generate new duplicate removal rule.
Relative to prior art, the invention has the beneficial effects as follows: by method and the device of the embodiment of the present invention, filtration duplicate removal can be carried out to magnanimity url data by duplicate removal rule, the dynamic CGI after their rewritings of minute quantity is restored from a large amount of url datas, avoid when URL security breaches detect, security scan thinks highly of the same CGI of multiple scanning, thus improves the detection efficiency of security breaches.
Accompanying drawing explanation
Fig. 1 is a kind of running environment schematic diagram logging in address recording method and device of the embodiment of the present invention;
Fig. 2 is the structural drawing of a kind of Website server in Fig. 1;
Fig. 3 is the process flow diagram of a kind of URL(uniform resource locator) De-weight method of the embodiment of the present invention;
Fig. 4 is the embodiment of the present invention a kind of process flow diagram when wanting the URL of duplicate removal and duplicate removal rule to mate;
Fig. 5 is the process flow diagram of the first duplicate removal rule generating method of the embodiment of the present invention;
Fig. 6 is the another kind of process flow diagram of the first duplicate removal rule generating method of the embodiment of the present invention;
Fig. 7 is the process flow diagram of the second duplicate removal rule generating method of the embodiment of the present invention;
Fig. 8 is the another kind of process flow diagram of the second duplicate removal rule generating method of the embodiment of the present invention;
Fig. 9 is the structural drawing of a kind of URL(uniform resource locator) duplicate removal device of the embodiment of the present invention;
Figure 10 be a kind of matching module of the embodiment of the present invention structural drawing;
Figure 11 is the structural drawing of the first duplicate removal generate rule device of the embodiment of the present invention;
Figure 12 is the another kind of structural drawing of the first duplicate removal generate rule device of the embodiment of the present invention;
Figure 13 is a kind of structural drawing that the embodiment of the present invention rewrites filtering module;
Figure 14 is a kind of structural drawing of embodiment of the present invention authentication module;
Figure 15 is the structural drawing of the second duplicate removal generate rule device of the embodiment of the present invention;
Figure 16 is the another kind of structural drawing of the second duplicate removal generate rule device of the embodiment of the present invention.
Embodiment
Aforementioned and other technology contents, Characteristic for the present invention, can clearly present in following cooperation describes in detail with reference to graphic preferred embodiment.By the explanation of embodiment, when can to the present invention for the technological means reaching predetermined object and take and effect be able to more deeply and concrete understanding, however institute's accompanying drawings be only to provide with reference to and the use of explanation, be not used for being limited the present invention.
The embodiment of the present invention proposes a kind of URL(uniform resource locator) De-weight method and device, and corresponding duplicate removal rule generating method and device, carries out duplicate removal filtration, and generate duplicate removal rule so that carry out duplicate removal to URL address for the URL address after rewriteeing.Refer to Fig. 1, it is the running environment schematic diagram of above-mentioned method and device.One or more user side 100 only illustrates one by network 300 with one or more Website server 200(Fig. 1) be connected.Described user side 100 can be the smart machine that panel computer, mobile phone, electronic reader, telepilot, PC, notebook computer, mobile unit, Web TV, wearable device etc. have network function.Described network 300 such as can be internet, LAN (Local Area Network), intranet etc.
Consult Fig. 2 further, it is the structured flowchart of an embodiment of above-mentioned Website server 200.As shown in Figure 2, Website server 200 comprises: storer 102, memory controller 104, one or more (only illustrating one in figure) processor 106, Peripheral Interface 108 and network controller 112.Be appreciated that the structure shown in Fig. 2 is only signal, it does not cause restriction to the structure of Website server 200.Such as, Website server 200 also can comprise than assembly more or less shown in Fig. 2, or has the configuration different from shown in Fig. 2.
Storer 102 can be used for storing software program and module, as the URL(uniform resource locator) De-weight method in the embodiment of the present invention and programmed instruction/module corresponding to device, processor 104 is by running the software program and module that are stored in storer 102, thus perform the application of various function and data processing, namely realize above-mentioned method.
Storer 102 can comprise high speed random access memory, also can comprise nonvolatile memory, as one or more magnetic storage device, flash memory or other non-volatile solid state memories.In some instances, storer 102 can comprise the storer relative to the long-range setting of processor 106 further, and these remote memories can be connected to Website server 200 by network.The example of above-mentioned network includes but not limited to internet, intranet, LAN (Local Area Network), mobile radio communication and combination thereof.Processor 106 and other possible assemblies can carry out the access of storer 102 under the control of memory controller 104.
Various input/output device is coupled to processor 106 by Peripheral Interface 108.Various softwares in processor 106 run memory 102, instruction, and carry out data processing.In certain embodiments, Peripheral Interface 108, processor 106 and memory controller 104 can realize in one single chip.In some other example, they can respectively by independently chip realization.
Network controller 112 is for receiving and sending network signal.Above-mentioned network signal can comprise wireless signal or wire signal.In an example, above-mentioned network signal is cable network signal.Now, network controller 112 can comprise the elements such as processor, random access memory, converter, crystal oscillator.
Above-mentioned software program and module comprise: operating system 122 and Website server module 224.Wherein operating system 122 such as can be LINUX, UNIX, WINDOWS, it can comprise the various component software for management system task (such as memory management, memory device control, power management etc.) and/or driving, and can with various hardware or the mutual communication of component software, thus provide the running environment of other component softwares.Website server module 224 operates on the basis of operating system 122, and monitor the web access requests of automatic network by the network service of operating system 122, complete corresponding data processing according to web access requests, and the data returning results webpage or extended formatting are to user side 100.Above-mentioned Website server module 224 such as can comprise dynamic web page script and script interpreter etc.Above-mentioned script interpreter such as can be Apache Website server program, and it is for being processed into client acceptable form by dynamic web page script, such as hypertext markup (HTML) language format or extend markup language (XML) form etc.
Embodiment one
Refer to Fig. 3, it is the process flow diagram of a kind of URL(uniform resource locator) De-weight method of the embodiment of the present invention, and it comprises the following steps:
S301, duplicate removal rule base is preset according to the structure of URL(uniform resource locator), multiple duplicate removal rule is deposited in described duplicate removal rule base, the different structure of the corresponding URL(uniform resource locator) of each duplicate removal rule, and the rewriting mark being provided with the segmentation parameter representing writing in corresponding URL(uniform resource locator) in described duplicate removal rule.
S302, obtains the URL(uniform resource locator) data wanting duplicate removal from website visitation data.
S303, according to structure and the segmentation parameter of URL(uniform resource locator), mates the described URL(uniform resource locator) of duplicate removal of wanting with the duplicate removal rule in described duplicate removal rule base.
S304, the URL(uniform resource locator) corresponding with identical duplicate removal rule matched filtered, and corresponding each duplicate removal rule retains a URL(uniform resource locator).
According to W3C(World Wide Web Consortium, World Wide Web Consortium) standard can decomposite three main parts from the structure of a URL: domain name parameters part (being also referred to as the embodiment of the present invention " host part "), path parameter part (being also referred to as in the embodiment of the present invention " path part "), query argument part (being also referred to as in the embodiment of the present invention " query part ").Such as URL form is http://www.qq.com/nba/lbj.php age=30 & sex=1, its host part is " www.qq.com ", path part is "/nba/lbj.php ", query part is " age=30 & sex=1 ", wherein, the parameter name contamination of query part is " age & sex ".URL also can be divided into three major types by the suffix difference according to path part: the first kind, does not have the URL of suffix; Equations of The Second Kind, have the URL of suffix, suffix is as php, jsp, html, htm etc.; 3rd class, suffix is the URL of "/".
In step S301, whether described duplicate removal rule is used for identification URL in embodiments of the present invention and was rewritten.Mat in above-mentioned W3C standard to the division not having writing URL architectural feature, therefore the embodiment of the present invention rewrites the structural relation of front and back for URL, preferably duplicate removal rule settings is become to comprise domain name parameters part, suffix portion, division number part and rewriting rule part, for ease of describing, the form of duplicate removal rule is expressed as { host by the present embodiment, suffix, sections, rule_path }.Can see, this duplicate removal rule is made up of four parts: host(domain name parameters part), suffix(suffix portion), sections(division number part) and rule_path(rewriting rule part).
The corresponding host rewritten in rear URL of host part; The corresponding suffix rewriteeing path part in rear URL of suffix part; The corresponding number of path by the segmentation out of "/" symbol segmentation rewriteeing rear URL of sections part, the character string split is called segmentation parameter; Rule_path is the rewriting rule of path part in URL after rewriteeing, the segmentation parameter rewritten rewriting mark is replaced (represent with " * " in the embodiment of the present invention and rewrite mark), represent that this section is for rewriteeing undue section parameter, does not have the segmentation parameter rewritten, retains constant in rule_path.
The URL of (2) is example with the formula: http://www.qq.com/news/sports/20131120/1.html, in this URL, parameter value " sports ", " 20131120 " and " 1 " field have all been re-written in the parameter of dynamic CGI, thus use " * " to replace them, and parameter " news " is not rewritten, so be retained in duplicate removal rule, therefore the duplicate removal rule of this URL can be expressed as: { www.qq.com, html, 4, news/*/*/* }.
Whether visible, duplicate removal rule corresponds to the structure of URL, and can indicate this URL by rewriting mark wherein and be rewritten.
In step S302, the url data capturing Website server mainly can adopt two kinds of methods: first, want the URL(uniform resource locator) data of duplicate removal described in being obtained by web crawlers, the second, filter from the original log of web page access and obtain the described URL(uniform resource locator) data wanting duplicate removal.
In step S303, will the URL of duplicate removal when mating with duplicate removal rule, can will mate with the duplicate removal rule in duplicate removal rule base by the URL of duplicate removal one by one, if a URL matches corresponding duplicate removal rule, then illustrate that this URL was rewritten, duplicate removal may be needed to filter.And, judge whether a URL was rewritten, also need the segmentation parameter considering its rule_path whether to have and rewrite mark, rewrite mark if had, determine that this URL was rewritten.
Furthermore, in URL and duplicate removal rule match process, coupling be fallen less than the url filtering of duplicate removal rule, refer to Fig. 4, specifically can also comprise the following steps:
S3031, according to the domain name parameters part in described duplicate removal rule, wants the URL(uniform resource locator) that in the URL(uniform resource locator) of duplicate removal, domain name is not corresponding described in filtering out.Such as, if the domain name of a URL is " www.sina.com ", and in duplicate removal rule base, the host part of neither one duplicate removal rule is www.sina.com, so this url filtering is fallen.
S3032, according to the suffix portion in described duplicate removal rule and division number part, wants in the URL(uniform resource locator) of duplicate removal, with the URL(uniform resource locator) that all duplicate removal regular textures are not corresponding described in filtering out.If namely a URL suffix portion and segmentation parameter part, with the suffix of any one the duplicate removal rule in duplicate removal rule base part and sections not corresponding, so this url filtering is fallen.
S3033, according to the rewriting rule part in described duplicate removal rule, wants do not have writing URL(uniform resource locator) in the URL(uniform resource locator) of duplicate removal described in filtering out.
For understanding the matching process of URL and duplicate removal rule further, present embodiment discloses a kind of matching algorithm as shown in table 1:
Table 1
Wherein, if the match is successful, return the duplicate removal rule matched, representing this URL is the URL rewritten; Otherwise return null character string, representing this URL is not the URL rewritten.1st step and the 2nd step are input and output explanation; 3rd step represents that algorithm starts; 4th step judges whether to there is the duplicate removal rule identical with the URL domain name parameters of input, if do not had, then filters out the URL of input; 5th, 6 steps judge whether to there is the duplicate removal rule identical with division number part with the suffix portion of the URL of input, if do not had, then filter out the URL of input; 8th, 9,10 steps represent when finding at least one duplicate removal rule of existence, its domain name parameters part, suffix portion are identical with the URL of input with division number part, judge that whether the URL inputted is consistent with the rewriting rule part of these duplicate removal rules found out so successively.11st step judges these duplicate removal rules found out successively, and whether the quantity of its rewriting rule segment section is identical with the value of division number part; 13rd, 14,15 steps represent if find out the duplicate removal rule corresponding with the URL inputted, and so export this duplicate removal rule, if can not find the duplicate removal rule corresponding with the URL inputted, so filter out the URL of this input.
In step S304, when all want the URL of duplicate removal all to mate after, find out the URL(uniform resource locator) corresponding to the duplicate removal rule matched to add up, because the URL that identical duplicate removal rule is corresponding is that the middle URL produced by same CGI rewrites, so the URL corresponding to identical duplicate removal rule filters, and only retain a URL relative to each duplicate removal rule, for follow-up Hole Detection or other data analysis.
Such as, " http://www.qq.com/news/sports/20131120/1.html " and " http://www.qq.com/news/science/20131121/2.html " these two URL, { www.qq.com can be matched, html, 4, news/*/*/* } this duplicate removal rule, so, illustrate that these two URL were rewritten, namely these two URL do not rewrite before URL generated by same CGI, so one can be filtered out arbitrarily from these two URL.
By the method for the present embodiment, filtration duplicate removal can be carried out to magnanimity url data by duplicate removal rule, the dynamic CGI after their rewritings of minute quantity is restored from a large amount of url datas, avoid when URL security breaches detect, security scan thinks highly of the same CGI of multiple scanning, thus improves the detection efficiency of security breaches.
In addition, when doing web access requests flow analysis, the accessing page request flow counted on is the URL flow after being rewritten, and can not count on the flow of real CGI.And attack (Challenge Collapsar at website CC, it is a kind of common website attack method, assailant controls some main frame and ceaselessly sends out mass data bag and cause server resource to exhaust to the other side's server, the machine collapse until delay) in system of defense, what need statistics audit is the flow of CGI after rewriteeing.By the method for the present embodiment, the data analysis efficiency that CC is attacked also can be improved.
The embodiment of the present invention also proposed the generation method of a kind of duplicate removal rule of embodiment, the generation of duplicate removal rule and the duplicate removal of URL can be carried out by iterative cycles, such as, when URL duplicate removal can be mated, the URL not having duplicate removal to cross filtered out as the generation benchmark of duplicate removal rule, to improve duplicate removal rule base and to expand URL duplicate removal scope.The embodiment of the invention discloses two kinds of methods and generate this duplicate removal rule, a kind of is data structure building duplicate removal rule by URL, and another kind is by under the duplicate removal existed rule is applied to new domain name, and forms new duplicate removal rule.Below describe in detail:
Embodiment two
Refer to Fig. 5, it is the process flow diagram of the first duplicate removal rule generating method of the embodiment of the present invention, and it comprises the following steps:
S501, acquisition will generate the URL(uniform resource locator) data under the domain name of duplicate removal rule.Namely read the identical URL of host part from the data center of Website server, the URL needs of reading do not have writing.
S502, carries out cluster to the URL(uniform resource locator) of described acquisition.Specifically, the URL carried out step S501 reads according to length and character words canonical ordering carries out cluster.URL after cluster can make system operations quicker, improves the efficiency of duplicate removal generate rule.
Described length refers to the length of the section of URL, in other words in URL by the number of "/" symbol segmentation section out, such as, the number of the "/" symbol that " http://www.qq.com/news/getNews type=sports & date=20131120 & id=1 " and " http://www.qq.com/news/getNews type=science & date=20131121 & id=2 " these two URL comprise is two, then think that these two URL length are identical.
Described character words canonical ordering refers to the sequence of character in dictionary in URL, such as " http://www.aa.com/news/getNews type=sports & date=20131120 & id=1 " and " " domain name part of these two URL all comprises " aa " http://www.aa.com/news/getNews type=science & date=20131121 & id=2, these two URL can be classified as same class; " http://www.bb.com/news/getNews type=sports & date=20131120 & id=1 " and " " domain name part of these two URL all comprises " bb " http://www.bb.com/news/getNews type=science & date=20131121 & id=2, these two URL can be classified as same class.
S503, splits the URL(uniform resource locator) after cluster according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations.
Specifically, this step be namely to cluster after URL carry out structure analysis, and to be formed as { the statistical information of this form of host, suffix, sections, index, path}.Statistical information can be stored in the storer of Website server, also can be stored in system cache.Wherein, index is the index position of path part.
Such as, the statistical information of " http://www.qq.com/news/sports/20131120/1.html " this URL is { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 }, index represents the memory location of path part, i.e. " 0/1/2/3 " and " news/sport/20131120/1 " one_to_one corresponding.
S504, obtains mutually isostructural statistical information after over-segmentation.
Such as, the statistical information of " http://www.qq.com/news/sports/20131120/1.html " this URL is { www.qq.com, html, 4,0/1/2/3, news/sport/20131120/1 }; The statistical information of " http://www.qq.com/news/science/20131121/2.html " this URL is { www.qq.com, html, 4,0/1/2/3, news/science/20131121/2 }, then think that the structure of statistical information of these two URL is identical.
S505, replaces with corresponding segments parameter values different for mutually isostructural statistical information intermediate value and rewrites mark, and generates new duplicate removal rule by replacing the statistical information rewriteeing mark.
Such as, for { www.qq.com, html, 4, 0/1/2/3, news/sport/20131120/1 } and { www.qq.com, html, 4, 0/1/2/3, news/science/20131121/2 } these two statistical informations, according to the numerical value of index, respectively each segmentation parameter of path part is compared, if the value of corresponding segments parameter is different, then replaced to and rewritten mark, namely these two statistical informations generate { www.qq.com after rewriteeing the replacement of mark, html, 4, 0/1/2/3, news/*/*/* }, then the index part in statistical information is removed, just new duplicate removal rule is defined: { www.qq.com, html, 4, news/*/*/* }.So far, by newly-generated duplicate removal rule stored in duplicate removal rule base, when mating for URL duplicate removal, inquiry uses.
Refer to Fig. 6, it is the another kind of process flow diagram of the first duplicate removal rule generating method of the embodiment of the present invention, and it comprises the following steps:
S601, acquisition will generate the URL(uniform resource locator) data under the domain name of duplicate removal rule.
S602, filters out URL(uniform resource locator) writing in the described URL(uniform resource locator) data of acquisition.During owing to generating statistical information, to analyze the structure of URL and segmentation parameter and parameter value, and the URL None-identified segmentation parameter after rewriteeing, do not make mistakes in order to ensure computing, thus this step is filtered the URL obtained further, removes writing URL.
Specifically, the URL that step S601 obtains can be mated with duplicate removal rule existing in duplicate removal rule base according to the matching algorithm (i.e. the duplicate removal matching algorithm of table 1) described in embodiment one, if a URL can match corresponding duplicate removal rule, so illustrate that this URL was rewritten, then filtered out.
S603, carries out cluster to the URL(uniform resource locator) of described acquisition according to length and character words canonical ordering.
S604, splits the URL(uniform resource locator) after cluster according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations.
S605, obtains mutually isostructural statistical information after over-segmentation.
S606, replaces with corresponding segments parameter values different for mutually isostructural statistical information intermediate value and rewrites mark, and generates new duplicate removal rule by replacing the statistical information rewriteeing mark.
S607, carries out mating with mutually isostructural URL(uniform resource locator) all under same domain name by new duplicate removal rule and verifies.The object of this step is that the availability done newly-generated duplicate removal rule does further checking.
Specifically, the process of checking still can adopt the matching algorithm (i.e. the duplicate removal matching algorithm of table 1) described in embodiment one.First, the URL having same domain name parameter, suffix, division number with new duplicate removal rule is obtained; Secondly, new duplicate removal rule is mated with the URL of described acquisition; When the number matching corresponding URL exceedes setting threshold value, illustrate that this newly-generated duplicate removal rule can be used to identify whether URL is rewritten, be then verified.
S608, arranges a pending mark to new duplicate removal rule.Make mistakes to prevent system operations further, affect the automatic generation of follow-up duplicate removal rule and the automatic operation of URL duplicate removal program, so stamp a pending mark for newly-generated duplicate removal rule, examine inspection for the artificial duplicate removal rule new to this, when examine by after just remove this pending mark this duplicate removal rule formally come into operation.
The method of the present embodiment can generate duplicate removal rule automatically, perfect further to duplicate removal rule base, thus make the De-weight method of embodiment one to during URL duplicate removal more efficient with accurately, and then can improve the detection efficiency of URL security breaches further and improve data analysis efficiency when adding up the flow of CGI.
Embodiment three
Refer to Fig. 7, it is the process flow diagram of the second duplicate removal rule generating method of the embodiment of the present invention, and it comprises the following steps:
S701, obtain existing duplicate removal rule in the duplicate removal rule base preset, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part.
S702, acquisition will generate the multiple URL(uniform resource locator) data under the domain name of duplicate removal rule.The URL quantity obtained very little, in general can not can not be less than 5000.
S703, by suffix portion and the rewriting rule part of existing duplicate removal rule, mates the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule.
S704, when the number of the URL(uniform resource locator) matched is greater than the threshold value of setting, then the domain name that will generate duplicate removal rule replaces the domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
For example, suppose { www.qq.com, html, 4, news/*/*/* } be an existing duplicate removal rule in duplicate removal rule base, its suffix portion and rewriting rule part are " html " and " news/*/*/* " respectively, with these two parts, the URL that will generate duplicate removal rule is mated, coupling can adopt the matching algorithm (i.e. the duplicate removal matching algorithm of table 1) described in embodiment one, if the URL matched exceedes the threshold value (as 100) of setting, so illustrate that 100 URL that this matches were rewritten, alternatively bright these want applicable " html " and " news/*/*/* " these two parts of the URL of duplicate removal to carry out duplicate removal, the domain name of the URL of duplicate removal is so directly wanted to be substituted into domain name parameters part in corresponding duplicate removal rule, if want the domain name of the URL of duplicate removal to be " www.sina.com ", the new duplicate removal rule generated after so replacing is exactly: { www.qq.com, html, 4, news/*/*/* }.
Refer to Fig. 8, it is the another kind of process flow diagram of the second duplicate removal rule generating method of the embodiment of the present invention, and it comprises the following steps:
S801, obtain existing duplicate removal rule in the duplicate removal rule base preset, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part.
S802, the duplicate removal rule identical with rewriting rule part to suffix portion in existing duplicate removal rule, sorts according to the number of different domain name.
S803, according to ranking results, obtains suffix portion and the rewriting rule part of the maximum setting quantity duplicate removal rule of the number of different domain name.
The quantity of the duplicate removal rule in duplicate removal rule base is normally a lot, if existing all duplicate removal rules all once mated, the efficiency that so certainly will generate duplicate removal rule can be very low.Thus in order to improve operation efficiency, first sorting to the existing duplicate removal rule of identical suffix part and rule_pathe part, then find out in ranking results in step S803 in step S802, the setting quantity duplicate removal that the number of different domain name is maximum is regular.
Such as, suppose that duplicate removal rule base comprises, 6 duplicate removal rules:
{www.aa.com,html,4,news/*/*/*}
{www.bb.com,html,4,news/*/*/*}
{www.cc.com,html,4,news/*/*/*}
{www.aa.com,html,3,news/*/*}
{www.bb.com,html,3,news/*/*}
{www.aa.com,html,3,news/*}
After sequence, the different domain names comprising " html " and " news/*/*/* " have 3, i.e. " www.aa.com ", " www.bb.com ", " www.cc.com "; The different domain names comprising " html " and " news/*/* " have 2, i.e. " www.aa.com ", " www.bb.com "; The different domain names comprising " html " and " news/* " have 1, i.e. " www.aa.com ".Present hypothesis will choose two groups of suffix portion and rewriting rule part, according to ranking results, the quantity of the different domain names that " html " and " news/*/*/* ", " html " and " news/*/* " are corresponding is maximum, by these two groups of suffix portion and rewriting rule extracting section out.And the domain name of " html " and " news/* " correspondence only has one group, illustrate that domain name that it is suitable for is less, namely versatility is low, is filtered out.
S804, acquisition will generate the multiple URL(uniform resource locator) data under the domain name of duplicate removal rule.
S805, by suffix portion and the rewriting rule part of the maximum setting quantity duplicate removal rule of the number of different domain names that obtains, mates the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule.During coupling, the matching algorithm (i.e. the duplicate removal matching algorithm of table 1) described in embodiment one can be adopted equally.
S806, when the number of the URL(uniform resource locator) matched is greater than the threshold value of setting, then the domain name that will generate duplicate removal rule replaces the domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
The method of the present embodiment can generate duplicate removal rule automatically, perfect further to duplicate removal rule base, thus make the De-weight method of embodiment one to during URL duplicate removal more efficient with accurately, and then can improve the detection efficiency of URL security breaches further and improve data analysis efficiency when adding up the flow of CGI.
Embodiment four
Corresponding to the method for previous embodiment one, the embodiment of the present invention also proposes a kind of URL(uniform resource locator) duplicate removal device, refer to Fig. 9, this device comprises: duplicate removal rule base arranges module 901, URL(uniform resource locator) handling module 902, matching module 903 and duplicate removal module 904.
Duplicate removal rule base arranges module 901 for presetting duplicate removal rule base according to the structure of URL(uniform resource locator), multiple duplicate removal rule is deposited in described duplicate removal rule base, the different structure of the corresponding URL(uniform resource locator) of each duplicate removal rule, and the rewriting mark being provided with the segmentation parameter representing writing in corresponding URL(uniform resource locator) in described duplicate removal rule.The structure of described duplicate removal rule can comprise domain name parameters part, suffix portion, division number part and rewriting rule part.Described rewriting mark can be arranged on described rewriting rule part.
URL(uniform resource locator) handling module 902 for obtaining the URL(uniform resource locator) data wanting duplicate removal from website visitation data.Want the URL(uniform resource locator) data of duplicate removal described in described URL(uniform resource locator) handling module 902 is obtained by web crawlers, or described in obtaining from the original log of web page access, want the URL(uniform resource locator) data of duplicate removal.
The described URL(uniform resource locator) of duplicate removal of wanting, for according to the structure of URL(uniform resource locator) and segmentation parameter, is mated with the duplicate removal rule in described duplicate removal rule base by matching module 903.
Duplicate removal module 904 is filtered for the URL(uniform resource locator) corresponding with identical duplicate removal rule matched by matching module 903, and corresponding each duplicate removal rule retains a URL(uniform resource locator).
Furthermore, in URL and duplicate removal rule match process, coupling will be fallen less than the url filtering of duplicate removal rule by matching module 903, and refer to Figure 10, matching module 903 can also comprise further: the first filter element 9031, second filter element 9032 and the 3rd filter element 9033.
First filter element 9031, for according to the domain name parameters part in described duplicate removal rule, wants the URL(uniform resource locator) that in the URL(uniform resource locator) of duplicate removal, domain name is not corresponding described in filtering out.
Second filter element 9032, for according to the suffix portion in described duplicate removal rule and division number part, is wanted in the URL(uniform resource locator) of duplicate removal, with the URL(uniform resource locator) that all duplicate removal regular textures are not corresponding described in filtering out.
3rd filter element 9033, for according to the rewriting rule part in described duplicate removal rule, wants do not have writing URL(uniform resource locator) in the URL(uniform resource locator) of duplicate removal described in filtering out.
By the device of the present embodiment, filtration duplicate removal can be carried out to magnanimity url data by duplicate removal rule, the dynamic CGI after their rewritings of minute quantity is restored from a large amount of url datas, avoid when URL security breaches detect, security scan thinks highly of the same CGI of multiple scanning, thus improves the detection efficiency of security breaches.
Embodiment five
Corresponding to previous embodiment two, the embodiment of the present invention also proposes the first duplicate removal generate rule device, refer to Figure 11, this device comprises: URL(uniform resource locator) acquisition module 1101, cluster module 1102, segmentation module 1103, statistical information acquisition module 1104 and segmentation parameter replacement module 1105.
URL(uniform resource locator) acquisition module 1101 is for obtaining the URL(uniform resource locator) data under the domain name that will generate duplicate removal rule.
Cluster module 1102 is for carrying out cluster to the URL(uniform resource locator) of described acquisition.Cluster module 1102 can carry out cluster to the URL(uniform resource locator) of described acquisition according to length and character words canonical ordering.
Segmentation module 1103 for the URL(uniform resource locator) after cluster being split according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations.
Statistical information acquisition module 1104 is for obtaining mutually isostructural statistical information after over-segmentation.
Segmentation parameter replacement module 1105 rewrites mark for being replaced with by corresponding segments parameter values different for mutually isostructural statistical information intermediate value, and generates new duplicate removal rule by replacing the statistical information rewriteeing mark.
Refer to Figure 12, it is the another kind of structural drawing of the first duplicate removal generate rule device of the embodiment of the present invention, it comprises except comprising: URL(uniform resource locator) acquisition module 1101, cluster module 1102, segmentation module 1103, statistical information acquisition module 1104 and segmentation parameter replacement module 1105, also comprise: rewrite filtering module 1106, authentication module 1107 and examination & verification mark and arrange module 1108.
Rewrite filtering module 1106 for after the URL(uniform resource locator) data that obtain under the domain name that will generate duplicate removal rule at described URL(uniform resource locator) acquisition module 1101, filter out URL(uniform resource locator) writing in the described URL(uniform resource locator) data that described URL(uniform resource locator) acquisition module obtains.
New duplicate removal rule, for after described segmentation parameter replacement module 1105 generates new duplicate removal rule, is carried out mating with mutually isostructural URL(uniform resource locator) all under same domain name and is verified by authentication module 1107.
Examination & verification mark arranges module 1108 for after described segmentation parameter replacement module 1105 generates new duplicate removal rule, arranges a pending mark to new duplicate removal rule.
Refer to Figure 13, it is a kind of structural drawing of embodiment of the present invention rewriting filtering module, and this rewriting filtering module 1106 comprises: duplicate removal rule match unit 11061 and URL(uniform resource locator) data filtering units 11062.
Duplicate removal rule match unit 11061 mates with existing duplicate removal rule for the described URL(uniform resource locator) data obtained by described URL(uniform resource locator) acquisition module 1101.
Filter element 11062 matches the URL(uniform resource locator) of corresponding duplicate removal rule for filtering out described duplicate removal rule match unit 11061.
Refer to Figure 14, it is a kind of structural drawing of embodiment of the present invention authentication module, and this authentication module 1107 comprises: choose unit 11071 and coupling judgement unit 11072.
Choose unit 11071 for obtaining all URL(uniform resource locator) having same domain name parameter, suffix, division number with new duplicate removal rule.
Coupling judgement unit 11072 for new duplicate removal rule is chosen the URL(uniform resource locator) that unit 11071 obtains mate with described, when the number matching corresponding URL(uniform resource locator) exceed set threshold value time, be verified.
The device of the present embodiment can generate duplicate removal rule automatically, perfect further to duplicate removal rule base, thus make the De-weight method of embodiment one and the duplicate removal device of embodiment four to during URL duplicate removal more efficient with accurately, and then can improve the detection efficiency of URL security breaches further and improve data analysis efficiency when adding up the flow of CGI.
Embodiment six
Corresponding to previous embodiment three, the embodiment of the present invention also proposes the second duplicate removal generate rule device, refer to Figure 15, this device comprises: duplicate removal rule acquisition module 1501, URL(uniform resource locator) acquisition module 1502, suffix and rewriting rule matching module 1503 and domain name parameters replacement module 1504.
Duplicate removal rule acquisition module 1501 is for obtaining existing duplicate removal rule in default duplicate removal rule base, and the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part.
URL(uniform resource locator) acquisition module 1502 for obtain to generate duplicate removal rule domain name under multiple URL(uniform resource locator) data.
Suffix and rewriting rule matching module 1503 for by the suffix portion of existing duplicate removal rule and rewriting rule part, mate the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule.
Domain name parameters replacement module 1504 for being greater than the threshold value of setting when the number of the URL(uniform resource locator) matched, then will generating the domain name parameters part in the duplicate removal rule of the domain name replacement correspondence of duplicate removal rule, and generate new duplicate removal rule.
Refer to Figure 16, it is the another kind of structural drawing of the second duplicate removal generate rule device of the embodiment of the present invention, this device is except comprising: duplicate removal rule acquisition module 1501, URL(uniform resource locator) acquisition module 1502, suffix and rewriting rule matching module 1503 and domain name parameters replacement module 1504, also comprise: order module 1505 and duplicate removal Rules Filtering module 1506.
Order module 1505 is regular for the duplicate removal that suffix portion in the existing duplicate removal rule to described duplicate removal rule acquisition module 1501 acquisition is identical with rewriting rule part, sorts according to the number of different domain name.
Duplicate removal Rules Filtering module 1506, for the ranking results according to described order module 1502, obtains suffix portion and the rewriting rule part of the maximum setting quantity duplicate removal rule of the number of different domain name.
Described suffix and rewriting rule matching module 1503, by the suffix portion of the maximum setting quantity duplicate removal rule of the number of the different domain names obtained and rewriting rule part, mate the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule.
The device of the present embodiment can generate duplicate removal rule automatically, perfect further to duplicate removal rule base, thus make the De-weight method of embodiment one and the duplicate removal device of embodiment four to during URL duplicate removal more efficient with accurately, and then can improve the detection efficiency of URL security breaches further and improve data analysis efficiency when adding up the flow of CGI.
Through the above description of the embodiments, those skilled in the art can be well understood to the embodiment of the present invention can by hardware implementing, and the mode that also can add necessary general hardware platform by software realizes.Based on such understanding, the technical scheme of the embodiment of the present invention can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, comprise some instructions and perform each method implementing described in scene of the embodiment of the present invention in order to make a computer equipment (can be personal computer, server, or the network equipment etc.).
The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention discloses as above with preferred embodiment, but and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical scheme, make a little change when the technology contents of above-mentioned announcement can be utilized or be modified to the Equivalent embodiments of equivalent variations, in every case be do not depart from technical scheme content, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all still belong in the scope of technical solution of the present invention.

Claims (28)

1. a URL(uniform resource locator) De-weight method, is characterized in that, comprising:
Duplicate removal rule base is preset according to the structure of URL(uniform resource locator), multiple duplicate removal rule is deposited in described duplicate removal rule base, the different structure of the corresponding URL(uniform resource locator) of each duplicate removal rule, and the rewriting mark being provided with the segmentation parameter representing writing in corresponding URL(uniform resource locator) in described duplicate removal rule;
The URL(uniform resource locator) data wanting duplicate removal are obtained from website visitation data;
According to structure and the segmentation parameter of URL(uniform resource locator), the described URL(uniform resource locator) of duplicate removal of wanting is mated with the duplicate removal rule in described duplicate removal rule base; And
The URL(uniform resource locator) corresponding with identical duplicate removal rule matched is filtered, and corresponding each duplicate removal rule retains a URL(uniform resource locator).
2. URL(uniform resource locator) De-weight method as claimed in claim 1, it is characterized in that, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part.
3. URL(uniform resource locator) De-weight method as claimed in claim 2, is characterized in that, the described rewriting rule part of described duplicate removal rule is provided with the rewriting mark of segmentation parameter writing in corresponding URL(uniform resource locator).
4. URL(uniform resource locator) De-weight method as claimed in claim 3, is characterized in that, describedly the described URL(uniform resource locator) of duplicate removal and the regular step of carrying out mating of the duplicate removal in described duplicate removal rule base wanted is comprised:
According to the domain name parameters part in described duplicate removal rule, described in filtering out, want the URL(uniform resource locator) that in the URL(uniform resource locator) of duplicate removal, domain name is not corresponding;
According to the suffix portion in described duplicate removal rule and division number part, want in the URL(uniform resource locator) of duplicate removal, with the URL(uniform resource locator) that all duplicate removal regular textures are not corresponding described in filtering out; And
According to the rewriting rule part in described duplicate removal rule, described in filtering out, want there is no writing URL(uniform resource locator) in the URL(uniform resource locator) of duplicate removal.
5. URL(uniform resource locator) De-weight method as claimed in claim 1, it is characterized in that, described acquisition from website visitation data wants the step of the URL(uniform resource locator) data of duplicate removal to comprise: the URL(uniform resource locator) data wanting duplicate removal described in being obtained by web crawlers, or wants the URL(uniform resource locator) data of duplicate removal described in obtaining from the original log of web page access.
6. the duplicate removal rule generating method as described in any one of Claims 1 to 5, is characterized in that, comprising:
Acquisition will generate the URL(uniform resource locator) data under the domain name of duplicate removal rule;
Cluster is carried out to the URL(uniform resource locator) of described acquisition;
URL(uniform resource locator) after cluster is split according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations;
Obtain mutually isostructural statistical information after over-segmentation; And
Corresponding segments parameter values different for mutually isostructural statistical information intermediate value is replaced with and rewrites mark, and generated new duplicate removal rule by replacing the statistical information rewriteeing mark.
7. duplicate removal rule generating method as claimed in claim 6, it is characterized in that, the step that the described URL(uniform resource locator) to described acquisition carries out cluster comprises: carry out cluster to the URL(uniform resource locator) of described acquisition according to length and character words canonical ordering.
8. duplicate removal rule generating method as claimed in claim 6, it is characterized in that, described acquisition comprises after will generating the step of the URL(uniform resource locator) data under the domain name of duplicate removal rule: filter out URL(uniform resource locator) writing in the described URL(uniform resource locator) data of acquisition.
9. duplicate removal rule generating method as claimed in claim 8, is characterized in that, described in filter out URL(uniform resource locator) writing in the described URL(uniform resource locator) data of acquisition step comprise:
The described URL(uniform resource locator) data obtained are mated with existing duplicate removal rule; And
Filter out the URL(uniform resource locator) matching corresponding duplicate removal rule.
10. duplicate removal rule generating method as claimed in claim 6, is characterized in that, comprise after the step of the duplicate removal rule that described generation is new: new duplicate removal rule carried out mating with mutually isostructural URL(uniform resource locator) all under same domain name and verify.
11. duplicate removal rule generating methods as claimed in claim 10, is characterized in that, described new duplicate removal rule is carried out mating the step verified with mutually isostructural URL(uniform resource locator) all under same domain name and comprised:
Obtain all URL(uniform resource locator) having same domain name parameter, suffix, division number with new duplicate removal rule;
New duplicate removal rule is mated with the URL(uniform resource locator) of described acquisition; And
When the number matching corresponding URL(uniform resource locator) exceedes setting threshold value, be verified.
12. duplicate removal rule generating methods as claimed in claim 6, is characterized in that, comprise: arrange a pending mark to new duplicate removal rule after the step of the duplicate removal rule that described generation is new.
13. 1 kinds of duplicate removal rule generating methods as described in any one of Claims 1 to 5, is characterized in that, comprising:
Obtain existing duplicate removal rule in the duplicate removal rule base preset, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part;
Acquisition will generate the multiple URL(uniform resource locator) data under the domain name of duplicate removal rule;
By suffix portion and the rewriting rule part of existing duplicate removal rule, the multiple URL(uniform resource locator) under the domain name that generate duplicate removal rule are mated; And
When the number of the URL(uniform resource locator) matched is greater than the threshold value of setting, then the domain name that will generate duplicate removal rule replaces the domain name parameters part in corresponding duplicate removal rule, and generates new duplicate removal rule.
14. duplicate removal rule generating methods as claimed in claim 13, is characterized in that, in the duplicate removal rule base that described acquisition is preset existing duplicate removal rule step after also comprise:
The duplicate removal rule identical with rewriting rule part to suffix portion in existing duplicate removal rule, sorts according to the number of different domain name;
According to ranking results, obtain suffix portion and the rewriting rule part of the maximum setting quantity duplicate removal rule of the number of different domain name;
The described suffix portion by existing duplicate removal rule and rewriting rule part, the step that multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule are mated is comprised: by suffix portion and the rewriting rule part of the maximum setting quantity duplicate removal rule of the number of the different domain names obtained, the multiple URL(uniform resource locator) under the domain name that generate duplicate removal rule are mated.
15. 1 kinds of URL(uniform resource locator) duplicate removal devices, is characterized in that, comprising:
Duplicate removal rule base arranges module, for presetting duplicate removal rule base according to the structure of URL(uniform resource locator), multiple duplicate removal rule is deposited in described duplicate removal rule base, the different structure of the corresponding URL(uniform resource locator) of each duplicate removal rule, and the rewriting mark being provided with the segmentation parameter representing writing in corresponding URL(uniform resource locator) in described duplicate removal rule;
URL(uniform resource locator) handling module, for obtaining the URL(uniform resource locator) data wanting duplicate removal from website visitation data;
Matching module, for according to the structure of URL(uniform resource locator) and segmentation parameter, mates the described URL(uniform resource locator) of duplicate removal of wanting with the duplicate removal rule in described duplicate removal rule base; And
Duplicate removal module, for the URL(uniform resource locator) corresponding with identical duplicate removal rule matched being filtered, and corresponding each duplicate removal rule retains a URL(uniform resource locator).
16. URL(uniform resource locator) duplicate removal devices as claimed in claim 15, is characterized in that, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part.
17. URL(uniform resource locator) duplicate removal devices as claimed in claim 16, is characterized in that, the described rewriting rule part of described duplicate removal rule is provided with the rewriting mark of segmentation parameter writing in corresponding URL(uniform resource locator).
18. URL(uniform resource locator) duplicate removal devices as claimed in claim 17, it is characterized in that, described matching module comprises further:
First filter element, for according to the domain name parameters part in described duplicate removal rule, wants the URL(uniform resource locator) that in the URL(uniform resource locator) of duplicate removal, domain name is not corresponding described in filtering out;
Second filter element, for according to the suffix portion in described duplicate removal rule and division number part, wants in the URL(uniform resource locator) of duplicate removal, with the URL(uniform resource locator) that all duplicate removal regular textures are not corresponding described in filtering out; And
3rd filter element, for according to the rewriting rule part in described duplicate removal rule, wants do not have writing URL(uniform resource locator) in the URL(uniform resource locator) of duplicate removal described in filtering out.
19. URL(uniform resource locator) duplicate removal devices as claimed in claim 15, it is characterized in that, want the URL(uniform resource locator) data of duplicate removal described in described URL(uniform resource locator) handling module is obtained by web crawlers, or described in obtaining from the original log of web page access, want the URL(uniform resource locator) data of duplicate removal.
20. 1 kinds of duplicate removal generate rule devices as described in any one of claim 15 ~ 19, is characterized in that, comprising:
URL(uniform resource locator) acquisition module, for obtaining the URL(uniform resource locator) data under the domain name that will generate duplicate removal rule;
Cluster module, for carrying out cluster to the URL(uniform resource locator) of described acquisition;
Segmentation module, for the URL(uniform resource locator) after cluster being split according to domain name parameters part, suffix portion, division number part and segmentation parameter part, and forms many statistical informations;
Statistical information acquisition module, for obtaining mutually isostructural statistical information after over-segmentation; And
Segmentation parameter replacement module, rewrites mark for being replaced with by corresponding segments parameter values different for mutually isostructural statistical information intermediate value, and generates new duplicate removal rule by replacing the statistical information rewriteeing mark.
21. duplicate removal generate rule devices as claimed in claim 20, it is characterized in that, described cluster module carries out cluster to the URL(uniform resource locator) of described acquisition according to length and character words canonical ordering.
22. duplicate removal generate rule devices as claimed in claim 20, is characterized in that, described duplicate removal generate rule device also comprises:
Rewrite filtering module, for after the URL(uniform resource locator) data that obtain under the domain name that will generate duplicate removal rule at described URL(uniform resource locator) acquisition module, filter out URL(uniform resource locator) writing in the described URL(uniform resource locator) data that described URL(uniform resource locator) acquisition module obtains.
23. duplicate removal generate rule devices as claimed in claim 22, it is characterized in that, described rewriting filtering module comprises further:
Duplicate removal rule match unit, mates with existing duplicate removal rule for the described URL(uniform resource locator) data obtained by described URL(uniform resource locator) acquisition module; And
URL(uniform resource locator) data filtering units, goes out the URL(uniform resource locator) of corresponding duplicate removal rule for filtering out described duplicate removal rule match units match.
24. duplicate removal generate rule devices as claimed in claim 20, is characterized in that, described duplicate removal generate rule device also comprises:
Authentication module, for after described segmentation parameter replacement module generates new duplicate removal rule, carries out mating with mutually isostructural URL(uniform resource locator) all under same domain name by new duplicate removal rule and verifies.
25. duplicate removal generate rule devices as claimed in claim 24, it is characterized in that, described authentication module comprises further:
Choose unit, for obtaining all URL(uniform resource locator) having same domain name parameter, suffix, division number with new duplicate removal rule; And
Coupling judgement unit, for new duplicate removal rule being mated with the URL(uniform resource locator) of described acquisition, when the number matching corresponding URL(uniform resource locator) exceedes setting threshold value, is verified.
26. duplicate removal generate rule devices as claimed in claim 20, is characterized in that, described duplicate removal generate rule device also comprises:
Examination & verification mark arranges module, for after described segmentation parameter replacement module generates new duplicate removal rule, arranges a pending mark to new duplicate removal rule.
27. 1 kinds of duplicate removal generate rule devices as described in any one of claim 15 ~ 19, is characterized in that, comprising:
Duplicate removal rule acquisition module, for obtaining existing duplicate removal rule in default duplicate removal rule base, the structure of described duplicate removal rule comprises domain name parameters part, suffix portion, division number part and rewriting rule part;
URL(uniform resource locator) acquisition module, for obtain to generate duplicate removal rule domain name under multiple URL(uniform resource locator) data;
Suffix and rewriting rule matching module, for by the suffix portion of existing duplicate removal rule and rewriting rule part, mate the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule; And
Domain name parameters replacement module, for being greater than the threshold value of setting when the number of the URL(uniform resource locator) matched, then will generating the domain name parameters part in the duplicate removal rule of the domain name replacement correspondence of duplicate removal rule, and generate new duplicate removal rule.
28. duplicate removal generate rule devices as claimed in claim 27, is characterized in that, described duplicate removal generate rule device also comprises:
Order module, for the duplicate removal rule that suffix portion in the existing duplicate removal rule to described duplicate removal rule acquisition module acquisition is identical with rewriting rule part, sorts according to the number of different domain name;
Duplicate removal Rules Filtering module, for the ranking results according to described order module, obtains suffix portion and the rewriting rule part of the maximum setting quantity duplicate removal rule of the number of different domain name;
Described suffix and rewriting rule matching module, by the suffix portion of the maximum setting quantity duplicate removal rule of the number of the different domain names obtained and rewriting rule part, mate the multiple URL(uniform resource locator) under the domain name that will generate duplicate removal rule.
CN201410100765.2A 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device Active CN104933056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410100765.2A CN104933056B (en) 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410100765.2A CN104933056B (en) 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device

Publications (2)

Publication Number Publication Date
CN104933056A true CN104933056A (en) 2015-09-23
CN104933056B CN104933056B (en) 2019-08-13

Family

ID=54120224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410100765.2A Active CN104933056B (en) 2014-03-18 2014-03-18 Uniform resource locator De-weight method and device

Country Status (1)

Country Link
CN (1) CN104933056B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630983A (en) * 2015-12-28 2016-06-01 努比亚技术有限公司 Resource obtaining and optimizing device and method
CN105975599A (en) * 2016-05-11 2016-09-28 北京京东尚博广益投资管理有限公司 Method and device monitoring website page event tracking
CN106844389A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 The treating method and apparatus of network resources address URL
CN106919570A (en) * 2015-12-24 2017-07-04 国家新闻出版广电总局广播科学研究院 The page link duplicate removal scan method and device of a kind of network-oriented new media
CN107766167A (en) * 2017-10-23 2018-03-06 郑州云海信息技术有限公司 A kind of fault log repeats to report an error the method for merger
CN108268508A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 URL De-weight methods and device
CN108449355A (en) * 2018-04-04 2018-08-24 上海有云信息技术有限公司 A kind of vulnerability scanning method and system
CN108694184A (en) * 2017-04-06 2018-10-23 北京国双科技有限公司 Expose URL processing method and processing devices
CN108959359A (en) * 2018-05-16 2018-12-07 顺丰科技有限公司 A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN109359250A (en) * 2018-08-31 2019-02-19 阿里巴巴集团控股有限公司 Uniform resource locator processing method, device, server and readable storage medium storing program for executing
CN109561163A (en) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 The generation method and device of uniform resource locator rewriting rule
CN109861963A (en) * 2017-11-30 2019-06-07 南京德朔实业有限公司 Portable power source system and data transmission method for the system
CN110008419A (en) * 2019-03-11 2019-07-12 阿里巴巴集团控股有限公司 Removing duplicate webpages method, device and equipment
CN110413866A (en) * 2018-04-27 2019-11-05 北京搜狗科技发展有限公司 Data processing method and device, the device for data processing
WO2020006909A1 (en) * 2018-07-05 2020-01-09 平安科技(深圳)有限公司 Method and device for deduplicating urls
WO2020006908A1 (en) * 2018-07-05 2020-01-09 平安科技(深圳)有限公司 Url de-duplication method and device
CN110717036A (en) * 2018-07-11 2020-01-21 阿里巴巴集团控股有限公司 Method and device for removing duplication of uniform resource locator and electronic equipment
CN110825947A (en) * 2019-10-31 2020-02-21 深圳前海微众银行股份有限公司 URL duplicate removal method, device, equipment and computer readable storage medium
CN111259282A (en) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN114650152A (en) * 2020-12-17 2022-06-21 中国科学院计算机网络信息中心 Method and system for detecting vulnerability of super computing center

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112960A1 (en) * 2003-03-31 2007-05-17 Microsoft Corporation Systems and methods for removing duplicate search engine results
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN103530337A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Device and method for recognizing invalid parameters in URL

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112960A1 (en) * 2003-03-31 2007-05-17 Microsoft Corporation Systems and methods for removing duplicate search engine results
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN103530337A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Device and method for recognizing invalid parameters in URL

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844389B (en) * 2015-12-07 2021-05-04 阿里巴巴集团控股有限公司 Method and device for processing URL (Uniform resource locator)
CN106844389A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 The treating method and apparatus of network resources address URL
CN106919570A (en) * 2015-12-24 2017-07-04 国家新闻出版广电总局广播科学研究院 The page link duplicate removal scan method and device of a kind of network-oriented new media
CN105630983A (en) * 2015-12-28 2016-06-01 努比亚技术有限公司 Resource obtaining and optimizing device and method
CN105975599A (en) * 2016-05-11 2016-09-28 北京京东尚博广益投资管理有限公司 Method and device monitoring website page event tracking
CN105975599B (en) * 2016-05-11 2020-01-07 北京京东金融科技控股有限公司 Method and device for monitoring page embedded points of website
CN108268508A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 URL De-weight methods and device
CN108694184B (en) * 2017-04-06 2022-03-11 北京国双科技有限公司 Exposure URL processing method and device
CN108694184A (en) * 2017-04-06 2018-10-23 北京国双科技有限公司 Expose URL processing method and processing devices
CN109561163A (en) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 The generation method and device of uniform resource locator rewriting rule
CN109561163B (en) * 2017-09-27 2022-03-15 阿里巴巴集团控股有限公司 Method and device for generating uniform resource locator rewriting rule
CN107766167A (en) * 2017-10-23 2018-03-06 郑州云海信息技术有限公司 A kind of fault log repeats to report an error the method for merger
CN109861963A (en) * 2017-11-30 2019-06-07 南京德朔实业有限公司 Portable power source system and data transmission method for the system
CN108449355A (en) * 2018-04-04 2018-08-24 上海有云信息技术有限公司 A kind of vulnerability scanning method and system
CN110413866B (en) * 2018-04-27 2024-02-02 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN110413866A (en) * 2018-04-27 2019-11-05 北京搜狗科技发展有限公司 Data processing method and device, the device for data processing
CN108959359A (en) * 2018-05-16 2018-12-07 顺丰科技有限公司 A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN108959359B (en) * 2018-05-16 2022-10-11 顺丰科技有限公司 Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium
WO2020006908A1 (en) * 2018-07-05 2020-01-09 平安科技(深圳)有限公司 Url de-duplication method and device
WO2020006909A1 (en) * 2018-07-05 2020-01-09 平安科技(深圳)有限公司 Method and device for deduplicating urls
CN110717036B (en) * 2018-07-11 2023-11-10 阿里巴巴集团控股有限公司 Method and device for removing duplication of uniform resource locator and electronic equipment
CN110717036A (en) * 2018-07-11 2020-01-21 阿里巴巴集团控股有限公司 Method and device for removing duplication of uniform resource locator and electronic equipment
CN109359250B (en) * 2018-08-31 2022-05-31 创新先进技术有限公司 Uniform resource locator processing method, device, server and readable storage medium
CN109359250A (en) * 2018-08-31 2019-02-19 阿里巴巴集团控股有限公司 Uniform resource locator processing method, device, server and readable storage medium storing program for executing
CN110008419A (en) * 2019-03-11 2019-07-12 阿里巴巴集团控股有限公司 Removing duplicate webpages method, device and equipment
CN110825947A (en) * 2019-10-31 2020-02-21 深圳前海微众银行股份有限公司 URL duplicate removal method, device, equipment and computer readable storage medium
WO2021082938A1 (en) * 2019-10-31 2021-05-06 深圳前海微众银行股份有限公司 Url deduplication method, apparatus, device and computer-readable storage medium
CN110825947B (en) * 2019-10-31 2024-03-08 深圳前海微众银行股份有限公司 URL deduplication method, device, equipment and computer readable storage medium
CN111259282B (en) * 2020-02-13 2023-08-29 深圳市腾讯计算机系统有限公司 URL (Uniform resource locator) duplication removing method, device, electronic equipment and computer readable storage medium
CN111259282A (en) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN114650152B (en) * 2020-12-17 2023-06-20 中国科学院计算机网络信息中心 Super computing center vulnerability detection method and system
CN114650152A (en) * 2020-12-17 2022-06-21 中国科学院计算机网络信息中心 Method and system for detecting vulnerability of super computing center

Also Published As

Publication number Publication date
CN104933056B (en) 2019-08-13

Similar Documents

Publication Publication Date Title
CN104933056A (en) Uniform resource locator (URL) de-duplication method and device
CN108595583B (en) Dynamic graph page data crawling method, device, terminal and storage medium
CN110099059B (en) Domain name identification method and device and storage medium
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
US9304979B2 (en) Authorized syndicated descriptions of linked web content displayed with links in user-generated content
CN106874266A (en) User's portrait method and the device for user's portrait
CN107408115B (en) Web site filter, method and medium for controlling access to content
CN106992981B (en) Website backdoor detection method and device and computing equipment
CN106991175B (en) Customer information mining method, device, equipment and storage medium
CN108573146A (en) A kind of malice URL detection method and device
CN107239701B (en) Method and device for identifying malicious website
CN110069693B (en) Method and device for determining target page
CN109359263B (en) User behavior feature extraction method and system
US20130198240A1 (en) Social Network Analysis
CN111008348A (en) Anti-crawler method, terminal, server and computer readable storage medium
CN111177618A (en) Website building method, device, equipment and computer readable storage medium
WO2019136987A1 (en) Web crawler identification method and apparatus, computer device, and storage medium
CN103577426A (en) Method, device and system for providing additional application messages of searching suggestion
CN105550179A (en) Webpage collection method and browser plug-in
CN115687810A (en) Webpage searching method and device and related equipment
CN105227528B (en) To the detection method and device of the attack of Web server group
US20090248673A1 (en) Method of sorting web pages, search terminal and client terminal
CN115470489A (en) Detection model training method, detection method, device and computer readable medium
CN104899320A (en) Webpage repair method, terminal, server and system
CN105095404A (en) Method and apparatus for processing and recommending webpage information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240108

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right