CN104778164A

CN104778164A - Method and device for detecting repeated URL (Uniform Resource Locator)

Info

Publication number: CN104778164A
Application number: CN201410009241.2A
Authority: CN
Inventors: 冯亮; 尹亚伟; 费志军
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2014-01-09
Filing date: 2014-01-09
Publication date: 2015-07-15
Anticipated expiration: 2034-01-09
Also published as: CN104778164B

Abstract

The invention relates to a method and a device for detecting a repeated URL (Uniform Resource Locator). The method comprises the following steps: grouping all URL addresses in a first URL address set; aiming at each group to independently carry out generalization expression on the first characteristic part of each URL address to form a second URL address set; aiming at the second URL address set to independently carry out the generalization expression on the second characteristic part of each URL address contained in each element of the second URL address set, and forming a third URL address set; aiming at each element of the third URL address set to independently extract the similarity part of all URL addresses contained in each element, and forming a fourth URL address set; and if a URL address to be downloaded is matched with any element in the fourth URL address set, judging that a webpage corresponding to the URL address to be downloaded is downloaded, wherein the URL address to be downloaded is obtained by a web crawler. The webpage can be prevented from being repeatedly downloaded, and the work efficiency of the web crawler is improved.

Description

Detect the method and device that repeat URL

Technical field

The present invention relates to net application technology field, more particularly, relate to a kind of method and the device that detect repetition URL.

Background technology

In recent years, electric business's class website is flourish, has become the main entrance of purchase and consumption on people's line.In these website and webpage, contain a large amount of commodity related introduction information and user comment information.Collecting these data is the bases of launching the E-business applications such as personalized recommendation, goods marketing analysis, sentiment analysis.

Web crawlers is a kind of program of automatic extraction webpage, and it is by the mode down loading network resource of traversal, is also collect a kind of conventional means formulating website and webpage.Its principle of work is: web crawlers is from one or more URL of initial setting, obtain the webpage of its correspondence, in the process capturing webpage, constantly again extract new URL from current web page, whether relevantly to the interested theme of user analyze it, relevant URL is put into access queue by incoherent URL filtering, downloads successively, and then repeat said process.

But for internet, the polyisomenism of data resource is customary thereon.A large amount of following situation is there is: it is identical or close to identical webpage (we call repetition URL these URL below) that content is pointed in multiple different URL address in a lot of website.This not only can cause the repeated downloads of webpage, adds the workload of web crawlers, and improves the complexity of subsequent treatment datamation (as index, retrieval, rank).Therefore, the necessary process repeated when URL has become web crawlers process URL is detected.

A kind of existing method detecting repetition URL, all places in a memory block by web crawlers by the URL address of having downloaded, and when extracting new URL, searching in memory block whether have this URL address, if having, being then judged to be repetition URL.And after the URL quantity downloaded is more and more huger, the space of storage unit can not be unlimited cumulative, and consuming time also the becoming of carrying out searching in flood tide URL address date is difficult to stand.

Therefore, the method that above-mentioned detection repeats URL is subject to many limitations, and researchist expects a kind of detect repetition URL method with obtaining high efficient and reliable more.

Summary of the invention

One object of the present invention is to provide a kind of method detecting repetition URL.

For achieving the above object, the invention provides a technical scheme as follows:

A kind of method detecting repetition URL, the web page contents whether downloaded with web crawlers for detecting webpage corresponding to a URL address that a web crawlers obtains repeats or close, the method comprises the steps: a), grouping step: divide into groups to each URL address in a URL address set, is less than the first setting threshold value to make the diversity factor between webpage that in same group, each URL address is corresponding; B), URL pattern extraction step: for each group, respectively extensive expression is carried out to the first characteristic part of wherein each URL address, and using each URL address after extensive jointly as an element in the 2nd URL address set, to be formed as the 2nd URL address set; Wherein, the first characteristic part makes each URL address in same group be different from other URL addresses; C), URL schema creation step: for the 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each URL address that its each element comprises, and using each URL address after extensive jointly as an element in the 3rd URL address set, to be formed as the 3rd URL address set; Wherein, each URL address that the second characteristic part makes each element in the 2nd URL address set comprise is different from each URL address that other elements comprise; D), main URL mode construction step: for each element of the 3rd URL address set, extract the general character part of each URL address that it comprises respectively, and using general character part as an element in the 4th URL address set, to be formed as the 4th URL address set; E), repeat URL detecting step: the URL address to be downloaded that web crawlers is obtained, if arbitrary Match of elemental composition in this URL address to be downloaded and the 4th URL address set, then judge that webpage corresponding to this URL address to be downloaded is downloaded; Otherwise, then judge that webpage corresponding to this URL address to be downloaded is not downloaded.

Preferably, before step a), also comprise a URL address specifications step, with the RFC3986 document making each URL address in a URL address set meet World Wide Web Consortium formulation.

Preferably, the capitalization that each the URL address in a URL address set comprises is replaced with corresponding lowercase; Lowercase in the number percent coding comprised each URL address in one URL address set replaces with corresponding capitalization; Remove the default port number that each the URL address in a URL address set comprises.

Preferably, the difference figure place that the diversity factor between any two webpages is encoded by the SimHash that these two webpages are corresponding is weighed.

Preferably, in step c) after, steps d) before, also comprise element combining step in a URL address set: traversal the 3rd URL address set, characterization expression is carried out to wherein first, second element, form first, second URL Address instance respectively, diversity factor between the webpage that relatively this first, second URL Address instance is corresponding, if diversity factor is less than the second setting threshold value, then merges this first, second element; Wherein, first, second element is any two the different elements in the 3rd URL address set.

Another object of the present invention is to provide a kind of device detecting repetition URL.

For achieving the above object, the invention provides another technical scheme as follows:

A kind of device detecting repetition URL, with a web crawlers with the use of, comprise: grouped element, it divides into groups to each URL address in a URL address set, is less than the first setting threshold value to make the diversity factor between webpage that in same group, each URL address is corresponding; URL extracting unit, it receives the output of grouped element, for each group, respectively extensive expression is carried out to the first characteristic part of wherein each URL address, and using each URL address after extensive jointly as an element in the 2nd URL address set, merge export to be formed as the 2nd URL address set; Wherein, the first characteristic part makes each URL address in same group be different from other URL addresses; URL schema creation unit, it receives the output of URL extracting unit, for the 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each URL address that its each element comprises, and using each URL address after extensive jointly as an element in the 3rd URL address set, to be formed as the 3rd URL address set; Wherein, each URL address that the second characteristic part makes each element in the 2nd URL address set comprise is different from each URL address that other elements comprise; Main URL mode construction unit, it receives the output of URL schema creation unit, for each element of the 3rd URL address set, extract the general character part of each URL address that it comprises respectively, and using general character part as an element in the 4th URL address set, to be formed as the 4th URL address set; Repeat URL detecting unit, it receives the output of main URL mode construction unit and a URL address to be downloaded of web crawlers acquisition, if arbitrary Match of elemental composition in this URL address to be downloaded and the 4th URL address set, then judge that webpage corresponding to this URL address to be downloaded is downloaded; Otherwise, then judge that webpage corresponding to this URL address to be downloaded is not downloaded.

The method detecting repetition URL provided by the invention, analyze the similarity between URL address and data mining is carried out to these URL addresses, first is divided into groups by the similarity between corresponding webpage in the URL address of flood tide, again through extensive characterization, extract the treatment steps such as general character, be merged into several main URL patterns, and each main URL pattern may correspond to a large amount of repetition URL, they have identical or approximate webpage, thus, only need store these main URL patterns and the URL address that these main URL patterns newly obtain with web crawlers is respectively done to mate, efficiently also can reliably detect repetition URL.The method not only can avoid the repeated downloads of webpage, improves web crawlers work efficiency, and saves the work of re-treatment in subsequent data processing steps (as index, retrieval, rank); In addition, adopt regular expression to carry out extensive to the characteristic part of each URL address, main URL pattern can be made to possess higher generalization.The training dataset that this method uses adopts SimHash coding to weigh similarity between webpage, without the need to artificial mark, and, when training set data is constantly expanded, independently can increase main URL pattern, to improve comprehensive and accuracy during detection.The method is implemented simple, is applicable in industry promoting.

Accompanying drawing explanation

Fig. 1 illustrates the schematic flow sheet of the method for the detection repetition URL that first embodiment of the invention provides;

Fig. 2 illustrates the schematic flow sheet of the method for the detection repetition URL that second embodiment of the invention provides;

Fig. 3 illustrates the structural representation of the device of the detection repetition URL that third embodiment of the invention provides;

Fig. 4 illustrates the structural representation of the device of the detection repetition URL that fourth embodiment of the invention provides;

Fig. 5-1 illustrates that one of URL pattern extraction step implements schematic diagram;

Fig. 5-2 illustrates that one of URL schema creation step implements schematic diagram;

Fig. 5-3 illustrates that one of main URL mode construction step implements schematic diagram.

Embodiment

Need it is noted that, in various embodiments of the present invention, one URL address set comprises multiple URL address, after it is divided into groups, each grouping includes at least one URL address, the element in second, third URL address set and above-mentioned grouping one_to_one corresponding, and each element in second, third URL address set is the URL address set of time one-level, comprise one or more URL address respectively, in the 4th URL address set, each element all only comprises a URL address.

In the various embodiments of the invention, can make as given a definition:

Definition 1.1(URL repeats): given two URL address u ₁and u ₂if, the web page contents doc (u of correspondence ₁) and doc (u ₂) identical or close to identical, then claim u ₁with u ₂repeat.

Definition 1.2(URL pattern): URL pattern is the extensive of the specific URL of a class.If a URL example u ₁meet URL pattern r ₁, then u is claimed ₁meet r ₁.Meanwhile, URL pattern r is defined ₁corresponding URL example collection, S (r1)={ u ₁, u ₂, u ₃u _n| u _imeet r ₁, 0<i<n }, wherein n represents and meets r ₁the number of URL example;

Definition 1.3(URL pattern repeats): given URL pattern r ₁and r ₂, and the example collection S (r of correspondence ₁) and S (r ₂).If from example collection S (r ₁) in, appoint and get a URL example u ₁, at S (r ₂) u repeated with it can be found ₂, then r is claimed ₁and r ₂repeat;

Definition 1.4(repeats URL set of modes): given URL set of modes, if wherein arbitrary two URL patterns repeat, so this URL set of modes is repetition URL set of modes;

From above definition, URL repeats and URL pattern repeats to meet hereditary property, such as, if u ₁with u ₂repetition, u ₁with u ₃repetition, so u ₂with u ₃also repeat.

Above-mentioned explanation and definition are only directed to the preferred embodiments of the present invention, and do not lie in and limit the scope of the invention.Those skilled in the art can make various deformation design, and do not depart from thought of the present invention and subsidiary claim.

As shown in Figure 1, first embodiment of the invention provides a kind of method detecting repetition URL, and the web page contents whether downloaded with web crawlers for detecting webpage corresponding to a URL address that a web crawlers obtains repeats or close, and the method comprises the steps:

Step S10(divides into groups step): divided into groups in each URL address in a URL address set, be less than the first setting threshold value to make the diversity factor between webpage that in same group, each URL address is corresponding.

In this step, by identical or be placed in same bucket (a kind of data structure of group expresses mode) close to the URL address that identical webpage is corresponding, namely each bucket comprises several URL addresses, and each URL address " URL repetition " (as defined 1.1) in same bucket.The data structure of bucket may be defined as b={ b.u ₁, b.u ₂, b.u ₃b.u _n, wherein b.u ₁, b.u ₂be respectively a URL address respectively Deng element, n is the quantity of grouping.Bucket pond stores multiple bucket, corresponds to the URL address set after grouping.

Wherein, a URL address set can be provided by the historical data of the downloading web pages of web crawlers, as the training study data of the detection repetition URL method that the present embodiment provides.

It will be appreciated by those skilled in the art that webpage is a kind of text in essence, the algorithm calculating similarity between text at present has a lot, comprising: based on cosine similarity, the Dice similarity and Jaccard similarity etc. of vector space model.The accuracy of above method is higher, but brings larger computation complexity thereupon.

SimHash algorithm is adopted in embodiments of the invention.SimHash is a kind of hash algorithm of local sensitivity, and n position SimHash algorithm can be document structure tree n position binary string, and namely n position SimHash encodes.Diversity factor between document equals the difference figure place of the corresponding SimHash coding of two sections of documents.Such as, four SimHash coding 0011 with SimHash encode 0010 difference figure place equal 1; Difference figure place is more, represents that two sections of document differences are larger.SimHash coding between two sections of identical documents is identical, and diversity factor is 0.

For the webpage of electric business's class website, it generally includes dynamic advertising, and the webpage of the same address that different time is downloaded there will be nuance.Therefore, 64 SimHash algorithms can be adopted, such as, when the diversity factor between document is less than or equal to 3, judge that webpage is identical or close to identical.

Step S11(URL pattern extraction step): for each group, respectively extensive expression is carried out to the first characteristic part of wherein each URL address, and using each URL address after extensive jointly as an element in the 2nd URL address set, to be formed as the 2nd URL address set.

The basic thought of URL pattern extraction step is compared to each URL address in same group (bucket), find its maximum common part as far as possible, then extensive expression is carried out to its respective first characteristic part, and a URL address may be expressed as by extensive for multiple different URL address.In this step, such as, the result that regular expression can be adopted to represent extensive.

Wherein, the first characteristic part makes each URL address in same group be different from other URL addresses.

The groundwork of URL pattern extraction step is after carrying out extensive expression to the first characteristic part of URL address each in same group, generates the 2nd URL address set by a URL address set.Each element in 2nd URL address set is the URL address set of one one-level, one or more URL address (i.e. URL pattern can be comprised, as defined 1.2), each URL pattern that each element in the 2nd URL address comprises meets " repetition of URL pattern " (as defined 1.3).

Step S12(URL schema creation step): for the 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each URL address that its each element comprises, and using each URL address after extensive jointly as an element in the 3rd URL address set, to be formed as the 3rd URL address set.

Wherein, each URL address that the second characteristic part makes each element in the 2nd URL address set comprise is different from each URL address that other elements comprise.

On electric business's class website, commodity introduce in detail webpage and comment webpage usually occupy the very most of of middle webpage, this part webpage be duplicate URL problem multiplely.In order to distinguish the webpage of different commodity, general meeting comprises the field of commodity indications in URL address, URL schema creation step is using this commodity sign symbol as the second characteristic part, its groundwork is exactly for the 2nd URL address set, find out the position that in the one or more URL address that its each element comprises, commodity indications occurs, and commodity sign symbol is carried out extensive expression, multiple different URL address may be expressed as a URL address through extensive, and generates the 3rd URL address set by the 2nd URL address set.Each element in 3rd URL address set is the URL address set of one one-level, one or more URL address (i.e. URL pattern can be comprised, as defined 1.2), each URL pattern that each element in the 3rd URL address comprises meets " repetition of URL pattern " (as defined 1.3).

Step S13(main URL mode construction step): for each element of the 3rd URL address set, extract the general character part of each URL address that it comprises respectively, and using general character part as an element in the 4th URL address set, to be formed as the 4th URL address set.

Element one_to_one corresponding in element in 4th URL address set and the 3rd URL address set, each element in 4th URL address set is undertaken being formed after general character is extracted by the corresponding element in the 3rd URL address set, and this general character is defined as main URL pattern, thus each element all only comprises a URL address in the 4th URL address set.

For easy fast for the purpose of, the general character part of each described URL address is the shortest one of character length in those URL addresses.The general character part of each described URL address also can be obtained by other general character extracting method provided in prior art.

Particularly, URL mode index can represent corresponding relation between the 3rd URL address set and the 4th URL address set, URL mode index is a simple data structure efficiently, its structure example is as the binary array < character string for a length being m, pointer > vector, wherein character string stores the URL address that in the 3rd URL address set obtained in above-mentioned URL schema creation step, an element comprises, the element (being defined as URL holotype) in pointed the 4th URL address set.Be appreciated that the element in same 4th URL address set is all pointed in each URL address that in the 3rd URL address set, arbitrary element comprises, namely they have common main URL pattern.

Step S14(repeats URL detecting step): the URL address to be downloaded obtained for web crawlers, if arbitrary Match of elemental composition in this URL address to be downloaded and the 4th URL address set, then judges that webpage corresponding to this URL address to be downloaded is downloaded; Otherwise, then judge that webpage corresponding to this URL address to be downloaded is not downloaded.

According to a specific implementation of this step, for the new URL address u that web crawlers obtains, first, extract its first, second characteristic part, obtain the URL pattern of its correspondence.Then, if the main URL pattern of its correspondence can be found in the 4th URL address set with matching algorithm, then judge that this URL address u is downloaded by web crawlers, otherwise, be then judged to not download.

For further to its checking the URL address of not downloading is added in order to follow-up training study in a URL address set, can by its main URL pattern and first, second characteristic part, above-mentioned URL address u is converted to a new URL address u '.Then, the web crawlers webpage whether download URL address u ' is corresponding is checked.If without Download History, system then return messages claims " this URL address corresponding webpage not yet access download ", can while this webpage of download, by URL address u stored in a URL address set.

According to the further improvement of above-mentioned first embodiment, after step slo, before step S11, also comprise a pre-treatment step: expressed with data structure form each URL address in a URL address set, to be realized the method for the detection repetition URL that this embodiment provides by computer program, and to be stored by database.

Particularly, pre-treatment step can comprise step by step following:

I), transportation protocol partial character string in each URL address is deleted, such as " http: // " and " https: // " etc.;

Ii), identify address character string and parameter character string that in a URL address set, any URL address comprises, such as, with " " character, URL address is separated into two parts, i.e. address character string and parameter character string;

Iii), for address character string, with "/" character for separation, be divided at least one address substring, and be formed as an address list; For parameter character string, with " & " character for separation, be divided at least one parameter substring, parameter character string can be made to become a parameter list, and it comprises at least one parameter item; Wherein, a parameter substring such as " pcode=123 "; Parameter item includes a parameter indexing and a parameter value (in the parameter item that as above routine parameter substring is corresponding, pcode is parameter indexing, and 123 is parameter value);

Iv) extract each address substring, respectively to form an address array, extract each parameter substring respectively to form a parameter array, jointly partnering with address array and parameter array should in the data structure of this URL address; Such as, a URL address can be expressed as two tuples (u.a, u.p), and wherein u.a represents the address list of URL address, and formal definitions is u.a={ u.a ₁, u.a ₂..., u.a _n, n is the quantity of address entries in address list.U.p represents the parameter list of URL address, and formal definitions is u.p={ u.p ₁, u.p ₂, u.p _m, m is the quantity of parameter item in parameter list, and in parameter list, each parameter item can sort by parameter indexing.

V), repeat above-mentioned ii), iii) and iv) step, each the URL address in a URL address set is rewritten as corresponding data structure.

Fig. 5-1 illustrates, after stating pre-treatment step on the implementation, and the enforcement schematic diagram of URL pattern extraction step; Fig. 5-2 illustrates, after stating pre-treatment step on the implementation, and the enforcement schematic diagram of URL schema creation step; Fig. 5-3 illustrates, after stating pre-treatment step on the implementation, and the enforcement schematic diagram of main URL mode construction step.

The method of the detection repetition URL that above-mentioned first embodiment of the present invention provides not only can avoid the repeated downloads of webpage, improves web crawlers work efficiency, and saves the work of re-treatment in subsequent data processing steps (as index, retrieval, rank); In addition, adopt regular expression to carry out extensive to the characteristic part of each URL address, main URL pattern can be made to possess higher generalization.The training dataset that this method uses adopts SimHash coding to weigh similarity between webpage, without the need to artificial mark, and, when training set data is constantly expanded, the method can self refresh training study data, thus the main URL pattern that autonomous increase is new, be conducive to improving the comprehensive and accuracy detected when repeating URL.The method is implemented simple, is applicable in industry promoting.

As shown in Figure 2, detection that second embodiment of the invention provides repeats the method for URL, and the web page contents whether webpage corresponding to the URL address also obtained for Sampling network reptile had been downloaded with web crawlers repeats or close, and it comprises:

Step S20(URL address specifications step): standardized in each URL address in a URL address set.

Particularly, URL standardization refers to by amendment URL address, and each URL address is expressed with the form that standard is unified, preferably makes each URL address meet World Wide Web Consortium and formulates URL technical standard RFC3986, to avoid the difference between different URL address in field.Such as, URL address specifications step can specifically comprise following step by step in any one or appoint multiple:

1) all capitalizations in each URL address are converted to corresponding lowercase;

2) lowercase in wherein number percent coding is replaced with corresponding capitalization, such as: be rewritten as http://www.example.com/a%C2%B1b by http://www.example.com/a%c2%b1b;

3) non-reserved character in number percent coding is decoded, such as:

Http:// www.example.com/ ~ username/ is rewritten as by http://www.example.com/%7Eusername/;

4) default port number wherein comprised is removed;

5) remove wherein all ". " and " .. " character, and according to path change, corresponding adjustment can be made to URL address.

Step S21: divided into groups in each URL address in a URL address set.

Step S22: for each group, carries out extensive expression to the first characteristic part of wherein each URL address respectively, and using each URL address after extensive as an element of the 2nd URL address set, to form the 2nd URL address set.

Step S23: for the 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each URL address that its each element comprises, and using each URL address after extensive as an element in the 3rd URL address set, to form the 3rd URL address set.

Element combining step in step S24(URL address set): traversal the 3rd URL address set, characterization expression is carried out to wherein first, second element, form first, second URL Address instance respectively, if the webpage that this first, second URL Address instance is corresponding is similar, then merge this first, second element.Wherein, first, second element is any two the different elements in the 3rd URL address set.

Because can not only be actually " repeating URL set of modes " (as defined 1.4) by the different elements thoroughly got rid of with implementation step S21 in the 3rd URL address set, therefore in this step, particularly, diversity factor (the difference figure place of encoding with corresponding SimHash is weighed) between the webpage that relatively this first, second URL Address instance is corresponding, if diversity factor is less than the second setting threshold value, then merge this first, second element; Wherein, the second setting threshold value can be greater than the first setting threshold value.

In this URL address set, a kind of specific implementation of element combining step is:

From each element the 3rd URL address set, choose arbitrarily a URL address respectively, form the 5th URL address set;

Choose arbitrarily second characteristic part obtained in step S23;

One first characteristic part is formed by an any character or character string;

This first, second characteristic part is inserted in each the URL address in the 5th URL address set;

Travel through the 5th URL address set, if a corresponding webpage in URL address is less than the second setting threshold value with the diversity factor between another corresponding webpage in URL address in the 5th URL address set, then merge the element in the 3rd URL address set corresponding to another URL address of element and this in the 3rd URL address set corresponding to this URL address.

Step S25: for each element of the 3rd URL address set, extracts the general character part of each URL address that it comprises respectively, and using general character part as an element in the 4th URL address set, to be formed as the 4th URL address set.

Step S26: the URL address to be downloaded obtained for web crawlers, if arbitrary Match of elemental composition in this URL address to be downloaded and the 4th URL address set, then judges that webpage corresponding to this URL address to be downloaded is downloaded.

Further, in the enforcement of step S23, also can comprise: traversal established 3rd URL address set at present, if wherein the 3rd element, the 4th element only comprise the 3rd URL address, the 4th URL address (being also URL pattern) respectively, and the 3rd URL address is different from second characteristic part of the 4th URL address only corresponding separately, and the 3rd URL address and the 4th URL address separately corresponding second characteristic part occurrence number in the 3rd other elements of URL address set be less than 5 times, then directly merge the 3rd element and the 4th element; Wherein, the 3rd element, the 4th element are any two different elements in the 3rd URL address set.

As shown in Figure 3, third embodiment of the invention provides a kind of device detecting repetition URL, with a web crawlers with the use of, it comprises: grouped element 202, URL extracting unit 203, URL schema creation unit 204, main URL mode construction unit 205 and repeat URL detecting unit 206.

Grouped element 202 divides into groups to each URL address in a URL address set, is less than the first setting threshold value to make the diversity factor between webpage that in same group, each URL address is corresponding.URL extracting unit 203 receives the output of grouped element 202, for each group, respectively extensive expression is carried out to the first characteristic part of wherein each URL address, and using each URL address after extensive jointly as an element in the 2nd URL address set, merge export to be formed as the 2nd URL address set; Wherein, the first characteristic part makes each URL address in same group be different from other URL addresses; URL schema creation unit 204 receives the output of URL extracting unit 203, for the 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each URL address that its each element comprises, and using each URL address after extensive jointly as an element in the 3rd URL address set, to be formed as the 3rd URL address set; Wherein, each URL address that the second characteristic part makes each element in the 2nd URL address set comprise is different from each URL address that other elements comprise; Main URL mode construction unit 205 receives the output of URL schema creation unit 204, for each element of the 3rd URL address set, extract the general character part of each URL address that it comprises respectively, and using general character part as an element in the 4th URL address set, to be formed as the 4th URL address set; Repeat URL detecting unit 206 and receive the output of main URL mode construction unit 205 and a URL address to be downloaded of web crawlers acquisition, if arbitrary Match of elemental composition in this URL address to be downloaded and the 4th URL address set, then judge that webpage corresponding to this URL address to be downloaded is downloaded; Otherwise, then judge that webpage corresponding to this URL address to be downloaded is not downloaded.

This detection repeats the device of URL, with web crawlers with the use of, it can be an independently device, or is combined into a computer based control system with web crawlers, and this device can avoid the repeated downloads of webpage, effectively improves the work efficiency of web crawlers.

As shown in Figure 4, the device that the detection that fourth embodiment of the invention provides repeats URL is improve the device of the detection repetition URL in above-mentioned 3rd embodiment to obtain, and it comprises: database 301, grouped element 302, URL extracting unit 303, URL schema creation unit 304, main URL mode construction unit 305 and repetition URL detecting unit 306.

Wherein, grouped element 302, URL extracting unit 303, URL schema creation unit 304, main URL mode construction unit 305 and to repeat corresponding component in the data processing method of URL detecting unit 306 and function and above-mentioned 3rd embodiment identical.

Unlike, a URL address set is stored in database 301, it can be the historical data of web crawlers downloading web pages, the training study data of the device of URL are repeated for this detection, also store in database 301 and each element webpage one to one in a URL address set, the web page contents of its correspondence can be obtained from URL address fast.

Repeat URL detecting unit 306 and judging that webpage corresponding to URL address to be downloaded be not as after being downloaded, can by this URL address to be downloaded stored in the URL address set in database 301, simultaneously, web crawlers downloads webpage corresponding to this URL address to be downloaded, also stored in database 301, thus the renewal to the training study data that this device uses is realized.To be updated be accumulated to a certain amount of or after certain hour, the detection that can provide in conjunction with the present invention first or second embodiment repeats the method for URL, start once new learning process again, thus, the device that this detection repeats URL independently can increase main URL pattern, to improve comprehensive and accuracy during detection.

Claims

1. detect a method of repetition URL, the web page contents whether downloaded with described web crawlers for detecting webpage corresponding to a URL address that a web crawlers obtains repeats or close, and described method comprises the steps:

A), grouping step: divided into groups in each URL address in a URL address set, be less than the first setting threshold value to make the diversity factor between webpage that in same group, each described URL address is corresponding;

B), URL pattern extraction step: for group described in each, respectively extensive expression is carried out to the first characteristic part of wherein each described URL address, and using each described URL address after extensive jointly as an element in the 2nd URL address set, to be formed as described 2nd URL address set; Wherein, described first characteristic part makes URL address described in each in same described group be different from URL address described in other;

C), URL schema creation step: for described 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each described URL address that its each element comprises, and using each described URL address after extensive jointly as an element in the 3rd URL address set, to be formed as described 3rd URL address set; Wherein, each described URL address that described second characteristic part makes each element in described 2nd URL address set comprise is different from each described URL address that other elements comprise;

D), main URL mode construction step: for each element of described 3rd URL address set, extract the general character part of each described URL address that it comprises respectively, and using described general character part as an element in the 4th URL address set, to be formed as described 4th URL address set;

E), repeat URL detecting step: the URL address to be downloaded that described web crawlers is obtained, if arbitrary Match of elemental composition in this URL address to be downloaded and described 4th URL address set, then judge that webpage corresponding to this URL address to be downloaded is downloaded; Otherwise, then judge that webpage corresponding to this URL address to be downloaded is not downloaded.

2. method according to claim 1, is characterized in that, before described step a), also comprise a URL address specifications step, with the RFC3986 document making each URL address in a described URL address set meet World Wide Web Consortium formulation.

3. method according to claim 2, is characterized in that, described URL address specifications step at least comprise following step by step at least one:

The capitalization that each URL address in a described URL address set comprises is replaced with corresponding lowercase;

Lowercase in the number percent coding comprised each URL address in a described URL address set replaces with corresponding capitalization;

Remove the default port number that each the URL address in a described URL address set comprises.

4. method according to claim 1, is characterized in that, the difference figure place that the diversity factor described in any two between webpage is encoded by the SimHash that these two webpages are corresponding is weighed.

5. method according to claim 1, is characterized in that, described step a) after, described step b) before, also comprise a pre-treatment step: expressed with data structure form each URL address in a described URL address set.

6. method according to claim 5, is characterized in that, described pre-treatment step specifically comprises:

I) transportation protocol partial character string in each described URL address, is deleted in a described URL address set;

Ii), the address character string that in a described URL address set, arbitrary described URL address comprises and parameter character string is identified;

Iii), for described address character string, with "/" character for separation, at least one address substring is divided into; For described parameter character string, with " & " character for separation, be divided at least one parameter substring;

Iv) extract each described address substring, respectively to form an address array, extract each described parameter substring respectively to form a parameter array, jointly partnering with described address array and parameter array should in the data structure of this URL address;

V), repeating said steps ii), iii) and iv), URL address described in each in a described URL address set is rewritten as corresponding described data structure.

7. method according to claim 1, is characterized in that, at described step c) after, described steps d) before, also comprise element combining step in a URL address set:

Travel through described 3rd URL address set, characterization expression is carried out to wherein first, second element, form first, second URL Address instance respectively, diversity factor between the webpage that relatively this first, second URL Address instance is corresponding, if described diversity factor is less than the second setting threshold value, then merge this first, second element; Wherein, first, second element described is any two the different elements in described 3rd URL address set.

8. method according to claim 7, is characterized in that, in described URL address set, element combining step specifically comprises:

From each element described 3rd URL address set, choose arbitrarily URL address described in respectively, form the 5th URL address set;

Choose arbitrarily one from described step c) in described second characteristic part that obtains;

The first characteristic part described in one is formed by an any character or character string;

This first, second characteristic part is inserted in each the URL address in described 5th URL address set;

Travel through described 5th URL address set, if a corresponding webpage in URL address is less than described second with the diversity factor between another corresponding webpage in URL address and sets threshold value in described 5th URL address set, then merge the element in described 3rd URL address set corresponding to another URL address of element and this in described 3rd URL address set corresponding to this URL address.

9. method according to claim 1, is characterized in that, described step b) in, with regular expression, carried out extensive expression to the first characteristic part of wherein each described URL address.

10. method according to claim 1, is characterized in that, described step c) in, described second characteristic part is the field comprising commodity sign symbol in described URL address.

11. methods according to claim 1, is characterized in that, described steps d) in, the general character part of each described URL address that the element in described 3rd URL address set comprises is the shortest one of character length in those URL addresses.

12. methods according to claim 1, is characterized in that, described step e) in, this URL address to be downloaded is mated with greedy algorithm with arbitrary element in described 4th URL address set.

13. methods according to any one of claim 1 to 12, it is characterized in that, a described URL address set comes from the historical data of described web crawlers downloading web pages.

14. 1 kinds of devices detecting repetition URL, with a web crawlers with the use of, comprising:

Grouped element, it divides into groups to each URL address in a URL address set, is less than the first setting threshold value to make the diversity factor between webpage that in same group, each described URL address is corresponding;

URL extracting unit, it receives the output of described grouped element, for group described in each, respectively extensive expression is carried out to the first characteristic part of wherein each described URL address, and using each described URL address after extensive jointly as an element in the 2nd URL address set, merge export to be formed as described 2nd URL address set; Wherein, described first characteristic part makes URL address described in each in same described group be different from URL address described in other;

URL schema creation unit, it receives the output of described URL extracting unit, for described 2nd URL address set, respectively extensive expression is carried out to the second characteristic part of each described URL address that its each element comprises, and using each described URL address after extensive jointly as an element in the 3rd URL address set, to be formed as described 3rd URL address set; Wherein, each described URL address that described second characteristic part makes each element in described 2nd URL address set comprise is different from each described URL address that other elements comprise;

Main URL mode construction unit, it receives the output of described URL schema creation unit, for each element of described 3rd URL address set, extract the general character part of each described URL address that it comprises respectively, and using described general character part as an element in the 4th URL address set, to be formed as described 4th URL address set;

Repeat URL detecting unit, it receives the output of described main URL mode construction unit and a URL address to be downloaded of described web crawlers acquisition, if arbitrary Match of elemental composition in this URL address to be downloaded and described 4th URL address set, then judge that webpage corresponding to this URL address to be downloaded is downloaded; Otherwise, then judge that webpage corresponding to this URL address to be downloaded is not downloaded.

15. devices detecting repetition URL according to claim 14, it is characterized in that, it also comprises a database, described database purchase have a described URL address set and with each element webpage one to one in a described URL address set, described repetition URL detecting unit judge webpage corresponding to described URL address to be downloaded as after not being downloaded by this URL address to be downloaded stored in described database.