CN101102316A

CN101102316A - A method and system for removing duplicate webpages

Info

Publication number: CN101102316A
Application number: CNA2007101230528A
Authority: CN
Inventors: 文勖
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2007-06-22
Filing date: 2007-06-22
Publication date: 2008-01-09

Abstract

The method comprises: selecting the preset numbers of words and expressions from the optional webpage; selecting the webpage with maximum numbers of the words and expressions as the reference webpage; if the said reference webpage contains the words and expressions whose numbers is more than a threshold, the said optional webpage is processed as a repeated webpage. The invention also provides a system thereof.

Description

A kind of method and system of removing duplicate webpages

Technical field

The present invention relates to the webpage process field, particularly relate to a kind of method and system of removing duplicate webpages.

Background technology

Along with the fast development of Internet technology, the webpage on the Internet is more and more, and according to statistics, Chinese web page has exceeded 10,000,000,000, and wherein nearly 70% belongs to repeated pages, and the shared proportion of repeated pages is very big.Therefore, how effectively removing repeated pages in the webpage of enormous amount, is the difficult problem that search engine faces.Be that repeated pages is judged, removed to the mode that contrasts this condition code by selected characteristic sign indicating number in webpage at present.

Consult Fig. 1, be the method flow diagram of existing removing duplicate webpages, concrete steps are as described below.

Step S101, in the benchmark webpage, choose certain fullstop as anchor point.

Because of in Web page text, there being a plurality of fullstops to occur, can in Web page text, select certain fullstop as anchor point by locate mode.

Step S102, choose the Chinese character of some as condition code on described anchor point both sides.

As, respectively choose 5 Chinese character composition characteristic sign indicating numbers on the anchor point both sides.

Step S103, in alternative webpage, adopt and obtain condition code in a like fashion.

Adopt in a like fashion and in alternative webpage, position, and respectively choose 5 Chinese character composition characteristic sign indicating numbers on these anchor point both sides.

Step S104, identical with condition code in the benchmark webpage as alternative webpage judges that this alternative webpage is a repeated pages.

Identical as alternative webpage with condition code in the benchmark webpage, judge that then this alternative webpage is a repeated pages, forwards step S105 to; Inequality as condition code, judge that then this alternative webpage is not a repeated pages.

The alternative webpage that step S105, deletion repeat.

Said method can effectively be removed repeated pages under the on all four situation of the content of two webpages.But repeated pages not only comprises the on all four webpage of content, comprises that also relative increase do not have the webpage of essential meaning information, and relative difference does not have the webpage of essential meaning word.If alternative webpage relative datum webpage increases the word of the no essence meaning just in several Chinese characters of anchor point annex, the condition code of two webpages is just different, and said method will cause the accuracy rate of removing duplicate webpages not high with alternative webpage as non-repeated pages; If the alternative webpage relative datum webpage several Chinese characters at the anchor point annex just is identical, and other guide has essential distinction, but the condition code of two webpages is identical, and said method will be deleted this alternative webpage as repeated pages, cause the False Rate of removing duplicate webpages too high.

Summary of the invention

Technical problem to be solved by this invention provides a kind of removing duplicate webpages method, and this method can effectively improve the accuracy rate of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.

Another object of the present invention provides a kind of removing duplicate webpages system, and this system can effectively improve the accuracy rate of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.

The method of a kind of removing duplicate webpages of the present invention comprises: the word of choosing predetermined number in alternative webpage; In collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage; The quantity that comprises above-mentioned word in the benchmark webpage is then handled described alternative webpage greater than setting threshold as repeated pages as described.

Preferably, also comprise: the quantity that comprises above-mentioned word in the benchmark webpage then adds described collections of web pages with described alternative webpage less than setting threshold as described.

Preferably, choose in alternative webpage before the word of predetermined number, also comprise: the attribute to word in the described alternative webpage marks, and filter attribute is the word of stop words and function word.

Preferably, in described alternative webpage, choose the word of predetermined number according to weights order from big to small.

Preferably, before in described alternative webpage, choosing the word of predetermined number, also comprise: the merchant divided by total training text number takes the logarithm with the training text number of each word, the numerical value that obtains multiply by the number of times that occurs this word in the described alternative web page text again, obtains the weights of each word in the described alternative webpage.

Preferably, by following step, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage; The word of above-mentioned predetermined number as query string, is retrieved in described collections of web pages; From big to small the webpage that retrieves is sorted according to the quantity that comprises above-mentioned word; With ordering first webpage as the benchmark webpage.

Preferably, before described alternative webpage handled as repeated pages, also comprise: the webpage of choosing ordering second is as the benchmark webpage; The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared; Greater than setting threshold, determine that described alternative webpage is a repeated pages as the quantity of the above-mentioned word that comprised in this benchmark webpage.

Preferably, determine that described alternative webpage is before the repeated pages, also comprises: choose successively ordering after webpage as the benchmark webpage; The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared; Greater than setting threshold, determine that alternative webpage is that repeated pages is handled as the quantity of the above-mentioned word that comprised in this benchmark webpage.

The system of a kind of removing duplicate webpages of the present invention, comprise that the unit chosen in word, the benchmark webpage is chosen unit, comparing unit, reached processing unit: the unit chosen in described word, is used for choosing at alternative webpage the word of predetermined number; Described benchmark webpage is chosen the unit, is used for choosing in collections of web pages containing the maximum webpage of above-mentioned word quantity as the benchmark webpage; Described comparing unit, the quantity that is used for comprising above-mentioned word at described benchmark webpage start described processing unit during greater than setting threshold; Described processing unit is used for described alternative webpage is handled as repeated pages.

Preferably, also comprise weight calculation unit, be used to calculate the weights of above-mentioned each word, and result of calculation is sent to described word chooses the unit; Predetermined number is chosen in the unit in alternative webpage according to weights order from big to small word chosen in described word.

Compared with prior art, the present invention has the following advantages:

The present invention chooses the word of predetermined number in alternative webpage, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage,, then alternative webpage is handled as repeated pages greater than setting threshold as the quantity that comprises above-mentioned word in the benchmark webpage.The present invention can set numerical value by suitably improving, and increasing participates in the word amount of contrast, reduces the contingency of contrast, when the removal content is not on all four repeated pages, can effectively improve the accuracy of removing duplicate webpages, and reduce False Rate.Simultaneously, the present invention also can effectively regulate the accuracy rate and the False Rate of removing duplicate webpages by raising/reduction setting threshold, as, improve setting threshold, can improve accuracy rate; Reduce setting threshold, can improve False Rate.Therefore, prior art is carried out removing duplicate webpages by simple contrast characteristic's sign indicating number relatively, and the present invention can effectively improve the accuracy rate of removing duplicate webpages by suitably regulating setting threshold and setting numerical value, reduces False Rate.

The present invention chooses the word of predetermined number in alternative webpage according to weights order from big to small.Weights show this word and Web page subject degree of correlation height greatly, have more representativeness.When the removal content is not on all four repeated pages, choose the high word of weights and compare, judge, can further improve the accuracy rate of removing duplicate webpages, reduce False Rate.

Description of drawings

Fig. 1 is the method flow diagram of existing removing duplicate webpages;

The removing duplicate webpages method flow diagram that Fig. 2 provides for first embodiment of the invention;

The removing duplicate webpages method flow diagram that Fig. 3 provides for second embodiment of the invention;

The removing duplicate webpages method flow diagram that Fig. 4 provides for third embodiment of the invention;

The removing duplicate webpages system schematic that Fig. 5 provides for fourth embodiment of the invention;

The removing duplicate webpages system schematic that Fig. 6 provides for fifth embodiment of the invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

The present invention chooses the word of predetermined number in alternative webpage, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage,, then alternative webpage is handled as repeated pages greater than setting threshold as the quantity that comprises above-mentioned word in the benchmark webpage.

Consult Fig. 2, the method flow diagram of the removing duplicate webpages that provides for first embodiment of the invention, concrete steps are as described below.

Step S201, in alternative webpage, choose the word of predetermined number.

In the removing duplicate webpages process, earlier with existing webpage as collections of web pages, again based on this collections of web pages, to after the judgement that whether repeats of the webpage that gets access to.After the webpage that obtains as alternative webpage.According to the required accuracy of removing duplicate webpages, in alternative Web page text, choose the word of predetermined number.Wherein the scope of predetermined number can be 1 to 100.

As, existing collections of web pages is: { webpage A, webpage B, webpage C}, alternative webpage are webpage D.In webpage D, choose a, b, three words of c.

Step S202, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage.

Choose in collections of web pages by modes such as contrasting, search and to contain the maximum webpage of above-mentioned word quantity, as the benchmark webpage.

As, do not comprise among word a, b, the c any one in the webpage A text; Comprise two words of a, b in the webpage B text; Comprise a, b, three words of c in the webpage C text.Because of the quantity of webpage word that C comprises greater than webpage B and webpage A, with webpage C as the benchmark webpage.

Step S203, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.

Extract the quantity of benchmark above-mentioned word that webpage comprises, and compare with setting threshold.Setting threshold can be provided with according to the required accuracy of removing duplicate webpages, and the scope of setting threshold can be predetermined number/2 to predetermined number.

As, the quantity that comprises above-mentioned word among the webpage C is 3, setting threshold is 2.

Certainly, step S203 also can be the quantity of the above-mentioned word that is comprised in the calculating benchmark webpage and the proportional numbers of predetermined number, and this proportional numbers and setting threshold are compared.At this moment, the scope of setting threshold can be 50% to 100%.

Step S204, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, then alternative webpage is handled as repeated pages.

, then alternative webpage is handled as repeated pages greater than setting threshold as the quantity of the above-mentioned word that comprised in the benchmark webpage; Less than setting threshold, then described alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.

As, the above-mentioned word quantity that is comprised among the webpage C is 3, greater than setting threshold 2, then with alternative webpage D as repeated pages, delete.

The present invention can set numerical value by suitably improving, and increasing participates in the word amount of contrast, reduces the contingency of contrast, when the removal content is not on all four repeated pages, can effectively improve the accuracy of removing duplicate webpages, and reduce False Rate.Simultaneously, the present invention also can effectively regulate the accuracy rate and the False Rate of removing duplicate webpages by raising/reduction setting threshold, as, improve setting threshold, can improve accuracy rate; Reduce setting threshold, can improve False Rate.Therefore, the present invention can effectively improve the accuracy rate of removing duplicate webpages by suitably regulating setting threshold and setting numerical value, reduces False Rate.

For further improving the accuracy of removing duplicate webpages, reduce the False Rate of removing duplicate webpages, the present invention can choose the word higher with the Web page subject degree of correlation in alternative webpage.

Consult Fig. 3, the removing duplicate webpages method flow diagram for second embodiment of the invention provides may further comprise the steps.

Step S301, the attribute of word in the alternative webpage is marked, filter attribute is the word of stop words and function word.

Attribute to word in the alternative webpage marks.The attribute of word of the present invention can be divided into keyword, stop words and function word.Wherein, keyword is the word with certain essential meaning, as words such as computer, purchase, posts; Stop words and function word are the word that do not have real-time meaning, as,,, etc. word.The database of built-in storage stop words of the present invention and function word compares the word of storing in word in the alternative Web page text and the database, and is identical as the word of storing in the word in the alternative Web page text and the database, then filters out this word.

The weights of each word in step S302, the alternative webpage of calculating.

The weights of word are represented the degree of correlation of this word and theme, and the word that weights are high can be thought and Web page subject degree of correlation height.The present invention is by the weights of weights=TF * IDF formula calculating word, that is:

w(f _i，d)＝TF(f _i，d)*IDF(f _i)＝N(f _id)*log(N(f _i)/N)

Wherein, w (f _i, d) expression word f _iWeights in alternative web page text d; N (f _i) expression word f _iThe number of times that in alternative web page text d, occurs; N (f _i) expression word f _iThe training text number, N represents total training text number.IDF (f _i) be word f _iCorresponding fixed value is by in N training text, searches to calculate to contain word f _iTextual data N (f _i), again by formula log (N (f _i)/N) calculated acquisition.TF (f _i, d) by searching word f _iThe number of times that occurs in alternative web page text d obtains.

Step S303, in alternative webpage, choose the word of predetermined number according to weights order from big to small.

The ordering of weights size pressed in the word of Web page text, in webpage, choose the word of predetermined number again according to order from big to small.

As, predetermined number is 5, then chooses the bigger a of weights, b, c, five words of d, e in alternative webpage.

Step S304, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage.

As, do not comprise among word a, b, c, d, the e any one in the collections of web pages in the webpage A text; Comprise a, b, three words of c in the webpage B text, comprise a, b, four words of c, d in the webpage C text.Because of the quantity of webpage word that C comprises greater than webpage B and webpage A, with webpage C as the benchmark webpage.

Step S305, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.

Step S306, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, then alternative webpage is handled as repeated pages; Less than setting threshold, then described alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.

The present invention also can utilize in the collections of web pages other webpages further to confirm after judging that alternative webpage is repeated pages, to improve the accuracy of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.

Consult Fig. 4, be the removing duplicate webpages method flow diagram that third embodiment of the invention provides, concrete steps are as described below.

Step S401, in alternative webpage, choose the word of predetermined number.

As, in alternative webpage, choose ten words such as a, b, c, d, e, f, g, h, i, j.

Step S402, with the word of above-mentioned predetermined number as query string, in collections of web pages, retrieve.

As, ten words such as a, b, c, d, e, f, g, h, i, j as query string, are retrieved in collections of web pages.

Step S403, according to the quantity that comprises above-mentioned word from big to small with the ordering of the webpage that retrieves.

As, do not comprise in the above-mentioned word any one in the collections of web pages in the webpage A text, comprise a, b, c, d, e, f, eight words of g, h in the webpage B text, comprise a, b, c, d, e, f, g, h, nine words of i in the webpage C text.The webpage that retrieves is webpage B and webpage C, because of the quantity of webpage word that C comprises greater than webpage B, sorting is webpage C, webpage B.

Step S404, choose ordering first webpage as the benchmark webpage.

As, choose webpage C as the benchmark webpage.

Step S405, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.

As, setting the fault value is 7, the quantity that comprises above-mentioned word in the benchmark webpage is 9.

Step S4 06, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, judge that then alternative webpage is a repeated pages, forwards step S407 to; Less than setting threshold, then described alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.

As, 9＞7, judge that then alternative webpage is a repeated pages.

Step S407, choose ordering second webpage as the benchmark webpage again.

As, select webpage B as the benchmark webpage.

Step S408, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.

As, setting threshold is 7, the quantity that comprises above-mentioned word in the benchmark webpage is 8.

Step S409, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, determine that then alternative webpage is a repeated pages; Less than setting threshold, alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.

As, 8＞7, determine that then alternative webpage is a repeated pages.

In above-mentioned steps, the present invention judges by comprising the maximum webpage of above-mentioned word in the collections of web pages whether alternative webpage is repeated pages, determine further that by the webpage that comprises above-mentioned word quantity second in the collections of web pages this alternative webpage is a repeated pages, guarantees the high-accuracy and the low fault rate of removing duplicate webpages again.

Certainly, the present invention also can utilize comprise in the collections of web pages above-mentioned word quantity the the 3rd, the 4th, the 5th,,, webpage successively as the benchmark webpage, determine that further this alternative webpage is a repeated pages, guarantee the high-accuracy and the low fault rate of removing duplicate webpages.

Based on above-mentioned removing duplicate webpages method, the present invention also provides a kind of removing duplicate webpages system, and this system can effectively improve the accuracy rate of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.

Consult Fig. 5, be the removing duplicate webpages system schematic that fourth embodiment of the invention provides, comprise that unit 51 chosen in word, the benchmark webpage is chosen unit 52, comparing unit 53, reached processing unit 54.

Predetermined number is chosen in unit 51 in alternative webpage word chosen in word, and send above-mentioned word to the benchmark webpage and choose unit 52.Wherein the scope of predetermined number can be 1 to 100.Unit 51 chosen in word can built-in storage stop words and the database of function word, and the word of storing in word in the alternative Web page text and the database is compared, identical as the word of storing in the word in the alternative Web page text and the database, then filters out this word.

The benchmark webpage is chosen unit 52 and is chosen in collections of web pages and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage, and this benchmark webpage is sent to comparing unit 53.

Comparing unit 53 extracts the quantity of benchmark above-mentioned words that webpage comprises, and compares with setting threshold, and the quantity of the above-mentioned word that is comprised in the benchmark webpage is during greater than setting threshold, startup processing unit 54.The scope of setting threshold can be predetermined number/2 to predetermined number.

Certainly, comparing unit 53 also can be the quantity of the above-mentioned word that is comprised in the calculating benchmark webpage and the proportional numbers of predetermined number, this proportional numbers and setting threshold are compared, and the quantity of the above-mentioned word that is comprised in the benchmark webpage starts processing unit 54 during greater than setting threshold.At this moment, the scope of setting threshold can be 50% to 100%.

Processing unit 54 is handled described alternative webpage as repeated pages.

This system can set numerical value by suitably improving, and increasing participates in the word amount of contrast, reduces the contingency of contrast, when the removal content is not on all four repeated pages, can effectively improve the accuracy of removing duplicate webpages, and reduce False Rate.

Consult Fig. 6, be the removing duplicate webpages system schematic that fifth embodiment of the invention provides, comprise that unit 51 chosen in word, the benchmark webpage is chosen unit 52, comparing unit 53, processing unit 54, reached weight calculation unit 55.

Weight calculation unit 55 is calculated the weights of each word in the alternative Web page text, and result of calculation is sent to word chooses unit 51.

The formula that calculates weights is:

w(f _i，d)＝TF(f _i，d)*IDF(f _i)＝N(f _id)*log(N(f _i)/N)

Wherein, w (f _i, d) expression word f _iWeights in alternative web page text d; N (f _i) expression word f _iThe number of times that in alternative web page text d, occurs; N (f _i) expression word f _iThe training text number, N represents total training text number.

Predetermined number is chosen in unit 51 in alternative webpage according to weights order from big to small word chosen in word.

It is identical with figure four illustrated embodiments that the benchmark webpage is chosen unit 52, comparing unit 53, processing unit 54 function in this embodiment and effect, repeats no more.

More than to a kind of removing duplicate webpages method and system provided by the present invention, be described in detail, used specific case herein principle of the present invention and execution mode are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1, a kind of method of removing duplicate webpages is characterized in that, comprising:

In alternative webpage, choose the word of predetermined number;

In collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage;

The quantity that comprises above-mentioned word in the benchmark webpage is then handled described alternative webpage greater than setting threshold as repeated pages as described.

2, the method for claim 1 is characterized in that, also comprises:

The quantity that comprises above-mentioned word in the benchmark webpage then adds described collections of web pages with described alternative webpage less than setting threshold as described.

3, the method for claim 1 is characterized in that, chooses in alternative webpage before the word of predetermined number, also comprises:

Attribute to word in the described alternative webpage marks, and filter attribute is the word of stop words and function word.

4, as claim 1,2 or 3 described methods, it is characterized in that, in described alternative webpage, choose the word of predetermined number according to weights order from big to small.

5, method as claimed in claim 4 is characterized in that, in described alternative webpage, choose the word of predetermined number before, also comprise:

Merchant divided by total training text number takes the logarithm with the training text number of each word, and the numerical value of acquisition multiply by the number of times that occurs this word in the described alternative web page text again, obtains the weights of each word in the described alternative webpage.

6, the method for claim 1 is characterized in that, by following step, chooses in collections of web pages and contains the maximum webpage of above-mentioned word quantity as the benchmark webpage;

The word of above-mentioned predetermined number as query string, is retrieved in described collections of web pages;

From big to small the webpage that retrieves is sorted according to the quantity that comprises above-mentioned word;

With ordering first webpage as the benchmark webpage.

7, method as claimed in claim 6 is characterized in that, before described alternative webpage is handled as repeated pages, also comprises:

The webpage of choosing ordering second is as the benchmark webpage;

The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared;

Greater than setting threshold, determine that described alternative webpage is a repeated pages as the quantity of the above-mentioned word that comprised in this benchmark webpage.

8, as claim 6 or 7 described methods, it is characterized in that, determine that described alternative webpage is before the repeated pages, also comprises:

Choose successively ordering after webpage as the benchmark webpage;

Greater than setting threshold, determine that alternative webpage is that repeated pages is handled as the quantity of the above-mentioned word that comprised in this benchmark webpage.

9, a kind of system of removing duplicate webpages is characterized in that, comprises that the unit chosen in word, the benchmark webpage is chosen unit, comparing unit, reached processing unit:

The unit chosen in described word, is used for choosing at alternative webpage the word of predetermined number;

Described benchmark webpage is chosen the unit, is used for choosing in collections of web pages containing the maximum webpage of above-mentioned word quantity as the benchmark webpage;

Described comparing unit, the quantity that is used for comprising above-mentioned word at described benchmark webpage start described processing unit during greater than setting threshold;

Described processing unit is used for described alternative webpage is handled as repeated pages.

10, to go to 9 described systems as right, it is characterized in that, also comprise weight calculation unit, be used to calculate the weights of above-mentioned each word, and result of calculation is sent to described word chooses the unit;

Predetermined number is chosen in the unit in alternative webpage according to weights order from big to small word chosen in described word.