CN101102316A - A method and system for removing duplicate webpages - Google Patents

A method and system for removing duplicate webpages Download PDF

Info

Publication number
CN101102316A
CN101102316A CNA2007101230528A CN200710123052A CN101102316A CN 101102316 A CN101102316 A CN 101102316A CN A2007101230528 A CNA2007101230528 A CN A2007101230528A CN 200710123052 A CN200710123052 A CN 200710123052A CN 101102316 A CN101102316 A CN 101102316A
Authority
CN
China
Prior art keywords
webpage
word
benchmark
alternative
mentioned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101230528A
Other languages
Chinese (zh)
Inventor
文勖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNA2007101230528A priority Critical patent/CN101102316A/en
Publication of CN101102316A publication Critical patent/CN101102316A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The method comprises: selecting the preset numbers of words and expressions from the optional webpage; selecting the webpage with maximum numbers of the words and expressions as the reference webpage; if the said reference webpage contains the words and expressions whose numbers is more than a threshold, the said optional webpage is processed as a repeated webpage. The invention also provides a system thereof.

Description

A kind of method and system of removing duplicate webpages
Technical field
The present invention relates to the webpage process field, particularly relate to a kind of method and system of removing duplicate webpages.
Background technology
Along with the fast development of Internet technology, the webpage on the Internet is more and more, and according to statistics, Chinese web page has exceeded 10,000,000,000, and wherein nearly 70% belongs to repeated pages, and the shared proportion of repeated pages is very big.Therefore, how effectively removing repeated pages in the webpage of enormous amount, is the difficult problem that search engine faces.Be that repeated pages is judged, removed to the mode that contrasts this condition code by selected characteristic sign indicating number in webpage at present.
Consult Fig. 1, be the method flow diagram of existing removing duplicate webpages, concrete steps are as described below.
Step S101, in the benchmark webpage, choose certain fullstop as anchor point.
Because of in Web page text, there being a plurality of fullstops to occur, can in Web page text, select certain fullstop as anchor point by locate mode.
Step S102, choose the Chinese character of some as condition code on described anchor point both sides.
As, respectively choose 5 Chinese character composition characteristic sign indicating numbers on the anchor point both sides.
Step S103, in alternative webpage, adopt and obtain condition code in a like fashion.
Adopt in a like fashion and in alternative webpage, position, and respectively choose 5 Chinese character composition characteristic sign indicating numbers on these anchor point both sides.
Step S104, identical with condition code in the benchmark webpage as alternative webpage judges that this alternative webpage is a repeated pages.
Identical as alternative webpage with condition code in the benchmark webpage, judge that then this alternative webpage is a repeated pages, forwards step S105 to; Inequality as condition code, judge that then this alternative webpage is not a repeated pages.
The alternative webpage that step S105, deletion repeat.
Said method can effectively be removed repeated pages under the on all four situation of the content of two webpages.But repeated pages not only comprises the on all four webpage of content, comprises that also relative increase do not have the webpage of essential meaning information, and relative difference does not have the webpage of essential meaning word.If alternative webpage relative datum webpage increases the word of the no essence meaning just in several Chinese characters of anchor point annex, the condition code of two webpages is just different, and said method will cause the accuracy rate of removing duplicate webpages not high with alternative webpage as non-repeated pages; If the alternative webpage relative datum webpage several Chinese characters at the anchor point annex just is identical, and other guide has essential distinction, but the condition code of two webpages is identical, and said method will be deleted this alternative webpage as repeated pages, cause the False Rate of removing duplicate webpages too high.
Summary of the invention
Technical problem to be solved by this invention provides a kind of removing duplicate webpages method, and this method can effectively improve the accuracy rate of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.
Another object of the present invention provides a kind of removing duplicate webpages system, and this system can effectively improve the accuracy rate of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.
The method of a kind of removing duplicate webpages of the present invention comprises: the word of choosing predetermined number in alternative webpage; In collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage; The quantity that comprises above-mentioned word in the benchmark webpage is then handled described alternative webpage greater than setting threshold as repeated pages as described.
Preferably, also comprise: the quantity that comprises above-mentioned word in the benchmark webpage then adds described collections of web pages with described alternative webpage less than setting threshold as described.
Preferably, choose in alternative webpage before the word of predetermined number, also comprise: the attribute to word in the described alternative webpage marks, and filter attribute is the word of stop words and function word.
Preferably, in described alternative webpage, choose the word of predetermined number according to weights order from big to small.
Preferably, before in described alternative webpage, choosing the word of predetermined number, also comprise: the merchant divided by total training text number takes the logarithm with the training text number of each word, the numerical value that obtains multiply by the number of times that occurs this word in the described alternative web page text again, obtains the weights of each word in the described alternative webpage.
Preferably, by following step, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage; The word of above-mentioned predetermined number as query string, is retrieved in described collections of web pages; From big to small the webpage that retrieves is sorted according to the quantity that comprises above-mentioned word; With ordering first webpage as the benchmark webpage.
Preferably, before described alternative webpage handled as repeated pages, also comprise: the webpage of choosing ordering second is as the benchmark webpage; The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared; Greater than setting threshold, determine that described alternative webpage is a repeated pages as the quantity of the above-mentioned word that comprised in this benchmark webpage.
Preferably, determine that described alternative webpage is before the repeated pages, also comprises: choose successively ordering after webpage as the benchmark webpage; The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared; Greater than setting threshold, determine that alternative webpage is that repeated pages is handled as the quantity of the above-mentioned word that comprised in this benchmark webpage.
The system of a kind of removing duplicate webpages of the present invention, comprise that the unit chosen in word, the benchmark webpage is chosen unit, comparing unit, reached processing unit: the unit chosen in described word, is used for choosing at alternative webpage the word of predetermined number; Described benchmark webpage is chosen the unit, is used for choosing in collections of web pages containing the maximum webpage of above-mentioned word quantity as the benchmark webpage; Described comparing unit, the quantity that is used for comprising above-mentioned word at described benchmark webpage start described processing unit during greater than setting threshold; Described processing unit is used for described alternative webpage is handled as repeated pages.
Preferably, also comprise weight calculation unit, be used to calculate the weights of above-mentioned each word, and result of calculation is sent to described word chooses the unit; Predetermined number is chosen in the unit in alternative webpage according to weights order from big to small word chosen in described word.
Compared with prior art, the present invention has the following advantages:
The present invention chooses the word of predetermined number in alternative webpage, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage,, then alternative webpage is handled as repeated pages greater than setting threshold as the quantity that comprises above-mentioned word in the benchmark webpage.The present invention can set numerical value by suitably improving, and increasing participates in the word amount of contrast, reduces the contingency of contrast, when the removal content is not on all four repeated pages, can effectively improve the accuracy of removing duplicate webpages, and reduce False Rate.Simultaneously, the present invention also can effectively regulate the accuracy rate and the False Rate of removing duplicate webpages by raising/reduction setting threshold, as, improve setting threshold, can improve accuracy rate; Reduce setting threshold, can improve False Rate.Therefore, prior art is carried out removing duplicate webpages by simple contrast characteristic's sign indicating number relatively, and the present invention can effectively improve the accuracy rate of removing duplicate webpages by suitably regulating setting threshold and setting numerical value, reduces False Rate.
The present invention chooses the word of predetermined number in alternative webpage according to weights order from big to small.Weights show this word and Web page subject degree of correlation height greatly, have more representativeness.When the removal content is not on all four repeated pages, choose the high word of weights and compare, judge, can further improve the accuracy rate of removing duplicate webpages, reduce False Rate.
Description of drawings
Fig. 1 is the method flow diagram of existing removing duplicate webpages;
The removing duplicate webpages method flow diagram that Fig. 2 provides for first embodiment of the invention;
The removing duplicate webpages method flow diagram that Fig. 3 provides for second embodiment of the invention;
The removing duplicate webpages method flow diagram that Fig. 4 provides for third embodiment of the invention;
The removing duplicate webpages system schematic that Fig. 5 provides for fourth embodiment of the invention;
The removing duplicate webpages system schematic that Fig. 6 provides for fifth embodiment of the invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
The present invention chooses the word of predetermined number in alternative webpage, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage,, then alternative webpage is handled as repeated pages greater than setting threshold as the quantity that comprises above-mentioned word in the benchmark webpage.
Consult Fig. 2, the method flow diagram of the removing duplicate webpages that provides for first embodiment of the invention, concrete steps are as described below.
Step S201, in alternative webpage, choose the word of predetermined number.
In the removing duplicate webpages process, earlier with existing webpage as collections of web pages, again based on this collections of web pages, to after the judgement that whether repeats of the webpage that gets access to.After the webpage that obtains as alternative webpage.According to the required accuracy of removing duplicate webpages, in alternative Web page text, choose the word of predetermined number.Wherein the scope of predetermined number can be 1 to 100.
As, existing collections of web pages is: { webpage A, webpage B, webpage C}, alternative webpage are webpage D.In webpage D, choose a, b, three words of c.
Step S202, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage.
Choose in collections of web pages by modes such as contrasting, search and to contain the maximum webpage of above-mentioned word quantity, as the benchmark webpage.
As, do not comprise among word a, b, the c any one in the webpage A text; Comprise two words of a, b in the webpage B text; Comprise a, b, three words of c in the webpage C text.Because of the quantity of webpage word that C comprises greater than webpage B and webpage A, with webpage C as the benchmark webpage.
Step S203, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.
Extract the quantity of benchmark above-mentioned word that webpage comprises, and compare with setting threshold.Setting threshold can be provided with according to the required accuracy of removing duplicate webpages, and the scope of setting threshold can be predetermined number/2 to predetermined number.
As, the quantity that comprises above-mentioned word among the webpage C is 3, setting threshold is 2.
Certainly, step S203 also can be the quantity of the above-mentioned word that is comprised in the calculating benchmark webpage and the proportional numbers of predetermined number, and this proportional numbers and setting threshold are compared.At this moment, the scope of setting threshold can be 50% to 100%.
Step S204, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, then alternative webpage is handled as repeated pages.
, then alternative webpage is handled as repeated pages greater than setting threshold as the quantity of the above-mentioned word that comprised in the benchmark webpage; Less than setting threshold, then described alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.
As, the above-mentioned word quantity that is comprised among the webpage C is 3, greater than setting threshold 2, then with alternative webpage D as repeated pages, delete.
The present invention can set numerical value by suitably improving, and increasing participates in the word amount of contrast, reduces the contingency of contrast, when the removal content is not on all four repeated pages, can effectively improve the accuracy of removing duplicate webpages, and reduce False Rate.Simultaneously, the present invention also can effectively regulate the accuracy rate and the False Rate of removing duplicate webpages by raising/reduction setting threshold, as, improve setting threshold, can improve accuracy rate; Reduce setting threshold, can improve False Rate.Therefore, the present invention can effectively improve the accuracy rate of removing duplicate webpages by suitably regulating setting threshold and setting numerical value, reduces False Rate.
For further improving the accuracy of removing duplicate webpages, reduce the False Rate of removing duplicate webpages, the present invention can choose the word higher with the Web page subject degree of correlation in alternative webpage.
Consult Fig. 3, the removing duplicate webpages method flow diagram for second embodiment of the invention provides may further comprise the steps.
Step S301, the attribute of word in the alternative webpage is marked, filter attribute is the word of stop words and function word.
Attribute to word in the alternative webpage marks.The attribute of word of the present invention can be divided into keyword, stop words and function word.Wherein, keyword is the word with certain essential meaning, as words such as computer, purchase, posts; Stop words and function word are the word that do not have real-time meaning, as,,, etc. word.The database of built-in storage stop words of the present invention and function word compares the word of storing in word in the alternative Web page text and the database, and is identical as the word of storing in the word in the alternative Web page text and the database, then filters out this word.
The weights of each word in step S302, the alternative webpage of calculating.
The weights of word are represented the degree of correlation of this word and theme, and the word that weights are high can be thought and Web page subject degree of correlation height.The present invention is by the weights of weights=TF * IDF formula calculating word, that is:
w(f i,d)=TF(f i,d)*IDF(f i)=N(f id)*log(N(f i)/N)
Wherein, w (f i, d) expression word f iWeights in alternative web page text d; N (f i) expression word f iThe number of times that in alternative web page text d, occurs; N (f i) expression word f iThe training text number, N represents total training text number.IDF (f i) be word f iCorresponding fixed value is by in N training text, searches to calculate to contain word f iTextual data N (f i), again by formula log (N (f i)/N) calculated acquisition.TF (f i, d) by searching word f iThe number of times that occurs in alternative web page text d obtains.
Step S303, in alternative webpage, choose the word of predetermined number according to weights order from big to small.
The ordering of weights size pressed in the word of Web page text, in webpage, choose the word of predetermined number again according to order from big to small.
As, predetermined number is 5, then chooses the bigger a of weights, b, c, five words of d, e in alternative webpage.
Step S304, in collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage.
As, do not comprise among word a, b, c, d, the e any one in the collections of web pages in the webpage A text; Comprise a, b, three words of c in the webpage B text, comprise a, b, four words of c, d in the webpage C text.Because of the quantity of webpage word that C comprises greater than webpage B and webpage A, with webpage C as the benchmark webpage.
Step S305, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.
Step S306, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, then alternative webpage is handled as repeated pages; Less than setting threshold, then described alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.
The present invention chooses the word of predetermined number in alternative webpage according to weights order from big to small.Weights show this word and Web page subject degree of correlation height greatly, have more representativeness.When the removal content is not on all four repeated pages, choose the high word of weights and compare, judge, can further improve the accuracy rate of removing duplicate webpages, reduce False Rate.
The present invention also can utilize in the collections of web pages other webpages further to confirm after judging that alternative webpage is repeated pages, to improve the accuracy of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.
Consult Fig. 4, be the removing duplicate webpages method flow diagram that third embodiment of the invention provides, concrete steps are as described below.
Step S401, in alternative webpage, choose the word of predetermined number.
As, in alternative webpage, choose ten words such as a, b, c, d, e, f, g, h, i, j.
Step S402, with the word of above-mentioned predetermined number as query string, in collections of web pages, retrieve.
As, ten words such as a, b, c, d, e, f, g, h, i, j as query string, are retrieved in collections of web pages.
Step S403, according to the quantity that comprises above-mentioned word from big to small with the ordering of the webpage that retrieves.
As, do not comprise in the above-mentioned word any one in the collections of web pages in the webpage A text, comprise a, b, c, d, e, f, eight words of g, h in the webpage B text, comprise a, b, c, d, e, f, g, h, nine words of i in the webpage C text.The webpage that retrieves is webpage B and webpage C, because of the quantity of webpage word that C comprises greater than webpage B, sorting is webpage C, webpage B.
Step S404, choose ordering first webpage as the benchmark webpage.
As, choose webpage C as the benchmark webpage.
Step S405, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.
As, setting the fault value is 7, the quantity that comprises above-mentioned word in the benchmark webpage is 9.
Step S4 06, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, judge that then alternative webpage is a repeated pages, forwards step S407 to; Less than setting threshold, then described alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.
As, 9>7, judge that then alternative webpage is a repeated pages.
Step S407, choose ordering second webpage as the benchmark webpage again.
As, select webpage B as the benchmark webpage.
Step S408, the quantity and the setting threshold of the above-mentioned word that comprised in the benchmark webpage compared.
As, setting threshold is 7, the quantity that comprises above-mentioned word in the benchmark webpage is 8.
Step S409, as the quantity of the above-mentioned word that comprised in the benchmark webpage greater than setting threshold, determine that then alternative webpage is a repeated pages; Less than setting threshold, alternative webpage is added collections of web pages as the above-mentioned word that comprised in the benchmark webpage.
As, 8>7, determine that then alternative webpage is a repeated pages.
In above-mentioned steps, the present invention judges by comprising the maximum webpage of above-mentioned word in the collections of web pages whether alternative webpage is repeated pages, determine further that by the webpage that comprises above-mentioned word quantity second in the collections of web pages this alternative webpage is a repeated pages, guarantees the high-accuracy and the low fault rate of removing duplicate webpages again.
Certainly, the present invention also can utilize comprise in the collections of web pages above-mentioned word quantity the the 3rd, the 4th, the 5th,,, webpage successively as the benchmark webpage, determine that further this alternative webpage is a repeated pages, guarantee the high-accuracy and the low fault rate of removing duplicate webpages.
Based on above-mentioned removing duplicate webpages method, the present invention also provides a kind of removing duplicate webpages system, and this system can effectively improve the accuracy rate of removing duplicate webpages, reduces the False Rate of removing duplicate webpages.
Consult Fig. 5, be the removing duplicate webpages system schematic that fourth embodiment of the invention provides, comprise that unit 51 chosen in word, the benchmark webpage is chosen unit 52, comparing unit 53, reached processing unit 54.
Predetermined number is chosen in unit 51 in alternative webpage word chosen in word, and send above-mentioned word to the benchmark webpage and choose unit 52.Wherein the scope of predetermined number can be 1 to 100.Unit 51 chosen in word can built-in storage stop words and the database of function word, and the word of storing in word in the alternative Web page text and the database is compared, identical as the word of storing in the word in the alternative Web page text and the database, then filters out this word.
The benchmark webpage is chosen unit 52 and is chosen in collections of web pages and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage, and this benchmark webpage is sent to comparing unit 53.
Comparing unit 53 extracts the quantity of benchmark above-mentioned words that webpage comprises, and compares with setting threshold, and the quantity of the above-mentioned word that is comprised in the benchmark webpage is during greater than setting threshold, startup processing unit 54.The scope of setting threshold can be predetermined number/2 to predetermined number.
Certainly, comparing unit 53 also can be the quantity of the above-mentioned word that is comprised in the calculating benchmark webpage and the proportional numbers of predetermined number, this proportional numbers and setting threshold are compared, and the quantity of the above-mentioned word that is comprised in the benchmark webpage starts processing unit 54 during greater than setting threshold.At this moment, the scope of setting threshold can be 50% to 100%.
Processing unit 54 is handled described alternative webpage as repeated pages.
This system can set numerical value by suitably improving, and increasing participates in the word amount of contrast, reduces the contingency of contrast, when the removal content is not on all four repeated pages, can effectively improve the accuracy of removing duplicate webpages, and reduce False Rate.
Consult Fig. 6, be the removing duplicate webpages system schematic that fifth embodiment of the invention provides, comprise that unit 51 chosen in word, the benchmark webpage is chosen unit 52, comparing unit 53, processing unit 54, reached weight calculation unit 55.
Weight calculation unit 55 is calculated the weights of each word in the alternative Web page text, and result of calculation is sent to word chooses unit 51.
The formula that calculates weights is:
w(f i,d)=TF(f i,d)*IDF(f i)=N(f id)*log(N(f i)/N)
Wherein, w (f i, d) expression word f iWeights in alternative web page text d; N (f i) expression word f iThe number of times that in alternative web page text d, occurs; N (f i) expression word f iThe training text number, N represents total training text number.
Predetermined number is chosen in unit 51 in alternative webpage according to weights order from big to small word chosen in word.
It is identical with figure four illustrated embodiments that the benchmark webpage is chosen unit 52, comparing unit 53, processing unit 54 function in this embodiment and effect, repeats no more.
More than to a kind of removing duplicate webpages method and system provided by the present invention, be described in detail, used specific case herein principle of the present invention and execution mode are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1, a kind of method of removing duplicate webpages is characterized in that, comprising:
In alternative webpage, choose the word of predetermined number;
In collections of web pages, choose and contain the maximum webpage of above-mentioned word quantity as the benchmark webpage;
The quantity that comprises above-mentioned word in the benchmark webpage is then handled described alternative webpage greater than setting threshold as repeated pages as described.
2, the method for claim 1 is characterized in that, also comprises:
The quantity that comprises above-mentioned word in the benchmark webpage then adds described collections of web pages with described alternative webpage less than setting threshold as described.
3, the method for claim 1 is characterized in that, chooses in alternative webpage before the word of predetermined number, also comprises:
Attribute to word in the described alternative webpage marks, and filter attribute is the word of stop words and function word.
4, as claim 1,2 or 3 described methods, it is characterized in that, in described alternative webpage, choose the word of predetermined number according to weights order from big to small.
5, method as claimed in claim 4 is characterized in that, in described alternative webpage, choose the word of predetermined number before, also comprise:
Merchant divided by total training text number takes the logarithm with the training text number of each word, and the numerical value of acquisition multiply by the number of times that occurs this word in the described alternative web page text again, obtains the weights of each word in the described alternative webpage.
6, the method for claim 1 is characterized in that, by following step, chooses in collections of web pages and contains the maximum webpage of above-mentioned word quantity as the benchmark webpage;
The word of above-mentioned predetermined number as query string, is retrieved in described collections of web pages;
From big to small the webpage that retrieves is sorted according to the quantity that comprises above-mentioned word;
With ordering first webpage as the benchmark webpage.
7, method as claimed in claim 6 is characterized in that, before described alternative webpage is handled as repeated pages, also comprises:
The webpage of choosing ordering second is as the benchmark webpage;
The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared;
Greater than setting threshold, determine that described alternative webpage is a repeated pages as the quantity of the above-mentioned word that comprised in this benchmark webpage.
8, as claim 6 or 7 described methods, it is characterized in that, determine that described alternative webpage is before the repeated pages, also comprises:
Choose successively ordering after webpage as the benchmark webpage;
The quantity and the setting threshold of the above-mentioned word that comprised in this benchmark webpage are compared;
Greater than setting threshold, determine that alternative webpage is that repeated pages is handled as the quantity of the above-mentioned word that comprised in this benchmark webpage.
9, a kind of system of removing duplicate webpages is characterized in that, comprises that the unit chosen in word, the benchmark webpage is chosen unit, comparing unit, reached processing unit:
The unit chosen in described word, is used for choosing at alternative webpage the word of predetermined number;
Described benchmark webpage is chosen the unit, is used for choosing in collections of web pages containing the maximum webpage of above-mentioned word quantity as the benchmark webpage;
Described comparing unit, the quantity that is used for comprising above-mentioned word at described benchmark webpage start described processing unit during greater than setting threshold;
Described processing unit is used for described alternative webpage is handled as repeated pages.
10, to go to 9 described systems as right, it is characterized in that, also comprise weight calculation unit, be used to calculate the weights of above-mentioned each word, and result of calculation is sent to described word chooses the unit;
Predetermined number is chosen in the unit in alternative webpage according to weights order from big to small word chosen in described word.
CNA2007101230528A 2007-06-22 2007-06-22 A method and system for removing duplicate webpages Pending CN101102316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101230528A CN101102316A (en) 2007-06-22 2007-06-22 A method and system for removing duplicate webpages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007101230528A CN101102316A (en) 2007-06-22 2007-06-22 A method and system for removing duplicate webpages

Publications (1)

Publication Number Publication Date
CN101102316A true CN101102316A (en) 2008-01-09

Family

ID=39036407

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101230528A Pending CN101102316A (en) 2007-06-22 2007-06-22 A method and system for removing duplicate webpages

Country Status (1)

Country Link
CN (1) CN101102316A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645082B (en) * 2009-04-17 2011-04-20 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
CN101714147B (en) * 2008-10-06 2012-10-17 易搜比控股公司 Method for filtering same or similar files
WO2014000508A1 (en) * 2012-06-30 2014-01-03 华为技术有限公司 Duplicated web page deletion method and device
CN104572720A (en) * 2013-10-21 2015-04-29 腾讯科技(深圳)有限公司 Webpage information duplicate eliminating method and device and computer-readable storage medium
WO2016066043A1 (en) * 2014-10-30 2016-05-06 阿里巴巴集团控股有限公司 Web page deduplication method and apparatus
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101714147B (en) * 2008-10-06 2012-10-17 易搜比控股公司 Method for filtering same or similar files
CN101645082B (en) * 2009-04-17 2011-04-20 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
CN102375813B (en) * 2010-08-09 2016-12-21 深圳市世纪光速信息技术有限公司 Search engine re-scheduling system and method
WO2014000508A1 (en) * 2012-06-30 2014-01-03 华为技术有限公司 Duplicated web page deletion method and device
US10346257B2 (en) 2012-06-30 2019-07-09 Huawei Technologies Co., Ltd. Method and device for deduplicating web page
CN104572720A (en) * 2013-10-21 2015-04-29 腾讯科技(深圳)有限公司 Webpage information duplicate eliminating method and device and computer-readable storage medium
CN104572720B (en) * 2013-10-21 2019-07-16 腾讯科技(深圳)有限公司 A kind of method, apparatus and computer readable storage medium of webpage information re-scheduling
WO2016066043A1 (en) * 2014-10-30 2016-05-06 阿里巴巴集团控股有限公司 Web page deduplication method and apparatus
CN105630802A (en) * 2014-10-30 2016-06-01 阿里巴巴集团控股有限公司 Webpage duplication removal method and apparatus
US10691769B2 (en) 2014-10-30 2020-06-23 Alibaba Group Holding Limited Methods and apparatus for removing a duplicated web page
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device

Similar Documents

Publication Publication Date Title
US10671676B2 (en) Multiple index based information retrieval system
US9990421B2 (en) Phrase-based searching in an information retrieval system
US9817886B2 (en) Information retrieval system for archiving multiple document versions
CA2513850C (en) Phrase identification in an information retrieval system
EP1622052B1 (en) Phrase-based generation of document description
CA2513853C (en) Phrase-based indexing in an information retrieval system
CA2513852C (en) Phrase-based searching in an information retrieval system
US7426507B1 (en) Automatic taxonomy generation in search results using phrases
CN101102316A (en) A method and system for removing duplicate webpages
CN105302793A (en) Method for automatically evaluating novelty of scientific and technical literature by using computer
CN102789452A (en) Similar content extraction method
CN111104488A (en) Method, device and storage medium for integrating retrieval and similarity analysis
CN101826102A (en) Automatic book keyword generation method
CN106815196B (en) Soft text display frequency statistical method and device
JP2002157273A (en) Method for selecting featured word using probability

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20080109