CN102622365B - Judging system and judging method for web page repeating - Google Patents

Judging system and judging method for web page repeating Download PDF

Info

Publication number
CN102622365B
CN102622365B CN201110031636.9A CN201110031636A CN102622365B CN 102622365 B CN102622365 B CN 102622365B CN 201110031636 A CN201110031636 A CN 201110031636A CN 102622365 B CN102622365 B CN 102622365B
Authority
CN
China
Prior art keywords
signature
webpage
web page
sentence
computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110031636.9A
Other languages
Chinese (zh)
Other versions
CN102622365A (en
Inventor
吴一璞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110031636.9A priority Critical patent/CN102622365B/en
Publication of CN102622365A publication Critical patent/CN102622365A/en
Application granted granted Critical
Publication of CN102622365B publication Critical patent/CN102622365B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a judging system and a judging method for web page repeating. The judging method includes the following steps: obtaining multiple web pages; extracting the texts of the web pages respectively; extracting one or more sentences from the texts of the web pages, and computing the sentence signatures of the texts of the web pages according to the one or more sentences; clustering multiple web pages according to the sentence signatures of the texts of the web pages; computing the additional signatures for the web pages in the same cluster; and the judging whether the web pages in the same cluster repeat according to the additional signatures. By adopting the judging system and the judging method for web page repeating, the web pages can be effectively and quickly judged whether to be repeated according to the multi-dimensional signatures including the sentence signatures of the texts of the web pages.

Description

The judgement system that a kind of webpage repeats and determination methods thereof
[technical field]
The present invention relates to internet arena, particularly relate to judgement system and the determination methods thereof of the repetition of a kind of webpage.
[background technology]
In the epoch that this science and technology is highly developed, internet has become the main path that people obtain message.But be flooded with some contents repeated in internet of today everywhere, very large puzzlement is caused to the access of user.Therefore, service provider needs to repeat to judge to webpage, to the webpage repeated, only chooses some high-quality webpages, browses for user.
But, being generally the content by comparing two pages and node in prior art, confirming the similarity of two pages.It is relatively accurate that this method can calculate, can time complexity too high, calculate very time-consuming.By signing to some important information in a page, then comparing the signature of two pages, calculating similarity, this Method compare is simply efficient, and computational speed, than very fast, is relatively applicable to the application scenarios of this magnanimity information in internet.
[summary of the invention]
Technical problem to be solved by this invention is to provide the judgement system and determination methods thereof that a kind of webpage repeats, with effectively and judge whether webpage repeats rapidly.
The present invention is the determination methods that technical scheme that technical solution problem adopts is to provide a kind of webpage and repeats, and comprising: a. obtains multiple webpage; B. the Web page text of webpage is extracted respectively; C. from Web page text, extract one or more sentence, and calculate Web page text sentence signature according to one or more sentence; D. according to Web page text sentence signature, cluster is carried out to multiple webpage; E. for the webpage under each class, the attaching signature of webpage is calculated; Whether the webpage f. judged under each class according to attaching signature repeats.
According to one of the present invention preferred embodiment, step b comprises further: b1. carries out piecemeal to webpage; B2. block filtration is carried out to the webpage after piecemeal, to obtain the content blocks comprising Web page text; B3. from content blocks, Web page text is extracted.
According to one of the present invention preferred embodiment, step c comprises further: c1. carries out subordinate sentence to Web page text; C2. the Web page text after subordinate sentence is filtered and changed; C3. the longest one or more sentences are extracted the Web page text after filtering and changing; C4. the computing of hash signature is carried out to one or more sentence, to obtain Web page text sentence signature.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises Web page text signature, and Web page text signature obtains by carrying out the computing of simhash signature to Web page text.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises true title signature, and true title signature obtains by carrying out the computing of hash signature to the true title of webpage.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises tag title signature, and tag title signature obtains by carrying out the computing of hash signature to the tag title of webpage.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises digest, and digest obtains by carrying out the computing of hash signature to the summary of webpage.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises web page contents signature, and web page contents signature obtains by carrying out the computing of hash signature to the web page contents of webpage.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises web placement signature, and web placement signature obtains by carrying out the computing of hash signature to the positional information of webpage in current site.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises comment block signature, and comment block signature obtains by carrying out the computing of hash signature to the review information of webpage.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises resource signature, and resource signature obtains by carrying out the computing of hash signature to the url of the picture resource in webpage, voice resource, video resource or download link resource.
According to one of the present invention preferred embodiment, in step e, attaching signature comprises url filename signature, and url filename signature obtains by carrying out the computing of hash signature to the filename in the url of webpage.
According to one of the present invention preferred embodiment, this determination methods comprises further: g. to being judged as in step f that the webpage of repetition carries out Similarity Measure, to judge whether webpage repeats further.
According to one of the present invention preferred embodiment, in step g, use decision Tree algorithms to carry out Similarity Measure, the attaching signature that independent importance is high is identical, thinks that webpage repeats.
According to one of the present invention preferred embodiment, in step g, use decision Tree algorithms to carry out Similarity Measure, the attaching signature that multiple importance is lower is identical, thinks that webpage repeats.
The present invention is that technical scheme that technical solution problem adopts is to provide the judgement system that a kind of webpage repeats and comprises: webpage acquisition device, for obtaining multiple webpage; Extraction element, for extracting the Web page text of webpage respectively; Sentence signature calculation device, for extracting one or more sentence from Web page text, and calculates Web page text sentence signature according to one or more sentence; Clustering apparatus, for carrying out cluster according to Web page text sentence signature to multiple webpage; Attaching signature calculation element, for for the webpage under each class, calculates the attaching signature of webpage; Judgment means, for judging according to attaching signature whether the webpage under each class repeats.
According to one of the present invention preferred embodiment, extraction element comprises further: web page release module, for carrying out piecemeal to webpage; Home page filter module, for carrying out block filtration to the webpage after piecemeal, to obtain the content blocks comprising Web page text; Text extraction module, for extracting Web page text from content blocks.
According to one of the present invention preferred embodiment, sentence signature calculation device comprises further: subordinate sentence module, for carrying out subordinate sentence to Web page text; Filter modular converter, for filtering the Web page text after subordinate sentence and change; Sentence extraction module, for extracting the longest one or more sentences from the Web page text after filtration and conversion; Sentence signature calculation module, for carrying out the computing of hash signature to one or more sentence, to obtain Web page text sentence signature.
According to one of the present invention preferred embodiment, in attaching signature calculation element, attaching signature calculation element comprises Web page text signature calculation module, and Web page text signature calculation module obtains Web page text signature by carrying out the computing of simhash signature to Web page text.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises true title signature calculation module, and true title signature calculation module obtains true title signature by carrying out the computing of hash signature to the true title of webpage.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises tag title signature calculation module, and tag title signature calculation module obtains tag title signature by carrying out the computing of hash signature to the tag title of webpage.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises digest computing module, and digest computing module obtains digest by carrying out the computing of hash signature to the summary of webpage.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises web page contents signature calculation module, and web page contents signature calculation module obtains web page contents signature by carrying out the computing of hash signature to the web page contents of webpage.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises web placement signature calculation module, and web placement signature calculation module is carried out the computing of hash signature by the positional information of webpage in current site and obtained web placement signature.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises comment block signature calculation module, and comment block signature calculation module obtains comment block signature by carrying out the computing of hash signature to the review information of webpage.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises resource signature calculation module, and resource signature calculation module is carried out the computing of hash signature by the url of the picture resource in webpage, voice resource, video resource or download link resource and obtained resource signature.
According to one of the present invention preferred embodiment, attaching signature calculation element comprises url filename signature calculation module, and url filename signature calculation module obtains url filename signature by carrying out the computing of hash signature to the filename in the url of webpage.
According to one of the present invention preferred embodiment, this judgement system comprises further: similarity calculation module, for being judged as in judgment means that the webpage of repetition carries out Similarity Measure, to judge whether webpage repeats further.
According to one of the present invention preferred embodiment, in similarity calculation module, use decision Tree algorithms to carry out Similarity Measure, the attaching signature that independent importance is high is identical, thinks that webpage repeats.
According to one of the present invention preferred embodiment, in similarity calculation module, use decision Tree algorithms to carry out Similarity Measure, the attaching signature that multiple importance is lower is identical, thinks that webpage repeats.
As can be seen from the above technical solutions, the determination methods that repeats of webpage of the present invention and judgment means are by the various dimensions signature that comprises Web page text sentence signature effectively and judge whether webpage repeats rapidly.
[accompanying drawing explanation]
Fig. 1 is the flow chart of the determination methods that webpage of the present invention repeats.
Fig. 2 is the sub-process figure of step 11 in Fig. 1.
Fig. 3 is the sub-process figure of step 12 in Fig. 1.
Fig. 4 is web page contents schematic diagram of the present invention.
Fig. 5 is the schematic block diagram of the judgement system that webpage of the present invention repeats.
[detailed description of the invention]
Below in conjunction with drawings and Examples, the present invention is described in detail.
As shown in Figure 1, Fig. 1 is the flow chart of the determination methods that webpage of the present invention repeats.
In step 10, multiple webpage is obtained.In this step, can utilize web crawlers (spider) from internet, capture a large amount of webpages.
In a step 11, the Web page text of each webpage is extracted respectively.In webpage, extraction is carried out to Web page text and can adopt many kinds of methods, referring to Fig. 2, one specific embodiment of step 11 is specifically described.
As shown in Figure 2, Fig. 2 is the sub-process figure of step 11 in Fig. 1.
In step 111, piecemeal is carried out to webpage.In this step, as shown in Figure 4, the web page contents shown by browser can be divided into multiple content blocks, comprising: navigation block, web placement block, true title block, Web page text block, commercial block and comment block.Wherein, navigation block is positioned at the top of web page contents, for directing into the corresponding page according to the clicking operation of user.Web placement block, for recording the positional information of shown webpage at current site, is generally positioned at the below of navigation block.True title block, for recording the true title of Web page text, is generally positioned at below web placement block, and generally highlights to amplify or to add the modes such as boldface type.Web page text block is for recording Web page text, Web page text refers to the body part in webpage, represents the real content expressed by this webpage, is the core of webpage, generally be positioned at below true title, and generally comprise the resource such as a large amount of text descriptions and picture, video, sound.Commercial block is generally positioned at both sides or the side of Web page text block, for the content information providing advertisement or other and Web page text irrelevant.Comment block is generally positioned at the below of Web page text block, for recording the review information of viewer's input.In the present embodiment, piecemeal is carried out to the web page contents in webpage various ways can be adopted to realize, such as based on template, based on HTML mark and view-based access control model information etc.These methods are all known in those skilled in the art, are not described in detail at this.
In step 112, block filtration is carried out to the webpage after piecemeal, to obtain the content blocks comprising Web page text.Wherein, concrete block filter method is known in those skilled in the art, is not described in detail at this.
In step 113, from the content blocks comprising Web page text, extract Web page text.
In step 12, from Web page text, extract one or more sentence, and calculate Web page text sentence signature according to one or more sentence.Referring to Fig. 3, one specific embodiment of step 12 is specifically described.
See the sub-process figure that Fig. 3, Fig. 3 are steps 12 in Fig. 1.
In step 121, subordinate sentence is carried out to Web page text.In this step, fullstop, question mark, exclamation mark etc. can be utilized to represent, and the tag mark that sentence finishes carries out subordinate sentence to Web page text.In addition, subordinate sentence can also be carried out by the visual information of Web page text to Web page text.Such as, in the webpage example of Fig. 4 display, following sentence can be marked off from Web page text:
The sentence 1:7 month 6 message, according to economic observation net, the original dimension group Skyworth board of directors chairman Huang Hongsheng that controls interest that holds concurrently now has been released and has released from prison;
Honest and clean administration " Tiger Mountain is capable " is in Hong Kong raw in November, 2004 is under arrest for sentence 2: Huang Hong, is decided that 4 charges such as collusion theft and swindle listed company assets are set up in July, 2006, is judged to imprisonment in 6 years;
Sentence 3: according to Hong Kong relevant law, deduct festivals or holidays, 6 year prison term of Huang Hongsheng will expire next year;
Sentence 4:2006 afternoon July 13, before the yellow grand life of digital Pty Ltd (Skyworth) former chairman of Skyworth and blood younger brother thereof, Skyworth executive director Huang Peisheng because of collusion theft and the Skyworth that conspires to defraud be that 4 charges such as more than 5,000 ten thousand Hongkong dollars are set up, two people are sentenced to fixed-term imprisonment 6 years by Hong Kong district court respectively;
Sentence 5: in the same year, August 11, Skyworth's number declares publicly, and Huang Hongsheng has resigned company non-executing chairman and non-executive director's position;
Sentence 6: as far back as April in this year, in the news conference of Skyworth's new production introduction, the said firm has a senior executive to leak informaton title, " end of the year in 2009 Skyworth founder Huang Hongsheng be expected to the mode of bailing out release from prison in advance ";
Sentence 7: Skyworth's stock market at noon digital today is closed in 1.94 Hongkong dollars, goes up 13.45%, creates 52 weeks highest prices.In past 52 weeks, this burst of lowest price is 0.28 Hongkong dollar.
In step 122, the Web page text after subordinate sentence is filtered and changed.In this step, first filter out the digital information in sentence, copyright information and other webpage is repeated to the information that judges not play a decisive role.Subsequently, sentence is changed, such as, carry out full-shape/half-angle conversion or numerous/letter conversion, to make the uniform format of the sentence after changing.
In step 123, the Web page text after filtering and changing, extract the longest one or more sentences.In this step, filtration and the Web page text after changing extract the combination of a longest sentence or the continuous sentence of the longest predetermined quantity (such as, 3).Such as, in the webpage example shown in Fig. 4, the sentence 4 after filtering and changing is the longest, and other sentences super far away, therefore can select sentence 4 for Web page text sentence, or select the longest sequence sentence sub-portfolio 4,5,6 as Web page text sentence.
In step 124, the computing of hash signature is carried out to the one or more sentences extracted, to obtain Web page text sentence signature.Concrete hash signature algorithm can adopt various hash algorithm well known in the art.
In step 13, according to Web page text sentence signature, cluster is carried out to multiple webpage.In this step, identical webpage of being signed by Web page text sentence gathers same class.
At step 14, for the webpage under each class, calculate the attaching signature of webpage.In this step, attaching signature can comprise one or more the combination in Web page text signature, true title signature, tag title signature, digest, web page contents signature, web placement signature, comment block signature, resource signature and url filename signature.
Web page text is signed: Web page text signature obtains by carrying out the computing of simhash signature to above-mentioned Web page text.
True title signature: true title signature obtains by carrying out the computing of hash signature to the true title of webpage.The true title of webpage is generally the topmost title in webpage, gives expression to the main contents of this webpage.As shown in Figure 4, true title can from carrying out web page contents extracting true title block that piecemeal obtains.
Web page contents is signed: web page contents signature obtains by carrying out the computing of hash signature to the web page contents of webpage.Content shown in webpage is all web page contents.As shown in Figure 4, web page contents comprises Web page text, true title, web placement and other guide.
Web placement is signed: web placement signature carries out the computing of hash signature by the positional information of webpage in current site to obtain.As shown in Figure 4, positional information is from carrying out web page contents extracting web placement block that piecemeal obtains.
Comment block signature: comment block signature obtains by carrying out the computing of hash signature to the review information of webpage.As shown in Figure 4, review information can from carrying out web page contents extracting comment block that piecemeal obtains.
Resource is signed: resource signature carries out the computing of hash signature by the url of the picture resource in webpage, voice resource, video resource or download link resource to obtain.As shown in Figure 4, just there is picture resource in Web page text.But also, because there are many plain text webpages, and there is above-mentioned resource in the webpage of not all available resource signatures.
Url filename is signed: url filename signature obtains by carrying out the computing of hash signature to the filename in the url of webpage.
Such as, for url " http://www.zinenet.cn/newsDetails.asp? id=823 ", the url file of its correspondence " newsDetails.asp " by name.
Except above-mentioned signature, attaching signature can further include tag title signature and digest, wherein:
Tag title is signed: tag title signature obtains by carrying out the computing of hash signature to the tag title of webpage.
Digest: digest obtains by carrying out the computing of hash signature to the summary of webpage.
Tag title can obtain respectively with summary from the corresponding source code of webpage, enumerates two source code examples of tag title and summary below:
Tag title: be released share price of releasing from prison in advance of <TITLE> Skyworth founder Huang Hongsheng rises sharply </TITLE>;
Summary: <meta http-equiv=" Content-Type " content=" text/html; Charset=gb2312 "/>.
In step 15, whether the webpage judged under each class according to attaching signature repeats.In this step, compare whether the attaching signature of each webpage under same class is same or similar judges whether webpage repeats.Attaching signature can adopt one or more combination of above-mentioned signature, and is judged by decision tree mode.Specifically, when comparing the Web page text signature utilizing the computing of simhash signature to obtain, compare the not isotopic number of Web page text signature, coordination is not fewer, represents that the possibility of webpage repetition is higher.When comparing other attaching signature, if attaching signature is equal, represent that webpage repeats in this dimension.In deterministic process, the confidence level of different signature need be taken into full account.Such as, the information content that Web page text is signed and web page contents is signed to be comprised is comparatively large, and therefore its confidence level is relatively high, and the information content that other signatures comprise is less, and therefore its confidence level is relatively low.
Repeat to comprise step 16 further in determination methods at webpage of the present invention.In step 16, to being judged as in step 15 that the webpage of repetition carries out Similarity Measure, to judge whether webpage repeats further.In this step, by carrying out Similarity Measure to webpage entirety, the false webpage repeated can be filtered out further.
Specifically, decision Tree algorithms generally can be used to carry out Similarity Measure.In decision Tree algorithms, the attaching signature that independent importance is high or the lower attaching signature of multiple importance identical, think that described webpage repeats.Such as, in the present embodiment, when carrying out the judgement of webpage repetition, if two webpages meet any one below, then think that these two webpages are that true weight is multiple:
1, the true title signature of two webpages is identical.
2, the web page contents signature of two webpages is identical.
3, the not isotopic number of the Web page text signature of two webpages is less than 6.
4, the web placement signature of two webpages is identical, and url filename signature is identical.
5, commenting in block signature, resource signature, tag title signature, digest, url filename signature has three signatures identical.
Compared by the page between two, the set of the multiple url of true weight can be obtained.In general, if the quantity/whole webpage of the webpage in the multiple url set of this true weight concentrates the quantity > 30% of webpage, then think that whole webpage collection is all that true weight is multiple, otherwise be exactly false repetition.
As shown in Figure 6, Fig. 6 is the schematic block diagram of the judgement system that webpage of the present invention repeats.
The judgement system that webpage of the present invention repeats comprises: webpage acquisition device 20, extraction element 21, sentence signature calculation device 22, clustering apparatus 23, attaching signature calculation element 24, judgment means 25 and Similarity Measure device 26.
Webpage acquisition device 20 is for obtaining multiple webpage.In this device, can utilize web crawlers from internet, capture a large amount of webpages.
Extraction element 21 is for extracting the Web page text of each webpage respectively.In webpage, carry out extraction to Web page text can adopt many kinds of methods, such as, in the present embodiment, extraction element 21 comprises web page release module 211, home page filter module 212 and text extraction module 213 further.
Web page release module 211 is for carrying out piecemeal to webpage.As described above, the webpage example shown in Fig. 4 can be divided into multiple content blocks by web page release module 211.Carrying out piecemeal to the web page contents in webpage can adopt various ways to realize, such as based on template, based on HTML mark and view-based access control model information etc.These methods are all known in those skilled in the art, are not described in detail at this.
Home page filter module 212 for carrying out block filtration to the webpage after piecemeal, to obtain the content blocks comprising Web page text.Wherein, concrete block filter method is known in those skilled in the art, is not described in detail at this.
Text extraction module 213 for extracting Web page text from the content blocks comprising Web page text.
Sentence signature calculation device 22 for extracting one or more sentence from Web page text, and calculates Web page text sentence signature according to one or more sentence.In the present embodiment, sentence signature calculation device 22 comprises further: subordinate sentence module 221, filtration modular converter 222, sentence extraction module 223 and sentence signature calculation module 224.
Subordinate sentence module 221 is for carrying out subordinate sentence to Web page text.In this module, fullstop, question mark, exclamation mark etc. can be utilized to represent, and the tag mark that sentence finishes carries out subordinate sentence to Web page text.In addition, subordinate sentence can also be carried out by the visual information of Web page text to Web page text.
Filter modular converter 222 for filtering the Web page text after subordinate sentence and change.In this module, first filter out the digital information in sentence, copyright information and other webpage is repeated to the information that judges not play a decisive role.Subsequently, sentence is changed, such as, carry out full-shape/half-angle conversion or numerous/letter conversion, to make the uniform format of the sentence after changing.
Sentence extraction module 223 for extracting the longest one or more sentences from the Web page text after filtration and conversion.In this module, the Web page text after filtering and changing extracts the combination of a longest sentence or the continuous sentence of the longest predetermined quantity (such as, 3).
Sentence signature calculation module 224, for carrying out the computing of hash signature to the one or more sentences extracted, is signed to obtain Web page text sentence.Concrete hash signature algorithm can adopt various hash algorithm well known in the art.
Clustering apparatus 23 is for carrying out cluster according to Web page text sentence signature to multiple webpage.In this module, Web page text sentence signature same web page is gathered same class.
Attaching signature calculation element 24, for for the webpage under each class, calculates the attaching signature of webpage.Attaching signature calculation element 24 comprises Web page text signature calculation module 241, true title signature calculation module 242, tag title signature calculation module 243, digest computing module 244, web page contents signature calculation module 245, web placement signature calculation module 246, comment block signature calculation module 247, resource signature calculation module 248 and url filename signature calculation module 249 further.
Web page text signature calculation module 241 obtains Web page text signature by carrying out the computing of simhash signature to above-mentioned Web page text.Web page text can obtain from text extraction module 213.
True title signature calculation module 242 obtains true title signature by carrying out the computing of hash signature to the true title of webpage.The true title of webpage is generally the topmost title in webpage, gives expression to the main contents of this webpage.True title can carry out extracting the true title block that piecemeal obtains from web page release module 211 pairs of web page contents.
Tag title signature calculation module 243 obtains tag title signature by carrying out the computing of hash signature to the tag title of webpage.Can obtain the corresponding source code of the webpage that tag title can obtain from webpage acquisition device 20.
Digest computing module 244 obtains digest by carrying out the computing of hash signature to the summary of webpage.Can obtain the corresponding source code of the webpage that summary can obtain from webpage acquisition device 20 equally.
Web page contents signature calculation module 245 obtains web page contents signature by carrying out the computing of hash signature to the web page contents of webpage.Content shown in webpage is all web page contents.As shown in Figure 4, web page contents comprises Web page text, true title, web placement and other guide.
Web placement signature calculation module 246 is carried out the computing of hash signature by the positional information of webpage in current site and is obtained web placement signature.Positional information can carry out extracting the web placement block that piecemeal obtains from web page release module 211 pairs of web page contents.
Comment block signature calculation module 247 obtains comment block signature by carrying out the computing of hash signature to the review information of webpage.Review information can carry out extracting the web placement block that piecemeal obtains from web page release module 211 pairs of web page contents.
Resource signature calculation module 248 is carried out the computing of hash signature by the url of the picture resource in webpage, voice resource, video resource or download link resource and is obtained resource signature.As shown in Figure 4, just there is picture resource in Web page text.But also, because there are many plain text webpages, and there is above-mentioned resource in the webpage of not all available resource signatures.
Url filename signature calculation module 249 obtains url filename signature by carrying out the computing of hash signature to the filename in the url of webpage.
Judgment means 25 is for judging according to attaching signature whether each class webpage repeats.In this device, whether the attaching signature of more each webpage is same or similar judges whether webpage repeats.Specifically, when comparing the Web page text signature utilizing the computing of simhash signature to obtain, compare the not isotopic number of Web page text signature, coordination is not fewer, represents that the possibility of webpage repetition is higher.When comparing other attaching signature, if attaching signature is equal, represent that webpage repeats in this dimension.In deterministic process, the confidence level of different signature need be taken into full account.Such as, the information content that Web page text is signed and web page contents is signed to be comprised is comparatively large, and therefore its confidence level is relatively high, and the information content that other signatures comprise is less, and therefore its confidence level is relatively low.
Repeat can comprise Similarity Measure device 26 further in judgement system at webpage of the present invention.Similarity Measure device 26 for being judged as in judgment means 25 that the webpage of repetition carries out Similarity Measure, to judge whether webpage repeats further.By carrying out Similarity Measure to webpage entirety, the false webpage repeated can be filtered out further.General, decision Tree algorithms can be used to carry out Similarity Measure.In decision Tree algorithms, the attaching signature that independent importance is high or the lower attaching signature of multiple importance identical, think that described webpage repeats.
By the way, what webpage provided by the invention repeated judges that system and determination methods thereof are signed effectively by the various dimensions comprising Web page text sentence signature and judged whether webpage repeats rapidly.
In the above-described embodiments, only to invention has been exemplary description, but those skilled in the art can carry out various amendment to the present invention without departing from the spirit and scope of the present invention after reading present patent application.

Claims (28)

1. a determination methods for webpage repetition, it is characterized in that, this determination methods comprises:
A. multiple webpage is obtained;
B. the Web page text of described webpage is extracted respectively;
C. from described Web page text, extract a longest sentence or the combination of the longest multiple sentences, and calculate Web page text sentence signature according to the combination of a described the longest sentence or the longest multiple sentences;
D. according to described Web page text sentence signature, cluster is carried out to described multiple webpage;
E. for the described webpage under each class, the attaching signature of described webpage is calculated;
Whether the described webpage f. judged under each class according to described attaching signature repeats;
Described step c comprises further:
C1. subordinate sentence is carried out to described Web page text;
C2. the described Web page text after subordinate sentence is filtered and changed;
C3. a longest sentence or the combination of the longest multiple sentences is extracted the described Web page text after filtering and changing;
C4. the computing of hash signature is carried out, to obtain described Web page text sentence signature to the combination of a described sentence or multiple sentence.
2. the determination methods of webpage repetition as claimed in claim 1, it is characterized in that, described step b comprises further:
B1. piecemeal is carried out to described webpage;
B2. block filtration is carried out to the described webpage after piecemeal, to obtain the content blocks comprising described Web page text;
B3. from described content blocks, described Web page text is extracted.
3. the determination methods of webpage repetition as claimed in claim 1, is characterized in that, in described step e, described attaching signature comprises Web page text signature, and described Web page text signature obtains by carrying out the computing of simhash signature to described Web page text.
4. the determination methods of webpage repetition as claimed in claim 1, is characterized in that, in described step e, described attaching signature comprises true title signature, and described true title signature obtains by carrying out the computing of hash signature to the true title of described webpage.
5. the determination methods of webpage repetition as claimed in claim 1, is characterized in that, in described step e, described attaching signature comprises tag title signature, and described tag title signature obtains by carrying out the computing of hash signature to the tag title of described webpage.
6. the determination methods of webpage repetition as claimed in claim 1, it is characterized in that, in described step e, described attaching signature comprises digest, and described digest obtains by carrying out the computing of hash signature to the summary of described webpage.
7. the determination methods of webpage repetition as claimed in claim 1, is characterized in that, in described step e, described attaching signature comprises web page contents signature, and described web page contents signature obtains by carrying out the computing of hash signature to the web page contents of described webpage.
8. the determination methods of webpage repetition as claimed in claim 1, it is characterized in that, in described step e, described attaching signature comprises web placement signature, and described web placement signature obtains by carrying out the computing of hash signature to the positional information of described webpage in current site.
9. the determination methods of webpage repetition as claimed in claim 1, is characterized in that, in described step e, described attaching signature comprises comment block signature, and described comment block signature is by carrying out hash signature computing acquisition to the review information of described webpage.
10. the determination methods of webpage repetition as claimed in claim 1, it is characterized in that, in described step e, described attaching signature comprises resource signature, and described resource signature obtains by carrying out the computing of hash signature to the url of the picture resource in described webpage, voice resource, video resource or download link resource.
The determination methods that 11. webpages as claimed in claim 1 repeat, it is characterized in that, in described step e, described attaching signature comprises url filename signature, and described url filename signature carries out the computing of hash signature by the filename in the url to described webpage to obtain.
The determination methods that 12. webpages as claimed in claim 1 repeat, it is characterized in that, this determination methods comprises further:
G. to being judged as in step f that the described webpage of repetition carries out Similarity Measure, to judge whether described webpage repeats further.
The determination methods that 13. webpages as claimed in claim 12 repeat, is characterized in that, in described step g, use decision Tree algorithms to carry out Similarity Measure, the described attaching signature that independent importance is high is identical, thinks that described webpage repeats.
The determination methods that 14. webpages as claimed in claim 12 repeat, is characterized in that, in described step g, use decision Tree algorithms to carry out Similarity Measure, the described attaching signature that multiple importance is lower is identical, thinks that described webpage repeats.
The judgement system that 15. 1 kinds of webpages repeat, it is characterized in that, this judgement system comprises:
Webpage acquisition device, for obtaining multiple webpage;
Extraction element, for extracting the Web page text of described webpage respectively;
Sentence signature calculation device, for extracting a longest sentence or the combination of the longest multiple sentences from described Web page text, and calculates Web page text sentence signature according to the combination of a described the longest sentence or the longest multiple sentences;
Clustering apparatus, for carrying out cluster according to described Web page text sentence signature to described multiple webpage;
Attaching signature calculation element, for for the described webpage under each class, calculates the attaching signature of described webpage;
Judgment means, for judging according to described attaching signature whether the described webpage under each class repeats;
Described sentence signature calculation device comprises further:
Subordinate sentence module, for carrying out subordinate sentence to described Web page text;
Filter modular converter, for filtering the described Web page text after subordinate sentence and change;
Sentence extraction module, for extracting a longest sentence or the combination of the longest multiple sentences from the described Web page text after filtration and conversion;
Sentence signature calculation module, for carrying out the computing of hash signature to the combination of a described sentence or multiple sentence, to obtain described Web page text sentence signature.
The judgement system that 16. webpages as claimed in claim 15 repeat, it is characterized in that, described extraction element comprises further:
Web page release module, for carrying out piecemeal to described webpage;
Home page filter module, for carrying out block filtration to the described webpage after piecemeal, to obtain the content blocks comprising described Web page text;
Text extraction module, for extracting described Web page text from described content blocks.
The judgement system that 17. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises Web page text signature calculation module, and described Web page text signature calculation module obtains Web page text signature by carrying out the computing of simhash signature to described Web page text.
The judgement system that 18. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises true title signature calculation module, and described true title signature calculation module obtains true title signature by carrying out the computing of hash signature to the true title of described webpage.
The judgement system that 19. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises tag title signature calculation module, and described tag title signature calculation module obtains tag title signature by carrying out the computing of hash signature to the tag title of described webpage.
The judgement system that 20. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises digest computing module, and described digest computing module obtains digest by carrying out the computing of hash signature to the summary of described webpage.
The judgement system that 21. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises web page contents signature calculation module, and described web page contents signature calculation module obtains web page contents signature by carrying out the computing of hash signature to the web page contents of described webpage.
The judgement system that 22. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises web placement signature calculation module, and described web placement signature calculation module is carried out the computing of hash signature by the positional information of described webpage in current site and obtained web placement signature.
The judgement system that 23. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises comment block signature calculation module, and described comment block signature calculation module obtains comment block signature by carrying out the computing of hash signature to the review information of described webpage.
The judgement system that 24. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises resource signature calculation module, and described resource signature calculation module is carried out the computing of hash signature by the url of the picture resource in described webpage, voice resource, video resource or download link resource and obtained resource signature.
The judgement system that 25. webpages as claimed in claim 15 repeat, it is characterized in that, described attaching signature calculation element comprises url filename signature calculation module, and described url filename signature calculation module is carried out the computing of hash signature by the filename in the url to described webpage and obtained url filename signature.
The judgement system that 26. webpages as claimed in claim 15 repeat, it is characterized in that, this judgement system comprises further:
Similarity calculation module, for being judged as in described judgment means that the described webpage of repetition carries out Similarity Measure, to judge whether described webpage repeats further.
The judgement system that 27. webpages as claimed in claim 26 repeat, is characterized in that, in described similarity calculation module, use decision Tree algorithms to carry out Similarity Measure, the described attaching signature that independent importance is high is identical, thinks that described webpage repeats.
The judgement system that 28. webpages as claimed in claim 26 repeat, is characterized in that, in described similarity calculation module, use decision Tree algorithms to carry out Similarity Measure, the described attaching signature that multiple importance is lower is identical, thinks that described webpage repeats.
CN201110031636.9A 2011-01-28 2011-01-28 Judging system and judging method for web page repeating Active CN102622365B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110031636.9A CN102622365B (en) 2011-01-28 2011-01-28 Judging system and judging method for web page repeating

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110031636.9A CN102622365B (en) 2011-01-28 2011-01-28 Judging system and judging method for web page repeating

Publications (2)

Publication Number Publication Date
CN102622365A CN102622365A (en) 2012-08-01
CN102622365B true CN102622365B (en) 2015-04-29

Family

ID=46562288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110031636.9A Active CN102622365B (en) 2011-01-28 2011-01-28 Judging system and judging method for web page repeating

Country Status (1)

Country Link
CN (1) CN102622365B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678373B (en) * 2012-09-17 2017-11-17 腾讯科技(深圳)有限公司 A kind of garbage template article recognition methods and equipment
CN103189867B (en) * 2012-10-30 2016-05-25 华为技术有限公司 Repeating data search method and equipment
CN104021124B (en) 2013-02-28 2017-11-03 国际商业机器公司 Methods, devices and systems for handling web data
CN104079559B (en) * 2014-06-05 2017-07-25 腾讯科技(深圳)有限公司 A kind of website safety detection method, device and server
CN104615714B (en) * 2015-02-05 2019-05-24 北京中搜云商网络技术有限公司 Blog article rearrangement based on text similarity and microblog channel feature
CN104809256A (en) * 2015-05-22 2015-07-29 数据堂(北京)科技股份有限公司 Data deduplication method and data deduplication method
CN106371988A (en) * 2016-08-22 2017-02-01 浪潮(北京)电子信息产业有限公司 Automatic interface test method and device
CN107169011B (en) * 2017-03-31 2021-06-11 百度在线网络技术(北京)有限公司 Webpage originality identification method and device based on artificial intelligence and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093485A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for filtering out repeated contents on web page
CN101226533A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for arranging web page again
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093485A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for filtering out repeated contents on web page
CN101226533A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for arranging web page again
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news

Also Published As

Publication number Publication date
CN102622365A (en) 2012-08-01

Similar Documents

Publication Publication Date Title
CN102622365B (en) Judging system and judging method for web page repeating
CN102156737B (en) Method for extracting subject content of Chinese webpage
CN101661513B (en) Detection method of network focus and public sentiment
CN103778200B (en) A kind of message information source abstracting method and its system
CN106557513A (en) Event information method for pushing and event information pusher
CN101609399B (en) Intelligent website development system based on modeling and method thereof
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN104951961A (en) Method, terminal, server and system for pushing contents
CN101561802A (en) Web page structural data extraction method and system
CN102270206A (en) Method and device for capturing valid web page contents
CN102682082B (en) Network Flash searching system and network Flash searching method based on content structure characteristics
CN104462532B (en) The method and apparatus that Web page text is extracted
CN101779201A (en) Methods and apparatus to monitor content distributed by the internet
CN102314494B (en) Method and equipment for processing webpage contents
CN105589922A (en) Page display method, device and system and page display assisting method and device
CN109522410A (en) Document clustering method and platform, server and computer-readable medium
Henrys Importance of web scraping in e-commerce and e-marketing
CN103942285A (en) Recommendation method and system for dynamic page element
CN101625695B (en) Method and system for extracting complex named entities from Web video p ages
CN103761257A (en) Webpage handling method and system based on mobile browser
CN104008213B (en) A kind of more new discovery of info web and the method and apparatus of statistics
CN103034655A (en) Collection method and system of user behavior information and related equipment
CN110134854A (en) A kind of crawler acquisition method based on user&#39;s incentive mechanism
CN101661471A (en) Method and device for displaying web page

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant