CN101620616A

CN101620616A - Chinese similar web page de-emphasis method based on microcosmic characteristic

Info

Publication number: CN101620616A
Application number: CN200910083711A
Authority: CN
Inventors: 曹玉娟; 牛振东; 赵堃; 赵育民; 江鹏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2009-05-07
Filing date: 2009-05-07
Publication date: 2010-01-06

Abstract

The invention discloses a Chinese similar web page de-emphasis method based on microcosmic characteristics in order to solve the problem of automatic detection of content similar to Chinese web pages. The Chinese similar web page de-emphasis method considering syntactic information and semantic information of web pages both comprises the following steps: firstly, establishing a text term co-occurrence picture according to extracted web page effective information; secondly, extracting document characteristic vectors, wherein the document characteristic vectors comprise keyword position information and keyword terms ; finally, establishing a document keyword inverted index file by sufficiently using a retrieval system and classified information; completing document characteristic vector retrieval match according to the inverted index file, and thereby, detecting and investigating similar web pages. The Chinese similar web page de-emphasis method can effectively reduce the harmful effect of arithmetic accuracy by noise information, considers the content and structure information of the web page text, sufficiently uses the advantages of a retrieval and classification system simultaneously, obtains good effect of de-emphasis accuracy rate larger than 90 percent and average recalling rate larger than 80 percent and is especially suitable for large-scale web page de-emphasis.

Description

The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic

Technical field

The present invention relates to the approximate removing duplicate webpages method of a kind of Chinese, belong to computer network intelligent information retrieval technical field.

Technical background

Along with developing on an unprecedented scale of Internet technology and scale, Internet has become one of main channel of the information of obtaining.In the investigation by in July, 2007, there are more than 100,000,000 2 thousand 5 hundred ten thousand websites in statistics altogether.Search engine becomes the main tool that the current network user carries out information retrieval because of its search function conveniently, and wherein, the quality of information retrieval and work efficiency thereof will directly have influence on the overall performance of search engine.Statistical report according to CNNIC in July, 2005 issue shows that the user is when answering " greatest problem that runs into during retrieving information " this enquirement, and that selects " duplicate message is too many " option accounts for 44.6%, and the position is ranked first.In the face of the information of magnanimity, the user is unwilling to see the data that a pile content is identical or approximate.How helping the user to obtain needed information more quickly and accurately, is the new problem that the network information service faces.In recent years, the detection at approximate webpage has launched many researchs, for example structure of web page approximation detection, the detection of hyperlink approximation, the approximate detection of web page contents etc.

Usually, sentence structure, identical in structure document are considered as repetitive file.The removal of repetitive file adopts traditional plagiarization detection technique to be easy to finish, but just so uneasy for the approximate document detection of content.Approximate webpage is meant the essentially identical webpage of body matter, no matter and whether its sentence structure, structure be in full accord.For the approximate detection of web page contents, can adopt the text copy detection method, comprise two classes: based on the method (based on the method for Shingle) of grammer with based on method of semantic (based on the method for Term).

(1) based on the method for Shingle

Shingle is meant the sequence word that has that a group is closed in the document.Method based on Shingle requires to choose a series of Shingle from document, then Shingle is mapped in Hash (Hash) table the corresponding hash value of Shingle.At last, identical Shingle number or ratio in the statistics Hash table is as judging the text similarity foundation.For realizing the detection of extensive document, each researcher has adopted different sampling policies, is used to reduce the quantity of participating in Shingle relatively.

Doctor Heintze of Bell Laboratory proposes to choose N shingle of hash value minimum in " Scalable document fingerprinting " literary composition, and removes the frequent Shingles that occurs.The Bharat researcher in Google research centre is in " A comparison of techniques to find mirrored hosts on theWWW " literary composition, the Shingle that hash value is 25 multiples is chosen in proposition, and every piece of document is chosen 400 Shingle at most.The Broder researcher at Digital company systematic study center proposes a plurality of Shingle to be joined together to form a Supershingle and pass through the relatively similarity of the hash value calculating document of Supershingle in " Syntactic clustering of theweb " literary composition.Although Supershingle algorithm computation amount is littler, Broder finds that it is not suitable for the detection of short and small document.Doctor Fetterly in Microsoft research centre is in " Onthe evolution of elusters of near-duplicate web pages " literary composition, propose 5 speech that occur continuously are considered as a Shingle, 84 Shingle of every piece of document sampling are combined as these Shingle 6 Supershingle then; Document with 2 identical Supershingles is regarded as the similar document of content.The Wu Pingbo of Tsing-Hua University etc. are in " the extensive Chinese web page based on the feature string goes the method research of reruning fast ", proposition utilizes the punctuation mark majority to appear at characteristics in the web page text, comes presentation web page uniquely with each five Chinese character of fullstop both sides as Shingle.

(2) based on the method for Term

Method based on Term all adopts single entry as the elementary cell of calculating basically, obtains the similarity of document by the cosine value that calculates the file characteristics vector, and does not consider position and order that entry occurs.Owing to adopted many feature extractions (the especially selection of proper vector) technology, made based on the method ratio of Term more complicated based on the algorithm of Shingle.

Which speech the I-Match algorithm of Chowdhury determines to select as proper vector by calculating inverse document frequency (IDF:inversedocument frequency).IDF=log (N/n), wherein N is the number of document in the document sets, n is the number that comprises the document of this keyword.The I-Match algorithm just is based on the deduction of " semantic information that the frequent speech that occurs can't increase document in document sets ", removes the less speech of IDF value, represents thereby obtained better document." fingerprint " that constitute document by descending sort through the keyword that filters (fingerprint), the document that fingerprint is identical is regarded as approximate document.Under the worst case (all documents all are approximate documents), the time complexity of I-Match algorithm is O (nlogn).

Existing these detection methods exist following defective and deficiency: the method based on Shingle need accurately be mated when detecting the document that repeats fully, and this will cause the approximate document of content by omission.It is not enough only using key term based on the method for Term, sometimes, the web document of different content, its keyword may be the same, so just might cause erroneous judgement, is not enough to be used for the detection of document similarity.

Summary of the invention

The objective of the invention is to overcome the deficiencies in the prior art,, help the user to obtain needed information more quickly and accurately, propose the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic for solving the automatic detection problem of the approximate Chinese web page of content.This method is taken into account the syntactic and semantic information of considering webpage, and text based worldlet characteristic makes up text vocabulary co-occurrence figure, extracts proper vector and carries out the expression of document, and make full use of searching system and classified information and be similar to webpage and detect and investigate.

For achieving the above object, the technical scheme of the inventive method is as follows:

The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic of the present invention may further comprise the steps:

Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message.

The advertising message that comprises in the webpage, be linked to the navigation information of other webpage etc., all can produce and disturb this web page contents retrieval.Therefore, before the content of webpage is set up index, effective text message is wherein extracted.

Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure.

Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector.

The worldlet phenomenon comes from the relevant research of following the trail of shortest path in American society's network that sociologist Milgram carried out in 1967.Result of study shows between any a pair of American, can find no more than six people that know each other in twos that they are connected greatly, famous " six degree separate " (Six DegreeSeparation) problem that Here it is.The paper that Watts delivered on the Nature magazine in 1998 " Collective dynamicsof small-world networks " is furtherd investigate the worldlet phenomenon, proposes the worldlet network and has the characteristic of Gao Judu and short path.Use the worldlet theory that the research of various complex networks (transportation network, transmission, internet control etc.) is continued to bring out in recent years.Yutaka Matsuo is in " ExtractingKeywords in a Documents as a Small World " literary composition, and Ramon Ferrer points out that all human language and the vocabulary co-occurrence figure that is made of document have the worldlet characteristic equally in " Thesmall world of human language " literary composition.Therefore, by the keyword in the document is considered as key node, extract key node according to the worldlet characteristic of vocabulary co-occurrence figure, i.e. the key concept that will explain of document, make like this when detecting document, to need not accurately to mate, thereby prevent that the approximate document of content is by omission.

At vocabulary co-occurrence figure G _LIn, the keyword of tested document is G _LIn key node.Setting d is G _LThe feature path, setting and removing i the vocabulary co-occurrence figure behind the node is CN _i, d _iBe CN _iAverage path length, set node t _iTo G _LThe contribution rate that presents the worldlet feature is CB _i=d _i-d, CN _iValue big more, expression node t _iThe speech of representative is crucial more to the connection role of entire document structure.Between all key nodes, set up bridge for the notion in the document by setting up " shortcut ".In case lose them, document will become unrelated one by one little network according to the different themes segmentation, and it is loose that structure will become.

(1) obtains vocabulary co-occurrence figure G _LPoly-degree C and feature path d.

Vocabulary co-occurrence figure structure G _LBuild finish after, promptly can calculate its two fundamental characteristics: poly-degree C and feature path d.

For node t _i∈ T _L, defining its neighbor node is Γ _i={ j| ξ _{I, j}=1}, then t _iPoly-degree C _iFor

Wherein, k is the number of neighbor node,

Be the actual limit number that exists between neighbor node, and Draw vocabulary co-occurrence figure G thus _LPoly-degree C be

C = \frac{1}{L} Σ_{i = 1}^{L} C_{i}

For two given node t _i, t _j∈ T _L, d _Min(i j) is two internodal shortest paths.Node t _iAverage path length be

d_{i} = \frac{1}{L - 1} (Σ_{j = 1, j &NotEqual; i}^{L} d_{\min} (i, j)),

Vocabulary co-occurrence figure G then _LThe feature path

d = \frac{1}{L} (Σ_{i = 1}^{L} d_{i}) .

(2) according to CB _i=d _i-d draws each node t _iTo vocabulary co-occurrence figure G _LThe contribution rate CB that presents the worldlet feature _i

(3) with the contribution rate CB of each node of drawing in (2) _iSort according to from big to small order, select CB _iBe worth forward top n node, as document keyword sequence Ti, the value of N is decided in its sole discretion by the user.But it is not enough only using keyword, is not enough to be used for the document similarity therefore because traditional cosine similarity is calculated, and also needs to count the positional information Pos of keyword.

The position breath Pos of keyword, promptly the position of characteristic item in document is also very important for the detection of approximate document.By utilizing vector lists V _p=(Lp _l... Lp _i..., Lp _N), Lp _i=(Pos _Il... Pos _Ij..., Pos _In) come the position of recording feature item, wherein Pos _IjIt is i the entry position that the j time occurs in document.That is, adopt N keyword and their positional information Pos _IjRepresent one piece of document, V _pBe storage keyword positional information Pos _IjMatrix, LP _iBe matrix V _pRow vector, i.e. the position vector of i keyword:

V_{p} = \{\begin{matrix} {LP}_{1} \\ {LP}_{2} \\ . . . . . . \\ {LP}_{N} \end{matrix}\} = \{\begin{matrix} [{Pos}_{1,1}, . . . ., {Pos}_{1, m}] \\ [{Pos}_{2,1}, . . . ., {Pos}_{2, n}] \\ . . . . . . \\ [{Pos}_{N, 1}, . . . ., {Pos}_{N, k}] \end{matrix}\}

The positional information Pos of keyword constitutes text feature vector Va with key term.

Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.Step is as follows:

(1) if this webpage is first piece of document, read Web page classifying sign Ca, keyword is made up inverted index; Class indication is all web page index constitutive characteristic vector index storehouse IDXV_Ca of Ca.

For proper vector is carried out fast access, must set up index mechanism to characteristic item.It is simple relatively that inverted index has realization, and inquiry velocity is supported advantages such as synonym inquiry soon, easily.By characteristic item being set up the inverted index file, can significantly improve recall precision.

(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user.If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If result for retrieval show k coupling output Vdi (i=1,2 ..., k, k＞0), represent that this m keyword all appears in this k piece of writing document, all calculate the similarity ξ of vectorial Va of text feature and Vdi entry vector at every piece of document, computing method are as follows:

ξ = \frac{d 1 \times d 2}{| | d 1 | | \times | | d 2 | |} = \frac{Σ_{i = 1}^{m} d 1 (i) \times d 2 (i)}{\sqrt{Σ_{i = 1}^{m} d 1 {(i)}^{2}} \times \sqrt{Σ_{i = 1}^{m} d 2 {(i)}^{2}}}

Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.If the ξ≤prior preset threshold (similarity that the threshold value representative is set, its value is no more than 1), then make the judgement of " not detecting the approximate webpage of content ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If ξ＞and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:

At first, obtain the distance matrix of document d1 and d2 characteristic item

V_{p 1} - V_{p 2} = \{\begin{matrix} [{Pos}_{1,1}^{1} - {Pos}_{1,1}^{2}, . . . ., {Pos}_{1, m}^{1} - {Pos}_{1, m}^{2}] \\ {[Pos}_{2,1}^{1} - {Pos}_{2,1}^{2}, . . . ., {Pos}_{2, n}^{1} - {Pos}_{2, n}^{2}] \\ . . . . . . \\ [{Pos}_{N, 1}^{1} - {Pos}_{N, 1}^{2}, . . . ., {Pos}_{N, k}^{1} - {Pos}_{N, k}^{2}] \end{matrix}\}

= \{\begin{matrix} δ P_{11}, δ P_{12}, . . . . . ., δ P_{1 m} \\ . . . . . . \\ . . . . . . \\ δ P_{N 1}, δ P_{N 2}, . . . . . ., δ P_{Nm} \end{matrix}\}

Wherein, Pos _1,1 ¹The 1st position that keyword occurs for the 1st time among the expression document d1, Pos _1,1 ²The 1st position that keyword occurs for the 1st time among the expression document d2, Pos _{1, m} ¹The 1st position that keyword occurs for the m time among the expression document d1, Pos _{1, m} ²The 1st position that keyword occurs for the m time among the expression document d2; δ P _IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time.

Afterwards, obtain matrix V _P1-V _P2In each the row in every mean square deviation S _i: the mean distance AVG that obtains i keyword earlier _iFor:

{AVG}_{i} = \frac{Σ_{j = 1}^{r} | ({Pos}_{ij}^{1} - {Pos}_{ij}^{2}) |}{r} = \frac{Σ_{j = 1}^{r} | δ P_{ij} |}{r}

Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword.Then the distribution of i keyword distance is by mean square deviation S _iExpression, for:

S_{i} = \frac{\sqrt{Σ_{j = 1}^{r} {(δ P_{ij} - {AVG}_{i})}^{2}}}{r}

For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:

S = \frac{Σ_{i = 1}^{N} S_{i}}{N}

If the S＜prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.

So far, the detection of promptly having finished Chinese approximate webpage with go heavily.

Beneficial effect

The principal element that influences the removing duplicate webpages accuracy is the webpage noise, and the inventive method can effectively reduce the harmful effect of noise information to the algorithm accuracy.Adopt text based worldlet characteristic, extract the document keyword.Not only considered content, the structural information of web page text, made full use of the advantage of retrieval and categorizing system simultaneously, obtained to go heavy accuracy rate＞90%, the good result of average recall rate＞80%.This method has the time response of approximately linear and good space efficiency, and the comparison of characteristic item only needs run-down proper vector index database, is particularly useful for extensive removing duplicate webpages.In addition, the Web page importance of frequently being reprinted is bigger, and the retrieval ordering value of their correspondences is due for promotion, to reflect its significance level.Thereby the timely discovery of approximate webpage also helps improving the retrieval quality of search engine system.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method;

Fig. 2 is for extracting the process flow diagram of file characteristics vector;

Fig. 3 is for finishing the process flow diagram that the file characteristics vector index is mated according to inverted index file and keyword position vector;

Fig. 4 is " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " the webpage vocabulary co-occurrence figure among the embodiment.

Embodiment

Engaging drawings and Examples below is described in further detail the present invention.

The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic of the present invention as shown in Figure 1, may further comprise the steps:

At first, to the pre-service that effective text message carries out successively that sentence is cut apart, participle, stop words are removed, the document after obtaining handling.Be chosen at frequency of occurrences f＞f in the document _ThrSpeech t _i(f _ThrFor prior preset threshold, be set to 2) as the node of the vocabulary co-occurrence figure of the document.

Then, for each speech to (t _i, t _j) calculate its Jaccard coefficient by following formula

J_{t_{i}, t_{j}} = \frac{n_{t_{i}, t_{j}}}{n_{t_{i}} + n_{t_{j}} - n_{t_{i}, t_{j}}}

In the following formula,

For comprising (t _i, t _j) the sentence number,

For comprising t _iOr t _jThe sentence number.

If

J_{t_{i}, t_{j}} > J_{thr}

(J _ThrBe prior preset threshold, be set to 1.2 usually), then at node t _iAnd t _jBetween add a limit.The vocabulary co-occurrence figure of document can decide literary composition and be G _L=(T _L, E _L), T wherein _L={ t _iBe the set of node, L is the number of whole nodes, E _L={ { t _i, t _jBe the set on limit, ξ _{I, j}={ t _i, t _jExpression node t _iWith t _jBetween whether have the limit, if there is the limit, ξ then _{I, j}=1, otherwise ξ _{I, j}=0.

Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, as shown in Figure 2.

At vocabulary co-occurrence figure G _LIn, the keyword of tested document is G _LIn key node.Setting d is G _lThe feature path.It is CN that setting removes i the vocabulary co-occurrence figure behind the node _i, d _iBe CN _iAverage path length.Set node t _iTo G _LThe contribution rate that presents the worldlet feature is CB _i=d _i-d, CN _iValue big more, expression node t _iThe speech of representative is crucial more to the connection role of entire document structure.Between all key nodes, set up bridge for the notion in the document by setting up " shortcut ".In case lose them, document will become unrelated one by one little network according to the different themes segmentation, and it is loose that structure will become.

Wherein, k is the number of neighbor node,

Be the actual limit number that exists between neighbor node, and

Draw vocabulary co-occurrence figure G thus _LPoly-degree C be

C = \frac{1}{L} Σ_{i = 1}^{L} C_{i}

d_{i} = \frac{1}{L - 1} (Σ_{j = 1, j &NotEqual; i}^{L} d_{\min} (i, j)),

Vocabulary co-occurrence figure G then _LThe feature path

d = \frac{1}{L} (Σ_{i = 1}^{L} d_{i}) .

(2) according to CB _i=d-d _i, draw each node t _iTo vocabulary co-occurrence figure G _LThe contribution rate CB that presents the worldlet feature _i

(3) with the contribution rate CB of each node of drawing in (2) _iSort according to from big to small order, select CB _iBe worth forward top n node, as document keyword Ti, the value of N is decided in its sole discretion by the user, but is no less than 6 at least.But it is not enough only using keyword, because traditional cosine similarity is calculated the detection that is not enough to be used for the document similarity, therefore, also needs to count the positional information Pos of keyword.

The positional information Pos of keyword, promptly the position of characteristic item in document is also very important for the detection of approximate document.By utilizing vector lists V _p=(Lp _l... Lp _i..., Lp _N), Lp _i=(Pos _Il... Pos _Ij..., Pos _In) come the position of recording feature item, wherein Pos _IjIt is i the entry position that the j time occurs in document.That is, adopt N keyword and their positional information Pos _IjRepresent one piece of document, V _pBe storage keyword positional information Pos _IjMatrix, i.e. the positional information Pos of keyword:

V_{p} = \{\begin{matrix} [{Pos}_{1,1}, . . . ., {Pos}_{1, m}] \\ [{Pos}_{2,1}, . . . ., {Pos}_{2, n}] \\ . . . . . . \\ [{Pos}_{N, 1}, . . . ., {Pos}_{N, k}] \end{matrix}\}

(1) this webpage is first piece of document, and keyword is made up inverted index file, all index file constitutive characteristic vector index storehouse IDXV_Ca.

(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user.If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If result for retrieval show k coupling output Vdi (i=1,2 ..., k, k＞0), represent that this m keyword all appears in this k piece of writing document, all calculate the similarity ξ of vectorial Va of text feature and Vdi entry vector at every piece of document, computing method are as follows:

ξ = \frac{d 1 \times d 2}{| | d 1 | | \times | | d 2 | |} = \frac{Σ_{i = 1}^{m} d 1 (i) \times d 2 (i)}{\sqrt{Σ_{i = 1}^{m} d 1 {(i)}^{2}} \times \sqrt{Σ_{i = 1}^{m} d 2 {(i)}^{2}}}

At first, obtain the distance matrix of document d1 and d2 characteristic item

V_{p 1} - V_{p 2} = \{\begin{matrix} [{Pos}_{1,1}^{1} - {Pos}_{1,1}^{2}, . . . ., {Pos}_{1, m}^{1} - {Pos}_{1, m}^{2}] \\ {[Pos}_{2,1}^{1} - {Pos}_{2,1}^{2}, . . . ., {Pos}_{2, n}^{1} - {Pos}_{2, n}^{2}] \\ . . . . . . \\ [{Pos}_{N, 1}^{1} - {Pos}_{N, 1}^{2}, . . . ., {Pos}_{N, k}^{1} - {Pos}_{N, k}^{2}] \end{matrix}\}

= \{\begin{matrix} δ P_{11}, δ P_{12}, . . . . . ., δ P_{1 m} \\ . . . . . . \\ . . . . . . \\ δ P_{N 1}, δ P_{N 2}, . . . . . ., δ P_{Nm} \end{matrix}\}

{AVG}_{i} = \frac{Σ_{j = 1}^{r} | ({Pos}_{ij}^{1} - {Pos}_{ij}^{2}) |}{r} = \frac{Σ_{j = 1}^{r} | δ P_{ij} |}{r}

S_{i} = \frac{\sqrt{Σ_{j = 1}^{r} {(δ P_{ij} - {AVG}_{i})}^{2}}}{r}

S = \frac{Σ_{i = 1}^{N} S_{i}}{N}

For correctness and the efficient of estimating this method, this paper has designed a series of experiments.

Correctness is the life of algorithm, provides two evaluation criterions here: repeated pages recall rate (Recall) and go heavy accuracy rate (Precision) is defined as follows:

In order to detect the performance of DDW, we have selected 72 query words in military affairs, medical science and three fields of computing machine, with Google retrieval and inquisition speech.In every group of result for retrieval, choose the same or analogous webpage of content and amount to 5835 pieces, and these approximate webpages are inserted in the already present document sets (comprising 1028,568 webpages).And move I-Match (choosing 20 feature speech equally) and DDW algorithm simultaneously and be similar to webpage and detect.

1, in 23 inquiries of military field input, experimental result is as shown in table 1:

Table (1) military field test sample book accuracy rate and recall rate statistics

2,28 inquiries of medical domain input, 20 groups of introductory webpages of corresponding knowledge wherein, 8 groups of corresponding news webpages, experimental result sees Table (2):

Table (2) medical domain test sample book accuracy rate and recall rate statistics

3,21 inquiries of computer realm input, whole corresponding news webpages, experimental result sees Table (3)

Table (3) computer realm test sample book accuracy rate and recall rate statistics

Above experimental result shows that the inventive method contrast prior art has higher accuracy rate and recall rate.

Embodiment

For example, for URL be " http://cs.taoyuan.gov.cn/news/ReadNews.asp? NewsID=4727 ", introduce the webpage of Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure, detect and whether exist content to be similar to webpage.

Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message, specific as follows:

Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure

Date issued: on October 26th, 2006

Source: red net

Editor's typing: to the flaw

Picture information Russian army " Bu Lawa " (claiming " top fuller " again) intercontinental ballistic missile

Picture information is used for emission " top fuller " sea base intercontinental ballistic missile " this Coe of Dmitri east " number nuclear submarine

Www.xinhuanet.com Moscow (reporter Yue Lianguo) Russian Navy News on October 25 and public relation office 25 days are to the news definition, the novel intercontinental ballistic missile offset track of a piece " Bu Lawa " of trial fire on Russian military same day (claiming " top fuller " again) also crashes into marinely, and trial fire is this time counted out.According to Russia's media report, to be Northern Fleet of Russia naval " Dmitri east this Coe " number strategic nuclear submarine launch from white saline waters in 17: 05 on the 25th (21: 05 Beijing time) of Moscow Time this piece ballistic missile, is transmitted under water and carries out.According to plan, guided missile warhead should hit the intended target on Russia's Far East Kamchatka Peninsula grinder.But this piece guided missile is offset track and crash into marine after emission a few minutes.To investigate the reason of trial fire of guided missle failure by the special commission that Russia Ministry of National Defence and guided missile design and the production unit representative is formed.The 5th trial fire of " Bu Lawa " novel intercontinental ballistic missile that Russian military carried out September 7 this year also ends in failure.Guided missile also is an offset track and crash into marine after emission a few minutes.Preceding 4 trial fires of this guided missile have all obtained success.10 branches of " Bu Lawa " guided missile portability are led nuclear warhead, and range can reach 8000 kilometers.According to the chief designer Saloman promise husband of the Moscow thermal engineering research institute introduction of being responsible for this guided missile of development, " Bu Lawa " guided missile and " white poplar-M " intercontinental ballistic missile will become the main body of following Russia strategic nuclear force.According to Russian army's plan, " Bu Lawa " guided missile will be in equipment Russia naval in 2008.Before 2008, Russian military will carry out several times trial fire again to this guided missile.

Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure, as shown in Figure 4.

Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, shown in table (4)

Table (4) " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " web page characteristics vector

Sequence number	Keyword	Word frequency	Position vector
Sequence number	Keyword	Word frequency	Position vector	??1	Guided missile	??12	?{421，561，617，669，711，813，857，894，952，1012，1096，1150}
??2	Trial fire	??7	?{33，309，385，715，775，866，1166}	??1	Guided missile	??12	?{421，561，617，669，711，813，857，894，952，1012，1096，1150}
??2	Trial fire	??7	?{33，309，385，715，775，866，1166}	??3	The cloth Bearing score	??8	?{9，11，112，323，767，888，1006，1090}
??4	Intercontinental ballistic missile	??5	?{21，134，348，776，1029}	??3	The cloth Bearing score	??8	?{9，11，112，323，767，888，1006，1090}
??4	Intercontinental ballistic missile	??5	?{21，134，348，776，1029}	??5	Emission	??4	?{160，529，537，623，823}
??6	Moscow	??3	?{227，477，958}	??5	Emission	??4	?{160，529，537，623，823}
??6	Moscow	??3	?{227，477，958}	??7	25 days	??3	?{237，283，487}
??8	Top fuller	??3	?{125，165，336}	??7	25 days	??3	?{237，283，487}
??8	Top fuller	??3	?{125，165，336}	??9	Novel	??3	?{16，345，772}
??10	05 minute	??2	?{495，513}	??9	Novel	??3	?{16，345，772}
??10	05 minute	??2	?{495，513}	??11	October	??3	?{58，229，904}
??12	This Coe of Dmitri east number	??2	?{192，442}	??11	October	??3	?{58，229，904}
??12	This Coe of Dmitri east number	??2	?{192，442}	??13	Russia naval	??2	?{427，1113}
??14	Carry out	??4	?{547，728，756，1155}	??13	Russia naval	??2	?{427，1113}
??14	Carry out	??4	?{547，728，756，1155}	??15	Failure	??4	?{36，392，718，802}
??16	Depart from	??3	?{361，635，834}	??15	Failure	??4	?{36，392，718，802}
??16	Depart from	??3	?{361，635，834}	??17	Track	??3	?{365，639，838}
??18	Nuclear submarine	??2	?{214，468}	??17	Track	??3	?{365，639，838}

??19	Russia	??2	??{1，257}
??19	Russia	??2	??{1，257}	??20	The sea base intercontinental ballistic missile	??1	??{171}

Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.

Read " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " Web page classifying sign Ca, in document keyword index file ID XV_Ca, to the first eight the entry { guided missile in the keyword, trial fire, Bu Lawa, intercontinental ballistic missile, emission, Moscow, 25 days, top fuller, novel, 05 minute, October, this Coe of Dmitri east number } retrieve.The corresponding keyword of one of them occurrence that finds for guided missile, trial fire, strategy, Bu Lawa, intercontinental ballistic missile, novel, Moscow; emission, 25 days, top fuller, 05 minute, October, failure, this Coe of Dmitri east number; Russia naval carries out, sea base strategic missile, Russia, track, fleet }.The character pair vector is shown in table (5):

The corresponding web page characteristics vector of table (5) occurrence

Sequence number	Keyword	Word frequency	Position vector
Sequence number	Keyword	Word frequency	Position vector	??1	Guided missile	??19	??{421，561，617，669，711，813，857，894，952，1012，1096，1150 ??1133，1645，1675，1745，1947，2168，2262}
??2	Trial fire	??13	??{33，309，385，715，775，866，1166，1275，1451，1504，1573，1583，1651}	??1	Guided missile	??19
??2	Trial fire	??13		??3	Strategy	??10	??{465，1054，1247，1563，1707，1923，1961，1987，2050，2198}
??4	The cloth Bearing score	??11	??{9，11，110，321，765，886，1004，1088，1287，1460，1551，1699}	??3	Strategy	??10	??{465，1054，1247，1563，1707，1923，1961，1987，2050，2198}
??4	The cloth Bearing score	??11	??{9，11，110，321，765，886，1004，1088，1287，1460，1551，1699}	??5	Intercontinental ballistic missile	??11	??{20，134，349，777，1030，1315，1472，1761，1869，2128，2400}
??6	Novel	??10	??{17，345，773，1311，1468，1541，1803，1939，2124，2396}	??5	Intercontinental ballistic missile	??11	??{20，134，349，777，1030，1315，1472，1761，1869，2128，2400}
??6	Novel	??10	??{17，345，773，1311，1468，1541，1803，1939，2124，2396}	??7	Moscow	??6	??{226，447，957，1422，1717，2330}
??8	Emission	??5	??{160，529，537，623，823}	??7	Moscow	??6	??{226，447，957，1422，1717，2330}
??8	Emission	??5	??{160，529，537，623，823}	??9	25 days	??5	??{237，283，487，1165，1184}
??10	Top fuller	??6	??{126，166，337，1303，1863，2414}	??9	25 days	??5	??{237，283，487，1165，1184}
??10	Top fuller	??6	??{126，166，337，1303，1863，2414}	??11	05 minute	??2	??{495，513}
??12	October	??4	??{58，229，1180，1361}	??11	05 minute	??2	??{495，513}
??12	October	??4	??{58，229，1180，1361}	??13	Failure	??6	??{37，393，719，803，1512，2218}
??14	This Coe of Dmitri east number	??2	??{192，443}	??13	Failure	??6	??{37，393，719，803，1512，2218}
??14	This Coe of Dmitri east number	??2	??{192，443}	??15	Russia naval	??3	??{427，1114，1518}
??16	Carry out	??6	??{547，729，757，1156，1537，1821}	??15	Russia naval	??3	??{427，1114，1518}
??16	Carry out	??6	??{547，729，757，1156，1537，1821}	??17	The sea base strategic missile	??3	??{1559，1957，2046}
??18	Russia	??3	??{1，257，1205，1408}	??17	The sea base strategic missile	??3	??{1559，1957，2046}
??18	Russia	??3	??{1，257，1205，1408}	??19	Track	??4	??{365，639，839，1488}
??20	Fleet	??3	??{437，1241，1418}	??19	Track	??4	??{365，639，839，1488}

Utilizing cosine formula (4) to carry out similarity calculates:

Therefore ξ=81.51%＞80% (threshold value) judges that these two pieces of webpages might be the approximate webpages of content.Continue to calculate the keyword distance variance S of two pieces of documents;

With guided missile one speech is example, and position vector corresponding in two pieces of articles is respectively:

{421，561，617，669，711，813，857，894，952，1012，1096，1150}

With

{421，561，617，669，711，813，857，894，952，1012，1096，11501133，1645，1675，1745，1947，2168，2262}

{AVG}_{1} = \frac{| Σ_{j = 1}^{r} ({Pos}_{ij}^{1} - {Pos}_{ij}^{2}) |}{r} = | ((421 - 421) + (561 - 561) + (617 - 617) + (669 - 669) + (711 - 711) +

(813 - 813) + (857 - 857) + (894 - 894) + (952 - 952) + (1012 - 1012) +

(1096 - 1096) + (1150 - 1150) + (0 - 1133) + (0 - 1645) + (0 - 1675) +

(0 - 1745) + (0 - 1947) + (0 - 2168) + (0 - 2262)) | \div 19 = 661

δP ₁₁＝|421-421|＝0......δP _1，19＝|0-2262|＝2262

S_{1} = \frac{\sqrt{Σ_{j = 1}^{r} {(δ P_{1 j} - {AVG}_{1})}^{2}}}{r} = 197.6

The range distribution of N keyword of entire chapter document is:

S = \frac{Σ_{i = 1}^{N} S_{i}}{N} = 264

" Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " web length is 1165 bytes, and the setpoint distance threshold value is 10% of a web length, i.e. 117 bytes.At this moment, therefore S＞distance threshold, judges that two pieces of documents are not the approximate webpages of content.Classification information according to document " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure ", according to Web page classifying sign Ca, the keyword increment of document is built among the proper vector index database IDXV_Ca, thereby finishes the detection of document " the novel intercontinental ballistic missile trial fire of Russia " Bu Lawa " is failed ".

Claims

1, the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic is characterized in that may further comprise the steps:

Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message;

Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure;

Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, implementation procedure is as follows;

Setting d vocabulary co-occurrence figure is G _LThe feature path, setting and removing i the vocabulary co-occurrence figure behind the node is CN _i, d _iBe CN _iAverage path length, set node t _iTo G _LThe contribution rate that presents the worldlet feature is CB _i=d _i-d;

(1) obtains vocabulary co-occurrence figure G _LPoly-degree C and feature path d;

Wherein, k is the number of neighbor node, Be the actual limit number that exists between neighbor node, and

Draw vocabulary co-occurrence figure G thus _LPoly-degree C be

C = \frac{1}{L} Σ_{i = 1}^{L} C_{i}

For two given node t _i, t _j∈ T _L, d _Min(i j) is two internodal shortest paths; Node t _iAverage path length be

d_{i} = \frac{1}{L - 1} (Σ_{j = 1, j &NotEqual; i}^{L} d_{\min} (i, j)),

Vocabulary co-occurrence figure G then _LThe feature path

d = \frac{1}{L} (Σ_{i = 1}^{L} d_{i});

(2) according to CB _i=d _i-d draws each node t _iTo vocabulary co-occurrence figure G _LThe contribution rate that presents the worldlet feature;

(3) with the contribution rate CB of each node of drawing in (2) _iSort according to from big to small order, select CB _iBe worth forward top n node, as document keyword sequence Ti, the value of N is decided in its sole discretion by the user;

Afterwards, count the positional information Pos of keyword, by utilizing vector lists V _p=(Lp _i... Lp _i..., Lp _N), Lp _i=(Pos _Ij... Pos _Ij..., Pos _m) come the position of recording feature item, wherein Pos _{I, j}It is i the entry position that the j time occurs in document; That is, adopt N keyword and their positional information Pos _{I, j}Represent one piece of document, V _pBe storage keyword positional information Pos _{I, j}Matrix, LP _iBe matrix V _pRow vector, i.e. the position vector of i keyword:

V_{p} = \{\begin{matrix} {LP}_{1} \\ {LP}_{2} \\ . . . . . . \\ {LP}_{N} \end{matrix}\} = \{\begin{matrix} [{Pos}_{1,1}, . . . . . ., {Pos}_{1, m}] \\ [{Pos}_{2,1}, . . . . . ., {Pos}_{2, n}] \\ . . . . . . \\ [{Pos}_{N, 1}, . . . . . ., {Pos}_{N, k}] \end{matrix}\}

The positional information Pos of keyword constitutes text feature vector Va with key term;

Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file, and detailed process is as follows:

(1) if this webpage is first piece of document, read Web page classifying sign Ca, keyword is made up inverted index; Class indication is all web page index constitutive characteristic vector index storehouse IDXV_Ca of Ca;

(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user; If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If result for retrieval shows k coupling output Vdi (i=1,2, ..., k, k＞0), then all calculates the similarity ξ of text feature vector Va and Vdi entry vector at every piece of document, if ξ≤prior preset threshold, then make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If ξ＞and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:

At first, obtain the distance matrix of document d1 and d2 characteristic item

V_{p 1} - V_{p 2} = \{\begin{matrix} [{Pos}_{1,1}^{1} - {Pos}_{1,1}^{2}, . . . . . ., {Pos}_{1, m}^{1} - {Pos}_{1, m}^{2}] \\ [{Pos}_{2,1}^{1} - {Pos}_{2,1}^{2}, . . . . . ., {Pos}_{2, n}^{1} - {Pos}_{2, n}^{2}] \\ . . . . . . \\ [{Pos}_{N, 1}^{1} - {Pos}_{N, 1}^{2}, . . . . . ., {Pos}_{N, k}^{1} - {Pos}_{N, k}^{2}] \end{matrix}\}

= \{\begin{matrix} {δP}_{11}, {δP}_{12}, . . . . . ., {δP}_{1 m} \\ . . . . . . \\ . . . . . . \\ {δP}_{N 1}, {δP}_{N 2}, . . . . . ., {δP}_{Nm} \end{matrix}\}

Wherein, Pos _1,1 ¹The 1st position that keyword occurs for the 1st time among the expression document d1, Pos _{I, j} ²The 1st position that keyword occurs for the 1st time among the expression document d2, Pos _{1, m} ¹The 1st position that keyword occurs for the m time among the expression document d1, Pos _{1, m} ²The 1st position that keyword occurs for the m time among the expression document d2; δ P _IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time;

{AVG}_{i} = \frac{Σ_{j = 1}^{r} | ({Pos}_{ij}^{1} - {Pos}_{ij}^{2}) |}{r} = \frac{Σ_{j = 1}^{r} | {δP}_{ij} |}{r}

Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword, and then the distribution of i keyword distance is by mean square deviation S _iExpression, for:

S_{i} = \frac{\sqrt{Σ_{j = 1}^{r} {({δP}_{ij} - {AVG}_{i})}^{2}}}{r}

S = \frac{Σ_{i = 1}^{N} S_{i}}{N}

2, the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic as claimed in claim 1 is characterized in that the computing method of the similarity ξ of text feature vector Va and Vdi entry vector are as follows:

ξ = \frac{d 1 \times d 2}{| | d 1 | | \times | | d 2 | |} = \frac{Σ_{i = 1}^{m} d 1 (i) \times d 2 (i)}{\sqrt{Σ_{i = 1}^{m} {d 1 (i)}^{2}} \times \sqrt{Σ_{i = 1}^{m} {d 2 (i)}^{2}}}

Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.