CN101620616A - Chinese similar web page de-emphasis method based on microcosmic characteristic - Google Patents
Chinese similar web page de-emphasis method based on microcosmic characteristic Download PDFInfo
- Publication number
- CN101620616A CN101620616A CN200910083711A CN200910083711A CN101620616A CN 101620616 A CN101620616 A CN 101620616A CN 200910083711 A CN200910083711 A CN 200910083711A CN 200910083711 A CN200910083711 A CN 200910083711A CN 101620616 A CN101620616 A CN 101620616A
- Authority
- CN
- China
- Prior art keywords
- pos
- document
- keyword
- vector
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Chinese similar web page de-emphasis method based on microcosmic characteristics in order to solve the problem of automatic detection of content similar to Chinese web pages. The Chinese similar web page de-emphasis method considering syntactic information and semantic information of web pages both comprises the following steps: firstly, establishing a text term co-occurrence picture according to extracted web page effective information; secondly, extracting document characteristic vectors, wherein the document characteristic vectors comprise keyword position information and keyword terms ; finally, establishing a document keyword inverted index file by sufficiently using a retrieval system and classified information; completing document characteristic vector retrieval match according to the inverted index file, and thereby, detecting and investigating similar web pages. The Chinese similar web page de-emphasis method can effectively reduce the harmful effect of arithmetic accuracy by noise information, considers the content and structure information of the web page text, sufficiently uses the advantages of a retrieval and classification system simultaneously, obtains good effect of de-emphasis accuracy rate larger than 90 percent and average recalling rate larger than 80 percent and is especially suitable for large-scale web page de-emphasis.
Description
Technical field
The present invention relates to the approximate removing duplicate webpages method of a kind of Chinese, belong to computer network intelligent information retrieval technical field.
Technical background
Along with developing on an unprecedented scale of Internet technology and scale, Internet has become one of main channel of the information of obtaining.In the investigation by in July, 2007, there are more than 100,000,000 2 thousand 5 hundred ten thousand websites in statistics altogether.Search engine becomes the main tool that the current network user carries out information retrieval because of its search function conveniently, and wherein, the quality of information retrieval and work efficiency thereof will directly have influence on the overall performance of search engine.Statistical report according to CNNIC in July, 2005 issue shows that the user is when answering " greatest problem that runs into during retrieving information " this enquirement, and that selects " duplicate message is too many " option accounts for 44.6%, and the position is ranked first.In the face of the information of magnanimity, the user is unwilling to see the data that a pile content is identical or approximate.How helping the user to obtain needed information more quickly and accurately, is the new problem that the network information service faces.In recent years, the detection at approximate webpage has launched many researchs, for example structure of web page approximation detection, the detection of hyperlink approximation, the approximate detection of web page contents etc.
Usually, sentence structure, identical in structure document are considered as repetitive file.The removal of repetitive file adopts traditional plagiarization detection technique to be easy to finish, but just so uneasy for the approximate document detection of content.Approximate webpage is meant the essentially identical webpage of body matter, no matter and whether its sentence structure, structure be in full accord.For the approximate detection of web page contents, can adopt the text copy detection method, comprise two classes: based on the method (based on the method for Shingle) of grammer with based on method of semantic (based on the method for Term).
(1) based on the method for Shingle
Shingle is meant the sequence word that has that a group is closed in the document.Method based on Shingle requires to choose a series of Shingle from document, then Shingle is mapped in Hash (Hash) table the corresponding hash value of Shingle.At last, identical Shingle number or ratio in the statistics Hash table is as judging the text similarity foundation.For realizing the detection of extensive document, each researcher has adopted different sampling policies, is used to reduce the quantity of participating in Shingle relatively.
Doctor Heintze of Bell Laboratory proposes to choose N shingle of hash value minimum in " Scalable document fingerprinting " literary composition, and removes the frequent Shingles that occurs.The Bharat researcher in Google research centre is in " A comparison of techniques to find mirrored hosts on theWWW " literary composition, the Shingle that hash value is 25 multiples is chosen in proposition, and every piece of document is chosen 400 Shingle at most.The Broder researcher at Digital company systematic study center proposes a plurality of Shingle to be joined together to form a Supershingle and pass through the relatively similarity of the hash value calculating document of Supershingle in " Syntactic clustering of theweb " literary composition.Although Supershingle algorithm computation amount is littler, Broder finds that it is not suitable for the detection of short and small document.Doctor Fetterly in Microsoft research centre is in " Onthe evolution of elusters of near-duplicate web pages " literary composition, propose 5 speech that occur continuously are considered as a Shingle, 84 Shingle of every piece of document sampling are combined as these Shingle 6 Supershingle then; Document with 2 identical Supershingles is regarded as the similar document of content.The Wu Pingbo of Tsing-Hua University etc. are in " the extensive Chinese web page based on the feature string goes the method research of reruning fast ", proposition utilizes the punctuation mark majority to appear at characteristics in the web page text, comes presentation web page uniquely with each five Chinese character of fullstop both sides as Shingle.
(2) based on the method for Term
Method based on Term all adopts single entry as the elementary cell of calculating basically, obtains the similarity of document by the cosine value that calculates the file characteristics vector, and does not consider position and order that entry occurs.Owing to adopted many feature extractions (the especially selection of proper vector) technology, made based on the method ratio of Term more complicated based on the algorithm of Shingle.
Which speech the I-Match algorithm of Chowdhury determines to select as proper vector by calculating inverse document frequency (IDF:inversedocument frequency).IDF=log (N/n), wherein N is the number of document in the document sets, n is the number that comprises the document of this keyword.The I-Match algorithm just is based on the deduction of " semantic information that the frequent speech that occurs can't increase document in document sets ", removes the less speech of IDF value, represents thereby obtained better document." fingerprint " that constitute document by descending sort through the keyword that filters (fingerprint), the document that fingerprint is identical is regarded as approximate document.Under the worst case (all documents all are approximate documents), the time complexity of I-Match algorithm is O (nlogn).
Existing these detection methods exist following defective and deficiency: the method based on Shingle need accurately be mated when detecting the document that repeats fully, and this will cause the approximate document of content by omission.It is not enough only using key term based on the method for Term, sometimes, the web document of different content, its keyword may be the same, so just might cause erroneous judgement, is not enough to be used for the detection of document similarity.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art,, help the user to obtain needed information more quickly and accurately, propose the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic for solving the automatic detection problem of the approximate Chinese web page of content.This method is taken into account the syntactic and semantic information of considering webpage, and text based worldlet characteristic makes up text vocabulary co-occurrence figure, extracts proper vector and carries out the expression of document, and make full use of searching system and classified information and be similar to webpage and detect and investigate.
For achieving the above object, the technical scheme of the inventive method is as follows:
The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic of the present invention may further comprise the steps:
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message.
The advertising message that comprises in the webpage, be linked to the navigation information of other webpage etc., all can produce and disturb this web page contents retrieval.Therefore, before the content of webpage is set up index, effective text message is wherein extracted.
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure.
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector.
The worldlet phenomenon comes from the relevant research of following the trail of shortest path in American society's network that sociologist Milgram carried out in 1967.Result of study shows between any a pair of American, can find no more than six people that know each other in twos that they are connected greatly, famous " six degree separate " (Six DegreeSeparation) problem that Here it is.The paper that Watts delivered on the Nature magazine in 1998 " Collective dynamicsof small-world networks " is furtherd investigate the worldlet phenomenon, proposes the worldlet network and has the characteristic of Gao Judu and short path.Use the worldlet theory that the research of various complex networks (transportation network, transmission, internet control etc.) is continued to bring out in recent years.Yutaka Matsuo is in " ExtractingKeywords in a Documents as a Small World " literary composition, and Ramon Ferrer points out that all human language and the vocabulary co-occurrence figure that is made of document have the worldlet characteristic equally in " Thesmall world of human language " literary composition.Therefore, by the keyword in the document is considered as key node, extract key node according to the worldlet characteristic of vocabulary co-occurrence figure, i.e. the key concept that will explain of document, make like this when detecting document, to need not accurately to mate, thereby prevent that the approximate document of content is by omission.
At vocabulary co-occurrence figure G
LIn, the keyword of tested document is G
LIn key node.Setting d is G
LThe feature path, setting and removing i the vocabulary co-occurrence figure behind the node is CN
i, d
iBe CN
iAverage path length, set node t
iTo G
LThe contribution rate that presents the worldlet feature is CB
i=d
i-d, CN
iValue big more, expression node t
iThe speech of representative is crucial more to the connection role of entire document structure.Between all key nodes, set up bridge for the notion in the document by setting up " shortcut ".In case lose them, document will become unrelated one by one little network according to the different themes segmentation, and it is loose that structure will become.
(1) obtains vocabulary co-occurrence figure G
LPoly-degree C and feature path d.
Vocabulary co-occurrence figure structure G
LBuild finish after, promptly can calculate its two fundamental characteristics: poly-degree C and feature path d.
For node t
i∈ T
L, defining its neighbor node is Γ
i={ j| ξ
I, j=1}, then t
iPoly-degree C
iFor
Wherein, k is the number of neighbor node,
Be the actual limit number that exists between neighbor node, and
Draw vocabulary co-occurrence figure G thus
LPoly-degree C be
For two given node t
i, t
j∈ T
L, d
Min(i j) is two internodal shortest paths.Node t
iAverage path length be
Vocabulary co-occurrence figure G then
LThe feature path
(2) according to CB
i=d
i-d draws each node t
iTo vocabulary co-occurrence figure G
LThe contribution rate CB that presents the worldlet feature
i
(3) with the contribution rate CB of each node of drawing in (2)
iSort according to from big to small order, select CB
iBe worth forward top n node, as document keyword sequence Ti, the value of N is decided in its sole discretion by the user.But it is not enough only using keyword, is not enough to be used for the document similarity therefore because traditional cosine similarity is calculated, and also needs to count the positional information Pos of keyword.
The position breath Pos of keyword, promptly the position of characteristic item in document is also very important for the detection of approximate document.By utilizing vector lists V
p=(Lp
l... Lp
i..., Lp
N), Lp
i=(Pos
Il... Pos
Ij..., Pos
In) come the position of recording feature item, wherein Pos
IjIt is i the entry position that the j time occurs in document.That is, adopt N keyword and their positional information Pos
IjRepresent one piece of document, V
pBe storage keyword positional information Pos
IjMatrix, LP
iBe matrix V
pRow vector, i.e. the position vector of i keyword:
The positional information Pos of keyword constitutes text feature vector Va with key term.
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.Step is as follows:
(1) if this webpage is first piece of document, read Web page classifying sign Ca, keyword is made up inverted index; Class indication is all web page index constitutive characteristic vector index storehouse IDXV_Ca of Ca.
For proper vector is carried out fast access, must set up index mechanism to characteristic item.It is simple relatively that inverted index has realization, and inquiry velocity is supported advantages such as synonym inquiry soon, easily.By characteristic item being set up the inverted index file, can significantly improve recall precision.
(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user.If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If result for retrieval show k coupling output Vdi (i=1,2 ..., k, k>0), represent that this m keyword all appears in this k piece of writing document, all calculate the similarity ξ of vectorial Va of text feature and Vdi entry vector at every piece of document, computing method are as follows:
Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.If the ξ≤prior preset threshold (similarity that the threshold value representative is set, its value is no more than 1), then make the judgement of " not detecting the approximate webpage of content ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If ξ>and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:
At first, obtain the distance matrix of document d1 and d2 characteristic item
Wherein, Pos
1,1 1The 1st position that keyword occurs for the 1st time among the expression document d1, Pos
1,1 2The 1st position that keyword occurs for the 1st time among the expression document d2, Pos
1, m 1The 1st position that keyword occurs for the m time among the expression document d1, Pos
1, m 2The 1st position that keyword occurs for the m time among the expression document d2; δ P
IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time.
Afterwards, obtain matrix V
P1-V
P2In each the row in every mean square deviation S
i: the mean distance AVG that obtains i keyword earlier
iFor:
Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword.Then the distribution of i keyword distance is by mean square deviation S
iExpression, for:
For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:
If the S<prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.
So far, the detection of promptly having finished Chinese approximate webpage with go heavily.
Beneficial effect
The principal element that influences the removing duplicate webpages accuracy is the webpage noise, and the inventive method can effectively reduce the harmful effect of noise information to the algorithm accuracy.Adopt text based worldlet characteristic, extract the document keyword.Not only considered content, the structural information of web page text, made full use of the advantage of retrieval and categorizing system simultaneously, obtained to go heavy accuracy rate>90%, the good result of average recall rate>80%.This method has the time response of approximately linear and good space efficiency, and the comparison of characteristic item only needs run-down proper vector index database, is particularly useful for extensive removing duplicate webpages.In addition, the Web page importance of frequently being reprinted is bigger, and the retrieval ordering value of their correspondences is due for promotion, to reflect its significance level.Thereby the timely discovery of approximate webpage also helps improving the retrieval quality of search engine system.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is for extracting the process flow diagram of file characteristics vector;
Fig. 3 is for finishing the process flow diagram that the file characteristics vector index is mated according to inverted index file and keyword position vector;
Fig. 4 is " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " the webpage vocabulary co-occurrence figure among the embodiment.
Embodiment
Engaging drawings and Examples below is described in further detail the present invention.
The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic of the present invention as shown in Figure 1, may further comprise the steps:
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message.
The advertising message that comprises in the webpage, be linked to the navigation information of other webpage etc., all can produce and disturb this web page contents retrieval.Therefore, before the content of webpage is set up index, effective text message is wherein extracted.
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure.
At first, to the pre-service that effective text message carries out successively that sentence is cut apart, participle, stop words are removed, the document after obtaining handling.Be chosen at frequency of occurrences f>f in the document
ThrSpeech t
i(f
ThrFor prior preset threshold, be set to 2) as the node of the vocabulary co-occurrence figure of the document.
In the following formula,
For comprising (t
i, t
j) the sentence number,
For comprising t
iOr t
jThe sentence number.
If
(J
ThrBe prior preset threshold, be set to 1.2 usually), then at node t
iAnd t
jBetween add a limit.The vocabulary co-occurrence figure of document can decide literary composition and be G
L=(T
L, E
L), T wherein
L={ t
iBe the set of node, L is the number of whole nodes, E
L={ { t
i, t
jBe the set on limit, ξ
I, j={ t
i, t
jExpression node t
iWith t
jBetween whether have the limit, if there is the limit, ξ then
I, j=1, otherwise ξ
I, j=0.
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, as shown in Figure 2.
The worldlet phenomenon comes from the relevant research of following the trail of shortest path in American society's network that sociologist Milgram carried out in 1967.Result of study shows between any a pair of American, can find no more than six people that know each other in twos that they are connected greatly, famous " six degree separate " (Six DegreeSeparation) problem that Here it is.The paper that Watts delivered on the Nature magazine in 1998 " Collective dynamicsof small-world networks " is furtherd investigate the worldlet phenomenon, proposes the worldlet network and has the characteristic of Gao Judu and short path.Use the worldlet theory that the research of various complex networks (transportation network, transmission, internet control etc.) is continued to bring out in recent years.Yutaka Matsuo is in " ExtractingKeywords in a Documents as a Small World " literary composition, and Ramon Ferrer points out that all human language and the vocabulary co-occurrence figure that is made of document have the worldlet characteristic equally in " Thesmall world of human language " literary composition.Therefore, by the keyword in the document is considered as key node, extract key node according to the worldlet characteristic of vocabulary co-occurrence figure, i.e. the key concept that will explain of document, make like this when detecting document, to need not accurately to mate, thereby prevent that the approximate document of content is by omission.
At vocabulary co-occurrence figure G
LIn, the keyword of tested document is G
LIn key node.Setting d is G
lThe feature path.It is CN that setting removes i the vocabulary co-occurrence figure behind the node
i, d
iBe CN
iAverage path length.Set node t
iTo G
LThe contribution rate that presents the worldlet feature is CB
i=d
i-d, CN
iValue big more, expression node t
iThe speech of representative is crucial more to the connection role of entire document structure.Between all key nodes, set up bridge for the notion in the document by setting up " shortcut ".In case lose them, document will become unrelated one by one little network according to the different themes segmentation, and it is loose that structure will become.
(1) obtains vocabulary co-occurrence figure G
LPoly-degree C and feature path d.
Vocabulary co-occurrence figure structure G
LBuild finish after, promptly can calculate its two fundamental characteristics: poly-degree C and feature path d.
For node t
i∈ T
L, defining its neighbor node is Γ
i={ j| ξ
I, j=1}, then t
iPoly-degree C
iFor
Wherein, k is the number of neighbor node,
Be the actual limit number that exists between neighbor node, and
Draw vocabulary co-occurrence figure G thus
LPoly-degree C be
For two given node t
i, t
j∈ T
L, d
Min(i j) is two internodal shortest paths.Node t
iAverage path length be
Vocabulary co-occurrence figure G then
LThe feature path
(2) according to CB
i=d-d
i, draw each node t
iTo vocabulary co-occurrence figure G
LThe contribution rate CB that presents the worldlet feature
i
(3) with the contribution rate CB of each node of drawing in (2)
iSort according to from big to small order, select CB
iBe worth forward top n node, as document keyword Ti, the value of N is decided in its sole discretion by the user, but is no less than 6 at least.But it is not enough only using keyword, because traditional cosine similarity is calculated the detection that is not enough to be used for the document similarity, therefore, also needs to count the positional information Pos of keyword.
The positional information Pos of keyword, promptly the position of characteristic item in document is also very important for the detection of approximate document.By utilizing vector lists V
p=(Lp
l... Lp
i..., Lp
N), Lp
i=(Pos
Il... Pos
Ij..., Pos
In) come the position of recording feature item, wherein Pos
IjIt is i the entry position that the j time occurs in document.That is, adopt N keyword and their positional information Pos
IjRepresent one piece of document, V
pBe storage keyword positional information Pos
IjMatrix, i.e. the positional information Pos of keyword:
The positional information Pos of keyword constitutes text feature vector Va with key term.
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.Step is as follows:
(1) this webpage is first piece of document, and keyword is made up inverted index file, all index file constitutive characteristic vector index storehouse IDXV_Ca.
For proper vector is carried out fast access, must set up index mechanism to characteristic item.It is simple relatively that inverted index has realization, and inquiry velocity is supported advantages such as synonym inquiry soon, easily.By characteristic item being set up the inverted index file, can significantly improve recall precision.
(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user.If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If result for retrieval show k coupling output Vdi (i=1,2 ..., k, k>0), represent that this m keyword all appears in this k piece of writing document, all calculate the similarity ξ of vectorial Va of text feature and Vdi entry vector at every piece of document, computing method are as follows:
Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.If the ξ≤prior preset threshold (similarity that the threshold value representative is set, its value is no more than 1), then make the judgement of " not detecting the approximate webpage of content ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If ξ>and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:
At first, obtain the distance matrix of document d1 and d2 characteristic item
Wherein, Pos
1,1 1The 1st position that keyword occurs for the 1st time among the expression document d1, Pos
1,1 2The 1st position that keyword occurs for the 1st time among the expression document d2, Pos
1, m 1The 1st position that keyword occurs for the m time among the expression document d1, Pos
1, m 2The 1st position that keyword occurs for the m time among the expression document d2; δ P
IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time.
Afterwards, obtain matrix V
P1-V
P2In each the row in every mean square deviation S
i: the mean distance AVG that obtains i keyword earlier
iFor:
Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword.Then the distribution of i keyword distance is by mean square deviation S
iExpression, for:
For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:
If the S<prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.
So far, the detection of promptly having finished Chinese approximate webpage with go heavily.
For correctness and the efficient of estimating this method, this paper has designed a series of experiments.
Correctness is the life of algorithm, provides two evaluation criterions here: repeated pages recall rate (Recall) and go heavy accuracy rate (Precision) is defined as follows:
In order to detect the performance of DDW, we have selected 72 query words in military affairs, medical science and three fields of computing machine, with Google retrieval and inquisition speech.In every group of result for retrieval, choose the same or analogous webpage of content and amount to 5835 pieces, and these approximate webpages are inserted in the already present document sets (comprising 1028,568 webpages).And move I-Match (choosing 20 feature speech equally) and DDW algorithm simultaneously and be similar to webpage and detect.
1, in 23 inquiries of military field input, experimental result is as shown in table 1:
Table (1) military field test sample book accuracy rate and recall rate statistics
2,28 inquiries of medical domain input, 20 groups of introductory webpages of corresponding knowledge wherein, 8 groups of corresponding news webpages, experimental result sees Table (2):
Table (2) medical domain test sample book accuracy rate and recall rate statistics
3,21 inquiries of computer realm input, whole corresponding news webpages, experimental result sees Table (3)
Table (3) computer realm test sample book accuracy rate and recall rate statistics
Above experimental result shows that the inventive method contrast prior art has higher accuracy rate and recall rate.
Embodiment
For example, for URL be " http://cs.taoyuan.gov.cn/news/ReadNews.asp? NewsID=4727 ", introduce the webpage of Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure, detect and whether exist content to be similar to webpage.
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message, specific as follows:
Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure
Date issued: on October 26th, 2006
Source: red net
Editor's typing: to the flaw
Picture information Russian army " Bu Lawa " (claiming " top fuller " again) intercontinental ballistic missile
Picture information is used for emission " top fuller " sea base intercontinental ballistic missile " this Coe of Dmitri east " number nuclear submarine
Www.xinhuanet.com Moscow (reporter Yue Lianguo) Russian Navy News on October 25 and public relation office 25 days are to the news definition, the novel intercontinental ballistic missile offset track of a piece " Bu Lawa " of trial fire on Russian military same day (claiming " top fuller " again) also crashes into marinely, and trial fire is this time counted out.According to Russia's media report, to be Northern Fleet of Russia naval " Dmitri east this Coe " number strategic nuclear submarine launch from white saline waters in 17: 05 on the 25th (21: 05 Beijing time) of Moscow Time this piece ballistic missile, is transmitted under water and carries out.According to plan, guided missile warhead should hit the intended target on Russia's Far East Kamchatka Peninsula grinder.But this piece guided missile is offset track and crash into marine after emission a few minutes.To investigate the reason of trial fire of guided missle failure by the special commission that Russia Ministry of National Defence and guided missile design and the production unit representative is formed.The 5th trial fire of " Bu Lawa " novel intercontinental ballistic missile that Russian military carried out September 7 this year also ends in failure.Guided missile also is an offset track and crash into marine after emission a few minutes.Preceding 4 trial fires of this guided missile have all obtained success.10 branches of " Bu Lawa " guided missile portability are led nuclear warhead, and range can reach 8000 kilometers.According to the chief designer Saloman promise husband of the Moscow thermal engineering research institute introduction of being responsible for this guided missile of development, " Bu Lawa " guided missile and " white poplar-M " intercontinental ballistic missile will become the main body of following Russia strategic nuclear force.According to Russian army's plan, " Bu Lawa " guided missile will be in equipment Russia naval in 2008.Before 2008, Russian military will carry out several times trial fire again to this guided missile.
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure, as shown in Figure 4.
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, shown in table (4)
Table (4) " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " web page characteristics vector
Sequence number | Keyword | Word frequency | Position vector |
??1 | Guided missile | ??12 | ?{421,561,617,669,711,813,857,894,952,1012,1096,1150} |
??2 | Trial fire | ??7 | ?{33,309,385,715,775,866,1166} |
??3 | The cloth Bearing score | ??8 | ?{9,11,112,323,767,888,1006,1090} |
??4 | Intercontinental ballistic missile | ??5 | ?{21,134,348,776,1029} |
??5 | Emission | ??4 | ?{160,529,537,623,823} |
??6 | Moscow | ??3 | ?{227,477,958} |
??7 | 25 days | ??3 | ?{237,283,487} |
??8 | Top fuller | ??3 | ?{125,165,336} |
??9 | Novel | ??3 | ?{16,345,772} |
??10 | 05 minute | ??2 | ?{495,513} |
??11 | October | ??3 | ?{58,229,904} |
??12 | This Coe of Dmitri east number | ??2 | ?{192,442} |
??13 | Russia naval | ??2 | ?{427,1113} |
??14 | Carry out | ??4 | ?{547,728,756,1155} |
??15 | Failure | ??4 | ?{36,392,718,802} |
??16 | Depart from | ??3 | ?{361,635,834} |
??17 | Track | ??3 | ?{365,639,838} |
??18 | Nuclear submarine | ??2 | ?{214,468} |
??19 | Russia | ??2 | ??{1,257} |
??20 | The sea base intercontinental ballistic missile | ??1 | ??{171} |
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.
Read " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " Web page classifying sign Ca, in document keyword index file ID XV_Ca, to the first eight the entry { guided missile in the keyword, trial fire, Bu Lawa, intercontinental ballistic missile, emission, Moscow, 25 days, top fuller, novel, 05 minute, October, this Coe of Dmitri east number } retrieve.The corresponding keyword of one of them occurrence that finds for guided missile, trial fire, strategy, Bu Lawa, intercontinental ballistic missile, novel, Moscow; emission, 25 days, top fuller, 05 minute, October, failure, this Coe of Dmitri east number; Russia naval carries out, sea base strategic missile, Russia, track, fleet }.The character pair vector is shown in table (5):
The corresponding web page characteristics vector of table (5) occurrence
Sequence number | Keyword | Word frequency | Position vector |
??1 | Guided missile | ??19 | ??{421,561,617,669,711,813,857,894,952,1012,1096,1150 ??1133,1645,1675,1745,1947,2168,2262} |
??2 | Trial fire | ??13 | ??{33,309,385,715,775,866,1166,1275,1451,1504,1573,1583,1651} |
??3 | Strategy | ??10 | ??{465,1054,1247,1563,1707,1923,1961,1987,2050,2198} |
??4 | The cloth Bearing score | ??11 | ??{9,11,110,321,765,886,1004,1088,1287,1460,1551,1699} |
??5 | Intercontinental ballistic missile | ??11 | ??{20,134,349,777,1030,1315,1472,1761,1869,2128,2400} |
??6 | Novel | ??10 | ??{17,345,773,1311,1468,1541,1803,1939,2124,2396} |
??7 | Moscow | ??6 | ??{226,447,957,1422,1717,2330} |
??8 | Emission | ??5 | ??{160,529,537,623,823} |
??9 | 25 days | ??5 | ??{237,283,487,1165,1184} |
??10 | Top fuller | ??6 | ??{126,166,337,1303,1863,2414} |
??11 | 05 minute | ??2 | ??{495,513} |
??12 | October | ??4 | ??{58,229,1180,1361} |
??13 | Failure | ??6 | ??{37,393,719,803,1512,2218} |
??14 | This Coe of Dmitri east number | ??2 | ??{192,443} |
??15 | Russia naval | ??3 | ??{427,1114,1518} |
??16 | Carry out | ??6 | ??{547,729,757,1156,1537,1821} |
??17 | The sea base strategic missile | ??3 | ??{1559,1957,2046} |
??18 | Russia | ??3 | ??{1,257,1205,1408} |
??19 | Track | ??4 | ??{365,639,839,1488} |
??20 | Fleet | ??3 | ??{437,1241,1418} |
Utilizing cosine formula (4) to carry out similarity calculates:
Therefore ξ=81.51%>80% (threshold value) judges that these two pieces of webpages might be the approximate webpages of content.Continue to calculate the keyword distance variance S of two pieces of documents;
With guided missile one speech is example, and position vector corresponding in two pieces of articles is respectively:
{421,561,617,669,711,813,857,894,952,1012,1096,1150}
With
{421,561,617,669,711,813,857,894,952,1012,1096,11501133,1645,1675,1745,1947,2168,2262}
δP
11=|421-421|=0......δP
1,19=|0-2262|=2262
The range distribution of N keyword of entire chapter document is:
" Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " web length is 1165 bytes, and the setpoint distance threshold value is 10% of a web length, i.e. 117 bytes.At this moment, therefore S>distance threshold, judges that two pieces of documents are not the approximate webpages of content.Classification information according to document " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure ", according to Web page classifying sign Ca, the keyword increment of document is built among the proper vector index database IDXV_Ca, thereby finishes the detection of document " the novel intercontinental ballistic missile trial fire of Russia " Bu Lawa " is failed ".
Claims (2)
1, the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic is characterized in that may further comprise the steps:
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message;
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure;
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, implementation procedure is as follows;
Setting d vocabulary co-occurrence figure is G
LThe feature path, setting and removing i the vocabulary co-occurrence figure behind the node is CN
i, d
iBe CN
iAverage path length, set node t
iTo G
LThe contribution rate that presents the worldlet feature is CB
i=d
i-d;
(1) obtains vocabulary co-occurrence figure G
LPoly-degree C and feature path d;
For node t
i∈ T
L, defining its neighbor node is Γ
i={ j| ξ
I, j=1}, then t
iPoly-degree C
iFor
Wherein, k is the number of neighbor node,
Be the actual limit number that exists between neighbor node, and
Draw vocabulary co-occurrence figure G thus
LPoly-degree C be
For two given node t
i, t
j∈ T
L, d
Min(i j) is two internodal shortest paths; Node t
iAverage path length be
Vocabulary co-occurrence figure G then
LThe feature path
(2) according to CB
i=d
i-d draws each node t
iTo vocabulary co-occurrence figure G
LThe contribution rate that presents the worldlet feature;
(3) with the contribution rate CB of each node of drawing in (2)
iSort according to from big to small order, select CB
iBe worth forward top n node, as document keyword sequence Ti, the value of N is decided in its sole discretion by the user;
Afterwards, count the positional information Pos of keyword, by utilizing vector lists V
p=(Lp
i... Lp
i..., Lp
N), Lp
i=(Pos
Ij... Pos
Ij..., Pos
m) come the position of recording feature item, wherein Pos
I, jIt is i the entry position that the j time occurs in document; That is, adopt N keyword and their positional information Pos
I, jRepresent one piece of document, V
pBe storage keyword positional information Pos
I, jMatrix, LP
iBe matrix V
pRow vector, i.e. the position vector of i keyword:
The positional information Pos of keyword constitutes text feature vector Va with key term;
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file, and detailed process is as follows:
(1) if this webpage is first piece of document, read Web page classifying sign Ca, keyword is made up inverted index; Class indication is all web page index constitutive characteristic vector index storehouse IDXV_Ca of Ca;
(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user; If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If result for retrieval shows k coupling output Vdi (i=1,2, ..., k, k>0), then all calculates the similarity ξ of text feature vector Va and Vdi entry vector at every piece of document, if ξ≤prior preset threshold, then make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If ξ>and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:
At first, obtain the distance matrix of document d1 and d2 characteristic item
Wherein, Pos
1,1 1The 1st position that keyword occurs for the 1st time among the expression document d1, Pos
I, j 2The 1st position that keyword occurs for the 1st time among the expression document d2, Pos
1, m 1The 1st position that keyword occurs for the m time among the expression document d1, Pos
1, m 2The 1st position that keyword occurs for the m time among the expression document d2; δ P
IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time;
Afterwards, obtain matrix V
P1-V
P2In each the row in every mean square deviation S
i: the mean distance AVG that obtains i keyword earlier
iFor:
Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword, and then the distribution of i keyword distance is by mean square deviation S
iExpression, for:
For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:
If the S<prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.
2, the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic as claimed in claim 1 is characterized in that the computing method of the similarity ξ of text feature vector Va and Vdi entry vector are as follows:
Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910083711A CN101620616A (en) | 2009-05-07 | 2009-05-07 | Chinese similar web page de-emphasis method based on microcosmic characteristic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910083711A CN101620616A (en) | 2009-05-07 | 2009-05-07 | Chinese similar web page de-emphasis method based on microcosmic characteristic |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101620616A true CN101620616A (en) | 2010-01-06 |
Family
ID=41513855
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910083711A Pending CN101620616A (en) | 2009-05-07 | 2009-05-07 | Chinese similar web page de-emphasis method based on microcosmic characteristic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101620616A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314418A (en) * | 2011-10-09 | 2012-01-11 | 北京航空航天大学 | Method for comparing Chinese similarity based on context relation |
CN102663093A (en) * | 2012-04-10 | 2012-09-12 | 中国科学院计算机网络信息中心 | Method and device for detecting bad website |
CN102722526A (en) * | 2012-05-16 | 2012-10-10 | 成都信息工程学院 | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method |
CN103123685A (en) * | 2011-11-18 | 2013-05-29 | 江南大学 | Text mode recognition method |
CN103246640A (en) * | 2013-04-23 | 2013-08-14 | 北京十分科技有限公司 | Duplicated text detection method and device |
CN103761477A (en) * | 2014-01-07 | 2014-04-30 | 北京奇虎科技有限公司 | Method and equipment for acquiring virus program samples |
CN103778163A (en) * | 2012-10-26 | 2014-05-07 | 广州市邦富软件有限公司 | Rapid webpage de-weight algorithm based on fingerprints |
CN104123272A (en) * | 2014-05-21 | 2014-10-29 | 山东省科学院情报研究所 | Document classification method based on variance |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN104636319A (en) * | 2013-11-11 | 2015-05-20 | 腾讯科技(北京)有限公司 | Text duplicate removal method and device |
CN105550170A (en) * | 2015-12-14 | 2016-05-04 | 北京锐安科技有限公司 | Chinese word segmentation method and apparatus |
WO2017063525A1 (en) * | 2015-10-12 | 2017-04-20 | 广州神马移动信息科技有限公司 | Query processing method, device and apparatus |
CN108536753A (en) * | 2018-03-13 | 2018-09-14 | 腾讯科技(深圳)有限公司 | The determination method and relevant apparatus of duplicate message |
WO2018184588A1 (en) * | 2017-04-07 | 2018-10-11 | 腾讯科技(深圳)有限公司 | Text deduplication method and device and storage medium |
CN110716533A (en) * | 2019-10-29 | 2020-01-21 | 山东师范大学 | Key subsystem identification method and system influencing reliability of numerical control equipment |
CN111859896A (en) * | 2019-04-01 | 2020-10-30 | 长鑫存储技术有限公司 | Formula document detection method and device, computer readable medium and electronic equipment |
CN112883704A (en) * | 2021-04-29 | 2021-06-01 | 南京视察者智能科技有限公司 | Big data similar text duplicate removal preprocessing method and device and terminal equipment |
-
2009
- 2009-05-07 CN CN200910083711A patent/CN101620616A/en active Pending
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314418A (en) * | 2011-10-09 | 2012-01-11 | 北京航空航天大学 | Method for comparing Chinese similarity based on context relation |
CN103123685B (en) * | 2011-11-18 | 2016-03-02 | 江南大学 | Text mode recognition method |
CN103123685A (en) * | 2011-11-18 | 2013-05-29 | 江南大学 | Text mode recognition method |
CN102663093A (en) * | 2012-04-10 | 2012-09-12 | 中国科学院计算机网络信息中心 | Method and device for detecting bad website |
CN102663093B (en) * | 2012-04-10 | 2014-07-09 | 中国科学院计算机网络信息中心 | Method and device for detecting bad website |
CN102722526A (en) * | 2012-05-16 | 2012-10-10 | 成都信息工程学院 | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method |
CN102722526B (en) * | 2012-05-16 | 2014-04-30 | 成都信息工程学院 | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method |
CN103778163A (en) * | 2012-10-26 | 2014-05-07 | 广州市邦富软件有限公司 | Rapid webpage de-weight algorithm based on fingerprints |
CN103246640A (en) * | 2013-04-23 | 2013-08-14 | 北京十分科技有限公司 | Duplicated text detection method and device |
CN103246640B (en) * | 2013-04-23 | 2016-08-03 | 北京酷云互动科技有限公司 | A kind of method and device detecting repeated text |
CN104636319B (en) * | 2013-11-11 | 2018-09-28 | 腾讯科技(北京)有限公司 | A kind of text De-weight method and device |
CN104636319A (en) * | 2013-11-11 | 2015-05-20 | 腾讯科技(北京)有限公司 | Text duplicate removal method and device |
CN103761477A (en) * | 2014-01-07 | 2014-04-30 | 北京奇虎科技有限公司 | Method and equipment for acquiring virus program samples |
CN104123272A (en) * | 2014-05-21 | 2014-10-29 | 山东省科学院情报研究所 | Document classification method based on variance |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN104615714B (en) * | 2015-02-05 | 2019-05-24 | 北京中搜云商网络技术有限公司 | Blog article rearrangement based on text similarity and microblog channel feature |
WO2017063525A1 (en) * | 2015-10-12 | 2017-04-20 | 广州神马移动信息科技有限公司 | Query processing method, device and apparatus |
CN105550170A (en) * | 2015-12-14 | 2016-05-04 | 北京锐安科技有限公司 | Chinese word segmentation method and apparatus |
CN105550170B (en) * | 2015-12-14 | 2018-10-12 | 北京锐安科技有限公司 | A kind of Chinese word cutting method and device |
US11379422B2 (en) | 2017-04-07 | 2022-07-05 | Tencent Technology (Shenzhen) Company Limited | Text deduplication method and apparatus, and storage medium |
WO2018184588A1 (en) * | 2017-04-07 | 2018-10-11 | 腾讯科技(深圳)有限公司 | Text deduplication method and device and storage medium |
CN108536753A (en) * | 2018-03-13 | 2018-09-14 | 腾讯科技(深圳)有限公司 | The determination method and relevant apparatus of duplicate message |
CN111859896A (en) * | 2019-04-01 | 2020-10-30 | 长鑫存储技术有限公司 | Formula document detection method and device, computer readable medium and electronic equipment |
CN111859896B (en) * | 2019-04-01 | 2022-11-25 | 长鑫存储技术有限公司 | Formula document detection method and device, computer readable medium and electronic equipment |
CN110716533A (en) * | 2019-10-29 | 2020-01-21 | 山东师范大学 | Key subsystem identification method and system influencing reliability of numerical control equipment |
CN112883704A (en) * | 2021-04-29 | 2021-06-01 | 南京视察者智能科技有限公司 | Big data similar text duplicate removal preprocessing method and device and terminal equipment |
CN112883704B (en) * | 2021-04-29 | 2021-07-16 | 南京视察者智能科技有限公司 | Big data similar text duplicate removal preprocessing method and device and terminal equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101620616A (en) | Chinese similar web page de-emphasis method based on microcosmic characteristic | |
Francis-Landau et al. | Capturing semantic similarity for entity linking with convolutional neural networks | |
CN105488196B (en) | A kind of hot topic automatic mining system based on interconnection corpus | |
Zaragoza et al. | Ranking very many typed entities on wikipedia | |
Pereira et al. | Using web information for author name disambiguation | |
Jiang et al. | Mining ontological knowledge from domain-specific text documents | |
Yin et al. | Facto: a fact lookup engine based on web tables | |
KR100847376B1 (en) | Method and apparatus for searching information using automatic query creation | |
CN100435145C (en) | Multiple file summarization method based on sentence relation graph | |
Srinivas et al. | A weighted tag similarity measure based on a collaborative weight model | |
Verma et al. | Exploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search. | |
Cheng et al. | MISDA: web services discovery approach based on mining interface semantics | |
Mastropavlos et al. | Automatic acquisition of bilingual language resources | |
Gey et al. | Cross-language retrieval for the CLEF collections—comparing multiple methods of retrieval | |
Dumani et al. | Fine and coarse granular argument classification before clustering | |
Bechikh Ali et al. | Multi-word terms selection for information retrieval | |
Zhang et al. | A preprocessing framework and approach for web applications | |
Schilit et al. | Exploring a digital library through key ideas | |
Arefin et al. | BAENPD: A Bilingual Plagiarism Detector. | |
Chu et al. | Chuweb21D: A Deduped English Document Collection for Web Search Tasks | |
Chen et al. | Chinese named entity abbreviation generation using first-order logic | |
Zheng et al. | Research on domain term extraction based on conditional random fields | |
Lee et al. | Bvideoqa: Online English/Chinese bilingual video question answering | |
Huang et al. | Learning to find comparable entities on the web | |
Gu et al. | Towards efficient similar sentences extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20100106 |