CN101620616A - Chinese similar web page de-emphasis method based on microcosmic characteristic - Google Patents

Chinese similar web page de-emphasis method based on microcosmic characteristic Download PDF

Info

Publication number
CN101620616A
CN101620616A CN200910083711A CN200910083711A CN101620616A CN 101620616 A CN101620616 A CN 101620616A CN 200910083711 A CN200910083711 A CN 200910083711A CN 200910083711 A CN200910083711 A CN 200910083711A CN 101620616 A CN101620616 A CN 101620616A
Authority
CN
China
Prior art keywords
pos
document
keyword
vector
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910083711A
Other languages
Chinese (zh)
Inventor
曹玉娟
牛振东
赵堃
赵育民
江鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN200910083711A priority Critical patent/CN101620616A/en
Publication of CN101620616A publication Critical patent/CN101620616A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese similar web page de-emphasis method based on microcosmic characteristics in order to solve the problem of automatic detection of content similar to Chinese web pages. The Chinese similar web page de-emphasis method considering syntactic information and semantic information of web pages both comprises the following steps: firstly, establishing a text term co-occurrence picture according to extracted web page effective information; secondly, extracting document characteristic vectors, wherein the document characteristic vectors comprise keyword position information and keyword terms ; finally, establishing a document keyword inverted index file by sufficiently using a retrieval system and classified information; completing document characteristic vector retrieval match according to the inverted index file, and thereby, detecting and investigating similar web pages. The Chinese similar web page de-emphasis method can effectively reduce the harmful effect of arithmetic accuracy by noise information, considers the content and structure information of the web page text, sufficiently uses the advantages of a retrieval and classification system simultaneously, obtains good effect of de-emphasis accuracy rate larger than 90 percent and average recalling rate larger than 80 percent and is especially suitable for large-scale web page de-emphasis.

Description

The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic
Technical field
The present invention relates to the approximate removing duplicate webpages method of a kind of Chinese, belong to computer network intelligent information retrieval technical field.
Technical background
Along with developing on an unprecedented scale of Internet technology and scale, Internet has become one of main channel of the information of obtaining.In the investigation by in July, 2007, there are more than 100,000,000 2 thousand 5 hundred ten thousand websites in statistics altogether.Search engine becomes the main tool that the current network user carries out information retrieval because of its search function conveniently, and wherein, the quality of information retrieval and work efficiency thereof will directly have influence on the overall performance of search engine.Statistical report according to CNNIC in July, 2005 issue shows that the user is when answering " greatest problem that runs into during retrieving information " this enquirement, and that selects " duplicate message is too many " option accounts for 44.6%, and the position is ranked first.In the face of the information of magnanimity, the user is unwilling to see the data that a pile content is identical or approximate.How helping the user to obtain needed information more quickly and accurately, is the new problem that the network information service faces.In recent years, the detection at approximate webpage has launched many researchs, for example structure of web page approximation detection, the detection of hyperlink approximation, the approximate detection of web page contents etc.
Usually, sentence structure, identical in structure document are considered as repetitive file.The removal of repetitive file adopts traditional plagiarization detection technique to be easy to finish, but just so uneasy for the approximate document detection of content.Approximate webpage is meant the essentially identical webpage of body matter, no matter and whether its sentence structure, structure be in full accord.For the approximate detection of web page contents, can adopt the text copy detection method, comprise two classes: based on the method (based on the method for Shingle) of grammer with based on method of semantic (based on the method for Term).
(1) based on the method for Shingle
Shingle is meant the sequence word that has that a group is closed in the document.Method based on Shingle requires to choose a series of Shingle from document, then Shingle is mapped in Hash (Hash) table the corresponding hash value of Shingle.At last, identical Shingle number or ratio in the statistics Hash table is as judging the text similarity foundation.For realizing the detection of extensive document, each researcher has adopted different sampling policies, is used to reduce the quantity of participating in Shingle relatively.
Doctor Heintze of Bell Laboratory proposes to choose N shingle of hash value minimum in " Scalable document fingerprinting " literary composition, and removes the frequent Shingles that occurs.The Bharat researcher in Google research centre is in " A comparison of techniques to find mirrored hosts on theWWW " literary composition, the Shingle that hash value is 25 multiples is chosen in proposition, and every piece of document is chosen 400 Shingle at most.The Broder researcher at Digital company systematic study center proposes a plurality of Shingle to be joined together to form a Supershingle and pass through the relatively similarity of the hash value calculating document of Supershingle in " Syntactic clustering of theweb " literary composition.Although Supershingle algorithm computation amount is littler, Broder finds that it is not suitable for the detection of short and small document.Doctor Fetterly in Microsoft research centre is in " Onthe evolution of elusters of near-duplicate web pages " literary composition, propose 5 speech that occur continuously are considered as a Shingle, 84 Shingle of every piece of document sampling are combined as these Shingle 6 Supershingle then; Document with 2 identical Supershingles is regarded as the similar document of content.The Wu Pingbo of Tsing-Hua University etc. are in " the extensive Chinese web page based on the feature string goes the method research of reruning fast ", proposition utilizes the punctuation mark majority to appear at characteristics in the web page text, comes presentation web page uniquely with each five Chinese character of fullstop both sides as Shingle.
(2) based on the method for Term
Method based on Term all adopts single entry as the elementary cell of calculating basically, obtains the similarity of document by the cosine value that calculates the file characteristics vector, and does not consider position and order that entry occurs.Owing to adopted many feature extractions (the especially selection of proper vector) technology, made based on the method ratio of Term more complicated based on the algorithm of Shingle.
Which speech the I-Match algorithm of Chowdhury determines to select as proper vector by calculating inverse document frequency (IDF:inversedocument frequency).IDF=log (N/n), wherein N is the number of document in the document sets, n is the number that comprises the document of this keyword.The I-Match algorithm just is based on the deduction of " semantic information that the frequent speech that occurs can't increase document in document sets ", removes the less speech of IDF value, represents thereby obtained better document." fingerprint " that constitute document by descending sort through the keyword that filters (fingerprint), the document that fingerprint is identical is regarded as approximate document.Under the worst case (all documents all are approximate documents), the time complexity of I-Match algorithm is O (nlogn).
Existing these detection methods exist following defective and deficiency: the method based on Shingle need accurately be mated when detecting the document that repeats fully, and this will cause the approximate document of content by omission.It is not enough only using key term based on the method for Term, sometimes, the web document of different content, its keyword may be the same, so just might cause erroneous judgement, is not enough to be used for the detection of document similarity.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art,, help the user to obtain needed information more quickly and accurately, propose the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic for solving the automatic detection problem of the approximate Chinese web page of content.This method is taken into account the syntactic and semantic information of considering webpage, and text based worldlet characteristic makes up text vocabulary co-occurrence figure, extracts proper vector and carries out the expression of document, and make full use of searching system and classified information and be similar to webpage and detect and investigate.
For achieving the above object, the technical scheme of the inventive method is as follows:
The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic of the present invention may further comprise the steps:
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message.
The advertising message that comprises in the webpage, be linked to the navigation information of other webpage etc., all can produce and disturb this web page contents retrieval.Therefore, before the content of webpage is set up index, effective text message is wherein extracted.
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure.
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector.
The worldlet phenomenon comes from the relevant research of following the trail of shortest path in American society's network that sociologist Milgram carried out in 1967.Result of study shows between any a pair of American, can find no more than six people that know each other in twos that they are connected greatly, famous " six degree separate " (Six DegreeSeparation) problem that Here it is.The paper that Watts delivered on the Nature magazine in 1998 " Collective dynamicsof small-world networks " is furtherd investigate the worldlet phenomenon, proposes the worldlet network and has the characteristic of Gao Judu and short path.Use the worldlet theory that the research of various complex networks (transportation network, transmission, internet control etc.) is continued to bring out in recent years.Yutaka Matsuo is in " ExtractingKeywords in a Documents as a Small World " literary composition, and Ramon Ferrer points out that all human language and the vocabulary co-occurrence figure that is made of document have the worldlet characteristic equally in " Thesmall world of human language " literary composition.Therefore, by the keyword in the document is considered as key node, extract key node according to the worldlet characteristic of vocabulary co-occurrence figure, i.e. the key concept that will explain of document, make like this when detecting document, to need not accurately to mate, thereby prevent that the approximate document of content is by omission.
At vocabulary co-occurrence figure G LIn, the keyword of tested document is G LIn key node.Setting d is G LThe feature path, setting and removing i the vocabulary co-occurrence figure behind the node is CN i, d iBe CN iAverage path length, set node t iTo G LThe contribution rate that presents the worldlet feature is CB i=d i-d, CN iValue big more, expression node t iThe speech of representative is crucial more to the connection role of entire document structure.Between all key nodes, set up bridge for the notion in the document by setting up " shortcut ".In case lose them, document will become unrelated one by one little network according to the different themes segmentation, and it is loose that structure will become.
(1) obtains vocabulary co-occurrence figure G LPoly-degree C and feature path d.
Vocabulary co-occurrence figure structure G LBuild finish after, promptly can calculate its two fundamental characteristics: poly-degree C and feature path d.
For node t i∈ T L, defining its neighbor node is Γ i={ j| ξ I, j=1}, then t iPoly-degree C iFor
Figure G2009100837119D00041
Wherein, k is the number of neighbor node,
Figure G2009100837119D00042
Be the actual limit number that exists between neighbor node, and Draw vocabulary co-occurrence figure G thus LPoly-degree C be
C = 1 L Σ i = 1 L C i
For two given node t i, t j∈ T L, d Min(i j) is two internodal shortest paths.Node t iAverage path length be d i = 1 L - 1 ( Σ j = 1 , j ≠ i L d min ( i , j ) ) , Vocabulary co-occurrence figure G then LThe feature path d = 1 L ( Σ i = 1 L d i ) .
(2) according to CB i=d i-d draws each node t iTo vocabulary co-occurrence figure G LThe contribution rate CB that presents the worldlet feature i
(3) with the contribution rate CB of each node of drawing in (2) iSort according to from big to small order, select CB iBe worth forward top n node, as document keyword sequence Ti, the value of N is decided in its sole discretion by the user.But it is not enough only using keyword, is not enough to be used for the document similarity therefore because traditional cosine similarity is calculated, and also needs to count the positional information Pos of keyword.
The position breath Pos of keyword, promptly the position of characteristic item in document is also very important for the detection of approximate document.By utilizing vector lists V p=(Lp l... Lp i..., Lp N), Lp i=(Pos Il... Pos Ij..., Pos In) come the position of recording feature item, wherein Pos IjIt is i the entry position that the j time occurs in document.That is, adopt N keyword and their positional information Pos IjRepresent one piece of document, V pBe storage keyword positional information Pos IjMatrix, LP iBe matrix V pRow vector, i.e. the position vector of i keyword:
V p = LP 1 LP 2 . . . . . . LP N = [ Pos 1,1 , . . . . , Pos 1 , m ] [ Pos 2,1 , . . . . , Pos 2 , n ] . . . . . . [ Pos N , 1 , . . . . , Pos N , k ]
The positional information Pos of keyword constitutes text feature vector Va with key term.
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.Step is as follows:
(1) if this webpage is first piece of document, read Web page classifying sign Ca, keyword is made up inverted index; Class indication is all web page index constitutive characteristic vector index storehouse IDXV_Ca of Ca.
For proper vector is carried out fast access, must set up index mechanism to characteristic item.It is simple relatively that inverted index has realization, and inquiry velocity is supported advantages such as synonym inquiry soon, easily.By characteristic item being set up the inverted index file, can significantly improve recall precision.
(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user.If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If result for retrieval show k coupling output Vdi (i=1,2 ..., k, k>0), represent that this m keyword all appears in this k piece of writing document, all calculate the similarity ξ of vectorial Va of text feature and Vdi entry vector at every piece of document, computing method are as follows:
ξ = d 1 × d 2 | | d 1 | | × | | d 2 | | = Σ i = 1 m d 1 ( i ) × d 2 ( i ) Σ i = 1 m d 1 ( i ) 2 × Σ i = 1 m d 2 ( i ) 2
Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.If the ξ≤prior preset threshold (similarity that the threshold value representative is set, its value is no more than 1), then make the judgement of " not detecting the approximate webpage of content ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If ξ>and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:
At first, obtain the distance matrix of document d1 and d2 characteristic item
V p 1 - V p 2 = [ Pos 1,1 1 - Pos 1,1 2 , . . . . , Pos 1 , m 1 - Pos 1 , m 2 ] [ Pos 2,1 1 - Pos 2,1 2 , . . . . , Pos 2 , n 1 - Pos 2 , n 2 ] . . . . . . [ Pos N , 1 1 - Pos N , 1 2 , . . . . , Pos N , k 1 - Pos N , k 2 ]
= δ P 11 , δ P 12 , . . . . . . , δ P 1 m . . . . . . . . . . . . δ P N 1 , δ P N 2 , . . . . . . , δ P Nm
Wherein, Pos 1,1 1The 1st position that keyword occurs for the 1st time among the expression document d1, Pos 1,1 2The 1st position that keyword occurs for the 1st time among the expression document d2, Pos 1, m 1The 1st position that keyword occurs for the m time among the expression document d1, Pos 1, m 2The 1st position that keyword occurs for the m time among the expression document d2; δ P IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time.
Afterwards, obtain matrix V P1-V P2In each the row in every mean square deviation S i: the mean distance AVG that obtains i keyword earlier iFor:
AVG i = Σ j = 1 r | ( Pos ij 1 - Pos ij 2 ) | r = Σ j = 1 r | δ P ij | r
Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword.Then the distribution of i keyword distance is by mean square deviation S iExpression, for:
S i = Σ j = 1 r ( δ P ij - AVG i ) 2 r
For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:
S = Σ i = 1 N S i N
If the S<prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.
So far, the detection of promptly having finished Chinese approximate webpage with go heavily.
Beneficial effect
The principal element that influences the removing duplicate webpages accuracy is the webpage noise, and the inventive method can effectively reduce the harmful effect of noise information to the algorithm accuracy.Adopt text based worldlet characteristic, extract the document keyword.Not only considered content, the structural information of web page text, made full use of the advantage of retrieval and categorizing system simultaneously, obtained to go heavy accuracy rate>90%, the good result of average recall rate>80%.This method has the time response of approximately linear and good space efficiency, and the comparison of characteristic item only needs run-down proper vector index database, is particularly useful for extensive removing duplicate webpages.In addition, the Web page importance of frequently being reprinted is bigger, and the retrieval ordering value of their correspondences is due for promotion, to reflect its significance level.Thereby the timely discovery of approximate webpage also helps improving the retrieval quality of search engine system.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is for extracting the process flow diagram of file characteristics vector;
Fig. 3 is for finishing the process flow diagram that the file characteristics vector index is mated according to inverted index file and keyword position vector;
Fig. 4 is " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " the webpage vocabulary co-occurrence figure among the embodiment.
Embodiment
Engaging drawings and Examples below is described in further detail the present invention.
The approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic of the present invention as shown in Figure 1, may further comprise the steps:
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message.
The advertising message that comprises in the webpage, be linked to the navigation information of other webpage etc., all can produce and disturb this web page contents retrieval.Therefore, before the content of webpage is set up index, effective text message is wherein extracted.
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure.
At first, to the pre-service that effective text message carries out successively that sentence is cut apart, participle, stop words are removed, the document after obtaining handling.Be chosen at frequency of occurrences f>f in the document ThrSpeech t i(f ThrFor prior preset threshold, be set to 2) as the node of the vocabulary co-occurrence figure of the document.
Then, for each speech to (t i, t j) calculate its Jaccard coefficient by following formula
Figure G2009100837119D00081
J t i , t j = n t i , t j n t i + n t j - n t i , t j
In the following formula,
Figure G2009100837119D00083
For comprising (t i, t j) the sentence number,
Figure G2009100837119D00084
For comprising t iOr t jThe sentence number.
If J t i , t j > J thr (J ThrBe prior preset threshold, be set to 1.2 usually), then at node t iAnd t jBetween add a limit.The vocabulary co-occurrence figure of document can decide literary composition and be G L=(T L, E L), T wherein L={ t iBe the set of node, L is the number of whole nodes, E L={ { t i, t jBe the set on limit, ξ I, j={ t i, t jExpression node t iWith t jBetween whether have the limit, if there is the limit, ξ then I, j=1, otherwise ξ I, j=0.
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, as shown in Figure 2.
The worldlet phenomenon comes from the relevant research of following the trail of shortest path in American society's network that sociologist Milgram carried out in 1967.Result of study shows between any a pair of American, can find no more than six people that know each other in twos that they are connected greatly, famous " six degree separate " (Six DegreeSeparation) problem that Here it is.The paper that Watts delivered on the Nature magazine in 1998 " Collective dynamicsof small-world networks " is furtherd investigate the worldlet phenomenon, proposes the worldlet network and has the characteristic of Gao Judu and short path.Use the worldlet theory that the research of various complex networks (transportation network, transmission, internet control etc.) is continued to bring out in recent years.Yutaka Matsuo is in " ExtractingKeywords in a Documents as a Small World " literary composition, and Ramon Ferrer points out that all human language and the vocabulary co-occurrence figure that is made of document have the worldlet characteristic equally in " Thesmall world of human language " literary composition.Therefore, by the keyword in the document is considered as key node, extract key node according to the worldlet characteristic of vocabulary co-occurrence figure, i.e. the key concept that will explain of document, make like this when detecting document, to need not accurately to mate, thereby prevent that the approximate document of content is by omission.
At vocabulary co-occurrence figure G LIn, the keyword of tested document is G LIn key node.Setting d is G lThe feature path.It is CN that setting removes i the vocabulary co-occurrence figure behind the node i, d iBe CN iAverage path length.Set node t iTo G LThe contribution rate that presents the worldlet feature is CB i=d i-d, CN iValue big more, expression node t iThe speech of representative is crucial more to the connection role of entire document structure.Between all key nodes, set up bridge for the notion in the document by setting up " shortcut ".In case lose them, document will become unrelated one by one little network according to the different themes segmentation, and it is loose that structure will become.
(1) obtains vocabulary co-occurrence figure G LPoly-degree C and feature path d.
Vocabulary co-occurrence figure structure G LBuild finish after, promptly can calculate its two fundamental characteristics: poly-degree C and feature path d.
For node t i∈ T L, defining its neighbor node is Γ i={ j| ξ I, j=1}, then t iPoly-degree C iFor
Figure G2009100837119D00091
Wherein, k is the number of neighbor node,
Figure G2009100837119D00092
Be the actual limit number that exists between neighbor node, and
Figure G2009100837119D00093
Draw vocabulary co-occurrence figure G thus LPoly-degree C be
C = 1 L Σ i = 1 L C i
For two given node t i, t j∈ T L, d Min(i j) is two internodal shortest paths.Node t iAverage path length be d i = 1 L - 1 ( Σ j = 1 , j ≠ i L d min ( i , j ) ) , Vocabulary co-occurrence figure G then LThe feature path d = 1 L ( Σ i = 1 L d i ) .
(2) according to CB i=d-d i, draw each node t iTo vocabulary co-occurrence figure G LThe contribution rate CB that presents the worldlet feature i
(3) with the contribution rate CB of each node of drawing in (2) iSort according to from big to small order, select CB iBe worth forward top n node, as document keyword Ti, the value of N is decided in its sole discretion by the user, but is no less than 6 at least.But it is not enough only using keyword, because traditional cosine similarity is calculated the detection that is not enough to be used for the document similarity, therefore, also needs to count the positional information Pos of keyword.
The positional information Pos of keyword, promptly the position of characteristic item in document is also very important for the detection of approximate document.By utilizing vector lists V p=(Lp l... Lp i..., Lp N), Lp i=(Pos Il... Pos Ij..., Pos In) come the position of recording feature item, wherein Pos IjIt is i the entry position that the j time occurs in document.That is, adopt N keyword and their positional information Pos IjRepresent one piece of document, V pBe storage keyword positional information Pos IjMatrix, i.e. the positional information Pos of keyword:
V p = [ Pos 1,1 , . . . . , Pos 1 , m ] [ Pos 2,1 , . . . . , Pos 2 , n ] . . . . . . [ Pos N , 1 , . . . . , Pos N , k ]
The positional information Pos of keyword constitutes text feature vector Va with key term.
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.Step is as follows:
(1) this webpage is first piece of document, and keyword is made up inverted index file, all index file constitutive characteristic vector index storehouse IDXV_Ca.
For proper vector is carried out fast access, must set up index mechanism to characteristic item.It is simple relatively that inverted index has realization, and inquiry velocity is supported advantages such as synonym inquiry soon, easily.By characteristic item being set up the inverted index file, can significantly improve recall precision.
(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user.If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If result for retrieval show k coupling output Vdi (i=1,2 ..., k, k>0), represent that this m keyword all appears in this k piece of writing document, all calculate the similarity ξ of vectorial Va of text feature and Vdi entry vector at every piece of document, computing method are as follows:
ξ = d 1 × d 2 | | d 1 | | × | | d 2 | | = Σ i = 1 m d 1 ( i ) × d 2 ( i ) Σ i = 1 m d 1 ( i ) 2 × Σ i = 1 m d 2 ( i ) 2
Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.If the ξ≤prior preset threshold (similarity that the threshold value representative is set, its value is no more than 1), then make the judgement of " not detecting the approximate webpage of content ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the classification information of detected document; If ξ>and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:
At first, obtain the distance matrix of document d1 and d2 characteristic item
V p 1 - V p 2 = [ Pos 1,1 1 - Pos 1,1 2 , . . . . , Pos 1 , m 1 - Pos 1 , m 2 ] [ Pos 2,1 1 - Pos 2,1 2 , . . . . , Pos 2 , n 1 - Pos 2 , n 2 ] . . . . . . [ Pos N , 1 1 - Pos N , 1 2 , . . . . , Pos N , k 1 - Pos N , k 2 ]
= δ P 11 , δ P 12 , . . . . . . , δ P 1 m . . . . . . . . . . . . δ P N 1 , δ P N 2 , . . . . . . , δ P Nm
Wherein, Pos 1,1 1The 1st position that keyword occurs for the 1st time among the expression document d1, Pos 1,1 2The 1st position that keyword occurs for the 1st time among the expression document d2, Pos 1, m 1The 1st position that keyword occurs for the m time among the expression document d1, Pos 1, m 2The 1st position that keyword occurs for the m time among the expression document d2; δ P IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time.
Afterwards, obtain matrix V P1-V P2In each the row in every mean square deviation S i: the mean distance AVG that obtains i keyword earlier iFor:
AVG i = Σ j = 1 r | ( Pos ij 1 - Pos ij 2 ) | r = Σ j = 1 r | δ P ij | r
Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword.Then the distribution of i keyword distance is by mean square deviation S iExpression, for:
S i = Σ j = 1 r ( δ P ij - AVG i ) 2 r
For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:
S = Σ i = 1 N S i N
If the S<prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.
So far, the detection of promptly having finished Chinese approximate webpage with go heavily.
For correctness and the efficient of estimating this method, this paper has designed a series of experiments.
Correctness is the life of algorithm, provides two evaluation criterions here: repeated pages recall rate (Recall) and go heavy accuracy rate (Precision) is defined as follows:
Figure G2009100837119D00121
Figure G2009100837119D00122
In order to detect the performance of DDW, we have selected 72 query words in military affairs, medical science and three fields of computing machine, with Google retrieval and inquisition speech.In every group of result for retrieval, choose the same or analogous webpage of content and amount to 5835 pieces, and these approximate webpages are inserted in the already present document sets (comprising 1028,568 webpages).And move I-Match (choosing 20 feature speech equally) and DDW algorithm simultaneously and be similar to webpage and detect.
1, in 23 inquiries of military field input, experimental result is as shown in table 1:
Table (1) military field test sample book accuracy rate and recall rate statistics
Figure G2009100837119D00123
Figure G2009100837119D00131
2,28 inquiries of medical domain input, 20 groups of introductory webpages of corresponding knowledge wherein, 8 groups of corresponding news webpages, experimental result sees Table (2):
Table (2) medical domain test sample book accuracy rate and recall rate statistics
Figure G2009100837119D00132
3,21 inquiries of computer realm input, whole corresponding news webpages, experimental result sees Table (3)
Table (3) computer realm test sample book accuracy rate and recall rate statistics
Figure G2009100837119D00142
Figure G2009100837119D00151
Above experimental result shows that the inventive method contrast prior art has higher accuracy rate and recall rate.
Embodiment
For example, for URL be " http://cs.taoyuan.gov.cn/news/ReadNews.asp? NewsID=4727 ", introduce the webpage of Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure, detect and whether exist content to be similar to webpage.
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message, specific as follows:
Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure
Date issued: on October 26th, 2006
Source: red net
Editor's typing: to the flaw
Picture information Russian army " Bu Lawa " (claiming " top fuller " again) intercontinental ballistic missile
Picture information is used for emission " top fuller " sea base intercontinental ballistic missile " this Coe of Dmitri east " number nuclear submarine
Www.xinhuanet.com Moscow (reporter Yue Lianguo) Russian Navy News on October 25 and public relation office 25 days are to the news definition, the novel intercontinental ballistic missile offset track of a piece " Bu Lawa " of trial fire on Russian military same day (claiming " top fuller " again) also crashes into marinely, and trial fire is this time counted out.According to Russia's media report, to be Northern Fleet of Russia naval " Dmitri east this Coe " number strategic nuclear submarine launch from white saline waters in 17: 05 on the 25th (21: 05 Beijing time) of Moscow Time this piece ballistic missile, is transmitted under water and carries out.According to plan, guided missile warhead should hit the intended target on Russia's Far East Kamchatka Peninsula grinder.But this piece guided missile is offset track and crash into marine after emission a few minutes.To investigate the reason of trial fire of guided missle failure by the special commission that Russia Ministry of National Defence and guided missile design and the production unit representative is formed.The 5th trial fire of " Bu Lawa " novel intercontinental ballistic missile that Russian military carried out September 7 this year also ends in failure.Guided missile also is an offset track and crash into marine after emission a few minutes.Preceding 4 trial fires of this guided missile have all obtained success.10 branches of " Bu Lawa " guided missile portability are led nuclear warhead, and range can reach 8000 kilometers.According to the chief designer Saloman promise husband of the Moscow thermal engineering research institute introduction of being responsible for this guided missile of development, " Bu Lawa " guided missile and " white poplar-M " intercontinental ballistic missile will become the main body of following Russia strategic nuclear force.According to Russian army's plan, " Bu Lawa " guided missile will be in equipment Russia naval in 2008.Before 2008, Russian military will carry out several times trial fire again to this guided missile.
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure, as shown in Figure 4.
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, shown in table (4)
Table (4) " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " web page characteristics vector
Sequence number Keyword Word frequency Position vector
??1 Guided missile ??12 ?{421,561,617,669,711,813,857,894,952,1012,1096,1150}
??2 Trial fire ??7 ?{33,309,385,715,775,866,1166}
??3 The cloth Bearing score ??8 ?{9,11,112,323,767,888,1006,1090}
??4 Intercontinental ballistic missile ??5 ?{21,134,348,776,1029}
??5 Emission ??4 ?{160,529,537,623,823}
??6 Moscow ??3 ?{227,477,958}
??7 25 days ??3 ?{237,283,487}
??8 Top fuller ??3 ?{125,165,336}
??9 Novel ??3 ?{16,345,772}
??10 05 minute ??2 ?{495,513}
??11 October ??3 ?{58,229,904}
??12 This Coe of Dmitri east number ??2 ?{192,442}
??13 Russia naval ??2 ?{427,1113}
??14 Carry out ??4 ?{547,728,756,1155}
??15 Failure ??4 ?{36,392,718,802}
??16 Depart from ??3 ?{361,635,834}
??17 Track ??3 ?{365,639,838}
??18 Nuclear submarine ??2 ?{214,468}
??19 Russia ??2 ??{1,257}
??20 The sea base intercontinental ballistic missile ??1 ??{171}
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file.
Read " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " Web page classifying sign Ca, in document keyword index file ID XV_Ca, to the first eight the entry { guided missile in the keyword, trial fire, Bu Lawa, intercontinental ballistic missile, emission, Moscow, 25 days, top fuller, novel, 05 minute, October, this Coe of Dmitri east number } retrieve.The corresponding keyword of one of them occurrence that finds for guided missile, trial fire, strategy, Bu Lawa, intercontinental ballistic missile, novel, Moscow; emission, 25 days, top fuller, 05 minute, October, failure, this Coe of Dmitri east number; Russia naval carries out, sea base strategic missile, Russia, track, fleet }.The character pair vector is shown in table (5):
The corresponding web page characteristics vector of table (5) occurrence
Sequence number Keyword Word frequency Position vector
??1 Guided missile ??19 ??{421,561,617,669,711,813,857,894,952,1012,1096,1150 ??1133,1645,1675,1745,1947,2168,2262}
??2 Trial fire ??13 ??{33,309,385,715,775,866,1166,1275,1451,1504,1573,1583,1651}
??3 Strategy ??10 ??{465,1054,1247,1563,1707,1923,1961,1987,2050,2198}
??4 The cloth Bearing score ??11 ??{9,11,110,321,765,886,1004,1088,1287,1460,1551,1699}
??5 Intercontinental ballistic missile ??11 ??{20,134,349,777,1030,1315,1472,1761,1869,2128,2400}
??6 Novel ??10 ??{17,345,773,1311,1468,1541,1803,1939,2124,2396}
??7 Moscow ??6 ??{226,447,957,1422,1717,2330}
??8 Emission ??5 ??{160,529,537,623,823}
??9 25 days ??5 ??{237,283,487,1165,1184}
??10 Top fuller ??6 ??{126,166,337,1303,1863,2414}
??11 05 minute ??2 ??{495,513}
??12 October ??4 ??{58,229,1180,1361}
??13 Failure ??6 ??{37,393,719,803,1512,2218}
??14 This Coe of Dmitri east number ??2 ??{192,443}
??15 Russia naval ??3 ??{427,1114,1518}
??16 Carry out ??6 ??{547,729,757,1156,1537,1821}
??17 The sea base strategic missile ??3 ??{1559,1957,2046}
??18 Russia ??3 ??{1,257,1205,1408}
??19 Track ??4 ??{365,639,839,1488}
??20 Fleet ??3 ??{437,1241,1418}
Utilizing cosine formula (4) to carry out similarity calculates:
Therefore ξ=81.51%>80% (threshold value) judges that these two pieces of webpages might be the approximate webpages of content.Continue to calculate the keyword distance variance S of two pieces of documents;
With guided missile one speech is example, and position vector corresponding in two pieces of articles is respectively:
{421,561,617,669,711,813,857,894,952,1012,1096,1150}
With
{421,561,617,669,711,813,857,894,952,1012,1096,11501133,1645,1675,1745,1947,2168,2262}
AVG 1 = | Σ j = 1 r ( Pos ij 1 - Pos ij 2 ) | r = | ( ( 421 - 421 ) + ( 561 - 561 ) + ( 617 - 617 ) + ( 669 - 669 ) + ( 711 - 711 ) +
( 813 - 813 ) + ( 857 - 857 ) + ( 894 - 894 ) + ( 952 - 952 ) + ( 1012 - 1012 ) +
( 1096 - 1096 ) + ( 1150 - 1150 ) + ( 0 - 1133 ) + ( 0 - 1645 ) + ( 0 - 1675 ) +
( 0 - 1745 ) + ( 0 - 1947 ) + ( 0 - 2168 ) + ( 0 - 2262 ) ) | ÷ 19 = 661
δP 11=|421-421|=0......δP 1,19=|0-2262|=2262
S 1 = Σ j = 1 r ( δ P 1 j - AVG 1 ) 2 r = 197.6
The range distribution of N keyword of entire chapter document is:
S = Σ i = 1 N S i N = 264
" Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure " web length is 1165 bytes, and the setpoint distance threshold value is 10% of a web length, i.e. 117 bytes.At this moment, therefore S>distance threshold, judges that two pieces of documents are not the approximate webpages of content.Classification information according to document " Russia " Bu Lawa " novel intercontinental ballistic missile trial fire failure ", according to Web page classifying sign Ca, the keyword increment of document is built among the proper vector index database IDXV_Ca, thereby finishes the detection of document " the novel intercontinental ballistic missile trial fire of Russia " Bu Lawa " is failed ".

Claims (2)

1, the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic is characterized in that may further comprise the steps:
Step 1, for the webpage of new input, carry out the extraction of webpage effective information, obtain effective text message;
Step 2, effective text message that step 1 is extracted are handled, and construct vocabulary co-occurrence figure;
Step 3, according to the worldlet characteristic of vocabulary co-occurrence figure, extract the file characteristics vector, implementation procedure is as follows;
Setting d vocabulary co-occurrence figure is G LThe feature path, setting and removing i the vocabulary co-occurrence figure behind the node is CN i, d iBe CN iAverage path length, set node t iTo G LThe contribution rate that presents the worldlet feature is CB i=d i-d;
(1) obtains vocabulary co-occurrence figure G LPoly-degree C and feature path d;
For node t i∈ T L, defining its neighbor node is Γ i={ j| ξ I, j=1}, then t iPoly-degree C iFor
Figure A2009100837110002C1
Wherein, k is the number of neighbor node, Be the actual limit number that exists between neighbor node, and
Figure A2009100837110002C3
Draw vocabulary co-occurrence figure G thus LPoly-degree C be
C = 1 L Σ i = 1 L C i
For two given node t i, t j∈ T L, d Min(i j) is two internodal shortest paths; Node t iAverage path length be d i = 1 L - 1 ( Σ j = 1 , j ≠ i L d min ( i , j ) ) , Vocabulary co-occurrence figure G then LThe feature path d = 1 L ( Σ i = 1 L d i ) ;
(2) according to CB i=d i-d draws each node t iTo vocabulary co-occurrence figure G LThe contribution rate that presents the worldlet feature;
(3) with the contribution rate CB of each node of drawing in (2) iSort according to from big to small order, select CB iBe worth forward top n node, as document keyword sequence Ti, the value of N is decided in its sole discretion by the user;
Afterwards, count the positional information Pos of keyword, by utilizing vector lists V p=(Lp i... Lp i..., Lp N), Lp i=(Pos Ij... Pos Ij..., Pos m) come the position of recording feature item, wherein Pos I, jIt is i the entry position that the j time occurs in document; That is, adopt N keyword and their positional information Pos I, jRepresent one piece of document, V pBe storage keyword positional information Pos I, jMatrix, LP iBe matrix V pRow vector, i.e. the position vector of i keyword:
V p = LP 1 LP 2 . . . . . . LP N = [ Pos 1,1 , . . . . . . , Pos 1 , m ] [ Pos 2,1 , . . . . . . , Pos 2 , n ] . . . . . . [ Pos N , 1 , . . . . . . , Pos N , k ]
The positional information Pos of keyword constitutes text feature vector Va with key term;
Step 4, structure document keyword inverted index file are finished file characteristics vector index coupling according to the inverted index file, and detailed process is as follows:
(1) if this webpage is first piece of document, read Web page classifying sign Ca, keyword is made up inverted index; Class indication is all web page index constitutive characteristic vector index storehouse IDXV_Ca of Ca;
(2) if this webpage is not first piece of document, then retrieve preceding m item in the document keyword sequence Ti that step 3 draws whether in generic key term inverted index file, wherein, and m≤N, the value of m is decided in its sole discretion by the user; If result for retrieval is 0, then makes the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If result for retrieval shows k coupling output Vdi (i=1,2, ..., k, k>0), then all calculates the similarity ξ of text feature vector Va and Vdi entry vector at every piece of document, if ξ≤prior preset threshold, then make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va is built among the proper vector index database IDXV_Ca according to the class indication Ca of detected document; If ξ>and preset threshold in advance, then calculate the mean value S of mean square deviation of the position vector of Va and Vdi again, method is as follows:
At first, obtain the distance matrix of document d1 and d2 characteristic item
V p 1 - V p 2 = [ Pos 1,1 1 - Pos 1,1 2 , . . . . . . , Pos 1 , m 1 - Pos 1 , m 2 ] [ Pos 2,1 1 - Pos 2,1 2 , . . . . . . , Pos 2 , n 1 - Pos 2 , n 2 ] . . . . . . [ Pos N , 1 1 - Pos N , 1 2 , . . . . . . , Pos N , k 1 - Pos N , k 2 ]
= δP 11 , δP 12 , . . . . . . , δP 1 m . . . . . . . . . . . . δP N 1 , δP N 2 , . . . . . . , δP Nm
Wherein, Pos 1,1 1The 1st position that keyword occurs for the 1st time among the expression document d1, Pos I, j 2The 1st position that keyword occurs for the 1st time among the expression document d2, Pos 1, m 1The 1st position that keyword occurs for the m time among the expression document d1, Pos 1, m 2The 1st position that keyword occurs for the m time among the expression document d2; δ P IjBe the poor of the position that occurs of i keyword of two pieces of documents the j time;
Afterwards, obtain matrix V P1-V P2In each the row in every mean square deviation S i: the mean distance AVG that obtains i keyword earlier iFor:
AVG i = Σ j = 1 r | ( Pos ij 1 - Pos ij 2 ) | r = Σ j = 1 r | δP ij | r
Wherein, r is that document d1 and document d2 comprise the greater in the number of times of i keyword, and then the distribution of i keyword distance is by mean square deviation S iExpression, for:
S i = Σ j = 1 r ( δP ij - AVG i ) 2 r
For the range distribution of N keyword of entire chapter document, represent by the mean value S of mean square deviation:
S = Σ i = 1 N S i N
If the S<prior distance threshold of setting judges that then the webpage A webpage corresponding with Vdi is the approximate webpage of content; Otherwise, make the judgement of " do not detect content approximate webpage ", and, the keyword increment among the document text feature vector Va be built among the proper vector index database IDXV_Ca according to the classification information of detected document.
2, the approximate removing duplicate webpages method of a kind of Chinese based on the worldlet characteristic as claimed in claim 1 is characterized in that the computing method of the similarity ξ of text feature vector Va and Vdi entry vector are as follows:
ξ = d 1 × d 2 | | d 1 | | × | | d 2 | | = Σ i = 1 m d 1 ( i ) × d 2 ( i ) Σ i = 1 m d 1 ( i ) 2 × Σ i = 1 m d 2 ( i ) 2
Wherein, d1 represents web document to be detected, and d2 is illustrated in the k piece of writing document that piece document that will mate with d1; D1 (i), d2 (i) is a word frequency.
CN200910083711A 2009-05-07 2009-05-07 Chinese similar web page de-emphasis method based on microcosmic characteristic Pending CN101620616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910083711A CN101620616A (en) 2009-05-07 2009-05-07 Chinese similar web page de-emphasis method based on microcosmic characteristic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910083711A CN101620616A (en) 2009-05-07 2009-05-07 Chinese similar web page de-emphasis method based on microcosmic characteristic

Publications (1)

Publication Number Publication Date
CN101620616A true CN101620616A (en) 2010-01-06

Family

ID=41513855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910083711A Pending CN101620616A (en) 2009-05-07 2009-05-07 Chinese similar web page de-emphasis method based on microcosmic characteristic

Country Status (1)

Country Link
CN (1) CN101620616A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314418A (en) * 2011-10-09 2012-01-11 北京航空航天大学 Method for comparing Chinese similarity based on context relation
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN102722526A (en) * 2012-05-16 2012-10-10 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN103123685A (en) * 2011-11-18 2013-05-29 江南大学 Text mode recognition method
CN103246640A (en) * 2013-04-23 2013-08-14 北京十分科技有限公司 Duplicated text detection method and device
CN103761477A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Method and equipment for acquiring virus program samples
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN104123272A (en) * 2014-05-21 2014-10-29 山东省科学院情报研究所 Document classification method based on variance
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN105550170A (en) * 2015-12-14 2016-05-04 北京锐安科技有限公司 Chinese word segmentation method and apparatus
WO2017063525A1 (en) * 2015-10-12 2017-04-20 广州神马移动信息科技有限公司 Query processing method, device and apparatus
CN108536753A (en) * 2018-03-13 2018-09-14 腾讯科技(深圳)有限公司 The determination method and relevant apparatus of duplicate message
WO2018184588A1 (en) * 2017-04-07 2018-10-11 腾讯科技(深圳)有限公司 Text deduplication method and device and storage medium
CN110716533A (en) * 2019-10-29 2020-01-21 山东师范大学 Key subsystem identification method and system influencing reliability of numerical control equipment
CN111859896A (en) * 2019-04-01 2020-10-30 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN112883704A (en) * 2021-04-29 2021-06-01 南京视察者智能科技有限公司 Big data similar text duplicate removal preprocessing method and device and terminal equipment

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314418A (en) * 2011-10-09 2012-01-11 北京航空航天大学 Method for comparing Chinese similarity based on context relation
CN103123685B (en) * 2011-11-18 2016-03-02 江南大学 Text mode recognition method
CN103123685A (en) * 2011-11-18 2013-05-29 江南大学 Text mode recognition method
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN102663093B (en) * 2012-04-10 2014-07-09 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN102722526A (en) * 2012-05-16 2012-10-10 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN102722526B (en) * 2012-05-16 2014-04-30 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN103246640A (en) * 2013-04-23 2013-08-14 北京十分科技有限公司 Duplicated text detection method and device
CN103246640B (en) * 2013-04-23 2016-08-03 北京酷云互动科技有限公司 A kind of method and device detecting repeated text
CN104636319B (en) * 2013-11-11 2018-09-28 腾讯科技(北京)有限公司 A kind of text De-weight method and device
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN103761477A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Method and equipment for acquiring virus program samples
CN104123272A (en) * 2014-05-21 2014-10-29 山东省科学院情报研究所 Document classification method based on variance
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
CN104615714B (en) * 2015-02-05 2019-05-24 北京中搜云商网络技术有限公司 Blog article rearrangement based on text similarity and microblog channel feature
WO2017063525A1 (en) * 2015-10-12 2017-04-20 广州神马移动信息科技有限公司 Query processing method, device and apparatus
CN105550170A (en) * 2015-12-14 2016-05-04 北京锐安科技有限公司 Chinese word segmentation method and apparatus
CN105550170B (en) * 2015-12-14 2018-10-12 北京锐安科技有限公司 A kind of Chinese word cutting method and device
US11379422B2 (en) 2017-04-07 2022-07-05 Tencent Technology (Shenzhen) Company Limited Text deduplication method and apparatus, and storage medium
WO2018184588A1 (en) * 2017-04-07 2018-10-11 腾讯科技(深圳)有限公司 Text deduplication method and device and storage medium
CN108536753A (en) * 2018-03-13 2018-09-14 腾讯科技(深圳)有限公司 The determination method and relevant apparatus of duplicate message
CN111859896A (en) * 2019-04-01 2020-10-30 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN111859896B (en) * 2019-04-01 2022-11-25 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN110716533A (en) * 2019-10-29 2020-01-21 山东师范大学 Key subsystem identification method and system influencing reliability of numerical control equipment
CN112883704A (en) * 2021-04-29 2021-06-01 南京视察者智能科技有限公司 Big data similar text duplicate removal preprocessing method and device and terminal equipment
CN112883704B (en) * 2021-04-29 2021-07-16 南京视察者智能科技有限公司 Big data similar text duplicate removal preprocessing method and device and terminal equipment

Similar Documents

Publication Publication Date Title
CN101620616A (en) Chinese similar web page de-emphasis method based on microcosmic characteristic
Francis-Landau et al. Capturing semantic similarity for entity linking with convolutional neural networks
CN105488196B (en) A kind of hot topic automatic mining system based on interconnection corpus
Zaragoza et al. Ranking very many typed entities on wikipedia
Pereira et al. Using web information for author name disambiguation
Jiang et al. Mining ontological knowledge from domain-specific text documents
Yin et al. Facto: a fact lookup engine based on web tables
KR100847376B1 (en) Method and apparatus for searching information using automatic query creation
CN100435145C (en) Multiple file summarization method based on sentence relation graph
Srinivas et al. A weighted tag similarity measure based on a collaborative weight model
Verma et al. Exploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search.
Cheng et al. MISDA: web services discovery approach based on mining interface semantics
Mastropavlos et al. Automatic acquisition of bilingual language resources
Gey et al. Cross-language retrieval for the CLEF collections—comparing multiple methods of retrieval
Dumani et al. Fine and coarse granular argument classification before clustering
Bechikh Ali et al. Multi-word terms selection for information retrieval
Zhang et al. A preprocessing framework and approach for web applications
Schilit et al. Exploring a digital library through key ideas
Arefin et al. BAENPD: A Bilingual Plagiarism Detector.
Chu et al. Chuweb21D: A Deduped English Document Collection for Web Search Tasks
Chen et al. Chinese named entity abbreviation generation using first-order logic
Zheng et al. Research on domain term extraction based on conditional random fields
Lee et al. Bvideoqa: Online English/Chinese bilingual video question answering
Huang et al. Learning to find comparable entities on the web
Gu et al. Towards efficient similar sentences extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20100106