CN102024065A - SIMD optimization-based webpage duplication elimination and concurrency method - Google Patents

SIMD optimization-based webpage duplication elimination and concurrency method Download PDF

Info

Publication number
CN102024065A
CN102024065A CN 201110021002 CN201110021002A CN102024065A CN 102024065 A CN102024065 A CN 102024065A CN 201110021002 CN201110021002 CN 201110021002 CN 201110021002 A CN201110021002 A CN 201110021002A CN 102024065 A CN102024065 A CN 102024065A
Authority
CN
China
Prior art keywords
finger
webpage
fingerprint
shingle
webpages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110021002
Other languages
Chinese (zh)
Other versions
CN102024065B (en
Inventor
龙军
张祖平
袁鑫攀
罗跃逸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN 201110021002 priority Critical patent/CN102024065B/en
Publication of CN102024065A publication Critical patent/CN102024065A/en
Application granted granted Critical
Publication of CN102024065B publication Critical patent/CN102024065B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an SIMD (single instruction multiple data) optimization-based webpage duplication elimination and concurrency method, which comprises the following steps of: 1, extracting text information of webpages, namely extracting effective information of webpages; 2, extracting Shingle, namely extracting webpage characteristics and generating a Shingles set; 3, clustering to reduce comparison times and reduce time and space complexity; and 4, comparing fingerprints to find similar webpages and delete the similar webpages. The SIMD optimization-based webpage duplication elimination and concurrency method can ensure the precision rate and the recall ratio, and effectively improves the rate of detecting webpage similarity.

Description

Removing duplicate webpages parallel method based on SIMD optimization
Technical field
The invention belongs to the Computer Applied Technology field, relate to a kind of removing duplicate webpages parallel method of optimizing based on SIMD.SIMD (Single Instruction Multiple Data, single-instruction multiple-data stream (SIMD)) be that controller of a kind of employing is controlled a plurality of processors, simultaneously to one group of data (claim again " data vector ") thus in each carry out the technology of the concurrency on the identical operations implementation space respectively.In microprocessor, the single instruction stream multiple data stream technology then is that a controller is controlled a plurality of parallel processing infinitesimals, for example the 3D of the MMX of Intel or SSE and AMD Now! Technology.
Technical background
Along with computer science and technology and rapid development of network technology, network has become the important channel that people obtain important information.At present the greatest difficulty that faces of search engine is exactly the information that the result set that returns has comprised a large amount of repetitions, the information of these repetitions has not only taken user's plenty of time and has increased the weight of user's burden, simultaneously also take a large amount of storage spaces and bandwidth, reduced the efficient of index.Therefore, how search engine result set is classified or removing duplicate webpages becomes the important step that improves search engine retrieving efficient.
Based on the removing duplicate webpages algorithm of " approximate fingerprint ", because the character string of text is mapped to the hash value set, the problem of string matching is changed into numeric ratio problem, computing velocity is fast, is fit to extensive computing.But, there are many difficulties in selection in text block size and quantity, most complete text block is that text is used as a text block in full, and such text comparison can only detect letter perfect text and duplicate, and this method can only solve the problem of duplicating of " word does not leak ".By the text participle, extract the shingle feature based on " Shingle " similarity detection algorithm, the number of more common shingle calculates similarity.Algorithm need be considered the influence of parameters such as similarity threshold, shingle moving window size, shingle weight coefficient and paper attribute to the accuracy rate and the recall rate of removing duplicate webpages algorithm, and eliminates the blindness that similarity threshold is set.
Streaming SIMD Extensions SSE4.2[3] be Intel upgrading expansion once to ISA expansion instruction set maximum after SSE2.The new instruction of SSE4.2 is respectively that character string and character are handled the new instruction STTNI of usefulness and handled ATA towards the acceleration of concrete application towards two big fields.New instructions has strengthened the performance from multimedia application to the high-performance calculation application, also utilizes some special circuits to realize quickening for application-specific simultaneously.The present invention adopts to embed compilation SSE coded system, to guarantee once to compare simultaneously 128 fingerprint according to the architecture optimization code of " Intel Core i7 " series processors.Analyzing experiment and practical application shows: this algorithm can improve the speed that the document similarity detects effectively when not losing any assurance precision ratio and recall ratio.
Summary of the invention
The objective of the invention is to propose a kind of removing duplicate webpages parallel method of optimizing based on SIMD, should can in high precision ratio and high recall ratio, improve the speed that the webpage similarity detects effectively based on the removing duplicate webpages parallel method that SIMD optimizes.
Technical solution of the present invention is as follows:
A kind of removing duplicate webpages parallel method of optimizing based on SIMD is characterized in that, may further comprise the steps:
Step 1: web page text information extraction step: this process is used for the webpage effective information is extracted;
Step 2:Shingle extraction step: this process is used to extract web page characteristics, and generates the Shingles set;
Step 3: cluster step: this process is used for reducing the comparison number of times, reduces time and space complexity;
Step 4: fingerprint comparison step: this process is used to seek out similar web page, and similar webpage is rejected.
The concrete steps of step 1 are:
File to HTML, XHTML, XML webpage format scans, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.
The concrete steps of step 2 are:
At first, the web page text information of extracting is carried out forward maximum match word segmentation processing, generate the set of word;
Then, the vocabulary of make up stopping using, and use the vocabulary of stopping using to filter out noise in the webpage, by the window size of setting, generate the Shingles set; Noise is an existing insignificant speech in the set of word.
The forward maximum match divides the main flow process of word algorithm as follows: suppose that the Chinese character number is MAX in the long word bar in the automatic word segmentation dictionary, then get preceding MAX word in the pending text as matching field, search the participle dictionary, if in the dictionary such MAX words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find such MAX words in the dictionary, then it fails to match, and the last character of this matching field is removed, and repeats above process, till the match is successful.So just finish once coupling, promptly matched a speech.And then go on by top step, till all speech in being syncopated as text.
From precision ratio and recall ratio, window size is the smaller the better; From display effect, window size is big more, and the effect of demonstration is good more.Window is got 2-4 and is advisable generally speaking.
The concrete steps of step 3 are:
At first, for the Shingle set that generates, establishing the Shingle set sizes is L, selects 1 Shingle as its sampling table Sample_Shingle_List every L/n Shingle from the Shingle set;
Then, Sample_Shingle_List is used the displacement Hash function of M different independent random, the Hash function that adopts is converted to one 128 fingerprint S set ample_Finger_List respectively with the feature of all shingle among the Sample_Shingle_List, selects the fingerprint of minimum fingerprint as this webpage from each Sample_Finger_List;
At last, the fingerprint that generates for N webpage carries out cluster, and during cluster, it is a class that the webpage that will have identical fingerprints incorporates into, finally obtains the collections of web pages after the cluster.
M gets the integer between the 7-10.
Adopt displacement Hash (being Hash) the function π of M independent random 1, π 2..., π M, be the Sample_Shingle_List of any one document d (document) set S just dBe converted to Sample_Finger_List:
Sample_Finger_List=(min{π 1(S d)},min{π 2(S d)},...,min{π M(S d)})
Give an example
Ω=and 1,2,3,4,5}, S1={1,2,3}, S2={1,2,4}, Ω represents complete or collected works.
π 1{1,2,3,4,5}->3,2,1,5,4}, in the displacement hash function of M independent random of π 1 expression one.
π2{1,2,3,4,5}->{2,3,5,4,1}
...
πM{1,2,3,4,5}->{5,3,1,2,4}
π1(S1)={3,2,1};π2(S1)={2,3,5};πM(S1)={5,3,1};
π1(S2)={3,2,5};π2(S2)={2,3,4};πM(S2)={5,3,2};
π 1 is that { 1,2,3,4,5}->{ 3,2,1,5,4} then is exactly
1->3,2->2,3->1,4->5,5->4, then π 1 (S1)=π 1 (1,2,3})=3,2,1}
Analogizing of other.
min(π(S1))=Sample_Finger_List(S1)={1,2,1}
Min (π (S1)) is illustrated in each set π 1 (S1), π 2 (S1) ..., get a minimum value among the π M (S1) respectively, the set of forming by these minimum value.
min(π(S2))=Sample_Finger_List(S2)={2,2,2}
They have identical fingerprint { 2} then are classified as a class then.Because they have { this element of 2}.
The concrete steps of step 4 are:
According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be
Matrix finger = finger 11 finger 21 . . . finger n 1 finger 12 finger 22 . . . finger n 2 . . . . . . . . . . . . . finger 1 M finger 2 M . . . finger nM ;
Definition
Figure BDA0000044294720000042
Defined function v ( x ) = 1 , x &GreaterEqual; 0 0 , x < 0 ;
Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprints of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V, for negative element is classified as 0;
V 1 = [ v finger 1,1 ( 1 ) + finger 1,2 ( 1 ) + . . . + finger 1 , M ( 1 ) , v finger 1,1 ( 2 ) + finger 1,2 ( 2 ) + . . . + finger 1 , M ( 2 ) , . . . , v finger 1,1 ( 128 ) + finger 1,2 ( 128 ) + . . . + finger 1 , M ( 128 ) ]
V 2 = [ v finger 2 , 1 ( 1 ) + finger 2 , 2 ( 1 ) + . . . + finger 2 , M ( 1 ) , v finger 2 , 1 ( 2 ) + finger 2 , 2 ( 2 ) + . . . + finger 2 , M ( 2 ) , . . . , v finger 2 , 1 ( 128 ) + finger 2 , 2 ( 128 ) + . . . + finger 2 , M ( 128 ) ] ;
...
V n = [ v finger n , 1 ( 1 ) + finger n , 2 ( 1 ) + . . . + finger n , M ( 1 ) , v finger n , 1 ( 2 ) + finger n , 2 ( 2 ) + . . . + finger n , M ( 2 ) , . . . , v finger n , 1 ( 128 ) + finger n , 2 ( 128 ) + . . . + finger n , M ( 128 ) ]
Matrix finger = finger 11 finger 21 . . . finger n 1 finger 12 finger 22 . . . finger n 2 . . . . . . . . . . . . . finger 1 M finger 2 M . . . finger nM = [ V 1 , V 2 . . . V n ] ;
Vector V iBe used to represent the fingerprint of n document, i=1 wherein, 2 ..., n; V iBe 128 long number, this long numerical table is levied the fingerprint of this webpage;
The fingerprint comparison idiographic flow is as follows:
Step a: use x, y, z represent three 128 xmm registers (general-purpose register) respectively, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B, and a, b are loaded into register x, y respectively;
Step b: use _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculates a, b XOR value Mask, and 1 number is exactly the hamming distance of a and b in the XOR value;
Step c: use _ mm_cvtsd_f64 obtains low 64 Mask_low of Mask, use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high of Mask;
Steps d: use _ popcnt64 calculates among Mask_low and the Mask_high 1 number respectively, and the addition assignment is given count;
Step e: think webpage A and webpage B dissmilarity when Count>Dx, otherwise similar, Dx is first distance threshold;
Method 2:
Consider the fingerprint set: Matrix finger = finger 11 finger 21 . . . finger n 1 finger 12 finger 22 . . . finger n 2 . . . . . . . . . . . . . finger 1 M finger 2 M . . . finger nM ,
For two webpages: webpage i and webpage j, their fingerprint set is:
Figure BDA0000044294720000054
Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum, when distance sum during, think webpage i and webpage j dissmilarity greater than second distance threshold value Dy, on the contrary similar.
General Study and experimental result think if fingerprint first apart from dx greater than>6, then be repeated pages not.
Dy gets 0.8M.
Beneficial effect:
The present invention is directed to removing duplicate webpages algorithm speed slow, search shortcomings such as DeGrain, recall ratio are low and improve, removing duplicate webpages parallel algorithm based on SIMD optimization has been proposed, with algorithm application in the removing duplicate webpages system development, can when not losing any assurance precision ratio and recall ratio, improve the speed that the webpage similarity detects effectively.
In order to achieve the above object, analyzing execution time in each stage of removing duplicate webpages system.Subprocess such as the removing duplicate webpages algorithm mainly comprises feature extraction, takes the fingerprint, fingerprint comparison have higher inherent SIMD concurrency.Through Intel (R) VTune (TM) Performance Analyzer[4] statistical study, the fingerprint comparison subprocess has taken total system 58.6% time loss, and the fingerprint comparison process is optimized, the travelling speed of raising system that can be bigger.
According to above-mentioned analysis, we have adopted, and comparison is optimized to shingle based on the webpage similarity algorithm of SIMD.By data are carried out the SIMD parallelization by SSE data type tissue and to main subprocess, designed the removing duplicate webpages algorithm of optimizing based on SSE, this algorithm overcomes the characteristics that general removing duplicate webpages algorithm speed is slow, recall ratio is low, search DeGrain, have that travelling speed is fast, effect obviously, be easy to characteristics such as realization.
Because SSE used 128 storage unit, this can leave 4 for 32 floating number, that is to say, all among the SSE are calculated and all disposablely finished at 4 floating numbers, and this batch processing has brought the lifting of efficient.In addition, the SSE4.2 instruction set has added STTNI (the character string text newly instructs) and ATA (application oriented accelerator) two big optimizations are instructed, STTNI is primarily aimed at that XML carries out document and data processing is optimized, and makes this application performance on the one hand reach 3.8 times of previous generation product.The feature extraction that the removing duplicate webpages algorithm mainly comprises, take the fingerprint, subprocess such as fingerprint contrast handles the processes such as processing that XML carries out document and data processing, character string that relate generally to.By using the SSE technology, can obviously improve the speed of removing duplicate webpages theoretically.
Description of drawings
Fig. 1 is a main flow chart of the present invention;
Fig. 2 info web extracts process flow diagram;
The process flow diagram that Fig. 3 Shingle extracts;
Fig. 4 cluster process flow diagram;
Fig. 5 fingerprint comparison method one process flow diagram;
Fig. 6 fingerprint comparison method two process flow diagram.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details:
Embodiment 1:
The removing duplicate webpages algorithm key step based on SIMD of this example is as follows:
1, web page text information extraction.This process mainly is that the webpage effective information is extracted;
2, Shingle extracts.This process mainly is to extract web page characteristics;
3, cluster.This process mainly is to reduce the comparison number of times, reduces time and space complexity;
4, fingerprint comparison.This process mainly is to seek out similar web page, with its rejecting.
Each step division is as follows:
1 web page text information extraction
The text structure of webpage mainly comprises physical arrangement and logical organization.Physical arrangement is meant the composition situation of webpage, mainly comprises information such as webpage label, web page title, web page contents, article title, advertisement; Logical organization is meant mainly in the web page contents that structure, text between the paragraph organize the logical course of style and expression.
This process mainly is by common webpage formats such as HTML, XHTML, XML are carried out information processing, this process mainly comprises noises such as the advertisement bar of rejecting in the webpage, navigation bar, site marker, picture, and key messages such as the title in the webpage, text, author are extracted.This process mainly is by HTML, XHTML, XML are scanned, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.
At present, domestic and international parsing to HTML, XHTML, XML mainly is by the DOM syntax tree web page text data to be resolved.SSE4.2 has increased 4 character strings and the peculiar instruction of character newly, can be used for quickening many character strings and the application program relevant with text, for example: the pattern match of string, character string comparison or the like.This step mainly is the extraction algorithm that quickens existing main flow info web by the newly-increased peculiar instruction of character string of adopting SSE 4.2, accelerates to extract the speed of info web.
It is as follows that info web extracts flow process:
101 definition common HTML, XHTML, XML labels are as<body 〉,</body,<title,</title,<div etc. common text message label, utilize among the SSE 4.2 _ the mm_set_epi8 instruction is pre-loaded onto the binary code of these labels in 128 bit variables of SSE 4.2._ mm_set_epi8 instruction major function is that 16 bytes or 128 binary codes are reprinted in 128 bit variables of SSE 4.2.
Function among the 102 use SSE 4.2 mates the label of HTML, XHTML, XML.This process mainly is by adopting SSE 4.2 processor computation capability, improving the speed to information extraction.By _ mm_cmpestri, _ mm_cvtsi128_si32, _ the peculiar instruction of SSE4.2 such as mm_cmpestrm compares the label (sse128 bit variable) of info web and preloaded.Certain the label success of then explanation coupling, write down the position of this label, repeating this process then is SSE 4.2 distinctive instructions to finding all html label _ mm_cmpestri only, its major function is that 128 variable is compared, if two character strings are identical, return 0 so, certain the label success of then explanation coupling, write down the position of this label, repeat this process then up to finding all html labels._ mm_cvtsi128_si32 instruction is to be used for 32 int types that change into 32 behind 128 the variable.
103 information translation and filtration.This process mainly is the conversion to the title that extracts, text message, some irrelevant informations in the text, irrelevant mark, hyperlinked information are filtered, and carry out the conversion of some necessity, keep structured messages such as necessary placeholder, text structure information, paragraph information.Filter mainly by constantly reading info web, utilize regular expression to mate, utilize the method for study simultaneously, optimize the regular expression content, the information irrelevant information is filtered.
2 Shingle extract
Web page characteristics for webpage go heavily have great importance, it is to influence total system to go the heavy accuracy rate and the key factor of recall rate., therefore, only need cut apart web page text just according to text and can take its feature owing to be to open between the English alphabet for English webpage with space-separated.For Chinese web page, owing to do not have natural separator between them, so the text feature (word, speech, phrase, sentence) that obtains what type is particularly important in characteristic extraction procedure.
Web page text information is regarded the sequence of word (tokens) composition of one piece of text as, and Shingle is meant a subsequence in the sequence of terms.Sequence by shingles characterizes webpage.(D is the set of the continuation character string of ω for window among the D (document document) ω) for a given collections of web pages S.
Sentence for example: { Shingle is widely used in extensive similar text detection technology },
With the sequence that obtains word behind the sentence forward maximum match participle be: { " Shingle ", " extensively ", " application ", " on a large scale ", " similar ", " text ", " detection ", " technology " }
Window ω is that 2 S (D, 2) set is: { " Shingle is extensive ", " widespread use ", " using extensive ", " similar on a large scale ", " similar text ", " text detection ", " detection technique " }
When if window ω is 4 (usually being made as 4 in the real system), define the S set that the character string of two documents and sign is formed 1(D, ω), S 2(D, ω), a=|S 1I S 2| expression S 1, S 2Public identical sequence, we represent S with R so 1, S 2Likelihood:
R = | S 1 &cap; S 2 | | S 1 &cup; S 2 | = a f 1 + f 2 - a , f 1=|S 1|,f 2=|S 2|,a=|S 1∩S 2|
The concrete meaning of R is: the shingle number that two documents are identical and the ratio of the shingle number that they are total, when R approaches 1 more, these two pieces of documents are similar more so.
The web page text information of extracting in the 201 pairs of first segments is carried out forward maximum match word segmentation processing, generates the set of word (tokens).The forward maximum match divides the main flow process of word algorithm as follows: suppose that the Chinese character number is MAX in the long word bar in the automatic word segmentation dictionary, then get preceding MAX word in the pending text as matching field, search the participle dictionary, if in the dictionary such MAX words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find such MAX words in the dictionary, then it fails to match, and the last character of this matching field is removed, and repeats above process, till the match is successful.So just finish once coupling, promptly matched a speech.And then go on by top step, till all speech in being syncopated as text.
The word frequency of the word of 202 statistical systems, the function word that filters out the low justice of high frequency make up the initial vocabulary of stopping using as stop words.Discern the neologisms that repeatedly occur by The result of statistics, and neologisms are joined in the dictionary database.(dictionary database is the data source of forward maximum matching algorithm, has only by continuing to optimize dictionary database, and the algorithm participle just can be more accurate)
The 203 inactive vocabularys that used for second step made up filter out insignificant noise in the webpage.By the window size of setting, generate the Shingles set.(from precision ratio and recall ratio, window size is the smaller the better; From display effect, window size is big more, and the effect of demonstration is good more.Window is got 2-4 and is advisable generally speaking)
3 clusters
Cluster is the process that concrete or abstract set grouping is become a plurality of classes of being made up of similar object, its main foundation is that similar or close object is classified as a class, the object that difference is big is in the different classes, what generated like this bunch is the set of one group of data object, these objects are similar each other to other object in same bunch, and different each other with the object in other bunch.
In the Shingle of webpage calculation process, for the webpage collection that a quantity is N, each webpage comprises n word, and the length of Shingle is m, and then Shingle is less than n-m+1 (having rejected the part stop words).Therefore the time complexity that calculates the Shingle algorithm of whole webpages is O (N*n).But the space complexity that characterizes the feature of a webpage then is O (n*m), and then all the space complexity of webpage is O (N*n*m), and can produce a large amount of interim results in computation process, has taken a large amount of storage spaces equally.For example: for a webpage collection that 100,000 webpages are arranged, each piece webpage includes 100 words, and adopting window size is that the time complexity of 2 Shingle algorithm is O (10 8), space complexity is O (2*10 8), obviously, adopt the Shingle algorithm that webpage is directly gone heavily to handle and not only take the plenty of time, also take a large amount of spaces.
Therefore, the web page text ensemble of communication is divided into the similar collections of web pages of possibility, excluding does not need the webpage compared fully, only carries out subsequent operation when two pieces of webpages have similar possibility again, can reduce the number of times that every piece of document need be compared significantly like this.
According to the minwise hashing principle of Ander broder, definition Hash function π has
Pr ( min ( &pi; ( S 1 ) ) = min ( &pi; ( S 2 ) ) = | S 1 I S 2 | | S 1 U S 2 | = R
The practical significance of above-mentioned formula is: adopt different Hash functions to S 1, S 2The fingerprint set of generation is shone upon in the Shingle set, and the probability that the minimum value of fingerprint equates in the set equals S 1, S 2Likelihood.That is to say that for webpage A and B, the Shingle set that they generate is S 1, S 2,, use Hash function π, respectively to S 1, S 2Carry out the hash mapping, if the probability that the minimum value of their fingerprints equates is exactly these two pieces of webpage similarities.
Yet, because the cause of Hash only uses a fingerprint to tend to miss a lot of documents that may be similar, therefore should use a plurality of fingerprints, each fingerprint is found and may similar document merge, can reach accuracy rate and recall rate preferably like this.
Definition R 0Be similarity threshold, definition accuracy rate and recall rate:
Figure BDA0000044294720000102
Figure BDA0000044294720000103
Based on above-mentioned narration, the present invention designs clustering method, and webpage is generated fingerprint, and webpage is carried out clustering processing, and concrete steps are as follows:
301 for the Shingle set that generates in the step 2, and establishing set sizes is L, therefrom selects 1 Shingle as its Sample_Shingle_List that samples every L/n Shingle; If on the reasonable machine of performance, we can get n=1, under the poor environment of machines configurations, can suitably improve n, and n gets more for a short time in a word, and the effect of removing duplicate webpages is good more.
302 couples of Sample_Shingle_List use M different Hash function, Sample_Shingle_List is used the displacement Hash function of the individual different independent random of M (7-10), the Hash function that adopts can be one 128 fingerprint S set ample_Finger_List with the Feature Conversion of all shingle among the Sample_Shingle_List, from each Sample_Finger_List, the fingerprint of middle selection minimum is as the fingerprint of this webpage; For the displacement hash function of M independent random, so just the Sample_Shingle_List of any one document d (document) set is converted to Sample_Finger_List Sample_Finger_List=(min{ π 1(S d), min{ π 2(S d) ..., min{ π M(S d))
Give an example
Ω={1,2,3,4,5},S1={1,2,3},S2={1,2,4}
π1{1,2,3,4,5}->{3,2,1,5,4}
π2{1,2,3,4,5}->{2,3,5,4,1}
...
πM{1,2,3,4,5}->{5,3,1,2,4}
π1(S1)={3,2,1};π2(S1)={2,3,5};πM(S1)={5,3,1};
π1(S2)={3,2,5};π2(S2)={2,3,4};πM(S2)={5,3,2};
Set?of?min(π(S1))=Sample_Finger_List(S1)={1,2,1}
Set?of?min(π(S2))=Sample_Finger_List(S2)={2,2,2}
They have identical fingerprint elements { 2} then are classified as a class then
For n webpage, have
Figure BDA0000044294720000112
The fingerprint of 303 pairs of webpage generations carries out cluster, as long as identical fingerprint is arranged, just it being incorporated into is a class, and concrete grammar is as follows:
According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be
Figure BDA0000044294720000113
Figure BDA0000044294720000121
4 fingerprint comparisons
By to the collections of web pages after the cluster in 3, adopt one of following two kinds of methods to handle to each set:
Method 1:
According to cluster result, the webpage ID in each class is taken out, establishing such has k webpage, for the fingerprint set of each webpage ID
finger 11 finger 12 . . . finger 1 M finger 21 finger 22 . . . finger 2 M . . . . . . . . . . . . finger k 1 finger k 2 . . . finger kM ;
Fingerprint matrices is carried out DUAL PROBLEMS OF VECTOR MAPPING, make in its vector space that is mapped to a m dimension; Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprint of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V,, negative element should be converted to 128 long array by vector V for being classified as 0., just can obtain the fingerprint of this row.Is [V with fingerprint as the vector space that m ties up 1V 2... V n] T
The fingerprint comparison idiographic flow is as follows:
410 load fingerprint: use x, y, z represent three 128 xmm registers, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B.A, b are loaded into register x, y.
410 calculated fingerprint distance: use _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculate a, b XOR value (Mask), and 1 number of XOR value is exactly hamming (Haming) distance of a and b.
410 obtain high 64 fingerprints and obtain low 64 fingerprints: use _ _ mm_cvtsd_f64 obtain low 64 Mask_low and use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high.
The different fingerprint sum of 410 statistics: uses _ popcnt64 calculates Mask_low respectively, 1 the number of Mask_high, and the addition assignment is to count.
410 similar web pages are judged: when Count>Dx distance threshold, think dissimilar, on the contrary similar.
Use the fingerprint comparison of SIMD, to two pieces of webpages only need use _ mm xor_si128, _ mm_unpacklo_epi64, _ mm_cvtsd_f64, _ _ the popcnt64 four instructions just can finish.Finish the comparison of two pieces of webpages, only called 6 instructions altogether and just finished.And adopt general processing, and for the webpage of 128 fingerprints, then need to compare 128 times, be higher than far away and use 6 instructions.Therefore, for the data of magnanimity, use the SIMD technology can reduce the number of times that uses CPU, significantly the performance of elevator system.
Method 2:
Consider the fingerprint set: finger 11 finger 12 . . . finger 1 M finger 21 finger 22 . . . finger 2 M . . . . . . . . . . . . finger k 1 finger k 2 . . . finger kM ,
Also can adopt following fingerprint comparison method two to handle:
For two webpage i, j, their fingerprint set is:
Figure BDA0000044294720000132
Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum.
410 initialization, definition K=0, Cnt=0;
410 load fingerprint: have M fingerprint for each webpage, use x, y, z represent three 128 xmm registers, and respectively with i, the k row fingerprint of j is loaded into register x, y respectively.
410 calculate every pair of fingerprint distance: use _ _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculates a, b XOR value (Mask), and 1 number of XOR value is exactly hamming (Haming) distance of a and b.
410 calculate total distance: K++ (expression K adds 1), and Cnt adds the fingerprint distance when the prostatitis, if K>M then forwards 409 to, otherwise forwards 407 to
410 similar web pages are judged: when Count>Dy distance threshold, think dissimilar, on the contrary similar.

Claims (5)

1. a removing duplicate webpages parallel method of optimizing based on SIMD is characterized in that, may further comprise the steps:
Step 1: web page text information extraction step: this process is used for the webpage effective information is extracted;
Step 2:Shingle extraction step: this process is used to extract web page characteristics, and generates the Shingles set;
Step 3: cluster step: this process is used for reducing the comparison number of times, reduces time and space complexity;
Step 4: fingerprint comparison step: this process is used to seek out similar web page, and similar webpage is rejected.
2. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 1 are:
File to HTML, XHTML, XML webpage format scans, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.
3. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 2 are:
At first, the web page text information of extracting is carried out forward maximum match word segmentation processing, generate the set of word;
Then, the vocabulary of make up stopping using, and use the vocabulary of stopping using to filter out noise in the webpage, by the window size of setting, generate the Shingles set; Noise is an existing insignificant speech in the set of word.
4. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 3 are:
At first, for the Shingle set that generates, establishing the Shingle set sizes is L, selects 1 Shingle as its sampling table Sample_Shingle_List every L/n Shingle from the Shingle set;
Then, Sample_Shingle_List is used the displacement Hash function of M different independent random, the Hash function that adopts is converted to one 128 fingerprint S set ample_Finger_List respectively with the feature of all shingle among the Sample_Shingle_List, selects the fingerprint of minimum fingerprint as this webpage from each Sample_Finger_List;
At last, the fingerprint that generates for N webpage carries out cluster, and during cluster, it is a class that the webpage that will have identical fingerprints incorporates into, finally obtains the collections of web pages after the cluster.
5. according to each described removing duplicate webpages parallel method of optimizing based on SIMD of claim 1-4, it is characterized in that the concrete steps of step 4 are:
Adopt any method in following two kinds of methods:
Method 1:
According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be
Matrix finger = finger 11 finger 21 . . . finger n 1 finger 12 finger 22 . . . finger n 2 . . . . . . . . . . . . . finger 1 M finger 2 M . . . finger nM ;
Definition
Figure FDA0000044294710000022
Defined function v ( x ) = 1 , x &GreaterEqual; 0 0 , x < 0 ;
Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprints of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V, for negative element is classified as 0;
V 1 = [ v finger 1,1 ( 1 ) + finger 1,2 ( 1 ) + . . . + finger 1 , M ( 1 ) , v finger 1,1 ( 2 ) + finger 1,2 ( 2 ) + . . . + finger 1 , M ( 2 ) , . . . , v finger 1,1 ( 128 ) + finger 1,2 ( 128 ) + . . . + finger 1 , M ( 128 ) ]
V 2 = [ v finger 2 , 1 ( 1 ) + finger 2 , 2 ( 1 ) + . . . + finger 2 , M ( 1 ) , v finger 2 , 1 ( 2 ) + finger 2 , 2 ( 2 ) + . . . + finger 2 , M ( 2 ) , . . . , v finger 2 , 1 ( 128 ) + finger 2 , 2 ( 128 ) + . . . + finger 2 , M ( 128 ) ] ;
...
V n = [ v finger n , 1 ( 1 ) + finger n , 2 ( 1 ) + . . . + finger n , M ( 1 ) , v finger n , 1 ( 2 ) + finger n , 2 ( 2 ) + . . . + finger n , M ( 2 ) , . . . , v finger n , 1 ( 128 ) + finger n , 2 ( 128 ) + . . . + finger n , M ( 128 ) ]
Matrix finger = finger 11 finger 21 . . . finger n 1 finger 12 finger 22 . . . finger n 2 . . . . . . . . . . . . . finger 1 M finger 2 M . . . finger nM = [ V 1 , V 2 . . . V n ] ;
Vector V iBe used to represent the fingerprint of n document, i=1 wherein, 2 ..., n; V iBe 128 long number, this long numerical table is levied the fingerprint of this webpage;
The fingerprint comparison idiographic flow is as follows:
Step a: use x, y, z represent three 128 xmm registers (general-purpose register) respectively, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B, and a, b are loaded into register x, y respectively;
Step b: use _ m128i_mm_xor_si128 (_ m128ia, _ m128ib) calculate a, b XOR value Mask, 1 number is exactly the hamming distance of a and b in the XOR value;
Step c: use _ mm_cvtsd_f64 obtains low 64 Mask_low of Mask, use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high of Mask;
Steps d: use _ popcnt64 calculates among Mask_low and the Mask_high 1 number respectively, and the addition assignment is given count;
Step e: think webpage A and webpage B dissmilarity when Count>Dx, otherwise similar, Dx is first distance threshold;
Method 2:
Consider the fingerprint set: Matrix finger = finger 11 finger 21 . . . finger n 1 finger 12 finger 22 . . . finger n 2 . . . . . . . . . . . . . finger 1 M finger 2 M . . . finger nM ,
For two webpages: webpage i and webpage j, their fingerprint set is:
Figure FDA0000044294710000034
Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum, when distance sum during, think webpage i and webpage j dissmilarity greater than second distance threshold value Dy, on the contrary similar.
CN 201110021002 2011-01-18 2011-01-18 SIMD optimization-based webpage duplication elimination and concurrency method Expired - Fee Related CN102024065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110021002 CN102024065B (en) 2011-01-18 2011-01-18 SIMD optimization-based webpage duplication elimination and concurrency method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110021002 CN102024065B (en) 2011-01-18 2011-01-18 SIMD optimization-based webpage duplication elimination and concurrency method

Publications (2)

Publication Number Publication Date
CN102024065A true CN102024065A (en) 2011-04-20
CN102024065B CN102024065B (en) 2013-01-02

Family

ID=43865362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110021002 Expired - Fee Related CN102024065B (en) 2011-01-18 2011-01-18 SIMD optimization-based webpage duplication elimination and concurrency method

Country Status (1)

Country Link
CN (1) CN102024065B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831198A (en) * 2012-08-07 2012-12-19 人民搜索网络股份公司 Similar document identifying device and similar document identifying method based on document signature technology
CN103036697A (en) * 2011-10-08 2013-04-10 阿里巴巴集团控股有限公司 Multi-dimensional data duplicate removal method and system
CN103116760A (en) * 2013-02-18 2013-05-22 人民搜索网络股份公司 Method and device for identifying text-missing web pages
CN103123618A (en) * 2011-11-21 2013-05-29 北京新媒传信科技有限公司 Text similarity obtaining method and device
CN103164698A (en) * 2013-03-29 2013-06-19 华为技术有限公司 Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN103559259A (en) * 2013-11-04 2014-02-05 同济大学 Method for eliminating similar-duplicate webpage on the basis of cloud platform
CN103745012A (en) * 2014-01-28 2014-04-23 广州一呼百应网络技术有限公司 Method and system for intelligently matching and showing recommended information of web page according to product title
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN105095162A (en) * 2014-05-19 2015-11-25 腾讯科技(深圳)有限公司 Text similarity determining method and device, electronic equipment and system
CN105160014A (en) * 2015-09-24 2015-12-16 四川师范大学 Data processing method and apparatus
CN105677661A (en) * 2014-09-30 2016-06-15 华东师范大学 Method for detecting repetition data of social media
CN106407195A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Method and system for eliminating duplication of webpage
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN106446124A (en) * 2016-09-19 2017-02-22 成都知道创宇信息技术有限公司 Website classification method based on network relation graph
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
WO2017080320A1 (en) * 2015-11-09 2017-05-18 北京奇虎科技有限公司 Method of mining and cleaning up similar books in book database, and device utilizing same
CN106815226A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 Text matching technique and device
CN107004221A (en) * 2014-11-28 2017-08-01 Bc卡有限公司 For predict using industry card use pattern analysis method and perform its server
CN107273175A (en) * 2016-04-06 2017-10-20 龙芯中科技术有限公司 Program optimization method and device
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN109165307A (en) * 2018-09-19 2019-01-08 腾讯科技(深圳)有限公司 A kind of characteristic key method, apparatus and storage medium
CN110532795A (en) * 2019-07-11 2019-12-03 西安交通大学 A kind of repeated data detection method calculated based on rabin fingerprint and exclusive or
CN111079403A (en) * 2019-12-10 2020-04-28 深圳市兴之佳科技有限公司 Page comparison method and device
CN111159996A (en) * 2019-12-31 2020-05-15 福建福诺移动通信技术有限公司 Short text set similarity comparison method and system based on improved text fingerprint algorithm
CN116383334A (en) * 2023-06-05 2023-07-04 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《计算机工程》 20080131 张祖平等 基于生物计算的分布式计算系统 第86-88页 1-5 第34卷, 第2期 2 *
《计算机系统应用》 20101031 龙军等 专家信息语义模型异构数据转换技术 第57-62页 1-5 第19卷, 第10期 2 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103036697A (en) * 2011-10-08 2013-04-10 阿里巴巴集团控股有限公司 Multi-dimensional data duplicate removal method and system
CN103036697B (en) * 2011-10-08 2015-07-15 阿里巴巴集团控股有限公司 Multi-dimensional data duplicate removal method and system
CN103123618A (en) * 2011-11-21 2013-05-29 北京新媒传信科技有限公司 Text similarity obtaining method and device
CN103123618B (en) * 2011-11-21 2016-09-14 北京新媒传信科技有限公司 Text similarity acquisition methods and device
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN103294671B (en) * 2012-02-22 2018-04-27 深圳市世纪光速信息技术有限公司 The detection method and system of document
CN102831198A (en) * 2012-08-07 2012-12-19 人民搜索网络股份公司 Similar document identifying device and similar document identifying method based on document signature technology
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN103116760A (en) * 2013-02-18 2013-05-22 人民搜索网络股份公司 Method and device for identifying text-missing web pages
CN103164698B (en) * 2013-03-29 2016-01-27 华为技术有限公司 Text fingerprints library generating method and device, text fingerprints matching process and device
CN103164698A (en) * 2013-03-29 2013-06-19 华为技术有限公司 Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
CN103559259A (en) * 2013-11-04 2014-02-05 同济大学 Method for eliminating similar-duplicate webpage on the basis of cloud platform
CN104636319A (en) * 2013-11-11 2015-05-20 腾讯科技(北京)有限公司 Text duplicate removal method and device
CN104636319B (en) * 2013-11-11 2018-09-28 腾讯科技(北京)有限公司 A kind of text De-weight method and device
CN103745012A (en) * 2014-01-28 2014-04-23 广州一呼百应网络技术有限公司 Method and system for intelligently matching and showing recommended information of web page according to product title
CN105095162A (en) * 2014-05-19 2015-11-25 腾讯科技(深圳)有限公司 Text similarity determining method and device, electronic equipment and system
CN105677661A (en) * 2014-09-30 2016-06-15 华东师范大学 Method for detecting repetition data of social media
CN107004221A (en) * 2014-11-28 2017-08-01 Bc卡有限公司 For predict using industry card use pattern analysis method and perform its server
CN106407195A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Method and system for eliminating duplication of webpage
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
CN105160014A (en) * 2015-09-24 2015-12-16 四川师范大学 Data processing method and apparatus
WO2017080320A1 (en) * 2015-11-09 2017-05-18 北京奇虎科技有限公司 Method of mining and cleaning up similar books in book database, and device utilizing same
CN106815226A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 Text matching technique and device
CN107273175A (en) * 2016-04-06 2017-10-20 龙芯中科技术有限公司 Program optimization method and device
CN106446124A (en) * 2016-09-19 2017-02-22 成都知道创宇信息技术有限公司 Website classification method based on network relation graph
CN106446124B (en) * 2016-09-19 2019-11-15 成都知道创宇信息技术有限公司 A kind of Website classification method based on cyberrelationship figure
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN106446148B (en) * 2016-09-21 2019-08-09 中国运载火箭技术研究院 A kind of text duplicate checking method based on cluster
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN109165307B (en) * 2018-09-19 2021-02-02 腾讯科技(深圳)有限公司 Feature retrieval method, device and storage medium
CN109165307A (en) * 2018-09-19 2019-01-08 腾讯科技(深圳)有限公司 A kind of characteristic key method, apparatus and storage medium
CN110532795B (en) * 2019-07-11 2021-02-19 西安交通大学 Repeating data detection method based on rabin fingerprint and XOR calculation
CN110532795A (en) * 2019-07-11 2019-12-03 西安交通大学 A kind of repeated data detection method calculated based on rabin fingerprint and exclusive or
CN111079403A (en) * 2019-12-10 2020-04-28 深圳市兴之佳科技有限公司 Page comparison method and device
CN111079403B (en) * 2019-12-10 2023-08-08 深圳市兴之佳科技有限公司 Page comparison method and device
CN111159996A (en) * 2019-12-31 2020-05-15 福建福诺移动通信技术有限公司 Short text set similarity comparison method and system based on improved text fingerprint algorithm
CN116383334A (en) * 2023-06-05 2023-07-04 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report
CN116383334B (en) * 2023-06-05 2023-08-08 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report

Also Published As

Publication number Publication date
CN102024065B (en) 2013-01-02

Similar Documents

Publication Publication Date Title
CN102024065B (en) SIMD optimization-based webpage duplication elimination and concurrency method
Navarro Indexing highly repetitive string collections, part II: Compressed indexes
Adelfio et al. Schema extraction for tabular data on the web
Menai Detection of plagiarism in Arabic documents
Wang et al. Document zone content classification and its performance evaluation
JP2000231563A (en) Document retrieving method and its system and computer readable recording medium for recording document retrieval program
Abdelaziz et al. A large vocabulary system for Arabic online handwriting recognition
CN109145260A (en) A kind of text information extraction method
Bellare et al. Learning extractors from unlabeled text using relevant databases
Mäkinen et al. Linear time construction of indexable founder block graphs
CN106528524A (en) Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
CN109344403A (en) A kind of document representation method of enhancing semantic feature insertion
Zhu et al. Webpage understanding: an integrated approach
Navarro Indexing highly repetitive string collections
Augsten et al. Approximate joins for data-centric XML
Navarro Computing MEMs and Relatives on Repetitive Text Collections
CN110019674A (en) A kind of text plagiarizes detection method and system
JP2010182238A (en) Citation detection device, device and method for creating original document database, program and recording medium
Dölek et al. A deep learning model for Ottoman OCR
Celebi et al. Segmenting hashtags using automatically created training data
CN113297844B (en) Method for detecting repeatability data based on doc2vec model and minimum editing distance
Katsura et al. Permuted pattern matching on multi-track strings
CN112765940A (en) Novel webpage duplicate removal method based on subject characteristics and content semantics
CN113609246A (en) Webpage similarity detection method and system
Maheswari et al. Rule based morphological variation removable stemming algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130102

Termination date: 20150118

EXPY Termination of patent right or utility model