CN102024065A - SIMD optimization-based webpage duplication elimination and concurrency method - Google Patents
SIMD optimization-based webpage duplication elimination and concurrency method Download PDFInfo
- Publication number
- CN102024065A CN102024065A CN 201110021002 CN201110021002A CN102024065A CN 102024065 A CN102024065 A CN 102024065A CN 201110021002 CN201110021002 CN 201110021002 CN 201110021002 A CN201110021002 A CN 201110021002A CN 102024065 A CN102024065 A CN 102024065A
- Authority
- CN
- China
- Prior art keywords
- finger
- webpage
- fingerprint
- shingle
- webpages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses an SIMD (single instruction multiple data) optimization-based webpage duplication elimination and concurrency method, which comprises the following steps of: 1, extracting text information of webpages, namely extracting effective information of webpages; 2, extracting Shingle, namely extracting webpage characteristics and generating a Shingles set; 3, clustering to reduce comparison times and reduce time and space complexity; and 4, comparing fingerprints to find similar webpages and delete the similar webpages. The SIMD optimization-based webpage duplication elimination and concurrency method can ensure the precision rate and the recall ratio, and effectively improves the rate of detecting webpage similarity.
Description
Technical field
The invention belongs to the Computer Applied Technology field, relate to a kind of removing duplicate webpages parallel method of optimizing based on SIMD.SIMD (Single Instruction Multiple Data, single-instruction multiple-data stream (SIMD)) be that controller of a kind of employing is controlled a plurality of processors, simultaneously to one group of data (claim again " data vector ") thus in each carry out the technology of the concurrency on the identical operations implementation space respectively.In microprocessor, the single instruction stream multiple data stream technology then is that a controller is controlled a plurality of parallel processing infinitesimals, for example the 3D of the MMX of Intel or SSE and AMD Now! Technology.
Technical background
Along with computer science and technology and rapid development of network technology, network has become the important channel that people obtain important information.At present the greatest difficulty that faces of search engine is exactly the information that the result set that returns has comprised a large amount of repetitions, the information of these repetitions has not only taken user's plenty of time and has increased the weight of user's burden, simultaneously also take a large amount of storage spaces and bandwidth, reduced the efficient of index.Therefore, how search engine result set is classified or removing duplicate webpages becomes the important step that improves search engine retrieving efficient.
Based on the removing duplicate webpages algorithm of " approximate fingerprint ", because the character string of text is mapped to the hash value set, the problem of string matching is changed into numeric ratio problem, computing velocity is fast, is fit to extensive computing.But, there are many difficulties in selection in text block size and quantity, most complete text block is that text is used as a text block in full, and such text comparison can only detect letter perfect text and duplicate, and this method can only solve the problem of duplicating of " word does not leak ".By the text participle, extract the shingle feature based on " Shingle " similarity detection algorithm, the number of more common shingle calculates similarity.Algorithm need be considered the influence of parameters such as similarity threshold, shingle moving window size, shingle weight coefficient and paper attribute to the accuracy rate and the recall rate of removing duplicate webpages algorithm, and eliminates the blindness that similarity threshold is set.
Streaming SIMD Extensions SSE4.2[3] be Intel upgrading expansion once to ISA expansion instruction set maximum after SSE2.The new instruction of SSE4.2 is respectively that character string and character are handled the new instruction STTNI of usefulness and handled ATA towards the acceleration of concrete application towards two big fields.New instructions has strengthened the performance from multimedia application to the high-performance calculation application, also utilizes some special circuits to realize quickening for application-specific simultaneously.The present invention adopts to embed compilation SSE coded system, to guarantee once to compare simultaneously 128 fingerprint according to the architecture optimization code of " Intel Core i7 " series processors.Analyzing experiment and practical application shows: this algorithm can improve the speed that the document similarity detects effectively when not losing any assurance precision ratio and recall ratio.
Summary of the invention
The objective of the invention is to propose a kind of removing duplicate webpages parallel method of optimizing based on SIMD, should can in high precision ratio and high recall ratio, improve the speed that the webpage similarity detects effectively based on the removing duplicate webpages parallel method that SIMD optimizes.
Technical solution of the present invention is as follows:
A kind of removing duplicate webpages parallel method of optimizing based on SIMD is characterized in that, may further comprise the steps:
Step 1: web page text information extraction step: this process is used for the webpage effective information is extracted;
Step 2:Shingle extraction step: this process is used to extract web page characteristics, and generates the Shingles set;
Step 3: cluster step: this process is used for reducing the comparison number of times, reduces time and space complexity;
Step 4: fingerprint comparison step: this process is used to seek out similar web page, and similar webpage is rejected.
The concrete steps of step 1 are:
File to HTML, XHTML, XML webpage format scans, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.
The concrete steps of step 2 are:
At first, the web page text information of extracting is carried out forward maximum match word segmentation processing, generate the set of word;
Then, the vocabulary of make up stopping using, and use the vocabulary of stopping using to filter out noise in the webpage, by the window size of setting, generate the Shingles set; Noise is an existing insignificant speech in the set of word.
The forward maximum match divides the main flow process of word algorithm as follows: suppose that the Chinese character number is MAX in the long word bar in the automatic word segmentation dictionary, then get preceding MAX word in the pending text as matching field, search the participle dictionary, if in the dictionary such MAX words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find such MAX words in the dictionary, then it fails to match, and the last character of this matching field is removed, and repeats above process, till the match is successful.So just finish once coupling, promptly matched a speech.And then go on by top step, till all speech in being syncopated as text.
From precision ratio and recall ratio, window size is the smaller the better; From display effect, window size is big more, and the effect of demonstration is good more.Window is got 2-4 and is advisable generally speaking.
The concrete steps of step 3 are:
At first, for the Shingle set that generates, establishing the Shingle set sizes is L, selects 1 Shingle as its sampling table Sample_Shingle_List every L/n Shingle from the Shingle set;
Then, Sample_Shingle_List is used the displacement Hash function of M different independent random, the Hash function that adopts is converted to one 128 fingerprint S set ample_Finger_List respectively with the feature of all shingle among the Sample_Shingle_List, selects the fingerprint of minimum fingerprint as this webpage from each Sample_Finger_List;
At last, the fingerprint that generates for N webpage carries out cluster, and during cluster, it is a class that the webpage that will have identical fingerprints incorporates into, finally obtains the collections of web pages after the cluster.
M gets the integer between the 7-10.
Adopt displacement Hash (being Hash) the function π of M independent random
1, π
2..., π
M, be the Sample_Shingle_List of any one document d (document) set S just
dBe converted to Sample_Finger_List:
Sample_Finger_List=(min{π
1(S
d)},min{π
2(S
d)},...,min{π
M(S
d)})
Give an example
Ω=and 1,2,3,4,5}, S1={1,2,3}, S2={1,2,4}, Ω represents complete or collected works.
π 1{1,2,3,4,5}->3,2,1,5,4}, in the displacement hash function of M independent random of π 1 expression one.
π2{1,2,3,4,5}->{2,3,5,4,1}
...
πM{1,2,3,4,5}->{5,3,1,2,4}
π1(S1)={3,2,1};π2(S1)={2,3,5};πM(S1)={5,3,1};
π1(S2)={3,2,5};π2(S2)={2,3,4};πM(S2)={5,3,2};
1->3,2->2,3->1,4->5,5->4, then π 1 (S1)=π 1 (1,2,3})=3,2,1}
Analogizing of other.
min(π(S1))=Sample_Finger_List(S1)={1,2,1}
Min (π (S1)) is illustrated in each set π 1 (S1), π 2 (S1) ..., get a minimum value among the π M (S1) respectively, the set of forming by these minimum value.
min(π(S2))=Sample_Finger_List(S2)={2,2,2}
They have identical fingerprint { 2} then are classified as a class then.Because they have { this element of 2}.
The concrete steps of step 4 are:
According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be
Defined function
Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprints of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V, for negative element is classified as 0;
...
Vector V
iBe used to represent the fingerprint of n document, i=1 wherein, 2 ..., n; V
iBe 128 long number, this long numerical table is levied the fingerprint of this webpage;
The fingerprint comparison idiographic flow is as follows:
Step a: use x, y, z represent three 128 xmm registers (general-purpose register) respectively, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B, and a, b are loaded into register x, y respectively;
Step b: use _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculates a, b XOR value Mask, and 1 number is exactly the hamming distance of a and b in the XOR value;
Step c: use _ mm_cvtsd_f64 obtains low 64 Mask_low of Mask, use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high of Mask;
Steps d: use _ popcnt64 calculates among Mask_low and the Mask_high 1 number respectively, and the addition assignment is given count;
Step e: think webpage A and webpage B dissmilarity when Count>Dx, otherwise similar, Dx is first distance threshold;
Method 2:
Consider the fingerprint set:
Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum, when distance sum during, think webpage i and webpage j dissmilarity greater than second distance threshold value Dy, on the contrary similar.
General Study and experimental result think if fingerprint first apart from dx greater than>6, then be repeated pages not.
Dy gets 0.8M.
Beneficial effect:
The present invention is directed to removing duplicate webpages algorithm speed slow, search shortcomings such as DeGrain, recall ratio are low and improve, removing duplicate webpages parallel algorithm based on SIMD optimization has been proposed, with algorithm application in the removing duplicate webpages system development, can when not losing any assurance precision ratio and recall ratio, improve the speed that the webpage similarity detects effectively.
In order to achieve the above object, analyzing execution time in each stage of removing duplicate webpages system.Subprocess such as the removing duplicate webpages algorithm mainly comprises feature extraction, takes the fingerprint, fingerprint comparison have higher inherent SIMD concurrency.Through Intel (R) VTune (TM) Performance Analyzer[4] statistical study, the fingerprint comparison subprocess has taken total system 58.6% time loss, and the fingerprint comparison process is optimized, the travelling speed of raising system that can be bigger.
According to above-mentioned analysis, we have adopted, and comparison is optimized to shingle based on the webpage similarity algorithm of SIMD.By data are carried out the SIMD parallelization by SSE data type tissue and to main subprocess, designed the removing duplicate webpages algorithm of optimizing based on SSE, this algorithm overcomes the characteristics that general removing duplicate webpages algorithm speed is slow, recall ratio is low, search DeGrain, have that travelling speed is fast, effect obviously, be easy to characteristics such as realization.
Because SSE used 128 storage unit, this can leave 4 for 32 floating number, that is to say, all among the SSE are calculated and all disposablely finished at 4 floating numbers, and this batch processing has brought the lifting of efficient.In addition, the SSE4.2 instruction set has added STTNI (the character string text newly instructs) and ATA (application oriented accelerator) two big optimizations are instructed, STTNI is primarily aimed at that XML carries out document and data processing is optimized, and makes this application performance on the one hand reach 3.8 times of previous generation product.The feature extraction that the removing duplicate webpages algorithm mainly comprises, take the fingerprint, subprocess such as fingerprint contrast handles the processes such as processing that XML carries out document and data processing, character string that relate generally to.By using the SSE technology, can obviously improve the speed of removing duplicate webpages theoretically.
Description of drawings
Fig. 1 is a main flow chart of the present invention;
Fig. 2 info web extracts process flow diagram;
The process flow diagram that Fig. 3 Shingle extracts;
Fig. 4 cluster process flow diagram;
Fig. 5 fingerprint comparison method one process flow diagram;
Fig. 6 fingerprint comparison method two process flow diagram.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details:
Embodiment 1:
The removing duplicate webpages algorithm key step based on SIMD of this example is as follows:
1, web page text information extraction.This process mainly is that the webpage effective information is extracted;
2, Shingle extracts.This process mainly is to extract web page characteristics;
3, cluster.This process mainly is to reduce the comparison number of times, reduces time and space complexity;
4, fingerprint comparison.This process mainly is to seek out similar web page, with its rejecting.
Each step division is as follows:
1 web page text information extraction
The text structure of webpage mainly comprises physical arrangement and logical organization.Physical arrangement is meant the composition situation of webpage, mainly comprises information such as webpage label, web page title, web page contents, article title, advertisement; Logical organization is meant mainly in the web page contents that structure, text between the paragraph organize the logical course of style and expression.
This process mainly is by common webpage formats such as HTML, XHTML, XML are carried out information processing, this process mainly comprises noises such as the advertisement bar of rejecting in the webpage, navigation bar, site marker, picture, and key messages such as the title in the webpage, text, author are extracted.This process mainly is by HTML, XHTML, XML are scanned, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.
At present, domestic and international parsing to HTML, XHTML, XML mainly is by the DOM syntax tree web page text data to be resolved.SSE4.2 has increased 4 character strings and the peculiar instruction of character newly, can be used for quickening many character strings and the application program relevant with text, for example: the pattern match of string, character string comparison or the like.This step mainly is the extraction algorithm that quickens existing main flow info web by the newly-increased peculiar instruction of character string of adopting SSE 4.2, accelerates to extract the speed of info web.
It is as follows that info web extracts flow process:
101 definition common HTML, XHTML, XML labels are as<body 〉,</body,<title,</title,<div etc. common text message label, utilize among the SSE 4.2 _ the mm_set_epi8 instruction is pre-loaded onto the binary code of these labels in 128 bit variables of SSE 4.2._ mm_set_epi8 instruction major function is that 16 bytes or 128 binary codes are reprinted in 128 bit variables of SSE 4.2.
Function among the 102 use SSE 4.2 mates the label of HTML, XHTML, XML.This process mainly is by adopting SSE 4.2 processor computation capability, improving the speed to information extraction.By _ mm_cmpestri, _ mm_cvtsi128_si32, _ the peculiar instruction of SSE4.2 such as mm_cmpestrm compares the label (sse128 bit variable) of info web and preloaded.Certain the label success of then explanation coupling, write down the position of this label, repeating this process then is SSE 4.2 distinctive instructions to finding all html label _ mm_cmpestri only, its major function is that 128 variable is compared, if two character strings are identical, return 0 so, certain the label success of then explanation coupling, write down the position of this label, repeat this process then up to finding all html labels._ mm_cvtsi128_si32 instruction is to be used for 32 int types that change into 32 behind 128 the variable.
103 information translation and filtration.This process mainly is the conversion to the title that extracts, text message, some irrelevant informations in the text, irrelevant mark, hyperlinked information are filtered, and carry out the conversion of some necessity, keep structured messages such as necessary placeholder, text structure information, paragraph information.Filter mainly by constantly reading info web, utilize regular expression to mate, utilize the method for study simultaneously, optimize the regular expression content, the information irrelevant information is filtered.
2 Shingle extract
Web page characteristics for webpage go heavily have great importance, it is to influence total system to go the heavy accuracy rate and the key factor of recall rate., therefore, only need cut apart web page text just according to text and can take its feature owing to be to open between the English alphabet for English webpage with space-separated.For Chinese web page, owing to do not have natural separator between them, so the text feature (word, speech, phrase, sentence) that obtains what type is particularly important in characteristic extraction procedure.
Web page text information is regarded the sequence of word (tokens) composition of one piece of text as, and Shingle is meant a subsequence in the sequence of terms.Sequence by shingles characterizes webpage.(D is the set of the continuation character string of ω for window among the D (document document) ω) for a given collections of web pages S.
Sentence for example: { Shingle is widely used in extensive similar text detection technology },
With the sequence that obtains word behind the sentence forward maximum match participle be: { " Shingle ", " extensively ", " application ", " on a large scale ", " similar ", " text ", " detection ", " technology " }
Window ω is that 2 S (D, 2) set is: { " Shingle is extensive ", " widespread use ", " using extensive ", " similar on a large scale ", " similar text ", " text detection ", " detection technique " }
When if window ω is 4 (usually being made as 4 in the real system), define the S set that the character string of two documents and sign is formed
1(D, ω), S
2(D, ω), a=|S
1I S
2| expression S
1, S
2Public identical sequence, we represent S with R so
1, S
2Likelihood:
The concrete meaning of R is: the shingle number that two documents are identical and the ratio of the shingle number that they are total, when R approaches 1 more, these two pieces of documents are similar more so.
The web page text information of extracting in the 201 pairs of first segments is carried out forward maximum match word segmentation processing, generates the set of word (tokens).The forward maximum match divides the main flow process of word algorithm as follows: suppose that the Chinese character number is MAX in the long word bar in the automatic word segmentation dictionary, then get preceding MAX word in the pending text as matching field, search the participle dictionary, if in the dictionary such MAX words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find such MAX words in the dictionary, then it fails to match, and the last character of this matching field is removed, and repeats above process, till the match is successful.So just finish once coupling, promptly matched a speech.And then go on by top step, till all speech in being syncopated as text.
The word frequency of the word of 202 statistical systems, the function word that filters out the low justice of high frequency make up the initial vocabulary of stopping using as stop words.Discern the neologisms that repeatedly occur by The result of statistics, and neologisms are joined in the dictionary database.(dictionary database is the data source of forward maximum matching algorithm, has only by continuing to optimize dictionary database, and the algorithm participle just can be more accurate)
The 203 inactive vocabularys that used for second step made up filter out insignificant noise in the webpage.By the window size of setting, generate the Shingles set.(from precision ratio and recall ratio, window size is the smaller the better; From display effect, window size is big more, and the effect of demonstration is good more.Window is got 2-4 and is advisable generally speaking)
3 clusters
Cluster is the process that concrete or abstract set grouping is become a plurality of classes of being made up of similar object, its main foundation is that similar or close object is classified as a class, the object that difference is big is in the different classes, what generated like this bunch is the set of one group of data object, these objects are similar each other to other object in same bunch, and different each other with the object in other bunch.
In the Shingle of webpage calculation process, for the webpage collection that a quantity is N, each webpage comprises n word, and the length of Shingle is m, and then Shingle is less than n-m+1 (having rejected the part stop words).Therefore the time complexity that calculates the Shingle algorithm of whole webpages is O (N*n).But the space complexity that characterizes the feature of a webpage then is O (n*m), and then all the space complexity of webpage is O (N*n*m), and can produce a large amount of interim results in computation process, has taken a large amount of storage spaces equally.For example: for a webpage collection that 100,000 webpages are arranged, each piece webpage includes 100 words, and adopting window size is that the time complexity of 2 Shingle algorithm is O (10
8), space complexity is O (2*10
8), obviously, adopt the Shingle algorithm that webpage is directly gone heavily to handle and not only take the plenty of time, also take a large amount of spaces.
Therefore, the web page text ensemble of communication is divided into the similar collections of web pages of possibility, excluding does not need the webpage compared fully, only carries out subsequent operation when two pieces of webpages have similar possibility again, can reduce the number of times that every piece of document need be compared significantly like this.
According to the minwise hashing principle of Ander broder, definition Hash function π has
The practical significance of above-mentioned formula is: adopt different Hash functions to S
1, S
2The fingerprint set of generation is shone upon in the Shingle set, and the probability that the minimum value of fingerprint equates in the set equals S
1, S
2Likelihood.That is to say that for webpage A and B, the Shingle set that they generate is S
1, S
2,, use Hash function π, respectively to S
1, S
2Carry out the hash mapping, if the probability that the minimum value of their fingerprints equates is exactly these two pieces of webpage similarities.
Yet, because the cause of Hash only uses a fingerprint to tend to miss a lot of documents that may be similar, therefore should use a plurality of fingerprints, each fingerprint is found and may similar document merge, can reach accuracy rate and recall rate preferably like this.
Definition R
0Be similarity threshold, definition accuracy rate and recall rate:
Based on above-mentioned narration, the present invention designs clustering method, and webpage is generated fingerprint, and webpage is carried out clustering processing, and concrete steps are as follows:
301 for the Shingle set that generates in the step 2, and establishing set sizes is L, therefrom selects 1 Shingle as its Sample_Shingle_List that samples every L/n Shingle; If on the reasonable machine of performance, we can get n=1, under the poor environment of machines configurations, can suitably improve n, and n gets more for a short time in a word, and the effect of removing duplicate webpages is good more.
302 couples of Sample_Shingle_List use M different Hash function, Sample_Shingle_List is used the displacement Hash function of the individual different independent random of M (7-10), the Hash function that adopts can be one 128 fingerprint S set ample_Finger_List with the Feature Conversion of all shingle among the Sample_Shingle_List, from each Sample_Finger_List, the fingerprint of middle selection minimum is as the fingerprint of this webpage; For the displacement hash function of M independent random, so just the Sample_Shingle_List of any one document d (document) set is converted to Sample_Finger_List Sample_Finger_List=(min{ π
1(S
d), min{ π
2(S
d) ..., min{ π
M(S
d))
Give an example
Ω={1,2,3,4,5},S1={1,2,3},S2={1,2,4}
π1{1,2,3,4,5}->{3,2,1,5,4}
π2{1,2,3,4,5}->{2,3,5,4,1}
...
πM{1,2,3,4,5}->{5,3,1,2,4}
π1(S1)={3,2,1};π2(S1)={2,3,5};πM(S1)={5,3,1};
π1(S2)={3,2,5};π2(S2)={2,3,4};πM(S2)={5,3,2};
Set?of?min(π(S1))=Sample_Finger_List(S1)={1,2,1}
Set?of?min(π(S2))=Sample_Finger_List(S2)={2,2,2}
They have identical fingerprint elements { 2} then are classified as a class then
For n webpage, have
The fingerprint of 303 pairs of webpage generations carries out cluster, as long as identical fingerprint is arranged, just it being incorporated into is a class, and concrete grammar is as follows:
According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be
4 fingerprint comparisons
By to the collections of web pages after the cluster in 3, adopt one of following two kinds of methods to handle to each set:
Method 1:
According to cluster result, the webpage ID in each class is taken out, establishing such has k webpage, for the fingerprint set of each webpage ID
Fingerprint matrices is carried out DUAL PROBLEMS OF VECTOR MAPPING, make in its vector space that is mapped to a m dimension; Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprint of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V,, negative element should be converted to 128 long array by vector V for being classified as 0., just can obtain the fingerprint of this row.Is [V with fingerprint as the vector space that m ties up
1V
2... V
n]
T
The fingerprint comparison idiographic flow is as follows:
410 load fingerprint: use x, y, z represent three 128 xmm registers, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B.A, b are loaded into register x, y.
410 calculated fingerprint distance: use _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculate a, b XOR value (Mask), and 1 number of XOR value is exactly hamming (Haming) distance of a and b.
410 obtain high 64 fingerprints and obtain low 64 fingerprints: use _ _ mm_cvtsd_f64 obtain low 64 Mask_low and use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high.
The different fingerprint sum of 410 statistics: uses _ popcnt64 calculates Mask_low respectively, 1 the number of Mask_high, and the addition assignment is to count.
410 similar web pages are judged: when Count>Dx distance threshold, think dissimilar, on the contrary similar.
Use the fingerprint comparison of SIMD, to two pieces of webpages only need use _ mm xor_si128, _ mm_unpacklo_epi64, _ mm_cvtsd_f64, _ _ the popcnt64 four instructions just can finish.Finish the comparison of two pieces of webpages, only called 6 instructions altogether and just finished.And adopt general processing, and for the webpage of 128 fingerprints, then need to compare 128 times, be higher than far away and use 6 instructions.Therefore, for the data of magnanimity, use the SIMD technology can reduce the number of times that uses CPU, significantly the performance of elevator system.
Method 2:
Consider the fingerprint set:
Also can adopt following fingerprint comparison method two to handle:
For two webpage i, j, their fingerprint set is:
Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum.
410 initialization, definition K=0, Cnt=0;
410 load fingerprint: have M fingerprint for each webpage, use x, y, z represent three 128 xmm registers, and respectively with i, the k row fingerprint of j is loaded into register x, y respectively.
410 calculate every pair of fingerprint distance: use _ _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculates a, b XOR value (Mask), and 1 number of XOR value is exactly hamming (Haming) distance of a and b.
410 calculate total distance: K++ (expression K adds 1), and Cnt adds the fingerprint distance when the prostatitis, if K>M then forwards 409 to, otherwise forwards 407 to
410 similar web pages are judged: when Count>Dy distance threshold, think dissimilar, on the contrary similar.
Claims (5)
1. a removing duplicate webpages parallel method of optimizing based on SIMD is characterized in that, may further comprise the steps:
Step 1: web page text information extraction step: this process is used for the webpage effective information is extracted;
Step 2:Shingle extraction step: this process is used to extract web page characteristics, and generates the Shingles set;
Step 3: cluster step: this process is used for reducing the comparison number of times, reduces time and space complexity;
Step 4: fingerprint comparison step: this process is used to seek out similar web page, and similar webpage is rejected.
2. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 1 are:
File to HTML, XHTML, XML webpage format scans, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.
3. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 2 are:
At first, the web page text information of extracting is carried out forward maximum match word segmentation processing, generate the set of word;
Then, the vocabulary of make up stopping using, and use the vocabulary of stopping using to filter out noise in the webpage, by the window size of setting, generate the Shingles set; Noise is an existing insignificant speech in the set of word.
4. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 3 are:
At first, for the Shingle set that generates, establishing the Shingle set sizes is L, selects 1 Shingle as its sampling table Sample_Shingle_List every L/n Shingle from the Shingle set;
Then, Sample_Shingle_List is used the displacement Hash function of M different independent random, the Hash function that adopts is converted to one 128 fingerprint S set ample_Finger_List respectively with the feature of all shingle among the Sample_Shingle_List, selects the fingerprint of minimum fingerprint as this webpage from each Sample_Finger_List;
At last, the fingerprint that generates for N webpage carries out cluster, and during cluster, it is a class that the webpage that will have identical fingerprints incorporates into, finally obtains the collections of web pages after the cluster.
5. according to each described removing duplicate webpages parallel method of optimizing based on SIMD of claim 1-4, it is characterized in that the concrete steps of step 4 are:
Adopt any method in following two kinds of methods:
Method 1:
According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be
Defined function
Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprints of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V, for negative element is classified as 0;
...
Vector V
iBe used to represent the fingerprint of n document, i=1 wherein, 2 ..., n; V
iBe 128 long number, this long numerical table is levied the fingerprint of this webpage;
The fingerprint comparison idiographic flow is as follows:
Step a: use x, y, z represent three 128 xmm registers (general-purpose register) respectively, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B, and a, b are loaded into register x, y respectively;
Step b: use _ m128i_mm_xor_si128 (_ m128ia, _ m128ib) calculate a, b XOR value Mask, 1 number is exactly the hamming distance of a and b in the XOR value;
Step c: use _ mm_cvtsd_f64 obtains low 64 Mask_low of Mask, use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high of Mask;
Steps d: use _ popcnt64 calculates among Mask_low and the Mask_high 1 number respectively, and the addition assignment is given count;
Step e: think webpage A and webpage B dissmilarity when Count>Dx, otherwise similar, Dx is first distance threshold;
Method 2:
Consider the fingerprint set:
Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum, when distance sum during, think webpage i and webpage j dissmilarity greater than second distance threshold value Dy, on the contrary similar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110021002 CN102024065B (en) | 2011-01-18 | 2011-01-18 | SIMD optimization-based webpage duplication elimination and concurrency method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110021002 CN102024065B (en) | 2011-01-18 | 2011-01-18 | SIMD optimization-based webpage duplication elimination and concurrency method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102024065A true CN102024065A (en) | 2011-04-20 |
CN102024065B CN102024065B (en) | 2013-01-02 |
Family
ID=43865362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110021002 Expired - Fee Related CN102024065B (en) | 2011-01-18 | 2011-01-18 | SIMD optimization-based webpage duplication elimination and concurrency method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102024065B (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831198A (en) * | 2012-08-07 | 2012-12-19 | 人民搜索网络股份公司 | Similar document identifying device and similar document identifying method based on document signature technology |
CN103036697A (en) * | 2011-10-08 | 2013-04-10 | 阿里巴巴集团控股有限公司 | Multi-dimensional data duplicate removal method and system |
CN103116760A (en) * | 2013-02-18 | 2013-05-22 | 人民搜索网络股份公司 | Method and device for identifying text-missing web pages |
CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
CN103164698A (en) * | 2013-03-29 | 2013-06-19 | 华为技术有限公司 | Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested |
CN103294671A (en) * | 2012-02-22 | 2013-09-11 | 腾讯科技(深圳)有限公司 | Document detection method and system |
CN103559259A (en) * | 2013-11-04 | 2014-02-05 | 同济大学 | Method for eliminating similar-duplicate webpage on the basis of cloud platform |
CN103745012A (en) * | 2014-01-28 | 2014-04-23 | 广州一呼百应网络技术有限公司 | Method and system for intelligently matching and showing recommended information of web page according to product title |
CN103778163A (en) * | 2012-10-26 | 2014-05-07 | 广州市邦富软件有限公司 | Rapid webpage de-weight algorithm based on fingerprints |
CN104636319A (en) * | 2013-11-11 | 2015-05-20 | 腾讯科技(北京)有限公司 | Text duplicate removal method and device |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
CN105160014A (en) * | 2015-09-24 | 2015-12-16 | 四川师范大学 | Data processing method and apparatus |
CN105677661A (en) * | 2014-09-30 | 2016-06-15 | 华东师范大学 | Method for detecting repetition data of social media |
CN106407195A (en) * | 2015-07-28 | 2017-02-15 | 北京京东尚科信息技术有限公司 | Method and system for eliminating duplication of webpage |
CN106446148A (en) * | 2016-09-21 | 2017-02-22 | 中国运载火箭技术研究院 | Cluster-based text duplicate checking method |
CN106446124A (en) * | 2016-09-19 | 2017-02-22 | 成都知道创宇信息技术有限公司 | Website classification method based on network relation graph |
CN106547764A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | The method and device of web data duplicate removal |
WO2017080320A1 (en) * | 2015-11-09 | 2017-05-18 | 北京奇虎科技有限公司 | Method of mining and cleaning up similar books in book database, and device utilizing same |
CN106815226A (en) * | 2015-11-27 | 2017-06-09 | 阿里巴巴集团控股有限公司 | Text matching technique and device |
CN107004221A (en) * | 2014-11-28 | 2017-08-01 | Bc卡有限公司 | For predict using industry card use pattern analysis method and perform its server |
CN107273175A (en) * | 2016-04-06 | 2017-10-20 | 龙芯中科技术有限公司 | Program optimization method and device |
CN108153872A (en) * | 2017-12-25 | 2018-06-12 | 佛山市车品匠汽车用品有限公司 | A kind of method and apparatus of the Internet web page information filtering |
CN109165307A (en) * | 2018-09-19 | 2019-01-08 | 腾讯科技(深圳)有限公司 | A kind of characteristic key method, apparatus and storage medium |
CN110532795A (en) * | 2019-07-11 | 2019-12-03 | 西安交通大学 | A kind of repeated data detection method calculated based on rabin fingerprint and exclusive or |
CN111079403A (en) * | 2019-12-10 | 2020-04-28 | 深圳市兴之佳科技有限公司 | Page comparison method and device |
CN111159996A (en) * | 2019-12-31 | 2020-05-15 | 福建福诺移动通信技术有限公司 | Short text set similarity comparison method and system based on improved text fingerprint algorithm |
CN116383334A (en) * | 2023-06-05 | 2023-07-04 | 长沙丹渥智能科技有限公司 | Method, device, computer equipment and medium for removing duplicate report |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710333A (en) * | 2009-11-26 | 2010-05-19 | 西北工业大学 | Network text segmenting method based on genetic algorithm |
-
2011
- 2011-01-18 CN CN 201110021002 patent/CN102024065B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710333A (en) * | 2009-11-26 | 2010-05-19 | 西北工业大学 | Network text segmenting method based on genetic algorithm |
Non-Patent Citations (2)
Title |
---|
《计算机工程》 20080131 张祖平等 基于生物计算的分布式计算系统 第86-88页 1-5 第34卷, 第2期 2 * |
《计算机系统应用》 20101031 龙军等 专家信息语义模型异构数据转换技术 第57-62页 1-5 第19卷, 第10期 2 * |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103036697A (en) * | 2011-10-08 | 2013-04-10 | 阿里巴巴集团控股有限公司 | Multi-dimensional data duplicate removal method and system |
CN103036697B (en) * | 2011-10-08 | 2015-07-15 | 阿里巴巴集团控股有限公司 | Multi-dimensional data duplicate removal method and system |
CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
CN103123618B (en) * | 2011-11-21 | 2016-09-14 | 北京新媒传信科技有限公司 | Text similarity acquisition methods and device |
CN103294671A (en) * | 2012-02-22 | 2013-09-11 | 腾讯科技(深圳)有限公司 | Document detection method and system |
CN103294671B (en) * | 2012-02-22 | 2018-04-27 | 深圳市世纪光速信息技术有限公司 | The detection method and system of document |
CN102831198A (en) * | 2012-08-07 | 2012-12-19 | 人民搜索网络股份公司 | Similar document identifying device and similar document identifying method based on document signature technology |
CN103778163A (en) * | 2012-10-26 | 2014-05-07 | 广州市邦富软件有限公司 | Rapid webpage de-weight algorithm based on fingerprints |
CN103116760A (en) * | 2013-02-18 | 2013-05-22 | 人民搜索网络股份公司 | Method and device for identifying text-missing web pages |
CN103164698B (en) * | 2013-03-29 | 2016-01-27 | 华为技术有限公司 | Text fingerprints library generating method and device, text fingerprints matching process and device |
CN103164698A (en) * | 2013-03-29 | 2013-06-19 | 华为技术有限公司 | Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested |
CN103559259A (en) * | 2013-11-04 | 2014-02-05 | 同济大学 | Method for eliminating similar-duplicate webpage on the basis of cloud platform |
CN104636319A (en) * | 2013-11-11 | 2015-05-20 | 腾讯科技(北京)有限公司 | Text duplicate removal method and device |
CN104636319B (en) * | 2013-11-11 | 2018-09-28 | 腾讯科技(北京)有限公司 | A kind of text De-weight method and device |
CN103745012A (en) * | 2014-01-28 | 2014-04-23 | 广州一呼百应网络技术有限公司 | Method and system for intelligently matching and showing recommended information of web page according to product title |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
CN105677661A (en) * | 2014-09-30 | 2016-06-15 | 华东师范大学 | Method for detecting repetition data of social media |
CN107004221A (en) * | 2014-11-28 | 2017-08-01 | Bc卡有限公司 | For predict using industry card use pattern analysis method and perform its server |
CN106407195A (en) * | 2015-07-28 | 2017-02-15 | 北京京东尚科信息技术有限公司 | Method and system for eliminating duplication of webpage |
CN106547764A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | The method and device of web data duplicate removal |
CN105160014A (en) * | 2015-09-24 | 2015-12-16 | 四川师范大学 | Data processing method and apparatus |
WO2017080320A1 (en) * | 2015-11-09 | 2017-05-18 | 北京奇虎科技有限公司 | Method of mining and cleaning up similar books in book database, and device utilizing same |
CN106815226A (en) * | 2015-11-27 | 2017-06-09 | 阿里巴巴集团控股有限公司 | Text matching technique and device |
CN107273175A (en) * | 2016-04-06 | 2017-10-20 | 龙芯中科技术有限公司 | Program optimization method and device |
CN106446124A (en) * | 2016-09-19 | 2017-02-22 | 成都知道创宇信息技术有限公司 | Website classification method based on network relation graph |
CN106446124B (en) * | 2016-09-19 | 2019-11-15 | 成都知道创宇信息技术有限公司 | A kind of Website classification method based on cyberrelationship figure |
CN106446148A (en) * | 2016-09-21 | 2017-02-22 | 中国运载火箭技术研究院 | Cluster-based text duplicate checking method |
CN106446148B (en) * | 2016-09-21 | 2019-08-09 | 中国运载火箭技术研究院 | A kind of text duplicate checking method based on cluster |
CN108153872A (en) * | 2017-12-25 | 2018-06-12 | 佛山市车品匠汽车用品有限公司 | A kind of method and apparatus of the Internet web page information filtering |
CN109165307B (en) * | 2018-09-19 | 2021-02-02 | 腾讯科技(深圳)有限公司 | Feature retrieval method, device and storage medium |
CN109165307A (en) * | 2018-09-19 | 2019-01-08 | 腾讯科技(深圳)有限公司 | A kind of characteristic key method, apparatus and storage medium |
CN110532795B (en) * | 2019-07-11 | 2021-02-19 | 西安交通大学 | Repeating data detection method based on rabin fingerprint and XOR calculation |
CN110532795A (en) * | 2019-07-11 | 2019-12-03 | 西安交通大学 | A kind of repeated data detection method calculated based on rabin fingerprint and exclusive or |
CN111079403A (en) * | 2019-12-10 | 2020-04-28 | 深圳市兴之佳科技有限公司 | Page comparison method and device |
CN111079403B (en) * | 2019-12-10 | 2023-08-08 | 深圳市兴之佳科技有限公司 | Page comparison method and device |
CN111159996A (en) * | 2019-12-31 | 2020-05-15 | 福建福诺移动通信技术有限公司 | Short text set similarity comparison method and system based on improved text fingerprint algorithm |
CN116383334A (en) * | 2023-06-05 | 2023-07-04 | 长沙丹渥智能科技有限公司 | Method, device, computer equipment and medium for removing duplicate report |
CN116383334B (en) * | 2023-06-05 | 2023-08-08 | 长沙丹渥智能科技有限公司 | Method, device, computer equipment and medium for removing duplicate report |
Also Published As
Publication number | Publication date |
---|---|
CN102024065B (en) | 2013-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102024065B (en) | SIMD optimization-based webpage duplication elimination and concurrency method | |
Navarro | Indexing highly repetitive string collections, part II: Compressed indexes | |
Adelfio et al. | Schema extraction for tabular data on the web | |
Menai | Detection of plagiarism in Arabic documents | |
Wang et al. | Document zone content classification and its performance evaluation | |
JP2000231563A (en) | Document retrieving method and its system and computer readable recording medium for recording document retrieval program | |
Abdelaziz et al. | A large vocabulary system for Arabic online handwriting recognition | |
CN109145260A (en) | A kind of text information extraction method | |
Bellare et al. | Learning extractors from unlabeled text using relevant databases | |
Mäkinen et al. | Linear time construction of indexable founder block graphs | |
CN106528524A (en) | Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm | |
CN109344403A (en) | A kind of document representation method of enhancing semantic feature insertion | |
Zhu et al. | Webpage understanding: an integrated approach | |
Navarro | Indexing highly repetitive string collections | |
Augsten et al. | Approximate joins for data-centric XML | |
Navarro | Computing MEMs and Relatives on Repetitive Text Collections | |
CN110019674A (en) | A kind of text plagiarizes detection method and system | |
JP2010182238A (en) | Citation detection device, device and method for creating original document database, program and recording medium | |
Dölek et al. | A deep learning model for Ottoman OCR | |
Celebi et al. | Segmenting hashtags using automatically created training data | |
CN113297844B (en) | Method for detecting repeatability data based on doc2vec model and minimum editing distance | |
Katsura et al. | Permuted pattern matching on multi-track strings | |
CN112765940A (en) | Novel webpage duplicate removal method based on subject characteristics and content semantics | |
CN113609246A (en) | Webpage similarity detection method and system | |
Maheswari et al. | Rule based morphological variation removable stemming algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130102 Termination date: 20150118 |
|
EXPY | Termination of patent right or utility model |